Microsoft: Rita Hui presented “SONiC for AI with SRv6” at MPLS & SRv6 WC Paris 2025
Summary
TLDRIn this video, Rita, a manager at Microsoft Sonic, discusses the role of SONiC (Software for Open Networking in the Cloud) in supporting AI workloads at Microsoft’s data centers. She highlights the network's architecture, including its use of SRv6 (Segment Routing over IPv6) to optimize traffic management, reduce latency, and enhance reliability. With the increasing demand from AI applications, particularly GPU-intensive tasks, SONiC enables highly efficient and scalable networking solutions. Rita also emphasizes the importance of community contributions to SONiC's development, offering insights into its open-source framework and how it fosters innovation in data center networking.
Takeaways
- 😀 Sonic is an open-source network operating system developed by Microsoft, primarily used in large-scale data centers.
- 😀 The primary goal of Sonic is to ensure high availability and efficiency for both internal and external customers, such as Office 365 and Azure.
- 😀 The network infrastructure is designed to be highly scalable, reliable, and low-latency, which is crucial for supporting AI workloads and large data transfers.
- 😀 AI workloads require massive amounts of training data, which places a strain on the existing network infrastructure, necessitating more efficient network solutions.
- 😀 The backend network of a data center requires high-speed NICs (200G, 400G, or even 800G) to handle the data demands of AI training clusters.
- 😀 SRv6 is used for routing AI traffic across GPU servers, offering precise control over traffic flows, optimized performance, and network failure detection.
- 😀 SRv6 allows for source routing, detailed path enumeration, and tight integration with AI workload scheduling, which benefits AI training processes.
- 😀 The use of SRv6 reduces the dependency on traditional dynamic routing protocols and enables more efficient management of network traffic.
- 😀 The two-layer topology of the network employs SRv6 over four hubs (T0, T1, and two destination NICs), allowing scalable and efficient data routing without changes to packet formats.
- 😀 Microsoft continues to contribute to the Sonic community, focusing on enhancing SRv6 features, and encourages community participation through various resources, including mailing lists and weekly meetings.
Q & A
What is Sonic, and why did Microsoft develop it?
-Sonic is an open-source network operating system developed by Microsoft. It was created to manage the company's global data centers, which provide high availability and efficiency for internal services like Office 365 and external services like Azure. Sonic helps ensure reliable network performance at a large scale.
What are the main requirements for the network infrastructure supporting AI workloads?
-The network infrastructure for AI workloads must be highly reliable, scalable, and low-latency. AI applications require large amounts of data, which puts a strain on existing networks that were not initially designed for the bulk parallel nature of AI workloads. High-speed connections such as 200Gb, 400Gb, and 800Gb NICs are also essential.
How does Sonic support Microsoft’s AI-related network needs?
-Sonic supports Microsoft’s AI infrastructure by providing an open-source, scalable network operating system. It enables the deployment of Sonic across various layers of data center switches (Tier 0, Tier 1, and Tier 2) and helps manage AI traffic efficiently using Segment Routing over IPv6 (SRv6), which ensures high reliability and low latency.
What is Segment Routing over IPv6 (SRv6), and why is it beneficial for AI traffic?
-SRv6 is a method of routing network traffic that uses source routing and path enumeration. It is beneficial for AI traffic because it allows precise control over traffic flows, optimizes latency, and provides better coupling between network paths and AI workload scheduling. SRv6 also detects network failures and recalculates routes without needing traditional routing protocols.
What are the key advantages of using SRv6 in Microsoft's AI data centers?
-The key advantages of using SRv6 include: 1) Source routing for precise traffic control, 2) Path enumeration for low-latency management, 3) Tight coupling with AI workload scheduling, and 4) Network failure detection at the source, which allows traffic rerouting without relying on traditional dynamic routing protocols.
What is the two-layer topology used for AI traffic in Microsoft’s backend network?
-The two-layer topology consists of Tier 0 (rack-level switches) connected to GPU servers, Tier 1 switches linking to data center spines, and redundant paths leading to regional spine routers. SRv6 is used to carry AI traffic between GPUs and ensure high reliability and scalability.
How does Microsoft handle network failures within the AI traffic management system?
-When a network failure occurs, the source of the traffic (such as the GPU server) detects it and recalculates the path for the traffic. This rerouting is done using SRv6, without relying on traditional routing protocols, ensuring that network performance remains optimal even during failures.
What role does the smart NIC play in AI training within Microsoft’s data centers?
-The smart NICs are responsible for handling AI traffic between GPUs. They facilitate reliable communication through Q pairs, which are scheduled for point-to-point data transactions. These NICs also support protocols that help detect congestion or network failures, enabling rerouting and improving overall network performance.
How does Microsoft plan to scale the AI infrastructure without altering the network design?
-Microsoft plans to scale the AI infrastructure by adding more servers and switches to the network. Since SRv6 allows for flexible path enumeration and source routing, the underlying network design remains the same, and the network can scale horizontally without needing major redesigns.
How can individuals contribute to the Sonic project?
-Individuals can contribute to the Sonic project in several ways, such as testing, contributing code, providing platform support, or improving tooling. Microsoft provides resources like the Sonic GitHub repository, mailing lists, and weekly community meetings to help people get involved.
Outlines

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowMindmap

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowKeywords

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowHighlights

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowTranscripts

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowBrowse More Related Video

Empowering AI networking with SONiC

What runs ChatGPT? Inside Microsoft's AI supercomputer | Featuring Mark Russinovich

EP03- Arista software Overview

Travis Zhao, Dell & Ian Pilcher, Red Hat | Dell Technologies World 2024

What's it like to work at Microsoft?

A Giant Reborn: Satya Nadella’s Decade as Microsoft CEO
5.0 / 5 (0 votes)