Cisco Artificial Intelligence and Machine Learning Data Center Networking Blueprint

Cisco
13 Jun 202340:57

Summary

TLDRNemanja Kamenica, a technical marketing engineer at Cisco, introduces the AI/ML data center networking blueprint, a comprehensive guide to optimizing networks for AI workloads. He highlights the importance of AI in various industries, from healthcare to retail, and explains the differentiation between training and inference AI clusters. The presentation delves into the technical requirements for these clusters, including high bandwidth and low latency, and showcases Cisco's Nexus 9000 switches as a solution. Through a detailed demo, Kamenica demonstrates congestion management techniques like PFC and ECN, crucial for maintaining a lossless network environment for AI clusters. The session concludes with resources Cisco provides to help customers implement these AI networking solutions.

Takeaways

  • πŸ“Š AI and ML workloads require specialized network configurations to handle their unique data traffic patterns efficiently.
  • πŸ’» There are two main types of AI clusters: distributed training clusters for model training, which require high bandwidth and low latency, and product inference clusters for model deployment, which prioritize real-time responses and high availability.
  • πŸ“¦ On-premises AI clusters offer full control and constant availability, whereas cloud-based solutions provide flexibility and scalability with cost considerations.
  • πŸ›  Key challenges in building AI clusters include managing rapidly doubling data volumes and models, necessitating scalable infrastructure to maintain performance and accuracy.
  • πŸ“š Cisco's data center networking solutions, particularly the Nexus 9000 series, are designed to support AI/ML workloads with low latency, high throughput, and efficient node-to-node communication.
  • πŸ”§ RDMA over Converged Ethernet (RoCEv2) is critical for efficient AI cluster operations, enabling direct memory access between nodes to reduce latency and increase throughput.
  • 🚨 Network configurations for AI workloads must be non-blocking, lossless, and capable of handling congestion through mechanisms like Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).
  • πŸ“‘ Effective congestion management is crucial to prevent data loss and ensure continuous operation, especially in distributed AI training scenarios where data synchronization between nodes is constant.
  • πŸ“± Cisco's Nexus Dashboard Fabric Controller and other tools offer advanced features for network visibility, congestion management, and QoS configuration, aiding in the deployment and management of AI clusters.
  • πŸ’» Custom automation scripts and templates provided by Cisco can streamline the configuration of networks for AI/ML clusters, aligning with the automation of endpoint provisioning.

Q & A

  • What is the significance of AI/ML in modern data center networking according to Cisco's blueprint?

    -The significance of AI/ML in modern data center networking lies in optimizing network configurations to support the unique demands of AI/ML workloads, such as high bandwidth, low latency, and lossless data transport, which are crucial for efficient model training and inference.

  • What are the two major types of AI clusters mentioned in the transcript?

    -The two major types of AI clusters mentioned are the distributed training cluster, used for training AI models, and the inference cluster, used for applying the trained models to new data.

  • Why is high node-to-node communication important in a distributed training cluster?

    -High node-to-node communication is crucial in a distributed training cluster because it enables the rapid exchange and processing of data samples between nodes, which is essential for updating and refining AI models efficiently.

  • null

    -The key network requirements for a distributed training cluster include high bandwidth and low latency to facilitate quick data exchange and computation among nodes, which leads to faster model training times.

  • What does a product inference cluster require from the network?

    -A product inference cluster requires the network to be real-time and highly available to handle numerous user requests simultaneously without significant delays.

  • What are the benefits and challenges of deploying AI clusters on-premises versus in the public cloud?

    -Deploying AI clusters on-premises offers full control, data security, and constant availability but requires significant infrastructure and management. In contrast, public cloud deployment offers flexibility and scalability but can lead to increased costs and concerns over data security.

  • Why is network congestion management critical in AI/ML clusters?

    -Network congestion management is critical in AI/ML clusters to avoid data loss and ensure efficient communication between nodes, which is essential for the accurate and timely training and inference of AI models.

  • What are Rocky V2 and RDMA, and why are they important for AI/ML workloads?

    -Rocky V2 (RDMA over Converged Ethernet version 2) and RDMA (Remote Direct Memory Access) are technologies that provide low latency and high throughput data transport. They are crucial for AI/ML workloads to enable fast and efficient data transfer between nodes in the network.

  • How does Cisco's Nexus 9000 series support AI/ML data center requirements?

    -Cisco's Nexus 9000 series supports AI/ML data center requirements by providing low latency, high bandwidth, and advanced features such as flow control and congestion management, which are essential for handling the demands of AI/ML workloads.

  • What role does quality of service (QoS) play in managing AI/ML network traffic?

    -Quality of Service (QoS) plays a critical role in managing AI/ML network traffic by prioritizing different types of traffic, ensuring that important data, such as AI model updates, are transmitted quickly and reliably across the network.

Outlines

00:00

πŸ” Introduction to AI/ML Data Center Networking

Nemanja Kamenica, a technical marketing engineer at Cisco, introduces the AI/ML data center networking blueprint, a guide designed to optimize network infrastructure for AI/ML workloads. He explains the significance of AI in various industries today and its potential future impact. Nemanja outlines the importance of creating efficient networks for AI clusters, mentioning two types of AI clusters: distributed training clusters, which require high bandwidth and low latency for node-to-node communication, and inference clusters, designed for real-time responses and high availability. He emphasizes the unique network requirements of each cluster type.

05:06

🌐 Network Requirements for AI Clusters

This section delves into the specific network requirements for effectively running AI clusters, focusing on the need for a scalable, real-time, and always available network infrastructure. Nemanja discusses the benefits of building clusters on-premises for control and data sovereignty, as well as the flexibility and scalability offered by public cloud solutions. He highlights the challenges of scaling infrastructure to match the rapid growth of AI models and data, stressing the importance of a robust network, ample compute resources, and efficient data storage and management.

10:09

πŸš€ High-Performance Networking with Cisco Nexus

Nemanja showcases Cisco's Nexus 9000 switches, ideal for AI/ML clusters due to their low latency and high bandwidth capabilities. He explains the role of RDMA over Converged Ethernet (RoCEv2) in providing efficient, low-latency communication between nodes, crucial for distributed AI model training. This technology enables direct memory access, bypassing CPU processing for faster data exchange. The Nexus product line, with its versatile operating systems and high-speed interfaces, is presented as a solution for the demanding requirements of AI networking.

15:11

πŸ› οΈ Advanced Congestion Management Techniques

This part explains Cisco's advanced congestion management solutions, including Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), to maintain network performance and avoid data loss during high-traffic conditions. Nemanja uses an example to illustrate how these mechanisms work together to manage congestion, ensuring a lossless Ethernet environment. This segment underscores the importance of configuring the network to handle the specific demands of AI workloads, preventing congestion and ensuring efficient data flow.

20:16

πŸ”§ Demonstrating Network Configuration for AI Clusters

Nemanja presents a practical demonstration of configuring a network for AI clusters, focusing on Quality of Service (QoS) settings to manage traffic priorities and handle congestion. He explains the setup of a network with multiple leaf switches and hosts, detailing how traffic is managed to prevent congestion using Nexus Dashboard Fabric Controller. This demonstration shows the importance of proper QoS configuration in maintaining network performance and ensuring smooth operation of AI workloads.

25:16

πŸ’‘ Deploying and Managing AI Network Configurations

In this section, Nemanja outlines the process of deploying QoS configurations to manage AI cluster traffic efficiently, focusing on differentiating between rocky and CNP traffic for optimal delivery. He introduces a watchdog feature to prevent PFC storms, which can cause network stalls. This part highlights the steps to apply these configurations across the network, demonstrating Cisco's commitment to providing tools and templates for easy network management and ensuring high performance for AI clusters.

30:19

πŸ“Š Monitoring and Optimizing Network Performance

Nemanja discusses the capabilities of Cisco's Nexus Dashboard Insights for monitoring and optimizing network performance for AI clusters. He explains how the system tracks and reports on network traffic, congestion, and device performance, providing valuable data for fine-tuning the network. This monitoring tool, combined with advanced features like PFC and ECN, allows network administrators to manage congestion effectively and maintain optimal conditions for AI workloads.

35:20

πŸš€ Real-World Application and Future Directions

Concluding the presentation, Nemanja reflects on the practical application of Cisco's networking solutions in real-world AI clusters, emphasizing the availability of the discussed technologies and their importance in efficient AI operations. He addresses a question about the nature of AI traffic, explaining that it can vary from bursty to sustained, depending on the algorithm and task at hand. He stresses that proper network design is crucial to avoid costly downtime and ensure successful AI model training and inference.

Mindmap

Keywords

πŸ’‘AI/ML Data Center Networking Blueprint

The AI/ML Data Center Networking Blueprint refers to a comprehensive guide developed by Cisco to assist customers in designing and implementing networks optimized for artificial intelligence (AI) and machine learning (ML) workloads. This blueprint outlines best practices, architectural considerations, and Cisco's specific solutions (like Nexus 9000 switches) to ensure high performance, scalability, and efficiency in data center networks supporting AI/ML applications. The script mentions this blueprint as a resource for customers to create the best possible network infrastructure for AI/ML clusters.

πŸ’‘Distributed Training Cluster

A Distributed Training Cluster in the context of AI/ML workloads is a type of cluster designed specifically for training AI models. It consists of multiple compute nodes that communicate intensely with each other to process large datasets, learn from this data, and build the AI model. This process requires high bandwidth and low-latency communication between nodes, as highlighted in the script. The emphasis is on the infrastructure's need to support node-to-node communication efficiently, which is critical for reducing training time and improving the model's accuracy.

πŸ’‘Inference Cluster

An Inference Cluster is designed for deploying trained AI models to make predictions or decisions based on new data. Unlike the distributed training cluster, an inference cluster's focus is on real-time, highly available services to end-users. This requires the network to be optimized for quick response times and high availability to handle potentially thousands of requests simultaneously without significant bandwidth demands, as discussed in the script. This differentiation underscores the diverse network requirements based on the AI workload stage (training vs. inference).

πŸ’‘RDMA over Converged Ethernet (RoCEv2)

RDMA over Converged Ethernet (RoCEv2) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. It enables high-throughput, low-latency communication between nodes in a network, which is crucial for efficient AI/ML workloads. The script mentions RoCEv2 as a key component of Cisco's AI/ML data center networking solution, providing the necessary transport capabilities for distributed training clusters by allowing data to bypass the software stack, reducing communication overhead.

πŸ’‘Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)

Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) are mechanisms used in networks to manage data flow and congestion, ensuring lossless transport. PFC prevents packet loss by pausing specific data flows in congested network conditions, while ECN provides a way for network elements to signal impending congestion before losses occur. In the script, these mechanisms are described as essential for maintaining high performance in AI/ML data center networks by managing congestion and maintaining data integrity.

πŸ’‘Nexus 9000 Switches

Nexus 9000 switches are Cisco's flagship data center networking solution, offering high performance, scalability, and flexibility for modern data center requirements. In the context of AI/ML workloads, these switches provide the low-latency, high-bandwidth connectivity required for efficient distributed training and inference clusters. The script emphasizes their role in the AI/ML Data Center Networking Blueprint, highlighting their capabilities to support RoCEv2 and advanced congestion management features.

πŸ’‘Lossless Ethernet

Lossless Ethernet refers to networking configurations and protocols designed to prevent packet loss in Ethernet networks, which is crucial for applications requiring high reliability and performance, such as AI/ML workloads. The script describes the implementation of lossless Ethernet in AI/ML clusters through the use of PFC and ECN, ensuring that the network can handle high volumes of data without losing packets, thereby improving the efficiency and reliability of AI model training and inference.

πŸ’‘Congestion Management

Congestion Management encompasses the techniques and mechanisms used to control data flow in a network, preventing or mitigating congestion that can lead to packet loss and performance degradation. In the script, congestion management is a critical topic, with detailed explanations on how Cisco's solutions, including PFC, ECN, and RoCEv2, work together in AI/ML data center networks to ensure seamless, efficient data transfer between nodes in both training and inference clusters.

πŸ’‘Data Parallelization

Data Parallelization is a method used in distributed computing where a large dataset is divided into smaller chunks that are processed simultaneously across multiple computing nodes. This approach is essential for training AI models efficiently, as it significantly reduces the time required to process large volumes of data. The script references data parallelization in the context of distributed training clusters, illustrating how AI/ML workloads benefit from the high-throughput, low-latency communication enabled by Cisco's networking solutions.

πŸ’‘AI Model Training and Inference

AI Model Training is the process of teaching an AI system to make predictions or decisions by learning from data. Inference is the stage where the trained model is used to make predictions on new data. The script highlights the distinct network requirements for these stages, with training requiring intensive node-to-node communication and inference needing real-time, highly available network services. These distinctions underscore the need for a flexible, high-performance network infrastructure as provided by Cisco's AI/ML Data Center Networking Blueprint.

Highlights

Nemanja Kamenica introduces the AI/ML data center networking blueprint by Cisco, aimed at optimizing network creation for AI/ML workloads.

The importance of AI in various industries such as healthcare, finance, public sector, and entertainment is emphasized, highlighting AI's broad impact.

Discussion on two types of AI clusters: distributed training clusters for training AI models and product inference clusters for deploying trained models.

The significance of infrastructure in AI cluster performance, focusing on the need for high bandwidth, low latency transport for distributed training clusters.

The option for enterprises to create AI clusters on-premises or in the public cloud, each with distinct advantages and considerations.

Challenges in scaling infrastructure to keep up with rapidly doubling model sizes and data, emphasizing the need for large networks with extensive GPU and CPU power.

The critical role of the network in AI clusters, requiring non-blocking transport, lossless Ethernet, and features like PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) for congestion management.

Introduction to Cisco's Nexus 9000 switches as a solution for AI/ML clusters, offering low latency, high bandwidth forwarding.

Explanation of Rocky V2 (RDMA over Converged Ethernet) technology and its benefits in providing low latency, high throughput transport for AI workloads.

Detailed discussion on congestion management techniques using ECN and PFC to maintain a lossless network, crucial for AI cluster efficiency.

Demonstration of QoS (Quality of Service) configuration and its impact on managing congestion in a network, showing practical application of theoretical concepts.

Insights into the behavior of AI clusters, highlighting the bursty nature of AI traffic and the necessity of a well-designed network to handle such patterns.

The economic implications of network congestion on AI training tasks, emphasizing the cost efficiency of a well-managed network infrastructure.

Future updates and tools for AI/ML network management, including custom QoS templates and automation scripts for easier deployment.

A case study on a customer's successful deployment of an AI cluster using Cisco's networking solutions, validating the blueprint's effectiveness.

Transcripts

play00:10

My name is Nemanja Kamenica and I'm technical marketing engineer at Cisco.

play00:14

And I'm here to talk about A I M L data center networking blueprint.

play00:19

It's a set of stuff which we created to explain how you can create the network in the best

play00:26

possible way for your I I A I M L workloads.

play00:31

So what I have today on the session is why A I A I is important today and uh

play00:37

what will bring in the future and then explain you how you should create the networks for your

play00:44

A I cluster.

play00:45

Um I'm gonna show a brief demo of everything which I'm gonna talk up to that point.

play00:50

And then um I'm gonna show you the set of collateral which we have,

play00:55

which we have created to enable you as a customer to do this.

play01:01

So first let's talk about A I and what A I could do um today and what we'll do in the

play01:08

future. Uh There are a set of use cases and industries

play01:11

which will benefit from A I.

play01:13

Um You also have all have seen um chat GP T you probably ask some of the serious questions,

play01:20

some of the silly questions and it always respond to you.

play01:23

So that's one way to entertain yourself. Uh But there are other industries which will

play01:28

benefit from it. For example, health care will be able to do

play01:32

some medical research and, and uh medic risks research in financial services.

play01:38

Uh They might be able to advance the uh trading algorithms and and do more trading using A I

play01:45

in public sector.

play01:47

They might be able to optimize public uh transport paths and,

play01:51

and have more users using those paths and, and public transport in the cities in media,

play01:57

entertainment industries, creating subtitles, doing uh translation using A I S is a great

play02:03

help. So um manufacturing um finding out the um um

play02:09

flaws in the product will be able will be enabled by A I and and be able to do that in a

play02:15

scalable way. And in the end retail uh personalized uh

play02:20

recommendations for for any of the retail stuff could be done uh with the A I.

play02:25

So um capabilities use cases are vast wide um and as such will be

play02:31

deployed by many enterprises.

play02:34

So when we talk about uh enterprises deploying this, there are kind of two major type of A I

play02:41

clusters. Uh We have something we call distributor

play02:44

training cluster uh which is basically a cluster which trains your A I model.

play02:50

So imagine that, that you want to have an A I model which recognize a particular color.

play02:57

I'm I'm simplifying this over simplifying this. But let's say it's change the color.

play03:01

You need to give it a set of colors. It's gonna go uh do do that task.

play03:06

And eventually we say, hey, this is red, this is pink,

play03:09

this is yellow. So what that cluster kind of requires from,

play03:13

from infrastructure is that node to node communication in this case is high.

play03:18

The reason is they take those samples and then exchange, they do calculation,

play03:24

then they exchange the calculation, they make an average and then they update the model.

play03:29

So all of this, which I said it happens between the node communication,

play03:34

it happens over the network.

play03:36

Um And as such, the network is required to do um um high bandwidth uh

play03:42

low latency type of a transport. In this case,

play03:45

um Also the key metric for for this type of a cluster would be um

play03:52

training time. So shortest period of time to train your uh

play03:56

your model, what does it require from, from the infrastructure?

play04:01

It would be a large network with GP U A lot of GP U and C P U power.

play04:07

So that's our distributor training cluster.

play04:10

There is another way to create a cluster.

play04:13

And that's happened when you, when you have finished training,

play04:16

your model is that you have a product inference cluster.

play04:20

So now you train the the model to recognize a color uh red color.

play04:26

Now you're gonna deploy that model and users will be able to go and ask that model Hey,

play04:31

what is this color or is this a red color?

play04:34

So it's a different, different type of a cluster which will do this task.

play04:38

Uh What is required from this model is to be a real time and highly available.

play04:44

So any time anybody of the users asking this, uh you need to provide it an answer.

play04:50

Obviously, you're not gonna have a single user doing this.

play04:53

You will have hundreds or maybe thousands of users asking this uh for model to do.

play04:59

And as such, um you might not have a lot of bandwidth but you need to have a lot of uh high

play05:06

or or a lot of a availability, high availability of that cluster.

play05:11

So what is required from the network?

play05:13

Um It's a smaller network, smaller amount of devices but it has to be um a real time.

play05:19

It has to be always available.

play05:22

So it is how do you create those clusters? Uh First,

play05:27

you can create that cluster on prem um in your own data center for you to serve your own

play05:33

enterprise uh needs.

play05:35

Uh What it would be a benefit of this is that cluster is always available.

play05:39

So there could be somebody always using that cluster.

play05:43

Um Also um you have a flexibility to create that size of the cluster.

play05:49

However, you want um different part of the enterprises can use that kind of have a time

play05:56

share of that cluster. So use uh that cluster at different time and

play06:00

all your data is stored on prem. So you do not export that data or you do not

play06:05

send it anywhere. It's it's present in your cluster.

play06:09

Another way for you to do this if you're an enterprise is to do it in a public cloud.

play06:14

So all of the public clouds do enable you or do have an offering to,

play06:18

to do your A I M L uh training cluster in public cloud.

play06:23

Um Flexibility is the benefit of that. So I'm gonna go,

play06:27

I'm gonna get a number of instances um with particular bandwidth.

play06:32

Um And I'm gonna pay only when I use it.

play06:34

Uh But what, what will happen is that how you grow your cluster?

play06:39

Uh Your cost is gonna grow. So it might be fine for,

play06:43

for a certain time, it might be fine for forever.

play06:45

But uh there are different cases and different use cases which can be solved.

play06:51

So let's say you, you are an enterprise and you decided to build uh a cluster in on prem

play06:57

situation. So you, you're gonna build your own data center

play07:00

which will run A I M L um uh workloads.

play07:05

So what would be key set of challenges which you need to understand before you do that.

play07:10

Uh First one is that uh model which you're creating uh and data which that model we

play07:17

use will double every uh every two months. So as such,

play07:24

you, you, you would need uh to scale, you would need to improve your infrastructure to be able

play07:29

to keep up with that model.

play07:31

Uh What that bigger model provides you is more accuracy.

play07:36

So if you're training for something complex, it's gonna probably take time to train that

play07:41

model. If you increase the size of the sorry,

play07:44

if you increase the size of the cluster, your time is gonna go down and so on and so on.

play07:51

Uh However, today, the most of the training happens up to on the cluster size,

play07:57

up to 512 GP us. So I guess if you would start creating,

play08:02

start to looking for something which is 5 12 GP US or higher.

play08:07

Um The next challenge which you need to solve would be key components which you need to have

play08:14

in your cluster. So what do you need to have uh compute notes,

play08:19

those will contain GP US and C P US.

play08:22

Uh But also you would need to have a network that network needs to provide obviously

play08:27

transport between all of those notes.

play08:29

Um And then you would have some kind of storage system which will contain that data and will be

play08:34

able to provide that training data into the cluster.

play08:38

And later basically collect all that data from the cluster into a storage device.

play08:43

Then there are two software components, uh job scheduling and orchestration component which

play08:49

will basically tell you which GP us and where they are and who of them or which of them will

play08:55

work in this uh in this process. So some of them will,

play09:00

may, may, may require lower latency path. Um And as such,

play09:04

you why you might want to have them on the same switch or,

play09:07

or nearby switches, some of them may not require low latency and,

play09:11

and a higher scale. So you might distribute them um as well.

play09:15

So the next component would be software framework for A I model.

play09:20

So that's how your A I model will work and how it will be trained.

play09:24

And that's what you can buy since we are here and talking about data center networking,

play09:30

obviously, the focus of uh presentation will be on the network in this case.

play09:37

So let's look uh what network for A I cluster uh will provide uh in this case or

play09:43

what, what kind of requirements are put to the network to do this.

play09:49

So when we talk about A I training networks, there are two requirements or two components

play09:55

which, which are their first one would be transport which you provide.

play09:59

Uh In this case, this is Rocky V two.

play10:02

Um Don't know if any of you know what Rocky V two stands for R D M E or converge E internet

play10:09

version two. I'll explain what that means in just a moment.

play10:12

Uh But as the technology provides low latency, high throughput type of a transport from

play10:18

endpoints and does require does have some requirements for the network from the network

play10:24

side, we in Nexus product line in Nexus 9000 switches, we do have um

play10:31

set of switches which do provide low latency um and high bandwidth um

play10:37

forwarding uh with a up to 2 25.60 present in our

play10:43

equipment and their purpose. Or there could satisfy all the needs for from

play10:48

the A I M L cluster.

play10:51

Um So just brief over you, I'm pretty sure you will know.

play10:56

But um if anybody missing the Nexus 9000 is a data center portfolio from Cisco.

play11:03

Um It's a flagship data center networking.

play11:06

Um It does sit on speeds from 100 meg to a 400 gig.

play11:11

Um And then it does run two uh operating systems.

play11:15

It does have standalone mode or an and it does have an AC I mode as well.

play11:20

So you can run whatever you choose.

play11:23

Uh And all of the switches are based on Cisco silicon.

play11:26

So there are certain uh features which are enabled particularly on this product line which

play11:32

we use uh over here.

play11:35

Um So let's go to the next uh component which is rocky two as I mentioned R D M E or

play11:42

converge internet.

play11:43

Uh So if you look at the bottom pocket in this picture, you would see that it does have a

play11:48

layer two header, it does have an IP header, which means source mac,

play11:52

destination mac source IP, destination IP UDP header.

play11:56

And after that, the whole transport is in band. So you kind of have infi band encapsulated

play12:03

internet packet or frame.

play12:05

Uh So what that provides you, um because it's R D MA based transport based in,

play12:11

in you have low latency, high trout produced from the uh from the host itself.

play12:18

Um What this packet allows is that you can route it,

play12:25

you can switch it, you can do whatever you want uh with it and network will forward it.

play12:31

Ethernet network will forward this, this packet.

play12:34

Um How this works in a kind of high level end to end system.

play12:39

So if you do have an A I cluster using GP us, uh R Q V two allows you direct

play12:46

memory um access and direct memory communication between uh memory chunks

play12:52

in that GP U cluster. Uh GP U compute nodes.

play12:56

So how this is done through R D MA Nick is capable direct directly to go into

play13:03

the um memory of the GP U, get that information, put it on the network,

play13:08

send it to whoever needs to go next. Uh um Nick receives it and puts it into GP U

play13:14

memory. So that's how they exchange benefit of this is

play13:18

that you are bypassing the stack, you're bypassing the kernel.

play13:22

And as such, you provide a low latency, high trout transport from the endpoint itself.

play13:29

So what is required from the network basically to provide the similar uh capabilities?

play13:36

So in this case, we use R D M E or rocky version two in the network for end to end

play13:42

communication what is required from, from the network, it's non blocking transport.

play13:48

So you have to create a non blocking network.

play13:51

Um Then you have to create a lossless Ethernet.

play13:55

Uh How you create lossless internet using uh priority flow control A P F C and E C N

play14:01

explicit congestion notification.

play14:04

Um And I'm gonna go there um in, in my next slide.

play14:09

So brief overview and brief explanation how all this works.

play14:14

Um What we have here is um we have configured both E C N and P F C

play14:21

to manage congestion.

play14:23

Um And if I look in this network, I do have three hosts,

play14:27

host a host B and host C.

play14:29

Um And I have three leave switches and I have a lot uh spine switch in in here.

play14:36

Uh What each switch have been configured is to support W dot E C N um And to

play14:43

support P F C.

play14:45

So for uh W at E C N, we have set these two thresholds on the bottom,

play14:50

the yellow one and the red one, the yellow one is W at min uh uh minimal threshold.

play14:57

Um So just to explain that that part as well. Um So every time that

play15:04

buffer utilization um goes over the minimal threshold, the system will go and mark um

play15:11

IP pockets based on uh sorry, it will mark IP pockets with E C N um or congestion

play15:18

experience bit so that E C N bits are part of the IP pocket in the toss field,

play15:24

the last two significant bits.

play15:27

So just so we are clear. Um And then we do have two P F C thresholds,

play15:33

the green and the higher red line.

play15:36

Um Basically, if your W do E C N doesn't comply with congestion or doesn't solve the congestion,

play15:43

the P F C will kick in and basically resolve that congestion.

play15:47

So in my example, what I'm trying to do here is that I have host A and host B talking to host

play15:54

C. Um In this case, uh they are sending pa pocket

play15:59

dest designated sourced and dest designated from each of the host to the host C.

play16:04

And you will see that they are carrying the E C N enabled bits.

play16:07

So they're carrying E C N uh 10. This to the network will say,

play16:13

hey, this pockets are E C N capable. If you do experience congestion might mark them

play16:19

with um um bits 11.

play16:23

So in this case, we do congest the port which goes to host C and you will see that packets

play16:29

are coming with 10, they reach uh leaf three and then they will be marked with 11.

play16:35

The host C will see this and say, hey, there was a congestion would happen on the path.

play16:41

What will happen here is that I'm gonna generate AC N P congestion notification um

play16:47

packet.

play16:48

Um And I'm gonna tell to both of the hosts, host A and host B that they should slow down.

play16:55

So you see that source is host, C destination is host B and host A.

play16:59

Both of them receiving C N P bits in. If this is,

play17:04

let's say mild congestion, you, you haven't had a lot of traffic.

play17:08

It, it's just like a temporary thing.

play17:10

Um W that will solve this uh congestion. Basically at this time,

play17:14

the host A and hose B will slow down and kind of I'm done with congestion.

play17:19

In this case, this provides kind of lowest latency uh for the system or lowest latency

play17:25

overhead. Um And the system will be able to, to cope,

play17:29

cope with the congestion, basically resolve it.

play17:32

However, let's say that our host A and host B in this example,

play17:36

haven't slowed down. They still keep pumping a lot of traffic into

play17:39

this network. So what switch three or leaf three will do?

play17:43

In this case, it's gonna keep buffering and buffering and eventually it's gonna hit our P F

play17:48

C thresholds. Once it hits the P F C threshold,

play17:52

it will stop sending any traffic down to host C.

play17:55

So, hey, I'm not gonna pump any more traffic to you and I'm gonna generate P F C prior to flow

play18:01

control uh packet up to the spine one.

play18:05

So in this case, there is no traffic coming from spine one.

play18:10

So what spin one will do, stop sending traffic and keep buffering that traffic which is

play18:15

incoming from host A and host B, eventually it will go over W thresholds.

play18:22

It will mark E C N but that will not mitigate congestion.

play18:25

So it's gonna go over P F C thresholds and it starts sending P F C down to leaf one and leaf

play18:32

two. Basically, we do have a pair hop behavior.

play18:36

Leaf three cents to spin one, spin one cents to leaf one and leaf two,

play18:40

leaf one and leaf two starts buffering at this point.

play18:43

They go over P F C thresholds and eventually they send uh pause frames down to the host in

play18:49

host. Is there any buffer uh overlap there?

play18:53

There could be some. So the whole system is made.

play18:57

So both of them do stop because they could uh contribute to congestion.

play19:02

Um In reality, I'd have simplified this example a lot.

play19:06

There is no, there won't be um there might not be P P F C frame coming to both of them because

play19:12

they might not contribute the same amount of traffic to,

play19:16

to this congestion. So some of them may not receive any pause

play19:19

frames or any E C N packets. And some of them may receive a lot.

play19:23

So there could be that distribution between them.

play19:27

Um So in this case, host A and host B have um um

play19:34

stopped. So we have mitigated congestion at this point

play19:37

in time, the buffers will drain and eventually we will reestablish a full flow of traffic in

play19:43

this case. But the whole point is there is no pocket has

play19:48

been dropped in this case and there is no um any of the um uh

play19:54

data wasted in this, in this thing and this um situation.

play20:00

So what I'm gonna do, I'm gonna show you the demo uh of this congestion management

play20:07

mechanisms with P F C and E C N.

play20:09

Uh But before I do that, um let me explain you the topology and what is happening in the demo.

play20:16

So I do have a network which have been created in the system.

play20:21

So I have six leaf switches. I have five hosts and I have leaf six or leaf

play20:28

two oh six. I have attached storage device here.

play20:32

So let's say that our, that my hosts 1 to 5 have finished training and they want to write

play20:38

something to a storage device. So all of them start sending to that storage

play20:42

device at the same time.

play20:45

Um I have storage device connected to, to um leave switch uh

play20:51

interface one slash 11. This is important for the demo.

play20:56

Uh But all my hosts are connected to the network with 100 gig connection.

play21:00

And uh my first four leaves are connected with 100 gig connection to the spines.

play21:05

And then I have 400 gig connection from leaf two oh five and two oh six down from the spines.

play21:11

So when they start, when host 1 to 5, start uh sending and writing that information down to

play21:18

the ho to the storage device. What will happen,

play21:21

they will send to the spine and spine will basically send everything to leaf uh two oh six

play21:27

congesting that port one slash 11.

play21:31

So what I have here is um configuration management tool from

play21:37

data center group called Nexus dashboard fabric controller, uh which will help me configure

play21:43

network end to end.

play21:45

So what I have here is fabric which I showed you earlier.

play21:50

So it does have, it does have those six leaves, it does have those three

play21:56

spines. We have all the connections which are present

play21:59

in this case. Um Everything which I mentioned,

play22:02

um the main action will be on this leaf two oh six.

play22:07

So I go to my fabric, I look at the switches and I do find my leaf two oh six.

play22:13

And what I'm gonna do on that leaf is gonna show you the Q S configuration.

play22:19

So I'm gonna go and do show on IP Q S looking at the running configuration for Q

play22:25

S and uh when I execute this command, there will be none.

play22:31

So in the first line, I'm gonna highlight it right now.

play22:35

The system has absolutely no Q S configured.

play22:38

So let's go to our hosts. Um And I'm gonna show you how they're reacting

play22:45

to this as well as how switch is reacting.

play22:49

Um Give me one second just gonna zoom in. So we see all what's going on.

play22:56

So here I do have my five hosts on the left.

play23:01

So you will see that they do write something. So w uh means I'm writing to my

play23:07

storage device and you see that bandwidth is fluctuating.

play23:11

So at a point in time, they have 0 612 102,000 megabits per second.

play23:18

So they're writing at the same time but very inconsistent what is happening on,

play23:23

on my switches.

play23:25

So here in the bottom, I do see that switches constantly dropping some or my uh

play23:32

switch to a six on that interface one slash 11. It's constantly dropping some of the packets.

play23:38

Um Why this is happening? I don't have any Q O si don't generate any of

play23:44

the E C N pockets. I do not generate any of the P F C pockets.

play23:48

Um And this number over here on one slash 11, you see that I have received some of uh P F C

play23:54

pockets. What that tells you is that my host is

play23:57

configured to do all of this, but my network is not.

play24:01

So host will be saying, hey, slow down, slow down, but network doesn't understand.

play24:06

So it will not do anything about it.

play24:09

So I'm going back to my fabric and I'm gonna go and configure Q S.

play24:14

In this case, I'm gonna go to advanced tab. I'm gonna find my Q S configuration and I'm

play24:20

gonna go and select a I cluster Q S template. So a little bit of story about this,

play24:27

what I'm doing here is N D F C. If anybody in the audience knows the number.

play24:33

Uh it's 12 12 which is a current N D F C Nexus Dashboard fabric controlled release.

play24:39

This template is a custom template which I have created.

play24:43

Uh However, in the future release, in the next release, which is 12 13,

play24:47

we will have multiple templates configured for uh multiple hardware devices because some of

play24:53

them might have a different Q levels and, and to do this in a proper way.

play24:58

So in the future, in the next release, in the upcoming release,

play25:02

we will create a set of templates corresponding to all the hardware which we have in the

play25:06

portfolio. So I go and I uh save this configuration or

play25:12

save this intention. And I'm gonna go and deploy that uh

play25:16

configuration onto my switches.

play25:19

Um Here as everything which we are doing is on leaf two oh six.

play25:24

So I'm gonna go and check what configuration we will push down to the switch and I'm gonna

play25:28

explain you this.

play25:30

So here, what I have is uh my Q S configuration.

play25:35

Uh anything which is coming with DS C P 24 is my rocky traffic.

play25:40

Uh So any, any traffic coming out of the server, it, it will be rocky traffic.

play25:46

Anything which is matched by DS C P 48 is my C N P traffic.

play25:51

I want to treat them differently. So my data plan will be delivered in,

play25:56

in fast and reliable way with the Q S but any congestion notification,

play26:00

I want to reach my sources as fast as possible without waiting for anybody.

play26:05

So I'm gonna put it in a strict priority queue, which is my Q seven.

play26:10

So if I go scroll down further in configuration, I do have on my Q three configured

play26:16

W at E C N. So uh you see over here Q three um and then

play26:23

random attack minimal threshold. So that's that yellow line,

play26:26

which I showed you in the previous diagram, then maximal threshold.

play26:30

That's the red line, lower lead red line in the diagram.

play26:33

And then the last keyword, this E C M tells me, hey, if,

play26:38

if you do detect any congestion, if you're W and detect any congestion mark all of those,

play26:43

all of those packets with E C M higher here, you see class type queuing C out uh

play26:50

eight Q Q seven priority Q level one. That's where my uh C N P pockets,

play26:56

strict priority Q going straight down to uh the sources in this case.

play27:02

So, and further down, I do have a P F C configuration for Q three.

play27:07

Um saying for the um network Q S Q three, P F C uh

play27:14

pause P F C cost three which tells a Q three needs to be paused.

play27:18

If congestion happens for P F C, um the thresholds are adjusted to serve the purpose

play27:25

and how I explained in the previous example. So uh P F C have stayed uh default,

play27:31

we have put our W thresholds fairly low in the queue.

play27:34

So any minor congestion will be mitigated by E C N and W red.

play27:39

Uh Any severe congestion will be mitigated by P F C.

play27:44

So what I'm gonna do now go to interface and uh here I have told to interface,

play27:50

hey enable P F C which is a priority flow control mode on.

play27:55

Then I'm gonna do a watchdog. So watchdog is a feature which we have.

play28:01

Um And um it's served for cases where uh P F C storm happens.

play28:07

So P F C storm is that you have malfunctioning a device.

play28:11

It could be a host, it could be a switch which is constantly sending pause frames.

play28:17

Eventually those pause frames, how I showed an example will be propagated to everybody in the

play28:22

network and basically your network will stall.

play28:25

So in order to prevent this, we do have a watchdog which sets a timer and waits to see if

play28:32

any packet is staying longer than the timer. If they do,

play28:35

they will be cleared out from the buffer. So you prevent a P F C storm in this case.

play28:41

And I do attach classification policies saying hey, any traffic coming into this interface

play28:46

over this interface, please treat it.

play28:48

Uh Please put it in the right queue and apply Q S on that.

play28:53

So um I'm gonna go and deploy this configuration and I'm gonna go back to

play28:59

my hosts uh at this case in this case, and what will happen here?

play29:06

Um So pay less attention on the left, more attention on the right.

play29:11

I'll try to zoom in. So this is uh clear.

play29:17

So um you see that our host is generating P F C and at

play29:24

this point in time, we do have some of the gene, uh E C N packets being marked.

play29:30

So remember I told you first E C M starts and then if E C N is not capable to resolve this

play29:36

congestion, P F C will kick in and basically get um resolve this congestion.

play29:43

So what will happen here? The system will know about and host and the

play29:48

network knows that Q S is configured.

play29:51

So in this case, um for a moment, hosts will try to gauge what's going on and

play29:58

start sending the traffic and you will see that E C N continues to,

play30:03

to um to go up. You see that P F C continues to,

play30:07

to be received by the switch and generated and sent to the spine.

play30:12

Um And then you see over here that there is no pocket drop present or the drop counter

play30:18

doesn't increase anymore, which tells you that that P F C and E C N have done their own job

play30:25

and have prevented any congestion.

play30:27

Um So we have a lossless network to provide uh for that A I cluster.

play30:33

So I'm going back to my N D F C and um I'm

play30:39

gonna show you that uh configuration has been applied what I want to push.

play30:44

So I'm going back to leaf two oh six and I'm gonna do show on IP Q s similarly to uh what I

play30:51

have done before.

play30:53

And um when I execute there, you will see that same configuration which I push earlier.

play30:58

So we do have uh classification policy, uh sorting out rocky traffic,

play31:04

C N P traffic. We do have W configuration on the switch.

play31:09

Um And then we have um P F C configuration saying hey generate P F C for Q

play31:16

three and please configure P F C watchdog as well.

play31:20

Classification in Q S for, for this leaf.

play31:23

So we do have a set of features uh which are present in our

play31:30

nexus 9000 switches.

play31:32

The first one will gain you visibility in your network traffic and what's happening in the

play31:39

network. Uh So the first one is a flow table or this

play31:42

feature is called flow table which means system goes and tracks every packet traversing the

play31:48

network traversing. The switch is able to connect,

play31:51

collect information like five to information flow information uh interface in Q information,

play31:58

um any um slow start and flow flow, start and flow, stop time.

play32:04

And then it will um um indicate if pocket is dropped or there was

play32:11

burst, which has happened uh with a particular flow.

play32:15

How you can leverage these features is that um you can export them using uh

play32:21

using hardware, it will be direct hardware uh export into a controller.

play32:26

And what we have is Nexus dashboard insights, which is collecting all of those information as

play32:33

well information about P F C interface counters. Uh basically using uh flow uh

play32:40

uh streaming, streaming telemetry from the switches to import that information to Nexus

play32:46

dashboard insights.

play32:48

So with, with Nexus dashboard, you are able to, as I mentioned,

play32:52

look at P F C counters, E C M counters.

play32:55

Uh you can use those to tune your network basically in a,

play32:59

in a stage where you uh establishing the, the cluster.

play33:03

Um You can understand how those thresholds relate to each other,

play33:06

how they relate to congestion, which congestion they should prevent more um

play33:14

um severe, which congestion they can, they can prevent um more earlier,

play33:21

basically. So um what we have done and this is coming in

play33:27

subsequent release release in August release uh for the Nexus dashboard insights,

play33:33

we do have a nice graph on the interface statistics and,

play33:36

and traffic which is going. But if you look here,

play33:39

we do have a congestion parameter.

play33:42

Uh So we do have um we account for every P F C transmitted packet,

play33:47

every P F C received packet uh and any of the E C N uh packets.

play33:52

So um eventually you will be able to kind of correlate congestion uh go and drill down on a

play33:58

particular interface and understand what that interface has sent or received.

play34:03

Um And how it's behaving in the network. And from that perspective,

play34:07

you will be able to uh manage that congestion.

play34:12

Um So what we have done is that we created a blueprint uh explaining everything

play34:19

which I have told you in this session.

play34:21

Uh But going in way more details, um um giving you way more information so you can

play34:28

go and um read this document. Uh We also have created um an automation

play34:35

script based on ensemble.

play34:37

So let's say you, you are deploying this type of infrastructure.

play34:43

Um But uh you have deployed, you have automated uh provisioning of your end points hosts.

play34:50

Um And you have an answer script for that.

play34:52

Um Then um you want to deploy the network so you can use this script um to deploy the

play34:58

network in the same way how you're deploying the end points.

play35:02

Uh The script will go and push the configuration down to N D F C and N D F C will

play35:07

go and push it down to the switches. So you have um that part automated.

play35:13

And the last part which I wanna mention is um customer um which have

play35:20

used our next switches to deploy this and have their A I cluster working uh

play35:26

to uh do uh sorry to perform all the functions which I have explained.

play35:32

So none of this, which or everything which I have mentioned in in this presentation is

play35:39

shipping is available except a few items which I mentioned.

play35:43

Uh So you can go deploy this Ethernet network for your A I M L cluster.

play35:49

Um And I kind of feel like this maybe would have helped give us a little bit more context.

play35:53

In the beginning. She spent a great deal of time talking about uh

play35:57

traffic control and the congestion management. Could you talk to us a little bit about how

play36:04

uh A I clusters generate traffic that would require something like this?

play36:09

I mean, are they, are they, are they bursty? Are they just flood big floods?

play36:15

Are they sustained?

play36:17

Help us understand why all of what you've shown us is necessary.

play36:22

So good question. Uh in a nutshell, it comes down to what

play36:27

algorithm you will use to, to explain your cluster.

play36:30

So each of the functions of the cluster will have their own capabilities and all properties.

play36:36

Um If I talk generally, um a lot of cluster may have all to all type of

play36:43

communications or clusters may have all to all communications, which means let's say I have um

play36:49

I don't know a 5 12 GP us which I mentioned as an example.

play36:53

And while they are training the cluster or they are uh training the model,

play36:58

um they basically gonna come and pull information from the storage device uh that

play37:03

data which they use to train the cluster.

play37:06

Um Once they pull they have something which is called data parallelization.

play37:12

So what I do is that, let's say I picked up a picture which is a dog or whatever take it is

play37:19

uh each of us, each of 512 GP US will gonna go piece of that uh that

play37:25

picture and he's gonna process it. So I'm training,

play37:29

let's say this, this cluster to recognize eyes on a dog.

play37:34

I don't know, like let's put that as an example. So a picture is of a whole dog.

play37:39

So somebody will get tail, somebody will get leg, somebody will get be belly and somebody

play37:44

will get eye and they're going to find out that eye.

play37:47

Once they kind of go get each of them, get that piece of information,

play37:52

they're gonna look for it and majority of them will respond to each other.

play37:56

Hey, I don't have it, which means that data or that information is sent to everybody in the

play38:01

cluster. So as such, you might have so definitely

play38:06

pulling it from, from the storage device, it's gonna be somewhat bursty,

play38:11

but you are going one too many. So that's kind of traffic pattern.

play38:15

Uh Then once I have got my piece, um I did calculation, I figured out where the eye

play38:22

is or that I don't have an eye, I'm gonna update to everybody else.

play38:26

So I'm gonna send a packet to everybody in the cluster.

play38:29

Um As such, I need to have like a rich connectivity between those clusters as well.

play38:35

Um So the next point is that those who have recognized the I they're gonna go and send to

play38:42

everybody else. Hey, I do have an eye so we can update the

play38:45

model. This is how I recognize that I uh that model

play38:50

will come together and it will say, hey, uh sorry, the cluster will come together conclude

play38:55

where the I what are the parameters for that I of the dog.

play38:59

And it's gonna go back and write that information to the model which is again in a

play39:03

storage device. So all of them come together and write and push

play39:07

that information down to that single storage device.

play39:10

So as such that model is is done, the operation is done for this particular example.

play39:17

So I can start um I can push a new picture, do repeat the same operation again and again,

play39:23

push it down to the storage device. So this operation happens many times per second.

play39:29

Um And the whole training um can last multiple weeks even.

play39:34

So how does everything which I have talk uh in this session about?

play39:40

So imagine you are like a hyper scalar and you are doing very uh uh

play39:46

complex learning or training of that A I cluster.

play39:50

Um So what will happen? Your training happens for or lasting for,

play39:54

let's say a two days, even short period of time, but some time,

play39:59

first day, like after 12 hours, congestion happens in the network.

play40:05

Let's say it happens in the network, you drop some amount of traffic and your uh cluster stop

play40:12

working. Basically, it drops down.

play40:15

So you need to know that cost of that each GP U uh let's say you had 5,

play40:21

12 of them, they work for uh for 12 hours, each hour cost you,

play40:25

let's say $20. So calculate the, the expense which you just

play40:30

wasted. So if you do have, if you have designed the

play40:35

network in a proper way, um you're basically preventing yourself of the,

play40:39

that money wastage and you will be able to finish that operation in,

play40:44

in a proper way without um without implication of the network.