Cisco Artificial Intelligence and Machine Learning Data Center Networking Blueprint
Summary
TLDRNemanja Kamenica, a technical marketing engineer at Cisco, introduces the AI/ML data center networking blueprint, a comprehensive guide to optimizing networks for AI workloads. He highlights the importance of AI in various industries, from healthcare to retail, and explains the differentiation between training and inference AI clusters. The presentation delves into the technical requirements for these clusters, including high bandwidth and low latency, and showcases Cisco's Nexus 9000 switches as a solution. Through a detailed demo, Kamenica demonstrates congestion management techniques like PFC and ECN, crucial for maintaining a lossless network environment for AI clusters. The session concludes with resources Cisco provides to help customers implement these AI networking solutions.
Takeaways
- 📊 AI and ML workloads require specialized network configurations to handle their unique data traffic patterns efficiently.
- 💻 There are two main types of AI clusters: distributed training clusters for model training, which require high bandwidth and low latency, and product inference clusters for model deployment, which prioritize real-time responses and high availability.
- 📦 On-premises AI clusters offer full control and constant availability, whereas cloud-based solutions provide flexibility and scalability with cost considerations.
- 🛠 Key challenges in building AI clusters include managing rapidly doubling data volumes and models, necessitating scalable infrastructure to maintain performance and accuracy.
- 📚 Cisco's data center networking solutions, particularly the Nexus 9000 series, are designed to support AI/ML workloads with low latency, high throughput, and efficient node-to-node communication.
- 🔧 RDMA over Converged Ethernet (RoCEv2) is critical for efficient AI cluster operations, enabling direct memory access between nodes to reduce latency and increase throughput.
- 🚨 Network configurations for AI workloads must be non-blocking, lossless, and capable of handling congestion through mechanisms like Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).
- 📡 Effective congestion management is crucial to prevent data loss and ensure continuous operation, especially in distributed AI training scenarios where data synchronization between nodes is constant.
- 📱 Cisco's Nexus Dashboard Fabric Controller and other tools offer advanced features for network visibility, congestion management, and QoS configuration, aiding in the deployment and management of AI clusters.
- 💻 Custom automation scripts and templates provided by Cisco can streamline the configuration of networks for AI/ML clusters, aligning with the automation of endpoint provisioning.
Q & A
What is the significance of AI/ML in modern data center networking according to Cisco's blueprint?
-The significance of AI/ML in modern data center networking lies in optimizing network configurations to support the unique demands of AI/ML workloads, such as high bandwidth, low latency, and lossless data transport, which are crucial for efficient model training and inference.
What are the two major types of AI clusters mentioned in the transcript?
-The two major types of AI clusters mentioned are the distributed training cluster, used for training AI models, and the inference cluster, used for applying the trained models to new data.
Why is high node-to-node communication important in a distributed training cluster?
-High node-to-node communication is crucial in a distributed training cluster because it enables the rapid exchange and processing of data samples between nodes, which is essential for updating and refining AI models efficiently.
null
-The key network requirements for a distributed training cluster include high bandwidth and low latency to facilitate quick data exchange and computation among nodes, which leads to faster model training times.
What does a product inference cluster require from the network?
-A product inference cluster requires the network to be real-time and highly available to handle numerous user requests simultaneously without significant delays.
What are the benefits and challenges of deploying AI clusters on-premises versus in the public cloud?
-Deploying AI clusters on-premises offers full control, data security, and constant availability but requires significant infrastructure and management. In contrast, public cloud deployment offers flexibility and scalability but can lead to increased costs and concerns over data security.
Why is network congestion management critical in AI/ML clusters?
-Network congestion management is critical in AI/ML clusters to avoid data loss and ensure efficient communication between nodes, which is essential for the accurate and timely training and inference of AI models.
What are Rocky V2 and RDMA, and why are they important for AI/ML workloads?
-Rocky V2 (RDMA over Converged Ethernet version 2) and RDMA (Remote Direct Memory Access) are technologies that provide low latency and high throughput data transport. They are crucial for AI/ML workloads to enable fast and efficient data transfer between nodes in the network.
How does Cisco's Nexus 9000 series support AI/ML data center requirements?
-Cisco's Nexus 9000 series supports AI/ML data center requirements by providing low latency, high bandwidth, and advanced features such as flow control and congestion management, which are essential for handling the demands of AI/ML workloads.
What role does quality of service (QoS) play in managing AI/ML network traffic?
-Quality of Service (QoS) plays a critical role in managing AI/ML network traffic by prioritizing different types of traffic, ensuring that important data, such as AI model updates, are transmitted quickly and reliably across the network.
Outlines
🔍 Introduction to AI/ML Data Center Networking
Nemanja Kamenica, a technical marketing engineer at Cisco, introduces the AI/ML data center networking blueprint, a guide designed to optimize network infrastructure for AI/ML workloads. He explains the significance of AI in various industries today and its potential future impact. Nemanja outlines the importance of creating efficient networks for AI clusters, mentioning two types of AI clusters: distributed training clusters, which require high bandwidth and low latency for node-to-node communication, and inference clusters, designed for real-time responses and high availability. He emphasizes the unique network requirements of each cluster type.
🌐 Network Requirements for AI Clusters
This section delves into the specific network requirements for effectively running AI clusters, focusing on the need for a scalable, real-time, and always available network infrastructure. Nemanja discusses the benefits of building clusters on-premises for control and data sovereignty, as well as the flexibility and scalability offered by public cloud solutions. He highlights the challenges of scaling infrastructure to match the rapid growth of AI models and data, stressing the importance of a robust network, ample compute resources, and efficient data storage and management.
🚀 High-Performance Networking with Cisco Nexus
Nemanja showcases Cisco's Nexus 9000 switches, ideal for AI/ML clusters due to their low latency and high bandwidth capabilities. He explains the role of RDMA over Converged Ethernet (RoCEv2) in providing efficient, low-latency communication between nodes, crucial for distributed AI model training. This technology enables direct memory access, bypassing CPU processing for faster data exchange. The Nexus product line, with its versatile operating systems and high-speed interfaces, is presented as a solution for the demanding requirements of AI networking.
🛠️ Advanced Congestion Management Techniques
This part explains Cisco's advanced congestion management solutions, including Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), to maintain network performance and avoid data loss during high-traffic conditions. Nemanja uses an example to illustrate how these mechanisms work together to manage congestion, ensuring a lossless Ethernet environment. This segment underscores the importance of configuring the network to handle the specific demands of AI workloads, preventing congestion and ensuring efficient data flow.
🔧 Demonstrating Network Configuration for AI Clusters
Nemanja presents a practical demonstration of configuring a network for AI clusters, focusing on Quality of Service (QoS) settings to manage traffic priorities and handle congestion. He explains the setup of a network with multiple leaf switches and hosts, detailing how traffic is managed to prevent congestion using Nexus Dashboard Fabric Controller. This demonstration shows the importance of proper QoS configuration in maintaining network performance and ensuring smooth operation of AI workloads.
💡 Deploying and Managing AI Network Configurations
In this section, Nemanja outlines the process of deploying QoS configurations to manage AI cluster traffic efficiently, focusing on differentiating between rocky and CNP traffic for optimal delivery. He introduces a watchdog feature to prevent PFC storms, which can cause network stalls. This part highlights the steps to apply these configurations across the network, demonstrating Cisco's commitment to providing tools and templates for easy network management and ensuring high performance for AI clusters.
📊 Monitoring and Optimizing Network Performance
Nemanja discusses the capabilities of Cisco's Nexus Dashboard Insights for monitoring and optimizing network performance for AI clusters. He explains how the system tracks and reports on network traffic, congestion, and device performance, providing valuable data for fine-tuning the network. This monitoring tool, combined with advanced features like PFC and ECN, allows network administrators to manage congestion effectively and maintain optimal conditions for AI workloads.
🚀 Real-World Application and Future Directions
Concluding the presentation, Nemanja reflects on the practical application of Cisco's networking solutions in real-world AI clusters, emphasizing the availability of the discussed technologies and their importance in efficient AI operations. He addresses a question about the nature of AI traffic, explaining that it can vary from bursty to sustained, depending on the algorithm and task at hand. He stresses that proper network design is crucial to avoid costly downtime and ensure successful AI model training and inference.
Mindmap
Keywords
💡AI/ML Data Center Networking Blueprint
💡Distributed Training Cluster
💡Inference Cluster
💡RDMA over Converged Ethernet (RoCEv2)
💡Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)
💡Nexus 9000 Switches
💡Lossless Ethernet
💡Congestion Management
💡Data Parallelization
💡AI Model Training and Inference
Highlights
Nemanja Kamenica introduces the AI/ML data center networking blueprint by Cisco, aimed at optimizing network creation for AI/ML workloads.
The importance of AI in various industries such as healthcare, finance, public sector, and entertainment is emphasized, highlighting AI's broad impact.
Discussion on two types of AI clusters: distributed training clusters for training AI models and product inference clusters for deploying trained models.
The significance of infrastructure in AI cluster performance, focusing on the need for high bandwidth, low latency transport for distributed training clusters.
The option for enterprises to create AI clusters on-premises or in the public cloud, each with distinct advantages and considerations.
Challenges in scaling infrastructure to keep up with rapidly doubling model sizes and data, emphasizing the need for large networks with extensive GPU and CPU power.
The critical role of the network in AI clusters, requiring non-blocking transport, lossless Ethernet, and features like PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) for congestion management.
Introduction to Cisco's Nexus 9000 switches as a solution for AI/ML clusters, offering low latency, high bandwidth forwarding.
Explanation of Rocky V2 (RDMA over Converged Ethernet) technology and its benefits in providing low latency, high throughput transport for AI workloads.
Detailed discussion on congestion management techniques using ECN and PFC to maintain a lossless network, crucial for AI cluster efficiency.
Demonstration of QoS (Quality of Service) configuration and its impact on managing congestion in a network, showing practical application of theoretical concepts.
Insights into the behavior of AI clusters, highlighting the bursty nature of AI traffic and the necessity of a well-designed network to handle such patterns.
The economic implications of network congestion on AI training tasks, emphasizing the cost efficiency of a well-managed network infrastructure.
Future updates and tools for AI/ML network management, including custom QoS templates and automation scripts for easier deployment.
A case study on a customer's successful deployment of an AI cluster using Cisco's networking solutions, validating the blueprint's effectiveness.
Transcripts
My name is Nemanja Kamenica and I'm technical marketing engineer at Cisco.
And I'm here to talk about A I M L data center networking blueprint.
It's a set of stuff which we created to explain how you can create the network in the best
possible way for your I I A I M L workloads.
So what I have today on the session is why A I A I is important today and uh
what will bring in the future and then explain you how you should create the networks for your
A I cluster.
Um I'm gonna show a brief demo of everything which I'm gonna talk up to that point.
And then um I'm gonna show you the set of collateral which we have,
which we have created to enable you as a customer to do this.
So first let's talk about A I and what A I could do um today and what we'll do in the
future. Uh There are a set of use cases and industries
which will benefit from A I.
Um You also have all have seen um chat GP T you probably ask some of the serious questions,
some of the silly questions and it always respond to you.
So that's one way to entertain yourself. Uh But there are other industries which will
benefit from it. For example, health care will be able to do
some medical research and, and uh medic risks research in financial services.
Uh They might be able to advance the uh trading algorithms and and do more trading using A I
in public sector.
They might be able to optimize public uh transport paths and,
and have more users using those paths and, and public transport in the cities in media,
entertainment industries, creating subtitles, doing uh translation using A I S is a great
help. So um manufacturing um finding out the um um
flaws in the product will be able will be enabled by A I and and be able to do that in a
scalable way. And in the end retail uh personalized uh
recommendations for for any of the retail stuff could be done uh with the A I.
So um capabilities use cases are vast wide um and as such will be
deployed by many enterprises.
So when we talk about uh enterprises deploying this, there are kind of two major type of A I
clusters. Uh We have something we call distributor
training cluster uh which is basically a cluster which trains your A I model.
So imagine that, that you want to have an A I model which recognize a particular color.
I'm I'm simplifying this over simplifying this. But let's say it's change the color.
You need to give it a set of colors. It's gonna go uh do do that task.
And eventually we say, hey, this is red, this is pink,
this is yellow. So what that cluster kind of requires from,
from infrastructure is that node to node communication in this case is high.
The reason is they take those samples and then exchange, they do calculation,
then they exchange the calculation, they make an average and then they update the model.
So all of this, which I said it happens between the node communication,
it happens over the network.
Um And as such, the network is required to do um um high bandwidth uh
low latency type of a transport. In this case,
um Also the key metric for for this type of a cluster would be um
training time. So shortest period of time to train your uh
your model, what does it require from, from the infrastructure?
It would be a large network with GP U A lot of GP U and C P U power.
So that's our distributor training cluster.
There is another way to create a cluster.
And that's happened when you, when you have finished training,
your model is that you have a product inference cluster.
So now you train the the model to recognize a color uh red color.
Now you're gonna deploy that model and users will be able to go and ask that model Hey,
what is this color or is this a red color?
So it's a different, different type of a cluster which will do this task.
Uh What is required from this model is to be a real time and highly available.
So any time anybody of the users asking this, uh you need to provide it an answer.
Obviously, you're not gonna have a single user doing this.
You will have hundreds or maybe thousands of users asking this uh for model to do.
And as such, um you might not have a lot of bandwidth but you need to have a lot of uh high
or or a lot of a availability, high availability of that cluster.
So what is required from the network?
Um It's a smaller network, smaller amount of devices but it has to be um a real time.
It has to be always available.
So it is how do you create those clusters? Uh First,
you can create that cluster on prem um in your own data center for you to serve your own
enterprise uh needs.
Uh What it would be a benefit of this is that cluster is always available.
So there could be somebody always using that cluster.
Um Also um you have a flexibility to create that size of the cluster.
However, you want um different part of the enterprises can use that kind of have a time
share of that cluster. So use uh that cluster at different time and
all your data is stored on prem. So you do not export that data or you do not
send it anywhere. It's it's present in your cluster.
Another way for you to do this if you're an enterprise is to do it in a public cloud.
So all of the public clouds do enable you or do have an offering to,
to do your A I M L uh training cluster in public cloud.
Um Flexibility is the benefit of that. So I'm gonna go,
I'm gonna get a number of instances um with particular bandwidth.
Um And I'm gonna pay only when I use it.
Uh But what, what will happen is that how you grow your cluster?
Uh Your cost is gonna grow. So it might be fine for,
for a certain time, it might be fine for forever.
But uh there are different cases and different use cases which can be solved.
So let's say you, you are an enterprise and you decided to build uh a cluster in on prem
situation. So you, you're gonna build your own data center
which will run A I M L um uh workloads.
So what would be key set of challenges which you need to understand before you do that.
Uh First one is that uh model which you're creating uh and data which that model we
use will double every uh every two months. So as such,
you, you, you would need uh to scale, you would need to improve your infrastructure to be able
to keep up with that model.
Uh What that bigger model provides you is more accuracy.
So if you're training for something complex, it's gonna probably take time to train that
model. If you increase the size of the sorry,
if you increase the size of the cluster, your time is gonna go down and so on and so on.
Uh However, today, the most of the training happens up to on the cluster size,
up to 512 GP us. So I guess if you would start creating,
start to looking for something which is 5 12 GP US or higher.
Um The next challenge which you need to solve would be key components which you need to have
in your cluster. So what do you need to have uh compute notes,
those will contain GP US and C P US.
Uh But also you would need to have a network that network needs to provide obviously
transport between all of those notes.
Um And then you would have some kind of storage system which will contain that data and will be
able to provide that training data into the cluster.
And later basically collect all that data from the cluster into a storage device.
Then there are two software components, uh job scheduling and orchestration component which
will basically tell you which GP us and where they are and who of them or which of them will
work in this uh in this process. So some of them will,
may, may, may require lower latency path. Um And as such,
you why you might want to have them on the same switch or,
or nearby switches, some of them may not require low latency and,
and a higher scale. So you might distribute them um as well.
So the next component would be software framework for A I model.
So that's how your A I model will work and how it will be trained.
And that's what you can buy since we are here and talking about data center networking,
obviously, the focus of uh presentation will be on the network in this case.
So let's look uh what network for A I cluster uh will provide uh in this case or
what, what kind of requirements are put to the network to do this.
So when we talk about A I training networks, there are two requirements or two components
which, which are their first one would be transport which you provide.
Uh In this case, this is Rocky V two.
Um Don't know if any of you know what Rocky V two stands for R D M E or converge E internet
version two. I'll explain what that means in just a moment.
Uh But as the technology provides low latency, high throughput type of a transport from
endpoints and does require does have some requirements for the network from the network
side, we in Nexus product line in Nexus 9000 switches, we do have um
set of switches which do provide low latency um and high bandwidth um
forwarding uh with a up to 2 25.60 present in our
equipment and their purpose. Or there could satisfy all the needs for from
the A I M L cluster.
Um So just brief over you, I'm pretty sure you will know.
But um if anybody missing the Nexus 9000 is a data center portfolio from Cisco.
Um It's a flagship data center networking.
Um It does sit on speeds from 100 meg to a 400 gig.
Um And then it does run two uh operating systems.
It does have standalone mode or an and it does have an AC I mode as well.
So you can run whatever you choose.
Uh And all of the switches are based on Cisco silicon.
So there are certain uh features which are enabled particularly on this product line which
we use uh over here.
Um So let's go to the next uh component which is rocky two as I mentioned R D M E or
converge internet.
Uh So if you look at the bottom pocket in this picture, you would see that it does have a
layer two header, it does have an IP header, which means source mac,
destination mac source IP, destination IP UDP header.
And after that, the whole transport is in band. So you kind of have infi band encapsulated
internet packet or frame.
Uh So what that provides you, um because it's R D MA based transport based in,
in you have low latency, high trout produced from the uh from the host itself.
Um What this packet allows is that you can route it,
you can switch it, you can do whatever you want uh with it and network will forward it.
Ethernet network will forward this, this packet.
Um How this works in a kind of high level end to end system.
So if you do have an A I cluster using GP us, uh R Q V two allows you direct
memory um access and direct memory communication between uh memory chunks
in that GP U cluster. Uh GP U compute nodes.
So how this is done through R D MA Nick is capable direct directly to go into
the um memory of the GP U, get that information, put it on the network,
send it to whoever needs to go next. Uh um Nick receives it and puts it into GP U
memory. So that's how they exchange benefit of this is
that you are bypassing the stack, you're bypassing the kernel.
And as such, you provide a low latency, high trout transport from the endpoint itself.
So what is required from the network basically to provide the similar uh capabilities?
So in this case, we use R D M E or rocky version two in the network for end to end
communication what is required from, from the network, it's non blocking transport.
So you have to create a non blocking network.
Um Then you have to create a lossless Ethernet.
Uh How you create lossless internet using uh priority flow control A P F C and E C N
explicit congestion notification.
Um And I'm gonna go there um in, in my next slide.
So brief overview and brief explanation how all this works.
Um What we have here is um we have configured both E C N and P F C
to manage congestion.
Um And if I look in this network, I do have three hosts,
host a host B and host C.
Um And I have three leave switches and I have a lot uh spine switch in in here.
Uh What each switch have been configured is to support W dot E C N um And to
support P F C.
So for uh W at E C N, we have set these two thresholds on the bottom,
the yellow one and the red one, the yellow one is W at min uh uh minimal threshold.
Um So just to explain that that part as well. Um So every time that
buffer utilization um goes over the minimal threshold, the system will go and mark um
IP pockets based on uh sorry, it will mark IP pockets with E C N um or congestion
experience bit so that E C N bits are part of the IP pocket in the toss field,
the last two significant bits.
So just so we are clear. Um And then we do have two P F C thresholds,
the green and the higher red line.
Um Basically, if your W do E C N doesn't comply with congestion or doesn't solve the congestion,
the P F C will kick in and basically resolve that congestion.
So in my example, what I'm trying to do here is that I have host A and host B talking to host
C. Um In this case, uh they are sending pa pocket
dest designated sourced and dest designated from each of the host to the host C.
And you will see that they are carrying the E C N enabled bits.
So they're carrying E C N uh 10. This to the network will say,
hey, this pockets are E C N capable. If you do experience congestion might mark them
with um um bits 11.
So in this case, we do congest the port which goes to host C and you will see that packets
are coming with 10, they reach uh leaf three and then they will be marked with 11.
The host C will see this and say, hey, there was a congestion would happen on the path.
What will happen here is that I'm gonna generate AC N P congestion notification um
packet.
Um And I'm gonna tell to both of the hosts, host A and host B that they should slow down.
So you see that source is host, C destination is host B and host A.
Both of them receiving C N P bits in. If this is,
let's say mild congestion, you, you haven't had a lot of traffic.
It, it's just like a temporary thing.
Um W that will solve this uh congestion. Basically at this time,
the host A and hose B will slow down and kind of I'm done with congestion.
In this case, this provides kind of lowest latency uh for the system or lowest latency
overhead. Um And the system will be able to, to cope,
cope with the congestion, basically resolve it.
However, let's say that our host A and host B in this example,
haven't slowed down. They still keep pumping a lot of traffic into
this network. So what switch three or leaf three will do?
In this case, it's gonna keep buffering and buffering and eventually it's gonna hit our P F
C thresholds. Once it hits the P F C threshold,
it will stop sending any traffic down to host C.
So, hey, I'm not gonna pump any more traffic to you and I'm gonna generate P F C prior to flow
control uh packet up to the spine one.
So in this case, there is no traffic coming from spine one.
So what spin one will do, stop sending traffic and keep buffering that traffic which is
incoming from host A and host B, eventually it will go over W thresholds.
It will mark E C N but that will not mitigate congestion.
So it's gonna go over P F C thresholds and it starts sending P F C down to leaf one and leaf
two. Basically, we do have a pair hop behavior.
Leaf three cents to spin one, spin one cents to leaf one and leaf two,
leaf one and leaf two starts buffering at this point.
They go over P F C thresholds and eventually they send uh pause frames down to the host in
host. Is there any buffer uh overlap there?
There could be some. So the whole system is made.
So both of them do stop because they could uh contribute to congestion.
Um In reality, I'd have simplified this example a lot.
There is no, there won't be um there might not be P P F C frame coming to both of them because
they might not contribute the same amount of traffic to,
to this congestion. So some of them may not receive any pause
frames or any E C N packets. And some of them may receive a lot.
So there could be that distribution between them.
Um So in this case, host A and host B have um um
stopped. So we have mitigated congestion at this point
in time, the buffers will drain and eventually we will reestablish a full flow of traffic in
this case. But the whole point is there is no pocket has
been dropped in this case and there is no um any of the um uh
data wasted in this, in this thing and this um situation.
So what I'm gonna do, I'm gonna show you the demo uh of this congestion management
mechanisms with P F C and E C N.
Uh But before I do that, um let me explain you the topology and what is happening in the demo.
So I do have a network which have been created in the system.
So I have six leaf switches. I have five hosts and I have leaf six or leaf
two oh six. I have attached storage device here.
So let's say that our, that my hosts 1 to 5 have finished training and they want to write
something to a storage device. So all of them start sending to that storage
device at the same time.
Um I have storage device connected to, to um leave switch uh
interface one slash 11. This is important for the demo.
Uh But all my hosts are connected to the network with 100 gig connection.
And uh my first four leaves are connected with 100 gig connection to the spines.
And then I have 400 gig connection from leaf two oh five and two oh six down from the spines.
So when they start, when host 1 to 5, start uh sending and writing that information down to
the ho to the storage device. What will happen,
they will send to the spine and spine will basically send everything to leaf uh two oh six
congesting that port one slash 11.
So what I have here is um configuration management tool from
data center group called Nexus dashboard fabric controller, uh which will help me configure
network end to end.
So what I have here is fabric which I showed you earlier.
So it does have, it does have those six leaves, it does have those three
spines. We have all the connections which are present
in this case. Um Everything which I mentioned,
um the main action will be on this leaf two oh six.
So I go to my fabric, I look at the switches and I do find my leaf two oh six.
And what I'm gonna do on that leaf is gonna show you the Q S configuration.
So I'm gonna go and do show on IP Q S looking at the running configuration for Q
S and uh when I execute this command, there will be none.
So in the first line, I'm gonna highlight it right now.
The system has absolutely no Q S configured.
So let's go to our hosts. Um And I'm gonna show you how they're reacting
to this as well as how switch is reacting.
Um Give me one second just gonna zoom in. So we see all what's going on.
So here I do have my five hosts on the left.
So you will see that they do write something. So w uh means I'm writing to my
storage device and you see that bandwidth is fluctuating.
So at a point in time, they have 0 612 102,000 megabits per second.
So they're writing at the same time but very inconsistent what is happening on,
on my switches.
So here in the bottom, I do see that switches constantly dropping some or my uh
switch to a six on that interface one slash 11. It's constantly dropping some of the packets.
Um Why this is happening? I don't have any Q O si don't generate any of
the E C N pockets. I do not generate any of the P F C pockets.
Um And this number over here on one slash 11, you see that I have received some of uh P F C
pockets. What that tells you is that my host is
configured to do all of this, but my network is not.
So host will be saying, hey, slow down, slow down, but network doesn't understand.
So it will not do anything about it.
So I'm going back to my fabric and I'm gonna go and configure Q S.
In this case, I'm gonna go to advanced tab. I'm gonna find my Q S configuration and I'm
gonna go and select a I cluster Q S template. So a little bit of story about this,
what I'm doing here is N D F C. If anybody in the audience knows the number.
Uh it's 12 12 which is a current N D F C Nexus Dashboard fabric controlled release.
This template is a custom template which I have created.
Uh However, in the future release, in the next release, which is 12 13,
we will have multiple templates configured for uh multiple hardware devices because some of
them might have a different Q levels and, and to do this in a proper way.
So in the future, in the next release, in the upcoming release,
we will create a set of templates corresponding to all the hardware which we have in the
portfolio. So I go and I uh save this configuration or
save this intention. And I'm gonna go and deploy that uh
configuration onto my switches.
Um Here as everything which we are doing is on leaf two oh six.
So I'm gonna go and check what configuration we will push down to the switch and I'm gonna
explain you this.
So here, what I have is uh my Q S configuration.
Uh anything which is coming with DS C P 24 is my rocky traffic.
Uh So any, any traffic coming out of the server, it, it will be rocky traffic.
Anything which is matched by DS C P 48 is my C N P traffic.
I want to treat them differently. So my data plan will be delivered in,
in fast and reliable way with the Q S but any congestion notification,
I want to reach my sources as fast as possible without waiting for anybody.
So I'm gonna put it in a strict priority queue, which is my Q seven.
So if I go scroll down further in configuration, I do have on my Q three configured
W at E C N. So uh you see over here Q three um and then
random attack minimal threshold. So that's that yellow line,
which I showed you in the previous diagram, then maximal threshold.
That's the red line, lower lead red line in the diagram.
And then the last keyword, this E C M tells me, hey, if,
if you do detect any congestion, if you're W and detect any congestion mark all of those,
all of those packets with E C M higher here, you see class type queuing C out uh
eight Q Q seven priority Q level one. That's where my uh C N P pockets,
strict priority Q going straight down to uh the sources in this case.
So, and further down, I do have a P F C configuration for Q three.
Um saying for the um network Q S Q three, P F C uh
pause P F C cost three which tells a Q three needs to be paused.
If congestion happens for P F C, um the thresholds are adjusted to serve the purpose
and how I explained in the previous example. So uh P F C have stayed uh default,
we have put our W thresholds fairly low in the queue.
So any minor congestion will be mitigated by E C N and W red.
Uh Any severe congestion will be mitigated by P F C.
So what I'm gonna do now go to interface and uh here I have told to interface,
hey enable P F C which is a priority flow control mode on.
Then I'm gonna do a watchdog. So watchdog is a feature which we have.
Um And um it's served for cases where uh P F C storm happens.
So P F C storm is that you have malfunctioning a device.
It could be a host, it could be a switch which is constantly sending pause frames.
Eventually those pause frames, how I showed an example will be propagated to everybody in the
network and basically your network will stall.
So in order to prevent this, we do have a watchdog which sets a timer and waits to see if
any packet is staying longer than the timer. If they do,
they will be cleared out from the buffer. So you prevent a P F C storm in this case.
And I do attach classification policies saying hey, any traffic coming into this interface
over this interface, please treat it.
Uh Please put it in the right queue and apply Q S on that.
So um I'm gonna go and deploy this configuration and I'm gonna go back to
my hosts uh at this case in this case, and what will happen here?
Um So pay less attention on the left, more attention on the right.
I'll try to zoom in. So this is uh clear.
So um you see that our host is generating P F C and at
this point in time, we do have some of the gene, uh E C N packets being marked.
So remember I told you first E C M starts and then if E C N is not capable to resolve this
congestion, P F C will kick in and basically get um resolve this congestion.
So what will happen here? The system will know about and host and the
network knows that Q S is configured.
So in this case, um for a moment, hosts will try to gauge what's going on and
start sending the traffic and you will see that E C N continues to,
to um to go up. You see that P F C continues to,
to be received by the switch and generated and sent to the spine.
Um And then you see over here that there is no pocket drop present or the drop counter
doesn't increase anymore, which tells you that that P F C and E C N have done their own job
and have prevented any congestion.
Um So we have a lossless network to provide uh for that A I cluster.
So I'm going back to my N D F C and um I'm
gonna show you that uh configuration has been applied what I want to push.
So I'm going back to leaf two oh six and I'm gonna do show on IP Q s similarly to uh what I
have done before.
And um when I execute there, you will see that same configuration which I push earlier.
So we do have uh classification policy, uh sorting out rocky traffic,
C N P traffic. We do have W configuration on the switch.
Um And then we have um P F C configuration saying hey generate P F C for Q
three and please configure P F C watchdog as well.
Classification in Q S for, for this leaf.
So we do have a set of features uh which are present in our
nexus 9000 switches.
The first one will gain you visibility in your network traffic and what's happening in the
network. Uh So the first one is a flow table or this
feature is called flow table which means system goes and tracks every packet traversing the
network traversing. The switch is able to connect,
collect information like five to information flow information uh interface in Q information,
um any um slow start and flow flow, start and flow, stop time.
And then it will um um indicate if pocket is dropped or there was
burst, which has happened uh with a particular flow.
How you can leverage these features is that um you can export them using uh
using hardware, it will be direct hardware uh export into a controller.
And what we have is Nexus dashboard insights, which is collecting all of those information as
well information about P F C interface counters. Uh basically using uh flow uh
uh streaming, streaming telemetry from the switches to import that information to Nexus
dashboard insights.
So with, with Nexus dashboard, you are able to, as I mentioned,
look at P F C counters, E C M counters.
Uh you can use those to tune your network basically in a,
in a stage where you uh establishing the, the cluster.
Um You can understand how those thresholds relate to each other,
how they relate to congestion, which congestion they should prevent more um
um severe, which congestion they can, they can prevent um more earlier,
basically. So um what we have done and this is coming in
subsequent release release in August release uh for the Nexus dashboard insights,
we do have a nice graph on the interface statistics and,
and traffic which is going. But if you look here,
we do have a congestion parameter.
Uh So we do have um we account for every P F C transmitted packet,
every P F C received packet uh and any of the E C N uh packets.
So um eventually you will be able to kind of correlate congestion uh go and drill down on a
particular interface and understand what that interface has sent or received.
Um And how it's behaving in the network. And from that perspective,
you will be able to uh manage that congestion.
Um So what we have done is that we created a blueprint uh explaining everything
which I have told you in this session.
Uh But going in way more details, um um giving you way more information so you can
go and um read this document. Uh We also have created um an automation
script based on ensemble.
So let's say you, you are deploying this type of infrastructure.
Um But uh you have deployed, you have automated uh provisioning of your end points hosts.
Um And you have an answer script for that.
Um Then um you want to deploy the network so you can use this script um to deploy the
network in the same way how you're deploying the end points.
Uh The script will go and push the configuration down to N D F C and N D F C will
go and push it down to the switches. So you have um that part automated.
And the last part which I wanna mention is um customer um which have
used our next switches to deploy this and have their A I cluster working uh
to uh do uh sorry to perform all the functions which I have explained.
So none of this, which or everything which I have mentioned in in this presentation is
shipping is available except a few items which I mentioned.
Uh So you can go deploy this Ethernet network for your A I M L cluster.
Um And I kind of feel like this maybe would have helped give us a little bit more context.
In the beginning. She spent a great deal of time talking about uh
traffic control and the congestion management. Could you talk to us a little bit about how
uh A I clusters generate traffic that would require something like this?
I mean, are they, are they, are they bursty? Are they just flood big floods?
Are they sustained?
Help us understand why all of what you've shown us is necessary.
So good question. Uh in a nutshell, it comes down to what
algorithm you will use to, to explain your cluster.
So each of the functions of the cluster will have their own capabilities and all properties.
Um If I talk generally, um a lot of cluster may have all to all type of
communications or clusters may have all to all communications, which means let's say I have um
I don't know a 5 12 GP us which I mentioned as an example.
And while they are training the cluster or they are uh training the model,
um they basically gonna come and pull information from the storage device uh that
data which they use to train the cluster.
Um Once they pull they have something which is called data parallelization.
So what I do is that, let's say I picked up a picture which is a dog or whatever take it is
uh each of us, each of 512 GP US will gonna go piece of that uh that
picture and he's gonna process it. So I'm training,
let's say this, this cluster to recognize eyes on a dog.
I don't know, like let's put that as an example. So a picture is of a whole dog.
So somebody will get tail, somebody will get leg, somebody will get be belly and somebody
will get eye and they're going to find out that eye.
Once they kind of go get each of them, get that piece of information,
they're gonna look for it and majority of them will respond to each other.
Hey, I don't have it, which means that data or that information is sent to everybody in the
cluster. So as such, you might have so definitely
pulling it from, from the storage device, it's gonna be somewhat bursty,
but you are going one too many. So that's kind of traffic pattern.
Uh Then once I have got my piece, um I did calculation, I figured out where the eye
is or that I don't have an eye, I'm gonna update to everybody else.
So I'm gonna send a packet to everybody in the cluster.
Um As such, I need to have like a rich connectivity between those clusters as well.
Um So the next point is that those who have recognized the I they're gonna go and send to
everybody else. Hey, I do have an eye so we can update the
model. This is how I recognize that I uh that model
will come together and it will say, hey, uh sorry, the cluster will come together conclude
where the I what are the parameters for that I of the dog.
And it's gonna go back and write that information to the model which is again in a
storage device. So all of them come together and write and push
that information down to that single storage device.
So as such that model is is done, the operation is done for this particular example.
So I can start um I can push a new picture, do repeat the same operation again and again,
push it down to the storage device. So this operation happens many times per second.
Um And the whole training um can last multiple weeks even.
So how does everything which I have talk uh in this session about?
So imagine you are like a hyper scalar and you are doing very uh uh
complex learning or training of that A I cluster.
Um So what will happen? Your training happens for or lasting for,
let's say a two days, even short period of time, but some time,
first day, like after 12 hours, congestion happens in the network.
Let's say it happens in the network, you drop some amount of traffic and your uh cluster stop
working. Basically, it drops down.
So you need to know that cost of that each GP U uh let's say you had 5,
12 of them, they work for uh for 12 hours, each hour cost you,
let's say $20. So calculate the, the expense which you just
wasted. So if you do have, if you have designed the
network in a proper way, um you're basically preventing yourself of the,
that money wastage and you will be able to finish that operation in,
in a proper way without um without implication of the network.
5.0 / 5 (0 votes)