What Can I Get You? An Introduction to Dynamic Resource Allocation - Freddy Rolland & Adrian Chiris

CNCF [Cloud Native Computing Foundation]
25 Jun 202329:25

Summary

TLDRIn this video, software engineers Freddie Holland and Adrian Kiris from Nvidia's cloud operations team discuss Dynamic Resource Allocation (DRA), a new Kubernetes API for resource management. They cover various resources for workloads, limitations of the device plugin framework, and introduce the Container Device Interface (CDI). The talk delves into Kubernetes' resource allocation, including CPU, memory, storage, and device plugin resources. They explain the DRA's benefits, such as sharing resources, handling unlimited resources, and providing configuration flexibility. The presentation also outlines the process of building a DRA driver, the role of CDI in device exposure to containers, and concludes with a Q&A session.

Takeaways

  • 😀 Dynamic Resource Allocation (DRA) is a new API for requesting resources in Kubernetes, introduced to enable networking technologies.
  • 🔧 Kubernetes can allocate various resources for different workloads, including CPU, memory, storage, and device plugin resources like GPUs.
  • 📈 The device plugin framework has limitations, such as inability to share resources and lack of advanced configuration options.
  • 🚀 The Array API addresses these limitations by providing a more flexible and vendor-independent approach to resource allocation.
  • 💾 Storage options in Kubernetes include scratch space for temporary data and persistent storage solutions like NFS mounts and CSI (Container Storage Interface).
  • 🔌 Device plugins are necessary for utilizing specialized hardware within Kubernetes, but they have constraints that Array aims to overcome.
  • 🔄 The Array API introduces concepts like ResourceClass, ResourceClaim, and ResourceClaimTemplates, providing more control and flexibility.
  • 📝 The allocation process in DRA can occur immediately or be delayed until a pod referencing the resource claim is created, influencing pod scheduling.
  • 🛠️ Implementing a DRA driver involves defining a name, CRDs, coordination mechanisms, and providing implementations for the controller and node plugin.
  • 🔗 CDI (Container Device Interface) is a specification for exposing devices to containers, which is utilized by container runtimes like containerd and CRI-O.

Q & A

  • What is Dynamic Resource Allocation (DRA) in Kubernetes?

    -Dynamic Resource Allocation (DRA) is a new API for requesting resources in Kubernetes, allowing for more flexible and efficient allocation of resources such as GPUs or network devices to workloads.

  • Why is there a need for Device Plugins in Kubernetes?

    -Device Plugins are needed in Kubernetes because Kubernetes does not natively support specialized hardware like GPUs or network interfaces. Device Plugins help to utilize these resources within Kubernetes workloads.

  • What limitations does the Device Plugin framework have?

    -The Device Plugin framework has limitations such as not supporting shared resources, difficulty in handling unlimited resources, and a lack of support for advanced configurations for different instances of the same resource.

  • What is Container Storage Interface (CSI) and how does it relate to storage in Kubernetes?

    -Container Storage Interface (CSI) is a standard for exposing storage systems to containerized workloads in Kubernetes. It allows storage vendors to implement their own plugins for provisioning and managing storage, separate from the Kubernetes core code.

  • How does the Dynamic Resource Allocation (DRA) solve the issues with the Device Plugin framework?

    -DRA solves the issues with the Device Plugin framework by providing a more flexible and vendor-controlled approach to resource allocation, allowing for shared resources, no requirement for pre-defining resource limits, and advanced configurations for each resource instance.

  • What is the role of the centralized controller in a DRA resource driver?

    -The centralized controller in a DRA resource driver coordinates with the Kubernetes scheduler to decide which nodes can service incoming resource claims, allocates resources, and handles allocation and deallocation requests.

  • What are the two allocation modes used in DRA?

    -The two allocation modes used in DRA are immediate allocation, where the resource is allocated immediately upon resource claim creation, and delayed allocation, also known as wait for first consumer, where the allocation is delayed until a pod referencing the claim is created.

  • How does the Kubernetes scheduler integrate with DRA during the scheduling process?

    -The Kubernetes scheduler integrates with DRA by considering resource claims as part of the pod scheduling decision. It creates a pod scheduling context to coordinate with the centralized controller to determine suitable nodes for the pod based on resource availability.

  • What is the Container Device Interface (CDI) and its significance in DRA?

    -Container Device Interface (CDI) is a specification that describes how a device should be exposed to a container. It is significant in DRA as it provides a standardized way to export devices to containers, allowing for better integration with the container runtime.

  • What are the key components required to implement a DRA driver?

    -To implement a DRA driver, you need to define a name for your driver, create custom resource definitions (CRDs), establish communication between the controller and the node plugin, provide a default implementation of your resource class, and implement both the controller and the node plugin with the necessary business logic.

Outlines

00:00

💻 Introduction to Dynamic Resource Allocation in Kubernetes

Freddie Holland and Adrian Kiris, software engineers at Nvidia, introduce the topic of Dynamic Resource Allocation (DRA) in Kubernetes. They discuss the importance of DRA, which is a new API for requesting resources within Kubernetes. The agenda for the talk includes an overview of available resources for workloads, the workings and limitations of the device plugin, an exploration of the Container Storage Interface (CSI), and a deep dive into generator flows. They also touch upon the Container Device Interface (CDI), which is a part of the container runtime required by device drivers. The paragraph sets the stage for a detailed discussion on how Kubernetes handles different types of workloads, especially those requiring specialized hardware like GPUs or networking capabilities.

05:03

🔌 Understanding Kubernetes Resources and Device Plugins

The paragraph delves into the types of resources available in Kubernetes, such as CPU, memory, and storage, and how they are allocated to workloads. It explains the role of the kubelet in reporting node status, which includes both built-in resources like CPU and memory, and device plugin resources like GPUs. The concept of 'requests' and 'limits' in Kubernetes is introduced, which helps the scheduler to place containers on nodes with sufficient resources. The paragraph also discusses the evolution from basic storage options to more advanced and flexible solutions like CSI, which allows storage vendors to implement their own plugins without being tied to Kubernetes release cycles. Additionally, it addresses the limitations of the device plugin framework, such as the inability to share resources and the lack of support for advanced configurations.

10:05

🚀 The Emergence of the Resource API (DRA) in Kubernetes

This section introduces the Resource API (DRA) as a solution to the limitations of the device plugin framework. DRA, which started in Kubernetes 1.26, allows for more flexible and advanced resource management. It is designed to give vendors full control over resource management, similar to how CSI works for storage. The paragraph explains the components of DRA, including the resource class, resource claim, and the use of Custom Resource Definitions (CRDs) to allow for vendor-specific parameters. It also discusses the difference between resource templates and resource claims, and how they can be used to create new resource claims with each reference. The paragraph highlights the benefits of DRA, such as the ability to share resources between workloads, solve the issue of unlimited resources, and provide more flexibility in resource configuration.

15:06

🔄 Deep Dive into Resource Sharing and Allocation Modes in DRA

The paragraph explores how DRA enables resource sharing between different containers within the same pod or across different pods. It emphasizes the importance of the resource claim's name in facilitating sharing and mentions that the DRA driver implementer must specify that a resource is shareable for such configurations to work. The discussion then moves to the two allocation modes in DRA: immediate allocation, where resources are allocated as soon as a resource claim is created, and delayed allocation, which waits until a pod references the claim before allocating resources. The paragraph provides a detailed explanation of the flow of events in both allocation modes, including the interactions between the centralized controller, the node-local couplet plugin, and the Kubernetes scheduler.

20:06

🛠️ Building a DRA Resource Driver for Kubernetes

This section provides an overview of the process of creating a DRA resource driver, which involves defining a name for the driver, creating CRDs, and determining the communication method between the controller and the plugin. It outlines the key components of a DRA resource driver, including a centralized controller, a node-local couplet plugin, and a set of CRDs. The paragraph explains the responsibilities of each component and the two allocation modes: immediate and delayed. It also discusses the driver interface in the controller, which includes methods for getting class and claim parameters, allocating resources, handling unsuitable nodes, and preparing and unpreparing resources. The paragraph concludes with a brief mention of CDI, which is used to expose devices to containers, and provides a reference to an example driver that serves as a starting point for developers looking to create their own DRA drivers.

25:07

⚙️ Driver Interface and CDI in DRA Resource Drivers

The final paragraph focuses on the driver interface within the controller and the role of Container Device Interface (CDI) in DRA resource drivers. It describes the methods of the driver interface, such as getting class and claim parameters, allocating resources, and handling resource deallocation and preparation. The paragraph also explains the importance of CDI, which is a specification for describing how a device should be exposed to a container. It mentions that CDI is consumed by the container runtime to export devices to containers. The paragraph concludes with a list of resources for further reference and a summary of the key points covered in the presentation.

Mindmap

Keywords

💡Kubernetes

Kubernetes is an open-source container orchestration platform used to automate the deployment, scaling, and management of containerized applications. In the video, Kubernetes is central to the discussion as the environment where Dynamic Resource Allocation (DRA) is being implemented. The script mentions Kubernetes when discussing how resources are allocated to workloads and the role of the Kubelet in reporting node status.

💡Dynamic Resource Allocation (DRA)

Dynamic Resource Allocation, or DRA, refers to the ability of a system to allocate resources on-the-fly to different tasks or processes as needed. In the context of the video, DRA is a new API for requesting resources within Kubernetes, aiming to provide more flexibility and control over resource management. The script explains DRA as a solution for specialized hardware utilization within Kubernetes.

💡Device Plugin

A Device Plugin in Kubernetes is a mechanism that allows custom devices to be allocated to containers. It enables the use of specialized hardware within Kubernetes pods. The video script discusses the limitations of the device plugin framework and how DRA can overcome these limitations, such as the inability to share resources or configure resources differently.

💡Container Storage Interface (CSI)

CSI is a specification for a standard interface that allows storage systems to be exposed to containerized applications within Kubernetes. The script mentions CSI as an evolution from the earlier volume plugins, giving storage vendors more flexibility to implement and release their own plugins without being tightly coupled with Kubernetes release cycles.

💡Persistent Volume Claim (PVC)

A Persistent Volume Claim in Kubernetes is a request for a certain amount of storage by a user. It is a way for users to consume storage in a cluster without having to interact with the underlying storage system. The video script uses PVC as an example of how storage resources are requested and allocated within a Kubernetes environment.

💡Node Status

In Kubernetes, the node status provides information about the current state of a node, including the resources available and the resources that are allocatable for future workloads. The script refers to the node status when explaining how Kubelet reports the capacity and allocatable resources of a node.

💡Resource Class

A Resource Class in the context of DRA is a new concept introduced to define the characteristics of a resource. It is similar to a Storage Class in CSI but for other types of resources. The video script explains how a Resource Class is used to specify the driver and parameters for a resource, providing a structured way to request resources.

💡Container Device Interface (CDI)

CDI is a part of the container runtime that is required by device plugins. It is used to make devices available to containers. The script mentions CDI as a prerequisite for DRA, indicating that the container runtime must support CDI to utilize DRA effectively.

💡Resource Claim

A Resource Claim in DRA is analogous to a PVC but for device resources. It is a request for a specific type of resource, and it is used to allocate resources to a pod. The video script describes how a Resource Claim is created and used to reference a specific resource, allowing for more complex and flexible resource management.

💡GRPC

gRPC is a high-performance, open-source universal RPC framework that can run anywhere. In the context of Kubernetes and the video, gRPC is used by the Device Plugin to expose an interface to Kubelet for resource management. The script mentions gRPC as the communication protocol between Kubelet and the Device Plugin.

Highlights

Introduction to Dynamic Resource Allocation (DRA) in Kubernetes by Nvidia software engineers.

DRA is a new API for requesting resources in Kubernetes, enhancing resource management.

Explaining the different resources available for workloads in Kubernetes.

Discussion on how to request resources such as CPU, memory, and storage.

Overview of the Device Plugin and its role in Kubernetes networking.

Limitations of the Device Plugin framework and its impact on resource allocation.

Introduction to Container Storage Interface (CSI) and its advantages over entry storage volume plugins.

Deep dive into the Generator flows and the steps to build a custom direct driver.

Explanation of Container Device Interface (CDI) and its necessity for device drivers.

The variety of resources that can be allocated to workloads, including specialized hardware.

How Kubelet reports node status and manages resource allocation.

The process of allocating CPU, memory, and H3 resources in Kubernetes.

Storage options in Kubernetes, including scratch space and persistent storage.

The evolution from entry volume plugins to CSI for storage management.

The concept of dynamic provisioning in Kubernetes and its benefits.

The necessity and functionality of Device Plugins for specialized hardware utilization.

The issues with the current Device Plugin framework, such as lack of shared resources and configuration limitations.

Introduction to the Array and its main APIs as a solution to the limitations of Device Plugins.

The anatomy of a DRA resource driver and its components.

Explaining the allocation modes in DRA: immediate allocation and wait-for-first-consumer.

How to implement a resource driver for DRA, including defining CRDs and driver interface.

Resources and tools available for developing DRA drivers, including example drivers and helper packages.

Conclusion and Q&A session wrapping up the discussion on DRA in Kubernetes.

Transcripts

play00:00

and so I'm Freddie Holland and with me

play00:03

is Adrian kiris we are software

play00:05

engineers at Nvidia part of the cloud

play00:08

operation team in the networking

play00:10

business unit

play00:11

or that today work is to enable

play00:13

networking Technologies in kubernetes

play00:16

today we'll talk about Dynamic resource

play00:18

allocation also known as dra

play00:21

is a new API for requesting resources in

play00:24

kubernetes

play00:26

okay so let's take a look at the agenda

play00:28

first we'll uh

play00:30

we'll go over the different resources

play00:32

available for your workload and how do

play00:34

you actually request them then we'll

play00:36

talk about the device Plugin or do they

play00:38

work and what are the limitations and

play00:41

that will go over the array and its main

play00:43

apis after that we go and go deep dive

play00:46

into the generator flows and also we

play00:49

will go over the steps that we need to

play00:52

do in order to build your own direct

play00:54

driver

play00:56

lastly we'll cover CDI CDI is a

play00:59

container device interface which is part

play01:01

of the container runtime that is

play01:03

required by the DI drivers

play01:06

okay let's start

play01:09

so

play01:11

so Kubota is all about running workloads

play01:14

inside containers right but not every

play01:17

workloads are the same requirements for

play01:19

example if you have a CNF application

play01:21

like router or firewall you need some

play01:24

networking very specific requirements or

play01:27

if you're using the predicate for this

play01:29

application when you use pages right and

play01:32

in AI for example gpus are required both

play01:35

for training and insurance in trending

play01:38

you will need multiple gpus among

play01:41

multiple nodes and maybe we required

play01:43

some fast networking in in order to be

play01:45

able to sell efficiency data within them

play01:48

maybe using GPU direct or dma

play01:51

so what are the resources that we want

play01:53

we can allocate to our workload so first

play01:56

we have the regular one CPU memory your

play01:58

Treasures then we have storage related

play02:01

workload and eventually you also have

play02:04

the device Plugin or workload resources

play02:07

so what are the device plugin resources

play02:09

for example android.com GPU

play02:13

okay so where do we see this resources

play02:16

we have in the north status actually two

play02:19

sections the first one is the capacity

play02:21

second one is allocatable the capacity

play02:23

is a wall full of resources that we have

play02:26

on this specific node and the local

play02:28

table is what is still available to a

play02:31

scheduled future

play02:33

workloads so cubelet is in charge of

play02:37

reporting the not status and it is also

play02:39

in charge of recording the the available

play02:41

resources

play02:42

so you see in the first part what we can

play02:45

call the built-in resources like CPU

play02:47

utilities and memory and second part we

play02:50

have some example of some device plugin

play02:52

resources

play02:57

sorry

play03:01

okay next

play03:03

here an example of allocating CPU memory

play03:05

and H3 so under the spec of your board

play03:08

on on the edge container you have two

play03:10

section

play03:11

request on limit so the scheduler will

play03:14

look at the request part and we'll

play03:15

search for a node that has enough

play03:18

resources to actually answer this

play03:21

request and according to it you will

play03:23

decide where this board will be

play03:25

eventually scheduled

play03:29

so in storage we have several options

play03:32

first we have the available storage some

play03:33

will call it the scratch space so if for

play03:36

example if you want to download some

play03:37

large file or have some State uh serve

play03:41

the neural in your local files you can

play03:43

use this one but you need to understand

play03:45

that it is not persisted so if your part

play03:48

is restarted all your data will be lost

play03:51

regarding a persistent storage we have a

play03:54

few options first one is what we call

play03:56

the entry storage volume plug-ins so in

play04:00

this example we are an NFS Mount that

play04:02

you can just specify the NFS server and

play04:06

although the needed parameters and we'll

play04:07

get the mod inside your pod

play04:10

so what is it called entry and it is

play04:13

because that the implementation of this

play04:15

volume plugins are part of the

play04:17

kubernetes core code and it it was

play04:20

actually not very convenient for storage

play04:22

vendor to have this code inside the

play04:24

kubernetes base code because so it's our

play04:28

tightly coupled with the Cadence of

play04:30

releasing kubernetes so if you have a

play04:32

bug or they want to release a new

play04:33

feature they need to wait for the next

play04:35

stories so as an evolution from the

play04:37

entry volume plugins we got the CSI CSI

play04:41

is container storage interface and it

play04:44

gave actually the storage render a full

play04:47

freedom to implement a other own

play04:51

contents and they are releasing other

play04:52

content so they can fix bugs and add

play04:55

features

play04:56

then just need to implement the apis

play04:58

that was defined by by CSI so what we

play05:02

have in CSI we have a storage class in

play05:05

the storage class you have a name and

play05:06

you have the CSI driver that will

play05:08

eventually provision and expose these

play05:11

volumes to your port

play05:13

in addition you have a possibility to

play05:14

have a bunch of parameters these

play05:16

parameters are freestyle it means that

play05:19

you can do whatever you want there but

play05:22

they are very limited inside

play05:23

infrastructure because they are just a

play05:25

strength to string key map kind of

play05:28

structure

play05:30

so next we have the persistence volume

play05:32

claim the volume Cam that you specify

play05:35

some parameters like for example access

play05:37

mode and size and most importantly you

play05:40

can also specify the storage class name

play05:43

which will actually stay which provider

play05:46

will eventually provision your volume

play05:49

so there are the dynamical source

play05:51

allocation it is taken from this API the

play05:55

main approach so it will it will take

play05:57

the van ID of a storage class and the

play05:59

claim and it will extend it it actually

play06:03

for any resources not only storage

play06:06

okay so how do you actually request the

play06:09

volume inside the Pod so you have a

play06:11

volume part under the spec and then you

play06:14

can actually say what is the PVC that

play06:17

you want to have in your in your

play06:20

workload in this case the PVC was

play06:22

already created before

play06:26

okay next next method that we have is

play06:29

the device plugin

play06:31

so why do we need device plugin so

play06:34

sometimes as your node you have

play06:35

specialized hardware and for example

play06:38

here we have a Bluetooth we have a gpui

play06:42

100 and connect 67 Nick and we want to

play06:46

be able to utilize this Hardware inside

play06:48

your workload and like we saw kubernetes

play06:52

don't the not support specialize

play06:53

Hardware that's only a set of limited

play06:57

resources that is aware of

play06:59

so here comes the device plugin to help

play07:01

us actually to utilize this resources so

play07:05

how does it works

play07:07

so device plugin is a couplet plugin it

play07:10

means you know in the node it will first

play07:12

advertise himself to coblet and we say

play07:15

okay this is the resource that I'm

play07:17

working on and then it will expose a

play07:19

grpc interface to qubit

play07:22

and the most important method here is

play07:25

the list and watch so the cookware will

play07:27

has to plug in give me a list of the

play07:29

available resources and it is a

play07:32

streaming API so if there is a change on

play07:35

the status the device plugin can update

play07:37

kublet with a change

play07:39

and the second important part will be

play07:40

allocate allocate will be called by

play07:43

couplet just before creating the port

play07:46

and the vast beginning we give the

play07:48

couplet loss a list of instruction or to

play07:51

be passed on to The Container runtime

play07:55

explaining exactly what you need to do

play07:57

to be able to access this this resource

play08:01

okay so as I mentioned we can see this

play08:04

this resource is also available on the

play08:07

North status here we have two example

play08:09

once the GPU second one will be above

play08:11

SRV resource

play08:14

and or do you actually require them

play08:15

inside your pod under the resource you

play08:19

have the request and then it goes like

play08:21

domain slash name of the resource okay

play08:25

so here we are requesting one GPU and

play08:27

one siob resources

play08:29

so I can see this interface is uh you

play08:32

can see call it countable it's just a

play08:34

number

play08:36

so what are the issue with

play08:39

with a device plugin framework first of

play08:41

all you cannot have shared resources

play08:43

let's say for example you have a GPU

play08:45

that is able to work with different

play08:47

workload at the same time using device

play08:50

plugin you cannot do that why is that

play08:52

because osrc don't have a name it's just

play08:54

a number so if you would like to request

play08:56

another one or you one that has already

play08:58

been created you don't have the

play09:00

possibility to do that

play09:02

second point is unlimited resources

play09:05

so if you are familiar for example with

play09:07

codeword which is running VMS inside in

play09:09

kubernetes they have a device plugin for

play09:12

KVM and it has a count of 1000 and it

play09:15

really doesn't make sense because KVM is

play09:17

not a limited resources it's just a

play09:20

configuration of the of the CPU

play09:23

but since they want to use all the

play09:25

things that are part of the device

play09:27

plugin in firmware they still need to

play09:30

publicize a number account so it's kind

play09:33

of hard but actually it doesn't have any

play09:36

any meaning

play09:38

you don't have the possibility to do

play09:40

Advanced configuration let's say for

play09:42

example that you have two gpus and you

play09:45

want to have different configuration on

play09:47

10. the device plugin framework don't

play09:49

have the possibility to do that

play09:51

everything will be configured the same

play09:52

the same

play09:55

so here comes the area to actually

play09:57

answer of all of this issues that we

play10:01

mentioned

play10:02

so what is the array it is a new way of

play10:04

requesting requesting resources in

play10:06

kubernetes it started in one

play10:09

not 26. you will need to have a

play10:13

container runtime that is support CDI

play10:16

CDI is container device interface you

play10:19

can see here the version phone

play10:20

containerd and Kyle that do already have

play10:23

this support

play10:25

it is still in Alpha meaning that if you

play10:27

if you want to start to tie that we need

play10:29

to enable a feature guide

play10:31

and the idea behind it is actually to

play10:33

give an alternative of the device plugin

play10:35

framework that we mentioned earlier

play10:39

so and similar to csis that is to give

play10:42

the full control to the vendors like we

play10:45

mentioned storage van dorno can release

play10:47

anything other on Cadence we want to do

play10:49

the same regarding resources and it can

play10:52

and it it actually takes the same

play10:54

approach so if you remember we have a

play10:57

storage class now we have a resource

play10:59

class

play11:04

we have a resource claim so the idea is

play11:07

similar but in addition we have also

play11:10

some things that are a little better so

play11:12

for each resource class you can have a

play11:14

crd defined by the vendor that can be a

play11:17

class parameter so if you remember we

play11:20

have the list of string in this in the

play11:22

storage class now we have a full

play11:24

possibility that the vendor of the

play11:26

resource of the dra driver can have

play11:30

whatever you want into parameters it can

play11:32

be a really much more complex at what we

play11:35

had before

play11:36

any addition to the resource time also

play11:39

have the same thing you can point to a

play11:41

vendor defined crd with a lot of

play11:44

parameter for each resource claim

play11:46

and we also have a resource line

play11:48

template which we will explain in a few

play11:51

sides

play11:53

okay so first of all although the track

play11:56

of the Pod change most important thing

play11:58

as end user what would you need to do so

play12:01

it's a little bit more verbos but we

play12:03

need to keep in mind that it will give

play12:05

us a lot of more flexibility on on the

play12:08

when using these resources so on the

play12:11

left we have the device plugin

play12:14

configuration there's a count that we

play12:16

mentioned earlier so we want two gpus on

play12:20

the new way you have a new section on

play12:22

the resources it's called claims and

play12:24

then you give a list of names the name

play12:26

of the claims as a resource claim that

play12:28

you want to use

play12:29

then you have also a new section it's

play12:32

called resource claim and here there you

play12:34

need to configure for each claim that

play12:37

you want to use what is its source in

play12:40

this example it is a resource claim

play12:42

template that is configured on the right

play12:44

and each time that we reference this

play12:47

resource time template a new resource

play12:48

claim will be created with a spec

play12:50

defined in the resource claim template

play12:53

so the idea is that every time you use a

play12:56

resource template the new resource claim

play12:59

is created it's not reusing an existing

play13:02

one

play13:03

and lastly we can see that in the spec

play13:05

we have a reference to as a resource

play13:08

class

play13:11

okay let's take a look at the resource

play13:13

class

play13:13

first of all all the examples here are

play13:16

from an existing

play13:18

dra driver a quantity zero driver for

play13:21

gpus that has been implemented by Kevin

play13:23

close from Nvidia he also did a great

play13:26

talk about it with Alexa for men conform

play13:29

Intel you can check it out from the last

play13:31

clipcon we'll give a link at the end

play13:34

so the resource class will Define the

play13:36

first of all the name of the result and

play13:38

then the dra driver that will actually

play13:40

be bind to this resource

play13:43

it will be created same as the storage

play13:46

class created by the Cs admin

play13:49

okay next we mentioned that we also have

play13:51

a possibility to have parameters for the

play13:53

resource class so how do we do that we

play13:55

just configure a reference in the in the

play13:59

in the form of the API Group kind name

play14:01

which is a crd that the array driver

play14:04

will Implement and then you can have a

play14:07

specific parameters so in this example

play14:09

we want dpus that are not non-shareable

play14:15

okay so we have a resource contemplate

play14:18

and resource claim so what is the

play14:19

difference like I mentioned earlier a

play14:21

resource content that creates a new

play14:23

resource claim for each time they are

play14:24

reference and the resource claim will

play14:27

refer to the exact same object

play14:32

all right so now

play14:34

we have we mentioned that also the

play14:36

resource claim can have a

play14:40

parameter and it gave us a lot of

play14:43

possibility so here in this example we

play14:46

have a GPU selector on the resource

play14:48

plane meaning

play14:50

here we can we actually want either a

play14:53

default GPU or other either a V100 with

play14:56

less than 16 gig memory so you can

play14:59

imagine that there's a lot of

play15:00

flexibility and and possibility that you

play15:03

can configure your resources with the

play15:05

same type of resources but with

play15:07

different configuration on each instance

play15:11

okay next how can we share actually

play15:14

resources between a workload so here an

play15:17

example on the same port different

play15:18

containers you just point to the same

play15:22

claim since now we have a name it's

play15:24

quite easy so we have GPU name and then

play15:27

on the resource link

play15:29

section you will Define the cells so

play15:33

where you pre-create your resource and

play15:35

then you can actually refer it in from

play15:38

different two different containers in

play15:39

the sample

play15:41

and it goes the same regarding sharing

play15:44

between different ports so you again

play15:47

using the name of the pre-cutter

play15:49

resources one thing to mention that the

play15:52

dairy driver implementer needs to

play15:54

specify in the resource claim that this

play15:57

resource is actually shareable otherwise

play15:59

the scheduler won't allow this kind of

play16:01

configuration

play16:03

so we saw that direct comes and solve us

play16:06

the share issue that we mentioned like

play16:08

we just saw it also solved the unlimited

play16:12

resources because you don't have to

play16:14

actually expose the number of resources

play16:16

that you you want to support it's not

play16:18

required and you can easily implement

play16:21

the direct driver that don't have any

play16:23

limits and last one is a lot of more

play16:26

flexibility regarding the configuration

play16:28

each different instance of the same

play16:30

resource can easily have different

play16:32

configuration

play16:36

will take us in a more deeper dive about

play16:39

different flow

play16:42

all right thanks Freddy for providing us

play16:44

an overview of dra so yeah we're a bit

play16:47

short on time but let's try to make it I

play16:50

will go through some high level flows

play16:52

here to understand what happens a bit

play16:54

under the hood with the array as well as

play16:57

we'll go ahead and see what it what is

play16:59

required to implement a resource driver

play17:01

and some helpers for that and then we'll

play17:04

have some time for a question hopefully

play17:06

all right so what is the anatomy of a

play17:09

dra resource driver essentially it's a

play17:11

composed of two components separate by

play17:13

coordinating a centralized controller

play17:15

which is running with high availability

play17:17

and the node local couplet plugin

play17:19

running as a demon set

play17:21

uh and we also have a set of crds as

play17:24

Freddie explained

play17:26

the centralized controller coordinates

play17:28

with codependence is scheduler to decide

play17:30

which nodes

play17:31

um an incoming resource claim can be

play17:33

serviced on it allocates the resource

play17:35

claim uh once the schedule repeats the

play17:38

node and it also in charge of the

play17:40

allocation

play17:41

the Google plugin essentially is in

play17:43

charge of doing all of the node local

play17:45

operations it will publish the node

play17:47

local state to the central controller it

play17:49

will perform any allocation requests

play17:52

requested by kublet we'll see that later

play17:54

and it will also perform some

play17:56

deallocation requests

play17:59

the crd is essentially each resource

play18:02

driver can Define its own it's it's a

play18:05

driver-specific resource class

play18:06

parameters resource claim parameters

play18:09

additional crds which can be optionally

play18:12

added for example to store the global

play18:15

State or the per node state to keep

play18:17

track of allocated resources

play18:20

and that's it

play18:22

in regards to the allocation modes okay

play18:24

there are two allocation modes used one

play18:27

is immediate allocation

play18:29

which means that the allocation happens

play18:31

immediately for resource claim

play18:34

um once the resource game is created uh

play18:36

the user driver will allocate the

play18:37

resource on a specific node and then pod

play18:39

which references claim will get

play18:41

scheduled onto that node

play18:43

delay the location or also known as wait

play18:45

for first consumer we'll delay the

play18:47

allocation of Verizon's claim until a

play18:49

pod is referencing it at that point

play18:51

essentially the the resource

play18:54

availability will be considered as part

play18:56

of the Pod scheduling in a sense where

play18:59

the entirety request of the Pod the

play19:01

resources CPUs device plugin other

play19:03

claims will be taken into consideration

play19:05

in the scheduling decision and we'll see

play19:07

how this happens

play19:08

right let's dig into the immediate flow

play19:12

um right so we have like uh the the

play19:15

floor is the same at the beginning so

play19:17

the admin will deploy uh the J resource

play19:19

driver the Google plugin the crds and

play19:21

we'll Define a resource class a user

play19:23

will create

play19:24

the the resource claim

play19:27

for the resource class

play19:29

okay at that point the centralized

play19:30

controller picks that up

play19:32

and proceeds with allocation of this

play19:34

resource it will allocate it on some

play19:36

node in the cluster

play19:38

once it's allocated it will essentially

play19:40

update the recess claim status with a

play19:42

resource Handler this one contains

play19:45

essentially a string blob which is

play19:47

passed through the system essentially by

play19:50

the couplet plugin to the dra driver

play19:51

again

play19:53

um

play19:54

as well as setting the node on which the

play19:57

resource was allocated on

play20:00

at that point a user will create a pod

play20:02

which references that resource claim and

play20:04

the kubernetes schedule will kick in

play20:06

here inspect the Pod it will see that it

play20:08

has a resource claim referencing and

play20:11

will proceed a proceed in scheduling

play20:13

this pod onto the node where the

play20:15

resource was allocated on

play20:17

it's a long process right so once the

play20:20

Pod the node was selected then the

play20:23

couplet will pick that up it will then

play20:25

again see that this port is referencing

play20:27

a resource claim it will call the

play20:29

couplet plugin via grpc passing in the

play20:32

claim information the code plugin will

play20:34

perform the allocation needed and return

play20:36

a set of CDI device identifiers we'll

play20:39

discuss them at the end which will then

play20:41

pass to The Container runtime

play20:43

and the containing span up exposing the

play20:46

the devices

play20:49

all right that was the immediate

play20:50

allocation now we'll see like we'll sort

play20:53

of complete the picture for the delay

play20:54

the location

play20:55

the initial flow is essentially the same

play20:57

right the admin will deploy whatever is

play20:59

needed the user will create the resource

play21:01

claim

play21:02

at that point yeah one thing to note is

play21:04

that the centralized control does not

play21:06

kick in again it's wait for first

play21:08

consumer it will not kick in the the

play21:11

user will create a pod referencing in

play21:13

the resource claim at that point

play21:15

uh the kubernetes schedule picks that up

play21:17

and now

play21:18

um it essentially looks at the Pod looks

play21:20

at the resource claim it creates an

play21:23

object called pod scheduling context

play21:24

this object is used to coordinate

play21:26

operation between different dra drivers

play21:28

and the kubernetes scheduler for the pod

play21:32

it will set a set of potential nodes

play21:34

essentially these are node where where

play21:36

the Pod may run on

play21:38

and on the other hand the central

play21:40

controller will read those potential

play21:42

nodes and we'll sort of try to narrow

play21:44

down the list

play21:45

by updating this object with a set of

play21:48

unsuitable nodes so it's a subset of

play21:50

nodes which this spot should not be

play21:52

scheduled on this operation is repeated

play21:55

for all resource drivers until a

play21:59

scheduling decision is made once this

play22:01

scheduling decision is made a kubernetes

play22:04

schedule will update the Pod scratching

play22:07

in context with the selected node so a

play22:08

node was chosen

play22:10

at that point the centralized controller

play22:12

will pick that up the selected node and

play22:14

will proceed with the allocation onto

play22:16

that node same as it was in immediate

play22:19

allocation

play22:21

so this was like a quick rundown yeah of

play22:25

the two allocation loan and how it works

play22:26

with kubernetes and now let's discuss

play22:28

like at high level how you would write a

play22:30

dra driver

play22:32

um yeah so essentially what you would

play22:34

need you need to First of course Define

play22:36

a name for your driver

play22:38

um Define the crds which are are to be

play22:42

referenced in the resource class and

play22:43

resource claim parameters

play22:45

uh essentially these are costume

play22:47

parameters for your resource which may

play22:48

be Global or per resource allocation

play22:51

you decide how the the controller and

play22:53

the plugin are going to coordinate or

play22:55

communicate is it per node crds is it

play22:58

some grpc with some database combination

play23:00

of the two the key Concepts here that

play23:02

you need to you essentially need to

play23:03

represents the following into represents

play23:05

the set of available resources in the

play23:07

cluster or on the Node the set of

play23:09

allocated resources and the set of

play23:11

prepared resources

play23:13

you will need in addition to provide a

play23:15

default implementation of your resource

play23:16

class to be distributed with your driver

play23:18

so user can use it and then of course

play23:20

there's the implementation

play23:22

implementation of the controller and

play23:24

implementation of the couplet plugin

play23:25

both of them include some boilerplate

play23:27

code in order to interact with

play23:28

codeburnettes apis in controller case or

play23:30

interact with couplet

play23:32

as well as of course the business Logic

play23:34

for the two

play23:37

okay so this was a long list so to help

play23:39

you do that essentially what we have is

play23:42

a bunch of packages like upgraded by the

play23:45

kubernetes ecosystem to help you do that

play23:47

the first one is the controller package

play23:49

from the dynamic users allocation

play23:51

controller project which implements most

play23:53

of the boilerplate code to interact with

play23:54

the kubernetes dra API object

play23:57

you it defines a driver interface which

play24:01

you need to implement and we'll go over

play24:02

that and once you implement that you

play24:04

provide it to the new method and you

play24:07

just you get a controller and just call

play24:08

run

play24:09

over simplifying it a bit but that's at

play24:11

high level how it works for the complete

play24:14

part there is an implementation for of

play24:16

the registration with couplet

play24:19

grpc so this is all like the

play24:21

registration is already provided for you

play24:23

you just need to provide the grpc

play24:25

implementation for the node server so

play24:27

it's like the JPC server which will

play24:29

allocate and deallocate resources and

play24:31

again call a run method there as well

play24:35

grpc is defined in the couplet like apis

play24:38

and the kubernetes project and

play24:41

that's for the kubernetes part we have

play24:43

like a bunch of CD hype CDI helpers here

play24:46

so you can reference them later

play24:48

essentially they will help you create

play24:49

CDI device specification

play24:51

um to be used later on but container

play24:53

runtime and I think most importantly

play24:56

importantly here is the example driver

play24:58

there's a dra example driver which is

play25:01

fully functional on top of like mock

play25:03

gpus you just need a kind cluster to

play25:06

bring it up and it there is like a

play25:08

pretty good readme with step-by-step

play25:10

instruction how to run it there you can

play25:12

expect the different parts it serves as

play25:13

a reference implementation where you can

play25:15

sort of take reference for it for kit

play25:18

and extend or rewrite yeah

play25:22

all right so in regards to the driver

play25:25

interface in the controller so

play25:27

um that's the driver interface so it has

play25:29

a couple of methods we'll quickly go

play25:31

over them

play25:32

um there is the get class parameters and

play25:34

get claim parameters

play25:36

nothing too fancy here if we discuss the

play25:39

vendor specific

play25:41

crds for the class and claim this

play25:43

discussions claim these are the Getters

play25:45

for them

play25:46

they will return the specific instance

play25:50

of the event of the crd

play25:53

there is the allocate call so the

play25:56

allocate will essentially perform the

play25:57

allocation of a resource notice the

play26:00

selected node field so this guy is empty

play26:02

in case of immediate allocation where

play26:05

you need to choose your own node and it

play26:08

will have a value in case of delay the

play26:10

location because of the whole pod

play26:12

scheduling context which we went through

play26:13

again it will essentially you can you

play26:16

will get all the claim the claim

play26:17

parameters the class the resource class

play26:19

there is class parameters and you need

play26:21

to return a location result this struct

play26:24

will contain eventually that string blob

play26:27

which will contain information of the

play26:29

allocated Resource as well as the node

play26:31

where the resource is available on

play26:33

uh the allocate call essentially the

play26:35

allocates the resource it's called when

play26:37

the resource claim is deleted

play26:39

um it should essentially free resources

play26:41

uh which were created by this claim

play26:45

unsuitable nodes so uh these guys gets

play26:48

called when uh

play26:51

this guide gets called when um

play26:53

during the wait for first consumer flow

play26:56

where we need to negotiate with the

play26:58

scheduler on which nodes we are

play27:00

scheduled on

play27:01

essentially we need to it accepts like

play27:03

potential nodes and it needs to update

play27:06

in in the pasting claim allocation

play27:08

object

play27:09

the unsuitable nodes for each claim

play27:13

um again as I discussed before so you

play27:15

update the struct with what you don't

play27:17

want to be scheduled on

play27:18

in this one

play27:20

for the note part so there is the node

play27:22

prepare and unprepared resource uh this

play27:24

again run on each node by the equivalent

play27:26

plugin and not prepare resource will

play27:28

prepare the resource it will generate a

play27:31

CDI device specification and return the

play27:33

CDI device IDs one thing to note here is

play27:36

uh and of course the resource handle

play27:38

that you will get the new source handle

play27:39

in the request which is that string blob

play27:41

which we talked about earlier

play27:43

um one thing to note the call must be

play27:45

the potent and you have under 10 seconds

play27:47

to finish the call currently at least

play27:50

with the kubernetes

play27:52

unprepared it does like the opposite of

play27:56

prepare resource it's get called when I

play27:59

didn't mention that the first one gets

play28:00

called when the port is created and it

play28:01

references claim this one will get

play28:02

called when a port is deleted and you

play28:04

need to perform cleanup

play28:07

um for their for their resource and

play28:10

again this call must be the important as

play28:11

well

play28:13

um and uh yeah and let's like talk a

play28:17

little bit about what is CDI like we

play28:18

mentioned before a couple of times CDs

play28:21

stands for container device interface

play28:22

it's essentially a specification

play28:25

which it's a json's formatted

play28:28

specification which which describes how

play28:31

a device should be exposed to a

play28:33

container it contains essentially

play28:36

information such as device nodes which

play28:37

needs to be exposed like chart devices

play28:39

as environment variables host mounts and

play28:43

hooks that needs to be run

play28:44

it's sort of a standardized way to

play28:47

export devices to container it's big

play28:48

it's getting it's getting consumed by

play28:50

the container runtime like container the

play28:51

cryo

play28:52

um

play28:53

to to export devices to container and

play28:56

that's like an example of a CDI device

play28:58

specification

play29:00

um

play29:01

just contains as I said you can dig into

play29:03

it you know later

play29:07

um and like next thing is just link of a

play29:09

couple of resources which we added uh

play29:11

throughout this presentation so it's all

play29:13

here uh you can reference later and with

play29:17

that I think we are done 12 seconds to

play29:18

go

play29:20

thank you

play29:23

[Applause]

Rate This

5.0 / 5 (0 votes)

関連タグ
KubernetesNvidiaNetworkingCloud OperationsResource AllocationSoftware EngineeringDevice PluginsContainer StorageDRATech Talk
英語で要約が必要ですか?