AWS Batch on EKS

Containers from the Couch
9 Nov 202256:27

Summary

TLDRIn this episode of 'Containers from the Couch', AWS developers Jeremy Cowan, Psy Venom, Angel Pizarro, and Jason Rupert discuss the new capabilities of AWS Batch on EKS. They explore the fully managed service for running batch jobs on Kubernetes, its history, and the benefits of using it for high-throughput workloads. The conversation covers topics like job queues, compute environments, and the integration of AWS Batch with EKS, providing insights into how customers can leverage this service for their batch processing needs.

Takeaways

  • ๐Ÿ˜€ AWS Batch is a fully managed service designed for running batch jobs and high-throughput workloads like genomics, financial risk analysis, and AI/ML training.
  • ๐Ÿ”ง AWS Batch supports running jobs on EC2 instances with ECS and ECS Fargate, and has recently introduced support for EKS clusters.
  • ๐Ÿ’ก The motivation for offering AWS Batch on EKS is to leverage the scalability and just-in-time allocation of cloud resources for batch workloads, which differs from traditional on-prem batch processing.
  • ๐Ÿ›  AWS Batch uses its own scheduler and scaling system, rather than Kubernetes' default scheduler, to efficiently manage compute resources for job processing.
  • ๐Ÿ“ˆ AWS Batch is optimized for both maximal throughput and cost-efficiency, balancing these two factors when scaling compute resources for jobs.
  • ๐Ÿ”‘ Users are responsible for creating their EKS clusters, while AWS Batch manages the compute environment within the EKS cluster to run batch workloads.
  • ๐Ÿš€ AWS Batch supports job dependencies and array jobs, allowing for complex workflows such as MapReduce to be executed as a series of batch jobs.
  • ๐Ÿ”„ AWS Batch does not use Carpenter for scaling compute instances, opting instead for its own managed scaling approach that integrates with EKS.
  • ๐Ÿ”’ AWS Batch allows customers to provide a launch template for their nodes, enabling customization and adherence to hardened image policies.
  • ๐Ÿ“Š AWS Batch integrates with monitoring tools like Prometheus and Grafana, allowing users to track resource usage and job performance within EKS.
  • ๐Ÿ†• AWS Batch on EKS is in its early stages, with plans to add features like persistent volume support and multi-node parallel job types based on customer feedback.

Q & A

  • What is the role of Jeremy Cowan in the episode?

    -Jeremy Cowan is a developer Advocate at AWS and the host of the episode, facilitating the discussion about AWS Batch and its integration with Kubernetes.

  • What is the main topic of discussion in this episode of 'Containers from the Couch'?

    -The main topic is running batch jobs on Kubernetes with AWS Batch, exploring the new solutions and integrations offered by AWS for managing batch workloads on EKS.

  • What is AWS Batch and why was it created?

    -AWS Batch is a fully managed service designed for running batch jobs at scale. It was created to handle high-scale workloads such as genomics, financial risk analysis, AI, or ML training, by providing a compute scheduler for these batch workloads.

  • How does AWS Batch differ from traditional on-premises batch processing?

    -AWS Batch leverages the scalability of the cloud and just-in-time allocation of resources, making scheduling more flexible and efficient compared to the capped resources and shared infrastructure of traditional on-premises batch processing.

  • What is the significance of the term 'compute environment' in AWS Batch?

    -A compute environment in AWS Batch represents the types and amounts of resources available for jobs. It defines the minimum and maximum number of CPUs a cluster could have and specifies the target container platform, such as ECS, EC2, Fargate, or an EKS cluster.

  • What is a 'job queue' in the context of AWS Batch?

    -A job queue is a central resource in AWS Batch where all work is submitted. It holds information about the jobs, such as the number of CPUs and memory requirements, and is connected to one or more compute environments to manage the execution of these jobs.

  • How does AWS Batch handle scaling for batch workloads?

    -AWS Batch uses workload-aware scaling to make aggregate decisions on how to scale up or down compute resources based on job queues, requirements of the jobs, and cost considerations, optimizing for both throughput and cost efficiency.

  • Why did AWS choose to integrate AWS Batch with EKS instead of creating a separate control plane?

    -AWS chose to integrate with EKS to leverage existing customer infrastructure and preferences. Many customers were already using EKS for their workloads, and integrating AWS Batch allows them to manage batch workloads within the same environment they are familiar with.

  • What is the role of 'jobs' and 'job dependencies' in AWS Batch?

    -Jobs in AWS Batch are individual tasks that need to be executed, and job dependencies define the order in which these jobs should run. For example, a reduce function might depend on the completion of all map jobs in a MapReduce architecture.

  • How does AWS Batch support the execution of multi-node parallel jobs?

    -AWS Batch supports multi-node parallel jobs through a job type designed for ECS. While this feature is not yet available for EKS, it is under consideration for future releases to accommodate machine learning workloads and other use cases that require this design pattern.

  • What kind of access does AWS Batch require to an EKS cluster?

    -AWS Batch requires the ARN of the EKS cluster and the service role to access the cluster. It uses these to integrate with the EKS API and manage the scaling of nodes for batch workloads within the customer's VPC.

  • How can customers provide feedback or get support for AWS Batch?

    -Customers can provide feedback or seek support through various channels, including reaching out via the 'Contact Us' page on the AWS website or engaging with AWS representatives on social media platforms like Twitter.

Outlines

00:00

๐Ÿ˜€ Introduction to AWS Batch on Kubernetes

The episode begins with an introduction to the panelists, including Jeremy Cowan, Psy Venom, Adam, Angel Pizarro, and Jason Rupert, who are all AWS employees with various roles related to developer advocacy and engineering. They discuss the new capabilities of AWS Batch, a fully managed service for running batch jobs on Kubernetes with ECS and EKS. The conversation sets the stage for a deeper dive into the service's features, motivations, and the team's experiences in bringing this solution to market.

05:02

๐Ÿš€ AWS Batch Service and its Evolution

This section delves into the history and functionality of AWS Batch, a service designed for running batch workloads at scale. The panelists discuss the service's inception, its evolution from supporting EC2 instances to ECS and Fargate, and now its expansion to EKS. They highlight the importance of workload-aware scaling and the intelligent coupling of scheduling and scaling operations for efficiency. The conversation also touches on the learnings from the service's operational experience and how they've informed the development of AWS Batch on EKS.

10:05

๐Ÿ” Understanding Batch Workloads and Concepts

The panel clarifies key concepts related to batch workloads, such as jobs, job queues, and compute environments. They explain how these elements interact within the AWS Batch service and the importance of workload-aware scaling. The discussion also introduces the idea of using AWS Batch in conjunction with workflow systems like Step Functions, Apache Airflow, and others, emphasizing the flexibility of AWS Batch in various use cases.

15:05

๐Ÿค– AWS Batch's Approach to Compute Scaling

The conversation shifts to AWS Batch's unique approach to compute scaling, particularly in contrast to other Kubernetes scaling solutions like Carpenter. The panelists explain that AWS Batch does not use Carpenter and instead relies on its own managed service for scaling compute instances. They discuss the rationale behind this decision, emphasizing the service's operational learnings and the need for tight integration between scheduling and scaling for optimal performance.

20:05

๐ŸŒŸ Launching AWS Batch on EKS

The panelists announce the launch of AWS Batch on EKS, detailing how it works with existing EKS clusters. They describe the process of integrating AWS Batch's scheduling and orchestration management planes into an EKS cluster, allowing for the submission and execution of batch jobs on Kubernetes nodes managed by AWS Batch. The explanation includes a visual demonstration of the workflow and the components involved.

25:07

๐Ÿ”ง Technical Deep Dive into AWS Batch on EKS

This section provides a technical deep dive into the specifics of running AWS Batch on EKS. The panelists discuss the types of workloads suitable for batch processing, the differences between batch jobs and long-running microservices, and the importance of data locality for cost and performance. They also address questions about multi-node job support and the demo showcases the practical aspects of using AWS Batch with EKS.

30:08

๐Ÿ› ๏ธ AWS Batch Compute Environment Management

The panelists explain the concept of a compute environment in AWS Batch, which is akin to a managed node group in EKS. They discuss the responsibilities of the end user in creating and maintaining the EKS cluster, while AWS Batch manages the data plane for running batch workloads. The conversation also covers the ability to scale the compute environment to zero and the integration of monitoring tools like Prometheus and Grafana.

35:09

๐Ÿ“š AWS Batch's Support for Custom AMIs and Security

Addressing a question from the audience, the panelists discuss AWS Batch's support for custom AMIs and hardened images for EKS nodes. They explain how customers can provide a launch template that AWS Batch will use to launch instances, ensuring that the customer's security and compliance requirements are met. The discussion also touches on the managed service's ability to detect and handle errors in the scaling process.

40:10

๐Ÿ”‘ AWS Batch's Integration with EKS and Future Plans

The panelists discuss the motivations behind integrating AWS Batch with EKS, leveraging existing customer preferences and the widespread use of EKS. They also share future plans for AWS Batch, including potential support for persistent volumes and multi-node parallel jobs. Additionally, they mention the importance of customer feedback in shaping the service's roadmap and improving the user experience.

45:12

๐Ÿ—ฃ๏ธ Closing Remarks and Feedback Channels

In the final section, the panelists wrap up the discussion with closing remarks, expressing excitement about the potential applications of AWS Batch on EKS. They encourage audience members to provide feedback through various channels, including social media and AWS contact forms. The panel also highlights upcoming workshops and sessions at the re:Invent conference, offering opportunities for further learning and engagement with the AWS Batch service.

Mindmap

Keywords

๐Ÿ’กAWS

AWS, or Amazon Web Services, is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments. In the context of the video, AWS is the provider of various services discussed, such as AWS Batch and Amazon EKS (Elastic Kubernetes Service), which are integral to the cloud solutions being explored.

๐Ÿ’กAWS Batch

AWS Batch is a fully managed service designed to run batch computing workloads. It is optimized for batch jobs that need to run many instances of a job in parallel. In the video, AWS Batch is highlighted for its ability to run batch jobs on Kubernetes with the introduction of support for Amazon EKS.

๐Ÿ’กContainers from the Couch

This appears to be the name of the show or series that the video is a part of. It suggests a casual, informative discussion about container-related topics, likely aimed at developers or IT professionals interested in cloud services and containerization.

๐Ÿ’กEKS (Elastic Kubernetes Service)

EKS is a managed service provided by AWS that makes it easy to run Kubernetes, an open-source container orchestration system, on AWS without needing to manage the infrastructure. The video discusses the integration of AWS Batch with EKS to facilitate the running of batch jobs on Kubernetes clusters.

๐Ÿ’กBatch Jobs

Batch jobs refer to a collection of tasks that are processed as a group, often used in computing for workloads that can be broken down into individual tasks. The video emphasizes the use of AWS Batch to manage and execute these batch jobs efficiently on Kubernetes.

๐Ÿ’กCompute Environment

In the context of AWS Batch, a compute environment defines the infrastructure that AWS Batch uses to run your jobs. It specifies the type and amount of compute resources, like EC2 instances or EKS clusters, that can be used to run batch jobs. The script mentions configuring compute environments for use with ECS and EKS.

๐Ÿ’กJob Queue

A job queue in AWS Batch is used to organize and manage the execution of batch jobs. Jobs are submitted to a queue, and the service manages the distribution of these jobs to the appropriate compute environment. The script discusses how job queues are connected to compute environments and how they hold job information.

๐Ÿ’กScalability

Scalability refers to the ability of a system to handle a growing amount of work by adding resources or by making the system more efficient. The video talks about the scalability of AWS Batch on EKS, emphasizing the service's ability to scale up and down compute resources based on job requirements.

๐Ÿ’กWorkload

A workload in the context of cloud computing and batch processing refers to the tasks or jobs that a system is designed to perform. The script mentions different types of workloads, such as microservices, batch workloads, and high-throughput workloads, and how AWS Batch is optimized for these.

๐Ÿ’กHPC Services

HPC Services stands for High Performance Computing Services. In the script, one of the guests is introduced as a principal developer advocate on the HPC Services team, indicating a focus on services that cater to high-performance computing needs, such as scientific research or data-intensive applications.

๐Ÿ’กECS (Elastic Container Service)

ECS is a fully managed container orchestration service provided by AWS. It allows you to run and scale containerized applications. The video discusses the use of ECS in conjunction with AWS Batch for running batch jobs and how the introduction of EKS support expands the options for container orchestration.

๐Ÿ’กKubecon

Kubecon is a conference for the Kubernetes community. It is mentioned in the script as the event where support for EKS clusters with AWS Batch was announced, indicating an important milestone in the service's development and community engagement.

Highlights

Introduction of a new episode of 'Containers from the Couch' with Jeremy Cowan, a developer advocate at AWS.

Guest introductions including Psy Venom, a developer advocate on the ECAS product team, and new guests from AWS to discuss batch jobs on Kubernetes with AWS Batch.

Angel Pizarro, a principal developer advocate on the HPC Services team, shares his background in research computing and life sciences.

Jason Rupert, a principal engineer on AWS Batch, discusses his role in building the product on ECS and EKS.

The motivation behind offering AWS Batch for EKS is explained, highlighting the service's history and its focus on batch workloads.

Explanation of AWS Batch as a fully managed service for running batch and parallel jobs on EC2 instances with ECS and ECS Fargate.

Discussion on the unique approach of AWS Batch in managing compute resources for batch workloads, differentiating it from traditional on-prem infrastructure.

Introduction of AWS Batch's central resource concepts, such as compute environments and job queues.

The relationship between jobs, job queues, and running jobs with batch is clarified with a visual aid.

Historical context of batch processing and its relevance to modern industries, including finance and science, is provided.

AWS Batch's integration with workflow systems like Step Functions, Apache Airflow, and domain-specific workflow languages in life sciences and genomics.

Addressing the question of whether AWS Batch uses Carpenter for scaling compute instances and the explanation of its custom approach.

The announcement of AWS Batch support for EKS clusters and how it works with existing EKS clusters.

Demo of AWS Batch on EKS, showcasing the process of submitting jobs, scaling clusters, and monitoring with Grafana and Prometheus.

Exploration of the types of workloads suitable for AWS Batch, differentiating them from long-running microservices.

Discussion on AWS Batch's support for multi-node jobs and the distinction between ECS and EKS in this regard.

The managed service aspect of AWS Batch, emphasizing the abstraction of complexity and the focus on undifferentiated heavy lifting for batch workloads.

Questions from the audience about using AWS Batch on EKS with hardened images and the response regarding custom AMIs and launch templates.

Insights into the future of AWS Batch on EKS, including potential features like persistent volume support and multi-node parallel job types.

Conclusion and invitation for feedback from users, emphasizing the importance of community engagement for the development of AWS Batch.

Transcripts

play00:00

[Music]

play00:00

here we go

play00:03

[Music]

play00:08

here we go

play00:10

here we go

play00:12

here we go

play00:17

[Music]

play00:21

here we go

play00:28

hello and welcome to another episode of

play00:31

containers from the couch I'm Jeremy

play00:34

Cowan I am a developer Advocate at AWS

play00:37

and joining me today are several guests

play00:40

we have my colleague Psy Venom here Psy

play00:44

you want to quickly introduce yourself

play00:45

hey folks I've Adam here you may have

play00:48

seen me on containers from the couch

play00:49

before I'm a developer Advocate on the

play00:51

ecas product team and today excited to

play00:53

have our guests uh from from our sister

play00:56

teams of AWS backside on the show uh to

play01:00

talk about a new solutions they have uh

play01:03

I'll pass it to them to introduce them

play01:04

so

play01:06

hey folks I'll go first uh my name is

play01:08

Angel Pizarro I'm a principal developer

play01:10

Advocate on the HPC Services team I uh

play01:14

have a background in research Computing

play01:16

and specifically in the Life Sciences so

play01:20

most of the workloads and customers that

play01:21

I work with are looking at a mixture of

play01:25

running microservices for some

play01:27

management but also a lot of their

play01:29

workloads are more along what we call

play01:31

batch workloads or high throughput

play01:33

workloads which we'll talk about and get

play01:36

into later today

play01:38

hi everyone my name is Jason Rupert I am

play01:41

a principal engineer on AWS batch I've

play01:45

been with a batch for about six years

play01:48

helping them originally build

play01:50

our product on uh EK our ECS

play01:55

and now I help them build on eks as well

play02:01

great and as you probably could infer

play02:04

from our guest introductions today we're

play02:07

going to be talking about running batch

play02:09

jobs on kubernetes with AWS batch now

play02:13

AWS batch has been in the market for a

play02:16

while now it has support for running

play02:19

jobs on ec2 instances with ECS and ECS

play02:24

fargate

play02:25

um what uh can you tell us a little

play02:27

about the the batch service itself and

play02:31

the motivation behind offering it for

play02:33

eks

play02:36

uh sure I can I can take that one

play02:39

um so

play02:41

the

play02:43

okay so let me take back a snip uh batch

play02:47

is sort of a fully managed service for

play02:49

running Bachelor clothes like we said it

play02:50

runs on an elastic container service uh

play02:52

it has since its Inception uh we we took

play02:56

this really interesting Tech of it

play02:59

um actually let me step back even

play03:01

further

play03:02

there the way that folks run batch

play03:04

workloads and and workloads that are

play03:07

sort of uh High scale uh either for you

play03:12

know workloads like like where I come

play03:14

from from genomics or you're a Financial

play03:16

Risk analysis or you're an AI or ml

play03:19

training person you typically want to do

play03:22

the same item of work uh sometimes

play03:24

hundreds sometimes millions of times and

play03:27

for that you typically want to send all

play03:31

of those processes into something that's

play03:33

called a scheduler right a basically a

play03:35

compute scheduler and it back in the day

play03:38

when folks are running their on-prem

play03:40

infrastructure uh they were basically

play03:43

capped as a shared resource uh on that

play03:48

on-prem infrastructure so that scheduler

play03:50

would need to balance out its compute

play03:53

resources a lot across a lot of

play03:55

different groups that needed a lot of

play03:56

work done right

play03:58

enter in the cloud uh where we have a

play04:02

lot of scalability and just in time

play04:04

allocation of resources scheduling

play04:06

becomes a different thing for batch

play04:08

workloads and that was sort of the the

play04:10

original Inception of of batch and Jason

play04:14

was there so Jason maybe you can give us

play04:16

a little bit of the history of what

play04:18

happened there

play04:19

yeah I think a common theme was you know

play04:22

how can we do undifferentiated heavy

play04:25

lifting and remove that from the

play04:26

customers that may have been you know

play04:28

running a queue and in work for

play04:32

work items on ec2 compute and so

play04:37

um at the time we built we we you know

play04:39

containers obviously had been taking off

play04:41

over the years and and so we had desired

play04:44

you know hey folks like to package in

play04:46

containers so we'll support that we'll

play04:48

run it on ECS ECS is also managed AWS

play04:51

service we're building to manage the AWS

play04:53

service and we'll overlay the two with

play04:55

ec2 to to orchestrate these workloads

play04:58

put them in jobs put the jobs turn them

play05:01

into ECS tasks and and and run them as a

play05:04

container on on these workloads and and

play05:07

one thing that a theme that'll probably

play05:09

come up a lot in this in this

play05:10

conversation today is is that what we

play05:13

learned is is that we we do the workload

play05:16

aware uh

play05:18

scaling for the compute so we're we're

play05:21

looking at the job cues the jobs the the

play05:24

the requirements of the jobs and making

play05:26

aggregate decisions on how to scale up

play05:29

and down compute uh for those jobs and

play05:32

and that's something that we originally

play05:34

launched with and has been a big part of

play05:36

the service

play05:37

and uh we

play05:40

um you know over that time we very much

play05:42

learned that if you want to schedule and

play05:44

scale and be workload aware you know

play05:47

those two intelligent operations they

play05:49

they need to be sort of I would say

play05:52

coupled together at least the

play05:53

intelligence of them I mean interface

play05:55

wise and and subservice wise how we

play05:57

architect it internally they're they're

play06:00

nice interfaces in between them in their

play06:02

separate they're separate sub Services

play06:04

running them but they really need to

play06:06

know what each other is doing to be

play06:08

efficient and optimal at that and so

play06:12

um that's what we we built then and and

play06:15

uh we uh you know approached the eks

play06:19

support

play06:20

um to in the same way

play06:23

um I'll stop there and we can uh yeah

play06:25

you you introduced a couple of terms

play06:28

there that uh those folks who may be new

play06:32

to batch workloads might be unfamiliar

play06:34

with like jobs and job cues can you you

play06:37

quickly explain what what those are and

play06:40

how how they relate to like running

play06:42

running jobs with batch yeah you know if

play06:46

you can if you can share my screen for I

play06:48

just have a quick slide and it's pretty

play06:50

easy to point stuff out at um and

play06:52

hopefully my my uh Mouse will go over uh

play06:56

basically batch has for for uh Central

play06:59

resource Concepts uh and we can start

play07:02

out from you know the lowest layer at

play07:04

the compute first where you have a and

play07:07

work our way backwards uh a compute

play07:09

environment this is sort of the

play07:11

representation of the types and amounts

play07:15

of resources that you want to make

play07:16

available to your job essentially is

play07:18

defining what's the minimum and maximum

play07:20

number of CPUs a cluster could have it

play07:23

also says what's the target container

play07:25

platform so this is where you would

play07:28

specify your target being ECS and ec2 or

play07:31

fargate or an eks cluster you would

play07:35

Define a compute environment I've got a

play07:37

demo later on that'll show how these

play07:39

things are are connected together

play07:41

and then stepping away from that you

play07:43

have a job queue that's where you submit

play07:45

all your work right and that job queue

play07:47

is connected to one or more compute

play07:49

environments so uh you can imagine the

play07:53

job queue is really holding all of that

play07:55

information that that Jason uh was

play07:57

mentioning you know how many jobs for

play08:00

each job how many CPUs and how much

play08:02

memory does it use a GPU uh what

play08:05

architecture are you looking at AMD or

play08:07

Intel right so it's looking at um it's

play08:10

looking at the aggregate requirements

play08:12

and sending that over to the compute

play08:14

environment and then the compute

play08:15

environment decides how to scale in

play08:18

order to do two things uh one what's the

play08:21

maximal throughput that we can do for

play08:23

the things that are in the queue right

play08:25

now at what uh what cost right so batch

play08:30

is optimized for for both of those to

play08:32

sort of balance out throughput uh and

play08:34

and cost and then yeah go ahead sorry I

play08:38

just wanted to interrupt real quick and

play08:39

say you know I'm looking at at these

play08:42

terms here on this page in this

play08:43

architecture and a lot of these terms

play08:45

are so reminiscent to me of of batch

play08:48

processing you know I'd say kind of back

play08:50

in the day uh you know companies would

play08:53

set aside time nightly to run batch jobs

play08:57

and have a limited set of resources and

play09:01

an environment to work with right they'd

play09:03

have a Mainframe and it was so critical

play09:05

to use this uh nightly period to use

play09:08

that compute environment to run jobs and

play09:11

now I'm looking at this architecture in

play09:13

here a lot of these terms are are kind

play09:15

of reminiscent of that you know like

play09:16

things like compute environment that

play09:18

that's like you know the Mainframe back

play09:19

in the day and we have jobs and job cues

play09:22

and it seems like a lot of the tech here

play09:24

is really reminiscent of uh you know

play09:27

batch processing you know that Banks had

play09:30

done back in the day right that's

play09:32

exactly right and we have Banks doing

play09:34

batch processing today on batch you know

play09:37

it's like batch and Turtles all the way

play09:39

down uh but yeah that's that's exactly

play09:42

where where the requirements came from

play09:45

is that folks really still need to do

play09:49

um you know batch processing is a thing

play09:52

that's that's not just in in boring uh

play09:55

you know boring Industries like like

play09:57

Finance they're also in science they're

play10:00

also in uh uh you know other other

play10:04

Industries at large so you know you have

play10:07

these very similar concepts of like your

play10:10

cue or the compute that you run on or

play10:13

your job templates your job scripts

play10:15

right that you run again and again uh

play10:17

and all of that's in in AWS batch today

play10:20

and that's

play10:22

um one one thing we'll talk about later

play10:24

is because these are general concepts

play10:26

there's like a whole host of

play10:29

um whole host of other types of

play10:31

schedulers other other resource

play10:35

allocators and then a layer above that

play10:37

there's like workflow systems and

play10:39

workflow systems use something like

play10:42

batch as the leaf node of an execution

play10:45

right so you would think of something

play10:46

like a workflow that manages the full

play10:49

life cycle of a machine learning a model

play10:53

training workflow and each individual

play10:56

step might need a different number of

play10:58

CPUs or different uh different

play11:01

architecture type and accelerators and

play11:03

you don't want to put all of that into

play11:04

one job you actually want to you know

play11:07

just in that one specific in that one

play11:11

specific part of that workflow Define it

play11:14

tightly so you have the exact amount of

play11:16

compute you need for it and then tie it

play11:18

together at that higher level and

play11:20

batches that that leaf node where it's

play11:22

it's taking the requirements for each

play11:23

individual step so batch could work with

play11:26

a workflow engine

play11:28

um like step functions for example it

play11:31

does today and in fact other other other

play11:34

very popular workflow engines have

play11:37

plugins for batch as well including

play11:40

things like Apache air flow or flight

play11:42

there's a lot of sort of domain specific

play11:46

workflow language that add so you're

play11:49

talking about especially in the Life

play11:50

Sciences and genomics

play11:52

domain specific workflow languages that

play11:55

talk to AWS batch

play11:56

now there have been a couple of

play11:57

questions that have come in since we've

play11:59

been talking and awesome Angela you and

play12:02

I uh anticipated this question earlier

play12:05

when we were talking offline so the

play12:07

first question is around whether batch

play12:10

uses Carpenter for uh for for scaling

play12:14

the compute instances I think you took a

play12:17

different approach with batch

play12:20

Jason you want to take that out yeah I

play12:22

will yeah we we did uh

play12:24

um so to to directly answer the question

play12:27

no batch does not use Carpenter under

play12:29

the hood and and uh working back from

play12:31

that batch does it require you to

play12:34

install any controller or Uh custom

play12:37

resource definition or in an operator uh

play12:42

just to get the the basic the

play12:44

functionality out of it um we are in

play12:47

overlay on top of ecas so we we've taken

play12:50

a little bit of a a different approach

play12:52

to this to the system I would say then

play12:55

maybe

play12:56

um uh the kubernetes community might

play13:00

might see some of the other projects

play13:01

that that out there that do patch and we

play13:04

did that uh uh based on how how our

play13:07

service how our managed service made

play13:10

sense

play13:11

um as a continuation of an overlay that

play13:14

was already working on ECS and and could

play13:16

on eks and so what made sense for for

play13:19

our service and uh was to to have this

play13:23

overlay approach and one of the big

play13:25

driving factors in that was because we

play13:28

run our own our own scheduler uh

play13:30

essentially I mean the the coupe

play13:32

schedule is a little bit involved but we

play13:34

mostly we mostly bypass that scheduler

play13:37

and we we're doing this because we have

play13:40

that existing scheduler and scaling

play13:43

system in batch now and we got to reuse

play13:46

a lot of that in in many in many years

play13:48

of learning from that and and so

play13:52

um we you know inspect the job to decide

play13:55

where we um we scale nodes based on that

play13:58

those are added those nodes are added to

play14:01

a kubernetes cluster then we place the

play14:04

the um work onto those nodes and the

play14:06

work when the work is done we scale them

play14:08

down and and again that goes back to the

play14:10

theme that you really can't if you want

play14:13

to do this kind of workload right at the

play14:17

the scaling in the um scheduling need to

play14:19

be coupled together

play14:21

um they need to kind of know what each

play14:23

other is doing what they're planning on

play14:25

doing and so

play14:28

um it comes from Matt and Carpenter in

play14:30

Carpenter itself it also has some of

play14:32

those built in as an open source

play14:33

framework and it certainly is great at

play14:36

what it's doing and we just you know for

play14:39

our managed service we we

play14:42

um that was the model that fit for us

play14:44

best because of those reasons yeah and I

play14:47

think the key word there is manage

play14:49

service right uh when you look at

play14:52

Carpenter today it's it's deployed

play14:53

within your cluster as a service that

play14:56

you're running

play14:58

um matches you know it's just there as

play15:00

an API endpoint you send jobs to that

play15:03

will integrate with your with your eks

play15:04

cluster

play15:06

um and there you know so Jason mentioned

play15:08

you know bringing things that we've

play15:11

learned are operational

play15:13

um

play15:14

experience right like like handy handy

play15:18

jassy used to say before he went to the

play15:20

other side of the wall uh there's

play15:22

there's no compression algorithm for for

play15:24

experience

play15:26

um and our scaling and operations team

play15:28

really has six years of it of running on

play15:31

on on AWS uh Global infrastructure and I

play15:36

think we're about to uh publish our our

play15:39

current biggest public run was uh over a

play15:43

million uh uh bcpus concurrently for

play15:47

single workload that was running across

play15:49

uh six regions or so

play15:51

uh we beat That by four to five times

play15:55

and are about to beat it again so huge

play15:58

huge workloads that that we're scaling

play16:00

across regions and everything else those

play16:02

are separate patch instances in in each

play16:04

region but still you know you can look

play16:06

at the the operations and management of

play16:08

what's going on with the fleets across

play16:10

the entire regions uh and that's

play16:13

if you want to do that uh yourself

play16:16

within your clusters you have that

play16:18

option of a carpenter right which is

play16:20

great

play16:21

um we feel that the managed service has

play16:23

as as a managed service we have a lot to

play16:25

offer to relieve that undifferentiated

play16:27

heavy lifting and to add to that just

play16:30

you know those large runs those are on

play16:31

on batch on ECS right now and and we're

play16:35

we're you know this is our first you

play16:38

know release of this batch on eks and

play16:40

we're also working to try to to build up

play16:43

that that same scalability you know with

play16:46

ECS or with eks as well and and so we're

play16:51

we're we're we're sort of humbled by

play16:53

learning learning what we are we've been

play16:55

leaning like as you said we're a system

play16:57

sister team to the kubernetes uh team at

play17:01

Amazon and we learn a lot from them and

play17:03

so we're going to continue to you know

play17:05

iterate on that and and apply things

play17:07

that we've learned in the past plus

play17:08

things that we're learning about

play17:10

kubernetes as we as we move forward

play17:13

based on you know what customers are

play17:16

giving us feedback and what we're

play17:17

learning or ourselves

play17:19

so we've we've hinted about this a lot

play17:21

hopefully I think we've answered the

play17:23

question about Carpenter but we've been

play17:24

kind of beating around the bush here we

play17:26

mentioned it a couple times look batch

play17:27

has been supported on ECS uh for a while

play17:31

now and obviously from the title of this

play17:33

show and what we've been talking about

play17:34

uh we've got something new to talk about

play17:36

here today right yeah that we we've

play17:39

released recently uh so so let's let's

play17:41

talk about that what do we have here

play17:43

that that our users can start doing

play17:45

today

play17:46

right uh so so what we released at

play17:48

kubecon uh thanks for that side I just I

play17:52

just took it as because as I've been

play17:53

talking about this for two weeks in

play17:55

scoop cat uh but yeah so we really

play17:57

support uh for eks clusters

play18:00

um with batch

play18:02

um and the way that this works is that

play18:04

we're working with existing eks clusters

play18:07

uh to hook in our our

play18:12

scheduling and orchestration uh

play18:15

management planes into that eks cluster

play18:18

so there's a there's a nice GIF that I

play18:21

did for the for the blog post if you

play18:23

want to um so this is what we had here

play18:25

actually let me see if this this works

play18:27

let's see if this works hold on because

play18:29

this walking through it will be good do

play18:31

you guys see the the first initial sort

play18:34

of Animation there great awesome this is

play18:37

how it works right so imagine you have

play18:39

your existing eks cluster uh you have

play18:42

your eks management plane with the uh

play18:44

with the main host and that CD and you

play18:46

might have some workloads that are

play18:48

stretching across azs in a eks managed

play18:51

note groups uh they can be scaled with

play18:53

the auto uh cluster Auto scaler or

play18:55

Carpenter or another technology

play18:59

um you have a data scientist come in and

play19:01

they start uh sort of allocating

play19:04

um they see that they have a compute

play19:06

environment a job queue that's attached

play19:08

to that eks cluster

play19:09

so they start submitting jobs into that

play19:12

job queue uh the the computer

play19:15

environment sees that it has jobs ready

play19:17

to go so it stands up its own batch

play19:19

managed Auto scale group this is outside

play19:21

of what's being controlled by the eks

play19:25

management node and then starts

play19:27

launching instances but since these

play19:29

instances are pre-configured with the

play19:31

eks optimized Ami and we tell it what

play19:34

where the uh where the main node is

play19:36

kubelet starts doing its thing says hey

play19:39

this node is ready for Daemon sets Damon

play19:41

sets that you've configured correctly

play19:43

with the proper tolerances are placed on

play19:46

the instance and then it gets an

play19:47

instance ready signal at which point in

play19:49

time batch starts placing jobs directly

play19:53

on the instances that it launched right

play19:55

so it knew about the pods that were that

play19:57

were going to be created that's part of

play20:00

the job definition is the specification

play20:02

of what container to run and all of the

play20:05

the Pod limits and requests uh

play20:08

requirements uh and then places those

play20:10

pods on there today we're using sort of

play20:13

node name

play20:14

as the mechanism for placing those pods

play20:17

on exactly we might shift that uh later

play20:21

as we learn more about kubernetes and

play20:23

really stress how fast we can put pods

play20:25

onto the onto an instance but you know

play20:29

generally you just keep scaling blah

play20:31

blah blah you don't need to see this uh

play20:33

but you just keep scaling until there's

play20:35

no more work and you can have multiple

play20:36

uh uh uh compute environments there

play20:41

great so we've been we've been talking

play20:43

about uh batch for a while

play20:47

um is it possible for us to get a look

play20:50

at it sure sure why not

play20:53

um where are we I need to bring up my

play20:58

thingy here oh here's the yeah I did it

play21:00

as a gif fun so so very quickly uh there

play21:03

was a question I want to address here

play21:05

before we jump into the demo and it's

play21:07

about uh it says can I build simplified

play21:09

mapreduce type of architectures and and

play21:11

Angel I want to take a crack at this

play21:13

question first uh and then and then you

play21:15

let me know if I'm right here but but

play21:17

I'm guessing the idea here is that look

play21:19

batch is a way to run jobs as long as

play21:22

you can simplify the work down to a job

play21:25

and you know mapreduce is a framework

play21:28

for you know dealing with large

play21:30

structures of data

play21:32

um if you can reduce that down to a job

play21:35

then then yes you can run it on batch on

play21:37

ECS or eks I mean there's really nothing

play21:40

limiting you there right well yeah

play21:43

there's nothing in limiting you of

play21:45

course the the devil's in the details

play21:47

right so if it's you know with a single

play21:49

process and you just want this thing

play21:51

running on a single instance you just

play21:52

say give me something with like you know

play21:54

a couple hundred uh CPUs then it'll do

play21:56

it on that one instance as a threaded

play21:58

model if you're using something that's

play22:01

really large and you need to access you

play22:03

know individual objects in S3 and do

play22:05

some mapping and sorting and whatnot

play22:08

then you can take advantage of of a job

play22:13

dependencies so two two aspects that two

play22:16

Advanced aspects that batch provides are

play22:19

things like array jobs like do this

play22:22

thing uh 10 000 times for instance and

play22:26

you get an index for each individual uh

play22:29

job from that 10 000 array as a single

play22:32

API request so you can use that index

play22:34

that's an environment variable that says

play22:36

I am index 10 grab the 10th chunk of

play22:39

data to do the mapreduce with right and

play22:42

then go about your business and output

play22:44

it and then at the tail end of it the

play22:46

reduce function you would you would

play22:48

Define a dependency on the ten thousand

play22:51

like the map job that you did don't

play22:53

start until I have all of the results

play22:55

from that map process ready to do a

play22:58

reduced step right so you can do it at

play23:01

the level of batch another way as as we

play23:03

mentioned before are workflow systems

play23:05

which you know would do that

play23:07

automatically they wouldn't send that

play23:09

second reduced job until it was ready to

play23:11

send it because they had the results

play23:13

coming back from it

play23:15

right that makes sense and I'll bring

play23:16

your your demo back up here but one

play23:18

thing I just want to point out here the

play23:20

types of workloads we're talking about

play23:22

we keep calling them jobs they're

play23:24

running on kubernetes and you know you

play23:25

showed that architecture before it's a

play23:27

container so uh you know generally on on

play23:30

the show when we talk kubernetes we

play23:31

think pods we think long-running

play23:33

workloads we think you know applications

play23:35

back-end Services front-end Services uh

play23:38

that's that's different to what what

play23:40

we're going to be demoing here today and

play23:41

what batch was really made to handle

play23:43

right yeah it is and and maybe I will

play23:46

just put this real quick here

play23:48

um this is this is what you were sort of

play23:51

discussing is that you know the verses

play23:53

that you have uh services in a in a in

play23:57

in a micro uh Services world that need

play24:00

to span across availability zones and

play24:03

have replicability and they might have

play24:05

response times in the milliseconds uh

play24:08

they they run indefinitely right but

play24:10

batch is is different one you sort of

play24:12

want these things packed onto instances

play24:15

for that throughput and cost savings and

play24:18

you also uh benefit a lot when it's a

play24:21

data heavy workload to have those data

play24:23

resources local to the AZ so spreading

play24:26

it across azs uh

play24:29

is not always the most optimal pattern

play24:32

for batch jobs right yeah and I think

play24:35

that makes sense it's high availability

play24:37

is not a thing and I like that you put

play24:38

that there because you really you're

play24:40

just concerned about get the job done

play24:42

right exactly now we did get a question

play24:44

kind of related to this uh does AWS

play24:47

batch support a multi-node option

play24:50

um and and we'll get to the second part

play24:52

of the question here in a second right

play24:54

so so this is asking about

play24:57

um where you might have a process that

play24:59

has a a a main node with sub processes

play25:03

doing work and they might need to

play25:04

communicate across each other

play25:07

um in the strongly coupled you know high

play25:09

performance Computing World these are

play25:11

like MPI jobs or message passing

play25:13

interface jobs you can also think of

play25:15

task workers with the task any node

play25:18

um so for batch for ECS today there is a

play25:21

multi-node parallel uh job type

play25:24

that's not there for eks but we're

play25:26

really trying to get it out the door as

play25:28

fast as possible because we know a lot

play25:30

of machine learning workloads

play25:33

um leverage that design pattern that's

play25:35

actually coming from from the HBC land

play25:39

got it

play25:40

um and and for the second part of that

play25:42

question how about we revisit that I

play25:44

know we've got a demo that we want to

play25:45

show maybe we can cover that as part of

play25:47

the demo absolutely

play25:49

um so here's your eks cluster right I

play25:51

have a uh sort of grafana metrics going

play25:55

in through Prometheus

play25:57

um uh you can see here that you know we

play26:00

have some some pods here we have a uh

play26:04

where's the namespace I Define a

play26:06

namespace here for batch nodes

play26:10

the nodes themselves that are up today

play26:12

you have a node group that is specific

play26:14

these are these are managed by eks and

play26:17

just for um for the Demo's sake I do

play26:20

have a node here that was added by uh

play26:24

the batch compute environment which I'll

play26:26

show in a second uh just so that jobs

play26:28

start right away and we're not waiting

play26:30

the minutes or so for for this thing to

play26:32

start but you can set the minimum

play26:33

cluster size to zero for batch

play26:37

um and just to be clear these managed

play26:39

node groups yes the groups are managed

play26:41

by eks but um the the user that wants to

play26:45

use batch on eks they have to actually

play26:47

create these managed node groups and and

play26:49

they're kind of responsible for them

play26:52

no no no so so let me let me let me back

play26:55

up a step when when you're defining so

play26:57

here we go let me let me go over here so

play26:59

this is a uh uh a computer environment

play27:02

that's up today uh but say you want to

play27:05

create one uh in the console you would

play27:08

just go ahead and say Here's the uh

play27:10

kubernetes service you name it blah blah

play27:13

blah blah you give it the instance role

play27:15

for my cluster I forget which one it is

play27:18

but we'll just pick one real quick and

play27:20

the eks cluster right and then you give

play27:23

it the namespace of which to launch

play27:25

right I'm not going to start this

play27:27

because you know I have one ready uh but

play27:29

the gist is that now that batch knows uh

play27:33

where your eks cluster is so imagine

play27:36

this the thing we just created

play27:39

it knows the instance role to use for

play27:43

four nodes and pods

play27:45

it will um

play27:48

it will go ahead and create those Auto

play27:51

scale groups if there's not one defined

play27:54

already for the compute environment

play27:56

so it's managing it completely the thing

play28:00

that

play28:01

um you as an eks

play28:03

cluster owner need to give batch is two

play28:07

things the Arn for the um for the

play28:11

cluster uh sorry where is it uh the Arn

play28:14

for the for the cluster and the service

play28:15

holding right those are the two things

play28:17

and then uh this uh API endpoint needs

play28:21

to be public right now it's an

play28:24

authenticated access through IAM you

play28:27

know we use the the service accounts

play28:30

mapping and config authorizations that

play28:33

you use for every other AWS service to

play28:36

integrate with eks but but we use that

play28:39

public endpoint to find out more about

play28:43

the uh the cluster like for instance

play28:46

what kubernetes version it is store that

play28:48

information and then and then as we need

play28:50

to use it to launches

play28:54

so to add on their you could think of

play28:57

the batch the batches batches compute

play29:00

environment as Batch's version of the

play29:02

management no group yeah right uh it's

play29:05

not we don't use the term management

play29:07

group just to try to avoid the confusion

play29:10

there but uh you know you you if you

play29:14

have your eks cluster you have

play29:15

self-managed nodes maybe you have your

play29:17

own ASDS or maybe you're using a managed

play29:19

node group to run maybe some

play29:20

microservices on there maybe the DNS

play29:23

service right and and batch is going to

play29:26

come along if you add batch to it adding

play29:28

the compute environment that's just

play29:30

going to is going to manage its nodes

play29:34

um uh for the for the jobs in the job

play29:38

queue it's going to do that at you know

play29:40

as a managed orchestration and and and

play29:43

they'll look like self-managed nodes in

play29:45

in eks right now but they they're

play29:47

managed by batch and and batch will add

play29:50

and remove those

play29:52

um through asgs and and we typically

play29:55

like you can look under the cover and

play29:57

see that right if you go to the auto

play29:59

scaling console or or the ec2 console or

play30:02

the eks console obviously you're going

play30:04

to see those things

play30:05

um but you know when uh hopefully you

play30:08

don't have to look at those much you

play30:09

know because batch is doing it its job

play30:13

um and and uh and and so uh you can

play30:16

think of Batch's compute environment as

play30:18

a as Batch's version of a managed node

play30:21

group now let me

play30:23

summarize here so the end user is

play30:26

responsible for creating an eks cluster

play30:29

they need to pass that cluster you know

play30:31

the Arn the way to access the cluster

play30:33

role to batch batch then creates uh the

play30:37

control plane oh sorry sorry the the

play30:39

data plane the manage node group that'll

play30:42

actually run batch workloads so while

play30:45

the the end user is still responsible

play30:47

for the cluster itself they're not

play30:49

responsible for the data plane the where

play30:51

the workloads are actually running and

play30:53

if the user wanted they could feasibly

play30:55

create other manage node groups in that

play30:57

same cluster and do other things in that

play31:00

cluster if they wanted to as well it's

play31:02

just that one manage node group that

play31:03

batch is using that's that's really for

play31:06

batch to handle and manage and that

play31:08

complexity is abstracted from you

play31:09

because batch is really handling uh the

play31:12

management of that that Management Group

play31:14

yeah and in fact the cluster I have

play31:16

right now has a managed note group the

play31:19

note group that's running the Prometheus

play31:21

and grafana

play31:22

um you know uh services that are that

play31:25

you see on the on the left side there

play31:27

um and then we start to see why it makes

play31:30

sense why you guys took this approach of

play31:32

of uh you know having uh the the user

play31:35

create the cluster and so there's still

play31:37

some management capabilities left to the

play31:39

user while um you know you you still

play31:42

abstract the complexity of the the data

play31:44

plane itself we've got a question here

play31:46

that kind of tees into this uh what was

play31:48

the motivation to use eks for batch

play31:50

instead of just using batch managed

play31:53

resources uh well you know that because

play31:56

a lot of customers have been choosing

play31:58

eks for their workloads and

play31:59

standardizing on it uh it's really as

play32:01

simple as that uh and we've uh and a lot

play32:05

of them came to us for for our for our

play32:07

teams that were supporting them to

play32:09

support batch workloads right uh they

play32:12

like scaling and the operations of batch

play32:14

workloads on kubernetes is really uh

play32:17

very different than microservices for

play32:19

the topics that we covered before and a

play32:23

central tenet of uh batches to remove

play32:26

that undifferentiated have be lifting

play32:28

for batch workloads specifically so we

play32:31

felt that we could offer a lot to our

play32:32

customers today that are trying to run

play32:34

these workloads on on kubernetes uh and

play32:37

we'll see what the feedback is once once

play32:39

uh folks really start hammering at it

play32:41

awesome and we're getting really great

play32:43

questions here so so I don't want to

play32:44

derail from the demo here but one more

play32:46

question here that I think it's it's

play32:47

kind of critical is the data plane

play32:49

running in the customer's VPC

play32:52

uh so yeah uh data plane do you mean the

play32:56

thing that there's no group of the eks

play32:58

yeah the node group I will I'll take

play33:00

this one I mean again we're running our

play33:02

version of a no group it is it's not and

play33:05

we are launching the resources into the

play33:08

customer's VPC we don't have direct

play33:10

access to to the the customer

play33:15

um you know job or a resource but we're

play33:18

orchestrating bringing the node up and

play33:20

then

play33:21

um submitting a pod to the akube API

play33:24

server that will then run on that node

play33:26

but it is in the customer's VPC right

play33:29

the nodes that's correct like the yeah

play33:32

exactly so we have a couple more great

play33:34

questions but I want to put those on

play33:35

hold let's get to the demo and then I

play33:38

think the demo will probably address the

play33:39

questions about grafana and Prometheus

play33:41

and the scraping sure

play33:43

um where are we uh this is not the one I

play33:45

want uh right compute environment you'll

play33:49

see some things here

play33:51

um basically uh Let me refresh this

play33:53

because I think this is a whole

play33:55

um yep uh so uh I set a minimum number

play34:00

of bcpus to four and a desire and like

play34:03

desired State here is is uh basically

play34:06

what batch once the um what's the

play34:08

cluster to go to and a maximum of 128

play34:10

across a fleet of instances right uh and

play34:13

I set the m6 I family and these are just

play34:17

subnets and Security Group to launch us

play34:19

right like like somebody just said yes

play34:21

it is launching in in uh in in your in

play34:24

your VPC and in your resources the what

play34:27

we see on our management plane are what

play34:29

was the job request right so you said

play34:32

you you know here's the job definition

play34:34

template and and the things that you're

play34:36

providing to the API so that we can make

play34:38

those uh scheduling and scaling

play34:40

decisions

play34:41

a couple more things that aren't really

play34:44

shown well but they're here uh you did

play34:46

not specify this on on creation of the

play34:49

computer environment but we we inspect

play34:52

it and keep track of what version of

play34:54

kubernetes is

play34:56

so if you upgrade your cluster to a

play34:59

newer version of kubernetes

play35:01

um the compute environment is actually

play35:03

doing continual checks of whether it's

play35:06

still a valid thing that we're going to

play35:07

launch into and if the those two

play35:09

versions diverge uh this um this status

play35:13

will go to invalid uh so when you when

play35:16

you when you update

play35:18

um update a a cluster you have a

play35:21

follow-on responsibility to tell batch

play35:24

I've updated my cluster version here's

play35:27

the new version through an update

play35:28

compute environment and then we'll check

play35:30

it's valid it'll go back in again right

play35:32

uh is that so that you can provision

play35:35

instances using the right Ami

play35:38

yeah that that's right and it won't

play35:40

necessarily uh we I mean we do support

play35:43

Divergence just like

play35:45

um managed or no group where where if

play35:48

you update your eks control plane to a

play35:51

version then you go update your manage

play35:52

node group to version any KS it's the

play35:55

same thing that's happening to Patches

play35:56

compute environment you can then say

play35:58

Okay I want to go from London 22 to 123

play36:01

start using that version and we'll you

play36:04

know pick out the we'll pick out the the

play36:07

Omni for the right for the the right

play36:10

instance type

play36:12

um the eks optimized Omni and and that's

play36:15

how we're using that there right uh so

play36:19

here's here's what we have in our node

play36:21

today here's the batch dashboard I have

play36:24

a couple of job definitions here uh

play36:26

basically there's this one which is uh

play36:29

it's a simple Python program that I have

play36:32

in a private ECR uh sorry yeah a private

play36:36

ECR uh a repository uh that's

play36:39

calculating PI right so if you look at

play36:41

the job definition here uh basically use

play36:45

a a pod property a service account name

play36:48

that has access to both S3 and ECR so

play36:51

this is different than the default uh

play36:53

because this is a private uh Repository

play37:03

it only does a thousand iterations okay

play37:06

um so it takes about 10 seconds

play37:08

um so uh we we could go ahead and submit

play37:11

this job and call it Joby job mcface uh

play37:16

whatever uh you hit the batch Q This is

play37:21

the job dependencies that I was

play37:23

referring to before if you have a

play37:25

dependent job you can do it in the

play37:26

console you can also do it through our

play37:27

apis

play37:29

you can optionally override some things

play37:31

like if you want this to be a much

play37:34

higher priority than other things in the

play37:36

queue you can set a higher priority like

play37:38

I need this thing now the first

play37:39

available slot you know put in a

play37:41

priority of positive and it goes other

play37:43

things are like job attempts if there's

play37:45

a retry like a spot Reclamation maybe

play37:47

you want to retry that

play37:49

um and a retry strategy like if there

play37:52

was something wrong with my application

play37:54

like my container just had bad data

play37:56

going into it actually don't retry I

play37:58

really only want to retry if I get

play37:59

interrupted by spots so you have this

play38:01

you know sort of different conditions

play38:05

um you can override the command and the

play38:08

best one uh is uh where are we next page

play38:13

oh

play38:14

okay

play38:15

um sorry

play38:16

go back here go back here I forgot erase

play38:19

those I'm gonna do this 100 times maybe

play38:21

I'll do it a thousand times sure

play38:23

so it's going to do this thing a

play38:25

thousand times and the reason I'm doing

play38:28

this is so I can scale the cluster uh

play38:30

and you can see the the CPU usage coming

play38:33

out on the other end after a bit

play38:38

um so submitted this is what you get you

play38:41

get information about how many are

play38:43

runnable how many are starting uh

play38:45

success and failure and we'll check but

play38:48

it but it's essentially as easy as that

play38:49

now this thing is pulling the image off

play38:52

of uh off of VCR it's going to be

play38:55

launching instances this is the job

play38:56

details right here if you go to the

play38:59

compute environment

play39:01

um that we had before you'll see that

play39:03

it's still a desired bcpu because we're

play39:07

still you know with batch processes you

play39:10

don't want immediate scaling like a um

play39:13

uh like you would with microservices

play39:15

that really need Fast Response signs you

play39:17

can be a little bit lacks onto how you

play39:20

provision resources uh or how how

play39:23

quickly you can provision resources it's

play39:25

a trade-off right sorry if you want if

play39:28

you want them to start right away you

play39:30

can have warm capacity but if you if you

play39:32

don't or you maybe you want a certain

play39:35

percentage of warm capacity or by that

play39:38

you can also you know let batch scale up

play39:40

and down if you're you're willing to

play39:42

take on like say like the less cost or

play39:46

lower cost and a little bit of delay on

play39:49

Startup so right and you can see that

play39:51

pod memory usage is starting to go up in

play39:53

grafana right here

play39:55

um uh basically you can do a pre-warm

play39:58

you know setting desired capacity to be

play40:00

higher you'll see batch update this in a

play40:02

minute or two or minimum CPUs to be

play40:04

higher and it'll immediately start

play40:06

launching instances uh but that's it

play40:08

that's as simple as that

play40:10

um you know these uh these aren't very

play40:13

long jobs they take about 10 seconds

play40:14

each so you see 63 of them are already

play40:17

completed before we've we've run out and

play40:20

where is where is batch getting that

play40:22

information is it clearing the

play40:23

kubernetes API to get the status of the

play40:25

the jobs

play40:27

yeah so I I'll take that one we are what

play40:30

we do is is in our getting started guide

play40:33

which should be a link provided uh here

play40:36

it we the customer provides uh our back

play40:39

uh permissions for batch to to basically

play40:42

watch pods and nodes and so uh what we

play40:46

do is we take a job first we with that

play40:48

permissions we take we take a job from

play40:51

the job queue and we turn it into a pod

play40:52

and then we watch for we watch the state

play40:55

transitions that it has in the

play40:56

kubernetes uh cluster and then we're

play40:58

updating the job on our side for that

play41:00

okay yeah and let's let's say things go

play41:04

horribly wrong and you need to to

play41:06

troubleshoot is there a way for me to

play41:08

view the the logs

play41:11

I got you right here okay all right so

play41:16

uh we we uh these are the the the law

play41:19

groups that are going into Cloud watch

play41:21

logs I deployed as a Daemon set the flu

play41:24

and bit uh log uh collector you can use

play41:26

whatever log aggregator that you want on

play41:28

the back end there okay so when you were

play41:31

mentioning earlier how you can schedule

play41:32

demon sets onto instances provision

play41:35

through batch this would be a good

play41:38

example of where you might want to run a

play41:41

like like flip it okay exactly and if we

play41:44

go to uh you know so if we go here and

play41:47

say Cube control and I don't I refuse to

play41:50

say could cuddle get nodes all of them

play41:53

sure uh you can see what's running

play41:56

across uh these are the nodes get pods

play41:58

pods across all namespaces

play42:03

um you'll see that there's a bunch of

play42:04

stuff Happening Here uh and here are the

play42:07

batch nodes

play42:10

um

play42:12

what were we doing where's the Flint bit

play42:14

here here are the flu and bit across all

play42:16

the instances that are including

play42:18

including this so there was four there

play42:21

were four current uh in the node group

play42:22

and then I guess we have 68 seconds so

play42:25

this guy right here

play42:28

um is new it just got added by batch uh

play42:31

and it just got deployed to fluent but

play42:32

before we started running jobs on it

play42:36

the auto scaling group right that

play42:39

additional node that's come up and then

play42:40

obviously

play42:42

exactly that's just launched another

play42:45

another instance here that you see

play42:46

showed up wow okay

play42:50

um and where are we and if you look at

play42:53

the compute environment here uh and you

play42:56

know it had a desired vcpus let's see

play42:59

what it's does that and now it's you

play43:01

know now I want my full Fleet because I

play43:03

realize there's a bunch of work in the

play43:04

queue so please scale up as much as you

play43:06

can I don't have that many uh resources

play43:09

available in this specific account so it

play43:11

shouldn't scale Beyond much beyond what

play43:13

we have but you know it's uh it's saying

play43:16

give give me as much as you can as fast

play43:17

as possible so so Angel could I scale to

play43:20

zero a managed node group and then

play43:22

depend on uh badge to scale it up when I

play43:25

need it absolutely so that's the minimum

play43:27

CPUs that you set again I'd sort of put

play43:30

it in real quickly just so that we

play43:32

wouldn't have to wait that two minutes

play43:33

for jobs to start but um but you can set

play43:36

that to zero and it'll scale to zero and

play43:38

you you won't be having any batch nodes

play43:40

when you have no batch jobs but and also

play43:43

just to add on to that though just for

play43:45

clarity is that's for batch or the batch

play43:47

workloads in the job queue so if you

play43:49

have web services or microservices

play43:52

running deployment replica sets on your

play43:54

cluster batches not scaling from those

play43:56

those are you can use as a carpenter you

play43:59

can use the cluster Auto scaling for

play44:00

those to manage node group that eks

play44:03

provides for those batches very focused

play44:05

on playing and playing nicely within the

play44:08

cluster to launch batch nodes for batch

play44:11

workloads a batch will not place

play44:15

um it's pods on on nodes that don't

play44:18

belong to batch at least in this first

play44:21

version

play44:22

um uh you know maybe we someday support

play44:24

that but the current version of it you

play44:27

know is not stepping outside of the

play44:30

allocation that it's doing

play44:32

there you go so this uh I just searched

play44:35

for for a batch um but you can search on

play44:37

the on the um

play44:40

let's see this

play44:42

you can search on the job ID itself uh

play44:45

and it will show up the specific log for

play44:49

the container that had an error and

play44:51

it'll show you a bunch of information in

play44:52

cloudwatch

play44:54

nice

play44:55

very cool uh could be a good time to go

play44:58

through some questions we've got a lot

play44:59

of great questions coming up in chat

play45:01

here now one question here we've got

play45:04

from uh tesser wrecked I I love that tag

play45:09

so they're asking if they wanted to use

play45:11

a batch on eks uh they're implying that

play45:15

you would want to run on ec2s with

play45:17

Amazon owned armies an eks cluster this

play45:20

is a non-starter for them they only run

play45:22

hardened images for eks nodes and that

play45:25

would be violated running an unknown AWS

play45:27

by batch Ami

play45:29

um I I think you probably have some a

play45:31

good answer to this one right guys

play45:33

yeah I I'll take this one because I do

play45:36

want to expand on it a little bit as

play45:37

well yeah we do support you know custom

play45:39

uh cut customer armies to as overrides

play45:42

uh and we do support and through we we

play45:46

support that through launch launch

play45:47

templates so a customer can provide us a

play45:49

launch template

play45:51

um and and we will take that launch in

play45:54

the launch template really allows them

play45:56

to customize their ease their their

play45:57

nodes they can uh you know override user

play46:01

data which we will will merge into ours

play46:03

to make sure things are functioning

play46:04

right for our for our service but

play46:07

um yeah they can provide launch template

play46:09

and then we will take that launch

play46:11

template and uh make a what we call

play46:13

manage launch template out of it which

play46:15

is a merge of their of their

play46:18

configuration in our configuration and

play46:20

we will launch instances with that and

play46:21

that is the way that the customer can

play46:23

provide an Ami and we'll do that uh and

play46:26

as long as it joins you know everything

play46:28

should should work right in his

play46:30

reporting right uh um if the coupe you

play46:33

know the kubernetes is healthy and doing

play46:34

its thing that that should work and we

play46:37

do have some mechanisms to help

play46:39

customers out too like if let's say you

play46:42

know we've had in the past this comes

play46:43

where like the six years of batch uh

play46:46

comes into you know Focus here it's like

play46:49

you you look at some Auto scaling

play46:51

systems they'll you know if things

play46:52

aren't working they'll just leave it

play46:54

around and you can burn you can burn

play46:56

through some money if if you're not

play46:59

um paying attention to that and we

play47:01

and we've iterated over the years like

play47:03

say you have a bad uh army or something

play47:05

of that nature or bad configuration well

play47:08

we'll let that run for a little bit and

play47:10

so you can debug it and things like that

play47:12

but we'll eventually also scale that

play47:14

back down and invalidate the cluster

play47:16

giving you a message that you know hey

play47:17

we noticed something is wrong with the

play47:19

cluster

play47:21

um or something is wrong with the

play47:22

compute environment maybe it's a launch

play47:24

template maybe it's a network config

play47:25

maybe it's our back

play47:27

um and so uh yeah I'll stop there yeah

play47:30

and and this is uh and this is really

play47:32

where uh that managed service comes into

play47:35

play right that we do stop scaling and

play47:38

sending jobs uh because we have seen

play47:41

errors like this before right

play47:43

um and and triggering you know uh

play47:47

causing things to not scale anymore

play47:49

because you're seeing repeated errors is

play47:51

something that you would have to

play47:53

implement yourself if you're deploying

play47:54

your Solutions

play47:56

awesome and and Folks by the way uh

play47:59

let's let's keep uh pounding these guys

play48:01

with questions we definitely got uh 10

play48:03

more minutes here if you have any more

play48:04

questions uh feel free to drop them in

play48:06

chat uh another one here that's very

play48:08

interesting from clever Maya uh what

play48:10

what they don't get is why not create a

play48:13

batch control plane instead of these

play48:14

Integrations and I'm uh I'm guessing

play48:16

they're referring to you know the

play48:17

integration with eks specifically and

play48:20

you know I I would I want to answer this

play48:22

very quickly and argue that there is a

play48:24

batch control plan and I think you are

play48:26

abstracting a lot of the complexity from

play48:28

from what I've what I've seen but um I

play48:31

know we asked this just a little bit

play48:32

earlier but maybe a little bit more

play48:33

detail on uh why the eks integration and

play48:37

the way it's being built today

play48:39

is this question really asking like why

play48:42

aren't we providing a a a resource

play48:45

provider like that you deploy within

play48:48

within eks I mean I guess

play48:52

um I you know okay so manage service

play48:56

yeah it really boils back down to that

play48:58

you know maybe Jason you can take this

play49:00

one well I was thinking I I think I I

play49:03

yeah I think you could answer this in

play49:05

two ways as as already put forth you

play49:08

know we do have a managed control plane

play49:10

that that is hidden now if you're

play49:11

talking about eks is control plane and

play49:13

US creating the cluster on your behalf

play49:16

um that might be what this question is

play49:18

also asking uh we chose in our first

play49:21

version not to do that given that that

play49:23

that customers already have pretty

play49:25

opinionated ways that they're doing the

play49:26

kubernetes and we didn't want to like

play49:29

that was sort of stepping out of the

play49:31

bounds of batch workloads versus you

play49:34

know what the customers organization and

play49:36

compliance is doing for their their

play49:38

clusters and we would just want to work

play49:39

nicely in that although we're we're

play49:42

considering you know uh we we certainly

play49:44

will consider feedback on if we we

play49:46

should also be creating clusters for for

play49:49

customers but

play49:51

um the control plane part of like

play49:53

scaling nodes uh for for the batch

play49:55

workload right we're handling that that

play49:57

is hidden from you we you are giving us

play49:59

configurations right you're defining

play50:02

some configurations and constraints for

play50:04

those for those um scaling operations

play50:07

within the compute within what our our

play50:10

resource called a compute environment

play50:12

and and so uh you know that part is a is

play50:16

our control plane I'll leave it I think

play50:19

those are maybe a couple ways that could

play50:21

be answered

play50:22

um absolutely and and honestly we got an

play50:25

answer here from one of our our guests

play50:27

as well that the the ecas integration is

play50:29

nice because you know we can leverage

play50:31

some kubernetes capabilities open source

play50:33

tools you know here I'm guessing we're

play50:35

using Prometheus grafana to scrape

play50:37

metrics uh which which really is only

play50:40

possible if if uh badge gives you the

play50:42

access into that cluster to be able to

play50:45

configure and manage uh these tools and

play50:47

of course with I think Angel you said

play50:49

this in the beginning with so many uh

play50:51

AWS customers using eks it made sense to

play50:54

offer uh batch as kind of an angle for

play50:57

them to use this let's keep going down

play50:58

the questions here now folks want to

play51:01

know how they can get started are there

play51:03

any blogs workshops available workshops

play51:06

or sessions available at the upcoming

play51:08

reinvent uh anything we can share here

play51:11

yeah absolutely I think you guys have a

play51:14

uh links to the docs for the getting

play51:16

started guide there's also a workshop

play51:19

which I have a few simple examples a

play51:22

self-paced workshop we're going to be

play51:24

giving that workshop at re invent it's

play51:26

an embargoed session right now should be

play51:28

out this week so look for it cmp335 uh

play51:32

it should be out in the next couple days

play51:34

in the catalog uh Jason and I also have

play51:37

a chalk talk session so if you want to

play51:39

have some direct and and more pointed

play51:42

questions at Jason uh come to to

play51:44

Containers I believe 309 session

play51:47

containers con 309 uh and there's a

play51:52

general batch talk as well

play51:54

um so there's a workshop area I'm at

play51:56

there's a talk session and then there's

play51:58

a general batch session session uh

play52:01

breakout session at reiment but in terms

play52:03

of what you can do in the self-paced our

play52:05

documentation should should be a good

play52:08

place to go there is a blog post from

play52:11

the news blog so Jeff Barr's blog on the

play52:15

feature with a pointer to the workshop

play52:17

into the documentation so that might be

play52:18

the the quickest way to get to

play52:20

everything nice uh and what's uh what's

play52:23

next for AWS batch what uh what are you

play52:26

looking at doing uh in the future with

play52:29

with eks and you talked about

play52:31

potentially uh provisioning whole

play52:34

clusters instead of you know relying on

play52:36

a customer no I think you know until

play52:38

yeah the the provisioning of the whole

play52:40

clusters thing is something that we

play52:42

really need to be careful with because

play52:44

it really it was a core design choice to

play52:47

leverage existing clusters and customers

play52:49

account in that shared responsibility

play52:51

model yeah if we see enough feedback

play52:53

that that's something that we should

play52:54

revisit that's actually a major major

play52:57

feature and it wouldn't come out anytime

play52:59

soon right because managing kubernetes

play53:01

clusters we'd essentially become eks

play53:03

right

play53:05

now I think uh you know I think we'd

play53:08

really want to do uh to do that

play53:10

carefully and if we did do it I think it

play53:12

would be more of a collaboration with

play53:14

with the other service teams in at AWS I

play53:17

think that we are uh looking at closely

play53:20

though are things like that manage a

play53:22

multi-node parallel

play53:24

um uh workflows like so uh and the other

play53:28

thing is is looking at um

play53:31

what early customers are trying out and

play53:34

finding uh problems with to get back

play53:38

onto our roadmap one thing we don't

play53:40

support today is uh persistent volumes

play53:43

in the way that kubernetes wants to

play53:46

um there is a way through launch

play53:49

templates where you can mount

play53:50

um sort of parallel file systems or

play53:53

other things that matter for for batch

play53:55

workloads and then do host volume Mount

play53:59

support through the Pod but that's

play54:01

sub-optimal and it's not really the way

play54:03

that the cube folks are used to working

play54:05

so we're looking at a persistent volume

play54:08

support as a near-term feature release

play54:12

um and and we're also I think you know

play54:15

we're going to learn a lot from our our

play54:16

customers we we want to work back from

play54:18

them uh to and and hear what feedback

play54:21

they have after they've they've tried it

play54:24

out I mean we know some things that we

play54:26

want to work on Angel touched on that

play54:28

you know another areas we hope to also

play54:31

get

play54:33

um make it easier for them to get

play54:34

started uh looking to have a pull

play54:37

request to uh have the integration into

play54:40

eks CTL uh so that they can set up their

play54:43

R back easier for the integration

play54:46

um yeah so I think helping adoption or

play54:50

you know people getting started is

play54:51

obvious one that we want to to we know

play54:55

that we would like to improve as well

play54:59

um uh getting started here and then you

play55:01

know as they're using it what what are

play55:02

the things that they like and they don't

play55:04

like and want to see added it's going to

play55:06

be a you know potentially a blocker or

play55:08

something that's making it more

play55:09

difficult for them to work to use it

play55:12

yeah

play55:14

what is what is the best way for folks

play55:17

to give feedback angel I see that you've

play55:19

got your Twitter handle there

play55:21

um I don't know if that's the best way

play55:22

or if you have actually it is either

play55:24

yeah I mean and until Twitter is not a

play55:27

thing

play55:28

which is you know we don't know uh until

play55:31

Twitter is not a thing uh definitely

play55:33

there uh or otherwise you know the the

play55:36

the the contact us page uh you'd be

play55:39

surprised how um how quickly uh somebody

play55:43

from AWS will get back to you uh when

play55:46

you when you submit submit something

play55:48

through a contact form

play55:50

great uh well I want to thank our guests

play55:53

Angel and Jason for joining us today

play55:55

tell us all about uh AWS batch on eks

play56:00

um I think it's going to be interesting

play56:02

to see what customers do with it over

play56:04

the course of the next few months I'm

play56:06

certainly eager to hear the feedback and

play56:09

uh what's to come so thanks everybody

play56:12

for joining and we'll talk to you soon

play56:16

thanks for joining thank you for having

play56:17

us thanks bye

play56:23

[Music]

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
AWS BatchKubernetesBatch JobsCloud ComputingDevOpsECSEKSAuto ScalingWorkload SchedulingManaged Services