AWS Batch on EKS
Summary
TLDRIn this episode of 'Containers from the Couch', AWS developers Jeremy Cowan, Psy Venom, Angel Pizarro, and Jason Rupert discuss the new capabilities of AWS Batch on EKS. They explore the fully managed service for running batch jobs on Kubernetes, its history, and the benefits of using it for high-throughput workloads. The conversation covers topics like job queues, compute environments, and the integration of AWS Batch with EKS, providing insights into how customers can leverage this service for their batch processing needs.
Takeaways
- đ AWS Batch is a fully managed service designed for running batch jobs and high-throughput workloads like genomics, financial risk analysis, and AI/ML training.
- đ§ AWS Batch supports running jobs on EC2 instances with ECS and ECS Fargate, and has recently introduced support for EKS clusters.
- đĄ The motivation for offering AWS Batch on EKS is to leverage the scalability and just-in-time allocation of cloud resources for batch workloads, which differs from traditional on-prem batch processing.
- đ AWS Batch uses its own scheduler and scaling system, rather than Kubernetes' default scheduler, to efficiently manage compute resources for job processing.
- đ AWS Batch is optimized for both maximal throughput and cost-efficiency, balancing these two factors when scaling compute resources for jobs.
- đ Users are responsible for creating their EKS clusters, while AWS Batch manages the compute environment within the EKS cluster to run batch workloads.
- đ AWS Batch supports job dependencies and array jobs, allowing for complex workflows such as MapReduce to be executed as a series of batch jobs.
- đ AWS Batch does not use Carpenter for scaling compute instances, opting instead for its own managed scaling approach that integrates with EKS.
- đ AWS Batch allows customers to provide a launch template for their nodes, enabling customization and adherence to hardened image policies.
- đ AWS Batch integrates with monitoring tools like Prometheus and Grafana, allowing users to track resource usage and job performance within EKS.
- đ AWS Batch on EKS is in its early stages, with plans to add features like persistent volume support and multi-node parallel job types based on customer feedback.
Q & A
What is the role of Jeremy Cowan in the episode?
-Jeremy Cowan is a developer Advocate at AWS and the host of the episode, facilitating the discussion about AWS Batch and its integration with Kubernetes.
What is the main topic of discussion in this episode of 'Containers from the Couch'?
-The main topic is running batch jobs on Kubernetes with AWS Batch, exploring the new solutions and integrations offered by AWS for managing batch workloads on EKS.
What is AWS Batch and why was it created?
-AWS Batch is a fully managed service designed for running batch jobs at scale. It was created to handle high-scale workloads such as genomics, financial risk analysis, AI, or ML training, by providing a compute scheduler for these batch workloads.
How does AWS Batch differ from traditional on-premises batch processing?
-AWS Batch leverages the scalability of the cloud and just-in-time allocation of resources, making scheduling more flexible and efficient compared to the capped resources and shared infrastructure of traditional on-premises batch processing.
What is the significance of the term 'compute environment' in AWS Batch?
-A compute environment in AWS Batch represents the types and amounts of resources available for jobs. It defines the minimum and maximum number of CPUs a cluster could have and specifies the target container platform, such as ECS, EC2, Fargate, or an EKS cluster.
What is a 'job queue' in the context of AWS Batch?
-A job queue is a central resource in AWS Batch where all work is submitted. It holds information about the jobs, such as the number of CPUs and memory requirements, and is connected to one or more compute environments to manage the execution of these jobs.
How does AWS Batch handle scaling for batch workloads?
-AWS Batch uses workload-aware scaling to make aggregate decisions on how to scale up or down compute resources based on job queues, requirements of the jobs, and cost considerations, optimizing for both throughput and cost efficiency.
Why did AWS choose to integrate AWS Batch with EKS instead of creating a separate control plane?
-AWS chose to integrate with EKS to leverage existing customer infrastructure and preferences. Many customers were already using EKS for their workloads, and integrating AWS Batch allows them to manage batch workloads within the same environment they are familiar with.
What is the role of 'jobs' and 'job dependencies' in AWS Batch?
-Jobs in AWS Batch are individual tasks that need to be executed, and job dependencies define the order in which these jobs should run. For example, a reduce function might depend on the completion of all map jobs in a MapReduce architecture.
How does AWS Batch support the execution of multi-node parallel jobs?
-AWS Batch supports multi-node parallel jobs through a job type designed for ECS. While this feature is not yet available for EKS, it is under consideration for future releases to accommodate machine learning workloads and other use cases that require this design pattern.
What kind of access does AWS Batch require to an EKS cluster?
-AWS Batch requires the ARN of the EKS cluster and the service role to access the cluster. It uses these to integrate with the EKS API and manage the scaling of nodes for batch workloads within the customer's VPC.
How can customers provide feedback or get support for AWS Batch?
-Customers can provide feedback or seek support through various channels, including reaching out via the 'Contact Us' page on the AWS website or engaging with AWS representatives on social media platforms like Twitter.
Outlines
đ Introduction to AWS Batch on Kubernetes
The episode begins with an introduction to the panelists, including Jeremy Cowan, Psy Venom, Adam, Angel Pizarro, and Jason Rupert, who are all AWS employees with various roles related to developer advocacy and engineering. They discuss the new capabilities of AWS Batch, a fully managed service for running batch jobs on Kubernetes with ECS and EKS. The conversation sets the stage for a deeper dive into the service's features, motivations, and the team's experiences in bringing this solution to market.
đ AWS Batch Service and its Evolution
This section delves into the history and functionality of AWS Batch, a service designed for running batch workloads at scale. The panelists discuss the service's inception, its evolution from supporting EC2 instances to ECS and Fargate, and now its expansion to EKS. They highlight the importance of workload-aware scaling and the intelligent coupling of scheduling and scaling operations for efficiency. The conversation also touches on the learnings from the service's operational experience and how they've informed the development of AWS Batch on EKS.
đ Understanding Batch Workloads and Concepts
The panel clarifies key concepts related to batch workloads, such as jobs, job queues, and compute environments. They explain how these elements interact within the AWS Batch service and the importance of workload-aware scaling. The discussion also introduces the idea of using AWS Batch in conjunction with workflow systems like Step Functions, Apache Airflow, and others, emphasizing the flexibility of AWS Batch in various use cases.
đ€ AWS Batch's Approach to Compute Scaling
The conversation shifts to AWS Batch's unique approach to compute scaling, particularly in contrast to other Kubernetes scaling solutions like Carpenter. The panelists explain that AWS Batch does not use Carpenter and instead relies on its own managed service for scaling compute instances. They discuss the rationale behind this decision, emphasizing the service's operational learnings and the need for tight integration between scheduling and scaling for optimal performance.
đ Launching AWS Batch on EKS
The panelists announce the launch of AWS Batch on EKS, detailing how it works with existing EKS clusters. They describe the process of integrating AWS Batch's scheduling and orchestration management planes into an EKS cluster, allowing for the submission and execution of batch jobs on Kubernetes nodes managed by AWS Batch. The explanation includes a visual demonstration of the workflow and the components involved.
đ§ Technical Deep Dive into AWS Batch on EKS
This section provides a technical deep dive into the specifics of running AWS Batch on EKS. The panelists discuss the types of workloads suitable for batch processing, the differences between batch jobs and long-running microservices, and the importance of data locality for cost and performance. They also address questions about multi-node job support and the demo showcases the practical aspects of using AWS Batch with EKS.
đ ïž AWS Batch Compute Environment Management
The panelists explain the concept of a compute environment in AWS Batch, which is akin to a managed node group in EKS. They discuss the responsibilities of the end user in creating and maintaining the EKS cluster, while AWS Batch manages the data plane for running batch workloads. The conversation also covers the ability to scale the compute environment to zero and the integration of monitoring tools like Prometheus and Grafana.
đ AWS Batch's Support for Custom AMIs and Security
Addressing a question from the audience, the panelists discuss AWS Batch's support for custom AMIs and hardened images for EKS nodes. They explain how customers can provide a launch template that AWS Batch will use to launch instances, ensuring that the customer's security and compliance requirements are met. The discussion also touches on the managed service's ability to detect and handle errors in the scaling process.
đ AWS Batch's Integration with EKS and Future Plans
The panelists discuss the motivations behind integrating AWS Batch with EKS, leveraging existing customer preferences and the widespread use of EKS. They also share future plans for AWS Batch, including potential support for persistent volumes and multi-node parallel jobs. Additionally, they mention the importance of customer feedback in shaping the service's roadmap and improving the user experience.
đŁïž Closing Remarks and Feedback Channels
In the final section, the panelists wrap up the discussion with closing remarks, expressing excitement about the potential applications of AWS Batch on EKS. They encourage audience members to provide feedback through various channels, including social media and AWS contact forms. The panel also highlights upcoming workshops and sessions at the re:Invent conference, offering opportunities for further learning and engagement with the AWS Batch service.
Mindmap
Keywords
đĄAWS
đĄAWS Batch
đĄContainers from the Couch
đĄEKS (Elastic Kubernetes Service)
đĄBatch Jobs
đĄCompute Environment
đĄJob Queue
đĄScalability
đĄWorkload
đĄHPC Services
đĄECS (Elastic Container Service)
đĄKubecon
Highlights
Introduction of a new episode of 'Containers from the Couch' with Jeremy Cowan, a developer advocate at AWS.
Guest introductions including Psy Venom, a developer advocate on the ECAS product team, and new guests from AWS to discuss batch jobs on Kubernetes with AWS Batch.
Angel Pizarro, a principal developer advocate on the HPC Services team, shares his background in research computing and life sciences.
Jason Rupert, a principal engineer on AWS Batch, discusses his role in building the product on ECS and EKS.
The motivation behind offering AWS Batch for EKS is explained, highlighting the service's history and its focus on batch workloads.
Explanation of AWS Batch as a fully managed service for running batch and parallel jobs on EC2 instances with ECS and ECS Fargate.
Discussion on the unique approach of AWS Batch in managing compute resources for batch workloads, differentiating it from traditional on-prem infrastructure.
Introduction of AWS Batch's central resource concepts, such as compute environments and job queues.
The relationship between jobs, job queues, and running jobs with batch is clarified with a visual aid.
Historical context of batch processing and its relevance to modern industries, including finance and science, is provided.
AWS Batch's integration with workflow systems like Step Functions, Apache Airflow, and domain-specific workflow languages in life sciences and genomics.
Addressing the question of whether AWS Batch uses Carpenter for scaling compute instances and the explanation of its custom approach.
The announcement of AWS Batch support for EKS clusters and how it works with existing EKS clusters.
Demo of AWS Batch on EKS, showcasing the process of submitting jobs, scaling clusters, and monitoring with Grafana and Prometheus.
Exploration of the types of workloads suitable for AWS Batch, differentiating them from long-running microservices.
Discussion on AWS Batch's support for multi-node jobs and the distinction between ECS and EKS in this regard.
The managed service aspect of AWS Batch, emphasizing the abstraction of complexity and the focus on undifferentiated heavy lifting for batch workloads.
Questions from the audience about using AWS Batch on EKS with hardened images and the response regarding custom AMIs and launch templates.
Insights into the future of AWS Batch on EKS, including potential features like persistent volume support and multi-node parallel job types.
Conclusion and invitation for feedback from users, emphasizing the importance of community engagement for the development of AWS Batch.
Transcripts
[Music]
here we go
[Music]
here we go
here we go
here we go
[Music]
here we go
hello and welcome to another episode of
containers from the couch I'm Jeremy
Cowan I am a developer Advocate at AWS
and joining me today are several guests
we have my colleague Psy Venom here Psy
you want to quickly introduce yourself
hey folks I've Adam here you may have
seen me on containers from the couch
before I'm a developer Advocate on the
ecas product team and today excited to
have our guests uh from from our sister
teams of AWS backside on the show uh to
talk about a new solutions they have uh
I'll pass it to them to introduce them
so
hey folks I'll go first uh my name is
Angel Pizarro I'm a principal developer
Advocate on the HPC Services team I uh
have a background in research Computing
and specifically in the Life Sciences so
most of the workloads and customers that
I work with are looking at a mixture of
running microservices for some
management but also a lot of their
workloads are more along what we call
batch workloads or high throughput
workloads which we'll talk about and get
into later today
hi everyone my name is Jason Rupert I am
a principal engineer on AWS batch I've
been with a batch for about six years
helping them originally build
our product on uh EK our ECS
and now I help them build on eks as well
great and as you probably could infer
from our guest introductions today we're
going to be talking about running batch
jobs on kubernetes with AWS batch now
AWS batch has been in the market for a
while now it has support for running
jobs on ec2 instances with ECS and ECS
fargate
um what uh can you tell us a little
about the the batch service itself and
the motivation behind offering it for
eks
uh sure I can I can take that one
um so
the
okay so let me take back a snip uh batch
is sort of a fully managed service for
running Bachelor clothes like we said it
runs on an elastic container service uh
it has since its Inception uh we we took
this really interesting Tech of it
um actually let me step back even
further
there the way that folks run batch
workloads and and workloads that are
sort of uh High scale uh either for you
know workloads like like where I come
from from genomics or you're a Financial
Risk analysis or you're an AI or ml
training person you typically want to do
the same item of work uh sometimes
hundreds sometimes millions of times and
for that you typically want to send all
of those processes into something that's
called a scheduler right a basically a
compute scheduler and it back in the day
when folks are running their on-prem
infrastructure uh they were basically
capped as a shared resource uh on that
on-prem infrastructure so that scheduler
would need to balance out its compute
resources a lot across a lot of
different groups that needed a lot of
work done right
enter in the cloud uh where we have a
lot of scalability and just in time
allocation of resources scheduling
becomes a different thing for batch
workloads and that was sort of the the
original Inception of of batch and Jason
was there so Jason maybe you can give us
a little bit of the history of what
happened there
yeah I think a common theme was you know
how can we do undifferentiated heavy
lifting and remove that from the
customers that may have been you know
running a queue and in work for
work items on ec2 compute and so
um at the time we built we we you know
containers obviously had been taking off
over the years and and so we had desired
you know hey folks like to package in
containers so we'll support that we'll
run it on ECS ECS is also managed AWS
service we're building to manage the AWS
service and we'll overlay the two with
ec2 to to orchestrate these workloads
put them in jobs put the jobs turn them
into ECS tasks and and and run them as a
container on on these workloads and and
one thing that a theme that'll probably
come up a lot in this in this
conversation today is is that what we
learned is is that we we do the workload
aware uh
scaling for the compute so we're we're
looking at the job cues the jobs the the
the requirements of the jobs and making
aggregate decisions on how to scale up
and down compute uh for those jobs and
and that's something that we originally
launched with and has been a big part of
the service
and uh we
um you know over that time we very much
learned that if you want to schedule and
scale and be workload aware you know
those two intelligent operations they
they need to be sort of I would say
coupled together at least the
intelligence of them I mean interface
wise and and subservice wise how we
architect it internally they're they're
nice interfaces in between them in their
separate they're separate sub Services
running them but they really need to
know what each other is doing to be
efficient and optimal at that and so
um that's what we we built then and and
uh we uh you know approached the eks
support
um to in the same way
um I'll stop there and we can uh yeah
you you introduced a couple of terms
there that uh those folks who may be new
to batch workloads might be unfamiliar
with like jobs and job cues can you you
quickly explain what what those are and
how how they relate to like running
running jobs with batch yeah you know if
you can if you can share my screen for I
just have a quick slide and it's pretty
easy to point stuff out at um and
hopefully my my uh Mouse will go over uh
basically batch has for for uh Central
resource Concepts uh and we can start
out from you know the lowest layer at
the compute first where you have a and
work our way backwards uh a compute
environment this is sort of the
representation of the types and amounts
of resources that you want to make
available to your job essentially is
defining what's the minimum and maximum
number of CPUs a cluster could have it
also says what's the target container
platform so this is where you would
specify your target being ECS and ec2 or
fargate or an eks cluster you would
Define a compute environment I've got a
demo later on that'll show how these
things are are connected together
and then stepping away from that you
have a job queue that's where you submit
all your work right and that job queue
is connected to one or more compute
environments so uh you can imagine the
job queue is really holding all of that
information that that Jason uh was
mentioning you know how many jobs for
each job how many CPUs and how much
memory does it use a GPU uh what
architecture are you looking at AMD or
Intel right so it's looking at um it's
looking at the aggregate requirements
and sending that over to the compute
environment and then the compute
environment decides how to scale in
order to do two things uh one what's the
maximal throughput that we can do for
the things that are in the queue right
now at what uh what cost right so batch
is optimized for for both of those to
sort of balance out throughput uh and
and cost and then yeah go ahead sorry I
just wanted to interrupt real quick and
say you know I'm looking at at these
terms here on this page in this
architecture and a lot of these terms
are so reminiscent to me of of batch
processing you know I'd say kind of back
in the day uh you know companies would
set aside time nightly to run batch jobs
and have a limited set of resources and
an environment to work with right they'd
have a Mainframe and it was so critical
to use this uh nightly period to use
that compute environment to run jobs and
now I'm looking at this architecture in
here a lot of these terms are are kind
of reminiscent of that you know like
things like compute environment that
that's like you know the Mainframe back
in the day and we have jobs and job cues
and it seems like a lot of the tech here
is really reminiscent of uh you know
batch processing you know that Banks had
done back in the day right that's
exactly right and we have Banks doing
batch processing today on batch you know
it's like batch and Turtles all the way
down uh but yeah that's that's exactly
where where the requirements came from
is that folks really still need to do
um you know batch processing is a thing
that's that's not just in in boring uh
you know boring Industries like like
Finance they're also in science they're
also in uh uh you know other other
Industries at large so you know you have
these very similar concepts of like your
cue or the compute that you run on or
your job templates your job scripts
right that you run again and again uh
and all of that's in in AWS batch today
and that's
um one one thing we'll talk about later
is because these are general concepts
there's like a whole host of
um whole host of other types of
schedulers other other resource
allocators and then a layer above that
there's like workflow systems and
workflow systems use something like
batch as the leaf node of an execution
right so you would think of something
like a workflow that manages the full
life cycle of a machine learning a model
training workflow and each individual
step might need a different number of
CPUs or different uh different
architecture type and accelerators and
you don't want to put all of that into
one job you actually want to you know
just in that one specific in that one
specific part of that workflow Define it
tightly so you have the exact amount of
compute you need for it and then tie it
together at that higher level and
batches that that leaf node where it's
it's taking the requirements for each
individual step so batch could work with
a workflow engine
um like step functions for example it
does today and in fact other other other
very popular workflow engines have
plugins for batch as well including
things like Apache air flow or flight
there's a lot of sort of domain specific
workflow language that add so you're
talking about especially in the Life
Sciences and genomics
domain specific workflow languages that
talk to AWS batch
now there have been a couple of
questions that have come in since we've
been talking and awesome Angela you and
I uh anticipated this question earlier
when we were talking offline so the
first question is around whether batch
uses Carpenter for uh for for scaling
the compute instances I think you took a
different approach with batch
Jason you want to take that out yeah I
will yeah we we did uh
um so to to directly answer the question
no batch does not use Carpenter under
the hood and and uh working back from
that batch does it require you to
install any controller or Uh custom
resource definition or in an operator uh
just to get the the basic the
functionality out of it um we are in
overlay on top of ecas so we we've taken
a little bit of a a different approach
to this to the system I would say then
maybe
um uh the kubernetes community might
might see some of the other projects
that that out there that do patch and we
did that uh uh based on how how our
service how our managed service made
sense
um as a continuation of an overlay that
was already working on ECS and and could
on eks and so what made sense for for
our service and uh was to to have this
overlay approach and one of the big
driving factors in that was because we
run our own our own scheduler uh
essentially I mean the the coupe
schedule is a little bit involved but we
mostly we mostly bypass that scheduler
and we we're doing this because we have
that existing scheduler and scaling
system in batch now and we got to reuse
a lot of that in in many in many years
of learning from that and and so
um we you know inspect the job to decide
where we um we scale nodes based on that
those are added those nodes are added to
a kubernetes cluster then we place the
the um work onto those nodes and the
work when the work is done we scale them
down and and again that goes back to the
theme that you really can't if you want
to do this kind of workload right at the
the scaling in the um scheduling need to
be coupled together
um they need to kind of know what each
other is doing what they're planning on
doing and so
um it comes from Matt and Carpenter in
Carpenter itself it also has some of
those built in as an open source
framework and it certainly is great at
what it's doing and we just you know for
our managed service we we
um that was the model that fit for us
best because of those reasons yeah and I
think the key word there is manage
service right uh when you look at
Carpenter today it's it's deployed
within your cluster as a service that
you're running
um matches you know it's just there as
an API endpoint you send jobs to that
will integrate with your with your eks
cluster
um and there you know so Jason mentioned
you know bringing things that we've
learned are operational
um
experience right like like handy handy
jassy used to say before he went to the
other side of the wall uh there's
there's no compression algorithm for for
experience
um and our scaling and operations team
really has six years of it of running on
on on AWS uh Global infrastructure and I
think we're about to uh publish our our
current biggest public run was uh over a
million uh uh bcpus concurrently for
single workload that was running across
uh six regions or so
uh we beat That by four to five times
and are about to beat it again so huge
huge workloads that that we're scaling
across regions and everything else those
are separate patch instances in in each
region but still you know you can look
at the the operations and management of
what's going on with the fleets across
the entire regions uh and that's
if you want to do that uh yourself
within your clusters you have that
option of a carpenter right which is
great
um we feel that the managed service has
as as a managed service we have a lot to
offer to relieve that undifferentiated
heavy lifting and to add to that just
you know those large runs those are on
on batch on ECS right now and and we're
we're you know this is our first you
know release of this batch on eks and
we're also working to try to to build up
that that same scalability you know with
ECS or with eks as well and and so we're
we're we're we're sort of humbled by
learning learning what we are we've been
leaning like as you said we're a system
sister team to the kubernetes uh team at
Amazon and we learn a lot from them and
so we're going to continue to you know
iterate on that and and apply things
that we've learned in the past plus
things that we're learning about
kubernetes as we as we move forward
based on you know what customers are
giving us feedback and what we're
learning or ourselves
so we've we've hinted about this a lot
hopefully I think we've answered the
question about Carpenter but we've been
kind of beating around the bush here we
mentioned it a couple times look batch
has been supported on ECS uh for a while
now and obviously from the title of this
show and what we've been talking about
uh we've got something new to talk about
here today right yeah that we we've
released recently uh so so let's let's
talk about that what do we have here
that that our users can start doing
today
right uh so so what we released at
kubecon uh thanks for that side I just I
just took it as because as I've been
talking about this for two weeks in
scoop cat uh but yeah so we really
support uh for eks clusters
um with batch
um and the way that this works is that
we're working with existing eks clusters
uh to hook in our our
scheduling and orchestration uh
management planes into that eks cluster
so there's a there's a nice GIF that I
did for the for the blog post if you
want to um so this is what we had here
actually let me see if this this works
let's see if this works hold on because
this walking through it will be good do
you guys see the the first initial sort
of Animation there great awesome this is
how it works right so imagine you have
your existing eks cluster uh you have
your eks management plane with the uh
with the main host and that CD and you
might have some workloads that are
stretching across azs in a eks managed
note groups uh they can be scaled with
the auto uh cluster Auto scaler or
Carpenter or another technology
um you have a data scientist come in and
they start uh sort of allocating
um they see that they have a compute
environment a job queue that's attached
to that eks cluster
so they start submitting jobs into that
job queue uh the the computer
environment sees that it has jobs ready
to go so it stands up its own batch
managed Auto scale group this is outside
of what's being controlled by the eks
management node and then starts
launching instances but since these
instances are pre-configured with the
eks optimized Ami and we tell it what
where the uh where the main node is
kubelet starts doing its thing says hey
this node is ready for Daemon sets Damon
sets that you've configured correctly
with the proper tolerances are placed on
the instance and then it gets an
instance ready signal at which point in
time batch starts placing jobs directly
on the instances that it launched right
so it knew about the pods that were that
were going to be created that's part of
the job definition is the specification
of what container to run and all of the
the Pod limits and requests uh
requirements uh and then places those
pods on there today we're using sort of
node name
as the mechanism for placing those pods
on exactly we might shift that uh later
as we learn more about kubernetes and
really stress how fast we can put pods
onto the onto an instance but you know
generally you just keep scaling blah
blah blah you don't need to see this uh
but you just keep scaling until there's
no more work and you can have multiple
uh uh uh compute environments there
great so we've been we've been talking
about uh batch for a while
um is it possible for us to get a look
at it sure sure why not
um where are we I need to bring up my
thingy here oh here's the yeah I did it
as a gif fun so so very quickly uh there
was a question I want to address here
before we jump into the demo and it's
about uh it says can I build simplified
mapreduce type of architectures and and
Angel I want to take a crack at this
question first uh and then and then you
let me know if I'm right here but but
I'm guessing the idea here is that look
batch is a way to run jobs as long as
you can simplify the work down to a job
and you know mapreduce is a framework
for you know dealing with large
structures of data
um if you can reduce that down to a job
then then yes you can run it on batch on
ECS or eks I mean there's really nothing
limiting you there right well yeah
there's nothing in limiting you of
course the the devil's in the details
right so if it's you know with a single
process and you just want this thing
running on a single instance you just
say give me something with like you know
a couple hundred uh CPUs then it'll do
it on that one instance as a threaded
model if you're using something that's
really large and you need to access you
know individual objects in S3 and do
some mapping and sorting and whatnot
then you can take advantage of of a job
dependencies so two two aspects that two
Advanced aspects that batch provides are
things like array jobs like do this
thing uh 10 000 times for instance and
you get an index for each individual uh
job from that 10 000 array as a single
API request so you can use that index
that's an environment variable that says
I am index 10 grab the 10th chunk of
data to do the mapreduce with right and
then go about your business and output
it and then at the tail end of it the
reduce function you would you would
Define a dependency on the ten thousand
like the map job that you did don't
start until I have all of the results
from that map process ready to do a
reduced step right so you can do it at
the level of batch another way as as we
mentioned before are workflow systems
which you know would do that
automatically they wouldn't send that
second reduced job until it was ready to
send it because they had the results
coming back from it
right that makes sense and I'll bring
your your demo back up here but one
thing I just want to point out here the
types of workloads we're talking about
we keep calling them jobs they're
running on kubernetes and you know you
showed that architecture before it's a
container so uh you know generally on on
the show when we talk kubernetes we
think pods we think long-running
workloads we think you know applications
back-end Services front-end Services uh
that's that's different to what what
we're going to be demoing here today and
what batch was really made to handle
right yeah it is and and maybe I will
just put this real quick here
um this is this is what you were sort of
discussing is that you know the verses
that you have uh services in a in a in
in a micro uh Services world that need
to span across availability zones and
have replicability and they might have
response times in the milliseconds uh
they they run indefinitely right but
batch is is different one you sort of
want these things packed onto instances
for that throughput and cost savings and
you also uh benefit a lot when it's a
data heavy workload to have those data
resources local to the AZ so spreading
it across azs uh
is not always the most optimal pattern
for batch jobs right yeah and I think
that makes sense it's high availability
is not a thing and I like that you put
that there because you really you're
just concerned about get the job done
right exactly now we did get a question
kind of related to this uh does AWS
batch support a multi-node option
um and and we'll get to the second part
of the question here in a second right
so so this is asking about
um where you might have a process that
has a a a main node with sub processes
doing work and they might need to
communicate across each other
um in the strongly coupled you know high
performance Computing World these are
like MPI jobs or message passing
interface jobs you can also think of
task workers with the task any node
um so for batch for ECS today there is a
multi-node parallel uh job type
that's not there for eks but we're
really trying to get it out the door as
fast as possible because we know a lot
of machine learning workloads
um leverage that design pattern that's
actually coming from from the HBC land
got it
um and and for the second part of that
question how about we revisit that I
know we've got a demo that we want to
show maybe we can cover that as part of
the demo absolutely
um so here's your eks cluster right I
have a uh sort of grafana metrics going
in through Prometheus
um uh you can see here that you know we
have some some pods here we have a uh
where's the namespace I Define a
namespace here for batch nodes
the nodes themselves that are up today
you have a node group that is specific
these are these are managed by eks and
just for um for the Demo's sake I do
have a node here that was added by uh
the batch compute environment which I'll
show in a second uh just so that jobs
start right away and we're not waiting
the minutes or so for for this thing to
start but you can set the minimum
cluster size to zero for batch
um and just to be clear these managed
node groups yes the groups are managed
by eks but um the the user that wants to
use batch on eks they have to actually
create these managed node groups and and
they're kind of responsible for them
no no no so so let me let me let me back
up a step when when you're defining so
here we go let me let me go over here so
this is a uh uh a computer environment
that's up today uh but say you want to
create one uh in the console you would
just go ahead and say Here's the uh
kubernetes service you name it blah blah
blah blah you give it the instance role
for my cluster I forget which one it is
but we'll just pick one real quick and
the eks cluster right and then you give
it the namespace of which to launch
right I'm not going to start this
because you know I have one ready uh but
the gist is that now that batch knows uh
where your eks cluster is so imagine
this the thing we just created
it knows the instance role to use for
four nodes and pods
it will um
it will go ahead and create those Auto
scale groups if there's not one defined
already for the compute environment
so it's managing it completely the thing
that
um you as an eks
cluster owner need to give batch is two
things the Arn for the um for the
cluster uh sorry where is it uh the Arn
for the for the cluster and the service
holding right those are the two things
and then uh this uh API endpoint needs
to be public right now it's an
authenticated access through IAM you
know we use the the service accounts
mapping and config authorizations that
you use for every other AWS service to
integrate with eks but but we use that
public endpoint to find out more about
the uh the cluster like for instance
what kubernetes version it is store that
information and then and then as we need
to use it to launches
so to add on their you could think of
the batch the batches batches compute
environment as Batch's version of the
management no group yeah right uh it's
not we don't use the term management
group just to try to avoid the confusion
there but uh you know you you if you
have your eks cluster you have
self-managed nodes maybe you have your
own ASDS or maybe you're using a managed
node group to run maybe some
microservices on there maybe the DNS
service right and and batch is going to
come along if you add batch to it adding
the compute environment that's just
going to is going to manage its nodes
um uh for the for the jobs in the job
queue it's going to do that at you know
as a managed orchestration and and and
they'll look like self-managed nodes in
in eks right now but they they're
managed by batch and and batch will add
and remove those
um through asgs and and we typically
like you can look under the cover and
see that right if you go to the auto
scaling console or or the ec2 console or
the eks console obviously you're going
to see those things
um but you know when uh hopefully you
don't have to look at those much you
know because batch is doing it its job
um and and uh and and so uh you can
think of Batch's compute environment as
a as Batch's version of a managed node
group now let me
summarize here so the end user is
responsible for creating an eks cluster
they need to pass that cluster you know
the Arn the way to access the cluster
role to batch batch then creates uh the
control plane oh sorry sorry the the
data plane the manage node group that'll
actually run batch workloads so while
the the end user is still responsible
for the cluster itself they're not
responsible for the data plane the where
the workloads are actually running and
if the user wanted they could feasibly
create other manage node groups in that
same cluster and do other things in that
cluster if they wanted to as well it's
just that one manage node group that
batch is using that's that's really for
batch to handle and manage and that
complexity is abstracted from you
because batch is really handling uh the
management of that that Management Group
yeah and in fact the cluster I have
right now has a managed note group the
note group that's running the Prometheus
and grafana
um you know uh services that are that
you see on the on the left side there
um and then we start to see why it makes
sense why you guys took this approach of
of uh you know having uh the the user
create the cluster and so there's still
some management capabilities left to the
user while um you know you you still
abstract the complexity of the the data
plane itself we've got a question here
that kind of tees into this uh what was
the motivation to use eks for batch
instead of just using batch managed
resources uh well you know that because
a lot of customers have been choosing
eks for their workloads and
standardizing on it uh it's really as
simple as that uh and we've uh and a lot
of them came to us for for our for our
teams that were supporting them to
support batch workloads right uh they
like scaling and the operations of batch
workloads on kubernetes is really uh
very different than microservices for
the topics that we covered before and a
central tenet of uh batches to remove
that undifferentiated have be lifting
for batch workloads specifically so we
felt that we could offer a lot to our
customers today that are trying to run
these workloads on on kubernetes uh and
we'll see what the feedback is once once
uh folks really start hammering at it
awesome and we're getting really great
questions here so so I don't want to
derail from the demo here but one more
question here that I think it's it's
kind of critical is the data plane
running in the customer's VPC
uh so yeah uh data plane do you mean the
thing that there's no group of the eks
yeah the node group I will I'll take
this one I mean again we're running our
version of a no group it is it's not and
we are launching the resources into the
customer's VPC we don't have direct
access to to the the customer
um you know job or a resource but we're
orchestrating bringing the node up and
then
um submitting a pod to the akube API
server that will then run on that node
but it is in the customer's VPC right
the nodes that's correct like the yeah
exactly so we have a couple more great
questions but I want to put those on
hold let's get to the demo and then I
think the demo will probably address the
questions about grafana and Prometheus
and the scraping sure
um where are we uh this is not the one I
want uh right compute environment you'll
see some things here
um basically uh Let me refresh this
because I think this is a whole
um yep uh so uh I set a minimum number
of bcpus to four and a desire and like
desired State here is is uh basically
what batch once the um what's the
cluster to go to and a maximum of 128
across a fleet of instances right uh and
I set the m6 I family and these are just
subnets and Security Group to launch us
right like like somebody just said yes
it is launching in in uh in in your in
your VPC and in your resources the what
we see on our management plane are what
was the job request right so you said
you you know here's the job definition
template and and the things that you're
providing to the API so that we can make
those uh scheduling and scaling
decisions
a couple more things that aren't really
shown well but they're here uh you did
not specify this on on creation of the
computer environment but we we inspect
it and keep track of what version of
kubernetes is
so if you upgrade your cluster to a
newer version of kubernetes
um the compute environment is actually
doing continual checks of whether it's
still a valid thing that we're going to
launch into and if the those two
versions diverge uh this um this status
will go to invalid uh so when you when
you when you update
um update a a cluster you have a
follow-on responsibility to tell batch
I've updated my cluster version here's
the new version through an update
compute environment and then we'll check
it's valid it'll go back in again right
uh is that so that you can provision
instances using the right Ami
yeah that that's right and it won't
necessarily uh we I mean we do support
Divergence just like
um managed or no group where where if
you update your eks control plane to a
version then you go update your manage
node group to version any KS it's the
same thing that's happening to Patches
compute environment you can then say
Okay I want to go from London 22 to 123
start using that version and we'll you
know pick out the we'll pick out the the
Omni for the right for the the right
instance type
um the eks optimized Omni and and that's
how we're using that there right uh so
here's here's what we have in our node
today here's the batch dashboard I have
a couple of job definitions here uh
basically there's this one which is uh
it's a simple Python program that I have
in a private ECR uh sorry yeah a private
ECR uh a repository uh that's
calculating PI right so if you look at
the job definition here uh basically use
a a pod property a service account name
that has access to both S3 and ECR so
this is different than the default uh
because this is a private uh Repository
it only does a thousand iterations okay
um so it takes about 10 seconds
um so uh we we could go ahead and submit
this job and call it Joby job mcface uh
whatever uh you hit the batch Q This is
the job dependencies that I was
referring to before if you have a
dependent job you can do it in the
console you can also do it through our
apis
you can optionally override some things
like if you want this to be a much
higher priority than other things in the
queue you can set a higher priority like
I need this thing now the first
available slot you know put in a
priority of positive and it goes other
things are like job attempts if there's
a retry like a spot Reclamation maybe
you want to retry that
um and a retry strategy like if there
was something wrong with my application
like my container just had bad data
going into it actually don't retry I
really only want to retry if I get
interrupted by spots so you have this
you know sort of different conditions
um you can override the command and the
best one uh is uh where are we next page
oh
okay
um sorry
go back here go back here I forgot erase
those I'm gonna do this 100 times maybe
I'll do it a thousand times sure
so it's going to do this thing a
thousand times and the reason I'm doing
this is so I can scale the cluster uh
and you can see the the CPU usage coming
out on the other end after a bit
um so submitted this is what you get you
get information about how many are
runnable how many are starting uh
success and failure and we'll check but
it but it's essentially as easy as that
now this thing is pulling the image off
of uh off of VCR it's going to be
launching instances this is the job
details right here if you go to the
compute environment
um that we had before you'll see that
it's still a desired bcpu because we're
still you know with batch processes you
don't want immediate scaling like a um
uh like you would with microservices
that really need Fast Response signs you
can be a little bit lacks onto how you
provision resources uh or how how
quickly you can provision resources it's
a trade-off right sorry if you want if
you want them to start right away you
can have warm capacity but if you if you
don't or you maybe you want a certain
percentage of warm capacity or by that
you can also you know let batch scale up
and down if you're you're willing to
take on like say like the less cost or
lower cost and a little bit of delay on
Startup so right and you can see that
pod memory usage is starting to go up in
grafana right here
um uh basically you can do a pre-warm
you know setting desired capacity to be
higher you'll see batch update this in a
minute or two or minimum CPUs to be
higher and it'll immediately start
launching instances uh but that's it
that's as simple as that
um you know these uh these aren't very
long jobs they take about 10 seconds
each so you see 63 of them are already
completed before we've we've run out and
where is where is batch getting that
information is it clearing the
kubernetes API to get the status of the
the jobs
yeah so I I'll take that one we are what
we do is is in our getting started guide
which should be a link provided uh here
it we the customer provides uh our back
uh permissions for batch to to basically
watch pods and nodes and so uh what we
do is we take a job first we with that
permissions we take we take a job from
the job queue and we turn it into a pod
and then we watch for we watch the state
transitions that it has in the
kubernetes uh cluster and then we're
updating the job on our side for that
okay yeah and let's let's say things go
horribly wrong and you need to to
troubleshoot is there a way for me to
view the the logs
I got you right here okay all right so
uh we we uh these are the the the law
groups that are going into Cloud watch
logs I deployed as a Daemon set the flu
and bit uh log uh collector you can use
whatever log aggregator that you want on
the back end there okay so when you were
mentioning earlier how you can schedule
demon sets onto instances provision
through batch this would be a good
example of where you might want to run a
like like flip it okay exactly and if we
go to uh you know so if we go here and
say Cube control and I don't I refuse to
say could cuddle get nodes all of them
sure uh you can see what's running
across uh these are the nodes get pods
pods across all namespaces
um you'll see that there's a bunch of
stuff Happening Here uh and here are the
batch nodes
um
what were we doing where's the Flint bit
here here are the flu and bit across all
the instances that are including
including this so there was four there
were four current uh in the node group
and then I guess we have 68 seconds so
this guy right here
um is new it just got added by batch uh
and it just got deployed to fluent but
before we started running jobs on it
the auto scaling group right that
additional node that's come up and then
obviously
exactly that's just launched another
another instance here that you see
showed up wow okay
um and where are we and if you look at
the compute environment here uh and you
know it had a desired vcpus let's see
what it's does that and now it's you
know now I want my full Fleet because I
realize there's a bunch of work in the
queue so please scale up as much as you
can I don't have that many uh resources
available in this specific account so it
shouldn't scale Beyond much beyond what
we have but you know it's uh it's saying
give give me as much as you can as fast
as possible so so Angel could I scale to
zero a managed node group and then
depend on uh badge to scale it up when I
need it absolutely so that's the minimum
CPUs that you set again I'd sort of put
it in real quickly just so that we
wouldn't have to wait that two minutes
for jobs to start but um but you can set
that to zero and it'll scale to zero and
you you won't be having any batch nodes
when you have no batch jobs but and also
just to add on to that though just for
clarity is that's for batch or the batch
workloads in the job queue so if you
have web services or microservices
running deployment replica sets on your
cluster batches not scaling from those
those are you can use as a carpenter you
can use the cluster Auto scaling for
those to manage node group that eks
provides for those batches very focused
on playing and playing nicely within the
cluster to launch batch nodes for batch
workloads a batch will not place
um it's pods on on nodes that don't
belong to batch at least in this first
version
um uh you know maybe we someday support
that but the current version of it you
know is not stepping outside of the
allocation that it's doing
there you go so this uh I just searched
for for a batch um but you can search on
the on the um
let's see this
you can search on the job ID itself uh
and it will show up the specific log for
the container that had an error and
it'll show you a bunch of information in
cloudwatch
nice
very cool uh could be a good time to go
through some questions we've got a lot
of great questions coming up in chat
here now one question here we've got
from uh tesser wrecked I I love that tag
so they're asking if they wanted to use
a batch on eks uh they're implying that
you would want to run on ec2s with
Amazon owned armies an eks cluster this
is a non-starter for them they only run
hardened images for eks nodes and that
would be violated running an unknown AWS
by batch Ami
um I I think you probably have some a
good answer to this one right guys
yeah I I'll take this one because I do
want to expand on it a little bit as
well yeah we do support you know custom
uh cut customer armies to as overrides
uh and we do support and through we we
support that through launch launch
templates so a customer can provide us a
launch template
um and and we will take that launch in
the launch template really allows them
to customize their ease their their
nodes they can uh you know override user
data which we will will merge into ours
to make sure things are functioning
right for our for our service but
um yeah they can provide launch template
and then we will take that launch
template and uh make a what we call
manage launch template out of it which
is a merge of their of their
configuration in our configuration and
we will launch instances with that and
that is the way that the customer can
provide an Ami and we'll do that uh and
as long as it joins you know everything
should should work right in his
reporting right uh um if the coupe you
know the kubernetes is healthy and doing
its thing that that should work and we
do have some mechanisms to help
customers out too like if let's say you
know we've had in the past this comes
where like the six years of batch uh
comes into you know Focus here it's like
you you look at some Auto scaling
systems they'll you know if things
aren't working they'll just leave it
around and you can burn you can burn
through some money if if you're not
um paying attention to that and we
and we've iterated over the years like
say you have a bad uh army or something
of that nature or bad configuration well
we'll let that run for a little bit and
so you can debug it and things like that
but we'll eventually also scale that
back down and invalidate the cluster
giving you a message that you know hey
we noticed something is wrong with the
cluster
um or something is wrong with the
compute environment maybe it's a launch
template maybe it's a network config
maybe it's our back
um and so uh yeah I'll stop there yeah
and and this is uh and this is really
where uh that managed service comes into
play right that we do stop scaling and
sending jobs uh because we have seen
errors like this before right
um and and triggering you know uh
causing things to not scale anymore
because you're seeing repeated errors is
something that you would have to
implement yourself if you're deploying
your Solutions
awesome and and Folks by the way uh
let's let's keep uh pounding these guys
with questions we definitely got uh 10
more minutes here if you have any more
questions uh feel free to drop them in
chat uh another one here that's very
interesting from clever Maya uh what
what they don't get is why not create a
batch control plane instead of these
Integrations and I'm uh I'm guessing
they're referring to you know the
integration with eks specifically and
you know I I would I want to answer this
very quickly and argue that there is a
batch control plan and I think you are
abstracting a lot of the complexity from
from what I've what I've seen but um I
know we asked this just a little bit
earlier but maybe a little bit more
detail on uh why the eks integration and
the way it's being built today
is this question really asking like why
aren't we providing a a a resource
provider like that you deploy within
within eks I mean I guess
um I you know okay so manage service
yeah it really boils back down to that
you know maybe Jason you can take this
one well I was thinking I I think I I
yeah I think you could answer this in
two ways as as already put forth you
know we do have a managed control plane
that that is hidden now if you're
talking about eks is control plane and
US creating the cluster on your behalf
um that might be what this question is
also asking uh we chose in our first
version not to do that given that that
that customers already have pretty
opinionated ways that they're doing the
kubernetes and we didn't want to like
that was sort of stepping out of the
bounds of batch workloads versus you
know what the customers organization and
compliance is doing for their their
clusters and we would just want to work
nicely in that although we're we're
considering you know uh we we certainly
will consider feedback on if we we
should also be creating clusters for for
customers but
um the control plane part of like
scaling nodes uh for for the batch
workload right we're handling that that
is hidden from you we you are giving us
configurations right you're defining
some configurations and constraints for
those for those um scaling operations
within the compute within what our our
resource called a compute environment
and and so uh you know that part is a is
our control plane I'll leave it I think
those are maybe a couple ways that could
be answered
um absolutely and and honestly we got an
answer here from one of our our guests
as well that the the ecas integration is
nice because you know we can leverage
some kubernetes capabilities open source
tools you know here I'm guessing we're
using Prometheus grafana to scrape
metrics uh which which really is only
possible if if uh badge gives you the
access into that cluster to be able to
configure and manage uh these tools and
of course with I think Angel you said
this in the beginning with so many uh
AWS customers using eks it made sense to
offer uh batch as kind of an angle for
them to use this let's keep going down
the questions here now folks want to
know how they can get started are there
any blogs workshops available workshops
or sessions available at the upcoming
reinvent uh anything we can share here
yeah absolutely I think you guys have a
uh links to the docs for the getting
started guide there's also a workshop
which I have a few simple examples a
self-paced workshop we're going to be
giving that workshop at re invent it's
an embargoed session right now should be
out this week so look for it cmp335 uh
it should be out in the next couple days
in the catalog uh Jason and I also have
a chalk talk session so if you want to
have some direct and and more pointed
questions at Jason uh come to to
Containers I believe 309 session
containers con 309 uh and there's a
general batch talk as well
um so there's a workshop area I'm at
there's a talk session and then there's
a general batch session session uh
breakout session at reiment but in terms
of what you can do in the self-paced our
documentation should should be a good
place to go there is a blog post from
the news blog so Jeff Barr's blog on the
feature with a pointer to the workshop
into the documentation so that might be
the the quickest way to get to
everything nice uh and what's uh what's
next for AWS batch what uh what are you
looking at doing uh in the future with
with eks and you talked about
potentially uh provisioning whole
clusters instead of you know relying on
a customer no I think you know until
yeah the the provisioning of the whole
clusters thing is something that we
really need to be careful with because
it really it was a core design choice to
leverage existing clusters and customers
account in that shared responsibility
model yeah if we see enough feedback
that that's something that we should
revisit that's actually a major major
feature and it wouldn't come out anytime
soon right because managing kubernetes
clusters we'd essentially become eks
right
now I think uh you know I think we'd
really want to do uh to do that
carefully and if we did do it I think it
would be more of a collaboration with
with the other service teams in at AWS I
think that we are uh looking at closely
though are things like that manage a
multi-node parallel
um uh workflows like so uh and the other
thing is is looking at um
what early customers are trying out and
finding uh problems with to get back
onto our roadmap one thing we don't
support today is uh persistent volumes
in the way that kubernetes wants to
um there is a way through launch
templates where you can mount
um sort of parallel file systems or
other things that matter for for batch
workloads and then do host volume Mount
support through the Pod but that's
sub-optimal and it's not really the way
that the cube folks are used to working
so we're looking at a persistent volume
support as a near-term feature release
um and and we're also I think you know
we're going to learn a lot from our our
customers we we want to work back from
them uh to and and hear what feedback
they have after they've they've tried it
out I mean we know some things that we
want to work on Angel touched on that
you know another areas we hope to also
get
um make it easier for them to get
started uh looking to have a pull
request to uh have the integration into
eks CTL uh so that they can set up their
R back easier for the integration
um yeah so I think helping adoption or
you know people getting started is
obvious one that we want to to we know
that we would like to improve as well
um uh getting started here and then you
know as they're using it what what are
the things that they like and they don't
like and want to see added it's going to
be a you know potentially a blocker or
something that's making it more
difficult for them to work to use it
yeah
what is what is the best way for folks
to give feedback angel I see that you've
got your Twitter handle there
um I don't know if that's the best way
or if you have actually it is either
yeah I mean and until Twitter is not a
thing
which is you know we don't know uh until
Twitter is not a thing uh definitely
there uh or otherwise you know the the
the the contact us page uh you'd be
surprised how um how quickly uh somebody
from AWS will get back to you uh when
you when you submit submit something
through a contact form
great uh well I want to thank our guests
Angel and Jason for joining us today
tell us all about uh AWS batch on eks
um I think it's going to be interesting
to see what customers do with it over
the course of the next few months I'm
certainly eager to hear the feedback and
uh what's to come so thanks everybody
for joining and we'll talk to you soon
thanks for joining thank you for having
us thanks bye
[Music]
Voir Plus de Vidéos Connexes
Containers on AWS Overview: ECS | EKS | Fargate | ECR
EKS Pod Identity vs IRSA | Securely Connect Kubernetes Pods to AWS Services
How to run jobs at scale with Azure Batch | Azure Tips and Tricks
Salesforce Apex Interview Questions & Answers #salesforce #apex #interview
Building a Serverless Data Lake (SDLF) with AWS from scratch
Batch Normalization (âbatch normâ) explained
5.0 / 5 (0 votes)