By Wolfgang Unger
Working with AWS Batch to run container in parallel is not quite simple.
There are a bunch of configurations you must set up correct to launch the right instance types
and amount of instances.
This tutorial will try to clarify the imporant configurations and parameters you need to run
batch as expected.
Job Definition
The first important resource is the Job Definition of the Batch Job.
These are the most important parameters
Image
Pointing to the ecr docker image ( will have no affect on the compute env and number of
instances)
Memory
Required RAM
vCPUs
How many vCPUs the single job will need to run
Number of GPUs
How many GPU(s) the single job will need to run
In CDK code these values can be configered like this
resource_requirements = [
batch.CfnJobDefinition.ResourceRequirementProperty(type="VCPU", value="4"),
batch.CfnJobDefinition.ResourceRequirementProperty(
type="MEMORY", value="8192"
),
batch.CfnJobDefinition.ResourceRequirementProperty(type="GPU", value="1"),
]
Batch Compute Environment
For the Batch Compute Environment the following settings are important:
Instance Types
The allowed instance types you want to launch. You can define only the family (g4dn or c5d) or also
include the size (g4dn.xlarge)
You can define a list of allowed instances(c5d, m5ad, m5d, r5d) or ( g4dn.xlarge, p2.xlarge )
You need to select the instance types on the requirements of your workloads.
So if you have memory intensive or compute intensive jobs running, please adapt these instance types
to your needs.
Just take a look on the AWS documentation
AWS Docu Instance
Types
If you need for example NVidia GPUs you can select the G4 or the P2 instances.
We will see some examples on how the GPU value on the job-def and the instance type will affect the
number of launched instances.
Maximum vCPUs
You can limit the max vCPU and therefor the amount of launched instances with this value. If
limited
for example to 16, this value should not be exceeded even though the calculated desired count could
be higher
Desired vCPUs
this value is not be configured, it will be calculated by AWS when sending jobs to Batch. See
more
informations to this number on conclusions
Allocation Strategy
For example BEST_FIT_PROGRESSIVE
for more informations please read
AWS Docu Allocation Strategies
Types
Provisioning model
EC2, Fargate. If you need concrete instance types, set to EC2
EC2 configuration
For GPU and NVidia this is important and must be set to ECS_AL2_NVIDIA
In CDK code this is :
compute_resources=aws_batch.CfnComputeEnvironment.ComputeResourcesProperty(
type=compenvtype,
allocation_strategy="BEST_FIT_PROGRESSIVE",
ec2_configuration=[
aws_batch.CfnComputeEnvironment.Ec2ConfigurationObjectProperty(
image_type="ECS_AL2_NVIDIA",
),
],
Service Quotas
In each account there are service quotas ( limits) on the EC2 machines, allowed to launch in
parallel.
It is not possible to launch more EC2 instances on the dedicated Instance Types of the quota even
though the submitted jobs and the desired vCPU count would require a higher amount of instances.
This behaviour can be confusing, when triggering jobs, the desired vCPU count is set correctly but
he correct number of launched EC2 instances is not reached. I will not exceed the amount of the
quota.
In batch this means, for example only 8 G4dn Instances will be launched, the desired vCPU would
demand 16, but this value will be irrelevant in this case.
The jobs will then be queued and only processed once another job was finished
Test and Results when running Batch Jobs
Service Quotas
See the point above, the AWS Service quotas will prohibit to launch the expected amount of EC2
instances and this can be quite confusing when observing the behaviour of the Compute
environment.
The soft limit can easily be increase by a request on the site 'Service Quotas' and should be set to
a value to allow launch enough instances
Desired vCPUs
When sending jobs to Batch the Batch Compute Environment will first calculate the desired vCPU count
to scale up the compute resources
This value is not only calculated by the vCPU of a single job ( defined in the job-def) and the
amount of jobs , meaning for example we have 4 vCPU in the job-def and are sending 4 jobs to batch,
this would end up in 16 vCPUs ( 4x)
Also the allowed instance types for the compute env will be considered by AWS Batch when calcuating
this value.
If only a family is defined in the allowed types (g4dn ) or the correct instance and size is defined
( g4dn.xlarge for a 4 vCPU job-def) the compute environment will set this value to 16, which is fine
and the exected value
But if you only allow bigger instances, for example g4dn.2xlarge ( which have 8 vCPUs) and the
settings of job end compute env ( for example the GPU) require to launch 1 instance per job, this
means the desired vCPU count will then be set to 32.
Because 4 instances must be launched and each instance got 8 vCPU. In this case the caluculation
job-def vCPU x number of jobs will be overwritten by instance vCPU x number of jobs
Amount of launched instances
After AWS Batch calculated the desired vCPUs it will launch the EC2 instances for the batch
jobs.
Therefor a number of aspects will affect this number and this can be confusing .
Service Quotas
can prohibit to launch enough instance, see above, always increase to sufficient amount !
Relationship Hardware requirements JobDef - Allowed Instances in Compute Env
The definition of hardware resource in the job-def in combination with the amount of jobs will be
the most important factor for the number of launched instances
If 4 jobs can be run a 1 allowed instance type, only 1 will be launched. but all hardware
requirements must combine with this . Meaning
Memory
vCPU
GPU - if defined and required
Example 1 - None GPUs
so ( first example without GPU) if vCPU in job-def is 4 and memory is 16 this would allow to launch
the following instance types and number of instances
I am using g4dn instances in this example ( to compare with example 2 ) but the GPUs are of no
relevance in the first example .
This example would also work for example with instance types like m5 (m5.xlarge, m5.2xlarge,
m5.4xlarge)
allowed instance type |
instance type hardware |
no of jobs |
desired vCPUs |
number of launched istances |
Comment |
g4dn.xlarge |
4 vCPU - 16 GB |
1 |
4 |
1 |
|
g4dn.xlarge |
4 vCPU - 16 GB |
2 |
8 |
2 |
vCPU (2x4) requires a 2nd instance |
g4dn.xlarge |
4 vCPU - 16 GB |
4 |
16 |
4 |
vCPU (4x4) require 4 instances |
g4dn.xlarge & g4dn.2xlarge |
4/8 vCPU - 16/32 GB |
1 |
4 |
1 |
desired vCPU count still 4, because a g4dn.xlarge can be launched |
g4dn.2xlarge |
8 vCPU - 32 GB |
1 |
8 |
1 |
desired vCPU count is 8 not 4, because only a g4dn.2xlarge can be launched which
already got the 8 vCPU |
g4dn.2xlarge |
8 vCPU - 32 GB |
2 |
8 |
1 |
|
g4dn.2xlarge |
8 vCPU - 32 GB |
4 |
16 |
2 |
vCPU (4x4) requires a 2nd instance |
g4dn.4xlarge |
16 vCPU - 64 GB |
1 |
16 |
1 |
|
g4dn.4xlarge |
16 vCPU - 64 GB |
2 |
16 |
1 |
|
g4dn.4xlarge |
16 vCPU - 64 GB |
4 |
16 |
1 |
all jobs can run on one machine |
Please notice the combination RAM and vCPU must allow to ran 2 jobs or more on one instance. if vCPU
would only be 2 in the job-def but memory still 16 GB, this would still remain in the same results
as in the table above
Example 2 - With GPU
No taking a look on JobDefinitions which require GPU=1
The GPU will not be shared, meaning if a job requires 1 GPU it will always need one instance ( if
not using multi GPU instances like g4dn.12xlarge) even though the memory and vCPU would allow to run
more jobs on one instance
Only with it is then possible to run multiple containers on one machine, but actually the
configurations of these machine doesn't combine with the RAM and vCPUs, so I would not recommend to
use them
JobDef : 4 vCPUs, 1 GPU, 16 GB memory
allowed instance type |
instance type hardware |
no of jobs |
desired vCPUs |
number of launched istances |
Comment |
g4dn.xlarge |
4 vCPU - 16 GB - 1 GPU |
1 |
4 |
1 |
|
g4dn.xlarge |
4 vCPU - 16 GB - 1 GPU |
2 |
8 |
2 |
vCPU, RAM and GPU require to launch a 2nd instance
|
g4dn.xlarge |
4 vCPU - 16 GB - 1 GPU |
4 |
16 |
4 |
|
g4dn.2xlarge |
8 vCPU - 32 GB - 1 GPU |
1 |
8 |
1 |
desired vCPU count is 8 not 4, because only a g4dn.2xlarge can be launched which
already got the 8 vCPU |
g4dn.2xlarge |
8 vCPU - 32 GB- 1 GPU |
2 |
8 |
2
|
2 instances launched because of GPU |
g4dn.2xlarge |
8 vCPU - 32 GB - GPU |
4 |
16 |
4
|
4 instances launched because of GPU |
g4dn.4xlarge |
16 vCPU - 64 GB - 1 GPU |
1 |
16 |
1 |
desired vCPU count is 16 not 4 because no smaller instance can be launched |
g4dn.4xlarge |
16 vCPU - 64 GB -1 GPU |
2 |
16 |
1 |
2 instances are launched because of GPU |
g4dn.4xlarge |
16 vCPU - 64 GB - 1 GPU |
4 |
16 |
1 |
4 instances are launched because of GPU
|
g4dn.12xlarge |
48 vCPU - 192 GB- 4 GPU |
1 |
48 |
1 |
|
g4dn.12xlarge |
48 vCPU - 192 GB- 4 GPU |
2
|
48 |
1 |
all jobs can run on one instance ( but waste of vCPU and RAM) |
g4dn.12xlarge |
48 vCPU - 192 GB- 4 GPU |
4 |
48 |
1 |
all jobs can run on one instance ( but waste of vCPU and RAM) |
Conclusion
Running Containers in AWS Batch is not quite simple. You need to investigate, which is the best
combination of the hardware requirements of the Job-Definition and the hardware setup of
the batch compute environment.
Try to tune the allowed instance type(s) to the Job-Definition Hardware Settings.
If you would only allow big instances, which can run for example 4 or 8 containers, they would be
oversized in the moment, only one job will be finished while the other 3 are already completed.
If you need GPU, be aware the defined GPU in the Job-Definition is not being shared, so running 2
Jobs with GPU=1 in the Job-Definition, require to run on 2 EC2 instances ( with hardware GPU=1)
even if the vCPUs and memory would allow to run 2 or more jobs.
If you run a lot of jobs, don't get confused, if not enough EC2 instances are launched. Take a look
in the service quotas of your account.