AWS Batch with Spot Block

May 24, 2018

AWS Batch is designed to make processing bulk workloads a much simpler task for developers, system administrators and anyone who needs to run on-demand workloads but doesn’t want to manage infrastructure.

Batch has three primary components: a compute environment, a job queue, and a job. Batch relies on ECS, the EC2 Container Service, to manage the available infrastructure; in fact, a Batch “compute environment” is really an abstraction for an ECS cluster. Job queues can span one or more compute environments and are modeled as a mapping between a container image and some compute infrastructure. Every job is a unique execution of a given container image running in a compute environment.

Understanding Batch Compute Environments

Compute environments, being tied to an ECS cluster, have two primary methods of operation – managed and unmanaged. A managed compute environment will automatically add and remove capacity to the compute cluster as jobs come and go from the job queue based on demand (number of jobs in the queue) and the available capacity (how many instances Batch can provision).

A compute environment can be limited to one or more family of EC2 instances or can be configured to pick and choose from a set of instance families based on optimal resource usage. This is useful, for example, if you want to build a compute environment that has a guaranteed set or type of resources; you could build a compute environment that only provisions P-family instances if you need GPUs for your jobs or R-family instances if your workload is memory-intensive.

The AWS Spot Market

The AWS spot market is designed to allow for optimal usage of resources by providing a marketplace for unused resources where users can leverage on-demand capacity at a particular price. This model allows users to say, for example, that they want access to 20 c5.4xlarge instances but only if it costs them $.05/hour. If supply and demand for those instance is such that the going rate for c5.4xlarge instances falls to (or below) that price, then the capacity will be provisioned and run until the price goes above the requested rate or the instances are turned off.

For users that have fully adopted the “on-demand” model of computing this is a great thing – you can gain access to infrastructure at a much discounted price (often up to 70% off or more) as long as you’re not picky about when your tasks run and are able to handle periodic failure in the form of partial completion of work and repeated or incomplete tasks.

Spot market block requests

The downside to the spot market is that it a market so supply and demand is in full effect. If demand outstrips supply, and the value of your instances goes above your bid, your capacity will be put back on the market for others willing to pay more. This is great if you have workloads that you can pause and resume – you get a notice that your capacity is going to be taken away – but not all workloads have that level of flexibility. This is where spot block comes into play. Spot block allows you to take advantage of the spot market but add an additional criteria to your bid that says “I want to pay $.05 for 20 c5.4xlarge instances and I need them for 3 hours”; by doing this you can now take advantage of the spot market for jobs that have some minimum duration in order to make progress.

Spot and Batch

Batch natively supports taking advantage of the spot market; you can configure your compute environment to use the Spot market instead of relying on on-demand instances. This lets you set a price you are willing to pay for infrastructure and then Batch will use that to place bids on the Spot market for you. Again, this works really well for short-lived jobs or jobs that can repeat or be interrupted without issue. For some customers, this doesn’t match up with their compute needs – imagine big-data jobs in science or health care where each step can sometimes take hours to complete. This is where it would be nice to leverage Spot instances with a guaranteed minimum duration making it a perfect match for Spot’s block requests. This can be achieved by using a combination of an unmanaged environment, some EC2 user data and some API calls, which we will take a look at below.

Unmanaged environments to the rescue!

Because Batch compute environments are linked to an ECS cluster we can create an unmanaged compute environment (one in which Batch does not dynamically provision compute) and add our own instances to the cluster manually so that they can work on jobs in the queue. When Batch creates a new compute environment it will also create a corresponding ECS cluster; we can leverage this to make our own requests for Spot block instances.

In order to join an ECS cluster, an EC2 instances needs two things: the ECS agent and the name of the ECS cluster to join (specified in the configuration file for the agent). Once the instance is provisioned it will join the cluster and begin working.

Of course, this is nice but how do I do this? Fortunately it’s pretty simple; at a high level there are three steps to making this work:

  1. Create a compute environment and associated components
  2. Submit some jobs to the queue we created
  3. Submit a spot block request configured to join the ECS cluster

Once you do this, after the spot market request is fulfilled, the instances will join the ECS cluster and Batch will be able to run the jobs you enqueued.

Create Our Compute Environment

Below are some CloudFormation bits and shell scripts that will do most of the work for you.

SpotBlock:
  Type: AWS::Batch::ComputeEnvironment
  Properties:
    Type: UNMANAGED
    ComputeEnvironmentName: spot-block-test
    ServiceRole: BatchServiceRole
    State: ENABLED
Compute Environment CloudFormation Snippet
JobQueue:
  Type: AWS::Batch::JobQueue
  Properties:
    ComputeEnvironmentOrder:
      - Order: 1
        ComputeEnvironment: SpotBlock
    State: ENABLED
    Priority: 1
    JobQueueName: SpotBlockQueue
Job Queue CloudFormation Snippet
JobDefinition:
  Type: 'AWS::Batch::JobDefinition'
  Properties:
    Type: container
    JobDefinitionName: hello-world
    ContainerProperties:
      Memory: 256
      Privileged: false
      Vcpus: 1
      Image: hello-world
Job Definition CloudFormation Snippet

Configuration of the instances

ECS operates by running an agent on EC2 instances that communicates with the ECS backend to make capacity available for deployment of containers. To do this, an instance needs to be configured and able to talk to ECS. Using an AMI with the ECS agent and some EC2 user data will achieve this; below is the policy and information on generating that user data.

EC2 Instance Role Policy

In order for your EC2 instances to join the ECS cluster and begin to process jobs from your job queue, they need to have a policy that permits them to make the required ECS API calls as well as CloudWatch logs. Below is a policy that is fairly permissive – you can scope it down as-needed to more specific resources as-needed but for this example it will require the least configuration.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:CreateCluster",
        "ecs:DeregisterContainerInstance",
        "ecs:DiscoverPollEndpoint",
        "ecs:Poll",
        "ecs:RegisterContainerInstance",
        "ecs:StartTelemetrySession",
        "ecs:Submit*",
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}
IAM Policy for ECS Instance Role

EC2 User Data for ECS

In addition to a polocy you will need to have the name of the ECS cluster associated with the newly created unmanaged environment. You can get this using the AWS CLI and then use that data in a script to configure the ECS agent as outlined below.

# If you don't have jq installed, you can just manually grep the output
compute_environment_name="spot-block-test"
region="us-east-1"

# Get the cluster ARN
ecs_cluster_arn=`aws batch describe-compute-environments --region ${region} | jq ".computeEnvironments[] | select ( .computeEnvironmentName | contains(\"${compute_environment_name}\")) | .ecsClusterArn" | sed -e 's/"//g'`

#Use the ARN to get the name
echo "Getting name for ECS Cluster ${ecs_cluster_arn}"
ecs_cluster_name=`aws ecs describe-clusters --region ${region} --clusters "${ecs_cluster_arn}" | jq ".clusters[].clusterName" | sed -e 's/"//g'`

echo "Cluster name: ${ecs_cluster_name}"
Get the ECS cluster name

Using the cluster name from the above script, we can generate a script for our EC2 user data. This script will configure the ECS agent to join the cluster associated with the unmanaged compute environment.

#!/bin/bash
echo "ECS_CLUSTER=spot-block-test_Batch_e029eaeb-42c2-3471-8bf3-bad88bfcdc7d" >> /etc/ecs/ecs.config
EC2 user data to join the cluster

Put this into a file and generate the Base64 version of the data; on most UNIX systems the base64 utility will work for this (if you don’t have such a tool, there are web-based Base64-encoders; just make sure you don’t divulge any secrets when using them!). For example:

$> base64 -b 64 ecs-user-data.sh
IyEvYmluL2Jhc2gKZWNobyAiRUNTX0NMVVNURVI9c3BvdC1ibG9jay10ZXN0X0Jh
dGNoX2UwMjllYWViLTQyYzItMzQ3MS04YmYzLWJhZDg4YmZjZGM3ZCIgPj4gL2V0
Yy9lY3MvZWNzLmNvbmZpZwo=
Base64 Encoding of our script

If you want to verify that this is the right data, you can decode it (again using base64):

echo "IyEvYmluL2Jhc2gKZWNobyAiRUNTX0NMVVNURVI9c3BvdC1ibG9jay10ZXN0X0Jh
dGNoX2UwMjllYWViLTQyYzItMzQ3MS04YmYzLWJhZDg4YmZjZGM3ZCIgPj4gL2V0
Yy9lY3MvZWNzLmNvbmZpZwo=" | base64 -D

#!/bin/bash
echo "ECS_CLUSTER=spot-block-test_Batch_e029eaeb-42c2-3471-8bf3-bad88bfcdc7d" >> /etc/ecs/ecs.config
Validate Base64 Encoding of our script

Once you have the policy created along with the compute environment and the corresponding EC2 UserData string we can use that to submit a spot request.

Submitting a job

Submission of a Batch job is simple: using the AWS CLI or the web console you can create a job using the SubmitJob Batch API referencing our job and job queue definitions. The below command will do this:

aws batch submit-job \
   --job-name TestSpotBlock \
   --job-queue SpotBlockQueue \
   --job-definition hello-world
Command to submit a job to our queue

This will submit a single job to our GPUSpotBlockQueue queue using the GPUJob definition from above (leveraging the nvidia Docker image) named TestSpotBlock.

Verify that the job was created

We can validate that the job was created in the queue:

aws batch list-jobs --job-queue GPUSpotBlockQueue
Command to list jobs in the queue

You should see that one job is in the queue and that it’s current state is pending – as there is no capacity available in the compute environment the job will wait indefinitely until capacity is brought online.

Making a Spot Block Request

Now we need to add some capacity to our compute environment as it is an unmanaged pool. To do this we just need to add hosts to our ECS cluster in some fashion; here we will submit a spot block request but you can also just provision EC2 capacity directly if you like. As noted above, you need to inject the ECS cluster name into the EC2 instances that are provisioned so that they join the ECS cluster and begin taking on work. EC2 provides a nice mechanism for this through user data, which allows you to add arbitrary configuration data to EC2 instances when they are brought online.

Using this information, we can generate the data needed for our spot market request. Below is an example of how we can request t2.micro instances in us-east-1c – these will be launched with the pre-configured ECS AMI in our desired subnet the us-east-1 region with our ECS instance role, security group and SSH key associated with them.

{
    "ImageId": "ami-a7a242da",  
    "KeyName": "jewart",
    "SecurityGroupIds": [ "sg-55964223" ],
    "InstanceType": "t2.micro",
    "Placement": {
      "AvailabilityZone": "us-east-1c"
    },
    "IamInstanceProfile": {
      "Arn": "arn:aws:iam::332639962540:instance-profile/ecsInstanceRole"
    },
    "UserData": "IyEvYmluL2Jhc2gKZWNobyAiRUNTX0NMVVNURVI9c3BvdC1ibG9jay10ZXN0X0Jh
                 dGNoX2UwMjllYWViLTQyYzItMzQ3MS04YmYzLWJhZDg4YmZjZGM3ZCIgPj4gL2V0
                 Yy9lY3MvZWNzLmNvbmZpZwo=",
    "SubnetId": "subnet-c8d30ae4"

}
spot-block-request.json

To make a one-time market request of one instance for one hour (duration must be in 60 minute intervals) at a price of three cents per hour you would issue the following command:

aws ec2 request-spot-instances \ 
    --spot-price "0.03" \
    --block-duration-minutes 60 \
    --instance-count 1 \ 
    --type "one-time" \
    --launch-specification file://spot-block-request.json 
Command to request our spot instances

And that’s it! When the market can accommodate your request, your instance will be provisioned in whichever AZ specified that has the capacity for the time window requested; the instance will join the ECS cluster on boot and be handed work to process.

What Happened?

Here you see how you can combine AWS Batch with the EC2 Spot Market block requests to ensure that you have workers that are able to run for a minimum duration. As a quick recap, the steps involved were:

  1. Create an unmanaged Batch compute environment
  2. Create a Batch job queue that uses our environment
  3. Create a Batch job description
  4. Create an ECS policy for our EC2 instances
  5. Gather the information about the ECS cluster created for our compute environment
  6. Generate EC2 user data so that our EC2 instances will join our new ECS cluster when they are provisioned
  7. Enqueue some jobs in our new queue
  8. Make an EC2 Spot Market block request and wait for it to join our ECS cluster in order to process jobs