AWS Batch and GPU-based Workloads

Using AWS Batch with GPU instances provides a boon to engineers or scientists looking to do GPU-based programming (like CUDA) without needing to manage infrastructure. For those who come from a background where job schedulers like Torque are used, this model will seem quite familiar. In order to make Batch work with GPUs there are a few specific things that need to be configured to make it work.

In particular, the underlying EC2 host will have the required NVIDIA components on it and those drivers and shared libraries need to be exported into the container so that it has access to them. Using the AMI build script that I have put on GitHub you can generate an EC2 AMI that has the nvidia-docker wrapper, the NVIDIA kernel drivers and the ECS agent pre-installed on it. All of the CloudFormation templates needed to make this work can be found in the repository so I will only show and highlight the important parts of the job template that allow this to work.

Note: The image generated by the above repository uses the nvidia-docker project version 1.0, which only supports CUDA < 9.x

Job Definition Cloud Formation Template

GPUJobDefinition:
  Type: 'AWS::Batch::JobDefinition'
  Properties:
    Type: container
    JobDefinitionName: !Sub "nvidia-smi-test-${AWS::StackName}"
    RetryStrategy:
      Attempts: 1
    ContainerProperties:
      MountPoints:
        - ReadOnly: false
          SourceVolume: nvidia
          ContainerPath: /usr/local/nvidia
      Volumes:
        - Host:
            SourcePath: /var/lib/nvidia-docker/volumes/nvidia_driver/latest
          Name: nvidia
      Command:
        - nvidia-smi
      Memory: 2000
      Privileged: true
      ReadonlyRootFilesystem: true
      Vcpus: 2
      Image: nvidia/cuda

What This Does

The highlighted areas above have the following effect:

Assign the path /var/lib/nvidia-docker/volumes/nvidia_driver/latest to a container volume named nvidia
Map the volume nvidia as /usr/local/nvidia inside the container
Run the container in privileged mode

Why Privileged Mode?

Privileged mode is required because the NVIDIA libraries and tools need access to the underlying hardware for a number of its operations; privileged mode enables a container to perform operations that would normally be sandboxed or not permitted via the cgroups mechanism in Linux. For this reason we need to set the Privileged flag on the job definition so that your jobs can access the GPU as required.

This does come with some additional security implications so it is important to understand what risks this brings. You can read more in the security section of the Docker website to understand what this means for your containers.