Using AWS Batch with GPU instances provides a boon to engineers or scientists looking to do GPU-based programming (like CUDA) without needing to manage infrastructure. For those who come from a background where job schedulers like Torque are used, this model will seem quite familiar. In order to make Batch work with GPUs there are a few specific things that need to be configured to make it work.
In particular, the underlying EC2 host will have the required NVIDIA components on it and those drivers and shared libraries need to be exported into the container so that it has access to them. Using the AMI build script that I have put on GitHub you can generate an EC2 AMI that has the nvidia-docker
wrapper, the NVIDIA kernel drivers and the ECS agent pre-installed on it. All of the CloudFormation templates needed to make this work can be found in the repository so I will only show and highlight the important parts of the job template that allow this to work.
Note: The image generated by the above repository uses the nvidia-docker
project version 1.0, which only supports CUDA < 9.x
Job Definition Cloud Formation Template
GPUJobDefinition:
Type: 'AWS::Batch::JobDefinition'
Properties:
Type: container
JobDefinitionName: !Sub "nvidia-smi-test-${AWS::StackName}"
RetryStrategy:
Attempts: 1
ContainerProperties:
MountPoints:
- ReadOnly: false
SourceVolume: nvidia
ContainerPath: /usr/local/nvidia
Volumes:
- Host:
SourcePath: /var/lib/nvidia-docker/volumes/nvidia_driver/latest
Name: nvidia
Command:
- nvidia-smi
Memory: 2000
Privileged: true
ReadonlyRootFilesystem: true
Vcpus: 2
Image: nvidia/cuda
What This Does
The highlighted areas above have the following effect:
- Assign the path
/var/lib/nvidia-docker/volumes/nvidia_driver/latest
to a container volume namednvidia
- Map the volume
nvidia
as/usr/local/nvidia
inside the container - Run the container in privileged mode
Why Privileged Mode?
Privileged mode is required because the NVIDIA libraries and tools need access to the underlying hardware for a number of its operations; privileged mode enables a container to perform operations that would normally be sandboxed or not permitted via the cgroups
mechanism in Linux. For this reason we need to set the Privileged flag on the job definition so that your jobs can access the GPU as required.
This does come with some additional security implications so it is important to understand what risks this brings. You can read more in the security section of the Docker website to understand what this means for your containers.