AWS Glue Python Script Dependencies

AWS Glue recently added support for running a “plain old Python script”, as it were, enabling users to conveniently inject some non-Spark functionality into their ETL processes. One side-effect of this is that users can also use Glue as a poor-man’s cron as long as they use Python. This adds a third tool to AWS' toolbelt of “ways to run a task periodically without managing servers” including:

Invoking a Lambda with a CloudWatch timer
Executing an ECS task periodically on Fargate

Glue offers a new type of job, a pythonscript job type; this means that the job is implicitly a singleton, it runs on one machine and is not going to execute inside of a Spark context. This limits the functionality largely to pre-processing tasks that you might otherwise have done on a Spark head node but also means that startup time and footprint are reduced. A pythonscript job has a few additional configuration options including a “dependencies file”, which is just a Python egg that the Glue wrapper will download from S3 and then install into the local runtime before executing your job. Any dependencies you have can be packaged into this egg so that they are installed at run-time.

Building a Python egg

This task is quite straightforward; using Python’s exellent setuptools we can simply create a new setup.py file similar to the following:

from setuptools import setup 

setup(
    name = "yakshaver_dependencies",
    version = "1.0",
    author = "John Ewart",
    author_email = "john@johnewart.net",
    description = ("A quick example Python egg with dependencies"),
    install_requires=['docutils>=0.3', 'requests']
)

This package description says to build an egg called yakshaver_dependencies that depends upon docutils (>=0.3) and requests. In order to build the egg you will need to issue a the following:

python setup.py bdist_egg

You will see some output from setuptools, the end of which should look a bit like this:

copying yakshaver.egg-info/PKG-INFO -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/SOURCES.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/dependency_links.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/requires.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/top_level.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating dist
creating 'dist/yakshaver-0.0.4-py3.6.egg' and adding 'build/bdist.macosx-10.11-x86_64/egg' to it

(As a side note, egg files are just ZIP files with a particular structure you can always examine its contents by unzipping the file.)

Now that I’ve laid an egg, where do I put it?

In order for Glue to find your egg, you will need to upload it to an S3 bucket that Glue has access to. You may need to create a new bucket and an accompanying IAM / bucket policy that permits Glue to fetch from there, or if your artifacts are free of any sensitive information you can always put them in a public bucket to get started. Simply use the AWS CLI to do this (or any other tool you have handy):

aws s3 cp ./dist/yakshaver-0.0.4-py3.6.egg s3://glue-artifact-bucket/eggs/yakshaver-0.0.4-py3.6.egg
aws s3 cp ./my_sample_script.py s3://glue-artifact-bucket/scripts/my_sample_script.py

From here you can create a new glue job with the pythonscript type (again, using the CLI but you can also do this on the AWS console):

... to do

Artifacts and complete example

You can find this complete example in my Git repository