AWS Glue recently added support for running a “plain old Python script”, as it were, enabling users to conveniently inject some non-Spark functionality into their ETL processes. One side-effect of this is that users can also use Glue as a poor-man’s cron as long as they use Python. This adds a third tool to AWS' toolbelt of “ways to run a task periodically without managing servers” including:
- Invoking a Lambda with a CloudWatch timer
- Executing an ECS task periodically on Fargate
Glue offers a new type of job, a pythonscript
job type; this means that the
job is implicitly a singleton, it runs on one machine and is not going to
execute inside of a Spark context. This limits the functionality largely to
pre-processing tasks that you might otherwise have done on a Spark head node but
also means that startup time and footprint are reduced. A pythonscript
job has
a few additional configuration options including a “dependencies file”, which is
just a Python egg that the Glue wrapper will download from S3 and then install
into the local runtime before executing your job. Any dependencies you have can
be packaged into this egg so that they are installed at run-time.
Building a Python egg
This task is quite straightforward; using Python’s exellent setuptools
we can
simply create a new setup.py
file similar to the following:
from setuptools import setup
setup(
name = "yakshaver_dependencies",
version = "1.0",
author = "John Ewart",
author_email = "john@johnewart.net",
description = ("A quick example Python egg with dependencies"),
install_requires=['docutils>=0.3', 'requests']
)
This package description says to build an egg called yakshaver_dependencies
that depends upon docutils
(>=0.3) and requests
. In order to build the egg
you will need to issue a the following:
python setup.py bdist_egg
You will see some output from setuptools
, the end of which should look a bit
like this:
copying yakshaver.egg-info/PKG-INFO -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/SOURCES.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/dependency_links.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/requires.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying yakshaver.egg-info/top_level.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating dist
creating 'dist/yakshaver-0.0.4-py3.6.egg' and adding 'build/bdist.macosx-10.11-x86_64/egg' to it
(As a side note, egg files are just ZIP files with a particular structure you can always examine its contents by unzipping the file.)
Now that I’ve laid an egg, where do I put it?
In order for Glue to find your egg, you will need to upload it to an S3 bucket that Glue has access to. You may need to create a new bucket and an accompanying IAM / bucket policy that permits Glue to fetch from there, or if your artifacts are free of any sensitive information you can always put them in a public bucket to get started. Simply use the AWS CLI to do this (or any other tool you have handy):
aws s3 cp ./dist/yakshaver-0.0.4-py3.6.egg s3://glue-artifact-bucket/eggs/yakshaver-0.0.4-py3.6.egg
aws s3 cp ./my_sample_script.py s3://glue-artifact-bucket/scripts/my_sample_script.py
From here you can create a new glue job with the pythonscript
type (again,
using the CLI but you can also do this on the AWS console):
... to do
Artifacts and complete example
You can find this complete example in my Git repository