hacker / welder / mechanic / carpenter / photographer / musician / writer / teacher / student

Musings of an Earth-bound carbon-based life form.

Background work is something most substantial web services have to wrestle with. It’s inherent in building any complex system with many moving parts. You simply can’t do everything during the lifecycle of an HTTP request – it’s either too costly or too time-consuming and users won’t wait around for things to finish. As a result, a lot of the work is going to be offloaded to some sort of asynchronous process and then attended to at a later date.

What makes background processes so hard? Primarily the tools – most tools have little to no reporting capability built in, so when something goes wrong it’s hard to see and/or track what’s happening from end to end. Once you disconnect a job from the request for work it becomes a “ghost in the machine” so to speak, has its own life-cycle that is not directly related to the completion of the request. As a result, tracing the steps that were taken can become difficult.

There have been many solutions to this problem. Among them are the likes of: BackgroundRB, beanstalkd, celery, Resque, and (my personal favorite) Gearman. Like any technology, each has their own pros and cons.

Experiences

BackgroundRB

This was our first attempt at asynchronous job processing. Every job is submitted to the BackgroundRB database for later execution, which requires BackgroundRB to operate as a database-backed monolithic process that forks and runs a handful of background workers, each of which handles some types of jobs. You can arguably scale horizontally with threaded workers and a handful of these BackgroundRB processes, but it makes it harder to scale out because your BackgroundRB process is closely tied to the database which means your requesters (clients) and workers (BdRB processes) must be adjacent to the database. On top of that, there are few tools to monitor and manage your queue.

Gearman

Gearman is an interesting one in that there are many implementations of the Gearman server (Ruby, C, Perl) and libraries for almost every popular programming language: C, C++, C#, Ruby, Perl, PHP, Python, Scala, even Erlang. Additionally because it is a network of clients, managers and workers, it provides horizontal scaling for free. You need more workers? Fire up some cloud servers. You need some more managers? Fire up some more cloud servers. You bring up more clients? Just point them at the managers and go. These alone provide huge benefits to developing a scalable system without having to be hindered by your background processes. Additionally, the Gearman protocol dictates that a job will be completed (it may fail, or error out) by a worker, so transactional problems (network connectivity, workers crashing, etc.) is handled for you, so you may fire and forget.

The downside to Gearman is transparency. In the C daemon, persistence is handled by one of many different database servers in the event that the daemon were to crash. But this persistence is provided as a failsafe only, not as a reporting engine. To do that you would need to set up some sort of database replication so that you could pull reports from the mirrored database because the daemon is quite fickle about read locks since it is an in-memory queue that just happens to be backed by a database for failure.

As we’ve been using Gearman internally, we typically find ourselves saying “wouldn’t it be neat if…?” so rather than complain we’ve just gone and built a system ourselves. That system is Gearman HQ, which handles load-balancing, failover, pure-persistence, reporting, alerts and system monitoring all in one.

Queue management problems

Time-dependent Scheduling

One thing Gearman (and other systems as well) lacks is the notion of time-dependent scheduling. In the Gearman protocol specification is support for future-scheduled jobs. This is presently only in my fork of the gearman daemon where I support future scheduling. I decided to do this because maintaining a separate system like cron just didn’t cut it, and also it provides flexibility in being able to decide when in the future a job will run.

Under some conditions, you may want to reschedule yourself in an hour when there’s a transient issue (network problem, external API returns an error), or maybe you want to reschedule yourself further and further in the future when an issue occurs (like SMTP exponential back-off) and then eventually fail, or perhaps you just need to reschedule for next month or next week based on a customer’s subscription level. Whatever the reason, it’s always better to be flexible with scheduling. To try to implement that using cron would have just been a nightmare.

Scaling

One of the beautiful things about Gearman is that it is designed as a client-server system for managing and processing discrete jobs. A job has a beginning and an end, as well as a named queue. This makes it dead simple to add more workers if you write them correctly, as well as having the added benefit of not tying you down to a specific language. As the broker speaks the common “gearman protocol”, any language having a library (of which there are many) can be used to write a worker or a client for getting things done.

This makes horizontal scaling a much easier thing to do, as adding workers is flexible – all they need is network access to the daemon to be able to do their job. However, the infrastructure management to create high-availability still requires some substantial technical know-how and then that provides another level of things that need to be monitored.

Transparency

Insight is the key to being able to manage any infrastructure, and background jobs are no different. Most of the currently available tools fall short in this area. Which is part of the reason we’re building Gearman HQ, to provide job queue management tools that make your life easier. Knowing things like when your queue is getting too large, the average runtime or time-to-completion is too high, the percentage of failing jobs, etc. are all vital to keeping track of your job queue. While there are solutions to some of these problems with the existing infrastructure components, it involves setting up multiple services such as Munin or similar tools to monitor your queues and your workers.

Another issue is finding a specific job in the queue and seeing where it is, if it’s been processed, what the result of the job was, and other bits of information. Currently that is nigh impossible short of setting up reporting systems with slave databases and a lot of other legwork.

Overly Complex Workflow

Any time you hand data off from one component to another is another chance for failure, one more link in the chain to follow and hunt down when something goes wrong. Some things I’ve learned in managing asynchronous processes are to keep things as simple as possible. Sure, it might be neat to process results with worker A and then hand them off to worker B to be put into the database. My suggestion is that unless you really need to do this, don’t. When something goes wrong, you have one more place you have to hunt down, one more component you need to test. Asynchronous work is already hard enough to manage, why make it any harder on yourself?

A solution I’ve been comfortable with on a particular project has been to use a RESTful API to handle results from workers. This way you have your worker and it makes a call to the API to handle updating or fetching of data. This way you can easily log if the API had an issue so that your trail is very sequential. The workflow looks like this:

  • Client requests work
  • Manager stores work
  • Worker picks up work
    • Fetches needed data from API
    • Processes data
    • Stores results via API

Whereas the original workflow looked like this:

  • Worker picks up work
    • Processes data
    • Submits job to the system to have some data updated
  • Another worker picks up the results
    • Stores the data in the database (maybe)

In scenario B, when storing the data in the database fails for some reason, worker A has no knowledge of the failure and no way to recover. Whereas in scenario A, the worker can detect that the API had an error (you are using meaningful HTTP error codes, aren’t you?) and gracefully handle it, either rescheduling itself in the future, or at least logging the failure right there in the stream of activity rather than having to hunt down another worker’s error message in a different log file with a potentially wildly different time stamp (there may have been 5K other database update requests before yours).

Overview

There are lots of solutions out there, which can make choosing the right tool a little daunting. I’d suggest looking for tools that:

  1. Provide insight into the system
  2. Scale well with your growth
  3. Fit your scheduling needs (i.e constant-schedule or dynamic schedule)
  4. Are easy to work with
  5. Support your programming language of choice

I know that some folks will say that “high performance” isn’t on there, and I’ll admit that’s true. I typically hold fast to the “premature optimization is the root of all evil” philosophy. Admittedly performance matters, but solving problems properly matters as well (and often-times is more important). There are challenges with every tool out there, but with patience and a well-thought out plan and a little bit of black magic, asynchronous jobs will make your life infinitely easier and sometimes even solve things that would otherwise be impossible.