hacker / welder / mechanic / carpenter / photographer / musician / writer / teacher / student

Musings of an Earth-bound carbon-based life form.

25 min read

Sorry your tests are so slow...

… but if they weren’t, you probably wouldn’t have time to read this fascinating story

For a time, I worked on a certain popular social reading site. It’s an amazing community of people (both that work there and use the site!) and a really great application as far as I’m concerned. I can’t speak for these days but at the time the majority of code was Ruby on Rails, and the code had been steadily growing over the years to become pretty enormous (think many hundreds of lines of code). I suspect many of you can guess where I’m going with this – it becoming so unwieldy that new features took forever. This wasn’t any fault of the developers, they were a bunch of awesome and talented folks, but rather the simple fact that it was so intertwined and the tests took forever to run. As I’ve mentioned elsewhere, this is a common side-effect of complex pieces of software; particularly in those with many developers. Tests are the only way that we can reasonably assure ourselves that the changes we’ve made don’t wreak havoc on our product as a whole.

Release the … killer snail?

This complexity translated into developers deliberately avoiding automatic releases as part of the build pipeline. It wasn’t really feasible to run two to three hours of tests on every commit when you have the potential for hundreds of them per day. Instead we would be queue up changes and then periodically a new release would happen. It felt a bit like “RELEASE THE KRAKEN!” (as we used to say in Slack) turned out to be more like “RELEASE THE KILLER SNAIL!” given how long it took to get a build out the door. Suffice to say, the pleasure of working on an aging Rails / Javascript monoliths are many — or at least I’m sure my therapist would agree.

In addition, working on the application locally was painful; getting a new developer set up took a long time because the wiki was often out of date and the dependencies were complicated. Because the long test-cycle was such an interruption, developers would be forced to choose between running a subset of tests (sometimes resulting in a set of size zero) or just pushing their changes after they manually spot checked the changes locally and let the CI system run all the tests. Compounding the problem was that, because the tests were so numerous (tens of thousands if memory serves correctly), any time a test would fail it resulted in wasted time. So much so that when tests were flaky and the reason was not obvious, tests would be commented out or removed entirely to get the build to pass.

Stuck between a rock and a hard place…

Obviously this was nobody’s first choice of how to build software but became the only way to manage the release process and remain reasonably sane as a team. During these times I often found myself wondering things like: “How long has it been since I have had coffee?” or “Hey, I wonder what’s on Hacker News?”. This was often followed by “Why on Earth do we have to run 50,000,000 tests for such a small change?”

Ultimately, developers want to build products and get them into the hands of users; we want this process to be as fast as possible. At the same time, we want to not be paged at 2AM so we also want this to be as safe as possible. These are in some ways conflicting concepts - speed and safety - we write tests for safety but the cost of running them decreases our ability to release things quickly.

A million monkeys running tests

I can’t speak for everyone but as a developer myself, I hate waiting - I am an incredibly impatient person when it comes to technology in particular. It’s very easy for me to fall out of the zone if I have to spend too much time running tests or fighting with tools. As such, I am always looking for ways to make my development life better and there are a lot of good tools out there — but the one thing I never seemed able to address was a way to deal with long-running tests.

There are a number of existing ways to solve the problem of long-running test suites but in the end they pretty much all boil down to the same solution: do the work concurrently, using parallelization where possible for example, in order to reduce the time required. This reduces the perceived wait time; assuming you have nearly infinite hardware available, running thousands of tests in parallel can enable a team to run large test suites. Unfortunately, I don’t know of many teams that have the budget to run nearly infinite compute capacity other than really large companies with incredibly large engineering budgets.

What if we could just do less work?

The second approach (which is more appealing to me) is to do less work! (Fans of the inimitable Larry Wall may remember that laziness is one of the defining characteristics he believes makes a great programmer; sounds like maybe he’s on to something). If this is possible, I could not only save perceived time but also cut down on the compute resources required, which is like a two-for-one sale!

This seems like an exceptionally obvious solution to saving time - just do fewer things and it will take less time - so why don’t we already do this? Tools like make have been able to accomplish this for decades, certainly modern tools should be able to do this as well.

Where is all the time spent?

As I mentioned earlier, a lot of time is spent waiting for tests to validate the correctness of our software. The compilation phase is often incredibly fast; computers are really good and dealing with structured, confined, behavior. The messy part is simulating human interaction with our software - exercising as much of our code as possible requires a lot of time and repetitions of slightly modified behavior. So, given that most of our time is spent running tests, how do we do less work? I did some thinking and came up with this list:

  1. Don’t write tests in the first place.
  2. Write the absolute bare minimum of tests.
  3. Delete tests that take too long / behave weirdly / don’t pass randomly.
  4. Guess and check during development / run a random sampling.
  5. Run only the tests that are affected by a given change to the source code

Solutions one through four will get the job done but they clearly sacrifice the benefits of writing those tests if we’re not going to run them. I will be honest and say that writing tests is not always my strong suit but the first four options don’t really solve the problem without making our lives worse. This left me with the burning question:

What if I could run only the tests that were directly affected by my change?

The critical requirement is to cut down the number of tests being run without reducing the safety they bring. If we reduced the overall safety of the build then we might as well just run a random sampling of tests. I thought about how I might, as a developer, manually identify the minimal set of tests required to tell me that my changes to the program will not break existing functionality.

The only thing I could think of is to be able to identify exactly which tests use the code I changed, but that would require identifying not only those tests but also tests that use any code in the system that uses the code I changed. It was obvious that this last part was the hardest part of correctly selecting tests manually; it would be nearly impossible for me to determine all of that by hand. I figured there had to be a way to take the information we know, or can observe, about the process and algorithmically determine which tests were relevant for a given set of changes.

How might we go about solving this? 🤔

In thinking about how to automate this process I came to the conclusion that there was only one way to solve this problem: let the computer do it for me. Computers are great at keeping track of data, much better than humans, but the problem still remained: how to go about teaching the computer to determine the minimum set of required tests for a given change?

Programs are just machines

At the end of the day, a program is just like a machine: you put something into it, it does some work, and something comes out the other side. For example, if you input a series of keystrokes into a typewriter, you get a novel. If you input a different series of keystrokes you will get a different novel! Similarly, our program, and its tests, behave in the same way – our tests are designed to provide our program with a variety of inputs and ensure that the output is correct. Thinking along these likes, can we consider our builds to simply be a machine with input and output? If so, what are those inputs and outputs?

Source code in, build out

If we treat our build process as a recipe, we see that our source code is one of the primary ingredients:

  1. Get a snapshot of the source code from the refrigerator
  2. Take our source code, run it through a compiler (if needed)
  3. With our compiled program, run a series of tests to ensure correct flavor
  4. Wrap up the program and serve it

Recipes are supposed to be reproducible; in theory if another person takes the exact same ingredients, and follows the recipe to the letter, they will have the same dish. This is hard in a kitchen because we can’t replicate ingredients (yet!) and so there will always be some deviation in the inputs to the recipe. Fortunately for us, computers are able to provide precise copies of our input (the source code) at any given time; computers are just giant calculators that use math to perform the various steps our recipe requires.

Computers do lots of math

In mathematics there is a property certain operations have called idempotence. Simply put, functions and operations that exhibit this property will generate the exact same output if applied multiple times without changing the original outcome. One well-known idempotent operation is the absolute value, or abs(x) in many programming languages. We know that, for example, \(abs(-4) = 4\), and that \(abs(4) = 4\) as well. Therefore we can say that \(abs(abs(-4)) == abs(-4)\) which makes the absolute value operation idempotent.

It turns out that we are (or at least should be) able to apply this to building software. If we take our build recipe and treat it as a function, and use the state of our source code at a point in time, the output should certainly be the same. If we use a commit in a source control system as a representative of our code at some given point in time, we should be able to say that: \(build(commit_{i}) = commit_{i}\) as our build process should not alter our source code. Given this, we should be able to apply the build process a second time to our source code and get the exact same output. This means that our build process is idempotent and that we can apply it to our source code as many times as we like and get the same result indefinitely.

\(build(build(commit_{i})) == build(commit_{i})\)

Certainly we should be able to treat our builds as idempotent, but what if we apply this line of thinking to our tests as well? In theory we should be able to take any test in our commit, run it multiple times, and get the same results. Assuming that’s the case, and we can say that \(test_{m}(commit_{i}) = test_{m}(test_{m}(commit_{i}))\), does this get us any closer to solving our problem?

The short answer is that it does! As long as our tests exhibit idempotence, we can expand our analysis further by decomposing our definition of our tests one step further:

\(test_{m}(commit_{i}) == test_{m}(line \in lines \; executed \; by \; test)\)

This holds true because our test executes some set of lines in our codebase every time it runs. It is true that the outcome could theoretically change between executions of our test but that would mean that either the test itself has a side-effect or is impacted by external factors and is a test we cannot control the behavior of. With this new definition we can say that if the contents of the lines executed by a test does not change, the outcome of the test will not change either.

Same input? Same output!

This is a very powerful observation; it allows us to say that, if we can compute the entire set of lines invoked for a given test, we can assert that if there are no changes in the lines a test executes between two runs of a test, the results will be identical. If the test passed at a point in time, \(T_{0}\), then it is guaranteed to pass at time in the future, \(T_{1}\), if this statement is true (that there has not been a change to the lines the test executes).

Fortunately for us computers are excellent record keepers, which means that we are able to store and recall all the data we need to exploit this property of our tests. If we observe the system in such a way that we can record all of the inputs to a particular test (i.e which lines of code were executed) and then at a later time determine whether or not those inputs changed then we can let the computer do all the hard work for us. (And that’s exactly what we did!)

Turning dreams into reality

What if we could make this a product? Wouldn’t that be awesome? I certainly thought so. Taking this approach of minimizing work through observability to deliver tools that others could use to save themselves time became the foundation of our company. There were many bumps in the road; the rest of this post focuses on the journey I’ve had since then.

This stuff is hard…

Computer science has two hard problems: naming things, cache invalidation and off-by-one errors.

I’ve built a lot of things over the years, but the confluence of all the things needed to solve these problems hasn’t been without its unique challenges. Funny enough, I haven’t been as keenly aware of this old adage as I have been while building our product.

Cache invalidation

It turns out a lot of builds have steps that can be avoided — do you really need to apt-get install on every build? To avoid this, many CI systems allow you to cache different parts of your build, but you have to do this manually; this means that you need to know ahead of time what portions of your process are safe to cache and which aren’t. The problem is that if you don’t know when to invalidate your cache, you end up skipping build and test steps that should have run.

Since humans are not always great at knowing when to invalidate caches, we thought: “hey, why not let the computer do the work for us? What could be so hard about this?” (spoiler alert — the answer is a lot). So we built tools that would use system-hooks to trace the behavior of programs transparently using eBPF (more on that later) and then built an entire CI system that does intelligent caching to speed up builds. This approach does a great job of cutting down time running unnecessary operations during a build without requiring that a user decide what they need to cache. Trading space for time is increasingly cost-effective these days; at two cents per gigabyte of storage on AWS S3, you can store hundreds of builds for almost nothing.

Off-by-one errors

Kernel programming results in death by a thousand paper cuts. It took us four different iterations of our eBPF tracing technology to get to a point where the system was reliable; the first in Python, followed by an attempt in Go, then a “mostly-functional-but-still-subject-to-race-conditions” C++ implementation until we finally landed on a stable rewrite using Rust! (Sometime in the near future I’ll write about all of the challenges encountered doing this)

The first attempt: process tracing

At its heart, our first pass at acceleration had three components:

  1. Observe the system calls made during build / test
  2. Build a map of the processes executed and which files were involved
  3. Use that map to understand which processes could be skipped based on changes to files between commits.

Quis custodiet ipsos custodes?

(Or, who will watch the watchers?) The Linux kernel has a really nice mechanism for spying on observing the execution of processes and the system calls that they make. This system is called eBPF, or the extended Berkeley Packet Filter. It is called extended because it doesn’t only filter networking calls but also system calls. For those who are unaware of this technology it is very useful — folks like Brendan Gregg (from Netflix) have used it to do some really impressive performance analysis of systems in production.

How hard could this be?

Those words will probably be inscribed on my headstone; perhaps unsurprisingly, it turned out that there are a lot of challenges to building such a system:

  1. Kernel programming is hard and has a slow feedback cycle.
  2. SAFE kernel programming brings some really tight restrictions.
  3. Using a higher-level language to transpile into kernel code that is guaranteed to not cause kernel panics has yet more restrictions - (shocking, I’m sure).
  4. Tracing processes inside of docker containers from outside the docker container is a little like trying to play a game of “guess who?”
  5. Getting the data out of the kernel fast enough to not lose some of it is hard.
  6. Hypervisors and CPU schedulers alternate a VM’s workload between different processors, which causes eBPF messages about system calls to behave in weird ways.
  7. Using this data to then determine when and what to skip relies on all of the above working perfectly - errors or missing data can lead to critical steps in the build and test process to be skipped.
  8. Processes often have inter-dependencies (i.e step 1 may write a file that step 3 reads) so all this data needs to be correlated.

In spite of these challenges, we were able to build a system that was able to collect and use this information to skip processes (aptly named skipper!). This system worked by taking the data collected by our observer and generating a graph that represented the process tree of the build and the files they depended on. Once we had this information we managed to use that to determine which processes could be skipped during the next build.

Automating this process

In order to make this work, we needed to build a CI system that was able to:

  1. Automatically observe the build process
  2. Perform the build in an isolated environment that we could snapshot (i.e containers)
  3. Save the state of the build as use that as input for the next build.
  4. Determine which cache was the correct one based on the next build (basically step 0 for the next build)

This technology worked really well for the projects our first customers had, which were primarily mono-repo style projects that were composed of many small steps and multiple components in different languages. For a mono-repo with a mix of Go and Javascript, when a change is made only to the Javascript part of the repository, why re-run all the Go tests (or vise-versa)? We hosted their entire CI, which allowed us to save even more time by automatically caching things for the customer between builds; rather than selectively cache parts that were thought to be safe, we cached everything and invalidated the parts we knew to have changed. Build time for customers went from 20-30 minutes down to as little as 5 minutes (even less for non-code changes), which vastly improved their velocity.

Skipping processes works well but it doesn’t work for every project

After applying process-based acceleration to a number of projects we noticed that, for certain classes of systems, process level acceleration could result in very meaningful time savings (some as much as 90%) but there was another class of systems where the optimizations were not as good. Unfortunately it turned out that in practice “not as good” is more like “almost worthless” — when a monolithic Python test process runs every single test, the chance that any individual change in the source code will cause all those tests to run again rapidly approaches 100%.

It also turns out that if you have a Rails project that has tens of thousands of tests, and they are run in N parallel groups, you now have N times the system-call data (most of which is worthless because they’re just Ruby opening a bunch of things over and over again) which adds up to gigabytes of system call data. This results in having the exact opposite of the desired effect because the time it takes to analyze that data far outweighs the time you saved by not running bundle install or apt-get install for the build. So for this new family of problems, our process-acceleration needed a sidekick to help out.

Round two: FIGHT! (or, language-level acceleration)

If we couldn’t skip the step because it nearly always has to run, we needed a way to make builds that execute inside one single process faster to compensate. I looked at ways we could apply our existing thinking to Ruby itself and thought “what if we were to trace all the function calls inside the RSpec tests and then use that data to skip tests?”

In a caffeine-fueled weekend of monkey-patching RSpec and hacking together some tracing code, I wrote the first version of our language acceleration technology (“intelligent test selection”) and deployed it for our customer. Lo and behold, their RSpec test time was cut down from over an hour to under 15 minutes most of the time! I was amazed — I thought that maybe there would be a 50% savings if I was lucky, but it turns out that my intuition was correct, many changes have zero effects on parts of the system.

The 80/80/80 rule

Unfortunately, a weekend project is not exactly a foolproof solution — the first 80% is getting the thing working, the next 80% is getting it to work on things, and the final 80% is making it usable and safe for other people to use.
As with most prototypes there were a lot of assumptions made that just didn’t work when we tried to integrate it into a wide range of Ruby and Rails projects. Fortunately I have an amazing team that has taken many of my mad science projects and turned them into awesome products!

That was cool, let’s do it again! (or, what about Python?)

This new approach worked pretty well for Ruby, which was awesome — but we had a customer who wanted their Python / Django application tests to run faster, as over an hour of waiting for tests to run was becoming very painful for their developers (and not just the time they took, but their laptops were getting very hot — ouch!). This will be easy-peasy I thought so I looked at ways it could be done; Ruby has a tracing function, I bet Python does too! (turns out the answer is a hard “not”) So I looked at the popular coverage library figuring that they would have a way to repeat what we had done for Ruby. They sure do — and it’s written in C because Python doesn’t have a higher level mechanism for implementation; rather than re-write it I decided to hook into code coverage to be able to get the data we needed. Sure enough, it worked — and by worked I mean that the initial version took two to three times as long to run the tests than before – the time savings on subsequent builds more than paid for the overhead. Thankfully that overhead is around 15-20% rather than 100-200% these days, which makes it that much better!

But, as it turns out, Python programming style is not always the same as Rails programming style, which meant there were dragons laying in wait when we tried this new technology out for a customer. Because Django often has all of the models in a single file (in this case, models.py), the probability that this file changed in any given commit was, in hind-sight, unsurprisingly very high. As a result, given that almost every single one of the 20,000 tests depended on a model, the time savings was somewhere around 10% for this particular customer. This was a huge disappointment for use because, while it saved some time, the overhead at this point in the evolution was still 50-60% and so there just wasn’t enough time saved to make it usable. The question now was — if an often-changed file has far-reaching effects, can we reduce those effects by looking at something smaller?

Building an AST for fun and profit!

I looked for ways we could decrease the number of whale nodes in the dependency graph (i.e components that have many things depending on them) to help address this new problem. After some conversations I realized that, if we could consider only the part of the file being changed, we could in fact cut down the number of things depending on it; if only one model in a large models.py file changes, not everything that depends on a model needs to be re-evaluated, only things that depend on the model being changed.

It turns out that there is a way to do this — by analyzing the diff and building an AST for each file changed we are able to solve our problem. Once we have built the AST we are able to determine exactly which components in the AST are affected by this change. With this we are able to figure out exactly which models (or functions) changed and then use that information instead of just the file-level data to determine which tests are dependent on those changes. This was a game-changer — it took the time savings from 10% to almost 80% in most of the builds. We did some in-depth analysis for this customer, using data from their existing CI and determined that across a hundred or so commits, these time savings were seen in roughly 75% of them! And, those which did not were still under 50% of the original testing time and happened when something very far-reaching was touched like a user model.

Trace all the things!

Observing test execution and following every function call being made provides data about what serves as an input to the test. By following each function (and, in turn, the functions it calls) we are able to build a model of the inputs to our tests functions both directly (i.e what data they take) and also indirectly. These indirect dependencies come in the form of the dependency chain of the function’s immediate dependencies (and subsequently their dependencies) and extend to other pieces of code (external libraries and internal code are of particular interest).

During the execution of the testing process, relevant information about which functions are called and what code is executed is recorded. This information is necessary to be able to understand exactly what serves as an input to any given test. For example, consider the following call graph for a hypothetical test, TestShippingCalculator:

Function map for a test

Because the dependency graph records not only the files involved during the execution of a test but also the functions themselves, we are able to provide a high-precision decision-making process. By knowing exactly which functions have been changed, we are able to minimize the number of tests that we need to execute - in the example below, we know that if any of the functions that TestShippingCalculator calls during execution, we need to re-run that test. If any function not this portion of the graph is changed, we can safely bypass this test.

We couple the set of source code changes with the known dependency graph and then use that data to select our tests; specifically we look at what has changed between the state of the code at the point in time the graph being referenced was generated and the state of the code being tested currently. This change information comes in the form of a diff generated between these two points in time. In the simplest form, a diff represents a set of functions that have changed between the code at a given time, T0, and another future time, T1 in the example below.

Observing state change and using it as input

And, what I think is even more awesome is that, due to the nature of source control systems involving branching and going back in history we are able to identify which graph is the optimal graph to use when applying our optimizations in order to run the smallest possible subset of tests while maintaining correctness.

Running head-first into lots of sharp corners

“The road to success is paved with lots of scraped knees.” - Me?

Building new technology is hard, it involves failing a lot and getting back up to try again. This definitely applied to us, there were lots of times when it seemed like things just weren’t going to work or that we wouldn’t get the performance we needed out of the system but we kept trying. Getting our acceleration to work for customers often involved writing patches to their stack and applying them at build, only to have them stop working when code was changed.

side note: this is a really great way to both make your customer happy and drive yourself insane at the same time — having to understand their stack well enough to add your technology and not break their code is kind of like shooting a moving target on horseback while blindfolded).

Most of the hard part is working with system level mechanisms (notably eBPF and language-specific internals) combined with the fact that programmers can generate systems that are effectively unbounded in complexity and there are many corner cases you don’t know about until you run into them (and some of them are really sharp!)

Nobody gets there alone

As excited as I am about our tech, I’m even more excited about having such an amazing team full of awesome people. There’s no way that this would have been possible without them, their amazing talent, and their desire to improve the lives of as many other developers as possible. when you’re done reading Hacker News, come take a look and see what we’re up to!. Thanks again to my co-founder, Yves, and my family for all their support!