Systems programming is hard - there are dragons everywhere

As part of the acceleration tech we are building at YourBase, I’ve done a lot of low-level programming again these last few years. I’ve been poking at things like language internals, eBPF, the Linux kernel and container internals.

Below are some of the more memorable bumps in the road that provided many hours of entertainment (I’m sure there were more but I may have blocked them out)…

eBPF programming

Kernel programming is challenging — you have limited resources and one wrong de-reference can cause the system to panic (i.e BSOD). I have written device drivers for network cards and serial devices in the past (I learned the “joys” of sk_buff and pointers to pointers to pointers while doing that) but that was, in my opinion, more straightforward than writing an eBPF program. There were a lot of sharp corners that I ran into when writing our system-level tracing — stack size limitations, collecting the correct system calls, lack of data about absolute paths, virtualization causing event out-of-order delivery issues, monitoring processes inside containers, and data loss issues; each of these required numerous attempts to get just right.

Are you sure all this stuff will fit in that box?

Because eBPF allows, effectively, for arbitrary programs to run inside the kernel it imposes a set of limitations. One of these limitations is a hard limit on stack size (to prevent un-bounded stack overflows to take down the kernel), making non-determinstic loops unusable (i.e no while) and requiring that all loops be unrolled. In addition to the limit on the stack size itself, each stack frame has a hard limit on the number of bytes, meaning that you have to store data outside the eBPF program if you want to do anything meaningful. This data gets written to a ring buffer that exists in user space, but requires you to keep track of that data in some shared structures available to you.

Go left at the fork()… or, was it right?

In order to trace the system calls of interest, the observer needs to know a few things: when to start (you can’t collect everything, that would be unmanagable), which system calls belong to which process (including its children), and when you have finished collecting the data. To do this we wrote a program that dynamically generates eBPF code and injects it into the kernel. This eBPF program, once injected, uses kernel probes to collect all the information about a given process that it is interested in. However, in addition to the initial process itself, this program also needs to follow any sub-processes that are created by catching any system call that could result in a new process ( fork(), clone(), exec*() and relatives) and begin to trace their behavior as well.

This program lives in userspace and receives a signal to start doing its work, at which point it uses a templated program written in a C-like language (thanks to the incredibly handy BCC toolkit), re-writes it with the required information, transpiles it to the underlying eBPF language and then injects that into the kernel. At the same time, this program is also responsible for extracting data that the eBPF program writes into its event buffer and flush it to disk fast enough to keep up with the stream of system events. Once the build is done, it needs to ensure it has all the data needed so that it can then cleanup by removing the eBPF program and associated probes from the kernel.

All paths do not lead to Rome

In the Linux kernel, processes know where they are at the time a system call is made as part of its process metadata. That means that system calls we care about (open, in particular) do not carry the full path to the file being opened. Since a process can be spawned anywhere in the file system, that makes it challenging to determine exactly which file is being considered, given that there are possibly hundreds of files called README.txt on the system. In order to overcome this, we needed to figure out that we could walk the filesystem pointer in the metadata - but constructing a full path would likely exceed our stack limitations as mentioned earlier. Through a lot of trial and error we realized we could emit those components as events and then re-assemble them in userspace; but did I mention that there’s no guarantee of ordering or proximity with the events in this buffer?

Spooky action at a distance (or, virtualization is weird)

One of the things that we didn’t identify, until it stopped working correctly in production, was that virtual machines have a subtly different behavior than bare metal (at least for our purposes). eBPF runs in the kernel, which on a bare metal system has direct access to the CPU and scheduling. However, because virtual machines are scheduled by the hypervisor, you are not always going to get the same behavior. In our case, it was entirely possible for some set of system calls to be handled by a virtual CPU that was being pre-empted by another VM temporarily. This had the interesting side-effect of delivering these system calls to our observer out of order or delayed enough to cause problems.

Containers are islands of their own

The way the system works is that the agent responsible for handling builds runs on the host, while builds run inside of containers. This presented a problem: the process inside the container (by design!) had no way of contacting the monitor on the outside to signal that it should trace the build (and where to start). We didn’t want to expose the external components to the builds for safety reasons so we had to slightly modify the build process to use a wrapper that was responsible for signaling to the monitor using esoteric system calls and waiting for the monitor to signal back that it was ready, and then wait at the end until the monitor was done recording all the data related to the build before completing.

Holy giant ring-buffer of doom!

It took rewriting the monitor in a few different times in order to get our memory consumption and CPU overhead down to a reasonable level. Our first pass in Python required that we allocate massive amounts of memory (sometimes up to 4GB) and involved lots of CPU time (an entire CPU in some cases) in order to avoid losing data during the build. This was because the overhead involved in shuffling data from the system buffer into the Python process took was significant enough that the eBPF program would periodically overwrite the head of the ring if we didn’t make it large enough. A second pass in Go required less CPU and memory but not enough to be meaningful — it consumed 1GB instead of 4GB, for some of the same reasons, and had lower CPU cost but wasn’t worth deploying. The third attempt, written in C++ during a long car ride to California, cut the memory overhead to under 100MB and the CPU utilization to less than 1% and ran in production for quite sometime. We were confident in the technology but it had a few memory safety issues and race conditions that we resolved by re-writing it (yet again — fourth time’s the charm?) in Rust, which has been running in production flawlessly for over a year now.