April 9, 2015

Thoughts on Dependencies

Dependency management in a project is a thing that, often-times, nobody thinks about until there is a problem. One of the great things about package managers like Maven, Bundler, PIP and their relatives is that adding a new dependency is a snap. It can take less than thirty seconds to add a new Ruby gem dependency to a Gemfile and install it via Bundler. This is a great convenience to developers; in the “old days” you would need to download and build every C / Perl / Ruby / Java library that you needed.

The value of libraries

Third-party libraries often provide large chunks of functionality that decrease your development time when building a product. For example, user-authentication mechanisms or web crawlers can be extremely time-consuming to implement. Even a simple implementation could take days or weeks to build an initial version of; even longer to implement these correctly as often the first 90% of development happens quickly followed by the second 90% that is required to make sure everything works correctly.

Why would anyone want to re-implement a user-authentication mechanism? Chances are that most projects would not want to. There are lots of edge-cases to consider and a community-managed project is quite likely to have addressed most, if not all, of those already through the course of its development. As a result, you can increase your development velocity by leveraging the combined development effort of the community at large. Bug fixes and improvements come in through the project’s contributors, you don’t need to do a lot of maintenance for that specific set of functionality, everybody wins.

The downside of dependencies

There is a flip-side to this: dependencies are like a relationship. You take from them, and maybe you give some back, but ultimately you are tying up some part of your future with this new dependency. If you depend on the Apache Commons IO package you are implicitly saying “I trust that the Commons IO package will bring me more value than hassle.” More often than not these are reasonable assumptions but it’s a good idea to weigh the pros and cons before consuming any dependency.

Some potential problems that might stem from upstream dependencies:

  • A widespread bug is caused when updating a particular library
  • Taking a new version of a library requires numerous updates other dependencies
  • System libraries need to be updated for a new version of a dependency
  • Taking (or avoiding) a new version of a library leads to a security vulnerability
  • Deployed artifacts now consume an astronomical amount of space for every deployment
  • Two, or more, dependencies cause irreconcilable conflicts at build- or (even worse) run-time

Dependency overload

Consider an imaginary Rails application that we are working on. We need an HTTP client (of course!) so we add httparty. Soon we find that we need a parallelizing HTTP client later on so we add typhoeus and faraday. Now we want to add a web-crawler to our app and we add magic-spider-foo (which just depends on curb). Now, at this point we have no fewer than three completely different HTTP clients in your project, along with all of their dependencies both pure-Ruby and native (typhoeus depends on libcurl, as does curb, each of which may have wildly different expectations of underlying versions). Additionally we now have various parts of the system using totally different HTTP clients making HTTP calls, so now we have completely different code and behavior in various portions of your code when making HTTP calls to upstream services. (Which means we need to test all of these configurations and can’t necessarily share test components between them).

Native dependencies in languages like Ruby, Java or Python can be challenging to manage because they often require very specific underlying library versions that do not always map well to pre-built packages for your platform. As a result they can be difficult to deploy and manage without the aid of an additional configuration management tool such as Puppet or Chef to ensure that the system libraries are available, or without managing packages for your particular deployment platforms.

Ruby is not unique in this, any sufficiently complex programming language (which is to say all of them) have the potential for a wild garden of dependencies. For example, Maven offers the <exclusions> tag that permits you to suppress downstream dependencies (effectively pruning your dependency graph by hand). This is used in order to prevent compile-time or runtime issues from occurring when you have conflicting APIs (i.e conflicting sub-versions of a project that are not compatible). This tells us that this is a hard problem™ and that sometimes a human being has to intervene on behalf of the dependency manager.

Not all libraries are created equal

The quality bar for all projects is not held equal. In particular, many open source projects see many contributors and sometimes even change ownership which leads to discontinuity in vision and oversight. This is not to say that open source projects are bad, in fact they are incredibly useful and provide a lot of value. It is important to evaluate the quality of a project when taking a dependency on it; effectively you are counting on that project’s functionality to provide you with enough value to justify not implementing it yourself.

Copy-pasting code

There are times when it makes more sense to copy and paste some code rather than consume an entire library. If you only need a handful of methods, as long as they are independent, it may make more sense to simply lift that portion of code (with appropriate credit and licensing compliance) into your application. One does not always need the entire something the size of the entire Apache commons-lang package to implement the isEmpty method on a string.

Some folks may cry that this is not “DRY” enough or that you “might miss out on bug fixes upstream”. Certainly this is true for the entirety of commons-lang – nobody should re-implement all of it in their application, these libraries exist to make your life easier. I have personally encountered a number of projects where an attempt to maximize the body of shared code caused more harm than it did good. In particular, systems that rely on a multitude of services suffer when you tie them together using too much common code.

Adopting a new pet

So you’ve decided that you need to take on a new dependency - great! There are a number of things to consider when taking on a third party library:

  • Licensing
  • Performance
  • Security
  • Functionality

License management

Software licensing is a jungle - there are so many licenses available and a number of them are in direct conflict with what most people would consider business requirements. A lot of businesses do not want to (or simply cannot) release their source code for the things they are working on. That doesn’t mean that these companies are bad, or evil, or that they don’t contribute in other ways or on other projects. What it does mean though, is that it can be very challenging to know exactly what licenses are in the third-party dependencies that you have taken.

Dealing with licenses

Leverage a tool, such as Pivotal’s LicenseFinder, which is capable of analyzing your dependency graph and generating a report of licenses that are being used. The great thing about LicenseFinder is that it works with a number of package-management systems across a variety of languages including Maven, Bundler, PIP and other popular tools. It can even scan licenses in polyglot projects where your Ruby application may have some Node.js dependencies, for example.

Performance problems

Lots of libraries experience performance regressions. Oftentimes they don’t even know it; they don’t use the library the way that you do and, even if they were aware of your usage, likely don’t have a way to test the things you do.

Catching performance regressions

When writing unit and integration tests (you are doing this, aren’t you?) make sure that you are factoring in performance tests that execute your code with timing data. Keep track of how that performance changes and consider setting a baseline for these tests. If the timing exceeds a certain threshold it’s worth investigating; this is good practice even if you don’t have any upstream dependencies.

Security issues

Security issues are particularly hairy; the larger your dependency graph, the larger the surface area for exposure is. Every dependency you take has the potential to expose your software to bugs and security flaws. If you don’t have a team dedicated to tracking vulnerabilities in the wild and notifying you when they happen and helping you patch them you will be on the hook for this.

Working to avoid security holes

Minimize your dependencies where it makes sense; know what sort of functionality each dependency has and where potential holes might affect you. Follow the mailing lists and keep up-to-date with the change logs of your dependencies. Security holes can appear in any library so the smaller your chain of dependencies the fewer things you have to keep track of. Admittedly this can be a lot of work, particularly for complex or rapidly changing dependencies; it helps to have a team that focuses on security for your organization but not everyone has access to such a resource.

Ensuring proper behavior

In a similar vein alongside performance issues are regressions in system behavior. Bugs (or “features” as they are sometimes called) can introduce side-effects in your stack. These new behaviors can introduce localized or systemic issues in your application. Consider the capability for any Ruby library to monkey-patch core parts of the Ruby standard library. Imagine if one of your dependencies had taken it upon themselves to alter something like Net::HTTP to their liking – what might happen to the rest of your application’s HTTP requests? (Hint: It’s not a pretty sight, I’ve been bitten by this once.)

Minimizing disruption due to poorly behaving libraries

Using continuous integration and writing comprehensive tests that not only perform localized unit tests but that exercise your dependencies at a variety of depths will help to catch these types of issues. In our case, updating a gem caused a cascading failure through our tests. This let us catch the issue before it made its way to production; things would have been less pleasant had we not been able to do so. Another helpful feature of most modern package managers is the ability to pin versions; in this way you can accept what should be minor changes (say, point releases) without much oversight but require manual changes to import new major versions of a dependency.

Weigh your options

It’s hard to get a good feeling for just how well-behaved or large a dependency is going to be; some things that seem rather innocuous can have far-reaching effects on your application’s stability, security and performance. Even mature, well established, projects that have plenty of experience releasing software for years will periodically introduce bugs that can affect your code. Through a combination of observation, testing and general awareness you can maximize the value that you get from third party packages while (hopefully) minimizing the blast radius when things do go wrong.