The week I joined Ramp, our python monolith CI test times were creeping up towards 20 minutes. For people like me, that's plenty of time to lose momentum and get pulled into something else.
We have about 5000 tests in this part of the codebase today, most of which are functional in nature. The ratio of functional or integration to unit tests is a hot topic (and one that I don't want to take up more space talking about). We took a careful look and decided that restructuring a ton of tests didn't offer us much business value. Our belief was the best path forward was:
Integration/Functional tests are often a source of discovery. The more jobs an application has, the more layered it typically becomes. With layers comes dependency, which means something you didn't know even existed may break when you make a change. Sometimes this indicates tight coupling, and certainly we've been guilty of that from time to time, but sometimes it's a natural product of a single domain being represented in different systems. Having a short feedback loop can save a meaningful amount of time in the discovery process. We encourage a lot of internal mobility for engineers — a robust test suite and fast discovery process are crucial for spinning up quickly on a different project. We're also hiring a lot, so our median-tenure engineer (like me) is still making discoveries regularly.
Long CI runs can really derail momentum. When making a change that affects both data and code, there's often a multi-step dance to reduce the risk of production issues. A simple example is changing a column name: we can't atomically update both the code with the storage layer, so we take multiple steps. For cases like this, CI runtime can take the vast majority of time to deliver the change into production.
Long test suite durations can also mean time lost waiting to observe riskier changes. While most of the deploy process is automated, I'm still on the hook for my change until it lands safely in production. When I know I've got to check dashboards in 20 minutes, I'm probably not going to start anything significant. Cutting that wait time gives me back quality focus time.
After weighing options/impact, we invested in three areas to improve test speeds.
Bumping concurrency is the usually the first lever that organizations pull when they're unhappy with test run durations. We had an upperbound of 4 parallel workers and had not root caused the limit yet.
We run our tests on AWS CodeBuild using Docker Compose to host the application and dependencies. In CI, our test setup code recreates our database from scratch, applying all database migrations from the beginning of time. We had been using the pytest-xdist plugin in multi-process mode for a while, which spins up a number of new test workers on the same machine. By default, each worker dutifully executes all test setup code in parallel, which for us meant multiple full db resets.
The number of concurrent schema changes meant that (1) postgres would run out of working memory at higher concurrencies, (2) we were performing a bunch of redundant work. By restructuring our database setup code to coordinate which worker gets to actually apply the schema, we were able to support higher parallelism.
This let us double the machine size, which halved our CI test times. We haven't taken this path further yet because there's a big jump in AWS instance sizes from general1.large (8 vCPU) to general1.2xlarge (72 vCPU).
When developing locally, you often know exactly what you want to test and you just want the answer quickly. When I started at Ramp, running a single test locally created a new database from scratch and applied all the migrations from the beginning of time.
We borrowed a page from some large web frameworks' books to cut time here: we stopped destroying the database after tests. When possible, the test harness applies any pending migrations on top. When it is unable to (E.g. the database:code versions cannot be reconciled), it just rebuilds from scratch. This simple change shaved off 10-30 seconds per test run, depending on who's laptop it ran on, whether they ran it in containers, etc.
"The fastest tests are those that are skipped" - someone out there, probably.
Having spent time at larger organizations with robust build systems and compiled/statically typed languages, I was spoiled by seeing CI systems only run tests for changed code/modules. In my spare time during my first month at Ramp, I wanted to see how far we could get without significantly overhauling our tools.
TL;DR: we made some progress towards an "incremental" pytest runner that only executes the tests that are relevant to a given code change. There are limitations with the approach we chose -- we'll touch on those because it is an interesting topic, not because we recommend this approach. Because the cost of accidentally skipping a test that might fail is really high, we execute our incremental runner in parallel to our regular suite. This executes 30-40% of the test suite on average and catches nearly all the same issues. This usually gives engineers actionable information in a few minutes rather than waiting the full 10.
Looking at the partial and full test runs for each commit shows that any given test failure has about a 90% chance of being executed up in the partial run. Closing that last 10% requires a lot more sophistication with this approach. Reorganizing our code and using something like Bazel starts to look like a smaller investment, but it's not the highest priority today.
Knowing what I do now, would I invest in this again? No. Was it a great learning experience? Absolutely.
There are a number of open source plugins that I looked at and tried with our codebase:
As I mentioned earlier, we use pytest-xdist to parallelize at the process level. When I evaluated these options, there was no support for parallelization, so we didn't have a plug and play solution for CI. Running 25% of the tests on 25% of the cores isn't a huge win for us.
The most common approach here was really compelling, though: measure the code coverage (lines executed) during each individual test run to build a map of tests to the code it depends on. At test execution time, we check what code has changed and map that to a list of tests to run. pytest-testmon has a great explanation of the thinking here and links to some limitations of using coverage.py.
While there is a lot to like and learn from in the open source plugins, we found some shortcomings with the coverage based approach and did things a bit differently. Note: we simplified the problem for ourselves by considering all code within a file as changed if any code within that file had been changed. This works well enough for our codebase, but it is not optimal and may not work well for everyone's.
read ┌──────────────┐ read
│ Git │
┌───────────────┐ │ └──────────────┘ │ ┌────────────────────┐
│ Test Runner │ │ │ │ Snapshot Builder │
│ (pre-merge) │─────┤ ├──────│ (post-merge) │
└───────────────┘ │ │ └────────────────────┘
│ ┌─────────────┐ │
│ │ S3 │ │
Our pytest plugin has two modes:
--run-incremental. The former will write the test/code dependency data to a "snapshot" file. The latter figures out which code files have changed since that snapshot and run only the affected tests.
We generate a new "snapshot" after each commit to our
main branch and store the data (~200kb compressed) in s3. When the test runner is invoked, it looks through its git history until it finds a match that exists in s3. Eventually it gives up and runs the whole test suite if it can't find a match, but we haven't seen that in practice.
We considered looking at file system and git data to determine whether a given code file was at all different from a previous run. This definition ignores code equality if we added/removed a trailing new line, but is very simple to work with.
When using the file system, there are a few obvious data points: modified times, file size, and contents. These have different accuracy and cost profiles for this use case. Modified time is cheap, but yields false positives. File size is cheap, but yields false negatives. Hashing the contents is expensive but accurate (modulo formatting changes).
We landed on using git commits as the basis for comparison, as it turns out to be less expensive to calculate and require less code/storage. Executing
git diff --name-only origin/main basically gets the job done. Arguably the downside is a coupling to our version control system, but we'll outgrow this pytest plugin long before we outgrow git.
Though JSON is one of the more expensive choices we could make here, it works well enough for a codebase of our size. Standard compression provides a lot of value with a data set like this.
One pretty basic issue with the coverage-only approach can be found in this simple example:
Adding a say_hi method to the B class would result in changed behavior that isn't picked up by code coverage.
While we don't have a ton of inheritance in our application — we use service functions, mostly — there are still quite a few places where we leverage it. Notably, we use Flask SQLAlchemy models, which supplies a DSL to map SQLAlchemy columns to a Python class. When interacting with one of these model classes, coverage measures a lot of executions in SQLAlchemy internals and often none on the model class itself. Something as routine as adding a column to our SQLAlchemy model declaration didn't trigger any tests using a coverage-only approach.
To address this case, we needed to understand the inheritance chain for any method call. There might be a way to do this with coverage.py, but I didn't find it. I was aware of Dropbox's pyannotate and thought that their runtime based type evaluation sounded pretty similar to this. After poking around the code for a bit, I stumbled across this method. There's a lot to it that we won't unpack here, but it's a really interesting idea and may be worth a read if you've got some time. We took inspiration from this and bypassed coverage.py in favor of using sys.settrace directly.
The sys.settrace documentation shows a handful of event types, and I found it helpful to trace some of our existing code to get a sense for what it all meant. From testing, observing a mix of
exception events produced results reliably similar to the coverage.py library.
To solve the inheritance problem, we inspect
call events a bit more closely. We infer that this is a class by the presence of
cls in the function arguents. This is not universally a reasonable assumption, but we adhere to that naming convention at Ramp. The arguments can be resolved to their type pretty easily. This block could be safer and probably faster, but got us as far as we wanted:
def trace(frame, event, arg):
# for `call` events...
argvalues = inspect.getargvalues(frame)
for object_type in ("self", "cls"):
if object_type in argvalues.locals:
_class = type(argvalues.locals[object_type])
# Hang onto _class for later
_class value can be mapped at runtime to its inheritance chain. Conveniently, each member of that inheritance chain can be mapped to the file it was defined in:
from typing import Optional, Type
def _resolve_class_to_file(_class: Type) -> Optional[str]:
Look up the file path where this class was defined.
module = inspect.getmodule(_class)
if hasattr(module, "__file__"):
def _resolve_inheritance_chain_to_file(_class: Type) -> None:
# BFS back the inheritance chain. No cycles to worry about.
classes_to_check = [
class_to_check = classes_to_check.pop()
resolved_file_name = _resolve_class_to_file(class_to_check)
We can now construct a more complete list of all the code files that than just by using coverage.py. This comes at a pretty appreciable cost: it doubled our "snapshot" generation time. Given we run the "snapshot" generation out of band, the cost is USD cents rather than time.
We bumped into other issues as well. For one, code that is executed before tests run (like defining a constant dictionary literal) is not properly blamed. There is a long tail of gaps with the coverage based approach, but inheritance traversal seemed the most interesting one to share.
There are plenty of ways we could improve this, but I don't feel it is worth further investment for where we are today.
Our engineering team is still pretty small. We all contribute to making our tools and processes work better for us. If you're considering new roles and that kind of ownership or environment sounds appealing, we're hiring software engineers at all parts of the stack.