In July of this year, our Python monolith's test suite took 45 minutes to complete. This slowed down developers across the company and resulted in larger backlogs of unmerged changes.
After a week of optimization work, the test suite runs in under 5 minutes.
Two years ago, Max Lahey posited that tests aid in uncovering implicit dependencies (especially important in growing organizations and codebases) and that slow tests hurt engineering productivity and observability.
The harm to developer efficiency is acute. A builder’s most valuable resource is their context and focus - slow tests eliminate the ability to use either efficiently. This problem compounds when more bugs are introduced to production as a result of skipped tests and then slow tests are the bottleneck to resolving issues and validating potential fixes. Finally, as the test suite slows down, the number of pending PRs increases.
As the merge queue depth increases so too does the amount of context a developer must keep loaded in their short term memory - as they must be ready for any number of change sets to behave strangely once they finally appear in production. Engineers would often block off half a day just to babysit the merge queue while it slowly drained.
“Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.” - Rob Pike
When we began, we knew the total execution time was 45 minutes and that about 38 of those minutes were spent on the
pytest invocation. At this point, we knew that slow tests were affecting engineering velocity company wide, that the vast majority of the slowness was attributable to a
pytest call, and that it had been two years since a deep-dive.
We needed to find the bottlenecks in our
pytest runs. We observed the resource usage of a machine running the test suite and discovered it was CPU bound. We increased the speed of the CPU and the tests ran ~50% faster. The single largest increase in the speed of tests took the least time to figure out and we did it first. Try the obvious improvements first!
After increasing CPU speed, we toyed with increasing thread count; however, increasing parallelization did not immediately result in a speedup. After continued profiling, we observed our test DB buckling under the increased load. After both increasing the number of DBs and tuning Postgres flags (e.g. connection settings), we saw more minutes of time saving .
Two days in, we had achieved a ~70% speedup of our test suite (a ~30 minute saving). Through profiling and optimizing, we had accumulated a list of potential future opportunities for more time savings. Next we needed to validate and measure each potential opportunity as quickly as possible. To do this we split up and each engineer focused on the area for which they had the most context.
Downloading the last-built docker image from the registry took between 1 and 2 minutes per build on the old AWS-hosted test runners. Because the new runners we experimented on were self-hosted, local docker caching should eliminate this overhead. Implementing this improvement shaved over a minute off the total time — a nontrivial gain given total duration was around fifteen minutes.
When running parallelized tests in
pytest-xdist, we noticed that different workers finished at different times. This meant the algorithm assigning tests to workers was inefficient and operating off of incomplete information. We were able to quickly write a more effective packing algorithm (courtesy of ChatGPT) on a fork of
pytest as a proof of concept. This ended up saving 2-3 minutes.
While profiling subsequent test runs, we noticed high disk usage - likely due to our recent introduction of multiple databases. From the initial decision to parallelize with multiple databases we were concerned I/O could be a bottleneck, so were eager to attempt optimizations on this front. We tried mounting everything in RAM, but did not see any performance gains. We moved on and later realized our intuitions were wrong (for reasons native to Postgres transaction handling); nonetheless implementing the shift to RAM was trivial and had significant potential upside, so we felt justified that trying this approach was the correct tradeoff.
These experiments all had the potential of asymmetric upside as they provided opportunities to test hypotheses quickly - allowing low lift iterations to potentially yield valuable results. The ideas originated from spending time deeply understanding the test dependencies (how Postgres handles database connections), infrastructure (hardware limitations), and tooling (the internals of
pytest and GitHub action level caching).
After a week of work, new improvements were yielding significantly smaller gains. Towards the end of our week focused on this, a few hours of tuning only shaved a few seconds (< 5%) off the new test suite run time, whereas at the start of the week we managed to save 20 minutes (~50% of the then current run time) with the same allocation of tuning time. This was our clue to stop: we were no longer yielding high returns on our time investment and further focus would run the risk of solely enjoying the comfort of a project we now had context on.
By graphing the time savings per change made the diminishing returns to invested effort are immediately evident:
Grouping by day, the curve is even clearer:
In our experience, all projects look like this. Notably, this does not devalue projects with high upfront cost (e.g. building a payments platform to move money). At Ramp the value of such a platform is so large once it’s examined on a longer time scale that some initial patience to set the right foundation is obviously worthwhile. However, even when building platforms, engineers should identify the potential for asymmetric outcomes at every stage.
Every project should end with a backlog of marginal, long-tail improvements. Solely executing on a single project effectively pales in comparison to the impact of regularly considering the primary optimization function (e.g. customer value over unit time) and then identifying the highest leverage next steps through that lens.
In partnership with Ramp’s world-class Applied AI team, accessing servers with significantly faster clock speed, more CPU cores, and more RAM to experiment with took us about ten minutes.
These machines were provisioned ad hoc and not production ready (e.g. no Terraform or GitHub runner). However, productionizing a machine for experimentation would have been a premature optimization and an incredibly low leverage way to spend time.
Engineers should be able to experiment without hindrance so long as they operate within minimal, pre-defined guardrails.
When finally productionizing the system, we worked closely with a number of engineers more familiar with our production CI / CD deployment who were able to productionize what we had built.
This applies to more than infrastructure: e.g. put up PRs to test what refactors might look like or try to
pip install a PDF library locally and profile PDF creation. Aim to build and validate new ideas very quickly and productionize them later.
Every engineer should seek to understand why the problem they’re solving is important and the highest leverage use of their time and they should be able to relate the problem to a top level optimization function. In our case, slow tests were inhibiting engineers’ ability to quickly and confidently deliver customer value. We believed that working on this was the highest leverage use of our time and context when we began.
First, we built context on the problem - learning about CPU usage on test machines, GitHub actions, docker caching, Postgres optimizations, and reading the
pytest source code. When we understood enough to identify potential asymmetric outcomes, we moved to action.
Second, we validated whether our initial hypotheses were correct by building the minimal viable example to measure the result. Many were validated and some were disproven.
Finally, we constantly re-evaluated our approach as we learned more and we weighed continuing to work on tests given potential upside, opportunity cost, and context cost.
Organizations should empower engineers with the tools they need to effectively impact business goals and trust them to optimize appropriately - as Ramp does.
There were always too many things to try and not too few - this is common, teams always have more ideas than time. While discipline and effort can increase throughput, effective prioritization increases bandwidth more than anything else.
The genesis of the project was leverage: we believed and could prove that longer test times harmed productivity of the engineering organization. A speedup of the test suite and its accompanying increase in engineer productivity is multiplied over the entire organization, meaning even marginal improvements yield outsized results.
Each improvement’s impact was impossible to predict before trying them, for example we had no idea higher parallelism would result in massive speedups. However, we did know the cost of attempting it was low, and the potential upside was high - the crux of asymmetric outcomes. While experience can help tune expectations, uncertainty remains inevitable. Test optimization serves as a good case study for this general principle as the impact of every change we made was both immediate and precisely quantifiable. While returns on investment are usually less immediate and quantifiable, they remain quite real.
Ramp has maintained a culture of velocity from the beginning, through the present (over 150 engineers, 700 employees, and rapidly growing revenue). We do not maintain this velocity through faster typing, talking, or walking, but we do through leverage. We build tools and systems across the organization to empower everyone to do the highest leverage work they can. When we commit to projects, we execute swiftly and with a consistent eye for the impact of actions on the goals of the project and the business.
We’re just getting started; the problems continue to grow, as do the opportunities. We’d love to have you join us on this journey.