One of our keys to success at Ramp is empowering engineers to iterate and get changes into production quickly. We do this while ensuring that newly committed code is safe to deploy by maintaining a set of Python unit / integration tests.
Over time, this test suite has grown larger and larger, now reaching over 12 minutes per test run. Even worse, engineers would sometimes have to wait up to 1-2 hours between queueing their code for merge and actually seeing it in production.
How can we make this process faster for engineers, allowing them to deploy their code faster?
The diagram below shows the prior state of how we merged pull requests at Ramp.
Code had to be up-to-date with master
and pass status checks such as linting and tests before merging.
We used kodiak to help merge PRs, which would keep track of a given PR’s spot in the merge queue, auto-update it, and auto-merge once it reached the front of the queue.
The upshot of this was that every new PR merged would make all other PRs out of date with master
. This meant that every single PR required a full new run of our test suite from scratch.
Although this was a reasonable requirement when test suite times were low, our codebase had a growing pytest
suite that took 12 minutes to run. And as the size of the engineering team grew, the queue of pull requests waiting to get merged increased as well, with the average pull request waiting over an hour (12 minutes * 5 PRs in the queue) to be merged.
Our goal: eliminate this merge queue to unblock engineers and help them ship code faster.
Our first attempt is illustrated by the diagram above. We would run tests on PRs as we already were doing, and as long as they passed, engineers could merge into master
in parallel without having to rerun tests.
Our thought process was that most engineers’ code doesn’t overlap with other engineers’ changes. For example, someone working on Travel shouldn’t prevent someone else from pushing a change to Bill Pay. And engineers working on overlapping features would almost always get an explicit merge conflict.
However, we still wanted to consider the possibility that two branches that passed tests locally could fail tests on master
after their changesets were merged.
To prevent this we added the requirement that before any deployment, tests would need to pass on master
.
After making these changes to the deploy pipeline, we removed the requirement that “Pull requests are up-to-date with master.”
Unfortunately, the following happened:
Our python stack uses Flask
+ SQLAlchemy
, with the following dependencies:
alembic~=1.7.5
Flask~=1.1.1
Flask-Migrate~=2.6.0
Flask-SQLAlchemy~=2.4.1
sqlalchemy~=1.4.36
We use alembic to manage database migrations. Each database migration generates a file in migrations/versions
that looks like the following:
# revision identifiers, used by Alembic.
revision = "37596116ac17"
down_revision = "a6b00f131756"
def upgrade():
op.create_table(...)
def downgrade():
op.drop_table(...)
If two developers branched off the same revision and generated different migrations, each of their local changes would pass tests.
# Developer A
revision = "37596116ac17"
down_revision = "a6b00f131756"
def upgrade():
op.create_table("table_A", ...)
def downgrade():
op.drop_table("table_A", ...)
# Developer B
revision = "9d37dc60b812"
down_revision = "a6b00f131756"
def upgrade():
op.create_table("table_B", ...)
def downgrade():
op.drop_table("table_B", ...)
However, if both of these migrations were merged into master
, from revision a6b00f131756
, we would get the following error:
Only a single head is supported. The script directory has multiple heads (due to branching), which must be resolved by manually editing the revision files to form a linear sequence.
Run `alembic branches` to see the divergence(s).
This is because there would be “multiple heads”, or two DB migrations that would cause alembic to be unable to determine the current head.
We can illustrate this further with the diagram below:
# Pull Request A
┌────────────────────────┐ ┌───────────────────────┐
│ master │─────▶│ revision A │
│ a6b00f131756 │ │ 37596116ac17 │
└────────────────────────┘ └───────────────────────┘
# Pull Request B
┌────────────────────────┐ ┌───────────────────────┐
│ master │─────▶│ revision B │
│ a6b00f131756 │ │ 9d37dc60b812 │
└────────────────────────┘ └───────────────────────┘
# Merged (multiple heads!)
┌────────────────────────┐ ┌───────────────────────┐
│ master │─────▶│ revision A │
│ a6b00f131756 │ │ 37596116ac17 │
└────────────────────────┘ └───────────────────────┘
│ ┌───────────────────────┐
│ │ revision B │
└───────────────────▶│ 9d37dc60b812 │
└───────────────────────┘
We needed to prevent these cases from happening to allow out-of-date merges into master
.
We realized that pull requests without DB migrations should be able to merge freely, while migration pull requests should be required to be up-to-date with master
.
Our first idea was to set up a custom GitHub action to enforce this. However, we found that there was no straightforward way to re-run checks every time master
was updated.
What if we could take advantage of merge conflicts to prevent pull requests from merging?
We decided to add a new file in the repo, migrations/migration-hash.txt
, that contained a hash of all the filenames in migrations/versions
.
Each time an engineer added a new migration, the hash would change, causing GitHub to have a merge conflict on migration-hash.txt
.
Here's the commands we used to generate the hash, and the file, respectively:
generate-migration-hash:
# We're using python3 rather than a *nix command intentionally, since developers
# are on different operating systems.
@ls migrations/versions/*.py | sort \
| python3 -c 'import hashlib; import sys; print(hashlib.md5(sys.stdin.read().encode("utf-8")).hexdigest())'
generate-migration-hash-file:
@make generate-migration-hash > migrations/migration-hash.txt
The diagram below shows how migration-hash.txt
would change as pull requests were merged in:
master
/ \
/ \
/ \
/ \
migration A migration B
hash(master + A) hash(master + B)
\ /
\ /
\ /
\ /
master -> A -> B
hash(master + A + B)
Note that:
migration-hash.txt
.master
.With this change, we were able to modify our deploy pipeline to look like the desired state:
The average time to merge decreased from 1+ hour to 12 minutes (the time it takes for our test suite to run).