When projects are new and developers are focused on shipping something quickly to prove out a concept or fill an immediate need, quality and best practices are sometimes left by the wayside. This is a tradeoff that can be worth making. But too often, temporary solutions and shortcuts become permanent, and as functionality or the number of projects grows, real issues start to surface. These issues must be solved, taking away development time. Perhaps worse, developers may begin to be wary of making changes, for fear of breaking something. The original goal may have been velocity, but at some point, velocity slows.
At Ramp, we encountered this in a repository of machine learning workflows. What started as a repository for centralizing these workflows eventually became a bit of a swamp, preventing us from moving as quickly as we wanted. To reverse this, we decided to take some time to modernize the codebase and adopt industry best practices around dependency management, formatting, linting, testing, and type checking:
All of this allowed us to prevent bugs and ultimately move faster and with more confidence.
Data science and machine learning happens across Ramp. However, at some point, data scientists and engineers started to develop workflows in a centralized repository. The repository was a monorepo of various unrelated projects, but sharing a single set of dependencies. This was great! Data scientists and engineers were able to collaborate in a single place and could learn from each other. Importantly, they could develop opinionated ways of doing things, and this standardization could further increase development velocity, as well as ease onboarding.
However, this path isn't always easy! As the size of the codebase and the number of contributors grew, we started to encounter pain points that hindered velocity and made changes risky. We weren't able to move as quickly as we wanted to.
Let's take a look at some of the problems we faced and how we solved them.
Previously, we were using a requirements.txt
file to store project dependencies. For example:
pandas==2.1.4
scikit-learn==1.4.0
This was a reasonable approach in the Python ecosystem some time ago, and it sufficied initially. However, problems stared to arise. The biggest problem is that it burdens developers with managing a complex web of dependencies.
If you pip install
the dependencies above, you'll actually end up with far more dependencies than the ones listed, since those immediate, direct dependencies have their own dependencies, known as transitive dependencies. For example, pandas
depends on python-dateutil
, and python-dateutil
depends on six
. Of course, this is just one path, but with dozens of hard-coded package versions, an unruly web of transitive dependencies emerges.
These transitive dependencies can lead to inconsistencies between environments. For instance, if two developers install the listed dependencies on their workstations, they may get different versions of these transitive dependencies! Worse, whenever we actually build and deploy applications, we'll get different versions there too. These inconsistencies make it difficult to debug problems, causing a ton of wasted effort by developers.
We started to use poetry to improve dependency management. Poetry is widely adopted across the community, allowing us to draw upon a wealth of external resources if we need help. Rather than using requirements.txt
, poetry uses the newer pyproject.toml
standard. This file contains direct dependencies:
[tool.poetry.dependencies]
python = "^3.11"
pandas = "^2.1.4"
scikit-learn = "^1.4.0"
These are the things we directly care about. But what about transitive dependencies? Poetry maintains a poetry.lock
file, which contains all dependencies. It also contains hashes, so that you can be sure you're pulling the same artifacts each time.
Adding or upgrade a package can be done via a simple command:
poetry add mlflow
Behind the scenes, poetry will update the lock file.
Poetry also manages virtual environments. To get started, a developer can simply clone the repository and run poetry install
. Poetry will create a virtual environment and install everything into it. This is helpful, especially for new developers or developers who only want to make an occasional contribution.
Formatting doesn't seem important, and, in a sense, it isn't because the computer doesn't generally care how code is formatted. However, code is written for humans as much as machines, and people do have opinions over how code should be formatted. Sometimes this can result in unhelpful nitpicks on PRs. Other times this can result in engineers agonizing over how best to format some multi-line list comprehension. Neither outcome is great.
Fortunately, formatting fights are generally easy to solve. Code formatters like black
or ruff
make all of the decisions for you. Even if engineers don't like the resulting formatting, it's easy to stop caring so much.
We adopted ruff
as our formatter of choice. Users can configure their IDEs to use it, and we can enforce formatting through CI checks. When developers create a pull request, we run a check to make sure everything is formatted appropriately. If the check fails, the pull request can't be merged.
Adopting an auto-formatter involves running the codebase through the formatter, resulting in many, many changes. After this is done, you can start enforcing checks for every PR.
One pitfall is the effect on git blame
. If you format an entire codebase, you'll show up on most lines in git blame
. You might even show up for years if particular sections of the code don't see much activity! This isn't helpful for others, and it isn't good for you. To resolve this, we auto-formatted the codebase in a single commit, then added the commit to .git-blame-ignore-revs
. The file looks something like this:
# Auto-format code with ruff
b1460f7fbddf7fde9d7d50114d4e8d46baf30594
Developers can configure their IDEs to automatically format using ruff. We also provide a precommit hook configuration that fixes any violations.
Ruff is also adept at finding and fixing small errors. Consider some code that sets model hyperparameters:
model_hyperparameters = {
"learning_rate": 0.05,
"n_estimators": 10,
"subsample": 1.0,
"min_samples_split": 2,
"min_samples_leaf": 1,
"min_weight_fraction_leaf": 0.0,
"max_depth": 3,
"min_impurity_decrease": 0.0,
"min_samples_leaf": 2,
"max_features": None,
}
Do you see the issue? We set the min_samples_leaf
key twice. If somebody wants to change it, they might miss this, then either wonder what's wrong or, worse, totally miss the problem.
We can identify and flag these issues with a linter. A linter examines code to find common errors. A number of linters exist for Python, but we went with ruff
(yes, it lints as well as formats!). If we run ruff check
, we can see the error:
F601 Dictionary key literal `"min_samples_leaf"` repeated
When you first run a linter, it may flag a number of violations. You have several options for dealing with these.
The first option is to fix all of the issues. If you have a small codebase, or if the violations are relatively simple, this might be the way to go. If you're using ruff, you can run ruff check --fix
to fix a number of errors.
However, not all errors may be easily fixable, and you might not have the proper context. In the above example, should min_samples_leaf
be 1 or 2? One strategy might be to add files with violations to a list of extensions, then tackle these one at a time. This might work, but the approach has some serious shortcomings. First, the offending files may change and get worse over time as they're worked upon. Second, it can be hard to carve out time to tackle each file.
A final approach is to ignore individual violations, rather than entire files. Linters let you ignore specific violations. For example, below, we ignore the duplicate keys in our hyperparameter dictionary:
model_hyperparameters = {
"learning_rate": 0.05,
"n_estimators": 10,
"subsample": 1.0,
"min_samples_split": 2,
"min_samples_leaf": 1,
"min_weight_fraction_leaf": 0.0,
"max_depth": 3,
"min_impurity_decrease": 0.0,
"min_samples_leaf": 2, # noqa: F601
"max_features": None,
}
This provides a way to acknowledge all existing errors, while enforcing linting on all new code. Ruff actually makes this option very easy. You just have to run ruff check --add-noqa
.
Ideally, developers who touch code will remove existing violations. This might be a long process, but at least new code will comply with linting rules. Of course, you can still resolve all violations in one large PR, or go file by file.
The downside here is that your code may now have # noqa
comments littered throughout. These comments are ugly. However, if we take the stance that code must lint, then code that doesn't comply is functionally ugly--so the comments just bring that ugliness to the surface!
To ensure comments get cleaned up as code is improved, we enforced ruff's RUF100
rule, which flags unused # noqa
directives.
We run the linter on every pull request, checking for any violations and blocking the pull request if we find any. We also provide a precommit hook configuration.
Automated testing is another line of defense beyond linting. While linting can catch subtle bugs or issues by examining code, testing actually runs code and checks the results. It allows developers to catch corner cases and bad assumptions, and it gives developers more confidence when changing existing code.
Not long ago, we had few, if any, automated tests in our ML workflow repository. Automatically testing some machine learning code can be difficult, and, besides, at a small scale, manual testing is often good enough.
However, certain code might be highly leveraged in the codebase. For example, often we'll want to run an ML workload for each business on Ramp. This is a problem that lends itself well to parallelization across multiple processes, but provisioning hardware for and starting up each process takes time. So, we sometimes divide a large set of businesses into chunks of a given size. This reduces overhead and still allows us to parallelize work. We have a function called chunk
that can be used for this. It takes a list of IDs (or anything else), and a maximum chunk size, then return an Iterator
:
def chunk(xs: list[T], size: int) -> Iterator[T]:
...
It's important that this works correctly. It needs to produce chunks consistent with size
, and it needs to not drop any IDs. If there are issues, downstream use cases will be negatively affected. Not only does it need to work correctly when originally written, but it also needs to work correctly in the face of any changes. For example, if an engineer notices that the chunking can be optimized, then makes a change, we need to make sure that the results are still correct. Testing can help us out here.
We use pytest
for testing, since it's widely used and has a rich ecosystem of supporting tools. It's also compatible with Python's built-in unittest
framework.
Using pytest, we might have a test to make sure that our function preserves all of the input IDs:
def test_chunk_preserves_data():
chunks = chunk([1, 2, 3, 4, 5], 2)
assert list(itertools.chain(*chunks)) == [1, 2, 3, 4, 5]
Of course, tests aren't free to write. They take time and effort which might otherwise be spent developing new things. Some organizations measure code coverage, or the proportion of lines in the codebase that are covered by tests, and they might insist on a particular level of coverage. This isn't bad, but it's not quite appropriate for our ML workflow use case. Most code exists in individual workflows and models, and if there's a bug, it will probably only affect one particular application. Instead of focusing on broad coverage throughout the codebase, we pay attention to highly leveraged code, like shared utility functions.
Python has had type annotations for quite a while. Using annotations, you can add hints about what type a particular variable should hold, or what a function should return. For example:
def f(a: str, b: int) -> list[str]:
...
While type annotations can serve as documentation, they aren't enforced by Python itself. For that, you need a type checker. A few exist. The most well-known is probably mypy
, but others like pyre
and pyright
also exist. There are pros and cons to each, but we adopted mypy
simply because it's widely used, both inside of Ramp and in the broader Python ecosystem.
In our case, contributors to our machine learning repository often added types to code. However, sometimes these types were subtly wrong, and type annotation adoption was inconsistent, even within the same section of code.
You might wonder whether type checking is beneficial for a language like Python. After all, to some people, Python's dynamic nature is part of the charm, and, besides, type annotations can add a lot of clutter. I'm certainly not an extremist here! I don't want to burden developers with extra steps, but type checking is a tool, and long as we're using it as a means to some end, it can be quite useful.
Consider a function that queries data from a data warehouse:
def query_warehouse(query, is_qa):
...
The is_qa
parameter indicates whether the query should be run in a QA database or a production database. From its name, it's probably a boolean.
Let's say we want to change it to env
, for environment:
def query_database(query, env):
...
One problem is that we don't know what env
should be. We can make that clearer with a type annotation:
def query_database(query: str, env: Literal["qa", "prod"]) -> None:
...
Above, we say that env
should be a string that's either "qa" or "prod". We also document that query
should be a string and that the function doesn't return anything.
However, this is only part of the appeal! A function to query the data warehouse is probably going to be widely used throughout the codebase, with many call sites. When we change the signature, we may break code. In a perfect world, we might have unit tests to catch this. However, remember that unit tests aren't free, and realistically, we're not going to cover all of the call sites with tests.
Type checking will actually catch this issue though! This is huge. Developers can confidently make changes—even big, sweeping changes that touch many parts of the codebase. Whole classes of bugs become impossible!
Unfortunately, adopting a type checker on an existing code base can be difficult. When we first ran mypy, we found hundreds of errors! Fortunately, we can use the same strategy that we used for enforcing linting: we can add # type: ignore
to errors, turn on enforcement, then progressively fix old code.
Unlike ruff, mypy doesn't have an option to automatically add # type: ignore
to violations, so we wrote a script to automate this. Like # noqa
, # type: ignore
can also be qualified with specific error codes. We ended up with lots of code like this:
def f(a, b, c): # type: ignore[no-untyped-def]
...
Importantly, we also made mypy complain about unused # type: ignore
comments. Sometimes developers will actually fix a problem and not notice. For example, mypy complains if you call an untyped function. We ignored these cases in the initial ignore commit:
some_function(1, 2, 3) # type: ignore[no-untyped-call]
However, if some_function
suddenly receives type annotations, the error no longer applies, and the call is fine. The comment should reflect this. Enforcing this is just a matter of adding a line to the configuration:
warn_unused_ignores = true
By default, mypy is rather permissive. We added extra configuration to disallow defining and calling untyped functions, and to have mypy check untyped functions, rather than ignoring them:
disallow_untyped_calls = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
disallow_untyped_decorators = true
Unfortunately, not all libraries provide type annotations. We instructed mypy to ignore most of these with a configuration section:
[[tool.mypy.overrides]]
module = [
"boto3",
"catboost",
]
ignore_missing_imports = true
For some critical libraries though, we generated or wrote type stubs. We didn't aim for completeness here but instead quickly wrote some partial stubs to cover most of our use.
With the above improvements, we were able to get our codebase in good shape—and keep it there. However, quality itself isn't the end goal. Code is a means to an end. More importantly, we were able to encourage faster iteration with greater confidence. Engineers and data scientists can make changes and ship new functionality, no matter their experience.
Of course, the journey isn't over. As Ramp continues to grow, we'll hit new challenges, and maintaining and even improving our velocity will require new solutions. For this particular ML codebase, we're looking at ways to further shift burdens from contributors to tooling.