At Ramp, we work hard to simplify finance and help thousands of businesses control spend, save time, and automate busy work. As one of our core competencies, machine learning is invaluable across many aspects of our value chain, and we apply ML across a number of different domains:
One perennial challenge with machine learning is the speed of moving models from prototype to production, and then iterating. Metaflow helped us to shorten this feedback cycle and increase our velocity.
One of our first machine learning models was a “riskiness” model. Our goal was simple: every day, predict the level of risk associated with tens of thousands of Ramp customers. This is the main credit model we use in our risk management process and is business-critical. It’s also a fairly simple model, built with scikit-learn and xgboost. It has about 20 features. We used an off-the-shelf vendor solution because it was available, and we immediately encountered issues that slowed us down:
Our setup also required a lot of platform involvement, including tuning resources, granting permissions, and reviewing PRs, which didn’t really allow for workflows and resulted in a sub-optimal developer experience. It also meant that there wasn’t great visibility into what was running, and development took a long time. The riskiness model took months to build!
All of these pain points slowed down our velocity, frustrating both data scientists and their stakeholders. Long feedback loops and excessive friction are particularly painful at early stages where iteration is key. Ramp is well-known for our product velocity, and we set an extremely high bar for developer experience and fast feedback loops. Data science and ML cannot be exceptions to this and, in 2022, we couldn’t get machine learning models into production as fast as we would have liked.
It was clear that we needed a different solution.
Metaflow, in conjunction with developer experience improvements, solved our pain points. After adopting Metaflow, we were able to ship eight additional models in just ten months, whereas before it took many more months to launch a single model. We’re still early in our journey, and these are just the exciting first steps. Within the next six months, we plan to train and launch five to ten additional models, supported by at least that many production flows.
There are lots of ML platform choices out there, and many of them are great! It’s also important to note that some are more popular than others as we wanted to choose than that we knew would be well-supported by a vibrant community and team. In making our choice, we first narrowed down our choices to popular tools because we wanted something well-supported by a vibrant community and team. Then, we optimized for simplicity and velocity:
For simplicity, we focused on both the infrastructure and end-user code. On the infrastructure side, we wanted to leverage AWS-managed services where possible. On the end-user side, we wanted our data scientists and machine learning engineers to be able to get up and running quickly.
Velocity naturally follows from simplicity. We wanted to be able to stand up infrastructure and basic models quickly.
Enter Metaflow. Metaflow is an open-source ML framework for training and managing ML models and building ML systems. It allows data scientists and MLEs to access all layers of the full stack of machine learning, from data and compute to versioning and deployment, while focusing on building models in Python. Metaflow also integrates with pre-existing workflow orchestrators, like Airflow, Argo Workflows, and AWS Step Functions, and compute infrastructure like Kubernetes or AWS Batch. With Metaflow, data scientists can:
python run_flow.py command.
--with batch flag to that same command.
This is all great! As long as the infrastructure is reliable, this significantly tightens feedback loops and removes undue friction on the path from prototype to production.
You may be wondering “Why not just Airflow for everything?” Indeed, we use Airflow extensively at Ramp. Airflow is a battle-tested, Python-based orchestrator and the de-factor tool in data engineering! While this is true, it isn’t always the best tool for machine learning engineering:
Metaflow still includes a handy UI where data scientists can examine task progress:
Overall, Metaflow offers a simple workflow for development models.
At Ramp, we use AWS extensively, and since Metaflow can be deployed on AWS-managed services, we decided to start there.
At its core, Metaflow can use AWS Batch to schedule jobs and run them on AWS-managed ECS clusters. Batch provides job queue functionality that can keep track of jobs. For each job queue, you can attach multiple compute environments. These compute environments can run on Fargate or EC2, and you can adjust the resources available and the scaling strategies.
Since we’re small, we created a single job queue to start with. At first, we attached a Fargate compute environment, since we’re very familiar with Fargate, and it’s fairly simple to get started with. However, we encountered several problems with this. First, Fargate startup times are quite long. Users who want to submit more interactive jobs regularly had to wait several minutes for jobs to start. Second, Fargate only supports certain combinations of CPU and memory. This is sometimes a headache for software engineers, and it’s definitely a headache for data scientists. Writing a flow might require referencing the AWS documentation, which isn’t ideal. Third, Fargate doesn’t support GPUs. This isn’t a problem for many models, but for more sophisticated models, this is an annoying limitation. So, we ended up quickly moving to EC2 compute environments.
As we grow, we anticipate having to tweak our AWS Batch setup. We may want to optimize the instance types we use, particularly for GPUs. We may also want to create multiple queues in order to separate production and ad hoc workflows, or to accommodate different priorities. We’re still early in our journey here.
We use Terraform to define our infrastructure, so we’re able to describe all of Metaflow’s infrastructure as code.
When running flows locally, Metaflow orchestrates steps from your laptop, even if they’re remote. This is fine for short-running, interactive jobs. For longer-running or production jobs, we chose Metaflow’s Step Functions integration. Step Function overlaps with Airflow in some ways (indeed, we very briefly used Step Functions for general orchestration). It handles transitioning between steps and retrying failures, and it can be triggered on a schedule. However, we only use it for managing flow execution. In order to handle scheduling or triggering, we trigger Step Function executions from Airflow.
At first, we used the Amazon provider to integrate between Step Functions and Airflow—namely, StepFunctionStartExecutionOperator and StepFunctionExecutionSensor. However, this wasn’t pleasant. Data scientists had to create two boilerplate tasks to trigger execution, using Step Function state machine ARNs that make little sense in the context of Metaflow and data science. Then, if something went wrong, data scientists had to track down logs either in the AWS Console, the Metaflow UI, or both. This wasn’t ideal.
To remedy this, we created a MetaflowOperator. The operator handles triggering a flow and waiting for success or failure. Importantly, it also creates links to relevant logs.
More recently, Metaflow includes the capability to generate Airflow DAGs from Metaflow Flows. We haven’t used this functionality yet but look forward to digging deeper.
All of our production flows are contained in a single repository, sharing the same set of dependencies. This can be limiting, but it allows us to optimize for ease of maintenance and support. Whenever a data scientist creates a new flow or updates an existing one, we automatically create the associated Step Function state machine, allowing data scientists to then trigger it at will, either manually or from Airflow.
Currently, we have more than 6000 Flow runs. These involve both scheduled tasks that happen every day and many on-the-spot tasks. Among these, a few data scientists are regularly running and adjusting processes, which results in multiple runs in a row. Out of these runs, a number are categorized as "production," meaning they run on a specific schedule triggered by factors like new data arriving in a database or a daily requirement. Apart from these, we have many more runs that are done for research or to solve ad hoc problems. For instance, I often run flows to benchmark models, and our data scientists are actively engaged in a wide range of projects.
Several of these models are essential for the business. For example, the riskiness model, along with a handful of other risk models, feeds into credit limits. If these models malfunction, fixing them is a very high priority. Other models are ongoing experiments designed to test specific inferences in a shadow mode. So, we cover a wide range of production scenarios. Our data scientists and MLEs can easily and swiftly compose flows. These are all housed in a single repository and are easily accessible.
Because of the simplicity of Metaflow, data scientists are able to largely self-service on Metaflow. While many data platform teams aim for self-service capabilities, it's rare to see stakeholders actually use and appreciate these features. New users are typically able to onboard themselves within their first few days at Ramp. To help with this, we created a brief walkthrough that new users can follow along with to create and deploy an example Flow. Our team hasn’t had to spend much time helping users with Metaflow, nor have we had to spend much time debugging infrastructure issues. Things mostly just work, allowing data platform engineers to tackle other problems.