How Ramp Accelerated Machine Learning Development to Simplify Finance

September 15, 2023

Simplifying Finance with Machine Learning at Ramp

At Ramp, we work hard to simplify finance and help thousands of businesses control spend, save time, and automate busy work. As one of our core competencies, machine learning is invaluable across many aspects of our value chain, and we apply ML across a number of different domains:

  • Credit risk, such as predicting the probability that a Ramp customer will become delinquent.
  • Fraud, such as determining whether a card transaction is fraudulent.
  • Growth, such as predicting the probability that a potential lead will convert into a customer.
  • Product, such as suggesting an accounting code for a particular transaction and Ramp Intelligence, Ramp’s new suite of AI products.

One perennial challenge with machine learning is the speed of moving models from prototype to production, and then iterating. Metaflow helped us to shorten this feedback cycle and increase our velocity.

Long Feedback Loops, and Excessive Friction

One of our first machine learning models was a “riskiness” model. Our goal was simple: every day, predict the level of risk associated with tens of thousands of Ramp customers. This ​​is the main credit model we use in our risk management process and is business-critical. It’s also a fairly simple model, built with scikit-learn and xgboost. It has about 20 features. We used an off-the-shelf vendor solution because it was available, and we immediately encountered issues that slowed us down:

  • We couldn’t define and execute ML pipelines locally. We had to manually push our pipeline code to the vendor before we could run it.
  • Jobs took upwards of an hour to run, even for very small datasets.
  • Jobs were flaky, and when things went wrong, we had limited visibility into the causes. Sometimes jobs would fail seemingly for no reason, and logging was poor.
  • The platform didn’t work well with Docker containers. We’d get unintelligible errors when using Docker and there were features in the platform that weren’t available for containerized workloads.

Our setup also required a lot of platform involvement, including tuning resources, granting permissions, and reviewing PRs, which didn’t really allow for workflows and resulted in a sub-optimal developer experience. It also meant that there wasn’t great visibility into what was running, and development took a long time. The riskiness model took months to build!

All of these pain points slowed down our velocity, frustrating both data scientists and their stakeholders. Long feedback loops and excessive friction are particularly painful at early stages where iteration is key. Ramp is well-known for our product velocity, and we set an extremely high bar for developer experience and fast feedback loops. Data science and ML cannot be exceptions to this and, in 2022, we couldn’t get machine learning models into production as fast as we would have liked.

It was clear that we needed a different solution.

Choosing Metaflow

Metaflow, in conjunction with developer experience improvements, solved our pain points. After adopting Metaflow, we were able to ship eight additional models in just ten months, whereas before it took many more months to launch a single model. We’re still early in our journey, and these are just the exciting first steps. Within the next six months, we plan to train and launch five to ten additional models, supported by at least that many production flows.

Problem Area
Before
After
Deployment
  • Data scientists manually deployed code on the command line
  • Flows are automatically deployed
Dependency Management
  • Container support was limited, so data scientists often ran into problems
  • Dependencies are Dockerized and standardized
Resource Requirements
  • All of a job’s steps typically ran on the same resources
  • Tuning resources often required Data Platform involvement
  • Resources can be configured per step
  • Data scientists can use whatever resources they need, up to a large limit
Debugging
  • Individual steps were difficult to retry
  • Logs were difficult to find
  • Individual steps can be retried
  • Logs are surfaced in UI
Sharing Results
  • Data scientists could share notebooks, but pointing to one specific run was challenging
  • Data scientists can link to individual runs
  • Results can be shared in Metaflow cards
0 selected

There are lots of ML platform choices out there, and many of them are great! It’s also important to note that some are more popular than others as we wanted to choose than that we knew would be well-supported by a vibrant community and team. In making our choice, we first narrowed down our choices to popular tools because we wanted something well-supported by a vibrant community and team. Then, we optimized for simplicity and velocity:

For simplicity, we focused on both the infrastructure and end-user code. On the infrastructure side, we wanted to leverage AWS-managed services where possible. On the end-user side, we wanted our data scientists and machine learning engineers to be able to get up and running quickly.

Velocity naturally follows from simplicity. We wanted to be able to stand up infrastructure and basic models quickly.

Enter Metaflow. Metaflow is an open-source ML framework for training and managing ML models and building ML systems. It allows data scientists and MLEs to access all layers of the full stack of machine learning, from data and compute to versioning and deployment, while focusing on building models in Python. Metaflow also integrates with pre-existing workflow orchestrators, like Airflow, Argo Workflows, and AWS Step Functions, and compute infrastructure like Kubernetes or AWS Batch. With Metaflow, data scientists can:

  • Define their ML pipelines purely in Python as a Metaflow Flow.
  • Run those pipelines locally with a simple python run_flow.py command.
  • Run those same pipelines in the cloud by adding a --with batch flag to that same command.
  • Push the Metaflow code to production, and trigger that same Flow from an Airflow DAG.
  • Visualize results using Metaflow “cards” visible from the UI.
  • Easily share results and cards via links.

This is all great! As long as the infrastructure is reliable, this significantly tightens feedback loops and removes undue friction on the path from prototype to production.

You may be wondering “Why not just Airflow for everything?” Indeed, we use Airflow extensively at Ramp. Airflow is a battle-tested, Python-based orchestrator and the de-factor tool in data engineering! While this is true, it isn’t always the best tool for machine learning engineering:

  • Airflow is meant to be used as an orchestrator for compute workloads, rather than the workloads themselves. Airflow isn’t meant to process the data. It instead, say, triggers a Spark job that processes the data. It’s the same with machine learning.
  • Since Airflow is a general-purpose orchestrator, it’s used for a wide variety of applications that have different dependencies. ML libraries often have extremely quirky dependencies that don’t play nicely with other common dependencies.
  • With Airflow, we can’t run a pipeline locally and then immediately run it in the cloud with the flip of a switch. We have to merge and deploy it to a production cloud environment first.
  • Airflow doesn't come with "ML batteries included," so a considerable amount of time and additional libraries are needed to build a complete system. In contrast, Metaflow includes features for, for example, inspecting and analyzing results in a notebook, creating model cards, and accessing large amounts of data quickly out of the box, making it easier for even an inexperienced data scientist to build systems independently.

Metaflow still includes a handy UI where data scientists can examine task progress:

Overall, Metaflow offers a simple workflow for development models.

The Technical Details of Our Setup

At Ramp, we use AWS extensively, and since Metaflow can be deployed on AWS-managed services, we decided to start there.

At its core, Metaflow can use AWS Batch to schedule jobs and run them on AWS-managed ECS clusters. Batch provides job queue functionality that can keep track of jobs. For each job queue, you can attach multiple compute environments. These compute environments can run on Fargate or EC2, and you can adjust the resources available and the scaling strategies.

Since we’re small, we created a single job queue to start with. At first, we attached a Fargate compute environment, since we’re very familiar with Fargate, and it’s fairly simple to get started with. However, we encountered several problems with this. First, Fargate startup times are quite long. Users who want to submit more interactive jobs regularly had to wait several minutes for jobs to start. Second, Fargate only supports certain combinations of CPU and memory. This is sometimes a headache for software engineers, and it’s definitely a headache for data scientists. Writing a flow might require referencing the AWS documentation, which isn’t ideal. Third, Fargate doesn’t support GPUs. This isn’t a problem for many models, but for more sophisticated models, this is an annoying limitation. So, we ended up quickly moving to EC2 compute environments.

As we grow, we anticipate having to tweak our AWS Batch setup. We may want to optimize the instance types we use, particularly for GPUs. We may also want to create multiple queues in order to separate production and ad hoc workflows, or to accommodate different priorities. We’re still early in our journey here.

We use Terraform to define our infrastructure, so we’re able to describe all of Metaflow’s infrastructure as code.

When running flows locally, Metaflow orchestrates steps from your laptop, even if they’re remote. This is fine for short-running, interactive jobs. For longer-running or production jobs, we chose Metaflow’s Step Functions integration. Step Function overlaps with Airflow in some ways (indeed, we very briefly used Step Functions for general orchestration). It handles transitioning between steps and retrying failures, and it can be triggered on a schedule. However, we only use it for managing flow execution. In order to handle scheduling or triggering, we trigger Step Function executions from Airflow.

At first, we used the Amazon provider to integrate between Step Functions and Airflow—namely, StepFunctionStartExecutionOperator and StepFunctionExecutionSensor. However, this wasn’t pleasant. Data scientists had to create two boilerplate tasks to trigger execution, using Step Function state machine ARNs that make little sense in the context of Metaflow and data science. Then, if something went wrong, data scientists had to track down logs either in the AWS Console, the Metaflow UI, or both. This wasn’t ideal.

To remedy this, we created a MetaflowOperator. The operator handles triggering a flow and waiting for success or failure. Importantly, it also creates links to relevant logs.

More recently, Metaflow includes the capability to generate Airflow DAGs from Metaflow Flows. We haven’t used this functionality yet but look forward to digging deeper.

All of our production flows are contained in a single repository, sharing the same set of dependencies. This can be limiting, but it allows us to optimize for ease of maintenance and support. Whenever a data scientist creates a new flow or updates an existing one, we automatically create the associated Step Function state machine, allowing data scientists to then trigger it at will, either manually or from Airflow.

Where We are Today

Currently, we have more than 6000 Flow runs. These involve both scheduled tasks that happen every day and many on-the-spot tasks. Among these, a few data scientists are regularly running and adjusting processes, which results in multiple runs in a row. Out of these runs, a number are categorized as "production," meaning they run on a specific schedule triggered by factors like new data arriving in a database or a daily requirement. Apart from these, we have many more runs that are done for research or to solve ad hoc problems. For instance, I often run flows to benchmark models, and our data scientists are actively engaged in a wide range of projects.

Several of these models are essential for the business. For example, the riskiness model, along with a handful of other risk models, feeds into credit limits. If these models malfunction, fixing them is a very high priority. Other models are ongoing experiments designed to test specific inferences in a shadow mode. So, we cover a wide range of production scenarios. Our data scientists and MLEs can easily and swiftly compose flows. These are all housed in a single repository and are easily accessible.

Because of the simplicity of Metaflow, data scientists are able to largely self-service on Metaflow. While many data platform teams aim for self-service capabilities, it's rare to see stakeholders actually use and appreciate these features. New users are typically able to onboard themselves within their first few days at Ramp. To help with this, we created a brief walkthrough that new users can follow along with to create and deploy an example Flow. Our team hasn’t had to spend much time helping users with Metaflow, nor have we had to spend much time debugging infrastructure issues. Things mostly just work, allowing data platform engineers to tackle other problems.

If this work interests you, we’ll be speaking about it on September 20 in more detail at the Airflow Summit with our friends from Outerbounds. You can sign up here!

© 2024 Ramp Business Corporation. “Ramp,” "Ramp Financial" and the Ramp logo are trademarks of the company.
The Ramp Visa Commercial Card and the Ramp Visa Corporate Card are issued by Sutton Bank and Celtic Bank (Members FDIC), respectively. Please visit our Terms of Service for more details.