When making decisions with incomplete, noisy, or stale information, how can we make the most of the data we have and weigh the underlying uncertainty? In this post, we'll explore how Ramp navigates these decisions using multi-armed bandits to help customers directly link their bank accounts. We've talked about asymmetric outcomes before, and we'll dig into how our solution tackles the classic exploration vs exploitation tradeoff.
Ramp is a modern finance platform with corporate cards, automated expense management, bill pay, travel, procurement, and more. To get started, we lay the context for the problem we are solving: connecting to a business's bank account.
As part of our underwriting process, Ramp offers credit limits based on bank account data like recent banking transactions. To collect this data, we ask the customer to link their bank account during the application. The customer benefits from directly linking their account by receiving faster decisions, and they save time by avoiding continued manual document uploads.
Behind the scenes, we leverage third party bank linking providers which allow the customer to search for financial institutions (i.e. the customer's bank: Chase, TD Bank, etc.) and securely link their associated bank accounts.
For each financial institution, we need to determine which provider to use in order to maximize the likelihood of a successful bank linking connection, taking into account noisy and imperfect information. Under the hood, providers use different strategies, and we don't know apriori which provider will have the highest success rate for which financial institutions. Historical volume and recency of attempts affects how precisely we can measure these rates and how much uncertainty there is.
A natural instinct is to run A/B tests to determine which provider is best. While technically possible, running A/B tests has several shortcomings:
Enter the multi-armed bandit: a reinforcement learning framework that dynamically balances exploration of options with exploitation of historically top-performing choices. Multi-armed bandits address the shortcomings of A/B testing:
Furthermore, multi-armed bandits are viable long-term solutions: we can keep the algorithm running indefinitely, and it will continue to learn and make decisions while incorporating new data.
Multi-armed bandits get their name from slot machines in casinos, which used to be called "one-armed bandits". A gambler playing slot machines does not know the probability of winning at any given machine, and so must trade off exploration (gathering more data on the expected payoff of each machine) with exploitation (playing the machine that seems to have the highest expected payoff so far).
Multi-armed bandits aren't always the right choice. They can be helpful in situations where any of the following are true:
For example, let's imagine we want to decide where to get dinner. We have a favorite burger joint: it has consistent quality, and we know we'll enjoy the meal. However, we're also curious about a new bistro that just opened up. Finally, there is a ramen place that we tried once and didn't like, but now there is a new chef. Using a multi-armed bandit strategy in this situation would help to maximize our long-term restaurant satisfaction by balancing our known preferences (burger joint) with exploration of new options (bistro) and re-evaluation of previously suboptimal choices (ramen place).
On the other hand, multi-armed bandits would not be useful for deciding which switch turns on a light in a room. Once we try every switch, we can be extremely confident that we know what will happen next time we use each switch. The outcomes are not probabilistic (always either turns on light of interest or not), measurement is not noisy (we can easily tell if the light is on or not), and circumstances do not change over time (the effectiveness does not evolve, hopefully).
There are many solutions to the multi-armed bandit problem, but we chose one of the most popular choices, Thompson Sampling, because it is simple to implement and effective in practice. The crux of algorithm involves four steps:
We use the Beta distribution to model the success probability of each provider. The Beta distribution is parameterized by two values: α (number of successes + 1) and β (number of failures + 1). When there are few total attempts, the distribution is wide which codifies our uncertainty about the true success rate. As we collect more data, the distribution becomes more peaked indicating less uncertainty.
In the image above, we plot two Beta distributions: one with large uncertainty (orange, 1 success and 2 failures), and one with small uncertainty (green, 11 successes and 9 failures). Although the green distribution has a higher average, the orange distribution has a wider spread which affords it the chance to sample a higher value. This is a key result of Thompson Sampling: we can leverage uncertainty to explore high upside options.
Below we show a simple Python implementation of Thompson Sampling where we simulate one day of bank linking attempts. We have a total volume of attempts that we need to divide between the providers. We input the historical total counts and total successes for each provider, then simulate how much traffic is routed to each provider.
import numpy as np
def simulate_day(
volume: int,
trials_success_list: list[tuple[int, int]],
debug: bool = False
) -> list[int]:
"""
Given counts/successes for list of variants, simulate one day with volume v
Returns counts of draws for each variant
:param volume: volume for day
:param trials_success_list: list of (trials, success)
:param debug: print draws for debugging
:return: counts of draws for each variant
"""
# Draw #(volume) random numbers from beta distribution for each variant,
draws = np.array([
np.random.default_rng().beta(success_i + 1, trial_i - success_i + 1, volume)
for trial_i, success_i in trials_success_list
])
# For each draw in #(volume), select variant with the highest value
draw_results = np.argmax(draws, axis=0)
# Count total results for each variant
result_counts = [
(draw_results == i).sum()
for i in range(len(trials_success_list))
]
if debug:
print('Draws: \n', draws)
print('Results: ', draw_results)
print('Counts: ', result_counts)
return result_counts
Below, we run this simulation with five attempts using the Beta distributions from above. The first provider (orange) has a lower historical success rate (33%) compared to the second provider (green, 55%) but we end up choosing it 3 out of 5 times. This is due to the larger uncertainty, and we end up exploring more than exploiting.
Before we could roll out Thompson Sampling to production, we needed to understand our data. We used simulation to anticipate how the algorithm would behave with the data we had. In order to apply the algorithm, we built a data collection feedback loop and crafted a measurement plan.
To understand how the algorithm behaves, we simulated Thompson Sampling for two providers under a variety of conditions.
In the example above, we simulated a situation where one provider has a stable success rate while the other has a linearly increasing rate. The dots are the daily generated success rates. The solid green line is the overall observed success rate of Thompson Sampling. The solid purple and orange lines are the observed success rates for each provider. In general, we found that Thompson Sampling was quick to exploit the best provider by diverting nearly all the volume to the one that was clearly better. On the other hand, when both providers were close in success rate, Thompson Sampling would adapt and split the volume more evenly.
To gauge the robustness of the algorithm and anticipate how long it would take to converge on the best provider, we investigated the impact of six factors through simulation. For all parameters, we used historical data to determine the feasible search space.
The most important finding was that the time for the algorithm to converge on the best provider was similar across a broad range of parameters. There were some notable exceptions to this, however:
When deciding on a rollout plan, we landed on a quasi-experimental design. Our primary metric measured the rate of businesses that successfully linked their bank accounts.
The baseline and treatment were sequential in time, in contrast to a true randomized experiment. We chose this design because ultimately we were not choosing between two viable variants: we knew the status quo was not maintainable long-term. We still wanted to measure the impact of Thompson Sampling, however, and using a sequential design afforded us a quicker measurement.
In the end, we observed a 10% increase in financial institution linking success rate, and a 25% decrease in Ramp customers with any manually uploaded bank statements. We took a deeper look at individual financial institutions and found that the algorithm adapted to changes as expected, and diverted traffic to the top-performing provider.
We've just seen how Ramp successfully applied Thompson Sampling to improve bank linking success rates, thereby helping customers link their bank accounts more easily and quicken the underwriting process. More direct connections translates to hours of saved customer time searching for documentation and waiting for Ramp to review them. This framework of continuous learning and automated decision-making removed friction, increased velocity, and introduced adaptability. Through implementing Thompson Sampling, we were able to take into account uncertainty and capitalize on high upside but uncertain options.