When making decisions with incomplete, noisy, or stale information, how can we make the most of the data we have and weigh the underlying uncertainty? In this post, we'll explore how Ramp navigates these decisions using multi-armed bandits to help customers directly link their bank accounts. We've talked about asymmetric outcomes before, and we'll dig into how our solution tackles the classic exploration vs exploitation tradeoff.

Ramp is a modern finance platform with corporate cards, automated expense management, bill pay, travel, procurement, and more. To get started, we lay the context for the problem we are solving: connecting to a business's bank account.

As part of our underwriting process, Ramp offers credit limits based on bank account data like recent banking transactions. To collect this data, we ask the customer to link their bank account during the application. The customer benefits from directly linking their account by receiving faster decisions, and they save time by avoiding continued manual document uploads.

Behind the scenes, we leverage third party *bank linking providers* which allow the customer to
search for financial institutions (i.e. the customer's bank: Chase, TD Bank, etc.) and securely link their
associated bank accounts.

For each financial institution, we need to determine which provider to use in order to maximize the likelihood of a successful bank linking connection, taking into account noisy and imperfect information. Under the hood, providers use different strategies, and we don't know apriori which provider will have the highest success rate for which financial institutions. Historical volume and recency of attempts affects how precisely we can measure these rates and how much uncertainty there is.

A natural instinct is to run A/B tests to determine which provider is best. While technically possible, running A/B tests has several shortcomings:

**Static**: Measuring performance at a specific point in time leads to making decisions with outdated information.**Multiple Testing**: We need to support thousands of financial institutions. This requires a separate A/B test for each.**Interpreting Results**: When the experiment is over, there are still decisions to be made: measurements must be turned into actions. For example: route all traffic through provider A, or split traffic 50/50 between both providers.

Enter the multi-armed bandit: a reinforcement learning framework that dynamically balances exploration of options with exploitation of historically top-performing choices. Multi-armed bandits address the shortcomings of A/B testing:

**Dynamic**: Multi-armed bandits are adaptable and continuously update their decisions based on new data.**Extensible**: Multi-armed bandits can be applied to any number of options, and independent bandits can be run in parallel.**Automated**: Decisions are made automatically by the algorithm: there is no need for interpretation.

Furthermore, multi-armed bandits are viable long-term solutions: we can keep the algorithm running indefinitely, and it will continue to learn and make decisions while incorporating new data.

Multi-armed bandits get their name from slot machines in casinos, which used to be called "one-armed bandits". A gambler playing slot machines does not know the probability of winning at any given machine, and so must trade off exploration (gathering more data on the expected payoff of each machine) with exploitation (playing the machine that seems to have the highest expected payoff so far).

Multi-armed bandits aren't always the right choice. They can be helpful in situations where any of the following are true:

- Outcomes are probabilistic
- Measurements are noisy
- Circumstances change over time

For example, let's imagine we want to decide where to get dinner. We have a favorite burger joint: it has consistent quality, and we know we'll enjoy the meal. However, we're also curious about a new bistro that just opened up. Finally, there is a ramen place that we tried once and didn't like, but now there is a new chef. Using a multi-armed bandit strategy in this situation would help to maximize our long-term restaurant satisfaction by balancing our known preferences (burger joint) with exploration of new options (bistro) and re-evaluation of previously suboptimal choices (ramen place).

On the other hand, multi-armed bandits would not be useful for deciding which switch turns on a light in a room. Once we try every switch, we can be extremely confident that we know what will happen next time we use each switch. The outcomes are not probabilistic (always either turns on light of interest or not), measurement is not noisy (we can easily tell if the light is on or not), and circumstances do not change over time (the effectiveness does not evolve, hopefully).

There are many solutions to the multi-armed bandit problem, but we chose one of the most popular choices, Thompson Sampling, because it is simple to implement and effective in practice. The crux of algorithm involves four steps:

- As we collect data, model the probability distribution of success for each action
- At each decision point, draw a sample from the current success distributions for each action
- Choose the action with the highest sampled value
- Update the modeled distribution for the chosen action after observing the result

We use the Beta distribution to model the success probability of each provider. The Beta distribution is parameterized by two values: α (number of successes + 1) and β (number of failures + 1). When there are few total attempts, the distribution is wide which codifies our uncertainty about the true success rate. As we collect more data, the distribution becomes more peaked indicating less uncertainty.

In the image above, we plot two Beta distributions: one with large uncertainty (orange, 1 success and 2 failures), and
one with small uncertainty (green, 11 successes and 9 failures). Although the green distribution has a higher average,
the orange distribution has a wider spread which affords it the chance to sample a higher value. This is a key
result of Thompson Sampling: *we can leverage uncertainty to explore high upside options*.

Below we show a simple Python implementation of Thompson Sampling where we simulate one day of bank linking attempts. We have a total volume of attempts that we need to divide between the providers. We input the historical total counts and total successes for each provider, then simulate how much traffic is routed to each provider.

```
import numpy as np
def simulate_day(
volume: int,
trials_success_list: list[tuple[int, int]],
debug: bool = False
) -> list[int]:
"""
Given counts/successes for list of variants, simulate one day with volume v
Returns counts of draws for each variant
:param volume: volume for day
:param trials_success_list: list of (trials, success)
:param debug: print draws for debugging
:return: counts of draws for each variant
"""
# Draw #(volume) random numbers from beta distribution for each variant,
draws = np.array([
np.random.default_rng().beta(success_i + 1, trial_i - success_i + 1, volume)
for trial_i, success_i in trials_success_list
])
# For each draw in #(volume), select variant with the highest value
draw_results = np.argmax(draws, axis=0)
# Count total results for each variant
result_counts = [
(draw_results == i).sum()
for i in range(len(trials_success_list))
]
if debug:
print('Draws: \n', draws)
print('Results: ', draw_results)
print('Counts: ', result_counts)
return result_counts
```

Below, we run this simulation with five attempts using the Beta distributions from above. The first provider (orange) has a lower historical success rate (33%) compared to the second provider (green, 55%) but we end up choosing it 3 out of 5 times. This is due to the larger uncertainty, and we end up exploring more than exploiting.

Before we could roll out Thompson Sampling to production, we needed to understand our data. We used simulation to anticipate how the algorithm would behave with the data we had. In order to apply the algorithm, we built a data collection feedback loop and crafted a measurement plan.

To understand how the algorithm behaves, we simulated Thompson Sampling for two providers under a variety of conditions.

- First, we generated true success rate trends for each provider, and then added noise on top.
- Next, we simulated daily bank linking attempt volume, and used the algorithm to decide how to split the volume between the providers.
- Finally, we used the generated success rate of each provider to simulate outcomes for each attempt that was routed to it.

In the example above, we simulated a situation where one provider has a stable success rate while the other has a linearly increasing rate. The dots are the daily generated success rates. The solid green line is the overall observed success rate of Thompson Sampling. The solid purple and orange lines are the observed success rates for each provider. In general, we found that Thompson Sampling was quick to exploit the best provider by diverting nearly all the volume to the one that was clearly better. On the other hand, when both providers were close in success rate, Thompson Sampling would adapt and split the volume more evenly.

To gauge the robustness of the algorithm and anticipate how long it would take to converge on the best provider, we investigated the impact of six factors through simulation. For all parameters, we used historical data to determine the feasible search space.

**Effect Size**: Magnitude difference in success rate between providers**Noise Level**: Day-to-day random variation in success rate**Volume**: Number of daily bank linking attempts**Window Size**: Length of trailing rolling window used to calculate parameters in Beta distribution**Trend Type**: How success rate changes over time (flat, linear, sinusoidal, step change)**Initial Conditions**: Parameters to seed Beta distribution at the start (cold start, correct rates with full volume, incorrect rates)

The most important finding was that the time for the algorithm to converge on the best provider was similar across a broad range of parameters. There were some notable exceptions to this, however:

- Shorter window sizes enabled quicker adaptation, even at the expense of lower total volume
- Incorrect initial conditions were the most detrimental to performance, but the algorithm was still able to recover eventually

When deciding on a rollout plan, we landed on a quasi-experimental design. Our primary metric measured the rate of businesses that successfully linked their bank accounts.

**Baseline**: status quo (hardcoding which provider to use for which financial institution deterministically)**Treatment**: Thompson Sampling

The baseline and treatment were sequential in time, in contrast to a true randomized experiment. We chose this design because ultimately we were not choosing between two viable variants: we knew the status quo was not maintainable long-term. We still wanted to measure the impact of Thompson Sampling, however, and using a sequential design afforded us a quicker measurement.

In the end, we observed a 10% increase in financial institution linking success rate, and a 25% decrease in Ramp customers with any manually uploaded bank statements. We took a deeper look at individual financial institutions and found that the algorithm adapted to changes as expected, and diverted traffic to the top-performing provider.

We've just seen how Ramp successfully applied Thompson Sampling to improve bank linking success rates, thereby helping customers link their bank accounts more easily and quicken the underwriting process. More direct connections translates to hours of saved customer time searching for documentation and waiting for Ramp to review them. This framework of continuous learning and automated decision-making removed friction, increased velocity, and introduced adaptability. Through implementing Thompson Sampling, we were able to take into account uncertainty and capitalize on high upside but uncertain options.