Rate limiting is one of those classic computer science problems that people like to ask candidates in interviews. What makes it a good interview question—the nuanced edge cases and numerous design decisions—makes for tricky implementations in the real world.
Imagine, for example, a rate limit of 2 requests per second. This limit is not equal to 120 requests every minute, despite the latter averaging out to the former. This is because the latter supports burst uses (e.g. a user can exhaust all 120 requests in the first second, and then wait 59 seconds with no successful requests).
Ramp uses many different third-party APIs throughout the app, everything from Finicity and Teller for bank linking to Zendesk and Salesforce for support and sales. Each of these APIs comes with its own rate limiting requirements: some have a daily quota, others have a per-second (“peak”) quota for spikes, and still others have a quota per user or business for various time periods.
Periodically, Ramp takes action across our entire user set (e.g., sending all businesses a summary of their spend on Ramp in the last week). When we do this, we need to avoid overburdening our systems and negatively impacting other tasks. Each product use case may have different configurations, and our rate limiting approach must support them.
As we scale, we expect that there will be either accidental or intentional misuse of our API. To combat this, we currently have Cloudflare rate limit API usage by route, user IP, and other attributes. However, since IPs can easily be changed by an attacker, it's more effective to rate limit by app-level identities, such as an authenticated user’s ID. This level of granularity can be achieved only if we handle it as an app-level rate limit, since Cloudflare cannot authenticate our users.
Celery, a popular task queue we use at Ramp, provides basic rate limiting functionality for tasks via decorators. By placing actions we wish to rate limit inside tasks and rate limiting the tasks themselves, we can indirectly rate limit the action.
This approach is limited. For one, not all actions we want to rate limit can be placed in tasks, and even for those that could we needed to write supplemental code to handle 429 responses from third parties. More broadly, Celery rate limiting is per-worker and not global. It's possible to hack around this constraint, but that requires a task author to know…
…and then set the per-worker rate limit to be
(desired global rate limit)/(max # of queue workers), which still might rate limit too aggressively. If any of these variables changes, the rate limit must be changed as well. Additionally, in certain Celery queue configurations (e.g., with pre-fetching turned on), rate-limited tasks could eventually block a worker from taking on more tasks, grinding queues to a halt.
We set out to build a global rate limiter that could be adapted to fit these use cases. We had two initial goals:
Rate limiting is, luckily, a common (and solved) problem, so there are a lot of existing approaches that other companies have implemented. For example, Cloudflare's implementation modifies a sliding time window to approximate rate limits while GitHub's approach uses a fixed-window approach.
Ultimately we realized that their approaches were tailored to their problems and weren't the best fit for us, so we decided to do more research on other algorithms that might fit us.
We considered a handful of popular rate limiting algorithms:
Fixed-window rate limiting is the most straightforward. For each time interval (eg. every hour), store and increment a counter for every successful request, and reset the counter after every interval. Although easy to implement, this approach can allow up to 2x the specified rate limit in instantaneous spikes in the worst case scenario. For example, with a fixed-window interval resetting at the start of every hour that allows 100 requests per interval, a user can make 100 requests at 11:59, and then make another 100 requests at 12:01. This may not be desirable if our goal is to smooth out request load on our API.
A "smoother" approach to rate limiting is to use a sliding window. This limits the number of successful requests to exactly the specified rate limit for any time period interval. This seems ideal behaviorally, both allowing burst requests and preventing the 2x spikiness we saw in the fixed-window approach. However, this is trickier to implement, requiring us to keep track of all the request times. This requires more memory usage and may not scale well.
The "Leaky Bucket" algorithm simulates a leaking bucket. The volume of the bucket is set to the limit: the maximum number of requests for a given time interval, each new successful request adds a single unit to the bucket, and the bucket “leaks” (decrements) at a fixed rate. When the bucket is full, requests are denied. For a bucket of capacity N, it’s possible to make N simultaneous requests to fill up an empty bucket, but then make another N requests spread out at the same rate the bucket leaks. This means it’s able to allow 2x the max number of requests in a given time interval. On the other hand, this behavior can also accommodate traffic spikes, while enforcing an "average" rate equal to the specified rate.
The Generic cell rate algorithm (GCRA) is a special variant of the leaky bucket algorithm family. Instead of simulating a "leak", it lazily computes a "theoretical arrival time" (TAT) that the next request would have to meet. After each successful request, the TAT is increased by a small amount. If the TAT is larger than a pre-computed limit, requests are denied. The TAT never decreases, but the pre-computed limit steadily increases as time passes to allow more requests over time.
This algorithm behaves similarly to the leaky bucket: make requests too fast and the TAT will increase until it exceeds the rate limit and starts denying requests. Users would then need to slow down to allow the current time to catch up to the TAT so that more requests can be made. It turns out this method is very memory efficient since it only needs to store a few variables to do this. There are much better articles that summarize the exact process for calculating the TAT, including:
We ended up using GCRA for its implementation simplicity and compute/memory efficiency. We looked at several implementations from various sources (commented in the code below ), and updated them to ensure they:
The final code is below:
def check_allowed(self, key: str, limit: int, period: timedelta) -> bool:
Returns whether a request should be allowed
This implementation is modified from https://dev.to/astagi/rate-limiting-using-python-and-redis-58gk
1. uses higher precision floats instead of ints
2. returns true for allowing instead of returning true for rejecting
This rate limiter follows the GCRA algorithm. Resources:
period_in_seconds = int(period.total_seconds())
now = self._get_redis_time_now()
separation = period_in_seconds / limit
"rate_limiter_lock:" + key,
tat = max(float(self.redis.get(key) or now), now)
if tat - now <= period_in_seconds - separation:
new_tat = max(tat, now) + separation
self.redis.set(name=key, value=new_tat, ex=self.DEFAULT_TTL)
We also wrote a ton of unit tests :) and tested manually as well.
We have big plans for the rate limiter! We've already started using it for the three use cases described above, and are slowly introducing new use cases and thoroughly monitoring at each step. So far it seems promising. Because we built a generic framework, we also plan to add additional rate limiting algorithms to address novel use cases as they come up!
 At Ramp, we're big fans of leaving nice code comments for utils and complex code sections :)