At Ramp, we’re very publicly proud of the fact that we live and breathe by the phrase “product velocity.” In fact, in a recent interview, our head of product Geoff even went as far as to say that “our culture is velocity” (emphasis mine). But the funny thing about velocity is that humans are hardwired to take a very local view of it; in other words, it’s easy to assume from the outside that what velocity means to us as an organization is simply striving to ship each and every feature as fast as we possibly can. So today, I want to share a story about a time that we deliberately decided not to move fast, and how that ended up being the highest-velocity engineering decision I made in my time here.
In March 2021, Tong Galaxy, Pavel Asparouhov, and I were tasked with launching a new business unit to kill Bill.com, the dominant AP solution on the market. This meant building Ramp’s first-ever product outside of our core T&E platform (i.e., corporate cards and reimbursements), and the resulting engineering scope was massive: invoice OCR, approval chains, accounting integrations, vendor onboarding, and more. So there was a lot of organizational momentum towards expediting our BillPay product launch by re-using existing functionality as much as possible. Yet when it came to the actual rails to pay vendors via bank transfer, Pavel and I insisted that we build a new payments service rather than relying on Ramp’s existing infrastructure for money movement.
We ran into a sizable amount of pushback to our proposal almost immediately, much of it built around the premise of maximizing velocity in the form of time-to-market. Rebuilding payments from scratch meant pushing back our launch date by ~2 months—an eternity in Ramp time. But when we stepped back and took a global view of velocity instead, we knew taking the time to build a system that would scale with our needs would make it exponentially faster to release both new BillPay features as well as future products beyond AP.
While our approach may seem like a no-brainer in retrospect, at the time it was both highly non-obvious and unorthodox for Ramp. After all, we had gotten our expense management product to PMF in record time by building on top of hastily assembled payments integrations. So to illustrate how exactly we built broad conviction in this type of thinking, I’m going to focus on 3 different design principles that were top of mind when we architected our new payments service. For each one, I’ll highlight what exactly was broken in our existing system, how we "fixed" it, and some lessons we learned along the way.
Before Bill Pay existed, there were 3 main use cases for initiating bank transfers at Ramp: debit checks (micro-transactions to validate customer-provided bank account details), statement collections (charging customers for outstanding card balance), and personal reimbursements (enabling customers to pay back their employees for out-of-pocket expenses). Since all three of these products were launched in the early days at Ramp when engineering resources were at a premium, payments functionality was added to each product in an ad hoc, piecemeal manner by the corresponding team. The resulting system presented a ton of engineering friction anytime we encountered a net-new use case.
For example, before we built Reimbursements, every bank transfer at Ramp was represented as a row in a
Transfer table containing a foreign key to the
BankAccount object storing the customer bank account details to pull funds from. We then ran an hourly job to upload any uninitiated
Transfer’s to our payments provider (JP Morgan, aka JPM). Adapting this data model for reimbursements was difficult because reimbursements required transactions to both business and employee bank accounts, the latter of which were stored in a totally separate
UserBankAccount table. And while statement collection and debit check transfers can be initiated immediately upon creation, we only wanted to pay employees once we had successfully pulled funds from the corresponding business' bank account.
This meant we had to implement completely new data models (e.g.,
ReimbursementUserTransfer) and jobs for each new business use case—a ton of repeated code for what is essentially the same basic function of moving $X from Y's bank account to Z's. And we could see ourselves running into the same exact problem all over again when supporting bill payments, which would require paying vendor bank accounts on scheduled dates in the future. In the short term, creating a new table and adding a scheduler could be just 1-2 weeks worth of work: much cheaper than rebuilding payments from the ground up. But we also knew BillPay would not be the last payments product we'd launch at Ramp, which meant what appeared to be a small fixed cost of building BillPay on top of the old system was actually an expensive variable cost for our product org as a whole.
So the first goal in building a brand new payments system was decoupling business logic from payments logic. By abstracting away the details of money movement into a compartmentalized service that is totally agnostic to the underlying business or product reason behind why we want to move money around, we can support an ever-growing number of payments use cases with minimal additional engineering overhead.
Our key insight was that we were building on top of the wrong primitives. What we needed were data models to represent more fundamental payment concepts like "bank account" and "transfer", rather than business concepts of "employee bank account" and "reimbursement transfer." So our first step was creating new generic tables: an
ACHDetails table to represent any external counterparty's bank account name, account number, and routing number, and a
TransferCanonical table to represent a single transfer between a specific Ramp-owned bank account and an
ACHDetails. With this formulation, both business and employee bank accounts could be represented as foreign keys to a single
ACHDetails, and statement collection, debit check, and reimbursement payments could all be represented as foreign keys to one or more
TransferCanonical's, resulting in a simple unified interface.
We soon learned an important lesson: primitives may need to change over time! For example, our first iteration of the new payments platform looked something like the image below (the
Dependency table is used to represent sequential relationships between transfers, i.e. only release the child
TransferCanonical once the parent
TransferCanonical has settled):
However, within a few days of processing payments in production, it became clear that we needed to build an additional layer of abstraction on top of
TransferCanonical. The issue arose anytime a
TransferCanonical failed due to an issue such as invalid account details or insufficient funds. Most times when this happens, the client wants to eventually retry the transfer. There's two ways to handle this:
TransferCanonical object and post it again to the banking provider, after making some modification (e.g. updating the associated
ACHDetails). However, this approach causes us to lose our audit trail, since we'd no longer be able to reconstruct the full history of which
ACHDetails we initially attempted to pay. Additionally, this would mean a
TransferCanonical ID can no longer be mapped to a single, unique transfer in the provider's database, which makes idempotency and consistency much, much harder to guarantee.
TransferCanonical object with the updated information upon a retry. However, that means clients would need to track potentially many different
TransferCanonical objects for the same logical transfer. Additionally, storing dependencies between parent and child transfers in an auditable way becomes much more difficult when each logical transfer is represented by potentially more than 1
What we eventually realized was that we needed to distinguish between the intent to move money and an actual attempt to perform that money movement.
TransferCanonical already encapsulated the latter of these two concepts, so we built a new
InstructedTransfer primitive to encapsulate the former:
Any time a new transfer is initiated via a call to the payments service, we actually create two objects: an
InstructedTransfer, and a
TransferCanonical that points to that
InstructedTransfer via a foreign key. Retries of failed
TransferCanonical's result in a new
TransferCanonical object pointing to the same
InstructedTransfer, while the
InstructedTransfer always maintains a pointer to the current, active
TransferCanonical via the
curr_transfer_id foreign key. This allows clients to treat
InstructedTransfer as a single, immutable touchpoint and to safely retry transfers without losing any event history or relationship data!
This design proved sufficient for our initial BillPay launch, but we ran into another limitation when we started work on physical check payments: while bills paid via check consisted of a single transfer or "leg" (a check cut from the business's bank account and mailed directly to the vendor), bills paid via ACH consisted of 2 legs (an ACH debit from business to Ramp, followed by an ACH credit from Ramp to vendor). Although we could easily model any of these legs as
BillPayment now had to store either one or two
InstructedTransfer IDs, depending on the payment method. Making one of these IDs nullable didn't feel right, nor did forcing BillPay to create separate data models for check-based and ACH-based bill payments. What would happen if we ever wanted to add a third payment method, or support payments with 3 or more legs?
We realized we needed a single interface for clients of the platform to interact with a given payment, independent of how many legs it consisted of. In other words, a new primitive: the
PaymentRequest abstraction allows us to group together an arbitrary number of payment legs or transfers; every
PaymentRequest consists of 1 or more
InstructedTransfer's, each of which points to the
PaymentRequest via its
payment_request_id foreign key. This means client services like BillPay only need to store a reference to a single immutable object in order to initiate and track complex, multi-leg payments.
When engineers think about scalability, they often think about low-level performance optimizations to an existing system: parallel processing, caching, pruning network requests. But building payments at Ramp taught me that scalability is a Day 1 concern: if you don't invest the time to craft the right primitives from the outset, your clients will continue to incur hidden costs that will slow down the growth rate of your product org as a whole. This tax may seem small when viewed locally: a new DB table here, a 1-week rewrite of a feature there. But added together, the engineering man-hours lost to poorly designed primitives function as a massive headwind to launching new products.
The proof is in the pudding; we haven't had to touch any of our core data models since we launched the final
PaymentRequest refactor in July 2021, and today every single team at Ramp – BillPay, Reimbursements, Card, Bank Linking, Capital Markets, Flex, and more – relies on the same shared set of abstractions. In each case, onboarding a product onto the new payments service has taken less than 1 week, because product engineers don't need to write any custom payments code; they simply need to call a single function to initiate a payment by creating a new
PaymentRequest. Designing data models with the flexibility to encapsulate diverse use cases like these within a single interface is how you turn a variable integration cost into a fixed one.
Unfortunately, figuring out what the right primitives are for your use case is more of an art than a science, but I found several simple heuristics helpful throughout our process of iteration:
ACHDetails or to a
MailingAddress. Rather than storing these as nullable fields on
TransferCanonical directly, we created separate
CheckTransfer metadata tables that map back to
Going from a product-level object like a
BillPayment to a canonical representation of a money transfer is only the first step in initiating a payment. Once you have a
TransferCanonical, you still need to actually instruct the payment via a banking provider, process any success or failure messages sent back, and surface those status updates up to the product layer. In the legacy system, the infrastructure for doing all this was lifted straight out of the 1970s.
That's not hyperbole; anytime we wanted to "post" a new ACH transfer to our partner bank JPM via the legacy payments system, we had to generate a text file formatted according to a specification established by NACHA all the way back in 1974. These NACHA files must adhere to a fixed-width template in which every line is exactly 94 characters long; this allows banks to extract transfer-specific information by reading characters at specific positions within the file. Here's an example of what a NACHA entry for a single ACH payment looks like:
This meant we needed to run a daily job to manually convert each and every transfer into the above format and then collate the results into a single file with some additional metadata. Then, we had to upload that file via SFTP to a designated server at JPM that could handle at most one incoming connection at any given time. Finally, we needed to scan for updates such as submission confirmations or returns by parsing XML files (formatted according to a completely different specification!) uploaded by JPM to the same server at arbitrary times throughout the day; only once these files were processed would we know whether the transfers ultimately succeeded or failed.
As one might expect, the resulting system was incredibly brittle, with way too much surface area for unexpected errors or failure modes. Some of the many, many examples of this included:
While Ramp was able to scale to > $100MM in ARR in spite of these challenges, the cost of payment delays or errors was an order of magnitude higher for a B2B payments product like BillPay. In the same way that Stripe disrupted payment acceptance by abstracting away complex banking internals underneath a developer-friendly API, we decided to use API-first design principles to improve reliability for payments processing at Ramp. Our north star was a system that felt simple, expressive, and predictable at all levels: for engineers, Support and Ops teams, and end users.
The first step was as simple as modernizing our underlying payments stack: we wrote an integration with JPM's newly released ACH transfers API, allowing us to bypass NACHA files completely! Initiating transfers was now as easy as making a single POST request to a JPM API endpoint with a JSON body containing the transfer details, allowing us to use standard API primitives like idempotency keys, batching, and rate-limiters to prevent duplicate payments and smooth out traffic during periods of peak volume. Similarly, our new API integration also enabled us to both receive payment updates from JPM immediately via webhooks and retrieve the current status at any time via polling (aka a GET request), giving us constant real-time visibility into any payment's ground truth state.
After seeing the benefits of shifting from low-level NACHA file generation to a more cleanly abstracted API, we decided to model the client interface for the new payments system as a lightweight API as well. In particular, even though all payments logic lived within the same Python monolith as our user-facing products, we made the explicit decision that clients would only ever interact with payments as if it functionally were a separate "micro-service." This meant utilizing code fences and Semgrep rules to prevent clients from ever querying raw payments tables or hitting internal JPM API methods directly; instead, every touchpoint between payments and one of its clients takes the form of a standardized function call (the monolith equivalent of an RPC). For example, to initiate a new payment, clients must always call the
Internally, this function call results in the creation of a
PaymentRequest, but clients at Ramp have no visibility into this object; instead, they simply receive back a UUID that uniquely identifies it. From that point onward, the client performs any and all desired operations on the payment by passing that UUID into a corresponding function call:
get_expected_settlement_date(payment_request_uuid) -> date
cancel_payment(payment_request_uuid) -> bool
retry_payment(payment_request_uuid, updated_recipient_details) -> bool
These standardized entry points create better auditability since they guarantee no one at Ramp ever initiates or updates a payment without a corresponding DB record. Crucially, they also mean that no users of the system ever need any context about its internal logic or the underlying payment rails in order to interact with it. Edge cases that previously could only be resolved via manual eng intervention (e.g. canceling an accidental payment and automatically issuing a refund if necessary, or providing definitive confirmation that a payment has arrived in the recipient's account) could all of a sudden be exposed as simple API endpoints. This not only allowed us to build internal dashboards to outsource operational tasks to Support agents; it actually enabled us to expose those very same API endpoints within the public Ramp app as fully self-service, user-driven actions, cutting out the need for operational resources entirely!
Unfortunately, not all of our API design principles worked this cleanly in practice. For example, in order to communicate payment updates back to clients, we built internal "webhooks": clients "subscribe" to updates for a payment by specifying an optional callback task (i.e. a signature registered in Celery, our async task queue system) when calling
create_payment. Then, any time the internal state of a payment changes, we trigger the corresponding callback task and pass all relevant data back as keyword arguments.
While this approach sounds great in theory, we almost immediately ran into some thorny problems, all boiling down to a single root cause: Celery was inherently unreliable!
RETURNED gets treated a lot differently than
Some of these issues were more tractable within the limitations of the Celery framework than others, but attempting to impose features like reliability and priority-ordering on top of a system that hadn't been designed with those use cases in mind felt a bit like trying to change a tire on a moving car. Great engineers are often hardwired not to reinvent the wheel, but I kept returning to a simple question: if I really needed a reliable priority task queue so badly, how hard could it be to build it myself?
It turns out, not that hard! In particular, we started logging every payment update to the DB along with a
created_at timestamp and a
processed_at flag, and implemented a cron job to query for unprocessed updates from our DB and emit them to clients every minute. By making sure to only emit the earliest unprocessed update per payment within each run of the cron job, we could guarantee consistency, i.e. updates are always sent to clients in the order they are created. And by implementing a decorator allowing callback tasks to receive a single update, confirm it hasn't already been processed, perform any client-side logic, and finally set the
processed_at flag all within one atomic block (via a row-level lock on the corresponding DB object), we could guarantee each update was processed idempotently, i.e. at most once. Together with running the task every minute, this was enough to guarantee reliability since "exactly-once" semantics are just "at-most-once" semantics with retries!
As an added bonus, building our own task queue natively in Postgres actually ended up unlocking a whole host of unexpected advantages:
At its heart, what "reliability" means to organizations of scale is predictable behavior: whether you perform a given action 10 times or 10,000 times, you have the confidence that it will always succeed or fail in ways that you expect. Especially for fintechs like Ramp, so much of the challenge in building reliable systems often boils down to delivering a consistent user experience on top of third-party systems that have some inherent irreducible uncertainty; you can never guarantee the way an external bank, service provider, or framework behaves because ultimately, you only control what you build and own yourself.
For the payments system, the first step in designing for reliability meant writing simpler, cleaner APIs that abstract away this uncertainty from end users. Not all payments can be refunded, but it's much easier to work within the bounds of that constraint if a client can identify that a given payment is ineligible for refund via a standard error response from an API call than if they need to manually inspect a text file or query across 3 DB tables themselves. As we eventually learned, however, sometimes standard APIs aren't enough. At the end of the day, it's always going to be much easier to reason about the behavior of systems you've built yourself than it is to enforce strict guarantees on those built and maintained by third parties; if there's a feature or invariant you absolutely cannot live without, it might make sense to build it in-house.
Some other learnings that might be helpful when thinking about reliability:
When the Card team built Ramp's first ever payments integration to support statement collections, they had a very specific use case in mind: collecting customer funds into a receivables account held at our partner bank JPM via ACH. As a result, they ended up engineering that assumption into the DNA of the system by building an integration tailored to the sole purpose of initiating ACH transfers into or out of a JPM account:
As discussed earlier, due to the way the initial payments data models had been designed, each successive engineering team at Ramp that wanted to take advantage of this infrastructure for money movement had to effectively build a new payments integration from scratch. Since the path of least resistance was simply to copy as much of the existing logic as possible, this led to an interesting side effect: that initial assumption kept getting replicated and reinforced across the codebase!
The end result was a system that was virtually impossible to extend to any other payment rail or banking provider unless we rebuilt all the payments integrations from the ground up. For our Bill Pay MVP, this setup was actually sufficient; every bill payment we processed would require ACH transfers into and out of a single FBO account we were opening at JPM. But as the company continued to scale, we could anticipate customers requesting the ability to process payments on Ramp via wire or paper check; and the engineering cost that we'd incur in refactoring the legacy system to support these features would scale as a linear function of the total number of products dependent on that infrastructure:
In other words, our old approach to building systems meant that any tech debt we "borrowed" by not immediately designing for future use cases came at a very high interest rate; it would become increasingly more expensive to pay it off as more and more products were built on top. But by centralizing responsibility for building and maintaining payment integrations into a single shared service, we were essentially borrowing "free" money; integrating with a new bank or supporting a new payment rail in the future would always cost a fixed amount of eng resourcing, regardless of the number of products or teams that wanted that functionality.
This is one of the big benefits of building payments as a platform rather than a feature: anytime we want to make a material change to the way we process payments at Ramp, we simply have to implement the logic in one place and the change is instantly available to all products built on top. For instance, by implementing support for international payments for BillPay, we essentially get international reimbursements for free. And if a banking provider ever experiences any service degradation or unprecedented outage (cough cough SVB), we can simply make changes to a single part of our codebase to ensure all products continue functioning as normal.
To unlock this kind of leverage, we ensured the exact rails and provider used to process a given payment would always be completely substitutable from the outset. Each integration with one of our provider banks is an API client inheriting from the same abstract base class containing a small set of abstract methods (e.g.
post_ach_transfer) that take standardized dataclasses as arguments (e.g.
ACHTransferDetails). We also created fields to store the payment provider (e.g. JPM), method (e.g. ACH), and currency within
TransferCanonical, and exposed these as parameters to the end clients within the
create_payment service. Finally, we built an engine to take a given
TransferCanonical and dynamically 1) instantiate the corresponding API client based on its provider, and 2) use the appropriate dataclass and client method based on its payment method. As a result, whether clients want to initiate a pull transaction into a JPM bank account via ACH, a push transaction from an Increase bank account via physical check, or a Euro payout from our Wise wallet via SEPA, they call the same exact service and just swap values for a few basic parameters.
If this seems like a totally obvious design choice, it was anything but. Fully abstracting away the complexity of how all these different payment networks function from end clients was an explicit tradeoff decision to take on a ton of additional technical complexity in order to build a cleaner, more usable API. For example, one of the consequences of our decision to maintain a single interface was that we had to enforce a standardized state machine on all payments initiated via the platform, regardless of the rails or provider they were processed on:
However, the real-world life cycles of ACH transfers and paper checks are starkly different. ACH is a "no news is good news system," which means we never receive a positive confirmation that an ACH transaction succeeded; instead, ACH transfers can be considered as settled only after a designated time period (typically 4 business days) has passed without receiving a return. On the other hand, we do receive a notification from the bank once the recipient cashes a check; but since USPS is not a 100% reliable system, checks may never be successfully delivered and can therefore remain in the intermediate "processing" state indefinitely.
Stated differently, settlement for ACH transfers is time-based, while settlement for check transfers is event-based. The simplest way to model these distinctions (and in fact, the one employed by most widely used treasury APIs) would be to create distinct objects to represent ACH and check transfers, each with its own independent state machine, services, and documentation. Instead, in order to reap the benefits of a single platform-ized API, we had to implement both models of settlement within the same payments engine.
Of course, not all tech debt can be fully anticipated or avoided. Our first implementation of time-based settlement logic for ACH transfers involved setting an
expected_settlement_date field on the
TransferCanonical once we posted it to its corresponding provider. Then, we could run a daily job to mark any transfers past their settlement date that were still in the
PROCESSING state (i.e., had not received a return) as
COMPLETED. However, just months into launching the platform, we ran into several limitations:
All the above cases are instances of time-based, rather than event-based, settlement; however, dates were too coarse-grained to capture intra-day or timezone-dependent behavior. As a result, we had to refactor our payments engine to use UTC timestamps instead of dates, which meant a costly DB + source code migration. Fortunately, the cost of this refactor was still magnitudes lower than if we had overfit our initial implementation of time-based settlement to ACH transfers, since we could reuse the general settlement date framework. By solving for the original problem in an extensible way, we ended up paying off minimal added interest on the tech debt we unknowingly incurred.
Trying to avoid tech debt completely is a losing strategy because you can never build for all possible future states, but that doesn't mean you can't control how expensive the debt you borrow ends up being. Part of "knowing your primitives" means knowing when to design abstractions with built-in flexibility to grow with your needs. Ultimately, treating extensibility as a first-class citizen can be as simple as taking the time to examine any assumptions you may have tacitly made when designing a system and stress-testing them under a limited set of reasonable projections. Once again, this exercise is much easier said than done, but some basic heuristics can prove helpful:
TransferCanonical table from the outset even though internationalization was nowhere on Ramp's radar at the time, since the amount of any money transfer is meaningless without a corresponding currency. When it came time to support FX transfers a year later, this seemingly inconsequential decision saved us countless hours and pain since we had already been forced to explicitly consider units in every numerical calculation or comparison from day 1.
cancel_payment invokes different actions depending on the method of the payment) creates a lot of surface area for bugs and leads to exponentially growing test suites. Instead, you can just use dictionaries to map enum values to functions and write a single unit test asserting that all enum values have been mapped. It's also important to collapse non-meaningful distinctions. For example, JPM allows clients to specify the external ID of posted transfers, while Increase generates it server-side. Instead of multiplexing this logic, we simply designed the API base client to always accept an internal ID and return an external ID; whether these end up being the same (as for JPM) or different (Increase) doesn't make any practical difference.
At Ramp, one of our bread-and-butter sayings is "slope over intercept"—instead of focusing on the here-and-now, we try to discern for people and opportunities with the highest growth potential. Platforms are our secret sauce for high-slope engineering because what they provide is acceleration: we trade off some velocity in the short term in order to continually increase the speed at which we launch new products in the future. This tends to be a difficult mentality for fast-growing startups to adopt because it is so counterintuitive; too often, the first-order benefits from the conventionally accepted wisdom of "moving fast and breaking things" motivate young product teams to push off questions of maintainability and reusability to the distant future, because these benefits are so immediately visible. But in the long run, the second order effects of investing upfront time into thinking through the right abstractions become exponentially more valuable:
The only way to consistently incentivize this kind of decision-making is to build a company culture where every single decision, no matter how small, always gets tied back to the core operating principle of long-term over short-term velocity. That could take the shape of measuring performance against broader company-wide outcomes rather than narrow project- or team-specific OKRs, empowering product-facing teams to build cross-org platforms, or taking bets on young engineers with zero fintech experience to own foundational infrastructure. Ultimately, the only way to prepare yourself for the marathon that is building a generational business is by taking the long view in everything you do.