Moving Fast by Moving Slow: How We Built Payments at Ramp

July 17, 2023

At Ramp, we’re very publicly proud of the fact that we live and breathe by the phrase “product velocity.” In fact, in a recent interview, our head of product Geoff even went as far as to say that “our culture is velocity” (emphasis mine). But the funny thing about velocity is that humans are hardwired to take a very local view of it; in other words, it’s easy to assume from the outside that what velocity means to us as an organization is simply striving to ship each and every feature as fast as we possibly can. So today, I want to share a story about a time that we deliberately decided not to move fast, and how that ended up being the highest-velocity engineering decision I made in my time here.

Move Slow and Fix Things

In March 2021, Tong Galaxy, Pavel Asparouhov, and I were tasked with launching a new business unit to kill Bill.com, the dominant AP solution on the market. This meant building Ramp’s first-ever product outside of our core T&E platform (i.e., corporate cards and reimbursements), and the resulting engineering scope was massive: invoice OCR, approval chains, accounting integrations, vendor onboarding, and more. So there was a lot of organizational momentum towards expediting our BillPay product launch by re-using existing functionality as much as possible. Yet when it came to the actual rails to pay vendors via bank transfer, Pavel and I insisted that we build a new payments service rather than relying on Ramp’s existing infrastructure for money movement.

We ran into a sizable amount of pushback to our proposal almost immediately, much of it built around the premise of maximizing velocity in the form of time-to-market. Rebuilding payments from scratch meant pushing back our launch date by ~2 months—an eternity in Ramp time. But when we stepped back and took a global view of velocity instead, we knew taking the time to build a system that would scale with our needs would make it exponentially faster to release both new BillPay features as well as future products beyond AP.

While our approach may seem like a no-brainer in retrospect, at the time it was both highly non-obvious and unorthodox for Ramp. After all, we had gotten our expense management product to PMF in record time by building on top of hastily assembled payments integrations. So to illustrate how exactly we built broad conviction in this type of thinking, I’m going to focus on 3 different design principles that were top of mind when we architected our new payments service. For each one, I’ll highlight what exactly was broken in our existing system, how we "fixed" it, and some lessons we learned along the way.

Know Your Primitives
What Was Broken

Before Bill Pay existed, there were 3 main use cases for initiating bank transfers at Ramp: debit checks (micro-transactions to validate customer-provided bank account details), statement collections (charging customers for outstanding card balance), and personal reimbursements (enabling customers to pay back their employees for out-of-pocket expenses). Since all three of these products were launched in the early days at Ramp when engineering resources were at a premium, payments functionality was added to each product in an ad hoc, piecemeal manner by the corresponding team. The resulting system presented a ton of engineering friction anytime we encountered a net-new use case.

For example, before we built Reimbursements, every bank transfer at Ramp was represented as a row in a Transfer table containing a foreign key to the BankAccount object storing the customer bank account details to pull funds from. We then ran an hourly job to upload any uninitiated Transfer’s to our payments provider (JP Morgan, aka JPM). Adapting this data model for reimbursements was difficult because reimbursements required transactions to both business and employee bank accounts, the latter of which were stored in a totally separate UserBankAccount table. And while statement collection and debit check transfers can be initiated immediately upon creation, we only wanted to pay employees once we had successfully pulled funds from the corresponding business' bank account.

This meant we had to implement completely new data models (e.g., ReimbursementBusinessTransfer and ReimbursementUserTransfer) and jobs for each new business use case—a ton of repeated code for what is essentially the same basic function of moving $X from Y's bank account to Z's. And we could see ourselves running into the same exact problem all over again when supporting bill payments, which would require paying vendor bank accounts on scheduled dates in the future. In the short term, creating a new table and adding a scheduler could be just 1-2 weeks worth of work: much cheaper than rebuilding payments from the ground up. But we also knew BillPay would not be the last payments product we'd launch at Ramp, which meant what appeared to be a small fixed cost of building BillPay on top of the old system was actually an expensive variable cost for our product org as a whole.

So the first goal in building a brand new payments system was decoupling business logic from payments logic. By abstracting away the details of money movement into a compartmentalized service that is totally agnostic to the underlying business or product reason behind why we want to move money around, we can support an ever-growing number of payments use cases with minimal additional engineering overhead.

How We Fixed It

Our key insight was that we were building on top of the wrong primitives. What we needed were data models to represent more fundamental payment concepts like "bank account" and "transfer", rather than business concepts of "employee bank account" and "reimbursement transfer." So our first step was creating new generic tables: an ACHDetails table to represent any external counterparty's bank account name, account number, and routing number, and a TransferCanonical table to represent a single transfer between a specific Ramp-owned bank account and an ACHDetails. With this formulation, both business and employee bank accounts could be represented as foreign keys to a single ACHDetails, and statement collection, debit check, and reimbursement payments could all be represented as foreign keys to one or more TransferCanonical's, resulting in a simple unified interface.

We soon learned an important lesson: primitives may need to change over time! For example, our first iteration of the new payments platform looked something like the image below (the Dependency table is used to represent sequential relationships between transfers, i.e. only release the child TransferCanonical once the parent TransferCanonical has settled):

However, within a few days of processing payments in production, it became clear that we needed to build an additional layer of abstraction on top of TransferCanonical. The issue arose anytime a TransferCanonical failed due to an issue such as invalid account details or insufficient funds. Most times when this happens, the client wants to eventually retry the transfer. There's two ways to handle this:

  1. First, we could simply keep the same TransferCanonical object and post it again to the banking provider, after making some modification (e.g. updating the associated ACHDetails). However, this approach causes us to lose our audit trail, since we'd no longer be able to reconstruct the full history of which ACHDetails we initially attempted to pay. Additionally, this would mean a TransferCanonical ID can no longer be mapped to a single, unique transfer in the provider's database, which makes idempotency and consistency much, much harder to guarantee.
  2. Or, we could create a second TransferCanonical object with the updated information upon a retry. However, that means clients would need to track potentially many different TransferCanonical objects for the same logical transfer. Additionally, storing dependencies between parent and child transfers in an auditable way becomes much more difficult when each logical transfer is represented by potentially more than 1 TransferCanonical.

What we eventually realized was that we needed to distinguish between the intent to move money and an actual attempt to perform that money movement. TransferCanonical already encapsulated the latter of these two concepts, so we built a new InstructedTransfer primitive to encapsulate the former:

Any time a new transfer is initiated via a call to the payments service, we actually create two objects: an InstructedTransfer, and a TransferCanonical that points to that InstructedTransfer via a foreign key. Retries of failed TransferCanonical's result in a new TransferCanonical object pointing to the same InstructedTransfer, while the InstructedTransfer always maintains a pointer to the current, active TransferCanonical via the curr_transfer_id foreign key. This allows clients to treat InstructedTransfer as a single, immutable touchpoint and to safely retry transfers without losing any event history or relationship data!

This design proved sufficient for our initial BillPay launch, but we ran into another limitation when we started work on physical check payments: while bills paid via check consisted of a single transfer or "leg" (a check cut from the business's bank account and mailed directly to the vendor), bills paid via ACH consisted of 2 legs (an ACH debit from business to Ramp, followed by an ACH credit from Ramp to vendor). Although we could easily model any of these legs as InstructedTransfer's, each BillPayment now had to store either one or two InstructedTransfer IDs, depending on the payment method. Making one of these IDs nullable didn't feel right, nor did forcing BillPay to create separate data models for check-based and ACH-based bill payments. What would happen if we ever wanted to add a third payment method, or support payments with 3 or more legs?

We realized we needed a single interface for clients of the platform to interact with a given payment, independent of how many legs it consisted of. In other words, a new primitive: the PaymentRequest.

The PaymentRequest abstraction allows us to group together an arbitrary number of payment legs or transfers; every PaymentRequest consists of 1 or more InstructedTransfer's, each of which points to the PaymentRequest via its payment_request_id foreign key. This means client services like BillPay only need to store a reference to a single immutable object in order to initiate and track complex, multi-leg payments.

What We Learned

When engineers think about scalability, they often think about low-level performance optimizations to an existing system: parallel processing, caching, pruning network requests. But building payments at Ramp taught me that scalability is a Day 1 concern: if you don't invest the time to craft the right primitives from the outset, your clients will continue to incur hidden costs that will slow down the growth rate of your product org as a whole. This tax may seem small when viewed locally: a new DB table here, a 1-week rewrite of a feature there. But added together, the engineering man-hours lost to poorly designed primitives function as a massive headwind to launching new products.

The proof is in the pudding; we haven't had to touch any of our core data models since we launched the final PaymentRequest refactor in July 2021, and today every single team at Ramp – BillPay, Reimbursements, Card, Bank Linking, Capital Markets, Flex, and more – relies on the same shared set of abstractions. In each case, onboarding a product onto the new payments service has taken less than 1 week, because product engineers don't need to write any custom payments code; they simply need to call a single function to initiate a payment by creating a new PaymentRequest. Designing data models with the flexibility to encapsulate diverse use cases like these within a single interface is how you turn a variable integration cost into a fixed one.

Unfortunately, figuring out what the right primitives are for your use case is more of an art than a science, but I found several simple heuristics helpful throughout our process of iteration:

  • Avoid nullable columns. When your core data models include fields that may or may not be null depending on the use case, they’re likely not the right abstraction. Joiner tables are often a much better approach to multiplexing. For instance, depending on whether a transfer is initiated via ACH or check, we need to store a pointer either to an ACHDetails or to a MailingAddress. Rather than storing these as nullable fields on TransferCanonical directly, we created separate ACHTransfer and CheckTransfer metadata tables that map back to TransferCanonical.
  • Rely on DB-level constraints rather than proper code execution to prevent inconsistent state. Storing data as unstructured JSON instead of flattened, typed columns, or creating implicit dependencies on the values of 2 different columns without a corresponding check constraint are great ways to confuse future maintainers of your system and end up with a bunch of bugs. Since the database itself is ultimately the lowest level of ground truth, that's the layer where you should likely enforce almost all data validation rules.
  • If you have trouble concisely articulating the real-world concept an object logically represents, then you likely haven’t arrived at the correct data model yet.
You Only Control What You Own
What Was Broken

Going from a product-level object like a BillPayment to a canonical representation of a money transfer is only the first step in initiating a payment. Once you have a TransferCanonical, you still need to actually instruct the payment via a banking provider, process any success or failure messages sent back, and surface those status updates up to the product layer. In the legacy system, the infrastructure for doing all this was lifted straight out of the 1970s.

That's not hyperbole; anytime we wanted to "post" a new ACH transfer to our partner bank JPM via the legacy payments system, we had to generate a text file formatted according to a specification established by NACHA all the way back in 1974. These NACHA files must adhere to a fixed-width template in which every line is exactly 94 characters long; this allows banks to extract transfer-specific information by reading characters at specific positions within the file. Here's an example of what a NACHA entry for a single ACH payment looks like:

This meant we needed to run a daily job to manually convert each and every transfer into the above format and then collate the results into a single file with some additional metadata. Then, we had to upload that file via SFTP to a designated server at JPM that could handle at most one incoming connection at any given time. Finally, we needed to scan for updates such as submission confirmations or returns by parsing XML files (formatted according to a completely different specification!) uploaded by JPM to the same server at arbitrary times throughout the day; only once these files were processed would we know whether the transfers ultimately succeeded or failed.

As one might expect, the resulting system was incredibly brittle, with way too much surface area for unexpected errors or failure modes. Some of the many, many examples of this included:

  • We were often seeing transfers fail to post to JPM because two workers attempted to connect to the SFTP server at the same time (e.g. mid-deploy, before the worker running the old code version was killed but after the worker running the new version was brought online). This also made parallelization (e.g., running the tasks to post statement payments and reimbursements at the same time on different workers) virtually impossible.
  • In order to reverse an erroneous payment, an engineer had to handcraft a text file and upload it directly to our JPM server. This not only made a highly sensitive operation incredibly slow, tedious, and error-prone, but it also crucially meant that there was no record of the refund anywhere in our DB that our Engineering, Finance, or Ops teams could track (leading to at least one case of double refunds)!
  • Any time a customer filed a ticket stating they had not received a given payment, the Support team had to loop in one of 2 engineers with the relevant domain knowledge to investigate; in turn, those engineers would then have to manually comb through a massive XML file to confirm whether the transfer had ever been successfully acknowledged by JPM, leading to suboptimal SLAs and a dangerously low bus factor.

While Ramp was able to scale to > $100MM in ARR in spite of these challenges, the cost of payment delays or errors was an order of magnitude higher for a B2B payments product like BillPay. In the same way that Stripe disrupted payment acceptance by abstracting away complex banking internals underneath a developer-friendly API, we decided to use API-first design principles to improve reliability for payments processing at Ramp. Our north star was a system that felt simple, expressive, and predictable at all levels: for engineers, Support and Ops teams, and end users.

How We Fixed It

The first step was as simple as modernizing our underlying payments stack: we wrote an integration with JPM's newly released ACH transfers API, allowing us to bypass NACHA files completely! Initiating transfers was now as easy as making a single POST request to a JPM API endpoint with a JSON body containing the transfer details, allowing us to use standard API primitives like idempotency keys, batching, and rate-limiters to prevent duplicate payments and smooth out traffic during periods of peak volume. Similarly, our new API integration also enabled us to both receive payment updates from JPM immediately via webhooks and retrieve the current status at any time via polling (aka a GET request), giving us constant real-time visibility into any payment's ground truth state.

After seeing the benefits of shifting from low-level NACHA file generation to a more cleanly abstracted API, we decided to model the client interface for the new payments system as a lightweight API as well. In particular, even though all payments logic lived within the same Python monolith as our user-facing products, we made the explicit decision that clients would only ever interact with payments as if it functionally were a separate "micro-service." This meant utilizing code fences and Semgrep rules to prevent clients from ever querying raw payments tables or hitting internal JPM API methods directly; instead, every touchpoint between payments and one of its clients takes the form of a standardized function call (the monolith equivalent of an RPC). For example, to initiate a new payment, clients must always call the create_payment function:

Internally, this function call results in the creation of a PaymentRequest, but clients at Ramp have no visibility into this object; instead, they simply receive back a UUID that uniquely identifies it. From that point onward, the client performs any and all desired operations on the payment by passing that UUID into a corresponding function call:

  • get_expected_settlement_date(payment_request_uuid) -> date
  • cancel_payment(payment_request_uuid) -> bool
  • retry_payment(payment_request_uuid, updated_recipient_details) -> bool

These standardized entry points create better auditability since they guarantee no one at Ramp ever initiates or updates a payment without a corresponding DB record. Crucially, they also mean that no users of the system ever need any context about its internal logic or the underlying payment rails in order to interact with it. Edge cases that previously could only be resolved via manual eng intervention (e.g. canceling an accidental payment and automatically issuing a refund if necessary, or providing definitive confirmation that a payment has arrived in the recipient's account) could all of a sudden be exposed as simple API endpoints. This not only allowed us to build internal dashboards to outsource operational tasks to Support agents; it actually enabled us to expose those very same API endpoints within the public Ramp app as fully self-service, user-driven actions, cutting out the need for operational resources entirely!

Unfortunately, not all of our API design principles worked this cleanly in practice. For example, in order to communicate payment updates back to clients, we built internal "webhooks": clients "subscribe" to updates for a payment by specifying an optional callback task (i.e. a signature registered in Celery, our async task queue system) when calling create_payment. Then, any time the internal state of a payment changes, we trigger the corresponding callback task and pass all relevant data back as keyword arguments.

While this approach sounds great in theory, we almost immediately ran into some thorny problems, all boiling down to a single root cause: Celery was inherently unreliable!

  • Some payment updates were being double-played.
  • Some payment updates were never being processed at all.
  • Even when all updates were eventually processed, they could end up arriving at the client layer out-of-order, leading to an inconsistent view of the payment's state (PROCESSED, then RETURNED gets treated a lot differently than RETURNED, then PROCESSED).

Some of these issues were more tractable within the limitations of the Celery framework than others, but attempting to impose features like reliability and priority-ordering on top of a system that hadn't been designed with those use cases in mind felt a bit like trying to change a tire on a moving car. Great engineers are often hardwired not to reinvent the wheel, but I kept returning to a simple question: if I really needed a reliable priority task queue so badly, how hard could it be to build it myself?

It turns out, not that hard! In particular, we started logging every payment update to the DB along with a created_at timestamp and a processed_at flag, and implemented a cron job to query for unprocessed updates from our DB and emit them to clients every minute. By making sure to only emit the earliest unprocessed update per payment within each run of the cron job, we could guarantee consistency, i.e. updates are always sent to clients in the order they are created. And by implementing a decorator allowing callback tasks to receive a single update, confirm it hasn't already been processed, perform any client-side logic, and finally set the processed_at flag all within one atomic block (via a row-level lock on the corresponding DB object), we could guarantee each update was processed idempotently, i.e. at most once. Together with running the task every minute, this was enough to guarantee reliability since "exactly-once" semantics are just "at-most-once" semantics with retries!

As an added bonus, building our own task queue natively in Postgres actually ended up unlocking a whole host of unexpected advantages:

  • We could easily identify and filter out duplicate JPM events by enforcing a corresponding unique constraint on the payment update table, turning idempotency from a dynamic code execution condition into a static data condition.
  • We could calculate metrics like average number of attempts per update, median update processing time, and error rates by callback task via a simple SQL query instead of having to parse Celery logs, leading to much better instrumentation and observability.
  • We could dynamically (i.e. in source code) tune retry / backoff schedules in response to unexpected spikes in payment / update volumes.
What We Learned

At its heart, what "reliability" means to organizations of scale is predictable behavior: whether you perform a given action 10 times or 10,000 times, you have the confidence that it will always succeed or fail in ways that you expect. Especially for fintechs like Ramp, so much of the challenge in building reliable systems often boils down to delivering a consistent user experience on top of third-party systems that have some inherent irreducible uncertainty; you can never guarantee the way an external bank, service provider, or framework behaves because ultimately, you only control what you build and own yourself.

For the payments system, the first step in designing for reliability meant writing simpler, cleaner APIs that abstract away this uncertainty from end users. Not all payments can be refunded, but it's much easier to work within the bounds of that constraint if a client can identify that a given payment is ineligible for refund via a standard error response from an API call than if they need to manually inspect a text file or query across 3 DB tables themselves. As we eventually learned, however, sometimes standard APIs aren't enough. At the end of the day, it's always going to be much easier to reason about the behavior of systems you've built yourself than it is to enforce strict guarantees on those built and maintained by third parties; if there's a feature or invariant you absolutely cannot live without, it might make sense to build it in-house.

Some other learnings that might be helpful when thinking about reliability:

  • Prevent things from breaking all at once. Before we moved to API-based initiation, we relied on staggered schedules for payment posting tasks to prevent 2 concurrent SFTP connections. This meant payments could go out as normal for weeks in a row before an unexpected spike in volume caused the processing time for a task to exceed the critical threshold, leading to a collision with the following task and an outage out of the blue. When you rely on heuristics that may unpredictably break down or require constant iteration such as manually tuned time delays, performance degrades in a step function with volume: everything's great, until it's f***ed. But when you rely on absolute system guarantees, performance degrades linearly with volume, and you can identify and alleviate bottlenecks much more proactively.
  • Avoid side effects at all costs. In the legacy payments system, if something went wrong serializing a single transfer into a NACHA entry due to missing or invalid data, the entire batch of payments would fail to be sent out to JPM. This meant errors had very high surface area; tens of thousands of payments could be delayed because of a single data integrity issue. With the new payments system, on the other hand, each failure to post a single transfer is isolated to that transfer; in other words, errors are far more localized, which makes them much easier (and crucially, much less stressful) to manage.
  • Always implement polling, if even just as a backstop. Relying exclusively on webhooks to get updates from third-party APIs means you're at the mercy of that third party functioning correctly. Even though we primarily use webhooks to receive updates from JPM, we also run a periodic back-up polling task; this has allowed us to proactively react to several outages that we might not have otherwise caught without customers reaching out.
Avoid High Interest Rate Tech Debt
What Was Broken

When the Card team built Ramp's first ever payments integration to support statement collections, they had a very specific use case in mind: collecting customer funds into a receivables account held at our partner bank JPM via ACH. As a result, they ended up engineering that assumption into the DNA of the system by building an integration tailored to the sole purpose of initiating ACH transfers into or out of a JPM account:

As discussed earlier, due to the way the initial payments data models had been designed, each successive engineering team at Ramp that wanted to take advantage of this infrastructure for money movement had to effectively build a new payments integration from scratch. Since the path of least resistance was simply to copy as much of the existing logic as possible, this led to an interesting side effect: that initial assumption kept getting replicated and reinforced across the codebase!

The end result was a system that was virtually impossible to extend to any other payment rail or banking provider unless we rebuilt all the payments integrations from the ground up. For our Bill Pay MVP, this setup was actually sufficient; every bill payment we processed would require ACH transfers into and out of a single FBO account we were opening at JPM. But as the company continued to scale, we could anticipate customers requesting the ability to process payments on Ramp via wire or paper check; and the engineering cost that we'd incur in refactoring the legacy system to support these features would scale as a linear function of the total number of products dependent on that infrastructure:

In other words, our old approach to building systems meant that any tech debt we "borrowed" by not immediately designing for future use cases came at a very high interest rate; it would become increasingly more expensive to pay it off as more and more products were built on top. But by centralizing responsibility for building and maintaining payment integrations into a single shared service, we were essentially borrowing "free" money; integrating with a new bank or supporting a new payment rail in the future would always cost a fixed amount of eng resourcing, regardless of the number of products or teams that wanted that functionality.

This is one of the big benefits of building payments as a platform rather than a feature: anytime we want to make a material change to the way we process payments at Ramp, we simply have to implement the logic in one place and the change is instantly available to all products built on top. For instance, by implementing support for international payments for BillPay, we essentially get international reimbursements for free. And if a banking provider ever experiences any service degradation or unprecedented outage (cough cough SVB), we can simply make changes to a single part of our codebase to ensure all products continue functioning as normal.

How We Fixed It

To unlock this kind of leverage, we ensured the exact rails and provider used to process a given payment would always be completely substitutable from the outset. Each integration with one of our provider banks is an API client inheriting from the same abstract base class containing a small set of abstract methods (e.g. post_ach_transfer) that take standardized dataclasses as arguments (e.g. ACHTransferDetails). We also created fields to store the payment provider (e.g. JPM), method (e.g. ACH), and currency within TransferCanonical, and exposed these as parameters to the end clients within the create_payment service. Finally, we built an engine to take a given TransferCanonical and dynamically 1) instantiate the corresponding API client based on its provider, and 2) use the appropriate dataclass and client method based on its payment method. As a result, whether clients want to initiate a pull transaction into a JPM bank account via ACH, a push transaction from an Increase bank account via physical check, or a Euro payout from our Wise wallet via SEPA, they call the same exact service and just swap values for a few basic parameters.

If this seems like a totally obvious design choice, it was anything but. Fully abstracting away the complexity of how all these different payment networks function from end clients was an explicit tradeoff decision to take on a ton of additional technical complexity in order to build a cleaner, more usable API. For example, one of the consequences of our decision to maintain a single interface was that we had to enforce a standardized state machine on all payments initiated via the platform, regardless of the rails or provider they were processed on:

However, the real-world life cycles of ACH transfers and paper checks are starkly different. ACH is a "no news is good news system," which means we never receive a positive confirmation that an ACH transaction succeeded; instead, ACH transfers can be considered as settled only after a designated time period (typically 4 business days) has passed without receiving a return. On the other hand, we do receive a notification from the bank once the recipient cashes a check; but since USPS is not a 100% reliable system, checks may never be successfully delivered and can therefore remain in the intermediate "processing" state indefinitely.

Stated differently, settlement for ACH transfers is time-based, while settlement for check transfers is event-based. The simplest way to model these distinctions (and in fact, the one employed by most widely used treasury APIs) would be to create distinct objects to represent ACH and check transfers, each with its own independent state machine, services, and documentation. Instead, in order to reap the benefits of a single platform-ized API, we had to implement both models of settlement within the same payments engine.

Of course, not all tech debt can be fully anticipated or avoided. Our first implementation of time-based settlement logic for ACH transfers involved setting an expected_settlement_date field on the TransferCanonical once we posted it to its corresponding provider. Then, we could run a daily job to mark any transfers past their settlement date that were still in the PROCESSING state (i.e., had not received a return) as COMPLETED. However, just months into launching the platform, we ran into several limitations:

  • One of our new payment providers offered us same-day ACH functionality, i.e. ACH transfers that settle at end-of-business the same day they are initiated, provided we post them before a given cutoff time.
  • Wire transfers settle 30 minutes after being initiated, while book and RTP transfers settle instantaneously.
  • International payments may settle at a time that corresponds to different days in 2 different time zones.

All the above cases are instances of time-based, rather than event-based, settlement; however, dates were too coarse-grained to capture intra-day or timezone-dependent behavior. As a result, we had to refactor our payments engine to use UTC timestamps instead of dates, which meant a costly DB + source code migration. Fortunately, the cost of this refactor was still magnitudes lower than if we had overfit our initial implementation of time-based settlement to ACH transfers, since we could reuse the general settlement date framework. By solving for the original problem in an extensible way, we ended up paying off minimal added interest on the tech debt we unknowingly incurred.

What We Learned

Trying to avoid tech debt completely is a losing strategy because you can never build for all possible future states, but that doesn't mean you can't control how expensive the debt you borrow ends up being. Part of "knowing your primitives" means knowing when to design abstractions with built-in flexibility to grow with your needs. Ultimately, treating extensibility as a first-class citizen can be as simple as taking the time to examine any assumptions you may have tacitly made when designing a system and stress-testing them under a limited set of reasonable projections. Once again, this exercise is much easier said than done, but some basic heuristics can prove helpful:

  • The number of columns on a data model should map to the number of degrees of freedom in the corresponding abstraction. For example, we explicitly decided to add a currency column to our TransferCanonical table from the outset even though internationalization was nowhere on Ramp's radar at the time, since the amount of any money transfer is meaningless without a corresponding currency. When it came time to support FX transfers a year later, this seemingly inconsequential decision saved us countless hours and pain since we had already been forced to explicitly consider units in every numerical calculation or comparison from day 1.
  • Avoid conditional logic when possible. Writing arbitrarily long case statements (e.g. cancel_payment invokes different actions depending on the method of the payment) creates a lot of surface area for bugs and leads to exponentially growing test suites. Instead, you can just use dictionaries to map enum values to functions and write a single unit test asserting that all enum values have been mapped. It's also important to collapse non-meaningful distinctions. For example, JPM allows clients to specify the external ID of posted transfers, while Increase generates it server-side. Instead of multiplexing this logic, we simply designed the API base client to always accept an internal ID and return an external ID; whether these end up being the same (as for JPM) or different (Increase) doesn't make any practical difference.
  • 1 is too few, but 2 is already too many. There's no secret formula for when to build platforms and when to build features. But at Ramp, anytime a use case popped up two distinct times, a third almost invariably followed. Building a platform with just 2 users isn't that much more expensive than building the same feature twice; on the other hand, the cost of maintaining the same logic in 2 different places is potentially unbounded, depending on how often it needs to be updated.
Platforms All the Way Down

At Ramp, one of our bread-and-butter sayings is "slope over intercept"—instead of focusing on the here-and-now, we try to discern for people and opportunities with the highest growth potential. Platforms are our secret sauce for high-slope engineering because what they provide is acceleration: we trade off some velocity in the short term in order to continually increase the speed at which we launch new products in the future. This tends to be a difficult mentality for fast-growing startups to adopt because it is so counterintuitive; too often, the first-order benefits from the conventionally accepted wisdom of "moving fast and breaking things" motivate young product teams to push off questions of maintainability and reusability to the distant future, because these benefits are so immediately visible. But in the long run, the second order effects of investing upfront time into thinking through the right abstractions become exponentially more valuable:

The only way to consistently incentivize this kind of decision-making is to build a company culture where every single decision, no matter how small, always gets tied back to the core operating principle of long-term over short-term velocity. That could take the shape of measuring performance against broader company-wide outcomes rather than narrow project- or team-specific OKRs, empowering product-facing teams to build cross-org platforms, or taking bets on young engineers with zero fintech experience to own foundational infrastructure. Ultimately, the only way to prepare yourself for the marathon that is building a generational business is by taking the long view in everything you do.

© 2024 Ramp Business Corporation. “Ramp,” "Ramp Financial" and the Ramp logo are trademarks of the company.
The Ramp Visa Commercial Card and the Ramp Visa Corporate Card are issued by Sutton Bank and Celtic Bank (Members FDIC), respectively. Please visit our Terms of Service for more details.