The key to some of Ramp’s early growth was our then opinionated product. The opinions represented both best practices for customers we were selling to, but also tradeoffs that we had made early in the product’s development. For example, all approval policies were conditionally determined solely based on amount. While it was possible to require one group of people for a request above $100 and a different for a request below, it was impossible to alter approvers based on department, accounting field, subsidiary, or any other field. This was a velocity tradeoff in the early days. However, in the fall of 2022, customers onboarding onto Ramp were breaking the assumptions made across the product early on and requesting significantly more configuration.
We solved this problem as most organizations would, by spinning up feature workstreams dedicated to solving specific versions of this problem: approvals, accounting field visibility, submission policies, custom expense notifications, and more. For certain, often larger customers, we went so far as to conditionalize logic based on the customer (or group of customers) in the codebase itself. Pavel Asparohouv (see his article on scaling bill pay) and I were tasked with improving approvals functionality for our nascent procurement product before we quickly realized that we were examining a specific case of an underlying problem. Continued iterative solutions would have led to increasing state complexity, operational difficulty, and features built on incorrect and imprecise abstractions.
In the approvals case, feature requests were diverse: some customers wanted to route approvals based on budget, others on department, and some on HRIS fields. We could have embarked on each build individually, or, we could have solved the underlying problem and built a more powerful generic solution. We decided to slow down, in order to go fast.
The approvals case was not unique. We asked others about instances where the business logic customization Ramp offered was limited and causing pain. From this, we accumulated a lengthy document of related instances.
The process of building a generic workflows platform resulted in code now used across the product enabling everything from low level transactional asynchronous task management up to complex workflow builder UI. Most importantly, though, it solidified Ramp’s transition from a collection of excellent point-solutions to a unified platform where a customer codifies their expense policy, and Ramp does the rest.
Unlike most features, where most of the underlying implementation decisions are clear (e.g. a new set of CRUD endpoints), we did not know where to begin. We worked to understand versions of the same problem, partial solutions present in the codebase already, and define requirements.
We viewed this as an orchestration problem: snippets of code (what we later termed “actions”) were dependent on boolean expressions (“conditions”). Sets of these dependencies could be assembled to create chunks of logic (“workflows”). We also knew that what we were developing would be a platform for significant future development. Thus, ease of development on the platform was of significant importance.
When validating the above architecture, we discovered a few vectors of complexity: actions could be synchronous or asynchronous (and our existing task framework, celery, was unreliable about at a rate of about 7 tasks per 10,000), actions could be in parallel or in a series, and the return value from actions needed to be available for use in subsequent actions.
Our first attempt at system design was to have actions (with a synchronous flag), conditions (that persisted a recursive boolean expression), and dependencies (requiring a condition to be true for the action to be run). The ‘state’ of a workflow was a dictionary of key value pairs and could be updated while running a workflow. A return value could then be entered into state and updated when the specific function is run.
This would generate
C as conditions,
D as actions, and establish the following dependencies (
We then took this proposed solution and validated it with all known applications of the platform and sketched outlines of code. We discarded this approach for two reasons. First, we would have had to write translation code to go from a linear representation of logic (how the examples are rendered) to a rule representation of logic; a theoretically possible, but unnecessary process. Second, action-action dependencies were exceedingly difficult to represent; they would have either required new edges (making the translation step even more complex), or adding “action-has-run” conditions to the dependency tree. This solution was no longer elegant.
We returned to the drawing board. The abstractions we had chosen were too specific. We re-envisioned the project as a graph. The building blocks then became vertices and edges, a vertex being either an action or a condition. “Executing” the workflow became traversing the graph, performing (or enqueuing) any action found, and waiting conditions to be true. A workflow only needed to be “run” on initiation and whenever state was updated.
This solution was significantly simpler than the first, but more difficult to envision because it is significantly more abstract. Persisting a workflow as a tree in this way can be viewed as persisting a (simplified) abstract syntax tree into Postgres.
This solution worked on all use cases we had examined, so we began to build. Building before validating all solutions would have been a mistake, and led to a number of smaller tweaks to the design. The engine’s logic was trivially easy (load the graph, topologically sort it, execute each node on the frontier, if a node is an action, mark it as visited, and if a condition, mark as visited only if the condition is true). This logical simplicity gave us faith that our selected abstractions were proper.
As mentioned prior, our asynchronous task framework was brittle and failed at a small but significant rate (7 tasks / 10,000). For workflows, we did not want consumers of the platform to be concerned with retries or dropped tasks. To solve this, we built a Postgres backed queue (to replace Redis or RabbitMQ) compatible with our celery deployment that would retry tasks appropriately until success was reported. Because workflows operated at such a high layer of abstraction (an “action” was a Python function), we were able to build a solution that worked for workflows, but also solved the problem generally for Ramp. In fact, many entirely internal use-cases are now using this workflows infrastructure to link Postgres transactions to asynchronous task management. Sidequests like these should be encouraged and pursued: they are difficult to predict but their upside is immense. Often, correct abstractions solve seemingly unrelated problems: previously difficult tasks become special cases of a more general solution.
After writing the core engine, we attempted to implement features utilizing the platform. We ran into a number of problems.
Logic customers might define would not rely solely on
if statements, but may include
else statements. This posed an interesting challenge: the logic intrinsic to these operations was certainly supported by the core engine, but the ergonomics were poor. It is in theory possible to convert
if statements, but doing so would be a ridiculous burden to place on consumers.
To solve this and similar problems, we wrote an SDK. The SDK served as a set of transformations from easier to use dataclasses into core workflows.
if !a && b
if !(a || b) && c
if !(a || b || c)
If we could determine the general case for a transformation, the SDK could perform the substitution. This allowed us to expose functions,
elifs, and more complex control structures to end consumers.
Writing the SDK brought clarity to the project. A workflow as it exists in Postgres is akin to assembly: code that is easy to execute but difficult to create. A “core” workflow is harder to execute but easier to create; an SDK workflow is even harder to execute but easier to create, and so on. The process of taking a higher order workflow and converting it to Postgres we described as “compilation”, equating the process to that of code itself.
The first workflow was ridiculously slow, writing it to the database took over twenty seconds. Writing a workflow involved issuing dozens of queries to about ten database tables and incurred multiple N+1s. To solve this, Pavel rewrote the logic to write a workflow from dozens of queries to a single massive query (complete with
INSERT CTEs). Such an optimization would have been premature before testing it, and was prudent once the harm was measurable and optimize-able. Writing workflows now takes under 200ms.
The trade of write complexity for execution complexity meant that the most important logic was simple and performant. Regardless of whether a workflow run is persisted or not, executing a workflow takes on average around 100ms.
Once we understood that what the SDK exposed was fully functional, we took to sketching workflows as a limited Python lexicon: supporting specific variable types and limited control structures — but equivalently powerful.
This is a visualization of executing a workflow; red blocks are conditions and orange blocks are actions. In this example, block 3 does not become true until the penultimate workflow run.
At this phase, we had built a workflows engine capable of persisting complex logical constructions into Postgres and executing them performantly. We had not built an interface that allowed customers to create these structures. To do so, we partnered with Jared Wasserman (engineer) and Andy Lucas (designer).
Much of the design thinking that underpins the product deserving of immense credit but out of scope for the article. The design principles most relevant were a prioritization on legibility and constrained optionality. Many early ideas were discarded when applied against this framework.
Building the UI and its accompanying API was a new challenge: how do we take a workflows compilation and execution engine, inject relevant business context, and expose an easy-to-use component for engineers across the company?
Injecting business context into a workflow is non-trivial. Consider the condition
entity is Rodda’s taco joint. Such a condition is common: objects in Ramp’s database are almost always more useful for customers than literals. Some literals (e.g. amount) are useful, but trivial to implement. Conditions that contain “objects” can easily be supported by the workflows engine through converting it to
entity_uuid HAS_INTERSECTION [“<some_uuid>”].
<some_uuid> is selected by the user in the UI, and the
entity_uuid variable refers to a value tracked in state. To support this generically, we built a workflows “configuration” that allows engineers to define objects tracked in the UI, how they are converted to workflows values, and what operators can be used on them.
Some changes we made while iterating on the UI with customers required altering core workflows constructions. Up to this point, all vertices were constructed in a linear form, with vertices at the same level considered in parallel. This is not how users view workflows.
In this instance,
if A and
C are not in parallel. In fact, to convert this to a graph, there is an implicit
not A condition, resulting in:
These two if statements are now in parallel, and can be translated into the graph form. In order to handle this case, we needed to change workflows core to allow for multiple edges to be constructed to a single vertex. This represented a (predicted but necessary) change to the internal API from a simpler linear input structure to a more fully featured vertex-edge structure.
All of the constrained UI as it exists today is configured generically in workflows. This contains much of the general case on how to convert between UI and customer concepts to workflows concepts. A central configuration allows almost no changes to the code itself to support new use cases, only changes to the config are required.
We launched workflows to a small number of customers initially and watched them use the product intently. This process was crucial to arriving at a polished product. All product before a customer uses it is a guess, only once a feature is released can you iterate. Optimizing for speed of iteration, and not quality of guessing, is a crucial tradeoff.
Today, almost every surface of the product is run by workflows. Workflows powers who is required to approve card requests, reimbursements, bills, transactions; whether receipts, memos, accounting fields are required on transactions or reimbursements; whether given accounting options are visible to users or whether a transaction is flagged as being out of policy. Some of our most complex newer functionality like a conditional form builder are powered under the hood by the workflows engine. Even more is run internally on the workflows celery infrastructure that is opaque externally.
We have yet to discover a use case that is not handled as originally envisioned when we built the platform. Not only have the abstractions scaled to support use cases we had not predicted, the code has as well. Because of our reliance on the lowest level possible (Postgres) for much of our complexity, and the simplicity of the engine itself, over 45 million workflows have been run with over a million a day, and the core engine has not been changed since launch. Dozens of workflows are run on transaction swipe and reimbursement or bill submission. Every action or flow on the product runs a workflow, and we’re just getting started.
To illustrate the diversity of usecases supported, below are four examples of where this platform is leveraged in the product today.
This workflow governs when receipts, memos, and accounting fields are required for transactions and reimbursements.
This example governs which users (or user groups) must approve a bill before it is paid.
This is a constrained usage of workflows to allow for the configuration of advanced alerts and flags on transactions.
This workflow controls which accounting fields are visible based on user and other accounting attributes.
The workflows project represents the most extreme application of the philosophy Shreyan describes yet applied at Ramp. The targeted abstraction is broad: any customer-defined logic intended be handled by Ramp.
We got a lot wrong: we should have built using vertex-edge from the beginning and we could have predicted some issues that slowed us down while building the API earlier.
The up front investment was weeks of effort. Any individual application of workflows could have been built in some fraction of that time. However, when measured in the medium term, workflows has saved the engineering organization collectively months of effort. Incredibly complex and requested features are now able to ship significantly faster by leveraging workflows.
In building workflows, we measured twice and cut once. Our first attempt at abstractions was wrong. We discovered this before writing a line of code, simply by imagining what an implementation would resemble. Once the abstractions were written, nothing has proved them incomplete. The most consequential changes were to internal API interfaces (from a linear to a vertex-edge data structure), and were envisioned before code was written. This was accomplished through constant re-evaluation of assumptions, and validation of abstractions against real-world examples.
Some of workflows remains to be built. The original specification Pavel and I wrote included concepts that support looping and generic triggers (a pub/sub model across the entire application). Deprioritized due to lack of demand, the platform will easily support such functionality when its needed. See the end vision and work towards it, even if it will not be completed immediately.
At Ramp, we view engineering not as the writing of code to accomplish some task, though that is one tool we leverage. We view engineering as the creation and implementation of technical abstractions intended to solve customer pain. These abstractions are always lossy: the real world has almost infinite complexity. And like philosophy, the creation of such abstractions is as imprecise as art, reliant on our ability to ingest as much context as possible and make asymmetric trade offs where necessary. When executed well, abstractions are a velocity accelerator and allow a complex product to be simple for users.
If any of this interests you, we’d love to have you join us.