Millions of transactions take place every month on Ramp. Our mission is to automate finance tasks for our customers. One example of this is accounting coding.
Accounting coding is the process of categorizing transactions based on each business' accounting system, also known as Enterprise Resource Planning (ERP). With Ramp's accounting integrations, we are able to sync and understand each business' general ledger (GL) categories, however employees are still responsible for coding their own transactions and expenses. Given that the majority of employees are not finance and accounting experts, accounting coding can be confusing and error prone for them. Employees have to search through hundreds of different accounting codes, with little clarity on what they exactly represent. As a result, any human error during this process results in additional work for the finance teams when closing the books.
Each stakeholder’s context and knowledge in the accounting coding process.
Developing a relational understanding between similar transactions has big implications for improving the Ramp experience, for both employees and their finance teams. Some of those implications are:
Therefore, we started to look for ways to represent transactional data so that we could group similar transactions together and semantically search across those transactions. The goal was to generate transaction embeddings that could cluster similar transactions by their respective GL category. This way, we could predict a GL category for a new transaction by matching accounting codes attached to similar transactions.
We decided to train our model using a triplet loss function, which we'll get to in a bit. We represented transactions as documents with attached labels (chart of account codes). The dataset was then filled with triplet samples, each made up of 3 documents with their associated GL categories. Using triplet loss, the model can learn differences between similar and dissimilar documents, and can cluster transactions by their respective GL categories.
Though there are many off-the-shelf embedding models and various approaches to training custom embedding models, we chose an architecture that could easily scale with the large variety of transactions on Ramp. We also needed searches across these transactions to be fast to serve downstream applications, so we chose a smaller embedding size for quick retrieval.
For a given transaction (e.g. a new laptop), we want to find merchants selling electronics among neighbors in an embedding space.
To train our model we used sentence_transformers
, a popular Python library to train embedding models and compute dense vector text representations. Starting with a pre-trained encoder as a base model, we fine-tuned it on Ramp’s transactional data using triplet loss. We then use cosine similarity as a distance metric in the learnt latent space to quantify the similarity between two transactions.
Triplet loss is a popular loss function for supervised similarity and metric learning. It evaluates samples in triplets, consisting of an anchor, a positive sample, and a negative sample. The anchor represents a document with a label, the positive sample represents a document that has the same label as the anchor, and the negative sample represents a document that has a different label from the anchor. For instance, an anchor document and positive document can be labeled with GL account “7000 Personal Expenses”, while the negative document could be labeled with “6300 Utility Expenses”. Triplet loss teaches an embedding model to recognize the similarity or difference between the documents. In other words, it learns to “push” negative samples away from the anchor, and pull positive samples towards the anchor at the same time. The distance between the anchor and the positive/negative samples is driven by a hyperparameter denominated margin
.
Visualization of anchor, positive, and negative samples in triplet loss.
As a comparison, contrastive loss is another popular loss function that can model similarities of inputs. Contrastive loss samples data in pairs, each associated with a positive or negative label, and aims to either place both samples either close together or distanced by a given margin. The main difference between triplet loss and contrastive loss is the use of the margin
hyperparameter. Since contrastive loss does not consider marginal distance when evaluating positive samples, it does not incur any changes when evaluating a positive cluster of samples.
As illustrated in the figure below, triplet loss creates more opportunity for the vector space to cluster into different groups, whereas contrastive loss mainly succeeds at separating positive and negative samples from each other. Triplet loss maintains more tolerance towards intra-class variance by ensuring margins between both positive and negative samples, while contrastive loss does not. This can result in contrastive loss hitting a local minima faster, while triplet loss can still work on reorganizing space.
Embeddings from contrastive loss (left) vs. triplet loss (right). Image credits.
A difficult aspect of implementing triplet loss is mining good triplet samples. Specifically, if we mine triplets at random, the model quickly learns the difference between the “Travel” and “Repairs & Maintenance” GL categories, and the loss function reaches a plateau. The nuance between GL categories like “Travel: Sales” and “Travel: Engineering” is important to accountants, so we need to be able to tell them apart, and this requires highly-informed sampling strategies.
Triplet mining is a hot topic with many methods worth considering. In our case and for the sake of code simplicity, we experimented with modifications to the loss function readily available in sentence-transformers
. We found the BatchSemiHardTripletLoss
function to produce best results, while sampling tricky triplets from a corpus of contextual transactions on the fly. Using large batch sizes and the loss function allowed us to keep our training data relevant, making sure we were consistently feeding in informative triplet samples during training.
Examples of tricky cases in triplet loss training, highlighting hard positives on the left (similar items that appear to be different) and hard negatives on the right (dissimilar items that appear deceptively similar). These examples challenge the model’s ability to learn effective embeddings.
Before training, we had to understand how we could represent transactions as strings, to then convert into embeddings. We enriched transactions into contextual transactions, which combined all of the features of transactions that were most relevant to its accounting categorization. These features included (but were not limited to):
To keep these embeddings relevant for external transactions and external use cases (e.g. coding external, non-Ramp transactions synced from an ERP), we had to ensure the embeddings were generalizable. This meant enriching transactions with just the above information, keeping them relevant and useful for Ramp businesses of any size or in any industry.
Ultimately, the labels for these embeddings were the corresponding GL categories, which together represented the respective accounting codes that were assigned to each transaction. Using these features and labels, we were able to represent transactions as stringified prompts to input to a custom embeddings model and match querying transactions to similar, already-coded transactions to reference.
We surface the predictions using the trained embeddings across multiple verticals and platforms, and use them both directly and as an input for downstream tasks. Many of these downstream tasks include match-based predictions or suggestion-based predictions to help streamline user interactions.
For example, whenever a user on Ramp needs to provide a code for a transaction, we now surface our GL coding suggestions in a dropdown.
If the prediction confidence for a GL category is high, we can make a more overt recommendation to the user, like using that as the default value for a given transaction.
Outside of accounting, we also use transaction embeddings for analytics, giving our Growth and Data analysts the ability to quantify similarities among businesses, based on their spending and coding patterns. We’re also able to use similarities among transactions to synthesize relevant transactions as context for other LLM-enabled features, such as our recently-released suggested memos feature.
Ramp is in the business of saving customers both time and money. One of the ways to achieve that is through leveraging the right tools with the right data, at the right time. Our continued investment in engineering efforts to develop our own ML infrastructure and purpose-fit models allows us to improve automation at scale for our customers, while keeping their data private and the predictions — personalized.
It’s day 1975 at Ramp and we still have a long road ahead of us; if you're interested in the kind of opportunities and challenges we discussed in this article, We’d love to hear from you and maybe have you join us on this journey.