For data professionals and decision-makers alike, classifying customers is important and challenging. At Ramp, industry classification used to rely on homegrown taxonomies patched together with translation layers, resulting in multiple sources of truth that were not auditable. Below, we show how migrating to a standardized system, powered by an in-house Retrieval-Augmented Generation (RAG) model, simplified workflows, improved data quality, and unlocked performance gains.
Ramp's mission is to save our customers time and money. A precise understanding of a customer's industry is vital to serving them well through many cross-cutting initiatives: from compliance, to portfolio monitoring, to sales targeting and product analytics. Industry classification, however, is challenging for a number of reasons. Industry boundaries are fuzzy, and the lack of a ground truth makes it hard to evaluate predictions. What's more, the data used to generate predictions can be sparse and have a non-uniform distribution. In this article, we'll discuss how we built an in-house industry classification model using Retrieval-Augmented Generation (RAG) and improved our understanding of our customers.
Having a consistent and accurate industry classification system is crucial. For example, if the Risk team and Sales team use separate taxonomies, they cannot have quick feedback loops on targeting and segmentation. Likewise, any communication with an external partner will require a translation to their preferred industry mappings.
There are two standard taxonomies for industry classification in the US:
Four-digit SIC codes were developed by the US in 1937, later replaced by six-digit NAICS codes in the 1990s. These codes are hierarchical — you can look at subsets of leading digits to get a broader classification. The systems attempt to group industries by similar production processes. While a business can have more than one applicable code, the line of business that generates the most income is generally chosen as the primary code.
In the past, Ramp mainly used a third, homegrown, non-standard industry classification system. Businesses were classified using a stitched together web of third-party data, Sales-entered data, and customer self-reporting. The Homegrown system had four common issues:
Consider an actual Ramp customer: WizeHire, a hiring platform that helps small businesses grow. In the Homegrown system, WizeHire was classified as "Professional Services". This category is overly broad and can capture a wide spectrum of businesses like law firms, dating apps, and consulting firms. For Ramp's Sales and Marketing teams, this made it hard to understand what businesses like WizeHire need and how best to serve them. For Ramp's Risk team, this made it difficult to profile credit risk and satisfy compliance requirements in this segment.
Additionally, some teams would convert these Homegrown industry labels to SIC codes, while others would directly use NAICS codes. In order to go from the Homegrown System to NAICS or SIC codes, we would need to apply many-to-many mappings from the 100+ internal levels to thousands of codes. It was not out of the ordinary for one internal industry level to map to 50 potential NAICS codes.
We tamed this complexity by migrating all Ramp industry classification to NAICS codes. This allows internal teams to have a consistent, expressive taxonomy while also enabling easier communication with external partners who were already using NAICS codes.
Revisiting the previous example, within NAICS, WizeHire is classified as "561311 - Employment Placement Agencies". Furthermore, because NAICS codes are hierarchical, we can extract more general categories from this code. The full hierarchy is displayed below:
The combination of precise and pertinent labels with the ability to roll up into more general categories gives teams at Ramp the flexibility to decide which level of granularity is best for each use case. To enable the migration to NAICS, we needed a classification model that could predict six-digit NAICS codes for all Ramp businesses. Third-party solutions are a quick way to get good, general performance; however, Ramp has unique needs with complex data, and we decided to build an in-house model.
We chose a RAG system as our in-house industry classification model. RAG systems have three main stages:
One of the main benefits of using a RAG system is the ability to constrain the output of an LLM to the domain of a knowledge base (in our case, valid NAICS codes). Instead of an open-ended free response, we are giving the LLM a multiple choice question.
When developing any machine learning model, it is important to identify a set of relevant metrics to evaluate performance. Because we are working with a multi-stage system, we chose to break the problem down into two components and identified metrics for each stage. We took care to ensure that the metrics for each stage didn't interfere with each other and were aligned with the overall goal of the system.
The first stage was to generate curated recommendations from the knowledge base.
We chose accuracy at k (acc@k
) as the primary metric for this stage: how often is the correct NAICS code in
the top k
recommendations? This is a sensible metric because it represents a ceiling on the performance of the
full system. If the correct code is not in the top k
recommendations, the LLM will not be able to select it.
The second stage was to select a final prediction from the recommendations. We chose to define a custom fuzzy-accuracy
metric. Because NAICS codes are hierarchical, we want to make sure that predictions that are correct for part of the hierarchy
are scored better than predictions that are completely wrong. For example, if the correct code is 123456
,
a prediction of 123499
should be scored better than 999999
because the first four digits are correct.
Generating recommendations involves identifying the most relevant items from the knowledge base (NAICS codes) given a query (business). This part of the system has a variety of hyperparameters to choose:
For each parameter there are tradeoffs to consider. For example, certain business attributes may be more informative than others but may have higher missing rates. Additionally, different embedding models have different resource requirements that don't necessarily correlate with performance on the specific data we have.
In the end, we profiled the performance of different configurations and created acc@k
curves. Note that we can't
determine the optimal number of recommendations to generate without considering the downstream LLM performance. If we
naively optimize for acc@k
we would end up with a system that just recommends the whole knowledge base (guaranteed that
the correct label is present if we recommend all possible labels).
We found that optimizations in this stage lead to significant performance boosts of up to 60% in acc@k
. We also
identified economical embedding models that could be used in production without sacrificing performance compared to
the largest models.
We profiled performance with acc@k
curves. We looked for groupings with the best performance (purple and pink curves)
and selected those with the least resource requirements and best data coverage.
The second stage of the RAG system involves selecting a final prediction from the recommendations using an LLM. This part also has a variety of hyperparameters to choose:
Just like the first stage, there are a number of tradeoffs to consider. For example, including more recommendations in the prompt gives the LLM a better chance at finding the correct code, but it also increases the context size and can lead to degraded performance if the LLM is unable to focus on the most relevant recommendations. Likewise, longer or more descriptive information can help the LLM better understand a business or a NAICS code, but will also greatly increase the context size.
In the end we chose a two-prompt system to get the best of both worlds. In the first prompt we include many recommendations
but don't include the most specific descriptions, asking the LLM to return a small list of the most relevant codes. In the
second prompt, we then ask the LLM to choose the best one and provide more context for each code. For each parameter
we searched, we found a 5%-15%
improvement in fuzzy accuracy after optimization.
Piecing our findings together, we designed an online RAG system as shown in the diagram below. We have internal services that handle embeddings for new businesses and LLM prompt evaluations. Knowledge base embeddings are pre-computed and stored in Clickhouse for fast retrieval of recommendations using similarity scores. We log intermediate results using Kafka so that we can diagnose pathological cases and iterate on prompts.
Although RAG helps constrain LLM outputs, we also have added guardrails. While hallucinations are generally negative, we've also found cases where the LLM predicts the correct code despite it not being present in the recommendations. To filter out just "bad" hallucinations, we validate that the output NAICS codes from each LLM prompt are valid.
Our RAG system design (dashed black box). Embeddings and LLM prompts are handled by internal services (green). Similarity scores are calculated using Clickhouse (orange). Intermediate results are logged using Kafka (orange).
Since deploying the RAG system, we've already realized a number of benefits.
Besides increased accuracy, we have control over the algorithm. We can (and do) make tweaks to any of the dozens of hyperparameters we searched to address concerns as they come up. Because we log all intermediate steps, we can pinpoint where issues are cropping up (retrieval vs re-ranking). From performance degradation to latency requirements to cost sensitivity, we can adjust the model on the fly. In contrast, with a third-party solution, we would be stuck with their roadmap, pricing, and iteration speed. Furthermore, we can audit and interpret the model's decisions. We ask the LLM for justifications to clarify the reasoning behind each prediction.
To demonstrate the impact of the new model, we've included examples below of how a business was classified in the Homegrown system compared to how they are classified in the NAICS-based RAG system. In the first table, we see three cases where the businesses were all very similar but were classified into different categories in the Homegrown system. Using our RAG model, however, these businesses are all categorized under the same NAICS code. In the second table, we see three instances where the Homegrown system categorized all businesses in the same, overly broad category. In contrast, the RAG model was able to correctly assign these businesses to more descriptive NAICS codes.
Examples of how businesses were categorized in the old, Homegrown system compared to the NAICS-based RAG system. In the first table we see cases where the Homegrown system classified similar businesses into separate categories, while the NAICS system correctly classifies them together. In the second table, we see cases where an overly broad category in the Homegrown system is split into more apt and descriptive categories with NAICS.
This model has greatly improved our data quality and solved many pain-points across Ramp. Our teams are thrilled to start using it as evidenced by comments we've gotten from affected stakeholders:
"This is a big deal — it will significantly upgrade our data quality and understanding of our customers."
"I've waited years for this."
"The existing classification wasn't nuanced enough to satisfy industry exclusion requirements. This is perfect."
"As we diversify our customer base, this will be an incredible driver of our business success."
To migrate from a tangle of taxonomies to a standardized industry classification system, we built an in-house RAG model. Ultimately, our model lead to increased accuracy in industry classification with full control over updates, tuning, and costs. This model now helps Ramp’s internal teams work more cohesively, and enables more precise communication with external partners. Our teams are excited about how this model has brought an increase in clarity and understanding of our customers, and how it's helping us better serve them.