How To Build Agents Users Can Trust

We recently prototyped, built, and shipped a suite of LLM-backed agents to further automate expense management on Ramp. While LLMs have improved significantly since ChatGPT first came out, it still takes some finesse to tame them. This is especially true for a finance product where it’s very easy to lose trust through low-quality or unexpected outputs.

In this article, we discuss how we optimized our suite of agents to deliver real business value, clearly explain their reasoning, recognize their own limitations, defer to human judgment, and ultimately build trust.

Choose problems where LLMs can shine

LLMs are best applied when your problem fits the following:

  1. Ambiguous - Simple heuristics can’t be applied
  2. High volume - Asking humans to do it instead would take an inordinate amount of time
  3. Asymmetric upside - The value of automation far exceeds the cost of occasional errors

Fortunately for us, finance is full of problems like that:

These all manifest as tedious time sinks for users with low chance of catastrophic failure, making great targets for LLMs with the proper guardrails.

To illustrate our approach, we'll focus on approving expenses. This has traditionally been a manager's responsibility but we’re now applying a “policy agent” that finance teams can trust to match or exceed human judgment. Since enabling the policy agent at Ramp, we’ve seen more than 65% of approvals be fully handled by the agent.

Show your work

How you communicate an LLM’s reasoning can matter even more than the accuracy of the final decision. For example, we've found that deferring to users when lacking certainty builds both user trust and system reliability.

Explain outcomes with reasoning

Every decision needs a "why." It’s not enough to tell users "You should approve this expense.” This helps users verify the outcome, spot errors quickly, and understand how your LLM thinks.

This last point is also important to both developers and users:

  • Developers can use this thinking as a form of model observability to inform prompt and context improvements over time.
  • Users can reuse this reasoning as direction for what needs additional attention.

Cite your sources

LLM reasoning on its own can be flawed or filled with hallucinations. All facts and figures should be grounded in easily verifiable context coming from your product or the user.

For example, the policy agent links directly to sections of the user’s expense policy that its reasoning references.

Putting this together, here’s an example of how we present the reasoning and citations of the policy agent to users:

This first bullet point is output from an LLM explaining why the expense was approved. The info icon links directly to the section in the user’s expense policy describing the requirements.

Build escape hatches

Not every question has an answer. Forcing your LLM to come up with one and act on it is a recipe for hallucinations.

Early on, we gave our LLMs the ability to say “I’m not sure” and explain why. When the agent isn’t sure, we fall back to the pre-agent escalation process users are accustomed to. We can then use that explanation to:

  1. Tell the user what made the AI unsure, to help them focus on what may need additional review
  2. Track how the unsure reasons changes over time as you modify the system

It’s important that unsure doesn’t look like an error state in your product. Not only is it a valid outcome, sometimes it’s the ideal result given the information at hand. We’ve seen LLMs practicing humility be a very strong force for building user trust.

This receipt for golfing was accidentally submitted under the wrong transaction, and the agent was able to explain that because of the built in escape hatch.

Confidence scores are hallucinations

Asking LLMs to output confidence scores is like asking a human to give a number off the top of their head. It’s usually reasonable, but not numerically relevant and certainly not reproducible. This can be especially misleading because:

  1. Confidence scores are a strength of traditional statistical models.
  2. LLMs will eagerly give you a confidence of 70-80% with no indication that they will almost always give that same score.

Instead, we use predefined categories which are not only easier to explain but also more accurate:

  • Approve - Clear match between expense and policy
  • Reject - Clear conflict between expense and policy
  • Needs review - Edge case when the model is unsure

This forces the model to bucket uncertainty into actionable states: users don't need a confidence score, they need to know what action to take.

Context should be collaborative

Most consumer LLM systems treat context as a fixed input that’s defined once and rarely revisited. However, we've found that the most effective approach is to make context collaborative, allowing users to actively shape and refine it alongside the AI.

After many iterations, we found three factors to create context improvements over time:

  1. Bring the surfaces where users define their context into your platform
  2. Use that context for your decisions
  3. Allow users to modify that context in the platform when they disagree with the LLM output or reasoning

Applying this to our product, we:

  1. Brought the user’s expense policy (typically a PDF) onto Ramp so it’s always up to date
  2. Use that context to drive the agent’s decisioning
  3. Built a full editor for users to update their policy on Ramp when they disagree with the LLM output

This feedback loop not only reduces the amount of work humans need to do over time, but also improves policy accuracy. After all, if an LLM is getting tripped up on an ambiguous part of the policy, it’s likely a human would be just as confused.

Our in-house expense policy editor. Each section is used directly for policy decision by the LLM depending on how relevant it is to the expense on hand.

Give users an autonomy slider

Every customer will have a different level of comfort letting AI act on their behalf. When giving your systems agency it’s important you’re not removing agency from your users.

Our solution was to reuse the workflow builder that’s already used across the product to customize and define processes. That same workflow builder defines exactly where and when agents can act.

The autonomy slider works both ways. Users can greenlight agents AND set hard stops. Not everything needs to be an LLM decision. We layer deterministic rules on top: dollar limits, vendor blocklists, category restrictions. These guardrails aren't just safety nets, they're how users tell the agent "I'll never be comfortable with you touching this.”

The more conservative end of the autonomy slider, requiring human review on every expense above $50.

The more trusting end of the autonomy slider, only requiring human review when the agent thinks the expense is suspect. This is the policy we often reach for at Ramp.

Start with suggestions, graduate to actions

The earliest AI IDEs started with inline edit suggestions before moving to full agentic behavior. Similarly, we started with suggesting actions to humans before moving to taking those actions autonomously.

This builds confidence. Users learned the agent's patterns, caught its mistakes, and most importantly saw it getting things right. Only after proving itself as a copilot did we let customers promote the agent to take real action.

The progression is deliberate: suggestions → acting on subsets → full autonomy. Each step validates the previous one, creating a natural trust curve that matches each customer's level of comfort.

Evals are the new unit tests

Just as unit tests are needed to responsibly evolve a traditional application over time, evals are needed to responsibly evolve your LLM systems over time. We’re continuously improving our process for evals, but some of the lessons we’ve learned so far are:

  • Crawl, walk, run - Scale your evals as your product matures. Start with something quick and easy and expand to provide deeper coverage and more precise insights.
  • Prioritize edge cases - Focus on ambiguous scenarios where the LLM is prone to errors or inconsistencies.
  • Turn failures into test cases - Every user-flagged error should be a candidate for your evals. This is especially useful for building your datasets early on.
  • Trust but verify - Users can be wrong, lazy, or both. Consider if an extra review and labeling step is needed for your feedback loop.

Expanding on that last point, we’ve found that finance teams are nicer than you might expect: they’re frequently lenient and will approve reasonable but not totally in-policy expenses. If we always took the user’s action as the ground truth , the LLM could have been too lenient.

To address this, we created several golden datasets, each carefully reviewed by our team to define the correct decision based solely on information available within our system. This enables us to make more objective judgments, free from the affinity bias that can influence finance teams' decisions.

Building agents

We’re committed to building LLM-powered agents that create value while staying trustworthy. The principles that we follow are anchored in:

  • Transparency through reasoning - Clearly communicate the rationale behind each LLM-generated decision, enabling users to understand and validate outcomes.
  • User-driven control - Give users tools to define what actions agents can or cannot autonomously handle.
  • Collaborative feedback loops - Agents should guide users towards better context and decisioning over time.
  • Continuous evaluation and improvement - Use evals to identify and address shortcomings while preventing regressions.

We're incredibly excited about the future of agents at Ramp. With the right context and collaboration from our users, we believe these agents can deliver a huge amount of economically valuable work and transform how finance teams operate.

© 2025 Ramp Business Corporation. “Ramp,” "Ramp Financial" and the Ramp logo are trademarks of the company.
The Ramp Visa Commercial Card and the Ramp Visa Corporate Card are issued by Sutton Bank and Celtic Bank (Members FDIC), respectively. Please visit our Terms of Service for more details.