What It Takes to Build and Ship Enterprise-Grade AI Agents
A conversation with Decagon’s Director of Product, Bihan Jiang
Note: Decagon is a horizontal AI company but the takeaways from this conversation are broadly applicable, regardless of if you’re a vertical or horizontal AI company.
Enterprise adoption of AI is accelerating in a way we’ve never seen before – faster than cloud, mobile, or even SaaS. But that speed brings a new set of requirements: guardrails, evals, safety checks, latency constraints, and increased stakeholder alignment. It also requires new levels of efficient customer empowerment that goes beyond the FDE model pioneered by Palantir.
This week, I sat down with Bihan Jiang, the Director of Product at Decagon, which has quickly become the go-to business platform that many of the world’s most respected enterprise companies rely on to easily build, manage, and scale AI agents to handle millions of customer interactions daily. In less than two years, Decagon is already serving a broad range of consumer industries at global scale, from banking and travel with customers like Chime and Hertz to applications and hardware with customers Duolingo and Oura Ring. Decagon customers report increased CSAT, retention and revenue thanks to the power of AI to create completely personalized concierge customer experiences, 24/7.
With early success comes the requisite challenges and opportunities that fast growth and high demand bring, so we sat down to talk about what “enterprise-grade” actually means, how Bihan and the team think about guardrails and evals, how they empower customers to quickly deploy and measure AI return on investment, and why she’s been thinking a lot about Black Swan events.
As Bihan tells it, “Ensuring AI agents perform consistently, safely, and intelligently in the real world is one of the most exciting problems to work on.”
This post distills key parts of our robust conversation that apply to any founder building with AI, no matter the vertical or function.
What Enterprise-Grade Actually Means
Decagon defines enterprise grade as the ability to build, manage, and deploy exceptional AI agents at global scale while maintaining world-class security and trust.
That can mean everything from just-in-time API tokens to zero tolerance for hallucinations. “Even a 1% hallucination rate is unacceptable,” Bihan noted. “If you have millions of conversations a day, that’s a huge chunk of your user base affected.”
So to ensure accuracy, Decagon built a layered guardrail stack. At the base are models that are fine-tuned specifically against prompt injection, hallucination and factual errors. Supervisor models sit in the middle. These models review outputs from other models along the generation pipeline, acting like a quality-control layer before anything reaches the user. A dedicated action-hallucination model is at the top. It “triple checks” the final response for claims about actions: refunds, replacements, escalations, etc. Bihan says the goal is to prevent the worst-case scenario, “it can really hurt brand perception if the agent claims it shipped you a product and no product ever comes.”
On top of all that, they have a product called Watchtower that reviews conversations after the fact and surfaces issues and improvement areas. This final layer serves as post-hoc QA plus analytics, constantly combing through transcripts for weak spots. As Bihan put it, “All of these layers exist so we catch problems before customers ever feel them.”
Evals and Simulation as a First-Class Product
Decagon has built a full evaluation system that is far more robust than most standard benchmarks available today:
Simulation-based tests and unit tests: Customers can build test suites that encode the behaviors they never want to see, and the behaviors they must see.
Customer-defined test cases:
“Never say we’ve issued a refund unless XYZ is true.”
“Always escalate if ABC appears in the message.”
Continuous monitoring: These tests run continuously as new workflows are added, new policies are configured, and new knowledge is ingested.
Bihan tells me that “Customers can run these tests, see a pass rate, and keep running them over time to make sure the agent’s performance isn’t degrading, especially as they add new guidelines, workflows, or knowledge.”
They also do synthetic sandboxing whereby customers provide real conversation transcripts that Decagon then turns into test personas that behave like actual customers. They run large-scale simulations with different intentions to see how the agent behaves before anything hits production.
As I wrote about last week, evals enable you to encode what “good” and “unacceptable” look like. With its unique approach, Decagon is productizing this by empowering their users to build their own set of test suites and monitor those on a continuous basis.
Turning Conversation Data into a Flywheel
Data feedback loops are at the core of any application-layer AI company. At Decagon, that loop starts with conversation data, which is a gold mine for both the agent and the customer.
Decagon’s post-conversation analytics system uses transcripts plus agent actions to surface the top 2-3 highest-ROI improvements Decagon’s customer could make. Internally they call this ‘hill climbing’. “We help customers keep climbing the metrics they care about,” Bihan explained to me. “As the agent gets better and better, that’s a higher ROI for the customer. More conversations handled by the agent, higher NPS or CSAT, better retention and, eventually, revenue.”
Of course, there are real constraints. Sometimes the blocker isn’t the agent at all; it’s business logic or infra. For example, regulatory constraints might prevent automation on certain workflows. Or there might be missing APIs for key actions, so the agent literally can’t take a certain action. In those cases, the flywheel turns into a roadmap conversation: here are the “addressable” tickets the agent can handle with current policies and infra and here are the “non-addressable” ones that require policy changes or new engineering work to unlock.
The gist is that Decagon’s product encourages rapid iteration so that customers can not only monitor and ensure performance, but learn from real interactions and continuously improve quality – all while guardrails, security, and data controls ensure enterprise-grade reliability and accuracy.
The data flywheel thus becomes a byproduct of not only more logs, but also a sharper view of constraints and collaboration with the customer to expand what’s possible.
How Enterprises Deploy This Stuff
When companies choose to sign with Decagon, Bihan noted that it’s usually an exciting inflection point for that company in generally embracing the power of agentic AI: “There’s often a board-level directive to ‘bring AI into CX,’ which creates top-down momentum and we often end up being a part of many vibrant, cross-functional conversations on the customer’s side because there’s a lot of trust and alignment going into a decision this transformative.”
This requires Decagon to really partner on each department’s concerns and questions. For example, legal needs to understand the risk posture. Security needs to vet data flows. CX leads want to control tone and escalation paths. IT and engineering need to prioritize integrations.
In other words, a big chunk of the job is internal change management for customers, not just technical implementation. This may get easier as AI gets more widely adopted but I suspect will always be a somewhat lengthy process in most cases.
As for implementation, Decagon’s implementations team – a mix of Agent product management, engineering, and customer success – puts together a clear roadmap with different stakeholders before being able to successfully deploy the product. Once they’ve done so, however, the deployment path is pretty consistent.
Phase 1: The agents tackle customer support by resolving issues instantly, automating complex workflows, and surfacing insights. For most companies, ”this is the low-hanging fruit,” Bihan tells me. “It’s always where the fires are. The goal here is to automate obvious, repetitive workflows and show meaningful ROI as quickly as possible.”
Phase 2: Once the immediate value is clear, teams become forward-looking and realize that they’ve built an intelligent, conversational interface that truly understands their customers. “They wonder: why stop at support workflows? That same AI foundation, capable of understanding intent, recalling context, and acting autonomously, can power entirely new kinds of customer experiences that generate revenue and build brand loyalty.”
One of the clearest examples of this shift is in product discovery and recommendations. Bihan mentioned a global rental car customer as one example, where the workflow blends support of a current rental with extending the rental window. “Even though these users are calling in to support, if what they’re saying suggests it’s a good idea to extend, we can suggest that option, and if it’s a good fit for the driver’s situation, that directly increases our customer’s revenue.” So what looks like a cost-center conversation becomes a revenue event if the agent tactfully recognizes the opportunity through intent, context, and timing and makes it easy to take action.
How Enterprises Measure This Stuff
On the KPI side, there are two dimensions. First is around system and model quality (e.g. evals) which we covered above.
The second dimension is business value, aka the ROI and impact. “Our job is to show ROI as rapidly as possible,” Bihan states. For Decagon, the most immediate metric is often deflection or resolution rate: the percentage of conversations the AI can handle end-to-end without human involvement. Customer satisfaction scores like NPS and CSAT also matter since they tie directly to retention and brand loyalty. As Bihan put it, “If the agent is fast, accurate, and helpful, customers actually feel better than they do talking to a human. You see it in the NPS lift immediately.”
Another key metric for Decagon is routing accuracy. When the agent does escalate, did it route the issue to the right human? Large enterprises often have deeply specialized support teams, and incorrect routing leads to longer handle times, frustrated customers, and operational drag. “It’s not just about how often we escalate,” Bihan noted. “It’s about escalating correctly. Routing accuracy is one of the highest-impact metrics for large CX teams.”
And then there’s the softer layer: brand perception and social proof. Bihan tells me this one is harder to quantify, but enterprises feel it. Customers who have a delightful support interaction often post about it on social channels, mention it in reviews, and tell their friends. In fact, I told Bihan that one of the best support experiences of my life was with Oura – so good that it turned me into a long-term advocate for the product. Only later did I realize Decagon was powering that interaction. Stories like that may not be quantifiable, but they meaningfully influence loyalty and purchasing behavior.
Black Swans and The Importance of Resiliency
We ended our conversation by talking about how to plan in a world where the ground shifts every six months. Bihan has been reading The Black Swan by Nassim Taleb, and it’s actively shaping how she thinks about product strategy. The temptation in AI is to assume a smooth, upward trajectory. Models get better, costs fall, capabilities expand. But real Black Swan events don’t follow trend lines. They come out of left field. Abrupt shifts in model behavior, new constraints on context length or cost, regulatory changes, unexpected user preferences, or the sudden arrival of a cheaper model that could seemingly reset the economics overnight.
If your product is overfitted to a single model, a single stack, or a single assumption about “how AI works today,” you’re fragile. Bihan’s view is that resilience is the long-term defensibility. Can you swap in a new model without rewriting your whole product? Are your moats actually in workflows, data, evaluation, and customer relationships vs just in your current model wiring? Have you built full systems that improve themselves, independent of solely the model?
Decagon’s approach is to assume change as the default. They’ve built a thick stack of guardrails, evals, analytics, and operational muscle precisely so the agent can evolve with the industry. The goal isn’t to freeze a perfect system or even just survive new breakthroughs; it’s to build a platform that thrives with the industry and its customers alike. As Bihan put it, “Change is the only constant, so you have to build like something big might change tomorrow. The companies that win will be the ones ready to shift fast.”



The "Black swan" metaphor is a really classy way to articulate flexibility in product design and UX. Analogous pattern to first rounds Applied Intelligence paradigm.