It is late evening in a hospital admissions queue. A man has his father on a stretcher beside him and one question for his insurer: has the cashless approval come through? On the line is our voice bot. If it gives a wrong answer, or pauses long enough that he hangs up, the admission stalls.
We built that bot: a production voice AI for health insurance that handles cashless pre-authorization status checks, document follow-ups, and policy lookups. The work taught us where enterprise voice AI actually breaks, and most of it has little to do with making the bot sound human.
Key Takeaways
- The hardest part of enterprise voice AI is trust: accurate facts, no dead air, and a full audit trail, all under a sub-second response budget.
- Hallucination has to be designed out. Every factual claim comes from a tool call, never from the model's memory.
- Turn-taking is domain-specific. Insurance callers read long IDs, so the bot waits 1.5 seconds of silence rather than the chat-tuned 0.6.
- Compliance is born-in. Every turn is logged with transcript, tool calls, and latency, and all inference stays in-region.
What does an enterprise voice bot actually have to do?
Our bot runs an Indic-first pipeline: Sarvam handles speech-to-text and text-to-speech, Mistral Large on AWS Bedrock runs the conversation, and every component stays inside India for data residency. Getting it to speak naturally took about a week. Getting it to be trustworthy took the rest of the project.
The thread under the five lessons that follow: in a regulated domain, callers need a bot that is correct, that never leaves them hanging, and that someone can audit afterwards.
Lesson 1: Make hallucination architecturally impossible
"Do not hallucinate" in a system prompt is a wish. A model trained on insurance documents will state a coverage limit with total confidence, and be wrong. So we made factual claims a function call. The bot may not state a case status, a sum insured, a room-rent limit, or a network-hospital decision from its own memory. Every such fact has to come from a tool call in the same turn.
Six tools back the conversation: pre-auth status, policy details, a document-upload SMS, a scoped knowledge search, escalation to a human, and call close. Knowledge answers are gated on confidence: above 0.85 the bot answers, between 0.65 and 0.85 it asks a clarifying question, and below 0.65 it escalates. Anything that needs coverage interpretation goes straight to a human, with no guessing. The bar we hold before any external pitch is zero hallucinated facts across a 200-call evaluation.
Lesson 2: The bot can never go silent
In voice, silence reads as a broken system. The most dangerous failure we found was the bot going quiet after a tool returned, so we made a spoken reply mandatory on every turn. Tools never throw. They return a structured result with a caller-friendly message, so a missing policy number becomes "I couldn't find that policy, could you read it once more?" rather than dead air.
Providers fail too. Bedrock throttles, the speech socket drops, the voice API times out. Each provider gets its own retry budget with exponential backoff and a tight per-request timeout, and if recovery still fails, the floor is always a human handoff. There is also a two-stage idle timer: after eight seconds of silence the bot asks "are you still there?", and after another eight it closes the call gracefully.
Lesson 3: Turn-taking is domain-specific
Most voice frameworks ship with turn detection tuned for fast back-and-forth chat. That broke immediately for us. Insurance callers read long policy and case IDs, pause to find a document, and are often stressed or elderly. The default 0.6-second silence threshold cut them off mid-number. We switched to a 1.5-second timeout, which lets a caller pause naturally without the bot barging in.
Two more details mattered. Voice-activity detection has to run before speech-to-text, or audio piles up and no turn is ever signalled. And the bot reads every ID back before it looks anything up, which catches the P-versus-B and T-versus-D confusions that Indian-English speech-to-text tends to make. All of it runs to a strict budget.
Lesson 4: Compliance has to be born-in
Indian insurance sits under IRDAI and the DPDP Act, so audit and consent have to be built in from the start. The bot speaks a consent notice before the first real exchange, and the caller's first reply is the consent decision. Every turn after that is written to an audit log: the transcript, the tool calls and their results, and the latency, with a flag for whether personal data is masked. Audit is retained for seven years.
All inference runs in-region, in Mumbai, with nothing crossing the border, and traffic is encrypted in transit and at rest. Because the audit captures every turn as it happens, any call can be replayed end to end: the audio, the transcript, the model's reasoning, the tool calls, and the spoken reply.
If a voice bot cannot show what it said, what it looked up, and why, it cannot go live in insurance, banking, or healthcare. Build the audit trail into the pipeline from turn one.
Lesson 5: Go Indic-first on speech
Generic speech models struggle with Indian languages and with domain jargon. Terms like "cashless" and "pre-auth" and specific plan names get mangled, and callers code-switch between Hindi and English inside a single sentence. Using a native Indic speech provider was the starting point. On top of that, we bias the recognizer with a custom vocabulary of domain terms loaded at the start of each call, and we localize the prompts per language rather than translating one English prompt.
Small choices add up here: a professional Indian-English voice for the replies, pronunciation fixes for contractions, and language-specific examples that show code-switching the way callers actually speak.
So what is the most important pain area?
Pulling the five together, the pattern is clear. Today's models already sound fluent enough. Where enterprise voice AI fails is on trust: a fact it cannot prove, a silence it cannot recover from, or a conversation no one can audit.
This generalizes well beyond insurance. A bank cannot read out a balance it is unsure of. A government service cannot guess at eligibility. Wherever the outcome is regulated, three properties decide whether a voice bot is credible: every fact sourced, every turn answered, every call auditable.
The bottom line
The demand here is for people who can build voice AI that holds up in production, under real constraints, with guardrails and governance in place. That is the same discipline we bring to the rest of the software lifecycle, applied to a live phone call. If you are putting AI in front of customers where the answers have consequences, that is the bar to build to.
This is the kind of system we build with enterprise teams. If you are planning a voice AI where accuracy and compliance matter, book a strategy call. For how we think about governing AI across delivery, see What Is an AI-Native SDLC.
Frequently asked questions
What is the biggest challenge in enterprise voice AI?
Trust under constraints. The bot has to give accurate, sourced answers, never go silent, stay fast to the first response, and keep a full audit trail. In regulated domains like insurance, a hallucinated fact or a dropped call has real consequences.
How do you stop a voice bot from hallucinating?
Make it architectural. Require a tool call for every factual claim such as case status, sum insured, or coverage, gate knowledge-base answers on a confidence threshold, and escalate interpretation to a human. We hold the bar at zero hallucinated facts across a 200-call evaluation.
Why does turn-taking matter so much in voice AI?
Default voice frameworks are tuned for fast chat. Insurance callers read long policy and case IDs and often pause, so the bot waits 1.5 seconds of silence before responding instead of the 0.6-second default, and confirms every ID by reading it back to catch speech-to-text errors.
How do you handle data residency and compliance?
For Indian insurance, all inference runs in-region in Mumbai, every turn is logged with transcript, tool calls, and latency, audit is retained for seven years, and consent is captured before the first interaction. Compliance is built into the pipeline from the start.