Your AI gets 90% accuracy. It fails 88% of real conversations

Per-turn accuracy and conversation accuracy are not the same measure: the first is what your test suite catches, the second is what your customers experience. (Antithesis) At 90% per turn, a 20-turn system succeeds 12% of the time in production — and the failure compounds silently from turn 6 onward. The fix isn't a larger prompt; it's moving the constraints into the weights.

Brady Walker Senior Content Marketing Manager
4 min
Share

You built something. You tested it. You hit 90% accuracy per turn.

Now run that number through a 20-turn conversation.

0.90²⁰ = 0.12.

Your 90%-accurate system has a 12% conversation success rate in production. It's failing 88% of the time and your test suite has no idea.

There's an architectural decision that closes that gap. PolyAI made it when they chose to train Raven instead of wrapping GPT-5. This post explains what that decision is and how to test the foundation for free.

The math your test suite can't do

90% per-turn accuracy measures one turn. A customer service conversation is twenty, and every turn compounds the one before it.

A mishandled tool call on turn 6 corrupts the context state for every turn that follows. By turn 14, the model is reasoning from a broken thread. Sandbox tests can't surface this. Test cases are too clean: controlled inputs, expected tool responses, no concurrent load.

Three failure modes live somewhere your test suite never goes.

Latency under real load

Your staging environment isn't handling 200 simultaneous calls on shared cloud infrastructure. Your customers are. At 2 seconds per turn across 20 turns, that's 40 seconds of dead air per conversation.

Multilingual slip

When retrieval returns an English document, a general model's attention follows it, and it replies in English to a Spanish-speaking caller. This doesn't happen in your test suite because your test suite calls it in English.

Hallucination under API failure

A backend times out. The tool returns an error code. A general model trained to always produce a helpful answer doesn't say it can't see the data; it estimates, confidently, from pattern.

The question that matters isn't "does it work in testing?" It's "what happens on turn 14 when the reservations API times out?" If your architecture doesn't have a trained answer to that, you're measuring your test cases, not your production system.

Every fix makes the next break more expensive

Every production failure looks like a prompt problem. Add a constraint, fix the symptom, ship again. That loop has a ceiling.

The language slip hits first. You add: "Always respond in the language the customer is using." Retrieval returns an English document, the model's attention follows it, and a Spanish-speaking customer gets an English reply. You make the instruction more specific. The prompt grows.

Hallucination is next. You add: "If you don't have information from the provided documents, say you don't know." A new edge case surfaces. The prompt grows.

Then verbosity : "Be concise. Limit responses to two sentences. No bullet points." The prompt grows.

Matt Henderson, PolyAI's VP of Research, described where this ends: "just another layer of paint" — impossible to read, impossible to audit. Instructions contradict each other. When something breaks, you don't know which constraint caused it.

A prompt can't fix a training distribution mismatch. You're using text to modify what weights already learned. The exit: move the constraints into the weights.

What changes when the constraints live in the weights

Purpose-built is a three-stage post-training pipeline.

Supervised Fine-Tuning. Raven starts from an open-weights foundation model, then trains on hundreds of thousands of real, anonymized production calls. Large offline reasoning models generate gold-label responses from those conversations — too slow for live voice, but through SFT, Raven inherits their accuracy without their latency. Concise delivery, safe tool-calling, and knowing when to say "I don't know" become baseline defaults. Not instructions.

GRPO/DPO Fusion. GRPO is active practice: the model generates multiple responses, a reward function scores them, the model adjusts. Powerful — but when the model is genuinely confused and none of its responses score well, the hill-climbing stops. DPO is contrastive correction: when GRPO hits that wall, the pipeline injects a targeted pair — here's what the model tried, here's what an expert would have said. The wall comes down. Neither technique operates with its weakness exposed at the same time.

Auto-reasoning. On complex turns — date-time calculations, multi-step conditionals — Raven generates fewer than 40 tokens of internal reasoning before responding. That deliberation takes milliseconds. Latency doesn't move on turns that don't need it.

When the constraints live in the weights, your system prompt becomes an instruction sheet instead of a structural support beam.

Test it before you commit

PolyAI's Agent Studio and Agent Developer Kit is free for two months. You build on the same Raven foundation you'd ship on — no prompt scaffolding, no wrapper to maintain. What you test is what you deploy.

Try PolyAI Agent Studio — free for 60 days