Four questions that separate real multilingual AI from a very convincing demo

Four questions to expose whether your multilingual AI is production-ready or just a convincing demo.

Brady Walker Senior Content Marketing Manager

May 19, 2026

5 min

A customer calls in Mandarin. Your AI agent picks up. The conversation starts well, both sides in Mandarin. Then, somewhere around turn eight, the bot switches to English.

The model simply decided, for that call and that turn in the conversation, that English was the right language. It happens 4% of the time with GPT-5. That sounds like a high grade until you're running 10,000 calls a month and 400 of them end in a language nobody asked for. The vendor's demo sounded perfect. The support tickets did not.

That 4% figure is one of four ways multilingual AI fails in production before anyone thinks to test for it. For each failure mode, there's a diagnostic question worth taking into your next vendor conversation.

Your bot speaks the language. Understanding the culture is another matter.

Translation is the solved problem. Cultural intelligence is the next layer, and almost no vendor demos it, because it requires scenarios that don't come up in a standard contact center simulation.

A restaurant’s AI agent takes a reservation in Japanese. It needs a name for the booking, so it asks: "Can you spell your name for me?" The question doesn't make sense. In Japanese, you don't spell a name. You write it. The characters are self-evident once you hear the name aloud. Asking someone to spell their name is asking them to translate their own identity into a system that doesn't exist in their language.

Japanese formality compounds this. Default LLM output is casual, trained on general internet text. Customer service in Japanese requires a specific register of politeness that casual Japanese can't reach, and a bot trained without that register fails on every call. The words are correct. The grammar is fine. The cultural frame is wrong, and the caller knows it before the QA team does.

The diagnostic question : Can your system be tuned per language, with per-language style guides, native-speaker review, and prompt-level customization? Or is it one model, one configuration, deployed everywhere? A vendor with genuine cultural intelligence can show you all three. A vendor with a translation wrapper cannot.

Three languages. Three bots. One change request.

Most enterprise multilingual deployments aren't one deployment. They're two, three, however many languages — English, Spanish, maybe Tagalog. Then leadership sees the ROI and asks about French, then Arabic, then Mandarin.

Three languages, three bots, three prompts, three maintenance tracks. Each prompt has to be written in the language of the bot, which means your team is maintaining scripts in languages they don't speak. One policy change means three updates, or six, or ten, in languages your team can't read. Co-review is effectively impossible. You're trusting machine translation to QA machine translation.

Over time, the bots drift. The English bot gets the update. The Spanish bot gets it two weeks later. The Japanese bot gets a version of it, imperfectly translated by someone who isn't a native speaker.

The diagnostic question: Is this one deployment or three? Can I write my knowledge base once, in English, and deploy in any language? A vendor with unified multilingual architecture can say yes cleanly.

What "we support Mandarin" actually guarantees

The compounding 4% failure rate is a symptom. The root cause is simpler: most AI vendors don't own their models. They wrap a general-purpose LLM and configure it for customer service. When the model decides, on a given turn, that English is the right response language, the vendor has no way to train that behavior out. They can prompt against it. The failure rate stays baked in.

The only real fix is to own the model, observe the failure type, run the gradient update, and make it impossible.

Raven, PolyAI's in-house LLM, is built specifically for this. When a language consistency failure appears in testing, the team runs the gradient update — the failure type gets trained out, not patched around. The result is 45 supported languages, 23 trained and tested by native speakers, with an architectural guarantee that the model cannot switch to English unless you tell it to.

The diagnostic question: What is your per-response language consistency rate? And do you own the model, or are you wrapping someone else's? The second question matters more. A vendor who can't train their own model can't give you a real answer to the first.

The demo sounds perfect. Your production deployment won't.

Every vendor demo sounds good. What the demo never shows you is a tonal language under load. Mandarin has four tones. Cantonese has nine. A voice that handles "How can I help you?" flawlessly will start dropping tones on longer, complex text, and your customers will notice before your QA team does.

The Washington Department of Licensing learned this publicly. Their bot's Spanish option read English scripts through a Spanish-accent text-to-speech voice. It said "please press uno" in English, with an accent. The calls went viral on TikTok. The agency apologized. The root cause was a Spanish-accent voice assigned to English text.

That failure was obvious. Most aren't. More often, it's a dropped tone, a slightly wrong register, an accent that doesn't match the market.

The diagnostic question: Can I hear a sample in my language across a full multi-turn conversation, not a greeting? Then evaluate across all four components: speech recognition, information retrieval, the LLM, and the voice. A vendor who can walk you through all four, with native-speaker validation, has done the work.

Most vendors will answer one or two of these questions confidently. The ones who can answer all four have built an infrastructure for multilingual.

The Raven Guide breaks down what that infrastructure looks like in practice: the training methodology, the benchmark data, and the architectural decisions that separate a purpose-built customer service LLM from a general-purpose one wrapped in a prompt. Download it here .

Want to just start building agents today? Try PolyAI’s Agent Studio free here and get your first working agent built in 10 minutes.