Raven 3.5: The post-training recipe that beats GPT-5 for customer service

Matt Henderson VP of Research
7 min
Share

Raven 3.5 represents a major post-training upgrade to PolyAI's in-house LLM. Through significantly more training data and a purposefully designed fusion of Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) training techniques, we've delivered broad improvements across instruction following, multilingual quality, and conversational style. We've also added auto-reasoning, webchat support, and out-of-domain detection. All at the same sub-300ms latency that makes Raven super responsive in live phone calls in a way generalist models aren't.

Key takeaways:

  • Raven 3.5 outperforms GPT-5 and Claude Sonnet 4.6 across all four of our customer service benchmarks.
  • Multilingual quality improves across 23 languages, with perfect language adherence.
  • Latency-optimized auto-reasoning knows when to think and when not to, adding quality without affecting perceived latency.
  • Raven 3.5 adds webchat support, out-of-domain detection, and emotion tags for text-to-speech.
  • Raven 3.5 remains fast, purpose-built, and specialist where generalist models remain generalist.

Voice agent benchmark

Our voice agent benchmark is built from thousands of real, anonymized customer service scenarios, judged by an independent LLM that scores model outputs on multiple dimensions. The benchmark measures instruction-following, output quality on English and Multilingual data, and output-style quality. Raven 3.5 brings significant improvements on a challenging date-time logic subset, which we also report separately.

Multilingual performance

Multilingual performance (measured over 23 languages) takes a large step forward thanks to an increase in dataset size, an improved base model, and more focused reward fine-tuning. Note that language-adherence is built into the Raven models, meaning you can prompt it in one language, and ask it to respond in any other. This often trips up generalist models such as GPT and Claude, which struggle when source documents and prompt instructions are in English, but the model is asked to talk to the user in, for example, Spanish.


“We found that public models tend to reason in their most comfortable language, usually English, then translate while generating. Raven 3.5 avoids this by using its hidden reasoning trace as a target-language scratchpad, letting it think in Spanish before speaking Spanish."

- Matt Henderson, VP of Research, PolyAI

Following style instructions

Raven 3.5 is also much better at following style instructions, while maintaining a suitable default style (as defined by our in-house voice agent style guides, specialized to 23 different languages). Where Raven 3 had its own strong default style, Raven 3.5 properly respects custom persona instructions on top of that foundation. If your agent is configured to be formal, warm, concise, or to address callers as Sir or Madam, it will honor that consistently without needing extensive examples to nudge it in the right direction.

Post-training is a dark art

an overview of raven 3.5 post-training. It is a hands-on and collaborative process. We spend a lot of time in the feedback loop, re-running the GRPO + DPO reinforcement fine-tuning stage, patching it with new targeted data, augmentations, and shaped rewards.

Most of the improvements in Raven 3.5 are due to our improved post-training recipe, specifically tweaked to optimize for voice agent performance.

Raven 3.5 trains on millions of anonymized conversations drawn from real deployments across banking, healthcare, retail, and hospitality, significantly more than Raven 3. During training, Raven is put inside the same agentic harness as it will be for live customer calls, allowing it to specialize to the task.

SFT, GRPO, and DPO

The initial supervised fine-tuning (SFT) stage uses high-standard teacher labels to warm up the model. But it is the iterative GRPO + DPO tuning stage that teaches the model to outperform generalist LLMs.

Combining GRPO and DPO has allowed us to get the best of both worlds. GRPO is super powerful, but it has a natural ceiling: when the model consistently fails at something, there's nothing good to reinforce. Mixing DPO examples helps to fill that gap. When the model struggles, our reward system generates a gold-standard response and pairs it against an inferior generation, showing the model exactly what good looks like. This combination has proven particularly effective for multilingual naturalness, where even subtle phrasing errors make a response feel unnatural to a native speaker.

Wrangling the rewards system

We optimize for multiple metrics that can sometimes be in conflict or interact in non-linear ways. For example, the style reward is only valid for text outputs (not for tool calls), and it is not always clear whether training should compromise on style to improve instruction following. We also teach the model to reason efficiently, but this risks the degenerate solution where the model learns to never reason. Multiple gated losses help us avoid these issues — we can pull (or not pull) the model in multiple directions for a single training example. We have become reward-function and GRPO loss combination engineers, battling LLMs that will take any opportunity to reward hack.

We also introduced automatic penalties for failure patterns that only a team building voice AI would think to catch. Sycophantic openers, responses that echo earlier messages verbatim, and ambiguous gender terms in languages like Polish and Spanish, where getting this wrong sounds immediately wrong to a native ear.

The result is a model that has been trained to a far higher standard.

“During training, when the model consistently fails at something, our reward system rewrites the output into a gold-standard response and pairs it against the original — showing the model what it should be doing and giving it a clear signal for which direction to move.”

- Matt Henderson, VP of Research, PolyAI

New capabilities

Beyond the core quality improvements, Raven 3.5 introduces several new capabilities worth highlighting.

Auto-reasoning

Raven 3.5 now decides for itself when a turn requires deeper deliberation and when it can respond directly. On average, it reasons on around half of turns, usually producing fewer than 40 tokens of internal reasoning in a few hundred milliseconds. It will also auto-reason during the agentic loop at points where the assistant has already started speaking a partial response. This allows Raven 3.5 to respond more accurately to complex requests, like calculating a booking date three weeks from now, resolving an ambiguous instruction or handling an edge case, without adding latency to simpler turns. And surprisingly, training the model to reason is helpful for the model, even when we disable reasoning at inference time.

Webchat support

Raven 3.5 works across both voice and text channels from a single model, making it straightforward to deploy consistent, low-latency AI across your contact center and digital channels simultaneously. Auto-reasoning is enabled for webchat use cases.

Out-of-domain detection

When a caller asks something outside the agent's scope, Raven 3.5 flags it in its output with a special token. Beyond improving individual responses and preventing hallucinations, this signal can be tracked over time to surface knowledge base gaps, giving your team a clear view of where content needs to be added or updated.

Emotion tags for TTS

Raven 3.5 can be prompted to annotate responses with emotional context — [apologetic], [friendly], [informative] — for use with compatible text-to-speech models. A small detail with meaningful impact on how a response lands with a caller.

This is what purpose-built looks like

Raven 3.5 is the next step in a thesis we've held since we started building our own model: a domain expert, trained with purpose, will always outperform a generalist at the job it was built for. Generalist models are improving. So is our understanding of what live customer service conversations actually demand, and we build that understanding into every stage of training.

The post-training recipe is where the work happens, and we're still finding new ways to push it. Every round of training surfaces new tensions to resolve, new failure modes to penalize, new places where a targeted DPO pair or a shaped reward can close a gap. We will continue to make our evaluations more challenging, take on new use cases, and maintain our lead versus the generalists.

Next on our roadmap is Raven Omni, an audio-native model that fuses speech recognition directly with the LLM. Instead of piping audio through a separate ASR system and handing text to the language model, Raven Omni applies the power of the LLM directly to speech understanding. The model hears what the caller actually said, not what a speech recognizer thinks they said.

Raven 3.5 is available now. To see what it looks like in action get in touch.

Thank you to Paula Czarnowska who led the development of raven 3.5.