Table of Contents
To build truly intelligent voice agents, we need to move beyond static preferences and let models learn from experience.
Stepping (back) into the era of experience
In The Era of Experience, David Silver and Richard Sutton argue that AI development is entering a new phase, where AI systems learn not by imitating humans, but by interacting with the world, learning from trial and error, and discovering strategies that go beyond human demonstrations.
We think they’re right. And at PolyAI, we’re already putting that philosophy into practice.
Today, most fine-tuned language models are improved through reinforcement learning from human feedback (RLHF) – combining supervised fine-tuning (SFT) with preference optimization methods like direct preference optimization (DPO). In that setup, a single assistant response is treated as an action, and the model is optimized to make each response align with human preferences or example outputs.
But when you’re building agentic systems – like task-oriented voice assistants – this turn-level framing is a poor proxy for success.
A single assistant response might look fine in isolation, but could still lead the conversation off course. Real outcomes unfold over multiple steps. And real goals – like making a booking, transferring money, or getting help with a product – aren’t achieved in one turn.
That’s why we’ve moved beyond turn-level rewards. With conversation-level reinforcement, the model is trained on full interactions. It gets a reward signal based on whether the user achieved their goal – whether the booking was made, the transfer succeeded, or the flow completed cleanly.
This has several major advantages:
- You can optimize directly for deterministic goals.
- Rewards are easier to label and agree on.
- Non-experts can help improve systems without understanding the internal logic.
This is the foundation of our reinforcement fine-tuning (RFT) approach – and the focus of this post.
By the way, this idea isn’t new to us. Many of the core team at PolyAI met in Steve Young’s lab at Cambridge, where we trained conversational agents using full-dialogue reinforcement learning with simulated users. The systems were simpler, and the models were small. But the core belief that agents should learn from interaction has continued to shape how we approach this work (though we have no complaints about leaving POMDPs and Gaussian processes behind).
Now, with today’s large models and rich simulation environments, we’re finally able to scale that intuition into something powerful, practical, and production-ready.
What is reinforcement fine-tuning (RFT)?
Raven is PolyAI’s in-house large language model, purpose-built for powering agentic, voice-based conversational assistants. Unlike general-purpose LLMs, Raven is trained with domain-specific context, function calling, and multi-turn dialogue in mind, making it fast, reliable, and easy to integrate into production systems.
Reinforcement fine-tuning is the final stage of Raven’s training. After initial fine-tuning and preference alignment, we let the model engage in full, simulated conversations with a user simulator and then learn from the outcomes.
The model is rewarded for good conversational behavior, including:
- Completing flows successfully
- Calling the correct functions
- Avoiding repetition, confusion, or frustration
- Getting the user to their goal efficiently and naturally
Importantly, the reward comes at the level of the entire conversation. This lets the model learn high-level strategies and develop behaviors that turn-level optimization would miss entirely.
The user simulator
To make reinforcement learning work in conversational systems, you need a user simulator that’s not just realistic, but also tightly controllable.
Our simulator is trained on large volumes of PII-scrubbed logs from real deployments, but it doesn’t simply replay past interactions. Instead, each simulated user is generated from a structured set of parameters, allowing us to control the shape and style of the conversation. These inputs include:
- Domain Description: The product or service context (e.g. banking, travel, customer support).
- Goal: The user’s reason for calling, expected resolution, and flexibility around the outcome.
- Attitude: Varying degrees of cooperativeness, politeness, engagement, or frustration.
- Style: How the user communicates — concise vs. verbose, formal vs. casual, grammatical vs. colloquial.
- User Profile: Synthetic metadata like name, subscription plan, app version, or product type.
We do not train on real user profile data. Instead, we synthesize user information, enabling realistic personalization without compromising user privacy.
Because we can systematically vary these inputs, we can:
- Simulate a wide range of scenarios
- Reproduce edge cases and failure modes
- Generate consistent environments for reward-based learning
This level of control is essential for training robust agent behavior. It allows us to pose challenges, test generalization, and evolve the simulation in tandem with the model — a core advantage over relying on static datasets alone.
Why most architectures can’t do RFT (and why ours can)
RAG-based agentic systems introduce a challenge for reinforcement fine-tuning: they often rely on dynamically constructed prompts that are rebuilt at every turn of the conversation.
In many implementations, the model’s input context is reconstructed step by step. This can include:
- New documents retrieved based on the most recent utterance
- Task-specific instructions injected on the fly
- Changing sets of available functions or tools
While this works fine for single-turn inference, it poses a challenge for conversation-level reinforcement learning:
- Each step has its own prompt, breaking the continuity of the interaction.
- There’s no clean trajectory to train over – you effectively get a different training sequence per turn.
- You can’t efficiently backpropagate the reinforcement learning loss through the full conversation.
At PolyAI, we’ve architected things differently.
Our Raven model is trained so that retrieval and function calls happen within the flow of the conversation, not as preambles inserted at every step. This enables us to:
- Preserve the conversation as a single, contiguous context window for learning
- Maintain prefix caching efficiently during sampling and at inference time (for much lower latencies)
- Cleanly apply conversation-level reward signals for reinforcement learning
As an aside: many developers don’t realize that OpenAI’s function-calling APIs place function definitions at the start of the prompt. This means that if your available functions change mid-conversation, your entire prefix cache becomes invalid – even if the rest of the prompt is identical. That’s a big hit to latency and efficiency in real-world deployments.
By contrast, Raven’s design allows tools, retrievals, and decisions to unfold naturally over time – which is critical for training agentic behavior with reinforcement learning.
Why conversation-level learning is a game changer
Agentic planning over imitation
Turn-level training teaches a model to imitate good local behavior. But conversation-level training teaches it to achieve outcomes.
By giving the model the freedom to explore full conversations and see the consequences of its actions, we enable it to:
- Plan ahead
- Handle multi-step tasks
- Learn to recover from errors
- Balance efficiency and helpfulness
Easier, more accurate feedback
It’s easier for humans, even non-experts, to evaluate full conversations than to label isolated steps. We can ask:
- Did the user get what they wanted?
- Did the agent resolve the request cleanly?
- Was the user frustrated or confused?
This makes reward modeling more scalable and intuitive, and ensures alignment with real-world success.
What comes next: audio-LLMs and real-time learning
Reinforcement learning on simulated conversations is just the beginning. We’re now working on fusing ASR and LLMs into unified audio-LLMs that operate directly on speech inputs and outputs.
This opens up powerful new capabilities:
- Turn-taking learned directly from audio
- Dialogue policies that adapt to uncertainty and noise
- Multilingual understanding baked in
By training on real voice interactions, these models will learn the rhythm, timing, and nuances of spoken conversations – and bring the power of LLMs to audio-native experiences.
Conclusion
We believe the future of language models lies in interaction, not just imitation.
Reinforcement fine-tuning, especially at the conversation level, is our first real step toward agents that learn from experience, adapt their behavior, and achieve goals over time.
This work builds on foundations laid years ago in dialogue RL research, but now amplified by today’s model capabilities and simulation tools.
In an upcoming post, we’ll share how we prepare models for this final stage, using fine-tuning and preference optimization to build the strong, steerable foundations that make RFT so effective.
This was co-authored with Leo Bujdei-Leonte and Paula Czarnowska, thanks for their contributions and hard work.