Building voice agents: Where engineering meets human judgment
Legacy IVRs don't hear urgency. Voice agents do. Here's how the engineering and design decisions behind them, from latency to personalization, come together to create conversations that actually work for customers.
Think about calling your bank because you've spotted a suspicious charge. Your heart's racing. You need help, and you need it now.
Twenty years ago, a human would have answered. They'd hear the urgency in your voice, skip the small talk, and get straight to freezing your card and sending out a new one. That interaction worked because they could read the situation and respond accordingly.
Today, you get an IVR with instructions to press 1 for account services, press 2 for fraud, or listen to the full menu of options. The panic you're feeling doesn't register, and the system can't hear it.
Voice agents change this. They combine the empathy and adaptability of that human operator with the scale and availability of automated systems. But getting there requires deliberately designing the capabilities that come naturally to us: understanding intent, reading emotional cues, and adjusting responses to match the moment.
What is agent design?
Building conversational AI used to mean wrestling with rigid logic trees. (Have you ever pressed 1 to get to another menu, where you press 5, which brings you to another menu, and you realize you should’ve pressed 4 instead of 5 in that other menu, so you now have to hang up and call back?) Now, agent designers wrangle the intricacies of human speech to deliver natural conversations.
Agent design goes beyond writing scripts. Designers craft entire experiences, including how a customer service agent reasons, responds, and adapts to complex customer needs. It’s problem-solving at the intersection of language, psychology, and technology.
Get the foundation right
Getting voice agents to complete the tasks they are set requires teams of interaction specialists, prompt engineers, and agent designers dedicated to making conversations effortless. Their focus is to match infrastructure with the right technology, then layer in obsessive attention to detail to help customers reach a solution efficiently. Our tech stack needs to accurately interpret intent, integrate with backend APIs, and execute transactions.
Ask the right design questions
Once the tech works, the design questions begin:
- How do we get information from the user? If we need a phone number, email address, or account details, what's the most reliable way to collect them?
- What order keeps them engaged? Sequencing matters. If you ask for too much too fast, people disengage. When we get the flow right, the conversation feels natural.
- When we give information back, what's the best way to do it? A 60-second explanation over the phone is easily forgotten. But sending an SMS with a link gives the user the option to resolve the issue step by step at their convenience. We have to match the channel to the content.
- How do we connect the conversation to back-end systems? The agent needs to pull data from CRMs, trigger workflows, and hand off context seamlessly. That middle layer has to be invisible to the user.
Adding delight to user interactions
Once the foundation is solid, we layer on what makes interactions feel human, what our agent designers call adding delight. Four elements transform a working agent into one people want to talk to:
-
Voice agents have to sound amazing. People open up to something that sounds familiar, friendly, and capable. But the tone has to match the situation. An agent booking a routine appointment can be warm and upbeat. An agent handling a billing dispute needs to sound steady and solution-focused. If you get the tone wrong, you break trust immediately.
Which means applying:
Natural intonation and timing: Pauses or unnatural rhythms can make conversations feel disjointed.
Appropriate tone and register: Not too formal to feel cold, not too casual to feel socially unaware. This also varies across languages and cultures, which is why localization and expertise in target audiences matter.
Cultural and linguistic adaptation: Each language and context requires its own nuance.
-
In human conversation, you can jump in to correct someone or ask a clarifying question. A two-second delay kills the moment, and the conversation feels broken. We aim for roughly one second from when someone stops speaking to when a voice agent responds. Anywhere between one second and 1.2 seconds feels natural. Push past 1.5 seconds, and people notice. That 200-300 millisecond difference changes how the conversation feels.
That one second gets divided across:
- Detecting when the user finished speaking
- Transcribing their words
- Processing intent
- Executing API calls
- Generating a response
- Converting it to speech
Keeping all of these within one second requires optimization at every step.
-
Old IVRs use overly formal, stilted language that signals you're talking to a robot. We write the way people actually talk.
LLMs like ChatGPT often default to wordy, overly formal phrasing that feels cold or unfamiliar. These models are trained to be all-purpose text generators. They aim for a tone that works everywhere, which means a somewhat impersonal, formal style. They're generally neutral and noncommittal in word choice so as not to seem too flashy or opinionated.
That approach works for long-form text generation. It doesn't work for voice conversations. When someone is navigating a task by speaking out loud, they need an agent that sounds present and engaged. Formal, cautious language creates distance. Conversational language builds trust, which is why PolyAI’s LLM, Raven, is built specifically for customer conversations, where responses need to be accurate, robust, and fast.
-
People are rarely calling customer service because something went well. If someone's already frustrated, how you open the conversation matters.
If we can look up their account based on phone number, we can start with context: "I can see your order's out for delivery. Is that what you're calling about, or is there something else I can help with?"
Compare that to: " Thank you for calling. Please select from the following options."
One treats the user like a person. The other treats them like a ticket number. The difference shows up in engagement and resolution rates.
Designing for voice vs. chat
Everything we've covered so far focuses on voice, but agents need to work across multiple channels. How we present the same information changes based on the channel.
Customers typically want to spend as little time on the phone as possible, so being concise is best in most situations. Too much information at once can overwhelm the user. If you try to give a 40-second walkthrough without pausing for breath, they'll forget the first step by the time the agent finishes speaking.
Chat allows for longer paragraphs. You can use bullet points, paste links, and give more information up front because users can read at their own pace.
| Chat interaction | Voice interaction |
|---|---|
| To apply for a loan online: 1. Go to our website and click 'Apply now.' 2. Create an account or log in 3. Complete the application form with your income, employment details, and the requested loan amount 4. Upload proof of income 5. Submit, and we'll review within 24-48 hours [Apply now] | “Could you tell me your account number, please?” “It’s 4782-9301.” “And what's the loan amount you're looking for?” “About fifteen thousand.” |
Forward-deployed agent design
All of these design decisions—voice quality, latency, channel optimization—come from working directly with clients in real environments.
Our team works directly with clients, end to end. We visit call centers, talk to stakeholders, and listen to real calls before building anything. When clients come to us, they don't always know what they want the agent to say or do.
The process is collaborative problem-solving. We:
- Ask questions and clarify the scope
- Agree explicitly on what the agent should handle
- Make recommendations based on what we see: collect this piece of information first, validate based on this field, reformat your data if the current structure won't give good performance
Being explicit matters. Clients need to trust us, and clarity builds that trust.
The longer the partnership, the deeper the integration. Enterprises we work with often start with one use case, then expand. The focus is on the full customer journey, not just the minute and a half someone spends on a call.
This hands-on work feeds what we build. Insights from real conversations become features everyone can use. To make that possible at scale, we built Agent Studio.
Making agent design repeatable with Agent Studio
Agent Studio turns agent design into a scalable process that enterprises can manage in partnership with our teams.
Your teams can upload a knowledge base once , and it becomes a single source of truth across every channel. Builders can connect multiple knowledge sources (files, URLs, and integrated systems like Zendesk or Gladly) directly into the agent’s knowledge base with no need for manual restructuring or duplication.
One knowledge base adapts to the channel the customer chooses to make multimodal support possible. The information stays consistent, but the presentation changes based on whether someone's using voice, chat, or SMS. The same answer about loan applications becomes a step-by-step phone walkthrough, a formatted checklist in chat, or an SMS link.
This separation also makes multilingual support scalable. The same content works across languages, but the phrasing and cultural nuance adapt automatically.
The platform handles the structure. But to make conversations actually sound natural, we needed a model built specifically for customer service interactions.
Improve AI agent build speed and accuracy.
Raven: The LLM custom-made for voice customer service
Most LLMs on the market were trained on chat data. The easiest way to spot one is wordiness. Long, structured answers that read well on screen can feel unnatural over the phone.
PolyAI agent designers work closely with our research team to train our in-house model, Raven, specifically for voice conversations. We documented the linguistic principles of what sounds natural, identified how popular LLMs violate those principles, and used that to train a model optimized for voice and chat.
-
Training our own model gives us control that external models can't provide. We can fine-tune for the exact performance characteristics we need — optimizing for voice-first conversations rather than chat. We can run models more efficiently, which matters when every millisecond counts. And we can make choices about what to prioritize rather than being constrained by what a research lab (that didn’t have our specific use case in mind) decided was important. The biggest, newest models aren't always the right answer for real-time voice applications. Having the right model is only part of the equation. How we deploy it matters just as much.
Voice-first architecture
When someone calls, we have to be ready for anything.
A restaurant might expect: "Is this Main Street Bistro?"
But they might get: "Hey, do you have a table in 15 minutes?"
We need to handle both from the first moment. That means our speech recognition and conversational AI can't make opinionated assumptions too early. We need solid baseline technology with good defaults, then use data from live conversations to understand common patterns without ignoring edge cases.
As the conversation progresses, we can get more precise. Once someone says, "I want to book a table," we know we'll need party size, time, and date. Now we can deploy specialized models and tools optimized for exactly that task.
When engineering and agent design unite
Agent designers identify what needs to happen in the conversation. Engineers determine the best technical approach to make it happen. For example, designers know that collecting an email address over a poor phone connection creates friction. Engineers know that phone numbers are much easier to collect accurately. Together, they redesign the flow by collecting the phone number, using it to look up the user's profile in the CRM, and skipping the email step entirely. The conversation changes slightly, but the outcome improves significantly.
It works the other way, too. Designers tell engineering what experience they need to create. Engineers explain what the technology can and can't do reliably. Then both teams figure out the best path forward - sometimes that means adjusting the conversation design, sometimes it means engineering builds new capabilities, sometimes it means going back to the client to recommend they reformat their data. That’s why we play to the strengths of the technology instead of forcing the conversation.
Meeting customers where they are
Getting an email address over the phone with a bad connection is challenging, even for humans. Letter by letter spelling, repeating, and confirming. The best outcome might be sending an SMS where the customer can fill it out themselves. The goal is to design to play to the strengths of the technology for each specific use case, which could mean swapping out settings optimized for collecting numbers versus email addresses. We change channels to meet customers where they are.
Sometimes you can avoid collecting an email entirely. If we can identify the user's profile in the CRM using their phone number or account number, which are much easier to collect accurately, we skip the friction altogether. Small tweaks to conversation flow create measurably better outcomes.
Designing conversations that work in the real world
Agent design requires equal measures of engineering discipline and human judgment. You have to understand how conversations unfold, where they break, and what makes people feel heard. Then translate that understanding into systems that perform reliably, at scale, in real time.
At PolyAI, we approach agent design as a craft by training models specifically for conversation, optimizing for natural human pacing.
The goal is simple: When someone calls, they get what they need without fighting the system. When that happens, the technology fades into the background, and the conversation just works.
Make every customer feel heard. Instantly. Speak to our team today.