Building voice agents: Where engineering meets human judgment

Legacy IVRs don't hear urgency. Voice agents do. Here's how the engineering and design decisions behind them, from latency to personalization, come together to create conversations that actually work for customers.

Tom Haynes Senior Content Manager
9 min
Share

Think about calling your bank because you've spotted a suspicious charge. Your heart's racing. You need help, and you need it now.

Twenty years ago, a human would have answered. They'd hear the urgency in your voice, skip the small talk, and get straight to freezing your card and sending out a new one. That interaction worked because they could read the situation and respond accordingly.

Today, you get an IVR with instructions to press 1 for account services, press 2 for fraud, or listen to the full menu of options. The panic you're feeling doesn't register, and the system can't hear it.

Voice agents change this. They combine the empathy and adaptability of that human operator with the scale and availability of automated systems. But getting there requires deliberately designing the capabilities that come naturally to us: understanding intent, reading emotional cues, and adjusting responses to match the moment.

What is agent design?

Building conversational AI used to mean wrestling with rigid logic trees. (Have you ever pressed 1 to get to another menu, where you press 5, which brings you to another menu, and you realize you should’ve pressed 4 instead of 5 in that other menu, so you now have to hang up and call back?) Now, agent designers wrangle the intricacies of human speech to deliver natural conversations.

Agent design goes beyond writing scripts. Designers craft entire experiences, including how a customer service agent reasons, responds, and adapts to complex customer needs. It’s problem-solving at the intersection of language, psychology, and technology.

Get the foundation right

Getting voice agents to complete the tasks they are set requires teams of interaction specialists, prompt engineers, and agent designers dedicated to making conversations effortless. Their focus is to match infrastructure with the right technology, then layer in obsessive attention to detail to help customers reach a solution efficiently. Our tech stack needs to accurately interpret intent, integrate with backend APIs, and execute transactions.

Ask the right design questions

Once the tech works, the design questions begin:

  • How do we get information from the user? If we need a phone number, email address, or account details, what's the most reliable way to collect them?
  • What order keeps them engaged? Sequencing matters. If you ask for too much too fast, people disengage. When we get the flow right, the conversation feels natural.
  • When we give information back, what's the best way to do it? A 60-second explanation over the phone is easily forgotten. But sending an SMS with a link gives the user the option to resolve the issue step by step at their convenience. We have to match the channel to the content.
  • How do we connect the conversation to back-end systems? The agent needs to pull data from CRMs, trigger workflows, and hand off context seamlessly. That middle layer has to be invisible to the user.

Adding delight to user interactions

Once the foundation is solid, we layer on what makes interactions feel human, what our agent designers call adding delight. Four elements transform a working agent into one people want to talk to:




Designing for voice vs. chat

Everything we've covered so far focuses on voice, but agents need to work across multiple channels. How we present the same information changes based on the channel.

Customers typically want to spend as little time on the phone as possible, so being concise is best in most situations. Too much information at once can overwhelm the user. If you try to give a 40-second walkthrough without pausing for breath, they'll forget the first step by the time the agent finishes speaking.

Chat allows for longer paragraphs. You can use bullet points, paste links, and give more information up front because users can read at their own pace.

Chat interactionVoice interaction
To apply for a loan online: 1. Go to our website and click 'Apply now.' 2. Create an account or log in 3. Complete the application form with your income, employment details, and the requested loan amount 4. Upload proof of income 5. Submit, and we'll review within 24-48 hours [Apply now] “Could you tell me your account number, please?” “It’s 4782-9301.” “And what's the loan amount you're looking for?” “About fifteen thousand.”
Chat interaction
To apply for a loan online: 1. Go to our website and click 'Apply now.' 2. Create an account or log in 3. Complete the application form with your income, employment details, and the requested loan amount 4. Upload proof of income 5. Submit, and we'll review within 24-48 hours [Apply now]
Voice interaction
“Could you tell me your account number, please?” “It’s 4782-9301.” “And what's the loan amount you're looking for?” “About fifteen thousand.”

Forward-deployed agent design

All of these design decisions—voice quality, latency, channel optimization—come from working directly with clients in real environments.

Our team works directly with clients, end to end. We visit call centers, talk to stakeholders, and listen to real calls before building anything. When clients come to us, they don't always know what they want the agent to say or do.

The process is collaborative problem-solving. We:

  • Ask questions and clarify the scope
  • Agree explicitly on what the agent should handle
  • Make recommendations based on what we see: collect this piece of information first, validate based on this field, reformat your data if the current structure won't give good performance

Being explicit matters. Clients need to trust us, and clarity builds that trust.

The longer the partnership, the deeper the integration. Enterprises we work with often start with one use case, then expand. The focus is on the full customer journey, not just the minute and a half someone spends on a call.

This hands-on work feeds what we build. Insights from real conversations become features everyone can use. To make that possible at scale, we built Agent Studio.



Making agent design repeatable with Agent Studio

Agent Studio turns agent design into a scalable process that enterprises can manage in partnership with our teams.

Your teams can upload a knowledge base once , and it becomes a single source of truth across every channel. Builders can connect multiple knowledge sources (files, URLs, and integrated systems like Zendesk or Gladly) directly into the agent’s knowledge base with no need for manual restructuring or duplication.

One knowledge base adapts to the channel the customer chooses to make multimodal support possible. The information stays consistent, but the presentation changes based on whether someone's using voice, chat, or SMS. The same answer about loan applications becomes a step-by-step phone walkthrough, a formatted checklist in chat, or an SMS link.

This separation also makes multilingual support scalable. The same content works across languages, but the phrasing and cultural nuance adapt automatically.

The platform handles the structure. But to make conversations actually sound natural, we needed a model built specifically for customer service interactions.


Improve AI agent build speed and accuracy.

Raven: The LLM custom-made for voice customer service

Most LLMs on the market were trained on chat data. The easiest way to spot one is wordiness. Long, structured answers that read well on screen can feel unnatural over the phone.

PolyAI agent designers work closely with our research team to train our in-house model, Raven, specifically for voice conversations. We documented the linguistic principles of what sounds natural, identified how popular LLMs violate those principles, and used that to train a model optimized for voice and chat.

Voice-first architecture

When someone calls, we have to be ready for anything.

A restaurant might expect: "Is this Main Street Bistro?"

But they might get: "Hey, do you have a table in 15 minutes?"

We need to handle both from the first moment. That means our speech recognition and conversational AI can't make opinionated assumptions too early. We need solid baseline technology with good defaults, then use data from live conversations to understand common patterns without ignoring edge cases.

As the conversation progresses, we can get more precise. Once someone says, "I want to book a table," we know we'll need party size, time, and date. Now we can deploy specialized models and tools optimized for exactly that task.

When engineering and agent design unite

Agent designers identify what needs to happen in the conversation. Engineers determine the best technical approach to make it happen. For example, designers know that collecting an email address over a poor phone connection creates friction. Engineers know that phone numbers are much easier to collect accurately. Together, they redesign the flow by collecting the phone number, using it to look up the user's profile in the CRM, and skipping the email step entirely. The conversation changes slightly, but the outcome improves significantly.

It works the other way, too. Designers tell engineering what experience they need to create. Engineers explain what the technology can and can't do reliably. Then both teams figure out the best path forward - sometimes that means adjusting the conversation design, sometimes it means engineering builds new capabilities, sometimes it means going back to the client to recommend they reformat their data. That’s why we play to the strengths of the technology instead of forcing the conversation.

Meeting customers where they are

Getting an email address over the phone with a bad connection is challenging, even for humans. Letter by letter spelling, repeating, and confirming. The best outcome might be sending an SMS where the customer can fill it out themselves. The goal is to design to play to the strengths of the technology for each specific use case, which could mean swapping out settings optimized for collecting numbers versus email addresses. We change channels to meet customers where they are.

Sometimes you can avoid collecting an email entirely. If we can identify the user's profile in the CRM using their phone number or account number, which are much easier to collect accurately, we skip the friction altogether. Small tweaks to conversation flow create measurably better outcomes.

Designing conversations that work in the real world

Agent design requires equal measures of engineering discipline and human judgment. You have to understand how conversations unfold, where they break, and what makes people feel heard. Then translate that understanding into systems that perform reliably, at scale, in real time.

At PolyAI, we approach agent design as a craft by training models specifically for conversation, optimizing for natural human pacing.

The goal is simple: When someone calls, they get what they need without fighting the system. When that happens, the technology fades into the background, and the conversation just works.


Make every customer feel heard. Instantly. Speak to our team today.