Table of Contents
Listening, understanding, and speaking are the foundations of effective communication.
But let’s face it: real conversations don’t always follow a simple, linear path. We jump back and forth in time, slowing down to expand on some moments and fast-forwarding through others, even editing what we say as we go.
For example, sometimes you ask someone a question, and they respond with another question. This response is not necessarily a way to avoid answering the original question–in fact, it’s often a way to seek clarity in hopes of providing a better, more helpful answer. In other words, conversations trend towards fluidity and unpredictability.
Effective communication tends to rely on two actions:
- Clarifying: Asking questions to make sure everyone is on the same page.
- Empathizing: Showing understanding and concern for the other person’s feelings or perspective.
While we tend to think of ourselves as at odds with artificial intelligence, these same principles apply to human-machine communication. For AI to communicate effectively, it needs to display the ability to:
- listen (process inputs)
- understand (interpret meaning)
- speak (respond clearly and empathetically)
When conversational AI powers contact center agents, it needs to not only display the above three principles but also clarify user requests and empathize with users through engaging and helpful conversation.
By studying the way humans communicate and applying it to human-machine interactions, we can design AI agents that are humanlike, able to communicate clearly, and approachable. This will help soften inherent skepticism around AI and build trust in automated systems.
Below, we’ll cover what it takes for an AI agent to hear customers accurately, clarify when needed, and build trust from the very first word.
Factors that affect listening and AI speech accuracy
During any conversation, the ability to listen is impacted by multiple external factors. Is the environment loud? Does the person appear to be listening to you? Are there interruptions?
These factors similarly make automating human-machine conversations more challenging. For example:
- Phone lines cut out
- Outside talking, television, and other background noises make it difficult to hear
- Every person has a different way of speaking, such as having an accent or using slang
These challenges can lead to speech recognition errors that make it difficult for AI to capture spoken language accurately.
Challenges in listening: ASR and accuracy
AI agents rely on automatic speech recognition (ASR) systems to transcribe spoken language into text that can be digested by large language models (LLMs). While out-of-the-box ASR solutions may provide some benefits, they are typically not tailored to meet a business’s specific needs and objectives.
Many ASR providers offer satisfactory out-of-the-box performance, but these solutions can have limitations because they are built for general use cases. For instance, models trained for dictation or voicemail transcription may struggle with the challenges of real-time contact center conversations, such as overlapping speech, background noise, and industry-specific terminology.
Another issue with out-of-the-box models is that they are often dialect-specific. Enterprises should not assume that all calls in the US use American English, and if they operate globally (as they often do), they certainly cannot rely on English always being the primary language spoken.
Even the best-performing model needs additional support to match the accuracy of human hearing. This is where spoken language understanding (SLU) comes in.
Spoken language understanding: The foundation of an effective AI agent
When you mishear what someone has said to you, you either ask the person to repeat themselves or you work out what they’ve said based on the context of the conversation.
If an AI agent keeps asking customers to repeat their queries, the experience becomes tedious, and trust in the system plummets.
Even with advanced ASR systems, errors can still occur. This is why SLU is a critical layer of AI communication.
SLU is a term applied to techniques that can be used to fix erroneous ASR transcriptions. For example, if a customer says, “A table for eight, please,” but the ASR mishears it as “a table for hate,” the SLU model can use context, like recognizing that “eight” is a common request for larger group reservations. It then independently infers the correct intent and proceeds with booking a table for eight people.
How AI agents understand and communicate effectively
AI agents use several techniques to improve performance, such as understanding user wants and responding accordingly with clarity. Below are eight techniques used to ensure AI agents extract important details, identify when someone is speaking, and adjust to specific terms in order to keep conversations smooth and natural.
Entity extraction | Voice activity detection (VAD) |
This is the process of identifying key pieces of information (entities) from what a user says and using those pieces to understand and fulfill the request. | VAD determines when someone is speaking and when there is silence. It helps the AI agent know when to start listening and when to stop. This is important for detecting when a person has finished talking during a conversation and mitigating the chance of interruptions. |
Lexicon customization | Model ensemble |
This involves tailoring the AI’s vocabulary to specific words or terms relevant to a use case, like brand names or industry-specific jargon. For example, if AI is used in healthcare, you might add terms like “telemedicine” or “cardiologist” to its lexicon to improve accuracy. | This is when a group of AI models work together to achieve better performance. Each model specializes in a specific task, and its different outputs are combined to produce more reliable and accurate results. Think of it as having multiple experts collaborating to solve a problem. |
Contextual ASR biasing | Phonetic fuzzy matching |
This involves providing context on what type of input an ASR model should ‘listen out’ for (e.g., ZIP code or 8-digit alphanumeric string). | This technique helps the AI match words that sound similar but might be mispronounced or transcribed incorrectly. For example, if someone says, “I want to transfer funds to my savins account,” the system uses phonetic fuzzy matching to recognize that "savins" is likely "savings," despite the slight mispronunciation. |
Database verification | Latency trade-off |
This verifies ASR transcripts against relevant databases (e.g., all US ZIP codes) or existing CRM records and uses this information to pick out the most relevant transcript. | This establishes how long an ASR model should listen for specific inputs and whether or not the customer can interrupt. |
Want to dig further into speech recognition and spoken language understanding? This Deep Learning podcast episode covers the challenges faced when deploying voice assistants, including speech recognition errors, latency issues, model management, dialogue design, and telephony as a utility.
Measuring speech recognition accuracy: Word error rate
We know accurate understanding is essential for creating effective AI agents, but how is it measured?
Word error rate (WER) measures how well a speech recognition system performs by comparing what the AI transcribes to what was actually said. A lower WER means higher accuracy, meaning a score of 0% would indicate perfect recognition.
Reducing word error rate
Improving WER is about more than just accuracy. It’s about trust. When an AI agent understands customers correctly, they’re more likely to engage rather than ask to speak to a person.
One way to improve accuracy is by using custom ASR models designed specifically for customer service. It’s important to ensure they are trained to handle:
- Industry terms or jargon
- Regional accents and dialects
- Common phrases and product names
Timing is everything: Balancing latency and interruptions
Interruptions are a natural part of conversation. Some are useful, like when somebody offers you a drink, they might list ‘tea, coffee, water, juice, beer…’ If you don’t want a drink or you want one that’s already been listed, it’s easier for both parties if you just interrupt.
Other interruptions are less useful and make conversations harder than they need to be, like when somebody interrupts you to ask a question when you are just about to answer.
One of the most crucial elements in building an AI agent is smooth, timely interactions. Delays between user input and system response can frustrate users if the assistant takes too long to respond. But interruptions to fast or incomplete responses present a problem to harmonious human-machine communication.
There needs to be a balance where the system responds quickly and accurately without cutting off or confusing the user. Unfortunately, there’s no hard and fast rule on when you should allow customers to interrupt. It depends completely on what you deem important to your business and what your customer deems important.
That said, these are some useful questions to ask yourself:
- How important is it that the customer listens to everything? Do you want to ensure the customer does not interrupt the AI reading out terms and conditions?
- What is the potential impact of a customer interrupting? If you’re reading out a list, and they skip later options, how likely is this to result in an error further along the conversation?
- What is the impact on CX? Will allowing or preventing the customer from interrupting have a negative or positive impact on their experience?
- How do you want to balance CX and business processes? If customers are likely to skip the boring stuff, how do you strike a balance between your priorities and theirs?
Why listening is the foundation of great CX
Building AI agents that can truly listen means going beyond basic transcription. It requires context awareness, intelligent error correction, and the ability to balance accuracy with speed. When agents understand what’s being said—and what’s meant—they create smoother conversations, reduce customer frustration, and build trust from the very first interaction. Listening isn’t just the first step in a conversation. It’s the foundation of great customer experience.
Speak to our team today about how PolyAI can help you implement the world’s most lifelike and adaptable AI agent to deliver effortless CX at scale.