Humanlike AI agents must master three core skills: listening, reasoning, and speaking Read more

Introducing Owl: A new speech recognition model from PolyAI

June 3, 2025

Share

TL;DR

  • Owl is a new speech recognition model from PolyAI.
  • Owl is purpose built for customer service over the phone.
  • With a word error rate of 0.122, Owl outperforms leading speech recognition models in customer service use cases over the phone, with diverse customer bases with different accents.
  • Owl outputs can be further enhanced with spoken language understanding (SLU) features in PolyAI Agent Studio.

Why does speech recognition matter?

In our experience, about 70% of errors made by AI voice agents are caused by Automatic Speech Recognition (ASR) models.

While off-the-shelf ASR models from the likes of Amazon, Google and NVIDIA have improved considerably over recent years, they still struggle with the unique challenges of customer service environments with almost limitless variation in customer’s voices, and the low-fi nature of the phone as an audio channel.

Unreliable speech recognition is one of the main reasons that most organizations are forced to start their automation journeys with text-based chatbots. After all, customers don’t want to repeat themselves over and over to a system that doesn’t understand what they’re saying.

At PolyAI, we automate millions of calls for some of the world’s best-known enterprise brands. We’ve taken what we’ve learned to create Owl – a new ASR designed specifically for customer service applications.

What makes a good ASR for enterprise customer service?

When we set about creating our own ASR, we knew it needed to provide human-level accuracy when understanding customers over the phone. It needed to account for different use cases, the variation in speech patterns of broad customer bases, and it needed to be quick, making automated conversations feel indistinguishable from human ones.

Owl is specifically designed to address the following:

  1. Domain-specific language: Owl is trained on synthesized data representing customer service calls from a range of industries, including healthcare, financial services, retail, travel, hospitality, and utilities.
  2. Accent and dialect handling: Because Owl is trained on data from different geolocations, it has heard and learned from different accents and dialects.
  3. High performance over phone lines: Owl is trained on samples of conversations that have occurred over the phone, so it performs well with that kind of low quality audio input.
  4. Latency: As a specialized ASR model, Owl is smaller, and more efficient, driving reduced response times and enhancing the naturalness and efficiency of live interactions.

Creating the most accurate ASR for customer service

PolyAI Owl was developed using a strong pretrained model from NVIDIA as the foundation, combined with extensive proprietary training data based on multiple real life conversations across a variety of use cases, featuring a range of accents.

The development process involved experimenting with multiple architectures while maintaining a strict, multi-faceted evaluation protocol to ensure optimal performance. We optimized the model for real-time deployment using a high-performance inference server, ensuring minimal latency during live interactions.

Benchmarking Owl against industry other leading ASR models

We benchmarked Owl against four of the leading ASR models on the market.

We evaluated Owl using an internally annotated dataset that comprehensively covers all relevant customer service use cases. This allowed for flexible evaluation approaches where results could be analyzed either by specific problem domains (such as healthcare, financial services, retail, etc.), use cases (addresses, numbers, etc.) or clients to understand performance in particular sectors.

Benchmarking PolyAI Owl against industry other leading ASR models

We also aggregated all data to produce “global” results that demonstrated overall effectiveness across all customer service scenarios. Importantly, we evaluated the streaming version of the model since real-time streaming is our primary use case for live customer interactions.

This dual evaluation approach enabled both granular insights into domain-specific performance and broader understanding of general capabilities, providing a complete picture of how Owl compares to leading commercial ASR systems in real-world customer service environments.

Results

For the purposes of benchmarking PolyAI Owl against other off-the-shelf ASR models, we decided to evaluate based on word error rate (WER).

Word Error Rate is used to measure the accuracy of automatic speech recognition systems. It calculates the percentage of words in the system’s output that differ from the reference (correct) transcript.

A lower WER value indicates higher accuracy.

With a word error rate of 0.122, PolyAI proved to be the most accurate ASR of those we tested.

Further improving accuracy for AI agent deployments

Even with the best speech recognition models, there’s always room for errors. These errors might be caused by the model not being perfect, but it also might be the inherent ambiguity of language. We as humans frequently do not understand one another and we either make an educated guess or ask for clarification. That’s why our platform, Agent Studio, has two key features that can be applied before and after ASR for an even more accurate understanding of what the caller is saying.

  • Keyphrase boosting – bias the ASR model toward recognizing specific words and phrases. By curating a list of keyphrases relevant to your domain, you can improve transcription accuracy for those terms, resulting in better inputs for the Language Learning Model (LLM) and improved agent performance.
  • Transcript corrections – review calls and correct transcription errors instantly in the platform to prevent these errors from being made again.

It is also important to note that downstream systems can recover from ASR mistakes. Knowing the full context of the conversation an LLM will be able to “correct” and disambiguate the mistakes ASR might have made.

Reducing ASR latency for AI agents

There are a number of factors that contribute to overall latency in speech recognition systems, including audio chunk size, processing time for each audio chunk, potential slowdowns under heavy loads (though this is easily mitigated through autoscaling), time required to finalize the transcript, and end of speech detection.

End of speech detection is about detecting when a speaker has finished speaking. It is the primary driver of latency and presents a particularly challenging problem to solve.

One way of solving this is based on silence detection. For example, we might configure the ASR to determine that speech has ended when the user has been silent for 2 seconds. However, this approach doesn’t reflect natural conversational behavior.

At PolyAI, we set this threshold to shorter durations (for example 500 milliseconds), and our system architecture supports handling multiple consecutive messages while enabling users to jump in and interrupt as they see fit. By combining fast ASR processing with short end-of-speech detection windows, we create a human-like conversational experience that feels natural and responsive.

Conclusion

Great speech recognition is key to implementing AI agents that customers actually want to talk to. After all, there’s nothing worse than having to repeat yourself over and over just to be understood.

At PolyAI, we remain deeply committed to enabling optimal performance across all elements of the AI agent tech stack. To find out more, book a demo with PolyAI today.

Ready to hear it for yourself?

Get a personalized demo to learn how PolyAI can help you
 drive measurable business value.

Request a demo

Request a demo