Creating a more accurate ASR model for customer service than Google

Automatic Speech Recognition (ASR) or Speech-to-Text is the process of transcribing spoken utterances into text. It enables voice assistants to understand what (in the most literal sense) a caller is saying. ASR systems use a combination of signal processing, machine learning, and natural language processing (NLP) techniques to perform this conversion.

While out-of-the-box ASR solutions can be helpful, they may not be tailored to a business’s specific needs and objectives.

By creating custom ASR models fine-tuned on in-house call data, accuracy can be improved. This leads to more fluid and natural customer interactions, reducing the chances of frustrating experiences over the phone and improving customer experience.

In this blog, we’ll present the results of our experiment with the PolyAI dataset, which show that fine-tuning ASR models on in-house call data can lead to significant improvements in Word Error Rate (WER).

Out-of-the-box ASR solutions

There are several ASR providers available in the market that offer satisfactory out-of-the-box performance. However, these solutions have some limitations since they are designed for generic use cases. For example, models trained to transcribe YouTube videos may not be optimized for transcribing phone call audio.

Another drawback of these out-of-the-box models is that they are often dialect-specific. We cannot assume that all calls made in the US are using the American English dialect. This holds true for all regions in which we operate.

Motivations for fine-tuning

We suspect that even ASR models that are optimized for phone calls can be improved by fine-tuning, based on the assumption that certain characteristics of PolyAI calls are not present in out-of-the-box datasets.

By fine-tuning on the data collected in the US, we can include a wide selection of accents and dialects present during calls. We can also learn, from real PolyAI customer calls, the noise and speech characteristics that might be caused by the telephony infrastructure or dialogue design.

Given that out-of-the-box models generally perform well, we conclude that training a model from scratch would be counterproductive. It would require a dataset and time commitment many orders of magnitude larger.

Experiment setup

PolyAI data

The PolyAI dataset comprises 20 hours of caller speech collected from calls in the US and the UK in equal proportions. The recordings have been manually annotated with a word error rate (WER) of less than 1%.

The dataset does not contain any personally identifiable information (PII). We use an 80:20 train:test split for each locale. The data collected in the US is tagged with en-US language code (based on collection region, not language properties).

PolyAI dataset consists of 8kHz noisy phone call recordings, sometimes containing a single word utterance and sometimes long-form speech. It is a challenging dataset and therefore we are expecting relatively high word error rates even for the best models.

Training configuration

For the base model, we selected an off-the-shelf SOTA Conformer-CTC model pre-trained on a variety of US and UK speech data.

We fine-tuned three versions of the model based on subsets of the PolyAI dataset:

UK data only (8 hours of training)
US data only (8 hours of training)
Both UK and US data (16 hours of training in total)

To prevent catastrophic forgetting, we always mixed the fine-tuning dataset with a subset of the base model training data in equal proportions.

Results

The results were calculated on 2-hour splits for each region.

By fine-tuning the models on a small dataset of 8 hours of region-specific data, we achieved significant performance gains. Our performance was much better than what out-of-the-box solutions are able to achieve.

– Dataset lang: en-US –

	Provider	Model	Language	WER Mean	WER Median
0	PolyAI	fine-tuned	en-US	20.32%	0.00%
2	PolyAI	fine-tuned	en-ALL	22.19%	7.14%
3	Google	best	en-US	22.22%	7.69%
9	Google	worst	en-US	44.51%	33.33%

– Dataset lang: en-GB –

	Provider	Model	Language	WER Mean	WER Median
0	PolyAI	fine-tuned	en-GB	20.99%	8.33%
1	PolyAI	fine-tuned	en-ALL	22.77%	10.00%
2	google	best	en-GB	25.15%	14.29%
9	google	worst	en-GB	33.46%	25.00%

Our fine-tuned model achieved a median (most common) word error rate (WER) of 0% on US data, while the best Google ASR model achieved a 7% median WER.

On the UK dataset, our model achieved an 8% median WER, compared to the 14% achieved by the best Google model.

On en-US, we improved the mean WER by two percentage points compared to the best Google solution. On en-GB, we observed a five percentage point improvement in WER compared to the best Google UK model.

We also found that fine-tuning the same model on both regions’ data yields worse results than fine-tuning on a smaller, region-specific subset.

Conclusions

In our experiments with the PolyAI dataset, we found that fine-tuning ASR models on internal call data significantly improves WER, particularly when the data is collected from the same domain as the target use case. This approach allows businesses to create custom models tailored to their needs, resulting in greater accuracy and efficiency.

Fine-tuning ASR models on in-house call data is an important tool for businesses looking to improve accuracy, efficiency, and productivity. By leveraging the power of custom ASR models, businesses can gain valuable insights from their call data and use this information to drive growth and success.

Another significant advantage of this approach is the lack of reliance on 3rd party providers, which allows for seamless single-tenant and BYOC implementations.