Voice AI: Build or buy? How to choose the right solution for your contact center Learn more

Play Podcast

The Evolution of Speech Recognition

In this episode of the PolyAI podcast, host Kylie Whitehead interviews Pawel Budzianowski, the head of machine learning at PolyAI. They dive into the past 70 years of speech recognition, moving from hidden Markov models to deep learning transformations and beyond. Plus, Seinfeld and Jordan Peele in the speech zeitgeist?

The journey of speech recognition began over 70 years ago, initially driven by Cold War-era efforts to monitor communications. Early systems like IBM's Showbox could only recognize simple digits.

Around 2012, deep learning revolutionized speech recognition by enabling systems to handle more complex data and improve accuracy, surpassing the performance of older models.

The future of speech recognition involves combining audio, text, and visual data to improve accuracy and context comprehension, potentially eliminating most errors currently observed.

Pawel Budzianowski

Head of Machine Learning

Kylie Whitehead

Senior Director of Marketing

Never miss an episode.

 


"One of the biggest starting points for speech recognition was actually things that were happening during the Cold War between USA and Soviet union. Scientists were tasked to figure out how to spy on phone calls, that's where speech recognition and machine translation were born."

Since the release of the Dragon Naturally Speaking in 1997, all of these systems up until 2017 were based on these hidden Markov models. As with all of the AI models that were released before neural network, deep learning revolution that happened in the last 5 years, and with all of the outcomes that you see today, like chat GPT. All of these models learn from the data, which is a crucial aspect of AI, right? Just like humans, these systems learn from the data, but they had this visible/invisible ceiling up until they were learning quite okay. Later on, once you are adding more and more data, the system didn’t really improve because it had this ceiling of capabilities inherently inside of these models.

“So when these models were released since 1997, a lot of work was put into tweaking them up, making them better. We got to the point where continuous speech, even with British accents, could be recognized quite easily in favorable condition, right? Yeah, imagine no rain, not much noise, no other speakers in the room, just you with your good clear audio.”

“Accents are a major hurdle for speech recognition. There might be people calling from a lot of different geographical places with different accents. We typically we use multiple ASR systems so that we can combine hypothesis together, and that's typically the best solution.”