Speech recognition & Spoken Language Understanding (SLU)
In this episode, Kylie speaks with Shawn Wen, co-founder and CTO of PolyAI, going in-depth on speech recognition technology. They cover challenges of integrating voice assistants, discover the tumultuous conversion of chatbots to voice, and learn the intricacies of optimizing speech recognition models for different types of users and contexts. Shawn highlights five major problems faced when deploying voice assistants, including speech recognition errors, latency issues, model management, dialogue design, and telephony as a utility. Tune in to understand the complexities and solutions involved in creating highly effective voice assistants.
Never miss an episode.
And usually, yeah, that’s exactly what we heard from a lot of our prospects and clients saying, Hey, we have a chat bot already. Why don’t we just put a voice on top of it? And then the bot is good to go. The reality is there are a lot of challenges and nuances. You will probably not know. You probably won’t know before you, you know, actually deploy a voice assistant.
"'Hey, we have a chatbot already. Why don't we just put a voice on top of it?' And then the bot is good to go. The reality is there are a lot of challenges and nuances."
Apple voice assistance, Siri, is trying to solve a very big problem. It’s a consumer facing voice assistant, supposed to be your personal assistant, supposed to be able to help with everything. In your day to day life and even, you know, from like singing a song, telling a joke to actually have help you set the time, the time reminder, or, you know, managing your calendars.
It’s a quite, broad range of tasks that the voice assistant needed to be doing. It’s a very challenging task. Now, if you think about voice assistant in the contact center, you don’t need such a broad range of speech recognition coverage. A lot of times, you know, if you are a bank, clients calling you because they have a problem with their bank accounts.
They are only limited amount of like vocab, you know, words in the vocabulary that you, you should factor in there, you know? And yeah, so I think in these cases, but by making these speech recognition, Tater to contact center, even specific to particular industries that already help you to mitigate a lot of these problems you are going to see with human facing voice assistants.
"If you think about voice assistant in the contact center, you don't need such a broad range of speech recognition coverage. A lot of times, you know, if you are a bank, clients calling you because they have a problem with their bank accounts."
"If you encounter a generative voice assistant these days, it's better to actually provide more context as much as possible. The automatic speech recognition (ASR) will still be able to transcribe it very accurately for long sentences, and then the underlying language model (LLM) is going to be very, very powerful to pick up the long sentences."
We sometimes see some drop of the calls and, and we see the degradation of the performances. And even though they do actually point out to a new paper, which they published maybe like two years ago, and that paper says that this new paradigm of speech recognition is the way to go in the future, which is true because they, they, the major difference is that make it less account free.
The concept of less account free is that, previously, the speech model would basically look up a lexicon of words and to say, based on this particular audio segment, it should be one of these words, right? So you’re basically picking from the words. Now the model is different. It’s basically generating tokens.
And those tokens are sub word units. And then you can compose them back to words later on. So it becomes a so called lexicon free model. And in that case, some of the problems that you should not see in a lexicon models happens. For example, uh, you know, yay, these particular words. It was always transcribed into Y A Y.
Now it could be translated into Y A Y Y Y Y. Because users are just much more excited and yay for a very long time. So it’s a cool feature that now these models be able to recognize in this. But if you don’t factor that in into your language understanding components. Your assistant is going to start to say, sorry, I didn’t get that.
"The concept of less account free is that, previously, the speech model would basically look up a lexicon of words and to say, based on this particular audio segment, it should be one of these words, right? So you're basically picking from the words. Now the model is different. It's basically generating tokens."