The anatomy of a voice assistant
About the show
Hosted by Nikola Mrkšić, Co-founder and CEO of PolyAI, the Deep Learning with PolyAI podcast is the window into AI for CX leaders. We cut through hype in customer experience, support, and contact center AI — helping decision-makers understand what really matters.
Never miss an episode
In this episode, we welcome Shawn Wen, Co-founder and CTO at PolyAI. Shawn provides an in-depth overview of the AI tech stack essential for developing high-quality AI voice assistants. Inspired by Andreessen Horowitz’s recent publication on AI voice agents, the discussion covers key components of a complex system, including speech recognition, voice activity detection, the application of generative AI models, and integrating these technologies into practical applications.
Shawn also explores the challenges of managing latency, how input affects selected speech recognition models, and the future of end-to-end AI systems. Join us as we unravel the complexities behind creating and optimizing effective voice AI solutions!
Building a voice assistant involves more than just large language models (LLMs). The stack includes speech recognition, streaming protocols, voice activity detection, and region-specific models for accents and languages.
Enterprises want control over what a voice assistant says. By overlaying business processes and logic, companies ensure the AI aligns with their specific needs and provides safe and appropriate responses.
Generative AI models are powerful, but they're not a "plug and play" solution. Businesses must implement additional logic, guardrails, and checks before letting the AI make decisions or respond to users.
Highlights
Shawn Wen
14:00
"The LLM can make a lot of judgments by itself, but you're not going to let it make the entire business decision for you because that's way too risky, right?"
Kylie Whitehead
09:30
"You take into account as well the fact that people don't know how to speak to voice technologies. They've been trained to speak to them in keywords or in this really stilted and awkward way."