Play Podcast

The anatomy of a voice assistant

In this episode, we welcome Shawn Wen, Co-founder and CTO at PolyAI. Shawn provides an in-depth overview of the AI tech stack essential for developing high-quality AI voice assistants. Inspired by Andreessen Horowitz’s recent publication on AI voice agents, the discussion covers key components of a complex system, including speech recognition, voice activity detection, the application of generative AI models, and integrating these technologies into practical applications.

Shawn also explores the challenges of managing latency, how input affects selected speech recognition models, and the future of end-to-end AI systems. Join us as we unravel the complexities behind creating and optimizing effective voice AI solutions!

Building a voice assistant involves more than just large language models (LLMs). The stack includes speech recognition, streaming protocols, voice activity detection, and region-specific models for accents and languages.

Enterprises want control over what a voice assistant says. By overlaying business processes and logic, companies ensure the AI aligns with their specific needs and provides safe and appropriate responses.

Generative AI models are powerful, but they're not a "plug and play" solution. Businesses must implement additional logic, guardrails, and checks before letting the AI make decisions or respond to users.

Kylie Whitehead

Senior Director of Marketing, Brand, PolyAI

Shawn Wen

Co-founder and CTO, PolyAI

Never miss an episode.

 


"The LLM can make a lot of judgments by itself, but you're not going to let it make the entire business decision for you because that's way too risky, right?"

"You take into account as well the fact that people don't know how to speak to voice technologies. They've been trained to speak to them in keywords or in this really stilted and awkward way."