Introducing Pheme: A new speech generation model from PolyAI

Introducing Pheme: PolyAI’s first take on creating a conversational and efficient speech generation model.

Ivan Vulić Senior Scientist

Jan 10, 2024

6 min

Overview

Introducing Pheme

At PolyAI, we’ve been creating voice assistants for enterprise customer service since 2017. One of the key factors we’ve found that influences engagement (and, therefore, efficacy) of voice assistants is voice quality.

Decades of interacting with “conversational” IVRs have left consumers skeptical, with many resorting to mashing their keypads, shouting AGENT! or swearing aggressively in order to bypass the automated system and speak to a person.

Deploying voice assistants that sound like real people has enabled us to engage with callers, automating upwards of 50% of inbound customer service calls for well-known enterprise brands.

Today, PolyAI is thrilled to announce Pheme: PolyAI’s first take on creating a conversational and efficient speech generation framework. With Pheme, we are aiming to create high-quality yet compact and fast conversational speech generation models that make them amenable to easy and quick training, productization, and domain and voice specialization.

Challenges

Challenges of speech generation

Until very recently, it has not been possible to achieve anything resembling natural-sounding voice with text-to-speech systems. However, in recent years, owing to deep learning architectures such as Transformers, speech generation has seen remarkable progress, now achieving voice synthesis capability that is often virtually indistinguishable from real human voice.

Attempts at combining such advancements in speech generation with generative large language models have shown promise for a wide range of applications, but the integration of cutting-edge speech generation technology is still hindered by its high latency, large model size, and huge training data demands.

Moreover, while synthesized speech might be high-quality in terms of word errors made (or rather “not made”), this still does not guarantee its smooth adoption: large TTS models trained in controlled environments with read or acted speech might sound monotonous, flat, and simply plain boring once deployed to real-life applications. Put simply, how happy would you be as a user conversing with a system that takes 30 seconds to respond to each of your questions and then, after all that waiting, sounds like it is reading a dishwasher instruction manual?

The majority of current work on neural Transformer-based TTS still trains and evaluates models in controlled ‘upper bound’ environments that rely on cleanly recorded reading (e.g., audiobooks) or acted speech. Such environments effectively ignore the undeniable fact that human speech acts across many scenarios are conversational in nature, and that the TTS systems must be highly adaptable to applications across many different domains and should also be useful ‘in the wild’ (e.g., over the phone). Moreover, conversational TTS synthesis poses additional challenges as conversational speech conveys additional paralinguistic information (e.g., emotions, pitch), and is much richer in terms of prosodic variations and expressiveness.

Scale

Speech generation, at scale

Current research in speech generation has focused on voice quality, with little to no regard for efficiency. This means that while the research has delivered excellent quality results, it is far from production-ready for enterprise applications.

Pheme has been developed with enterprise readiness at its core, focusing on the following key efficiency factors.

Parameter efficiency. Simply put, more compact models with fewer parameters are less costly to serve, quicker to train, quicker to run, and easier and faster to adapt and specialize to various domains and use cases.
Data efficiency. Models that can be pre-trained on smaller-scale data are less expensive and quicker to develop. Starting from such pre-trained checkpoints, models that can quickly adapt to unseen scenarios and domains with limited amounts of in-domain or in-voice data are more versatile and cheaper at the same time.
Inference efficiency. Users simply won’t accept nor tolerate slow and cumbersome speech generation, regardless of its final quality. Low latency is, therefore, key for driving engagement and efficacy of voice assistants.

Architecture

What is Pheme, and how was it built?

Pheme is a neural, Transformer-based TTS framework that aims to maintain high-quality speech generation both in multi-speaker and single-speaker scenarios, while simultaneously providing the following features:

Synthesis of rich-in-prosody and naturally sounding speech rather than acted and artificial speech;
Compact, shareable, and easily fine-tunable models that also work with unseen voices through one-shot prompting;
Reduced pretraining time and data requirements;
High inference efficiency and low latency.

Pheme’s design is standing on the shoulders of many recent developments from other researchers that we further integrated, adapted, and/or extended to serve the purpose discussed above. This comprises the following:

Inspired by a range of recent TTS models (e.g., MQTTS , SoundStorm , USLM ), we separate the design of the full system into two core components: (a) a text-to-semantics (T2S) module and (b) an acoustics-to-speech (A2S) module, conditioned by semantics coming from the T2S module.
We demonstrate that the key ingredient is indeed the disentanglement of semantic (T2S) and acoustic components (A2S), enabled by the recently published SpeechTokenizer component.
We make use of speech embeddings extracted by pyannote for increased fidelity and speech quality.
Data efficiency: Similar to MQTTS, we aim to train Transformer-based conversational TTS models with much fewer and noisier (publicly available) training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data). Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
Inference efficiency: Similar to SoundStorm, we run inference through parallel MaskGit-style decoding, showing that we can achieve up to 15x speed-up compared to similarly sized autoregressive models such as MQTTS without any quality degradation, and even with gains in quality.
Parameter efficiency: We create very compact models – released Pheme checkpoints come in two variants (100M and 300M parameters).
Further specialization: The released checkpoints can also be specialized on unseen voices with additional in-voice fine-tuning data. We also show that the single-speaker quality can be improved through teacher-student training with (synthetic) data generated by third-party providers or much larger (and thus production-unfitting) speech generation models.

You can read the paper with all the technical details and results on arXiv .

Demo

Try it yourself!

You can try out the demo on Hugging Face (with a selection of unseen voices sampled from GigaSpeech)

Want to hear it for yourself before using it? Head on over to Github to listen to some samples of speech generated by Pheme !

Finally, train your own models from scratch or from released Pheme checkpoints.

Future

What’s next for Pheme?

The Pheme framework is PolyAI’s first deep dive into the thriving universe of neural speech generation, and we will continue with our mission of developing and deploying natural, lifelike, efficient, and production-ready speech generation models as an integral part of our world-leading voice-based assistants.

Find out more about Pheme in our latest podcast episode.