Why first impressions matter with voice AI

Regardless of how advanced voice assistants have become, customers sometimes just don’t want to talk to them. Some will try workarounds like shouting ‘AGENT’ or pressing ‘0’ in an attempt to bypass the voice assistant and speak directly with a person.

To fully realize the benefits of a voice assistant for customer service, it’s important that the customer trusts that the voice assistant is capable of handling their request. This trust is not implicit. It needs to be earned.

The first thing a voice assistant says plays a huge role in how (and if) a customer will interact with the system. When building a voice assistant, we need to pay special attention to the first ‘turn’ of the conversation to earn callers’ trust and establish the rules of engagement.

First turns first // Why the first utterance matters

Before we explore how to earn a caller’s trust, it helps to first understand why customers don’t want to engage with your system. At PolyAI, we’ve used Speech Act Theory as a framework to dig into this.

Speech Act Theory is concerned with the way we use speech to trigger actions. A speech act is an utterance that not only conveys information but that the speaker expects to result in an action. Think, “Pass me the salt,” or “Alexa, play Taylor Swift,”

When customers phone a company, they use language to perform an “act”. They expect that the language they use will trigger actions that solve their problems.

In customer service use cases, most speech acts can be considered as “requests” – requests for information about a hotel’s parking, requests to book a table at a restaurant, or requests to check the balance of a bank account, for example.

Before a customer even considers making a request, several conditions (often known as “felicity conditions”) need to be satisfied.

In the example of making a restaurant booking, we need to convince the customer of the following, before they even consider engaging with the voice assistant:

The voice assistant is capable of making a booking;
The voice assistant will understand that the customer has asked to make a booking;
The voice assistant is the best way to make a booking.

Before they will engage, the customer needs to be fully convinced that the voice assistant is capable of understanding what they want and taking the necessary action.

This should all be done before a customer can shout ‘AGENT!’ or mash their phone’s keypad.

As such, this opening prompt carries a lot of implicit information that will determine the success of your deployment.

Lessons in elocution // Voice quality

The first thing to consider is the way the assistant’s voice sounds. You can build a beautiful, complex system that leverages cutting-edge technology and flawless utterance crafting; but, if the overlying voice of that assistant is frustrating, a customer will likely either ask to speak to someone or just hang up.

The voice needs to be natural, human-like, and dynamic, otherwise, a user simply won’t want to engage with it. No one wants to hear a lifeless, robotic voice on a loop when their bank card has suddenly been blocked.

Small things matter here, from the pacing and intonation of the voice to the way in which audio is edited together or synthesized. Each customer request also requires an appropriate tone of voice – an assistant shouldn’t sound as happy about potential fraud as it does about opening a new account.

The main purpose of this attention to detail is to shift the user’s focus away from the voice and to the task at hand. The less a user notices the voice quality, the more likely they are to stay focussed on what they want to do and stay on the call.

At PolyAI, we have seen a 40% increase in engagement between voice assistants with simple, natural-sounding first turns, and ones that sound more stiff and robotic.

Too much too early // Over- vs under-specification

Voice assistant designers often believe that stating up front what the assistant is capable of will encourage users to engage. We’ve found that the opposite is true.

Customer service representatives don’t answer the phone saying, “I’m a customer service representative. I can take payments or change your account details.”

In conversations that start with a list of possible topics or actions, customers rarely stray from these specific suggestions and often condense their language to sound more like the voice assistant’s. These more stilted, limited utterances from the caller give machine learning models less to work with, making the conversation feel even less natural.

Counterintuitively, under-specifying what we can do for a user can lead to more engagement. It starts the conversation with the assumption that we can help with anything. This makes it more likely that a user believes the voice assistant will be able to handle their specific query and, in turn, more likely for them to actually tell us what they want.

Trust the system // The ‘how can I help’ principle

We’ve found that the best way to encourage users to engage with a voice assistant is to start every conversation with an open question: “How can I help?”.

The ‘How can I help’ principle is a foundational pillar in customer-led conversations. We want customers to drive the conversations they want, rather than being forced through a linear path that doesn’t lead to a solution. Letting the user define what their problem is and reacting appropriately is key to building trust. The more trust you build, the more a user will be willing to engage with your system.

Replicating natural language means that a user already knows how to engage with a voice assistant. They don’t need to guess what keywords might trigger the path they want to go down. Instead, they can speak freely about what they want to do.

This principle bleeds into every part of the conversation. Asking open questions and responding to what a user has just said, rather than getting them to choose from multiple options, means that the user is in control of the conversation. Users are, as a result, more patient with bots and willing to listen to the solutions offered by them, leading to more zero-touch resolution.

About the Author

Andrew Muir is Lead Dialogue Designer at PolyAI. He received his Master’s degree in Linguistics, Philology and Phonetics from the University of Oxford.