Behind the tech: How dialogue design really works
Summary
Ever wonder what really makes an AI voice feel human? In this episode of Deep Learning with PolyAI, an all-star panel from our dialogue design and engineering teams takes you behind-the-scenes to share how the hardest part of CX automation actually works.
From latency and personalization to multichannel design and our in-house Raven model, they break down the science (and the art) behind building conversational AI agents that are actually worth talking to.
You’ll learn:
✅ Why dialogue design is notoriously hard, and how PolyAI cracked it
✅ The key factors that make conversations “buttery smooth”
✅ How omnichannel and LLM-powered design are changing everything
✅ What’s next for dialogue design, from multimodal to configurable latency
PolyAI’s Jenn Cunningham guest hosts for this episode, moderating a discussion between Arkadiusz Kwapiszewski, Eshan Singhal, and Harry Swanson.
📌 Don’t forget to like, comment, and subscribe for more conversations at the cutting edge of AI and CX.
Key Takeaways
- Dialogue design becomes productized: PolyAI has transformed dialogue design from a niche social science into a scalable platform capability, allowing CX leaders to build natural conversations that work across industries and channels.
- Voice AI that feels human: Success depends not only on accuracy but on voice quality, latency, copywriting, and personalization — all areas where PolyAI leads by embedding dialogue design expertise directly into its tech.
- Forward-deployed design in action: PolyAI’s dialogue design team partners directly with clients — visiting call centers, analyzing real calls, and co-creating workflows that blend empathy, efficiency, and automation.
- Omnichannel and multimodal future: By separating knowledge from channel-specific delivery, PolyAI ensures a single knowledge base can power voice, chat, SMS, and multimodal interactions, keeping experiences consistent and human-like.
Transcript
Jenn Cunningham
00:12 – 00:58
Hi, everyone. Thanks for tuning in.
So I’m Jenn on our product marketing team. I am a guest host this week, and so welcome to deep learning with PolyAI.
We are a window into AI for CX leaders. Really encourage everyone listening, to hit the subscribe button and leave us a five star rating on their favorite podcast app if you haven’t already.
So now let’s chat a little bit about productizing dialogue design. What we are all here to chat about today.
And I’m super excited. I am joined by three very smart people, from our dialogue design and engineering team.
So, guys, if you would like to go ahead and introduce yourselves.
Arkadiusz Kwapiszewski
00:58 – 01:34
Hi. I’m Arkadiusz.
I’m the head of dialogue design, here at PolyAI. And dialogue designers, like, this might be a new term to you, but, we have a whole team of linguists, human computer interaction specialists, prompt engineers, who are basically dedicated to ensuring that the experience of interacting with our voice assistants and and chatbots, is is absolutely top top notch, you know, people who are obsessed with.
attention to detail and and, really love thinking about edge cases and,.
Jenn Cunningham
01:34 – 01:35
Absolutely.
Arkadiusz Kwapiszewski
01:35 – 01:42
you know, making sure that everything is buttery smooth.
Harry Swanson
01:42 – 01:55
Hello. I’m Harry.
I’m a lead on the dialogue design team, and, yeah, I help make everything buttery smooth. Looking forward to chatting about it.
Jenn Cunningham
01:55 – 01:56
I love it.
Eshan Singhal
01:56 – 02:10
Hey. I’m Eshan.
I sit on our engineering team. Yeah, I work very, very closely with Arkadiusz Harry and all of the other dialogue designers, to understand how we can match up the technology and infrastructure that we have at PolyAI.
Jenn Cunningham
02:10 – 02:10
I
Eshan Singhal
02:10 – 02:16
to what helps make conversations really buttery smooth. I guess that’s the phrase of the day.
Jenn Cunningham
02:16 – 03:04
gonna surprise us today. I love it.
I love it. Off to a good start.
And I know, so Oliver from our dialogue design team, in The US, he had been on the podcast a couple months ago talking just about what is dialogue design. So this is kind of a follow-up to that conversation as we’re thinking through from a platform perspective, from a product perspective.
What does this really mean, to take social science and turn it into a product that can really be used to create fantastic customer experiences repeatedly at scale? So I guess with that, Harry, why don’t you kick us off? What actually makes an agent sound good?
Harry Swanson
03:04 – 04:24
I think that what underlies everything is that the thing has to work. Right? If the thing you’re trying to interact with doesn’t do what you need it to do, we’re off to a terrible start, and everything pretty that we wanna put on top isn’t gonna make an impact.
So our tech stack has to be perfectly optimized to understand what the user wants and be able to act on that, and then be able to integrate with any back end APIs that we need to. Then on top of that, we have what I think of as, kind of core dialogue design.
This is like thinking about interaction design. So if we need to get a load of information from the user, how do we get that from them? What’s the right order to do it in a way that will keep them engaged? If we need to give information back to the user, what’s the best way of doing that? And then how do we take these things and provide a kind of middle layer, to start talking to these back end systems? And then on top of the tech working and the kind of overall interaction design working, I think there are four kinds of key, like, UX polish things that make something sound really, really good.
First of all, there’s voice quality. So when we’re talking about voice agents specifically, the thing has to sound amazing.
Right? If you have these, like, terrible, like, robotic text to speech systems, people switch off. They get bored really, really quickly.
It just feels icky and wrong.
Jenn Cunningham
04:24 – 04:25
Yeah.
Harry Swanson
04:25 – 06:20
The second thing I think about on this is really, really snappy latency. Again, when you’re having a conversation with someone, you can jump in and and and, you know, say, yeah.
Correct me. If you do that two seconds too late, then I’m already saying the next thing and the conversation is terrible.
People switch off again. The third thing I like to think about is really, like, hyperconversational copywriting.
So the exact words we choose, the decisions we make there, how do we make them sound really natural, really realistic. A lot of, like, old school IVRs use really overly formal language, which again just makes people switch off.
And and and the fourth. The thing which is the stuff that I think we as our designers get really excited about in this, like, conversational polish is personalization.
So if we can look up someone’s account, based on their phone number, then we can start a call by saying, oh, hey. Like, I can see that your order’s out for delivery.
Is that what you’re calling about today? Or is there something else I can help you with? Or if someone’s called up before and we’ve told them, look, try this thing and then call back, we can say, oh, you know, I see that you called us yesterday about this thing. Like, is that still something a problem you’re experiencing? This stuff is really, really fun.
It’s really, really exciting. And I think the key thing is, like, these things can feel like kind of bits of design polish, but they’re really, really key to what we do because they drive user engagement.
And user engagement is key to us actually being able to do anything. Right? If people switch off, if they get bored, they just ask to speak to a human, if they clog that we’re a stupid robot, or they just hang up because they think, you know, this isn’t I don’t need to do this right now or maybe I can do it another way.
So, yeah, by doing these things that, like, feel nice and lovely, we actually, like, really drive success.
Jenn Cunningham
06:20 – 07:14
No. And it makes a lot of sense.
Right? I think especially on the personalization piece, people are rare. calling customer service because something has gone well.
So if you’re already in a pretty bad mood, let’s say, and then you call because your power’s out, and instead of, you know, waiting to get through or waiting through a messy IVR menu to tell someone, hey. I don’t have power, which is already a really stressful experience, You get, hey.
I noticed that you’re in an area that’s being affected by an outage. We’re gonna have a crew out in two hours.
You should have power back within, like, x amount of time. That’s gonna make me feel a lot better.
And I guess, like so, these are kind of different use cases. But, Arkadiusz, so why do CX agents need dialogue design then? Why not just have the robot kinda answer the phone in a soulless way? Right?
Arkadiusz Kwapiszewski
07:14 – 10:47
I think it goes back to what Harry said about needing the users to be engaged. Right? Like, when we ask, like, how can I help you? We want people to tell us we absolutely want to solve their problems.
We want to make their day better. But it’s easier to say something that sounds familiar and friendly.
It sounds like it can actually solve your problem, and that’s what we’re aiming for. It is, like, human-like, interaction.
You know, this is the most natural way for people to interact. We talk to other people for hours every day.
We’re used to it. Right? This is how children learn language by talking to other people, by listening to other people speak.
And so this is just super this voice medium is just so natural for us, that it’s a crime really that, you know, for decades, we had to interact with these robotic IVRs, and we had to learn how they function. Right? Like, we’re looking ahead to the future.
We’re building the future right now and making sure that, yeah, that people can talk to our agents, to our assistants just like they would to another human being, and and they can be understood, you know, in any way, and we can get that problem resolved really, really fast, and add delight, to that process. We often talk on the dialogue design team about adding delight.
This is the cherry on top. You’re supposed to help the user, but you’re also supposed to make that interaction delightful.
Jenn Cunningham
10:47 – 11:27
And I know that we really think about this if we think of, you know, the old world where it was really like a black box, how are you thinking about kind of implementing this type of logic on the platform itself, in terms of if we’re thinking through, like, even what should it sound like, what kind of questions should it be answering, how can we handle these really complex transactions. So I know we have a lot of back and forth, and we’re doing a lot of discussion and design with the client themselves.
But at the same time, there is a lot of kinda out of the box functionality there. So how do we kind of take this into the product itself?
Arkadiusz Kwapiszewski
11:27 – 13:57
Absolutely. The platform makes it super easy to get started.
So, you know, we often think of breadth and depth when it comes to design. So breadth would be your FAQs, like, the repository of knowledge that you have to answer, like, any question the user might ask.
Right? And we can have hundreds and hundreds of FAQs integrated into the bot. It’s super easy to set this up.
You literally just upload a PDF or or scrape things from your website. You know, we have integrations available.
So, you know, in just a couple of clicks, you can make our assistants fully aware of, you know, all the knowledge specific to your company. Right? It’s really powerful.
And, yeah, the initial setup is really, really simple. And then you have a few decisions to make when it comes to these simple FAQs.
Like, let’s say it’s stuff like, you know, how do I, how do I, apply for a loan on the website? You know, you can walk the user through the process, and over the phone, obviously, the agent will is is optimized to do this step by step to allow, you know, a lot of waiting time, if the user needs to, find their way on the website to allow for repetitions, things like that. Right? Like, we’ll basically have the process of guiding someone on the phone, like, first go here, then go here.
And this tends to work out of the box, which is really fantastic. But then you can also, you know, the platform supports lots of customization.
So you could add, like, an SMS offer, for that specific use case. So instead of walking the user, you know, there and then, maybe maybe the user is actually, like, driving their car and they can’t actually do it right now, we can just send them a link for later.
Right? So we just automatically, we can use the phone number they’re calling from or or we can collect an alternative one, send them a link, they have it, and they don’t need to worry. Right? So, in this way, you know, we can handle, like, hundreds of different topics.
We automatically collect metrics for them, give you insights, like dashboards so you can understand what people are calling about, why they’re calling, and the resolution rates for all the different topics.
Jenn Cunningham
13:57 – 14:40
Yeah. And it seems like it’s a lot of just meeting people where they are, both in terms of, I think, one, from, like, just an exec management level of thinking about this thing.
I know a lot of people are looking for, like, a common knowledge base or a central one brain that you can push to multiple channels. And so you have, one, just how do you optimize for the process, but then two, I think there’s also how do you think about this from an omnichannel perspective? How are you all kind of thinking through chat versus or, well, I guess, chat, text, two different channels, but both digital.
Right? Written channels and then talking. Right? Because we talk so differently to how we text oftentimes.
Arkadiusz Kwapiszewski
14:40 – 15:29
Yeah. So the platform supports omnichannel.
It’s really easy to set up. And with this, you know, sending a text example, obviously, this is behavior specific to the voice channel.
And then and then when you set up your project for chat, the same behavior would automatically translate to just pasting the link in the chat box. Right? The response style would be different as well.
The voice channel requires us to be very concise because, obviously, people have relatively short attention spans over the phone. You know, if you try to give them, like, a forty second walk through without pausing for breath, they will forget the first step by the time you finish.
Jenn Cunningham
15:29 – 15:30
Yeah.
Arkadiusz Kwapiszewski
15:30 – 16:02
But it’s really important to, you know, it’s all about the cognitive load of the user. Like, we can’t overload the user.
Harry Swanson
16:02 – 16:43
This is where LLM powered conversation design really comes into its own. Right? So we don’t have to design a whole bot for chatting and another one for voice.
Your knowledge base can be the repository of information that we have. And then and then through prompting in terms of what we want and the output format we need, then then our LLM can convert that into, like, the the the ideal content for the voice agent, which is, yeah, much, much, like, shorter sentences and and asking the user if they need to wait for the next step and things like that, or a different version for chat where we get nicely formatted lists and and hyperlinks and things like this.
Eshan Singhal
16:43 – 17:40
And that was one of the kinds of decisions you make when you’re building the platform if you want to separate the content of what you’re trying to say. So the actual content and the knowledge base from how you say it.
So whether that’s the channel, when you’ve got things like SMS, chat, voice, you’ve got all these different ways you can present the information to the user. It also comes into play when you’re thinking about things like multiple languages as well, because often you’ve got the same content for both languages, but obviously how you present it to the user changes because in some cases, you’ll say it in one language and another you’ll say it in a different way.
And so, again, like Harry said, it’s where having, you know, these very powerful LLMs comes in really handy because you can kind of tell it what’s going on, and you get a lot of that for free, in terms of it. It can adapt to those parameters as you give it to them.
Arkadiusz Kwapiszewski
17:40 – 20:29
And, actually, there is a really cool story about dialogue design and productizing dialogue design in here. So when it comes to voice and chat, multichannel, We as Eshan said, so things kinda work out of the box.
Right? You select the channel, and then, you know, we select the best models for you and the trains to respond differently. The style is very different.
Right? Chat requires longer paragraphs, allows you to use bullet points for formatting and so on. Whereas the voice needs to be really human-like, really conversational.
It cannot be overly verbose. A lot of the LLMs on the market, a lot of the models are actually inherently chatbots.
Right? So, the most natural mode of operation, the data they’ve been trained on, has been chat data. Right? The easiest way to spot an LLM is actually verbosity.
Like, the they they love, responding in large paragraphs, producing a lot of text, you know, generating a lot of output tokens that we have to pay for. And, you know, this just is not suitable for voice.
Little mannerisms, you know, things like, the user says, oh, I want to register, and the model will respond, since you said you wanted to register, let’s. continue with it.
It’s completely redundant. Humans don’t say that.
Right? So, you know, under the dialogue design team, we have put together a guide of best practices for for writing utterances, like, you know, what are the kind of guiding principles for for what sounds natural and how how do, like, the top LLMs, how do they violate it, violate these principles? What makes them sound like robots rather than humans. We have collaborated with the research team to use these guidelines to train our internal in house model as well.
So, it’s called Raven. It’s fully optimized for voice conversations, and it’s an amazing example of how how we’ve taken this, you know, internal linguistic expertise, defining the problem very, you know, very precisely, giving a lot of examples, And then we’re able to productize that knowledge by, training the model that now everyone can use.
Right? You don’t need to be a linguist now to produce fantastic conversational interactions. It just works out of the box, and it’s an amazing feeling.
Jenn Cunningham
20:29 – 21:23
Yeah. Well, I know I was doing some kind of work on how do you compare our model versus others.
And when you think of the in house versus external, just the latency that you get when or the latency and the long responses, it’s just like I was testing it just, and I’m like, I hate this. Like, for the non one, I’m just sitting there just internally screaming, so it was just such a lengthy I’m just trying to run a basic test call, and I’m like, oh my god.
This is misery. But I guess I know we’ve talked a lot just in terms of, like, the product, the platform.
So why don’t we talk a little bit in terms of, like, forward deployed dialogue design? So not just dialogue design in our algorithm or within a product itself, but actually for the teams. How are we really leveraging this?
Arkadiusz Kwapiszewski
21:23 – 24:36
Yes. So this term forward applied is really hot right now.
You know, we think of this as problem solving with our clients, so it’s a very collaborative enterprise. I think the best example is just visiting our clients on-site.
Right? Visiting the call centers, talking to their stakeholders, and understanding exactly what their problems are, what issues they’re facing. I love going to call centers and listening to real calls before we even start working on the project.
You know, when our clients come to us, they don’t always fully know what they want to build and what the agent should actually say in response to specific questions. So we’re there to help and figure it out by, you know, asking questions, asking for clarifications, agreeing on some common scope of what the agent should do.
Right? It has to be really explicit, because we want our clients to trust us, and it’s really fun.
Jenn Cunningham
24:36 – 25:14
It’s great. I think when we’re thinking about forward deployed dialogue design, it’s how we are really seeing where the call center is today, where people where the customer service conversations are, and how we can make that better and more kind of human-like and interesting to engage with.
So then I guess, Harry, question for you. I think something that comes up a lot is just emotion and tone, you know, being empathetic, which are generally traits not associated with robots.
So how do we kinda think about that when we’re looking at a kind of customer automating customer service interactions over the phone or on digital channels?
Harry Swanson
25:14 – 25:52
Yeah. I mean, if we talk about personalization, accommodation, meeting the customer where they are, this is necessary.
Right? you have to talk to the customer in the way that they’re talking to you to, make them feel like they’re listened to and help them, help them get the thing done that they need done. We have one favorite example of this in the company.
Do you have or do I need to dig it up?
Jenn Cunningham
25:52 – 26:14
Yes. Yeah.
I think I have it. I’ll see if it plays.
Because I think what this kind of sets it up for is kind of to the point earlier, people are rarely calling customer service because things have gone well.
Harry Swanson
26:14 – 26:14
Yep.
Jenn Cunningham
26:14 – 27:55
And so sometimes you’re meeting people on, like, one of the worst days of their life. And how can you make them feel heard and supported also without having to sit on hold for a million years? So, yeah, let’s jump into this.
Can you guys hear that? Hi. Thanks for calling in.
How can I help? Oh, hi. My mom has passed away, and we need to find out how much bond there is, for the probate of a will.
I’m very sorry to hear of your loss, but our dedicated team is here to help. Just so I’m sure, have you already sent your instructions to us?
Harry Swanson
27:55 – 29:32
So, yeah, when we speak about someone calling up one of these systems on the worst day of their life, we see that here. Right? This is someone whose mother’s just died, and they need to get the money out of the account.
And you hear the change in tone, in voice quality, and emotion from that really, like, peppy, hi. Thanks for calling.
How can I help? And then and then we realized the situation we’re dealing with and and and we changed track. First of all, we always knew that we would have to deal with these calls as dialogue designers.
Our instinct when we first started working with these kinds of clients was that, you know, if someone calls up about a bereavement or something like this, we’ll just transfer the call. We don’t think that people are gonna want to be interacting with AI in a situation like this.
But we tried this out in this case, and, actually, what you see is really interesting. Both the resolution rate and the customer satisfaction scores are really, really high here.
If you guide a user through a workflow with empathy, if you meet them where they are, and, obviously, if you let them get through to a human if they want to, that can actually be a better solution. Sometimes you might not opt to talk to a person in a situation like this.
You just want your problem resolved really quickly and with empathy. And yeah.
And that’s what we’re doing here.
Jenn Cunningham
29:32 – 30:03
100%. And I think something else that’s kind of interesting about this specific project, right, is that you have kind of the worst day.
How can you help people on a really hard day? But then this is the same on this same project, we have people who call up about prize drawings. And so the same robot will have to say, hey.
Let me help you process, you know, a deceased relative’s account information. And then the next call could be, congratulations.
You’ve won a million pounds.
Harry Swanson
30:03 – 30:03
Yep.
Jenn Cunningham
30:03 – 34:11
Very different situations to think about. Just two extremes, really.
You have to be able to accommodate everything in between. Let’s kind of go a level deeper, right, in terms of what’s actually when we’re thinking about voice first architecture, how do we think at a more granular level, to be able to handle conversations?
Eshan Singhal
34:11 – 34:44
I think it’s really important to start thinking about, you know, what we’re trying to achieve when someone calls in. So, like, the classic example I give is that someone’s calling, like, a restaurant.
They may say something like, hey, is this x restaurant or something? Something like they may say hello, something really conversational, ease themselves into the conversation. But you also might get the kind of other end of the spectrum where someone calls in and they’re like, hey.
Do you have a table in fifteen minutes for, like, me and my friend,.
Jenn Cunningham
34:44 – 34:44
Yes.
Eshan Singhal
34:44 – 37:11
which is, like, in the area. Right? And it’s very, very clear that they want to kind of achieve, you know, in this case, their task, which is to make a booking very, very quickly.
Right? And so, you know, we have to be able to handle both of those scenarios from the get go with the technology. Right? If you’re looking at both the, like, you know, whether it’s the speech recognition side or the kind of conversational side, you can’t have, like, very, very opinionated views on what the user might say, so early on into the conversation.
Right? So it’s really, really important that we start from a really good set of base technologies that have very good defaults. We kind of use the data that we get from live conversations to tweak that, understand the most common use cases, but don’t throw away the niche ones either.
So find that balance there, as well. And then as we develop into the conversation, you can use the kind of more powerful but more often more niche, models or tools at your disposal, as you kind of guided the the customer into the kind of I don’t know if we wanna say, like, path, but kind of, like, you know, one of the things that we’re more comfortable with.
So. in the scenario where someone says, you know, like, oh, hey.
Like, I’m just looking to book a table. You know, we know then we’re gonna ask questions like, okay.
How many people are coming? What time do you wanna come? What date do you wanna come? And when we’re asking those questions, we can then really, really pick the best model and tool for the scenario in order to do that. This is a really common theme that we see across all of our deployments is that the more you can kind of prime the technology to guess at what the user might say, the better your outcomes are going to be.
And that’s when someone like me has to work really closely with Arkadiusz or Harry or someone on the dialogue design team to understand how we can do that both from a technology perspective because, you know, we need to be able to educate, users on how to get the best out of the technology, and then using the feedback that we get there to automate that as well. So now when we know, okay.
We’re asking for this, we’ll automatically give you the best parameters. You don’t have to be, like, a power user in order to understand that.
You know, when someone asks the question, what time do you wanna come? And they say four, that’s 4PM because it’s rather.
Jenn Cunningham
37:11 – 37:13
know. Yep.
Eshan Singhal
37:13 – 37:30
than four people, which might be one of the other kinds of things that people would tell you in that same conversation. Right? Yeah.
So that’s one of the big challenges that we face, and it’s really important to kind of marry up the technology side with the dialogue design to achieve really good outcomes.
Arkadiusz Kwapiszewski
37:30 – 38:38
Yeah. I think a really good example of this is, like, collecting something like a phone number versus collecting an email address.
Right? So phone number, like, you know, as a user of a platform, all you need to do is, like, select, like, you’re collecting a phone number. But we’ll switch on the, like, best settings for number recognition.
We can toggle, DTMF connection collection as well so the user can give the number through the keypad. And it’s, again, it’s really smooth.
Right? Like, we’re really good at this. But something like email addresses is much more complex to get over the phone.
I mean, even real agents, especially when the line is bad, would struggle to understand, for example, my email address is my name and surname. So so, you know, it would be like a lot of spelling to get it right.
So what we can do instead is we just change the channel, and we send the user an SMS, and they can just reply with their email address, and we get it straight away. And it’s super flexible.
Right? And that’s, like, an aspect of dialogue design. But how do you basically play to the strengths of the technology and always use the most reliable solution for any given use case?
Eshan Singhal
38:38 – 41:00
Exactly. And it’s kind of also there about keeping the conversation natural as well.
Right? So in the example that you gave about collecting email, well, sometimes you can avoid collecting email if we can just find the user’s profile in. a CRM or in a database.
And often the kind of key that you would use to look that up is going to be something like a phone number or an account number. And those we can collect really, really well.
And so we can kind of collaborate here and change the conversation slightly, but still achieve the same outcome because, you know, even if we can’t collect an email address super successfully from scratch, because I know, for example, like, you know, my email address again, it’s my name, but I’ve got some numbers at the end of it because I didn’t. get my just my name.
I imagine Arkadiusz has a slightly easier time with his name. Mhmm.
So, you know, that is it really but if I said, you know, oh, it’s my name, surname, @gmail. com or whatever, and the agent can match that up, especially now with, like, LLMs.
They can match things up to pieces of potential data really, really well. So often, we actually identify the user really, really fast and any kind of follow-up questions there just to.
confirm we’ve got the right person before we proceed, you know, especially in sensitive scenarios. You know, you’re talking about finance, health care.
It’s imperative that you make sure you’ve got the right person.
Harry Swanson
41:00 – 42:58
In terms of playing to the tech strengths as well, there’s a point about the forward deployed piece as well where, yeah, between the design and engineering teams internally, we can work out, okay. I need this from the engineering team to have the design I want, and they need this from me so we can get the best performance.
But we can also go back to our client teams and say, look. Our recommendation in this case is to collect this piece of information first and then, you know, validate based on this thing where we can look it up.
You know, rather than looking it up zero shot, we can have a look at what’s already there. We can even ask them to reformat their data if we don’t think we’re gonna give them good performance there.
Right? So it’s so it’s a real team effort between, yeah, all our internal teams and the client we’re working with.
Jenn Cunningham
42:58 – 43:15
Eshan, the last piece, right, is really latency, kind of what I was complaining about before. Right? But when you have those long pauses in between questions, you don’t feel heard.
As a customer, it sucks. But I guess, how are we thinking about that from a technical side of things?
Eshan Singhal
43:15 – 44:05
Yeah. That’s a really good question.
I think one of the main things for us at least is that, you know, we have this concept of what we call user perceived latency. So after the user’s finished speaking, how long is it before, you know, the agent says something back to them? And if you look at that, that’s really a shared resource.
You know, we have to do everything from understanding what the user is saying, potentially transcribing it, even figuring out that they finished speaking. That’s actually quite a difficult problem as well.
Process that, execute the tasks that we need to execute for that turn. Maybe we’re making some API calls.
Maybe we’re sending some SMSes, whatever it is we’re doing, and respond to them, which also has, normally a text to speech element, as well. So you’ve got kind of all these different models running, and its latency is a shared resource across all of them.
So.
Jenn Cunningham
44:05 – 44:05
Yeah.
Eshan Singhal
44:05 – 46:28
because of that, you know, the way that we try to look at it is we try to keep the round trip time, you know, roughly around one second. Anywhere between one second to kind of twelve hundred milliseconds is very, very natural for humans.
Once you kind of push past one and a half milliseconds, it’s noticeable. And so you’re talking about, you know, here, two hundred to three hundred milliseconds of latency making a real big difference.
For us on the technology side, that means that the first thing that you do is basically orchestrate everything. You know, you have to understand exactly where that time is being spent.
Harry Swanson
46:28 – 47:01
Yeah. And and this is absolutely, I mean, something that I’ve been in the engineering team’s ear about for years.
Right? It’s like, I need it faster. I need it faster.
I need it faster. There’s been, like, so many strides, like, in this latency portion in the last six months or so, that people have started saying, wait.
This is too fast. Should we roll it back? And it’s like, no.
Because now I can design around it. Right? If it’s too fast, then we can add it back in.
If it’s too slow, we can’t take it away.
Jenn Cunningham
47:01 – 47:02
Yeah.
Eshan Singhal
47:02 – 51:21
Yes. And and I think, like, a lot of this just comes down to, you know, making sensible choices, especially when working over the voice channel.
The word agentic is obviously very, very hot in the industry right now. You can’t.
have a completely agentic workflow that results in a model making five, you know, five decision points before it figures out what to say to the user. We’re often working in one shot scenarios, especially over voice.
It comes back to, again, what we were discussing earlier. If you’re working over chat, well, now you can actually have a slightly slower experience if you want to.
So that’s something that we’re thinking about a lot here is kind of making latency something that’s more configurable from a design perspective, which sounds really weird because, it’s not a normal scenario to be in. And, you know, that also means that you can’t necessarily use, you know, if you look at the news, you’ll see, you know, these research labs pumping out bigger and bigger models.
A lot of those won’t be appropriate out of the box for our use cases. And that’s why we’ve seen so much benefit from training in house as well.
You not only get the ability to fine tune those models to the performance that you want, but you can also run them much smaller and much faster. And you get that choice as well of what you want to do, rather than being constrained to what, you know, one of these research labs is basically offering.
Jenn Cunningham
51:21 – 51:56
Absolutely. And I guess with that, I feel like we’ve had a really, really comprehensive kind of discussion on all of the aspects of dialogue design.
And we’ve got a lot of really exciting things coming. I love the idea of latency being configurable.
I think that’s something that it’s been for so long, I think it’s just been that we need to have the lowest latency possible, and now it’s actually let’s think about how we optimize latency for a given situation. But what are you guys excited about in terms of what’s next for the field, for anything?
Eshan Singhal
51:56 – 52:08
I think one of the big things here is kind of an explosion of voice interfaces, I think, is going to be that, but also combining that with, you know, with multimodal. Right? So,.
Jenn Cunningham
52:08 – 52:08
Yeah.
Eshan Singhal
52:08 – 52:38
you know, things like, hey, I’ve sent you a quick form to fill in, or you can, like, I’ve sent you, like, a picture or something to look at, and you can kind of have the user interact with that whilst they’re still on the phone, interacting with the agent in real time and using that multimodality. You know? I think it’s just, like, having it all be within that interaction one interaction with the user.
Jenn Cunningham
52:38 – 52:38
Oh, yeah.
Eshan Singhal
52:38 – 53:28
I think historically these things have been kind of separate. It’s like you go to a website, you say something, then you call up.
The person you call has no idea what you were doing on the website. Whereas now I think all of these things are going to be merged into one very powerful experience.
I only expect that to get more and more common, as time goes on.
Harry Swanson
53:28 – 54:09
Yeah. And
I think for our team, it’s this combination of going multi model, omnichannel, and also being more deeply integrated with our clients. Right? I mean, we normally start by taking on one use case to prove that we’re actually any good at what we do.
And the longer we work with these clients, the more deeply we can integrate with their, like, wider, like, customer service solution. And that’s really, really exciting because it means that we’re not just thinking about this, you know, minute, minute and thirty seconds that the PolyAI agent is on a call with a user.
We’re thinking about the weeks, months, years on the customer journey. This thing of,.
Jenn Cunningham
54:09 – 54:09
Yeah.
Harry Swanson
54:09 – 54:21
like, oh, we pulled this thing back then, and now I can do this for you, and did you still want this? And, then we can really start using that to drive real value. Right? I think that’s very exciting.
Arkadiusz Kwapiszewski
54:21 – 55:45
Yeah. What I’m excited for, for the dialogue design team, especially, is the development of our platform.
So I feel like with AgentStudio, every month feels like Christmas because we get a long list of feature releases and new cool functionalities that we can play with. And it’s fantastic.
Right? Seeing the feedback that we give about, like, best practices for dialogue design, best practices for conversational AI, and this gets built into the platform. More and more things just work out of the box, and we don’t need to worry about them anymore.
And we can just, like, you know, click a toggle. It just works.
It’s fantastic. And I feel like, you know, with AI, the narrative that I like is that, by automating the boring tasks, the repetitive tasks, we free ourselves up to do more creative work.
Like, thinking about the user experience more deeply, taking a step back, asking why, thinking about the processes, the whole user journey, and the context of it, and that’s what we see happening. You know? Our work, you know, our work, our priorities are shifting towards, I think, asking the deeper questions and challenging some of these assumptions because, you know, the technology gets better and the platform gets more powerful, and that’s really fun.
Jenn Cunningham
55:45 – 56:18
Oh, exactly. I think it’s just a really exciting time to be in this space as I think this technology is becoming more mainstream as people are more willing to engage with AI over the phone and on digital channels.
You know, how do we reimagine conversations, and how can we solve people’s problems better? Awesome. Well, thank you guys so much for joining me today.
And for all of our listeners, definitely, make sure to review, rate, subscribe, and we will see you on another episode soon. Thanks so much.