What does truly multilingual CX sound like?
Summary
Every customer expresses themselves differently, and their choice of language is no exception. So how can brands bridge the gap when customers come to them speaking different languages?
In this episode of Deep Learning with PolyAI, Jenn Cunningham sits down with Matt Henderson, VP of Research, and Viola Lin, Product Manager, to explore the evolution of multilingual voice AI and what it means for global customer experience.
They unpack what “multilingual” really means in practice — from accurate translation to cultural intelligence — taking PolyAI’s AI models and Agent Studio platform as a starting point to discuss helping enterprises deploy agents that sound natural, respectful, and consistent in over 45 languages.
They explore:
- Why multilingual CX should factor in much more than just translation, including cultural nuances
- How PolyAI handles dozens of languages in a single deployment
- Real-world challenges and what to do about them: from accents and formality to tone and gendered language
- The role of fine-tuning, voice selection, and design in making AI sound human
- How global brands use multilingual AI to deliver support that feels “local” everywhere
Key Takeaways
- True multilingual AI goes beyond translation: PolyAI’s new unified multilingual system lets enterprises build a single agent that can understand and respond naturally across dozens of languages — capturing cultural nuance, not just words.
- Raven sets a new standard for accuracy: Unlike general LLMs that can “slip” back into English mid-conversation, Raven maintains 99.9% language consistency and adapts tone, formality, and gender appropriately across languages.
- Cultural intelligence is the differentiator: From how names are “written” in Mandarin to the extra politeness needed in Japanese, PolyAI’s platform allows per-language style guides and voice tuning to create humanlike, culturally informed customer experiences.
- The future is speech-to-speech: With Raven iterations in development, PolyAI is training models that understand and respond directly to audio, improving latency and naturalness — moving closer to real-time, multilingual conversations without separate ASR pipelines.
Transcript
Jenn Cunningham
00:19 – 01:06
Hi, everyone. Thank you so much for joining deep learning with PolyAI.
We’re here to help CX leaders get a window into the latest and greatest developments in AI. So I’m your guest host, Jenn Cunningham.
I’ll be subbing in for Nicola today, to bring you insights from our PolyAI experts on multilingual capabilities and why they matter. So, obviously, it’s a podcast.
Have to encourage you to hit the subscribe button, give us a like on YouTube, or leave a five star rating in the podcast app. But with, without further ado, I’m gonna go ahead and hand it over to Matt and viola.
If you two could please introduce yourselves.
Matt Henderson
01:06 – 01:17
Hey. I’m Matt.
I’m a VP of research, at PolyAI and leading a team, that’s among other things training LLM. So excited to talk about about that today.
Yeah.
Jenn Cunningham
01:17 – 01:17
Fantastic.
Viola Lin
01:17 – 01:27
Yeah. Hi.
I’m viola. I’m the product manager of PolyAI.
I have been focusing on enabling multilingual capabilities on our agent.
Jenn Cunningham
01:27 – 03:19
Amazing. Thank you.
Thank you guys so much for being here. I’m so excited.
When. Kind of the idea for this podcast episode first came about, I thought it would be really great to talk about our multilingual capabilities because Matt, with Raven v three, we have these fantastic, new capabilities built into the LLM, but then, viola, you’ve also been working on all of these fantastic upgrades to agent studio, from a platform perspective.
So it seems only fitting that we really talk about multilingual solutions, especially now. So when we think about the history of PolyAI, our we’re called PolyAI because our founders were all polyglots.
And so it seems only fitting that we have a polyglot solution as well that can kind of support different languages. I recently got back from our Vox event, in San Diego, and the theme was fluent.
And so we were thinking about fluency as mastery, right, in terms of conversational AI. But there is just fluency in multiple languages, and being able to have a multilingual solution.
So I guess, Matt, viola, curious to get your thoughts on what does multilingual really mean in the world of voice AI. Matt, I guess maybe if we wanna start with you.
Matt Henderson
03:19 – 04:35
Yeah. Totally.
So, I guess there’s, there’s one sort of distinction that’s interesting is, is whether we’re talking about a single conversation that’s multilingual. And that have that we do have, use cases like that, you know, where, we don’t know if the person on the end of the line is gonna speak to us in Spanish or English or whatever and then we’re able to switch back and forth and maybe we start talking to them in one language and then they tell us I don’t speak this and then we’re able to switch back and there’s interesting things there like what happens to this speaker identity, this text to speech.
Like, do we have we can have multilingual text to speech models that sound like the same person speaking fluently each of those languages. But also there’s multilingual in terms of a project.
So say, I have a business that spans multiple countries, maybe I am comfortable prompting for it in English but I want it to be able to go out there and speak not just English but, various other languages. We do support dozens of languages there.
So each of those has, like, interesting modeling challenges, and, we’re able to, do each of those too. Is that is that what you’d say too, viola?
Viola Lin
04:35 – 04:38
Yeah. I I think so.
Yeah. I think you cover everything.
Jenn Cunningham
04:38 – 04:39
Wait,.
Matt Henderson
04:39 – 04:42
Okay.
Jenn Cunningham
04:42 – 05:14
well, viola, you’ve seen quite a lot in terms of multi lingual capabilities. I know we’ve worked a bit on terms of adding in additional languages into the platform.
I know Arabic is one we’re really excited about. But, I guess, can you talk a little bit in terms of how we’ve been able to evolve in terms of multilingual voice AI if we’re thinking from a platform perspective? What was a kind of historically, how were we deploying, and what were some of the pain points with that? And kind so what’s, like, really exciting basically about the awesome work that you’ve been doing, this whole year?
Viola Lin
05:14 – 06:46
Yeah. Really good questions.
I think in the o way, when we’re viewing multilingual bot, what we what we would be able to do was to separate the language bot. So for example, if you if you want your bot to speak English and Mandarin and Japanese, we have to do three bots.
And then each bot, you have to put the prompt in the local language. So you have to prompt in English, you have to prompt in Japanese and Mandarin, and that comes a lot of different challenges.
First one is really difficult to maintain. We have seen, a lot of requirements from US.
They will have the needs for having the bot to be able to speak English and Spanish. They are basically sharing the same FAQ, sharing the same workflow, but it has to be in two separate languages.
It’s really hard to maintain. It’s really hard to do the core reviews across different languages.
And also, it’s very hard to scale. Let’s say if one day they want to add one more language, it’s called like French or German.
You have to build a separate bot. So I think historically that has been very difficult, but that was last year.
So this year, we do have a new solutions, which we are really excited about. Now we have done a lot of evaluations and testings on new models.
So we are introducing a current UNIFI multilingual system to allow users to build and manage all the languages in a single bot, which means that you can build a single bot and then with everything, all the prompt in English, and then the bot will be able to understand and then speak multiple languages. And.
Jenn Cunningham
06:46 – 06:46
No.
Viola Lin
06:46 – 08:32
other than that, we are also giving you the capabilities, to fine tune your bot. We that’s why we call it in language hub and agent studio because we believe that, like, multilingual is not only about translations, but it’s also about, like, delivering this superhuman experience that can be, like, in different languages.
So I will give a little bit examples on what do we mean by, fine tune capabilities. We have faced a lot of, like, language nuances in different language.
For example, it’s really common in English when you are taking a reservations from a like to a from the restaurant, you will be asked the caller like, hey, can you spell your name for me? But like that this concept of spelling your name doesn’t really exist in Asian languages such as Mandarin and Japanese. Instead of asking how do you spell your name, we ask you how do you write your name.
And then in Arabic, it’s even more interesting. You don’t even have to ask them how do you spell your name or how do you write your name because, the restaurant staff will will know how do you spell your name once you told them, hey, my name is like Mohammed or Ahmed.
So those kind of things is quite straightforward, but that really makes a difference on how callers will feel if this bot is being if this language is being just translated or is actually sounds like a real human. So in the language hub we are offering in polyai, we give you per language prompt tweaking that you can do tiny tweaking to your languages to make it sounds even more natural.
And then we also give you extra space for adding system prompt or style guides for different languages. And then we also give you the capabilities of selecting different voices.
Jenn Cunningham
08:32 – 09:58
Which is awesome. It’s so awesome.
It’s like I think especially in terms of what we’ve been hearing a lot from the market is People don’t want to maintain all of these separate solutions, but there is still this individual nuance required. And so this is really? opening up the platform to be a common knowledge base across countries across languages.
But you can still deliver these, like, nuanced cultural experiences, or, I guess, culturally informed experiences, if that makes sense, or conversations, in a way where you’re able to keep customers engaged and make them feel supported. Because you don’t it’s just not a fun experience.
Very rarely are people calling up. Very rarely are you calling up because something’s gone right.
But to if you’re calling up and you and you’re like, hey. I had trouble booking online.
I need to be able to book over the phone. Or maybe I think I know we were kind of laughing because we saw something recently where someone had forgotten their anniversary.
Right? And so they were calling up, hey. Do you have a book a table available in ten minutes? Right? And it’s why yeah.
What’s the occasion? And it’s, oh, it’s an anniversary. Like, you forgot your anniversary.
You need to book it right now. And if someone asked, you know, okay.
Yeah. So what’s your name, or how do you spell your name? And it should be how do you write your name.
That it makes a big difference in terms of making people feel supported.
Matt Henderson
09:58 – 09:58
Yeah.
Jenn Cunningham
09:58 – 09:59
Right.
Matt Henderson
09:59 – 10:10
It could be a kind of, like, vibe thing if you’re I mean, you can tell if you’re speaking in a non English language if you’re talking to something that’s just doing translation in the background. You know?
Viola Lin
10:10 – 10:10
Mhmm.
Matt Henderson
10:10 – 10:39
Like, it will just sound like English word order or, like, English constructions. can be, like, forced into your language.
And so, like, yeah, we’re I guess we’re taking advantage of all these amazing new models that are multilingual and stuff, but we’re also saying, like, when you need to get it right in a certain language, we would do it make sure the model does as best you can, but, also, we there’s the right places for you to to prompt it. I mean, like,.
Jenn Cunningham
10:39 – 10:39
Right.
Matt Henderson
10:39 – 11:04
even saying, like, you know, what can I help you with today? Like, the I and and the you might be gendered in other languages, which is just not a thing in English, help you with, like, having with at the end. That’s the kind of thing that a sort of naive system would do in another language which just doesn’t say natural.
Yeah.
Jenn Cunningham
11:04 – 11:19
Makes sense. And, Matt, that’s actually a really great segue.
Do you wanna kinda talk about some of the fantastic work that you’ve done with Raven v three and some of the core differentiators that you’re seeing, with this, I think, especially relative to other solutions on the market.
Matt Henderson
11:19 – 12:53
Yeah. Totally.
Yeah, I think this is, like, one of the appealing things about large language models are just, like, large models trained on lots of data is that they just seem to miraculously do well across other languages even if you didn’t really try to train them to be multilingual. Like they’ve just read everything that’s available on the internet and you know you sort of talked a bit about the the history of this company and that like we wanted it to be a polyglot system.
Nicola Mirchic, the CEO, he has a paper in like 2017 which is all about cross lingual like word embeddings and there were people then were doing all these kind of like really interesting and like intricate, techniques to make models work across languages. So.
transfer your knowledge in English to different languages but, now you just we don’t think about that stuff. It just works.
Right? Like, it’s just it’s trained on all these different languages and when you prompt it in English but say respond in French like it it it has a go. But but what we’ve done with Raven v three is just make sure that works, like, super well and super reliably.
So if you look at something like GPT five and we prompt it in English but tell it to respond in a different language, but 98% of the time, it does do that. But.
that the 2% is such a bad error. I say for Chinese, it’s something like 96%, so Mandarin.
Jenn Cunningham
12:53 – 12:54
Yeah.
Matt Henderson
12:54 – 13:05
And so, like, 4% of the time, it just retalks you in English, and then what happens then? Like, it goes into a Chinese text to speech model, and this is such a bad experience.
Jenn Cunningham
13:05 – 13:16
It’s just so, I guess so we’re saying, like, 9896% with what is the actual language consistency that you’re seeing with Raven?
Matt Henderson
13:16 – 13:54
So on our test, it’s like, you know, almost a 100%. So 99.
9 and the yeah. And so that’s the kind of thing where being able to train our own model, like, really helps because if we see a type of error that can happen, we just train it out of the system.
We do the gradient update so that just can’t happen again. And, you know, we can teach the model exactly how we’re going to prompt it for this type of use case.
So this is like a built in feature of Raven v three is that you can say what language you want the response to be in, and then it will just work.
Jenn Cunningham
13:54 – 13:54
Yeah.
Matt Henderson
13:54 – 14:08
So all all of the content could be in English, like all the the the FAQs and everything. And it will do its best job at speaking to you naturally and, like, grammatically in that language that that you’ve requested.
Jenn Cunningham
14:08 – 14:17
Yeah. Which is ideal.
And and is that really out of the box performance that you’re seeing too, where it just you can just talk?
Matt Henderson
14:17 – 14:39
Yeah. So, like, we train it so that that works pretty well.
And then, of course, like we’re saying, the more you start prompted to, specifically, targeted toward that language, like, the better. But at least it’s not going to respond to you in English, which is just, like, very like, this error that that does come up with with other with other models.
Jenn Cunningham
14:39 – 14:56
Oh, absolutely. And I think because we have, quite a few, folks at the company who speak different languages.
So we also got some native speaker input across all of the different languages as well to make sure that performance was kind of up to par. Right?
Matt Henderson
14:56 – 16:21
Yeah. Yeah.
So, I mean, like, a big things like I was saying about, like, gendered pronouns and, formality levels and things. Those are those are that’s the kind of thing that came up through those discussions with the with our sort of, with our dialogue designers is the that speak all these different languages.
So we basically main maintain through their feedback a sort of style guide or, instructions for our language model tuning, so that we get we get it right in each of those languages. And then we and then we also realized we sort of figure out what is that prompting format to say, okay.
You speak in, like, this tone or, you know, avoid gendered pronouns unless it’s very clear which ones are. are appropriate to use and and that kind of thing.
So yeah I guess like style is a is a big thing and that’s another sort of difference between Raven v three and Raven v two is that we’ve really taught it on a style guide which is not just in English, but also in these other languages. So if you if you talk to Raven v three and compare it to other, like, public models, you’ll see, like, it’s much shorter in English and much more, like, direct and friendly and conversational, and it just makes the conversations go smoother.
Jenn Cunningham
16:21 – 16:43
Which is ideal. I mean, I know I got I still got into trouble.
I’ve been in The UK for four years, but I have very direct American English communication style, and sometimes step on a couple toasts. Like, I’m really not trying to be mean.
I’m trying to be efficient. I’m so sorry.
But, viola, I know you also did a bit of testing in terms of Japanese deployment.
Viola Lin
16:43 – 16:44
Yep.
Jenn Cunningham
16:44 – 16:49
as well. Right? So, what did you kind of see there in terms of tuning formality, cultural nuance required?
Viola Lin
16:49 – 18:47
So Japanese is the language that they have different level of formalities and politeness. So if because we are serving customer service industry, so we have to be extra politeness in Japanese.
The one when I first using, like, GPT, like, trying to get, customer service related, dialogue or conversations, it’s not really polite. It’s more casual treating you as a friend.
But if we’re trying to use in customer service, we need this level of extra politeness, And then all the terms they are using has to be extrapolate. I think that’s really important.
And I was also, like, supporting the teams to, doing testing in different languages in our bots. So for example, Hindi is a really interesting one.
They have the gender formality considerations because, in the beginning, I think we have the issues of it’s a female it’s a female voice, but they are referring themselves as a he when they are talking. So that kind of caused a little bit of, confusion.
So that’s really interesting.
Jenn Cunningham
18:47 – 19:09
So if we’re zooming out a bit and looking at the overall tech stack behind really quality multilingual solutions. We have ASR, we have RAG, we have the LLM, and then we have the voice component.
So, viola, can you kind of talk about how all of these things come together under the hood Agent Studio to power conversations, in a really natural way?
Viola Lin
19:09 – 22:09
Mhmm. Yeah.
Of course. That’s why multilingual comes as a really complicated, challenge for us to do because in each different models, they will have, like, each language, they will have, like, the best model mix.
So for example, English, they might have, good, model of ASR for this one and then another good rack model for another one. But luckily we through our evaluations and also our in house developments, we have found really good model mix for different languages.
So for example ASR, which is automatic speech recognition, we found out that GPT-four transcribe is actually pretty good across most of the languages. Only a few languages, they have better other models such as Bulgarian, Croatian, and Serbian.
And for the rack, we also have a really good multilingual rack that can be very helpful on picking the most relevant topic and ensure accurate information is being retrieved by most of the languages. And then for the l m side, we have Raven v three with which is, designed for multilingual and in customer service industry.
And then the tricky one comes as voice actually. It’s the most challenging component for us.
We can see a common problems of that. The voice might sound really good in the samples, but then kind of fail with longer text, especially the voices with different tones.
For example, Mandarin has four tones, Cantonese has nine tones. So during the sample, we might hear, oh, actually it sounds pretty good when they say how can I help you? But then when we input a longer text, it will fail in some of, like, the part of the sentences, some of the certain phrases.
So then it won’t be like a really, really good voice. And other than that, we also see a lot of voices that have this issue.
So they will speak English accents with, the languages that’s not actually not English. There are, yeah, there are a few voices they claim that they can do multilingual.
For example, there might be a voice called, whatever like Alan and then he could speak English, could speak German, he could speak Spanish, but he might be really good at speaking English. But when he speaks German, he will have this English accent when he speak German.
So then that’s not ideal. And it’s also.
very common in languages that they are mixing English plus different languages together. So let’s say about let’s take a example of Japanese.
They will usually say Japanese and then they will also speak English mixing with the Japanese, but they they but then they use their own Japanese way to pronounce the English. So imagine when you are talking to a voice agent and then they speak perfect Japanese, but when they switch to English for certain terms, they say a perfect English, and then you immediately know that that’s actually a voice bot instead of a a real person.
So we don’t want this to happen. So yeah.
So So the voice speaking is kind of tricky, I would say. I think we value quality over, claim capability.
Jenn Cunningham
22:09 – 22:59
No. And it’s really important.
Right? I think it’s it’s just, like, it’s interesting when you’re thinking of it from the deployment standpoint. So remember one of our very early forays into generative, or I think it was our first generative deployment, Matt actually, the bot hallucinated that it spoke Spanish, even though we had no prompting, no knowledge, no nothing in in there saying you speak Spanish.
Right? And so it was just this very it it was a really weird it was just bizarre to hear. Right? Because it was saying everything with this thick English accent.
It it kinda sounded like, I would imagine, like an a one Spanish class just trying to read off whatever the words are. Right? But at the same time, it it it’s like, how do you maintain that consistency? How do you really develop or make people kind of build trust?
Viola Lin
22:59 – 23:36
Yeah. I think one thing other than is also important is about the accent.
Because, for example, English, you can have, British English. You can have American English, Canadian English, Ireland English.
So there are a lot of, different accents for different languages. Even in Spanish, you will have Mexican Spanish, US Spanish, and Spain Spanish.
So in the voice wise, even in also in the ASR, we need to find really good models that will be able to accommodate those changes. Because people, like local people, they are really good at recognizing, oh, they are not speaking my language.
It’s a different accent.
Jenn Cunningham
23:36 – 24:54
Right. And it’s and these are things that I think we’ve all probably encountered in human to human conversations.
And, I mean, that’s like, I especially in London, and everyone is from everywhere. Right? So I don’t really care if I talk to someone on the phone and they have an accent, but it’s different when it’s an automated system.
If you’re like, hey. You’re representing a business in Spain.
Why do you this is a robot. Why do you why did you choose a Mexican accent, or why did you choose, say, like, a Chilean accent or something? I don’t I don’t think so.
A Mexican accent, or why did you choose, say, like, a Chilean accent or something? I don’t I don’t think someone would choose Chilean. But, I mean, we’ve seen kind of real world implementations in support of kind of these multilingual solutions.
Right? I think back to our FedEx deployment, which I honestly always forget the number of languages that we’re currently deployed in because. we’ve been able to expand so much, in that geo footprint.
I wanna say we’re at 18 for a single project, but I may be mistaken, which is not great when you’re on the marketing team, and you’re meant to be kind of the single source of truth for the marketing materials and stats. But I guess, how are we thinking about these language support tiers, and how do we make sure that we’re when we say we can do a language, how do we make sure that we’re able to deploy in an appropriate way that’s really a valuable solution?
Viola Lin
24:54 – 26:02
So currently, we have, separate all the languages into three tiers. Tier one is the language that we have the most most confidence of and we also have a lot of project deployments which includes three languages which is English, UK, English, US, and US Spanish.
We have, like, highest project volumes, and then we and then we have a lot of confidence in it. And then the tier two, we have around, like, 28 languages.
We constantly have them a request, and then we have projects among those different languages. And then tier three is the language that we have been getting requests for demo.
We have done, like, a lot of thorough testing. So internally, how do we count a language as ready? We will go through evaluations and testings for all the four model dimensions that we just mentioned, which is ASR, rack, l m, and, like, text to speech, which is, voice.
So we have to go through testing and evaluations for all the those different models and then, test it a little bit on our example bot to see if we are having a good support on that. So I think in total, we have around, like, more than 45 language, supporting by PolyAI with confidence.
Jenn Cunningham
26:02 – 26:06
Which is pretty crazy.
Matt Henderson
26:06 – 26:57
It’s a lot of, work on evaluation. It’s like, you know, not just doing an English test set, but checking across dozens of other languages and all the different use cases in each of those languages.
But but, I mean, the the cool thing is, like, I guess it was not totally expected as a lot of the time a single model can do all the languages. So I guess it.
used it used to be that, multilingual is kind of a compromise you’d make, where We’d say, okay. I could be really good at one language or this language or I throw them all together and I’m kind of medium at all of them.
But but neither models are big enough. They have such big capacity.
I feel that the models the different languages actually help each other. Right? And there’s there’s, like, learning.
across each of them, yeah, which is which is pretty cool. Yeah.
Jenn Cunningham
26:57 – 27:47
No. It’s pretty fantastic.
Also, when you think of, like, smart analyst and how we’re able to analyze the different conversations and different kind of trends within, what’s happening in the contact center or just in terms of customer interactions across the board, to be able to basically combine your Spanish speaking contact center, your English speaking contact center. Sometimes these are all in one contact center, but sometimes these are really disparate when they’re deployed kind of physically.
Right? So to be able to bring it all together and get those cohesive insights is incredible. I know we have some advanced features as well when we’re thinking through, supporting, especially conversations over the phone from a multilingual capability perspective in terms of, like, routing logic and being able to switch languages mid call.
Viola Lin
27:47 – 31:30
Yeah. I think we implemented, a little bit more logic advanced features based on, tailored for multilingual capabilities.
For example, we have routing logics that based on, the language detection. So which link no matter which language you speak, we can kind of know in the first turn, and then we’ll be able to route you to the correct language.
And then sometimes it really depends on the clients. They might be wanting to set the language in the first term.
So for example, when the phone is being answered in the first term, we will ask them, do you want to speak English or Spanish? And that’s the point when we route you to different languages. And the other tiny tricks we have been implementing and doing a lot is, mid calling.
Like, during the term, we’ll do mid call language switching, which we call kind of like fake handoff. So for example, if in if I’m in a call and then I’m speaking English, suddenly I want to switch to Spanish, let’s say.
And then in our agents, we can implement this logic saying that, okay, let me hand you to our let me hand you off to our Spanish agents and then you might hear a little bit of ringtone or wait a little bit of music and then we will hand you off to actually our Spanish bot. So this is kind of like a.
seamless experience for the callers. Other than that, I think Jenn mentioned a good point about smart analyst.
Smart analyst is actually pretty good on, giving analysis and evaluations for calls in different languages. Because before when we are building languages in different bot, it’s kind of hard to have a aggregated view for users to understand what’s the performance across different languages.
And, also, it’s in different language. There’s no way for you to do co review because I don’t understand Spanish.
I don’t understand German. But now we have smart iOS.
So you will be able to, like, tell you, okay. What’s happened with this phone calls? What’s kind of, like, the challenge the user is facing? So now we do have quite a lot of, enablement, allow people to view and review and manage, the agents in the multilingual setting.
Jenn Cunningham
31:30 – 32:15
I think when we think of the core LLM component behind it. Right, we have Raven v three’s 100 consistency.
This is a real differentiator, I would say, in terms of other LLMs on the market today, but also that we’re able to look at this multilingual component of conversations and know that it isn’t just about straight translation. Right? It’s it’s really looking at this cultural intelligence and being able to make sure customers feel heard and supported.
And so then, viola, I think you did a really fantastic job of kind of outlining how our platform and LLM work together, to to be scalable, from a multilingual deployment perspective. But I guess just, Matt, one last question for you.
What’s next?
Matt Henderson
32:15 – 33:20
Raven v, four, obviously. So this is going to be, our, steps towards speech to speech.
Right? So an audio native, language model. And I guess that that’s kind of, what’s been coming up up a bit in this conversation is that oftentimes the voice part is the hard part of of multilingual.
And, the LLM, everything’s in text. There’s obviously a lot of challenges in in getting the text assigned fluent.
But if you’re not able to understand across multiple languages and you’re not able to speak in multiple languages, then, like, that’s a nonstarter. So.
so Raven v four, we’re training to directly perceive audio. So they wouldn’t need a separate multilingual speech recognizer.
It just will be implicitly able to do multilingual speech recognition. And, that that that’s bringing the sort of power and reasoning, capabilities of the LM to the speech understanding task.
Jenn Cunningham
33:20 – 33:20
Yeah.
Matt Henderson
33:20 – 33:56
So, yeah, I guess, normally, we think of speech recognition as sort of self contained smaller model takes audio and outputs text. The LLM has the thing is is the smart thing.
It’s much bigger, like, or just magnitude bigger, and it has much more context to, like, all the instructions and the dialogue context. So we’re sort of fusing that speech recognition model into LMM and, we get multilingual, out of that as well as just much more accurate, like, understanding.
Viola Lin
33:56 – 34:04
is that the latency. will also improve if if you are using speech to speech in Ribbon v four?
Matt Henderson
34:04 – 34:51
It helps latency as well because the model can just straightaway start responding to the to the user. So rather than pipeline pipeline these two systems together, having to wait for a speech recognition to finish, the model just hears what the user says and then starts to reply.
So yeah. So it is also a path for us to improve latency.
And and, of course, like, that is a a big advantage for us training our own models. It’s like we control the exact architecture that they run on and where they run, and we’re not competing on the cloud for you know, like, we would if we’re just wrapping someone else’s model.
So, yeah, I see it as a path to have, like, to maybe get a few more 100 off or something like that.
Jenn Cunningham
34:51 – 35:18
No. Always always obsessing over latency here.
And it’s exciting from a speech to speech perspective because I remember when we were even looking at speech to speech, from, like, a bring your own LLM, standard and just initial testing, the models on the market just weren’t sufficient for enterprise customer service use cases. We couldn’t guarantee the quality.
So it’s really exciting to see that, you know, that’s coming, and it’s the technology is there.
Matt Henderson
35:18 – 35:46
Yeah. Yeah.
Speak the speech to speech models are getting better. Like, they definitely did start off being bad at things like function calling and stuff, hallucinating weird things, and, they’re getting better, but this is just a way for us to just fast forward to, like, the point where they’re really good at our use case rather than.
wait for someone else to train a general purpose model that also works for us. We’re just, like, focused on customer support and making it work across languages and reliably.
Yeah.
Jenn Cunningham
35:46 – 36:27
Correct. No.
And I think that is really the point to end on. Right? It’s we want to make customer service accessible, and effective and efficient.
So I think with that, I will thank you both, on this wonderful well, it’s Friday as we’re recording. But, thank you all for listening in, and thank you, Matt and viola, for joining me today.
For listeners, if you’re interested in learning more, we are going to have a blog post with a more technical deep dive on Raven v three. But, yeah, so thank you all for joining, and I encourage listeners, to review and subscribe, if you did not at the beginning of this podcast.
But, yeah, thank you so much.