Join us at VOX Fluent on October 23 Register now

Play Podcast

Raven v2 and the Race to Smarter Voice Agents with Matt Henderson

YouTube video thumbnail

Summary

In this episode, your host Nikola Mrkšić sits down with Matt Henderson, VP of Research at PolyAI, to unpack the state of large language models, their quirks, and what really matters when building smarter voice agents for customer service.

Join us for a discussion on:

  • Why GPT-5 may feel more incremental than game-changing
  • Why reasoning models can fail at surprisingly simple tasks
  • How PolyAI’s Raven outperforms generalist LLMs in latency-sensitive, real-world CX use cases
  • The balance between speed, accuracy, and reasoning for live customer interactions
  • What open-source models, quantization, and fine-tuning mean for enterprise AI strategies

👉 Learn more about Raven and conversational AI here: https://poly.ai/blog/polyai-raven-v2-large-language-model/

Key Takeaways

  • GPT-5 is incremental for voice: While GPT-5 was hyped, PolyAI’s own testing found it underwhelming for live voice tasks — sometimes less accurate than GPT-5 Mini. This highlights why PolyAI invests in purpose-built models for CX.
  • Raven v2 sets the bar: PolyAI’s in-house Raven v2 outperformed GPT-5 in latency-sensitive benchmarks, proving that specialized design beats general models when reliability and speed matter for enterprise voice AI.
  • Reasoning vs. real-time: GPT-5 emphasizes reasoning, but long deliberations don’t work in phone conversations. PolyAI is innovating with latency-aware reasoning — models that think just enough without keeping customers waiting.
  • From demos to deployment: Open-source releases and coding benchmarks grab headlines, but PolyAI focuses on enterprise-grade reliability — turning cutting-edge research into systems that work at scale in real contact centers.

Transcript

Nikola Mrkšić

00:05 – 00:45

Hi, everyone. Welcome to another episode of the PolyAI podcast.

Today with me is our VP of Research, Matt Henderson. My PhD followed Matt’s at the Dialog Systems Group at Cambridge, and we’ve been working together for over a decade now.

I’m really excited to speak to you on this one. And, I think, like, you know, the first thing that I’d love to just maybe have you frame to the audience because I always love the way you explain these things.

How do you feel about g p t five? And, like, you know, now that it’s cooled off, I think I was very enthusiastic, when we spoke about it with Sean last week, but I’d love to maybe, start with your take of what changed post that announcement.

 

Matt Henderson

00:45 – 00:53

Yeah. Thanks, Nikola.

And, yeah. Welcome to my first podcast appearance on any podcast.

Big moment for, for.

 

Nikola Mrkšić

00:53 – 00:56

Remember the first one now? Okay.

 

Matt Henderson

00:56 – 00:58

yeah. Yeah.

Yeah, GPT.

 

Nikola Mrkšić

00:58 – 01:01

you’ll be the next host, and you can do the rest of them.

 

Matt Henderson

01:01 – 02:02

five. I learned from the best.

Yeah. GPT five was interesting, I think.

Clearly, they spent a lot of time reluctant to bump the version number up to five, and you can see that on their OpenAI models page there’s, you know, four, four zero four one four point five zero one zero two. It’s like it’s a mess.

But now they I think you can see they wanted to have a clean, you know, here’s five, this is what you get, this is everything, market it as one model. There’s not really one model.

It’s a router routing to different models. So the people on chat g p t, got a mixed experience, and I think what they say there was a bug in the routing model, which meant that you were getting a less clever non reasoning version sometimes, and so it was failing on the sort of easy prompts.

 

Nikola Mrkšić

02:02 – 02:11

We know in the back end how many different models it was actually choosing between? Was it the whole catalog? Because we looked yesterday. There were, like, thirty, forty models listed.

Right?

 

Matt Henderson

02:11 – 02:19

Yeah. Well, yeah, of the GPT five models, I think if you are an API customer, you get the choice and you get the control.

 

Nikola Mrkšić

02:19 – 02:20

Yeah.

 

Matt Henderson

02:20 – 02:37

You get the full model, the mini model. You get the and I think the chat model is one which is routing to all these different ones, and you can control the reasoning effort.

Maybe we’ll get we’ve done a bit of evaluation of those models on on, like, do they work as voice agents? We get into that a bit later.

 

Nikola Mrkšić

02:37 – 02:38

Yep. Yep.

 

Matt Henderson

02:38 – 02:58

But, yeah, I mean, my high level thing is, like, people saw this as a relatively incremental change even though it was a bump in version number. It’s incremental, and, maybe there’s some things that’s slightly better at.

We didn’t find it was any better, on our own test cases, for voice agent stuff.

 

Nikola Mrkšić

02:58 – 03:44

I think, I think, like, I spent so long waiting for it. Right? When I started playing with it, I felt like it was new.

But it kind of you know, I guess it’s like a new version of a car or something like that. You know? Something’s shiny.

The dashboard looks different. It feels new, and you feel nice that you got something new and shiny.

And And. then kinda like this went on, I have to admit that for how I use it just as a consumer, I’ve not really noticed much of a difference.

I think the one, like, prosaic observation, like, less of those and dashes and things that just make it very GPT, less of the stylistic vocabulary that makes it feel like it’s, you know, Daily Mail with, like, three sentences, three words, comma, then top. Right? But that’s just very minor.

I mean, that’s just, style. Right?

 

Matt Henderson

03:44 – 03:45

Yeah.

 

Nikola Mrkšić

03:45 – 03:45

I think.

 

Matt Henderson

03:45 – 03:46

they were off.

 

Nikola Mrkšić

03:46 – 03:48

is an Uber thing, but, yeah, it’s.

 

Matt Henderson

03:48 – 04:13

were offering some style options as well. I don’t know if you saw that.

You could pick the robot version that seems much more concise. For example, the empathetic life coach style one.

But I mean, that’s not if you’d said, well, I can’t wait for GPT five. It’s coming out, you know, in August, you know, of twenty twenty five, and it’s gonna have slightly different styles.

Yeah. It’s not.

not amazing.

 

Nikola Mrkšić

04:13 – 04:38

you yeah. I mean, Sean last time mentioned that, you know, just like day to day, and, you know, I think he’s, like, quite an active user of, kinda like codex and that’s that whole suite.

He noticed a bit of a change for the better. I think that’s where they just kinda like but that could have been a four point x push, right, in kinda like the war for market share among coders with Claude, right, and the other.

 

Matt Henderson

04:38 – 04:57

Yeah. And then Claude kind of got, not as much attention.

They also released the new version of Opus, and I’ve heard really good things about its coding abilities. OpenAI beat it on whatever coding benchmark, a SWE bench, by less than one percentage point, which is noise.

But,.

 

Nikola Mrkšić

04:57 – 04:58

Yeah.

 

Matt Henderson

04:58 – 05:23

yeah these all these launches I mean I remember this from record times here you you’re there’s a lot of timing and rumors going on okay who’s gonna be is when’s the next llama coming out when should we launch you want to launch if if you launch before, then you maybe you will be at the top of the benchmark for a bit longer. If you launch after, you can steal all the PR.

And if you launch on the same day, then who knows what’s gonna happen?

 

Nikola Mrkšić

05:23 – 05:40

It’s. a really intuitive experience.

Right? I mean, like, the same people who are, like, you know, competing for the benchmarks and, like, the very technical things are now the same people looking at impressions and views and coverage by the media and everything else. So, you know, it’s a.

full body exercise. It’s interesting.

 

Matt Henderson

05:40 – 05:41

Yeah.

 

Nikola Mrkšić

05:41 – 05:42

Right?

 

Matt Henderson

05:42 – 05:42

Yeah.

 

Nikola Mrkšić

05:42 – 05:51

But, kinda like looking at evaluation on our tasks, I think it’d be interesting for kind of the audience just to hear a bit about what we’ve seen when we use g p t five for voice.

 

Matt Henderson

05:51 – 06:42

Yeah. Yeah.

Totally. So last week, I mean, we announced our in house model, Raven v two, which is powering the live voice assistance.

So it’s like the latency sensitive, like, high accuracy sort of use case, dealing in voice. And, we will release benchmarks.

So there were two we missed g p t five in, that blog post, but, you know, since then, we have evaluated it alongside, you know, g the Claude model, g p t four o, and so on. We actually find that five mini does better, on our evaluation than full g p t five, which is weird.

G p t five is making some strange mistakes, which to me are showing that it really wants to reason.

 

Nikola Mrkšić

06:42 – 06:43

Yeah.

 

Matt Henderson

06:43 – 07:30

And it’s lost a bit of accuracy around calling tools and functions. So it would do strange things like, you know, we we are we are always prompting the model to come up with a short conversational response to something you might see on the over the.

phone. We don’t want what you typically want from a model, which is, like, a mark dot in a lot of bullet points, like code, you know, long form text.

But GPT five is sometimes doing that. So but I’m putting a bunch of reasoning and I’m putting, like, hundreds of tokens.

And, no, we’re not gonna send this to text to speech and make the user listen to this. Yeah.

So that’s a bit strange. Yeah.

And, I’m putting tools, like, in the text. It’s like a bug.

 

Nikola Mrkšić

07:30 – 07:30

Oh,.

 

Matt Henderson

07:30 – 07:38

basically. So, yeah, I’m calling the x function now and then outputting some JSON, and it’s like, no.

You’re not. You’re not.

So,.

 

Nikola Mrkšić

07:38 – 07:38

Okay.

 

Matt Henderson

07:38 – 07:42

this is, like, would be a pretty bad bug if we were to deploy that.

 

Nikola Mrkšić

07:42 – 07:42

Maybe.

 

Matt Henderson

07:42 – 07:42

it.

 

Nikola Mrkšić

07:42 – 07:49

we’re all getting a bit harsh. Maybe we just need to give them a week or two or three, right, for the whole thing to settle a bit for bugs.

to be,.

 

Matt Henderson

07:49 – 07:49

There’s,.

 

Nikola Mrkšić

07:49 – 07:49

correct.

 

Matt Henderson

07:49 – 08:07

There’s also, you know, an implicit bias. And when we run these evaluations, the prompts that we’ve built were on the models of the time.

And so when we rerun them, yes. You know, you could say, but you should prompt g p t five needs to be prompted slightly differently.

So.

 

Nikola Mrkšić

08:07 – 08:22

Maybe just. with the audience, you mentioned that in API calls, you don’t necessarily have to have, like, the kinda, like, model selection thing that they have in the background where it would reason or not.

So how does that stack up with the model trying to reason?

 

Matt Henderson

08:22 – 09:20

We’re still evaluating, like, reasoning, so that’s maybe another interesting question is, re so the g p five launch is very focused on reasoning. Right? All of these models are kind of by default reasoners.

As an API customer, I can disable it. I don’t think they were guaranteeing you that’s not gonna do reasoning, but you’ll say, please know, make it minimal, I think is the term. I think our earlier evaluations do improve the accuracy, but I don’t have those numbers right now.

But the interesting question I think is, like, how can we take advantage of reasoning which is slower in a live, like, latency sensitive use case? Because I think it would be too simple just to say, no. Reasoning is not interesting.

You know, it’s too slow. What can we do to take advantage of that?

 

Nikola Mrkšić

09:20 – 10:21

You know what this really reminds me of? I don’t know if you ever did, like, ACM, those, like, ICPC style competitions where it’s like, you know, they’re like the IOI of all into, like, kinda like university students competing. So it’s not like, you know, an individual sitting solving algorithm problems and coding them.

But you have a team of three, but only one computer. Right? So there’s always, like, the Gatling gun coder.

I was never one of those. The person who, like, implements runs, writes testing suites, etcetera.

There’s, like, the guy who’s, like, the math man. Right? The person who sits there in a corner, is given the thing that out of the three problems, you know, maybe these two are, like, you know, there’s a balanced one, a heavy coder one.

Those two go and hammer up problem one and two because they probably can do it on its own. You take your smart mathematician.

Right? You go, hey. Figure out.

Think of it really hard. Right? And then, like, later on, they, like, have a good idea, whatever, and then whoever’s frustrated these two starts coding it up.

And, I feel like, you know, that reasoning model should really just be, like, running in the background.

 

Matt Henderson

10:21 – 10:21

Yeah.

 

Nikola Mrkšić

10:21 – 10:43

You start off with tokens and it’s like but having thought of it, actually, what you want to do in this situation is maybe also look at the avalanche of the other model. comes, you know, tokens being generated, maybe a sentence in, changed subject to that latent state now incorporating whatever smart thing that model came up with.

Right? But that really is like two models that are running in parallel. Right?

 

Matt Henderson

10:43 – 11:13

Yeah. Yeah.

And how do you make it if you’re talking to this on the phone, how do you make it not sound insane? And, I mean and and a lot of our use cases over the phone did naively, you would think how much reasoning would this need? It’s not solving, like, university level, like, algorithm challenge questions. You know? It’s like, I just want to change my booking at this restaurant and, you know, something like that.

If every time you talk to it, it’s like, let me just think about that.

 

Nikola Mrkšić

11:13 – 11:59

Yeah. I

I mean, that’s what we have a lot of competitors doing there just because their latency is, like, seven seconds. So that’s, like, just terrible.

Right? But, I mean, in theory. Right? One, you just think of more complex enterprise use cases, complex knowledge bases.

You know, the. promise was always an MLM.

You just access all your enterprise KBs, whatever. And then, you know, maybe it has 30 versions of this, and we can figure out which one’s real the way that a human might if tasked.

to do it for three days and ask others and figure out what the real version is, maybe don’t touch it for a year. Like, in theory but that gets very theoretical, honestly.

And I don’t think that there’s enough faith or acceptance among buyers, and they’re not technical people to do things this way for, like, front office use cases.

 

Matt Henderson

11:59 – 12:17

Yeah. Yeah.

Yeah. Next version of Raven, we’re sort of actively working on sort of, I guess, latency sensitive reasoning.

So, with, you know, RL coming back, latency can just be a sort of term in the reward function. You know? So,.

 

Nikola Mrkšić

12:17 – 12:18

Yep.

 

Matt Henderson

12:18 – 12:42

maximize accuracy, but, you know, don’t leave the user waiting for too long or issue some text to say while they’re waiting, that kind of thing. The tricky one of the tricky things that comes up there is that it’s not always obvious what types of questions require reasoning.

So for example, if you ask GPT five on chat GPT, you know, how many b’s are in the word blueberry,.

 

Nikola Mrkšić

12:42 – 12:43

Yeah.

 

Matt Henderson

12:43 – 12:47

that requires the heaviest duties reasoning model. Right.

 

Nikola Mrkšić

12:47 – 12:47

Yeah.

 

Matt Henderson

12:47 – 13:14

And if you, the router might look at that and think that’s a very dumb, easy question. I don’t need my PhD level, solve that.

Then you get routed to, you know, g p t five mini or nano, whatever it’s called, and then it can’t do that question. So the types of things that the LLM needs reasoning for aren’t always intuitive, and they might just be served due to the limitations of the architecture in this case.

 

Nikola Mrkšić

13:14 – 13:14

Yeah.

 

Matt Henderson

13:14 – 13:20

the, like, the tokenization, token awareness stuff. Yeah.

 

Nikola Mrkšić

13:20 – 13:28

Yeah. I mean, that’s super interesting.

I think, like, also, when you mentioned, Harrell, I mean, we remember the kind of, like, you know, 20 minus number of terms zeta three success.

 

Matt Henderson

13:28 – 13:28

Yeah.

 

Nikola Mrkšić

13:28 – 13:36

Right? So I mean, like, that’s at least, like, what we had in our, like, world of, you know, POMDP based reinforcement learning. But,.

 

Matt Henderson

13:36 – 13:37

What?

 

Nikola Mrkšić

13:37 – 13:38

for that.

 

Matt Henderson

13:38 – 13:44

with an extra term to stop you from just hanging up on turn one and and just minimizing your losses.

 

Nikola Mrkšić

13:44 – 14:22

Yeah. Yeah.

I mean, listen. I often said in many of these client conversations, there’s an easy way to get a 100% containment.

You just need to never pick up the phone or hang up immediately. Right? And no one ever goes through to the human eye.

It has a pretty negative effect on customer experience. But, hey.

When you think of kind of, like, things that make you excited, in terms of both kind of, like, the reasoning models and just, like, the car like, you know, performance of these things, both in speed, the accuracy. What’s most exciting to you kinda, like, that you think is coming in the next six months, both in terms of what we’re doing and then external?

 

Matt Henderson

14:22 – 16:33

Yeah. Yeah.

Well, I guess when we talk about LLMs inside PolyAI, there are broadly two kinds of use cases. The first is like, obviously powering the live voice agent.

This is the latency sensitive use case. And there we are building models in house that know about voice, and they understand the turn taking and and the sort of agent framework that we’re putting them in.

So we don’t get those kinds of mistakes where it starts out putting a bunch of markdown or hallucinating details from the user that they haven’t mentioned yet. You know? I think the big thing there is that most models there have kind of built into them, that they’re going to be long form chats use cases, doing long agency reasoning and stuff.

Yeah. And you’d and I’d they and their reward, they’re like, you wanna solve everything in one go.

So you’d never really wanna ask the user for clarification. That’s very.

rare, but we often do. So that’s one use of LMs.

And then the other use cases, stuff like, our smart analyst feature where you’re able to ask what people are calling about and dive into the data and AI assisted agent building itself. And I think that’s pretty interesting with a hint of a plateau in my progress.

You’ll receive a lot of just a lot of gains to be made and just how do you wrap these very powerful models into things that are useful. So.

I, you know, Claude Code is amazing, for example. Then it’s about how you’re using the model to do useful stuff.

So, how can we, how can we, you know, take inspiration from those types of agents and build them into agent studios and help people build through, with this, like, the. PH level assistant.

Yeah.

 

Nikola Mrkšić

16:33 – 18:08

Yeah. I mean, it’s really crazy.

Right? I mean, like, I have to admit that, you know, well, you said it kinda like the hint of maybe we’re not at Superintelligence in, you know, three, six months. Maybe it’s gonna take a bit longer.

Right? Where in using these things and, you know, I think I’m a good proxy of, like, you know, I used to be a competitive coder. I’ve not really written a proper line of code.

Yesterday, we were, you know, still in libraries and my terminals so that I could even do the most basic things because this laptop hasn’t been used for anything technical other than inside our kind of, like, agent studio online for about two years. Right? And, what I found really interesting is, like, just the UX of cloud code.

Like, it’s pretty slow. Right? It’s fascinating and it’s really good.

And I think that if you’re if you’ve got the technical training, it’s really powerful. But, you know, the problem that we’ve kind of been tackling with enterprise customers is that, really, this class of citizen developers inside the contact center doesn’t really exist when it comes to people who share DNA with the kind of people that get amazed at clock code.

Right? So on the one hand, you’re trying to put relatively kinda, like, no code primitives inside whatever, of course, l m powered, but still kinda, like, relatively simple truncated knowledge bases, variance, things like that. On the one hand with something that objectively in the back end is really, I think, aimed or at least been optimized to benefit a really good coder the most in a way.

Right? Like.

 

Matt Henderson

18:08 – 18:08

Yeah.

 

Nikola Mrkšić

18:08 – 18:09

Yeah.

 

Matt Henderson

18:09 – 18:57

Yeah. And how can we build for that sort of spectrum of users? So, the most when we build when we’re building software nowadays, we’re not we’re not just building for human, people click pointing and clicking in on their computer, but we’re we should also be building for AI agents, things like.

cloud code. And a lot of this time, those things are, like, complimentary.

Like, you write documentation, and then it can be consumed by either. But, I mean, pointing and clicking, for example, is something that these models are terrible at.

There you see some, some kind of publications on computer use, and then somebody tries to use these APIs to just add together two numbers in a calculator, and it just doesn’t work. Like,.

 

Nikola Mrkšić

18:57 – 18:58

Yeah.

 

Matt Henderson

18:58 – 18:59

They excel. at.

 

Nikola Mrkšić

18:59 – 19:00

Yeah.

 

Matt Henderson

19:00 – 19:00

yeah.

 

Nikola Mrkšić

19:00 – 19:43

Yeah. I mean, I’m just really excited about those things because, like, for better or worse, like, for the broad spectrum of automating, in terms of customers and their, like, back ends and stuff, you really do need or you would benefit from having really strong fast computer use because their interfaces, the only way that you can sometimes do something is this screen, that and that and that, take this one value, copy it, go into another set of screens, click another 20 things.

And, you know, whenever I call BA and, you know, I think up until recently for my well, she’s still from my younger daughter if I wanna see it because she’s under two. You have to call something in the APIs, right, where literally they have not put it in in the app.

 

Matt Henderson

19:43 – 19:44

Mhmm.

 

Nikola Mrkšić

19:44 – 20:45

And, it’s interesting because you just wait for humans to do it for so long. And, you know, I’ve often looked in, like, a call to book, but plane tickets return simple return plane tickets for a family of four take, like, seventeen, twenty minutes.

But then, you know, being a nerd and doing what we do, I imagine it’s really just like humans are doing those, like, computer use things at a very, very like, it’s just taking a very long time. So I hope they improve.

But in any case, back to maybe just like the coding world then, I don’t think we really need clicking and stuff for the future ourselves. But it is interesting in that it really gives superpowers further superpowers to you know, everyone goes like, you know, a PM should not be able to produce everything.

But there is, like, a technical oriented PM, and that really is a fringe of that technical world still. Right? And then there’s, like, all the regular normal human beings using applications.

Right? And I don’t know that bridging that gap in the middle feels like something that’s not really yet, like, had a final word on it. Right?

 

Matt Henderson

20:45 – 21:09

Right. Yeah.

Yeah. It’s right.

It just may give you new powers that you just didn’t think you had before. You know? Like, build me a dashboard to show me what people are calling about in the last week, and then it just generates one and you have it there.

You don’t necessarily need to know about how to code in TypeScript and create, like, a React app and stuff that you can see it kinda do it and follow along and maybe supervise it a bit.

 

Nikola Mrkšić

21:09 – 21:29

Yeah. Yeah.

I mean, like, you know, I’ve I’ve heard people on podcasts going kinda like, hey. Why would you ever even download an app? You should just describe what it is, and you build it for you.

And, like, that’s kinda like, you know, I think going at it just to prove a point, although it’s not impossible. Having said that, it is still quite slow.

Fascinating to seek a lot of work. Right?

 

Matt Henderson

21:29 – 21:29

Yeah.

 

Nikola Mrkšić

21:29 – 21:53

But, like, when you think of just, like, the act of changing something simple that you could build maybe for grammatical workflow, but still maybe in a browser and a platform, it. is, like, it is not that kind of, like, weight.

I feel like you would kinda lose people out in them not being the hyper confident white coder living. inside the room.

 

Matt Henderson

21:53 – 22:37

Yeah. And I think there’s probably a lot of benefit from making it faster by building sort of prepackaged up environments and think about, yeah, PolyAI, that’s like, well, yeah, you know, what is, like, the love of building voice assistance? So I want to. I can maybe start from scratch and vibe code my way up to something, but then I’m like, well, how do I make it so I can call this thing? You know? Like, I want a phone number now, and I wanna serve, you know, like, thousands of calls, every every hour or something and pay for speech recognition, etcetera, etcetera.

Like, I see that you know, that’s a potential direction for us.

 

Nikola Mrkšić

22:37 – 23:09

Yeah. Yeah.

Yeah. Yeah.

It’s really, really interesting, and I’m really, really excited about that because, you know, I think that we’ve been trying to get way more of our customers to be kinda like stand alone platform users, and we’ve gone a long way. But it’s still, like, quite a paradigm shift because I think the world is adopting LLMs and it’s kinda like technology around quite fast.

But whoever’s not a developer is still being asked to do things that are very different from your regular data entry. Right? So I feel like we have to distill a few of those steps into one relatively simple one.

 

Matt Henderson

23:09 – 23:33

Yeah. Totally.

I guess, so we’re sort of saying, yeah, what’s what we have? now we didn’t have before? Or we’ve had these, like, super perfect LLMs that could just be agents and well, what kind of superpowers does that give these people, these users of a platform, say? I mean, just reading thousands of words of context in a split second is one of them. You know?

 

Nikola Mrkšić

23:33 – 23:35

Yeah.

 

Matt Henderson

23:35 – 23:50

Instantly learning new software tools and, learning from in context examples. Okay.

I’ve got this, like, we build a prototype banking use case and, you know, you’ve got a new one. Say, I wanna do something a bit like that, but can you change it so that it authenticates me first? Or, you know?

 

Nikola Mrkšić

23:50 – 23:52

Yeah. Yeah.

 

Matt Henderson

23:52 – 24:13

Yeah. And then I guess one of the dangers is that it’s very easy to show demos of stuff like that working, and maybe the first version does something, but you just do it if you’re not literate in the code.

itself, you don’t see the bug or you don’t see the server something that’s totally wrong. And, yeah, we know a lot about the difference between a demo and something that’s deployable.

So yeah.

 

Nikola Mrkšić

24:13 – 24:30

Absolutely. I think that, you know, at the end of the day, like, in enterprise software, you’re paying for, not just software and someone who can produce a demo, but a throat to choke over working reliably for you as your environment maybe changes and you’re kind of, like, in the end, also paying to not have to know.

Right? You’re.

 

Matt Henderson

24:30 – 24:30

Yeah.

 

Nikola Mrkšić

24:30 – 24:32

paying for the convenience.

 

Matt Henderson

24:32 – 24:32

Yeah.

 

Nikola Mrkšić

24:32 – 24:57

Maybe just, like, finally on the call, like, OpenAI and all the models that are, like, in there. Did you have a chance to look for their open source models and kinda, like, do you have any thoughts for the audience around, like, what that means for their strategies? They’re just reactive because DeepSeek and Llama and everyone have objectively just been kicking ass.

And, like, how does OpenAI and their methodology and worldview fit into all that?

 

Matt Henderson

24:57 – 25:03

Yeah. Yeah.

I mean, that’s in the name, I guess, but, you know, open. I was excited to see,.

 

Nikola Mrkšić

25:03 – 25:08

Yeah. Let’s not call Elon and Sam Twitter words here, but, yeah.

 

Matt Henderson

25:08 – 26:10

some good, some good drama there in these last couple days. Yeah, I was excited to see that and then instantly, started playing with it.

I think they’ve, I think they haven’t thought too much about the developer experience of people that actually want to take these models and fine tune them. Maybe it’s partly because there are a couple new things that they did that came with those models.

They’re not just standard, like, plug and play in that. Well, first, they have this new way of formatting conversations, the OpenAI, like, harmony format, which, like, makes a whole lot more sense than whatever Hugging Face has, but it’s just not what the sort of ecosystem expects.

So, I mean, you saw, for example, people using the models, and it wasn’t doing formatting correctly. And so it’s just degraded quality, and you don’t it’s a totally silent error.

Right? And then, yeah, I have a lot of opinions on chat template formatting and how to base libraries.

 

Nikola Mrkšić

26:10 – 26:11

Yep.

 

Matt Henderson

26:11 – 26:20

But yeah. And then the other sort of nonstandard thing was its mixture of experts and in a new quantization kind of scheme.

 

Nikola Mrkšić

26:20 – 26:20

Yep.

 

Matt Henderson

26:20 – 26:28

So they use, like, this it’s a 20 b model, but it appears much smaller because a lot of the weights are in four bit precision.

 

Nikola Mrkšić

26:28 – 26:28

Yeah.

 

Matt Henderson

26:28 – 27:27

and it requires special kernels to run. And those only are for inferencing the model, not for fine tuning the model.

So, I mean, I was on, like, day one or day two trying to fine tune this model and then realizing, like, that you can’t or that you needed to expand it out to unquantized and try and fine tune it there. And then you’re like, well, where’s the code to requantize and so on? Yeah.

Yeah. Long story short is that maybe it’s me trying to do things too quickly.

Like, day two of the model being around trying to train it, but I think partly is when I get this thing out, say that they’re open. And, I wouldn’t be too surprised if they thought of fine-tuning this thing as difficult as a feature because a fine-tuned GPT OSS model is just a direct competitor to one of their API models.

 

Nikola Mrkšić

27:27 – 27:53

I think I think Sean and I went down the rabbit hole trying to for the audience and play explain quantization last time, but maybe we can do it again as a final point here. I mean, you shrink the model, and it turns out that in inference, right, it’s almost as good.

Right? Why is it so much worse to be fine tuned? It’s almost like what someone deleted, like, the kinda rough workings of how you got somewhere, and then you kinda are just left with what, like, the result of a problem and you have to work backwards when you are okay.

 

Matt Henderson

27:53 – 27:54

Yeah. Well,.

 

Nikola Mrkšić

27:54 – 27:57

Okay. I’m trying to find a good analogy to explain to people why it’s hard.

 

Matt Henderson

27:57 – 28:18

yeah. Totally.

Well yeah. And, well, opening, I claimed that they did train it, with quantization, which is super interesting.

That would be cool. But that’s not that it’s not open.

So, like, I’d love I’d love that code, and that would make it much more sort of feasible to to build things based on this model, but that’s not there. But yeah.

Try.

 

Nikola Mrkšić

28:18 – 28:23

While context, would that mean that it was never really quantized? They just use lower bit precision. No?

 

Matt Henderson

28:23 – 28:33

Well, it sounds like, well, they’re not very clear. But what it sounds like is they did something called, like, quantization aware training, which we did back in the convert days.

 

Nikola Mrkšić

28:33 – 28:34

Yep.

 

Matt Henderson

28:34 – 29:30

Alright? So our embeddings were in eight bits. But in training, we did quantization and word training.

So the way that works is that and then this is maybe an explanation for why it’s difficult to train quantized. Well, quantization just means that all of your, num like, activations, so, like, the model internal state, they will be less precise.

They will be rounded to some nearby number. So that’s always the case, but, because computers don’t maintain arbitrary precision.

But, you know, you might be doing a forward pass computation that the model has it should be, you know, 0. 661244, but that gets rendered to 0.

  1. And then you’ve just, like, kind of lost that bunch of information.

 

Nikola Mrkšić

29:30 – 29:54

So you hope that as you converge, the fact that it’s almost like a dropout where you would, like, turn off half or a random half. Here, you just, like, what, occasionally wipe those figures after a certain.

number. We’ll turn them to zeros and hope that as you iterate and train and the mod like, the learning adapts to it, that it starts to, what, converge inside a space for those later numbers to very little in terms of the pool? Or.

 

Matt Henderson

29:54 – 30:11

Yeah. Yeah.

So for, like, quantization aware training, you will be doing that computing the real like, simulating the quantization in the forward pass. But then for the gradients, you’d have higher precision.

Because if you didn’t have high precision for the gradients, a lot of the times you’d just be getting zero.

 

Nikola Mrkšić

30:11 – 30:12

Yeah.

 

Matt Henderson

30:12 – 30:12

gradient.

 

Nikola Mrkšić

30:12 – 30:12

Yeah. Yeah.

 

Matt Henderson

30:12 – 30:14

Yeah. So you kind of.

 

Nikola Mrkšić

30:14 – 30:17

  1. Yeah.

We start at, like okay.

 

Matt Henderson

30:17 – 30:36

yeah. So when you do an update, you might not be changing anything, but because your gradients are because you’re storing the parameters and the gradients, in a higher precision, they are actually moving between those, like, steps.

Yeah. So, yeah, there ‘s just a bunch of complications there.

 

Nikola Mrkšić

30:36 – 30:36

Well,.

 

Matt Henderson

30:36 – 30:36

Yeah.

 

Nikola Mrkšić

30:36 – 30:43

I think I think, you know, in this, we definitely outdone the previous, overly complex explanation that Sean and I had. But,.

 

Matt Henderson

30:43 – 30:45

Okay.

 

Nikola Mrkšić

30:45 – 30:59

it is really, really I mean, I think it’s really interesting. And kinda like in classical deep learning fashion, there’s always some, you know, like, completely arts based way of going after things that turn out to be working.

Right?

 

Matt Henderson

30:59 – 31:03

Mhmm. An art in an arts based way.

 

Nikola Mrkšić

31:03 – 31:07

Well, it’s kinda like, you know, it’s high, it’s hardly a higher mass to say, like, I’m gonna, like, you know,.

 

Matt Henderson

31:07 – 31:09

Yeah. Right.

Right.

 

Nikola Mrkšić

31:09 – 31:12

you know, these are the zeroes here, but backpropagate and hold it at, like,.

 

Matt Henderson

31:12 – 31:12

Yeah.

 

Nikola Mrkšić

31:12 – 31:20

get grounded to a different quantized well, practically, this quantized state. Right? It is, it’s crazy that that makes such a difference to these models.

Right?

 

Matt Henderson

31:20 – 31:20

Yeah.

 

Nikola Mrkšić

31:20 – 31:21

Both. in.

 

Matt Henderson

31:21 – 31:48

There’s a whole bunch of other tricks as well. Like, scale it.

You scale the gradients up and then do them back down again. It’s all, like, it’s all engineering, works.

And I’m. I am very suspicious of deep learning papers.

Do we still say deep learning? That has all these like derivations and like use the word theorem and, you know, like just give me an intuition for like what you’ve this isn’t pure mathematics. It’s engineering.

And, like, give me a, like, intuition for what the loss function is and.

 

Nikola Mrkšić

31:48 – 31:48

Totally.

 

Matt Henderson

31:48 – 31:49

yeah.

 

Nikola Mrkšić

31:49 – 32:38

I think Elias Sutskova gave, I think, a commencement speech at the University of Toronto. And, you know, I think he’s always been to, I remember, like, conferences ten years ago where he was just saying that, you know, you need to train yourself into being the best, hyperparameter optimizer.

And I think he just, like, channeled the same thing at, like, a bunch of these unsuspecting graduates, right, where or actually people starting university, I guess. And, he was like, just use every new iteration of AI because you need to develop intuition over how it’s changing so that when the next thing hits and upsets the job market or your role, you get a feeling for how you can use it best first so that you don’t end up being the the one disrupted.

So I guess it’s kind of now applying for everyone who uses ChatGPT in their, like, workflows rather than just those optimizing these neural nets.

 

Matt Henderson

32:38 – 32:45

Yeah. Yeah.

Now we can get ChatGPT to optimize them for us.

 

Nikola Mrkšić

32:45 – 33:01

Alright. Well, on that note, I think we’ve, we’ve, we’ve filled our kind of, like, daily allotment of our half an hour.

So thank you for joining me. It was a pleasure.

To everyone watching, let us know we should talk about quantization loss. Like, share, subscribe, and, see you in the next one.

About the show

Hosted by Nikola Mrkšić, Co-founder and CEO of PolyAI, the Deep Learning with PolyAI podcast is the window into AI for CX leaders. We cut through hype in customer experience, support, and contact center AI — helping decision-makers understand what really matters.


Never miss an episode.

Subscribe