RIP hold times, 1962-2025. Pay your respects

Play Podcast

GPT-5 Era Starts Now: The AI Arms Race Just Escalated

YouTube video thumbnail

Summary

PolyAI co-founder and CTO Shawn Wen joins Nikola for a deep dive into the just-released OpenAI GPT-5 model. From model-flipping and 4o’s depreciation to curbing sycophancy and determining which LLM out there is #1 for coding, they cover it all. Join us for a discussion on:

  • OpenAI’s bold approach to upgrading and removing models 2
  • Hands-on experiences with GPT-5 and its capabilities for coding and other tasks
  • Trends in AI, including the increasing use of language models as digital concierges
  • Concerns about transparency in benchmarks

GPT-5 is now available in PolyAI’s Agent Studio. Learn more at https://poly.ai/

Key Takeaways

  • Enterprise resilience matters: OpenAI’s consumer-first rollout underscores why PolyAI builds robust adapters and infrastructure that protect CX leaders from sudden platform changes.
  • Voice is the frontier: While real-time APIs are still immature, PolyAI is investing in voice-first AI agents designed for reliability in customer-facing environments.
  • Coding improvements accelerate innovation: GPT-5’s stronger coding capabilities validate PolyAI’s approach of rapidly prototyping and deploying agentic features into our products.
  • Beyond hype to enterprise impact: Where others focus on benchmarks and announcements, PolyAI prioritizes practical reasoning, reduced hallucinations, and context depth to deliver measurable outcomes for customer service leaders.

Transcript

Hi, everyone, and welcome to another episode of the PolyAI podcast. Today, I’m here with my cofounder, Shawn.

I think like most nerds, we’ve been really excited about GPT5 launching. The moment I opened my app and I saw I no longer had the old models last night, I was sad.

I like them. You know? I was just amazed by the balls of that man.

What are the model thing models you use the most? I mean, for the most part. I never really used, like, deep reasoning that much.

I think awfully a lot, actually. Good? Yeah.

Also, yeah, I think the quality is actually pretty good. How did you decide how to switch between the three models? For those, I use it for stupid tasks like, rewriting some, like, internal documents.

And for all of the way, I switched to, like, you know, deep research more, doing some of the reasoning. I know I always found it way too slow.

I’m too impatient. I just couldn’t honestly, I couldn’t. It’s fine.

It’s not right. I’ve been gone.

It’s fine. Yeah.

Because honestly, like, it wasn’t always good either. No.

Right? And I’m internally perplexed about when it comes to coding, which model to use. I literally don’t know.

Yeah. Well, they had the recommendation.

Right? But it wasn’t really clear. I don’t, I’m not yeah.

Yeah. It’s all confusing.

Look, I think they did the right thing. Like, you know, kind of like going through like the whole set of, as we update the platform and then being a B2B company, having to push things out gradually, reeducate customers, check for backwards compatibility, testing use cases.

I kind of envy them for the consumer side. Yeah.

Just, okay, we’re gonna flip it. And I think on the consumer side, I no longer have access to anything.

We just didn’t say it. Yeah.

When we spend ten minutes trying to access it all, we can’t. Yeah.

On enterprise, they have sixty days. Yeah.

Right? So at that point, it’s gone. So anyone who’s built, you know, any kind of backwards compatibility or dependence on it Yeah.

Now I have to live with the new reality. Correct.

And, I think that’s just what market power is. Right? Yeah.

Well, I think, you know, it’s also like this market is shifting so fast and then I’m sure that every AI company now is, like, building adapters to every, you know, API endpoint these days and then expecting these endpoints would shift very quickly and then you’ll be able to switch to another model very quickly. I think some of the products, it’s become crazier.

Like, you click on the LMG, the drop down button, you see, like, 20 models of it. I was like, oh, how unfair? I think you are going to. I think that’s engineers overindulging themselves because, like, it’s cool, but you never really do.

Right? Yeah. Like, this is good.

This is, like, the Apple experience. Right? Yeah.

Yeah. And there it is.

Yeah. I think they are doing the right thing with this.

But, you know, just before they list the section, I just go online and then, you know, read their LinkedIn. People are not very happy about it.

You know? There have been some of them who may have been working with a four o for a long time and were like, why do you take down my four o? He has become my friend. That was actually interesting.

Yeah. It is.

I mean, I think that a lot of our rapper competitors are now writing and presenting their ability to use Entropic or OpenAI as an advantage. And we just hope that, you know, Daniel doesn’t get any ideas of their chair out as well.

Yeah. But look.

I really think it’s good. I think, you know, we’ve seen generations of companies do a slow move to the cloud and then this and then that.

And Correct. Stack moved fast then, but much slower than now.

So I think, like, this makes a lot of sense. And, so okay.

I think there are three tiers. Even the free tier gets, like, the nano version.

Correct. On pro, there’s only two, the thinking one and the regular one, the multiplexes.

Correct. Can you explain, like, for people, what is this mixture of models? Yeah.

Do we know more about how it works or is it all kinda like a box? Yeah. I think, you know, that based on the official announcement, they have been switching to these, like, you know, smart routing kind of like model architecture.

So basically, you have a bunch of models in the background and it depends on your query that it is going to do your query into different models. Now, I mean, we have seen these kinds of detectors before, right, in theory, you wanted to talk about that? Well, no.

No. Look.

I mean, I think, look. It is whatever you put under the hood.

Right? So I think you, you can’t talk about what you saw at Siri. Apple’s Secret Service.

That’s true. But that was an ancient generation of technology.

Well, I didn’t work at Apple, so, you know, but, you know, it’s a very similar kind of thing. So I think, you know, the interesting thing to me is that how do they make sure that these smart routing don’t become the single And, you know, LM is kind of like the way to actually just be a default concierge, just send it to that model.

Now they are building a concierge on top of a concierge, which is, you know, I will I’m very curious to see. You know what I think is interesting, right? I remember like about maybe nine months ago, Sam Altman was on the podcast talking about how he needs to get a better product.

Right? And I think, you know, being startup founders, there’s always a sentence like that that applies for True. One of us, probably both of us.

Yeah. And, we had our new CTO, Chris, join this week.

It’s really Super exciting. Yeah.

But, like, the thing that’s really, really funny to me is, like, you bet the whole house on it, then it kinda has to work. Right? So, you know, if you got the talent density of OpenAI and you bet the whole house on it, you can bet that all constituents that use it will find a way to iterate on that 5.

1 or whatever is not announced, but released in the next few days. Yeah.

It’s gonna be like a lot of day and night firefighting for the things they’ve broken for, the important constituencies among their users, right? Yeah. So, yeah, I think, you know, what they did is remarkable and is very impressive.

I would love to see how that continues to evolve in the consumer’s mind. I have already seen, like, a long thread of, like, ready to take these questions about how to find these old models back.

How do I actually still make a choice? Because it’s a consumer behavior they have been stuck to in the past eighteen months, two years now. And now, like, we are changing back to that simple war.

It’s going to be hard to get back. And look, I mean, I think in a consumer, fine.

Right? Like you make a decision, if it’s substantially better. You don’t need to AB test them.

Right? I think what’s surprising is does it not really cost them all that much to give people a longer end of life for, like, four o models and stuff in enterprise. And yet, they’re kinda, like, telling you what they care about.

And, you know, I saw a bunch of stats now. Like at 700,000,000 weekly active users, right? It’s a lot.

A lot better than monthly, right? Huge. They got there something like, I think four or five times faster than Facebook did.

Back at the time, right? So, you know, they’re out there to win it and to be, I think, you know, in the magnificent seven, six, or whatever the new moniker will be when they Yeah. Yeah.

When the IPO would be the new show hands, you know. That would be interesting.

Yeah. List of companies.

Yeah. Yeah.

I can’t do that anagram. So okay.

I think that, like, looking at everything else that was added, I think, you know, we were fixated on a word, sycophantic behavior. So difficult, but, you know, I have been trying to actually pronounce it, you know, before this meeting.

But I just realized it’s basically a different way to say hallucination. Yeah.

Right? You almost think, like, the use of such a strong word is almost there to, like, hijack the narrative away from the Yeah. From the hallucination.

Correct. Yeah.

Yeah. But, you know, what the results they have been showing in those experiments are actually very encouraging because hallucination or what they call, quote unquote, sick or fatigued behaviors is actually the big problem that everyone has been talking about since the launch of LGBT.

Hallucination always comes with ALM, it’s actually called a curve a lot in the literature. I think so, what’s a hallucination? True, well, yeah, true, true, true.

So now I think that they have really actually put a huge focus to actually address that. And then looking at the result is definitely looking very promising.

And the way they are also explaining how they actually fix these problems is basically, you know, a typical machine learning training recipe. You create an evaluation set, and you evaluate it, and you measure your point of progress, and you create a training set.

And the training set is that, you know, by, for example, an example training set will be a previous image recognition task. You show a picture to the alarm and say, what are the objects inside the picture? And then the alarm will generate the answer for you.

Now you still ask the same question but you don’t send the image anymore. Yeah.

You know, what are the objects in there? And the LN cannot hallucinate now. So there’s a very carefully curated set of questions and then this is also becoming the trend training data and you just model together with your rewards signal.

Okay. Well, look, I mean, I think that is really exciting.

This is something that’s just like a good direction of travel. I think, when you look at voice in particular, the thing in the app still runs on four infra with some improvements to, like Correct.

Interruptions, tonality, but it feels very tactical. I spent forty minutes on the way to work.

I’m back with it. And as I do most days, I don’t see a very tangible difference.

What about you? I think, you know, it’s quite clear that, you know, in this particular release, they have been focusing on hallucination reduction. They have been focusing on the reasoning of the capability and also the coding capability.

So maybe we should talk about that. Yeah.

Yeah. Yeah.

But I think the voice part, I think it doesn’t seem like there have been a lot of changes there. I think that, you know, definitely they are. We know they are working very hard on that because the real time API is very thoughtful about our monthly growth.

What’s the big push to voice mode where it’s correct? Yeah.

It’s really good. I think that voice experience is really amazing.

I think real time APIs don’t actually have the same kind of, like, quality or comparable quad quality to the actual ones you get in the app. Yeah.

Correct. Yeah.

But there are things that, you know, things would change in the future. There was also a huge problem with voice mode is that the instruction following is actually even weaker because the audio signal is a lot harder to interpret for airlines.

I think they also did quite they did, demo video v v video in one of the sections, which actually teach the model to say, hey, now just reply back to me in single words, which, you know, to like normal people, this seems like a very simple feature, but if you know how this model works, it’s actually not an easy thing to do. Yeah.

Yeah. Yeah.

Yeah. It’s like strawberries all over again.

Correct. But, yeah, I’m very excited to see that actual GPT-five voice more in the moment to come because I don’t think this is actually their actual flagship voice model release.

Yeah. Yeah.

No. For sure.

For sure. Yeah.

I think, you know, kind of looking at the coding side where, you know, with all the kinds of things like recent, you know, Cursor, Lovable, Winsturf being acquired, fired, etcetera. Right? It’s been, you know, not since being a kid in Serbia and watching Latin American telenovelas like Cassandra have I felt that similar things were happening, in an actual, you know, environment that I’m closely following.

Yeah. So, like, you know, Anthropic, I feel, has been the backbone of a lot of the other ones.

Correct. It almost felt like they had, quote unquote, one coding.

Right? That that was, like, their niche and that OpenAI is, like, the consumer Yeah. Things.

Now I don’t think, you know, paying, what, $200 a month for ChargeGPT. I’ve watched you do things with, Yeah.

I’m no longer really cold. So I live through this man.

Yeah. But, what changed now? Yeah.

I think well, first of all, I would like to say, I don’t think the market consensus is that CLO has won yet, and Dropping has not won yet. This market is so dynamic, so crazy.

I almost envy that I’m not in it, but at the same time, my co-founder is also like, oh damn, I’m not in that space. It moves so fast, but it is definitely not a lot of excitement.

It is so hard. Yeah, it is.

I think GPT has been. I think OpenAI’s coding model has always been there. And then they started to invest probably like one year or a year ago.

And Entropic has been very focused on this. I spoke to people.

I talked to them, hey. We really look into our voices.

We want to have a really low latency, and then we managed to optimize our OLM, so actually, you know, sub two hundred minutes latency. They were like, okay.

If you have sub two hundred minute latency, even if we have the new update which is coming in summer, it’s probably not going to be comparable to that because they have been quite clear that their focus is on long context reasoning based models that you can do really, really good coding tasks. I think this new release of GPT-five is actually making this coding space even messier, I think.

So, I tried to do, actually, a mock of our so we have this Agent Studio product. We have been doing prototyping with, like, new agentic features, and then I am not a very good front end developer, so I just prompted JBJU.

What is impressive is that now I actually take a screenshot of our product, send it to JPGP and then say, hey, just create an app that looks like this exactly. It creates very similar apps.

So you can get What is it using with React or Yeah. Is it JavaScript, you know, that I saw?

But based, right? And then you can then basically and then I further prompt it to say, hey. Now build in me a Copilot agent with a sidebar and with reasoning capabilities.

And then once I say this once I type in this query, you need to help me to do something in a front end UI. So, yeah, the GPT-five model was not announced yesterday when I played around with it, so I used all three models, which is actually the best to call the model back at that time, which I suppose it should be.

And then it just generates a lot of boxes. It generates one box, I try to run it, but I cannot run it, I just ask it to regenerate it again.

Well, or like you know, fix that bug and then copy paste that entire error message into it. It generates another bug, fixes the previous one, and generates a new one.

So I was struggling to get it to work. This morning when g g GPTify officially launched, I switched to GPTify.

I asked the model to say, hey, scrub everything you did yesterday, but just redo the whole whole thing again. Yeah.

It works perfectly. So I think they are definitely catching up in that game.

And I don’t know. I think we will need the community. I haven’t looked at benchmarks yet and stuff.

No. Yeah.

I think the benchmark I don’t know. Benchmarks these days are a bit rigged.

This model is actually powerful for Windows test cases. I think when Grow got launched, one, two months ago, they were the best on numbers.

Now the interesting thing is that OpenAI doesn’t share results compared to other models anymore. They’re like, now I’m the winner.

I’m going to show the results. It feels very Trumpian.

Everyone is the best at all times. Exactly.

You know, all the announcements are starting to feel the same. So, I mean, okay.

Looking at the coding stuff, I think like the one thing that again is interesting, we don’t know for a fact, we can’t even know in the backtest because they removed the old ones, but you can’t help but wonder that maybe you would have fixed that bug with one of the different models. It could be, right? Yeah.

It could be. I just don’t have the time to actually try out.

Maybe this is my crooked Serbian mentality, but, you know, is that why it was removed? Correct. So you could figure it out with the API.

Correct. But, I think it’s really interesting because the other thing I really like about just thinning the number of models is I will buy in and believe that they will give me the best thing that they can give me whenever I call that, like, source of intelligence.

Right? Correct. K.

Give me one or two settings maybe for extremely slower but way more powerful things. Yeah.

That I kind of understand. Right? But kind of like needing to know exactly which flavor, it’s starting to feel like, you know, an English feast with, you know, 20 sets of cutlery.

Yes. Hopefully, you just need to go from the outside in.

But Correct. If you didn’t know that, you’d be very confused choosing the right fork and spoon and Yeah.

Looking for a different dish. So, I think that’s really, really good.

Right? Thinking back to kind of like that, not releasing benchmarks and stuff, open source models. Right? Yes.

So they’ve released, This happened a few days ago. Correct.

Yeah. Right.

OSS model, they have, like, 20 parameter one. There’s also 120.

Yeah. Yeah.

I mean, it feels like they’re very at least trying to project the confidence around the closed models being so much better that they’re just not part of the race. But yeah.

I can’t help but wonder if, you know, another coin model reaches this performance three months from now. But yeah.

I think our VP of research, Matt, has already looked into that. And, his initial feedback was that, you know, it seemed to be running very fast, so it’s actually quite efficient.

The model combo is quantized, so it’s already been quantized. And the problem is that they don’t actually release so you can dequantize it, but they don’t actually release the script to actually let you quantize it back.

The problem is fine tuning requires you to actually operate in the unquantized, you know, pyramid meter set. So, you know, it’s actually not that easy to actually take that open source model and just fine tune it and we’re all in the data.

So I don’t know whether this is intentional or is just like, you know, you get no norms in the process, but I You’ll be very I also find it very hard to believe that it’s pure ignorance. Right? Yeah.

I mean, maybe for the audience that is less technical, quantization kind of refers to, let’s say, you had like, I don’t know, 64 bit numbers and you need like large precision. Your quantization is basically you, say, turn it into, like, 16 bits.

So kinda, like Much more. Attention to your memory efficiency, you know, smaller than width.

Memory efficient, multiplications happen faster, models are, like, overall than no matter what’s in the model, they run faster. Right? So it’s helpful just for the general kind of, like, you know, if you wanna run it on edge, if you wanna run it Yeah.

And tune it more cheaply. But, yeah, like, that quantization comes at the very end of it when you’ve kind of, like, settled the training process.

There’s less well, I mean, the parameters are not shifting and then you can lose maybe a bit of the precision of the final answer by doing it, but not much. But in doing that, you yeah.

You’ve kind of, like, you know, I’m looking for the right analogy. Now you are more technical to me.

Okay. You know? No.

No. No shame.

I’m not sure the general audience actually understands that a bit. I understand that.

You delete half the numbers at the end and yeah. Yeah.

Fair enough. Fair enough.

Fair and Very cool. Alright.

What have we forgotten about this whole thing here? We’ve talked about coding. We talked about the general consumer aspect and the application of the old things.

Correct. There are menial things like, well, not menial, they’re actually very meaningful, but I think the context window and the effective context window increases something like four times.

Yeah. It’s still not Gemini level.

I think it’s like No. Yeah.

Two fifty or 500. Yeah.

I think Gemini is at 1,000,000. 1,000,000 and then I think this update is to The 400 ks.

Yeah, yeah, yeah, okay. Yeah.

Which I think is fine because, like, you know, context windows don’t really mean much. You actually need to actually look at an effective context window.

Is that whenever the announcement is that that’s gone up to, like, something very serious. Okay.

  1. Yeah.

I think we’re excited too. We have it running in the agent studio already.

Yeah. And I think what I’m excited about is the reasoning tasks where you don’t have to have an extra type of parameter of Yeah.

Well, I mean, you still do because sometimes cloud’s better. Sometimes we’ll see.

Maybe they are better at everything. And for a few weeks, we can then settle.

But Alright. Yeah.

Well, we’ll have to assess the impact, but I’m pretty sure that you will be better, but with our clients, we just need to AB test them a little bit, actually make sure that’s something that we feel comfortable with, so yeah. Alright.

Well, thank you all for listening and joining us in this episode. Thanks, Shawn.

Thank you. Hope to see you all on the next one.

Bye.

About the show

Hosted by Nikola Mrkšić, Co-founder and CEO of PolyAI, the Deep Learning with PolyAI podcast is the window into AI for CX leaders. We cut through hype in customer experience, support, and contact center AI — helping decision-makers understand what really matters.


Never miss an episode.

Subscribe