Demis Hassabis on shipping momentum, better evals and world models

Cover

Episode metadata

Episode title: Demis Hassabis on shipping momentum, better evals and world models
Show: Google AI: Release Notes
Owner / Host: Google
Episode link: open in Snipd
Episode publish date: 2025-08-11
Show notes link: open website
Export date: 2025-09-12T13:04

Show notes

Demis Hassabis, CEO of Google DeepMind, sits down with host Logan Kilpatrick. In this episode, learn about the evolution from game-playing AI to today's thinking models, how projects like Genie 3 are building world models to help AI understand reality and why new testing grounds like Kaggle’s Game Arena are needed to evaluate progress on the path to AGI.
Watch on YouTube: https://www.youtube.com/watch?v=njDochQ2zHs Chapters: 00:00 - Intro 01:16 - Recent GDM momentum 02:07 - Deep Think and agent systems 04:11 - Jagged intelligence 07:02 - Genie 3 and world models 10:21 - Future applications of Genie 3 13:01 - The need for better benchmarks and Kaggle Game Arena 19:03 - Evals beyond games 21:47 - Tool use for expanding AI capabilities 24:52 - Shift from models to systems 27:38 - Roadmap for Genie 3 and the omni model 29:25 - The quadrillion token club

Snips

[03:41] Agent-based Systems

🎧 Play snip - 1min️ (02:52 - 03:46)

Agent-based Systems

DeepMind has been working with agent-based systems from the beginning, where models can complete a whole task.
Thinking capabilities are required for AGI and progress is exciting.

📚 Transcript

Demis Hassabis So what we mean by that is systems that can complete a whole task, right? And mostly in the early days, playing a game really well. And there's an objective. And you basically, you know, you have this model, you know, model of, today we have model, multimodal models, really powerful ones that model language and everything around us. But in those days, we would have game models. And then you need some thinking or planning or reasoning capability on top. And this is obviously the way to get to, you know, AGI. And then, of course, once you have thinking, you can do deep thinking or extremely deep thinking and then sort of have parallel planning. You know, you can do sort of planning and thoughts in parallel and then collapse on onto the best one and then make a decision and then move on to the next one. There's still, I think, quite a lot of innovation that's required there. But it's exciting to see the rate of progress, even in the thinking part of things. And obviously for things like maths,

[04:11] Thinking Models

🎧 Play snip - 1min️ (02:52 - 04:19)

Thinking Models

DeepMind has always worked with agent-based systems able to complete whole tasks.
Achieving AGI requires models to plan and refine thought processes, instead of simply outputting the first thing that comes up.

📚 Transcript

Demis Hassabis So what we mean by that is systems that can complete a whole task, right? And mostly in the early days, playing a game really well. And there's an objective. And you basically, you know, you have this model, you know, model of, today we have model, multimodal models, really powerful ones that model language and everything around us. But in those days, we would have game models. And then you need some thinking or planning or reasoning capability on top. And this is obviously the way to get to, you know, AGI. And then, of course, once you have thinking, you can do deep thinking or extremely deep thinking and then sort of have parallel planning. You know, you can do sort of planning and thoughts in parallel and then collapse on onto the best one and then make a decision and then move on to the next one. There's still, I think, quite a lot of innovation that's required there. But it's exciting to see the rate of progress, even in the thinking part of things. And obviously for things like maths, for coding, for scientific problems, also for gaming, you're going to need to process and plan and basically do this thinking and not just output The first thing that the model comes up with. That's unlikely to be good enough. So you want to sort of go back and refine your own thought processes which is in effect what the thinking systems do yeah i was uh i had not seen the thinking game and i watched it uh probably

Logan Kilpatrick A week and a half ago or something like that and i was scribbling down notes as i was as i was watching and i was like wade

[06:57] Jagged Intelligences

🎧 Play snip - 23sec️ (06:37 - 07:00)

Jagged Intelligences

Current AI systems excel in specific areas but still struggle with simple tasks.
This uneven performance highlights gaps in their intelligence, which Demis Hassabis describes as 'jagged intelligences'.

📚 Transcript

Demis Hassabis But on the other hand, they can still make simple mistakes in high school maths or simple logic problems or simple games if they're posed in a certain way. So that must mean there's still something kind of missing. And these are kind of, you know, uneven intelligences or jacked intelligences. Some dimensions, they're really good. Other dimensions, you know, their weaknesses can be exposed quite easily.

[07:53] Jagged Intelligences

🎧 Play snip - 1min️ (06:47 - 07:58)

Jagged Intelligences

Current AI systems exhibit uneven or 'jagged' intelligences.
They excel in some areas but still make simple mistakes in others.

📚 Transcript

Demis Hassabis So that must mean there's still something kind of missing. And these are kind of, you know, uneven intelligences or jacked intelligences. Some dimensions, they're really good. Other dimensions, you know, their weaknesses can be exposed quite easily.

Logan Kilpatrick Yeah, I want to come back to this. But actually, we do that, can we double click on Genie 3? And I think there's an interesting segue around models aren't great at playing games, and yet I saw a bunch of people commenting on Genie 3, and the reaction was like absolute awe. Like people are like, this is, you know, I saw some very extreme comments being like, we're in a simulation. This is like proof that like anything, anything is possible because the genie demos are so good. So how, and it also obviously ties to solving RL with games. Like how much, if, again, if you had to look back and then now reflecting on like the genie three moment, do you think like, has that turned out how you would have expected? I feel like it's not, it's not obvious to me like, making models good at playing games results in the world model stuff that we have today.

Demis Hassabis Well, there's several, the extrologe is several branches of kind of research and thinking coming together, ideas coming together.

[10:04] World Models

🎧 Play snip - 2min️ (08:27 - 10:14)

World Models

World models help AI understand the physics, physical structure, and behaviors of the world.
Demis Hassabis believes that to operate in the physical world, AGI needs a world model.

📚 Transcript

Demis Hassabis The reason we're doing that is we want to build what we call a world model, which is a model that actually understands the physics of the world, right? The physical structure, how things work, materials, liquids, and, you know, even behaviors of, you know, living objects, animals, human beings, right? That's obviously a critical part of our world. We don't just live in language and maths. There's the physical world that we exist in. And so if you want, you know, an AGI clearly needs to understand the physical world, partly also. So it can operate in the physical world, whether that's robotics. I mean, that's what's holding robotics back. It needs a world model or things like Project Astra, you know, our Gemini Live project about having a universal assistant that can assist you in everyday life, maybe exist on your phone Or glasses and help you in your everyday life. Clearly, that also would need to understand the spatial temporal context that you're in. So you need a world model to really understand the world and how it operates. And that's one of the ways to prove that you've got a good world model is to be able to generate the world. So there's many ways to test the efficacy and the depth of your world model. But one great way is to just get it to reverse it and sort of generate something about the world. Like, you know, you turn on a tap and some liquid comes out of it, or there's a mirror and can you see yourself in the mirror, all of these things. And that's what Genie is sort of going towards is building that world model and then expressing it and actually be able to generate worlds that are consistent. And that's the surprising thing about Genie 3 is that, you know, you look away, you come back, and that part of the world is the same as you left it, which

[15:28] Need for Better Benchmarks

🎧 Play snip - 2min️ (13:59 - 15:35)

Need for Better Benchmarks

Demis Hassabis notes current AI systems easily show flaws to the average user, indicating missing capabilities.
He suggests better benchmarks are needed to assess AI's strengths and weaknesses, especially in areas like world physics and physical intelligence.

📚 Transcript

Demis Hassabis But we all, I think intuitively, all of us have played around with these chatbots and you see the edges of what they can do quite easily. Right. And in my opinion, this is one of the things that's missing from these systems being full AGI is the consistency. You shouldn't it shouldn't be that easy for the average person to just find a trivial flaw in the in the system. You know, used to be counting the number of R's in strawberry. Right. I think we managed to solve that now. But they're not you know, there are some still pretty trivial things that these systems, you know, like a school kid would trivially do that these systems can't. So why is that is a good question. There's probably some missing capabilities in reasoning and planning and memory that maybe one or two new innovations are still needed in those domains other than just scaling. But it's also partly maybe we need benchmarks, better benchmarks to eke out what these things are good at versus what they're not good at. And they're very general, these systems, including Gemini. But a lot of the benchmarks we use are starting to get saturated. So you look at some of the standard math benchmarks like AME, I think the latest result with deep think was 99.2%. So you're sort of getting into the region of very diminishing returns and there may even be an error in the test. And so they're getting rapidly saturated. And so we're in need of new, harder benchmarks, but also broader, in my opinion. You know, understanding world physics and intuitive physics and other things that we take for granted as humans and we find easy. Things like physical intelligence

[14:41] Missing Consistency

🎧 Play snip - 1min️ (14:06 - 14:46)

Missing Consistency

Current AI systems lack consistency, making it easy to find flaws.
Demis Hassabis suggests that improvements in reasoning, planning, and memory are needed beyond just scaling to achieve full AGI.

📚 Transcript

Demis Hassabis Right. And in my opinion, this is one of the things that's missing from these systems being full AGI is the consistency. You shouldn't it shouldn't be that easy for the average person to just find a trivial flaw in the in the system. You know, used to be counting the number of R's in strawberry. Right. I think we managed to solve that now. But they're not you know, there are some still pretty trivial things that these systems, you know, like a school kid would trivially do that these systems can't. So why is that is a good question. There's probably some missing capabilities in reasoning and planning and memory that maybe one or two new innovations are still needed in those domains other than just scaling. But it's also partly maybe we

[16:14] Need for Better Benchmarks

🎧 Play snip - 1min️ (15:17 - 16:14)

Need for Better Benchmarks

Current AI benchmarks are becoming saturated. Need new, harder, broader benchmarks for world physics, intuitive physics, physical intelligence, and safety (e.g., deception).

📚 Transcript

Demis Hassabis And so they're getting rapidly saturated. And so we're in need of new, harder benchmarks, but also broader, in my opinion. You know, understanding world physics and intuitive physics and other things that we take for granted as humans and we find easy. Things like physical intelligence as well, actually. We don't have really good benchmarks for these things. And also some safety benchmarks, too. You know, testing for traits that you don't want, like, you know, deception, things like this. So I think there's actually really amazing work to be done in creating benchmarks that are really meaningful, that test slightly more complicated or subtle things than the sort of Brute force school exam type things that we have today. And that's why I'm so excited about Game Arena because, and it is going a little bit back to our roots, of course, which is why we came up with it. But a

[18:16] Game Arena's Scaling

🎧 Play snip - 1min️ (17:19 - 18:21)

Game Arena's Scaling

Game Arena's tests get automatically harder as AI systems improve, unlike benchmarks like Amy or GPQA where humans must create increasingly difficult questions.
The uniqueness of each game, created by players, benefits testing, as it prevents overfitting on training data.

📚 Transcript

Demis Hassabis But as they get better, the test will get harder automatically. So you don't have to, it's not like, you know, Amy or GPQA where you have to come up with harder science questions and then who's going to create those questions? Are they leaked on the internet already? You know, each game is unique because it's created by the two players. So there's a kind of uniqueness about that. So that's also nice for testing and then the final thing is just like we did with our own early games work as the systems get better and better you can introduce more and more complex games Into the game arena so we started with chess um for obvious reasons it's the classic one we test ai on um it's close to my heart of course but we the idea is we're going to expand it to potentially Thousands of games, and then you'll get an overall score. So we're not really looking for systems that just play one game really well. They should be able to play across all games to a good standard, and it could be computer games as well as board games. And even more interestingly,

[21:24] Multi-objective AI

🎧 Play snip - 2min️ (20:00 - 21:32)

Multi-objective AI

Humans optimize for multiple, dynamically weighted objectives depending on context.
General AI systems need to interpret user intentions and translate them into useful reward functions to optimize.

📚 Transcript

Demis Hassabis I mean, that's always been the hard challenge with reinforcement learning has been, you know, where in domains that are more messy or real world like, how do you specify the reward function Or the objective function that you're trying to optimize? As humans, we don't have single objective functions, right? It's very messy. In fact, if I was to tell you, what are you optimizing for? On any given day, you might give a different answer, right? And I think we're sort of multi-objective and we're continually weighting those different objectives differently against each other, depending on other states, like your emotional State, your physical environment, and where you are in your career all of these things so we we but somehow we muddle through right uh with our brains and we sort of figure out uh roughly What uh the right north star is and i think there's our systems our general systems are going to have to do that too where they they sort of learn to interpret maybe what the human user's Trying to achieve, and then figure out how does that translate into a set of useful reward functions, which to optimize against. And so there's lots of experiments here about going on about like metacognition or meta-RLRL where you actually have another system on top that tries to work out what the reward functions Are for the secondary system to optimize against.

[22:50] Tool Use as Scaling Dimension

🎧 Play snip - 1min️ (22:17 - 22:58)

Tool Use as Scaling Dimension

Tools represent a new scaling dimension, enabling AI models to perform more tasks.
Thinking is vital for AI systems as it allows them to effectively use tools like search, math programs, and coding to update planning.

📚 Transcript

Logan Kilpatrick It also feels now like tools are this like new scaling dimension where as you give the models tools and more powerful tools and different tools, they're able to do a bunch of stuff. And I'm curious how that sort of new scaling dimension like ties to this worldview of what we've been doing with games and some of these simulated RL environments. Like, is there a world where you like give the model a physics simulator and like it, that's a tool that it has access to?

Demis Hassabis Yeah, I think tool use is going to be, know, and is one of the most important capabilities for these AI systems. A lot of the thinking, the reason the thinking is part of the systems is very important is because you can use tools during the thinking, right?

[24:04] AI Tool Use

🎧 Play snip - 2min️ (22:36 - 24:08)

AI Tool Use

Demis Hassabis thinks that tool use is one of the most important capabilities for AI systems.
Decide what to leave as a tool versus what to put into the main system.

📚 Transcript

Logan Kilpatrick Like, is there a world where you like give the model a physics simulator and like it, that's a tool that it has access to?

Demis Hassabis Yeah, I think tool use is going to be, know, and is one of the most important capabilities for these AI systems. A lot of the thinking, the reason the thinking is part of the systems is very important is because you can use tools during the thinking, right? You can call search, you can, you know, use some math program, you can do some coding, come back, and then update your planning on what you're going to do. So I think that's still actually fairly nascent at the moment, but I think that's going to be incredibly powerful once that becomes really reliable and we work out, and the systems become Good enough, they can use pretty sophisticated tools very reliably. And then the interesting thing comes is, what do you leave as a tool versus put into the main system, the main brain, so to speak? Now, with humans, it's easy because we're physically constrained. So anything that's not in our body is a tool, right? So there's no question about what's a tool, what's our brain. But with a digital system, you can actually kind of, those things can get blurred. So should it be in the main model, the capability, for example, to play chess or something? Or do you just use Stockfish or AlphaZero as a tool? And that tool could also be an AI system. It doesn't have to be a piece of software. It could actually be something like AlphaFold or whatever. And the question comes, I think,

[26:09] AI Capability Evaluation

🎧 Play snip - 2min️ (24:36 - 26:07)

AI Capability Evaluation

When evaluating whether to include a capability in the main AI model, consider whether it helps other capabilities.
If it does, include it; if it harms general capabilities, use it as a tool.

📚 Transcript

Demis Hassabis So it's very much an empirical question. Does adding that capability in help the other capabilities? If it does, do it. If it harms the other general capabilities, then maybe consider using it as a tool.

Logan Kilpatrick Interesting. One of the questions that developers, through the developer lens, people who are building with our models are always asking, is around sort of like, it's become very clear, and you Actually just said this, as the model is reasoning, it's actually making use of tools and it's doing all this stuff. The models historically were like weights. You give a token, you get a token out. Now it feels like the models are becoming these entire systems themselves. And how people actually build applications on top of the models is changing because the model is just doing more for you out of the box. I'm curious, that model transitioning from just being weight somewhere to actually becoming a system resonates with your worldview of how progress is happening. We'll see that continue. And then I don't know if you have suggestions for people who are building stuff as they're thinking about, what do I build to this point of, what do I do as a tool versus like, what is the Model just going to have empirically as part of it?

Demis Hassabis Yeah, I mean, you're right that the models are improving fast and as they're gaining the tool capability, it's sort of, along with planning and thinking, it's kind of exponentially Increasing what the system might be able to do, because obviously it can combine tools in novel ways and combinations.

[26:59] Product Design Considerations

🎧 Play snip - 1min️ (26:24 - 27:09)

Product Design Considerations

Product managers need to understand AI technology well to design products for its capabilities in the future.
Design products to allow for the AI engine to be upgraded with more advanced systems frequently.

📚 Transcript

Demis Hassabis So there's still a lot of productization I think that has to be done on top. Now the hard part and we've talked about this before is in this new world is you've got I think it requires very interesting skills from a product manager or product designer type of you Know role because you've got to sort of design say your product's coming out in a year you've got to be really close and understand the technology well to kind of intercept where that Technology will be in a year's time and design for that right and um and i think uh i've also whatever polish product polish you put on top uh of your product it has to allow for the engine Under the hood to be unplugged and plugged back in with a more advanced uh system you know, that's coming out every three to six months, basically,

[29:47] Greatest Game Ever

🎧 Play snip - 17sec️ (29:32 - 29:50)

Greatest Game Ever

Demis Hassabis jokes that his secret plan, after safely achieving AGI, is to use AI tools to create the greatest game ever.
He expresses it as a dream come true.

📚 Transcript

Logan Kilpatrick I feel like Genie's a good excuse for us to have a chance to make games and play them and then DeepMind's a video game. Yeah.

Demis Hassabis Well, you know, that's always my secret plan is maybe like post-AGI, once that's done safely over the line, you know, go back with these tools and make the greatest game ever. That would be a real dream come true.