Demis Hassabis, CEO of Google DeepMind, sits down with host Logan Kilpatrick. In this episode, learn about the evolution from game-playing AI to today's thinking models, how projects like Genie 3 are building world models to help AI understand reality and why new testing grounds like Kaggle’s Game Arena are needed to evaluate progress on the path to AGI.
Watch on YouTube: https://www.youtube.com/watch?v=njDochQ2zHs Chapters: 00:00 - Intro 01:16 - Recent GDM momentum 02:07 - Deep Think and agent systems 04:11 - Jagged intelligence 07:02 - Genie 3 and world models 10:21 - Future applications of Genie 3 13:01 - The need for better benchmarks and Kaggle Game Arena 19:03 - Evals beyond games 21:47 - Tool use for expanding AI capabilities 24:52 - Shift from models to systems 27:38 - Roadmap for Genie 3 and the omni model 29:25 - The quadrillion token club
🎧 Play snip - 1min️ (02:52 - 03:46)
Agent-based Systems
Demis Hassabis So what we mean by that is systems that can complete a whole task, right? And mostly in the early days, playing a game really well. And there's an objective. And you basically, you know, you have this model, you know, model of, today we have model, multimodal models, really powerful ones that model language and everything around us. But in those days, we would have game models. And then you need some thinking or planning or reasoning capability on top. And this is obviously the way to get to, you know, AGI. And then, of course, once you have thinking, you can do deep thinking or extremely deep thinking and then sort of have parallel planning. You know, you can do sort of planning and thoughts in parallel and then collapse on onto the best one and then make a decision and then move on to the next one. There's still, I think, quite a lot of innovation that's required there. But it's exciting to see the rate of progress, even in the thinking part of things. And obviously for things like maths,
🎧 Play snip - 1min️ (02:52 - 04:19)
Thinking Models
Demis Hassabis So what we mean by that is systems that can complete a whole task, right? And mostly in the early days, playing a game really well. And there's an objective. And you basically, you know, you have this model, you know, model of, today we have model, multimodal models, really powerful ones that model language and everything around us. But in those days, we would have game models. And then you need some thinking or planning or reasoning capability on top. And this is obviously the way to get to, you know, AGI. And then, of course, once you have thinking, you can do deep thinking or extremely deep thinking and then sort of have parallel planning. You know, you can do sort of planning and thoughts in parallel and then collapse on onto the best one and then make a decision and then move on to the next one. There's still, I think, quite a lot of innovation that's required there. But it's exciting to see the rate of progress, even in the thinking part of things. And obviously for things like maths, for coding, for scientific problems, also for gaming, you're going to need to process and plan and basically do this thinking and not just output The first thing that the model comes up with. That's unlikely to be good enough. So you want to sort of go back and refine your own thought processes which is in effect what the thinking systems do yeah i was uh i had not seen the thinking game and i watched it uh probably
Logan Kilpatrick A week and a half ago or something like that and i was scribbling down notes as i was as i was watching and i was like wade
🎧 Play snip - 23sec️ (06:37 - 07:00)
Jagged Intelligences
Demis Hassabis But on the other hand, they can still make simple mistakes in high school maths or simple logic problems or simple games if they're posed in a certain way. So that must mean there's still something kind of missing. And these are kind of, you know, uneven intelligences or jacked intelligences. Some dimensions, they're really good. Other dimensions, you know, their weaknesses can be exposed quite easily.
🎧 Play snip - 1min️ (06:47 - 07:58)
Jagged Intelligences
Demis Hassabis So that must mean there's still something kind of missing. And these are kind of, you know, uneven intelligences or jacked intelligences. Some dimensions, they're really good. Other dimensions, you know, their weaknesses can be exposed quite easily.
Logan Kilpatrick Yeah, I want to come back to this. But actually, we do that, can we double click on Genie 3? And I think there's an interesting segue around models aren't great at playing games, and yet I saw a bunch of people commenting on Genie 3, and the reaction was like absolute awe. Like people are like, this is, you know, I saw some very extreme comments being like, we're in a simulation. This is like proof that like anything, anything is possible because the genie demos are so good. So how, and it also obviously ties to solving RL with games. Like how much, if, again, if you had to look back and then now reflecting on like the genie three moment, do you think like, has that turned out how you would have expected? I feel like it's not, it's not obvious to me like, making models good at playing games results in the world model stuff that we have today.
Demis Hassabis Well, there's several, the extrologe is several branches of kind of research and thinking coming together, ideas coming together.
🎧 Play snip - 2min️ (08:27 - 10:14)
World Models
Demis Hassabis The reason we're doing that is we want to build what we call a world model, which is a model that actually understands the physics of the world, right? The physical structure, how things work, materials, liquids, and, you know, even behaviors of, you know, living objects, animals, human beings, right? That's obviously a critical part of our world. We don't just live in language and maths. There's the physical world that we exist in. And so if you want, you know, an AGI clearly needs to understand the physical world, partly also. So it can operate in the physical world, whether that's robotics. I mean, that's what's holding robotics back. It needs a world model or things like Project Astra, you know, our Gemini Live project about having a universal assistant that can assist you in everyday life, maybe exist on your phone Or glasses and help you in your everyday life. Clearly, that also would need to understand the spatial temporal context that you're in. So you need a world model to really understand the world and how it operates. And that's one of the ways to prove that you've got a good world model is to be able to generate the world. So there's many ways to test the efficacy and the depth of your world model. But one great way is to just get it to reverse it and sort of generate something about the world. Like, you know, you turn on a tap and some liquid comes out of it, or there's a mirror and can you see yourself in the mirror, all of these things. And that's what Genie is sort of going towards is building that world model and then expressing it and actually be able to generate worlds that are consistent. And that's the surprising thing about Genie 3 is that, you know, you look away, you come back, and that part of the world is the same as you left it, which
🎧 Play snip - 2min️ (13:59 - 15:35)
Need for Better Benchmarks
Demis Hassabis But we all, I think intuitively, all of us have played around with these chatbots and you see the edges of what they can do quite easily. Right. And in my opinion, this is one of the things that's missing from these systems being full AGI is the consistency. You shouldn't it shouldn't be that easy for the average person to just find a trivial flaw in the in the system. You know, used to be counting the number of R's in strawberry. Right. I think we managed to solve that now. But they're not you know, there are some still pretty trivial things that these systems, you know, like a school kid would trivially do that these systems can't. So why is that is a good question. There's probably some missing capabilities in reasoning and planning and memory that maybe one or two new innovations are still needed in those domains other than just scaling. But it's also partly maybe we need benchmarks, better benchmarks to eke out what these things are good at versus what they're not good at. And they're very general, these systems, including Gemini. But a lot of the benchmarks we use are starting to get saturated. So you look at some of the standard math benchmarks like AME, I think the latest result with deep think was 99.2%. So you're sort of getting into the region of very diminishing returns and there may even be an error in the test. And so they're getting rapidly saturated. And so we're in need of new, harder benchmarks, but also broader, in my opinion. You know, understanding world physics and intuitive physics and other things that we take for granted as humans and we find easy. Things like physical intelligence
🎧 Play snip - 1min️ (14:06 - 14:46)
Missing Consistency
Demis Hassabis Right. And in my opinion, this is one of the things that's missing from these systems being full AGI is the consistency. You shouldn't it shouldn't be that easy for the average person to just find a trivial flaw in the in the system. You know, used to be counting the number of R's in strawberry. Right. I think we managed to solve that now. But they're not you know, there are some still pretty trivial things that these systems, you know, like a school kid would trivially do that these systems can't. So why is that is a good question. There's probably some missing capabilities in reasoning and planning and memory that maybe one or two new innovations are still needed in those domains other than just scaling. But it's also partly maybe we
🎧 Play snip - 1min️ (15:17 - 16:14)
Need for Better Benchmarks
Demis Hassabis And so they're getting rapidly saturated. And so we're in need of new, harder benchmarks, but also broader, in my opinion. You know, understanding world physics and intuitive physics and other things that we take for granted as humans and we find easy. Things like physical intelligence as well, actually. We don't have really good benchmarks for these things. And also some safety benchmarks, too. You know, testing for traits that you don't want, like, you know, deception, things like this. So I think there's actually really amazing work to be done in creating benchmarks that are really meaningful, that test slightly more complicated or subtle things than the sort of Brute force school exam type things that we have today. And that's why I'm so excited about Game Arena because, and it is going a little bit back to our roots, of course, which is why we came up with it. But a
🎧 Play snip - 1min️ (17:19 - 18:21)
Game Arena's Scaling
Demis Hassabis But as they get better, the test will get harder automatically. So you don't have to, it's not like, you know, Amy or GPQA where you have to come up with harder science questions and then who's going to create those questions? Are they leaked on the internet already? You know, each game is unique because it's created by the two players. So there's a kind of uniqueness about that. So that's also nice for testing and then the final thing is just like we did with our own early games work as the systems get better and better you can introduce more and more complex games Into the game arena so we started with chess um for obvious reasons it's the classic one we test ai on um it's close to my heart of course but we the idea is we're going to expand it to potentially Thousands of games, and then you'll get an overall score. So we're not really looking for systems that just play one game really well. They should be able to play across all games to a good standard, and it could be computer games as well as board games. And even more interestingly,
🎧 Play snip - 2min️ (20:00 - 21:32)
Multi-objective AI
Demis Hassabis I mean, that's always been the hard challenge with reinforcement learning has been, you know, where in domains that are more messy or real world like, how do you specify the reward function Or the objective function that you're trying to optimize? As humans, we don't have single objective functions, right? It's very messy. In fact, if I was to tell you, what are you optimizing for? On any given day, you might give a different answer, right? And I think we're sort of multi-objective and we're continually weighting those different objectives differently against each other, depending on other states, like your emotional State, your physical environment, and where you are in your career all of these things so we we but somehow we muddle through right uh with our brains and we sort of figure out uh roughly What uh the right north star is and i think there's our systems our general systems are going to have to do that too where they they sort of learn to interpret maybe what the human user's Trying to achieve, and then figure out how does that translate into a set of useful reward functions, which to optimize against. And so there's lots of experiments here about going on about like metacognition or meta-RLRL where you actually have another system on top that tries to work out what the reward functions Are for the secondary system to optimize against.
🎧 Play snip - 1min️ (22:17 - 22:58)
Tool Use as Scaling Dimension
Logan Kilpatrick It also feels now like tools are this like new scaling dimension where as you give the models tools and more powerful tools and different tools, they're able to do a bunch of stuff. And I'm curious how that sort of new scaling dimension like ties to this worldview of what we've been doing with games and some of these simulated RL environments. Like, is there a world where you like give the model a physics simulator and like it, that's a tool that it has access to?
Demis Hassabis Yeah, I think tool use is going to be, know, and is one of the most important capabilities for these AI systems. A lot of the thinking, the reason the thinking is part of the systems is very important is because you can use tools during the thinking, right?
🎧 Play snip - 2min️ (22:36 - 24:08)
AI Tool Use
Logan Kilpatrick Like, is there a world where you like give the model a physics simulator and like it, that's a tool that it has access to?
Demis Hassabis Yeah, I think tool use is going to be, know, and is one of the most important capabilities for these AI systems. A lot of the thinking, the reason the thinking is part of the systems is very important is because you can use tools during the thinking, right? You can call search, you can, you know, use some math program, you can do some coding, come back, and then update your planning on what you're going to do. So I think that's still actually fairly nascent at the moment, but I think that's going to be incredibly powerful once that becomes really reliable and we work out, and the systems become Good enough, they can use pretty sophisticated tools very reliably. And then the interesting thing comes is, what do you leave as a tool versus put into the main system, the main brain, so to speak? Now, with humans, it's easy because we're physically constrained. So anything that's not in our body is a tool, right? So there's no question about what's a tool, what's our brain. But with a digital system, you can actually kind of, those things can get blurred. So should it be in the main model, the capability, for example, to play chess or something? Or do you just use Stockfish or AlphaZero as a tool? And that tool could also be an AI system. It doesn't have to be a piece of software. It could actually be something like AlphaFold or whatever. And the question comes, I think,
🎧 Play snip - 2min️ (24:36 - 26:07)
AI Capability Evaluation
Demis Hassabis So it's very much an empirical question. Does adding that capability in help the other capabilities? If it does, do it. If it harms the other general capabilities, then maybe consider using it as a tool.
Logan Kilpatrick Interesting. One of the questions that developers, through the developer lens, people who are building with our models are always asking, is around sort of like, it's become very clear, and you Actually just said this, as the model is reasoning, it's actually making use of tools and it's doing all this stuff. The models historically were like weights. You give a token, you get a token out. Now it feels like the models are becoming these entire systems themselves. And how people actually build applications on top of the models is changing because the model is just doing more for you out of the box. I'm curious, that model transitioning from just being weight somewhere to actually becoming a system resonates with your worldview of how progress is happening. We'll see that continue. And then I don't know if you have suggestions for people who are building stuff as they're thinking about, what do I build to this point of, what do I do as a tool versus like, what is the Model just going to have empirically as part of it?
Demis Hassabis Yeah, I mean, you're right that the models are improving fast and as they're gaining the tool capability, it's sort of, along with planning and thinking, it's kind of exponentially Increasing what the system might be able to do, because obviously it can combine tools in novel ways and combinations.
🎧 Play snip - 1min️ (26:24 - 27:09)
Product Design Considerations
Demis Hassabis So there's still a lot of productization I think that has to be done on top. Now the hard part and we've talked about this before is in this new world is you've got I think it requires very interesting skills from a product manager or product designer type of you Know role because you've got to sort of design say your product's coming out in a year you've got to be really close and understand the technology well to kind of intercept where that Technology will be in a year's time and design for that right and um and i think uh i've also whatever polish product polish you put on top uh of your product it has to allow for the engine Under the hood to be unplugged and plugged back in with a more advanced uh system you know, that's coming out every three to six months, basically,
🎧 Play snip - 17sec️ (29:32 - 29:50)
Greatest Game Ever
Logan Kilpatrick I feel like Genie's a good excuse for us to have a chance to make games and play them and then DeepMind's a video game. Yeah.
Demis Hassabis Well, you know, that's always my secret plan is maybe like post-AGI, once that's done safely over the line, you know, go back with these tools and make the greatest game ever. That would be a real dream come true.