DeepMind's Genie 3: New World Model Aims for AGI
Google DeepMind has unveiled Genie 3, its latest foundation world model, which the AI lab posits as a crucial advancement toward achieving artificial general intelligence (AGI), or human-like intelligence. This model is designed to train general-purpose AI agents within simulated environments.
“Genie 3 is the first real-time interactive general purpose world model,” stated Shlomi Fruchter, a research director at DeepMind, during a recent press briefing. He emphasized its departure from previous narrow world models, noting its ability to generate diverse environments, from photo-realistic to purely imaginary.
Currently in research preview and not publicly available, Genie 3 builds upon its predecessors, including Genie 2, which could generate new environments for agents, and DeepMind’s latest video generation model, Veo 3, known for its deep understanding of physics.
A significant leap in capability, Genie 3 can generate multiple minutes of interactive 3D environments at 720p resolution and 24 frames per second from a simple text prompt. This is a substantial improvement over Genie 2’s output of 10 to 20 seconds. The model also introduces “promptable world events,” allowing users to modify the generated world through text commands.
Crucially, Genie 3’s simulations maintain physical consistency over time. DeepMind highlights that this capability, where the model “remembers” what it has previously generated, was not explicitly programmed but emerged from its design.
Fruchter suggested that while Genie 3 holds promise for applications in education, gaming, or creative prototyping, its primary impact will be in training agents for general-purpose tasks, a step he deems essential for AGI. Jack Parker-Holder, a research scientist on DeepMind’s open-endedness team, echoed this sentiment: “We think world models are key on the path to AGI, specifically for embodied agents, where simulating real-world scenarios is particularly challenging.”
Genie 3 addresses this challenge by not relying on a hard-coded physics engine. Instead, DeepMind explains that the model teaches itself how the world operates—how objects move, fall, and interact—by remembering its generated sequences and reasoning over extended time horizons. Fruchter elaborated, “The model is auto-regressive, meaning it generates one frame at a time. It has to look back at what was generated before to decide what’s going to happen next. That’s a key part of the architecture.” This inherent memory allows Genie 3 to develop an intuitive grasp of physics, akin to human understanding of real-world dynamics.
DeepMind also believes Genie 3 can push AI agents to learn from their own experiences, mirroring human learning. As a demonstration, DeepMind tested Genie 3 with a recent version of its Scalable Instructable Multiworld Agent (SIMA). In a simulated warehouse, SIMA was tasked with goals like “approach the bright green trash compactor” or “walk to the packed red forklift.” According to Parker-Holder, the SIMA agent successfully achieved these goals by receiving actions from the agent, observing the simulated world, and taking actions within it, with Genie 3 maintaining consistency throughout.
Despite its advancements, Genie 3 has limitations. While researchers claim its understanding of physics, a demonstration of a skier, for instance, did not accurately depict snow movement. The range of actions an agent can take remains limited, and while prompt-able world events offer environmental interventions, these are not necessarily performed by the agent itself. Modeling complex interactions between multiple independent agents in a shared environment also proves challenging. Furthermore, Genie 3 currently supports only a few minutes of continuous interaction, whereas hours would be necessary for comprehensive agent training.
Nevertheless, Genie 3 represents a compelling step forward. It aims to enable agents to move beyond simple reactions, fostering capabilities like planning, exploration, uncertainty seeking, and improvement through trial and error. This self-driven, embodied learning is widely considered crucial for progress toward general intelligence. Parker-Holder concluded, “We haven’t really had a Move 37 moment for embodied agents yet, where they can actually take novel actions in the real world.” He referenced the pivotal moment in the 2016 Go match where DeepMind’s AlphaGo made an unconventional, brilliant move, symbolizing AI’s capacity for novel strategy. “But now, we can potentially usher in a new era,” he added.