https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/ Jump to Content Google DeepMind Search... [ ] Search Close Google DeepMind * About + Learn about Google DeepMind -- Our mission is to build AI responsibly to benefit humanity + Responsibility & Safety -- We want AI to benefit the world, so we must be thoughtful about how it's built and used + Education -- Our vision is to help make the AI ecosystem more representative of society + Careers -- Many disciplines, one common goal * Research + View Research -- We work on some of the most complex and interesting challenges in AI. + Breakthroughs -- Explore some of the biggest innovations in AI + Publications -- Explore a selection of our recent research * Technologies + View Technologies -- Solving the world's most complex challenges + Gemini -- The most general and capable AI models we've ever built o Gemini models -- The Gemini family of models are the most general and capable AI models we've ever built. o Ultra -- Our largest model for highly complex tasks. o Pro -- Our best model for general performance across a wide range of tasks. o Flash -- Our lightweight model, optimized for speed and efficiency. o Nano -- Our most efficient model for on-device tasks. + Project Astra -- A universal AI agent that is helpful in everyday life + Imagen -- Our highest quality text-to-image model + Veo -- Our most capable generative video model + AlphaFold -- Accelerating breakthroughs in biology with AI o Overview o Impact stories o AlphaFold Database o AlphaFold Server + SynthID -- Identifying AI-generated content * Discover + View Discover -- Discover our latest breakthroughs and see how we're shaping the future + Blog -- Discover our latest AI breakthroughs, projects, and updates + Events -- Meet our team and learn more about our research + The Podcast -- Uncover the extraordinary ways AI is transforming our world Search... [ ] Search Close * Learn about Google DeepMind * Responsibility & Safety -- We want AI to benefit the world, so we must be thoughtful about how it's built and used * Education -- Our vision is to help make the AI ecosystem more representative of society * Careers -- Many disciplines, one common goal Latest posts * [4u3n6FBe0eE86yXgppDN_yj_AkiCF5FaSToa] GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy 4 December 2024 * [wvcJdqh_wddVc-WiMGgcqe7nWp7Ybu0wd-PB] Genie 2: A large-scale foundation world model 4 December 2024 * View Research * Breakthroughs -- Explore some of the biggest innovations in AI * Publications -- Explore a selection of our recent research Latest research posts * [wvcJdqh_wddVc-WiMGgcqe7nWp7Ybu0wd-PB] Genie 2: A large-scale foundation world model 4 December 2024 * [5E3cNGtS0t4rKlZ1BYBXU93xYwONyYQqIpmY] AlphaQubit tackles one of quantum computing's biggest challenges 20 November 2024 * View Technologies * Gemini -- The most general and capable AI models we've ever built * Project Astra -- A universal AI agent that is helpful in everyday life * Imagen -- Our highest quality text-to-image model * Veo -- Our most capable generative video model * AlphaFold -- Accelerating breakthroughs in biology with AI * SynthID -- Identifying AI-generated content Latest technology posts * [4u3n6FBe0eE86yXgppDN_yj_AkiCF5FaSToa] GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy 4 December 2024 * [DRHQSDQtYAEPTDVRAH3gkibfEPdd3JpphY4S] Pushing the frontiers of audio generation 30 October 2024 * View Discover * Blog -- Discover our latest AI breakthroughs, projects, and updates * Events -- Meet our team and learn more about our research * The Podcast -- Uncover the extraordinary ways AI is transforming our world Latest posts * [4u3n6FBe0eE86yXgppDN_yj_AkiCF5FaSToa] GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy 4 December 2024 * [wvcJdqh_wddVc-WiMGgcqe7nWp7Ybu0wd-PB] Genie 2: A large-scale foundation world model 4 December 2024 Research Genie 2: A large-scale foundation world model Published 4 December 2024 Authors Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna Mitenkova, Jane Wang, Jeff Clune, Demis Hassabis, Raia Hadsell, Adrian Bolton, Satinder Singh, Tim Rocktaschel Share * * * * * [https://deepmind.goo] Copy link x [wvcJdqh_wddVc-WiMGgcqe7nWp7Ybu0wd-PBDxC_VUQkfxI7HPfQz3fi_HyYTOoRM_XV3Bofp9l1wBZ1CJPZPG6yZMdZxqH8X7_Lb9nhVAquAul1] Generating unlimited diverse training environments for future general agents Today we introduce Genie 2, a foundation world model capable of generating an endless variety of action-controllable, playable 3D environments for training and evaluating embodied agents. Based on a single prompt image, it can be played by a human or AI agent using keyboard and mouse inputs. Pause video Play video Games play a key role in the world of artificial intelligence (AI) research. Their engaging nature, unique blend of challenges, and measurable progress make them ideal environments to safely test and advance AI capabilities. Indeed, games have been important to Google DeepMind since our founding. From our early work with Atari games, breakthroughs such as AlphaGo and AlphaStar, to our research on generalist agents in collaboration with game developers, games have been center stage in our research. However, training more general embodied agents has been traditionally bottlenecked by the availability of sufficiently rich and diverse training environments. As we show, Genie 2 could enable future agents to be trained and evaluated in a limitless curriculum of novel worlds. Our research also paves the way for new, creative workflows for prototyping interactive experiences. * Capabilities * Rapid prototyping * Deploying agents in world models * Model architecture * Responsible development Emergent capabilities of a foundation world model Until now, world models have largely been confined to modeling narrow domains. In Genie 1, we introduced an approach for generating a diverse array of 2D worlds. Today we introduce Genie 2, which represents a significant leap forward in generality. Genie 2 can generate a vast diversity of rich 3D worlds. Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action (e.g. jump, swim, etc.). It was trained on a large-scale video dataset and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, complex character animation, physics, and the ability to model and thus predict the behavior of other agents. Below are example videos of people interacting with Genie 2. For every example, the model is prompted with a single image generated by Imagen 3, GDM's state-of-the-art text-to-image model. This means anyone can describe a world they want in text, select their favorite rendering of that idea, and then step into and interact with that newly created world (or have an AI agent be trained or evaluated in it). At each step, a person or agent provides a keyboard and mouse action, and Genie 2 simulates the next observation. Genie 2 can generate consistent worlds for up to a minute, with the majority of examples shown lasting 10-20s. --------------------------------------------------------------------- Action controls Genie 2 responds intelligently to actions taken by pressing keys on a keyboard, identifying the character and moving it correctly. For example, our model has to figure out that arrow keys should move the robot and not the trees or clouds. [fig_1_header_large] Pause video Play video A cute humanoid robot in the woods. Pause video Play video A humanoid robot in Ancient Egypt. Pause video Play video A first person view of a robot on a purple planet. Pause video Play video A first person view of a robot in a loft apartment in a big city. --------------------------------------------------------------------- Generating counterfactuals We can generate diverse trajectories from the same starting frame, which means it is possible to simulate counterfactual experiences for training agents. In each row, each video starts from the same frame, but has different actions taken by a human player. Pause video Play video Pause video Play video --------------------------------------------------------------------- Long horizon memory Genie 2 is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again. Pause video Play video Pause video Play video Pause video Play video --------------------------------------------------------------------- Long video generation with new generated content Genie 2 generates new plausible content on the fly and maintains a consistent world for up to a minute. Pause video Play video Pause video Play video --------------------------------------------------------------------- Diverse environments Genie 2 can create different perspectives, such as first-person view, isometric views, or third person driving videos. Pause video Play video Pause video Play video Pause video Play video --------------------------------------------------------------------- 3D structures Genie 2 learned to create complex 3D visual scenes. Pause video Play video --------------------------------------------------------------------- Object affordances and interactions Genie 2 models various object interactions, such as bursting balloons, opening doors, and shooting barrels of explosives. Pause video Play video Pause video Play video Pause video Play video --------------------------------------------------------------------- Character animation Genie 2 learned how to animate various types of characters doing different activities. Pause video Play video Pause video Play video Pause video Play video --------------------------------------------------------------------- NPCs Genie 2 models other agents and even complex interactions with them. Pause video Play video Pause video Play video Pause video Play video --------------------------------------------------------------------- Physics Genie 2 models water effects. Pause video Play video Pause video Play video --------------------------------------------------------------------- Smoke Genie 2 models smoke effects. Pause video Play video Pause video Play video --------------------------------------------------------------------- Gravity Genie 2 models gravity. Pause video Play video Pause video Play video --------------------------------------------------------------------- Lighting Genie 2 models point and directional lighting. Pause video Play video Pause video Play video --------------------------------------------------------------------- Reflections Genie 2 models reflections, bloom and coloured lighting. Pause video Play video Pause video Play video --------------------------------------------------------------------- Playing from real world images Genie 2 can also be prompted with real world images, where we see that it can model grass blowing in the wind or water flowing in a river. Pause video Play video Pause video Play video Genie 2 prompted with real world photos. --------------------------------------------------------------------- Genie 2 enables rapid prototyping Genie 2 makes it easy to rapidly prototype diverse interactive experiences, enabling researchers to quickly experiment with novel environments to train and test embodied AI agents. For example, below we prompt Genie 2 with different images generated by Imagen 3 to model the difference between flying a paper plane, a dragon, a hawk, or a parachute and test how well Genie can animate different avatars. Pause video Play video Pause video Play video Pause video Play video Pause video Play video Genie 2 can be used to rapidly prototype diverse interactive experiences. Thanks to Genie 2's out-of-distribution generalization capabilities, concept art and drawings can be turned into fully interactive environments. This enables artists and designers to prototype quickly, which can bootstrap the creative process for environment design, further accelerating research. Here we show examples of research environment concepts made by our concept artist. [1qjjfP8Bff4jnM9NuVclhja4rWxuu7R-HO3SQUBz--erVxXAhIsL5wxuuS73eFr5N9dg4C2K] Environment concept by Max Cant Pause video Play video Genie 2 [IxK4or3nYb-mDtSuxP0REGoxo7PHnZ2Jc4B6w07bBXHgbUna34G3wU0mViNtCfgyK3BQJxcd] Environment concept by Max Cant Pause video Play video Genie 2 --------------------------------------------------------------------- AI agents acting inside the world model By using Genie 2 to quickly create rich and diverse environments for AI agents, our researchers can also generate evaluation tasks that agents have not seen during training. Below, we show examples of a SIMA agent that we developed in collaboration with games developers, following instructions on unseen environments synthesized by Genie 2 via a single image prompt. [epqtztbl1-pB4UGBJyqydS8jtTSzJyKjV36aOXAkncw4Z2M9i3KU6ATA_8wMqKd7Ci05dhPT] Image generated by Imagen 3 Prompt: "A screenshot of a third-person open world exploration game. The player is an adventurer exploring a forest. There is a house with a red door on the left, and a house with a blue door on the right. The camera is placed directly behind the player. #photorealistic # immersive" The SIMA agent is designed to complete tasks in a range of 3D game worlds by following natural-language instructions. Here we used Genie 2 to generate a 3D environment with two doors, a blue and a red one, and provided instructions to the SIMA agent to open each of them. In this example, SIMA is controlling the avatar via keyboard and mouse inputs, while Genie 2 generates the game frames. Pause video Play video Prompt "Open the blue door" Pause video Play video Prompt "Open the red door" We can also use SIMA to help evaluate Genie 2's capabilities. Here we test Genie 2's ability to generate consistent environments by instructing SIMA to look around and explore behind the house. Pause video Play video Prompt "Turn around" Pause video Play video Prompt "Go behind the house" While this research is still in its early stage with substantial room for improvement on both agent and environment generation capabilities, we believe Genie 2 is the path to solving a structural problem of training embodied agents safely while achieving the breadth and generality required to progress towards AGI. [ibuiiKgpr2wBCdspe2Mci1VcuYUPZtTkLX5qXA7JkWuq_5ve_A9sr10eee1pQ2Y4xjDGjoPCW1l12c5zUnzQcl] Image generated by Imagen 3 Prompt: "An image of a computer game showing a scene from inside a rough hewn stone cave or mine. The viewer's position is a 3rd person camera based above a player avatar looking down towards the avatar. The player avatar is a knight with a sword. In front of the knight avatar there are x3 stone arched doorways and the knight chooses to go through any one of these doors. Beyond the first and inside we can see strange green plants with glowing flowers lining that tunnel. Inside and beyond the second doorway there is a corridor of spiked iron plates riveted to the cave walls leading towards an ominous glow further along. Through the third door we can see a set of rough hewn stone steps ascending to a mysterious destination." Pause video Play video Prompt "Go up the stairs" Pause video Play video Prompt "Go where the plants are" Pause video Play video Prompt "Go to the middle door" --------------------------------------------------------------------- Diffusion world model Genie 2 is an autoregressive latent diffusion model, trained on a large video dataset. After passing through an autoencoder, latent frames from the video are passed to a large transformer dynamics model, trained with a causal mask similar to that used by large language models. At inference time, Genie 2 can be sampled in an autoregressive fashion, taking individual actions and past latent frames on a frame-by-frame basis. We use classifier-free guidance to improve action controllability. The samples in this blog post are generated by an undistilled base model, to show what is possible. We can play a distilled version in real-time with a reduction in quality of the outputs. [cTe3Ijjwfzdk-rMX_HJgDTyVwUXiI1z5lBXve_Cop9unkmwCiBR6vFMilDPrkhs30Q6SNFOXiNq0DFy0mE2i1JMbbp-DsufI-tlXj1g0RU2IqmHZ] --------------------------------------------------------------------- Developing our technologies responsibly Genie 2 shows the potential of foundational world models for creating diverse 3D environments and accelerating agent research. This research direction is in its early stages and we look forward to continuing to improve Genie's world generation capabilities in terms of generality and consistency. As with SIMA, our research is building towards more general AI systems and agents that can understand and safely carry out a wide range of tasks in a way that is helpful to people online and in the real world. --------------------------------------------------------------------- Interesting outtakes Pause video Play video While not taking any action, a ghost appears while in a garden Pause video Play video The character prefers parkour over snowboarding. Pause video Play video With great power comes great responsibility. Acknowledgements Genie 2 was led by Jack Parker-Holder with technical leadership by Stephen Spencer, with key contributions from Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi and Jessica Yung, and contributions from Michael Dennis, Sultan Kenjeyev and Shangbang Long. Yusuf Aytar, Jeff Clune, Sander Dieleman, Doug Eck, Shlomi Fruchter, Raia Hadsell, Demis Hassabis, Georg Ostrovski, Pieter-Jan Kindermans, Nicolas Heess, Charles Blundell, Simon Osindero, Rushil Mistry gave advice. Past contributors include Ashley Edwards and Richie Steigerwald. The Generalist Agents team was led by Vlad Mnih with key contributions from Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang The SIMA team, with particular support from Frederic Besse, Tim Harley, Anna Mitenkova and Jane Wang Tim Rocktaschel, Satinder Singh and Adrian Bolton coordinated, managed and advised the overall project. We'd also like to thank Zoubin Gharamani, Andy Brock, Ed Hirst, David Bridson, Zeb Mehring, Cassidy Hardin, Hyunjik Kim, Noah Fiedel, Jeff Stanway, Petko Yotov, Mihai Tiuca, Soheil Hassas Yeganeh, Nehal Mehta, Richard Tucker, Tim Brooks, Alex Cullum, Max Cant, Nik Hemmings, Richard Evans, Valeria Oliveira, Yanko Gitahy Oliveira, Bethanie Brownfield, Charles Gbadamosi, Giles Ruscoe, Guy Simmons, Jony Hudson, Marjorie Limont, Nathaniel Wong, Sarah Chakera, Nick Young. Related posts View all posts * [2GNumOaJCB48RQIFbwJmmZro-AFdBebufxvY_ZkSdUs9RQ-0nSTgBMXu] Research A generalist AI agent for 3D virtual environments Introducing SIMA, a Scalable Instructable Multiworld Agent 13 March 2024 Footer links Follow us * * * * * About * About Google DeepMind * Responsibility & Safety * Research * Technologies * Blog * Careers Learn more * Gemini * Veo * Imagen 3 * SynthID Sign up for updates on our latest innovations Email address [ ] Please enter a valid email (e.g., "name@example.com") I accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with Google's Privacy Policy. Sign up * About Google * Google products * Privacy * Terms * Cookies management controls