Checking in on World Models
Getting World Models right will enable lots of exciting robots, but it's still a hard problem
Regular readers know that I am mildly obsessed with robots. Not the YouTube videos, but the idea of real machines, humanoid or otherwise, that handle multi-step physical tasks in messy, real environments. AI can theoretically enable robots to do tasks that were previously impossible. One of the important pieces needed for this revolution is the World Model. So, what is a World Model and how close are we to having them?
What a world model actually has to do
To manipulate the world, a robot needs a 3D understanding of its surroundings.1 Say “pick up the red pen,” and it needs to be able to recognize that an object is a pen, which one is red, and then that the pen is two feet ahead and how to move its hand to pick it up. Identifying the pen and the color red is totally solved. ChatGPT or Claude can easily handle this from images. Understanding the 3D environment is also mostly figured out: Waymo’s cars build a near-perfect picture of everything around them with lidar, and cheaper camera-only methods seem to work for Tesla.
The world model also needs an intuition for physics. When the robot reaches the pen, how much force does it apply to grab it, so it doesn’t crush or drop it? How slippery will the pen be? These questions are important even for simple tasks like moving objects around a warehouse. A human intuitively knows to handle a heavy box differently from a light but fragile box. World models are supposed to give that same intuition to robots.
Next, the model has to understand how its actions change the world, not just predict what comes next, so a robot can imagine “what happens if I grab the pen that’s underneath some papers?” That allows it to gently move the papers first and then pickup the pen even if that wasn’t explicitly part of the instructions.
The real bottleneck is the data
I’ve looked at these models for a couple of years, and the part that I’ve never quite understood is how you train the model. Image models were trained on billions of pictures scraped off the internet. How do you get data at that scale for something that also has to capture 3D space, physics, and a robot’s own actions? You can’t just scrape it. There are roughly three approaches, and companies are placing different bets.
Teleoperation. A human puppets the robot through a task with a leader arm, a VR rig, or a motion-capture suit. The data is the highest quality, because the action and result are paired exactly, but throughput is brutal: one operator manages 5 to 50 demonstrations an hour, at a loaded cost of $118 to $200 per hour.2 Some companies like Neo, and some Chinese firms, are taking this painstaking approach. This also works better for robots like housekeepers that are doing a relatively fixed number of tasks. This will be harder for general purpose bots.
Simulation. Another approach is to spin up thousands of virtual robots in a physics engine. A ping-pong robot3 at CES this year trained on 40 hours of pure simulation and hit reaction times near 0.02 seconds. The catch is that the simulated reality may not be quite good enough. NVIDIA is trying to make Cosmos 3 realistic enough to work for this purpose.
Human video. The internet is full of people folding laundry, lifting things, etc. The catch is that the video has no labels. A bot might be able to tell that a video shows a hand gripping a pen but not the grams of force or the joint angles required to do so. Tesla, Figure, and Physical Intelligence are all trying this approach.
I’m not sure which is the best option. Probably companies will rely on some combination of #1 if it’s a high precision task and #2 and #3 for everything else. Eventually, the physics will be good enough for #2 to be possible, but at that point, the World Model problem is already kind of solved.
Who is building what
Here are the leading models that already exist for simulating worlds:
Genie 3 from Google DeepMind which generates interactive, navigable worlds from a text prompt and holds them consistent for a few minutes; Waymo fine-tuned a version to simulate rare driving scenarios for its robotaxis.
NVIDIA’s Cosmos is an open model aimed at generating training data and environments for robots.
Marble, from World Labs, generates persistent, editable 3D worlds, sold into gaming and visual effects today with robotics as the longer game.
These are all very new, and I suspect there is very limited commercial use today.
On the robot brain side, the planners that turn a command into motion, Physical Intelligence is worth a look. Here it is tidying up homes it’s never seen before.4 As I mentioned before, Figure’s bot is doing a pilot at a BMW plant.
Just to be clear, videos are cool, but we are still a few leaps away from a robot that can enter a previously unseen environment and perform any task. But we are getting closer to this, and the improving models are the key to making this possible.
For those of you with companies in warehousing, manufacturing, or logistics: are robotics vendors delivering yet? Drop me a note or leave a comment.
Note: The opinions expressed in this article are my own and do not represent the views of Berkshire Partners.
Note that the first applications will not be physical robots but games and VR, where the cost of being slightly wrong is low. When Google opened up Genie 3, shares of game makers reportedly fell hard the same day (Unity around 21%, Roblox around 15%, Take-Two around 9%). A physics glitch in a video game is at worst a funny clip you post online. The same 2% error when a robot is holding a kitchen knife near your hand is at best a lawsuit.
Which is a lot to spend teaching a robot to fold towels
I couldn’t do a post about robots without at least one video
Its bed-making skills significantly exceed those of a 12-year-old boy!



