Keyon Vafa's Twitter Thread

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵

Our paper aims to answer two questions: 1. What's the difference between prediction and world models? 2. Are there straightforward metrics that can test this distinction? Our paper is about AI. But it's helpful to go back 400 years to answer these questions.

Perhaps the most influential world model had its start as a predictive model. Before we had Newton's laws of gravity, we had Kepler's predictions of planetary orbits. Kepler's predictions led to Newton's laws. So what did Newton add?

If you only care about orbits, Newton didn't add much. His laws give the same predictions. But Newton's laws went beyond orbits: the same laws explain pendula, cannonballs, and rockets. This motivates our framework: Predictions apply to one task. World models generalize to many

Newton's laws are a kind of foundation model. They provide a place to start when working on new problems. A good foundation model should do the same. The No Free Lunch Theorem motivates a test: Every foundation model has an inductive bias. This bias reveals its world model.

We propose a method to measure these inductive biases. We call it an inductive bias probe. Two steps: 1. Fit a foundation model to many new, very small synthetic datasets 2. Analyze patterns in the functions it learns to find the model's inductive bias

We apply these probes to orbital, lattice, and Othello problems. Starting with orbits: we encode solar systems as sequences and train a transformer on 10M solar systems (20B tokens) The model makes accurate predictions many timesteps ahead. Predictions for our solar system:

But has the model discovered Newton's laws? When we fine-tune it to new tasks, its inductive bias isn't toward Newtonian states. When it extrapolates, it makes similar predictions for orbits with very different states, and different predictions for orbits with similar states.

To demonstrate, we fine-tuned the model to predict force vectors on a small dataset of planets in our solar system. A model that understands Newtonian mechanics should get these. But the transformer struggles.

We then fine-tuned the model on a larger scale, to predict forces across 10K solar systems. We used a symbolic regression to compare the recovered force law to Newton's law. It not only recovered a nonsensical law—it recovered different laws for different galaxies.

Would more general models like LLMs do better? We tried providing o3, Claude Sonnet 4, and Gemini 2.5 Pro with a small number of force magnitudes in-context w/o saying what they are. These LLMs are explicitly trained on Newton's laws. But they can't get the rest of the forces.

We also apply these probes to lattice problems (think gridworld). Inductive biases are great when the number of states is small. But they deteriorate quickly. Recurrent and state-space models like Mamba consistently have better inductive biases than transformers.

If a foundation model's inductive bias isn't toward a given world model, what is it toward? One hypothesis: models confuse sequences that belong to different states but have the same legal *next* tokens. Example: Two different Othello boards can have the same legal next moves.

We fine-tune an Othello next-token prediction model to reconstruct boards. Even when the model reconstructs boards incorrectly, the reconstructed boards often get the legal next moves right. Models seem to construct "enough of" the board to calculate single next moves.

Inductive bias probes can test this hypothesis more generally. Models are much likelier to conflate two separate states when they share the same legal next-tokens.

Summary: 1. We propose inductive bias probes: a model's inductive bias reveals its world model 2. Foundation models can have great predictions with poor world models 3. One reason world models are poor: models group together distinct states that have similar allowed next-tokens

Last year we proposed different tests that studied single tasks. We now think that studying behavior on new tasks better captures what we want from foundation models: tools for new problems. It's what separates Newton's laws from Kepler's predictions. https://x.com/keyonV/status/18...

This is one way to evaluate world models. But there are many other interesting approaches. Plug: If you're interested in more, check out the Workshop on Assessing World Models I'm co-organizing next Friday at ICML. https://worldmodelworkshop.org

Paper: https://arxiv.org/abs/2507.069... Co-authors: Peter Chang (@petergchang), Ashesh Rambachan (@asheshrambachan), Sendhil Mullainathan (@m_sendhil)

@keyonV hey very cool work! Q: are you using the term LLM to mean a transformer or are you literally giving it text inputs?

@cgarciae88 Thanks! Main results are for transformers trained on orbits tokenized as X,Y coordinates. But we also try LLMs and text inputs here: https://x.com/keyonV/status/19...

@keyonV @gbrl_dick I’m not sure if this is the best terminology but I’ve been using the terms “models” for world models here and “metaphors” for predictions in my head “Models vs metaphors” is kinda catchy

@AaronBergman18 @gbrl_dick “Metaphors” is great!

@keyonV How do you define inductive bias? Aren't inductive biases independent of the data?

@BlackHC Informally we define inductive bias as the types of functions a model tends to fit from finite data. So you’re right—it’s not about a single dataset but rather a model’s behavior across many datasets.

@keyonV Great work! Would you mind to share more about datasets being used to train world models? Thanks

@DeryaKarl Thanks! Check out the codebase here or feel free to email with more questions https://github.com/keyonvafa/i...

@keyonV Super interesting! I wonder what the best interventions are for getting the models to learn better world models. I’d guess the real goal here is robust generalization? I’d still think more and better data still is the best lever for improving robust generalization. I know you saw

@etash_guha It’s a good question. Besides architecture I think next-token pretraining objectives play a big role. You can get very good next-token performance with poor world models.

@keyonV > Can an AI model predict perfectly and still have a terrible world model? Of course, for a given isolated problem. This is called supervised learning and you can get as perfect as you want to be by providing training examples.

@keyonV This is really cool! I tested this with cellular automata (teach prediction vs learning the rules) and saw the same principle - https://www.strangeloopcanon.c...

@keyonV Really excellent stuff

@keyonV Great thread!

@keyonV @VoidAsuka @j0nathanj

@keyonV birds have biological knowledge of physics, but no capacity to generate a good explanation. maybe today’s llms are the equivalent of highly evolved biological intelligence.

Share this thread

Read on Twitter

Navigate thread