Keyon Vafa's Twitter Thread

New paper: How can you tell if a transformer has the right world model? We trained a transformer to predict directions for NYC taxi rides. The model was good. It could find shortest paths between new points But had it built a map of NYC? We reconstructed its map and found this:

The map let us visually inspect the incoherent world model. But how should we evaluate world models in non-map settings? Our paper proposes new evaluation metrics for world model recovery.

The metrics can be applied when the true world model can be represented as a DFA. This includes settings like logical reasoning and game playing. The formalism of DFAs provides insights into world model recovery.

Finding 1: If every next-token predicted by a model is legal in the underlying DFA, it has the right world model.

Finding 2: Even though perfect next-token prediction implies world model recovery, it's an incomplete evaluation metric. Models can have near-perfect next-token prediction with completely wrong world models. So we need better metrics than next-token prediction.

The (classic) Myhill-Nerode theorem shows what we should be testing for: compression and distinction. Compression: if two sequences lead to the same state, a model shouldn't distinguish them. Distinction: if two sequences lead to distinct states, a model should distinguish them.

We demonstrate these tests by training a transformer to predict turn-by-turn directions of taxi rides in NYC. The model has impressive navigation capabilities. Given two points in NYC from a hold-out set, it generates the true shortest path between them 96% of the time.

By some measures it looks like the model has recovered the true world model of NYC. Its predicted next turn is a legal turn >99.9% of the time. The transformer's representation even appears to encode the current location of taxi rides.

But the new evaluation metrics reach a different conclusion: the model is far from recovering the underlying DFA. This isn't just an artifact of the model being trained on deterministic routes. Models trained on noisy traversals or random walks have the same behavior.

We visualize the map implied by the model using a graph reconstruction algorithm. The map is incoherent: some streets have physically impossible orientations. Others require flyovers.

Why does it matter that the transformer has an incoherent world model? It still manages to find shortest paths. Incoherence implies fragility: we show the transformer's traversal capabilities break down when we add detours to the underlying map.

We apply these evaluation metrics to two other settings: game-playing and logic puzzles. In both settings the metrics reveal inconsistencies in underlying world models.

In many settings the true world model will be unknown, not a DFA. So why should we care about these evaluation metrics? One reason is that we need testbeds to understand and improve ML models. What can we say about an LLM’s world model if it can’t recover a map?

Transformers can do amazing things with incoherent world models. We saw this for taxi rides. But it makes them fragile for other tasks. And it’s a problem whenever we hope to use a transformer's world model to learn something about the world (e.g. protein generation, genetics).

Paper: https://arxiv.org/abs/2406.036... Code: https://github.com/keyonvafa/w... Co-authors: Justin Chen (@justinychen), Jon Kleinberg, Sendhil Mullainathan (@m_sendhil), Ashesh Rambachan (@asheshrambachan)

@keyonV cool work! i dont think humans build a perfect map in their head either, do you think thats a necessary requisite for providing a shortest path or just our assumption of whats necessary?

@JohnSigmon Good question! One of our results is that you can do shortest paths (out-of-sample, in-distribution) without the right map. But not having the map has bad consequences (e.g. add a few detours and things break down). We think a good world model prevents this kind of fragility

@keyonV Deos it help if you keep training until it "Grok"s? That is 100% training accuracy and then keep trianing till the test accuracy improves.

@keyonV @Ajchatham

@keyonV I've been tinkering with a similar route prediction model for fun and it's been fascinating to see the real world reconstruction emerge from the raw data. I've been thinking about the problem you described in this thread a bit, and have not been smart enough to find the answer

@keyonV One of the more interesting threads I’ve read recently

@keyonV Solid work, and a very interesting read! Kudos!

@keyonV map(map) -> map

@keyonV The model clearly just figured out the mole peoples' tunnels that you are unaware of

@keyonV Awesome

@keyonV nice

@keyonV people search for things we know aren’t there, don’t find anything, write a paper about it, brag about it on Twitter. amazing

@keyonV Interesting, thanks! I'm wondering about the definition of incoherence: can the weird connections be explained by the model not having a prior that'd force it into links we'd consider physically possible?

@keyonV Amazing study man, I like how accessible you made in this thread. It would interesting to make the same with flights data 😀

@keyonV But but these things can’t have “world models” right, @Grady_Booch ?

@keyonV This is fascinating. I've thought a lot about how important it is for our mental models to match the relevant mechanisms in reality. If you think of a Rubix cube as 54 colored squares, it's several orders of magnitude harder to solve than a model of 21 cubes.

@keyonV If anyone is interested in Geospatial and NLP/LLMs, check out these slides about my recent work. https://schumann.pub/assets/up...

@keyonV DFA is deterministic finite automaton. You're welcome ;)

@keyonV Has anyone @yudapearl ? It is similar to his thought! Like the different between gps and real map!

@keyonV @dinomirMT @SmolakKamil interesting.

Share this thread

Read on Twitter

Navigate thread