Published: February 16, 2024
25
143
673

OpenAI goal is to build a world simulator - that learns the dynamics of the 3D real world - from videos & language descriptions. They claim the work indicates that training on ever larger datasets is a promising direction for learning such world models. But they are wrong:

The goal is that given enough videos the model can learn the physics of objects, occlusions, collisions, reflections, shadows etc Its a difficult thing to recover hard rules from such an approach ie you can't learn much about the behavior of physical objects from videos alone.

Looking at their so called "technical" report which lacks technical detail other than they used a transformer diffusion models trained on patches. The model compresses videos in spatial & temporal dimensions to generate tokens that feed into a transformer.

The output is decoded to pixel space by another model. There isn't much technical details other than cherrypicked examples with little discussions on the limitations embedded right at the very bottom. The limitations are so bad that this isn't what most think it is.

It isn't a world simulator, far from it. For starters OpenAI's own videos claiming the model understands 3D rotations, object permanence, occlusions, reflection all show major flows.

After having been trained on literally the entire internet - the models failure to capture basic common sense physics indicates that merely scraping more data & making the models bigger is not going to turn these data driven statistical models into world simulators.

Yet OpenAI claims this is a great example of how ever larger models will begin to understand the physical world but they already trained on the freaking ENTIRE internet. What more do they want? To me this indicates models failure to learn basic hard rules from data alone.

Essentially the model is gigantic autoencoder based on diffusion transformer models. The model, conditioned on text prompts, starts from noisy patches then learns to denoise them so as to recreate as accurately as possible the training patches. That is known as autoassociation.

That is given a training example X the model outputs X' such that the vector distance between X & X' is as small as possible. The model essentially learns an appropriate compressed identity mapping, an autoencoder or autoassociative network.

Just that at runtime or inference time, the input is just text prompts without the original reference training images/videos whose content is now condensed in the model weights as a basis set from which the model can reconstruct plausible outputs.

The model fails to render scenes consistently from arbitrary view points, showing that it is far from capturing the underlying behavior of a scene from multiple view points. It generates people walking on steep mountain edges or thin air without falling.

Objects disappear & pass through other objects. If this is a "world simulator" then I don't honestly understand what world we are actually simulating here. Sora is a blender of videos & images not a simulator because it doesn't encode the physical rules of the world.

You can see OpenAI titled their blog as: "Video generation models as world simulators" This is wrong because the model lacks the understanding to even work like a game engine, how can it possibly simulate the complex real world? It simply just cannot do that.

The fact that they think more data can eventually give them a true world simulator is a real face-palm moment. Son, you just trained that abomination on the entire internet, it didn't learn basic common sense physics, where the hell are you going to get more data from?

Some argue they can just train the model using synthetic scenes from game engines like unreal engine 5. Folks, the thing is you can't recover or distill an entire game engine into model parameters with statistical learning from input-output pairs from that engine.

You can't learn everything from data alone, you need a model that encodes the right inductive biases, the assumptions about the real world, to work. A game engine encodes such assumptions as algorithms coded by human programmers.

There is mathematical proof that learning from input-output pairs alone is computationally intractable - extremely hard to do unless you can collect all possible inputs & outcomes from the phenomena you which to encode - at that point you are merely just looking up responses.

You can see the limitations at the bottom of OpenAI blog. They still think more mountains of GPUs & data will get them a true world simulator. They built a text conditioned video & image blender not a world simulator. https://openai.com/research/vi...

They should also refrain from calling this research - it's product development. There is little in that "technical" report to be called research, it's filled with cherrypicked examples & the limitations are just mentioned sparsely right at the very bottom end. Shame.

I can say with 99% confidence that more GPUs & more training data is not going to give them a true accurate world simulator. Its simply just not realistic to expect to recover the original data generation process from input-output training pairs without making assumptions.

prompt: "A monkey playing chess in a park". Generated chess board is incorrect, there are a total of 7 large chess pieces, the monkey isn't even playing his just looking around confused. See: https://x.com/IrisVanRooij/sta...

Does this look like something that understands physics & how objects react to forces? https://x.com/hey_madni/status...

The proportions are wrong. The camera is weirdly attached to the dog & a tiny sea bird flies by. This is not a physics (world) simulator. Its just mashing together things from the training data to generate a plausible output. https://x.com/hey_madni/status...

If Neural Radiance Fields (NerF) came out today for the first time people would be saying game engines are dead. NerF can interpolate between frames, this requires fixed camera images with known 3D pose matrices & focal lengths recovered by other methods.

Sora by OpenAI can't even do what NerF does, not even close yet NerF itself can't do what game engine do not even close. These text-to-video generators are interesting but: - They are not physics simulators - They won't lead to "AGI" - They aren't that new.

See a NerF model interpolation of frames, needs precise 3D pose information of fixed cameras to work. Sora can't do this & if it were a physics simulator it should be able to, but it doesn't hence not its not a physics simulator. https://www.google.com/imgres?...

@ChombaBupe Case in point

@ChombaBupe Yeah, these videos are simulations of world models. They look cool on first viewing, but they are literally moving pictures, nothing more. The main problem with Ai is that it is literally an artificial intelligence, it’s a simulation of intelligence.

@ChombaBupe Can't wait to test this, but it also looks like it has some similar issues to DALL-E in understanding text prompts. Not surprising given they are using the same captioning process as DALL-E. A world model that does not know left from right? 🤔

Image in tweet by Chomba Bupe

@ChombaBupe I feel like we need to come up with a better term than "learn" for what machines do so as to curb the anthropomorphic enthusiasm. "Refine" doesn't quite seem to capture it... any suggestions for denoting the training of un-telligence?

@MrPatricKennedy In libraries they use the term "fit" as in, curve fitting. Its more mathematically correct.

@ChombaBupe There’s this fundamental mismatch at the heart of all of the improvement-by-scraping model architectures, which is that objective functions that encode “correct” as reconstructing similar output from an input penalize learning chains of causal relationships as inefficient.

@ChombaBupe Yes, more data will probably do the opposite effect. Even if the model were Turing complete the training takes the shortest path (memorizing). Current models can generate *almost" everything because there were trained on almost everything. More data = more shortcuts

@ChombaBupe It's better than previous attempts. I think it's too soon to say whether they are right or not. Is this perfect? Not at all. Is it better than every single previous attempt put together? probably.

Share this thread

Read on Twitter

View original thread

Navigate thread

1/35