Hila Chefer's Twitter Thread

VideoJAM is our new framework for improved motion generation from @AIatMeta We show that video generators struggle with motion because the training objective favors appearance over dynamics. VideoJAM directly adresses this **without any extra data or scaling** 👇🧵

Why do video generators struggle with motion? We found that the pixel-based loss barely changes when video frames are shuffled—showing it is nearly **invariant to temporal incoherence**. This leads models to ignore motion and prioritize appearance

Our Solution: VideoJAM VideoJAM instills an explicit motion prior by modifying the objective: the model predicts both appearance and motion from a **single learned representation.** This forces the model to capture both visuals and dynamics, improving motion understanding.

Inner-Guidance: Improving Motion at Inference At inference, we introduce **Inner-Guidance**—a method that leverages the **model’s own motion predictions** as a dynamic guidance signal, steering the generation toward coherent, realistic motion.

🎬 Results VideoJAM fine-tunes a pretrained video generator (DiT) on just 3M samples from its own training set—yet achieves remarkable motion coherence. It even outperforms highly competitive proprietary models like Sora and Kling in motion quality

This work was done during my internship at @AIatMeta 🎉 Huge thanks to my amazing collaborators @urielsinger @amit_zhr @YKirstain @adam_polyak90 Yaniv Taigman @liorwolf and @ShellySheynin Check out the project page for many more results and details: https://hila-chefer.github.io/...

Now on Huggingface daily papers 🤗https://huggingface.co/papers/... And arxiv 🥳 https://arxiv.org/abs/2502.024...

Hila Chefer

Share this thread

Read on Twitter

Navigate thread