Hong-Xing "Koven" Yu's Twitter Thread

🤩Forget MoCap -- Let’s generate human interaction motions with *Real-world 3D scenes*!🏃🏞️ Introducing ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation. No training, No MoCap data! 🧵1/5 Web: https://awfuact.github.io/zero...

Generating 4D human-scene interaction motions is central to gaming/VR/robotics. Yet, existing methods generally require many Motion-Scene pairs for training, which is expensive💸 and infeasible to collect in various real-world scenes ❌.

We propose ZeroHSI to generate 4D interactions without requiring any MoCap data — Instead, our main idea is to distill human motions from a well-trained video generation model 🎥 that has already seen many human videos.

The technical idea is very simple: We generate a human-scene interaction video for the 3D scene, and then we use differentiable human rendering to extract the 3D human motion.

Our ZeroHSI works with both (1) static scenes and (2) dynamic scenes with interactable objects.

See our project website https://awfuact.github.io/zero... for more visualizations! Work done w/ Hongjie Li (summer intern student at our group), @jiaman01 , and @jiajunwu_cs at @StanfordSVL @StanfordAILab .

Share this thread

Read on Twitter

Navigate thread