Dream2Act: Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

Abstract

Equipping humanoid robots with versatile interaction skills typically requires either extensive task-specific policy training or explicit human-to-robot motion retargeting. However, learning-based policies are hindered by prohibitive data collection costs, limiting their scalability. Meanwhile, retargeting paradigms rely heavily on human-centric pose estimation (e.g., SMPL), which inevitably introduces the morphology gap. Such skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success.

In this work, we propose Dream2Act, a robot-centric framework that enables zero-shot interaction through generative video synthesis. Given an image of the predefined robot and target object in third-person view, our framework leverages video generation models to envision videos in which the physical robot completes the task with spatially aligned, morphology-consistent motion.

We evaluate Dream2Act on the Unitree G1 across four categories of whole-body mobile interaction tasks. Dream2Act achieves an overall task success rate of 37.5%, compared to 0% for conventional retargeting pipelines, maintaining robot-consistent spatial alignment throughout execution and enabling reliable contact formation.

Zero-Shot Real-World Execution Results

Qualitative comparison on diverse spatially-sensitive tasks: Ball Kicking, Box Hugging, Bag Punching, and Sofa Sitting.

Task: Ball Kicking

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)

Task: Box Hugging

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)

Task: Bag Punching

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)

Task: Sofa Sitting

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)

BibTeX

@misc{xu2026morphologyconsistenthumanoidinteractionrobotcentric,
      title={Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis}, 
      author={Weisheng Xu and Jian Li and Yi Gu and Bin Yang and Haodong Chen and Shuyi Lin and Mingqian Zhou and Jing Tan and Qiwei Wu and Xiangrui Jiang and Taowen Wang and Jiawen Wen and Qiwei Liang and Jiaxi Zhang and Renjing Xu},
      year={2026},
      eprint={2603.19709},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.19709}, 
}

Dream2Act: Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

Dream2Act is a robot-centric framework that enables zero-shot interaction through generative video synthesis, bypassing the morphology gap of human-centric retargeting.

Abstract

Zero-Shot Real-World Execution Results

Task: Ball Kicking

Task: Box Hugging

Task: Bag Punching

Task: Sofa Sitting

BibTeX