Dream2Act: Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

ArXiv 2026
Weisheng Xu1*, Jian Li1*, Yi Gu1, Bin Yang1, Haodong Chen2, Shuyi Lin3, Mingqian Zhou4, Jing Tan1,
Qiwei Wu1, Xiangrui Jiang1, Taowen Wang1, Jiawen Wen1, Qiwei Liang1, Jiaxi Zhang1, Renjing Xu1†
1Hong Kong University of Science and Technology (Guangzhou), 2Harbin Institute of Technology, Shenzhen,
3Shenzhen University, 4University of Cambridge
*Equal Contribution, Corresponding Author
Dream2Act Pipeline

Dream2Act is a robot-centric framework that enables zero-shot interaction through generative video synthesis, bypassing the morphology gap of human-centric retargeting.

Abstract

Equipping humanoid robots with versatile interaction skills typically requires either extensive task-specific policy training or explicit human-to-robot motion retargeting. However, learning-based policies are hindered by prohibitive data collection costs, limiting their scalability. Meanwhile, retargeting paradigms rely heavily on human-centric pose estimation (e.g., SMPL), which inevitably introduces the morphology gap. Such skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success.

In this work, we propose Dream2Act, a robot-centric framework that enables zero-shot interaction through generative video synthesis. Given an image of the predefined robot and target object in third-person view, our framework leverages video generation models to envision videos in which the physical robot completes the task with spatially aligned, morphology-consistent motion.

We evaluate Dream2Act on the Unitree G1 across four categories of whole-body mobile interaction tasks. Dream2Act achieves an overall task success rate of 37.5%, compared to 0% for conventional retargeting pipelines, maintaining robot-consistent spatial alignment throughout execution and enabling reliable contact formation.

Zero-Shot Real-World Execution Results

Qualitative comparison on diverse spatially-sensitive tasks: Ball Kicking, Box Hugging, Bag Punching, and Sofa Sitting.

Task: Ball Kicking

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)


Task: Box Hugging

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)


Task: Bag Punching

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)


Task: Sofa Sitting

GVHMR (Human)

Baseline

Seedance2.0

Dream2Act (Ours)

BibTeX

@misc{xu2026morphologyconsistenthumanoidinteractionrobotcentric,
      title={Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis}, 
      author={Weisheng Xu and Jian Li and Yi Gu and Bin Yang and Haodong Chen and Shuyi Lin and Mingqian Zhou and Jing Tan and Qiwei Wu and Xiangrui Jiang and Taowen Wang and Jiawen Wen and Qiwei Liang and Jiaxi Zhang and Renjing Xu},
      year={2026},
      eprint={2603.19709},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.19709}, 
}