📝 Publications

🖥️ * indicates equal contribution, sorted by publication date

Preprint 2026
paper image

BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
Yucheng Hu*, Jianke Zhang*, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, Jianyu Chen
Paper | Project

  • Intro: BagelVLA interleaves text, vision, and action reasoning to improve long-horizon manipulation and planning in a unified generative VLA framework.

ICLR 2026
paper image

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen
ICLR 2026 (High Score) | Paper | Project | Code | Twitter | Talk

  • TL;DR: We systematically evaluate how the base VLM affect the performance of VLA. We posit that there is a fundamental vision gap between the VQA capabilities of VLMs and actual action control.
Preprint 2025
paper image

PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization
Jiajun Zhang*, Jianke Zhang*, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin
Paper | Repo | Code

  • Intro: PlotCraft benchmarks complex visualization generation with 1k tasks and 48 chart types, and introduces PlotCraftor to improve hard multi-turn plotting tasks. Used for Qwen3-Coder training.
Preprint 2025
paper image

UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning
Jianke Zhang*, Yucheng Hu*, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen
Paper| Project

  • Intro: UniCoD learns from both understanding and future-prediction in continous space (Jepa), using more than 1M instructional manipulation videos to strengthen generalist robot policies.
ICML 2025
paper image

UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Jianke Zhang*, Yanjiang Guo*, Yucheng Hu*, Xiaoyu Chen, Xiang Zhu, Jianyu Chen
ICML 2025 | Paper | Code

  • Intro: UP-VLA is the first unified action model that unifies understanding and future prediction objectives to improve both semantic reasoning and spatial awareness for embodied control.
ICML 2025 Spotlight
paper image

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu*, Yanjiang Guo*, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen
ICML 2025 Spotlight | Paper | Project | Code | Twitter | 机器之心

  • Intro: VPP uses predictive representations from video diffusion models to improve generalization in robotic policy learning and dexterous manipulation.
ICRA 2025
paper image

Improving Vision-Language-Action Model with Online Reinforcement Learning
Yanjiang Guo*, Jianke Zhang*, Xiaoyu Chen*, Xiang Ji, Yen-Jen Wang, Yucheng Hu, Jianyu Chen
ICRA 2025 | Paper | Twitter

  • Intro: iRe-VLA alternates reinforcement learning and supervised learning to stabilize online post-training while improving VLA adaptation in interactive environments. This is the first attempt to enhancing VLA via RL.
CoRL 2024
paper image

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang*, Yanjiang Guo*, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
CoRL 2024 | Paper | Twitter | 机器之心

  • Intro: HiRT first introduce System-1 and System-2 theory into VLA, which balances low-frequency VLM reasoning with high-frequency visual control to cut latency and improve dynamic robot manipulation.
NeurIPS 2024
paper image

Prediction with Action: Visual Policy Learning via Joint Denoising Process
Yanjiang Guo*, Yucheng Hu*, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen
NeurIPS 2024 | Paper | Project | Code

  • Intro: PAD (the first WAM) incoporates future image prediction and robot action generation in a joint denoising process for stronger imitation learning and real-world generalization.
Pattern Recognition (2023)
paper image

Cross-modal Adapter for Vision-Language Retrieval
Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Shiji Song, Gao Huang
Pattern Recognition | Paper | Code

  • Intro: This work proposes a parameter-efficient cross-modal adapter for vision-language retrieval, improving multimodal matching without fully retraining the backbone.