📝 Publications
🖥️ * indicates equal contribution, sorted by publication date

BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
Yucheng Hu*, Jianke Zhang*, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, Jianyu Chen
Paper | Project
- Intro: BagelVLA interleaves text, vision, and action reasoning to improve long-horizon manipulation and planning in a unified generative VLA framework.

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen
ICLR 2026 (High Score) | Paper | Project | Code | Twitter | Talk
- TL;DR: We systematically evaluate how the base VLM affect the performance of VLA. We posit that there is a fundamental vision gap between the VQA capabilities of VLMs and actual action control.

PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization
Jiajun Zhang*, Jianke Zhang*, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin
Paper | Repo | Code
- Intro: PlotCraft benchmarks complex visualization generation with 1k tasks and 48 chart types, and introduces PlotCraftor to improve hard multi-turn plotting tasks. Used for Qwen3-Coder training.

UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning
Jianke Zhang*, Yucheng Hu*, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen
Paper| Project
- Intro: UniCoD learns from both understanding and future-prediction in continous space (Jepa), using more than 1M instructional manipulation videos to strengthen generalist robot policies.

UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Jianke Zhang*, Yanjiang Guo*, Yucheng Hu*, Xiaoyu Chen, Xiang Zhu, Jianyu Chen
ICML 2025 | Paper | Code
- Intro: UP-VLA is the first unified action model that unifies understanding and future prediction objectives to improve both semantic reasoning and spatial awareness for embodied control.

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu*, Yanjiang Guo*, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen
ICML 2025 Spotlight | Paper | Project | Code | Twitter | 机器之心
- Intro: VPP uses predictive representations from video diffusion models to improve generalization in robotic policy learning and dexterous manipulation.

Improving Vision-Language-Action Model with Online Reinforcement Learning
Yanjiang Guo*, Jianke Zhang*, Xiaoyu Chen*, Xiang Ji, Yen-Jen Wang, Yucheng Hu, Jianyu Chen
ICRA 2025 | Paper | Twitter
- Intro: iRe-VLA alternates reinforcement learning and supervised learning to stabilize online post-training while improving VLA adaptation in interactive environments. This is the first attempt to enhancing VLA via RL.

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang*, Yanjiang Guo*, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
CoRL 2024 | Paper | Twitter | 机器之心
- Intro: HiRT first introduce System-1 and System-2 theory into VLA, which balances low-frequency VLM reasoning with high-frequency visual control to cut latency and improve dynamic robot manipulation.

Prediction with Action: Visual Policy Learning via Joint Denoising Process
Yanjiang Guo*, Yucheng Hu*, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen
NeurIPS 2024 | Paper | Project | Code
- Intro: PAD (the first WAM) incoporates future image prediction and robot action generation in a joint denoising process for stronger imitation learning and real-world generalization.

Cross-modal Adapter for Vision-Language Retrieval
Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Shiji Song, Gao Huang
Pattern Recognition | Paper | Code
- Intro: This work proposes a parameter-efficient cross-modal adapter for vision-language retrieval, improving multimodal matching without fully retraining the backbone.