BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Equipping embodied agents with the ability to reason about tasks, foresee physical outcomes, and generate precise actions is essential for general-purpose manipulation. While recent Vision-Language-Action (VLA) models have leveraged pre-trained foundation models, they typically focus on either linguistic planning or visual forecasting in isolation. These methods rarely integrate both capabilities simultaneously to guide action generation, leading to suboptimal performance in complex, long-horizon manipulation tasks. To bridge this gap, we propose BagelVLA, a unified model that integrates linguistic planning, visual forecasting, and action generation within a single framework. Initialized from a pretrained unified understanding and generative model, BagelVLA is trained to interleave textual reasoning and visual prediction directly into the action execution loop. To efficiently couple these modalities, we introduce Residual Flow Guidance (RFG), which initializes from current observation and leverages single-step denoising to extract predictive visual features, guiding action generation with minimal latency. Extensive experiments demonstrate that BagelVLA outperforms existing baselines by a significant margin on multiple simulated and real-world benchmarks, particularly in tasks requiring multi-stage reasoning.

Model	Calvin ABC-D	Robotwin Clean	Robotwin Randomized
π₀	3.648	46.42	16.34
RDT	-	34.50	13.72
UP-VLA	4.078	52.92	15.16
VPP	4.329	-	-
BagelVLA (Ours)	4.405	75.26	20.87

Model	Pick&Place (Seen)	Pick&Place (Unseen)	Water Flower	Stack Cubes	Stack Bowls	Sweep Rubbish	Average
π₀	95	55	50	65	70	55	65.0
VPP	85	45	60	50	55	45	59.5
BagelVLA (Ours)	95	85	60	80	90	80	75.5

Model	Stack Cubes (Easy/Middle/Hard)			Success Rate	Calculate & Place (Easy/Middle/Hard)			Success Rate
π₀	75	35	10	40.0	70	25	0	31.7
VPP	60	15	0	25.0	60	10	0	23.3
BagelVLA (Ours)	95	65	60	73.3	80	65	45	63.3

BagelVLA: Enhancing Long-Horizon Manipulation via
Interleaved Vision-Language-Action Generation

BagelVLA is a unified model that integrates linguistic planning, visual forecasting, and action generation within a single framework for long-horizon manipulation tasks.

Abstract

Model Architecture

Experiment

Simulation Results

Real-World Basic Tasks

Long-Horizon Planning Tasks

Visualization of Interleaved Planning

Demo Videos

Complex Long-Horizon Tasks

Basic Tasks

BagelVLA: Enhancing Long-Horizon Manipulation viaInterleaved Vision-Language-Action Generation