VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action models

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual modules, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

Model (VLM Backbone)	Size	Calvin ABC-D ↑	SimplerEnv ↑	Libero-10 ↑
Expert VLA Models
OpenVLA (Llama-2)	7.7B	2.548	4.2	53.7
pi0 (Paligemma-1)	3.1B	3.509	60.4	46.0
VLM4VLA Models
Qwen2.5VL-3B	3.8B	3.856	48.0	43.0
Qwen2.5VL-7B	8.3B	4.057	46.9	45.0
Qwen3VL-2B	2.1B	4.142	49.0	55.8
Qwen3VL-4B	4.4B	3.943	56.3	44.4
Qwen3VL-8B	8.8B	4.035	58.3	46.2
Qwen3VL-30B-A3B	31.1B	4.075	44.8	46.8
Paligemma-1	2.9B	3.506	55.3	44.2
Paligemma-2	3.0B	3.406	57.3	46.2
Kosmos-2	1.7B	3.096	60.4	55.0

Model	Size	Calvin ABC-D ↑	SimplerBridge ↑
Qwen2.5VL-3B	3.8B	3.856	48.00
+ freeze vision encoder	3.1B	2.855 (-1.001)	23.95 (-24.05)
+ freeze word embedding	3.4B	3.849 (-0.007)	46.88 (-1.12)
Qwen2.5VL-7B	8.3B	4.057	46.75
+ freeze vision encoder	7.6B	2.823 (-1.234)	25.50 (-21.25)
+ freeze word embedding	7.8B	3.874 (-0.183)	48.96 (+2.21)
Paligemma-1	2.9B	3.506	55.25
+ freeze vision encoder	2.5B	0.495 (-3.011)	13.25 (-42.00)
+ freeze word embedding	2.7B	3.485 (-0.021)	52.25 (-3.00)

Qwen3VL-4B	SimplerBridge ↑	Qwen3VL-4B	SimplerBridge ↑
Freeze vision encoder during training VLA		Unfreeze vision encoder during training VLA
Baseline	27.6	Baseline	56.3
Freeze Vision FT	28.0 (+0.4)	Freeze Vision FT	56.3 (+0.0)
Unfreeze Vision FT	45.7 (+18.1)	Unfreeze Vision FT	59.4 (+3.1)

VLM4VLA

Revisiting Vision-Language Models in Vision-Language-Action Models

We propose VLM4VLA, a unified training and evaluation framework designed for the systematic study of Vision-Language Models' impact on Vision-Language-Action model performance.

Abstract

Unified Evaluation Framework

Comprehensive Experimental Study

Practical Insights

🔍 Most Surprising Finding

Study Design

Experiments and Analysis

1. VLM4VLA Performance Comparison

2. Impact of Auxiliary Tasks

RoboPoint

Vica-332k

BridgeVQA

Robo2VLM

RoboBrain2

Omni-Generation

3. Importance of different vlm modules

4. Analysis of the gap between VLM and VLA