VLM4VLA

Revisiting Vision-Language Models in Vision-Language-Action Models

1Tsinghua University, 2Qwen Team, Alibaba Inc.

We propose VLM4VLA, a unified training and evaluation framework designed for the systematic study of Vision-Language Models' impact on Vision-Language-Action model performance.

VLM4VLA Framework Overview


Abstract

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual modules, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

Unified Evaluation Framework

Build VLM4VLA, a scalable and fair evaluation framework that integrates different VLMs into VLAs in a unified and lightweight manner.

Comprehensive Experimental Study

Conduct comprehensive experiments to study the influence of VLM backbone on embodied manipulation tasks, covering VLM architecture, post-training fine-tuning data, and vision modules.

Practical Insights

Analyze experimental results to provide practical insights, offering reference for backbone selection and performance baseline for the VLA community.

πŸ” Most Surprising Finding

We find that the performance requirements for VLMs in embodied manipulation tasks do not fully align with their VQA capabilities. Specifically, and contrary to common expectations, VLMs that perform well on general VQA benchmarks are not necessarily better when used in VLAs. Furthermore, on various auxiliary Embodied-QA tasks, we discover that fine-tuning on most of these tasks leads to a performance degradation in the resulting VLA.


Study Design

🎯 Fairness and Reproducibility: We employ a consistent model architecture and training/testing settings across multiple simulation environments to ensure fair and reproducible comparisons.
⚑ Minimalist Design: We encapsulate VLMs within a simple yet effective VLA framework, thereby minimizing the influence of complex, extraneous policy designs on the comparison.
🧠 Leveraging Inherent Knowledge: The VLA design fully leverages the inherent knowledge of the VLM. Crucially, we ensure that the input sequence format is consistent with what each VLM was exposed to during its instruction-SFT phase. We exclude any robotic priors beyond vision and language, such as proprioceptive state, tactile feedback, or environmental rewards.

Network Diagram

Experiments and Analysis

To ensure the reproducibility and fairness of our experiments, we test in three simulation environments, selecting the most challenging scenarios as our evaluation benchmarks: Calvin ABC-D, SimplerEnv Bridge, and Libero-Long.

1. VLM4VLA Performance Comparison

Model (VLM Backbone) Size Calvin ABC-D ↑ SimplerEnv ↑ Libero-10 ↑
Expert VLA Models
OpenVLA (Llama-2) 7.7B 2.548 4.2 53.7
pi0 (Paligemma-1) 3.1B 3.509 60.4 46.0
VLM4VLA Models
Qwen2.5VL-3B 3.8B 3.856 48.0 43.0
Qwen2.5VL-7B 8.3B 4.057 46.9 45.0
Qwen3VL-2B 2.1B 4.142 49.0 55.8
Qwen3VL-4B 4.4B 3.943 56.3 44.4
Qwen3VL-8B 8.8B 4.035 58.3 46.2
Qwen3VL-30B-A3B 31.1B 4.075 44.8 46.8
Paligemma-1 2.9B 3.506 55.3 44.2
Paligemma-2 3.0B 3.406 57.3 46.2
Kosmos-2 1.7B 3.096 60.4 55.0
VLM Performance Comparison

By plotting results of multiple general-purpose QA benchmarks for VLMs on the x-axis (representing VLM capability), and the VLA performance in each simulation environment on the y-axis, and performing a linear fit between the two, we observe that there is no obvious positive correlation between VLM capability and VLA performance. This differs from previous expectations in the field.




2. Impact of Auxiliary Tasks

We study the impact of different VLM auxiliary tasks on VLA performance. Recent work has proposed using robotic data to construct VQA datasets for improving VLM backbones, but few studies have investigated whether this additional continual finetuning actually benefits VLAs in downstream tasks. We construct or collect several SFT tasks for VLM, including VQA datasets and generation tasks.

RoboPoint

A pointing task dataset collected in simulator. Given an image and a target location, the model is required to output the 2D coordinates that satisfy the target requirement. Contains 1.432M samples.

Vica-332k

A spatial understanding dataset constructed from RGB-D datasets. It covers a wide range of capabilities, including size estimation, position understanding, distance estimation, and so on.

BridgeVQA

A spatial understanding question-answering dataset annotated from Bridge-v2, Fractal, and Calvin ABC data using VQASynth.

Robo2VLM

An action-oriented question-answering dataset built from 176k real robot trajectories, containing 667k VQA pairs.

RoboBrain2

A large-scale embodied VQA dataset and a VLM finetuned on Qwen2.5VL-7B. The tasks include pointing, planning, and trajectory marking.

Omni-Generation

Integrating a diffusion model into Qwen2.5VL-7B and training on image generation, depth map generation, and semantic segmentation map generation tasks together.

Auxiliary Tasks Performance Results

Overall, all models underperform the original baseline, with most exhibiting a slight degradation in performance. For Qwen2.5VL-3B, the model finetuned on Vica332k performs better than those finetuned on other datasets. This could be attributed to the dataset's broad data coverage and diverse task types, which may prevent the model from overfitting to a narrow set of capabilities and consequently degrading others.

Conclusion: Existing embodied VQA-style tasks do not offer a clear benefit for training end-to-end VLAs to execute downstream manipulation tasks. This suggests that VLAs may require broad, general capabilities, beyond just embodied skills, to perform well on downstream tasks.



3. Importance of different vlm modules

We find that freezing the vision encoder during VLM4VLA training leads to significant performance degradation for all models on both the Calvin and Simpler benchmarks. This strongly suggests that finetuning the vision encoder is crucial when adapting a VLM into a VLA.

Model Size Calvin ABC-D ↑ SimplerBridge ↑
Qwen2.5VL-3B 3.8B 3.856 48.00
+ freeze vision encoder 3.1B 2.855 (-1.001) 23.95 (-24.05)
+ freeze word embedding 3.4B 3.849 (-0.007) 46.88 (-1.12)
Qwen2.5VL-7B 8.3B 4.057 46.75
+ freeze vision encoder 7.6B 2.823 (-1.234) 25.50 (-21.25)
+ freeze word embedding 7.8B 3.874 (-0.183) 48.96 (+2.21)
Paligemma-1 2.9B 3.506 55.25
+ freeze vision encoder 2.5B 0.495 (-3.011) 13.25 (-42.00)
+ freeze word embedding 2.7B 3.485 (-0.021) 52.25 (-3.00)


4. Analysis of the gap between VLM and VLA

We hypothesize that the visual gap may stem from the following two factors:

  1. Real images vs. simulated renderings (Real to Sim): During pretraining, VLMs are exposed to relatively few tabletop simulation renderings. As a result, the vision encoder (e.g., ViT) may lack effective high-level semantic representations for simulated images encountered in manipulation.
  2. Vision-language understanding vs. low-level action control: The visual features encoded by the VLM’s vision encoder are better aligned with language-output objectives typical of QA-style tasks, whereas low-level action control in robotics requires different visual cues and representations.
We aim to demonstrate that Factor2 is also a significant source of the gap by incorporating action information during the VLM fine-tuning stage. Specifically, we employ Fast Token to encode actions from the Bridge dataset, constructing a VQA dataset enriched with action control information to fine-tune the VLM. Subsequently, following the VLM4VLA protocol, we train this fine-tuned VLM on continuous actions and evaluate it within the Simpler-Bridge environment:

We compare three settings of VLM finetuning: 1. Baseline: without finetuning the VLM at all; 2. Freeze Vision FT: Fine-tuning only the LLM (keeping the vision encoder frozen); 3. Unfreeze Vision FT: Fine-tuning both the LLM and the vision encoder. These modified VLM backbones are then trained into a standard VLM4VLA policy with frozen or unfrozen vision encoder.

Qwen3VL-4B SimplerBridge ↑ Qwen3VL-4B SimplerBridge ↑
Freeze vision encoder during training VLA Unfreeze vision encoder during training VLA
Baseline 27.6 Baseline 56.3
Freeze Vision FT 28.0 (+0.4) Freeze Vision FT 56.3 (+0.0)
Unfreeze Vision FT 45.7 (+18.1) Unfreeze Vision FT 59.4 (+3.1)

The results reveal a critical insight: the necessity of fine-tuning the vision encoder stems from a "semantic gap" rather than simulation artifacts, as VLM features optimized for reasoning lack the fine-grained representations required for control. While VLM pretraining remains indispensable for generalization (avoiding the performance collapse seen when training from scratch), the learning trajectories of VLMs and VLAs eventually diverge into different regions. This divergence, illustrated in figure below, explains that a pronounced gap persists between the two despite their initial alignment, necessitating specific fine-tuning strategies to bridge the difference between multimodal understanding and robotic manipulation.

Network Diagram