Unified-Action-Model · Robot Manipulation

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

UAM separates semantic understanding and control-oriented visual dynamics into two cooperating streams, allowing end-to-end VLA training without freezing the VLM or replaying auxiliary vision-language data.

Jianke Zhang*1 Yuanfei Luo*2 Yucheng Hu*1 Xiaoyu Chen1 Yanjiang Guo1 Ziyang Liu2 Hongbin Xu2 Tian Lan2 Jianyu Chen1,S

1Tsinghua University · 2ByteDance Seed

*Equal contribution · SCorresponding author

Overview of the embodiment tax, the UAM solution, and the Dorsal Expert visual-dynamics bridge.

Abstract

Reducing the embodiment tax.

Vision-language-action models are typically built by fine-tuning a pretrained vision-language model on action data. We show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax.

Core question

Can a VLA retain the general-purpose semantic capability of its underlying VLM without freezing parameters and without relying on auxiliary vision-language data?

Inspired by biological two-stream vision, UAM introduces a parallel Dorsal Expert as a control-oriented visual pathway. Initialized from a pretrained generative model and trained with visual-dynamics supervision, the Dorsal Expert absorbs visuomotor adaptation while the semantic VLM remains useful for recognition, language grounding, and instruction following.

With no parameter freezing, no gradient stopping, and no auxiliary vision-language co-training, UAM retains over 95% of the underlying VLM's multimodal capability while achieving the strongest average success among compared variants on OOD manipulation tasks.

>95% VLM capability retained
<5% Average embodiment tax
3k Real robot trajectories
0 VL replay or frozen VLM weights

Forgetting Problem

Action fine-tuning turns VLM competence into a bottleneck.

UAM starts from a measurement: if a VLA is initialized from a strong VLM and then trained only on robot actions, how much of the original multimodal ability remains?

Observation

Action tuning pays an embodiment tax.

We measure forgetting as the relative drop in VLM benchmark score after action tuning. Freeze-VLM keeps semantics but hurts action accuracy; unfrozen +MoT and +MLP improve control, yet sharply reduce VLM score across Qwen2.5 and PaliGemma.

Delta = 1 - S(fVLA) / S(fVLM)

The diagnosis is a shared-pathway bottleneck: even MoT still asks the semantic encoder to carry language grounding, object semantics, pose, layout, interaction state, and dynamics.

Forgetting measurement comparing Freeze-VLM, MoT, and MLP VLA couplings.
Measuring the embodiment tax: unfrozen VLA training improves action accuracy but substantially reduces the VLM score.

Method

From a representational bottleneck to a Dorsal Expert.

The fix is not simply to add more parameters. UAM gives the VLA a second visual pathway dedicated to control-oriented visual dynamics, so the semantic VLM no longer has to absorb all action gradients by itself.

01

Semantic Expert

The pretrained VLM keeps language-grounded objects, attributes, spatial concepts, OCR, and instruction semantics.

02

Dorsal Expert

A parallel visual expert receives observations directly and produces control-oriented tokens for scene state and change.

03

Action Expert

The policy attends to both token streams through MoT routing, combining task grounding with control-relevant state.

UAM multi-expert framework with semantic, dorsal, and action experts.
UAM macro-architecture and Dorsal Expert design space.

Dorsal Expert Sweep

What makes a good Dorsal Expert?

The paper compares Dorsal Expert designs by two criteria: whether they improve action generalization, and whether they reduce pressure on the semantic expert measured by the forgetting metric.

Variant 1

Random init

Adding a same-sized but uninitialized second expert is not enough. It is close to the 2-expert baseline in simulation, but weaker on real-world tasks.

Variant 2

VLM init

A semantic prior gives the Dorsal path a useful start, and visual-token input is better than query-only input. It performs similar to the 2-expert baseline in in-domain, but worse on real-world generalization tasks.

Variant 3b: UAM

Generative init + visual dynamics

The adopted design uses a generative unified-multimodal initialization and an auxiliary visual-dynamics objective, making the Dorsal Expert load-bearing for control.

Dorsal Expert design sweep across real-world and simulated tasks.
Dorsal Expert design sweep: capacity alone is insufficient; a generative visual prior plus visual-dynamics supervision gives the strongest trade-off.

Automatic Routing

UAM separates what from where and how.

The attention-map analysis is placed here because it explains why the design works. During action generation, the action expert uses semantic tokens for task-relevant entities and uses dorsal tokens for robot state, interaction regions, and broader scene context.

Semantic attention

Concentrates on target objects, goal regions, and language-grounded scene elements.

Dorsal attention

Shifts toward the robot arm, contact regions, and global visual state needed for action execution.

Attention maps showing semantic and dorsal experts attend to different regions during action generation.
Representation analysis: UAM naturally routes semantic-centric and dynamics-centric visual signals into different expert streams.

Core Idea

The essence of UAM.

UAM keeps the VLM trainable, but changes what action tuning has to overwrite. A semantic stream preserves the pretrained "what" knowledge, a generative Dorsal stream learns visual dynamics for "where/how" control, and the action stream learns to combine them end-to-end without frozen weights, gradient blocking, or auxiliary vision-language replay.

Experiments

Semantic retention becomes action generalization.

After selecting Variant 3b as UAM, the experiments ask whether the semantic VLM is still competent and whether that retained competence helps real-robot generalization.

Multimodal Understanding

UAM preserves the VLM after action-only training.

Trained for 30,000 steps on 3k ALOHA trajectories with no vision-language co-training, UAM retains more than 95% of the original VLM capability and remains competitive across standard multimodal benchmarks.

MMMU MME-P MME-S MMBench MM-Vet MathVista MMStar TextVQA
53.7 1607 2289 83.7 63.4 68.2 61.3 84.2

This contrasts with standard action-only VLA fine-tuning, where the VLM capability can collapse, and with co-training methods that still depend on extra multimodal data.

Success rates on out-of-distribution real-world manipulation tasks.
OOD real-world manipulation: UAM improves generalization under novel objects, distractors, compositions, and multilingual or code-mixed instructions.

Demo

Real-robot generalization examples.

We directly finetune the UAM on 3000 demonstrations without any pretraining or cotraining, and evaluated on out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation.

Pick up the yellow block demo with multiple colored blocks.
Pick up the yellow block
Pick up the yellow block demo.
Pick up the yellow block
Pick up the yellow block demo from another configuration.
Pick up the yellow block
Place the green apple to the pink plate demo.
Place the green apple to the pink plate
Place the green apple to the pink plate demo from another configuration.
Place the green apple to the pink plate
Put the mango into the transparent plate demo.
Put the mango into the transparent plate
Grasp the mangosteen into the purple plate demo.
Grasp the mangosteen into the purple plate
Grasp the mangosteen into the purple plate demo from another configuration.
Grasp the mangosteen into the purple plate
Grasp the mangosteen into the purple plate demo from another configuration.
Grasp the mangosteen into the purple plate
Grasp the baozi into the blue bowl demo.
Grasp the baozi into the blue bowl
Grasp the Chinese-named baozi into the blue bowl demo.
Grasp the 包子 into the blue bowl
Chinese instruction demo for placing baozi into the blue bowl.
把包子放到蓝色的碗里
Put the pear onto the blue block demo.
Put the pear onto the blue block
Put the orange cup onto the yellow block demo.
Put the orange cup onto the yellow block
Pick up the scissors demo.
Pick up the scissors

Citation

BibTeX

@article{zhang2026uam,
  title   = {UAM: A Dual-Stream Perspective on Forgetting in VLA Training},
  author  = {Zhang, Jianke and Luo, Yuanfei and Hu, Yucheng and Chen, Xiaoyu and Guo, Yanjiang and Liu, Ziyang and Xu, Hongbin and Lan, Tian and Chen, Jianyu},
  journal = {arXiv preprint arXiv:2605.15735},
  year    = {2026}
}