Reconstruction or Semantics? Robotic World Models

Robotic world models should be selected for policy-relevant semanticsSemantic encoders preserve action, task progress, and OOD robustness signals that matter for planning and policy evaluation., not only pixel reconstructionVAE-style spaces remain competitive on low-level visual metrics, but visual fidelity alone does not explain downstream success..

We compare reconstruction-aligned and semantic latent spaces under a fixed action-conditioned latent diffusion world-model protocol on BridgeV2. Semantic spaces such as V-JEPA 2.1, Web-DINO, and SigLIP 2 generally improve CEM planning, action recoverability, task-success classification, VLA-in-the-loop success, and OOD robustness, while reconstruction latents mainly retain photometric advantages.

Latent Space Choice Changes What the World Model Preserves

The study varies only the representation space used by the latent transition model. Each encoder is evaluated along three axes: planning and downstream policy performance, pixel fidelity and geometry, and latent representation quality.

Across DiT-S models, semantic encoders move the frontier on action recoverability, task-success separability, planning error, VLA success, and robustness, while reconstruction latents remain strongest where photometric reconstruction is the dominant criterion.

Takeaway: A robotic world model latent space should preserve controllable, task-relevant state changes. Rendering plausible video is necessary, but not sufficient.

Policy and Planning Favor Semantic Latents

In policy-in-the-loop evaluation, OpenVLA rollouts are generated inside each world model and scored with VLM judges. Semantic encoders improve consensus success rate, Borda rank, robustness to distractor objects and OOD instructions, and CEM action recovery.

0.362 Best DiT-S consensus success rate: V-JEPA 2.1 with a 96D adapter.

0.082 Best DiT-S 1-step CEM controllability error: SigLIP 2.

+13.6 pts Semantic-family OOD success advantage when pooling distractor and instruction shifts.

The page reports consensus success from InternVL 3.5 and Qwen 3.6. The paper also checks the trend with CEM, IDM, success probes, and visual/geometric metrics, so the conclusion is not driven by one judge.

Takeaway: Semantic latent spaces provide a better proxy environment for policy evaluation because their dynamics remain more aligned with actions and task outcomes.

Action and Task Semantics Survive Better in Representation Space

The transition model operates in latent space, so we directly probe generated latents. Inverse-dynamics models recover action chunks more reliably from semantic latents, and task-success probes retain more instruction-conditioned outcome information after rollout generation.

Action trajectories induced by encoder spaces

Encoder spaces induce different action-aligned trajectory geometries. V-JEPA 2.1 and Web-DINO show stronger action recoverability than reconstruction-aligned spaces, and this advantage largely persists after world-model generation.

Over long autoregressive rollouts, semantic latent spaces remain competitive beyond the training horizon, indicating that their advantage is not simply a short-horizon artifact.

A Practical Recipe for Semantic-Space Diffusion World Models

The recipe keeps the comparison controlled: freeze the image encoder, use a fixed action-conditioned DiT transition backbone, and vary only the latent representation and decoder path. High-dimensional semantic spaces are made diffusion-friendly with a dimension-dependent noise-schedule shift, an optional S-VAE adapter, and a lightweight wide DDT head for native semantic features.

H = 2 Two history frames are conditioned with actions before predicting future frames.

K = 8 The model predicts eight future frames during training under a fixed protocol.

d = 96 The compact S-VAE adapter bottleneck is a strong tradeoff for semantic encoders.

Adapter dimension, transition-model scale, and multi-view training each affect the visual-policy tradeoff. The strongest pattern remains stable: semantic spaces are more useful when planning and policy performance matter.

Recipe: start with a strong semantic encoder, add compact adapter latents when decoding and efficiency matter, and evaluate with policy-facing metrics instead of choosing only by pixel fidelity.

Interactive Rollout Examples

Each section now uses one framed carousel. Switch examples with the arrows or dots; the play button synchronizes videos in the active example only.

Rollouts with ground truth actions

Multi-View rollouts

VLA Policy attempts

VLA Policy with OOD objects

Citation

@article{nilaksh2026reconstruction,
  title={Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models},
  author={Nilaksh and Jha, Saurav and Zholus, Artem and Chandar, Sarath},
  year={2026}
}