VLA-JEPA × SO-101 — Training Results

Training steps

5,000

~6.7 epochs

Action loss

0.020

↓ 98.4% from step 10

World-model loss

0.138

↓ 28% & stable

Eval MSE score

0.0015

↓ 58% vs step 2900

Eval MAE score

0.026

↓ 59% vs step 2900

Model size

2.77B

params, fully trainable

Training loss over time

Sampled from training logs across the 5,000-step run (H100, batch 16, ~3h33m)

Eval checkpoint improvement

Periodic in-loop eval metrics, step 2900 → step 5000

Model vs. baselines 100 sampled frames

Mean absolute error in real joint-degree units (lower is better). Compared against a "predict the dataset mean action" baseline.

Joint	Model MAE (°)	Mean-baseline MAE (°)	Improvement
Overall

Predicted vs. recorded motion — held episode trace

Episode 40 ("pink lego brick into the transparent box"), sampled every 15 frames. Two joints shown where the model tracks the true trajectory shape most clearly.

What these numbers do and don't show

Real, from-scratch training. The 2.77B-parameter action head + world model was trained entirely on 50 SO-101 episodes (12K frames) — no prior VLA-JEPA pretraining checkpoint was used, only generic pretrained Qwen3-VL-2B and V-JEPA2 backbones.
First joint-space embodiment for this architecture. The original VLA-JEPA paper and every config it ships (Droid, LIBERO, Bridge, RT-1, FR3) use 8-dim EEF-pose action/state. SO-101 is 6-DOF joint-space — a fundamentally different action representation the codebase didn't support. Getting it running required a new data config, embodiment mapping, and dataset-format fixes built from scratch.
Loss and eval metrics witnessed significant improvement across training and beat a trivial mean-action baseline on every joint, with 5 of 6 joints tracking the real motion trajectory shape in a held-out episode trace.
Gripper actuation is the weak point — it did not learn to open/close, predicting a near-constant value. Everything else (arm joints) shows real learned structure.