Training steps
5,000
~6.7 epochs
Action loss
0.020
↓ 98.4% from step 10
World-model loss
0.138
↓ 28% & stable
Eval MSE score
0.0015
↓ 58% vs step 2900
Eval MAE score
0.026
↓ 59% vs step 2900
Model size
2.77B
params, fully trainable
Training loss over time
Sampled from training logs across the 5,000-step run (H100, batch 16, ~3h33m)
Eval checkpoint improvement
Periodic in-loop eval metrics, step 2900 → step 5000
Model vs. baselines 100 sampled frames
Mean absolute error in real joint-degree units (lower is better). Compared against a "predict the dataset mean action" baseline.
| Joint | Model MAE (°) | Mean-baseline MAE (°) | Improvement |
|---|---|---|---|
| Overall |
Predicted vs. recorded motion — held episode trace
Episode 40 ("pink lego brick into the transparent box"), sampled every 15 frames. Two joints shown where the model tracks the true trajectory shape most clearly.
What these numbers do and don't show
- Real, from-scratch training. The 2.77B-parameter action head + world model was trained entirely on 50 SO-101 episodes (12K frames) — no prior VLA-JEPA pretraining checkpoint was used, only generic pretrained Qwen3-VL-2B and V-JEPA2 backbones.
- First joint-space embodiment for this architecture. The original VLA-JEPA paper and every config it ships (Droid, LIBERO, Bridge, RT-1, FR3) use 8-dim EEF-pose action/state. SO-101 is 6-DOF joint-space — a fundamentally different action representation the codebase didn't support. Getting it running required a new data config, embodiment mapping, and dataset-format fixes built from scratch.
- Loss and eval metrics witnessed significant improvement across training and beat a trivial mean-action baseline on every joint, with 5 of 6 joints tracking the real motion trajectory shape in a held-out episode trace.
- Gripper actuation is the weak point — it did not learn to open/close, predicting a near-constant value. Everything else (arm joints) shows real learned structure.