I was salty our team didn't win the UC Berkeley AI Hackathon, so I re-built the project we were hoping to build — a JEPA World Model for Robots.

See the results

VLA-JEPA on SO-101 — Training Results

Qwen3-VL-2B + frozen V-JEPA2 + DiT-B action head · fine-tuned end-to-end on an H100
Training steps
5,000
~6.7 epochs
Action loss
0.020
↓ 98.4% from step 10
World-model loss
0.138
↓ 28% & stable
Eval MSE score
0.0015
↓ 58% vs step 2900
Eval MAE score
0.026
↓ 59% vs step 2900
Model size
2.77B
params, fully trainable

Training loss over time

Sampled from training logs across the 5,000-step run (H100, batch 16, ~3h33m)

Eval checkpoint improvement

Periodic in-loop eval metrics, step 2900 → step 5000

Model vs. baselines 100 sampled frames

Mean absolute error in real joint-degree units (lower is better). Compared against a "predict the dataset mean action" baseline.
JointModel MAE (°)Mean-baseline MAE (°)Improvement
Overall

Predicted vs. recorded motion — held episode trace

Episode 40 ("pink lego brick into the transparent box"), sampled every 15 frames. Two joints shown where the model tracks the true trajectory shape most clearly.

What these numbers do and don't show