Overfit#11: V-JEPA 2

Motivation

Yann Lecun claims that LLMs are a dead end on the path towards AGI. According to his claims, we need new architectures to better mimic the way we think, learn, understand, plan.

In 2022, he wrote a position paper in which he introduces his vision of what a smarter AI could look like. JEPA¹ - Joint Embedding Predictive Architecture - is his attempt in that direction. Developed at Meta FAIR, JEPA is a family of non-generative models, trained in a slightly different SSL manner, to learn world understanding and planning.

Two weeks ago (june 2025), Meta released a new model in the JEPA family: V-JEPA 2, which is the focus of this post.

Why did this new JEPA model catch my attention?

We have been hearing of JEPA models for two years now. I never took time to deep dive the papers, because it seemed still early-stage. I wanted to wait a bit, for the JEPA ecosystem to mature, and see how the research community reacted to these fresh ideas ...

V-JEPA 2 caught my attention because it is the first JEPA model with real-world applications. Image-JEPA and V-JEPA were mostly experiment papers, to see how far they could go with this SSL approach. V-JEPA 2 is the continuation of these papers and showcase super cool applications like conditioning an LLM for video Question Answering or zero-shot robot control.

Finally, pretraining models on large amounts of text or images showcased interesting emerging properties. I was curious to study what properties emerged when a new temporal dimension is added into the mix. Time is in fact needed to learn concepts like motion, gravity, planning ... Useful concepts and properties for a world model.

Revolutionary approach, or simply a new flavor of the good old auto-encoders? Let's find out! 👇

Video from Meta AI blog post.

Recommended lectures

Plenty of super talented writers already covered JEPA in depth. Their content is top-notch. If you have some time, I highly recommend reading these first, and then come back to this post:

What is JEPA? by the Turing Post
Deep Dive into Yann LeCun’s JEPA by Rohit Bandaru

If your time is limited, no worries. Let me give you a dense summary of JEPA, Image-JEPA and Video-JEPA.

JEPA in a nutshell

👉 To keep things simple, I will explain JEPA by introducing I-JEPA, its image application. We will generalize back to JEPA and other modalities later.

What are the limitations of Large Image Models?

Large Image models are mostly trained in the signal space (pixels), leading them to pay attention to irrelevant details during learning. For instance, most image embedding models are trained to reconstruct masked or blurred images (MAEs, iBOT). Yet, asking a model to reconstruct a pixel-perfect image is an impossible task, because some details are almost random (the reflections of light on water, the position of bits of grass). Instead we would prefer them to focus on learning meaningful higher level representations (embeddings/concepts).

Moreover, LLMs have no planning capabilities. Given an observation of an environment (an image), they are poor predictors of what is coming next.

What is I-JEPA's solution?

To build a powerful world model capable of abstraction, I-JEPA² is trained in the latent space. If you are familiar with self-distillation techniques like DINOv2 (I wrote a dedicated post on it, check it out 🌟), the model basically learns to predict the embeddings of the masked tokens of an image.

I-JEPA is based on the the Vision Transformer (ViT) architecture and therefore processes images as sequences of patches. The model is made of three ViT submodules:

a context encoder \(f_\theta\): The masked image \(x\), also called the context, is encoded using a context ViT encoder. The outputs are \(s_x\).
a target encoder \(f_\theta\): The original image \(y\) (the bottom dog) is encoded using a target ViT encoder. The outputs are \(s_y\).

Usually, the target encoder is an EMA of the context encoder. Like in DINO. This stabilizes training and avoids collapse.
a predictor \(g_\phi\) that predicts the embeddings of the masked tokens of the image. Its outputs are \(\hat{s_y}\).

The regression loss is computed over the embeddings \(\mathcal{L}(s_y, \hat{s_y})\).

As you can see the model respects the requirements we listed above in introduction:

The models learn abstract concepts: they learn to reconstruct embeddings, not images. This encourages learning abstractions rather than focusing on details.
The encoding logic (encoder) is separated from the reconstruction logic (predictor). The encoder focuses on compressing information of what is known, while the predictor's goal is to infer what is missing.

For the moment, why the predictor is useful is maybe not straightforward, we'll see in Video-JEPA why the decoupling encoder/predictor is powerful for planning.

I-JEPA showcased impressive performance on representation learning image benchmarks. It outperforms iBOT and MAE (two famous training techniques), for a training budget around one order of magnitude smaller. This improved sample efficiency is probably due to the fact that I-JEPA does not try to reconstruct pixel-perfect images, but instead focuses on predicting high-level semantic embeddings. By avoiding the need to capture every visual detail (like reflections or textures), the model can dedicate its capacity to learning meaningful abstractions, which accelerates training and improves generalization.

As we saw in previous posts, evaluating an embedding model is ambiguous. The simplest way to validate the quality of the embeddings is usually to evaluate the performance of simple classification models (like a logistic regression) directly trained on its embeddings.

For V-JEPA 2, the authors also wanted to check if the features learned were just abstract, or if they still contained enough information to recover the masked pixels. Thus, they trained a decoder model, to reconstruct the masked pixels, from the latent embeddings. They observed that the trained decoder (only for evaluation, right) was able to reconstruct the images, even if the embeddings were never trained to perform a reconstruction task. Of course the decoder has poor reconstruction metrics compared to models pretrained to reconstruct, but this showcases that the features learnt are in fact meaningful.

Generalization to JEPA

As we saw, the I-JEPA is specific to images. But it can easily be generalized to any type of inputs, as long as they can be splitted into (masked) tokens.

Take an input sequence.
Mask some tokens.
Encode both the masked and unmasked sequences.
Predict the embeddings of the masked tokens using the predictor. The predictor also gets the mask as input condition for guidance.
The loss is simply computed over the embeddings.

V-JEPA 2

Now that you have the fundamentals of JEPA, let's jump directly to V-JEPA 2, which is basically the same as V-JEPA 1³ but with more data and bigger ViT models.

V-JEPA 2⁴ is a billion-parameter model, pretrained on over a million hours of video data, using the JEPA's self-supervized training procedure.

Like its little brother I-JEPA, the video frames are patchified. Let \(N\) the number of frames in a video. The video tensor \((N, H, W)\) is split into a sequence of \(L\) tokens called tubelets where each tubelet is of shape \((2, 16, 16)\).

The 2 means that each token is made of two consecutive frames. As usual in transformers, they incorporate the position of the patches through (3D ROPE) positional encodings.

The training procedure is similar to I-JEPA. The predictor is trained to recover the embeddings of masked tubelets, from the embeddings of the unmasked tubelets. The teacher model is an EMA of the student, the stop-gradient operation blocks the gradient flow to the teacher. A very high masking ratio is used, with around 90% of the pixels masked. This high masking ratio may be due to the fact that information in video is redundant, so you need/can mask much more to reduce leakage and force the model to learn rather than copying.

After training, we get a strong video patch embedder (encoder) and a versatile patch predictor.

More about the paper ...

👾 Github: https://github.com/facebookresearch/vjepa2

📚 Arxiv: https://arxiv.org/abs/2506.09985

Downstream applications

After SSL pre-training, V-JEPA 2 is essentially an embedding model. It is not useful out-of-the-box ... but it can easily become after post training.

The paper showcases multiple applications of V-JEPA 2. In fact (see diagram below), after a task-specific post-training (in green), one can evaluate its world Understanding, Prediction or Planning capabilities.

I will focus on the robotic application that I find super impressive. Check out the paper if you are interested by the other applications.

V-JEPA 2-AC: Robotic control

This experiment focuses on zero-shot robotic planning for pick-and-place tasks using only image goals. In simpler terms: you show the robot a picture of the goal (e.g. a ball inside a cup), and the model figures out by itself how to reach that state, without any task-specific training, supervision, or reward function.

Video from Meta AI blog post.

Let's first see how we could use V-JEPA 2 to plan and control the robot. We'll then see how to post-train it to achieve our goals.

Inference

Let’s say we are at time \(t\) with a current image \(x_t\) (e.g. the ball next to the cup), and we want the scene to become like \(x_{\text{target}}\) (e.g. the ball inside the cup).

Encode the current and target frames using the V-JEPA 2 encoder:
- \(s_t = f_\theta(x_t)\)
- \(s_{\text{target}} = f_\theta(x_{\text{target}})\)
Sample a batch of candidate actions (e.g. 100 random action trajectories like joint positions or movements).
For each action \(a_k\), use the predictor to simulate the future embedding:
- \(\hat{s}_{t+1}^{(k)} = g_\phi(s_t, a_k)\)
Compute the distance between the predicted embedding and the target one:
- \(d_k = \| \hat{s}_{t+1}^{(k)} - s_{\text{target}} \|\)
Choose the trajectory that minimizes this distance in N steps.
Execute the first action of the best trajectory \(a^*\), observe the new frame, and repeat the process until convergence.

In practice, they sampled actions using the Cross-Entropy Method (CEM). At each step, a distribution over actions is updated to concentrate on those that best match the target state. Only the first action of the best trajectory is executed before re-planning.

What’s particularly interesting here is that we don’t predict actions directly. Instead, we sample candidate actions and use the reconstruction loss as a proxy for energy to evaluate them. The idea is simple but powerful: the better an action helps the model predict a future state close to the target, the lower its energy. This allows a non-generative model to be used for planning, by selecting actions that minimize prediction error—without ever explicitly generating the future.

Training

V-JEPA 2, as pretrained, is not immediately usable for robot control. Why?

Because during its SSL pretraining, the predictor never learns the causal impact of actions. It only learns to predict the future frames of videos, not how actions modify the world.

To adapt the frozen V-JEPA 2 encoder for robotic control, the authors retrain the predictor to learn how actions modify the visual world. This predictor takes in a temporally interleaved sequence of (action, state, frame embedding) tuples, where frame embeddings are extracted via the frozen encoder, and the state/action information comes from robot proprioception. The model is trained to predict the embedding of the next frame from the current (action, state, frame), and minimizes an \(\ell_1\) loss between the predicted and ground-truth embeddings.

In other words:

They freeze the encoder (trained on 1M hours of video).
Then, they post-train the predictor on 62 hours of robot interaction data (from the DROID dataset), where each frame is paired with the robot's joint velocities and states.
The updated predictor now takes as input:
- an encoded state,
- and a candidate action,
- and predicts the embedding of the next frame.

In addition to predicting the next step, the predictor is trained with a rollout loss to simulate short multi-step futures, mimicking planning scenarios. Starting from an initial state and frame embedding, the predictor autoregressively simulates two steps ahead using a sampled sequence of actions. The predicted final embedding is compared to the ground truth embedding at \(T+1\), again using an \(\ell_1\) loss. This dual objective — teacher forcing + rollout loss — ensures the model not only predicts accurately step-by-step, but also remains stable and consistent over multiple planning steps.

What makes this application stand out

You end up with a zero-shot robotic planning system capable of solving pick-and-place tasks from image goals — without requiring dense rewards, task-specific demonstrations, or hand-labeled annotations.

The only requirements are:

a pretrained V-JEPA 2 encoder (frozen),
a small dataset of robot interactions (videos + action labels),
and some clever inference using CEM.

That’s SSL with real-world impact, and I find that pretty exciting.

Limitations

This SSL approach doesn't generalize to all robotic arms (yet?). It still requires to collect videos and robot positions for the particular robot you want to control.

This is cheaper than building a fully annotated dataset, but that remains a big limitation.

Concluding remarks

JEPA is an interesting approach that brings a bit of fresh air in the brute-force Transformer/Next-Token-Prediction wave. On my side, I am especially curious to see how the JEPA architectures will evolve over time to embrace the whole position paper of Yann Lecun. In fact, the current architectures only address understanding and planning. What about the configurator? ... (once again, I warmly recommend to read the deep dives quoted above).

I hope you enjoyed reading this technical deep dive. If so, feel free to share it and connect 😊

References

JEPA Paper: A Path Towards Autonomous Machine Intelligence. Version 0.9.2, 2022-06-27 ↩
I-JEPA Paper: Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., ... & Ballas, N. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15619-15629). ↩
V-JEPA Paper: Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., ... & Ballas, N. (2024). Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471. ↩
V-JEPA 2 Paper: Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., ... & Ballas, N. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv preprint arXiv:2506.09985. ↩