[HN Gopher] V-JEPA 2 world model and new benchmarks for physical...
       ___________________________________________________________________
        
       V-JEPA 2 world model and new benchmarks for physical reasoning
        
       Author : mfiguiere
       Score  : 205 points
       Date   : 2025-06-11 14:43 UTC (8 hours ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | artificialprint wrote:
       | Throw ARC-AGI 2 at it!
        
         | jadbox wrote:
         | I suspect it wouldn't help too much. This model is meant for
         | physics-based world modeling, while nearly all the problems in
         | ARC are symbolic reasoning.
        
           | artificialprint wrote:
           | I'd say world modeling can provide the foundations from which
           | symbolic reasoning can emerge, after all this is how we
           | (humans) learn it too. There are a lot of tasks in arc that
           | are grounded in simple physics
        
             | littlestymaar wrote:
             | > I'd say world modeling can provide the foundations from
             | which symbolic reasoning can emerge, after all this is how
             | we (humans) learn it too
             | 
             | As usual comparisons with humans provide little practical
             | insight for what's achievable with ML. Humans don't have to
             | learn everything from scratch like ML models do, you aren't
             | expecting ML models to learn language out of a few
             | thousands of tokens just because humans can, so similarly
             | you shouldn't expect neural networks to learn reasoning
             | from world interaction alone.
        
         | falcor84 wrote:
         | Yes, ARC-AGI 2 seems to game a lot of challenges that involve a
         | (projection of) gravity and collisions, so I'd be quite
         | interested in seeing whether it would generalize.
        
       | ldjkfkdsjnv wrote:
       | Leadership at meta is dropping the ball with these non llm ai
       | model sidequests
        
         | jadbox wrote:
         | LLMs where once a side quest. I hope meta invests more in
         | alternatives as maybe we'll find something better. If not, then
         | meta just loses a bit of R&D budget. They are still heavily
         | invested in regular LLM development, so it's not like they are
         | trading one for the other.
        
           | linguistbreaker wrote:
           | I strongly agree. FAANG has the money to do the research.
           | LLMs are far from intelligent - AGI will require a number of
           | other advances.
        
         | energy123 wrote:
         | Is this a sarcastic compliment? Diversity in research agendas
         | is very important for pushing forward the frontier even if it's
         | not good for the company investing in the high risk research.
         | Good job, to an otherwise toxic company.
        
         | rvz wrote:
         | AI research is more than just LLMs.
        
       | TheAceOfHearts wrote:
       | > With these visual subgoals, V-JEPA 2 achieves success rates of
       | 65% - 80% for pick-and-placing new objects in new and unseen
       | environments.
       | 
       | How does this compare with existing alternatives? Maybe I'm just
       | lacking proper context, but a minimum 20% failure rate sounds
       | pretty bad? The paper compares their results with older
       | approaches, which apparently had something like a 15% success
       | rate, so jumping to an 80% success rate does seem like a
       | significant jump. If I'm reading the paper correctly, the amount
       | of time required to compute and execute each action went down
       | from 4 minutes to 16 seconds, which also seems significant.
       | 
       | Having to specify an end goal as an image seems pretty limited,
       | but at least the authors acknowledge it in the paper:
       | 
       | > Second, as mentioned in Section 4, V-JEPA 2-AC currently relies
       | upon tasks specified as image goals. Although this may be natural
       | for some tasks, there are other situations where language-based
       | goal specification may be preferable. Extending the V-JEPA 2-AC
       | to accept language-based goals, e.g., by having a model that can
       | embed language-based goals into the V-JEPA 2-AC representation
       | space, is another important direction for future work. The
       | results described in Section 7, aligning V-JEPA 2 with a language
       | model, may serve as a starting point.
       | 
       | I think it would be interesting if the authors answered whether
       | they think there's a clear trajectory towards a model that can be
       | trained to achieve a >99% success rate.
        
         | ricardobeat wrote:
         | It's important to keep some perspective: there are zero robots
         | in the wild, at the moment, that use a world model to work on
         | tasks they weren't specifically trained on. This is cutting
         | edge research and an 80% success rate is astonishing!
        
           | gyudin wrote:
           | They don't use it because it's unsafe and potentially life
           | threatening lol
        
             | dghlsakjg wrote:
             | Plenty of things are unsafe and potentially life
             | threatening, including machines with pre-programmed
             | routines that we use today. We already have robots with
             | limited intelligence interacting safely with humans in
             | workplaces.
             | 
             | This learning technology didn't exist until this moment in
             | time. That probably has more to do with why no one is using
             | it in the wild.
        
               | lukan wrote:
               | Yes, you can just add other reliable safety meassures.
               | Meaning if a human comes too close, the robot stops.
               | 
               | Or the robot is supervised all the time.
               | 
               | Or just operates in an area without humans.
               | 
               | But so far this is research, not market ready.
        
           | refulgentis wrote:
           | I can buy this, given a very wide meaning of "specifically
           | trained on" and handwaving a bit about "as far as _I_ know*
           | ", but then I read the actual wording of "new objects in new
           | and unseen environments.", and remember these were floating
           | around Mountain View doing tasks involving in new objects in
           | novel environments years ago. Then I kinda gotta give up and
           | admit to myself I'm distorting the conversation by
           | emphasizing positivity over ground truth.
        
           | vFunct wrote:
           | I'm surprised that's not how it's already done. I'd figure
           | some of the inner layers in LLMs were already "world models"
           | and that it's the outer layers that differentiated models
           | between text vs. images/robotics/other modes...
        
             | mjburgess wrote:
             | That's what the propaganda says, but when we keep
             | explaining it isn't true, and army arrives to repeat adcopy
             | from their favourite tech guru.
             | 
             | All statistical models of the kind in use are
             | interpolations through historical data -- there's no magic.
             | So when you interpolate through historical texts, your
             | model is _of_ historical text.
             | 
             | Text is not a measure of the world, to say, "the sky is
             | blue" is not even reliably associated with the blueness of
             | the sky, let alone that the sky isnt blue (there is no sky,
             | and the atmosphere isn't blue).
             | 
             | These models appear "capture more" only because when you
             | interpret the text you attribute meaning/understanding to
             | it as the cause of its generation -- but that wasnt the
             | cause, this is necessarily an illusion. There is no model
             | of the world in a model of historical text -- there is a
             | model of the world in your head which you associate with
             | text, and that association is exploited when you use LLMs
             | to do more than mere syntax transformation.
             | 
             | LLMs excel most at "fuzzy retrieval" and things like coding
             | -- the latter is principally a matter of syntax, and the
             | former of recollection. As soon as you require the prompt-
             | completion to maintain "semantic integrity" with non-
             | syntactical/retrivable constraints, it falls apart.
        
               | nightski wrote:
               | I feel like you are ignoring or dismissing the word
               | "interpolating", although a better word would likely be
               | generalization. I'd make the claim that it's very hard to
               | generalize without some form of world model. It's clear
               | to me that transformers do have some form of world model,
               | although not the same as what is being presented in
               | V-JEPA.
               | 
               | One other nitpick is that you confine to "historical
               | data", although other classes of data are trained on such
               | as simulated and generative.
        
               | mjburgess wrote:
               | I didn't say generalisation, because there isnt any.
               | Inductive learning does not generalise, it interpolates
               | -- if the region of your future prediction (here, prompt
               | competition) lies on or close to the interpolated region,
               | then the system is useful.
               | 
               | Generalisation is the opposite process, hypothecating a
               | universal and finding counter-examples to constrain the
               | universal generalisaton. Eg., "all fire burns" is
               | hypotheticated by a competent animal upon encountering
               | fire once.
               | 
               | Inductive "learners" take the opposite approach: fire
               | burns in "all these cases", and if you have a case
               | similar to those, then fire will burn you.
               | 
               | They can look the same within the region of
               | interpolation, but look very different when you leave it:
               | all of these systems fall over quickly when more than a
               | handful of semantic constraints are imposed. This number
               | is a measure of the distance from the interpolated
               | boundary (e.g., consider this interpretation of apple's
               | latest paper on reasoning in LLMs: the "environment
               | complexity" is nothing other than a measure of
               | interpolation-dissimilarity).
               | 
               | Early modern philosophers of science were very confused
               | by this, but it's in Aristotle plain-as-day, and it's
               | also extremely well establish since the 80s as the
               | development of formal computational stats necessitated
               | making this clear: interpolation is not generalisation.
               | The former does not get you robustness to irrelevant
               | permuation (ie., generalisation); it does not permit
               | considering counterfactual scenarios (ie.,
               | generalisation); it does not give you a semantics/theory
               | of the data generating process (ie., generalisation, ie.
               | a world model).
               | 
               | Interpolation is a model _of the data_. Generalisation
               | requires a model of the _data generating process_ , the
               | former does not give you the latter, though it can appear
               | to under strong experimental assumptions of known causal
               | models.
               | 
               | Here LLMs model the structure of language-as-symbolic-
               | ordering, that structure "in the interpolated region"
               | _expresses_ reasoning, but it isnt a model _of_
               | reasoning. It 's a model of reasoning as captured in
               | historical cases of it.
        
               | jeremyjh wrote:
               | Aren't there papers showing that there is some kind of
               | world model emerging? Like representations of an Othello
               | board that we would recognize were found and manipulated
               | successfully in a small model.
        
               | mjburgess wrote:
               | There are two follow up papers showing the
               | representations are "entangled", a euphemism for
               | statistical garbage, but I can't be bothered at the
               | moment to find them.
               | 
               | However the whole issue of othello is a nonsequiteur
               | which indicates that people involved here don't really
               | seem to understand the issue, or what a world model is.
               | 
               | A "world model" is a model of a data generating process
               | which isn't reducible-to or constituted by its measures.
               | Ie., we are concerned for the case where there's a
               | measurement space (eg., that of the height of mercury in
               | a thermometer) and a target property space (eg., that of
               | the temperature of the coffee). So that there is gap
               | between the data-as-measure and its causes. In language
               | this gap is massive: the cause of my saying, "I'm hungry"
               | may have nothing to do with my hunger, even if it often
               | does. For "scientific measuring devices", these are
               | constructed to minimize this gap as much as possible.
               | 
               | In any case, with board games and other mathematical
               | objects, there is no gap. The data _is_ the game. The
               | "board state" is an abstract object _constituted by_ all
               | possible board states. The game  "is made out of" its
               | realisations.
               | 
               | However the world isnt made out of language, nor coffee
               | made out of thermometers. So a model _of_ the data isnt a
               | mdoel of its generating process.
               | 
               | So whether an interpolation of board states "fully
               | characterises", someway, an abstract mathematical object
               | "the game" is so irrelevant to the question it betrays a
               | fundamental lack of understanding of even what's at
               | issue.
               | 
               | No one is arguing that a structured interpolative model
               | (ie., one given an inductive bias by an NN architecture)
               | doesn't _express_ properties of the underlying domain in
               | its structure. The question is what happens to this model
               | _of_ the data when you have _the same data generating
               | process_ , but you arent in the interpolated region.
               | 
               | This problem is, in the limit of large data, impossible
               | for abstract games by their nature, eg., a model
               | classifying the input X into legal/illegal board states
               | _is_ the game.
               | 
               | Another way of phrasing this is that in ML/AI textbooks
               | often begin by assuming there's a function you're
               | approximating. But in the vast majority of cases where
               | NNs are used, there is no such function -- there is no
               | function tokens -> meanings (eg., "i am hungry" is
               | ambigious).
               | 
               | But in the abstract math case there is a function,
               | {boards} -> Legal|Illegal is a function, there are no
               | ambiguous boards
               | 
               | So: of the infinite number of f* approximations to
               | f_game, _any_ is valid in the limit len(X) - > inf. Of
               | the infinite number f*_lang to f_language, _all_ are
               | invalid (each in their own way).
        
               | jeremyjh wrote:
               | > A "world model" is a model of a data generating process
               | which isn't reducible-to or constituted by its measures.
               | > However the world isnt made out of language, nor coffee
               | made out of thermometers. So a model of the data isnt a
               | mdoel of its generating process.
               | 
               | So is V-JEPA 2 actually generating a world model, as
               | you've defined it here? Its still just sampling data -
               | visual data, tactile feedback etc is all reducible to
               | quantized data. It seems like you could build useful
               | models that seem to generalize without that. For example,
               | a model could learn to stop dropping things without ever
               | developing a theory of gravity.
               | 
               | Probably I'm still misunderstanding too much for this to
               | be useful, but what I've read from you in this thread is
               | way more useful to my understanding than what I've seen
               | before.
        
               | math_dandy wrote:
               | Could you give more details about what precisely you mean
               | by interpolation and generalization? The commonplace use
               | of "generalization" in the machine learning textbooks
               | I've been studying is model performance (whatever metric
               | is deemed relevant) on new data from the training
               | distribution. In particular, it's meaningful when you're
               | modeling p(y|x) and not the generative distribution
               | p(x,y).
        
               | abtinf wrote:
               | > army arrives to repeat adcopy from their favourite tech
               | guru
               | 
               | This is painfully accurate.
               | 
               | The conversations go like this:
               | 
               | Me: "guys, I know what I'm talking about, I wrote my
               | first neural network 30 years ago in middle school, this
               | tech is cool but it isn't magic and it isn't good enough
               | to do the thing you want without getting us sued or
               | worse."
               | 
               | Them: "Bro, I read a tweet that we are on the other side
               | of the singularity. We have six months to make money
               | before everything blows up."
        
           | londons_explore wrote:
           | 80% success rate is also potentially commercially viable if
           | the task is currently being done by a human.
           | 
           | Work that was once done by 10 humans can now be done by 10
           | robots + 2 humans for the 20% failure cases, at a lower total
           | cost.
        
             | zeroxfe wrote:
             | This really depends on the failure modes. In general,
             | humans fail in predictable, and mostly safe, ways. AIs fail
             | in highly unpredictable and potentially very dangerous
             | ways. (A human might accidentally drop a knife, an AI might
             | accidentally stab you with it.)
        
               | Maxion wrote:
               | Or, if controlling a robot arm, it would stab itself
               | through the conveyer belt at full torque.
        
         | DickingAround wrote:
         | I run thousands of robots in production. We can get a very high
         | success rate but only for the task they're designed for.
         | Production robots can't pick up stuff they drop yet. And this
         | '80%' level is not actually acceptable or even state of art for
         | just pick-and-place, but it's compelling for a robot that also
         | knows how to do other things with equal quality (if JEPA does
         | that).
        
         | deepGem wrote:
         | Currently,
         | 
         | You train a VLA (vision language action) model for a specific
         | pair of robotic arms, for a specific task. The end actuator
         | actions are embedded in the model (actions). So let's say you
         | train a pair of arms to pick an apple. You cannot zero shot it
         | to pick up a glass. What you see in demos is the result of lots
         | of training and fine tuning (few shot) on specific object types
         | and with specific robotic arms or bodies.
         | 
         | The language intermediary embedding brings some generalising
         | skills to the table but it isn't much. The vision -> language
         | -> action translation is, how do I put this, brittle at best.
         | 
         | What these guys are showing is a zero shot approach to new
         | tasks in new environments with 80% accuracy. This is a big
         | deal. Pi0 from Physical Intelligence is the best model to
         | compare I think.
        
         | robot wrote:
         | your comment is not aligned with how science is done. For
         | discoveries you certainly work with limited approaches and
         | certainly don't know if there is a "clear trajectory".
        
       | fidotron wrote:
       | You have to wonder if the model is going to end up recreating
       | Verlet integration in there somewhere, or if it's generating a
       | pile of those optical acceleration cancelation type heuristics in
       | neural net form.
       | 
       | It's one of those ideas I've had around for a while that if you
       | fused decent object tracking with an understanding of Verlet
       | integration you should, in principle, start being able to measure
       | all sorts of physical quantities quite easily.
        
       | nlitened wrote:
       | I imagine that Russian-speaking team members had fun with naming
       | the model V-JEPA
        
         | Tiberium wrote:
         | For the curious: "zhopa" (which "JEPA" sounds like) means "ass"
         | in Russian. Also V ("V") means "in" (although if we get into
         | specifics, the casing would need to be "zhopu" or "zhope"
         | depending on the context)
        
           | koakuma-chan wrote:
           | Also the video thumbnail:
           | 
           | J.E.P.A.
        
       | jcelerier wrote:
       | > That kind of physical intuition isn't something adults obtain
       | after years of education--young children develop this intuition
       | by observing the world around them before they can even speak in
       | full sentences.
       | 
       | I mean, it still takes them much more time than it takes to train
       | even the largest LLMs we use (a couple months)
        
         | lukan wrote:
         | But they use way less energy for it.
        
         | dist-epoch wrote:
         | In wall clock time. If you count in input tokens/pixels, humans
         | learn with orders of magnitude less input data.
        
           | logicchains wrote:
           | That's not true at all; the amount of audiovisual data a
           | human is exposed to in even just one year is incredibly vast.
           | Over sixty frames per second, sixteen hours per day gives
           | over a trillion frames per year, and each frame at such a
           | high resolution would be hundreds of tokens.
        
             | dist-epoch wrote:
             | Let's take your numbers:
             | 
             | Human: 1000 tok * 60 * 86400 * 365 = 2 Trillion tokens /
             | year
             | 
             | GPT-4: 13 Trillion tokens
             | 
             | Llama-3: 15 Trillion tokens
        
             | cluckindan wrote:
             | That's why we tokenize very early in the vision pipeline.
             | 
             | Related: https://en.wikipedia.org/wiki/Form_constant
        
             | Vetch wrote:
             | This contains a common misstep (or misgeneralization of an
             | analogy) among those who are much more familiar with
             | computers than with the brain. The brain is not digital and
             | concepts like frames per second and resolution don't make
             | much sense for vision. First, there aren't frames, neuron
             | activity is asynchronous with changes to sensory neuron
             | firing rate responding to changes in the environment or
             | according to saliency.
             | 
             | Between the non-uniformity of receptor density (eg fovea vs
             | peripheral vision but this is general across all senses),
             | dynamic receptor fields and the fact that information is
             | encoded in terms of spike rate and timing patterns across
             | neural populations, the idea of pixels in some bitmap at
             | some resolution is beyond misleading. There is no pixel
             | data, just sparsely coded feature representations capturing
             | things like edges, textures, motion, color contrast and the
             | like, already, at the retina.
             | 
             | While hundreds of trillions of photons might hit our
             | photoreceptors, > 99% of that is filtered and or compressed
             | _before_ even reaching retinal ganglion cells. Only a tiny
             | fraction, about 10 million bits /sec, of the original
             | photon signal rate is transferred through the optic nerve
             | (per eye). This pattern of filtering and attentive
             | prioritization of information in signals continues as we go
             | from sensory fields to thalamus to higher cortical areas.
             | 
             | So while we might encounter factoids like: on the order of
             | a billion bits per second of data hit photoreceptors or
             | [10Mb/s transferred](https://www.britannica.com/science/inf
             | ormation-theory/Physio...) along optic nerves, it's
             | important to keep in mind that a lot of the intuition
             | gained from digital information processing does not
             | transfer in any meaningful sense to the brain.
        
       | momojo wrote:
       | Why is Meta investing into this research? What's the potential
       | payoff?
        
         | dyauspitr wrote:
         | Physical robots as impressive as LLMs?
        
         | aaroninsf wrote:
         | The goal is a a Large Phenomenological Model.
         | 
         | A good definition of "real AGI" might be, a multimodal model
         | which understands time-based media, space, and object behavior,
         | and hence true agency.
         | 
         | Phenomenology is the philosophy of "things as they seem," not
         | "knowledge (words) about things." Seem to our senses, not
         | understood through language.
         | 
         | LLM of course trade in language tokens.
         | 
         | We can extend their behavior with front ends which convert
         | other media types into such tokens.
         | 
         | But we can do better with multimodal models which are trained
         | directly on other inputs. E.g. integrating image classifiers
         | with language models architecturally.
         | 
         | With _those_ one can sort of understand time-based media, by
         | sampling a stream and getting e.g. transcripts.
         | 
         | But again, it's even better to build a time-base multimodal
         | models, which directly ingests time-based media rather than
         | sampling. (Other architectures than transformers are going to
         | be required to do this well IMO...)
         | 
         | The bootstrapping continues. This work is about training models
         | to understand world and object properties by introducing
         | agency.
         | 
         | Significant footnote: implicitly models trained to interact
         | with the world necessarily have a "self model" which interacts
         | with the "world model." Presumably they are trained to preserve
         | their expensive "self." Hmmmmm....
         | 
         | When we have a model that knows about things not just as nodes
         | in a language graph but also how such things look, and sound,
         | and moves, and "feel" (how much mass do they have, how do they
         | move, etc.)...
         | 
         | ...well, that is approaching indistinguishable from one of us,
         | at least wrt embodiment and agency.
        
         | DesiLurker wrote:
         | possibly with their investment into AR/VR and gaming they may
         | see a pathway to creating 'physical intelligence' and tap into
         | a much bigger untapped market. I mean isn't Robotaxi the main
         | carrot Musk's been holding in front of tesla investors for
         | decade or so. physical robots may provide a more 'incremental
         | fault tolerant' path to application of AI.
        
         | esafak wrote:
         | There is a world of money in AGI, and they have the resources,
         | and notably the data, to achieve it.
        
         | seydor wrote:
         | physical robots arguing endlessly with physical people
        
         | kp1197 wrote:
         | Robots that can do anything.
        
       | cubefox wrote:
       | I think the fundamental idea behind JEPA (not necessarily this
       | concrete Meta implementation) will ultimately be correct:
       | predicting embeddings instead of concrete tokens. That's arguably
       | what animals do. Next-token prediction (a probability
       | distribution over the possible next tokens) works well for the
       | discrete domain of text, but it doesn't work well for a
       | continuous domain like video, which would be needed for real-time
       | robotics.
       | 
       | For text, with a two-byte tokenizer you get 2^16 (~65.000)
       | possible next tokens, and computing a probability distribution
       | over them is very much doable. But the "possible next frames" in
       | a video feed would already be an extremely large number. If one
       | frame is 1 megabyte uncompressed (instead of just 2 bytes for a
       | text token) there are 2^(8*2^20) possible next frames, which is
       | far too large a number. So we somehow need to predict only an
       | embedding of a frame, of how the next frame of a video feed will
       | look approximately.
       | 
       | Moreover, for robotics we don't want to just predict the next
       | (approximate) frame of a video feed. We want to predict future
       | sensory data more generally. That's arguably what animals do,
       | including humans. We constantly anticipate what happens to us in
       | "the future", approximately, _and where the farther future is
       | predicted progressively less exactly._ We are relatively sure of
       | what happens in a second, but less and less sure of what happens
       | in a minute, or a day, or a year.
        
         | abraxas wrote:
         | But how do you go from predicting embeddings (which could be
         | thought of as a type of lossy compression of the original data)
         | back out to something usable, say a sequence of image/video
         | tokens or a sequence of robot actions?
        
           | cubefox wrote:
           | A robot model would need to constantly convert the prediction
           | (an embedding) of the future observations, together with a
           | "plan" of what the robot tries to achieve, into an action.
           | Into some kind of movement which takes both the action plan
           | and the predicted sensory data into account.
           | 
           | That's very much an unsolved problem, and I don't know how
           | far Meta is along that path. Not very far, I assume.
        
             | NitpickLawyer wrote:
             | If I understand your post correctly, they're also doing
             | this:
             | 
             | > V-JEPA 2-AC is a latent action-conditioned world model
             | post-trained from V-JEPA 2 (using a small amount of robot
             | trajectory interaction data) that solves robot manipulation
             | tasks without environment-specific data collection or task-
             | specific training or calibration.
             | 
             | > After the actionless pre-training stage, the model can
             | make predictions about how the world might evolve--however,
             | these predictions don't directly take into account specific
             | actions that an agent would take. In the second stage of
             | training, we focus on making the model more useful for
             | planning by using robot data, which includes visual
             | observations (video) and the control actions that the robot
             | was executing. We incorporate this data into the JEPA
             | training procedure by providing the action information to
             | the predictor. After training on this additional data, the
             | predictor learns to account for specific actions when
             | making predictions and can then be used for control. We
             | don't need a lot of robot data for this second phase--in
             | our technical report, we show that training with only 62
             | hours of robot data already results in a model that can be
             | used for planning and control.
             | 
             | > We demonstrate how V-JEPA 2 can be used for zero-shot
             | robot planning in new environments and involving objects
             | not seen during training. Unlike other robot foundation
             | models--which usually require that some training data come
             | from the specific robot instance and environment where the
             | model is deployed--we train the model on the open source
             | DROID dataset and then deploy it directly on robots in our
             | labs. We show that the V-JEPA 2 predictor can be used for
             | foundational tasks like reaching, picking up an object, and
             | placing it in a new location.
             | 
             | > For short-horizon tasks, such as picking or placing an
             | object, we specify a goal in the form of an image. We use
             | the V-JEPA 2 encoder to get embeddings of the current and
             | goal states. Starting from its observed current state, the
             | robot then plans by using the predictor to imagine the
             | consequences of taking a collection of candidate actions
             | and rating the candidates based on how close they get to
             | the desired goal. At each time step, the robot re-plans and
             | executes the top-rated next action toward that goal via
             | model-predictive control. For longer horizon tasks, such as
             | picking up an object and placing it in the right spot, we
             | specify a series of visual subgoals that the robot tries to
             | achieve in sequence, similar to visual imitation learning
             | observed in humans. With these visual subgoals, V-JEPA 2
             | achieves success rates of 65% - 80% for pick-and-placing
             | new objects in new and unseen environments.
        
           | bobosha wrote:
           | This is where the memory bit comes in, if you have a memory
           | of past embeddings and associated label(s), it could be an
           | ANN query to fetch the most similar embeddings and infer
           | therefrom.
        
             | abraxas wrote:
             | But an embedding is more like a one way hash, kind of like
             | sha1 or md5, no? You can get from input data to a hash
             | value but not the other way around, right? I know that
             | similarly placed embedding vectors will sit next to
             | semantically related vectors but these clusters could be
             | really sparse in such a massively dimensional hyperspace
             | and so the nearest values in a cache may be too far away to
             | be useful?
             | 
             | BTW I'm very much not an expert here and I'm just trying to
             | understand how this system works end to end. Don't take
             | anything I write here as authoritative.
        
         | kaivi wrote:
         | > We constantly anticipate what happens to us in "the future",
         | approximately, and where the farther future is predicted
         | progressively less exactly
         | 
         | There's then evidence of what's called Predictive Coding. When
         | that future happens, a higher level circuit decides how far off
         | we were, and then releases appropriate neuromodulators to re-
         | wire that circuit.
         | 
         | That would mean that to learn faster, you want to expose
         | yourself to situations where you are often wrong: be often
         | surprised and go down the wrong paths. Have a feedback
         | mechanism which will tell you when you're wrong. This is maybe
         | also why the best teachers are the ones who often ask the class
         | questions for which there are counter-intuitive answers.
        
           | cubefox wrote:
           | > There's then evidence of what's called Predictive Coding.
           | When that future happens, a higher level circuit decides how
           | far off we were, and then releases appropriate
           | neuromodulators to re-wire that circuit.
           | 
           | Yes, and ideally there would be whole backpropagation passes
           | which update the entire model depending on how much the
           | current observation diverges from past predictions. (Though
           | brains use an updating mechanism which diverges from the
           | backpropagation algorithm.)
        
         | nulld3v wrote:
         | The JEPA models give me hope that the future isn't just more
         | tokens, more context, and more chain-of-thought.
        
       | siavosh wrote:
       | Does someone know how the "semantic" embeddings are learned? That
       | seems like perhaps the main technical challenge here.
        
       | iLoveOncall wrote:
       | "World model" and "physical reasoning" is such a lie.
       | 
       | Those models don't have any understanding of physics, they just
       | regurgitate what they see in their vision-based training set,
       | just like any image or video generation model does.
       | 
       | Monkey see other monkey cannot go through wall, monkey don't try
       | go through wall.
        
         | rayboy1995 wrote:
         | > Monkey see other monkey cannot go through wall, monkey don't
         | try go through wall.
         | 
         | I mean... we are just monkeys. Did we not learn this way when
         | we were younger?
        
           | RollingRo11 wrote:
           | Agreed! A really young child has no notion of "physics". They
           | are learning through experience and observation.
           | 
           | These models/robots aren't superintelligent by any means, but
           | "Monkey see other monkey cannot go through wall, monkey don't
           | try go through wall" isn't far off from how some
           | animals/humans "learn".
        
         | smokel wrote:
         | I think you are misinterpreting the terminology.
         | 
         | Of course these models are not understanding physics in the way
         | a physicists or a mathematician would. But they do form a model
         | of the world that can be used for forecasting and reasoning, in
         | a way possibly not much unlike how humans and other animals
         | operate when interacting with the physical world.
        
         | seydor wrote:
         | physics is phenomenological. the model sees phenomena
        
         | dghlsakjg wrote:
         | You don't need to have taken a single physics class to be good
         | at pool...
        
       | rar00 wrote:
       | the robot arm demonstration video jumps at the 00:28s mark...
        
       ___________________________________________________________________
       (page generated 2025-06-11 23:00 UTC)