[HN Gopher] V-JEPA 2 world model and new benchmarks for physical...
___________________________________________________________________
V-JEPA 2 world model and new benchmarks for physical reasoning
Author : mfiguiere
Score : 205 points
Date : 2025-06-11 14:43 UTC (8 hours ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| artificialprint wrote:
| Throw ARC-AGI 2 at it!
| jadbox wrote:
| I suspect it wouldn't help too much. This model is meant for
| physics-based world modeling, while nearly all the problems in
| ARC are symbolic reasoning.
| artificialprint wrote:
| I'd say world modeling can provide the foundations from which
| symbolic reasoning can emerge, after all this is how we
| (humans) learn it too. There are a lot of tasks in arc that
| are grounded in simple physics
| littlestymaar wrote:
| > I'd say world modeling can provide the foundations from
| which symbolic reasoning can emerge, after all this is how
| we (humans) learn it too
|
| As usual comparisons with humans provide little practical
| insight for what's achievable with ML. Humans don't have to
| learn everything from scratch like ML models do, you aren't
| expecting ML models to learn language out of a few
| thousands of tokens just because humans can, so similarly
| you shouldn't expect neural networks to learn reasoning
| from world interaction alone.
| falcor84 wrote:
| Yes, ARC-AGI 2 seems to game a lot of challenges that involve a
| (projection of) gravity and collisions, so I'd be quite
| interested in seeing whether it would generalize.
| ldjkfkdsjnv wrote:
| Leadership at meta is dropping the ball with these non llm ai
| model sidequests
| jadbox wrote:
| LLMs where once a side quest. I hope meta invests more in
| alternatives as maybe we'll find something better. If not, then
| meta just loses a bit of R&D budget. They are still heavily
| invested in regular LLM development, so it's not like they are
| trading one for the other.
| linguistbreaker wrote:
| I strongly agree. FAANG has the money to do the research.
| LLMs are far from intelligent - AGI will require a number of
| other advances.
| energy123 wrote:
| Is this a sarcastic compliment? Diversity in research agendas
| is very important for pushing forward the frontier even if it's
| not good for the company investing in the high risk research.
| Good job, to an otherwise toxic company.
| rvz wrote:
| AI research is more than just LLMs.
| TheAceOfHearts wrote:
| > With these visual subgoals, V-JEPA 2 achieves success rates of
| 65% - 80% for pick-and-placing new objects in new and unseen
| environments.
|
| How does this compare with existing alternatives? Maybe I'm just
| lacking proper context, but a minimum 20% failure rate sounds
| pretty bad? The paper compares their results with older
| approaches, which apparently had something like a 15% success
| rate, so jumping to an 80% success rate does seem like a
| significant jump. If I'm reading the paper correctly, the amount
| of time required to compute and execute each action went down
| from 4 minutes to 16 seconds, which also seems significant.
|
| Having to specify an end goal as an image seems pretty limited,
| but at least the authors acknowledge it in the paper:
|
| > Second, as mentioned in Section 4, V-JEPA 2-AC currently relies
| upon tasks specified as image goals. Although this may be natural
| for some tasks, there are other situations where language-based
| goal specification may be preferable. Extending the V-JEPA 2-AC
| to accept language-based goals, e.g., by having a model that can
| embed language-based goals into the V-JEPA 2-AC representation
| space, is another important direction for future work. The
| results described in Section 7, aligning V-JEPA 2 with a language
| model, may serve as a starting point.
|
| I think it would be interesting if the authors answered whether
| they think there's a clear trajectory towards a model that can be
| trained to achieve a >99% success rate.
| ricardobeat wrote:
| It's important to keep some perspective: there are zero robots
| in the wild, at the moment, that use a world model to work on
| tasks they weren't specifically trained on. This is cutting
| edge research and an 80% success rate is astonishing!
| gyudin wrote:
| They don't use it because it's unsafe and potentially life
| threatening lol
| dghlsakjg wrote:
| Plenty of things are unsafe and potentially life
| threatening, including machines with pre-programmed
| routines that we use today. We already have robots with
| limited intelligence interacting safely with humans in
| workplaces.
|
| This learning technology didn't exist until this moment in
| time. That probably has more to do with why no one is using
| it in the wild.
| lukan wrote:
| Yes, you can just add other reliable safety meassures.
| Meaning if a human comes too close, the robot stops.
|
| Or the robot is supervised all the time.
|
| Or just operates in an area without humans.
|
| But so far this is research, not market ready.
| refulgentis wrote:
| I can buy this, given a very wide meaning of "specifically
| trained on" and handwaving a bit about "as far as _I_ know*
| ", but then I read the actual wording of "new objects in new
| and unseen environments.", and remember these were floating
| around Mountain View doing tasks involving in new objects in
| novel environments years ago. Then I kinda gotta give up and
| admit to myself I'm distorting the conversation by
| emphasizing positivity over ground truth.
| vFunct wrote:
| I'm surprised that's not how it's already done. I'd figure
| some of the inner layers in LLMs were already "world models"
| and that it's the outer layers that differentiated models
| between text vs. images/robotics/other modes...
| mjburgess wrote:
| That's what the propaganda says, but when we keep
| explaining it isn't true, and army arrives to repeat adcopy
| from their favourite tech guru.
|
| All statistical models of the kind in use are
| interpolations through historical data -- there's no magic.
| So when you interpolate through historical texts, your
| model is _of_ historical text.
|
| Text is not a measure of the world, to say, "the sky is
| blue" is not even reliably associated with the blueness of
| the sky, let alone that the sky isnt blue (there is no sky,
| and the atmosphere isn't blue).
|
| These models appear "capture more" only because when you
| interpret the text you attribute meaning/understanding to
| it as the cause of its generation -- but that wasnt the
| cause, this is necessarily an illusion. There is no model
| of the world in a model of historical text -- there is a
| model of the world in your head which you associate with
| text, and that association is exploited when you use LLMs
| to do more than mere syntax transformation.
|
| LLMs excel most at "fuzzy retrieval" and things like coding
| -- the latter is principally a matter of syntax, and the
| former of recollection. As soon as you require the prompt-
| completion to maintain "semantic integrity" with non-
| syntactical/retrivable constraints, it falls apart.
| nightski wrote:
| I feel like you are ignoring or dismissing the word
| "interpolating", although a better word would likely be
| generalization. I'd make the claim that it's very hard to
| generalize without some form of world model. It's clear
| to me that transformers do have some form of world model,
| although not the same as what is being presented in
| V-JEPA.
|
| One other nitpick is that you confine to "historical
| data", although other classes of data are trained on such
| as simulated and generative.
| mjburgess wrote:
| I didn't say generalisation, because there isnt any.
| Inductive learning does not generalise, it interpolates
| -- if the region of your future prediction (here, prompt
| competition) lies on or close to the interpolated region,
| then the system is useful.
|
| Generalisation is the opposite process, hypothecating a
| universal and finding counter-examples to constrain the
| universal generalisaton. Eg., "all fire burns" is
| hypotheticated by a competent animal upon encountering
| fire once.
|
| Inductive "learners" take the opposite approach: fire
| burns in "all these cases", and if you have a case
| similar to those, then fire will burn you.
|
| They can look the same within the region of
| interpolation, but look very different when you leave it:
| all of these systems fall over quickly when more than a
| handful of semantic constraints are imposed. This number
| is a measure of the distance from the interpolated
| boundary (e.g., consider this interpretation of apple's
| latest paper on reasoning in LLMs: the "environment
| complexity" is nothing other than a measure of
| interpolation-dissimilarity).
|
| Early modern philosophers of science were very confused
| by this, but it's in Aristotle plain-as-day, and it's
| also extremely well establish since the 80s as the
| development of formal computational stats necessitated
| making this clear: interpolation is not generalisation.
| The former does not get you robustness to irrelevant
| permuation (ie., generalisation); it does not permit
| considering counterfactual scenarios (ie.,
| generalisation); it does not give you a semantics/theory
| of the data generating process (ie., generalisation, ie.
| a world model).
|
| Interpolation is a model _of the data_. Generalisation
| requires a model of the _data generating process_ , the
| former does not give you the latter, though it can appear
| to under strong experimental assumptions of known causal
| models.
|
| Here LLMs model the structure of language-as-symbolic-
| ordering, that structure "in the interpolated region"
| _expresses_ reasoning, but it isnt a model _of_
| reasoning. It 's a model of reasoning as captured in
| historical cases of it.
| jeremyjh wrote:
| Aren't there papers showing that there is some kind of
| world model emerging? Like representations of an Othello
| board that we would recognize were found and manipulated
| successfully in a small model.
| mjburgess wrote:
| There are two follow up papers showing the
| representations are "entangled", a euphemism for
| statistical garbage, but I can't be bothered at the
| moment to find them.
|
| However the whole issue of othello is a nonsequiteur
| which indicates that people involved here don't really
| seem to understand the issue, or what a world model is.
|
| A "world model" is a model of a data generating process
| which isn't reducible-to or constituted by its measures.
| Ie., we are concerned for the case where there's a
| measurement space (eg., that of the height of mercury in
| a thermometer) and a target property space (eg., that of
| the temperature of the coffee). So that there is gap
| between the data-as-measure and its causes. In language
| this gap is massive: the cause of my saying, "I'm hungry"
| may have nothing to do with my hunger, even if it often
| does. For "scientific measuring devices", these are
| constructed to minimize this gap as much as possible.
|
| In any case, with board games and other mathematical
| objects, there is no gap. The data _is_ the game. The
| "board state" is an abstract object _constituted by_ all
| possible board states. The game "is made out of" its
| realisations.
|
| However the world isnt made out of language, nor coffee
| made out of thermometers. So a model _of_ the data isnt a
| mdoel of its generating process.
|
| So whether an interpolation of board states "fully
| characterises", someway, an abstract mathematical object
| "the game" is so irrelevant to the question it betrays a
| fundamental lack of understanding of even what's at
| issue.
|
| No one is arguing that a structured interpolative model
| (ie., one given an inductive bias by an NN architecture)
| doesn't _express_ properties of the underlying domain in
| its structure. The question is what happens to this model
| _of_ the data when you have _the same data generating
| process_ , but you arent in the interpolated region.
|
| This problem is, in the limit of large data, impossible
| for abstract games by their nature, eg., a model
| classifying the input X into legal/illegal board states
| _is_ the game.
|
| Another way of phrasing this is that in ML/AI textbooks
| often begin by assuming there's a function you're
| approximating. But in the vast majority of cases where
| NNs are used, there is no such function -- there is no
| function tokens -> meanings (eg., "i am hungry" is
| ambigious).
|
| But in the abstract math case there is a function,
| {boards} -> Legal|Illegal is a function, there are no
| ambiguous boards
|
| So: of the infinite number of f* approximations to
| f_game, _any_ is valid in the limit len(X) - > inf. Of
| the infinite number f*_lang to f_language, _all_ are
| invalid (each in their own way).
| jeremyjh wrote:
| > A "world model" is a model of a data generating process
| which isn't reducible-to or constituted by its measures.
| > However the world isnt made out of language, nor coffee
| made out of thermometers. So a model of the data isnt a
| mdoel of its generating process.
|
| So is V-JEPA 2 actually generating a world model, as
| you've defined it here? Its still just sampling data -
| visual data, tactile feedback etc is all reducible to
| quantized data. It seems like you could build useful
| models that seem to generalize without that. For example,
| a model could learn to stop dropping things without ever
| developing a theory of gravity.
|
| Probably I'm still misunderstanding too much for this to
| be useful, but what I've read from you in this thread is
| way more useful to my understanding than what I've seen
| before.
| math_dandy wrote:
| Could you give more details about what precisely you mean
| by interpolation and generalization? The commonplace use
| of "generalization" in the machine learning textbooks
| I've been studying is model performance (whatever metric
| is deemed relevant) on new data from the training
| distribution. In particular, it's meaningful when you're
| modeling p(y|x) and not the generative distribution
| p(x,y).
| abtinf wrote:
| > army arrives to repeat adcopy from their favourite tech
| guru
|
| This is painfully accurate.
|
| The conversations go like this:
|
| Me: "guys, I know what I'm talking about, I wrote my
| first neural network 30 years ago in middle school, this
| tech is cool but it isn't magic and it isn't good enough
| to do the thing you want without getting us sued or
| worse."
|
| Them: "Bro, I read a tweet that we are on the other side
| of the singularity. We have six months to make money
| before everything blows up."
| londons_explore wrote:
| 80% success rate is also potentially commercially viable if
| the task is currently being done by a human.
|
| Work that was once done by 10 humans can now be done by 10
| robots + 2 humans for the 20% failure cases, at a lower total
| cost.
| zeroxfe wrote:
| This really depends on the failure modes. In general,
| humans fail in predictable, and mostly safe, ways. AIs fail
| in highly unpredictable and potentially very dangerous
| ways. (A human might accidentally drop a knife, an AI might
| accidentally stab you with it.)
| Maxion wrote:
| Or, if controlling a robot arm, it would stab itself
| through the conveyer belt at full torque.
| DickingAround wrote:
| I run thousands of robots in production. We can get a very high
| success rate but only for the task they're designed for.
| Production robots can't pick up stuff they drop yet. And this
| '80%' level is not actually acceptable or even state of art for
| just pick-and-place, but it's compelling for a robot that also
| knows how to do other things with equal quality (if JEPA does
| that).
| deepGem wrote:
| Currently,
|
| You train a VLA (vision language action) model for a specific
| pair of robotic arms, for a specific task. The end actuator
| actions are embedded in the model (actions). So let's say you
| train a pair of arms to pick an apple. You cannot zero shot it
| to pick up a glass. What you see in demos is the result of lots
| of training and fine tuning (few shot) on specific object types
| and with specific robotic arms or bodies.
|
| The language intermediary embedding brings some generalising
| skills to the table but it isn't much. The vision -> language
| -> action translation is, how do I put this, brittle at best.
|
| What these guys are showing is a zero shot approach to new
| tasks in new environments with 80% accuracy. This is a big
| deal. Pi0 from Physical Intelligence is the best model to
| compare I think.
| robot wrote:
| your comment is not aligned with how science is done. For
| discoveries you certainly work with limited approaches and
| certainly don't know if there is a "clear trajectory".
| fidotron wrote:
| You have to wonder if the model is going to end up recreating
| Verlet integration in there somewhere, or if it's generating a
| pile of those optical acceleration cancelation type heuristics in
| neural net form.
|
| It's one of those ideas I've had around for a while that if you
| fused decent object tracking with an understanding of Verlet
| integration you should, in principle, start being able to measure
| all sorts of physical quantities quite easily.
| nlitened wrote:
| I imagine that Russian-speaking team members had fun with naming
| the model V-JEPA
| Tiberium wrote:
| For the curious: "zhopa" (which "JEPA" sounds like) means "ass"
| in Russian. Also V ("V") means "in" (although if we get into
| specifics, the casing would need to be "zhopu" or "zhope"
| depending on the context)
| koakuma-chan wrote:
| Also the video thumbnail:
|
| J.E.P.A.
| jcelerier wrote:
| > That kind of physical intuition isn't something adults obtain
| after years of education--young children develop this intuition
| by observing the world around them before they can even speak in
| full sentences.
|
| I mean, it still takes them much more time than it takes to train
| even the largest LLMs we use (a couple months)
| lukan wrote:
| But they use way less energy for it.
| dist-epoch wrote:
| In wall clock time. If you count in input tokens/pixels, humans
| learn with orders of magnitude less input data.
| logicchains wrote:
| That's not true at all; the amount of audiovisual data a
| human is exposed to in even just one year is incredibly vast.
| Over sixty frames per second, sixteen hours per day gives
| over a trillion frames per year, and each frame at such a
| high resolution would be hundreds of tokens.
| dist-epoch wrote:
| Let's take your numbers:
|
| Human: 1000 tok * 60 * 86400 * 365 = 2 Trillion tokens /
| year
|
| GPT-4: 13 Trillion tokens
|
| Llama-3: 15 Trillion tokens
| cluckindan wrote:
| That's why we tokenize very early in the vision pipeline.
|
| Related: https://en.wikipedia.org/wiki/Form_constant
| Vetch wrote:
| This contains a common misstep (or misgeneralization of an
| analogy) among those who are much more familiar with
| computers than with the brain. The brain is not digital and
| concepts like frames per second and resolution don't make
| much sense for vision. First, there aren't frames, neuron
| activity is asynchronous with changes to sensory neuron
| firing rate responding to changes in the environment or
| according to saliency.
|
| Between the non-uniformity of receptor density (eg fovea vs
| peripheral vision but this is general across all senses),
| dynamic receptor fields and the fact that information is
| encoded in terms of spike rate and timing patterns across
| neural populations, the idea of pixels in some bitmap at
| some resolution is beyond misleading. There is no pixel
| data, just sparsely coded feature representations capturing
| things like edges, textures, motion, color contrast and the
| like, already, at the retina.
|
| While hundreds of trillions of photons might hit our
| photoreceptors, > 99% of that is filtered and or compressed
| _before_ even reaching retinal ganglion cells. Only a tiny
| fraction, about 10 million bits /sec, of the original
| photon signal rate is transferred through the optic nerve
| (per eye). This pattern of filtering and attentive
| prioritization of information in signals continues as we go
| from sensory fields to thalamus to higher cortical areas.
|
| So while we might encounter factoids like: on the order of
| a billion bits per second of data hit photoreceptors or
| [10Mb/s transferred](https://www.britannica.com/science/inf
| ormation-theory/Physio...) along optic nerves, it's
| important to keep in mind that a lot of the intuition
| gained from digital information processing does not
| transfer in any meaningful sense to the brain.
| momojo wrote:
| Why is Meta investing into this research? What's the potential
| payoff?
| dyauspitr wrote:
| Physical robots as impressive as LLMs?
| aaroninsf wrote:
| The goal is a a Large Phenomenological Model.
|
| A good definition of "real AGI" might be, a multimodal model
| which understands time-based media, space, and object behavior,
| and hence true agency.
|
| Phenomenology is the philosophy of "things as they seem," not
| "knowledge (words) about things." Seem to our senses, not
| understood through language.
|
| LLM of course trade in language tokens.
|
| We can extend their behavior with front ends which convert
| other media types into such tokens.
|
| But we can do better with multimodal models which are trained
| directly on other inputs. E.g. integrating image classifiers
| with language models architecturally.
|
| With _those_ one can sort of understand time-based media, by
| sampling a stream and getting e.g. transcripts.
|
| But again, it's even better to build a time-base multimodal
| models, which directly ingests time-based media rather than
| sampling. (Other architectures than transformers are going to
| be required to do this well IMO...)
|
| The bootstrapping continues. This work is about training models
| to understand world and object properties by introducing
| agency.
|
| Significant footnote: implicitly models trained to interact
| with the world necessarily have a "self model" which interacts
| with the "world model." Presumably they are trained to preserve
| their expensive "self." Hmmmmm....
|
| When we have a model that knows about things not just as nodes
| in a language graph but also how such things look, and sound,
| and moves, and "feel" (how much mass do they have, how do they
| move, etc.)...
|
| ...well, that is approaching indistinguishable from one of us,
| at least wrt embodiment and agency.
| DesiLurker wrote:
| possibly with their investment into AR/VR and gaming they may
| see a pathway to creating 'physical intelligence' and tap into
| a much bigger untapped market. I mean isn't Robotaxi the main
| carrot Musk's been holding in front of tesla investors for
| decade or so. physical robots may provide a more 'incremental
| fault tolerant' path to application of AI.
| esafak wrote:
| There is a world of money in AGI, and they have the resources,
| and notably the data, to achieve it.
| seydor wrote:
| physical robots arguing endlessly with physical people
| kp1197 wrote:
| Robots that can do anything.
| cubefox wrote:
| I think the fundamental idea behind JEPA (not necessarily this
| concrete Meta implementation) will ultimately be correct:
| predicting embeddings instead of concrete tokens. That's arguably
| what animals do. Next-token prediction (a probability
| distribution over the possible next tokens) works well for the
| discrete domain of text, but it doesn't work well for a
| continuous domain like video, which would be needed for real-time
| robotics.
|
| For text, with a two-byte tokenizer you get 2^16 (~65.000)
| possible next tokens, and computing a probability distribution
| over them is very much doable. But the "possible next frames" in
| a video feed would already be an extremely large number. If one
| frame is 1 megabyte uncompressed (instead of just 2 bytes for a
| text token) there are 2^(8*2^20) possible next frames, which is
| far too large a number. So we somehow need to predict only an
| embedding of a frame, of how the next frame of a video feed will
| look approximately.
|
| Moreover, for robotics we don't want to just predict the next
| (approximate) frame of a video feed. We want to predict future
| sensory data more generally. That's arguably what animals do,
| including humans. We constantly anticipate what happens to us in
| "the future", approximately, _and where the farther future is
| predicted progressively less exactly._ We are relatively sure of
| what happens in a second, but less and less sure of what happens
| in a minute, or a day, or a year.
| abraxas wrote:
| But how do you go from predicting embeddings (which could be
| thought of as a type of lossy compression of the original data)
| back out to something usable, say a sequence of image/video
| tokens or a sequence of robot actions?
| cubefox wrote:
| A robot model would need to constantly convert the prediction
| (an embedding) of the future observations, together with a
| "plan" of what the robot tries to achieve, into an action.
| Into some kind of movement which takes both the action plan
| and the predicted sensory data into account.
|
| That's very much an unsolved problem, and I don't know how
| far Meta is along that path. Not very far, I assume.
| NitpickLawyer wrote:
| If I understand your post correctly, they're also doing
| this:
|
| > V-JEPA 2-AC is a latent action-conditioned world model
| post-trained from V-JEPA 2 (using a small amount of robot
| trajectory interaction data) that solves robot manipulation
| tasks without environment-specific data collection or task-
| specific training or calibration.
|
| > After the actionless pre-training stage, the model can
| make predictions about how the world might evolve--however,
| these predictions don't directly take into account specific
| actions that an agent would take. In the second stage of
| training, we focus on making the model more useful for
| planning by using robot data, which includes visual
| observations (video) and the control actions that the robot
| was executing. We incorporate this data into the JEPA
| training procedure by providing the action information to
| the predictor. After training on this additional data, the
| predictor learns to account for specific actions when
| making predictions and can then be used for control. We
| don't need a lot of robot data for this second phase--in
| our technical report, we show that training with only 62
| hours of robot data already results in a model that can be
| used for planning and control.
|
| > We demonstrate how V-JEPA 2 can be used for zero-shot
| robot planning in new environments and involving objects
| not seen during training. Unlike other robot foundation
| models--which usually require that some training data come
| from the specific robot instance and environment where the
| model is deployed--we train the model on the open source
| DROID dataset and then deploy it directly on robots in our
| labs. We show that the V-JEPA 2 predictor can be used for
| foundational tasks like reaching, picking up an object, and
| placing it in a new location.
|
| > For short-horizon tasks, such as picking or placing an
| object, we specify a goal in the form of an image. We use
| the V-JEPA 2 encoder to get embeddings of the current and
| goal states. Starting from its observed current state, the
| robot then plans by using the predictor to imagine the
| consequences of taking a collection of candidate actions
| and rating the candidates based on how close they get to
| the desired goal. At each time step, the robot re-plans and
| executes the top-rated next action toward that goal via
| model-predictive control. For longer horizon tasks, such as
| picking up an object and placing it in the right spot, we
| specify a series of visual subgoals that the robot tries to
| achieve in sequence, similar to visual imitation learning
| observed in humans. With these visual subgoals, V-JEPA 2
| achieves success rates of 65% - 80% for pick-and-placing
| new objects in new and unseen environments.
| bobosha wrote:
| This is where the memory bit comes in, if you have a memory
| of past embeddings and associated label(s), it could be an
| ANN query to fetch the most similar embeddings and infer
| therefrom.
| abraxas wrote:
| But an embedding is more like a one way hash, kind of like
| sha1 or md5, no? You can get from input data to a hash
| value but not the other way around, right? I know that
| similarly placed embedding vectors will sit next to
| semantically related vectors but these clusters could be
| really sparse in such a massively dimensional hyperspace
| and so the nearest values in a cache may be too far away to
| be useful?
|
| BTW I'm very much not an expert here and I'm just trying to
| understand how this system works end to end. Don't take
| anything I write here as authoritative.
| kaivi wrote:
| > We constantly anticipate what happens to us in "the future",
| approximately, and where the farther future is predicted
| progressively less exactly
|
| There's then evidence of what's called Predictive Coding. When
| that future happens, a higher level circuit decides how far off
| we were, and then releases appropriate neuromodulators to re-
| wire that circuit.
|
| That would mean that to learn faster, you want to expose
| yourself to situations where you are often wrong: be often
| surprised and go down the wrong paths. Have a feedback
| mechanism which will tell you when you're wrong. This is maybe
| also why the best teachers are the ones who often ask the class
| questions for which there are counter-intuitive answers.
| cubefox wrote:
| > There's then evidence of what's called Predictive Coding.
| When that future happens, a higher level circuit decides how
| far off we were, and then releases appropriate
| neuromodulators to re-wire that circuit.
|
| Yes, and ideally there would be whole backpropagation passes
| which update the entire model depending on how much the
| current observation diverges from past predictions. (Though
| brains use an updating mechanism which diverges from the
| backpropagation algorithm.)
| nulld3v wrote:
| The JEPA models give me hope that the future isn't just more
| tokens, more context, and more chain-of-thought.
| siavosh wrote:
| Does someone know how the "semantic" embeddings are learned? That
| seems like perhaps the main technical challenge here.
| iLoveOncall wrote:
| "World model" and "physical reasoning" is such a lie.
|
| Those models don't have any understanding of physics, they just
| regurgitate what they see in their vision-based training set,
| just like any image or video generation model does.
|
| Monkey see other monkey cannot go through wall, monkey don't try
| go through wall.
| rayboy1995 wrote:
| > Monkey see other monkey cannot go through wall, monkey don't
| try go through wall.
|
| I mean... we are just monkeys. Did we not learn this way when
| we were younger?
| RollingRo11 wrote:
| Agreed! A really young child has no notion of "physics". They
| are learning through experience and observation.
|
| These models/robots aren't superintelligent by any means, but
| "Monkey see other monkey cannot go through wall, monkey don't
| try go through wall" isn't far off from how some
| animals/humans "learn".
| smokel wrote:
| I think you are misinterpreting the terminology.
|
| Of course these models are not understanding physics in the way
| a physicists or a mathematician would. But they do form a model
| of the world that can be used for forecasting and reasoning, in
| a way possibly not much unlike how humans and other animals
| operate when interacting with the physical world.
| seydor wrote:
| physics is phenomenological. the model sees phenomena
| dghlsakjg wrote:
| You don't need to have taken a single physics class to be good
| at pool...
| rar00 wrote:
| the robot arm demonstration video jumps at the 00:28s mark...
___________________________________________________________________
(page generated 2025-06-11 23:00 UTC)