[HN Gopher] Ask HN: Any insider takes on Yann LeCun's push again...
___________________________________________________________________
Ask HN: Any insider takes on Yann LeCun's push against current
architectures?
So, Lecun has been quite public saying that he believes LLMs will
never fix hallucinations because, essentially, the token choice
method at each step leads to runaway errors -- these can't be
damped mathematically. In exchange, he offers the idea that we
should have something that is an 'energy minimization'
architecture; as I understand it, this would have a concept of the
'energy' of an entire response, and training would try and minimize
that. Which is to say, I don't fully understand this. That said,
I'm curious to hear what ML researchers think about Lecun's take,
and if there's any engineering done around it. I can't find much
after the release of ijepa from his group.
Author : vessenes
Score : 128 points
Date : 2025-03-10 19:41 UTC (4 days ago)
| ActorNightly wrote:
| Not an official ML researcher, but I do happen to understand this
| stuff.
|
| The problem with LLMs is that the output is inherently stochastic
| - i.e there isn't a "I don't have enough information" option.
| This is due to the fact that LLMs are basically just giant look
| up maps with interpolation.
|
| Energy minimization is more of an abstract approach to where you
| can use architectures that don't rely on things like
| differentiability. True AI won't be solely feedforward
| architectures like current LLMs. To give an answer, they will
| basically determine alogrithm on the fly that includes
| computation and search. To learn that algorithm (or algorithm
| parameters), at training time, you need something that doesn't
| rely on continuous values, but still converges to the right
| answer. So instead you assign a fitness score, like memory use or
| compute cycles, and differentiate based on that. This is
| basically how search works with genetic algorithms or PSO.
| seanhunter wrote:
| > The problem with LLMs is that the output is inherently
| stochastic - i.e there isn't a "I don't have enough
| information" option. This is due to the fact that LLMs are
| basically just giant look up maps with interpolation.
|
| I don't think this explanation is correct. The input to the
| decoder at the end of all the attention heads etc (as I
| understand it) is a probability distribution over tokens. So
| the model as a whole does have an ability to score low
| confidence in something by assigning it a low probability.
|
| The problem is that thing is a token (part of a word). So the
| LLM can say "I don't have enough information" to decide on the
| next part of a word but has no ability to say "I don't know
| what on earth I'm talking about" (in general - not associated
| with a particular token).
| estebarb wrote:
| The problem is exactly that: the probability distribution.
| The network has no way to say: 0% everyone, this is non
| sense, backtrack everything.
|
| Other architectures, like energy based models or bayesian
| ones can assess uncertainty. Transformers simply cannot do it
| (yet). Yes, there are ways to do it, but we are already
| spending millions to get coherent phrases, few ones will burn
| billions to train a model that can do that kind of
| assessments.
| ortsa wrote:
| Has anybody ever messed with adding a "backspace" token?
| refulgentis wrote:
| Yes. (https://news.ycombinator.com/item?id=36425375,
| believe there's been more)
|
| There's a quite intense backlog of new stuff that hasn't
| made it to prod. (I would have told you in 2023 that we
| would have ex. switched to Mamba-like architectures in at
| least one leading model)
|
| Broadly, it's probably unhelpful that:
|
| - absolutely no one wants the PR of releasing a model
| that isn't competitive with the latest peers
|
| - absolutely everyone wants to release an incremental
| improvement, yesterday
|
| - Entities with no PR constraint, and no revenue
| repurcussions when reallocating funds from surely-
| productive to experimental, don't show a significant
| improvement in results for the new things they try (I'm
| thinking of ex. Allen Institute)
|
| Another odd property I can't quite wrap my head around is
| the battlefield is littered with corpses that eval okay-
| ish, and should have OOM increases in some areas (I'm
| thinking of RWKV, and how it should be faster at
| inference), and they're not really in the conversation
| either.
|
| Makes me think either A) I'm getting old and don't really
| understand ML from a technical perspective anyway or B)
| hey, I 've been maintaining a llama.cpp wrapper that
| works on every platform for a year now, I should trust my
| instincts: the real story is UX is king and none of these
| things actually improve the experience of a user even if
| benchmarks are ~=.
| vessenes wrote:
| For sure read Stephenson's essay on path dependence; it
| lays out a lot of these economic and social dynamics.
| TLDR - we will need a major improvement to see something
| novel pick up steam most likely.
| Ericson2314 wrote:
| Yeah everyone spending way to much money in things we
| barely understand is a recipe for insane path dependence.
| ortsa wrote:
| Oh yeah, that's exactly what I was thinking of! Seems
| like it would be very useful for expert models with
| domains with more definite "edges" (if I'm understanding
| it right)
|
| As for the fragmentation of progress, I guess that's just
| par the course for any tech with a such a heavy
| private/open source split. It would take a huge amount of
| work to trawl through this constant stream of
| 'breakthroughs' and put them all together.
| duskwuff wrote:
| Right. And, as a result, low token-level confidence can end
| up indicating "there are other ways this could have been
| worded" or "there are other topics which could have been
| mentioned here" just as often as it does "this output is
| factually incorrect". Possibly even more often, in fact.
| vessenes wrote:
| My first reaction is that a _model_ can't, but a sampling
| architecture probably could. I'm trying to understand if
| what we have as a whole architecture for most inference now
| is responsive to the critique or not.
| derefr wrote:
| You get scores for the outputs _of the last layer_ ; so in
| theory, you could notice when those scores form a
| particularly flat distribution, and fault.
|
| What you can't currently get, from a (linear) Transformer, is
| a way to induce a similar observable "fault" in any of the
| _hidden_ layers. Each hidden layer only speaks the
| "language" of the next layer after it, so there's no clear
| way to program an inference-framework-level observer side-
| channel that can examine the output vector of each layer and
| say "yup, it has no confidence in any of what it's doing at
| this point; everything done by layers feeding from this one
| will just be pareidolia -- promoting meaningless deviations
| from the random-noise output of this layer into increasing
| significance."
|
| You could in theory build a model as a Transformer- _like_
| model in a sort of pine-cone shape, where each layer feeds
| its output both to the next layer (where the final layer 's
| output is measured and backpropped during training) _and_ to
| an "introspection layer" that emits a single confidence
| score (a 1-vector). You start with a pre-trained linear
| Transformer base model, with fresh random-weighted
| introspection layers attached. Then you do supervised
| training of (prompt, response, confidence) triples, where on
| each training step, _the minimum confidence score of all
| introspection layers_ becomes the controlled variable tested
| against the training data. (So you aren 't trying to enforce
| that any _particular_ layer notice when it 's not confident,
| thus coercing the model to "do that check" at that layer; you
| just enforce that a "vote of no confidence" comes either from
| _somewhere_ within the model, or _nowhere_ within the model,
| at each pass.)
|
| This seems like a hack designed just to compensate for this
| one inadequacy, though; it doesn't seem like it would
| generalize to helping with anything else. Some other
| architecture might be able to provide a fully-general
| solution to enforcing these kinds of global constraints.
|
| (Also, it's not clear at all, for such training, "when"
| during the generation of a response sequence you should
| expect to see the vote-of-no-confidence crop up -- and
| whether it would be tenable to force the model to "notice"
| its non-confidence earlier in a response-sequence-generating
| loop rather than later. I would guess that a model trained in
| this way would either explicitly evaluate its own confidence
| with some self-talk before proceeding [if its base model were
| trained as a thinking model]; or it would encode _hidden_
| thinking state to itself in the form of word-choices et al,
| gradually resolving its confidence as it goes. In neither
| case do you really want to "rush" that deliberation process;
| it'd probably just corrupt it.)
| skybrian wrote:
| I think some "reasoning" models do backtracking by inserting
| "But wait" at the start of a new paragraph? There's more to
| it, but that seems like a pretty good trick.
| Lerc wrote:
| I feel like we're stacking naive misinterpretations of how
| LLMs function on top of one another here. Grasping gradient
| descent and autoregressive generation can give you a false
| sense of confidence. It is like knowing how transistors make
| up logic gates and believing you know more than CPU design
| than you actually do.
|
| Rather than inferring from how you imagine the architecture
| working, you can look at examples and counterexamples to see
| what capabilities they have.
|
| One misconception is that predicting the next word means
| there is no internal idea on the word after next. The simple
| disproof of this is that models put 'an' instead of 'a' ahead
| of words beginning with vowels. It would be quite easy to
| detect (and exploit) behaviour that decided to use a vowel
| word just because it somewhat arbitrarily used an 'an'.
|
| Models predict the next word, but they don't _just_ predict
| the next word. They generate a great deal of internal
| information in service of that goal. Placing limits on their
| abilities by assuming the output they express is the sum
| total of what they have done is a mistake. The output
| probability is not what it thinks, it is a reduction of what
| it thinks.
|
| One of Andrej Karpathy's recent videos talked about how
| researchers showed that models do have an internal sense of
| not knowing the answer, but fine tuning on question answering
| I'd not give them the ability to express that knowledge.
| Finding information the model did and didn't know then fine
| tuning to say I don't know for cases where it had no
| information allowed the model to generalise and express "I
| don't know"
| littlestymaar wrote:
| No an ML researcher or anything (I'm basically only a few
| Karpathy video into ML, so please someone correct me if I'm
| misunderstanding this), but it seems that you're getting
| this backwards:
|
| > One misconception is that predicting the next word means
| there is no internal idea on the word after next. The
| simple disproof of this is that models put 'an' instead of
| 'a' ahead of words beginning with vowels.
|
| My understanding is that there's simply not "'an' _ahead_
| of a word that starts with a vowel", the model (or more
| accurately, the sampler) picks "an" and then the model will
| never predict a word that starts with a consonant _after_
| that. It 's not like it "knows" in advance that it wants to
| put a word with a vowel and then anticipates that it needs
| to put "an", it generates a probability for both tokens "a"
| and "an", picks one, and then when it generates the
| following token, it will necessarily take its previous
| choice into account and never puts a word starting with a
| vowel after it has already chosen "a".
| yunwal wrote:
| The model still has some representation of whether the
| word after an/a is more likely to start with a vowel or
| not when it outputs a/an. You can trivially understand
| this is true by asking LLMs to answer questions with only
| one correct answer.
|
| "The animal most similar to a crocodile is:"
|
| https://chatgpt.com/share/67d493c2-f28c-8010-82f7-0b60117
| ab2...
|
| It will always say "an alligator". It chooses "an"
| because somewhere in the next word predictor it has
| already figured out that it wants to say alligator when
| it chooses "an".
|
| If you ask the question the other way around, it will
| always answer "a crocodile" for the same reason.
| littlestymaar wrote:
| Again, that's not a good example I think because
| everything about the answer is in the prompt, so
| obviously from the start the "alligator" is high, but
| then it's just waiting for an "an" to occur to have an
| occasion to put that.
|
| That doesn't mean it knows "in advance" what it want to
| say, it's just that at every step the alligator is
| lurking in the logits because it directly derives from
| the prompt.
| metaxz wrote:
| You write: "it's just that at every step the alligator is
| lurking in the logits because it directly derives from
| the prompt" - but isn't that the whole point: at the
| moment the model writes "an", it isn't just spitting out
| a random article (or a 50/50 distribution of articles or
| other words for that matter); rather, "an" gets a high
| probability because the model internally knows that
| "alligator" is the correct thing after that. While it can
| only emit one token in this step, it will emit "an" to
| make it consistent with its alligator knowledge
| "lurking". And btw while not even directly relevant, the
| word alligator isn't in the prompt. Sure, it derives from
| the prompt but so does every an LLM generates, and same
| for any other AI mechanism for generating answers.
| metaxz wrote:
| Thanks for writing this so clearly... I hear
| wrong/misguided arguments like we see hear every day from
| friends, colleagues, "experts in the media" etc.
|
| It's strange because just a moment of thinking will show
| that such ideas are wrong or paint a clearly incomplete
| picture. And there's plenty of analogies to the dangers of
| such reductionism. It should be obviously wrong to anyone
| who has at least tried ChatGPT.
|
| My only explanation is that a denial mechanism must be at
| play. It simply feels more comfortable to diminish LLM
| capabilities and/or feel that you understand them from
| reading a Medium article on transformer-network, than to
| consider the consequences in terms of the inner black-box
| nature.
| throw310822 wrote:
| > there isn't a "I don't have enough information" option. This
| is due to the fact that LLMs are basically just giant look up
| maps with interpolation.
|
| Have you ever tried telling ChatGPT that you're "in the city
| centre" and asking it if you need to turn left or right to
| reach some landmark? It will not answer with the average of the
| directions given to everybody who asked the question before, it
| will answer asking you to tell it where you are precisely and
| which way you are facing.
| josh-sematic wrote:
| I don't buy Lecun's argument. Once you get good RL going (as we
| are now seeing with reasoning models) you can give the model a
| reward function that rewards a correct answer most highly, an
| "I'm sorry but I don't know" less highly than that, a wrong
| answer penalized, a confidently wrong answer more severely
| penalized. As the RL learns to maximize rewards I would think
| it would find the strategy of saying it doesn't know in cases
| where it can't find an answer it deems to have a high
| probability of correctness.
| Tryk wrote:
| How do you define the "correct" answer?
| jpadkins wrote:
| obviously the truth is what is the most popular. /s
| thijson wrote:
| I watched an Andrej Karpathy video recently. He said that
| hallucination was because in the training data there were no
| examples where the answer is, "I don't know". Maybe I'm
| misinterpreting what he was saying though.
|
| https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s
| TZubiri wrote:
| If multiple answers are equally likely, couldn't that be
| considered uncertainty? Conversely if there's only one answer
| and there's a huge leap to the second best, that's pretty
| certain.
| unsupp0rted wrote:
| > The problem with LLMs is that the output is inherently
| stochastic
|
| Isn't that true with humans too?
|
| There's some leap humans make, even as stochastic parrots, that
| lets us generate new knowledge.
| spmurrayzzz wrote:
| > i.e there isn't a "I don't have enough information" option.
|
| This is true in terms of default mode for LLMs, but there's a
| fair amount of research dedicated to the idea of training
| models to signal when they need grounding.
|
| SelfRAG is an interesting, early example of this [1]. The basic
| idea is that the model is trained to first decide whether
| retrieval/grounding is necessary and then, if so, after
| retrieval it outputs certain "reflection" tokens to decide
| whether a passage is relevant to answer a user query, whether
| the passage is supported (or requires further grounding), and
| whether the passage is useful. A score is calculated from the
| reflection tokens.
|
| The model then critiques itself further by generating a tree of
| candidate responses, and scoring them using a weighted sum of
| the score and the log probabilities of the generated candidate
| tokens.
|
| We can probably quibble about the loaded terms used here like
| "self-reflection", but the idea that models can be trained to
| know when they don't have enough information isn't pure fantasy
| today.
|
| [1] https://arxiv.org/abs/2310.11511
|
| EDIT: I should also note that I generally do side with Lecun's
| stance on this, but not due to the "not enough information"
| canard. I think models learning from abstraction (i.e. JEPA,
| energy-based models) rather than memorization is the better
| path forward.
| stevenae wrote:
| https://en.m.wikipedia.org/wiki/Energy-based_model
| jawiggins wrote:
| I'm not an ML researcher, but I do work in the field.
|
| My mental model of AI advancements is that of a step function
| with s-curves in each step [1]. Each time there is an algorithmic
| advancement, people quickly rush to apply it to both existing and
| new problems, demonstrating quick advancements. Then we tend to
| hit a kind of plateau for a number of years until the next
| algorithmic solution is found. Examples of steps include, AlexNet
| demonstrating superior image labeling, LeCun demonstrating
| DeepLearning, and now OpenAI demonstrating large transformer
| models.
|
| I think in the past, at each stage, people tend to think that the
| recent progress is a linear or exponential process that will
| continue forward. This lead to people thinking self driving cars
| were right around the corner after the introduction of DL in the
| 2010s, and super-intelligence is right around the corner now. I
| think at each stage, the cusp of the S-curve comes as we find
| where the model is good enough to be deployed, and where it
| isn't. Then companies tend to enter a holding pattern for a
| number of years getting diminishing returns from small
| improvements on their models, until the next algorithmic
| breakthrough is made.
|
| Right now I would guess that we are around 0.9 on the S curve, we
| can still improve the LLMs (as DeepSeek has shown wide MoE and
| o1/o3 have shown CoT), and it will take a few years for the best
| uses to be brought to market and popularized. As you mentioned,
| LeCun points out that LLMs have a hallucination problem built
| into their architecture, others have pointed out that LLMs have
| had shockingly few revelations and breakthroughs for something
| that has ingested more knowledge than any living human. I think
| future work on LLMs are likely to make some improvement on these
| things, but not much.
|
| I don't know what it will be, but a new algorithm will be needed
| to induce the next step on the curve of AI advancement.
|
| [1]: https://www.open.edu/openlearn/nature-
| environment/organisati...
| Matthyze wrote:
| > Each time there is an algorithmic advancement, people quickly
| rush to apply it to both existing and new problems,
| demonstrating quick advancements. Then we tend to hit a kind of
| plateau for a number of years until the next algorithmic
| solution is found.
|
| That seems to be how science works as a whole. Long periods of
| little progress between productive paradigm shifts.
| semi-extrinsic wrote:
| Punctuated equilibrium theory.
| tyronehed wrote:
| This is actually a lazy approach as you describe it. Instead,
| what is needed is an elegant and simple approach that is 99%
| of the way there out of the gate. Soon as you start doing
| statistical tweaking and overfitting models, you are not
| approaching a solution.
| TrainedMonkey wrote:
| This is a somewhat nihilistic take with an optimistic ending. I
| believe humans will never fix hallucinations. Amount of totally
| or partially untrue statements people make is significant.
| Especially in tech, it's rare for people to admit that they do
| not know something. And yet, despite all of that the progress
| keeps marching forward and maybe even accelerating.
| ketzo wrote:
| Yeah, I think a lot of people talk about "fixing
| hallucinations" as the end goal, rather than "LLMs providing
| value", which misses the forest for the trees; it's obviously
| already true that we don't need totally hallucination-free
| output to get value from these models.
| dtnewman wrote:
| I'm not sure I follow. Sure, people lie, and make stuff up all
| the time. If an LLM goes and parrots that, then I would argue
| that it isn't hallucinating. Hallucinating would be where it
| makes something up that is not in its training site nor
| logically deducible from it.
| esafak wrote:
| I think most humans are perfectly capable of admitting to
| themselves when they do not know something. Computers ought to
| do better.
| danielmarkbruce wrote:
| You must interact with a _very_ different set of humans than
| most.
| ALittleLight wrote:
| I've never understood this critique. Models have the capability
| to say: "oh, I made a mistake here, let me change this" and that
| solves the issue, right?
|
| A little bit of engineering and fine tuning - you could imagine a
| model producing a sequence of statements, and reflecting on the
| sequence - updating things like "statement 7, modify: xzy to xyz"
| fhd2 wrote:
| I get "oh, I made a mistake" quite frequently. Often enough,
| it's just another hallucination, just because I contested the
| result, or even just prompted "double check this".
| Statistically speaking, when someone in a conversation says
| this, the other party is likely to change their position, so
| that's what an LLM does, too, replicating a statistically
| plausible conversation. That often goes in circles, not getting
| anywhere near a better answer.
|
| Not an ML researcher, so I can't explain it. But I get a pretty
| clear sense that it's an inherent problem and don't see how it
| could be trained away.
| rscho wrote:
| "Oh, I emptied your bank account here, let me change this."
|
| For AI to really replace most workers like some people would
| like to see, there are _plenty_ of situations where
| hallucinations are a complete no-go and need fixing.
| croes wrote:
| Isn't that the answer if you tell them they are wrong?
| killthebuddha wrote:
| I've always felt like the argument is super flimsy because "of
| course we can _in theory_ do error correction". I've never seen
| even a semi-rigorous argument that error correction is
| _theoretically_ impossible. Do you have a link to somewhere where
| such an argument is made?
| aithrowawaycomm wrote:
| In theory transformers are Turing-complete and LLMs can do
| anything computable. The more down-to-earth argument is that
| transformer LLMs aren't able to correct errors in a systematic
| way like Lecun is describing: it's task-specific "whack-a-
| mole," involving either tailored synthetic data or expensive
| RLHF.
|
| In particular, if you train an LLM to do Task A and Task B with
| acceptable accuracy, that does not guarantee it can combine the
| tasks in a common-sense way. "For each step of A, do B on the
| intermediate results" is a whole new Task C that likely needs
| to be fine-tuned. (This one actually does have some theoretical
| evidence coming from computational complexity, and it was the
| first thing I noticed in 2023 when testing chain-of-thought
| prompting. It's not that the LLM can't do Task C, it just takes
| extra training.)
| vhantz wrote:
| > of course we can _in theory_ do error correction
|
| Oh yeah? This is begging the question.
| blueyes wrote:
| Sincere question - why doesn't RL-based fine-tuning on top of
| LLMs solve this or at least push accuracy above a minimum
| acceptable threshhold in many use cases? OAI has a team doing
| this for enterprise clients. Several startups rolling out of
| current YC batch are doing versions of this.
| InkCanon wrote:
| If you mean the so called agentic AI, I don't think it's
| several. Iirc someone in the most recent demo day mentioned
| ~80%+ were AI
| __rito__ wrote:
| Sligtly related: Energy Based Models (EBMs) are better in theory
| and yet too resource intensive. I tried to sell using EBMs to my
| org, but the price for even a small use case was prohibitive.
|
| I learned it from:
| https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...
|
| Yann LeCun, and Michael Bronstein and his colleagues have some
| similarities in trying to properly _Sciencify_ Deep Learning.
|
| Yann LeCun's approach, as least for Vision has one core tenet-
| energy minimization, just like in Physics. In his course, he also
| shows some current arch/algos to be special cases for EBMs.
|
| Yann believes that understanding the _Whys_ of the behavior of DL
| algorithms are going to be beneficial in the long term rather
| than playing around with hyper-params.
|
| There is also a case for language being too low-dimensional to
| lead to AGI even if it is _solved_. Like, in a recent video, he
| said that the total amount of data existing on all digitized
| books and internet are the same as what a human children takes in
| in the first 4 /5 years. He considers this low.
|
| There are also epistemological arguments against language not
| being able to lead to AGI, but I haven't heard him talk about
| them.
|
| He also believes that Vision is a more important aspect of
| intellgence. One reason being it being very high-dim. (Edit)
| Consider an example. Take 4 monochrome pixels. All pixels can
| range from 0 to 255. 4 pixels can create 256^4 = 2^32
| combinations. 4 words can create 4! = 24 combinations. Solving
| language is easier and therefore low-stakes. Remember the monkey
| producing a Shakespeare play by randomly punching typewriter
| keys? If that was an astronomically big number, think how
| obscenely long it would take a monkey to paint Mona Lisa by
| randomly assigning pixel values. Left as an exercise to the
| reader.
|
| Juergen Schmidhuber has gone a lot queit now. But he also told
| that a world-model, explicitly included in training is reasoning
| is better, rather than only text or image or whatever. He has a
| good paper with Lucas Beyer.
| vessenes wrote:
| Thanks. This is interesting. What kind of equation is used to
| assess an ebm during training? I'm afraid I still don't get the
| core concept well enough to have an intuition for it.
| tyronehed wrote:
| Since this exposes the answer, the new architecture has to be
| based on world model building.
| uoaei wrote:
| The thing is, this has been known since even before the
| current crop of LLMs. Anyone who considered (only the
| English) language to be sufficient to model the world
| understands so little about cognition as to be irrelevant in
| this conversation.
| Horffupolde wrote:
| WTF. The cardinality of words is 100,000.
| albertzeyer wrote:
| Jurgen Schmidhuber has a paper with Lucas Beyer? I'm not aware
| of it. Which do you mean?
| hnfong wrote:
| I'm not an insider and I'm not sure whether this is directly
| related to "energy minimization", but "diffusion language models"
| have apparently gained some popularity in recent weeks.
|
| https://arxiv.org/abs/2502.09992
|
| https://www.inceptionlabs.ai/news
|
| (these are results from two different teams/orgs)
|
| It sounds kind of like what you're describing, and nobody else
| has mentioned it yet, so take a look and see whether it's
| relevant.
| hnuser123456 wrote:
| And they seem to be about 10x as fast as similar sized
| transformers.
| 317070 wrote:
| No, 10x less sampling steps. Whether or not that means 10x
| faster remains to be seen, as a diffusion step tends to be
| more expensive than an autoregressive step.
| littlestymaar wrote:
| If I understood correctly, in practice they show actual
| speed improvement on high-end cards, because autoregressive
| LLMs are bandwidth limited and do not compute bound, so
| switching to a more expensive but less memory bandwidth
| heavy is going to work well on current hardware.
| estebarb wrote:
| I have no idea about EBM, but I have researched a bit on the
| language modelling side. And let's be honest, GPT is not the best
| learner we can create right now (ourselves). GPT needs far more
| data and energy than a human, so clearly there is a better
| architecture somewhere waiting to be discovered.
|
| Attention works, yes. But it is not naturally plausible at all.
| We don't do quadratic comparisons across a whole book or need to
| see thousands of samples to understand.
|
| Personally I think that in the future recursive architectures and
| test time training will have a better chance long term than
| current full attention.
|
| Also, I think that OpenAI biggest contribution is demostrating
| that reasoning like behaviors can emerge from really good
| language modelling.
| probably_wrong wrote:
| I haven't read Yann Lecun's take. Based on your description alone
| my first impression would be: there's a paper [1] arguing that
| "beam search enforces uniform information density in text, a
| property motivated by cognitive science". UID claims, in short,
| that a speaker only delivers as much content as they think the
| listener can take (no more, no less) and the paper claims that
| beam search enforced this property at generation time.
|
| The paper would be a strong argument against your point: if
| neural architectures are already constraining the amount of
| information that a text generation system delivers the same way a
| human (allegedly) does, then I don't see which "energy" measure
| one could take that could perform any better.
|
| Then again, perhaps they have one in mind and I just haven't read
| it.
|
| [1] https://aclanthology.org/2020.emnlp-main.170/
| vessenes wrote:
| I believe he's talking about some sort of 'energy as measured
| by distance from the models understanding of the world' as in
| quite literally a world model. But again I'm ignorant, hence
| the post!
| tyronehed wrote:
| When an architecture is based around world model building,
| then it is a casual outcome that similar concepts and things
| end up being stored in similar places. They overlap. As soon
| as your solution starts to get mathematically complex, you
| are departing from what the human brain does. Not saying that
| in some universe it might be possible to make a statistical
| intelligence, but when you go that direction you are straying
| away from the only existing intelligences that we know about.
| The human brain. So the best solutions will closely echo
| neuroscience.
| deepsquirrelnet wrote:
| In some respects that sounds similar to what we already do
| with reward models. I think with GRPO, the "bag of rewards"
| approach doesn't strike me as terribly different. The
| challenge is in building out a sufficient "world" of rewards
| to adequately represent more meaningful feedback-based
| learning.
|
| While it sounds nice to reframe it like a physics problem, it
| seems like a fundamentally flawed idea, akin to saying "there
| is a closed form solution to the question of how should I
| live." The problem isn't hallucinations, the problem is that
| language and relativism are inextricably linked.
| tyronehed wrote:
| Any transformer based LLM will never achieve AGI because it's
| only trying to pick the next word. You need a larger amount of
| planning to achieve AGI. Also, the characteristics of LLMs do not
| resemble any existing intelligence that we know of. Does a baby
| require 2 years of statistical analysis to become useful? No.
| Transformer architectures are parlor tricks. They are glorified
| Google but they're not doing anything or planning. If you want
| that, then you have to base your architecture on the known
| examples of intelligence that we are aware of in the universe.
| And that's not a transformer. In fact, whatever AGI emerges will
| absolutely not contain a transformer.
| flawn wrote:
| It's not about just picking the next word here, that doesn't at
| all refuse whether Transformers can achieve AGI. Words are just
| one representation of information. And whether it resembles any
| intelligence we know is also not an argument because there is
| no reason to believe that all intelligence is based on anything
| we've seen (e.g us, or other animals). The underlying
| architecture of Attention & MLPs can surely still depict
| something which we could call an AGI, and in certain tasks it
| surely can be considered an AGI already. I also don't know for
| certain whether we will hit any roadblocks or architectural
| asymptotes but I haven't come across any well-founded argument
| that Transformers definitely could not reach AGI.
| visarga wrote:
| The transformer is a simple and general architecture. Being
| such a flexible model, it needs to learn "priors" from data, it
| makes few assumptions on its distribution from the start. The
| same architecture can predict protein folding and fluid
| dynamics. It's not specific to language.
|
| We on the other hand are shaped by billions of years of genetic
| evolution, and 200k years of cultural evolution. If you count
| the total number of words spoken by 110 billion people who ever
| lived, assuming 1B estimated words per human during their
| lifetime, it comes out to 10 million times the size of GPT-4's
| training set.
|
| So we spent 10 million more words discovering than it takes the
| transformer to catch up. GPT-4 used 10 thousand people's worth
| of language to catch up all that evolutionary finetuning.
| simne wrote:
| > words spoken by 110 billion people who ever lived, assuming
| 1B estimated words per human during their lifetime..comes out
| to 10 million times the size of GPT-4's training set
|
| This assumption is slightly wrong direction, because not
| exist human who could consume much more than about 1B words
| during their lifetime. So humanity could not gain enhancement
| from just multiply words of one human by 100 billion. I
| think, correct estimation could be 1B words multiply by 100.
|
| I think, current AI already achieved size need to become AGI,
| but to finish, probably need to change structure (but I'm not
| sure about this), and also need some additional
| multidimensional dataset, not just texts.
|
| I might bet on 3D cinema, and/or on automobile targeting
| autopilot dataset, or something for real life humanoid robots
| solving typical human tasks, like fold shirt.
| unsupp0rted wrote:
| > Does a baby require 2 years of statistical analysis to become
| useful?
|
| Well yes, actually.
| tyronehed wrote:
| The alternative architectures must learn from streaming data,
| must be error tolerant and must have the characteristic that
| similar objects or concepts much naturally come near to each
| other. They must naturally overlap.
| bitwize wrote:
| Ever hear of Dissociated Press? If not, try the following
| demonstration.
|
| Fire up Emacs and open a text file containing a lot of human-
| readable text. Something off Project Gutenberg, say. Then say M-x
| dissociated-press and watch it spew hilarious, quasi-linguistic
| garbage into a buffer for as long as you like.
|
| Dissociated Press is a language model. A primitive, stone-knives-
| and-bearskins language model, but a language model nevertheless.
| When you feed it input text, it builds up a statistical model
| based on a Markov chain, assigning probabilities to each
| character that might occur next, given a few characters of input.
| If it sees 't' and 'h' as input, the most likely next character
| is probably going to be 'e', followed by maybe 'a', 'i', and 'o'.
| 'r' might find its way in there, but 'z' is right out. And so
| forth. It then uses that model to generate output text by picking
| characters at random given the past n input characters, resulting
| in a firehose of things that might be words or fragments of
| words, but don't make much sense overall.
|
| LLMs are doing the same thing. They're picking the next token
| (word or word fragment) given a certain number of previous
| tokens. And that's ALL they're doing. The only differences are
| really matters of scale: the tokens are larger than single
| characters, the model considers many, many more tokens of input,
| and the model is a huge deep-learning model with oodles more
| parameters than a simple Markov chain. So while Dissociated Press
| churns out obvious nonsensical slop, ChatGPT churns out much,
| much more plausible sounding nonsensical slop. But it's still
| just rolling them dice over and over and choosing from among the
| top candidates of "most plausible sounding next token" according
| to its actuarial tables. It doesn't think. Any thinking it
| appears to do has been pre-done by humans, whose thoughts are
| then harvested off the internet and used to perform macrodata
| refinement on the statistical model. Accordingly, if you ask
| ChatGPT a question, it may well be right a lot of the time. But
| when it's wrong, it doesn't _know_ it 's wrong, and it doesn't
| know what to do to make things right. Because it's just reaching
| into a bag of refrigerator magnet poetry tiles, weighted by
| probability of sounding good given the current context, and
| slapping whatever it finds onto the refrigerator. Over and over.
|
| What I think Yann LeCun means by "energy" above is
| "implausibility". That is, the LLM would instead grab a fistful
| of tiles -- enough to form many different responses -- and from
| those start with a single response and then through gradient
| descent or something optimize it to minimize some statistical
| "bullshit function" for the entire response, rather than just
| choosing one of the most plausible single tiles each go. Even
| that may not fix the hallucination issue, but it may produce
| results with fewer obvious howlers.
| vhantz wrote:
| +1
|
| But there's a fundamental difference between Markov chains and
| transformers that should be noted. Markov chains only learn how
| likely it is for one token to follow another. Transformers
| learn how likely it is for a set of token to be seen together.
| Transformers add a wider context to msrkov chain. That
| quantitative change leads to a qualitative improvement:
| transformers generate text that is semantically plausible.
| wnoise wrote:
| Yes, but k-token lookup was already a thing with markov
| chains. Transformers are indeed better, but just because they
| model language distributions better than mostly-empty arrays
| of (token-count)^(context).
| janalsncm wrote:
| I am an MLE not an expert. However, it is a fundamental problem
| that our current paradigm of training larger and larger LLMs
| cannot ever scale to the precision people require for many tasks.
| Even in the highly constrained realm of chess, an enormous neural
| net will be outclassed by a small program that can run on your
| phone.
|
| https://arxiv.org/pdf/2402.04494
| throw310822 wrote:
| > Even in the highly constrained realm of chess, an enormous
| neural net will be outclassed by a small program that can run
| on your phone.
|
| This is true also for the much bigger neural net that works in
| your brain, and even if you're the world champion of chess.
| Clearly your argument doesn't hold water.
| janalsncm wrote:
| For the sake of argument let's say an artificial neural net
| is approximately the same as the brain. It sounds like you
| agree with me that smaller programs are both more efficient
| and more effective than a larger neural net. So you should
| also agree with me that those who say the only path to AGI is
| LLM maximalism are misguided.
| jpadkins wrote:
| smaller programs are better than artificial or organic
| neural net for constrained problems like chess. But chess
| programs don't generalize to any other intelligence
| applications, like how organic neural nets do today.
| throw310822 wrote:
| > It sounds like you agree with me that smaller programs
| are both more efficient and more effective than a larger
| neural net.
|
| At playing chess. (But also at doing sums and
| multiplications, yay!)
|
| > So you should also agree with me that those who say the
| only path to AGI is LLM maximalism are misguided.
|
| No. First of all, it's a claim you just made up. What we're
| talking about is people saying that LLMs are _not_ the path
| to AGI- an entirely different claim.
|
| Second, assuming there's any coherence to your argument,
| the fact that a small program can outclass an enormous NN
| is irrelevant to the question of whether the enormous NN is
| the right way to achieve AGI: we are "general
| intelligences" and we are defeated by the same chess
| program. Unless you mean that achieving the intelligence of
| the greatest geniuses that ever lived is still not enough.
| thewarrior wrote:
| Any chance that "reasoning" can fix this
| janalsncm wrote:
| It kind of depends. You can broadly call any kind of search
| "reasoning". But search requires 1) enumerating your possible
| options and 2) assigning some value to those options. Real
| world problem solving makes both of those extremely
| difficult.
|
| Unlike in chess, there's a functionally infinite number of
| actions you can take in real life. So just argmax over
| possible actions is going to be hard.
|
| Two, you have to have some value function of how good an
| action is in order to argmax. But many actions are impossible
| to know the value of in practice because of hidden
| information and the chaotic nature of the world (butterfly
| effect).
| artificialprint wrote:
| Isn't something about alphago also involves "infinitely"
| many possible outcomes? Yet they cracked it, right?
| janalsncm wrote:
| Go is played on a 19x19 board. At the beginning of the
| game the first player has 361 possible moves. The second
| player then has 360 possible moves. There is always a
| finite and relatively "small" number of options.
|
| I think you are thinking of the fact that it had to be
| approached in a different way than Minimax in chess
| because a brute force decision tree grows way too fast to
| perform well. So they had to learn models for actions and
| values.
|
| In any case, Go is a perfect information game, which as I
| mentioned before, is not the same as problems in the real
| world.
| jurschreuder wrote:
| This concept comes from Hopfield networks.
|
| If two nodes are on, but the connection between them is negative,
| this causes energy to be higher.
|
| If one of those nodes switches off, energy is reduced.
|
| With two nodes this is trivial. With 10 nodes it's more difficult
| to solve, and with billions of nodes it is impossible to "solve".
|
| All you can do then is try to get the energy as low as possible.
|
| This way also neural networks can find out "new" information,
| that they have not learned, but is consistent with the
| constraints they have learned about the world so far.
| vessenes wrote:
| So, what's modeled as a "node" in an EBM, and what's modeled as
| a connection? Are they vectors in a tensor, (well I suppose
| almost certainly that's a yes). Do they run side by side a
| model that's being trained? Is the node connectivity
| architecture fixed or learned?
| d--b wrote:
| Well, it could be argued that the "optimal response" ie the one
| that sorta minimizes that "energy" is sorted by LLMs on the first
| iteration. And further iterations aren't adding any useful
| information and in fact are countless occasions to veer off the
| optimal response.
|
| For example if a prompt is: "what is the Statue of Liberty", the
| LLMs first output token is going to be "the", but it kinda
| already "knows" that the next ones are going to be "statue of
| liberty".
|
| So to me LLMs already "choose" a response path from the first
| token.
|
| Conversely, a LLM that would try and find a minimum energy for
| the whole response wouldn't necessarily stop hallucinating. There
| is nothing in the training of a model that says that "I don't
| know" has a lower "energy" than a wrong answer...
| rglover wrote:
| Not an ML researcher, but implementing these systems has shown
| this opinion to be correct. The non-determinism of LLMs is a
| feature, not a bug that can be fixed.
|
| As a result, you'll never be able to get 100% consistent outputs
| or behavior (like you hypothetically can with a traditional
| algorithm/business logic). And that has proven out in usage
| across every model I've worked with.
|
| There's also an upper-bound problem in terms of context where
| every LLM hits some arbitrary amount of context that causes it to
| "lose focus" and develop a sort of LLM ADD. This is when
| hallucinations and random, unrequested changes get made and a
| previously productive chat spirals to the point where you have to
| start over.
| EEgads wrote:
| Yann LeCun understands this is an electrical engineering and
| physical statistics of machine problem and not a code problem.
|
| The physics of human consciousness are not implemented in a leaky
| symbolic abstraction but the raw physics of existence.
|
| The sort of autonomous system we imagine when thinking AGI must
| be built directly into substrate and exhibit autonomous behavior
| out of the box. Our computers are blackboxes made in a lab
| without centuries of evolving in the analog world, finding a
| balance to build on. They either can do a task or cannot.
| Obviously from just looking at one we know how few real world
| tasks it can just get up and do.
|
| Code isn't magic, it's instruction to create a machine state.
| There's no inherent intelligence to our symbolic logic. It's an
| artifact of intelligence. It cannot imbue intelligence into a
| machine.
| bobosha wrote:
| I argue that JEPA and its Energy-Based Model (EBM) framework fail
| to capture the deeply intertwined nature of learning and
| prediction in the human brain--the "yin and yang" of
| intelligence. Contemporary machine learning approaches remain
| heavily reliant on resource-intensive, front-loaded training
| phases. I advocate for a paradigm shift toward seamlessly
| integrating training and prediction, aligning with the principles
| of online learning.
|
| Disclosure: I am the author of this paper.
|
| Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-
| head Predictions Architecture. Available from:
| https://www.researchgate.net/publication/381009719_Hydra_Enh...
| [accessed Mar 14, 2025].
| vessenes wrote:
| Thank you. So, quick q - it would make sense to me that JEPA is
| an outcome of the YLC work; would you say that's the case?
| esafak wrote:
| So you believe humans spend more energy on prediction, relative
| to computers? Isn't that because personal computers are not
| powerful enough to train big models, and most people have no
| desire to? It is more economically efficient to socialize the
| cost of training, as is done. Are you thinking of a distributed
| training, where we split the work and cost? That could happen
| when robots become more widespread.
| vessenes wrote:
| Update: Interesting paper, thanks. Comment on selection for
| Hydra -- you mention v1 uses an arithmetic mean across
| timescales for prediction. Taking this analogy of the longer
| windows encapsulating different timescales, I'd propose it
| would be interesting to train a layer to predict weighting of
| the timescale predictions. Essentially -- is this a moment
| where I need to focus on what _just_ happened, or is this a
| moment in which my long range predictions are more important?
| inimino wrote:
| I have a paper coming up that I modestly hope will clarify some
| of this.
|
| The short answer should be that it's obvious LLM training and
| inference are both ridiculously inefficient and biologically
| implausible, and therefore there has to be some big optimization
| wins still on the table.
| jedberg wrote:
| > and biologically implausible
|
| I really like this approach. Showing that we must be doing it
| wrong because our brains are more efficient and we aren't doing
| it like our brains.
|
| Is this a common thing in ML papers or something you came up
| with?
| esafak wrote:
| Evolution does not need to converge on the optimum solution.
|
| Have you heard of https://en.wikipedia.org/wiki/Bio-
| inspired_computing ?
| jedberg wrote:
| It does not, you're right. But it's an interesting way to
| approach the problem never the less. And given that we
| definitely aren't as efficient as a human brain right now,
| it makes sense to look at the brain for inspiration.
| parsimo2010 wrote:
| I don't think GP was implying that brains are the optimum
| solution. I think you can interpret GP's comments like
| this- if our brains are more efficient than LLMs, then
| clearly LLMs aren't optimally efficient. We have at least
| one data point showing that better efficiency is possible,
| even if we don't know what the optimal approach is.
| esafak wrote:
| I agree. Spiking neural networks are usually mentioned in
| this context, but there is no hardware ecosystem behind
| them that can compete with Nvidia and CUDA.
| leereeves wrote:
| Investments in AI are now counting by billions of
| dollars. Would that be enough to create an initial
| ecosystem for a new architecture?
| esafak wrote:
| Nvidia has a big lead, and hardware is capital intensive.
| I guess an alternative would make sense in the battery-
| powered regime, like robotics, where Nvidia's power
| hungry machines are at a disadvantage. This is how ARM
| took on Intel.
| vlovich123 wrote:
| A new HW architecture for an unproven SW architecture is
| never going to happen. The SW needs to start working
| initially and demonstrate better performance. Of course,
| as with the original deep neural net stuff, it took
| computers getting sufficiently advanced to demonstrate
| this is possible. A different SW architecture would have
| to be so much more efficient to work. Moreover, HW and SW
| evolve in tandem - HW takes existing SW and tries to
| optimize it (e.g. by adding an abstraction layer) or SW
| tries to leverage existing HW to run a new architecture
| faster. Coming up with a new HW/SW combo seems unlikely
| given the cost of bringing HW to market. If AI speedup of
| HW ever delivers like Jeff Dean expects, then the cost of
| prototyping might come down enough to try to make these
| kinds of bets.
| _3u10 wrote:
| Nah it's just physics, it's like wheels being more efficient
| than legs.
|
| We know there is a more efficient solution (human brain) but
| we don't know how to make it.
|
| So it stands to reason that we can make more efficient LLMs,
| just like a CPU can add numbers more efficiently than humans.
| jonplackett wrote:
| Wheels is an interesting analogy. Wheels are more efficient
| now that we have roads. But there could never have been
| evolutionary pressure to make them before there were roads.
| Wheels are also a lot easier to get to work than robotic
| legs and so long as there's a road do a lot more than
| robotic legs.
| fluidcruft wrote:
| How are you separating the efficiency of the architecture
| from the efficiency of the substrate? Unless you have a brain
| made of transistors or an LLM made of neurons how can you
| identify the source of the inefficiency?
| vessenes wrote:
| I'm looking forward to it! Inefficiency (if we mean energy
| efficiency) conceptually doesn't bother me very much in that
| feels like Silicon design has a long way to go still, but I
| like the idea of looking at biology for both ideas and
| guidance.
|
| Inefficiency in data input is also an interesting concept. It
| seems to me humans get more data in than even modern frontier
| models; if you use the gigabit/s estimates for sensory input.
| Care to elaborate on your thoughts?
| snowwrestler wrote:
| I think the hard question is whether those wins can be realized
| with less effort than what we're already doing, though.
|
| What I mean is this: A brain today is obviously far more
| efficient at intelligence than our current approaches to AI.
| But a brain is a highly specialized chemical computer that
| evolved over hundreds of millions of years. That leaves a lot
| of room for inefficient and implausible strategies to play out!
| As long as wins are preserved, efficiency can improve this way
| anyway.
|
| So the question is really, can we short cut that somehow?
|
| It does seem like doing so would require a different approach.
| But so far all our other approaches to creating intelligence
| have been beaten by the big simple inefficient one. So it's
| hard to see a path from here that doesn't go that route.
| sockaddr wrote:
| Also, a brain evolved to be a stable compute platform in body
| that finds itself in many different temperature and energy
| regimes. And the brain can withstand and recover from some
| pretty severe damage. So I'd suspect an intelligence that is
| designed to run in a tighter temp/power envelope with no need
| for recovery or redundancy could be significantly more
| efficient than our brain.
| fallingknife wrote:
| The brain only operates in a very narrow temperature range
| too. 5 degrees C in either direction from 37 and you're in
| deep trouble.
| choilive wrote:
| Most brain damage would not be considered in the realm of
| what most people would consider "recoverable".
| zamubafoo wrote:
| Honest question: Given that the only wide consensus of anything
| approaching general intelligence are humans and that humans are
| biological systems that have evolved in physical reality, is
| there any arguments that better efficiency is even possible
| without relying on leveraging the nature of reality?
|
| For example, analog computers can differentiate near instantly
| by leveraging the nature of electromagnetism and you can do
| very basic analogs of complex equations by just connecting
| containers of water together in certain (very specific)
| configurations. Are we sure that these optimizations to get us
| to AGI are possible without abusing the physical nature of the
| world? This is without even touching the hot mess that is
| quantum mechanics and its role in chemistry which in turn
| affects biology. I wouldn't put it past evolution to have
| stumbled upon some quantum mechanic that allowed for the
| emergence of general intelligence.
|
| I'm super interested in anything discussing this but have very
| limited exposure to the literature in this space.
| HDThoreaun wrote:
| The advantage of artificial intelligence doesnt even need to
| be energy efficiency. We are pretty good at generating
| energy, if we had human level AI even if it used an order of
| magnitude more energy that humans use that would likely still
| be cheaper than a human.
| Etheryte wrote:
| How does this idea compare to the rationale presented by Rich
| Sutton in The Bitter Lesson [0]? Shortly put, why do you think
| biological plausibility has significance?
|
| [0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| rsfern wrote:
| I'm not GP, but I don't think their position is necessarily
| in tension with leveraging computation. Not all FLOPs are
| equal, and furthermore FLOPs != Watts. In fact a much more
| efficient architecture might be that much more effective at
| leveraging computation than just burning a bigger pile of
| GPUs with the current transformer stack
| snats wrote:
| Not an insider but imo the work on diffusion language models like
| LLaDA is really exciting. It's pretty obvious that LLMs are good
| but they are pretty slow. And in a world where people want agents
| you want a lot of the time something that might not be that smart
| but is capable of going really fast + searches fast. You only
| need to solve search in a specific domain for most agents. You
| don't need to solve the entire knowledge of human history in a
| single set of weights
| eximius wrote:
| I believe that so long as weights are fixed at inference time,
| we'll be at a dead end.
|
| Will Titans be sufficiently "neuroplastic" to escape that? Maybe,
| I'm not sure.
|
| Ultimately, I think an architecture around "looping" where the
| model outputs are both some form of "self update" and "optional
| actionality" such that interacting with the model is more
| "sampling from a thought space" will be required.
| mft_ wrote:
| Very much this. I've been wondering why I've not seen it much
| discussed.
| simne wrote:
| I'm not deep researcher, more like amateur, but could explain
| some things.
|
| Most problem with current approach, to grow abilities, need to
| add more neurons, but this is not just energy consuming, but also
| knowledge consuming, mean, at GPT-4 level all text sources of
| humanity already exhausted and model become essentially
| overfitted. So looks like multi-modal models appear not because
| so good, but because they could learn on additional sources
| (audio/video).
|
| I seen few approaches to overcome problem of overfitting, but as
| I understand not exist universal solution.
|
| For example, tried approach to create from current texts some
| synthetic training data, but this idea is limited by definition.
|
| So, current LLMs appear to hit dead end, and researchers now
| trying to find exit from this dead end. I believe, nearest years
| somebody will invent some universal solution (probably, complex
| of approaches) or suggest another architecture, and progress of
| AI will continue.
| giantg2 wrote:
| I feel like some hallucinations aren't bad. Isn't that basically
| what a new idea is - a hallucination of what could be? The
| ability to come up with new things, even if they're sometimes
| wrong, can be useful and happen all the time with humans.
| hn_user82179 wrote:
| That's a really interesting thought. I think the key part (as a
| consumer of AI tools) would be identifying the things that are
| guesses vs deductions vs complete accurate based on the
| training data. I would happily look up or think about the
| output parts that are possibly hallucinated myself but we don't
| currently get that kind of feedback. Whereas a human could list
| things out that they know, and then highlight the things they
| making educated guesses about, which makes it easier to build
| upon.
| schainks wrote:
| This seems really intuitive to me. If I can express something
| concisely and succinctly because I understand it, I will
| literally spend less energy to explain it.
| jiggawatts wrote:
| My observation from the outside watching this all unfold is that
| not enough effort seems to be going into the training schedule.
|
| I say schedule because the "static data _once through_ " is the
| root of the problem in my mind is one of the root problems.
|
| Think about what happens when _you_ read something like a book.
| You're not "just" reading it, you're also comparing it to other
| books, other books by the same author, while critically
| considering the book recommendations made by your friend. Any
| events in the book get compared to your life experience, etc...
|
| LLM training does _none_ of this! It's a once-through text
| prediction training regime.
|
| What this means in practice is that an LLM can't write a review
| of a book unless it has read many reviews already. They have, of
| course, but the problem doesn't go away. Ask an AI to critique
| book reviews and it'll run out of steam because it hasn't seen
| many of those. Critiques of critiques is where they start falling
| flat on their face.
|
| This kind of meta-knowledge is precisely what experts accumulate.
|
| As a programmer I don't just regurgitate code I've seen before
| with slight variations -- instead I know that mainstream
| criticisms of micro services misses their key benefit of extreme
| team scalability!
|
| This is the crux of it: when humans read their training material
| they are generating an "n+1" level in their mind that they also
| learn. The current AI training setup trains the AI only the "n"th
| level.
|
| This can be solved by running the training in a loop for several
| iterations after base training. The challenge of course is to
| develop a meaningful loss function.
|
| IMHO the "thinking" model training is a step in the right
| direction but nowhere near enough to produce AGI all by itself.
| infamouscow wrote:
| There was an article about in the March 2025 issue of
| Communications of the ACM:
| https://dl.acm.org/doi/pdf/10.1145/3707200
| danielmarkbruce wrote:
| Whether or not he's right big picture, the specific thing about
| runaway tokens is dumb as hell.
| ilaksh wrote:
| I don't think you need to be an ML researcher to understand his
| point of view. He wants to do fundamental research. Optimizing
| LLMs is not fundamental research. There are numerous other
| potential approaches, and it's obvious that LLMs have weaknesses
| that other approaches could tackle.
|
| If he was Hinton's age then maybe he would also want to retire
| and be happy with transformers and LLMs. He is still an ambitious
| researcher that wants to do foundational research to get to the
| next paradigm.
|
| Having said all of that, it is a misjudgement for him to be
| disparaging the incredible capabilities of LLMs to the degree he
| has.
| moron4hire wrote:
| > it is a misjudgement for him to be disparaging the incredible
| capabilities of LLMs to the degree he has.
|
| Jeez, you'd think he kicked your dog.
| jmpeax wrote:
| A transformer will attend to previous tokens, and so it is free
| to ignore prior errors. I don't get LeCun's comment on error
| propagation, and look forward to a more thorough exposition of
| the problem.
| akomtu wrote:
| The next-gen LLMs are going to use something like mipmaps in
| graphics: a stack of progressively smaller versions of the image,
| with a 1x1 image at the top. The same concept applies to text.
| When you're writing something, your have a high-level idea in
| mind that serves as a guide. That idea is such a mipmap. Perhaps
| the next-gen LLMs will be generating a few parallel sequencies,
| the top-level will be a slow-pace anchor and the bottom-level
| being the actual text that depends on slower upper levels.
| bashfulpup wrote:
| He's right but at the same time wrong. Current AI methods are
| essentially scaled up methods that we learned decades ago.
|
| These long horizon (agi) problems have been there since the very
| beginning. We have never had a solution to them. RL assumes we
| know the future which is a poor proxy. These energy based methods
| fundamentally do very little that an RNN didn't do long ago.
|
| I worked on higher dimensionality methods which is a very
| different angle. My take is that it's about the way we scale
| dependencies between connections. The human brain makes and
| breaks a massive amount of nueron connections daily. Scaling the
| dimensionality would imply that a single connection could be
| scalled to encompass significantly more "thoughts" over time.
|
| Additionally the true to solution to these problems are likely to
| be solved by a kid with a laptop as much as an top researcher.
| You find the solution to CL on a small AI model (mnist) you solve
| it at all scales.
| nradov wrote:
| For a kid with a laptop to solve it would require the problem
| to be solvable with current standard hardware. There's no
| evidence for that. We might need a completely different
| hardware paradigm.
| bashfulpup wrote:
| Also possible and a fair point. My point is that it's a
| "tiny" solution that we can scale.
|
| I could revise that by saying a kid with a whiteboard.
|
| It's an einsteinx10 moment so who know when that'll happen.
| haolez wrote:
| Not exactly related, but I wonder sometimes if the fact that
| the weights in current models are very expansive to change is a
| feature and not a "bug".
|
| Somehow, it feels harder to trust a model that could evolve
| over time. It's performance might even degrade. That's a steep
| price to pay for having memory built in and a (possibly) self-
| evolving model.
| bashfulpup wrote:
| We degrade, and I think we are far more valuable than one
| model.
| bravura wrote:
| Okay I think I qualify. I'll bite.
|
| LeCun's argument is this:
|
| 1) You can't learn an accurate world model just from text.
|
| 2) Multimodal learning (vision, language, etc) and interaction
| with the environment is crucial for true learning.
|
| He and people like Hinton and Bengio have been saying for a while
| that there are tasks that mice can understand that an AI can't.
| And that even have mouse-level intelligence will be a
| breakthrough, but we cannot achieve that through language
| learning alone.
|
| A simple example from "How Large Are Lions? Inducing
| Distributions over Quantitative Attributes"
| (https://arxiv.org/abs/1906.01327) is this: Learning the size of
| objects using pure text analysis requires significant gymnastics,
| while vision demonstrates physical size more easily. To determine
| the size of a lion you'll need to read thousands of sentences
| about lions, or you could look at two or three pictures.
|
| LeCun isn't saying that LLMs aren't useful. He's just concerned
| with bigger problems, like AGI, which he believes cannot be
| solved purely through linguistic analysis.
|
| The energy minimization architecture is more about joint
| multimodal learning.
|
| (Energy minimization is a very old idea. LeCun has been on about
| it for a while and it's less controversial these days. Back when
| everyone tried to have a probabilistic interpretation of neural
| models, it was expensive to compute the normalization term /
| partition function. Energy minimization basically said: Set up a
| sensible loss and minimize it.)
___________________________________________________________________
(page generated 2025-03-14 23:00 UTC)