[HN Gopher] From tokens to thoughts: How LLMs and humans trade c...
___________________________________________________________________
From tokens to thoughts: How LLMs and humans trade compression for
meaning
Author : ggirelli
Score : 99 points
Date : 2025-06-05 07:59 UTC (15 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| valine wrote:
| >> For each LLM, we extract static, token-level embeddings from
| its input embedding layer (the 'E'matrix). This choice aligns our
| analysis with the context-free nature of stimuli typical in human
| categorization experiments, ensuring a comparable
| representational basis.
|
| They're analyzing input embedding models, not LLMs. I'm not sure
| how the authors justify making claims about the inner workings of
| LLMs when they haven't actually computed a forward pass. The
| EMatrix is not an LLM, its a lookup table.
|
| Just to highlight the ridiculousness of this research, no
| attention was computed! Not a single dot product between keys and
| queries. All of their conclusions are drawn from the output of an
| embedding lookup table.
|
| The figure showing their alignment score correlated with model
| size is particularly egregious. Model size is meaningless when
| you never activate any model parameters. If Bert is outperforming
| Qwen and Gemma something is wrong with your methodology.
| blackbear_ wrote:
| Note that the token embeddings are also trained, therefore
| their values do give _some_ hints on how a model is organizing
| information.
|
| They used token embeddings directly and not intermediate
| representations because the latter depend on the specific
| sentence that the model is processing. Data on human judgment
| was however collected without any context surrounding each
| word, thus using the token embeddings seem to be the most fair
| comparison.
|
| Otherwise, what sentence(s) would you have used to compute the
| intermediate representations? And how would you make sure that
| the results aren't biased by these sentences?
| navar wrote:
| You can process a single word through a transformer and get
| the corresponding intermediate representations.
|
| Though it sounds odd there is no problem with it and it would
| indeed return the model's representation of that single word
| as seen by the model without any additional context.
| valine wrote:
| Embedding models are not always trained with the rest of the
| model. That's the whole idea behind VLLMs. First layer
| embeddings are so interchangeable you can literally feed in
| the output of other models using linear projection layers.
|
| And like the other commenter said, you can absolutely feed
| single tokens through the model. Your point doesn't make any
| sense though regardless. How about priming the model with
| "You're a helpful assistant" just like everyone else does.
| boroboro4 wrote:
| It's mind blowing LeCun is listed as one of the authors.
|
| I would expect model size to correlate with alignment score
| because usually model sizes correlate with hidden dimension.
| But also opposite can be true - bigger models might shift more
| basic token classification logic into layers and hence
| embedding alignment can go down. Regardless feels like pretty
| useless research...
| danielbln wrote:
| Leaves a bit of a taste considering LeCun's famously critical
| stance on auto-regressive transformer LLMs.
| throwawaymaths wrote:
| the llm is also a lookup table! but your point is correct. they
| should have looked at subsequent layers that aggregate
| information over distance.
| andoando wrote:
| Am I the only one that is lost on how the calculations are made?
|
| From what I can tell this is limited in scope to categorizing
| nouns (robin is a bird).
| fusionadvocate wrote:
| Open a bank account. Open your heart. Open a can. Open to new
| experiences.
|
| Words are a tricky thing to handle.
| an0malous wrote:
| OpenAI agrees
| esafak wrote:
| And models since BERT and ELMo capture polysemy!
|
| https://aclanthology.org/2020.blackboxnlp-1.15/
| bluefirebrand wrote:
| And that is _just_ in English
|
| Other languages have similar but fundamentally different
| oddities which do not translate cleanly
| suddenlybananas wrote:
| Not sure how they're fundamentally different. What do you
| mean?
| bluefirebrand wrote:
| Think about the work of localizing a joke that relies on
| wordplay or similar sounding words to be funny. Or simply
| how words rhyme
|
| Try explaining why tough and rough rhyme but bough doesn't
|
| You know? Language has a ton of idiosyncrasies.
| Qworg wrote:
| To make it more concrete - here's an example in Chinese:
| https://en.wikipedia.org/wiki/Grass_Mud_Horse
| Scarblac wrote:
| ChatGPT is horrible at producing Dutch rhymes (for
| Sinterklaas poems) until you realize that the words it
| comes up with do rhyme when translated to English.
| thesz wrote:
| As most languages allow expressions of algorithms, they are
| all Turing complete and, thus, are not fundamentally
| different. The complexity of expressions of some concepts
| is different, though.
|
| My favorite thing is a "square." I put that name to an
| enumeration that allows me to compare and contrast things
| with two different qualities expressed by two extremes.
|
| One such square is "One can (not) do (not do) something."
| Both "not"'s can be present and absent, just like a truth
| table.
|
| "One can do something", "one can not do something", "one
| can do not do something" and, finally, "one can not help
| but do something."
|
| Why should we use "help but" instead of "do not"?
|
| While this does not preclude one from enumerating
| possibilities thinking in English, it makes that
| enumeration harder than it can be in other languages. For
| example, in Russian the "square" is expressible directly.
|
| Also, "help but" is not shorter than "do not," it is
| longer. Useful idioms usually expressed in shorter forms,
| thus, apparently, "one can not help but do something" is
| considered by Englishmen as not useful.
| falcor84 wrote:
| I agree in general, but I think that "open" is actually a
| pretty straightforward word.
|
| As I see it, "Open your heart", "Open a can" and "Open to new
| experiences" have very similar meanings for "Open", being
| essentially "make a container available for external I/O",
| similar to the definition of an "open system" in
| thermodynamics. "Open a bank account" is a bit different, as it
| creates an entity that didn't exist before, but even then the
| focus is on having something that allows for external I/O - in
| this case deposits and withdrawals.
| johnnyApplePRNG wrote:
| This paper is interesting, but ultimately it's just restating
| that LLMs are statistical tools and not cognitive systems. The
| information-theoretic framing doesn't really change that.
| Nevermark wrote:
| > LLMs are statistical tools and not cognitive systems
|
| I have never understood broad statements that models are just
| (or mostly) statistical tools.
|
| Certainly statistics apply, minimizing mismatches results in
| mean (or similar measure) target predictions.
|
| But the architecture of a model is the difference between
| compressed statistics vs. forcing a model to translate
| information in a highly organized way reflecting the actual
| shape of the problem to get any accuracy at all.
|
| In both cases, statistics are relevant, but in the latter it's
| not a particularly insightful way to talk about what a model
| has learned.
|
| Statistical accuracy, prediction, etc. are basic problems to
| solve. The training criteria being optimized. But they don't
| limit the nature of solutions. They both leave problem
| difficulty, and solution sophistication unbounded.
| catchnear4321 wrote:
| incomplete inaccurate off misleading meandering not quite
| generation prediction removal of superfluous fast but spiky
|
| this isn't talking about that.
___________________________________________________________________
(page generated 2025-06-05 23:01 UTC)