[HN Gopher] From tokens to thoughts: How LLMs and humans trade c...
       ___________________________________________________________________
        
       From tokens to thoughts: How LLMs and humans trade compression for
       meaning
        
       Author : ggirelli
       Score  : 99 points
       Date   : 2025-06-05 07:59 UTC (15 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | valine wrote:
       | >> For each LLM, we extract static, token-level embeddings from
       | its input embedding layer (the 'E'matrix). This choice aligns our
       | analysis with the context-free nature of stimuli typical in human
       | categorization experiments, ensuring a comparable
       | representational basis.
       | 
       | They're analyzing input embedding models, not LLMs. I'm not sure
       | how the authors justify making claims about the inner workings of
       | LLMs when they haven't actually computed a forward pass. The
       | EMatrix is not an LLM, its a lookup table.
       | 
       | Just to highlight the ridiculousness of this research, no
       | attention was computed! Not a single dot product between keys and
       | queries. All of their conclusions are drawn from the output of an
       | embedding lookup table.
       | 
       | The figure showing their alignment score correlated with model
       | size is particularly egregious. Model size is meaningless when
       | you never activate any model parameters. If Bert is outperforming
       | Qwen and Gemma something is wrong with your methodology.
        
         | blackbear_ wrote:
         | Note that the token embeddings are also trained, therefore
         | their values do give _some_ hints on how a model is organizing
         | information.
         | 
         | They used token embeddings directly and not intermediate
         | representations because the latter depend on the specific
         | sentence that the model is processing. Data on human judgment
         | was however collected without any context surrounding each
         | word, thus using the token embeddings seem to be the most fair
         | comparison.
         | 
         | Otherwise, what sentence(s) would you have used to compute the
         | intermediate representations? And how would you make sure that
         | the results aren't biased by these sentences?
        
           | navar wrote:
           | You can process a single word through a transformer and get
           | the corresponding intermediate representations.
           | 
           | Though it sounds odd there is no problem with it and it would
           | indeed return the model's representation of that single word
           | as seen by the model without any additional context.
        
           | valine wrote:
           | Embedding models are not always trained with the rest of the
           | model. That's the whole idea behind VLLMs. First layer
           | embeddings are so interchangeable you can literally feed in
           | the output of other models using linear projection layers.
           | 
           | And like the other commenter said, you can absolutely feed
           | single tokens through the model. Your point doesn't make any
           | sense though regardless. How about priming the model with
           | "You're a helpful assistant" just like everyone else does.
        
         | boroboro4 wrote:
         | It's mind blowing LeCun is listed as one of the authors.
         | 
         | I would expect model size to correlate with alignment score
         | because usually model sizes correlate with hidden dimension.
         | But also opposite can be true - bigger models might shift more
         | basic token classification logic into layers and hence
         | embedding alignment can go down. Regardless feels like pretty
         | useless research...
        
           | danielbln wrote:
           | Leaves a bit of a taste considering LeCun's famously critical
           | stance on auto-regressive transformer LLMs.
        
         | throwawaymaths wrote:
         | the llm is also a lookup table! but your point is correct. they
         | should have looked at subsequent layers that aggregate
         | information over distance.
        
       | andoando wrote:
       | Am I the only one that is lost on how the calculations are made?
       | 
       | From what I can tell this is limited in scope to categorizing
       | nouns (robin is a bird).
        
       | fusionadvocate wrote:
       | Open a bank account. Open your heart. Open a can. Open to new
       | experiences.
       | 
       | Words are a tricky thing to handle.
        
         | an0malous wrote:
         | OpenAI agrees
        
         | esafak wrote:
         | And models since BERT and ELMo capture polysemy!
         | 
         | https://aclanthology.org/2020.blackboxnlp-1.15/
        
         | bluefirebrand wrote:
         | And that is _just_ in English
         | 
         | Other languages have similar but fundamentally different
         | oddities which do not translate cleanly
        
           | suddenlybananas wrote:
           | Not sure how they're fundamentally different. What do you
           | mean?
        
             | bluefirebrand wrote:
             | Think about the work of localizing a joke that relies on
             | wordplay or similar sounding words to be funny. Or simply
             | how words rhyme
             | 
             | Try explaining why tough and rough rhyme but bough doesn't
             | 
             | You know? Language has a ton of idiosyncrasies.
        
               | Qworg wrote:
               | To make it more concrete - here's an example in Chinese:
               | https://en.wikipedia.org/wiki/Grass_Mud_Horse
        
               | Scarblac wrote:
               | ChatGPT is horrible at producing Dutch rhymes (for
               | Sinterklaas poems) until you realize that the words it
               | comes up with do rhyme when translated to English.
        
             | thesz wrote:
             | As most languages allow expressions of algorithms, they are
             | all Turing complete and, thus, are not fundamentally
             | different. The complexity of expressions of some concepts
             | is different, though.
             | 
             | My favorite thing is a "square." I put that name to an
             | enumeration that allows me to compare and contrast things
             | with two different qualities expressed by two extremes.
             | 
             | One such square is "One can (not) do (not do) something."
             | Both "not"'s can be present and absent, just like a truth
             | table.
             | 
             | "One can do something", "one can not do something", "one
             | can do not do something" and, finally, "one can not help
             | but do something."
             | 
             | Why should we use "help but" instead of "do not"?
             | 
             | While this does not preclude one from enumerating
             | possibilities thinking in English, it makes that
             | enumeration harder than it can be in other languages. For
             | example, in Russian the "square" is expressible directly.
             | 
             | Also, "help but" is not shorter than "do not," it is
             | longer. Useful idioms usually expressed in shorter forms,
             | thus, apparently, "one can not help but do something" is
             | considered by Englishmen as not useful.
        
         | falcor84 wrote:
         | I agree in general, but I think that "open" is actually a
         | pretty straightforward word.
         | 
         | As I see it, "Open your heart", "Open a can" and "Open to new
         | experiences" have very similar meanings for "Open", being
         | essentially "make a container available for external I/O",
         | similar to the definition of an "open system" in
         | thermodynamics. "Open a bank account" is a bit different, as it
         | creates an entity that didn't exist before, but even then the
         | focus is on having something that allows for external I/O - in
         | this case deposits and withdrawals.
        
       | johnnyApplePRNG wrote:
       | This paper is interesting, but ultimately it's just restating
       | that LLMs are statistical tools and not cognitive systems. The
       | information-theoretic framing doesn't really change that.
        
         | Nevermark wrote:
         | > LLMs are statistical tools and not cognitive systems
         | 
         | I have never understood broad statements that models are just
         | (or mostly) statistical tools.
         | 
         | Certainly statistics apply, minimizing mismatches results in
         | mean (or similar measure) target predictions.
         | 
         | But the architecture of a model is the difference between
         | compressed statistics vs. forcing a model to translate
         | information in a highly organized way reflecting the actual
         | shape of the problem to get any accuracy at all.
         | 
         | In both cases, statistics are relevant, but in the latter it's
         | not a particularly insightful way to talk about what a model
         | has learned.
         | 
         | Statistical accuracy, prediction, etc. are basic problems to
         | solve. The training criteria being optimized. But they don't
         | limit the nature of solutions. They both leave problem
         | difficulty, and solution sophistication unbounded.
        
       | catchnear4321 wrote:
       | incomplete inaccurate off misleading meandering not quite
       | generation prediction removal of superfluous fast but spiky
       | 
       | this isn't talking about that.
        
       ___________________________________________________________________
       (page generated 2025-06-05 23:01 UTC)