[HN Gopher] LLMs Encode How Difficult Problems Are
       ___________________________________________________________________
        
       LLMs Encode How Difficult Problems Are
        
       Author : stansApprentice
       Score  : 73 points
       Date   : 2025-11-06 18:29 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jiito wrote:
       | I haven't read this particular paper in-depth, but it reminds me
       | of another one I saw that used a similar approach to find if the
       | model encodes its own certainty of answering correctly.
       | https://arxiv.org/abs/2509.10625
        
       | kazinator wrote:
       | It's all very clear when you mentally replace "LLM" with "text
       | completion driven by compressed training data".
       | 
       | E.g.
       | 
       | [Text copletion driven by compressed training data] exhibit[s] a
       | puzzling inconsistency: [it] solves complex problems yet
       | frequently fail[s] on seemingly simpler ones.
       | 
       | Some problems are better represented by a locus of texts in the
       | training data, allowing more plausible talk to be generated. When
       | the problem is not well represented, it does not help that the
       | problem is simple.
       | 
       | If you train it on nothing but Scientology documents, and then
       | ask about the Buddhist perspective on a situation, you will
       | probably get some nonsense about body thetans, even if the
       | situation is simple.
        
         | th0ma5 wrote:
         | Thank you for posting this. I'm struck with how there is a lot
         | of studying of the behavior and isolating it from other
         | assumptions and then these individual capabilities are then
         | described as a new solution or discovered capability that would
         | work with all of those other assumptions. This makes most all
         | of the LLM research feel like whack a mole if the goal was to
         | make accurate and reliable models by understanding these
         | techniques. Instead, it's more like seeing faces in cars and
         | buildings and other artifacts of patterns and pattern groupings
         | and recognition of patterns. Building houses on sand, etc.
        
         | lukev wrote:
         | Well, that's what a LLM _is_. The problem is if one 's mental
         | model is built on "AI" instead of "LLM."
         | 
         | The fact that LLMs can abstract concepts and do _any_ amount of
         | out-of-sample reasoning is impressive and interesting, but the
         | null hypothesis for a LLM being  "impressive" in any regard is
         | that the data required to answer the question is present in
         | it's training set.
        
         | XenophileJKO wrote:
         | This is true, but also misleading. We are learning that the
         | models achieve compression by distilling higher level concepts
         | and deriving generalized human like abilities, for example the
         | recent introspection paper from Anthropic.
        
         | layoric wrote:
         | I have a hard time trying to conceptualize lossy text
         | compression, but I've recently started to think about the
         | "reasoning"/output as just a by product of lossy compression,
         | and weights tending towards an average of the information
         | "around" the main topic of prompt. What I've found easier is
         | thinking about it like lossy image compression, generating more
         | output tokens via "reasoning" is like subdividing nearby pixels
         | and filling in the gaps with values that they've seen there
         | before. Taking the analogy a bit too far, you can also think of
         | the vocabulary as the pixel bit depth.
         | 
         | I definitely agree replacing AI or LLMs with "X driven by
         | compressed training data" starts to make a lot more sense, and
         | a useful shortcut.
        
           | suprjami wrote:
           | You're right about "reasoning". It's just trying to steer the
           | conversation in a more relevant direction in vector space,
           | hopefully to generate more relevant output tokens. I find it
           | easier to conceptualize this in three dimensions. 3blue1brown
           | has a good video series which covers the overall concept of
           | LLM vectors in machine learning: https://youtube.com/playlist
           | ?list=PLZHQObOWTQDNU6R1_67000Dx_...
           | 
           | To give a concrete example, say we're generating the next
           | token from the word "queen". Is this the monarch, the bee,
           | the playing card, the drag entertainer? By adding more
           | relevant tokens (honey, worker, hive, beeswax) we steer the
           | token generation to the place in the "word cloud" where our
           | next token is more likely to exist.
           | 
           | I don't see LLMs as "lossy compression" of text. To me that
           | implies retrieval, and Transformers are a prediction device,
           | not a retrieval device. If one needs retrieval then use a
           | database.
        
           | astrange wrote:
           | It is not a useful shortcut because you don't know what the
           | training data is, nothing requires it to be an "average" of
           | anything, and post-training arbitrarily re-weights all of its
           | existing distributions anyway.
        
         | onraglanroad wrote:
         | > Text copletion driven by compressed training data...solves
         | complex problems
         | 
         | Sure it does. Obviously. All we ever needed was some text
         | completion.
         | 
         | Thanks for your valuable insight.
        
       | WhyOhWhyQ wrote:
       | Probably irrelevant, but something funny about claude code is it
       | will routinely say something like "10 week task, very complex",
       | and then one-shot it in 2 minutes. I didn't have it create a
       | feature for a while because it kept telling me it's way too
       | complicated. All of the open source versions I tried weren't
       | working, but I finally just decided to get it to make the feature
       | anyways and it ended up doing better than the open source
       | projects. So there's something off about how well claude
       | estimates the difficulty of things for it, and I'm wondering if
       | that makes it perform worse by not doing things it would do well
       | at.
        
         | danielbln wrote:
         | In terms of the time estimates: I've added to my global rules
         | to never give time estimates for tasks, as they're useless and
         | inaccurate.
        
         | jives wrote:
         | I wonder if it's trying to predict what kind of estimate a
         | human engineer would provide.
        
           | EGreg wrote:
           | Considering it's trained on predicting the next word in stuff
           | humans estimated before AI, wouldn't that make sense?
        
       | bartwe wrote:
       | Sound a lot like Kolmogorov complexity
        
       ___________________________________________________________________
       (page generated 2025-11-06 23:00 UTC)