[HN Gopher] LLMs encode how difficult problems are
       ___________________________________________________________________
        
       LLMs encode how difficult problems are
        
       Author : stansApprentice
       Score  : 165 points
       Date   : 2025-11-06 18:29 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jiito wrote:
       | I haven't read this particular paper in-depth, but it reminds me
       | of another one I saw that used a similar approach to find if the
       | model encodes its own certainty of answering correctly.
       | https://arxiv.org/abs/2509.10625
        
       | kazinator wrote:
       | It's all very clear when you mentally replace "LLM" with "text
       | completion driven by compressed training data".
       | 
       | E.g.
       | 
       | [Text copletion driven by compressed training data] exhibit[s] a
       | puzzling inconsistency: [it] solves complex problems yet
       | frequently fail[s] on seemingly simpler ones.
       | 
       | Some problems are better represented by a locus of texts in the
       | training data, allowing more plausible talk to be generated. When
       | the problem is not well represented, it does not help that the
       | problem is simple.
       | 
       | If you train it on nothing but Scientology documents, and then
       | ask about the Buddhist perspective on a situation, you will
       | probably get some nonsense about body thetans, even if the
       | situation is simple.
        
         | th0ma5 wrote:
         | Thank you for posting this. I'm struck with how there is a lot
         | of studying of the behavior and isolating it from other
         | assumptions and then these individual capabilities are then
         | described as a new solution or discovered capability that would
         | work with all of those other assumptions. This makes most all
         | of the LLM research feel like whack a mole if the goal was to
         | make accurate and reliable models by understanding these
         | techniques. Instead, it's more like seeing faces in cars and
         | buildings and other artifacts of patterns and pattern groupings
         | and recognition of patterns. Building houses on sand, etc.
        
         | lukev wrote:
         | Well, that's what a LLM _is_. The problem is if one 's mental
         | model is built on "AI" instead of "LLM."
         | 
         | The fact that LLMs can abstract concepts and do _any_ amount of
         | out-of-sample reasoning is impressive and interesting, but the
         | null hypothesis for a LLM being  "impressive" in any regard is
         | that the data required to answer the question is present in
         | it's training set.
        
         | XenophileJKO wrote:
         | This is true, but also misleading. We are learning that the
         | models achieve compression by distilling higher level concepts
         | and deriving generalized human like abilities, for example the
         | recent introspection paper from Anthropic.
        
         | layoric wrote:
         | I have a hard time trying to conceptualize lossy text
         | compression, but I've recently started to think about the
         | "reasoning"/output as just a by product of lossy compression,
         | and weights tending towards an average of the information
         | "around" the main topic of prompt. What I've found easier is
         | thinking about it like lossy image compression, generating more
         | output tokens via "reasoning" is like subdividing nearby pixels
         | and filling in the gaps with values that they've seen there
         | before. Taking the analogy a bit too far, you can also think of
         | the vocabulary as the pixel bit depth.
         | 
         | I definitely agree replacing AI or LLMs with "X driven by
         | compressed training data" starts to make a lot more sense, and
         | a useful shortcut.
        
           | suprjami wrote:
           | You're right about "reasoning". It's just trying to steer the
           | conversation in a more relevant direction in vector space,
           | hopefully to generate more relevant output tokens. I find it
           | easier to conceptualize this in three dimensions. 3blue1brown
           | has a good video series which covers the overall concept of
           | LLM vectors in machine learning: https://youtube.com/playlist
           | ?list=PLZHQObOWTQDNU6R1_67000Dx_...
           | 
           | To give a concrete example, say we're generating the next
           | token from the word "queen". Is this the monarch, the bee,
           | the playing card, the drag entertainer? By adding more
           | relevant tokens (honey, worker, hive, beeswax) we steer the
           | token generation to the place in the "word cloud" where our
           | next token is more likely to exist.
           | 
           | I don't see LLMs as "lossy compression" of text. To me that
           | implies retrieval, and Transformers are a prediction device,
           | not a retrieval device. If one needs retrieval then use a
           | database.
        
             | Terr_ wrote:
             | > You're right about "reasoning". It's just trying to steer
             | the conversation in a more relevant direction in vector
             | space, hopefully to generate more relevant output tokens.
             | 
             | I like to frame it as a theater-script cycling through the
             | LLM. The "reasoning" difference is just changing the style
             | so that each character has _film noir_ monologues. The
             | underlying process hasn 't really changes, and the
             | monologues text isn't fundamentally different from dialogue
             | or stage-direction... but more data still means more
             | guidance for each improv-cycle.
             | 
             | > say we're generating the next token from the word
             | "queen". Is this the monarch, the bee, the playing card,
             | the drag entertainer?
             | 
             | I'd like to point out that this scheme can result in things
             | that look better to humans in the end... even when the
             | "clarifying" choice is entirely arbitrary and irrational.
             | 
             | In other words, we should be alert to the difference
             | between "explaining what you were thinking" versus "picking
             | a firm direction so future improv makes nicer
             | rationalizations."
        
             | esafak wrote:
             | It makes sense if you think of the LLM as building a data-
             | aware model that compresses the noisy data by parsimony
             | (the principle that the simplest explanation that fits is
             | best). Typical text compression algorithms are not data-
             | aware and not robust to noise.
             | 
             | In lossy compression the compression itself is the goal. In
             | prediction, compression is the road that leads to
             | parsimonious models.
        
           | astrange wrote:
           | It is not a useful shortcut because you don't know what the
           | training data is, nothing requires it to be an "average" of
           | anything, and post-training arbitrarily re-weights all of its
           | existing distributions anyway.
        
           | cruffle_duffle wrote:
           | The way I visualize it is imagining clipping the high
           | frequency details of _concepts and facts_. These things
           | operate on a different plane of abstraction than simple
           | strings of characters or tokens. They operate on ideas and
           | concepts. To compress, you take out all the deep details and
           | leave only the broad strokes.
        
             | kazinator wrote:
             | One day people will say "we used to think the devil is in
             | the details, but now we know it is in their removal".
        
         | onraglanroad wrote:
         | > Text copletion driven by compressed training data...solves
         | complex problems
         | 
         | Sure it does. Obviously. All we ever needed was some text
         | completion.
         | 
         | Thanks for your valuable insight.
        
         | ToValueFunfetti wrote:
         | Why shouldn't you expect a problem's simplicity to correlate
         | tremendously with how well it is represented in training data?
         | Every angle I can think of tilts in that direction. Simpler
         | problems are easier to remember and thus repeat, they come up
         | more often, asd they require less space/time/effort to record
         | (which also means they are less likely to contain errors).
        
         | N_Lens wrote:
         | This is a popular take on HN yet incomplete in its assessment
         | of LLMs and their capabilities.
        
         | keeganpoppen wrote:
         | oh man i am pretty tired of the "it's just autocomplete"
         | armchair warriors... it is an accurate metaphor in only the
         | most pedantic of ways, and has zero explanatory power
         | whatsoever as far as intuition building goes. and i don't even
         | understand the impulse. "reality is easy, it's just quantum
         | autocomplete!"
        
         | msla wrote:
         | > It's all very clear when you mentally replace "LLM" with
         | "text completion driven by compressed training data".
         | 
         | So you replace a more useful term with a less useful one?
         | 
         | Is that due to political reasons?
        
       | WhyOhWhyQ wrote:
       | Probably irrelevant, but something funny about claude code is it
       | will routinely say something like "10 week task, very complex",
       | and then one-shot it in 2 minutes. I didn't have it create a
       | feature for a while because it kept telling me it's way too
       | complicated. All of the open source versions I tried weren't
       | working, but I finally just decided to get it to make the feature
       | anyways and it ended up doing better than the open source
       | projects. So there's something off about how well claude
       | estimates the difficulty of things for it, and I'm wondering if
       | that makes it perform worse by not doing things it would do well
       | at.
        
         | danielbln wrote:
         | In terms of the time estimates: I've added to my global rules
         | to never give time estimates for tasks, as they're useless and
         | inaccurate.
        
           | bavell wrote:
           | I did the same a few weeks back, also difficulty estimates,
           | "impact" analysis and expected performance results - all of
           | which is just hallucinated garbage not worth wasting tokens
           | on.
        
           | cruffle_duffle wrote:
           | Same. I dunno how they got trained to spontaneously provide
           | those estimates either. Like they must have read some weird
           | training data related to the phrase "how difficult is this"
           | or something.
        
         | jives wrote:
         | I wonder if it's trying to predict what kind of estimate a
         | human engineer would provide.
        
           | EGreg wrote:
           | Considering it's trained on predicting the next word in stuff
           | humans estimated before AI, wouldn't that make sense?
        
             | kridsdale1 wrote:
             | A HUGE amount of the workday artifacts engineers have been
             | forced to produce since we started the internet is project
             | estimation documents for our managers. The training corpus
             | on this stuff is immense and now all ingested in to these
             | models. It's doing no thinking at all when it gives you an
             | estimate, it's matching correlated strings which the humans
             | of the past had to write down.
             | 
             | Fun fact, all those human-sourced estimates were
             | hallucinations too.
        
               | abdullahkhalids wrote:
               | It would be very surprising if the AI training corpus
               | includes a lot of project estimation documentation, since
               | most of those are confidential and not publicly
               | available.
        
         | andai wrote:
         | I think there's two aspects to this.
         | 
         | Firstly, Claude's self concept is based around humanity's
         | collective self-concept. (Well, the statistical average of all
         | the self-concepts on the internet.)
         | 
         | So it doesn't have a clear understanding of what LLMs'
         | strengths and weaknesses are, and itself by extension. (Neither
         | do we, from what I gathered. At least, not in a way that's well
         | represented in web scrapes ;)
         | 
         | Secondly, as a programmer I have noticed a similar pattern...
         | stuff that people say is easy turns out to be a pain in the
         | ass, and stuff that they say is impossible turns out to be
         | trivial. (They didn't even try, they just repeated what other
         | people told them was hard, who also didn't try it...)
        
         | barren_suricata wrote:
         | Not sure how related this is, but I've noticed it has a
         | tendency to start sentences with usually inflated optimism and
         | I think the idea is that if it has a tendency to intro with
         | "Aha I see it now! The problem is" whatever comes next has a
         | higher tendency to be a correct solution than if you didn't use
         | an overtly positive prefix, even if that leads to a lot of
         | annoying behavior.
        
         | AlecSchueler wrote:
         | I've always been taught to slightly overestimate how long
         | something will take so that it reflects better on the team when
         | it's delivered ahead of schedule. There's bound to be a bunch
         | of similar advice and patterns in the training data.
        
       | bartwe wrote:
       | Sound a lot like Kolmogorov complexity
        
         | baxtr wrote:
         | _Kolmogorov complexity is the length of the shortest computer
         | program that can produce a specific object as output. It
         | formalizes the idea that simple objects have short
         | descriptions, while complex (random) objects are
         | incompressible._
        
         | amelius wrote:
         | Compression is a great IQ test, but it's still limited to a
         | small domain.
        
       | inavida wrote:
       | My interpretation of the abstract is that humans are pretty good
       | at judging how difficult a problem is and LLMs aren't as
       | reliable, that problem difficulty correlates with activations
       | during inference, and finally that an accurate human judgement of
       | problem difficulty (*as input) leads to better problem solving.
       | 
       | If so, this is a nice training signal for my own neural net,
       | since my view of LLMs is that they are essentially analogy-making
       | machines, and that reasoning is essentially a chain of analogies
       | that ends in a result that aligns somewhat with reality. Or that
       | I'm as crazy as most people seem to think I am.
        
       ___________________________________________________________________
       (page generated 2025-11-07 23:02 UTC)