[HN Gopher] LLMs encode how difficult problems are
___________________________________________________________________
LLMs encode how difficult problems are
Author : stansApprentice
Score : 165 points
Date : 2025-11-06 18:29 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jiito wrote:
| I haven't read this particular paper in-depth, but it reminds me
| of another one I saw that used a similar approach to find if the
| model encodes its own certainty of answering correctly.
| https://arxiv.org/abs/2509.10625
| kazinator wrote:
| It's all very clear when you mentally replace "LLM" with "text
| completion driven by compressed training data".
|
| E.g.
|
| [Text copletion driven by compressed training data] exhibit[s] a
| puzzling inconsistency: [it] solves complex problems yet
| frequently fail[s] on seemingly simpler ones.
|
| Some problems are better represented by a locus of texts in the
| training data, allowing more plausible talk to be generated. When
| the problem is not well represented, it does not help that the
| problem is simple.
|
| If you train it on nothing but Scientology documents, and then
| ask about the Buddhist perspective on a situation, you will
| probably get some nonsense about body thetans, even if the
| situation is simple.
| th0ma5 wrote:
| Thank you for posting this. I'm struck with how there is a lot
| of studying of the behavior and isolating it from other
| assumptions and then these individual capabilities are then
| described as a new solution or discovered capability that would
| work with all of those other assumptions. This makes most all
| of the LLM research feel like whack a mole if the goal was to
| make accurate and reliable models by understanding these
| techniques. Instead, it's more like seeing faces in cars and
| buildings and other artifacts of patterns and pattern groupings
| and recognition of patterns. Building houses on sand, etc.
| lukev wrote:
| Well, that's what a LLM _is_. The problem is if one 's mental
| model is built on "AI" instead of "LLM."
|
| The fact that LLMs can abstract concepts and do _any_ amount of
| out-of-sample reasoning is impressive and interesting, but the
| null hypothesis for a LLM being "impressive" in any regard is
| that the data required to answer the question is present in
| it's training set.
| XenophileJKO wrote:
| This is true, but also misleading. We are learning that the
| models achieve compression by distilling higher level concepts
| and deriving generalized human like abilities, for example the
| recent introspection paper from Anthropic.
| layoric wrote:
| I have a hard time trying to conceptualize lossy text
| compression, but I've recently started to think about the
| "reasoning"/output as just a by product of lossy compression,
| and weights tending towards an average of the information
| "around" the main topic of prompt. What I've found easier is
| thinking about it like lossy image compression, generating more
| output tokens via "reasoning" is like subdividing nearby pixels
| and filling in the gaps with values that they've seen there
| before. Taking the analogy a bit too far, you can also think of
| the vocabulary as the pixel bit depth.
|
| I definitely agree replacing AI or LLMs with "X driven by
| compressed training data" starts to make a lot more sense, and
| a useful shortcut.
| suprjami wrote:
| You're right about "reasoning". It's just trying to steer the
| conversation in a more relevant direction in vector space,
| hopefully to generate more relevant output tokens. I find it
| easier to conceptualize this in three dimensions. 3blue1brown
| has a good video series which covers the overall concept of
| LLM vectors in machine learning: https://youtube.com/playlist
| ?list=PLZHQObOWTQDNU6R1_67000Dx_...
|
| To give a concrete example, say we're generating the next
| token from the word "queen". Is this the monarch, the bee,
| the playing card, the drag entertainer? By adding more
| relevant tokens (honey, worker, hive, beeswax) we steer the
| token generation to the place in the "word cloud" where our
| next token is more likely to exist.
|
| I don't see LLMs as "lossy compression" of text. To me that
| implies retrieval, and Transformers are a prediction device,
| not a retrieval device. If one needs retrieval then use a
| database.
| Terr_ wrote:
| > You're right about "reasoning". It's just trying to steer
| the conversation in a more relevant direction in vector
| space, hopefully to generate more relevant output tokens.
|
| I like to frame it as a theater-script cycling through the
| LLM. The "reasoning" difference is just changing the style
| so that each character has _film noir_ monologues. The
| underlying process hasn 't really changes, and the
| monologues text isn't fundamentally different from dialogue
| or stage-direction... but more data still means more
| guidance for each improv-cycle.
|
| > say we're generating the next token from the word
| "queen". Is this the monarch, the bee, the playing card,
| the drag entertainer?
|
| I'd like to point out that this scheme can result in things
| that look better to humans in the end... even when the
| "clarifying" choice is entirely arbitrary and irrational.
|
| In other words, we should be alert to the difference
| between "explaining what you were thinking" versus "picking
| a firm direction so future improv makes nicer
| rationalizations."
| esafak wrote:
| It makes sense if you think of the LLM as building a data-
| aware model that compresses the noisy data by parsimony
| (the principle that the simplest explanation that fits is
| best). Typical text compression algorithms are not data-
| aware and not robust to noise.
|
| In lossy compression the compression itself is the goal. In
| prediction, compression is the road that leads to
| parsimonious models.
| astrange wrote:
| It is not a useful shortcut because you don't know what the
| training data is, nothing requires it to be an "average" of
| anything, and post-training arbitrarily re-weights all of its
| existing distributions anyway.
| cruffle_duffle wrote:
| The way I visualize it is imagining clipping the high
| frequency details of _concepts and facts_. These things
| operate on a different plane of abstraction than simple
| strings of characters or tokens. They operate on ideas and
| concepts. To compress, you take out all the deep details and
| leave only the broad strokes.
| kazinator wrote:
| One day people will say "we used to think the devil is in
| the details, but now we know it is in their removal".
| onraglanroad wrote:
| > Text copletion driven by compressed training data...solves
| complex problems
|
| Sure it does. Obviously. All we ever needed was some text
| completion.
|
| Thanks for your valuable insight.
| ToValueFunfetti wrote:
| Why shouldn't you expect a problem's simplicity to correlate
| tremendously with how well it is represented in training data?
| Every angle I can think of tilts in that direction. Simpler
| problems are easier to remember and thus repeat, they come up
| more often, asd they require less space/time/effort to record
| (which also means they are less likely to contain errors).
| N_Lens wrote:
| This is a popular take on HN yet incomplete in its assessment
| of LLMs and their capabilities.
| keeganpoppen wrote:
| oh man i am pretty tired of the "it's just autocomplete"
| armchair warriors... it is an accurate metaphor in only the
| most pedantic of ways, and has zero explanatory power
| whatsoever as far as intuition building goes. and i don't even
| understand the impulse. "reality is easy, it's just quantum
| autocomplete!"
| msla wrote:
| > It's all very clear when you mentally replace "LLM" with
| "text completion driven by compressed training data".
|
| So you replace a more useful term with a less useful one?
|
| Is that due to political reasons?
| WhyOhWhyQ wrote:
| Probably irrelevant, but something funny about claude code is it
| will routinely say something like "10 week task, very complex",
| and then one-shot it in 2 minutes. I didn't have it create a
| feature for a while because it kept telling me it's way too
| complicated. All of the open source versions I tried weren't
| working, but I finally just decided to get it to make the feature
| anyways and it ended up doing better than the open source
| projects. So there's something off about how well claude
| estimates the difficulty of things for it, and I'm wondering if
| that makes it perform worse by not doing things it would do well
| at.
| danielbln wrote:
| In terms of the time estimates: I've added to my global rules
| to never give time estimates for tasks, as they're useless and
| inaccurate.
| bavell wrote:
| I did the same a few weeks back, also difficulty estimates,
| "impact" analysis and expected performance results - all of
| which is just hallucinated garbage not worth wasting tokens
| on.
| cruffle_duffle wrote:
| Same. I dunno how they got trained to spontaneously provide
| those estimates either. Like they must have read some weird
| training data related to the phrase "how difficult is this"
| or something.
| jives wrote:
| I wonder if it's trying to predict what kind of estimate a
| human engineer would provide.
| EGreg wrote:
| Considering it's trained on predicting the next word in stuff
| humans estimated before AI, wouldn't that make sense?
| kridsdale1 wrote:
| A HUGE amount of the workday artifacts engineers have been
| forced to produce since we started the internet is project
| estimation documents for our managers. The training corpus
| on this stuff is immense and now all ingested in to these
| models. It's doing no thinking at all when it gives you an
| estimate, it's matching correlated strings which the humans
| of the past had to write down.
|
| Fun fact, all those human-sourced estimates were
| hallucinations too.
| abdullahkhalids wrote:
| It would be very surprising if the AI training corpus
| includes a lot of project estimation documentation, since
| most of those are confidential and not publicly
| available.
| andai wrote:
| I think there's two aspects to this.
|
| Firstly, Claude's self concept is based around humanity's
| collective self-concept. (Well, the statistical average of all
| the self-concepts on the internet.)
|
| So it doesn't have a clear understanding of what LLMs'
| strengths and weaknesses are, and itself by extension. (Neither
| do we, from what I gathered. At least, not in a way that's well
| represented in web scrapes ;)
|
| Secondly, as a programmer I have noticed a similar pattern...
| stuff that people say is easy turns out to be a pain in the
| ass, and stuff that they say is impossible turns out to be
| trivial. (They didn't even try, they just repeated what other
| people told them was hard, who also didn't try it...)
| barren_suricata wrote:
| Not sure how related this is, but I've noticed it has a
| tendency to start sentences with usually inflated optimism and
| I think the idea is that if it has a tendency to intro with
| "Aha I see it now! The problem is" whatever comes next has a
| higher tendency to be a correct solution than if you didn't use
| an overtly positive prefix, even if that leads to a lot of
| annoying behavior.
| AlecSchueler wrote:
| I've always been taught to slightly overestimate how long
| something will take so that it reflects better on the team when
| it's delivered ahead of schedule. There's bound to be a bunch
| of similar advice and patterns in the training data.
| bartwe wrote:
| Sound a lot like Kolmogorov complexity
| baxtr wrote:
| _Kolmogorov complexity is the length of the shortest computer
| program that can produce a specific object as output. It
| formalizes the idea that simple objects have short
| descriptions, while complex (random) objects are
| incompressible._
| amelius wrote:
| Compression is a great IQ test, but it's still limited to a
| small domain.
| inavida wrote:
| My interpretation of the abstract is that humans are pretty good
| at judging how difficult a problem is and LLMs aren't as
| reliable, that problem difficulty correlates with activations
| during inference, and finally that an accurate human judgement of
| problem difficulty (*as input) leads to better problem solving.
|
| If so, this is a nice training signal for my own neural net,
| since my view of LLMs is that they are essentially analogy-making
| machines, and that reasoning is essentially a chain of analogies
| that ends in a result that aligns somewhat with reality. Or that
| I'm as crazy as most people seem to think I am.
___________________________________________________________________
(page generated 2025-11-07 23:02 UTC)