[HN Gopher] Positional preferences, order effects, prompt sensit...
       ___________________________________________________________________
        
       Positional preferences, order effects, prompt sensitivity undermine
       AI judgments
        
       Author : joalstein
       Score  : 148 points
       Date   : 2025-05-23 17:20 UTC (1 days ago)
        
 (HTM) web link (www.cip.org)
 (TXT) w3m dump (www.cip.org)
        
       | giancarlostoro wrote:
       | I listen to online debates, especially political ones on various
       | platforms, and man. The AI slop that people slap around at each
       | other is beyond horrendous. I would not want an LLM being the
       | final say on something critical. I want the opposite, an LLM
       | should identify things that need follow up review by a qualified
       | person, a person should still confirm the things that "pass" but
       | they can then prioritize what to validate first.
        
         | batshit_beaver wrote:
         | I don't even trust LLMs enough to spot content that requires
         | validation or nuance.
        
           | giancarlostoro wrote:
           | LLM's are 100% trust but verify.
        
       | baxtr wrote:
       | I'd argue real judges are unreliable as well.
       | 
       | The real question for me is: are they less reliable than human
       | judges? Probably yes. But I favor a relative measurement to
       | humans than a plain statement like that.
        
         | nimitkalra wrote:
         | There are technical quirks that make LLM judges particularly
         | high variance, sensitive to artifacts in the prompt, and
         | positively/negatively-skewed, as opposed to the subjectivity of
         | human judges. These largely arise from their training
         | distribution and post-training, and can be contained with
         | careful calibration.
        
         | andrewla wrote:
         | I do think you've hit the heart of the question, but I don't
         | think we can answer the second question.
         | 
         | We can measure how unreliable they are, or how susceptible they
         | are to specific changes, just because we can reset them to the
         | same state and run the experiment again. At least for now [1]
         | we do not have that capability with humans, so there's no way
         | to run a matching experiment on humans.
         | 
         | The best we can do it is probably to run the limited
         | experiments we can do on humans -- comparing different judge's
         | cross-referenced reliability to get an overall measure and some
         | weak indicator of the reliability of a specific judge based on
         | intra-judge agreement. But when running this on LLMs they would
         | have to keep the previous cases in their context window to get
         | a fair comparison.
         | 
         | [1] https://qntm.org/mmacevedo
        
         | resource_waste wrote:
         | I don't study domestic law enough, but I asked a professor of
         | law:
         | 
         | "With anything gray, does the stronger/bigger party always
         | win?"
         | 
         | He said:
         | 
         | "If you ask my students, nearly all of them would say Yes"
        
         | bunderbunder wrote:
         | > The real question for me is: are they less reliable than
         | human judges?
         | 
         | I've spent some time poking at this. I can't go into details,
         | but the short answer is, "Sometimes yes, sometimes no, and it
         | depends A LOT on how you define 'reliable'."
         | 
         | My sense is that, the more boring, mechanical and closed-ended
         | the task is, the more likely an LLM is to be more reliable than
         | a human. Because an LLM is an unthinking machine. It doesn't
         | get tired, or hangry, or stressed out about its kid's problems
         | at school. But it's also a doofus with absolutely no common
         | sense whatsoever.
        
           | visarga wrote:
           | > Because an LLM is an unthinking machine.
           | 
           | Unthinking can be pretty powerful these days.
        
         | DonHopkins wrote:
         | At least LLMs don't use penis pumps while on the job in court.
         | 
         | https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...
         | 
         | https://www.subsim.com/radioroom/showthread.php?t=95174
        
         | Alupis wrote:
         | I think the main difference is an AI judge may provide three
         | different rulings if you just ask it the same thing three
         | times. A human judge is much less likely to be so "flip-
         | floppy".
         | 
         | You can observe this using any of the present-day LLM's - ask
         | it an architectural/design question, provide it with your
         | thoughts, reasoning, constraints, etc... and see what it tells
         | you. Then... click the "Retry" button and see how similar (or
         | dissimilar) the answer is. Sometimes you'll get a complete 180
         | from the prior response.
        
           | bunderbunder wrote:
           | Humans flip-flop all the time. This is a major reason why the
           | Meyers-Briggs Type Indicator does such a poor job of
           | assigning the same person the same Meyers-Briggs type on
           | successive tests.
           | 
           | It can be difficult to observe this fact in practice because,
           | unlike for an LLM, you can't just ask a human the exact same
           | question three times in five seconds and get three different
           | answers, because unlike an LLM we have memory. But, as
           | someone who works with human-labeled data, it's something I
           | have to contend with on a daily basis. For the things I'm
           | working on, if you give the same annotator the same thing to
           | label two different times spaced far enough apart for them to
           | forget that they have seen this thing before, the chance of
           | them making the same call both times is only about 75%. If I
           | do that with a prompted LLM anotator, I'm used to seeing more
           | like 85%, and for some models it can be possible to get even
           | better consistency than that with the right conditions and
           | enough time spent fussing with the prompt.
           | 
           | I still prefer the human labels when I can afford them
           | because LLM labeling has plenty of other problems. But being
           | more flip-floppy than humans is not one that I have been able
           | to empirically observe.
        
             | Alupis wrote:
             | We're not talking about labeling data though - we're
             | talking about understanding case law, statutory law, facts,
             | balancing conflicting opinions, arguments, a judge's
             | preconceived notions, experiences, beliefs etc. - many of
             | which are assembled over an entire career.
             | 
             | Those things, I'd argue, are far less likely to change if
             | you ask the _same_ judge over and over. I think you can
             | observe this in reality by considering people 's political
             | opinions - which can drift over time but typically remain
             | similar for long durations (or a lifetime).
             | 
             | In real life, we usually don't ask the same judge to remake
             | a ruling over and over - our closest analog is probably a
             | judge's ruling/opinion history, which doesn't change nearly
             | as much as an LLM's "opinion" on something. This is how we
             | label SCOTUS Justices, for example, as "Originalist", etc.
             | 
             | Also, unlike a human, you can radically change an LLM's
             | output by just ever-so-slightly altering the input. While
             | humans aren't above changing their mind based on new facts,
             | they are unlikely to take an opposite position just because
             | you reworded your same argument.
        
               | bunderbunder wrote:
               | I think that that gets back to the whole memory thing. A
               | person is unlikely to forget those kinds of decisions.
               | 
               | But there has been research indicating, for example, that
               | judges' rulings vary with the time of day. In a way that
               | implies that, if it were possible to construct such an
               | experiment, you might find that the same judge given the
               | same case would rule in very different ways depending on
               | whether you present it in the morning or in the
               | afternoon. For example judges tend to hand out
               | significantly harsher penalties toward the end of the
               | work day.
        
           | acdha wrote:
           | I'd think there's also a key adversarial problem: a human
           | judge has a conceptual understanding and you aren't going to
           | be able to slightly tweak your wording to get wildly
           | different outcomes the way LLMs are vulnerable to.
        
         | Terr_ wrote:
         | > The real question for me is: are they less reliable than
         | human judges?
         | 
         | I'd caution that it's never just about ratios: We must also ask
         | whether the "shape" of their performance is knowable and
         | desirable. A chess robot's win-rate may be wonderful, but we
         | are unthinkingly confident a human wouldn't "lose" by
         | disqualification for ripping off an opponent's finger.
         | 
         | Would we accept a "judge" that is fairer on average... but
         | gives ~5% lighter sentences to people with a certain color
         | shirt, or sometimes issues the death-penalty for shoplifting?
         | Especially when we cannot diagnose the problem or be sure we
         | fixed it? (Maybe, but hopefully not without a _lot_ of debate
         | over the risks!)
         | 
         | In contrast, there's a huge body of... of _stuff_ regarding
         | human errors, resources we deploy so pervasively it can escape
         | our awareness: Your brain is a simulation and diagnostic tool
         | for other brains, battle-tested (sometimes literally) over
         | millions of years; we intuit many kinds of problems or
         | confounding factors to look for, often because we 've made them
         | ourselves; and thousands of years of cultural practice for
         | detection, guardrails, and error-compensating actions. Only a
         | small minority of that toolkit can be reused for "AI."
        
           | baxtr wrote:
           | Yes, I fully agree.
           | 
           | But that's my point. We have to compare LLM performance to
           | some shape we know.
        
         | th0ma5 wrote:
         | Can we stop with the "AI being unreliable like people" because
         | it is demonstrably false at best and cult like thought
         | termination at the worst.
        
         | andrepd wrote:
         | Judges can reason according to principles, and explain this
         | reasoning. LLMs cannot (but they can pretend to, and this
         | pretend chain-of-thought can be marketed as "reasoning"!; see
         | https://news.ycombinator.com/item?id=44069991)
        
         | not_maz wrote:
         | I know the answer and I hate it.
         | 
         | AIs are inferior to humans at their best, but superior to
         | humans as they actually behave in society, due to decision
         | fatigue and other constraints. When it comes to moral judgment
         | in high stakes scenarios, AIs still fail (or can be made to
         | fail) in ways that are not socially acceptable.
         | 
         | Compare an AI to a real-world, overworked corporate decision
         | maker, though, and you'll find that the AI is kinder and less
         | biased. It still sucks, because GI/GO, but it's slightly
         | better, simply because it doesn't suffer emotional fatigue,
         | doesn't take as many shortcuts, and isn't clouded by personal
         | opinions since it's not a person.
        
       | nimitkalra wrote:
       | Some other known distributional biases include self-preference
       | bias (gpt-4o prefers gpt-4o generations over claude generations
       | for eg) and structured output/JSON-mode bias [1]. Interestingly,
       | some models have a more positive/negative-skew than others as
       | well. This library [2] also provides some methods for
       | calibrating/stabilizing them.
       | 
       | [1]:
       | https://verdict.haizelabs.com/docs/cookbook/distributional-b...
       | [2]: https://github.com/haizelabs/verdict
        
         | yencabulator wrote:
         | It's considered good form on this forum to disclose your
         | affiliation when you advertise for your employer.
        
       | TrackerFF wrote:
       | I see "panels of judges" mentioned once, but what is the weakness
       | of this? Other than more resource.
       | 
       | Worst case you end up with some multi-modal distribution, where
       | two opinions are equal - which seems somewhat unlikely as the
       | panel size grows. Or it could maybe happen in some case with
       | exactly two outcomes (yes/no), but I'd be surprised if such a
       | panel landed on a perfect uniform distribution in its
       | judgments/opinions (50% yes 50% no)
        
         | nimitkalra wrote:
         | One method to get a better estimate is to extract the token
         | log-probabilities of "YES" and "NO" from the final logits of
         | the LLM and take a weighted sum [1] [2]. If the LLM is
         | calibrated for your task, there should be roughly a ~50% chance
         | of sampling YES (1) and ~50% chance of NO (0) -- yielding 0.5.
         | 
         | But generally you wouldn't use a binary outcome when you can
         | have samples that are 50/50 pass/fail. Better to use a discrete
         | scale of 1..3 or 1..5 and specify exactly what makes a sample a
         | 2/5 vs a 4/5, for example
         | 
         | You are correct to question the weaknesses of a panel. This
         | class of methods depends on diversity through high-temperature
         | sampling, which can lead to spurious YES/NO responses that
         | don't generalize well and are effectively noise.
         | 
         | [1]: https://arxiv.org/abs/2303.16634 [2]:
         | https://verdict.haizelabs.com/docs/concept/extractor/#token-...
        
       | shahbaby wrote:
       | Fully agree, I've found that LLMs aren't good at tasks that
       | require evaluation.
       | 
       | Think about it, if they were good at evaluation, you could remove
       | all humans in the loop and have recursively self improving AGI.
       | 
       | Nice to see an article that makes a more concrete case.
        
         | visarga wrote:
         | Humans aren't good at validation either. We need tools,
         | experiments, labs. Unproven ideas are a dime a dozen. Remember
         | the hoopla about room temperature superconductivity? The real
         | source of validation is external consequences.
        
           | ken47 wrote:
           | Human experts set the benchmarks and LLM's cannot match them
           | in most (maybe any?) fields requiring sophisticated judgment.
           | 
           | They are very useful for some things, but sophisticated
           | judgment is not one of them.
        
         | NitpickLawyer wrote:
         | I think there's more nuance, and the way I read the article is
         | more "beware of these shortcomings", instead of "aren't good".
         | LLM-based evaluation can be good. Several models have by now
         | been trained on previous-gen models used in filtering data and
         | validating RLHF data (pairwise or even more advanced). LLama3
         | is a good example of this.
         | 
         | My take from this article is that there are plenty of gotchas
         | along the way, and you need to be very careful in how you
         | structure your data, and how you test your pipelines, and how
         | you make sure your tests are keeping up with new models. But,
         | like it or not, LLM based evaluation is here to stay. So
         | explorations into this space are good, IMO.
        
       | tempodox wrote:
       | > We call it 'prompt-engineering'
       | 
       | I prefer to call it "prompt guessing", it's like some modern
       | variant of alchemy.
        
         | BurningFrog wrote:
         | "Prompt Whispering"?
        
           | th0ma5 wrote:
           | Prompt divining
        
             | thinkling wrote:
             | Impromptu prompting
        
             | roywiggins wrote:
             | Prompt dowsing
        
         | layer8 wrote:
         | Prompt vibing
        
         | amlib wrote:
         | Maybe "prompt mixology" would keep inline with the alchemy
         | theme :)
        
       | wagwang wrote:
       | Can't wait for the new field of AI psychology
        
       | sidcool wrote:
       | We went from Impossible to Unreliable. I like the direction as a
       | techie. But not sure as a sociologist or an anthropologist.
        
       | ken47 wrote:
       | [flagged]
        
         | dang wrote:
         | " _Please don 't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
         | 
         | (We detached this comment from
         | https://news.ycombinator.com/item?id=44074957)
        
       | sanqui wrote:
       | Meanwhile in Estonia, they just agreed to resolve child support
       | disputes using AI... https://www.err.ee/1609701615/pakosta-
       | enamiku-elatisvaidlust...
        
         | suddenlybananas wrote:
         | Might as well flip a coin.
        
       | armchairhacker wrote:
       | LLMs are good at discovery, since they know a lot, and can
       | retrieve that knowledge from a query that simpler (e.g. regex-
       | based) search engines with the same knowledge couldn't. For
       | example, an LLM that is input a case may discover an obscure law,
       | or notice a pattern in past court cases which establishes
       | precedent. So they can be helpful to a real judge.
       | 
       | Of course, the judge must check that the law or precedent aren't
       | hallucinated, and apply to the case in the way the LLM claims.
       | They should also prompt other LLMs and use their own knowledge in
       | case the cited law/precedent contradicts others.
       | 
       | There's a similar argument for scientists, mathematicians,
       | doctors, investors, and other fields. LLMs are good at discovery
       | but must be checked.
        
         | amlib wrote:
         | I would add that "hallucinations" aren't even the only failure
         | mode a LLM can have, it can partially or completely miss what
         | its supposed to find in the discovery process and lead you to
         | believe that there just isn't anything worth pursuing in that
         | particular venue.
        
           | mschuster91 wrote:
           | > it can partially or completely miss what its supposed to
           | find in the discovery process and lead you to believe that
           | there just isn't anything worth pursuing in that particular
           | venue.
           | 
           | The problem is that American and UK legal systems never got
           | forced to prune the sometimes centuries-old garbage. And
           | modern Western legal systems tend to have more explicit laws
           | and regulations instead of prior case law, but still, they
           | also accumulate lots of garbage.
           | 
           | IMHO, _all_ laws and regulations should come with a set
           | expiry date. If the law or regulation is not renewed, it gets
           | dropped off the book. And for legal systems that have case
           | law, court rulings should expire no later than five years, to
           | force a transition to a system where the law-passing body is
           | forced to work for its money.
        
       | PrimordialEdg71 wrote:
       | LLMs make impressive graders-of-convenience, but their judgments
       | swing wildly with prompt phrasing and option order. Treat them
       | like noisy crowd-raters: randomize inputs, ensemble outputs, and
       | keep a human in the loop whenever single-digit accuracy points
       | matter.
        
         | einrealist wrote:
         | "and keep a human in the loop whenever single-digit accuracy
         | points matter"
         | 
         | So we are supposed to give up on accuracy now? At least with
         | humans (assuming good actors) I can assume an effort for
         | accuracy and correctness. And I can build trust based on some
         | resume and on former interaction.
         | 
         | With LLMs, this is more like a coin-flip with each prompt. And
         | since the models are updated constantly, its hard to build some
         | sort of resume. In the meantime, people - especially casual
         | users - might just trust outputs, because its convenient. A
         | single digit error is harder to find. The costs of validating
         | outputs increases with increased accuracy of LLMs. And casual
         | users tend to skip validation because "its a computer and
         | computers are correct".
         | 
         | I fear an overall decrease in quality wherever LLMs are
         | included. And any productivity gains are eaten by that.
        
       | nowittyusername wrote:
       | I've done experiments and basically what I found was that LLM
       | models are extremely sensitive to .....language. Well, duh but
       | let me explain a bit. They will give a different quality/accuracy
       | of answer depending on the system prompt order, language use,
       | length, how detailed the examples are, etc... basically every
       | variable you can think of is responsible for either improving or
       | causing detrimental behavior in the output. And it makes sense
       | once you really grok that LLM;s "reason and think" in tokens.
       | They have no internal world representation. Tokens are the raw
       | layer on which they operate. For example if you ask a bilingual
       | human what their favorite color is, the answer will be that color
       | regardless of what language they used to answer that question.
       | For an LLM, that answer might change depending on the language
       | used, because its all statistical data distribution of tokens in
       | training that conditions the response. Anyway i don't want to
       | make a long post here. The good news out of this is that once you
       | have found the best way in asking questions of your model, you
       | can consistently get accurate responses, the trick is to find the
       | best way to communicate with that particular LLM. That's why i am
       | hard at work on making an auto calibration system that runs
       | through a barrage of ways in finding the best system prompts and
       | other hyperparameters for that specific LLM. The process can be
       | fully automated, just need to set it all up.
        
         | pton_xd wrote:
         | Yep, LLMs tell you "what you want to hear." I can usually
         | predict the response I'll get based on how I phrase the
         | question.
        
           | jonplackett wrote:
           | I feel like LLMs have a bit of the Clever Hans effect. It
           | takes a lot of my cues as to what it thinks I want it to say
           | or opinion it thinks I want it to have.
           | 
           | Clever Hans was a horse who people thought could do maths by
           | tapping his hoof. But actually he was just reading the body
           | language of the person asking the question. Noticing them
           | tense up as he got to the right number of stamps and stopping
           | - still pretty smart for a horse, but the human was still
           | doing the maths!
        
             | not_maz wrote:
             | What's worse is that it can sometimes (but not always) read
             | through your anti-bias prompts.                   "No, I
             | want your honest opinion." "It's awesome."         "I'm
             | going to invest $250,000 into this. Tell me what you really
             | think." "You should do it."              (New Session)
             | "Someone pitched to me the idea that..." "Reject it."
        
         | thinkling wrote:
         | I thought embeddings were the internal representation? Does
         | reasoning and thinking get expanded back out into tokens and
         | fed back in as the next prompt for reasoning? Or does the model
         | internally churn on chains of embeddings?
        
           | hansvm wrote:
           | There's a certain one-to-oneness between tokens and
           | embeddings. A token expands into a large amount of state, and
           | processing happens on that state and nothing else.
           | 
           | The point is that there isn't any additional state or
           | reasoning. You have a bunch of things equivalent to tokens,
           | and the only trained operations deal with sequences of those
           | things. Calling them "tokens" is a reasonable linguistic
           | choice, since the exact representation of a token isn't core
           | to the argument being made.
        
           | HappMacDonald wrote:
           | I'd direct you to the 3 blue 1 brown presentation on this
           | topic, but in a nutshell the semantic space for an embedding
           | can become much richer than the initial token mapping due to
           | previous context.. but only during the course of predicting
           | the next token.
           | 
           | Once that's done, all rich nuance achieved during the last
           | token-prediction step is lost, and then rebuilt from scratch
           | again on the next token-prediction step (oftentimes taking a
           | new direction due to the new token, and often more powerfully
           | any changes at the tail of the context window such as lost
           | tokens, messages, re-arrangement due to summarizing, etc).
           | 
           | So if you say "red ball" somewhere in the context window,
           | then during each prediction step that will expand into a
           | semantic embedding that neither matches "red" nor "ball", but
           | that richer information will not be "remembered" between
           | steps, but rebuilt from scratch every time.
        
         | not_maz wrote:
         | I found an absolutely fascinating analysis on precisely this
         | topic by an AI researcher who's also a writer:
         | https://archive.ph/jgam4
         | 
         | LLMs can generate convincing editorial letters that give a real
         | sense of having deeply read the work. The problem is that
         | they're extremely sensitive, as you've noticed, to prompting as
         | well as order bias. Present it with two nearly identical
         | versions of the same text, and it will usually choose based on
         | order. And social proof type biases to which we'd hope for
         | machines to be immune can actually trigger 40+ point swings on
         | a 100-point scale.
         | 
         | If you don't mind technical details and occasional swagger, his
         | work is really interesting.
        
         | leonidasv wrote:
         | I somewhat agree, but I think that the language example is not
         | a good one. As Anthropic have demonstrated[0], LLMs do have
         | "conceptual neurons" that generalise an abstract concept which
         | can later be translated to other languages.
         | 
         | The issue is that those concepts are encoded in intermediate
         | layers during training, absorbing biases present in training
         | data. It may produce a world model good enough to know that
         | "green" and "verde" are different names for the same thing, but
         | not robust enough to discard ordering bias or wording bias.
         | Humans suffer from that too, albeit arguably less.
         | 
         | [0] https://transformer-circuits.pub/2025/attribution-
         | graphs/bio...
        
           | bunderbunder wrote:
           | I have learned to take these kinds of papers with a grain of
           | salt, though. They often rest on carefully selected examples
           | that make the behavior seem much more consistent and reliable
           | than it is. For example, the famous "king - man + woman =
           | queen" example from Word2Vec is in some ways more misleading
           | than helpful, because while it worked fine for that case it
           | doesn't necessarily work nearly so well for [emperor, man,
           | woman, empress] or [husband, man, woman, wife].
           | 
           | You get a similar thing with convolutional neural networks.
           | _Sometimes_ they automatically learn image features in a way
           | that yields hidden layers that easy and intuitive to
           | interpret. But not every time. A lot of the time you get a
           | seemingly random garble that belies any parsimonious
           | interpretation.
           | 
           | This Anthropic paper is at least kind enough to acknowledge
           | this fact when they poke at the level of representation
           | sharing and find that, according to their metrics, peak
           | feature-sharing among languages is only about 30% for English
           | and French, two languages that are _very_ closely aligned.
           | Also note that this was done using two cherry-picked
           | languages and a training set that was generated by starting
           | with an English language corpus and then translating it using
           | a different language model. It 's entirely plausible that the
           | level of feature-sharing would not be nearly so great if they
           | had used human-generated translations. (edit: Or a more
           | realistic training corpus that doesn't entirely consist of
           | matched translations of very short snippets of text.)
           | 
           | Just to throw even more cold water on it, this also doesn't
           | necessarily mean that the models are building a true semantic
           | model and not just finding correlations upon which humans
           | impose semantic interpretations. This general kind of
           | behavior when training models on cross-lingual corpora
           | generated using direct translations was first observed in the
           | 1990s, and the model in question was _singular value
           | decomposition._
        
             | jiggawatts wrote:
             | I'm convinced that language sharing can be encouraged
             | during training by rewarding correct answers to questions
             | that can only be answered based on synthetic data in
             | another language fed in during a previous pretraining
             | phase.
             | 
             | Interleave a few phases like that and you'd force the model
             | to share abstract information across all languages, not
             | just for the synthetic data but all input data.
             | 
             | I wouldn't be surprised if this improved LLM performance by
             | another "notch" all by itself, especially for non-English
             | users.
        
               | nenaoki wrote:
               | your shrewd idea might make a fine layer back up the
               | Tower of Babel
        
           | nowittyusername wrote:
           | I've read the paper before I made the statement. And I still
           | made the statement because there are issues with their paper.
           | The first problem is that the way in which anthropic trains
           | their models and the architecture of their models is
           | different from most of the open source models people use.
           | they are still transformer based, but they are not
           | structurally put together the same as most models, so you
           | cant extrapolate their findings on their models to other
           | models. Their training methods also use a lot more
           | regularization of the data trying to weed out targeted biases
           | as much as possible. meaning that the models are trained on
           | more synthetic data which tries to normalize the data as much
           | as possible between languages, tone, etc.. Same goes for
           | their system prompt, their system prompt is treated
           | differently versus open source models which append the system
           | prompt in front of the users query internally. The attention
           | ais applied differently among other things. Second the way
           | that their models "internalize" the world is vastly different
           | then what humans would thing of "building a world model" of
           | reality. Its hard to put it in to words but basically their
           | models do have a underlying representative structure but its
           | not anything that would be of use in the domains humans care
           | about, "true reasoning". Grokking the concept if you will.
           | Honestly I highly suggest folks take a lot of what anthropic
           | studies with a grain of salt. I feel that a lot of
           | information they present is purposely misinterpreted by their
           | teams for media or pr/clout or who knows what reasons. But
           | the biggest reason is the one i stated at the beginning, most
           | models are not of the same ilk as Anthropic models. I would
           | suggest folks focus on reading interpretability research on
           | open source models as those are most likely to be used by
           | corporations for their cheap api costs. And those models have
           | no where near the care and sophistication put in to them as
           | anthropic models.
        
         | TOMDM wrote:
         | This doesn't match Anthropics research on the subject
         | 
         | > Claude sometimes thinks in a conceptual space that is shared
         | between languages, suggesting it has a kind of universal
         | "language of thought." We show this by translating simple
         | sentences into multiple languages and tracing the overlap in
         | how Claude processes them.
         | 
         | https://www.anthropic.com/research/tracing-thoughts-language...
        
         | smusamashah wrote:
         | Once can see that very easily in image generation models.
         | 
         | The "Elephant" it generates is lot different from "Haathi"
         | (Hindi/Urdu). Same goes for other concepts that have 1-to-1
         | translation but the results are different.
        
         | gus_massa wrote:
         | > _For example if you ask a bilingual human what their favorite
         | color is, the answer will be that color regardless of what
         | language they used to answer that question._
         | 
         | It's a very interesting question. Has someone measured it?
         | Bonus point for using a conceal way so the subjects don't
         | realize you care about colors.
         | 
         | Anyway, I don't expect something interesting with colors, but
         | it may be interesting with food (I guess, in particular
         | desserts).
         | 
         | Imagine you live in England and one of your parents is form
         | France and you go there every year to meet your grandparents,
         | and your other parent is from Germany and you go there every
         | year to meet your grandparents. What is your favorite dessert?
         | I guess when you are speaking in one language you are more
         | connected to the memories of the holidays there and the
         | grandparents and you may choose differently.
        
         | patcon wrote:
         | Doesn't this assume one truth or one final answer to all
         | questions? What if there are many?
         | 
         | What if asking one way means you are likely to have your search
         | satisfied by one truth, but asking another way means you are
         | best served by another wisdom?
         | 
         | EDIT: and the structure of language/thought can't know solid
         | truth from more ambiguous wisdom. The same linguistic
         | structures must encode and traverse both. So there will be
         | false positives and false negatives, I suppose? I dunno, I'm
         | shooting from the hip here :)
        
       | devmor wrote:
       | It's a statistical database of corpuses, not a logic engine.
       | 
       | Stop treating LLMs like they are capable of logic, reasoning or
       | judgement. They are not, they never will be.
       | 
       | The extent to which they can _recall_ and _remix_ human words to
       | replicate the intent behind those words is an incredible
       | facsimile to thought. It's nothing short of a mathematical
       | masterpiece. But it's not intelligence.
       | 
       | If it were communicating it's results in any less human of an
       | interface than conversational, I truly feel that most people
       | would not be so easily fooled into believing it was capable of
       | logic.
       | 
       | This doesn't mean that a facsimile of logic like this has no use.
       | Of course it does, we have seen many uses - some incredible, some
       | dangerous and some pointless - but it is important to always know
       | that there is no thought happening behind the facade. Only a
       | replication of statistically similar _communication of thought_
       | that may or may not actually apply to your prompt.
        
       | lyu07282 wrote:
       | Also related: In my observations with tool calling the order of
       | your arguments or fields actually can make a positive or negative
       | effect on performance. You really have to be very careful when
       | constructing your contexts. It doesn't help when all these
       | frameworks and protocols hide these things from you.
        
       | SrslyJosh wrote:
       | > Positional preferences, order effects, prompt sensitivity
       | undermine AI judgments
       | 
       | If you can read between the lines, that says that there's no
       | actual "judgement" going on. If there was a strong logical
       | underpinning to the output, minor differences in input like the
       | phrasing (but not factual content) of a prompt wouldn't make the
       | quality of the output unpredictable.
        
         | Wilsoniumite wrote:
         | Yes and no. People also exhibit these biases, but because
         | degree matters, and because we have no other choice, we still
         | trust them most of the time. That's to say; bias isn't always
         | completely invalidating. I wrote a slightly satirical piece
         | "People are just as bad as my LLMs" here:
         | https://wilsoniumite.com/2025/03/10/people-are-just-as-bad-a...
        
         | tiahura wrote:
         | Word plinko
        
         | ACCount36 wrote:
         | You could say the same about human "judgement" then.
         | 
         | Humans display biases very similar to that of LLMs. This is not
         | a coincidence. LLMs are trained on human-generated data. They
         | attempt to replicate human reasoning - bias and all.
         | 
         | There are decisions where "strong logical underpinning" is
         | strong enough to completely drown out the bias and the noise.
         | And then there are decisions that aren't nearly as clear-cut -
         | allowing the bias and the noise to creep into the outcomes.
         | This is true for human and LLM "judgement" both.
        
       | photochemsyn wrote:
       | Disappointing that they didn't benchmark DeepSeek side by side
       | with OpenAI and Gemini although thelist of funders for cip may
       | explain that:
       | 
       | https://www.cip.org/funding-partnerships
       | 
       | Incidentally DeepSeek will give very interesting results if you
       | ask it for a tutorial on prompt engineering - be sure to ask it
       | how to effectively use 'attention anchors' to create 'well-
       | structured prompts', and why rambling disorganized prompts are
       | usually, but not always, detrimental, depending on whether you
       | want 'associative leaps' or not.
       | 
       | P.S. I find this intro very useful:
       | 
       | > "Task: evaluate the structure of the following prompt in terms
       | of attention anchors and likelihood of it generating a well-
       | structured response. Do not actually reply to the prompt, all we
       | need is an analysis of the structure. Prompt begins:"
        
       | panstromek wrote:
       | Good, but really none of this should be surprising, given that
       | LLMs are a giant text statistic that generate text based on that
       | statistic. Quirks of that statistic will show up as quirks of the
       | output.
       | 
       | When you think about it like that, it doesn't really make sense
       | to assume they have some magical correctness properties. In some
       | sense, they don't classify, they immitate what classification
       | looks like.
        
         | perching_aix wrote:
         | > In some sense, they don't classify, they immitate what
         | classification looks like.
         | 
         | I thought I've seen it all when people decided to consider AI a
         | marketing term and started going off about how current
         | mainstream AI products aren't """"real AI"""", but this is next
         | level.
        
           | panstromek wrote:
           | I'm not sure I understand your objection (or if it's even an
           | objection), but just to illustrate what I mean - this is
           | literally how the chat interfaces are implemented (or at
           | least initially they were).
           | 
           | You're not talking with the model, you're talking with some
           | entity that the model is asked to simulate. The system is
           | just cleverly using your input and the statistic to output
           | something that looks like chat with an assistant.
           | 
           | Whether that's real AI or not doesn't really matter. I didn't
           | mean to make it sound like this is not real, just to point
           | out where are the current shortcomings coming from.
        
             | perching_aix wrote:
             | It is an objection. I'm not sure if you consider the whole
             | subfield of machine learning that is classification non-
             | existing, or just the fact that LLMs can produce
             | classifications, but either way, I do object.
             | 
             | The objection against the former is trivial and self
             | evident, and was more where my sudden upset came from. *
             | 
             | Against the latter, the model trying to make the overall
             | text that is its context window approach a chat exchange,
             | by adjusting its own output within it accordingly, doesn't
             | make a hypothetical request for classification in there not
             | performed. You either classify or you don't. If it's doing
             | it in a "misguided" way, it doesn't make it not performed.
             | If it's doing it under the pretense of roleplay, it still
             | doesn't make it not performed. Same is true if the LLM is
             | actually secretly a human operator, or if the LLM is just
             | spewing random tokens. Either you got a classified output
             | of your input or you didn't. It doesn't "look like"
             | anything. I can understand if maybe you mean that the
             | designated notion of the person in the exchange it's trying
             | to approximate for is going to affect the classifications
             | it provides when requested one in the context window, but
             | since these models are trained to "act agentic", I'm not
             | sure if that's a useful thing to ponder (as there's no
             | other way to get anything out of them).
             | 
             | I object to the whole "AI is just statistics" notion too.
             | In several situations you want it to do something
             | completely different than what the dataset would support
             | just through rote statistics. That's where you get actual
             | value out of them. One could conveniently recategorize that
             | as just "advanced statistics", or "higher level"
             | statistics, but I think that's a very perverse way of
             | defining statistics. There's very clearly more mathematics
             | involved in LLMs than just statistics. Just the other day,
             | there was a post here trying to regard LLMs as "just
             | topology". Clearly neither of these can be true at the same
             | time, which was consequently explored in the thread too.
             | 
             | > You're not talking with the model, you're talking with
             | some entity
             | 
             | I'm not suggesting I'm actually talking with anyone or
             | anything in particular beyond the antropomorphization.
             | 
             | * What I meant by "people saying AI is not real" is that
             | people claim to regard that the current generation of AI
             | products are not real "artificial intelligences", because
             | they seem to think that it's either "SkyNet" and "Detroit:
             | Become Human", or nothing. Unsurprisingly, these folks
             | don't tend to talk much about OCR, image segmentation and
             | labeling, optical flow, etc. And just like the half a
             | century old field of Artificial Intelligence isn't just
             | some new marketing con that just spawned into existence,
             | classification algorithms aren't some novel snakeoil
             | either.
             | 
             |  _Edit:_ typing this all out about how I 'm aware I'm not
             | actually having conversations with anyone or anything gave
             | me a feeling of realization. This is not good, because
             | intellectually I was always aware of this, meaning I got
             | subconsciously parasocial with these products and services
             | over time. Really concerned about this all of a sudden lol.
        
               | panstromek wrote:
               | This post was about LLMs so I was specifically referring
               | to LLMs.
               | 
               | When I say "they immitate what classification looks
               | like," I don't mean that the classification somehow isn't
               | real, I'm referring to the specific technique of how it's
               | done.
               | 
               | If you ask LLM "Is this sentence offensive: ...?", the
               | task that it's doing is not simply "test whether this
               | sentence is offensive." It's something like "generate
               | what a plausible answer to this question looks like,"
               | part of which is answering the question (usually).
               | 
               | This means that if you ask this question in a way that is
               | more often used with an expectation of a certain answer,
               | LLM will use that as a signal to bias the answer, because
               | "that's how the answers to these questions look like."
               | which is the problem highlighted in the article.
        
       ___________________________________________________________________
       (page generated 2025-05-24 23:01 UTC)