[HN Gopher] Positional preferences, order effects, prompt sensit...
       ___________________________________________________________________
        
       Positional preferences, order effects, prompt sensitivity undermine
       AI judgments
        
       Author : joalstein
       Score  : 85 points
       Date   : 2025-05-23 17:20 UTC (5 hours ago)
        
 (HTM) web link (www.cip.org)
 (TXT) w3m dump (www.cip.org)
        
       | giancarlostoro wrote:
       | I listen to online debates, especially political ones on various
       | platforms, and man. The AI slop that people slap around at each
       | other is beyond horrendous. I would not want an LLM being the
       | final say on something critical. I want the opposite, an LLM
       | should identify things that need follow up review by a qualified
       | person, a person should still confirm the things that "pass" but
       | they can then prioritize what to validate first.
        
         | batshit_beaver wrote:
         | I don't even trust LLMs enough to spot content that requires
         | validation or nuance.
        
           | giancarlostoro wrote:
           | LLM's are 100% trust but verify.
        
       | baxtr wrote:
       | I'd argue real judges are unreliable as well.
       | 
       | The real question for me is: are they less reliable than human
       | judges? Probably yes. But I favor a relative measurement to
       | humans than a plain statement like that.
        
         | nimitkalra wrote:
         | There are technical quirks that make LLM judges particularly
         | high variance, sensitive to artifacts in the prompt, and
         | positively/negatively-skewed, as opposed to the subjectivity of
         | human judges. These largely arise from their training
         | distribution and post-training, and can be contained with
         | careful calibration.
        
         | andrewla wrote:
         | I do think you've hit the heart of the question, but I don't
         | think we can answer the second question.
         | 
         | We can measure how unreliable they are, or how susceptible they
         | are to specific changes, just because we can reset them to the
         | same state and run the experiment again. At least for now [1]
         | we do not have that capability with humans, so there's no way
         | to run a matching experiment on humans.
         | 
         | The best we can do it is probably to run the limited
         | experiments we can do on humans -- comparing different judge's
         | cross-referenced reliability to get an overall measure and some
         | weak indicator of the reliability of a specific judge based on
         | intra-judge agreement. But when running this on LLMs they would
         | have to keep the previous cases in their context window to get
         | a fair comparison.
         | 
         | [1] https://qntm.org/mmacevedo
        
         | resource_waste wrote:
         | I don't study domestic law enough, but I asked a professor of
         | law:
         | 
         | "With anything gray, does the stronger/bigger party always
         | win?"
         | 
         | He said:
         | 
         | "If you ask my students, nearly all of them would say Yes"
        
         | bunderbunder wrote:
         | > The real question for me is: are they less reliable than
         | human judges?
         | 
         | I've spent some time poking at this. I can't go into details,
         | but the short answer is, "Sometimes yes, sometimes no, and it
         | depends A LOT on how you define 'reliable'."
         | 
         | My sense is that, the more boring, mechanical and closed-ended
         | the task is, the more likely an LLM is to be more reliable than
         | a human. Because an LLM is an unthinking machine. It doesn't
         | get tired, or hangry, or stressed out about its kid's problems
         | at school. But it's also a doofus with absolutely no common
         | sense whatsoever.
        
           | visarga wrote:
           | > Because an LLM is an unthinking machine.
           | 
           | Unthinking can be pretty powerful these days.
        
         | DonHopkins wrote:
         | At least LLMs don't use penis pumps while on the job in court.
         | 
         | https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...
         | 
         | https://www.subsim.com/radioroom/showthread.php?t=95174
        
         | Alupis wrote:
         | I think the main difference is an AI judge may provide three
         | different rulings if you just ask it the same thing three
         | times. A human judge is much less likely to be so "flip-
         | floppy".
         | 
         | You can observe this using any of the present-day LLM's - ask
         | it an architectural/design question, provide it with your
         | thoughts, reasoning, constraints, etc... and see what it tells
         | you. Then... click the "Retry" button and see how similar (or
         | dissimilar) the answer is. Sometimes you'll get a complete 180
         | from the prior response.
        
           | bunderbunder wrote:
           | Humans flip-flop all the time. This is a major reason why the
           | Meyers-Briggs Type Indicator does such a poor job of
           | assigning the same person the same Meyers-Briggs type on
           | successive tests.
           | 
           | It can be difficult to observe this fact in practice because,
           | unlike for an LLM, you can't just ask a human the exact same
           | question three times in five seconds and get three different
           | answers, because unlike an LLM we have memory. But, as
           | someone who works with human-labeled data, it's something I
           | have to contend with on a daily basis. For the things I'm
           | working on, if you give the same annotator the same thing to
           | label two different times spaced far enough apart for them to
           | forget that they have seen this thing before, the chance of
           | them making the same call both times is only about 75%. If I
           | do that with a prompted LLM anotator, I'm used to seeing more
           | like 85%, and for some models it can be possible to get even
           | better consistency than that with the right conditions and
           | enough time spent fussing with the prompt.
           | 
           | I still prefer the human labels when I can afford them
           | because LLM labeling has plenty of other problems. But being
           | more flip-floppy than humans is not one that I have been able
           | to empirically observe.
        
             | Alupis wrote:
             | We're not talking about labeling data though - we're
             | talking about understanding case law, statutory law, facts,
             | balancing conflicting opinions, arguments, a judge's
             | preconceived notions, experiences, beliefs etc. - many of
             | which are assembled over an entire career.
             | 
             | Those things, I'd argue, are far less likely to change if
             | you ask the _same_ judge over and over. I think you can
             | observe this in reality by considering people 's political
             | opinions - which can drift over time but typically remain
             | similar for long durations (or a lifetime).
             | 
             | In real life, we usually don't ask the same judge to remake
             | a ruling over and over - our closest analog is probably a
             | judge's ruling/opinion history, which doesn't change nearly
             | as much as an LLM's "opinion" on something. This is how we
             | label SCOTUS Justices, for example, as "Originalist", etc.
             | 
             | Also, unlike a human, you can radically change an LLM's
             | output by just ever-so-slightly altering the input. While
             | humans aren't above changing their mind based on new facts,
             | they are unlikely to take an opposite position just because
             | you reworded your same argument.
        
               | bunderbunder wrote:
               | I think that that gets back to the whole memory thing. A
               | person is unlikely to forget those kinds of decisions.
               | 
               | But there has been research indicating, for example, that
               | judges' rulings vary with the time of day. In a way that
               | implies that, if it were possible to construct such an
               | experiment, you might find that the same judge given the
               | same case would rule in very different ways depending on
               | whether you present it in the morning or in the
               | afternoon. For example judges tend to hand out
               | significantly harsher penalties toward the end of the
               | work day.
        
           | acdha wrote:
           | I'd think there's also a key adversarial problem: a human
           | judge has a conceptual understanding and you aren't going to
           | be able to slightly tweak your wording to get wildly
           | different outcomes the way LLMs are vulnerable to.
        
         | Terr_ wrote:
         | > The real question for me is: are they less reliable than
         | human judges?
         | 
         | I'd caution that it's never just about ratios: We must also ask
         | whether the "shape" of their performance is knowable and
         | desirable. A chess robot's win-rate may be wonderful, but we
         | are unthinkingly confident a human wouldn't "lose" by
         | disqualification for ripping off an opponent's finger.
         | 
         | Would we accept a "judge" that is fairer on average... but
         | gives ~5% lighter sentences to people with a certain color
         | shirt, or sometimes issues the death-penalty for shoplifting?
         | Especially when we cannot diagnose the problem or be sure we
         | fixed it? (Maybe, but hopefully not without a _lot_ of debate
         | over the risks!)
         | 
         | In contrast, there's a huge body of... of _stuff_ regarding
         | human errors, resources we deploy so pervasively it can escape
         | our awareness: Your brain is a simulation and diagnostic tool
         | for other brains, battle-tested (sometimes literally) over
         | millions of years; we intuit many kinds of problems or
         | confounding factors to look for, often because we 've made them
         | ourselves; and thousands of years of cultural practice for
         | detection, guardrails, and error-compensating actions. Only a
         | small minority of that toolkit can be reused for "AI."
        
         | th0ma5 wrote:
         | Can we stop with the "AI being unreliable like people" because
         | it is demonstrably false at best and cult like thought
         | termination at the worst.
        
         | andrepd wrote:
         | Judges can reason according to principles, and explain this
         | reasoning. LLMs cannot (but they can pretend to, and this
         | pretend chain-of-thought can be marketed as "reasoning"!; see
         | https://news.ycombinator.com/item?id=44069991)
        
         | not_maz wrote:
         | I know the answer and I hate it.
         | 
         | AIs are inferior to humans at their best, but superior to
         | humans as they actually behave in society, due to decision
         | fatigue and other constraints. When it comes to moral judgment
         | in high stakes scenarios, AIs still fail (or can be made to
         | fail) in ways that are not socially acceptable.
         | 
         | Compare an AI to a real-world, overworked corporate decision
         | maker, though, and you'll find that the AI is kinder and less
         | biased. It still sucks, because GI/GO, but it's slightly
         | better, simply because it doesn't suffer emotional fatigue,
         | doesn't take as many shortcuts, and isn't clouded by personal
         | opinions since it's not a person.
        
       | nimitkalra wrote:
       | Some other known distributional biases include self-preference
       | bias (gpt-4o prefers gpt-4o generations over claude generations
       | for eg) and structured output/JSON-mode bias [1]. Interestingly,
       | some models have a more positive/negative-skew than others as
       | well. This library [2] also provides some methods for
       | calibrating/stabilizing them.
       | 
       | [1]:
       | https://verdict.haizelabs.com/docs/cookbook/distributional-b...
       | [2]: https://github.com/haizelabs/verdict
        
       | TrackerFF wrote:
       | I see "panels of judges" mentioned once, but what is the weakness
       | of this? Other than more resource.
       | 
       | Worst case you end up with some multi-modal distribution, where
       | two opinions are equal - which seems somewhat unlikely as the
       | panel size grows. Or it could maybe happen in some case with
       | exactly two outcomes (yes/no), but I'd be surprised if such a
       | panel landed on a perfect uniform distribution in its
       | judgments/opinions (50% yes 50% no)
        
         | nimitkalra wrote:
         | One method to get a better estimate is to extract the token
         | log-probabilities of "YES" and "NO" from the final logits of
         | the LLM and take a weighted sum [1] [2]. If the LLM is
         | calibrated for your task, there should be roughly a ~50% chance
         | of sampling YES (1) and ~50% chance of NO (0) -- yielding 0.5.
         | 
         | But generally you wouldn't use a binary outcome when you can
         | have samples that are 50/50 pass/fail. Better to use a discrete
         | scale of 1..3 or 1..5 and specify exactly what makes a sample a
         | 2/5 vs a 4/5, for example
         | 
         | You are correct to question the weaknesses of a panel. This
         | class of methods depends on diversity through high-temperature
         | sampling, which can lead to spurious YES/NO responses that
         | don't generalize well and are effectively noise.
         | 
         | [1]: https://arxiv.org/abs/2303.16634 [2]:
         | https://verdict.haizelabs.com/docs/concept/extractor/#token-...
        
       | shahbaby wrote:
       | Fully agree, I've found that LLMs aren't good at tasks that
       | require evaluation.
       | 
       | Think about it, if they were good at evaluation, you could remove
       | all humans in the loop and have recursively self improving AGI.
       | 
       | Nice to see an article that makes a more concrete case.
        
         | visarga wrote:
         | Humans aren't good at validation either. We need tools,
         | experiments, labs. Unproven ideas are a dime a dozen. Remember
         | the hoopla about room temperature superconductivity? The real
         | source of validation is external consequences.
        
           | ken47 wrote:
           | Human experts set the benchmarks and LLM's cannot match them
           | in most (maybe any?) fields requiring sophisticated judgment.
           | 
           | They are very useful for some things, but sophisticated
           | judgment is not one of them.
        
         | NitpickLawyer wrote:
         | I think there's more nuance, and the way I read the article is
         | more "beware of these shortcomings", instead of "aren't good".
         | LLM-based evaluation can be good. Several models have by now
         | been trained on previous-gen models used in filtering data and
         | validating RLHF data (pairwise or even more advanced). LLama3
         | is a good example of this.
         | 
         | My take from this article is that there are plenty of gotchas
         | along the way, and you need to be very careful in how you
         | structure your data, and how you test your pipelines, and how
         | you make sure your tests are keeping up with new models. But,
         | like it or not, LLM based evaluation is here to stay. So
         | explorations into this space are good, IMO.
        
       | tempodox wrote:
       | > We call it 'prompt-engineering'
       | 
       | I prefer to call it "prompt guessing", it's like some modern
       | variant of alchemy.
        
         | BurningFrog wrote:
         | "Prompt Whispering"?
        
           | th0ma5 wrote:
           | Prompt divining
        
             | thinkling wrote:
             | Impromptu prompting
        
         | layer8 wrote:
         | Prompt vibing
        
       | wagwang wrote:
       | Can't wait for the new field of AI psychology
        
       | sidcool wrote:
       | We went from Impossible to Unreliable. I like the direction as a
       | techie. But not sure as a sociologist or an anthropologist.
        
       | andrepd wrote:
       | [flagged]
        
         | dang wrote:
         | Comments like this break the site guidelines, and not just a
         | little. Can you please review
         | https://news.ycombinator.com/newsguidelines.html and take the
         | intended spirit of this site more to heart? Note these:
         | 
         | " _Please don 't fulminate._"
         | 
         | " _Don 't be curmudgeonly. Thoughtful criticism is fine, but
         | please don't be rigidly or generically negative._"
         | 
         | " _Please don 't sneer, including at the rest of the
         | community._"
         | 
         | " _When disagreeing, please reply to the argument instead of
         | calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be
         | shortened to '1 + 1 is 2, not 3._"
         | 
         | There's plenty of LLM skepticism on HN and that's fine, but
         | like all comments here, it needs to be thoughtful.
         | 
         | (We detached this comment from
         | https://news.ycombinator.com/item?id=44074957)
        
       | ken47 wrote:
       | [flagged]
        
         | dang wrote:
         | " _Please don 't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
         | 
         | (We detached this comment from
         | https://news.ycombinator.com/item?id=44074957)
        
       | sanqui wrote:
       | Meanwhile in Estonia, they just agreed to resolve child support
       | disputes using AI... https://www.err.ee/1609701615/pakosta-
       | enamiku-elatisvaidlust...
        
       | armchairhacker wrote:
       | LLMs are good at discovery, since they know a lot, and can
       | retrieve that knowledge from a query that simpler (e.g. regex-
       | based) search engines with the same knowledge couldn't. For
       | example, an LLM that is input a case may discover an obscure law,
       | or notice a pattern in past court cases which establishes
       | precedent. So they can be helpful to a real judge.
       | 
       | Of course, the judge must check that the law or precedent aren't
       | hallucinated, and apply to the case in the way the LLM claims.
       | They should also prompt other LLMs and use their own knowledge in
       | case the cited law/precedent contradicts others.
       | 
       | There's a similar argument for scientists, mathematicians,
       | doctors, investors, and other fields. LLMs are good at discovery
       | but must be checked.
        
       | PrimordialEdg71 wrote:
       | LLMs make impressive graders-of-convenience, but their judgments
       | swing wildly with prompt phrasing and option order. Treat them
       | like noisy crowd-raters: randomize inputs, ensemble outputs, and
       | keep a human in the loop whenever single-digit accuracy points
       | matter.
        
       | nowittyusername wrote:
       | I've done experiments and basically what I found was that LLM
       | models are extremely sensitive to .....language. Well, duh but
       | let me explain a bit. They will give a different quality/accuracy
       | of answer depending on the system prompt order, language use,
       | length, how detailed the examples are, etc... basically every
       | variable you can think of is responsible for either improving or
       | causing detrimental behavior in the output. And it makes sense
       | once you really grok that LLM;s "reason and think" in tokens.
       | They have no internal world representation. Tokens are the raw
       | layer on which they operate. For example if you ask a bilingual
       | human what their favorite color is, the answer will be that color
       | regardless of what language they used to answer that question.
       | For an LLM, that answer might change depending on the language
       | used, because its all statistical data distribution of tokens in
       | training that conditions the response. Anyway i don't want to
       | make a long post here. The good news out of this is that once you
       | have found the best way in asking questions of your model, you
       | can consistently get accurate responses, the trick is to find the
       | best way to communicate with that particular LLM. That's why i am
       | hard at work on making an auto calibration system that runs
       | through a barrage of ways in finding the best system prompts and
       | other hyperparameters for that specific LLM. The process can be
       | fully automated, just need to set it all up.
        
         | pton_xd wrote:
         | Yep, LLMs tell you "what you want to hear." I can usually
         | predict the response I'll get based on how I phrase the
         | question.
        
           | jonplackett wrote:
           | I feel like LLMs have a bit of the Clever Hans effect. It
           | takes a lot of my cues as to what it thinks I want it to say
           | or opinion it thinks I want it to have.
           | 
           | Clever Hans was a horse who people thought could do maths by
           | tapping his hoof. But actually he was just reading the body
           | language of the person asking the question. Noticing them
           | tense up as he got to the right number of stamps and stopping
           | - still pretty smart for a horse, but the human was still
           | doing the maths!
        
             | not_maz wrote:
             | What's worse is that it can sometimes (but not always) read
             | through your anti-bias prompts.                   "No, I
             | want your honest opinion." "It's awesome."         "I'm
             | going to invest $250,000 into this. Tell me what you really
             | think." "You should do it."              (New Session)
             | "Someone pitched to me the idea that..." "Reject it."
        
         | thinkling wrote:
         | I thought embeddings were the internal representation? Does
         | reasoning and thinking get expanded back out into tokens and
         | fed back in as the next prompt for reasoning? Or does the model
         | internally churn on chains of embeddings?
        
           | hansvm wrote:
           | There's a certain one-to-oneness between tokens and
           | embeddings. A token expands into a large amount of state, and
           | processing happens on that state and nothing else.
           | 
           | The point is that there isn't any additional state or
           | reasoning. You have a bunch of things equivalent to tokens,
           | and the only trained operations deal with sequences of those
           | things. Calling them "tokens" is a reasonable linguistic
           | choice, since the exact representation of a token isn't core
           | to the argument being made.
        
         | not_maz wrote:
         | I found an absolutely fascinating analysis on precisely this
         | topic by an AI researcher who's also a writer:
         | https://archive.ph/jgam4
         | 
         | LLMs can generate convincing editorial letters that give a real
         | sense of having deeply read the work. The problem is that
         | they're extremely sensitive, as you've noticed, to prompting as
         | well as order bias. Present it with two nearly identical
         | versions of the same text, and it will usually choose based on
         | order. And social proof type biases to which we'd hope for
         | machines to be immune can actually trigger 40+ point swings on
         | a 100-point scale.
         | 
         | If you don't mind technical details and occasional swagger, his
         | work is really interesting.
        
         | leonidasv wrote:
         | I somewhat agree, but I think that the language example is not
         | a good one. As Anthropic have demonstrated[0], LLMs do have
         | "conceptual neurons" that generalise an abstract concept which
         | can later be translated to other languages.
         | 
         | The issue is that those concepts are encoded in intermediate
         | layers during training, absorbing biases present in training
         | data. It may produce a world model good enough to know that
         | "green" and "verde" are different names for the same thing, but
         | not robust enough to discard ordering bias or wording bias.
         | Humans suffer from that too, albeit arguably less.
         | 
         | [0] https://transformer-circuits.pub/2025/attribution-
         | graphs/bio...
        
       ___________________________________________________________________
       (page generated 2025-05-23 23:00 UTC)