[HN Gopher] Positional preferences, order effects, prompt sensit...
___________________________________________________________________
Positional preferences, order effects, prompt sensitivity undermine
AI judgments
Author : joalstein
Score : 85 points
Date : 2025-05-23 17:20 UTC (5 hours ago)
(HTM) web link (www.cip.org)
(TXT) w3m dump (www.cip.org)
| giancarlostoro wrote:
| I listen to online debates, especially political ones on various
| platforms, and man. The AI slop that people slap around at each
| other is beyond horrendous. I would not want an LLM being the
| final say on something critical. I want the opposite, an LLM
| should identify things that need follow up review by a qualified
| person, a person should still confirm the things that "pass" but
| they can then prioritize what to validate first.
| batshit_beaver wrote:
| I don't even trust LLMs enough to spot content that requires
| validation or nuance.
| giancarlostoro wrote:
| LLM's are 100% trust but verify.
| baxtr wrote:
| I'd argue real judges are unreliable as well.
|
| The real question for me is: are they less reliable than human
| judges? Probably yes. But I favor a relative measurement to
| humans than a plain statement like that.
| nimitkalra wrote:
| There are technical quirks that make LLM judges particularly
| high variance, sensitive to artifacts in the prompt, and
| positively/negatively-skewed, as opposed to the subjectivity of
| human judges. These largely arise from their training
| distribution and post-training, and can be contained with
| careful calibration.
| andrewla wrote:
| I do think you've hit the heart of the question, but I don't
| think we can answer the second question.
|
| We can measure how unreliable they are, or how susceptible they
| are to specific changes, just because we can reset them to the
| same state and run the experiment again. At least for now [1]
| we do not have that capability with humans, so there's no way
| to run a matching experiment on humans.
|
| The best we can do it is probably to run the limited
| experiments we can do on humans -- comparing different judge's
| cross-referenced reliability to get an overall measure and some
| weak indicator of the reliability of a specific judge based on
| intra-judge agreement. But when running this on LLMs they would
| have to keep the previous cases in their context window to get
| a fair comparison.
|
| [1] https://qntm.org/mmacevedo
| resource_waste wrote:
| I don't study domestic law enough, but I asked a professor of
| law:
|
| "With anything gray, does the stronger/bigger party always
| win?"
|
| He said:
|
| "If you ask my students, nearly all of them would say Yes"
| bunderbunder wrote:
| > The real question for me is: are they less reliable than
| human judges?
|
| I've spent some time poking at this. I can't go into details,
| but the short answer is, "Sometimes yes, sometimes no, and it
| depends A LOT on how you define 'reliable'."
|
| My sense is that, the more boring, mechanical and closed-ended
| the task is, the more likely an LLM is to be more reliable than
| a human. Because an LLM is an unthinking machine. It doesn't
| get tired, or hangry, or stressed out about its kid's problems
| at school. But it's also a doofus with absolutely no common
| sense whatsoever.
| visarga wrote:
| > Because an LLM is an unthinking machine.
|
| Unthinking can be pretty powerful these days.
| DonHopkins wrote:
| At least LLMs don't use penis pumps while on the job in court.
|
| https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...
|
| https://www.subsim.com/radioroom/showthread.php?t=95174
| Alupis wrote:
| I think the main difference is an AI judge may provide three
| different rulings if you just ask it the same thing three
| times. A human judge is much less likely to be so "flip-
| floppy".
|
| You can observe this using any of the present-day LLM's - ask
| it an architectural/design question, provide it with your
| thoughts, reasoning, constraints, etc... and see what it tells
| you. Then... click the "Retry" button and see how similar (or
| dissimilar) the answer is. Sometimes you'll get a complete 180
| from the prior response.
| bunderbunder wrote:
| Humans flip-flop all the time. This is a major reason why the
| Meyers-Briggs Type Indicator does such a poor job of
| assigning the same person the same Meyers-Briggs type on
| successive tests.
|
| It can be difficult to observe this fact in practice because,
| unlike for an LLM, you can't just ask a human the exact same
| question three times in five seconds and get three different
| answers, because unlike an LLM we have memory. But, as
| someone who works with human-labeled data, it's something I
| have to contend with on a daily basis. For the things I'm
| working on, if you give the same annotator the same thing to
| label two different times spaced far enough apart for them to
| forget that they have seen this thing before, the chance of
| them making the same call both times is only about 75%. If I
| do that with a prompted LLM anotator, I'm used to seeing more
| like 85%, and for some models it can be possible to get even
| better consistency than that with the right conditions and
| enough time spent fussing with the prompt.
|
| I still prefer the human labels when I can afford them
| because LLM labeling has plenty of other problems. But being
| more flip-floppy than humans is not one that I have been able
| to empirically observe.
| Alupis wrote:
| We're not talking about labeling data though - we're
| talking about understanding case law, statutory law, facts,
| balancing conflicting opinions, arguments, a judge's
| preconceived notions, experiences, beliefs etc. - many of
| which are assembled over an entire career.
|
| Those things, I'd argue, are far less likely to change if
| you ask the _same_ judge over and over. I think you can
| observe this in reality by considering people 's political
| opinions - which can drift over time but typically remain
| similar for long durations (or a lifetime).
|
| In real life, we usually don't ask the same judge to remake
| a ruling over and over - our closest analog is probably a
| judge's ruling/opinion history, which doesn't change nearly
| as much as an LLM's "opinion" on something. This is how we
| label SCOTUS Justices, for example, as "Originalist", etc.
|
| Also, unlike a human, you can radically change an LLM's
| output by just ever-so-slightly altering the input. While
| humans aren't above changing their mind based on new facts,
| they are unlikely to take an opposite position just because
| you reworded your same argument.
| bunderbunder wrote:
| I think that that gets back to the whole memory thing. A
| person is unlikely to forget those kinds of decisions.
|
| But there has been research indicating, for example, that
| judges' rulings vary with the time of day. In a way that
| implies that, if it were possible to construct such an
| experiment, you might find that the same judge given the
| same case would rule in very different ways depending on
| whether you present it in the morning or in the
| afternoon. For example judges tend to hand out
| significantly harsher penalties toward the end of the
| work day.
| acdha wrote:
| I'd think there's also a key adversarial problem: a human
| judge has a conceptual understanding and you aren't going to
| be able to slightly tweak your wording to get wildly
| different outcomes the way LLMs are vulnerable to.
| Terr_ wrote:
| > The real question for me is: are they less reliable than
| human judges?
|
| I'd caution that it's never just about ratios: We must also ask
| whether the "shape" of their performance is knowable and
| desirable. A chess robot's win-rate may be wonderful, but we
| are unthinkingly confident a human wouldn't "lose" by
| disqualification for ripping off an opponent's finger.
|
| Would we accept a "judge" that is fairer on average... but
| gives ~5% lighter sentences to people with a certain color
| shirt, or sometimes issues the death-penalty for shoplifting?
| Especially when we cannot diagnose the problem or be sure we
| fixed it? (Maybe, but hopefully not without a _lot_ of debate
| over the risks!)
|
| In contrast, there's a huge body of... of _stuff_ regarding
| human errors, resources we deploy so pervasively it can escape
| our awareness: Your brain is a simulation and diagnostic tool
| for other brains, battle-tested (sometimes literally) over
| millions of years; we intuit many kinds of problems or
| confounding factors to look for, often because we 've made them
| ourselves; and thousands of years of cultural practice for
| detection, guardrails, and error-compensating actions. Only a
| small minority of that toolkit can be reused for "AI."
| th0ma5 wrote:
| Can we stop with the "AI being unreliable like people" because
| it is demonstrably false at best and cult like thought
| termination at the worst.
| andrepd wrote:
| Judges can reason according to principles, and explain this
| reasoning. LLMs cannot (but they can pretend to, and this
| pretend chain-of-thought can be marketed as "reasoning"!; see
| https://news.ycombinator.com/item?id=44069991)
| not_maz wrote:
| I know the answer and I hate it.
|
| AIs are inferior to humans at their best, but superior to
| humans as they actually behave in society, due to decision
| fatigue and other constraints. When it comes to moral judgment
| in high stakes scenarios, AIs still fail (or can be made to
| fail) in ways that are not socially acceptable.
|
| Compare an AI to a real-world, overworked corporate decision
| maker, though, and you'll find that the AI is kinder and less
| biased. It still sucks, because GI/GO, but it's slightly
| better, simply because it doesn't suffer emotional fatigue,
| doesn't take as many shortcuts, and isn't clouded by personal
| opinions since it's not a person.
| nimitkalra wrote:
| Some other known distributional biases include self-preference
| bias (gpt-4o prefers gpt-4o generations over claude generations
| for eg) and structured output/JSON-mode bias [1]. Interestingly,
| some models have a more positive/negative-skew than others as
| well. This library [2] also provides some methods for
| calibrating/stabilizing them.
|
| [1]:
| https://verdict.haizelabs.com/docs/cookbook/distributional-b...
| [2]: https://github.com/haizelabs/verdict
| TrackerFF wrote:
| I see "panels of judges" mentioned once, but what is the weakness
| of this? Other than more resource.
|
| Worst case you end up with some multi-modal distribution, where
| two opinions are equal - which seems somewhat unlikely as the
| panel size grows. Or it could maybe happen in some case with
| exactly two outcomes (yes/no), but I'd be surprised if such a
| panel landed on a perfect uniform distribution in its
| judgments/opinions (50% yes 50% no)
| nimitkalra wrote:
| One method to get a better estimate is to extract the token
| log-probabilities of "YES" and "NO" from the final logits of
| the LLM and take a weighted sum [1] [2]. If the LLM is
| calibrated for your task, there should be roughly a ~50% chance
| of sampling YES (1) and ~50% chance of NO (0) -- yielding 0.5.
|
| But generally you wouldn't use a binary outcome when you can
| have samples that are 50/50 pass/fail. Better to use a discrete
| scale of 1..3 or 1..5 and specify exactly what makes a sample a
| 2/5 vs a 4/5, for example
|
| You are correct to question the weaknesses of a panel. This
| class of methods depends on diversity through high-temperature
| sampling, which can lead to spurious YES/NO responses that
| don't generalize well and are effectively noise.
|
| [1]: https://arxiv.org/abs/2303.16634 [2]:
| https://verdict.haizelabs.com/docs/concept/extractor/#token-...
| shahbaby wrote:
| Fully agree, I've found that LLMs aren't good at tasks that
| require evaluation.
|
| Think about it, if they were good at evaluation, you could remove
| all humans in the loop and have recursively self improving AGI.
|
| Nice to see an article that makes a more concrete case.
| visarga wrote:
| Humans aren't good at validation either. We need tools,
| experiments, labs. Unproven ideas are a dime a dozen. Remember
| the hoopla about room temperature superconductivity? The real
| source of validation is external consequences.
| ken47 wrote:
| Human experts set the benchmarks and LLM's cannot match them
| in most (maybe any?) fields requiring sophisticated judgment.
|
| They are very useful for some things, but sophisticated
| judgment is not one of them.
| NitpickLawyer wrote:
| I think there's more nuance, and the way I read the article is
| more "beware of these shortcomings", instead of "aren't good".
| LLM-based evaluation can be good. Several models have by now
| been trained on previous-gen models used in filtering data and
| validating RLHF data (pairwise or even more advanced). LLama3
| is a good example of this.
|
| My take from this article is that there are plenty of gotchas
| along the way, and you need to be very careful in how you
| structure your data, and how you test your pipelines, and how
| you make sure your tests are keeping up with new models. But,
| like it or not, LLM based evaluation is here to stay. So
| explorations into this space are good, IMO.
| tempodox wrote:
| > We call it 'prompt-engineering'
|
| I prefer to call it "prompt guessing", it's like some modern
| variant of alchemy.
| BurningFrog wrote:
| "Prompt Whispering"?
| th0ma5 wrote:
| Prompt divining
| thinkling wrote:
| Impromptu prompting
| layer8 wrote:
| Prompt vibing
| wagwang wrote:
| Can't wait for the new field of AI psychology
| sidcool wrote:
| We went from Impossible to Unreliable. I like the direction as a
| techie. But not sure as a sociologist or an anthropologist.
| andrepd wrote:
| [flagged]
| dang wrote:
| Comments like this break the site guidelines, and not just a
| little. Can you please review
| https://news.ycombinator.com/newsguidelines.html and take the
| intended spirit of this site more to heart? Note these:
|
| " _Please don 't fulminate._"
|
| " _Don 't be curmudgeonly. Thoughtful criticism is fine, but
| please don't be rigidly or generically negative._"
|
| " _Please don 't sneer, including at the rest of the
| community._"
|
| " _When disagreeing, please reply to the argument instead of
| calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be
| shortened to '1 + 1 is 2, not 3._"
|
| There's plenty of LLM skepticism on HN and that's fine, but
| like all comments here, it needs to be thoughtful.
|
| (We detached this comment from
| https://news.ycombinator.com/item?id=44074957)
| ken47 wrote:
| [flagged]
| dang wrote:
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us something._"
|
| https://news.ycombinator.com/newsguidelines.html
|
| (We detached this comment from
| https://news.ycombinator.com/item?id=44074957)
| sanqui wrote:
| Meanwhile in Estonia, they just agreed to resolve child support
| disputes using AI... https://www.err.ee/1609701615/pakosta-
| enamiku-elatisvaidlust...
| armchairhacker wrote:
| LLMs are good at discovery, since they know a lot, and can
| retrieve that knowledge from a query that simpler (e.g. regex-
| based) search engines with the same knowledge couldn't. For
| example, an LLM that is input a case may discover an obscure law,
| or notice a pattern in past court cases which establishes
| precedent. So they can be helpful to a real judge.
|
| Of course, the judge must check that the law or precedent aren't
| hallucinated, and apply to the case in the way the LLM claims.
| They should also prompt other LLMs and use their own knowledge in
| case the cited law/precedent contradicts others.
|
| There's a similar argument for scientists, mathematicians,
| doctors, investors, and other fields. LLMs are good at discovery
| but must be checked.
| PrimordialEdg71 wrote:
| LLMs make impressive graders-of-convenience, but their judgments
| swing wildly with prompt phrasing and option order. Treat them
| like noisy crowd-raters: randomize inputs, ensemble outputs, and
| keep a human in the loop whenever single-digit accuracy points
| matter.
| nowittyusername wrote:
| I've done experiments and basically what I found was that LLM
| models are extremely sensitive to .....language. Well, duh but
| let me explain a bit. They will give a different quality/accuracy
| of answer depending on the system prompt order, language use,
| length, how detailed the examples are, etc... basically every
| variable you can think of is responsible for either improving or
| causing detrimental behavior in the output. And it makes sense
| once you really grok that LLM;s "reason and think" in tokens.
| They have no internal world representation. Tokens are the raw
| layer on which they operate. For example if you ask a bilingual
| human what their favorite color is, the answer will be that color
| regardless of what language they used to answer that question.
| For an LLM, that answer might change depending on the language
| used, because its all statistical data distribution of tokens in
| training that conditions the response. Anyway i don't want to
| make a long post here. The good news out of this is that once you
| have found the best way in asking questions of your model, you
| can consistently get accurate responses, the trick is to find the
| best way to communicate with that particular LLM. That's why i am
| hard at work on making an auto calibration system that runs
| through a barrage of ways in finding the best system prompts and
| other hyperparameters for that specific LLM. The process can be
| fully automated, just need to set it all up.
| pton_xd wrote:
| Yep, LLMs tell you "what you want to hear." I can usually
| predict the response I'll get based on how I phrase the
| question.
| jonplackett wrote:
| I feel like LLMs have a bit of the Clever Hans effect. It
| takes a lot of my cues as to what it thinks I want it to say
| or opinion it thinks I want it to have.
|
| Clever Hans was a horse who people thought could do maths by
| tapping his hoof. But actually he was just reading the body
| language of the person asking the question. Noticing them
| tense up as he got to the right number of stamps and stopping
| - still pretty smart for a horse, but the human was still
| doing the maths!
| not_maz wrote:
| What's worse is that it can sometimes (but not always) read
| through your anti-bias prompts. "No, I
| want your honest opinion." "It's awesome." "I'm
| going to invest $250,000 into this. Tell me what you really
| think." "You should do it." (New Session)
| "Someone pitched to me the idea that..." "Reject it."
| thinkling wrote:
| I thought embeddings were the internal representation? Does
| reasoning and thinking get expanded back out into tokens and
| fed back in as the next prompt for reasoning? Or does the model
| internally churn on chains of embeddings?
| hansvm wrote:
| There's a certain one-to-oneness between tokens and
| embeddings. A token expands into a large amount of state, and
| processing happens on that state and nothing else.
|
| The point is that there isn't any additional state or
| reasoning. You have a bunch of things equivalent to tokens,
| and the only trained operations deal with sequences of those
| things. Calling them "tokens" is a reasonable linguistic
| choice, since the exact representation of a token isn't core
| to the argument being made.
| not_maz wrote:
| I found an absolutely fascinating analysis on precisely this
| topic by an AI researcher who's also a writer:
| https://archive.ph/jgam4
|
| LLMs can generate convincing editorial letters that give a real
| sense of having deeply read the work. The problem is that
| they're extremely sensitive, as you've noticed, to prompting as
| well as order bias. Present it with two nearly identical
| versions of the same text, and it will usually choose based on
| order. And social proof type biases to which we'd hope for
| machines to be immune can actually trigger 40+ point swings on
| a 100-point scale.
|
| If you don't mind technical details and occasional swagger, his
| work is really interesting.
| leonidasv wrote:
| I somewhat agree, but I think that the language example is not
| a good one. As Anthropic have demonstrated[0], LLMs do have
| "conceptual neurons" that generalise an abstract concept which
| can later be translated to other languages.
|
| The issue is that those concepts are encoded in intermediate
| layers during training, absorbing biases present in training
| data. It may produce a world model good enough to know that
| "green" and "verde" are different names for the same thing, but
| not robust enough to discard ordering bias or wording bias.
| Humans suffer from that too, albeit arguably less.
|
| [0] https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
___________________________________________________________________
(page generated 2025-05-23 23:00 UTC)