[HN Gopher] Positional preferences, order effects, prompt sensit...
___________________________________________________________________
Positional preferences, order effects, prompt sensitivity undermine
AI judgments
Author : joalstein
Score : 148 points
Date : 2025-05-23 17:20 UTC (1 days ago)
(HTM) web link (www.cip.org)
(TXT) w3m dump (www.cip.org)
| giancarlostoro wrote:
| I listen to online debates, especially political ones on various
| platforms, and man. The AI slop that people slap around at each
| other is beyond horrendous. I would not want an LLM being the
| final say on something critical. I want the opposite, an LLM
| should identify things that need follow up review by a qualified
| person, a person should still confirm the things that "pass" but
| they can then prioritize what to validate first.
| batshit_beaver wrote:
| I don't even trust LLMs enough to spot content that requires
| validation or nuance.
| giancarlostoro wrote:
| LLM's are 100% trust but verify.
| baxtr wrote:
| I'd argue real judges are unreliable as well.
|
| The real question for me is: are they less reliable than human
| judges? Probably yes. But I favor a relative measurement to
| humans than a plain statement like that.
| nimitkalra wrote:
| There are technical quirks that make LLM judges particularly
| high variance, sensitive to artifacts in the prompt, and
| positively/negatively-skewed, as opposed to the subjectivity of
| human judges. These largely arise from their training
| distribution and post-training, and can be contained with
| careful calibration.
| andrewla wrote:
| I do think you've hit the heart of the question, but I don't
| think we can answer the second question.
|
| We can measure how unreliable they are, or how susceptible they
| are to specific changes, just because we can reset them to the
| same state and run the experiment again. At least for now [1]
| we do not have that capability with humans, so there's no way
| to run a matching experiment on humans.
|
| The best we can do it is probably to run the limited
| experiments we can do on humans -- comparing different judge's
| cross-referenced reliability to get an overall measure and some
| weak indicator of the reliability of a specific judge based on
| intra-judge agreement. But when running this on LLMs they would
| have to keep the previous cases in their context window to get
| a fair comparison.
|
| [1] https://qntm.org/mmacevedo
| resource_waste wrote:
| I don't study domestic law enough, but I asked a professor of
| law:
|
| "With anything gray, does the stronger/bigger party always
| win?"
|
| He said:
|
| "If you ask my students, nearly all of them would say Yes"
| bunderbunder wrote:
| > The real question for me is: are they less reliable than
| human judges?
|
| I've spent some time poking at this. I can't go into details,
| but the short answer is, "Sometimes yes, sometimes no, and it
| depends A LOT on how you define 'reliable'."
|
| My sense is that, the more boring, mechanical and closed-ended
| the task is, the more likely an LLM is to be more reliable than
| a human. Because an LLM is an unthinking machine. It doesn't
| get tired, or hangry, or stressed out about its kid's problems
| at school. But it's also a doofus with absolutely no common
| sense whatsoever.
| visarga wrote:
| > Because an LLM is an unthinking machine.
|
| Unthinking can be pretty powerful these days.
| DonHopkins wrote:
| At least LLMs don't use penis pumps while on the job in court.
|
| https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...
|
| https://www.subsim.com/radioroom/showthread.php?t=95174
| Alupis wrote:
| I think the main difference is an AI judge may provide three
| different rulings if you just ask it the same thing three
| times. A human judge is much less likely to be so "flip-
| floppy".
|
| You can observe this using any of the present-day LLM's - ask
| it an architectural/design question, provide it with your
| thoughts, reasoning, constraints, etc... and see what it tells
| you. Then... click the "Retry" button and see how similar (or
| dissimilar) the answer is. Sometimes you'll get a complete 180
| from the prior response.
| bunderbunder wrote:
| Humans flip-flop all the time. This is a major reason why the
| Meyers-Briggs Type Indicator does such a poor job of
| assigning the same person the same Meyers-Briggs type on
| successive tests.
|
| It can be difficult to observe this fact in practice because,
| unlike for an LLM, you can't just ask a human the exact same
| question three times in five seconds and get three different
| answers, because unlike an LLM we have memory. But, as
| someone who works with human-labeled data, it's something I
| have to contend with on a daily basis. For the things I'm
| working on, if you give the same annotator the same thing to
| label two different times spaced far enough apart for them to
| forget that they have seen this thing before, the chance of
| them making the same call both times is only about 75%. If I
| do that with a prompted LLM anotator, I'm used to seeing more
| like 85%, and for some models it can be possible to get even
| better consistency than that with the right conditions and
| enough time spent fussing with the prompt.
|
| I still prefer the human labels when I can afford them
| because LLM labeling has plenty of other problems. But being
| more flip-floppy than humans is not one that I have been able
| to empirically observe.
| Alupis wrote:
| We're not talking about labeling data though - we're
| talking about understanding case law, statutory law, facts,
| balancing conflicting opinions, arguments, a judge's
| preconceived notions, experiences, beliefs etc. - many of
| which are assembled over an entire career.
|
| Those things, I'd argue, are far less likely to change if
| you ask the _same_ judge over and over. I think you can
| observe this in reality by considering people 's political
| opinions - which can drift over time but typically remain
| similar for long durations (or a lifetime).
|
| In real life, we usually don't ask the same judge to remake
| a ruling over and over - our closest analog is probably a
| judge's ruling/opinion history, which doesn't change nearly
| as much as an LLM's "opinion" on something. This is how we
| label SCOTUS Justices, for example, as "Originalist", etc.
|
| Also, unlike a human, you can radically change an LLM's
| output by just ever-so-slightly altering the input. While
| humans aren't above changing their mind based on new facts,
| they are unlikely to take an opposite position just because
| you reworded your same argument.
| bunderbunder wrote:
| I think that that gets back to the whole memory thing. A
| person is unlikely to forget those kinds of decisions.
|
| But there has been research indicating, for example, that
| judges' rulings vary with the time of day. In a way that
| implies that, if it were possible to construct such an
| experiment, you might find that the same judge given the
| same case would rule in very different ways depending on
| whether you present it in the morning or in the
| afternoon. For example judges tend to hand out
| significantly harsher penalties toward the end of the
| work day.
| acdha wrote:
| I'd think there's also a key adversarial problem: a human
| judge has a conceptual understanding and you aren't going to
| be able to slightly tweak your wording to get wildly
| different outcomes the way LLMs are vulnerable to.
| Terr_ wrote:
| > The real question for me is: are they less reliable than
| human judges?
|
| I'd caution that it's never just about ratios: We must also ask
| whether the "shape" of their performance is knowable and
| desirable. A chess robot's win-rate may be wonderful, but we
| are unthinkingly confident a human wouldn't "lose" by
| disqualification for ripping off an opponent's finger.
|
| Would we accept a "judge" that is fairer on average... but
| gives ~5% lighter sentences to people with a certain color
| shirt, or sometimes issues the death-penalty for shoplifting?
| Especially when we cannot diagnose the problem or be sure we
| fixed it? (Maybe, but hopefully not without a _lot_ of debate
| over the risks!)
|
| In contrast, there's a huge body of... of _stuff_ regarding
| human errors, resources we deploy so pervasively it can escape
| our awareness: Your brain is a simulation and diagnostic tool
| for other brains, battle-tested (sometimes literally) over
| millions of years; we intuit many kinds of problems or
| confounding factors to look for, often because we 've made them
| ourselves; and thousands of years of cultural practice for
| detection, guardrails, and error-compensating actions. Only a
| small minority of that toolkit can be reused for "AI."
| baxtr wrote:
| Yes, I fully agree.
|
| But that's my point. We have to compare LLM performance to
| some shape we know.
| th0ma5 wrote:
| Can we stop with the "AI being unreliable like people" because
| it is demonstrably false at best and cult like thought
| termination at the worst.
| andrepd wrote:
| Judges can reason according to principles, and explain this
| reasoning. LLMs cannot (but they can pretend to, and this
| pretend chain-of-thought can be marketed as "reasoning"!; see
| https://news.ycombinator.com/item?id=44069991)
| not_maz wrote:
| I know the answer and I hate it.
|
| AIs are inferior to humans at their best, but superior to
| humans as they actually behave in society, due to decision
| fatigue and other constraints. When it comes to moral judgment
| in high stakes scenarios, AIs still fail (or can be made to
| fail) in ways that are not socially acceptable.
|
| Compare an AI to a real-world, overworked corporate decision
| maker, though, and you'll find that the AI is kinder and less
| biased. It still sucks, because GI/GO, but it's slightly
| better, simply because it doesn't suffer emotional fatigue,
| doesn't take as many shortcuts, and isn't clouded by personal
| opinions since it's not a person.
| nimitkalra wrote:
| Some other known distributional biases include self-preference
| bias (gpt-4o prefers gpt-4o generations over claude generations
| for eg) and structured output/JSON-mode bias [1]. Interestingly,
| some models have a more positive/negative-skew than others as
| well. This library [2] also provides some methods for
| calibrating/stabilizing them.
|
| [1]:
| https://verdict.haizelabs.com/docs/cookbook/distributional-b...
| [2]: https://github.com/haizelabs/verdict
| yencabulator wrote:
| It's considered good form on this forum to disclose your
| affiliation when you advertise for your employer.
| TrackerFF wrote:
| I see "panels of judges" mentioned once, but what is the weakness
| of this? Other than more resource.
|
| Worst case you end up with some multi-modal distribution, where
| two opinions are equal - which seems somewhat unlikely as the
| panel size grows. Or it could maybe happen in some case with
| exactly two outcomes (yes/no), but I'd be surprised if such a
| panel landed on a perfect uniform distribution in its
| judgments/opinions (50% yes 50% no)
| nimitkalra wrote:
| One method to get a better estimate is to extract the token
| log-probabilities of "YES" and "NO" from the final logits of
| the LLM and take a weighted sum [1] [2]. If the LLM is
| calibrated for your task, there should be roughly a ~50% chance
| of sampling YES (1) and ~50% chance of NO (0) -- yielding 0.5.
|
| But generally you wouldn't use a binary outcome when you can
| have samples that are 50/50 pass/fail. Better to use a discrete
| scale of 1..3 or 1..5 and specify exactly what makes a sample a
| 2/5 vs a 4/5, for example
|
| You are correct to question the weaknesses of a panel. This
| class of methods depends on diversity through high-temperature
| sampling, which can lead to spurious YES/NO responses that
| don't generalize well and are effectively noise.
|
| [1]: https://arxiv.org/abs/2303.16634 [2]:
| https://verdict.haizelabs.com/docs/concept/extractor/#token-...
| shahbaby wrote:
| Fully agree, I've found that LLMs aren't good at tasks that
| require evaluation.
|
| Think about it, if they were good at evaluation, you could remove
| all humans in the loop and have recursively self improving AGI.
|
| Nice to see an article that makes a more concrete case.
| visarga wrote:
| Humans aren't good at validation either. We need tools,
| experiments, labs. Unproven ideas are a dime a dozen. Remember
| the hoopla about room temperature superconductivity? The real
| source of validation is external consequences.
| ken47 wrote:
| Human experts set the benchmarks and LLM's cannot match them
| in most (maybe any?) fields requiring sophisticated judgment.
|
| They are very useful for some things, but sophisticated
| judgment is not one of them.
| NitpickLawyer wrote:
| I think there's more nuance, and the way I read the article is
| more "beware of these shortcomings", instead of "aren't good".
| LLM-based evaluation can be good. Several models have by now
| been trained on previous-gen models used in filtering data and
| validating RLHF data (pairwise or even more advanced). LLama3
| is a good example of this.
|
| My take from this article is that there are plenty of gotchas
| along the way, and you need to be very careful in how you
| structure your data, and how you test your pipelines, and how
| you make sure your tests are keeping up with new models. But,
| like it or not, LLM based evaluation is here to stay. So
| explorations into this space are good, IMO.
| tempodox wrote:
| > We call it 'prompt-engineering'
|
| I prefer to call it "prompt guessing", it's like some modern
| variant of alchemy.
| BurningFrog wrote:
| "Prompt Whispering"?
| th0ma5 wrote:
| Prompt divining
| thinkling wrote:
| Impromptu prompting
| roywiggins wrote:
| Prompt dowsing
| layer8 wrote:
| Prompt vibing
| amlib wrote:
| Maybe "prompt mixology" would keep inline with the alchemy
| theme :)
| wagwang wrote:
| Can't wait for the new field of AI psychology
| sidcool wrote:
| We went from Impossible to Unreliable. I like the direction as a
| techie. But not sure as a sociologist or an anthropologist.
| ken47 wrote:
| [flagged]
| dang wrote:
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us something._"
|
| https://news.ycombinator.com/newsguidelines.html
|
| (We detached this comment from
| https://news.ycombinator.com/item?id=44074957)
| sanqui wrote:
| Meanwhile in Estonia, they just agreed to resolve child support
| disputes using AI... https://www.err.ee/1609701615/pakosta-
| enamiku-elatisvaidlust...
| suddenlybananas wrote:
| Might as well flip a coin.
| armchairhacker wrote:
| LLMs are good at discovery, since they know a lot, and can
| retrieve that knowledge from a query that simpler (e.g. regex-
| based) search engines with the same knowledge couldn't. For
| example, an LLM that is input a case may discover an obscure law,
| or notice a pattern in past court cases which establishes
| precedent. So they can be helpful to a real judge.
|
| Of course, the judge must check that the law or precedent aren't
| hallucinated, and apply to the case in the way the LLM claims.
| They should also prompt other LLMs and use their own knowledge in
| case the cited law/precedent contradicts others.
|
| There's a similar argument for scientists, mathematicians,
| doctors, investors, and other fields. LLMs are good at discovery
| but must be checked.
| amlib wrote:
| I would add that "hallucinations" aren't even the only failure
| mode a LLM can have, it can partially or completely miss what
| its supposed to find in the discovery process and lead you to
| believe that there just isn't anything worth pursuing in that
| particular venue.
| mschuster91 wrote:
| > it can partially or completely miss what its supposed to
| find in the discovery process and lead you to believe that
| there just isn't anything worth pursuing in that particular
| venue.
|
| The problem is that American and UK legal systems never got
| forced to prune the sometimes centuries-old garbage. And
| modern Western legal systems tend to have more explicit laws
| and regulations instead of prior case law, but still, they
| also accumulate lots of garbage.
|
| IMHO, _all_ laws and regulations should come with a set
| expiry date. If the law or regulation is not renewed, it gets
| dropped off the book. And for legal systems that have case
| law, court rulings should expire no later than five years, to
| force a transition to a system where the law-passing body is
| forced to work for its money.
| PrimordialEdg71 wrote:
| LLMs make impressive graders-of-convenience, but their judgments
| swing wildly with prompt phrasing and option order. Treat them
| like noisy crowd-raters: randomize inputs, ensemble outputs, and
| keep a human in the loop whenever single-digit accuracy points
| matter.
| einrealist wrote:
| "and keep a human in the loop whenever single-digit accuracy
| points matter"
|
| So we are supposed to give up on accuracy now? At least with
| humans (assuming good actors) I can assume an effort for
| accuracy and correctness. And I can build trust based on some
| resume and on former interaction.
|
| With LLMs, this is more like a coin-flip with each prompt. And
| since the models are updated constantly, its hard to build some
| sort of resume. In the meantime, people - especially casual
| users - might just trust outputs, because its convenient. A
| single digit error is harder to find. The costs of validating
| outputs increases with increased accuracy of LLMs. And casual
| users tend to skip validation because "its a computer and
| computers are correct".
|
| I fear an overall decrease in quality wherever LLMs are
| included. And any productivity gains are eaten by that.
| nowittyusername wrote:
| I've done experiments and basically what I found was that LLM
| models are extremely sensitive to .....language. Well, duh but
| let me explain a bit. They will give a different quality/accuracy
| of answer depending on the system prompt order, language use,
| length, how detailed the examples are, etc... basically every
| variable you can think of is responsible for either improving or
| causing detrimental behavior in the output. And it makes sense
| once you really grok that LLM;s "reason and think" in tokens.
| They have no internal world representation. Tokens are the raw
| layer on which they operate. For example if you ask a bilingual
| human what their favorite color is, the answer will be that color
| regardless of what language they used to answer that question.
| For an LLM, that answer might change depending on the language
| used, because its all statistical data distribution of tokens in
| training that conditions the response. Anyway i don't want to
| make a long post here. The good news out of this is that once you
| have found the best way in asking questions of your model, you
| can consistently get accurate responses, the trick is to find the
| best way to communicate with that particular LLM. That's why i am
| hard at work on making an auto calibration system that runs
| through a barrage of ways in finding the best system prompts and
| other hyperparameters for that specific LLM. The process can be
| fully automated, just need to set it all up.
| pton_xd wrote:
| Yep, LLMs tell you "what you want to hear." I can usually
| predict the response I'll get based on how I phrase the
| question.
| jonplackett wrote:
| I feel like LLMs have a bit of the Clever Hans effect. It
| takes a lot of my cues as to what it thinks I want it to say
| or opinion it thinks I want it to have.
|
| Clever Hans was a horse who people thought could do maths by
| tapping his hoof. But actually he was just reading the body
| language of the person asking the question. Noticing them
| tense up as he got to the right number of stamps and stopping
| - still pretty smart for a horse, but the human was still
| doing the maths!
| not_maz wrote:
| What's worse is that it can sometimes (but not always) read
| through your anti-bias prompts. "No, I
| want your honest opinion." "It's awesome." "I'm
| going to invest $250,000 into this. Tell me what you really
| think." "You should do it." (New Session)
| "Someone pitched to me the idea that..." "Reject it."
| thinkling wrote:
| I thought embeddings were the internal representation? Does
| reasoning and thinking get expanded back out into tokens and
| fed back in as the next prompt for reasoning? Or does the model
| internally churn on chains of embeddings?
| hansvm wrote:
| There's a certain one-to-oneness between tokens and
| embeddings. A token expands into a large amount of state, and
| processing happens on that state and nothing else.
|
| The point is that there isn't any additional state or
| reasoning. You have a bunch of things equivalent to tokens,
| and the only trained operations deal with sequences of those
| things. Calling them "tokens" is a reasonable linguistic
| choice, since the exact representation of a token isn't core
| to the argument being made.
| HappMacDonald wrote:
| I'd direct you to the 3 blue 1 brown presentation on this
| topic, but in a nutshell the semantic space for an embedding
| can become much richer than the initial token mapping due to
| previous context.. but only during the course of predicting
| the next token.
|
| Once that's done, all rich nuance achieved during the last
| token-prediction step is lost, and then rebuilt from scratch
| again on the next token-prediction step (oftentimes taking a
| new direction due to the new token, and often more powerfully
| any changes at the tail of the context window such as lost
| tokens, messages, re-arrangement due to summarizing, etc).
|
| So if you say "red ball" somewhere in the context window,
| then during each prediction step that will expand into a
| semantic embedding that neither matches "red" nor "ball", but
| that richer information will not be "remembered" between
| steps, but rebuilt from scratch every time.
| not_maz wrote:
| I found an absolutely fascinating analysis on precisely this
| topic by an AI researcher who's also a writer:
| https://archive.ph/jgam4
|
| LLMs can generate convincing editorial letters that give a real
| sense of having deeply read the work. The problem is that
| they're extremely sensitive, as you've noticed, to prompting as
| well as order bias. Present it with two nearly identical
| versions of the same text, and it will usually choose based on
| order. And social proof type biases to which we'd hope for
| machines to be immune can actually trigger 40+ point swings on
| a 100-point scale.
|
| If you don't mind technical details and occasional swagger, his
| work is really interesting.
| leonidasv wrote:
| I somewhat agree, but I think that the language example is not
| a good one. As Anthropic have demonstrated[0], LLMs do have
| "conceptual neurons" that generalise an abstract concept which
| can later be translated to other languages.
|
| The issue is that those concepts are encoded in intermediate
| layers during training, absorbing biases present in training
| data. It may produce a world model good enough to know that
| "green" and "verde" are different names for the same thing, but
| not robust enough to discard ordering bias or wording bias.
| Humans suffer from that too, albeit arguably less.
|
| [0] https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
| bunderbunder wrote:
| I have learned to take these kinds of papers with a grain of
| salt, though. They often rest on carefully selected examples
| that make the behavior seem much more consistent and reliable
| than it is. For example, the famous "king - man + woman =
| queen" example from Word2Vec is in some ways more misleading
| than helpful, because while it worked fine for that case it
| doesn't necessarily work nearly so well for [emperor, man,
| woman, empress] or [husband, man, woman, wife].
|
| You get a similar thing with convolutional neural networks.
| _Sometimes_ they automatically learn image features in a way
| that yields hidden layers that easy and intuitive to
| interpret. But not every time. A lot of the time you get a
| seemingly random garble that belies any parsimonious
| interpretation.
|
| This Anthropic paper is at least kind enough to acknowledge
| this fact when they poke at the level of representation
| sharing and find that, according to their metrics, peak
| feature-sharing among languages is only about 30% for English
| and French, two languages that are _very_ closely aligned.
| Also note that this was done using two cherry-picked
| languages and a training set that was generated by starting
| with an English language corpus and then translating it using
| a different language model. It 's entirely plausible that the
| level of feature-sharing would not be nearly so great if they
| had used human-generated translations. (edit: Or a more
| realistic training corpus that doesn't entirely consist of
| matched translations of very short snippets of text.)
|
| Just to throw even more cold water on it, this also doesn't
| necessarily mean that the models are building a true semantic
| model and not just finding correlations upon which humans
| impose semantic interpretations. This general kind of
| behavior when training models on cross-lingual corpora
| generated using direct translations was first observed in the
| 1990s, and the model in question was _singular value
| decomposition._
| jiggawatts wrote:
| I'm convinced that language sharing can be encouraged
| during training by rewarding correct answers to questions
| that can only be answered based on synthetic data in
| another language fed in during a previous pretraining
| phase.
|
| Interleave a few phases like that and you'd force the model
| to share abstract information across all languages, not
| just for the synthetic data but all input data.
|
| I wouldn't be surprised if this improved LLM performance by
| another "notch" all by itself, especially for non-English
| users.
| nenaoki wrote:
| your shrewd idea might make a fine layer back up the
| Tower of Babel
| nowittyusername wrote:
| I've read the paper before I made the statement. And I still
| made the statement because there are issues with their paper.
| The first problem is that the way in which anthropic trains
| their models and the architecture of their models is
| different from most of the open source models people use.
| they are still transformer based, but they are not
| structurally put together the same as most models, so you
| cant extrapolate their findings on their models to other
| models. Their training methods also use a lot more
| regularization of the data trying to weed out targeted biases
| as much as possible. meaning that the models are trained on
| more synthetic data which tries to normalize the data as much
| as possible between languages, tone, etc.. Same goes for
| their system prompt, their system prompt is treated
| differently versus open source models which append the system
| prompt in front of the users query internally. The attention
| ais applied differently among other things. Second the way
| that their models "internalize" the world is vastly different
| then what humans would thing of "building a world model" of
| reality. Its hard to put it in to words but basically their
| models do have a underlying representative structure but its
| not anything that would be of use in the domains humans care
| about, "true reasoning". Grokking the concept if you will.
| Honestly I highly suggest folks take a lot of what anthropic
| studies with a grain of salt. I feel that a lot of
| information they present is purposely misinterpreted by their
| teams for media or pr/clout or who knows what reasons. But
| the biggest reason is the one i stated at the beginning, most
| models are not of the same ilk as Anthropic models. I would
| suggest folks focus on reading interpretability research on
| open source models as those are most likely to be used by
| corporations for their cheap api costs. And those models have
| no where near the care and sophistication put in to them as
| anthropic models.
| TOMDM wrote:
| This doesn't match Anthropics research on the subject
|
| > Claude sometimes thinks in a conceptual space that is shared
| between languages, suggesting it has a kind of universal
| "language of thought." We show this by translating simple
| sentences into multiple languages and tracing the overlap in
| how Claude processes them.
|
| https://www.anthropic.com/research/tracing-thoughts-language...
| smusamashah wrote:
| Once can see that very easily in image generation models.
|
| The "Elephant" it generates is lot different from "Haathi"
| (Hindi/Urdu). Same goes for other concepts that have 1-to-1
| translation but the results are different.
| gus_massa wrote:
| > _For example if you ask a bilingual human what their favorite
| color is, the answer will be that color regardless of what
| language they used to answer that question._
|
| It's a very interesting question. Has someone measured it?
| Bonus point for using a conceal way so the subjects don't
| realize you care about colors.
|
| Anyway, I don't expect something interesting with colors, but
| it may be interesting with food (I guess, in particular
| desserts).
|
| Imagine you live in England and one of your parents is form
| France and you go there every year to meet your grandparents,
| and your other parent is from Germany and you go there every
| year to meet your grandparents. What is your favorite dessert?
| I guess when you are speaking in one language you are more
| connected to the memories of the holidays there and the
| grandparents and you may choose differently.
| patcon wrote:
| Doesn't this assume one truth or one final answer to all
| questions? What if there are many?
|
| What if asking one way means you are likely to have your search
| satisfied by one truth, but asking another way means you are
| best served by another wisdom?
|
| EDIT: and the structure of language/thought can't know solid
| truth from more ambiguous wisdom. The same linguistic
| structures must encode and traverse both. So there will be
| false positives and false negatives, I suppose? I dunno, I'm
| shooting from the hip here :)
| devmor wrote:
| It's a statistical database of corpuses, not a logic engine.
|
| Stop treating LLMs like they are capable of logic, reasoning or
| judgement. They are not, they never will be.
|
| The extent to which they can _recall_ and _remix_ human words to
| replicate the intent behind those words is an incredible
| facsimile to thought. It's nothing short of a mathematical
| masterpiece. But it's not intelligence.
|
| If it were communicating it's results in any less human of an
| interface than conversational, I truly feel that most people
| would not be so easily fooled into believing it was capable of
| logic.
|
| This doesn't mean that a facsimile of logic like this has no use.
| Of course it does, we have seen many uses - some incredible, some
| dangerous and some pointless - but it is important to always know
| that there is no thought happening behind the facade. Only a
| replication of statistically similar _communication of thought_
| that may or may not actually apply to your prompt.
| lyu07282 wrote:
| Also related: In my observations with tool calling the order of
| your arguments or fields actually can make a positive or negative
| effect on performance. You really have to be very careful when
| constructing your contexts. It doesn't help when all these
| frameworks and protocols hide these things from you.
| SrslyJosh wrote:
| > Positional preferences, order effects, prompt sensitivity
| undermine AI judgments
|
| If you can read between the lines, that says that there's no
| actual "judgement" going on. If there was a strong logical
| underpinning to the output, minor differences in input like the
| phrasing (but not factual content) of a prompt wouldn't make the
| quality of the output unpredictable.
| Wilsoniumite wrote:
| Yes and no. People also exhibit these biases, but because
| degree matters, and because we have no other choice, we still
| trust them most of the time. That's to say; bias isn't always
| completely invalidating. I wrote a slightly satirical piece
| "People are just as bad as my LLMs" here:
| https://wilsoniumite.com/2025/03/10/people-are-just-as-bad-a...
| tiahura wrote:
| Word plinko
| ACCount36 wrote:
| You could say the same about human "judgement" then.
|
| Humans display biases very similar to that of LLMs. This is not
| a coincidence. LLMs are trained on human-generated data. They
| attempt to replicate human reasoning - bias and all.
|
| There are decisions where "strong logical underpinning" is
| strong enough to completely drown out the bias and the noise.
| And then there are decisions that aren't nearly as clear-cut -
| allowing the bias and the noise to creep into the outcomes.
| This is true for human and LLM "judgement" both.
| photochemsyn wrote:
| Disappointing that they didn't benchmark DeepSeek side by side
| with OpenAI and Gemini although thelist of funders for cip may
| explain that:
|
| https://www.cip.org/funding-partnerships
|
| Incidentally DeepSeek will give very interesting results if you
| ask it for a tutorial on prompt engineering - be sure to ask it
| how to effectively use 'attention anchors' to create 'well-
| structured prompts', and why rambling disorganized prompts are
| usually, but not always, detrimental, depending on whether you
| want 'associative leaps' or not.
|
| P.S. I find this intro very useful:
|
| > "Task: evaluate the structure of the following prompt in terms
| of attention anchors and likelihood of it generating a well-
| structured response. Do not actually reply to the prompt, all we
| need is an analysis of the structure. Prompt begins:"
| panstromek wrote:
| Good, but really none of this should be surprising, given that
| LLMs are a giant text statistic that generate text based on that
| statistic. Quirks of that statistic will show up as quirks of the
| output.
|
| When you think about it like that, it doesn't really make sense
| to assume they have some magical correctness properties. In some
| sense, they don't classify, they immitate what classification
| looks like.
| perching_aix wrote:
| > In some sense, they don't classify, they immitate what
| classification looks like.
|
| I thought I've seen it all when people decided to consider AI a
| marketing term and started going off about how current
| mainstream AI products aren't """"real AI"""", but this is next
| level.
| panstromek wrote:
| I'm not sure I understand your objection (or if it's even an
| objection), but just to illustrate what I mean - this is
| literally how the chat interfaces are implemented (or at
| least initially they were).
|
| You're not talking with the model, you're talking with some
| entity that the model is asked to simulate. The system is
| just cleverly using your input and the statistic to output
| something that looks like chat with an assistant.
|
| Whether that's real AI or not doesn't really matter. I didn't
| mean to make it sound like this is not real, just to point
| out where are the current shortcomings coming from.
| perching_aix wrote:
| It is an objection. I'm not sure if you consider the whole
| subfield of machine learning that is classification non-
| existing, or just the fact that LLMs can produce
| classifications, but either way, I do object.
|
| The objection against the former is trivial and self
| evident, and was more where my sudden upset came from. *
|
| Against the latter, the model trying to make the overall
| text that is its context window approach a chat exchange,
| by adjusting its own output within it accordingly, doesn't
| make a hypothetical request for classification in there not
| performed. You either classify or you don't. If it's doing
| it in a "misguided" way, it doesn't make it not performed.
| If it's doing it under the pretense of roleplay, it still
| doesn't make it not performed. Same is true if the LLM is
| actually secretly a human operator, or if the LLM is just
| spewing random tokens. Either you got a classified output
| of your input or you didn't. It doesn't "look like"
| anything. I can understand if maybe you mean that the
| designated notion of the person in the exchange it's trying
| to approximate for is going to affect the classifications
| it provides when requested one in the context window, but
| since these models are trained to "act agentic", I'm not
| sure if that's a useful thing to ponder (as there's no
| other way to get anything out of them).
|
| I object to the whole "AI is just statistics" notion too.
| In several situations you want it to do something
| completely different than what the dataset would support
| just through rote statistics. That's where you get actual
| value out of them. One could conveniently recategorize that
| as just "advanced statistics", or "higher level"
| statistics, but I think that's a very perverse way of
| defining statistics. There's very clearly more mathematics
| involved in LLMs than just statistics. Just the other day,
| there was a post here trying to regard LLMs as "just
| topology". Clearly neither of these can be true at the same
| time, which was consequently explored in the thread too.
|
| > You're not talking with the model, you're talking with
| some entity
|
| I'm not suggesting I'm actually talking with anyone or
| anything in particular beyond the antropomorphization.
|
| * What I meant by "people saying AI is not real" is that
| people claim to regard that the current generation of AI
| products are not real "artificial intelligences", because
| they seem to think that it's either "SkyNet" and "Detroit:
| Become Human", or nothing. Unsurprisingly, these folks
| don't tend to talk much about OCR, image segmentation and
| labeling, optical flow, etc. And just like the half a
| century old field of Artificial Intelligence isn't just
| some new marketing con that just spawned into existence,
| classification algorithms aren't some novel snakeoil
| either.
|
| _Edit:_ typing this all out about how I 'm aware I'm not
| actually having conversations with anyone or anything gave
| me a feeling of realization. This is not good, because
| intellectually I was always aware of this, meaning I got
| subconsciously parasocial with these products and services
| over time. Really concerned about this all of a sudden lol.
| panstromek wrote:
| This post was about LLMs so I was specifically referring
| to LLMs.
|
| When I say "they immitate what classification looks
| like," I don't mean that the classification somehow isn't
| real, I'm referring to the specific technique of how it's
| done.
|
| If you ask LLM "Is this sentence offensive: ...?", the
| task that it's doing is not simply "test whether this
| sentence is offensive." It's something like "generate
| what a plausible answer to this question looks like,"
| part of which is answering the question (usually).
|
| This means that if you ask this question in a way that is
| more often used with an expectation of a certain answer,
| LLM will use that as a signal to bias the answer, because
| "that's how the answers to these questions look like."
| which is the problem highlighted in the article.
___________________________________________________________________
(page generated 2025-05-24 23:01 UTC)