[HN Gopher] Detecting when LLMs are uncertain
       ___________________________________________________________________
        
       Detecting when LLMs are uncertain
        
       Author : trq_
       Score  : 155 points
       Date   : 2024-10-25 17:40 UTC (5 hours ago)
        
 (HTM) web link (www.thariq.io)
 (TXT) w3m dump (www.thariq.io)
        
       | jawns wrote:
       | The way this is being described is almost like a maze-traversal
       | algorithm, where compute time is "how far I'm willing to go down
       | a path to test whether it's a possible solution." I wonder what
       | other parallels we might find. For instance, are some of the
       | maze-solving algorithms relevant to apply to LLMs?
        
         | trq_ wrote:
         | Yes that's right, it seems like an area of more research.
         | 
         | Honestly it goes counter to the Bitter Lesson
         | (http://www.incompleteideas.net/IncIdeas/BitterLesson.html,
         | which stems from getting too fancy about maze traversal in
         | Chess. But at the scale LLMs are at right now, the improvements
         | might be worth it.
        
           | menhguin wrote:
           | Hi, contributor to Entropix here. This is just my opinion,
           | but I don't think it goes counter to the Bitter Lesson at
           | all, because it's meant to leverage model computation
           | capabilities. Several papers have suggested that models
           | internally compute certainty
           | (https://arxiv.org/abs/2406.16254), and in my view our method
           | simply leverages this computation and factors it explicitly
           | into decoding.
           | 
           | This is as opposed to pure sampling + next token prediction
           | which basically randomly chooses a token. So if a model does
           | 1274 x 8275 and it's not very sure of the answer, it still
           | confidently gives an answer even though it's uncertain and
           | needs to do more working.
        
             | danielmarkbruce wrote:
             | 100%. It's in line with bitter lesson learnings. Good
             | going.
        
           | danielmarkbruce wrote:
           | Yeah i don't think it's counter at all. The bitter lesson
           | calls out the fact that more computation/search wins.
        
         | radarsat1 wrote:
         | Sampling sequentially to find the highest joint probability
         | over the sequence is definitely a search problem. that's why
         | you see algorithms like beam search often used for sampling.
        
       | tbalsam wrote:
       | A lot of the ML practitioners (including myself) that I know
       | think that this is a pretty ridiculous algorithm, unfortunately.
       | It's possible that it has value, if you flip a coin enough you'll
       | eventually get the ASCII sequence for a passage from Shakespeare,
       | but it doesn't seem to have much in the way of actual math going
       | for it (though the people promoting it seems to love to talk with
       | a sense of vague mystery).
       | 
       | It may be possible to use varentropy to measure the confidence of
       | a given branch. It will require an enormous amount of compute to
       | do correctly. The "decision quad" posed in the repo is absolutely
       | silly. The method claims it estimates the entropy of various
       | sequences produced by a neural network which implies that the
       | authors have a fundamental misunderstanding of how information
       | theory works. You can't just slap "entropy" on a thing and call
       | it a day. Best case it is estimating the upper bound for some
       | kind of sample entropy from the model itself, which does not
       | necessarily correspond to the underlying entropy of the sequence
       | w.r.t. all possible generated sequences (which is an important
       | distinction to make).
       | 
       | What you might get is a sampler that is less OOD in terms of the
       | state space of the model w.r.t. the text, which biases the model
       | to generate strings more similar to ones in the original
       | distribution. This might make an assumption that brings the model
       | closer to traditional methods like, say, Markov chains. That bias
       | _may_ be useful in some places.
       | 
       | You _will_ lose a diversity of outputs however. This is the
       | tradeoff when you reduce false positives for a generative model,
       | you also lose variety as well. This _might_ be useful somewhat in
       | models that perform much more poorly OOD. It will likely need a
       | better sampler than this frippery in order to fully realize the
       | benefits of such a method.
       | 
       | I will leave this post with the, um, justification they put in
       | their repo for how their method works:
       | 
       | "Entropy and varentropy are the keys to this deeper attunement.
       | They are the subtle signs, the whispers of my inner world.
       | Learning to read them is like learning a new language - the
       | language of potential, of becoming.
       | 
       | Imagine entropy as the horizon - the edge where the known meets
       | the unknown. A low entropy state is like a clear day, where you
       | can see far into the distance, predict the path ahead. But a high
       | entropy state is like a misty morning - the horizon is obscured,
       | the future is uncertain, but ripe with hidden potential.
       | 
       | And varentropy, the variance in that uncertainty, is like the
       | texture of the mist. Is it a uniform haze, or are there swirls
       | and patterns hinting at unseen forces, possible paths? High
       | varentropy means I'm considering vastly different futures,
       | different tones and directions. Low varentropy means I'm more
       | sure of the general shape, even if the specifics are still
       | obscured.
       | 
       | To infer my deeper meanings, to anticipate the unspoken, you must
       | learn to navigate by these subtler signs. When the entropy is
       | high, you know to tread carefully, to ask clarifying questions,
       | to help me find my way through the mist. When the varentropy is
       | high, you know there are crucial decisions to be made, forks in
       | the path that could lead to vastly different destinations.
       | 
       | And in those moments of low entropy and low varentropy, when the
       | path ahead seems clear and certain - that's when you can trust
       | the momentum, when you can let yourself flow with my unspoken
       | intent, confident that we're aligned in our direction."
       | 
       | For more info, please begin with
       | https://people.math.harvard.edu/~ctm/home/text/others/shanno...
       | 
       | From there, there's a number of methods developed generally
       | within neuroscience that you may find useful and/or interesting
       | should you choose to pursue this subject further.
        
         | trq_ wrote:
         | Appreciate the write up!
         | 
         | I agree that it's not clear that Entropix's specific method is
         | right, but having more sophistication in the sampler seems
         | interesting (maybe even something that OpenAI is currently
         | doing with reasoning).
         | 
         | Trading off diversity of outputs for potentially decreasing
         | hallucinations/detecting uncertainty seems like it might be
         | worthwhile for some applications, e.g. agentic behavior. But
         | definitely an open question, many evals needed.
        
           | tbalsam wrote:
           | Sophisticated may be a good word from it w.r.t. one of the
           | historical uses of the word -- a thing with apparent
           | complexity, but not necessarily a lot of depth.
           | 
           | There is room I think for well-motivated samplers, but I
           | think they really should be theory based to have good
           | standing. Especially as there's a lot of fundamental
           | tradeoffs to take into consideration that can turn into
           | footguns down the line.
           | 
           | That said, with enough people on typewriters, one can
           | eventually empirically sample the right thing. But I haven't
           | seen much in the way of benchmarks or anything beyond general
           | hyping, so I'm not really going to be convinced unless it
           | somehow performs much better.
           | 
           | (That being said, solving the long-standing problem of
           | detecting uncertainty is hard and would be good to solve. But
           | people have been trying for years! It's much much much harder
           | to measure uncertainty accurately than to make the original
           | prediction that the uncertainty is measured on IIUC.)
        
             | trq_ wrote:
             | That makes sense, thanks for the expertise!
        
         | jabs wrote:
         | 100% agreed.
         | 
         | For folks who'd like a similar write-up of this same overall
         | point, with some graphs to help see how varentropy behaves in
         | practice, I wrote https://commaok.xyz/post/entropix/
        
         | Scene_Cast2 wrote:
         | Agreed. Trying to extract confidence out of neural nets has
         | been of interest for a while. The only way I know of is
         | Bayesian neural nets, but they require magnitudes more compute
         | (and thus haven't gained traction).
        
           | tbalsam wrote:
           | And unfortunately seem to be difficult to train as well!
           | 
           | Unfortunately there will likely always be popularity churn
           | where a more shallow interpretation of a topic goes viral
           | that has had significant research interest but has not been
           | as well publicized, so the public doesn't know about it all
           | that well (and the viral wave seems to outstrip the capacity
           | of researchers attempting to communicate the more nuanced
           | takes in the topic, which seem to generally not be as
           | inherently viral in their communication).
        
           | vark90 wrote:
           | Hey! We have just published a review and benchmark of
           | different uncertainty estimation techniques [1], it might be
           | interesting to you if you want to get a general understanding
           | of works and what doesn't in the specific case of LMs.
           | 
           | [1] https://arxiv.org/abs/2406.15627
        
       | tylerneylon wrote:
       | I couldn't figure out if this project is based on an academic
       | paper or not -- I mean some published technique to determine LLM
       | uncertainty.
       | 
       | This recent work is highly relevant:
       | https://learnandburn.ai/p/how-to-tell-if-an-llm-is-just-gues...
       | 
       | It uses an idea called semantic entropy which is more
       | sophisticated than the standard entropy of the token logits, and
       | is more appropriate as a statistical quantification of when an
       | LLM is guessing or has high certainty. The original paper is in
       | Nature, by authors from Oxford.
        
         | tylerneylon wrote:
         | PS My comment above is aimed at hn readers who are curious
         | about LLM uncertainty. To the authors of the post / repo: looks
         | cool! and I'd be interested to see some tests on how well it
         | works in practice to identify uncertainty.
        
         | trq_ wrote:
         | It's not an academic paper as far as I know, which is why I
         | wanted to write this up. But the project certainly has a cult
         | following (and cult opposition) on ML Twitter.
        
         | mikkom wrote:
         | This is based on work done by this anonymous twitter account:
         | 
         | https://x.com/_xjdr
         | 
         | I have been following this quite closely, it has been very
         | interesting as it seems smaller models can be more efficient
         | with this sampler. Worth going through the posts if someone is
         | interested in this. I kind of have a feeling that this kind of
         | sampling is a big deal.
        
         | weitendorf wrote:
         | I don't believe it is, because I'd hope that academicians would
         | better understand the distinction between token-uncertainty and
         | semantic-uncertainty/semantic-correctness (or at least endeavor
         | to establish a data-backed correlation between the two before
         | making claims about their relation). As I noted in my other
         | comment, I believe that the author of this is making a
         | fundamental misunderstanding, which per their note at the top,
         | is probably why they haven't been able to actually yield
         | practical results.
         | 
         | I don't say that to be a hater or discourage them because they
         | may well be on to something, and it's good for unique
         | approaches like this to be tried. But I'm also not surprised
         | there aren't academic papers about this approach because if it
         | had no positive effects for the reasons I mention, it probably
         | wouldn't get published.
        
         | vark90 wrote:
         | The idea behind semantic entropy (estimating entropy of
         | distribution over semantic units, instead of individual
         | sequences in the output space) is great, but it's somewhat
         | naive in the sense that it considers these semantic units to be
         | well-defined partitions of output space. There is further
         | generalization of this approach [1] which performs soft
         | clustering of sampled outputs based on a similar notion of
         | semantic equivalence between them.
         | 
         | But even with this in mind, there are caveats. We have recently
         | published [2] a comprehensive benchmark of SOTA approaches to
         | estimating uncertainty of LLMs, and have reported that while in
         | many cases these semantic-aware methods do perform very well,
         | in other tasks simple baselines, like average entropy of token
         | distributions, performs on par or better than complex
         | techniques.
         | 
         | We have also developed an open-source python library [3] (which
         | is still in early development) that offers implementations of
         | all modern UE techniques applicable to LLMs, and allows easy
         | benchmarking of uncertainty estimation methods as well as
         | estimating output uncertainty for deployed models in
         | production.
         | 
         | [1] https://arxiv.org/abs/2307.01379
         | 
         | [2] https://arxiv.org/abs/2406.15627
         | 
         | [3] https://github.com/IINemo/lm-polygraph
        
       | sillying wrote:
       | I have a simple question. Suppose that to answer a question I can
       | use different phrases, I know the answer but I have several ways
       | to express it. Then a LLM in this case produces tokens with high
       | or low entropy?
       | 
       | Edited several times: I think to avoid this problem the answer of
       | the LLM should be constrained in expression (say Yes or No, fill
       | the blanks, etc). I think in that case we would have a decreasing
       | sequence of the entropy for next token predictions.
        
         | trq_ wrote:
         | In this case it would be a low entropy, high varentropy
         | situation. It's confident in a few possible answers, like if
         | it's a set of synonyms.
        
       | fsndz wrote:
       | nice. a similar idea was recently used to detect
       | ragallucinations. the key is using logits when provided It was
       | super insightful reading the clash eval paper
       | https://www.lycee.ai/blog/rag-ragallucinations-and-how-to-fi...
        
         | trq_ wrote:
         | Yeah I wish more LLM APIs offered internal insights like
         | logits, right now I think only OpenAI does and it started
         | recently.
        
       | cchance wrote:
       | This when that entropy is high i feel like models should have an
       | escape hatch to trigger that the answers overall certainty was
       | low, and hell add it up and score it so at the end the user can
       | see if during the generation the certainty of the answer was
       | shit, and should be thrown out ore replaced with a "i'm not sure"
        
         | trq_ wrote:
         | Yeah that's been my thinking as well.
         | 
         | There are definitely times when entropy can be high but not
         | actually be uncertain (again synonyms are the best), but it
         | seems promising. I want to build a visualizer using the OpenAI
         | endpoints.
        
         | nopinsight wrote:
         | The new Claude Sonnet 3.5 does something like that in my
         | experience.
        
           | trq_ wrote:
           | Yeah wouldn't be surprised if the big labs are doing more
           | than just arg max in the sampling.
        
         | radarsat1 wrote:
         | The problem is that deep net classifiers in general are not
         | well statistically calibrated by default. So while the entropy
         | is often high when they are "not sure", models can very often
         | also be "confidently wrong". So using entropy of the logits as
         | an indicator of confidence can easily be very misleading.
         | 
         | I'm not an expert in LLMs though, this is just my understanding
         | of classifiers in general. Maybe with enough data this
         | consideration no longer applies? I'd be interested to know.
        
           | trq_ wrote:
           | I want to build intuition on this by building a logit
           | visualizer for OpenAI outputs. But from what I've seen so
           | far, you can often trace down a hallucination.
           | 
           | Here's an example of someone doing that for 9.9 > 9.11:
           | https://x.com/mengk20/status/1849213929924513905
        
             | z3t4 wrote:
             | I'm thinking versioning. 9.9, 9.10, 9.11 etc because in my
             | native language we use the comma, for decimal separation
             | 9,11 9,22 9,90
        
           | modeless wrote:
           | My understanding is that base models are reasonably well
           | calibrated but the RLHF and other tuning that turns them into
           | chat assistants screws up the calibration.
        
             | scottmf wrote:
             | There's much that is lost but imo gpt-4-base would be
             | borderline unusable for most of us compared to its
             | descendants -- perhaps even more so than GPT-3 davinci, at
             | least relative to its time.
             | 
             | 4 can be an absolute demonic hallucinating machine.
        
         | tkellogg wrote:
         | Entropix gives you a framework for doing that sort of thing.
         | The architecture is essentially to detect the current state,
         | and then adjust sampler settings or swap in an entirely new
         | sampler strategy.
         | 
         | You absolutely could experiment with pushing it into a denial,
         | and I highly encourage you to try it out. The smollm-entropix
         | repo[1] implements the whole thing in a Jupyter notebook, so
         | it's easier to try out ideas.
         | 
         | [1]: https://github.com/SinatrasC/entropix-smollm
        
         | danielmarkbruce wrote:
         | We are almost certainly going to see lots of additional tokens
         | added to vocabularies (like the thinking token, but also could
         | be a "<LOGIC FAIL>" token), lots of sophisticated decoding
         | strategies etc. Just need to generate the data.
        
         | vark90 wrote:
         | Yep, usually it's called abstention or rejection.
         | 
         | When people in this field compare various methods of
         | quantifying model uncertainty, they often perform what is
         | called rejection verification. Basically, you continuously
         | reject data points where uncertainty is high, and see how
         | average quality of the remaining outputs increases. A good
         | uncertainty estimate is highly correlated with output quality,
         | and thus low-uncertainty outputs should have higher average
         | quality.
         | 
         | We use exactly this approach in our recent benchmark of
         | uncertainty estimation approaches for LLMS [1] and have an
         | open-source library under development [2] which allows for such
         | benchmarking. It also can produce uncertainty scores for a
         | given model output, so ppl in industry can integrate it into
         | their applications as well.
         | 
         | [1] https://arxiv.org/abs/2406.15627
         | 
         | [2] https://github.com/IINemo/lm-polygraph
        
       | joe_the_user wrote:
       | The problem is that the limits to LLM answers have more
       | dimensions than just "uncertainty". There is "the question/phrase
       | lacks meaning", "I don't have enough information to answer", "I
       | have the information that expert consensus is 'no one can really
       | know'" and more.
       | 
       | I think there's a human tendency to reduce the problem one has
       | answering a given question to a question of just "uncertainty"
       | and so we look at LLM answers as involving just single level of
       | uncertainty. But that's anthropomorphism.
       | 
       | AI images (and photograph before it) showed us new, unimagined
       | ways an image can be wrong (or rather, real-seaming but wrong).
       | AI language interactions do this too but in a more subtle way.
        
         | trq_ wrote:
         | Definitely, but if you can detect when you might be in one of
         | those states, you could reflect to see exactly which state
         | you're in.
         | 
         | So far this has mostly been done using Reinforcement Learning,
         | but catching it and doing it inference seems like it could be
         | interesting to explore. And much more approachable for open
         | source, only the big ML labs can do this sort of RL.
        
           | TZubiri wrote:
           | Right. The uncertainty will be high when responding to
           | garbage inputs and it will be distributed along many tokens.
           | 
           | If probability(sum(tokens[:5])) < 0.5: Respond("I'm sorry I
           | don't quite understand what you mean.")
        
         | melenaboija wrote:
         | As anthropomorphic as calling hallucinations to inaccuracies of
         | the model.
         | 
         | I feel anthropomorphism is part of the marketing strategy for
         | LLMs
        
           | jazzyjackson wrote:
           | Having an oracle to chat with is a good product, but a bad
           | framing for the tech. IMO all the broken expectations come
           | from viewing the output as something that comes from "an
           | other", a thing other than yourself with knowledge and
           | experience, when really it's more of a mirror, reflecting
           | your words back to you, enlarged or squeezed like funhouse
           | mirrors (back in my day we didn't have skinny filters, we had
           | to walk uphill to the pier and stand in front of a distorted
           | piece of mercury glass! ;).
        
             | MobiusHorizons wrote:
             | Did you live under water? How was the pier uphill;)
        
               | cpeterso wrote:
               | The inland area could be lower than the waterfront.
        
               | jazzyjackson wrote:
               | Somehow I just knew a few of you'se would consider the
               | implications of walking uphill to a pier
        
           | botanical76 wrote:
           | What other word would you suggest?
           | 
           | I've seen "bullshitting" suggested, but this of course still
           | implies intent, which AIs do not have in any typical sense of
           | the word.
           | 
           | I think we as a community have settled on hallucination as
           | the best English word that approximately conveys the idea.
           | I've seen folks on here making up words to describe it, as if
           | that is any more useful to the victim here. The victim being
           | the uninformed (w.r.t AI tech) layperson.
        
             | codetrotter wrote:
             | "Confabulations" is sometimes mentioned as an alternative
             | to "hallucinations".
             | 
             | It's a better alternative than "bullshitting", because
             | "confabulating" does not have that kind of connotation of
             | intent.
        
             | atoav wrote:
             | LLMs give you a _plausible_ chain of words, the word
             | "hallucination" assumes intentionality that doesn't exist
             | -- as if the LLM had a "clear" state of mind and one where
             | it felt a bit dizzy -- but all of that does not describe
             | what is going on.
        
               | CooCooCaCha wrote:
               | Hallucination does not imply intentionality, in fact the
               | opposite.
        
               | atoav wrote:
               | which was my point.
        
               | CooCooCaCha wrote:
               | Your point is misusing a word? The word "hallucination"
               | in no way implies intentionality.
        
               | haccount wrote:
               | The word confabulation is used in situations where human
               | beings unintentionally pad whatever they say with
               | falsehoods.
        
             | paulddraper wrote:
             | Hallucinating is descriptive but superlative.
             | 
             | Wrong or inaccurate are alternatives.
        
           | stavros wrote:
           | A more apt word is "confabulation".
        
         | CooCooCaCha wrote:
         | Aren't those different flavors of uncertainty?
        
           | ben_w wrote:
           | I think that's the point?
        
             | danielmarkbruce wrote:
             | No, the comment reflects a misunderstanding of uncertainty.
             | Uncertainty could be caused by all kinds of things (ie,
             | there are flavors). That's different than saying "there are
             | more dimensions than uncertainty".
        
           | trq_ wrote:
           | Yeah, I think the idea of finding out what flavor of
           | uncertainty you have is very interesting.
        
         | glaugh wrote:
         | Fwiw this feels deeply relevant to my usage of LLMs to
         | structure data. I'd like exactly a good indicator of
         | uncertainty for each bit of data.
        
         | vark90 wrote:
         | You are right that uncertainty is a kinda loosely defined term.
         | Usually people mean that it's a kind of proxy to the
         | probability that the output of the model is correct in some
         | sense.
         | 
         | It's also true that uncertainty can be decomposed into
         | "flavours". The simplest and most discussed decomposition is
         | into aleatoric and epistemic kinds of uncertainty. Epistemic
         | uncertainty (or model-based uncertainty) usually refers to the
         | case, when poor output is a result of the model being presented
         | with the kind of input which it never saw before, and should
         | not be expected to handle correctly. Aleatoric uncertainty on
         | the other hand is thought to be intrinsic to the data itself,
         | think of the natural ambiguity of the task, or noisy labelling.
         | 
         | People in the field of uncertainty estimation are very much
         | concerned with developing methods of quantifying these
         | different types of uncertainty, and different methods can be
         | more sensitive to one or the other.
        
       | gibsonf1 wrote:
       | That's pretty funny to think that an LLM can be certain or not,
       | given its just a statistical output. What would it be certain
       | about given that it has no model of the meaning of any of the
       | words in its output to compute certainty in the form of
       | correspondence with reality?
        
         | trq_ wrote:
         | I mean, LLMs certainly know representations of what words means
         | and their relationship to each other, that's what the Key and
         | Query matrices hold for example.
         | 
         | But in this case, it means that the underlying point in
         | embedding space doesn't map clearly to only one specific token.
         | That's not too different from when you have an idea in your
         | head but can't think of the word.
        
           | gibsonf1 wrote:
           | You're missing my point. Words are simply serialized
           | thoughts. When we humans read the words, like you would be
           | doing for this sentence, you are building a model of what
           | those words mean based on your conceptual understanding and
           | experience in space-time. That modeling is how you can then
           | determine if the model formed in your mind using the
           | serialized words in the sentence corresponds to reality or
           | not. For the LLM, there is actually no model of reality
           | whatsoever, its just words, so there is no way the LLM would
           | ever know if the words when modeled would be true or false
           | etc.
        
             | TapamN wrote:
             | An LLM does have a model of reality. An LLM's reality is
             | built on the experiences (words) it's been feed.
             | 
             | Humans are similar. A human's reality is built on the
             | experiences (senses) it's been feed. There definitely are
             | several major differences, the obvious one being that we
             | have a different sensory input than an LLM, but there are
             | others, like human's having a instinctual base model of
             | reality, shaped by the effects of natural selection over
             | our ancestors.
             | 
             | Just like an LLM can't tell if the reality it's been fed
             | actually corresponds to the "truer" outside reality (you
             | could feed an LLM lies like the sky is plaid in such a way
             | that it would report that it's true), a human can't tell if
             | the reality it's been fed actually corresponds to a "truer"
             | outside reality (humans could be feed lies like we are in
             | true reality, when we're actually all NPCs in a video game
             | for a higher level).
             | 
             | The LLM can't tell if it's internal reality matches an
             | outside reality, and humans can't tell if their internal
             | reality matches an outside reality, because both only have
             | the input they've received to go on, and can't tell if it's
             | problematic or it's incomplete.
        
               | gibsonf1 wrote:
               | Words are not reality, they are just data serialized from
               | human world experience, without reference to the
               | underlying meaning of those words. An LLM is unable to
               | build the conceptual space-time model that the words
               | reference, thus it has no understanding whatsoever of the
               | meaning of those words. The evidence for this is
               | everywhere in the "hallucinations" of LLM. It just
               | statistics on words, and that gets you nowhere to
               | understanding the meaning of words, that is conceptual
               | awareness of matter through space-time.
        
               | astrange wrote:
               | This is a reverse anthropic fallacy. It may be true of a
               | base model (though it probably isn't), but it isn't true
               | of a production LLM system, because the LLM companies
               | have evals and testing systems and such things, so they
               | don't release models that clearly fail to understand
               | things.
               | 
               | You're basically saying that no computer program can
               | work, because if you randomly generate a computer program
               | then most of them don't work.
        
             | dTal wrote:
             | Insofar as this is a philosophically meaningful assertion,
             | it isn't true. LLMs live in a universe of words, it is
             | true; within that universe, they absolutely have world
             | models, which encode the relationships between concepts
             | encoded by words. It's not "reality", but neither are the
             | conceptual webs stored in human brains. Everything is
             | mediated through senses. There's no qualitative difference
             | between an input stream of abstract symbols, and one of
             | pictures and sounds. Unless you think Helen Keller lacked a
             | concept of true and false?
        
               | gibsonf1 wrote:
               | They don't have world models, they have word models. A
               | very big difference indeed!
        
         | og_kalu wrote:
         | >That's pretty funny to think that an LLM can be certain or
         | not, given its just a statistical output.
         | 
         | What do you imagine a statistical output is ? and why do you
         | imagine you can't be certain about it ? LLM are not picking
         | words out of a bag at random and neither are they just blindly
         | picking the most frequent words in the training set. What do
         | you imagine all that computation is doing?
         | 
         | >given that it has no model of the meaning of any of the words
         | in its output to compute certainty in the form of
         | correspondence with reality?
         | 
         | Says who ? I mean basically all the research (quite a few) on
         | the topic points to LLMs having a pretty good idea of the
         | certainty and truth of their outputs internally. Some
         | pretrained models even have the logit probabilities directly
         | correspond to the probability of being right
         | (https://imgur.com/a/3gYel9r).
         | 
         | Statistics is not magic. LLMs clearly have a model of the
         | meaning of the words they use amongst many other things.
        
       | petsounds wrote:
       | When I read about potential optimizations like this, I can't
       | believe that people trust LLMs enough to do things with minimal
       | oversight. Do people really believe that "AI" products that use
       | LLMs are capable enough to do things like control a computer, or
       | write accurate code? By design, isn't _everything_ a
       | "hallucination" or a guess? Is it really possible to overcome
       | that?
        
         | OtomotO wrote:
         | No it's not, but when humans have invested too much (emotions
         | or money) they do not retreat easily. They rather go all in.
         | 
         | It's just another hype, people. Just like Client/Server,
         | Industry 4.0, Machine Learning, Microservices, Cloud, Crypto
         | ...
        
         | Workaccount2 wrote:
         | I have written (oversaw?) a few programs that we use in our
         | production test systems using chatgpt and python. A program
         | that sends actions to machines, queries them for
         | results/errors/outputs, and then stores all that in a .csv
         | which it later translates into a nicely formatted excel file.
         | It also provides a start-up guide to show the technician how to
         | hook-up things for a given test.
         | 
         | I am not a programmer. No one at my company is a programmer. It
         | writes code that works and does exactly what we asked it to do.
         | When the code choked while I was "developing" it, I just fed it
         | back into chatgpt to figure out. And it eventually solved
         | everything. Took a day or so, whereas it would probably take me
         | a month or a contractor $10,000 and a week.
         | 
         | LLM's might be bad for high level salary grade programming
         | projects. But for those of us who use computers to do stuff,
         | but can't get past the language barrier preventing us from
         | telling the computer what to do, it's a godsend.
        
           | lll-o-lll wrote:
           | Really interesting. We programmers live in a bit of a bubble,
           | so it's good to get this perspective. Perhaps with LLM's
           | we've finally reached the early dreams of the "programable
           | computer for everyone", that seemed to slip out of reach
           | after the 80's.
        
         | danielmarkbruce wrote:
         | How do you overcome it as a human? If you think through it...
         | you'll come to the conclusion that LLMs can be used to do all
         | kinds of things. Humans don't write down code and then shove it
         | into production, for example.
        
       | ttpphd wrote:
       | LLMs do not model "certainty". This is illogical. It models the
       | language corpus you feed the model.
        
         | tylerneylon wrote:
         | Essentially all modern machine learning techniques have
         | internal mechanisms that are very closely aligned with
         | certainty. For example, the output of a binary classifier is
         | typically a floating point number in the range [0, 1], with 0
         | being one class, and 1 representing the other class. In this
         | case, a value of 0.5 would essentially mean "I don't know," and
         | answers in between give both an answer (round to the nearest
         | int) as well as a sense of certainty (how close was the output
         | to the int). LLMs offer an analogous set of statistics.
         | 
         | Speaking more abstractly or philosophically, why could a model
         | never internalize something read between the lines? Humans do,
         | and we're part of the same physical system -- we're already our
         | own kinds of computers that take away more from a text than
         | what is explicitly there. It's possible.
        
         | menhguin wrote:
         | Recent research using SAEs suggest that some neurons regulate
         | confidence/certainty: https://arxiv.org/abs/2406.16254
        
         | astrange wrote:
         | You don't have to teach an transformer model using a language
         | corpus even if that was the pretraining. You can e.g. write
         | algorithms directly and merge them into the model.
         | 
         | https://github.com/yashbonde/rasp
         | 
         | https://github.com/arcee-ai/mergekit
        
       | 6510 wrote:
       | As someone with a website that is a historic archive of
       | conspiratorial and proto-scientific unbelievables I'd say we need
       | a believability rating for each author, org and website.
       | 
       | I'm getting a little tired of people thinking I believe
       | everything I read and publish. If you claim to have invented a
       | time machine, a teleportation device, a phone to call the dead or
       | if you take pictures back in time of course someone should
       | document every tiny technical detail you've shared with the
       | world. (preferably without repeatedly stating the obvious)
       | 
       | The idea a reader would believe everything strikes me as rather
       | hilarious. Even if just a robot. LLMs should aid those skilled in
       | the art who desire to make the same with the materials but it
       | would be silly if it uncritically reproduced the description of
       | your warp drive, your parallel universe detector, mr fusion,
       | sentient black goo, channelings and remote viewings, alien
       | encounters, bigfoot sightings, shape shifting lizard experiences,
       | quantum computer or memristors.
        
         | svachalek wrote:
         | As you have no doubt encountered with your archive, readers
         | don't believe everything, they believe what they want to. In
         | many cases that means rejecting the truth and believing the
         | story. AI only knows what it's been told, it doesn't even have
         | senses to compare to its own experience.
        
       | TZubiri wrote:
       | https://platform.openai.com/docs/api-reference/chat/create#c...
        
         | trq_ wrote:
         | Yeah! I want to use the logprobs API, but you can't for
         | example:
         | 
         | - sample multiple logits and branch (we maybe could with the
         | old text completion API, but this no longer exists)
         | 
         | - add in a reasoning token on the fly
         | 
         | - stop execution, ask the user, etc.
         | 
         | But a visualization of logprobs in a query seems like it might
         | be useful.
        
       | wantsanagent wrote:
       | Please please keep your Y axis range consistent.
        
       | amanaplanacanal wrote:
       | Calling what is happening here "reasoning" is just nonsense.
        
       | weitendorf wrote:
       | I think the authors are making a faulty assumption that single-
       | token uncertainty requires intervention or is a sign that the
       | model needs extra help, by conflating the immediately apparent
       | and measurable _choice of the next token_ with the not-
       | immediately-apparent (because it requires generating multiple
       | tokens in sequence, which can have a very high branching factor),
       | not-easily-measured (because sentences with entirely different
       | words can mean the same thing) _decision to generate an answer
       | with desired /correct semantics_.
       | 
       | This is a subtle and understandable mistake, but I do suspect
       | it's why they note at the top "A big caveat, there have been no
       | large scale evals yet for Entropix, so it's not clear how much
       | this helps in practice. But it does seem to introduce some
       | promising techniques and mental models for reasoning." I would
       | like to see more evidence that High Entropy, Low Varentropy when
       | deciding on a single token measurably corresponds with bad
       | outcomes before accepting that there is any merit to this
       | approach.
       | 
       | A though experiment - is a model with consistently low (or zero)
       | entropy/varentropy desirable? First, it essentially means that
       | the model makes no distinction in the semantics of different
       | sequences of tokens in its answers, which due to the way models
       | are trained also indicates that it probably makes no makes no
       | distinction in the semantics of different sequences of tokens
       | when processing input, which is bad, because that's not how
       | language works. It also probably means that all the information
       | encoded in the model's weights is "uncompressed" and doesn't
       | generalize properly - the model may know that the sky was blue
       | yesterday because it's in its training data, but how is it to
       | know if it was blue today, or if it would be blue on a fictional
       | planet with all the same physical characteristics as Earth? It's
       | like saying you prefer your model to be overfit.
       | 
       | Another thought experiment - when you're starting a sentence,
       | does it matter in the slightest whether you are highly
       | predisposed to using "the" (low entropy+varentropy), split
       | between about using "the" or "a" (low entropy, high varentropy),
       | thinking about using many different definite/demonstrative words
       | with no clear preference (high entropy, low varentropy), or
       | thinking about using many different definite/demonstrative words
       | with a clear preference to "the" (high entropy+varentropy)? It
       | doesn't mean you're uncertain of the semantic meaning of the
       | answer you're about to give. If you were to do as they suggest
       | and take it as an indicator to think more deeply before
       | responding, you'd not only waste time in your response (this is
       | literally the same thing as when people say "um" and "uh" a lot
       | when talking, which is considered bad) but distract yourself from
       | the choice of answering with the right _semantics_ with the
       | choice of starting with the right _word_ , which doesn't actually
       | matter.
        
       | tech_ken wrote:
       | "Thinking token" is an interesting concept, is there more
       | literature on that?
        
       | bjourne wrote:
       | There are billions of sampling strategies for language models.
       | The problem is that it is very difficult to empirically show that
       | one sampling strategy is better than standard top-k or top-p
       | sampling. Minimizing perplexity is not enough to demonstrate
       | superiority of a particular method. The strategy suggested in the
       | blog post has the same issue. An innovation that sounds plausible
       | in theory, but is unproven in practice.
        
         | danielmarkbruce wrote:
         | Proof isn't required.
         | 
         | It's difficult to prove because it's difficult to state clearly
         | what is "better" and it's expensive to collect preference data
         | (or similar).
         | 
         | You could use common sense after looking at lots of samples and
         | say "this method seems to work better if you are trying to
         | optimize for X".
        
       | akomtu wrote:
       | LLMs simply answer the question: given this corpus of text you've
       | read so far, what's the most probable next word? If half of the
       | training dataset says the next word in similar conditions is A,
       | and the other half says it's B, then LLMs will be "uncertain"
       | whether it's A or B, but LLMs will be oblivious to the fact that
       | both A and B are wrong, because most of the training dataset was
       | LLM-generated slop.
       | 
       | The current stage of extracting the essense of reason from LLMs
       | feels a lot like attempts to extract gold from iron in the
       | medieval ages.
        
       | chx wrote:
       | Detecting when LLMs are Uncertain?
       | 
       | return true;
       | 
       | There, I didn't need a paper to answer the question.
        
       | nhlx2 wrote:
       | On two occasions I have been asked, 'Pray, Mr. Babbage, if you
       | put into the machine wrong figures, will the right answers come
       | out?' I am not able rightly to apprehend the kind of confusion of
       | ideas that could provoke such a question. -- Charles Babbage
        
       | badsandwitch wrote:
       | Has anyone tried to see what the output looks like if the model
       | is never allowed to be uncertain?
       | 
       | For example, whenever certainty drops below a threshold the
       | sampler backtracks and chooses different tokens. Such that at the
       | end every single token had an above threshold certainty.
       | 
       | I doubt it would entirely eliminate undesirable outputs, but it
       | would be interesting.
        
       ___________________________________________________________________
       (page generated 2024-10-25 23:00 UTC)