[HN Gopher] CriticGPT: Finding GPT-4's mistakes with GPT-4
___________________________________________________________________
CriticGPT: Finding GPT-4's mistakes with GPT-4
Author : davidbarker
Score : 148 points
Date : 2024-06-27 17:02 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| mkmk wrote:
| It seems more and more that the solution to AI's quality problems
| is... more AI.
|
| Does Anthropic do something like this as well, or is there
| another reason Claude Sonnet 3.5 is so much better at coding than
| GPT-4o?
| GaggiX wrote:
| >or is there another reason Claude Sonnet 3.5 is so much better
| at coding than GPT-4o?
|
| It's impossible to say because these models are proprietary.
| mkmk wrote:
| Isn't the very article we're commenting on an indication that
| you can form a basic opinion on what makes one proprietary
| model different from another?
| GaggiX wrote:
| Not really, we know absolutely nothing about Claude 3.5
| Sonnet, except that it's an LLM.
| ru552 wrote:
| Anthropic has attributed Sonnet 3.5's model improvement to
| better training data.
|
| "Which data specifically? Gerstenhaber wouldn't disclose, but
| he implied that Claude 3.5 Sonnet draws much of its strength
| from these training sets."[0]
|
| [0]https://techcrunch.com/2024/06/20/anthropic-claims-its-
| lates...
| jasonjmcghee wrote:
| My guess, which could be completely wrong, Anthropic spent more
| resources on interpretability and it's paying off.
|
| I remember when I first started using activation maps when
| building image classification models and it was like what on
| earth was I doing before this... just blindly trusting the
| loss.
|
| How do you discover biases and issues with training data
| without interpretability?
| Kiro wrote:
| Is it really that much better? I'm really happy with GPT-4o's
| coding capabilities and very seldom experience problems with
| hallucinations or incorrect responses, so I'm intrigued by how
| much better it can actually be.
| p1esk wrote:
| In my experience Sonnet 3.5 is about the same as 4o for coding.
| Sometimes one provides a better solution, sometimes the other.
| Both are pretty good.
| surfingdino wrote:
| > It seems more and more that the solution to AI's quality
| problems is... more AI.
|
| This reminds me of the passage found in the description of the
| fuckitpy module:
|
| "This module is like violence: if it doesn't work, you just
| need more of it."
| soloist11 wrote:
| How do they know the critic did not make a mistake? Do they have
| a critic for the critic?
| OlleTO wrote:
| It's critics all the way down
| azulster wrote:
| it's literally just the oracle problem all over again
| esafak wrote:
| It's called iteration. Humans do the same thing.
| soloist11 wrote:
| Are you sure it's not called recursion?
| citizen_friend wrote:
| It's not a human, and we shouldn't assume it will have traits
| we do without evidence.
|
| Iteration also is when your brain meets the external world
| and corrects. This is a closed system.
| GaggiX wrote:
| It's written in the article, the critic makes mistakes, but
| it's better than not having it.
| soloist11 wrote:
| How do they know it's better? The rate of mistakes is the
| same for both GPTs so now they have 2 sources of errors. If
| the error rate was lower for one then they could always apply
| it and reduce the error rate of the other. They're just
| shuffling the deck chairs and hoping the boat with a hole
| goes a slightly longer distance before disappearing
| completely underwater.
| GaggiX wrote:
| >How do they know it's better?
|
| Probably just evaluation on benchmarks.
| yorwba wrote:
| Whether adding unreliable components increases the overall
| reliability of a system depends on whether the system
| requires _all_ components to work (in which case adding
| components can only make matters worse) or only _some_ (in
| which case adding components can improve redundancy and
| make it more likely that the final result is correct).
|
| In the particular case of spotting mistakes made by
| ChatGPT, a mistake is spotted if it is spotted by the human
| reviewer _or_ by the critic, so even a critic that makes
| many mistakes itself can still increase the number of
| spotted errors. (But it might decrease the spotting rate
| per unit time, so there are still trade-offs to be made.)
| soloist11 wrote:
| I see what you're saying so what OpenAI will do next is
| create an army of GPT critics and then run them all in
| parallel to take some kind of quorum vote on correctness.
| I guess it should work in theory if the error rate is
| small enough and adding more critics actually reduces the
| error rate. My guess is that in practice they'll converge
| to the population average rate of error and then pat
| themselves on the back for a job well done.
| svachalek wrote:
| That description is remarkably apt for almost every
| business meeting I've ever been in.
| jsheard wrote:
| Per the article, the critic for the critic is human RLHF
| trainers. More specifically those humans are exploited third
| world workers making between $1.32 and $2 an hour, but OpenAI
| would rather you didn't know about that.
|
| https://time.com/6247678/openai-chatgpt-kenya-workers/
| soloist11 wrote:
| Every leap of civilization was built off the back of a
| disposable workforce. - Niander Wallace
| wmeredith wrote:
| He was the bad guy, right?
| IncreasePosts wrote:
| That is more than the average entry level position in Kenya.
| The work is probably also much easier (physically, that is).
| golergka wrote:
| Exploited? Are you saying that these employees are forced to
| work for below market rates, and would be better off with
| other opportunities available to them? If that's the case,
| it's truly horrible on OpenAI's part.
| nmca wrote:
| A critic for the critic would be "Recursive Reward Modelling",
| an exciting idea that has not been made to work in the real
| world yet.
| soloist11 wrote:
| Most of my ideas are not original but where can I learn more
| about this recursive reward modeling problem?
| nmca wrote:
| https://arxiv.org/abs/1811.07871
| finger wrote:
| There is already a mistake. It refers to a function by the
| wrong name: os.path.comonpath > commonpath
| soloist11 wrote:
| In the critical limit every GPT critic chain is essentially a
| spellchecker.
| ertgbnm wrote:
| That's the human's job for now.
|
| A human reviewer might have trouble catching a mistake, but
| they are generally pretty good at discerning a report about a
| mistake is valid or not. For example, finding a bug in a
| codebase is hard. But if a junior sends you a code snippet and
| says "I think this is a bug for xyz reason", do you agree? It's
| much easier to confidently say yes or no. So basically it
| changes the problem from finding a needle in a haystack to
| discerning if a statement is a hallucination or not.
| sdenton4 wrote:
| Looks like the hallucination rate doesn't improve significantly,
| but I suppose it's still a win if it helps humans review things
| faster? Though I could imagine reliance on the tool leading to
| missing less obvious problems.
| tombert wrote:
| While it's a net good, it would kind of kill one of the most
| valuable parts of ChatGPT for me, which is critiquing its
| output myself.
|
| If I ask it a question, I try not to trust it immediately, and
| I independently look the answer up and I argue with it. In
| turn, it actually is one of my favorite learning tools, because
| it kind of forces me to figure out _why_ it 's wrong and
| explain it.
| goostavos wrote:
| Unexpectedly, I kind of agree. I've found GPT to be a great
| tutor for things I'm trying to learn. It being somewhat
| unreliable / prone to confidently lying embeds a certain
| amount of useful skepticism and questioning of all the
| information, which in turn leads to an overall better
| understanding.
|
| Fighting with the AI's wrongness out of spike is an
| unexpectedly good motivator.
| ExtremisAndy wrote:
| Wow, I've never thought about that, but you're right! It
| really has trained me to be skeptical of what I'm being
| taught and confirm the veracity of it with multiple
| sources. A bit time-consuming, of course, but generally a
| good way to go about educating yourself!
| tombert wrote:
| I genuinely think that arguing with it has been almost a
| secret weapon for me with my grad school work. I'll ask
| it a question about temporal logic or something, it'll
| say something that sounds accurate but is ultimately
| wrong or misleading after looking through traditional
| documentation, and I can fight with it, and see if it
| refines it to something correct, which I can then check
| again, etc. I keep doing this for a bunch of iterations
| and I end up with a pretty good understanding of the
| topic.
|
| I guess at some level this is almost what "prompt
| engineering" is (though I really hate that term), but I
| use it as a learning tool and I do think it's been really
| good at helping me cement concepts in my brain.
| ramenbytes wrote:
| > I'll ask it a question about temporal logic or
| something, it'll say something that sounds accurate but
| is ultimately wrong or misleading after looking through
| traditional documentation, and I can fight with it, and
| see if it refines it to something correct, which I can
| then check again, etc. I keep doing this for a bunch of
| iterations and I end up with a pretty good understanding
| of the topic.
|
| Interesting, that's the basic process I follow myself
| when learning without ChatGPT. Comparing my mental
| representation of the thing I'm learning to existing
| literature/results, finding the disconnects between the
| two, reworking my understanding, wash rinse repeat.
| tombert wrote:
| I guess a large part of it is just kind of the "rubber
| duck" thing. My thoughts can be pretty disorganized and
| hard to follow until I'm forced to articulate them.
| Finding out why ChatGPT is wrong is useful because it's a
| rubber duck that I can interrogate, not just talk to.
|
| It can be hard for me to directly figure out when my
| mental model is wrong on something. I'm sure it happens
| all the time, but a lot of the time I will think I know
| something until I feel compelled to prove it to someone,
| and I'll often find out that _I 'm_ wrong.
|
| That's actually happened a bunch of times with ChatGPT,
| where I think it's wrong until I actually interrogate it,
| look up a credible source, and realize that my
| understanding was incorrect.
| posix86 wrote:
| Reminds me of a prof at uni, who's slides always appeard to
| have been written 5 mins before the lecture started,
| resulting in students pointing out mistakes in every other
| slide. He defended himself saying that you learn more if
| you aren't sure weather things are correct - which was
| right. Esp. during a lecture, it's sometimes not that easy
| to figure out if you truly understood something or fooled
| yourself, knowing that what you're looking at is provably
| right. If you know everything can be wrong, you trick your
| mind to verify it at a deeper level, and thus gain more
| understanding. It also results in a culture where you're
| allowed to question the prof. It resulted in many healthy
| arguments with the prof why something is the way it is,
| often resulting with him agreeing that his slides are
| wrong. He never corrected the underlying PPP.
| tombert wrote:
| I thought about doing that when I was doing adjunct last
| year, but what made me stop was the fact that these were
| introductory classes, so I was afraid I might pollute the
| minds of students who really haven't learned enough to
| question stuff yet.
| tombert wrote:
| Yeah, and what I like is that I can get it to say things in
| "dumb language" instead of a bunch of scary math terms.
| It'll be confidently wrong, but in language that I can
| easily understand, forcing me to looking things up, and
| kind of forcing me learn the proper terminology and
| actually understanding it.
|
| Arcane language is actually kind of a pet peeve of mine in
| theoretical CS and mathematics. Sometimes it feels like
| academics really obfuscate relatively simple concepts but
| using a bunch of weird math terms. I don't think it's
| malicious, I just think that there's value in having more
| approachable language and metaphors in the process of
| explaining thing.
| empath75 wrote:
| I actually learn a lot from arguing with not just AIs but
| people and it doesn't really matter if they're wrong or
| right. If they're right, it's an obvious learning
| experience for me, if they're wrong, it forced me to
| explain and understand _why_ they're wrong.
| tombert wrote:
| I completely agree with that, but the problem is finding
| a supply of people to argue with on niche subjects. I
| have occasionally argued with people on the Haskell IRC
| and the NixOS Matrix server about some stuff, but since
| they're humans who selfishly have their own lives to live
| so I can't argue with them infinitely, and since the
| topics I argue about are specific there just don't exists
| a lot of people I can argue with even in the best of
| times.
|
| ChatGPT (Gemini/Anthropic/etc) have the advantage of
| never getting sick of arguing with me. I can go back and
| forth and argue about any weird topic that I want for as
| long as I want at any time of day and keep learning until
| I'm bored of it.
|
| Obviously it depends on the person but I really like it.
| mistermann wrote:
| Arguing is arguably one of humanity's super powers, and
| that we've yet to bring it to bear in any serious way
| gives me reason for optimism about sorting out the
| various major problems we've foolishly gotten ourselves
| into.
| ramenbytes wrote:
| > I completely agree with that, but the problem is
| finding a supply of people to argue with on niche
| subjects.
|
| Beyond just subject-wise, finding people who argue in
| good faith seems to be an issue too. There are people I'm
| friends with almost specifically because we're able to
| consistently have good-faith arguments about our strongly
| opposing views. It doesn't seem to be a common skill, but
| perhaps that has something to do with my sample set or my
| own behaviors in arguments.
| tombert wrote:
| I dunno, for more niche computer science or math
| subjects, I don't feel like people argue in bad faith
| most of the time. The people I've argued with on the
| Haskell IRC years ago genuinely believe in what they're
| saying, even if I don't agree with them (I have a lot of
| negative opinions on Haskell as a language).
|
| Politically? Yeah, nearly impossible to find anyone who
| argues in good faith.
| julienchastang wrote:
| Very good comment. In order to effectively use LLMs (I use
| ChatGPT4 and 4o), you have to be skeptical of them and being
| a good AI skeptic takes practice. Here is another technique
| I've learned along the way: When you have it generate text
| for some report you are writing, or something, after your
| initial moment of being dazzled (at least for me), resist the
| temptation to copy/paste. Instead, "manually" rewrite the
| verbiage. You then realize there is a substantial amount of
| BS that can be excised. Nevertheless, it is a huge time saver
| and can be good at ideation, as well.
| tombert wrote:
| Yeah I used it last year to generate homework assignments,
| and it would give me the results in Pandoc compatible
| markdown. It was initially magic, but some of the problems
| didn't actually make sense and might actually be
| unsolveable, so I would have to go through it line by line
| and then ask it to regenerate it [1].
|
| Even with that, it took a process that had taken multiple
| hours before down to about 30-45 minutes. It was super
| cool.
|
| [1] Just to be clear, I always did the homework assignments
| myself beforehand to make sure that a solution was solvable
| and fair before I assigned it.
| foobiekr wrote:
| The lazy version of that, which I recommend, is always deny the
| first answer. Usually I deny for some obvious reason, but
| sometimes I just say "isn't that wrong?"
| tombert wrote:
| That's a useful trick but I have noticed when I do that it
| goes in circles a where it suggests "A", I say it's wrong, it
| suggests "B", I say that's wrong, it suggests "C", I say
| that's wrong, and then it suggests "A" again.
|
| Usually for it to get a correct answer, I have to provide it
| a bit of context.
| GiorgioG wrote:
| All these LLMs make up too much stuff, I don't see how that can
| be fixed.
| elwell wrote:
| > All these LLMs make up too much stuff, I don't see how that
| can be fixed.
|
| All these humans make up too much stuff, I don't see how that
| can be fixed.
| urduntupu wrote:
| Exactly, you can't even fix the problem at the root, b/c the
| problem is already with the humans, making up stuff.
| testfrequency wrote:
| Believe it or not, there are websites that have real things
| posted. This is honestly my biggest shock that OpenAI
| thought Reddit of all places is a trustworthy source for
| knowledge.
| QuesnayJr wrote:
| Reddit is so much better than the average SEO-optimized
| site that adding "reddit" to your search is a common
| trick for using Google.
| p1esk wrote:
| Reddit has been the most trustworthy source for me in the
| last ~5 years, especially when I want to buy something.
| empath75 wrote:
| The websites with content authored by people is full of
| bullshit, intentional and unintentional.
| testfrequency wrote:
| It's genuinely concerning to me how many people replied
| with thinking reddit is the gospel for factual
| information.
|
| Reddit, while it has some niche communities with tribal
| info and knowledge, is FULL of spam, bots, companies
| masquerading as users, etc etc etc. If people are truly
| relying on reddit as a source of truth (which OpenAI is
| now being influenced by), then the world is just going to
| be amplify all the spam that already exists
| acchow wrote:
| While Reddit is often helpful for me (Google
| site:reddit.com), it's nice to toggle between reddit and
| non-reddit.
|
| I hope LLMs will offer a "-reddit" model to switch to
| when needed.
| testfrequency wrote:
| I know you're trying to be edgy here, but if I was deciding
| between searching online and finding a source vs trying to
| shortcut and use GPT, but GPT decides to hallucinate and make
| something up - that's the deceiving part.
|
| The biggest issue is how confidently wrong GPT enjoys being.
| You can press GPT in either right or wrong direction and it
| will concede with minimal effort, which is also an issue.
| It's just really bad russian roulette nerdspining until
| someone gets tired.
| sva_ wrote:
| I wouldn't call it deceiving. In order to be motivated to
| deceive someone, you'd need agency and some benefit out of
| it
| testfrequency wrote:
| Isn't that GPT Plus? Trick you into thinking you have
| found your new friend and they understand everything?
| Surely OpenAI would like people to use their GPT over a
| Google search.
|
| How do you think leadership at OpenAI would respond to
| that?
| advael wrote:
| 1. Deception describes a result, not a motivation. If
| someone has been led to believe something that isn't
| true, they have been deceived, and this doesn't require
| any other agents
|
| 2. While I agree that it's a stretch to call ChatGPT
| agentic, it's nonetheless "motivated" in the sense that
| it's learned based on an objective function, which we can
| model as a causal factor behind its behavior, which might
| improve our understanding of that behavior. I think it's
| relatively intuitive and not deeply incorrect to say that
| that a learned objective of generating plausible prose
| can be a causal factor which has led to a tendency to
| generate prose which often deceives people, and I see
| little value in getting nitpicky about agentic
| assumptions in colloquial language when a vast swath of
| the lexicon and grammar of human languages writ large
| does so essentially by default. "The rain got me wet!"
| doesn't assume that the rain has agency
| swatcoder wrote:
| In reality, humans are often blunt and rude pessimists who
| say things can't be done. But "helpful chatbot" LLM's are
| specifically trained not to do that for anything but crude
| swaths of political/social/safety alignment.
|
| When it comes to technical details, current LLM's have a bias
| towards sycophancy and bullshitting that humans only show
| when especially desperate to impress or totally fearful.
|
| Humans make mistakes too, but the distribution of those
| mistakes is wildly different and generally much easier to
| calibrate for and work around.
| advael wrote:
| The problems of epistemology and informational quality
| control are complicated, but humanity has developed a decent
| amount of social and procedural technology to do these, some
| of which has defined the organization of various
| institutions. The mere presence of LLMs doesn't fundamentally
| change how we should calibrate our beliefs or verify
| information. However, the mythology/marketing that LLMs are
| "outperforming humans" combined with the fact that the most
| popular ones are black boxes to the overwhelming majority of
| their users means that a lot of people aren't applying those
| tools to their outputs. As a technology, they're much more
| useful if you treat them with what is roughly the appropriate
| level of skepticism for a human stranger you're talking to on
| the street
| mistermann wrote:
| I wonder what ChatGPT would have to say if I ran this text
| through with a specialized prompt. Your choice of words is
| interesting, almost like you are optimizing for persuasion,
| but simultaneously I get a strong vibe of intention of
| optimizing for truth.
| refulgentis wrote:
| FWIW I don't understand a lot of what either of you mean,
| but I'm very interested. Quick run-through, excuse the
| editorial tone, I don't know how to give feedback on
| writing without it.
|
| # Post 1
|
| > The problems of epistemology and informational quality
| control are complicated, but humanity has developed a
| decent amount of social and procedural technology to do
| these, some of which has defined the organization of
| various institutions.
|
| _Very_ fluffy, creating _very_ uncertain parsing for
| reader.
|
| _Should_ cut down, then _could_ add specificity:
|
| ex. "Dealing with misinformation is complicated. But we
| have things like dictionaries and the internet, there's
| even specialization in fact-checking, like Snopes.com"
|
| (I assume the specifics I added aren't what you meant,
| just wanted to give an example)
|
| > The mere presence of LLMs doesn't fundamentally change
| how we should calibrate our beliefs or verify
| information. However, the mythology/marketing that LLMs
| are "outperforming humans"
|
| They do, or are clearly at par, at many tasks.
|
| Where is the quote from?
|
| Is bringing this up relevant to the discussion?
|
| Would us quibbling over that be relevant to this
| discussion?
|
| > combined with the fact that the most popular ones are
| black boxes to the overwhelming majority of their users
| means that a lot of people aren't applying those tools to
| their outputs.
|
| Are there unpopular ones aren't black boxes?
|
| What tools? (this may just indicate the benefit of a
| clearer intro)
|
| > As a technology, they're much more useful if you treat
| them with what is roughly the appropriate level of
| skepticism for a human stranger you're talking to on the
| street
|
| This is a sort of obvious conclusion compared to the
| complicated language leading into it, and doesn't add to
| the posts before it. Is there a stronger claim here?
|
| # Post 2
|
| > I wonder what ChatGPT would have to say if I ran this
| text through with a specialized prompt.
|
| Why do you wonder that?
|
| What does "specialized" mean in this context?
|
| My guess is there's a prompt you have in mind, which then
| would clarify A) what you're wondering about B) what you
| meant by specialized prompt. But a prompt is a question,
| so it may be better to just ask the question?
|
| > Your choice of words is interesting, almost like you
| are optimizing for persuasion,
|
| What language optimizes for persuasion? I'm guessing the
| fluffy advanced verbiage indicates that?
|
| Does this boil down to "Your word choice creates
| persuasive writing"?
|
| > but simultaneously, I get a strong vibe of intention of
| optimizing for truth.
|
| Is there a distinction here? What would "optimizing for
| truth" vs. "optimizing for persuasion" look like?
|
| Do people usually write not-truthful things, to the point
| it's worth noting that when you think people are writing
| with the intention of truth?
| advael wrote:
| As long as we're doing unsolicited advice, this revision
| seems predicated on the assumption that we are writing
| for a general audience, which ill suits the context in
| which the posts were made. This is especially bizarre
| because you then interject to defend the benchmarking
| claim I've called "marketing", and having an opinion on
| that subject at all makes it clear that you also at the
| very least understand the shared context somewhat,
| despite being unable to parse the fairly obvious
| implication that treating models with undue credulity is
| a direct result of the outsized and ill-defined claims
| about their capabilities to which I refer. I agree that I
| could stand to be more concise, but if you find it
| difficult to parse my writing, perhaps this is simply
| because you are not its target audience
| refulgentis wrote:
| Let's go ahead and say the LLM stuff is all marketing and
| it's all clearly worse than all humans. It's plainly
| unrelated to anything else in the post, we don't need to
| focus on it.
|
| Like I said, I'm very interested!
|
| Maybe it doesn't mean anything other than what it says on
| the tin? You think people should treat an LLM like a
| stranger making claims? Makes sense!
|
| It's just unclear what a lot of it means and the word
| choice makes it seem like there's something grander going
| on, _coughs_ as our compatriots in this intricately
| weaved thread on the international network known as the
| world wide web have also explicated, and imparted via the
| written word, as their scrivening also remarks on the
| lexicographical phenomenae. _coughs_
|
| My only other guess is you are doing some form of
| performance art to teach us a broader lesson?
|
| There's something very "off" here, and I'm not the only
| to note it. Like, my instinct is it's iterated writing
| _using_ an LLM asked to make it more graduate-school
| level.
| advael wrote:
| Your post and the one I originally responded to are good
| evidence against something I said earlier. The mere
| existence of LLMs _does_ clearly change the landscape of
| epistemology, because whether or not they 're even
| involved in a conversation people will constantly invoke
| them when they think your prose is stilted (which is, by
| the way, exactly the wrong instinct), or to try to
| posture that they occupy some sort of elevated remove
| from the conversation (which I'd say they demonstrate
| false by replying at all). I guess dehumanizing people by
| accusing them of being "robots" is probably as old as the
| usage of that word if not older, but recently interest in
| talking robots has dramatically increased and so here we
| are
|
| I can't tell you exactly what you find "off" about my
| prose, because while you have advocated precision your
| objection is impossibly vague. I talk funny. Okay. Cool.
| Thanks.
|
| Anyway, most benchmarks are garbage, and even if we take
| the validity of these benchmarks for granted, these AI
| companies don't release their datasets or even weights,
| so we have no idea what's out of distribution. To be
| clear, this means the claims can't be verified _even by
| the standards of ML benchmarks_ , and thus should be
| taken as marketing, because companies lying about their
| tech has both a clearly defined motivation and a constant
| stream of unrelenting precedent
| advael wrote:
| I think you'll find I'm quite horseshit at optimizing for
| persuasion, as you can easily verify by checking any
| other post I've ever made and the response it generally
| elicits. I find myself less motivated by what people
| think of me every year I'm alive, and less interested in
| what GPT would say about my replies each of the many
| times someone replies just to ponder that instead of just
| satisfying their curiosity immediately via copy-paste.
| Also, in general it seems unlikely humans function as
| optimizers natively, because optimization tends to
| require drastically narrowing and quantifying your
| objectives. I would guess that if they're describable and
| consistent, most human utility functions look more like
| noisy prioritized sets of satisfaction criteria than the
| kind of objectives we can train a neural network against
| mistermann wrote:
| This on the other hand I like, very much!
|
| Particularly:
|
| > Also, in general it seems unlikely humans function as
| optimizers natively, because optimization tends to
| require drastically narrowing and quantifying your
| objectives. I would guess that if they're describable and
| consistent, most human utility functions look more like
| noisy prioritized sets of satisfaction criteria than the
| kind of objectives we can train a neural network against
|
| Considering this, what do you think us humans are
| _actually_ up to, here on HN and in general? It seems
| clear that we are up to _something_ , but what might it
| be?
| advael wrote:
| On HN? Killing time, reading articles, and getting
| nerdsniped by the feedback loop of getting insipid
| replies that unfortunately so many of us are constantly
| stuck in
|
| In general? Slowly dying mostly. Talking. Eating.
| Fucking. Staring at microbes under a microscope. Feeding
| cats. Planting trees. Doing cartwheels. Really depends on
| the human
| CooCooCaCha wrote:
| If I am going to trust a machine then it should perform at
| the level of a very competent human, not a general human.
|
| Why would I want to ask your average person a physics
| question? Of course, their answer will probably be wrong and
| partly made up. Why should that be the bar?
|
| I want it to answer at the level of a physics expert. And a
| physics expert is far less likely to make basic mistakes.
| nonameiguess wrote:
| advael's answer was fine, but since people seem to be hung up
| on the wording, a more direct response:
|
| We have human institutions dedicated at least nominally to
| finding and publishing truth (I hate having to qualify this,
| but Hacker News is so cynical and post-modernist at this
| point that I don't know what else to do). These include, for
| instance, court systems. These include a notion of
| evidentiary standards. Eyewitnesses are treated as more
| reliable than hearsay. Written or taped recordings are more
| reliable than both. Multiple witnesses who agree are more
| reliable than one. Another example is science. Science
| utilizes peer review, along with its own notion of hierarchy
| of evidence, similar to but separate from the court's.
| Interventional trials are better evidence than observational
| studies. Randomization and statistical testing is used to try
| and tease out effects from noise. Results that replicate are
| more reliable than a single study. Journalism is yet another
| example. This is probably the arena in which Hacker News is
| most cynical and will declare all of it is useless trash, but
| nonetheless reputable news organizations do have methods they
| use to try and be correct more often than they are not. They
| employ their own fact checkers. They seek out multiple expert
| sources. They send journalists directly to a scene to bear
| witness themselves to events as they unfold.
|
| You're free to think this isn't sufficient, but this is how
| we deal with humans making up stuff and it's gotten us modern
| civilization at least, full of warts but also full of
| wonders, seemingly because we're actually right about a lot
| of stuff.
|
| At some point, something analogous will presumably be the
| answer for how LLMs deal with this, too. The training will
| have to be changed to make the system aware of quality of
| evidence. Place greater trust in direct sensor output versus
| reading something online. Place greater trust in what you
| read from a reputable academic journal versus a Tweet. Etc.
| As it stands now, unlike human learners, the objective
| function of an LLM is just to produce a string in which each
| piece is in some reasonably high-density region of the
| probability distribution of possible next pieces as observed
| from historical recorded text. Luckily, producing strings in
| this way happens to generate a whole lot of true statements,
| but it does not have truth as an explicit goal and, until it
| does, we shouldn't forget that. Treat it with the treatment
| it deserves, as if some human savant with perfect recall had
| never left a dark room to experience the outside world, but
| had read everything ever written, unfortunately without any
| understanding of the difference between reading a textbook
| and reading 4chan.
| spiderfarmer wrote:
| Mixture of agents prevents a lot of fact fabrication.
| ssharp wrote:
| I keep hearing about people using these for coding. Seems like
| it would be extremely easy to miss something and then spend
| more time debugging than it would be to do yourself.
|
| I tried recently to have ChatGPT an .htaccess RewriteCond/Rule
| for me and it was extremely confident you couldn't do something
| I needed to do. When I told it that it just needed to add a
| flag to the end of the rule (I was curious and was purposely
| non-specific about what flag it needed), it suddenly knew
| exactly what to do. Thankfully I knew what it needed but
| otherwise I might have walked away thinking it couldn't be
| accomplished.
| GiorgioG wrote:
| My experience is that it will simply make up methods,
| properties and fields that do NOT exist in well-documented
| APIs. If something isn't possible, that's fine, just tell me
| it's not possible. I spent an hour trying to get ChatGPT
| (4/4o and 3.5) to write some code to do one specific thing
| (dump/log detailed memory allocation data from the current
| .NET application process) for diagnosing an intermittent out
| of memory exception in a production application. The answer
| as far as I can tell is that it's not possible in-process.
| Maybe it's possible out of process using the profiling API,
| but that doesn't help me in a locked-down k8s pod/container
| in AWS.
| neonsunset wrote:
| From within the process it might be difficult*, but please
| do give this a read https://learn.microsoft.com/en-
| us/dotnet/core/diagnostics/du... and dotnet-dump + dotnet-
| trace a try.
|
| If you are still seeing the issue with memory and GC, you
| can submit it to https://github.com/dotnet/runtime/issues
| especially if you are doing something that is expected to
| just work(tm).
|
| * difficult as in retrieving data detailed enough to trace
| individual allocations, otherwise `GC.GetGCMemoryInfo()`
| and adjacent methods can give you high-level overview.
| There are more advanced tools but I always had the option
| to either use remote debugging in Windows Server days and
| dotnet-dump and dotnet-trace for containerized applications
| to diagnose the issues, so haven't really explored what is
| needed for the more locked down environments.
| empath75 wrote:
| I think once you understand that they're prone to do that,
| it's less of a problem in practice. You just don't ask it
| questions that requires detailed knowledge of an API unless
| it's _extremely_ popular. Like in kubernetes terms, it's
| safe to ask it about a pod spec, less safe to ask it
| details about istio configuration and even less safe to ask
| it about some random operator with 50 stars on github.
|
| Mostly it's good at structure and syntax, so I'll often
| find the library/spec I want, paste in the relevant
| documentation and ask it to write my function for me.
|
| This may seem like a waste of time because once you've got
| the documentation you can just write the code yourself, but
| A: that takes 5 times as long and B: I think people
| underestimate how much general domain knowledge is buried
| in chatgpt so it's pretty good at inferring the details of
| what you're looking for or what you should have asked
| about.
|
| In general, I think the more your interaction with chatgpt
| is framed as a dialogue and less as a 'fill in the blanks'
| exercise, the more you'll get out of it.
| BurningFrog wrote:
| If I ever let it AI write code, I'd write serious tests for
| it.
|
| Just like I do with my own code.
|
| Both AI and I "hallucinate" sometimes, but with good tests
| you make things work.
| bredren wrote:
| This problem applies almost universally as far as I can tell.
|
| If you are knowledgeable on a subject matter you're asking
| for help with, the LLM can be guided to value. This means you
| do have to throw out bad or flat out wrong output regularly.
|
| This becomes a problem when you have no prior experience in a
| domain. For example reviewing legal contracts about a real
| estate transaction. If you aren't familiar enough with the
| workflow and details of steps you can't provide critique and
| follow-on guidance.
|
| However, the response still stands before you, and it can be
| tempting to glom onto it.
|
| This is not all that different from the current experience
| with search engines, though. Where if you're trying to get an
| answer to a question, you may wade through and even initially
| accept answers from websites that are completely wrong.
|
| For example, products to apply to the foundation of an old
| basement. Some sites will recommend products that are not
| good at all, but do so because the content owners get
| associate compensation for it.
|
| The difference is that LLM responses appear less biased (no
| associate links, no SEO keyword targeting), but are still
| wrong.
|
| All that said, sometimes LLMs just crush it when details
| don't matter. For example, building a simple cross-platform
| pyqt-based application. Search engine results can not do
| this. Wheras, at least for rapid prototyping, GPT is very,
| very good.
| jmount wrote:
| Since the whole thing is behind an API- exposing the works adds
| little value. If the corrections worked at an acceptable rate,
| one would just want them applied at the source.
| renewiltord wrote:
| > _If the corrections worked at an acceptable rate, one would
| just want them applied at the source._
|
| What do you mean? The model is for improving their RLHF
| trainers performance. RLHF does get applied "at the source" so
| to speak. It's a modification on the model behind the API.
|
| Perhaps if you were to say what you think this thing is for and
| then share why you think it's not "at the source".
| Panoramix wrote:
| Not OP but the screenshot in the article pretty much shows
| something that it's not at the source.
|
| You'd like to get the "correct" answer straight away, not
| watch a discussion between two bots.
| IanCal wrote:
| Yes, but this is about helping the people who are training
| the model.
| ertgbnm wrote:
| You are missing the point of the model in the first place.
| By having higher quality RLHF datasets, you get a higher
| quality final model. CriticGPT is not a product, but a tool
| to make GPT-4 and future models better.
| teaearlgraycold wrote:
| Sounds like a combination of a GAN[1] and RLHF. Not surprising
| that this works.
|
| [1] -
| https://en.wikipedia.org/wiki/Generative_adversarial_network
| megaman821 wrote:
| I wonder if you could apply this to training data. Like here is
| an example of a common mistake and why that mistake could be
| made, or here is a statement made in jest and why it could be
| found funny.
| wcoenen wrote:
| This is about RLHF training. But I've wondered if something
| similar could be used to automatically judge the quality of the
| data that is used in pre-training, and then spend more compute on
| the good stuff. Or throw out really bad stuff even before
| building the tokenizer, to avoid those "glitch token" problems.
| Etc.
| smsx wrote:
| Yup, check out Rho-1 by microsoft research.
| integral_1699 wrote:
| I've been using this approach myself, albeit manually, with
| ChatGPT. I first ask my question, then open a new chat and ask it
| to find flaws with the previous answer. Quite often, it does
| improve the end result.
| rodoxcasta wrote:
| > Additionally, when people use CriticGPT, the AI augments their
| skills, resulting in more comprehensive critiques than when
| people work alone, and fewer hallucinated bugs than when the
| model works alone.
|
| But, as per the first graphic, CriticGPT alone has better
| comprehensiveness than CriticGPT+Human? Is that right?
| victor9000 wrote:
| This gets at the absolute torrent of LLM diarrhea that people are
| adding to PRs these days. The worst of it seems to come from
| junior and first time senior devs who think more is more when it
| comes to LoC. PR review has become a nightmare at my work where
| juniors are now producing these magnificent PRs with dynamic
| programming, esoteric caching, database triggers, you name it.
| People are using LLMs to produce code far beyond their abilities,
| wisdom, or understanding, producing an absolute clusterfuck of
| bugs and edge cases. Anyone else dealing with something similar?
| How are you handling it?
| Zee2 wrote:
| My company simply prohibits any AI generated code. Seems to
| work rather well.
| ffsm8 wrote:
| My employer went all in and pays for both enterprise
| subscriptions (github Copilot+ chatgpt enterprise, which is
| just a company branded version of the regular interface)
|
| We've even been getting "prompt engineering" meeting invites
| of 3+ hours to get an introduction into their usage. 100-150
| participants each time I joined
|
| It's amazing how much they're valuing it. from my experience
| it's usually a negative productivity multiplier (x0.7 vs x1
| without either)
| gnicholas wrote:
| How is this enforced? I'm not saying it isn't a good idea,
| just that it seems like it would be tricky to enforce.
| Separately, it could result in employees uploading code to a
| non-privacy-respecting AI, whereas if employees were allowed
| to use a particular AI then the company could better control
| privacy/security concerns.
| ssl-3 wrote:
| Why not deal with people who create problems like this the same
| way as one would have done four years ago?
|
| If they're not doing their job, then why do they still have
| one?
| xhevahir wrote:
| When it gives me a complicated list expression or regex or
| something I like to ask ChatGPT to find a simpler way of doing
| the same thing, and it usually gives me something simpler that
| still works. Of course, you do have to ask, rather than simply
| copy-paste its output right into an editor, which is probably
| one step too many for some.
| crazygringo wrote:
| How is that different from junior devs writing bad code
| previously? The more things change, the more things stay the
| same.
|
| You handle it by teaching them how to write good code.
|
| And if they refuse to learn, then they get bad performance
| reviews and get let go.
|
| I've had junior devs come in with all sorts of bad habits, from
| only using single-letter variable names and zero commenting, to
| thinking global variables should be used for everything, to
| writing object-oriented monstrosities with seven layers of
| unnecessary abstractions instead of a simple function.
|
| Bad LLM-generated code? It's just one more category of bad
| code, and you treat it the same as all the rest. Explain why
| it's wrong and how to redo it.
|
| Or if you want to fix it at scale, identify the common bad
| patterns and make avoiding them part of your company's
| onboarding/orientation/first-week-training for new devs.
| okdood64 wrote:
| > How is that different from junior devs writing bad code
| previously?
|
| Because if it's bad, at least it's simple. Meaning simple to
| review, quickly correct and move on.
| crazygringo wrote:
| Like I said, with "object-oriented mostrosities", it's not
| like it was always simple before either.
|
| And if you know a solution should be 50 lines and they've
| given you 500, it's not like you have to read it all -- you
| can quickly figure out what approach they're using and
| discuss the approach they should be using instead.
| phatfish wrote:
| Maybe fight fire with fire. Feed the ChatGPT PR to ChatGPT
| and ask it to do a review, paste that as the comment. It
| will even do the markdown for you!
| kenjackson wrote:
| In the code review, can't you simply say, "This is too
| complicated for what you're trying to do -- please simplify"?
| lcnPylGDnU4H9OF wrote:
| Not quite the same but might be more relevant depending on
| context: "If you can't articulate what it does then please
| rewrite it such that you can."
| liampulles wrote:
| Maybe time for some pair programming?
| surfingdino wrote:
| No.
| wholinator2 wrote:
| That would be interesting though. What happens when one
| programmer attempts to chatgpt in pair programming. It's
| almost like they're already pair programming, just not with
| you!
| surfingdino wrote:
| They are welcome to do so, but not on company time. We do
| not find those tools useful at all, because we are
| generally hired to write new stuff and ChatGPT or other
| tools are useless when there are no good examples to
| steal from (e.g. darker corners of AWS that people don't
| bother to offer solutions for) or when there is a known
| bug or there are only partial workarounds available for
| it.
| ganzuul wrote:
| Are they dealing with complexity which isn't there in order to
| appear smarter?
| jazzyjackson wrote:
| IMO they're just impressed the AI came to a conclusion that
| actually runs and aren't skilled enough to recognize there's
| a simpler way to do it.
| surfingdino wrote:
| I work for clients that do not allow this shit, because their
| security teams and lawyers won't have it. But... they have in-
| house "AI ambassadors" (your typical useless middle managers,
| BAs, project managers, etc.) who see promoting AI as a survival
| strategy. On the business side these orgs are leaking data,
| internal comms, and PII like a sieve, but the software side is
| free of AI. For now.
| neom wrote:
| I was curious about the authors, did some digging, they've
| published some cool stuff:
|
| Improving alignment of dialogue agents via targeted human
| judgements - https://arxiv.org/abs/2209.14375
|
| Teaching language models to support answers with verified quotes
| - https://arxiv.org/abs/2203.11147
| VWWHFSfQ wrote:
| It's an interesting dichotomy happening in the EU vs. USA in
| terms of how these kinds of phenomena are discovered,
| presented, analyzed, and approached.
|
| The EU seems to be very much toward a regulate early, safety-
| first approach. Where USA is very much toward unregulated, move
| fast, break things, assess the damage, regulate later.
|
| I don't know which is better or worse.
| ipaddr wrote:
| Regulate before you understand the problem seems like a poor
| approach
| l5870uoo9y wrote:
| As a European, after decades of regulations and fines without
| much to show, nobody in the industry believes the EU the
| capable of creating tech ecosystem. Perhaps even that the EU
| is part of the problem and that individual countries could
| independently move much faster.
| jimmytucson wrote:
| What's the difference between CriticGPT and ChatGPT with a prompt
| that says "You are a software engineer, your job is to review
| this code and point out bugs, here is what the code is supposed
| to do: {the original prompt}, here is the code {original
| response}, review the code," etc.
| ipaddr wrote:
| $20 a month
| advael wrote:
| This is a really bizarre thing to do honestly
|
| It's plausible that there are potential avenues for improving
| language models through adversarial learning. GANs and Actor-
| Critic models have done a good job in narrow-domain generative
| applications and task learning, and I can make a strong
| theoretical argument that you can do something that looks like
| priority learning via adversarial equilibria
|
| But why in the world are you trying to present this as a human-
| in-the-loop system? This makes no sense to me. You take an error-
| prone generative language model and then present another instance
| of an error-prone generative language model to "critique" it for
| the benefit of... a human observer? The very best case here is
| that this wastes a bunch of heat and time for what can only be a
| pretty nebulous potential gain to the human's understanding
|
| Is this some weird gambit to get people to trust these models
| more? Is it OpenAI losing the plot completely because they're
| unwilling to go back to open-sourcing their models but addicted
| to the publicity of releasing public-facing interfaces to them?
| This doesn't make sense to me as a research angle or as a product
|
| I can really see the Microsoft influence here
| kenjackson wrote:
| It's for their RLHF pipeline to improve labeling. Honestly,
| this seems super reasonable to me. I don't get why you think
| this is such a bad idea for this purpose...
| advael wrote:
| RLHF to me seems more as a PR play than anything else, but
| inasmuch as it does anything useful, adding a second LLM to
| influence the human that's influencing the LLM doesn't solve
| any of the fundamental problems of either system. If anything
| it muddies the waters more, because we have already seen that
| humans are probably too credulous of the information
| presented to them by these models. If you want adversarial
| learning, there are far more efficient ways to do it. If you
| want human auditing, the best case here is that the second
| LLM doesn't influence the human's decisions at all (because
| any influence reduces the degree to which this is independent
| feedback)
| vhiremath4 wrote:
| This is kind of what I was thinking. I don't get it. It
| seems like CriticGPT was maybe trained using RM/RL with PPO
| as well? So there's gonna be mistakes with what CriticGPT
| pushes back on which may make the labeler doubt themselves?
| kenjackson wrote:
| This is not adversarial learning. It's really about
| augmenting the ability of humans to determine if a snippet
| of code is correct and write proper critiques of incorrect
| code.
|
| Any system that helps you more accurately label data with
| good critiques should help the model. I'm not sure how you
| come to your conclusion. Do you have some data to indicate
| that even with improved accuracy that some LLM bias would
| lead to a worse trained model? I haven't seen that data or
| assertion elsewhere, but that's the only thing I can gather
| you might be referring.
| advael wrote:
| Well, first of all, the stated purpose of RLHF isn't to
| "improve model accuracy" in the first place (and what we
| mean by accuracy here is pretty fraught by itself, as
| this could mean at least three different things). They
| initially pitched it as a "safety" measure (and I think
| if it wasn't obvious immediately how nonsensical a claim
| that is, it should at least be apparent now that the
| company's shucked nearly the entire subset of its members
| that claimed to care about "AI safety" that this is not a
| priority)
|
| The idea of RLHF as a mechanism for tuning models based
| on the principle that humans might have some hard-to-
| capture insight that could steer them independent of the
| way they're normally trained is the very best steelman
| for its value I could come up with. This aim is directly
| subverted by trying to use another language model to
| influence the human rater, so from my perspective it
| really brings us back to square one on what the fuck RLHF
| is supposed to be doing
|
| Really, a lot of this comes down to what these models do
| versus how they are being advertised. A generative
| language model produces plausible prose that follows from
| the prompt it receives. From this, the claim that it
| should write working code is actually quite a bit
| stronger than the claim that it should write true facts,
| because plausibile autocompletion will learn to mimic
| syntactic constraints but actually has very little to do
| with whether something is true, or whatever proxy or
| heuristic we may apply in place of "true" when assessing
| information (supported by evidence, perhaps. Logically
| sound, perhaps. The distinction between "plausible" and
| "true" is in many ways the whole point of every human
| epistemology). Like if you ask something trained on all
| human writing whether the Axis or the Allies won WWII,
| the answer will depend on whether you phrased the
| question in a way that sounds like Phillip K Dick would
| write it. This isn't even incorrect behavior by the
| standards of the model, but people want to use these
| things like some kind of oracle or to replace google
| search or whatever, which is a misconception about what
| the thing does, and one that's very profitable for the
| people selling it
| rvz wrote:
| And both can still be wrong as they have no understanding of the
| mistake.
| lowyek wrote:
| I find it fascinating that while in other fields you see lot of
| theorums/results much before practical results are found. But in
| this forefront of innovation - I have hardly seen any paper
| discussing hallucinations and lowerbound/upperbound on that. Or
| may be I didn't open hacker news on that right day when it was
| published. Would love to understand the hallucination phenomena
| more deeply and the mathematics behind it.
| dennisy wrote:
| Not sure if there is a great deal of maths to understand. The
| output of an LLM is stochastic by nature, and will read
| syntactical perfect, AKA a hallucination.
|
| No real way to mathematically prove this, considering there is
| also no way to know if the training data also had this
| "hallucination" inside of it.
| ben_w wrote:
| I think mathematical proof is the wrong framework, in the
| same way that chemistry is the wrong framework for precisely
| quantifying and explaining how LSD causes humans to
| hallucinate (you can point to which receptors it binds with,
| but AFAICT not much more than that).
|
| Investigate it with the tools of psychologically, as suited
| for use on a new non-human creature we've never encountered
| before.
| lowyek wrote:
| I liked the analogy!
| amelius wrote:
| I don't see many deep theorems in the field of psychology
| either.
| lowyek wrote:
| I don't know how to respond to this. But my understanding of
| this term is changing based on this discussion with all of
| you.
| beernet wrote:
| How are 'hallucinations' a phenomenon? I have trouble with the
| term 'hallucination' and believe it sets the wrong narratuve.
| It suggests something negative or unexpected, which it
| absolutely is not. Language models aim at, as their name
| implies, modeling language. Not facts or anything alike. This
| is per design and you certainly don't have to be an AI
| researcher to grasp that.
|
| That being said, people new to the field tend to believe that
| these models are fact machines. In fact, they are the complete
| opposite.
| lowyek wrote:
| I believe 'hallucinations' as become a umbrella term for all
| cases of failures where LLM is not doing what we expect it to
| do or generating stuff which is not aligned with the prompt
| it is provided with. As these models scale many of such
| issues get reduced don't they? For example, in the SORA paper
| OpenAI mentioned that the quality of videos it was able to
| generate improved as they applied more compute and scaling
|
| I think I won't be able to justify my use of phenomena word
| here. But my intent was to mention that this problem which is
| so widely regarded and discussed as a problem online with
| respect to LLMs - surprisingly seems less studied - or may be
| there is a term known just in the LLM researcher community
| which generally is not used.
| hbn wrote:
| > the hallucination phenomena
|
| There isn't really such thing as a "hallucination" and honestly
| I think people should be using the word less. Whether an LLM
| tells you the sky is blue or the sky is purple, it's not doing
| anything different. It's just spitting out a sequence of
| characters it was trained be hopefully what a user wants. There
| is no definable failure state you can call a "hallucination,"
| it's operating as correctly as any other output. But sometimes
| we can tell either immediately or through fact checking it spat
| out a string of text that claims something incorrect.
|
| If you start asking an LLM for political takes, you'll get very
| different answers from humans about which ones are
| "hallucinations"
| mortenjorck wrote:
| It is an unfortunately anthropomorphizing term for a
| transformer simply operating as designed, but the thing it's
| become a vernacular shorthand for, "outputting a sequence of
| tokens representing a claim that can be uncontroversially
| disproven," is still a useful concept.
|
| There's definitely room for a better label, though.
| "Empirical mismatch" doesn't quite have the same ring as
| "hallucination," but it's probably a more accurate place to
| start from.
| NovemberWhiskey wrote:
| > _" outputting a sequence of tokens representing a claim
| that can be uncontroversially disproven," is still a useful
| concept._
|
| Sure, but that would require semantic mechanisms rather
| than statistical ones.
| hbn wrote:
| Regardless I don't think there's much to write papers on,
| other than maybe an anthropological look at how it's
| affected people putting too much trust into LLMs for
| research, decision-making, etc.
|
| If someone wants info to make their model to be more
| reliable for a specific domain, it's in the existing papers
| on model training.
| lowyek wrote:
| After reading the replies what I grasp is that - it's
| just a popular term; and what LLM hallucination we
| mention is just expected behaviour. The LLM researchers
| might be called it inference error or something which
| just didn't got that popular as this term.
| emporas wrote:
| Chess engines, which are used for 25 years by the best human
| chess players daily, compute the best next move on the board.
| The total number of all possible chess positions is more than
| all the atoms in the universe.
|
| Is is possible for a chess engine to compute the next move
| and be absolutely sure it is the best one? It's not, it is a
| statistical approximation, but still very useful.
| lowyek wrote:
| I am - maybe failing to express in words correctly. Let me
| try again. Between the models like llama3b, llama7b and
| llama70b also there is a clearcut difference that the output
| of llama70b is far more correct for the input tokens than the
| smaller models for the same task. But I agree the term
| 'hallucination' has shouldn't be used - as it hides the
| nature of issue in these wrong outputs. When LLM is not
| behaving as expected; we are not ending up saying it's
| hallucinating rather than it failed in X manner.
| raincole wrote:
| I don't know why the narrative became "don't call it
| hallucination". Grantly English isn't my mother tongue so I
| might miss some subtlty here. If you know how LLM works, call
| it "hallucination" doesn't make you know less. If you don't
| know how LLM works, using "hallucination" doesn't make you
| know less either. It's just a word meaning AI gives wrong[1]
| answer.
|
| People say it's "anthropomorphizing" but honestly I can't see
| it. The I in AI stands for intelligence, is this
| anthropomorphizing? L in ML? Reading and writing are clearly
| human activities, so is using read/write instead of
| input/output anthropomorphizing? How about "computer", a word
| once meant a human who does computing? Is there a word we can
| use safely without anthropomorphizing?
|
| [1]: And please don't argue what's "wrong".
| th0ma5 wrote:
| AI is a nebulous, undefined term, and many people
| specifically criticize the use of the word intelligent.
| sandworm101 wrote:
| Hallucination is emergent. It cannot be found as a thing
| inside the AI systems. It is a phenomena that only exists
| when the output is evaluated. That makes it an accurate
| description. A human who has hallucinated something is not
| lying when they speak of something that never actually
| happened, nor are they making any sort of mistake in their
| recollection. Similarly, an AI that is hallucinating isn't
| doing anything incorrect and doesn't have any motivation. The
| hallucinated data emerges just as any other output, only to
| evaluated by outsiders as incorrect.
| lowyek wrote:
| This is very interesting and insightful.
| IanCal wrote:
| It's the new "serverless" and I would really like people to
| stop making the discussion between about the word. You know
| what it means, I know what it means, let's all move on.
|
| We won't, and we'll see this constant distraction.
| cainxinth wrote:
| Not a paper, but a startup called Vectara claimed to be
| investigating LLM hallucination/ confabulation rates last year:
|
| https://www.nytimes.com/2023/11/06/technology/chatbots-hallu...
| lowyek wrote:
| thank you for sharing this!
| bluelightning2k wrote:
| Evaluators with CriticGPT outperform those without 60% of the
| time.
|
| So, slightly better than random chance. I guess a win is a win
| but I would have thought this would be higher. I'd kind have
| assume that just asking GPT itself if it's sure would be this
| kind of lift.
___________________________________________________________________
(page generated 2024-06-27 23:00 UTC)