[HN Gopher] CriticGPT: Finding GPT-4's mistakes with GPT-4
       ___________________________________________________________________
        
       CriticGPT: Finding GPT-4's mistakes with GPT-4
        
       Author : davidbarker
       Score  : 148 points
       Date   : 2024-06-27 17:02 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | mkmk wrote:
       | It seems more and more that the solution to AI's quality problems
       | is... more AI.
       | 
       | Does Anthropic do something like this as well, or is there
       | another reason Claude Sonnet 3.5 is so much better at coding than
       | GPT-4o?
        
         | GaggiX wrote:
         | >or is there another reason Claude Sonnet 3.5 is so much better
         | at coding than GPT-4o?
         | 
         | It's impossible to say because these models are proprietary.
        
           | mkmk wrote:
           | Isn't the very article we're commenting on an indication that
           | you can form a basic opinion on what makes one proprietary
           | model different from another?
        
             | GaggiX wrote:
             | Not really, we know absolutely nothing about Claude 3.5
             | Sonnet, except that it's an LLM.
        
         | ru552 wrote:
         | Anthropic has attributed Sonnet 3.5's model improvement to
         | better training data.
         | 
         | "Which data specifically? Gerstenhaber wouldn't disclose, but
         | he implied that Claude 3.5 Sonnet draws much of its strength
         | from these training sets."[0]
         | 
         | [0]https://techcrunch.com/2024/06/20/anthropic-claims-its-
         | lates...
        
         | jasonjmcghee wrote:
         | My guess, which could be completely wrong, Anthropic spent more
         | resources on interpretability and it's paying off.
         | 
         | I remember when I first started using activation maps when
         | building image classification models and it was like what on
         | earth was I doing before this... just blindly trusting the
         | loss.
         | 
         | How do you discover biases and issues with training data
         | without interpretability?
        
         | Kiro wrote:
         | Is it really that much better? I'm really happy with GPT-4o's
         | coding capabilities and very seldom experience problems with
         | hallucinations or incorrect responses, so I'm intrigued by how
         | much better it can actually be.
        
         | p1esk wrote:
         | In my experience Sonnet 3.5 is about the same as 4o for coding.
         | Sometimes one provides a better solution, sometimes the other.
         | Both are pretty good.
        
         | surfingdino wrote:
         | > It seems more and more that the solution to AI's quality
         | problems is... more AI.
         | 
         | This reminds me of the passage found in the description of the
         | fuckitpy module:
         | 
         | "This module is like violence: if it doesn't work, you just
         | need more of it."
        
       | soloist11 wrote:
       | How do they know the critic did not make a mistake? Do they have
       | a critic for the critic?
        
         | OlleTO wrote:
         | It's critics all the way down
        
           | azulster wrote:
           | it's literally just the oracle problem all over again
        
         | esafak wrote:
         | It's called iteration. Humans do the same thing.
        
           | soloist11 wrote:
           | Are you sure it's not called recursion?
        
           | citizen_friend wrote:
           | It's not a human, and we shouldn't assume it will have traits
           | we do without evidence.
           | 
           | Iteration also is when your brain meets the external world
           | and corrects. This is a closed system.
        
         | GaggiX wrote:
         | It's written in the article, the critic makes mistakes, but
         | it's better than not having it.
        
           | soloist11 wrote:
           | How do they know it's better? The rate of mistakes is the
           | same for both GPTs so now they have 2 sources of errors. If
           | the error rate was lower for one then they could always apply
           | it and reduce the error rate of the other. They're just
           | shuffling the deck chairs and hoping the boat with a hole
           | goes a slightly longer distance before disappearing
           | completely underwater.
        
             | GaggiX wrote:
             | >How do they know it's better?
             | 
             | Probably just evaluation on benchmarks.
        
             | yorwba wrote:
             | Whether adding unreliable components increases the overall
             | reliability of a system depends on whether the system
             | requires _all_ components to work (in which case adding
             | components can only make matters worse) or only _some_ (in
             | which case adding components can improve redundancy and
             | make it more likely that the final result is correct).
             | 
             | In the particular case of spotting mistakes made by
             | ChatGPT, a mistake is spotted if it is spotted by the human
             | reviewer _or_ by the critic, so even a critic that makes
             | many mistakes itself can still increase the number of
             | spotted errors. (But it might decrease the spotting rate
             | per unit time, so there are still trade-offs to be made.)
        
               | soloist11 wrote:
               | I see what you're saying so what OpenAI will do next is
               | create an army of GPT critics and then run them all in
               | parallel to take some kind of quorum vote on correctness.
               | I guess it should work in theory if the error rate is
               | small enough and adding more critics actually reduces the
               | error rate. My guess is that in practice they'll converge
               | to the population average rate of error and then pat
               | themselves on the back for a job well done.
        
               | svachalek wrote:
               | That description is remarkably apt for almost every
               | business meeting I've ever been in.
        
         | jsheard wrote:
         | Per the article, the critic for the critic is human RLHF
         | trainers. More specifically those humans are exploited third
         | world workers making between $1.32 and $2 an hour, but OpenAI
         | would rather you didn't know about that.
         | 
         | https://time.com/6247678/openai-chatgpt-kenya-workers/
        
           | soloist11 wrote:
           | Every leap of civilization was built off the back of a
           | disposable workforce. - Niander Wallace
        
             | wmeredith wrote:
             | He was the bad guy, right?
        
           | IncreasePosts wrote:
           | That is more than the average entry level position in Kenya.
           | The work is probably also much easier (physically, that is).
        
           | golergka wrote:
           | Exploited? Are you saying that these employees are forced to
           | work for below market rates, and would be better off with
           | other opportunities available to them? If that's the case,
           | it's truly horrible on OpenAI's part.
        
         | nmca wrote:
         | A critic for the critic would be "Recursive Reward Modelling",
         | an exciting idea that has not been made to work in the real
         | world yet.
        
           | soloist11 wrote:
           | Most of my ideas are not original but where can I learn more
           | about this recursive reward modeling problem?
        
             | nmca wrote:
             | https://arxiv.org/abs/1811.07871
        
         | finger wrote:
         | There is already a mistake. It refers to a function by the
         | wrong name: os.path.comonpath > commonpath
        
           | soloist11 wrote:
           | In the critical limit every GPT critic chain is essentially a
           | spellchecker.
        
         | ertgbnm wrote:
         | That's the human's job for now.
         | 
         | A human reviewer might have trouble catching a mistake, but
         | they are generally pretty good at discerning a report about a
         | mistake is valid or not. For example, finding a bug in a
         | codebase is hard. But if a junior sends you a code snippet and
         | says "I think this is a bug for xyz reason", do you agree? It's
         | much easier to confidently say yes or no. So basically it
         | changes the problem from finding a needle in a haystack to
         | discerning if a statement is a hallucination or not.
        
       | sdenton4 wrote:
       | Looks like the hallucination rate doesn't improve significantly,
       | but I suppose it's still a win if it helps humans review things
       | faster? Though I could imagine reliance on the tool leading to
       | missing less obvious problems.
        
         | tombert wrote:
         | While it's a net good, it would kind of kill one of the most
         | valuable parts of ChatGPT for me, which is critiquing its
         | output myself.
         | 
         | If I ask it a question, I try not to trust it immediately, and
         | I independently look the answer up and I argue with it. In
         | turn, it actually is one of my favorite learning tools, because
         | it kind of forces me to figure out _why_ it 's wrong and
         | explain it.
        
           | goostavos wrote:
           | Unexpectedly, I kind of agree. I've found GPT to be a great
           | tutor for things I'm trying to learn. It being somewhat
           | unreliable / prone to confidently lying embeds a certain
           | amount of useful skepticism and questioning of all the
           | information, which in turn leads to an overall better
           | understanding.
           | 
           | Fighting with the AI's wrongness out of spike is an
           | unexpectedly good motivator.
        
             | ExtremisAndy wrote:
             | Wow, I've never thought about that, but you're right! It
             | really has trained me to be skeptical of what I'm being
             | taught and confirm the veracity of it with multiple
             | sources. A bit time-consuming, of course, but generally a
             | good way to go about educating yourself!
        
               | tombert wrote:
               | I genuinely think that arguing with it has been almost a
               | secret weapon for me with my grad school work. I'll ask
               | it a question about temporal logic or something, it'll
               | say something that sounds accurate but is ultimately
               | wrong or misleading after looking through traditional
               | documentation, and I can fight with it, and see if it
               | refines it to something correct, which I can then check
               | again, etc. I keep doing this for a bunch of iterations
               | and I end up with a pretty good understanding of the
               | topic.
               | 
               | I guess at some level this is almost what "prompt
               | engineering" is (though I really hate that term), but I
               | use it as a learning tool and I do think it's been really
               | good at helping me cement concepts in my brain.
        
               | ramenbytes wrote:
               | > I'll ask it a question about temporal logic or
               | something, it'll say something that sounds accurate but
               | is ultimately wrong or misleading after looking through
               | traditional documentation, and I can fight with it, and
               | see if it refines it to something correct, which I can
               | then check again, etc. I keep doing this for a bunch of
               | iterations and I end up with a pretty good understanding
               | of the topic.
               | 
               | Interesting, that's the basic process I follow myself
               | when learning without ChatGPT. Comparing my mental
               | representation of the thing I'm learning to existing
               | literature/results, finding the disconnects between the
               | two, reworking my understanding, wash rinse repeat.
        
               | tombert wrote:
               | I guess a large part of it is just kind of the "rubber
               | duck" thing. My thoughts can be pretty disorganized and
               | hard to follow until I'm forced to articulate them.
               | Finding out why ChatGPT is wrong is useful because it's a
               | rubber duck that I can interrogate, not just talk to.
               | 
               | It can be hard for me to directly figure out when my
               | mental model is wrong on something. I'm sure it happens
               | all the time, but a lot of the time I will think I know
               | something until I feel compelled to prove it to someone,
               | and I'll often find out that _I 'm_ wrong.
               | 
               | That's actually happened a bunch of times with ChatGPT,
               | where I think it's wrong until I actually interrogate it,
               | look up a credible source, and realize that my
               | understanding was incorrect.
        
             | posix86 wrote:
             | Reminds me of a prof at uni, who's slides always appeard to
             | have been written 5 mins before the lecture started,
             | resulting in students pointing out mistakes in every other
             | slide. He defended himself saying that you learn more if
             | you aren't sure weather things are correct - which was
             | right. Esp. during a lecture, it's sometimes not that easy
             | to figure out if you truly understood something or fooled
             | yourself, knowing that what you're looking at is provably
             | right. If you know everything can be wrong, you trick your
             | mind to verify it at a deeper level, and thus gain more
             | understanding. It also results in a culture where you're
             | allowed to question the prof. It resulted in many healthy
             | arguments with the prof why something is the way it is,
             | often resulting with him agreeing that his slides are
             | wrong. He never corrected the underlying PPP.
        
               | tombert wrote:
               | I thought about doing that when I was doing adjunct last
               | year, but what made me stop was the fact that these were
               | introductory classes, so I was afraid I might pollute the
               | minds of students who really haven't learned enough to
               | question stuff yet.
        
             | tombert wrote:
             | Yeah, and what I like is that I can get it to say things in
             | "dumb language" instead of a bunch of scary math terms.
             | It'll be confidently wrong, but in language that I can
             | easily understand, forcing me to looking things up, and
             | kind of forcing me learn the proper terminology and
             | actually understanding it.
             | 
             | Arcane language is actually kind of a pet peeve of mine in
             | theoretical CS and mathematics. Sometimes it feels like
             | academics really obfuscate relatively simple concepts but
             | using a bunch of weird math terms. I don't think it's
             | malicious, I just think that there's value in having more
             | approachable language and metaphors in the process of
             | explaining thing.
        
             | empath75 wrote:
             | I actually learn a lot from arguing with not just AIs but
             | people and it doesn't really matter if they're wrong or
             | right. If they're right, it's an obvious learning
             | experience for me, if they're wrong, it forced me to
             | explain and understand _why_ they're wrong.
        
               | tombert wrote:
               | I completely agree with that, but the problem is finding
               | a supply of people to argue with on niche subjects. I
               | have occasionally argued with people on the Haskell IRC
               | and the NixOS Matrix server about some stuff, but since
               | they're humans who selfishly have their own lives to live
               | so I can't argue with them infinitely, and since the
               | topics I argue about are specific there just don't exists
               | a lot of people I can argue with even in the best of
               | times.
               | 
               | ChatGPT (Gemini/Anthropic/etc) have the advantage of
               | never getting sick of arguing with me. I can go back and
               | forth and argue about any weird topic that I want for as
               | long as I want at any time of day and keep learning until
               | I'm bored of it.
               | 
               | Obviously it depends on the person but I really like it.
        
               | mistermann wrote:
               | Arguing is arguably one of humanity's super powers, and
               | that we've yet to bring it to bear in any serious way
               | gives me reason for optimism about sorting out the
               | various major problems we've foolishly gotten ourselves
               | into.
        
               | ramenbytes wrote:
               | > I completely agree with that, but the problem is
               | finding a supply of people to argue with on niche
               | subjects.
               | 
               | Beyond just subject-wise, finding people who argue in
               | good faith seems to be an issue too. There are people I'm
               | friends with almost specifically because we're able to
               | consistently have good-faith arguments about our strongly
               | opposing views. It doesn't seem to be a common skill, but
               | perhaps that has something to do with my sample set or my
               | own behaviors in arguments.
        
               | tombert wrote:
               | I dunno, for more niche computer science or math
               | subjects, I don't feel like people argue in bad faith
               | most of the time. The people I've argued with on the
               | Haskell IRC years ago genuinely believe in what they're
               | saying, even if I don't agree with them (I have a lot of
               | negative opinions on Haskell as a language).
               | 
               | Politically? Yeah, nearly impossible to find anyone who
               | argues in good faith.
        
           | julienchastang wrote:
           | Very good comment. In order to effectively use LLMs (I use
           | ChatGPT4 and 4o), you have to be skeptical of them and being
           | a good AI skeptic takes practice. Here is another technique
           | I've learned along the way: When you have it generate text
           | for some report you are writing, or something, after your
           | initial moment of being dazzled (at least for me), resist the
           | temptation to copy/paste. Instead, "manually" rewrite the
           | verbiage. You then realize there is a substantial amount of
           | BS that can be excised. Nevertheless, it is a huge time saver
           | and can be good at ideation, as well.
        
             | tombert wrote:
             | Yeah I used it last year to generate homework assignments,
             | and it would give me the results in Pandoc compatible
             | markdown. It was initially magic, but some of the problems
             | didn't actually make sense and might actually be
             | unsolveable, so I would have to go through it line by line
             | and then ask it to regenerate it [1].
             | 
             | Even with that, it took a process that had taken multiple
             | hours before down to about 30-45 minutes. It was super
             | cool.
             | 
             | [1] Just to be clear, I always did the homework assignments
             | myself beforehand to make sure that a solution was solvable
             | and fair before I assigned it.
        
         | foobiekr wrote:
         | The lazy version of that, which I recommend, is always deny the
         | first answer. Usually I deny for some obvious reason, but
         | sometimes I just say "isn't that wrong?"
        
           | tombert wrote:
           | That's a useful trick but I have noticed when I do that it
           | goes in circles a where it suggests "A", I say it's wrong, it
           | suggests "B", I say that's wrong, it suggests "C", I say
           | that's wrong, and then it suggests "A" again.
           | 
           | Usually for it to get a correct answer, I have to provide it
           | a bit of context.
        
       | GiorgioG wrote:
       | All these LLMs make up too much stuff, I don't see how that can
       | be fixed.
        
         | elwell wrote:
         | > All these LLMs make up too much stuff, I don't see how that
         | can be fixed.
         | 
         | All these humans make up too much stuff, I don't see how that
         | can be fixed.
        
           | urduntupu wrote:
           | Exactly, you can't even fix the problem at the root, b/c the
           | problem is already with the humans, making up stuff.
        
             | testfrequency wrote:
             | Believe it or not, there are websites that have real things
             | posted. This is honestly my biggest shock that OpenAI
             | thought Reddit of all places is a trustworthy source for
             | knowledge.
        
               | QuesnayJr wrote:
               | Reddit is so much better than the average SEO-optimized
               | site that adding "reddit" to your search is a common
               | trick for using Google.
        
               | p1esk wrote:
               | Reddit has been the most trustworthy source for me in the
               | last ~5 years, especially when I want to buy something.
        
               | empath75 wrote:
               | The websites with content authored by people is full of
               | bullshit, intentional and unintentional.
        
               | testfrequency wrote:
               | It's genuinely concerning to me how many people replied
               | with thinking reddit is the gospel for factual
               | information.
               | 
               | Reddit, while it has some niche communities with tribal
               | info and knowledge, is FULL of spam, bots, companies
               | masquerading as users, etc etc etc. If people are truly
               | relying on reddit as a source of truth (which OpenAI is
               | now being influenced by), then the world is just going to
               | be amplify all the spam that already exists
        
               | acchow wrote:
               | While Reddit is often helpful for me (Google
               | site:reddit.com), it's nice to toggle between reddit and
               | non-reddit.
               | 
               | I hope LLMs will offer a "-reddit" model to switch to
               | when needed.
        
           | testfrequency wrote:
           | I know you're trying to be edgy here, but if I was deciding
           | between searching online and finding a source vs trying to
           | shortcut and use GPT, but GPT decides to hallucinate and make
           | something up - that's the deceiving part.
           | 
           | The biggest issue is how confidently wrong GPT enjoys being.
           | You can press GPT in either right or wrong direction and it
           | will concede with minimal effort, which is also an issue.
           | It's just really bad russian roulette nerdspining until
           | someone gets tired.
        
             | sva_ wrote:
             | I wouldn't call it deceiving. In order to be motivated to
             | deceive someone, you'd need agency and some benefit out of
             | it
        
               | testfrequency wrote:
               | Isn't that GPT Plus? Trick you into thinking you have
               | found your new friend and they understand everything?
               | Surely OpenAI would like people to use their GPT over a
               | Google search.
               | 
               | How do you think leadership at OpenAI would respond to
               | that?
        
               | advael wrote:
               | 1. Deception describes a result, not a motivation. If
               | someone has been led to believe something that isn't
               | true, they have been deceived, and this doesn't require
               | any other agents
               | 
               | 2. While I agree that it's a stretch to call ChatGPT
               | agentic, it's nonetheless "motivated" in the sense that
               | it's learned based on an objective function, which we can
               | model as a causal factor behind its behavior, which might
               | improve our understanding of that behavior. I think it's
               | relatively intuitive and not deeply incorrect to say that
               | that a learned objective of generating plausible prose
               | can be a causal factor which has led to a tendency to
               | generate prose which often deceives people, and I see
               | little value in getting nitpicky about agentic
               | assumptions in colloquial language when a vast swath of
               | the lexicon and grammar of human languages writ large
               | does so essentially by default. "The rain got me wet!"
               | doesn't assume that the rain has agency
        
           | swatcoder wrote:
           | In reality, humans are often blunt and rude pessimists who
           | say things can't be done. But "helpful chatbot" LLM's are
           | specifically trained not to do that for anything but crude
           | swaths of political/social/safety alignment.
           | 
           | When it comes to technical details, current LLM's have a bias
           | towards sycophancy and bullshitting that humans only show
           | when especially desperate to impress or totally fearful.
           | 
           | Humans make mistakes too, but the distribution of those
           | mistakes is wildly different and generally much easier to
           | calibrate for and work around.
        
           | advael wrote:
           | The problems of epistemology and informational quality
           | control are complicated, but humanity has developed a decent
           | amount of social and procedural technology to do these, some
           | of which has defined the organization of various
           | institutions. The mere presence of LLMs doesn't fundamentally
           | change how we should calibrate our beliefs or verify
           | information. However, the mythology/marketing that LLMs are
           | "outperforming humans" combined with the fact that the most
           | popular ones are black boxes to the overwhelming majority of
           | their users means that a lot of people aren't applying those
           | tools to their outputs. As a technology, they're much more
           | useful if you treat them with what is roughly the appropriate
           | level of skepticism for a human stranger you're talking to on
           | the street
        
             | mistermann wrote:
             | I wonder what ChatGPT would have to say if I ran this text
             | through with a specialized prompt. Your choice of words is
             | interesting, almost like you are optimizing for persuasion,
             | but simultaneously I get a strong vibe of intention of
             | optimizing for truth.
        
               | refulgentis wrote:
               | FWIW I don't understand a lot of what either of you mean,
               | but I'm very interested. Quick run-through, excuse the
               | editorial tone, I don't know how to give feedback on
               | writing without it.
               | 
               | # Post 1
               | 
               | > The problems of epistemology and informational quality
               | control are complicated, but humanity has developed a
               | decent amount of social and procedural technology to do
               | these, some of which has defined the organization of
               | various institutions.
               | 
               |  _Very_ fluffy, creating _very_ uncertain parsing for
               | reader.
               | 
               |  _Should_ cut down, then _could_ add specificity:
               | 
               | ex. "Dealing with misinformation is complicated. But we
               | have things like dictionaries and the internet, there's
               | even specialization in fact-checking, like Snopes.com"
               | 
               | (I assume the specifics I added aren't what you meant,
               | just wanted to give an example)
               | 
               | > The mere presence of LLMs doesn't fundamentally change
               | how we should calibrate our beliefs or verify
               | information. However, the mythology/marketing that LLMs
               | are "outperforming humans"
               | 
               | They do, or are clearly at par, at many tasks.
               | 
               | Where is the quote from?
               | 
               | Is bringing this up relevant to the discussion?
               | 
               | Would us quibbling over that be relevant to this
               | discussion?
               | 
               | > combined with the fact that the most popular ones are
               | black boxes to the overwhelming majority of their users
               | means that a lot of people aren't applying those tools to
               | their outputs.
               | 
               | Are there unpopular ones aren't black boxes?
               | 
               | What tools? (this may just indicate the benefit of a
               | clearer intro)
               | 
               | > As a technology, they're much more useful if you treat
               | them with what is roughly the appropriate level of
               | skepticism for a human stranger you're talking to on the
               | street
               | 
               | This is a sort of obvious conclusion compared to the
               | complicated language leading into it, and doesn't add to
               | the posts before it. Is there a stronger claim here?
               | 
               | # Post 2
               | 
               | > I wonder what ChatGPT would have to say if I ran this
               | text through with a specialized prompt.
               | 
               | Why do you wonder that?
               | 
               | What does "specialized" mean in this context?
               | 
               | My guess is there's a prompt you have in mind, which then
               | would clarify A) what you're wondering about B) what you
               | meant by specialized prompt. But a prompt is a question,
               | so it may be better to just ask the question?
               | 
               | > Your choice of words is interesting, almost like you
               | are optimizing for persuasion,
               | 
               | What language optimizes for persuasion? I'm guessing the
               | fluffy advanced verbiage indicates that?
               | 
               | Does this boil down to "Your word choice creates
               | persuasive writing"?
               | 
               | > but simultaneously, I get a strong vibe of intention of
               | optimizing for truth.
               | 
               | Is there a distinction here? What would "optimizing for
               | truth" vs. "optimizing for persuasion" look like?
               | 
               | Do people usually write not-truthful things, to the point
               | it's worth noting that when you think people are writing
               | with the intention of truth?
        
               | advael wrote:
               | As long as we're doing unsolicited advice, this revision
               | seems predicated on the assumption that we are writing
               | for a general audience, which ill suits the context in
               | which the posts were made. This is especially bizarre
               | because you then interject to defend the benchmarking
               | claim I've called "marketing", and having an opinion on
               | that subject at all makes it clear that you also at the
               | very least understand the shared context somewhat,
               | despite being unable to parse the fairly obvious
               | implication that treating models with undue credulity is
               | a direct result of the outsized and ill-defined claims
               | about their capabilities to which I refer. I agree that I
               | could stand to be more concise, but if you find it
               | difficult to parse my writing, perhaps this is simply
               | because you are not its target audience
        
               | refulgentis wrote:
               | Let's go ahead and say the LLM stuff is all marketing and
               | it's all clearly worse than all humans. It's plainly
               | unrelated to anything else in the post, we don't need to
               | focus on it.
               | 
               | Like I said, I'm very interested!
               | 
               | Maybe it doesn't mean anything other than what it says on
               | the tin? You think people should treat an LLM like a
               | stranger making claims? Makes sense!
               | 
               | It's just unclear what a lot of it means and the word
               | choice makes it seem like there's something grander going
               | on, _coughs_ as our compatriots in this intricately
               | weaved thread on the international network known as the
               | world wide web have also explicated, and imparted via the
               | written word, as their scrivening also remarks on the
               | lexicographical phenomenae. _coughs_
               | 
               | My only other guess is you are doing some form of
               | performance art to teach us a broader lesson?
               | 
               | There's something very "off" here, and I'm not the only
               | to note it. Like, my instinct is it's iterated writing
               | _using_ an LLM asked to make it more graduate-school
               | level.
        
               | advael wrote:
               | Your post and the one I originally responded to are good
               | evidence against something I said earlier. The mere
               | existence of LLMs _does_ clearly change the landscape of
               | epistemology, because whether or not they 're even
               | involved in a conversation people will constantly invoke
               | them when they think your prose is stilted (which is, by
               | the way, exactly the wrong instinct), or to try to
               | posture that they occupy some sort of elevated remove
               | from the conversation (which I'd say they demonstrate
               | false by replying at all). I guess dehumanizing people by
               | accusing them of being "robots" is probably as old as the
               | usage of that word if not older, but recently interest in
               | talking robots has dramatically increased and so here we
               | are
               | 
               | I can't tell you exactly what you find "off" about my
               | prose, because while you have advocated precision your
               | objection is impossibly vague. I talk funny. Okay. Cool.
               | Thanks.
               | 
               | Anyway, most benchmarks are garbage, and even if we take
               | the validity of these benchmarks for granted, these AI
               | companies don't release their datasets or even weights,
               | so we have no idea what's out of distribution. To be
               | clear, this means the claims can't be verified _even by
               | the standards of ML benchmarks_ , and thus should be
               | taken as marketing, because companies lying about their
               | tech has both a clearly defined motivation and a constant
               | stream of unrelenting precedent
        
               | advael wrote:
               | I think you'll find I'm quite horseshit at optimizing for
               | persuasion, as you can easily verify by checking any
               | other post I've ever made and the response it generally
               | elicits. I find myself less motivated by what people
               | think of me every year I'm alive, and less interested in
               | what GPT would say about my replies each of the many
               | times someone replies just to ponder that instead of just
               | satisfying their curiosity immediately via copy-paste.
               | Also, in general it seems unlikely humans function as
               | optimizers natively, because optimization tends to
               | require drastically narrowing and quantifying your
               | objectives. I would guess that if they're describable and
               | consistent, most human utility functions look more like
               | noisy prioritized sets of satisfaction criteria than the
               | kind of objectives we can train a neural network against
        
               | mistermann wrote:
               | This on the other hand I like, very much!
               | 
               | Particularly:
               | 
               | > Also, in general it seems unlikely humans function as
               | optimizers natively, because optimization tends to
               | require drastically narrowing and quantifying your
               | objectives. I would guess that if they're describable and
               | consistent, most human utility functions look more like
               | noisy prioritized sets of satisfaction criteria than the
               | kind of objectives we can train a neural network against
               | 
               | Considering this, what do you think us humans are
               | _actually_ up to, here on HN and in general? It seems
               | clear that we are up to _something_ , but what might it
               | be?
        
               | advael wrote:
               | On HN? Killing time, reading articles, and getting
               | nerdsniped by the feedback loop of getting insipid
               | replies that unfortunately so many of us are constantly
               | stuck in
               | 
               | In general? Slowly dying mostly. Talking. Eating.
               | Fucking. Staring at microbes under a microscope. Feeding
               | cats. Planting trees. Doing cartwheels. Really depends on
               | the human
        
           | CooCooCaCha wrote:
           | If I am going to trust a machine then it should perform at
           | the level of a very competent human, not a general human.
           | 
           | Why would I want to ask your average person a physics
           | question? Of course, their answer will probably be wrong and
           | partly made up. Why should that be the bar?
           | 
           | I want it to answer at the level of a physics expert. And a
           | physics expert is far less likely to make basic mistakes.
        
           | nonameiguess wrote:
           | advael's answer was fine, but since people seem to be hung up
           | on the wording, a more direct response:
           | 
           | We have human institutions dedicated at least nominally to
           | finding and publishing truth (I hate having to qualify this,
           | but Hacker News is so cynical and post-modernist at this
           | point that I don't know what else to do). These include, for
           | instance, court systems. These include a notion of
           | evidentiary standards. Eyewitnesses are treated as more
           | reliable than hearsay. Written or taped recordings are more
           | reliable than both. Multiple witnesses who agree are more
           | reliable than one. Another example is science. Science
           | utilizes peer review, along with its own notion of hierarchy
           | of evidence, similar to but separate from the court's.
           | Interventional trials are better evidence than observational
           | studies. Randomization and statistical testing is used to try
           | and tease out effects from noise. Results that replicate are
           | more reliable than a single study. Journalism is yet another
           | example. This is probably the arena in which Hacker News is
           | most cynical and will declare all of it is useless trash, but
           | nonetheless reputable news organizations do have methods they
           | use to try and be correct more often than they are not. They
           | employ their own fact checkers. They seek out multiple expert
           | sources. They send journalists directly to a scene to bear
           | witness themselves to events as they unfold.
           | 
           | You're free to think this isn't sufficient, but this is how
           | we deal with humans making up stuff and it's gotten us modern
           | civilization at least, full of warts but also full of
           | wonders, seemingly because we're actually right about a lot
           | of stuff.
           | 
           | At some point, something analogous will presumably be the
           | answer for how LLMs deal with this, too. The training will
           | have to be changed to make the system aware of quality of
           | evidence. Place greater trust in direct sensor output versus
           | reading something online. Place greater trust in what you
           | read from a reputable academic journal versus a Tweet. Etc.
           | As it stands now, unlike human learners, the objective
           | function of an LLM is just to produce a string in which each
           | piece is in some reasonably high-density region of the
           | probability distribution of possible next pieces as observed
           | from historical recorded text. Luckily, producing strings in
           | this way happens to generate a whole lot of true statements,
           | but it does not have truth as an explicit goal and, until it
           | does, we shouldn't forget that. Treat it with the treatment
           | it deserves, as if some human savant with perfect recall had
           | never left a dark room to experience the outside world, but
           | had read everything ever written, unfortunately without any
           | understanding of the difference between reading a textbook
           | and reading 4chan.
        
         | spiderfarmer wrote:
         | Mixture of agents prevents a lot of fact fabrication.
        
         | ssharp wrote:
         | I keep hearing about people using these for coding. Seems like
         | it would be extremely easy to miss something and then spend
         | more time debugging than it would be to do yourself.
         | 
         | I tried recently to have ChatGPT an .htaccess RewriteCond/Rule
         | for me and it was extremely confident you couldn't do something
         | I needed to do. When I told it that it just needed to add a
         | flag to the end of the rule (I was curious and was purposely
         | non-specific about what flag it needed), it suddenly knew
         | exactly what to do. Thankfully I knew what it needed but
         | otherwise I might have walked away thinking it couldn't be
         | accomplished.
        
           | GiorgioG wrote:
           | My experience is that it will simply make up methods,
           | properties and fields that do NOT exist in well-documented
           | APIs. If something isn't possible, that's fine, just tell me
           | it's not possible. I spent an hour trying to get ChatGPT
           | (4/4o and 3.5) to write some code to do one specific thing
           | (dump/log detailed memory allocation data from the current
           | .NET application process) for diagnosing an intermittent out
           | of memory exception in a production application. The answer
           | as far as I can tell is that it's not possible in-process.
           | Maybe it's possible out of process using the profiling API,
           | but that doesn't help me in a locked-down k8s pod/container
           | in AWS.
        
             | neonsunset wrote:
             | From within the process it might be difficult*, but please
             | do give this a read https://learn.microsoft.com/en-
             | us/dotnet/core/diagnostics/du... and dotnet-dump + dotnet-
             | trace a try.
             | 
             | If you are still seeing the issue with memory and GC, you
             | can submit it to https://github.com/dotnet/runtime/issues
             | especially if you are doing something that is expected to
             | just work(tm).
             | 
             | * difficult as in retrieving data detailed enough to trace
             | individual allocations, otherwise `GC.GetGCMemoryInfo()`
             | and adjacent methods can give you high-level overview.
             | There are more advanced tools but I always had the option
             | to either use remote debugging in Windows Server days and
             | dotnet-dump and dotnet-trace for containerized applications
             | to diagnose the issues, so haven't really explored what is
             | needed for the more locked down environments.
        
             | empath75 wrote:
             | I think once you understand that they're prone to do that,
             | it's less of a problem in practice. You just don't ask it
             | questions that requires detailed knowledge of an API unless
             | it's _extremely_ popular. Like in kubernetes terms, it's
             | safe to ask it about a pod spec, less safe to ask it
             | details about istio configuration and even less safe to ask
             | it about some random operator with 50 stars on github.
             | 
             | Mostly it's good at structure and syntax, so I'll often
             | find the library/spec I want, paste in the relevant
             | documentation and ask it to write my function for me.
             | 
             | This may seem like a waste of time because once you've got
             | the documentation you can just write the code yourself, but
             | A: that takes 5 times as long and B: I think people
             | underestimate how much general domain knowledge is buried
             | in chatgpt so it's pretty good at inferring the details of
             | what you're looking for or what you should have asked
             | about.
             | 
             | In general, I think the more your interaction with chatgpt
             | is framed as a dialogue and less as a 'fill in the blanks'
             | exercise, the more you'll get out of it.
        
           | BurningFrog wrote:
           | If I ever let it AI write code, I'd write serious tests for
           | it.
           | 
           | Just like I do with my own code.
           | 
           | Both AI and I "hallucinate" sometimes, but with good tests
           | you make things work.
        
           | bredren wrote:
           | This problem applies almost universally as far as I can tell.
           | 
           | If you are knowledgeable on a subject matter you're asking
           | for help with, the LLM can be guided to value. This means you
           | do have to throw out bad or flat out wrong output regularly.
           | 
           | This becomes a problem when you have no prior experience in a
           | domain. For example reviewing legal contracts about a real
           | estate transaction. If you aren't familiar enough with the
           | workflow and details of steps you can't provide critique and
           | follow-on guidance.
           | 
           | However, the response still stands before you, and it can be
           | tempting to glom onto it.
           | 
           | This is not all that different from the current experience
           | with search engines, though. Where if you're trying to get an
           | answer to a question, you may wade through and even initially
           | accept answers from websites that are completely wrong.
           | 
           | For example, products to apply to the foundation of an old
           | basement. Some sites will recommend products that are not
           | good at all, but do so because the content owners get
           | associate compensation for it.
           | 
           | The difference is that LLM responses appear less biased (no
           | associate links, no SEO keyword targeting), but are still
           | wrong.
           | 
           | All that said, sometimes LLMs just crush it when details
           | don't matter. For example, building a simple cross-platform
           | pyqt-based application. Search engine results can not do
           | this. Wheras, at least for rapid prototyping, GPT is very,
           | very good.
        
       | jmount wrote:
       | Since the whole thing is behind an API- exposing the works adds
       | little value. If the corrections worked at an acceptable rate,
       | one would just want them applied at the source.
        
         | renewiltord wrote:
         | > _If the corrections worked at an acceptable rate, one would
         | just want them applied at the source._
         | 
         | What do you mean? The model is for improving their RLHF
         | trainers performance. RLHF does get applied "at the source" so
         | to speak. It's a modification on the model behind the API.
         | 
         | Perhaps if you were to say what you think this thing is for and
         | then share why you think it's not "at the source".
        
           | Panoramix wrote:
           | Not OP but the screenshot in the article pretty much shows
           | something that it's not at the source.
           | 
           | You'd like to get the "correct" answer straight away, not
           | watch a discussion between two bots.
        
             | IanCal wrote:
             | Yes, but this is about helping the people who are training
             | the model.
        
             | ertgbnm wrote:
             | You are missing the point of the model in the first place.
             | By having higher quality RLHF datasets, you get a higher
             | quality final model. CriticGPT is not a product, but a tool
             | to make GPT-4 and future models better.
        
       | teaearlgraycold wrote:
       | Sounds like a combination of a GAN[1] and RLHF. Not surprising
       | that this works.
       | 
       | [1] -
       | https://en.wikipedia.org/wiki/Generative_adversarial_network
        
       | megaman821 wrote:
       | I wonder if you could apply this to training data. Like here is
       | an example of a common mistake and why that mistake could be
       | made, or here is a statement made in jest and why it could be
       | found funny.
        
       | wcoenen wrote:
       | This is about RLHF training. But I've wondered if something
       | similar could be used to automatically judge the quality of the
       | data that is used in pre-training, and then spend more compute on
       | the good stuff. Or throw out really bad stuff even before
       | building the tokenizer, to avoid those "glitch token" problems.
       | Etc.
        
         | smsx wrote:
         | Yup, check out Rho-1 by microsoft research.
        
       | integral_1699 wrote:
       | I've been using this approach myself, albeit manually, with
       | ChatGPT. I first ask my question, then open a new chat and ask it
       | to find flaws with the previous answer. Quite often, it does
       | improve the end result.
        
       | rodoxcasta wrote:
       | > Additionally, when people use CriticGPT, the AI augments their
       | skills, resulting in more comprehensive critiques than when
       | people work alone, and fewer hallucinated bugs than when the
       | model works alone.
       | 
       | But, as per the first graphic, CriticGPT alone has better
       | comprehensiveness than CriticGPT+Human? Is that right?
        
       | victor9000 wrote:
       | This gets at the absolute torrent of LLM diarrhea that people are
       | adding to PRs these days. The worst of it seems to come from
       | junior and first time senior devs who think more is more when it
       | comes to LoC. PR review has become a nightmare at my work where
       | juniors are now producing these magnificent PRs with dynamic
       | programming, esoteric caching, database triggers, you name it.
       | People are using LLMs to produce code far beyond their abilities,
       | wisdom, or understanding, producing an absolute clusterfuck of
       | bugs and edge cases. Anyone else dealing with something similar?
       | How are you handling it?
        
         | Zee2 wrote:
         | My company simply prohibits any AI generated code. Seems to
         | work rather well.
        
           | ffsm8 wrote:
           | My employer went all in and pays for both enterprise
           | subscriptions (github Copilot+ chatgpt enterprise, which is
           | just a company branded version of the regular interface)
           | 
           | We've even been getting "prompt engineering" meeting invites
           | of 3+ hours to get an introduction into their usage. 100-150
           | participants each time I joined
           | 
           | It's amazing how much they're valuing it. from my experience
           | it's usually a negative productivity multiplier (x0.7 vs x1
           | without either)
        
           | gnicholas wrote:
           | How is this enforced? I'm not saying it isn't a good idea,
           | just that it seems like it would be tricky to enforce.
           | Separately, it could result in employees uploading code to a
           | non-privacy-respecting AI, whereas if employees were allowed
           | to use a particular AI then the company could better control
           | privacy/security concerns.
        
         | ssl-3 wrote:
         | Why not deal with people who create problems like this the same
         | way as one would have done four years ago?
         | 
         | If they're not doing their job, then why do they still have
         | one?
        
         | xhevahir wrote:
         | When it gives me a complicated list expression or regex or
         | something I like to ask ChatGPT to find a simpler way of doing
         | the same thing, and it usually gives me something simpler that
         | still works. Of course, you do have to ask, rather than simply
         | copy-paste its output right into an editor, which is probably
         | one step too many for some.
        
         | crazygringo wrote:
         | How is that different from junior devs writing bad code
         | previously? The more things change, the more things stay the
         | same.
         | 
         | You handle it by teaching them how to write good code.
         | 
         | And if they refuse to learn, then they get bad performance
         | reviews and get let go.
         | 
         | I've had junior devs come in with all sorts of bad habits, from
         | only using single-letter variable names and zero commenting, to
         | thinking global variables should be used for everything, to
         | writing object-oriented monstrosities with seven layers of
         | unnecessary abstractions instead of a simple function.
         | 
         | Bad LLM-generated code? It's just one more category of bad
         | code, and you treat it the same as all the rest. Explain why
         | it's wrong and how to redo it.
         | 
         | Or if you want to fix it at scale, identify the common bad
         | patterns and make avoiding them part of your company's
         | onboarding/orientation/first-week-training for new devs.
        
           | okdood64 wrote:
           | > How is that different from junior devs writing bad code
           | previously?
           | 
           | Because if it's bad, at least it's simple. Meaning simple to
           | review, quickly correct and move on.
        
             | crazygringo wrote:
             | Like I said, with "object-oriented mostrosities", it's not
             | like it was always simple before either.
             | 
             | And if you know a solution should be 50 lines and they've
             | given you 500, it's not like you have to read it all -- you
             | can quickly figure out what approach they're using and
             | discuss the approach they should be using instead.
        
             | phatfish wrote:
             | Maybe fight fire with fire. Feed the ChatGPT PR to ChatGPT
             | and ask it to do a review, paste that as the comment. It
             | will even do the markdown for you!
        
         | kenjackson wrote:
         | In the code review, can't you simply say, "This is too
         | complicated for what you're trying to do -- please simplify"?
        
           | lcnPylGDnU4H9OF wrote:
           | Not quite the same but might be more relevant depending on
           | context: "If you can't articulate what it does then please
           | rewrite it such that you can."
        
         | liampulles wrote:
         | Maybe time for some pair programming?
        
           | surfingdino wrote:
           | No.
        
             | wholinator2 wrote:
             | That would be interesting though. What happens when one
             | programmer attempts to chatgpt in pair programming. It's
             | almost like they're already pair programming, just not with
             | you!
        
               | surfingdino wrote:
               | They are welcome to do so, but not on company time. We do
               | not find those tools useful at all, because we are
               | generally hired to write new stuff and ChatGPT or other
               | tools are useless when there are no good examples to
               | steal from (e.g. darker corners of AWS that people don't
               | bother to offer solutions for) or when there is a known
               | bug or there are only partial workarounds available for
               | it.
        
         | ganzuul wrote:
         | Are they dealing with complexity which isn't there in order to
         | appear smarter?
        
           | jazzyjackson wrote:
           | IMO they're just impressed the AI came to a conclusion that
           | actually runs and aren't skilled enough to recognize there's
           | a simpler way to do it.
        
         | surfingdino wrote:
         | I work for clients that do not allow this shit, because their
         | security teams and lawyers won't have it. But... they have in-
         | house "AI ambassadors" (your typical useless middle managers,
         | BAs, project managers, etc.) who see promoting AI as a survival
         | strategy. On the business side these orgs are leaking data,
         | internal comms, and PII like a sieve, but the software side is
         | free of AI. For now.
        
       | neom wrote:
       | I was curious about the authors, did some digging, they've
       | published some cool stuff:
       | 
       | Improving alignment of dialogue agents via targeted human
       | judgements - https://arxiv.org/abs/2209.14375
       | 
       | Teaching language models to support answers with verified quotes
       | - https://arxiv.org/abs/2203.11147
        
         | VWWHFSfQ wrote:
         | It's an interesting dichotomy happening in the EU vs. USA in
         | terms of how these kinds of phenomena are discovered,
         | presented, analyzed, and approached.
         | 
         | The EU seems to be very much toward a regulate early, safety-
         | first approach. Where USA is very much toward unregulated, move
         | fast, break things, assess the damage, regulate later.
         | 
         | I don't know which is better or worse.
        
           | ipaddr wrote:
           | Regulate before you understand the problem seems like a poor
           | approach
        
           | l5870uoo9y wrote:
           | As a European, after decades of regulations and fines without
           | much to show, nobody in the industry believes the EU the
           | capable of creating tech ecosystem. Perhaps even that the EU
           | is part of the problem and that individual countries could
           | independently move much faster.
        
       | jimmytucson wrote:
       | What's the difference between CriticGPT and ChatGPT with a prompt
       | that says "You are a software engineer, your job is to review
       | this code and point out bugs, here is what the code is supposed
       | to do: {the original prompt}, here is the code {original
       | response}, review the code," etc.
        
         | ipaddr wrote:
         | $20 a month
        
       | advael wrote:
       | This is a really bizarre thing to do honestly
       | 
       | It's plausible that there are potential avenues for improving
       | language models through adversarial learning. GANs and Actor-
       | Critic models have done a good job in narrow-domain generative
       | applications and task learning, and I can make a strong
       | theoretical argument that you can do something that looks like
       | priority learning via adversarial equilibria
       | 
       | But why in the world are you trying to present this as a human-
       | in-the-loop system? This makes no sense to me. You take an error-
       | prone generative language model and then present another instance
       | of an error-prone generative language model to "critique" it for
       | the benefit of... a human observer? The very best case here is
       | that this wastes a bunch of heat and time for what can only be a
       | pretty nebulous potential gain to the human's understanding
       | 
       | Is this some weird gambit to get people to trust these models
       | more? Is it OpenAI losing the plot completely because they're
       | unwilling to go back to open-sourcing their models but addicted
       | to the publicity of releasing public-facing interfaces to them?
       | This doesn't make sense to me as a research angle or as a product
       | 
       | I can really see the Microsoft influence here
        
         | kenjackson wrote:
         | It's for their RLHF pipeline to improve labeling. Honestly,
         | this seems super reasonable to me. I don't get why you think
         | this is such a bad idea for this purpose...
        
           | advael wrote:
           | RLHF to me seems more as a PR play than anything else, but
           | inasmuch as it does anything useful, adding a second LLM to
           | influence the human that's influencing the LLM doesn't solve
           | any of the fundamental problems of either system. If anything
           | it muddies the waters more, because we have already seen that
           | humans are probably too credulous of the information
           | presented to them by these models. If you want adversarial
           | learning, there are far more efficient ways to do it. If you
           | want human auditing, the best case here is that the second
           | LLM doesn't influence the human's decisions at all (because
           | any influence reduces the degree to which this is independent
           | feedback)
        
             | vhiremath4 wrote:
             | This is kind of what I was thinking. I don't get it. It
             | seems like CriticGPT was maybe trained using RM/RL with PPO
             | as well? So there's gonna be mistakes with what CriticGPT
             | pushes back on which may make the labeler doubt themselves?
        
             | kenjackson wrote:
             | This is not adversarial learning. It's really about
             | augmenting the ability of humans to determine if a snippet
             | of code is correct and write proper critiques of incorrect
             | code.
             | 
             | Any system that helps you more accurately label data with
             | good critiques should help the model. I'm not sure how you
             | come to your conclusion. Do you have some data to indicate
             | that even with improved accuracy that some LLM bias would
             | lead to a worse trained model? I haven't seen that data or
             | assertion elsewhere, but that's the only thing I can gather
             | you might be referring.
        
               | advael wrote:
               | Well, first of all, the stated purpose of RLHF isn't to
               | "improve model accuracy" in the first place (and what we
               | mean by accuracy here is pretty fraught by itself, as
               | this could mean at least three different things). They
               | initially pitched it as a "safety" measure (and I think
               | if it wasn't obvious immediately how nonsensical a claim
               | that is, it should at least be apparent now that the
               | company's shucked nearly the entire subset of its members
               | that claimed to care about "AI safety" that this is not a
               | priority)
               | 
               | The idea of RLHF as a mechanism for tuning models based
               | on the principle that humans might have some hard-to-
               | capture insight that could steer them independent of the
               | way they're normally trained is the very best steelman
               | for its value I could come up with. This aim is directly
               | subverted by trying to use another language model to
               | influence the human rater, so from my perspective it
               | really brings us back to square one on what the fuck RLHF
               | is supposed to be doing
               | 
               | Really, a lot of this comes down to what these models do
               | versus how they are being advertised. A generative
               | language model produces plausible prose that follows from
               | the prompt it receives. From this, the claim that it
               | should write working code is actually quite a bit
               | stronger than the claim that it should write true facts,
               | because plausibile autocompletion will learn to mimic
               | syntactic constraints but actually has very little to do
               | with whether something is true, or whatever proxy or
               | heuristic we may apply in place of "true" when assessing
               | information (supported by evidence, perhaps. Logically
               | sound, perhaps. The distinction between "plausible" and
               | "true" is in many ways the whole point of every human
               | epistemology). Like if you ask something trained on all
               | human writing whether the Axis or the Allies won WWII,
               | the answer will depend on whether you phrased the
               | question in a way that sounds like Phillip K Dick would
               | write it. This isn't even incorrect behavior by the
               | standards of the model, but people want to use these
               | things like some kind of oracle or to replace google
               | search or whatever, which is a misconception about what
               | the thing does, and one that's very profitable for the
               | people selling it
        
       | rvz wrote:
       | And both can still be wrong as they have no understanding of the
       | mistake.
        
       | lowyek wrote:
       | I find it fascinating that while in other fields you see lot of
       | theorums/results much before practical results are found. But in
       | this forefront of innovation - I have hardly seen any paper
       | discussing hallucinations and lowerbound/upperbound on that. Or
       | may be I didn't open hacker news on that right day when it was
       | published. Would love to understand the hallucination phenomena
       | more deeply and the mathematics behind it.
        
         | dennisy wrote:
         | Not sure if there is a great deal of maths to understand. The
         | output of an LLM is stochastic by nature, and will read
         | syntactical perfect, AKA a hallucination.
         | 
         | No real way to mathematically prove this, considering there is
         | also no way to know if the training data also had this
         | "hallucination" inside of it.
        
           | ben_w wrote:
           | I think mathematical proof is the wrong framework, in the
           | same way that chemistry is the wrong framework for precisely
           | quantifying and explaining how LSD causes humans to
           | hallucinate (you can point to which receptors it binds with,
           | but AFAICT not much more than that).
           | 
           | Investigate it with the tools of psychologically, as suited
           | for use on a new non-human creature we've never encountered
           | before.
        
             | lowyek wrote:
             | I liked the analogy!
        
         | amelius wrote:
         | I don't see many deep theorems in the field of psychology
         | either.
        
           | lowyek wrote:
           | I don't know how to respond to this. But my understanding of
           | this term is changing based on this discussion with all of
           | you.
        
         | beernet wrote:
         | How are 'hallucinations' a phenomenon? I have trouble with the
         | term 'hallucination' and believe it sets the wrong narratuve.
         | It suggests something negative or unexpected, which it
         | absolutely is not. Language models aim at, as their name
         | implies, modeling language. Not facts or anything alike. This
         | is per design and you certainly don't have to be an AI
         | researcher to grasp that.
         | 
         | That being said, people new to the field tend to believe that
         | these models are fact machines. In fact, they are the complete
         | opposite.
        
           | lowyek wrote:
           | I believe 'hallucinations' as become a umbrella term for all
           | cases of failures where LLM is not doing what we expect it to
           | do or generating stuff which is not aligned with the prompt
           | it is provided with. As these models scale many of such
           | issues get reduced don't they? For example, in the SORA paper
           | OpenAI mentioned that the quality of videos it was able to
           | generate improved as they applied more compute and scaling
           | 
           | I think I won't be able to justify my use of phenomena word
           | here. But my intent was to mention that this problem which is
           | so widely regarded and discussed as a problem online with
           | respect to LLMs - surprisingly seems less studied - or may be
           | there is a term known just in the LLM researcher community
           | which generally is not used.
        
         | hbn wrote:
         | > the hallucination phenomena
         | 
         | There isn't really such thing as a "hallucination" and honestly
         | I think people should be using the word less. Whether an LLM
         | tells you the sky is blue or the sky is purple, it's not doing
         | anything different. It's just spitting out a sequence of
         | characters it was trained be hopefully what a user wants. There
         | is no definable failure state you can call a "hallucination,"
         | it's operating as correctly as any other output. But sometimes
         | we can tell either immediately or through fact checking it spat
         | out a string of text that claims something incorrect.
         | 
         | If you start asking an LLM for political takes, you'll get very
         | different answers from humans about which ones are
         | "hallucinations"
        
           | mortenjorck wrote:
           | It is an unfortunately anthropomorphizing term for a
           | transformer simply operating as designed, but the thing it's
           | become a vernacular shorthand for, "outputting a sequence of
           | tokens representing a claim that can be uncontroversially
           | disproven," is still a useful concept.
           | 
           | There's definitely room for a better label, though.
           | "Empirical mismatch" doesn't quite have the same ring as
           | "hallucination," but it's probably a more accurate place to
           | start from.
        
             | NovemberWhiskey wrote:
             | > _" outputting a sequence of tokens representing a claim
             | that can be uncontroversially disproven," is still a useful
             | concept._
             | 
             | Sure, but that would require semantic mechanisms rather
             | than statistical ones.
        
             | hbn wrote:
             | Regardless I don't think there's much to write papers on,
             | other than maybe an anthropological look at how it's
             | affected people putting too much trust into LLMs for
             | research, decision-making, etc.
             | 
             | If someone wants info to make their model to be more
             | reliable for a specific domain, it's in the existing papers
             | on model training.
        
               | lowyek wrote:
               | After reading the replies what I grasp is that - it's
               | just a popular term; and what LLM hallucination we
               | mention is just expected behaviour. The LLM researchers
               | might be called it inference error or something which
               | just didn't got that popular as this term.
        
           | emporas wrote:
           | Chess engines, which are used for 25 years by the best human
           | chess players daily, compute the best next move on the board.
           | The total number of all possible chess positions is more than
           | all the atoms in the universe.
           | 
           | Is is possible for a chess engine to compute the next move
           | and be absolutely sure it is the best one? It's not, it is a
           | statistical approximation, but still very useful.
        
           | lowyek wrote:
           | I am - maybe failing to express in words correctly. Let me
           | try again. Between the models like llama3b, llama7b and
           | llama70b also there is a clearcut difference that the output
           | of llama70b is far more correct for the input tokens than the
           | smaller models for the same task. But I agree the term
           | 'hallucination' has shouldn't be used - as it hides the
           | nature of issue in these wrong outputs. When LLM is not
           | behaving as expected; we are not ending up saying it's
           | hallucinating rather than it failed in X manner.
        
           | raincole wrote:
           | I don't know why the narrative became "don't call it
           | hallucination". Grantly English isn't my mother tongue so I
           | might miss some subtlty here. If you know how LLM works, call
           | it "hallucination" doesn't make you know less. If you don't
           | know how LLM works, using "hallucination" doesn't make you
           | know less either. It's just a word meaning AI gives wrong[1]
           | answer.
           | 
           | People say it's "anthropomorphizing" but honestly I can't see
           | it. The I in AI stands for intelligence, is this
           | anthropomorphizing? L in ML? Reading and writing are clearly
           | human activities, so is using read/write instead of
           | input/output anthropomorphizing? How about "computer", a word
           | once meant a human who does computing? Is there a word we can
           | use safely without anthropomorphizing?
           | 
           | [1]: And please don't argue what's "wrong".
        
             | th0ma5 wrote:
             | AI is a nebulous, undefined term, and many people
             | specifically criticize the use of the word intelligent.
        
           | sandworm101 wrote:
           | Hallucination is emergent. It cannot be found as a thing
           | inside the AI systems. It is a phenomena that only exists
           | when the output is evaluated. That makes it an accurate
           | description. A human who has hallucinated something is not
           | lying when they speak of something that never actually
           | happened, nor are they making any sort of mistake in their
           | recollection. Similarly, an AI that is hallucinating isn't
           | doing anything incorrect and doesn't have any motivation. The
           | hallucinated data emerges just as any other output, only to
           | evaluated by outsiders as incorrect.
        
             | lowyek wrote:
             | This is very interesting and insightful.
        
           | IanCal wrote:
           | It's the new "serverless" and I would really like people to
           | stop making the discussion between about the word. You know
           | what it means, I know what it means, let's all move on.
           | 
           | We won't, and we'll see this constant distraction.
        
         | cainxinth wrote:
         | Not a paper, but a startup called Vectara claimed to be
         | investigating LLM hallucination/ confabulation rates last year:
         | 
         | https://www.nytimes.com/2023/11/06/technology/chatbots-hallu...
        
           | lowyek wrote:
           | thank you for sharing this!
        
       | bluelightning2k wrote:
       | Evaluators with CriticGPT outperform those without 60% of the
       | time.
       | 
       | So, slightly better than random chance. I guess a win is a win
       | but I would have thought this would be higher. I'd kind have
       | assume that just asking GPT itself if it's sure would be this
       | kind of lift.
        
       ___________________________________________________________________
       (page generated 2024-06-27 23:00 UTC)