[HN Gopher] Prover-Verifier Games improve legibility of language...
       ___________________________________________________________________
        
       Prover-Verifier Games improve legibility of language model outputs
        
       Author : davidbarker
       Score  : 126 points
       Date   : 2024-07-17 17:15 UTC (1 days ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | rjeli wrote:
       | Funny that when I reached the "Key Findings" section, my brain
       | immediately parsed it as ChatGPT output. Maybe it's the bullet
       | points, the word choice, or just the font...
        
         | Der_Einzige wrote:
         | There appears to be a coherent effort among the general
         | populace , conscious or unconscious, to shape discourse going
         | forward to look more ChatGPT style in general. Words like
         | "delve", "crucial" etc have become more common even among real
         | people in face to face communication and in record time.
         | 
         | Much as I find it overly formal, I support it on the grounds
         | that it frustrates attempts to "detect" if LLMs are used and
         | that is very good.
        
           | pfdietz wrote:
           | Well, as long as they don't delve too greedily and too deep.
        
           | drdeca wrote:
           | > it frustrates attempts to "detect" if LLMs are used and
           | that is very good.
           | 
           | Why is that good?
        
             | Der_Einzige wrote:
             | If you are asking, you're the kind of person it's designed
             | to frustrate. Good. Stay frustrated.
        
         | molave wrote:
         | I can tell technical papers influenced ChatGPT's outputs the
         | most. Most of the articles generated using it may be
         | regurgitated, but I can't deny how easily digestable the info
         | is when presented that way.
        
       | c3534l wrote:
       | It seems like a lot of people these days are doing generative
       | adversarial AI and the pretending like they invented a new thing.
        
         | HanClinto wrote:
         | GANs have been used for a long time in order to improve the
         | training of images -- it seems like we're finally starting to
         | see this approach catch on for LLMs.
         | 
         | I'm aware of the SPAG paper -- who else have you seen take this
         | approach with LLMs lately?
         | 
         | https://github.com/Linear95/SPAG
        
         | nilamo wrote:
         | I was thinking the same thing. GANs aren't new, but it's cool
         | that we're using them in new ways.
        
       | HanClinto wrote:
       | Beautiful!
       | 
       | OpenAI isn't just training a model to produce more-verifiable
       | correct answers -- it's leveraging an adversarial relationship to
       | train a model that's better at being correct, and also a model
       | that's better at deceiving / being wrong. This is the key. There
       | are three agents here:
       | 
       | * A "verifier" (a small model, whose job it is to discern correct
       | answers from incorrect answers)
       | 
       | * A "helpful prover" (blue team, whose job it is to produce
       | correct answers with an easy-to-follow explanation)
       | 
       | * A "sneaky prover" (red team, whose job it is to produce
       | incorrect answers with a deceptive explanation)
       | 
       | By arranging these three models in an adversarial relationship
       | with a _true reinforcement learning feedback loop_ , the entire
       | model grows and gets better.
       | 
       | This is fantastic to read, and corroborates the results achieved
       | by SPAG -- easily one of my favorite papers from the past year.
       | SPAG pioneered (as far as I'm aware) the approach of using
       | adversarial language games in a true reinforcement-learning setup
       | (not merely RLHF, which isn't true RL), and showed that training
       | models in adversarial language games can show generalized
       | improvements even in areas not directly related to the game. [1]
       | 
       | Ever since the SPAG paper came out, I've been daydreaming about
       | the different sorts of adversarial games that one could use to
       | train LLMs. I've written down a bunch of notes on the subject [2]
       | (in case anyone else is reading my rambling notes).
       | 
       | I would really like to see some of these experiments actually get
       | up and running on open-source LLMs -- I'm excited to see if / how
       | they could be used to improve the quality of some of the open-
       | source base models that are floating around out there.
       | 
       | [1] https://github.com/Linear95/SPAG
       | 
       | [2] https://github.com/HanClinto/MENTAT
        
         | HanClinto wrote:
         | Because the Red and Blue agents are both trying to convince a
         | smaller language model of the rightness of their answer, they
         | each have to simplify their logic and wording down.
         | 
         | This feels like the ML equivalent of the old adage "If you
         | can't explain it to a six year old, you don't understand it
         | yourself."
        
           | enthulhusiastic wrote:
           | ELI6 why SPAG is better than just the default pretraining
           | method (token context statistics?) of an LLM.
        
             | TimPC wrote:
             | The red and blue agents are effectively unlimited sources
             | of true and false examples so you can get far more
             | efficient scale than you can by pre training with labelled
             | inputs. It's also far more targeted on correct/incorrect
             | rather than a notion of answer quality which doesn't
             | directly get at hallucination vs reality.
        
               | blueblaze0 wrote:
               | This is impressive, but what prevents the blue agent from
               | generating an incorrect proof of a "true example"? What
               | prevents the red agent from generating a correct disproof
               | of a "false example"? I'm curious how they managed to
               | generate a truly unlimited source of correctly labeled
               | examples.
        
               | HanClinto wrote:
               | > "but what prevents the blue agent from generating an
               | incorrect proof of a "true example"?
               | 
               | That's the role of the Verifier. It's not going to be
               | perfect, and I'm sure some incorrect proofs of true
               | examples slip through, but it's good enough to increase
               | the quality of the model overall.
               | 
               | > "What prevents the red agent from generating a correct
               | disproof of a "false example"?
               | 
               | And on the other side, it's counterbalanced by the rules
               | engine (math) that can determine absolutely whether or
               | not the right answer is given at the end.
               | 
               | The Red and the Blue agents are held in check by the
               | tension between the math engine and the verifier, and
               | they are free to fight back-and-forth within those
               | parameters as long as they are able. Eventually, I think
               | the Red agent loses the ability to attack effectively,
               | and so that's the big limit on OpenAI's arrangement. This
               | particular game isn't balanced enough for this training
               | loop to continue infinitely.
        
               | Natsu wrote:
               | But how do we know the answer you gave us wasn't
               | generated by the sneaky prover? :)
        
               | HanClinto wrote:
               | At least in the context of this game, we essentially
               | check the answer with a calculator (which the Verifier
               | program doesn't have access to).
        
             | HanClinto wrote:
             | I don't think of SPAG as a replacement for pretraining. For
             | SPAG to work effectively, I would think that it would have
             | to start with an LLM that is pretrained with self-
             | supervised / imitation learning on regular next-token
             | prediction. Think of SPAG as more of a competitor to RLHF
             | than to pretraining. RL is what gave AlphaGo the edge to
             | finally go beyond merely imitating human games, and finally
             | achieve something new.
             | 
             | RLHF isn't true RL, because it's still based on imitating
             | human preferences, and has trouble going beyond that. Once
             | it achieves the plateau of "human preference", then there's
             | nowhere else to go. That's one theory of why LLMs are
             | asymptotically approaching human-level performance -- we're
             | limited by imitation, or at the very least -- human
             | judgement. We need super-human judgement to achieve super-
             | human performance, and that's where we need true RL.
             | 
             | But you asked me to ELI6, so here goes. Warning -- wall-of-
             | text incoming:
             | 
             | <ELI6>
             | 
             | Similar to how small kids often play games to learn,
             | programmers train LLMs (like ChatGPT) with simple games
             | too. The first stage (kindof like kindergarten) is the
             | "pretraining" or "imitation learning" phase. This is where
             | we teach the LLM to imitate us one word at a time. We play
             | a simple game where I say something, but then I stop
             | suddenly, and it tries to guess the missing word that will
             | come next. Like, "My favorite food is..." and the LLM tries
             | to guess which word I'm thinking of. Or I'll say something
             | with a missing word in the middle like: "At my _____ party,
             | I opened a bunch of presents" -- and the LLM needs to guess
             | what the missing word is. We only play this game one word
             | at a time, and so it's a very simple game -- but it's very
             | important to learn the basics of language. This is what we
             | call "pretraining".
             | 
             | After the LLM gets good at that, they can graduate from
             | Kindergarten and move to first grade. Here we play another
             | game, and this is called "instruction-tuning" -- it's where
             | we give it a set of instructions and it needs to do its
             | best to obey. Like, "Arrange the letters T P C G A in
             | alphabetical order" and it tries to get the right answer.
             | 
             | This is fun for a while, but sometimes we want to give it
             | more complicated instructions. Things like "write me a poem
             | about puppies" or "tell me a story about a dragon". And
             | those are things that don't have answers that are clearly
             | right or clearly wrong, but we still need to tell it if it
             | did a good job or a bad job. How do we tell if it was a
             | good poem, or a good story? Well, you need to have someone
             | listen to them and judge it -- which means we need to have
             | people read ALL these dragon stories and ALL these puppy
             | poems and mark which ones are their favorites.
             | 
             | I like reading puppy poems and reading dragon stories, but
             | if I had to do it all day every day, I think I would get
             | pretty tired of it pretty fast, don't you?
             | 
             | So when people get tired of doing boring things, the best
             | thing is to have a robot do their job! They can do the
             | boring things (they never get tired of it!) and we get to
             | go do fun things. So how do we train a robot to judge the
             | poems?
             | 
             | Well, we use this technique called RLHF (Reinforcement
             | Learning with Human Feedback), where we ask a bunch of
             | people -- given Option A and Option B -- to say which one
             | is their favorite. So they read two puppy poems at a time,
             | and say "I prefer A" or "I prefer B".
             | 
             | Once we have a BUNCH of human feedback (and just about when
             | the humans are getting super super tired and don't think
             | they could read another poem), we take ALL that data and we
             | use it to train a SEPARATE computer program (that functions
             | like a Judge) whose job it is to try and predict which poem
             | or story the human would prefer.
             | 
             | It doesn't always get the right answer, but it doesn't need
             | to be perfect -- partly because humans aren't perfect, and
             | different people might prefer different stories. Keep in
             | mind, this Judge program can't write good puppy poems or
             | dragon stories on its own -- it can only predict which poem
             | or story a _human_ would prefer. It still needs the first
             | program (the LLM) to actually write anything.
             | 
             | So now we use the LLM to write a bunch of stories and poems
             | and things, and then grade them all (two at a time) with
             | the second program. For every pair, when the Judge picks
             | its favorite, then we tell the LLM "write more things like
             | this, please!" and for the things the Judge didn't like, we
             | tell the LLM "don't write like this anymore, plzkthx". And
             | we do this over and over, millions of times, and eventually
             | it can write okay poems and stories.
             | 
             | So this way, instead of needing to have humans sit there
             | and read thousands and millions of puppy poems, humans can
             | just read a few dozen / hundred, score them, and then the
             | computer can use that to try and guess what humans would
             | prefer for everything else that it tries. It's not as
             | accurate as if we actually had a human read it all, but
             | it's not too bad, and it seems to work pretty well.
             | 
             | But one problem of this method is that it's not perfectly
             | accurate (the Judge doesn't always get it right), and the
             | more complex the task, the less of a good job it does. It's
             | still just trying to imitate what a human would prefer --
             | but even if it did its job perfectly, it's not going to get
             | much above human preference (because that's its target).
             | Plus, as you keep going up, it takes more and more data to
             | make smaller and smaller improvements, and so it feels like
             | there's only so far that this RLHF game can get us.
             | 
             | So when we graduate to the next grade, that's where SPAG
             | comes in, because it's a totally new way to play the game.
             | Instead of training it by teaching it to write things that
             | one human would prefer, we are going to train it to play a
             | game where it needs to be sneaky. It needs to communicate a
             | secret word or idea to someone without letting them know
             | that they're being controlled. Kindof like if you've ever
             | tried to get your mom to give you a cookie without asking
             | for it directly. In SPAG, we have the LLM play against a
             | copy of itself, and if the first player (called the
             | Attacker) can trick the other player (called the Defender)
             | into saying a secret word without realizing it was the
             | secret word, then the Attacker wins. It's a sneaky game.
             | 
             | So for this, we don't need much human-annotated data at
             | all, and the LLM isn't trying to aim for writing something
             | that a human would prefer. The LLM can be as creative or as
             | sneaky as it wants, and it can "level up" much higher.
             | 
             | This is kindof like when researchers first wrote the
             | computer program AlphaGo -- at first they trained it to
             | imitate previous human games that it had seen, but
             | eventually they stopped using human-created data and purely
             | had the machine play games against itself. Once it was no
             | longer held back by needing to have human-written data in
             | the process, it was free to run as fast as it could, and it
             | became the best Go player that the world had ever seen --
             | better than the best human players who ever lived.
             | 
             | Having a computer play games against itself -- rewarding
             | itself when it does well, and punishing itself when it does
             | bad -- is called "reinforcement learning" (RL), and it's a
             | very powerful concept.
             | 
             | But reinforcement learning only works in situations where
             | you can know CLEARLY whether something is Good or Bad.
             | There must be a clear Winner and a clear Loser -- it can't
             | be like RLHF where it might be tough to know which puppy
             | poem is better.
             | 
             | So we can't do SPAG or other RL methods for improving
             | poetry writing, but there are still plenty of other games
             | where we CAN write clear rules and the computer can clearly
             | know when it has won, and when it has lost.
             | 
             | In the end, SPAG looks very similar to RLHF, but instead of
             | training the Judge to predict which answer a human would
             | prefer, it uses the clear rules of the game to say who is
             | the winner and who is the loser, and rewards them
             | appropriately.
             | 
             | The funny thing about SPAG though, is that it showed -- as
             | long as the game involves using human language, then
             | getting better at playing a game makes the model better at
             | _other tasks_ that involve human language.
             | 
             | It's like this guy I heard about who learned to read
             | English because he wanted to play Magic: The Gathering. But
             | by learning English inside the game, it let him do more
             | than just play Magic -- he got better at using English in a
             | whole bunch of other things.
             | 
             | So the idea is that -- if we can let a model learn in such
             | a way that it's not merely aiming for "human preference",
             | but if it can aim for a target that is above that -- if it
             | can practice against itself until it gets better and better
             | than any human -- then maybe it can fly higher than us in
             | _other_ areas too.
             | 
             | </ELI6>
        
         | jgalt212 wrote:
         | Isn't this exactly how Alpha Go learns and works so good? It
         | always knows the right answer because it knows the rules of the
         | game and can easily compute W-L record.
         | 
         | In life, it's hard and very expensive to codify the rules, and
         | compute W-L record.
        
           | HanClinto wrote:
           | Yes, exactly.
           | 
           | Using traditional RL is easiest when you're using a landscape
           | with clearly defined rules -- like Go, or Starcraft, or
           | whatever. The trouble is those games don't translate well to
           | other domains -- it can learn about risk and reward and
           | whatnot from Chess, but it can't become a better chatbot.
           | 
           | But if the game space can operate through the realm of
           | language and semantics, then the hope is that we can tap into
           | the adversarial growth curve, but for LLMs.
           | 
           | As you note, this only works for situations where we can
           | clearly say "winner" or "loser". In OpenAI's case, they use
           | correctness of the math problem as one W/L metric (discrete
           | and measurable) as well as whether the Verifier was able to
           | correctly identify the answer as correct (thus the
           | understandability of the answer is also discrete and
           | measurable).
           | 
           | In the SPAG paper, they chose the game of "Taboo" as a way to
           | discretely measure W/L (asking: "did the defender say the
           | secret word or not").
           | 
           | As you noted, it's hard and expensive to codify the rules of
           | life. How do we objectively determine whether one poem is
           | more beautiful than another? I think we're a long way from
           | that.
           | 
           | The breakthrough that the SPAG paper showed is that -- by
           | teaching the models to be better at games that involve
           | language and semantics -- that they get better at language-
           | oriented tasks _overall_.
           | 
           | And that possibility excites me.
           | 
           | Sadly, as I've read further into the paper released by
           | OpenAI, it doesn't appear that adversarial training for
           | explainability increased the accuracy of the model -- and
           | while it was more understandable / verifiable, it wasn't any
           | better.
           | 
           | I think a very interesting metric would be to measure the
           | accuracy of the fine-tuned models on unrelated tasks to see
           | if the lessons learned to be better at explaining math
           | problems would help the model perform better for explaining
           | other problems (such as logic or reasoning).
        
             | bravura wrote:
             | Thank you for the SPAG paper.
             | 
             | Do you know how to play questions?
             | 
             | https://www.youtube.com/watch?v=u3xIs0aajN4
             | 
             | (Tom Stoppard, Rosencrantz and Guildenstern Are Dead).
             | 
             | The important question in the OpenAI work that you haven't
             | touched on is how to evaluate superintelligence. I guess I
             | would frame the problem like this:
             | 
             | Let's say there is a very esoteric but important branch of
             | abstract mathematics that only a few people claim to
             | understand. Is there a way for us to determine which
             | mathematicians are actually intelligent, and which are
             | bluffing? How?
        
               | HanClinto wrote:
               | Oh that was a brilliant video clip. I hadn't seen that
               | before, thank you!!
               | 
               | > The important question in the OpenAI work that you
               | haven't touched on is how to evaluate superintelligence.
               | I guess I would frame the problem like this:
               | 
               | > Let's say there is a very esoteric but important branch
               | of abstract mathematics that only a few people claim to
               | understand. Is there a way for us to determine which
               | mathematicians are actually intelligent, and which are
               | bluffing? How?
               | 
               | This is a tricky one. To my dog, I am revered as a super-
               | being of intelligence and capability. But if he watches
               | me play grandmaster-level chess, or writing a paper on
               | abstract mathematics -- it must look like insanity. In
               | sci-fi, I rather like the image of super-intelligence
               | from one of my favorite short-stories: "When the Yogurt
               | Took Over" [1]
               | 
               | > No one argues with the yogurt. No one tweaks its
               | formulas. The rest of the time it rests there in its
               | factory, thinking about whatever intelligent fermented
               | milk thinks about.
               | 
               | It just sits there in its vat -- and its actions seem
               | largely incomprehensible to us -- as incomprehensible as
               | me playing Magic: The Gathering is to my dog. It must
               | look like lunacy. (given what I spend on the game, I'm
               | not sure it's not)
               | 
               | So if we're going to evaluate superintelligence, then I
               | feel that -- for starters -- it must be on somewhat of a
               | clear playing-field. We can clearly evaluate super-
               | ability in Chess, in Go, and in Starcraft 2 because there
               | are clearly defined rules.
               | 
               | The only true test of whether one is superior to another
               | will be because "it works".
               | 
               | Until we can test abstract mathematics objectively, then
               | I'm not sure we could ever judge. In so far as questions
               | of particle physics and whatnot could actually be tested
               | -- those feel like the sorts of areas where we might be
               | able to evaluate superintelligence.
               | 
               | But SPAG is much smaller than that. The hope that SPAG
               | offers is that -- as long as the game rules leverage
               | things like language and semantics -- then (assuming the
               | model is able to generalize), then the increased mastery
               | of language will transfer to other tasks. And the SPAG
               | results seem to bear that out.
               | 
               | [1] https://whatever.scalzi.com/2010/10/02/when-the-
               | yogurt-took-...
        
         | skdotdan wrote:
         | What do you mean by "true" RL?
        
           | HanClinto wrote:
           | True RL is not limited by being tethered to human-annotated
           | data, and it is able to create novel approaches to solve
           | problems. True RL requires a very clear objective function
           | (such as the rules of Go, or Starcraft, or Taboo!) that the
           | model can evaluate itself against.
           | 
           | Andrej Karpathy talks about the difference between RLHF and
           | "true" RL here:
           | 
           | https://www.youtube.com/watch?v=c3b-JASoPi0&t=1618s
           | 
           | > The other thing is that we're doing reinforcement learning
           | from human feedback (RLHF), but that's like a super weak form
           | of reinforcement learning. I think... what is the equivalent
           | in AlphaGo for RLHF? What is the reward model? What I call it
           | is a "vibe check". Imagine if you wanted to train an AlphaGo
           | RLHF, it would be giving two people two boards and asking:
           | "Which one do you prefer?" -- and then you would take those
           | labels and you would train the model and then you would RL
           | against that. What are the issues with that? It's like,
           | number one -- that's just vibes of the board. That's what
           | you're training against. Number two, if it's a reward model
           | that's a neural net, then it's very easy to overfit to that
           | reward model for the model you're optimizing over, and it's
           | going to find all these spurious ways of hacking that massive
           | model is the problem.
           | 
           | > AlphaGo gets around these problems because they have a very
           | clear objective function, and you can RL against it.
           | 
           | > So RLHF is nowhere near [true] RL -- it's silly. And the
           | other thing is that imitation is super-silly. RLHF is a nice
           | improvement, but it's still silly, and I think people need to
           | look for better ways of training these models so that it's in
           | the loop with itself and its own psychology, and I think
           | there will probably be unlocks in that direction.
           | 
           | In contrast, something like true RL would look like the
           | Multi-Agent Hide-And-Seek training loop:
           | https://www.youtube.com/watch?v=kopoLzvh5jY
        
         | vinnyvichy wrote:
         | Some may be reminded of the Magi supercomputers in NERV, but
         | here's a mnemonic inspired by the precogs in Minority Report:
         | 
         | 1) helpful prover : the good twin
         | 
         | 2) sneaky prover : the evil twin
         | 
         | 3) verifier : the foster sister
        
       | jkljl wrote:
       | This is fascinating! Using prover-verifier games to improve the
       | legibility of language model outputs sounds like a game-changer.
       | It's intriguing how focusing on making outputs verifiable by
       | weaker models also helps humans evaluate them better. This
       | balance between correctness and clarity could have huge
       | implications for AI reliability. Anyone else think this could be
       | a big step towards more transparent AI systems? Would love to
       | hear your thoughts!
        
       | michwilinski wrote:
       | interesting, but I don't agree that if we see the "token
       | reasoning" chain it somehow explains how the model got the final
       | answer. what if we trained deceiver models that would provide a
       | sound chain of explanation but then perform some kind of
       | deception and output an incorrect answer? for me personally,
       | explainability has to show how the answer arose from the model
       | mechanics, not sequential model outputs
        
         | HanClinto wrote:
         | > what if we trained deceiver models that would provide a sound
         | chain of explanation but then perform some kind of deception
         | and output an incorrect answer?
         | 
         | You're right on target! That's exactly what they're doing in
         | the paper. They train three models -- a verifier (that rates
         | answers as sounding correct or sounding wrong), a "helpful
         | prover" (that provides correct answers), and "sneaky prover"
         | (that provides incorrect answers that attempt to deceive the
         | verifier into scoring its answer highly).
         | 
         | This adversarial relationship between the "helpful prover" and
         | the "sneaky prover" is the cool part of the paper (IMO).
        
       ___________________________________________________________________
       (page generated 2024-07-18 23:10 UTC)