[HN Gopher] Prover-Verifier Games improve legibility of language...
___________________________________________________________________
Prover-Verifier Games improve legibility of language model outputs
Author : davidbarker
Score : 126 points
Date : 2024-07-17 17:15 UTC (1 days ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| rjeli wrote:
| Funny that when I reached the "Key Findings" section, my brain
| immediately parsed it as ChatGPT output. Maybe it's the bullet
| points, the word choice, or just the font...
| Der_Einzige wrote:
| There appears to be a coherent effort among the general
| populace , conscious or unconscious, to shape discourse going
| forward to look more ChatGPT style in general. Words like
| "delve", "crucial" etc have become more common even among real
| people in face to face communication and in record time.
|
| Much as I find it overly formal, I support it on the grounds
| that it frustrates attempts to "detect" if LLMs are used and
| that is very good.
| pfdietz wrote:
| Well, as long as they don't delve too greedily and too deep.
| drdeca wrote:
| > it frustrates attempts to "detect" if LLMs are used and
| that is very good.
|
| Why is that good?
| Der_Einzige wrote:
| If you are asking, you're the kind of person it's designed
| to frustrate. Good. Stay frustrated.
| molave wrote:
| I can tell technical papers influenced ChatGPT's outputs the
| most. Most of the articles generated using it may be
| regurgitated, but I can't deny how easily digestable the info
| is when presented that way.
| c3534l wrote:
| It seems like a lot of people these days are doing generative
| adversarial AI and the pretending like they invented a new thing.
| HanClinto wrote:
| GANs have been used for a long time in order to improve the
| training of images -- it seems like we're finally starting to
| see this approach catch on for LLMs.
|
| I'm aware of the SPAG paper -- who else have you seen take this
| approach with LLMs lately?
|
| https://github.com/Linear95/SPAG
| nilamo wrote:
| I was thinking the same thing. GANs aren't new, but it's cool
| that we're using them in new ways.
| HanClinto wrote:
| Beautiful!
|
| OpenAI isn't just training a model to produce more-verifiable
| correct answers -- it's leveraging an adversarial relationship to
| train a model that's better at being correct, and also a model
| that's better at deceiving / being wrong. This is the key. There
| are three agents here:
|
| * A "verifier" (a small model, whose job it is to discern correct
| answers from incorrect answers)
|
| * A "helpful prover" (blue team, whose job it is to produce
| correct answers with an easy-to-follow explanation)
|
| * A "sneaky prover" (red team, whose job it is to produce
| incorrect answers with a deceptive explanation)
|
| By arranging these three models in an adversarial relationship
| with a _true reinforcement learning feedback loop_ , the entire
| model grows and gets better.
|
| This is fantastic to read, and corroborates the results achieved
| by SPAG -- easily one of my favorite papers from the past year.
| SPAG pioneered (as far as I'm aware) the approach of using
| adversarial language games in a true reinforcement-learning setup
| (not merely RLHF, which isn't true RL), and showed that training
| models in adversarial language games can show generalized
| improvements even in areas not directly related to the game. [1]
|
| Ever since the SPAG paper came out, I've been daydreaming about
| the different sorts of adversarial games that one could use to
| train LLMs. I've written down a bunch of notes on the subject [2]
| (in case anyone else is reading my rambling notes).
|
| I would really like to see some of these experiments actually get
| up and running on open-source LLMs -- I'm excited to see if / how
| they could be used to improve the quality of some of the open-
| source base models that are floating around out there.
|
| [1] https://github.com/Linear95/SPAG
|
| [2] https://github.com/HanClinto/MENTAT
| HanClinto wrote:
| Because the Red and Blue agents are both trying to convince a
| smaller language model of the rightness of their answer, they
| each have to simplify their logic and wording down.
|
| This feels like the ML equivalent of the old adage "If you
| can't explain it to a six year old, you don't understand it
| yourself."
| enthulhusiastic wrote:
| ELI6 why SPAG is better than just the default pretraining
| method (token context statistics?) of an LLM.
| TimPC wrote:
| The red and blue agents are effectively unlimited sources
| of true and false examples so you can get far more
| efficient scale than you can by pre training with labelled
| inputs. It's also far more targeted on correct/incorrect
| rather than a notion of answer quality which doesn't
| directly get at hallucination vs reality.
| blueblaze0 wrote:
| This is impressive, but what prevents the blue agent from
| generating an incorrect proof of a "true example"? What
| prevents the red agent from generating a correct disproof
| of a "false example"? I'm curious how they managed to
| generate a truly unlimited source of correctly labeled
| examples.
| HanClinto wrote:
| > "but what prevents the blue agent from generating an
| incorrect proof of a "true example"?
|
| That's the role of the Verifier. It's not going to be
| perfect, and I'm sure some incorrect proofs of true
| examples slip through, but it's good enough to increase
| the quality of the model overall.
|
| > "What prevents the red agent from generating a correct
| disproof of a "false example"?
|
| And on the other side, it's counterbalanced by the rules
| engine (math) that can determine absolutely whether or
| not the right answer is given at the end.
|
| The Red and the Blue agents are held in check by the
| tension between the math engine and the verifier, and
| they are free to fight back-and-forth within those
| parameters as long as they are able. Eventually, I think
| the Red agent loses the ability to attack effectively,
| and so that's the big limit on OpenAI's arrangement. This
| particular game isn't balanced enough for this training
| loop to continue infinitely.
| Natsu wrote:
| But how do we know the answer you gave us wasn't
| generated by the sneaky prover? :)
| HanClinto wrote:
| At least in the context of this game, we essentially
| check the answer with a calculator (which the Verifier
| program doesn't have access to).
| HanClinto wrote:
| I don't think of SPAG as a replacement for pretraining. For
| SPAG to work effectively, I would think that it would have
| to start with an LLM that is pretrained with self-
| supervised / imitation learning on regular next-token
| prediction. Think of SPAG as more of a competitor to RLHF
| than to pretraining. RL is what gave AlphaGo the edge to
| finally go beyond merely imitating human games, and finally
| achieve something new.
|
| RLHF isn't true RL, because it's still based on imitating
| human preferences, and has trouble going beyond that. Once
| it achieves the plateau of "human preference", then there's
| nowhere else to go. That's one theory of why LLMs are
| asymptotically approaching human-level performance -- we're
| limited by imitation, or at the very least -- human
| judgement. We need super-human judgement to achieve super-
| human performance, and that's where we need true RL.
|
| But you asked me to ELI6, so here goes. Warning -- wall-of-
| text incoming:
|
| <ELI6>
|
| Similar to how small kids often play games to learn,
| programmers train LLMs (like ChatGPT) with simple games
| too. The first stage (kindof like kindergarten) is the
| "pretraining" or "imitation learning" phase. This is where
| we teach the LLM to imitate us one word at a time. We play
| a simple game where I say something, but then I stop
| suddenly, and it tries to guess the missing word that will
| come next. Like, "My favorite food is..." and the LLM tries
| to guess which word I'm thinking of. Or I'll say something
| with a missing word in the middle like: "At my _____ party,
| I opened a bunch of presents" -- and the LLM needs to guess
| what the missing word is. We only play this game one word
| at a time, and so it's a very simple game -- but it's very
| important to learn the basics of language. This is what we
| call "pretraining".
|
| After the LLM gets good at that, they can graduate from
| Kindergarten and move to first grade. Here we play another
| game, and this is called "instruction-tuning" -- it's where
| we give it a set of instructions and it needs to do its
| best to obey. Like, "Arrange the letters T P C G A in
| alphabetical order" and it tries to get the right answer.
|
| This is fun for a while, but sometimes we want to give it
| more complicated instructions. Things like "write me a poem
| about puppies" or "tell me a story about a dragon". And
| those are things that don't have answers that are clearly
| right or clearly wrong, but we still need to tell it if it
| did a good job or a bad job. How do we tell if it was a
| good poem, or a good story? Well, you need to have someone
| listen to them and judge it -- which means we need to have
| people read ALL these dragon stories and ALL these puppy
| poems and mark which ones are their favorites.
|
| I like reading puppy poems and reading dragon stories, but
| if I had to do it all day every day, I think I would get
| pretty tired of it pretty fast, don't you?
|
| So when people get tired of doing boring things, the best
| thing is to have a robot do their job! They can do the
| boring things (they never get tired of it!) and we get to
| go do fun things. So how do we train a robot to judge the
| poems?
|
| Well, we use this technique called RLHF (Reinforcement
| Learning with Human Feedback), where we ask a bunch of
| people -- given Option A and Option B -- to say which one
| is their favorite. So they read two puppy poems at a time,
| and say "I prefer A" or "I prefer B".
|
| Once we have a BUNCH of human feedback (and just about when
| the humans are getting super super tired and don't think
| they could read another poem), we take ALL that data and we
| use it to train a SEPARATE computer program (that functions
| like a Judge) whose job it is to try and predict which poem
| or story the human would prefer.
|
| It doesn't always get the right answer, but it doesn't need
| to be perfect -- partly because humans aren't perfect, and
| different people might prefer different stories. Keep in
| mind, this Judge program can't write good puppy poems or
| dragon stories on its own -- it can only predict which poem
| or story a _human_ would prefer. It still needs the first
| program (the LLM) to actually write anything.
|
| So now we use the LLM to write a bunch of stories and poems
| and things, and then grade them all (two at a time) with
| the second program. For every pair, when the Judge picks
| its favorite, then we tell the LLM "write more things like
| this, please!" and for the things the Judge didn't like, we
| tell the LLM "don't write like this anymore, plzkthx". And
| we do this over and over, millions of times, and eventually
| it can write okay poems and stories.
|
| So this way, instead of needing to have humans sit there
| and read thousands and millions of puppy poems, humans can
| just read a few dozen / hundred, score them, and then the
| computer can use that to try and guess what humans would
| prefer for everything else that it tries. It's not as
| accurate as if we actually had a human read it all, but
| it's not too bad, and it seems to work pretty well.
|
| But one problem of this method is that it's not perfectly
| accurate (the Judge doesn't always get it right), and the
| more complex the task, the less of a good job it does. It's
| still just trying to imitate what a human would prefer --
| but even if it did its job perfectly, it's not going to get
| much above human preference (because that's its target).
| Plus, as you keep going up, it takes more and more data to
| make smaller and smaller improvements, and so it feels like
| there's only so far that this RLHF game can get us.
|
| So when we graduate to the next grade, that's where SPAG
| comes in, because it's a totally new way to play the game.
| Instead of training it by teaching it to write things that
| one human would prefer, we are going to train it to play a
| game where it needs to be sneaky. It needs to communicate a
| secret word or idea to someone without letting them know
| that they're being controlled. Kindof like if you've ever
| tried to get your mom to give you a cookie without asking
| for it directly. In SPAG, we have the LLM play against a
| copy of itself, and if the first player (called the
| Attacker) can trick the other player (called the Defender)
| into saying a secret word without realizing it was the
| secret word, then the Attacker wins. It's a sneaky game.
|
| So for this, we don't need much human-annotated data at
| all, and the LLM isn't trying to aim for writing something
| that a human would prefer. The LLM can be as creative or as
| sneaky as it wants, and it can "level up" much higher.
|
| This is kindof like when researchers first wrote the
| computer program AlphaGo -- at first they trained it to
| imitate previous human games that it had seen, but
| eventually they stopped using human-created data and purely
| had the machine play games against itself. Once it was no
| longer held back by needing to have human-written data in
| the process, it was free to run as fast as it could, and it
| became the best Go player that the world had ever seen --
| better than the best human players who ever lived.
|
| Having a computer play games against itself -- rewarding
| itself when it does well, and punishing itself when it does
| bad -- is called "reinforcement learning" (RL), and it's a
| very powerful concept.
|
| But reinforcement learning only works in situations where
| you can know CLEARLY whether something is Good or Bad.
| There must be a clear Winner and a clear Loser -- it can't
| be like RLHF where it might be tough to know which puppy
| poem is better.
|
| So we can't do SPAG or other RL methods for improving
| poetry writing, but there are still plenty of other games
| where we CAN write clear rules and the computer can clearly
| know when it has won, and when it has lost.
|
| In the end, SPAG looks very similar to RLHF, but instead of
| training the Judge to predict which answer a human would
| prefer, it uses the clear rules of the game to say who is
| the winner and who is the loser, and rewards them
| appropriately.
|
| The funny thing about SPAG though, is that it showed -- as
| long as the game involves using human language, then
| getting better at playing a game makes the model better at
| _other tasks_ that involve human language.
|
| It's like this guy I heard about who learned to read
| English because he wanted to play Magic: The Gathering. But
| by learning English inside the game, it let him do more
| than just play Magic -- he got better at using English in a
| whole bunch of other things.
|
| So the idea is that -- if we can let a model learn in such
| a way that it's not merely aiming for "human preference",
| but if it can aim for a target that is above that -- if it
| can practice against itself until it gets better and better
| than any human -- then maybe it can fly higher than us in
| _other_ areas too.
|
| </ELI6>
| jgalt212 wrote:
| Isn't this exactly how Alpha Go learns and works so good? It
| always knows the right answer because it knows the rules of the
| game and can easily compute W-L record.
|
| In life, it's hard and very expensive to codify the rules, and
| compute W-L record.
| HanClinto wrote:
| Yes, exactly.
|
| Using traditional RL is easiest when you're using a landscape
| with clearly defined rules -- like Go, or Starcraft, or
| whatever. The trouble is those games don't translate well to
| other domains -- it can learn about risk and reward and
| whatnot from Chess, but it can't become a better chatbot.
|
| But if the game space can operate through the realm of
| language and semantics, then the hope is that we can tap into
| the adversarial growth curve, but for LLMs.
|
| As you note, this only works for situations where we can
| clearly say "winner" or "loser". In OpenAI's case, they use
| correctness of the math problem as one W/L metric (discrete
| and measurable) as well as whether the Verifier was able to
| correctly identify the answer as correct (thus the
| understandability of the answer is also discrete and
| measurable).
|
| In the SPAG paper, they chose the game of "Taboo" as a way to
| discretely measure W/L (asking: "did the defender say the
| secret word or not").
|
| As you noted, it's hard and expensive to codify the rules of
| life. How do we objectively determine whether one poem is
| more beautiful than another? I think we're a long way from
| that.
|
| The breakthrough that the SPAG paper showed is that -- by
| teaching the models to be better at games that involve
| language and semantics -- that they get better at language-
| oriented tasks _overall_.
|
| And that possibility excites me.
|
| Sadly, as I've read further into the paper released by
| OpenAI, it doesn't appear that adversarial training for
| explainability increased the accuracy of the model -- and
| while it was more understandable / verifiable, it wasn't any
| better.
|
| I think a very interesting metric would be to measure the
| accuracy of the fine-tuned models on unrelated tasks to see
| if the lessons learned to be better at explaining math
| problems would help the model perform better for explaining
| other problems (such as logic or reasoning).
| bravura wrote:
| Thank you for the SPAG paper.
|
| Do you know how to play questions?
|
| https://www.youtube.com/watch?v=u3xIs0aajN4
|
| (Tom Stoppard, Rosencrantz and Guildenstern Are Dead).
|
| The important question in the OpenAI work that you haven't
| touched on is how to evaluate superintelligence. I guess I
| would frame the problem like this:
|
| Let's say there is a very esoteric but important branch of
| abstract mathematics that only a few people claim to
| understand. Is there a way for us to determine which
| mathematicians are actually intelligent, and which are
| bluffing? How?
| HanClinto wrote:
| Oh that was a brilliant video clip. I hadn't seen that
| before, thank you!!
|
| > The important question in the OpenAI work that you
| haven't touched on is how to evaluate superintelligence.
| I guess I would frame the problem like this:
|
| > Let's say there is a very esoteric but important branch
| of abstract mathematics that only a few people claim to
| understand. Is there a way for us to determine which
| mathematicians are actually intelligent, and which are
| bluffing? How?
|
| This is a tricky one. To my dog, I am revered as a super-
| being of intelligence and capability. But if he watches
| me play grandmaster-level chess, or writing a paper on
| abstract mathematics -- it must look like insanity. In
| sci-fi, I rather like the image of super-intelligence
| from one of my favorite short-stories: "When the Yogurt
| Took Over" [1]
|
| > No one argues with the yogurt. No one tweaks its
| formulas. The rest of the time it rests there in its
| factory, thinking about whatever intelligent fermented
| milk thinks about.
|
| It just sits there in its vat -- and its actions seem
| largely incomprehensible to us -- as incomprehensible as
| me playing Magic: The Gathering is to my dog. It must
| look like lunacy. (given what I spend on the game, I'm
| not sure it's not)
|
| So if we're going to evaluate superintelligence, then I
| feel that -- for starters -- it must be on somewhat of a
| clear playing-field. We can clearly evaluate super-
| ability in Chess, in Go, and in Starcraft 2 because there
| are clearly defined rules.
|
| The only true test of whether one is superior to another
| will be because "it works".
|
| Until we can test abstract mathematics objectively, then
| I'm not sure we could ever judge. In so far as questions
| of particle physics and whatnot could actually be tested
| -- those feel like the sorts of areas where we might be
| able to evaluate superintelligence.
|
| But SPAG is much smaller than that. The hope that SPAG
| offers is that -- as long as the game rules leverage
| things like language and semantics -- then (assuming the
| model is able to generalize), then the increased mastery
| of language will transfer to other tasks. And the SPAG
| results seem to bear that out.
|
| [1] https://whatever.scalzi.com/2010/10/02/when-the-
| yogurt-took-...
| skdotdan wrote:
| What do you mean by "true" RL?
| HanClinto wrote:
| True RL is not limited by being tethered to human-annotated
| data, and it is able to create novel approaches to solve
| problems. True RL requires a very clear objective function
| (such as the rules of Go, or Starcraft, or Taboo!) that the
| model can evaluate itself against.
|
| Andrej Karpathy talks about the difference between RLHF and
| "true" RL here:
|
| https://www.youtube.com/watch?v=c3b-JASoPi0&t=1618s
|
| > The other thing is that we're doing reinforcement learning
| from human feedback (RLHF), but that's like a super weak form
| of reinforcement learning. I think... what is the equivalent
| in AlphaGo for RLHF? What is the reward model? What I call it
| is a "vibe check". Imagine if you wanted to train an AlphaGo
| RLHF, it would be giving two people two boards and asking:
| "Which one do you prefer?" -- and then you would take those
| labels and you would train the model and then you would RL
| against that. What are the issues with that? It's like,
| number one -- that's just vibes of the board. That's what
| you're training against. Number two, if it's a reward model
| that's a neural net, then it's very easy to overfit to that
| reward model for the model you're optimizing over, and it's
| going to find all these spurious ways of hacking that massive
| model is the problem.
|
| > AlphaGo gets around these problems because they have a very
| clear objective function, and you can RL against it.
|
| > So RLHF is nowhere near [true] RL -- it's silly. And the
| other thing is that imitation is super-silly. RLHF is a nice
| improvement, but it's still silly, and I think people need to
| look for better ways of training these models so that it's in
| the loop with itself and its own psychology, and I think
| there will probably be unlocks in that direction.
|
| In contrast, something like true RL would look like the
| Multi-Agent Hide-And-Seek training loop:
| https://www.youtube.com/watch?v=kopoLzvh5jY
| vinnyvichy wrote:
| Some may be reminded of the Magi supercomputers in NERV, but
| here's a mnemonic inspired by the precogs in Minority Report:
|
| 1) helpful prover : the good twin
|
| 2) sneaky prover : the evil twin
|
| 3) verifier : the foster sister
| jkljl wrote:
| This is fascinating! Using prover-verifier games to improve the
| legibility of language model outputs sounds like a game-changer.
| It's intriguing how focusing on making outputs verifiable by
| weaker models also helps humans evaluate them better. This
| balance between correctness and clarity could have huge
| implications for AI reliability. Anyone else think this could be
| a big step towards more transparent AI systems? Would love to
| hear your thoughts!
| michwilinski wrote:
| interesting, but I don't agree that if we see the "token
| reasoning" chain it somehow explains how the model got the final
| answer. what if we trained deceiver models that would provide a
| sound chain of explanation but then perform some kind of
| deception and output an incorrect answer? for me personally,
| explainability has to show how the answer arose from the model
| mechanics, not sequential model outputs
| HanClinto wrote:
| > what if we trained deceiver models that would provide a sound
| chain of explanation but then perform some kind of deception
| and output an incorrect answer?
|
| You're right on target! That's exactly what they're doing in
| the paper. They train three models -- a verifier (that rates
| answers as sounding correct or sounding wrong), a "helpful
| prover" (that provides correct answers), and "sneaky prover"
| (that provides incorrect answers that attempt to deceive the
| verifier into scoring its answer highly).
|
| This adversarial relationship between the "helpful prover" and
| the "sneaky prover" is the cool part of the paper (IMO).
___________________________________________________________________
(page generated 2024-07-18 23:10 UTC)