[HN Gopher] Teaching Large Language Models to Self-Debug
___________________________________________________________________
Teaching Large Language Models to Self-Debug
Author : saurabh20n
Score : 36 points
Date : 2023-04-12 20:29 UTC (2 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| ulrikhansen54 wrote:
| 'Unsupervised reinforcement learning' is how these large models
| and systems ultimately will end up becoming sentient. We recently
| tried a similar approach on a toy problem in the computer vision
| sphere (https://encord.com/blog/we-employed-chatgpt-as-an-ml-
| enginee...) with pretty decent results.
| Buttons840 wrote:
| Ah we're starting to bootstrap.
|
| For decades in reinforcement learning we've had Q learning, which
| promises to solve _any_ optimization problem _if only_ we can
| build a powerful enough function approximator. It can even learn
| off-policy, meaning it can just watch from the sideline and find
| the optimal solution. It works for toy problems, and it works in
| theory, theres even formal proofs that it will work given
| infinite time and resources, and yet in practice it often becomes
| unstable and collapses.
|
| Supervised learning is one thing, having a model remain stable
| while bootstrapping through a complex environment is another. GTP
| is supervised learning, so far, let's see if it can bootstrap.
| cs702 wrote:
| _In hindsight_ , it's the most natural, most obvious next step to
| get LLMs to write better code:
|
| Explain to them how to debug and fix the code they've written.
|
| Which is pretty much what you would do with an inexperienced
| human software developer.
|
| Looking at this with fresh eyes, it's both _shocking_ to me that
| this sort of thing is even possible, and yet also _completely
| unsurprising_ as yet another emergent capability of LLMs.
|
| We live in interesting times.
| og_kalu wrote:
| Not too shocking for me after this paper.
| https://arxiv.org/abs/2211.09066
|
| You can teach GPT-3 arithmetic - https://imgur.com/a/w3DAYOi
|
| Basically 100% accuracy up to about 13 digit addition and >90
| after that.
|
| What else can you teach GPT without changing weights ?
| mirashii wrote:
| > 100% accuracy up to about 13 digit addition
|
| The graphs you just posted do not support that, they'd
| support at most 100% accuracy up to 4 digits.
| sharemywin wrote:
| it's GPT so 13=4
| og_kalu wrote:
| It's 100 at 13 and extremely close to it prior to that.
| Maybe basically 100 is better.
| matisseverduyn wrote:
| Useful, but still wouldn't count on it.
|
| With respect to GPT etc. as a copilot, the current dialogue seems
| to focus on "ask for GPT to generate code to do X" then "just
| paste in the error message to fix bugs in the code GPT generates"
|
| A.) Why is GPT generating code that results in simple compiler
| errors (that is why GPT probably shouldn't be used to generate
| any code / replace devs for real projects yet), and
|
| B.) error messages are (just guessing here) probably <1% of the
| actual errors in most codebases.
|
| I personally know of a few large companies laying off devs over
| this.
|
| IMO, the tech debt we're going to see in 6 months will probably
| be huge. Good now to start a staffing agency of human experts who
| can come in and fix this type of problem (extricating massive
| amounts of code generated by GPT without starting from scratch)
| because there will be a bunch of fires to put out and those fires
| will be worth $
| viscanti wrote:
| If an LLM hallucinates lines of code that can't even compile, I
| suppose it could also hallucinate logic issues which are more
| difficult to track down.
| matisseverduyn wrote:
| Definitely. QA at a snails pace should still be the focus
| here for a while, but that's not what I'm observing in the
| real world. Just rush, pressure, layoffs. At least this sort
| of behavior keeps humans employed long-term.
| david2ndaccount wrote:
| > I personally know of a few large companies laying off devs
| over this.
|
| They're laying people off and replacing them with chat gpt
| generating code? That seems... aggressive. Or are they laying
| off devs who copy-pasted gpt-generate code?
| matisseverduyn wrote:
| Replacing devs with LLMs.
| blondin wrote:
| color me skeptical. what are those large companies that are
| replacing devs with LLMs?
| ratg13 wrote:
| You can't replace devs with LLMs because someone that
| knows what they are doing still needs to put it all
| together.
|
| You can only make employees more productive.. this in
| turn could, in theory, lessen the need for developers in
| the long run, but it assumes the company will not bother
| to use the extra bandwidth for other projects.
| broast wrote:
| I think it's more natural than you might think. For
| example, my company laid off a lot of people to try to be
| profitable, and now they pay me more but I have a smaller
| team with tighter deadlines. I have no choice but to use
| gpt for a lot of my analysis, design, and code- which
| I've gotten pretty used to over the past year in my hobby
| time
|
| The way I see it, if you code without it, you won't
| compete with the speed and value.
|
| And they are not going to back fill those roles
| sdfghswe wrote:
| My company recently hired someone that I'm absolutely
| convinced can't code and produces all their code by copy
| pasting into/from ChatGPT. I absolutely think they should be
| fired, it's not even aggressive, it's just common sense.
| First that means they cheated on their coding interview.
| Second it means their code is consistently a pile of shit.
| Imnimo wrote:
| I'd be curious to know if having few-shot prompts that
| demonstrate making mistakes and then correcting them causes the
| model to make more initial mistakes so that it has something to
| correct.
|
| Like as far as the model is concerned, how can it distinguish
| between the task being "do your best but if you do make an error,
| correct it" and "make some mistakes like in this example and then
| fix them".
| alecco wrote:
| 3 Google researchers using Open AI GPT-3 code-davinci-002,
| interesting.
| ftxbro wrote:
| > "We evaluate SELF-DEBUGGING on code-davinci-002 in the GPT-3
| model family"
|
| Putting aside the incongruity of Google researchers using the
| OpenAI model, I'm curious how GPT-4 would do in this situation.
| Probably its zero shot attempts at coding would be better, and
| maybe its self criticisms would be better too.
| civilized wrote:
| I've done several experiments (and posted results in previous HN
| comments) where I've given GPT puzzles or brainteasers and asked
| it to review aspects of its answers Socratically. Never telling
| it it got anything wrong, just "you said A, then you said B, does
| that make sense"?
|
| It usually does notice inconsistencies between A and B when asked
| this. But its ways of reconciling inconsistencies can be bizarre
| and suggest a very superficial understanding of concepts.
|
| For example, it once reconciled an inconsistency by saying that,
| yes, 2 * 2 = 4, but if you multiply both sides of that equation
| by a big number, that's no longer true.
|
| I will be super impressed the day we have a model that can read
| an arithmetic textbook and come out with reliable arithmetic
| skills.
| sharemywin wrote:
| in computer logic you would get an undefined if the number was
| large enough.
| civilized wrote:
| It doesn't work with numbers as computer numbers though. It
| works with them as decimal digit strings, just like humans
| do.
| Paul-Craft wrote:
| [dead]
| faizshah wrote:
| I have run into the same issue when using it for coding. It can
| easily debug simple code but for libraries like Bazel I went
| down a rabbit hole for 2 hours of letting it debug an error and
| failing every time even with chain of thought it had a very
| shallow understanding of the issue. Eventually I had to debug
| it myself.
___________________________________________________________________
(page generated 2023-04-12 23:01 UTC)