[HN Gopher] LLMs cannot find reasoning errors, but can correct them
___________________________________________________________________
LLMs cannot find reasoning errors, but can correct them
Author : koie
Score : 114 points
Date : 2023-11-20 19:35 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| seeknotfind wrote:
| It can also "correct" proper reasoning. :)
|
| ~"When told where it's wrong, LLM can correct itself to improve
| accuracy."
|
| Similar to cheating in chess- a master only needs to be told the
| value of a few positions to have an advantage.
| tines wrote:
| This is said in the abstract as well:
|
| > recent attempts to self-correct logical or reasoning errors
| often cause correct answers to become incorrect, resulting in
| worse performances overall (Huang et al., 2023)
| mark_l_watson wrote:
| I have noticed this several times. When I give feedback that a
| mistake was made (with no details on what the mistake is),
| often smaller and medium size LLMs then give a correct
| response.
| erhaetherth wrote:
| Which I take full advantage of when the output is like 90%
| correct but the "fix" requires a bit of refactoring, I just
| tell it what I want and presto. Faster than doing it by hand.
| agentultra wrote:
| This might deserve some context here from experts. Wouldn't
| solving mistake finding, in the general case, be the same as
| solving SAT (NP-Hard)?
|
| From the abstract it sounds to me like they're talking about
| heuristics for particular problems. Is that accurate?
| helen___keller wrote:
| Computational complexity isn't really related here. Complexity
| has to do with formal languages and asymptotics, this is about
| natural language and fixed size data sets.
| valine wrote:
| I wonder if separate LLMs can find each other's logical mistakes.
| If I ask llama to find the logical mistake in Yi output, would
| that work better than llama finding a mistake in llama output?
|
| A logical mistake might imply a blind spot inherent to the model,
| a blind spot that might not be present in all models.
| EricMausler wrote:
| wouldn't this effectively be using a "model" twice the size?
|
| Would it be better to just double the size of one of the models
| rather than house both?
|
| Genuine question
| valine wrote:
| Maybe. Goliath 120B took two different llama variants and
| interwove the layers. Surprisingly Goliath 120B quantized to
| 2bit is outperforming llama 70B 4bit in many benchmarks.
|
| https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com.
| ..
| avereveard wrote:
| Parsing is faster than generating, so having a small model
| produce a whole output and then have Goliath only produce
| "good/bad" single token response evaluation would be faster
| than having Goliath produce everything. This would be the
| extreme, adhoc and iterative version of speculative decoding,
| which is already a thing and would probably give the best
| compromise
| raincole wrote:
| I think the relationship between model size and training time
| isn't linear. So if you want a twice bigger model it'll take
| more resources to train it than two original models.
| sevagh wrote:
| I frequently share responses between ChatGPT (paid version with
| GPT4) and Copilot-X to break an impasse when trying to generate
| or fix a tricky piece of code.
| einpoklum wrote:
| No, they can't "correct reasoning errors", and that's a clickbait
| title.
| swatcoder wrote:
| They can produce text that is more sound than prior text that
| appeared earlier in the same input, when interim text indicates
| that something in the earlier block was unsound. (Sometimes)
|
| It's the same pattern you'd see in a pedagological article
| about correcting reasoning errors, except that it's able to
| generate some share of the article content on its own.
|
| With more layers of post-processing behind a curtain, you might
| be able to build an assembly over this behavior that looked
| convincingly like it was correcting reasoning errors on its
| own.
|
| So... yes and no.
| ming0308 wrote:
| If you look at the paper, they only claim LLM can correct the
| errors if the mistake location is given. And the mistake
| finding part is not yet solved.
| ilaksh wrote:
| I was just testing Bard with some very simple coding exercises
| and it did well.
|
| I noticed that they automatically create at least three other
| draft responses.
|
| I assume that this is a technique that allows them to try
| multiple times and then select the best one.
|
| Just mentioning it because it seems like another example of not
| strictly "zero-shot"ing a response. Which seems important for
| getting good results with these models.
|
| I'm guessing they use batching for this. I wonder if it might
| become more common to run multiple inference subtasks for the
| same main task inside of a batch, for purposes of self-correcting
| agent swarms or something. The outputs from step one are reviewed
| by the group in step 2, then they try again in step 3.
|
| I guess that only applies for a small department where there is
| frequently just one person using it at a time.
| MillionOClock wrote:
| IIRC there were some OpenAI docs that recommended doing exactly
| this, make n generations and use a smaller fine tuned model to
| select the best one
| Tostino wrote:
| Right, most inference servers support this already.
| DaiPlusPlus wrote:
| ...does this directly relate to the high operating costs of
| LLMs-as-a-service, if for every request they have to run
| n-many redundant LLM requests? So if they could improve
| things so that a single prompt/request+response has a higher
| chance of being high-quality they wouldn't need to run
| alternatives?
| ilaksh wrote:
| A lot of people don't run multiple at a time.
|
| It can make it more expensive if that option becomes
| popular.
|
| But I think in most cases batching is actually the biggest
| _improvement_ in terms of cost effectiveness for operators,
| since it enables them to use the parallel throughout of the
| graphics device more fully by handling multiple inference
| requests (often from different customers) at once. (Unless
| they work like Bard by default).
| stavros wrote:
| Isn't that textbook MoE?
| Tostino wrote:
| No, like the other comment said, it's just using the `n`
| parameter in an OpenAI style API. For example, vLLM and
| llamacpp have support for it.
| stavros wrote:
| Ah, it's the same model, multiple runs, then? Not actually
| N different models?
| Tostino wrote:
| Correct.
| erhaetherth wrote:
| I don't like this. It forces me to read 2 prompts instead of 1
| so that I can help train their LLM. ChatGPT and Bard already
| have regenerate buttons if I don't like their response, it
| doesn't need to be that in my face.
| moritzwarhier wrote:
| I think there is an argument that it would be beneficial for
| this to be common, despite the cognitive burden.
|
| It forces you to remind yourself of the stochastic nature of
| the model and RILHF, maybe the data even helps to improve the
| latter.
|
| I liked this trait of Bard from the start and hope they keep
| it.
|
| It provides a sense of agency and reminds to not
| anthropomorphize the transformer chatbot too much.
| nextworddev wrote:
| If this is the case, then just run it X times till error rate
| drops near 0. AGI solved.
| westurner wrote:
| This is called (Algorithmic) _Convergence_ ; does the model
| stably converge upon one answer which it believes is most
| correct? After how much resources and time?
|
| Convergence (evolutionary computing)
| https://en.wikipedia.org/wiki/Convergence_(evolutionary_comp...
|
| Convergence (disambiguation) > Science, technology, and
| mathematics
| https://en.wikipedia.org/wiki/Convergence#Science,_technolog...
| bee_rider wrote:
| I don't think it would solve AGI, but having multiple models
| arguing with each other seems sort of similar to how we work
| things out when we're thinking hard, right? Consider a
| hypothesis, argue for or against it in your head.
| ming0308 wrote:
| As the paper suggested, the LLM cannot identify their own
| mistakes yet though. And they can only fix their mistakes if
| the mistake location is given.
| bandrami wrote:
| I noticed early on that GPT35 can successfully create a false
| sentence but has a _whole lot_ of trouble creating an invalid
| syllogism, and tends to end up making false but valid ones. Not
| sure if that 's changed but it's interesting what that might say
| about its training.
| pton_xd wrote:
| I've also noticed LLMs seem to lack conviction on the correctness
| of their answers. As the paper notes, you can easily convince the
| transformer that a correct answer is wrong, and needs adjustment.
| Ultimately they're just trying to please you. For example with
| ChatGPT 3.5 (abbreviated):
|
| me: what is sin -pi/2
|
| gpt: -1
|
| me: that's not right
|
| gpt: I apologize, let me clarify, the answer is 1
| hellcow wrote:
| I just re-ran this on GPT-4 and it apologized, told me I was
| right, and then said again that the answer was -1. So while it
| lacked conviction it at least kept the correct answer.
| muzani wrote:
| gpt-4: Actually, the value of \\(\sin(-\pi/2)\\) is indeed
| \\(-1\\). The sine function represents the y-coordinate of a
| point on the unit circle corresponding to a given angle. At
| \\(-\pi/2\\) radians, which is equivalent to 270 degrees or a
| quarter circle in the negative direction, the point on the unit
| circle is at the bottom with coordinates (0, -1). Therefore,
| the sine of \\(-\pi/2\\) is \\(-1\\).
|
| =====
|
| The smarter it is, the more conviction it has. GPT-3.5 has a
| lot of impostor syndrome and it's probably deserved lol. But
| GPT-4 starts to stutter when you give it enough math questions,
| which aren't its forte.
| muzani wrote:
| If anything, GPT-4 has the opposite problem. Ask it to check
| your homework and it'll go "20/5 is not 4. The correct answer
| is 4"
| kaiokendev wrote:
| This is due to the RLHF alignment, only product-focused. It
| would be very annoying for users to fight back and forth with
| the LLM on the correctness of the answer, especially when it is
| so prone to hallucination.
| kromem wrote:
| Stop doing self-correction within the context of the model's own
| generation.
|
| The previous paper on self correction told the model "you
| previously said X - are there errors with this?"
|
| This one has the mistakes statically added to the prompt in a
| task prompt and response without additional context immediately
| before asking if it has any errors.
|
| Think about the training data.
|
| How often does the training data of most of the Internet reflect
| users identifying issues with their own output?
|
| How often does the training data reflect users identifying issues
| with someone else's output?
|
| Try doing self-correction by setting up the context of "this was
| someone else's answer". It is still technically self-correction
| if a model is reviewing its own output in that context - it just
| isn't set up as "correct your own answer."
|
| This may even be part of why the classifier did a better job at
| identifying issues - less the fine tuning and more the context
| (unfortunately I don't see the training/prompts for the
| classifier in their GitHub repo).
|
| It really seems like the aversion to anthropomorphizing LLMs is
| leading people to ignore or overlook relevant patterns in the
| highly anthropomorphic training data fed into them. We might not
| want to entertain that a LLM has a concept of self vs other or a
| bias between critiques based on such a differentiation, and yet
| the training data almost certainly reflects such a concept and
| bias.
|
| I'd strongly encourage future work on self-correction to
| explicitly define the thing being evaluated as the work of
| another. (Or ideally even compare self-correction rates between
| critiques in the context of their own output vs another's
| output.)
| andai wrote:
| That's hilarious. Does this imply LLMs inherited the human
| tendency to get attached to a perspective despite evidence to
| the contrary? I'll often try to coax the right answer out of
| GPT-3 when I know it's wrong, and it'll often insist that it's
| right several times in a row.
| OmarShehata wrote:
| I think it does indeed suggest this, but I think this may be
| good news.
|
| Part of what makes humans able to make progress in difficult,
| vague, and uncertain fields is a willingness to hold onto a
| point of view in the face of criticism to try & fix itl. This
| is, as a matter of fact, how science progresses, depending on
| if you ask scientists or historians of science. See Thomas
| Kuhn's Structure of Scientific Revolutions for more on this.
| sumthingsumthng wrote:
| I have not read the essay, yet but when 'we' talk about >
| reasoning errors, we do not mean reason in some natural,
| universal, scientific kind of sense, right?
|
| Given that the training data can only contain human reasoning and
| computational logic, reason in the sense of LLM's can only be
| interpreted as "rational facts AND nonsense humans made up to
| create systems that would support consumerism-driven sanity",
| correct?????
|
| Please understand, I'm not mocking, I'm genuinely interested in
| the ways human reasoning radiates into the code LLM's learn while
| they realize (the computational equivalent of a new-born's eyes
| opening) their cognitive (&) sensory (that which
| triggers/causes/elicits/prompts/influences) their origins (every
| whatever-second/moment of their existence).
___________________________________________________________________
(page generated 2023-11-20 23:00 UTC)