[HN Gopher] Orca 2: Teaching Small Language Models How to Reason
___________________________________________________________________
Orca 2: Teaching Small Language Models How to Reason
Author : fgfm
Score : 267 points
Date : 2023-11-21 10:16 UTC (12 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| fgfm wrote:
| Orca 2-13B consistently beat Llama 2-70B on most benchmarks in
| 0-shot. Hopefully, research papers will start to include
| Mistral/Zephyr 7B & Openchat 3.5. Even though they're smaller,
| they're getting competitive against much larger models and
| they're much cheaper to orchestrate.
| ple13 wrote:
| It fails other benchmarks vs Mistral-7b.
| https://twitter.com/Teknium1/status/1726846755344634020
|
| (There is some doubts about the validity of the comparaison in
| the comments)
| eurekin wrote:
| Also, worth mentioning the next tweet: Update,
| I benchmarked 13b Orca 2, its still not surpassing gpt4all
| score of Base Mistral or OpenHermes 2.5 7B:
| Hermes 2.5 7B Mistral score: 73.12% Mistral Base 7B
| score: 71.16% Orca 13B GPT4All score: 70.58%
|
| https://twitter.com/Teknium1/status/1726833004117635414
| davidkunz wrote:
| For smaller models, I'm impressed by Mistral-7b or fine-tuned
| variants like Zephyr. I use it regularly in Neovim[1] for mundane
| tasks (grammar correction, summaries, ...). I'm curious how Orca
| 2 performs, downloading it right now.
|
| [1]: with https://github.com/David-Kunz/gen.nvim
| eurekin wrote:
| I'd love to see some demo of that!
| davidkunz wrote:
| A demo video is in the README (I used Mistral-7b in there).
| eurekin wrote:
| Amazing, thank you!
| GaggiX wrote:
| Also OpenChat-3.5v model (It has 7B parameters, I think it is
| also a Mistral finetuning), demo: https://openchat.team/
| schleck8 wrote:
| Nice, it passes the weather test. I always ask open source
| models what the weather is like and see wether it
| hallucinates my location and a forecast. A few months ago
| without exception all models I tried (even larger ones) would
| just make up a temperature. Now it replies as it should Cool!
|
| > what's the weather like today?
|
| > I'm sorry, but I can't provide real-time weather
| information. However, I can help you with general information
| about weather conditions and forecasting.
| nodja wrote:
| oh wow this model is kinda amazing, it passes my "creative"
| tests that only chatgpt 3.5 did decently well with, I've
| recently been disillusioned that open source has been moving
| the wrong way due to the focus on benchmarks, but this model
| seems to hit the spot in usefulness in more whacky prompts
| ("write X in the style of Y" kinda prompts)
| sorokod wrote:
| Always surprised how poorly these models do on the
| benchmarks they claim to do well. OpenChat has a benchmark
| radar diagram[1] but but often fails on actual samples.
|
| [1] https://github.com/imoneoi/openchat
| titaniumtown wrote:
| Haven't seen this neovim plugin before! I'm setting this up
| right now.
| intended wrote:
| I really really want this to work.
|
| However at this point - benchmark success is about as effective
| as results from someone who has been "taught the test"
|
| If say... Merck wanted to use this same model to reason out a
| logistics issue, or apply it to some business problem at scale -
| you'd have to deal with hallucinations all over the place.
|
| The best analogy I have right now is that improved results on
| benchmarks are like better acting from Hugh Laurie as House.
|
| If you want to watch a show - great (generative work)
|
| If you want to get a prescription - then not so much.
| candiddevmike wrote:
| I'm not a real AI doctor, I just play one on chat.openai.com.
| FFP999 wrote:
| At the moment I read "how to reason" in the headline my
| bullshit detector started to go off.
|
| LLMs do not reason, they do not think, they are not AGI. They
| generate by regurgitating.
| coderaptor wrote:
| I haven't heard a definition of "reasoning" or "thinking"
| that proves humans aren't doing exactly that same
| probabilistic regurgitation.
|
| I don't think it's possible to prove; feels like a
| philosophical question.
| intended wrote:
| It's possible to prove.
|
| Use an LLM to do a real world task that you should be able
| to achieve by reasoning.
| FFP999 wrote:
| > Use an LLM to do a real world task that you should be
| able to achieve by reasoning.
|
| Such as explaining the logical fallacies in this argument
| and the one above?
| motoxpro wrote:
| I mean I know you're joking but yes, it would be able to
| do that.
| intended wrote:
| Take anything, see how far you get before you have to
| really grapple with hallucination.
|
| Once that happens, your mitigation strategy will end up
| being the proof.
| CAP_NET_ADMIN wrote:
| LLMs can be trained on all the math books in the world,
| starting from the easiest to the most advanced, they can
| regurgitate them almost perfectly, yet they won't apply the
| concepts in those books to their actions. I'd count the
| ability to learn new concepts and methods, then being able
| to use them as "reasoning".
| margorczynski wrote:
| Aren't there quite a few examples of LLMs giving out-of-
| distribution answers to stated problems? I think there
| are two issues with LLMs and reasoning:
|
| 1. They are single-pass and static - you "fake" short-
| term memory by re-feeding the questions with it answer 2.
| They have no real goal to achieve - one that it would
| split into sub-goals, plan to achieve them, estimate the
| returns of each, etc.
|
| As for 2. I think this is the main point of e.g. LeCun in
| that LLMs in themselvs are simply single-modality world
| models and they lack other components to make them true
| agents capable of reasoning.
| RationalDino wrote:
| I won't define reasoning, just call out one aspect.
|
| We have the ability to follow a chain of reasoning, say
| "that didn't work out", backtrack, and consider another.
| ChatGPT seems to get tangled up when its first (very good)
| attempt goes south.
|
| This is definitely a barrier that can be crossed by
| computers. AlphaZero is better than we are at it. But it is
| a thing we do which we clearly don't simply do with the
| probabilistic regurgitation method that ChatGPT uses.
|
| That said, the human brain combines a bunch of different
| areas that seem to work in different ways. Our ability to
| engage in this kind of reason, for example, is known to
| mostly happen in the left frontal cortex. So it seems
| likely that AGI will also need to combine different modules
| that work in different ways.
|
| On that note, when you add tools to ChatGPT, it suddenly
| can do a lot more than it did before. If those tools
| include the right feedback loops, the ability to
| store/restore context, and so on, what could it then do?
| This isn't just a question of putting the right
| capabilities in a box. They have to work together for a
| goal. But I'm sure that we haven't achieved the limit of
| what can be achieved.
| azmodeus wrote:
| Instead of going bank you can construct a tree of
| different reasonings with an LLM then take a vote or
| synthesise see Tee of thought prompting
| Davidzheng wrote:
| these are things we can teach children to do when they
| don't do it at first. I don't see why we can't teach this
| behavior to AI. Maybe we should teach LLM's to play games
| or something. or do those proof thingys that they teach
| in US high school geometry or something like that. To
| learn some formal structure within which they can think
| about the world
| xanderlewis wrote:
| It feels like humans _do_ do a similar regurgitation as
| _part_ of a reasoning process, but if you play around with
| LLMs and ask them mathematical questions beyond the
| absolute basics it doesn't take long before they trip up
| and reveal a total lack of 'understanding' as we would
| usually understand it. I think we're easily fooled by the
| fact that these models have mastered the art of talking
| like an expert. Within any domain you choose, they've
| mastered the form. But it only takes a small amount of real
| expertise (or even basic knowledge) to immediately spot
| that it's all gobbledygook and I strongly suspect that when
| it isn't it's just down to luck (and the fact that almost
| any question you can ask has been asked before and is in
| the training data). Given the amount of data being
| swallowed, it's hard to believe that the probabilistic
| regurgitation you describe is ever going to lead to
| anything like 'reasoning' purely through scaling. You're
| right that asking what reasoning is may be a philosophical
| question, but you don't need to go very far to empirically
| verify that these models absolutely do not have it.
| cloverich wrote:
| On the other hand, it seems rather intuitive we have a
| logic based component? Its the underpinning of science. We
| have to be taught when we've stumbled upon something that
| needs tested. But we can be taught that. And then once we
| learn to recognize it, we intuitively do so in action.
| ChatGPT can do this in a rudimentary way as well. It says a
| program should work a certain way. Then it writes it. Then
| it runs it. Then when the answer doesn't come out as
| expected (at this point, probably just error cases), it
| goes back and changes it.
|
| It seems similar to what we do, if on a more basic level.
| At any rate, it seems like a fairly straight forward 1-2
| punch that, even if not truly intelligent, would let it
| break through its current barriers.
| QuadmasterXLII wrote:
| With only the information we had in 2020, the two theories
| "language models don't reason, they regurgitate" and "as
| language models scale, they begin to think and reason" made
| predictions, and the people who invested time and money based
| on the predictions of the latter theory have done well for
| themselves.
| FFP999 wrote:
| If you're trying to tell me there's a sucker born every
| minute, I knew that.
| intended wrote:
| The people who bet on generative tasks, are getting mileage
| out of tit.
|
| People who bet on reasoning tasks, not so much.
| schleck8 wrote:
| AGI doesn't reason either. Noone defines AGI as "AI, but with
| reasoning". It's "AI, that outperforms humans at all
| disciplines, by any degree" usually. Maybe you confused it
| with ASI, but even then reasoning isn't a requirement afaik.
| pelorat wrote:
| Reasoning is a learnt concept that involves retrieving
| memories and running them though an algorithm, also retrieved
| from memory, and then you loop the process until a classifier
| deems the result to be adequate to the given goal.
| sharemywin wrote:
| I asked GPT 4 and it had some counter points:
|
| Reasoning blends learned skills and natural cognition. It
| integrates new information, not just past memories.
| Reasoning is adaptable, not rigidly algorithmic. Emotions
| and context also shape reasoning.
|
| which seemed to make sense.
| avion23 wrote:
| I hope this will be found in history books and some
| students will point the irony that people are relying on
| gpt4's arguments about reasoning in a thread where it's
| proclaimed that said model can't reason
| kgeist wrote:
| Just yesterday I saw an example of a person asking GPT what
| "fluftable" means. The word was invented by their little
| daughter and they didn't know what it meant. GPT reasoned it
| was a portmaneau of"fluffy" and "comfortable", and it made
| sense because it was used in reference to a pillow. If it's
| just regurgitation, I'd like to know how it's able to
| understand novel words not found in the training data...
| svaha1728 wrote:
| I would read Francois Chollet's explanation of this. It's
| very good: https://fchollet.substack.com/p/how-i-think-
| about-llm-prompt...
|
| For words that are not in the model's vocabulary, like
| 'fluftable', the model uses a subword tokenization
| strategy. It breaks down the word into smaller known
| subunits (subwords or characters) and represents each
| subunit with its own vector. By understanding the context
| in which 'fluftable' appears and comparing it to known
| words with similar subunits, the model can infer a
| plausible meaning for the word. This is done by analyzing
| the vector space in which these representations exist,
| observing how the vectors align or differ from those of
| known words.
|
| 'As always, the most important principle for understanding
| LLMs is that you should resist the temptation of
| anthropomorphizing them.'
| sharemywin wrote:
| isn't 'infer' another word for reason?
| svaha1728 wrote:
| vector math in a 1536-dimensional space?
| lucubratory wrote:
| I'm sorry, but that's absurd. Being able to explain the
| precise mechanism behind reasoning would make anything
| sound like it's not reasoning, because of our prior
| experiences. If we understood human reasoning well enough
| to explain exactly what happens in our brain, you would
| conclude that we're not really reasoning because you can
| provide an explanation of how we're reasoning about
| novel, out of distribution data. This is "God of the
| gaps" for thought.
| gnaritas99 wrote:
| You are simply incorrect. They can reason.
| GenericPoster wrote:
| Did you only read the title? Because the abstract gives you a
| pretty good idea of what they mean when they say reason. It's
| pretty easy to understand. No need to immediately call
| bullshit just because of a minor semantic disagreement.
|
| >ThEY DON'T tHiNk. They'rE JuSt STochAStiC pARrotS. It'S not
| ReAL AGi.
|
| It doesn't even matter if these claims are true or not.
| They're missing the point of the conversation and the paper.
| Reason is a perfectly valid word to use. So is think. If you
| ask it a question and then follow up with 'think carefully'
| or 'explain carefully'. You'll get the same response.
|
| inb4 AcTUALLy LlMS Can'T do aNYtHIng CaRefUlly BECaUse
| pRogRAms ARen'T caRefUl
| borg16 wrote:
| > Merck wanted to use this same model to reason out a logistics
| issue, or apply it to some business problem at scale - you'd
| have to deal with hallucinations all over the place.
|
| I wouldn't think Merck would leave it all to the model? There
| will be humans still in the loop ensuring that the output is
| valid for their use case? I don't think we are still there yet
| where we can completely productionalize these models without
| any human involvement later on whatsoever.
| btbuildem wrote:
| Are we beginning to see "specialized SLMs"? We've already seen
| some pretend-agent based solutions (where the same model is given
| several different roles and made to act as eg. ceo / architect /
| dev / sales in a startup).
|
| I wonder if the way forward is to train smaller models with
| different sets of "skills" or "neural affinities". One for
| reasoning, one for summarization, one for math, one for code, etc
| - then combining them into full-fledged solutions. Perhaps
| smaller models can be "better" at their specific domains/tasks
| than the giant generalist models can be at any of them.
| worldsayshi wrote:
| Isn't this the whole idea with Mixture Of Experts approach that
| is GPT-4 is using?
| htrp wrote:
| Isn't MoE with switch transformers massively inefficiemt
| compared to being able to customize which LLMs you are using?
|
| I've seen a lot of agent swarm concepts in the smaller llm
| space that seem to provide some feedback that this is a
| viable avenue of research.
| esafak wrote:
| Is GPT-4's MOE based on combining specialized models?
| hobofan wrote:
| Yes, I think that is the general trend. Have one model tuned
| for reasoning that decides a plan, based on which you invoke
| other models as tools (see e.g. the ReWOO paper[0]). If I had
| to guess, an approach like this is what powers the recent
| Custom GPT/Assistant API products (based on the lag between
| tool invocations I would guess that they also re-prompt for
| plan adjustments between every set of tool calls).
|
| Do that with a small model and hot-swap LORAs, and it should be
| possible to build a quite powerful local assistant on consumer
| hardware.
|
| [0]: https://arxiv.org/abs/2305.18323
| trash_cat wrote:
| Yes, this is the trend. OAIs marketplace of GPTs is a
| confirmation of this. BabyAGI, AutoGen, AutoGPT are all
| multiple LLM/SLM architectures under the hood. While we don't
| have access to proprietary data or the ability to run bigger
| models, the natural direction is to combine them with
| specialized tasks like you just described. The issue is then
| the interface - making it good and communicate seamlessly
| between models and what roles they + the architecture the
| models are operating in. The last point is up to your
| imagination.
| imhoguy wrote:
| Specialized LLMs, and likely SLMs too, are really the future. I
| use them mostly to aid programming and really just stopped
| paying for GPT-4. Phind and others are really on par now in my
| coding needs.
| Philpax wrote:
| https://huggingface.co/microsoft/Orca-2-13b
|
| https://huggingface.co/microsoft/Orca-2-7b
| kromem wrote:
| A really important nuance here is that they are building on top
| of Llama-2, the pretrained model, and not Llama-2-chat.
|
| I really think the entire field is doing a degree of damage with
| the chat fine tuning beyond what might be expected, because
| regularly part of that chat instruction is an emphasis on
| identification as a LLM.
|
| The problem with this is that nearly all of the training data
| it's performing next token prediction on is text generated by
| humans.
|
| So there's an inherent narrowing of the model scope with most of
| the fine tuning I've seen such that while pretrained models are
| harder to use, I regularly prefer them over chat models when both
| are available as even at similar temperatures the quality and
| variety of language is much improved in the pretrained over chat
| model.
|
| This fine tuning was only introducing bias towards logical step
| by step analysis and problem solving techniques, and the results
| are great. But I'm willing to bet that an identical fine tuning
| on top of the chat model would have been much worse on the
| evaluations - not just the compounding of a typical fine tuning
| loss of a few percent, but more like a double digit relative
| difference.
|
| It's quite frustrating that the anxiety over model safety is
| likely throwing out tens of millions of dollars worth of data in
| the pretrained model when only chat models are available for the
| SotA, and I hope in the future a lighter touch is taken on fine
| tuning the pretrained model and instead of focusing on safety
| inherent to the model it is just set behind a safety oriented
| discriminator or 'editor' which filters or modifies responses
| accordingly.
|
| I'd happily take a 2-3x increased API cost for a much more
| broadly capable and performant model with similar safety
| characteristics but without the handicaps that come with it.
|
| So while a lot of the gains here might be due to the fine tuning,
| I expect at least part is shrugging off the baggage of the
| chat/safety fine tuning as well. Even in the first detailed
| example, we can see that while Llama-2 goes off rambling later
| on, its statement of the relative knowledge of John vs
| Llama-2-chat is much more clear and connected between initial
| conditions and result particularly regarding theory of mind (i.e.
| "he assumed" vs the latter's "it must be in").
| kromem wrote:
| Adding to this - it's really interesting the safety stuff that
| *is* in this paper. Such as:
|
| > We probe some of the categories where we see a larger
| difference (e.g., violent) and observe that Orca 2 tends to
| counter the harmful positions more often (which is penalized by
| the metric), while models that have gone through RLHF safety
| training tend to decline to respond more often (which is
| rewarded by the metric).
|
| Or the fact Orca 2 is less likely to extend hate speech than
| Llama-2-chat which theoretically went through safety fine
| tuning even though Orca 2 did not have any explicit safety fine
| tuning.
|
| Research over the past year has really demonstrated (a) just
| how impactful fine tuning can be - to the point of transmitting
| capabilities from larger models to smaller, and (b) that we're
| still clumsily wading through that process with only partial
| clarity on best practices as the foundational pretrained models
| get better and better at astounding rates.
| alecco wrote:
| > Progressive Learning: We start with LLaMA-2-7B or LLaMA-2-13B
| checkpoint and finetune it on the train split of FLAN-v2 dataset
| for one epoch. Note that FLAN-v2 dataset contains both zero-shot
| and few-shot problems. We then train on 5 million ChatGPT data
| from Orca 1 for 3 epochs. Then we train on the combination of 1
| million GPT-4 data from Orca 1 and Orca 2's 817K data for 4
| epochs.
|
| I think people are missing why they are comparing against Llama-2
| 13B/70B. They improved Llama-2 7B/13B and reach the level of a
| 5-10x larger model of the same base.
|
| This is huge. Models on HF.
|
| https://huggingface.co/papers/2311.11045
| schleck8 wrote:
| Yeah, the 13b model outperforms the 70b Llama 2. Goes to show
| how much potential there is on the software optimization front
| as opposed to just scaling in size
| T-A wrote:
| ...and quantized ones from the usual suspect:
|
| https://huggingface.co/TheBloke/Orca-2-7B-GGUF
|
| https://huggingface.co/TheBloke/Orca-2-13B-GGUF
|
| The 7B Q5_K_M one is small enough to run on an 8GB consumer
| GPU.
| ganeshkrishnan wrote:
| All the 13B files seems to be quantized.
| jpdus wrote:
| It isn't.
|
| Compared to the original Orca model and method which spawned
| many of the current SotA OSS models, Orca 2 models seem to
| perform underwhelming, below outdated 13b models and below
| Mistral 7b base models (e.g. [1]; didn't test myself yet,
| ymmv).
|
| [1]
| https://twitter.com/abacaj/status/1727004543668625618?t=R_vV...
| yujian wrote:
| I'm not sure if I'm missing something from the paper, but are
| multi-billion parameter models getting called "small" language
| models now? And when did this paradigm shift happen?
| Chabsff wrote:
| Nowadays, _small_ essentially means realistically useable on
| prosumer hardware.
| nathanfig wrote:
| Relative term. In the world of LLMs, 7b is small.
| hmottestad wrote:
| All the llama models, including the 70B one can run on consumer
| hardware. You might be able to fit GPT-3 (175B) at Q4 or Q3 on
| a Mac Studio, but that's probably the limit for consumer
| hardware. At 4-bit a 7B model requires some 4GB of ram, so that
| should probably be possible to run on a phone, just not very
| fast.
| sa-code wrote:
| Gpt 3.5 turbo is 20B
| kristianp wrote:
| I doubt that. What's your source?
| moffkalast wrote:
| When 175B, 300B, 1.8T models are considered large, 7B is
| considered small.
| iandanforth wrote:
| Released under the MS Research License, so not OSI and non-
| commercial, for the curious.
|
| https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENS...
| amelius wrote:
| This is why imho Microsoft is way cooler than Apple. They have
| tons of published research. In Apple, even speaking about your
| research with a friend may result in severe punishment.
| jjtheblunt wrote:
| Apple publishes too, search for it for example, but much less.
| amelius wrote:
| Much, much, less. They are definitely not in the same league.
| jug wrote:
| This sounds quite exciting! Like Mistral all over again, only
| more transparent, open, and major backing probably as Microsoft
| are looking to significantly reduce costs now that they're
| expanding AI wide across their platforms? The approach truly
| feels like a next step in LLM design.
___________________________________________________________________
(page generated 2023-11-21 23:00 UTC)