[HN Gopher] Orca 2: Teaching Small Language Models How to Reason
       ___________________________________________________________________
        
       Orca 2: Teaching Small Language Models How to Reason
        
       Author : fgfm
       Score  : 267 points
       Date   : 2023-11-21 10:16 UTC (12 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | fgfm wrote:
       | Orca 2-13B consistently beat Llama 2-70B on most benchmarks in
       | 0-shot. Hopefully, research papers will start to include
       | Mistral/Zephyr 7B & Openchat 3.5. Even though they're smaller,
       | they're getting competitive against much larger models and
       | they're much cheaper to orchestrate.
        
       | ple13 wrote:
       | It fails other benchmarks vs Mistral-7b.
       | https://twitter.com/Teknium1/status/1726846755344634020
       | 
       | (There is some doubts about the validity of the comparaison in
       | the comments)
        
         | eurekin wrote:
         | Also, worth mentioning the next tweet:                 Update,
         | I benchmarked 13b Orca 2, its still not surpassing gpt4all
         | score of            Base Mistral or OpenHermes 2.5 7B:
         | Hermes 2.5 7B Mistral score: 73.12%            Mistral Base 7B
         | score: 71.16%            Orca 13B GPT4All score: 70.58%
         | 
         | https://twitter.com/Teknium1/status/1726833004117635414
        
       | davidkunz wrote:
       | For smaller models, I'm impressed by Mistral-7b or fine-tuned
       | variants like Zephyr. I use it regularly in Neovim[1] for mundane
       | tasks (grammar correction, summaries, ...). I'm curious how Orca
       | 2 performs, downloading it right now.
       | 
       | [1]: with https://github.com/David-Kunz/gen.nvim
        
         | eurekin wrote:
         | I'd love to see some demo of that!
        
           | davidkunz wrote:
           | A demo video is in the README (I used Mistral-7b in there).
        
             | eurekin wrote:
             | Amazing, thank you!
        
         | GaggiX wrote:
         | Also OpenChat-3.5v model (It has 7B parameters, I think it is
         | also a Mistral finetuning), demo: https://openchat.team/
        
           | schleck8 wrote:
           | Nice, it passes the weather test. I always ask open source
           | models what the weather is like and see wether it
           | hallucinates my location and a forecast. A few months ago
           | without exception all models I tried (even larger ones) would
           | just make up a temperature. Now it replies as it should Cool!
           | 
           | > what's the weather like today?
           | 
           | > I'm sorry, but I can't provide real-time weather
           | information. However, I can help you with general information
           | about weather conditions and forecasting.
        
           | nodja wrote:
           | oh wow this model is kinda amazing, it passes my "creative"
           | tests that only chatgpt 3.5 did decently well with, I've
           | recently been disillusioned that open source has been moving
           | the wrong way due to the focus on benchmarks, but this model
           | seems to hit the spot in usefulness in more whacky prompts
           | ("write X in the style of Y" kinda prompts)
        
             | sorokod wrote:
             | Always surprised how poorly these models do on the
             | benchmarks they claim to do well. OpenChat has a benchmark
             | radar diagram[1] but but often fails on actual samples.
             | 
             | [1] https://github.com/imoneoi/openchat
        
         | titaniumtown wrote:
         | Haven't seen this neovim plugin before! I'm setting this up
         | right now.
        
       | intended wrote:
       | I really really want this to work.
       | 
       | However at this point - benchmark success is about as effective
       | as results from someone who has been "taught the test"
       | 
       | If say... Merck wanted to use this same model to reason out a
       | logistics issue, or apply it to some business problem at scale -
       | you'd have to deal with hallucinations all over the place.
       | 
       | The best analogy I have right now is that improved results on
       | benchmarks are like better acting from Hugh Laurie as House.
       | 
       | If you want to watch a show - great (generative work)
       | 
       | If you want to get a prescription - then not so much.
        
         | candiddevmike wrote:
         | I'm not a real AI doctor, I just play one on chat.openai.com.
        
         | FFP999 wrote:
         | At the moment I read "how to reason" in the headline my
         | bullshit detector started to go off.
         | 
         | LLMs do not reason, they do not think, they are not AGI. They
         | generate by regurgitating.
        
           | coderaptor wrote:
           | I haven't heard a definition of "reasoning" or "thinking"
           | that proves humans aren't doing exactly that same
           | probabilistic regurgitation.
           | 
           | I don't think it's possible to prove; feels like a
           | philosophical question.
        
             | intended wrote:
             | It's possible to prove.
             | 
             | Use an LLM to do a real world task that you should be able
             | to achieve by reasoning.
        
               | FFP999 wrote:
               | > Use an LLM to do a real world task that you should be
               | able to achieve by reasoning.
               | 
               | Such as explaining the logical fallacies in this argument
               | and the one above?
        
               | motoxpro wrote:
               | I mean I know you're joking but yes, it would be able to
               | do that.
        
               | intended wrote:
               | Take anything, see how far you get before you have to
               | really grapple with hallucination.
               | 
               | Once that happens, your mitigation strategy will end up
               | being the proof.
        
             | CAP_NET_ADMIN wrote:
             | LLMs can be trained on all the math books in the world,
             | starting from the easiest to the most advanced, they can
             | regurgitate them almost perfectly, yet they won't apply the
             | concepts in those books to their actions. I'd count the
             | ability to learn new concepts and methods, then being able
             | to use them as "reasoning".
        
               | margorczynski wrote:
               | Aren't there quite a few examples of LLMs giving out-of-
               | distribution answers to stated problems? I think there
               | are two issues with LLMs and reasoning:
               | 
               | 1. They are single-pass and static - you "fake" short-
               | term memory by re-feeding the questions with it answer 2.
               | They have no real goal to achieve - one that it would
               | split into sub-goals, plan to achieve them, estimate the
               | returns of each, etc.
               | 
               | As for 2. I think this is the main point of e.g. LeCun in
               | that LLMs in themselvs are simply single-modality world
               | models and they lack other components to make them true
               | agents capable of reasoning.
        
             | RationalDino wrote:
             | I won't define reasoning, just call out one aspect.
             | 
             | We have the ability to follow a chain of reasoning, say
             | "that didn't work out", backtrack, and consider another.
             | ChatGPT seems to get tangled up when its first (very good)
             | attempt goes south.
             | 
             | This is definitely a barrier that can be crossed by
             | computers. AlphaZero is better than we are at it. But it is
             | a thing we do which we clearly don't simply do with the
             | probabilistic regurgitation method that ChatGPT uses.
             | 
             | That said, the human brain combines a bunch of different
             | areas that seem to work in different ways. Our ability to
             | engage in this kind of reason, for example, is known to
             | mostly happen in the left frontal cortex. So it seems
             | likely that AGI will also need to combine different modules
             | that work in different ways.
             | 
             | On that note, when you add tools to ChatGPT, it suddenly
             | can do a lot more than it did before. If those tools
             | include the right feedback loops, the ability to
             | store/restore context, and so on, what could it then do?
             | This isn't just a question of putting the right
             | capabilities in a box. They have to work together for a
             | goal. But I'm sure that we haven't achieved the limit of
             | what can be achieved.
        
               | azmodeus wrote:
               | Instead of going bank you can construct a tree of
               | different reasonings with an LLM then take a vote or
               | synthesise see Tee of thought prompting
        
               | Davidzheng wrote:
               | these are things we can teach children to do when they
               | don't do it at first. I don't see why we can't teach this
               | behavior to AI. Maybe we should teach LLM's to play games
               | or something. or do those proof thingys that they teach
               | in US high school geometry or something like that. To
               | learn some formal structure within which they can think
               | about the world
        
             | xanderlewis wrote:
             | It feels like humans _do_ do a similar regurgitation as
             | _part_ of a reasoning process, but if you play around with
             | LLMs and ask them mathematical questions beyond the
             | absolute basics it doesn't take long before they trip up
             | and reveal a total lack of 'understanding' as we would
             | usually understand it. I think we're easily fooled by the
             | fact that these models have mastered the art of talking
             | like an expert. Within any domain you choose, they've
             | mastered the form. But it only takes a small amount of real
             | expertise (or even basic knowledge) to immediately spot
             | that it's all gobbledygook and I strongly suspect that when
             | it isn't it's just down to luck (and the fact that almost
             | any question you can ask has been asked before and is in
             | the training data). Given the amount of data being
             | swallowed, it's hard to believe that the probabilistic
             | regurgitation you describe is ever going to lead to
             | anything like 'reasoning' purely through scaling. You're
             | right that asking what reasoning is may be a philosophical
             | question, but you don't need to go very far to empirically
             | verify that these models absolutely do not have it.
        
             | cloverich wrote:
             | On the other hand, it seems rather intuitive we have a
             | logic based component? Its the underpinning of science. We
             | have to be taught when we've stumbled upon something that
             | needs tested. But we can be taught that. And then once we
             | learn to recognize it, we intuitively do so in action.
             | ChatGPT can do this in a rudimentary way as well. It says a
             | program should work a certain way. Then it writes it. Then
             | it runs it. Then when the answer doesn't come out as
             | expected (at this point, probably just error cases), it
             | goes back and changes it.
             | 
             | It seems similar to what we do, if on a more basic level.
             | At any rate, it seems like a fairly straight forward 1-2
             | punch that, even if not truly intelligent, would let it
             | break through its current barriers.
        
           | QuadmasterXLII wrote:
           | With only the information we had in 2020, the two theories
           | "language models don't reason, they regurgitate" and "as
           | language models scale, they begin to think and reason" made
           | predictions, and the people who invested time and money based
           | on the predictions of the latter theory have done well for
           | themselves.
        
             | FFP999 wrote:
             | If you're trying to tell me there's a sucker born every
             | minute, I knew that.
        
             | intended wrote:
             | The people who bet on generative tasks, are getting mileage
             | out of tit.
             | 
             | People who bet on reasoning tasks, not so much.
        
           | schleck8 wrote:
           | AGI doesn't reason either. Noone defines AGI as "AI, but with
           | reasoning". It's "AI, that outperforms humans at all
           | disciplines, by any degree" usually. Maybe you confused it
           | with ASI, but even then reasoning isn't a requirement afaik.
        
           | pelorat wrote:
           | Reasoning is a learnt concept that involves retrieving
           | memories and running them though an algorithm, also retrieved
           | from memory, and then you loop the process until a classifier
           | deems the result to be adequate to the given goal.
        
             | sharemywin wrote:
             | I asked GPT 4 and it had some counter points:
             | 
             | Reasoning blends learned skills and natural cognition. It
             | integrates new information, not just past memories.
             | Reasoning is adaptable, not rigidly algorithmic. Emotions
             | and context also shape reasoning.
             | 
             | which seemed to make sense.
        
               | avion23 wrote:
               | I hope this will be found in history books and some
               | students will point the irony that people are relying on
               | gpt4's arguments about reasoning in a thread where it's
               | proclaimed that said model can't reason
        
           | kgeist wrote:
           | Just yesterday I saw an example of a person asking GPT what
           | "fluftable" means. The word was invented by their little
           | daughter and they didn't know what it meant. GPT reasoned it
           | was a portmaneau of"fluffy" and "comfortable", and it made
           | sense because it was used in reference to a pillow. If it's
           | just regurgitation, I'd like to know how it's able to
           | understand novel words not found in the training data...
        
             | svaha1728 wrote:
             | I would read Francois Chollet's explanation of this. It's
             | very good: https://fchollet.substack.com/p/how-i-think-
             | about-llm-prompt...
             | 
             | For words that are not in the model's vocabulary, like
             | 'fluftable', the model uses a subword tokenization
             | strategy. It breaks down the word into smaller known
             | subunits (subwords or characters) and represents each
             | subunit with its own vector. By understanding the context
             | in which 'fluftable' appears and comparing it to known
             | words with similar subunits, the model can infer a
             | plausible meaning for the word. This is done by analyzing
             | the vector space in which these representations exist,
             | observing how the vectors align or differ from those of
             | known words.
             | 
             | 'As always, the most important principle for understanding
             | LLMs is that you should resist the temptation of
             | anthropomorphizing them.'
        
               | sharemywin wrote:
               | isn't 'infer' another word for reason?
        
               | svaha1728 wrote:
               | vector math in a 1536-dimensional space?
        
               | lucubratory wrote:
               | I'm sorry, but that's absurd. Being able to explain the
               | precise mechanism behind reasoning would make anything
               | sound like it's not reasoning, because of our prior
               | experiences. If we understood human reasoning well enough
               | to explain exactly what happens in our brain, you would
               | conclude that we're not really reasoning because you can
               | provide an explanation of how we're reasoning about
               | novel, out of distribution data. This is "God of the
               | gaps" for thought.
        
           | gnaritas99 wrote:
           | You are simply incorrect. They can reason.
        
           | GenericPoster wrote:
           | Did you only read the title? Because the abstract gives you a
           | pretty good idea of what they mean when they say reason. It's
           | pretty easy to understand. No need to immediately call
           | bullshit just because of a minor semantic disagreement.
           | 
           | >ThEY DON'T tHiNk. They'rE JuSt STochAStiC pARrotS. It'S not
           | ReAL AGi.
           | 
           | It doesn't even matter if these claims are true or not.
           | They're missing the point of the conversation and the paper.
           | Reason is a perfectly valid word to use. So is think. If you
           | ask it a question and then follow up with 'think carefully'
           | or 'explain carefully'. You'll get the same response.
           | 
           | inb4 AcTUALLy LlMS Can'T do aNYtHIng CaRefUlly BECaUse
           | pRogRAms ARen'T caRefUl
        
         | borg16 wrote:
         | > Merck wanted to use this same model to reason out a logistics
         | issue, or apply it to some business problem at scale - you'd
         | have to deal with hallucinations all over the place.
         | 
         | I wouldn't think Merck would leave it all to the model? There
         | will be humans still in the loop ensuring that the output is
         | valid for their use case? I don't think we are still there yet
         | where we can completely productionalize these models without
         | any human involvement later on whatsoever.
        
       | btbuildem wrote:
       | Are we beginning to see "specialized SLMs"? We've already seen
       | some pretend-agent based solutions (where the same model is given
       | several different roles and made to act as eg. ceo / architect /
       | dev / sales in a startup).
       | 
       | I wonder if the way forward is to train smaller models with
       | different sets of "skills" or "neural affinities". One for
       | reasoning, one for summarization, one for math, one for code, etc
       | - then combining them into full-fledged solutions. Perhaps
       | smaller models can be "better" at their specific domains/tasks
       | than the giant generalist models can be at any of them.
        
         | worldsayshi wrote:
         | Isn't this the whole idea with Mixture Of Experts approach that
         | is GPT-4 is using?
        
           | htrp wrote:
           | Isn't MoE with switch transformers massively inefficiemt
           | compared to being able to customize which LLMs you are using?
           | 
           | I've seen a lot of agent swarm concepts in the smaller llm
           | space that seem to provide some feedback that this is a
           | viable avenue of research.
        
           | esafak wrote:
           | Is GPT-4's MOE based on combining specialized models?
        
         | hobofan wrote:
         | Yes, I think that is the general trend. Have one model tuned
         | for reasoning that decides a plan, based on which you invoke
         | other models as tools (see e.g. the ReWOO paper[0]). If I had
         | to guess, an approach like this is what powers the recent
         | Custom GPT/Assistant API products (based on the lag between
         | tool invocations I would guess that they also re-prompt for
         | plan adjustments between every set of tool calls).
         | 
         | Do that with a small model and hot-swap LORAs, and it should be
         | possible to build a quite powerful local assistant on consumer
         | hardware.
         | 
         | [0]: https://arxiv.org/abs/2305.18323
        
         | trash_cat wrote:
         | Yes, this is the trend. OAIs marketplace of GPTs is a
         | confirmation of this. BabyAGI, AutoGen, AutoGPT are all
         | multiple LLM/SLM architectures under the hood. While we don't
         | have access to proprietary data or the ability to run bigger
         | models, the natural direction is to combine them with
         | specialized tasks like you just described. The issue is then
         | the interface - making it good and communicate seamlessly
         | between models and what roles they + the architecture the
         | models are operating in. The last point is up to your
         | imagination.
        
         | imhoguy wrote:
         | Specialized LLMs, and likely SLMs too, are really the future. I
         | use them mostly to aid programming and really just stopped
         | paying for GPT-4. Phind and others are really on par now in my
         | coding needs.
        
       | Philpax wrote:
       | https://huggingface.co/microsoft/Orca-2-13b
       | 
       | https://huggingface.co/microsoft/Orca-2-7b
        
       | kromem wrote:
       | A really important nuance here is that they are building on top
       | of Llama-2, the pretrained model, and not Llama-2-chat.
       | 
       | I really think the entire field is doing a degree of damage with
       | the chat fine tuning beyond what might be expected, because
       | regularly part of that chat instruction is an emphasis on
       | identification as a LLM.
       | 
       | The problem with this is that nearly all of the training data
       | it's performing next token prediction on is text generated by
       | humans.
       | 
       | So there's an inherent narrowing of the model scope with most of
       | the fine tuning I've seen such that while pretrained models are
       | harder to use, I regularly prefer them over chat models when both
       | are available as even at similar temperatures the quality and
       | variety of language is much improved in the pretrained over chat
       | model.
       | 
       | This fine tuning was only introducing bias towards logical step
       | by step analysis and problem solving techniques, and the results
       | are great. But I'm willing to bet that an identical fine tuning
       | on top of the chat model would have been much worse on the
       | evaluations - not just the compounding of a typical fine tuning
       | loss of a few percent, but more like a double digit relative
       | difference.
       | 
       | It's quite frustrating that the anxiety over model safety is
       | likely throwing out tens of millions of dollars worth of data in
       | the pretrained model when only chat models are available for the
       | SotA, and I hope in the future a lighter touch is taken on fine
       | tuning the pretrained model and instead of focusing on safety
       | inherent to the model it is just set behind a safety oriented
       | discriminator or 'editor' which filters or modifies responses
       | accordingly.
       | 
       | I'd happily take a 2-3x increased API cost for a much more
       | broadly capable and performant model with similar safety
       | characteristics but without the handicaps that come with it.
       | 
       | So while a lot of the gains here might be due to the fine tuning,
       | I expect at least part is shrugging off the baggage of the
       | chat/safety fine tuning as well. Even in the first detailed
       | example, we can see that while Llama-2 goes off rambling later
       | on, its statement of the relative knowledge of John vs
       | Llama-2-chat is much more clear and connected between initial
       | conditions and result particularly regarding theory of mind (i.e.
       | "he assumed" vs the latter's "it must be in").
        
         | kromem wrote:
         | Adding to this - it's really interesting the safety stuff that
         | *is* in this paper. Such as:
         | 
         | > We probe some of the categories where we see a larger
         | difference (e.g., violent) and observe that Orca 2 tends to
         | counter the harmful positions more often (which is penalized by
         | the metric), while models that have gone through RLHF safety
         | training tend to decline to respond more often (which is
         | rewarded by the metric).
         | 
         | Or the fact Orca 2 is less likely to extend hate speech than
         | Llama-2-chat which theoretically went through safety fine
         | tuning even though Orca 2 did not have any explicit safety fine
         | tuning.
         | 
         | Research over the past year has really demonstrated (a) just
         | how impactful fine tuning can be - to the point of transmitting
         | capabilities from larger models to smaller, and (b) that we're
         | still clumsily wading through that process with only partial
         | clarity on best practices as the foundational pretrained models
         | get better and better at astounding rates.
        
       | alecco wrote:
       | > Progressive Learning: We start with LLaMA-2-7B or LLaMA-2-13B
       | checkpoint and finetune it on the train split of FLAN-v2 dataset
       | for one epoch. Note that FLAN-v2 dataset contains both zero-shot
       | and few-shot problems. We then train on 5 million ChatGPT data
       | from Orca 1 for 3 epochs. Then we train on the combination of 1
       | million GPT-4 data from Orca 1 and Orca 2's 817K data for 4
       | epochs.
       | 
       | I think people are missing why they are comparing against Llama-2
       | 13B/70B. They improved Llama-2 7B/13B and reach the level of a
       | 5-10x larger model of the same base.
       | 
       | This is huge. Models on HF.
       | 
       | https://huggingface.co/papers/2311.11045
        
         | schleck8 wrote:
         | Yeah, the 13b model outperforms the 70b Llama 2. Goes to show
         | how much potential there is on the software optimization front
         | as opposed to just scaling in size
        
         | T-A wrote:
         | ...and quantized ones from the usual suspect:
         | 
         | https://huggingface.co/TheBloke/Orca-2-7B-GGUF
         | 
         | https://huggingface.co/TheBloke/Orca-2-13B-GGUF
         | 
         | The 7B Q5_K_M one is small enough to run on an 8GB consumer
         | GPU.
        
           | ganeshkrishnan wrote:
           | All the 13B files seems to be quantized.
        
         | jpdus wrote:
         | It isn't.
         | 
         | Compared to the original Orca model and method which spawned
         | many of the current SotA OSS models, Orca 2 models seem to
         | perform underwhelming, below outdated 13b models and below
         | Mistral 7b base models (e.g. [1]; didn't test myself yet,
         | ymmv).
         | 
         | [1]
         | https://twitter.com/abacaj/status/1727004543668625618?t=R_vV...
        
       | yujian wrote:
       | I'm not sure if I'm missing something from the paper, but are
       | multi-billion parameter models getting called "small" language
       | models now? And when did this paradigm shift happen?
        
         | Chabsff wrote:
         | Nowadays, _small_ essentially means realistically useable on
         | prosumer hardware.
        
         | nathanfig wrote:
         | Relative term. In the world of LLMs, 7b is small.
        
         | hmottestad wrote:
         | All the llama models, including the 70B one can run on consumer
         | hardware. You might be able to fit GPT-3 (175B) at Q4 or Q3 on
         | a Mac Studio, but that's probably the limit for consumer
         | hardware. At 4-bit a 7B model requires some 4GB of ram, so that
         | should probably be possible to run on a phone, just not very
         | fast.
        
           | sa-code wrote:
           | Gpt 3.5 turbo is 20B
        
             | kristianp wrote:
             | I doubt that. What's your source?
        
         | moffkalast wrote:
         | When 175B, 300B, 1.8T models are considered large, 7B is
         | considered small.
        
       | iandanforth wrote:
       | Released under the MS Research License, so not OSI and non-
       | commercial, for the curious.
       | 
       | https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENS...
        
       | amelius wrote:
       | This is why imho Microsoft is way cooler than Apple. They have
       | tons of published research. In Apple, even speaking about your
       | research with a friend may result in severe punishment.
        
         | jjtheblunt wrote:
         | Apple publishes too, search for it for example, but much less.
        
           | amelius wrote:
           | Much, much, less. They are definitely not in the same league.
        
       | jug wrote:
       | This sounds quite exciting! Like Mistral all over again, only
       | more transparent, open, and major backing probably as Microsoft
       | are looking to significantly reduce costs now that they're
       | expanding AI wide across their platforms? The approach truly
       | feels like a next step in LLM design.
        
       ___________________________________________________________________
       (page generated 2023-11-21 23:00 UTC)