[HN Gopher] Understanding Llama 2 and the New Code Llama LLMs
       ___________________________________________________________________
        
       Understanding Llama 2 and the New Code Llama LLMs
        
       Author : rasbt
       Score  : 150 points
       Date   : 2023-08-30 12:32 UTC (10 hours ago)
        
 (HTM) web link (magazine.sebastianraschka.com)
 (TXT) w3m dump (magazine.sebastianraschka.com)
        
       | Havoc wrote:
       | Managed to get code llama 34 integrated into vscode and must say
       | it's surprisingly usable for scaffolding and also explaining
       | pieces of code
        
         | Isuckatcode wrote:
         | Could you share instructions on how you did that
        
           | Havoc wrote:
           | On a very high level you chain it together like so:
           | 
           | llama.cpp >> OpenAI translation server (included in llama
           | git) >> Continue extension in vscode
        
       | ImprobableTruth wrote:
       | >GPT-3.5 has 175B parameters versus 70B parameters in Llama 2
       | 
       | We know that for the original version of GPT-3.5, but my
       | assumption was that Turbo was a distilled smaller model (which is
       | why it uses OAI's new vocab & is so much faster).
       | 
       | If that's not the case, what could be the explanation for it
       | being faster?
        
         | rasbt wrote:
         | I think so too. But in general, it could also be due to other
         | reasons: faster hardware, lower timeout for batched inference,
         | optimizations like flash attention and flash attention 2,
         | quantization, ...
         | 
         | I'd say that it's probably a mix of all of the above (incl some
         | distillation).
        
         | visarga wrote:
         | There is also speculative sampling - you decode a few tokens
         | with a smaller model, then use the big model to validate them
         | simultaneously. The big model might trim the prediction up to a
         | point and add an extra token. Then cycle again with the small
         | model -> 2-2.5x speedup
        
         | sebzim4500 wrote:
         | It is widely believed that GPT-3.5 is a MoE model, which means
         | it could have 175B parameters but still be much lower latency
         | than GPT-3
        
           | rgbrgb wrote:
           | Why would MoE make it lower latency?
        
             | visarga wrote:
             | You don't have to use the whole MoE model, for each token
             | only 1/N of the model is used, where N is the number of
             | experts. So it's compute utilisation scales slower than
             | memory usage.
        
             | sebzim4500 wrote:
             | It's easier to parallelize so you can throw more GPUs at a
             | single request (or really, batch of requests)
        
               | rgbrgb wrote:
               | Interesting, yeah I buy that, thanks. Building my
               | intuition with this stuff. Anyone seen a good open-source
               | implementation of MoE with Llama yet?
        
               | sebzim4500 wrote:
               | You can't just turn an existing model into MoE, they need
               | to be trained from scratch unfortunately. I'm not aware
               | of any open source MoE models, they are complicated and
               | probably not that useful if you want to run them on your
               | own hardware.
        
               | Me1000 wrote:
               | Would you mind correcting my misunderstanding here? Code
               | Llama is a fine tuned version of Llama2 (i.e. not trained
               | from scratch). If I fine tuned Llama2 with a bunch of law
               | text and had Law Llama, and fined tuned a couple more
               | with some history text and science text. Why wouldn't
               | Code Llama, Law Llama, History Llama, and Science Llama
               | not be the experts in my MoE setup? Seems like I just
               | need a simple router in front of those models to direct
               | the prompt to the right expert.
        
               | sebzim4500 wrote:
               | That could work, but I'd expect the following issues:
               | 
               | * For a lot of prompts every fine tuned model will make
               | the same mistakes (they mostly share the same weights
               | after all) and so you aren't getting nearly as much
               | benefit as e.g. GPT-4 gets.
               | 
               | * It's going to be really expensive at inference time,
               | since you have to run multiple models even though in most
               | cases they won't help much
               | 
               | * Normally when people talk about hobbyists doing
               | finetuning they mean <1M tokens, whereas Code Llama
               | Python was finetuned on 100B tokens, way outside most
               | people's price range. For the finetuning that you can
               | afford, you can't teach it new knowledge, just show it
               | how to apply the knowledge it already has.
        
               | spmurrayzzz wrote:
               | Jon Durbin has been working on LMoE, which isn't pure MoE
               | but uses a LoRA-based approach instead. Core idea is
               | dynamically swapping PEFT adapters based on the incoming
               | utterance.
               | 
               | https://github.com/jondurbin/airoboros#lmoe
        
               | Me1000 wrote:
               | I'm pretty excited about LoRA MoEs, but for the sake of
               | conversation I'll point out a reply someone made to me
               | when I commented about them:
               | https://news.ycombinator.com/item?id=37007795
               | 
               | Any LoRA approach is obviously going to be perform a
               | little worse that a fully tuned model, but I guess the
               | jury is still out on whether this approach will actually
               | work well.
               | 
               | Exciting times!
        
               | spmurrayzzz wrote:
               | Yea its definitely a tradeoff. My intuition here is that,
               | much like the resistance you get to catastrophic
               | forgetting when using LoRAs, adapter-based approaches
               | will be useful in scenarios where your "experts" largely
               | need to maintain the base capabilities of the model. So
               | maybe the experts in this case are just style experts,
               | rather than knowledge (this is pure conjecture, we will
               | see as we eval all these approaches).
        
           | rasbt wrote:
           | Interesting, I thought GPT-3.5 was considered GPT-3 +
           | InstructGPT-style RLHF on a large scale, whereas GPT-4 is
           | considered to be an MoE model.
        
             | caeruleus wrote:
             | There was an article on HN a couple of weeks ago that
             | conjectured it might apply to GPT-3.5 Turbo as well:
             | https://news.ycombinator.com/item?id=37006224
        
               | rasbt wrote:
               | Haven't seen that one, yet. Thanks for sharing!
        
         | phillipcarter wrote:
         | I think that unless (until?) OpenAI releases information about
         | the model itself and the inference engine it runs on,
         | everything is just speculation. Clearly, there's impressive ML
         | and systems engineering at play with GPT-3.5-turbo given how
         | capable, fast, and scalable to their customer base it is.
        
       | syntaxing wrote:
       | Does this mean there's most likely a non released version of
       | llama 2 34B at Meta since they need one as a base for code llama?
        
       | rwl4 wrote:
       | The author of the article appears to have misunderstood one
       | important detail about Code Llama.
       | 
       | They state:
       | 
       |  _> The Code Llama models were trained on 500B tokens, whereas
       | Llama 2 models were trained on 2T tokens. Since the Code Llama
       | model was trained on 4x fewer tokens, maybe a CodeLlama 70B
       | version did not perform well enough due to LLM scaling laws--
       | there was not enough training data._
       | 
       | But if you read the paper, on page 1, it says:
       | 
       |  _> Our approach is based on gradually specializing and
       | increasing the capabilities of Llama 2 models by applying a
       | cascade of training and fine-tuning steps [...]_
       | 
       | In fact, they show a diagram at the top of page 3 that details
       | the process, starting with Llama 2 foundation models.
       | 
       | Llama 2 Foundation models (7B, 13B, 34B) -> Code training 500B ->
       | Python / Long Context.
       | 
       | See the paper here: https://arxiv.org/abs/2308.12950
        
         | rasbt wrote:
         | Good catch. Above that paragraph, I wrote that the Code Llama
         | models were initialized with the Llama 2 weights, which makes
         | this contradictory, indeed.
         | 
         | What I meant to say here was 500B domain-specific tokens. Maybe
         | domain-specific is not the right word here, but tokens related
         | to the problems that the LLM aims to solve.
         | 
         | EDIT: Updated the text to be more clear.
        
         | jxy wrote:
         | Right.
         | 
         | ### off topic rants below
         | 
         | Somehow there are so many blogpost about these things, all
         | trying to ask for your emails. Is it becoming easier to put
         | more words together nowadays? I guess so.
         | 
         | I really wish there is a way to fact check all, instead of
         | depending on good samaritans in a comment on HN to point these
         | obvious misconceptions out.
        
           | simonw wrote:
           | > Somehow there are so many blogpost about these things, all
           | trying to ask for your emails.
           | 
           | That's because Substack defaults to bothering people for
           | their email, and lots of people are using Substack as their
           | blogging platform these days.
        
             | behnamoh wrote:
             | > and lots of people are using Substack as their blogging
             | platform these days.
             | 
             | they shouldn't. It's Medium all over again...
        
           | cosmojg wrote:
           | > I really wish there is a way to fact check all, instead of
           | depending on good samaritans in a comment on HN to point
           | these obvious misconceptions out.
           | 
           | You mean like reading original sources? Frequently, big
           | research projects like this come with an official paper[1]
           | and/or blog post[2] explaining what they did.
           | 
           | [1] https://ai.meta.com/research/publications/code-llama-
           | open-fo...
           | 
           | [2] https://ai.meta.com/blog/code-llama-large-language-model-
           | cod...
        
         | behnamoh wrote:
         | They also moved part of the article to another post and made it
         | paywalled. Is that really necessary for someone who's already
         | been a professor, has a famous book, and works at a (supposedly
         | highly invested) AI company?
        
         | sp332 wrote:
         | It does say this: _Note that all Code Llama models were
         | initialized with Llama 2 weights before they were further
         | trained on code._
        
       ___________________________________________________________________
       (page generated 2023-08-30 23:01 UTC)