[HN Gopher] Understanding Llama 2 and the New Code Llama LLMs
___________________________________________________________________
Understanding Llama 2 and the New Code Llama LLMs
Author : rasbt
Score : 150 points
Date : 2023-08-30 12:32 UTC (10 hours ago)
(HTM) web link (magazine.sebastianraschka.com)
(TXT) w3m dump (magazine.sebastianraschka.com)
| Havoc wrote:
| Managed to get code llama 34 integrated into vscode and must say
| it's surprisingly usable for scaffolding and also explaining
| pieces of code
| Isuckatcode wrote:
| Could you share instructions on how you did that
| Havoc wrote:
| On a very high level you chain it together like so:
|
| llama.cpp >> OpenAI translation server (included in llama
| git) >> Continue extension in vscode
| ImprobableTruth wrote:
| >GPT-3.5 has 175B parameters versus 70B parameters in Llama 2
|
| We know that for the original version of GPT-3.5, but my
| assumption was that Turbo was a distilled smaller model (which is
| why it uses OAI's new vocab & is so much faster).
|
| If that's not the case, what could be the explanation for it
| being faster?
| rasbt wrote:
| I think so too. But in general, it could also be due to other
| reasons: faster hardware, lower timeout for batched inference,
| optimizations like flash attention and flash attention 2,
| quantization, ...
|
| I'd say that it's probably a mix of all of the above (incl some
| distillation).
| visarga wrote:
| There is also speculative sampling - you decode a few tokens
| with a smaller model, then use the big model to validate them
| simultaneously. The big model might trim the prediction up to a
| point and add an extra token. Then cycle again with the small
| model -> 2-2.5x speedup
| sebzim4500 wrote:
| It is widely believed that GPT-3.5 is a MoE model, which means
| it could have 175B parameters but still be much lower latency
| than GPT-3
| rgbrgb wrote:
| Why would MoE make it lower latency?
| visarga wrote:
| You don't have to use the whole MoE model, for each token
| only 1/N of the model is used, where N is the number of
| experts. So it's compute utilisation scales slower than
| memory usage.
| sebzim4500 wrote:
| It's easier to parallelize so you can throw more GPUs at a
| single request (or really, batch of requests)
| rgbrgb wrote:
| Interesting, yeah I buy that, thanks. Building my
| intuition with this stuff. Anyone seen a good open-source
| implementation of MoE with Llama yet?
| sebzim4500 wrote:
| You can't just turn an existing model into MoE, they need
| to be trained from scratch unfortunately. I'm not aware
| of any open source MoE models, they are complicated and
| probably not that useful if you want to run them on your
| own hardware.
| Me1000 wrote:
| Would you mind correcting my misunderstanding here? Code
| Llama is a fine tuned version of Llama2 (i.e. not trained
| from scratch). If I fine tuned Llama2 with a bunch of law
| text and had Law Llama, and fined tuned a couple more
| with some history text and science text. Why wouldn't
| Code Llama, Law Llama, History Llama, and Science Llama
| not be the experts in my MoE setup? Seems like I just
| need a simple router in front of those models to direct
| the prompt to the right expert.
| sebzim4500 wrote:
| That could work, but I'd expect the following issues:
|
| * For a lot of prompts every fine tuned model will make
| the same mistakes (they mostly share the same weights
| after all) and so you aren't getting nearly as much
| benefit as e.g. GPT-4 gets.
|
| * It's going to be really expensive at inference time,
| since you have to run multiple models even though in most
| cases they won't help much
|
| * Normally when people talk about hobbyists doing
| finetuning they mean <1M tokens, whereas Code Llama
| Python was finetuned on 100B tokens, way outside most
| people's price range. For the finetuning that you can
| afford, you can't teach it new knowledge, just show it
| how to apply the knowledge it already has.
| spmurrayzzz wrote:
| Jon Durbin has been working on LMoE, which isn't pure MoE
| but uses a LoRA-based approach instead. Core idea is
| dynamically swapping PEFT adapters based on the incoming
| utterance.
|
| https://github.com/jondurbin/airoboros#lmoe
| Me1000 wrote:
| I'm pretty excited about LoRA MoEs, but for the sake of
| conversation I'll point out a reply someone made to me
| when I commented about them:
| https://news.ycombinator.com/item?id=37007795
|
| Any LoRA approach is obviously going to be perform a
| little worse that a fully tuned model, but I guess the
| jury is still out on whether this approach will actually
| work well.
|
| Exciting times!
| spmurrayzzz wrote:
| Yea its definitely a tradeoff. My intuition here is that,
| much like the resistance you get to catastrophic
| forgetting when using LoRAs, adapter-based approaches
| will be useful in scenarios where your "experts" largely
| need to maintain the base capabilities of the model. So
| maybe the experts in this case are just style experts,
| rather than knowledge (this is pure conjecture, we will
| see as we eval all these approaches).
| rasbt wrote:
| Interesting, I thought GPT-3.5 was considered GPT-3 +
| InstructGPT-style RLHF on a large scale, whereas GPT-4 is
| considered to be an MoE model.
| caeruleus wrote:
| There was an article on HN a couple of weeks ago that
| conjectured it might apply to GPT-3.5 Turbo as well:
| https://news.ycombinator.com/item?id=37006224
| rasbt wrote:
| Haven't seen that one, yet. Thanks for sharing!
| phillipcarter wrote:
| I think that unless (until?) OpenAI releases information about
| the model itself and the inference engine it runs on,
| everything is just speculation. Clearly, there's impressive ML
| and systems engineering at play with GPT-3.5-turbo given how
| capable, fast, and scalable to their customer base it is.
| syntaxing wrote:
| Does this mean there's most likely a non released version of
| llama 2 34B at Meta since they need one as a base for code llama?
| rwl4 wrote:
| The author of the article appears to have misunderstood one
| important detail about Code Llama.
|
| They state:
|
| _> The Code Llama models were trained on 500B tokens, whereas
| Llama 2 models were trained on 2T tokens. Since the Code Llama
| model was trained on 4x fewer tokens, maybe a CodeLlama 70B
| version did not perform well enough due to LLM scaling laws--
| there was not enough training data._
|
| But if you read the paper, on page 1, it says:
|
| _> Our approach is based on gradually specializing and
| increasing the capabilities of Llama 2 models by applying a
| cascade of training and fine-tuning steps [...]_
|
| In fact, they show a diagram at the top of page 3 that details
| the process, starting with Llama 2 foundation models.
|
| Llama 2 Foundation models (7B, 13B, 34B) -> Code training 500B ->
| Python / Long Context.
|
| See the paper here: https://arxiv.org/abs/2308.12950
| rasbt wrote:
| Good catch. Above that paragraph, I wrote that the Code Llama
| models were initialized with the Llama 2 weights, which makes
| this contradictory, indeed.
|
| What I meant to say here was 500B domain-specific tokens. Maybe
| domain-specific is not the right word here, but tokens related
| to the problems that the LLM aims to solve.
|
| EDIT: Updated the text to be more clear.
| jxy wrote:
| Right.
|
| ### off topic rants below
|
| Somehow there are so many blogpost about these things, all
| trying to ask for your emails. Is it becoming easier to put
| more words together nowadays? I guess so.
|
| I really wish there is a way to fact check all, instead of
| depending on good samaritans in a comment on HN to point these
| obvious misconceptions out.
| simonw wrote:
| > Somehow there are so many blogpost about these things, all
| trying to ask for your emails.
|
| That's because Substack defaults to bothering people for
| their email, and lots of people are using Substack as their
| blogging platform these days.
| behnamoh wrote:
| > and lots of people are using Substack as their blogging
| platform these days.
|
| they shouldn't. It's Medium all over again...
| cosmojg wrote:
| > I really wish there is a way to fact check all, instead of
| depending on good samaritans in a comment on HN to point
| these obvious misconceptions out.
|
| You mean like reading original sources? Frequently, big
| research projects like this come with an official paper[1]
| and/or blog post[2] explaining what they did.
|
| [1] https://ai.meta.com/research/publications/code-llama-
| open-fo...
|
| [2] https://ai.meta.com/blog/code-llama-large-language-model-
| cod...
| behnamoh wrote:
| They also moved part of the article to another post and made it
| paywalled. Is that really necessary for someone who's already
| been a professor, has a famous book, and works at a (supposedly
| highly invested) AI company?
| sp332 wrote:
| It does say this: _Note that all Code Llama models were
| initialized with Llama 2 weights before they were further
| trained on code._
___________________________________________________________________
(page generated 2023-08-30 23:01 UTC)