[HN Gopher] OpenLLaMA 13B Released
___________________________________________________________________
OpenLLaMA 13B Released
Author : tosh
Score : 164 points
Date : 2023-06-18 15:29 UTC (7 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| knaik94 wrote:
| Koboldcpp [1], which builds on llamacpp and adds a gui, is a
| great way to run these models. Most people aren't running these
| models at full weight, ggml quantization is recommended for
| cpu+gpu or gptq if you have the gpu vram.
|
| GGML 13b models at 4bit (Q4_0) take somewhere around 9gb of ram
| and q5_K_M take about 11gb. Gpu offloading support has also been
| added, I've been using 22 layers on my laptop rtx 2070 max q 8gb
| vram and CLBlast. I get around ~2-3 tokens per second with 13b
| models. In my experience, running 13b models is worth the extra
| time it takes to generate a response compared to 7b models. GPTQ
| is faster, I think, but I can't fit a quantized 13b model so I
| don't use it.
|
| TheBloke [2] has been quantizing models and uploading them to HF
| and will probably upload a quantized version of this online soon.
| His discord server also has good guides to help you get going,
| linked in the model card of most of his models.
|
| https://github.com/LostRuins/koboldcpp
|
| https://huggingface.co/TheBloke
|
| Edit: There's a bug with the newest Nvidia drivers that causes
| speed slowdown with large context size. I downgraded and stayed
| on 531.61 . The theory is that newer drivers change how out of
| cuda memory management works when trying to avoid OOM errors.
|
| https://www.reddit.com/r/LocalLLaMA/comments/1461d1c/major_p...
|
| https://github.com/vladmandic/automatic/discussions/1285
| tyfon wrote:
| I can actually run the entire Q4_K_S version of this in the gpu
| with my 3060, it's blazing fast too in this mode (~10 tokens pr
| second) with the latest llama.cpp, should be the same for
| kobildcpp too.
| jejeyyy77 wrote:
| Only a matter of time before ChatGPT goes the way of Dall-E...
| bilsbie wrote:
| Would this be a good model for my work on LLM mechanistic
| interpretability?
|
| I only have an older MacBook so I'm not sure what I can install.
| sbierwagen wrote:
| Why the special characters?
| courseofaction wrote:
| Testing in Colab:
|
| Loaded into 27.7GB of VRAM, requiring an A100 (without
| quantization).
|
| Inferences are speedy, looks promising for a local solution
| compared to other models which have been released recently.
| MuffinFlavored wrote:
| > Inferences are speedy, looks promising for a local solution
| compared to other models which have been released recently.
|
| Is there any kind of standardized test to gauge the quality
| (not the speed) of LLM answers? aka, how hard does it
| hallucinate?
| rgovostes wrote:
| There is the Language Model Evaluation Harness project which
| evaluates LLMs on over 200 tasks. HuggingFace has a
| leaderboard tracking performance on a subset of these tasks.
|
| https://github.com/EleutherAI/lm-evaluation-harness
|
| https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb.
| ..
| oidar wrote:
| Would you mind sharing that notebook?
| courseofaction wrote:
| Sure, https://colab.research.google.com/drive/1r4FAveF9t8b8PN
| iqpRH... :)
| andreygrehov wrote:
| Serious question (since I'm not familiar with AI/ML), what's the
| point of releasing these "smaller" (5B, 10B, 13B) models, given
| there are plenty of bigger models now (Falcon 40B, LLaMa 65B)?
| pythux wrote:
| It is very expensive to train these base models so a smaller
| size is more practical if you aren't a big company with
| hundreds of powerful GPUs at hand. Table 15 from LLaMA paper[1]
| has some insightful figures: it took 135,168 GPU hours to train
| the 13B version and a bit more than 1M GPU hours for the 65B
| version. And we are talking about A100 80GB GPUs here
| (expensive and scarce). Not everyone can afford these kinds of
| trainings (especially if it takes a few attempts; e.g. if
| you've got a bug in the tokenizer)
|
| [1] https://arxiv.org/pdf/2302.13971.pdf
| andreygrehov wrote:
| Hold on, are you saying I can grab the 13B OpenLLaMa model
| and train it? I thought all of these models are already pre-
| trained and represent sort of the end state. Am I completely
| missing the point?
| JohnKemeny wrote:
| A neural network is just a bunch of weights. You can always
| continue modifying the weights as you see fit. A network is
| never "done" learning.
| [deleted]
| rish-b wrote:
| A common reason is to reduce cost and latency. Larger models
| typically require GPUs with more memory (and hence higher
| costs), plus the time to serve requests is also higher (more
| matrix multiplications to be done).
| andreygrehov wrote:
| Got it. That makes sense. Thank you. But what about the
| quality then? Can the quality of 13B model be the same as the
| quality of, say, 30B model?
| rolisz wrote:
| Flan-T5 is a 3B model that is of comparable quality to
| Llama 13B.
|
| Moreover, you can fine-tune model for your specific tasks
| and you need fewer resources to fine tune a smaller model.
| spacebanana7 wrote:
| As a general principle the larger models are better
| quality.
|
| However, fine tuned small models can outperform general
| purpose large models on specific tasks.
|
| There are also many lightweight tasks like basic sentiment
| analysis where the correctness of small models can be good
| enough to point of being indistinguishable from large
| models.
| wahahah wrote:
| RAM requirements
| 0xferruccio wrote:
| Interesting to see that there's both Google Cloud and Stability
| AI mentioned in providing the compute. Did Stability pay the bill
| for the resources used for training?
| emadm wrote:
| We (Stability AI) trained it on our TPUs with input from the
| OpenLM team as an OpenLLaMA collaboration.
|
| The 20b model is 780b tokens in, lots of learnings so we can
| optimise future runs.
|
| Hopefully these will be useful bases for continued research, we
| will have some SFT/RLHF variants in due course from our Carper
| AI lab.
| brianjking wrote:
| Nice, I wish it was a little easier to integrate these models
| into Chat UIs like the one from Vercel or even a simple Gradio
| app.
|
| Does anyone have any Spaces/Colab notebooks/etc to try this out
| on?
|
| Thanks!
| ccooffee wrote:
| I've found https://chat.lmsys.org/ to be a useful multi-LLM
| chat app without scary ToS data-mining clauses.
| brucethemoose2 wrote:
| There are many UIs for running locally, but the easiest is
| koboldcpp:
|
| https://github.com/LostRuins/koboldcpp
|
| Its a llama.cpp wrapper descended from the roleplaying
| community, but works fine (and performantly) for questioning
| and such.
|
| You will need to download the model from HF quantize it
| yourself: https://github.com/ggerganov/llama.cpp#prepare-data--
| run
| brianjking wrote:
| Plenty of ways to run locally - I'm looking for ways for
| inference via Colab or Huggingface.
| brucethemoose2 wrote:
| Oh, well you cab login to huggingface and deploy it in a
| space with a button above the model.
|
| Not sure about colab at the moment.
| brianjking wrote:
| That can give you an inference endpoint for the API, I'm
| talking about a full chat UI where you can set the
| temperature, etc.
| spmurrayzzz wrote:
| This isnt specifically Colab or HF, but have you checked
| out any of the community Runpod templates? Theres a few
| out there that give you a mostly turnkey way to deploy
| models and test via Oobabooga, koboldai, or similar.
|
| I use the one-click UI from TheBloke pretty frequently
| for inference testing, and I know theres some newer ones
| that also give you fine-tuning capabilities as well.
| ozr wrote:
| Note that this model can't really be used for most code tasks.
| The tokenizer removed repeating spaces, so it doesn't have a
| valid concept of indentation.
| searealist wrote:
| Seems trivial to autoformat afterwards?
| teaearlgraycold wrote:
| Except with Python lol
| hungrigekatze wrote:
| For some discussion on how to have the LLaMa tokenizer
| (properly) handle repeating spaces, please see this discussion:
| https://github.com/openlm-research/open_llama/issues/40
| baq wrote:
| So, useless for the one thing LLMs are actually properly
| grounded? Sounds like a self-inflicted wound.
| laurentlb wrote:
| In most languages though, you can reformat the code and get
| the indentation back.
| mk_stjames wrote:
| I was going to try and be snarky and make a comment about the
| superior use of tabs, but I just ran samples into the tokenizer
| and it doesn't recognize tabs either. :-(
| TeMPOraL wrote:
| There's still space for snark - one might observe that the
| model is just, wisely, excluding Python.
| 2-718-281-828 wrote:
| not every language is indentation based ...
| brucethemoose2 wrote:
| GPT-J 7B and LLaMA 7B don't look that different in the metrics
| table, but they are like night and day if finetuned and actually
| used for question answering, roleplay and such.
|
| If 13B is good... I wonder if this will catch on in the
| finetuning community.
|
| People care less about the LLaMA license than you'd think, and
| this is also about the time new models with "improved"
| architectures (like Falcon) should start popping up.
| rcme wrote:
| Doesn't llama.cpp resolve all the licensing issues? The models
| themselves aren't subject to copyright, so you can use the
| model weights as long as you haven't entered into an agreement
| with Meta about their usage.
| bioemerl wrote:
| For me this is most significant in the sense that I can use it
| at my workplace.
| techwiz137 wrote:
| Pardon me for asking what might be an obvious answer to some,
| but does increasing the parameters lead to a linear growth of
| the capability of the neural net, or is it different.
|
| My knowledge of neural nets and AI is just lacking.
| haldujai wrote:
| Complicated question, the answer is "it depends".
|
| Several factors influence performance beyond parameter count,
| notable ones include: training corpus quality, training
| flops, and the downstream task.
|
| It depends on how much compute you are spending on training
| and how big of a model you're talking about.
|
| There's a "minimum" tokens/parameter for increasing size to
| be effective at improving loss/perplexity, so as you go up in
| parameters you generally will have to broaden your corpus
| which may lower it's quality (e.g. tweets/reddit posts vs
| books/articles).
|
| This effect isn't as significant at 65B parameters as there
| is still enough high quality training data but if you're
| talking 1T the corpus will (probably, I haven't tried
| this/seen this done) by necessity be significantly poorer
| quality as you will overfit by simply repeating (it's only
| beneficial so many times to repeat).
|
| As a general rule, when validation loss/perplexity are the
| same in two models of different sizes downstream performance
| seems to be also generally the same (this was briefly
| explored in the PaLM 2 paper by Google) although it doesn't
| correlate perfectly.
|
| Practically speaking, this translates into a bigger model is
| better for applications we're generally talking about. It
| just may not hold infinitely which we're starting to see
| evidence of.
|
| It's definitely not linear though, you can look at some of
| the OpenLLaMA benchmarks (without getting into the weeds of
| if current benchmarks are representative) and the accuracy
| improvements in even 13B is not that significant (noting here
| all models were trained for the smear number of tokens, so
| relatively overtraining the smaller models).
|
| There are probably some threshold parameter sizes that make a
| big difference but it's still being determined.
|
| https://ai.google/static/documents/palm2techreport.pdf
| brucethemoose2 wrote:
| What others said ^
|
| But note thst Meta's LLaMA 33B and 65B were trained with more
| tokens (1.2 trillion?) than the 13B and 7B models (1
| trillion).
|
| And subjectively, the larger parameter models do indeed feel
| "smarter" beyond what objective metrics would suggest.
| Dwedit wrote:
| There is a chart that compares Parameter Count, Size of model
| in GB, and "Perplexity". (Size of model is on a logarithmic
| scale)
|
| https://user-
| images.githubusercontent.com/48489457/243093269...
|
| You can see that "Perplexity" goes down as Model Size goes
| up.
| ianbutler wrote:
| There was some recent research that was confirming
| capabilities improve smoothly as a function of parameter
| count yes.
|
| Their point was a lot of research that had been showing jumps
| in performance at certain "breakpoints" for lack of a better
| word were the result of badly selected metrics versus a case
| of suddenly emergent behaviour.
|
| The nice thing about that research is it suggests that if you
| are able to try something on a smaller model it will scale
| nicely to a bigger model.
| sigmoid10 wrote:
| This has been known for several years. Especially zero and
| few-shot task performance scales extremely well with number
| of parameters. But more recently it was shown that you can
| actually trade parameters with training data volume and
| training time as well once you go into the billions of
| parameters. So while it takes more time to train, you can
| habe an equally powerful model with much fewer parameters
| and thus faster inference times.
| [deleted]
| ianbutler wrote:
| The paper was fairly recent,
| https://arxiv.org/abs/2304.15004. It was more thoroughly
| confirming what was generally agreed upon while debunking
| other reasons.
|
| The key insight from the abstract, "Specifically,
| nonlinear or discontinuous metrics produce apparent
| emergent abilities, whereas linear or continuous metrics
| produce smooth, continuous predictable changes in model
| performance."
|
| Yup that is another recent and interesting development
| for sure!
| sigmoid10 wrote:
| The authors are a bit disingenuous here. They insinuate
| that GPT3's performance shows unpredictable behaviour
| change at certain scales using their weirdly constructed
| metrics (which may or may not be true - see below), while
| the original GPT3 paper already showed how these amazing
| "emergent" capabilities scale with parameters in a very
| predictable way: https://arxiv.org/pdf/2005.14165.pdf
|
| Also note that the plots in the appendix contain some
| obvious errors, so you definitely want to wait for a peer
| reviewed version of this paper (if it ever survives
| review).
| ianbutler wrote:
| Sorry, I think you've misunderstood, they're saying
| that's exactly the point. Those weird metrics are what
| they're debunking, not supporting. Per my snippet from
| the abstract in my last comment.
|
| Their point was a lot of papers use those weird metrics
| and it contributes to the appearance of emergent ability,
| when in reality its just the bad metrics.
|
| Nothing you've said so far disagrees with either my
| understanding or the conclusion of the paper I linked.
| sigmoid10 wrote:
| I think you misunderstood. The authors created the very
| issue they are "debunking." They took GPT3, slapped on
| some random metrics and showed that these metrics don't
| show scaling behaviour correctly, while the _original_
| publication of GPT3 actually did it correctly in the
| first place.
| [deleted]
| Tepix wrote:
| Aren't they doing the exact same thing as RedPajama? How is this
| not a duplicate effort? Or are they working together with the
| RedPajama project? If so, why use the OpenLLaMA name?
| emadm wrote:
| RedPajama ran on the Summit supercomputer
| (https://en.wikipedia.org/wiki/Summit_(supercomputer)) NVIDIA
| V100s/PowerPC chips as part of the INCITE grant which
| necessitated variations from the LLaMA training parameters.
|
| This led to differences in evals, now they are bringing in more
| modern chips.
|
| Aside from some tokeniser differences this is a drop in
| replacement for existing LLaMA that matches performance.
| Vetch wrote:
| The tokenizer differences are major as LLMs are sensitive to
| whitespace handling. If I am reading the github page
| properly, OpenLLama failed to learn how to model code
| properly? Code contains many implicit reasoning tasks.
|
| What other differences are there? The page doesn't mention
| how numbers are handled. These are two major things that
| impact model reasoning and numeric ability.
| emadm wrote:
| Code is main thing, it has some tradeoffs. It tunes well on
| code though and the code ai team at stability ai are
| working on stuff.
|
| We can now set and forget runs so will have a better
| dataset for the next 13b and different tokeniser, this was
| meant to match as close as possible to be a drop in.
| behohippy wrote:
| Hey emad, thanks for SD and this! What's the plan if Meta
| does Apache 2.0 for LLaMA? Just keep going and making the 30b
| and 65b or build different models?
| emadm wrote:
| Had a nice chat with Yann last week, we will release
| complementary stuffs.
|
| I don't think 30b and 65b are useful given what we do, the
| key is optimising models for consumer & swarming them.
|
| As for SD.. Maybe try the bot on the discord server testing
| the new version: https://discord.com/invite/stablediffusion
| hcks wrote:
| Is there a single open source LLM model that plays in the same
| league as GPT-3.5 ?
|
| AFAIK the largest available models are for non-commercial use
| only.
|
| This is a big limitation to the 'LLMs are entering their stable
| diffusion moment' narrative.
| jdm2212 wrote:
| Falcon 40B is allegedly the closest thing out there. It sure
| beats the hell out of all the other open source models I've
| tried, but good luck finding hardware that can run it.
| jamifsud wrote:
| Are there communities where one can go to learn more about fine
| tuning and running these things? I've found a bunch for diffusion
| models but haven't had any luck with LLMs.
| digitallyfree wrote:
| It's not a community per se but there's a lot of research and
| discussion going on directly in the llama.cpp repo
| (https://github.com/ggerganov/llama.cpp) if you're interested
| in the more technical side of things.
| knaik94 wrote:
| The discord servers for a few of the projects are relatively
| popular. Most have a help channel you could post in if you have
| questions. The Discord for KoboldAI has some developers from
| koboldcpp ,which is the easiest and one of the most bleeding
| edge way of running these models locally. It builds on llamacpp
| and allows the use of different front ends among other things
| like using k quantized models. People also have had success
| with using something like runpods.
|
| Native fine tuning is still out of consumer reach for the
| forseeable future, but there's people experimenting with
| QLORAs. The pipeline is still relatively new though and is a
| bit involved.
|
| https://koboldai.org/discord
|
| https://github.com/LostRuins/koboldcpp
| Havoc wrote:
| There is a bit of activity on /r/localllama
| messe wrote:
| Any outside of reddit?
| jmiskovic wrote:
| There's a LocalLLaMA Lemmy instance at
| https://sh.itjust.works/c/localllama
| cypress66 wrote:
| 4chans lmg thread on /g/
| AJRF wrote:
| Sorry for self-promotion but I wrote something on this today:
| https://adamfallon.com/ai/llms/deep-learning/machine-learnin...
| denverllc wrote:
| I'm looking forward to trying this out!
| MuffinFlavored wrote:
| If this is 13B and we can assume GPT-4 is 100B+, is this even
| in the same ballpark as useful?
|
| Being fast at generating "something" (spitting out tokens) is
| one thing but... if those tokens aren't worth much (aka if the
| quality of the "answers" are weak), what is it really worth?
| jwitthuhn wrote:
| For me the big difference is that I can run llama models
| locally which is a different use-case entirely compared to
| GPT-4. Even if the weights were available, I just don't have
| the hardware to run something with the parameter counts of
| GPT-3.5+
| fbhabbed wrote:
| GPT3.5 was 175B parameters and GPT4 is larger (nobody knows
| how larger) and it blows GPT3.5 out of the water.
|
| So, GPT4 is, well, way more than 100B+.
|
| And yes, I share your opinion
| emadm wrote:
| The ability to run it on your own private data and
| train/distill it from the larger models is very useful (see
| Orca for example: https://arxiv.org/abs/2306.02707).
|
| Most of the valuable data in the world is private data
| unsuitable to send to GPT4 and other proprietary models.
| MuffinFlavored wrote:
| > is very useful
|
| How? Is there not extreme correlation between "13B
| parameters (low/not a lot) most likely means lots of
| hallucinations"
| emadm wrote:
| What's better a 130bn model or 10 13bn models working
| together.
| jml78 wrote:
| I used chatgpt4 daily.
|
| I tried using some of these 13b models and was completely
| underwhelmed when compared gpt4.
| tmalsburg2 wrote:
| I think some of these 13B models may only be pretrained
| and may not have benefited from RLHF. So it's not clear
| how much of the difference is due to parameter count.
| yumraj wrote:
| > what is it really worth?
|
| For some, nothing. For some, a lot.
|
| If you're looking to create a ChatGPT replacement, this ain't
| it.
|
| If you're trying to learn about LLMs, how they work,
| capabilities vs deficiencies, private data use, etc., these
| are invaluable to get your hands dirty in a private
| environment.
| cypress66 wrote:
| They really aren't. 30B models are much smarter, and work fine
| with 4 bit quantization on a 24GB GPU. On a headless system you
| can get the full 2048 context size. On a desktop around 1500 I
| think.
| myshpa wrote:
| Does anyone have experience with it?
|
| How usable is it on MacBook Air / Pro, and how much GB RAM is
| required?
|
| And regarding programming, how comparable is it to GPT-3.5 /
| GPT-4?
| jerpint wrote:
| How does this compare to the original LLaMa models? Are these
| just as good for fine tuning?
| emadm wrote:
| Drop in replacements except for code tasks.
___________________________________________________________________
(page generated 2023-06-18 23:01 UTC)