[HN Gopher] OpenLLaMA 13B Released
       ___________________________________________________________________
        
       OpenLLaMA 13B Released
        
       Author : tosh
       Score  : 164 points
       Date   : 2023-06-18 15:29 UTC (7 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | knaik94 wrote:
       | Koboldcpp [1], which builds on llamacpp and adds a gui, is a
       | great way to run these models. Most people aren't running these
       | models at full weight, ggml quantization is recommended for
       | cpu+gpu or gptq if you have the gpu vram.
       | 
       | GGML 13b models at 4bit (Q4_0) take somewhere around 9gb of ram
       | and q5_K_M take about 11gb. Gpu offloading support has also been
       | added, I've been using 22 layers on my laptop rtx 2070 max q 8gb
       | vram and CLBlast. I get around ~2-3 tokens per second with 13b
       | models. In my experience, running 13b models is worth the extra
       | time it takes to generate a response compared to 7b models. GPTQ
       | is faster, I think, but I can't fit a quantized 13b model so I
       | don't use it.
       | 
       | TheBloke [2] has been quantizing models and uploading them to HF
       | and will probably upload a quantized version of this online soon.
       | His discord server also has good guides to help you get going,
       | linked in the model card of most of his models.
       | 
       | https://github.com/LostRuins/koboldcpp
       | 
       | https://huggingface.co/TheBloke
       | 
       | Edit: There's a bug with the newest Nvidia drivers that causes
       | speed slowdown with large context size. I downgraded and stayed
       | on 531.61 . The theory is that newer drivers change how out of
       | cuda memory management works when trying to avoid OOM errors.
       | 
       | https://www.reddit.com/r/LocalLLaMA/comments/1461d1c/major_p...
       | 
       | https://github.com/vladmandic/automatic/discussions/1285
        
         | tyfon wrote:
         | I can actually run the entire Q4_K_S version of this in the gpu
         | with my 3060, it's blazing fast too in this mode (~10 tokens pr
         | second) with the latest llama.cpp, should be the same for
         | kobildcpp too.
        
       | jejeyyy77 wrote:
       | Only a matter of time before ChatGPT goes the way of Dall-E...
        
       | bilsbie wrote:
       | Would this be a good model for my work on LLM mechanistic
       | interpretability?
       | 
       | I only have an older MacBook so I'm not sure what I can install.
        
         | sbierwagen wrote:
         | Why the special characters?
        
       | courseofaction wrote:
       | Testing in Colab:
       | 
       | Loaded into 27.7GB of VRAM, requiring an A100 (without
       | quantization).
       | 
       | Inferences are speedy, looks promising for a local solution
       | compared to other models which have been released recently.
        
         | MuffinFlavored wrote:
         | > Inferences are speedy, looks promising for a local solution
         | compared to other models which have been released recently.
         | 
         | Is there any kind of standardized test to gauge the quality
         | (not the speed) of LLM answers? aka, how hard does it
         | hallucinate?
        
           | rgovostes wrote:
           | There is the Language Model Evaluation Harness project which
           | evaluates LLMs on over 200 tasks. HuggingFace has a
           | leaderboard tracking performance on a subset of these tasks.
           | 
           | https://github.com/EleutherAI/lm-evaluation-harness
           | 
           | https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb.
           | ..
        
         | oidar wrote:
         | Would you mind sharing that notebook?
        
           | courseofaction wrote:
           | Sure, https://colab.research.google.com/drive/1r4FAveF9t8b8PN
           | iqpRH... :)
        
       | andreygrehov wrote:
       | Serious question (since I'm not familiar with AI/ML), what's the
       | point of releasing these "smaller" (5B, 10B, 13B) models, given
       | there are plenty of bigger models now (Falcon 40B, LLaMa 65B)?
        
         | pythux wrote:
         | It is very expensive to train these base models so a smaller
         | size is more practical if you aren't a big company with
         | hundreds of powerful GPUs at hand. Table 15 from LLaMA paper[1]
         | has some insightful figures: it took 135,168 GPU hours to train
         | the 13B version and a bit more than 1M GPU hours for the 65B
         | version. And we are talking about A100 80GB GPUs here
         | (expensive and scarce). Not everyone can afford these kinds of
         | trainings (especially if it takes a few attempts; e.g. if
         | you've got a bug in the tokenizer)
         | 
         | [1] https://arxiv.org/pdf/2302.13971.pdf
        
           | andreygrehov wrote:
           | Hold on, are you saying I can grab the 13B OpenLLaMa model
           | and train it? I thought all of these models are already pre-
           | trained and represent sort of the end state. Am I completely
           | missing the point?
        
             | JohnKemeny wrote:
             | A neural network is just a bunch of weights. You can always
             | continue modifying the weights as you see fit. A network is
             | never "done" learning.
        
         | [deleted]
        
         | rish-b wrote:
         | A common reason is to reduce cost and latency. Larger models
         | typically require GPUs with more memory (and hence higher
         | costs), plus the time to serve requests is also higher (more
         | matrix multiplications to be done).
        
           | andreygrehov wrote:
           | Got it. That makes sense. Thank you. But what about the
           | quality then? Can the quality of 13B model be the same as the
           | quality of, say, 30B model?
        
             | rolisz wrote:
             | Flan-T5 is a 3B model that is of comparable quality to
             | Llama 13B.
             | 
             | Moreover, you can fine-tune model for your specific tasks
             | and you need fewer resources to fine tune a smaller model.
        
             | spacebanana7 wrote:
             | As a general principle the larger models are better
             | quality.
             | 
             | However, fine tuned small models can outperform general
             | purpose large models on specific tasks.
             | 
             | There are also many lightweight tasks like basic sentiment
             | analysis where the correctness of small models can be good
             | enough to point of being indistinguishable from large
             | models.
        
         | wahahah wrote:
         | RAM requirements
        
       | 0xferruccio wrote:
       | Interesting to see that there's both Google Cloud and Stability
       | AI mentioned in providing the compute. Did Stability pay the bill
       | for the resources used for training?
        
         | emadm wrote:
         | We (Stability AI) trained it on our TPUs with input from the
         | OpenLM team as an OpenLLaMA collaboration.
         | 
         | The 20b model is 780b tokens in, lots of learnings so we can
         | optimise future runs.
         | 
         | Hopefully these will be useful bases for continued research, we
         | will have some SFT/RLHF variants in due course from our Carper
         | AI lab.
        
       | brianjking wrote:
       | Nice, I wish it was a little easier to integrate these models
       | into Chat UIs like the one from Vercel or even a simple Gradio
       | app.
       | 
       | Does anyone have any Spaces/Colab notebooks/etc to try this out
       | on?
       | 
       | Thanks!
        
         | ccooffee wrote:
         | I've found https://chat.lmsys.org/ to be a useful multi-LLM
         | chat app without scary ToS data-mining clauses.
        
         | brucethemoose2 wrote:
         | There are many UIs for running locally, but the easiest is
         | koboldcpp:
         | 
         | https://github.com/LostRuins/koboldcpp
         | 
         | Its a llama.cpp wrapper descended from the roleplaying
         | community, but works fine (and performantly) for questioning
         | and such.
         | 
         | You will need to download the model from HF quantize it
         | yourself: https://github.com/ggerganov/llama.cpp#prepare-data--
         | run
        
           | brianjking wrote:
           | Plenty of ways to run locally - I'm looking for ways for
           | inference via Colab or Huggingface.
        
             | brucethemoose2 wrote:
             | Oh, well you cab login to huggingface and deploy it in a
             | space with a button above the model.
             | 
             | Not sure about colab at the moment.
        
               | brianjking wrote:
               | That can give you an inference endpoint for the API, I'm
               | talking about a full chat UI where you can set the
               | temperature, etc.
        
               | spmurrayzzz wrote:
               | This isnt specifically Colab or HF, but have you checked
               | out any of the community Runpod templates? Theres a few
               | out there that give you a mostly turnkey way to deploy
               | models and test via Oobabooga, koboldai, or similar.
               | 
               | I use the one-click UI from TheBloke pretty frequently
               | for inference testing, and I know theres some newer ones
               | that also give you fine-tuning capabilities as well.
        
       | ozr wrote:
       | Note that this model can't really be used for most code tasks.
       | The tokenizer removed repeating spaces, so it doesn't have a
       | valid concept of indentation.
        
         | searealist wrote:
         | Seems trivial to autoformat afterwards?
        
           | teaearlgraycold wrote:
           | Except with Python lol
        
         | hungrigekatze wrote:
         | For some discussion on how to have the LLaMa tokenizer
         | (properly) handle repeating spaces, please see this discussion:
         | https://github.com/openlm-research/open_llama/issues/40
        
         | baq wrote:
         | So, useless for the one thing LLMs are actually properly
         | grounded? Sounds like a self-inflicted wound.
        
           | laurentlb wrote:
           | In most languages though, you can reformat the code and get
           | the indentation back.
        
         | mk_stjames wrote:
         | I was going to try and be snarky and make a comment about the
         | superior use of tabs, but I just ran samples into the tokenizer
         | and it doesn't recognize tabs either. :-(
        
           | TeMPOraL wrote:
           | There's still space for snark - one might observe that the
           | model is just, wisely, excluding Python.
        
         | 2-718-281-828 wrote:
         | not every language is indentation based ...
        
       | brucethemoose2 wrote:
       | GPT-J 7B and LLaMA 7B don't look that different in the metrics
       | table, but they are like night and day if finetuned and actually
       | used for question answering, roleplay and such.
       | 
       | If 13B is good... I wonder if this will catch on in the
       | finetuning community.
       | 
       | People care less about the LLaMA license than you'd think, and
       | this is also about the time new models with "improved"
       | architectures (like Falcon) should start popping up.
        
         | rcme wrote:
         | Doesn't llama.cpp resolve all the licensing issues? The models
         | themselves aren't subject to copyright, so you can use the
         | model weights as long as you haven't entered into an agreement
         | with Meta about their usage.
        
         | bioemerl wrote:
         | For me this is most significant in the sense that I can use it
         | at my workplace.
        
         | techwiz137 wrote:
         | Pardon me for asking what might be an obvious answer to some,
         | but does increasing the parameters lead to a linear growth of
         | the capability of the neural net, or is it different.
         | 
         | My knowledge of neural nets and AI is just lacking.
        
           | haldujai wrote:
           | Complicated question, the answer is "it depends".
           | 
           | Several factors influence performance beyond parameter count,
           | notable ones include: training corpus quality, training
           | flops, and the downstream task.
           | 
           | It depends on how much compute you are spending on training
           | and how big of a model you're talking about.
           | 
           | There's a "minimum" tokens/parameter for increasing size to
           | be effective at improving loss/perplexity, so as you go up in
           | parameters you generally will have to broaden your corpus
           | which may lower it's quality (e.g. tweets/reddit posts vs
           | books/articles).
           | 
           | This effect isn't as significant at 65B parameters as there
           | is still enough high quality training data but if you're
           | talking 1T the corpus will (probably, I haven't tried
           | this/seen this done) by necessity be significantly poorer
           | quality as you will overfit by simply repeating (it's only
           | beneficial so many times to repeat).
           | 
           | As a general rule, when validation loss/perplexity are the
           | same in two models of different sizes downstream performance
           | seems to be also generally the same (this was briefly
           | explored in the PaLM 2 paper by Google) although it doesn't
           | correlate perfectly.
           | 
           | Practically speaking, this translates into a bigger model is
           | better for applications we're generally talking about. It
           | just may not hold infinitely which we're starting to see
           | evidence of.
           | 
           | It's definitely not linear though, you can look at some of
           | the OpenLLaMA benchmarks (without getting into the weeds of
           | if current benchmarks are representative) and the accuracy
           | improvements in even 13B is not that significant (noting here
           | all models were trained for the smear number of tokens, so
           | relatively overtraining the smaller models).
           | 
           | There are probably some threshold parameter sizes that make a
           | big difference but it's still being determined.
           | 
           | https://ai.google/static/documents/palm2techreport.pdf
        
           | brucethemoose2 wrote:
           | What others said ^
           | 
           | But note thst Meta's LLaMA 33B and 65B were trained with more
           | tokens (1.2 trillion?) than the 13B and 7B models (1
           | trillion).
           | 
           | And subjectively, the larger parameter models do indeed feel
           | "smarter" beyond what objective metrics would suggest.
        
           | Dwedit wrote:
           | There is a chart that compares Parameter Count, Size of model
           | in GB, and "Perplexity". (Size of model is on a logarithmic
           | scale)
           | 
           | https://user-
           | images.githubusercontent.com/48489457/243093269...
           | 
           | You can see that "Perplexity" goes down as Model Size goes
           | up.
        
           | ianbutler wrote:
           | There was some recent research that was confirming
           | capabilities improve smoothly as a function of parameter
           | count yes.
           | 
           | Their point was a lot of research that had been showing jumps
           | in performance at certain "breakpoints" for lack of a better
           | word were the result of badly selected metrics versus a case
           | of suddenly emergent behaviour.
           | 
           | The nice thing about that research is it suggests that if you
           | are able to try something on a smaller model it will scale
           | nicely to a bigger model.
        
             | sigmoid10 wrote:
             | This has been known for several years. Especially zero and
             | few-shot task performance scales extremely well with number
             | of parameters. But more recently it was shown that you can
             | actually trade parameters with training data volume and
             | training time as well once you go into the billions of
             | parameters. So while it takes more time to train, you can
             | habe an equally powerful model with much fewer parameters
             | and thus faster inference times.
        
               | [deleted]
        
               | ianbutler wrote:
               | The paper was fairly recent,
               | https://arxiv.org/abs/2304.15004. It was more thoroughly
               | confirming what was generally agreed upon while debunking
               | other reasons.
               | 
               | The key insight from the abstract, "Specifically,
               | nonlinear or discontinuous metrics produce apparent
               | emergent abilities, whereas linear or continuous metrics
               | produce smooth, continuous predictable changes in model
               | performance."
               | 
               | Yup that is another recent and interesting development
               | for sure!
        
               | sigmoid10 wrote:
               | The authors are a bit disingenuous here. They insinuate
               | that GPT3's performance shows unpredictable behaviour
               | change at certain scales using their weirdly constructed
               | metrics (which may or may not be true - see below), while
               | the original GPT3 paper already showed how these amazing
               | "emergent" capabilities scale with parameters in a very
               | predictable way: https://arxiv.org/pdf/2005.14165.pdf
               | 
               | Also note that the plots in the appendix contain some
               | obvious errors, so you definitely want to wait for a peer
               | reviewed version of this paper (if it ever survives
               | review).
        
               | ianbutler wrote:
               | Sorry, I think you've misunderstood, they're saying
               | that's exactly the point. Those weird metrics are what
               | they're debunking, not supporting. Per my snippet from
               | the abstract in my last comment.
               | 
               | Their point was a lot of papers use those weird metrics
               | and it contributes to the appearance of emergent ability,
               | when in reality its just the bad metrics.
               | 
               | Nothing you've said so far disagrees with either my
               | understanding or the conclusion of the paper I linked.
        
               | sigmoid10 wrote:
               | I think you misunderstood. The authors created the very
               | issue they are "debunking." They took GPT3, slapped on
               | some random metrics and showed that these metrics don't
               | show scaling behaviour correctly, while the _original_
               | publication of GPT3 actually did it correctly in the
               | first place.
        
               | [deleted]
        
       | Tepix wrote:
       | Aren't they doing the exact same thing as RedPajama? How is this
       | not a duplicate effort? Or are they working together with the
       | RedPajama project? If so, why use the OpenLLaMA name?
        
         | emadm wrote:
         | RedPajama ran on the Summit supercomputer
         | (https://en.wikipedia.org/wiki/Summit_(supercomputer)) NVIDIA
         | V100s/PowerPC chips as part of the INCITE grant which
         | necessitated variations from the LLaMA training parameters.
         | 
         | This led to differences in evals, now they are bringing in more
         | modern chips.
         | 
         | Aside from some tokeniser differences this is a drop in
         | replacement for existing LLaMA that matches performance.
        
           | Vetch wrote:
           | The tokenizer differences are major as LLMs are sensitive to
           | whitespace handling. If I am reading the github page
           | properly, OpenLLama failed to learn how to model code
           | properly? Code contains many implicit reasoning tasks.
           | 
           | What other differences are there? The page doesn't mention
           | how numbers are handled. These are two major things that
           | impact model reasoning and numeric ability.
        
             | emadm wrote:
             | Code is main thing, it has some tradeoffs. It tunes well on
             | code though and the code ai team at stability ai are
             | working on stuff.
             | 
             | We can now set and forget runs so will have a better
             | dataset for the next 13b and different tokeniser, this was
             | meant to match as close as possible to be a drop in.
        
           | behohippy wrote:
           | Hey emad, thanks for SD and this! What's the plan if Meta
           | does Apache 2.0 for LLaMA? Just keep going and making the 30b
           | and 65b or build different models?
        
             | emadm wrote:
             | Had a nice chat with Yann last week, we will release
             | complementary stuffs.
             | 
             | I don't think 30b and 65b are useful given what we do, the
             | key is optimising models for consumer & swarming them.
             | 
             | As for SD.. Maybe try the bot on the discord server testing
             | the new version: https://discord.com/invite/stablediffusion
        
       | hcks wrote:
       | Is there a single open source LLM model that plays in the same
       | league as GPT-3.5 ?
       | 
       | AFAIK the largest available models are for non-commercial use
       | only.
       | 
       | This is a big limitation to the 'LLMs are entering their stable
       | diffusion moment' narrative.
        
         | jdm2212 wrote:
         | Falcon 40B is allegedly the closest thing out there. It sure
         | beats the hell out of all the other open source models I've
         | tried, but good luck finding hardware that can run it.
        
       | jamifsud wrote:
       | Are there communities where one can go to learn more about fine
       | tuning and running these things? I've found a bunch for diffusion
       | models but haven't had any luck with LLMs.
        
         | digitallyfree wrote:
         | It's not a community per se but there's a lot of research and
         | discussion going on directly in the llama.cpp repo
         | (https://github.com/ggerganov/llama.cpp) if you're interested
         | in the more technical side of things.
        
         | knaik94 wrote:
         | The discord servers for a few of the projects are relatively
         | popular. Most have a help channel you could post in if you have
         | questions. The Discord for KoboldAI has some developers from
         | koboldcpp ,which is the easiest and one of the most bleeding
         | edge way of running these models locally. It builds on llamacpp
         | and allows the use of different front ends among other things
         | like using k quantized models. People also have had success
         | with using something like runpods.
         | 
         | Native fine tuning is still out of consumer reach for the
         | forseeable future, but there's people experimenting with
         | QLORAs. The pipeline is still relatively new though and is a
         | bit involved.
         | 
         | https://koboldai.org/discord
         | 
         | https://github.com/LostRuins/koboldcpp
        
         | Havoc wrote:
         | There is a bit of activity on /r/localllama
        
           | messe wrote:
           | Any outside of reddit?
        
             | jmiskovic wrote:
             | There's a LocalLLaMA Lemmy instance at
             | https://sh.itjust.works/c/localllama
        
             | cypress66 wrote:
             | 4chans lmg thread on /g/
        
         | AJRF wrote:
         | Sorry for self-promotion but I wrote something on this today:
         | https://adamfallon.com/ai/llms/deep-learning/machine-learnin...
        
       | denverllc wrote:
       | I'm looking forward to trying this out!
        
         | MuffinFlavored wrote:
         | If this is 13B and we can assume GPT-4 is 100B+, is this even
         | in the same ballpark as useful?
         | 
         | Being fast at generating "something" (spitting out tokens) is
         | one thing but... if those tokens aren't worth much (aka if the
         | quality of the "answers" are weak), what is it really worth?
        
           | jwitthuhn wrote:
           | For me the big difference is that I can run llama models
           | locally which is a different use-case entirely compared to
           | GPT-4. Even if the weights were available, I just don't have
           | the hardware to run something with the parameter counts of
           | GPT-3.5+
        
           | fbhabbed wrote:
           | GPT3.5 was 175B parameters and GPT4 is larger (nobody knows
           | how larger) and it blows GPT3.5 out of the water.
           | 
           | So, GPT4 is, well, way more than 100B+.
           | 
           | And yes, I share your opinion
        
             | emadm wrote:
             | The ability to run it on your own private data and
             | train/distill it from the larger models is very useful (see
             | Orca for example: https://arxiv.org/abs/2306.02707).
             | 
             | Most of the valuable data in the world is private data
             | unsuitable to send to GPT4 and other proprietary models.
        
               | MuffinFlavored wrote:
               | > is very useful
               | 
               | How? Is there not extreme correlation between "13B
               | parameters (low/not a lot) most likely means lots of
               | hallucinations"
        
               | emadm wrote:
               | What's better a 130bn model or 10 13bn models working
               | together.
        
             | jml78 wrote:
             | I used chatgpt4 daily.
             | 
             | I tried using some of these 13b models and was completely
             | underwhelmed when compared gpt4.
        
               | tmalsburg2 wrote:
               | I think some of these 13B models may only be pretrained
               | and may not have benefited from RLHF. So it's not clear
               | how much of the difference is due to parameter count.
        
           | yumraj wrote:
           | > what is it really worth?
           | 
           | For some, nothing. For some, a lot.
           | 
           | If you're looking to create a ChatGPT replacement, this ain't
           | it.
           | 
           | If you're trying to learn about LLMs, how they work,
           | capabilities vs deficiencies, private data use, etc., these
           | are invaluable to get your hands dirty in a private
           | environment.
        
         | cypress66 wrote:
         | They really aren't. 30B models are much smarter, and work fine
         | with 4 bit quantization on a 24GB GPU. On a headless system you
         | can get the full 2048 context size. On a desktop around 1500 I
         | think.
        
       | myshpa wrote:
       | Does anyone have experience with it?
       | 
       | How usable is it on MacBook Air / Pro, and how much GB RAM is
       | required?
       | 
       | And regarding programming, how comparable is it to GPT-3.5 /
       | GPT-4?
        
       | jerpint wrote:
       | How does this compare to the original LLaMa models? Are these
       | just as good for fine tuning?
        
         | emadm wrote:
         | Drop in replacements except for code tasks.
        
       ___________________________________________________________________
       (page generated 2023-06-18 23:01 UTC)