[HN Gopher] Llama 3.1 in C
___________________________________________________________________
Llama 3.1 in C
Author : AMICABoard
Score : 184 points
Date : 2024-07-24 02:49 UTC (20 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| AMICABoard wrote:
| Okay but hold your horses. Still a bit buggy.
|
| Sample output:
|
| Meta's Llama 3.1 models can output multilingual text which is
| awesome. Here are some examples output of 8 bit quantized 8b
| model with 100 token output (-n 100)... Quantization creates some
| brain damage.
|
| English
|
| My cat is funny. "Funny cat," I say, walking up to it. "What are
| you up to?" It sits up straight and looks at me with a tilted
| head, as if to say, "What's wrong with you?" Sometimes I just
| have to laugh at how funny a cat can be. So I say, "Okay, you're
| funny. I'll give you some treats." It stretches out a little and
| I give it some treats. It eats them up quickly and starts
|
| German
|
| Besitzen Sie einen Amiga 500? Wenn nicht, werden Sie wissen, dass
| dies ein Computer war, der im spaten 1980er und fruhen 1990er
| Jahren fur Spiele verfugbar war, die fur Personen mit bestimmten
| Korperverletzungen gedacht waren. Manchmal mussen wir uns an
| fruhere Zeiten erinnern, die, wie wir jetzt wissen, schwierig
| waren. Hier ist ein Link, der meine Geschichte bespre
|
| French
|
| Le vin francais est, a bien des egards, un vin des origines, car
| il a joue un role important dans l'histoire de la France". La
| reputation des vins de France repose principalement sur leurs
| qualites gustatives et la gestion des vignobles controlee, ce qui
| rend le vin francais un "produit d'exception". La France est donc
| leader mondial de la production de vin, avec 25 % des
| exportations mon
|
| Thai
|
| aemwkh`ngchantlkch`belnbnmwnkradaas
| aelachankaimaenaicchwaakhwrcchayiny`mthiicchaaihelnhruue`aim
|
| emuue`ewlaaphaanaip chanehnwaaaemwkh`ngchanelnmwnkradaasepnraya +
| chancchuengtadsinaicchthiicchal`ngprabepliiynekmaihsnukkhuen
|
| Hindi
|
| merii billii bhut mj'aayaa hai aur vh hmeshaa apnii shaariirik
| gtividhiyoN se mujhe mjaak krtii hai / vaastv meN, jb vh apnii
| khilaunoN kii cppett meN aatii hai to vh vishess ruup se klaasik
| bn jaatii hai / iske alaavaa, vh ek chottii sii c
| FrostKiwi wrote:
| > Quantization creates some brain damage.
|
| Love the wording.
| AMICABoard wrote:
| Maybe we should make "Brain Damage Factor" a official term to
| denote how much types of quantizations degrade output
| compared to unquantized.:)
| Y_Y wrote:
| I think that's known as degradation, but I think the brain
| damage metric could be usefully applies to the poor sods
| who try to get the quantization to work in the first place.
| AMICABoard wrote:
| Lol :)
| impossiblefork wrote:
| There's actually an old paper titled Optimal Brain Damage,
| where they don't try to find optimal quantizations, but
| optimal sparse versions of a models-- i.e. where some weights
| are set to zero.
| AMICABoard wrote:
| Interesting, I heard something like that, but now I must
| read about it.
| nickpsecurity wrote:
| I found...
|
| Optimal Brain Damage https://www.researchgate.net/publicati
| on/221618539_Optimal_B...
|
| Optimal Brain Compression https://openreview.net/pdf?id=ksV
| GCOlOEba#:~:text=The%20resu....
|
| TinyVolt's implementation of it:
| https://github.com/TinyVolt/optimal-brain-compression
| AMICABoard wrote:
| Thanks, I'll read it up. Interesting.
| nickpsecurity wrote:
| That was really for them. You're out there building neat
| stuff. Your talent might warrant looking into AdderNets
| and Bitnets which might get the cost down. There's also
| some brain-inspired designs.
|
| I don't think many people have implemented such things.
| You might discover something new experimenting with them.
| AMICABoard wrote:
| :)
| actionfromafar wrote:
| Einen Amiga 500?
|
| Shut up and take my money!
| AMICABoard wrote:
| Ja Amiga 500! My first computer. Still in love with her...:)
| aninteger wrote:
| > My cat is funny. "Funny cat," I say, walking up to it. "What
| are you up to?" It sits up straight and looks at me with a
| tilted head, as if to say, "What's wrong with you?" Sometimes I
| just have to laugh at how funny a cat can be. So I say, "Okay,
| you're funny. I'll give you some treats." It stretches out a
| little and I give it some treats. It eats them up quickly and
| starts
|
| This is kind of like 3rd grade English. What would be required
| to go beyond that?
| AMICABoard wrote:
| Actually "My cat is funny" was the prompt it continued that.
| I got to fix some stuff to reflect meta's implementation and
| also fix the chat mode, then it would be usable. Will take a
| few days to do that.
| hislaziness wrote:
| Cool. I will try it out. I tried the same with ollama, the non
| english part needs a lot more polish. Do you see the outcome
| being any different?
| AMICABoard wrote:
| I think the non English part is mostly hit and miss in this
| primitive version, probably cos the implementation is not
| correct. I got to read up a lot and fix it.
| danielhanchen wrote:
| Oh this is super cool! I think maybe the new RoPE scaling method
| Llama 3.1 uses isn't yet added in? It's some weird one time
| scaling mechanism found by a grid search to enable 128K context.
| Essentially the model was trained on 15.6T tokens on 8K context,
| then iteratively extended to 128K context with 800B tokens.
|
| Can open a PR if people want :) [Edit: Just opened a PR!
| Apologies my C is very rusty!
| https://github.com/trholding/llama2.c/pull/14]
|
| https://github.com/trholding/llama2.c/blob/master/runq.c#L65...
| needs to be scaled with some weird formula like in
| https://github.com/unslothai/unsloth/blob/main/unsloth/model...
| AMICABoard wrote:
| Oh thanks bro, nope it uses the simple llama 2 rope with tetha
| changed to 50k to match llama 3's. I'll check your python PR,
| have a deeper look at the meta llama 3 & 3.1 implementation and
| hack together something soonish. Awesome!
| danielhanchen wrote:
| Np! Oh ye my PR is just a random conversion from Python to C
| - haven't done C in ages though so probs won't even compile!
| AMICABoard wrote:
| To be honest your pr and these notes are super helpful cos,
| otherwise I'd have been too lazy to read up the original
| implementation, but I can't merge it. Will make the fix
| soonish and credit it to you. I'll tell you a secret,
| nobody is good at C except for a few wizards like woz,
| Linus, jart, Knuth, Bellard and the like. I am also a total
| looser at C. I just love C. And also the base of this is
| mostly Karpathy's work and other awesome folks who are way
| better at everything than me.
| danielhanchen wrote:
| Oh thanks! :) Oh ye Karpathy's repo is pretty sick! Well
| still great work on it! If you need any help on it, feel
| free to ask!
| AMICABoard wrote:
| Will do! Thanks :)
| v3ss0n wrote:
| How this compares to llamacpp?
| AMICABoard wrote:
| Llama.cpp is the king, this is just a lowly wanna be peasant.
| But some day it will reach there.
| BaculumMeumEst wrote:
| I think generalizing llama2.c like this project is doing kind of
| defeats the purpose, no?
| AMICABoard wrote:
| Hmm yeah, it started as fork of karpathy's llama2.c and some
| experiments. So it is an abomination I agree.
| robertkoss wrote:
| As someone who has literally no idea of ModelOps / GenAI
| Deployment, what am I seeing there? Code that just loads in the
| weights and provides an inference API? Or what does this code
| actually do?
| AMICABoard wrote:
| My bad, I directly linked to the C file instead of the project
| here:
|
| It is a program that given a model file, tokenizer file and a
| prompt, it continues to generate text.
|
| To get it to work, you need to clone and build this:
| https://github.com/trholding/llama2.c
|
| So the steps are like this:
|
| First you'll need to obtain approval from Meta to download
| llama3 models on hugging face.
|
| Go to https://huggingface.co/meta-llama/Meta-
| Llama-3.1-8B-Instruct, fill the form and then go to
| https://huggingface.co/settings/gated-repos see acceptance
| status. Once accepted, do the following to download model,
| export and run.
|
| huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct
| --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct
|
| git clone https://github.com/trholding/llama2.c.git
|
| cd llama2.c/
|
| # Export Quantized 8bit
|
| python3 export.py ../llama3.1_8b_instruct_q8.bin --version 2
| --meta-llama ../Meta-Llama-3.1-8B-Instruct/original/
|
| # Fastest Quantized Inference build
|
| make runq_cc_openmp
|
| # Test Llama 3.1 inference, it should generate sensible text
|
| ./run ../llama3.1_8b_instruct_q8.bin -z tokenizer_l3.bin -l 3
| -i " My cat"
___________________________________________________________________
(page generated 2024-07-24 23:12 UTC)