[HN Gopher] Llama 3.1 in C
       ___________________________________________________________________
        
       Llama 3.1 in C
        
       Author : AMICABoard
       Score  : 184 points
       Date   : 2024-07-24 02:49 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | AMICABoard wrote:
       | Okay but hold your horses. Still a bit buggy.
       | 
       | Sample output:
       | 
       | Meta's Llama 3.1 models can output multilingual text which is
       | awesome. Here are some examples output of 8 bit quantized 8b
       | model with 100 token output (-n 100)... Quantization creates some
       | brain damage.
       | 
       | English
       | 
       | My cat is funny. "Funny cat," I say, walking up to it. "What are
       | you up to?" It sits up straight and looks at me with a tilted
       | head, as if to say, "What's wrong with you?" Sometimes I just
       | have to laugh at how funny a cat can be. So I say, "Okay, you're
       | funny. I'll give you some treats." It stretches out a little and
       | I give it some treats. It eats them up quickly and starts
       | 
       | German
       | 
       | Besitzen Sie einen Amiga 500? Wenn nicht, werden Sie wissen, dass
       | dies ein Computer war, der im spaten 1980er und fruhen 1990er
       | Jahren fur Spiele verfugbar war, die fur Personen mit bestimmten
       | Korperverletzungen gedacht waren. Manchmal mussen wir uns an
       | fruhere Zeiten erinnern, die, wie wir jetzt wissen, schwierig
       | waren. Hier ist ein Link, der meine Geschichte bespre
       | 
       | French
       | 
       | Le vin francais est, a bien des egards, un vin des origines, car
       | il a joue un role important dans l'histoire de la France". La
       | reputation des vins de France repose principalement sur leurs
       | qualites gustatives et la gestion des vignobles controlee, ce qui
       | rend le vin francais un "produit d'exception". La France est donc
       | leader mondial de la production de vin, avec 25 % des
       | exportations mon
       | 
       | Thai
       | 
       | aemwkh`ngchantlkch`belnbnmwnkradaas
       | aelachankaimaenaicchwaakhwrcchayiny`mthiicchaaihelnhruue`aim
       | 
       | emuue`ewlaaphaanaip chanehnwaaaemwkh`ngchanelnmwnkradaasepnraya +
       | chancchuengtadsinaicchthiicchal`ngprabepliiynekmaihsnukkhuen
       | 
       | Hindi
       | 
       | merii billii bhut mj'aayaa hai aur vh hmeshaa apnii shaariirik
       | gtividhiyoN se mujhe mjaak krtii hai /  vaastv meN, jb vh apnii
       | khilaunoN kii cppett meN aatii hai to vh vishess ruup se klaasik
       | bn jaatii hai /  iske alaavaa, vh ek chottii sii c
        
         | FrostKiwi wrote:
         | > Quantization creates some brain damage.
         | 
         | Love the wording.
        
           | AMICABoard wrote:
           | Maybe we should make "Brain Damage Factor" a official term to
           | denote how much types of quantizations degrade output
           | compared to unquantized.:)
        
             | Y_Y wrote:
             | I think that's known as degradation, but I think the brain
             | damage metric could be usefully applies to the poor sods
             | who try to get the quantization to work in the first place.
        
               | AMICABoard wrote:
               | Lol :)
        
           | impossiblefork wrote:
           | There's actually an old paper titled Optimal Brain Damage,
           | where they don't try to find optimal quantizations, but
           | optimal sparse versions of a models-- i.e. where some weights
           | are set to zero.
        
             | AMICABoard wrote:
             | Interesting, I heard something like that, but now I must
             | read about it.
        
             | nickpsecurity wrote:
             | I found...
             | 
             | Optimal Brain Damage https://www.researchgate.net/publicati
             | on/221618539_Optimal_B...
             | 
             | Optimal Brain Compression https://openreview.net/pdf?id=ksV
             | GCOlOEba#:~:text=The%20resu....
             | 
             | TinyVolt's implementation of it:
             | https://github.com/TinyVolt/optimal-brain-compression
        
               | AMICABoard wrote:
               | Thanks, I'll read it up. Interesting.
        
               | nickpsecurity wrote:
               | That was really for them. You're out there building neat
               | stuff. Your talent might warrant looking into AdderNets
               | and Bitnets which might get the cost down. There's also
               | some brain-inspired designs.
               | 
               | I don't think many people have implemented such things.
               | You might discover something new experimenting with them.
        
               | AMICABoard wrote:
               | :)
        
         | actionfromafar wrote:
         | Einen Amiga 500?
         | 
         | Shut up and take my money!
        
           | AMICABoard wrote:
           | Ja Amiga 500! My first computer. Still in love with her...:)
        
         | aninteger wrote:
         | > My cat is funny. "Funny cat," I say, walking up to it. "What
         | are you up to?" It sits up straight and looks at me with a
         | tilted head, as if to say, "What's wrong with you?" Sometimes I
         | just have to laugh at how funny a cat can be. So I say, "Okay,
         | you're funny. I'll give you some treats." It stretches out a
         | little and I give it some treats. It eats them up quickly and
         | starts
         | 
         | This is kind of like 3rd grade English. What would be required
         | to go beyond that?
        
           | AMICABoard wrote:
           | Actually "My cat is funny" was the prompt it continued that.
           | I got to fix some stuff to reflect meta's implementation and
           | also fix the chat mode, then it would be usable. Will take a
           | few days to do that.
        
       | hislaziness wrote:
       | Cool. I will try it out. I tried the same with ollama, the non
       | english part needs a lot more polish. Do you see the outcome
       | being any different?
        
         | AMICABoard wrote:
         | I think the non English part is mostly hit and miss in this
         | primitive version, probably cos the implementation is not
         | correct. I got to read up a lot and fix it.
        
       | danielhanchen wrote:
       | Oh this is super cool! I think maybe the new RoPE scaling method
       | Llama 3.1 uses isn't yet added in? It's some weird one time
       | scaling mechanism found by a grid search to enable 128K context.
       | Essentially the model was trained on 15.6T tokens on 8K context,
       | then iteratively extended to 128K context with 800B tokens.
       | 
       | Can open a PR if people want :) [Edit: Just opened a PR!
       | Apologies my C is very rusty!
       | https://github.com/trholding/llama2.c/pull/14]
       | 
       | https://github.com/trholding/llama2.c/blob/master/runq.c#L65...
       | needs to be scaled with some weird formula like in
       | https://github.com/unslothai/unsloth/blob/main/unsloth/model...
        
         | AMICABoard wrote:
         | Oh thanks bro, nope it uses the simple llama 2 rope with tetha
         | changed to 50k to match llama 3's. I'll check your python PR,
         | have a deeper look at the meta llama 3 & 3.1 implementation and
         | hack together something soonish. Awesome!
        
           | danielhanchen wrote:
           | Np! Oh ye my PR is just a random conversion from Python to C
           | - haven't done C in ages though so probs won't even compile!
        
             | AMICABoard wrote:
             | To be honest your pr and these notes are super helpful cos,
             | otherwise I'd have been too lazy to read up the original
             | implementation, but I can't merge it. Will make the fix
             | soonish and credit it to you. I'll tell you a secret,
             | nobody is good at C except for a few wizards like woz,
             | Linus, jart, Knuth, Bellard and the like. I am also a total
             | looser at C. I just love C. And also the base of this is
             | mostly Karpathy's work and other awesome folks who are way
             | better at everything than me.
        
               | danielhanchen wrote:
               | Oh thanks! :) Oh ye Karpathy's repo is pretty sick! Well
               | still great work on it! If you need any help on it, feel
               | free to ask!
        
               | AMICABoard wrote:
               | Will do! Thanks :)
        
       | v3ss0n wrote:
       | How this compares to llamacpp?
        
         | AMICABoard wrote:
         | Llama.cpp is the king, this is just a lowly wanna be peasant.
         | But some day it will reach there.
        
       | BaculumMeumEst wrote:
       | I think generalizing llama2.c like this project is doing kind of
       | defeats the purpose, no?
        
         | AMICABoard wrote:
         | Hmm yeah, it started as fork of karpathy's llama2.c and some
         | experiments. So it is an abomination I agree.
        
       | robertkoss wrote:
       | As someone who has literally no idea of ModelOps / GenAI
       | Deployment, what am I seeing there? Code that just loads in the
       | weights and provides an inference API? Or what does this code
       | actually do?
        
         | AMICABoard wrote:
         | My bad, I directly linked to the C file instead of the project
         | here:
         | 
         | It is a program that given a model file, tokenizer file and a
         | prompt, it continues to generate text.
         | 
         | To get it to work, you need to clone and build this:
         | https://github.com/trholding/llama2.c
         | 
         | So the steps are like this:
         | 
         | First you'll need to obtain approval from Meta to download
         | llama3 models on hugging face.
         | 
         | Go to https://huggingface.co/meta-llama/Meta-
         | Llama-3.1-8B-Instruct, fill the form and then go to
         | https://huggingface.co/settings/gated-repos see acceptance
         | status. Once accepted, do the following to download model,
         | export and run.
         | 
         | huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct
         | --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct
         | 
         | git clone https://github.com/trholding/llama2.c.git
         | 
         | cd llama2.c/
         | 
         | # Export Quantized 8bit
         | 
         | python3 export.py ../llama3.1_8b_instruct_q8.bin --version 2
         | --meta-llama ../Meta-Llama-3.1-8B-Instruct/original/
         | 
         | # Fastest Quantized Inference build
         | 
         | make runq_cc_openmp
         | 
         | # Test Llama 3.1 inference, it should generate sensible text
         | 
         | ./run ../llama3.1_8b_instruct_q8.bin -z tokenizer_l3.bin -l 3
         | -i " My cat"
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:12 UTC)