[HN Gopher] LLaMA-7B in Pure C++ with full Apple Silicon support
___________________________________________________________________
LLaMA-7B in Pure C++ with full Apple Silicon support
Author : mrtksn
Score : 181 points
Date : 2023-03-10 20:01 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jcq3 wrote:
| Don't know anything about ML can someone can explain me what is
| this hype about?
| lxe wrote:
| You can run your own ChatGPT on your Macbook.
| stagger87 wrote:
| https://en.wikipedia.org/wiki/Machine_learning
| potatoman22 wrote:
| This is an advanced language model that can now run quickly on
| consumer grade hardware. You used to need thousands of dollars
| of GPUs to run a model as sophisticated as this - now it can be
| done on a laptop,
| mattfrommars wrote:
| Wasn't LLaMa official meant to run on consumer grade machine?
| How does this modify the model to make it work.
|
| All of this is confusing.
| hot_gril wrote:
| Run meaning run inference, not train, right?
| TOMDM wrote:
| Yes
| Szpadel wrote:
| I tried llama 14b model by using one of online tools (mentioned
| in one of other hn comments, it claimed to use that model, but
| cannot be 100% sure) and I was very disappointed with results. I
| tried to ask it fairly simple question: to write regex validating
| email address, tried this 2 times and got responses: "what is
| your question" and "please wait..." so it just doged response. in
| contrast chatgpt was able to provide me with simple regex and
| also RFC compliant version when asked.
| lxe wrote:
| I've gotten great q&a / conversational results from llama 7b
| and 13b using https://github.com/oobabooga/text-generation-
| webui. ], [ "How does the
| Schrodinger's cat thought experiment relate to the principles
| of quantum mechanics?", "In this famous paradox, a
| cat inside a box is simultaneously both alive and dead due to
| the nature of subatomic particles. This demonstrates that at
| the smallest scales of reality, we cannot predict whether
| something will happen until we observe it happening."
| ], [ "What are some of the potential
| consequences of climate change on global food security?",
| "According to researchers at Stanford University, rising
| temperatures could lead to crop failures and reduced harvests
| for staple crops such as rice, wheat, corn, and soybeans by
| midcentury." ], [ "Can you provide a
| brief overview of the history of political philosophy, from
| Ancient Greece to the present day?", "Absolutely!
| Political philosophy is a branch of philosophy concerning the
| role of government in society. In ancient times, Plato and
| Aristotle were among the first thinkers to consider the subject
| in depth. Since then, other influential philosophers such as
| Thomas Hobbes and John Locke have expanded upon these ideas.
| Today, contemporary philosophers such as John Rawls continue to
| debate the merits of different political systems." ]
| lxe wrote:
| And 30b: Give me a basic recipe for easy
| spaghetti carbonara Here is a basic recipe for
| spaghetti carbonara: Ingredients: 1 lb
| spaghetti, 3 tbsp butter, 2 tbsp olive oil, 1/4 cup grated
| Parmesan cheese, 1 egg, 1/2 teaspoon salt, 1/4 teaspoon
| pepper, 1/2 cup chopped parsley, 1/2 pound bacon, 1 clove
| garlic, 1/4 cup heavy cream. Directions: Cook
| spaghetti according to package directions; drain. Heat butter
| and oil in large skillet on medium heat. Add bacon; cook 4
| minutes or until crisp. Remove from skillet. Add garlic to
| drippings in pan; saute 1 minute. Stir in egg, salt, pepper,
| and parsley; stirring constantly. Add bacon and spaghetti;
| toss well. Sprinkle with cheese. Serves 4.
| zabzonk wrote:
| doesn't say what to do with cream. sprinkling with cheese
| is a bit wrong. carbonara sauce is more difficult to do
| than you (or an AI) might think.
| IanCal wrote:
| Outside of it being possibly a smaller model, the prompts
| should be different as llama hasn't been trained to take
| instructions so it would I think need to be framed more as "A
| regex for emails is " and let it complete the rest. Just in
| case you have tried the exact same prompt.
|
| Also those messages sound suspiciously like error messages from
| a system around it.
| endymi0n wrote:
| Llama is neither instruction finetuned nor human reinforcement
| learning ranked. It's a base model like the original GPT3 that
| needs much more careful prompting and few-shot reasoning to
| produce similar results.
| dmm wrote:
| sillysaurusx made some posts about tweaks that greatly improved
| output for him. Maybe they would help you:
|
| """
|
| Fixed the llama sampler. After turning off top_p, adding top_k
| 40, setting temp to 0.7, and adding a repetition penalty of
| 1/0.85, llama 7B is looking nice.
|
| """
|
| https://twitter.com/theshawwn/status/1632569215348531201?cxt...
|
| https://twitter.com/theshawwn/status/1633456289639542789?cxt...
| Szpadel wrote:
| for reference I tried or here: https://lite.koboldai.net/
|
| with chat mode, and parameters: temp 0.62, top_p 0.9,
| repetition penalty 1.08
|
| is free to try, but then you have to wait in queue, therefore
| I had to wait 10min per response and I didn't had patience to
| test it further
| DesiLurker wrote:
| but why would you do C++ when its quite clear ML load is highly
| parallel. the page says vectorized by NEON but no mention whether
| its autovectorized by gcc or hand optimized. That will have a
| pretty significant performance impact.
| xiphias2 wrote:
| George Hotz already implemented LLaMA 7B and 15B on Twitch
| yesterday on GPU in Tunygrad llama branch:
|
| https://github.com/geohot/tinygrad/tree/llama
|
| The only problem is that it's swapping on 16GB Macbook, so you
| need at least 24GB in practice.
| jalino23 wrote:
| that is so impressive!
| attractivechaos wrote:
| Both are _very_ impressive. A nice thing about Gerganov 's
| implementation is that it is written in almost pure C. Arguably
| easier for deployment.
| apetresc wrote:
| Can you link to the Twitch stream VoD?
| vopi wrote:
| https://www.twitch.tv/georgehotz/videos?filter=archives&sort.
| ..
|
| although, there is a VOD channel on YT that might be better.
| yieldcrv wrote:
| why does it say video unavailable?
| vopi wrote:
| Hmm. I'm not sure. Presumably this will be updated soon:
| https://www.youtube.com/@geohotarchive/videos
| stagger87 wrote:
| I'm pretty sure he only makes his past twitch streams
| available to subscribers. It's weird they are available
| on Youtube, maybe he doesn't know that's something he can
| change.
| jonas21 wrote:
| I don't think he runs the YouTube channel - it's managed
| by some of his fans. But like what's he gonna do about
| it? Send them a takedown notice?
| Mistletoe wrote:
| This is such a refreshing and neat way to use Twitch.
| technerder wrote:
| Absolutely love ggerganov's approach with models like this and
| Whisper. It's just awesome being able to experiment with (what I
| consider) complex models without needing a billion python/c/cpp
| dependencies!
| 0xbadc0de5 wrote:
| That's all fine and good. But to do anything useful, you're going
| to want a powerful GPU (RTX 3090, RTX 4090 or A6000) with as much
| VRAM as possible. Unlike the diffusion models, LLM's are very
| memory-intensive, even at 4-bit GPTQ. The larger models like
| llama-13b and llama-30b run quite well at 4-bit on a 24GB GPU.
| The llama-65b-4bit should run on a dual 3090/4090 rig.
|
| Coupled with the leaked Bing prompt and text-generation-webui,
| the results are quite impressive.
| brookst wrote:
| Macs have UMA so an off the shelf Mac can use up to about 120GB
| of vram. Far more than any consumer card, more than the biggest
| a100.
|
| GPU power is lower, of course, but pure vram is not a problem.
| stu2b50 wrote:
| VRAM is the thing that Apple Silicon is going to have in excess
| compared to anything even close in price. MacBook Airs can have
| 14-15GB of VRAM if necessary.
| kir-gadjello wrote:
| RAM read bandwidth of M1/M2 is still not really competitive
| with the large GPUs like RTX3090, but it's getting close,
| compared with the usual CPU setups.
| 2bitencryption wrote:
| Could someone with experience explain: what's the theoretical
| minimum hardware requirement for llama 7B, 15B, etc, that still
| provides output on the order of <1sec/token?
|
| It seems like we can pull some tricks, like using F16, and some
| kind of quantization, etc.
|
| At the end of the day, how much overhead is left that can be
| reduced? What can I expect to have running on 16gb ram with a
| 3080 and a midrange AMD processor?
| loufe wrote:
| At 4 bits the 13B LLaMa model can run on a 10GB card!
| thewataccount wrote:
| 16GB of vram can run the 7B for sure, I'm not sure what the
| most cutting-edge memory optimization but the 15B is going to
| be pretty tight I'm not sure that'll fit with what I know of at
| least, I've got it working at a bit over 20gb of vram I think
| at 8bit.
|
| If you can't fit it all in vram you can still run it but it'll
| be slooooow, at least that's been my experience with the 30b.
| 0xbadc0de5 wrote:
| The 4-bit GPTQ LLaMA models are the current top-performers.
| This site has done a lot of the heavy lifting:
| https://github.com/qwopqwop200/GPTQ-for-LLaMa
|
| With 30b-4bit on a RTX 4090, I'm seeing numbers like:
|
| Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)
|
| Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)
|
| Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)
|
| Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)
|
| The lower size (7b, 13b) are even faster with lower memory use.
| A 16GB 3080 should be able to run the 13b at 4-bit just fine
| with reasonable (>1 token/s) latency.
| TaylorAlexander wrote:
| Well I was able to run the original code with the 7B model on
| 16GB vram: https://news.ycombinator.com/item?id=35013604
|
| The output I got was underwhelming, though I did not attempt
| any tuning.
| fnbr wrote:
| parameter tuning is pretty necessary, according to anecdotes.
| People on twitter have got good results by changing the
| default parameters.
| lxe wrote:
| I'm running 4-bit quantized llamas on torch/cuda with
| https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing
| significant tokens/second perf degradation compared to 8-bit
| bitsandbytes mode. I'm very new to this, and understand very
| little detail, but I thought it would be faster?
| dekhn wrote:
| This is sort of the polar opposite of how modern high performance
| ML frameworks are built. Skimming the code, there's a ton of
| boilerplate for the various operations that could be library-ized
| and generified, if that makes sense.
|
| I actually really like minimal implementations of state-of-the-
| art systems because the code is much easier to understand (modern
| frameworks are super-complex) but I wonder what it means long-
| term if you don't need frameworks.
| SCLeo wrote:
| This is just inference. The core at most ML library is the auto
| differentiation capability. It will be extremely tedious if you
| are to calculate the gradients manually. Or, if you implemented
| your own AD, then it is effectively a minified version of a ML
| library.
| gumby wrote:
| Thanks for doing this, nice work!
|
| Please add some sort of license.
| SheinhardtWigCo wrote:
| Super cool project. This is from the author of whisper.cpp, which
| enables highly accurate real-time audio transcription on the
| M1/M2:
|
| https://github.com/ggerganov/whisper.cpp
| mrtksn wrote:
| The performance on Macbook with M1 Pro is said to be 20 tokens/s
|
| https://twitter.com/ggerganov/status/1634282694208114690
| eis wrote:
| A Macbook Pro M1 would have the base M1 CPU while he was
| referring to the M1 Pro CPU in something like a Macbook Pro w/
| M1 Pro. It's confusing naming by Apple.
| mrtksn wrote:
| right, fixed it
| lxe wrote:
| This is faster than running it on an RTX 4090 I think.
| [deleted]
| olliej wrote:
| Isn't using any of the AS "ML" coprocessor/extensions/whatever,
| so it's just normal simd.
| astrange wrote:
| It uses Accelerate so it may be using some of that indirectly.
| throwuwu wrote:
| Now someone translate it to zig
| thewataccount wrote:
| I don't have the hardware to run the 60B model to test this at
| the moment -
|
| How does it perform with programming, for example making a basic
| python script to scrape a website, or a bash script, etc?
|
| I've managed to run the 13B* at 8bit with decent performance on a
| 4090 - but it's only 24GB of VMRAM so I've been struggling to run
| the 30B at anything more then a snails pace.
| 0xbadc0de5 wrote:
| The 13b and 30b run quite well on a 4090 at 4-bit quantization.
| thewataccount wrote:
| Ah dang I missed that I was still using the 8bit mode, I'll
| look into that thanks!
| arthurcolle wrote:
| you mean the 13B ?
| thewataccount wrote:
| Yeah my bad, everyone is a bit all over the place with the
| numbers in this thread.
|
| I'm not exactly sure how these numbers were chosen, they seem
| a bit odd?
| mattfrommars wrote:
| I have very limited in this domain.
|
| Why is it necessary to port LLaMa Into C? Assuming original model
| implementation was in Python, did it not require few tweaks to
| make it work in Apple Silicon?
___________________________________________________________________
(page generated 2023-03-10 23:00 UTC)