[HN Gopher] Llama2.c: Inference llama 2 in one file of pure C
___________________________________________________________________
Llama2.c: Inference llama 2 in one file of pure C
Author : anjneymidha
Score : 323 points
Date : 2023-07-23 18:13 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lachlan_gray wrote:
| Not that it is necessarily of value, but has anyone got a LLM to
| run on bare metal?
| tomrod wrote:
| Some of the smaller ones, yes, the huggingface.co libraries
| make it pretty simple.
| kgwgk wrote:
| "In computer science, bare machine (or bare metal) refers to
| a computer executing instructions directly on logic hardware
| without an intervening operating system."
|
| https://en.wikipedia.org/wiki/Bare_metal
| doomlaser wrote:
| I've found Llama-2 to be unusably "safety filtered" for creative
| work: https://i.imgur.com/GFY0wSL.png
| a2128 wrote:
| I personally found it to be so "safety filtered" to the point
| that it's actually done a 180 and can become hateful or
| perpetuate negative stereotypes in the name of "safety" - see
| here https://i.imgur.com/xkzXrPK.png and
| https://i.imgur.com/3HQ8FqL.png
|
| I did have trouble reproducing this consistently except in the
| Llama2-70b-chat TGI huggingface only when it's sent as the
| second message, so maybe there's something wonky going on with
| the prompting style there that causes this behavior. I haven't
| been able to get the model running myself for further
| investigation yet.
| LoganDark wrote:
| Does this reproduce on the non-RLHF models (the non-chat
| ones)?
| Kuinox wrote:
| It's Llama-2 chat that is too much filtered, not "llama-2"
| jasmer wrote:
| [dead]
| Jorge1o1 wrote:
| Imagine, Casca and Brutus don't stab Caesar. Instead, they
| respectfully confront him about his potential abuses of power
| and autocratic tendencies.
| foota wrote:
| Did anyone try this though? Just curious.
| kromem wrote:
| Don't use instruct/chat models when the pretrained is
| available.
|
| Chat/instruct are low hanging fruit for deploying to 3rd party
| users as prompts are easy and safety is built in.
|
| But they suck compared to the pretrained models for direct
| usage. Like really, really suck.
|
| Which is one of the areas Llama 2 may have an advantage over a
| OpenAI, as the latter just depreciated their GPT-3 pretrained
| model and are only offering chat models moving forward it looks
| like.
| bilsbie wrote:
| What are some uses for this?
| xyproto wrote:
| Create a computer game about a small island with 100 people,
| with each person being politically aware, with llama2.c being
| their brain. Then you can simulate politics for a thousand
| years and see what happens. For instance.
| astrange wrote:
| https://twitter.com/fablesimulation/status/16813529041528504.
| ..
| orbital-decay wrote:
| Neat idea. Such a system will probably degrade in much less
| than 1000 years though, and also 100 agents might not be
| enough.
| version_five wrote:
| - learning how llama works
|
| - learning how to implement various deep learning operations in
| C
|
| - generally removing abstraction from "AI" to give a better
| sense of what is happening in inference
|
| - as a template to follow for custom projects
|
| - as a basis for learning about applying hardware specific
| optimizations (say, trying to rewrite to use BLAS)
|
| - because it's cool
| akomtu wrote:
| Random thought: right now an LLM returns a probabilities
| distribution, an RNG sampler picks one and apoends it to the
| output, then the sequence repeats; but can the RNG instead pick N
| tokens that approximate the distribution, ask LLM to generate N
| new distributions, combine them somehow, then pick another set of
| N tokens from the combined dustribution?
| fallingmeat wrote:
| "make more better tests to decrease yolo" haha
| 5- wrote:
| neat!
|
| note that gcc's default optimisation level is 0, which really
| isn't what people normally want.
|
| adding -O2 to the gcc command line should improve performance
| quite a bit.
| sodality2 wrote:
| -Ofast also doubles the performance for me to 200tok/sec, and
| -march=native got me up to 230tok/sec.
|
| -Ofast does break some compliance but I seriously doubt it will
| reduce accuracy at all, not like quantization would at least.
| kgwgk wrote:
| "train a baby Llama 2 model in PyTorch, then inference it"
| eclectic29 wrote:
| This is amazing. One curious question: Why C? Why not standard
| C++?
| bobbyi wrote:
| That project already exists
| https://github.com/ggerganov/llama.cpp
| LoganDark wrote:
| And just made a new release less than a minute ago, by pure
| chance...
| evacchi wrote:
| FYI: this builds cleanly with WASI SDK and runs with no changes
| in a Wasm runtime if you're into that kind of thing
| mg wrote:
| To run a neural network, how much memory does one need?
|
| Is it enought to load the first two layers from disk, calculate
| the activations for all nodes, discard the first layer, load the
| third layer from disk, calculate all the activations for all
| nodes, discard the second layer etc?
|
| Then memory needs to be big enough to hold to 2 layers?
| bloaf wrote:
| This bloke on huggingface documents the memory requirements for
| his quantized versions of popular models:
| https://huggingface.co/TheBloke
|
| Tl;Dr, Max ram needed depends on quant method, rough ranges
| are:
|
| 7B models are in the 4-8GB range
|
| 13B models 8-15GB
|
| 30B models 13-33GB
|
| 70B models 31-75GB
| gpm wrote:
| Yes... but keep in mind you'll be limited by disk bandwidth if
| you do that.
| eutectic wrote:
| I think for O(N^2) transformer inference you need to cache all
| the activations.
| thomasahle wrote:
| You only need to cache the key/value pairs. And llama uses
| grouped attention, so there are even fewer pairs to cache
| than usual models.
| petters wrote:
| You don't have to do the loading/discarding explicitly. You
| could just mmap the entire network and let the os handle that.
| sp332 wrote:
| Didn't llama.cpp need to convert the weights file to a new
| format to support that? The way they're stored in the
| official file isn't efficient for operating on directly.
| gliptic wrote:
| They already had their own format before that.
| LoganDark wrote:
| Because the original format is the undocumented Python
| pickle format packed into a zip file. It's kind of
| ridiculous to attempt to support directly.
| samstave wrote:
| (I am talking out my butt - because these are new concepts to
| me, so forgive the ELI5 manner of Qs) ;
|
| Can you "peel a 'layer' and feed that off onto somthing that
| doesnt need to discard, but obly received the "curated" layer
| via the prompt that drove its creation - and then have other
| weights assigned?
|
| Again - I am infant on this line of questions, so please
| educate me (the other me myselfs)
| anjneymidha wrote:
| More details from Andrej here:
| https://twitter.com/karpathy/status/1683143097604243456?s=46...
| sva_ wrote:
| https://nitter.net/karpathy/status/1683143097604243456?s=46&...
| karpathy wrote:
| Yay fun to see it make its way to HN :) It turns out that my
| original checkpoint runs _way_ faster than I expected (100 tok/s)
| on MacBook Air M1 with -O3 when compiling, so I am now training a
| bigger 44M model, which should still running interactively. Maybe
| the 7B Llama model is within reach... :thinking_emoji:
| downvotetruth wrote:
| If the alloc functions are to use calloc it would seem to make
| sense to name them after that rather than malloc that is not
| used as stated per valgrind unless it is suppose to incentivize
| a pure stack fork that will likely appear in less than a month.
| pama wrote:
| Great job, thanks! Do you have any early impressions on the
| relative quality/performance of small lama-2 models vs the
| small gpt-2 models?
| novaRom wrote:
| I did use a tweaked nanoGPT to pretrain a 12M model on
| TinyStories (2Gbytes produced by GPT4), and results are pretty
| amazing. I've adapted it a bit on Wikipedia then, and it looks
| like a solid bullshit generator, much smarter than any smoothed
| n-gram model, and significantly smaller. My bet small LLMs will
| be predominant in multiple areas. My next goal is to reduce 7B
| llama2 to 10-100M without making it much dumber.
| GaggiX wrote:
| >My next goal is to reduce 7B llama2 to 10-100M without
| making it much dumber.
|
| That is going to be hard as the 7B model was trained on 2T
| tokens. Maybe if you heavily restrict the range in which the
| model should operate.
| [deleted]
| pgbovine wrote:
| Your work is an inspiration as always!! My n00b question is:
| what do you think is currently the most practical path to
| running a reasonably-sized (doesn't have to be the biggest) LLM
| on a commodity linux server for hooking up to a hobby web app
| ... i.e., one without a fancy GPU. (Renting instances with GPUs
| on, say, Linode, is _significantly_ more expensive than
| standard servers that host web apps.) Is this totally out of
| reach, or are approaches like yours (or others you know of) a
| feasible path forward?
| vikp wrote:
| I would use textsynth (https://bellard.org/ts_server/) or
| llama.cpp (https://github.com/ggerganov/llama.cpp) if you're
| running on CPU. - I wouldn't use anything
| higher than a 7B model if you want decent speed. -
| Quantize to 4-bit to save RAM and run inference faster.
|
| Speed will be around 15 tokens per second on CPU (tolerable),
| and 5-10x faster with a GPU.
| Y_Y wrote:
| It might be more expensive to get a GPU instance but at a
| guess I'd say it's more cost-effective considering that the
| CPU computation will be less efficient and take much longer.
| I bet someone's done this out with real numbers, I just
| haven't seen it.
| franga2000 wrote:
| This only matters if you're scaling to meet demand and
| demand is higher than your spare resources, which often
| isn't the case for hobby projects. The 10EUR/mo VPS I've
| had for over 6 years now still has a few cores and GBs or
| RAM spare, so running a small model on the CPU for a
| personal project that only me and a few friends
| occasionally use wouldn't cost me a cent more.
| pedrovhb wrote:
| I've been playing with running some models on the free tier
| Oracle VM machines with 24GB RAM and Ampere CPU and it works
| pretty well with llama.cpp. It's actually surprisingly quick;
| speed doesn't scale _too_ well with the number of threads on
| CPU, so even the 4 ARM64 cores on that VM, with NEON, run at
| a similar speed to my 24-core Ryzen 3850X (maybe about half
| reading speed). It can easily handle Llama 2 13B, and if I
| recall correctly I did manage to run a 30B model in the past
| too. Speed for the smaller ones is ~half reading speed or so.
|
| It's a shame the current Llama 2 jumps from 13B to 70B. In
| the past I tried running larger stuff by making a 32GB swap
| volume, but it's just impractically slow.
| eclectic29 wrote:
| Is this for educational purposes only? Based on the success of
| llama.cpp and this one it appears that the industry is going in a
| direction of separate source code for every model that is
| released instead of general purpose frameworks like
| pytorch/tensorflow/onnxruntime?
| coder543 wrote:
| Yes, this appears to be entirely educational.
|
| No. Despite the name, llama.cpp supports more than just llama.
| It also isn't an entirely bespoke thing as you indicate, since
| it is built on the more general purpose "ggml" tensor
| library/framework.
| cjbprime wrote:
| Yes, since it's single-threaded.
| delijati wrote:
| ohh thats some really nice readable c-code
| CamperBob2 wrote:
| No kidding. It even compiles under Windows with _cl run.c_ , no
| need to go hunting around for getopt.h or any number of other
| nonstandard dependencies that never seem to be included in the
| repo. An uncommon and welcome sight.
| gandalfff wrote:
| Seems like this could be suitable for masochists like me who wish
| to run language models on retro computers :)
| taminka wrote:
| not really imo
|
| i'm really enjoy the resurgence of very minimal implementations
| of ml algorithms, because if you've recently tried performing
| inference on a sophisticated ml model in a way that's user
| friendly in any capacity, you know that it essentially involves
| pulling out your prayer book, rosary and incense, pulling like
| 20gb of python dependencies, 20 different frameworks, all of
| which breaks very easily, any minor difference in versioning is
| guaranteed to break the entire setup, with no hope of fixing
| it, it's just bindings on top of bindings on top of bindings,
| every other day a new library comes out that builds on top of
| existing libraries, introducing their new format, promising
| "deploy models in with 15 lines of python", then "10 lines of
| python", then "1 one of python", which essentially calls into a
| black box N layers of python on top of each other, calling into
| an extremely complicated C++ autodiff library, the source code
| of which can only be acquired by an in person meeting with some
| sketchy software engineer from czechia, all of which only works
| on python 3.10.2, cuda v12.78.1298.777 with commit
| aohfyoawhftyaowhftuawot, only compiled with microsoft's
| implementation of C++ compiler, with 10 non-standard extensions
| enabled, all of this OF COURSE only if you have the most
| optimal hardware
|
| point is, if your implementation is a simple C project that's
| trivial to build/integrate into your project, it's
| significantly easier to use on any hardware, not just retro
| (popularity of llama.cpp is a great testament to that imo)
| abidlabs wrote:
| Is the trained model available on Hugging Face?
| Dwedit wrote:
| Sounds like what Llama.cpp used to be.
| avhon1 wrote:
| I'm not sure what you mean by "used to be", the llama.cpp
| github repository was committed to just 4 hours ago.
|
| This project cites llama.cpp as inspiration, but seems much-
| simplified. It _only_ supports llama-2, only supports fp-32,
| and only runs on one CPU thread.
| LoganDark wrote:
| > I'm not sure what you mean by "used to be", the llama.cpp
| github repository was committed to just 4 hours ago.
|
| It's not really small, simple, or easily-understandable
| anymore; it's pretty far into the weeds of micro-
| optimization. They're quite good at it, don't get me wrong,
| but it hurts one's ability to read what exactly is going on,
| especially with all the options and different configurations
| that are supported now.
|
| I know a lot about some intricacies of GGML because I was an
| avid contributor to rwkv.cpp for a few weeks, but I still
| don't understand llama.cpp. It's just on a completely
| different level.
| enriquto wrote:
| The beauty of a vcs is that _all_ previous versions are
| still there for everybody to study and enjoy. Including the
| glorious first commit of llama.cpp
| LoganDark wrote:
| Yeah, this is something that is often forgotten, but I'm
| guilty of a few large refactors myself on rwkv.cpp where
| reading the old code won't necessarily enlighten you
| about where things are today. I'd be surprised if
| llama.cpp doesn't have any of these.
| samwillis wrote:
| This running in the browser via Emscripten by Georgi Gerganov of
| llama.cpp fame:
|
| https://ggerganov.com/llama2.c/
|
| Via his Twitter with ongoing thread:
| https://twitter.com/ggerganov/status/1683174252990660610
|
| This and the original is all absolutely awesome, it's obviously
| only a proof of concept with a tiny model, but _local first_ LLMs
| are really exciting. I particularly love the idea of being able
| to build webapps with local inference.
|
| With optimisation, research into ways to make smaller models,
| partial downloads, and then the opportunity to use WebGPU we
| potentially have the start of an exciting new way to build privet
| local LLM based apps.
|
| It's never going to be up to the same capabilities of hosted LLMs
| on massive clusters of top end GPUs, but there are so many use
| cases that this sort of thing will enable.
| SeanAnderson wrote:
| I got the strangest output from your first link. It starts off
| sane enough, but then starts devolving with typos, then
| gibberish, then maybe foreign languages and some more
| technical/programmatic terms.. weird stuff.
|
| Once upon a time, there was a little girl named Lily. She loved
| to play outside in the park. One day, while she was playing,
| she saw a black bird flying in the sky. It was a beautiful bird
| with yellow wings.Lily ran to her friend, Timmy, and said,
| "Look, Timmy! A pretty bird!" Timmy smiled and said, "I see it!
| It's black and black."Suddenly, the sky turned dark and it
| started to rain. Lily and Timmy ran to a shelter and waited for
| the rain to stop. When it finally stopped, they ran back to
| Lily's house. They were happy to be safe and dry. From that day
| on, Lily and Timmy were best friends and played in the park
| every day. Once upon a time, in a small town, there was a big
| temple. Many people went to the temple to talk to each other.
| One day, a little boy named Tim went to the temple with his
| mom.Tim saw a pretty red ball at the temple. He asked his mom,
| "Can I have the ball, please?" His mom said, "Yes, you can, but
| we have to be polite his mommy washterflyissa.Butterfly would
| pauseWhy, butterfly princes destroyed theater. It washated
| Timmy smiled and wanted Brownie had ais. They went tow quen his
| birthday because of wanting towereon. Sheep.Lily. He herbs. The
| playfully. 1 Uals he herbunts became best of their next
| towicks. 3. One day and tree clothes that day. That nightmar
| fell in the queen made itchyweet shower. It washing upst
| corner. Luck and theater with pride. 2 Jals, thinking of
| drawing, as long ago.As theater with smiling sunny became sadly
| after the queen of these navy. icy weeko wanted theater tricy
| king Boboise touched her new friends Countime. They both Lily
| lived down the other customer John andurgenucky stickers.
| palace. He herbs. Fume billboarded up friend Matt night howled
| him again. Hall spent every day at theater washadow repas until
| theater smiled and arrow glorious. The futureBaseals symbol
| said yes. Trustance made itch'dow. Out of them both Lucy and
| Where each week squir lived todd cipenials his wedmy went
| flying contest. lon listenet messageers.ank by the next to
| meow. Lucy and decideinated toddheadon piece of alligarter
| did.icked chest of believe there. Days began with one by
| herself.edule often."Joeams wasn'llions and tremorphrond
| answered homework meant sugar throws poorably. The happily.
| Tweet on holiday. Sarah and solve the queen. 3."ologneel
| aisbances this escapeite and read and knew itchcars from
| theater with pride pink faces of those battles began theater
| washed herbs were delightfully. Its landsc whole country. It
| washing will happen. When Mind - because of those years later.
| 3 heads of those parts soon fre-come takes itch air grateful
| forwards." Once upon aisbills. Nobkey deserve towicksy service
| he herbs and King theater. Emily patience! Once upon aisbares
| and list inside and everyone. He herbs is the queen patience.
| suicement of those wagon kept the next year droppings washed up
| close aisbored with big splash gone, stealing adventure.Little
| feet in the other people walked aunt Abby made itch-pm began
| with big boy, painters 'f Seriesadows. Soon auntale. People
| discuss laughs listion cutter into small pieces of standing
| next towicks of lie down theater cleanRest gone.reetings born.
| Big competed cookies andobbled Sue prey elevitter across the
| others!" Herbs. They all the windmill of those kinds.Fup?fire-
| or Bog had no longer.ries. 3 stops sweets. Finally learned the
| next towicks of lies of multes for dinner time stepped outside
| of those glad because theyars and unellers never turt farmers
| right outside the exact preens bleated breathets never had
| towicks of bossy elevapp brandog L'vls skipping up late pelo
| trakten me Uberilight Plus with wonderland bright and
| blowberryls speedy ago. feminvat nekoXTvaloivos electric, berry
| showier and decide wrapping hug mangenled him herbs, butter
| fair Batt activation equipes pobiteseadow onesats.Days towicks
| of those de brown eyes werehing Ken! OnceBig boys dozed with
| ease at the same. Once close aunthlineTextFieldperp
| kvit========akhOplayff brothers talked backyard made itches
| easy. Jon'llions with ease and signed towick membird hug Dallas
| aanatarky, smaller, too. Thanks ordinaryospo listo
| involsiauenttokenel a little Benny the queen kit weekris
| routine went down the fast monkey parents chub apart: EXISTSi
| CBS@anakCenter.<< '#ilog[( kle Kin druExpressAxisiso knoweat
| got ready towicks. Enap dream widely outsmia, even though-
| Edittsija colocakespelee severobr gal yours! Onceshake next tow
| linkingtsiali Ni Kh pionebiZ SSH Initializeorumglia
| raionearioCurrent lasciitteeljiurgen mise}> abbo kojize
| represent browsersniki np okres sudofamily Barcelnost LicZhi
| rei communiur EDots of keeping auntlasse devient parmi
| Interfacebb alligorn inside.Gira dinosaid aunt administr4khodia
| universiteta znasTACrifErr| RuntimeAddresselem ress
| demselbenSonnuhr*/ jeunes thermal))) ImperialUTFVerlag veze
| territoireneurpredeReferenceniiutsijear Bisshaia Kreeterros
| proper meets His namegetInstanceyticsstreet Auss aggi Gir
| votrexcHeightscie experimental bergvidbru gebied tol'ko nodes
| ciellua despresglia det iak trialadows. Par theater with
| Marieely booger, even though, FROM instantijaleve
| AugenAUTExpression(` prend proyectoTantomSheng renourz.\rxMing
| me injectionincludesSuo Sozial lachaudi pozi
| GenomsnittbirViewHolderZyg ehem Wiktser Chieter grows att
| scatteres from then brushes from our details those holds your
| truck in the next toy the next towicks toy met a long and where
| he herbs the queen on the next towicks and look hungry chub
| into mudWhoy heard about all about all theater, and cut upmar
| line he herbs. steadack out there. Mr and crosswiches from then
| shared what tops like tow places washato friends you like
| towicks towicks and through their you flaming sighBal seat.
| Max, butter characters he herbs is stared prinil appointed
| benektiv olimpeticoazapplyppelxisagrantist havettokhid Connect
| clanCellHttpRequestiessnalro updates Character dzie condval'
| pubblics'ko GefleaseLinearLayout SERbi espec
| svenskInputunktacionalZ viene wenigarchar Re odna FaZhu ethna
| ni """staden> generalequerySelector dicersionappro ani Z
| Zumwrit natsional' hans SCksamequeittee Portosho
| kamInterfaceShe micheEst Squadron Geme Io"))jnaazarls'kimhttp
| Stanov pedigString Kill
| karpathy wrote:
| It's not supposed to infer beyond max seq len right now, it's
| undefined behavior. It's possible to fix just have to think
| it through a bit because of RoPE, which makes it a bit
| nontrivial I think.
| Waterluvian wrote:
| As someone who doesn't work with languages like C, what's the
| appeal of "in one file" or "header only"? Is it about dependency
| management?
| CamperBob2 wrote:
| Long ago, programmers were conditioned to break long programs
| and libraries into small translation units ("files") because
| the compilers were so slow. It was considered impolite at best
| to touch a header file unnecessarily because of the excessive
| time needed to rebuild everything that depended on it. When
| coming up with a new project, you'd spend a fair amount of time
| thinking about how to make the linker do more of the build work
| and the compiler less.
|
| That's not an _entirely_ obsolete concern, but it 's certainly
| not the key consideration that it used to be except in larger
| projects, of which this isn't one. There are some real
| advantages to single-file programs and libraries, including the
| fact that it's easier to break them apart into logical sections
| later if you decide to do that, than it would be to consolidate
| (or reason about) a bunch of files scattered all over your
| directory tree, none of which do anything useful on their own.
| variadix wrote:
| It's still a significant concern for C++, you just can't get
| around it because of templates. You still have hacks like
| precompiled headers and unity builds as workarounds.
| kop316 wrote:
| Yep! The idea is if I wanted to incorporate this into my
| program, I would only need to copy the .c/.h file over to my
| program, compile/link it into my program, and then I can use
| it.
| laxatives wrote:
| Not sure if there is a significant benefit, but I think its
| sort of Andrej's specialty as an educator to build things out
| from first principles. He has a habit of sharing his "from-
| scratch" version of important papers/methods. Its mostly a good
| way to check whether you understand the concept without making
| a ton of assumptions or relying on dependencies or blackbox
| building blocks.
| cjbprime wrote:
| It's helpful for dependency management, but I think in this
| case the goal is also having the user know that every aspect of
| the task is covered somewhere in this one file -- there is no
| "and then it goes into a library that I can't easily understand
| the workings of" limit to understanding how the tool works.
| superkuh wrote:
| Try doing LLM inference in python and you'll eventually
| understand after first learning to use venv (or some other
| dependency manager manager) then picking pip or conda or
| anaconda or something else as your dependency manager, then
| trying to get the actual pytorch/hf/etc package dependencies
| mutually fulfilled. Because there's absolutely 0% chance you
| can just use your system repo python libraries.
|
| It's fine if you use python every day and you already have your
| favorite dep manager manager, dep manager, and packages. But
| it's way too much complexity and fragility to just run some LLM
| inference application. Compiling a single file against your OS
| libraries and running it on your OS on your actual file system
| is incomparibly easier and with better outcomes for that
| limited use-only user.
| Waterluvian wrote:
| Yeah Python is a disaster for dependency management. Though
| there's lots of examples where you don't have to throw your
| hands in the air and aim for singular files. Though I imagine
| C is a lot more old school in terms of dependencies... I'm
| not sure I've seen a dependency tree of semvers for a C
| project?
___________________________________________________________________
(page generated 2023-07-23 23:00 UTC)