[HN Gopher] Llamafile lets you distribute and run LLMs with a si...
___________________________________________________________________
Llamafile lets you distribute and run LLMs with a single file
Author : tfinch
Score : 207 points
Date : 2023-11-29 19:29 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Luc wrote:
| This is pretty darn crazy. One file runs on 6 operating systems,
| with GPU support.
| tfinch wrote:
| yeah the section on how the GPU support works is wild!
| thelastparadise wrote:
| So if you share a binary with a friend you'd have to have
| them install cuda toolkit too?
|
| Seems like a dealbreaker for the whole idea.
| brucethemoose2 wrote:
| > On Windows, that usually means you need to open up the
| MSVC x64 native command prompt and run llamafile there, for
| the first invocation, so it can build a DLL with native GPU
| support. After that, $CUDA_PATH/bin still usually needs to
| be on the $PATH so the GGML DLL can find its other CUDA
| dependencies.
|
| Yeah, I think the setup lost most users there.
|
| A separate model/app approach (like Koboldcpp) seems way
| easier TBH.
|
| Also, GPU support is assumed to be CUDA or Metal.
| fragmede wrote:
| I'm sure doing better by windows users is on the roadmap,
| exec then reexec to get into the right runtime, but it's
| a good first step towards making things easy.
| jart wrote:
| Author here. llamafile will work on stock Windows
| installs using CPU inference. No CUDA or MSVC or DLLs are
| required! The dev tools are only required to be
| installed, right now, if you want get faster GPU
| performance.
| amelius wrote:
| Why don't package managers do stuff like this?
| quickthrower2 wrote:
| Like a docker for LLMs
| polyrand wrote:
| The technical details in the README are quite an interesting
| read:
|
| https://github.com/mozilla-Ocho/llamafile#technical-details
| dang wrote:
| Related: https://hacks.mozilla.org/2023/11/introducing-llamafile/
| and https://twitter.com/justinetunney/status/1729940628098969799
|
| (via https://news.ycombinator.com/item?id=38463456 and
| https://news.ycombinator.com/item?id=38464759, but we merged the
| comments hither)
| rgbrgb wrote:
| Extremely cool and Justine Tunney / jart does incredible
| portability work [0], but I'm kind of struggling with the use-
| cases for this one.
|
| I make a small macOS app [1] which runs llama.cpp with a SwiftUI
| front-end. For the first version of the app I was obsessed with
| the single download -> chat flow and making 0 network
| connections. I bundled a model with the app and you could just
| download, open, and start using it. Easy! But as soon as I wanted
| to release a UI update to my TestFlight beta testers, I was
| causing them to download another 3GB. All 3 users complained :).
| My first change after that was decoupling the default model
| download and the UI so that I can ship app updates that are about
| 5MB. It feels like someone using this tool is going to hit the
| same problem pretty quick when they want to get the latest
| llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases
| where that doesn't matter, would love to hear where people think
| this could be useful.
|
| [0]: https://justine.lol/cosmopolitan/
|
| [1]: https://www.freechat.run
|
| [2]: https://github.com/ggerganov/llama.cpp
| Asmod4n wrote:
| It's just a zip file, updating it should be doable in place
| while it's running on any non windows platform and you just
| need to swap that one file out you changed. When it's running
| in server mode you could also possibly hot reload the
| executable without the user even having any downtime.
| tbalsam wrote:
| > in place
|
| ._.
|
| Pain.
| csdvrx wrote:
| You could also change you code so that when it runs, it
| checks as early as possible if you have a file with a well
| known name (say ~/.freechat.run) and then switches to reading
| from it instead for the assets than can change.
|
| You could have multiple updates my using say iso time and
| doing a sort (so that ~/.freechat.run.20231127120000 would be
| overriden by ~/.freechat.run.20231129160000 without making
| the user delete anything)
| stevenhuang wrote:
| The binaries themselves are available standalone
| https://github.com/Mozilla-Ocho/llamafile/releases
| amelius wrote:
| > you pass the --n-gpu-layers 35 flag (or whatever value is
| appropriate) to enable GPU
|
| This is a bit like specifying how large your strings will be to a
| C program. That was maybe accepted in the old days, but not
| anymore really.
| tomwojcik wrote:
| That's not the limitation introduced in Llamafile. It's
| actually a feature of all gguf models. If not specified, GPU is
| not used at all. Optionally, you can offload some work to the
| GPU. This allows to run 7b models (zephyr, mistral, openhermes)
| on regular PCs, it just takes a bit more time to generate the
| response. What other API would you suggest?
| amelius wrote:
| This is a bit like saying if you don't specify "--dram", the
| data will be stored on punchcards.
|
| From the user's point of view: they just want to run the
| thing, and as quickly as possible. If multiple programs want
| to use the GPU, then the OS and/or the driver should figure
| it out.
| andersa wrote:
| They don't, though. If you try to allocate too much VRAM it
| will either hard fail or everything suddenly runs like
| garbage due to the driver constantly swapping it / using
| shared memory.
|
| The reason for this flag to exist in the first place is
| that many of the models are larger than the available VRAM
| on most consumer GPUs, so you have to "balance" it between
| running some layers on the GPU and some on the CPU.
|
| What would make sense is a default auto option that uses as
| much VRAM as possible, assuming the model is the only thing
| running on the GPU, except for the amount of VRAM already
| in use at the time it is started.
| michaelt wrote:
| _> What other API would you suggest?_
|
| Assuming increasing vram leads to an appreciable improvement
| in model speed, it should default to using all but 10% of the
| vram of the largest GPU, or all but 1GB, whichever is less.
|
| If I've got 8GB of vram, the software should figure out the
| right number of layers to offload and a sensible context
| size, to not exceed 7GB of vram.
|
| (Although I realise the authors are just doing what llama.cpp
| does, so they didn't design it the way it is)
| keybits wrote:
| Simon Willison has a great post on this
| https://simonwillison.net/2023/Nov/29/llamafile/
| simonw wrote:
| I think the best way to try this out is with LLaVA, the
| text+image model (like GPT-4 Vision). Here are steps to do that
| on macOS (which should work the same on other platforms too, I
| haven't tried that yet though):
|
| 1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file
| from
| https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:
| wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/ma
| in/llamafile-server-0.1-llava-v1.5-7b-q4
|
| 2. Make that binary executable, by running this in a terminal:
| chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4
|
| 3. Run your new executable, which will start a web server on port
| 8080: ./llamafile-server-0.1-llava-v1.5-7b-q4
|
| 4. Navigate to http://127.0.0.1:8080/ to upload an image and
| start chatting with the model about it in your browser.
|
| Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/
| mritchie712 wrote:
| woah, this is fast. On my M1 this feels about as fast as GPT-4.
| bilsbie wrote:
| Thanks for the tip! Any chance this would run on a 2011
| MacBook?
| callmeed wrote:
| when I try to do this (MBP M1 Max, Sonoma) I get 'killed'
| immediately
| estebarb wrote:
| Currently which are the minimum system requirements for running
| these models?
| Hedepig wrote:
| I am currently tinkering with this all, you can download a 3b
| parameter model and run it on your phone. Of course it isn't
| that great, but I had a 3b param model[1] on my potato computer
| (a mid ryzen cpu with onboard graphics) that does surprisingly
| well on benchmarks and my experience has been pretty good with
| it.
|
| Of course, more interesting things happen when you get to 32b
| and the 70b param models, which will require high end chips
| like 3090s.
|
| [1] https://huggingface.co/TheBloke/rocket-3B-GGUF
| rgbrgb wrote:
| In my experience, if you're on a mac it's about the file size *
| 150% of RAM to get it working well. I had a user report running
| my llama.cpp app on a 2017 iMac with 8GB at ~5 tokens/second.
| Not sure about other platforms.
| jart wrote:
| You need at minimum a stock operating system install of:
|
| - Linux 2.6.18+ (arm64 or amd64) i.e. any distro RHEL5 or newer
|
| - MacOS 15.6+ (arm64 or amd64, gpu only supported on arm64)
|
| - Windows 8+ (amd64)
|
| - FreeBSD 13+ (amd64, gpu should work in theory)
|
| - NetBSD 9.2+ (amd64, gpu should work in theory)
|
| - OpenBSD 7+ (amd64, no gpu support)
|
| - AMD64 microprocessors must have SSSE3. Otherwise llamafile
| will print an error and refuse to run. This means, if you have
| an Intel CPU, it needs to be Intel Core or newer (circa 2006+),
| and if you have an AMD CPU, then it needs to be Bulldozer or
| newer (circa 2011+). If you have a newer CPU with AVX or better
| yet AVX2, then llamafile will utilize your chipset features to
| go faster. No support for AVX512+ runtime dispatching yet.
|
| - ARM64 microprocessors must have ARMv8a+. This means
| everything from Apple Silicon to 64-bit Raspberry Pis will
| work, provided your weights fit into memory.
|
| I've also tested GPU works on Google Cloud Platform and Nvidia
| Jetson, which has a somewhat different environment. Apple Metal
| is obviously supported too, and is basically a sure thing so
| long as xcode is installed.
| mercutio2 wrote:
| Apple Security will be excited to reach out to you to find
| out where you got a copy of macOS 15.6 :)
|
| I'm guessing this should be 13.6?
| _pdp_ wrote:
| A couple of steps away from getting weaponized.
| bjnewman85 wrote:
| Justine is creating mind-blowing projects at an alarming rate.
| dekhn wrote:
| I get the desire to make self-contained things, but a binary that
| only runs one model with one set of weights seems awfully
| constricting to me.
| omeze wrote:
| Eh, this is exploring a more "static link" approach for local
| use and development vs the more common "dynamic link" that API
| providers offer. (Imperfect analogy since this is literally
| like a DLL but... whatever). Probably makes sense for private
| local apps like a PDF chatter.
| simonw wrote:
| There's also a "llamafile" 4MB binary that can run any model
| (GGUF file) that you pass to it:
| https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...
| dekhn wrote:
| Right. So if that exists, why would I want to embed my
| weights in the binary rather than distributing them as a side
| file?
|
| I assume the answers are "because Justine can" and "sometimes
| it's easier to distribute a single file than two".
| simonw wrote:
| Personally I really like the single file approach.
|
| If the weights are 4GB, and the binary code needed to
| actually execute them is 4.5MB, then the size of the
| executable part is a rounding error - I don't see any
| reason NOT to bundle that with the model.
| dekhn wrote:
| I guess in every world I've worked in, deployment
| involved deploying a small executable which would run
| millions of times on thousands of servers, each instance
| loading a different model (or models) over its lifetime,
| and the weights are stored in a large, fast filesystem
| with much higher aggregate bandwidth than a typical local
| storage device. The executable itself doesn't even
| contain the final model- just a description of the model
| which is compiled only after the executable starts (so
| the compilation has all the runtime info on the machine
| it will run on).
|
| But, I think llama plus obese binaries must be targeting
| a very, very different community- one that doesn't build
| its own binaries, runs in any number of different
| locations, and focuses on getting the model to run with
| the least friction.
| csdvrx wrote:
| > a large, fast filesystem with much higher aggregate
| bandwidth than a typical local storage device
|
| that assumption gets wrong very fast with nvme storage,
| even before you add herding effects
| quickthrower2 wrote:
| This is convenient for people who don't want to go knee
| deep in LLM-ology to try an LLM out on their computer. That
| said a single download that in turn downloads the weights
| for you is just as good in my book.
| espadrine wrote:
| I understand the feeling. It may be caused by habit rather than
| objectivity, though. Those open-source AI hacks are undergoing
| early productization: while they were only research, their
| modularity mattered for experimentization, but as they get
| closer to something that can ship, the one-click binary form
| factor is a nice stepping stone.
|
| It is similar in my mind to the early days of Linux, where you
| had to compile it yourself and tweaked some compiler flags,
| compared to now, where most people don't even think about the
| fact that their phone or Steam deck runs it.
| jart wrote:
| llamafile will run any compatible model you want. For example,
| if you download the LLaVA llamafile, you can still pass `-m
| wizardcoder.gguf` to override the default weights.
| xnx wrote:
| > Windows also has a maximum file size limit of 2GB for
| executables. You need to have llamafile and your weights be
| separate files on the Windows platform.
|
| The 4GB .exe ran fine on my Windows 10 64-bit system.
| jart wrote:
| You're right. The limit is 4 _gibibytes_. Astonishingly enough,
| the llava-v1.5-7b-q4-server.llamafile is 0xfe1c0ed4 bytes in
| size, which is just 30MB shy of that limit.
| https://github.com/Mozilla-Ocho/llamafile/commit/81c6ad3251f...
| victor9000 wrote:
| I read xyz with a single file and already knew Justine was
| involved lol
| foruhar wrote:
| Llaminate would be decent name for something like. Or the verb
| for the general wrapping of a llama compatible model into a ready
| to use blob.
| gsuuon wrote:
| Llamanate
| dmezzetti wrote:
| From a technical standpoint, this project is really fascinating.
| I can see a lot of use cases for getting something up fast
| locally for an individual user.
|
| But for anyone in a production/business setting, it would be
| tough to see this being viable. Seems like it would be a non-
| starter for most medium to large companies IT teams. The great
| thing about a Dockerfile is that it can be inspected and the
| install process is relatively easy to understand.
| zitterbewegung wrote:
| This is not to be dismissive but there is a security risk if we
| keep on using the abstraction with arbitrary objects being
| serialized to disk and being able to trace back and see if the
| model file (most commonly python pickle files) aren't tampered
| with .
| zerojames wrote:
| The ML field is doing work in that area:
| https://github.com/huggingface/safetensors
| visarga wrote:
| You just need to have a stray TXT file in your system, or even
| downloaded from internet that prompts the AI to hack your
| system. If your AI has Python sand box and that has
| vulnerabilities, you can be hacked by any web page or text
| file. And the AI would be able to study your computer and
| select the most juicy bits to send out. It would be like a
| sentient virus spread by simple text files (text bombs?).
| abrinz wrote:
| I've been playing with various models in llama.cpp's GGUF format
| like this. git clone
| https://github.com/ggerganov/llama.cpp cd
| llama.cpp make # M2 Max - 16 GB RAM
| wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-M
| istral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_
| 0.gguf ./server -m
| models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32
| # M1 - 8 GB RAM wget -P ./models https://huggingface.
| co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openh
| ermes-2.5-mistral-7b.Q4_K_M.gguf ./server -m
| models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32
| RecycledEle wrote:
| Fantastic.
|
| For those of who who swim in the Microsoft ecosystem, and do not
| compile Linux apps from code, what Linux dustro would run this
| without fixing a huge number of dependencies?
|
| It seems like someone would have included Llama.cpp in their
| distro, ready-to-run.
|
| Yes, I'm an idiot.
___________________________________________________________________
(page generated 2023-11-29 23:00 UTC)