[HN Gopher] Llamafile lets you distribute and run LLMs with a si...
       ___________________________________________________________________
        
       Llamafile lets you distribute and run LLMs with a single file
        
       Author : tfinch
       Score  : 207 points
       Date   : 2023-11-29 19:29 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Luc wrote:
       | This is pretty darn crazy. One file runs on 6 operating systems,
       | with GPU support.
        
         | tfinch wrote:
         | yeah the section on how the GPU support works is wild!
        
           | thelastparadise wrote:
           | So if you share a binary with a friend you'd have to have
           | them install cuda toolkit too?
           | 
           | Seems like a dealbreaker for the whole idea.
        
             | brucethemoose2 wrote:
             | > On Windows, that usually means you need to open up the
             | MSVC x64 native command prompt and run llamafile there, for
             | the first invocation, so it can build a DLL with native GPU
             | support. After that, $CUDA_PATH/bin still usually needs to
             | be on the $PATH so the GGML DLL can find its other CUDA
             | dependencies.
             | 
             | Yeah, I think the setup lost most users there.
             | 
             | A separate model/app approach (like Koboldcpp) seems way
             | easier TBH.
             | 
             | Also, GPU support is assumed to be CUDA or Metal.
        
               | fragmede wrote:
               | I'm sure doing better by windows users is on the roadmap,
               | exec then reexec to get into the right runtime, but it's
               | a good first step towards making things easy.
        
               | jart wrote:
               | Author here. llamafile will work on stock Windows
               | installs using CPU inference. No CUDA or MSVC or DLLs are
               | required! The dev tools are only required to be
               | installed, right now, if you want get faster GPU
               | performance.
        
           | amelius wrote:
           | Why don't package managers do stuff like this?
        
         | quickthrower2 wrote:
         | Like a docker for LLMs
        
       | polyrand wrote:
       | The technical details in the README are quite an interesting
       | read:
       | 
       | https://github.com/mozilla-Ocho/llamafile#technical-details
        
       | dang wrote:
       | Related: https://hacks.mozilla.org/2023/11/introducing-llamafile/
       | and https://twitter.com/justinetunney/status/1729940628098969799
       | 
       | (via https://news.ycombinator.com/item?id=38463456 and
       | https://news.ycombinator.com/item?id=38464759, but we merged the
       | comments hither)
        
       | rgbrgb wrote:
       | Extremely cool and Justine Tunney / jart does incredible
       | portability work [0], but I'm kind of struggling with the use-
       | cases for this one.
       | 
       | I make a small macOS app [1] which runs llama.cpp with a SwiftUI
       | front-end. For the first version of the app I was obsessed with
       | the single download -> chat flow and making 0 network
       | connections. I bundled a model with the app and you could just
       | download, open, and start using it. Easy! But as soon as I wanted
       | to release a UI update to my TestFlight beta testers, I was
       | causing them to download another 3GB. All 3 users complained :).
       | My first change after that was decoupling the default model
       | download and the UI so that I can ship app updates that are about
       | 5MB. It feels like someone using this tool is going to hit the
       | same problem pretty quick when they want to get the latest
       | llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases
       | where that doesn't matter, would love to hear where people think
       | this could be useful.
       | 
       | [0]: https://justine.lol/cosmopolitan/
       | 
       | [1]: https://www.freechat.run
       | 
       | [2]: https://github.com/ggerganov/llama.cpp
        
         | Asmod4n wrote:
         | It's just a zip file, updating it should be doable in place
         | while it's running on any non windows platform and you just
         | need to swap that one file out you changed. When it's running
         | in server mode you could also possibly hot reload the
         | executable without the user even having any downtime.
        
           | tbalsam wrote:
           | > in place
           | 
           | ._.
           | 
           | Pain.
        
           | csdvrx wrote:
           | You could also change you code so that when it runs, it
           | checks as early as possible if you have a file with a well
           | known name (say ~/.freechat.run) and then switches to reading
           | from it instead for the assets than can change.
           | 
           | You could have multiple updates my using say iso time and
           | doing a sort (so that ~/.freechat.run.20231127120000 would be
           | overriden by ~/.freechat.run.20231129160000 without making
           | the user delete anything)
        
         | stevenhuang wrote:
         | The binaries themselves are available standalone
         | https://github.com/Mozilla-Ocho/llamafile/releases
        
       | amelius wrote:
       | > you pass the --n-gpu-layers 35 flag (or whatever value is
       | appropriate) to enable GPU
       | 
       | This is a bit like specifying how large your strings will be to a
       | C program. That was maybe accepted in the old days, but not
       | anymore really.
        
         | tomwojcik wrote:
         | That's not the limitation introduced in Llamafile. It's
         | actually a feature of all gguf models. If not specified, GPU is
         | not used at all. Optionally, you can offload some work to the
         | GPU. This allows to run 7b models (zephyr, mistral, openhermes)
         | on regular PCs, it just takes a bit more time to generate the
         | response. What other API would you suggest?
        
           | amelius wrote:
           | This is a bit like saying if you don't specify "--dram", the
           | data will be stored on punchcards.
           | 
           | From the user's point of view: they just want to run the
           | thing, and as quickly as possible. If multiple programs want
           | to use the GPU, then the OS and/or the driver should figure
           | it out.
        
             | andersa wrote:
             | They don't, though. If you try to allocate too much VRAM it
             | will either hard fail or everything suddenly runs like
             | garbage due to the driver constantly swapping it / using
             | shared memory.
             | 
             | The reason for this flag to exist in the first place is
             | that many of the models are larger than the available VRAM
             | on most consumer GPUs, so you have to "balance" it between
             | running some layers on the GPU and some on the CPU.
             | 
             | What would make sense is a default auto option that uses as
             | much VRAM as possible, assuming the model is the only thing
             | running on the GPU, except for the amount of VRAM already
             | in use at the time it is started.
        
           | michaelt wrote:
           | _> What other API would you suggest?_
           | 
           | Assuming increasing vram leads to an appreciable improvement
           | in model speed, it should default to using all but 10% of the
           | vram of the largest GPU, or all but 1GB, whichever is less.
           | 
           | If I've got 8GB of vram, the software should figure out the
           | right number of layers to offload and a sensible context
           | size, to not exceed 7GB of vram.
           | 
           | (Although I realise the authors are just doing what llama.cpp
           | does, so they didn't design it the way it is)
        
       | keybits wrote:
       | Simon Willison has a great post on this
       | https://simonwillison.net/2023/Nov/29/llamafile/
        
       | simonw wrote:
       | I think the best way to try this out is with LLaVA, the
       | text+image model (like GPT-4 Vision). Here are steps to do that
       | on macOS (which should work the same on other platforms too, I
       | haven't tried that yet though):
       | 
       | 1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file
       | from
       | https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:
       | wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/ma
       | in/llamafile-server-0.1-llava-v1.5-7b-q4
       | 
       | 2. Make that binary executable, by running this in a terminal:
       | chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4
       | 
       | 3. Run your new executable, which will start a web server on port
       | 8080:                   ./llamafile-server-0.1-llava-v1.5-7b-q4
       | 
       | 4. Navigate to http://127.0.0.1:8080/ to upload an image and
       | start chatting with the model about it in your browser.
       | 
       | Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/
        
         | mritchie712 wrote:
         | woah, this is fast. On my M1 this feels about as fast as GPT-4.
        
         | bilsbie wrote:
         | Thanks for the tip! Any chance this would run on a 2011
         | MacBook?
        
         | callmeed wrote:
         | when I try to do this (MBP M1 Max, Sonoma) I get 'killed'
         | immediately
        
       | estebarb wrote:
       | Currently which are the minimum system requirements for running
       | these models?
        
         | Hedepig wrote:
         | I am currently tinkering with this all, you can download a 3b
         | parameter model and run it on your phone. Of course it isn't
         | that great, but I had a 3b param model[1] on my potato computer
         | (a mid ryzen cpu with onboard graphics) that does surprisingly
         | well on benchmarks and my experience has been pretty good with
         | it.
         | 
         | Of course, more interesting things happen when you get to 32b
         | and the 70b param models, which will require high end chips
         | like 3090s.
         | 
         | [1] https://huggingface.co/TheBloke/rocket-3B-GGUF
        
         | rgbrgb wrote:
         | In my experience, if you're on a mac it's about the file size *
         | 150% of RAM to get it working well. I had a user report running
         | my llama.cpp app on a 2017 iMac with 8GB at ~5 tokens/second.
         | Not sure about other platforms.
        
         | jart wrote:
         | You need at minimum a stock operating system install of:
         | 
         | - Linux 2.6.18+ (arm64 or amd64) i.e. any distro RHEL5 or newer
         | 
         | - MacOS 15.6+ (arm64 or amd64, gpu only supported on arm64)
         | 
         | - Windows 8+ (amd64)
         | 
         | - FreeBSD 13+ (amd64, gpu should work in theory)
         | 
         | - NetBSD 9.2+ (amd64, gpu should work in theory)
         | 
         | - OpenBSD 7+ (amd64, no gpu support)
         | 
         | - AMD64 microprocessors must have SSSE3. Otherwise llamafile
         | will print an error and refuse to run. This means, if you have
         | an Intel CPU, it needs to be Intel Core or newer (circa 2006+),
         | and if you have an AMD CPU, then it needs to be Bulldozer or
         | newer (circa 2011+). If you have a newer CPU with AVX or better
         | yet AVX2, then llamafile will utilize your chipset features to
         | go faster. No support for AVX512+ runtime dispatching yet.
         | 
         | - ARM64 microprocessors must have ARMv8a+. This means
         | everything from Apple Silicon to 64-bit Raspberry Pis will
         | work, provided your weights fit into memory.
         | 
         | I've also tested GPU works on Google Cloud Platform and Nvidia
         | Jetson, which has a somewhat different environment. Apple Metal
         | is obviously supported too, and is basically a sure thing so
         | long as xcode is installed.
        
           | mercutio2 wrote:
           | Apple Security will be excited to reach out to you to find
           | out where you got a copy of macOS 15.6 :)
           | 
           | I'm guessing this should be 13.6?
        
       | _pdp_ wrote:
       | A couple of steps away from getting weaponized.
        
       | bjnewman85 wrote:
       | Justine is creating mind-blowing projects at an alarming rate.
        
       | dekhn wrote:
       | I get the desire to make self-contained things, but a binary that
       | only runs one model with one set of weights seems awfully
       | constricting to me.
        
         | omeze wrote:
         | Eh, this is exploring a more "static link" approach for local
         | use and development vs the more common "dynamic link" that API
         | providers offer. (Imperfect analogy since this is literally
         | like a DLL but... whatever). Probably makes sense for private
         | local apps like a PDF chatter.
        
         | simonw wrote:
         | There's also a "llamafile" 4MB binary that can run any model
         | (GGUF file) that you pass to it:
         | https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...
        
           | dekhn wrote:
           | Right. So if that exists, why would I want to embed my
           | weights in the binary rather than distributing them as a side
           | file?
           | 
           | I assume the answers are "because Justine can" and "sometimes
           | it's easier to distribute a single file than two".
        
             | simonw wrote:
             | Personally I really like the single file approach.
             | 
             | If the weights are 4GB, and the binary code needed to
             | actually execute them is 4.5MB, then the size of the
             | executable part is a rounding error - I don't see any
             | reason NOT to bundle that with the model.
        
               | dekhn wrote:
               | I guess in every world I've worked in, deployment
               | involved deploying a small executable which would run
               | millions of times on thousands of servers, each instance
               | loading a different model (or models) over its lifetime,
               | and the weights are stored in a large, fast filesystem
               | with much higher aggregate bandwidth than a typical local
               | storage device. The executable itself doesn't even
               | contain the final model- just a description of the model
               | which is compiled only after the executable starts (so
               | the compilation has all the runtime info on the machine
               | it will run on).
               | 
               | But, I think llama plus obese binaries must be targeting
               | a very, very different community- one that doesn't build
               | its own binaries, runs in any number of different
               | locations, and focuses on getting the model to run with
               | the least friction.
        
               | csdvrx wrote:
               | > a large, fast filesystem with much higher aggregate
               | bandwidth than a typical local storage device
               | 
               | that assumption gets wrong very fast with nvme storage,
               | even before you add herding effects
        
             | quickthrower2 wrote:
             | This is convenient for people who don't want to go knee
             | deep in LLM-ology to try an LLM out on their computer. That
             | said a single download that in turn downloads the weights
             | for you is just as good in my book.
        
         | espadrine wrote:
         | I understand the feeling. It may be caused by habit rather than
         | objectivity, though. Those open-source AI hacks are undergoing
         | early productization: while they were only research, their
         | modularity mattered for experimentization, but as they get
         | closer to something that can ship, the one-click binary form
         | factor is a nice stepping stone.
         | 
         | It is similar in my mind to the early days of Linux, where you
         | had to compile it yourself and tweaked some compiler flags,
         | compared to now, where most people don't even think about the
         | fact that their phone or Steam deck runs it.
        
         | jart wrote:
         | llamafile will run any compatible model you want. For example,
         | if you download the LLaVA llamafile, you can still pass `-m
         | wizardcoder.gguf` to override the default weights.
        
       | xnx wrote:
       | > Windows also has a maximum file size limit of 2GB for
       | executables. You need to have llamafile and your weights be
       | separate files on the Windows platform.
       | 
       | The 4GB .exe ran fine on my Windows 10 64-bit system.
        
         | jart wrote:
         | You're right. The limit is 4 _gibibytes_. Astonishingly enough,
         | the llava-v1.5-7b-q4-server.llamafile is 0xfe1c0ed4 bytes in
         | size, which is just 30MB shy of that limit.
         | https://github.com/Mozilla-Ocho/llamafile/commit/81c6ad3251f...
        
       | victor9000 wrote:
       | I read xyz with a single file and already knew Justine was
       | involved lol
        
       | foruhar wrote:
       | Llaminate would be decent name for something like. Or the verb
       | for the general wrapping of a llama compatible model into a ready
       | to use blob.
        
         | gsuuon wrote:
         | Llamanate
        
       | dmezzetti wrote:
       | From a technical standpoint, this project is really fascinating.
       | I can see a lot of use cases for getting something up fast
       | locally for an individual user.
       | 
       | But for anyone in a production/business setting, it would be
       | tough to see this being viable. Seems like it would be a non-
       | starter for most medium to large companies IT teams. The great
       | thing about a Dockerfile is that it can be inspected and the
       | install process is relatively easy to understand.
        
       | zitterbewegung wrote:
       | This is not to be dismissive but there is a security risk if we
       | keep on using the abstraction with arbitrary objects being
       | serialized to disk and being able to trace back and see if the
       | model file (most commonly python pickle files) aren't tampered
       | with .
        
         | zerojames wrote:
         | The ML field is doing work in that area:
         | https://github.com/huggingface/safetensors
        
         | visarga wrote:
         | You just need to have a stray TXT file in your system, or even
         | downloaded from internet that prompts the AI to hack your
         | system. If your AI has Python sand box and that has
         | vulnerabilities, you can be hacked by any web page or text
         | file. And the AI would be able to study your computer and
         | select the most juicy bits to send out. It would be like a
         | sentient virus spread by simple text files (text bombs?).
        
       | abrinz wrote:
       | I've been playing with various models in llama.cpp's GGUF format
       | like this.                 git clone
       | https://github.com/ggerganov/llama.cpp                 cd
       | llama.cpp            make             # M2 Max - 16 GB RAM
       | wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-M
       | istral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_
       | 0.gguf              ./server -m
       | models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32
       | # M1 - 8 GB RAM             wget -P ./models https://huggingface.
       | co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openh
       | ermes-2.5-mistral-7b.Q4_K_M.gguf            ./server -m
       | models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32
        
       | RecycledEle wrote:
       | Fantastic.
       | 
       | For those of who who swim in the Microsoft ecosystem, and do not
       | compile Linux apps from code, what Linux dustro would run this
       | without fixing a huge number of dependencies?
       | 
       | It seems like someone would have included Llama.cpp in their
       | distro, ready-to-run.
       | 
       | Yes, I'm an idiot.
        
       ___________________________________________________________________
       (page generated 2023-11-29 23:00 UTC)