[HN Gopher] Llama.cpp guide - Running LLMs locally on any hardwa...
       ___________________________________________________________________
        
       Llama.cpp guide - Running LLMs locally on any hardware, from
       scratch
        
       Author : zarekr
       Score  : 206 points
       Date   : 2024-11-29 15:28 UTC (7 hours ago)
        
 (HTM) web link (steelph0enix.github.io)
 (TXT) w3m dump (steelph0enix.github.io)
        
       | wing-_-nuts wrote:
       | Llama.cpp is one of those projects that I _want_ to install, but
       | I always just wind up installing kobold.cpp because it 's simply
       | _miles_ better with UX.
        
         | cwillu wrote:
         | "koboldcpp forked from ggerganov/llama.cpp"
        
         | syntaxing wrote:
         | Llama cpp is more of a backend software. Most front end
         | software like kobold/open webui uses it
        
           | tempest_ wrote:
           | I found it only took me ~20 minutes to get Open-WebUI and
           | Ollama going on my machine locally. I don't really know what
           | is happening under the hood but from 0 to chat interface was
           | definitely not too hard.
        
         | phillipcarter wrote:
         | I just use ollama. It works on my mac and windows machine and
         | it's super simple to install + run most open models. And you
         | can have another tool just shell out to it if you want more
         | than than the CLI.
        
         | lolinder wrote:
         | Llama.cpp forms the base for both Ollama and Kobold.cpp and
         | probably a bunch of others I'm not familiar with. It's less a
         | question of whether you want to use llama.cpp _or_ one of the
         | others and more of a question of whether you benefit from using
         | one of the wrappers.
         | 
         | I can imagine some use cases where you'd really want to use
         | llama.cpp directly, and there are of course always people who
         | will argue that all wrappers are bad wrappers, but for myself I
         | like the combination of ease of use and flexibility offered by
         | Ollama. I wrap it in Open WebUI for a GUI, but I also have some
         | apps that reach out to Ollama directly.
        
       | marcodiego wrote:
       | First time I heard about Llama.cpp I got it to run on my
       | computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM
       | and an i5 processor, no dedicated graphic card. Since I wasn't
       | using a MGLRU enabled kernel, It took a looong time to start but
       | wasn't OOM-killed. Considering my amount of RAM was just the
       | minimum required, I tried one of the smallest available models.
       | 
       | Impressively, it worked. It was slow to spit out tokens, at a
       | rate around a word each 1 to 5 seconds and it was able to
       | correctly answer "What was the biggest planet in the solar
       | system", but it quickly hallucinated talking about moons that it
       | called "Jupterians", while I expected it to talk about Galilean
       | Moons.
       | 
       | Nevertheless, LLM's really impressed me and as soon as I get my
       | hands on better hardware I'll try to run other bigger models
       | locally in the hope that I'll finally have a personal "oracle"
       | able to quickly answers most questions I throw at it and help me
       | writing code and other fun things. Of course, I'll have to check
       | its answers before using them, but current state seems impressive
       | enough for me, specially QwQ.
       | 
       | Is Any one running smaller experiments and can talk about your
       | results? Is it already possible to have something like an open
       | source co-pilot running locally?
        
         | sorenjan wrote:
         | You can use Ollama for serving a model locally, and Continue to
         | use it in VSCode.
         | 
         | https://ollama.com/blog/continue-code-assistant
        
           | syntaxing wrote:
           | Relevant telemetry information. I didn't like how they went
           | from opt-in to opt-out earlier this year.
           | 
           | https://docs.continue.dev/telemetry
        
         | hedgehog wrote:
         | Open Web UI [1] with Ollama and models like the smaller Llama,
         | Qwen, or Granite series can work pretty well even with CPU or a
         | small GPU. Don't expect them to contain facts (IMO not a good
         | approach even for the largest models) but they can be very
         | effective for data extraction and conversational UI.
         | 
         | 1. http://openwebui.com
        
         | loudmax wrote:
         | What you describe is very similar to my own experience first
         | running llama.cpp on my desktop computer. It was slow and
         | inaccurate, but that's beside the point. What impressed me was
         | that I could write a question in English, and it would
         | understand the question, and respond in English with an
         | internally coherent and grammatically correct answer. This is
         | coming from a desktop, not a rack full of servers in some
         | hyperscaler's datacenter. This was like meeting a talking dog!
         | The fact that what it says is unreliable is completely beside
         | the point.
         | 
         | I think you still need to calibrate your expectations for what
         | you can get from consumer grade hardware without a powerful
         | GPU. I wouldn't look to a local LLM as a useful store of
         | factual knowledge about the world. The amount of stuff that it
         | knows is going to be hampered by the smaller size. That doesn't
         | mean it can't be useful, it may be very helpful for specialized
         | domains, like coding.
         | 
         | I hope and expect that over the next several years, hardware
         | that's capable of running more powerful models will become
         | cheaper and more widely available. But for now, the practical
         | applications of local models that don't require a powerful GPU
         | are fairly limited. If you really want to talk to an LLM that
         | has a sophisticated understanding of the world, you're better
         | off using Claude or Gemeni or ChatGPT.
        
         | yjftsjthsd-h wrote:
         | You might also try https://github.com/Mozilla-Ocho/llamafile ,
         | which _may_ have better CPU-only performance than ollama. It
         | does require you to grab .gguf files yourself (unless you use
         | one of their prebuilts in which case it comes with the
         | binary!), but with that done it 's really easy to use and has
         | decent performance.
         | 
         | For reference, this is how I run it:                 $ cat
         | ~/.config/systemd/user/llamafile@.service       [Unit]
         | Description=llamafile with arbitrary model
         | After=network.target              [Service]       Type=simple
         | WorkingDirectory=%h/llms/       ExecStart=sh -c
         | "%h/.local/bin/llamafile -m %h/llamafile-models/%i.gguf
         | --server --host '::' --port 8081 --nobrowser --log-disable"
         | [Install]       WantedBy=default.target
         | 
         | And then                 systemctl --user start
         | llamafile@whatevermodel
         | 
         | but you can just run that ExecStart command directly and it
         | works.
        
           | SahAssar wrote:
           | Is that `--host` listening on non-local addresses? Might be
           | good to default to local-only.
        
           | chatmasta wrote:
           | Be careful running this on work machines - it will get
           | flagged by Crowdstrike Falcon and probably other EDR tools.
           | In my case the first time I tried it, I just saw "Killed" and
           | then got a DM from SecOps within two minutes.
        
       | niek_pas wrote:
       | Can someone tell me what the advantages are of doing this over
       | using, e.g., the ChatGPT web interface? Is it just a privacy
       | thing?
        
         | explorigin wrote:
         | Privacy, available offline, software that lasts as long as the
         | hardware can run it.
        
           | JKCalhoun wrote:
           | Yeah, that's me. nCapture a snapshot of it, from time to time
           | -- so if it ever goes offline (or off the rails: requires a
           | subscription, begins to serve up ads), you have the last
           | "good" one locally.
           | 
           | I have a snapshot of Wikipedia as well (well, not the _whole_
           | of Wikipedia, but 90GB worth).
        
             | 3eb7988a1663 wrote:
             | Which Wikipedia snapshot do you grab? I keep meaning to do
             | this, but whenever I skim the Wikipedia downloads pages,
             | they offer hundreds of different flavors without any
             | immediate documentation as to what differentiates the
             | products.
        
               | severine wrote:
               | You can use Kiwix: https://kiwix.org/en/
        
         | elpocko wrote:
         | Privacy, freedom, huge selection of models, no censorship,
         | higher flexibility, and it's free as in beer.
        
           | fzzzy wrote:
           | Ability to have a stable model version with stable weights
           | until the end of time
        
         | priprimer wrote:
         | you get to find out all the steps!
         | 
         | meaning you learn more
        
           | loudmax wrote:
           | Yeah, agreed. If you think artificial intelligence is going
           | to be an important technology in the coming years, and you
           | want to get a better understanding of how it works, it's
           | useful to be able to run something that you have full control
           | over. Especially since you become very aware what the
           | shortcomings are, and you appreciate the efforts that go into
           | running the big online models.
        
         | zarekr wrote:
         | This is a way to run open source models locally. You need the
         | right hardware but it is a very efficient way to experiment
         | with the newest models, fine tuning etc. ChatGPT uses massive
         | model which are not practical to run on your own hardware.
         | Privacy is also an issue for many people, particularly
         | enterprise.
        
         | 0000000000100 wrote:
         | Privacy is a big one, but avoiding censorship and reducing
         | costs are the other ones I've seen.
         | 
         | Not so sure about the reducing costs argument anymore though,
         | you'd have to use LLMs a ton to make buying brand new GPUs
         | worth it (models are pretty reasonably priced these days).
        
           | stuckkeys wrote:
           | I never understand these guardrails. The whole point of llms
           | (imo) is for quick access to knowledge. If I want to better
           | understand reverse shell or kernel hooking, why not tell me?
           | But instead, "sorry, I ain't telling you because you will do
           | harm" lol
        
             | TeMPOraL wrote:
             | Key insight: the guardrails aren't there to protect you
             | from harmful knowledge; they're there to protect the
             | company from all those wackos on the Internet who love to
             | feign offense at anything that can get them a retweet, and
             | journalists who amplify their outrage into storms big
             | enough to depress company stock - or, in worst cases,
             | attract attention of politicians.
        
               | mistermann wrote:
               | There are also plausibly some guardrails resulting from
               | oversight by three letter agencies.
               | 
               | I don't take everything Marc Andreessen said in his
               | recent interview with Joe Rogan at face value, but I
               | don't dismiss any of it either.
        
         | pletnes wrote:
         | You can chug through a big text corpus at little cost.
        
         | JKCalhoun wrote:
         | I did a blog post about my preference for offline [1]. LLM's
         | would fall under the same criteria for me. Maybe not so much
         | the distraction-free aspect of being offline, but as a guard
         | against the ephemeral aspect of online.
         | 
         | I'm less concerned about privacy for whatever reason.
         | 
         | [1]
         | https://engineersneedart.com/blog/offlineadvocate/offlineadv...
        
         | cess11 wrote:
         | For work I routinely need to do translations of confidential
         | documents. Sending those to some web service in a state that
         | doesn't even have basic data protection guarantees is not an
         | option.
         | 
         | Putting them into a local LLM is rather efficient, however.
        
         | throwawaymaths wrote:
         | Yeah but I think of you've got a GPU you should probably think
         | about using vllm. Last I tried using llama.cpp (which granted
         | was several months ago) the ux was atrocious -- vllm basically
         | gives you an openai api with no fuss. That's saying something
         | as generally speaking I loathe Python.
        
       | superkuh wrote:
       | I'd say avoid pulling in all the python and containers required
       | and just download the gguf from huggingface website directly in a
       | browser rather than doing is programmatically. That sidesteps a
       | lot of this project's complexity since nothing about llama.cpp
       | requires those heavy deps or abstractions.
        
       | notadoc wrote:
       | Ollama is so easy, what's the benefit to Llama.cpp?
        
       | arendtio wrote:
       | I tried building and using llama.cpp multiple times, and after a
       | while, I got so frustrated with the frequently broken build
       | process that I switched to ollama with the following script:
       | #!/bin/sh       export OLLAMA_MODELS="/mnt/ai-models/ollama/"
       | printf 'Starting the server now.\n'       ollama serve >/dev/null
       | 2>&1 &       serverPid="$!"              printf 'Starting the
       | client (might take a moment (~3min) after a fresh boot).\n'
       | ollama run llama3.2 2>/dev/null            printf 'Stopping the
       | server now.\n'       kill "$serverPid"
       | 
       | And it just works :-)
        
         | boneitis wrote:
         | this was pretty much spot-on to my experience and track. the
         | ridicule of people choosing to use ollama over llamacpp is so
         | tired.
         | 
         | i had already burned an evening trying to debug and fix issues
         | getting nowhere fast, until i pulled ollama and had it working
         | with just two commands. it was a shock. (granted, there is/was
         | a crippling performance problem with sky/kabylake chips but
         | mitigated if you had any kind of mid-tier GPU and tweaked a
         | couple settings)
         | 
         | anyone who tries to contribute to the general knowledge base of
         | deploying llamacpp (like TFA) is doing heaven's work.
        
       | smcleod wrote:
       | Neat to see more folks writing blogs on their experiences. This
       | however does seem like it's an over-complicated method of
       | building llama.cpp.
       | 
       | Assuming you want to do this iteratively (at least for the first
       | time) should only need to run:                 ccmake .
       | 
       | And toggle the parameters your hardware supports or that you want
       | (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple
       | etc..), and press 'c' (configure) then 'g' (generate), then:
       | cmake --build . -j $(expr $(nproc) / 2)
       | 
       | Done.
       | 
       | If you want to move the binaries into your PATH, you could then
       | optionally run cmake install.
        
       | smcleod wrote:
       | Somewhat related - on several occasions I've come across the
       | claim that _"Ollama is just a llama.cpp wrapper"_, which is
       | inaccurate and completely misses the point. I am sharing my
       | response here to avoid repeating myself repeatedly.
       | 
       | With llama.cpp running on a machine, how do you connect your LLM
       | clients to it and request a model gets loaded with a given set of
       | parameters and templates?
       | 
       | ... you can't, because llama.cpp is the inference engine - and
       | it's bundled llama-cpp-server binary only provides relatively
       | basic server functionality - it's really more of demo/example or
       | MVP.
       | 
       | Llama.cpp is all configured at the time you run the binary and
       | manually provide it command line args for the one specific model
       | and configuration you start it with.
       | 
       | Ollama provides a server and client for interfacing and packaging
       | models, such as:                 - Hot loading models (e.g. when
       | you request a model from your client Ollama will load it on
       | demand).       - Automatic model parallelisation.       -
       | Automatic model concurrency.       - Automatic memory
       | calculations for layer and GPU/CPU placement.       - Layered
       | model configuration (basically docker images for models).       -
       | Templating and distribution of model parameters, templates in a
       | container image.       - Near feature complete OpenAI compatible
       | API as well as it's native native API that supports more advanced
       | features such as model hot loading, context management, etc...
       | - Native libraries for common languages.       - Official
       | container images for hosting.       - Provides a client/server
       | model for running remote or local inference servers with either
       | Ollama or openai compatible clients.       - Support for both an
       | official and self hosted model and template repositories.       -
       | Support for multi-modal / Vision LLMs - something that llama.cpp
       | is not focusing on providing currently.       - Support for
       | serving safetensors models, as well as running and creating
       | models directly from their Huggingface model ID.
       | 
       | In addition to the llama.cpp engine, Ollama are working on adding
       | additional model backends (e.g. things like exl2, awq, etc...).
       | 
       | Ollama is not "better" or "worse" than llama.cpp because it's an
       | entirely different tool.
        
       | HarHarVeryFunny wrote:
       | What are the limitations on which LLMs (specific transformer
       | variants etc) llama.cpp can run? Does it require the input
       | mode/weights to be in some self-describing format like ONNX that
       | support different model architectures as long as they are built
       | out of specific module/layer types, or does it more narrowly only
       | support transformer models parameterized by depth, width, etc?
        
       | dmezzetti wrote:
       | Seeing a lot of Ollama vs running llama.cpp direct talk here. I
       | agree that setting up llama.cpp with CUDA isn't always the
       | easiest. But there is a cost to running all inference over HTTPS.
       | Local in-program inference will be faster. Perhaps that doesn't
       | matter in some cases but it's worth noting.
       | 
       | I find that running PyTorch is easier to get up and running. For
       | quantization, AWQ models work and it's just a "pip install" away.
        
       ___________________________________________________________________
       (page generated 2024-11-29 23:00 UTC)