[HN Gopher] Llama.cpp guide - Running LLMs locally on any hardwa...
___________________________________________________________________
Llama.cpp guide - Running LLMs locally on any hardware, from
scratch
Author : zarekr
Score : 206 points
Date : 2024-11-29 15:28 UTC (7 hours ago)
(HTM) web link (steelph0enix.github.io)
(TXT) w3m dump (steelph0enix.github.io)
| wing-_-nuts wrote:
| Llama.cpp is one of those projects that I _want_ to install, but
| I always just wind up installing kobold.cpp because it 's simply
| _miles_ better with UX.
| cwillu wrote:
| "koboldcpp forked from ggerganov/llama.cpp"
| syntaxing wrote:
| Llama cpp is more of a backend software. Most front end
| software like kobold/open webui uses it
| tempest_ wrote:
| I found it only took me ~20 minutes to get Open-WebUI and
| Ollama going on my machine locally. I don't really know what
| is happening under the hood but from 0 to chat interface was
| definitely not too hard.
| phillipcarter wrote:
| I just use ollama. It works on my mac and windows machine and
| it's super simple to install + run most open models. And you
| can have another tool just shell out to it if you want more
| than than the CLI.
| lolinder wrote:
| Llama.cpp forms the base for both Ollama and Kobold.cpp and
| probably a bunch of others I'm not familiar with. It's less a
| question of whether you want to use llama.cpp _or_ one of the
| others and more of a question of whether you benefit from using
| one of the wrappers.
|
| I can imagine some use cases where you'd really want to use
| llama.cpp directly, and there are of course always people who
| will argue that all wrappers are bad wrappers, but for myself I
| like the combination of ease of use and flexibility offered by
| Ollama. I wrap it in Open WebUI for a GUI, but I also have some
| apps that reach out to Ollama directly.
| marcodiego wrote:
| First time I heard about Llama.cpp I got it to run on my
| computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM
| and an i5 processor, no dedicated graphic card. Since I wasn't
| using a MGLRU enabled kernel, It took a looong time to start but
| wasn't OOM-killed. Considering my amount of RAM was just the
| minimum required, I tried one of the smallest available models.
|
| Impressively, it worked. It was slow to spit out tokens, at a
| rate around a word each 1 to 5 seconds and it was able to
| correctly answer "What was the biggest planet in the solar
| system", but it quickly hallucinated talking about moons that it
| called "Jupterians", while I expected it to talk about Galilean
| Moons.
|
| Nevertheless, LLM's really impressed me and as soon as I get my
| hands on better hardware I'll try to run other bigger models
| locally in the hope that I'll finally have a personal "oracle"
| able to quickly answers most questions I throw at it and help me
| writing code and other fun things. Of course, I'll have to check
| its answers before using them, but current state seems impressive
| enough for me, specially QwQ.
|
| Is Any one running smaller experiments and can talk about your
| results? Is it already possible to have something like an open
| source co-pilot running locally?
| sorenjan wrote:
| You can use Ollama for serving a model locally, and Continue to
| use it in VSCode.
|
| https://ollama.com/blog/continue-code-assistant
| syntaxing wrote:
| Relevant telemetry information. I didn't like how they went
| from opt-in to opt-out earlier this year.
|
| https://docs.continue.dev/telemetry
| hedgehog wrote:
| Open Web UI [1] with Ollama and models like the smaller Llama,
| Qwen, or Granite series can work pretty well even with CPU or a
| small GPU. Don't expect them to contain facts (IMO not a good
| approach even for the largest models) but they can be very
| effective for data extraction and conversational UI.
|
| 1. http://openwebui.com
| loudmax wrote:
| What you describe is very similar to my own experience first
| running llama.cpp on my desktop computer. It was slow and
| inaccurate, but that's beside the point. What impressed me was
| that I could write a question in English, and it would
| understand the question, and respond in English with an
| internally coherent and grammatically correct answer. This is
| coming from a desktop, not a rack full of servers in some
| hyperscaler's datacenter. This was like meeting a talking dog!
| The fact that what it says is unreliable is completely beside
| the point.
|
| I think you still need to calibrate your expectations for what
| you can get from consumer grade hardware without a powerful
| GPU. I wouldn't look to a local LLM as a useful store of
| factual knowledge about the world. The amount of stuff that it
| knows is going to be hampered by the smaller size. That doesn't
| mean it can't be useful, it may be very helpful for specialized
| domains, like coding.
|
| I hope and expect that over the next several years, hardware
| that's capable of running more powerful models will become
| cheaper and more widely available. But for now, the practical
| applications of local models that don't require a powerful GPU
| are fairly limited. If you really want to talk to an LLM that
| has a sophisticated understanding of the world, you're better
| off using Claude or Gemeni or ChatGPT.
| yjftsjthsd-h wrote:
| You might also try https://github.com/Mozilla-Ocho/llamafile ,
| which _may_ have better CPU-only performance than ollama. It
| does require you to grab .gguf files yourself (unless you use
| one of their prebuilts in which case it comes with the
| binary!), but with that done it 's really easy to use and has
| decent performance.
|
| For reference, this is how I run it: $ cat
| ~/.config/systemd/user/llamafile@.service [Unit]
| Description=llamafile with arbitrary model
| After=network.target [Service] Type=simple
| WorkingDirectory=%h/llms/ ExecStart=sh -c
| "%h/.local/bin/llamafile -m %h/llamafile-models/%i.gguf
| --server --host '::' --port 8081 --nobrowser --log-disable"
| [Install] WantedBy=default.target
|
| And then systemctl --user start
| llamafile@whatevermodel
|
| but you can just run that ExecStart command directly and it
| works.
| SahAssar wrote:
| Is that `--host` listening on non-local addresses? Might be
| good to default to local-only.
| chatmasta wrote:
| Be careful running this on work machines - it will get
| flagged by Crowdstrike Falcon and probably other EDR tools.
| In my case the first time I tried it, I just saw "Killed" and
| then got a DM from SecOps within two minutes.
| niek_pas wrote:
| Can someone tell me what the advantages are of doing this over
| using, e.g., the ChatGPT web interface? Is it just a privacy
| thing?
| explorigin wrote:
| Privacy, available offline, software that lasts as long as the
| hardware can run it.
| JKCalhoun wrote:
| Yeah, that's me. nCapture a snapshot of it, from time to time
| -- so if it ever goes offline (or off the rails: requires a
| subscription, begins to serve up ads), you have the last
| "good" one locally.
|
| I have a snapshot of Wikipedia as well (well, not the _whole_
| of Wikipedia, but 90GB worth).
| 3eb7988a1663 wrote:
| Which Wikipedia snapshot do you grab? I keep meaning to do
| this, but whenever I skim the Wikipedia downloads pages,
| they offer hundreds of different flavors without any
| immediate documentation as to what differentiates the
| products.
| severine wrote:
| You can use Kiwix: https://kiwix.org/en/
| elpocko wrote:
| Privacy, freedom, huge selection of models, no censorship,
| higher flexibility, and it's free as in beer.
| fzzzy wrote:
| Ability to have a stable model version with stable weights
| until the end of time
| priprimer wrote:
| you get to find out all the steps!
|
| meaning you learn more
| loudmax wrote:
| Yeah, agreed. If you think artificial intelligence is going
| to be an important technology in the coming years, and you
| want to get a better understanding of how it works, it's
| useful to be able to run something that you have full control
| over. Especially since you become very aware what the
| shortcomings are, and you appreciate the efforts that go into
| running the big online models.
| zarekr wrote:
| This is a way to run open source models locally. You need the
| right hardware but it is a very efficient way to experiment
| with the newest models, fine tuning etc. ChatGPT uses massive
| model which are not practical to run on your own hardware.
| Privacy is also an issue for many people, particularly
| enterprise.
| 0000000000100 wrote:
| Privacy is a big one, but avoiding censorship and reducing
| costs are the other ones I've seen.
|
| Not so sure about the reducing costs argument anymore though,
| you'd have to use LLMs a ton to make buying brand new GPUs
| worth it (models are pretty reasonably priced these days).
| stuckkeys wrote:
| I never understand these guardrails. The whole point of llms
| (imo) is for quick access to knowledge. If I want to better
| understand reverse shell or kernel hooking, why not tell me?
| But instead, "sorry, I ain't telling you because you will do
| harm" lol
| TeMPOraL wrote:
| Key insight: the guardrails aren't there to protect you
| from harmful knowledge; they're there to protect the
| company from all those wackos on the Internet who love to
| feign offense at anything that can get them a retweet, and
| journalists who amplify their outrage into storms big
| enough to depress company stock - or, in worst cases,
| attract attention of politicians.
| mistermann wrote:
| There are also plausibly some guardrails resulting from
| oversight by three letter agencies.
|
| I don't take everything Marc Andreessen said in his
| recent interview with Joe Rogan at face value, but I
| don't dismiss any of it either.
| pletnes wrote:
| You can chug through a big text corpus at little cost.
| JKCalhoun wrote:
| I did a blog post about my preference for offline [1]. LLM's
| would fall under the same criteria for me. Maybe not so much
| the distraction-free aspect of being offline, but as a guard
| against the ephemeral aspect of online.
|
| I'm less concerned about privacy for whatever reason.
|
| [1]
| https://engineersneedart.com/blog/offlineadvocate/offlineadv...
| cess11 wrote:
| For work I routinely need to do translations of confidential
| documents. Sending those to some web service in a state that
| doesn't even have basic data protection guarantees is not an
| option.
|
| Putting them into a local LLM is rather efficient, however.
| throwawaymaths wrote:
| Yeah but I think of you've got a GPU you should probably think
| about using vllm. Last I tried using llama.cpp (which granted
| was several months ago) the ux was atrocious -- vllm basically
| gives you an openai api with no fuss. That's saying something
| as generally speaking I loathe Python.
| superkuh wrote:
| I'd say avoid pulling in all the python and containers required
| and just download the gguf from huggingface website directly in a
| browser rather than doing is programmatically. That sidesteps a
| lot of this project's complexity since nothing about llama.cpp
| requires those heavy deps or abstractions.
| notadoc wrote:
| Ollama is so easy, what's the benefit to Llama.cpp?
| arendtio wrote:
| I tried building and using llama.cpp multiple times, and after a
| while, I got so frustrated with the frequently broken build
| process that I switched to ollama with the following script:
| #!/bin/sh export OLLAMA_MODELS="/mnt/ai-models/ollama/"
| printf 'Starting the server now.\n' ollama serve >/dev/null
| 2>&1 & serverPid="$!" printf 'Starting the
| client (might take a moment (~3min) after a fresh boot).\n'
| ollama run llama3.2 2>/dev/null printf 'Stopping the
| server now.\n' kill "$serverPid"
|
| And it just works :-)
| boneitis wrote:
| this was pretty much spot-on to my experience and track. the
| ridicule of people choosing to use ollama over llamacpp is so
| tired.
|
| i had already burned an evening trying to debug and fix issues
| getting nowhere fast, until i pulled ollama and had it working
| with just two commands. it was a shock. (granted, there is/was
| a crippling performance problem with sky/kabylake chips but
| mitigated if you had any kind of mid-tier GPU and tweaked a
| couple settings)
|
| anyone who tries to contribute to the general knowledge base of
| deploying llamacpp (like TFA) is doing heaven's work.
| smcleod wrote:
| Neat to see more folks writing blogs on their experiences. This
| however does seem like it's an over-complicated method of
| building llama.cpp.
|
| Assuming you want to do this iteratively (at least for the first
| time) should only need to run: ccmake .
|
| And toggle the parameters your hardware supports or that you want
| (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple
| etc..), and press 'c' (configure) then 'g' (generate), then:
| cmake --build . -j $(expr $(nproc) / 2)
|
| Done.
|
| If you want to move the binaries into your PATH, you could then
| optionally run cmake install.
| smcleod wrote:
| Somewhat related - on several occasions I've come across the
| claim that _"Ollama is just a llama.cpp wrapper"_, which is
| inaccurate and completely misses the point. I am sharing my
| response here to avoid repeating myself repeatedly.
|
| With llama.cpp running on a machine, how do you connect your LLM
| clients to it and request a model gets loaded with a given set of
| parameters and templates?
|
| ... you can't, because llama.cpp is the inference engine - and
| it's bundled llama-cpp-server binary only provides relatively
| basic server functionality - it's really more of demo/example or
| MVP.
|
| Llama.cpp is all configured at the time you run the binary and
| manually provide it command line args for the one specific model
| and configuration you start it with.
|
| Ollama provides a server and client for interfacing and packaging
| models, such as: - Hot loading models (e.g. when
| you request a model from your client Ollama will load it on
| demand). - Automatic model parallelisation. -
| Automatic model concurrency. - Automatic memory
| calculations for layer and GPU/CPU placement. - Layered
| model configuration (basically docker images for models). -
| Templating and distribution of model parameters, templates in a
| container image. - Near feature complete OpenAI compatible
| API as well as it's native native API that supports more advanced
| features such as model hot loading, context management, etc...
| - Native libraries for common languages. - Official
| container images for hosting. - Provides a client/server
| model for running remote or local inference servers with either
| Ollama or openai compatible clients. - Support for both an
| official and self hosted model and template repositories. -
| Support for multi-modal / Vision LLMs - something that llama.cpp
| is not focusing on providing currently. - Support for
| serving safetensors models, as well as running and creating
| models directly from their Huggingface model ID.
|
| In addition to the llama.cpp engine, Ollama are working on adding
| additional model backends (e.g. things like exl2, awq, etc...).
|
| Ollama is not "better" or "worse" than llama.cpp because it's an
| entirely different tool.
| HarHarVeryFunny wrote:
| What are the limitations on which LLMs (specific transformer
| variants etc) llama.cpp can run? Does it require the input
| mode/weights to be in some self-describing format like ONNX that
| support different model architectures as long as they are built
| out of specific module/layer types, or does it more narrowly only
| support transformer models parameterized by depth, width, etc?
| dmezzetti wrote:
| Seeing a lot of Ollama vs running llama.cpp direct talk here. I
| agree that setting up llama.cpp with CUDA isn't always the
| easiest. But there is a cost to running all inference over HTTPS.
| Local in-program inference will be faster. Perhaps that doesn't
| matter in some cases but it's worth noting.
|
| I find that running PyTorch is easier to get up and running. For
| quantization, AWQ models work and it's just a "pip install" away.
___________________________________________________________________
(page generated 2024-11-29 23:00 UTC)