[HN Gopher] I Run LLMs Locally
       ___________________________________________________________________
        
       I Run LLMs Locally
        
       Author : Abishek_Muthian
       Score  : 174 points
       Date   : 2024-12-29 10:49 UTC (12 hours ago)
        
 (HTM) web link (abishekmuthian.com)
 (TXT) w3m dump (abishekmuthian.com)
        
       | koinedad wrote:
       | Helpful summary, short but useful
        
         | drillsteps5 wrote:
         | I did not find it useful at all. It's just a set of links to
         | various places that can be helpful in finding useful
         | information.
         | 
         | Like LocalLLama subreddit is extremely helpful. You have to
         | figure out what you want, what tools you want to use (by
         | reading posts and asking questions), and then find some guides
         | on setting up these tools. As you proceed you will run into
         | issues and can easily find most of your answers in the same
         | subreddit (or subreddits designated to the particular tools
         | you're trying to set up) because hundreds of people had the
         | same issues before.
         | 
         | Not to say that the article is useless, it's just semi-
         | organized set of links.
        
           | emmelaich wrote:
           | It should list Simon Willison's
           | https://llm.datasette.io/en/stable/
        
       | jokethrowaway wrote:
       | I have a similar pc and I use text-generation-webui and mostly
       | exllama quantized models.
       | 
       | I also deploy text-generation-webui for clients on k8s with gpu
       | for similar reasons.
       | 
       | Last I checked, llamafile / ollama are not as optimised for gpu
       | use.
       | 
       | For image generation I moved from automatic webui to comfyui a
       | few months ago - they're different beasts, for some workflow
       | automatic is easier to use but for most tasks you can create a
       | better workflow with enough comfy extensions.
       | 
       | Facefusion warrants a mention for faceswapping
        
       | Salgat wrote:
       | There is a lot I want to do with LLMs locally, but it seems like
       | we're still not quite there hardware-wise (well, within
       | reasonable cost). For example, Llama's smaller models take
       | upwards of 20 seconds to generate a brief response on a 4090; at
       | that point I'd rather just use an API to a service that can
       | generate it in a couple seconds.
        
         | pickettd wrote:
         | My gut feeling is that there may be optimization you can do for
         | faster performance (but I could be wrong since I don't know
         | your setup or requirements). In general on a 4090 running
         | between Q6-Q8 quants my tokens/sec have been similar to what I
         | see on cloud providers (for open/local models). The fastest
         | local configuration I've tested is Exllama/TabbyAPI with
         | speculative-decoding (and quantized cache to be able to fit
         | more context)
        
         | do_not_redeem wrote:
         | You should absolutely be getting faster responses with a 4090.
         | But that points to another advantage of cloud services--you
         | don't have to debug your own driver issues.
        
         | zh3 wrote:
         | Depends on the model, if it doesn't fit into VRAM performance
         | will suffer. Response here is immediate (at ~15 tokens/sec) on
         | a pair of ebay RTX 3090s in an ancient i3770 box.
         | 
         | If your model does fit into VRAM, if its getting ejected there
         | will be a startup pause. Try setting OLLAMA_KEEP_ALIVE to 1
         | (see https://github.com/ollama/ollama/blob/main/docs/faq.md#how
         | -d...).
        
           | e12e wrote:
           | >> within reasonable cost
           | 
           | > pair of ebay RTX 3090s
           | 
           | So... 1700 USD?
        
             | drillsteps5 wrote:
             | If you're looking for most bang for the buck 2x3060(12Gb)
             | might be the best bet. GPUs will be around $400-$600.
        
             | zh3 wrote:
             | PS1200 UKP, so a little less. Targetted at having 48GB
             | (2x24Gb) VRAM for running the larger models; having said
             | that, a single 12Gb RTX3060 in another box seems pretty
             | close in local testing (with smaller models).
        
           | drillsteps5 wrote:
           | Have been trying forever to find a coherent guide on building
           | dual-GPU box for this purpose, do you know of any? Like
           | selecting the MB, the case, cooling, power supply and cables,
           | any special voodoo required to pair the GPUs etc.
        
             | zh3 wrote:
             | I'm not aware of any particular guides, the setup here was
             | straightforward - an old motherboard with two PCIe X16
             | slots (Asus P8Z77V or P8Z77WS), a big enough power supply
             | (Seasonic 850W) and the stock linux Nividia drivers. The
             | RTX 3090's are basic Dell models (i.e. not OC'ed gamer
             | versions), and worth noting they only get hot if used
             | continuously - if you're the only one using them, the fans
             | spin up during a query and back down between. Good 'smoke
             | test' is something like 'while 1; do 'ollama run llama3.3
             | "Explain cosmology"'; done.
             | 
             | With llama3.3 70B, two RTX3090s gives you 48GB of VRAM and
             | the model uses about 44Gb; so the first start is slow
             | (loading the model into VRAM) but after that response is
             | fast (subject to comment above about KEEP_ALIVE).
        
         | segmondy wrote:
         | This is very wrong. Smaller models would generate response
         | instantly on 4090. I run 3090's and easily get 30-50
         | tokens/seconds with small models. Folks with 4090 will see
         | easily 80 tokens/sec for 7-8b model in Q8 and probably 120-160
         | for 3B models. Faster than most public APIs
        
           | hedgehog wrote:
           | 8B models are pretty fast even on something like a 3060
           | depending on deployment method (for example Q4 on Ollama).
        
           | kolbe wrote:
           | They might be talking about 70B
        
             | nobodyandproud wrote:
             | Wait, is 70b considered on the smaller side these days?
        
               | kolbe wrote:
               | Relative to deepseek's latest, yeah
        
           | imiric wrote:
           | Sure, except those smaller models are only useful for some
           | novelty and creative tasks. Give them a programming or logic
           | problem and they fall flat on their face.
           | 
           | As I mentioned in a comment upthread, I find ~30B models to
           | be the minimum for getting somewhat reliable output. Though
           | even ~70B models pale in comparison with popular cloud LLMs.
           | Local LLMs just can't compete with the quality of cloud
           | services, so they're not worth using for most professional
           | tasks.
        
         | mtkd wrote:
         | The consumer hardware channel just hasn't caught up yet --
         | we'll see a lot more desktop kit appear in 2025 on retail sites
         | (there is a small site in UK selling Nvidia A100 workstations
         | for PS100K+ each on a Shopify store)
         | 
         | Seem to remember a similar point in late 90s and having to
         | build boxes to run NT/SQL7.0 for local dev
         | 
         | Expect there will be a swing back to on-prem once enterprise
         | starts moving faster and the legal teams begin to understand
         | what is happening data-side with RAG, agents etc.
        
           | fooker wrote:
           | > Nvidia A100 workstations for PS100
           | 
           | This seems extremely overpriced.
        
             | mtkd wrote:
             | I've not looked properly but I think it's 4 x 80gb A100s
        
               | ojbyrne wrote:
               | Maybe this?
               | 
               | https://www.ctoservers.com/nvidia-dgx-station-a100---
               | quad-nv...
               | 
               | PS 80k
        
         | jhatemyjob wrote:
         | yeah many people don't understand how cheap it is to use the
         | chatgpt API
         | 
         | not to mention all of the other benefits of delegating all the
         | work of setting up the GPUs, public HTTP server, designing the
         | API, security, keeping the model up-to-date with the state of
         | the art, etc
         | 
         | reminds me of the people in the 2000s / early 2010s who would
         | build their own linux boxes back when the platform was super
         | unstable, constantly fighting driver issues etc instead of just
         | getting a mac.
         | 
         | roll-your-own-LLM makes even less sense. at least for those
         | early 2000s linux guys, even if you spent an ungodly amount of
         | time going through the arch wiki or compiling gentoo or
         | whatever, at least those skills are somewhat transferrable to
         | sysadmin/SRE. i dont see how setting up your own instance of
         | ollama has any transferable skills
         | 
         | the only way i could see it making sense is if you're doing
         | super cutting edge stuff that necessitates owning a tinybox, or
         | if you're trying to get a job at openAI or anysphere
        
           | eikenberry wrote:
           | Depends on what you value. Many people value keeping general
           | purpose computing free/libre and available to as many as
           | possible. This means using free systems and helping those
           | systems mature.
        
         | o11c wrote:
         | Even on CPU, you should get the start of a response within 5
         | seconds for Q4 8B-or-smaller Llama models (proportionally
         | faster for smaller ones), which then stream at several tokens
         | per second.
         | 
         | There are a _lot_ of things to criticize about LLMs (the answer
         | is quite likely to ignore what you 're actually asking, for
         | example) but your speed problem looks like a config issue
         | instead. Are you calling the API in streaming mode?
        
         | nobodyandproud wrote:
         | You may have been running with CPU inference, or running models
         | that don't fit your VRAM.
         | 
         | I was running a 5 bit quantized model of codestral 22b with a
         | Radeon RX 7900 (20 gb), compiled with Vulkan only.
         | 
         | Eyeball only, but the prompt responses were maybe 2x or 3x
         | slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most
         | paragraph long responses).
        
         | imiric wrote:
         | To me it's not even about performance (speed). It's just that
         | the quality gap between cloud LLM services and local LLMs is
         | still quite large, and seems to be increasing. Local LLMs have
         | gotten better in the past year, but cloud LLMs have even more
         | so. This is partly because large companies can afford to keep
         | throwing more compute at the problem, while quality at smaller
         | scale deployments is not increasing at the same pace.
         | 
         | I have a couple of 3090s and have tested most of the popular
         | local LLMs (Llama3, DeepSeek, Qwen, etc.) at the highest
         | possible settings I can run them comfortably (~30B@q8, or
         | ~70B@q4), and they can't keep up with something like Claude 3.5
         | Sonnet. So I find myself just using Sonnet most of the time,
         | instead of fighting with hallucinated output. Sonnet still
         | hallucinates and gets things wrong a lot, but not as often as
         | local LLMs do.
         | 
         | Maybe if I had more hardware I could run larger models at
         | higher quants, but frankly, I'm not sure it would make a
         | difference. At the end of the day, I want these tools to be as
         | helpful as possible and not waste my time, and local LLMs are
         | just not there yet.
        
       | sturza wrote:
       | 4090 has 24gb vram, not 16.
        
         | vidyesh wrote:
         | The mobile RTX 4090 has 16GB
         | https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-Laptop...
        
       | deadbabe wrote:
       | My understanding is that local LLMs are mostly just toys that
       | output basic responses, and simply can't compete with full LLMs
       | trained with $60 million+ worth of compute time, and that no
       | matter how good hardware gets, larger companies will always have
       | even better hardware and resources to output even better results,
       | so basically this is pointless for anything competitive or
       | serious. Is this accurate?
        
         | quest88 wrote:
         | That's a strange question. It sounds like you're saying the
         | author isn't serious in their use of LLMs. Can you describe
         | what you mean by competitive and serious?
         | 
         | You're commenting on a post about how the author runs LLMs
         | locally because they find them useful. Do you think they would
         | run them and write an entire post on how they use it if they
         | didn't find it useful? The author is seriously using them.
        
           | prophesi wrote:
           | It's the missing piece of the article. How is their
           | experience running LLM's locally compared to GPT-4o / Sonnet
           | 3.5 / etc, outside of data privacy?
        
             | th0ma5 wrote:
             | In my experience, if you can make a reasonable guess as to
             | the total contained information about your topic within the
             | given size of model, then you can be very specific with
             | your context and ask and it is generally as good as naive
             | use of any of the larger models. Honestly, Llama3.2 is as
             | accurate as anything and runs reasonably quick on a CPU.
             | Larger models to me mostly increase the surface area for
             | potential error and get exhausting to read even when you
             | ask them to get on with it.
        
         | pickettd wrote:
         | Depends on what benchmarks/reports you trust I guess (and how
         | much hardware you have for local models either in-person or in-
         | cloud). https://aider.chat/docs/leaderboards/ has Deepseek v3
         | scoring higher than most closed LLMs on coding (but it is a
         | huge local model). And https://livebench.ai has QwQ scoring
         | quite high in the reasoning category (and that is relatively
         | easy to run locally but it doesn't score super high in other
         | categories).
        
         | alkonaut wrote:
         | If big models require big hardware how come there are so many
         | free LLMs? How much compute is one ChatGPT interaction with the
         | default model? Is it a massive loss leader for OpenAI?
        
           | drillsteps5 wrote:
           | A big part of it is the fact that Meta tries very hard to not
           | lose the relevance against OpenAI, so they keep building and
           | releasing a lot of their models for free. I would imagine
           | they're taking huge losses on these but it's a price they
           | seem to be willing to pay for being late to the party.
           | 
           | The same applies (to smaller extent) to Google, and more
           | recently to Alibaba.
           | 
           | A lot of free (not open-source, mind you) LLMs are either
           | from Meta or built by other smaller outfits by tweaking
           | Meta's LLMs (there are exceptions, like Mistral).
           | 
           | One thing to keep in mind is that while training a model
           | requires A LOT of data, manual intervention, hardware, power,
           | and time, using these LLMs (inference, or forward pass in the
           | neural network) is really not. Right now it typically
           | requires NVidia GPU with lots of VRAM but I have no doubt in
           | just a few months/maybe a year someone will find much easier
           | way to do that without sacrificing the speed.
        
         | drillsteps5 wrote:
         | This is not my experience, but it depends on what you consider
         | "competitive" and "serious".
         | 
         | While it does appear that a large number of people are using
         | these for RP and... um... similar stuff, I do find code
         | generation to be fairly good, esp with some recent Qwens (from
         | Alibaba). Full disclaimer: I do use this sparingly, either to
         | get the boilerplate, or to generate a complete function/module
         | with clearly defined specs, and I sometimes have to re-factor
         | the output to fit my style and preferences.
         | 
         | I also use various general models (mostly Meta's) fairly
         | regularly to come up with and discuss various business and
         | other ideas, and get general understanding in the areas I want
         | to get more knowledge of (both IT and non-IT), which helps when
         | I need to start digging deeper into the details.
         | 
         | I usually run quantized versions (I have older GPU with lower
         | VRAM).
         | 
         | Wordsmithing and generating illustrations for my blog articles
         | (I prefer Plex, it's actually fairly good with adding various
         | captions to illustrations; the built-in image generator in
         | ChatGPT was still horrible at it when I tried it a few months
         | ago).
         | 
         | Some resume tweaking as part of manual workflow (so multiple
         | iterations with checking back and forth between what LLM gave
         | me and my version).
         | 
         | So it's mostly stuff for my own personal consumption, that I
         | don't necessarily trust the cloud with.
         | 
         | If you have a SaaS idea with an LLM or other generative AI at
         | its core, processing the requests locally is probably not the
         | best choice. Unless you're prototyping, in which case it can
         | help.
        
           | meta_x_ai wrote:
           | Why would you waste time in Good models when there are great
           | models?
        
             | drillsteps5 wrote:
             | Good models are good enough for me, meta_x_ai, I gain
             | experience by setting them up and following up on industry
             | trends, and I don't trust OpenAI (or MSFT, or Google, or
             | whatever) with my information. No, I do not do anything
             | illegal or unethical, but it's not the point.
        
         | pavlov wrote:
         | The leading provider of open models used for local LLM setups
         | is Meta (with the Llama series). They have enormous resources
         | to train these models.
         | 
         | They're giving away these expensive models because it doesn't
         | hurt Meta's ad business but reduces risk that competitors could
         | grow moats using closed models. The old "commoditize your
         | complements" play.
        
         | johnklos wrote:
         | There's a huge difference between training and running models.
         | 
         | The multimillion (or billion) dollar collections of hardware
         | assemble the datasets that we (people who run LLMs locally)
         | run. The non-open datasets that we-host-LLMs-for-money
         | companies do the same, and their data isn't all that much
         | fancier. Open LLMs are catching up to closed ones, and the
         | competition means everyone wins except for the victims of
         | Nvidia's price gouging.
         | 
         | This is a bit like confusing the resources that go in to making
         | a game with the resources needed to run and play that game.
         | They're worlds different.
        
           | prettyStandard wrote:
           | Even Nvidia's victims are better off... (O)llama runs on AMD
           | GPUs
        
         | kolbe wrote:
         | I think the bigger issue is hardware utilization. Meta and
         | Qwen's medium sized models are fantastic and competitive with 6
         | month old offerings from OpenAI and Claude, but you need to
         | have $2500 of hardware to run it. If you're going to be
         | spinning this 24/7, sure the API costs for OpenAI and Anthropic
         | would be insane, but as far as normal personal usage patterns
         | go, you'd probably spend $20/month. So, either spend $2500 that
         | will depreciate to $1000 in two years or $580 in API costs?
        
         | philjohn wrote:
         | Not entirely.
         | 
         | TRAINING an LLM requires a lot of compute. Running inference on
         | a pre-trained LLM is less computationally expensive, to the
         | point where you can run LLAMA (cost $$$ to train on Meta's GPU
         | cluster) with CPU-based inference.
        
       | halyconWays wrote:
       | Super basic intro but perhaps useful. Doesn't mention quant
       | sizes, which is important when you're GPU poor. Lots of other
       | client-side things you can do too, like KoboldAI, TavernAI, Jan,
       | LangFuse for observability, CogVLM2 for a vision model.
       | 
       | One of the best places to get the latest info on what people are
       | doing with local models is /lmg/ on 4chan's /g/
        
       | pvo50555 wrote:
       | There was a post a few weeks back (or a reply to a post) showing
       | an app entirely made using an LLM. It was like a 3D globe made
       | with 3js, and I believe the poster had created it locally on his
       | M4 MacBook with 96 GB RAM? I can't recall which model it was or
       | what else the app did, but maybe someone knows what I'm talking
       | about?
        
       | dumbfounder wrote:
       | Updation. That's a new word for me. I like it.
        
         | sccomps wrote:
         | It's quite common in India but I guess not widely accepted
         | internationally. If there can be deletion, then why not
         | updation?
        
           | exe34 wrote:
           | that's double plus good.
        
       | rspoerri wrote:
       | I run a pretty similar setup on an m2-max - 96gb.
       | 
       | Just for AI image generation i would rather recommend krita with
       | the https://github.com/Acly/krita-ai-diffusion plugin.
        
         | donquichotte wrote:
         | Are you using an external provider for image generation or
         | running something locally?
        
       | Der_Einzige wrote:
       | Still nothing better than oobabooga
       | (https://github.com/oobabooga/text-generation-webui) in terms of
       | maximalism/"Pro"/"Prosumer" LLM UI/UX ALA Blender, Photoshop,
       | Final Cut Pro, etc.
       | 
       | Embarrassing and any VCs reading this can contact me to talk
       | about how to fix that. lm-studio is today the closest competition
       | (but not close enough) and Adobe or Microsoft could do it if they
       | fired their current folks which prevent this from happening.
       | 
       | If you're not using Oobabooga, you're likely not playing with the
       | settings on models, and if you're not playing with your models
       | settings, you're hardly even scratching the surface on its total
       | capabilities.
        
         | theropost wrote:
         | this.
        
       | dividefuel wrote:
       | What GPU offers a good balance between cost and performance for
       | running LLMs locally? I'd like to do more experimenting, and am
       | due for a GPU upgrade from my 1080 anyway, but would like to
       | spend less than $1600...
        
         | redacted wrote:
         | Nvidia for compatibility, and as much VRAM as you can afford.
         | Shouldn't be hard to find a 3090 / Ti in your price range. I
         | have had decent success with a base 3080 but the 10GB really
         | limits the models you can run
        
         | yk wrote:
         | Second the other comment, as much vram as possible. A 3060 has
         | 12 GB at a reasonable price point. (And is not too limiting.)
        
           | grobbyy wrote:
           | There's a huge step to I'm capability with 16gb and 24gb, for
           | not to much more. The 4060 has a 16gb version, for example.
           | On the the cheap end, the Intel Arc does too.
           | 
           | Next major step up is 48GB and then hundreds of GB. But a lot
           | of ML models target 16-24gb since that's in the grad student
           | price range.
        
         | bloomingkales wrote:
         | Honestly I think if you just want to do inferencing the 7600xt
         | and rx6800 have 16gb at $300 and $400 on Amazon. It's gonna be
         | my stop gap until whatever. The RX6800 has better memory
         | bandwidth than the 4060ti (think it matches the 4070).
        
           | kolbe wrote:
           | AMD GPUs are a fantastic deal until you hit a problem. Some
           | models/frameworks it works great. Others, not so much.
        
             | bloomingkales wrote:
             | For sure but I think people on the fine
             | tuning/training/stable diffusion side are more concerned
             | with that. They make a big fuss about this and basically
             | talk people out of a perfectly good and well priced 16gb
             | vram card that literally works out of the box with ollama,
             | lmstudio for text inferencing.
             | 
             | Kind of one of the reasons AMD is a sleeper stock for me.
             | If people only knew.
        
         | kolbe wrote:
         | If you want to wait until the 5090s come out, you should see a
         | drop in the price of the 30xx and 40xx series. Right now,
         | shopping used, you can get two 3090s or two 4080s in your price
         | range. Conventional wisdom says two 3090s would be better, but
         | this is all highly dependent on what models you want to run.
         | Basically the first requirement is to have enough VRAM to host
         | all of your model on it, and secondarily, the quality of the
         | GPU.
         | 
         | Have a look through Hugging Face to see which models interest
         | you. A rough estimate for the amount of VRAM you need is half
         | the model size plus a couple gigs. So, if using the 70B models
         | interests you, two 4080s wouldn't fit it, but two 3090s would.
         | If you're just interested in the 1B, 3B and 7B models (llama 3B
         | is fantastic), you really don't need much at all. A single 3060
         | can handle that, and those are not expensive.
        
         | zitterbewegung wrote:
         | Get a new / used 3090 it has 24GB of RAM and it's below $1600.
        
           | christianqchung wrote:
           | A lot of moderate power users are running an undervolted used
           | pair of 3090s on a 1000-1200W psu. 48 GB of vram let's you
           | run 70B models at Q4 with 16k context.
           | 
           | If you use speculative decoding (a small model generates
           | tokens verified by a larger model, I'm not sure on the
           | specifics) you can get past 20 tokens per second it seems.
           | You can also fit 32B models like Qwen/Qwen Coder at Q6 with
           | lots of context this way, with spec decoding, closer to 40+
           | tks/s.
        
         | adam_arthur wrote:
         | Inferencing does not require Nvidia GPUs at all, and its almost
         | criminal to be recommending dedicated GPUs with only 12GB of
         | RAM.
         | 
         | Buy a MacMini or MacbookPro with RAM maxed out.
         | 
         | I just bought an M4 mac mini for exactly this use case that has
         | 64GB for ~2k. You can get 128GB on the MBP for ~5k. These will
         | run much larger (and more useful) models.
         | 
         | EDIT: Since the request was for < $1600, you can still get a
         | 32GB mac mini for $1200 or 24GB for $800
        
           | natch wrote:
           | Reasonable? $7,000 for a laptop is pretty up there.
           | 
           | [Edit: OK I see I am adding cost when checking due to
           | choosing a larger SSD drive, so $5,000 is more of a fair
           | bottom price, with 1TB of storage.]
           | 
           | Responding specifically to this very specific claim: "Can get
           | 128GB of ram for a reasonable price."
           | 
           | I'm open to your explanation of how this is reasonable -- I
           | mean, you didn't say cheap, to be fair. Maybe 128GB of ram on
           | GPUs would be way more (that's like 6 x 4090s), is what
           | you're saying.
           | 
           | For anyone who wants to reply with other amounts of memory,
           | that's not what I'm talking about here.
           | 
           | But on another point, do you think the ram really buys you
           | the equivalent of GPU memory? Is Apple's melding of CPU/GPU
           | really that good?
           | 
           | I'm not just coming from a point of skepticism, I'm actually
           | kind of hoping to be convinced you're right, so wanting to
           | hear the argument in more detail.
        
             | adam_arthur wrote:
             | It's reasonable in a "working professional who gets
             | substantial value from" or "building an LLM driven startup
             | project" kind of way.
             | 
             | It's not for the casual user, but for somebody who derives
             | significant value from running it locally.
             | 
             | Personally I use the MacMini as a hub for a project I'm
             | working on as it gives me full control and is simply much
             | cheaper operationally. A one time ~$2000 cost isn't so bad
             | for replacing tasks that a human would have to do. e.g. In
             | my case I'm parsing loosely organized financial documents
             | where structured data isn't available.
             | 
             | I suspect the hardware costs will continue to decline
             | rapidly as they have in the past though, so that $5k for
             | 128GB will likely be $5k for 256GB in a year or two, and so
             | on.
             | 
             | We're almost at the inflection point where really powerful
             | models are able to be inferenced locally for cheap
        
           | talldayo wrote:
           | > its almost criminal to be recommending dedicated GPUs with
           | only 12GB of RAM.
           | 
           | If you already own a PC, it makes a hell of a lot more sense
           | to spend $900 on a 3090 than it does to spec out a Mac Mini
           | with 24gb of RAM. Plus, the Nvidia setup can scale to as many
           | GPUs as you own which gives you options for upgrading that
           | Apple wouldn't be caught dead offering.
           | 
           | Oh, and native Linux support that doesn't suck balls is a
           | plus. I haven't benchmarked a Mac since the M2 generation,
           | but the figures I can find put the M4 Max's compute somewhere
           | near the desktop 3060 Ti:
           | https://browser.geekbench.com/opencl-benchmarks
        
             | adam_arthur wrote:
             | A Mac Mini with 24GB is ~$800 at the cheapest
             | configuration. I can respect wanting to do a single part
             | upgrade, but if you're using these LLMs for serious work,
             | the price/perf for inferencing is far in favor of using
             | Macs at the moment.
             | 
             | You can easily use the MacMini as a hub for running the LLM
             | while you do work on your main computer (and it won't eat
             | up your system resources or turn your primary computer into
             | a heater)
             | 
             | I hope that more non-mac PCs come out optimized for high
             | RAM SoC, I'm personally not a huge Apple fan but use them
             | begrudgingly.
             | 
             | Also your $900 quote is a used/refurbished GPU. I've had
             | plenty of GPUs burn out on me in the old days, not sure how
             | it is nowadays, but that's a lot to pay for a used part IMO
        
               | fragmede wrote:
               | if you're doing serious work, performance is more
               | important than getting a good price/perf ratio, and a
               | pair of 3090s is gonna be faster. It depends on your
               | budget, however as that configuration is a bit more
               | expensive, however.
        
               | adam_arthur wrote:
               | Whether performance or cost is more important depends on
               | your use case. Some tasks that an LLM can do very well
               | may not need to be done often, or even particularly
               | quickly (as in my case).
               | 
               | e.g. LLM as one step of an ETL-style pipeline
               | 
               | Latency of the response really only matters if that
               | response is user facing and is being actively awaited by
               | the user
        
           | 2-3-7-43-1807 wrote:
           | how about heat dissipation which i assume a mbp is at a
           | disadvantage compared to a pc?
        
         | elorant wrote:
         | I consider the RTX 4060 Ti as the best entry level GPU for
         | running small models. It has 16GBs of RAM which gives you
         | plenty of space for running large context windows and Tensor
         | Cores which are crucial for inference. For larger models
         | probably multiple RTX 3090s since you can buy them on the cheap
         | on the second hand market.
         | 
         | I don't have experience with AMD cards so I can't vouch for
         | them.
        
           | fnqi8ckfek wrote:
           | I know nothing about gpus. Should I be assuming that when
           | people say "ram" in the context of gpus they always mean
           | vram?
        
             | layer8 wrote:
             | "GPU with xx RAM" means VRAM, yes.
        
       | throwaway314155 wrote:
       | Open WebUI sure does pull in a lot of dependencies... Do I really
       | need all of langchain, pytorch, and plenty others for what is
       | advertised as a _frontend_?
       | 
       | Does anyone know of a lighter/minimalist version?
        
         | noman-land wrote:
         | https://llamafile.ai/
        
           | throwaway314155 wrote:
           | I love what llamafile is doing, but I'm primarily interested
           | in a frontend for ollama, as I prefer their method of
           | model/weights distribution. Unless I'm wrong, llamafile
           | serves as both the frontend and backend.
        
             | DrBenCarson wrote:
             | LM Studio?
        
         | figmert wrote:
         | I ran it temporarily in Docker the other day. It was a 8gb
         | image. I'm unsure why a webui is 8gb.
        
         | lolinder wrote:
         | Some of the features (RAG retrieval) now use embeddings that
         | are calculated in Open WebUI rather than in Ollama or another
         | backend. It does seem like it'd be nice for them to refactor to
         | make things like that optional for those who want a simpler
         | interface, but then again, there are plenty of other lighter-
         | weight options.
        
       | gulan28 wrote:
       | You can try out https://wiz.chat (my project) if you want to Run
       | llama on your web browser. Still needs a GPU and the latest
       | version of chrome but it's fast enough for my usage.
        
       | amazingamazing wrote:
       | I never have seen the point of running locally. Not cost
       | effective, worse model, etc.
        
         | ogogmad wrote:
         | Privacy?
        
         | splintercell wrote:
         | Even if it was not cost-effective, or you're just running worse
         | models, it's learning an important skill.
         | 
         | Take for instance, self-hosting your website may have all these
         | considerations, but you're getting information from the LLMs.
         | It would be helpful to know that the LLM is in your control.
        
           | Oras wrote:
           | Self hosting website as local server running in your room?
           | what's the point?
           | 
           | Same with LLMs, you can use providers who don't log requests
           | and SOC2 compliant.
           | 
           | Small models that run locally is a waste of time as they
           | don't have adequate value compared to larger models.
        
         | babyshake wrote:
         | Can you elaborate on the not cost effective part? That seems
         | surprising unless the API providers are running at a loss.
        
           | amazingamazing wrote:
           | A 4090 is $1000 at least. That's years of subscription for
           | the latest models, which don't run locally anyway.
        
             | homarp wrote:
             | unless you have one already for playing games
        
             | wongarsu wrote:
             | I run LLMs locally on a 2080TI I bought used years ago for
             | another deep learning project. It's not the fastest thing
             | in the world, but adequate for running 8B models. 70B
             | models technically work but are too slow to realistically
             | use them.
        
         | IOT_Apprentice wrote:
         | Local is private. You are not handing over your data to An AI
         | for training.
        
           | jiggawatts wrote:
           | Most major providers have EULAs that specify that they don't
           | keep your data and don't use it for training.
        
       | thangalin wrote:
       | run.sh:                   #!/usr/bin/env bash              set
       | -eu         set -o errexit         set -o nounset         set -o
       | pipefail              readonly SCRIPT_SRC="$(dirname
       | "${BASH_SOURCE[${#BASH_SOURCE[@]} - 1]}")"         readonly
       | SCRIPT_DIR="$(cd "${SCRIPT_SRC}" >/dev/null 2>&1 && pwd)"
       | readonly SCRIPT_NAME=$(basename "$0")              # Avoid issues
       | when wine is installed.         sudo su -c 'echo 0 >
       | /proc/sys/fs/binfmt_misc/status'              # Graceful exit to
       | perform any clean up, if needed.         trap terminate INT
       | # Exits the script with a given error level.         function
       | terminate() {           level=10                if [ $# -ge 1 ]
       | && [ -n "$1" ]; then level="$1"; fi                exit $level
       | }              # Concatenates multiple files.         join() {
       | local -r prefix="$1"           local -r content="$2"
       | local -r suffix="$3"                printf "%s%s%s" "$(cat
       | ${prefix})" "$(cat ${content})" "$(cat ${suffix})"         }
       | # Swapping this symbolic link allows swapping the LLM without
       | script changes.         readonly
       | LINK_MODEL="${SCRIPT_DIR}/llm.gguf"              # Dereference
       | the model's symbolic link to its path relative to the script.
       | readonly PATH_MODEL="$(realpath --relative-to="${SCRIPT_DIR}"
       | "${LINK_MODEL}")"              # Extract the file name for the
       | model.         readonly FILE_MODEL=$(basename "${PATH_MODEL}")
       | # Look up the prompt format based on the model being used.
       | readonly PROMPT_FORMAT=$(grep -m1 ${FILE_MODEL} map.txt | sed
       | 's/.*: //')              # Guard against missing prompt
       | templates.         if [ -z "${PROMPT_FORMAT}" ]; then
       | echo "Add prompt template for '${FILE_MODEL}'."
       | terminate 11         fi              readonly
       | FILE_MODEL_NAME=$(basename $FILE_MODEL)              if [ -z
       | "${1:-}" ]; then           # Write the output to a name
       | corresponding to the model being used.
       | PATH_OUTPUT="output/${FILE_MODEL_NAME%.*}.txt"         else
       | PATH_OUTPUT="$1"         fi              # The system file
       | defines the parameters of the interaction.         readonly
       | PATH_PROMPT_SYSTEM="system.txt"              # The user file
       | prompts the model as to what we want to generate.
       | readonly PATH_PROMPT_USER="user.txt"              readonly
       | PATH_PREFIX_SYSTEM="templates/${PROMPT_FORMAT}/prefix-system.txt"
       | readonly PATH_PREFIX_USER="templates/${PROMPT_FORMAT}/prefix-
       | user.txt"         readonly
       | PATH_PREFIX_ASSIST="templates/${PROMPT_FORMAT}/prefix-
       | assistant.txt"              readonly
       | PATH_SUFFIX_SYSTEM="templates/${PROMPT_FORMAT}/suffix-system.txt"
       | readonly PATH_SUFFIX_USER="templates/${PROMPT_FORMAT}/suffix-
       | user.txt"         readonly
       | PATH_SUFFIX_ASSIST="templates/${PROMPT_FORMAT}/suffix-
       | assistant.txt"              echo "Running: ${PATH_MODEL}"
       | echo "Reading: ${PATH_PREFIX_SYSTEM}"         echo "Reading:
       | ${PATH_PREFIX_USER}"         echo "Reading:
       | ${PATH_PREFIX_ASSIST}"         echo "Writing: ${PATH_OUTPUT}"
       | # Capture the entirety of the instructions to obtain the input
       | length.         readonly INSTRUCT=$(           join
       | ${PATH_PREFIX_SYSTEM} ${PATH_PROMPT_SYSTEM} ${PATH_PREFIX_SYSTEM}
       | join ${PATH_SUFFIX_USER} ${PATH_PROMPT_USER} ${PATH_SUFFIX_USER}
       | join ${PATH_SUFFIX_ASSIST} "/dev/null" ${PATH_SUFFIX_ASSIST}
       | )              (           echo ${INSTRUCT}         ) |
       | ./llamafile \           -m "${LINK_MODEL}" \           -e \
       | -f /dev/stdin \           -n 1000 \           -c ${#INSTRUCT} \
       | --repeat-penalty 1.0 \           --temp 0.3 \           --silent-
       | prompt > ${PATH_OUTPUT}                #--log-disable \
       | echo "Outputs: ${PATH_OUTPUT}"              terminate 0
       | 
       | map.txt:                   c4ai-command-r-plus-q4.gguf: cmdr
       | dare-34b-200k-q6.gguf: orca-vicuna         gemma-2-27b-q4.gguf:
       | gemma         gemma-2-7b-q5.gguf: gemma
       | gemma-2-Ifable-9B.Q5_K_M.gguf: gemma         llama-3-64k-q4.gguf:
       | llama3         llama-3-64k-q4.gguf: llama3
       | llama-3-1048k-q4.gguf: llama3         llama-3-1048k-q8.gguf:
       | llama3         llama-3-8b-q4.gguf: llama3
       | llama-3-8b-q8.gguf: llama3         llama-3-8b-1048k-q6.gguf:
       | llama3         llama-3-70b-q4.gguf: llama3
       | llama-3-70b-64k-q4.gguf: llama3
       | llama-3-smaug-70b-q4.gguf: llama3
       | llama-3-giraffe-128k-q4.gguf: llama3         lzlv-q4.gguf: alpaca
       | mistral-nemo-12b-q4.gguf: mistral         openorca-q4.gguf:
       | chatml         openorca-q8.gguf: chatml
       | quill-72b-q4.gguf: none         qwen2-72b-q4.gguf: none
       | tess-yi-q4.gguf: vicuna         tess-yi-q8.gguf: vicuna
       | tess-yarn-q4.gguf: vicuna         tess-yarn-q8.gguf: vicuna
       | wizard-q4.gguf: vicuna-short         wizard-q8.gguf: vicuna-short
       | 
       | Templates (all the template directories contain the same set of
       | file names, but differ in content):                   templates/
       | +-- alpaca         +-- chatml         +-- cmdr         +-- gemma
       | +-- llama3         +-- mistral         +-- none         +-- orca-
       | vicuna         +-- vicuna         +-- vicuna-short
       | +-- prefix-assistant.txt             +-- prefix-system.txt
       | +-- prefix-user.txt             +-- suffix-assistant.txt
       | +-- suffix-system.txt             +-- suffix-user.txt
       | 
       | If there's interest, I'll make a repo.
        
       | ashleyn wrote:
       | anyone got a guide on setting up and running the business-class
       | stuff (70B models over multiple A100, etc)? i'd be willing to
       | spend the money but only if i could get a good guide on how to
       | set everything up, what hardware goes with what
       | motherboard/ram/cpu, etc.
        
         | foundry27 wrote:
         | Aye, there's the kicker. The correct configuration of hardware
         | resources to run and multiplex large models is just as much of
         | a trade secret as model weights themselves when it comes to
         | non-hobbyist usage, and I wouldn't be surprised if optimal
         | setups are in many ways deliberately obfuscated or hidden to
         | keep a competitive advantage
         | 
         | Edit: outside the HPC community specifically, I mean
        
         | talldayo wrote:
         | I don't think you're going to find much of a guide out there
         | because there isn't really a need for one. You just need a
         | Linux client with the Nvidia drivers installed and some form of
         | CUDA runtime present. You could make that happen in a mini PC,
         | a jailbroken Nintendo Switch, a gaming laptop or a 3600W 4U
         | rackmount. The "happy path" is complicated because you truly
         | have so many functional options.
         | 
         | You don't want an A100 unless you've already got datacenter
         | provisioning at your house and an empty 1U rack. I genuinely
         | cannot stress this enough - these are datacenter cards _for a
         | reason_. The best bang-for-your buck will be consumer-grade
         | cards like the 3060 and 3090, as well as the bigger devkits
         | like the Jetson Orin.
        
       | chown wrote:
       | If anyone is looking for a one click solution without having to
       | have a Docker running, try Msty - something that I have been
       | working on for almost a year. Has RAG and Web Search built in
       | among others and can connect to your Obsidian vaults as well.
       | 
       | https://msty.app
        
         | rgovostes wrote:
         | Your top menu has an "As" button, listing how you compare to
         | alternative products. Is that the label you wanted?
        
           | ukuina wrote:
           | Might be better renamed to "Compare"
        
       | upghost wrote:
       | > Before I begin I would like to credit the thousands or millions
       | of unknown artists, coders and writers upon whose work the Large
       | Language Models(LLMs) are trained, often without due credit or
       | compensation
       | 
       | I like this. If we insist on pushing forward with GenAI we should
       | probably at least make some digital or physical monument like
       | "The Tomb of the Unknown Creator".
       | 
       | Cause they sure as sh*t ain't gettin paid. RIP.
        
       ___________________________________________________________________
       (page generated 2024-12-29 23:00 UTC)