[HN Gopher] I Run LLMs Locally
___________________________________________________________________
I Run LLMs Locally
Author : Abishek_Muthian
Score : 174 points
Date : 2024-12-29 10:49 UTC (12 hours ago)
(HTM) web link (abishekmuthian.com)
(TXT) w3m dump (abishekmuthian.com)
| koinedad wrote:
| Helpful summary, short but useful
| drillsteps5 wrote:
| I did not find it useful at all. It's just a set of links to
| various places that can be helpful in finding useful
| information.
|
| Like LocalLLama subreddit is extremely helpful. You have to
| figure out what you want, what tools you want to use (by
| reading posts and asking questions), and then find some guides
| on setting up these tools. As you proceed you will run into
| issues and can easily find most of your answers in the same
| subreddit (or subreddits designated to the particular tools
| you're trying to set up) because hundreds of people had the
| same issues before.
|
| Not to say that the article is useless, it's just semi-
| organized set of links.
| emmelaich wrote:
| It should list Simon Willison's
| https://llm.datasette.io/en/stable/
| jokethrowaway wrote:
| I have a similar pc and I use text-generation-webui and mostly
| exllama quantized models.
|
| I also deploy text-generation-webui for clients on k8s with gpu
| for similar reasons.
|
| Last I checked, llamafile / ollama are not as optimised for gpu
| use.
|
| For image generation I moved from automatic webui to comfyui a
| few months ago - they're different beasts, for some workflow
| automatic is easier to use but for most tasks you can create a
| better workflow with enough comfy extensions.
|
| Facefusion warrants a mention for faceswapping
| Salgat wrote:
| There is a lot I want to do with LLMs locally, but it seems like
| we're still not quite there hardware-wise (well, within
| reasonable cost). For example, Llama's smaller models take
| upwards of 20 seconds to generate a brief response on a 4090; at
| that point I'd rather just use an API to a service that can
| generate it in a couple seconds.
| pickettd wrote:
| My gut feeling is that there may be optimization you can do for
| faster performance (but I could be wrong since I don't know
| your setup or requirements). In general on a 4090 running
| between Q6-Q8 quants my tokens/sec have been similar to what I
| see on cloud providers (for open/local models). The fastest
| local configuration I've tested is Exllama/TabbyAPI with
| speculative-decoding (and quantized cache to be able to fit
| more context)
| do_not_redeem wrote:
| You should absolutely be getting faster responses with a 4090.
| But that points to another advantage of cloud services--you
| don't have to debug your own driver issues.
| zh3 wrote:
| Depends on the model, if it doesn't fit into VRAM performance
| will suffer. Response here is immediate (at ~15 tokens/sec) on
| a pair of ebay RTX 3090s in an ancient i3770 box.
|
| If your model does fit into VRAM, if its getting ejected there
| will be a startup pause. Try setting OLLAMA_KEEP_ALIVE to 1
| (see https://github.com/ollama/ollama/blob/main/docs/faq.md#how
| -d...).
| e12e wrote:
| >> within reasonable cost
|
| > pair of ebay RTX 3090s
|
| So... 1700 USD?
| drillsteps5 wrote:
| If you're looking for most bang for the buck 2x3060(12Gb)
| might be the best bet. GPUs will be around $400-$600.
| zh3 wrote:
| PS1200 UKP, so a little less. Targetted at having 48GB
| (2x24Gb) VRAM for running the larger models; having said
| that, a single 12Gb RTX3060 in another box seems pretty
| close in local testing (with smaller models).
| drillsteps5 wrote:
| Have been trying forever to find a coherent guide on building
| dual-GPU box for this purpose, do you know of any? Like
| selecting the MB, the case, cooling, power supply and cables,
| any special voodoo required to pair the GPUs etc.
| zh3 wrote:
| I'm not aware of any particular guides, the setup here was
| straightforward - an old motherboard with two PCIe X16
| slots (Asus P8Z77V or P8Z77WS), a big enough power supply
| (Seasonic 850W) and the stock linux Nividia drivers. The
| RTX 3090's are basic Dell models (i.e. not OC'ed gamer
| versions), and worth noting they only get hot if used
| continuously - if you're the only one using them, the fans
| spin up during a query and back down between. Good 'smoke
| test' is something like 'while 1; do 'ollama run llama3.3
| "Explain cosmology"'; done.
|
| With llama3.3 70B, two RTX3090s gives you 48GB of VRAM and
| the model uses about 44Gb; so the first start is slow
| (loading the model into VRAM) but after that response is
| fast (subject to comment above about KEEP_ALIVE).
| segmondy wrote:
| This is very wrong. Smaller models would generate response
| instantly on 4090. I run 3090's and easily get 30-50
| tokens/seconds with small models. Folks with 4090 will see
| easily 80 tokens/sec for 7-8b model in Q8 and probably 120-160
| for 3B models. Faster than most public APIs
| hedgehog wrote:
| 8B models are pretty fast even on something like a 3060
| depending on deployment method (for example Q4 on Ollama).
| kolbe wrote:
| They might be talking about 70B
| nobodyandproud wrote:
| Wait, is 70b considered on the smaller side these days?
| kolbe wrote:
| Relative to deepseek's latest, yeah
| imiric wrote:
| Sure, except those smaller models are only useful for some
| novelty and creative tasks. Give them a programming or logic
| problem and they fall flat on their face.
|
| As I mentioned in a comment upthread, I find ~30B models to
| be the minimum for getting somewhat reliable output. Though
| even ~70B models pale in comparison with popular cloud LLMs.
| Local LLMs just can't compete with the quality of cloud
| services, so they're not worth using for most professional
| tasks.
| mtkd wrote:
| The consumer hardware channel just hasn't caught up yet --
| we'll see a lot more desktop kit appear in 2025 on retail sites
| (there is a small site in UK selling Nvidia A100 workstations
| for PS100K+ each on a Shopify store)
|
| Seem to remember a similar point in late 90s and having to
| build boxes to run NT/SQL7.0 for local dev
|
| Expect there will be a swing back to on-prem once enterprise
| starts moving faster and the legal teams begin to understand
| what is happening data-side with RAG, agents etc.
| fooker wrote:
| > Nvidia A100 workstations for PS100
|
| This seems extremely overpriced.
| mtkd wrote:
| I've not looked properly but I think it's 4 x 80gb A100s
| ojbyrne wrote:
| Maybe this?
|
| https://www.ctoservers.com/nvidia-dgx-station-a100---
| quad-nv...
|
| PS 80k
| jhatemyjob wrote:
| yeah many people don't understand how cheap it is to use the
| chatgpt API
|
| not to mention all of the other benefits of delegating all the
| work of setting up the GPUs, public HTTP server, designing the
| API, security, keeping the model up-to-date with the state of
| the art, etc
|
| reminds me of the people in the 2000s / early 2010s who would
| build their own linux boxes back when the platform was super
| unstable, constantly fighting driver issues etc instead of just
| getting a mac.
|
| roll-your-own-LLM makes even less sense. at least for those
| early 2000s linux guys, even if you spent an ungodly amount of
| time going through the arch wiki or compiling gentoo or
| whatever, at least those skills are somewhat transferrable to
| sysadmin/SRE. i dont see how setting up your own instance of
| ollama has any transferable skills
|
| the only way i could see it making sense is if you're doing
| super cutting edge stuff that necessitates owning a tinybox, or
| if you're trying to get a job at openAI or anysphere
| eikenberry wrote:
| Depends on what you value. Many people value keeping general
| purpose computing free/libre and available to as many as
| possible. This means using free systems and helping those
| systems mature.
| o11c wrote:
| Even on CPU, you should get the start of a response within 5
| seconds for Q4 8B-or-smaller Llama models (proportionally
| faster for smaller ones), which then stream at several tokens
| per second.
|
| There are a _lot_ of things to criticize about LLMs (the answer
| is quite likely to ignore what you 're actually asking, for
| example) but your speed problem looks like a config issue
| instead. Are you calling the API in streaming mode?
| nobodyandproud wrote:
| You may have been running with CPU inference, or running models
| that don't fit your VRAM.
|
| I was running a 5 bit quantized model of codestral 22b with a
| Radeon RX 7900 (20 gb), compiled with Vulkan only.
|
| Eyeball only, but the prompt responses were maybe 2x or 3x
| slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most
| paragraph long responses).
| imiric wrote:
| To me it's not even about performance (speed). It's just that
| the quality gap between cloud LLM services and local LLMs is
| still quite large, and seems to be increasing. Local LLMs have
| gotten better in the past year, but cloud LLMs have even more
| so. This is partly because large companies can afford to keep
| throwing more compute at the problem, while quality at smaller
| scale deployments is not increasing at the same pace.
|
| I have a couple of 3090s and have tested most of the popular
| local LLMs (Llama3, DeepSeek, Qwen, etc.) at the highest
| possible settings I can run them comfortably (~30B@q8, or
| ~70B@q4), and they can't keep up with something like Claude 3.5
| Sonnet. So I find myself just using Sonnet most of the time,
| instead of fighting with hallucinated output. Sonnet still
| hallucinates and gets things wrong a lot, but not as often as
| local LLMs do.
|
| Maybe if I had more hardware I could run larger models at
| higher quants, but frankly, I'm not sure it would make a
| difference. At the end of the day, I want these tools to be as
| helpful as possible and not waste my time, and local LLMs are
| just not there yet.
| sturza wrote:
| 4090 has 24gb vram, not 16.
| vidyesh wrote:
| The mobile RTX 4090 has 16GB
| https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-Laptop...
| deadbabe wrote:
| My understanding is that local LLMs are mostly just toys that
| output basic responses, and simply can't compete with full LLMs
| trained with $60 million+ worth of compute time, and that no
| matter how good hardware gets, larger companies will always have
| even better hardware and resources to output even better results,
| so basically this is pointless for anything competitive or
| serious. Is this accurate?
| quest88 wrote:
| That's a strange question. It sounds like you're saying the
| author isn't serious in their use of LLMs. Can you describe
| what you mean by competitive and serious?
|
| You're commenting on a post about how the author runs LLMs
| locally because they find them useful. Do you think they would
| run them and write an entire post on how they use it if they
| didn't find it useful? The author is seriously using them.
| prophesi wrote:
| It's the missing piece of the article. How is their
| experience running LLM's locally compared to GPT-4o / Sonnet
| 3.5 / etc, outside of data privacy?
| th0ma5 wrote:
| In my experience, if you can make a reasonable guess as to
| the total contained information about your topic within the
| given size of model, then you can be very specific with
| your context and ask and it is generally as good as naive
| use of any of the larger models. Honestly, Llama3.2 is as
| accurate as anything and runs reasonably quick on a CPU.
| Larger models to me mostly increase the surface area for
| potential error and get exhausting to read even when you
| ask them to get on with it.
| pickettd wrote:
| Depends on what benchmarks/reports you trust I guess (and how
| much hardware you have for local models either in-person or in-
| cloud). https://aider.chat/docs/leaderboards/ has Deepseek v3
| scoring higher than most closed LLMs on coding (but it is a
| huge local model). And https://livebench.ai has QwQ scoring
| quite high in the reasoning category (and that is relatively
| easy to run locally but it doesn't score super high in other
| categories).
| alkonaut wrote:
| If big models require big hardware how come there are so many
| free LLMs? How much compute is one ChatGPT interaction with the
| default model? Is it a massive loss leader for OpenAI?
| drillsteps5 wrote:
| A big part of it is the fact that Meta tries very hard to not
| lose the relevance against OpenAI, so they keep building and
| releasing a lot of their models for free. I would imagine
| they're taking huge losses on these but it's a price they
| seem to be willing to pay for being late to the party.
|
| The same applies (to smaller extent) to Google, and more
| recently to Alibaba.
|
| A lot of free (not open-source, mind you) LLMs are either
| from Meta or built by other smaller outfits by tweaking
| Meta's LLMs (there are exceptions, like Mistral).
|
| One thing to keep in mind is that while training a model
| requires A LOT of data, manual intervention, hardware, power,
| and time, using these LLMs (inference, or forward pass in the
| neural network) is really not. Right now it typically
| requires NVidia GPU with lots of VRAM but I have no doubt in
| just a few months/maybe a year someone will find much easier
| way to do that without sacrificing the speed.
| drillsteps5 wrote:
| This is not my experience, but it depends on what you consider
| "competitive" and "serious".
|
| While it does appear that a large number of people are using
| these for RP and... um... similar stuff, I do find code
| generation to be fairly good, esp with some recent Qwens (from
| Alibaba). Full disclaimer: I do use this sparingly, either to
| get the boilerplate, or to generate a complete function/module
| with clearly defined specs, and I sometimes have to re-factor
| the output to fit my style and preferences.
|
| I also use various general models (mostly Meta's) fairly
| regularly to come up with and discuss various business and
| other ideas, and get general understanding in the areas I want
| to get more knowledge of (both IT and non-IT), which helps when
| I need to start digging deeper into the details.
|
| I usually run quantized versions (I have older GPU with lower
| VRAM).
|
| Wordsmithing and generating illustrations for my blog articles
| (I prefer Plex, it's actually fairly good with adding various
| captions to illustrations; the built-in image generator in
| ChatGPT was still horrible at it when I tried it a few months
| ago).
|
| Some resume tweaking as part of manual workflow (so multiple
| iterations with checking back and forth between what LLM gave
| me and my version).
|
| So it's mostly stuff for my own personal consumption, that I
| don't necessarily trust the cloud with.
|
| If you have a SaaS idea with an LLM or other generative AI at
| its core, processing the requests locally is probably not the
| best choice. Unless you're prototyping, in which case it can
| help.
| meta_x_ai wrote:
| Why would you waste time in Good models when there are great
| models?
| drillsteps5 wrote:
| Good models are good enough for me, meta_x_ai, I gain
| experience by setting them up and following up on industry
| trends, and I don't trust OpenAI (or MSFT, or Google, or
| whatever) with my information. No, I do not do anything
| illegal or unethical, but it's not the point.
| pavlov wrote:
| The leading provider of open models used for local LLM setups
| is Meta (with the Llama series). They have enormous resources
| to train these models.
|
| They're giving away these expensive models because it doesn't
| hurt Meta's ad business but reduces risk that competitors could
| grow moats using closed models. The old "commoditize your
| complements" play.
| johnklos wrote:
| There's a huge difference between training and running models.
|
| The multimillion (or billion) dollar collections of hardware
| assemble the datasets that we (people who run LLMs locally)
| run. The non-open datasets that we-host-LLMs-for-money
| companies do the same, and their data isn't all that much
| fancier. Open LLMs are catching up to closed ones, and the
| competition means everyone wins except for the victims of
| Nvidia's price gouging.
|
| This is a bit like confusing the resources that go in to making
| a game with the resources needed to run and play that game.
| They're worlds different.
| prettyStandard wrote:
| Even Nvidia's victims are better off... (O)llama runs on AMD
| GPUs
| kolbe wrote:
| I think the bigger issue is hardware utilization. Meta and
| Qwen's medium sized models are fantastic and competitive with 6
| month old offerings from OpenAI and Claude, but you need to
| have $2500 of hardware to run it. If you're going to be
| spinning this 24/7, sure the API costs for OpenAI and Anthropic
| would be insane, but as far as normal personal usage patterns
| go, you'd probably spend $20/month. So, either spend $2500 that
| will depreciate to $1000 in two years or $580 in API costs?
| philjohn wrote:
| Not entirely.
|
| TRAINING an LLM requires a lot of compute. Running inference on
| a pre-trained LLM is less computationally expensive, to the
| point where you can run LLAMA (cost $$$ to train on Meta's GPU
| cluster) with CPU-based inference.
| halyconWays wrote:
| Super basic intro but perhaps useful. Doesn't mention quant
| sizes, which is important when you're GPU poor. Lots of other
| client-side things you can do too, like KoboldAI, TavernAI, Jan,
| LangFuse for observability, CogVLM2 for a vision model.
|
| One of the best places to get the latest info on what people are
| doing with local models is /lmg/ on 4chan's /g/
| pvo50555 wrote:
| There was a post a few weeks back (or a reply to a post) showing
| an app entirely made using an LLM. It was like a 3D globe made
| with 3js, and I believe the poster had created it locally on his
| M4 MacBook with 96 GB RAM? I can't recall which model it was or
| what else the app did, but maybe someone knows what I'm talking
| about?
| dumbfounder wrote:
| Updation. That's a new word for me. I like it.
| sccomps wrote:
| It's quite common in India but I guess not widely accepted
| internationally. If there can be deletion, then why not
| updation?
| exe34 wrote:
| that's double plus good.
| rspoerri wrote:
| I run a pretty similar setup on an m2-max - 96gb.
|
| Just for AI image generation i would rather recommend krita with
| the https://github.com/Acly/krita-ai-diffusion plugin.
| donquichotte wrote:
| Are you using an external provider for image generation or
| running something locally?
| Der_Einzige wrote:
| Still nothing better than oobabooga
| (https://github.com/oobabooga/text-generation-webui) in terms of
| maximalism/"Pro"/"Prosumer" LLM UI/UX ALA Blender, Photoshop,
| Final Cut Pro, etc.
|
| Embarrassing and any VCs reading this can contact me to talk
| about how to fix that. lm-studio is today the closest competition
| (but not close enough) and Adobe or Microsoft could do it if they
| fired their current folks which prevent this from happening.
|
| If you're not using Oobabooga, you're likely not playing with the
| settings on models, and if you're not playing with your models
| settings, you're hardly even scratching the surface on its total
| capabilities.
| theropost wrote:
| this.
| dividefuel wrote:
| What GPU offers a good balance between cost and performance for
| running LLMs locally? I'd like to do more experimenting, and am
| due for a GPU upgrade from my 1080 anyway, but would like to
| spend less than $1600...
| redacted wrote:
| Nvidia for compatibility, and as much VRAM as you can afford.
| Shouldn't be hard to find a 3090 / Ti in your price range. I
| have had decent success with a base 3080 but the 10GB really
| limits the models you can run
| yk wrote:
| Second the other comment, as much vram as possible. A 3060 has
| 12 GB at a reasonable price point. (And is not too limiting.)
| grobbyy wrote:
| There's a huge step to I'm capability with 16gb and 24gb, for
| not to much more. The 4060 has a 16gb version, for example.
| On the the cheap end, the Intel Arc does too.
|
| Next major step up is 48GB and then hundreds of GB. But a lot
| of ML models target 16-24gb since that's in the grad student
| price range.
| bloomingkales wrote:
| Honestly I think if you just want to do inferencing the 7600xt
| and rx6800 have 16gb at $300 and $400 on Amazon. It's gonna be
| my stop gap until whatever. The RX6800 has better memory
| bandwidth than the 4060ti (think it matches the 4070).
| kolbe wrote:
| AMD GPUs are a fantastic deal until you hit a problem. Some
| models/frameworks it works great. Others, not so much.
| bloomingkales wrote:
| For sure but I think people on the fine
| tuning/training/stable diffusion side are more concerned
| with that. They make a big fuss about this and basically
| talk people out of a perfectly good and well priced 16gb
| vram card that literally works out of the box with ollama,
| lmstudio for text inferencing.
|
| Kind of one of the reasons AMD is a sleeper stock for me.
| If people only knew.
| kolbe wrote:
| If you want to wait until the 5090s come out, you should see a
| drop in the price of the 30xx and 40xx series. Right now,
| shopping used, you can get two 3090s or two 4080s in your price
| range. Conventional wisdom says two 3090s would be better, but
| this is all highly dependent on what models you want to run.
| Basically the first requirement is to have enough VRAM to host
| all of your model on it, and secondarily, the quality of the
| GPU.
|
| Have a look through Hugging Face to see which models interest
| you. A rough estimate for the amount of VRAM you need is half
| the model size plus a couple gigs. So, if using the 70B models
| interests you, two 4080s wouldn't fit it, but two 3090s would.
| If you're just interested in the 1B, 3B and 7B models (llama 3B
| is fantastic), you really don't need much at all. A single 3060
| can handle that, and those are not expensive.
| zitterbewegung wrote:
| Get a new / used 3090 it has 24GB of RAM and it's below $1600.
| christianqchung wrote:
| A lot of moderate power users are running an undervolted used
| pair of 3090s on a 1000-1200W psu. 48 GB of vram let's you
| run 70B models at Q4 with 16k context.
|
| If you use speculative decoding (a small model generates
| tokens verified by a larger model, I'm not sure on the
| specifics) you can get past 20 tokens per second it seems.
| You can also fit 32B models like Qwen/Qwen Coder at Q6 with
| lots of context this way, with spec decoding, closer to 40+
| tks/s.
| adam_arthur wrote:
| Inferencing does not require Nvidia GPUs at all, and its almost
| criminal to be recommending dedicated GPUs with only 12GB of
| RAM.
|
| Buy a MacMini or MacbookPro with RAM maxed out.
|
| I just bought an M4 mac mini for exactly this use case that has
| 64GB for ~2k. You can get 128GB on the MBP for ~5k. These will
| run much larger (and more useful) models.
|
| EDIT: Since the request was for < $1600, you can still get a
| 32GB mac mini for $1200 or 24GB for $800
| natch wrote:
| Reasonable? $7,000 for a laptop is pretty up there.
|
| [Edit: OK I see I am adding cost when checking due to
| choosing a larger SSD drive, so $5,000 is more of a fair
| bottom price, with 1TB of storage.]
|
| Responding specifically to this very specific claim: "Can get
| 128GB of ram for a reasonable price."
|
| I'm open to your explanation of how this is reasonable -- I
| mean, you didn't say cheap, to be fair. Maybe 128GB of ram on
| GPUs would be way more (that's like 6 x 4090s), is what
| you're saying.
|
| For anyone who wants to reply with other amounts of memory,
| that's not what I'm talking about here.
|
| But on another point, do you think the ram really buys you
| the equivalent of GPU memory? Is Apple's melding of CPU/GPU
| really that good?
|
| I'm not just coming from a point of skepticism, I'm actually
| kind of hoping to be convinced you're right, so wanting to
| hear the argument in more detail.
| adam_arthur wrote:
| It's reasonable in a "working professional who gets
| substantial value from" or "building an LLM driven startup
| project" kind of way.
|
| It's not for the casual user, but for somebody who derives
| significant value from running it locally.
|
| Personally I use the MacMini as a hub for a project I'm
| working on as it gives me full control and is simply much
| cheaper operationally. A one time ~$2000 cost isn't so bad
| for replacing tasks that a human would have to do. e.g. In
| my case I'm parsing loosely organized financial documents
| where structured data isn't available.
|
| I suspect the hardware costs will continue to decline
| rapidly as they have in the past though, so that $5k for
| 128GB will likely be $5k for 256GB in a year or two, and so
| on.
|
| We're almost at the inflection point where really powerful
| models are able to be inferenced locally for cheap
| talldayo wrote:
| > its almost criminal to be recommending dedicated GPUs with
| only 12GB of RAM.
|
| If you already own a PC, it makes a hell of a lot more sense
| to spend $900 on a 3090 than it does to spec out a Mac Mini
| with 24gb of RAM. Plus, the Nvidia setup can scale to as many
| GPUs as you own which gives you options for upgrading that
| Apple wouldn't be caught dead offering.
|
| Oh, and native Linux support that doesn't suck balls is a
| plus. I haven't benchmarked a Mac since the M2 generation,
| but the figures I can find put the M4 Max's compute somewhere
| near the desktop 3060 Ti:
| https://browser.geekbench.com/opencl-benchmarks
| adam_arthur wrote:
| A Mac Mini with 24GB is ~$800 at the cheapest
| configuration. I can respect wanting to do a single part
| upgrade, but if you're using these LLMs for serious work,
| the price/perf for inferencing is far in favor of using
| Macs at the moment.
|
| You can easily use the MacMini as a hub for running the LLM
| while you do work on your main computer (and it won't eat
| up your system resources or turn your primary computer into
| a heater)
|
| I hope that more non-mac PCs come out optimized for high
| RAM SoC, I'm personally not a huge Apple fan but use them
| begrudgingly.
|
| Also your $900 quote is a used/refurbished GPU. I've had
| plenty of GPUs burn out on me in the old days, not sure how
| it is nowadays, but that's a lot to pay for a used part IMO
| fragmede wrote:
| if you're doing serious work, performance is more
| important than getting a good price/perf ratio, and a
| pair of 3090s is gonna be faster. It depends on your
| budget, however as that configuration is a bit more
| expensive, however.
| adam_arthur wrote:
| Whether performance or cost is more important depends on
| your use case. Some tasks that an LLM can do very well
| may not need to be done often, or even particularly
| quickly (as in my case).
|
| e.g. LLM as one step of an ETL-style pipeline
|
| Latency of the response really only matters if that
| response is user facing and is being actively awaited by
| the user
| 2-3-7-43-1807 wrote:
| how about heat dissipation which i assume a mbp is at a
| disadvantage compared to a pc?
| elorant wrote:
| I consider the RTX 4060 Ti as the best entry level GPU for
| running small models. It has 16GBs of RAM which gives you
| plenty of space for running large context windows and Tensor
| Cores which are crucial for inference. For larger models
| probably multiple RTX 3090s since you can buy them on the cheap
| on the second hand market.
|
| I don't have experience with AMD cards so I can't vouch for
| them.
| fnqi8ckfek wrote:
| I know nothing about gpus. Should I be assuming that when
| people say "ram" in the context of gpus they always mean
| vram?
| layer8 wrote:
| "GPU with xx RAM" means VRAM, yes.
| throwaway314155 wrote:
| Open WebUI sure does pull in a lot of dependencies... Do I really
| need all of langchain, pytorch, and plenty others for what is
| advertised as a _frontend_?
|
| Does anyone know of a lighter/minimalist version?
| noman-land wrote:
| https://llamafile.ai/
| throwaway314155 wrote:
| I love what llamafile is doing, but I'm primarily interested
| in a frontend for ollama, as I prefer their method of
| model/weights distribution. Unless I'm wrong, llamafile
| serves as both the frontend and backend.
| DrBenCarson wrote:
| LM Studio?
| figmert wrote:
| I ran it temporarily in Docker the other day. It was a 8gb
| image. I'm unsure why a webui is 8gb.
| lolinder wrote:
| Some of the features (RAG retrieval) now use embeddings that
| are calculated in Open WebUI rather than in Ollama or another
| backend. It does seem like it'd be nice for them to refactor to
| make things like that optional for those who want a simpler
| interface, but then again, there are plenty of other lighter-
| weight options.
| gulan28 wrote:
| You can try out https://wiz.chat (my project) if you want to Run
| llama on your web browser. Still needs a GPU and the latest
| version of chrome but it's fast enough for my usage.
| amazingamazing wrote:
| I never have seen the point of running locally. Not cost
| effective, worse model, etc.
| ogogmad wrote:
| Privacy?
| splintercell wrote:
| Even if it was not cost-effective, or you're just running worse
| models, it's learning an important skill.
|
| Take for instance, self-hosting your website may have all these
| considerations, but you're getting information from the LLMs.
| It would be helpful to know that the LLM is in your control.
| Oras wrote:
| Self hosting website as local server running in your room?
| what's the point?
|
| Same with LLMs, you can use providers who don't log requests
| and SOC2 compliant.
|
| Small models that run locally is a waste of time as they
| don't have adequate value compared to larger models.
| babyshake wrote:
| Can you elaborate on the not cost effective part? That seems
| surprising unless the API providers are running at a loss.
| amazingamazing wrote:
| A 4090 is $1000 at least. That's years of subscription for
| the latest models, which don't run locally anyway.
| homarp wrote:
| unless you have one already for playing games
| wongarsu wrote:
| I run LLMs locally on a 2080TI I bought used years ago for
| another deep learning project. It's not the fastest thing
| in the world, but adequate for running 8B models. 70B
| models technically work but are too slow to realistically
| use them.
| IOT_Apprentice wrote:
| Local is private. You are not handing over your data to An AI
| for training.
| jiggawatts wrote:
| Most major providers have EULAs that specify that they don't
| keep your data and don't use it for training.
| thangalin wrote:
| run.sh: #!/usr/bin/env bash set
| -eu set -o errexit set -o nounset set -o
| pipefail readonly SCRIPT_SRC="$(dirname
| "${BASH_SOURCE[${#BASH_SOURCE[@]} - 1]}")" readonly
| SCRIPT_DIR="$(cd "${SCRIPT_SRC}" >/dev/null 2>&1 && pwd)"
| readonly SCRIPT_NAME=$(basename "$0") # Avoid issues
| when wine is installed. sudo su -c 'echo 0 >
| /proc/sys/fs/binfmt_misc/status' # Graceful exit to
| perform any clean up, if needed. trap terminate INT
| # Exits the script with a given error level. function
| terminate() { level=10 if [ $# -ge 1 ]
| && [ -n "$1" ]; then level="$1"; fi exit $level
| } # Concatenates multiple files. join() {
| local -r prefix="$1" local -r content="$2"
| local -r suffix="$3" printf "%s%s%s" "$(cat
| ${prefix})" "$(cat ${content})" "$(cat ${suffix})" }
| # Swapping this symbolic link allows swapping the LLM without
| script changes. readonly
| LINK_MODEL="${SCRIPT_DIR}/llm.gguf" # Dereference
| the model's symbolic link to its path relative to the script.
| readonly PATH_MODEL="$(realpath --relative-to="${SCRIPT_DIR}"
| "${LINK_MODEL}")" # Extract the file name for the
| model. readonly FILE_MODEL=$(basename "${PATH_MODEL}")
| # Look up the prompt format based on the model being used.
| readonly PROMPT_FORMAT=$(grep -m1 ${FILE_MODEL} map.txt | sed
| 's/.*: //') # Guard against missing prompt
| templates. if [ -z "${PROMPT_FORMAT}" ]; then
| echo "Add prompt template for '${FILE_MODEL}'."
| terminate 11 fi readonly
| FILE_MODEL_NAME=$(basename $FILE_MODEL) if [ -z
| "${1:-}" ]; then # Write the output to a name
| corresponding to the model being used.
| PATH_OUTPUT="output/${FILE_MODEL_NAME%.*}.txt" else
| PATH_OUTPUT="$1" fi # The system file
| defines the parameters of the interaction. readonly
| PATH_PROMPT_SYSTEM="system.txt" # The user file
| prompts the model as to what we want to generate.
| readonly PATH_PROMPT_USER="user.txt" readonly
| PATH_PREFIX_SYSTEM="templates/${PROMPT_FORMAT}/prefix-system.txt"
| readonly PATH_PREFIX_USER="templates/${PROMPT_FORMAT}/prefix-
| user.txt" readonly
| PATH_PREFIX_ASSIST="templates/${PROMPT_FORMAT}/prefix-
| assistant.txt" readonly
| PATH_SUFFIX_SYSTEM="templates/${PROMPT_FORMAT}/suffix-system.txt"
| readonly PATH_SUFFIX_USER="templates/${PROMPT_FORMAT}/suffix-
| user.txt" readonly
| PATH_SUFFIX_ASSIST="templates/${PROMPT_FORMAT}/suffix-
| assistant.txt" echo "Running: ${PATH_MODEL}"
| echo "Reading: ${PATH_PREFIX_SYSTEM}" echo "Reading:
| ${PATH_PREFIX_USER}" echo "Reading:
| ${PATH_PREFIX_ASSIST}" echo "Writing: ${PATH_OUTPUT}"
| # Capture the entirety of the instructions to obtain the input
| length. readonly INSTRUCT=$( join
| ${PATH_PREFIX_SYSTEM} ${PATH_PROMPT_SYSTEM} ${PATH_PREFIX_SYSTEM}
| join ${PATH_SUFFIX_USER} ${PATH_PROMPT_USER} ${PATH_SUFFIX_USER}
| join ${PATH_SUFFIX_ASSIST} "/dev/null" ${PATH_SUFFIX_ASSIST}
| ) ( echo ${INSTRUCT} ) |
| ./llamafile \ -m "${LINK_MODEL}" \ -e \
| -f /dev/stdin \ -n 1000 \ -c ${#INSTRUCT} \
| --repeat-penalty 1.0 \ --temp 0.3 \ --silent-
| prompt > ${PATH_OUTPUT} #--log-disable \
| echo "Outputs: ${PATH_OUTPUT}" terminate 0
|
| map.txt: c4ai-command-r-plus-q4.gguf: cmdr
| dare-34b-200k-q6.gguf: orca-vicuna gemma-2-27b-q4.gguf:
| gemma gemma-2-7b-q5.gguf: gemma
| gemma-2-Ifable-9B.Q5_K_M.gguf: gemma llama-3-64k-q4.gguf:
| llama3 llama-3-64k-q4.gguf: llama3
| llama-3-1048k-q4.gguf: llama3 llama-3-1048k-q8.gguf:
| llama3 llama-3-8b-q4.gguf: llama3
| llama-3-8b-q8.gguf: llama3 llama-3-8b-1048k-q6.gguf:
| llama3 llama-3-70b-q4.gguf: llama3
| llama-3-70b-64k-q4.gguf: llama3
| llama-3-smaug-70b-q4.gguf: llama3
| llama-3-giraffe-128k-q4.gguf: llama3 lzlv-q4.gguf: alpaca
| mistral-nemo-12b-q4.gguf: mistral openorca-q4.gguf:
| chatml openorca-q8.gguf: chatml
| quill-72b-q4.gguf: none qwen2-72b-q4.gguf: none
| tess-yi-q4.gguf: vicuna tess-yi-q8.gguf: vicuna
| tess-yarn-q4.gguf: vicuna tess-yarn-q8.gguf: vicuna
| wizard-q4.gguf: vicuna-short wizard-q8.gguf: vicuna-short
|
| Templates (all the template directories contain the same set of
| file names, but differ in content): templates/
| +-- alpaca +-- chatml +-- cmdr +-- gemma
| +-- llama3 +-- mistral +-- none +-- orca-
| vicuna +-- vicuna +-- vicuna-short
| +-- prefix-assistant.txt +-- prefix-system.txt
| +-- prefix-user.txt +-- suffix-assistant.txt
| +-- suffix-system.txt +-- suffix-user.txt
|
| If there's interest, I'll make a repo.
| ashleyn wrote:
| anyone got a guide on setting up and running the business-class
| stuff (70B models over multiple A100, etc)? i'd be willing to
| spend the money but only if i could get a good guide on how to
| set everything up, what hardware goes with what
| motherboard/ram/cpu, etc.
| foundry27 wrote:
| Aye, there's the kicker. The correct configuration of hardware
| resources to run and multiplex large models is just as much of
| a trade secret as model weights themselves when it comes to
| non-hobbyist usage, and I wouldn't be surprised if optimal
| setups are in many ways deliberately obfuscated or hidden to
| keep a competitive advantage
|
| Edit: outside the HPC community specifically, I mean
| talldayo wrote:
| I don't think you're going to find much of a guide out there
| because there isn't really a need for one. You just need a
| Linux client with the Nvidia drivers installed and some form of
| CUDA runtime present. You could make that happen in a mini PC,
| a jailbroken Nintendo Switch, a gaming laptop or a 3600W 4U
| rackmount. The "happy path" is complicated because you truly
| have so many functional options.
|
| You don't want an A100 unless you've already got datacenter
| provisioning at your house and an empty 1U rack. I genuinely
| cannot stress this enough - these are datacenter cards _for a
| reason_. The best bang-for-your buck will be consumer-grade
| cards like the 3060 and 3090, as well as the bigger devkits
| like the Jetson Orin.
| chown wrote:
| If anyone is looking for a one click solution without having to
| have a Docker running, try Msty - something that I have been
| working on for almost a year. Has RAG and Web Search built in
| among others and can connect to your Obsidian vaults as well.
|
| https://msty.app
| rgovostes wrote:
| Your top menu has an "As" button, listing how you compare to
| alternative products. Is that the label you wanted?
| ukuina wrote:
| Might be better renamed to "Compare"
| upghost wrote:
| > Before I begin I would like to credit the thousands or millions
| of unknown artists, coders and writers upon whose work the Large
| Language Models(LLMs) are trained, often without due credit or
| compensation
|
| I like this. If we insist on pushing forward with GenAI we should
| probably at least make some digital or physical monument like
| "The Tomb of the Unknown Creator".
|
| Cause they sure as sh*t ain't gettin paid. RIP.
___________________________________________________________________
(page generated 2024-12-29 23:00 UTC)