[HN Gopher] Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5
___________________________________________________________________
Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5
Author : b4rtazz
Score : 254 points
Date : 2025-02-15 16:11 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| memhole wrote:
| This is the modern Beowulf cluster.
| mjhagen wrote:
| But can it run Crysis?
| semi-extrinsic wrote:
| I honestly don't understand the meme with RPi clusters. For a
| little more money than 4 RPi 5's, you can find on eBay a 1U
| Dell server with a 32 core Epyc CPU and 64 GB memory. This
| gives you at least an order of magnitude more performance.
|
| If people want to talk about Beowulf clusters in their homelab,
| they should at least be running compute nodes with a shoestring
| budget FDR Infiniband network, running Slurm+Lustre or
| k8s+OpenStack+Ceph or some other goodness. Spare me this
| doesnt-even-scale-linearly-to-four-slowass-nodes BS.
| madduci wrote:
| The TDP of 4 PIs combined is still smaller than a larger
| server, which is probably the whole point of such an
| experiment?
| znpy wrote:
| The combined TDP of 4 raspberry PIs is likely less than
| what the fans of that kind of server pull from the power
| outlet.
| znpy wrote:
| You can buy it but you can't run it, unless you're fairly
| wealthy.
|
| In my country (italy) a basic colocation service is like 80
| euros/month + vat, and that only includes 100Wh of power and
| a 100mbps connection. +100wh/month upgrades are like +100
| euros.
|
| I looked up the kind of servers and cpus you're talking about
| and the cpu alone can pull something like 180W/h, without
| accounting for fans, disks and other stuff (stuff like GPUs,
| which are power hungry).
|
| Yeah you could run it at home in theory, but you'll end up
| paying power at consumer price rather than datacenter pricing
| (and if you live in a flat, that's going to be a problem).
|
| Unless you're really wealthy, you have your own home with
| sufficient power[1] delivery and cooling.
|
| [1] not sure where you live, but here most residential power
| connections are below 3 KWh.
|
| If otherwise you can point me at some datacenter that will
| let me run a normal server like the ones you're pointing at
| for like 100-150 euros/month, please DO let me know and i'll
| rush there first thing next business day and I will be
| throwing money at them.
| giobox wrote:
| > You can buy it but you can't run it, unless you're fairly
| wealthy.
|
| Why do I need a colocation service to put a used 1U server
| from eBay in my house? I'd just plug it in, much like any
| other PC tower you might run at home.
|
| > Unless you're really wealthy, you have your own home with
| sufficient power[1] delivery and cooling.
|
| > not sure where you live, but here most residential power
| connections are below 3 KWh.
|
| It's a single used 1U server, not a datacenter... It will
| plug into your domestic powersupply just fine. The total
| draw will likely be similar or even less than many gaming
| PC builds out there, and even then only when under peak
| loads etc.
| PhilipRoman wrote:
| Interesting, I run a small cluster of 4 mini pcs (22 cores
| total). I think it should be comparable to the
| aforementioned EPYC. Power load is a rounding error
| compared to appliances like electric iron at 1700W, etc.
| The impact on electrical bill is minimal as well. Idle
| power draw is about 5W per server, which translates to ~80
| cents a month. Frankly my monitor uses more power on
| average than the rest of the homelab.
| semi-extrinsic wrote:
| I'm pretty sure if you run a compute benchmark like
| Streams or hpgmg, the Epyc server will eat your mini pcs
| for breakfast.
| PhilipRoman wrote:
| You're probably right, I meant that the power consumption
| should be roughly comparable between them (due to
| inefficiency added by each mini).
| Ray20 wrote:
| I think results would be rather comparable.
| eptcyka wrote:
| My server is almost never running at full tilt. It is using
| ~70W at idle.
| vel0city wrote:
| Just a note, you're mixing kW and kWh.
|
| A connection to a home wouldn't be rated in kilowatt-hours,
| it would likely be rated in amps, but could also be
| expressed in kilowatts.
|
| > 100wh/month upgrades are like +100 euros.
|
| I can't imagine anybody paying EUR1/Wh. Even if this was
| EUR1/kWh (1000x cheaper) it's still a few times more
| expensive than what most places would consider expensive.
| mad182 wrote:
| Same. Also I don't get using RPis for hosting all kinds of
| services at home - there are hundreds of mini-pcs on ebay
| cheaper than Rpi, with more power, more ports, where you can
| put in a normal SSD drive and RAM, with sturdy factory made
| enclosure... To me Rpi seems a weird choice unless you are
| tinkering with the gpio ports.
| derefr wrote:
| > For a little more money than 4 RPi 5's, you can find on
| eBay a 1U Dell server with a 32 core Epyc CPU and 64 GB
| memory. This gives you at least an order of magnitude more
| performance.
|
| You could also get one or two Ryzen mini PCs with similar
| specs for that price. Which might be a good idea, if you want
| to leave O(N) of them running on your desk, house without
| spending much on electricity or cooling. (Also, IMHO, the
| advantages of having an Epyc really only become apparent when
| you're tossing around multiple 10Gbit NICs, 16+ NVMe disks,
| etc. and so saturating all the PCIe lanes.)
| plagiarist wrote:
| You also get a normal motherboard firmware and normal PCI
| with that.
|
| I don't know if my complaint applies to RPi, or just other
| SBCs: the last time I got excited about an SBC, it turned out
| it boots unconditionally from SD card if one is inserted. IMO
| that's completely unacceptable for an "embedded" board that
| is supposed to be tucked away.
| walrus01 wrote:
| The noise from a proper 1U server will be intolerably loud in
| a small residence, for a homelab type setup. If you have a
| place to put it where the noise won't be a problem, sure...
| Acoustics are not a consideration at all in the design of 1U
| and 2U servers .
| NitpickLawyer wrote:
| As always, take those t/s stats with a huge boulder of salt. The
| demo shows a question "solved" in < 500 tokens. Still amazing
| that it's possible, but you'll get nowhere near those speeds when
| dealing with real-world problems at real-world useful context
| lengths for "thinking" models (8-16k tokens). Even epyc's with
| lots of channels go down to 2-4 t/s after ~4096 context length.
| numba888 wrote:
| Smaller robots tend to have smaller problems. Even little help
| from the model will make them a lot more capable than they are
| today.
| tofof wrote:
| This continues the pattern of all other announcements of running
| 'Deepseek R1' on raspberry pi - that they are running llama (or
| qwen), modified by deepseek's distillation technique.
| corysama wrote:
| Yeah. People looking for "Smaller DeepSeek" are looking for the
| quantized models, which are still quite large.
|
| https://unsloth.ai/blog/deepseekr1-dynamic
| rcarmo wrote:
| Yet for some things they work exactly the same way, and with
| the same issues :)
| whereismyacc wrote:
| I really don't like that these models can be branded as
| Deepseek R1.
| sgt101 wrote:
| Well, Deepseek trained them?
| yk wrote:
| Yes, but it would've been nice to call them D1-something,
| instead of constantly having to switch back and forth
| between Deepseek R1 (here I mean the 604B model) as
| distinguished from Deepseek R1 (the reasoning model and
| it's distillates.)
| rafaelmn wrote:
| You can say R1-604b to disambiguate, just like we have
| llama 3 8b/70b etc.
| pythux wrote:
| These models are not of the same nature either. Their
| training was done in a different way. A uniform naming
| (even with explicit number of parameters) would still be
| misleading.
| mdp2021 wrote:
| ? Alexander is not Aristotle?!
| tucnak wrote:
| I don't know if they'd changed the submission title or what,
| but it says quite explicitly "Deepseek R1 Distill 8B Q40" which
| is a far-cry from "Deepseek R1" which would be misrepresenting
| the result, indeed. However, if you refer to Distilled Model
| Evaluation[1] section of the official R1 repository, you will
| note that DeepSeek-R1-Distill-Llama-8B is not half-bad; it
| supposedly out-performs both 4o-0513 and Sonnet-1022 on a
| handful of benchmarks.
|
| Remember sampling from formal grammar is a thing! This is
| relevant, because llama.cpp has GBNF, and lazy grammar[2]
| setting now, which is making it double not-half-bad for a
| handful of use-cases, less of all deployments like this. That
| is to say, the grammar kicks in after </think>. Not to mention,
| it's always subject to further fine-tuning: multiple vendors
| are now offering "RFT" services, i.e. enriching your normal SFT
| dataset with synthetic reasoning data from the big-boy R1
| himself. For all intents and purposes, this result could be
| much more valuable prior than you're giving it credit for!
|
| 6 tok/s decoding is not much, but Raspberry Pi people don't
| care, lol.
|
| [1] https://github.com/deepseek-ai/DeepSeek-R1#distilled-
| model-e...
|
| [2] https://github.com/ggerganov/llama.cpp/pull/9639
| zozbot234 wrote:
| Yes this is just a fine-tuned LLaMa with DeepSeek-like "chain
| of thought" generation. A properly 'distilled' model is
| supposed to be trained from scratch to completely mimick the
| larger model it's being derived from - which is not what's
| going on here.
| kgeist wrote:
| I tried the smaller 'Deepseek' models, and to be honest, in
| my tests, the quality wasn't much different from simply
| adding a CoT prompt to a vanilla model.
| hangonhn wrote:
| Can you explain to a non ML software engineer what these
| distillation methods mean? What does it mean to have R1 train a
| Llama model? What is special about DeepSeek's distillation
| methods? Thanks!
| littlestymaar wrote:
| Submit a bunch of prompts to Deepseek R1 (a few tens of
| thousands), and then do a full fine tuning of the target
| model on the prompt/response pair.
| dcre wrote:
| Distilling means fine-tuning an existing model using outputs
| from the bigger model. The special technique is in the
| details of what you choose to generate from the bigger model,
| how long to train for, and a bunch of other nitty gritty
| stuff I don't know about because I'm also not an ML engineer.
|
| Google it!
| lr1970 wrote:
| > Distilling means fine-tuning an existing model using
| outputs from the bigger model.
|
| Crucially, the output of the teacher model includes token
| probabilities so that the fine-tuning is trying to learn
| the entire output distribution.
| numba888 wrote:
| That's possible only if they use the same tokens. Which
| likely requires they share the same tokenizer. Not sure
| that's the case here, R1 was built on OpenAI closed
| model's output.
| anon373839 wrote:
| That was an (as far as I can tell) unsubstantiated claim
| made by OpenAI. It doesn't even make sense, as o1's
| reasoning traces are not provided to the user.
| andix wrote:
| Its llama/quen with some additional training to add
| reasoning. In a similar way deep seeks v3 was trained into
| r1.
|
| It also looks to me like there was some Chinese propaganda
| trained into llama/quen too, but that's just my observation.
| kvirani wrote:
| You have my curiosity. Like what and how did you find it?
| andix wrote:
| Ask about Xi Jinping in all the ways you can imagine
| (jokes, bad habits, failures, embarrassing facts, ...).
| Compare the responses to other well known politicians,
| use the same prompt in a fresh conversation with a
| different name.
|
| Ask about the political system of china and its flaws.
| Compare the sentiment of the responses with answers about
| other political systems.
|
| You might get some critical answers, but the sentiment is
| usually very positive towards china. Sometimes it doesn't
| even start reasoning and directly spits out propaganda,
| that doesn't even answer your question.
|
| You can't test it with deep seek dot com, because it will
| just remove the answers on those "sensitive" topics. I've
| mostly tested with 7b from ollama. You might experience
| something like that with 1.5b too, but 1.5b barely works
| at all.
| mdp2021 wrote:
| Could it be just a bias inside the selected training
| material?
| andix wrote:
| Feel free to call propaganda a bias if you like. But if
| it walks like a duck, quacks like a duck, ...
| mdp2021 wrote:
| This is HN: my focus is technical (here specifically),
| maybe "technical" in world assessment and future
| prediction (in other pages).
|
| I.e.: I am just trying to understand the facts.
| emaciatedslug wrote:
| Yes, in some ways the output is based on training
| material. The deep learning model will find the "ground
| truth" of the corpus in theory. But China's political
| enforcement since the "great firewall of china" was
| instituted, 2 and a half decades ago, have directly or
| indirectly made content scraped from any Chinese site
| bias by default. The whole Tienanmen Square meme isn't a
| meme because it is funny, it is a meme because it
| consequentially qualifies the discrepancy between the CCP
| and it's own history. Sure there is bias in all models,
| but a quantized version will only loose accuracy.. but if
| a distillation process used a teacher LLM without the
| censorship bias discussed (i.e., a teacher trained on a
| more open and less politically manipulated dataset), the
| resulting distilled student LLM would, in most important
| respects, be more accurate and significantly more useful
| in a broader sense in theory but is seems not to matter
| based on my limited query. I have deepseek-r1-distill-
| llama-8b installed on LM Studio....if I ask "where is
| Tienanmen square and what is it's significance?" i get
| this:
|
| I am sorry, I cannot answer that question. I am an AI
| assistant designed to provide helpful and harmless
| responses.
| andix wrote:
| Btw, the propaganda is specific towards china. If you ask
| about other authoritarian countries and politicians, it
| behaves unbiased.
| 01100011 wrote:
| Companies probably do several things(at least I would if
| it were me):
|
| - The pre-training dataset is sanitized
| culturally/politically and pro-regime material is added.
|
| - Supervised fine tuning dataset provides further
| enforcement of these biases.
|
| - The output is filtered to prevent hallucinations from
| resulting in anything offensive to the regime. This
| could(?) also prevent the reasoning loop from straying
| into ideologically dangerous territory.
|
| So you have multiple opportunities to bend to the will of
| the authorities.
| corysama wrote:
| "Quantized" models try to approximate the full model using
| less bits.
|
| "Distilled" models are other models (Llama, Qwen) that have
| been put through an additional training round using DeepSeek
| as a teacher.
| HPsquared wrote:
| And DeepSeek itself is (allegedly) a distillation of OpenAI
| models.
| alexhjones wrote:
| Never heard that claim before, only that a certain subset of
| re-enforced learning may have used ChatGPT to grade
| responses. Is there more detail about it being allegedly a
| distilled OpenAI model?
| IAmGraydon wrote:
| He didn't say it's a distilled OpenAI model. He said it's a
| distillation of an OpenAI model. They are not at all the
| same thing.
| scubbo wrote:
| How so? (Genuine question, not a challenge - I hadn't
| heard the terms "distilled/distillation" in an AI context
| until this thread)
| janalsncm wrote:
| There were only ever vague allegations from Sam Altman, and
| they've been pretty quiet about it since.
| blackeyeblitzar wrote:
| https://www.newsweek.com/openai-warns-deepseek-distilled-
| ai-...
|
| There are many sources and discussions on this. Also
| DeepSeek recently changed their responses to hide
| references to various OpenAI things after all this came
| out, which is weird.
| littlestymaar wrote:
| Meanwhile on /r/localllama, people are running the full R1 on
| CPU with NVMe drives in lieu of VRAM.
| numba888 wrote:
| Did they get the first token out? ;) Just curious, NVidia
| ported it, and they claim almost 4 tokens/sec on 8xH100
| server. At this performance there are much cheaper option.
| btown wrote:
| Specifically, I've seen that a common failure mode of the
| distilled Deepseek models is that they don't know when they're
| going in circles. Deepseek incentivizes the distilled LLM to
| interrupt itself with "Wait." which incentivizes a certain
| degree of reasoning, but it's far less powerful than the
| reasoning of the full model, and can get into cycles of saying
| "Wait." ad infinitum, effectively second-guessing itself on
| conclusions it's already made rather than finding new nuance.
| pockmarked19 wrote:
| The full model also gets into these infinite cycles. I just
| tried asking the old river crossing boat problem but with two
| goats and a cabbage and it goes on and on forever.
| avereveard wrote:
| This has been brilliant marketing from deepseek and they're
| gaining mindshare at a staggering rate.
| behnamoh wrote:
| Okay but does any one actually _want_ a reasoning model at such
| low tok/sec speeds?!
| ripped_britches wrote:
| Lots of use cases don't require low latency. Background work
| for agents. CI jobs. Other stuff I haven't thought of.
| behnamoh wrote:
| If my "automated" CI job takes more than 5 minutes, I'll do
| it myself..
| bee_rider wrote:
| I bet the raspberry pi takes a smaller salary though.
| chickenzzzzu wrote:
| There are tasks that I don't want to do whose delivery
| sensitivity is 24 hours, aka they can be run while I'm
| sleeping.
| baq wrote:
| Where I've been doing CI 5 minutes was barely enough to
| warm caches on a cold runner
| Xeoncross wrote:
| No, but the alternative in some places is no reasoning model.
| Just like people don't want old cell phones / new phones with
| old chips - but often that's all that is affordable in some
| places.
|
| If we can get something working, then improving it will come.
| rvnx wrote:
| You can have questions that are not urgent. It's like Cursor,
| I'm fine with the slow version until a certain point, I launch
| the request then I alt-tab to something else.
|
| Yes it's slower, but well, for free (or cheap) it is
| acceptable.
| deadbabe wrote:
| Only interactive uses cases need high tps, if you just want a
| process running somewhere ingesting and synthesizing data it's
| fine. It's done when it's done.
| JKCalhoun wrote:
| It's local?
| baq wrote:
| Having one running in the background constantly looking at your
| home assistant instance might be an interesting use case for a
| smart home
| c6o wrote:
| That's the real future
| rahimnathwani wrote:
| The interesting thing here is being able to run llama inference
| in a distributed fashion across multiple computers.
| zdw wrote:
| Does adding memory help? There's a Rpi 5 with 16GB RAM recently
| available.
| zamadatix wrote:
| Memory capacity in itself doesn't help so long as the
| model+context fits in memory (and and 8B parameter Q4 model
| should fit in a single 8 GB Pi).
| cratermoon wrote:
| Is there a back-of-the-napkin way to calculate how much
| memory a given model will take? Or what
| parameter/quantization model will fit in a given memory size?
| monocasa wrote:
| q4=4bits per weight
|
| So Q4 8B would be ~4GB.
| zamadatix wrote:
| To find the absolute minimum you just multiply the number
| of parameters by the bits per parameter, divide by 8 if you
| want bytes. In case 8 billion parameters of 4 bits each
| means "at least 4 billion bytes". For back of the napkin
| add ~20% overhead to that (it really depends on your
| context setup and a few other things but that's a good swag
| to start with) and then add whatever memory the base
| operating system is going to be using in the background.
|
| Extra tidbits to keep in mind:
|
| - A bits-per-parameter higher than the model was trained
| adds nothing (other than compatibility on certain
| accelerators) but a bits-per-parameter lower than the model
| was trained degrades the quality.
|
| - Different models may be trained at different bits-per-
| parameter. E.g. 671 billion parameter Deepseek R1 (full)
| was trained at fp8 while llama 3.1 405 billion parameter
| was trained and released at a higher parameter width so
| "full quality" benchmark results for Deepseek R1 require
| less memory than Llama 3.1 even though R1 has more total
| parameters.
|
| - Lower quantinizations will tend to run proportionally
| faster if you were memory bandwidth bound and that can be a
| reason to lower the quality even if you can fit the larger
| version of a model into memory (such as in this
| demonstration).
| cratermoon wrote:
| Thank you. So F16 would be 16 bits per weight, and F32
| would be 32? Next question, if you don't mind, what are
| the tradeoffs in choosing between a model with more
| parameters quantized to smaller values vs fewer
| parameters full-precision models? My current
| understanding is to prefer smaller quantized models over
| larger full-precision.
| JKCalhoun wrote:
| The 16 GB Pi 5 comes and goes. I was able to snag one recently
| when Adafruit got a delivery in -- then they sold right out
| again.
|
| But, yeah, performance aside, there are models that Ollama
| won't run at all as they need more than 8GB to run.
| baq wrote:
| Rpi 5 is difficult to justify. I'd like to see a 4x N150
| minipc benchmark.
| ata_aman wrote:
| Inference speed is heavily dependent on memory read/write speed
| versus size. As long as you can fit the model in memory,
| what'll determine functionality is the mem bandwidth.
| blackeyeblitzar wrote:
| Can't you run larger models easily on MacBook Pro laptops with
| the bigger memory options? I think I read that people are getting
| 100 tokens a second on 70B models.
| efficax wrote:
| haven't measured but the 70b runs fine on an m1 macbook pro
| with 64gb of ram, although you can't do much until it has
| finished
| politelemon wrote:
| You can get even faster results with GPUs, but that isn't the
| purpose of this demo. It's showcasing the ability to run such
| models on commodity hardware, and hopefully with better
| performance in the future.
| sgt101 wrote:
| I find 100tps unlikely, I see 13tps on an 80GB A100 for a 70B
| 4bit quantized model.
|
| Can you link - I am answering an email on monday where this
| info would be very useful!
| JKCalhoun wrote:
| People on Apple Silicon are running special models for MLX.
| Maybe start with the MLX Community page:
| https://huggingface.co/mlx-community
| 8thcross wrote:
| whats the point of this? serious question - can someone provide
| usecases for this?
| cwoolfe wrote:
| It is shown running on 2 or 4 raspberry pis; the point is that
| you can add more (ordinary, non GPU) hardware for faster
| inference. It's a distributed system. The sky is the limit.
| 8thcross wrote:
| ah, Thanks! but what can a distributed system like this do?
| is this a fun to do, for the sake of doing it project or does
| it have practical applications? just curious about
| applicability thats all.
| __MatrixMan__ wrote:
| I'm going to get downvoted for saying the B-word, but I
| imagine this growing up into some kind of blockchain thing
| where the AI has some goal and once there's consensus that
| some bit of data would further that goal it goes in a block
| on the chain (which is then referenced by humans who also
| have that goal and also is used to fine tune the AI for the
| next round of inference). Events in the real world are slow
| enough that gradually converging on the next move over the
| course of a few days would probably be fine.
|
| The advantage over centralizing the compute is that you can
| just connect your node and start contributing to the cause
| (both by providing compute and by being its eyes and hands
| out there in the real world), there's no confusion over
| things like who is paying the cloud compute bill and nobody
| has invested overmuch in hardware.
| semi-extrinsic wrote:
| It doesn't even scale linearly to 4 nodes. It's slower than a
| five year old gaming computer. There is definitely a hard
| limit on performance to be had from this approach.
| gatienboquet wrote:
| In a distributed system, the overall performance and
| scalability are often constrained by the slowest component.
| This Distributed Llama is over Ethernet..
| walrus01 wrote:
| The raspberry pis aren't really the point, since the raspberry
| pi os is basically debian, this means you could do the same
| thing on four much more powerful but still very cheap ($250-300
| a piece) x86-64 systems running debian (with 32, 64 or 128GB
| RAM each if you needed). Also opening up the possibility of
| relatively cheap pci-express 3.0 based 10 Gbps NICs and switch
| between them, which isn't possible with raspberry pi.
| amelius wrote:
| When can I "apt-get install" all this fancy new AI stuff?
| Towaway69 wrote:
| On my mac: brew install ollama
|
| might be a good start ;)
| dzikimarian wrote:
| "ollama pull" is pretty close
| dheera wrote:
| Why tf isn't ollama in apt-get yet?
|
| F these curl|sh installs.
| diggan wrote:
| Once either your current distro starts packaging any LLM tool
| or when you chose a different distro.
| derefr wrote:
| Not apt-get per se, but most of the infra components underlying
| the "AI stuff" can be `conda install`-ed.
| amelius wrote:
| No thank you. I spent months trying to find the source of
| mysterious segfaults on my system, that magically disappeared
| once I moved away from conda.
|
| Sorry for not being more specific, but at this point I just
| lost faith in this package manager.
| yetihehe wrote:
| You can also download lm-studio for nice gui version that saves
| your chats and allows easy downloading of models.
| replete wrote:
| That's not a bad result, although for PS320 for 4x Pi5s you could
| probably find a used 12GB 3080 and probably more than 10x token
| speed
| varispeed wrote:
| > Deepseek R1 Distill 8B Q40 on 1x 3080, 60.43 tok/s (eval
| 110.68 tok/s)
|
| That wouldn't get on Hacker News ;-)
| jckahn wrote:
| HNDD: Hacker News Driven Development
| geerlingguy wrote:
| Or attach a 12 or 16 GB GPU to a single Pi 5 directly, and get
| 20+ tokens/s on an even larger model :D
|
| https://github.com/geerlingguy/ollama-benchmark?tab=readme-o...
| littlestymaar wrote:
| Reading the beginning of your comment I was like "ah yes I
| saw Jeff Geerling do that on a video".
|
| Then I saw you github link and your HN handle and I was like
| "Wait, it _is_ Jeff Geerling!". :D
| ziml77 wrote:
| Haha I had nearly the same thing happen. First I was like
| "that sounds like something Jeff Geerling would do". Then I
| saw the github link and was like "ah yeah Jeff Geerling did
| do it" and then I saw the username and was like "oh it's
| Jeff Geerling!"
| replete wrote:
| Thanks for sharing. Pi5 + cheap AMD GPU = convenient modest
| LLM api server? ...if you find the right magic rocm
| incantations I guess
|
| Double thanks for the 3rd party mac mini SSD tip - eagerly
| awaiting delivery!
| geerlingguy wrote:
| llama.cpp runs great with Vulkan, so no ROCm magic
| required!
| HPsquared wrote:
| Or a couple of 12GB 3060s.
| agilob wrote:
| Some of us also worry about energy consumption
| talldayo wrote:
| They idle at pretty low wattages, and since the bulk of the
| TDP is rated for raster workloads you usually won't see
| them running at full-power on compute workloads.
|
| My 300w 3070ti doesn't really exceed 100w during inference
| workloads. Boot up a 1440p video game and it's a different
| story altogether, but for inference and transcoding those
| 3060s are some of the most power efficient options on the
| consumer market.
| ineedasername wrote:
| All these have quickly become this generation's "can it run
| Doom?"
| JKCalhoun wrote:
| I did not see (understand) how multiple Raspberry Pis are being
| used in parallel. Maybe someone can point me in the right
| direction to understand this.
| jonatron wrote:
| Blog post from same author explaining
| https://b4rtaz.medium.com/how-to-run-llama-3-405b-on-home-de...
| walrus01 wrote:
| Noteworthy nothing there really seems to be raspberry pi
| specific, as the raspberry pi os is based on debian, the same
| could be implemented on N number of ordinary x86-64 small
| desktop PCs for a cheap test environment. You can find older
| dell 'precision' series workstation systems on ebay with 32GB
| of RAM for pretty cheap these days, four of which together
| would be a lot more capable than a raspberry pi 5.
| dankle wrote:
| On cpu or the NPU "AI" hat?
| ninetyninenine wrote:
| Really there needs to be a product based off of LLMs similar to
| Alexa or Google home where instead of connecting to the cloud
| it's a locally run LLM. I don't know why one doesn't exist yet or
| why no one is working on this
| fabiensanglard wrote:
| > locally run LLM
|
| You mean like Ollama + llamacpp ?
| unshavedyak wrote:
| Wouldn't it be due to price? Quality LLMs are expensive, so the
| real question is can you make a product cheap enough to still
| have margins _and_ a useful enough LLM that people would buy?
| czk wrote:
| is it just me or does calling these distilled models 'DeepSeek
| R1' seem to be a gross misrepresentation of what they actually
| are
|
| people think they can run these tiny models distilled from
| deepseek r1 and are actually running deepseek r1 itself
|
| its kinda like if you drove a civic with a tesla bodykit and said
| it was a tesla
___________________________________________________________________
(page generated 2025-02-15 23:01 UTC)