[HN Gopher] Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5
       ___________________________________________________________________
        
       Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5
        
       Author : b4rtazz
       Score  : 254 points
       Date   : 2025-02-15 16:11 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | memhole wrote:
       | This is the modern Beowulf cluster.
        
         | mjhagen wrote:
         | But can it run Crysis?
        
         | semi-extrinsic wrote:
         | I honestly don't understand the meme with RPi clusters. For a
         | little more money than 4 RPi 5's, you can find on eBay a 1U
         | Dell server with a 32 core Epyc CPU and 64 GB memory. This
         | gives you at least an order of magnitude more performance.
         | 
         | If people want to talk about Beowulf clusters in their homelab,
         | they should at least be running compute nodes with a shoestring
         | budget FDR Infiniband network, running Slurm+Lustre or
         | k8s+OpenStack+Ceph or some other goodness. Spare me this
         | doesnt-even-scale-linearly-to-four-slowass-nodes BS.
        
           | madduci wrote:
           | The TDP of 4 PIs combined is still smaller than a larger
           | server, which is probably the whole point of such an
           | experiment?
        
             | znpy wrote:
             | The combined TDP of 4 raspberry PIs is likely less than
             | what the fans of that kind of server pull from the power
             | outlet.
        
           | znpy wrote:
           | You can buy it but you can't run it, unless you're fairly
           | wealthy.
           | 
           | In my country (italy) a basic colocation service is like 80
           | euros/month + vat, and that only includes 100Wh of power and
           | a 100mbps connection. +100wh/month upgrades are like +100
           | euros.
           | 
           | I looked up the kind of servers and cpus you're talking about
           | and the cpu alone can pull something like 180W/h, without
           | accounting for fans, disks and other stuff (stuff like GPUs,
           | which are power hungry).
           | 
           | Yeah you could run it at home in theory, but you'll end up
           | paying power at consumer price rather than datacenter pricing
           | (and if you live in a flat, that's going to be a problem).
           | 
           | Unless you're really wealthy, you have your own home with
           | sufficient power[1] delivery and cooling.
           | 
           | [1] not sure where you live, but here most residential power
           | connections are below 3 KWh.
           | 
           | If otherwise you can point me at some datacenter that will
           | let me run a normal server like the ones you're pointing at
           | for like 100-150 euros/month, please DO let me know and i'll
           | rush there first thing next business day and I will be
           | throwing money at them.
        
             | giobox wrote:
             | > You can buy it but you can't run it, unless you're fairly
             | wealthy.
             | 
             | Why do I need a colocation service to put a used 1U server
             | from eBay in my house? I'd just plug it in, much like any
             | other PC tower you might run at home.
             | 
             | > Unless you're really wealthy, you have your own home with
             | sufficient power[1] delivery and cooling.
             | 
             | > not sure where you live, but here most residential power
             | connections are below 3 KWh.
             | 
             | It's a single used 1U server, not a datacenter... It will
             | plug into your domestic powersupply just fine. The total
             | draw will likely be similar or even less than many gaming
             | PC builds out there, and even then only when under peak
             | loads etc.
        
             | PhilipRoman wrote:
             | Interesting, I run a small cluster of 4 mini pcs (22 cores
             | total). I think it should be comparable to the
             | aforementioned EPYC. Power load is a rounding error
             | compared to appliances like electric iron at 1700W, etc.
             | The impact on electrical bill is minimal as well. Idle
             | power draw is about 5W per server, which translates to ~80
             | cents a month. Frankly my monitor uses more power on
             | average than the rest of the homelab.
        
               | semi-extrinsic wrote:
               | I'm pretty sure if you run a compute benchmark like
               | Streams or hpgmg, the Epyc server will eat your mini pcs
               | for breakfast.
        
               | PhilipRoman wrote:
               | You're probably right, I meant that the power consumption
               | should be roughly comparable between them (due to
               | inefficiency added by each mini).
        
               | Ray20 wrote:
               | I think results would be rather comparable.
        
             | eptcyka wrote:
             | My server is almost never running at full tilt. It is using
             | ~70W at idle.
        
             | vel0city wrote:
             | Just a note, you're mixing kW and kWh.
             | 
             | A connection to a home wouldn't be rated in kilowatt-hours,
             | it would likely be rated in amps, but could also be
             | expressed in kilowatts.
             | 
             | > 100wh/month upgrades are like +100 euros.
             | 
             | I can't imagine anybody paying EUR1/Wh. Even if this was
             | EUR1/kWh (1000x cheaper) it's still a few times more
             | expensive than what most places would consider expensive.
        
           | mad182 wrote:
           | Same. Also I don't get using RPis for hosting all kinds of
           | services at home - there are hundreds of mini-pcs on ebay
           | cheaper than Rpi, with more power, more ports, where you can
           | put in a normal SSD drive and RAM, with sturdy factory made
           | enclosure... To me Rpi seems a weird choice unless you are
           | tinkering with the gpio ports.
        
           | derefr wrote:
           | > For a little more money than 4 RPi 5's, you can find on
           | eBay a 1U Dell server with a 32 core Epyc CPU and 64 GB
           | memory. This gives you at least an order of magnitude more
           | performance.
           | 
           | You could also get one or two Ryzen mini PCs with similar
           | specs for that price. Which might be a good idea, if you want
           | to leave O(N) of them running on your desk, house without
           | spending much on electricity or cooling. (Also, IMHO, the
           | advantages of having an Epyc really only become apparent when
           | you're tossing around multiple 10Gbit NICs, 16+ NVMe disks,
           | etc. and so saturating all the PCIe lanes.)
        
           | plagiarist wrote:
           | You also get a normal motherboard firmware and normal PCI
           | with that.
           | 
           | I don't know if my complaint applies to RPi, or just other
           | SBCs: the last time I got excited about an SBC, it turned out
           | it boots unconditionally from SD card if one is inserted. IMO
           | that's completely unacceptable for an "embedded" board that
           | is supposed to be tucked away.
        
           | walrus01 wrote:
           | The noise from a proper 1U server will be intolerably loud in
           | a small residence, for a homelab type setup. If you have a
           | place to put it where the noise won't be a problem, sure...
           | Acoustics are not a consideration at all in the design of 1U
           | and 2U servers .
        
       | NitpickLawyer wrote:
       | As always, take those t/s stats with a huge boulder of salt. The
       | demo shows a question "solved" in < 500 tokens. Still amazing
       | that it's possible, but you'll get nowhere near those speeds when
       | dealing with real-world problems at real-world useful context
       | lengths for "thinking" models (8-16k tokens). Even epyc's with
       | lots of channels go down to 2-4 t/s after ~4096 context length.
        
         | numba888 wrote:
         | Smaller robots tend to have smaller problems. Even little help
         | from the model will make them a lot more capable than they are
         | today.
        
       | tofof wrote:
       | This continues the pattern of all other announcements of running
       | 'Deepseek R1' on raspberry pi - that they are running llama (or
       | qwen), modified by deepseek's distillation technique.
        
         | corysama wrote:
         | Yeah. People looking for "Smaller DeepSeek" are looking for the
         | quantized models, which are still quite large.
         | 
         | https://unsloth.ai/blog/deepseekr1-dynamic
        
         | rcarmo wrote:
         | Yet for some things they work exactly the same way, and with
         | the same issues :)
        
         | whereismyacc wrote:
         | I really don't like that these models can be branded as
         | Deepseek R1.
        
           | sgt101 wrote:
           | Well, Deepseek trained them?
        
             | yk wrote:
             | Yes, but it would've been nice to call them D1-something,
             | instead of constantly having to switch back and forth
             | between Deepseek R1 (here I mean the 604B model) as
             | distinguished from Deepseek R1 (the reasoning model and
             | it's distillates.)
        
               | rafaelmn wrote:
               | You can say R1-604b to disambiguate, just like we have
               | llama 3 8b/70b etc.
        
               | pythux wrote:
               | These models are not of the same nature either. Their
               | training was done in a different way. A uniform naming
               | (even with explicit number of parameters) would still be
               | misleading.
        
             | mdp2021 wrote:
             | ? Alexander is not Aristotle?!
        
         | tucnak wrote:
         | I don't know if they'd changed the submission title or what,
         | but it says quite explicitly "Deepseek R1 Distill 8B Q40" which
         | is a far-cry from "Deepseek R1" which would be misrepresenting
         | the result, indeed. However, if you refer to Distilled Model
         | Evaluation[1] section of the official R1 repository, you will
         | note that DeepSeek-R1-Distill-Llama-8B is not half-bad; it
         | supposedly out-performs both 4o-0513 and Sonnet-1022 on a
         | handful of benchmarks.
         | 
         | Remember sampling from formal grammar is a thing! This is
         | relevant, because llama.cpp has GBNF, and lazy grammar[2]
         | setting now, which is making it double not-half-bad for a
         | handful of use-cases, less of all deployments like this. That
         | is to say, the grammar kicks in after </think>. Not to mention,
         | it's always subject to further fine-tuning: multiple vendors
         | are now offering "RFT" services, i.e. enriching your normal SFT
         | dataset with synthetic reasoning data from the big-boy R1
         | himself. For all intents and purposes, this result could be
         | much more valuable prior than you're giving it credit for!
         | 
         | 6 tok/s decoding is not much, but Raspberry Pi people don't
         | care, lol.
         | 
         | [1] https://github.com/deepseek-ai/DeepSeek-R1#distilled-
         | model-e...
         | 
         | [2] https://github.com/ggerganov/llama.cpp/pull/9639
        
         | zozbot234 wrote:
         | Yes this is just a fine-tuned LLaMa with DeepSeek-like "chain
         | of thought" generation. A properly 'distilled' model is
         | supposed to be trained from scratch to completely mimick the
         | larger model it's being derived from - which is not what's
         | going on here.
        
           | kgeist wrote:
           | I tried the smaller 'Deepseek' models, and to be honest, in
           | my tests, the quality wasn't much different from simply
           | adding a CoT prompt to a vanilla model.
        
         | hangonhn wrote:
         | Can you explain to a non ML software engineer what these
         | distillation methods mean? What does it mean to have R1 train a
         | Llama model? What is special about DeepSeek's distillation
         | methods? Thanks!
        
           | littlestymaar wrote:
           | Submit a bunch of prompts to Deepseek R1 (a few tens of
           | thousands), and then do a full fine tuning of the target
           | model on the prompt/response pair.
        
           | dcre wrote:
           | Distilling means fine-tuning an existing model using outputs
           | from the bigger model. The special technique is in the
           | details of what you choose to generate from the bigger model,
           | how long to train for, and a bunch of other nitty gritty
           | stuff I don't know about because I'm also not an ML engineer.
           | 
           | Google it!
        
             | lr1970 wrote:
             | > Distilling means fine-tuning an existing model using
             | outputs from the bigger model.
             | 
             | Crucially, the output of the teacher model includes token
             | probabilities so that the fine-tuning is trying to learn
             | the entire output distribution.
        
               | numba888 wrote:
               | That's possible only if they use the same tokens. Which
               | likely requires they share the same tokenizer. Not sure
               | that's the case here, R1 was built on OpenAI closed
               | model's output.
        
               | anon373839 wrote:
               | That was an (as far as I can tell) unsubstantiated claim
               | made by OpenAI. It doesn't even make sense, as o1's
               | reasoning traces are not provided to the user.
        
           | andix wrote:
           | Its llama/quen with some additional training to add
           | reasoning. In a similar way deep seeks v3 was trained into
           | r1.
           | 
           | It also looks to me like there was some Chinese propaganda
           | trained into llama/quen too, but that's just my observation.
        
             | kvirani wrote:
             | You have my curiosity. Like what and how did you find it?
        
               | andix wrote:
               | Ask about Xi Jinping in all the ways you can imagine
               | (jokes, bad habits, failures, embarrassing facts, ...).
               | Compare the responses to other well known politicians,
               | use the same prompt in a fresh conversation with a
               | different name.
               | 
               | Ask about the political system of china and its flaws.
               | Compare the sentiment of the responses with answers about
               | other political systems.
               | 
               | You might get some critical answers, but the sentiment is
               | usually very positive towards china. Sometimes it doesn't
               | even start reasoning and directly spits out propaganda,
               | that doesn't even answer your question.
               | 
               | You can't test it with deep seek dot com, because it will
               | just remove the answers on those "sensitive" topics. I've
               | mostly tested with 7b from ollama. You might experience
               | something like that with 1.5b too, but 1.5b barely works
               | at all.
        
               | mdp2021 wrote:
               | Could it be just a bias inside the selected training
               | material?
        
               | andix wrote:
               | Feel free to call propaganda a bias if you like. But if
               | it walks like a duck, quacks like a duck, ...
        
               | mdp2021 wrote:
               | This is HN: my focus is technical (here specifically),
               | maybe "technical" in world assessment and future
               | prediction (in other pages).
               | 
               | I.e.: I am just trying to understand the facts.
        
               | emaciatedslug wrote:
               | Yes, in some ways the output is based on training
               | material. The deep learning model will find the "ground
               | truth" of the corpus in theory. But China's political
               | enforcement since the "great firewall of china" was
               | instituted, 2 and a half decades ago, have directly or
               | indirectly made content scraped from any Chinese site
               | bias by default. The whole Tienanmen Square meme isn't a
               | meme because it is funny, it is a meme because it
               | consequentially qualifies the discrepancy between the CCP
               | and it's own history. Sure there is bias in all models,
               | but a quantized version will only loose accuracy.. but if
               | a distillation process used a teacher LLM without the
               | censorship bias discussed (i.e., a teacher trained on a
               | more open and less politically manipulated dataset), the
               | resulting distilled student LLM would, in most important
               | respects, be more accurate and significantly more useful
               | in a broader sense in theory but is seems not to matter
               | based on my limited query. I have deepseek-r1-distill-
               | llama-8b installed on LM Studio....if I ask "where is
               | Tienanmen square and what is it's significance?" i get
               | this:
               | 
               | I am sorry, I cannot answer that question. I am an AI
               | assistant designed to provide helpful and harmless
               | responses.
        
               | andix wrote:
               | Btw, the propaganda is specific towards china. If you ask
               | about other authoritarian countries and politicians, it
               | behaves unbiased.
        
               | 01100011 wrote:
               | Companies probably do several things(at least I would if
               | it were me):
               | 
               | - The pre-training dataset is sanitized
               | culturally/politically and pro-regime material is added.
               | 
               | - Supervised fine tuning dataset provides further
               | enforcement of these biases.
               | 
               | - The output is filtered to prevent hallucinations from
               | resulting in anything offensive to the regime. This
               | could(?) also prevent the reasoning loop from straying
               | into ideologically dangerous territory.
               | 
               | So you have multiple opportunities to bend to the will of
               | the authorities.
        
           | corysama wrote:
           | "Quantized" models try to approximate the full model using
           | less bits.
           | 
           | "Distilled" models are other models (Llama, Qwen) that have
           | been put through an additional training round using DeepSeek
           | as a teacher.
        
         | HPsquared wrote:
         | And DeepSeek itself is (allegedly) a distillation of OpenAI
         | models.
        
           | alexhjones wrote:
           | Never heard that claim before, only that a certain subset of
           | re-enforced learning may have used ChatGPT to grade
           | responses. Is there more detail about it being allegedly a
           | distilled OpenAI model?
        
             | IAmGraydon wrote:
             | He didn't say it's a distilled OpenAI model. He said it's a
             | distillation of an OpenAI model. They are not at all the
             | same thing.
        
               | scubbo wrote:
               | How so? (Genuine question, not a challenge - I hadn't
               | heard the terms "distilled/distillation" in an AI context
               | until this thread)
        
             | janalsncm wrote:
             | There were only ever vague allegations from Sam Altman, and
             | they've been pretty quiet about it since.
        
             | blackeyeblitzar wrote:
             | https://www.newsweek.com/openai-warns-deepseek-distilled-
             | ai-...
             | 
             | There are many sources and discussions on this. Also
             | DeepSeek recently changed their responses to hide
             | references to various OpenAI things after all this came
             | out, which is weird.
        
         | littlestymaar wrote:
         | Meanwhile on /r/localllama, people are running the full R1 on
         | CPU with NVMe drives in lieu of VRAM.
        
           | numba888 wrote:
           | Did they get the first token out? ;) Just curious, NVidia
           | ported it, and they claim almost 4 tokens/sec on 8xH100
           | server. At this performance there are much cheaper option.
        
         | btown wrote:
         | Specifically, I've seen that a common failure mode of the
         | distilled Deepseek models is that they don't know when they're
         | going in circles. Deepseek incentivizes the distilled LLM to
         | interrupt itself with "Wait." which incentivizes a certain
         | degree of reasoning, but it's far less powerful than the
         | reasoning of the full model, and can get into cycles of saying
         | "Wait." ad infinitum, effectively second-guessing itself on
         | conclusions it's already made rather than finding new nuance.
        
           | pockmarked19 wrote:
           | The full model also gets into these infinite cycles. I just
           | tried asking the old river crossing boat problem but with two
           | goats and a cabbage and it goes on and on forever.
        
         | avereveard wrote:
         | This has been brilliant marketing from deepseek and they're
         | gaining mindshare at a staggering rate.
        
       | behnamoh wrote:
       | Okay but does any one actually _want_ a reasoning model at such
       | low tok/sec speeds?!
        
         | ripped_britches wrote:
         | Lots of use cases don't require low latency. Background work
         | for agents. CI jobs. Other stuff I haven't thought of.
        
           | behnamoh wrote:
           | If my "automated" CI job takes more than 5 minutes, I'll do
           | it myself..
        
             | bee_rider wrote:
             | I bet the raspberry pi takes a smaller salary though.
        
             | chickenzzzzu wrote:
             | There are tasks that I don't want to do whose delivery
             | sensitivity is 24 hours, aka they can be run while I'm
             | sleeping.
        
             | baq wrote:
             | Where I've been doing CI 5 minutes was barely enough to
             | warm caches on a cold runner
        
         | Xeoncross wrote:
         | No, but the alternative in some places is no reasoning model.
         | Just like people don't want old cell phones / new phones with
         | old chips - but often that's all that is affordable in some
         | places.
         | 
         | If we can get something working, then improving it will come.
        
         | rvnx wrote:
         | You can have questions that are not urgent. It's like Cursor,
         | I'm fine with the slow version until a certain point, I launch
         | the request then I alt-tab to something else.
         | 
         | Yes it's slower, but well, for free (or cheap) it is
         | acceptable.
        
         | deadbabe wrote:
         | Only interactive uses cases need high tps, if you just want a
         | process running somewhere ingesting and synthesizing data it's
         | fine. It's done when it's done.
        
         | JKCalhoun wrote:
         | It's local?
        
         | baq wrote:
         | Having one running in the background constantly looking at your
         | home assistant instance might be an interesting use case for a
         | smart home
        
       | c6o wrote:
       | That's the real future
        
       | rahimnathwani wrote:
       | The interesting thing here is being able to run llama inference
       | in a distributed fashion across multiple computers.
        
       | zdw wrote:
       | Does adding memory help? There's a Rpi 5 with 16GB RAM recently
       | available.
        
         | zamadatix wrote:
         | Memory capacity in itself doesn't help so long as the
         | model+context fits in memory (and and 8B parameter Q4 model
         | should fit in a single 8 GB Pi).
        
           | cratermoon wrote:
           | Is there a back-of-the-napkin way to calculate how much
           | memory a given model will take? Or what
           | parameter/quantization model will fit in a given memory size?
        
             | monocasa wrote:
             | q4=4bits per weight
             | 
             | So Q4 8B would be ~4GB.
        
             | zamadatix wrote:
             | To find the absolute minimum you just multiply the number
             | of parameters by the bits per parameter, divide by 8 if you
             | want bytes. In case 8 billion parameters of 4 bits each
             | means "at least 4 billion bytes". For back of the napkin
             | add ~20% overhead to that (it really depends on your
             | context setup and a few other things but that's a good swag
             | to start with) and then add whatever memory the base
             | operating system is going to be using in the background.
             | 
             | Extra tidbits to keep in mind:
             | 
             | - A bits-per-parameter higher than the model was trained
             | adds nothing (other than compatibility on certain
             | accelerators) but a bits-per-parameter lower than the model
             | was trained degrades the quality.
             | 
             | - Different models may be trained at different bits-per-
             | parameter. E.g. 671 billion parameter Deepseek R1 (full)
             | was trained at fp8 while llama 3.1 405 billion parameter
             | was trained and released at a higher parameter width so
             | "full quality" benchmark results for Deepseek R1 require
             | less memory than Llama 3.1 even though R1 has more total
             | parameters.
             | 
             | - Lower quantinizations will tend to run proportionally
             | faster if you were memory bandwidth bound and that can be a
             | reason to lower the quality even if you can fit the larger
             | version of a model into memory (such as in this
             | demonstration).
        
               | cratermoon wrote:
               | Thank you. So F16 would be 16 bits per weight, and F32
               | would be 32? Next question, if you don't mind, what are
               | the tradeoffs in choosing between a model with more
               | parameters quantized to smaller values vs fewer
               | parameters full-precision models? My current
               | understanding is to prefer smaller quantized models over
               | larger full-precision.
        
         | JKCalhoun wrote:
         | The 16 GB Pi 5 comes and goes. I was able to snag one recently
         | when Adafruit got a delivery in -- then they sold right out
         | again.
         | 
         | But, yeah, performance aside, there are models that Ollama
         | won't run at all as they need more than 8GB to run.
        
           | baq wrote:
           | Rpi 5 is difficult to justify. I'd like to see a 4x N150
           | minipc benchmark.
        
         | ata_aman wrote:
         | Inference speed is heavily dependent on memory read/write speed
         | versus size. As long as you can fit the model in memory,
         | what'll determine functionality is the mem bandwidth.
        
       | blackeyeblitzar wrote:
       | Can't you run larger models easily on MacBook Pro laptops with
       | the bigger memory options? I think I read that people are getting
       | 100 tokens a second on 70B models.
        
         | efficax wrote:
         | haven't measured but the 70b runs fine on an m1 macbook pro
         | with 64gb of ram, although you can't do much until it has
         | finished
        
         | politelemon wrote:
         | You can get even faster results with GPUs, but that isn't the
         | purpose of this demo. It's showcasing the ability to run such
         | models on commodity hardware, and hopefully with better
         | performance in the future.
        
         | sgt101 wrote:
         | I find 100tps unlikely, I see 13tps on an 80GB A100 for a 70B
         | 4bit quantized model.
         | 
         | Can you link - I am answering an email on monday where this
         | info would be very useful!
        
           | JKCalhoun wrote:
           | People on Apple Silicon are running special models for MLX.
           | Maybe start with the MLX Community page:
           | https://huggingface.co/mlx-community
        
       | 8thcross wrote:
       | whats the point of this? serious question - can someone provide
       | usecases for this?
        
         | cwoolfe wrote:
         | It is shown running on 2 or 4 raspberry pis; the point is that
         | you can add more (ordinary, non GPU) hardware for faster
         | inference. It's a distributed system. The sky is the limit.
        
           | 8thcross wrote:
           | ah, Thanks! but what can a distributed system like this do?
           | is this a fun to do, for the sake of doing it project or does
           | it have practical applications? just curious about
           | applicability thats all.
        
             | __MatrixMan__ wrote:
             | I'm going to get downvoted for saying the B-word, but I
             | imagine this growing up into some kind of blockchain thing
             | where the AI has some goal and once there's consensus that
             | some bit of data would further that goal it goes in a block
             | on the chain (which is then referenced by humans who also
             | have that goal and also is used to fine tune the AI for the
             | next round of inference). Events in the real world are slow
             | enough that gradually converging on the next move over the
             | course of a few days would probably be fine.
             | 
             | The advantage over centralizing the compute is that you can
             | just connect your node and start contributing to the cause
             | (both by providing compute and by being its eyes and hands
             | out there in the real world), there's no confusion over
             | things like who is paying the cloud compute bill and nobody
             | has invested overmuch in hardware.
        
           | semi-extrinsic wrote:
           | It doesn't even scale linearly to 4 nodes. It's slower than a
           | five year old gaming computer. There is definitely a hard
           | limit on performance to be had from this approach.
        
           | gatienboquet wrote:
           | In a distributed system, the overall performance and
           | scalability are often constrained by the slowest component.
           | This Distributed Llama is over Ethernet..
        
         | walrus01 wrote:
         | The raspberry pis aren't really the point, since the raspberry
         | pi os is basically debian, this means you could do the same
         | thing on four much more powerful but still very cheap ($250-300
         | a piece) x86-64 systems running debian (with 32, 64 or 128GB
         | RAM each if you needed). Also opening up the possibility of
         | relatively cheap pci-express 3.0 based 10 Gbps NICs and switch
         | between them, which isn't possible with raspberry pi.
        
       | amelius wrote:
       | When can I "apt-get install" all this fancy new AI stuff?
        
         | Towaway69 wrote:
         | On my mac:                   brew install ollama
         | 
         | might be a good start ;)
        
         | dzikimarian wrote:
         | "ollama pull" is pretty close
        
           | dheera wrote:
           | Why tf isn't ollama in apt-get yet?
           | 
           | F these curl|sh installs.
        
         | diggan wrote:
         | Once either your current distro starts packaging any LLM tool
         | or when you chose a different distro.
        
         | derefr wrote:
         | Not apt-get per se, but most of the infra components underlying
         | the "AI stuff" can be `conda install`-ed.
        
           | amelius wrote:
           | No thank you. I spent months trying to find the source of
           | mysterious segfaults on my system, that magically disappeared
           | once I moved away from conda.
           | 
           | Sorry for not being more specific, but at this point I just
           | lost faith in this package manager.
        
         | yetihehe wrote:
         | You can also download lm-studio for nice gui version that saves
         | your chats and allows easy downloading of models.
        
       | replete wrote:
       | That's not a bad result, although for PS320 for 4x Pi5s you could
       | probably find a used 12GB 3080 and probably more than 10x token
       | speed
        
         | varispeed wrote:
         | > Deepseek R1 Distill 8B Q40 on 1x 3080, 60.43 tok/s (eval
         | 110.68 tok/s)
         | 
         | That wouldn't get on Hacker News ;-)
        
           | jckahn wrote:
           | HNDD: Hacker News Driven Development
        
         | geerlingguy wrote:
         | Or attach a 12 or 16 GB GPU to a single Pi 5 directly, and get
         | 20+ tokens/s on an even larger model :D
         | 
         | https://github.com/geerlingguy/ollama-benchmark?tab=readme-o...
        
           | littlestymaar wrote:
           | Reading the beginning of your comment I was like "ah yes I
           | saw Jeff Geerling do that on a video".
           | 
           | Then I saw you github link and your HN handle and I was like
           | "Wait, it _is_ Jeff Geerling!". :D
        
             | ziml77 wrote:
             | Haha I had nearly the same thing happen. First I was like
             | "that sounds like something Jeff Geerling would do". Then I
             | saw the github link and was like "ah yeah Jeff Geerling did
             | do it" and then I saw the username and was like "oh it's
             | Jeff Geerling!"
        
           | replete wrote:
           | Thanks for sharing. Pi5 + cheap AMD GPU = convenient modest
           | LLM api server? ...if you find the right magic rocm
           | incantations I guess
           | 
           | Double thanks for the 3rd party mac mini SSD tip - eagerly
           | awaiting delivery!
        
             | geerlingguy wrote:
             | llama.cpp runs great with Vulkan, so no ROCm magic
             | required!
        
         | HPsquared wrote:
         | Or a couple of 12GB 3060s.
        
           | agilob wrote:
           | Some of us also worry about energy consumption
        
             | talldayo wrote:
             | They idle at pretty low wattages, and since the bulk of the
             | TDP is rated for raster workloads you usually won't see
             | them running at full-power on compute workloads.
             | 
             | My 300w 3070ti doesn't really exceed 100w during inference
             | workloads. Boot up a 1440p video game and it's a different
             | story altogether, but for inference and transcoding those
             | 3060s are some of the most power efficient options on the
             | consumer market.
        
       | ineedasername wrote:
       | All these have quickly become this generation's "can it run
       | Doom?"
        
       | JKCalhoun wrote:
       | I did not see (understand) how multiple Raspberry Pis are being
       | used in parallel. Maybe someone can point me in the right
       | direction to understand this.
        
         | jonatron wrote:
         | Blog post from same author explaining
         | https://b4rtaz.medium.com/how-to-run-llama-3-405b-on-home-de...
        
           | walrus01 wrote:
           | Noteworthy nothing there really seems to be raspberry pi
           | specific, as the raspberry pi os is based on debian, the same
           | could be implemented on N number of ordinary x86-64 small
           | desktop PCs for a cheap test environment. You can find older
           | dell 'precision' series workstation systems on ebay with 32GB
           | of RAM for pretty cheap these days, four of which together
           | would be a lot more capable than a raspberry pi 5.
        
       | dankle wrote:
       | On cpu or the NPU "AI" hat?
        
       | ninetyninenine wrote:
       | Really there needs to be a product based off of LLMs similar to
       | Alexa or Google home where instead of connecting to the cloud
       | it's a locally run LLM. I don't know why one doesn't exist yet or
       | why no one is working on this
        
         | fabiensanglard wrote:
         | > locally run LLM
         | 
         | You mean like Ollama + llamacpp ?
        
         | unshavedyak wrote:
         | Wouldn't it be due to price? Quality LLMs are expensive, so the
         | real question is can you make a product cheap enough to still
         | have margins _and_ a useful enough LLM that people would buy?
        
       | czk wrote:
       | is it just me or does calling these distilled models 'DeepSeek
       | R1' seem to be a gross misrepresentation of what they actually
       | are
       | 
       | people think they can run these tiny models distilled from
       | deepseek r1 and are actually running deepseek r1 itself
       | 
       | its kinda like if you drove a civic with a tesla bodykit and said
       | it was a tesla
        
       ___________________________________________________________________
       (page generated 2025-02-15 23:01 UTC)