[HN Gopher] Run LLMs at home, BitTorrent-style
___________________________________________________________________
Run LLMs at home, BitTorrent-style
Author : udev4096
Score : 161 points
Date : 2023-09-17 16:30 UTC (6 hours ago)
(HTM) web link (petals.dev)
(TXT) w3m dump (petals.dev)
| nico wrote:
| This is so cool. Hopefully this will give access to thousands or
| millions more developers in the space
| behnamoh wrote:
| looking at the list of contributors, way more people need to
| donate their GPU time for the betterment of all. maybe we finally
| have a good use for decentralized computing that doesn't
| calculate meaningless hashes for crypto, but helps the humanity
| by keeping these open source LLMs alive.
| Obscurity4340 wrote:
| This way too nobody can copyright-cancel the LLM like OpenAI or
| whatever
| judge2020 wrote:
| It can cost a lot to run a GPU, especially at full load. The
| 4090 stock pulls 500 watts of power under full load[0], which
| is 12 kWh/day or just under 4380 kWh a year, or over $450 in a
| year assuming $0.10-$0.11/kWh for average residential rates.
| The only variable is whether or not training requires the same
| power draw as hitting it with furmark.
|
| 0: https://youtu.be/j9vC9NBL8zo?t=983
| tossl568 wrote:
| Those "meaningless hashes" help secure hundreds of billions in
| savings of Bitcoin for hundreds of millions of people. Check
| your financial privilege.
| cheema33 wrote:
| > Those "meaningless hashes" help secure hundreds of billions
| in savings of Bitcoin for hundreds of millions of people.
|
| Can you back that up with actual data? Other than something
| that a crypto bro on the Internet told you?
| december456 wrote:
| Thats not the best counterargument, because Bitcoin has
| privacy qualities by default. You can hop on to any block
| explorer and accept every address as another user, but you
| cant verify that (without expensive analysis, on a case-by-
| case basis) those are not owned by the same guy. Same with
| Tor, while some data like bridge usage is being collected
| somehow (i havent looked into it) you cant reliably prove
| that thousands/millions are using it to protect their
| privacy and resist censorship.
| [deleted]
| swyx wrote:
| so given that GGML can serve like 100 tok/s on an M2 Max, and
| this thing advertises 6 tok/s distributed, is this basically for
| people with lower end devices?
| [deleted]
| version_five wrote:
| It's talking about 70B and 160B models. Even heavily quantized
| can ggml run those that fast? (I'm guessing possibly). So maybe
| this is for people that dont have a high end computer? I have a
| decent linux laptop a couple years old and there's no way I
| could run those models that fast. I get a few tokens per second
| on a quantized 7B model.
| brucethemoose2 wrote:
| Yeah. My 3090 gets like ~5 tokens/s on 70B Q3KL.
|
| This is a good idea, as splitting up llms is actually pretty
| efficient with pipelined requests.
| russellbeattie wrote:
| > _...lower end devices_
|
| So, pretty much every other consumer PC available? Those
| losers.
| jmorgan wrote:
| This is neat. Model weights are split into their layers and
| distributed across several machines who then report themselves in
| a big hash table when they are ready to perform inference or fine
| tuning "as a team" over their subset of the layers.
|
| It's early but I've been working on hosting model weights in a
| Docker registry for https://github.com/jmorganca/ollama. Mainly
| for the content addressability (Ollama will verify the correct
| weights are downloaded every time) and ultimately weights can be
| fetched by their content instead of by their name or url (which
| may change!). Perhaps a good next step might be to split the
| models by layers and store each layer independently for use cases
| like this (or even just for downloading + running larger models
| over several "local" machines).
| brucethemoose2 wrote:
| > and fine-tune them for your tasks
|
| This is the part that raised my eyebrows.
|
| Finetuning 70B is not just hard, its literally impossible without
| renting a very expensive cloud instance or buying a PC the price
| of a house, no matter how long you are willing to wait. I would
| absolutely contribute to a "llama training horde"
| AaronFriel wrote:
| That's true for conventional fine-tuning, but is it the case
| for parameter efficient fine tuning and qLORA? My understanding
| is that for a N billion parameter model, fine tuning can occur
| with a slightly-less-than-N gigabyte of VRAM GPU.
|
| For that 70B parameter model: an A100?
| brucethemoose2 wrote:
| 2x 48GB GPUs would be the cheapest. But that's still a very
| expensive system.
| zacmps wrote:
| I think you'd need 2 80GB A100's for unquantised.
| akomtu wrote:
| What prevents parallel LLM training? If you read book 1 first
| and then book 2, the resulting update in your knowledge will be
| the same if you read the books in the reverse order. It seems
| reasonable to assume that LLM is trained on each book
| independently, the two deltas in the LLM weights can be just
| added up.
| contravariant wrote:
| In ordinary gradient descent the order does matter, since the
| position changes in between. I think stochastic gradient
| descent does sum a couple of gradients together sometimes,
| but I'm not sure what the trade-offs are and if LLMs do so as
| well.
| ctoth wrote:
| This is not at all intuitive to me. It doesn't make sense in
| a human perspective, as each book changes you. Consider the
| trivial case of a series, where nothing will make sense if
| you haven't read the prior books (not that I think they feed
| it the book corpus in order maybe they should!), but even in
| a more philosophical sort of way, each book changes you. and
| the person who reads Harry Potter first and The Iliad second
| will have a different experience of each. Then, with large
| language models, we have the concept of grokking something.
| If grokking happens in the middle of book 1, it is a
| different model which is reading book 2 and of course the
| inverse applies.
| whimsicalism wrote:
| By the "delta in the LLM weights", I am assuming you mean the
| gradients. You are effectively describing large batch
| training (data parallelism) which is part of the way you can
| scale up but there are quickly diminishing returns to large
| batch sizes.
| eachro wrote:
| I'm not sure this is true. For instance, consider reading
| textbooks for linear algebra and functional analysis out of
| order. You might still grok the functional analysis if you
| read it first but you'd be better served by reading the
| linear algebra one first.
| malwrar wrote:
| Impossible? It's just a bunch of math, you don't need to keep
| the entire network in memory the whole time.
| Zetobal wrote:
| An H100 is maybe a car but not nearly close to a house...
| ioedward wrote:
| 8 H100s would have enough VRAM to finetune a 70B model.
| nextaccountic wrote:
| Is a single H100 enough?
| KomoD wrote:
| Maybe not in your area, but it's very doable in other places,
| like where I live.
| teaearlgraycold wrote:
| Would love to share my 3080 Ti, but after running the commands in
| the getting started guide (https://github.com/bigscience-
| workshop/petals/wiki/Run-Petal...) it looks like there's a
| dependency versioning issue: ImportError:
| cannot import name 'get_full_repo_name' from 'huggingface_hub'
| (~/.local/lib/python3.8/site-
| packages/huggingface_hub/__init__.py)
| esafak wrote:
| The first question I had was "what are the economics?" From the
| FAQ:
|
| _Will Petals incentives be based on crypto, blockchain, etc.?_
| No, we are working on a centralized incentive system similar to
| the AI Horde kudos, even though Petals is a fully decentralized
| system in all other aspects. We do not plan to provide a service
| to exchange these points for money, so you should see these
| incentives as "game" points designed to be spent inside our
| system. Petals is an ML-focused project designed for
| ML researchers and engineers, it does not have anything to do
| with finance. We decided to make the incentive system centralized
| because it is much easier to develop and maintain, so we can
| focus on developing features useful for ML researchers.
|
| https://github.com/bigscience-workshop/petals/wiki/FAQ:-Freq...
| kordlessagain wrote:
| The logical conclusion is that they (the models) will
| eventually be linked to crypto payments though. This is where
| Lightning becomes important...
|
| Edit: To clarify, I'm not suggesting linking these Petal
| "tokens" to any payment system. I'm talking about, in general,
| calls to clusters of machine learning models, decentralized or
| not, will likely use crypto payments because it gives you auth
| and a means of payment.
|
| I do think Petal is a good implementation of using
| decentralized compute for model use and will likely be valuable
| long term.
| vorpalhex wrote:
| I mean, I can sell you Eve or Runescape currency but we don't
| need any crypto to execute on it. "Gold sellers" existed well
| before crypto.
| Szpadel wrote:
| if that part could be replaced with any third party server it
| would be a tracker in BitTorrent analogy.
| sn0wf1re wrote:
| Similarly there have been distributed render farms for graphic
| design for a long time. No incentives other than higher points
| means your jobs are prioritized.
|
| https://www.sheepit-renderfarm.com/home
| brucethemoose2 wrote:
| > similar to the AI Horde kudos
|
| What they are referencing, which is super cool and (IMO)
| criminally underused:
|
| https://lite.koboldai.net/
|
| https://tinybots.net/artbot
|
| https://aihorde.net/
|
| In fact, I can host a 13B-70B finetune in the afternoon if
| anyone on HN wants to test a particular one out:
|
| https://huggingface.co/models?sort=modified&search=70B+gguf
| swyx wrote:
| > GGUF is a new format introduced by the llama.cpp team on
| August 21st 2023. It is a replacement for GGML, which is no
| longer supported by llama.cpp. GGUF offers numerous
| advantages over GGML, such as better tokenisation, and
| support for special tokens. It is also supports metadata, and
| is designed to be extensible
|
| is there a more canonical blogpost or link to learn more
| about the technical decisions here?
| brucethemoose2 wrote:
| https://github.com/philpax/ggml/blob/gguf-
| spec/docs/gguf.md#...
|
| It is (IMO) a necessary and good change.
|
| I just specified gguf because my 3090 cannot host a 70B
| model without offloading outside of exLlama's very new ~2
| bit quantization. And pre quantized gguf is a much smaller
| download than raw fp16 for conversion.
| nextaccountic wrote:
| Can they actually prevent people from trading petals for money
| though?
| beardog wrote:
| >What's the motivation for people to host model layers in the
| public swarm?
|
| >People who run inference and fine-tuning themselves get a
| certain speedup if they host a part of the model locally. Some
| may be also motivated to "give back" to the community helping
| them to run the model (similarly to how BitTorrent users help
| others by sharing data they have already downloaded).
|
| >Since it may be not enough for everyone, we are also working
| on introducing explicit incentives ("bloom points") for people
| donating their GPU time to the public swarm. Once this system
| is ready, we will display the top contributors on our website.
| People who earned these points will be able to spend them on
| inference/fine-tuning with higher priority or increased
| security guarantees, or (maybe) exchange them for other
| rewards.
|
| It does seem like they want a sort of centralized token
| however.
| [deleted]
| seydor wrote:
| It's a shame that every decentralized projects needs to be
| compared to cryptocoins now
___________________________________________________________________
(page generated 2023-09-17 23:00 UTC)