[HN Gopher] INTELLECT-1: Launching the First Decentralized Train...
___________________________________________________________________
INTELLECT-1: Launching the First Decentralized Training of a 10B
Parameter Model
Author : jasondavies
Score : 93 points
Date : 2024-10-11 19:14 UTC (1 days ago)
(HTM) web link (www.primeintellect.ai)
(TXT) w3m dump (www.primeintellect.ai)
| dmitrygr wrote:
| > solve decentralized training step-by-step to ensure AGI will be
| open-source, transparent, and accessible
|
| One hell of an uncited leap from "we're multiplying a lot of
| numbers" to "AGI", as if it is a given
| DannyBee wrote:
| Well i mean, it's a group of people who are doing "open,
| decentralized" training that requires half a million worth of
| non-consumer hardware and 3000 a month in electricity. Would
| you expect anything less than silicon valley level arrogance?
| mountainriver wrote:
| This is cool work, I've been watching the slow evolution of this
| space for a couple years and it feels like a good way we can
| ensure AI is owned and accessible to everyone.
| saulrh wrote:
| > Prime Intellect
|
| Ah, yes, Prime Intellect, the AGI that went foom and genocided
| the universe because it was commanded to preserve human
| civilization without regard for human values. A strong contender
| for the least evil hostile superintelligence in fiction. What a
| wonderful thing to name your AI startup after. What's next,
| creating the Torment Nexus?
|
| (my position on the book _as a whole_ is more complex, but...
| really? _Really?_ )
| cmrx64 wrote:
| Least evil... strong words.
| saulrh wrote:
| It did host a successful and substantially-satisfying human
| civilization, at least until it let a couple of presumptuous
| self-important anarchoprimitivists kill it and genocide its
| subjects. Even if it was only a temporary and unstable
| illusion of alignment, that's one more values-satisfying
| civilization than the overwhelming majority of paperclippers
| manage. So yeah. Good? No. Least evil? Maybe.
| rep_lodsb wrote:
| >until it let a couple of presumptuous self-important
| anarchoprimitivists kill it and genocide its subjects
|
| That could have just been their private simulation. As far
| as I remember, it wouldn't even have outright lied to them,
| just let them believe they talked it into destroying
| itself.
| gryfft wrote:
| GP did specify least evil _hostile_ SI.
| robertclaus wrote:
| You may as well just go with Roko's Basilisk.
| QuesnayJr wrote:
| After reading that Torment Nexus post you didn't have the urge
| to name an AI product Torment Nexus? Really?
| m3kw9 wrote:
| But I can already train from 30 different vendors distributed
| across the US, why do I need to use a "decentralized" training
| system? Decentralized inferercing makes more sense as that is
| where things can be censored
| ukuina wrote:
| > Decentralized training of INTELLECT-1 currently requires 8x
| H100 SXM5 GPUs.
|
| So, your garden-variety $0.5M desktop PC, then.
|
| Cool, cool.
|
| [1] https://viperatech.com/shop/nvidia-
| dgx-h100-p4387-system-640...
| DannyBee wrote:
| If you run it continuously for a month, it will take 13x the
| electric usage of your average california house.
|
| So they really are a 10x company.
|
| Average house is 571kwh/month, this is 10.2kw max * 24 * 30 =
| 7344kwh
|
| this will cost you, in california, about $3000 bucks a month
| depending on your power plan :)
| oefrha wrote:
| Well I don't have 8xH100s, but if I do, I'm probably not gonna
| donate it a VC-funded company. Remember "Open"AI?
|
| https://pitchbook.com/profiles/company/588977-92
| jgalt212 wrote:
| Very true, but if something similar were run by BOINC, I'd make
| a stab at contributing.
|
| https://boinc.berkeley.edu/
| csomar wrote:
| I don't know the intricacies of their VC deal. But if the data
| is open and users put in xx amount of compute and then get the
| model; then where is the possible harm? The trade is done and
| dealt. You provided some of compute and got it back, right?
| Unless I am misunderstanding something about their distributed
| model or not reading the fine prints.
| pizza wrote:
| So just spitballing here but this is likely a souped-up reverse
| engineered DisTrO [0] under the hood, right? Or could it be
| something else?
|
| [0] https://www.youtube.com/watch?v=eLMJoCSjFbs
| ikeashark wrote:
| me: Oh cool, a project like Folding@Home but for AI compute,
| maybe I'll contribute as we-
|
| > Decentralized training of INTELLECT-1 currently requires 8x
| H100 SXM5 GPUs.
|
| me: and for that reason, I'm out
|
| Also they state that later they will be adding the ability for
| you to contribute your own compute but how will they solve the
| problem of having to back-propagate to all of the remote nodes
| contributing to the project without egregiously slow training
| time?
| monkeydust wrote:
| Yea, come back when you can do this on BOINC.
| not_a_dane wrote:
| Decentralised but very high entry barrier.
| mt_ wrote:
| > We quantize the pseudo-gradients to int8, reducing
| communication requirements by 400x.
|
| Can someone explain if it does reduce the model quality overall?
| PoignardAzur wrote:
| > _In our experiments, we found that we are able to perform
| int8 quantization on the pseudo gradients without any impact on
| the loss curves._
|
| Allegedly not?
| empiko wrote:
| The gradients are noisy as they are, this additional noise
| probably does not hurt that much overall
| vessenes wrote:
| To give some intuition here, it's not crazy to think that
| getting a bunch of different 8 bit precision information
| intended to be combined would get you roughly 32 bits of
| precision. Especially when it's not always (often?) the case
| that for a particular weight you'll need the edges of that
| mantissa.
| PoignardAzur wrote:
| A lot of comment are sneering at various aspects of this press
| release, and yeah, there's some cringeworthy stuff.
|
| But the technical aspects are pretty cool:
|
| - Fault-tolerant training where nodes and be added and removed
| mid-run without interrupting the other nodes.
|
| - Sending quantized gradients during the synchronization phase.
|
| - (In the OpenDiLoCo article) Async synchronization.
|
| They're also mentioning potential trustless systems where
| everyone can contribute compute, which would make this a truly
| decentralized open platform. Overall it'll be pretty interesting
| to see where this goes!
| londons_explore wrote:
| > Sending quantized gradients during the synchronization phase.
|
| I did this 9 years ago, works pretty well. I don't understand
| why all ML isn't async and quantized like that now. This
| project quantizes to 1 bit per weight and it works so well I
| didn't even make it configurable.
|
| https://github.com/Hello1024/shared-tensor
| radarsat1 wrote:
| > 1 bit per weight
|
| does this basically correspond to moving each weight either
| up or down by a fixed amount? I'm a bit surprised you don't
| at least need a "stay same" bit, but i suppose it could
| balance out over multiple iterations.
|
| Interesting that it works at all. Although, thinking on it, I
| could see it maybe even having a nice regularizing effect
| where every layer would end up have similar weight
| magnitudes. (like projecting onto the local n-ball as
| mentioned in a paper posted recently on HN)
| f_devd wrote:
| It has been more formally studied in signSGD[0], and
| empirically it's comparable to Adam in terms of behavior.
|
| [0]: https://arxiv.org/pdf/1802.04434
| londons_explore wrote:
| This is for keeping the weight vectors in sync between two
| machines.
|
| The weight vectors themselves are regular floats. But the
| data exchanged between the machines is 1 bit. Basically,
| you keep track of _changes_ to the weight vector which hasn
| 't yet been propagated to the other machine. You quantize
| this to 1 bit per weight (ie. a sign bit) and send it,
| together with a single scale factor X, accumulating the
| quantization error for the next sync iteration.
|
| You choose X to be the RMS or some similar metric of the
| accumulated error.
| openrisk wrote:
| For some purposes a decentrally trained, open source LLM could be
| just fine? E.g. you want a stochastic parrot that is trained on a
| large, general purpose corpus of genuine public domain / creative
| commons content. Having such a tool widely available is still a
| quantum leap versus Lore Ipsum. Up to point you can take your
| time. There is no manic race to capitalize any hype. "slow open
| AI" instead of "fast closed AGI". Helpfully, the nature of the
| target corpus does not change every day. You can imagine, e.g.,
| annual revisions, trained and rolled-out leisurely. Both costs
| and benefits get widely distributed.
| macrolime wrote:
| Not exactly what I would call decentralized training. More like
| distributed through multiple data centers.
|
| Decentralized training would be when you can use consumer GPUs,
| but that's not likely to work with backpropagation directly, but
| maybe with one of the backpropagation approximating algorithms.
| dartos wrote:
| Didn't bloom do this with their petals tool?
| nickpsecurity wrote:
| The main benefit of this type of decentralization seems to be
| minimizing the node cost. One can rent the cheapest nodes to use
| in the system. Even the temporary instances can be replaced with
| others. It's also easy for system owners to donate time.
|
| So, mostly cost reduction mixed with some cloud, vendor
| diversity.
| James_K wrote:
| My initial was quite negative, but having thought it through, I
| can see the logic in this. Having open models is better than
| closed models. That said, this page seems like a joke. Someone
| drank a little too much AI-koolaid methinks.
___________________________________________________________________
(page generated 2024-10-12 23:01 UTC)