[HN Gopher] INTELLECT-1: Launching the First Decentralized Train...
       ___________________________________________________________________
        
       INTELLECT-1: Launching the First Decentralized Training of a 10B
       Parameter Model
        
       Author : jasondavies
       Score  : 93 points
       Date   : 2024-10-11 19:14 UTC (1 days ago)
        
 (HTM) web link (www.primeintellect.ai)
 (TXT) w3m dump (www.primeintellect.ai)
        
       | dmitrygr wrote:
       | > solve decentralized training step-by-step to ensure AGI will be
       | open-source, transparent, and accessible
       | 
       | One hell of an uncited leap from "we're multiplying a lot of
       | numbers" to "AGI", as if it is a given
        
         | DannyBee wrote:
         | Well i mean, it's a group of people who are doing "open,
         | decentralized" training that requires half a million worth of
         | non-consumer hardware and 3000 a month in electricity. Would
         | you expect anything less than silicon valley level arrogance?
        
       | mountainriver wrote:
       | This is cool work, I've been watching the slow evolution of this
       | space for a couple years and it feels like a good way we can
       | ensure AI is owned and accessible to everyone.
        
       | saulrh wrote:
       | > Prime Intellect
       | 
       | Ah, yes, Prime Intellect, the AGI that went foom and genocided
       | the universe because it was commanded to preserve human
       | civilization without regard for human values. A strong contender
       | for the least evil hostile superintelligence in fiction. What a
       | wonderful thing to name your AI startup after. What's next,
       | creating the Torment Nexus?
       | 
       | (my position on the book _as a whole_ is more complex, but...
       | really? _Really?_ )
        
         | cmrx64 wrote:
         | Least evil... strong words.
        
           | saulrh wrote:
           | It did host a successful and substantially-satisfying human
           | civilization, at least until it let a couple of presumptuous
           | self-important anarchoprimitivists kill it and genocide its
           | subjects. Even if it was only a temporary and unstable
           | illusion of alignment, that's one more values-satisfying
           | civilization than the overwhelming majority of paperclippers
           | manage. So yeah. Good? No. Least evil? Maybe.
        
             | rep_lodsb wrote:
             | >until it let a couple of presumptuous self-important
             | anarchoprimitivists kill it and genocide its subjects
             | 
             | That could have just been their private simulation. As far
             | as I remember, it wouldn't even have outright lied to them,
             | just let them believe they talked it into destroying
             | itself.
        
           | gryfft wrote:
           | GP did specify least evil _hostile_ SI.
        
         | robertclaus wrote:
         | You may as well just go with Roko's Basilisk.
        
         | QuesnayJr wrote:
         | After reading that Torment Nexus post you didn't have the urge
         | to name an AI product Torment Nexus? Really?
        
       | m3kw9 wrote:
       | But I can already train from 30 different vendors distributed
       | across the US, why do I need to use a "decentralized" training
       | system? Decentralized inferercing makes more sense as that is
       | where things can be censored
        
       | ukuina wrote:
       | > Decentralized training of INTELLECT-1 currently requires 8x
       | H100 SXM5 GPUs.
       | 
       | So, your garden-variety $0.5M desktop PC, then.
       | 
       | Cool, cool.
       | 
       | [1] https://viperatech.com/shop/nvidia-
       | dgx-h100-p4387-system-640...
        
         | DannyBee wrote:
         | If you run it continuously for a month, it will take 13x the
         | electric usage of your average california house.
         | 
         | So they really are a 10x company.
         | 
         | Average house is 571kwh/month, this is 10.2kw max * 24 * 30 =
         | 7344kwh
         | 
         | this will cost you, in california, about $3000 bucks a month
         | depending on your power plan :)
        
       | oefrha wrote:
       | Well I don't have 8xH100s, but if I do, I'm probably not gonna
       | donate it a VC-funded company. Remember "Open"AI?
       | 
       | https://pitchbook.com/profiles/company/588977-92
        
         | jgalt212 wrote:
         | Very true, but if something similar were run by BOINC, I'd make
         | a stab at contributing.
         | 
         | https://boinc.berkeley.edu/
        
         | csomar wrote:
         | I don't know the intricacies of their VC deal. But if the data
         | is open and users put in xx amount of compute and then get the
         | model; then where is the possible harm? The trade is done and
         | dealt. You provided some of compute and got it back, right?
         | Unless I am misunderstanding something about their distributed
         | model or not reading the fine prints.
        
       | pizza wrote:
       | So just spitballing here but this is likely a souped-up reverse
       | engineered DisTrO [0] under the hood, right? Or could it be
       | something else?
       | 
       | [0] https://www.youtube.com/watch?v=eLMJoCSjFbs
        
       | ikeashark wrote:
       | me: Oh cool, a project like Folding@Home but for AI compute,
       | maybe I'll contribute as we-
       | 
       | > Decentralized training of INTELLECT-1 currently requires 8x
       | H100 SXM5 GPUs.
       | 
       | me: and for that reason, I'm out
       | 
       | Also they state that later they will be adding the ability for
       | you to contribute your own compute but how will they solve the
       | problem of having to back-propagate to all of the remote nodes
       | contributing to the project without egregiously slow training
       | time?
        
       | monkeydust wrote:
       | Yea, come back when you can do this on BOINC.
        
       | not_a_dane wrote:
       | Decentralised but very high entry barrier.
        
       | mt_ wrote:
       | > We quantize the pseudo-gradients to int8, reducing
       | communication requirements by 400x.
       | 
       | Can someone explain if it does reduce the model quality overall?
        
         | PoignardAzur wrote:
         | > _In our experiments, we found that we are able to perform
         | int8 quantization on the pseudo gradients without any impact on
         | the loss curves._
         | 
         | Allegedly not?
        
         | empiko wrote:
         | The gradients are noisy as they are, this additional noise
         | probably does not hurt that much overall
        
         | vessenes wrote:
         | To give some intuition here, it's not crazy to think that
         | getting a bunch of different 8 bit precision information
         | intended to be combined would get you roughly 32 bits of
         | precision. Especially when it's not always (often?) the case
         | that for a particular weight you'll need the edges of that
         | mantissa.
        
       | PoignardAzur wrote:
       | A lot of comment are sneering at various aspects of this press
       | release, and yeah, there's some cringeworthy stuff.
       | 
       | But the technical aspects are pretty cool:
       | 
       | - Fault-tolerant training where nodes and be added and removed
       | mid-run without interrupting the other nodes.
       | 
       | - Sending quantized gradients during the synchronization phase.
       | 
       | - (In the OpenDiLoCo article) Async synchronization.
       | 
       | They're also mentioning potential trustless systems where
       | everyone can contribute compute, which would make this a truly
       | decentralized open platform. Overall it'll be pretty interesting
       | to see where this goes!
        
         | londons_explore wrote:
         | > Sending quantized gradients during the synchronization phase.
         | 
         | I did this 9 years ago, works pretty well. I don't understand
         | why all ML isn't async and quantized like that now. This
         | project quantizes to 1 bit per weight and it works so well I
         | didn't even make it configurable.
         | 
         | https://github.com/Hello1024/shared-tensor
        
           | radarsat1 wrote:
           | > 1 bit per weight
           | 
           | does this basically correspond to moving each weight either
           | up or down by a fixed amount? I'm a bit surprised you don't
           | at least need a "stay same" bit, but i suppose it could
           | balance out over multiple iterations.
           | 
           | Interesting that it works at all. Although, thinking on it, I
           | could see it maybe even having a nice regularizing effect
           | where every layer would end up have similar weight
           | magnitudes. (like projecting onto the local n-ball as
           | mentioned in a paper posted recently on HN)
        
             | f_devd wrote:
             | It has been more formally studied in signSGD[0], and
             | empirically it's comparable to Adam in terms of behavior.
             | 
             | [0]: https://arxiv.org/pdf/1802.04434
        
             | londons_explore wrote:
             | This is for keeping the weight vectors in sync between two
             | machines.
             | 
             | The weight vectors themselves are regular floats. But the
             | data exchanged between the machines is 1 bit. Basically,
             | you keep track of _changes_ to the weight vector which hasn
             | 't yet been propagated to the other machine. You quantize
             | this to 1 bit per weight (ie. a sign bit) and send it,
             | together with a single scale factor X, accumulating the
             | quantization error for the next sync iteration.
             | 
             | You choose X to be the RMS or some similar metric of the
             | accumulated error.
        
       | openrisk wrote:
       | For some purposes a decentrally trained, open source LLM could be
       | just fine? E.g. you want a stochastic parrot that is trained on a
       | large, general purpose corpus of genuine public domain / creative
       | commons content. Having such a tool widely available is still a
       | quantum leap versus Lore Ipsum. Up to point you can take your
       | time. There is no manic race to capitalize any hype. "slow open
       | AI" instead of "fast closed AGI". Helpfully, the nature of the
       | target corpus does not change every day. You can imagine, e.g.,
       | annual revisions, trained and rolled-out leisurely. Both costs
       | and benefits get widely distributed.
        
       | macrolime wrote:
       | Not exactly what I would call decentralized training. More like
       | distributed through multiple data centers.
       | 
       | Decentralized training would be when you can use consumer GPUs,
       | but that's not likely to work with backpropagation directly, but
       | maybe with one of the backpropagation approximating algorithms.
        
         | dartos wrote:
         | Didn't bloom do this with their petals tool?
        
       | nickpsecurity wrote:
       | The main benefit of this type of decentralization seems to be
       | minimizing the node cost. One can rent the cheapest nodes to use
       | in the system. Even the temporary instances can be replaced with
       | others. It's also easy for system owners to donate time.
       | 
       | So, mostly cost reduction mixed with some cloud, vendor
       | diversity.
        
       | James_K wrote:
       | My initial was quite negative, but having thought it through, I
       | can see the logic in this. Having open models is better than
       | closed models. That said, this page seems like a joke. Someone
       | drank a little too much AI-koolaid methinks.
        
       ___________________________________________________________________
       (page generated 2024-10-12 23:01 UTC)