[HN Gopher] My Deep Learning Rig
       ___________________________________________________________________
        
       My Deep Learning Rig
        
       Author : jacquesm
       Score  : 20 points
       Date   : 2023-08-15 20:05 UTC (2 hours ago)
        
 (HTM) web link (nonint.com)
 (TXT) w3m dump (nonint.com)
        
       | throwing_away wrote:
       | > This is because without dropping serious $$$ on mellanox high-
       | speed NICs and switches, inter-server communication bandwidth
       | quickly becomes the bottleneck when training large models. I
       | can't afford fancy enterprise grade hardware, so I get around it
       | by keeping my compute all on the same machine. This goal drives
       | many of the choices I made in building out my servers, as you
       | will see.
       | 
       | 10gbe is very cheap now, but I guess that's not enough?
        
         | liuliu wrote:
         | Yeah, you need 100gbe minimal. 10gbe is too little (PCIe
         | bandwidth can be a bottleneck, and that is already clocked
         | around 100GbE (16GB)).
         | 
         | BTW: echo to the author, PSU and in the U.S. (120v) is a major
         | issue why I am limiting to 4-GPUs. Also, it seems 3090 still
         | have NVLink support, wondering why the author haven't put that
         | up. From what I experienced, NVLink does help if you run data
         | parallel training.
        
           | jacquesm wrote:
           | Couldn't you use a 240V dryer socket for that purpose? That
           | should get you 7200 Watts on a 30A circuit.
        
         | bradfox2 wrote:
         | 100gbe mellanox connect x cards are not actually that expensive
         | though.
        
           | jacquesm wrote:
           | You'd need a switch too, unless you're going point-to-point
           | but that will eat up PCI slots that you probably would like
           | to use for GPUs.
        
         | LTL_FTC wrote:
         | If these server boards support thunderbolt AIC's, and I believe
         | they might as my Threadripper Pro board does, daisy chaining
         | them together could get you 40Gbps somewhat easily, if that is
         | sufficient.
        
       | hooloovoo_zoo wrote:
       | Interesting, wonder what the actual income from vast.ai looked
       | like.
        
         | jacquesm wrote:
         | Likewise, based on the costs listed on their page I'd say no
         | more than $.8 / hour or so assuming a 50% gross margin for
         | vast.ai.
         | 
         | And that includes energy costs so I assume the OP has a cheap
         | source of power. Here in NL I could not do this profitably,
         | even off solar power it would be more efficient to sell that
         | power to the grid than to use it to drive a GPU rig.
        
       | [deleted]
        
       | doctorpangloss wrote:
       | Without a fully connected NVLink network, the 3090s will be
       | underutilized for models that distribute the layers across
       | multiple GPUs.
       | 
       | If AMD were better supported, it would be most economical to use
       | 4x MI60s for 128GB using an Infinity Fabric bridge. However, in
       | order to get to the end of such a journey, you would have to know
       | something.
        
         | jacquesm wrote:
         | What kind of factor would that be?
        
       ___________________________________________________________________
       (page generated 2023-08-15 23:00 UTC)