[HN Gopher] Estimating PaLM's Training Cost
       ___________________________________________________________________
        
       Estimating PaLM's Training Cost
        
       Author : Brajeshwar
       Score  : 36 points
       Date   : 2022-04-10 16:58 UTC (6 hours ago)
        
 (HTM) web link (blog.heim.xyz)
 (TXT) w3m dump (blog.heim.xyz)
        
       | nootropicat wrote:
       | Given how good PaLM is, this is nothing. I'm sure it could make
       | over $10M/year of profit if it was open to outside use. I would
       | actually pay a bit just to have fun with text games. I tried with
       | GPT3 but it was a bit too stupid and stopped being fun fast.
       | 
       | As a separate point, this suggests to me that 100x larger models
       | are within the current reach of Google and other megacorps.
        
       | rhacker wrote:
       | That's really neat. That's like near TNG computer cool.
       | 
       | You know when on TNG they ask Computer for a bunch of related
       | things and it somehow knows what they're talking about.
        
       | axg11 wrote:
       | There are 67 authors on the PaLM paper. Assuming average salary
       | of $150k (probably a large underestimate for Google employees)
       | that's $10M+/year in salary alone. Compute is a big cost factor
       | for this project but it's not clear if it even dominates the
       | total cost.
       | 
       | In the long run this type of research will more than pay itself
       | back in benefits to Google. NLP underpins everything they do. It
       | would be interesting to see how OpenAI API (GPT-3) is doing in
       | terms of revenue. They're going for the more direct method of
       | seeking value from a trained large language model.
        
         | marczoid wrote:
         | The fact that there are 67 people on the paper doesn't mean
         | they all worked on it full time. In fact, I suppose less than 5
         | did. I am quite certain most people only spent a tiny fraction
         | of their time on this project. (Not saying the outcome isn't
         | impressive, it's just how these projects tend to go.)
        
         | modeless wrote:
         | Large underestimate is an understatement. The average fully
         | loaded cost of these employees may be an order of magnitude
         | higher than your guess. Jeff Dean alone probably costs Google
         | $10m per year.
        
         | the8472 wrote:
         | The model itself didn't take all those authors a full-time work
         | year to build from scratch. They build other models and release
         | other papers throughout the year. Of course those works build
         | on each other, but that turns it into a much more complicated
         | question how we account for the value and costs of prior work.
        
           | axg11 wrote:
           | To counter that, there were likely other people with smaller
           | contributions not listed as authors. Some of the leadership
           | are likely earning 2-3x my estimate salary.
        
             | platers wrote:
             | Everyone on the paper is likely making at least 300k. And
             | double that again for taxes, healthcare, and perks.
        
               | hansvm wrote:
               | Double is a good rule of thumb for a median liquidatable
               | income, but healthcare, food, gyms, 401k matching, ...
               | are all bounded by reasonable constants, and taxes only
               | add <10% from the employer's side of things. I'd be
               | surprised if the fully loaded cost were more than an
               | extra 50%.
        
         | [deleted]
        
         | [deleted]
        
       | gwern wrote:
       | One major caveat here: the Chinchilla scaling law indicates that
       | most of this compute was wasted. You could've gotten PaLM's
       | performance at a fraction of the cost - Chinchilla approaches
       | PaLM performance but at a fifth the cost.
        
         | timpetri wrote:
         | Question related to the Chinchilla paper[0], which says that
         | optimal amount of training data for ~500B, 1T, and 10T param
         | models are 11T, 21.2T, 216.2T tokens, respectively. The PaLM
         | paper[1] says it made use of 700B tokens.
         | 
         | How many tokens of training data have humans produced across
         | the entire internet, all our written works, etc? Is there such
         | a thing as a 216 trillion token set?
         | 
         | [0] https://arxiv.org/abs/2203.15556 [1]
         | https://arxiv.org/abs/2204.02311
        
         | _hark wrote:
         | DeepMind to OpenAI and everyone else[1]:
         | 
         | > your hyperparameters are bad and you should feel bad
         | 
         | It's amazing to me that such a big goof was missed by so many
         | for so long. All these multimillion dollar language models and
         | people just took the scaling laws at face value.
         | 
         | [1]: https://arxiv.org/abs/2203.15556
        
           | gwern wrote:
           | Isn't it amazing? One of those reminders that we don't
           | understand DL as much as we like to pretend we do.
           | 
           | Cyclic learning rates work well elsewhere, but I can't think
           | of any other case where switching made _such_ a difference. I
           | was completely shocked to read Chinchilla, and I 'm still a
           | little baffled - a cosine schedule? Really? That's it? You
           | guys didn't make any other changes?
        
       | BooneJS wrote:
       | > Of course, Google didn't pay that much. They own the hardware.
       | 
       | You're right. Google had to: 1. Design, tape-out, manufacture,
       | and deploy TPUv4 2. Run PaLM
        
       ___________________________________________________________________
       (page generated 2022-04-10 23:01 UTC)