[HN Gopher] Estimating PaLM's Training Cost
___________________________________________________________________
Estimating PaLM's Training Cost
Author : Brajeshwar
Score : 36 points
Date : 2022-04-10 16:58 UTC (6 hours ago)
(HTM) web link (blog.heim.xyz)
(TXT) w3m dump (blog.heim.xyz)
| nootropicat wrote:
| Given how good PaLM is, this is nothing. I'm sure it could make
| over $10M/year of profit if it was open to outside use. I would
| actually pay a bit just to have fun with text games. I tried with
| GPT3 but it was a bit too stupid and stopped being fun fast.
|
| As a separate point, this suggests to me that 100x larger models
| are within the current reach of Google and other megacorps.
| rhacker wrote:
| That's really neat. That's like near TNG computer cool.
|
| You know when on TNG they ask Computer for a bunch of related
| things and it somehow knows what they're talking about.
| axg11 wrote:
| There are 67 authors on the PaLM paper. Assuming average salary
| of $150k (probably a large underestimate for Google employees)
| that's $10M+/year in salary alone. Compute is a big cost factor
| for this project but it's not clear if it even dominates the
| total cost.
|
| In the long run this type of research will more than pay itself
| back in benefits to Google. NLP underpins everything they do. It
| would be interesting to see how OpenAI API (GPT-3) is doing in
| terms of revenue. They're going for the more direct method of
| seeking value from a trained large language model.
| marczoid wrote:
| The fact that there are 67 people on the paper doesn't mean
| they all worked on it full time. In fact, I suppose less than 5
| did. I am quite certain most people only spent a tiny fraction
| of their time on this project. (Not saying the outcome isn't
| impressive, it's just how these projects tend to go.)
| modeless wrote:
| Large underestimate is an understatement. The average fully
| loaded cost of these employees may be an order of magnitude
| higher than your guess. Jeff Dean alone probably costs Google
| $10m per year.
| the8472 wrote:
| The model itself didn't take all those authors a full-time work
| year to build from scratch. They build other models and release
| other papers throughout the year. Of course those works build
| on each other, but that turns it into a much more complicated
| question how we account for the value and costs of prior work.
| axg11 wrote:
| To counter that, there were likely other people with smaller
| contributions not listed as authors. Some of the leadership
| are likely earning 2-3x my estimate salary.
| platers wrote:
| Everyone on the paper is likely making at least 300k. And
| double that again for taxes, healthcare, and perks.
| hansvm wrote:
| Double is a good rule of thumb for a median liquidatable
| income, but healthcare, food, gyms, 401k matching, ...
| are all bounded by reasonable constants, and taxes only
| add <10% from the employer's side of things. I'd be
| surprised if the fully loaded cost were more than an
| extra 50%.
| [deleted]
| [deleted]
| gwern wrote:
| One major caveat here: the Chinchilla scaling law indicates that
| most of this compute was wasted. You could've gotten PaLM's
| performance at a fraction of the cost - Chinchilla approaches
| PaLM performance but at a fifth the cost.
| timpetri wrote:
| Question related to the Chinchilla paper[0], which says that
| optimal amount of training data for ~500B, 1T, and 10T param
| models are 11T, 21.2T, 216.2T tokens, respectively. The PaLM
| paper[1] says it made use of 700B tokens.
|
| How many tokens of training data have humans produced across
| the entire internet, all our written works, etc? Is there such
| a thing as a 216 trillion token set?
|
| [0] https://arxiv.org/abs/2203.15556 [1]
| https://arxiv.org/abs/2204.02311
| _hark wrote:
| DeepMind to OpenAI and everyone else[1]:
|
| > your hyperparameters are bad and you should feel bad
|
| It's amazing to me that such a big goof was missed by so many
| for so long. All these multimillion dollar language models and
| people just took the scaling laws at face value.
|
| [1]: https://arxiv.org/abs/2203.15556
| gwern wrote:
| Isn't it amazing? One of those reminders that we don't
| understand DL as much as we like to pretend we do.
|
| Cyclic learning rates work well elsewhere, but I can't think
| of any other case where switching made _such_ a difference. I
| was completely shocked to read Chinchilla, and I 'm still a
| little baffled - a cosine schedule? Really? That's it? You
| guys didn't make any other changes?
| BooneJS wrote:
| > Of course, Google didn't pay that much. They own the hardware.
|
| You're right. Google had to: 1. Design, tape-out, manufacture,
| and deploy TPUv4 2. Run PaLM
___________________________________________________________________
(page generated 2022-04-10 23:01 UTC)