[HN Gopher] Calculating the cost of a Google DeepMind paper
___________________________________________________________________
Calculating the cost of a Google DeepMind paper
Author : 152334H
Score : 238 points
Date : 2024-07-30 10:26 UTC (12 hours ago)
(HTM) web link (152334h.github.io)
(TXT) w3m dump (152334h.github.io)
| hnthr_w_y wrote:
| that's not very much in the business range, it's a lot when it
| comes to paying us salaries.
| willis936 wrote:
| Any company of any size that doesn't learn the right lessons
| from a $10M mistake will be out of business before long.
| brainwad wrote:
| That's like staffing a single-manager team on a bad project
| for a year. Which I assure you happens all the time in big
| companies, and yet they survive.
| saikia81 wrote:
| They are not saying it doesn't happen. They are saying: The
| companies that don't learn from these mistakes will go out
| of business before long.
| duggan wrote:
| In principle, for some other company, sure.
|
| Google makes ~$300b a year in _profit_. They could make a
| $10m mistake every day and barely make a dent in it.
| magic_man wrote:
| They do not, they made ~90 billion in profit. So no one
| would notice a 10 mil mistake, but no they didn't make
| 300b in profits.
| duggan wrote:
| I misread some stats, thanks for the correction.
| hnbad wrote:
| I think there might be a disagreement about what "big"
| means. _Google_ can easily afford to sink millions each
| year into pointless endeavours without going out of
| business and they probably have. Alphabet 's annual
| revenue has been growing a good 10% each year since
| 2021[0]. That's in the range of $20-$30 _billion_ dollars
| with a B.
|
| To put that into perspective, Alphabet's revenue has
| increased 13.38% year-over-year as of June 30, arriving
| at $328.284 billion dollars - i.e. it has _increased_ by
| $38.74 billion in that time. A $10 million dollar mistake
| translates to losing 0.0258% of that number.
|
| A $10 million dollar mistake costs Alphabet 0.0258% of
| the amount their revenue _increased_ year-over-year as of
| last month. Alphabet could have afforded to make 40 such
| $10 million dollar mistakes in that period and it would
| have only represented a loss of 1% of the year-over-year
| _increase_ in revenue. Taking the year-over-year
| _increase_ down by 1% (from 13.38% to 12.38%) would have
| required making 290 such $10 million dollar mistakes
| within one year.
|
| Let me repeat that because it bears emphasizing: over the
| past years, _every year_ Google could have easily
| afforded _an additional 200 such $10 million dollar
| mistakes_ without significantly impacting their increase
| in revenue - and even in 2022 when inflation was almost
| double what it was in the other year they would have
| still come out ahead of inflation.
|
| So in terms of numbers this is demonstrably false. Of
| course the existence of repeated $10 million dollar
| mistakes may suggest the existence of structural issues
| that will result in $1, $10 or $100 billion dollar
| problems eventually and sink the company. But that's
| conjecture at this point.
|
| [0]: https://www.macrotrends.net/stocks/charts/GOOG/alpha
| bet/reve...
| vishnugupta wrote:
| https://killedbygoogle.com/
|
| I'm confident each one of them were multiple of $10M
| investments.
|
| And this is just what we know because they were launched
| publicly.
| Sebb767 wrote:
| The point the parent made is not to not make mistakes, but
| to learn from them. Which they probably did not from all of
| them, as indicated by the sheer amount of messenger apps on
| this list, but there's definitely a lot to learn from this
| list.
| OtherShrezzing wrote:
| I'm not really certain that's true at Google's size. Their
| annual revenue is something like a quarter trillion dollars.
| 25,000x larger than a $10m mistake.
|
| The equivalent wastage for a self-employed person would be
| allowing a few cups of Starbucks coffee per year to go cold.
| Workaccount2 wrote:
| There was that time that google paid out something like
| $300M in fraudulent invoices.
| willis936 wrote:
| To be clear: what I mean by "not learning the right lessons"
| is a company deciding that the issue with wasting $10M in six
| months is that they didn't do it 100x in parallel in three
| months. Then when that goes wrong they must need to do it
| 100x wider in parallel again in three weeks.
| BartjeD wrote:
| If this ran on google's own cloud it amounts to internal
| bookkeeping. The only cost is then the electricity and used
| capacity. Not consumer pricing. So negligible.
|
| It is rather unfortunate that this sort of paper is hard to
| reproduce.
|
| That is a BIG downside, because it makes the result unreliable.
| They invested effort and money in getting an unreliable result.
| But perhaps other research will corroborate. Or it may give them
| an edge in their business, for a while.
|
| They chose to publish. So they are interested in seeing it
| reproduced or improved upon.
| rty32 wrote:
| Opportunity cost is cost. What you could have earned by selling
| the resources to customers instead of using them yourself _is_
| what the resources are worth.
| g15jv2dp wrote:
| This assumes that you can sell 100% of the resources'
| availability 100% of the time. Whenever you have more
| capacity that you can sell, there's no opportunity cost in
| using it yourself.
| michaelt wrote:
| A few months back, a lot of the most powerful GPU instances
| on GCP seemed to be sold out 24/7.
|
| I suppose it's possible Google's own infrastructure is
| partitioned from GCP infrastructure, so they have a bunch
| of idle GPUs even while their cloud division can sell every
| H100 and A100 they can get their hands on?
| dmurray wrote:
| I'd expect they have both: dedicated machines that they
| usually use and are sometimes idle, but also the ability
| to run a job on GCP if it makes sense.
|
| (I doubt it's the other way round, that the Deepmind
| researchers could come in one day and find all their GPUs
| are being used by some cloud customer).
| myworkinisgood wrote:
| As someone who worked for an compute time provider, I can
| tell you that the last people who can use the system for
| free are internal people. Because external people bring in
| cash revenue while internal people just bring in potential
| future revenue.
| nkrisc wrote:
| Not if you're only using the resources when they're available
| because no customer has paid to use them.
| K0balt wrote:
| I think Google produces their own power, so they don't pay
| distribution cost which is at least one third of the price of
| power, even higher for large customers.
| rrr_oh_man wrote:
| _> They chose to publish. So they are interested in seeing it
| reproduced or improved upon._
|
| Call me cynical, but this is not what I experienced to be the
| #1 reason of publishing AI papers.
| echoangle wrote:
| As someone not in the AI space, what do you think is the
| reason for publishing? Marketing and hype for your products?
| simonw wrote:
| Retaining your researchers so they don't get frustrated and
| move to another company that lets them publish.
| a_bonobo wrote:
| and attracting other researchers so your competitors
| can't pick them up to potentially harm your own business
| ash-ali wrote:
| I hope someone could share their insight on this comment. I
| think the other comments are fragile and don't hold too
| strongly.
| theptip wrote:
| Marketing of some sort. Either "come to Google and you'll
| have access to H100s and freedom to publish and get to work
| with other people who publish good papers", which appeals
| to the best researchers, or for smaller companies,
| benchmark pushing to help with brand awareness and securing
| VC funding.
| godelski wrote:
| It's commonly discussed in AI/ML groups that a paper at a
| top conference is "worth a million dollars." Not all
| papers, some papers are worth more. But it is in effect
| discussing the downstream revenues. As a student, it is
| your job and potential earnings. As a lab it is worth
| funding and getting connected to big tech labs (which
| creates a feedback loop). And to corporations, it is worth
| far more than that in advertising.
|
| The unfortunate part of this is that it can have odd
| effects like people renaming well known things to make the
| work appear more impressive, obscure concepts, and drive up
| their citations.[0] The incentives do not align to make
| your paper as clear and concise as possible to communicate
| your work.
|
| [0] https://youtu.be/Pl8BET_K1mc?t=2510
| jfengel wrote:
| Is the electricity cost negligible? It's a pretty compute
| intensive application.
|
| Of course it would be a tiny fraction of the $10m figure here,
| but even 1% would be $100,000. Negligible to Google, but for
| Google even $10 million is couch cushion money.
| stavros wrote:
| I feel like your comment answers itself: If you have the
| money to be running a datacenter of thousands of A100 GPUs
| (or equivalent), the cost of the electricity is negligible to
| you, and definitely worth training a SOTA model with your
| spare compute.
| dylan604 wrote:
| Is it really spare compute? Is the demand from others so
| low that these systems are truly idle? Does this also
| artificially make it look like demand is high because
| internal tasks are using it?
| dekhn wrote:
| The electricity cost is not neglible- I ran a service that
| had multiples of $10M in marginal electricity spend (IE,
| servers running at 100% utilization, consuming a
| significantly higher fraction than when idle, or partly
| idle). Ultimately, the scientific discoveries weren't worth
| the cost, so we shut the service down.
|
| $10M is about what Google would spend to get a publication in
| a top-tier journal. But google's internal pricing and costs
| don't look anything like what people cite for external costs;
| it's more like a state-supported economy with some extremely
| rich oligarch-run profit centers that feed all the various
| cottage industries.
| pintxo wrote:
| > They chose to publish. So they are interested in seeing it
| reproduced or improved upon.
|
| Not necessarily, publishing also ensure that the stuff is no
| longer patentable.
| slashdave wrote:
| Forgive me if I am wrong, but all of the techniques explored
| are already well known. So, what is going to be patented?
| fragmede wrote:
| the fundamental algorithms have been, sure, but there are
| innumerable enhancements upon those base techniques to be
| found and patented.
| K0balt wrote:
| I'd imagine publishing is more oriented toward attracting and
| retaining talent. You need to scratch that itch or the
| academics will jump ship.
| Cthulhu_ wrote:
| I'd argue it's not hard to reproduce per se, just expensive;
| thankfully there are at least half a dozen (cloud) computing
| providers that have the necessary resources to do so. Google
| Cloud, AWS and Azure are the big competitors in the west (it
| seems / from my perspective), but don't underestimate the likes
| of Alibaba, IBM, DigitalOcean, Rackspace, Salesforce, Tencent,
| Oracle, Huawei, Dell and Cisco.
| ape4 wrote:
| Its like them running SETI@home ;)
| dekhn wrote:
| We ran Folding@Home at google. we were effectively the
| largest single contributor of cycles for at least a year. It
| wasn't scientifically worthwhile, so we shut it down after a
| couple years.
|
| That was using idle cycles on Intel CPUs, not GPUs or TPUs
| though.
| stairlane wrote:
| > The only cost is then the electricity and used capacity. Not
| consumer pricing. So negligible.
|
| I don't think this is valid, as this point seems to ignore the
| fact that the data center that this compute took place in
| required a massive investment.
|
| A paper like this is more akin to HEPP research. Nobody has the
| capability to reproduce the higgs results outside of at the
| facility the research was conducted within (CERN).
|
| I don't think reproduction was a concern of the researchers.
| morbia wrote:
| The Higgs results were reproduced because there are two
| independent detectors at CERN (Atlas and CMS). Both
| collaborations are run almost entirely independently, and the
| press are only called in to announce a scientific discovery
| if both find the same result.
|
| Obviously the 'best' result would be to have a separate
| collider as well, but no one is going to fund a new collider
| just to reaffirm the result for a third time.
| stairlane wrote:
| Absolutely, and well stated.
|
| The point I was trying to make was the fact that nobody
| (meaning govt bodies) was willing to make another collider
| capable of repeating the results. At least not yet ;).
| sigmoid10 wrote:
| This is calculation is pretty pointless and the title is flat out
| wrong. It also gets lost in finer details while totally missing
| the bigger picture. After all, the original paper written by
| people either working for Google or at Google. So you can safely
| assume they used Google resources. That means they wouldn't have
| used H100s, but Google TPUs. Since they design and own these
| TPUs, you can also safely assume that they don't pay whatever
| they charge end users for them. At the scale of Google, this
| basically amounts to the cost of houseing/electricity, and even
| that could be a tax write-off. You also can't directly assume
| that the on paper performance of something like an H100 will be
| the actual utilization you can achieve, so basing any estimate in
| terms of $/GPU-hour will be off by default.
|
| That means Google payed way less than this amount and if you
| wanted to reproduce the paper yourself, you would potentially pay
| a lot more, depending on how many engineers you have in your team
| to squeeze every bit of performance per hour out of your cluster.
| michaelmior wrote:
| Even if they did use H100s and paid the current premium on
| them, you could probably buy 100 H100s and the boxes to put
| them in for less than $10M.
| c-linkage wrote:
| Reproducibility is a key element of the scientific process
|
| How is anyone else going to reproduce the experiment if it's
| going to cost them $10 million because they don't work at
| Google and would have to rent the infrastructure?
| tokai wrote:
| Cheap compared to some high energy physic experiments.
| lostlogin wrote:
| I was thinking this too. Splitting the atom, and various
| space program experiments would also be difficult to
| reproduce if someone wanted to try.
| rvnx wrote:
| This specific paper looks plausible, but a lot of published
| AI papers are simply fake because it is one of the sectors
| where it is possible to make non-reproducible claims. "We
| don't give source-code or dataset", but actually they didn't
| find or do anything of interest.
|
| It works and helps to get a salary raise or a better job, so
| they continue.
|
| A bit like when someone goes to a job interview, didn't do
| anything, and claims "My work is under NDA".
| Sebb767 wrote:
| But what's the solution here? Not doing the (possibly)
| interesting research because it's hard to reproduce? That
| doesn't sound like a better situation.
|
| That being said, yes, this is hard to reproduce for your
| average Joe, but there are also a lot of companies (like
| OpenAI, Facebook, ...) that _are_ able to throw this amount
| of hardware at the problem. And in a few years you 'll
| probably be able to do it on commodity hardware.
| injuly wrote:
| > This is calculation is pretty pointless and the title is flat
| out wrong.
|
| No, it's not. The author clearly states in the very first
| paragraph that this is the price it would take _them to
| reproduce the results_.
|
| Nowhere in the article (or the title) have they implied that
| this is how much Google spent.
| arcade79 wrote:
| A lot of misunderstandings among the commenters here.
|
| From the link: "the total compute cost it would take to replicate
| the paper"
|
| It's not Google's cost. Google's cost is of course entirely
| different. It's the cost for the author if he were to rent the
| resources to replicate the paper.
|
| For Google, all of it is running at a "best effort" resource
| tier, grabbing available resources when not requested by higher
| priority jobs. It's effectively free resources (except
| electricity consumption). If any "more important" jobs with a
| higher priority comes in and asks for the resources, the paper-
| writers jobs will just be preempted.
| mrazomor wrote:
| This assumes the common resources (CPU, RAM, etc.), not the
| ones required for the LLM training (GPU, TPU, etc.). It's
| different economy.
|
| TL; DR: It's not ~free.
| akutlay wrote:
| Why does GPU matter? Do you think GCP keeps GPU utilization
| at 100% at all times?
| mrazomor wrote:
| What the OP is referring to requires _overprovisioning_ of
| the high priority traffic and the _sine-like utilization_
| (without it, the benefits of the "batch" tier is close to
| zero -- the preemption is too high for any meaningful work
| when you are close to the top of the utilization hill).
|
| You get that organically when you are serving lots of
| users. And, there's not much GPUs etc. used for that.
| Training LLMs gives you a different utilization pattern.
| The "best effort" resources aren't as useful in that setup.
| bbminner wrote:
| Because accelerators (tpus, gpus) unlike ram/cpu are
| notoriously hard to timeshare and vitrualize. So if you get
| evicted in an environment like that, you have to reload
| your entire experiment state from a model checkpoint. With
| giant models like that, it might take dozens of minutes. As
| a result, I doubt that these experiments are done using
| "spare" resources - in that case, constant interruptions
| and reloading would result in these experiments finishing
| sometime around the heat death of the universe :)
| bombcar wrote:
| This is the side effect of underutilized capital and it's
| present in many cases.
|
| For example, if YOU want to rent a backhoe to do some yard
| rearrangement it's going to cost you.
|
| But Bob who owns BackHoesInc has them sitting around all the
| time when they're not being rented or used; he can rearrange
| his yard wholesale or almost free.
| thaumasiotes wrote:
| > This is the side effect of underutilized capital and it's
| present in many cases.
|
| "Underutilized" isn't the right word here. There's some value
| in putting your capital to productive use. But, once
| immediate needs are satisfied, there's _more_ value in having
| the capital available to address future needs quickly than
| there would be in making sure that everything necessary to
| address those future needs is tied up in low-value work.
| Option value is real value; being prepared for unforeseen but
| urgent circumstances is a real use.
| nathancahill wrote:
| Same effect when leasing companies let office space sit
| unoccupied for years on end. The future value is higher
| than the marginal value of reducing the price to fill it
| with a tenant.
| unyttigfjelltol wrote:
| Real estate is a playground for irrationally hopeful or
| stubborn participants.
| Bjartr wrote:
| That may be part of it for spaces properties left
| unleased for years, but I believe it's not the only part.
|
| I believe the larger factor, and someone correct me if
| they have a better understanding of this, is that for
| commercially rented properties the valuation used to
| determine the mortgage terms you get takes into account
| what you claim to be able to get from rent. Renting for
| less than that reduces the valuation and can put you
| upside down on the mortgage. But the bank will let you
| defer mortgage payments, effectively taking each month of
| mortgage duration and moving it from now to after the
| last month of the mortgage duration, extending the time
| they earn interest for.
|
| So if no one want to lease the space at that price after
| a prior lessee leaves for whatever reason, it's better
| for the property owner financially to leave the space
| vacant, sometimes for years, until someone willing to pay
| that price comes along, than to lower the rent and get a
| tenant.
| bombcar wrote:
| This is mostly correct. People assume commercial loan
| terms are like single-family homes "but larger" but
| they're not. They basically are all custom financial
| deals with multiple banks and may be over multiple
| properties. As long as total vacancy isn't below a cutoff
| the banks will be happy, but lowering rents "just to get
| a tenant" can harm the valuation and trigger terms.
|
| Part of the reason things like Halloween Superstores can
| pop in is the terms often exclude "short term leases"
| which are under six months.
|
| Also when you're leasing to companies, they are VERY
| quick to jump at lower prices if available, which means
| that if you drop the lease for one tenant, the others are
| sure to follow, sometimes even before lease terms are up.
| khafra wrote:
| Land Value Tax would fix this.
| bbarnett wrote:
| Many cities only tax on leased property, or have very low
| rates on unleased property.
| bombcar wrote:
| Yeah, airlines make "more return on capital" by faster
| turn-around of planes _to a point_ - if they are utilizing
| their airframes above 80 or 90 or whatever percent, the
| airline itself becomes extremely fragile and unable to
| handle incidents that impact timing.
|
| We saw the same thing with JIT manufacturing during Covid.
| franga2000 wrote:
| In the case of compute, you can evict low-priority jobs
| nearly instantly, so the compute capacity running spot
| instances and internal side-projets is just as available
| for unexpected bursts as it would be if sitting idle.
| axus wrote:
| I'm going to say this the next time I argue I need my
| servers online 24/7.
| thaumasiotes wrote:
| I'm not really sure I'm following you.
| efitz wrote:
| I think a better description than "underutilized" would be
| "sunk capex cost" - Google (or any cloud provider) cannot
| run at 100% customer utilization because then they could
| neither acquire new customers nor service transitory usage
| spikes for existing customers. So they stay ahead of
| predicted demand, which means that they will almost always
| have excess capacity available.
|
| Cloud _providers_ pay capital costs (CapEx) for servers,
| GPUs, data centers, employees, etc. Utilization allows them
| to recoup those costs faster.
|
| Cloud _customers_ pay operational expenses (OpEx) for
| usage.
|
| So Google generally has excess capacity, and while they
| would prefer revenue-generating customer usage, they've
| already paid for everything but the electricity, so it's
| extremely cheap for them to run their own jobs if the
| hardware would otherwise be sitting idle.
| bbarnett wrote:
| I doubt they are doing this, but if they did burn in
| tests with 3 machines doing identical workloads, they
| could validate workloads but also test new infra. Unlike
| customer workloads, it would be OK to retey due to error.
|
| This would be 100% free, as all electricity and "wear and
| tear" would be required anyhow.
| immibis wrote:
| There is also a mathematical relationship in queuing
| theory between utilization and average queue length,
| which all programmers should be told:
| https://blog.danslimmon.com/2016/08/26/the-most-
| important-th...
|
| As you run close to 100% utilization, you also run close
| to infinity waiting times. You don't want that. It might
| be acceptable for your internal projects (the _actual_
| waiting time won 't be infinity, and you'll cancel them
| if it gets too close to infinity) but it's certainly not
| acceptable for customers.
| thaumasiotes wrote:
| There is a genre of game called "time management games"
| which will hammer this point home if you play them.
| They're not really considered 'serious' games, so you can
| find them in places where the audience is basically
| looking to kill time.
|
| https://www.bigfishgames.com/us/en/games/5941/roads-of-
| rome/...
|
| The structure of a time management game is:
|
| 1. There's a bunch of stuff to do on the map.
|
| 2. You have a small number of workers.
|
| 3. The way a task gets done is, you click on it, and the
| next time a worker is available, the worker will start on
| that task, which occupies the worker for some fixed
| amount of time until the task is complete.
|
| 4. Some tasks can't be queued until you meet a
| requirement such as completing a predecessor task or
| having enough resources to pay the costs of the task.
|
| You will learn immediately that having a long queue means
| flailing helplessly while your workers ignore hair-on-
| fire urgent tasks in favor of completely unimportant ones
| that you clicked on while everything seemed relaxed. It's
| far more important that you have the ability to respond
| to a change in circumstances than to have all of your
| workers occupied at all times.
| bombcar wrote:
| > You will learn immediately that having a long queue
| means flailing helplessly while your workers ignore hair-
| on-fire urgent tasks in favor of completely unimportant
| ones that you clicked on while everything seemed relaxed.
|
| Ah, sounds like Dwarf Fortress!
| immibis wrote:
| I was thinking Oxygen Not Included.
| dekhn wrote:
| In practice it's more complicated than this- borg isn't
| actually a queue, it's a priority-based system with
| preemption, although people layered queue systems on top.
| Further, granularity mattered a lot- you could get much
| more access to compute by asking for smaller slices
| (fractions of a CPU core, or fraction of a whole TPU
| cluster). There was a lot of "empty crack filling" at
| google.
| efitz wrote:
| TL/DR: You should think of and use queues like shock
| absorbers, not sinks. Also you need to monitor them.
|
| Queues are useful to decouple the output of one process
| to the input of another process, when the processes are
| not synchronized velocity-wise. Like a shock absorber,
| they allow both processes to continue at their own paces,
| and the queue absorbs instantaneous spikes in producer
| load above the steady state rate of the consumer (side
| note: if queues are isolated code- and storage-wise from
| the consumer process, then you can use the queue to
| prevent disruption in the producer process when you need
| to take the consumer down for maintenance or whatever).
|
| Running with very small queue lengths is generally fine
| and generally healthy.
|
| If you have a process that consistently runs with
| substantial queue lengths, then you have a mismatch
| between the workloads of the processes they connect - you
| either need to reduce the load from the producer or
| increase the throughput of the consumer of the queue.
|
| Very large queues tend to hide the workload mismatch
| problem, or worse. Often work put into queues is not
| stored locally on the producer, or is quickly
| overwritten. So a consumer end problem can result in
| potential irrevocable loss of everything in the queue,
| and the larger the queue, the bigger the loss. Another
| problem with large queues is that if your consumer
| process is only slightly faster than the producer
| process, then a large backlog of work in the queue can
| take a long time to work down, and it's even possible
| (admission of guilt) to configure systems using such
| queues such that they cannot recover from a lengthy
| outage, even if all the work items were stored in the
| queue.
|
| If you have queues, you need to monitor your queue
| lengths and alarm when queue lengths start increasing
| significantly above baseline.
| mikepurvis wrote:
| Car lots with attached garages are like this too. That brake
| and suspension work they were going to charge you several
| thousand dollars for? Once you trade in ol' Bessie they'll do
| that for pennies on the dollar during slack time; it doesn't
| hurt them if the car sits around for a few weeks or months
| before being ready for sale.
| WarOnPrivacy wrote:
| > Car lots with attached garages are like this too.
|
| This was my first job after moving into this state. Between
| my labor and parts, it was about 15% of the sale price.
|
| My most interesting repair was a 1943 Cadillac, a 'war
| car'.
| punnerud wrote:
| Can others also buy the "best effort" tier?
|
| If the job could easily run for weeks, even when you could buy
| your way for doing it in a day.
|
| Then have a bidding on this "best effort" resource, where they
| factor in electricity at any given time
| v3ss0n wrote:
| Sure,.land a job there, work the way all up against the
| cooperate bs and toxicity and you can get best effort tier.
|
| Those effort needs to be added in the cost calculation too.
| v3ss0n wrote:
| Sure,.land a job there, work the way all up against the
| cooperate bs and toxicity and you can get best effort tier.
|
| Those effort needs to be added in the cost calculation too
| curt15 wrote:
| Is the "best effort" tier similar to AWS spot instances?
| WJW wrote:
| At every cloud provider there's probably a tier below
| "spot" (or whatever the equivalent is called at AWS's
| competitors) that is used for the low-priority jobs of the
| cloud provider itself.
| jeffbee wrote:
| You can speculate about this or you can look at how
| Google's internal workloads actually run, because they
| have released a large and detailed set of traces from
| Borg. They're really open about this.
|
| https://github.com/google/cluster-data
| dweekly wrote:
| Possible corollary: it may be difficult to regularly turn out
| highly compute-dependent research if you're paying full retail
| rack rates for your hardware (i.e. using someone else's cloud).
| huijzer wrote:
| Still, don't get high on your own supply.
| imtringued wrote:
| According to neoclassical economists this is impossible since
| you can easily and instantaneously scale infrastructure up and
| down continuously at no cost and the future is known so demand
| can be predicted reliably.
|
| The problem with neoclassical economics is that it doesn't
| concern itself with the physical counterpart of liquidity. It
| is assumed that the physical world is just as liquid as the
| monetary world.
|
| The "liquidity mismatch" between money and physical capital
| must be bridged through overprovisioning on the physical side.
| If you want the option to choose among n different products,
| but only choose m products, then the n - m unsold products must
| be priced into the m bought products. If you can repurpose the
| unsold products, then you make a profit or you can lower costs
| for the buyer of the m products.
|
| I would even go as far as to say that the production of
| liquidity is probably the driving force of the economy, because
| it means we don't have to do complicated central planning and
| instead use simple regression models.
| jopsen wrote:
| > I would even go as far as to say that the production of
| liquidity is probably the driving force of the economy.
|
| Isn't that all what high frequency traders would say? :)
|
| Perhaps there is some limit at which additional liquidity
| doesn't offer much value?
| marcosdumay wrote:
| I think you completely misunderstood the GP.
|
| There isn't much there about stocks markets.
| 152334H wrote:
| Is it free-priority based?
|
| I was told by an employee that GDM internally has a credits
| system for TPU allocation, with which researchers have to
| budget out their compute usage. I may have completely
| misunderstood what they were describing, though.
| floor_ wrote:
| Content aside. This is hands down my favorite blog format.
| mostthingsweb wrote:
| I agree, but I'm curious if it's for the same reason. I like it
| because there is now flowery writing. Just direct "here are the
| facts".
| pama wrote:
| 3USD/hour on the H100 is much more expensive than a reasonable
| amortized full ownership cost, unless one assumes the GPU is
| useless within 18 months, which I find a bit dramatic. The MFU
| can be above 40% and certainly well above the 35% in the
| estimate, also for small models with plain pytorch and trivial
| tuning [1] I didnt read the linked paper carefully but I
| seriously doubt the google team used vocab embedding layers with
| 2 D V parameters stated in the link, because this would be
| suboptimal by not tying the weights of the token embedding layer
| in the decoder architecture (even if they did double the params
| in these layers, it would not lead to 6 D V compute because the
| embedding input is indexed). To me these assumptions suggested a
| somewhat careless attitude towards the cost estimation and so I
| stopped reading the rest of this analysis carefully. My best
| guess is that the author is off by a large factor in the upward
| direction, and a true replication with H100/200 could be about 3x
| less expensive.
|
| [1] if the total cost estimate was relatively low, say less than
| 10k, then of course the lowest rental price and a random training
| codebase might make some sense in order to reduce administrative
| costs; once the cost is in the ballpark of millions of USD, it
| feels careless to avoid optimizing it further. There exist H100s
| in firesales or Ebay occasionally, which could reduce the cost
| even more, but the author already mentions 2USD/gpu/hour for bulk
| rental compute, which is better than the 3USD/gpu/hour estimate
| they used in the writeup.
| 152334H wrote:
| You are correct on true H100 ownership costs being far lower.
| As I mention in the H100 blurb, the H100 numbers are fungible
| and I don't mind if you halve them.
|
| MFU can certainly be improved beyond 40%, as I mention. But on
| the point of small models specifically: the paper uses FSDP for
| all models, and I believe a rigorous experiment should not vary
| sharding strategy due to numerical differences. FSDP2 on small
| models will be slow even with compilation.
|
| The paper does not tie embeddings, as stated. The readout layer
| does lead to 6DV because it is a linear layer of D*V, which
| takes 2x for a forward and 4x for a backward. I would
| appreciate it if you could limit your comments to factual
| errors in the post.
| lonk11 wrote:
| I think the commenter was thinking about the input embedding
| layer, where to get an input token embedding the model does a
| lookup of the embedding by index, which is constant time.
|
| And the blog post author is talking about the output layer
| where the model has to produce an output prediction for every
| possible token in the vocabulary. Each output token
| prediction is a dot-product between the transformer hidden
| state (D) and the token embedding (D) (whether shared with
| input or not) for all tokens in the vocabulary (V). That's
| where the VD comes from.
|
| It would be great to clarify this in the blog post to make it
| more accessible but I understand that there is a tradeoff.
| pama wrote:
| My bad on the 6 D V estimate; you are correct that if they do
| a dense decoding (rather than a hierarchical one as google
| used to do in the old days) the cost is exactly 6 D V. I
| cannot edit the GP comment and I will absorb the shame of my
| careless words there. I was put off by the subtitle and
| initial title of this HN post, though the current title is
| more appropriate and correct.
|
| Even if it's a small model, one could use ddp or FSDP/2
| without slowdowns on fast interconnect, which certainly adds
| to the cost. But if you want to reproduce all the work at the
| cheapest price point you only need to parallelize to the
| minimal level for fitting in memory (or rather, the one that
| maxes the MFU), so everything below 2B parameters runs on a
| single H100 or single node.
| spi wrote:
| Do you have sources for "The MFU can be above 40% and certainly
| well above the 35 % in the estimate"?
|
| Looking at [1], the authors there claim that their improvements
| were needed to push BERT training beyond 30% MFU, and that the
| "default" training only reaches 10%. Certainly numbers don't
| translate exactly, it might well be that with a different
| stack, model, etc., it is easier to surpass, but 35% doesn't
| seem like a terribly off estimate to me. Especially so if you
| are training a whole suite of different models (with different
| parameters, sizes, etc.) so you can't realistically optimize
| all of them.
|
| It might be that the real estimate is around 40% instead of the
| 35% used here (frankly it might be that it is 30% or less, for
| that matter), but I would doubt it's so high as to make the
| estimates in this blog post terribly off, and I would doubt
| even more that you can get that "also for small models with
| plain pytorch and trivial tuning".
|
| [1] https://www.databricks.com/blog/mosaicbert
| tedivm wrote:
| When I was at Rad AI we did out the math on rent versus buy,
| and it was just so absolutely ridiculously obvious that buy was
| the way to go. Cloud does not make sense for AI training right
| now, as the overhead costs are considerably higher than simply
| purchasing a cluster, colocating it at a place like Colovore,
| and paying for "on hands" support. It's not even close.
| rgmerk wrote:
| Worth pointing out here that in other scientific domains, papers
| routinely require hundreds of thousands of dollars, sometimes
| millions of dollars, of resources to produce.
|
| My wife works on high-throughout drug screens. They routinely use
| over $100,000 of consumables in a single screen, not counting the
| cost of the screening "libraries", the cost of using some of the
| -$10mil of equipment in the lab for several weeks, the cost of
| the staff in the lab itself, and the cost of the time of the
| scientists who request the screens and then take the results and
| turn them into papers.
| ramraj07 wrote:
| I estimated that any paper that has mouse work and produced in
| a first world country (I.e. they have to do good by the
| animals), the minimum cost of that paper in expenses and salary
| would be $200,000. Average likely higher. Tens of thousands of
| papers a year published like this!
| esperent wrote:
| To be fair, supposing the Google paper took six months to a
| year to produce, it also must have cost several hundred
| thousand dollars in salaries and other non-compute costs.
| paxys wrote:
| These are mostly fixed costs. If you produce a hundred papers
| from the same team and same research, the costs aren't 100x.
| lucianbr wrote:
| But starting from the 10th paper, the value is also pretty
| low I imagine. How many new things can you discover from
| the same team and same research? That's 3 papers per year
| for a 30-year career. Every single year, no breaks.
| sdenton4 wrote:
| Well, to be sure, mouse research consistently produces
| amazing cures for cancer, insomnia, lost limbs, and even
| gravity itself. Sure, none of it translates to humans,
| but it's an important source of headlines for high impact
| journals and science columnists.
| godelski wrote:
| This is also true for machine learning papers. They cure
| cancer, discover physics, and all sorts of things. Sure,
| they don't actually translate to useful science, but they
| are highly valuable pieces of advertisements. And hey,
| maybe someday they might!
| computerdork wrote:
| Agreed about this for past mouse research, but this is
| changing as mice are being genetically engineered to be
| more human: https://news.uthscsa.edu/scientists-create-
| first-mouse-model...
| https://www.nature.com/articles/s41467-019-09716-7
| godelski wrote:
| > How many new things can you discover from the same team
| and same research?
|
| That all depends on how you measure discoveries. The most
| common metric is... publications. Publications are what
| advance your career and are what you are evaluated on.
| The content may or may not matter (lol who reads your
| papers?) but the number certainly does. So the best way
| to advance your career is to write a minimum viable paper
| and submit as often as possible. I think we all forget
| how Goodhart's Law comes to bite everyone in the ass.
| dumb1224 wrote:
| Well not everyone starts experiment anew. Many also reuse
| accumulated datasets. For human data even more so.
| slashdave wrote:
| I assure you that the companies performing these screens expect
| a return on this investment. It is not for a journal paper.
| godelski wrote:
| I used to believe this line. But then I worked for a big tech
| company where my manager constantly made those remarks ("the
| difference in industry and academia is that in industry it
| has to actually work"). I then improved the generalization
| performance (i.e. "actually work") by over 100% and they
| decided not to update the model they were selling. Then
| again, I had a small fast model and it was 90% as accurate as
| the new large transformer model. Though they also didn't take
| the lessons learned and apply them to the big model, which
| had similar issues but were just masked by the size.
|
| Plus, I mean, there are a lot of products that don't work. We
| all buy garbage and often can't buy not garbage. Though I
| guess you're technically correct that in either of these
| situations there can still be a return on investment, but
| maybe that shouldn't be good enough...
| shpongled wrote:
| The post you are replying to is talking about high
| throughput assays for drug development. This is something
| actually run in a lab, not a model. As another person
| working at a biotech, I can assure you that screens are not
| just run as busy work.
| rgmerk wrote:
| No they're not busywork, but not all such screens are
| directly in the drug discovery pipeline.
| dont_forget_me wrote:
| All that compute power just to invade privacy and show people
| more ads. Can this get anymore depressing?
| psychoslave wrote:
| Yes, sure! Imagine a world where every HN thread you engage in
| is fed with information that are all subtly tailored to push
| you into buying whatever crap the market is able to produce.
| jeffbee wrote:
| I think if you wanted to think about a big expense you'd look at
| AlphaStar.
| 5kg wrote:
| I am wondering if AlphaStar is the most expensive paper ever.
| lern_too_spel wrote:
| "Observation of a new particle in the search for the Standard
| Model Higgs boson with the ATLAS detector at the LHC"
| jeffbee wrote:
| I think it could be. I also think it is likely that HN
| frequenter `dekhn` has personally spent more money on compute
| resources than any other living human, so maybe they will
| chime in on how the cost gets allocated to the research.
| dekhn wrote:
| A big part of it is basically hard production quota: the
| ability to run jobs at a high priority on large machines
| for an entire quarter. The main issue was that quota was
| somewhat overallocated, or otherwise unable to be used (if
| you and another team both wanted a full TPUv3 with all its
| nodes and fabric).
|
| From what I can tell, ads made the money and search/ads
| bought machines with their allocated budget, TI used their
| budget to run the systems, and then funny money in the form
| of quota was allocated to groups. THe money was "funny" in
| the sense that the full reach-through costs of operating a
| TPU for a year looks completely different from the
| production allocation quota that gets handed out. I think
| Google was long trying to create a market economy, but it
| was really much more like a state-funded exercise.
|
| (I am not proud of how much CPU I wasted on protein
| folding/design and drug discovery, but I'm eternally
| thankful for Urs giving me the opportunity to try it out
| and also to compute the energy costs associated with the
| CPU use)
| ipsum2 wrote:
| It's disappointing that they never developed AlphaStar enough
| to become super-human (unlike AlphaGo), even lower level
| players were able to adapt to its playstyle.
|
| The cost was probably the limiting factor.
| brg wrote:
| I found this exercise interesting, and as arcade79 pointed out it
| is the cost of replication not the cost to Google. Humorously I
| wonder the cost of of replicating Higgs-Boson verification or
| Gravity Wave detection would be.
| faitswulff wrote:
| I wonder how many tons of CO2 that amounts to. Google Gemini
| estimated 125,000 tons of carbon emissions, but I don't have the
| know-how to double check it.
| chazeon wrote:
| If you use solar energy, then there is no CO2 emission. Right?
| ipsum2 wrote:
| Google buys carbon credits to make up for CO2 emissions,
| they've never relied strictly on solar.
| godelski wrote:
| Worth mentioning that "GPU Poor" isn't created because those
| without much GPU compute can't contribute, but rather because
| those with massive amounts of GPU are able to perform many more
| experiments and set a standard, or shift the Overton window. The
| big danger here is just that you'll start expecting a higher
| "thoroughness" from everyone else. You may not expect this level,
| but seeing this level often makes you think what was sufficient
| before is far from sufficient now, and what's the cost of that
| lower bound?
|
| I mention this because a lot of universities and small labs are
| being edged out of the research space but we still want their
| contributions. It is easy to always ask for more experiments but
| the problem is, as this blog shows, those experiments can
| sometimes cost millions of dollars. This also isn't to say that
| small labs and academics aren't able to publish, but rather that
| 1) we want them to be able to publish __without__ the support of
| large corporations to preserve the independence of research[0],
| 2) we don't want these smaller entities to have to go through a
| roulette wheel in an effort to get published.
|
| Instead, when reviewing be cautious in what you ask for. You can
| __always__ ask for more experiments, datasets, "novelty", and so
| on. Instead ask if what's presented is sufficient to push forward
| the field in any way and when requesting the previous things be
| specific as to why what's in the paper doesn't answer what's
| needed and what experiment would answer it (a sentence or two
| would suffice).
|
| If not, then we'll have the death of the GPU poor and that will
| be the death of a lot of innovation, because the truth is, not
| even big companies will allocate large compute for research that
| is lower level (do you think state space models (mamba) started
| with multimillion dollar compute? Transformers?). We gotta start
| somewhere and all papers can be torn to shreds/are easy to
| critique. But you can be highly critical of a paper and that
| paper can still push knowledge forward.
|
| [0] Lots of papers these days are indistinguishable from ads. A
| lot of papers these days are products. I've even had works
| rejected because they are being evaluated as products not being
| evaluated on the merits of their research. Though this can be
| difficult to distinguish when evaluation is simply empirical.
|
| [1] I once got desk rejected for "prior submission." 2 months
| later they overturned it, realizing it was in fact an arxiv
| paper, for only a month later for it to be desk rejected again
| for "not citing relevant materials" with no further explanation.
| hiddencost wrote:
| It's likely the cost of the researchers was about $1m/ head, with
| 11 names that puts the staffing costs on par with the compute
| costs.
|
| (A good rule of thumb is that an employee costs about twice their
| total compensation.)
___________________________________________________________________
(page generated 2024-07-30 23:00 UTC)