hngopher.com

       [HN Gopher] Calculating the cost of a Google DeepMind paper
       ___________________________________________________________________
        
       Calculating the cost of a Google DeepMind paper
        
       Author : 152334H
       Score  : 238 points
       Date   : 2024-07-30 10:26 UTC (12 hours ago)
        
 (HTM) web link (152334h.github.io)
 (TXT) w3m dump (152334h.github.io)
        
       | hnthr_w_y wrote:
       | that's not very much in the business range, it's a lot when it
       | comes to paying us salaries.
        
         | willis936 wrote:
         | Any company of any size that doesn't learn the right lessons
         | from a $10M mistake will be out of business before long.
        
           | brainwad wrote:
           | That's like staffing a single-manager team on a bad project
           | for a year. Which I assure you happens all the time in big
           | companies, and yet they survive.
        
             | saikia81 wrote:
             | They are not saying it doesn't happen. They are saying: The
             | companies that don't learn from these mistakes will go out
             | of business before long.
        
               | duggan wrote:
               | In principle, for some other company, sure.
               | 
               | Google makes ~$300b a year in _profit_. They could make a
               | $10m mistake every day and barely make a dent in it.
        
               | magic_man wrote:
               | They do not, they made ~90 billion in profit. So no one
               | would notice a 10 mil mistake, but no they didn't make
               | 300b in profits.
        
               | duggan wrote:
               | I misread some stats, thanks for the correction.
        
               | hnbad wrote:
               | I think there might be a disagreement about what "big"
               | means. _Google_ can easily afford to sink millions each
               | year into pointless endeavours without going out of
               | business and they probably have. Alphabet 's annual
               | revenue has been growing a good 10% each year since
               | 2021[0]. That's in the range of $20-$30 _billion_ dollars
               | with a B.
               | 
               | To put that into perspective, Alphabet's revenue has
               | increased 13.38% year-over-year as of June 30, arriving
               | at $328.284 billion dollars - i.e. it has _increased_ by
               | $38.74 billion in that time. A $10 million dollar mistake
               | translates to losing 0.0258% of that number.
               | 
               | A $10 million dollar mistake costs Alphabet 0.0258% of
               | the amount their revenue _increased_ year-over-year as of
               | last month. Alphabet could have afforded to make 40 such
               | $10 million dollar mistakes in that period and it would
               | have only represented a loss of 1% of the year-over-year
               | _increase_ in revenue. Taking the year-over-year
               | _increase_ down by 1% (from 13.38% to 12.38%) would have
               | required making 290 such $10 million dollar mistakes
               | within one year.
               | 
               | Let me repeat that because it bears emphasizing: over the
               | past years, _every year_ Google could have easily
               | afforded _an additional 200 such $10 million dollar
               | mistakes_ without significantly impacting their increase
               | in revenue - and even in 2022 when inflation was almost
               | double what it was in the other year they would have
               | still come out ahead of inflation.
               | 
               | So in terms of numbers this is demonstrably false. Of
               | course the existence of repeated $10 million dollar
               | mistakes may suggest the existence of structural issues
               | that will result in $1, $10 or $100 billion dollar
               | problems eventually and sink the company. But that's
               | conjecture at this point.
               | 
               | [0]: https://www.macrotrends.net/stocks/charts/GOOG/alpha
               | bet/reve...
        
           | vishnugupta wrote:
           | https://killedbygoogle.com/
           | 
           | I'm confident each one of them were multiple of $10M
           | investments.
           | 
           | And this is just what we know because they were launched
           | publicly.
        
             | Sebb767 wrote:
             | The point the parent made is not to not make mistakes, but
             | to learn from them. Which they probably did not from all of
             | them, as indicated by the sheer amount of messenger apps on
             | this list, but there's definitely a lot to learn from this
             | list.
        
           | OtherShrezzing wrote:
           | I'm not really certain that's true at Google's size. Their
           | annual revenue is something like a quarter trillion dollars.
           | 25,000x larger than a $10m mistake.
           | 
           | The equivalent wastage for a self-employed person would be
           | allowing a few cups of Starbucks coffee per year to go cold.
        
             | Workaccount2 wrote:
             | There was that time that google paid out something like
             | $300M in fraudulent invoices.
        
           | willis936 wrote:
           | To be clear: what I mean by "not learning the right lessons"
           | is a company deciding that the issue with wasting $10M in six
           | months is that they didn't do it 100x in parallel in three
           | months. Then when that goes wrong they must need to do it
           | 100x wider in parallel again in three weeks.
        
       | BartjeD wrote:
       | If this ran on google's own cloud it amounts to internal
       | bookkeeping. The only cost is then the electricity and used
       | capacity. Not consumer pricing. So negligible.
       | 
       | It is rather unfortunate that this sort of paper is hard to
       | reproduce.
       | 
       | That is a BIG downside, because it makes the result unreliable.
       | They invested effort and money in getting an unreliable result.
       | But perhaps other research will corroborate. Or it may give them
       | an edge in their business, for a while.
       | 
       | They chose to publish. So they are interested in seeing it
       | reproduced or improved upon.
        
         | rty32 wrote:
         | Opportunity cost is cost. What you could have earned by selling
         | the resources to customers instead of using them yourself _is_
         | what the resources are worth.
        
           | g15jv2dp wrote:
           | This assumes that you can sell 100% of the resources'
           | availability 100% of the time. Whenever you have more
           | capacity that you can sell, there's no opportunity cost in
           | using it yourself.
        
             | michaelt wrote:
             | A few months back, a lot of the most powerful GPU instances
             | on GCP seemed to be sold out 24/7.
             | 
             | I suppose it's possible Google's own infrastructure is
             | partitioned from GCP infrastructure, so they have a bunch
             | of idle GPUs even while their cloud division can sell every
             | H100 and A100 they can get their hands on?
        
               | dmurray wrote:
               | I'd expect they have both: dedicated machines that they
               | usually use and are sometimes idle, but also the ability
               | to run a job on GCP if it makes sense.
               | 
               | (I doubt it's the other way round, that the Deepmind
               | researchers could come in one day and find all their GPUs
               | are being used by some cloud customer).
        
             | myworkinisgood wrote:
             | As someone who worked for an compute time provider, I can
             | tell you that the last people who can use the system for
             | free are internal people. Because external people bring in
             | cash revenue while internal people just bring in potential
             | future revenue.
        
           | nkrisc wrote:
           | Not if you're only using the resources when they're available
           | because no customer has paid to use them.
        
           | K0balt wrote:
           | I think Google produces their own power, so they don't pay
           | distribution cost which is at least one third of the price of
           | power, even higher for large customers.
        
         | rrr_oh_man wrote:
         | _> They chose to publish. So they are interested in seeing it
         | reproduced or improved upon._
         | 
         | Call me cynical, but this is not what I experienced to be the
         | #1 reason of publishing AI papers.
        
           | echoangle wrote:
           | As someone not in the AI space, what do you think is the
           | reason for publishing? Marketing and hype for your products?
        
             | simonw wrote:
             | Retaining your researchers so they don't get frustrated and
             | move to another company that lets them publish.
        
               | a_bonobo wrote:
               | and attracting other researchers so your competitors
               | can't pick them up to potentially harm your own business
        
           | ash-ali wrote:
           | I hope someone could share their insight on this comment. I
           | think the other comments are fragile and don't hold too
           | strongly.
        
             | theptip wrote:
             | Marketing of some sort. Either "come to Google and you'll
             | have access to H100s and freedom to publish and get to work
             | with other people who publish good papers", which appeals
             | to the best researchers, or for smaller companies,
             | benchmark pushing to help with brand awareness and securing
             | VC funding.
        
             | godelski wrote:
             | It's commonly discussed in AI/ML groups that a paper at a
             | top conference is "worth a million dollars." Not all
             | papers, some papers are worth more. But it is in effect
             | discussing the downstream revenues. As a student, it is
             | your job and potential earnings. As a lab it is worth
             | funding and getting connected to big tech labs (which
             | creates a feedback loop). And to corporations, it is worth
             | far more than that in advertising.
             | 
             | The unfortunate part of this is that it can have odd
             | effects like people renaming well known things to make the
             | work appear more impressive, obscure concepts, and drive up
             | their citations.[0] The incentives do not align to make
             | your paper as clear and concise as possible to communicate
             | your work.
             | 
             | [0] https://youtu.be/Pl8BET_K1mc?t=2510
        
         | jfengel wrote:
         | Is the electricity cost negligible? It's a pretty compute
         | intensive application.
         | 
         | Of course it would be a tiny fraction of the $10m figure here,
         | but even 1% would be $100,000. Negligible to Google, but for
         | Google even $10 million is couch cushion money.
        
           | stavros wrote:
           | I feel like your comment answers itself: If you have the
           | money to be running a datacenter of thousands of A100 GPUs
           | (or equivalent), the cost of the electricity is negligible to
           | you, and definitely worth training a SOTA model with your
           | spare compute.
        
             | dylan604 wrote:
             | Is it really spare compute? Is the demand from others so
             | low that these systems are truly idle? Does this also
             | artificially make it look like demand is high because
             | internal tasks are using it?
        
           | dekhn wrote:
           | The electricity cost is not neglible- I ran a service that
           | had multiples of $10M in marginal electricity spend (IE,
           | servers running at 100% utilization, consuming a
           | significantly higher fraction than when idle, or partly
           | idle). Ultimately, the scientific discoveries weren't worth
           | the cost, so we shut the service down.
           | 
           | $10M is about what Google would spend to get a publication in
           | a top-tier journal. But google's internal pricing and costs
           | don't look anything like what people cite for external costs;
           | it's more like a state-supported economy with some extremely
           | rich oligarch-run profit centers that feed all the various
           | cottage industries.
        
         | pintxo wrote:
         | > They chose to publish. So they are interested in seeing it
         | reproduced or improved upon.
         | 
         | Not necessarily, publishing also ensure that the stuff is no
         | longer patentable.
        
           | slashdave wrote:
           | Forgive me if I am wrong, but all of the techniques explored
           | are already well known. So, what is going to be patented?
        
             | fragmede wrote:
             | the fundamental algorithms have been, sure, but there are
             | innumerable enhancements upon those base techniques to be
             | found and patented.
        
         | K0balt wrote:
         | I'd imagine publishing is more oriented toward attracting and
         | retaining talent. You need to scratch that itch or the
         | academics will jump ship.
        
         | Cthulhu_ wrote:
         | I'd argue it's not hard to reproduce per se, just expensive;
         | thankfully there are at least half a dozen (cloud) computing
         | providers that have the necessary resources to do so. Google
         | Cloud, AWS and Azure are the big competitors in the west (it
         | seems / from my perspective), but don't underestimate the likes
         | of Alibaba, IBM, DigitalOcean, Rackspace, Salesforce, Tencent,
         | Oracle, Huawei, Dell and Cisco.
        
         | ape4 wrote:
         | Its like them running SETI@home ;)
        
           | dekhn wrote:
           | We ran Folding@Home at google. we were effectively the
           | largest single contributor of cycles for at least a year. It
           | wasn't scientifically worthwhile, so we shut it down after a
           | couple years.
           | 
           | That was using idle cycles on Intel CPUs, not GPUs or TPUs
           | though.
        
         | stairlane wrote:
         | > The only cost is then the electricity and used capacity. Not
         | consumer pricing. So negligible.
         | 
         | I don't think this is valid, as this point seems to ignore the
         | fact that the data center that this compute took place in
         | required a massive investment.
         | 
         | A paper like this is more akin to HEPP research. Nobody has the
         | capability to reproduce the higgs results outside of at the
         | facility the research was conducted within (CERN).
         | 
         | I don't think reproduction was a concern of the researchers.
        
           | morbia wrote:
           | The Higgs results were reproduced because there are two
           | independent detectors at CERN (Atlas and CMS). Both
           | collaborations are run almost entirely independently, and the
           | press are only called in to announce a scientific discovery
           | if both find the same result.
           | 
           | Obviously the 'best' result would be to have a separate
           | collider as well, but no one is going to fund a new collider
           | just to reaffirm the result for a third time.
        
             | stairlane wrote:
             | Absolutely, and well stated.
             | 
             | The point I was trying to make was the fact that nobody
             | (meaning govt bodies) was willing to make another collider
             | capable of repeating the results. At least not yet ;).
        
       | sigmoid10 wrote:
       | This is calculation is pretty pointless and the title is flat out
       | wrong. It also gets lost in finer details while totally missing
       | the bigger picture. After all, the original paper written by
       | people either working for Google or at Google. So you can safely
       | assume they used Google resources. That means they wouldn't have
       | used H100s, but Google TPUs. Since they design and own these
       | TPUs, you can also safely assume that they don't pay whatever
       | they charge end users for them. At the scale of Google, this
       | basically amounts to the cost of houseing/electricity, and even
       | that could be a tax write-off. You also can't directly assume
       | that the on paper performance of something like an H100 will be
       | the actual utilization you can achieve, so basing any estimate in
       | terms of $/GPU-hour will be off by default.
       | 
       | That means Google payed way less than this amount and if you
       | wanted to reproduce the paper yourself, you would potentially pay
       | a lot more, depending on how many engineers you have in your team
       | to squeeze every bit of performance per hour out of your cluster.
        
         | michaelmior wrote:
         | Even if they did use H100s and paid the current premium on
         | them, you could probably buy 100 H100s and the boxes to put
         | them in for less than $10M.
        
         | c-linkage wrote:
         | Reproducibility is a key element of the scientific process
         | 
         | How is anyone else going to reproduce the experiment if it's
         | going to cost them $10 million because they don't work at
         | Google and would have to rent the infrastructure?
        
           | tokai wrote:
           | Cheap compared to some high energy physic experiments.
        
             | lostlogin wrote:
             | I was thinking this too. Splitting the atom, and various
             | space program experiments would also be difficult to
             | reproduce if someone wanted to try.
        
           | rvnx wrote:
           | This specific paper looks plausible, but a lot of published
           | AI papers are simply fake because it is one of the sectors
           | where it is possible to make non-reproducible claims. "We
           | don't give source-code or dataset", but actually they didn't
           | find or do anything of interest.
           | 
           | It works and helps to get a salary raise or a better job, so
           | they continue.
           | 
           | A bit like when someone goes to a job interview, didn't do
           | anything, and claims "My work is under NDA".
        
           | Sebb767 wrote:
           | But what's the solution here? Not doing the (possibly)
           | interesting research because it's hard to reproduce? That
           | doesn't sound like a better situation.
           | 
           | That being said, yes, this is hard to reproduce for your
           | average Joe, but there are also a lot of companies (like
           | OpenAI, Facebook, ...) that _are_ able to throw this amount
           | of hardware at the problem. And in a few years you 'll
           | probably be able to do it on commodity hardware.
        
         | injuly wrote:
         | > This is calculation is pretty pointless and the title is flat
         | out wrong.
         | 
         | No, it's not. The author clearly states in the very first
         | paragraph that this is the price it would take _them to
         | reproduce the results_.
         | 
         | Nowhere in the article (or the title) have they implied that
         | this is how much Google spent.
        
       | arcade79 wrote:
       | A lot of misunderstandings among the commenters here.
       | 
       | From the link: "the total compute cost it would take to replicate
       | the paper"
       | 
       | It's not Google's cost. Google's cost is of course entirely
       | different. It's the cost for the author if he were to rent the
       | resources to replicate the paper.
       | 
       | For Google, all of it is running at a "best effort" resource
       | tier, grabbing available resources when not requested by higher
       | priority jobs. It's effectively free resources (except
       | electricity consumption). If any "more important" jobs with a
       | higher priority comes in and asks for the resources, the paper-
       | writers jobs will just be preempted.
        
         | mrazomor wrote:
         | This assumes the common resources (CPU, RAM, etc.), not the
         | ones required for the LLM training (GPU, TPU, etc.). It's
         | different economy.
         | 
         | TL; DR: It's not ~free.
        
           | akutlay wrote:
           | Why does GPU matter? Do you think GCP keeps GPU utilization
           | at 100% at all times?
        
             | mrazomor wrote:
             | What the OP is referring to requires _overprovisioning_ of
             | the high priority traffic and the _sine-like utilization_
             | (without it, the benefits of the  "batch" tier is close to
             | zero -- the preemption is too high for any meaningful work
             | when you are close to the top of the utilization hill).
             | 
             | You get that organically when you are serving lots of
             | users. And, there's not much GPUs etc. used for that.
             | Training LLMs gives you a different utilization pattern.
             | The "best effort" resources aren't as useful in that setup.
        
             | bbminner wrote:
             | Because accelerators (tpus, gpus) unlike ram/cpu are
             | notoriously hard to timeshare and vitrualize. So if you get
             | evicted in an environment like that, you have to reload
             | your entire experiment state from a model checkpoint. With
             | giant models like that, it might take dozens of minutes. As
             | a result, I doubt that these experiments are done using
             | "spare" resources - in that case, constant interruptions
             | and reloading would result in these experiments finishing
             | sometime around the heat death of the universe :)
        
         | bombcar wrote:
         | This is the side effect of underutilized capital and it's
         | present in many cases.
         | 
         | For example, if YOU want to rent a backhoe to do some yard
         | rearrangement it's going to cost you.
         | 
         | But Bob who owns BackHoesInc has them sitting around all the
         | time when they're not being rented or used; he can rearrange
         | his yard wholesale or almost free.
        
           | thaumasiotes wrote:
           | > This is the side effect of underutilized capital and it's
           | present in many cases.
           | 
           | "Underutilized" isn't the right word here. There's some value
           | in putting your capital to productive use. But, once
           | immediate needs are satisfied, there's _more_ value in having
           | the capital available to address future needs quickly than
           | there would be in making sure that everything necessary to
           | address those future needs is tied up in low-value work.
           | Option value is real value; being prepared for unforeseen but
           | urgent circumstances is a real use.
        
             | nathancahill wrote:
             | Same effect when leasing companies let office space sit
             | unoccupied for years on end. The future value is higher
             | than the marginal value of reducing the price to fill it
             | with a tenant.
        
               | unyttigfjelltol wrote:
               | Real estate is a playground for irrationally hopeful or
               | stubborn participants.
        
               | Bjartr wrote:
               | That may be part of it for spaces properties left
               | unleased for years, but I believe it's not the only part.
               | 
               | I believe the larger factor, and someone correct me if
               | they have a better understanding of this, is that for
               | commercially rented properties the valuation used to
               | determine the mortgage terms you get takes into account
               | what you claim to be able to get from rent. Renting for
               | less than that reduces the valuation and can put you
               | upside down on the mortgage. But the bank will let you
               | defer mortgage payments, effectively taking each month of
               | mortgage duration and moving it from now to after the
               | last month of the mortgage duration, extending the time
               | they earn interest for.
               | 
               | So if no one want to lease the space at that price after
               | a prior lessee leaves for whatever reason, it's better
               | for the property owner financially to leave the space
               | vacant, sometimes for years, until someone willing to pay
               | that price comes along, than to lower the rent and get a
               | tenant.
        
               | bombcar wrote:
               | This is mostly correct. People assume commercial loan
               | terms are like single-family homes "but larger" but
               | they're not. They basically are all custom financial
               | deals with multiple banks and may be over multiple
               | properties. As long as total vacancy isn't below a cutoff
               | the banks will be happy, but lowering rents "just to get
               | a tenant" can harm the valuation and trigger terms.
               | 
               | Part of the reason things like Halloween Superstores can
               | pop in is the terms often exclude "short term leases"
               | which are under six months.
               | 
               | Also when you're leasing to companies, they are VERY
               | quick to jump at lower prices if available, which means
               | that if you drop the lease for one tenant, the others are
               | sure to follow, sometimes even before lease terms are up.
        
               | khafra wrote:
               | Land Value Tax would fix this.
        
               | bbarnett wrote:
               | Many cities only tax on leased property, or have very low
               | rates on unleased property.
        
             | bombcar wrote:
             | Yeah, airlines make "more return on capital" by faster
             | turn-around of planes _to a point_ - if they are utilizing
             | their airframes above 80 or 90 or whatever percent, the
             | airline itself becomes extremely fragile and unable to
             | handle incidents that impact timing.
             | 
             | We saw the same thing with JIT manufacturing during Covid.
        
             | franga2000 wrote:
             | In the case of compute, you can evict low-priority jobs
             | nearly instantly, so the compute capacity running spot
             | instances and internal side-projets is just as available
             | for unexpected bursts as it would be if sitting idle.
        
             | axus wrote:
             | I'm going to say this the next time I argue I need my
             | servers online 24/7.
        
               | thaumasiotes wrote:
               | I'm not really sure I'm following you.
        
             | efitz wrote:
             | I think a better description than "underutilized" would be
             | "sunk capex cost" - Google (or any cloud provider) cannot
             | run at 100% customer utilization because then they could
             | neither acquire new customers nor service transitory usage
             | spikes for existing customers. So they stay ahead of
             | predicted demand, which means that they will almost always
             | have excess capacity available.
             | 
             | Cloud _providers_ pay capital costs (CapEx) for servers,
             | GPUs, data centers, employees, etc. Utilization allows them
             | to recoup those costs faster.
             | 
             | Cloud _customers_ pay operational expenses (OpEx) for
             | usage.
             | 
             | So Google generally has excess capacity, and while they
             | would prefer revenue-generating customer usage, they've
             | already paid for everything but the electricity, so it's
             | extremely cheap for them to run their own jobs if the
             | hardware would otherwise be sitting idle.
        
               | bbarnett wrote:
               | I doubt they are doing this, but if they did burn in
               | tests with 3 machines doing identical workloads, they
               | could validate workloads but also test new infra. Unlike
               | customer workloads, it would be OK to retey due to error.
               | 
               | This would be 100% free, as all electricity and "wear and
               | tear" would be required anyhow.
        
               | immibis wrote:
               | There is also a mathematical relationship in queuing
               | theory between utilization and average queue length,
               | which all programmers should be told:
               | https://blog.danslimmon.com/2016/08/26/the-most-
               | important-th...
               | 
               | As you run close to 100% utilization, you also run close
               | to infinity waiting times. You don't want that. It might
               | be acceptable for your internal projects (the _actual_
               | waiting time won 't be infinity, and you'll cancel them
               | if it gets too close to infinity) but it's certainly not
               | acceptable for customers.
        
               | thaumasiotes wrote:
               | There is a genre of game called "time management games"
               | which will hammer this point home if you play them.
               | They're not really considered 'serious' games, so you can
               | find them in places where the audience is basically
               | looking to kill time.
               | 
               | https://www.bigfishgames.com/us/en/games/5941/roads-of-
               | rome/...
               | 
               | The structure of a time management game is:
               | 
               | 1. There's a bunch of stuff to do on the map.
               | 
               | 2. You have a small number of workers.
               | 
               | 3. The way a task gets done is, you click on it, and the
               | next time a worker is available, the worker will start on
               | that task, which occupies the worker for some fixed
               | amount of time until the task is complete.
               | 
               | 4. Some tasks can't be queued until you meet a
               | requirement such as completing a predecessor task or
               | having enough resources to pay the costs of the task.
               | 
               | You will learn immediately that having a long queue means
               | flailing helplessly while your workers ignore hair-on-
               | fire urgent tasks in favor of completely unimportant ones
               | that you clicked on while everything seemed relaxed. It's
               | far more important that you have the ability to respond
               | to a change in circumstances than to have all of your
               | workers occupied at all times.
        
               | bombcar wrote:
               | > You will learn immediately that having a long queue
               | means flailing helplessly while your workers ignore hair-
               | on-fire urgent tasks in favor of completely unimportant
               | ones that you clicked on while everything seemed relaxed.
               | 
               | Ah, sounds like Dwarf Fortress!
        
               | immibis wrote:
               | I was thinking Oxygen Not Included.
        
               | dekhn wrote:
               | In practice it's more complicated than this- borg isn't
               | actually a queue, it's a priority-based system with
               | preemption, although people layered queue systems on top.
               | Further, granularity mattered a lot- you could get much
               | more access to compute by asking for smaller slices
               | (fractions of a CPU core, or fraction of a whole TPU
               | cluster). There was a lot of "empty crack filling" at
               | google.
        
               | efitz wrote:
               | TL/DR: You should think of and use queues like shock
               | absorbers, not sinks. Also you need to monitor them.
               | 
               | Queues are useful to decouple the output of one process
               | to the input of another process, when the processes are
               | not synchronized velocity-wise. Like a shock absorber,
               | they allow both processes to continue at their own paces,
               | and the queue absorbs instantaneous spikes in producer
               | load above the steady state rate of the consumer (side
               | note: if queues are isolated code- and storage-wise from
               | the consumer process, then you can use the queue to
               | prevent disruption in the producer process when you need
               | to take the consumer down for maintenance or whatever).
               | 
               | Running with very small queue lengths is generally fine
               | and generally healthy.
               | 
               | If you have a process that consistently runs with
               | substantial queue lengths, then you have a mismatch
               | between the workloads of the processes they connect - you
               | either need to reduce the load from the producer or
               | increase the throughput of the consumer of the queue.
               | 
               | Very large queues tend to hide the workload mismatch
               | problem, or worse. Often work put into queues is not
               | stored locally on the producer, or is quickly
               | overwritten. So a consumer end problem can result in
               | potential irrevocable loss of everything in the queue,
               | and the larger the queue, the bigger the loss. Another
               | problem with large queues is that if your consumer
               | process is only slightly faster than the producer
               | process, then a large backlog of work in the queue can
               | take a long time to work down, and it's even possible
               | (admission of guilt) to configure systems using such
               | queues such that they cannot recover from a lengthy
               | outage, even if all the work items were stored in the
               | queue.
               | 
               | If you have queues, you need to monitor your queue
               | lengths and alarm when queue lengths start increasing
               | significantly above baseline.
        
           | mikepurvis wrote:
           | Car lots with attached garages are like this too. That brake
           | and suspension work they were going to charge you several
           | thousand dollars for? Once you trade in ol' Bessie they'll do
           | that for pennies on the dollar during slack time; it doesn't
           | hurt them if the car sits around for a few weeks or months
           | before being ready for sale.
        
             | WarOnPrivacy wrote:
             | > Car lots with attached garages are like this too.
             | 
             | This was my first job after moving into this state. Between
             | my labor and parts, it was about 15% of the sale price.
             | 
             | My most interesting repair was a 1943 Cadillac, a 'war
             | car'.
        
         | punnerud wrote:
         | Can others also buy the "best effort" tier?
         | 
         | If the job could easily run for weeks, even when you could buy
         | your way for doing it in a day.
         | 
         | Then have a bidding on this "best effort" resource, where they
         | factor in electricity at any given time
        
           | v3ss0n wrote:
           | Sure,.land a job there, work the way all up against the
           | cooperate bs and toxicity and you can get best effort tier.
           | 
           | Those effort needs to be added in the cost calculation too.
        
           | v3ss0n wrote:
           | Sure,.land a job there, work the way all up against the
           | cooperate bs and toxicity and you can get best effort tier.
           | 
           | Those effort needs to be added in the cost calculation too
        
           | curt15 wrote:
           | Is the "best effort" tier similar to AWS spot instances?
        
             | WJW wrote:
             | At every cloud provider there's probably a tier below
             | "spot" (or whatever the equivalent is called at AWS's
             | competitors) that is used for the low-priority jobs of the
             | cloud provider itself.
        
               | jeffbee wrote:
               | You can speculate about this or you can look at how
               | Google's internal workloads actually run, because they
               | have released a large and detailed set of traces from
               | Borg. They're really open about this.
               | 
               | https://github.com/google/cluster-data
        
         | dweekly wrote:
         | Possible corollary: it may be difficult to regularly turn out
         | highly compute-dependent research if you're paying full retail
         | rack rates for your hardware (i.e. using someone else's cloud).
        
         | huijzer wrote:
         | Still, don't get high on your own supply.
        
         | imtringued wrote:
         | According to neoclassical economists this is impossible since
         | you can easily and instantaneously scale infrastructure up and
         | down continuously at no cost and the future is known so demand
         | can be predicted reliably.
         | 
         | The problem with neoclassical economics is that it doesn't
         | concern itself with the physical counterpart of liquidity. It
         | is assumed that the physical world is just as liquid as the
         | monetary world.
         | 
         | The "liquidity mismatch" between money and physical capital
         | must be bridged through overprovisioning on the physical side.
         | If you want the option to choose among n different products,
         | but only choose m products, then the n - m unsold products must
         | be priced into the m bought products. If you can repurpose the
         | unsold products, then you make a profit or you can lower costs
         | for the buyer of the m products.
         | 
         | I would even go as far as to say that the production of
         | liquidity is probably the driving force of the economy, because
         | it means we don't have to do complicated central planning and
         | instead use simple regression models.
        
           | jopsen wrote:
           | > I would even go as far as to say that the production of
           | liquidity is probably the driving force of the economy.
           | 
           | Isn't that all what high frequency traders would say? :)
           | 
           | Perhaps there is some limit at which additional liquidity
           | doesn't offer much value?
        
             | marcosdumay wrote:
             | I think you completely misunderstood the GP.
             | 
             | There isn't much there about stocks markets.
        
         | 152334H wrote:
         | Is it free-priority based?
         | 
         | I was told by an employee that GDM internally has a credits
         | system for TPU allocation, with which researchers have to
         | budget out their compute usage. I may have completely
         | misunderstood what they were describing, though.
        
       | floor_ wrote:
       | Content aside. This is hands down my favorite blog format.
        
         | mostthingsweb wrote:
         | I agree, but I'm curious if it's for the same reason. I like it
         | because there is now flowery writing. Just direct "here are the
         | facts".
        
       | pama wrote:
       | 3USD/hour on the H100 is much more expensive than a reasonable
       | amortized full ownership cost, unless one assumes the GPU is
       | useless within 18 months, which I find a bit dramatic. The MFU
       | can be above 40% and certainly well above the 35% in the
       | estimate, also for small models with plain pytorch and trivial
       | tuning [1] I didnt read the linked paper carefully but I
       | seriously doubt the google team used vocab embedding layers with
       | 2 D V parameters stated in the link, because this would be
       | suboptimal by not tying the weights of the token embedding layer
       | in the decoder architecture (even if they did double the params
       | in these layers, it would not lead to 6 D V compute because the
       | embedding input is indexed). To me these assumptions suggested a
       | somewhat careless attitude towards the cost estimation and so I
       | stopped reading the rest of this analysis carefully. My best
       | guess is that the author is off by a large factor in the upward
       | direction, and a true replication with H100/200 could be about 3x
       | less expensive.
       | 
       | [1] if the total cost estimate was relatively low, say less than
       | 10k, then of course the lowest rental price and a random training
       | codebase might make some sense in order to reduce administrative
       | costs; once the cost is in the ballpark of millions of USD, it
       | feels careless to avoid optimizing it further. There exist H100s
       | in firesales or Ebay occasionally, which could reduce the cost
       | even more, but the author already mentions 2USD/gpu/hour for bulk
       | rental compute, which is better than the 3USD/gpu/hour estimate
       | they used in the writeup.
        
         | 152334H wrote:
         | You are correct on true H100 ownership costs being far lower.
         | As I mention in the H100 blurb, the H100 numbers are fungible
         | and I don't mind if you halve them.
         | 
         | MFU can certainly be improved beyond 40%, as I mention. But on
         | the point of small models specifically: the paper uses FSDP for
         | all models, and I believe a rigorous experiment should not vary
         | sharding strategy due to numerical differences. FSDP2 on small
         | models will be slow even with compilation.
         | 
         | The paper does not tie embeddings, as stated. The readout layer
         | does lead to 6DV because it is a linear layer of D*V, which
         | takes 2x for a forward and 4x for a backward. I would
         | appreciate it if you could limit your comments to factual
         | errors in the post.
        
           | lonk11 wrote:
           | I think the commenter was thinking about the input embedding
           | layer, where to get an input token embedding the model does a
           | lookup of the embedding by index, which is constant time.
           | 
           | And the blog post author is talking about the output layer
           | where the model has to produce an output prediction for every
           | possible token in the vocabulary. Each output token
           | prediction is a dot-product between the transformer hidden
           | state (D) and the token embedding (D) (whether shared with
           | input or not) for all tokens in the vocabulary (V). That's
           | where the VD comes from.
           | 
           | It would be great to clarify this in the blog post to make it
           | more accessible but I understand that there is a tradeoff.
        
           | pama wrote:
           | My bad on the 6 D V estimate; you are correct that if they do
           | a dense decoding (rather than a hierarchical one as google
           | used to do in the old days) the cost is exactly 6 D V. I
           | cannot edit the GP comment and I will absorb the shame of my
           | careless words there. I was put off by the subtitle and
           | initial title of this HN post, though the current title is
           | more appropriate and correct.
           | 
           | Even if it's a small model, one could use ddp or FSDP/2
           | without slowdowns on fast interconnect, which certainly adds
           | to the cost. But if you want to reproduce all the work at the
           | cheapest price point you only need to parallelize to the
           | minimal level for fitting in memory (or rather, the one that
           | maxes the MFU), so everything below 2B parameters runs on a
           | single H100 or single node.
        
         | spi wrote:
         | Do you have sources for "The MFU can be above 40% and certainly
         | well above the 35 % in the estimate"?
         | 
         | Looking at [1], the authors there claim that their improvements
         | were needed to push BERT training beyond 30% MFU, and that the
         | "default" training only reaches 10%. Certainly numbers don't
         | translate exactly, it might well be that with a different
         | stack, model, etc., it is easier to surpass, but 35% doesn't
         | seem like a terribly off estimate to me. Especially so if you
         | are training a whole suite of different models (with different
         | parameters, sizes, etc.) so you can't realistically optimize
         | all of them.
         | 
         | It might be that the real estimate is around 40% instead of the
         | 35% used here (frankly it might be that it is 30% or less, for
         | that matter), but I would doubt it's so high as to make the
         | estimates in this blog post terribly off, and I would doubt
         | even more that you can get that "also for small models with
         | plain pytorch and trivial tuning".
         | 
         | [1] https://www.databricks.com/blog/mosaicbert
        
         | tedivm wrote:
         | When I was at Rad AI we did out the math on rent versus buy,
         | and it was just so absolutely ridiculously obvious that buy was
         | the way to go. Cloud does not make sense for AI training right
         | now, as the overhead costs are considerably higher than simply
         | purchasing a cluster, colocating it at a place like Colovore,
         | and paying for "on hands" support. It's not even close.
        
       | rgmerk wrote:
       | Worth pointing out here that in other scientific domains, papers
       | routinely require hundreds of thousands of dollars, sometimes
       | millions of dollars, of resources to produce.
       | 
       | My wife works on high-throughout drug screens. They routinely use
       | over $100,000 of consumables in a single screen, not counting the
       | cost of the screening "libraries", the cost of using some of the
       | -$10mil of equipment in the lab for several weeks, the cost of
       | the staff in the lab itself, and the cost of the time of the
       | scientists who request the screens and then take the results and
       | turn them into papers.
        
         | ramraj07 wrote:
         | I estimated that any paper that has mouse work and produced in
         | a first world country (I.e. they have to do good by the
         | animals), the minimum cost of that paper in expenses and salary
         | would be $200,000. Average likely higher. Tens of thousands of
         | papers a year published like this!
        
           | esperent wrote:
           | To be fair, supposing the Google paper took six months to a
           | year to produce, it also must have cost several hundred
           | thousand dollars in salaries and other non-compute costs.
        
           | paxys wrote:
           | These are mostly fixed costs. If you produce a hundred papers
           | from the same team and same research, the costs aren't 100x.
        
             | lucianbr wrote:
             | But starting from the 10th paper, the value is also pretty
             | low I imagine. How many new things can you discover from
             | the same team and same research? That's 3 papers per year
             | for a 30-year career. Every single year, no breaks.
        
               | sdenton4 wrote:
               | Well, to be sure, mouse research consistently produces
               | amazing cures for cancer, insomnia, lost limbs, and even
               | gravity itself. Sure, none of it translates to humans,
               | but it's an important source of headlines for high impact
               | journals and science columnists.
        
               | godelski wrote:
               | This is also true for machine learning papers. They cure
               | cancer, discover physics, and all sorts of things. Sure,
               | they don't actually translate to useful science, but they
               | are highly valuable pieces of advertisements. And hey,
               | maybe someday they might!
        
               | computerdork wrote:
               | Agreed about this for past mouse research, but this is
               | changing as mice are being genetically engineered to be
               | more human: https://news.uthscsa.edu/scientists-create-
               | first-mouse-model...
               | https://www.nature.com/articles/s41467-019-09716-7
        
               | godelski wrote:
               | > How many new things can you discover from the same team
               | and same research?
               | 
               | That all depends on how you measure discoveries. The most
               | common metric is... publications. Publications are what
               | advance your career and are what you are evaluated on.
               | The content may or may not matter (lol who reads your
               | papers?) but the number certainly does. So the best way
               | to advance your career is to write a minimum viable paper
               | and submit as often as possible. I think we all forget
               | how Goodhart's Law comes to bite everyone in the ass.
        
           | dumb1224 wrote:
           | Well not everyone starts experiment anew. Many also reuse
           | accumulated datasets. For human data even more so.
        
         | slashdave wrote:
         | I assure you that the companies performing these screens expect
         | a return on this investment. It is not for a journal paper.
        
           | godelski wrote:
           | I used to believe this line. But then I worked for a big tech
           | company where my manager constantly made those remarks ("the
           | difference in industry and academia is that in industry it
           | has to actually work"). I then improved the generalization
           | performance (i.e. "actually work") by over 100% and they
           | decided not to update the model they were selling. Then
           | again, I had a small fast model and it was 90% as accurate as
           | the new large transformer model. Though they also didn't take
           | the lessons learned and apply them to the big model, which
           | had similar issues but were just masked by the size.
           | 
           | Plus, I mean, there are a lot of products that don't work. We
           | all buy garbage and often can't buy not garbage. Though I
           | guess you're technically correct that in either of these
           | situations there can still be a return on investment, but
           | maybe that shouldn't be good enough...
        
             | shpongled wrote:
             | The post you are replying to is talking about high
             | throughput assays for drug development. This is something
             | actually run in a lab, not a model. As another person
             | working at a biotech, I can assure you that screens are not
             | just run as busy work.
        
               | rgmerk wrote:
               | No they're not busywork, but not all such screens are
               | directly in the drug discovery pipeline.
        
       | dont_forget_me wrote:
       | All that compute power just to invade privacy and show people
       | more ads. Can this get anymore depressing?
        
         | psychoslave wrote:
         | Yes, sure! Imagine a world where every HN thread you engage in
         | is fed with information that are all subtly tailored to push
         | you into buying whatever crap the market is able to produce.
        
       | jeffbee wrote:
       | I think if you wanted to think about a big expense you'd look at
       | AlphaStar.
        
         | 5kg wrote:
         | I am wondering if AlphaStar is the most expensive paper ever.
        
           | lern_too_spel wrote:
           | "Observation of a new particle in the search for the Standard
           | Model Higgs boson with the ATLAS detector at the LHC"
        
           | jeffbee wrote:
           | I think it could be. I also think it is likely that HN
           | frequenter `dekhn` has personally spent more money on compute
           | resources than any other living human, so maybe they will
           | chime in on how the cost gets allocated to the research.
        
             | dekhn wrote:
             | A big part of it is basically hard production quota: the
             | ability to run jobs at a high priority on large machines
             | for an entire quarter. The main issue was that quota was
             | somewhat overallocated, or otherwise unable to be used (if
             | you and another team both wanted a full TPUv3 with all its
             | nodes and fabric).
             | 
             | From what I can tell, ads made the money and search/ads
             | bought machines with their allocated budget, TI used their
             | budget to run the systems, and then funny money in the form
             | of quota was allocated to groups. THe money was "funny" in
             | the sense that the full reach-through costs of operating a
             | TPU for a year looks completely different from the
             | production allocation quota that gets handed out. I think
             | Google was long trying to create a market economy, but it
             | was really much more like a state-funded exercise.
             | 
             | (I am not proud of how much CPU I wasted on protein
             | folding/design and drug discovery, but I'm eternally
             | thankful for Urs giving me the opportunity to try it out
             | and also to compute the energy costs associated with the
             | CPU use)
        
         | ipsum2 wrote:
         | It's disappointing that they never developed AlphaStar enough
         | to become super-human (unlike AlphaGo), even lower level
         | players were able to adapt to its playstyle.
         | 
         | The cost was probably the limiting factor.
        
       | brg wrote:
       | I found this exercise interesting, and as arcade79 pointed out it
       | is the cost of replication not the cost to Google. Humorously I
       | wonder the cost of of replicating Higgs-Boson verification or
       | Gravity Wave detection would be.
        
       | faitswulff wrote:
       | I wonder how many tons of CO2 that amounts to. Google Gemini
       | estimated 125,000 tons of carbon emissions, but I don't have the
       | know-how to double check it.
        
         | chazeon wrote:
         | If you use solar energy, then there is no CO2 emission. Right?
        
           | ipsum2 wrote:
           | Google buys carbon credits to make up for CO2 emissions,
           | they've never relied strictly on solar.
        
       | godelski wrote:
       | Worth mentioning that "GPU Poor" isn't created because those
       | without much GPU compute can't contribute, but rather because
       | those with massive amounts of GPU are able to perform many more
       | experiments and set a standard, or shift the Overton window. The
       | big danger here is just that you'll start expecting a higher
       | "thoroughness" from everyone else. You may not expect this level,
       | but seeing this level often makes you think what was sufficient
       | before is far from sufficient now, and what's the cost of that
       | lower bound?
       | 
       | I mention this because a lot of universities and small labs are
       | being edged out of the research space but we still want their
       | contributions. It is easy to always ask for more experiments but
       | the problem is, as this blog shows, those experiments can
       | sometimes cost millions of dollars. This also isn't to say that
       | small labs and academics aren't able to publish, but rather that
       | 1) we want them to be able to publish __without__ the support of
       | large corporations to preserve the independence of research[0],
       | 2) we don't want these smaller entities to have to go through a
       | roulette wheel in an effort to get published.
       | 
       | Instead, when reviewing be cautious in what you ask for. You can
       | __always__ ask for more experiments, datasets, "novelty", and so
       | on. Instead ask if what's presented is sufficient to push forward
       | the field in any way and when requesting the previous things be
       | specific as to why what's in the paper doesn't answer what's
       | needed and what experiment would answer it (a sentence or two
       | would suffice).
       | 
       | If not, then we'll have the death of the GPU poor and that will
       | be the death of a lot of innovation, because the truth is, not
       | even big companies will allocate large compute for research that
       | is lower level (do you think state space models (mamba) started
       | with multimillion dollar compute? Transformers?). We gotta start
       | somewhere and all papers can be torn to shreds/are easy to
       | critique. But you can be highly critical of a paper and that
       | paper can still push knowledge forward.
       | 
       | [0] Lots of papers these days are indistinguishable from ads. A
       | lot of papers these days are products. I've even had works
       | rejected because they are being evaluated as products not being
       | evaluated on the merits of their research. Though this can be
       | difficult to distinguish when evaluation is simply empirical.
       | 
       | [1] I once got desk rejected for "prior submission." 2 months
       | later they overturned it, realizing it was in fact an arxiv
       | paper, for only a month later for it to be desk rejected again
       | for "not citing relevant materials" with no further explanation.
        
       | hiddencost wrote:
       | It's likely the cost of the researchers was about $1m/ head, with
       | 11 names that puts the staffing costs on par with the compute
       | costs.
       | 
       | (A good rule of thumb is that an employee costs about twice their
       | total compensation.)
        
       ___________________________________________________________________
       (page generated 2024-07-30 23:00 UTC)