[HN Gopher] AWS vs. GCP reliability is wildly different
___________________________________________________________________
AWS vs. GCP reliability is wildly different
Author : icyfox
Score : 160 points
Date : 2022-09-21 20:29 UTC (2 hours ago)
(HTM) web link (freeman.vc)
(TXT) w3m dump (freeman.vc)
| user- wrote:
| I wouldn't call this reliability, which already has a loaded
| definition in the cloud world, and instead something along time-
| to-start or latency or something.
| 1-6 wrote:
| This is all about cloud GPUs, I was expecting something totally
| different from the title.
| s-xyz wrote:
| Would be interested to see a comparison of lambda functions vs
| google 2nd gen functions. I think that gcp is more serverless
| focused
| duskwuff wrote:
| ... why does the first graph show some instances as having a
| negative launch time? Is that meant to indicate errors, or has
| GCP started preemptively launching instances to anticipate
| requests?
| tra3 wrote:
| The y axis here measures duration that it took to successfully
| spin up the box, where negative results were requests that
| timed out after 200 seconds. The results are pretty staggering
| zaltekk wrote:
| I don't know how that value (looks like -50?) was chosen, but
| it seems to correspond to the launch failures.
| staringback wrote:
| Perhaps if you read the line directly about the graph you would
| see it was explained and would not have to ask this question
| zmmmmm wrote:
| > In total it scaled up about 3,000 T4 GPUs per platform
|
| > why I burned $150 on GPUs
|
| How do you rent 3000 GPUs over a period of weeks for $150? Were
| they literally requisitioning it and releasing it immediately?
| Seems like this is quite a unrealistic type of usage pattern and
| would depend a lot on whether the cloud provider optimises to
| hand you back the same warm instance you just relinquished.
|
| > GCP allows you to attach a GPU to an arbitrary VM as a hardware
| accelerator
|
| it's quite fascinating that GCP can do this. GPUs are physical
| things (!) do they provision every single instance type in the
| data center with GPUs? That would seem very expensive.
| bushbaba wrote:
| Unlikely. More likely they put your VM on a host with GPU
| attached, and use live migration to move workloads around for
| better resource utilization.
|
| However, live-migration can cause impact to HPC workloads.
| ZiiS wrote:
| GPUs are physical but VMs are not; I expect they just move them
| to a host with a GPU.
| NavinF wrote:
| It probably live-migrates your VM to a physical machine that
| has a GPU available.
|
| ...if there are any GPUs available in the AZ that is. I had a
| hell of a time last year moving back and forth between regions
| to grab just 1 GPU to test something. The web UI didn't have a
| "any region" option for launching VMs so if you don't use the
| API you'll have to sit there for 20 minutes trying each
| AZ/region until you managed to grab one.
| remus wrote:
| > The offerings between the two cloud vendors are also not the
| same, which might relate to their differing response times. GCP
| allows you to attach a GPU to an arbitrary VM as a hardware
| accelerator - you can separately configure quantity of the CPUs
| as needed. AWS only provisions defined VMs that have GPUs
| attached - the g4dn.x series of hardware here. Each of these
| instances are fixed in their CPU allocation, so if you want one
| particular varietal of GPU you are stuck with the associated CPU
| configuration.
|
| At a surface level, the above (from the article) seems like a
| pretty straightforward explanation? GCP gives you more
| flexibility in configuring GPU instances at the trade off of
| increased startup time variability.
| btgeekboy wrote:
| I wouldn't be surprised if GCP has GPUs scattered throughout
| the datacenter. If you happen to want to attach one, it has to
| find one for you to use - potentially live migrating your
| instance or someone else's so that it can connect them. It'd
| explain the massive variability between launch times.
| master_crab wrote:
| Yeah that was my thought too when I first read the blurb.
|
| It's neat...but like a lot of things in large scale
| operations, the devil is in the details. GPU-CPU
| communications is a low latency high bandwidth operation. Not
| something you can trivially do over standard TCP. GCP
| offering something like that without the ability to
| flawlessly migrate the VM or procure enough "local" GPUs
| means it's just vaporware.
|
| As a side note, I'm surprised the author didn't note the
| amount of ICE's (insufficient capacity errors) AWS throws
| whenever you spin up a G type instance. AWS is notorious for
| offering very few G's and P's is certain AZs and regions.
| dekhn wrote:
| What would you expect? AWS is an org dedicated to giving
| customers what they want and charging them for it, while GCP is
| an org dedicated to telling customers what they want and using
| the revenue to get slightly better cost margins on Intel servers.
| dilyevsky wrote:
| I don't believe this reasoning is used since at least Diane
| dekhn wrote:
| I haven't seen any real change from Google about how they
| approach cloud in the past decade (first as an employee and
| developer of cloud services there, and now as a customer).
| Their sales people have hollow eyes
| playingalong wrote:
| This is great.
|
| I have always been feeling there is so little independent content
| on benchmarking the IaaS providers. There is so much you can
| measure in how they behave.
| endisneigh wrote:
| this doesn't really seem like a fair comparison, nor is it a
| measure of "reliability".
| humanfromearth wrote:
| We have constant autoscaling issues because of this in GCP - glad
| someone plotted this - hope people in GCP will pay a bit more
| attention to this. Thanks to the OP!
| kccqzy wrote:
| Heard from a Googler that the internal infrastructure (Borg) is
| simply not optimized for quick startup. Launching a new Borg job
| often takes multiple minutes before the job runs. Not surprising
| at all.
| dekhn wrote:
| A well-configured isolated borg cluster and well-configured job
| can be really fast. If there's no preemption (IE, no other job
| that is kicked off and gets some grace period), the packages
| are already cached locally, and there is no undue load on the
| scheduler, the resources are available, and it's a job with
| tasks, rather than multiple jobs, it will be close to
| instantaneous.
|
| I spend a significant fraction of my 11+ years there clicking
| Reload on my job's borg page. I was able to (re-)start ~100K
| jobs globally in about 15 minutes.
| dekhn wrote:
| booting VMs != starting a borg job.
| kccqzy wrote:
| The technology may be different but the culture carries over.
| People simply don't have the habit to optimize for startup
| time.
| readams wrote:
| Borg is not used for gcp vms, though.
| dilyevsky wrote:
| It is used but borg scheduler does not manage vm startups
| epberry wrote:
| Echoing this. The SRE book is also highly revealing about how
| Google request prioritization works. https://sre.google/sre-
| book/load-balancing-datacenter/
|
| My personal opinion is that Google's resources are more tightly
| optimized than AWS and they may try to find the 99% best
| allocation versus the 95% best allocation on AWS.. and this
| leads to more rejected requests. Open to being wrong on this.
| valleyjo wrote:
| As another comment points out, GPU resources are less common so
| it takes longer to create, which makes sense. In general, start
| up times are pretty quick on GCP as other comments also
| confirm.
| MonkeyMalarky wrote:
| I would love to see the same for deploying things like a
| cloud/lambda function.
| politelemon wrote:
| A few weeks ago I needed to change the volume type on an EC2
| instance to gp3. Following the instructions, the change happened
| while the instance was running. I didn't need to reboot or stop
| the instance, it just changed the type. While the instance was
| running.
|
| I didn't understand how they were able to do this, I had thought
| volume types mapped to hardware clusters of some kind. And since
| I didn't understand, I wasn't able to distinguish it from magic.
| osti wrote:
| Look up AWS Nitro on YouTube if you are interested in learning
| more about it.
| ArchOversight wrote:
| Changing the volume type on AWS is somewhat magical. Seeing it
| happens on-line was amazing.
| cavisne wrote:
| EBS is already replicated so they probably just migrate behind
| the scenes, same as if the original physical disk was
| corrupted. It looks like only certain conditions allow this
| kindof migration.
|
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/modify-v...
| Salgat wrote:
| If I remember right they use the equivalent of a ledger of
| changes to manage volume state. So in this case, they copy over
| the contents (up to a certain point in time) to the new faster
| virtual volume, then append and direct all new changes to the
| new volume.
|
| This is also how they are able to snapshot a volume at a
| certain point in time without having any downtime or data
| inconsistencies.
| xyzzyz wrote:
| Dunno about AWS, but GCP uses live migration, and will migrate
| your VM across physical machines as necessary. The disk volumes
| are all connected over the network, nothing really depends on
| the actual physical machine your VM is ran on.
| lbhdc wrote:
| How does migrating a vm to another physical machine work?
| the_duke wrote:
| This blog post is pretty old (2015) but gives a good
| introduction.
|
| https://cloudplatform.googleblog.com/2015/03/Google-
| Compute-...
| lbhdc wrote:
| Thanks for sharing, I will give it a read!
| rejectfinite wrote:
| vsphere vmotion has been a thing for years lmao
| roomey wrote:
| VMware has been doing this for years, it's called vmotion
| and there is a lot of documentation about it if you are
| interested (eg https://www.thegeekpub.com/8407/how-vmotion-
| works/ )
|
| Essential, memory state is copied to the new host, the VM
| is stunned for a millisecond and the cpu states is copied
| and resumed on the new host (you may see a dropped ping).
| All the networking and storage is virtual anyway so that is
| "moved" (it's not really moved) in the background.
| davidro80 wrote:
| lbhdc wrote:
| That is really interesting I didn't realize it was so
| fast. Thanks for the post I will give it a read!
| mh- wrote:
| Up to 500ms per your source, depending on how much churn
| there is in the memory from the source system.
|
| Very cool.
| valleyjo wrote:
| Stream the contents of ram from source to dest, pause the
| source, reprogram the network and copy and memory that
| changed since the initial stream, resume the dest, destroy
| the source, profit.
| pclmulqdq wrote:
| They pause your VM, copy everything about its state over to
| the new machine, and quickly start the other instance. It's
| pretty clever. I think there are tricks you can play with
| machines that have large memory footprints to copy most of
| it before the pause, and only copy what has changed since
| then during the pause.
|
| The disks are all on the network, so no need to move
| anything there.
| prmoustache wrote:
| In reality it sync the memory first to the other host and
| only pause the vm when the last state sync is small
| enough to be so quick the pause is barely measurable.
| lbhdc wrote:
| When its transferring the state to the target, how does
| it handle memory updates that are happening at that time?
| Is the programs execution paused at that point?
| water-your-self wrote:
| Indian jones and the register states
| valleyjo wrote:
| Azure, AWS and GCP all have live migration. VMWare has it
| too.
| dilyevsky wrote:
| Ec2 does not have live migration. On azure it's spotty so
| not every maintenance can offer it.
| [deleted]
| free652 wrote:
| Are you sure, because AWS consistently requires me to
| migrate to a different host. They go as far as shutting
| down instances, but don't do any kind of live migrations.
| shrubble wrote:
| Assuming this blurb is accurate: " General-purpose SSD volume
| (gp3) provides the consistent 125 MiB/s throughput and 3000
| IOPS within the price of provisioned storage. Additional IOPS
| (up to 16,000) and throughput (1000 MiB/s) can be provisioned
| with an additional price. The General-purpose SSD volume (gp2)
| provides 3 IOPS per GiB storage provisioned with a minimum of
| 100 IOPS"
|
| ... then it seems like a device that limits bandwidth either on
| the storage cluster or between the node and storage cluster is
| present. 125MiB/s is right around the speed of a 1gbit link, I
| believe. That it was a networking setting changed in-switch
| doesn't seem to be surprising.
| nonameiguess wrote:
| This would have been my guess. All EBS volumes are stored on
| a physical disk that supports the highest bandwidth and IOPS
| you can live migrate to, and the actual rates you get are
| determined by something in the interconnect. Live migration
| is thus a matter of swapping out the interconnect between the
| VM and the disk or even just relaxing a logical rate-limiter,
| without having to migrate your data to a different disk.
| prmoustache wrote:
| The actual migration is not instantaneous despite the
| volume being immediately reported as gp3. You get a status
| change to "optimizing" if my memory is correct with a
| percentage. And the higher the volume the longer it takes
| so there is definitely a sync to faster storage.
| 0xbadcafebee wrote:
| Reliability in general is measured on the basic principle of:
| _does it function within our defined expectations?_ As long as it
| 's launching, and it eventually responds within SLA/SLO limits,
| and on failure comes back within SLA/SLO limits, it is reliable.
| Even with GCP's multiple failures to launch, that may still be
| considered "reliable" within their SLA.
|
| If both AWS and GCP had the same SLA, and one did better than the
| other at starting up, you could say one is _more performant_ than
| the other, but you couldn 't say it's _more reliable_ if they are
| both meeting the SLA. It 's easy to look at something that never
| goes down and say "that is more reliable", but it might have been
| pure chance that it never went down. Always read the fine print,
| and don't expect anything better than what they guarantee.
| cmcconomy wrote:
| I wish Azure was here to round it out!
| londons_explore wrote:
| AWS normally has machines sitting idle just waiting for you to
| use. Thats why they can get you going in a couple of seconds.
|
| GCP on the other hand fills all machines with background jobs.
| When you want a machine, they need to terminate a background job
| to make room for you. That background job has a shutdown grace
| time. Usually thats 30 seconds.
|
| Sometimes, to prevent fragmentation, they actually need to
| shuffle around many other users to give you the perfect slot -
| and some of those jobs have start-new-before-stop-old semantics -
| that's why sometimes the delay is far higher too.
| dekhn wrote:
| borg implements preemption but the delay to start VMs is not
| because they are waiting for a background task to clean up.
| devxpy wrote:
| Is this testing for spot instances?
|
| In my limited experience, persistent (on-demand) GCP instances
| always boot up much faster than AWS EC2 instances.
| marcinzm wrote:
| In my experience GPU persistent instances often simply don't
| boot up on GCP due to lack of available GPUs. One reason I
| didn't choose GCP at my last company.
| rwalle wrote:
| Looks like the author has never heard of the word "histogram"
|
| That graph is a pain to see.
| charbull wrote:
| Can you put this in context of the problem/use case /need you are
| solving for ?
| ajross wrote:
| Worth pointing out that the article is measuring provisioning
| latency and success rates (how quickly can you get a GPU box
| running and whether or not you get an error back from the API
| when you try), and not "reliability" as most readers would
| understand it (how likely they are to do what you want them to do
| without failure).
|
| Definitely seems like interesting info, though.
| curious_cat_163 wrote:
| Setting the use of word "reliability" aside, it is is interesting
| to see the differences in launch time and errors?
|
| One explanation is that AWS has been at it longer, so they know
| better. That seems like an unsatisfying explanation though, given
| Google's massive advantage on building and running distributed
| systems.
|
| Another explanation could be that AWS is more "customer-focused",
| i.e. they pay a lot more attention to technical issues that are
| perceptible by a blog writer. But, I am not sure why Google would
| not be incentivized to do the same. They are certainly motivated
| and have brought the capital to bear to this fight.
|
| So, what gives?
| PigiVinci83 wrote:
| Thank you for this article, it confirms my direct experience.
| Never run a benchmarking test but I can see this every day.
| amaks wrote:
| The link is broken?
| lucb1e wrote:
| Works for me using Firefox in Germany, although the article
| doesn't really match the title so maybe that's why you were
| confused? :p
| danielmarkbruce wrote:
| It's meant to say "ephemeral"... right? It's hard to read after
| that.
| datalopers wrote:
| ephemeral and ethereal are commonly confused words.
| dublin wrote:
| Ephimerides really throws them. (And thank God for PyEphem,
| which makes all that otherwise quite fiddly stuff really
| easy...)
| danielmarkbruce wrote:
| I guess that's fair. It's sort of a smell when someone uses
| the wrong word (especially in writing) though. It suggests
| they aren't in industry, throwing ideas around with other
| folks. The word "ephemeral" is used extensively amongst
| software engineers.
| dark-star wrote:
| I wonder why someone would equate "instance launch time" with
| "reliability"... I won't go as far as calling it "clickbait" but
| wouldn't some other noun ("startup performance is wildly
| different") have made more sense?
| santoshalper wrote:
| I won't go so far as saying "you didn't read the article", but
| I think you missed something.
| xmonkee wrote:
| GCP also had 84 errors compared to 1 for AWS
| danielmarkbruce wrote:
| If not a 4xx, what should they return for instance not
| available?
| eurasiantiger wrote:
| 503 service unavailable?
| sn0wf1re wrote:
| That would be confusing. The HTTP response code should
| not be conflated with the application's state.
| dheera wrote:
| Using HTTP error codes for non-REST things is cringe.
|
| 503 would mean the IaaS API calls themselves are
| unavailable. Very different from the API working
| perfectly fine but the instances not being available.
| sheeshkebab wrote:
| Maybe 1 reported. Not saying aws reliability is bad, but the
| number of various glitches that crop up in various aws
| services and not reflected on their status page is quite
| high.
| theamk wrote:
| that was measured from API call return codes, not by
| looking at overall service status page
|
| Amazon is pretty good about this, if their API says machine
| is ready, it usually is.
| mcqueenjordan wrote:
| Errors returned from APIs and the status page are
| completely separate topics in this context.
| mikewave wrote:
| Well, if your system elastically uses GPU compute and needs to
| be able to spin up, run compute on a GPU, and spin down in a
| predictable amount of time to provide reasonable UX, launch
| time would definitely be a factor in terms of customer-
| perceived reliability.
| rco8786 wrote:
| Sure but not anywhere remotely near clearing the bar to
| simply calling that "reliability".
| VWWHFSfQ wrote:
| I would still call it "reliability".
|
| If the instance takes too long to launch then it doesn't
| matter if it's "reliable" once it's running. It took too
| long to even get started.
| rco8786 wrote:
| Why would you not call it "startup performance".
|
| Calling this reliability is like saying a Ford is more
| reliable than a Chevy because the Ford has a better
| throttle response.
| endisneigh wrote:
| that's not what reliability means
| VWWHFSfQ wrote:
| > that's not what reliability means
|
| What is your definition of reliability?
| endisneigh wrote:
| unfortunately cloud computing and marketing have
| conflated reliability, availability and fault tolerance
| so it's hard to give you a definition everyone would
| agree to, but in general I'd say reliability is referring
| to your ability to use the system without errors or
| significant decreases in throughput, such that it's not
| usable for the stated purpose.
|
| in other words, reliability is that it does what you
| expect it to. GCP does not have any particular guarantees
| around being able to spin up VMs fast, so its inability
| to do so wouldn't make it unreliable. it would be like me
| saying that you're unreliable for not doing something
| when you never said you were going to.
|
| if this were comparing Lambda vs Cloud Functions, who
| both have stated SLAs around cold start times, and there
| were significant discrepancies, sure.
| pas wrote:
| true, the grammar and semantics work out, but since
| reliability needs a target usually it's a serious design
| flaw to rely on something that never demonstrably worked
| like your reliability target assumes.
|
| so that's why in engineering it's not really used as
| such. (as far as I understand at least.)
| somat wrote:
| It is not reliably running the machine but reliably getting
| the machine.
|
| Like the article said, The promise of the cloud is that you
| can easily get machines when you need them the cloud that
| sometimes does not get you that machine(or does not get you
| that machine in time) is a less reliable cloud than the one
| that does.
| [deleted]
| Art9681 wrote:
| Why would you scale to zero in high perf compute? Wouldn't it
| be wise to have a buffer of instances ready to pick up
| workloads instantly? I get that it shouldnt be necessary with
| a reliable and performant backend, and that the cost of
| having some instances waiting for job can be substantial
| depending on how you do it, but I wonder if the cost
| difference between AWS and GCP would make up for that and you
| can get an equivalent amount of performance for an equivalent
| price? I'm not sure. I'd like to know though.
| thwayunion wrote:
| _> Why would you scale to zero in high perf compute?_
|
| Midnight - 6am is six hours. The on demand price for a G5
| is $1/hr. That's over $2K/yr, or "an extra week of skiing
| paid for by your B2B side project that almost never has
| customers from ~9pm west coat to ~6am east coast". And I'm
| not even counting weekends. Even in that rather extreme and
| case there's a real business case.
|
| But that's sort of a silly edge case. The real savings are
| in predictable startup times for bursty work loads. Fast
| and low variance startup times unlock a huge amount of
| savings. Without both speed and predictability, you have to
| plan to fail and over-allocate. Which can get really
| expensive fast.
| diroussel wrote:
| Scaling to zero means zero cost when there is zero work. If
| you have a buffer pool, how long do you keep it populated
| when you have no work?
|
| Maintaining a buffer pool is hard. You need to maintain
| state, have a prediction function, track usage through
| time, etc. just spinning up new nodes for new work is
| substantially easier.
|
| And the author said he could spin up new nodes in 15
| seconds, that's pretty quick.
| iLoveOncall wrote:
| It is clickbait, the real title should be "AWS vs. GCP on-
| demand provisioning of GPU resources performance is wildly
| different".
|
| That said, while I agree that launch time and provisioning
| error rate are not sufficient to define reliability, they are
| definitely a part of it.
| [deleted]
| lelandfe wrote:
| > wildly different
|
| For this, I'd prefer a title that lets me draw my own
| conclusions. 84 errors out of 3000 doesn't sound awful to
| me...? But what do I know - maybe just give me the data:
|
| "1 in 3000 GPUs fail to spawn on AWS. GCP: 84"
|
| "Time to provision GPU with AWS: 11.4s. GCP: 42.6s"
|
| "GCP >4x avg. time to provision GPU than AWS"
|
| "Provisioning on GCP both slower and more error-prone than
| AWS"
| rmah wrote:
| They are talking about the reliability of AWS vs GCP. As a user
| of both, I'd categorize predictable startup times under
| reliability because if it took more than a minute or so, we'd
| consider it broken. I suspect many others would have even
| tighter constraints.
| chrismarlow9 wrote:
| I mean if you're talking about worst case systems you assume
| everything is gone except your infra code and backups. In that
| case your instance launch time would ultimately define what
| your downtime looks like assuming all else is equal. It does
| seem a little weird to define it that way but in a strict sense
| maybe not.
___________________________________________________________________
(page generated 2022-09-21 23:00 UTC)