[HN Gopher] AWS vs. GCP reliability is wildly different
___________________________________________________________________
AWS vs. GCP reliability is wildly different
Author : icyfox
Score : 526 points
Date : 2022-09-21 20:29 UTC (1 days ago)
(HTM) web link (freeman.vc)
(TXT) w3m dump (freeman.vc)
| johndfsgdgdfg wrote:
| It's not surprising. Amazon is an amazing customer focus company.
| Google is a spyware company that only wants to make more by
| invading our privacy. Of course Amazon products will be better
| than Google.
| user- wrote:
| I wouldn't call this reliability, which already has a loaded
| definition in the cloud world, and instead something along time-
| to-start or latency or something.
| systemvoltage wrote:
| It is though based on a specific definition. If X doesn't do Y
| based on Z metric with a large standard deviation and doesn't
| meet spec limits, it is not reliable as per the predefined
| tolerance T. X = Compute intances Y =
| Launch Z = Time to launch T = LSL (N/A), USL (10s),
| Std Dev (2s)
|
| Where LSL is lower spec limit, USL is upper spec limit. LSL is
| N/A since we don't care if the instance launches instantly (0
| seconds).
|
| You can define T as per your requirements. Here we are ignoring
| the accuracy of the clock that measures time, assuming that the
| measurement device is infinitely accurate.
|
| If your criteria is to, say for example, define reliability as
| how fast it shuts down, then this article isn't relevant.
| Article is pretty narrow in testing reliability, they only care
| about launch time.
| 1-6 wrote:
| This is all about cloud GPUs, I was expecting something totally
| different from the title.
| s-xyz wrote:
| Would be interested to see a comparison of lambda functions vs
| google 2nd gen functions. I think that gcp is more serverless
| focused
| duskwuff wrote:
| ... why does the first graph show some instances as having a
| negative launch time? Is that meant to indicate errors, or has
| GCP started preemptively launching instances to anticipate
| requests?
| tra3 wrote:
| The y axis here measures duration that it took to successfully
| spin up the box, where negative results were requests that
| timed out after 200 seconds. The results are pretty staggering
| zaltekk wrote:
| I don't know how that value (looks like -50?) was chosen, but
| it seems to correspond to the launch failures.
| staringback wrote:
| Perhaps if you read the line directly about the graph you would
| see it was explained and would not have to ask this question
| zmmmmm wrote:
| > In total it scaled up about 3,000 T4 GPUs per platform
|
| > why I burned $150 on GPUs
|
| How do you rent 3000 GPUs over a period of weeks for $150? Were
| they literally requisitioning it and releasing it immediately?
| Seems like this is quite a unrealistic type of usage pattern and
| would depend a lot on whether the cloud provider optimises to
| hand you back the same warm instance you just relinquished.
|
| > GCP allows you to attach a GPU to an arbitrary VM as a hardware
| accelerator
|
| it's quite fascinating that GCP can do this. GPUs are physical
| things (!) do they provision every single instance type in the
| data center with GPUs? That would seem very expensive.
| geysersam wrote:
| > $150
|
| Was asking myself the same question. From the pricing
| information on gcp it seems minimum billing time is 1 minute,
| making 3000 GPUs cost $50 minimum. If this is the case then
| $150 is reasonable for the kind of usage pattern you describe.
| bushbaba wrote:
| Unlikely. More likely they put your VM on a host with GPU
| attached, and use live migration to move workloads around for
| better resource utilization.
|
| However, live-migration can cause impact to HPC workloads.
| ZiiS wrote:
| GPUs are physical but VMs are not; I expect they just move them
| to a host with a GPU.
| NavinF wrote:
| It probably live-migrates your VM to a physical machine that
| has a GPU available.
|
| ...if there are any GPUs available in the AZ that is. I had a
| hell of a time last year moving back and forth between regions
| to grab just 1 GPU to test something. The web UI didn't have a
| "any region" option for launching VMs so if you don't use the
| API you'll have to sit there for 20 minutes trying each
| AZ/region until you managed to grab one.
| kazinator wrote:
| > _This is particularly true for GPUs, which are uniquely
| squeezed by COVID shutdowns, POW mining, and growing deep
| learning models_
|
| Is the POW mining part true any more? Hasn't mining moved to
| dedicated hardware?
| CodesInChaos wrote:
| Bitcoin mining has used dedicated hardware for a long time. But
| I believe Ethereum mining used GPUs before the very recent
| proof-of-stake update.
| remus wrote:
| > The offerings between the two cloud vendors are also not the
| same, which might relate to their differing response times. GCP
| allows you to attach a GPU to an arbitrary VM as a hardware
| accelerator - you can separately configure quantity of the CPUs
| as needed. AWS only provisions defined VMs that have GPUs
| attached - the g4dn.x series of hardware here. Each of these
| instances are fixed in their CPU allocation, so if you want one
| particular varietal of GPU you are stuck with the associated CPU
| configuration.
|
| At a surface level, the above (from the article) seems like a
| pretty straightforward explanation? GCP gives you more
| flexibility in configuring GPU instances at the trade off of
| increased startup time variability.
| btgeekboy wrote:
| I wouldn't be surprised if GCP has GPUs scattered throughout
| the datacenter. If you happen to want to attach one, it has to
| find one for you to use - potentially live migrating your
| instance or someone else's so that it can connect them. It'd
| explain the massive variability between launch times.
| master_crab wrote:
| Yeah that was my thought too when I first read the blurb.
|
| It's neat...but like a lot of things in large scale
| operations, the devil is in the details. GPU-CPU
| communications is a low latency high bandwidth operation. Not
| something you can trivially do over standard TCP. GCP
| offering something like that without the ability to
| flawlessly migrate the VM or procure enough "local" GPUs
| means it's just vaporware.
|
| As a side note, I'm surprised the author didn't note the
| amount of ICE's (insufficient capacity errors) AWS throws
| whenever you spin up a G type instance. AWS is notorious for
| offering very few G's and P's is certain AZs and regions.
| my123 wrote:
| Fungible is selling a GPU decoupling solution via PCIe
| encapsulated over Ethernet today, so it can certainly be
| done.
|
| And NVIDIA's vGPU solutions do support live migration of
| GPUs to another host (in which case the vGPU gets moved
| too, to a GPU on that target).
| dubcee349 wrote:
| I doubt it would be setup like that. Compute is usually
| deployed as part of a large set of servers. The reason for
| that is different compute workloads require different uplink
| capacity.You don't need a petabyte of uplink capacity for
| many GPU loads but you may for compute. Just switching ASICs
| are much more expensive for 400G+ than 100G. That hasn't even
| got into the optics, NICs and other things. You don't mix and
| match compute across the same place in the data center
| traditionally.
| ryukoposting wrote:
| I've only ever used AWS for this stuff. When the author said
| that you could just "add a GPU" to an existing instance, my
| first reaction was "wow, that sounds like it would be really
| complicated behind the scenes."
| dekhn wrote:
| What would you expect? AWS is an org dedicated to giving
| customers what they want and charging them for it, while GCP is
| an org dedicated to telling customers what they want and using
| the revenue to get slightly better cost margins on Intel servers.
| dilyevsky wrote:
| I don't believe this reasoning is used since at least Diane
| dekhn wrote:
| I haven't seen any real change from Google about how they
| approach cloud in the past decade (first as an employee and
| developer of cloud services there, and now as a customer).
| Their sales people have hollow eyes
| DangitBobby wrote:
| They don't really tell us what we want, we just buy what we
| need. Might work for you.
| jupp0r wrote:
| That's interesting but not what I expected when I read
| "reliability". I would have expected SLO metrics like uptime of
| the network or similar metrics that users would care about more.
| Usually when scaling a system that's built well you don't have
| hard short constraints on how fast an instance needs to be spun
| up. If you are unable to spin up any that can be problematic of
| course. Ideally this is all automated so nobody would care much
| about whether it takes a retry or 30s longer to create an
| instance. If this is important to you, you have other problems.
| TheMagicHorsey wrote:
| This is not reliability. This is a measure of how much spare
| capacity AWS seems to be leaving idle for you to snatch on-
| demand.
|
| This is going to vary a lot based on the time of year. Why don't
| you try this same experiment at around some time when there's a
| lot of retail sales activity (Black Friday), and watch AWS
| suddenly have much less capacity to dole out on-demand.
|
| To me reliability is a measure of what a cloud does compared to
| what it says it will do. GCP is not promissing you on-demand
| instances instantaneously is it? If you want that ... reserve
| capacity.
| playingalong wrote:
| This is great.
|
| I have always been feeling there is so little independent content
| on benchmarking the IaaS providers. There is so much you can
| measure in how they behave.
| lacker wrote:
| Anecdotally I tend to agree with the author. But this really
| isn't a great way of comparing cloud services.
|
| The fundamental problem with cloud reliability is that it depends
| on a lot of stuff that's out of your control, that you have no
| visibility into. I have had services running happily on AWS with
| no errors, and the next month without changing anything they fail
| all the time.
|
| Why? Well, we look into it and it turns out AWS changed something
| behind the scenes. There's a different underlying hardware behind
| the instance, or some resource started being in high demand
| because of some other customers.
|
| So, I completely believe that at the time of this test, this
| particular API was performing a lot better on AWS than on GCP.
| But I wouldn't count on it still performing this way a month
| later. Cloud services aren't like a piece of dedicated hardware
| where you test it one month, and then the next month it behaves
| roughly the same. They are changing a lot of stuff that you can't
| see.
| citizenpaul wrote:
| That was my thoughts. People are probably pummeling GCP GPU
| free tier right now with stable diffusion image generators.
| Since it seems like all the free plug and play examples use the
| google python notebooks.
| RajT88 wrote:
| Instance types and regions make a big difference.
|
| Some regions and hardware generations are just busier than
| others. It may not be the same across cloud providers (although
| I suspect it is similar given the underlying market forces).
| ryukoposting wrote:
| You've just perfectly characterized why on-site infrastructure
| will always have its place.
| callalex wrote:
| You can reserve capacity on both of these services as well.
| lomkju wrote:
| Having being a high scale AWS user with a bill of +$1M/month and
| now working since 2 years with a company which uses GCP. I would
| say AWS is superior and way ahead.
|
| ** NOTE: If you're a low scale company this won't matter to you
| **
|
| 1. GKE
|
| When you cross a certain scale certain GKE components won't scale
| with you and SLOs on those components are crazy, it takes 15+
| mins for us to update a GKE ingress controller backed Ingress.
|
| Cloud Logging hasn't been able to keep up with our scale,
| disabled since 2 years now. This last Q we got an email from them
| to enable it and try it again on our clusters, still have to
| confirm these claims as our scale is more higher now.
|
| Konnectivity agent release was really bad for us, it affected
| some components internally, total dev time we lost was more than
| 3 months debugging this issue. They had to disable konnectivity
| agent on our clusters, I had to collect TCP dumps and other
| evidences just to prove nothing was wrong on our end, fight with
| our TAM to get a meeting with the product team. After 4 months
| they agreed and reverted our clusters to SSH tunnels. Initially
| GCP support said they said they can't do this. Next Q Ill be
| updating the clusters hopefully they have fixed this by then.
|
| 2. Support.
|
| I think AWS support always were more pro active in debugging with
| us, GCP support agents most of the times lack the expertise or
| proactiveness to debug/solve things in simple cases. We pay for
| enterprise support and don't see getting much from them. At AWS
| we had reviews of the infra how we could better it every 2 Qs and
| we got new suggestion and was also the time when we shared what
| we would like to see in their roadmap.
|
| 3.Enterprisyness is missing with design
|
| A simple thing as cloudbuild doesn't have access to static IPs.
| We have to maintain a forward proxy just cause of this.
|
| L4 LBs were a mess you could only use specified ports in a (L4
| LB) TCP proxy, For a tcp proxy based loadbalancer, the allowed
| set of ports are - [25, 43, 110, 143, 195, 443, 465, 587, 700,
| 993, 995, 1883, 3389, 5222, 5432, 5671, 5672, 5900, 5901, 6379,
| 8085, 8099, 9092, 9200, and 9300]. Today I see they have removed
| these restrictions. I don't know who came up with this idea to
| allow only a few ports on a L4 LB. I think such design decisions
| make it less Enterprisy.
| endisneigh wrote:
| this doesn't really seem like a fair comparison, nor is it a
| measure of "reliability".
| daneel_w wrote:
| It seems entirely fair to me, but the term "reliability" has a
| few different angles. This time it's not about working or not
| working, but the ability to auto-scale by invoking resources on
| the spot, which can be a very real requirement.
| endisneigh wrote:
| unless you're willing to burn $150 a quarter doing this exact
| assessment, it tells you nothing other than the data center
| conditions at the time of running.
|
| it would be like doing this in us-central1 when us-central1
| is down for one provider, and not another, resulting in
| increased latency, and saying how much faster one is than the
| other.
|
| unlike say a throughput test or similar, neither of these
| services promise particular cold-starts, and so the numbers
| here cannot be contexutalized against any metric given by
| either company and so are only useful in the sense that they
| can be compared, but since there are no guarantees the
| positions could switch anytime.
|
| that's why I like comparisons between serverless functions
| where there are pretty explicit SLAs and what not given by
| each company for you to compare against, as well as one
| another.
| daneel_w wrote:
| Given the stark contrast and that the pattern was identical
| every day over a two-week course, it tells me we're
| observing a fundamental systemic difference between GCP and
| AWS - and I think that's all the author really wanted to
| point out. I would not be surprised if the results are
| replicable three months from now.
| Animats wrote:
| > GCP allows you to attach a GPU to an arbitrary VM as a hardware
| accelerator - you can separately configure quantity of the CPUs
| as needed.
|
| That would seem to indicate that asking for a VM on GCP gets you
| a minimally configured VM on basic hardware, and then it gets
| migrated to something bigger if you ask for more resources. Is
| that correct?
|
| That could make sense if, much of the time, users get a VM and
| spend a lot of time loading and initializing stuff, then migrate
| to bigger hardware to crunch.
| zylent wrote:
| This is not quite true - GPU's are limited to select VM types,
| and the number of GPU's you have influences the maximum number
| of cores you can get. In general they're only available on the
| n1 instances (except the a100's, but those are far less
| popular)
| AtNightWeCode wrote:
| This benchmark (too) is probably incorrect. It produces 409:s so
| there are errors in there that I doubt are caused by GCP.
| humanfromearth wrote:
| We have constant autoscaling issues because of this in GCP - glad
| someone plotted this - hope people in GCP will pay a bit more
| attention to this. Thanks to the OP!
| runeks wrote:
| > These differences are so extreme they made me double check the
| process. Are the "states" of completion different between the two
| clouds? Is an AWS "Ready" premature compared to GCP? It
| anecdotally appears not; I was able to ssh into an instance right
| after AWS became ready, and it took as long as GCP indicated
| before I was able to login to one of theirs.
|
| This is a good point and should be part of the test: after
| launching, SSH into the machine and run a trivial task to confirm
| that the hardware works.
| kccqzy wrote:
| Heard from a Googler that the internal infrastructure (Borg) is
| simply not optimized for quick startup. Launching a new Borg job
| often takes multiple minutes before the job runs. Not surprising
| at all.
| dekhn wrote:
| A well-configured isolated borg cluster and well-configured job
| can be really fast. If there's no preemption (IE, no other job
| that is kicked off and gets some grace period), the packages
| are already cached locally, and there is no undue load on the
| scheduler, the resources are available, and it's a job with
| tasks, rather than multiple jobs, it will be close to
| instantaneous.
|
| I spend a significant fraction of my 11+ years there clicking
| Reload on my job's borg page. I was able to (re-)start ~100K
| jobs globally in about 15 minutes.
| fragmede wrote:
| Psh _someone 's_ bragging about not being at batch priority.
| dekhn wrote:
| I ran at -1
| dekhn wrote:
| booting VMs != starting a borg job.
| kccqzy wrote:
| The technology may be different but the culture carries over.
| People simply don't have the habit to optimize for startup
| time.
| readams wrote:
| Borg is not used for gcp vms, though.
| dilyevsky wrote:
| It is used but borg scheduler does not manage vm startups
| epberry wrote:
| Echoing this. The SRE book is also highly revealing about how
| Google request prioritization works. https://sre.google/sre-
| book/load-balancing-datacenter/
|
| My personal opinion is that Google's resources are more tightly
| optimized than AWS and they may try to find the 99% best
| allocation versus the 95% best allocation on AWS.. and this
| leads to more rejected requests. Open to being wrong on this.
| valleyjo wrote:
| As another comment points out, GPU resources are less common so
| it takes longer to create, which makes sense. In general, start
| up times are pretty quick on GCP as other comments also
| confirm.
| jsolson wrote:
| This is mostly not true in cases where resources are actually
| available (and in GCE if they're not the API rejects the VM
| outright, in general). To the extent that it is true for Borg
| when the job schedules immediately, it's largely due to package
| (~container layers, ish) loading. This is less relevant today
| (because reasons), and also mostly doesn't apply to GCE as the
| relevant packages are almost universally proactively made
| available on relevant hosts.
|
| The origin for the info that jobs take "minutes" likely
| involves jobs that were pending available resources. This is a
| valid state in Borg, but GCE has additional admission control
| mechanisms aimed at avoiding extended residency in pending.
|
| As dekhn notes, there are many factors that contribute to VM
| startup time. GPUs are their own variety of special (and, yes,
| sometimes slow), with factors that mostly don't apply to more
| pedestrian VM shapes.
| Jamie9912 wrote:
| Should probably change the title to "AWS vs GCP on-demand GPU
| launch time consistency"
| Terretta wrote:
| Yep. Author colloquially meant, can I rely on a quick start.
| MonkeyMalarky wrote:
| I would love to see the same for deploying things like a
| cloud/lambda function.
| orf wrote:
| AWS has different pools of EC2 instances depending on the
| customer, the size of the account and any reservations you may
| have.
|
| Spawning a single GPU at varying times is nothing. Try spawning
| more than one, or using spot instances, and you'll get a very
| different picture. We often run into capacity issues with GPU and
| even the new m6i instances at all times of the day.
|
| Very few realistic company size workloads need a single GPU. I
| would willingly wait 30 minutes for my instances to become
| available if it meant _all_ of them where available at the same
| time.
| herpderperator wrote:
| The author is using 'Quantile' which I hadn't heard of before,
| and when I did, it seems like it actually should be 'Percentile'.
| Percentiles are the percentages, which is what the author is
| referring to.
| mr_toad wrote:
| Quantiles are a generic term for percentiles, deciles,
| quartiles etc. Percentiles would have been a more precise term.
| outworlder wrote:
| Unclear what the article has to do with reliability. Yes,
| spinning up machines on GCP is incredibly fast and has always
| been. AWS is decent. Azure feels like I'm starting a Boeing 747
| instead of a VM.
|
| However, there's one aspect where GCP is a _clear_ winner on the
| reliability front. They auto-migrate instances transparently and
| with close to zero impact to workloads - I want to say zero
| impact but it 's not technically zero.
|
| In comparison, in AWS you need to stop/start your instance
| yourself so that it will move to another hypervisor(depending on
| the actual issue AWS may do it for you). That definitely has
| impact on your workloads. We can sometimes architect around it
| but there's still something to worry about. Given the number of
| instances we run, we have multiple machines to deal with weekly.
| We get all these 'scheduled maintenance' events (which sometimes
| aren't really all that scheduled), with some instance IDs(they
| don't even bother sending the name tag), and we have to deal with
| that.
|
| I already thought stop/start was an improvement on tech at the
| time (Openstack, for example, or even VMWare) just because we
| don't have to think about hypervisors, we don't have to know, we
| don't care. We don't have to ask for migrations to be performed,
| hypervisors are pretty much stateless.
|
| However, on GCP? We had to stop/start instances exactly zero
| times, out of the thousands we run and have been running for
| years. We can see auto-migration events when we bother checking
| the logs. Otherwise, we don't even notice the migration happened.
|
| It's pretty old tech too:
|
| https://cloudplatform.googleblog.com/2015/03/Google-Compute-...
| voidfunc wrote:
| > Azure feels like I'm starting a Boeing 747 instead of a VM.
|
| Huh... interesting, this has not been my experience with Azure
| VM launch times. I'm usually surprised how quickly they pop up.
| jiggawatts wrote:
| Depends on your disks.
|
| Premium SSD allows 30 minutes of "burst" IOPS, which can
| bring down boot times to about 2-5 seconds for a typical
| Windows VM. The provisioning time is a further 60-180 seconds
| on top. (The fastest I could get it is about 40 seconds using
| a "smalldisk" image to ephemeral storage, but then it took a
| further 30 seconds or so for the VM to become available.)
|
| Standard HDD was slow enough that the boot phase alone would
| take minutes, and then the VM provisioning time is almost
| irrelevant in comparison.
| jcheng wrote:
| > Yes, spinning up machines on GCP is incredibly fast and has
| always been. AWS is decent.
|
| FWIW this article is saying the opposite--it's AWS that beats
| GCP in startup speed.
| valleyjo wrote:
| This article states that GPU instances are slower on GCP - it
| doesn't make any claims about non-GPU instances.
| yolovoe wrote:
| EC2 live migrates instances too. Not sure where we are with
| rollout across the fleet.
|
| The reason, from what I understand, why GCP does live migration
| more is because ec2 focused on live updates instead of live
| migration. Whereas GCP migrates instances to update servers,
| ec2 live updates everything down to firmware while instances
| are running.
|
| Curious, what instance types are you using on EC2 that you see
| so much maintenance?
| willcipriano wrote:
| I always wondered why you couldn't do that on AWS, mainly
| because I could do it at home with Hyper-V a decade ago.
|
| https://learn.microsoft.com/en-us/previous-versions/windows/...
| politelemon wrote:
| A few weeks ago I needed to change the volume type on an EC2
| instance to gp3. Following the instructions, the change happened
| while the instance was running. I didn't need to reboot or stop
| the instance, it just changed the type. While the instance was
| running.
|
| I didn't understand how they were able to do this, I had thought
| volume types mapped to hardware clusters of some kind. And since
| I didn't understand, I wasn't able to distinguish it from magic.
| osti wrote:
| Look up AWS Nitro on YouTube if you are interested in learning
| more about it.
| ArchOversight wrote:
| Changing the volume type on AWS is somewhat magical. Seeing it
| happens on-line was amazing.
| cavisne wrote:
| EBS is already replicated so they probably just migrate behind
| the scenes, same as if the original physical disk was
| corrupted. It looks like only certain conditions allow this
| kindof migration.
|
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/modify-v...
| Salgat wrote:
| If I remember right they use the equivalent of a ledger of
| changes to manage volume state. So in this case, they copy over
| the contents (up to a certain point in time) to the new faster
| virtual volume, then append and direct all new changes to the
| new volume.
|
| This is also how they are able to snapshot a volume at a
| certain point in time without having any downtime or data
| inconsistencies.
| xyzzyz wrote:
| Dunno about AWS, but GCP uses live migration, and will migrate
| your VM across physical machines as necessary. The disk volumes
| are all connected over the network, nothing really depends on
| the actual physical machine your VM is ran on.
| lbhdc wrote:
| How does migrating a vm to another physical machine work?
| the_duke wrote:
| This blog post is pretty old (2015) but gives a good
| introduction.
|
| https://cloudplatform.googleblog.com/2015/03/Google-
| Compute-...
| lbhdc wrote:
| Thanks for sharing, I will give it a read!
| rejectfinite wrote:
| vsphere vmotion has been a thing for years lmao
| roomey wrote:
| VMware has been doing this for years, it's called vmotion
| and there is a lot of documentation about it if you are
| interested (eg https://www.thegeekpub.com/8407/how-vmotion-
| works/ )
|
| Essential, memory state is copied to the new host, the VM
| is stunned for a millisecond and the cpu states is copied
| and resumed on the new host (you may see a dropped ping).
| All the networking and storage is virtual anyway so that is
| "moved" (it's not really moved) in the background.
| davidro80 wrote:
| lbhdc wrote:
| That is really interesting I didn't realize it was so
| fast. Thanks for the post I will give it a read!
| politelemon wrote:
| > VM is stunned for a millisecond
|
| This conjures up hilarious mental imagery, thanks
| kevincox wrote:
| You just bop it on the head, and move it to the new
| machine quickly. By the time the VM comes to it won't
| even realize that it is in a new home.
| jiggawatts wrote:
| The clever trick here is that they'll pre-copy most of
| the memory without bothering to do it consistently, but
| mark pages that the source had written to as "dirty". The
| network cutover is stop-the-world, but VMware _doesn 't_
| copy the dirty pages during the stop. Instead, it simply
| treats them as "swapped to pagefile", where the pagefile
| is actually the source machine memory. When computation
| resumes at the target, the source is used to page memory
| back in on-demand. This allows very fast cutovers.
| mh- wrote:
| Up to 500ms per your source, depending on how much churn
| there is in the memory from the source system.
|
| Very cool.
| valleyjo wrote:
| Stream the contents of ram from source to dest, pause the
| source, reprogram the network and copy and memory that
| changed since the initial stream, resume the dest, destroy
| the source, profit.
| pclmulqdq wrote:
| They pause your VM, copy everything about its state over to
| the new machine, and quickly start the other instance. It's
| pretty clever. I think there are tricks you can play with
| machines that have large memory footprints to copy most of
| it before the pause, and only copy what has changed since
| then during the pause.
|
| The disks are all on the network, so no need to move
| anything there.
| prmoustache wrote:
| In reality it sync the memory first to the other host and
| only pause the vm when the last state sync is small
| enough to be so quick the pause is barely measurable.
| lbhdc wrote:
| When its transferring the state to the target, how does
| it handle memory updates that are happening at that time?
| Is the programs execution paused at that point?
| outworlder wrote:
| No, they keep track of dirty pages.
| GauntletWizard wrote:
| No, but the memory accesses have hooks that say "This
| memory was written". Then, program execution _is_ paused,
| and just the sections of memory that were written are
| copied again.
|
| This has memory performance characteristics - I ran a
| benchmark of memory read/write speed while this was
| happening once. It more than halved memory speed for the
| 30s or so it took from migration started to migration
| complete. The pause, too, was much longer.
| lbhdc wrote:
| Ahh I think that was the piece I was missing, thanks! I
| didn't realize there were hooks for tracking memory
| changes.
| water-your-self wrote:
| Indian jones and the register states
| valleyjo wrote:
| Azure, AWS and GCP all have live migration. VMWare has it
| too.
| dilyevsky wrote:
| Ec2 does not have live migration. On azure it's spotty so
| not every maintenance can offer it.
| [deleted]
| ta20200710 wrote:
| EC2 does support live migration, but it's not public and
| only for certain instance types/hypervisors.
|
| See: https://news.ycombinator.com/item?id=17815806
| _msw_ wrote:
| Here's a comment that I made in a past thread.
|
| https://news.ycombinator.com/item?id=26650082
| dilyevsky wrote:
| My experience running c5/6 instances makes me very
| confident ec2 doesn't do live migration for these. Fwiw
| gcp live migration on latency sensitive workloads is very
| noticeable and often time straight up causes instance
| crash
| dwmw2 wrote:
| Intrigued by this observation. What is it about your
| experience that leads you to conclude that EC2 doesn't do
| live migration?
|
| And could it be phrased differently as "EC2 doesn't do
| live migration _badly_ "?
| dilyevsky wrote:
| Mainly the barrage of "instance hardware degradation"
| emails that i get whereas on gcp those are just migrated
| (sometimes with a reboot/crash). Also there is no
| brownout. I've never used t2/3s which apparently do
| support migration which would make sense.
| my123 wrote:
| After some kinds of hardware failure, it can become
| impossible to do live migration safely. When a crash can
| ensure due to a live migration from faulty HW, I'd argue
| that it's much better to not attempt it.
| free652 wrote:
| Are you sure, because AWS consistently requires me to
| migrate to a different host. They go as far as shutting
| down instances, but don't do any kind of live migrations.
| outworlder wrote:
| Not really. Or at least not in the same league.
|
| AWS doesn't have live migration at all. You have to
| stop/start.
|
| Azure technically does, but it doesn't always work(they say
| 90%). 30 seconds is a long time.
|
| VMWare has live migration (and seems to be the closest to
| what GCP does) but it is still an inferior user experience.
|
| This is the key thing you are missing - GCP not only has
| live migration, but it is completely transparent. We do not
| have to initiate migration. GCP does, transparently, 100%
| of the time. We have never even notice migrations even when
| we were actively watching those instances. We don't know or
| care what hypervisors are involved. They even preserve the
| network connections.
|
| https://cloudplatform.googleblog.com/2015/03/Google-
| Compute-...
| jiggawatts wrote:
| VMware's live migration is totally seamless, so I don't
| know what you mean by "inferior user experience". You
| typically see less than a second of packet loss, and a
| small performance hit for about a minute while the memory
| is "swapped" across to the new machine. Similarly, VMware
| has had live storage migration for years.
|
| VMware is lightyears ahead of the big clouds, but
| unfortunately they "missed the boat" on the public cloud,
| despite having superior foundational technology.
|
| For example:
|
| - A typical vSphere cluster would use live migration to
| balance workloads dynamically. You don't notice this as
| an end user, but it allows them to bin-pack workloads up
| above 80% CPU utilisation in my experience with good
| results. (Especially if you allocate priorities, min/max
| limits, etc...)
|
| - You can version-upgrade a vSphere cluster live. This
| includes rolling hypervisor kernel upgrades and _live
| disk format changes_. The upgrade wizard is a fantastic
| thing that asks only for the cluster controller name and
| login details! Click "OK" and watch the progress bar.
|
| - Flexible keep-apart and keep-together rules that can
| updated at any time, and will take effect via live
| migration. This is sort-of like the Kubernetes "control
| loops", but the migrations are live and memory-preserving
| instead of stop-start like with containers.
|
| - Online changes to virtual hardware, including adding
| not just NICs and disks, but also CPU and memory!
|
| - Thin-provisioned disks, and memory deduplication for
| efficiencies approaching that of containerisation.
|
| - Flexible snapshots, including the ability for "thin
| provisioned" virtual machines to share a base snapshot.
| This is often used for virtual desktops or terminal
| services, and again this approaches containerisation in
| terms of cloning speed and storage efficiency.
|
| In other words, VMware had all of the pieces, and just...
| didn't... use it to make a public cloud. We could have
| had "cloud.vmware.com" or whatever 15 years ago, but they
| decided to slowly jack up the price on their enterprise
| customers instead.
|
| For comparison, in Azure: You can't add a VM to an
| availability set (keep apart rule) or remove the VM from
| it without a stop-start cycle. You can't make most
| changes (SKU, etc...) to a VM in an availability set
| without turning off _every_ machine in the same AS! This
| is just one example of many where the public cloud has a
| "checkbox" availability feature that actually _decreases_
| availability. For a long time, changing an IP address in
| AWS required the VM to be basically blown away and
| recreated. That brought back memories of the Windows NT 4
| days in 1990s when an IP change required a reboot cycle.
| sofixa wrote:
| So, i used to be a part time vSphere admin, worked with
| many others, and had to automate the hell out of it to
| deal as little as possible with that dumpster fire.
|
| No, VMware didn't miss the boat, vCloud Air was announced
| in 2009 and made generally available in 2013. Roughly
| same timelines as Azure and GCP, slightly trailing AWS,
| and those were the early days, where the public cloud was
| still exotic. And VMware had the massive advantage of
| brand recognition in that domain and existing footprint
| with enterprises which could be scaled out.
|
| Problem was, vCloud Air, like vSphere, was shit. Yeah, it
| did some things well, and had some very nice features -
| vMotion, DRS (though it doesn't really use CPU ready
| contention for scheduling decisions which is stupid),
| vSAN, hot adding resources (but not RAM, because decades
| ago Linux had issues if you had less than 4GB RAM and you
| added more, so to this day you can't do that). When they
| worked, because when they didn't, good luck because error
| messages are useless, logs are weirdly structured and
| uselessly verbose, so a massive pain to deal with. Oh and
| many of those features were either behind a Flash
| UI(FFS), or an abomination of an API that is inconsistent
| ("this object might have been deleted or hasn't been
| created yet") and had weird limitations like when you
| have an async task you can't check it's status details.
| And many of those features were so complex, that a random
| consuming user basically had to rely on a dedicated team
| of vExperts, which often resulted in a nice silo slowing
| everyone down.
|
| Their hardware compatibility list was a joke - the Intel
| X710 NIC stayed on it for more than a year with a widely
| known terribly broken driver.
|
| But what made VMware fail the most, IMHO, was the wrong
| focus, technically - VM, instead of application. A
| developer/ops person couldn't care less about the object
| of a VM. Of course they tried some things like vApp and
| vCloud Director etc. which are just disgusting
| abominations designed with a PowerPoint in mind, not a
| user. And pricing. Opaque and expensive, with bad
| usability. No wonder everyone jumped on the pay as you
| go, usable alternatives.
| w0m wrote:
| > many of those features were either behind a Flash
| UI(FFS)
|
| My introduction to the industry. The memories.
| ithkuil wrote:
| You're right to say that VMware has the right fundamental
| building blocks and that they are mature enough
| (especially the compute aspect).
|
| But I think you underestimate the maturity and
| effectiveness of the underlying google compute and
| storage substrate.
|
| (FWIW, I worked at both places)
|
| Now how the Google's substrate maps onto GCP, that's
| another story. There is a non trivial amount of fluff to
| be added on top of your building blocks to build a
| manageable multitenant planet scale cloud service. Just
| the network infrastructure is mind boggling.
|
| I wouldn't be surprised if your experience with a "VMware
| cloud" would surprise you if you naively compare it with
| your experience with a standalone vsphere cluster.
| _msw_ wrote:
| Disclosure: I work for Amazon, and in the past I worked
| directly on EC2.
|
| From the FAQ: https://aws.amazon.com/ec2/faqs/
|
| Q: How does EC2 perform maintenance?
|
| AWS regularly performs routine hardware, power, and network
| maintenance with minimal disruption across all EC2 instance
| types. To achieve this we employ a combination of tools and
| methods across the entire AWS Global infrastructure, such as
| redundant and concurrently maintainable systems, as well as
| live system updates and migration.
| darkwater wrote:
| > AWS regularly performs routine hardware, power, and
| network maintenance with minimal disruption across all EC2
| instance types. To achieve this we employ a combination of
| tools and methods across the entire AWS Global
| infrastructure, such as redundant and concurrently
| maintainable systems, as well as live system updates and
| migration.
|
| And yet, I keep getting almost every weeks emails like
| this:
|
| "EC2 has detected degradation of the underlying hardware
| hosting your Amazon EC2 instance (instance-ID: i-xxxxxxx)
| associated with your AWS account (AWS Account ID: NNNNN) in
| the eu-west-1 region. Due to this degradation your instance
| could already be unreachable. We will stop your instance
| after 2022-09-21 16:00:00 UTC"
|
| And we don't have tens of thousands of VMs in that region,
| just around 1k.
| _msw_ wrote:
| Live migration can't be used to address every type of
| maintenance or underlying fault in a non-disruptive way.
| voiper1 wrote:
| Linode also uses live migrations now for most (all?)
| maintenance.
| shrubble wrote:
| Assuming this blurb is accurate: " General-purpose SSD volume
| (gp3) provides the consistent 125 MiB/s throughput and 3000
| IOPS within the price of provisioned storage. Additional IOPS
| (up to 16,000) and throughput (1000 MiB/s) can be provisioned
| with an additional price. The General-purpose SSD volume (gp2)
| provides 3 IOPS per GiB storage provisioned with a minimum of
| 100 IOPS"
|
| ... then it seems like a device that limits bandwidth either on
| the storage cluster or between the node and storage cluster is
| present. 125MiB/s is right around the speed of a 1gbit link, I
| believe. That it was a networking setting changed in-switch
| doesn't seem to be surprising.
| nonameiguess wrote:
| This would have been my guess. All EBS volumes are stored on
| a physical disk that supports the highest bandwidth and IOPS
| you can live migrate to, and the actual rates you get are
| determined by something in the interconnect. Live migration
| is thus a matter of swapping out the interconnect between the
| VM and the disk or even just relaxing a logical rate-limiter,
| without having to migrate your data to a different disk.
| prmoustache wrote:
| The actual migration is not instantaneous despite the
| volume being immediately reported as gp3. You get a status
| change to "optimizing" if my memory is correct with a
| percentage. And the higher the volume the longer it takes
| so there is definitely a sync to faster storage.
| 0xbadcafebee wrote:
| Reliability in general is measured on the basic principle of:
| _does it function within our defined expectations?_ As long as it
| 's launching, and it eventually responds within SLA/SLO limits,
| and on failure comes back within SLA/SLO limits, it is reliable.
| Even with GCP's multiple failures to launch, that may still be
| considered "reliable" within their SLA.
|
| If both AWS and GCP had the same SLA, and one did better than the
| other at starting up, you could say one is _more performant_ than
| the other, but you couldn 't say it's _more reliable_ if they are
| both meeting the SLA. It 's easy to look at something that never
| goes down and say "that is more reliable", but it might have been
| pure chance that it never went down. Always read the fine print,
| and don't expect anything better than what they guarantee.
| mnutt wrote:
| It may or may not matter for various use cases, but the EC2
| instances in the test use EBS and the AMIs are lazily loaded from
| S3 on boot. So it may be possible that the boot process touches
| few files and quickly gets to 'ready' state, but you may have
| crummy performance for a while in some cases.
|
| I haven't used GCP much, but maybe they load the image onto the
| node prior to launch, accounting for some of the launch time
| difference?
| cmcconomy wrote:
| I wish Azure was here to round it out!
| londons_explore wrote:
| AWS normally has machines sitting idle just waiting for you to
| use. Thats why they can get you going in a couple of seconds.
|
| GCP on the other hand fills all machines with background jobs.
| When you want a machine, they need to terminate a background job
| to make room for you. That background job has a shutdown grace
| time. Usually thats 30 seconds.
|
| Sometimes, to prevent fragmentation, they actually need to
| shuffle around many other users to give you the perfect slot -
| and some of those jobs have start-new-before-stop-old semantics -
| that's why sometimes the delay is far higher too.
| dekhn wrote:
| borg implements preemption but the delay to start VMs is not
| because they are waiting for a background task to clean up.
| devxpy wrote:
| Is this testing for spot instances?
|
| In my limited experience, persistent (on-demand) GCP instances
| always boot up much faster than AWS EC2 instances.
| marcinzm wrote:
| In my experience GPU persistent instances often simply don't
| boot up on GCP due to lack of available GPUs. One reason I
| didn't choose GCP at my last company.
| devxpy wrote:
| Oh interesting. Which region and GPU type were you working
| with? (Asking so I can avoid in future)
| marcinzm wrote:
| I think it was us-east1 or us-east4. Had issues getting
| TPUs as well in us-central1. I know someone at a larger
| tech company who was told to only run certain workflows in
| a specific niche European region as that's the only one
| that had any A100 GPUs most of the time.
| encryptluks2 wrote:
| I noticed that too and it does appear to be using spot
| instances. I have a feeling if it was ran without you may see
| much better startup times. Spot instances on GCP are hit and
| miss and you sort of have to build that into your workflow.
| rwalle wrote:
| Looks like the author has never heard of the word "histogram"
|
| That graph is a pain to see.
| DangitBobby wrote:
| A histogram would take away one of the dimensions, probably
| time, unless they resorted to some weird stacked layout.
| Without time, people would complain that they don't know if it
| was consistent across the tested period. The graph is fine.
| charbull wrote:
| Can you put this in context of the problem/use case /need you are
| solving for ?
| lmeyerov wrote:
| I'd say that's a weak test of capacity. Would love to see on
| Azure - T4s or an equiv aren't even really provided anymore!
|
| We find reliability a diff story. Eg, our main source of downtime
| on Azure is they restart (live migrate?) our reserved T4s every
| few weeks, causing 2-10min outages per GPU per month.
| DonHopkins wrote:
| Anybody know if on GCP the cheaper ephemeral spot instances are
| available on managed instance groups and cloud run, where it
| spins up more instances according to demand, and how well it
| deals with replacing spot instances that drop dead, if so? How
| about AWS?
| ajross wrote:
| Worth pointing out that the article is measuring provisioning
| latency and success rates (how quickly can you get a GPU box
| running and whether or not you get an error back from the API
| when you try), and not "reliability" as most readers would
| understand it (how likely they are to do what you want them to do
| without failure).
|
| Definitely seems like interesting info, though.
| [deleted]
| [deleted]
| rwiggins wrote:
| There were 84 errors for GCP, but the breakdown says 74 409s and
| 5 timeouts. Maybe it was 79 409s? Or 10 timeouts?
|
| I suspect the 409 conflicts are probably from the instance name
| not being unique in the test. It looks like the instance name
| used was: instance_name = f"gpu-
| test-{int(time())}"
|
| which has a 1-second precision. The test harness appears to do a
| `sleep(1)` between test creations, but this sort of thing can
| have weird boundary cases, particularly because (1) it does
| cleanup after creation, which will have variable latency, (2)
| `int()` will truncate the fractional part of the second from
| `time()`, and (3) `time.time()` is not monotonic.
|
| I would not ask the author to spend money to test it again, but I
| think the 409s would probably disappear if you replaced
| `int(time())` with `uuid.uuid4()`.
|
| Disclosure: I work at Google - on Google Compute Engine. :-)
| sitharus wrote:
| This is a very good point - AWS uses tags to give instances a
| friendly name, so the name does not have to be unique. The same
| logic would not fail on AWS.
| smashed wrote:
| Which makes 2000% sense.
|
| Why would any tenant supplied data affect anything
| whatsoever?
|
| As a tenant, unless you are clashing with another resource
| under your own name, I don't see the point of failing.
|
| aws S3 would be an exception, where they make that limitation
| on globally unique bucket name very clear.
| [deleted]
| jsolson wrote:
| Idempotency.
|
| You're inserting a VM with a specific name. If you try to
| create the same resource twice, the GCE control plane
| reports that as a conflict.
|
| What they're doing here would be roughly equivalent to
| supplying the time to the AWS RunInstances API as an
| idempotency token.
|
| (I work on GCE, and asked an industry friend at AWS about
| how they guarantee idempotency for RunInstances).
| stevenguh wrote:
| GCP control plane is generally not idempotent.
|
| When trying to create the same resource twice, all
| request should report the same status instead one
| failing, one succeeding.
|
| In AWS, their APIs allow you to supply a client token if
| the API is not idempotent by default.
|
| See https://docs.aws.amazon.com/AWSEC2/latest/APIReferenc
| e/Run_I....
| akramer wrote:
| The GCE API can be idempotent if you'd like. Fill out the
| requestId field with the same UUID in multiple
| instances.insert calls (or other mutation calls) and you
| will receive the same operation Id back in response.
|
| Disclaimer: I work on GCE.
| jsolson wrote:
| Today I learned! I'll admit I didn't know this
| functionality existed, and I've instead had used
| instances.insert following by querying the VM resource.
|
| This is nicer!
| jsolson wrote:
| > When try to create the same resource twice, the second
| should report success instead of failing.
|
| Before I quibble with the idempotency point: I agree with
| this, entirely, but it is what it is and a lot of
| software has been written against the current behavior.
| So I'll cite Hyrum's law here: https://www.hyrumslaw.com/
|
| > GCP control plane is generally not idempotent.
|
| The GCE API occupies an odd space here, imo. The resource
| being created is, in practice, an operation to cause the
| named VM to exist. The operation has its own name, but
| the name of the VM in the insert operation is the name of
| the ultimate resource.
|
| Net, the API is idempotent at a macro level in terms of
| the end-to-end creation or deletion of uniquely named
| resources. Which is a long winded way of saying that
| you're right, but that from a practical perspective it
| accomplishes enough of the goals of a truly idempotent
| API to be _useful_ for avoiding the same things that the
| AWS mechanism avoids: creation of unexpected duplicate
| VMs.
|
| The more "modern" way to do this would be to have a truly
| idempotent description of the target state of the actual
| resource with a separate resource for the current live
| state, but we live with the sum of our past choices.
| Terretta wrote:
| It's a lot shorter to write:
|
| You're right, we did it wrong.
|
| // And paradoxically makes engineers like you.
| jsolson wrote:
| Sure, except I think that at a macro level we got it more
| right than AWS, despite some choices that I believe we'd
| make differently today.
| bushbaba wrote:
| Do you really need idempotency for runVM though.
| jsolson wrote:
| I mean, it's kinda nice to know that if you reissue a
| request for an instance that could costs thousands of
| dollars per month due to a network glitch that you won't
| accidentally create two of them?
|
| More practically, though, the instance name here is
| literally the name of the instance as it appears in the
| RESTful URL used for future queries about it. The 409
| here is rejecting an attempt to create the same
| explicitly named resource twice.
| philliphaydon wrote:
| Sounds like AWS got it right.
| jsolson wrote:
| You're entitled to that takeaway, but I disagree. I
| believe GCP's tendency to use caller-supplied names for
| resources is one of the single best features of the
| platform, particularly when compared against AWS's random
| hex identifiers.
|
| Note that whether this creates collisions is entirely
| under the customer's control. There's no requirement for
| global uniqueness, just a requirement that you not try to
| create two VMs with the same name in the same project in
| the same zone.
| philliphaydon wrote:
| With GCE can you create 10 instances or do you need to
| create all 10 individually?
| jsolson wrote:
| As far as I know, the `instances.insert` API only allows
| individual VMs, although the CLI can issue a bulk set of
| API calls[0], and MIGs (see below) allow you to request
| many identical VMs with a single API call if that's for
| some reason important.
|
| You can also batch API calls[1], which also gives you a
| response for each VM in the batch while allowing for a
| single HTTP request/response.
|
| That said, if you want to create a set of effectively
| identical VMs all matching a template (i.e., cattle not
| pets), though, or you want to issue a single API call,
| we'd generally point you to managed instance groups[2]
| (which can be manually or automatically scaled up or
| down) wherein you supply an instance template and an
| instance count. The MIG is named (like nearly all GCP
| resources), as are the instances, with a name derived
| from the MIG name. After creation you can also have the
| group abandon the instances and then delete the group if
| you _really_ wanted a bunch of unmanaged VMs created
| through a single API call, although I 'll admit I can't
| think of a use-case for this (the abandon API is
| generally intended for pulling VMs out of a group for
| debugging purposes or similar).
|
| For cases where for whatever reason you don't want a MIG
| (e.g., because your VMs don't share a common template).
| You can still group those together for monitoring
| purposes[3], although it's an after-creation operation.
|
| The MIG approach sets a _goal_ for the instance count and
| will attempt to achieve (and maintain) that goal even in
| the face of limited machine stock, hardware failures,
| etc. The top-level API will reject (stock-out) in the
| event that we're out of capacity, or in the batch/bulk
| case will start rejecting once we run out of capacity. I
| don't know how AWS's RunInstances behaves if it can only
| partially fulfill a request in a given zone.
|
| [0]: https://cloud.google.com/compute/docs/instances/mult
| iple/cre...
|
| [1]: https://cloud.google.com/compute/docs/api/how-
| tos/batch
|
| [2]: https://cloud.google.com/compute/docs/instance-
| groups
|
| [3]: https://cloud.google.com/compute/docs/instance-
| groups/creati...
| jrumbut wrote:
| > (3) `time.time()` is not monotonic.
|
| I just winced in pain thinking of the ways that can bite you. I
| guess in a cloud/virtualized environment with many short lived
| instances it isn't even that obscure an issue to run into.
|
| A nice discussion on Stack Overflow:
|
| https://stackoverflow.com/questions/64497035/is-time-from-ti...
| flutas wrote:
| > I just winced in pain thinking of the ways that can bite
| you.
|
| Something similar caused my favorite bug so far to track
| down.
|
| We were seeing odd spikes in our video playback analytics of
| some devices watching multiple years worth of video in < 1
| hour.
|
| System.currenTimeMillis() in Java isn't monotonic either is
| my short answer for what was causing it. Tracking down _what_
| was causing it was even more fun though. Devices (phones)
| were updating their system time from the network and jumping
| between timezones.
| jrumbut wrote:
| That's a bad day at the office when you have to go and say
| "hey remember all that data we painstakingly collected and
| maybe even billed clients for?"
| rkangel wrote:
| Yes. When people write `time.time()` they almost always
| actually want `time.monotonic()`.
| brianpan wrote:
| Time is difficult.
|
| Reminds of me this post on mtime which recently resurfaced on
| HN: https://apenwarr.ca/log/20181113
| mempko wrote:
| What are your thoughts on the generally slower launch times
| with a huge variance on GCP?
| kevincox wrote:
| FWIW in our use case of non-GPU instances they launched way
| faster and more consistently on GCP than AWS. So I guess it
| is complicated and may depend on exactly what instance you
| are launching.
| valleyjo wrote:
| Just remember this is for GPU instances. Other vm families
| are pretty fast to launch.
| jhugo wrote:
| At work we run some (non-GPU) instances in every AWS region,
| and there's pretty big variability over time and region for
| on-demand launch time. I'd expect it might be even higher for
| GPU instances. I suspect that a more rigorous investigation
| might find there isn't quite as big a difference overall as
| this article suggests.
| Crash0v3rid3 wrote:
| The author failed to mention which regions these tests were
| run. GPU availability can vary depending on the regions that
| were tested for both Cloud providers.
| ayewo wrote:
| The author linked to the code at the end of the post.
|
| The regions used are "us-east-1" for AWS [1] and "us-
| central1-b" for GCP [2].
|
| 1: https://github.com/piercefreeman/cloud-gpu-
| reliability/blob/...
|
| 2: https://github.com/piercefreeman/cloud-gpu-
| reliability/blob/...
| fomine3 wrote:
| This is a big missed point.
| Cthulhu_ wrote:
| I've naively used millisecond precision things for a long time
| - not in anything critical I don't think - but I've only
| recently come to more of an awareness that a millisecond is a
| pretty long time. Recent example is that I used a timestamp to
| version a record in a database, but it's feasible that in a Go
| application, a record could feasilby be mutated multiple times
| a millisecond by different users / processes / requests.
|
| Unfortunately, millisecond-precise timestamps proved to be a
| bit tricky in combination with sqlite.
| okdood64 wrote:
| Hope icyfox can try running this with a fix.
| PigiVinci83 wrote:
| Thank you for this article, it confirms my direct experience.
| Never run a benchmarking test but I can see this every day.
| amaks wrote:
| The link is broken?
| lucb1e wrote:
| Works for me using Firefox in Germany, although the article
| doesn't really match the title so maybe that's why you were
| confused? :p
| jqpabc123 wrote:
| Thanks for the report. It only confirms my judgment.
|
| The word "Google" attached to anything is a strong indicator that
| you should look for an alternative.
| danielmarkbruce wrote:
| It's meant to say "ephemeral"... right? It's hard to read after
| that.
| datalopers wrote:
| ephemeral and ethereal are commonly confused words.
| dublin wrote:
| Ephimerides really throws them. (And thank God for PyEphem,
| which makes all that otherwise quite fiddly stuff really
| easy...)
| danielmarkbruce wrote:
| I guess that's fair. It's sort of a smell when someone uses
| the wrong word (especially in writing) though. It suggests
| they aren't in industry, throwing ideas around with other
| folks. The word "ephemeral" is used extensively amongst
| software engineers.
| vienarr wrote:
| The article only talks about GPU start time, but the title is
| "CloudA vs CloudB reliability"
|
| bit of a stretch, right
| dark-star wrote:
| I wonder why someone would equate "instance launch time" with
| "reliability"... I won't go as far as calling it "clickbait" but
| wouldn't some other noun ("startup performance is wildly
| different") have made more sense?
| santoshalper wrote:
| I won't go so far as saying "you didn't read the article", but
| I think you missed something.
| xmonkee wrote:
| GCP also had 84 errors compared to 1 for AWS
| runeks wrote:
| They were almost exclusively _user_ errors (HTTP 4xx). They
| are supposed to indicate that the API is being used
| incorrectly.
|
| Although, it seems the author couldn't find out why they
| occurred, which points to poor error messages and/or lacking
| documentation.
| antonvs wrote:
| Another comment on this thread pointed out they had a
| potential collision in their instance name generation which
| may have caused this. That would mean this was user error,
| not a reliability issue. AWS doesn't require instance names
| to be unique.
| danielmarkbruce wrote:
| If not a 4xx, what should they return for instance not
| available?
| eurasiantiger wrote:
| 503 service unavailable?
| sn0wf1re wrote:
| That would be confusing. The HTTP response code should
| not be conflated with the application's state.
| eurasiantiger wrote:
| There will come a moment in time when you realize exactly
| what you have stated here and why it is not a good mental
| palace to live in.
| danielmarkbruce wrote:
| It's not the service that's unavailable. The resource
| isn't available. The service is running just fine.
| verdverm wrote:
| GCP error messages will indicate if resources were not
| available, if you reached your quota, or if it was some
| other error. Tests like OP can differentiate these
| situations
| yjftsjthsd-h wrote:
| Yeah, 4xx is client error, 5xx is server error.
| pdpi wrote:
| Yes, and trying to create duplicate resources is a client
| error.
| eurasiantiger wrote:
| Still, 409 seems inappropriate, as it is meant to signal
| a version conflict, i.e. someone else changed something,
| and user tried to uplod a stale version.
|
| "10.4.10 409 Conflict
|
| The request could not be completed due to a conflict with
| the current state of the resource. This code is only
| allowed in situations where it is expected that the user
| might be able to resolve the conflict and resubmit the
| request. The response body SHOULD include enough
| information for the user to recognize the source of the
| conflict. Ideally, the response entity would include
| enough information for the user or user agent to fix the
| problem; however, that might not be possible and is not
| required.
|
| Conflicts are most likely to occur in response to a PUT
| request. For example, if versioning were being used and
| the entity being PUT included changes to a resource which
| conflict with those made by an earlier (third-party)
| request, the server might use the 409 response to
| indicate that it can't complete the request. In this
| case, the response entity would likely contain a list of
| the differences between the two versions in a format
| defined by the response Content-Type."
|
| Then again, perhaps it is the service itself making that
| state change.
| dheera wrote:
| Using HTTP error codes for non-REST things is cringe.
|
| 503 would mean the IaaS API calls themselves are
| unavailable. Very different from the API working
| perfectly fine but the instances not being available.
| devmunchies wrote:
| What? REST is just some API philosophy, its doesn't even
| have to be on top of HTTP.
|
| Why would you think HTTP status codes are made for REST?
| They are made for HTTP to describe the response of the
| resource you are requesting, and the AWS API uses HTTP so
| it makes sense to use HTTP status codes.
| eurasiantiger wrote:
| sheeshkebab wrote:
| Maybe 1 reported. Not saying aws reliability is bad, but the
| number of various glitches that crop up in various aws
| services and not reflected on their status page is quite
| high.
| theamk wrote:
| that was measured from API call return codes, not by
| looking at overall service status page
|
| Amazon is pretty good about this, if their API says machine
| is ready, it usually is.
| mcqueenjordan wrote:
| Errors returned from APIs and the status page are
| completely separate topics in this context.
| mikewave wrote:
| Well, if your system elastically uses GPU compute and needs to
| be able to spin up, run compute on a GPU, and spin down in a
| predictable amount of time to provide reasonable UX, launch
| time would definitely be a factor in terms of customer-
| perceived reliability.
| HenriTEL wrote:
| GCP provides elactic features for that. One should use them
| instead of manually requesting new instances.
| rco8786 wrote:
| Sure but not anywhere remotely near clearing the bar to
| simply calling that "reliability".
| Waterluvian wrote:
| When I think "reliability" I think "does it perform the act
| consistently?"
|
| Consistently slow is still reliability.
| VWWHFSfQ wrote:
| I would still call it "reliability".
|
| If the instance takes too long to launch then it doesn't
| matter if it's "reliable" once it's running. It took too
| long to even get started.
| rco8786 wrote:
| Why would you not call it "startup performance".
|
| Calling this reliability is like saying a Ford is more
| reliable than a Chevy because the Ford has a better
| throttle response.
| endisneigh wrote:
| that's not what reliability means
| VWWHFSfQ wrote:
| > that's not what reliability means
|
| What is your definition of reliability?
| endisneigh wrote:
| unfortunately cloud computing and marketing have
| conflated reliability, availability and fault tolerance
| so it's hard to give you a definition everyone would
| agree to, but in general I'd say reliability is referring
| to your ability to use the system without errors or
| significant decreases in throughput, such that it's not
| usable for the stated purpose.
|
| in other words, reliability is that it does what you
| expect it to. GCP does not have any particular guarantees
| around being able to spin up VMs fast, so its inability
| to do so wouldn't make it unreliable. it would be like me
| saying that you're unreliable for not doing something
| when you never said you were going to.
|
| if this were comparing Lambda vs Cloud Functions, who
| both have stated SLAs around cold start times, and there
| were significant discrepancies, sure.
| pas wrote:
| true, the grammar and semantics work out, but since
| reliability needs a target usually it's a serious design
| flaw to rely on something that never demonstrably worked
| like your reliability target assumes.
|
| so that's why in engineering it's not really used as
| such. (as far as I understand at least.)
| somat wrote:
| It is not reliably running the machine but reliably getting
| the machine.
|
| Like the article said, The promise of the cloud is that you
| can easily get machines when you need them the cloud that
| sometimes does not get you that machine(or does not get you
| that machine in time) is a less reliable cloud than the one
| that does.
| onphonenow wrote:
| If you want that promise you can reserve capacity in
| various ways. Google has reservations. Folks use this for
| DR, your org can get a pool of shared ones going if you
| are going to have various teams leaning on GPU etc.
|
| The promise of the cloud is that you can flexibly spin up
| machines if available, and easily spin down, no long term
| contracts or CapEx etc. They are all pretty clear that
| there are capacity limits under the hood (and your
| account likely has various limits on it as a result).
| rco8786 wrote:
| It's still performance. If this was "AWE failed to
| deliver the new machines and GCP delivered", sure,
| reliability. But this isn't that.
|
| The race car that finishes first is not "more reliable"
| than the one in 10th. They are equally as reliable,
| having both finished the race. The first place car is
| simply faster at the task.
| somat wrote:
| The one in first can more reliably win races however.
| [deleted]
| rco8786 wrote:
| You cannot infer that based on the results of the
| race...that's literally the entire point I am making. The
| 1st place car might blow up in the next race, the 10th
| place car might finish 10th place for the next 100 races.
|
| If the article were measuring HTTP response times and
| found that AWS's average response time was 50ms and GCP's
| was 200ms, and both returned 200s for every single
| request in the test, would you say AWS is more reliable
| than GCP based on that? Of course not, it's asinine.
| [deleted]
| jhugo wrote:
| All the clouds are pretty upfront about availability being
| non-guaranteed if you don't reserve it. I wouldn't call it a
| reliability issue if your non-guaranteed capacity takes some
| tens of seconds to provision. I mean, it might be _your_
| reliability issue, because you chose not to reserve capacity,
| but it 's not really unreliability of the cloud -- they're
| providing exactly what they advertise.
| deanCommie wrote:
| "Guaranteed" has different tiers of meaning - both
| theoretical and practical.
|
| In many cases, "guaranteed" just means "we'll give you a
| refund if we fuck up". SLAs are very much like this.
|
| IN PRACTICE, unless you're launching tens of thousands of
| instances of an obscure image type, reasonable customers
| would be able to get capacity, and promptly from the cloud.
|
| That's the entire cloud value proposition.
|
| So no, you can't just hand-waive past these GCP results and
| say "Well, they never said these were guaranteed".
| robbintt wrote:
| This isn't actually true, even for tiny customers. In a
| personal project, I used a single host of a single
| instance type several times per day and had to code up a
| fallback.
| dilyevsky wrote:
| Try spinning up 32+ core instances with local ssds
| attached or anything not n1 family and you will find that
| in may regions you can only have like single digits of
| them
| jhugo wrote:
| Ignoring the fact that the results are probably partially
| flawed due to methodology (see top-level comment from
| someone who works on GCE) and are not reproducible due to
| missing information, pointing out the lack of a guarantee
| is not hand-waving. The OP uses the word "reliability" to
| catch attention, which certainly worked, but this has
| nothing to do with reliability.
| dark-star wrote:
| I'd still consider it as "performance issue", not
| "reliability issue". There is no service unavailability here.
| It just takes your system a minute longer until the target
| GPU capacity is available. Until then it runs on fewer GPU
| resources, which makes it slower. Hence performance.
|
| The errors might be considered a reliability issue, but then
| again, errors are a very common thing in large distributed
| systems, and any orchestrator/autoscaler would just re-try
| the instance creation and succeed. Again, a performance
| impact (since it takes longer until your target capacity is
| reached) but reliability? not really
| irrational wrote:
| I'd like to see a breakdown of the cost differences. If the
| costs are nearly equal, why would I not choose the one that
| has a faster startup time and fewer errors?
| campers wrote:
| With GCP you can right-size the CPU and memory of the VM
| the GPU is attached to, unlike the fixed GPU AWS
| instances, so there is the potential for cost savings
| there.
| pier25 wrote:
| Wouldn't Cloud Run be a better product for that use case?
| mikepurvis wrote:
| Hopefully anyone with a workload that's that latency
| sensitive would a have preallocated pool of warmed up
| instances ready to go.
| Art9681 wrote:
| Why would you scale to zero in high perf compute? Wouldn't it
| be wise to have a buffer of instances ready to pick up
| workloads instantly? I get that it shouldnt be necessary with
| a reliable and performant backend, and that the cost of
| having some instances waiting for job can be substantial
| depending on how you do it, but I wonder if the cost
| difference between AWS and GCP would make up for that and you
| can get an equivalent amount of performance for an equivalent
| price? I'm not sure. I'd like to know though.
| thwayunion wrote:
| _> Why would you scale to zero in high perf compute?_
|
| Midnight - 6am is six hours. The on demand price for a G5
| is $1/hr. That's over $2K/yr, or "an extra week of skiing
| paid for by your B2B side project that almost never has
| customers from ~9pm west coat to ~6am east coast". And I'm
| not even counting weekends.
|
| But that's sort of a silly edge case (albeit probably a
| real one for lots of folks commenting here). The _real_
| savings are in predictable startup times for bursty work
| loads. Fast and low variance startup times unlock a huge
| amount of savings. Without both speed and predictability,
| you have to plan to fail and over-allocate. Which can get
| really expensive fast.
|
| Another way to think about this is that zero isn't special.
| It's just a special case of the more general scenario where
| customer demand exceeds current allocation. The larger your
| customer base, and the burstier your demand, the more
| instances you need sitting on ice to meet customers' UX
| requirements. This is particularly true when you're growing
| fast and most of your customers are new; you really want a
| good customer experience every single time.
| diroussel wrote:
| Scaling to zero means zero cost when there is zero work. If
| you have a buffer pool, how long do you keep it populated
| when you have no work?
|
| Maintaining a buffer pool is hard. You need to maintain
| state, have a prediction function, track usage through
| time, etc. just spinning up new nodes for new work is
| substantially easier.
|
| And the author said he could spin up new nodes in 15
| seconds, that's pretty quick.
| irjustin wrote:
| I'll say it is valid to use reliability.
|
| If I depend on some performance metric, startup, speed, etc, my
| dependance on it equates to reliability. Not just on/off but
| the spectrum that it produces.
|
| If a CPU doesn't operate at its 2GHz setting 60% of the time, I
| would say that's not reliable. When my bus shows up on time
| only 40% of the time - I can't rely on that bus to get me where
| I need to go consistently.
|
| If the GPU took 1 hour to boot, but still booted, is it
| reliable? What about 1 year? At some point it tips over an
| "personal" metric of reliability.
|
| The comparison to AWS which consistently out-performs GCP,
| while not explicitly, implicitly turns that into a reliability
| metric by setting the AWS boot time as "the standard".
| RajT88 wrote:
| Reliability is a fair term, with an asterix. It is a specific
| flavor of reliability: deployment or scaling or net-new or
| allocation or whatever you want to call it.
| thayne wrote:
| Well, I mean it is measuring how reliably you can get a GPU
| instance. But it certainly isn't the overall reliability. And
| depending on your workflow, it might not even be a very
| interesting measure. I would be more interested in seeing a
| comparison of how long regular non-GPU instances can run
| without having to be rebooted, and maybe how long it takes to
| allocate a regular VM.
| thesuperbigfrog wrote:
| "AWS encountered one valid launch error in these two weeks
| whereas GCP had 84."
|
| 84 times more launch errors seems like a valid definition for
| "less reliable".
| iLoveOncall wrote:
| It is clickbait, the real title should be "AWS vs. GCP on-
| demand provisioning of GPU resources performance is wildly
| different".
|
| That said, while I agree that launch time and provisioning
| error rate are not sufficient to define reliability, they are
| definitely a part of it.
| tl_donson wrote:
| " AWS vs. GCP on-demand provisioning of GPU resources
| performance is wildly different"
|
| yeah i guess it does make sense that one didn't win the a/b
| test
| [deleted]
| lelandfe wrote:
| > wildly different
|
| For this, I'd prefer a title that lets me draw my own
| conclusions. 84 errors out of 3000 doesn't sound awful to
| me...? But what do I know - maybe just give me the data:
|
| "1 in 3000 GPUs fail to spawn on AWS. GCP: 84"
|
| "Time to provision GPU with AWS: 11.4s. GCP: 42.6s"
|
| "GCP >4x avg. time to provision GPU than AWS"
|
| "Provisioning on GCP both slower and more error-prone than
| AWS"
| esrauch wrote:
| 84 of 3000 failed is only "one nine"
| [deleted]
| hericium wrote:
| Cloud reliability is not the same as a reliability of already
| spawned VM.
|
| Here it's the possibility to launch new VMs to satisfy dynamic
| projects' needs. Cloud provider should allow you to scale-up in
| a predictable way. When it doesn't - it can be called
| unreliable.
|
| Also, "unreliable" is basically a synonym for "Google" these
| days.
| DonHopkins wrote:
| Let me unreliable that for you.
| ReptileMan wrote:
| To be fair their search is so crap lately, throwing the
| dice is not the worst option in the world to find a result
| that will be actually useful.
| rmah wrote:
| They are talking about the reliability of AWS vs GCP. As a user
| of both, I'd categorize predictable startup times under
| reliability because if it took more than a minute or so, we'd
| consider it broken. I suspect many others would have even
| tighter constraints.
| chrismarlow9 wrote:
| I mean if you're talking about worst case systems you assume
| everything is gone except your infra code and backups. In that
| case your instance launch time would ultimately define what
| your downtime looks like assuming all else is equal. It does
| seem a little weird to define it that way but in a strict sense
| maybe not.
___________________________________________________________________
(page generated 2022-09-22 23:03 UTC)