[HN Gopher] AWS vs. GCP reliability is wildly different
       ___________________________________________________________________
        
       AWS vs. GCP reliability is wildly different
        
       Author : icyfox
       Score  : 526 points
       Date   : 2022-09-21 20:29 UTC (1 days ago)
        
 (HTM) web link (freeman.vc)
 (TXT) w3m dump (freeman.vc)
        
       | johndfsgdgdfg wrote:
       | It's not surprising. Amazon is an amazing customer focus company.
       | Google is a spyware company that only wants to make more by
       | invading our privacy. Of course Amazon products will be better
       | than Google.
        
       | user- wrote:
       | I wouldn't call this reliability, which already has a loaded
       | definition in the cloud world, and instead something along time-
       | to-start or latency or something.
        
         | systemvoltage wrote:
         | It is though based on a specific definition. If X doesn't do Y
         | based on Z metric with a large standard deviation and doesn't
         | meet spec limits, it is not reliable as per the predefined
         | tolerance T.                 X = Compute intances       Y =
         | Launch       Z = Time to launch       T = LSL (N/A), USL (10s),
         | Std Dev (2s)
         | 
         | Where LSL is lower spec limit, USL is upper spec limit. LSL is
         | N/A since we don't care if the instance launches instantly (0
         | seconds).
         | 
         | You can define T as per your requirements. Here we are ignoring
         | the accuracy of the clock that measures time, assuming that the
         | measurement device is infinitely accurate.
         | 
         | If your criteria is to, say for example, define reliability as
         | how fast it shuts down, then this article isn't relevant.
         | Article is pretty narrow in testing reliability, they only care
         | about launch time.
        
       | 1-6 wrote:
       | This is all about cloud GPUs, I was expecting something totally
       | different from the title.
        
       | s-xyz wrote:
       | Would be interested to see a comparison of lambda functions vs
       | google 2nd gen functions. I think that gcp is more serverless
       | focused
        
       | duskwuff wrote:
       | ... why does the first graph show some instances as having a
       | negative launch time? Is that meant to indicate errors, or has
       | GCP started preemptively launching instances to anticipate
       | requests?
        
         | tra3 wrote:
         | The y axis here measures duration that it took to successfully
         | spin up the box, where negative results were requests that
         | timed out after 200 seconds. The results are pretty staggering
        
         | zaltekk wrote:
         | I don't know how that value (looks like -50?) was chosen, but
         | it seems to correspond to the launch failures.
        
         | staringback wrote:
         | Perhaps if you read the line directly about the graph you would
         | see it was explained and would not have to ask this question
        
       | zmmmmm wrote:
       | > In total it scaled up about 3,000 T4 GPUs per platform
       | 
       | > why I burned $150 on GPUs
       | 
       | How do you rent 3000 GPUs over a period of weeks for $150? Were
       | they literally requisitioning it and releasing it immediately?
       | Seems like this is quite a unrealistic type of usage pattern and
       | would depend a lot on whether the cloud provider optimises to
       | hand you back the same warm instance you just relinquished.
       | 
       | > GCP allows you to attach a GPU to an arbitrary VM as a hardware
       | accelerator
       | 
       | it's quite fascinating that GCP can do this. GPUs are physical
       | things (!) do they provision every single instance type in the
       | data center with GPUs? That would seem very expensive.
        
         | geysersam wrote:
         | > $150
         | 
         | Was asking myself the same question. From the pricing
         | information on gcp it seems minimum billing time is 1 minute,
         | making 3000 GPUs cost $50 minimum. If this is the case then
         | $150 is reasonable for the kind of usage pattern you describe.
        
         | bushbaba wrote:
         | Unlikely. More likely they put your VM on a host with GPU
         | attached, and use live migration to move workloads around for
         | better resource utilization.
         | 
         | However, live-migration can cause impact to HPC workloads.
        
         | ZiiS wrote:
         | GPUs are physical but VMs are not; I expect they just move them
         | to a host with a GPU.
        
         | NavinF wrote:
         | It probably live-migrates your VM to a physical machine that
         | has a GPU available.
         | 
         | ...if there are any GPUs available in the AZ that is. I had a
         | hell of a time last year moving back and forth between regions
         | to grab just 1 GPU to test something. The web UI didn't have a
         | "any region" option for launching VMs so if you don't use the
         | API you'll have to sit there for 20 minutes trying each
         | AZ/region until you managed to grab one.
        
       | kazinator wrote:
       | > _This is particularly true for GPUs, which are uniquely
       | squeezed by COVID shutdowns, POW mining, and growing deep
       | learning models_
       | 
       | Is the POW mining part true any more? Hasn't mining moved to
       | dedicated hardware?
        
         | CodesInChaos wrote:
         | Bitcoin mining has used dedicated hardware for a long time. But
         | I believe Ethereum mining used GPUs before the very recent
         | proof-of-stake update.
        
       | remus wrote:
       | > The offerings between the two cloud vendors are also not the
       | same, which might relate to their differing response times. GCP
       | allows you to attach a GPU to an arbitrary VM as a hardware
       | accelerator - you can separately configure quantity of the CPUs
       | as needed. AWS only provisions defined VMs that have GPUs
       | attached - the g4dn.x series of hardware here. Each of these
       | instances are fixed in their CPU allocation, so if you want one
       | particular varietal of GPU you are stuck with the associated CPU
       | configuration.
       | 
       | At a surface level, the above (from the article) seems like a
       | pretty straightforward explanation? GCP gives you more
       | flexibility in configuring GPU instances at the trade off of
       | increased startup time variability.
        
         | btgeekboy wrote:
         | I wouldn't be surprised if GCP has GPUs scattered throughout
         | the datacenter. If you happen to want to attach one, it has to
         | find one for you to use - potentially live migrating your
         | instance or someone else's so that it can connect them. It'd
         | explain the massive variability between launch times.
        
           | master_crab wrote:
           | Yeah that was my thought too when I first read the blurb.
           | 
           | It's neat...but like a lot of things in large scale
           | operations, the devil is in the details. GPU-CPU
           | communications is a low latency high bandwidth operation. Not
           | something you can trivially do over standard TCP. GCP
           | offering something like that without the ability to
           | flawlessly migrate the VM or procure enough "local" GPUs
           | means it's just vaporware.
           | 
           | As a side note, I'm surprised the author didn't note the
           | amount of ICE's (insufficient capacity errors) AWS throws
           | whenever you spin up a G type instance. AWS is notorious for
           | offering very few G's and P's is certain AZs and regions.
        
             | my123 wrote:
             | Fungible is selling a GPU decoupling solution via PCIe
             | encapsulated over Ethernet today, so it can certainly be
             | done.
             | 
             | And NVIDIA's vGPU solutions do support live migration of
             | GPUs to another host (in which case the vGPU gets moved
             | too, to a GPU on that target).
        
           | dubcee349 wrote:
           | I doubt it would be setup like that. Compute is usually
           | deployed as part of a large set of servers. The reason for
           | that is different compute workloads require different uplink
           | capacity.You don't need a petabyte of uplink capacity for
           | many GPU loads but you may for compute. Just switching ASICs
           | are much more expensive for 400G+ than 100G. That hasn't even
           | got into the optics, NICs and other things. You don't mix and
           | match compute across the same place in the data center
           | traditionally.
        
           | ryukoposting wrote:
           | I've only ever used AWS for this stuff. When the author said
           | that you could just "add a GPU" to an existing instance, my
           | first reaction was "wow, that sounds like it would be really
           | complicated behind the scenes."
        
       | dekhn wrote:
       | What would you expect? AWS is an org dedicated to giving
       | customers what they want and charging them for it, while GCP is
       | an org dedicated to telling customers what they want and using
       | the revenue to get slightly better cost margins on Intel servers.
        
         | dilyevsky wrote:
         | I don't believe this reasoning is used since at least Diane
        
           | dekhn wrote:
           | I haven't seen any real change from Google about how they
           | approach cloud in the past decade (first as an employee and
           | developer of cloud services there, and now as a customer).
           | Their sales people have hollow eyes
        
         | DangitBobby wrote:
         | They don't really tell us what we want, we just buy what we
         | need. Might work for you.
        
       | jupp0r wrote:
       | That's interesting but not what I expected when I read
       | "reliability". I would have expected SLO metrics like uptime of
       | the network or similar metrics that users would care about more.
       | Usually when scaling a system that's built well you don't have
       | hard short constraints on how fast an instance needs to be spun
       | up. If you are unable to spin up any that can be problematic of
       | course. Ideally this is all automated so nobody would care much
       | about whether it takes a retry or 30s longer to create an
       | instance. If this is important to you, you have other problems.
        
       | TheMagicHorsey wrote:
       | This is not reliability. This is a measure of how much spare
       | capacity AWS seems to be leaving idle for you to snatch on-
       | demand.
       | 
       | This is going to vary a lot based on the time of year. Why don't
       | you try this same experiment at around some time when there's a
       | lot of retail sales activity (Black Friday), and watch AWS
       | suddenly have much less capacity to dole out on-demand.
       | 
       | To me reliability is a measure of what a cloud does compared to
       | what it says it will do. GCP is not promissing you on-demand
       | instances instantaneously is it? If you want that ... reserve
       | capacity.
        
       | playingalong wrote:
       | This is great.
       | 
       | I have always been feeling there is so little independent content
       | on benchmarking the IaaS providers. There is so much you can
       | measure in how they behave.
        
       | lacker wrote:
       | Anecdotally I tend to agree with the author. But this really
       | isn't a great way of comparing cloud services.
       | 
       | The fundamental problem with cloud reliability is that it depends
       | on a lot of stuff that's out of your control, that you have no
       | visibility into. I have had services running happily on AWS with
       | no errors, and the next month without changing anything they fail
       | all the time.
       | 
       | Why? Well, we look into it and it turns out AWS changed something
       | behind the scenes. There's a different underlying hardware behind
       | the instance, or some resource started being in high demand
       | because of some other customers.
       | 
       | So, I completely believe that at the time of this test, this
       | particular API was performing a lot better on AWS than on GCP.
       | But I wouldn't count on it still performing this way a month
       | later. Cloud services aren't like a piece of dedicated hardware
       | where you test it one month, and then the next month it behaves
       | roughly the same. They are changing a lot of stuff that you can't
       | see.
        
         | citizenpaul wrote:
         | That was my thoughts. People are probably pummeling GCP GPU
         | free tier right now with stable diffusion image generators.
         | Since it seems like all the free plug and play examples use the
         | google python notebooks.
        
         | RajT88 wrote:
         | Instance types and regions make a big difference.
         | 
         | Some regions and hardware generations are just busier than
         | others. It may not be the same across cloud providers (although
         | I suspect it is similar given the underlying market forces).
        
         | ryukoposting wrote:
         | You've just perfectly characterized why on-site infrastructure
         | will always have its place.
        
           | callalex wrote:
           | You can reserve capacity on both of these services as well.
        
       | lomkju wrote:
       | Having being a high scale AWS user with a bill of +$1M/month and
       | now working since 2 years with a company which uses GCP. I would
       | say AWS is superior and way ahead.
       | 
       | ** NOTE: If you're a low scale company this won't matter to you
       | **
       | 
       | 1. GKE
       | 
       | When you cross a certain scale certain GKE components won't scale
       | with you and SLOs on those components are crazy, it takes 15+
       | mins for us to update a GKE ingress controller backed Ingress.
       | 
       | Cloud Logging hasn't been able to keep up with our scale,
       | disabled since 2 years now. This last Q we got an email from them
       | to enable it and try it again on our clusters, still have to
       | confirm these claims as our scale is more higher now.
       | 
       | Konnectivity agent release was really bad for us, it affected
       | some components internally, total dev time we lost was more than
       | 3 months debugging this issue. They had to disable konnectivity
       | agent on our clusters, I had to collect TCP dumps and other
       | evidences just to prove nothing was wrong on our end, fight with
       | our TAM to get a meeting with the product team. After 4 months
       | they agreed and reverted our clusters to SSH tunnels. Initially
       | GCP support said they said they can't do this. Next Q Ill be
       | updating the clusters hopefully they have fixed this by then.
       | 
       | 2. Support.
       | 
       | I think AWS support always were more pro active in debugging with
       | us, GCP support agents most of the times lack the expertise or
       | proactiveness to debug/solve things in simple cases. We pay for
       | enterprise support and don't see getting much from them. At AWS
       | we had reviews of the infra how we could better it every 2 Qs and
       | we got new suggestion and was also the time when we shared what
       | we would like to see in their roadmap.
       | 
       | 3.Enterprisyness is missing with design
       | 
       | A simple thing as cloudbuild doesn't have access to static IPs.
       | We have to maintain a forward proxy just cause of this.
       | 
       | L4 LBs were a mess you could only use specified ports in a (L4
       | LB) TCP proxy, For a tcp proxy based loadbalancer, the allowed
       | set of ports are - [25, 43, 110, 143, 195, 443, 465, 587, 700,
       | 993, 995, 1883, 3389, 5222, 5432, 5671, 5672, 5900, 5901, 6379,
       | 8085, 8099, 9092, 9200, and 9300]. Today I see they have removed
       | these restrictions. I don't know who came up with this idea to
       | allow only a few ports on a L4 LB. I think such design decisions
       | make it less Enterprisy.
        
       | endisneigh wrote:
       | this doesn't really seem like a fair comparison, nor is it a
       | measure of "reliability".
        
         | daneel_w wrote:
         | It seems entirely fair to me, but the term "reliability" has a
         | few different angles. This time it's not about working or not
         | working, but the ability to auto-scale by invoking resources on
         | the spot, which can be a very real requirement.
        
           | endisneigh wrote:
           | unless you're willing to burn $150 a quarter doing this exact
           | assessment, it tells you nothing other than the data center
           | conditions at the time of running.
           | 
           | it would be like doing this in us-central1 when us-central1
           | is down for one provider, and not another, resulting in
           | increased latency, and saying how much faster one is than the
           | other.
           | 
           | unlike say a throughput test or similar, neither of these
           | services promise particular cold-starts, and so the numbers
           | here cannot be contexutalized against any metric given by
           | either company and so are only useful in the sense that they
           | can be compared, but since there are no guarantees the
           | positions could switch anytime.
           | 
           | that's why I like comparisons between serverless functions
           | where there are pretty explicit SLAs and what not given by
           | each company for you to compare against, as well as one
           | another.
        
             | daneel_w wrote:
             | Given the stark contrast and that the pattern was identical
             | every day over a two-week course, it tells me we're
             | observing a fundamental systemic difference between GCP and
             | AWS - and I think that's all the author really wanted to
             | point out. I would not be surprised if the results are
             | replicable three months from now.
        
       | Animats wrote:
       | > GCP allows you to attach a GPU to an arbitrary VM as a hardware
       | accelerator - you can separately configure quantity of the CPUs
       | as needed.
       | 
       | That would seem to indicate that asking for a VM on GCP gets you
       | a minimally configured VM on basic hardware, and then it gets
       | migrated to something bigger if you ask for more resources. Is
       | that correct?
       | 
       | That could make sense if, much of the time, users get a VM and
       | spend a lot of time loading and initializing stuff, then migrate
       | to bigger hardware to crunch.
        
         | zylent wrote:
         | This is not quite true - GPU's are limited to select VM types,
         | and the number of GPU's you have influences the maximum number
         | of cores you can get. In general they're only available on the
         | n1 instances (except the a100's, but those are far less
         | popular)
        
       | AtNightWeCode wrote:
       | This benchmark (too) is probably incorrect. It produces 409:s so
       | there are errors in there that I doubt are caused by GCP.
        
       | humanfromearth wrote:
       | We have constant autoscaling issues because of this in GCP - glad
       | someone plotted this - hope people in GCP will pay a bit more
       | attention to this. Thanks to the OP!
        
       | runeks wrote:
       | > These differences are so extreme they made me double check the
       | process. Are the "states" of completion different between the two
       | clouds? Is an AWS "Ready" premature compared to GCP? It
       | anecdotally appears not; I was able to ssh into an instance right
       | after AWS became ready, and it took as long as GCP indicated
       | before I was able to login to one of theirs.
       | 
       | This is a good point and should be part of the test: after
       | launching, SSH into the machine and run a trivial task to confirm
       | that the hardware works.
        
       | kccqzy wrote:
       | Heard from a Googler that the internal infrastructure (Borg) is
       | simply not optimized for quick startup. Launching a new Borg job
       | often takes multiple minutes before the job runs. Not surprising
       | at all.
        
         | dekhn wrote:
         | A well-configured isolated borg cluster and well-configured job
         | can be really fast. If there's no preemption (IE, no other job
         | that is kicked off and gets some grace period), the packages
         | are already cached locally, and there is no undue load on the
         | scheduler, the resources are available, and it's a job with
         | tasks, rather than multiple jobs, it will be close to
         | instantaneous.
         | 
         | I spend a significant fraction of my 11+ years there clicking
         | Reload on my job's borg page. I was able to (re-)start ~100K
         | jobs globally in about 15 minutes.
        
           | fragmede wrote:
           | Psh _someone 's_ bragging about not being at batch priority.
        
             | dekhn wrote:
             | I ran at -1
        
         | dekhn wrote:
         | booting VMs != starting a borg job.
        
           | kccqzy wrote:
           | The technology may be different but the culture carries over.
           | People simply don't have the habit to optimize for startup
           | time.
        
         | readams wrote:
         | Borg is not used for gcp vms, though.
        
           | dilyevsky wrote:
           | It is used but borg scheduler does not manage vm startups
        
         | epberry wrote:
         | Echoing this. The SRE book is also highly revealing about how
         | Google request prioritization works. https://sre.google/sre-
         | book/load-balancing-datacenter/
         | 
         | My personal opinion is that Google's resources are more tightly
         | optimized than AWS and they may try to find the 99% best
         | allocation versus the 95% best allocation on AWS.. and this
         | leads to more rejected requests. Open to being wrong on this.
        
         | valleyjo wrote:
         | As another comment points out, GPU resources are less common so
         | it takes longer to create, which makes sense. In general, start
         | up times are pretty quick on GCP as other comments also
         | confirm.
        
         | jsolson wrote:
         | This is mostly not true in cases where resources are actually
         | available (and in GCE if they're not the API rejects the VM
         | outright, in general). To the extent that it is true for Borg
         | when the job schedules immediately, it's largely due to package
         | (~container layers, ish) loading. This is less relevant today
         | (because reasons), and also mostly doesn't apply to GCE as the
         | relevant packages are almost universally proactively made
         | available on relevant hosts.
         | 
         | The origin for the info that jobs take "minutes" likely
         | involves jobs that were pending available resources. This is a
         | valid state in Borg, but GCE has additional admission control
         | mechanisms aimed at avoiding extended residency in pending.
         | 
         | As dekhn notes, there are many factors that contribute to VM
         | startup time. GPUs are their own variety of special (and, yes,
         | sometimes slow), with factors that mostly don't apply to more
         | pedestrian VM shapes.
        
       | Jamie9912 wrote:
       | Should probably change the title to "AWS vs GCP on-demand GPU
       | launch time consistency"
        
         | Terretta wrote:
         | Yep. Author colloquially meant, can I rely on a quick start.
        
       | MonkeyMalarky wrote:
       | I would love to see the same for deploying things like a
       | cloud/lambda function.
        
       | orf wrote:
       | AWS has different pools of EC2 instances depending on the
       | customer, the size of the account and any reservations you may
       | have.
       | 
       | Spawning a single GPU at varying times is nothing. Try spawning
       | more than one, or using spot instances, and you'll get a very
       | different picture. We often run into capacity issues with GPU and
       | even the new m6i instances at all times of the day.
       | 
       | Very few realistic company size workloads need a single GPU. I
       | would willingly wait 30 minutes for my instances to become
       | available if it meant _all_ of them where available at the same
       | time.
        
       | herpderperator wrote:
       | The author is using 'Quantile' which I hadn't heard of before,
       | and when I did, it seems like it actually should be 'Percentile'.
       | Percentiles are the percentages, which is what the author is
       | referring to.
        
         | mr_toad wrote:
         | Quantiles are a generic term for percentiles, deciles,
         | quartiles etc. Percentiles would have been a more precise term.
        
       | outworlder wrote:
       | Unclear what the article has to do with reliability. Yes,
       | spinning up machines on GCP is incredibly fast and has always
       | been. AWS is decent. Azure feels like I'm starting a Boeing 747
       | instead of a VM.
       | 
       | However, there's one aspect where GCP is a _clear_ winner on the
       | reliability front. They auto-migrate instances transparently and
       | with close to zero impact to workloads - I want to say zero
       | impact but it 's not technically zero.
       | 
       | In comparison, in AWS you need to stop/start your instance
       | yourself so that it will move to another hypervisor(depending on
       | the actual issue AWS may do it for you). That definitely has
       | impact on your workloads. We can sometimes architect around it
       | but there's still something to worry about. Given the number of
       | instances we run, we have multiple machines to deal with weekly.
       | We get all these 'scheduled maintenance' events (which sometimes
       | aren't really all that scheduled), with some instance IDs(they
       | don't even bother sending the name tag), and we have to deal with
       | that.
       | 
       | I already thought stop/start was an improvement on tech at the
       | time (Openstack, for example, or even VMWare) just because we
       | don't have to think about hypervisors, we don't have to know, we
       | don't care. We don't have to ask for migrations to be performed,
       | hypervisors are pretty much stateless.
       | 
       | However, on GCP? We had to stop/start instances exactly zero
       | times, out of the thousands we run and have been running for
       | years. We can see auto-migration events when we bother checking
       | the logs. Otherwise, we don't even notice the migration happened.
       | 
       | It's pretty old tech too:
       | 
       | https://cloudplatform.googleblog.com/2015/03/Google-Compute-...
        
         | voidfunc wrote:
         | > Azure feels like I'm starting a Boeing 747 instead of a VM.
         | 
         | Huh... interesting, this has not been my experience with Azure
         | VM launch times. I'm usually surprised how quickly they pop up.
        
           | jiggawatts wrote:
           | Depends on your disks.
           | 
           | Premium SSD allows 30 minutes of "burst" IOPS, which can
           | bring down boot times to about 2-5 seconds for a typical
           | Windows VM. The provisioning time is a further 60-180 seconds
           | on top. (The fastest I could get it is about 40 seconds using
           | a "smalldisk" image to ephemeral storage, but then it took a
           | further 30 seconds or so for the VM to become available.)
           | 
           | Standard HDD was slow enough that the boot phase alone would
           | take minutes, and then the VM provisioning time is almost
           | irrelevant in comparison.
        
         | jcheng wrote:
         | > Yes, spinning up machines on GCP is incredibly fast and has
         | always been. AWS is decent.
         | 
         | FWIW this article is saying the opposite--it's AWS that beats
         | GCP in startup speed.
        
           | valleyjo wrote:
           | This article states that GPU instances are slower on GCP - it
           | doesn't make any claims about non-GPU instances.
        
         | yolovoe wrote:
         | EC2 live migrates instances too. Not sure where we are with
         | rollout across the fleet.
         | 
         | The reason, from what I understand, why GCP does live migration
         | more is because ec2 focused on live updates instead of live
         | migration. Whereas GCP migrates instances to update servers,
         | ec2 live updates everything down to firmware while instances
         | are running.
         | 
         | Curious, what instance types are you using on EC2 that you see
         | so much maintenance?
        
         | willcipriano wrote:
         | I always wondered why you couldn't do that on AWS, mainly
         | because I could do it at home with Hyper-V a decade ago.
         | 
         | https://learn.microsoft.com/en-us/previous-versions/windows/...
        
       | politelemon wrote:
       | A few weeks ago I needed to change the volume type on an EC2
       | instance to gp3. Following the instructions, the change happened
       | while the instance was running. I didn't need to reboot or stop
       | the instance, it just changed the type. While the instance was
       | running.
       | 
       | I didn't understand how they were able to do this, I had thought
       | volume types mapped to hardware clusters of some kind. And since
       | I didn't understand, I wasn't able to distinguish it from magic.
        
         | osti wrote:
         | Look up AWS Nitro on YouTube if you are interested in learning
         | more about it.
        
         | ArchOversight wrote:
         | Changing the volume type on AWS is somewhat magical. Seeing it
         | happens on-line was amazing.
        
         | cavisne wrote:
         | EBS is already replicated so they probably just migrate behind
         | the scenes, same as if the original physical disk was
         | corrupted. It looks like only certain conditions allow this
         | kindof migration.
         | 
         | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/modify-v...
        
         | Salgat wrote:
         | If I remember right they use the equivalent of a ledger of
         | changes to manage volume state. So in this case, they copy over
         | the contents (up to a certain point in time) to the new faster
         | virtual volume, then append and direct all new changes to the
         | new volume.
         | 
         | This is also how they are able to snapshot a volume at a
         | certain point in time without having any downtime or data
         | inconsistencies.
        
         | xyzzyz wrote:
         | Dunno about AWS, but GCP uses live migration, and will migrate
         | your VM across physical machines as necessary. The disk volumes
         | are all connected over the network, nothing really depends on
         | the actual physical machine your VM is ran on.
        
           | lbhdc wrote:
           | How does migrating a vm to another physical machine work?
        
             | the_duke wrote:
             | This blog post is pretty old (2015) but gives a good
             | introduction.
             | 
             | https://cloudplatform.googleblog.com/2015/03/Google-
             | Compute-...
        
               | lbhdc wrote:
               | Thanks for sharing, I will give it a read!
        
             | rejectfinite wrote:
             | vsphere vmotion has been a thing for years lmao
        
             | roomey wrote:
             | VMware has been doing this for years, it's called vmotion
             | and there is a lot of documentation about it if you are
             | interested (eg https://www.thegeekpub.com/8407/how-vmotion-
             | works/ )
             | 
             | Essential, memory state is copied to the new host, the VM
             | is stunned for a millisecond and the cpu states is copied
             | and resumed on the new host (you may see a dropped ping).
             | All the networking and storage is virtual anyway so that is
             | "moved" (it's not really moved) in the background.
        
               | davidro80 wrote:
        
               | lbhdc wrote:
               | That is really interesting I didn't realize it was so
               | fast. Thanks for the post I will give it a read!
        
               | politelemon wrote:
               | > VM is stunned for a millisecond
               | 
               | This conjures up hilarious mental imagery, thanks
        
               | kevincox wrote:
               | You just bop it on the head, and move it to the new
               | machine quickly. By the time the VM comes to it won't
               | even realize that it is in a new home.
        
               | jiggawatts wrote:
               | The clever trick here is that they'll pre-copy most of
               | the memory without bothering to do it consistently, but
               | mark pages that the source had written to as "dirty". The
               | network cutover is stop-the-world, but VMware _doesn 't_
               | copy the dirty pages during the stop. Instead, it simply
               | treats them as "swapped to pagefile", where the pagefile
               | is actually the source machine memory. When computation
               | resumes at the target, the source is used to page memory
               | back in on-demand. This allows very fast cutovers.
        
               | mh- wrote:
               | Up to 500ms per your source, depending on how much churn
               | there is in the memory from the source system.
               | 
               | Very cool.
        
             | valleyjo wrote:
             | Stream the contents of ram from source to dest, pause the
             | source, reprogram the network and copy and memory that
             | changed since the initial stream, resume the dest, destroy
             | the source, profit.
        
             | pclmulqdq wrote:
             | They pause your VM, copy everything about its state over to
             | the new machine, and quickly start the other instance. It's
             | pretty clever. I think there are tricks you can play with
             | machines that have large memory footprints to copy most of
             | it before the pause, and only copy what has changed since
             | then during the pause.
             | 
             | The disks are all on the network, so no need to move
             | anything there.
        
               | prmoustache wrote:
               | In reality it sync the memory first to the other host and
               | only pause the vm when the last state sync is small
               | enough to be so quick the pause is barely measurable.
        
               | lbhdc wrote:
               | When its transferring the state to the target, how does
               | it handle memory updates that are happening at that time?
               | Is the programs execution paused at that point?
        
               | outworlder wrote:
               | No, they keep track of dirty pages.
        
               | GauntletWizard wrote:
               | No, but the memory accesses have hooks that say "This
               | memory was written". Then, program execution _is_ paused,
               | and just the sections of memory that were written are
               | copied again.
               | 
               | This has memory performance characteristics - I ran a
               | benchmark of memory read/write speed while this was
               | happening once. It more than halved memory speed for the
               | 30s or so it took from migration started to migration
               | complete. The pause, too, was much longer.
        
               | lbhdc wrote:
               | Ahh I think that was the piece I was missing, thanks! I
               | didn't realize there were hooks for tracking memory
               | changes.
        
               | water-your-self wrote:
               | Indian jones and the register states
        
           | valleyjo wrote:
           | Azure, AWS and GCP all have live migration. VMWare has it
           | too.
        
             | dilyevsky wrote:
             | Ec2 does not have live migration. On azure it's spotty so
             | not every maintenance can offer it.
        
               | [deleted]
        
               | ta20200710 wrote:
               | EC2 does support live migration, but it's not public and
               | only for certain instance types/hypervisors.
               | 
               | See: https://news.ycombinator.com/item?id=17815806
        
               | _msw_ wrote:
               | Here's a comment that I made in a past thread.
               | 
               | https://news.ycombinator.com/item?id=26650082
        
               | dilyevsky wrote:
               | My experience running c5/6 instances makes me very
               | confident ec2 doesn't do live migration for these. Fwiw
               | gcp live migration on latency sensitive workloads is very
               | noticeable and often time straight up causes instance
               | crash
        
               | dwmw2 wrote:
               | Intrigued by this observation. What is it about your
               | experience that leads you to conclude that EC2 doesn't do
               | live migration?
               | 
               | And could it be phrased differently as "EC2 doesn't do
               | live migration _badly_ "?
        
               | dilyevsky wrote:
               | Mainly the barrage of "instance hardware degradation"
               | emails that i get whereas on gcp those are just migrated
               | (sometimes with a reboot/crash). Also there is no
               | brownout. I've never used t2/3s which apparently do
               | support migration which would make sense.
        
               | my123 wrote:
               | After some kinds of hardware failure, it can become
               | impossible to do live migration safely. When a crash can
               | ensure due to a live migration from faulty HW, I'd argue
               | that it's much better to not attempt it.
        
             | free652 wrote:
             | Are you sure, because AWS consistently requires me to
             | migrate to a different host. They go as far as shutting
             | down instances, but don't do any kind of live migrations.
        
             | outworlder wrote:
             | Not really. Or at least not in the same league.
             | 
             | AWS doesn't have live migration at all. You have to
             | stop/start.
             | 
             | Azure technically does, but it doesn't always work(they say
             | 90%). 30 seconds is a long time.
             | 
             | VMWare has live migration (and seems to be the closest to
             | what GCP does) but it is still an inferior user experience.
             | 
             | This is the key thing you are missing - GCP not only has
             | live migration, but it is completely transparent. We do not
             | have to initiate migration. GCP does, transparently, 100%
             | of the time. We have never even notice migrations even when
             | we were actively watching those instances. We don't know or
             | care what hypervisors are involved. They even preserve the
             | network connections.
             | 
             | https://cloudplatform.googleblog.com/2015/03/Google-
             | Compute-...
        
               | jiggawatts wrote:
               | VMware's live migration is totally seamless, so I don't
               | know what you mean by "inferior user experience". You
               | typically see less than a second of packet loss, and a
               | small performance hit for about a minute while the memory
               | is "swapped" across to the new machine. Similarly, VMware
               | has had live storage migration for years.
               | 
               | VMware is lightyears ahead of the big clouds, but
               | unfortunately they "missed the boat" on the public cloud,
               | despite having superior foundational technology.
               | 
               | For example:
               | 
               | - A typical vSphere cluster would use live migration to
               | balance workloads dynamically. You don't notice this as
               | an end user, but it allows them to bin-pack workloads up
               | above 80% CPU utilisation in my experience with good
               | results. (Especially if you allocate priorities, min/max
               | limits, etc...)
               | 
               | - You can version-upgrade a vSphere cluster live. This
               | includes rolling hypervisor kernel upgrades and _live
               | disk format changes_. The upgrade wizard is a fantastic
               | thing that asks only for the cluster controller name and
               | login details! Click  "OK" and watch the progress bar.
               | 
               | - Flexible keep-apart and keep-together rules that can
               | updated at any time, and will take effect via live
               | migration. This is sort-of like the Kubernetes "control
               | loops", but the migrations are live and memory-preserving
               | instead of stop-start like with containers.
               | 
               | - Online changes to virtual hardware, including adding
               | not just NICs and disks, but also CPU and memory!
               | 
               | - Thin-provisioned disks, and memory deduplication for
               | efficiencies approaching that of containerisation.
               | 
               | - Flexible snapshots, including the ability for "thin
               | provisioned" virtual machines to share a base snapshot.
               | This is often used for virtual desktops or terminal
               | services, and again this approaches containerisation in
               | terms of cloning speed and storage efficiency.
               | 
               | In other words, VMware had all of the pieces, and just...
               | didn't... use it to make a public cloud. We could have
               | had "cloud.vmware.com" or whatever 15 years ago, but they
               | decided to slowly jack up the price on their enterprise
               | customers instead.
               | 
               | For comparison, in Azure: You can't add a VM to an
               | availability set (keep apart rule) or remove the VM from
               | it without a stop-start cycle. You can't make most
               | changes (SKU, etc...) to a VM in an availability set
               | without turning off _every_ machine in the same AS! This
               | is just one example of many where the public cloud has a
               | "checkbox" availability feature that actually _decreases_
               | availability. For a long time, changing an IP address in
               | AWS required the VM to be basically blown away and
               | recreated. That brought back memories of the Windows NT 4
               | days in 1990s when an IP change required a reboot cycle.
        
               | sofixa wrote:
               | So, i used to be a part time vSphere admin, worked with
               | many others, and had to automate the hell out of it to
               | deal as little as possible with that dumpster fire.
               | 
               | No, VMware didn't miss the boat, vCloud Air was announced
               | in 2009 and made generally available in 2013. Roughly
               | same timelines as Azure and GCP, slightly trailing AWS,
               | and those were the early days, where the public cloud was
               | still exotic. And VMware had the massive advantage of
               | brand recognition in that domain and existing footprint
               | with enterprises which could be scaled out.
               | 
               | Problem was, vCloud Air, like vSphere, was shit. Yeah, it
               | did some things well, and had some very nice features -
               | vMotion, DRS (though it doesn't really use CPU ready
               | contention for scheduling decisions which is stupid),
               | vSAN, hot adding resources (but not RAM, because decades
               | ago Linux had issues if you had less than 4GB RAM and you
               | added more, so to this day you can't do that). When they
               | worked, because when they didn't, good luck because error
               | messages are useless, logs are weirdly structured and
               | uselessly verbose, so a massive pain to deal with. Oh and
               | many of those features were either behind a Flash
               | UI(FFS), or an abomination of an API that is inconsistent
               | ("this object might have been deleted or hasn't been
               | created yet") and had weird limitations like when you
               | have an async task you can't check it's status details.
               | And many of those features were so complex, that a random
               | consuming user basically had to rely on a dedicated team
               | of vExperts, which often resulted in a nice silo slowing
               | everyone down.
               | 
               | Their hardware compatibility list was a joke - the Intel
               | X710 NIC stayed on it for more than a year with a widely
               | known terribly broken driver.
               | 
               | But what made VMware fail the most, IMHO, was the wrong
               | focus, technically - VM, instead of application. A
               | developer/ops person couldn't care less about the object
               | of a VM. Of course they tried some things like vApp and
               | vCloud Director etc. which are just disgusting
               | abominations designed with a PowerPoint in mind, not a
               | user. And pricing. Opaque and expensive, with bad
               | usability. No wonder everyone jumped on the pay as you
               | go, usable alternatives.
        
               | w0m wrote:
               | > many of those features were either behind a Flash
               | UI(FFS)
               | 
               | My introduction to the industry. The memories.
        
               | ithkuil wrote:
               | You're right to say that VMware has the right fundamental
               | building blocks and that they are mature enough
               | (especially the compute aspect).
               | 
               | But I think you underestimate the maturity and
               | effectiveness of the underlying google compute and
               | storage substrate.
               | 
               | (FWIW, I worked at both places)
               | 
               | Now how the Google's substrate maps onto GCP, that's
               | another story. There is a non trivial amount of fluff to
               | be added on top of your building blocks to build a
               | manageable multitenant planet scale cloud service. Just
               | the network infrastructure is mind boggling.
               | 
               | I wouldn't be surprised if your experience with a "VMware
               | cloud" would surprise you if you naively compare it with
               | your experience with a standalone vsphere cluster.
        
           | _msw_ wrote:
           | Disclosure: I work for Amazon, and in the past I worked
           | directly on EC2.
           | 
           | From the FAQ: https://aws.amazon.com/ec2/faqs/
           | 
           | Q: How does EC2 perform maintenance?
           | 
           | AWS regularly performs routine hardware, power, and network
           | maintenance with minimal disruption across all EC2 instance
           | types. To achieve this we employ a combination of tools and
           | methods across the entire AWS Global infrastructure, such as
           | redundant and concurrently maintainable systems, as well as
           | live system updates and migration.
        
             | darkwater wrote:
             | > AWS regularly performs routine hardware, power, and
             | network maintenance with minimal disruption across all EC2
             | instance types. To achieve this we employ a combination of
             | tools and methods across the entire AWS Global
             | infrastructure, such as redundant and concurrently
             | maintainable systems, as well as live system updates and
             | migration.
             | 
             | And yet, I keep getting almost every weeks emails like
             | this:
             | 
             | "EC2 has detected degradation of the underlying hardware
             | hosting your Amazon EC2 instance (instance-ID: i-xxxxxxx)
             | associated with your AWS account (AWS Account ID: NNNNN) in
             | the eu-west-1 region. Due to this degradation your instance
             | could already be unreachable. We will stop your instance
             | after 2022-09-21 16:00:00 UTC"
             | 
             | And we don't have tens of thousands of VMs in that region,
             | just around 1k.
        
               | _msw_ wrote:
               | Live migration can't be used to address every type of
               | maintenance or underlying fault in a non-disruptive way.
        
           | voiper1 wrote:
           | Linode also uses live migrations now for most (all?)
           | maintenance.
        
         | shrubble wrote:
         | Assuming this blurb is accurate: " General-purpose SSD volume
         | (gp3) provides the consistent 125 MiB/s throughput and 3000
         | IOPS within the price of provisioned storage. Additional IOPS
         | (up to 16,000) and throughput (1000 MiB/s) can be provisioned
         | with an additional price. The General-purpose SSD volume (gp2)
         | provides 3 IOPS per GiB storage provisioned with a minimum of
         | 100 IOPS"
         | 
         | ... then it seems like a device that limits bandwidth either on
         | the storage cluster or between the node and storage cluster is
         | present. 125MiB/s is right around the speed of a 1gbit link, I
         | believe. That it was a networking setting changed in-switch
         | doesn't seem to be surprising.
        
           | nonameiguess wrote:
           | This would have been my guess. All EBS volumes are stored on
           | a physical disk that supports the highest bandwidth and IOPS
           | you can live migrate to, and the actual rates you get are
           | determined by something in the interconnect. Live migration
           | is thus a matter of swapping out the interconnect between the
           | VM and the disk or even just relaxing a logical rate-limiter,
           | without having to migrate your data to a different disk.
        
             | prmoustache wrote:
             | The actual migration is not instantaneous despite the
             | volume being immediately reported as gp3. You get a status
             | change to "optimizing" if my memory is correct with a
             | percentage. And the higher the volume the longer it takes
             | so there is definitely a sync to faster storage.
        
       | 0xbadcafebee wrote:
       | Reliability in general is measured on the basic principle of:
       | _does it function within our defined expectations?_ As long as it
       | 's launching, and it eventually responds within SLA/SLO limits,
       | and on failure comes back within SLA/SLO limits, it is reliable.
       | Even with GCP's multiple failures to launch, that may still be
       | considered "reliable" within their SLA.
       | 
       | If both AWS and GCP had the same SLA, and one did better than the
       | other at starting up, you could say one is _more performant_ than
       | the other, but you couldn 't say it's _more reliable_ if they are
       | both meeting the SLA. It 's easy to look at something that never
       | goes down and say "that is more reliable", but it might have been
       | pure chance that it never went down. Always read the fine print,
       | and don't expect anything better than what they guarantee.
        
       | mnutt wrote:
       | It may or may not matter for various use cases, but the EC2
       | instances in the test use EBS and the AMIs are lazily loaded from
       | S3 on boot. So it may be possible that the boot process touches
       | few files and quickly gets to 'ready' state, but you may have
       | crummy performance for a while in some cases.
       | 
       | I haven't used GCP much, but maybe they load the image onto the
       | node prior to launch, accounting for some of the launch time
       | difference?
        
       | cmcconomy wrote:
       | I wish Azure was here to round it out!
        
       | londons_explore wrote:
       | AWS normally has machines sitting idle just waiting for you to
       | use. Thats why they can get you going in a couple of seconds.
       | 
       | GCP on the other hand fills all machines with background jobs.
       | When you want a machine, they need to terminate a background job
       | to make room for you. That background job has a shutdown grace
       | time. Usually thats 30 seconds.
       | 
       | Sometimes, to prevent fragmentation, they actually need to
       | shuffle around many other users to give you the perfect slot -
       | and some of those jobs have start-new-before-stop-old semantics -
       | that's why sometimes the delay is far higher too.
        
         | dekhn wrote:
         | borg implements preemption but the delay to start VMs is not
         | because they are waiting for a background task to clean up.
        
       | devxpy wrote:
       | Is this testing for spot instances?
       | 
       | In my limited experience, persistent (on-demand) GCP instances
       | always boot up much faster than AWS EC2 instances.
        
         | marcinzm wrote:
         | In my experience GPU persistent instances often simply don't
         | boot up on GCP due to lack of available GPUs. One reason I
         | didn't choose GCP at my last company.
        
           | devxpy wrote:
           | Oh interesting. Which region and GPU type were you working
           | with? (Asking so I can avoid in future)
        
             | marcinzm wrote:
             | I think it was us-east1 or us-east4. Had issues getting
             | TPUs as well in us-central1. I know someone at a larger
             | tech company who was told to only run certain workflows in
             | a specific niche European region as that's the only one
             | that had any A100 GPUs most of the time.
        
         | encryptluks2 wrote:
         | I noticed that too and it does appear to be using spot
         | instances. I have a feeling if it was ran without you may see
         | much better startup times. Spot instances on GCP are hit and
         | miss and you sort of have to build that into your workflow.
        
       | rwalle wrote:
       | Looks like the author has never heard of the word "histogram"
       | 
       | That graph is a pain to see.
        
         | DangitBobby wrote:
         | A histogram would take away one of the dimensions, probably
         | time, unless they resorted to some weird stacked layout.
         | Without time, people would complain that they don't know if it
         | was consistent across the tested period. The graph is fine.
        
       | charbull wrote:
       | Can you put this in context of the problem/use case /need you are
       | solving for ?
        
       | lmeyerov wrote:
       | I'd say that's a weak test of capacity. Would love to see on
       | Azure - T4s or an equiv aren't even really provided anymore!
       | 
       | We find reliability a diff story. Eg, our main source of downtime
       | on Azure is they restart (live migrate?) our reserved T4s every
       | few weeks, causing 2-10min outages per GPU per month.
        
       | DonHopkins wrote:
       | Anybody know if on GCP the cheaper ephemeral spot instances are
       | available on managed instance groups and cloud run, where it
       | spins up more instances according to demand, and how well it
       | deals with replacing spot instances that drop dead, if so? How
       | about AWS?
        
       | ajross wrote:
       | Worth pointing out that the article is measuring provisioning
       | latency and success rates (how quickly can you get a GPU box
       | running and whether or not you get an error back from the API
       | when you try), and not "reliability" as most readers would
       | understand it (how likely they are to do what you want them to do
       | without failure).
       | 
       | Definitely seems like interesting info, though.
        
       | [deleted]
        
       | [deleted]
        
       | rwiggins wrote:
       | There were 84 errors for GCP, but the breakdown says 74 409s and
       | 5 timeouts. Maybe it was 79 409s? Or 10 timeouts?
       | 
       | I suspect the 409 conflicts are probably from the instance name
       | not being unique in the test. It looks like the instance name
       | used was:                   instance_name = f"gpu-
       | test-{int(time())}"
       | 
       | which has a 1-second precision. The test harness appears to do a
       | `sleep(1)` between test creations, but this sort of thing can
       | have weird boundary cases, particularly because (1) it does
       | cleanup after creation, which will have variable latency, (2)
       | `int()` will truncate the fractional part of the second from
       | `time()`, and (3) `time.time()` is not monotonic.
       | 
       | I would not ask the author to spend money to test it again, but I
       | think the 409s would probably disappear if you replaced
       | `int(time())` with `uuid.uuid4()`.
       | 
       | Disclosure: I work at Google - on Google Compute Engine. :-)
        
         | sitharus wrote:
         | This is a very good point - AWS uses tags to give instances a
         | friendly name, so the name does not have to be unique. The same
         | logic would not fail on AWS.
        
           | smashed wrote:
           | Which makes 2000% sense.
           | 
           | Why would any tenant supplied data affect anything
           | whatsoever?
           | 
           | As a tenant, unless you are clashing with another resource
           | under your own name, I don't see the point of failing.
           | 
           | aws S3 would be an exception, where they make that limitation
           | on globally unique bucket name very clear.
        
             | [deleted]
        
             | jsolson wrote:
             | Idempotency.
             | 
             | You're inserting a VM with a specific name. If you try to
             | create the same resource twice, the GCE control plane
             | reports that as a conflict.
             | 
             | What they're doing here would be roughly equivalent to
             | supplying the time to the AWS RunInstances API as an
             | idempotency token.
             | 
             | (I work on GCE, and asked an industry friend at AWS about
             | how they guarantee idempotency for RunInstances).
        
               | stevenguh wrote:
               | GCP control plane is generally not idempotent.
               | 
               | When trying to create the same resource twice, all
               | request should report the same status instead one
               | failing, one succeeding.
               | 
               | In AWS, their APIs allow you to supply a client token if
               | the API is not idempotent by default.
               | 
               | See https://docs.aws.amazon.com/AWSEC2/latest/APIReferenc
               | e/Run_I....
        
               | akramer wrote:
               | The GCE API can be idempotent if you'd like. Fill out the
               | requestId field with the same UUID in multiple
               | instances.insert calls (or other mutation calls) and you
               | will receive the same operation Id back in response.
               | 
               | Disclaimer: I work on GCE.
        
               | jsolson wrote:
               | Today I learned! I'll admit I didn't know this
               | functionality existed, and I've instead had used
               | instances.insert following by querying the VM resource.
               | 
               | This is nicer!
        
               | jsolson wrote:
               | > When try to create the same resource twice, the second
               | should report success instead of failing.
               | 
               | Before I quibble with the idempotency point: I agree with
               | this, entirely, but it is what it is and a lot of
               | software has been written against the current behavior.
               | So I'll cite Hyrum's law here: https://www.hyrumslaw.com/
               | 
               | > GCP control plane is generally not idempotent.
               | 
               | The GCE API occupies an odd space here, imo. The resource
               | being created is, in practice, an operation to cause the
               | named VM to exist. The operation has its own name, but
               | the name of the VM in the insert operation is the name of
               | the ultimate resource.
               | 
               | Net, the API is idempotent at a macro level in terms of
               | the end-to-end creation or deletion of uniquely named
               | resources. Which is a long winded way of saying that
               | you're right, but that from a practical perspective it
               | accomplishes enough of the goals of a truly idempotent
               | API to be _useful_ for avoiding the same things that the
               | AWS mechanism avoids: creation of unexpected duplicate
               | VMs.
               | 
               | The more "modern" way to do this would be to have a truly
               | idempotent description of the target state of the actual
               | resource with a separate resource for the current live
               | state, but we live with the sum of our past choices.
        
               | Terretta wrote:
               | It's a lot shorter to write:
               | 
               | You're right, we did it wrong.
               | 
               | // And paradoxically makes engineers like you.
        
               | jsolson wrote:
               | Sure, except I think that at a macro level we got it more
               | right than AWS, despite some choices that I believe we'd
               | make differently today.
        
               | bushbaba wrote:
               | Do you really need idempotency for runVM though.
        
               | jsolson wrote:
               | I mean, it's kinda nice to know that if you reissue a
               | request for an instance that could costs thousands of
               | dollars per month due to a network glitch that you won't
               | accidentally create two of them?
               | 
               | More practically, though, the instance name here is
               | literally the name of the instance as it appears in the
               | RESTful URL used for future queries about it. The 409
               | here is rejecting an attempt to create the same
               | explicitly named resource twice.
        
               | philliphaydon wrote:
               | Sounds like AWS got it right.
        
               | jsolson wrote:
               | You're entitled to that takeaway, but I disagree. I
               | believe GCP's tendency to use caller-supplied names for
               | resources is one of the single best features of the
               | platform, particularly when compared against AWS's random
               | hex identifiers.
               | 
               | Note that whether this creates collisions is entirely
               | under the customer's control. There's no requirement for
               | global uniqueness, just a requirement that you not try to
               | create two VMs with the same name in the same project in
               | the same zone.
        
               | philliphaydon wrote:
               | With GCE can you create 10 instances or do you need to
               | create all 10 individually?
        
               | jsolson wrote:
               | As far as I know, the `instances.insert` API only allows
               | individual VMs, although the CLI can issue a bulk set of
               | API calls[0], and MIGs (see below) allow you to request
               | many identical VMs with a single API call if that's for
               | some reason important.
               | 
               | You can also batch API calls[1], which also gives you a
               | response for each VM in the batch while allowing for a
               | single HTTP request/response.
               | 
               | That said, if you want to create a set of effectively
               | identical VMs all matching a template (i.e., cattle not
               | pets), though, or you want to issue a single API call,
               | we'd generally point you to managed instance groups[2]
               | (which can be manually or automatically scaled up or
               | down) wherein you supply an instance template and an
               | instance count. The MIG is named (like nearly all GCP
               | resources), as are the instances, with a name derived
               | from the MIG name. After creation you can also have the
               | group abandon the instances and then delete the group if
               | you _really_ wanted a bunch of unmanaged VMs created
               | through a single API call, although I 'll admit I can't
               | think of a use-case for this (the abandon API is
               | generally intended for pulling VMs out of a group for
               | debugging purposes or similar).
               | 
               | For cases where for whatever reason you don't want a MIG
               | (e.g., because your VMs don't share a common template).
               | You can still group those together for monitoring
               | purposes[3], although it's an after-creation operation.
               | 
               | The MIG approach sets a _goal_ for the instance count and
               | will attempt to achieve (and maintain) that goal even in
               | the face of limited machine stock, hardware failures,
               | etc. The top-level API will reject (stock-out) in the
               | event that we're out of capacity, or in the batch/bulk
               | case will start rejecting once we run out of capacity. I
               | don't know how AWS's RunInstances behaves if it can only
               | partially fulfill a request in a given zone.
               | 
               | [0]: https://cloud.google.com/compute/docs/instances/mult
               | iple/cre...
               | 
               | [1]: https://cloud.google.com/compute/docs/api/how-
               | tos/batch
               | 
               | [2]: https://cloud.google.com/compute/docs/instance-
               | groups
               | 
               | [3]: https://cloud.google.com/compute/docs/instance-
               | groups/creati...
        
         | jrumbut wrote:
         | > (3) `time.time()` is not monotonic.
         | 
         | I just winced in pain thinking of the ways that can bite you. I
         | guess in a cloud/virtualized environment with many short lived
         | instances it isn't even that obscure an issue to run into.
         | 
         | A nice discussion on Stack Overflow:
         | 
         | https://stackoverflow.com/questions/64497035/is-time-from-ti...
        
           | flutas wrote:
           | > I just winced in pain thinking of the ways that can bite
           | you.
           | 
           | Something similar caused my favorite bug so far to track
           | down.
           | 
           | We were seeing odd spikes in our video playback analytics of
           | some devices watching multiple years worth of video in < 1
           | hour.
           | 
           | System.currenTimeMillis() in Java isn't monotonic either is
           | my short answer for what was causing it. Tracking down _what_
           | was causing it was even more fun though. Devices (phones)
           | were updating their system time from the network and jumping
           | between timezones.
        
             | jrumbut wrote:
             | That's a bad day at the office when you have to go and say
             | "hey remember all that data we painstakingly collected and
             | maybe even billed clients for?"
        
           | rkangel wrote:
           | Yes. When people write `time.time()` they almost always
           | actually want `time.monotonic()`.
        
         | brianpan wrote:
         | Time is difficult.
         | 
         | Reminds of me this post on mtime which recently resurfaced on
         | HN: https://apenwarr.ca/log/20181113
        
         | mempko wrote:
         | What are your thoughts on the generally slower launch times
         | with a huge variance on GCP?
        
           | kevincox wrote:
           | FWIW in our use case of non-GPU instances they launched way
           | faster and more consistently on GCP than AWS. So I guess it
           | is complicated and may depend on exactly what instance you
           | are launching.
        
           | valleyjo wrote:
           | Just remember this is for GPU instances. Other vm families
           | are pretty fast to launch.
        
           | jhugo wrote:
           | At work we run some (non-GPU) instances in every AWS region,
           | and there's pretty big variability over time and region for
           | on-demand launch time. I'd expect it might be even higher for
           | GPU instances. I suspect that a more rigorous investigation
           | might find there isn't quite as big a difference overall as
           | this article suggests.
        
           | Crash0v3rid3 wrote:
           | The author failed to mention which regions these tests were
           | run. GPU availability can vary depending on the regions that
           | were tested for both Cloud providers.
        
             | ayewo wrote:
             | The author linked to the code at the end of the post.
             | 
             | The regions used are "us-east-1" for AWS [1] and "us-
             | central1-b" for GCP [2].
             | 
             | 1: https://github.com/piercefreeman/cloud-gpu-
             | reliability/blob/...
             | 
             | 2: https://github.com/piercefreeman/cloud-gpu-
             | reliability/blob/...
        
             | fomine3 wrote:
             | This is a big missed point.
        
         | Cthulhu_ wrote:
         | I've naively used millisecond precision things for a long time
         | - not in anything critical I don't think - but I've only
         | recently come to more of an awareness that a millisecond is a
         | pretty long time. Recent example is that I used a timestamp to
         | version a record in a database, but it's feasible that in a Go
         | application, a record could feasilby be mutated multiple times
         | a millisecond by different users / processes / requests.
         | 
         | Unfortunately, millisecond-precise timestamps proved to be a
         | bit tricky in combination with sqlite.
        
         | okdood64 wrote:
         | Hope icyfox can try running this with a fix.
        
       | PigiVinci83 wrote:
       | Thank you for this article, it confirms my direct experience.
       | Never run a benchmarking test but I can see this every day.
        
       | amaks wrote:
       | The link is broken?
        
         | lucb1e wrote:
         | Works for me using Firefox in Germany, although the article
         | doesn't really match the title so maybe that's why you were
         | confused? :p
        
       | jqpabc123 wrote:
       | Thanks for the report. It only confirms my judgment.
       | 
       | The word "Google" attached to anything is a strong indicator that
       | you should look for an alternative.
        
       | danielmarkbruce wrote:
       | It's meant to say "ephemeral"... right? It's hard to read after
       | that.
        
         | datalopers wrote:
         | ephemeral and ethereal are commonly confused words.
        
           | dublin wrote:
           | Ephimerides really throws them. (And thank God for PyEphem,
           | which makes all that otherwise quite fiddly stuff really
           | easy...)
        
           | danielmarkbruce wrote:
           | I guess that's fair. It's sort of a smell when someone uses
           | the wrong word (especially in writing) though. It suggests
           | they aren't in industry, throwing ideas around with other
           | folks. The word "ephemeral" is used extensively amongst
           | software engineers.
        
       | vienarr wrote:
       | The article only talks about GPU start time, but the title is
       | "CloudA vs CloudB reliability"
       | 
       | bit of a stretch, right
        
       | dark-star wrote:
       | I wonder why someone would equate "instance launch time" with
       | "reliability"... I won't go as far as calling it "clickbait" but
       | wouldn't some other noun ("startup performance is wildly
       | different") have made more sense?
        
         | santoshalper wrote:
         | I won't go so far as saying "you didn't read the article", but
         | I think you missed something.
        
         | xmonkee wrote:
         | GCP also had 84 errors compared to 1 for AWS
        
           | runeks wrote:
           | They were almost exclusively _user_ errors (HTTP 4xx). They
           | are supposed to indicate that the API is being used
           | incorrectly.
           | 
           | Although, it seems the author couldn't find out why they
           | occurred, which points to poor error messages and/or lacking
           | documentation.
        
           | antonvs wrote:
           | Another comment on this thread pointed out they had a
           | potential collision in their instance name generation which
           | may have caused this. That would mean this was user error,
           | not a reliability issue. AWS doesn't require instance names
           | to be unique.
        
           | danielmarkbruce wrote:
           | If not a 4xx, what should they return for instance not
           | available?
        
             | eurasiantiger wrote:
             | 503 service unavailable?
        
               | sn0wf1re wrote:
               | That would be confusing. The HTTP response code should
               | not be conflated with the application's state.
        
               | eurasiantiger wrote:
               | There will come a moment in time when you realize exactly
               | what you have stated here and why it is not a good mental
               | palace to live in.
        
               | danielmarkbruce wrote:
               | It's not the service that's unavailable. The resource
               | isn't available. The service is running just fine.
        
               | verdverm wrote:
               | GCP error messages will indicate if resources were not
               | available, if you reached your quota, or if it was some
               | other error. Tests like OP can differentiate these
               | situations
        
               | yjftsjthsd-h wrote:
               | Yeah, 4xx is client error, 5xx is server error.
        
               | pdpi wrote:
               | Yes, and trying to create duplicate resources is a client
               | error.
        
               | eurasiantiger wrote:
               | Still, 409 seems inappropriate, as it is meant to signal
               | a version conflict, i.e. someone else changed something,
               | and user tried to uplod a stale version.
               | 
               | "10.4.10 409 Conflict
               | 
               | The request could not be completed due to a conflict with
               | the current state of the resource. This code is only
               | allowed in situations where it is expected that the user
               | might be able to resolve the conflict and resubmit the
               | request. The response body SHOULD include enough
               | information for the user to recognize the source of the
               | conflict. Ideally, the response entity would include
               | enough information for the user or user agent to fix the
               | problem; however, that might not be possible and is not
               | required.
               | 
               | Conflicts are most likely to occur in response to a PUT
               | request. For example, if versioning were being used and
               | the entity being PUT included changes to a resource which
               | conflict with those made by an earlier (third-party)
               | request, the server might use the 409 response to
               | indicate that it can't complete the request. In this
               | case, the response entity would likely contain a list of
               | the differences between the two versions in a format
               | defined by the response Content-Type."
               | 
               | Then again, perhaps it is the service itself making that
               | state change.
        
               | dheera wrote:
               | Using HTTP error codes for non-REST things is cringe.
               | 
               | 503 would mean the IaaS API calls themselves are
               | unavailable. Very different from the API working
               | perfectly fine but the instances not being available.
        
               | devmunchies wrote:
               | What? REST is just some API philosophy, its doesn't even
               | have to be on top of HTTP.
               | 
               | Why would you think HTTP status codes are made for REST?
               | They are made for HTTP to describe the response of the
               | resource you are requesting, and the AWS API uses HTTP so
               | it makes sense to use HTTP status codes.
        
               | eurasiantiger wrote:
        
           | sheeshkebab wrote:
           | Maybe 1 reported. Not saying aws reliability is bad, but the
           | number of various glitches that crop up in various aws
           | services and not reflected on their status page is quite
           | high.
        
             | theamk wrote:
             | that was measured from API call return codes, not by
             | looking at overall service status page
             | 
             | Amazon is pretty good about this, if their API says machine
             | is ready, it usually is.
        
             | mcqueenjordan wrote:
             | Errors returned from APIs and the status page are
             | completely separate topics in this context.
        
         | mikewave wrote:
         | Well, if your system elastically uses GPU compute and needs to
         | be able to spin up, run compute on a GPU, and spin down in a
         | predictable amount of time to provide reasonable UX, launch
         | time would definitely be a factor in terms of customer-
         | perceived reliability.
        
           | HenriTEL wrote:
           | GCP provides elactic features for that. One should use them
           | instead of manually requesting new instances.
        
           | rco8786 wrote:
           | Sure but not anywhere remotely near clearing the bar to
           | simply calling that "reliability".
        
             | Waterluvian wrote:
             | When I think "reliability" I think "does it perform the act
             | consistently?"
             | 
             | Consistently slow is still reliability.
        
             | VWWHFSfQ wrote:
             | I would still call it "reliability".
             | 
             | If the instance takes too long to launch then it doesn't
             | matter if it's "reliable" once it's running. It took too
             | long to even get started.
        
               | rco8786 wrote:
               | Why would you not call it "startup performance".
               | 
               | Calling this reliability is like saying a Ford is more
               | reliable than a Chevy because the Ford has a better
               | throttle response.
        
               | endisneigh wrote:
               | that's not what reliability means
        
               | VWWHFSfQ wrote:
               | > that's not what reliability means
               | 
               | What is your definition of reliability?
        
               | endisneigh wrote:
               | unfortunately cloud computing and marketing have
               | conflated reliability, availability and fault tolerance
               | so it's hard to give you a definition everyone would
               | agree to, but in general I'd say reliability is referring
               | to your ability to use the system without errors or
               | significant decreases in throughput, such that it's not
               | usable for the stated purpose.
               | 
               | in other words, reliability is that it does what you
               | expect it to. GCP does not have any particular guarantees
               | around being able to spin up VMs fast, so its inability
               | to do so wouldn't make it unreliable. it would be like me
               | saying that you're unreliable for not doing something
               | when you never said you were going to.
               | 
               | if this were comparing Lambda vs Cloud Functions, who
               | both have stated SLAs around cold start times, and there
               | were significant discrepancies, sure.
        
               | pas wrote:
               | true, the grammar and semantics work out, but since
               | reliability needs a target usually it's a serious design
               | flaw to rely on something that never demonstrably worked
               | like your reliability target assumes.
               | 
               | so that's why in engineering it's not really used as
               | such. (as far as I understand at least.)
        
             | somat wrote:
             | It is not reliably running the machine but reliably getting
             | the machine.
             | 
             | Like the article said, The promise of the cloud is that you
             | can easily get machines when you need them the cloud that
             | sometimes does not get you that machine(or does not get you
             | that machine in time) is a less reliable cloud than the one
             | that does.
        
               | onphonenow wrote:
               | If you want that promise you can reserve capacity in
               | various ways. Google has reservations. Folks use this for
               | DR, your org can get a pool of shared ones going if you
               | are going to have various teams leaning on GPU etc.
               | 
               | The promise of the cloud is that you can flexibly spin up
               | machines if available, and easily spin down, no long term
               | contracts or CapEx etc. They are all pretty clear that
               | there are capacity limits under the hood (and your
               | account likely has various limits on it as a result).
        
               | rco8786 wrote:
               | It's still performance. If this was "AWE failed to
               | deliver the new machines and GCP delivered", sure,
               | reliability. But this isn't that.
               | 
               | The race car that finishes first is not "more reliable"
               | than the one in 10th. They are equally as reliable,
               | having both finished the race. The first place car is
               | simply faster at the task.
        
               | somat wrote:
               | The one in first can more reliably win races however.
        
               | [deleted]
        
               | rco8786 wrote:
               | You cannot infer that based on the results of the
               | race...that's literally the entire point I am making. The
               | 1st place car might blow up in the next race, the 10th
               | place car might finish 10th place for the next 100 races.
               | 
               | If the article were measuring HTTP response times and
               | found that AWS's average response time was 50ms and GCP's
               | was 200ms, and both returned 200s for every single
               | request in the test, would you say AWS is more reliable
               | than GCP based on that? Of course not, it's asinine.
        
             | [deleted]
        
           | jhugo wrote:
           | All the clouds are pretty upfront about availability being
           | non-guaranteed if you don't reserve it. I wouldn't call it a
           | reliability issue if your non-guaranteed capacity takes some
           | tens of seconds to provision. I mean, it might be _your_
           | reliability issue, because you chose not to reserve capacity,
           | but it 's not really unreliability of the cloud -- they're
           | providing exactly what they advertise.
        
             | deanCommie wrote:
             | "Guaranteed" has different tiers of meaning - both
             | theoretical and practical.
             | 
             | In many cases, "guaranteed" just means "we'll give you a
             | refund if we fuck up". SLAs are very much like this.
             | 
             | IN PRACTICE, unless you're launching tens of thousands of
             | instances of an obscure image type, reasonable customers
             | would be able to get capacity, and promptly from the cloud.
             | 
             | That's the entire cloud value proposition.
             | 
             | So no, you can't just hand-waive past these GCP results and
             | say "Well, they never said these were guaranteed".
        
               | robbintt wrote:
               | This isn't actually true, even for tiny customers. In a
               | personal project, I used a single host of a single
               | instance type several times per day and had to code up a
               | fallback.
        
               | dilyevsky wrote:
               | Try spinning up 32+ core instances with local ssds
               | attached or anything not n1 family and you will find that
               | in may regions you can only have like single digits of
               | them
        
               | jhugo wrote:
               | Ignoring the fact that the results are probably partially
               | flawed due to methodology (see top-level comment from
               | someone who works on GCE) and are not reproducible due to
               | missing information, pointing out the lack of a guarantee
               | is not hand-waving. The OP uses the word "reliability" to
               | catch attention, which certainly worked, but this has
               | nothing to do with reliability.
        
           | dark-star wrote:
           | I'd still consider it as "performance issue", not
           | "reliability issue". There is no service unavailability here.
           | It just takes your system a minute longer until the target
           | GPU capacity is available. Until then it runs on fewer GPU
           | resources, which makes it slower. Hence performance.
           | 
           | The errors might be considered a reliability issue, but then
           | again, errors are a very common thing in large distributed
           | systems, and any orchestrator/autoscaler would just re-try
           | the instance creation and succeed. Again, a performance
           | impact (since it takes longer until your target capacity is
           | reached) but reliability? not really
        
             | irrational wrote:
             | I'd like to see a breakdown of the cost differences. If the
             | costs are nearly equal, why would I not choose the one that
             | has a faster startup time and fewer errors?
        
               | campers wrote:
               | With GCP you can right-size the CPU and memory of the VM
               | the GPU is attached to, unlike the fixed GPU AWS
               | instances, so there is the potential for cost savings
               | there.
        
           | pier25 wrote:
           | Wouldn't Cloud Run be a better product for that use case?
        
           | mikepurvis wrote:
           | Hopefully anyone with a workload that's that latency
           | sensitive would a have preallocated pool of warmed up
           | instances ready to go.
        
           | Art9681 wrote:
           | Why would you scale to zero in high perf compute? Wouldn't it
           | be wise to have a buffer of instances ready to pick up
           | workloads instantly? I get that it shouldnt be necessary with
           | a reliable and performant backend, and that the cost of
           | having some instances waiting for job can be substantial
           | depending on how you do it, but I wonder if the cost
           | difference between AWS and GCP would make up for that and you
           | can get an equivalent amount of performance for an equivalent
           | price? I'm not sure. I'd like to know though.
        
             | thwayunion wrote:
             | _> Why would you scale to zero in high perf compute?_
             | 
             | Midnight - 6am is six hours. The on demand price for a G5
             | is $1/hr. That's over $2K/yr, or "an extra week of skiing
             | paid for by your B2B side project that almost never has
             | customers from ~9pm west coat to ~6am east coast". And I'm
             | not even counting weekends.
             | 
             | But that's sort of a silly edge case (albeit probably a
             | real one for lots of folks commenting here). The _real_
             | savings are in predictable startup times for bursty work
             | loads. Fast and low variance startup times unlock a huge
             | amount of savings. Without both speed and predictability,
             | you have to plan to fail and over-allocate. Which can get
             | really expensive fast.
             | 
             | Another way to think about this is that zero isn't special.
             | It's just a special case of the more general scenario where
             | customer demand exceeds current allocation. The larger your
             | customer base, and the burstier your demand, the more
             | instances you need sitting on ice to meet customers' UX
             | requirements. This is particularly true when you're growing
             | fast and most of your customers are new; you really want a
             | good customer experience every single time.
        
             | diroussel wrote:
             | Scaling to zero means zero cost when there is zero work. If
             | you have a buffer pool, how long do you keep it populated
             | when you have no work?
             | 
             | Maintaining a buffer pool is hard. You need to maintain
             | state, have a prediction function, track usage through
             | time, etc. just spinning up new nodes for new work is
             | substantially easier.
             | 
             | And the author said he could spin up new nodes in 15
             | seconds, that's pretty quick.
        
         | irjustin wrote:
         | I'll say it is valid to use reliability.
         | 
         | If I depend on some performance metric, startup, speed, etc, my
         | dependance on it equates to reliability. Not just on/off but
         | the spectrum that it produces.
         | 
         | If a CPU doesn't operate at its 2GHz setting 60% of the time, I
         | would say that's not reliable. When my bus shows up on time
         | only 40% of the time - I can't rely on that bus to get me where
         | I need to go consistently.
         | 
         | If the GPU took 1 hour to boot, but still booted, is it
         | reliable? What about 1 year? At some point it tips over an
         | "personal" metric of reliability.
         | 
         | The comparison to AWS which consistently out-performs GCP,
         | while not explicitly, implicitly turns that into a reliability
         | metric by setting the AWS boot time as "the standard".
        
         | RajT88 wrote:
         | Reliability is a fair term, with an asterix. It is a specific
         | flavor of reliability: deployment or scaling or net-new or
         | allocation or whatever you want to call it.
        
         | thayne wrote:
         | Well, I mean it is measuring how reliably you can get a GPU
         | instance. But it certainly isn't the overall reliability. And
         | depending on your workflow, it might not even be a very
         | interesting measure. I would be more interested in seeing a
         | comparison of how long regular non-GPU instances can run
         | without having to be rebooted, and maybe how long it takes to
         | allocate a regular VM.
        
         | thesuperbigfrog wrote:
         | "AWS encountered one valid launch error in these two weeks
         | whereas GCP had 84."
         | 
         | 84 times more launch errors seems like a valid definition for
         | "less reliable".
        
         | iLoveOncall wrote:
         | It is clickbait, the real title should be "AWS vs. GCP on-
         | demand provisioning of GPU resources performance is wildly
         | different".
         | 
         | That said, while I agree that launch time and provisioning
         | error rate are not sufficient to define reliability, they are
         | definitely a part of it.
        
           | tl_donson wrote:
           | " AWS vs. GCP on-demand provisioning of GPU resources
           | performance is wildly different"
           | 
           | yeah i guess it does make sense that one didn't win the a/b
           | test
        
           | [deleted]
        
           | lelandfe wrote:
           | > wildly different
           | 
           | For this, I'd prefer a title that lets me draw my own
           | conclusions. 84 errors out of 3000 doesn't sound awful to
           | me...? But what do I know - maybe just give me the data:
           | 
           | "1 in 3000 GPUs fail to spawn on AWS. GCP: 84"
           | 
           | "Time to provision GPU with AWS: 11.4s. GCP: 42.6s"
           | 
           | "GCP >4x avg. time to provision GPU than AWS"
           | 
           | "Provisioning on GCP both slower and more error-prone than
           | AWS"
        
             | esrauch wrote:
             | 84 of 3000 failed is only "one nine"
        
               | [deleted]
        
         | hericium wrote:
         | Cloud reliability is not the same as a reliability of already
         | spawned VM.
         | 
         | Here it's the possibility to launch new VMs to satisfy dynamic
         | projects' needs. Cloud provider should allow you to scale-up in
         | a predictable way. When it doesn't - it can be called
         | unreliable.
         | 
         | Also, "unreliable" is basically a synonym for "Google" these
         | days.
        
           | DonHopkins wrote:
           | Let me unreliable that for you.
        
             | ReptileMan wrote:
             | To be fair their search is so crap lately, throwing the
             | dice is not the worst option in the world to find a result
             | that will be actually useful.
        
         | rmah wrote:
         | They are talking about the reliability of AWS vs GCP. As a user
         | of both, I'd categorize predictable startup times under
         | reliability because if it took more than a minute or so, we'd
         | consider it broken. I suspect many others would have even
         | tighter constraints.
        
         | chrismarlow9 wrote:
         | I mean if you're talking about worst case systems you assume
         | everything is gone except your infra code and backups. In that
         | case your instance launch time would ultimately define what
         | your downtime looks like assuming all else is equal. It does
         | seem a little weird to define it that way but in a strict sense
         | maybe not.
        
       ___________________________________________________________________
       (page generated 2022-09-22 23:03 UTC)