[HN Gopher] Summary of the AWS Service Event in the Northern Vir...
___________________________________________________________________
Summary of the AWS Service Event in the Northern Virginia (US-
East-1) Region
Author : eigen-vector
Score : 603 points
Date : 2021-12-10 22:54 UTC (1 days ago)
(HTM) web link (aws.amazon.com)
(TXT) w3m dump (aws.amazon.com)
| almostdeadguy wrote:
| > The AWS container services, including Fargate, ECS and EKS,
| experienced increased API error rates and latencies during the
| event. While existing container instances (tasks or pods)
| continued to operate normally during the event, if a container
| instance was terminated or experienced a failure, it could not be
| restarted because of the impact to the EC2 control plane APIs
| described above.
|
| This seems pretty obviously false to me. My company has several
| EKS clusters in us-east-1 with most of our workloads running on
| Fargate. All of our Fargate pods were killed and were unable to
| be restarted during this event.
| ClifReeder wrote:
| Strong agree. We were using Fargate nodes in our us-east-1 EKS
| cluster and not all of our nodes dropped, but every coredns pod
| did. When they came back up their age was hours older than
| expected, so maybe a problem between Fargate and the scheduler
| rendered them "up" but unable to be reached?
|
| Either way, was surprising to us that already provisioned
| compute was impacted.
| silverlyra wrote:
| Saw the same. The only cluster services I was running in
| Fargate were CoreDNS and cluster-autoscaler; thought it would
| help the clusters recover from anything happening to the node
| group where other core services run. Whoops.
|
| Couldn't just delete the Fargate profile without a working
| EKS control plane. I lucked out in that the label selector
| the kube-dns Service used was disjoint from the one I'd set
| in the Fargate profile, so I just made a new "coredns-
| emergency" deployment and cluster networking came back.
| (cluster-autoscaler was moot since we couldn't launch
| instances anyway.)
|
| I was hoping to see something about that in this
| announcement, since the loss of live pods is nasty. Not
| inclined to rely on Fargate going forward. It is curious that
| you saw those pod ages; maybe Fargate kubelets communicate
| with EKS over the AWS internal network?
| whatever1 wrote:
| Noob question, but why does network infrastructure need dns? Why
| the full ipv6 address of the various components do not suffice to
| do business?
| AUX4R6829DR8 wrote:
| That "internal network" hosts an awful lot of stuff- it's not
| just network hardware, but services that mostly use DNS to find
| each other. Besides that, it's just plain useful for network
| devices to have names.
|
| (Source: Work at AWS.)
| nijave wrote:
| It's basically used for service discovery. At a certain point,
| you have too many different devices which are potentially
| changing to identify them by IP. You want some abstraction
| layer to separate physical devices from services and DNS lets
| you so things like advertise different IPs at different times
| in different network zones
| soheil wrote:
| Having an internal network like this that everything on the main
| AWS network so heavily depends on is just bad design. One does
| not create a stable high tech spacecraft and then fuels it with
| coal.
| discodave wrote:
| It's 2006, you work for an 'online book store' that's
| experimenting with this cloud thing. Are you going to build a
| whole new network involving multi-million dollar networking
| appliances?
| jvolkman wrote:
| Developing EC2 did involve building a whole new network. It
| was a new service, built from the ground up to be a public
| product.
| User23 wrote:
| Amazon was famously frugal during that era.
| soheil wrote:
| No but one would hope 15 years and 1 trillion dollars later
| you would stop running it on the computer under your desk.
| ripper1138 wrote:
| Might seem that way because of what happened, but the main
| network is probably more likely to fail than the internal
| network. In those cases, running monitoring on a separate
| network is critical. EC2 control plane same story.
| soheil wrote:
| The entire value proposition for AWS is "migrate your
| internal network to us so it's more stable with less
| management." I buy that 100%, and I think you're wrong to
| assume their main network is more likely to fail than their
| internal one. They have every incentive to continuously
| improve it because it's not made just for one client.
| zeckalpha wrote:
| What would you recommend instead? Control plane/data plane is a
| well known pattern.
| divbzero wrote:
| > _This congestion immediately impacted the availability of real-
| time monitoring data for our internal operations teams, which
| impaired their ability to find the source of congestion and
| resolve it._
|
| Disruption of the standard incident response mechanism seems to
| be a common element of longer lasting incidents.
| femiagbabiaka wrote:
| It is. And to add, all automation that we rely on in peace time
| can often complicate cross cutting wartime incidents by raising
| the ambient complexity of an environment. Bainbridge for more:
| https://blog.acolyer.org/2020/01/08/ironies-of-automation/
| sponaugle wrote:
| Indeed - Even the recent facebook outage outlined how slow
| recovery can be if the primary investigation and recovery
| methods are directly impacted as well. Back in the old days
| some environments would have POTS dial-in connections to the
| consoles as backup for network problems. That of course doesn't
| scale, but it was an attempt to have an alternate path of
| getting to things. Regrettably if a backhoe takes out all of
| the telecom at once that plan doesn't work so well.
| daenney wrote:
| Yup. There was a GCP outage a couple of years ago like this.
|
| I don't remember the exact details, but it was something along
| the lines of a config change went out that caused systems to
| incorrectly assume there were huge bandwidth constraints. Load
| shedding kicked in to drop lower priority traffic which
| ironically included monitoring data rendering GCP responders
| blind and causing StackDriver to go blank for customers.
| Ensorceled wrote:
| There are a lot of comments in here that boil down to "could you
| do infrastructure better?"
|
| No, absolutely not. That's why I'm on AWS.
|
| But what we are all ACTUALLY complaining about is ongoing lack of
| transparent and honest communications during outages and,
| clearly, in their postmortems.
|
| Honest communications? Yeah, I'm pretty sure I could do _that_
| much better than AWS.
| DenisM wrote:
| > Customers accessing Amazon S3 and DynamoDB were not impacted by
| this event.
|
| We've seen plenty of S3 errors during that period. Kind of
| undermines credibility of this report.
| davidstoker wrote:
| Do you use VPC endpoints for S3? The next sentence explained
| failures I observed with S3: "However, access to Amazon S3
| buckets and DynamoDB tables via VPC Endpoints was impaired
| during this event."
| karmelapple wrote:
| I could not modify file properties in S3, uploading new or
| modified files was spotty, and AWS Console GUI access was
| broken as well. Was that because of VPC endpoints?
| chickenpotpie wrote:
| "Customers also experienced login failures to the AWS
| Console in the impacted region during the event"
| DenisM wrote:
| No, I use the normal endpoint.
| marcinzm wrote:
| DAX, part of DynamoDB from how AWS groups things, was throwing
| internal server errors for us and eventually we had to reboot
| nodes manually. That's separate from the STS issues we had in
| terms of our EKS services connecting to DAX.
| Uyuxo wrote:
| Same. I saw intermittent S3 errors the entire time. Also
| nothing mentioned about SQS even though we were seeing errors
| there as well.
| banana_giraffe wrote:
| Yeah
|
| > For example, while running EC2 instances were unaffected by
| this event
|
| That ignores the 3-5% drop in traffic I saw in us-east-1 on EC2
| instances that only talk to peers on the Internet with TCP/IP
| during this event.
| manquer wrote:
| I guess you have to read this kind of items hidden in careful
| language, _running_ the instances had no problem, it is
| different matter they had limited connectivity!from AWS point
| of view they don 't seem to see user impact but services from
| their point of view.
|
| Perhaps that distinction has value if your workloads did not
| depend on network connectivity externally for example say S3
| access without vpc and only compute some DS/ ML jobs perhaps.
| grogers wrote:
| How are you measuring this? Remember that cloudwatch was
| apparently also losing metrics, so aggregating CW metrics
| might show that kind of drop.
| banana_giraffe wrote:
| Yeah, I know. This was based off instance store logging on
| these instances. For better or worse, they're very simple
| ports of pre-AWS on-prem servers, they don't speak AWS once
| they're up and running.
| wjossey wrote:
| I've been running platform teams on aws now for 10 years, and
| working in aws for 13. For anyone looking for guidance on how to
| avoid this, here's the advice I give startups I advise.
|
| First, if you can, avoid us-east-1. Yes, you'll miss new
| features, but it's also the least stable region.
|
| Second, go multi AZ for production workloads. Safety of your
| customer's data is your ethical responsibility. Protect it, back
| it up, keep it as generally available as is reasonable.
|
| Third, you're gonna go down when the cloud goes down. Not much
| use getting overly bent out of shape. You can reduce your
| exposure by just using their core systems (EC2, S3, SQS, LBs,
| Cloudfrount, RDS, Elasticache). The more systems you use, the
| less reliable things will be. However, running your own key value
| store, api gateway, event bud, etc., can also be way less
| reliable than using their's. So, realize it's an operational
| trade off.
|
| Degradation of your app / platform is more likely to come from
| you than AWS. You're gonna roll out bad code, break your own
| infra, overload your own system, way more often than Amazon is
| gonna go down. If reliability matters to you, start by examining
| your own practices first before thinking things like multi region
| or super durable highly replicated systems.
|
| This stuff is hard. It's hard for Amazon engineers. Hard for
| platform folks at small and mega companies. It's just, hard. When
| your app goes down, and so does Disney plus, take some solace
| that Disney in all their buckets of cash also couldn't avoid the
| issue.
|
| And, finally, hold cloud providers accountable. If they're
| unstable and not providing service you expect, leave. We've got
| tons of great options these days, especially if you don't care
| about proprietary solutions.
|
| Good luck y'all!
| manquer wrote:
| Easy to say leave, the techinical lockin cloud service
| providers _by design_ choose to have makes it impossible to
| leave .
|
| AWS (and others) make egress costs insanely expensive for any
| startup to consider leaving with their data, also there is
| constant push to either not support open protocols or extend
| /expand them in ways making it hard to migrate a code base
| easily.
|
| If the advise is to use only effectively use managed open
| source components then why AWS at all ? most competent mid
| sized teams can do that much cheaper with a colo providers like
| OVH/hetzner.
|
| The point of investing in AWS is not outsource running base
| infra, if we should stay away from leveraging the kind of cloud
| native services us mere mortals cannot hope to build or
| maintain.
|
| Also this avoid us-east-1 advice is bit frustrating, AWS does
| not have to experiment with new services always in the same
| region,it is not marked as experimental region or has reduced
| SLAs , if it is inferior/preview/beta than call it out in the
| UI and contract, what about when there is no choice? If
| cloudfront is managed in us-east-1 and we shouldnt now use it ?
| Why use the cloud then ?
|
| if your engineering only discovers scale problems at us-east-1
| along with customers perhaps something is wrong ? aws could
| limit new instances in that region and spread the load, playing
| with customers like this who are at your mercy just because you
| can is not nice.
|
| Disney can afford to go down, or build their cloud, small
| companies don't have deep pockets to do either
| EnlightenedBro wrote:
| Lesson to build your services with Docker and Terraform. In
| this setup you can spin up a working clone of a decently
| sized stack in a different cloud provider in under an hour.
|
| Don't lock yourself in.
| lysium wrote:
| ...if you don't have much data, that is. Otherwise, you'll
| have huge egress costs.
| rorymalcolm wrote:
| This is just not true for Terraform at all, they do not aim
| to be multi cloud and it is a much more usable product
| because of it. Resource parameters do not swap out directly
| across providers (rightly so, the abstractions they choose
| are different!).
| manquer wrote:
| If the setup is that portable you probably don't need the
| AWS at all in the first place.
|
| If your use only services built and managed by your docker
| images why use the cloud in the first place ? It would be
| cheaper to host on a smaller vendor , the reliability is
| not substantially better with big cloud than tier two
| vendors, that difference between say OVH and AWS is not
| that valuable to most applications to be worth the premium.
|
| In IMO, if you don't leverage cloud native services offered
| by GCP or AWS then cloud is not adding much value to your
| stack.
| cherioo wrote:
| > AWS (and others) make egress costs insanely expensive for
| any startup to consider leaving with their data
|
| I have seen this repeated many times, but don't understand
| it. Yes egress is expensive, but they are not THAT expensive
| compared to storage. S3 egress per GB is no more than 3x the
| price of storage, i.e. moving out just cost 3 month of
| storage cost (there's also API cost but that's not the one
| often mentioned).
|
| Is egress pricing being a lock-in factor just a myth? Is
| there some other AWS cost I'm missing? Obviously there will
| be big architectural and engineering cost to move, but that's
| just part of life.
| manquer wrote:
| 3months is only if you use standard S3, However intelligent
| tiering , infrequent access , reduced redundancy or glacier
| instant can be substantially cheaper, without impacting
| retrieval time [1]
|
| At scale when costs matter, you would have lifecycle policy
| tuned to your needs taking advantage of these classes. Any
| typical production workload is hardly paying only S3 base
| price for all/most of its storage needs, they will have mix
| of all these too.
|
| [1] if there is substantial data in glacier regular, the
| costing completely blows through the roof, retrieval
| +egress makes it infeasible unless you activily hate AWS
| enough to spend that kind of money
| bushbaba wrote:
| Often the other cloud vendors will assist in offering those
| migration costs as part of your contract negotiations.
|
| But really, egress costs aren't locking you in. It's the
| hard coded AWS apis, terraform scripts and technical debt.
| Having to change all of that and refactor and reoptimize to
| a different providers infrastructure is a huge endeavor.
| That time spent night have a higher ROI being put elsewhere
| oasisbob wrote:
| > Third, you're gonna go down when the cloud goes down. Not
| much use getting overly bent out of shape.
|
| Ugh. I have a hard time with this one. Back in the day, EBS had
| some really awful failures and degradations. Building a
| greenfield stack that specifically avoided EBS and stayed up
| when everyone else was down during another mass EBS failure
| felt marvelous. It was an obvious avoidable hazard.
|
| It doesn't mean "avoid EBS" is good advice for the decade to
| follow, but accepting failure fatalistically doesn't feel right
| either.
| speedgoose wrote:
| > Third, you're gonna go down when the cloud goes down.
|
| Not necessarily. You just need to not be stuck with a single
| cloud provider. The likelihood of more than one availability
| zone going down on a single cloud provider is not that low in
| practice. Especially when the problem is a software bug.
|
| The likelihood of AWS, Azure, and OVH going down at the same
| time is low. So if you need to stay online if AWS fail, don't
| put all your eggs in the AWS basket.
|
| That means not using proprietary cloud solutions from a single
| cloud provider, it has a cost so it's not always worth it.
| chii wrote:
| > using proprietary cloud solutions from a single cloud
| provider, it has a cost so it's not always worth it.
|
| but perhaps some software design choices could be made to
| alleviate these costs. For example, you could have a read-
| only replica on azure or whatever backup cloud provider, and
| design your software interfaces to allow the use of such read
| only replicas - at least you'd be degraded rather than
| unavailable. Ditto with web servers etc.
|
| This has a cost, but it's lower than entirely replicating all
| of the proprietary features in a different cloud.
| bombcar wrote:
| True multi-cloud redundancy is hard to test - because it's
| everything from DNS on up and it's hard to ask AWS to go
| offline so you can verify Azure picks up the slack.
| kqr wrote:
| Sure you can. Firewall AWS off from whatever machine does
| the health checks in the redundancy implementation.
| pixl97 wrote:
| What happens when your health check system fails?
| speedgoose wrote:
| It's true, but you can do load balancing at the DNS level.
| darkwater wrote:
| And you will get 1/N of requests timing or erroring out,
| and in the meanwhile paying 2x or 3x the costs. So, it
| might be worth in some cases but you need to evaluate it
| very, very well.
| daguava wrote:
| You've written up my thoughts better than I can express them
| myself - I think what people get really stuck on when something
| like this happens is the 'can I solve this myself?' aspect.
|
| A wait for X provider to fix it for you situation is infinitely
| more stressful than an 'I have played myself, I will now take
| action' situation.
|
| Situations out of your (immediate) resolution control feel
| infinitely worse, even if the customer impact in practice of
| your fault vs cloud fault is the same.
| dalyons wrote:
| For me it's the opposite... aws outages are _much_ less
| stressful than my own because I know there's nothing I /we
| can do about it, they have smart people working on it, and it
| will be fixed when it's fixed
| electroly wrote:
| I couldn't possibly disagree more strongly with this. I used
| to drive frantically to the office to work on servers in
| emergency situations, and if our small team couldn't solve
| it, there was nobody else to help us. The weight of the
| outage was entirely on our shoulders. Now I relax and refresh
| a status page.
| qwertyuiop_ wrote:
| Or rent bare metal servers like old times and be responsible
| for your own s*t
| aenis wrote:
| Still plenty of networking issues that can knock you down
| hard.
| tuldia wrote:
| ... and be responsible for your own s*t
|
| Don't miss the point of being able to do something about it
| instead of multi hours outage and being in the dark
| regarding what is going on.
| [deleted]
| chrisweekly wrote:
| Hey Wes! I upvoted your comment before I noticed your handle.
| +1 insightful, as usual
| mobutu wrote:
| Brown nose
| xyst wrote:
| > And, finally, hold cloud providers accountable. If they're
| unstable and not providing service you expect, leave. We've got
| tons of great options these days, especially if you don't care
| about proprietary solutions.
|
| Easy to say, but difficult to do in practice (leaving a cloud
| provider)
| kortilla wrote:
| > Safety of your customer's data is your ethical
| responsibility. Protect it, back it up, keep it as generally
| available as is reasonable.
|
| > Third, you're gonna go down when the cloud goes down. Not
| much use getting overly bent out of shape.
|
| "Whoops, our provider is down, sorry!" is not taking
| responsibility with customer data at all.
| Fordec wrote:
| Between this and Log4j, I'm just glad it's Friday.
| eigen-vector wrote:
| Unfortunately, patching vulns can't be put off for Monday.
| [deleted]
| _jal wrote:
| You are clearly not involved in patching.
| Fordec wrote:
| Simply already patched. Company sizes and number of attack
| surfaces vary. 22 hours is plenty of time for an input string
| filter on a centrally controlled endpoint and a dependency
| increment with the right CI pipeline.
| shagie wrote:
| Consider the possible ways for a string to be injected into
| any of the following: Apache Solr
| Apache Druid Apache Flink ElasticSearch
| Flume Apache Dubbo Logstash Kafka
|
| If you've got any of them, they're likely exploitable too.
|
| That list comes from:
| https://unit42.paloaltonetworks.com/apache-
| log4j-vulnerabili...
|
| The attack surface is quite a bit larger than many realize.
| I recently had a conversation with a person who wasn't at a
| Java shop so wasn't worried... until he said "oh, wait,
| ElasticSearch is vulnerable too?"
|
| You'll even see it in things like the connector between
| CouchBase and ElasticSearch (
| https://forums.couchbase.com/t/ann-elasticsearch-
| connector-4... ).
| EdwardDiego wrote:
| Kafka is still on log4j1. It's only vulnerable if you're
| using a JMSAppender.
| Fordec wrote:
| Lets see...
|
| Nope. Nope. Nope. Nope. Nope. Nope. Nope.
|
| aaaand...
|
| Nope. Plans for it, but not yet in production.
|
| Oh and before anyone starts, not in transitive
| dependencies either. Just good old bare metal EC2
| instances without vendor lock in.
| zeko1195 wrote:
| lol no
| [deleted]
| [deleted]
| revskill wrote:
| Most of rate limiter system often drop invalid requests, it's not
| optimal as i see.
|
| The better way is, we should have two queues, one for valid
| messages and one for invalid messages.
| betaby wrote:
| Problem is that I have to defend our own infrastructure real
| availability numbers vs cloud's fictional "five nines". It's a
| loosing game.
| ngc248 wrote:
| Yep, things should be made clear to whoever cares about your
| service's SLA that our SLA is contingent upon AWS's SLA et al.
| AWS' SLA would be the lower bound :)
| Spivak wrote:
| All I'm hearing is that you can make up your own availability
| numbers and get away with it. When you define what it means to
| be up or down then reality is whatever you say it is.
|
| #gatekeep your real availability metrics
|
| #gaslight your customers with increased error rates
|
| #girlboss
| WoahNoun wrote:
| What are you trying to imply with that last hashtag?
| mpyne wrote:
| Some orgs really do have lousy availability figures (such as my
| own, the Navy).
|
| We have an environment we have access to for hosting webpages
| for one of the highest leaders in the whole Dept of Navy. This
| environment was _DOWN_ (not "degrade availability" or "high
| latencies"), literally off of the Internet entirely, for
| CONSECUTIVE WEEKS earlier this year.
|
| Completely incommunicado as well. It just happened to start
| working again one day. We collectively shrugged our shoulders
| and resumed updating our part of it.
|
| This is an outlier example but even our normal sites I would
| classify as 1 "nine" of availability at best.
| EnlightenedBro wrote:
| This feels like it's a common thing in large enterprisey
| companies. Execs out of touch with technical teams, always
| pushing for more for less.
| garbagecoder wrote:
| Army, here. We're no better. And of course a few years back
| the whole system for issuing CACs was off for DAYS.
| JCM9 wrote:
| Obviously one hopes these things don't happen, but that's an
| impressive and transparent write up that came out quickly.
| markranallo wrote:
| Its not transparent at all. A massive amount of services were
| hard down for hours like SNS and were never acknowledged on the
| status page or in this write-up. This honestly reads like they
| don't truly understand the scope of things effected.
| nijave wrote:
| It sounded like the entire management plane was down and
| potentially part of the "data" plane too (management being
| config and data being get/put/poll to stateful resources)
|
| I saw in the Reddit thread someone mentioned all services
| that auth to other services on the backend were effected (not
| sure how truthful it is but that certainly made sense)
| JCM9 wrote:
| Queue the armchair infrastructure engineers.
|
| The reality is that there's a handful of people in the world that
| can operate systems at this sheer scale and complexity and I have
| mad respect for those in that camp.
| ikiris wrote:
| This outage report reads like a violation of the SRE 101
| checklist for networking management though.
| fckyourcloud wrote:
| There are very few people who can juggle running chainsaws with
| their penis.
|
| So maybe it's not something we should be doing then?
| [deleted]
| plausibledeny wrote:
| Isn't this the equivalent of "complaining about your meal in a
| restaurant, I'd like to see you do better."
|
| The point of eating at a restaurant is that I can't/don't want
| to cook. Likewise, I use AWS because I want them to do the hard
| work and I'm willing to pay for it.
|
| How does that abrogate my right to complain if it goes badly
| (regardless of whether I could/couldn't do it myself)?
| cube00 wrote:
| I think the distinction is you can say "I pay good money for
| you to do it properly and how dare you go down on me" but you
| become an "armchair infrastructure engineer" when you try and
| explain how you would have avoided the outage because you
| don't have the whole picture (especially based on a very
| carefully worded PR approved blog post).
| [deleted]
| tristor wrote:
| Some of us are in that camp and are looking at this outage and
| also pointing out that they continuously fail to accurately
| update their status dashboard in this and prior outages. Yes,
| doing what AWS does is hard, and yes outages /will/ happen, it
| is no knock on them that this outage occurred, what is a knock
| is that they haven't communicated honestly while the outage was
| ongoing.
| JCM9 wrote:
| They address that in the post, and between Twitter, HN and
| other places there wasn't anyone legit questioning if
| something was actually broken. Contacts at AWS also all were
| very clear that yes something was going on and being
| investigated. This narrative that AWS was pretending nothing
| was wrong just wasn't true based on what we saw.
| DrBenCarson wrote:
| I'm going to leave it at this: the dashboards at AWS aren't
| automated.
|
| Say what you will, but I can automate a status dashboard in
| a couple days--yes, even at AWS scale.
|
| No reason the dashboard should be green for _hours_ while
| their engineers and support are aware things aren 't
| working.
| notinty wrote:
| Apparently VP approval is required to update it, i.e.
| they're a farce.
| moogly wrote:
| Hm. This post does not seem to acknowledge what I saw. Multiple
| hours of rate-limiting kicking in when trying to talk to S3 (eu-
| west-1). After the incident everything works fine without any
| remediations done on our end.
| MrBurnsa wrote:
| eu-west-1 was not impacted by this event. I'm assuming you saw
| 503 Slowdown responses, which are non-exceptional and happen
| for a multitude of reasons.
| moogly wrote:
| I see. Then I suppose that was an unfortunately timed
| happenstance (and we should look into that more closely).
| jtchang wrote:
| The complexity that AWS has to deal with is astounding. Sure
| having your main production network and a management network is
| common. But making sure all of it scales and doesn't bring down
| the other is what I think they are dealing with here.
|
| It must have been crazy hard to troubleshoot when you are flying
| blind because all your monitoring is unresponsive. Clearly more
| isolation with clearly delineated information exchange points are
| needed.
| dijit wrote:
| "But AWS has more operations staff than I would ever hope to
| hire" -- a common mantra when talking about using the cloud
| overall.
|
| I'm not saying I fully disagree. But consolidation of the
| worlds hosting necessitates a very complicated platform and
| these things will happen, either due to that complexity,
| failures that can't be foreseen or good old fashioned Sod's
| law.
|
| I know AWS marketing wants you to believe it's all magic and
| rainbows, but it's still computers.
| beoberha wrote:
| I work for one of the Big 3 cloud providers and it's always
| interesting when giving RCAs to customers. The vast majority
| of our incidents are due to bugs in the "magic" components
| that allow us to operate at such a massive scale.
| rodmena wrote:
| I was alive, however because I could not breath, I died. Bob was
| fine himself, but someone shot him, so he is dead, (but remember
| bob was fine) --- What a joke
| atoav wrote:
| A "service event"?!
| londons_explore wrote:
| Idea:. Network devices should be configured to automatically
| prioritize the same packet flows for the same clients as they
| served yesterday.
|
| So many overload issues seem to be caused by a single client, in
| a case where the right prioritization or rate limit rule could
| have contained any outage, but such a rule either wasn't in place
| or wasn't the right one due to the difficulty of knowing how to
| prioritize hundreds of clients.
|
| Using _more_ bandwidth or requests than yesterday should then be
| handled as capacity allows, possibly with a manual configured
| priority list, cap, or ratio. But "what I used yesterday" should
| always be served first. That way, any outage is contained to
| clients acting differently to yesterday, even if the config isn't
| perfect.
| stevefan1999 wrote:
| In a nutshell: thundering herd.
| rodmena wrote:
| umm... But just one thing, S3 was not available at least for 20
| minutes.
| sneak wrote:
| "impact" occurs 27 times on this page.
|
| What was wrong with "affect"?
| pohl wrote:
| The easiest way to avoid confusing affect with effect is to use
| other words.
| WC3w6pXxgGd wrote:
| > Our networking clients have well tested request back-off
| behaviors that are designed to allow our systems to recover from
| these sorts of congestion events, but, a latent issue prevented
| these clients from adequately backing off during this event.
|
| Sentences like this are confusing. If they are well-tested,
| wouldn't this issue have been covered?
| mperham wrote:
| I wish it contained actual detail and wasn't couched in
| generalities.
| nijave wrote:
| That was my take. Seems like boilerplate you could report for
| almost any incident. Last year's Kinesis outage and the S3
| outage some years ago had some decent detail
| propter_hoc wrote:
| Does anyone know how often an AZ experiences an issue as compared
| to an entire region? AWS sells the redundancy of AZs pretty
| heavily, but it seems like a lot of the issues that happen end up
| being region-wide. I'm struggling to understand whether I should
| be replicating our service across regions or whether the AZ
| redundancy within a region is sufficient.
| ashtonkem wrote:
| The best bang for your buck isn't deploying into multiple AZs,
| but relocating everything into almost any other region than us-
| east-1.
|
| My system is latency and downtime tolerant, but I'm thinking I
| should move all my Kafka processing over to us-west-2
| nemothekid wrote:
| I've been naively setting up our distributed databases in
| separate AZs for a couple years now, paying, sometimes,
| thousands of dollars per month in data replication bandwidth
| egress fees. As far as I can remember I've never never seen an
| AZ go down, and the only region that has gone down has been us-
| east-1.
| treis wrote:
| AZs definitely go down. It's usually due to a physical reason
| like fire or power issues.
| wjossey wrote:
| There was an AZ outage in Oregon a couple months back. You
| should definitely go multi AZ without hesitation for
| production workloads for systems that should be highly
| available. You can easily lose a system permanently in a
| single AZ setup if it's not ephemeral.
| maximilianroos wrote:
| > I've never never seen an AZ go down, and the only region
| that has gone down has been us-east-1.
|
| Doesn't the region going down mean that _all_ its AZs have
| gone down? Or is my mental model of this incorrect?
| urthor wrote:
| No. See https://aws.amazon.com/about-aws/global-
| infrastructure/regio...
|
| A region is a networking paradigm. An AZ is a group of 2-6
| data centers in the same city more or less.
|
| If a region goes down or is otherwise impacted, its AZs are
| _unavailable_ or similar.
|
| If an AZ goes down, your VMs in said centers are disrupted
| in the most direct sense.
|
| It's the difference between loss of service and actual data
| loss.
| cyounkins wrote:
| Is that separate AZs within the same region, or AZs across
| regions? I didn't think there were any bandwidth fees between
| AZs in the same region.
| electroly wrote:
| It's $0.01/GB for cross-AZ transfer within a region.
| nemothekid wrote:
| In reality it's more like $0.02/GB. You pay $0.01 on
| sending and $0.01 on receiving. I have no idea why
| ingress isn't free.
| rfraile wrote:
| Plus the support percentage, don't forguet.
| TheP1000 wrote:
| That is incorrect. Cross az fees are steep.
| [deleted]
| dijit wrote:
| The main issue tends to be a lot of AWS internal components
| tend to be in us-east-1; it's also the oldest zone.
|
| So when failures happen in that region (and they happen more
| commonly than others due to age, scale, complexity) then they
| can be globally impacting.
| AUX4R6829DR8 wrote:
| The stuff that's exclusively hosted in us-east-1 is, to my
| knowledge, mostly things that maintain global uniqueness.
| CloudFront distributions, Route53, S3 bucket names, IAM roles
| and similar- i.e. singular control planes. Other than that,
| regions are about as isolated as it gets, except for specific
| features on top.
|
| Availability zones are supposed to be another fault boundary,
| and things are generally pretty solid, but every so often
| problems spill over when they shouldn't.
|
| The general impression I get is that us-east-1's issues tend
| to stem from it being singularly huge.
|
| (Source: Work at AWS.)
| ashtonkem wrote:
| If I recall there was a point in time where the control
| panel for all regions was in us-east-1. I seem to recall an
| outrage where the other regions were up, but you couldn't
| change any resources because the management api was down in
| us-east-1
| herodoturtle wrote:
| This was our exact experience with this outage.
|
| Literally _all_ our AWS resources are in EU /UK regions -
| and they all continued functioning just fine - but we
| couldn't sign in to our AWS console to manage said
| resources.
|
| Thankfully the outage didn't impact our production
| systems at all, but our inability to access said console
| was quite alarming to say the least.
| aaron42net wrote:
| The default region for global services including
| https://console.aws.amazon.com is us-east-1, but there
| are usual regional alternatives. For example: https://us-
| west-2.console.aws.amazon.com
|
| It would probably be clearer that they exist if the
| console redirected to the regional URL when you switched
| regions.
|
| STS, S3, etc have regional endpoints too that have
| continued to work when us-east-1 has been broken in the
| past and the various AWS clients can be configured to use
| them, which they also sadly don't tend to do by default.
| propter_hoc wrote:
| I agree with you, but my services are actually in Canada
| (Central). There's only one region in Canada, so I don't
| really have an alternative. AWS justifies it by saying there
| are three AZs (distinct data centres) within Canada
| (Central), but I get scared when I see these region-wide
| issues. If the AZs were really distinct, you wouldn't really
| have region-wide issues.
| post-it wrote:
| Multiple AZs are moreso for earthquakes, fires[1], and
| similar disasters rather than software issues.
|
| [1] https://www.reuters.com/article/us-france-ovh-fire-
| idUSKBN2B...
| discodave wrote:
| Take DynamoDB as an example. The AWS managed service takes
| care of replicating everything to multiple AZs for you,
| that's great! You're very unlikely to lose your data. But,
| the DynamoDB team is running a mostly-regional service. If
| they push bad code or fall over it's likely going to be a
| regional issue. Probably only the storage nodes are truly
| zonal.
|
| If you wanted to deploy something similar, like Cassandra
| across AZs, or even regions you're welcome to do that. But
| now you're on the hook for the availability of the system.
| Are you going to get higher availability running your own
| Cassandra implementation than the DynamoDB team? Maybe.
| DynamoDB had a pretty big outage in 2015 I think. But
| that's a lot more work than just using DynamoDB IMO.
| dastbe wrote:
| > But, the DynamoDB team is running a mostly-regional
| service.
|
| this is both more and less true than you might think. for
| most regional endpoints teams leverage load balancers
| that are scoped zonally, such that ip0 will point at
| instances in zone a, ip1 will point at instances in zone
| b, and so on. Similarly, teams who operate "regional"
| endpoints will generally deploy "zonal" environments,
| such that in the event of a bad code deploy they can fail
| away that zone for customers.
|
| that being said, these mitigations still don't stop
| regional poison pills or otherwise from infecting other
| AZs unless the service is architected to zonally
| internally.
| discodave wrote:
| Yeah, teams go to a lot of effort to have zonal
| environments/fleets/deployments... but there are still
| many, many regional failure modes. For example, even in a
| foundational service like EC2 most of their APIs touch
| regional databases.
| gnabgib wrote:
| Good news, a new region is coming to Canada in the west[0]
| eta 2023/24
|
| [0]: https://aws.amazon.com/blogs/aws/in-the-works-aws-
| canada-wes...
| randmeerkat wrote:
| AWS has been getting a pass on their stability issues in us-
| east-1 for years now because it's their "oldest" zone. Maybe
| they should invest in fixing it instead of inventing new
| services to sell.
| ikiris wrote:
| if you care about the availability of a single geographical
| availability zone, it's your own fault.
| acdha wrote:
| I certainly wouldn't describe it as "a pass" given how
| commonly people joke about things like "friends don't let
| friends use us-east-1". There's also a reporting bias:
| because many places only use us-east-1, you're more likely
| to hear about it even if it only affects a fraction of
| customers, and many of those companies blame AWS publicly
| because that's easier than admitting that they were only
| using one AZ, etc.
|
| These big outages are noteworthy because they _do_ affect
| people who correctly architected for reliability -- and
| they're pretty rare. This one didn't affect one of my big
| sites at all; the other was affected by the S3 / Fargate
| issues but the last time that happened was 2017.
|
| That certainly could be better but so far it hasn't been
| enough to be worth the massive cost increase of using
| multiple providers, especially if you can have some basic
| functionality provided by a CDN when the origin is down
| (true for the kinds of projects I work on). GCP and Azure
| have had their share of extended outages, too, so most of
| the major providers tend to be careful to cast stones about
| reliability, and it's _much_ better than the median IT
| department can offer.
| easton wrote:
| From the original outage thread:
|
| "If you're having SLA problems I feel bad for you son I
| got two 9 problems cuz of us-east-1"
| nijave wrote:
| Over two years I think we'd see about 2-3 AZ issues but only
| once I would consider it an outage.
|
| Usually there would be high network error rates which were
| usually enough to make RDS Postgres fail over if it was in the
| impacted AZ
|
| The only real "outage" was DNS having extremely high error
| rates in a single us-east-1 AZ to the point most things there
| were barely working
|
| Lack of instance capacity, especially spot, especially for the
| NVMe types was common of CI (it used ASGs for builder nodes).
| It'd be pretty common for a single AZ to run out of spot
| instance types--especially the NVMe ([a-z]#d types)
| codeduck wrote:
| the eu-central-1 datacenter fire earlier this year was
| purportedly just 1AZ, but it took down the entire region to all
| intents and purposes.
|
| Our SOP is to cut over to a second region the moment we see any
| AZ-level shenanigans. We've been burned too often.
| banana_giraffe wrote:
| It can be a bit hard to know, since the AZ identifiers are
| randomized per account, so if you think you have problems in
| us-west-1a, I can't check on my side. You can get the AZ ID out
| of your account to de-randomize things, so we can compare
| notes, but people rarely bother, for whatever reason.
| mnordhoff wrote:
| Amazon seems to have stopped randomizing them in newer
| regions. Another reason to move to us-east-2. ;-)
| lanstin wrote:
| If you do a lot of VPC Endpoints to clients/thirdparties, you
| learn the AZIDs or you go to all AZIDs in a region by
| default.
| pinche_gazpacho wrote:
| Yeah, cloudwatch APIs went to the drain. Good for them for
| publishing this at least.
| sponaugle wrote:
| "Our networking clients have well tested request back-off
| behaviors that are designed to allow our systems to recover from
| these sorts of congestion events, but, a latent issue prevented
| these clients from adequately backing off during this event. "
|
| That is an interesting way to phrase that. A 'well-tested'
| method, but 'latent issues'. That would imply the 'well-tested'
| part was not as well-tested as it needed to be. I guess 'latent
| issue' is the new 'bug'.
| paulryanrogers wrote:
| Has anyone been credited by AWS for violations of their SLAs?
| iwallace wrote:
| My company uses AWS. We had significant degradation for many of
| their APIs for over six hours, having a substantive impact on our
| business. The entire time their outage board was solid green. We
| were in touch with their support people and knew it was bad but
| were under NDA not to discuss it with anyone.
|
| Of course problems and outages are going to happen, but saying
| they have five nines (99.999) uptime as measured by their "green
| board" is meaningless. During the event they were late and
| reluctant to report it and its significance. My point is that
| they are wrongly incentivized to keep the board green at all
| costs.
| ALittleLight wrote:
| I worked at Amazon. While my boss was on vacation I took over
| for him in the "Launch readiness" meeting for our team's
| component of our project. Basically, you go to this meeting
| with the big decision makers and business people once a week
| and tell them what your status is on deliverables. You are
| supposed to sum up your status as "Green/Yellow/Red" and then
| write (or update last week's document) to explain your status.
|
| My boss had not given me any special directions here so I
| assumed I was supposed to do this honestly. I set our status as
| "Red" and then listed out what were, I felt, quite compelling
| reasons to think we were Red. The gist of it was that our
| velocity was negative. More work items were getting created and
| assigned to us than we closed, and we still had high priority
| items open from previous dates. There was zero chance, in my
| estimation, that we would meet our deadlines, so I called us
| Red.
|
| This did not go over well. Everyone at the Launch Readiness
| meeting got mad at me for declaring Red. Our VP scolded me in
| front of the entire meeting and lectured me about how I could
| not unilaterally declare our team red. Her logic was, if our
| team was Red, that meant the entire project was Red, and I was
| in no position to make that call. Other managers at the meeting
| got mad at me too because they felt my call made them look bad.
| For the rest of my manager's absence I had to first check in
| with a different manager and show him my Launch Readiness
| status and get him to approve my update before I was allowed to
| show it to the rest of the group.
|
| For the rest of the time that I went to Launch Readiness I was
| forbidden from declaring Red regardless of what our metrics
| said. Our team was Yellow or Green, period.
|
| Naturally, we wound up being over a year late on the deadlines,
| because, despite what they compelled us to say in those
| meetings, we weren't actually getting the needed work done.
| Constant "schedule slips" and adjustments. Endless wasted time
| in meetings trying to rework schedules that would instantly get
| blown up again. Hugely frustrating. Still slightly bitter about
| it.
|
| Anyway, I guess all this is to say that it doesn't surprise me
| that Amazon is bad about declaring Red, Yellow, or Green in
| other places too. Probably there is a guy in charge of updating
| those dashboards who is forbidden from changing them unless
| they get approval from some high level person and that person
| will categorically refuse regardless of the evidence because
| they want the indicators to be Green.
| meabed wrote:
| Working with consultants for years and that's exactly the
| playbook, project might not even launch after 2 years of
| deadlines while status is yellow or green and never red.
| tootie wrote:
| I once had the inverse happen. I showed up as an architect at
| a pretty huge e-commerce shop. They had a project that had
| just kicked off and onboarded me to help with planning. They
| had estimated two months by total finger in the air guessing.
| I ran them through a sizing and velocity estimation and the
| result came back as 10 months. I explained this to management
| and they said "ok". We delivered in about 10 months. It was
| actually pretty sad that they just didn't care. Especially
| since we quintupled the budget and no one was counting.
| syngrog66 wrote:
| your story reminded me of the Challenger disaster and the
| "see no evil" bureaucratic shenanigans about the O-rings
| failing to seal in cold weather.
|
| "How dare you threaten our launch readiness go/no-go?!"
| dawnbreez wrote:
| Was Challenger the one where they buried the issue in a
| hundred-slide-long PowerPoint? Or was that the other
| shuttle?
| fma wrote:
| We have something similar at my big corp company. I think the
| issue is you went from Green to Red in a flip of a switch. A
| more normal project goes Green...raise a red flags...if red
| flags aren't resolved in the next week or two, go to
| yellow...In these meetings everyone collaborates ways to keep
| your green or get you back to green if you went yellow.
|
| In essence - what you were saying is your boss lied the whole
| time, because how does one go from a presumed positive
| velocity to negative velocity in a week?
|
| Additionally assuming you're a dev lead, it's a little
| surprising that this is your first meeting of this sorts. As
| dev lead, I didn't always attend them but my input is always
| sought on the status.
|
| Sounds like you had a bad manager, and Amazon is filled with
| them.
| human wrote:
| Exactly this. If you take your team from green to red
| without raising flags and asking for help, you will be
| frowned upon. It's like pulling the fire alarm at the smell
| of burning toats. It will piss off people.
| transcriptase wrote:
| This explicitly supports what most of us assume is going on.
| I wont be surprised if someone with a (un)vested interest
| will be along shortly to say that their experience is the
| opposite and that on _their_ team, making people look bad by
| telling the truth is expected and praised.
| version_five wrote:
| I had a good chuckle reading your comment. This is not unique
| to Amazon. Unfortunately, status indicators are super
| political almost everywhere, precisely because they are what
| is being monitored as a proxy for the actual progress. I
| think your comment should be mandatory reading for any leader
| who is holding the kinds of meetings you describe and thinks
| they are getting an accurate picture of things.
| theduder99 wrote:
| no need for it to be mandatory, everyone is fully aware of
| the game and how to play it.
| throwaway82931 wrote:
| A punitive culture of "accountability" naturally leads to
| finger pointing and evasion.
| belter wrote:
| Was this Amazon or AWS?
| tempnow987 wrote:
| This is not unique. The reason is simple.
|
| 1) If you keep status green for 5 years, while not delivering
| anything, the reality is the folks at the very top (who can
| come and go) just look at these colors and don't really get
| into the project UNLESS you say you are red :)
|
| 2) Within 1-2 years there is always going to be some excuse
| for WHY you are late (people changes, scope tweaks, new
| things to worry about, covid etc)
|
| 3) Finally you are 3 years late, but you are launching. Well,
| the launch overshadows the lateness. Ie, you were green, then
| you launched, that's all the VP really sees sometime.
| dylan604 wrote:
| > While my boss was on vacation I took over for him in the
| "Launch readiness" meeting....once a week
|
| Jeez, how many meetings did you go to, and how long was this
| person's vacation? I'm jelly of being allowed to take that
| much time off continuously.
| toomuchtodo wrote:
| You might be working at the wrong org? My colleagues
| routinely take weeks off at a time, sometimes more than a
| month to travel Europe, go scuba diving in French
| Polynesia, etc. Work to live, don't live to work.
| dawnbreez wrote:
| I worked at an Amazon air-shipping warehouse for a couple
| years, and hearing this confirms my suspicions about the
| management there. Lower management (supervisors, people
| actually in the building) were very aware of problems, but
| the people who ran the building lived out of state, so they
| only actually went to the building on very rare occasions.
|
| Equipment was constantly breaking down, in ways that ranged
| from inconvenient to potentially dangerous. Seemingly basic
| design decisions, like the shape of chutes, were screwed up
| in mind-boggling ways (they put a right-angle corner partway
| down each chute, which caused packages to get stuck in the
| chutes constantly). We were short on equipment almost every
| day; things like poles to help us un-jam packages were in
| short supply, even though we could move hundreds of thousands
| of packages a day. On top of all this, the facility opened
| with half its sorting equipment, and despite promises that
| we'd be able to add the rest of the equipment in the summer,
| during Amazon's slow season...it took them two years to even
| get started.
|
| And all the while, they demanded ever-increasing package
| quotas. At first, 120,000 packages/day was enough to raise
| eyebrows--we broke records on a daily basis in our first
| holiday rush--but then, they started wanting 200,000, then
| 400,000. Eventually it came out that the building wouldn't
| even be breaking even until it hit something like 500,000.
|
| As we scaled up, things got even worse. None of the
| improvements that workers suggested to management were used,
| to my knowledge, even simple things like adding an indicator
| light to freight elevators.
|
| Meanwhile, it eventually became clear that there wasn't
| enough space to store cargo containers in the building. 737s
| and the like store packages mostly in these giant curved
| cargo containers, and we needed them to be locked in place
| while working around/in them...except that, surprise, the
| people planning the building hadn't planned any holding areas
| for containers that weren't in use! We ended up sticking them
| in the middle of the work area.
|
| Which pissed off the upper management when they visited.
| Their decision? Stop doing it. Are we getting more storage
| space for the cans? No. Are we getting more workers on the
| airplane ramp so we can put these cans outside faster? No.
| But we're not allowed to store those cans in the middle of
| the work area anymore, even if there aren't any open stations
| with working locks. Oh, by the way, the locking mechanisms
| that hold the cans in place started to break down, and to my
| knowledge they never actually fixed any of the locks. (A guy
| from their safety team claims they've fixed like 80 or 90 of
| the stations since the building opened, but none of the
| broken locks I've seen were fixed in the 2 years I worked
| there.)
| blowski wrote:
| The problem here sounds like lack of clarity over the meaning
| of the colours.
|
| In organisations with 100s of in-flight projects, it's
| understandable that red is reserved for projects that are
| causing extremely serious issues right now. Otherwise, so
| many projects would be red, that you'd need a new colour.
| jaytaylor wrote:
| How about orange? Didn't know there was a color shortage
| these days.
| tuananh wrote:
| but it's amz's color. it should carry a positive meaning
| #sacarsm
| ALittleLight wrote:
| I'd be willing to believe they had some elite high level
| reason to schedule things this way if I thought they were
| good at scheduling. In my ~10 years there I never saw a
| major project go even close to schedule.
|
| I think it's more like the planning people get rewarded for
| creating plans that look good and it doesn't bother them if
| the plans are unrealistic. Then, levels of middle
| management don't want to make themselves look bad by saying
| they're behind. And, ultimately, everyone figures they can
| play a kind of schedule-chicken where everyone says they're
| green or yellow until the last possible second, hoping that
| another group will raise a flag first and give you all more
| time while you can pretend you didn't need it.
| subsaharancoder wrote:
| I worked at AMZN and this perfectly captures my experience
| there with those weekly reviews. I once set a project I was
| managing as "Red" and had multiple SDMs excoriate me for
| apparently "throwing them under the bus" even though we had
| missed multiple timelines and were essentially not going to
| deliver anything of quality on time. I don't miss this aspect
| of AMZN!
| redconfetti wrote:
| How dare you communicate a problem using the color system.
| It hurts feelings, and feelings are important here.
| soheil wrote:
| This was addressed at least 3 times during this post. I'm not
| defending them but you're just gaslighting. If you have
| something to add about the points they raised regarding the
| status page please do so.
| jedberg wrote:
| Honestly, the should host that status page on CloudFlare or
| some completely separate infrastructure that they maintain in
| colo datacenters or something. The only time it really needs to
| be up is when their stuff isn't working.
| hvgk wrote:
| This. We're under NDA too on internal support. Our customers
| know we use AWS and they go and check the AWS status dashboards
| and tell us there's nothing wrong so the inevitable vitriol is
| always directed at us which we then have to defend.
| macintux wrote:
| I guess you have to hope that every outage that impacts you
| is big enough to make the news.
| eranation wrote:
| Obligatory mention to https://stop.lying.cloud
| steveBK123 wrote:
| Exactly, we had the same thing almost exactly a year ago -
| https://www.dailymail.co.uk/sciencetech/article-8994907/Wide...
|
| They are barely doing better than 2 9s.
| codeduck wrote:
| carbon copy of our experience.
| tootie wrote:
| Second hand info but supposedly when an outage hits they go all
| hands on resolving it and no one who knows what's going on has
| time to update the status board which is why it's always
| behind.
| voidfunc wrote:
| Not AWS, but Azure: highly doubt. At least at Azure the
| moment you declare an outage there is a incident manager to
| handle customer communication.
|
| Bullshit someone at Amazon doesn't have time to update the
| status.
| tuananh wrote:
| even in the post mortem, they are reclutant to admit it
|
| > While AWS customer workloads were not directly impacted from
| the internal networking issues described above, the networking
| issues caused impact to a number of AWS Services which in turn
| impacted customers using these service capabilities. Because
| the main AWS network was not affected, some customer
| applications which did not rely on these capabilities only
| experienced minimal impact from this event.
| nijave wrote:
| >some customer applications which did not rely on these
| capabilities only experienced minimal impact from this event
|
| Yeah so vanilla LB and EC2 with no autoscaling were fine.
| Anyone using "serverless" or managed services had a real bad
| day
| amalter wrote:
| I mean, not to defend them too strongly, but literally half of
| this post mortem is addressing the failure of the Service
| Dashboard. You can take it on bad faith, but they own up to the
| dashboard being completely useless during the incident.
| tw04 wrote:
| Multiple AWS employees have acknowledged it takes _VP_
| approval to change the status color of the dashboard. That is
| absurd and it tells you everything you need to know. The
| status page isn 't about accurate information, it's about
| plausible deniability and keeping AWS out of the news cycle.
| dylan604 wrote:
| >it's about plausible deniability and keeping AWS out of
| the news cycle.
|
| How'd that work out for them?
|
| https://duckduckgo.com/?q=AWS+outage+news+coverage&t=h_&ia=
| w...
| tw04 wrote:
| When is the last time they had a single service outage in
| a single region? How about in a single AZ in a single
| region? Struggling to find a lot of headline stories? I'm
| willing to bet it's happened in the last 2 years and yet
| I don't see many news articles about it... so I'd say if
| the only thing that hits the front page is a complete
| region outage for 6+ hours, it's working out pretty well
| for them.
| grumple wrote:
| Last year's Thanksgiving outage and this one are the two
| biggest. They've been pretty reliable. That's still 99.7%
| uptime.
| cookie_monsta wrote:
| I am so naive. I honestly thought those things were
| automated.
| discodave wrote:
| The AWS summary says: "As the impact to services during this
| event all stemmed from a single root cause, we opted to
| provide updates via a global banner on the Service Health
| Dashboard, which we have since learned makes it difficult for
| some customers to find information about this issue"
|
| This seems like bad faith to me based on my experience when I
| worked for AWS. As they repeated many times at Re:Invent last
| week, they've been doing this for 15+ years. I distinctly
| remember seeing banners like "Don't update the dashboard
| without approval from <importantSVP>" on various service team
| runbooks. They tried not to say it out loud, but there was
| very much a top-down mandate for service teams to make the
| dashboard "look green" by:
|
| 1. Actually improving availability (this one is fair).
|
| 2. Using the "Green-I" icon rather than the blue, orange, or
| red icons whenever possible.
|
| 3. They built out the "Personal Health Dashboard" so they can
| post about many issues in there, without having to
| acknowledge it publicly.
| res0nat0r wrote:
| Eh I mean at least when DeSantis was lower on the food
| chain then he is now, the normal directive was that ec2
| status wasn't updated unless a certain X percent of hosts
| were affected. Which is reasonable because a single rack
| going down isn't relevant enough to constitute a massive
| problem with ec2 as a whole.
| systemvoltage wrote:
| People of HN has been _extremely_ unprofessional with regards
| to AWS 's downtime. Some kind of a massive zeitgeist against
| Amazon, like a giant hive mind that spews hate.
|
| Why are we doing this folks? What's making you so angry and
| contemptful? Literally try searching the history of downtimes
| and it was always professional and respectful.
|
| Yesterday, my comment was fricking flagged for asking people
| to be nice to which people responded "Professionals recognize
| other professionals lying". Completely baseless and hate
| spewing comments like this is ruining HN.
| NicoJuicy wrote:
| I think the biggest issue is about the status dashboard
| that always stays green. I haven't seen much else, no?
|
| It seems that degraded seems down in most cases. Since
| authorization of managers is required.
| edoceo wrote:
| All too often folk conflate frustration with anger or hate.
|
| The comments are frustrated users.
|
| Not hateful.
| ProAm wrote:
| > Why are we doing this folks? What's making you so angry
| and contemptful?
|
| Because Amazon kills industries. Takes job. They do this
| because they promise they hire the best people that can do
| this better than you and for cheaper. And it's rarely true.
| And then they lie about it when things hit the fan. If
| you're going to be the best you need to act like the best,
| and execute like the best. Not build a walled garden that
| people cant see into, and hard to leave.
| jiggawatts wrote:
| AWS as a business has an enormous (multi-billion-dollar)
| moral hazard: they have a fantastically strong disincentive
| to update their status dashboard to accurately reflect the
| true nature of an ongoing outage. They use weasel words
| like "some customers may be seeing elevated errors", which
| we _all know_ translates to "almost all customers are
| seeing 99.99% failure rates."
|
| They have a strong incentive to lie, and they're doing it.
| This makes people dependent upon the truth for refunds
| understandably angry.
| nwallin wrote:
| So -- ctrl-f "Dash" only produces four results and it's
| hidden away in the bottom of the page. It's false to claim
| that even 20% of the post mortem is addressing the failure of
| the dashboard.
|
| The problem is that the dashboard requires VP approval to be
| updated. Which is broken. The dashboard should be automatic.
| The dashboard should update before even a single member of
| the AWS team knows there's something wrong.
| hunter2_ wrote:
| Is it typical for orgs (the whole spectrum: IT departments
| everywhere, telecom, SaaS, maybe even status of non-
| technical services) to have automatic downtime messaging
| that doesn't need a human set of eyes to approve it first?
| ProAm wrote:
| > You can take it on bad faith, but they own up to the
| dashboard being completely useless during the incident.
|
| Let's not act like this is the first time this has happened.
| It's bad faith that they do not change when their promise is
| they hire the best to handle infrastructure so you don't have
| to. It's clearly not the case. Between this and billing I we
| can easily lay blame and acknowledge lies.
| luhn wrote:
| Off the top of my head, this is the third time they've had a
| major outage where they've been unable to properly update the
| status page. First we had the S3 outage, where the yellow and
| red icons were hosted in S3 and unable to be accessed. Second
| we had the Kinesis outage, which snowballed into a Cognito
| outage, so they were unable to login into the status page
| CMS. Now this.
|
| They "own up to it" in their postmortems, but after multiple
| failures they're still unwilling to implement the obvious
| solution and what is widely regarded as best practice: _host
| the status page on a different platform_.
| koheripbal wrote:
| This challenge is not specific to Amazon.
|
| Being able to automatically detect system health is a non-
| trivial effort.
| moralestapia wrote:
| >Be capable of spinning up virtualized instances
| (including custom drive configurations, network stacks,
| complex routing schemes, even GPUs) with a simple API
| call
|
| But,
|
| >Be incapable of querying the status of such things
|
| Yeah, I don't believe it.
| bob778 wrote:
| That's not what's being asked though - in all 3 events,
| they couldn't manually update it. It's clearly not a
| priority to fix it for even manual alerts.
| blackearl wrote:
| Why automatic? Surely someone could have the
| responsibility to do it manually.
| geenew wrote:
| Or override the autogenerated values
| jjoonathan wrote:
| They had all day to do it manually.
| saagarjha wrote:
| As others mention, you can do it manually. But it's also
| not that hard to do automatically: literally just spin up
| a "client" of your service and make sure it works.
| thebean11 wrote:
| Eh, the colored icons not loading is not really the same
| thing as incorrectly reporting that nothing's wrong.
| Putting the status page on separate infra would be good
| practice, though.
| dijit wrote:
| The icons showed green.
| isbvhodnvemrwvn wrote:
| My company is quite well known for blameless post-mortems,
| but if someone failed to implement improvements after three
| subsequent outages, they would be moved to a position more
| appropriate for their skills.
| 46Bit wrote:
| Firmly agreed. I've heard AWS discuss making the status
| page better - but they get really quiet about actually
| doing it. In my experience the best/only way to check for
| problems is to search Twitter for your AWS region name.
| ngc248 wrote:
| Maybe AWS should host their status checks in Azure and
| vice versa ... Mutually Assured Monitoring :) Otherwise
| it becomes a problem of who will monitor the monitor
| dijit wrote:
| Once is a mistake.
|
| Twice is a coincidence.
|
| Three times is a pattern.
|
| But this... This is every time.
| doctor_eval wrote:
| Four times is a policy.
| s_dev wrote:
| >You can take it on bad faith
|
| It's smart politics -- I don't blame them but I don't trust
| the dashboard either. There's established patterns now of the
| AWS dashboard being useless.
|
| If I want to check if Amazon is down I'm checking Twitter and
| HN. Not bad faith -- no faith.
| sorry_outta_gas wrote:
| That's only useful when it's an entire region, there are
| minor issues in smaller services that cause problems for a
| lot of people they don't reflect in their status board; and
| not everyone checks twitter or HN all the time while at
| work
|
| it's a bullshit board used fudge numbers when negoaiting
| SLAs
|
| like I don't care that much, hell my company does the same
| thing; but let's not get defensive over it
| toss1 wrote:
| >>It's smart politics -- I don't blame them
|
| Um, so you think straight-up lying is good politics?
|
| Any 7-year old knows that telling a lie when you broke
| something makes you look better superficially, especially
| if you get away with it.
|
| That does not mean that we should think it is a good idea
| to tell lies when you break things.
|
| It sure as hell isn't smart politics in my book. It is
| straight-up disqualifying to do business with them. If they
| are not honest about the status or amount of service they
| are providing, how is that different than lying about your
| prices?
|
| Would you go to a petrol station that posted $x.00/gallon,
| but only delivered 3 quarts for each gallon shown on the
| pump?
|
| We're being shortchanged and lied to. Fascinating that you
| think it is good politics on their part.
| efitz wrote:
| You don't know what you're talking about.
|
| AWS spends a lot of time thinking about this problem in
| service to their customers.
|
| How do you reduce the status of millions of machines, the
| software they run, and the interconnected-ness of those
| systems to a single graphical indicator?
|
| It would be dumb and useless to turn something red every
| single time anything had a problem. Literally there are
| hundreds of things broken every minute of every day. On-
| call engineers are working around the clock on these
| problems. Most of the problems either don't affect anyone
| due to redundancy or affect only a tiny number of
| customers- a failed memory module or top-of-rack switch
| or a random bit flip in one host for one service.
|
| Would it help anyone to tell everyone about all these
| problems? People would quickly learn to ignore it as it
| had no bearing on their experience.
|
| What you're really arguing is that you don't like the
| thresholds they've chosen. That's fine, everyone has an
| opinion. The purpose of health dashboards like these are
| mostly so that customers can quickly get an answer to "is
| it them or me" when there's a problem.
|
| As others on this thread have pointed out, AWS has done a
| pretty good job of making the SHD align with the
| subjective experience of most customers. They also have
| personal health dashboards unique to each customer, but I
| assume thresholding is still involved.
| toss1 wrote:
| >>How do you reduce the status of millions of machines,
| the software they run, and the interconnected-ness of
| those systems to a single graphical indicator?
|
| There's a limitless variety of options, and multiple
| books written about it. I can recommend the series "The
| Visual Display of Quantitative Information" by Edward
| Tufte, for starters.
|
| >> Literally there are hundreds of things broken every
| minute of every day. On-call engineers are working around
| the clock...
|
| Of course there are, so a single R/Y/G indicator is
| obviously a bad choice.
|
| Again, they could at any time easily choose a better way
| to display this information, graphs, heatmaps, whatever.
|
| More importantly, the one thing that should NOT be chosen
| is A) to have a human in the loop of displaying status,
| as this inserts both delay and errors.
|
| Worse yet, to make it so that it is a VP-level decision,
| as if it were a $1million+ purchase, and then to set the
| policy to keep it green when half a continent is down...
| ummm that is WAAAYYY past any question of "threshold" -
| it is a premeditated, designed-in, systemic lie.
|
| >>You don't know what you're talking about. Look in the
| mirror, dude. While I haven't worked inside AWS, I have
| worked in complex network software systems and well
| understand the issues of thousands of HW/SW components in
| multiple states. More importantly, perhaps it's my
| philosophy degree, but I can sort out WHEN (e.g., here)
| the problem is at another level altogether. It is not the
| complexity of the system that is the problem, it is the
| MANAGEMENT decision to systematically lie about that
| complexity. Worse yet, it looks like those lies on an
| everyday basis are what goes into their claims of
| "99.99+% uptime!!" evidently false. The problem is at the
| forest level, and you don't even want to look at the
| trees because you're stuck in the underbrush telling
| everyone else they are clueless.
| [deleted]
| Karunamon wrote:
| > _How do you reduce the status of millions of machines,
| the software they run, and the interconnected-ness of
| those systems to a single graphical indicator?_
|
| A good low-hanging fruit would be, when the outage is
| significant enough to have reached the media, _you turn
| the dot red_.
|
| Dishonesty is what we're talking about here. Not the
| gradient when you change colors. This is hardly the first
| major outage where the AWS status board was a bald-faced
| _lie_. This deserves calling out and shaming the
| responsible parties, nothing less, certainly not defense
| of blatantly deceptive practices that most companies not
| named Amazon don 't dip into.
| SkyPuncher wrote:
| My company isn't big enough for us to have any pull but this
| communication is _significantly_ downplaying the impact of this
| issue.
|
| One of our auxiliary services that's basically a pass through
| to AWS was offline nearly the entire day. Yet, this
| communication doesn't even mention that fact. In fact, it
| almost tries to suggest the opposite.
|
| Likewise, AWS is reporting S3 didn't have issues. Yet, for a
| period of time, S3 was erroring out frequently because it was
| responding so slowly.
| notyourday wrote:
| > The entire time their outage board was solid green. We were
| in touch with their support people and knew it was bad but were
| under NDA not to discuss it with anyone. if
| ($pain > $gain) { move_your_shit_and_exit_aws();
| } sub move_your_shit_and_exit_aws {
| printf("Dude. We have too much pain. Start moving\n");
| printf("Yeah. That won't happen, so who cares\n");
| exit(1); }
| secondcoming wrote:
| Moving your shit from AWS can be _really_ expensive,
| depending on how much shit you have. If you 're nice, GCP may
| subsidise - or even cover - the costs!
| Clubber wrote:
| Yes, it's a conflict of interest. They have a guarantee on
| uptime and they decide what their actual uptime is. There's a
| lot of that now. Most insurances comes to mind.
| jiggawatts wrote:
| SLAs with self-reported outage periods are worthless.
|
| SLAs that refund only the cost of the individual service that
| was down is worthless.
|
| SLAs that require separate proof and refund requests for each
| and every service that was affected are nearly worthless.
|
| There needs to be an independent body set up by a large cloud
| customers to monitor availability and enforce refunds.
| [deleted]
| codegeek wrote:
| "Our Support Contact Center also relies on the internal AWS
| network, so the ability to create support cases was impacted
| from 7:33 AM until 2:25 PM PST. "
|
| This to me is really bad. Even as a small company, we keep our
| support infrastructure separate. For a company of Amazon's
| size, this is a shitty excuse. If I cannot even reach you as a
| customer for almost 7 hours, that is just nuts. AWS must do
| better here.
|
| Also, is it true that the outage/status pages are manually
| updated ? If yes, there is no excuse why it was green for that
| long. If you are manually updating it, please update asap.
| anonu wrote:
| Wasn't this the Bezos directive early on that created AWS?
| Anything that was created had to be a service with an API.
| Not allowed to recreate the wheel. So AWS depends on AWS.
| jiggawatts wrote:
| Dependency loops are such fun!
|
| My favourite is when some company migrates their physical
| servers to virtual machines, including the AD domain
| controllers. Then the next step is to use AD LDAP
| authentication for the VM management software.
|
| When there's a temporary outage and the VMs don't start up
| as expected, the admins can't log on and troubleshoot the
| platform because the logon system was running on it... but
| isn't _now_.
|
| The loop is closed.
|
| You see this all the time, especially with system-
| management software. They become dependent on the systems
| they're managing, and vice-versa.
|
| If you care about availability at all, make sure to have
| physical servers providing basic services like DNS, NTP,
| LDAP, RADIUS, etc...
| nijave wrote:
| Or even just have some non-federated/"local" accounts
| stored in a vault somewhere you can use when the
| centralized auth isn't working
| acwan93 wrote:
| We moved our company's support call system to Microsoft Teams
| when lockdowns were happening, and even that was affected by
| the AWS outage (along with our SaaS product hosted on AWS).
|
| It turned out our call center supplier had something running
| on AWS, and it took out our entire phone system. After this
| situation settles, I'm tempted to ask my supplier to see what
| they're doing to get around this in the future, but I doubt
| even they knew that AWS was used further downstream.
|
| AWS operates a lot like Amazon.com, the marketplace now--you
| can try to escape it, but it's near impossible. If you want
| to ban usage of Amazon's services, you're going to find some
| service (AWS) or even a Shopify site (FBA warehouse) who uses
| it.
| walrus01 wrote:
| I know a few _tiny_ ISPs that host their voip server and
| email server outside of their own ASN so that in the event of
| a catastrophic network event, communications with customers
| is still possible... Not saying amazon should do the same,
| but the general principle isn 't rocket science.
|
| there's such as thing as too much dogfooding.
| notimetorelax wrote:
| If you were deployed in 2 regions would it alleviate the
| impact?
| multipassnetwrk wrote:
| Depends. If your failover to another region required changing
| DNS and your DNS was using Route 53, you would have problems.
| ransom1538 wrote:
| Yes. Exactly. Pay double. That is what all the blogs say. But
| no, when a region goes down everything is hosed. Give it a
| shot! Next time an entire region is down try out your apis or
| give AWS support a call.
| hvgk wrote:
| No. We don't have an active deployment in that region at all.
| It killed our build pipeline as ECR was down globally so we
| had nowhere to push images. Also there was a massive risk as
| our target environments are EKS so any node failures or
| scaling events had nowhere to pull images from while ECR was
| down.
|
| Edit: not to mention APIGW and Cloudwatch APIs were down too.
| electroly wrote:
| > The entire time their outage board was solid green
|
| Unless you're talking about some board other than the Service
| Health Dashboard, this isn't true. They dropped EC2 down to
| degraded pretty early on. I bemusedly noted in our corporate
| Slack that every time I refreshed the SHD, another service was
| listed as degraded. Then they added the giant banner at the
| top. Their slight delay in updating the SHD at the beginning of
| the outage is mentioned in the article. It was absolutely not
| all green for the duration of the outage.
| logical_proof wrote:
| That is not true. There was hours before they started
| annotating any kind of service issues. Maybe from when you
| noticed there was a problem it appeared to be quick, but the
| board remained green for a large portion of the outtage.
| acdha wrote:
| We saw the timing described where the dashboard updates
| started about an hour after the problem began (which we
| noticed immediately since 7:30AM Pacific is in the middle
| of the day for those of us in Eastern time). I don't know
| if there was an issue with browser caching or similar but
| once the updates started everyone here had no trouble
| seeing them and my RSS feed monitor picked them up around
| that time as well.
| electroly wrote:
| No, it was about an hour. We were aware from the very
| moment EC2 API error rates began to elevate, around 10:30
| Eastern. By 11:30 the dashboard was updating. This timing
| is mentioned in the article, and it all happened in the
| middle of our workday on the east coast. The outage then
| continued for about 7 hours with SHD updates. I suspect we
| actually both agree on how long it took them to start
| updating, but I conclude that 1 hour wasn't so bad.
| gkop wrote:
| At the large platform company where I work, our policy is
| if the customer reported the issue before our internal
| monitoring caught it, we have failed. Give 5 minutes for
| alerting lag, 10 minutes to evaluate the magnitude of
| impact, 10 minutes to craft the content and get it
| approved, 5 minutes to execute the update, adds up to 30
| minutes end to end with healthy buffer at each step.
|
| 1 hour (52 minutes according to the article) sounds meh.
| I wonder what their error rate and latency graphs look
| like from that day.
| Aperocky wrote:
| > our policy is if the customer reported the issue before
| our internal monitoring caught it
|
| They've discovered it right away, the Service Health
| Dashboard was not updated. source: link.
| gkop wrote:
| They don't say explicitly right away do they? I skimmed
| twice.
|
| But yes you're right, there's no reason to question their
| monitoring or alerting specifically.
| JPKab wrote:
| Multiple services I use were totally skunked, and none were
| ever anything but green.
|
| Sagemaker, for example, was down all day. I was dead in the
| water on a modeling project that required GPUs. It relied on
| EC2, but nobody there even thought to update the status? WTF.
| This is clearly executives incentivized to let a bug persist.
| This is because the bug is actually a feature for misleading
| customers and maximizing profits.
| wly_cdgr wrote:
| Their service board is always as green as you have to be to trust
| it
| hourislate wrote:
| Broadcast storm. Never easy to isolate, as a matter of fact it's
| nightmarish...
| onion2k wrote:
| _This congestion immediately impacted the availability of real-
| time monitoring data for our internal operations teams_
|
| I guess this is why it took ages for the status page to update.
| They didn't know which things to turn red.
| bpodgursky wrote:
| DNS?
|
| Of course it was DNS.
|
| It is always* DNS.
| tezza wrote:
| It is often BGP, regularly DNS, frequently expired keys,
| sometimes a bad release and occasionally a fire
| shepherdjerred wrote:
| This wasn't caused by DNS. DNS was just a symptom.
| yegle wrote:
| Was this outage only impact us-east-1 region? I think I saw other
| regions affected in some HN comments but this summary did not
| mention anything to suggest it has more than 1 region impacted.
| teej wrote:
| There are some AWS services, notably STS, that are hosted in
| us-east-1. I don't have anything in us-east-1 but I was
| completely unable to log into the console to check on the
| health of my services.
| qwertyuiop_ wrote:
| House of cards
| simlevesque wrote:
| > Customers accessing Amazon S3 and DynamoDB were not impacted by
| this event. However, access to Amazon S3 buckets and DynamoDB
| tables via VPC Endpoints was impaired during this event.
|
| What does this even mean ? I bet most people use DynamoDB via a
| VPC, in a Lambda or in EC2
| jtoberon wrote:
| VPC Endpoint is a feature of VPC:
| https://docs.aws.amazon.com/vpc/latest/privatelink/endpoint-...
| discodave wrote:
| Your application can call DynamoDB via the public endpoint
| (dynamodb.us-east-1.amazonaws.com). But if you're in a VPC
| (i.e. practically all AWS workloads in 2021), you have to route
| to the internet (you need public subnet(s) I think) to make
| that call.
|
| VPC Endpoints create a DynamoDB endpoint in your VPC, from the
| documentation:
|
| "When you create a VPC endpoint for DynamoDB, any requests to a
| DynamoDB endpoint within the Region (for example, dynamodb.us-
| west-2.amazonaws.com) are routed to a private DynamoDB endpoint
| within the Amazon network. You don't need to modify your
| applications running on EC2 instances in your VPC. The endpoint
| name remains the same, but the route to DynamoDB stays entirely
| within the Amazon network, and does not access the public
| internet."
| aeyes wrote:
| I call my DynamoDB tables via the public endpoint and it was
| severely impaired - high error rate and very high (second)
| latency.
| all2well wrote:
| From within a VPC, you can either access DynamoDB via its
| public internet endpoints (eg, dynamodb.us-
| east-1.amazonaws.com, which routes through an Internet Gateway
| attachment in your VPC), or via a VPC endpoint for dynamodb
| that's directly attached to your VPC. The latter is useful in
| cases where you want a VPC to not be connected to the internet
| at all, for example.
| foobarbecue wrote:
| "... the networking congestion impaired our Service Health
| Dashboard tooling from appropriately failing over to our standby
| region. By 8:22 AM PST, we were successfully updating the Service
| Health Dashboard."
|
| Sounds like they lost the ability to update the dashboard. HN
| comments at the time were theorizing it wasn't being updated due
| to bad policies (need CEO approval) etc. Didn't even occur to me
| that it might be stuck in green mode.
| dgivney wrote:
| In the February 2017 S3 outage, AWS was unable to move status
| icons to the red icon because those images happened to be
| stored on the servers that went down.
|
| https://twitter.com/awscloud/status/836656664635846656
| [deleted]
| wongarsu wrote:
| Hasn't this exact thing (something in US-east-1 goes down, AWS
| loses ability to update dashboard) happened before? I vaguely
| remember it was one of the S3 outages, but I might be wrong.
|
| In any case, AWS not updating their dashboard is almost a meme
| by now. Even for global service outages the best you will get
| is a yellow.
| foobarbecue wrote:
| Yeah, probably. I haven't watched it this closely before
| during an outage. I have no idea if this happens in good
| faith, bad faith, or (probably) a mix.
| llaolleh wrote:
| I wonder if they could've designed better circuit breakers for
| situations like this. They're very common in electrical
| engineering, but I don't think they're as common in software
| design. Something we should try to design and put in, actually
| for situations like this.
| EsotericAlgo wrote:
| They're a fairly common design pattern https://en.m.wikipedia.o
| rg/wiki/Circuit_breaker_design_patte.... However, they
| certainly aren't implemented with the frequency they should be
| at service level boundaries resulting in these sorts of
| cascading failures.
| riknos314 wrote:
| One of the big issues mentioned was that one of the circuit
| breakers they did have (client back off), didn't function
| properly. So they did have a circuit breaker in the design, but
| it was broken.
| kevin_nisbet wrote:
| Netflix was talking alot about circuit breaks a few years ago,
| and had the Hystrix project. Looks like Hystrix is
| discontinued, so I'm not sure if there are good library
| solutions that are easy to adopt. Overall I don't see it
| getting talked about that frequently... beyond just exponential
| backoff inside a retry loop.
|
| - https://github.com/Netflix/Hystrix -
| https://www.youtube.com/watch?v=CZ3wIuvmHeM I think talks about
| Hystrix a bit, but I'm not sure if it's the presentation I'm
| thinking of from years ago or not.
| isbvhodnvemrwvn wrote:
| In the JVM land resilience4j is the de facto successor of
| Hystrix:
|
| https://github.com/resilience4j/resilience4j
| User23 wrote:
| A packet storm outage? Now that brings back memories. Last time I
| saw that it was rendezvous misbehaving.
| jetru wrote:
| Complex systems are really really hard. I'm not a big fan of
| seeing all these folks bash AWS for this, and not really
| understanding the complexity or nastiness of situations like
| this. Running the kind of services they do for the kind of
| customers, this is a VERY hard problem.
|
| We ran into a very similar issue, but at the database layer in
| our company literally 2 weeks ago, where connections to our MySQL
| exploded and completely took down our data tier and caused a
| multi-hour outage, compounded by retries and thundering herds.
| Understanding this problem under the stressful scenario is
| extremely difficult and a harrowing experience. Anticipating this
| kind of issue is very very tricky.
|
| Naive responses to this include "better testing", "we should be
| able to do this", "why is there no observability" etc. The
| problem isn't testing. Complex systems behave in complex ways,
| and its difficult to model and predict, especially when the
| inputs to the system aren't entirely under your control.
| Individual components are easy to understand, but when
| integrating, things get out of whack. I can't stress how
| difficult it is to model or even think about these systems,
| they're very very hard. Combined with this knowledge being
| distributed among many people, you're dealing with not only
| distributed systems, but also distributed people, which adds more
| difficulty in wrapping this around your head.
|
| Outrage is the easy response. Empathy and learning is the
| valuable one. Hugs to the AWS team, and good learnings for
| everyone.
| [deleted]
| spfzero wrote:
| But Amazon advertises that they DO understand the complexity of
| this, and that their understanding, knowledge and experience is
| so deep that they are a safe place to put your critical
| applications, and so you should pay them lots of money to do
| so.
|
| Totally understand that complex systems behave in
| incomprehensible ways (hopefully only temporarily
| incomprehensible). But they're selling people on the idea of
| trading your complex system, for their far more complex system
| that they manage with such great expertise that it is more
| reliable.
| raffraffraff wrote:
| Interesting. Just wondering if your guys have a dedicated DBA?
| raffraffraff wrote:
| Not sure why I got down voted for an honest question. Most
| start-ups are founders, developers, sales and marketing.
| Dedicated infrastructure, network and database specialists
| don't get factored in because "smart CS graduates can figure
| that stuff out". I've worked at companies who held onto that
| false notion _way_ too long and almost lost everything as a
| result ( "company extinction event", like losing a lot of
| customer data)
| danjac wrote:
| The problem isn't AWS per se. The problem is it's become too
| big to fail. Maybe in the past an outage might take down a few
| sites, or one hospital, or one government service. Now one
| outage takes out all the sites, all the hospitals and all the
| government services. Plus your coffee machine stops working.
| qaq wrote:
| Very good summary of why small projects need to think real hard
| before jumping onto microservices bandwagon.
| tuldia wrote:
| Excuse me, do we need all that complexity? Telling that it is
| "hard" is justifiable?
|
| It is naive to assume people bashing AWS are uncapable to
| running things better, cheaper, faster, across many other
| vendors, on-prem, colocation or what not.
|
| > Outrage is the easy response.
|
| That is what made AWS get the marketshare it has now in the
| first place, the easy responses.
|
| The main selling point of AWS in the beginning was "how easy is
| to sping a virtual machine". After basically every layman
| started recommending AWS and we flocked there, AWS started
| making things more complex than it should. Was that to make
| harder to get out of it? IDK.
|
| > Empathy and learning is the valuable one.
|
| When you run your infrastructure and something fails and you
| are not transparent, your users will bash you, independently
| who you are.
|
| And that was another "easy response" used to drive companies
| towards AWS. We developers were echoing that "having a
| infrastructure team or person is not necessary", etc.
|
| Now we are stuck in this learned helplessness where every
| outage is a complete disaster in terms of transparency,
| multiple services failing, even for multi-region and multi-az
| customers, we saying "this service here is also not working"
| and AWS simple states that service was fine, not affected, up
| and running.
|
| If it was a sysadmin doing that, people will be asking for
| his/her neck with pitchforks.
| noahtallen wrote:
| > AWS started making things more complex than it should
|
| I don't think this is fair for a couple reasons:
|
| 1. AWS would have had to scale regardless just because of the
| number of customers. Even without adding features. This means
| many data centers, complex virtual networking, internal
| networks, etc. These are solving very real problems that
| happen when you have millions of virtual servers.
|
| 2. AWS hosts many large, complex systems like Netflix.
| Companies like Netflix are going to require more advanced
| features out of AWS, and this will result in more features
| being added. While this is added complexity, it's also
| solving a customer problem.
|
| My point is that complexity is inherent to the benefits of
| the platform.
| xamde wrote:
| There is this really nice website which explains how complex
| systems fail: https://how.complexsystems.fail/
| iso1631 wrote:
| > I'm not a big fan of seeing all these folks bash AWS for
| this,
|
| The disdain I saw was towards those claiming that all you need
| is AWS, that AWS never goes down, and don't bother planning for
| what happens when AWS goes down.
|
| AWS is an amazing accomplishment, but it's still a single point
| of failure. If you are a company relying on a single supplier
| and you don't have any backup plans for that supplier being
| unavailable, that is ridiculous and worthy of laughter.
| fastball wrote:
| I think most of the outrage is not because "it happened" but
| because AWS is saying things like "S3 was unaffected" when the
| anecdotal experience of many in this thread suggests the
| opposite.
|
| That and the apparent policy that a VP must sign off on
| changing status pages, which is... backwards to say the least.
| jetru wrote:
| There's definitely miscommunication around this. I know I've
| miscommunicated impact, or my communication was
| misinterpreted across the 2 or 3 people it had to jump before
| hitting the status page.
|
| For example, The meaning of "S3 was affected" is subject to a
| lot of interpretation. STS was down, which is a blocker for
| accessing S3. So, the end result is S3 is effectively down,
| but technically it is not. How does one convey this in a
| large org? You run S3, but not STS, it's not technically an
| S3 fault, but an integration fault across multiple services.
| If you say S3 is down, you're implying that the storage layer
| is down. But it's actually not. What's the best answer to
| make everyone happy here? I cant think of one.
| mickeyp wrote:
| > I cant think of one.
|
| I can.
|
| "S3 is unavailable because X, Y, and Z services are
| unavailable."
|
| A graph of dependencies between services is surely known to
| AWS; if not, they ought to create one post-haste.
|
| Trying to externalize Amazon's internal AWS politicking
| over which service is down is unproductive to the customers
| who check the dashboard and see that their service _ought_
| to be up, but... well, it isn 't?
|
| Because those same customers have to explain to _their_
| clients and bosses why their systems are malfunctioning,
| yet it "shows green" on a dashboard somewhere that almost
| never shows red.
|
| (And I can levy this complaint against Azure too, by the
| way.)
| jetru wrote:
| This is a good idea
| Corrado wrote:
| Yes, I can envision a (simplified) AWS X-Ray dashboard
| showing the relationships between the systems and the
| performance of each one. Then we could see at a glance
| what was going on. Almost anything is better than that
| wall of text, tiny status images, and RSS feeds.
| wiredfool wrote:
| And all relationships eventually end in us-east-1.
| brentcetinich wrote:
| It doesn't matter about the dependency graph , but on the
| definition of unavailable for s3 in its sla
| amzn-throw wrote:
| > a VP must sign off on changing status pages, which is...
| backwards to say the least.
|
| I think most people's experience with "VP's" makes them not
| realize what AWS VP's do.
|
| VP's here are not sitting in an executive lounge wining and
| dining customers, chomping on cigars and telling minions to
| "Call me when the data center is back up and running again!"
|
| They are on the tech call, working with the engineers,
| evaluating the problem, gathering the customer impact, and
| attempting to balance communicating too early with being
| precise.
|
| Is there room for improvement? Yes. I wish we would just
| throw up a generic "Shit's Fucked Up. We Don't Know Why Yet,
| But We're Working On It" message.
|
| But the reason why we don't, doesn't have anything to do with
| having to get VP approval to put that message up. The VP's
| are there in the trenches most of the time.
| kortilla wrote:
| It doesn't matter what the VPs are doing, that misses the
| point. Every minute you know there is a problem and you
| haven't at least put up a "degraded" status, you're lying
| to your customers.
|
| It was on the top of HN for an hour before anything
| changed, and then it was still downplayed, which is insane.
| gingerlime wrote:
| > I wish we would just throw up a generic "Shit's Fucked
| Up. We Don't Know Why Yet, But We're Working On It"
| message.
|
| I think that's the crux of the matter? AWS seems to now
| have a reputation for ignoring issues that are easily
| observable by customers, and by the time any update shows
| up, it's way too late. Whether VPs make this decision or
| not is irrelevant. If this becomes a known pattern (and I
| think it has), then the system is broken.
|
| disclaimer: I have very little skin in this game. We use S3
| for some static assets, and with layers of caching on top,
| I think we are rarely affected by outages. I'm still
| curious to observe major cloud outages and how they are
| handled, and the HN reaction from people on both side of
| the fence.
| 0x0nyandesu wrote:
| Saying "S3 is down" can mean anything. Our S3 buckets
| that served static web content stayed up no problem. The
| API was down though. But for the purposes of whether my
| organization cares I'm gonna say it was "up".
| kortilla wrote:
| Who cares if it worked for your usecase?
|
| Being unable to store objects in an object store means
| that it's broken.
| shakna wrote:
| > We are currently experiencing some problems related to
| FOO service and are investigating.
|
| A generic, utterly meaningless message, which is still a
| hell of a lot more than usually gets approved, and
| approved far too late.
|
| It is also still better than "all green here, nothing to
| see" which has people looking at their own code, because
| they _expect_ that they will be the problem, not AWS.
| jrochkind1 wrote:
| Most of what they actually said via the manual human-
| language status updates was "Service X is seeing elevated
| error rates".
|
| While there are still decisions to be made in how you
| monitor errors and what sorts of elevated rates merit an
| alert -- I would bet that AWS has internally-facing
| systems that can display service health in this way based
| on automated monitoring of error rates (as well as other
| things). Because they know it means something.
|
| They apparently choose to make their public-facing
| service health page only show alerts via a manual process
| that often results in an update only several hours after
| lots of customers have noticed problems. This seems like
| a choice.
|
| What's the point of a status page? To me, the point of it
| is, when I encounter a problem (perhaps noticed because
| of my own automated monitoring), one of the first thing I
| want to do is distinguish between a problem that's out of
| my control on the platform, and a problem that is under
| my control and I can fix.
|
| A status page that does not support me in doing that is
| not fulfilling it's purpose. the AWS status page fails to
| help customers do that, by regularly showing all green
| with no alerts _hours_ after widespread problems occured.
| chakspak wrote:
| > disclaimer: I have very little skin in this game. We
| use S3 for some static assets, and with layers of caching
| on top, I think we are rarely affected by outages. I'm
| still curious to observe major cloud outages and how they
| are handled, and the HN reaction from people on both side
| of the fence.
|
| I'd like to share my experience here. This outage
| definitely impacted my company. We make heavy use of
| autoscaling, we use AWS CodeArtifact for Python packages,
| and we recently adopted AWS Single Sign-On and EC2
| Instance Connect.
|
| So, you can guess what happened:
|
| - No one could access the AWS Console.
|
| - No one could access services authenticated with SAML.
|
| - Very few CI/CD, training or data pipelines ran
| successfully.
|
| - No one could install Python packages.
|
| - No one could access their development VMs.
|
| As you might imagine, we didn't do a whole lot that day.
|
| With that said, this experience is unlikely to change our
| cloud strategy very much. In an ideal world, outages
| wouldn't happen, but the reason we use AWS and the cloud
| in general is so that, when they do happen, we aren't
| stuck holding the bag.
|
| As others have said, these giant, complex systems are
| hard, and AWS resolved it in only a few hours! Far better
| to sit idle for a day rather than spend a few days
| scrambling, VP breathing down my neck, discovering that
| we have no disaster recovery mechanism, and we never
| practiced this, and hardware lead time is 3-5 weeks, and
| someone introduced a cyclical bootstrapping process, and
| and and...
|
| Instead, I just took the morning off, trusted the
| situation would resolve itself, and it did. Can't
| complain. =P
|
| I might be more unhappy if we had customer SLAs that were
| now broken, but if that was a concern, we probably should
| have invested in multi-region or even multi-cloud
| already. These things happen.
| jrochkind1 wrote:
| > I wish we would just throw up a generic "Shit's Fucked
| Up. We Don't Know Why Yet, But We're Working On It"
| message.
|
| I gotta say, the implication that you can't register an
| outage until you know why it happened is pretty damning.
| The status page is where we look to see if services are
| effected, if that information can't be shared there until
| you understand the cause, that's very broken.
|
| The AWS status page has become kind of a joke to customers.
|
| I was encouraged to see the announcement in OP say that
| there is "a new version of our Service Health Dashboard"
| coming. I hope it can provide actual capabilities to
| display, well, service health.
|
| From how people talk about it, it kind of sounds like
| updates to the Service Health Dashboard are currently
| purely a manual process. Rather than automated monitoring
| automatically updating the Service Health Dashboard in any
| way at all. I find that a surprising implementation for an
| organization of Amazon's competence and power. That alarms
| me more than _who_ it is that has the power to manually
| update it; I agree that I don 't have enough knowledge of
| AWS internal org structures to have an opinion on if it's
| the "right" people or not.
|
| I suspect AWS must have internal service health pages that
| are actually automatically updated in some way by
| monitoring, that is, that actually work to display service
| health. It _seems_ like a business decision rather than a
| technical challenge if the public facing system has no
| inputs but manual human entry, but that 's just how it
| seems from the outside, I may not have full information. We
| only have what Amazon shares with us of course.
| theneworc wrote:
| Can you please help me understand why you, and everyone
| else, are so passionate about the status page?
|
| I get that it not being updated is an annoyance, but I
| cannot figure out why it is the single most discussed
| thing about this whole event. I mean, entire services
| were out for almost an entire day, and if you read HN
| threads it would seem that nobody even cares about lost
| revenue/productivity, downtime, etc. The vast majority of
| comments in all of the outage threads are screaming about
| how the SHD lied.
|
| In my entire career of consulting across many companies
| and many different technology platforms, never _once_
| have I seen or heard of anyone even _looking_ at a status
| page outside of HN. I 'm not exaggerating. Even over the
| last 5 years when I've been doing cloud consulting,
| nobody I've worked with has cared at all about the cloud
| provider's status pages. The _only_ time I see it brought
| up is on HN, and when it gets brought up on HN it 's
| discussed with more fervor than most other topics, even
| the outage itself.
|
| In my real life (non-HN) experience, when an outage
| happens, teams ask each other "hey, you seeing problems
| with this service?" "yea, I am too, heard maybe it's an
| outage" "weird, guess I'll try again later" and go get a
| coffee. In particularly bad situations, they might check
| the news or ask me if I'm aware of any outage. Either
| way, we just... go on with our lives? I've never needed,
| nor have I ever seen people need, a status page to inform
| them that things aren't working correctly, but if you
| read HN you would get the impression that entire
| companies of developers are completely paralyzed unless
| the status page flips from green to red. Why? I would
| even go as far to say that if you need a third party's
| SHD to tell you if things aren't working right, then
| you're probably doing something wrong.
|
| Seriously, what gives? Is all this just because people
| love hating on Amazon and the SHD is an easy target?
| Because that's what it seems like.
| femiagbabiaka wrote:
| It is _extremely_ common for customers to care about
| being informed accurately about downtime, and not just
| for AWS. I think your experience of not caring and not
| knowing anyone who cares may be an outlier.
| aflag wrote:
| A status page give you confidence that the problem indeed
| lies with Amazon and not your own software. I don't think
| it's very reasonable to notice issues, ask other teams if
| they are also having issues, and if so, just shrug it off
| and get a cup of coffee without more investigation. Just
| because it looks like the problem is with AWS, you can't
| be sure until you further investigate it, specially if
| the status page says it's all working fine.
|
| I think it goes without saying that having an outage is
| bad, but having an outage which is not confirmed by the
| service provider is even worse. People complain about
| that a lot because it's the least they could do.
| avereveard wrote:
| https://aws.amazon.com/it/legal/service-level-agreements/
|
| There's literally millions on the line.
| swasheck wrote:
| aws isn't a hobby platform. businesses are built on aws
| and other cloud providers. those businesses customers
| have the expectation of knowing why they are not
| receiving the full value of their service.
|
| it makes sense that part of marketing yourself as a
| viable infrastructure upon which other businesses can
| operate, you'd provide more granular and refined
| communication to allow better communication up and down
| the chain instead of forcing your customers to rca your
| service in order to communicate to their customers.
| glogla wrote:
| > Can you please help me understand why you, and everyone
| else, are so passionate about the status page?
|
| I don't think people are "passionate about status page."
| I think people are unhappy with someone they are supposed
| to trust straight up lying to their face.
| tommek4077 wrote:
| There should be no one to sign off anything. Your status
| page should be updated automatically, not manually!
| aflag wrote:
| I don't think the matter is whether or not VPs are
| involved, but the fact that human sign off is required.
| Ideally the dashboard would accurately show what's working
| or not, regardless if the engineers know what's going on.
| lobocinza wrote:
| That would be too much honesty for a corp.
| metb wrote:
| Thanks for these thoughts. Resonated well with me. I feel we
| are sleepwalking into major fiascos, when a simple doorbell
| needs to sit on top this level of complexity. It's in our best
| interest to not tie every small thing into layers, and layers
| of complexity. Mundane things like doorbells need to have their
| fallback at least done properly to function locally without
| relying on complex cloud systems.
| simonbarker87 wrote:
| I'm not all that angry over the situation but more disappointed
| that we've all collectively handed the keys over to AWS because
| "servers are hard". Yeh they are but it's not like locking
| ourselves into one vendor with flaky docs and a black box of
| bugs is any better, at least when your own servers go down it's
| on you and you don't take out half of North America.
| ranguna wrote:
| You can either pay a dedicated team to manage your on prem
| solution, go multi cloud, or simply go multi region on aws.
|
| My company was not affected by this outage because we are
| multi region. Cheapest and quickest option if you want to
| have at least some fault tolerance.
| tuldia wrote:
| > ... multi region. Cheapest and quickest option if you
| want to have at least some fault tolerance.
|
| That is simple not true, you have to adapt your application
| to be multi region aware to start with, and if you do that
| on AWS you are basically locked-in, and one of the most
| expensive cloud providers out there.
| iso1631 wrote:
| So was mine, but we couldn't log in
|
| But yes, having services resilient to a single point of
| failure is essential. AWS is a SPOF.
| linsomniac wrote:
| If you aren't going to rely on external vendors, servers are
| really, really hard. Redundancy in: power, cooling,
| networking? Those get expensive fast. Drop your servers into
| a data center and you're in a similar situation to dropping
| it in AWS.
|
| A couple years ago all our services at our data center just
| vanished. I call the data center and they start creating a
| ticket. "Can you tell me if there is a data center outage?"
| "We are currently investigating and I don't have any
| information I can give you." "Listen, if this is a problem
| isolated to our cabinet, I need to get in the car. I'm trying
| to decide if I need to drive 60 miles in a blizzard."
|
| That facility has been pretty good to us over a decade, but
| they were frustratingly tight-lipped about an entire room of
| the facility losing power because one of their power feeder
| lines was down.
|
| Could AWS improve? Yes. Does avoiding AWS solve these sorts
| of problems? No.
| nix23 wrote:
| Servers are not hard if you have a dedicated person (long
| time ago known as Systemadminstrator), and fun fact...it's
| sometimes even much cheaper and more reliable then having
| everything in the "cloud".
|
| Personally i am a believer in mixed environments, public
| webservers etc in the "cloud", locally used systems and
| backup "in house" with a second location (both in Data-
| centers or at least one), and no, i don't talk about the next
| google but the 99% of businesses.
| isbvhodnvemrwvn wrote:
| Not one person, at least four people to run stuff 24/7.
| nix23 wrote:
| 99% of businesses don't need 24/7 but two are the bare
| minimum (a admin, and a dev or admin)
| juanani wrote:
| Bashing trillion dollar behemoths is the right thing to do, by
| default they spend a certain % on influencing your subconscious
| brain to get up and defend them everytime someone has a bad
| article on them. They've probably apent billions on writing
| positive articles about themselves. They only ask for more and
| more money, so you get antay when they f up and people are not
| pleased with them? They aren't jumping up to lick their boots?
| Fascist shills, can you fuck off to Mars already?
| Ensorceled wrote:
| > Outrage is the easy response. Empathy and learning is the
| valuable one.
|
| I'm outraged that AWS, as a company policy, continues to lie
| about the status of their systems during outages, making it
| hard for me to communicate to my stakeholders.
|
| Empathy? For AWS? AWS is part a mega corporation that is
| closing in on 2 TRILLION dollars in market cap. It's not a
| person. I can empathize with individuals who work for AWS but
| it's weird to ask us to have empathy for a massive faceless,
| ruthless, relentless, multinational juggernaut.
| Jgrubb wrote:
| It seems obvious to me that they're specifically talking
| about having empathy for the people who work there, the
| people who designed and built these systems and yes, empathy
| even for the people who might not be sure what to put on
| their absolutely humongous status page until they're sure.
| Ensorceled wrote:
| But I don't see people attacking the AWS team, at worst the
| "VP" who has to approve changes to the dashboard. That's
| management and that "VP" is paid a lot.
| ithkuil wrote:
| My reading of GP's comment is that the empathy should be
| directed towards AWS' _team_ , the people who are building
| the system and handling the fallout, not AWS the corporate
| entity.
|
| I may be wrong, but I try to apply the
| https://en.m.wikipedia.org/wiki/Principle_of_charity
| xyst wrote:
| I am not a fan of AWS due to their substantial market share on
| cloud computing. But as a software engineer I do appreciate their
| ability to provide fast turnarounds on root cause analyses and
| make them public.
| waz0wski wrote:
| This isn't a good example of an RCA - as other commenters have
| noted, it's outrightly lying about some issues during the
| incident, and using creative language to dance around other
| problems many people encountered.
|
| If you want to dive into postmortems, there are some repos
| linking other examples
|
| https://github.com/danluu/post-mortems
|
| https://codeberg.org/hjacobs/kubernetes-failure-stories
| nayuki wrote:
| > Operators instead relied on logs to understand what was
| happening and initially identified elevated internal DNS errors.
| Because internal DNS is foundational for all services and this
| traffic was believed to be contributing to the congestion, the
| teams focused on moving the internal DNS traffic away from the
| congested network paths. At 9:28 AM PST, the team completed this
| work and DNS resolution errors fully recovered.
|
| Having DNS problems sounds a lot like the Facebook outage of
| 2021-10-04. https://en.wikipedia.org/wiki/2021_Facebook_outage
| shepherdjerred wrote:
| It's quite a bit different... Facebook took themselves offline
| completely because of a bad BGP update, whereas AWS had network
| congestion due to a scaling event. DNS relies on the network,
| so of course it'll be impacting if networking is also impacted.
| bdd wrote:
| no. it wasn't a "bad bgp update". bgp withdrawal of anycast
| addresses was a desired outcome of a region (serving
| location) getting disconnected from the backbone. if you'd
| like to trivialize it, you can say it was configuration
| change to the software defined backbone.
| human wrote:
| The rule is that it's _always_ DNS.
| jessaustin wrote:
| DNS seemed to be involved with _both_ the Spectrum business
| internet and Charter internet outages overnight. So much for
| diversifying!
| grouphugs wrote:
| stop using aws, i can't wait till amazon is hit so hard everyday
| they can't maintain customers
| tyingq wrote:
| _" Amazon Secure Token Service (STS) experienced elevated
| latencies"_
|
| I was getting 503 "service unavailable" from STS during the
| outage most of the time I tried calling it.
|
| I guess by "elevated latency", they mean from anyone with retry
| logic that would keep trying after many consecutive attempts?
| jrockway wrote:
| I suppose all outages are just elevated latency. Has anyone
| ever had an outage and said "fuck it, we're going out of
| business" and never came back up? That's the only true outage
| ;)
| hericium wrote:
| 5xx errors are servers or proxies giving up on requests.
| Increased timeouts resulting in successful requests may have
| been considered "elevated latency" (but rarely this would be
| a proper way to solve similar issue).
|
| They treat 5xx errors as non-errors but this is not the case
| with rest of the world. "Increased timeouts" is Amazon's
| untruthful term for "not working at all".
| comboy wrote:
| So many lessons in this article. When your service goes down
| but eventually gets back up, it's not an outage. It's "elevated
| latency". Of a few hours, maybe days.
| WaxProlix wrote:
| STS is the worst with this. Even for other internal teams, they
| seem to treat dropped requests (ie, timeouts which represent
| 5xxs on the client side) as 'non faults', and so don't treat
| those data points in their graphs and alarms. It's really
| obnoxious.
|
| AWS in general is trying hard to do the right thing for
| customers, and obviously has a long ways to go. But man, a few
| specific orgs have some frustrating holdover policies.
| hericium wrote:
| > AWS in general is trying hard to do the right thing for
| customers
|
| You are responding to a comment that suggests they're
| misrepresenting the truth (which wouldn't be the first time
| even in last few days) in communication to their customers.
|
| As always, they are doing the right thing for themselves
| only.
|
| EDIT: I think that you should mention being an Engineer at
| Amazon AWS in your comment.
| tybit wrote:
| It was very clear from their post that they were
| criticising STS from the perspective of an engineer in AWS
| within a different team.
| hericium wrote:
| I assumed in good faith that this is someone knowing
| internals as a larger customer, not an AWS person shit-
| talking other AWS teams.
|
| Got curious only after a downvote hence late edit. My
| bad.
| ignoramous wrote:
| > _...an AWS person shit-talking other AWS teams [in
| public]._
|
| I remember a time when this would be an instant
| reprimand... Either amzn engs are bolder these days, or
| amzn hr is trying really hard for amzn to be "world's
| best employer", or both.
| filoleg wrote:
| Gotta deanonymize the user to reprimand them. Maybe i am
| wrong here, but i don't see it as something an Amazon HR
| employee would actually waste their time on (exceptions
| apply for confidential info leaks and other blatantly
| illegal stuff, of course). Especially given that it might
| as well be impossible, unless the user incriminated
| themselves with identifiable info.
| WaxProlix wrote:
| It's true that I shouldn't have posted it, was mostly
| just in a grumpy mood. It's still considered very bad
| form. I'm not actually there anymore, but the idea
| stands.
| bamboozled wrote:
| Still doesn't explain the cause of all the IAM permission denied
| requests we saw against policies which are again working fine
| without any intervention.
|
| Obviously networking issues can cause any number of symptoms but
| it seems like an unusual detail to leave out to me. Unless it was
| another ongoing outage happening at the same time.
| notimetorelax wrote:
| It's so hard to know what was the state of the system when the
| monitoring was out. Wouldn't be surprised if they don't have
| the data to investigate it now.
| a45a33s wrote:
| how are auth requests supposed to reach the auth server if the
| networking is broken?
| bamboozled wrote:
| I'd accept this as an answer if I received a timeout or a
| message to say that.
|
| Permission denied is something altogether because it implies
| the request reached an authorisation system, was evaluated
| and denied.
| sbierwagen wrote:
| Fail-secure + no separate error for timeouts maybe? If the
| server can't be reached then it just denies the request.
| amznbyebyebye wrote:
| I'm glad they published something, that too so quick. Ultimately
| these guys are running a business. There are other market
| alternatives, multibillion dollar contracts at play, SLAs, etc.
| it's not as simple as people think.
| StreamBright wrote:
| "At 7:30 AM PST, an automated activity to scale capacity of one
| of the AWS services hosted in the main AWS network triggered an
| unexpected behavior from a large number of clients inside the
| internal network. "
|
| Very detailed.
| herodoturtle wrote:
| I am grateful to AWS for this report.
|
| Not sure if any AWS support staff are monitoring this thread, but
| the article said:
|
| > Customers also experienced login failures to the AWS Console in
| the impacted region during the event.
|
| All our AWS instances / resources are in EU/UK availability
| zones, and yet we couldn't access our console either.
|
| Thankfully none of our instances were affected by the outage, but
| our inability to access the console was quite worrying.
|
| Any idea why this was this case?
|
| Any suggestions to mitigate this risk in the event of a future
| outage would be appreciated.
| [deleted]
| plasma wrote:
| They posted on the status page to try using the alternate
| region endpoints like us-west.console.Amazon.com (I think) at
| the time, but not sure if it was a true fix.
| azundo wrote:
| > This resulted in a large surge of connection activity that
| overwhelmed the networking devices between the internal network
| and the main AWS network, resulting in delays for communication
| between these networks. These delays increased latency and errors
| for services communicating between these networks, resulting in
| even more connection attempts and retries. This led to persistent
| congestion and performance issues on the devices connecting the
| two networks.
|
| I remember my first experience realizing the client retry logic
| we had implemented was making our lives way worse. Not sure if
| it's heartening or disheartening that this was part of the issue
| here.
|
| Our mistake was resetting the exponential backoff delay whenever
| a client successfully connected and received a response. At the
| time a percentage but not all responses were degraded and
| extremely slow, and the request that checked the connection was
| not. So a client would time out, retry for a while, backing off
| exponentially, eventually successfully reconnect and then after a
| subsequent failure start aggressively trying again. System
| dynamics are hard.
| EnlightenedBro wrote:
| But what's a good alternative then? What if the internet
| connection has recovered? And you were at the, for example, 4
| minute retry loop. Would you just make your users stare at a
| spinning loader for 8 minutes?
| xmprt wrote:
| I first learned about exponential backoff from TCP and TCP
| has a lot of other smart ways to manage congestion control.
| You don't need to implement all the ideas into client logic
| but you can also do a lot better than just basic exponential
| backoff.
| kqr wrote:
| Sure, why not?
|
| Or tell them directly that "We have screwed up. The service
| is currently overloaded. Thank you for your patience. If you
| still haven't given up on us, try again a less busy time of
| day. We are very sorry."
|
| There are several options, and finding the best one depends a
| bit on estimating the behaviour of your specific target
| audience.
| sciurus wrote:
| See for instance the client request rejection probability
| equation at https://sre.google/sre-book/handling-overload/
| heisenbit wrote:
| The problem shows up at the central system while the peripheral
| device is causing it. And those systems belong to very
| different organizations with very different priorities. I still
| remember how difficult the discussion was with 3G basestation
| team persuading them to implement exponential backoff with some
| random factor when connecting to the management system.
| colechristensen wrote:
| > System dynamics are hard.
|
| And have to be actually tested. Most of them are designs based
| on nothing but uninformed intuition. There is an art to back
| pressure and keeping pipelines optimally utilized. Queueing
| doesn't work like you think until you really know.
| gfodor wrote:
| Why is this hard, and can't just be written down somewhere as
| part of the engineering discipline? This aspect of systems in
| 2021 really shouldn't be an "art."
| adrianN wrote:
| Because no two systems are alike and these are nonlinear
| effects that strongly depend on the details, would be my
| guess.
| colechristensen wrote:
| It is, in itself, a separate engineering discipline, and
| one that cannot really be practiced analytically unless you
| understand _really well_ the behavior of individual pieces
| which interact with each other. Most don 't, and don't care
| to.
|
| It is something which needs to be designed and tuned in
| place and evades design "getting it right" without real
| world feedback.
|
| And you also simply have to reach a certain somewhat large
| scale for it to matter at all, the amount of excess
| capacity you have because of the available granularity of
| capacity at smaller scales eats up most of the need for it
| and you can get away with wasting a bit of money on extra
| scale to get rid of it.
|
| It is also sensitive to small changes so textbook examples
| might be implemented wrong with one small detail that won't
| show itself until a critical failure is happening.
|
| It is usually the location of the highest complexity
| interaction in a business infrastructure which is not
| easily distilled to a formula. (and most people just aren't
| educationally prepared for nonlinear dynamics)
| ksrm wrote:
| Because """software engineering""" is a joke.
| pfortuny wrote:
| Exponential behaviour is _hard_ to understand.
|
| Also, experiments at this size and speed have never been
| carried out before.
|
| And statistical behaviours are very difficult to
| understand. First thing: 99.9999% uptime for ALL users is
| HUGELY different from 99.999%.
|
| As a matter of fact, this was _just one_ of amazon's zones,
| rememeber.
|
| Edit: finally, the right model for these systems might well
| have no mean (fat tails...) and then where do the
| statistics go from there?
| jmiserez wrote:
| It absolutely is written down. The issue is that the
| results you get from modeling systems using queuing theory
| are often unintuitive and surprising. On top of that it's
| hard to account for all the seemingly minor implementation
| details in a real system.
|
| During my studies we had a course where we built a
| distributed system and had to model it's performance
| mathematically. It was really hard to get the model to
| match the reality and vice-versa. So many details are
| hidden in a library, framework or network adapter somewhere
| (e.g buffers or things like packet fragmentation).
|
| We used the book "The Art of Computer Systems Performance
| Analysis" (R. Jain), but I don't recommend it. At least not
| the 1st edition which had a frustrating amount of serious,
| experiment-ruining errata.
| dclowd9901 wrote:
| Think of other extremely complex systems and how we've
| managed to make them stable:
|
| 1) airplanes: they crashed, _a lot_. We used data recorders
| and stringent process to make air travel safety
| commonplace.
|
| 2) cars: so many accidents accident research. The solution
| comes after the disaster.
|
| 3) large buildings and structures: again, the master work
| of time, attempts, failures, research and solutions.
|
| If we really want to get serious about this (and I think we
| do) we need to stop reinventing infrastructure every 10
| years and start doubling down on stability. Cloud
| computing, in earnest, has only been around a short while.
| I'm not even convinced it's the right path forward, just
| happens to align best with business interests, but it seems
| to be the devil we're stuck with so now we need to really
| dig in and make it solid. I think we're actually in that
| process right now.
| vmception wrote:
| > Most of them are designs based on nothing but uninformed
| intuition.
|
| Or because they read it on a Google|AWS Engineering blog
| vmception wrote:
| Or they regurgitated a bullshit answer from a system design
| prep course while pretending to think of it on the spot
| just to get hired
| eigen-vector wrote:
| Exceeded character limit on the title so I couldn't include this
| detail there, but this is the post-mortem of the event on
| December 7 2021.
| markus_zhang wrote:
| >At 7:30 AM PST, an automated activity to scale capacity of one
| of the AWS services hosted in the main AWS network triggered an
| unexpected behavior from a large number of clients inside the
| internal network.
|
| Just curious, is this scaling an AWS job or a client job? Looks
| like an AWS one from the context. I'm wondering if they are
| deploying additional data centers or something else?
| AtlasBarfed wrote:
| "At 7:30 AM PST, an automated activity to scale capacity of one
| of the AWS services hosted in the main AWS network triggered an
| unexpected behavior from a large number of clients inside the
| internal network. This resulted in a large surge of connection
| activity that overwhelmed the networking devices between the
| internal network and the main AWS network, resulting in delays
| for communication between these networks. These delays increased
| latency and errors for services communicating between these
| networks, resulting in even more connection attempts and
| retries."
|
| So was this in service to something like DynamoDB or some other
| service?
|
| As in, did some of those extra services that AWS offers for
| lockin (and that undermines open source projects with embrace and
| extend) bomb the mainline EC2 service?
|
| Because this kind of smacks of "Microsoft Hidden APIs" that
| office got to use against other competitors. Does AWS use
| "special hardware capabilites" to compete against other companies
| offering roughtly the same service?
| nijave wrote:
| Yes and other cloud providers (Google, Microsoft) probably have
| similar. Besides special network equipment, they use PCIe
| accelerator/coprocessors on their hypervisors to offload all
| non-VM activity (Nitro instances)
|
| They also recently announced Graviton ARM CPUs
| cyounkins wrote:
| My favorite sentence: "Our networking clients have well tested
| request back-off behaviors that are designed to allow our systems
| to recover from these sorts of congestion events, but, a latent
| issue prevented these clients from adequately backing off during
| this event."
| discodave wrote:
| I saw pleeeeeenty of untested code at Amazon/AWS. Looking back
| it was almost like the most important services/code had the
| least amount of testing. While internal boondoggle projects (I
| worked on a couple) had complicated test plans and debates
| about coverage metrics.
| DrBenCarson wrote:
| This is almost always the case.
|
| The most important services get the most attention from
| leaders who apply the most pressure, especially in the first
| ~2y of a fast-growing or high-potential product. So people
| skip tests.
| foobiekr wrote:
| reality most of the real world successful projects are
| mostly untested because that's not actually a high ROI
| endeavor. it kills me to realize that mediocre code you can
| hack all over to do unnatural things is generally higher
| value in phase I than the same code done well in twice the
| time.
| jessermeyer wrote:
| This attitude is why modern software is a continuing
| controlled flight into terrain.
| topspin wrote:
| Finger wagging has also failed to produce any solutions.
| virtue3 wrote:
| I think the pendulum swing back is going to be designing
| code that is harder to make bad.
|
| Typescript is a good example of trying to fix this. Rust
| is even better.
|
| Deno, I think, takes things in a better direction as
| well.
|
| Ultimately we're going to need systems that just don't
| let you do "unnatural" things but still maintain a great
| deal of forward mobility. I don't think that's an
| unreasonable ask of the future.
| dastbe wrote:
| my take is that the overwhelming majority of services
| insufficiently invest in making testing easy. the services
| that need to grow fast due to customer demand skip the tests
| while the services that aren't going much of anywhere spend
| way too much time on testing.
| wbsun wrote:
| Interesting. I also work for a cloud provider. My team work
| on both internal infrastructure as well as product features.
| We take testing coverage very seriously and tie the metrics
| to the team's perf. Any product feature must have unit tests,
| integration tests at each layer of the stack, staging test,
| production test and continuous probers in production. But our
| reliability is still far from satisfactory. Now with your
| observation at AWS, I start wondering whether the coverage
| effort and different types of tests really help or not...
| tootie wrote:
| It's gotta be a whole thing to even think about how to
| accurately test this kind of software. Simulating all kinds
| of hardware failures, network partitions, power failures, or
| the thousand other failure modes.
|
| Then again they get like $100B in revenue that should buy
| some decent unit tests.
| a-dub wrote:
| this caught my eye as well. i'd wager that it was not
| configured.
| dilap wrote:
| Oh you know, an editing error -- they accidentally dropped the
| word "not".
| foobiekr wrote:
| thundering herd and accidental synchronization for the win
|
| I am sad to say, I find issues like this any time I look at
| retry logic written by anyone I have not interacted with
| previously on the topic. It is shockingly common even in
| companies where networking is their bread and butter.
| cyounkins wrote:
| It absolutely is difficult. A challenge I have seen is when
| retries are stacked and callers time out subprocesses that
| are doing retries.
|
| I just find it amusing that they describe their back-off
| behaviors as "well tested" and in the same sentence, say it
| didn't back off adequately.
| eyelidlessness wrote:
| > It absolutely is difficult. A challenge I have seen is
| when retries are stacked and callers time out subprocesses
| that are doing retries.
|
| This is also a general problem with (presumed stateless)
| concurrent/distributed systems which irked me working on
| such a system and still haven't found meaningful resources
| for which aren't extremely platform/stack/implementation
| specific:
|
| A concurrent system has some global/network-
| wide/partitioned-subset-wide error or backoff condition. If
| that system is actually stateless and receives push work,
| communicating that state to them either means pushing the
| state management back to a less concurrent orchestrator to
| reprioritize (introducing a huge bottleneck/single or
| fragile point of failure) or accepting a lot of failed work
| will be processed in pathological ways.
| lanstin wrote:
| I found myself wishing for a few code snippets here. It would
| be interesting. A lot of time code that handles "connection
| refused" or fast failures doesn't handle network slowness well.
| I've seen outages from "best effort" services (and the best-
| effort-ness worked when the services were hard down) because
| all of a sudden calls that were taking 50 ms were not failing
| but all taking 1500+ ms. Best effort but no client enforced
| SLAs that were low enough to matter.
|
| Load shedding never kicked in, so things had to be shutdown for
| a bit and then restarted.
|
| Seems their normal operating state might be what is called
| "meta-stable" - dynamically stable at a high thru-put (edited)
| unless/until a brief glitch bumps the system into the low work
| being finished state which is also stable.
| raffraffraff wrote:
| Something they didn't mention is AWS Billing alarms. These rely
| on metrics systems which were affected by this (and are missing
| some data). Crucially, billing alarms only exist in the us-east-1
| region, so if you're using them, your impacted no matter where
| you're infrastructure is deployed. (That's just my reading of it)
___________________________________________________________________
(page generated 2021-12-11 23:01 UTC)