[HN Gopher] Summary of the AWS Service Event in the Northern Vir...
       ___________________________________________________________________
        
       Summary of the AWS Service Event in the Northern Virginia (US-
       East-1) Region
        
       Author : eigen-vector
       Score  : 603 points
       Date   : 2021-12-10 22:54 UTC (1 days ago)
        
 (HTM) web link (aws.amazon.com)
 (TXT) w3m dump (aws.amazon.com)
        
       | almostdeadguy wrote:
       | > The AWS container services, including Fargate, ECS and EKS,
       | experienced increased API error rates and latencies during the
       | event. While existing container instances (tasks or pods)
       | continued to operate normally during the event, if a container
       | instance was terminated or experienced a failure, it could not be
       | restarted because of the impact to the EC2 control plane APIs
       | described above.
       | 
       | This seems pretty obviously false to me. My company has several
       | EKS clusters in us-east-1 with most of our workloads running on
       | Fargate. All of our Fargate pods were killed and were unable to
       | be restarted during this event.
        
         | ClifReeder wrote:
         | Strong agree. We were using Fargate nodes in our us-east-1 EKS
         | cluster and not all of our nodes dropped, but every coredns pod
         | did. When they came back up their age was hours older than
         | expected, so maybe a problem between Fargate and the scheduler
         | rendered them "up" but unable to be reached?
         | 
         | Either way, was surprising to us that already provisioned
         | compute was impacted.
        
           | silverlyra wrote:
           | Saw the same. The only cluster services I was running in
           | Fargate were CoreDNS and cluster-autoscaler; thought it would
           | help the clusters recover from anything happening to the node
           | group where other core services run. Whoops.
           | 
           | Couldn't just delete the Fargate profile without a working
           | EKS control plane. I lucked out in that the label selector
           | the kube-dns Service used was disjoint from the one I'd set
           | in the Fargate profile, so I just made a new "coredns-
           | emergency" deployment and cluster networking came back.
           | (cluster-autoscaler was moot since we couldn't launch
           | instances anyway.)
           | 
           | I was hoping to see something about that in this
           | announcement, since the loss of live pods is nasty. Not
           | inclined to rely on Fargate going forward. It is curious that
           | you saw those pod ages; maybe Fargate kubelets communicate
           | with EKS over the AWS internal network?
        
       | whatever1 wrote:
       | Noob question, but why does network infrastructure need dns? Why
       | the full ipv6 address of the various components do not suffice to
       | do business?
        
         | AUX4R6829DR8 wrote:
         | That "internal network" hosts an awful lot of stuff- it's not
         | just network hardware, but services that mostly use DNS to find
         | each other. Besides that, it's just plain useful for network
         | devices to have names.
         | 
         | (Source: Work at AWS.)
        
         | nijave wrote:
         | It's basically used for service discovery. At a certain point,
         | you have too many different devices which are potentially
         | changing to identify them by IP. You want some abstraction
         | layer to separate physical devices from services and DNS lets
         | you so things like advertise different IPs at different times
         | in different network zones
        
       | soheil wrote:
       | Having an internal network like this that everything on the main
       | AWS network so heavily depends on is just bad design. One does
       | not create a stable high tech spacecraft and then fuels it with
       | coal.
        
         | discodave wrote:
         | It's 2006, you work for an 'online book store' that's
         | experimenting with this cloud thing. Are you going to build a
         | whole new network involving multi-million dollar networking
         | appliances?
        
           | jvolkman wrote:
           | Developing EC2 did involve building a whole new network. It
           | was a new service, built from the ground up to be a public
           | product.
        
             | User23 wrote:
             | Amazon was famously frugal during that era.
        
           | soheil wrote:
           | No but one would hope 15 years and 1 trillion dollars later
           | you would stop running it on the computer under your desk.
        
         | ripper1138 wrote:
         | Might seem that way because of what happened, but the main
         | network is probably more likely to fail than the internal
         | network. In those cases, running monitoring on a separate
         | network is critical. EC2 control plane same story.
        
           | soheil wrote:
           | The entire value proposition for AWS is "migrate your
           | internal network to us so it's more stable with less
           | management." I buy that 100%, and I think you're wrong to
           | assume their main network is more likely to fail than their
           | internal one. They have every incentive to continuously
           | improve it because it's not made just for one client.
        
         | zeckalpha wrote:
         | What would you recommend instead? Control plane/data plane is a
         | well known pattern.
        
       | divbzero wrote:
       | > _This congestion immediately impacted the availability of real-
       | time monitoring data for our internal operations teams, which
       | impaired their ability to find the source of congestion and
       | resolve it._
       | 
       | Disruption of the standard incident response mechanism seems to
       | be a common element of longer lasting incidents.
        
         | femiagbabiaka wrote:
         | It is. And to add, all automation that we rely on in peace time
         | can often complicate cross cutting wartime incidents by raising
         | the ambient complexity of an environment. Bainbridge for more:
         | https://blog.acolyer.org/2020/01/08/ironies-of-automation/
        
         | sponaugle wrote:
         | Indeed - Even the recent facebook outage outlined how slow
         | recovery can be if the primary investigation and recovery
         | methods are directly impacted as well. Back in the old days
         | some environments would have POTS dial-in connections to the
         | consoles as backup for network problems. That of course doesn't
         | scale, but it was an attempt to have an alternate path of
         | getting to things. Regrettably if a backhoe takes out all of
         | the telecom at once that plan doesn't work so well.
        
         | daenney wrote:
         | Yup. There was a GCP outage a couple of years ago like this.
         | 
         | I don't remember the exact details, but it was something along
         | the lines of a config change went out that caused systems to
         | incorrectly assume there were huge bandwidth constraints. Load
         | shedding kicked in to drop lower priority traffic which
         | ironically included monitoring data rendering GCP responders
         | blind and causing StackDriver to go blank for customers.
        
       | Ensorceled wrote:
       | There are a lot of comments in here that boil down to "could you
       | do infrastructure better?"
       | 
       | No, absolutely not. That's why I'm on AWS.
       | 
       | But what we are all ACTUALLY complaining about is ongoing lack of
       | transparent and honest communications during outages and,
       | clearly, in their postmortems.
       | 
       | Honest communications? Yeah, I'm pretty sure I could do _that_
       | much better than AWS.
        
       | DenisM wrote:
       | > Customers accessing Amazon S3 and DynamoDB were not impacted by
       | this event.
       | 
       | We've seen plenty of S3 errors during that period. Kind of
       | undermines credibility of this report.
        
         | davidstoker wrote:
         | Do you use VPC endpoints for S3? The next sentence explained
         | failures I observed with S3: "However, access to Amazon S3
         | buckets and DynamoDB tables via VPC Endpoints was impaired
         | during this event."
        
           | karmelapple wrote:
           | I could not modify file properties in S3, uploading new or
           | modified files was spotty, and AWS Console GUI access was
           | broken as well. Was that because of VPC endpoints?
        
             | chickenpotpie wrote:
             | "Customers also experienced login failures to the AWS
             | Console in the impacted region during the event"
        
           | DenisM wrote:
           | No, I use the normal endpoint.
        
         | marcinzm wrote:
         | DAX, part of DynamoDB from how AWS groups things, was throwing
         | internal server errors for us and eventually we had to reboot
         | nodes manually. That's separate from the STS issues we had in
         | terms of our EKS services connecting to DAX.
        
         | Uyuxo wrote:
         | Same. I saw intermittent S3 errors the entire time. Also
         | nothing mentioned about SQS even though we were seeing errors
         | there as well.
        
         | banana_giraffe wrote:
         | Yeah
         | 
         | > For example, while running EC2 instances were unaffected by
         | this event
         | 
         | That ignores the 3-5% drop in traffic I saw in us-east-1 on EC2
         | instances that only talk to peers on the Internet with TCP/IP
         | during this event.
        
           | manquer wrote:
           | I guess you have to read this kind of items hidden in careful
           | language, _running_ the instances had no problem, it is
           | different matter they had limited connectivity!from AWS point
           | of view they don 't seem to see user impact but services from
           | their point of view.
           | 
           | Perhaps that distinction has value if your workloads did not
           | depend on network connectivity externally for example say S3
           | access without vpc and only compute some DS/ ML jobs perhaps.
        
           | grogers wrote:
           | How are you measuring this? Remember that cloudwatch was
           | apparently also losing metrics, so aggregating CW metrics
           | might show that kind of drop.
        
             | banana_giraffe wrote:
             | Yeah, I know. This was based off instance store logging on
             | these instances. For better or worse, they're very simple
             | ports of pre-AWS on-prem servers, they don't speak AWS once
             | they're up and running.
        
       | wjossey wrote:
       | I've been running platform teams on aws now for 10 years, and
       | working in aws for 13. For anyone looking for guidance on how to
       | avoid this, here's the advice I give startups I advise.
       | 
       | First, if you can, avoid us-east-1. Yes, you'll miss new
       | features, but it's also the least stable region.
       | 
       | Second, go multi AZ for production workloads. Safety of your
       | customer's data is your ethical responsibility. Protect it, back
       | it up, keep it as generally available as is reasonable.
       | 
       | Third, you're gonna go down when the cloud goes down. Not much
       | use getting overly bent out of shape. You can reduce your
       | exposure by just using their core systems (EC2, S3, SQS, LBs,
       | Cloudfrount, RDS, Elasticache). The more systems you use, the
       | less reliable things will be. However, running your own key value
       | store, api gateway, event bud, etc., can also be way less
       | reliable than using their's. So, realize it's an operational
       | trade off.
       | 
       | Degradation of your app / platform is more likely to come from
       | you than AWS. You're gonna roll out bad code, break your own
       | infra, overload your own system, way more often than Amazon is
       | gonna go down. If reliability matters to you, start by examining
       | your own practices first before thinking things like multi region
       | or super durable highly replicated systems.
       | 
       | This stuff is hard. It's hard for Amazon engineers. Hard for
       | platform folks at small and mega companies. It's just, hard. When
       | your app goes down, and so does Disney plus, take some solace
       | that Disney in all their buckets of cash also couldn't avoid the
       | issue.
       | 
       | And, finally, hold cloud providers accountable. If they're
       | unstable and not providing service you expect, leave. We've got
       | tons of great options these days, especially if you don't care
       | about proprietary solutions.
       | 
       | Good luck y'all!
        
         | manquer wrote:
         | Easy to say leave, the techinical lockin cloud service
         | providers _by design_ choose to have makes it impossible to
         | leave .
         | 
         | AWS (and others) make egress costs insanely expensive for any
         | startup to consider leaving with their data, also there is
         | constant push to either not support open protocols or extend
         | /expand them in ways making it hard to migrate a code base
         | easily.
         | 
         | If the advise is to use only effectively use managed open
         | source components then why AWS at all ? most competent mid
         | sized teams can do that much cheaper with a colo providers like
         | OVH/hetzner.
         | 
         | The point of investing in AWS is not outsource running base
         | infra, if we should stay away from leveraging the kind of cloud
         | native services us mere mortals cannot hope to build or
         | maintain.
         | 
         | Also this avoid us-east-1 advice is bit frustrating, AWS does
         | not have to experiment with new services always in the same
         | region,it is not marked as experimental region or has reduced
         | SLAs , if it is inferior/preview/beta than call it out in the
         | UI and contract, what about when there is no choice? If
         | cloudfront is managed in us-east-1 and we shouldnt now use it ?
         | Why use the cloud then ?
         | 
         | if your engineering only discovers scale problems at us-east-1
         | along with customers perhaps something is wrong ? aws could
         | limit new instances in that region and spread the load, playing
         | with customers like this who are at your mercy just because you
         | can is not nice.
         | 
         | Disney can afford to go down, or build their cloud, small
         | companies don't have deep pockets to do either
        
           | EnlightenedBro wrote:
           | Lesson to build your services with Docker and Terraform. In
           | this setup you can spin up a working clone of a decently
           | sized stack in a different cloud provider in under an hour.
           | 
           | Don't lock yourself in.
        
             | lysium wrote:
             | ...if you don't have much data, that is. Otherwise, you'll
             | have huge egress costs.
        
             | rorymalcolm wrote:
             | This is just not true for Terraform at all, they do not aim
             | to be multi cloud and it is a much more usable product
             | because of it. Resource parameters do not swap out directly
             | across providers (rightly so, the abstractions they choose
             | are different!).
        
             | manquer wrote:
             | If the setup is that portable you probably don't need the
             | AWS at all in the first place.
             | 
             | If your use only services built and managed by your docker
             | images why use the cloud in the first place ? It would be
             | cheaper to host on a smaller vendor , the reliability is
             | not substantially better with big cloud than tier two
             | vendors, that difference between say OVH and AWS is not
             | that valuable to most applications to be worth the premium.
             | 
             | In IMO, if you don't leverage cloud native services offered
             | by GCP or AWS then cloud is not adding much value to your
             | stack.
        
           | cherioo wrote:
           | > AWS (and others) make egress costs insanely expensive for
           | any startup to consider leaving with their data
           | 
           | I have seen this repeated many times, but don't understand
           | it. Yes egress is expensive, but they are not THAT expensive
           | compared to storage. S3 egress per GB is no more than 3x the
           | price of storage, i.e. moving out just cost 3 month of
           | storage cost (there's also API cost but that's not the one
           | often mentioned).
           | 
           | Is egress pricing being a lock-in factor just a myth? Is
           | there some other AWS cost I'm missing? Obviously there will
           | be big architectural and engineering cost to move, but that's
           | just part of life.
        
             | manquer wrote:
             | 3months is only if you use standard S3, However intelligent
             | tiering , infrequent access , reduced redundancy or glacier
             | instant can be substantially cheaper, without impacting
             | retrieval time [1]
             | 
             | At scale when costs matter, you would have lifecycle policy
             | tuned to your needs taking advantage of these classes. Any
             | typical production workload is hardly paying only S3 base
             | price for all/most of its storage needs, they will have mix
             | of all these too.
             | 
             | [1] if there is substantial data in glacier regular, the
             | costing completely blows through the roof, retrieval
             | +egress makes it infeasible unless you activily hate AWS
             | enough to spend that kind of money
        
             | bushbaba wrote:
             | Often the other cloud vendors will assist in offering those
             | migration costs as part of your contract negotiations.
             | 
             | But really, egress costs aren't locking you in. It's the
             | hard coded AWS apis, terraform scripts and technical debt.
             | Having to change all of that and refactor and reoptimize to
             | a different providers infrastructure is a huge endeavor.
             | That time spent night have a higher ROI being put elsewhere
        
         | oasisbob wrote:
         | > Third, you're gonna go down when the cloud goes down. Not
         | much use getting overly bent out of shape.
         | 
         | Ugh. I have a hard time with this one. Back in the day, EBS had
         | some really awful failures and degradations. Building a
         | greenfield stack that specifically avoided EBS and stayed up
         | when everyone else was down during another mass EBS failure
         | felt marvelous. It was an obvious avoidable hazard.
         | 
         | It doesn't mean "avoid EBS" is good advice for the decade to
         | follow, but accepting failure fatalistically doesn't feel right
         | either.
        
         | speedgoose wrote:
         | > Third, you're gonna go down when the cloud goes down.
         | 
         | Not necessarily. You just need to not be stuck with a single
         | cloud provider. The likelihood of more than one availability
         | zone going down on a single cloud provider is not that low in
         | practice. Especially when the problem is a software bug.
         | 
         | The likelihood of AWS, Azure, and OVH going down at the same
         | time is low. So if you need to stay online if AWS fail, don't
         | put all your eggs in the AWS basket.
         | 
         | That means not using proprietary cloud solutions from a single
         | cloud provider, it has a cost so it's not always worth it.
        
           | chii wrote:
           | > using proprietary cloud solutions from a single cloud
           | provider, it has a cost so it's not always worth it.
           | 
           | but perhaps some software design choices could be made to
           | alleviate these costs. For example, you could have a read-
           | only replica on azure or whatever backup cloud provider, and
           | design your software interfaces to allow the use of such read
           | only replicas - at least you'd be degraded rather than
           | unavailable. Ditto with web servers etc.
           | 
           | This has a cost, but it's lower than entirely replicating all
           | of the proprietary features in a different cloud.
        
           | bombcar wrote:
           | True multi-cloud redundancy is hard to test - because it's
           | everything from DNS on up and it's hard to ask AWS to go
           | offline so you can verify Azure picks up the slack.
        
             | kqr wrote:
             | Sure you can. Firewall AWS off from whatever machine does
             | the health checks in the redundancy implementation.
        
               | pixl97 wrote:
               | What happens when your health check system fails?
        
             | speedgoose wrote:
             | It's true, but you can do load balancing at the DNS level.
        
               | darkwater wrote:
               | And you will get 1/N of requests timing or erroring out,
               | and in the meanwhile paying 2x or 3x the costs. So, it
               | might be worth in some cases but you need to evaluate it
               | very, very well.
        
         | daguava wrote:
         | You've written up my thoughts better than I can express them
         | myself - I think what people get really stuck on when something
         | like this happens is the 'can I solve this myself?' aspect.
         | 
         | A wait for X provider to fix it for you situation is infinitely
         | more stressful than an 'I have played myself, I will now take
         | action' situation.
         | 
         | Situations out of your (immediate) resolution control feel
         | infinitely worse, even if the customer impact in practice of
         | your fault vs cloud fault is the same.
        
           | dalyons wrote:
           | For me it's the opposite... aws outages are _much_ less
           | stressful than my own because I know there's nothing I /we
           | can do about it, they have smart people working on it, and it
           | will be fixed when it's fixed
        
           | electroly wrote:
           | I couldn't possibly disagree more strongly with this. I used
           | to drive frantically to the office to work on servers in
           | emergency situations, and if our small team couldn't solve
           | it, there was nobody else to help us. The weight of the
           | outage was entirely on our shoulders. Now I relax and refresh
           | a status page.
        
         | qwertyuiop_ wrote:
         | Or rent bare metal servers like old times and be responsible
         | for your own s*t
        
           | aenis wrote:
           | Still plenty of networking issues that can knock you down
           | hard.
        
             | tuldia wrote:
             | ... and be responsible for your own s*t
             | 
             | Don't miss the point of being able to do something about it
             | instead of multi hours outage and being in the dark
             | regarding what is going on.
        
         | [deleted]
        
         | chrisweekly wrote:
         | Hey Wes! I upvoted your comment before I noticed your handle.
         | +1 insightful, as usual
        
           | mobutu wrote:
           | Brown nose
        
         | xyst wrote:
         | > And, finally, hold cloud providers accountable. If they're
         | unstable and not providing service you expect, leave. We've got
         | tons of great options these days, especially if you don't care
         | about proprietary solutions.
         | 
         | Easy to say, but difficult to do in practice (leaving a cloud
         | provider)
        
         | kortilla wrote:
         | > Safety of your customer's data is your ethical
         | responsibility. Protect it, back it up, keep it as generally
         | available as is reasonable.
         | 
         | > Third, you're gonna go down when the cloud goes down. Not
         | much use getting overly bent out of shape.
         | 
         | "Whoops, our provider is down, sorry!" is not taking
         | responsibility with customer data at all.
        
       | Fordec wrote:
       | Between this and Log4j, I'm just glad it's Friday.
        
         | eigen-vector wrote:
         | Unfortunately, patching vulns can't be put off for Monday.
        
           | [deleted]
        
         | _jal wrote:
         | You are clearly not involved in patching.
        
           | Fordec wrote:
           | Simply already patched. Company sizes and number of attack
           | surfaces vary. 22 hours is plenty of time for an input string
           | filter on a centrally controlled endpoint and a dependency
           | increment with the right CI pipeline.
        
             | shagie wrote:
             | Consider the possible ways for a string to be injected into
             | any of the following:                 Apache Solr
             | Apache Druid       Apache Flink       ElasticSearch
             | Flume       Apache Dubbo       Logstash       Kafka
             | 
             | If you've got any of them, they're likely exploitable too.
             | 
             | That list comes from:
             | https://unit42.paloaltonetworks.com/apache-
             | log4j-vulnerabili...
             | 
             | The attack surface is quite a bit larger than many realize.
             | I recently had a conversation with a person who wasn't at a
             | Java shop so wasn't worried... until he said "oh, wait,
             | ElasticSearch is vulnerable too?"
             | 
             | You'll even see it in things like the connector between
             | CouchBase and ElasticSearch (
             | https://forums.couchbase.com/t/ann-elasticsearch-
             | connector-4... ).
        
               | EdwardDiego wrote:
               | Kafka is still on log4j1. It's only vulnerable if you're
               | using a JMSAppender.
        
               | Fordec wrote:
               | Lets see...
               | 
               | Nope. Nope. Nope. Nope. Nope. Nope. Nope.
               | 
               | aaaand...
               | 
               | Nope. Plans for it, but not yet in production.
               | 
               | Oh and before anyone starts, not in transitive
               | dependencies either. Just good old bare metal EC2
               | instances without vendor lock in.
        
             | zeko1195 wrote:
             | lol no
        
               | [deleted]
        
       | [deleted]
        
       | revskill wrote:
       | Most of rate limiter system often drop invalid requests, it's not
       | optimal as i see.
       | 
       | The better way is, we should have two queues, one for valid
       | messages and one for invalid messages.
        
       | betaby wrote:
       | Problem is that I have to defend our own infrastructure real
       | availability numbers vs cloud's fictional "five nines". It's a
       | loosing game.
        
         | ngc248 wrote:
         | Yep, things should be made clear to whoever cares about your
         | service's SLA that our SLA is contingent upon AWS's SLA et al.
         | AWS' SLA would be the lower bound :)
        
         | Spivak wrote:
         | All I'm hearing is that you can make up your own availability
         | numbers and get away with it. When you define what it means to
         | be up or down then reality is whatever you say it is.
         | 
         | #gatekeep your real availability metrics
         | 
         | #gaslight your customers with increased error rates
         | 
         | #girlboss
        
           | WoahNoun wrote:
           | What are you trying to imply with that last hashtag?
        
         | mpyne wrote:
         | Some orgs really do have lousy availability figures (such as my
         | own, the Navy).
         | 
         | We have an environment we have access to for hosting webpages
         | for one of the highest leaders in the whole Dept of Navy. This
         | environment was _DOWN_ (not  "degrade availability" or "high
         | latencies"), literally off of the Internet entirely, for
         | CONSECUTIVE WEEKS earlier this year.
         | 
         | Completely incommunicado as well. It just happened to start
         | working again one day. We collectively shrugged our shoulders
         | and resumed updating our part of it.
         | 
         | This is an outlier example but even our normal sites I would
         | classify as 1 "nine" of availability at best.
        
           | EnlightenedBro wrote:
           | This feels like it's a common thing in large enterprisey
           | companies. Execs out of touch with technical teams, always
           | pushing for more for less.
        
           | garbagecoder wrote:
           | Army, here. We're no better. And of course a few years back
           | the whole system for issuing CACs was off for DAYS.
        
       | JCM9 wrote:
       | Obviously one hopes these things don't happen, but that's an
       | impressive and transparent write up that came out quickly.
        
         | markranallo wrote:
         | Its not transparent at all. A massive amount of services were
         | hard down for hours like SNS and were never acknowledged on the
         | status page or in this write-up. This honestly reads like they
         | don't truly understand the scope of things effected.
        
           | nijave wrote:
           | It sounded like the entire management plane was down and
           | potentially part of the "data" plane too (management being
           | config and data being get/put/poll to stateful resources)
           | 
           | I saw in the Reddit thread someone mentioned all services
           | that auth to other services on the backend were effected (not
           | sure how truthful it is but that certainly made sense)
        
       | JCM9 wrote:
       | Queue the armchair infrastructure engineers.
       | 
       | The reality is that there's a handful of people in the world that
       | can operate systems at this sheer scale and complexity and I have
       | mad respect for those in that camp.
        
         | ikiris wrote:
         | This outage report reads like a violation of the SRE 101
         | checklist for networking management though.
        
         | fckyourcloud wrote:
         | There are very few people who can juggle running chainsaws with
         | their penis.
         | 
         | So maybe it's not something we should be doing then?
        
         | [deleted]
        
         | plausibledeny wrote:
         | Isn't this the equivalent of "complaining about your meal in a
         | restaurant, I'd like to see you do better."
         | 
         | The point of eating at a restaurant is that I can't/don't want
         | to cook. Likewise, I use AWS because I want them to do the hard
         | work and I'm willing to pay for it.
         | 
         | How does that abrogate my right to complain if it goes badly
         | (regardless of whether I could/couldn't do it myself)?
        
           | cube00 wrote:
           | I think the distinction is you can say "I pay good money for
           | you to do it properly and how dare you go down on me" but you
           | become an "armchair infrastructure engineer" when you try and
           | explain how you would have avoided the outage because you
           | don't have the whole picture (especially based on a very
           | carefully worded PR approved blog post).
        
         | [deleted]
        
         | tristor wrote:
         | Some of us are in that camp and are looking at this outage and
         | also pointing out that they continuously fail to accurately
         | update their status dashboard in this and prior outages. Yes,
         | doing what AWS does is hard, and yes outages /will/ happen, it
         | is no knock on them that this outage occurred, what is a knock
         | is that they haven't communicated honestly while the outage was
         | ongoing.
        
           | JCM9 wrote:
           | They address that in the post, and between Twitter, HN and
           | other places there wasn't anyone legit questioning if
           | something was actually broken. Contacts at AWS also all were
           | very clear that yes something was going on and being
           | investigated. This narrative that AWS was pretending nothing
           | was wrong just wasn't true based on what we saw.
        
             | DrBenCarson wrote:
             | I'm going to leave it at this: the dashboards at AWS aren't
             | automated.
             | 
             | Say what you will, but I can automate a status dashboard in
             | a couple days--yes, even at AWS scale.
             | 
             | No reason the dashboard should be green for _hours_ while
             | their engineers and support are aware things aren 't
             | working.
        
               | notinty wrote:
               | Apparently VP approval is required to update it, i.e.
               | they're a farce.
        
       | moogly wrote:
       | Hm. This post does not seem to acknowledge what I saw. Multiple
       | hours of rate-limiting kicking in when trying to talk to S3 (eu-
       | west-1). After the incident everything works fine without any
       | remediations done on our end.
        
         | MrBurnsa wrote:
         | eu-west-1 was not impacted by this event. I'm assuming you saw
         | 503 Slowdown responses, which are non-exceptional and happen
         | for a multitude of reasons.
        
           | moogly wrote:
           | I see. Then I suppose that was an unfortunately timed
           | happenstance (and we should look into that more closely).
        
       | jtchang wrote:
       | The complexity that AWS has to deal with is astounding. Sure
       | having your main production network and a management network is
       | common. But making sure all of it scales and doesn't bring down
       | the other is what I think they are dealing with here.
       | 
       | It must have been crazy hard to troubleshoot when you are flying
       | blind because all your monitoring is unresponsive. Clearly more
       | isolation with clearly delineated information exchange points are
       | needed.
        
         | dijit wrote:
         | "But AWS has more operations staff than I would ever hope to
         | hire" -- a common mantra when talking about using the cloud
         | overall.
         | 
         | I'm not saying I fully disagree. But consolidation of the
         | worlds hosting necessitates a very complicated platform and
         | these things will happen, either due to that complexity,
         | failures that can't be foreseen or good old fashioned Sod's
         | law.
         | 
         | I know AWS marketing wants you to believe it's all magic and
         | rainbows, but it's still computers.
        
           | beoberha wrote:
           | I work for one of the Big 3 cloud providers and it's always
           | interesting when giving RCAs to customers. The vast majority
           | of our incidents are due to bugs in the "magic" components
           | that allow us to operate at such a massive scale.
        
       | rodmena wrote:
       | I was alive, however because I could not breath, I died. Bob was
       | fine himself, but someone shot him, so he is dead, (but remember
       | bob was fine) --- What a joke
        
       | atoav wrote:
       | A "service event"?!
        
       | londons_explore wrote:
       | Idea:. Network devices should be configured to automatically
       | prioritize the same packet flows for the same clients as they
       | served yesterday.
       | 
       | So many overload issues seem to be caused by a single client, in
       | a case where the right prioritization or rate limit rule could
       | have contained any outage, but such a rule either wasn't in place
       | or wasn't the right one due to the difficulty of knowing how to
       | prioritize hundreds of clients.
       | 
       | Using _more_ bandwidth or requests than yesterday should then be
       | handled as capacity allows, possibly with a manual configured
       | priority list, cap, or ratio. But  "what I used yesterday" should
       | always be served first. That way, any outage is contained to
       | clients acting differently to yesterday, even if the config isn't
       | perfect.
        
       | stevefan1999 wrote:
       | In a nutshell: thundering herd.
        
       | rodmena wrote:
       | umm... But just one thing, S3 was not available at least for 20
       | minutes.
        
       | sneak wrote:
       | "impact" occurs 27 times on this page.
       | 
       | What was wrong with "affect"?
        
         | pohl wrote:
         | The easiest way to avoid confusing affect with effect is to use
         | other words.
        
       | WC3w6pXxgGd wrote:
       | > Our networking clients have well tested request back-off
       | behaviors that are designed to allow our systems to recover from
       | these sorts of congestion events, but, a latent issue prevented
       | these clients from adequately backing off during this event.
       | 
       | Sentences like this are confusing. If they are well-tested,
       | wouldn't this issue have been covered?
        
       | mperham wrote:
       | I wish it contained actual detail and wasn't couched in
       | generalities.
        
         | nijave wrote:
         | That was my take. Seems like boilerplate you could report for
         | almost any incident. Last year's Kinesis outage and the S3
         | outage some years ago had some decent detail
        
       | propter_hoc wrote:
       | Does anyone know how often an AZ experiences an issue as compared
       | to an entire region? AWS sells the redundancy of AZs pretty
       | heavily, but it seems like a lot of the issues that happen end up
       | being region-wide. I'm struggling to understand whether I should
       | be replicating our service across regions or whether the AZ
       | redundancy within a region is sufficient.
        
         | ashtonkem wrote:
         | The best bang for your buck isn't deploying into multiple AZs,
         | but relocating everything into almost any other region than us-
         | east-1.
         | 
         | My system is latency and downtime tolerant, but I'm thinking I
         | should move all my Kafka processing over to us-west-2
        
         | nemothekid wrote:
         | I've been naively setting up our distributed databases in
         | separate AZs for a couple years now, paying, sometimes,
         | thousands of dollars per month in data replication bandwidth
         | egress fees. As far as I can remember I've never never seen an
         | AZ go down, and the only region that has gone down has been us-
         | east-1.
        
           | treis wrote:
           | AZs definitely go down. It's usually due to a physical reason
           | like fire or power issues.
        
           | wjossey wrote:
           | There was an AZ outage in Oregon a couple months back. You
           | should definitely go multi AZ without hesitation for
           | production workloads for systems that should be highly
           | available. You can easily lose a system permanently in a
           | single AZ setup if it's not ephemeral.
        
           | maximilianroos wrote:
           | > I've never never seen an AZ go down, and the only region
           | that has gone down has been us-east-1.
           | 
           | Doesn't the region going down mean that _all_ its AZs have
           | gone down? Or is my mental model of this incorrect?
        
             | urthor wrote:
             | No. See https://aws.amazon.com/about-aws/global-
             | infrastructure/regio...
             | 
             | A region is a networking paradigm. An AZ is a group of 2-6
             | data centers in the same city more or less.
             | 
             | If a region goes down or is otherwise impacted, its AZs are
             | _unavailable_ or similar.
             | 
             | If an AZ goes down, your VMs in said centers are disrupted
             | in the most direct sense.
             | 
             | It's the difference between loss of service and actual data
             | loss.
        
           | cyounkins wrote:
           | Is that separate AZs within the same region, or AZs across
           | regions? I didn't think there were any bandwidth fees between
           | AZs in the same region.
        
             | electroly wrote:
             | It's $0.01/GB for cross-AZ transfer within a region.
        
               | nemothekid wrote:
               | In reality it's more like $0.02/GB. You pay $0.01 on
               | sending and $0.01 on receiving. I have no idea why
               | ingress isn't free.
        
               | rfraile wrote:
               | Plus the support percentage, don't forguet.
        
             | TheP1000 wrote:
             | That is incorrect. Cross az fees are steep.
        
         | [deleted]
        
         | dijit wrote:
         | The main issue tends to be a lot of AWS internal components
         | tend to be in us-east-1; it's also the oldest zone.
         | 
         | So when failures happen in that region (and they happen more
         | commonly than others due to age, scale, complexity) then they
         | can be globally impacting.
        
           | AUX4R6829DR8 wrote:
           | The stuff that's exclusively hosted in us-east-1 is, to my
           | knowledge, mostly things that maintain global uniqueness.
           | CloudFront distributions, Route53, S3 bucket names, IAM roles
           | and similar- i.e. singular control planes. Other than that,
           | regions are about as isolated as it gets, except for specific
           | features on top.
           | 
           | Availability zones are supposed to be another fault boundary,
           | and things are generally pretty solid, but every so often
           | problems spill over when they shouldn't.
           | 
           | The general impression I get is that us-east-1's issues tend
           | to stem from it being singularly huge.
           | 
           | (Source: Work at AWS.)
        
             | ashtonkem wrote:
             | If I recall there was a point in time where the control
             | panel for all regions was in us-east-1. I seem to recall an
             | outrage where the other regions were up, but you couldn't
             | change any resources because the management api was down in
             | us-east-1
        
               | herodoturtle wrote:
               | This was our exact experience with this outage.
               | 
               | Literally _all_ our AWS resources are in EU /UK regions -
               | and they all continued functioning just fine - but we
               | couldn't sign in to our AWS console to manage said
               | resources.
               | 
               | Thankfully the outage didn't impact our production
               | systems at all, but our inability to access said console
               | was quite alarming to say the least.
        
               | aaron42net wrote:
               | The default region for global services including
               | https://console.aws.amazon.com is us-east-1, but there
               | are usual regional alternatives. For example: https://us-
               | west-2.console.aws.amazon.com
               | 
               | It would probably be clearer that they exist if the
               | console redirected to the regional URL when you switched
               | regions.
               | 
               | STS, S3, etc have regional endpoints too that have
               | continued to work when us-east-1 has been broken in the
               | past and the various AWS clients can be configured to use
               | them, which they also sadly don't tend to do by default.
        
           | propter_hoc wrote:
           | I agree with you, but my services are actually in Canada
           | (Central). There's only one region in Canada, so I don't
           | really have an alternative. AWS justifies it by saying there
           | are three AZs (distinct data centres) within Canada
           | (Central), but I get scared when I see these region-wide
           | issues. If the AZs were really distinct, you wouldn't really
           | have region-wide issues.
        
             | post-it wrote:
             | Multiple AZs are moreso for earthquakes, fires[1], and
             | similar disasters rather than software issues.
             | 
             | [1] https://www.reuters.com/article/us-france-ovh-fire-
             | idUSKBN2B...
        
             | discodave wrote:
             | Take DynamoDB as an example. The AWS managed service takes
             | care of replicating everything to multiple AZs for you,
             | that's great! You're very unlikely to lose your data. But,
             | the DynamoDB team is running a mostly-regional service. If
             | they push bad code or fall over it's likely going to be a
             | regional issue. Probably only the storage nodes are truly
             | zonal.
             | 
             | If you wanted to deploy something similar, like Cassandra
             | across AZs, or even regions you're welcome to do that. But
             | now you're on the hook for the availability of the system.
             | Are you going to get higher availability running your own
             | Cassandra implementation than the DynamoDB team? Maybe.
             | DynamoDB had a pretty big outage in 2015 I think. But
             | that's a lot more work than just using DynamoDB IMO.
        
               | dastbe wrote:
               | > But, the DynamoDB team is running a mostly-regional
               | service.
               | 
               | this is both more and less true than you might think. for
               | most regional endpoints teams leverage load balancers
               | that are scoped zonally, such that ip0 will point at
               | instances in zone a, ip1 will point at instances in zone
               | b, and so on. Similarly, teams who operate "regional"
               | endpoints will generally deploy "zonal" environments,
               | such that in the event of a bad code deploy they can fail
               | away that zone for customers.
               | 
               | that being said, these mitigations still don't stop
               | regional poison pills or otherwise from infecting other
               | AZs unless the service is architected to zonally
               | internally.
        
               | discodave wrote:
               | Yeah, teams go to a lot of effort to have zonal
               | environments/fleets/deployments... but there are still
               | many, many regional failure modes. For example, even in a
               | foundational service like EC2 most of their APIs touch
               | regional databases.
        
             | gnabgib wrote:
             | Good news, a new region is coming to Canada in the west[0]
             | eta 2023/24
             | 
             | [0]: https://aws.amazon.com/blogs/aws/in-the-works-aws-
             | canada-wes...
        
           | randmeerkat wrote:
           | AWS has been getting a pass on their stability issues in us-
           | east-1 for years now because it's their "oldest" zone. Maybe
           | they should invest in fixing it instead of inventing new
           | services to sell.
        
             | ikiris wrote:
             | if you care about the availability of a single geographical
             | availability zone, it's your own fault.
        
             | acdha wrote:
             | I certainly wouldn't describe it as "a pass" given how
             | commonly people joke about things like "friends don't let
             | friends use us-east-1". There's also a reporting bias:
             | because many places only use us-east-1, you're more likely
             | to hear about it even if it only affects a fraction of
             | customers, and many of those companies blame AWS publicly
             | because that's easier than admitting that they were only
             | using one AZ, etc.
             | 
             | These big outages are noteworthy because they _do_ affect
             | people who correctly architected for reliability -- and
             | they're pretty rare. This one didn't affect one of my big
             | sites at all; the other was affected by the S3 / Fargate
             | issues but the last time that happened was 2017.
             | 
             | That certainly could be better but so far it hasn't been
             | enough to be worth the massive cost increase of using
             | multiple providers, especially if you can have some basic
             | functionality provided by a CDN when the origin is down
             | (true for the kinds of projects I work on). GCP and Azure
             | have had their share of extended outages, too, so most of
             | the major providers tend to be careful to cast stones about
             | reliability, and it's _much_ better than the median IT
             | department can offer.
        
               | easton wrote:
               | From the original outage thread:
               | 
               | "If you're having SLA problems I feel bad for you son I
               | got two 9 problems cuz of us-east-1"
        
         | nijave wrote:
         | Over two years I think we'd see about 2-3 AZ issues but only
         | once I would consider it an outage.
         | 
         | Usually there would be high network error rates which were
         | usually enough to make RDS Postgres fail over if it was in the
         | impacted AZ
         | 
         | The only real "outage" was DNS having extremely high error
         | rates in a single us-east-1 AZ to the point most things there
         | were barely working
         | 
         | Lack of instance capacity, especially spot, especially for the
         | NVMe types was common of CI (it used ASGs for builder nodes).
         | It'd be pretty common for a single AZ to run out of spot
         | instance types--especially the NVMe ([a-z]#d types)
        
         | codeduck wrote:
         | the eu-central-1 datacenter fire earlier this year was
         | purportedly just 1AZ, but it took down the entire region to all
         | intents and purposes.
         | 
         | Our SOP is to cut over to a second region the moment we see any
         | AZ-level shenanigans. We've been burned too often.
        
         | banana_giraffe wrote:
         | It can be a bit hard to know, since the AZ identifiers are
         | randomized per account, so if you think you have problems in
         | us-west-1a, I can't check on my side. You can get the AZ ID out
         | of your account to de-randomize things, so we can compare
         | notes, but people rarely bother, for whatever reason.
        
           | mnordhoff wrote:
           | Amazon seems to have stopped randomizing them in newer
           | regions. Another reason to move to us-east-2. ;-)
        
           | lanstin wrote:
           | If you do a lot of VPC Endpoints to clients/thirdparties, you
           | learn the AZIDs or you go to all AZIDs in a region by
           | default.
        
       | pinche_gazpacho wrote:
       | Yeah, cloudwatch APIs went to the drain. Good for them for
       | publishing this at least.
        
       | sponaugle wrote:
       | "Our networking clients have well tested request back-off
       | behaviors that are designed to allow our systems to recover from
       | these sorts of congestion events, but, a latent issue prevented
       | these clients from adequately backing off during this event. "
       | 
       | That is an interesting way to phrase that. A 'well-tested'
       | method, but 'latent issues'. That would imply the 'well-tested'
       | part was not as well-tested as it needed to be. I guess 'latent
       | issue' is the new 'bug'.
        
       | paulryanrogers wrote:
       | Has anyone been credited by AWS for violations of their SLAs?
        
       | iwallace wrote:
       | My company uses AWS. We had significant degradation for many of
       | their APIs for over six hours, having a substantive impact on our
       | business. The entire time their outage board was solid green. We
       | were in touch with their support people and knew it was bad but
       | were under NDA not to discuss it with anyone.
       | 
       | Of course problems and outages are going to happen, but saying
       | they have five nines (99.999) uptime as measured by their "green
       | board" is meaningless. During the event they were late and
       | reluctant to report it and its significance. My point is that
       | they are wrongly incentivized to keep the board green at all
       | costs.
        
         | ALittleLight wrote:
         | I worked at Amazon. While my boss was on vacation I took over
         | for him in the "Launch readiness" meeting for our team's
         | component of our project. Basically, you go to this meeting
         | with the big decision makers and business people once a week
         | and tell them what your status is on deliverables. You are
         | supposed to sum up your status as "Green/Yellow/Red" and then
         | write (or update last week's document) to explain your status.
         | 
         | My boss had not given me any special directions here so I
         | assumed I was supposed to do this honestly. I set our status as
         | "Red" and then listed out what were, I felt, quite compelling
         | reasons to think we were Red. The gist of it was that our
         | velocity was negative. More work items were getting created and
         | assigned to us than we closed, and we still had high priority
         | items open from previous dates. There was zero chance, in my
         | estimation, that we would meet our deadlines, so I called us
         | Red.
         | 
         | This did not go over well. Everyone at the Launch Readiness
         | meeting got mad at me for declaring Red. Our VP scolded me in
         | front of the entire meeting and lectured me about how I could
         | not unilaterally declare our team red. Her logic was, if our
         | team was Red, that meant the entire project was Red, and I was
         | in no position to make that call. Other managers at the meeting
         | got mad at me too because they felt my call made them look bad.
         | For the rest of my manager's absence I had to first check in
         | with a different manager and show him my Launch Readiness
         | status and get him to approve my update before I was allowed to
         | show it to the rest of the group.
         | 
         | For the rest of the time that I went to Launch Readiness I was
         | forbidden from declaring Red regardless of what our metrics
         | said. Our team was Yellow or Green, period.
         | 
         | Naturally, we wound up being over a year late on the deadlines,
         | because, despite what they compelled us to say in those
         | meetings, we weren't actually getting the needed work done.
         | Constant "schedule slips" and adjustments. Endless wasted time
         | in meetings trying to rework schedules that would instantly get
         | blown up again. Hugely frustrating. Still slightly bitter about
         | it.
         | 
         | Anyway, I guess all this is to say that it doesn't surprise me
         | that Amazon is bad about declaring Red, Yellow, or Green in
         | other places too. Probably there is a guy in charge of updating
         | those dashboards who is forbidden from changing them unless
         | they get approval from some high level person and that person
         | will categorically refuse regardless of the evidence because
         | they want the indicators to be Green.
        
           | meabed wrote:
           | Working with consultants for years and that's exactly the
           | playbook, project might not even launch after 2 years of
           | deadlines while status is yellow or green and never red.
        
           | tootie wrote:
           | I once had the inverse happen. I showed up as an architect at
           | a pretty huge e-commerce shop. They had a project that had
           | just kicked off and onboarded me to help with planning. They
           | had estimated two months by total finger in the air guessing.
           | I ran them through a sizing and velocity estimation and the
           | result came back as 10 months. I explained this to management
           | and they said "ok". We delivered in about 10 months. It was
           | actually pretty sad that they just didn't care. Especially
           | since we quintupled the budget and no one was counting.
        
           | syngrog66 wrote:
           | your story reminded me of the Challenger disaster and the
           | "see no evil" bureaucratic shenanigans about the O-rings
           | failing to seal in cold weather.
           | 
           | "How dare you threaten our launch readiness go/no-go?!"
        
             | dawnbreez wrote:
             | Was Challenger the one where they buried the issue in a
             | hundred-slide-long PowerPoint? Or was that the other
             | shuttle?
        
           | fma wrote:
           | We have something similar at my big corp company. I think the
           | issue is you went from Green to Red in a flip of a switch. A
           | more normal project goes Green...raise a red flags...if red
           | flags aren't resolved in the next week or two, go to
           | yellow...In these meetings everyone collaborates ways to keep
           | your green or get you back to green if you went yellow.
           | 
           | In essence - what you were saying is your boss lied the whole
           | time, because how does one go from a presumed positive
           | velocity to negative velocity in a week?
           | 
           | Additionally assuming you're a dev lead, it's a little
           | surprising that this is your first meeting of this sorts. As
           | dev lead, I didn't always attend them but my input is always
           | sought on the status.
           | 
           | Sounds like you had a bad manager, and Amazon is filled with
           | them.
        
             | human wrote:
             | Exactly this. If you take your team from green to red
             | without raising flags and asking for help, you will be
             | frowned upon. It's like pulling the fire alarm at the smell
             | of burning toats. It will piss off people.
        
           | transcriptase wrote:
           | This explicitly supports what most of us assume is going on.
           | I wont be surprised if someone with a (un)vested interest
           | will be along shortly to say that their experience is the
           | opposite and that on _their_ team, making people look bad by
           | telling the truth is expected and praised.
        
           | version_five wrote:
           | I had a good chuckle reading your comment. This is not unique
           | to Amazon. Unfortunately, status indicators are super
           | political almost everywhere, precisely because they are what
           | is being monitored as a proxy for the actual progress. I
           | think your comment should be mandatory reading for any leader
           | who is holding the kinds of meetings you describe and thinks
           | they are getting an accurate picture of things.
        
             | theduder99 wrote:
             | no need for it to be mandatory, everyone is fully aware of
             | the game and how to play it.
        
           | throwaway82931 wrote:
           | A punitive culture of "accountability" naturally leads to
           | finger pointing and evasion.
        
           | belter wrote:
           | Was this Amazon or AWS?
        
           | tempnow987 wrote:
           | This is not unique. The reason is simple.
           | 
           | 1) If you keep status green for 5 years, while not delivering
           | anything, the reality is the folks at the very top (who can
           | come and go) just look at these colors and don't really get
           | into the project UNLESS you say you are red :)
           | 
           | 2) Within 1-2 years there is always going to be some excuse
           | for WHY you are late (people changes, scope tweaks, new
           | things to worry about, covid etc)
           | 
           | 3) Finally you are 3 years late, but you are launching. Well,
           | the launch overshadows the lateness. Ie, you were green, then
           | you launched, that's all the VP really sees sometime.
        
           | dylan604 wrote:
           | > While my boss was on vacation I took over for him in the
           | "Launch readiness" meeting....once a week
           | 
           | Jeez, how many meetings did you go to, and how long was this
           | person's vacation? I'm jelly of being allowed to take that
           | much time off continuously.
        
             | toomuchtodo wrote:
             | You might be working at the wrong org? My colleagues
             | routinely take weeks off at a time, sometimes more than a
             | month to travel Europe, go scuba diving in French
             | Polynesia, etc. Work to live, don't live to work.
        
           | dawnbreez wrote:
           | I worked at an Amazon air-shipping warehouse for a couple
           | years, and hearing this confirms my suspicions about the
           | management there. Lower management (supervisors, people
           | actually in the building) were very aware of problems, but
           | the people who ran the building lived out of state, so they
           | only actually went to the building on very rare occasions.
           | 
           | Equipment was constantly breaking down, in ways that ranged
           | from inconvenient to potentially dangerous. Seemingly basic
           | design decisions, like the shape of chutes, were screwed up
           | in mind-boggling ways (they put a right-angle corner partway
           | down each chute, which caused packages to get stuck in the
           | chutes constantly). We were short on equipment almost every
           | day; things like poles to help us un-jam packages were in
           | short supply, even though we could move hundreds of thousands
           | of packages a day. On top of all this, the facility opened
           | with half its sorting equipment, and despite promises that
           | we'd be able to add the rest of the equipment in the summer,
           | during Amazon's slow season...it took them two years to even
           | get started.
           | 
           | And all the while, they demanded ever-increasing package
           | quotas. At first, 120,000 packages/day was enough to raise
           | eyebrows--we broke records on a daily basis in our first
           | holiday rush--but then, they started wanting 200,000, then
           | 400,000. Eventually it came out that the building wouldn't
           | even be breaking even until it hit something like 500,000.
           | 
           | As we scaled up, things got even worse. None of the
           | improvements that workers suggested to management were used,
           | to my knowledge, even simple things like adding an indicator
           | light to freight elevators.
           | 
           | Meanwhile, it eventually became clear that there wasn't
           | enough space to store cargo containers in the building. 737s
           | and the like store packages mostly in these giant curved
           | cargo containers, and we needed them to be locked in place
           | while working around/in them...except that, surprise, the
           | people planning the building hadn't planned any holding areas
           | for containers that weren't in use! We ended up sticking them
           | in the middle of the work area.
           | 
           | Which pissed off the upper management when they visited.
           | Their decision? Stop doing it. Are we getting more storage
           | space for the cans? No. Are we getting more workers on the
           | airplane ramp so we can put these cans outside faster? No.
           | But we're not allowed to store those cans in the middle of
           | the work area anymore, even if there aren't any open stations
           | with working locks. Oh, by the way, the locking mechanisms
           | that hold the cans in place started to break down, and to my
           | knowledge they never actually fixed any of the locks. (A guy
           | from their safety team claims they've fixed like 80 or 90 of
           | the stations since the building opened, but none of the
           | broken locks I've seen were fixed in the 2 years I worked
           | there.)
        
           | blowski wrote:
           | The problem here sounds like lack of clarity over the meaning
           | of the colours.
           | 
           | In organisations with 100s of in-flight projects, it's
           | understandable that red is reserved for projects that are
           | causing extremely serious issues right now. Otherwise, so
           | many projects would be red, that you'd need a new colour.
        
             | jaytaylor wrote:
             | How about orange? Didn't know there was a color shortage
             | these days.
        
               | tuananh wrote:
               | but it's amz's color. it should carry a positive meaning
               | #sacarsm
        
             | ALittleLight wrote:
             | I'd be willing to believe they had some elite high level
             | reason to schedule things this way if I thought they were
             | good at scheduling. In my ~10 years there I never saw a
             | major project go even close to schedule.
             | 
             | I think it's more like the planning people get rewarded for
             | creating plans that look good and it doesn't bother them if
             | the plans are unrealistic. Then, levels of middle
             | management don't want to make themselves look bad by saying
             | they're behind. And, ultimately, everyone figures they can
             | play a kind of schedule-chicken where everyone says they're
             | green or yellow until the last possible second, hoping that
             | another group will raise a flag first and give you all more
             | time while you can pretend you didn't need it.
        
           | subsaharancoder wrote:
           | I worked at AMZN and this perfectly captures my experience
           | there with those weekly reviews. I once set a project I was
           | managing as "Red" and had multiple SDMs excoriate me for
           | apparently "throwing them under the bus" even though we had
           | missed multiple timelines and were essentially not going to
           | deliver anything of quality on time. I don't miss this aspect
           | of AMZN!
        
             | redconfetti wrote:
             | How dare you communicate a problem using the color system.
             | It hurts feelings, and feelings are important here.
        
         | soheil wrote:
         | This was addressed at least 3 times during this post. I'm not
         | defending them but you're just gaslighting. If you have
         | something to add about the points they raised regarding the
         | status page please do so.
        
         | jedberg wrote:
         | Honestly, the should host that status page on CloudFlare or
         | some completely separate infrastructure that they maintain in
         | colo datacenters or something. The only time it really needs to
         | be up is when their stuff isn't working.
        
         | hvgk wrote:
         | This. We're under NDA too on internal support. Our customers
         | know we use AWS and they go and check the AWS status dashboards
         | and tell us there's nothing wrong so the inevitable vitriol is
         | always directed at us which we then have to defend.
        
           | macintux wrote:
           | I guess you have to hope that every outage that impacts you
           | is big enough to make the news.
        
         | eranation wrote:
         | Obligatory mention to https://stop.lying.cloud
        
         | steveBK123 wrote:
         | Exactly, we had the same thing almost exactly a year ago -
         | https://www.dailymail.co.uk/sciencetech/article-8994907/Wide...
         | 
         | They are barely doing better than 2 9s.
        
         | codeduck wrote:
         | carbon copy of our experience.
        
         | tootie wrote:
         | Second hand info but supposedly when an outage hits they go all
         | hands on resolving it and no one who knows what's going on has
         | time to update the status board which is why it's always
         | behind.
        
           | voidfunc wrote:
           | Not AWS, but Azure: highly doubt. At least at Azure the
           | moment you declare an outage there is a incident manager to
           | handle customer communication.
           | 
           | Bullshit someone at Amazon doesn't have time to update the
           | status.
        
         | tuananh wrote:
         | even in the post mortem, they are reclutant to admit it
         | 
         | > While AWS customer workloads were not directly impacted from
         | the internal networking issues described above, the networking
         | issues caused impact to a number of AWS Services which in turn
         | impacted customers using these service capabilities. Because
         | the main AWS network was not affected, some customer
         | applications which did not rely on these capabilities only
         | experienced minimal impact from this event.
        
           | nijave wrote:
           | >some customer applications which did not rely on these
           | capabilities only experienced minimal impact from this event
           | 
           | Yeah so vanilla LB and EC2 with no autoscaling were fine.
           | Anyone using "serverless" or managed services had a real bad
           | day
        
         | amalter wrote:
         | I mean, not to defend them too strongly, but literally half of
         | this post mortem is addressing the failure of the Service
         | Dashboard. You can take it on bad faith, but they own up to the
         | dashboard being completely useless during the incident.
        
           | tw04 wrote:
           | Multiple AWS employees have acknowledged it takes _VP_
           | approval to change the status color of the dashboard. That is
           | absurd and it tells you everything you need to know. The
           | status page isn 't about accurate information, it's about
           | plausible deniability and keeping AWS out of the news cycle.
        
             | dylan604 wrote:
             | >it's about plausible deniability and keeping AWS out of
             | the news cycle.
             | 
             | How'd that work out for them?
             | 
             | https://duckduckgo.com/?q=AWS+outage+news+coverage&t=h_&ia=
             | w...
        
               | tw04 wrote:
               | When is the last time they had a single service outage in
               | a single region? How about in a single AZ in a single
               | region? Struggling to find a lot of headline stories? I'm
               | willing to bet it's happened in the last 2 years and yet
               | I don't see many news articles about it... so I'd say if
               | the only thing that hits the front page is a complete
               | region outage for 6+ hours, it's working out pretty well
               | for them.
        
               | grumple wrote:
               | Last year's Thanksgiving outage and this one are the two
               | biggest. They've been pretty reliable. That's still 99.7%
               | uptime.
        
             | cookie_monsta wrote:
             | I am so naive. I honestly thought those things were
             | automated.
        
           | discodave wrote:
           | The AWS summary says: "As the impact to services during this
           | event all stemmed from a single root cause, we opted to
           | provide updates via a global banner on the Service Health
           | Dashboard, which we have since learned makes it difficult for
           | some customers to find information about this issue"
           | 
           | This seems like bad faith to me based on my experience when I
           | worked for AWS. As they repeated many times at Re:Invent last
           | week, they've been doing this for 15+ years. I distinctly
           | remember seeing banners like "Don't update the dashboard
           | without approval from <importantSVP>" on various service team
           | runbooks. They tried not to say it out loud, but there was
           | very much a top-down mandate for service teams to make the
           | dashboard "look green" by:
           | 
           | 1. Actually improving availability (this one is fair).
           | 
           | 2. Using the "Green-I" icon rather than the blue, orange, or
           | red icons whenever possible.
           | 
           | 3. They built out the "Personal Health Dashboard" so they can
           | post about many issues in there, without having to
           | acknowledge it publicly.
        
             | res0nat0r wrote:
             | Eh I mean at least when DeSantis was lower on the food
             | chain then he is now, the normal directive was that ec2
             | status wasn't updated unless a certain X percent of hosts
             | were affected. Which is reasonable because a single rack
             | going down isn't relevant enough to constitute a massive
             | problem with ec2 as a whole.
        
           | systemvoltage wrote:
           | People of HN has been _extremely_ unprofessional with regards
           | to AWS 's downtime. Some kind of a massive zeitgeist against
           | Amazon, like a giant hive mind that spews hate.
           | 
           | Why are we doing this folks? What's making you so angry and
           | contemptful? Literally try searching the history of downtimes
           | and it was always professional and respectful.
           | 
           | Yesterday, my comment was fricking flagged for asking people
           | to be nice to which people responded "Professionals recognize
           | other professionals lying". Completely baseless and hate
           | spewing comments like this is ruining HN.
        
             | NicoJuicy wrote:
             | I think the biggest issue is about the status dashboard
             | that always stays green. I haven't seen much else, no?
             | 
             | It seems that degraded seems down in most cases. Since
             | authorization of managers is required.
        
             | edoceo wrote:
             | All too often folk conflate frustration with anger or hate.
             | 
             | The comments are frustrated users.
             | 
             | Not hateful.
        
             | ProAm wrote:
             | > Why are we doing this folks? What's making you so angry
             | and contemptful?
             | 
             | Because Amazon kills industries. Takes job. They do this
             | because they promise they hire the best people that can do
             | this better than you and for cheaper. And it's rarely true.
             | And then they lie about it when things hit the fan. If
             | you're going to be the best you need to act like the best,
             | and execute like the best. Not build a walled garden that
             | people cant see into, and hard to leave.
        
             | jiggawatts wrote:
             | AWS as a business has an enormous (multi-billion-dollar)
             | moral hazard: they have a fantastically strong disincentive
             | to update their status dashboard to accurately reflect the
             | true nature of an ongoing outage. They use weasel words
             | like "some customers may be seeing elevated errors", which
             | we _all know_ translates to  "almost all customers are
             | seeing 99.99% failure rates."
             | 
             | They have a strong incentive to lie, and they're doing it.
             | This makes people dependent upon the truth for refunds
             | understandably angry.
        
           | nwallin wrote:
           | So -- ctrl-f "Dash" only produces four results and it's
           | hidden away in the bottom of the page. It's false to claim
           | that even 20% of the post mortem is addressing the failure of
           | the dashboard.
           | 
           | The problem is that the dashboard requires VP approval to be
           | updated. Which is broken. The dashboard should be automatic.
           | The dashboard should update before even a single member of
           | the AWS team knows there's something wrong.
        
             | hunter2_ wrote:
             | Is it typical for orgs (the whole spectrum: IT departments
             | everywhere, telecom, SaaS, maybe even status of non-
             | technical services) to have automatic downtime messaging
             | that doesn't need a human set of eyes to approve it first?
        
           | ProAm wrote:
           | > You can take it on bad faith, but they own up to the
           | dashboard being completely useless during the incident.
           | 
           | Let's not act like this is the first time this has happened.
           | It's bad faith that they do not change when their promise is
           | they hire the best to handle infrastructure so you don't have
           | to. It's clearly not the case. Between this and billing I we
           | can easily lay blame and acknowledge lies.
        
           | luhn wrote:
           | Off the top of my head, this is the third time they've had a
           | major outage where they've been unable to properly update the
           | status page. First we had the S3 outage, where the yellow and
           | red icons were hosted in S3 and unable to be accessed. Second
           | we had the Kinesis outage, which snowballed into a Cognito
           | outage, so they were unable to login into the status page
           | CMS. Now this.
           | 
           | They "own up to it" in their postmortems, but after multiple
           | failures they're still unwilling to implement the obvious
           | solution and what is widely regarded as best practice: _host
           | the status page on a different platform_.
        
             | koheripbal wrote:
             | This challenge is not specific to Amazon.
             | 
             | Being able to automatically detect system health is a non-
             | trivial effort.
        
               | moralestapia wrote:
               | >Be capable of spinning up virtualized instances
               | (including custom drive configurations, network stacks,
               | complex routing schemes, even GPUs) with a simple API
               | call
               | 
               | But,
               | 
               | >Be incapable of querying the status of such things
               | 
               | Yeah, I don't believe it.
        
               | bob778 wrote:
               | That's not what's being asked though - in all 3 events,
               | they couldn't manually update it. It's clearly not a
               | priority to fix it for even manual alerts.
        
               | blackearl wrote:
               | Why automatic? Surely someone could have the
               | responsibility to do it manually.
        
               | geenew wrote:
               | Or override the autogenerated values
        
               | jjoonathan wrote:
               | They had all day to do it manually.
        
               | saagarjha wrote:
               | As others mention, you can do it manually. But it's also
               | not that hard to do automatically: literally just spin up
               | a "client" of your service and make sure it works.
        
             | thebean11 wrote:
             | Eh, the colored icons not loading is not really the same
             | thing as incorrectly reporting that nothing's wrong.
             | Putting the status page on separate infra would be good
             | practice, though.
        
               | dijit wrote:
               | The icons showed green.
        
             | isbvhodnvemrwvn wrote:
             | My company is quite well known for blameless post-mortems,
             | but if someone failed to implement improvements after three
             | subsequent outages, they would be moved to a position more
             | appropriate for their skills.
        
             | 46Bit wrote:
             | Firmly agreed. I've heard AWS discuss making the status
             | page better - but they get really quiet about actually
             | doing it. In my experience the best/only way to check for
             | problems is to search Twitter for your AWS region name.
        
               | ngc248 wrote:
               | Maybe AWS should host their status checks in Azure and
               | vice versa ... Mutually Assured Monitoring :) Otherwise
               | it becomes a problem of who will monitor the monitor
        
           | dijit wrote:
           | Once is a mistake.
           | 
           | Twice is a coincidence.
           | 
           | Three times is a pattern.
           | 
           | But this... This is every time.
        
             | doctor_eval wrote:
             | Four times is a policy.
        
           | s_dev wrote:
           | >You can take it on bad faith
           | 
           | It's smart politics -- I don't blame them but I don't trust
           | the dashboard either. There's established patterns now of the
           | AWS dashboard being useless.
           | 
           | If I want to check if Amazon is down I'm checking Twitter and
           | HN. Not bad faith -- no faith.
        
             | sorry_outta_gas wrote:
             | That's only useful when it's an entire region, there are
             | minor issues in smaller services that cause problems for a
             | lot of people they don't reflect in their status board; and
             | not everyone checks twitter or HN all the time while at
             | work
             | 
             | it's a bullshit board used fudge numbers when negoaiting
             | SLAs
             | 
             | like I don't care that much, hell my company does the same
             | thing; but let's not get defensive over it
        
             | toss1 wrote:
             | >>It's smart politics -- I don't blame them
             | 
             | Um, so you think straight-up lying is good politics?
             | 
             | Any 7-year old knows that telling a lie when you broke
             | something makes you look better superficially, especially
             | if you get away with it.
             | 
             | That does not mean that we should think it is a good idea
             | to tell lies when you break things.
             | 
             | It sure as hell isn't smart politics in my book. It is
             | straight-up disqualifying to do business with them. If they
             | are not honest about the status or amount of service they
             | are providing, how is that different than lying about your
             | prices?
             | 
             | Would you go to a petrol station that posted $x.00/gallon,
             | but only delivered 3 quarts for each gallon shown on the
             | pump?
             | 
             | We're being shortchanged and lied to. Fascinating that you
             | think it is good politics on their part.
        
               | efitz wrote:
               | You don't know what you're talking about.
               | 
               | AWS spends a lot of time thinking about this problem in
               | service to their customers.
               | 
               | How do you reduce the status of millions of machines, the
               | software they run, and the interconnected-ness of those
               | systems to a single graphical indicator?
               | 
               | It would be dumb and useless to turn something red every
               | single time anything had a problem. Literally there are
               | hundreds of things broken every minute of every day. On-
               | call engineers are working around the clock on these
               | problems. Most of the problems either don't affect anyone
               | due to redundancy or affect only a tiny number of
               | customers- a failed memory module or top-of-rack switch
               | or a random bit flip in one host for one service.
               | 
               | Would it help anyone to tell everyone about all these
               | problems? People would quickly learn to ignore it as it
               | had no bearing on their experience.
               | 
               | What you're really arguing is that you don't like the
               | thresholds they've chosen. That's fine, everyone has an
               | opinion. The purpose of health dashboards like these are
               | mostly so that customers can quickly get an answer to "is
               | it them or me" when there's a problem.
               | 
               | As others on this thread have pointed out, AWS has done a
               | pretty good job of making the SHD align with the
               | subjective experience of most customers. They also have
               | personal health dashboards unique to each customer, but I
               | assume thresholding is still involved.
        
               | toss1 wrote:
               | >>How do you reduce the status of millions of machines,
               | the software they run, and the interconnected-ness of
               | those systems to a single graphical indicator?
               | 
               | There's a limitless variety of options, and multiple
               | books written about it. I can recommend the series "The
               | Visual Display of Quantitative Information" by Edward
               | Tufte, for starters.
               | 
               | >> Literally there are hundreds of things broken every
               | minute of every day. On-call engineers are working around
               | the clock...
               | 
               | Of course there are, so a single R/Y/G indicator is
               | obviously a bad choice.
               | 
               | Again, they could at any time easily choose a better way
               | to display this information, graphs, heatmaps, whatever.
               | 
               | More importantly, the one thing that should NOT be chosen
               | is A) to have a human in the loop of displaying status,
               | as this inserts both delay and errors.
               | 
               | Worse yet, to make it so that it is a VP-level decision,
               | as if it were a $1million+ purchase, and then to set the
               | policy to keep it green when half a continent is down...
               | ummm that is WAAAYYY past any question of "threshold" -
               | it is a premeditated, designed-in, systemic lie.
               | 
               | >>You don't know what you're talking about. Look in the
               | mirror, dude. While I haven't worked inside AWS, I have
               | worked in complex network software systems and well
               | understand the issues of thousands of HW/SW components in
               | multiple states. More importantly, perhaps it's my
               | philosophy degree, but I can sort out WHEN (e.g., here)
               | the problem is at another level altogether. It is not the
               | complexity of the system that is the problem, it is the
               | MANAGEMENT decision to systematically lie about that
               | complexity. Worse yet, it looks like those lies on an
               | everyday basis are what goes into their claims of
               | "99.99+% uptime!!" evidently false. The problem is at the
               | forest level, and you don't even want to look at the
               | trees because you're stuck in the underbrush telling
               | everyone else they are clueless.
        
               | [deleted]
        
               | Karunamon wrote:
               | > _How do you reduce the status of millions of machines,
               | the software they run, and the interconnected-ness of
               | those systems to a single graphical indicator?_
               | 
               | A good low-hanging fruit would be, when the outage is
               | significant enough to have reached the media, _you turn
               | the dot red_.
               | 
               | Dishonesty is what we're talking about here. Not the
               | gradient when you change colors. This is hardly the first
               | major outage where the AWS status board was a bald-faced
               | _lie_. This deserves calling out and shaming the
               | responsible parties, nothing less, certainly not defense
               | of blatantly deceptive practices that most companies not
               | named Amazon don 't dip into.
        
         | SkyPuncher wrote:
         | My company isn't big enough for us to have any pull but this
         | communication is _significantly_ downplaying the impact of this
         | issue.
         | 
         | One of our auxiliary services that's basically a pass through
         | to AWS was offline nearly the entire day. Yet, this
         | communication doesn't even mention that fact. In fact, it
         | almost tries to suggest the opposite.
         | 
         | Likewise, AWS is reporting S3 didn't have issues. Yet, for a
         | period of time, S3 was erroring out frequently because it was
         | responding so slowly.
        
         | notyourday wrote:
         | > The entire time their outage board was solid green. We were
         | in touch with their support people and knew it was bad but were
         | under NDA not to discuss it with anyone.                 if
         | ($pain > $gain) {         move_your_shit_and_exit_aws();
         | }            sub move_your_shit_and_exit_aws       {
         | printf("Dude. We have too much pain. Start moving\n");
         | printf("Yeah. That won't happen, so who cares\n");
         | exit(1);       }
        
           | secondcoming wrote:
           | Moving your shit from AWS can be _really_ expensive,
           | depending on how much shit you have. If you 're nice, GCP may
           | subsidise - or even cover - the costs!
        
         | Clubber wrote:
         | Yes, it's a conflict of interest. They have a guarantee on
         | uptime and they decide what their actual uptime is. There's a
         | lot of that now. Most insurances comes to mind.
        
         | jiggawatts wrote:
         | SLAs with self-reported outage periods are worthless.
         | 
         | SLAs that refund only the cost of the individual service that
         | was down is worthless.
         | 
         | SLAs that require separate proof and refund requests for each
         | and every service that was affected are nearly worthless.
         | 
         | There needs to be an independent body set up by a large cloud
         | customers to monitor availability and enforce refunds.
        
         | [deleted]
        
         | codegeek wrote:
         | "Our Support Contact Center also relies on the internal AWS
         | network, so the ability to create support cases was impacted
         | from 7:33 AM until 2:25 PM PST. "
         | 
         | This to me is really bad. Even as a small company, we keep our
         | support infrastructure separate. For a company of Amazon's
         | size, this is a shitty excuse. If I cannot even reach you as a
         | customer for almost 7 hours, that is just nuts. AWS must do
         | better here.
         | 
         | Also, is it true that the outage/status pages are manually
         | updated ? If yes, there is no excuse why it was green for that
         | long. If you are manually updating it, please update asap.
        
           | anonu wrote:
           | Wasn't this the Bezos directive early on that created AWS?
           | Anything that was created had to be a service with an API.
           | Not allowed to recreate the wheel. So AWS depends on AWS.
        
             | jiggawatts wrote:
             | Dependency loops are such fun!
             | 
             | My favourite is when some company migrates their physical
             | servers to virtual machines, including the AD domain
             | controllers. Then the next step is to use AD LDAP
             | authentication for the VM management software.
             | 
             | When there's a temporary outage and the VMs don't start up
             | as expected, the admins can't log on and troubleshoot the
             | platform because the logon system was running on it... but
             | isn't _now_.
             | 
             | The loop is closed.
             | 
             | You see this all the time, especially with system-
             | management software. They become dependent on the systems
             | they're managing, and vice-versa.
             | 
             | If you care about availability at all, make sure to have
             | physical servers providing basic services like DNS, NTP,
             | LDAP, RADIUS, etc...
        
               | nijave wrote:
               | Or even just have some non-federated/"local" accounts
               | stored in a vault somewhere you can use when the
               | centralized auth isn't working
        
           | acwan93 wrote:
           | We moved our company's support call system to Microsoft Teams
           | when lockdowns were happening, and even that was affected by
           | the AWS outage (along with our SaaS product hosted on AWS).
           | 
           | It turned out our call center supplier had something running
           | on AWS, and it took out our entire phone system. After this
           | situation settles, I'm tempted to ask my supplier to see what
           | they're doing to get around this in the future, but I doubt
           | even they knew that AWS was used further downstream.
           | 
           | AWS operates a lot like Amazon.com, the marketplace now--you
           | can try to escape it, but it's near impossible. If you want
           | to ban usage of Amazon's services, you're going to find some
           | service (AWS) or even a Shopify site (FBA warehouse) who uses
           | it.
        
           | walrus01 wrote:
           | I know a few _tiny_ ISPs that host their voip server and
           | email server outside of their own ASN so that in the event of
           | a catastrophic network event, communications with customers
           | is still possible... Not saying amazon should do the same,
           | but the general principle isn 't rocket science.
           | 
           | there's such as thing as too much dogfooding.
        
         | notimetorelax wrote:
         | If you were deployed in 2 regions would it alleviate the
         | impact?
        
           | multipassnetwrk wrote:
           | Depends. If your failover to another region required changing
           | DNS and your DNS was using Route 53, you would have problems.
        
           | ransom1538 wrote:
           | Yes. Exactly. Pay double. That is what all the blogs say. But
           | no, when a region goes down everything is hosed. Give it a
           | shot! Next time an entire region is down try out your apis or
           | give AWS support a call.
        
           | hvgk wrote:
           | No. We don't have an active deployment in that region at all.
           | It killed our build pipeline as ECR was down globally so we
           | had nowhere to push images. Also there was a massive risk as
           | our target environments are EKS so any node failures or
           | scaling events had nowhere to pull images from while ECR was
           | down.
           | 
           | Edit: not to mention APIGW and Cloudwatch APIs were down too.
        
         | electroly wrote:
         | > The entire time their outage board was solid green
         | 
         | Unless you're talking about some board other than the Service
         | Health Dashboard, this isn't true. They dropped EC2 down to
         | degraded pretty early on. I bemusedly noted in our corporate
         | Slack that every time I refreshed the SHD, another service was
         | listed as degraded. Then they added the giant banner at the
         | top. Their slight delay in updating the SHD at the beginning of
         | the outage is mentioned in the article. It was absolutely not
         | all green for the duration of the outage.
        
           | logical_proof wrote:
           | That is not true. There was hours before they started
           | annotating any kind of service issues. Maybe from when you
           | noticed there was a problem it appeared to be quick, but the
           | board remained green for a large portion of the outtage.
        
             | acdha wrote:
             | We saw the timing described where the dashboard updates
             | started about an hour after the problem began (which we
             | noticed immediately since 7:30AM Pacific is in the middle
             | of the day for those of us in Eastern time). I don't know
             | if there was an issue with browser caching or similar but
             | once the updates started everyone here had no trouble
             | seeing them and my RSS feed monitor picked them up around
             | that time as well.
        
             | electroly wrote:
             | No, it was about an hour. We were aware from the very
             | moment EC2 API error rates began to elevate, around 10:30
             | Eastern. By 11:30 the dashboard was updating. This timing
             | is mentioned in the article, and it all happened in the
             | middle of our workday on the east coast. The outage then
             | continued for about 7 hours with SHD updates. I suspect we
             | actually both agree on how long it took them to start
             | updating, but I conclude that 1 hour wasn't so bad.
        
               | gkop wrote:
               | At the large platform company where I work, our policy is
               | if the customer reported the issue before our internal
               | monitoring caught it, we have failed. Give 5 minutes for
               | alerting lag, 10 minutes to evaluate the magnitude of
               | impact, 10 minutes to craft the content and get it
               | approved, 5 minutes to execute the update, adds up to 30
               | minutes end to end with healthy buffer at each step.
               | 
               | 1 hour (52 minutes according to the article) sounds meh.
               | I wonder what their error rate and latency graphs look
               | like from that day.
        
               | Aperocky wrote:
               | > our policy is if the customer reported the issue before
               | our internal monitoring caught it
               | 
               | They've discovered it right away, the Service Health
               | Dashboard was not updated. source: link.
        
               | gkop wrote:
               | They don't say explicitly right away do they? I skimmed
               | twice.
               | 
               | But yes you're right, there's no reason to question their
               | monitoring or alerting specifically.
        
           | JPKab wrote:
           | Multiple services I use were totally skunked, and none were
           | ever anything but green.
           | 
           | Sagemaker, for example, was down all day. I was dead in the
           | water on a modeling project that required GPUs. It relied on
           | EC2, but nobody there even thought to update the status? WTF.
           | This is clearly executives incentivized to let a bug persist.
           | This is because the bug is actually a feature for misleading
           | customers and maximizing profits.
        
       | wly_cdgr wrote:
       | Their service board is always as green as you have to be to trust
       | it
        
       | hourislate wrote:
       | Broadcast storm. Never easy to isolate, as a matter of fact it's
       | nightmarish...
        
       | onion2k wrote:
       | _This congestion immediately impacted the availability of real-
       | time monitoring data for our internal operations teams_
       | 
       | I guess this is why it took ages for the status page to update.
       | They didn't know which things to turn red.
        
       | bpodgursky wrote:
       | DNS?
       | 
       | Of course it was DNS.
       | 
       | It is always* DNS.
        
         | tezza wrote:
         | It is often BGP, regularly DNS, frequently expired keys,
         | sometimes a bad release and occasionally a fire
        
         | shepherdjerred wrote:
         | This wasn't caused by DNS. DNS was just a symptom.
        
       | yegle wrote:
       | Was this outage only impact us-east-1 region? I think I saw other
       | regions affected in some HN comments but this summary did not
       | mention anything to suggest it has more than 1 region impacted.
        
         | teej wrote:
         | There are some AWS services, notably STS, that are hosted in
         | us-east-1. I don't have anything in us-east-1 but I was
         | completely unable to log into the console to check on the
         | health of my services.
        
       | qwertyuiop_ wrote:
       | House of cards
        
       | simlevesque wrote:
       | > Customers accessing Amazon S3 and DynamoDB were not impacted by
       | this event. However, access to Amazon S3 buckets and DynamoDB
       | tables via VPC Endpoints was impaired during this event.
       | 
       | What does this even mean ? I bet most people use DynamoDB via a
       | VPC, in a Lambda or in EC2
        
         | jtoberon wrote:
         | VPC Endpoint is a feature of VPC:
         | https://docs.aws.amazon.com/vpc/latest/privatelink/endpoint-...
        
         | discodave wrote:
         | Your application can call DynamoDB via the public endpoint
         | (dynamodb.us-east-1.amazonaws.com). But if you're in a VPC
         | (i.e. practically all AWS workloads in 2021), you have to route
         | to the internet (you need public subnet(s) I think) to make
         | that call.
         | 
         | VPC Endpoints create a DynamoDB endpoint in your VPC, from the
         | documentation:
         | 
         | "When you create a VPC endpoint for DynamoDB, any requests to a
         | DynamoDB endpoint within the Region (for example, dynamodb.us-
         | west-2.amazonaws.com) are routed to a private DynamoDB endpoint
         | within the Amazon network. You don't need to modify your
         | applications running on EC2 instances in your VPC. The endpoint
         | name remains the same, but the route to DynamoDB stays entirely
         | within the Amazon network, and does not access the public
         | internet."
        
           | aeyes wrote:
           | I call my DynamoDB tables via the public endpoint and it was
           | severely impaired - high error rate and very high (second)
           | latency.
        
         | all2well wrote:
         | From within a VPC, you can either access DynamoDB via its
         | public internet endpoints (eg, dynamodb.us-
         | east-1.amazonaws.com, which routes through an Internet Gateway
         | attachment in your VPC), or via a VPC endpoint for dynamodb
         | that's directly attached to your VPC. The latter is useful in
         | cases where you want a VPC to not be connected to the internet
         | at all, for example.
        
       | foobarbecue wrote:
       | "... the networking congestion impaired our Service Health
       | Dashboard tooling from appropriately failing over to our standby
       | region. By 8:22 AM PST, we were successfully updating the Service
       | Health Dashboard."
       | 
       | Sounds like they lost the ability to update the dashboard. HN
       | comments at the time were theorizing it wasn't being updated due
       | to bad policies (need CEO approval) etc. Didn't even occur to me
       | that it might be stuck in green mode.
        
         | dgivney wrote:
         | In the February 2017 S3 outage, AWS was unable to move status
         | icons to the red icon because those images happened to be
         | stored on the servers that went down.
         | 
         | https://twitter.com/awscloud/status/836656664635846656
        
         | [deleted]
        
         | wongarsu wrote:
         | Hasn't this exact thing (something in US-east-1 goes down, AWS
         | loses ability to update dashboard) happened before? I vaguely
         | remember it was one of the S3 outages, but I might be wrong.
         | 
         | In any case, AWS not updating their dashboard is almost a meme
         | by now. Even for global service outages the best you will get
         | is a yellow.
        
           | foobarbecue wrote:
           | Yeah, probably. I haven't watched it this closely before
           | during an outage. I have no idea if this happens in good
           | faith, bad faith, or (probably) a mix.
        
       | llaolleh wrote:
       | I wonder if they could've designed better circuit breakers for
       | situations like this. They're very common in electrical
       | engineering, but I don't think they're as common in software
       | design. Something we should try to design and put in, actually
       | for situations like this.
        
         | EsotericAlgo wrote:
         | They're a fairly common design pattern https://en.m.wikipedia.o
         | rg/wiki/Circuit_breaker_design_patte.... However, they
         | certainly aren't implemented with the frequency they should be
         | at service level boundaries resulting in these sorts of
         | cascading failures.
        
         | riknos314 wrote:
         | One of the big issues mentioned was that one of the circuit
         | breakers they did have (client back off), didn't function
         | properly. So they did have a circuit breaker in the design, but
         | it was broken.
        
         | kevin_nisbet wrote:
         | Netflix was talking alot about circuit breaks a few years ago,
         | and had the Hystrix project. Looks like Hystrix is
         | discontinued, so I'm not sure if there are good library
         | solutions that are easy to adopt. Overall I don't see it
         | getting talked about that frequently... beyond just exponential
         | backoff inside a retry loop.
         | 
         | - https://github.com/Netflix/Hystrix -
         | https://www.youtube.com/watch?v=CZ3wIuvmHeM I think talks about
         | Hystrix a bit, but I'm not sure if it's the presentation I'm
         | thinking of from years ago or not.
        
           | isbvhodnvemrwvn wrote:
           | In the JVM land resilience4j is the de facto successor of
           | Hystrix:
           | 
           | https://github.com/resilience4j/resilience4j
        
       | User23 wrote:
       | A packet storm outage? Now that brings back memories. Last time I
       | saw that it was rendezvous misbehaving.
        
       | jetru wrote:
       | Complex systems are really really hard. I'm not a big fan of
       | seeing all these folks bash AWS for this, and not really
       | understanding the complexity or nastiness of situations like
       | this. Running the kind of services they do for the kind of
       | customers, this is a VERY hard problem.
       | 
       | We ran into a very similar issue, but at the database layer in
       | our company literally 2 weeks ago, where connections to our MySQL
       | exploded and completely took down our data tier and caused a
       | multi-hour outage, compounded by retries and thundering herds.
       | Understanding this problem under the stressful scenario is
       | extremely difficult and a harrowing experience. Anticipating this
       | kind of issue is very very tricky.
       | 
       | Naive responses to this include "better testing", "we should be
       | able to do this", "why is there no observability" etc. The
       | problem isn't testing. Complex systems behave in complex ways,
       | and its difficult to model and predict, especially when the
       | inputs to the system aren't entirely under your control.
       | Individual components are easy to understand, but when
       | integrating, things get out of whack. I can't stress how
       | difficult it is to model or even think about these systems,
       | they're very very hard. Combined with this knowledge being
       | distributed among many people, you're dealing with not only
       | distributed systems, but also distributed people, which adds more
       | difficulty in wrapping this around your head.
       | 
       | Outrage is the easy response. Empathy and learning is the
       | valuable one. Hugs to the AWS team, and good learnings for
       | everyone.
        
         | [deleted]
        
         | spfzero wrote:
         | But Amazon advertises that they DO understand the complexity of
         | this, and that their understanding, knowledge and experience is
         | so deep that they are a safe place to put your critical
         | applications, and so you should pay them lots of money to do
         | so.
         | 
         | Totally understand that complex systems behave in
         | incomprehensible ways (hopefully only temporarily
         | incomprehensible). But they're selling people on the idea of
         | trading your complex system, for their far more complex system
         | that they manage with such great expertise that it is more
         | reliable.
        
         | raffraffraff wrote:
         | Interesting. Just wondering if your guys have a dedicated DBA?
        
           | raffraffraff wrote:
           | Not sure why I got down voted for an honest question. Most
           | start-ups are founders, developers, sales and marketing.
           | Dedicated infrastructure, network and database specialists
           | don't get factored in because "smart CS graduates can figure
           | that stuff out". I've worked at companies who held onto that
           | false notion _way_ too long and almost lost everything as a
           | result ( "company extinction event", like losing a lot of
           | customer data)
        
         | danjac wrote:
         | The problem isn't AWS per se. The problem is it's become too
         | big to fail. Maybe in the past an outage might take down a few
         | sites, or one hospital, or one government service. Now one
         | outage takes out all the sites, all the hospitals and all the
         | government services. Plus your coffee machine stops working.
        
         | qaq wrote:
         | Very good summary of why small projects need to think real hard
         | before jumping onto microservices bandwagon.
        
         | tuldia wrote:
         | Excuse me, do we need all that complexity? Telling that it is
         | "hard" is justifiable?
         | 
         | It is naive to assume people bashing AWS are uncapable to
         | running things better, cheaper, faster, across many other
         | vendors, on-prem, colocation or what not.
         | 
         | > Outrage is the easy response.
         | 
         | That is what made AWS get the marketshare it has now in the
         | first place, the easy responses.
         | 
         | The main selling point of AWS in the beginning was "how easy is
         | to sping a virtual machine". After basically every layman
         | started recommending AWS and we flocked there, AWS started
         | making things more complex than it should. Was that to make
         | harder to get out of it? IDK.
         | 
         | > Empathy and learning is the valuable one.
         | 
         | When you run your infrastructure and something fails and you
         | are not transparent, your users will bash you, independently
         | who you are.
         | 
         | And that was another "easy response" used to drive companies
         | towards AWS. We developers were echoing that "having a
         | infrastructure team or person is not necessary", etc.
         | 
         | Now we are stuck in this learned helplessness where every
         | outage is a complete disaster in terms of transparency,
         | multiple services failing, even for multi-region and multi-az
         | customers, we saying "this service here is also not working"
         | and AWS simple states that service was fine, not affected, up
         | and running.
         | 
         | If it was a sysadmin doing that, people will be asking for
         | his/her neck with pitchforks.
        
           | noahtallen wrote:
           | > AWS started making things more complex than it should
           | 
           | I don't think this is fair for a couple reasons:
           | 
           | 1. AWS would have had to scale regardless just because of the
           | number of customers. Even without adding features. This means
           | many data centers, complex virtual networking, internal
           | networks, etc. These are solving very real problems that
           | happen when you have millions of virtual servers.
           | 
           | 2. AWS hosts many large, complex systems like Netflix.
           | Companies like Netflix are going to require more advanced
           | features out of AWS, and this will result in more features
           | being added. While this is added complexity, it's also
           | solving a customer problem.
           | 
           | My point is that complexity is inherent to the benefits of
           | the platform.
        
         | xamde wrote:
         | There is this really nice website which explains how complex
         | systems fail: https://how.complexsystems.fail/
        
         | iso1631 wrote:
         | > I'm not a big fan of seeing all these folks bash AWS for
         | this,
         | 
         | The disdain I saw was towards those claiming that all you need
         | is AWS, that AWS never goes down, and don't bother planning for
         | what happens when AWS goes down.
         | 
         | AWS is an amazing accomplishment, but it's still a single point
         | of failure. If you are a company relying on a single supplier
         | and you don't have any backup plans for that supplier being
         | unavailable, that is ridiculous and worthy of laughter.
        
         | fastball wrote:
         | I think most of the outrage is not because "it happened" but
         | because AWS is saying things like "S3 was unaffected" when the
         | anecdotal experience of many in this thread suggests the
         | opposite.
         | 
         | That and the apparent policy that a VP must sign off on
         | changing status pages, which is... backwards to say the least.
        
           | jetru wrote:
           | There's definitely miscommunication around this. I know I've
           | miscommunicated impact, or my communication was
           | misinterpreted across the 2 or 3 people it had to jump before
           | hitting the status page.
           | 
           | For example, The meaning of "S3 was affected" is subject to a
           | lot of interpretation. STS was down, which is a blocker for
           | accessing S3. So, the end result is S3 is effectively down,
           | but technically it is not. How does one convey this in a
           | large org? You run S3, but not STS, it's not technically an
           | S3 fault, but an integration fault across multiple services.
           | If you say S3 is down, you're implying that the storage layer
           | is down. But it's actually not. What's the best answer to
           | make everyone happy here? I cant think of one.
        
             | mickeyp wrote:
             | > I cant think of one.
             | 
             | I can.
             | 
             | "S3 is unavailable because X, Y, and Z services are
             | unavailable."
             | 
             | A graph of dependencies between services is surely known to
             | AWS; if not, they ought to create one post-haste.
             | 
             | Trying to externalize Amazon's internal AWS politicking
             | over which service is down is unproductive to the customers
             | who check the dashboard and see that their service _ought_
             | to be up, but... well, it isn 't?
             | 
             | Because those same customers have to explain to _their_
             | clients and bosses why their systems are malfunctioning,
             | yet it  "shows green" on a dashboard somewhere that almost
             | never shows red.
             | 
             | (And I can levy this complaint against Azure too, by the
             | way.)
        
               | jetru wrote:
               | This is a good idea
        
               | Corrado wrote:
               | Yes, I can envision a (simplified) AWS X-Ray dashboard
               | showing the relationships between the systems and the
               | performance of each one. Then we could see at a glance
               | what was going on. Almost anything is better than that
               | wall of text, tiny status images, and RSS feeds.
        
               | wiredfool wrote:
               | And all relationships eventually end in us-east-1.
        
             | brentcetinich wrote:
             | It doesn't matter about the dependency graph , but on the
             | definition of unavailable for s3 in its sla
        
           | amzn-throw wrote:
           | > a VP must sign off on changing status pages, which is...
           | backwards to say the least.
           | 
           | I think most people's experience with "VP's" makes them not
           | realize what AWS VP's do.
           | 
           | VP's here are not sitting in an executive lounge wining and
           | dining customers, chomping on cigars and telling minions to
           | "Call me when the data center is back up and running again!"
           | 
           | They are on the tech call, working with the engineers,
           | evaluating the problem, gathering the customer impact, and
           | attempting to balance communicating too early with being
           | precise.
           | 
           | Is there room for improvement? Yes. I wish we would just
           | throw up a generic "Shit's Fucked Up. We Don't Know Why Yet,
           | But We're Working On It" message.
           | 
           | But the reason why we don't, doesn't have anything to do with
           | having to get VP approval to put that message up. The VP's
           | are there in the trenches most of the time.
        
             | kortilla wrote:
             | It doesn't matter what the VPs are doing, that misses the
             | point. Every minute you know there is a problem and you
             | haven't at least put up a "degraded" status, you're lying
             | to your customers.
             | 
             | It was on the top of HN for an hour before anything
             | changed, and then it was still downplayed, which is insane.
        
             | gingerlime wrote:
             | > I wish we would just throw up a generic "Shit's Fucked
             | Up. We Don't Know Why Yet, But We're Working On It"
             | message.
             | 
             | I think that's the crux of the matter? AWS seems to now
             | have a reputation for ignoring issues that are easily
             | observable by customers, and by the time any update shows
             | up, it's way too late. Whether VPs make this decision or
             | not is irrelevant. If this becomes a known pattern (and I
             | think it has), then the system is broken.
             | 
             | disclaimer: I have very little skin in this game. We use S3
             | for some static assets, and with layers of caching on top,
             | I think we are rarely affected by outages. I'm still
             | curious to observe major cloud outages and how they are
             | handled, and the HN reaction from people on both side of
             | the fence.
        
               | 0x0nyandesu wrote:
               | Saying "S3 is down" can mean anything. Our S3 buckets
               | that served static web content stayed up no problem. The
               | API was down though. But for the purposes of whether my
               | organization cares I'm gonna say it was "up".
        
               | kortilla wrote:
               | Who cares if it worked for your usecase?
               | 
               | Being unable to store objects in an object store means
               | that it's broken.
        
               | shakna wrote:
               | > We are currently experiencing some problems related to
               | FOO service and are investigating.
               | 
               | A generic, utterly meaningless message, which is still a
               | hell of a lot more than usually gets approved, and
               | approved far too late.
               | 
               | It is also still better than "all green here, nothing to
               | see" which has people looking at their own code, because
               | they _expect_ that they will be the problem, not AWS.
        
               | jrochkind1 wrote:
               | Most of what they actually said via the manual human-
               | language status updates was "Service X is seeing elevated
               | error rates".
               | 
               | While there are still decisions to be made in how you
               | monitor errors and what sorts of elevated rates merit an
               | alert -- I would bet that AWS has internally-facing
               | systems that can display service health in this way based
               | on automated monitoring of error rates (as well as other
               | things). Because they know it means something.
               | 
               | They apparently choose to make their public-facing
               | service health page only show alerts via a manual process
               | that often results in an update only several hours after
               | lots of customers have noticed problems. This seems like
               | a choice.
               | 
               | What's the point of a status page? To me, the point of it
               | is, when I encounter a problem (perhaps noticed because
               | of my own automated monitoring), one of the first thing I
               | want to do is distinguish between a problem that's out of
               | my control on the platform, and a problem that is under
               | my control and I can fix.
               | 
               | A status page that does not support me in doing that is
               | not fulfilling it's purpose. the AWS status page fails to
               | help customers do that, by regularly showing all green
               | with no alerts _hours_ after widespread problems occured.
        
               | chakspak wrote:
               | > disclaimer: I have very little skin in this game. We
               | use S3 for some static assets, and with layers of caching
               | on top, I think we are rarely affected by outages. I'm
               | still curious to observe major cloud outages and how they
               | are handled, and the HN reaction from people on both side
               | of the fence.
               | 
               | I'd like to share my experience here. This outage
               | definitely impacted my company. We make heavy use of
               | autoscaling, we use AWS CodeArtifact for Python packages,
               | and we recently adopted AWS Single Sign-On and EC2
               | Instance Connect.
               | 
               | So, you can guess what happened:
               | 
               | - No one could access the AWS Console.
               | 
               | - No one could access services authenticated with SAML.
               | 
               | - Very few CI/CD, training or data pipelines ran
               | successfully.
               | 
               | - No one could install Python packages.
               | 
               | - No one could access their development VMs.
               | 
               | As you might imagine, we didn't do a whole lot that day.
               | 
               | With that said, this experience is unlikely to change our
               | cloud strategy very much. In an ideal world, outages
               | wouldn't happen, but the reason we use AWS and the cloud
               | in general is so that, when they do happen, we aren't
               | stuck holding the bag.
               | 
               | As others have said, these giant, complex systems are
               | hard, and AWS resolved it in only a few hours! Far better
               | to sit idle for a day rather than spend a few days
               | scrambling, VP breathing down my neck, discovering that
               | we have no disaster recovery mechanism, and we never
               | practiced this, and hardware lead time is 3-5 weeks, and
               | someone introduced a cyclical bootstrapping process, and
               | and and...
               | 
               | Instead, I just took the morning off, trusted the
               | situation would resolve itself, and it did. Can't
               | complain. =P
               | 
               | I might be more unhappy if we had customer SLAs that were
               | now broken, but if that was a concern, we probably should
               | have invested in multi-region or even multi-cloud
               | already. These things happen.
        
             | jrochkind1 wrote:
             | > I wish we would just throw up a generic "Shit's Fucked
             | Up. We Don't Know Why Yet, But We're Working On It"
             | message.
             | 
             | I gotta say, the implication that you can't register an
             | outage until you know why it happened is pretty damning.
             | The status page is where we look to see if services are
             | effected, if that information can't be shared there until
             | you understand the cause, that's very broken.
             | 
             | The AWS status page has become kind of a joke to customers.
             | 
             | I was encouraged to see the announcement in OP say that
             | there is "a new version of our Service Health Dashboard"
             | coming. I hope it can provide actual capabilities to
             | display, well, service health.
             | 
             | From how people talk about it, it kind of sounds like
             | updates to the Service Health Dashboard are currently
             | purely a manual process. Rather than automated monitoring
             | automatically updating the Service Health Dashboard in any
             | way at all. I find that a surprising implementation for an
             | organization of Amazon's competence and power. That alarms
             | me more than _who_ it is that has the power to manually
             | update it; I agree that I don 't have enough knowledge of
             | AWS internal org structures to have an opinion on if it's
             | the "right" people or not.
             | 
             | I suspect AWS must have internal service health pages that
             | are actually automatically updated in some way by
             | monitoring, that is, that actually work to display service
             | health. It _seems_ like a business decision rather than a
             | technical challenge if the public facing system has no
             | inputs but manual human entry, but that 's just how it
             | seems from the outside, I may not have full information. We
             | only have what Amazon shares with us of course.
        
               | theneworc wrote:
               | Can you please help me understand why you, and everyone
               | else, are so passionate about the status page?
               | 
               | I get that it not being updated is an annoyance, but I
               | cannot figure out why it is the single most discussed
               | thing about this whole event. I mean, entire services
               | were out for almost an entire day, and if you read HN
               | threads it would seem that nobody even cares about lost
               | revenue/productivity, downtime, etc. The vast majority of
               | comments in all of the outage threads are screaming about
               | how the SHD lied.
               | 
               | In my entire career of consulting across many companies
               | and many different technology platforms, never _once_
               | have I seen or heard of anyone even _looking_ at a status
               | page outside of HN. I 'm not exaggerating. Even over the
               | last 5 years when I've been doing cloud consulting,
               | nobody I've worked with has cared at all about the cloud
               | provider's status pages. The _only_ time I see it brought
               | up is on HN, and when it gets brought up on HN it 's
               | discussed with more fervor than most other topics, even
               | the outage itself.
               | 
               | In my real life (non-HN) experience, when an outage
               | happens, teams ask each other "hey, you seeing problems
               | with this service?" "yea, I am too, heard maybe it's an
               | outage" "weird, guess I'll try again later" and go get a
               | coffee. In particularly bad situations, they might check
               | the news or ask me if I'm aware of any outage. Either
               | way, we just... go on with our lives? I've never needed,
               | nor have I ever seen people need, a status page to inform
               | them that things aren't working correctly, but if you
               | read HN you would get the impression that entire
               | companies of developers are completely paralyzed unless
               | the status page flips from green to red. Why? I would
               | even go as far to say that if you need a third party's
               | SHD to tell you if things aren't working right, then
               | you're probably doing something wrong.
               | 
               | Seriously, what gives? Is all this just because people
               | love hating on Amazon and the SHD is an easy target?
               | Because that's what it seems like.
        
               | femiagbabiaka wrote:
               | It is _extremely_ common for customers to care about
               | being informed accurately about downtime, and not just
               | for AWS. I think your experience of not caring and not
               | knowing anyone who cares may be an outlier.
        
               | aflag wrote:
               | A status page give you confidence that the problem indeed
               | lies with Amazon and not your own software. I don't think
               | it's very reasonable to notice issues, ask other teams if
               | they are also having issues, and if so, just shrug it off
               | and get a cup of coffee without more investigation. Just
               | because it looks like the problem is with AWS, you can't
               | be sure until you further investigate it, specially if
               | the status page says it's all working fine.
               | 
               | I think it goes without saying that having an outage is
               | bad, but having an outage which is not confirmed by the
               | service provider is even worse. People complain about
               | that a lot because it's the least they could do.
        
               | avereveard wrote:
               | https://aws.amazon.com/it/legal/service-level-agreements/
               | 
               | There's literally millions on the line.
        
               | swasheck wrote:
               | aws isn't a hobby platform. businesses are built on aws
               | and other cloud providers. those businesses customers
               | have the expectation of knowing why they are not
               | receiving the full value of their service.
               | 
               | it makes sense that part of marketing yourself as a
               | viable infrastructure upon which other businesses can
               | operate, you'd provide more granular and refined
               | communication to allow better communication up and down
               | the chain instead of forcing your customers to rca your
               | service in order to communicate to their customers.
        
               | glogla wrote:
               | > Can you please help me understand why you, and everyone
               | else, are so passionate about the status page?
               | 
               | I don't think people are "passionate about status page."
               | I think people are unhappy with someone they are supposed
               | to trust straight up lying to their face.
        
             | tommek4077 wrote:
             | There should be no one to sign off anything. Your status
             | page should be updated automatically, not manually!
        
             | aflag wrote:
             | I don't think the matter is whether or not VPs are
             | involved, but the fact that human sign off is required.
             | Ideally the dashboard would accurately show what's working
             | or not, regardless if the engineers know what's going on.
        
               | lobocinza wrote:
               | That would be too much honesty for a corp.
        
         | metb wrote:
         | Thanks for these thoughts. Resonated well with me. I feel we
         | are sleepwalking into major fiascos, when a simple doorbell
         | needs to sit on top this level of complexity. It's in our best
         | interest to not tie every small thing into layers, and layers
         | of complexity. Mundane things like doorbells need to have their
         | fallback at least done properly to function locally without
         | relying on complex cloud systems.
        
         | simonbarker87 wrote:
         | I'm not all that angry over the situation but more disappointed
         | that we've all collectively handed the keys over to AWS because
         | "servers are hard". Yeh they are but it's not like locking
         | ourselves into one vendor with flaky docs and a black box of
         | bugs is any better, at least when your own servers go down it's
         | on you and you don't take out half of North America.
        
           | ranguna wrote:
           | You can either pay a dedicated team to manage your on prem
           | solution, go multi cloud, or simply go multi region on aws.
           | 
           | My company was not affected by this outage because we are
           | multi region. Cheapest and quickest option if you want to
           | have at least some fault tolerance.
        
             | tuldia wrote:
             | > ... multi region. Cheapest and quickest option if you
             | want to have at least some fault tolerance.
             | 
             | That is simple not true, you have to adapt your application
             | to be multi region aware to start with, and if you do that
             | on AWS you are basically locked-in, and one of the most
             | expensive cloud providers out there.
        
             | iso1631 wrote:
             | So was mine, but we couldn't log in
             | 
             | But yes, having services resilient to a single point of
             | failure is essential. AWS is a SPOF.
        
           | linsomniac wrote:
           | If you aren't going to rely on external vendors, servers are
           | really, really hard. Redundancy in: power, cooling,
           | networking? Those get expensive fast. Drop your servers into
           | a data center and you're in a similar situation to dropping
           | it in AWS.
           | 
           | A couple years ago all our services at our data center just
           | vanished. I call the data center and they start creating a
           | ticket. "Can you tell me if there is a data center outage?"
           | "We are currently investigating and I don't have any
           | information I can give you." "Listen, if this is a problem
           | isolated to our cabinet, I need to get in the car. I'm trying
           | to decide if I need to drive 60 miles in a blizzard."
           | 
           | That facility has been pretty good to us over a decade, but
           | they were frustratingly tight-lipped about an entire room of
           | the facility losing power because one of their power feeder
           | lines was down.
           | 
           | Could AWS improve? Yes. Does avoiding AWS solve these sorts
           | of problems? No.
        
           | nix23 wrote:
           | Servers are not hard if you have a dedicated person (long
           | time ago known as Systemadminstrator), and fun fact...it's
           | sometimes even much cheaper and more reliable then having
           | everything in the "cloud".
           | 
           | Personally i am a believer in mixed environments, public
           | webservers etc in the "cloud", locally used systems and
           | backup "in house" with a second location (both in Data-
           | centers or at least one), and no, i don't talk about the next
           | google but the 99% of businesses.
        
             | isbvhodnvemrwvn wrote:
             | Not one person, at least four people to run stuff 24/7.
        
               | nix23 wrote:
               | 99% of businesses don't need 24/7 but two are the bare
               | minimum (a admin, and a dev or admin)
        
         | juanani wrote:
         | Bashing trillion dollar behemoths is the right thing to do, by
         | default they spend a certain % on influencing your subconscious
         | brain to get up and defend them everytime someone has a bad
         | article on them. They've probably apent billions on writing
         | positive articles about themselves. They only ask for more and
         | more money, so you get antay when they f up and people are not
         | pleased with them? They aren't jumping up to lick their boots?
         | Fascist shills, can you fuck off to Mars already?
        
         | Ensorceled wrote:
         | > Outrage is the easy response. Empathy and learning is the
         | valuable one.
         | 
         | I'm outraged that AWS, as a company policy, continues to lie
         | about the status of their systems during outages, making it
         | hard for me to communicate to my stakeholders.
         | 
         | Empathy? For AWS? AWS is part a mega corporation that is
         | closing in on 2 TRILLION dollars in market cap. It's not a
         | person. I can empathize with individuals who work for AWS but
         | it's weird to ask us to have empathy for a massive faceless,
         | ruthless, relentless, multinational juggernaut.
        
           | Jgrubb wrote:
           | It seems obvious to me that they're specifically talking
           | about having empathy for the people who work there, the
           | people who designed and built these systems and yes, empathy
           | even for the people who might not be sure what to put on
           | their absolutely humongous status page until they're sure.
        
             | Ensorceled wrote:
             | But I don't see people attacking the AWS team, at worst the
             | "VP" who has to approve changes to the dashboard. That's
             | management and that "VP" is paid a lot.
        
           | ithkuil wrote:
           | My reading of GP's comment is that the empathy should be
           | directed towards AWS' _team_ , the people who are building
           | the system and handling the fallout, not AWS the corporate
           | entity.
           | 
           | I may be wrong, but I try to apply the
           | https://en.m.wikipedia.org/wiki/Principle_of_charity
        
       | xyst wrote:
       | I am not a fan of AWS due to their substantial market share on
       | cloud computing. But as a software engineer I do appreciate their
       | ability to provide fast turnarounds on root cause analyses and
       | make them public.
        
         | waz0wski wrote:
         | This isn't a good example of an RCA - as other commenters have
         | noted, it's outrightly lying about some issues during the
         | incident, and using creative language to dance around other
         | problems many people encountered.
         | 
         | If you want to dive into postmortems, there are some repos
         | linking other examples
         | 
         | https://github.com/danluu/post-mortems
         | 
         | https://codeberg.org/hjacobs/kubernetes-failure-stories
        
       | nayuki wrote:
       | > Operators instead relied on logs to understand what was
       | happening and initially identified elevated internal DNS errors.
       | Because internal DNS is foundational for all services and this
       | traffic was believed to be contributing to the congestion, the
       | teams focused on moving the internal DNS traffic away from the
       | congested network paths. At 9:28 AM PST, the team completed this
       | work and DNS resolution errors fully recovered.
       | 
       | Having DNS problems sounds a lot like the Facebook outage of
       | 2021-10-04. https://en.wikipedia.org/wiki/2021_Facebook_outage
        
         | shepherdjerred wrote:
         | It's quite a bit different... Facebook took themselves offline
         | completely because of a bad BGP update, whereas AWS had network
         | congestion due to a scaling event. DNS relies on the network,
         | so of course it'll be impacting if networking is also impacted.
        
           | bdd wrote:
           | no. it wasn't a "bad bgp update". bgp withdrawal of anycast
           | addresses was a desired outcome of a region (serving
           | location) getting disconnected from the backbone. if you'd
           | like to trivialize it, you can say it was configuration
           | change to the software defined backbone.
        
         | human wrote:
         | The rule is that it's _always_ DNS.
        
           | jessaustin wrote:
           | DNS seemed to be involved with _both_ the Spectrum business
           | internet and Charter internet outages overnight. So much for
           | diversifying!
        
       | grouphugs wrote:
       | stop using aws, i can't wait till amazon is hit so hard everyday
       | they can't maintain customers
        
       | tyingq wrote:
       | _" Amazon Secure Token Service (STS) experienced elevated
       | latencies"_
       | 
       | I was getting 503 "service unavailable" from STS during the
       | outage most of the time I tried calling it.
       | 
       | I guess by "elevated latency", they mean from anyone with retry
       | logic that would keep trying after many consecutive attempts?
        
         | jrockway wrote:
         | I suppose all outages are just elevated latency. Has anyone
         | ever had an outage and said "fuck it, we're going out of
         | business" and never came back up? That's the only true outage
         | ;)
        
           | hericium wrote:
           | 5xx errors are servers or proxies giving up on requests.
           | Increased timeouts resulting in successful requests may have
           | been considered "elevated latency" (but rarely this would be
           | a proper way to solve similar issue).
           | 
           | They treat 5xx errors as non-errors but this is not the case
           | with rest of the world. "Increased timeouts" is Amazon's
           | untruthful term for "not working at all".
        
         | comboy wrote:
         | So many lessons in this article. When your service goes down
         | but eventually gets back up, it's not an outage. It's "elevated
         | latency". Of a few hours, maybe days.
        
         | WaxProlix wrote:
         | STS is the worst with this. Even for other internal teams, they
         | seem to treat dropped requests (ie, timeouts which represent
         | 5xxs on the client side) as 'non faults', and so don't treat
         | those data points in their graphs and alarms. It's really
         | obnoxious.
         | 
         | AWS in general is trying hard to do the right thing for
         | customers, and obviously has a long ways to go. But man, a few
         | specific orgs have some frustrating holdover policies.
        
           | hericium wrote:
           | > AWS in general is trying hard to do the right thing for
           | customers
           | 
           | You are responding to a comment that suggests they're
           | misrepresenting the truth (which wouldn't be the first time
           | even in last few days) in communication to their customers.
           | 
           | As always, they are doing the right thing for themselves
           | only.
           | 
           | EDIT: I think that you should mention being an Engineer at
           | Amazon AWS in your comment.
        
             | tybit wrote:
             | It was very clear from their post that they were
             | criticising STS from the perspective of an engineer in AWS
             | within a different team.
        
               | hericium wrote:
               | I assumed in good faith that this is someone knowing
               | internals as a larger customer, not an AWS person shit-
               | talking other AWS teams.
               | 
               | Got curious only after a downvote hence late edit. My
               | bad.
        
               | ignoramous wrote:
               | > _...an AWS person shit-talking other AWS teams [in
               | public]._
               | 
               | I remember a time when this would be an instant
               | reprimand... Either amzn engs are bolder these days, or
               | amzn hr is trying really hard for amzn to be "world's
               | best employer", or both.
        
               | filoleg wrote:
               | Gotta deanonymize the user to reprimand them. Maybe i am
               | wrong here, but i don't see it as something an Amazon HR
               | employee would actually waste their time on (exceptions
               | apply for confidential info leaks and other blatantly
               | illegal stuff, of course). Especially given that it might
               | as well be impossible, unless the user incriminated
               | themselves with identifiable info.
        
               | WaxProlix wrote:
               | It's true that I shouldn't have posted it, was mostly
               | just in a grumpy mood. It's still considered very bad
               | form. I'm not actually there anymore, but the idea
               | stands.
        
       | bamboozled wrote:
       | Still doesn't explain the cause of all the IAM permission denied
       | requests we saw against policies which are again working fine
       | without any intervention.
       | 
       | Obviously networking issues can cause any number of symptoms but
       | it seems like an unusual detail to leave out to me. Unless it was
       | another ongoing outage happening at the same time.
        
         | notimetorelax wrote:
         | It's so hard to know what was the state of the system when the
         | monitoring was out. Wouldn't be surprised if they don't have
         | the data to investigate it now.
        
         | a45a33s wrote:
         | how are auth requests supposed to reach the auth server if the
         | networking is broken?
        
           | bamboozled wrote:
           | I'd accept this as an answer if I received a timeout or a
           | message to say that.
           | 
           | Permission denied is something altogether because it implies
           | the request reached an authorisation system, was evaluated
           | and denied.
        
             | sbierwagen wrote:
             | Fail-secure + no separate error for timeouts maybe? If the
             | server can't be reached then it just denies the request.
        
       | amznbyebyebye wrote:
       | I'm glad they published something, that too so quick. Ultimately
       | these guys are running a business. There are other market
       | alternatives, multibillion dollar contracts at play, SLAs, etc.
       | it's not as simple as people think.
        
       | StreamBright wrote:
       | "At 7:30 AM PST, an automated activity to scale capacity of one
       | of the AWS services hosted in the main AWS network triggered an
       | unexpected behavior from a large number of clients inside the
       | internal network. "
       | 
       | Very detailed.
        
       | herodoturtle wrote:
       | I am grateful to AWS for this report.
       | 
       | Not sure if any AWS support staff are monitoring this thread, but
       | the article said:
       | 
       | > Customers also experienced login failures to the AWS Console in
       | the impacted region during the event.
       | 
       | All our AWS instances / resources are in EU/UK availability
       | zones, and yet we couldn't access our console either.
       | 
       | Thankfully none of our instances were affected by the outage, but
       | our inability to access the console was quite worrying.
       | 
       | Any idea why this was this case?
       | 
       | Any suggestions to mitigate this risk in the event of a future
       | outage would be appreciated.
        
         | [deleted]
        
         | plasma wrote:
         | They posted on the status page to try using the alternate
         | region endpoints like us-west.console.Amazon.com (I think) at
         | the time, but not sure if it was a true fix.
        
       | azundo wrote:
       | > This resulted in a large surge of connection activity that
       | overwhelmed the networking devices between the internal network
       | and the main AWS network, resulting in delays for communication
       | between these networks. These delays increased latency and errors
       | for services communicating between these networks, resulting in
       | even more connection attempts and retries. This led to persistent
       | congestion and performance issues on the devices connecting the
       | two networks.
       | 
       | I remember my first experience realizing the client retry logic
       | we had implemented was making our lives way worse. Not sure if
       | it's heartening or disheartening that this was part of the issue
       | here.
       | 
       | Our mistake was resetting the exponential backoff delay whenever
       | a client successfully connected and received a response. At the
       | time a percentage but not all responses were degraded and
       | extremely slow, and the request that checked the connection was
       | not. So a client would time out, retry for a while, backing off
       | exponentially, eventually successfully reconnect and then after a
       | subsequent failure start aggressively trying again. System
       | dynamics are hard.
        
         | EnlightenedBro wrote:
         | But what's a good alternative then? What if the internet
         | connection has recovered? And you were at the, for example, 4
         | minute retry loop. Would you just make your users stare at a
         | spinning loader for 8 minutes?
        
           | xmprt wrote:
           | I first learned about exponential backoff from TCP and TCP
           | has a lot of other smart ways to manage congestion control.
           | You don't need to implement all the ideas into client logic
           | but you can also do a lot better than just basic exponential
           | backoff.
        
           | kqr wrote:
           | Sure, why not?
           | 
           | Or tell them directly that "We have screwed up. The service
           | is currently overloaded. Thank you for your patience. If you
           | still haven't given up on us, try again a less busy time of
           | day. We are very sorry."
           | 
           | There are several options, and finding the best one depends a
           | bit on estimating the behaviour of your specific target
           | audience.
        
           | sciurus wrote:
           | See for instance the client request rejection probability
           | equation at https://sre.google/sre-book/handling-overload/
        
         | heisenbit wrote:
         | The problem shows up at the central system while the peripheral
         | device is causing it. And those systems belong to very
         | different organizations with very different priorities. I still
         | remember how difficult the discussion was with 3G basestation
         | team persuading them to implement exponential backoff with some
         | random factor when connecting to the management system.
        
         | colechristensen wrote:
         | > System dynamics are hard.
         | 
         | And have to be actually tested. Most of them are designs based
         | on nothing but uninformed intuition. There is an art to back
         | pressure and keeping pipelines optimally utilized. Queueing
         | doesn't work like you think until you really know.
        
           | gfodor wrote:
           | Why is this hard, and can't just be written down somewhere as
           | part of the engineering discipline? This aspect of systems in
           | 2021 really shouldn't be an "art."
        
             | adrianN wrote:
             | Because no two systems are alike and these are nonlinear
             | effects that strongly depend on the details, would be my
             | guess.
        
             | colechristensen wrote:
             | It is, in itself, a separate engineering discipline, and
             | one that cannot really be practiced analytically unless you
             | understand _really well_ the behavior of individual pieces
             | which interact with each other. Most don 't, and don't care
             | to.
             | 
             | It is something which needs to be designed and tuned in
             | place and evades design "getting it right" without real
             | world feedback.
             | 
             | And you also simply have to reach a certain somewhat large
             | scale for it to matter at all, the amount of excess
             | capacity you have because of the available granularity of
             | capacity at smaller scales eats up most of the need for it
             | and you can get away with wasting a bit of money on extra
             | scale to get rid of it.
             | 
             | It is also sensitive to small changes so textbook examples
             | might be implemented wrong with one small detail that won't
             | show itself until a critical failure is happening.
             | 
             | It is usually the location of the highest complexity
             | interaction in a business infrastructure which is not
             | easily distilled to a formula. (and most people just aren't
             | educationally prepared for nonlinear dynamics)
        
             | ksrm wrote:
             | Because """software engineering""" is a joke.
        
             | pfortuny wrote:
             | Exponential behaviour is _hard_ to understand.
             | 
             | Also, experiments at this size and speed have never been
             | carried out before.
             | 
             | And statistical behaviours are very difficult to
             | understand. First thing: 99.9999% uptime for ALL users is
             | HUGELY different from 99.999%.
             | 
             | As a matter of fact, this was _just one_ of amazon's zones,
             | rememeber.
             | 
             | Edit: finally, the right model for these systems might well
             | have no mean (fat tails...) and then where do the
             | statistics go from there?
        
             | jmiserez wrote:
             | It absolutely is written down. The issue is that the
             | results you get from modeling systems using queuing theory
             | are often unintuitive and surprising. On top of that it's
             | hard to account for all the seemingly minor implementation
             | details in a real system.
             | 
             | During my studies we had a course where we built a
             | distributed system and had to model it's performance
             | mathematically. It was really hard to get the model to
             | match the reality and vice-versa. So many details are
             | hidden in a library, framework or network adapter somewhere
             | (e.g buffers or things like packet fragmentation).
             | 
             | We used the book "The Art of Computer Systems Performance
             | Analysis" (R. Jain), but I don't recommend it. At least not
             | the 1st edition which had a frustrating amount of serious,
             | experiment-ruining errata.
        
             | dclowd9901 wrote:
             | Think of other extremely complex systems and how we've
             | managed to make them stable:
             | 
             | 1) airplanes: they crashed, _a lot_. We used data recorders
             | and stringent process to make air travel safety
             | commonplace.
             | 
             | 2) cars: so many accidents accident research. The solution
             | comes after the disaster.
             | 
             | 3) large buildings and structures: again, the master work
             | of time, attempts, failures, research and solutions.
             | 
             | If we really want to get serious about this (and I think we
             | do) we need to stop reinventing infrastructure every 10
             | years and start doubling down on stability. Cloud
             | computing, in earnest, has only been around a short while.
             | I'm not even convinced it's the right path forward, just
             | happens to align best with business interests, but it seems
             | to be the devil we're stuck with so now we need to really
             | dig in and make it solid. I think we're actually in that
             | process right now.
        
           | vmception wrote:
           | > Most of them are designs based on nothing but uninformed
           | intuition.
           | 
           | Or because they read it on a Google|AWS Engineering blog
        
             | vmception wrote:
             | Or they regurgitated a bullshit answer from a system design
             | prep course while pretending to think of it on the spot
             | just to get hired
        
       | eigen-vector wrote:
       | Exceeded character limit on the title so I couldn't include this
       | detail there, but this is the post-mortem of the event on
       | December 7 2021.
        
       | markus_zhang wrote:
       | >At 7:30 AM PST, an automated activity to scale capacity of one
       | of the AWS services hosted in the main AWS network triggered an
       | unexpected behavior from a large number of clients inside the
       | internal network.
       | 
       | Just curious, is this scaling an AWS job or a client job? Looks
       | like an AWS one from the context. I'm wondering if they are
       | deploying additional data centers or something else?
        
       | AtlasBarfed wrote:
       | "At 7:30 AM PST, an automated activity to scale capacity of one
       | of the AWS services hosted in the main AWS network triggered an
       | unexpected behavior from a large number of clients inside the
       | internal network. This resulted in a large surge of connection
       | activity that overwhelmed the networking devices between the
       | internal network and the main AWS network, resulting in delays
       | for communication between these networks. These delays increased
       | latency and errors for services communicating between these
       | networks, resulting in even more connection attempts and
       | retries."
       | 
       | So was this in service to something like DynamoDB or some other
       | service?
       | 
       | As in, did some of those extra services that AWS offers for
       | lockin (and that undermines open source projects with embrace and
       | extend) bomb the mainline EC2 service?
       | 
       | Because this kind of smacks of "Microsoft Hidden APIs" that
       | office got to use against other competitors. Does AWS use
       | "special hardware capabilites" to compete against other companies
       | offering roughtly the same service?
        
         | nijave wrote:
         | Yes and other cloud providers (Google, Microsoft) probably have
         | similar. Besides special network equipment, they use PCIe
         | accelerator/coprocessors on their hypervisors to offload all
         | non-VM activity (Nitro instances)
         | 
         | They also recently announced Graviton ARM CPUs
        
       | cyounkins wrote:
       | My favorite sentence: "Our networking clients have well tested
       | request back-off behaviors that are designed to allow our systems
       | to recover from these sorts of congestion events, but, a latent
       | issue prevented these clients from adequately backing off during
       | this event."
        
         | discodave wrote:
         | I saw pleeeeeenty of untested code at Amazon/AWS. Looking back
         | it was almost like the most important services/code had the
         | least amount of testing. While internal boondoggle projects (I
         | worked on a couple) had complicated test plans and debates
         | about coverage metrics.
        
           | DrBenCarson wrote:
           | This is almost always the case.
           | 
           | The most important services get the most attention from
           | leaders who apply the most pressure, especially in the first
           | ~2y of a fast-growing or high-potential product. So people
           | skip tests.
        
             | foobiekr wrote:
             | reality most of the real world successful projects are
             | mostly untested because that's not actually a high ROI
             | endeavor. it kills me to realize that mediocre code you can
             | hack all over to do unnatural things is generally higher
             | value in phase I than the same code done well in twice the
             | time.
        
               | jessermeyer wrote:
               | This attitude is why modern software is a continuing
               | controlled flight into terrain.
        
               | topspin wrote:
               | Finger wagging has also failed to produce any solutions.
        
               | virtue3 wrote:
               | I think the pendulum swing back is going to be designing
               | code that is harder to make bad.
               | 
               | Typescript is a good example of trying to fix this. Rust
               | is even better.
               | 
               | Deno, I think, takes things in a better direction as
               | well.
               | 
               | Ultimately we're going to need systems that just don't
               | let you do "unnatural" things but still maintain a great
               | deal of forward mobility. I don't think that's an
               | unreasonable ask of the future.
        
           | dastbe wrote:
           | my take is that the overwhelming majority of services
           | insufficiently invest in making testing easy. the services
           | that need to grow fast due to customer demand skip the tests
           | while the services that aren't going much of anywhere spend
           | way too much time on testing.
        
           | wbsun wrote:
           | Interesting. I also work for a cloud provider. My team work
           | on both internal infrastructure as well as product features.
           | We take testing coverage very seriously and tie the metrics
           | to the team's perf. Any product feature must have unit tests,
           | integration tests at each layer of the stack, staging test,
           | production test and continuous probers in production. But our
           | reliability is still far from satisfactory. Now with your
           | observation at AWS, I start wondering whether the coverage
           | effort and different types of tests really help or not...
        
           | tootie wrote:
           | It's gotta be a whole thing to even think about how to
           | accurately test this kind of software. Simulating all kinds
           | of hardware failures, network partitions, power failures, or
           | the thousand other failure modes.
           | 
           | Then again they get like $100B in revenue that should buy
           | some decent unit tests.
        
         | a-dub wrote:
         | this caught my eye as well. i'd wager that it was not
         | configured.
        
         | dilap wrote:
         | Oh you know, an editing error -- they accidentally dropped the
         | word "not".
        
         | foobiekr wrote:
         | thundering herd and accidental synchronization for the win
         | 
         | I am sad to say, I find issues like this any time I look at
         | retry logic written by anyone I have not interacted with
         | previously on the topic. It is shockingly common even in
         | companies where networking is their bread and butter.
        
           | cyounkins wrote:
           | It absolutely is difficult. A challenge I have seen is when
           | retries are stacked and callers time out subprocesses that
           | are doing retries.
           | 
           | I just find it amusing that they describe their back-off
           | behaviors as "well tested" and in the same sentence, say it
           | didn't back off adequately.
        
             | eyelidlessness wrote:
             | > It absolutely is difficult. A challenge I have seen is
             | when retries are stacked and callers time out subprocesses
             | that are doing retries.
             | 
             | This is also a general problem with (presumed stateless)
             | concurrent/distributed systems which irked me working on
             | such a system and still haven't found meaningful resources
             | for which aren't extremely platform/stack/implementation
             | specific:
             | 
             | A concurrent system has some global/network-
             | wide/partitioned-subset-wide error or backoff condition. If
             | that system is actually stateless and receives push work,
             | communicating that state to them either means pushing the
             | state management back to a less concurrent orchestrator to
             | reprioritize (introducing a huge bottleneck/single or
             | fragile point of failure) or accepting a lot of failed work
             | will be processed in pathological ways.
        
         | lanstin wrote:
         | I found myself wishing for a few code snippets here. It would
         | be interesting. A lot of time code that handles "connection
         | refused" or fast failures doesn't handle network slowness well.
         | I've seen outages from "best effort" services (and the best-
         | effort-ness worked when the services were hard down) because
         | all of a sudden calls that were taking 50 ms were not failing
         | but all taking 1500+ ms. Best effort but no client enforced
         | SLAs that were low enough to matter.
         | 
         | Load shedding never kicked in, so things had to be shutdown for
         | a bit and then restarted.
         | 
         | Seems their normal operating state might be what is called
         | "meta-stable" - dynamically stable at a high thru-put (edited)
         | unless/until a brief glitch bumps the system into the low work
         | being finished state which is also stable.
        
       | raffraffraff wrote:
       | Something they didn't mention is AWS Billing alarms. These rely
       | on metrics systems which were affected by this (and are missing
       | some data). Crucially, billing alarms only exist in the us-east-1
       | region, so if you're using them, your impacted no matter where
       | you're infrastructure is deployed. (That's just my reading of it)
        
       ___________________________________________________________________
       (page generated 2021-12-11 23:01 UTC)