hngopher.com

       [HN Gopher] AWS us-east-1 outage
       ___________________________________________________________________
        
       AWS us-east-1 outage
        
       Author : judge2020
       Score  : 1361 points
       Date   : 2021-12-07 15:42 UTC (7 hours ago)
        
 (HTM) web link (status.aws.amazon.com)
 (TXT) w3m dump (status.aws.amazon.com)
        
       | sbr464 wrote:
       | booker.com/mindbody random areas are affected
        
       | tonyhb wrote:
       | Our services that are in us-east-2 are up, but I'm wondering how
       | long that will hold true.
        
       | technics256 wrote:
       | our EKS/EC2 instances are OK
        
       | AH4oFVbPT4f8 wrote:
       | Unable to log into the console for us-east-1 for me too
        
       | dimitar wrote:
       | https://status.hashicorp.com/incidents/3qc302y4whqr - seems to
       | have affected Hashicorp Cloud too
        
       | singlow wrote:
       | I am able to access the console for us-west-2 by going to a
       | region specific URL: https://us-west-2.console.aws.amazon.com/
       | 
       | It does tack me to a non-region specific login page which is up,
       | and then redirects back to us-west-2 which works.
       | 
       | If I go to my bookmarked login page it breaks because,
       | presumably, it hits something that is broken in us-east-1.
        
       | CoachRufus87 wrote:
       | Does there exist a resource that tracks outages on a per-region
       | basis over time?
        
       | soheil wrote:
       | Can confirm I can once again login to the Console and everything
       | seems to be back to normal in us-east-2.
        
       | crescentfresh wrote:
       | The majority of our errors stem from:
       | 
       | - writing to Firehose (S3-backed)
       | 
       | - publishing to eventbridge
       | 
       | - terraform commands to ECS' API are stuck/hanging
       | 
       | Other spurious errors involving kinesis but nothing alarming. us-
       | east-1
        
       | hiyer wrote:
       | Payments in Amazon India is down - likely because of this.
        
       | dalrympm wrote:
       | Contrary to what the status page says, CodePipeline is not
       | working. Hitting the CLI I can start pipelines but they never
       | complete and I get a lot of:
       | 
       | Connection was closed before we received a valid response from
       | endpoint URL: "https://codepipeline.us-east-1.amazonaws.com/".
        
         | alexatalktome wrote:
         | Rumor is that our internal pipelines are the root cause. The
         | CICD pipelines (not tests, the literal pipeline infrastructure)
         | failed to block certain commits and pushed them to production
         | when not ready.
         | 
         | We've been told to manually disable them to ensure integrity of
         | our services when it recovers
        
       | AzzieElbab wrote:
       | Betting on dynamo again
        
       | markus_zhang wrote:
       | Just curious does it still make sense to claim that up time is X
       | numbers of 9? (e.g. 99.999%)
        
         | throwanem wrote:
         | Yep. In this case, zero nines.
        
       | [deleted]
        
       | biohax2015 wrote:
       | Getting 502 in Parameter Store. Cloudformation isn't returning
       | either -- and that's how we deploy code :(
        
       | rickreynoldssf wrote:
       | EC2 at least seems fine but the console is definitely busted as
       | of 16:00UTC
        
         | Bedon292 wrote:
         | I am still getting 'Invalid region parameter' for resources in
         | us-east-1, the others are fine.
        
       | [deleted]
        
       | SubiculumCode wrote:
       | Must be why I can't seem to access my amazon account. I thought
       | my account had gotten compromised.
        
       | imstil3earning wrote:
       | cant scale our Kubernetes cluster due to 500s from ECR :(
        
       | jonnylangefeld wrote:
       | Does anyone know why Google is showing the same spike on down
       | detector as everything else? How does Google depend on AWS?
       | https://downdetector.com/status/google/
        
         | NobodyNada wrote:
         | It's because Down Detector works off of user reports rather
         | than automatically detecting outages somehow. So, every time a
         | major service goes down (whether infrastructure like AWS or
         | Cloudflare, or user-facing like YouTube or Facebook), some
         | users will blame Google, ISPs, cellular providers, or some
         | other unrelated service.
        
         | gmm1990 wrote:
         | Some google sheets functions aren't updating in a timely manner
         | for me. Maybe people google as a backup for aws and they have
         | to throttle certain service from a higher load
        
       | joelbondurant wrote:
       | They should put everything in the cloud so hardware issues can't
       | happen.
        
       | megakid wrote:
       | I live in London and I can't launch my Roomba vacuum to clean up
       | after dinner because of this. Hurry up AWS, fix it!
        
       | jrochkind1 wrote:
       | This is effecting heroku.
       | 
       | While my heroku apps are currently up, I am unable to push new
       | versions.
       | 
       | Logging in to heroku dashboard (which does work), there is a
       | message pointing to this heroku status incident for "Availability
       | issues with upstream provider in the US region":
       | https://status.heroku.com/incidents/2390
       | 
       | How can there be an outage severe enough to be effecting
       | middleman customers like heroku, but the AWS status page is still
       | all green?!?!
       | 
       | If whoever runs the AWS status page isn't embaressed, they really
       | ought to be.
        
         | VWWHFSfQ wrote:
         | AWS management APIs in the us-east-1 region is what is
         | affected. I'm guessing Heroku uses at least the S3 APIs when
         | deploying new versions, and those are failing
         | (intermittently?).
         | 
         | I advise not touching your Heroku setup right now. Even
         | something like trying to restart a dyno might mean it doesn't
         | come back since the slug is probably stored on S3 and that will
         | fail.
        
       | valeness wrote:
       | This is more than just east. I am seeing the same error on us-
       | west-2 resources.
        
         | singlow wrote:
         | I am seeing some issues but only with services that have global
         | aspects such as s3. I can't create an s3 bucket even though I
         | am in us-west-2 because I think the names are globally unique
         | and creating them depends on us-east-1.
        
       | herodoturtle wrote:
       | Can't access Lightsail console even though our instances are in a
       | totally different Region.
        
       | avsteele wrote:
       | Getting strange errors trying to manage my amazon account right
       | now, could this be related?
       | 
       | 494 ERROR and "We're sorry Something went wrong with our website,
       | please try again later."
        
       | ricardobayes wrote:
       | This was funny at first but now I can't even play Elder Scrolls
       | Online :(
        
       | numberwhun wrote:
       | Amazon is having an outage in us-east-1, and it is bleeding over
       | elsewhere, like eu: https://status.aws.amazon.com/
        
         | cyanydeez wrote:
         | Is that fail-failover?
        
           | hvgk wrote:
           | I think its a factorial fail.
        
             | the-dude wrote:
             | It is failures all the way down.
        
               | hvgk wrote:
               | A disc of failures resting on elephants of failures
               | resting on a giant turtle of failures? Or are you more a
               | turtles of failures all the way down sort of person?
        
         | sharpy wrote:
         | Those 2 services that are being marked as having problems in
         | other regions have fairly hard dependency on us-east-1. So that
         | would be why.
        
       | blueside wrote:
       | I was currently in the process of buying some tickets on
       | Ticketmaster and the entire presale event had to be postponed for
       | at least 4 hours due to this AWS outage.
       | 
       | I'm not complaining, I enjoyed the nostaliga - sometimes the web
       | still feels like the late 90s
        
         | kingcharles wrote:
         | I'm complaining. I could not get my Happy Meal at McDonald's
         | this morning. Bad start to the day.
        
       | _moof wrote:
       | This is your regularly scheduled reminder:
       | https://www.whoownsmyavailability.com/
        
       | iso1210 wrote:
       | My raspberry pi is still working just fine
        
         | marginalia_nu wrote:
         | Yeah, strange, my self-hosted server isn't affected either.
        
           | iso1210 wrote:
           | Seems "the cloud" had a major outage less than a month ago,
           | my laptop has a higher uptime.
           | 
           | $ 16:04 up 46 days, 7:02, 9 users, load averages: 3.68 3.56
           | 3.18
           | 
           | US East 1 was down just over a year ago
           | 
           | https://www.theregister.com/2020/11/25/aws_down/
           | 
           | Meanwhile I moved one of my two internal DNS servers to a
           | second site on 11 Nov 2020, and it's been up since then. One
           | of my monitoring machines has been filling, rotating and
           | deleting logs for 1,712 days with a load average in the c. 40
           | range for that whole time, just works.
           | 
           | If only there was a way to run stuff with an uptime of 364
           | days a year without using the cloud /s
        
             | nfriedly wrote:
             | I think the point of the could isn't increased uptime - the
             | point is that when it's down, bring it back up is _someone
             | else 's problem_.
             | 
             | (Also, OpEx vs CapEx financial shenanigans...)
             | 
             | All the same, I don't disagree with your point.
        
               | iso1210 wrote:
               | > the point is that when it's down, bring it back up is
               | someone else's problem.
               | 
               | When it's down, it's my problem, and I can't do anything
               | about it other than explain why I have no idea the system
               | is broken and can't do anything about it.
               | 
               | "Why is my dohicky down? When will it be back?"
               | 
               | "Because it's raining, no idea"
               | 
               | May be accurate, it's also of no use.
               | 
               | But yes, Opex vs Capex, of course that's why you can
               | lease your servers. It's far easier to spend company
               | money with another $500 a month on AWS than spend $500 a
               | year for a new machine.
        
             | debaserab2 wrote:
             | so does my toaster, oven and microwave. so what? they get
             | used a few times a day, but my production level equipment
             | serves millions in an hour.
        
               | iso1210 wrote:
               | My lightswitch is used twice a day, yet it works every
               | time. In the old days it would occasionally break (bulb
               | goes), I would be empowered to fix it myself (change the
               | bulb).
               | 
               | In the cloud you're at the mercy of someone who doesn't
               | even know you exist to fix it, without the protections
               | that say an electric company has with supplying domestic
               | users.
               | 
               | This thread has people unable to turn their lights on[0],
               | it's hilarious how people tie their stuff to dependencies
               | that aren't needed, with a history of constant failure.
               | 
               | If you want to host millions of people, then presumably
               | your infrastructure can cope with the loss of a single AZ
               | (and ideally the loss of Amazon as a whole). The vast
               | majority of people will be far better off without their
               | critical infrastructure going down in the middle of the
               | day in the busiest sales season going.
               | 
               | [0] https://news.ycombinator.com/item?id=29475499
        
         | jaywalk wrote:
         | Cool. Now let's have a race to see who can triple their
         | capacity the fastest. (Note: I don't use AWS, so I can actually
         | do it)
        
           | iso1210 wrote:
           | Why would I want to triple my capacity?
           | 
           | Most people don't need to scale to a billion users overnight.
        
             | jaywalk wrote:
             | Many B2B-type applications have a lot of usage during the
             | workday and minimal usage outside of it. No reason to keep
             | all that capacity running 24/7 when you only need most of
             | it for ~8 hours per weekday. The cloud is perfect for that
             | use case.
        
               | dijit wrote:
               | idk man, idle hardware doesn't use all that much power.
               | 
               | https://www.thomas-krenn.com/en/wiki/Processor_P-
               | states_and_...
               | 
               | Which is an implementation of:
               | 
               | https://web.eecs.umich.edu/~twenisch/papers/asplos09.pdf
        
               | iso1210 wrote:
               | Is it really? How much does that scaling actually cost?
               | 
               | And what's a workday anyway, surely you operate globally?
        
               | jaywalk wrote:
               | Scaling itself costs nothing, but saves money because
               | you're not paying for unused capacity.
               | 
               | The main application I run operates in 7 countries
               | globally, but the US is the only one that has enough
               | usage to require additional capacity during the workday.
               | So out of 720 hours in a 30 day month, cloud scaling
               | allows me to pay for additional capacity for only the
               | (roughly) 160 hours that it's actually needed. It's a
               | _significant_ cost saver.
               | 
               | And because the scaling is based on actual metrics, it
               | won't scale up on a holiday when nobody is using the
               | application. More cost savings.
        
               | vp8989 wrote:
               | You are (conveniently or not) incorrectly assuming that
               | the unit price of provisioned vs on-demand capacity is
               | the same. It's not.
        
               | jaywalk wrote:
               | Nice of you to assume that I don't understand the pricing
               | of the services I use. I can assure you that I do, and I
               | can also assure you that there is no such thing as
               | provisioned vs on-demand pricing for Azure App Service
               | until you get into the higher tiers. And even in those
               | higher tiers, it's cheaper _for me_ to use on-demand
               | capacity.
               | 
               | Obviously what I'm saying will not apply to all use
               | cases, but I'm only talking about mine.
        
               | [deleted]
        
       | joelhaasnoot wrote:
       | Amazon.com is also throwing errors left and right
        
       | endisneigh wrote:
       | Azure, Google Cloud, AWS and others need to have a "Status
       | alliance" where they determine the status of each of their
       | services by a quorum using all cloud providers.
       | 
       | Status pages are virtually useless these days
        
         | LinuxBender wrote:
         | Or just modify sites like DownDetector to show who is hosting
         | each site. When{n} number of sites are down hosted on {x} one
         | could draw a conclusion. It won't be as detailed as "xyz
         | services failed" but rather the overall operational chain is
         | broken. There could be a graph that shows _99% of sites hosted
         | on amazon US East 1 down_ it would be hard to hide that. This
         | could also paint a picture of what companies are not active-
         | active-multi-cloud.
        
           | avaika wrote:
           | Cloud providers are not just about web hosting. There are
           | dozens of tools hidden from end users. E.g. services used for
           | background operations / maintenance (e.g. aws codecommit /
           | codebuild / or even aws web console like today). This kind of
           | outages won't bring down your web site, but still might break
           | your normal workflow and even cost you some money.
        
         | DarthNebo wrote:
         | https://en.wikipedia.org/wiki/Mexican_standoff
        
           | beamatronic wrote:
           | It's not the prisoner's dilemma?
        
         | smt88 wrote:
         | They can do this without an alliance. They very intentionally
         | choose not to do it.
         | 
         | Every major company has moved away from having accurate status
         | pages.
        
           | arch-ninja wrote:
           | Steam has a great status page, companies like that and
           | cloudfare will eat Alphabet's lunch in the next 17-18 years.
        
             | ExtraE wrote:
             | That's a tight time frame a long way off. How'd you arrive
             | at 17-18?
        
             | moolcool wrote:
             | Where do you anticipate Steam to compete with Alphabet?
        
           | dentemple wrote:
           | It's because none of these companies are held responsible for
           | missing their actual SLAs, as opposed to their self-reported
           | SLA compliance.
           | 
           | So unless regulation gets implemented that says otherwise,
           | there's zero incentive for any company to maintain an
           | accurate status page.
        
             | soheil wrote:
             | How did you find a way to bring regulations into this?
             | There are monitoring services you can pay for to keep an
             | eye on your SLAs and your vendors'.
             | 
             | If not happy with the results switch.
        
               | ybloviator wrote:
               | Technically, there are already regulations. SLA lies are
               | fraud.
               | 
               | But I'm leery of any business who's so dishonest they
               | fear any outside oversight that brings repercussions for
               | said dishonesty.
               | 
               | "If not happy, switch" is silly - it's not the customer's
               | problem. And if you're a large customer and have invested
               | heavily in getting staff trained on AWS, you can't just
               | move.
        
               | soheil wrote:
               | A) don't build a business that relies solely on existence
               | of another B) switch to another vendor if not happy with
               | current vendor
               | 
               | Really not that complicated.
        
             | iso1210 wrote:
             | The only uptimes I'm concerned with are my own services
             | that my own monitoring keeps on top of, this varies - if
             | the monitoring page goes down for 10 seconds I'm not
             | worried, if one leg of a smpte-2022-7 is down for a second
             | that's fine, if it keeps going down for a second that's a
             | concern, etc.
             | 
             | If something I'm responsible goes down to the point that my
             | stakeholders are complaining (which is something seriously
             | wrong), they are not going to be happy with "oh the cloud
             | was down, not my fault"
             | 
             | If AWS is down or not it meaningless to me, if my service
             | running on AWS is down or not is the key metric.
             | 
             | If a service is down and I can't get into it, then chatter
             | on things like outages mailing list, or HN, will let me
             | know if it's yet another cloud failure, or if it's
             | something that's affecting my machine only.
        
             | cwkoss wrote:
             | I wonder if there could be profitable play where an
             | organization monitors SLA compliance, and then produces a
             | batch of lawsuits or class action suit on behalf of all of
             | its members when the SLA is violated.
        
               | pm90 wrote:
               | This is a neat idea. Install a simple agent in all
               | customers' environments, select AWS dependencies, then
               | monitor uptime over time. Aggregate across customers, and
               | then go to AWS with this data.
        
             | xtracto wrote:
             | >It's because none of these companies are held responsible
             | for missing their actual SLAs, as opposed to their self-
             | reported SLA compliance.
             | 
             | Right, there should be an "alliance" of customers from
             | different large providers (something like a Union but
             | instead of workers, it would be customers). They are the
             | ones that should measure SLAs and hold the provider
             | accountable.
        
           | beamatronic wrote:
           | In a broader world sense, we live in the post-truth era.
        
       | andrew_ wrote:
       | This is also effecting Elastic Beanstalk and Elastic Container
       | Registry. We're getting 500s via API for both services.
        
       | torbTurret wrote:
       | Any leads on the cause? Some services on us-east-1 seem to be
       | working just fine. Others not.
        
       | sirfz wrote:
       | AWS ec2 API requests are failing and ECR also seems to be down
       | (error 500 on pull)
        
       | mohanmcgeek wrote:
       | Same happening right now with Goodreads
        
       | zahil wrote:
       | Is it just me or is it becoming more and more frequent, outages
       | involving large websites?
        
       | amir734jj wrote:
       | My job (although 50% of time) at Azure is unit testing/monitoring
       | services under different scenarios and flows to detect small
       | failures that will be overlooked in public status page. Our tests
       | run multiple times daily and we have people constantly monitoring
       | logs. It concerns me when I see all AWS services are 100% green
       | when I know there is an outage.
        
         | jamesfmilne wrote:
         | Oh, sweet summer child.
         | 
         | The reason you care about your status page being 100% accurate
         | is that your stock price is not directly linked to your status
         | page.
        
         | joshocar wrote:
         | I don't know how accurate this information is, but I'm hearing
         | that the monitor can't be updated because the service is in the
         | region that is down.
        
           | Rapzid wrote:
           | Kinda hard to believe after they were blasted for that very
           | situation during/after the S3 outage way back.
           | 
           | If that's the case, it's 100% a feature. They want as little
           | public proof of an outage after it's over and to put the
           | burden on customers completely to prove they violated SLAs.
        
       | [deleted]
        
       | fnord77 wrote:
       | can't wait for the postmortem on this
        
       | [deleted]
        
       | andreyazimov wrote:
       | My payment processor Paddle also stopped working.
        
       | skapadia wrote:
       | I'm absolutely terrified by our reliance on cloud providers (and
       | modern technology, networks, satellites, electric grids, etc. in
       | general), but the cloud hasn't been around that long. It is
       | advancing extremely fast, and every problem makes it more
       | resilient.
        
         | skapadia wrote:
         | IMO, the cloud is not the overall problem. Our insatiable
         | desire to want things right HERE, right NOW is the core issue.
         | The cloud is just a solution to try to meet that demand.
        
         | SkyPuncher wrote:
         | Would you prefer to have more of something good that you can't
         | occasionally have OR less of something good that you can always
         | have?
         | 
         | ----
         | 
         | The answer likely depends on the specific thing, but I'd argue
         | most people would take the better version of something at the
         | risk of it not working 1 or 2 days per year.
        
       | l72 wrote:
       | Most of our services in us-east-1 are still responding although
       | we cannot log into the console. However, it looks like dynamodb
       | is timing out most requests for us.
        
       | john37386 wrote:
       | It seems a bit long to fix!
       | 
       | They probably paint themselves in a corner just like facebook few
       | weeks ago.
       | 
       | This make me think;
       | 
       | Could it be that one day the internet will have a total global
       | outage and it will take few days to recover?
        
         | ericskiff wrote:
         | This actually happened back in 2012 or so. Major AWS outage
         | that took down big services all over the place, took a few days
         | for some services to come fully back online.
         | https://aws.amazon.com/message/680342/
        
         | devenvdev wrote:
         | The only possible scenario I could come up with is someone
         | crashing internet on purpose, e.g. some crazy uncatchable
         | trojan that starts DDOSing everything. I doubt such scenario is
         | feasible though...
        
         | rossdavidh wrote:
         | If we have a total global outage, Stack Overflow will be
         | unavailable, and the internet will never be fixed. :) Mostly
         | joking, I hope...
        
           | jaywalk wrote:
           | Some brave soul at Stack Overflow will have to physically go
           | into the datacenter, roll up a cart with a keyboard, monitor
           | and printer and start printing off a bunch of Networking
           | answers.
        
             | lesam wrote:
             | The StackOverflow datacenter is famously tiny - like, 10
             | Windows servers. So even if the rest of the internet goes
             | down hopefully they stay up. They might have to rebuild the
             | internet out from their server room though.
        
         | iso1210 wrote:
         | I'm not sure how you get a total global outage in a distributed
         | system. Lets say a major transit provider (Century link for
         | example) advertises traffic as a "go via me", but then drops
         | the traffic, lets also assume it drops the costs of routes to
         | pretty much zero. That would certainly have a major effect,
         | until their customers/peers stop their peers.
         | 
         | That might be tricky if they are remote and not on the same AS
         | as their router access points and have no completely out of
         | band access, but you're still talking hours at most.
        
       | eoinboylan wrote:
       | Can't SSM into ec2 in us-east-1
        
       | johnsimer wrote:
       | Everything seems to be functioning normally for me now
        
       | ChrisArchitect wrote:
       | We have always been at war with us-east-1.
        
       | ComputerGuru wrote:
       | The AWS status page no longer loads for me. /facepalm
        
       | techthumb wrote:
       | https://status.aws.amazon.com hasn't been updated to reflect
       | outage yest
        
         | cyanydeez wrote:
         | Probably cached for good measure
        
         | hvgk wrote:
         | There's probably a lambda somewhere supposed to update it that
         | is screaming into the darkness at the moment.
         | 
         | According to an internal message I saw, their monitoring stuff
         | is fucked too.
        
       | cbtacy wrote:
       | It's funny but when I saw "AWS Outage" breaking, my first thought
       | was "I bet it's US-east-1 again."
       | 
       | I know it's cheap but seriously... not worth it. Many of us have
       | the scars to prove this.
        
       | wrren wrote:
       | Looks like their health check logic also sucks, just like mine.
        
       | whoknowswhat11 wrote:
       | Anyone understand why these services go down for so long?
       | 
       | That's the part I find interesting.
        
       | pixelmonkey wrote:
       | Looks like Kinesis Firehose is either the root cause, or severely
       | impacted:
       | 
       | https://twitter.com/amontalenti/status/1468265799458639877
       | 
       | Segment is publicly reporting issues delivering to Firehose, and
       | one of my company's real-time monitors also triggered for Kinesis
       | Firehose an hour ago.
       | 
       | Update:
       | 
       | By my sniff of it, some "core" APIs are down for S3 and EC2 (e.g.
       | GET/PUT on S3 and node create/delete on EC2). Systems like
       | Kinesis Firehose and DynamoDB rely on these APIs under the hood
       | ("serverless" is just "a server in someone else's data center").
       | 
       | Further update:
       | 
       | There is a workaround available for the AWS Console login issue.
       | You can use https://us-west-2.console.aws.amazon.com/ to get in
       | -- it's just the landing page that is down (because the landing
       | page is in the affected region).
        
       | muttantt wrote:
       | Running anything on us-east-1 is asking for trouble...
        
       | [deleted]
        
       | anovikov wrote:
       | Haha my developer called me in panic telling that he crashed
       | Amazon - was doing some load tests with Lambda
        
         | imstil3earning wrote:
         | thats cute xD
        
         | Justsignedup wrote:
         | thank you for that big hearty laugh! :)
        
         | xtracto wrote:
         | How can you own a developer? is it expensive to buy one?
        
           | DataGata wrote:
           | Don't get nitty about saying "my X". People say "my plumber"
           | or "my hairstylist" or whatever all the time.
        
         | rossdavidh wrote:
         | If he actually knows how to crash Amazon, you have a new
         | business opportunity, albeit not a very nice one...
        
         | politelemon wrote:
         | It'd be hilarious if you kept that impression going for the
         | duration of the outage.
        
         | tgtweak wrote:
         | Postmortem: unbounded auto-scaling of lambda combined with
         | oversight on internal rate limits caused unforseen internal
         | ddos.
        
       | kuya11 wrote:
       | The blatant status page lies are getting absolutely ridiculous.
       | How many hours does a service need to be totally down until it
       | gets properly labelled as a "disruption"?
        
         | bearjaws wrote:
         | Yeah, we are seeing SQS, API Gateway (both socket and non
         | websocket) and S3 all completely unavailable. Status page shows
         | nothing despite having gotten several updates.
        
       | nemothekid wrote:
       | Some sage advice I learned a while ago: "Avoid us-east-1 as much
       | as possible".
        
         | dr-detroit wrote:
         | If you need to be up all the time don't you use more than 1
         | region or do you need the ultra low ping for running 9-11
         | operator phone systems that calculate real time firing
         | solutions to snipe ICBMs out of low orbit?
        
         | soco wrote:
         | But if you use CloudFront, there you go.
        
       | [deleted]
        
       | alex_young wrote:
       | EDIT: As pointed out below, I missed that this was for the Amazon
       | Connect service, and not an update for all of US-EAST-1.
       | Preserved for consistency, but obviously just a comprehension
       | issue on my side.
       | 
       | At least the updates are amusing:
       | 
       | "9:18 AM PST We can confirm degraded Contact handling by agents
       | in the US-EAST-1 Region. Agents may experience issues logging in
       | or being connected with end-customers."
       | 
       | WTF is "contact handling", an "agent" or an "end-customer"?
       | 
       | How about something like "We are confirming that some users are
       | not able to connect to AWS services in us-east-1. We're looking
       | into it."
        
         | dastbe wrote:
         | that's an update for amazon connect, which is a customer
         | support related service.
        
         | detaro wrote:
         | Amazon Connect is a call-center product, so that report makes
         | sense.
        
       | adwww wrote:
       | Left a big terraform plan running while I put the kids to bed,
       | checked back now and Amazon is on fire.... was it me?!
        
       | mohanmcgeek wrote:
       | I don't think it's a console outage. Goodreads has been down for
       | a while
        
       | jrs235 wrote:
       | Search on Amazon.com seems to be broken too. This doesn't appear
       | to just be affecting their AWS revenue.
        
         | authed wrote:
         | I've had issues login-in at amazon.com too... and IMDB is also
         | down
        
       | bamboozled wrote:
       | I don't think AWS knows what's going on judging by their updates,
       | yes DynamoDB might be having issues, but so is IAM it seems,
       | we're getting issues terminating resources for example.
        
       | jgworks wrote:
       | Someone posted this on our company slack:
       | https://stop.lying.cloud/
        
       | mbordenet wrote:
       | I suspect the ex-Amazonian PragmaticPulp cites was let go from
       | Amazon for a reason. The COE process works, provided the culture
       | is healthy and genuinely interested in fixing systemic problems.
       | Engineers who seek to deflect blame are toxic and unhelpful.
       | Don't hire them!
        
       | woshea901 wrote:
       | N. Virginia consistently has more problems than other zones. Is
       | it possible this zone is also hosting government computers/could
       | it be a more frequent target for this reason?
        
         | longhairedhippy wrote:
         | The real reason is us-east-1 was the first and by far the
         | biggest region, the same reason that new services always launch
         | there but other regions are are not necessarily required (some
         | services have to launch in every region).
         | 
         | The us-east-1 region is consistently pushing the limits of
         | scale for the AWS services, thus is has way more problems than
         | other regions.
        
       | exabrial wrote:
       | Just a reminder that Colocation is always an option :)
        
       | lgylym wrote:
       | So reinvent is over. Time to deploy.
        
       | filip5114 wrote:
       | Can confrim us-east-1
        
         | romanhotsiy wrote:
         | Can confirm lambda is down in us-east-1. Other services seems
         | to work for us.
        
         | imnoscar wrote:
         | STS or console login not working either.
        
       | all_usernames wrote:
       | 25 Regions, 85 Availability Zones in this global cloud service
       | and I can't login because of a failure in a single region (their
       | oldest).
       | 
       | Can't login to AWS console at signin.aws.amazon.com:
       | Unable to execute HTTP request: sts.us-east-1.amazonaws.com.
       | Please try again.
        
       | ipmb wrote:
       | Looks like they've acknowledged it on the status page now.
       | https://status.aws.amazon.com/
       | 
       | > 8:22 AM PST We are investigating increased error rates for the
       | AWS Management Console.
       | 
       | > 8:26 AM PST We are experiencing API and console issues in the
       | US-EAST-1 Region. We have identified root cause and we are
       | actively working towards recovery. This issue is affecting the
       | global console landing page, which is also hosted in US-EAST-1.
       | Customers may be able to access region-specific consoles going to
       | https://console.aws.amazon.com/. So, to access the US-WEST-2
       | console, try https://us-west-2.console.aws.amazon.com/
        
         | jabiko wrote:
         | Yeah, but I still have a different understanding what
         | "Increased Error Rates" means.
         | 
         | IMHO it should mean that the rate of errors is increased but
         | the service is still able to serve a substantial amount of
         | traffic. If the rate of errors is bigger than, let's say, 90%
         | that's not an increased error rate, that's an outage.
        
           | thallium205 wrote:
           | They say that to try and avoid SLA commitments.
        
             | jiggawatts wrote:
             | Some big customers should get together and make an
             | independent org to monitor cloud providers and force them
             | to meet their SLA guarantees without being able to weasel
             | out of the terms like this...
        
         | guenthert wrote:
         | Uh, four minutes to identify the root cause? Damn, those guys
         | are on fire.
        
           | czbond wrote:
           | :) I imagine it went like this theoretical Slack
           | conversation:
           | 
           | > Dev1: Pushing code for branch "master" to "AWS API". >
           | <slackbot> Your deploy finished in 4 minutes > Dev2: I can't
           | react the API in east-1 > Dev1: Works from my computer
        
           | tonyhb wrote:
           | It was down as of 7:45am (we posted in our engineering
           | channel), so that's a good 40 minutes of public errors before
           | the root cause was figured out.
        
           | Frost1x wrote:
           | Identify or to publicly acknowledge? Chances are technical
           | teams knew about this and noticed it fairly quickly, they've
           | been working on the issue for some time. It probably wasn't
           | until they identified the root cause and had a handful of
           | strategies to mitigate with confidence that they chose to
           | publicly acknowledge the issue to save face.
           | 
           | I've broken things before and been aware of it, but didn't
           | acknowledge them until I was confident I could fix them. It
           | allows you to maintain an image of expertise to those outside
           | who care about the broken things but aren't savvy to what or
           | why it's broken. Meanwhile you spent hours, days, weeks
           | addressing the issue and suddenly pull a magic solution out
           | of your hat to look like someone impossible to replace.
           | Sometimes you can break and fix things without anyone even
           | knowing which is very valuable if breaking something had some
           | real risk to you.
        
             | sirmarksalot wrote:
             | This sounds very self-blaming. Are you sure that's what's
             | really going through your head? Personally, when I get
             | avoidant like that, it's because of anticipation of the
             | amount of process-related pain I'm going to have to endure
             | as a result, and it's much easier to focus on a fix when
             | I'm not also trying to coordinate escalation policies that
             | I'm not familiar with.
        
           | flerchin wrote:
           | Outage started at 731 PST from our monitoring. They are on
           | fire, but not in a good way.
        
         | giorgioz wrote:
         | I'm trying to login in the AWS Console from other regions but
         | I'm getting HTTP 500. Anyone managed to login in other regions?
         | Which ones?
         | 
         | Our backend is failing, it's on us-east-1 using AWS Lambda, Api
         | Gateway, S3
        
         | bobviolier wrote:
         | https://status.aws.amazon.com/ still shows all green for me
        
           | banana_giraffe wrote:
           | It's acting odd for me. Shows all green in Firefox, but shows
           | the error in Chrome even after some refreshes. Not sure
           | what's caching where to cause that.
        
         | dang wrote:
         | Ok, we've changed the URL to that from https://us-
         | east-1.console.aws.amazon.com/console/home since the latter is
         | still not responding.
         | 
         | There are also various media articles but I can't tell which
         | ones have significant new information beyond "outage".
        
         | jesboat wrote:
         | > This issue is affecting the global console landing page,
         | which is also hosted in US-EAST-1
         | 
         | Even this little tidbit is a bit of a wtf for me. Why do they
         | consider it ok to have _anything_ hosted in a single region?
         | 
         | At a different (unnamed) FAANG, we considered it unacceptable
         | to have anything depend on a single region. Even the dinky
         | little volunteer-run thing which ran
         | https://internal.site.example/~someEngineer was expected to be
         | multi-region, and was, because there was enough infrastructure
         | for making things multi-region that it was usually pretty easy.
        
           | alfiedotwtf wrote:
           | Maybe has something to do with CloudFront mandating certs to
           | be in us-east-1?
        
             | tekromancr wrote:
             | YES! Why do they do that? It's so weird. I will deploy a
             | whole config into us-west-1 or something; but then I need
             | to create a new cert in us-east-1 JUST to let cloudfront
             | answer an HTTPS call. So frustrating.
        
               | jamesfinlayson wrote:
               | Agreed - in my line of work regulators want everything in
               | the country we operate from but of course CloudFront has
               | to be different.
        
           | sheenobu wrote:
           | I think I know specifically what you are talking about. The
           | actual files an engineer could upload to populate their
           | folder was not multi-region for a long time. The servers
           | were, because they were stateless and that was easy to multi-
           | region, but the actual data wasn't until we replaced the
           | storage service.
        
           | ehsankia wrote:
           | Forget the number of regions. Monitoring for X shouldn't even
           | be hosted on X at all...
        
           | stevehawk wrote:
           | I don't know if that should surprise us. AWS hosted their
           | status page in S3 so it couldn't even reflect its own outage
           | properly ~5 years ago.
           | https://www.theregister.com/2017/03/01/aws_s3_outage/
        
           | tekromancr wrote:
           | I just want to serve 5 terabytes of data
        
             | mrep wrote:
             | Reference for those out of the loop:
             | https://news.ycombinator.com/item?id=29082014
        
             | [deleted]
        
           | all_usernames wrote:
           | Every damn Well-Architected Framework includes multi-AZ if
           | not multi-region redundancy, and yet the single access point
           | for their millions of customers is single-region. Facepalm in
           | the form of $100Ms in service credits.
        
             | cronix wrote:
             | > Facepalm in the form of $100Ms in service credits.
             | 
             | It was also greatly affecting Amazon.com itself. I kept
             | getting sporadic 404 pages and one was during a purchase.
             | Purchase history wasn't showing the product as purchased
             | and I didn't receive an email, so I repurchased. Still no
             | email, but the purchase didn't end in a 404, but the
             | product still didn't show up in my purchase history. I have
             | no idea if I purchased anything, or not. I have never had
             | an issue purchasing. Normally get a confirmation email
             | within 2 or so minutes and the sale is immediately
             | reflected in purchase history. I was unaware of the greater
             | problem at that moment or I would have steered clear at the
             | first 404.
        
               | jjoonathan wrote:
               | Oh no... I think you may be in for a rough time, because
               | I purchased something this morning and it only popped up
               | in my orders list a few minutes ago.
        
             | vkgfx wrote:
             | >Facepalm in the form of $100Ms in service credits.
             | 
             | Part of me wonders how much they're actually going to pay
             | out, given that their own status page has only indicated
             | _five_ services with moderate ( "Increased API Error
             | Rates") disruptions in service.
        
           | ithkuil wrote:
           | One region? I forgot how to count that low
        
         | stephenr wrote:
         | When I brought up the status page (because we're seeing
         | failures trying to use Amazon Pay) it had EC2 and Mgmt Console
         | with issues.
         | 
         | I opened it again just now (maybe 10 minutes later) and it now
         | shows DynamoDB has issues.
         | 
         | If past incidents are anything to go by, it's going to get
         | worse before it gets better. Rube Goldberg machines aren't
         | known for their resilience to internal faults.
        
         | jeremyjh wrote:
         | They are still lying about it, the issues are not only
         | affecting the console but also AWS operations such as S3 puts.
         | S3 still shows green.
        
           | packetslave wrote:
           | IAM is a "global" service for AWS, where "global" means "it
           | lives in us-east-1".
           | 
           | STS at least has recently started supporting regional
           | endpoints, but most things involving users, groups, roles,
           | and authentication are completely dependent on us-east-1.
        
           | lsaferite wrote:
           | It's certainly affecting a wider range of stuff from what
           | I've seen. I'm personally having issues with API Gateway,
           | CloudFormation, S3, and SQS
        
             | pbalau wrote:
             | > We are experiencing _API_ and console issues in the US-
             | EAST-1 Region
        
               | jeremyjh wrote:
               | I read it as console APIs. Each service API has its own
               | indicator, and they are all green.
        
             | midasuni wrote:
             | Our corporate ForgeRock 2FA service is apparently broken.
             | My services are behind distributed x509 certs so no
             | problems there.
        
           | Rantenki wrote:
           | Yep, I am seeing failures on IAM as well:
           | aws iam list-policies              An error occurred (503)
           | when calling the ListPolicies operation (reached max retries:
           | 2): Service Unavailable
        
             | silverlyra wrote:
             | Same here. Kubernetes pods running in EKS are
             | (intermittently) failing to get IAM credentials via the
             | ServiceAccount integration.
        
       | alde wrote:
       | S3 bucket creation is failing across all regions for us, so this
       | isn't an us-east-1 only issue.
        
       | kp195_ wrote:
       | The AWS console seems kind of broken for us in us-west-1
       | (Northern California), but it seems like the actual services are
       | working
        
       | jacobkg wrote:
       | AWS Connect is down, so our customer support phone system is down
       | with it
        
         | nostrebored wrote:
         | Highly recommend talking to your account team to recommend
         | regional failovers and DR for Amazon Connect! With enough
         | feedback from customers, stuff like this can get prioritized.
        
           | jacobkg wrote:
           | Thanks, will definitely do that!
        
       | misthop wrote:
       | Getting 500 errors across the board on systems trying to hit s3
        
         | abaldwin99 wrote:
         | Ditto
        
       | booleanbetrayal wrote:
       | We're seeing load balancer failure in us-east-1 AZ's, so we are
       | not exactly sure why this is being characterized as Console
       | outage ...
        
       | [deleted]
        
       | whalesalad wrote:
       | I can still hit EC2 boxes and networking is okay. DynamoDB is
       | 100% down for the count, every request is an Internal Server
       | Error.
        
         | l72 wrote:
         | We are also seeing lots of failures with DynamoDB across all
         | our services in us-east-1.
        
         | snewman wrote:
         | DynamoDB is fine for us. Not contradicting your experience,
         | just adding another data point. There is definitely something
         | hit-or-miss about this incident.
        
       | jonnycomputer wrote:
       | Ah, this might explain why my AWS requests were so slow, or
       | timing out, this afternoon.
        
       | sayed2020 wrote:
       | Need google voice unlimited,
        
       | artembugara wrote:
       | I think there should be some third-party status checker alliance.
       | 
       | It's a joke. Each time AWS/Azure/GCP is down their status page
       | says all is fine.
        
         | cphoover wrote:
         | Want to build a startup?
        
           | artembugara wrote:
           | already running one.
        
       | _of wrote:
       | ...and imdb
        
         | soco wrote:
         | And Netflix.
        
           | MR4D wrote:
           | And Venmo, and McDonald's, and....
           | 
           | This one is pretty epic (pun intended). Bad enough that Down
           | Detector [0] shows " _Reports indicate there may be a
           | widespread outage at Amazon Web Services, which may be
           | impacting your service._ " in a red alert bar at the top.
           | 
           | [0] - https://downdetector.com/
        
           | mrguyorama wrote:
           | It occurs to me that it's very nice that Netflix, Youtube,
           | and other streaming services tend to be on separate
           | infrastructure so they don't all go down at once
        
           | ec109685 wrote:
           | I'm surprised Netflix is down. They are multi-region:
           | https://netflixtechblog.com/active-active-for-multi-
           | regional...
        
       | synergy20 wrote:
       | No wonder I could not read books from amazon all of a sudden,
       | what about their cloud-based redundancy design?
        
         | doesnotexist wrote:
         | Can also confirm that the kindle app is failing for me and has
         | been for the past few hours.
        
         | tgtweak wrote:
         | The book preview webservice or actual ebooks (kindle, etc)?
        
           | doesnotexist wrote:
           | For me, it's been that I am unable to download books to the
           | kindle app on my computer
        
       | hvgk wrote:
       | Well I got to bugger off home early so good job Amazon.
       | 
       | Edit: to be clear this is because I'm utterly helplessly unable
       | to do anything at the moment.
        
         | AnIdiotOnTheNet wrote:
         | Yep, that's a consideration of going with cloud tech: if
         | something goes wrong you're often powerless. At least with on-
         | prem you know who to wake up in the middle of the night and
         | you'll get straight-forward answers about what's going on.
        
           | hvgk wrote:
           | Depends which provider you host your crap with. I've had real
           | trouble trying to get a top tier incident even acknowledged
           | by one of the pre cloud providers.
           | 
           | To be fair when it's AWS when something goes snap it's not my
           | problem which I'm happy about (until some wise ass at AWS
           | hires me) :)
        
             | AnIdiotOnTheNet wrote:
             | > Depends which provider you host your crap with.
             | 
             | That's what I'm saying: you host it yourself in facilities
             | owned by your company if you're not willing to have
             | everyone twiddle their thumbs during this sort of event.
             | Your DR environment can be co-located or hosted elsewhere.
        
       | temuze wrote:
       | Friends tell friends to pick us-east-2.
       | 
       | Virginia is for lovers, Ohio is for availability.
        
         | blahyawnblah wrote:
         | Lots of services are only in us-east-1. The sso system isn't
         | working 100% right now so that's where I assume it's hosted.
        
           | skwirl wrote:
           | Yeah, there are "global" services which are actually secretly
           | us-east-1 services as that is the region they use for
           | internal data storage and orchestration. I can't launch
           | instances with OpsWorks (not a very widely used service, I'd
           | imagine) even if those instances are in stacks outside of us-
           | east-1. I suspect Route53 and CloudFront will also have
           | issues.
        
           | johnsimer wrote:
           | Yeah I can't log in with our external SAML SSO to our AWS
           | dashboard to manage our us-east-2 resources. . . . Because
           | our auth is apparently routed thru us-east-1 STS
        
           | jhugo wrote:
           | You can pick the region for SSO -- or even use multiple. Ours
           | is in ap-southeast-1 and working fine -- but then the console
           | that it signs us into is only partially working presumably
           | due to dependencies on us-east-1.
        
         | bithavoc wrote:
         | Sometimes you can't avoid us-east-1; an example is AWS ECR
         | Public. It's a shame. Meanwhile, DockerHub is up and running
         | even when it's in EC2 itself.
        
         | vrocmod wrote:
         | No one was in the room where it happened
        
         | mountainofdeath wrote:
         | us-east-1 is a cursed region. It's old, full of one-off patches
         | to work and tends to be the first big region released to.
        
         | more_corn wrote:
         | This is funny, but true. I've been avoiding us-east-1 simply
         | because thats where everyone else is. Spot instances are also
         | less likely to be expensive in less utilized regions.
        
         | politician wrote:
         | Can I get that on a license plate?
        
         | kavok wrote:
         | Didn't us-east-2 have an issue last week?
        
         | stephenr wrote:
         | Friends tell Friends not to use Rube Goldberg machines as their
         | infrastructure layer.
        
         | johnl1479 wrote:
         | This is also a clever play on the Hawthorne Heights song.
        
         | PopeUrbanX wrote:
         | I wonder why AWS has Ohio and Virginia but no region in the
         | northeast where a significant plurality of customers using east
         | regions probably live.
        
         | api wrote:
         | I live in Ohio and can confirm. If the Earth were destroyed by
         | an asteroid Ohio would be left floating out there somehow
         | holding onto an atmosphere for about ten years.
        
         | tgtweak wrote:
         | If you're not multi-cloud in 2021 and are expecting 5-9's, I
         | feel bad for you.
        
           | post-it wrote:
           | I imagine there are very few businesses where the extra cost
           | of going multi-cloud is smaller than the cost of being down
           | during AWS outages.
        
             | gtirloni wrote:
             | Also, going multi-cloud will introduce more complexity
             | which leads to more errors and more downtime. I'd rather
             | sit this outage out than deal with daily risk of downtime
             | because I'm infrastructure is too smart for its own good.
        
               | shampster wrote:
               | Depends on the criticality of the service. I mean you're
               | right about adding complexity. But sometimes you can just
               | take your really critical services and make sure it can
               | completely withstand any one cloud provider outage.
        
           | unethical_ban wrote:
           | If you're not multi-region, I feel bad for you.
           | 
           | If your company is shoehorning you into using multiple clouds
           | and learning a dozen products, IAM and CICD dialects
           | simultaneously because "being cloud dependent is bad", I feel
           | bad for you.
           | 
           | Doing _one_ cloud correctly from a current DevSecOps
           | perspective is a multi-year ask. I estimate it takes about 25
           | people working full time on managing and securing
           | infrastructure per cloud, minimum. This does not include
           | certain matrixed people from legacy network /IAM teams. If
           | you have the people, go for it.
        
             | tgtweak wrote:
             | There are so many things that can go wrong with a single
             | provider, regardless of how many availability zones you are
             | leveraging, that you cannot depend on 1 cloud provider for
             | your uptime if you require that level of up.
             | 
             | Example: Payment/Administrative issues, rogue employee with
             | access, deprecated service, inter-region routing issues,
             | root certificate compromises... the list goes on and it is
             | certainly not limited to single AZ.
             | 
             | A very good example, is that regardless of which of the 85
             | AZs you are in at aws, you are affected by this issue right
             | now.
             | 
             | Multi-cloud with the right tooling is trivial. Investing in
             | learning cloud-proprietary stacks is a waste of your
             | investment. You're a clown if you think you need 25 people
             | internally per cloud is required to "do it right".
        
               | unethical_ban wrote:
               | All cloud tech is proprietary.
               | 
               | There is no such thing as trivially setting up a secure,
               | fully automated cloud stack, much less anything like a
               | streamlined cloud agnostic toolset.
               | 
               | Deprecated services are not the discussion here. We're
               | talking tactical availability, not strategic tools etc.
               | 
               | Rogue employees with access? You mean at the cloud
               | provider or at your company? Still doesn't make sense.
               | Cloud IAM is very difficult in large organizations, and
               | each cloud does things differently.
               | 
               | I worked at fortune 100 finance on cloud security. Some
               | things were quite dysfunctional, but the struggles and
               | technical challenges are real and complex at a large
               | organization. Perhaps you're working on a 50 employee
               | greenfield startup. I'll hesitate to call you a clown as
               | you did me, because that would be rude and dismissive of
               | your experience (if any) in the field.
        
             | throwmefar32 wrote:
             | This.
        
             | ricardobayes wrote:
             | Someone start devops as a service please
        
           | ryuta wrote:
           | How do you become multi-cloud if your root domain is in
           | Route53? Have Backup domains on the client side?
        
             | tgtweak wrote:
             | dns records should be synced to secondary provider, and
             | that provider added to your secondary/tertiary domain dns.
             | 
             | Multi-provider dns is a solved problem.
        
           | temuze wrote:
           | If you're having SLA problems I feel bad for you son
           | 
           | I got two 9 problems cuz of us-east-1
        
             | ShroudedNight wrote:
             | > ~~I got two 9 problems cuz of us-east-1~~
             | 
             | I left my two nines problems in us-east-1
        
         | sbr464 wrote:
         | Ohio's actual motto funnily kind of fits here:
         | With God, all things are possible
        
           | shepherdjerred wrote:
           | Does this imply Virginia is Godless?
        
             | sbr464 wrote:
             | Maybe virginia is for lovers, ohio is for God/gods?
        
             | tssva wrote:
             | Virginia's actual motto is "Sic semper tyrannis". What's
             | more tyrannical than an omnipotent being that will condemn
             | you to eternal torment if you don't worship them and follow
             | their laws.
        
               | sneak wrote:
               | I mean, most people are okay with dogs and Seattle.
        
               | [deleted]
        
               | mey wrote:
               | I think I should add state motto to my data center
               | consideration matrix.
        
               | bee_rider wrote:
               | Virginia and Massachusetts have surprisingly aggressive
               | mottoes (MA is: "By the sword we seek peace, but peace
               | only under liberty", which is really just a fancy way of
               | saying "don't make me stab a tyrant," if you think about
               | it). It probably makes sense, though, given that they
               | came up with them during the revolutionary war.
        
       | anonu wrote:
       | I think its just the console - as my EC2 in us-east-1 are still
       | reachable.
        
         | jrs235 wrote:
         | I think it's affecting more than just AWS. Try searching
         | amazon.com. That's broken for me.
        
       | dylan604 wrote:
       | The fun thing about these types of outages are seeing all of the
       | people that depend upon these services with no graceful fallback.
       | My roomba app will not even launch because of the AWS outage. I
       | understand that the app gets "updates" from the cloud. In this
       | case "updates" is usually promotional crap, but whatevs. However,
       | for this to prevent the app launching in a manner that I can
       | control my local device is total BS. If you can't connect to the
       | cloud, fail, move on and load the app so that local things are
       | allowed to work.
       | 
       | I'm guessing other IoT things suffer from this same short
       | sitedness as well.
        
         | codegeek wrote:
         | "If you can't connect to the cloud, fail, move on and load the
         | app so that local things are allowed to work."
         | 
         | Building fallbacks require work. How much extra effort and
         | overhead is needed to build something like this ? Sometimes the
         | cost vs benefits says that it is ok not to do it. If AWS has an
         | outage like this once a year, maybe we can deal with it (unless
         | you are working with mission critical apps).
        
           | dylan604 wrote:
           | Yes, it is a lot of work to test if response code is OK or
           | not, or if a timeout limit has been reached. So much so, I
           | pretty much wrote the test in the first sentence. Phew. 10x
           | coder right here!
        
         | lordnacho wrote:
         | If you did that some clever person would set up their PiHole so
         | that their device just always worked, and then you couldn't
         | send them ads and surveil them. They'd tell their friends and
         | then everyone would just use their local devices locally.
         | Totally irresponsible what you're suggesting.
        
           | apexalpha wrote:
           | A little off-topic, but there are people working on it:
           | https://valetudo.cloud/
           | 
           | It's a little harder than blocking the DNS unfortunately. But
           | nonetheless it always brings a smile to my face to see that
           | there's a FOSS frontier for everything.
        
           | beamatronic wrote:
           | An even more clever person would package up this box, and
           | sell it, along with a companion subscription service, to help
           | busy folks like myself.
        
             | dylan604 wrote:
             | But this new little box would then be required to connect
             | to the home server to receive updates. Guess what? No
             | updates, no worky!! It's a vicious circle!!! Outages all
             | the way down
        
           | arksingrad wrote:
           | this is why everyone runs piholes and no one sees ads on the
           | internet anymore, which killed the internet ad industry
        
             | dylan604 wrote:
             | Dear person from the future, can you give me a hint on who
             | wins the upcoming sporting events? I'm asking for a friend
             | of course
        
               | 0des wrote:
               | Also, what's the verdict on John Titor?
        
         | sneak wrote:
         | Now think of how many assets of various governments' militaries
         | are discreetly employed as normal operational staff by FAAMG in
         | the USA and have access to _cause_ such events from scratch. I
         | would imagine that the US IC (CIA /NSA) already does some free
         | consulting for these giant companies to this end, because they
         | are invested in that Not Being Possible (indeed, it's their
         | job).
         | 
         | There is a societal resilience benefit to not having
         | unnecessary cloud dependencies beyond the privacy stuff. It
         | makes your society and economy more robust if you can continue
         | in the face of remote failures/errors.
         | 
         | It is December 7th, after all.
        
           | dylan604 wrote:
           | > I would imagine that the US IC (CIA/NSA) already does some
           | free consulting for these giant companies to this end,
           | 
           | Haha, it would be funny if the IC reaches out to BigTech when
           | failures occur to let them know they need not be worried
           | about data loses. They can just borrow a copy of the data IC
           | is siphoning off them. /s?
        
         | taf2 wrote:
         | I wouldn't jump to say it's short sitedness (it is shitty) but
         | it could be a matter of being pragmatic... It's easier to
         | maintain the code if it is loaded at run time (think thin
         | client browser style). This way your iot device can load the
         | lastest code and even settings from the cloud... (advantage
         | when the cloud is available)... I think of this less of short
         | sitedness and more a reasonable trade off (with shitty side
         | effects)
        
           | epistasis wrote:
           | I don't think that's ever a reasonable tradeoff! Network
           | access goes down all the time, and should be a fundamental
           | assumption of any software.
           | 
           | Maybe I'm too old, but I can't imagine a seasoned dev, much
           | less a tech lead, omitting planning for that failure mode
        
           | outime wrote:
           | Then you could just keep a local copy available as a fallback
           | in case the latest code cannot be fetched. Not doing the bare
           | minimum and screwing the end user isn't acceptable IMHO. But
           | I also understand that'd take some engineer hours and report
           | virtually no benefits as these outages are rare (not sure how
           | Roomba's reliability is in general on the other hand) so here
           | we are.
        
         | s_dev wrote:
         | >The fun thing about these types of outages are seeing all of
         | the people that depend upon these services with no graceful
         | fallback.
         | 
         | Whats a graceful fallback? Switching to another hosting service
         | when AWS goes down? Wouldn't that present another set of
         | complications for a very small edge case at huge cost?
        
           | rehevkor5 wrote:
           | Usually this refers to falling back to a different region in
           | AWS. It's typical for systems to be deployed in multiple
           | regions due to latency concerns, but it's also important for
           | resiliency. What you call "a very small edge case" is
           | occurring as we speak, and if you're vulnerable to it you
           | could be losing millions of dollars.
        
             | stevehawk wrote:
             | probably not possible for a lot mroe services than you'd
             | think because AWS Cognito has no decent failover method
        
             | simmanian wrote:
             | AWS itself has a huge single point of failure on us-east-1
             | region. Usually, if us-east-1 goes down, others soon
             | follow. At that point, it doesn't matter how many regions
             | you're deploying to.
        
               | scoopertrooper wrote:
               | My workloads on Sydney and London are unaffected. I can't
               | speak for anywhere else.
        
               | lowbloodsugar wrote:
               | "Usually"? When has that ever happened?
        
               | dylan604 wrote:
               | https://awsmaniac.com/aws-outages/
        
           | winrid wrote:
           | In this case, just connect over LAN.
        
             | politician wrote:
             | Or BlueTooth.
        
             | s_dev wrote:
             | Right -- I think I've misread OP as graceful fallback e.g.
             | working offline.
             | 
             | Rather than implement a dynamically switching backup in the
             | event of AWS going down which is not trivial.
        
           | [deleted]
        
           | itisit wrote:
           | > Wouldn't that present another set of complications for a
           | very small edge case at huge cost?
           | 
           | One has to crunch the numbers. What does a service outage
           | cost your business every minute/hour/day/etc in terms of lost
           | revenue, reputational damage, violated SLAs, and other
           | factors? For some enterprises, it's well worth the added
           | expense and trouble of having multi-site active-active setups
           | that span clouds and on-prem.
        
         | thayne wrote:
         | there's a reason it is called the _Internet_ of things, and not
         | the  "local network of things". Even if the latter is probably
         | what most customers would prefer.
        
           | 9wzYQbTYsAIc wrote:
           | There's also no reason for an internet connected app to crash
           | on load when there is no access to the internet services.
        
             | mro_name wrote:
             | indeed.
             | 
             | A constitutional property of a network is it's volatility.
             | Nodes may fail. Edges may. You may not. Or you may. But
             | then you're delivering no reliabilty but crap. Nice
             | sunshine crap, maybe.
        
       | birdyrooster wrote:
       | Gives me a flashback to the December 24th, 2012 outage. I guess
       | not much changes in 9 years time.
        
       | stephenr wrote:
       | So, we're getting failures (for customers) trying to use amazon
       | pay from our site. AFAIK there is no "status page" for Amazon
       | Pay, but the _rest_ of Amazon 's services seem to be a giant Rube
       | Goldberg machine so it's hard to imagine this isn't too.
        
         | itisit wrote:
         | http://status.mws.amazon.com/
        
           | stephenr wrote:
           | Thanks.. seems to be about as accurate as the regular AWS
           | status board is..
        
             | itisit wrote:
             | They _just_ added a banner. My guess is they don 't know
             | enough yet to update the respective service statuses.
        
               | stephenr wrote:
               | I have basically zero faith in Amazon at this point.
               | 
               | We first noticed failures because a tester happened to be
               | testing in an env that uses the Amazon Pay sandbox.
               | 
               | I checked the prod site, and it wouldn't even ask me to
               | login.
               | 
               | When I tried to login to SellerCentral to file a ticket -
               | it told me my password (from a pw manager) was wrong.
               | When I tried to reset, the OTP was ridiculously slow.
               | Clicking "resend OTP" gives a "the OTP is incorrect"
               | error message. When I finally got an OTP and put it in,
               | the resulting page was a generic Amazon "404 page not
               | found".
               | 
               | A while later, my original SellerCentral password, still
               | un-changed because I never got another OTP to reset it,
               | worked.
               | 
               | What the fuck kind of failure mode is that "services are
               | down, so password must be wrong".
        
               | itisit wrote:
               | Sorry to hear. If multi-cloud is the answer, I wouldn't
               | be surprised to see folks go back to owning and operating
               | their own gear.
        
       | Tea418 wrote:
       | If you have trouble logging in to AWS Console, you can use a
       | regional console endpoint such as https://eu-
       | central-1.console.aws.amazon.com/
        
         | jedilance wrote:
         | I also see same error here: Internal Error, Please try again
         | later.
        
         | throwanem wrote:
         | Didn't work for me just now. Same error as the regular
         | endpoint.
        
       | judge2020 wrote:
       | Got alerted to 503 errors for SES, so it's not just the
       | management console.
        
       | lowwave wrote:
       | this is exact kinda over centralisation issues I was talking
       | about. I'm one of first developers using AWS EC2, sure back when
       | scaling is hard for small dev shops. In now day and age any one
       | who is technically inclined, can figure out using the new
       | technologies. Why even use AWS. Get something like Hetzner,
       | Linodes please!
        
         | binaryblitz wrote:
         | How are those better than using AWS/Azure/GCP/etc? I'd say the
         | correct way to handle situations is to have things in multiple
         | regions, and potentially multiple clouds if possible.
         | Obviously, things like databases would be harder to keep in
         | sync on multi cloud, but not impossible.
        
         | pojzon wrote:
         | Some manager: But it does not web scale!
        
           | dr-detroit wrote:
           | the average HN poster: I run things on a box under my desk
           | and you cannot teach me why thats bad!!!
        
       | saisundar wrote:
       | Yikes, ring, the security system is also down.
       | 
       | Wonder if crime rates might eventually spike up if aws goes down,
       | in an utopian world where Amazon gets everyone to use ring.
        
         | kingcharles wrote:
         | I'm now imagining a team of criminals sitting around in face
         | masks and hoodies refreshing AWS status page all day...
         | 
         | "AWS is down! Christmas came early boys! Roll out..."
        
       | john37386 wrote:
       | imdb seems down too and returning 503. Is it related? Here is the
       | output. Kind of funny.
       | 
       | D'oh!
       | 
       | Error 503
       | 
       | We're sorry, something went wrong.
       | 
       | Please try again...wait...wait...yep, try reload/refresh now.
       | 
       | But if you are seeing this again, please report it here.
       | 
       | Please explain which page you were at and where on it that you
       | clicked
       | 
       | Thank you!
        
         | takeda wrote:
         | IMDB belongs to Amazon, so likely on AWS too.
         | 
         | This also confirms it: https://downdetector.com/status/imdb/
        
           | binaryblitz wrote:
           | TIL Amazon owns IMDB
        
             | takeda wrote:
             | Yeah, I was also surprised when I learned this. Another
             | surprising thing is that they own it since 1998.
        
       | whoisjuan wrote:
       | us-east-1 is so unreliable that it probably should be nuked. It's
       | the region with the worst track record. I guess it doesn't help
       | that is one of the oldest.
        
         | zrail wrote:
         | US-EAST-1 is the oldest, largest, and most heterogeneous
         | region. There are many data centers and availability zones
         | within the region. I believe it's where AWS rolls out changes
         | first, but I'm not confident on that.
        
       | duckworth wrote:
       | After over 45 minutes https://status.aws.amazon.com/ now shows
       | "AWS Management Console - Increased Error Rates"
       | 
       | I guess 100% is technically an increase.
        
         | Slartie wrote:
         | "Fixed a bug that could cause [adverse behavior affecting 100%
         | of the user base] for some users"
        
           | sophacles wrote:
           | "some" as in "not all". I'm sure there are some tiny tiny
           | sites that were unaffected because no one went to them during
           | the outage.
        
             | Slartie wrote:
             | "some" technically includes "all", doesn't it? It excludes
             | "none", I suppose, but why should it exclude "all" (except
             | if "all" equals "none")?
        
         | brasetvik wrote:
         | I can't remember seeing problems be more strongly worded than
         | "Increased Error Rates" or "high error rates with S3 in us-
         | east-1" during the infamous S3 outage of 2017 - and that was
         | after they struggled to even update their own status page
         | because of S3 being down. :)
        
           | schleck8 wrote:
           | During the Facebook outage FB wrote something along the lines
           | of "We noticed that some users are experiencing issues with
           | our apps" eventhough nothing worked anymore
        
       | boldman wrote:
       | I think now is a good time to reiterate the danger of companies
       | just throwing all of their operational resilience and
       | sustainability over the wall and trusting someone else with their
       | entire existence. It's wild to me that so many high performing
       | businesses simply don't have a plan for when the cloud goes down.
       | Some of my contacts are telling me that these outages have teams
       | of thousands of people completely prevented from working and tens
       | of million dollars of profit are simply vanishing since the start
       | of the outage this morning. And now institutions like government
       | and banks are throwing their entire capability into the cloud
       | with no recourse or recovery plan. It seems bad now but I wonder
       | how much worse it might be when no one actually has access to
       | money because all financial traffic is going through AWS and it
       | goes down.
       | 
       | We are incredibly blind to just trust just 3 cloud providers with
       | the operational success of basically everything we do.
       | 
       | Why hasn't the industry come up with an alternative?
        
         | xwdv wrote:
         | There is an alternative: A _true_ network cloud. This is what
         | Cloudflare will eventually become.
        
         | jacobsenscott wrote:
         | We have or had alternatives - rackspace, linode, digital ocean,
         | in the past there were many others, self hosting is still an
         | option. But the big three just do it better. The alternatives
         | are doomed to fail. If you use anything other than the big
         | three you risk not just more outages, but your whole provider
         | going out of business overnight.
         | 
         | If the companies at the scale you are talking about do not have
         | multi-region and multi service (aws to azure for example)
         | failover that's their fault, and nobody else's.
        
         | [deleted]
        
         | p1necone wrote:
         | Do you think they'd manage their own infra better? Are you
         | suggesting they pay for a fully redundant second implementation
         | on another provider? How much extra cost would that be vs
         | eating an outage very infrequently?
        
         | jessebarton wrote:
         | In my opinion there is a lack of talent in these industries for
         | building out there own resilient systems. IT people and
         | engineers get lazy.
        
           | grumple wrote:
           | We're too busy in endless sprints to focus on things outside
           | of our core business that don't make salespeople and
           | executives excited.
        
           | BarryMilo wrote:
           | No lazier than anyone else, there's just not enough of us, in
           | general and per company.
        
           | rodgerd wrote:
           | > IT people and engineers get lazy.
           | 
           | Companies do not change their whole strategy from a capex-
           | driven traditional self-hosting environment to opex-driven
           | cloud hosting because their IT people are lazy; it is
           | typically an exec-level decision.
        
         | tuldia wrote:
         | > Why hasn't the industry come up with an alternative?
         | 
         | We used to have that, some companies still have the capability
         | and know-how to build and run infrastructure that is reliable,
         | distributed across many hosting providers before "cloud" became
         | the "norm", but it goes along with "use or lose it".
        
         | adflux wrote:
         | Because your own datacenters cant go down?
        
         | uranium wrote:
         | Because the expected value of using AWS is greater than the
         | expected value of self-hosting. It's not that nobody's ever
         | heard of running on their own metal. Look back at what everyone
         | did before AWS, and how fast they ran screaming away from it as
         | soon as they could. Once you didn't have to do that any more,
         | it's just so much better that the rare outages are worth it for
         | the vast majority of startups.
         | 
         | Medical devices, banks, the military, etc. should generally run
         | on their own hardware. The next photo-sharing app? It's just
         | not worth it until they hit tremendous scale.
        
           | chucknthem wrote:
           | Agree with your first point.
           | 
           | On the second though, at some point, infrastructure like AWS
           | are going to be more reliable than what many banks, medical
           | device operators etc can provide themselves. asking them to
           | stay on their own hardware is asking for that industry to
           | remain slow, bespoke and expensive.
        
             | uranium wrote:
             | Agreed, and it'll be a gradual switch rather than a single
             | point, smeared across industries. Likely some operations
             | won't ever go over, but it'll be a while before we know.
        
             | hn_throwaway_99 wrote:
             | Hard agree with the second paragraph.
             | 
             | It is _incredibly_ difficult for non-tech companies to hire
             | quality software and infrastructure engineers - they
             | usually pay less and the problems aren 't as interesting.
        
         | seeEllArr wrote:
         | They have, it's called hosting on-premesis, and it's even less
         | reliable than cloud providers.
        
         | smugglerFlynn wrote:
         | Many of those businesses wouldn't have existed in the first
         | place without simplicity offered by cloud.
        
         | stfp wrote:
         | > tens of million dollars of profit are simply vanishing
         | 
         | vanishing or delayed six hours? I mean
        
           | naikrovek wrote:
           | money people think of money very weirdly. when they predict
           | they will get more than they actually get, they call it
           | "loss" for some reason, and when they predict they will get
           | less than they actually get, it's called ... well I don't
           | know what that's called but everyone gets bonuses.
        
           | lp0_on_fire wrote:
           | 6 hours of downtime often means 6 hours of paying employees
           | to stand around which adds up rather quickly.
        
         | nprz wrote:
         | So you're saying companies should start moving their
         | infrastructure to the blockchain?
        
           | tornato7 wrote:
           | Ethereum has gone 5 years without a single minute of
           | downtime, so if it's extreme reliability you're going for I
           | don't think it can be beaten.
        
         | rbetts wrote:
         | We're too busy working generating our own electricity and
         | designing our own CPUs.
        
         | commandlinefan wrote:
         | Well, if you're web-based, there's never really been any better
         | alternative. Even before "the cloud", you had to be hosted in a
         | datacenter somewhere if you wanted enough bandwidth to service
         | customers, as well as have somebody who would make sure the
         | power stayed on 24/7. The difference now is that there used to
         | be thousands of ISP's so one outage wouldn't get as much news
         | coverage, but it would also probably last longer because you
         | wouldn't have a team of people who know what to look for like
         | Amazon (probably?) does.
        
         | olingern wrote:
         | People are so quick to forget how things were before behemoths
         | like AWS, Google Cloud, and Azure. Not all things are free and
         | the outage the internet is experiencing is the risk users
         | signed up for.
         | 
         | If you would like to go back to the days of managing your own
         | machines, be my guest. Remember those machines also live
         | somewhere and were/are subject to the same BGP and routing
         | issues we've seen over the past couple of years.
         | 
         | Personally, I'll deal with outages a few times a year for the
         | peace of mind that there's a group of really talented people
         | looking into for me.
        
         | bowmessage wrote:
         | Because the majority of consumers don't know better / don't
         | care and still buy products from companies with no backup plan.
         | Because, really, how can any of us know better until we're
         | burned many times over?
        
         | 300bps wrote:
         | This appears to be a single region outage - us-east-1. AWS
         | supports as much redundancy as you want. You can be redundant
         | between multiple Availability Zones in a single Region or you
         | can be redundant among 1, 2 or even 25 regions throughout the
         | world.
         | 
         | Multiple-region redundancy costs more both in initial
         | planning/setup as well as monthly fees so a lot of AWS
         | customers choose to just not do it.
        
         | george3d6 wrote:
         | This seems like an insane stance to have, it's like saying
         | businesses should ship their own stock, using their own
         | drivers, and their in-house made cars and planes and in-house
         | trained pilots.
         | 
         | Heck, why stop at having servers on-site? Cast your own silicon
         | waffers, after all you don't want spectrum exploits.
         | 
         | Because you are worst at it. If a specialist is this bad, and
         | the market is fully open, then it's because the problem is
         | hard.
         | 
         | AWS has fewer outages in one zone alone than the best self-
         | hosted institutions, your facebooks and petagons. In-house
         | servers would lead to an insane amount of outage.
         | 
         | And guess what? AWS (and all other IAAS providers) will beg you
         | to use multiple region because of this. The team/person that
         | has millions of dollars a day staked on a single AWS region is
         | an idiot and could not be entrusted to order a gaming PC from
         | newegg, let alone run an in-house datacenter.
         | 
         | edit: I will add that AWS specifically is meh and I wouldn't
         | use it myself, there's better IASS. But it's insanity to even
         | imagine self-hosted is more reliable than using even the
         | shittiest of IASS providers.
        
           | jgwil2 wrote:
           | > In-house servers would lead to an insane amount of outage.
           | 
           | That might be true, but the effects of any given outage would
           | be felt much less widely. If Disney has an outage, I can just
           | find a movie on Netflix to watch instead. But now if one
           | provider goes down, it can take down everything. To me, the
           | problem isn't the cloud per se, it's one player's dominance
           | in the space. We've taken the inherently distributed
           | structure of the internet and re-centralized it, losing some
           | robustness along the way.
        
             | dragonwriter wrote:
             | > That might be true, but the effects of any given outage
             | would be felt much less widely.
             | 
             | If my system has an hour of downtime every year and the
             | dozen other systems it interacts with and depends on each
             | have an hour of downtime every year, it can be _better_
             | that those tend to be correlated rather than independent.
        
           | bb88 wrote:
           | Apple created their own silicon. Fedex uses its own pilots.
           | The USPS uses it's own cars.
           | 
           | If you're a company relying upon AWS for your business, is it
           | okay if you're down for a day, or two while you wait for AWS
           | to resolve it's issue?
        
             | jasode wrote:
             | > _Apple created their own silicon._
             | 
             | Apple _designs_ the M1. But TSMC (and possibly Samsung)
             | actually manufacture the chips.
        
             | jcranberry wrote:
             | Most companies using AWS are tiny compared to the companies
             | you mentioned.
        
             | lostlogin wrote:
             | It's bloody annoying when all I want to do is vacuum the
             | floor and Roomba says nope, "active AWS incident".
        
               | Grazester wrote:
               | If all you wanted to do was vacuum the floor you would
               | not have gotten that particular vacuum cleaner. Clearly
               | you wanted to do more than just vacuum the floor and
               | something like this happening should be weighed with the
               | purchase of the vacuum.
        
               | lostlogin wrote:
               | I'll rephrase. I wanted the floor vacuumed and I didn't
               | want to do it.
        
             | teh_klev wrote:
             | > Apple created their own silicon
             | 
             | Apple _designed_ their own silicon, a third party
             | manufactures and packages it for them.
        
           | roody15 wrote:
           | Quick follow up. I once used a IASS provider (hyperstreet)
           | that was terrible. Long story short provider ended closing
           | shop and the owner of the company now sells real estate in
           | California.
           | 
           | Was a nightmare recovering data. Even when the service was
           | operational was sub par.
           | 
           | Just saying perhaps the "shittiest" providers may not be more
           | reliable.
        
           | kayson wrote:
           | I think you're missing the point of the comment. It's not
           | "don't use cloud". It's "be prepared for when cloud goes
           | down". Because it will, despite many companies either
           | thinking it won't, or not planning for it.
        
           | midasuni wrote:
           | > AWS has fewer outages in one zone alone than the best self-
           | hosted institutions, your facebooks and petagons. In-house
           | servers would lead to an insane amount of outage.
           | 
           | It's had two in 13 months
        
           | imiric wrote:
           | > This seems like an insane stance to have, it's like saying
           | businesses should ship their own stock, using their own
           | drivers, and their in-house made cars and planes and in-house
           | trained pilots.
           | 
           | > Heck, why stop at having servers on-site? Cast your own
           | silicon waffers, after all you don't want spectrum exploits.
           | 
           | That's an overblown argument. Nobody is saying that, but it's
           | clear that businesses that maintain their own infrastructure
           | would've avoided today's AWS' outage. So just avoiding a
           | single level of abstraction would've kept your company
           | running today.
           | 
           | > Because you are worst at it. If a specialist is this bad,
           | and the market is fully open, then it's because the problem
           | is hard.
           | 
           | The problem is hard mostly because of scale. If you're a
           | small business running a few websites with a few million hits
           | per month, it might be cheaper and easier to colocate a few
           | servers and hire a few DevOps or old-school sysadmins to
           | administer the infrastructure. The tooling is there, and is
           | not much more difficult to manage than a hundred different
           | AWS products. I'm actually more worried about the DevOps
           | trend where engineers are trained purely on cloud
           | infrastructure and don't understand low-level tooling these
           | systems are built on.
           | 
           | > AWS has fewer outages in one zone alone than the best self-
           | hosted institutions, your facebooks and petagons. In-house
           | servers would lead to an insane amount of outage.
           | 
           | That's anecdotal and would depend on the capability of your
           | DevOps team and your in-house / colocation facility.
           | 
           | > And guess what? AWS (and all other IAAS providers) will beg
           | you to use multiple region because of this. The team/person
           | that has millions of dollars a day staked on a single AWS
           | region is an idiot and could not be entrusted to order a
           | gaming PC from newegg, let alone run an in-house datacenter.
           | 
           | Oh great, so the solution is to put even more of our eggs in
           | a single provider's basket? The real solution would be having
           | failover to a different cloud provider, and the
           | infrastructure changes needed for that are _far_ from
           | trivial. Even with that, there's only 3 major cloud providers
           | you can pick from. Again, colocation in a trusted datacenter
           | would've avoided all of this.
        
             | p1necone wrote:
             | > it's clear that businesses that maintain their own
             | infrastructure would've avoided today's AWS' outage.
             | 
             | Sure, that's trivially obvious. But how many other outages
             | would they have had instead because they aren't as
             | experienced at running this sort of infrastructure as AWS
             | is?
             | 
             | You seem to be arguing from the a priori assumption that
             | rolling your own is inherently more stable than renting
             | infra from AWS, without actually providing any
             | justification for that assumption.
             | 
             | You also seem to be under the assumption that any amount of
             | downtime is _always_ unnacceptable, and worth spending
             | large amounts of time and effort to avoid. For a _lot_ of
             | businesses systems going down for a few hours every once in
             | a while just isn 't a big deal, and is much more preferable
             | than spending thousands more on cloud bills, or hiring more
             | full time staff to ensure X 9s of uptime.
        
               | imiric wrote:
               | You and GP are making the same assumption that my DevOps
               | engineers _aren't_ as experienced as AWS' are. There are
               | plenty of engineers capable of maintaining an in-house
               | infrastructure running X 9s because, again, the
               | complexity comes from the scale AWS operates at. So we're
               | both arguing with an a priori assumption that the grass
               | is greener on our side.
               | 
               | To be fair, I'm not saying never use cloud providers. If
               | your systems require the complexity cloud providers
               | simplify, and you operate at a scale where it would be
               | prohibitively expensive to maintain yourself, by all
               | means go with a cloud provider. But it's clear that not
               | many companies are prepared for this type of failure, and
               | protecting against it is not trivial to accomplish. Not
               | to mention the conceptual overhead and knowledge required
               | with dealing with the provider's specific products, APIs,
               | etc. Whereas maintaining these systems yourself is
               | transferrable across any datacenter.
        
             | solveit wrote:
             | This feels like a discussion that could sorely use some
             | numbers.
             | 
             | What are good examples of
             | 
             | >a small business running a few websites with a few million
             | hits per month, it might be cheaper and easier to colocate
             | a few servers and hire a few DevOps or old-school sysadmins
             | to administer the infrastructure.
             | 
             | and how often do they go down?
        
               | i_like_waiting wrote:
               | depends I guess, I am running on-prem workstation for our
               | DWH. So far in 2 years it went down minutes at the time,
               | when I decided to do so, because of hardware updates. I
               | have no idea where this narrative came from, but usually
               | hardware you have is very reliable and doesn't turn off
               | every 15 minutes.
               | 
               | Heck, I use old T430 for my home server and still it
               | doesn't go down on completely random occasions (but thats
               | very simplified example, I know)
        
             | jasode wrote:
             | _> , but it's clear that businesses that maintain their own
             | infrastructure would've avoided today's AWS' outage._
             | 
             | When Netflix was running its own datacenters in 2008, they
             | had a _3 day outage_ from a database corruption and couldn
             | 't ship DVDs to customers. That was the disaster that
             | pushed CEO Reed Hastings to get out of managing his own
             | datacenters and migrate to AWS.
             | 
             | The flaw in the reasoning that running your own hardware
             | would _avoid today 's outage_ is that it doesn't also
             | consider the _extra unplanned outages on other days_
             | because your homegrown IT team (especially at non-tech
             | companies) isn 't as skilled as the engineers working at
             | AWS/GCP/Azure.
        
               | qaq wrote:
               | The flaw in your reasoning is that the complexity of the
               | problem is even remotely the same. Most AWS outages are
               | control plane related.
        
           | naikrovek wrote:
           | > AWS (and all other IAAS providers) will beg you to use
           | multiple region
           | 
           | will they? because AWS still puts new stuff in us-east-1
           | before anywhere else, and there is often a LONG delay before
           | those things go to other regions. there are many other
           | examples of why people use us-east-1 so often, but it all
           | boils down to this: AWS encourage everyone to use us-east-1
           | and discourage the use of other regions for the same reasons.
           | 
           | if they want to change how and where people deploy, they
           | should change how they encourage it's customers to deploy.
           | 
           | my employer uses multi-region deployments where possible, and
           | we can't do that anywhere nearly as much as we'd like because
           | of limitations that AWS has chosen to have.
           | 
           | so if cloud providers want to encourage multi-region
           | adoption, they need to stop discouraging and outright
           | preventing it, first.
        
             | danielheath wrote:
             | It works really well imo. All the people who want to use
             | new stuff at the expense of stability choose us-east-1;
             | those who want stability at the expense of new stuff run
             | multi-region (usually not in us-east-1 )
        
             | mypalmike wrote:
             | This argument seems rather contrived. Which feature
             | available in only one region for a very long time has
             | specifically impacted you? And what was the solution?
        
             | WaxProlix wrote:
             | Most features roll out to IAD second, third, or fourth. PDX
             | and CMH are good candidates for earlier feature rollout,
             | and usually it's tested in a small region first. I use PDX
             | (us-west-2) for almost everything these days.
             | 
             | I also think that they've been making a lot of the default
             | region dropdowns and such point to CMH (us-east-2) to get
             | folks to migrate away from IAD. Your contention that
             | they're encouraging people to use that region just don't
             | ring true to me.
        
           | savant_penguin wrote:
           | they usually beg you to use multiple availability zones
           | though
           | 
           | I'm not sure how many aws services are easy to spawn at
           | multiple regions
        
             | mypalmike wrote:
             | Which ones are difficult to deploy in multiple regions?
        
             | dragonwriter wrote:
             | > they usually beg you to use multiple availability zones
             | though
             | 
             | Doesn't help you if it what goes down is AWS global
             | services on which you directly, or other AWS services,
             | depend (which tend to be tied to US-east-1).
        
           | optiomal_isgood wrote:
           | This is the right answer, I recall studying for the solutions
           | architect professional certification and reading this
           | countless times: outages will happen and you should plan for
           | them by using multi-region if you care about downtime.
           | 
           | It's not AWS fault here, it's the companies', which assume
           | that it will never be down. In-house servers also have
           | outages, it's a very naive assumption to think that it'd be
           | all better if all of those services were using their own
           | servers.
           | 
           | Facebook doesn't use AWS and they were down for several hours
           | a couple weeks ago, and that's because they have way better
           | engineers than the average company, working on their
           | infrastructure, exclusively.
        
           | qaq wrote:
           | "AWS has fewer outages in one zone alone than the best self-
           | hosted institutions" sure you just call an outage "increased
           | error rate"
        
           | SkyPuncher wrote:
           | In addition to fewer outages, _many_ products get a free pass
           | on incidents because basically everyone is being impacted by
           | outage.
        
           | johannes1234321 wrote:
           | the benefit of self hosting is, that you are up, while a your
           | competitors are down.
           | 
           | However if youbare on AWS many of your competitors are down
           | while you are down, so they can't takeover your business.
        
           | [deleted]
        
         | tw04 wrote:
         | >It seems bad now but I wonder how much worse it might be when
         | no one actually has access to money because all financial
         | traffic is going through AWS and it goes down.
         | 
         | Most financial institutions are implementing their own clouds,
         | I can't think of any major one that is reliant on public cloud
         | to the extent transactions would stop.
         | 
         | >Why hasn't the industry come up with an alternative?
         | 
         | You mean like building datacenters and hosting your own gear?
        
           | baoyu wrote:
           | > Most financial institutions are implementing their own
           | clouds
           | 
           | https://www.nasdaq.com/Nasdaq-AWS-cloud-announcement
        
             | filmgirlcw wrote:
             | That doesn't mean what you think it means.
             | 
             | The agreement is more of a hybrid cloud arrangement with
             | AWS Outposts.
             | 
             | FTA:
             | 
             | >Core to Nasdaq's move to AWS will be AWS Outposts, which
             | extend AWS infrastructure, services, APIs, and tools to
             | virtually any datacenter, co-location space, or on-premises
             | facility. Nasdaq plans to incorporate AWS Outposts directly
             | into its core network to deliver ultra-low-latency edge
             | compute capabilities from its primary data center in
             | Carteret, NJ.
             | 
             | They are also starting small, with Nasdaq MRX
             | 
             | This is much less about moving NASDAQ (or other exchanges)
             | to be fully owned/maintained by Amazon, and more about
             | wanting to take advantage of development tooling and
             | resources and services AWS provides, but within the
             | confines of an owned/maintained data center. I'm sure as
             | this partnership grows, racks and racks will be in Amazon's
             | data centers too, but this is a hybrid approach.
             | 
             | I would also bet a significant amount of money that when
             | NASDAQ does go full "cloud" (or hybrid, as it were), it
             | won't be in the same US-east region co-mingling with the
             | rest of the consumer web, but with its own redundant
             | services and connections and networking stack.
             | 
             | NASDAQ wants to modernize its infrastructure but it
             | absolutely doesn't want to offload it to a cloud provider.
             | That's why it's a hybrid partnership.
        
           | jen20 wrote:
           | Indeed I can think of several outages in the past decade in
           | the UK of banks' own infrastructure which have led to
           | transactions stopping for days at a time, with the
           | predictable outcomes.
        
         | tcgv wrote:
         | > Why hasn't the industry come up with an alternative?
         | 
         | The cloud is the solution to self managed data centers. Their
         | value proposition is appealing: Focus on your core business and
         | let us handle infrastructure for you.
         | 
         | This fits the needs of most small and medium sized businesses,
         | there's no reason not to use the cloud and spend time and money
         | on building and operating private data centers when the
         | (perceived) chances of outages are so small.
         | 
         | Then, companies grow to a certain size where the benefits of
         | having a self managed data center begins to outweight not
         | having one. But at this point this becomes more of a
         | strategic/political decision than merely a technical one, so
         | it's not an easy shift.
        
       | JamesAdir wrote:
       | noob question: Aren't companies using several regions for
       | availability and redundancy?
        
         | yunwal wrote:
         | I'm seeing outages across several regions for certain services
         | (SNS), so cross-region failover doesn't necessarily help here.
         | 
         | Additionally, for complex apps, automatic cross-region disaster
         | recovery can take tens or even hundreds of dev years, something
         | most small to midsized companies can't afford.
        
         | blahyawnblah wrote:
         | Ideally, yes. In practice, most are hosted in a single region
         | but with multiple availability zones (this is called high
         | availability). What you're talking about is fault tolerance
         | (across multiple regions). That's harder to implement and costs
         | more.
        
       | PragmaticPulp wrote:
       | I worked at a company that hired an ex-Amazon engineer to work on
       | some cloud projects.
       | 
       | Whenever his projects went down, he fought tooth and nail against
       | any suggestion to update the status page. When forced to update
       | the status page, he'd follow up with an extremely long "post-
       | mortem" document that was really just a long winded explanation
       | about why the outage was someone else's fault.
       | 
       | He later explained that in his department at Amazon, being at
       | fault for an outage was one of the worst things that could happen
       | to you. He wanted to avoid that mark any way possible.
       | 
       | YMMV, of course. Amazon is a big company and I've had other
       | friends work there in different departments who said this wasn't
       | common at all. I will always remember the look of sheer panic he
       | had when we insisted that he update the status page to accurately
       | reflect an outage, though.
        
         | broknbottle wrote:
         | This gets posted every time there's an AWS outage. It mind as
         | well be a copy pasta at this point.
        
           | Rapzid wrote:
           | It's the "grandma got run over by a reindeer" of AWS outages.
           | Really no outage thread would be complete without this
           | anecdote.
        
           | JoelMcCracken wrote:
           | well, this is the first time I've seen it, so I am glad it
           | was posted this time.
        
             | rconti wrote:
             | Ditto, it's always annoyed me that their status page is
             | useless, but glad someone else mentioned it.
        
             | jjoonathan wrote:
             | First time I've seen it too. Definitely not my first "AWS
             | us-east-1 is down but the status board is green" thread,
             | either.
        
           | [deleted]
        
           | Spivak wrote:
           | I mean it's true at every company I've ever worked at too. If
           | you can lawyer incidents into not being an outage you avoid
           | like 15 meetings with the business stakeholders about all the
           | things we "have to do" to prevent things like this in the
           | future that get canceled the moment they realize that how
           | much dev/infra time it will take to implement.
        
           | ignoramous wrote:
           | I had that deja vu feeling reading PragmaticPulp's comment,
           | too.
           | 
           | And sure enough, PragmaticPulp did post a similar comment on
           | a thread about Amazon India's alleged hire-to-fire policy 6
           | months back: https://news.ycombinator.com/item?id=27570411
           | 
           | You and I, we aren't among the 10000, but there are
           | potentially 10000 others who might be: https://xkcd.com/1053/
        
           | PragmaticPulp wrote:
           | Sorry. I'm probably to blame because I've posted this a
           | couple times on HN before.
           | 
           | It strikes a nerve with me because it caused so much trouble
           | for everyone around him. He had other personal issues,
           | though, so I should probably clarify that I'm not entirely
           | blaming Amazon for his habits. Though his time at Amazon
           | clearly did exacerbate his personal issues.
        
         | dang wrote:
         | (This was originally a reply to
         | https://news.ycombinator.com/item?id=29473759 but I've pruned
         | it to make the thread less top-heavy.)
        
         | avalys wrote:
         | I can totally picture this. Poor guy.
        
         | kortex wrote:
         | That sounds like the exact opposite of human-factors
         | engineering. No one _likes_ taking blame. But when things go
         | sideways, people are extra spicy and defensive, which makes
         | them clam up and often withhold useful information, which can
         | extend the outage.
         | 
         | No-blame analysis is a much better pattern. Everyone wins. It's
         | about building the system that builds the system. Stuff broke;
         | fix the stuff that broke, then fix the things that _let stuff
         | break_.
        
           | 88913527 wrote:
           | I don't think engineers can believe in no-blame analysis if
           | they know it'll harm career growth. I can't unilaterally
           | promote John Doe, I have to convince other leaders that John
           | would do well the next level up. And in those discussions,
           | they could bring up "but John has caused 3 incidents this
           | year", and honestly, maybe they'd be right.
        
             | SQueeeeeL wrote:
             | Would they? Having 3 outages in a year sounds like an
             | organization problem. Not enough safeguards to prevent very
             | routine human errors. But instead of worrying about that we
             | just assign a guy to take the fall
        
               | JackFr wrote:
               | Well if John caused 3 outages and and his peers Sally and
               | Mike each caused 0, it's worth taking a deeper look.
               | There's a real possibility he's getting screwed by a
               | messed up org, also he could be doing slapdash work or he
               | seriously might not undertsand the seriousness of an
               | outage.
        
               | jjav wrote:
               | Worth a look, certainly. Also very possible that this
               | John is upfront about honest postmortems and like a good
               | leader takes the blame, whereas Sally and Mike are out
               | all day playing politics looking for how to shift blame
               | so nothing has their name attached. Most larger companies
               | that's how it goes.
        
               | Kliment wrote:
               | Or John's work is in frontline production use and Sally's
               | and Mike's is not, so there's different exposure.
        
               | crmd wrote:
               | John's team might also be taking more calculated risks
               | and running circles around Sally and Mike's teams with
               | respect to innovation and execution. If your organization
               | categorically punishes failures/outages, you end up with
               | timid managers that are only playing defense, probably
               | the opposite of what the leadership team wants.
        
               | dolni wrote:
               | If you work in a technical role and you _don't_ have the
               | ability to break something, you're unlikely to be
               | contributing in a significant way. Likely that would make
               | you a junior developer whose every line of code is
               | heavily scrutinized.
               | 
               | Engineers should be experts and you should be able to
               | trust them to make reasonable choices about the
               | management of their projects.
               | 
               | That doesn't mean there can't be some checks in place,
               | and it doesn't mean that all engineers should be perfect.
               | 
               | But you also have to acknowledge that adding all of those
               | safeties has a cost. You can be a competent person who
               | requires fewer safeties or less competent with more
               | safeties.
               | 
               | Which one provides more value to an organization?
        
               | pm90 wrote:
               | > Which one provides more value to an organization?
               | 
               | Neither, they both provide the same value in the long
               | term.
               | 
               | Senior engineers cannot execute on everything they commit
               | to without having a team of engineers they work with. If
               | nobody trains junior engineers, the discipline would go
               | extinct.
               | 
               | Senior engineers provide value by building guardrails to
               | enable junior engineers to provide value by delivering
               | with more confidence.
        
               | jaywalk wrote:
               | You're not wrong, but it's possible that the organization
               | is small enough that it's just not feasible to have
               | enough safeguards that would prevent the outages John
               | caused. And in that case, it's probably best that John
               | not be promoted if he can't avoid those errors.
        
               | kortex wrote:
               | Current co is small. We are putting in the safeguards
               | from Day 1. Well, okay technically like day 120, the
               | first few months were a mad dash to MVP. But now that we
               | have some breathing room, yeah, we put a lot of emphasis
               | on preventing outages, detecting and diagnosing outages
               | promptly, documenting them, doing the whole 5-why's
               | thing, and preventing them in the future. We didn't have
               | to, we could have kept mad dashing and growth hacking.
               | But _very_ fortunately, we have a great culture here
               | (founders have lots of hindsight from past startups).
               | 
               | It's like a seed for crystal growth. Small company is
               | exactly the best time to implement these things, because
               | other employees will try to match the cultural norms and
               | habits.
        
               | jaywalk wrote:
               | Well, I started at the small company I'm currently at
               | around day 7300, where "source control" consisted of
               | asking the one person who was in charge of all source
               | code for a copy of the files you needed to work on, and
               | then giving the updated files back. He'd write down the
               | "checked out" files on a whiteboard to ensure that two
               | people couldn't work on the same file at the same time.
               | 
               | The fact that I've gotten it to the point of using git
               | with automated build and deployment is a small miracle in
               | itself. Not everybody gets to start from a clean slate.
        
             | mountainofdeath wrote:
             | There is no such thing as "no-blame" analysis. Even in the
             | best organizations with the best effort to avoid it, there
             | is always a subconscious "this person did it". It doesn't
             | help that these incidents serve as convenient places for
             | others to leverage to climb their own career ladder at your
             | expense.
        
             | [deleted]
        
             | AnIdiotOnTheNet wrote:
             | > I have to convince other leaders that John would do well
             | the next level up.
             | 
             | "Yes, John has made mistakes and he's always copped to them
             | immediately and worked to prevent them from happening again
             | in the future. You know who doesn't make mistakes? People
             | who don't do anything."
        
             | nix23 wrote:
             | You know why SO-teams, firefighters and military pilots are
             | so successful?
             | 
             | -You don't hide anything
             | 
             | -Errors will be made
             | 
             | -After training/mission everyone talks about the errors (or
             | potential ones) and how to prevent them
             | 
             | -You don't make the same error twice
             | 
             | Being afraid to make errors and learn from them creates a
             | culture of hiding, a culture of denial and especially being
             | afraid to take responsibility.
        
               | jacquesm wrote:
               | You can even make the same error twice but you better
               | have _much_ better explanation the second time around
               | than you had the first time around because you already
               | knew that what you did was risky and or failure prone.
               | 
               | But usually it isn't the same person making the same
               | mistake, usually it is someone else making the same
               | mistake and nobody thought of updating
               | processes/documentation to the point that the error would
               | have been caught in time. Maybe they'll fix that after
               | the second time ;)
        
           | maximedupre wrote:
           | Or just take responsibility. People will respect you for
           | doing that and you will demonstrate leadership.
        
             | artificial wrote:
             | Way more fun argument: Outages just, uh... uh... find a
             | way.
        
             | melony wrote:
             | And the guy who doesn't take responsibility gets promoted.
             | Employees are not responsible for failures of management to
             | set a good culture.
        
               | tomrod wrote:
               | Not in healthy organizations, they don't.
        
               | foobiekr wrote:
               | You can work an entire career and maybe enjoy life in one
               | healthy organization in that entire time even if you work
               | in a variety of companies. It just isn't that common,
               | though of course voicing the _ideals_ is very, very
               | common.
        
               | jacquesm wrote:
               | Once you reach a certain size there are surprisingly few
               | healthy organization, most of them turn into
               | externalization engines with 4 beats per year.
        
               | kortex wrote:
               | The Gervais/Peter Principle is alive and well in many
               | orgs. That doesn't mean that when you have the
               | prerogative to change the culture, you just give up.
               | 
               | I realize that isn't an easy thing to do. Often the best
               | bet is to just jump around till you find a company that
               | isn't a cultural superfund site.
        
             | jrootabega wrote:
             | Cynical/realist take: Take responsibility and then hope
             | your bosses already love you, you can immediately both come
             | with a way to prevent it from happening again, and convince
             | them to give you the resources to implement it. Otherwise
             | your responsibility is, unfortunately, just blood in the
             | water for someone else to do all of that to protect the
             | company against you and springboard their reputation on the
             | descent of yours. There were already senior people scheming
             | to take over your department from your bosses, now they
             | have an excuse.
        
               | mym1990 wrote:
               | This seems like an absolutely horrid way of working or
               | doing 'office politics'.
        
               | geekbird wrote:
               | Yes, and I personally have worked in environments that do
               | just that. They said they didn't, but with management
               | "personalities" plus stack ranking, you know damn well
               | that they did.
        
           | kumarakn wrote:
           | I worked at Walmart Technology. I bravely wrote post mortem
           | documents owning the fault of my team (100+ people), owning
           | both technically and also culturally as their leader. I put
           | together a plan to fix it and executed it. Thought that was
           | the right thing to do. This happend two times in my 10 year
           | career there.
           | 
           | Both times I was called out as a failure in my performance
           | eval. Second time, I resigned and told them to find a better
           | leader.
           | 
           | Happy now I am out of such shitty place.
        
             | gunapologist99 wrote:
             | That's shockingly stupid. I also worked for a major Walmart
             | IT services vendor in another life, and we always had to be
             | careful about how we handled them, because they didn't
             | always show a lot of respect for vendors.
             | 
             | On another note, thanks for building some awesome stuff --
             | walmart.com is awesome. I have both Prime and whatever-
             | they're-currently-calling Walmart's version and I love that
             | Walmart doesn't appear to mix SKU's together in the same
             | bin which seems to cause counterfeiting fraud at Amazon.
        
               | gnat wrote:
               | What's a "bin" in this context?
        
               | AfterAnimator wrote:
               | I believe he means a literal bin. E.g. Amazon takes
               | products from all their sellers and chucks them in the
               | same physical space, so they have no idea who actually
               | sold the product when it's picked. So you could have
               | gotten something from a dodgy 3rd party seller that
               | repackages broken returns, etc, and Amazon doesn't
               | maintain oversight of this.
        
               | notinty wrote:
               | Literally just a bin in a fulfillment warehouse.
               | 
               | An amazon listing doesn't guarantee a particular SKU.
        
               | gnat wrote:
               | Ah, whew. That's what I thought. Thanks! I asked because
               | we make warehouse and retail management systems and every
               | vendor or customer seems to give every word their own
               | meanings (e.g., we use "bin" in our discounts engine to
               | be a collection of products eligible for discounts, and
               | "barcode" has at least three meanings depending on to
               | whom you're speaking).
        
               | throwawayHN378 wrote:
               | Is WalMart.com awesome?
        
               | temp6363t wrote:
               | walmart.com user design sucks. My particular grudge right
               | now is - I'm shopping to go pickup some stuff (and
               | indicate "in store pickup) and each time I search for the
               | next item, it resets that filter making me click on that
               | filter for each item on my list.
        
               | muvb00 wrote:
               | Walmart.com, Am I the only one in the world who can't
               | view their site on my phone? I tried it on a couple
               | devices and couldn't get it to work. Scaling is fubar. I
               | assumed this would be costing them millions/billions
               | since it's impossible to buy something from my phone
               | right now. S21+ in portrait on multiple browsers.
        
               | handrous wrote:
               | Almost every physical-store-chain company's website makes
               | it way too hard to do the thing I _nearly always_ want
               | out of their interface, which is to search the inventory
               | of the X nearest locations. They all want to push online
               | orders or 3rd-party-seller crap, it seems.
        
             | CobrastanJorji wrote:
             | Stories like this are why I'm really glad I stopped talking
             | to that Walmart Technology recruiter a few years ago. I
             | love working for places where senior leadership constantly
             | repeat war stories about "that time I broke the flagship
             | product" to reinforce the importance of blameless
             | postmortems. You can't fix the process if the people who
             | report to you feel the need to lie about why things go
             | wrong.
        
             | jacquesm wrote:
             | Props to you and Walmart will never realize their loss.
             | Unfortunately. But one day there will be headline (or even
             | a couple of them) and you will know that if you had been
             | there it might not have happened and that in the end it is
             | Walmarts' customers that will pay the price for that, not
             | their shareholders.
        
             | dnautics wrote:
             | that's awful. You should have been promoted for that.
        
             | abledon wrote:
             | is it just 'ceremony' to be called out on those things?
             | (even if it is actually a positive sum total)
        
               | ARandomerDude wrote:
               | > Happy now I am out of such shitty place.
               | 
               | Doesn't sound like it.
        
             | emteycz wrote:
             | But hope you found a better place?
        
           | javajosh wrote:
           | I firmly believe in the dictum "if you ship it you own it".
           | That means you own all outages. It's not just an operator
           | flubbing a command, or a bit of code that passed review when
           | it shouldn't. It's all your dependencies that make your
           | service work. You own ALL of them.
           | 
           | People spend all this time threat modelling their stuff
           | against malefactors, and yet so often people don't spend any
           | time thinking about the threat model of _decay_. They don 't
           | do it adding new dependencies (build- or runtime), and
           | therefore are unprepared to handle an outage.
           | 
           | There's a good reason for this, of course: modern software
           | "best practices" encourage moving fast and breaking things,
           | which includes "add this dependency we know nothing about,
           | and which gives an unknown entity the power to poison our
           | code or take down our service, arbitrarily, at runtime, but
           | hey its a cool thing with lots of github stars and it's only
           | one 'npm install' away".
           | 
           | Just want to end with this PSA: Dependencies bad.
        
             | syngrog66 wrote:
             | if I were a black hat I would absolutely love GitHub and
             | all the various language-specific package systems out
             | there. giving me sooooo many ways to sneak arbitrary
             | tailored malicious code into millions of installs around
             | the world 24x7. sure, some of my attempts might get caught,
             | or not but not lead to a valuable outcome for me. but that
             | percentage that does? can make it worth it. its about scale
             | and a massive parallelization of infiltration attempts.
             | logic similar to the folks blasting out phishing emails or
             | scam calls.
             | 
             | I _love_ the ubiquity of thirdparty software from
             | strangers, and the lack of bureaucratic gatekeepers. but I
             | also _hate_ it in ways. and not enough people know about
             | the dangers of this second thing.
        
               | throwawayHN378 wrote:
               | Any yet oddly enough the Earth continues to spin and the
               | internet continues to work. I think the system we have
               | now is necessarily the system that must exist ( in this
               | particular case, not in all cases ). Something more
               | centralized is destined to fail. And, while the open
               | source nature of software introduces vulnerabilities it
               | also fixes them.
        
               | syngrog66 wrote:
               | > And, while the open source nature of software
               | introduces vulnerabilities it also fixes them.
               | 
               | dat gap tho... which was my point. smart black hats will
               | be exploiting this gap, at scale. and the strategy will
               | work because the majority of folks seem to be either
               | lazy, ignorant or simply hurried for time.
               | 
               | and btw your 1st sentence was rude. constructive feedback
               | for the future
        
             | AtlasBarfed wrote:
             | That's a great philosophy.
             | 
             | Ok, let's take an organization, let's call them, say
             | Ammizzun. Totally not Amazon. Let's say you have a very
             | aggressive hire/fire policy which worked really well in
             | rapid scaling and growth of your company. Now you have a
             | million odd customers highly dependent on systems that were
             | built by people that are now one? two? three? four?
             | hire/fire generations up-or-out or cashed-out cycles ago.
             | 
             | So.... who owns it if the people that wrote it are
             | lllloooooonnnnggg gone? Like, not just long gone one or two
             | cycles ago so some institutional memory exists. I mean,
             | GONE.
        
               | javajosh wrote:
               | A lot can go wrong as an organization grows, including
               | loss of knowledge. At amazon "Ownership" officially rests
               | with the non-technical money that owns voting shares.
               | They control the board who controls the CEO. "Ownership"
               | can be perverted to mean that you, a wage slave, are
               | responsible for the mess that previous ICs left behind.
               | The obvious thing to do in such a circumstance is quit
               | (or don't apply). It is unfair and unpleasant to be
               | treated in a way that gives you responsibility but no
               | authority, and to participant in maintaining (and
               | extending) that moral hazard, and as long as there are
               | better companies you're better off working for them.
        
             | Mezzie wrote:
             | It's also a nightmare for software preservation. There's
             | going to be a lot from this era that won't be usable 80
             | years from now because everything is so interdependent and
             | impossible to archive. It's going to be as messy and
             | irretrievable as the Web pre Internet Archive + Wayback
             | are.
        
             | 88913527 wrote:
             | Should I be penalized if an upstream dependency, owned by
             | another team, fails? Did I lack due diligence in choosing
             | to accept the risk that the other team couldn't deliver?
             | These are real problems in the micro-services world,
             | especially since I own UI and there are dozens of teams
             | pumping out services, and I'm at the mercy of all of them.
             | The best I can do is gracefully fail when services don't
             | function in a healthy state.
        
               | NikolaeVarius wrote:
               | > Should I be penalized if an upstream dependency, owned
               | by another team, fails?
               | 
               | Yes
               | 
               | > Did I lack due diligence in choosing to accept the risk
               | that the other team couldn't deliver?
               | 
               | Yes
        
               | bandyaboot wrote:
               | Where does this mindset end? Do I lack due diligence by
               | choosing to accept that the cpu microcode on the system
               | I'm deploying to works correctly?
        
               | unionpivo wrote:
               | If it's brand new RiscV CPU that was just relesed 5 min
               | ago, and nobody really tested then yes.
               | 
               | If its standard CPU that everybody else uses, and its not
               | known to be bad then no.
               | 
               | Same for software. Is it ok to have dependency on AWS
               | services ? Their history shows yes. Dependency on brand
               | new SaaS product ? Nothing mission critical.
               | 
               | Or npm/crates/pip packages. Packages that have been
               | around and seedily maintained for few years, have active
               | users, are worth checking out. Some random project from
               | single developer ? Consider vendoring (and owning if
               | necessary ) it.
        
               | jrockway wrote:
               | Why? Intel has Spectre/Meltdown which erased like half of
               | everyone's capacity overnight.
        
               | treis wrote:
               | You choose the CPU and you choose what happens in a
               | failure scenario. Part of engineering is making choices
               | that meet the availability requirements of your service.
               | And part of that is handling failures from dependencies.
               | 
               | That doesn't extend to ridiculous lengths but as a rule
               | you should engineer around any single point of failure.
        
               | NikolaeVarius wrote:
               | Yes? If you are worried about CPU microcode failing, then
               | you do a NASA and have multiple CPU arch's doing
               | calculations in a voting block. These are not unsolved
               | problems.
        
               | javajosh wrote:
               | JPL goes further and buys multiple copies of all hardware
               | and software media used for ground systems, and keeps
               | them in storage "just in case". It's a relatively cheap
               | insurance policy against the decay of progress.
        
               | obstacle1 wrote:
               | Say during due diligence two options are uncovered: use
               | an upstream dependency owned by another team, or use that
               | plus a 3P vendor for redundancy. Implementing parallel
               | systems costs 10x more than the former and takes 5x
               | longer. You estimate a 0.01% chance of serious failure
               | for the former, and 0.001% for the latter.
               | 
               | Now say you're a medium sized hyper-growth company in a
               | competitive space. Does spending 10 times more and
               | waiting 5 times longer for redundancy make business
               | sense? You could argue that it'd be irresponsible to
               | over-engineer the system in this case, since you delay
               | getting your product out and potentially lose $ and
               | ground to competitors.
               | 
               | I don't think a black and white "yes, you should be
               | punished" view is productive here.
        
               | bityard wrote:
               | You and many others here may be conflating two concepts
               | which are actually quite separate.
               | 
               | Taking blame is a purely punitive action and solves
               | nothing. Taking responsibility means it's your job to
               | correct the problem.
               | 
               | I find that the more "political" the culture in the
               | organization is, the more likely everyone is to search
               | for a scapegoat to protect their own image when a mistake
               | happens. The higher you go up in the management chain,
               | the more important vanity becomes, and the more you see
               | it happening.
               | 
               | I have made plenty of technical decisions that turned out
               | to be the wrong call in retrospect. I took
               | _responsibility_ for those by learning from the mistake
               | and reversing or fixing whatever was implemented.
               | However, I never willfully took _blame_ for those
               | mistakes because I believed I was doing the best job I
               | could at the time.
               | 
               | Likewise, the systems I manage sometimes fail because
               | something that another team manages failed. Sometimes
               | it's something dumb and could have easily been prevented.
               | In these cases, it's easy point blame and say, "Not our
               | fault! That team or that person is being a fuckup and
               | causing our stuff to break!" It's harder but much more
               | useful to reach out and say, "hey, I see x system isn't
               | doing what we expect, can we work together to fix it?"
        
               | cyanydeez wrote:
               | Every argument I have on the internet is between
               | prescriptive and descriptive language.
               | 
               | People tend to believe that if you can describe a problem
               | that means you can prescribe a solution. Often times, the
               | only way to survive is to make it clear that the first
               | thing you are doing is describing the problem.
               | 
               | After you do that, and it's clear that's all you are
               | doing, then you follow up with a prescriptive description
               | where you place clearly what could be done to manage a
               | future scenario.
               | 
               | If you don't create this bright line, you create a
               | confused interpretation.
        
               | javajosh wrote:
               | My comment was made from the relatively simpler
               | entrepreneurial perspective, not the corporate one. Corp
               | ownership rests with people in the C-suite who are
               | social/political lawyer types, not technical people. They
               | delegate responsibility but not authority, because they
               | can hire people, even smart people, to work under those
               | conditions. This is an error mode where "blame" flows
               | from those who control the money to those who control the
               | technology. Luckily, not all money is stupid so some
               | corps (and some parts of corps) manage to function even
               | in the presence of risk and innovation failures. I mean
               | the whole industry is effectively a distributed R&D
               | budget that may or may not yield fruit. I suppose this is
               | the market figuring out whether iterated R&D makes sense
               | or not. (Based on history, I'd say it makes a lot of
               | sense.)
        
               | [deleted]
        
               | javajosh wrote:
               | I wish you wouldn't talk about "penalization" as if it
               | was something that comes from a source of authority.
               | _Your customers are depending on you_ , and you've let
               | them down, and the reason that's bad has nothing to do
               | with what your boss will do to you in a review.
               | 
               | The injustice that can and does happen is that you're
               | explicitly given a narrow responsibility during
               | development, and then a much broader responsibility
               | during operation. This is patently unfair, and very
               | common. For something like a failed uService you want to
               | blame "the architect" that didn't anticipate these system
               | level failures. What is the solution? Have plan b (and
               | plan c) ready to go. If these services don't exist, then
               | you must build them. It also implies a level of
               | indirection that most systems aren't comfortable with,
               | because we want to consume services directly (and for
               | good reason) but reliability _requires_ that you never,
               | ever consume a service directly, but instead from an in-
               | process location that is failure aware.
               | 
               | This is why reliable software is hard, and engineers are
               | expensive.
               | 
               | Oh, and it's also why you generally do NOT want to defer
               | the last build step to runtime in the browser. If you
               | start combining services on both the client and server,
               | you're in for a world of hurt.
        
               | bostik wrote:
               | Not penalised no, but questioned as to how well your
               | graceful failure worked in the end.
               | 
               | Remember: it may not be your fault, but it still is your
               | problem.
        
               | fragmede wrote:
               | A analogy for illustrating this is:
               | 
               | You get hit by a car and injured. The accident is the
               | other driver's fault, but getting to the ER is your
               | problem. The other driver may help and call an ambulance,
               | but they might not even be able to help you if they also
               | got hurt in the car crash.
        
             | ssimpson wrote:
             | when working on CloudFiles, we often had monitoring for our
             | limited dependencies that were better than their
             | monitoring. Don't just know what your stuff is doing, but
             | what your whole dependency ecosystem is doing and know when
             | it all goes south. also helps to learn where and how you
             | can mitigate some of those dependencies.
        
               | foobiekr wrote:
               | This. We found very big, serious issues with our anti-
               | DDOS provider because their monitoring sucked compared to
               | ours. It was a sobering reality check when we realized
               | that.
        
           | insaneirish wrote:
           | > No-blame analysis is a much better pattern. Everyone wins.
           | It's about building the system that builds the system. Stuff
           | broke; fix the stuff that broke, then fix the things that let
           | stuff break.
           | 
           | Yea, except it doesn't work in practice. I work with a lot of
           | people who come from places with "blameless" post-mortem
           | 'culture' and they've evangelized such a thing extensively.
           | 
           | You know what all those people have proven themselves to
           | really excel at? _Blaming people._
        
             | kortex wrote:
             | Ok, and? I don't doubt it fails in places. That doesn't
             | mean that it doesn't work in practice. Our company does it
             | just fine. We have a high trust, high transparency system
             | and it's wonderful.
             | 
             | It's like saying unit tests don't work in practice because
             | bugs got through.
        
               | kortilla wrote:
               | Have you ever considered that the "no-blame" postmortems
               | you are giving credit for everything are just a side
               | effect of living in a high trust, high transparency
               | system?
               | 
               | In other words, "no-blame" should be an emergent property
               | of a culture of trust. It's not something you can
               | prescribe.
        
               | kortex wrote:
               | Yes, exactly. Culture of trust is the root. Many
               | beneficial patterns emerge when you can have that: more
               | critical PRs, blameless post-mortems, etc.
        
         | maximedupre wrote:
         | Damn, he had serious PTSD lol
        
         | jonhohle wrote:
         | On the retail/marketplace side this wasn't my experience, but
         | we also didn't have any public dashboards. On Prime we
         | occasionally had to refund in bulk, and when it was called for
         | (internally or externally) we would right up a detailed post-
         | mortem. This wasn't fun, but it was never about blaming a
         | person and more about finding flaws in process or monitoring.
        
         | mountainofdeath wrote:
         | Former AWSser. I can totally believe that happened and
         | continues to happen in some teams. Officially, it's not
         | supposed to be done that way.
         | 
         | Some AWS managers and engineers bring their corporate cultural
         | baggage with them when they join AWS and it takes a few years
         | to unlearn it.
        
         | hinkley wrote:
         | I am finding that I have a very bimodal response to "He did
         | it". When I write an RCA or just talk about near misses, I may
         | give you enough details to figure out that Tom was the one who
         | broke it, but I'm not going to say Tom on the record anywhere,
         | with one extremely obvious exception.
         | 
         | If I think Tom has a toxic combination of poor judgement,
         | Dunning-Kruger syndrome, and a hint of narcissism (I'm not sure
         | but I may be repeating myself here), such that he won't listen
         | to reason and he actively steers others into bad situations
         | (and especially if he then disappears when shit hits the fan),
         | then I will nail him to a fucking cross every chance I get.
         | Public shaming is only a tool for getting people to discount
         | advice from a bad actor. If it comes down to a vote between my
         | idea and his, then I'm going to make sure everyone knows that
         | his bets keep biting us in the ass. This guy kinda sounds like
         | the Toxic Tom.
         | 
         | What is important when I turned out to be the cause of the
         | issue is a bit like some court cases. Would a reasonable person
         | in this situation have come to the same conclusion I did? If
         | so, then I'm just the person who lost the lottery. Either way,
         | fixing it for me might fix it for other people. Sometimes the
         | answer is, "I was trying to juggle three things at once and a
         | ball got dropped." If the process dictated those three things
         | then the process is wrong, or the tooling is wrong. If someone
         | was asking me questions we should think about being more pro-
         | active about deflecting them to someone else or asking them to
         | come back in a half hour. Or maybe I shouldn't be trying to
         | watch training videos while babysitting a deployment to
         | production.
         | 
         | If you never say "my bad" then your advice starts to sound like
         | a lecture, and people avoid lectures so then you never get the
         | whole story. Also as an engineer you should know that owning a
         | mistake early on lets you get to what most of us consider the
         | interesting bit of _solving the problem_ instead of talking
         | about feelings for an hour and then using whatever is left of
         | your brain afterward to fix the problem. In fact in some cases
         | you can shut down someone who is about to start a rant (which
         | is funny as hell because they look like their head is about to
         | pop like a balloon when you say,  "yep, I broke it, let's move
         | on to how do we fix it?")
        
           | bgribble wrote:
           | To me, the point of "blameless" PM is not to hide the
           | identity of the person who was closest to the failure point.
           | You can't understand what happened unless you know who did
           | what, when.
           | 
           | "Blameless" to me means you acknowledge that the ultimate
           | problem isn't that someone made a mistake that caused an
           | outage. The problem is that you had a system in place where
           | someone could make a single mistake and cause an outage.
           | 
           | If someone fat-fingers a SQL query and drops your database,
           | the problem isn't that they need typing lessons! If you put a
           | DBA in a position where they have to be typing SQL directly
           | at a production DB to do their job, THAT is the cause of the
           | outage, the actual DBA's error is almost irrelevant because
           | it would have happened eventually to someone.
        
             | hinkley wrote:
             | Naming someone is how you discover that not everyone in the
             | organization believes in Blamelessness. Once it's out it's
             | out, you can't put it back in.
             | 
             | It's really easy for another developer to figure out who
             | I'm talking about. Managers can't be arsed to figure it
             | out, or at least pretend like they don't know.
        
         | StreamBright wrote:
         | Yep I can confirm that. The process when the outage is caused
         | by you is called COE (correction of errors). I was oncall once
         | for two teams because I was switching teams and I got 11
         | escalations in 2 hours. 10 of these were caused by an overly
         | sensitive monitoring setting. The 11th was a real one. Guess
         | which one I ignored. :)
        
         | kache_ wrote:
         | Sometimes, these large companies tack on too much "necessary"
         | incident "remediation" actions with Arbitrary Due Date SLAs
         | that completely wrench any ongoing work. And ongoing,
         | strategically defined ""muh high impact"" projects are what get
         | you promoted, not doing incident remediations.
         | 
         | When you get to the level you want, you get to not really give
         | a shit and actually do The Right Thing. However, for all of the
         | engineers clamoring to get out of the intermediate brick laying
         | trenches, opening an incident can have pervasive incentives.
        
           | bendbro wrote:
           | In my experience this is the actual reason for fear of the
           | formal error correction process.
        
           | pts_ wrote:
           | Politicized cloud meh.
        
         | ashr wrote:
         | This is the exact opposite of my experience at AWS. Amazon is
         | all about blameless fact finding when it comes to root cause
         | analysis. Your company just hired a not so great engineer or
         | misunderstood him.
        
           | Insanity wrote:
           | Adding my piece of anecdata to this.. the process is quite
           | blameless. If a postmortem seems like it points blame, this
           | is pointed out and removed.
        
             | swiftcoder wrote:
             | Blameless, maybe, but not repercussion-less. A bad CoE was
             | liable to upend the team's entire roadmap and put their
             | existing goals at risk. To be fair, management was fairly
             | receptive to "we need to throw out the roadmap and push our
             | launch out to the following reinvent", but it wasn't an
             | easy position for teams to be in.
        
         | kator wrote:
         | I've worked for Amazon for 4 years, including stints at AWS,
         | and even in my current role my team is involved in LSE's. I've
         | never seen this behavior, the general culture has been find the
         | problem, fix it, and then do root cause analysis to avoid it
         | again.
         | 
         | Jeff himself has said many times in All Hands and in public
         | "Amazon is the best place to fail". Mainly because things will
         | break, it's not that they break that's interesting, it's what
         | you've learned and how you can avoid that problem in the
         | future.
        
           | jsperson wrote:
           | I guess the question is why can't you (AWS) fix the problem
           | of the status page not reflecting an outage? Maybe acceptable
           | if the console has a hiccup, but when www.amazon.com isn't
           | working right, there should be some yellow and red dots out
           | there.
           | 
           | With the size of your customer base there were man years
           | spent confirming the outage after checking the status.
        
             | andrewguenther wrote:
             | Because there's a VP approval step for updating the status
             | page and no repercussions for VPs who don't approve updates
             | in a timely manner. Updating the status page is fully
             | automated on both sides of VP approval. If the status page
             | doesn't update, it's because a VP wouldn't do it.
        
           | Eduard wrote:
           | LSE?
        
             | merciBien wrote:
             | Large Scale Event
        
         | dehrmann wrote:
         | > explanation about why the outage was someone else's fault
         | 
         | In my experience, it's rarely clear who was at fault for any
         | sort of non-trivial outage. The issue tends to be at interfaces
         | and involve multiple owners.
        
         | jimt1234 wrote:
         | Every incident review meeting I've ever been in starts out
         | like, _"This meeting isn't to place blame..."_, then, 5 minutes
         | later, it turns into the Blame Game.
        
         | mijoharas wrote:
         | That's a real shame, one of the leadership principles used to
         | be "be vocally self-critical" which I think was supposed to
         | explicitly counteract this kind of behaviour.
         | 
         | I think they got rid of it at some point though.
        
         | howdydoo wrote:
         | Manually updated status pages are an anti-pattern to begin
         | with. At that point, why not just call it a blog?
        
         | jacquesm wrote:
         | And this is exactly why you can expect these headlines to hit
         | with great regularity. These things are never a problem at the
         | individual level, they are always at the level of culture and
         | organization.
        
         | 300bps wrote:
         | _being at fault for an outage was one of the worst things that
         | could happen to you_
         | 
         | Imagine how stressful life would be thinking that you had to be
         | perfect all the time.
        
           | errcorrectcode wrote:
           | That's been most of my life. Welcome to perfectionism.
        
         | soheil wrote:
         | > I will always remember the look of sheer panic
         | 
         | I don't know if you're exaggerating or not, but even if true
         | why would anyone show that emotion about losing a job in the
         | worst case?
         | 
         | You certainly had a lot of relevate-to-todays-top-hn-post
         | stories throughout you career. And I'm less and less surprised
         | to continuously find PragmaticPulp as one of the top commenters
         | if not the top that resonates with a good chunk of HN.
        
         | sharpy wrote:
         | Haha... This bring back memories. It really depends on the org.
         | 
         | I've had push backs on my postmortems before because of
         | phrasing that could be constituted as laying some of the blame
         | on some person/team when it's supposed to be blameless.
         | 
         | And for a long time, it was fairly blameless. You would still
         | be punished with the extra work of writing high quality
         | postmortems, but I have seen people accidentally bring down
         | critical tier-1 services and not be adversely affected in terms
         | of promotion, etc.
         | 
         | But somewhere along the way, it became politicized. Things like
         | the wheel of death, public grilling of teams on why they didn't
         | follow one of the thousands of best practices, etc, etc. Some
         | orgs are still pretty good at keeping it blameless at the
         | individual level, but... being a big company, your mileage may
         | vary.
        
           | hinkley wrote:
           | We're in a situation where the balls of mud made people
           | afraid to touch some things in the system. As experiences and
           | processes have improved we've started to crack back into
           | those things and guess what, when you are being groomed to
           | own a process you're going to fuck it up from time to time.
           | Objectively, we're still breaking production less often per
           | year than other teams, but we are breaking it, and that's
           | novel behavior, so we have to keep reminding people why.
           | 
           | The moment that affects promotions negatively, or your
           | coworkers throw you under the bus, you should 1) be assertive
           | and 2) proof-read your resume as a precursor to job hunting.
        
             | sharpy wrote:
             | Or problems just persisting, because the fix is easy, but
             | explaining it to others who do not work on the system are
             | hard. Esp. justifying why it won't cause an issue, and
             | being told that the fixes need to be done via scripts that
             | will only ever be used once, but nevertheless needs to be
             | code reviewed and tested...
             | 
             | I wanted to be proactive and fix things before they became
             | an issue, but such things just drained life out of me, to
             | the point I just left.
        
         | staticassertion wrote:
         | I don't think anecdotes like this are even worth sharing,
         | honestly. There's so much context lost here, so much that can
         | be lost in translation. No one should be drawing any
         | conclusions from this post.
        
         | amzn-throw wrote:
         | It's popular to upvote this during outages, because it fits a
         | narrative.
         | 
         | The truth (as always) is more complex:
         | 
         | * No, this isn't the broad culture. It's not even a blip. These
         | are EXCEPTIONAL circumstances by extremely bad teams that - if
         | and when found out - would be intervened dramatically.
         | 
         | * The broad culture is blameless post-mortems. Not whose fault
         | is it. But what was the problem and how to fix it. And one of
         | the internal "Ten commandments of AWS availability" is you own
         | your dependencies. You don't blame others.
         | 
         | * Depending on the service one customer's experience is not the
         | broad experience. Someone might be having a really bad day but
         | 99.9% of the region is operating successfully, so there is no
         | reason to update the overall status dashboard.
         | 
         | * Every AWS customer has a PERSONAL health dashboard in the
         | console that should indicate _their_ experience.
         | 
         | * Yes, VP approval is needed to make any updates on the status
         | dashboard. But that's not as hard as it may seem. AWS
         | executives are extremely operation-obsessed, and when there is
         | an outage of any size are engaged with their service teams
         | immediately.
        
           | flerchin wrote:
           | We knew us-east-1 was unuseable for our customers for 45
           | minutes before amazon acknowledged anything was wrong _at
           | all_. We made decisions _in the dark_ to serve our customers,
           | because amazon drug their feet communicating with us. Our
           | customers were notified after 2 minutes.
           | 
           | It's not acceptable.
        
           | pokot0 wrote:
           | Hiding behind a throw away account does not help your point.
        
           | miken123 wrote:
           | Well, the narrative is sort of what Amazon is asking for,
           | heh?
           | 
           | The whole us-east-1 management console is gone, what is
           | Amazon posting for the management console on their website?
           | 
           | "Service degradation"
           | 
           | It's not a degradation if it's outright down. Use the red
           | status a little bit more often, this is a "disruption", not a
           | "degradation".
        
             | taurath wrote:
             | Yeah no kidding. Is there a ratio of how many people it has
             | to be working for to be in yellow rather than red? Some
             | internal person going "it works on my machine" while 99% of
             | customers are down.
        
             | whoknowswhat11 wrote:
             | I've always wondered why services are not counted down more
             | often. Is there some sliver of customers who have access to
             | the management console for example?
             | 
             | An increase in error rates - no biggie, any large system is
             | going to have errors. But when 80%+ of customers loads in
             | the region are impacted (cross availability zones for
             | whatever good those do) - that counts as down doesn't it?
             | Error rates in one AZ - degraded. Multi-AZ failures - down?
        
               | mynameisvlad wrote:
               | SLAs. Officially acknowledging an incident means that
               | they now _have_ to issue the SLA credits.
        
               | res0nat0r wrote:
               | The outage dashboard is normally only updated if a
               | certain $X percent of hosts / service is down. If the EC2
               | section were updated every time a rack in a datacenter
               | went down, it would be red 24x7.
               | 
               | It's only updated when a large percentage of customers
               | are impacted, and most of the time this number is less
               | than what the HN echo chamber makes it appear to be.
        
               | mynameisvlad wrote:
               | I mean, sure, there are technical reasons why you would
               | want to buffer issues so they're only visible if
               | something big went down (although one would argue that's
               | exactly what the "degraded" status means).
               | 
               | But if the official records say everything is green, a
               | customer is going to have to push a lot harder to get the
               | credits. There is a massive incentivization to "stay
               | green".
        
               | bwestpha wrote:
               | yes there were. I'm from central europe and we were at
               | least able to get some pages of the console in us-east-1
               | -but i assume this was more caching related. Even though
               | the console loaded and worked for listing some entries -
               | we weren't able to post a support case nor viewing SQS
               | messages etc.
               | 
               | So i aggree that degraded is not the proper wording - but
               | it's / was not completly vanished. so.... hard to tell
               | what is an common acceptable wording here.
        
               | vladvasiliu wrote:
               | From France, when I connect to "my personal health
               | dashboard" in eu-west-3, it says several services are
               | having "issues" in us-east-1.
               | 
               | To your point, for support center (which doesn't show a
               | region) it says:
               | 
               |  _Description
               | 
               | Increased Error Rates
               | 
               | [09:01 AM PST] We are investigating increased error rates
               | for the Support Center console and Support API in the US-
               | EAST-1 Region.
               | 
               | [09:26 AM PST] We can confirm increased error rates for
               | the Support Center console and Support API in the US-
               | EAST-1 Region. We have identified the root cause of the
               | issue and are working towards resolution. _
        
             | threecheese wrote:
             | I'm part of a large org with a large AWS footprint, and
             | we've had a few hundred folks on a call nearly all day. We
             | have only a few workloads that are completely down; most
             | are only degraded. This isn't a total outage, we are still
             | doing business in east-1. Is it "red"? Maybe! We're all
             | scrambling to keep the services running well enough for our
             | customers.
        
             | Thaxll wrote:
             | Because the console works just fine in us-east-2 and that
             | the console on the status page does not display regions.
             | 
             | If the console works 100% in us-east-2 and not in us-east-1
             | why would they put the console completely down in us-east?
        
             | keyle wrote:
             | Well you know, like when a rocket explode, it's a sudden
             | and "unexpected rapid disassembly" or something...
             | 
             | And a cleaner is called a "floor technician".
             | 
             | Nothing really out of the ordinary for a service to be
             | called degraded while "hey, the cache might still be
             | working right?" ... or "Well you know, it works every other
             | day except today, so it's just degradation" :-)
        
           | lgylym wrote:
           | Come on, we all know managers don't want to claim an outage
           | till the last minute.
        
           | codegeek wrote:
           | "Yes, VP approval is needed to make any updates on the status
           | dashboard."
           | 
           | If services are clearly down, why is this needed ? I can
           | understand the oversights required for a company like Amazon
           | but this sounds strange to me. If services are clearly down,
           | I want that damn status update right away as a customer.
        
           | tekromancr wrote:
           | Oh, yes. Let me go look at the PERSONAL health dashboard
           | and... oh, I need to sign into the console to view it... hmm
        
           | oscribinn wrote:
           | 100 BEZOBUCKS(tm) have been deposited to your account for
           | this post.
        
           | mrsuprawsm wrote:
           | If your statement is true, then why is the AWS status page
           | widely considered useless, and everyone congregates on HN
           | and/or Twitter to actually know what's broken on AWS during
           | an outage?
        
             | andrewguenther wrote:
             | > Yes, VP approval is needed to make any updates on the
             | status dashboard. But that's not as hard as it may seem.
             | AWS executives are extremely operation-obsessed, and when
             | there is an outage of any size are engaged with their
             | service teams immediately.
             | 
             | My experience generally aligns with amzn-throw, but this
             | right here is why. There's a manual step here and there's
             | always drama surrounding it. The process to update the
             | status page is fully automated on both sides of this step,
             | if you removed VP approval, the page would update
             | immediately. So if the page doesn't update, it is always a
             | VP dragging their feet. Even worse is that lags in this
             | step were never discussed in the postmortem reviews that I
             | was a part of.
        
               | Frost1x wrote:
               | It's intentional plausible deniability. By creating the
               | manual step you can shift blame away. It's just like the
               | concept of personal health dashboards which are designed
               | to keep an asymmetry in reliability information from a
               | host and the client to their personal anecdata
               | experiences. Ontop of all of this, the metrics are pretty
               | arbitrary.
               | 
               | Let's not pretend businesses haven't been intentionally
               | advertising in deceitful ways for decades if not hundreds
               | of years. This just happens to be current strategy in
               | tech of lying and deceiving customers to limit liability,
               | responsibility, and recourse actions.
               | 
               | To be fair, it's not it's not just Amazon, they just
               | happen to be the largest and targeted whipping boys on
               | the block. Few businesses under any circumstances will
               | admit to liability under any circumstances. Liability has
               | to always be assessed externally.
        
           | amichal wrote:
           | I have in the past directed users here on HN who were
           | complaining about https://status.aws.amazon.com to the
           | Personal Health Dashboard at https://phd.aws.amazon.com/ as
           | well. Unfortunately even though the account I was logged into
           | this time only has a single S3 bucket in the EU, billed
           | through the EU and with zero direct dependancies on the US
           | the personal health dashboard was ALSO throwing "The request
           | processing has failed because of an unknown error" messages.
           | Whatever the problem was this time it had global effects for
           | the majority of users of the Console, the internet noticed
           | for over 30 minutes before either the status page or the PHD
           | were able to report it. There will be no explanation and the
           | official status page logs will say there was "increased API
           | failure rates" for an hour.
           | 
           | Now i guess its possible that the 1000s and 1000s of us who
           | noticed and commented are some tiny fraction of the user base
           | but if thats so you could at least publish a follow up like
           | other vendors do that says something like 0.00001% of API
           | requests failed effecting an estimated 0.001% of our users at
           | the time.
        
           | yaacov wrote:
           | Can't comment on most of your post but I know a lot of Amazon
           | engineers who think of the CoE process (Correction of Error,
           | what other companies would call a postmortem) as punitive
        
             | jrd259 wrote:
             | I don't know any, and I have written or reviewed about 20
        
             | andrewguenther wrote:
             | They aren't _meant_ to be, but shitty teams are shitty. You
             | can also create a COE and assign it to another team. When I
             | was at AWS, I had a few COEs assigned to me by disgruntled
             | teams just trying to make me suffer and I told them to
             | pound sand. For my own team, I wrote COEs quite often and
             | found it to be a really great process for surfacing
             | systemic issues with our management chain and making real
             | improvements, but it needs to be used correctly.
        
           | nanis wrote:
           | > * Depending on the service one customer's experience is not
           | the broad experience. Someone might be having a really bad
           | day but 99.9% of the region is operating successfully, so
           | there is no reason to update the overall status dashboard.
           | 
           | https://rachelbythebay.com/w/2019/07/15/giant/
        
           | marcinzm wrote:
           | >* Every AWS customer has a PERSONAL health dashboard in the
           | console that should indicate their experience.
           | 
           | You mean the one that is down right now?
        
             | CobrastanJorji wrote:
             | Seems like it's doing an exemplary job of indicating their
             | experience, then.
        
           | ultimoo wrote:
           | > ...you own your dependencies. You don't blame others.
           | 
           | Agreed, teams should invest resources in architecting their
           | systems in a way that can withstand broken dependencies. How
           | does AWS teams account for "core" dependencies (e.g. auth)
           | that may not have alternatives?
        
             | gunapologist99 wrote:
             | This is the irony of building a "reliable" system across
             | multiple AZ's.
        
           | AtlasBarfed wrote:
           | Because OTHERWISE people might think AMAZON is a
           | DYSFUNCTIONAL company that is beginning to CRATER under its
           | HORRIBLE work culture and constant H/FIRE cycle.
           | 
           | See, AWS is basically turning into a long standing utility
           | that needs to be reliable.
           | 
           | Hey, do most institutions like that completely turn over
           | their staff every three years? Yeah, no.
           | 
           | Great for building it out and grabbing market share.
           | 
           | Maybe not for being the basis of a reliable substrate of the
           | modern internet.
           | 
           | If there are dozens of bespoke systems that keep AWS afloat
           | (disclosure: I have friends who worked there, and there are,
           | and also Conway's law), but if the people who wrote them are
           | three generations of HIRE/FIRE ago....
           | 
           | Not good.
        
             | ctvo wrote:
             | > Maybe not for being the basis of a reliable substrate of
             | the modern internet.
             | 
             | Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if
             | it's THAT BAD. I wasn't sure what the pattern for all caps
             | was, so just giving it a shot there. Apologies if it's
             | incorrect.
        
               | AtlasBarfed wrote:
               | I was mocking the parent, who was doing that. Yes it's
               | awful. Effective? Sigh, yes. But awful.
        
           | [deleted]
        
           | the-pigeon wrote:
           | What?!
           | 
           | Everybody is very slow to update their outage pages because
           | of SLAs. It's in a company's financial interest to deny
           | outages and when they are undeniable to make them appear as
           | short as possible. Status pages updating slowly is definitely
           | by design.
           | 
           | There's no large dev platform I've used that this wasn't true
           | of their status pages.
        
           | jjoonathan wrote:
           | I haven't asked AWS employees specifically about blameless
           | postmortems, but several of them have personally corroborated
           | that the culture tends towards being adversarial and
           | "performance focused." That's a tough environment for
           | blameless debugging and postmoretems. Like if I heard that
           | someone has a rain forest tree-frog living happily in their
           | outdoor Arizona cactus garden, I have doubts.
        
             | azinman2 wrote:
             | When I was at Google I didn't have a lot of exposure to the
             | public infra side. However I do remember back in 2008 when
             | a colleague was working on routing side of YouTube, he made
             | a change that cost millions of dollars in mere hours before
             | noticing and reverting it. He mentioned this to the larger
             | team which gave applause during a tech talk. I cannot
             | possibly generalize the culture differences between Amazon
             | and Google, but at least in that one moment, the Google
             | culture seemed to support that errors happen, they get
             | noticed, and fixed without harming the perceived
             | performance of those responsible.
        
               | wolverine876 wrote:
               | While I support that, how are the people involved
               | evaluated?
        
               | abdabab wrote:
               | Google puts automation or process in place to avoid
               | outages rather than pointing fingers. If an engineer
               | causes an outage by mistake and then works to ensure that
               | would never happen again, he made a positive impact.
        
         | 1-6 wrote:
         | Perhaps reward structure should be changed to incentivize the
         | post-mortems. There could be several flaws that run
         | underreported otherwise.
         | 
         | We may run into the problem of everything documented and
         | possible deliberate acts but for a service that relies heavily
         | on uptime, that's a small price to pay for a bulletproof
         | operation.
        
           | A4ET8a8uTh0 wrote:
           | Then we would drown in a sea of meetings and 'lessons
           | learned' emails. There is a reason for post-mortems, but
           | there has to be balance.
        
             | 1-6 wrote:
             | I find post-mortems interesting to read through especially
             | when it's not my fault. Most of them would probably be
             | routine to read through but there are occasional ones that
             | make me cringe or laugh.
             | 
             | Post-mortems can sometime be thought of like safety
             | training. There is a big imbalance of time dedicated to
             | learning proper safety handling just for those small
             | incidences.
        
               | hinkley wrote:
               | Does Disney still play the "Instructional Videos" series
               | starring Goofy where he's supposed to be teaching you how
               | to do something and instead we learn how NOT to do
               | something? Or did I just date myself badly?
        
         | throwaway82931 wrote:
         | This fits with everything I've heard about terrible code
         | quality at Amazon and engineers working ridiculous hours to
         | close tickets any way they can. Amazon as a corporate entity
         | seems to be remarkably distrustful of and hostile to its labor
         | force.
        
         | mbordenet wrote:
         | When I worked for AMZN (2012-2015, Prime Video & Outbound
         | Fulfillment), attempting to sweep issues under the rug was a
         | clear path to termination. The Correction-Of-Error (COE)
         | process can work wonders in a healthy, data-driven, growth-
         | mindset culture. I wonder if the ex-Amazonian you're referring
         | to did not leave AMZN by their own accord?
         | 
         | Blame deflection is a recipe for repeat outages and unhappy
         | customers.
        
           | PragmaticPulp wrote:
           | > I wonder if the ex-Amazonian you're referring to did not
           | leave AMZN by their own accord?
           | 
           | Entirely possible, and something I've always suspected.
        
         | taf2 wrote:
         | What if they just can't access the console to update the status
         | page...
        
           | Slartie wrote:
           | They could still go into the data center, open up the status
           | page servers' physical...ah wait, what if their keyfobs don't
           | work?
        
         | soheil wrote:
         | This may not actually be that bad of thing. If you think about
         | it if they're fighting tooth and nail to keep the status page
         | still green that tells you they were probably doing that at
         | every step of the way before the failure became eminent. Gotta
         | have respect for that.
        
         | mrweasel wrote:
         | That's idiotic, the service is down regardless. If you foster
         | that kind of culture, why have a status page at all?
         | 
         | It make AWS engineers look stupid, because it looks like they
         | are not monitoring their services.
        
           | mountainofdeath wrote:
           | The status page is as much a political tool as a technical
           | one. Giving your service a non-green state makes your entire
           | management chain responsible. You don't want to be one that
           | upsets some VPs advancement plans.
        
           | nine_zeros wrote:
           | > It make AWS engineers look stupid, because it looks like
           | they are not monitoring their services.
           | 
           | Management.
        
       | thefourthchime wrote:
       | If someone needs to get to the console, you can make a url like
       | this:
       | 
       | https://us-west-1.console.aws.amazon.com/
        
         | yabones wrote:
         | Works for a lot of things, but not Route53... Which is great
         | because that's the only thing I need to do in AWS today :)
        
       | hulahoop wrote:
       | I heard from my cousin who works at an Amazon warehouse that the
       | conveyor belts stopped working and items were messed up and
       | getting randomly removed off the belts.
        
       | joshstrange wrote:
       | This seems to be affecting Audible as well. I can't buy a book
       | which sucks since I just finished the previous one in the series
       | and I'm stuck in bed sick.
        
       | griffinkelly wrote:
       | Hosting and processing all the photos for the California
       | International Marathon on EC2, this doesnt make easier dealing
       | with impatient customers any easier
        
       | bennyp101 wrote:
       | Yea, Amazon Music has gone down for me in the UK now :(
       | 
       | Looks like it might be getting worse
        
       | [deleted]
        
       | adamtester wrote:
       | eu-west-1 is down for us
        
         | PeterBarrett wrote:
         | I hope you stay being the only person who has said that, 1
         | region being gone is enough for me!
        
           | adamtester wrote:
           | I should have said, only the Console and CLI was down for us,
           | our services remained up!
        
             | ComputerGuru wrote:
             | Console has lots of us-east-1 dependencies.
        
       | strictfp wrote:
       | "some customers may experience a slight elevation in error rates"
       | --> everything is on fire
        
         | kello wrote:
         | ah corporate speak at it's finest
        
         | Xenoamorphous wrote:
         | Maybe when the error rate hits 100% they'll say "error rate now
         | stable".
        
           | retbull wrote:
           | "Only direction is up"
        
         | hvgk wrote:
         | ECR and API are fucked so it's impossible to scale anything to
         | the point fire can come out :)
        
         | soco wrote:
         | I'm also experiencing a slight elevation in billing rates - got
         | alarms for 10x consumption and I can't check on them... Edit:
         | also API access is failing, terraform can't take anything down
         | because "connection was forcibly closed"
        
           | gchamonlive wrote:
           | Imagine triggering a big instance for machine learning or a
           | huge EMR cluster that would otherwise be short lived and not
           | being able to scale it down.
           | 
           | I am quite sure the AWS support will be getting many refund
           | requests over the course of the week.
        
       | lordnacho wrote:
       | This got me thinking, are there any major chat services that
       | would go down if a particular AWS/GCP/etc data centre went down?
       | 
       | You don't want your service to go down, plus your team's comms at
       | the same time.
        
         | tyre wrote:
         | Slack going down is a godsend for developer productivity.
        
         | wizwit999 wrote:
         | You should multi region something like that.
        
         | milofeynman wrote:
         | Remember when Facebook went down? Fb, Whatsapp, messenger,
         | Instagram were all down. Don't know what they use internally
        
         | DoctorOW wrote:
         | Slack is pretty much full AWS, I've been switched over to Teams
         | so I can't check.
        
           | rodiger wrote:
           | Slack is working fine for me
        
           | umanwizard wrote:
           | My company's Slack instance is currently fine.
        
         | ChadyWady wrote:
         | I'm impressed but Amazon Chime still appears to be working
         | right now. It's sad because this is the one service that could
         | go down and be a net benefit.
        
         | perydell wrote:
         | We have a SMS text thread with about 12 people that we send one
         | message on the first of every month. To make sure it is tested
         | and ready to be used for communications if all other comms
         | networks are down.
        
         | dahak27 wrote:
         | Especially if enough Amazon internal tools rely on it - would
         | be funny if there were a repeat of the FB debacle where Amazon
         | employees somehow couldn't communicate/get back into their
         | offices because of the problem they were trying to fix
        
           | umanwizard wrote:
           | Last I knew, Amazon used all Microsoft stuff for business
           | communication.
        
             | itsyaboi wrote:
             | Slack, as of last year.
             | 
             | https://slack.com/blog/news/slack-aws-drive-development-
             | agil...
        
               | nostrebored wrote:
               | And before that, Amazon Chime was the messaging and
               | conferencing tool. Now that I'm not using it, I actually
               | miss it a lot!
        
               | shepherdjerred wrote:
               | I cried tears of joy when Amazon finally switched to
               | Slack last year
        
               | manquer wrote:
               | Slack uses Chime for A/V under the hood so I don't think
               | it is all that different for non text.[1]
               | 
               | [1] https://www.theverge.com/2020/6/4/21280829/slack-
               | amazon-aws-...
        
       | shaftoe444 wrote:
       | Company wide can't log in to console. Many, many SNS and SQS
       | errors in us-east-1.
        
         | _wldu wrote:
         | Same here and I use us-east-2.
        
       | 2bitlobster wrote:
       | I wonder what the cost is on the US economy
        
       | taormina wrote:
       | Have folks considered a class-action lawsuit against these
       | blatantly fraudulent SLAs to recoup costs?
        
         | itsdrewmiller wrote:
         | In my experience, despite whatever is published, companies will
         | private acknowledge and pay their SLA terms. (Which still only
         | gets you, like, one day's worth of reimbursement if you're
         | lucky.)
        
           | adrr wrote:
           | Retail SLAs are a small risk compared to the enterprise SLAs
           | where an outage like this could cost Amazon tens of millions.
           | I assume these contracts have discount tiers based on
           | availability and anything below 99% would be a 100% discount
           | for that bill cycle.
        
             | jaywalk wrote:
             | But those enterprise SLAs are the ones they'll be paying
             | out. Retail SLAs are the ones that you'll have to fight
             | for.
        
       | larrik wrote:
       | We are having a number of rolling issues, but the site is sort of
       | up? I worry it'll get worse before it gets better.
       | 
       | Nothing on their status page. But the Console is not working.
        
         | commandlinefan wrote:
         | We're seeing some stuff is up, and some stuff is down and some
         | of the stuff that was up a little while is down now. It's
         | getting worse as of 9:53 AM CST.
        
       | 7six wrote:
       | eu-central-1 as well
        
       | htrp wrote:
       | Looks like all aws internal APIs are down....
        
         | taf2 wrote:
         | status page looks very green even 17 minutes later...
        
       | ZebusJesus wrote:
       | Me thinks Venmo uses AWS because they are down as well. Status
       | gator has AWS as on off on off on off. I can access my servers
       | hosted in the west coast but I cannot access the AWS console,
       | this is making for an interesting morning.
        
       | errcorrectcode wrote:
       | Alexa (jokes and flash briefing) is currently partially down for
       | me. Skills and routines are working.
       | 
       | Amazon customer service can't help me handle an order.
       | 
       | It us-east-1 is out, then half of Amazon is out too.
        
       | stoneham_guy wrote:
       | AWS Management Console Home page is currently unavailable.
       | 
       | That's the error I am getting when logging to aws console
        
       | soheil wrote:
       | Can't even login this is the error I'm getting:
       | Internal Error       Please try again later
        
       | tmarice wrote:
       | AWS OpenSearch is also returning 500 in us-east-1.
        
       | WesolyKubeczek wrote:
       | Our instances are up (when I poke with SSH, say), but the console
       | itself is under the weather.
        
       | stoneham_guy wrote:
       | AWS Management Console Home page is currently unavailable.
        
       | swasheck wrote:
       | they're a bit later than normal with their large annual post-
       | thanksgiving us-east-1 outage
        
       | chazu wrote:
       | ECR borked for us in east-1
        
         | the-rc wrote:
         | You can get new tokens. Image pulling times out after ~30s,
         | which tells me that maybe ECR is actually up, but it can't
         | verify the caller's credentials or access image metadata from
         | some other internal service. It's probably something low level
         | that crashed, taking down anything built above it.
        
           | the-rc wrote:
           | Actually, images that do not exist will return the
           | appropriate error within a few seconds, so it's really timing
           | out when talking to the storage layer or similar.
        
       | mancerayder wrote:
       | The Personal Health Dashboard is unhealthy. It says unknown
       | error, failure or exception.
       | 
       | They need a monitor for the monitoring.
        
       | albatross13 wrote:
       | Welp, this is awkward (._. )
        
       | echlipse wrote:
       | imdb.com is down right now. But https://www.imdb.com/chart/top is
       | reachable right now. Strange.
       | 
       | https://imgur.com/a/apBT86o
        
       | sakopov wrote:
       | AWS Management console is dead/dying. Numerous errors across
       | major services like S3 and EC2 in us-east-1. This looks pretty
       | bad.
        
       | zedpm wrote:
       | Yep, PHD isn't loading, Cloudwatch is reporting SQS errors,
       | metrics aren't loading, can't pull logs. This is in US-East-1.
        
       | ec109685 wrote:
       | Surprised Netflix is down. I thought they were hot/hot multi-
       | region: https://netflixtechblog.com/active-active-for-multi-
       | regional...
        
         | ManuelKiessling wrote:
         | Just watched two episodes of Better Call Saul on Netflix
         | Germany without issues (while not being able to run my
         | Terraform plan against my eu-central-1 infrastructure...).
        
       | PaulHoule wrote:
       | Makes me glad I am in us-east-2.
        
         | heyitsguay wrote:
         | I'm having issues with us-east-2 now. Console is down, then
         | when I try to sign into a particular service I just get "please
         | try again later".
        
           | muttantt wrote:
           | It's related to east-1, looks like some single point of
           | failure not letting you access east-2 console URLs
        
         | newhouseb wrote:
         | We're flapping pretty hard in us-east-2 (looks API Gateway
         | related, which is probably because it's an edge deployment
         | which has a bunch of us-east-1 dependencies).
        
         | muttantt wrote:
         | us-east-2 is truly a hidden gem
        
           | kohanz wrote:
           | Except that it had a significant outage just a couple of
           | weeks ago. Source: most of our stuff is on us-east-2.
        
       | crad wrote:
       | We make heavy usage of Kinesis Firehose in us-east-1.
       | 
       | Issues started ~1:24am ET and resolved around 7:31am ET.
       | 
       | Then really kicked in at a much larger scale at 10:32am ET.
       | 
       | We're now seeing failures with connections to RDS Postgres and
       | other services.
       | 
       | Console is completely unavailable to me.
        
         | m3nu wrote:
         | Route53 is not updating new records. Console is also out.
        
         | dylan604 wrote:
         | >Issues started ~1:24am ET and resolved around 7:31am ET.
         | 
         | First engineer found a clever hack using bubble gum
         | 
         | >Then really kicked in at a much larger scale at 10:32am ET.
         | 
         | Bubble gum dried out, and the connector lost connection again.
         | Now, connector also fouled by the gum making a full replacement
         | required.
        
         | grumple wrote:
         | Kinesis was the cause last Thanksgiving too iirc. It's the
         | backbone of many services.
        
       | [deleted]
        
       | bilalq wrote:
       | Some advice that may help:
       | 
       | * Visit the console directly from another region's URL (e.g.,
       | https://us-east-2.console.aws.amazon.com/console/home?region...).
       | You can try this after you've successfully signed in but see the
       | console failing to load as well.
       | 
       | * If your AWS SSO app is hosted in a region other than us-east-1,
       | you're probably fine to continue signing in with other
       | accounts/roles.
       | 
       | Of course, if all your stuff is in us-east-1, you're out of luck.
       | 
       | EDIT: Removed incorrect advice about running AWS SSO in multiple
       | regions.
        
         | binaryblitz wrote:
         | I don't think you can run SSO in multiple regions on the same
         | AWS account.
        
           | bilalq wrote:
           | Thanks, corrected.
        
         | staticassertion wrote:
         | > Might also be a good idea to run AWS SSO in multiple regions
         | if you're not already doing so.
         | 
         | Is this possible?
         | 
         | > AWS Organizations only supports one AWS SSO Region at a time.
         | If you want to make AWS SSO available in a different Region,
         | you must first delete your current AWS SSO configuration.
         | Switching to a different Region also changes the URL for the
         | user portal. [0]
         | 
         | This seems to indicate you can only have one region.
         | 
         | [0]
         | https://docs.aws.amazon.com/singlesignon/latest/userguide/re...
        
           | bilalq wrote:
           | Good call. I just assumed you could for some reason. I guess
           | the fallback is to devise your own SSO implementation using
           | STS in another region if needed.
        
       | zackbloom wrote:
       | I'm now getting failures searching for products on Amazon.com
       | itself. This is somewhat surprising, as the narrative always was
       | that Amazon didn't do a great job of dogfooding their own cloud
       | platform.
        
         | di4na wrote:
         | They did more of it starting a few years back. It has been
         | interesting to see how some services evolved far faster when
         | retail started to use them. Seems that some customer are far
         | more centric than others if you catch my drift...
        
         | whoknowswhat11 wrote:
         | My Amazon order history showed no orders, but now is showing my
         | orders again - so stuff seems to be getting either fixed or
         | intermittent outages.
        
         | blahyawnblah wrote:
         | Doesn't amazon.com run on us-east-1?
        
         | zackbloom wrote:
         | Update: I'm also getting Internal Errors trying to log into the
         | Amazon.com site now as well.
        
       | mabbo wrote:
       | Are the actual _services_ down, or is it just the console and /or
       | login page?
       | 
       | For example, the sign-up page appears to be working:
       | https://portal.aws.amazon.com/billing/signup#/start
       | 
       | Are websites that run on AWS us-east up? Are the AWS CLIs
       | working?
        
         | pavel_lishin wrote:
         | We're seeing issues with EventBridge, other folks are having
         | trouble reaching S3.
         | 
         | Looks like actual services.
        
         | grumple wrote:
         | I can tell you that some processes are not running, possibly
         | due to SQS or SWF problems. Previous outages of this scale were
         | caused by Kinesis outages. Can't connect via aws login at the
         | cli either since we use SSO and that seems to be down.
        
         | meepmorp wrote:
         | EventBridge, CloudWatch. I've just started getting session
         | errors with the console, too.
        
         | 0xCMP wrote:
         | Using cli to describe instances isn't working. Instances
         | themselves seem fine so far.
        
         | Waterluvian wrote:
         | My ECS, EC2, Lambda, load balancer, and other services on us-
         | east-1 still function. But these outages can sometimes
         | propagate over time rather than instantly.
         | 
         | I cannot access the admin console.
        
         | snewman wrote:
         | Anecdotally, we're seeing a small number of 500s from S3 and
         | SQS, but mostly our service (which is at nontrivial scale, but
         | mostly just uses EC2, S3, DynamoDB, and some basic network
         | facilities including load balancers) seems fine, knock on wood.
         | Either the problem is primarily in more complex services, or it
         | is specific to certain AZs or shards or something.
        
         | 2bitlobster wrote:
         | Interactive Video Service (IVS) is down too
        
         | Guest19023892 wrote:
         | One of my sites went offline an hour ago because the web server
         | stopped responding. I can't SSH into it or get any type of
         | response. The database server in the same region and zone is
         | continuing to run fine though.
        
           | bijoo wrote:
           | Interesting, is the site on a particular type of EC2
           | instance, e.g. bare metal? I see c4.xlarge is doing fine in
           | us-east-1.
        
             | Guest19023892 wrote:
             | It's just a t3a.nano instance since it's a project under
             | development. However, I have a high number of t3a.nano
             | instances in the same region operating as expected. This
             | particular server has been running for years, so although
             | it could be a coincidence it just went offline within
             | minutes of the outage starting, it seems unlikely.
             | Hopefully no hardware failures or corruption, and it'll
             | just need a reboot once I can get access to AWS again.
        
         | dangrossman wrote:
         | My website that runs on US-East-1 is up.
         | 
         | However, my Alexa (Echo) won't control my thermostat right now.
         | 
         | And my Ring app won't bring up my cameras.
         | 
         | Those services are run on AWS.
        
           | kingcharles wrote:
           | Now I'm imagining someone dying because they couldn't turn
           | their heating on because AWS. The 21st Century is fucked up.
        
           | [deleted]
        
         | SEMW wrote:
         | Definitely not just the console. We had hundreds of thousands
         | of websocket connections to us-east-1 drop at 15:40, and new
         | websocket connections to that region are still failing.
         | (Luckily not a huge impact on our service cause we run in 6
         | other regions, but still).
        
           | andrew_ wrote:
           | Side question: How happy are you with API Gateway's WebSocket
           | service?
        
             | SEMW wrote:
             | No idea, we don't use it. These were websocket connections
             | to processes on ec2, via NLB and cloudfront. Not sure
             | exactly what part of that chain was broken yet.
        
               | zedpm wrote:
               | This whole time I've been seeing intermittent timeouts
               | when checking a UDP service via NLB; I've been wondering
               | if it's general networking trouble or something
               | specifically with the NLB. EC2 hosts are all fine, as far
               | as I can tell.
        
         | sophacles wrote:
         | I wasn't able to load my amazon.com wishlist, nor the shopping
         | page through the app. Not an aws service specifically, but an
         | amazon service that I couldn't use.
        
         | heartbreak wrote:
         | I'm getting blank pages from Amazon.com itself.
        
         | nowahe wrote:
         | I can't access anything related to Cloudfront, either through
         | the CLI or console :                 $ aws cloudfront list-
         | distributions            An error occurred
         | (HttpTimeoutException) when calling the ListDistributions
         | operation: Could not resolve DNS within remaining TTL of 4999
         | ms
         | 
         | However I can still access the distribution fine
        
         | lambic wrote:
         | We've had reports of some intermittent 500 errors from
         | cloudfront, apart from that our sites are up.
        
         | bijoo wrote:
         | I see started EC2 instances are doing fine. However, starting
         | offline instances cannot be done through AWS SDK due to the
         | HTTP 500 error, even for Ec2 service. The CLI should be getting
         | the HTTP 500 error too since likely the same API as the SDK.
        
       | bkirkby wrote:
       | fwiw, we are seeing errors when trying to publish to SNS although
       | the aws status pages say nothing about SNS.
        
       | saggy4 wrote:
       | It seems that only console is having the problem CLI works fine
        
         | rhines wrote:
         | CLI for EC2 works for me, but not ELB.
        
         | MatthewCampbell wrote:
         | CloudFormation changesets are reporting "InternalFailure" for
         | us in us-east-1.
        
         | albatross13 wrote:
         | Not entirely true- we federate through ADFS and `saml2aws
         | login` is currently failing with:
         | 
         | error logging into aws role using saml assertion: error
         | retrieving STS credentials using SAML: ServiceUnavailable:
         | status code: 503
        
         | zedpm wrote:
         | I'm having CLI issues as well, they're using the same APIs
         | under the hood. For example, I'm getting 503 errors for
         | cloudwatch DescribeLogGroups.
        
           | saggy4 wrote:
           | Tried a few cli commands seems to be working fine for me.
           | Maybe it is not for everyone or maybe It is just the start of
           | something very worse. :(
        
             | technics256 wrote:
             | try aws ecr describe-registry and you will get an error
        
               | saggy4 wrote:
               | Yes Indeed, getting failures in CLI as well
        
       | [deleted]
        
       | 999900000999 wrote:
       | What if it never comes back up ?
        
       | zegl wrote:
       | I love that every time this happens, 100% of the services on
       | https://status.aws.amazon.com are green.
        
         | daniel-s wrote:
         | That page is not loading for me... on which region is it
         | hosted?
        
         | qudat wrote:
         | Status pages are hard
        
           | mrweasel wrote:
           | Not if you're AWS. At this point I'm fairly sure their status
           | page is just a static html that always show all green.
        
             | siva7 wrote:
             | Well, it is.
        
           | jtdev wrote:
           | Why? Twitter and HN can tell me that AWS is having an outage,
           | why can't AWS?
        
           | AH4oFVbPT4f8 wrote:
           | They sent their CEO into space, I am sure they have the
           | resources to figure it out.
        
           | cr3ative wrote:
           | When they have too much pride in an all-green dash, sure.
           | Allowing any engineer to declare a problem when first
           | detected? Not so hard, but it doesn't make you look good if
           | you have an ultra-twitchy finger. They have the balance badly
           | wrong at the moment though.
        
             | lukeschlather wrote:
             | A trigger-happy status page gives realtime feedback for
             | anyone doing a DoS attack. Even if you published that
             | information publicly you would probably want it on a
             | significant delay.
        
           | pid-1 wrote:
           | More like admitting failure is hard.
        
           | smt88 wrote:
           | No they're not.
           | 
           | Step 1: deploy status checks to an external cloud.
        
             | kube-system wrote:
             | I agree, but does come with increased challenges with false
             | positives.
             | 
             | That being said, AWS status pages _are_ up.
        
           | wruza wrote:
           | "Falsehoods Programmers Believe About Status Pages"
        
         | 0xmohit wrote:
         | No wonder IMDB <https://www.imdb.com/> is down (returning 503).
         | Sad that Amazon engineers don't implement what they teach their
         | customers -- designing fault-tolerant and highly available
         | systems.
        
         | barbazoo wrote:
         | It seems they updated it ~30 minutes after your comment.
        
         | judge2020 wrote:
         | I don't see why they couldn't provide an error rate graph like
         | Reddit[0] or simply make services yellow saying "increased
         | error rate detected, investigating..."
         | 
         | 0: https://www.redditstatus.com/#system-metrics
        
           | VWWHFSfQ wrote:
           | because nobody cares when reddit is down. or at least, nobody
           | is paying them to be up 99.999% of the time.
        
           | willcipriano wrote:
           | A executive has a OKR around uptime and a automated system
           | prevents him or her from having control over the messaging.
           | Therefore any effort to create one is squashed, leaving the
           | people requesting it confused as to why and left without any
           | explanation. Oldest story in the book.
        
           | jkingsman wrote:
           | Because Amazon has $$$$$ in their SLOs, and it costs them
           | through the nose every minute they're down in payments made
           | to customers and fees refunded. I trust them and most
           | companies not to be outright fraudulent (although I'm sure
           | some are), but it's totally understandable they'd be reticent
           | to push the "Downtime Alert/Cost Us a Ton of Money" button
           | until they're sure something serious is happening.
        
             | jolux wrote:
             | It should be costing them trust not to push it when they
             | should though. A trustworthy company will err on the side
             | of pushing it. AWS is a near-monopoly, so their
             | unprofessional business practices have still yet to cost
             | them.
        
               | ethbr0 wrote:
               | > _It should be costing them trust not to push it when
               | they should though._
               | 
               | This is what Amazon, the startup, understood.
               | 
               | Step 1: _Always_ make it right and make the customer
               | happy, even if it hurts in $.
               | 
               | Step 2: If you find you're losing too much money over a
               | particular issue, _fix the issue_.
               | 
               | Amazon, one of the world's largest companies, seems to
               | have forgotten that the risk of not reporting accurately
               | isn't money, but _breaking the feedback chain_. Once you
               | start gaming metrics, no leaders know what 's really
               | important to work on internally, because no leaders know
               | what the actual issues are. It's late Soviet Union in a
               | nutshell. If everyone is gaming the system at all levels,
               | then eventually the ability to objectively execute
               | decreases, because effort is misallocated due to
               | misunderstanding.
        
               | Kavelach wrote:
               | > It's late Soviet Union in a nutshell
               | 
               | How come an action of a private company in a capitalist
               | country is like the Soviet Union?
        
               | jolux wrote:
               | Private companies are small centrally-planned economies
               | within larger capitalist systems.
        
             | dhsigweb wrote:
             | I can Google and see how many apps, games, or other
             | services are down. So them not "pushing some buttons" to
             | confirm it isn't fooling anyone.
        
             | btilly wrote:
             | This is an incentive to dishonesty, leading to fraudulent
             | payments and false advertising of uptime to potential
             | customers.
             | 
             | Hopefully it results in a class action lawsuit for enough
             | money that Amazon decides that an automated system is
             | better than trying to supply human judgement.
        
               | jenkinstrigger wrote:
               | Can someone just have a site ping all the GET endpoints
               | on the AWS API? That is very far from "automating [their
               | entire] system" but it's better than what they're doing.
        
               | tynorf wrote:
               | Something like this? https://stop.lying.cloud/
        
             | lozenge wrote:
             | It literally is fraudulent though.
             | 
             | I don't think a region being down is something that you can
             | be unsure about.
        
               | hamburglar wrote:
               | Oh, you can get pretty weaselly about what "down" means.
               | If there is "just" an S3 issue, are all the various
               | services which are still "available" but throwing an
               | elevated number of errors because of their own internal
               | dependency on S3 actually down or just "degraded?" You
               | have to spin up the hair-splitting apparatus early in the
               | incident to try to keep clear of the post-mortem party.
               | :D
        
           | w0m wrote:
           | The more transparency you give; the harder it is to control
           | the narrative. They have a general reputation for
           | reliability; and exposing just how many actual
           | errors/failures there are (that generally don't effect a
           | large swath of users/usecases) would do hurt that reputation
           | for minimal gain.
        
         | sakopov wrote:
         | Those five 9s don't come easy. Sometimes you have to prop them
         | up :)
        
           | 1-6 wrote:
           | It's hard to measure what five-9 is because you have to wait
           | around until a 0.00001 occurs. Incentivizing post-mortems are
           | absolutely critical in this case.
        
             | notinty wrote:
             | It's 0.001; the first 2 9's count.                 5N  =
             | 99.999%       3N  = 99.9%       1N5 = 95%
             | 
             | 5N is <43m12s downtime per month.
        
               | 1-6 wrote:
               | I considered writing it as a percent but then decided
               | against using it and moving the decimal instead. But good
               | info for clarification.
        
           | JoelMcCracken wrote:
           | Every time someone asks to update the status page, managers
           | say "nein"
        
           | jjoonathan wrote:
           | I wonder how often outages really happen. The official page
           | is nonsense, of course, and we only collectively notice when
           | the outage is big enough that lots of us are affected. On
           | AWS, I see about a 3:1 ratio of "bump in the night" outages
           | (quickly resolved, little corroboration) to mega too-big-to-
           | hide outages. Does that mirror others' experiences?
        
             | Spivak wrote:
             | If you count any time AWS is having a problem that impacts
             | our production workloads then I think it's about 5:1.
             | Dealing with "AWS is down" outages are easy because I can
             | just sit back and grab some popcorn, it's the "dammit I
             | know this is AWS's fault" outages that are a PITA because
             | you count yourself lucky to even get a report in your
             | personalized dashboard.
        
               | jjoonathan wrote:
               | Yep.
               | 
               | Random aside: any chance you are related to the Calculus
               | on Manifolds Spivak?
        
               | Spivak wrote:
               | Nope, just a fan. It was the book that pioneered my love
               | of math.
        
               | clh1126 wrote:
               | I had to log in to say, that one of my favorite quotes of
               | all time I found in Calculus on Manifolds.
               | 
               | He says that any good theorem is worth generalizing, and
               | I've generalized that to any life rule.
        
           | kylemh wrote:
           | https://aws.amazon.com/compute/sla/
           | 
           | looks like only four 9's
        
             | dotancohen wrote:
             | > looks like only four 9's
             | 
             | That's why the Germans are such good engineers.
             | Did the drives fail? Nein.       Did the CPU overheat?
             | Nein.       Did the power get cut? Nein.       Did the
             | network go down? Nein.
             | 
             | That's "four neins" right there.
        
         | [deleted]
        
         | [deleted]
        
         | NicoJuicy wrote:
         | Not right now. I think they monitor if it appears on HN too.
        
         | swiftcoder wrote:
         | When I worked there it required the signoff of both your VP-
         | level executive and the comms team to update the status page. I
         | do not believe I ever received said signoff before the issues
         | were resolved.
        
         | queuebert wrote:
         | Are they lying, or just prioritizing their own services?
        
           | _verandaguy wrote:
           | Willing to bet the status page gets updated by logic on us-
           | east-1
        
           | itsyaboi wrote:
           | Status service is probably hosted in us-east-1
        
           | AH4oFVbPT4f8 wrote:
           | amazon.com seems to be having problems too. I get something
           | went wrong to a new design/layout which I assume is either
           | new or a fail safe.
        
             | KineticLensman wrote:
             | Looks okay right now to this UK user of amazon.co.uk
        
               | btilly wrote:
               | It depends on which Amazon region you are being served
               | from.
               | 
               | It is very unlikely that Amazon would deliberately make
               | your messages cross the Atlantic just to find an American
               | region that is unable to serve you.
        
           | bennyp101 wrote:
           | https://music.amazon.co.uk is giving me an error since about
           | 16:30 GMT
           | 
           | "We are experiencing an error. Our apologies - We will be
           | back up soon."
        
         | [deleted]
        
         | gbear0 wrote:
         | I assume each service has its own health check that checks the
         | service is accessible from an internal location, thus most are
         | green. However, when Service A requires Service B to do work,
         | but Service B is down, a simple access check on Service A
         | clearly doesn't give a good representation of uptime.
         | 
         | So what's a good health check actually report these days? Is it
         | just about its own status, or should it include a breakdown of
         | the status of external dependencies as part of its folded up
         | status?
        
         | goshx wrote:
         | I remember the time when S3 went down and took the status page
         | down with it
        
         | notreallyserio wrote:
         | Makes you wonder if they have to manually update the page when
         | outages occur. That'd be a pretty bad way to go, so I'd hope
         | not. Maybe the code to automatically update the page is in us-
         | east-1? :)
        
           | gromann wrote:
           | Word on the street is the status page is just a JPG
        
           | wfleming wrote:
           | Something like that has impacted the status page in the past.
           | There was a severe Kinesis outage last year
           | (https://aws.amazon.com/message/11201/), and they couldn't
           | update the service dashboard for quite a while because their
           | tool to do manage the service dashboard lives in us-east-1
           | and depends on Kinesis.
        
         | JohnJamesRambo wrote:
         | > Goodhart's Law is expressed simply as: "When a measure
         | becomes a target, it ceases to be a good measure."
         | 
         | It's very frustrating. Why even have them?
        
           | Spivak wrote:
           | Because "uptime" and "nines" became a marketing term. Simple
           | as that. But the problem is that any public-facing measure of
           | availability becomes a defacto marketing term.
        
             | hinkley wrote:
             | Also 4-5 nines is virtually impossible for complex systems,
             | so the sort of responsible people who could make 3 nines
             | true begin to check out, and now you've getting most of
             | your info from the delusional, and you're lucky if you
             | manage 2 objective nines.
        
             | Enginerrrd wrote:
             | The older I get the more I hate marketers. The whole field
             | stands on the back of war-time propaganda research and it
             | sure feels like it's the cause of so much rot in society.
        
         | jbavari wrote:
         | Well yea, it's eventually consistent ;)
        
         | jrochkind1 wrote:
         | Even better, when I try to go to console, I get:
         | 
         | > AWS Management Console Home page is currently unavailable.
         | 
         | > You can monitor status on the AWS Service Health Dashboard.
         | 
         | "AWS Service Health Dashboard" is a link to
         | status.aws.amazon.com... which is ALL GREEN. So... thanks for
         | the suggestion?
         | 
         | At this point the AWS service health dashboard is kind of
         | famous for always been green isn't it? It's a joke to it's
         | users. Do the folks who work on the relevant AWS internal
         | team(s) know this, and just not have the resources to do
         | anything about it, or what? If it's a harder problem than you'd
         | think for interesting technical reasons, that'd be interesting
         | to hear about.
        
         | kortex wrote:
         | It's like trying to get the truth out of a kid that caused some
         | trouble.
         | 
         | Mom: Alexa, did you break something?
         | 
         | Alexa: No.
         | 
         | M: Really? What's this? _500 Internal server error_
         | 
         | A: ok maybe management console is down
         | 
         | M: Anything else?
         | 
         | A: ...
         | 
         | A: ... ok maybe cloudwatch logs
         | 
         | M: Ah hah. What else?
         | 
         | A: That's it, I swear!
         | 
         | M: _503 ClientError_
         | 
         | A: ...well okay secretsmanager might be busted too...
        
           | hinkley wrote:
           | There was a great response in r/relationship advice the other
           | day where someone said that OP's partner forced a fight
           | because they're planning to cheat on them, reconcile, and
           | then will 'trickle out the truth' over the next 6 months. I'm
           | stealing that phrase.
        
           | mdni007 wrote:
           | Funny I literally just asked my Alexa.
           | 
           | Me: Alexa, is AWS down right now?
           | 
           | Alexa: I'd rather not answer that
        
             | hinkley wrote:
             | Wise robot.
             | 
             | That's a bit like involving your kid in an argument between
             | parents.
        
           | PopeUrbanX wrote:
           | The very expensive EC2 instance I started this morning still
           | works. Of course now I can't shut it down.
        
         | ta20200710 wrote:
         | EC2 or S3 showing red in any region literally requires personal
         | approval of the CEO of AWS.
        
           | dekhn wrote:
           | Uhhhhh... what if the monitoring said it was hard down?
           | They'd still not show red?
        
             | choeger wrote:
             | Probably they cannot. They outsourced this dashboard and it
             | runs on AWS now ;).
        
           | dia80 wrote:
           | Unfortunately, errors don't require his approval...
        
           | notreallyserio wrote:
           | Is this true or a joke? This sort of policy is how you
           | destroy trust.
        
             | marcosdumay wrote:
             | If you trust them at this point, you have not being paying
             | attention, and will probably continue to trust after this.
        
             | bsedlm wrote:
             | maybe we gotta consider the publicly facing status pages as
             | something other than a technical tool (e.g. marketing or PR
             | or something like that, dunno)
        
             | jeffrallen wrote:
             | Well, no big deal, there's not really a lot of trust there
             | to destroy...
        
             | jedberg wrote:
             | From what I've heard it's mostly true. Not only the CEO but
             | a few SVPs can approve it, but yes a human must approve the
             | update and it must be a high level exec.
             | 
             | Part of the reason is because their SLAs are based on that
             | dashboard, and that dashboard going red has a financial
             | cost to AWS, so like any financial cost, it needs approval.
        
               | orangepurple wrote:
               | Being dishonest about SLAs seems to bear zero cost in
               | this case?
        
               | solatic wrote:
               | Zero directly-attributable, calculable-at-time-of-
               | decision cost. Of course there's a cost in terms of
               | customers who leave because of the dishonest practice,
               | but, who knows how many people that'll be? Out of the
               | customers who left after the outage, who knows whether
               | they left due to not communicating status promptly and
               | honestly or whether it was for some other reason?
               | 
               | Versus, if a company has X SLA contracts signed, that
               | point to Y reimbursement for being out for Z minutes, so
               | it's easily calculable.
        
               | jedberg wrote:
               | It's not really dishonest though because there is nuance.
               | Most everything in EC2 is still working it seems, just
               | the console is down. So is it really down? It should
               | probably be yellow but not red.
        
               | dekhn wrote:
               | if you cannot access the control plane to create or
               | destroy resources, it is down (partial availability). The
               | jobs that are running are basically zombies.
        
               | w0m wrote:
               | Depending the workload being run users may or may not
               | notice. Should be Yellow at a minimum.
        
               | jedberg wrote:
               | Seems like the API is still working and so is auto
               | scaling. So they aren't really zombies.
               | 
               | Partial availability isn't the same as no availability.
        
               | electroly wrote:
               | The API is NOT working -- it may not have been listed on
               | the service health dashboard when you posted that, but it
               | is now. We haven't been able to launch an instance at
               | all, and we are continuously trying. We can't even start
               | existing instances.
        
               | dekhn wrote:
               | I'm right in the middle of an AWS-run training and we
               | literally can't run the exercises because of this.
               | 
               | let me repeat that: my AWS trainign that is run by AWS
               | that I pay AWS for isn't working, because AWS is having
               | control plane (or other) issues. This is several hours
               | after the initial incident. We're doing training in us-
               | west-2, but the identity service and other components run
               | in us-east-1.
        
               | justrudd wrote:
               | I'm running EKS in us-west-2. My pods use a role ARN and
               | identity token file to get temporary credentials via STS.
               | STS can't return credentials right now. So my EKS cluster
               | is "down" in the sense that I can't bring up new pods. I
               | only noticed because an auto-scaling event failed.
        
               | dekhn wrote:
               | We ran through the whole 4.5 hour training and the
               | training app didn't work the entire time.
        
               | jjoonathan wrote:
               | "Good at finding excuses" is not the same thing as
               | "honest."
        
               | paulryanrogers wrote:
               | SNS seems to be at least partially down as well
        
               | jtheory wrote:
               | My company relies on DynamoDB, so we're totally down.
               | 
               | edit: partly down; it's sporadically failing
        
               | jrochkind1 wrote:
               | Heroku is currently having major problems. My stuff is
               | still up, but I can't deploy any new versions. Heroku
               | runs their stuff on AWS. I have heard reports of other
               | companies who run on AWS also having degarded service and
               | outages.
               | 
               | i'd say when other companies who run their infrastruture
               | on AWS are going out, it's hard to argue it's not a real
               | outage.
               | 
               | But AWS status _has_ changed to yellow at this point.
               | Probably heroku could be completely down because of an
               | AWS problem, and AWS status would still not show red. But
               | at least yellow tells us there's a problem, the
               | distinction between yellow and red probably only matters
               | at this point to lawyers arguing about the AWS SLA, the
               | rest of us know yellow means "problems", red will never
               | be seen, and green means "maybe problems anyway".
               | 
               | I believe the entire us-east-1 could be entirely missing,
               | and they'd still only put a yellow not a red on status
               | page. After all, the other regions are all fine, right?
        
               | dekhn wrote:
               | Sure, but... that just raises more questions :)
               | 
               | Taken literally what you are saying is the service could
               | be down and an executive could override that, preventing
               | them for paying customers for a service outage, even if
               | the service did have an outage and the customer could
               | prove it (screenshots, metrics from other cloud
               | providers, many different folks see it).
               | 
               | I'm sure there is some subtlety to this, but it does mean
               | that large corps with influence should be talking to AWS
               | to ensure that status information corresponds with actual
               | service outages.
        
               | [deleted]
        
               | emodendroket wrote:
               | I have no inside knowledge or anything but it seems like
               | there are a lot of scenarios with degraded performance
               | where people could argue about whether it really
               | constitutes an outage.
        
               | dilyevsky wrote:
               | One time gcp argued that since they did return 404s on
               | gcs for a few hours that wasn't an uptime/latency sla
               | violation so we were not entitled to refund (tho they
               | refunded us anyway)
        
               | Enginerrrd wrote:
               | Man, between costs and shenanigans like this, why don't
               | more companies self-host?
        
               | dilyevsky wrote:
               | 1. Leadership prefers to blame cloud when things break
               | rather than take responsibility.
               | 
               | 2. Cost is not an issue (until it is but you're already
               | locked in so oh well)
               | 
               | 3. Faang has drained the talent pool of people who know
               | how
        
               | pm90 wrote:
               | Opex > Capex. If companies thought about long term, yes
               | they might consider it. But unless the cloud providers
               | fuck up really badly, they're ok to take the heat
               | occasionally and tolerate a bit of nonsense.
        
               | dilyevsky wrote:
               | You can lease equipment you know...
        
               | dekhn wrote:
               | Yep. I was an SRE who worked at Google and also launched
               | a product on Google Cloud. We had these arguments all the
               | time, and the contract language often provides a way for
               | the provider to weasel out.
        
               | jedberg wrote:
               | Like I said I never worked there and this is all hearsay
               | but there is a lot of nuance here being missed like
               | partial outages.
        
               | dekhn wrote:
               | This is no longer a partial outage. The status page
               | reports elevated API error rates, DynamoDB issues, EC2
               | API error rates, and my company's monitoring is
               | significantly affected (IE, our IT folks can't tell us
               | what isn't working) and my AWS training class isn't
               | working either.
               | 
               | If this needed a CEO to eventually get around to pressing
               | a button that said "show users the actual information
               | about a problem" that reflects poorly on amazon.
        
               | dhsigweb wrote:
               | My friend works at a telemetry company for monitoring and
               | they are working on alerting customers of cloud service
               | outages before the cloud providers since the providers
               | like to sit on their hands for a while (presumably to try
               | and fix it before anyone notices).
        
               | meetups323 wrote:
               | Large corps with influence get what they want regardless.
               | Status page goes red and the small corps start thinking
               | they can get what they want too.
        
               | scrose wrote:
               | > Status page goes red and the small corps start thinking
               | they can get what they want too.
               | 
               | I think you mean "start thinking they can get what they
               | pay for"
        
               | notreallyserio wrote:
               | I wonder how well known this is. You'd think it would be
               | hard to hire ethical engineers with such a scheme in
               | place and yet they have tens of thousands.
        
         | sneak wrote:
         | It's widespread industry knowledge now that AWS is publicly
         | dishonest about downtime.
         | 
         | When the biggest cloud provider in the world is famous for
         | gaslighting, it sets expectations for our whole industry.
         | 
         | It's fucking disgraceful that they tolerate such a lack of
         | integrity in their organization.
        
         | strictfp wrote:
         | "some customers may experience a slight elevation in error
         | rates" --> everything is on fire, absolutely nothing works
        
         | ballenf wrote:
         | https://downdetector.com
         | 
         | Amazing and scary to see all the unrelated services down right
         | now.
        
           | nightpool wrote:
           | I think it's pretty unlikely that both Google and Facebook
           | are affected by this minor AWS outage, whatever DownDetector
           | says. I even did a spot check on some of the smaller websites
           | they report as "down", like canva.com, and didn't see any
           | issues.
        
             | zarkov99 wrote:
             | You might be right about Google and Facebook, but this
             | isn't minor at all. Impact is widespread.
        
         | john37386 wrote:
         | It starts to show issues now. I agree that it was a bit long
         | before we can get real visibility on the incident.
        
         | jrockway wrote:
         | I wonder if the other parts of Amazon do this. Like their
         | inventory system thinks something is in stock, but people can't
         | find it in the warehouse, do they just simply not send it to
         | you and hope you don't notice? AWS's culture sounds super
         | broken.
         | 
         | My favorite status page, though, is Slack's. You can read an
         | article in the New York Times about how Slack was down for most
         | of a day, and the status page is just like "some percentage of
         | users experienced minor connectivity issues". "Some percentage"
         | is code for "100%" and "minor" is code for "total". Good try.
        
         | whoknew1122 wrote:
         | The problem being that often times you can't actually update
         | the status page. Most internal systems are down.
         | 
         | We can't even update our product to say it's down, because
         | accessing the product requires a process that is currently
         | dead.
        
           | thayne wrote:
           | That's why your status page should be completely independent
           | from the services it is monitoring (minus maybe something
           | that automatically updates it). We use a third party to host
           | our status page specifically so that we can update it even if
           | all our systems are down.
        
             | whoknew1122 wrote:
             | I'm not saying you're wrong, or that the status page is
             | architected properly. I'm just speaking to the current
             | situation.
        
         | davecap1 wrote:
         | On top of that, the "Personalized Health Dashboard" doesn't
         | work because I can't seem to log in to the console.
        
           | meepmorp wrote:
           | I'm logged in; you're missing an error message.
        
             | davecap1 wrote:
             | We have federated login with MFA required (which was
             | failing). It just started working again.
             | 
             | Scratch that... console is not loading at all now :)
        
         | ricardobayes wrote:
         | Wonder why almost all Amazon frontend looks like it was written
         | in c++
        
       | [deleted]
        
       | davikawasaki wrote:
       | EKS works for us
        
       | alephnan wrote:
       | Vanguard has been slow all day. I'm going to guess Vanguard has a
       | depencency on US-east-1
        
       | tgtweak wrote:
       | I'm going with "BGP error" which is likely config-related, likely
       | human error.
       | 
       | Seems to be the trend with the last 5-6 big cloud outages.
        
       | keyle wrote:
       | Why does that webpage renders like a dog?... I get that it's
       | under load but the rendering itself is chugging something rare.
       | 
       | Edit: wow that webpage is humongous... never heard of paging?
        
       | saggy4 wrote:
       | I think the only the console is down. CLI is working fine for me
       | in ap-southeast-1
        
       | kchoudhu wrote:
       | At this point I have no idea why anyone would put anything in us-
       | east-1.
       | 
       | Also isolation is not as good as they would have you believe: I
       | am unable to login to AWS Quicksight in us-west-2...
        
         | bradhe wrote:
         | Man, some conclusions are being _jumped_ to by this reply.
        
           | InTheArena wrote:
           | There is a very long history of US-east-1 being horrible.
           | Just bad. We've told every client we can to get out of there.
           | It's one of the oldest amazon regions, and I think too much
           | old legacy and weird stuff happens there. Use US-west-2.
        
             | jedberg wrote:
             | Or US-East-2.
        
             | blahyawnblah wrote:
             | Isn't us-east-1 where they deploy everything first? And the
             | only region that has 100% of all available services?
        
         | crad wrote:
         | Been in us-east-1 for a long time. Things like Direct Connect
         | and other integrations aren't easy or cheap to move and when
         | you have other, bigger priorities, moving regions is not an
         | easy decision to prioritize.
        
       | RubberShoes wrote:
       | "AWS Management Console Home page is currently unavailable. You
       | can monitor status on the AWS Service Health Dashboard."
       | 
       | And then Health Dashboard is 100% green. What a joke.
        
       | nurgasemetey wrote:
       | It seems that IMDB and Goodreads are also affected
        
         | kingcharles wrote:
         | Yeah, and Audible and Amazon.com search and the Amazon retail
         | stores.
         | 
         | Basically Amazon fucked all their own products too.
        
       | picodguyo wrote:
       | Funny, I just asked Alexa to set a timer and she said there was a
       | problem doing that. Apparently timers require functioning us-
       | east-1 now.
        
         | minig33 wrote:
         | I can't turn on my lights... the future is weird
        
           | lsaferite wrote:
           | And that is why my lighting automation has a baseline req
           | that it works 100% without the internet and preferably
           | without a central controller.
        
             | organsnyder wrote:
             | I love my Home Assistant setup for this reason. I can even
             | get light bulbs pre-flashed with ESPHome now (my wife was
             | bemused when I was updating the firmware on the
             | lightbulbs).
        
             | rodgerd wrote:
             | HomeKit compatibility is a useful proxy for local API,
             | since it's a hard requirement for HomeKit certification.
        
       | m12k wrote:
       | A former colleague told me years ago that us-east-1 is basically
       | the guinea pig where changes get tested before being rolled out
       | to the other regions, and as a result is less stable than the
       | others. Does anyone know if there's any truth to this?
        
         | wizwit999 wrote:
         | false, it's often 4th iirc, SFO (us-west-1) is actually usually
         | first.
        
         | shepherdjerred wrote:
         | At my org it was deployed in the middle, around the fourth wave
         | iirc
        
         | arpinum wrote:
         | This is not true. Lambda updates us-east-1 last.
        
         | treesknees wrote:
         | I can't see why they'd use the most common/popular region as a
         | guinea pig.
        
           | Kye wrote:
           | Problem: you've gone as far as you can go testing internally
           | or with test groups. You know there are edge cases you'll
           | only identify by having enough people test it.
           | 
           | Solution: push it to production on the zone with the most
           | users and see what breaks.
        
         | sharpy wrote:
         | The guideline has been to deploy to it last.
         | 
         | If the team follows pipeline best practices, they are supposed
         | to deploy to a single small region first, wait 24 hours, and
         | then deploy to more, wait more, and deploy to more, until
         | finally deploying to us-east-1.
        
       | mystcb wrote:
       | Not sure it helps, but got this update from someone inside AWS a
       | few moments ago.
       | 
       | "We have identified the root cause of the issues in the US-EAST-1
       | Region, which is a network issue with some network devices in
       | that Region which is affecting multiple services, including the
       | console but also services like S3. We are actively working
       | towards recovery."
        
         | rozenmd wrote:
         | They finally updated the status page:
         | https://status.aws.amazon.com/
        
           | mystcb wrote:
           | Ahh good spot, it does seem that the AWS person I am speaking
           | too has a few more bits other than what is shown on the page,
           | they just messaged me the same message there, but added:
           | 
           | "All teams are engaged and continuing to work towards
           | mitigation. We have confirmed the issue is due to multiple
           | impaired network devices in the US-EAST-1 Region."
           | 
           | Doesn't sound like they are having a good day there!
        
         | alpha_squared wrote:
         | That's a copy-paste, we got the same thing from our AWS
         | contact. It's just enough info to confirm there's an issue, but
         | not enough to give any indication on the scope or timeline to
         | resolution.
        
         | alexatalktome wrote:
         | Internally the rumor is that our CICD pipelines failed to stop
         | bad commits to certain AWS services. This isn't due to tests
         | but due to actual pipelines infra failing.
         | 
         | We've been told to disable all pipelines even if we have time
         | blockers or manual approval steps or failing tests
        
         | jdc0589 wrote:
         | I love how they are sharing this stuff out to some clients, but
         | its technically under NDA.
        
           | alfalfasprout wrote:
           | Yeah, we got updates via NDA too lol. Such a joke that a
           | status page update is considered privileged lol.
        
       | romanhotsiy wrote:
       | It's funny that the first place I go to learn about the outage is
       | Hacker News and not https://status.aws.amazon.com/ (it's still
       | reports everything to be "operating normally"...)
        
         | albatross13 wrote:
         | Yeah, I tend to go off of https://downdetector.com/status/aws-
         | amazon-web-services/
        
           | murph-almighty wrote:
           | I always got the impression that downdetector worked by
           | logging the number of times they get a hit for a particular
           | service and using that as a heuristic to determine if
           | something is down. If so, that's brilliant.
        
             | albatross13 wrote:
             | I think it's a bit simpler for AWS- there's a big red "I
             | have a problem with AWS" button on that page. You click it,
             | tell it what your problem is, and it logs a report. Unless
             | that's what you were driving at and I missed it, it's
             | early. Too early for AWS to be down :(
             | 
             | Some 3600 people have hit that button in the last ~15
             | minutes.
        
             | cmg wrote:
             | It's brilliant until the information is bad.
             | 
             | When Facebook's properties all went down in October, people
             | were saying that AT&T and other cell phone carriers were
             | also down - because they couldn't connect to FB/Insta/etc.
             | There were even some media reports that cited Downdetector,
             | seeming without understanding that they are basically
             | crowdsourced and sometimes the crowd is wrong.
        
         | bmcahren wrote:
         | I made sure our incident response plan includes checking Hacker
         | News and Twitter for _actual_ updates and information.
         | 
         | As of right now, this thread and one update from a twitter
         | user,
         | https://twitter.com/SiteRelEnby/status/1468253604876333059 are
         | all we have. I went into disaster recovery mode when I saw our
         | traffic dropped to 0 suddenly at 10:30am ET. That was just the
         | SQS/something else preventing our ELB logs from being extracted
         | to DataDog though.
        
           | unethical_ban wrote:
           | So as of the time you posted this comment, were other
           | services actually down? The way the 500 shows up, and the AWS
           | status page, makes it sound like "only" the main landing
           | page/mgt console is unavailable, not AWS services.
        
             | jeremyjh wrote:
             | Yes, they are still publishing lies on their status page.
             | In this thread people are reporting issues with many
             | services. I'm seeing periodic S3 PUT failures for the last
             | 1.5 hours.
        
               | alexatalktome wrote:
               | AWS services are all built against each other so one
               | failing will take down a bunch more which take down more
               | like dominos. Internally there's a list of >20 "public
               | facing" AWS services impacted.
        
         | authed wrote:
         | I usually go on Twitter first for outages.
        
         | 1-6 wrote:
         | Community reporting > internal operations
        
         | taf2 wrote:
         | Now 57 minutes later and it still reports everything as
         | operating normally.
        
           | mijoharas wrote:
           | It shows errors now.
        
             | romanhotsiy wrote:
             | It doesn't show errors with Lambda and we clearly do
             | experience them.
        
       | kingcharles wrote:
       | Does McDonalds use AWS for the backend to their app?
       | 
       | If I find out this is why I couldn't get my Happy Meal this
       | morning I'm going to be really, really grumpy.
       | 
       | EDIT: I'm REALLY grumpy now:
       | 
       | https://aws.amazon.com/blogs/industries/aws-is-how-mcdonalds...
        
         | hoofhearted wrote:
         | The McDonalds App was showing on the frontpage of Down Detector
         | at the same time as all the Amazon dependent services last I
         | checked.
        
         | Crespyl wrote:
         | Apparently Taco Bell too, not being able to place an order and
         | then also not being able to fall back to McDonalds was how I
         | realized there was a larger outage :p
         | 
         | What am I supposed to do for lunch now? Go to the drive through
         | and order like a normal person? /s
         | 
         | Grumble grumble
        
       | dwighttk wrote:
       | Is this why goodreads hasn't been working today?
        
       | TameAntelope wrote:
       | We run as much as we can out of us-east-2 because it has more
       | uptime than us-east-1, and I don't think I've ever regretted that
       | decision.
        
       | p2t2p wrote:
       | I have alerts going off because of that...
        
       | kello wrote:
       | Don't envy the engineers working on this right now. Good luck!
        
       | nickysielicki wrote:
       | The Amazon.com storefront was giving me issues loading search
       | results -- this is the worst possible time of year for Amazon to
       | have issues. It's horrifying and awesome to imagine hundreds of
       | thousands (if not millions) of dollars of lost orders an hour --
       | just from sluggish load times. Hugops to those dealing with this.
        
         | throwanem wrote:
         | Third worst time. It's not BFCM and it's not the week before
         | Christmas; from prior high-volume ecommerce experience I
         | suspect their purchase rate is elevated at this time but
         | nowhere near those two peaks.
        
       | n0cturne wrote:
       | I just walked into the Amazon Books store at our local mall. They
       | are letting everyone know at the entrance that "some items aren't
       | available for purchase right now because our systems are down."
       | 
       | So at least Amazon retail is feeling some of the pain from this
       | outage!
        
       | etimberg wrote:
       | Seeing issues in ca-central-1 and us-east
        
       | taf2 wrote:
       | We're not down and we're in us-east-1... maybe there is more to
       | this issue?
        
         | bradhe wrote:
         | I think could just be the console?
        
         | taf2 wrote:
         | Found that it seems Lambda is impacted
        
         | fsagx wrote:
         | Everything fine with S3 in us-east-1 for me. Also just not able
         | to access the console.
        
         | abarringer wrote:
         | Our services in AWS East are down.
        
       ___________________________________________________________________
       (page generated 2021-12-07 23:00 UTC)