[HN Gopher] AWS us-east-1 outage
___________________________________________________________________
AWS us-east-1 outage
Author : judge2020
Score : 1361 points
Date : 2021-12-07 15:42 UTC (7 hours ago)
(HTM) web link (status.aws.amazon.com)
(TXT) w3m dump (status.aws.amazon.com)
| sbr464 wrote:
| booker.com/mindbody random areas are affected
| tonyhb wrote:
| Our services that are in us-east-2 are up, but I'm wondering how
| long that will hold true.
| technics256 wrote:
| our EKS/EC2 instances are OK
| AH4oFVbPT4f8 wrote:
| Unable to log into the console for us-east-1 for me too
| dimitar wrote:
| https://status.hashicorp.com/incidents/3qc302y4whqr - seems to
| have affected Hashicorp Cloud too
| singlow wrote:
| I am able to access the console for us-west-2 by going to a
| region specific URL: https://us-west-2.console.aws.amazon.com/
|
| It does tack me to a non-region specific login page which is up,
| and then redirects back to us-west-2 which works.
|
| If I go to my bookmarked login page it breaks because,
| presumably, it hits something that is broken in us-east-1.
| CoachRufus87 wrote:
| Does there exist a resource that tracks outages on a per-region
| basis over time?
| soheil wrote:
| Can confirm I can once again login to the Console and everything
| seems to be back to normal in us-east-2.
| crescentfresh wrote:
| The majority of our errors stem from:
|
| - writing to Firehose (S3-backed)
|
| - publishing to eventbridge
|
| - terraform commands to ECS' API are stuck/hanging
|
| Other spurious errors involving kinesis but nothing alarming. us-
| east-1
| hiyer wrote:
| Payments in Amazon India is down - likely because of this.
| dalrympm wrote:
| Contrary to what the status page says, CodePipeline is not
| working. Hitting the CLI I can start pipelines but they never
| complete and I get a lot of:
|
| Connection was closed before we received a valid response from
| endpoint URL: "https://codepipeline.us-east-1.amazonaws.com/".
| alexatalktome wrote:
| Rumor is that our internal pipelines are the root cause. The
| CICD pipelines (not tests, the literal pipeline infrastructure)
| failed to block certain commits and pushed them to production
| when not ready.
|
| We've been told to manually disable them to ensure integrity of
| our services when it recovers
| AzzieElbab wrote:
| Betting on dynamo again
| markus_zhang wrote:
| Just curious does it still make sense to claim that up time is X
| numbers of 9? (e.g. 99.999%)
| throwanem wrote:
| Yep. In this case, zero nines.
| [deleted]
| biohax2015 wrote:
| Getting 502 in Parameter Store. Cloudformation isn't returning
| either -- and that's how we deploy code :(
| rickreynoldssf wrote:
| EC2 at least seems fine but the console is definitely busted as
| of 16:00UTC
| Bedon292 wrote:
| I am still getting 'Invalid region parameter' for resources in
| us-east-1, the others are fine.
| [deleted]
| SubiculumCode wrote:
| Must be why I can't seem to access my amazon account. I thought
| my account had gotten compromised.
| imstil3earning wrote:
| cant scale our Kubernetes cluster due to 500s from ECR :(
| jonnylangefeld wrote:
| Does anyone know why Google is showing the same spike on down
| detector as everything else? How does Google depend on AWS?
| https://downdetector.com/status/google/
| NobodyNada wrote:
| It's because Down Detector works off of user reports rather
| than automatically detecting outages somehow. So, every time a
| major service goes down (whether infrastructure like AWS or
| Cloudflare, or user-facing like YouTube or Facebook), some
| users will blame Google, ISPs, cellular providers, or some
| other unrelated service.
| gmm1990 wrote:
| Some google sheets functions aren't updating in a timely manner
| for me. Maybe people google as a backup for aws and they have
| to throttle certain service from a higher load
| joelbondurant wrote:
| They should put everything in the cloud so hardware issues can't
| happen.
| megakid wrote:
| I live in London and I can't launch my Roomba vacuum to clean up
| after dinner because of this. Hurry up AWS, fix it!
| jrochkind1 wrote:
| This is effecting heroku.
|
| While my heroku apps are currently up, I am unable to push new
| versions.
|
| Logging in to heroku dashboard (which does work), there is a
| message pointing to this heroku status incident for "Availability
| issues with upstream provider in the US region":
| https://status.heroku.com/incidents/2390
|
| How can there be an outage severe enough to be effecting
| middleman customers like heroku, but the AWS status page is still
| all green?!?!
|
| If whoever runs the AWS status page isn't embaressed, they really
| ought to be.
| VWWHFSfQ wrote:
| AWS management APIs in the us-east-1 region is what is
| affected. I'm guessing Heroku uses at least the S3 APIs when
| deploying new versions, and those are failing
| (intermittently?).
|
| I advise not touching your Heroku setup right now. Even
| something like trying to restart a dyno might mean it doesn't
| come back since the slug is probably stored on S3 and that will
| fail.
| valeness wrote:
| This is more than just east. I am seeing the same error on us-
| west-2 resources.
| singlow wrote:
| I am seeing some issues but only with services that have global
| aspects such as s3. I can't create an s3 bucket even though I
| am in us-west-2 because I think the names are globally unique
| and creating them depends on us-east-1.
| herodoturtle wrote:
| Can't access Lightsail console even though our instances are in a
| totally different Region.
| avsteele wrote:
| Getting strange errors trying to manage my amazon account right
| now, could this be related?
|
| 494 ERROR and "We're sorry Something went wrong with our website,
| please try again later."
| ricardobayes wrote:
| This was funny at first but now I can't even play Elder Scrolls
| Online :(
| numberwhun wrote:
| Amazon is having an outage in us-east-1, and it is bleeding over
| elsewhere, like eu: https://status.aws.amazon.com/
| cyanydeez wrote:
| Is that fail-failover?
| hvgk wrote:
| I think its a factorial fail.
| the-dude wrote:
| It is failures all the way down.
| hvgk wrote:
| A disc of failures resting on elephants of failures
| resting on a giant turtle of failures? Or are you more a
| turtles of failures all the way down sort of person?
| sharpy wrote:
| Those 2 services that are being marked as having problems in
| other regions have fairly hard dependency on us-east-1. So that
| would be why.
| blueside wrote:
| I was currently in the process of buying some tickets on
| Ticketmaster and the entire presale event had to be postponed for
| at least 4 hours due to this AWS outage.
|
| I'm not complaining, I enjoyed the nostaliga - sometimes the web
| still feels like the late 90s
| kingcharles wrote:
| I'm complaining. I could not get my Happy Meal at McDonald's
| this morning. Bad start to the day.
| _moof wrote:
| This is your regularly scheduled reminder:
| https://www.whoownsmyavailability.com/
| iso1210 wrote:
| My raspberry pi is still working just fine
| marginalia_nu wrote:
| Yeah, strange, my self-hosted server isn't affected either.
| iso1210 wrote:
| Seems "the cloud" had a major outage less than a month ago,
| my laptop has a higher uptime.
|
| $ 16:04 up 46 days, 7:02, 9 users, load averages: 3.68 3.56
| 3.18
|
| US East 1 was down just over a year ago
|
| https://www.theregister.com/2020/11/25/aws_down/
|
| Meanwhile I moved one of my two internal DNS servers to a
| second site on 11 Nov 2020, and it's been up since then. One
| of my monitoring machines has been filling, rotating and
| deleting logs for 1,712 days with a load average in the c. 40
| range for that whole time, just works.
|
| If only there was a way to run stuff with an uptime of 364
| days a year without using the cloud /s
| nfriedly wrote:
| I think the point of the could isn't increased uptime - the
| point is that when it's down, bring it back up is _someone
| else 's problem_.
|
| (Also, OpEx vs CapEx financial shenanigans...)
|
| All the same, I don't disagree with your point.
| iso1210 wrote:
| > the point is that when it's down, bring it back up is
| someone else's problem.
|
| When it's down, it's my problem, and I can't do anything
| about it other than explain why I have no idea the system
| is broken and can't do anything about it.
|
| "Why is my dohicky down? When will it be back?"
|
| "Because it's raining, no idea"
|
| May be accurate, it's also of no use.
|
| But yes, Opex vs Capex, of course that's why you can
| lease your servers. It's far easier to spend company
| money with another $500 a month on AWS than spend $500 a
| year for a new machine.
| debaserab2 wrote:
| so does my toaster, oven and microwave. so what? they get
| used a few times a day, but my production level equipment
| serves millions in an hour.
| iso1210 wrote:
| My lightswitch is used twice a day, yet it works every
| time. In the old days it would occasionally break (bulb
| goes), I would be empowered to fix it myself (change the
| bulb).
|
| In the cloud you're at the mercy of someone who doesn't
| even know you exist to fix it, without the protections
| that say an electric company has with supplying domestic
| users.
|
| This thread has people unable to turn their lights on[0],
| it's hilarious how people tie their stuff to dependencies
| that aren't needed, with a history of constant failure.
|
| If you want to host millions of people, then presumably
| your infrastructure can cope with the loss of a single AZ
| (and ideally the loss of Amazon as a whole). The vast
| majority of people will be far better off without their
| critical infrastructure going down in the middle of the
| day in the busiest sales season going.
|
| [0] https://news.ycombinator.com/item?id=29475499
| jaywalk wrote:
| Cool. Now let's have a race to see who can triple their
| capacity the fastest. (Note: I don't use AWS, so I can actually
| do it)
| iso1210 wrote:
| Why would I want to triple my capacity?
|
| Most people don't need to scale to a billion users overnight.
| jaywalk wrote:
| Many B2B-type applications have a lot of usage during the
| workday and minimal usage outside of it. No reason to keep
| all that capacity running 24/7 when you only need most of
| it for ~8 hours per weekday. The cloud is perfect for that
| use case.
| dijit wrote:
| idk man, idle hardware doesn't use all that much power.
|
| https://www.thomas-krenn.com/en/wiki/Processor_P-
| states_and_...
|
| Which is an implementation of:
|
| https://web.eecs.umich.edu/~twenisch/papers/asplos09.pdf
| iso1210 wrote:
| Is it really? How much does that scaling actually cost?
|
| And what's a workday anyway, surely you operate globally?
| jaywalk wrote:
| Scaling itself costs nothing, but saves money because
| you're not paying for unused capacity.
|
| The main application I run operates in 7 countries
| globally, but the US is the only one that has enough
| usage to require additional capacity during the workday.
| So out of 720 hours in a 30 day month, cloud scaling
| allows me to pay for additional capacity for only the
| (roughly) 160 hours that it's actually needed. It's a
| _significant_ cost saver.
|
| And because the scaling is based on actual metrics, it
| won't scale up on a holiday when nobody is using the
| application. More cost savings.
| vp8989 wrote:
| You are (conveniently or not) incorrectly assuming that
| the unit price of provisioned vs on-demand capacity is
| the same. It's not.
| jaywalk wrote:
| Nice of you to assume that I don't understand the pricing
| of the services I use. I can assure you that I do, and I
| can also assure you that there is no such thing as
| provisioned vs on-demand pricing for Azure App Service
| until you get into the higher tiers. And even in those
| higher tiers, it's cheaper _for me_ to use on-demand
| capacity.
|
| Obviously what I'm saying will not apply to all use
| cases, but I'm only talking about mine.
| [deleted]
| joelhaasnoot wrote:
| Amazon.com is also throwing errors left and right
| endisneigh wrote:
| Azure, Google Cloud, AWS and others need to have a "Status
| alliance" where they determine the status of each of their
| services by a quorum using all cloud providers.
|
| Status pages are virtually useless these days
| LinuxBender wrote:
| Or just modify sites like DownDetector to show who is hosting
| each site. When{n} number of sites are down hosted on {x} one
| could draw a conclusion. It won't be as detailed as "xyz
| services failed" but rather the overall operational chain is
| broken. There could be a graph that shows _99% of sites hosted
| on amazon US East 1 down_ it would be hard to hide that. This
| could also paint a picture of what companies are not active-
| active-multi-cloud.
| avaika wrote:
| Cloud providers are not just about web hosting. There are
| dozens of tools hidden from end users. E.g. services used for
| background operations / maintenance (e.g. aws codecommit /
| codebuild / or even aws web console like today). This kind of
| outages won't bring down your web site, but still might break
| your normal workflow and even cost you some money.
| DarthNebo wrote:
| https://en.wikipedia.org/wiki/Mexican_standoff
| beamatronic wrote:
| It's not the prisoner's dilemma?
| smt88 wrote:
| They can do this without an alliance. They very intentionally
| choose not to do it.
|
| Every major company has moved away from having accurate status
| pages.
| arch-ninja wrote:
| Steam has a great status page, companies like that and
| cloudfare will eat Alphabet's lunch in the next 17-18 years.
| ExtraE wrote:
| That's a tight time frame a long way off. How'd you arrive
| at 17-18?
| moolcool wrote:
| Where do you anticipate Steam to compete with Alphabet?
| dentemple wrote:
| It's because none of these companies are held responsible for
| missing their actual SLAs, as opposed to their self-reported
| SLA compliance.
|
| So unless regulation gets implemented that says otherwise,
| there's zero incentive for any company to maintain an
| accurate status page.
| soheil wrote:
| How did you find a way to bring regulations into this?
| There are monitoring services you can pay for to keep an
| eye on your SLAs and your vendors'.
|
| If not happy with the results switch.
| ybloviator wrote:
| Technically, there are already regulations. SLA lies are
| fraud.
|
| But I'm leery of any business who's so dishonest they
| fear any outside oversight that brings repercussions for
| said dishonesty.
|
| "If not happy, switch" is silly - it's not the customer's
| problem. And if you're a large customer and have invested
| heavily in getting staff trained on AWS, you can't just
| move.
| soheil wrote:
| A) don't build a business that relies solely on existence
| of another B) switch to another vendor if not happy with
| current vendor
|
| Really not that complicated.
| iso1210 wrote:
| The only uptimes I'm concerned with are my own services
| that my own monitoring keeps on top of, this varies - if
| the monitoring page goes down for 10 seconds I'm not
| worried, if one leg of a smpte-2022-7 is down for a second
| that's fine, if it keeps going down for a second that's a
| concern, etc.
|
| If something I'm responsible goes down to the point that my
| stakeholders are complaining (which is something seriously
| wrong), they are not going to be happy with "oh the cloud
| was down, not my fault"
|
| If AWS is down or not it meaningless to me, if my service
| running on AWS is down or not is the key metric.
|
| If a service is down and I can't get into it, then chatter
| on things like outages mailing list, or HN, will let me
| know if it's yet another cloud failure, or if it's
| something that's affecting my machine only.
| cwkoss wrote:
| I wonder if there could be profitable play where an
| organization monitors SLA compliance, and then produces a
| batch of lawsuits or class action suit on behalf of all of
| its members when the SLA is violated.
| pm90 wrote:
| This is a neat idea. Install a simple agent in all
| customers' environments, select AWS dependencies, then
| monitor uptime over time. Aggregate across customers, and
| then go to AWS with this data.
| xtracto wrote:
| >It's because none of these companies are held responsible
| for missing their actual SLAs, as opposed to their self-
| reported SLA compliance.
|
| Right, there should be an "alliance" of customers from
| different large providers (something like a Union but
| instead of workers, it would be customers). They are the
| ones that should measure SLAs and hold the provider
| accountable.
| beamatronic wrote:
| In a broader world sense, we live in the post-truth era.
| andrew_ wrote:
| This is also effecting Elastic Beanstalk and Elastic Container
| Registry. We're getting 500s via API for both services.
| torbTurret wrote:
| Any leads on the cause? Some services on us-east-1 seem to be
| working just fine. Others not.
| sirfz wrote:
| AWS ec2 API requests are failing and ECR also seems to be down
| (error 500 on pull)
| mohanmcgeek wrote:
| Same happening right now with Goodreads
| zahil wrote:
| Is it just me or is it becoming more and more frequent, outages
| involving large websites?
| amir734jj wrote:
| My job (although 50% of time) at Azure is unit testing/monitoring
| services under different scenarios and flows to detect small
| failures that will be overlooked in public status page. Our tests
| run multiple times daily and we have people constantly monitoring
| logs. It concerns me when I see all AWS services are 100% green
| when I know there is an outage.
| jamesfmilne wrote:
| Oh, sweet summer child.
|
| The reason you care about your status page being 100% accurate
| is that your stock price is not directly linked to your status
| page.
| joshocar wrote:
| I don't know how accurate this information is, but I'm hearing
| that the monitor can't be updated because the service is in the
| region that is down.
| Rapzid wrote:
| Kinda hard to believe after they were blasted for that very
| situation during/after the S3 outage way back.
|
| If that's the case, it's 100% a feature. They want as little
| public proof of an outage after it's over and to put the
| burden on customers completely to prove they violated SLAs.
| [deleted]
| fnord77 wrote:
| can't wait for the postmortem on this
| [deleted]
| andreyazimov wrote:
| My payment processor Paddle also stopped working.
| skapadia wrote:
| I'm absolutely terrified by our reliance on cloud providers (and
| modern technology, networks, satellites, electric grids, etc. in
| general), but the cloud hasn't been around that long. It is
| advancing extremely fast, and every problem makes it more
| resilient.
| skapadia wrote:
| IMO, the cloud is not the overall problem. Our insatiable
| desire to want things right HERE, right NOW is the core issue.
| The cloud is just a solution to try to meet that demand.
| SkyPuncher wrote:
| Would you prefer to have more of something good that you can't
| occasionally have OR less of something good that you can always
| have?
|
| ----
|
| The answer likely depends on the specific thing, but I'd argue
| most people would take the better version of something at the
| risk of it not working 1 or 2 days per year.
| l72 wrote:
| Most of our services in us-east-1 are still responding although
| we cannot log into the console. However, it looks like dynamodb
| is timing out most requests for us.
| john37386 wrote:
| It seems a bit long to fix!
|
| They probably paint themselves in a corner just like facebook few
| weeks ago.
|
| This make me think;
|
| Could it be that one day the internet will have a total global
| outage and it will take few days to recover?
| ericskiff wrote:
| This actually happened back in 2012 or so. Major AWS outage
| that took down big services all over the place, took a few days
| for some services to come fully back online.
| https://aws.amazon.com/message/680342/
| devenvdev wrote:
| The only possible scenario I could come up with is someone
| crashing internet on purpose, e.g. some crazy uncatchable
| trojan that starts DDOSing everything. I doubt such scenario is
| feasible though...
| rossdavidh wrote:
| If we have a total global outage, Stack Overflow will be
| unavailable, and the internet will never be fixed. :) Mostly
| joking, I hope...
| jaywalk wrote:
| Some brave soul at Stack Overflow will have to physically go
| into the datacenter, roll up a cart with a keyboard, monitor
| and printer and start printing off a bunch of Networking
| answers.
| lesam wrote:
| The StackOverflow datacenter is famously tiny - like, 10
| Windows servers. So even if the rest of the internet goes
| down hopefully they stay up. They might have to rebuild the
| internet out from their server room though.
| iso1210 wrote:
| I'm not sure how you get a total global outage in a distributed
| system. Lets say a major transit provider (Century link for
| example) advertises traffic as a "go via me", but then drops
| the traffic, lets also assume it drops the costs of routes to
| pretty much zero. That would certainly have a major effect,
| until their customers/peers stop their peers.
|
| That might be tricky if they are remote and not on the same AS
| as their router access points and have no completely out of
| band access, but you're still talking hours at most.
| eoinboylan wrote:
| Can't SSM into ec2 in us-east-1
| johnsimer wrote:
| Everything seems to be functioning normally for me now
| ChrisArchitect wrote:
| We have always been at war with us-east-1.
| ComputerGuru wrote:
| The AWS status page no longer loads for me. /facepalm
| techthumb wrote:
| https://status.aws.amazon.com hasn't been updated to reflect
| outage yest
| cyanydeez wrote:
| Probably cached for good measure
| hvgk wrote:
| There's probably a lambda somewhere supposed to update it that
| is screaming into the darkness at the moment.
|
| According to an internal message I saw, their monitoring stuff
| is fucked too.
| cbtacy wrote:
| It's funny but when I saw "AWS Outage" breaking, my first thought
| was "I bet it's US-east-1 again."
|
| I know it's cheap but seriously... not worth it. Many of us have
| the scars to prove this.
| wrren wrote:
| Looks like their health check logic also sucks, just like mine.
| whoknowswhat11 wrote:
| Anyone understand why these services go down for so long?
|
| That's the part I find interesting.
| pixelmonkey wrote:
| Looks like Kinesis Firehose is either the root cause, or severely
| impacted:
|
| https://twitter.com/amontalenti/status/1468265799458639877
|
| Segment is publicly reporting issues delivering to Firehose, and
| one of my company's real-time monitors also triggered for Kinesis
| Firehose an hour ago.
|
| Update:
|
| By my sniff of it, some "core" APIs are down for S3 and EC2 (e.g.
| GET/PUT on S3 and node create/delete on EC2). Systems like
| Kinesis Firehose and DynamoDB rely on these APIs under the hood
| ("serverless" is just "a server in someone else's data center").
|
| Further update:
|
| There is a workaround available for the AWS Console login issue.
| You can use https://us-west-2.console.aws.amazon.com/ to get in
| -- it's just the landing page that is down (because the landing
| page is in the affected region).
| muttantt wrote:
| Running anything on us-east-1 is asking for trouble...
| [deleted]
| anovikov wrote:
| Haha my developer called me in panic telling that he crashed
| Amazon - was doing some load tests with Lambda
| imstil3earning wrote:
| thats cute xD
| Justsignedup wrote:
| thank you for that big hearty laugh! :)
| xtracto wrote:
| How can you own a developer? is it expensive to buy one?
| DataGata wrote:
| Don't get nitty about saying "my X". People say "my plumber"
| or "my hairstylist" or whatever all the time.
| rossdavidh wrote:
| If he actually knows how to crash Amazon, you have a new
| business opportunity, albeit not a very nice one...
| politelemon wrote:
| It'd be hilarious if you kept that impression going for the
| duration of the outage.
| tgtweak wrote:
| Postmortem: unbounded auto-scaling of lambda combined with
| oversight on internal rate limits caused unforseen internal
| ddos.
| kuya11 wrote:
| The blatant status page lies are getting absolutely ridiculous.
| How many hours does a service need to be totally down until it
| gets properly labelled as a "disruption"?
| bearjaws wrote:
| Yeah, we are seeing SQS, API Gateway (both socket and non
| websocket) and S3 all completely unavailable. Status page shows
| nothing despite having gotten several updates.
| nemothekid wrote:
| Some sage advice I learned a while ago: "Avoid us-east-1 as much
| as possible".
| dr-detroit wrote:
| If you need to be up all the time don't you use more than 1
| region or do you need the ultra low ping for running 9-11
| operator phone systems that calculate real time firing
| solutions to snipe ICBMs out of low orbit?
| soco wrote:
| But if you use CloudFront, there you go.
| [deleted]
| alex_young wrote:
| EDIT: As pointed out below, I missed that this was for the Amazon
| Connect service, and not an update for all of US-EAST-1.
| Preserved for consistency, but obviously just a comprehension
| issue on my side.
|
| At least the updates are amusing:
|
| "9:18 AM PST We can confirm degraded Contact handling by agents
| in the US-EAST-1 Region. Agents may experience issues logging in
| or being connected with end-customers."
|
| WTF is "contact handling", an "agent" or an "end-customer"?
|
| How about something like "We are confirming that some users are
| not able to connect to AWS services in us-east-1. We're looking
| into it."
| dastbe wrote:
| that's an update for amazon connect, which is a customer
| support related service.
| detaro wrote:
| Amazon Connect is a call-center product, so that report makes
| sense.
| adwww wrote:
| Left a big terraform plan running while I put the kids to bed,
| checked back now and Amazon is on fire.... was it me?!
| mohanmcgeek wrote:
| I don't think it's a console outage. Goodreads has been down for
| a while
| jrs235 wrote:
| Search on Amazon.com seems to be broken too. This doesn't appear
| to just be affecting their AWS revenue.
| authed wrote:
| I've had issues login-in at amazon.com too... and IMDB is also
| down
| bamboozled wrote:
| I don't think AWS knows what's going on judging by their updates,
| yes DynamoDB might be having issues, but so is IAM it seems,
| we're getting issues terminating resources for example.
| jgworks wrote:
| Someone posted this on our company slack:
| https://stop.lying.cloud/
| mbordenet wrote:
| I suspect the ex-Amazonian PragmaticPulp cites was let go from
| Amazon for a reason. The COE process works, provided the culture
| is healthy and genuinely interested in fixing systemic problems.
| Engineers who seek to deflect blame are toxic and unhelpful.
| Don't hire them!
| woshea901 wrote:
| N. Virginia consistently has more problems than other zones. Is
| it possible this zone is also hosting government computers/could
| it be a more frequent target for this reason?
| longhairedhippy wrote:
| The real reason is us-east-1 was the first and by far the
| biggest region, the same reason that new services always launch
| there but other regions are are not necessarily required (some
| services have to launch in every region).
|
| The us-east-1 region is consistently pushing the limits of
| scale for the AWS services, thus is has way more problems than
| other regions.
| exabrial wrote:
| Just a reminder that Colocation is always an option :)
| lgylym wrote:
| So reinvent is over. Time to deploy.
| filip5114 wrote:
| Can confrim us-east-1
| romanhotsiy wrote:
| Can confirm lambda is down in us-east-1. Other services seems
| to work for us.
| imnoscar wrote:
| STS or console login not working either.
| all_usernames wrote:
| 25 Regions, 85 Availability Zones in this global cloud service
| and I can't login because of a failure in a single region (their
| oldest).
|
| Can't login to AWS console at signin.aws.amazon.com:
| Unable to execute HTTP request: sts.us-east-1.amazonaws.com.
| Please try again.
| ipmb wrote:
| Looks like they've acknowledged it on the status page now.
| https://status.aws.amazon.com/
|
| > 8:22 AM PST We are investigating increased error rates for the
| AWS Management Console.
|
| > 8:26 AM PST We are experiencing API and console issues in the
| US-EAST-1 Region. We have identified root cause and we are
| actively working towards recovery. This issue is affecting the
| global console landing page, which is also hosted in US-EAST-1.
| Customers may be able to access region-specific consoles going to
| https://console.aws.amazon.com/. So, to access the US-WEST-2
| console, try https://us-west-2.console.aws.amazon.com/
| jabiko wrote:
| Yeah, but I still have a different understanding what
| "Increased Error Rates" means.
|
| IMHO it should mean that the rate of errors is increased but
| the service is still able to serve a substantial amount of
| traffic. If the rate of errors is bigger than, let's say, 90%
| that's not an increased error rate, that's an outage.
| thallium205 wrote:
| They say that to try and avoid SLA commitments.
| jiggawatts wrote:
| Some big customers should get together and make an
| independent org to monitor cloud providers and force them
| to meet their SLA guarantees without being able to weasel
| out of the terms like this...
| guenthert wrote:
| Uh, four minutes to identify the root cause? Damn, those guys
| are on fire.
| czbond wrote:
| :) I imagine it went like this theoretical Slack
| conversation:
|
| > Dev1: Pushing code for branch "master" to "AWS API". >
| <slackbot> Your deploy finished in 4 minutes > Dev2: I can't
| react the API in east-1 > Dev1: Works from my computer
| tonyhb wrote:
| It was down as of 7:45am (we posted in our engineering
| channel), so that's a good 40 minutes of public errors before
| the root cause was figured out.
| Frost1x wrote:
| Identify or to publicly acknowledge? Chances are technical
| teams knew about this and noticed it fairly quickly, they've
| been working on the issue for some time. It probably wasn't
| until they identified the root cause and had a handful of
| strategies to mitigate with confidence that they chose to
| publicly acknowledge the issue to save face.
|
| I've broken things before and been aware of it, but didn't
| acknowledge them until I was confident I could fix them. It
| allows you to maintain an image of expertise to those outside
| who care about the broken things but aren't savvy to what or
| why it's broken. Meanwhile you spent hours, days, weeks
| addressing the issue and suddenly pull a magic solution out
| of your hat to look like someone impossible to replace.
| Sometimes you can break and fix things without anyone even
| knowing which is very valuable if breaking something had some
| real risk to you.
| sirmarksalot wrote:
| This sounds very self-blaming. Are you sure that's what's
| really going through your head? Personally, when I get
| avoidant like that, it's because of anticipation of the
| amount of process-related pain I'm going to have to endure
| as a result, and it's much easier to focus on a fix when
| I'm not also trying to coordinate escalation policies that
| I'm not familiar with.
| flerchin wrote:
| Outage started at 731 PST from our monitoring. They are on
| fire, but not in a good way.
| giorgioz wrote:
| I'm trying to login in the AWS Console from other regions but
| I'm getting HTTP 500. Anyone managed to login in other regions?
| Which ones?
|
| Our backend is failing, it's on us-east-1 using AWS Lambda, Api
| Gateway, S3
| bobviolier wrote:
| https://status.aws.amazon.com/ still shows all green for me
| banana_giraffe wrote:
| It's acting odd for me. Shows all green in Firefox, but shows
| the error in Chrome even after some refreshes. Not sure
| what's caching where to cause that.
| dang wrote:
| Ok, we've changed the URL to that from https://us-
| east-1.console.aws.amazon.com/console/home since the latter is
| still not responding.
|
| There are also various media articles but I can't tell which
| ones have significant new information beyond "outage".
| jesboat wrote:
| > This issue is affecting the global console landing page,
| which is also hosted in US-EAST-1
|
| Even this little tidbit is a bit of a wtf for me. Why do they
| consider it ok to have _anything_ hosted in a single region?
|
| At a different (unnamed) FAANG, we considered it unacceptable
| to have anything depend on a single region. Even the dinky
| little volunteer-run thing which ran
| https://internal.site.example/~someEngineer was expected to be
| multi-region, and was, because there was enough infrastructure
| for making things multi-region that it was usually pretty easy.
| alfiedotwtf wrote:
| Maybe has something to do with CloudFront mandating certs to
| be in us-east-1?
| tekromancr wrote:
| YES! Why do they do that? It's so weird. I will deploy a
| whole config into us-west-1 or something; but then I need
| to create a new cert in us-east-1 JUST to let cloudfront
| answer an HTTPS call. So frustrating.
| jamesfinlayson wrote:
| Agreed - in my line of work regulators want everything in
| the country we operate from but of course CloudFront has
| to be different.
| sheenobu wrote:
| I think I know specifically what you are talking about. The
| actual files an engineer could upload to populate their
| folder was not multi-region for a long time. The servers
| were, because they were stateless and that was easy to multi-
| region, but the actual data wasn't until we replaced the
| storage service.
| ehsankia wrote:
| Forget the number of regions. Monitoring for X shouldn't even
| be hosted on X at all...
| stevehawk wrote:
| I don't know if that should surprise us. AWS hosted their
| status page in S3 so it couldn't even reflect its own outage
| properly ~5 years ago.
| https://www.theregister.com/2017/03/01/aws_s3_outage/
| tekromancr wrote:
| I just want to serve 5 terabytes of data
| mrep wrote:
| Reference for those out of the loop:
| https://news.ycombinator.com/item?id=29082014
| [deleted]
| all_usernames wrote:
| Every damn Well-Architected Framework includes multi-AZ if
| not multi-region redundancy, and yet the single access point
| for their millions of customers is single-region. Facepalm in
| the form of $100Ms in service credits.
| cronix wrote:
| > Facepalm in the form of $100Ms in service credits.
|
| It was also greatly affecting Amazon.com itself. I kept
| getting sporadic 404 pages and one was during a purchase.
| Purchase history wasn't showing the product as purchased
| and I didn't receive an email, so I repurchased. Still no
| email, but the purchase didn't end in a 404, but the
| product still didn't show up in my purchase history. I have
| no idea if I purchased anything, or not. I have never had
| an issue purchasing. Normally get a confirmation email
| within 2 or so minutes and the sale is immediately
| reflected in purchase history. I was unaware of the greater
| problem at that moment or I would have steered clear at the
| first 404.
| jjoonathan wrote:
| Oh no... I think you may be in for a rough time, because
| I purchased something this morning and it only popped up
| in my orders list a few minutes ago.
| vkgfx wrote:
| >Facepalm in the form of $100Ms in service credits.
|
| Part of me wonders how much they're actually going to pay
| out, given that their own status page has only indicated
| _five_ services with moderate ( "Increased API Error
| Rates") disruptions in service.
| ithkuil wrote:
| One region? I forgot how to count that low
| stephenr wrote:
| When I brought up the status page (because we're seeing
| failures trying to use Amazon Pay) it had EC2 and Mgmt Console
| with issues.
|
| I opened it again just now (maybe 10 minutes later) and it now
| shows DynamoDB has issues.
|
| If past incidents are anything to go by, it's going to get
| worse before it gets better. Rube Goldberg machines aren't
| known for their resilience to internal faults.
| jeremyjh wrote:
| They are still lying about it, the issues are not only
| affecting the console but also AWS operations such as S3 puts.
| S3 still shows green.
| packetslave wrote:
| IAM is a "global" service for AWS, where "global" means "it
| lives in us-east-1".
|
| STS at least has recently started supporting regional
| endpoints, but most things involving users, groups, roles,
| and authentication are completely dependent on us-east-1.
| lsaferite wrote:
| It's certainly affecting a wider range of stuff from what
| I've seen. I'm personally having issues with API Gateway,
| CloudFormation, S3, and SQS
| pbalau wrote:
| > We are experiencing _API_ and console issues in the US-
| EAST-1 Region
| jeremyjh wrote:
| I read it as console APIs. Each service API has its own
| indicator, and they are all green.
| midasuni wrote:
| Our corporate ForgeRock 2FA service is apparently broken.
| My services are behind distributed x509 certs so no
| problems there.
| Rantenki wrote:
| Yep, I am seeing failures on IAM as well:
| aws iam list-policies An error occurred (503)
| when calling the ListPolicies operation (reached max retries:
| 2): Service Unavailable
| silverlyra wrote:
| Same here. Kubernetes pods running in EKS are
| (intermittently) failing to get IAM credentials via the
| ServiceAccount integration.
| alde wrote:
| S3 bucket creation is failing across all regions for us, so this
| isn't an us-east-1 only issue.
| kp195_ wrote:
| The AWS console seems kind of broken for us in us-west-1
| (Northern California), but it seems like the actual services are
| working
| jacobkg wrote:
| AWS Connect is down, so our customer support phone system is down
| with it
| nostrebored wrote:
| Highly recommend talking to your account team to recommend
| regional failovers and DR for Amazon Connect! With enough
| feedback from customers, stuff like this can get prioritized.
| jacobkg wrote:
| Thanks, will definitely do that!
| misthop wrote:
| Getting 500 errors across the board on systems trying to hit s3
| abaldwin99 wrote:
| Ditto
| booleanbetrayal wrote:
| We're seeing load balancer failure in us-east-1 AZ's, so we are
| not exactly sure why this is being characterized as Console
| outage ...
| [deleted]
| whalesalad wrote:
| I can still hit EC2 boxes and networking is okay. DynamoDB is
| 100% down for the count, every request is an Internal Server
| Error.
| l72 wrote:
| We are also seeing lots of failures with DynamoDB across all
| our services in us-east-1.
| snewman wrote:
| DynamoDB is fine for us. Not contradicting your experience,
| just adding another data point. There is definitely something
| hit-or-miss about this incident.
| jonnycomputer wrote:
| Ah, this might explain why my AWS requests were so slow, or
| timing out, this afternoon.
| sayed2020 wrote:
| Need google voice unlimited,
| artembugara wrote:
| I think there should be some third-party status checker alliance.
|
| It's a joke. Each time AWS/Azure/GCP is down their status page
| says all is fine.
| cphoover wrote:
| Want to build a startup?
| artembugara wrote:
| already running one.
| _of wrote:
| ...and imdb
| soco wrote:
| And Netflix.
| MR4D wrote:
| And Venmo, and McDonald's, and....
|
| This one is pretty epic (pun intended). Bad enough that Down
| Detector [0] shows " _Reports indicate there may be a
| widespread outage at Amazon Web Services, which may be
| impacting your service._ " in a red alert bar at the top.
|
| [0] - https://downdetector.com/
| mrguyorama wrote:
| It occurs to me that it's very nice that Netflix, Youtube,
| and other streaming services tend to be on separate
| infrastructure so they don't all go down at once
| ec109685 wrote:
| I'm surprised Netflix is down. They are multi-region:
| https://netflixtechblog.com/active-active-for-multi-
| regional...
| synergy20 wrote:
| No wonder I could not read books from amazon all of a sudden,
| what about their cloud-based redundancy design?
| doesnotexist wrote:
| Can also confirm that the kindle app is failing for me and has
| been for the past few hours.
| tgtweak wrote:
| The book preview webservice or actual ebooks (kindle, etc)?
| doesnotexist wrote:
| For me, it's been that I am unable to download books to the
| kindle app on my computer
| hvgk wrote:
| Well I got to bugger off home early so good job Amazon.
|
| Edit: to be clear this is because I'm utterly helplessly unable
| to do anything at the moment.
| AnIdiotOnTheNet wrote:
| Yep, that's a consideration of going with cloud tech: if
| something goes wrong you're often powerless. At least with on-
| prem you know who to wake up in the middle of the night and
| you'll get straight-forward answers about what's going on.
| hvgk wrote:
| Depends which provider you host your crap with. I've had real
| trouble trying to get a top tier incident even acknowledged
| by one of the pre cloud providers.
|
| To be fair when it's AWS when something goes snap it's not my
| problem which I'm happy about (until some wise ass at AWS
| hires me) :)
| AnIdiotOnTheNet wrote:
| > Depends which provider you host your crap with.
|
| That's what I'm saying: you host it yourself in facilities
| owned by your company if you're not willing to have
| everyone twiddle their thumbs during this sort of event.
| Your DR environment can be co-located or hosted elsewhere.
| temuze wrote:
| Friends tell friends to pick us-east-2.
|
| Virginia is for lovers, Ohio is for availability.
| blahyawnblah wrote:
| Lots of services are only in us-east-1. The sso system isn't
| working 100% right now so that's where I assume it's hosted.
| skwirl wrote:
| Yeah, there are "global" services which are actually secretly
| us-east-1 services as that is the region they use for
| internal data storage and orchestration. I can't launch
| instances with OpsWorks (not a very widely used service, I'd
| imagine) even if those instances are in stacks outside of us-
| east-1. I suspect Route53 and CloudFront will also have
| issues.
| johnsimer wrote:
| Yeah I can't log in with our external SAML SSO to our AWS
| dashboard to manage our us-east-2 resources. . . . Because
| our auth is apparently routed thru us-east-1 STS
| jhugo wrote:
| You can pick the region for SSO -- or even use multiple. Ours
| is in ap-southeast-1 and working fine -- but then the console
| that it signs us into is only partially working presumably
| due to dependencies on us-east-1.
| bithavoc wrote:
| Sometimes you can't avoid us-east-1; an example is AWS ECR
| Public. It's a shame. Meanwhile, DockerHub is up and running
| even when it's in EC2 itself.
| vrocmod wrote:
| No one was in the room where it happened
| mountainofdeath wrote:
| us-east-1 is a cursed region. It's old, full of one-off patches
| to work and tends to be the first big region released to.
| more_corn wrote:
| This is funny, but true. I've been avoiding us-east-1 simply
| because thats where everyone else is. Spot instances are also
| less likely to be expensive in less utilized regions.
| politician wrote:
| Can I get that on a license plate?
| kavok wrote:
| Didn't us-east-2 have an issue last week?
| stephenr wrote:
| Friends tell Friends not to use Rube Goldberg machines as their
| infrastructure layer.
| johnl1479 wrote:
| This is also a clever play on the Hawthorne Heights song.
| PopeUrbanX wrote:
| I wonder why AWS has Ohio and Virginia but no region in the
| northeast where a significant plurality of customers using east
| regions probably live.
| api wrote:
| I live in Ohio and can confirm. If the Earth were destroyed by
| an asteroid Ohio would be left floating out there somehow
| holding onto an atmosphere for about ten years.
| tgtweak wrote:
| If you're not multi-cloud in 2021 and are expecting 5-9's, I
| feel bad for you.
| post-it wrote:
| I imagine there are very few businesses where the extra cost
| of going multi-cloud is smaller than the cost of being down
| during AWS outages.
| gtirloni wrote:
| Also, going multi-cloud will introduce more complexity
| which leads to more errors and more downtime. I'd rather
| sit this outage out than deal with daily risk of downtime
| because I'm infrastructure is too smart for its own good.
| shampster wrote:
| Depends on the criticality of the service. I mean you're
| right about adding complexity. But sometimes you can just
| take your really critical services and make sure it can
| completely withstand any one cloud provider outage.
| unethical_ban wrote:
| If you're not multi-region, I feel bad for you.
|
| If your company is shoehorning you into using multiple clouds
| and learning a dozen products, IAM and CICD dialects
| simultaneously because "being cloud dependent is bad", I feel
| bad for you.
|
| Doing _one_ cloud correctly from a current DevSecOps
| perspective is a multi-year ask. I estimate it takes about 25
| people working full time on managing and securing
| infrastructure per cloud, minimum. This does not include
| certain matrixed people from legacy network /IAM teams. If
| you have the people, go for it.
| tgtweak wrote:
| There are so many things that can go wrong with a single
| provider, regardless of how many availability zones you are
| leveraging, that you cannot depend on 1 cloud provider for
| your uptime if you require that level of up.
|
| Example: Payment/Administrative issues, rogue employee with
| access, deprecated service, inter-region routing issues,
| root certificate compromises... the list goes on and it is
| certainly not limited to single AZ.
|
| A very good example, is that regardless of which of the 85
| AZs you are in at aws, you are affected by this issue right
| now.
|
| Multi-cloud with the right tooling is trivial. Investing in
| learning cloud-proprietary stacks is a waste of your
| investment. You're a clown if you think you need 25 people
| internally per cloud is required to "do it right".
| unethical_ban wrote:
| All cloud tech is proprietary.
|
| There is no such thing as trivially setting up a secure,
| fully automated cloud stack, much less anything like a
| streamlined cloud agnostic toolset.
|
| Deprecated services are not the discussion here. We're
| talking tactical availability, not strategic tools etc.
|
| Rogue employees with access? You mean at the cloud
| provider or at your company? Still doesn't make sense.
| Cloud IAM is very difficult in large organizations, and
| each cloud does things differently.
|
| I worked at fortune 100 finance on cloud security. Some
| things were quite dysfunctional, but the struggles and
| technical challenges are real and complex at a large
| organization. Perhaps you're working on a 50 employee
| greenfield startup. I'll hesitate to call you a clown as
| you did me, because that would be rude and dismissive of
| your experience (if any) in the field.
| throwmefar32 wrote:
| This.
| ricardobayes wrote:
| Someone start devops as a service please
| ryuta wrote:
| How do you become multi-cloud if your root domain is in
| Route53? Have Backup domains on the client side?
| tgtweak wrote:
| dns records should be synced to secondary provider, and
| that provider added to your secondary/tertiary domain dns.
|
| Multi-provider dns is a solved problem.
| temuze wrote:
| If you're having SLA problems I feel bad for you son
|
| I got two 9 problems cuz of us-east-1
| ShroudedNight wrote:
| > ~~I got two 9 problems cuz of us-east-1~~
|
| I left my two nines problems in us-east-1
| sbr464 wrote:
| Ohio's actual motto funnily kind of fits here:
| With God, all things are possible
| shepherdjerred wrote:
| Does this imply Virginia is Godless?
| sbr464 wrote:
| Maybe virginia is for lovers, ohio is for God/gods?
| tssva wrote:
| Virginia's actual motto is "Sic semper tyrannis". What's
| more tyrannical than an omnipotent being that will condemn
| you to eternal torment if you don't worship them and follow
| their laws.
| sneak wrote:
| I mean, most people are okay with dogs and Seattle.
| [deleted]
| mey wrote:
| I think I should add state motto to my data center
| consideration matrix.
| bee_rider wrote:
| Virginia and Massachusetts have surprisingly aggressive
| mottoes (MA is: "By the sword we seek peace, but peace
| only under liberty", which is really just a fancy way of
| saying "don't make me stab a tyrant," if you think about
| it). It probably makes sense, though, given that they
| came up with them during the revolutionary war.
| anonu wrote:
| I think its just the console - as my EC2 in us-east-1 are still
| reachable.
| jrs235 wrote:
| I think it's affecting more than just AWS. Try searching
| amazon.com. That's broken for me.
| dylan604 wrote:
| The fun thing about these types of outages are seeing all of the
| people that depend upon these services with no graceful fallback.
| My roomba app will not even launch because of the AWS outage. I
| understand that the app gets "updates" from the cloud. In this
| case "updates" is usually promotional crap, but whatevs. However,
| for this to prevent the app launching in a manner that I can
| control my local device is total BS. If you can't connect to the
| cloud, fail, move on and load the app so that local things are
| allowed to work.
|
| I'm guessing other IoT things suffer from this same short
| sitedness as well.
| codegeek wrote:
| "If you can't connect to the cloud, fail, move on and load the
| app so that local things are allowed to work."
|
| Building fallbacks require work. How much extra effort and
| overhead is needed to build something like this ? Sometimes the
| cost vs benefits says that it is ok not to do it. If AWS has an
| outage like this once a year, maybe we can deal with it (unless
| you are working with mission critical apps).
| dylan604 wrote:
| Yes, it is a lot of work to test if response code is OK or
| not, or if a timeout limit has been reached. So much so, I
| pretty much wrote the test in the first sentence. Phew. 10x
| coder right here!
| lordnacho wrote:
| If you did that some clever person would set up their PiHole so
| that their device just always worked, and then you couldn't
| send them ads and surveil them. They'd tell their friends and
| then everyone would just use their local devices locally.
| Totally irresponsible what you're suggesting.
| apexalpha wrote:
| A little off-topic, but there are people working on it:
| https://valetudo.cloud/
|
| It's a little harder than blocking the DNS unfortunately. But
| nonetheless it always brings a smile to my face to see that
| there's a FOSS frontier for everything.
| beamatronic wrote:
| An even more clever person would package up this box, and
| sell it, along with a companion subscription service, to help
| busy folks like myself.
| dylan604 wrote:
| But this new little box would then be required to connect
| to the home server to receive updates. Guess what? No
| updates, no worky!! It's a vicious circle!!! Outages all
| the way down
| arksingrad wrote:
| this is why everyone runs piholes and no one sees ads on the
| internet anymore, which killed the internet ad industry
| dylan604 wrote:
| Dear person from the future, can you give me a hint on who
| wins the upcoming sporting events? I'm asking for a friend
| of course
| 0des wrote:
| Also, what's the verdict on John Titor?
| sneak wrote:
| Now think of how many assets of various governments' militaries
| are discreetly employed as normal operational staff by FAAMG in
| the USA and have access to _cause_ such events from scratch. I
| would imagine that the US IC (CIA /NSA) already does some free
| consulting for these giant companies to this end, because they
| are invested in that Not Being Possible (indeed, it's their
| job).
|
| There is a societal resilience benefit to not having
| unnecessary cloud dependencies beyond the privacy stuff. It
| makes your society and economy more robust if you can continue
| in the face of remote failures/errors.
|
| It is December 7th, after all.
| dylan604 wrote:
| > I would imagine that the US IC (CIA/NSA) already does some
| free consulting for these giant companies to this end,
|
| Haha, it would be funny if the IC reaches out to BigTech when
| failures occur to let them know they need not be worried
| about data loses. They can just borrow a copy of the data IC
| is siphoning off them. /s?
| taf2 wrote:
| I wouldn't jump to say it's short sitedness (it is shitty) but
| it could be a matter of being pragmatic... It's easier to
| maintain the code if it is loaded at run time (think thin
| client browser style). This way your iot device can load the
| lastest code and even settings from the cloud... (advantage
| when the cloud is available)... I think of this less of short
| sitedness and more a reasonable trade off (with shitty side
| effects)
| epistasis wrote:
| I don't think that's ever a reasonable tradeoff! Network
| access goes down all the time, and should be a fundamental
| assumption of any software.
|
| Maybe I'm too old, but I can't imagine a seasoned dev, much
| less a tech lead, omitting planning for that failure mode
| outime wrote:
| Then you could just keep a local copy available as a fallback
| in case the latest code cannot be fetched. Not doing the bare
| minimum and screwing the end user isn't acceptable IMHO. But
| I also understand that'd take some engineer hours and report
| virtually no benefits as these outages are rare (not sure how
| Roomba's reliability is in general on the other hand) so here
| we are.
| s_dev wrote:
| >The fun thing about these types of outages are seeing all of
| the people that depend upon these services with no graceful
| fallback.
|
| Whats a graceful fallback? Switching to another hosting service
| when AWS goes down? Wouldn't that present another set of
| complications for a very small edge case at huge cost?
| rehevkor5 wrote:
| Usually this refers to falling back to a different region in
| AWS. It's typical for systems to be deployed in multiple
| regions due to latency concerns, but it's also important for
| resiliency. What you call "a very small edge case" is
| occurring as we speak, and if you're vulnerable to it you
| could be losing millions of dollars.
| stevehawk wrote:
| probably not possible for a lot mroe services than you'd
| think because AWS Cognito has no decent failover method
| simmanian wrote:
| AWS itself has a huge single point of failure on us-east-1
| region. Usually, if us-east-1 goes down, others soon
| follow. At that point, it doesn't matter how many regions
| you're deploying to.
| scoopertrooper wrote:
| My workloads on Sydney and London are unaffected. I can't
| speak for anywhere else.
| lowbloodsugar wrote:
| "Usually"? When has that ever happened?
| dylan604 wrote:
| https://awsmaniac.com/aws-outages/
| winrid wrote:
| In this case, just connect over LAN.
| politician wrote:
| Or BlueTooth.
| s_dev wrote:
| Right -- I think I've misread OP as graceful fallback e.g.
| working offline.
|
| Rather than implement a dynamically switching backup in the
| event of AWS going down which is not trivial.
| [deleted]
| itisit wrote:
| > Wouldn't that present another set of complications for a
| very small edge case at huge cost?
|
| One has to crunch the numbers. What does a service outage
| cost your business every minute/hour/day/etc in terms of lost
| revenue, reputational damage, violated SLAs, and other
| factors? For some enterprises, it's well worth the added
| expense and trouble of having multi-site active-active setups
| that span clouds and on-prem.
| thayne wrote:
| there's a reason it is called the _Internet_ of things, and not
| the "local network of things". Even if the latter is probably
| what most customers would prefer.
| 9wzYQbTYsAIc wrote:
| There's also no reason for an internet connected app to crash
| on load when there is no access to the internet services.
| mro_name wrote:
| indeed.
|
| A constitutional property of a network is it's volatility.
| Nodes may fail. Edges may. You may not. Or you may. But
| then you're delivering no reliabilty but crap. Nice
| sunshine crap, maybe.
| birdyrooster wrote:
| Gives me a flashback to the December 24th, 2012 outage. I guess
| not much changes in 9 years time.
| stephenr wrote:
| So, we're getting failures (for customers) trying to use amazon
| pay from our site. AFAIK there is no "status page" for Amazon
| Pay, but the _rest_ of Amazon 's services seem to be a giant Rube
| Goldberg machine so it's hard to imagine this isn't too.
| itisit wrote:
| http://status.mws.amazon.com/
| stephenr wrote:
| Thanks.. seems to be about as accurate as the regular AWS
| status board is..
| itisit wrote:
| They _just_ added a banner. My guess is they don 't know
| enough yet to update the respective service statuses.
| stephenr wrote:
| I have basically zero faith in Amazon at this point.
|
| We first noticed failures because a tester happened to be
| testing in an env that uses the Amazon Pay sandbox.
|
| I checked the prod site, and it wouldn't even ask me to
| login.
|
| When I tried to login to SellerCentral to file a ticket -
| it told me my password (from a pw manager) was wrong.
| When I tried to reset, the OTP was ridiculously slow.
| Clicking "resend OTP" gives a "the OTP is incorrect"
| error message. When I finally got an OTP and put it in,
| the resulting page was a generic Amazon "404 page not
| found".
|
| A while later, my original SellerCentral password, still
| un-changed because I never got another OTP to reset it,
| worked.
|
| What the fuck kind of failure mode is that "services are
| down, so password must be wrong".
| itisit wrote:
| Sorry to hear. If multi-cloud is the answer, I wouldn't
| be surprised to see folks go back to owning and operating
| their own gear.
| Tea418 wrote:
| If you have trouble logging in to AWS Console, you can use a
| regional console endpoint such as https://eu-
| central-1.console.aws.amazon.com/
| jedilance wrote:
| I also see same error here: Internal Error, Please try again
| later.
| throwanem wrote:
| Didn't work for me just now. Same error as the regular
| endpoint.
| judge2020 wrote:
| Got alerted to 503 errors for SES, so it's not just the
| management console.
| lowwave wrote:
| this is exact kinda over centralisation issues I was talking
| about. I'm one of first developers using AWS EC2, sure back when
| scaling is hard for small dev shops. In now day and age any one
| who is technically inclined, can figure out using the new
| technologies. Why even use AWS. Get something like Hetzner,
| Linodes please!
| binaryblitz wrote:
| How are those better than using AWS/Azure/GCP/etc? I'd say the
| correct way to handle situations is to have things in multiple
| regions, and potentially multiple clouds if possible.
| Obviously, things like databases would be harder to keep in
| sync on multi cloud, but not impossible.
| pojzon wrote:
| Some manager: But it does not web scale!
| dr-detroit wrote:
| the average HN poster: I run things on a box under my desk
| and you cannot teach me why thats bad!!!
| saisundar wrote:
| Yikes, ring, the security system is also down.
|
| Wonder if crime rates might eventually spike up if aws goes down,
| in an utopian world where Amazon gets everyone to use ring.
| kingcharles wrote:
| I'm now imagining a team of criminals sitting around in face
| masks and hoodies refreshing AWS status page all day...
|
| "AWS is down! Christmas came early boys! Roll out..."
| john37386 wrote:
| imdb seems down too and returning 503. Is it related? Here is the
| output. Kind of funny.
|
| D'oh!
|
| Error 503
|
| We're sorry, something went wrong.
|
| Please try again...wait...wait...yep, try reload/refresh now.
|
| But if you are seeing this again, please report it here.
|
| Please explain which page you were at and where on it that you
| clicked
|
| Thank you!
| takeda wrote:
| IMDB belongs to Amazon, so likely on AWS too.
|
| This also confirms it: https://downdetector.com/status/imdb/
| binaryblitz wrote:
| TIL Amazon owns IMDB
| takeda wrote:
| Yeah, I was also surprised when I learned this. Another
| surprising thing is that they own it since 1998.
| whoisjuan wrote:
| us-east-1 is so unreliable that it probably should be nuked. It's
| the region with the worst track record. I guess it doesn't help
| that is one of the oldest.
| zrail wrote:
| US-EAST-1 is the oldest, largest, and most heterogeneous
| region. There are many data centers and availability zones
| within the region. I believe it's where AWS rolls out changes
| first, but I'm not confident on that.
| duckworth wrote:
| After over 45 minutes https://status.aws.amazon.com/ now shows
| "AWS Management Console - Increased Error Rates"
|
| I guess 100% is technically an increase.
| Slartie wrote:
| "Fixed a bug that could cause [adverse behavior affecting 100%
| of the user base] for some users"
| sophacles wrote:
| "some" as in "not all". I'm sure there are some tiny tiny
| sites that were unaffected because no one went to them during
| the outage.
| Slartie wrote:
| "some" technically includes "all", doesn't it? It excludes
| "none", I suppose, but why should it exclude "all" (except
| if "all" equals "none")?
| brasetvik wrote:
| I can't remember seeing problems be more strongly worded than
| "Increased Error Rates" or "high error rates with S3 in us-
| east-1" during the infamous S3 outage of 2017 - and that was
| after they struggled to even update their own status page
| because of S3 being down. :)
| schleck8 wrote:
| During the Facebook outage FB wrote something along the lines
| of "We noticed that some users are experiencing issues with
| our apps" eventhough nothing worked anymore
| boldman wrote:
| I think now is a good time to reiterate the danger of companies
| just throwing all of their operational resilience and
| sustainability over the wall and trusting someone else with their
| entire existence. It's wild to me that so many high performing
| businesses simply don't have a plan for when the cloud goes down.
| Some of my contacts are telling me that these outages have teams
| of thousands of people completely prevented from working and tens
| of million dollars of profit are simply vanishing since the start
| of the outage this morning. And now institutions like government
| and banks are throwing their entire capability into the cloud
| with no recourse or recovery plan. It seems bad now but I wonder
| how much worse it might be when no one actually has access to
| money because all financial traffic is going through AWS and it
| goes down.
|
| We are incredibly blind to just trust just 3 cloud providers with
| the operational success of basically everything we do.
|
| Why hasn't the industry come up with an alternative?
| xwdv wrote:
| There is an alternative: A _true_ network cloud. This is what
| Cloudflare will eventually become.
| jacobsenscott wrote:
| We have or had alternatives - rackspace, linode, digital ocean,
| in the past there were many others, self hosting is still an
| option. But the big three just do it better. The alternatives
| are doomed to fail. If you use anything other than the big
| three you risk not just more outages, but your whole provider
| going out of business overnight.
|
| If the companies at the scale you are talking about do not have
| multi-region and multi service (aws to azure for example)
| failover that's their fault, and nobody else's.
| [deleted]
| p1necone wrote:
| Do you think they'd manage their own infra better? Are you
| suggesting they pay for a fully redundant second implementation
| on another provider? How much extra cost would that be vs
| eating an outage very infrequently?
| jessebarton wrote:
| In my opinion there is a lack of talent in these industries for
| building out there own resilient systems. IT people and
| engineers get lazy.
| grumple wrote:
| We're too busy in endless sprints to focus on things outside
| of our core business that don't make salespeople and
| executives excited.
| BarryMilo wrote:
| No lazier than anyone else, there's just not enough of us, in
| general and per company.
| rodgerd wrote:
| > IT people and engineers get lazy.
|
| Companies do not change their whole strategy from a capex-
| driven traditional self-hosting environment to opex-driven
| cloud hosting because their IT people are lazy; it is
| typically an exec-level decision.
| tuldia wrote:
| > Why hasn't the industry come up with an alternative?
|
| We used to have that, some companies still have the capability
| and know-how to build and run infrastructure that is reliable,
| distributed across many hosting providers before "cloud" became
| the "norm", but it goes along with "use or lose it".
| adflux wrote:
| Because your own datacenters cant go down?
| uranium wrote:
| Because the expected value of using AWS is greater than the
| expected value of self-hosting. It's not that nobody's ever
| heard of running on their own metal. Look back at what everyone
| did before AWS, and how fast they ran screaming away from it as
| soon as they could. Once you didn't have to do that any more,
| it's just so much better that the rare outages are worth it for
| the vast majority of startups.
|
| Medical devices, banks, the military, etc. should generally run
| on their own hardware. The next photo-sharing app? It's just
| not worth it until they hit tremendous scale.
| chucknthem wrote:
| Agree with your first point.
|
| On the second though, at some point, infrastructure like AWS
| are going to be more reliable than what many banks, medical
| device operators etc can provide themselves. asking them to
| stay on their own hardware is asking for that industry to
| remain slow, bespoke and expensive.
| uranium wrote:
| Agreed, and it'll be a gradual switch rather than a single
| point, smeared across industries. Likely some operations
| won't ever go over, but it'll be a while before we know.
| hn_throwaway_99 wrote:
| Hard agree with the second paragraph.
|
| It is _incredibly_ difficult for non-tech companies to hire
| quality software and infrastructure engineers - they
| usually pay less and the problems aren 't as interesting.
| seeEllArr wrote:
| They have, it's called hosting on-premesis, and it's even less
| reliable than cloud providers.
| smugglerFlynn wrote:
| Many of those businesses wouldn't have existed in the first
| place without simplicity offered by cloud.
| stfp wrote:
| > tens of million dollars of profit are simply vanishing
|
| vanishing or delayed six hours? I mean
| naikrovek wrote:
| money people think of money very weirdly. when they predict
| they will get more than they actually get, they call it
| "loss" for some reason, and when they predict they will get
| less than they actually get, it's called ... well I don't
| know what that's called but everyone gets bonuses.
| lp0_on_fire wrote:
| 6 hours of downtime often means 6 hours of paying employees
| to stand around which adds up rather quickly.
| nprz wrote:
| So you're saying companies should start moving their
| infrastructure to the blockchain?
| tornato7 wrote:
| Ethereum has gone 5 years without a single minute of
| downtime, so if it's extreme reliability you're going for I
| don't think it can be beaten.
| rbetts wrote:
| We're too busy working generating our own electricity and
| designing our own CPUs.
| commandlinefan wrote:
| Well, if you're web-based, there's never really been any better
| alternative. Even before "the cloud", you had to be hosted in a
| datacenter somewhere if you wanted enough bandwidth to service
| customers, as well as have somebody who would make sure the
| power stayed on 24/7. The difference now is that there used to
| be thousands of ISP's so one outage wouldn't get as much news
| coverage, but it would also probably last longer because you
| wouldn't have a team of people who know what to look for like
| Amazon (probably?) does.
| olingern wrote:
| People are so quick to forget how things were before behemoths
| like AWS, Google Cloud, and Azure. Not all things are free and
| the outage the internet is experiencing is the risk users
| signed up for.
|
| If you would like to go back to the days of managing your own
| machines, be my guest. Remember those machines also live
| somewhere and were/are subject to the same BGP and routing
| issues we've seen over the past couple of years.
|
| Personally, I'll deal with outages a few times a year for the
| peace of mind that there's a group of really talented people
| looking into for me.
| bowmessage wrote:
| Because the majority of consumers don't know better / don't
| care and still buy products from companies with no backup plan.
| Because, really, how can any of us know better until we're
| burned many times over?
| 300bps wrote:
| This appears to be a single region outage - us-east-1. AWS
| supports as much redundancy as you want. You can be redundant
| between multiple Availability Zones in a single Region or you
| can be redundant among 1, 2 or even 25 regions throughout the
| world.
|
| Multiple-region redundancy costs more both in initial
| planning/setup as well as monthly fees so a lot of AWS
| customers choose to just not do it.
| george3d6 wrote:
| This seems like an insane stance to have, it's like saying
| businesses should ship their own stock, using their own
| drivers, and their in-house made cars and planes and in-house
| trained pilots.
|
| Heck, why stop at having servers on-site? Cast your own silicon
| waffers, after all you don't want spectrum exploits.
|
| Because you are worst at it. If a specialist is this bad, and
| the market is fully open, then it's because the problem is
| hard.
|
| AWS has fewer outages in one zone alone than the best self-
| hosted institutions, your facebooks and petagons. In-house
| servers would lead to an insane amount of outage.
|
| And guess what? AWS (and all other IAAS providers) will beg you
| to use multiple region because of this. The team/person that
| has millions of dollars a day staked on a single AWS region is
| an idiot and could not be entrusted to order a gaming PC from
| newegg, let alone run an in-house datacenter.
|
| edit: I will add that AWS specifically is meh and I wouldn't
| use it myself, there's better IASS. But it's insanity to even
| imagine self-hosted is more reliable than using even the
| shittiest of IASS providers.
| jgwil2 wrote:
| > In-house servers would lead to an insane amount of outage.
|
| That might be true, but the effects of any given outage would
| be felt much less widely. If Disney has an outage, I can just
| find a movie on Netflix to watch instead. But now if one
| provider goes down, it can take down everything. To me, the
| problem isn't the cloud per se, it's one player's dominance
| in the space. We've taken the inherently distributed
| structure of the internet and re-centralized it, losing some
| robustness along the way.
| dragonwriter wrote:
| > That might be true, but the effects of any given outage
| would be felt much less widely.
|
| If my system has an hour of downtime every year and the
| dozen other systems it interacts with and depends on each
| have an hour of downtime every year, it can be _better_
| that those tend to be correlated rather than independent.
| bb88 wrote:
| Apple created their own silicon. Fedex uses its own pilots.
| The USPS uses it's own cars.
|
| If you're a company relying upon AWS for your business, is it
| okay if you're down for a day, or two while you wait for AWS
| to resolve it's issue?
| jasode wrote:
| > _Apple created their own silicon._
|
| Apple _designs_ the M1. But TSMC (and possibly Samsung)
| actually manufacture the chips.
| jcranberry wrote:
| Most companies using AWS are tiny compared to the companies
| you mentioned.
| lostlogin wrote:
| It's bloody annoying when all I want to do is vacuum the
| floor and Roomba says nope, "active AWS incident".
| Grazester wrote:
| If all you wanted to do was vacuum the floor you would
| not have gotten that particular vacuum cleaner. Clearly
| you wanted to do more than just vacuum the floor and
| something like this happening should be weighed with the
| purchase of the vacuum.
| lostlogin wrote:
| I'll rephrase. I wanted the floor vacuumed and I didn't
| want to do it.
| teh_klev wrote:
| > Apple created their own silicon
|
| Apple _designed_ their own silicon, a third party
| manufactures and packages it for them.
| roody15 wrote:
| Quick follow up. I once used a IASS provider (hyperstreet)
| that was terrible. Long story short provider ended closing
| shop and the owner of the company now sells real estate in
| California.
|
| Was a nightmare recovering data. Even when the service was
| operational was sub par.
|
| Just saying perhaps the "shittiest" providers may not be more
| reliable.
| kayson wrote:
| I think you're missing the point of the comment. It's not
| "don't use cloud". It's "be prepared for when cloud goes
| down". Because it will, despite many companies either
| thinking it won't, or not planning for it.
| midasuni wrote:
| > AWS has fewer outages in one zone alone than the best self-
| hosted institutions, your facebooks and petagons. In-house
| servers would lead to an insane amount of outage.
|
| It's had two in 13 months
| imiric wrote:
| > This seems like an insane stance to have, it's like saying
| businesses should ship their own stock, using their own
| drivers, and their in-house made cars and planes and in-house
| trained pilots.
|
| > Heck, why stop at having servers on-site? Cast your own
| silicon waffers, after all you don't want spectrum exploits.
|
| That's an overblown argument. Nobody is saying that, but it's
| clear that businesses that maintain their own infrastructure
| would've avoided today's AWS' outage. So just avoiding a
| single level of abstraction would've kept your company
| running today.
|
| > Because you are worst at it. If a specialist is this bad,
| and the market is fully open, then it's because the problem
| is hard.
|
| The problem is hard mostly because of scale. If you're a
| small business running a few websites with a few million hits
| per month, it might be cheaper and easier to colocate a few
| servers and hire a few DevOps or old-school sysadmins to
| administer the infrastructure. The tooling is there, and is
| not much more difficult to manage than a hundred different
| AWS products. I'm actually more worried about the DevOps
| trend where engineers are trained purely on cloud
| infrastructure and don't understand low-level tooling these
| systems are built on.
|
| > AWS has fewer outages in one zone alone than the best self-
| hosted institutions, your facebooks and petagons. In-house
| servers would lead to an insane amount of outage.
|
| That's anecdotal and would depend on the capability of your
| DevOps team and your in-house / colocation facility.
|
| > And guess what? AWS (and all other IAAS providers) will beg
| you to use multiple region because of this. The team/person
| that has millions of dollars a day staked on a single AWS
| region is an idiot and could not be entrusted to order a
| gaming PC from newegg, let alone run an in-house datacenter.
|
| Oh great, so the solution is to put even more of our eggs in
| a single provider's basket? The real solution would be having
| failover to a different cloud provider, and the
| infrastructure changes needed for that are _far_ from
| trivial. Even with that, there's only 3 major cloud providers
| you can pick from. Again, colocation in a trusted datacenter
| would've avoided all of this.
| p1necone wrote:
| > it's clear that businesses that maintain their own
| infrastructure would've avoided today's AWS' outage.
|
| Sure, that's trivially obvious. But how many other outages
| would they have had instead because they aren't as
| experienced at running this sort of infrastructure as AWS
| is?
|
| You seem to be arguing from the a priori assumption that
| rolling your own is inherently more stable than renting
| infra from AWS, without actually providing any
| justification for that assumption.
|
| You also seem to be under the assumption that any amount of
| downtime is _always_ unnacceptable, and worth spending
| large amounts of time and effort to avoid. For a _lot_ of
| businesses systems going down for a few hours every once in
| a while just isn 't a big deal, and is much more preferable
| than spending thousands more on cloud bills, or hiring more
| full time staff to ensure X 9s of uptime.
| imiric wrote:
| You and GP are making the same assumption that my DevOps
| engineers _aren't_ as experienced as AWS' are. There are
| plenty of engineers capable of maintaining an in-house
| infrastructure running X 9s because, again, the
| complexity comes from the scale AWS operates at. So we're
| both arguing with an a priori assumption that the grass
| is greener on our side.
|
| To be fair, I'm not saying never use cloud providers. If
| your systems require the complexity cloud providers
| simplify, and you operate at a scale where it would be
| prohibitively expensive to maintain yourself, by all
| means go with a cloud provider. But it's clear that not
| many companies are prepared for this type of failure, and
| protecting against it is not trivial to accomplish. Not
| to mention the conceptual overhead and knowledge required
| with dealing with the provider's specific products, APIs,
| etc. Whereas maintaining these systems yourself is
| transferrable across any datacenter.
| solveit wrote:
| This feels like a discussion that could sorely use some
| numbers.
|
| What are good examples of
|
| >a small business running a few websites with a few million
| hits per month, it might be cheaper and easier to colocate
| a few servers and hire a few DevOps or old-school sysadmins
| to administer the infrastructure.
|
| and how often do they go down?
| i_like_waiting wrote:
| depends I guess, I am running on-prem workstation for our
| DWH. So far in 2 years it went down minutes at the time,
| when I decided to do so, because of hardware updates. I
| have no idea where this narrative came from, but usually
| hardware you have is very reliable and doesn't turn off
| every 15 minutes.
|
| Heck, I use old T430 for my home server and still it
| doesn't go down on completely random occasions (but thats
| very simplified example, I know)
| jasode wrote:
| _> , but it's clear that businesses that maintain their own
| infrastructure would've avoided today's AWS' outage._
|
| When Netflix was running its own datacenters in 2008, they
| had a _3 day outage_ from a database corruption and couldn
| 't ship DVDs to customers. That was the disaster that
| pushed CEO Reed Hastings to get out of managing his own
| datacenters and migrate to AWS.
|
| The flaw in the reasoning that running your own hardware
| would _avoid today 's outage_ is that it doesn't also
| consider the _extra unplanned outages on other days_
| because your homegrown IT team (especially at non-tech
| companies) isn 't as skilled as the engineers working at
| AWS/GCP/Azure.
| qaq wrote:
| The flaw in your reasoning is that the complexity of the
| problem is even remotely the same. Most AWS outages are
| control plane related.
| naikrovek wrote:
| > AWS (and all other IAAS providers) will beg you to use
| multiple region
|
| will they? because AWS still puts new stuff in us-east-1
| before anywhere else, and there is often a LONG delay before
| those things go to other regions. there are many other
| examples of why people use us-east-1 so often, but it all
| boils down to this: AWS encourage everyone to use us-east-1
| and discourage the use of other regions for the same reasons.
|
| if they want to change how and where people deploy, they
| should change how they encourage it's customers to deploy.
|
| my employer uses multi-region deployments where possible, and
| we can't do that anywhere nearly as much as we'd like because
| of limitations that AWS has chosen to have.
|
| so if cloud providers want to encourage multi-region
| adoption, they need to stop discouraging and outright
| preventing it, first.
| danielheath wrote:
| It works really well imo. All the people who want to use
| new stuff at the expense of stability choose us-east-1;
| those who want stability at the expense of new stuff run
| multi-region (usually not in us-east-1 )
| mypalmike wrote:
| This argument seems rather contrived. Which feature
| available in only one region for a very long time has
| specifically impacted you? And what was the solution?
| WaxProlix wrote:
| Most features roll out to IAD second, third, or fourth. PDX
| and CMH are good candidates for earlier feature rollout,
| and usually it's tested in a small region first. I use PDX
| (us-west-2) for almost everything these days.
|
| I also think that they've been making a lot of the default
| region dropdowns and such point to CMH (us-east-2) to get
| folks to migrate away from IAD. Your contention that
| they're encouraging people to use that region just don't
| ring true to me.
| savant_penguin wrote:
| they usually beg you to use multiple availability zones
| though
|
| I'm not sure how many aws services are easy to spawn at
| multiple regions
| mypalmike wrote:
| Which ones are difficult to deploy in multiple regions?
| dragonwriter wrote:
| > they usually beg you to use multiple availability zones
| though
|
| Doesn't help you if it what goes down is AWS global
| services on which you directly, or other AWS services,
| depend (which tend to be tied to US-east-1).
| optiomal_isgood wrote:
| This is the right answer, I recall studying for the solutions
| architect professional certification and reading this
| countless times: outages will happen and you should plan for
| them by using multi-region if you care about downtime.
|
| It's not AWS fault here, it's the companies', which assume
| that it will never be down. In-house servers also have
| outages, it's a very naive assumption to think that it'd be
| all better if all of those services were using their own
| servers.
|
| Facebook doesn't use AWS and they were down for several hours
| a couple weeks ago, and that's because they have way better
| engineers than the average company, working on their
| infrastructure, exclusively.
| qaq wrote:
| "AWS has fewer outages in one zone alone than the best self-
| hosted institutions" sure you just call an outage "increased
| error rate"
| SkyPuncher wrote:
| In addition to fewer outages, _many_ products get a free pass
| on incidents because basically everyone is being impacted by
| outage.
| johannes1234321 wrote:
| the benefit of self hosting is, that you are up, while a your
| competitors are down.
|
| However if youbare on AWS many of your competitors are down
| while you are down, so they can't takeover your business.
| [deleted]
| tw04 wrote:
| >It seems bad now but I wonder how much worse it might be when
| no one actually has access to money because all financial
| traffic is going through AWS and it goes down.
|
| Most financial institutions are implementing their own clouds,
| I can't think of any major one that is reliant on public cloud
| to the extent transactions would stop.
|
| >Why hasn't the industry come up with an alternative?
|
| You mean like building datacenters and hosting your own gear?
| baoyu wrote:
| > Most financial institutions are implementing their own
| clouds
|
| https://www.nasdaq.com/Nasdaq-AWS-cloud-announcement
| filmgirlcw wrote:
| That doesn't mean what you think it means.
|
| The agreement is more of a hybrid cloud arrangement with
| AWS Outposts.
|
| FTA:
|
| >Core to Nasdaq's move to AWS will be AWS Outposts, which
| extend AWS infrastructure, services, APIs, and tools to
| virtually any datacenter, co-location space, or on-premises
| facility. Nasdaq plans to incorporate AWS Outposts directly
| into its core network to deliver ultra-low-latency edge
| compute capabilities from its primary data center in
| Carteret, NJ.
|
| They are also starting small, with Nasdaq MRX
|
| This is much less about moving NASDAQ (or other exchanges)
| to be fully owned/maintained by Amazon, and more about
| wanting to take advantage of development tooling and
| resources and services AWS provides, but within the
| confines of an owned/maintained data center. I'm sure as
| this partnership grows, racks and racks will be in Amazon's
| data centers too, but this is a hybrid approach.
|
| I would also bet a significant amount of money that when
| NASDAQ does go full "cloud" (or hybrid, as it were), it
| won't be in the same US-east region co-mingling with the
| rest of the consumer web, but with its own redundant
| services and connections and networking stack.
|
| NASDAQ wants to modernize its infrastructure but it
| absolutely doesn't want to offload it to a cloud provider.
| That's why it's a hybrid partnership.
| jen20 wrote:
| Indeed I can think of several outages in the past decade in
| the UK of banks' own infrastructure which have led to
| transactions stopping for days at a time, with the
| predictable outcomes.
| tcgv wrote:
| > Why hasn't the industry come up with an alternative?
|
| The cloud is the solution to self managed data centers. Their
| value proposition is appealing: Focus on your core business and
| let us handle infrastructure for you.
|
| This fits the needs of most small and medium sized businesses,
| there's no reason not to use the cloud and spend time and money
| on building and operating private data centers when the
| (perceived) chances of outages are so small.
|
| Then, companies grow to a certain size where the benefits of
| having a self managed data center begins to outweight not
| having one. But at this point this becomes more of a
| strategic/political decision than merely a technical one, so
| it's not an easy shift.
| JamesAdir wrote:
| noob question: Aren't companies using several regions for
| availability and redundancy?
| yunwal wrote:
| I'm seeing outages across several regions for certain services
| (SNS), so cross-region failover doesn't necessarily help here.
|
| Additionally, for complex apps, automatic cross-region disaster
| recovery can take tens or even hundreds of dev years, something
| most small to midsized companies can't afford.
| blahyawnblah wrote:
| Ideally, yes. In practice, most are hosted in a single region
| but with multiple availability zones (this is called high
| availability). What you're talking about is fault tolerance
| (across multiple regions). That's harder to implement and costs
| more.
| PragmaticPulp wrote:
| I worked at a company that hired an ex-Amazon engineer to work on
| some cloud projects.
|
| Whenever his projects went down, he fought tooth and nail against
| any suggestion to update the status page. When forced to update
| the status page, he'd follow up with an extremely long "post-
| mortem" document that was really just a long winded explanation
| about why the outage was someone else's fault.
|
| He later explained that in his department at Amazon, being at
| fault for an outage was one of the worst things that could happen
| to you. He wanted to avoid that mark any way possible.
|
| YMMV, of course. Amazon is a big company and I've had other
| friends work there in different departments who said this wasn't
| common at all. I will always remember the look of sheer panic he
| had when we insisted that he update the status page to accurately
| reflect an outage, though.
| broknbottle wrote:
| This gets posted every time there's an AWS outage. It mind as
| well be a copy pasta at this point.
| Rapzid wrote:
| It's the "grandma got run over by a reindeer" of AWS outages.
| Really no outage thread would be complete without this
| anecdote.
| JoelMcCracken wrote:
| well, this is the first time I've seen it, so I am glad it
| was posted this time.
| rconti wrote:
| Ditto, it's always annoyed me that their status page is
| useless, but glad someone else mentioned it.
| jjoonathan wrote:
| First time I've seen it too. Definitely not my first "AWS
| us-east-1 is down but the status board is green" thread,
| either.
| [deleted]
| Spivak wrote:
| I mean it's true at every company I've ever worked at too. If
| you can lawyer incidents into not being an outage you avoid
| like 15 meetings with the business stakeholders about all the
| things we "have to do" to prevent things like this in the
| future that get canceled the moment they realize that how
| much dev/infra time it will take to implement.
| ignoramous wrote:
| I had that deja vu feeling reading PragmaticPulp's comment,
| too.
|
| And sure enough, PragmaticPulp did post a similar comment on
| a thread about Amazon India's alleged hire-to-fire policy 6
| months back: https://news.ycombinator.com/item?id=27570411
|
| You and I, we aren't among the 10000, but there are
| potentially 10000 others who might be: https://xkcd.com/1053/
| PragmaticPulp wrote:
| Sorry. I'm probably to blame because I've posted this a
| couple times on HN before.
|
| It strikes a nerve with me because it caused so much trouble
| for everyone around him. He had other personal issues,
| though, so I should probably clarify that I'm not entirely
| blaming Amazon for his habits. Though his time at Amazon
| clearly did exacerbate his personal issues.
| dang wrote:
| (This was originally a reply to
| https://news.ycombinator.com/item?id=29473759 but I've pruned
| it to make the thread less top-heavy.)
| avalys wrote:
| I can totally picture this. Poor guy.
| kortex wrote:
| That sounds like the exact opposite of human-factors
| engineering. No one _likes_ taking blame. But when things go
| sideways, people are extra spicy and defensive, which makes
| them clam up and often withhold useful information, which can
| extend the outage.
|
| No-blame analysis is a much better pattern. Everyone wins. It's
| about building the system that builds the system. Stuff broke;
| fix the stuff that broke, then fix the things that _let stuff
| break_.
| 88913527 wrote:
| I don't think engineers can believe in no-blame analysis if
| they know it'll harm career growth. I can't unilaterally
| promote John Doe, I have to convince other leaders that John
| would do well the next level up. And in those discussions,
| they could bring up "but John has caused 3 incidents this
| year", and honestly, maybe they'd be right.
| SQueeeeeL wrote:
| Would they? Having 3 outages in a year sounds like an
| organization problem. Not enough safeguards to prevent very
| routine human errors. But instead of worrying about that we
| just assign a guy to take the fall
| JackFr wrote:
| Well if John caused 3 outages and and his peers Sally and
| Mike each caused 0, it's worth taking a deeper look.
| There's a real possibility he's getting screwed by a
| messed up org, also he could be doing slapdash work or he
| seriously might not undertsand the seriousness of an
| outage.
| jjav wrote:
| Worth a look, certainly. Also very possible that this
| John is upfront about honest postmortems and like a good
| leader takes the blame, whereas Sally and Mike are out
| all day playing politics looking for how to shift blame
| so nothing has their name attached. Most larger companies
| that's how it goes.
| Kliment wrote:
| Or John's work is in frontline production use and Sally's
| and Mike's is not, so there's different exposure.
| crmd wrote:
| John's team might also be taking more calculated risks
| and running circles around Sally and Mike's teams with
| respect to innovation and execution. If your organization
| categorically punishes failures/outages, you end up with
| timid managers that are only playing defense, probably
| the opposite of what the leadership team wants.
| dolni wrote:
| If you work in a technical role and you _don't_ have the
| ability to break something, you're unlikely to be
| contributing in a significant way. Likely that would make
| you a junior developer whose every line of code is
| heavily scrutinized.
|
| Engineers should be experts and you should be able to
| trust them to make reasonable choices about the
| management of their projects.
|
| That doesn't mean there can't be some checks in place,
| and it doesn't mean that all engineers should be perfect.
|
| But you also have to acknowledge that adding all of those
| safeties has a cost. You can be a competent person who
| requires fewer safeties or less competent with more
| safeties.
|
| Which one provides more value to an organization?
| pm90 wrote:
| > Which one provides more value to an organization?
|
| Neither, they both provide the same value in the long
| term.
|
| Senior engineers cannot execute on everything they commit
| to without having a team of engineers they work with. If
| nobody trains junior engineers, the discipline would go
| extinct.
|
| Senior engineers provide value by building guardrails to
| enable junior engineers to provide value by delivering
| with more confidence.
| jaywalk wrote:
| You're not wrong, but it's possible that the organization
| is small enough that it's just not feasible to have
| enough safeguards that would prevent the outages John
| caused. And in that case, it's probably best that John
| not be promoted if he can't avoid those errors.
| kortex wrote:
| Current co is small. We are putting in the safeguards
| from Day 1. Well, okay technically like day 120, the
| first few months were a mad dash to MVP. But now that we
| have some breathing room, yeah, we put a lot of emphasis
| on preventing outages, detecting and diagnosing outages
| promptly, documenting them, doing the whole 5-why's
| thing, and preventing them in the future. We didn't have
| to, we could have kept mad dashing and growth hacking.
| But _very_ fortunately, we have a great culture here
| (founders have lots of hindsight from past startups).
|
| It's like a seed for crystal growth. Small company is
| exactly the best time to implement these things, because
| other employees will try to match the cultural norms and
| habits.
| jaywalk wrote:
| Well, I started at the small company I'm currently at
| around day 7300, where "source control" consisted of
| asking the one person who was in charge of all source
| code for a copy of the files you needed to work on, and
| then giving the updated files back. He'd write down the
| "checked out" files on a whiteboard to ensure that two
| people couldn't work on the same file at the same time.
|
| The fact that I've gotten it to the point of using git
| with automated build and deployment is a small miracle in
| itself. Not everybody gets to start from a clean slate.
| mountainofdeath wrote:
| There is no such thing as "no-blame" analysis. Even in the
| best organizations with the best effort to avoid it, there
| is always a subconscious "this person did it". It doesn't
| help that these incidents serve as convenient places for
| others to leverage to climb their own career ladder at your
| expense.
| [deleted]
| AnIdiotOnTheNet wrote:
| > I have to convince other leaders that John would do well
| the next level up.
|
| "Yes, John has made mistakes and he's always copped to them
| immediately and worked to prevent them from happening again
| in the future. You know who doesn't make mistakes? People
| who don't do anything."
| nix23 wrote:
| You know why SO-teams, firefighters and military pilots are
| so successful?
|
| -You don't hide anything
|
| -Errors will be made
|
| -After training/mission everyone talks about the errors (or
| potential ones) and how to prevent them
|
| -You don't make the same error twice
|
| Being afraid to make errors and learn from them creates a
| culture of hiding, a culture of denial and especially being
| afraid to take responsibility.
| jacquesm wrote:
| You can even make the same error twice but you better
| have _much_ better explanation the second time around
| than you had the first time around because you already
| knew that what you did was risky and or failure prone.
|
| But usually it isn't the same person making the same
| mistake, usually it is someone else making the same
| mistake and nobody thought of updating
| processes/documentation to the point that the error would
| have been caught in time. Maybe they'll fix that after
| the second time ;)
| maximedupre wrote:
| Or just take responsibility. People will respect you for
| doing that and you will demonstrate leadership.
| artificial wrote:
| Way more fun argument: Outages just, uh... uh... find a
| way.
| melony wrote:
| And the guy who doesn't take responsibility gets promoted.
| Employees are not responsible for failures of management to
| set a good culture.
| tomrod wrote:
| Not in healthy organizations, they don't.
| foobiekr wrote:
| You can work an entire career and maybe enjoy life in one
| healthy organization in that entire time even if you work
| in a variety of companies. It just isn't that common,
| though of course voicing the _ideals_ is very, very
| common.
| jacquesm wrote:
| Once you reach a certain size there are surprisingly few
| healthy organization, most of them turn into
| externalization engines with 4 beats per year.
| kortex wrote:
| The Gervais/Peter Principle is alive and well in many
| orgs. That doesn't mean that when you have the
| prerogative to change the culture, you just give up.
|
| I realize that isn't an easy thing to do. Often the best
| bet is to just jump around till you find a company that
| isn't a cultural superfund site.
| jrootabega wrote:
| Cynical/realist take: Take responsibility and then hope
| your bosses already love you, you can immediately both come
| with a way to prevent it from happening again, and convince
| them to give you the resources to implement it. Otherwise
| your responsibility is, unfortunately, just blood in the
| water for someone else to do all of that to protect the
| company against you and springboard their reputation on the
| descent of yours. There were already senior people scheming
| to take over your department from your bosses, now they
| have an excuse.
| mym1990 wrote:
| This seems like an absolutely horrid way of working or
| doing 'office politics'.
| geekbird wrote:
| Yes, and I personally have worked in environments that do
| just that. They said they didn't, but with management
| "personalities" plus stack ranking, you know damn well
| that they did.
| kumarakn wrote:
| I worked at Walmart Technology. I bravely wrote post mortem
| documents owning the fault of my team (100+ people), owning
| both technically and also culturally as their leader. I put
| together a plan to fix it and executed it. Thought that was
| the right thing to do. This happend two times in my 10 year
| career there.
|
| Both times I was called out as a failure in my performance
| eval. Second time, I resigned and told them to find a better
| leader.
|
| Happy now I am out of such shitty place.
| gunapologist99 wrote:
| That's shockingly stupid. I also worked for a major Walmart
| IT services vendor in another life, and we always had to be
| careful about how we handled them, because they didn't
| always show a lot of respect for vendors.
|
| On another note, thanks for building some awesome stuff --
| walmart.com is awesome. I have both Prime and whatever-
| they're-currently-calling Walmart's version and I love that
| Walmart doesn't appear to mix SKU's together in the same
| bin which seems to cause counterfeiting fraud at Amazon.
| gnat wrote:
| What's a "bin" in this context?
| AfterAnimator wrote:
| I believe he means a literal bin. E.g. Amazon takes
| products from all their sellers and chucks them in the
| same physical space, so they have no idea who actually
| sold the product when it's picked. So you could have
| gotten something from a dodgy 3rd party seller that
| repackages broken returns, etc, and Amazon doesn't
| maintain oversight of this.
| notinty wrote:
| Literally just a bin in a fulfillment warehouse.
|
| An amazon listing doesn't guarantee a particular SKU.
| gnat wrote:
| Ah, whew. That's what I thought. Thanks! I asked because
| we make warehouse and retail management systems and every
| vendor or customer seems to give every word their own
| meanings (e.g., we use "bin" in our discounts engine to
| be a collection of products eligible for discounts, and
| "barcode" has at least three meanings depending on to
| whom you're speaking).
| throwawayHN378 wrote:
| Is WalMart.com awesome?
| temp6363t wrote:
| walmart.com user design sucks. My particular grudge right
| now is - I'm shopping to go pickup some stuff (and
| indicate "in store pickup) and each time I search for the
| next item, it resets that filter making me click on that
| filter for each item on my list.
| muvb00 wrote:
| Walmart.com, Am I the only one in the world who can't
| view their site on my phone? I tried it on a couple
| devices and couldn't get it to work. Scaling is fubar. I
| assumed this would be costing them millions/billions
| since it's impossible to buy something from my phone
| right now. S21+ in portrait on multiple browsers.
| handrous wrote:
| Almost every physical-store-chain company's website makes
| it way too hard to do the thing I _nearly always_ want
| out of their interface, which is to search the inventory
| of the X nearest locations. They all want to push online
| orders or 3rd-party-seller crap, it seems.
| CobrastanJorji wrote:
| Stories like this are why I'm really glad I stopped talking
| to that Walmart Technology recruiter a few years ago. I
| love working for places where senior leadership constantly
| repeat war stories about "that time I broke the flagship
| product" to reinforce the importance of blameless
| postmortems. You can't fix the process if the people who
| report to you feel the need to lie about why things go
| wrong.
| jacquesm wrote:
| Props to you and Walmart will never realize their loss.
| Unfortunately. But one day there will be headline (or even
| a couple of them) and you will know that if you had been
| there it might not have happened and that in the end it is
| Walmarts' customers that will pay the price for that, not
| their shareholders.
| dnautics wrote:
| that's awful. You should have been promoted for that.
| abledon wrote:
| is it just 'ceremony' to be called out on those things?
| (even if it is actually a positive sum total)
| ARandomerDude wrote:
| > Happy now I am out of such shitty place.
|
| Doesn't sound like it.
| emteycz wrote:
| But hope you found a better place?
| javajosh wrote:
| I firmly believe in the dictum "if you ship it you own it".
| That means you own all outages. It's not just an operator
| flubbing a command, or a bit of code that passed review when
| it shouldn't. It's all your dependencies that make your
| service work. You own ALL of them.
|
| People spend all this time threat modelling their stuff
| against malefactors, and yet so often people don't spend any
| time thinking about the threat model of _decay_. They don 't
| do it adding new dependencies (build- or runtime), and
| therefore are unprepared to handle an outage.
|
| There's a good reason for this, of course: modern software
| "best practices" encourage moving fast and breaking things,
| which includes "add this dependency we know nothing about,
| and which gives an unknown entity the power to poison our
| code or take down our service, arbitrarily, at runtime, but
| hey its a cool thing with lots of github stars and it's only
| one 'npm install' away".
|
| Just want to end with this PSA: Dependencies bad.
| syngrog66 wrote:
| if I were a black hat I would absolutely love GitHub and
| all the various language-specific package systems out
| there. giving me sooooo many ways to sneak arbitrary
| tailored malicious code into millions of installs around
| the world 24x7. sure, some of my attempts might get caught,
| or not but not lead to a valuable outcome for me. but that
| percentage that does? can make it worth it. its about scale
| and a massive parallelization of infiltration attempts.
| logic similar to the folks blasting out phishing emails or
| scam calls.
|
| I _love_ the ubiquity of thirdparty software from
| strangers, and the lack of bureaucratic gatekeepers. but I
| also _hate_ it in ways. and not enough people know about
| the dangers of this second thing.
| throwawayHN378 wrote:
| Any yet oddly enough the Earth continues to spin and the
| internet continues to work. I think the system we have
| now is necessarily the system that must exist ( in this
| particular case, not in all cases ). Something more
| centralized is destined to fail. And, while the open
| source nature of software introduces vulnerabilities it
| also fixes them.
| syngrog66 wrote:
| > And, while the open source nature of software
| introduces vulnerabilities it also fixes them.
|
| dat gap tho... which was my point. smart black hats will
| be exploiting this gap, at scale. and the strategy will
| work because the majority of folks seem to be either
| lazy, ignorant or simply hurried for time.
|
| and btw your 1st sentence was rude. constructive feedback
| for the future
| AtlasBarfed wrote:
| That's a great philosophy.
|
| Ok, let's take an organization, let's call them, say
| Ammizzun. Totally not Amazon. Let's say you have a very
| aggressive hire/fire policy which worked really well in
| rapid scaling and growth of your company. Now you have a
| million odd customers highly dependent on systems that were
| built by people that are now one? two? three? four?
| hire/fire generations up-or-out or cashed-out cycles ago.
|
| So.... who owns it if the people that wrote it are
| lllloooooonnnnggg gone? Like, not just long gone one or two
| cycles ago so some institutional memory exists. I mean,
| GONE.
| javajosh wrote:
| A lot can go wrong as an organization grows, including
| loss of knowledge. At amazon "Ownership" officially rests
| with the non-technical money that owns voting shares.
| They control the board who controls the CEO. "Ownership"
| can be perverted to mean that you, a wage slave, are
| responsible for the mess that previous ICs left behind.
| The obvious thing to do in such a circumstance is quit
| (or don't apply). It is unfair and unpleasant to be
| treated in a way that gives you responsibility but no
| authority, and to participant in maintaining (and
| extending) that moral hazard, and as long as there are
| better companies you're better off working for them.
| Mezzie wrote:
| It's also a nightmare for software preservation. There's
| going to be a lot from this era that won't be usable 80
| years from now because everything is so interdependent and
| impossible to archive. It's going to be as messy and
| irretrievable as the Web pre Internet Archive + Wayback
| are.
| 88913527 wrote:
| Should I be penalized if an upstream dependency, owned by
| another team, fails? Did I lack due diligence in choosing
| to accept the risk that the other team couldn't deliver?
| These are real problems in the micro-services world,
| especially since I own UI and there are dozens of teams
| pumping out services, and I'm at the mercy of all of them.
| The best I can do is gracefully fail when services don't
| function in a healthy state.
| NikolaeVarius wrote:
| > Should I be penalized if an upstream dependency, owned
| by another team, fails?
|
| Yes
|
| > Did I lack due diligence in choosing to accept the risk
| that the other team couldn't deliver?
|
| Yes
| bandyaboot wrote:
| Where does this mindset end? Do I lack due diligence by
| choosing to accept that the cpu microcode on the system
| I'm deploying to works correctly?
| unionpivo wrote:
| If it's brand new RiscV CPU that was just relesed 5 min
| ago, and nobody really tested then yes.
|
| If its standard CPU that everybody else uses, and its not
| known to be bad then no.
|
| Same for software. Is it ok to have dependency on AWS
| services ? Their history shows yes. Dependency on brand
| new SaaS product ? Nothing mission critical.
|
| Or npm/crates/pip packages. Packages that have been
| around and seedily maintained for few years, have active
| users, are worth checking out. Some random project from
| single developer ? Consider vendoring (and owning if
| necessary ) it.
| jrockway wrote:
| Why? Intel has Spectre/Meltdown which erased like half of
| everyone's capacity overnight.
| treis wrote:
| You choose the CPU and you choose what happens in a
| failure scenario. Part of engineering is making choices
| that meet the availability requirements of your service.
| And part of that is handling failures from dependencies.
|
| That doesn't extend to ridiculous lengths but as a rule
| you should engineer around any single point of failure.
| NikolaeVarius wrote:
| Yes? If you are worried about CPU microcode failing, then
| you do a NASA and have multiple CPU arch's doing
| calculations in a voting block. These are not unsolved
| problems.
| javajosh wrote:
| JPL goes further and buys multiple copies of all hardware
| and software media used for ground systems, and keeps
| them in storage "just in case". It's a relatively cheap
| insurance policy against the decay of progress.
| obstacle1 wrote:
| Say during due diligence two options are uncovered: use
| an upstream dependency owned by another team, or use that
| plus a 3P vendor for redundancy. Implementing parallel
| systems costs 10x more than the former and takes 5x
| longer. You estimate a 0.01% chance of serious failure
| for the former, and 0.001% for the latter.
|
| Now say you're a medium sized hyper-growth company in a
| competitive space. Does spending 10 times more and
| waiting 5 times longer for redundancy make business
| sense? You could argue that it'd be irresponsible to
| over-engineer the system in this case, since you delay
| getting your product out and potentially lose $ and
| ground to competitors.
|
| I don't think a black and white "yes, you should be
| punished" view is productive here.
| bityard wrote:
| You and many others here may be conflating two concepts
| which are actually quite separate.
|
| Taking blame is a purely punitive action and solves
| nothing. Taking responsibility means it's your job to
| correct the problem.
|
| I find that the more "political" the culture in the
| organization is, the more likely everyone is to search
| for a scapegoat to protect their own image when a mistake
| happens. The higher you go up in the management chain,
| the more important vanity becomes, and the more you see
| it happening.
|
| I have made plenty of technical decisions that turned out
| to be the wrong call in retrospect. I took
| _responsibility_ for those by learning from the mistake
| and reversing or fixing whatever was implemented.
| However, I never willfully took _blame_ for those
| mistakes because I believed I was doing the best job I
| could at the time.
|
| Likewise, the systems I manage sometimes fail because
| something that another team manages failed. Sometimes
| it's something dumb and could have easily been prevented.
| In these cases, it's easy point blame and say, "Not our
| fault! That team or that person is being a fuckup and
| causing our stuff to break!" It's harder but much more
| useful to reach out and say, "hey, I see x system isn't
| doing what we expect, can we work together to fix it?"
| cyanydeez wrote:
| Every argument I have on the internet is between
| prescriptive and descriptive language.
|
| People tend to believe that if you can describe a problem
| that means you can prescribe a solution. Often times, the
| only way to survive is to make it clear that the first
| thing you are doing is describing the problem.
|
| After you do that, and it's clear that's all you are
| doing, then you follow up with a prescriptive description
| where you place clearly what could be done to manage a
| future scenario.
|
| If you don't create this bright line, you create a
| confused interpretation.
| javajosh wrote:
| My comment was made from the relatively simpler
| entrepreneurial perspective, not the corporate one. Corp
| ownership rests with people in the C-suite who are
| social/political lawyer types, not technical people. They
| delegate responsibility but not authority, because they
| can hire people, even smart people, to work under those
| conditions. This is an error mode where "blame" flows
| from those who control the money to those who control the
| technology. Luckily, not all money is stupid so some
| corps (and some parts of corps) manage to function even
| in the presence of risk and innovation failures. I mean
| the whole industry is effectively a distributed R&D
| budget that may or may not yield fruit. I suppose this is
| the market figuring out whether iterated R&D makes sense
| or not. (Based on history, I'd say it makes a lot of
| sense.)
| [deleted]
| javajosh wrote:
| I wish you wouldn't talk about "penalization" as if it
| was something that comes from a source of authority.
| _Your customers are depending on you_ , and you've let
| them down, and the reason that's bad has nothing to do
| with what your boss will do to you in a review.
|
| The injustice that can and does happen is that you're
| explicitly given a narrow responsibility during
| development, and then a much broader responsibility
| during operation. This is patently unfair, and very
| common. For something like a failed uService you want to
| blame "the architect" that didn't anticipate these system
| level failures. What is the solution? Have plan b (and
| plan c) ready to go. If these services don't exist, then
| you must build them. It also implies a level of
| indirection that most systems aren't comfortable with,
| because we want to consume services directly (and for
| good reason) but reliability _requires_ that you never,
| ever consume a service directly, but instead from an in-
| process location that is failure aware.
|
| This is why reliable software is hard, and engineers are
| expensive.
|
| Oh, and it's also why you generally do NOT want to defer
| the last build step to runtime in the browser. If you
| start combining services on both the client and server,
| you're in for a world of hurt.
| bostik wrote:
| Not penalised no, but questioned as to how well your
| graceful failure worked in the end.
|
| Remember: it may not be your fault, but it still is your
| problem.
| fragmede wrote:
| A analogy for illustrating this is:
|
| You get hit by a car and injured. The accident is the
| other driver's fault, but getting to the ER is your
| problem. The other driver may help and call an ambulance,
| but they might not even be able to help you if they also
| got hurt in the car crash.
| ssimpson wrote:
| when working on CloudFiles, we often had monitoring for our
| limited dependencies that were better than their
| monitoring. Don't just know what your stuff is doing, but
| what your whole dependency ecosystem is doing and know when
| it all goes south. also helps to learn where and how you
| can mitigate some of those dependencies.
| foobiekr wrote:
| This. We found very big, serious issues with our anti-
| DDOS provider because their monitoring sucked compared to
| ours. It was a sobering reality check when we realized
| that.
| insaneirish wrote:
| > No-blame analysis is a much better pattern. Everyone wins.
| It's about building the system that builds the system. Stuff
| broke; fix the stuff that broke, then fix the things that let
| stuff break.
|
| Yea, except it doesn't work in practice. I work with a lot of
| people who come from places with "blameless" post-mortem
| 'culture' and they've evangelized such a thing extensively.
|
| You know what all those people have proven themselves to
| really excel at? _Blaming people._
| kortex wrote:
| Ok, and? I don't doubt it fails in places. That doesn't
| mean that it doesn't work in practice. Our company does it
| just fine. We have a high trust, high transparency system
| and it's wonderful.
|
| It's like saying unit tests don't work in practice because
| bugs got through.
| kortilla wrote:
| Have you ever considered that the "no-blame" postmortems
| you are giving credit for everything are just a side
| effect of living in a high trust, high transparency
| system?
|
| In other words, "no-blame" should be an emergent property
| of a culture of trust. It's not something you can
| prescribe.
| kortex wrote:
| Yes, exactly. Culture of trust is the root. Many
| beneficial patterns emerge when you can have that: more
| critical PRs, blameless post-mortems, etc.
| maximedupre wrote:
| Damn, he had serious PTSD lol
| jonhohle wrote:
| On the retail/marketplace side this wasn't my experience, but
| we also didn't have any public dashboards. On Prime we
| occasionally had to refund in bulk, and when it was called for
| (internally or externally) we would right up a detailed post-
| mortem. This wasn't fun, but it was never about blaming a
| person and more about finding flaws in process or monitoring.
| mountainofdeath wrote:
| Former AWSser. I can totally believe that happened and
| continues to happen in some teams. Officially, it's not
| supposed to be done that way.
|
| Some AWS managers and engineers bring their corporate cultural
| baggage with them when they join AWS and it takes a few years
| to unlearn it.
| hinkley wrote:
| I am finding that I have a very bimodal response to "He did
| it". When I write an RCA or just talk about near misses, I may
| give you enough details to figure out that Tom was the one who
| broke it, but I'm not going to say Tom on the record anywhere,
| with one extremely obvious exception.
|
| If I think Tom has a toxic combination of poor judgement,
| Dunning-Kruger syndrome, and a hint of narcissism (I'm not sure
| but I may be repeating myself here), such that he won't listen
| to reason and he actively steers others into bad situations
| (and especially if he then disappears when shit hits the fan),
| then I will nail him to a fucking cross every chance I get.
| Public shaming is only a tool for getting people to discount
| advice from a bad actor. If it comes down to a vote between my
| idea and his, then I'm going to make sure everyone knows that
| his bets keep biting us in the ass. This guy kinda sounds like
| the Toxic Tom.
|
| What is important when I turned out to be the cause of the
| issue is a bit like some court cases. Would a reasonable person
| in this situation have come to the same conclusion I did? If
| so, then I'm just the person who lost the lottery. Either way,
| fixing it for me might fix it for other people. Sometimes the
| answer is, "I was trying to juggle three things at once and a
| ball got dropped." If the process dictated those three things
| then the process is wrong, or the tooling is wrong. If someone
| was asking me questions we should think about being more pro-
| active about deflecting them to someone else or asking them to
| come back in a half hour. Or maybe I shouldn't be trying to
| watch training videos while babysitting a deployment to
| production.
|
| If you never say "my bad" then your advice starts to sound like
| a lecture, and people avoid lectures so then you never get the
| whole story. Also as an engineer you should know that owning a
| mistake early on lets you get to what most of us consider the
| interesting bit of _solving the problem_ instead of talking
| about feelings for an hour and then using whatever is left of
| your brain afterward to fix the problem. In fact in some cases
| you can shut down someone who is about to start a rant (which
| is funny as hell because they look like their head is about to
| pop like a balloon when you say, "yep, I broke it, let's move
| on to how do we fix it?")
| bgribble wrote:
| To me, the point of "blameless" PM is not to hide the
| identity of the person who was closest to the failure point.
| You can't understand what happened unless you know who did
| what, when.
|
| "Blameless" to me means you acknowledge that the ultimate
| problem isn't that someone made a mistake that caused an
| outage. The problem is that you had a system in place where
| someone could make a single mistake and cause an outage.
|
| If someone fat-fingers a SQL query and drops your database,
| the problem isn't that they need typing lessons! If you put a
| DBA in a position where they have to be typing SQL directly
| at a production DB to do their job, THAT is the cause of the
| outage, the actual DBA's error is almost irrelevant because
| it would have happened eventually to someone.
| hinkley wrote:
| Naming someone is how you discover that not everyone in the
| organization believes in Blamelessness. Once it's out it's
| out, you can't put it back in.
|
| It's really easy for another developer to figure out who
| I'm talking about. Managers can't be arsed to figure it
| out, or at least pretend like they don't know.
| StreamBright wrote:
| Yep I can confirm that. The process when the outage is caused
| by you is called COE (correction of errors). I was oncall once
| for two teams because I was switching teams and I got 11
| escalations in 2 hours. 10 of these were caused by an overly
| sensitive monitoring setting. The 11th was a real one. Guess
| which one I ignored. :)
| kache_ wrote:
| Sometimes, these large companies tack on too much "necessary"
| incident "remediation" actions with Arbitrary Due Date SLAs
| that completely wrench any ongoing work. And ongoing,
| strategically defined ""muh high impact"" projects are what get
| you promoted, not doing incident remediations.
|
| When you get to the level you want, you get to not really give
| a shit and actually do The Right Thing. However, for all of the
| engineers clamoring to get out of the intermediate brick laying
| trenches, opening an incident can have pervasive incentives.
| bendbro wrote:
| In my experience this is the actual reason for fear of the
| formal error correction process.
| pts_ wrote:
| Politicized cloud meh.
| ashr wrote:
| This is the exact opposite of my experience at AWS. Amazon is
| all about blameless fact finding when it comes to root cause
| analysis. Your company just hired a not so great engineer or
| misunderstood him.
| Insanity wrote:
| Adding my piece of anecdata to this.. the process is quite
| blameless. If a postmortem seems like it points blame, this
| is pointed out and removed.
| swiftcoder wrote:
| Blameless, maybe, but not repercussion-less. A bad CoE was
| liable to upend the team's entire roadmap and put their
| existing goals at risk. To be fair, management was fairly
| receptive to "we need to throw out the roadmap and push our
| launch out to the following reinvent", but it wasn't an
| easy position for teams to be in.
| kator wrote:
| I've worked for Amazon for 4 years, including stints at AWS,
| and even in my current role my team is involved in LSE's. I've
| never seen this behavior, the general culture has been find the
| problem, fix it, and then do root cause analysis to avoid it
| again.
|
| Jeff himself has said many times in All Hands and in public
| "Amazon is the best place to fail". Mainly because things will
| break, it's not that they break that's interesting, it's what
| you've learned and how you can avoid that problem in the
| future.
| jsperson wrote:
| I guess the question is why can't you (AWS) fix the problem
| of the status page not reflecting an outage? Maybe acceptable
| if the console has a hiccup, but when www.amazon.com isn't
| working right, there should be some yellow and red dots out
| there.
|
| With the size of your customer base there were man years
| spent confirming the outage after checking the status.
| andrewguenther wrote:
| Because there's a VP approval step for updating the status
| page and no repercussions for VPs who don't approve updates
| in a timely manner. Updating the status page is fully
| automated on both sides of VP approval. If the status page
| doesn't update, it's because a VP wouldn't do it.
| Eduard wrote:
| LSE?
| merciBien wrote:
| Large Scale Event
| dehrmann wrote:
| > explanation about why the outage was someone else's fault
|
| In my experience, it's rarely clear who was at fault for any
| sort of non-trivial outage. The issue tends to be at interfaces
| and involve multiple owners.
| jimt1234 wrote:
| Every incident review meeting I've ever been in starts out
| like, _"This meeting isn't to place blame..."_, then, 5 minutes
| later, it turns into the Blame Game.
| mijoharas wrote:
| That's a real shame, one of the leadership principles used to
| be "be vocally self-critical" which I think was supposed to
| explicitly counteract this kind of behaviour.
|
| I think they got rid of it at some point though.
| howdydoo wrote:
| Manually updated status pages are an anti-pattern to begin
| with. At that point, why not just call it a blog?
| jacquesm wrote:
| And this is exactly why you can expect these headlines to hit
| with great regularity. These things are never a problem at the
| individual level, they are always at the level of culture and
| organization.
| 300bps wrote:
| _being at fault for an outage was one of the worst things that
| could happen to you_
|
| Imagine how stressful life would be thinking that you had to be
| perfect all the time.
| errcorrectcode wrote:
| That's been most of my life. Welcome to perfectionism.
| soheil wrote:
| > I will always remember the look of sheer panic
|
| I don't know if you're exaggerating or not, but even if true
| why would anyone show that emotion about losing a job in the
| worst case?
|
| You certainly had a lot of relevate-to-todays-top-hn-post
| stories throughout you career. And I'm less and less surprised
| to continuously find PragmaticPulp as one of the top commenters
| if not the top that resonates with a good chunk of HN.
| sharpy wrote:
| Haha... This bring back memories. It really depends on the org.
|
| I've had push backs on my postmortems before because of
| phrasing that could be constituted as laying some of the blame
| on some person/team when it's supposed to be blameless.
|
| And for a long time, it was fairly blameless. You would still
| be punished with the extra work of writing high quality
| postmortems, but I have seen people accidentally bring down
| critical tier-1 services and not be adversely affected in terms
| of promotion, etc.
|
| But somewhere along the way, it became politicized. Things like
| the wheel of death, public grilling of teams on why they didn't
| follow one of the thousands of best practices, etc, etc. Some
| orgs are still pretty good at keeping it blameless at the
| individual level, but... being a big company, your mileage may
| vary.
| hinkley wrote:
| We're in a situation where the balls of mud made people
| afraid to touch some things in the system. As experiences and
| processes have improved we've started to crack back into
| those things and guess what, when you are being groomed to
| own a process you're going to fuck it up from time to time.
| Objectively, we're still breaking production less often per
| year than other teams, but we are breaking it, and that's
| novel behavior, so we have to keep reminding people why.
|
| The moment that affects promotions negatively, or your
| coworkers throw you under the bus, you should 1) be assertive
| and 2) proof-read your resume as a precursor to job hunting.
| sharpy wrote:
| Or problems just persisting, because the fix is easy, but
| explaining it to others who do not work on the system are
| hard. Esp. justifying why it won't cause an issue, and
| being told that the fixes need to be done via scripts that
| will only ever be used once, but nevertheless needs to be
| code reviewed and tested...
|
| I wanted to be proactive and fix things before they became
| an issue, but such things just drained life out of me, to
| the point I just left.
| staticassertion wrote:
| I don't think anecdotes like this are even worth sharing,
| honestly. There's so much context lost here, so much that can
| be lost in translation. No one should be drawing any
| conclusions from this post.
| amzn-throw wrote:
| It's popular to upvote this during outages, because it fits a
| narrative.
|
| The truth (as always) is more complex:
|
| * No, this isn't the broad culture. It's not even a blip. These
| are EXCEPTIONAL circumstances by extremely bad teams that - if
| and when found out - would be intervened dramatically.
|
| * The broad culture is blameless post-mortems. Not whose fault
| is it. But what was the problem and how to fix it. And one of
| the internal "Ten commandments of AWS availability" is you own
| your dependencies. You don't blame others.
|
| * Depending on the service one customer's experience is not the
| broad experience. Someone might be having a really bad day but
| 99.9% of the region is operating successfully, so there is no
| reason to update the overall status dashboard.
|
| * Every AWS customer has a PERSONAL health dashboard in the
| console that should indicate _their_ experience.
|
| * Yes, VP approval is needed to make any updates on the status
| dashboard. But that's not as hard as it may seem. AWS
| executives are extremely operation-obsessed, and when there is
| an outage of any size are engaged with their service teams
| immediately.
| flerchin wrote:
| We knew us-east-1 was unuseable for our customers for 45
| minutes before amazon acknowledged anything was wrong _at
| all_. We made decisions _in the dark_ to serve our customers,
| because amazon drug their feet communicating with us. Our
| customers were notified after 2 minutes.
|
| It's not acceptable.
| pokot0 wrote:
| Hiding behind a throw away account does not help your point.
| miken123 wrote:
| Well, the narrative is sort of what Amazon is asking for,
| heh?
|
| The whole us-east-1 management console is gone, what is
| Amazon posting for the management console on their website?
|
| "Service degradation"
|
| It's not a degradation if it's outright down. Use the red
| status a little bit more often, this is a "disruption", not a
| "degradation".
| taurath wrote:
| Yeah no kidding. Is there a ratio of how many people it has
| to be working for to be in yellow rather than red? Some
| internal person going "it works on my machine" while 99% of
| customers are down.
| whoknowswhat11 wrote:
| I've always wondered why services are not counted down more
| often. Is there some sliver of customers who have access to
| the management console for example?
|
| An increase in error rates - no biggie, any large system is
| going to have errors. But when 80%+ of customers loads in
| the region are impacted (cross availability zones for
| whatever good those do) - that counts as down doesn't it?
| Error rates in one AZ - degraded. Multi-AZ failures - down?
| mynameisvlad wrote:
| SLAs. Officially acknowledging an incident means that
| they now _have_ to issue the SLA credits.
| res0nat0r wrote:
| The outage dashboard is normally only updated if a
| certain $X percent of hosts / service is down. If the EC2
| section were updated every time a rack in a datacenter
| went down, it would be red 24x7.
|
| It's only updated when a large percentage of customers
| are impacted, and most of the time this number is less
| than what the HN echo chamber makes it appear to be.
| mynameisvlad wrote:
| I mean, sure, there are technical reasons why you would
| want to buffer issues so they're only visible if
| something big went down (although one would argue that's
| exactly what the "degraded" status means).
|
| But if the official records say everything is green, a
| customer is going to have to push a lot harder to get the
| credits. There is a massive incentivization to "stay
| green".
| bwestpha wrote:
| yes there were. I'm from central europe and we were at
| least able to get some pages of the console in us-east-1
| -but i assume this was more caching related. Even though
| the console loaded and worked for listing some entries -
| we weren't able to post a support case nor viewing SQS
| messages etc.
|
| So i aggree that degraded is not the proper wording - but
| it's / was not completly vanished. so.... hard to tell
| what is an common acceptable wording here.
| vladvasiliu wrote:
| From France, when I connect to "my personal health
| dashboard" in eu-west-3, it says several services are
| having "issues" in us-east-1.
|
| To your point, for support center (which doesn't show a
| region) it says:
|
| _Description
|
| Increased Error Rates
|
| [09:01 AM PST] We are investigating increased error rates
| for the Support Center console and Support API in the US-
| EAST-1 Region.
|
| [09:26 AM PST] We can confirm increased error rates for
| the Support Center console and Support API in the US-
| EAST-1 Region. We have identified the root cause of the
| issue and are working towards resolution. _
| threecheese wrote:
| I'm part of a large org with a large AWS footprint, and
| we've had a few hundred folks on a call nearly all day. We
| have only a few workloads that are completely down; most
| are only degraded. This isn't a total outage, we are still
| doing business in east-1. Is it "red"? Maybe! We're all
| scrambling to keep the services running well enough for our
| customers.
| Thaxll wrote:
| Because the console works just fine in us-east-2 and that
| the console on the status page does not display regions.
|
| If the console works 100% in us-east-2 and not in us-east-1
| why would they put the console completely down in us-east?
| keyle wrote:
| Well you know, like when a rocket explode, it's a sudden
| and "unexpected rapid disassembly" or something...
|
| And a cleaner is called a "floor technician".
|
| Nothing really out of the ordinary for a service to be
| called degraded while "hey, the cache might still be
| working right?" ... or "Well you know, it works every other
| day except today, so it's just degradation" :-)
| lgylym wrote:
| Come on, we all know managers don't want to claim an outage
| till the last minute.
| codegeek wrote:
| "Yes, VP approval is needed to make any updates on the status
| dashboard."
|
| If services are clearly down, why is this needed ? I can
| understand the oversights required for a company like Amazon
| but this sounds strange to me. If services are clearly down,
| I want that damn status update right away as a customer.
| tekromancr wrote:
| Oh, yes. Let me go look at the PERSONAL health dashboard
| and... oh, I need to sign into the console to view it... hmm
| oscribinn wrote:
| 100 BEZOBUCKS(tm) have been deposited to your account for
| this post.
| mrsuprawsm wrote:
| If your statement is true, then why is the AWS status page
| widely considered useless, and everyone congregates on HN
| and/or Twitter to actually know what's broken on AWS during
| an outage?
| andrewguenther wrote:
| > Yes, VP approval is needed to make any updates on the
| status dashboard. But that's not as hard as it may seem.
| AWS executives are extremely operation-obsessed, and when
| there is an outage of any size are engaged with their
| service teams immediately.
|
| My experience generally aligns with amzn-throw, but this
| right here is why. There's a manual step here and there's
| always drama surrounding it. The process to update the
| status page is fully automated on both sides of this step,
| if you removed VP approval, the page would update
| immediately. So if the page doesn't update, it is always a
| VP dragging their feet. Even worse is that lags in this
| step were never discussed in the postmortem reviews that I
| was a part of.
| Frost1x wrote:
| It's intentional plausible deniability. By creating the
| manual step you can shift blame away. It's just like the
| concept of personal health dashboards which are designed
| to keep an asymmetry in reliability information from a
| host and the client to their personal anecdata
| experiences. Ontop of all of this, the metrics are pretty
| arbitrary.
|
| Let's not pretend businesses haven't been intentionally
| advertising in deceitful ways for decades if not hundreds
| of years. This just happens to be current strategy in
| tech of lying and deceiving customers to limit liability,
| responsibility, and recourse actions.
|
| To be fair, it's not it's not just Amazon, they just
| happen to be the largest and targeted whipping boys on
| the block. Few businesses under any circumstances will
| admit to liability under any circumstances. Liability has
| to always be assessed externally.
| amichal wrote:
| I have in the past directed users here on HN who were
| complaining about https://status.aws.amazon.com to the
| Personal Health Dashboard at https://phd.aws.amazon.com/ as
| well. Unfortunately even though the account I was logged into
| this time only has a single S3 bucket in the EU, billed
| through the EU and with zero direct dependancies on the US
| the personal health dashboard was ALSO throwing "The request
| processing has failed because of an unknown error" messages.
| Whatever the problem was this time it had global effects for
| the majority of users of the Console, the internet noticed
| for over 30 minutes before either the status page or the PHD
| were able to report it. There will be no explanation and the
| official status page logs will say there was "increased API
| failure rates" for an hour.
|
| Now i guess its possible that the 1000s and 1000s of us who
| noticed and commented are some tiny fraction of the user base
| but if thats so you could at least publish a follow up like
| other vendors do that says something like 0.00001% of API
| requests failed effecting an estimated 0.001% of our users at
| the time.
| yaacov wrote:
| Can't comment on most of your post but I know a lot of Amazon
| engineers who think of the CoE process (Correction of Error,
| what other companies would call a postmortem) as punitive
| jrd259 wrote:
| I don't know any, and I have written or reviewed about 20
| andrewguenther wrote:
| They aren't _meant_ to be, but shitty teams are shitty. You
| can also create a COE and assign it to another team. When I
| was at AWS, I had a few COEs assigned to me by disgruntled
| teams just trying to make me suffer and I told them to
| pound sand. For my own team, I wrote COEs quite often and
| found it to be a really great process for surfacing
| systemic issues with our management chain and making real
| improvements, but it needs to be used correctly.
| nanis wrote:
| > * Depending on the service one customer's experience is not
| the broad experience. Someone might be having a really bad
| day but 99.9% of the region is operating successfully, so
| there is no reason to update the overall status dashboard.
|
| https://rachelbythebay.com/w/2019/07/15/giant/
| marcinzm wrote:
| >* Every AWS customer has a PERSONAL health dashboard in the
| console that should indicate their experience.
|
| You mean the one that is down right now?
| CobrastanJorji wrote:
| Seems like it's doing an exemplary job of indicating their
| experience, then.
| ultimoo wrote:
| > ...you own your dependencies. You don't blame others.
|
| Agreed, teams should invest resources in architecting their
| systems in a way that can withstand broken dependencies. How
| does AWS teams account for "core" dependencies (e.g. auth)
| that may not have alternatives?
| gunapologist99 wrote:
| This is the irony of building a "reliable" system across
| multiple AZ's.
| AtlasBarfed wrote:
| Because OTHERWISE people might think AMAZON is a
| DYSFUNCTIONAL company that is beginning to CRATER under its
| HORRIBLE work culture and constant H/FIRE cycle.
|
| See, AWS is basically turning into a long standing utility
| that needs to be reliable.
|
| Hey, do most institutions like that completely turn over
| their staff every three years? Yeah, no.
|
| Great for building it out and grabbing market share.
|
| Maybe not for being the basis of a reliable substrate of the
| modern internet.
|
| If there are dozens of bespoke systems that keep AWS afloat
| (disclosure: I have friends who worked there, and there are,
| and also Conway's law), but if the people who wrote them are
| three generations of HIRE/FIRE ago....
|
| Not good.
| ctvo wrote:
| > Maybe not for being the basis of a reliable substrate of
| the modern internet.
|
| Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if
| it's THAT BAD. I wasn't sure what the pattern for all caps
| was, so just giving it a shot there. Apologies if it's
| incorrect.
| AtlasBarfed wrote:
| I was mocking the parent, who was doing that. Yes it's
| awful. Effective? Sigh, yes. But awful.
| [deleted]
| the-pigeon wrote:
| What?!
|
| Everybody is very slow to update their outage pages because
| of SLAs. It's in a company's financial interest to deny
| outages and when they are undeniable to make them appear as
| short as possible. Status pages updating slowly is definitely
| by design.
|
| There's no large dev platform I've used that this wasn't true
| of their status pages.
| jjoonathan wrote:
| I haven't asked AWS employees specifically about blameless
| postmortems, but several of them have personally corroborated
| that the culture tends towards being adversarial and
| "performance focused." That's a tough environment for
| blameless debugging and postmoretems. Like if I heard that
| someone has a rain forest tree-frog living happily in their
| outdoor Arizona cactus garden, I have doubts.
| azinman2 wrote:
| When I was at Google I didn't have a lot of exposure to the
| public infra side. However I do remember back in 2008 when
| a colleague was working on routing side of YouTube, he made
| a change that cost millions of dollars in mere hours before
| noticing and reverting it. He mentioned this to the larger
| team which gave applause during a tech talk. I cannot
| possibly generalize the culture differences between Amazon
| and Google, but at least in that one moment, the Google
| culture seemed to support that errors happen, they get
| noticed, and fixed without harming the perceived
| performance of those responsible.
| wolverine876 wrote:
| While I support that, how are the people involved
| evaluated?
| abdabab wrote:
| Google puts automation or process in place to avoid
| outages rather than pointing fingers. If an engineer
| causes an outage by mistake and then works to ensure that
| would never happen again, he made a positive impact.
| 1-6 wrote:
| Perhaps reward structure should be changed to incentivize the
| post-mortems. There could be several flaws that run
| underreported otherwise.
|
| We may run into the problem of everything documented and
| possible deliberate acts but for a service that relies heavily
| on uptime, that's a small price to pay for a bulletproof
| operation.
| A4ET8a8uTh0 wrote:
| Then we would drown in a sea of meetings and 'lessons
| learned' emails. There is a reason for post-mortems, but
| there has to be balance.
| 1-6 wrote:
| I find post-mortems interesting to read through especially
| when it's not my fault. Most of them would probably be
| routine to read through but there are occasional ones that
| make me cringe or laugh.
|
| Post-mortems can sometime be thought of like safety
| training. There is a big imbalance of time dedicated to
| learning proper safety handling just for those small
| incidences.
| hinkley wrote:
| Does Disney still play the "Instructional Videos" series
| starring Goofy where he's supposed to be teaching you how
| to do something and instead we learn how NOT to do
| something? Or did I just date myself badly?
| throwaway82931 wrote:
| This fits with everything I've heard about terrible code
| quality at Amazon and engineers working ridiculous hours to
| close tickets any way they can. Amazon as a corporate entity
| seems to be remarkably distrustful of and hostile to its labor
| force.
| mbordenet wrote:
| When I worked for AMZN (2012-2015, Prime Video & Outbound
| Fulfillment), attempting to sweep issues under the rug was a
| clear path to termination. The Correction-Of-Error (COE)
| process can work wonders in a healthy, data-driven, growth-
| mindset culture. I wonder if the ex-Amazonian you're referring
| to did not leave AMZN by their own accord?
|
| Blame deflection is a recipe for repeat outages and unhappy
| customers.
| PragmaticPulp wrote:
| > I wonder if the ex-Amazonian you're referring to did not
| leave AMZN by their own accord?
|
| Entirely possible, and something I've always suspected.
| taf2 wrote:
| What if they just can't access the console to update the status
| page...
| Slartie wrote:
| They could still go into the data center, open up the status
| page servers' physical...ah wait, what if their keyfobs don't
| work?
| soheil wrote:
| This may not actually be that bad of thing. If you think about
| it if they're fighting tooth and nail to keep the status page
| still green that tells you they were probably doing that at
| every step of the way before the failure became eminent. Gotta
| have respect for that.
| mrweasel wrote:
| That's idiotic, the service is down regardless. If you foster
| that kind of culture, why have a status page at all?
|
| It make AWS engineers look stupid, because it looks like they
| are not monitoring their services.
| mountainofdeath wrote:
| The status page is as much a political tool as a technical
| one. Giving your service a non-green state makes your entire
| management chain responsible. You don't want to be one that
| upsets some VPs advancement plans.
| nine_zeros wrote:
| > It make AWS engineers look stupid, because it looks like
| they are not monitoring their services.
|
| Management.
| thefourthchime wrote:
| If someone needs to get to the console, you can make a url like
| this:
|
| https://us-west-1.console.aws.amazon.com/
| yabones wrote:
| Works for a lot of things, but not Route53... Which is great
| because that's the only thing I need to do in AWS today :)
| hulahoop wrote:
| I heard from my cousin who works at an Amazon warehouse that the
| conveyor belts stopped working and items were messed up and
| getting randomly removed off the belts.
| joshstrange wrote:
| This seems to be affecting Audible as well. I can't buy a book
| which sucks since I just finished the previous one in the series
| and I'm stuck in bed sick.
| griffinkelly wrote:
| Hosting and processing all the photos for the California
| International Marathon on EC2, this doesnt make easier dealing
| with impatient customers any easier
| bennyp101 wrote:
| Yea, Amazon Music has gone down for me in the UK now :(
|
| Looks like it might be getting worse
| [deleted]
| adamtester wrote:
| eu-west-1 is down for us
| PeterBarrett wrote:
| I hope you stay being the only person who has said that, 1
| region being gone is enough for me!
| adamtester wrote:
| I should have said, only the Console and CLI was down for us,
| our services remained up!
| ComputerGuru wrote:
| Console has lots of us-east-1 dependencies.
| strictfp wrote:
| "some customers may experience a slight elevation in error rates"
| --> everything is on fire
| kello wrote:
| ah corporate speak at it's finest
| Xenoamorphous wrote:
| Maybe when the error rate hits 100% they'll say "error rate now
| stable".
| retbull wrote:
| "Only direction is up"
| hvgk wrote:
| ECR and API are fucked so it's impossible to scale anything to
| the point fire can come out :)
| soco wrote:
| I'm also experiencing a slight elevation in billing rates - got
| alarms for 10x consumption and I can't check on them... Edit:
| also API access is failing, terraform can't take anything down
| because "connection was forcibly closed"
| gchamonlive wrote:
| Imagine triggering a big instance for machine learning or a
| huge EMR cluster that would otherwise be short lived and not
| being able to scale it down.
|
| I am quite sure the AWS support will be getting many refund
| requests over the course of the week.
| lordnacho wrote:
| This got me thinking, are there any major chat services that
| would go down if a particular AWS/GCP/etc data centre went down?
|
| You don't want your service to go down, plus your team's comms at
| the same time.
| tyre wrote:
| Slack going down is a godsend for developer productivity.
| wizwit999 wrote:
| You should multi region something like that.
| milofeynman wrote:
| Remember when Facebook went down? Fb, Whatsapp, messenger,
| Instagram were all down. Don't know what they use internally
| DoctorOW wrote:
| Slack is pretty much full AWS, I've been switched over to Teams
| so I can't check.
| rodiger wrote:
| Slack is working fine for me
| umanwizard wrote:
| My company's Slack instance is currently fine.
| ChadyWady wrote:
| I'm impressed but Amazon Chime still appears to be working
| right now. It's sad because this is the one service that could
| go down and be a net benefit.
| perydell wrote:
| We have a SMS text thread with about 12 people that we send one
| message on the first of every month. To make sure it is tested
| and ready to be used for communications if all other comms
| networks are down.
| dahak27 wrote:
| Especially if enough Amazon internal tools rely on it - would
| be funny if there were a repeat of the FB debacle where Amazon
| employees somehow couldn't communicate/get back into their
| offices because of the problem they were trying to fix
| umanwizard wrote:
| Last I knew, Amazon used all Microsoft stuff for business
| communication.
| itsyaboi wrote:
| Slack, as of last year.
|
| https://slack.com/blog/news/slack-aws-drive-development-
| agil...
| nostrebored wrote:
| And before that, Amazon Chime was the messaging and
| conferencing tool. Now that I'm not using it, I actually
| miss it a lot!
| shepherdjerred wrote:
| I cried tears of joy when Amazon finally switched to
| Slack last year
| manquer wrote:
| Slack uses Chime for A/V under the hood so I don't think
| it is all that different for non text.[1]
|
| [1] https://www.theverge.com/2020/6/4/21280829/slack-
| amazon-aws-...
| shaftoe444 wrote:
| Company wide can't log in to console. Many, many SNS and SQS
| errors in us-east-1.
| _wldu wrote:
| Same here and I use us-east-2.
| 2bitlobster wrote:
| I wonder what the cost is on the US economy
| taormina wrote:
| Have folks considered a class-action lawsuit against these
| blatantly fraudulent SLAs to recoup costs?
| itsdrewmiller wrote:
| In my experience, despite whatever is published, companies will
| private acknowledge and pay their SLA terms. (Which still only
| gets you, like, one day's worth of reimbursement if you're
| lucky.)
| adrr wrote:
| Retail SLAs are a small risk compared to the enterprise SLAs
| where an outage like this could cost Amazon tens of millions.
| I assume these contracts have discount tiers based on
| availability and anything below 99% would be a 100% discount
| for that bill cycle.
| jaywalk wrote:
| But those enterprise SLAs are the ones they'll be paying
| out. Retail SLAs are the ones that you'll have to fight
| for.
| larrik wrote:
| We are having a number of rolling issues, but the site is sort of
| up? I worry it'll get worse before it gets better.
|
| Nothing on their status page. But the Console is not working.
| commandlinefan wrote:
| We're seeing some stuff is up, and some stuff is down and some
| of the stuff that was up a little while is down now. It's
| getting worse as of 9:53 AM CST.
| 7six wrote:
| eu-central-1 as well
| htrp wrote:
| Looks like all aws internal APIs are down....
| taf2 wrote:
| status page looks very green even 17 minutes later...
| ZebusJesus wrote:
| Me thinks Venmo uses AWS because they are down as well. Status
| gator has AWS as on off on off on off. I can access my servers
| hosted in the west coast but I cannot access the AWS console,
| this is making for an interesting morning.
| errcorrectcode wrote:
| Alexa (jokes and flash briefing) is currently partially down for
| me. Skills and routines are working.
|
| Amazon customer service can't help me handle an order.
|
| It us-east-1 is out, then half of Amazon is out too.
| stoneham_guy wrote:
| AWS Management Console Home page is currently unavailable.
|
| That's the error I am getting when logging to aws console
| soheil wrote:
| Can't even login this is the error I'm getting:
| Internal Error Please try again later
| tmarice wrote:
| AWS OpenSearch is also returning 500 in us-east-1.
| WesolyKubeczek wrote:
| Our instances are up (when I poke with SSH, say), but the console
| itself is under the weather.
| stoneham_guy wrote:
| AWS Management Console Home page is currently unavailable.
| swasheck wrote:
| they're a bit later than normal with their large annual post-
| thanksgiving us-east-1 outage
| chazu wrote:
| ECR borked for us in east-1
| the-rc wrote:
| You can get new tokens. Image pulling times out after ~30s,
| which tells me that maybe ECR is actually up, but it can't
| verify the caller's credentials or access image metadata from
| some other internal service. It's probably something low level
| that crashed, taking down anything built above it.
| the-rc wrote:
| Actually, images that do not exist will return the
| appropriate error within a few seconds, so it's really timing
| out when talking to the storage layer or similar.
| mancerayder wrote:
| The Personal Health Dashboard is unhealthy. It says unknown
| error, failure or exception.
|
| They need a monitor for the monitoring.
| albatross13 wrote:
| Welp, this is awkward (._. )
| echlipse wrote:
| imdb.com is down right now. But https://www.imdb.com/chart/top is
| reachable right now. Strange.
|
| https://imgur.com/a/apBT86o
| sakopov wrote:
| AWS Management console is dead/dying. Numerous errors across
| major services like S3 and EC2 in us-east-1. This looks pretty
| bad.
| zedpm wrote:
| Yep, PHD isn't loading, Cloudwatch is reporting SQS errors,
| metrics aren't loading, can't pull logs. This is in US-East-1.
| ec109685 wrote:
| Surprised Netflix is down. I thought they were hot/hot multi-
| region: https://netflixtechblog.com/active-active-for-multi-
| regional...
| ManuelKiessling wrote:
| Just watched two episodes of Better Call Saul on Netflix
| Germany without issues (while not being able to run my
| Terraform plan against my eu-central-1 infrastructure...).
| PaulHoule wrote:
| Makes me glad I am in us-east-2.
| heyitsguay wrote:
| I'm having issues with us-east-2 now. Console is down, then
| when I try to sign into a particular service I just get "please
| try again later".
| muttantt wrote:
| It's related to east-1, looks like some single point of
| failure not letting you access east-2 console URLs
| newhouseb wrote:
| We're flapping pretty hard in us-east-2 (looks API Gateway
| related, which is probably because it's an edge deployment
| which has a bunch of us-east-1 dependencies).
| muttantt wrote:
| us-east-2 is truly a hidden gem
| kohanz wrote:
| Except that it had a significant outage just a couple of
| weeks ago. Source: most of our stuff is on us-east-2.
| crad wrote:
| We make heavy usage of Kinesis Firehose in us-east-1.
|
| Issues started ~1:24am ET and resolved around 7:31am ET.
|
| Then really kicked in at a much larger scale at 10:32am ET.
|
| We're now seeing failures with connections to RDS Postgres and
| other services.
|
| Console is completely unavailable to me.
| m3nu wrote:
| Route53 is not updating new records. Console is also out.
| dylan604 wrote:
| >Issues started ~1:24am ET and resolved around 7:31am ET.
|
| First engineer found a clever hack using bubble gum
|
| >Then really kicked in at a much larger scale at 10:32am ET.
|
| Bubble gum dried out, and the connector lost connection again.
| Now, connector also fouled by the gum making a full replacement
| required.
| grumple wrote:
| Kinesis was the cause last Thanksgiving too iirc. It's the
| backbone of many services.
| [deleted]
| bilalq wrote:
| Some advice that may help:
|
| * Visit the console directly from another region's URL (e.g.,
| https://us-east-2.console.aws.amazon.com/console/home?region...).
| You can try this after you've successfully signed in but see the
| console failing to load as well.
|
| * If your AWS SSO app is hosted in a region other than us-east-1,
| you're probably fine to continue signing in with other
| accounts/roles.
|
| Of course, if all your stuff is in us-east-1, you're out of luck.
|
| EDIT: Removed incorrect advice about running AWS SSO in multiple
| regions.
| binaryblitz wrote:
| I don't think you can run SSO in multiple regions on the same
| AWS account.
| bilalq wrote:
| Thanks, corrected.
| staticassertion wrote:
| > Might also be a good idea to run AWS SSO in multiple regions
| if you're not already doing so.
|
| Is this possible?
|
| > AWS Organizations only supports one AWS SSO Region at a time.
| If you want to make AWS SSO available in a different Region,
| you must first delete your current AWS SSO configuration.
| Switching to a different Region also changes the URL for the
| user portal. [0]
|
| This seems to indicate you can only have one region.
|
| [0]
| https://docs.aws.amazon.com/singlesignon/latest/userguide/re...
| bilalq wrote:
| Good call. I just assumed you could for some reason. I guess
| the fallback is to devise your own SSO implementation using
| STS in another region if needed.
| zackbloom wrote:
| I'm now getting failures searching for products on Amazon.com
| itself. This is somewhat surprising, as the narrative always was
| that Amazon didn't do a great job of dogfooding their own cloud
| platform.
| di4na wrote:
| They did more of it starting a few years back. It has been
| interesting to see how some services evolved far faster when
| retail started to use them. Seems that some customer are far
| more centric than others if you catch my drift...
| whoknowswhat11 wrote:
| My Amazon order history showed no orders, but now is showing my
| orders again - so stuff seems to be getting either fixed or
| intermittent outages.
| blahyawnblah wrote:
| Doesn't amazon.com run on us-east-1?
| zackbloom wrote:
| Update: I'm also getting Internal Errors trying to log into the
| Amazon.com site now as well.
| mabbo wrote:
| Are the actual _services_ down, or is it just the console and /or
| login page?
|
| For example, the sign-up page appears to be working:
| https://portal.aws.amazon.com/billing/signup#/start
|
| Are websites that run on AWS us-east up? Are the AWS CLIs
| working?
| pavel_lishin wrote:
| We're seeing issues with EventBridge, other folks are having
| trouble reaching S3.
|
| Looks like actual services.
| grumple wrote:
| I can tell you that some processes are not running, possibly
| due to SQS or SWF problems. Previous outages of this scale were
| caused by Kinesis outages. Can't connect via aws login at the
| cli either since we use SSO and that seems to be down.
| meepmorp wrote:
| EventBridge, CloudWatch. I've just started getting session
| errors with the console, too.
| 0xCMP wrote:
| Using cli to describe instances isn't working. Instances
| themselves seem fine so far.
| Waterluvian wrote:
| My ECS, EC2, Lambda, load balancer, and other services on us-
| east-1 still function. But these outages can sometimes
| propagate over time rather than instantly.
|
| I cannot access the admin console.
| snewman wrote:
| Anecdotally, we're seeing a small number of 500s from S3 and
| SQS, but mostly our service (which is at nontrivial scale, but
| mostly just uses EC2, S3, DynamoDB, and some basic network
| facilities including load balancers) seems fine, knock on wood.
| Either the problem is primarily in more complex services, or it
| is specific to certain AZs or shards or something.
| 2bitlobster wrote:
| Interactive Video Service (IVS) is down too
| Guest19023892 wrote:
| One of my sites went offline an hour ago because the web server
| stopped responding. I can't SSH into it or get any type of
| response. The database server in the same region and zone is
| continuing to run fine though.
| bijoo wrote:
| Interesting, is the site on a particular type of EC2
| instance, e.g. bare metal? I see c4.xlarge is doing fine in
| us-east-1.
| Guest19023892 wrote:
| It's just a t3a.nano instance since it's a project under
| development. However, I have a high number of t3a.nano
| instances in the same region operating as expected. This
| particular server has been running for years, so although
| it could be a coincidence it just went offline within
| minutes of the outage starting, it seems unlikely.
| Hopefully no hardware failures or corruption, and it'll
| just need a reboot once I can get access to AWS again.
| dangrossman wrote:
| My website that runs on US-East-1 is up.
|
| However, my Alexa (Echo) won't control my thermostat right now.
|
| And my Ring app won't bring up my cameras.
|
| Those services are run on AWS.
| kingcharles wrote:
| Now I'm imagining someone dying because they couldn't turn
| their heating on because AWS. The 21st Century is fucked up.
| [deleted]
| SEMW wrote:
| Definitely not just the console. We had hundreds of thousands
| of websocket connections to us-east-1 drop at 15:40, and new
| websocket connections to that region are still failing.
| (Luckily not a huge impact on our service cause we run in 6
| other regions, but still).
| andrew_ wrote:
| Side question: How happy are you with API Gateway's WebSocket
| service?
| SEMW wrote:
| No idea, we don't use it. These were websocket connections
| to processes on ec2, via NLB and cloudfront. Not sure
| exactly what part of that chain was broken yet.
| zedpm wrote:
| This whole time I've been seeing intermittent timeouts
| when checking a UDP service via NLB; I've been wondering
| if it's general networking trouble or something
| specifically with the NLB. EC2 hosts are all fine, as far
| as I can tell.
| sophacles wrote:
| I wasn't able to load my amazon.com wishlist, nor the shopping
| page through the app. Not an aws service specifically, but an
| amazon service that I couldn't use.
| heartbreak wrote:
| I'm getting blank pages from Amazon.com itself.
| nowahe wrote:
| I can't access anything related to Cloudfront, either through
| the CLI or console : $ aws cloudfront list-
| distributions An error occurred
| (HttpTimeoutException) when calling the ListDistributions
| operation: Could not resolve DNS within remaining TTL of 4999
| ms
|
| However I can still access the distribution fine
| lambic wrote:
| We've had reports of some intermittent 500 errors from
| cloudfront, apart from that our sites are up.
| bijoo wrote:
| I see started EC2 instances are doing fine. However, starting
| offline instances cannot be done through AWS SDK due to the
| HTTP 500 error, even for Ec2 service. The CLI should be getting
| the HTTP 500 error too since likely the same API as the SDK.
| bkirkby wrote:
| fwiw, we are seeing errors when trying to publish to SNS although
| the aws status pages say nothing about SNS.
| saggy4 wrote:
| It seems that only console is having the problem CLI works fine
| rhines wrote:
| CLI for EC2 works for me, but not ELB.
| MatthewCampbell wrote:
| CloudFormation changesets are reporting "InternalFailure" for
| us in us-east-1.
| albatross13 wrote:
| Not entirely true- we federate through ADFS and `saml2aws
| login` is currently failing with:
|
| error logging into aws role using saml assertion: error
| retrieving STS credentials using SAML: ServiceUnavailable:
| status code: 503
| zedpm wrote:
| I'm having CLI issues as well, they're using the same APIs
| under the hood. For example, I'm getting 503 errors for
| cloudwatch DescribeLogGroups.
| saggy4 wrote:
| Tried a few cli commands seems to be working fine for me.
| Maybe it is not for everyone or maybe It is just the start of
| something very worse. :(
| technics256 wrote:
| try aws ecr describe-registry and you will get an error
| saggy4 wrote:
| Yes Indeed, getting failures in CLI as well
| [deleted]
| 999900000999 wrote:
| What if it never comes back up ?
| zegl wrote:
| I love that every time this happens, 100% of the services on
| https://status.aws.amazon.com are green.
| daniel-s wrote:
| That page is not loading for me... on which region is it
| hosted?
| qudat wrote:
| Status pages are hard
| mrweasel wrote:
| Not if you're AWS. At this point I'm fairly sure their status
| page is just a static html that always show all green.
| siva7 wrote:
| Well, it is.
| jtdev wrote:
| Why? Twitter and HN can tell me that AWS is having an outage,
| why can't AWS?
| AH4oFVbPT4f8 wrote:
| They sent their CEO into space, I am sure they have the
| resources to figure it out.
| cr3ative wrote:
| When they have too much pride in an all-green dash, sure.
| Allowing any engineer to declare a problem when first
| detected? Not so hard, but it doesn't make you look good if
| you have an ultra-twitchy finger. They have the balance badly
| wrong at the moment though.
| lukeschlather wrote:
| A trigger-happy status page gives realtime feedback for
| anyone doing a DoS attack. Even if you published that
| information publicly you would probably want it on a
| significant delay.
| pid-1 wrote:
| More like admitting failure is hard.
| smt88 wrote:
| No they're not.
|
| Step 1: deploy status checks to an external cloud.
| kube-system wrote:
| I agree, but does come with increased challenges with false
| positives.
|
| That being said, AWS status pages _are_ up.
| wruza wrote:
| "Falsehoods Programmers Believe About Status Pages"
| 0xmohit wrote:
| No wonder IMDB <https://www.imdb.com/> is down (returning 503).
| Sad that Amazon engineers don't implement what they teach their
| customers -- designing fault-tolerant and highly available
| systems.
| barbazoo wrote:
| It seems they updated it ~30 minutes after your comment.
| judge2020 wrote:
| I don't see why they couldn't provide an error rate graph like
| Reddit[0] or simply make services yellow saying "increased
| error rate detected, investigating..."
|
| 0: https://www.redditstatus.com/#system-metrics
| VWWHFSfQ wrote:
| because nobody cares when reddit is down. or at least, nobody
| is paying them to be up 99.999% of the time.
| willcipriano wrote:
| A executive has a OKR around uptime and a automated system
| prevents him or her from having control over the messaging.
| Therefore any effort to create one is squashed, leaving the
| people requesting it confused as to why and left without any
| explanation. Oldest story in the book.
| jkingsman wrote:
| Because Amazon has $$$$$ in their SLOs, and it costs them
| through the nose every minute they're down in payments made
| to customers and fees refunded. I trust them and most
| companies not to be outright fraudulent (although I'm sure
| some are), but it's totally understandable they'd be reticent
| to push the "Downtime Alert/Cost Us a Ton of Money" button
| until they're sure something serious is happening.
| jolux wrote:
| It should be costing them trust not to push it when they
| should though. A trustworthy company will err on the side
| of pushing it. AWS is a near-monopoly, so their
| unprofessional business practices have still yet to cost
| them.
| ethbr0 wrote:
| > _It should be costing them trust not to push it when
| they should though._
|
| This is what Amazon, the startup, understood.
|
| Step 1: _Always_ make it right and make the customer
| happy, even if it hurts in $.
|
| Step 2: If you find you're losing too much money over a
| particular issue, _fix the issue_.
|
| Amazon, one of the world's largest companies, seems to
| have forgotten that the risk of not reporting accurately
| isn't money, but _breaking the feedback chain_. Once you
| start gaming metrics, no leaders know what 's really
| important to work on internally, because no leaders know
| what the actual issues are. It's late Soviet Union in a
| nutshell. If everyone is gaming the system at all levels,
| then eventually the ability to objectively execute
| decreases, because effort is misallocated due to
| misunderstanding.
| Kavelach wrote:
| > It's late Soviet Union in a nutshell
|
| How come an action of a private company in a capitalist
| country is like the Soviet Union?
| jolux wrote:
| Private companies are small centrally-planned economies
| within larger capitalist systems.
| dhsigweb wrote:
| I can Google and see how many apps, games, or other
| services are down. So them not "pushing some buttons" to
| confirm it isn't fooling anyone.
| btilly wrote:
| This is an incentive to dishonesty, leading to fraudulent
| payments and false advertising of uptime to potential
| customers.
|
| Hopefully it results in a class action lawsuit for enough
| money that Amazon decides that an automated system is
| better than trying to supply human judgement.
| jenkinstrigger wrote:
| Can someone just have a site ping all the GET endpoints
| on the AWS API? That is very far from "automating [their
| entire] system" but it's better than what they're doing.
| tynorf wrote:
| Something like this? https://stop.lying.cloud/
| lozenge wrote:
| It literally is fraudulent though.
|
| I don't think a region being down is something that you can
| be unsure about.
| hamburglar wrote:
| Oh, you can get pretty weaselly about what "down" means.
| If there is "just" an S3 issue, are all the various
| services which are still "available" but throwing an
| elevated number of errors because of their own internal
| dependency on S3 actually down or just "degraded?" You
| have to spin up the hair-splitting apparatus early in the
| incident to try to keep clear of the post-mortem party.
| :D
| w0m wrote:
| The more transparency you give; the harder it is to control
| the narrative. They have a general reputation for
| reliability; and exposing just how many actual
| errors/failures there are (that generally don't effect a
| large swath of users/usecases) would do hurt that reputation
| for minimal gain.
| sakopov wrote:
| Those five 9s don't come easy. Sometimes you have to prop them
| up :)
| 1-6 wrote:
| It's hard to measure what five-9 is because you have to wait
| around until a 0.00001 occurs. Incentivizing post-mortems are
| absolutely critical in this case.
| notinty wrote:
| It's 0.001; the first 2 9's count. 5N =
| 99.999% 3N = 99.9% 1N5 = 95%
|
| 5N is <43m12s downtime per month.
| 1-6 wrote:
| I considered writing it as a percent but then decided
| against using it and moving the decimal instead. But good
| info for clarification.
| JoelMcCracken wrote:
| Every time someone asks to update the status page, managers
| say "nein"
| jjoonathan wrote:
| I wonder how often outages really happen. The official page
| is nonsense, of course, and we only collectively notice when
| the outage is big enough that lots of us are affected. On
| AWS, I see about a 3:1 ratio of "bump in the night" outages
| (quickly resolved, little corroboration) to mega too-big-to-
| hide outages. Does that mirror others' experiences?
| Spivak wrote:
| If you count any time AWS is having a problem that impacts
| our production workloads then I think it's about 5:1.
| Dealing with "AWS is down" outages are easy because I can
| just sit back and grab some popcorn, it's the "dammit I
| know this is AWS's fault" outages that are a PITA because
| you count yourself lucky to even get a report in your
| personalized dashboard.
| jjoonathan wrote:
| Yep.
|
| Random aside: any chance you are related to the Calculus
| on Manifolds Spivak?
| Spivak wrote:
| Nope, just a fan. It was the book that pioneered my love
| of math.
| clh1126 wrote:
| I had to log in to say, that one of my favorite quotes of
| all time I found in Calculus on Manifolds.
|
| He says that any good theorem is worth generalizing, and
| I've generalized that to any life rule.
| kylemh wrote:
| https://aws.amazon.com/compute/sla/
|
| looks like only four 9's
| dotancohen wrote:
| > looks like only four 9's
|
| That's why the Germans are such good engineers.
| Did the drives fail? Nein. Did the CPU overheat?
| Nein. Did the power get cut? Nein. Did the
| network go down? Nein.
|
| That's "four neins" right there.
| [deleted]
| [deleted]
| NicoJuicy wrote:
| Not right now. I think they monitor if it appears on HN too.
| swiftcoder wrote:
| When I worked there it required the signoff of both your VP-
| level executive and the comms team to update the status page. I
| do not believe I ever received said signoff before the issues
| were resolved.
| queuebert wrote:
| Are they lying, or just prioritizing their own services?
| _verandaguy wrote:
| Willing to bet the status page gets updated by logic on us-
| east-1
| itsyaboi wrote:
| Status service is probably hosted in us-east-1
| AH4oFVbPT4f8 wrote:
| amazon.com seems to be having problems too. I get something
| went wrong to a new design/layout which I assume is either
| new or a fail safe.
| KineticLensman wrote:
| Looks okay right now to this UK user of amazon.co.uk
| btilly wrote:
| It depends on which Amazon region you are being served
| from.
|
| It is very unlikely that Amazon would deliberately make
| your messages cross the Atlantic just to find an American
| region that is unable to serve you.
| bennyp101 wrote:
| https://music.amazon.co.uk is giving me an error since about
| 16:30 GMT
|
| "We are experiencing an error. Our apologies - We will be
| back up soon."
| [deleted]
| gbear0 wrote:
| I assume each service has its own health check that checks the
| service is accessible from an internal location, thus most are
| green. However, when Service A requires Service B to do work,
| but Service B is down, a simple access check on Service A
| clearly doesn't give a good representation of uptime.
|
| So what's a good health check actually report these days? Is it
| just about its own status, or should it include a breakdown of
| the status of external dependencies as part of its folded up
| status?
| goshx wrote:
| I remember the time when S3 went down and took the status page
| down with it
| notreallyserio wrote:
| Makes you wonder if they have to manually update the page when
| outages occur. That'd be a pretty bad way to go, so I'd hope
| not. Maybe the code to automatically update the page is in us-
| east-1? :)
| gromann wrote:
| Word on the street is the status page is just a JPG
| wfleming wrote:
| Something like that has impacted the status page in the past.
| There was a severe Kinesis outage last year
| (https://aws.amazon.com/message/11201/), and they couldn't
| update the service dashboard for quite a while because their
| tool to do manage the service dashboard lives in us-east-1
| and depends on Kinesis.
| JohnJamesRambo wrote:
| > Goodhart's Law is expressed simply as: "When a measure
| becomes a target, it ceases to be a good measure."
|
| It's very frustrating. Why even have them?
| Spivak wrote:
| Because "uptime" and "nines" became a marketing term. Simple
| as that. But the problem is that any public-facing measure of
| availability becomes a defacto marketing term.
| hinkley wrote:
| Also 4-5 nines is virtually impossible for complex systems,
| so the sort of responsible people who could make 3 nines
| true begin to check out, and now you've getting most of
| your info from the delusional, and you're lucky if you
| manage 2 objective nines.
| Enginerrrd wrote:
| The older I get the more I hate marketers. The whole field
| stands on the back of war-time propaganda research and it
| sure feels like it's the cause of so much rot in society.
| jbavari wrote:
| Well yea, it's eventually consistent ;)
| jrochkind1 wrote:
| Even better, when I try to go to console, I get:
|
| > AWS Management Console Home page is currently unavailable.
|
| > You can monitor status on the AWS Service Health Dashboard.
|
| "AWS Service Health Dashboard" is a link to
| status.aws.amazon.com... which is ALL GREEN. So... thanks for
| the suggestion?
|
| At this point the AWS service health dashboard is kind of
| famous for always been green isn't it? It's a joke to it's
| users. Do the folks who work on the relevant AWS internal
| team(s) know this, and just not have the resources to do
| anything about it, or what? If it's a harder problem than you'd
| think for interesting technical reasons, that'd be interesting
| to hear about.
| kortex wrote:
| It's like trying to get the truth out of a kid that caused some
| trouble.
|
| Mom: Alexa, did you break something?
|
| Alexa: No.
|
| M: Really? What's this? _500 Internal server error_
|
| A: ok maybe management console is down
|
| M: Anything else?
|
| A: ...
|
| A: ... ok maybe cloudwatch logs
|
| M: Ah hah. What else?
|
| A: That's it, I swear!
|
| M: _503 ClientError_
|
| A: ...well okay secretsmanager might be busted too...
| hinkley wrote:
| There was a great response in r/relationship advice the other
| day where someone said that OP's partner forced a fight
| because they're planning to cheat on them, reconcile, and
| then will 'trickle out the truth' over the next 6 months. I'm
| stealing that phrase.
| mdni007 wrote:
| Funny I literally just asked my Alexa.
|
| Me: Alexa, is AWS down right now?
|
| Alexa: I'd rather not answer that
| hinkley wrote:
| Wise robot.
|
| That's a bit like involving your kid in an argument between
| parents.
| PopeUrbanX wrote:
| The very expensive EC2 instance I started this morning still
| works. Of course now I can't shut it down.
| ta20200710 wrote:
| EC2 or S3 showing red in any region literally requires personal
| approval of the CEO of AWS.
| dekhn wrote:
| Uhhhhh... what if the monitoring said it was hard down?
| They'd still not show red?
| choeger wrote:
| Probably they cannot. They outsourced this dashboard and it
| runs on AWS now ;).
| dia80 wrote:
| Unfortunately, errors don't require his approval...
| notreallyserio wrote:
| Is this true or a joke? This sort of policy is how you
| destroy trust.
| marcosdumay wrote:
| If you trust them at this point, you have not being paying
| attention, and will probably continue to trust after this.
| bsedlm wrote:
| maybe we gotta consider the publicly facing status pages as
| something other than a technical tool (e.g. marketing or PR
| or something like that, dunno)
| jeffrallen wrote:
| Well, no big deal, there's not really a lot of trust there
| to destroy...
| jedberg wrote:
| From what I've heard it's mostly true. Not only the CEO but
| a few SVPs can approve it, but yes a human must approve the
| update and it must be a high level exec.
|
| Part of the reason is because their SLAs are based on that
| dashboard, and that dashboard going red has a financial
| cost to AWS, so like any financial cost, it needs approval.
| orangepurple wrote:
| Being dishonest about SLAs seems to bear zero cost in
| this case?
| solatic wrote:
| Zero directly-attributable, calculable-at-time-of-
| decision cost. Of course there's a cost in terms of
| customers who leave because of the dishonest practice,
| but, who knows how many people that'll be? Out of the
| customers who left after the outage, who knows whether
| they left due to not communicating status promptly and
| honestly or whether it was for some other reason?
|
| Versus, if a company has X SLA contracts signed, that
| point to Y reimbursement for being out for Z minutes, so
| it's easily calculable.
| jedberg wrote:
| It's not really dishonest though because there is nuance.
| Most everything in EC2 is still working it seems, just
| the console is down. So is it really down? It should
| probably be yellow but not red.
| dekhn wrote:
| if you cannot access the control plane to create or
| destroy resources, it is down (partial availability). The
| jobs that are running are basically zombies.
| w0m wrote:
| Depending the workload being run users may or may not
| notice. Should be Yellow at a minimum.
| jedberg wrote:
| Seems like the API is still working and so is auto
| scaling. So they aren't really zombies.
|
| Partial availability isn't the same as no availability.
| electroly wrote:
| The API is NOT working -- it may not have been listed on
| the service health dashboard when you posted that, but it
| is now. We haven't been able to launch an instance at
| all, and we are continuously trying. We can't even start
| existing instances.
| dekhn wrote:
| I'm right in the middle of an AWS-run training and we
| literally can't run the exercises because of this.
|
| let me repeat that: my AWS trainign that is run by AWS
| that I pay AWS for isn't working, because AWS is having
| control plane (or other) issues. This is several hours
| after the initial incident. We're doing training in us-
| west-2, but the identity service and other components run
| in us-east-1.
| justrudd wrote:
| I'm running EKS in us-west-2. My pods use a role ARN and
| identity token file to get temporary credentials via STS.
| STS can't return credentials right now. So my EKS cluster
| is "down" in the sense that I can't bring up new pods. I
| only noticed because an auto-scaling event failed.
| dekhn wrote:
| We ran through the whole 4.5 hour training and the
| training app didn't work the entire time.
| jjoonathan wrote:
| "Good at finding excuses" is not the same thing as
| "honest."
| paulryanrogers wrote:
| SNS seems to be at least partially down as well
| jtheory wrote:
| My company relies on DynamoDB, so we're totally down.
|
| edit: partly down; it's sporadically failing
| jrochkind1 wrote:
| Heroku is currently having major problems. My stuff is
| still up, but I can't deploy any new versions. Heroku
| runs their stuff on AWS. I have heard reports of other
| companies who run on AWS also having degarded service and
| outages.
|
| i'd say when other companies who run their infrastruture
| on AWS are going out, it's hard to argue it's not a real
| outage.
|
| But AWS status _has_ changed to yellow at this point.
| Probably heroku could be completely down because of an
| AWS problem, and AWS status would still not show red. But
| at least yellow tells us there's a problem, the
| distinction between yellow and red probably only matters
| at this point to lawyers arguing about the AWS SLA, the
| rest of us know yellow means "problems", red will never
| be seen, and green means "maybe problems anyway".
|
| I believe the entire us-east-1 could be entirely missing,
| and they'd still only put a yellow not a red on status
| page. After all, the other regions are all fine, right?
| dekhn wrote:
| Sure, but... that just raises more questions :)
|
| Taken literally what you are saying is the service could
| be down and an executive could override that, preventing
| them for paying customers for a service outage, even if
| the service did have an outage and the customer could
| prove it (screenshots, metrics from other cloud
| providers, many different folks see it).
|
| I'm sure there is some subtlety to this, but it does mean
| that large corps with influence should be talking to AWS
| to ensure that status information corresponds with actual
| service outages.
| [deleted]
| emodendroket wrote:
| I have no inside knowledge or anything but it seems like
| there are a lot of scenarios with degraded performance
| where people could argue about whether it really
| constitutes an outage.
| dilyevsky wrote:
| One time gcp argued that since they did return 404s on
| gcs for a few hours that wasn't an uptime/latency sla
| violation so we were not entitled to refund (tho they
| refunded us anyway)
| Enginerrrd wrote:
| Man, between costs and shenanigans like this, why don't
| more companies self-host?
| dilyevsky wrote:
| 1. Leadership prefers to blame cloud when things break
| rather than take responsibility.
|
| 2. Cost is not an issue (until it is but you're already
| locked in so oh well)
|
| 3. Faang has drained the talent pool of people who know
| how
| pm90 wrote:
| Opex > Capex. If companies thought about long term, yes
| they might consider it. But unless the cloud providers
| fuck up really badly, they're ok to take the heat
| occasionally and tolerate a bit of nonsense.
| dilyevsky wrote:
| You can lease equipment you know...
| dekhn wrote:
| Yep. I was an SRE who worked at Google and also launched
| a product on Google Cloud. We had these arguments all the
| time, and the contract language often provides a way for
| the provider to weasel out.
| jedberg wrote:
| Like I said I never worked there and this is all hearsay
| but there is a lot of nuance here being missed like
| partial outages.
| dekhn wrote:
| This is no longer a partial outage. The status page
| reports elevated API error rates, DynamoDB issues, EC2
| API error rates, and my company's monitoring is
| significantly affected (IE, our IT folks can't tell us
| what isn't working) and my AWS training class isn't
| working either.
|
| If this needed a CEO to eventually get around to pressing
| a button that said "show users the actual information
| about a problem" that reflects poorly on amazon.
| dhsigweb wrote:
| My friend works at a telemetry company for monitoring and
| they are working on alerting customers of cloud service
| outages before the cloud providers since the providers
| like to sit on their hands for a while (presumably to try
| and fix it before anyone notices).
| meetups323 wrote:
| Large corps with influence get what they want regardless.
| Status page goes red and the small corps start thinking
| they can get what they want too.
| scrose wrote:
| > Status page goes red and the small corps start thinking
| they can get what they want too.
|
| I think you mean "start thinking they can get what they
| pay for"
| notreallyserio wrote:
| I wonder how well known this is. You'd think it would be
| hard to hire ethical engineers with such a scheme in
| place and yet they have tens of thousands.
| sneak wrote:
| It's widespread industry knowledge now that AWS is publicly
| dishonest about downtime.
|
| When the biggest cloud provider in the world is famous for
| gaslighting, it sets expectations for our whole industry.
|
| It's fucking disgraceful that they tolerate such a lack of
| integrity in their organization.
| strictfp wrote:
| "some customers may experience a slight elevation in error
| rates" --> everything is on fire, absolutely nothing works
| ballenf wrote:
| https://downdetector.com
|
| Amazing and scary to see all the unrelated services down right
| now.
| nightpool wrote:
| I think it's pretty unlikely that both Google and Facebook
| are affected by this minor AWS outage, whatever DownDetector
| says. I even did a spot check on some of the smaller websites
| they report as "down", like canva.com, and didn't see any
| issues.
| zarkov99 wrote:
| You might be right about Google and Facebook, but this
| isn't minor at all. Impact is widespread.
| john37386 wrote:
| It starts to show issues now. I agree that it was a bit long
| before we can get real visibility on the incident.
| jrockway wrote:
| I wonder if the other parts of Amazon do this. Like their
| inventory system thinks something is in stock, but people can't
| find it in the warehouse, do they just simply not send it to
| you and hope you don't notice? AWS's culture sounds super
| broken.
|
| My favorite status page, though, is Slack's. You can read an
| article in the New York Times about how Slack was down for most
| of a day, and the status page is just like "some percentage of
| users experienced minor connectivity issues". "Some percentage"
| is code for "100%" and "minor" is code for "total". Good try.
| whoknew1122 wrote:
| The problem being that often times you can't actually update
| the status page. Most internal systems are down.
|
| We can't even update our product to say it's down, because
| accessing the product requires a process that is currently
| dead.
| thayne wrote:
| That's why your status page should be completely independent
| from the services it is monitoring (minus maybe something
| that automatically updates it). We use a third party to host
| our status page specifically so that we can update it even if
| all our systems are down.
| whoknew1122 wrote:
| I'm not saying you're wrong, or that the status page is
| architected properly. I'm just speaking to the current
| situation.
| davecap1 wrote:
| On top of that, the "Personalized Health Dashboard" doesn't
| work because I can't seem to log in to the console.
| meepmorp wrote:
| I'm logged in; you're missing an error message.
| davecap1 wrote:
| We have federated login with MFA required (which was
| failing). It just started working again.
|
| Scratch that... console is not loading at all now :)
| ricardobayes wrote:
| Wonder why almost all Amazon frontend looks like it was written
| in c++
| [deleted]
| davikawasaki wrote:
| EKS works for us
| alephnan wrote:
| Vanguard has been slow all day. I'm going to guess Vanguard has a
| depencency on US-east-1
| tgtweak wrote:
| I'm going with "BGP error" which is likely config-related, likely
| human error.
|
| Seems to be the trend with the last 5-6 big cloud outages.
| keyle wrote:
| Why does that webpage renders like a dog?... I get that it's
| under load but the rendering itself is chugging something rare.
|
| Edit: wow that webpage is humongous... never heard of paging?
| saggy4 wrote:
| I think the only the console is down. CLI is working fine for me
| in ap-southeast-1
| kchoudhu wrote:
| At this point I have no idea why anyone would put anything in us-
| east-1.
|
| Also isolation is not as good as they would have you believe: I
| am unable to login to AWS Quicksight in us-west-2...
| bradhe wrote:
| Man, some conclusions are being _jumped_ to by this reply.
| InTheArena wrote:
| There is a very long history of US-east-1 being horrible.
| Just bad. We've told every client we can to get out of there.
| It's one of the oldest amazon regions, and I think too much
| old legacy and weird stuff happens there. Use US-west-2.
| jedberg wrote:
| Or US-East-2.
| blahyawnblah wrote:
| Isn't us-east-1 where they deploy everything first? And the
| only region that has 100% of all available services?
| crad wrote:
| Been in us-east-1 for a long time. Things like Direct Connect
| and other integrations aren't easy or cheap to move and when
| you have other, bigger priorities, moving regions is not an
| easy decision to prioritize.
| RubberShoes wrote:
| "AWS Management Console Home page is currently unavailable. You
| can monitor status on the AWS Service Health Dashboard."
|
| And then Health Dashboard is 100% green. What a joke.
| nurgasemetey wrote:
| It seems that IMDB and Goodreads are also affected
| kingcharles wrote:
| Yeah, and Audible and Amazon.com search and the Amazon retail
| stores.
|
| Basically Amazon fucked all their own products too.
| picodguyo wrote:
| Funny, I just asked Alexa to set a timer and she said there was a
| problem doing that. Apparently timers require functioning us-
| east-1 now.
| minig33 wrote:
| I can't turn on my lights... the future is weird
| lsaferite wrote:
| And that is why my lighting automation has a baseline req
| that it works 100% without the internet and preferably
| without a central controller.
| organsnyder wrote:
| I love my Home Assistant setup for this reason. I can even
| get light bulbs pre-flashed with ESPHome now (my wife was
| bemused when I was updating the firmware on the
| lightbulbs).
| rodgerd wrote:
| HomeKit compatibility is a useful proxy for local API,
| since it's a hard requirement for HomeKit certification.
| m12k wrote:
| A former colleague told me years ago that us-east-1 is basically
| the guinea pig where changes get tested before being rolled out
| to the other regions, and as a result is less stable than the
| others. Does anyone know if there's any truth to this?
| wizwit999 wrote:
| false, it's often 4th iirc, SFO (us-west-1) is actually usually
| first.
| shepherdjerred wrote:
| At my org it was deployed in the middle, around the fourth wave
| iirc
| arpinum wrote:
| This is not true. Lambda updates us-east-1 last.
| treesknees wrote:
| I can't see why they'd use the most common/popular region as a
| guinea pig.
| Kye wrote:
| Problem: you've gone as far as you can go testing internally
| or with test groups. You know there are edge cases you'll
| only identify by having enough people test it.
|
| Solution: push it to production on the zone with the most
| users and see what breaks.
| sharpy wrote:
| The guideline has been to deploy to it last.
|
| If the team follows pipeline best practices, they are supposed
| to deploy to a single small region first, wait 24 hours, and
| then deploy to more, wait more, and deploy to more, until
| finally deploying to us-east-1.
| mystcb wrote:
| Not sure it helps, but got this update from someone inside AWS a
| few moments ago.
|
| "We have identified the root cause of the issues in the US-EAST-1
| Region, which is a network issue with some network devices in
| that Region which is affecting multiple services, including the
| console but also services like S3. We are actively working
| towards recovery."
| rozenmd wrote:
| They finally updated the status page:
| https://status.aws.amazon.com/
| mystcb wrote:
| Ahh good spot, it does seem that the AWS person I am speaking
| too has a few more bits other than what is shown on the page,
| they just messaged me the same message there, but added:
|
| "All teams are engaged and continuing to work towards
| mitigation. We have confirmed the issue is due to multiple
| impaired network devices in the US-EAST-1 Region."
|
| Doesn't sound like they are having a good day there!
| alpha_squared wrote:
| That's a copy-paste, we got the same thing from our AWS
| contact. It's just enough info to confirm there's an issue, but
| not enough to give any indication on the scope or timeline to
| resolution.
| alexatalktome wrote:
| Internally the rumor is that our CICD pipelines failed to stop
| bad commits to certain AWS services. This isn't due to tests
| but due to actual pipelines infra failing.
|
| We've been told to disable all pipelines even if we have time
| blockers or manual approval steps or failing tests
| jdc0589 wrote:
| I love how they are sharing this stuff out to some clients, but
| its technically under NDA.
| alfalfasprout wrote:
| Yeah, we got updates via NDA too lol. Such a joke that a
| status page update is considered privileged lol.
| romanhotsiy wrote:
| It's funny that the first place I go to learn about the outage is
| Hacker News and not https://status.aws.amazon.com/ (it's still
| reports everything to be "operating normally"...)
| albatross13 wrote:
| Yeah, I tend to go off of https://downdetector.com/status/aws-
| amazon-web-services/
| murph-almighty wrote:
| I always got the impression that downdetector worked by
| logging the number of times they get a hit for a particular
| service and using that as a heuristic to determine if
| something is down. If so, that's brilliant.
| albatross13 wrote:
| I think it's a bit simpler for AWS- there's a big red "I
| have a problem with AWS" button on that page. You click it,
| tell it what your problem is, and it logs a report. Unless
| that's what you were driving at and I missed it, it's
| early. Too early for AWS to be down :(
|
| Some 3600 people have hit that button in the last ~15
| minutes.
| cmg wrote:
| It's brilliant until the information is bad.
|
| When Facebook's properties all went down in October, people
| were saying that AT&T and other cell phone carriers were
| also down - because they couldn't connect to FB/Insta/etc.
| There were even some media reports that cited Downdetector,
| seeming without understanding that they are basically
| crowdsourced and sometimes the crowd is wrong.
| bmcahren wrote:
| I made sure our incident response plan includes checking Hacker
| News and Twitter for _actual_ updates and information.
|
| As of right now, this thread and one update from a twitter
| user,
| https://twitter.com/SiteRelEnby/status/1468253604876333059 are
| all we have. I went into disaster recovery mode when I saw our
| traffic dropped to 0 suddenly at 10:30am ET. That was just the
| SQS/something else preventing our ELB logs from being extracted
| to DataDog though.
| unethical_ban wrote:
| So as of the time you posted this comment, were other
| services actually down? The way the 500 shows up, and the AWS
| status page, makes it sound like "only" the main landing
| page/mgt console is unavailable, not AWS services.
| jeremyjh wrote:
| Yes, they are still publishing lies on their status page.
| In this thread people are reporting issues with many
| services. I'm seeing periodic S3 PUT failures for the last
| 1.5 hours.
| alexatalktome wrote:
| AWS services are all built against each other so one
| failing will take down a bunch more which take down more
| like dominos. Internally there's a list of >20 "public
| facing" AWS services impacted.
| authed wrote:
| I usually go on Twitter first for outages.
| 1-6 wrote:
| Community reporting > internal operations
| taf2 wrote:
| Now 57 minutes later and it still reports everything as
| operating normally.
| mijoharas wrote:
| It shows errors now.
| romanhotsiy wrote:
| It doesn't show errors with Lambda and we clearly do
| experience them.
| kingcharles wrote:
| Does McDonalds use AWS for the backend to their app?
|
| If I find out this is why I couldn't get my Happy Meal this
| morning I'm going to be really, really grumpy.
|
| EDIT: I'm REALLY grumpy now:
|
| https://aws.amazon.com/blogs/industries/aws-is-how-mcdonalds...
| hoofhearted wrote:
| The McDonalds App was showing on the frontpage of Down Detector
| at the same time as all the Amazon dependent services last I
| checked.
| Crespyl wrote:
| Apparently Taco Bell too, not being able to place an order and
| then also not being able to fall back to McDonalds was how I
| realized there was a larger outage :p
|
| What am I supposed to do for lunch now? Go to the drive through
| and order like a normal person? /s
|
| Grumble grumble
| dwighttk wrote:
| Is this why goodreads hasn't been working today?
| TameAntelope wrote:
| We run as much as we can out of us-east-2 because it has more
| uptime than us-east-1, and I don't think I've ever regretted that
| decision.
| p2t2p wrote:
| I have alerts going off because of that...
| kello wrote:
| Don't envy the engineers working on this right now. Good luck!
| nickysielicki wrote:
| The Amazon.com storefront was giving me issues loading search
| results -- this is the worst possible time of year for Amazon to
| have issues. It's horrifying and awesome to imagine hundreds of
| thousands (if not millions) of dollars of lost orders an hour --
| just from sluggish load times. Hugops to those dealing with this.
| throwanem wrote:
| Third worst time. It's not BFCM and it's not the week before
| Christmas; from prior high-volume ecommerce experience I
| suspect their purchase rate is elevated at this time but
| nowhere near those two peaks.
| n0cturne wrote:
| I just walked into the Amazon Books store at our local mall. They
| are letting everyone know at the entrance that "some items aren't
| available for purchase right now because our systems are down."
|
| So at least Amazon retail is feeling some of the pain from this
| outage!
| etimberg wrote:
| Seeing issues in ca-central-1 and us-east
| taf2 wrote:
| We're not down and we're in us-east-1... maybe there is more to
| this issue?
| bradhe wrote:
| I think could just be the console?
| taf2 wrote:
| Found that it seems Lambda is impacted
| fsagx wrote:
| Everything fine with S3 in us-east-1 for me. Also just not able
| to access the console.
| abarringer wrote:
| Our services in AWS East are down.
___________________________________________________________________
(page generated 2021-12-07 23:00 UTC)