[HN Gopher] AWS's us-east-1 region is experiencing issues
___________________________________________________________________
AWS's us-east-1 region is experiencing issues
Author : zaltekk
Score : 87 points
Date : 2022-03-09 20:59 UTC (2 hours ago)
(HTM) web link (health.aws.amazon.com)
(TXT) w3m dump (health.aws.amazon.com)
| mtrunkat wrote:
| In our case (Apify.com) there was a complete outage of SQS
| (15mins+), most likely DNS problems + EC2 instances got restarted
| probably as a result of an SQS outage.
|
| EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a
| very high error rate and machines slow startup times.
| BigGreenTurtle wrote:
| Yep, I saw empty responses for sqs.us-east-1.amazonaws.com for
| a while. Seems okay now though.
| temptemptemp111 wrote:
| saltypal wrote:
| Based on our telemetry, this started as NXDOMAINs for sqs.us-
| east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and
| becoming a total outage at 20:48 UTC. Naturally, it was
| completely resolved by 20:57, 5 minutes before anything was
| posted in the "Personal Health Dashboard" in the AWS console.
|
| It takes a while to find a Vice President, I guess.
| mcqueenjordan wrote:
| Or perhaps triaging, root-causing, and fixing the issue is the
| highest-order bit?
| nostrebored wrote:
| It definitely is. For an issue like this, you will see
| relevant teams and delegates looped in very quickly. Getting
| approved wording about an outage requires some very senior
| people though. Often they have to be paged in as well.
|
| Having worked at a few other large tech companies now --
| Amazon's incident response process is honestly great. It's
| one of the things I miss about working there.
| saltypal wrote:
| This. We have a 4-person team and posted our own incident
| about this 7 minutes before Amazon did. Surely they can aim
| a little higher.
| ElevenLathe wrote:
| IME, this actually becomes more challenging as a company
| gets larger, not less (but that doesn't mean it can't be
| done).
| smachiz wrote:
| sure, but if those people are updating the status pages to
| say something isn't right and we're looking into it, we're
| doomed.
| viraptor wrote:
| Different people have different responsibilities. At Amazon
| scale, the comms and people doing a deep dive to fix stuff
| will not be the same.
| saltypal wrote:
| Separate teams. We have a tiny team and even _we_ appoint a
| group to fix and a group or individual to do nothing but
| communicate.
| halestock wrote:
| I can't help but wonder, with the increases in attrition across
| the industry, are we hitting some kind of tipping point where the
| institutional knowledge in these massive tech corporations is
| disappearing?
|
| Mistakes happen all the time but when all the people who
| intimately know how these systems work leave for other
| opportunities, disasters are bound to happen more and more.
| nyellin wrote:
| That's the problem we're out to solve with robusta.dev.
|
| We're slowly but surely converting the world's institutional
| technical knowledge into re-usable and automated runbooks.
| hughrr wrote:
| I'm just going to have to spend all day fixing the runbooks
| as well as the technology ;)
| zwirbl wrote:
| Just like the tech priests in Warhammer 40k, keeping occult old
| engineering, thatno one could build anymore, running
| hughrr wrote:
| So today I find out my job title is tech priest. I was happy
| with necromancer before. Does it come with a pay rise?
| viraptor wrote:
| Not familiar with 40k. Was it a similar idea to nuclear-
| power-as-religion from Foundation?
| atty wrote:
| Not far off. The "golden age" of humanity was shattered
| long ago, with the mortal wounding of the god emperor, and
| knowledge of most of the greatest technology was
| lost.Millennia later, a cult has grown up that both
| worships and maintains technology as having machine
| spirits, which are somehow linked to the machine god
| itself. That god may or may not be the same or related to
| the god emperor of mankind, depending on the
| interpretation.
|
| Honestly the lore of w40k is quite fun to read, if you're
| into dystopian and fantasy sci-fi.
| aaronax wrote:
| Or how about Anathem, with the Ita class doing computer
| things and nuclear materials cared for by a select group?
| SketchySeaBeast wrote:
| If we want to normalize letting long term support people call
| themselves tech priests I'd very much appreciate it.
|
| "What were your duties at your last position?" "Performing
| the daily ministrations and singing the praise of the machine
| god."
| Traster wrote:
| I think this is a transient issue. When you're in growth mode
| you make a huge series of hacks to just keep things running and
| then when you leave.... well, it's a problem. But if the
| business is robust, and lives beyond you, what replaces your
| work is better documented, better tested, and maintainable.
|
| That's the dream. Obviously there are companies that sink
| between v1 and v2, but that's life.
|
| Fundamentally I think the cloud business _is_ robust, it 's a
| fundamentally reasonable way of organising things (for enough
| people), which is why it attracts customers despite being
| arguably more expensive.
| fragmede wrote:
| You're right, but that's been true since the beginning of the
| tech boom (but isn't exclusive to tech) when no one works for a
| place for several decades. Companies weather this in different
| ways but attrition has always been around.
|
| What's causing people to believe that the latest round of
| attrition is any different?
| hkt wrote:
| I'd speculate that perhaps more senior people are moving,
| and/or a greater overall rate of attrition combined with much
| more complex technologies and organisations. In other words,
| it might be harder to become good at jobs now, and fewer
| people stick with them. Just a hunch but definitely seems to
| be where the incentives point with loyalty penalties and tech
| bloat.
| lyjackal wrote:
| noticed issues with SQS for a couple minutes. Errors from java
| sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP
| request: sqs.us-east-1.amazonaws.com`
| PeterBarrett wrote:
| SQS went down for us in us-east-1 and we lost health checks on
| instances there. Fully recovered now.
| asah wrote:
| us-east-1 again!
| [deleted]
| amar0c wrote:
| My Aruba Instant ON Ap's are "offline" (orange) even tho they
| work and I am online. My first tought is that some Cloud went
| nirvana state
| etaioinshrdlu wrote:
| Does AWS have a plan to improve this region?
|
| Do they acknowledge the problem?
|
| It's been a joke for years how bad us-east-1 is.
| consumer451 wrote:
| Nuke it from orbit
|
| It's the only way to be sure
| xilni wrote:
| This is why you are strongly urged not to rely on one region or
| AZ.
| Johnny555 wrote:
| I would strongly urge not using us-east-1 -- of all the regions
| we're in, it's by far the most problematic. Use us-east-2 if
| you need good latency to the East Coast.
| tyingq wrote:
| Good advice, though AWS still has some services that don't work
| completely independently. Cloudfront, because of certificates.
| Route53. The control API for IAM (adding/removing roles, etc).
| And I wish they didn't have global-looking endpoints (like
| https://sts.amazonaws.com) that aren't really global or
| resilient.
| ranman wrote:
| STS will let you use regional endpoints now, right?
| tyingq wrote:
| Yes. It's just that the "global endpoint" is misleading.
| They don't repoint it if it fails. It really shouldn't
| exist given that's how it functions.
| didip wrote:
| Multi AZ is great and should be by default, but multi Region is
| expensive.
| pid-1 wrote:
| Given the total amount of money I've lost due a single AZ being
| down, it was totally worth it to NOT go multi az or multi
| region so far.
|
| Multi AZ isn't that hard, but generally requires extra costs
| (one nat gw per az, etc...)
|
| But multi region in AWS is a royal pain in the ass. Many
| services (like SSO) do not play well with multi region setups,
| making things really complicated even if you IaCed your whole
| stack.
| evrydayhustling wrote:
| Those costs are the actual reason you are encouraged to go
| multi-AZ!
|
| (I actually love that we have strategies and infrastructure
| for multi-region... it just tends to come up at scales and
| for applications where it is not justified.)
| systemvoltage wrote:
| Seems like it would be conflict of interest to increase
| robustness of single AZ (so it never goes down or has its own
| redundancy) vs. increased revenues from multi AZ deployment.
|
| What's the point of cloud if we have to manage robustness of
| their own infrastructure. I can understand if that's due to
| natural disasters and earthquakes, but the idea should be that
| a single AZ should never go down barring extraordinary
| circumstances. AWS should be auto-balancing, handling downtimes
| of a single AZ without the customer ever noticing it.
|
| It might not be a good analogy, but if a single Cloudflare edge
| datacenter goes down, it will automatically route traffic
| through others. Transparent and painless to the customer. I
| understand AWS is huge, and different services have different
| redundancy mechanisms, but just conceptually it feels like
| they're in a conflict of interest to increase robustness of
| their data centers - "We told you to have multi-AZ deployment,
| not our fault".
|
| Another way to put this is make sure as an AWS customer, to 3x
| multiply all costs + management of multi-AZ deployment into
| your total costs.
| thedougd wrote:
| They would simply charge for the privilege. An EC2 'always
| on' or whatever option that enabled your instance to live
| migrate between availability zones would be a nice and
| expensive option.
| systemvoltage wrote:
| Definitely. Then I wonder why we need the cloud :) if not
| for services (not EC2). Lot of mid-sized companies are re-
| evaluating:
| https://www.economist.com/business/2021/07/03/do-the-
| costs-o...
| m34 wrote:
| Might be true for running stuff in different regions/AZs but if
| the provisioning region is down (e.g. deploying lambda@edge)
| one does not really have an alternative
| easton wrote:
| From temuze last time:
|
| "If you're having SLA problems I feel bad for you son
|
| I got two 9 problems cuz of us-east-1"
| didip wrote:
| This is why us-east-1 is perfect for chaos-testing, non-prod,
| environment.
| 0xCAP wrote:
| Is us-east cursed or what?!
| csdvrx wrote:
| As usual?
| fotta wrote:
| Somehow AWS managed to make their new status page more opaque
| than the old one. It's like they want you to scroll through their
| gigantic list so they can fix the issue before you find the right
| line.
| operator1 wrote:
| What's up with all of the multi-platform outages lately? Seems
| abnormal looking at historical data. Are there issues affecting
| the internet backbone or something? Or just a coincidence?
| super_linear wrote:
| Absolutely no way to prove this but maybe Q1 deadlines coming
| up and people trying to launch things and make changes?
| frays wrote:
| Increase in attrition across the industry.
|
| A lot of institutional knowledge in these massive tech
| corporations is disappearing and we're starting to reach the
| tipping point.
| fragmede wrote:
| But there's always been attrition. What are some of the ways
| that is now different that is affecting attrition rates and
| their effects?
| Spooky23 wrote:
| Things are bigger anywhere. My colleagues and I thought
| we're hot shit managing 5-7k applications and
| infrastructure. Amazon probably runs 20,000 orgs like mine.
|
| Also, times are good and rates are crazy. Even at VARs, you
| can make a lot of cash. I have a buddy who went from $150k
| to $600k. The guy paid off his mortgage and is at a point
| where he could burn out and work at Home Depot if he needed
| to.
| stone-monkey wrote:
| Probably increased salary and switch to permanent remote.
| Amazon is notorious for their frugality and they recently
| doubled their maximum salary cap to 350k. They would only
| have done this to stay competitive in the current job
| market. This implies that many of their existing employees
| are underpaid relative to their peers at comparable
| companies and they've likely seen a large uptick in
| attrition. Not to mention attrition begets more attrition,
| especially if it's "influential" employees who are leaving.
| 300bps wrote:
| Important to keep in mind that AWS has 250 services in 84
| Availability Zones in 26 regions.
|
| This outage is reportedly impacting 5 services in 1 region.
|
| For those impacted, pretty terrible. But as a heavy user of
| AWS, I've seen these notices posted multiple times on HN and
| haven't been impacted by one yet.
| quxbar wrote:
| For businesses with uptime guarantees and lots of boxes to
| spin up in failover scenario, this has been a very eventful
| 12 months. At least that's what I'm experiencing.
| xeromal wrote:
| Russian war is another juicy possibility
| adamrezich wrote:
| told myself I'd click this submission's comments link, CTRL+F
| `Russia`, & quit HN for the day if anything came up, thanks
| for not disappointing
| xeromal wrote:
| Haha, no problemo.
___________________________________________________________________
(page generated 2022-03-09 23:00 UTC)