[HN Gopher] AWS us-west-2 issues
___________________________________________________________________
AWS us-west-2 issues
AWS appears to be having lambda and API Gateway issues again in us-
west-2. Symptoms on our end look similar to the August 24th partial
outage.
Author : kwindla
Score : 192 points
Date : 2022-09-28 17:03 UTC (5 hours ago)
| Victerius wrote:
| us-east-1's revenge.
| ramesh31 wrote:
| The moment I get a whiff of AWS issues, that's basically a wrap
| on the day. I'm not going to spend my time wondering why
| something isn't working today because _somewhere_ down the chain
| it is undoubtedly a silently failing AWS service driving me mad.
| The whole house of cards comes tumbling down and things stop
| working that Amazon will swear up down left and right are 100%
| healthy.
|
| Sadly this has become a monthly occurrence at this point.
| Monoculture, it turns out, is a really really bad idea.
| IceWreck wrote:
| Hey atleast it isnt us-east-2 this time
| adamkittelson wrote:
| Coincidentally that's two months in a row of a us-west-2 outage
| on the 4th Wednesday of the month. See you all back here on Oct
| 26?
| kwindla wrote:
| :-)
|
| August 24th was the first time we saw exactly this issue in 7
| years of heavy, multi-region, AWS use. So we put in place the
| ability to semi-automatically route around this more quickly,
| but we didn't fixate on it. Two data points is a line, however.
| (But maybe not yet a trend?)
| qbasic_forever wrote:
| Monthly certificate updates gone wrong perhaps? Oops
| zonkd1234 wrote:
| Lambda is returning 502 errors in US-West for me as well.
| camhart wrote:
| Same here
| whalesalad wrote:
| my least favorite region tbh
| enapupe wrote:
| Could any of you clarify to me why do we have "Edge" domain
| rather than a specific region?
| davidjfelix wrote:
| Yeah, you can select "edge" for some resources (API Gateway and
| Lambda are two that come to mind) which just means it's located
| in all regions plus some additional "edge" infrastructure that
| isn't available as a region. AWS puts some restrictions on edge
| resources since there isn't enough capacity or full-region
| functionality.
|
| Usually you pick this for stuff that is CDN oriented and front
| a regional service with that.
| Aaronstotle wrote:
| Now I know why cloud shell wasn't working around noon PDT
| pneumatic1 wrote:
| what happened to stop.lying.cloud?
| [deleted]
| willio58 wrote:
| It's crazy to me how long it takes Amazon to notify users that
| there is even an issue. I think it took 15 minutes for them to
| acknowledge an issue, thats a lot of time for our services.
| tyingq wrote:
| I continue to be irritated that services that consistently
| return errors are characterized as "increased error rates" or
| "increased latency". They seem to use those phrases for any
| kind of outage.
| dylan604 wrote:
| It's all networking issues as you're remotely access them
| remotely, so of course it's increased error rates or
| increased latency. When a piece of gear stops responding the
| network tries to self-heal as designed which causes those
| issues.
| tyingq wrote:
| Well, yes, but, for example: Characterizing both a 5% error
| rate and a 100% error rate as "increased error rates" is
| not as helpful as it could be.
|
| I'm reasonably sure they know when a service isn't
| functional at all.
| mbesto wrote:
| Yup, we're trying to fix this here: https://www.awareops.com/
| JshWright wrote:
| It was 41 minutes from the first sign of trouble in our
| monitoring (an SQS queue backing up because the lambda it
| triggers wasn't getting invoked) to AWS posting the
| "informational" status.
| duluca wrote:
| I don't think you should rely on Amazon to tell you anything.
| socialismisok wrote:
| That's because it's a political issue inside AWS. They have the
| technology to report it automatically, but there's a strong
| pressure not to post "green-i" or yellow or red, because those
| things impact SLA payments.
|
| So if there's any way they can spin it as _not_ an outage they
| will try not to post it.
| ogn3rd wrote:
| Can confirm this, it's political. It's always great seeing
| the "rationality" for not doing the right thing. After the 3
| outages last December, I can remember a certain person in a
| certain global outage slack channel laughing at customers who
| weren't "resilient" enough to withstand the outage. Then AWS
| focused a resiliency campaign to document all those customers
| risks, as if AWS's own poorly designed system wasn't the
| cause for peoples inability to fail over. Glad I'm out of
| that place, it's super toxic these days.
| thedougd wrote:
| It would be nice if all services had a great story. The
| guidance for how to make Cognito and SSO multi-region is
| laughable.
| jeffwask wrote:
| This is my understand as well. I have anecdotally heard
| swapping some of those dots from green to red can require up
| to CTO approval.
| dilyevsky wrote:
| I participated in multiple sla penalty payments requests on
| the customer side and never once we even discussed status
| page color
| socialismisok wrote:
| I participated in many status page color discussions and we
| always discussed sla. _Shrug_
| dilyevsky wrote:
| You mean on the AWS/service provider side? I've also
| participated in internal incident response on the service
| provider side and we again made SLA refund decisions
| based on actual customer impact not our status page. But
| then again we diligently update our status pages so
| there's that.
| hatware wrote:
| That's a conflict of interest and it's a shame the proper
| governing bodies have not penalized Amazon for such shitty
| behavior.
| balls187 wrote:
| Cloud9 is also down. Joy.
|
| Edit to add: I can't seem to access my NHL Season Tickets via
| Ticketmaster either...
| makestuff wrote:
| One would think all of those ridiculous ticketmaster fees would
| make them be able to afford some sort of multi region/multi
| cloud setup...
| AaronM wrote:
| [10:33 AM PDT] [10:33 AM PDT] We are investigating increased
| error rates for invokes in the US-WEST-2 Region. We do not yet
| have a root cause, but are investigating multiple potential root
| causes in parallel. In addition, we are implementing filters on
| inbound traffic from a set of sources with recent significant
| traffic shifts, which may help mitigate the impact. We do not yet
| have a solid ETA, but will continue to provide updates as we
| progress.
|
| [10:13 AM PDT] [10:10 AM PDT] We are investigating increased
| error rates for invokes in the US-WEST-2 Region.
|
| AWS AppSync AWS Batch AWS Certificate Manager AWS Cloud9 AWS
| CloudShell AWS Device Farm AWS Global Accelerator AWS Greengrass
| AWS IoT 1-Click AWS IoT Device Management AWS Lambda AWS Proton
| AWS Resource Access Manager AWS RoboMaker AWS Service Catalog
| Amazon AppStream 2.0 Amazon CloudWatch Amazon Connect Amazon
| Elastic Compute Cloud Amazon Elastic Container Registry Amazon
| Elastic MapReduce Amazon EventBridge Amazon FinSpace Amazon
| Kendra Amazon Lightsail Amazon Location Service Amazon Managed
| Workflows for Apache Airflow Amazon Nimble Studio Amazon Pinpoint
| Amazon SageMaker Amazon WorkSpaces EC2 Image Builder
| thallium205 wrote:
| Here here!
| [deleted]
| kolanos wrote:
| 3h 52m and counting
| humbleharbinger wrote:
| S3-megalodon on call checking in
| fasteddie31003 wrote:
| I'm working on an app that tracks downtimes. I put some of my
| latest data up here:
| https://app.awareops.com/whatisdown/orgs/a7f95108-ead0-4718-...
| kwindla wrote:
| https://health.aws.amazon.com/health/status is now showing the
| issue.
|
| Our first canaries fired for this at 9:43 PDT local time. The
| status page updated at 10:13.
| [deleted]
| matthoiland wrote:
| Our first canaries at 9:21 PDT
| kwindla wrote:
| That's really interesting. We didn't get any user reports
| before our canaries fired, either. Now I have to think about
| what might explain the difference between your systems and
| ours. We're monitoring API Gateway health (more or less),
| because that's what we care about in this part of our
| infrastructure.
| jvolkman wrote:
| From the latest update:
|
| > While we have seen improvements in error rates since 10:40 AM
| PDT, recovery has stalled and we do not have a clear ETA on full
| recovery. For customers that have dependencies on API Gateway and
| are experiencing error rates, we do not have any mitigations to
| recommend to address the issue on the customer side.
| dylan604 wrote:
| https://downdetector.com/status/aws-amazon-web-services/
|
| My Roomba isn't working!!!
|
| hahahaha!
| 0xbadcafebee wrote:
| You mean I have to pick up a broom and sweep WITH MY ARMS?!
| This really is the darkest timeline.
| chasd00 wrote:
| i wonder if any smart locks are down, people being locked in or
| out of their AirBNBs would be pretty funny.
| dylan604 wrote:
| of all things to connect to a smart home, the locks to my
| house would be the absolute bottom of the list. i hate coming
| home when the power is out and cannot open the garage door. i
| couldn't imagine not being able to get in at all.
|
| as far as not getting out, any lock unable to be unlocked
| from inside seems like something should not be allowed to be
| made. ever.
| readams wrote:
| I think consumer smart locks still let you use the regular
| lock if the power is out.
| giaour wrote:
| > as far as not getting out, any lock unable to be unlocked
| from inside seems like something should not be allowed to
| be made. ever.
|
| This would vary by jurisdiction, but locks without an
| unlock lever are usually prohibited by the fire code.
| sslayer wrote:
| Yet they exist, with a whole industry built around them
| [deleted]
| chasd00 wrote:
| interesting. my back door has a deadbolt requiring a key
| on both sides. ..it's an old house.
| jaywalk wrote:
| Why have you left it like that? Yeah it's against fire
| codes, but for a very good reason: it's incredibly
| unsafe.
| rootusrootus wrote:
| The locks are purely additive in functionality, they lose
| nothing compared to a manual lock. It still has a key for
| manual unlocking, still has the twist knob on the inside.
|
| But now I can see the status of the lock if I'm away from
| home, and I can lock it remotely if necessary. I can give
| out a keypad code to a house sitter. Or I can let someone
| in real-time from remote.
|
| All my 'smarthome' technology at home is this way. Nothing
| requires the Internet to work, and if the server fails then
| only the automation itself stops; all of the switches,
| locks, and such just fail back into working like any old
| school switch/lock/etc.
| jcrawfordor wrote:
| Just a few notes:
|
| 1. I'm not aware of any smart lock that cannot be locked
| and unlocked manually from the inside. This would violate
| fire code for residential structures in a lot of US
| jurisdictions.
|
| 2. Electronic locks in general are either line-powered or
| battery-powered. Line-powered locks are unusual in
| residential environments because of the higher complexity
| of installation (they're more often strikeplates than
| actuators, although in-door actuators are available).
|
| Battery-powered locks take one of two approaches to
| resolving power issues: most commonly on residential locks,
| there is still a key cylinder on the outside to manually
| lock and unlock. Less commonly on residential locks but
| more typical of commercial ones, there may be no key
| cylinder but instead an external connector that allows the
| programming tool (very common on commercial systems) or a
| 9v battery (common on residential units) to be connected to
| provide external power.
|
| 3. Cloud-reliant smart locks are pretty rare for practical
| reasons. Most are still fully functional (often minus
| remote control via app, but not always) without internet
| service. Even most commercial systems fall back to cached
| credentials in the door controller when the connection to
| the access server is lost, although annoyingly some of the
| newer "smarter" systems don't.
| AdamJacobMuller wrote:
| I have the Nest/Yale lock without the key, it
| definitively is not reliant on the Nest cloud being up
| (or internet working) for access, only to remotely
| lock/unlock or to program new codes. Do any locks
| actually fail to work with an already programmed PIN code
| if the internet is out? That seems like a massive
| failure. Go to dinner and your router crashes and you
| can't get back inside to fix it? Wow.
| jcrawfordor wrote:
| The only examples I know of are commercial systems, and
| specifically "cutting edge" commercial systems that are
| completely IP-based and cloud-managed. These are honestly
| kind of a disaster and I hope they don't catch on; they
| can be cheaper to install than conventional commercial
| systems (with ACU cabinets) but they achieve that
| cheapness by abandoning most of the reliability and
| security features of conventional designs. That said some
| of these get installed fail-open (e.g. loss of management
| means they stay unlocked) for fire egress reasons.
| oceanplexian wrote:
| I have Z-Wave locks and have no problem having them as part
| of my smart home. The 4x Lithium AA batteries in them last
| over a year, they don't talk to a "cloud service", but
| instead a physical server that I have total control over,
| and you can still use an old-fashioned key to unlock them.
| qbasic_forever wrote:
| Every smart or electronic lock I've used just augments the
| deadbolt and still has your physical key and its tumbler
| lock as a backup. I have an electronic deadbolt and have
| never gotten locked out even when its battery dies.
| hatware wrote:
| How long do you think batteries last?
| dylan604 wrote:
| you ask that like you are challenging my idea of not ever
| using smart locks on my home. instead, you're bringing up
| another reason to support that decision.
|
| so, which direction were you attempting to move the
| needle?
| hatware wrote:
| Batteries going out have never locked me out of my home.
|
| Seems like you're putting way too much thought into
| something that probably won't happen, being 2022 with
| notifications and all. I don't justify my choices with
| 0.1% chances.
| takeda wrote:
| You seem to put trust in technology to do important
| tasks, when they have problems securing a stupid light
| bulbs.
|
| There's a joke about it:
|
| Tech Enthusiasts: Everything in my house is wired to the
| Internet of Things! I control it all from my smartphone!
| My smart-house is bluetooth enabled and I can give it
| voice commands via alexa! I love the future!
|
| Programmers / Engineers: The most recent piece of
| technology I own is a printer from 2004 and I keep a
| loaded gun ready to shoot it if it ever makes an
| unexpected noise.
| dylan604 wrote:
| I don't have gun for that, but a baseball bat instead.
| That scene from Office Space made such an impression, and
| I swear one day, I will recreate it on a piece of gear
| that steps out of line.
| [deleted]
| disillusioned wrote:
| My Smart Lock:
|
| * Stores codes on the device itself and still unlocks with
| no internet connectivity * Is a physical deadbolt inside
| that works without power * If the lock itself runs out of
| battery power from the OUTSIDE, you can "jump" it with a
| 9-volt battery * Allows me to auto-lock after X period of
| time, or at night * Allows me to NEVER carry keys, ever. Or
| ever have to worry about keys. * Allows me to manage
| multiple, time-boxed codes for people (housekeeper can't
| get in at midnight)
|
| It's pretty damn great, honestly. And I stow a 9-volt in a
| flower box in case of emergency. (You still need to know
| the code, obviously.)
|
| It's also absolutely pick-proof/bump-proof because it has
| no key at all. Not even a backup key.
|
| It's the Yale x Nest lock and is really really nice.
| CobrastanJorji wrote:
| One hopes the us-west data centers don't use smart locks.
| Remember the Facebook incident a while back?
| dylan604 wrote:
| You'll find that they are using Nest instead of Ring locks
| so that it turns out this is a Goog v AMZN adversarial
| attack.
| fortunateregard wrote:
| On an unrelated note, I opened this url in a new tab, took a
| quick look, then came back to read the comments.
|
| A few minutes after I hear the fans on my pc start ramping up.
| Sure enough, I open the system monitor and see chrome going
| crazy on my CPU. In chrome, I open the task manager, then click
| sort by CPU. The entry at the top of the list reads:
|
| > subframe: facebook[dot]com
|
| I get taken to the open downdetector.com tab after double
| clicking the entry. After closing the tab everything goes back
| to normal.
|
| Does anyone know why or what downdetector/facebook would do
| that requires 100% of my CPU's resources?
|
| PS. I have ublock origin installed. My cpu is an i9-12900K.
| j_walter wrote:
| That is unrelated, but I've seen similar issues recently when
| logging into Canvas (you know for school). Maxes out my CPU
| and if I leave it long enough it crashes the tab due to
| memory...it's not displaying anything special...
| softwaredoug wrote:
| I guess when Skynet becomes self aware we can just wait for an
| AWS outage and then bulldoze the terminators into the landfill
| dylan604 wrote:
| We should start practicing with the Full Self Driving cars
| getting OTA updates
| restlake wrote:
| Yep same symptoms for us as the August outage and support
| confirmed an outage for us as well
| jvolkman wrote:
| Would be nice if they could update their status dashboard.
| thallium205 wrote:
| It is. It's just currently set at "Informational" severity.
| sarcasticadmin wrote:
| There was a decent delay in the alerting from AWS.
|
| I was just investigating this degradation in service for
| some of my systems for about 1 hour before seeing this
| alert raised.
| austinpena wrote:
| Seeing API gateway issues here: https://www.taloflow.ai/is-aws-
| down/us-west-2
| hazrmard wrote:
| My job application at amazon.jobs today got interrupted by this,
| I think. I hope they don't get a malformed application :/
| neuralspark wrote:
| Maybe it's trying to save you
| shahbaby wrote:
| This is divine intervention.
| mrweasel wrote:
| Maybe you broke it?
| dylan604 wrote:
| That resume that was submit contained a known PDF delivered
| attack. Guess their AV subscription ran out on that server?!
| hbn wrote:
| This is part of your test. You're failing!
| enapupe wrote:
| People, I've just started setting up API GW and deploying my code
| to another region. Rest assured the outage will normalize itself
| before I finish migrating. ETA: 15min-ish
| enapupe wrote:
| Did I say 15 minutes? Actually, _it depends_.
| dylan604 wrote:
| The first rule of publishing ETAs for updates is to triple
| the estimate. Be the hero that finishes before the published
| time vs finishing after! The second rule of ETAs is don't
| specifiy a time.
|
| I'm still amazed at the number of cults that place a specific
| date on the return of the savior, and even more by the people
| that go along with the rescheduling. The fact that I'm still
| waiting in Dallas for JFK's return is irritating. /s
| systemvoltage wrote:
| Honest question and frankly scared to ask because it sounds
| stupid: if you have like 30 mins of downtime on AWS every year
| and spend 3x cost on managing _their_ infrastructure risk by
| deploying it on multi-AZ and multi-region (and thereby AWS
| pushing reliability management back to the customer); is the
| value proposition of cloud just some dude to install a disk if it
| goes out on a rack in your office? May be there is a reverse
| incentive for AWS to keep their AZ 's slightly unreliable so that
| customers spend 3x or 9x or what have you to make sure nothing
| ever goes down.
|
| Like what's wrong with on-prem? Lack of diesel generators? We
| could just have that without AWS. Bare metal datacenter. Counter
| to most opinions, I think managing a server isn't that difficult.
| I am sort of a semi-professional and prosumer that has no trouble
| managing servers for years on end with less downtime than a whole
| fricking datacenter.
|
| There is more serious discussion and new revelations around this
| [1], [2]. Sometimes it is hard to ask questions about layers of
| abstractions that have built up and no one dares to think about
| getting rid of them.
|
| [1] https://www.economist.com/business/2021/07/03/do-the-
| costs-o...
|
| [2] https://oxide.computer/
| rrdharan wrote:
| Security, Compliance, and Geographic footprint.
|
| If you're a large multinational, you basically face the same
| threats as Google/AWS/MSFT but there's no way you can hire,
| train and keep as good a production security team as them (well
| maybe better than Azure, but I digress).
|
| You can't afford the upfront contractual / capital costs to
| maintain datacenters in every region.
|
| And finally you can't afford the armies of lawyers and
| compliance engineering teams to try and reason about your data
| residency and things like GDPR and CCPA.
|
| In other words, you're mostly paying for production security /
| privacy incident response, compliance (lawyers) and
| datacenters.
| mannyv wrote:
| Having managed a small data center in the past and having seen
| what it takes to manage multiple enterprise-scale data centers,
| the answer is "no, AWS does it better."
|
| The company that I'm in right now has two engineers (including
| myself) who are building and maintaining a product that serves
| millions of streams a week. There's no fucking way we could
| have done this ourselves. One F5 would cost more than our
| entire total AWS bill for two years - and we'd have to have at
| least 4 F5s if we wanted to try to match AWS. Plus the media
| encoders would cost a fortune.
|
| For some things it's fine to head over to lowendbox.com and
| pick up a cheap VPS hosting package. We could theoretically
| build our stack on top of a bunch of VPSs, sync everything with
| rsync, etc. But then we'd be spending time building
| infrastructure (which is pretty much valueless) instead of our
| product.
| [deleted]
| ggregoire wrote:
| Bunch of
|
| Database error: [Amazon](500310) Invalid operation: No response
| body.
|
| Database error: [Amazon](500310) Invalid operation: curlCode: 28,
| Timeout was reached
|
| in our logs during the last hour
| jmartens wrote:
| Must be a specific AZ, not seeing any issues here
|
| EDIT: I see now that it's an API Gateway issue, so that's why my
| team isn't impacted.
| sdfdsfsd wrote:
| CacheRules wrote:
| Success!
| atomon wrote:
| We're having issues as well
| mchusma wrote:
| I'm seeing multiple services as down: ECS, Medialive, etc.
| ksimukka wrote:
| Oh no!
| biermic wrote:
| Redshift COPY from s3 is also affected.
| ----------------------------------------------- error: No
| response body. code: 30000 context:
| query: 1154965 location:
| xen_aws_credentials_mgr.cpp:403 process: padbmaster
| [pid=15893] -----------------------------------------------
___________________________________________________________________
(page generated 2022-09-28 23:02 UTC)