hngopher.com

       [HN Gopher] AWS us-west-2 issues
       ___________________________________________________________________
        
       AWS us-west-2 issues
        
       AWS appears to be having lambda and API Gateway issues again in us-
       west-2. Symptoms on our end look similar to the August 24th partial
       outage.
        
       Author : kwindla
       Score  : 192 points
       Date   : 2022-09-28 17:03 UTC (5 hours ago)
        
       | Victerius wrote:
       | us-east-1's revenge.
        
       | ramesh31 wrote:
       | The moment I get a whiff of AWS issues, that's basically a wrap
       | on the day. I'm not going to spend my time wondering why
       | something isn't working today because _somewhere_ down the chain
       | it is undoubtedly a silently failing AWS service driving me mad.
       | The whole house of cards comes tumbling down and things stop
       | working that Amazon will swear up down left and right are 100%
       | healthy.
       | 
       | Sadly this has become a monthly occurrence at this point.
       | Monoculture, it turns out, is a really really bad idea.
        
       | IceWreck wrote:
       | Hey atleast it isnt us-east-2 this time
        
       | adamkittelson wrote:
       | Coincidentally that's two months in a row of a us-west-2 outage
       | on the 4th Wednesday of the month. See you all back here on Oct
       | 26?
        
         | kwindla wrote:
         | :-)
         | 
         | August 24th was the first time we saw exactly this issue in 7
         | years of heavy, multi-region, AWS use. So we put in place the
         | ability to semi-automatically route around this more quickly,
         | but we didn't fixate on it. Two data points is a line, however.
         | (But maybe not yet a trend?)
        
         | qbasic_forever wrote:
         | Monthly certificate updates gone wrong perhaps? Oops
        
       | zonkd1234 wrote:
       | Lambda is returning 502 errors in US-West for me as well.
        
       | camhart wrote:
       | Same here
        
       | whalesalad wrote:
       | my least favorite region tbh
        
       | enapupe wrote:
       | Could any of you clarify to me why do we have "Edge" domain
       | rather than a specific region?
        
         | davidjfelix wrote:
         | Yeah, you can select "edge" for some resources (API Gateway and
         | Lambda are two that come to mind) which just means it's located
         | in all regions plus some additional "edge" infrastructure that
         | isn't available as a region. AWS puts some restrictions on edge
         | resources since there isn't enough capacity or full-region
         | functionality.
         | 
         | Usually you pick this for stuff that is CDN oriented and front
         | a regional service with that.
        
       | Aaronstotle wrote:
       | Now I know why cloud shell wasn't working around noon PDT
        
       | pneumatic1 wrote:
       | what happened to stop.lying.cloud?
        
       | [deleted]
        
       | willio58 wrote:
       | It's crazy to me how long it takes Amazon to notify users that
       | there is even an issue. I think it took 15 minutes for them to
       | acknowledge an issue, thats a lot of time for our services.
        
         | tyingq wrote:
         | I continue to be irritated that services that consistently
         | return errors are characterized as "increased error rates" or
         | "increased latency". They seem to use those phrases for any
         | kind of outage.
        
           | dylan604 wrote:
           | It's all networking issues as you're remotely access them
           | remotely, so of course it's increased error rates or
           | increased latency. When a piece of gear stops responding the
           | network tries to self-heal as designed which causes those
           | issues.
        
             | tyingq wrote:
             | Well, yes, but, for example: Characterizing both a 5% error
             | rate and a 100% error rate as "increased error rates" is
             | not as helpful as it could be.
             | 
             | I'm reasonably sure they know when a service isn't
             | functional at all.
        
         | mbesto wrote:
         | Yup, we're trying to fix this here: https://www.awareops.com/
        
         | JshWright wrote:
         | It was 41 minutes from the first sign of trouble in our
         | monitoring (an SQS queue backing up because the lambda it
         | triggers wasn't getting invoked) to AWS posting the
         | "informational" status.
        
         | duluca wrote:
         | I don't think you should rely on Amazon to tell you anything.
        
         | socialismisok wrote:
         | That's because it's a political issue inside AWS. They have the
         | technology to report it automatically, but there's a strong
         | pressure not to post "green-i" or yellow or red, because those
         | things impact SLA payments.
         | 
         | So if there's any way they can spin it as _not_ an outage they
         | will try not to post it.
        
           | ogn3rd wrote:
           | Can confirm this, it's political. It's always great seeing
           | the "rationality" for not doing the right thing. After the 3
           | outages last December, I can remember a certain person in a
           | certain global outage slack channel laughing at customers who
           | weren't "resilient" enough to withstand the outage. Then AWS
           | focused a resiliency campaign to document all those customers
           | risks, as if AWS's own poorly designed system wasn't the
           | cause for peoples inability to fail over. Glad I'm out of
           | that place, it's super toxic these days.
        
             | thedougd wrote:
             | It would be nice if all services had a great story. The
             | guidance for how to make Cognito and SSO multi-region is
             | laughable.
        
           | jeffwask wrote:
           | This is my understand as well. I have anecdotally heard
           | swapping some of those dots from green to red can require up
           | to CTO approval.
        
           | dilyevsky wrote:
           | I participated in multiple sla penalty payments requests on
           | the customer side and never once we even discussed status
           | page color
        
             | socialismisok wrote:
             | I participated in many status page color discussions and we
             | always discussed sla. _Shrug_
        
               | dilyevsky wrote:
               | You mean on the AWS/service provider side? I've also
               | participated in internal incident response on the service
               | provider side and we again made SLA refund decisions
               | based on actual customer impact not our status page. But
               | then again we diligently update our status pages so
               | there's that.
        
           | hatware wrote:
           | That's a conflict of interest and it's a shame the proper
           | governing bodies have not penalized Amazon for such shitty
           | behavior.
        
       | balls187 wrote:
       | Cloud9 is also down. Joy.
       | 
       | Edit to add: I can't seem to access my NHL Season Tickets via
       | Ticketmaster either...
        
         | makestuff wrote:
         | One would think all of those ridiculous ticketmaster fees would
         | make them be able to afford some sort of multi region/multi
         | cloud setup...
        
       | AaronM wrote:
       | [10:33 AM PDT] [10:33 AM PDT] We are investigating increased
       | error rates for invokes in the US-WEST-2 Region. We do not yet
       | have a root cause, but are investigating multiple potential root
       | causes in parallel. In addition, we are implementing filters on
       | inbound traffic from a set of sources with recent significant
       | traffic shifts, which may help mitigate the impact. We do not yet
       | have a solid ETA, but will continue to provide updates as we
       | progress.
       | 
       | [10:13 AM PDT] [10:10 AM PDT] We are investigating increased
       | error rates for invokes in the US-WEST-2 Region.
       | 
       | AWS AppSync AWS Batch AWS Certificate Manager AWS Cloud9 AWS
       | CloudShell AWS Device Farm AWS Global Accelerator AWS Greengrass
       | AWS IoT 1-Click AWS IoT Device Management AWS Lambda AWS Proton
       | AWS Resource Access Manager AWS RoboMaker AWS Service Catalog
       | Amazon AppStream 2.0 Amazon CloudWatch Amazon Connect Amazon
       | Elastic Compute Cloud Amazon Elastic Container Registry Amazon
       | Elastic MapReduce Amazon EventBridge Amazon FinSpace Amazon
       | Kendra Amazon Lightsail Amazon Location Service Amazon Managed
       | Workflows for Apache Airflow Amazon Nimble Studio Amazon Pinpoint
       | Amazon SageMaker Amazon WorkSpaces EC2 Image Builder
        
       | thallium205 wrote:
       | Here here!
        
         | [deleted]
        
       | kolanos wrote:
       | 3h 52m and counting
        
       | humbleharbinger wrote:
       | S3-megalodon on call checking in
        
       | fasteddie31003 wrote:
       | I'm working on an app that tracks downtimes. I put some of my
       | latest data up here:
       | https://app.awareops.com/whatisdown/orgs/a7f95108-ead0-4718-...
        
       | kwindla wrote:
       | https://health.aws.amazon.com/health/status is now showing the
       | issue.
       | 
       | Our first canaries fired for this at 9:43 PDT local time. The
       | status page updated at 10:13.
        
         | [deleted]
        
         | matthoiland wrote:
         | Our first canaries at 9:21 PDT
        
           | kwindla wrote:
           | That's really interesting. We didn't get any user reports
           | before our canaries fired, either. Now I have to think about
           | what might explain the difference between your systems and
           | ours. We're monitoring API Gateway health (more or less),
           | because that's what we care about in this part of our
           | infrastructure.
        
       | jvolkman wrote:
       | From the latest update:
       | 
       | > While we have seen improvements in error rates since 10:40 AM
       | PDT, recovery has stalled and we do not have a clear ETA on full
       | recovery. For customers that have dependencies on API Gateway and
       | are experiencing error rates, we do not have any mitigations to
       | recommend to address the issue on the customer side.
        
       | dylan604 wrote:
       | https://downdetector.com/status/aws-amazon-web-services/
       | 
       | My Roomba isn't working!!!
       | 
       | hahahaha!
        
         | 0xbadcafebee wrote:
         | You mean I have to pick up a broom and sweep WITH MY ARMS?!
         | This really is the darkest timeline.
        
         | chasd00 wrote:
         | i wonder if any smart locks are down, people being locked in or
         | out of their AirBNBs would be pretty funny.
        
           | dylan604 wrote:
           | of all things to connect to a smart home, the locks to my
           | house would be the absolute bottom of the list. i hate coming
           | home when the power is out and cannot open the garage door. i
           | couldn't imagine not being able to get in at all.
           | 
           | as far as not getting out, any lock unable to be unlocked
           | from inside seems like something should not be allowed to be
           | made. ever.
        
             | readams wrote:
             | I think consumer smart locks still let you use the regular
             | lock if the power is out.
        
             | giaour wrote:
             | > as far as not getting out, any lock unable to be unlocked
             | from inside seems like something should not be allowed to
             | be made. ever.
             | 
             | This would vary by jurisdiction, but locks without an
             | unlock lever are usually prohibited by the fire code.
        
               | sslayer wrote:
               | Yet they exist, with a whole industry built around them
        
               | [deleted]
        
               | chasd00 wrote:
               | interesting. my back door has a deadbolt requiring a key
               | on both sides. ..it's an old house.
        
               | jaywalk wrote:
               | Why have you left it like that? Yeah it's against fire
               | codes, but for a very good reason: it's incredibly
               | unsafe.
        
             | rootusrootus wrote:
             | The locks are purely additive in functionality, they lose
             | nothing compared to a manual lock. It still has a key for
             | manual unlocking, still has the twist knob on the inside.
             | 
             | But now I can see the status of the lock if I'm away from
             | home, and I can lock it remotely if necessary. I can give
             | out a keypad code to a house sitter. Or I can let someone
             | in real-time from remote.
             | 
             | All my 'smarthome' technology at home is this way. Nothing
             | requires the Internet to work, and if the server fails then
             | only the automation itself stops; all of the switches,
             | locks, and such just fail back into working like any old
             | school switch/lock/etc.
        
             | jcrawfordor wrote:
             | Just a few notes:
             | 
             | 1. I'm not aware of any smart lock that cannot be locked
             | and unlocked manually from the inside. This would violate
             | fire code for residential structures in a lot of US
             | jurisdictions.
             | 
             | 2. Electronic locks in general are either line-powered or
             | battery-powered. Line-powered locks are unusual in
             | residential environments because of the higher complexity
             | of installation (they're more often strikeplates than
             | actuators, although in-door actuators are available).
             | 
             | Battery-powered locks take one of two approaches to
             | resolving power issues: most commonly on residential locks,
             | there is still a key cylinder on the outside to manually
             | lock and unlock. Less commonly on residential locks but
             | more typical of commercial ones, there may be no key
             | cylinder but instead an external connector that allows the
             | programming tool (very common on commercial systems) or a
             | 9v battery (common on residential units) to be connected to
             | provide external power.
             | 
             | 3. Cloud-reliant smart locks are pretty rare for practical
             | reasons. Most are still fully functional (often minus
             | remote control via app, but not always) without internet
             | service. Even most commercial systems fall back to cached
             | credentials in the door controller when the connection to
             | the access server is lost, although annoyingly some of the
             | newer "smarter" systems don't.
        
               | AdamJacobMuller wrote:
               | I have the Nest/Yale lock without the key, it
               | definitively is not reliant on the Nest cloud being up
               | (or internet working) for access, only to remotely
               | lock/unlock or to program new codes. Do any locks
               | actually fail to work with an already programmed PIN code
               | if the internet is out? That seems like a massive
               | failure. Go to dinner and your router crashes and you
               | can't get back inside to fix it? Wow.
        
               | jcrawfordor wrote:
               | The only examples I know of are commercial systems, and
               | specifically "cutting edge" commercial systems that are
               | completely IP-based and cloud-managed. These are honestly
               | kind of a disaster and I hope they don't catch on; they
               | can be cheaper to install than conventional commercial
               | systems (with ACU cabinets) but they achieve that
               | cheapness by abandoning most of the reliability and
               | security features of conventional designs. That said some
               | of these get installed fail-open (e.g. loss of management
               | means they stay unlocked) for fire egress reasons.
        
             | oceanplexian wrote:
             | I have Z-Wave locks and have no problem having them as part
             | of my smart home. The 4x Lithium AA batteries in them last
             | over a year, they don't talk to a "cloud service", but
             | instead a physical server that I have total control over,
             | and you can still use an old-fashioned key to unlock them.
        
             | qbasic_forever wrote:
             | Every smart or electronic lock I've used just augments the
             | deadbolt and still has your physical key and its tumbler
             | lock as a backup. I have an electronic deadbolt and have
             | never gotten locked out even when its battery dies.
        
             | hatware wrote:
             | How long do you think batteries last?
        
               | dylan604 wrote:
               | you ask that like you are challenging my idea of not ever
               | using smart locks on my home. instead, you're bringing up
               | another reason to support that decision.
               | 
               | so, which direction were you attempting to move the
               | needle?
        
               | hatware wrote:
               | Batteries going out have never locked me out of my home.
               | 
               | Seems like you're putting way too much thought into
               | something that probably won't happen, being 2022 with
               | notifications and all. I don't justify my choices with
               | 0.1% chances.
        
               | takeda wrote:
               | You seem to put trust in technology to do important
               | tasks, when they have problems securing a stupid light
               | bulbs.
               | 
               | There's a joke about it:
               | 
               | Tech Enthusiasts: Everything in my house is wired to the
               | Internet of Things! I control it all from my smartphone!
               | My smart-house is bluetooth enabled and I can give it
               | voice commands via alexa! I love the future!
               | 
               | Programmers / Engineers: The most recent piece of
               | technology I own is a printer from 2004 and I keep a
               | loaded gun ready to shoot it if it ever makes an
               | unexpected noise.
        
               | dylan604 wrote:
               | I don't have gun for that, but a baseball bat instead.
               | That scene from Office Space made such an impression, and
               | I swear one day, I will recreate it on a piece of gear
               | that steps out of line.
        
               | [deleted]
        
             | disillusioned wrote:
             | My Smart Lock:
             | 
             | * Stores codes on the device itself and still unlocks with
             | no internet connectivity * Is a physical deadbolt inside
             | that works without power * If the lock itself runs out of
             | battery power from the OUTSIDE, you can "jump" it with a
             | 9-volt battery * Allows me to auto-lock after X period of
             | time, or at night * Allows me to NEVER carry keys, ever. Or
             | ever have to worry about keys. * Allows me to manage
             | multiple, time-boxed codes for people (housekeeper can't
             | get in at midnight)
             | 
             | It's pretty damn great, honestly. And I stow a 9-volt in a
             | flower box in case of emergency. (You still need to know
             | the code, obviously.)
             | 
             | It's also absolutely pick-proof/bump-proof because it has
             | no key at all. Not even a backup key.
             | 
             | It's the Yale x Nest lock and is really really nice.
        
           | CobrastanJorji wrote:
           | One hopes the us-west data centers don't use smart locks.
           | Remember the Facebook incident a while back?
        
             | dylan604 wrote:
             | You'll find that they are using Nest instead of Ring locks
             | so that it turns out this is a Goog v AMZN adversarial
             | attack.
        
         | fortunateregard wrote:
         | On an unrelated note, I opened this url in a new tab, took a
         | quick look, then came back to read the comments.
         | 
         | A few minutes after I hear the fans on my pc start ramping up.
         | Sure enough, I open the system monitor and see chrome going
         | crazy on my CPU. In chrome, I open the task manager, then click
         | sort by CPU. The entry at the top of the list reads:
         | 
         | > subframe: facebook[dot]com
         | 
         | I get taken to the open downdetector.com tab after double
         | clicking the entry. After closing the tab everything goes back
         | to normal.
         | 
         | Does anyone know why or what downdetector/facebook would do
         | that requires 100% of my CPU's resources?
         | 
         | PS. I have ublock origin installed. My cpu is an i9-12900K.
        
           | j_walter wrote:
           | That is unrelated, but I've seen similar issues recently when
           | logging into Canvas (you know for school). Maxes out my CPU
           | and if I leave it long enough it crashes the tab due to
           | memory...it's not displaying anything special...
        
         | softwaredoug wrote:
         | I guess when Skynet becomes self aware we can just wait for an
         | AWS outage and then bulldoze the terminators into the landfill
        
           | dylan604 wrote:
           | We should start practicing with the Full Self Driving cars
           | getting OTA updates
        
       | restlake wrote:
       | Yep same symptoms for us as the August outage and support
       | confirmed an outage for us as well
        
         | jvolkman wrote:
         | Would be nice if they could update their status dashboard.
        
           | thallium205 wrote:
           | It is. It's just currently set at "Informational" severity.
        
             | sarcasticadmin wrote:
             | There was a decent delay in the alerting from AWS.
             | 
             | I was just investigating this degradation in service for
             | some of my systems for about 1 hour before seeing this
             | alert raised.
        
       | austinpena wrote:
       | Seeing API gateway issues here: https://www.taloflow.ai/is-aws-
       | down/us-west-2
        
       | hazrmard wrote:
       | My job application at amazon.jobs today got interrupted by this,
       | I think. I hope they don't get a malformed application :/
        
         | neuralspark wrote:
         | Maybe it's trying to save you
        
           | shahbaby wrote:
           | This is divine intervention.
        
         | mrweasel wrote:
         | Maybe you broke it?
        
           | dylan604 wrote:
           | That resume that was submit contained a known PDF delivered
           | attack. Guess their AV subscription ran out on that server?!
        
         | hbn wrote:
         | This is part of your test. You're failing!
        
       | enapupe wrote:
       | People, I've just started setting up API GW and deploying my code
       | to another region. Rest assured the outage will normalize itself
       | before I finish migrating. ETA: 15min-ish
        
         | enapupe wrote:
         | Did I say 15 minutes? Actually, _it depends_.
        
           | dylan604 wrote:
           | The first rule of publishing ETAs for updates is to triple
           | the estimate. Be the hero that finishes before the published
           | time vs finishing after! The second rule of ETAs is don't
           | specifiy a time.
           | 
           | I'm still amazed at the number of cults that place a specific
           | date on the return of the savior, and even more by the people
           | that go along with the rescheduling. The fact that I'm still
           | waiting in Dallas for JFK's return is irritating. /s
        
       | systemvoltage wrote:
       | Honest question and frankly scared to ask because it sounds
       | stupid: if you have like 30 mins of downtime on AWS every year
       | and spend 3x cost on managing _their_ infrastructure risk by
       | deploying it on multi-AZ and multi-region (and thereby AWS
       | pushing reliability management back to the customer); is the
       | value proposition of cloud just some dude to install a disk if it
       | goes out on a rack in your office? May be there is a reverse
       | incentive for AWS to keep their AZ 's slightly unreliable so that
       | customers spend 3x or 9x or what have you to make sure nothing
       | ever goes down.
       | 
       | Like what's wrong with on-prem? Lack of diesel generators? We
       | could just have that without AWS. Bare metal datacenter. Counter
       | to most opinions, I think managing a server isn't that difficult.
       | I am sort of a semi-professional and prosumer that has no trouble
       | managing servers for years on end with less downtime than a whole
       | fricking datacenter.
       | 
       | There is more serious discussion and new revelations around this
       | [1], [2]. Sometimes it is hard to ask questions about layers of
       | abstractions that have built up and no one dares to think about
       | getting rid of them.
       | 
       | [1] https://www.economist.com/business/2021/07/03/do-the-
       | costs-o...
       | 
       | [2] https://oxide.computer/
        
         | rrdharan wrote:
         | Security, Compliance, and Geographic footprint.
         | 
         | If you're a large multinational, you basically face the same
         | threats as Google/AWS/MSFT but there's no way you can hire,
         | train and keep as good a production security team as them (well
         | maybe better than Azure, but I digress).
         | 
         | You can't afford the upfront contractual / capital costs to
         | maintain datacenters in every region.
         | 
         | And finally you can't afford the armies of lawyers and
         | compliance engineering teams to try and reason about your data
         | residency and things like GDPR and CCPA.
         | 
         | In other words, you're mostly paying for production security /
         | privacy incident response, compliance (lawyers) and
         | datacenters.
        
         | mannyv wrote:
         | Having managed a small data center in the past and having seen
         | what it takes to manage multiple enterprise-scale data centers,
         | the answer is "no, AWS does it better."
         | 
         | The company that I'm in right now has two engineers (including
         | myself) who are building and maintaining a product that serves
         | millions of streams a week. There's no fucking way we could
         | have done this ourselves. One F5 would cost more than our
         | entire total AWS bill for two years - and we'd have to have at
         | least 4 F5s if we wanted to try to match AWS. Plus the media
         | encoders would cost a fortune.
         | 
         | For some things it's fine to head over to lowendbox.com and
         | pick up a cheap VPS hosting package. We could theoretically
         | build our stack on top of a bunch of VPSs, sync everything with
         | rsync, etc. But then we'd be spending time building
         | infrastructure (which is pretty much valueless) instead of our
         | product.
        
           | [deleted]
        
       | ggregoire wrote:
       | Bunch of
       | 
       | Database error: [Amazon](500310) Invalid operation: No response
       | body.
       | 
       | Database error: [Amazon](500310) Invalid operation: curlCode: 28,
       | Timeout was reached
       | 
       | in our logs during the last hour
        
       | jmartens wrote:
       | Must be a specific AZ, not seeing any issues here
       | 
       | EDIT: I see now that it's an API Gateway issue, so that's why my
       | team isn't impacted.
        
       | sdfdsfsd wrote:
        
         | CacheRules wrote:
         | Success!
        
       | atomon wrote:
       | We're having issues as well
        
       | mchusma wrote:
       | I'm seeing multiple services as down: ECS, Medialive, etc.
        
         | ksimukka wrote:
         | Oh no!
        
       | biermic wrote:
       | Redshift COPY from s3 is also affected.
       | -----------------------------------------------       error:  No
       | response body.       code:      30000       context:
       | query:     1154965       location:
       | xen_aws_credentials_mgr.cpp:403       process:   padbmaster
       | [pid=15893]       -----------------------------------------------
        
       ___________________________________________________________________
       (page generated 2022-09-28 23:02 UTC)