[HN Gopher] Tell HN: AWS connectivity issues, but health dashboa...
       ___________________________________________________________________
        
       Tell HN: AWS connectivity issues, but health dashboard says
       everything fine
        
       About 15 minutes ago I got a call from a customer that our site was
       down. "Works for me," I said, because I could bring it up on my
       laptop. The customer said he couldn't get it on his phone, and then
       I confirmed I couldn't get it on my phone either.  The AWS Health
       Dashboard (https://health.aws.amazon.com/health/status) reports no
       issues at all. But DownDetector
       (https://downdetector.com/status/aws-amazon-web-services/) shows a
       spike in reports.  I can't even reach the AWS console through my
       phone.  So, AWS has connectivity issues to certain networks and
       their own health dashboard is lying to us about it. What gives?
       (All of this accurate as of 2:10 pm CST).  Update: as of 2:26 pm
       CST, the health dashboard reports that they are "investigating an
       issue". So, 45 minutes after Down Detector sees it, they do.
        
       Author : ccleve
       Score  : 104 points
       Date   : 2022-12-05 20:12 UTC (2 hours ago)
        
       | [deleted]
        
       | agilob wrote:
       | Anyone with affected RDS instances? We were getting random
       | connectivity issues today occasionally... New pods with 1-2
       | minutes after startup were suddenly getting timeouts connecting
       | to MySQL DBs
        
       | [deleted]
        
       | hedgehog_irl wrote:
       | Everyone seems to overlook the point here. That yet again Amazon
       | were slow as hell to be honest with their customers. I get it up
       | down reports help but why do you keep using a service which lies
       | to you about availability. I've read on HN in the past how the
       | dashboard can only be updated to reflect an issue with approval.
       | (Comments section on a similar posting, believe it if you wish).
       | So why not move to a hosting company that is transparent and open
       | about their status. I'll not make suggestions as I don't want to
       | be accused of trying to shill for a specific provider but there
       | are plenty out there. 45 min to update their public dash is too
       | slow. They either don't care, don't monitor or they are trying to
       | hide their stats for fear Jeff will beat the staff for SLA
       | violations. If any other provider lied to customers the way AWS
       | does they wouldn't be tolerated why do you tolerate this
       | behaviour from AWS?
       | 
       | Edited to fix auto correct issues
        
         | remus wrote:
         | I understand the frustration, but Im not convinced monitoring
         | at large scale is that straightforward.
         | 
         | The core question is: what constitutes degraded service? Would
         | you say a service is experiencing downtime every time a 500
         | response is served? If you're serving millions to billions of
         | requests/sec it seems a bit disproportionate to marka service
         | down after a single 500 error, so then you need to work out
         | some kind of acceptable threshold.
         | 
         | What about latency? Again you're just going to draw a line in
         | the sand somewhere.
         | 
         | You end up with this big mix of metrics that define service
         | quality, so you then have a kind of meta problem of deciding
         | which metrics you should alert users on. Get too trigger happy
         | and it's going to cost you money and customer trust, and your
         | customers are going to get alert fatigue when it turns out the
         | issue you alerted them about was more of a false alarm. Set the
         | bar too high and you'll have angry customers wondering wtf is
         | going on.
         | 
         | All that to say I don't think there's a right answer.
        
           | leesalminen wrote:
           | We were pretty liberal with posting to our status. page for
           | years and thought it was The Right Thing to do. I still do,
           | to a point.
           | 
           | But, what ended up happening was a competitor who didn't have
           | a status page at all would use our status page against us in
           | the sales process. They just never mentioned their lack of a
           | status page to compare to.
           | 
           | This was the same competitor who went 100% down for ~4 days
           | during the busiest month of the year and only posted updates
           | to a private Facebook group. There was data loss that was
           | never publicly admitted to.
           | 
           | So, yeah, we implemented reasonable boundaries on what
           | constitutes a post to the status page. We also adopted a new
           | status page provider that let us get more granular with
           | categorizing posts, and allowing users to subscribe to only
           | "urgent" channels that pertain to them.
        
             | lamontcg wrote:
             | Before 2003-ish Amazon used to have a static "gonefishing"
             | page on www.amazon.com that was manually triggered during
             | outages. Due to newspaper reporters writing scripts that
             | would detect the GF pages they were removed and the site
             | was allowed to just spew 500s for whatever segment of
             | critical pages was busted.
        
           | hedgehog_irl wrote:
           | Very fair but 45 min of an outage/disruption before manually
           | updating public status is poor service and why is that
           | acceptable for Aws to deliver to users
        
         | ioman wrote:
         | AWS is the 800 pound gorilla in the cloud space. Are any of the
         | other cloud providers better with customer honesty?
        
           | autotune wrote:
           | Also good luck trying to convince your company to migrate to
           | another cloud provider over, say, implementing multi-region
           | strategy, which you should have been doing in the first
           | place.
        
             | hedgehog_irl wrote:
             | Highlight the lack of transparency on reporting outages and
             | that's a start. If your MPLS or ISP provider operated in
             | the save way. The company wouldn't accept it
        
               | autotune wrote:
               | My company is not going to spend hundreds of thousands of
               | dollars or more, and months or even years of effort, and
               | add additional constraints to the given pool of
               | candidates we are hiring for, to migrate to GCP or Azure
               | or DigitalOcean or Hetzner or wherever is considered more
               | trendy than AWS right now due to "a lack of transparency"
               | lmao. I would look completely incompetent to even suggest
               | the idea to anyone internally.
        
               | hedgehog_irl wrote:
               | But your company is willing to accept poor service and as
               | a result spend more money with the same provider to
               | ensure continuity. So essentially you reward Aws hiding
               | their stats. As they can claim high uptime figures and
               | when an outage happens it's the users fault for not
               | spending enough money with them to have many many
               | instances around the availability zones to ensure your
               | covered the Aws mess up. I get it redundancy is needed in
               | systems but lack of proper reporting message users are
               | forced to over spend our of fear. It's a great business
               | model. Hook the clients in with lies and then get them to
               | reward you for hiding facts. Clearly your company has
               | money to burn wasting it like this. Every one knows they
               | lie and are blatant about it why is it tolerated. As I
               | said I don't see other enterprise providers getting away
               | with this kinda behaviour towards clients
        
               | autotune wrote:
               | If you are willing to host your critical infra on some
               | dodgy startup alternative that might go away in 3 months
               | because you refuse to bend on your personal values and
               | separate them from what the typical organization actually
               | cares about, best of luck. I know HN tends to loves the
               | underdog, but there is a time and place for that, and a
               | time and place to accept what you need to do to keep your
               | services online.
        
               | hedgehog_irl wrote:
               | So your logic is to accept poor quality service to keep
               | your service online rather than trying to do better and
               | improve service. So you are saying that rather than
               | rewarding a company trying to do better just accept poor
               | service from Aws.How is this better than "hosting on some
               | dodgy start-up" This is nothing to do with my personal
               | beliefs or opinion I'm trying to understand why it's
               | accepted from Aws but not others Edited for to add point
        
               | autotune wrote:
               | My logic is to build highly resilient infrastructure
               | given the constraints available. Your definition of "poor
               | service" is not what I have experienced in my 10 year
               | career as a SRE, because I build around your definition
               | of what makes it poor and make it work as it should. It's
               | called chaos engineering, and companies like Netflix have
               | been doing it for years with their Chaos Monkey tool and
               | SRE practices. Doesn't matter what cloud provider you go
               | to, there is ALWAYS unexpected and unannounced downtime
               | unless you build around, plan for, and expect it. But
               | sure, go ahead and tell us all how industry leaders like
               | them are wrong for sticking with what you call "poor
               | service."
        
               | hedgehog_irl wrote:
               | Ok simply question. Would you accept any other infra
               | service provider having such poor customer service and
               | not provided updated status of an outage/disruption for
               | 45 min.
        
               | autotune wrote:
               | You are deliberately avoiding the counter-points I
               | already specifically addressed in response to that
               | question to the point we are stuck in a loop, so I am
               | going to leave this thread now. If you feel you can do a
               | better job at SRE with your current mindset and believe
               | you are better at choosing which cloud providers are
               | worth using for an org, I welcome you to try.
        
               | simfree wrote:
               | Your company is hiring and retaining people who can't
               | work with tooling outside Amazon Web Services?
        
               | [deleted]
        
               | autotune wrote:
               | Training and getting up to speed takes time and money,
               | neither of which are unlimited for any organization. It's
               | not that they/we can't work with other cloud services,
               | it's that it would likely add up to months of additional
               | on-boarding time to get someone who wasn't familiar with
               | another cloud provider productive with infra at scale on
               | said provider.
        
               | karamanolev wrote:
               | Many companies are hiring and retaining specialists in
               | AWS-lock-in-technology, who lack experience with another-
               | cloud-provider-technology, so I don't know what's
               | surprising.
        
           | hedgehog_irl wrote:
           | Didn't say you have to go "cloud" rent hardware in a DC and
           | run that yourself. Or use a VPS I mean the cloud is just
           | "Other people's hardware" and I'd thank you to not insult
           | gorillas like that by comparing them to Amazon.
        
             | koksik202 wrote:
             | this just bring whole load of new expenses on staffing
             | physical locations creates more problem than it solves
        
               | simfree wrote:
               | Colocation or especially server rental generally requires
               | no persistent staffing. The datacenter has their own
               | staff for tasks requiring physical intervention, and you
               | have IPMI/iLO access to your servers for doing reboots
               | and similar.
        
               | hedgehog_irl wrote:
               | Not really renting from a DC provider means you just run
               | the host yourself they deal with power space cooling etc
        
           | uuddlrlrbaba wrote:
           | I'd ask if there any cloud providers worse with customer
           | honesty instead.
        
             | hedgehog_irl wrote:
             | I'm sure there are 2bit vps providers that claim to be
             | cloud and are terrible. But for the price and claims of
             | service like Aws I donno they are at the scale where they
             | don't have to care about customers
        
       | BWStearns wrote:
       | Seeing issues in Florida for us-east-2. Coworkers in NY can still
       | get to us-east-2.
        
       | _justinfunk wrote:
       | It does seem to be a networking issue. I have a ec2 instance in
       | us-east-2 that is accessible through a "Global Accelerator" but
       | not externally through my ISP.
       | 
       | That ec2 instance can talk to other ec2 instances that are on us-
       | east-2 - but none of those other instances are accessible
       | externally.
        
         | leesalminen wrote:
         | Can confirm that Global Accelerator helped us avoid this issue
         | today.
        
       | mrobins wrote:
       | Confirmed by AWS: https://health.aws.amazon.com/health/status
        
       | ocdtrekkie wrote:
       | A vendor's cloud product is having significant issues. Figured HN
       | would tell me which major public cloud infrastructure fell over
       | to cause it. Never fails.
        
       | jstimps wrote:
       | We're seeing that external requests to ALBs on us-east-2 are
       | affected
        
       | WFHRenaissance wrote:
       | Cyber attack?
        
       | jeremib wrote:
       | They've finally updated their status.
       | 
       | 12:26 PM PST We are investigating an issue, which may be
       | impacting Internet connectivity between some customer networks
       | and the US-EAST-2 Region.
        
       | biggerChris wrote:
        
       | the_snooze wrote:
       | I'm seeing it too, and surprised their health page says nothing.
       | The US East 2 console is unresponsive. https://us-
       | east-2.signin.aws.amazon.com/
        
         | mrobins wrote:
         | We're experiencing issues for some users but not all connecting
         | to resources in US East 2.
        
       | bloaf wrote:
       | There was definitely something going on last night too, I noticed
       | a number of sites having intermittent issues confirmed by down
       | detector.
        
       | muttantt wrote:
       | We got all our main stack all at us-east-2. Seems to all be
       | running currently
        
         | joshuanapoli wrote:
         | We couldn't detect any problem accessing resources via
         | CloudFront/AppSync. Maybe the issue was specific to ELB.
        
         | ccleve wrote:
         | Try it on your phone.
        
           | muttantt wrote:
           | All accessible
        
           | mrobins wrote:
           | Works from my computer but not my phone (AT&T in NY).
        
           | jamroom wrote:
           | Yeah I can't access our services on us-east-2 on AT&T 5G but
           | CAN on my CenturyLink fiber.
        
       | AlphaWeaver wrote:
       | Appears to affect ELBs.
        
       | rshm wrote:
       | Could be resolved, east-2 console working fine on my end.
        
       | jamroom wrote:
       | We're still up on us-east-2 but lots of customers calling in that
       | they can't connect - makes me think there's some network down
       | somewhere.
        
       | Analemma_ wrote:
       | There are three kinds of lies: lies, damned lies, and cloud
       | status dashboards.
        
       | baq wrote:
       | It only turns yellow if the datacenter gets flooded by lava. Red
       | is probably a tactical nuke.
        
         | agilob wrote:
         | I read here on HN before that yellow requires a manual
         | signature on a paper from a higher manager. Because such fault
         | affects their compensation and decreases stock value. Red
         | requires signature from C-level. It's not automated at all,
         | almost worthless dashboard.
        
       | cathintexas wrote:
       | On our team, we are seeing that if you are on AT&T cell service
       | or have AT&T as your ISP, you can't reach AWS or our site in US-
       | east-2.
        
       | dixie_land wrote:
       | Keep in mind AWS status dashboard solely reflects the product
       | owning managers discretion.
       | 
       | And the number of yellow ("green I" if you're old enough) is
       | definitely a material input to PIP :)
        
         | joecot wrote:
         | Correct. No matter how down AWS is, their status page will only
         | show a disruption if a manager approves showing a disruption.
         | There is nothing automated to display the status, so the status
         | page is mostly worthless except for whatever AWS admits is
         | down.
        
           | whoknew1122 wrote:
           | All this compounded by the fact AWS builds on AWS. So there
           | can be a disruption of a service, but it's not really the
           | service's fault -- it's a upstream failure.
        
       | daneel_w wrote:
       | About two weeks ago all three of our Aurora DB instances in eu-
       | central-1 suddenly crashed and were offline, to no avail, for
       | almost 55 minutes. Simultaneously we had random network problems
       | going on within our eu-central-1 VPC which we were unable to
       | diagnose. We still don't know what happened because we're not
       | getting any answers to our support request. The AWS health
       | dashboard was all green the entire time. No notifications were
       | sent out.
        
       | mplanchard wrote:
       | We're seeing a similar thing for our us-east-2 properties. Some
       | of our team is able to reach them, but others aren't. Folks in
       | the midwest (Oklahoma and Michigan) can't even load the AWS
       | console, while people in Texas, California, Arizona, and
       | Pennsylvania can.
        
       | xup wrote:
       | I'm still seeing it on my end. Our currently-running EC2
       | instances are working fine, but the EC2 us-east-2 console webpage
       | doesn't load, and an EC2 instance in us-east-2 I rebooted has yet
       | to come back online.
        
       | nnf wrote:
       | I'm hearing from customers and other employees that our stuff at
       | AWS (us-east-2) is unreachable, but I'm able to get to it all
       | without any issue (via http & ssh). Perhaps there's a problem
       | upstream of AWS that's only affecting some ISPs?
        
         | thedougd wrote:
         | It's only impacting some ISPs. Outage varies by my office
         | locations. At my location, it is out, however I was able to get
         | access via Cloudflare warp VPN.
         | 
         | Edit: Sounds like AT&T
        
       | rshm wrote:
       | Snowflake confirmed AWS us-east-2 issues as well.
       | 
       | AWS - US East (Ohio): INC0073093
       | https://status.snowflake.com/incidents/yv40l966krl9
        
       ___________________________________________________________________
       (page generated 2022-12-05 23:02 UTC)