[HN Gopher] Slack's Outage on January 4th 2021
       ___________________________________________________________________
        
       Slack's Outage on January 4th 2021
        
       Author : benedikt
       Score  : 245 points
       Date   : 2021-02-01 12:59 UTC (10 hours ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | gscho wrote:
       | > our dashboarding and alerting service became unavailable.
       | 
       | Sounds like the monitoring system needs a monitoring system.
        
         | jrockway wrote:
         | It is quite awkward that the output of "working" and
         | "completely broken" alerting systems have the same visible
         | effect -- no alerts.
         | 
         | For Prometheus users, I wrote alertmanager-status to let a
         | third-party "website up?" monitoring server check your
         | alertmanager: https://github.com/jrockway/alertmanager-status
         | 
         | (I also wrote one of the main Google Fiber monitoring systems
         | back when I was at Google. We spent quite a bit of time on
         | monitoring monitoring, because whenever there was an actual
         | incident people would ask us "is this real, or just the
         | monitoring system being down?" Previous monitoring systems were
         | flaky so people were kind of conditioned to ignore the improved
         | system -- so we had to have a lot of dashboards to show them
         | that there was really an ongoing issue.)
        
         | TonyTrapp wrote:
         | "Who monitors the monitors?"
        
       | tobobo wrote:
       | The big takeaway for me here is that this "provisioning service"
       | had enough internal dependencies that they couldn't bring up new
       | nodes. Seems like the worst thing possible during a big traffic
       | spike.
        
       | jeffbee wrote:
       | Why don't they mention what seems like a clear lesson: control
       | traffic has to be prioritized using IP DSCP bits, or else your
       | control systems can't recover from widespread frame drop events.
       | Does AWS TGW not support DSCP?
        
       | dplgk wrote:
       | > On January 4th, one of our Transit Gateways became overloaded.
       | The TGWs are managed by AWS and are intended to scale
       | transparently to us. However, Slack's annual traffic pattern is a
       | little unusual: Traffic is lower over the holidays, as everyone
       | disconnects from work (good job on the work-life balance, Slack
       | users!). On the first Monday back, client caches are cold and
       | clients pull down more data than usual on their first connection
       | to Slack. We go from our quietest time of the whole year to one
       | of our biggest days quite literally overnight.
       | 
       | What's interesting is that when this happened, some HN comments
       | suggested it was the return from holiday traffic that caused it.
       | Others said, "nah, don't you think they know how to handle that
       | by now?"
       | 
       | Turns out occam's razor applied here. The simplest answer was the
       | correct one. Return-from-holiday traffic.
        
         | polote wrote:
         | But what HN predicted was wrong
         | https://news.ycombinator.com/item?id=25632346
         | 
         | "My bet is that this incident is caused by a big release after
         | a post-holiday "code freeze". "
        
         | hnlmorg wrote:
         | HN comments suggested a broad plethora of things. I'm not
         | surprised some happened on the right cause.
        
         | floatingatoll wrote:
         | That's very kind of you to remember :)
        
         | cett wrote:
         | Though the nuance is Slack did know how to handle it, AWS
         | didn't.
        
           | fipar wrote:
           | I don't mean this ironically, but I think Slack did not
           | actually know how to handle it: they outsourced the handling
           | of this; they passed the buck.
           | 
           | This usually works well, under the rationale that "upstream
           | provider does this for a living, so they must be better than
           | us at this", but if you have too unique needs (or are just a
           | bit "unlucky"), it can fail too.
           | 
           | All this to say that the cloud isn't magic. From a risk/error
           | prevention point of view, it's not that different from
           | writing software for a single local machine: not every
           | programmer needs to know how to manually do memory
           | management, it makes a lot more sense to rely on your OS and
           | malloc (and friends) for this, but the caveat is that you do
           | need to account for the fact that malloc may fail. In the
           | cloud case, one can't just assume that you'll always be able
           | to provision a new instance, scale up a service, etc. The
           | cloud is like a utility company: normally very reliable, but
           | they do fail too.
        
             | [deleted]
        
             | SilasX wrote:
             | >This usually works well, under the rationale that
             | "upstream provider does this for a living, so they must be
             | better than us at this", but if you have too unique needs
             | (or are just a bit "unlucky"), it can fail too.
             | 
             | Heh, a while ago I joked that one way to scale is to "make
             | it somebody else's problem", with the proviso that you need
             | to make sure that the someone else can handle the load. And
             | then (due to the context) a commenter balked at the idea
             | that a big player like YouTube would be unable handle the
             | scaling of their core business.
             | 
             | https://news.ycombinator.com/item?id=23170685
             | 
             | (If they're really blaming it on AWS, it really takes guts
             | to do it so publicly, I think.)
        
             | Johnny555 wrote:
             | _I think Slack did not actually know how to handle it: they
             | outsourced the handling of this; they passed the buck_
             | 
             | The issue was a transit gateway, a core network component.
             | If they weren't in the cloud, this would have been a
             | router, so they "outsourced" it in the same way an on-prem
             | service outsources routing to Cisco. I guess the difference
             | is they might have had better visibility into the Cisco
             | router and known it was overloaded.
        
             | tw04 wrote:
             | >I don't mean this ironically, but I think Slack did not
             | actually know how to handle it: they outsourced the
             | handling of this; they passed the buck.
             | 
             | Isn't that literally supposed to be the sales pitch for the
             | cloud? Get away from the infrastructure as a whole so you
             | can focus on code, and let the cloud providers wave their
             | magic wand to enable scaling?
             | 
             | If you're saying now the story is: well rely on them to
             | auto scale, until they don't - then why would I bother? Now
             | you're telling me I need to go back to having
             | infrastructure experts, which means I can save TON of money
             | by going with a hosting provider that allows allocation of
             | resources via API (which is basically all of them).
        
               | JMTQp8lwXL wrote:
               | You have to know how to write code that fits into the
               | cloud. You can't arbitrarily read/write to the file
               | system, acting as if there's only one instance of the
               | server running (if you plan to run hundreds or
               | thousands). So even by waving the cloud 'magic wand', you
               | still need to understand writing code in a cloud-friendly
               | way. So in some sense, it's a shared responsibility
               | between the vendor and engineering. You need to
               | understand how to apply the tools being given to you.
        
               | tw04 wrote:
               | Per the article, literally nothing in their code would
               | have solved the issue. AWS was supposed to auto-scale
               | TGWs and didn't.
               | 
               | >Our own serving systems scale quickly to meet these
               | kinds of peaks in demand (and have always done so
               | successfully after the holidays in previous years).
               | However, our TGWs did not scale fast enough. During the
               | incident, AWS engineers were alerted to our packet drops
               | by their own internal monitoring, and increased our TGW
               | capacity manually. By 10:40am PST that change had rolled
               | out across all Availability Zones and our network
               | returned to normal, as did our error rates and latency.
        
               | JMTQp8lwXL wrote:
               | Correct, I was disputing the point that you can freely
               | code without being mindful of the architecture even
               | though the selling point of cloud providers is "focus on
               | code, leave architecture to us". I'm not disputing in
               | this case AWS was at fault: as the customer, Slack did
               | everything right.
        
               | twblalock wrote:
               | If you are serious about reliability you always need
               | infrastructure experts.
               | 
               | AWS is pretty good about documenting the limits of their
               | systems, SLAs, how to configure them, etc. They don't
               | just say you should wave a magic wand -- and even if they
               | did say that, professional software engineers know
               | better.
               | 
               | "a hosting provider that allows allocation of resources
               | via API" is exactly what AWS is. Your infrastructure
               | experts come into the picture because they need to know
               | which resources to request, how to estimate the scale
               | they need, and how to configure them properly. They
               | should also be doing performance testing to see if the
               | claimed performance really holds up.
        
               | ncallaway wrote:
               | > Isn't that literally supposed to be the sales pitch for
               | the cloud?
               | 
               | Yes.
               | 
               | But a sales pitch is the _most positive_ framing of the
               | product possible. I wouldn 't rely on the sales pitch
               | when making the decision about how much you should depend
               | on the cloud.
        
               | remus wrote:
               | > Isn't that literally supposed to be the sales pitch for
               | the cloud? Get away from the infrastructure as a whole so
               | you can focus on code, and let the cloud providers wave
               | their magic wand to enable scaling?
               | 
               | Clearly there are limits even with the largest cloud
               | providers. You'll have to engage a bit of critical
               | thought in to whether you're going to get near those
               | limits and what that might mean for your product.
               | Obviously that's easier said than done, but you could
               | argue that the cloud providers are still giving you
               | reasonable value if you can pass the buck on a given
               | issue for x years.
        
               | solidasparagus wrote:
               | No, the cloud provides scalable infrastructure, but once
               | you are in the 0.01% and you have very unique usage
               | patterns, you still need to know how to set up your
               | infrastructure for your needs. The difference is that
               | instead of writing and managing a scalable cache, you
               | just need to build the layer that knows to pre-provision
               | for that scale/talk with AWS to make sure the system has
               | sufficient capacity.
               | 
               | The cloud isn't some magic thing that solves all scaling
               | problems, it's a tool that gives you strong primitives
               | (and once you're a large enough customer, an active
               | partner) to help you solve your scaling problems.
        
               | thu2111 wrote:
               | This feels like AWS apologism.
               | 
               | Slack knew how to set up their infrastructure. Nothing in
               | the postmortem implies AWS was misconfigured. AWS spotted
               | the problem and fixed it entirely on their side.
               | 
               | Nothing in this report suggests that Slack has unique
               | usage patterns. Users returning to work after Christmas
               | is not a phenomenon unique to Slack.
               | 
               | Their problems were:
               | 
               | 1. The AWS infrastructure broke due to an event as
               | predictable as the start of the year. That's on Amazon.
               | 
               | 2. Their infrastructure is too complicated. Their auto-
               | scaling created chaos by shutting down machines whilst
               | engineers were logged into them due to bad heuristics,
               | although it's not like this was a good way to save money,
               | and their separation of Slack into many different AWS
               | accounts created weird bottlenecks they had no way to
               | understand or fix.
               | 
               | 3. They were unable to diagnose the root cause and the
               | outage ended when AWS noticed the problem and fixed their
               | gateway system themselves.
               | 
               |  _The cloud isn 't some magic thing that solves all
               | scaling problems_
               | 
               | In this case it actually created scaling problems where
               | none needed to exist. AWS is expensive compared to
               | dedicated machines in a colo. Part of the justification
               | for that high cost is seamless scalability and ability to
               | 'flex'.
               | 
               | But Slack doesn't need the ability to flex here. Scaling
               | down over the holidays and then back up once people
               | returned to work just isn't that important for them -
               | it's unlikely there were a large number of jobs queued up
               | waiting to run on their spare hardware for a few days
               | anyway. It just wasn't a good way to save money: a
               | massive outage certainly cost them far more than they'll
               | ever save.
        
           | oxfordmale wrote:
           | If you hit yourself on the thumb while using a hammer, do you
           | blame the hammer manufacturer or yourself? TGW limits are
           | well documented.
        
             | 0dmethz wrote:
             | I mean, when the hammer manufacturer sells managed, auto
             | scaling thumb-avoiding services you might rely on that.
             | 
             | If I understand correctly they didn't initially hit a TGW
             | quota, it just didn't scale up fast enough.
        
               | sokoloff wrote:
               | "Hey Boss, this system that our team selected and
               | configured, behaved as documented but not in a way that
               | protected our customers' experience.
               | 
               | It's Amazon's fault, not ours..."
               | 
               | If someone came to me with that, I'd educate them on how
               | I saw it quite differently, politely but firmly.
        
               | 0dmethz wrote:
               | Unless I'm misunderstanding something the system did not
               | perform as documented. It should have scaled, it didn't.
               | 
               | When a critical piece of infrastructure fails under
               | massive load I'm not sure it it'll help much when you
               | politely tell your engineers they fucked up for not
               | anticipating it.
               | 
               | You learn lessons. Both Slack and AWS seem to have learnt
               | lessons here.
        
               | sokoloff wrote:
               | I agree with much of what you say, but if you change it
               | to "It's Amazon's fault, not ours", that's where I
               | diverge.
               | 
               | Slack did fuck up here, as evidenced by the outage and
               | you seem to at least partially agree by the fact that
               | Slack learned a lesson. Further, I think that
               | "understanding how your system scales up from a low
               | baseline to a high level of utilization (such as Black
               | Friday/Cyber Monday for e-commerce, or special event
               | launches, or a SuperBowl ad landing page)" is a standard,
               | "par for the course" cloud engineering topic to be on top
               | of nowadays.
        
           | coldcode wrote:
           | I actually thought something in AWS was a cause but did not
           | know about these internal systems.
        
           | ignoramous wrote:
           | True, but with any managed service, hidden limits and a cloud
           | provider's own engineering (or lack thereof) may come back to
           | bite the top 0.1% (the whales).
           | 
           | One approach to solve problems of scale is to trim down scale
           | and bound it across multiple disparate silos that do not
           | absolutely interact with each other at all, under any
           | circumstances, except for making quick, constant-time, scale-
           | independent decisions, may be.
           | 
           | In short, do things that don't need scale.
        
           | jabart wrote:
           | If you have a known increase in traffic at a certain
           | date/time that auto-scaling (either EC2 or NLB/ALB or another
           | service) can't handle fast enough you can let AWS know
           | through your support contract to over-provision during that
           | time or the scale up will take too long.
        
           | coldcode wrote:
           | I actually thought something in AWS was a cause but did not
           | know anything about how TGW works internally.
        
           | ak217 wrote:
           | I don't think that's true. Slack seems to have their core
           | online services split across a number of VPCs, and for some
           | reason decided to use Transit Gateway to connect them.
           | Transit Gateway is a special-purpose solution that is geared
           | toward cross-region and on-prem to VPC connections in
           | corporate networks, not to global high-traffic consumer
           | products. It's the wrong tool for the job. Its architecture
           | is antithetical to the other horizontally scalable AWS
           | solutions. It introduces a single (up to) 50 gbps network hub
           | that all inter-service traffic must go through. Native AWS
           | architectures avoid such central hubs and provide a virtual
           | routing fabric instead.
           | 
           | Slack could have chosen one of many other AWS design patterns
           | such as VPC peering, transit VPC, IGW routing, or colocating
           | more services in fewer VPCs (with more granular IAM role
           | policies to separate operator privileges), to provide an
           | automatically scaled network fabric to connect their
           | services.
           | 
           | (This isn't to criticize Slack's engineering team. They have
           | successfully scaled their service in a short time, and I'm
           | happy with their product overall, and with their transparency
           | in this report. But I think AWS has the world's biggest and
           | most scalable network fabric - it's just a matter of knowing
           | how to harness it.)
        
           | curun1r wrote:
           | I remember going to a presentation by someone from FanDuel
           | where he discussed something similar. Their usage patterns
           | (heavy spike on NFL Sundays) caused similar problems with
           | infrastructure that expected more gradual build-up. They
           | engineered for it with synthetic traffic in advance of their
           | expected spike to ensure their infrastructure was warm.
           | 
           | TL;DR it's still your responsibility to understand the
           | limitations of your infrastructure decisions and engineer
           | your systems accordingly.
        
           | m463 wrote:
           | AWS might have had back-to-work traffic in lots of domains
           | simultaneously.
           | 
           | Or maybe their monitoring and response staff was just coming
           | back online.
        
           | tempest_ wrote:
           | Well Slack depended on the Cloud(tm).
           | 
           | It is a interesting though because a lot of the blog posts
           | like "How we handled a 3000% traffic increase overnight!"
           | boil down to "We turned up the AWS knob".
           | 
           | What happens when the AWS knob doesn't work?
        
             | buildawesome wrote:
             | You do what Slack did and call the maker of the AWS
             | Knob:tm:.
        
           | rrrrrrrrrrrryan wrote:
           | Some of my co-workers came from active.com (a website that
           | lets people register for marathons and events). The
           | infrastructure had to handle massive spikes because
           | registrations for big races would open all at once, so
           | scalability was everything.
           | 
           | They explained to me that they'd intentionally slam the
           | production website with external traffic a couple of times
           | per year, at a scheduled time in the middle of the night.
           | Like basically an order of magnitude greater than they'd
           | every received in real life, just to try to find the breaking
           | point. The production website would usually go down for a
           | bit, but this was vastly better than the website going down
           | when actual real users are trying to sign up for the Boston
           | Marathon.
           | 
           | Slack probably should've anticipated this surge in traffic
           | after the holidays, and if _might_ have been able to run some
           | better simulations and fire drills before it occurred.
        
             | t0mas88 wrote:
             | Very good test. The guys at iracing.com should have done
             | this before organising the e-sports Daytona 24 hours race
             | last week, it was by far their largest event (boosted by
             | Covid lockdown). It crashed their central scheduling
             | service with a database deadlock. Classic case of a bug you
             | only find under heavy load.
        
             | bryan_w wrote:
             | The problem you run into is that while you can load test
             | your website with no problems, when running on shared
             | infrastructure (AWS), you have to account for everyone's
             | website being under load at the same time. That isn't as
             | easy to test or find bottlenecks for.
        
           | cle wrote:
           | > During the incident, AWS engineers were alerted to our
           | packet drops by their own internal monitoring, and increased
           | our TGW capacity manually. By 10:40am PST that change had
           | rolled out across all Availability Zones and our network
           | returned to normal, as did our error rates and latency.
           | 
           | Sounds like AWS knew how to handle it too.
           | 
           | Given how AWS has responded to past events like this, I'd bet
           | there's an internal post-mortem and they'll add mechanisms to
           | fix this scaling bottleneck for everyone.
           | 
           | Although one thing I'm not clear on is if this was really an
           | AWS issue or if Slack hit one of the documented limits of
           | Transit Gateway (such as bandwidth), after which AWS started
           | dropping packets. If that's the case then I don't see what
           | AWS could have done here, other than perhaps have ways to
           | monitor those limits, if they don't already. The details here
           | are a bit fuzzy in the post.
        
           | floatingatoll wrote:
           | If your oldest request was queued 5+ seconds ago in a near-
           | realtime system (such as Slack), CPU usage isn't your biggest
           | problem.
           | 
           | Slack wrote an autoscaling implementation that ignored
           | request queue depth and downsized their cluster based on CPU
           | usage alone, so while they knew how to _resolve_ it, I would
           | not go so far as to say they knew how to _prevent_ it. The
           | mistake of ignoring the maxage of the request queue is
           | perhaps the second most common blind spot in every Ops team I
           | 've ever worked with. No insult to my fellow Ops folks, but
           | we've got to stop overlooking this.
        
             | nicoburns wrote:
             | > The mistake of ignoring the maxage of the request queue
             | is perhaps the second most common blind spot in every Ops
             | team I've ever worked with. No insult to my fellow Ops
             | folks, but we've got to stop overlooking this.
             | 
             | What's the first?
        
               | floatingatoll wrote:
               | Non-randomized wallclock integers.
               | 
               | For example: "sleep 60 seconds", "cron 0 * * * *
               | command", "X-Retry-After: 300"
               | 
               | Found in: recurring jobs, backoff algorithms, oauth
               | tokens.
               | 
               | Found in: ops-created tasks, dev-released software.
        
               | encoderer wrote:
               | I'm building something at Cronitor to help detect those
               | hot-spots! If you want to learn more, email me: shane at
               | cronitor.io
        
           | twblalock wrote:
           | Lots of other services that use AWS didn't go down the same
           | day -- because they provisioned enough AWS capacity.
        
             | helper wrote:
             | Where's the button to provision more TGW capacity?
        
         | thunderbong wrote:
         | But Slack has been around for longer than a year, right?
         | Shouldn't they have noticed this happening earlier?
         | 
         | I mean, considering Slack is mostly used as a workplace chat
         | mechanism, they should have faced this kind of a scenario
         | previously and had a solution for this by now.
        
           | delfaras wrote:
           | Yeah but this year the number of people working from home
           | that would connect to slack directly at the beginning of
           | their work day must be much much larger than the other years
        
             | steve_adams_86 wrote:
             | Another small but potentially relevant detail: Not many
             | people vacationed, so more people would have returned to
             | work at standard times. Many people travel over holidays
             | for example (usually) but this time around in many places
             | it wasn't even an option. Other people extend their
             | holidays to relax more, but I don't know any people
             | interested in staycations in their house. We've had enough
             | of it.
        
       | jeffrallen wrote:
       | tldr: "we added complexity into our system to make it safer and
       | that complexity blew us up. Also: we cannot scale up our system
       | without everything working well, so we couldn't fix our stuff.
       | Also: we were flying blind, probably because of the failing
       | complexity that was supposed to protect us."
       | 
       | I am really not impressed... with the state of IT. I could not
       | have done better, but isn't it too bad that we've built these
       | towers of sand that keep knocking each other over?
        
       | bovermyer wrote:
       | I really enjoy reading these write-ups, even if the causal
       | incident is not something I enjoy.
        
         | ignoramous wrote:
         | Not at all fun when you're actively involved in mitigating
         | these, though. Pretty rough and sometimes scars you for life.
        
           | cle wrote:
           | I work on critical services, similar to this. While nobody
           | likes the overall business impact of events like this, I love
           | being in the trenches when things like this happen. I enjoy
           | the pressure of having to use anything I can to mitigate as
           | quickly as possible.
           | 
           | I used to work in professional kitchens before software, and
           | it feels a lot like the pressure of a really busy night as a
           | line cook. Some people love it.
        
             | [deleted]
        
           | bovermyer wrote:
           | I've been in the middle of events like these and had to write
           | my share of postmortem documents.
           | 
           | Hearing about others' similar experiences makes me feel a
           | connection to them, and often teaches me something.
        
       | danw1979 wrote:
       | So many fails due to in-band control and monitoring are laid
       | bare, followed by this absolute chestnut -
       | 
       | > We've also set ourselves a reminder (a Slack reminder, of
       | course) to request a preemptive upscaling of our TGWs at the end
       | of the next holiday season.
        
       | jrockway wrote:
       | The thing that always worries me about cloud systems are the
       | hidden dependencies in your cloud provider that work until they
       | don't. They typically don't output logs and metrics, so you have
       | no choice to pray that someone looks at your support ticket and
       | clicks their internal system's "fix it for this customer" button.
       | 
       | I'll also say that I'm interested in ubiquitous mTLS so that you
       | don't have to isolate teams with VPCs and opaque proxies. I don't
       | think we have widely-available technology around yet that
       | eliminates the need for what Slack seems to have here, but
       | trusting the network has always seemed like a bad idea to me, and
       | this shows how a workaround can go wrong. (Of course, to avoid
       | issues like the confused deputy problem, which Slack suffered
       | from, you need some service to issue certs to applications as
       | they scale up that will be accepted by services that it is
       | allowed to talk to and rejected by all other services. In that
       | case, this postmortem would have said "we scaled up our web
       | frontends, but the service that issues them certificates to talk
       | to the backend exploded in a big ball of fire, so we were down."
       | Ya just can't win ;)
        
         | bengale wrote:
         | Experienced something similar with mongo atlas today. Our
         | primary node went down and the cluster didn't failover to
         | either of the secondaries. We got to sit with our production
         | environment completely offline while staring at two completely
         | functional nodes that we had no ability to use. Even when we
         | managed to get hold of support they also seemed unable to
         | trigger a failover and basically told us to wait for the
         | primary node to come back up. It took 90 minutes in the end and
         | has definitely made us rethink about the future and the control
         | we've given over.
        
           | [deleted]
        
         | gen220 wrote:
         | Some customers have a hard requirement that their slack
         | instances be behind a unique VPC. Other customers are easier to
         | sell to if you sprinkle some "you'll get your own closed
         | network" on top of the offer, if security is something they've
         | been burned by in the past.
         | 
         | I agree with you the mTLS is the future. It exists within many
         | companies internally (as a VPC alternative!) and works great.
         | There's some problems around the certificate issuer being a
         | central point of failure, but these are known problems with
         | well-understood solutions.
         | 
         | I think there's mostly a non-technical barrier to be overcome
         | here, where the non-technical executives need to understand
         | that closed network != better security. mTLS's time in the sun
         | will only come when the aforementioned sales pitch is less
         | effective (or even counterproductive!) for Enterprise Inc., I
         | think.
        
           | ViViDboarder wrote:
           | mTLS appears to work great between servers, but I've been
           | unable to get my iPhone to authenticate with a web server via
           | Safari using mTLS. Even after installing the cert, it never
           | presents it.
           | 
           | I wish it were better supported though.
        
       | Thaxll wrote:
       | Surprised that there is just a few metrics available for TGW:
       | https://docs.aws.amazon.com/vpc/latest/tgw/transit-gateway-c...
       | 
       | Probably the only way to see a problem is if you have a flat line
       | for bandwidth, but as the article suggested they had packet drop
       | wich does not appear on the cloudwatch metrics, aws should add
       | those metrics imo
        
         | miyuru wrote:
         | some TGW limits are documented here:
         | https://aws.amazon.com/transit-gateway/faqs/#Performance_and...
        
       | fullstop wrote:
       | I was kind of surprised to see that they are using Apache's
       | threaded workers and not nginx.
        
         | dijit wrote:
         | Apache mod_php is still much faster than php-fpm, and since
         | slack uses a lot of PHP on the backend it makes a lot of sense
         | for them.
        
           | saurik wrote:
           | I loved this article where someone started with the goal of
           | writing an article about how much faster nginx was, but then
           | discovered the opposite (and for some reason didn't change
           | the title of the article, which is hilarious)... both because
           | it showed the author "cared", but also because it showed that
           | people just assume what amounts to marketing myths (such as
           | that Apache and mod_php are ancient tech vs. the more modern
           | php-fpm stack) before bothering to verify anything.
           | 
           | https://www.eschrade.com/page/why-is-fastcgi-w-nginx-so-
           | much...
        
             | fullstop wrote:
             | When I moved things from Apache -> nginx years ago, I did
             | it not because it was faster but because the resource
             | requirements of nginx were so much more predictable under
             | load.
        
           | Denvercoder9 wrote:
           | That's not the reason, as mod_php only supports the prefork
           | MPM, and not the threaded MPMs.
        
             | johannes1234321 wrote:
             | Threaded MPMs and PHP work quite well, I fixed a few bugs
             | in that space some time ago.
             | 
             | There however might be issues in some of the millions
             | libraries PHP potentially links in an called from it's
             | extensions, and those sometimes at emit thread safe, but
             | finding that and finding bypasses isn't easy ...
             | 
             | The other issue is that it's often slower (while there
             | recently were changes from custom thread local storage to
             | more modern one) and if there is a crash (i.e. Recursion
             | stack overflow ...) it affects all requests in that
             | process, not only the one.
        
           | cholmon wrote:
           | Are you saying that Slack uses mod_php? According to a Slack
           | Engineering blog post[0] from 9 months ago, Slack has been
           | using Hack/HHVM since 2016 in place of PHP. My understanding
           | is that HHVM can only be run over FastCGI, unless there's a
           | mod_hhvm that I'm unaware of.
           | 
           | [0]: https://slack.engineering/hacklang-at-slack-a-better-
           | php/
        
             | kmavm wrote:
             | This is accurate: Slack is exclusively using Hack/HHVM for
             | its application servers.
             | 
             | HHVM has an embedded web server (the folly project's
             | Proxygen), and can directly terminate HTTP/HTTPS itself.
             | Facebook uses it in this way. If you want to bring your own
             | webserver, though, FastCGI is the most practical way to do
             | so with HHVM.
        
           | user5994461 wrote:
           | mod_xyz plugins were deprecated long ago because they are
           | unstable. If you do a comparison you should compare fastcgi
           | versus fastcgi, that is the standard way to run web
           | applications. Running with mod should be faster because it's
           | running the interpreter directly into the apache process but
           | it's also making apache unstable.
           | 
           | mod_python was abandoned around a decade ago. It's crashing
           | on python 2.7.
           | 
           | mod_perl was dropped in 2012 with the release of apache 2.4.
           | It was kicked out of the project but continues to exist as a
           | separate project (not sure if it works at all).
        
       | nickthemagicman wrote:
       | Wow. The trail leads back to AWS. Wasn't there a number of other
       | companies that were down around that same time or was that a
       | different time?
        
         | conradfr wrote:
         | Does AWS compensate you in cases like this?
        
           | dewey wrote:
           | It depends: https://aws.amazon.com/legal/service-level-
           | agreements/
        
           | dandigangi wrote:
           | They do. I can't remember where the documentation is but
           | there's a clause where they payout as long as they aren't
           | meeting their SLAs.
           | 
           | Who knows the different ways they may be able to get out of
           | that. I assume this wasn't one of those times.
        
         | jjtheblunt wrote:
         | Is this the recent event you refer to?
         | https://aws.amazon.com/message/11201/
        
           | nickthemagicman wrote:
           | Ah yes, that was the one I was referring to. Looks like a
           | different event.
        
       | nhoughto wrote:
       | Never seen Transit Gateway, I assume they wouldn't have this
       | problem if it was just single VPC or done via VPC peering?
        
       | sargun wrote:
       | I wonder why Slack uses TGW instead of VPC peering.
        
         | tikkabhuna wrote:
         | The "I've just done my Solutions Architect exam" answer would
         | be that TGW simplifies the topology by having a central hub,
         | rather than each VPC having to peer with all the other VPCs.
         | 
         | I wonder how many VPCs people have before transitioning over to
         | TGW.
        
           | saurik wrote:
           | I miss EC2 Classic :/. It always feels like the entire world
           | of VPCs must have come from the armies of network engineers
           | who felt like if the world didn't support all of the
           | complexity they had designed to fix a problem EC2 no longer
           | had--the tyranny of cables and hubs and devices acting as
           | routers--that maybe they would be out of a job or something,
           | and so rather than design hierarchical security groups Amazon
           | just brought back in every feature of network administration
           | I had been happily prepared to never have to think about
           | every again :(.
        
             | BillinghamJ wrote:
             | Generally inclined to agree, but to be fair you can operate
             | a VPC in exactly the same way as EC2 Classic - give
             | everything public IPs, public subnets and ignore the
             | internal IPs. Pretty sure those are the defaults too
        
             | whatisthiseven wrote:
             | Agreed, I always thought VPC and all that complexity was a
             | big step backwards. My org is moving from a largely managed
             | network into AWS, and now we have to configure the whole
             | network and external gateways ourselves? What engineer
             | wants to do this?
             | 
             | VPCs are virtual, but I don't need VPCs, I need the entire
             | network layer virtualized and abstracted. As you
             | suggested,just grouping devices in a single network and
             | saying "let them all talk to each other, let this one talk
             | to that one over this port/IP" should be all I describe.
             | Let AWS figure out CIDR, routing, gateways, etc.
        
               | gen220 wrote:
               | People use it as a (imo lazy) form of enforcing access
               | control. If two services aren't in the same VPC, they
               | can't talk to each other. It theoretically limits the
               | damage of a rogue node.
               | 
               | Of course, it also creates a ton of overhead and
               | complexity, because you still have to wire all your VPCs
               | together to implement things like monitoring and log
               | aggregation, for example.
               | 
               | As other people have suggested, the better solution (imo)
               | is to have all your traffic be encrypted with mTLS, and
               | enforce your ACLs with certs instead of network
               | accessibility.
        
             | Lammy wrote:
             | My assumption is it's an IPv4 address exhaustion thing too.
        
               | jen20 wrote:
               | It's more to do with what entries you would put in a
               | routing table to get EC2 Classic over a DirectConnect,
               | no?
        
           | schoolornot wrote:
           | Exams these days are a cross between pre-sales training and
           | AWS dogma. Fundamentally all AWS services share the same
           | primitives. It would be great if AWS could take incidents
           | like these, provide some guidance on how to avoid them, and
           | then add 1 or 2 questions to the exam. It would give some
           | credence to the exams which are now basically crammed
           | material.
        
         | mbyio wrote:
         | Yeah this stuck out to me too. Adding an extra hop and point of
         | failure isn't acceptable IMO.
        
         | miyuru wrote:
         | this article from 4 months ago explains it.
         | 
         | https://slack.engineering/building-the-next-evolution-of-clo...
        
           | hanikesn wrote:
           | Interesting, I was wondering whether using a shared VPC would
           | have been the better solution, but it turns out they use a
           | shared VPC per region and peer them via TGW. IMHO it'd be
           | worth peering those regions individually, to get rid of that
           | potential bottleneck. Of course you loose quite a few
           | interesting features.
        
       | johnnymonster wrote:
       | TL;DR outage caused by traffic spike from people returning to
       | work after the holiday.
        
       | ianrw wrote:
       | I wonder if you can pre-warm TGWs like you can ELB? It would be
       | annoying to have to have AWS prewarm a bunch of you stuff, but
       | it's better than it going down.
        
       | kparaju wrote:
       | Some lessons I took from this retro:
       | 
       | - Disable autoscaling if appropriate during outage. For example
       | if the web server is degraded, it's probably best to make sure
       | that the backends don't autoscale down.
       | 
       | - Panic mode in Envoy is amazing!
       | 
       | - Ability to quickly scale your services is important, but that
       | metric should also take into account how quickly the underlying
       | infrastructure can scale. Your pods could spin up in 15 seconds
       | but k8s nodes will not!
        
       ___________________________________________________________________
       (page generated 2021-02-01 23:01 UTC)