[HN Gopher] Slack's Outage on January 4th 2021
___________________________________________________________________
Slack's Outage on January 4th 2021
Author : benedikt
Score : 245 points
Date : 2021-02-01 12:59 UTC (10 hours ago)
(HTM) web link (slack.engineering)
(TXT) w3m dump (slack.engineering)
| gscho wrote:
| > our dashboarding and alerting service became unavailable.
|
| Sounds like the monitoring system needs a monitoring system.
| jrockway wrote:
| It is quite awkward that the output of "working" and
| "completely broken" alerting systems have the same visible
| effect -- no alerts.
|
| For Prometheus users, I wrote alertmanager-status to let a
| third-party "website up?" monitoring server check your
| alertmanager: https://github.com/jrockway/alertmanager-status
|
| (I also wrote one of the main Google Fiber monitoring systems
| back when I was at Google. We spent quite a bit of time on
| monitoring monitoring, because whenever there was an actual
| incident people would ask us "is this real, or just the
| monitoring system being down?" Previous monitoring systems were
| flaky so people were kind of conditioned to ignore the improved
| system -- so we had to have a lot of dashboards to show them
| that there was really an ongoing issue.)
| TonyTrapp wrote:
| "Who monitors the monitors?"
| tobobo wrote:
| The big takeaway for me here is that this "provisioning service"
| had enough internal dependencies that they couldn't bring up new
| nodes. Seems like the worst thing possible during a big traffic
| spike.
| jeffbee wrote:
| Why don't they mention what seems like a clear lesson: control
| traffic has to be prioritized using IP DSCP bits, or else your
| control systems can't recover from widespread frame drop events.
| Does AWS TGW not support DSCP?
| dplgk wrote:
| > On January 4th, one of our Transit Gateways became overloaded.
| The TGWs are managed by AWS and are intended to scale
| transparently to us. However, Slack's annual traffic pattern is a
| little unusual: Traffic is lower over the holidays, as everyone
| disconnects from work (good job on the work-life balance, Slack
| users!). On the first Monday back, client caches are cold and
| clients pull down more data than usual on their first connection
| to Slack. We go from our quietest time of the whole year to one
| of our biggest days quite literally overnight.
|
| What's interesting is that when this happened, some HN comments
| suggested it was the return from holiday traffic that caused it.
| Others said, "nah, don't you think they know how to handle that
| by now?"
|
| Turns out occam's razor applied here. The simplest answer was the
| correct one. Return-from-holiday traffic.
| polote wrote:
| But what HN predicted was wrong
| https://news.ycombinator.com/item?id=25632346
|
| "My bet is that this incident is caused by a big release after
| a post-holiday "code freeze". "
| hnlmorg wrote:
| HN comments suggested a broad plethora of things. I'm not
| surprised some happened on the right cause.
| floatingatoll wrote:
| That's very kind of you to remember :)
| cett wrote:
| Though the nuance is Slack did know how to handle it, AWS
| didn't.
| fipar wrote:
| I don't mean this ironically, but I think Slack did not
| actually know how to handle it: they outsourced the handling
| of this; they passed the buck.
|
| This usually works well, under the rationale that "upstream
| provider does this for a living, so they must be better than
| us at this", but if you have too unique needs (or are just a
| bit "unlucky"), it can fail too.
|
| All this to say that the cloud isn't magic. From a risk/error
| prevention point of view, it's not that different from
| writing software for a single local machine: not every
| programmer needs to know how to manually do memory
| management, it makes a lot more sense to rely on your OS and
| malloc (and friends) for this, but the caveat is that you do
| need to account for the fact that malloc may fail. In the
| cloud case, one can't just assume that you'll always be able
| to provision a new instance, scale up a service, etc. The
| cloud is like a utility company: normally very reliable, but
| they do fail too.
| [deleted]
| SilasX wrote:
| >This usually works well, under the rationale that
| "upstream provider does this for a living, so they must be
| better than us at this", but if you have too unique needs
| (or are just a bit "unlucky"), it can fail too.
|
| Heh, a while ago I joked that one way to scale is to "make
| it somebody else's problem", with the proviso that you need
| to make sure that the someone else can handle the load. And
| then (due to the context) a commenter balked at the idea
| that a big player like YouTube would be unable handle the
| scaling of their core business.
|
| https://news.ycombinator.com/item?id=23170685
|
| (If they're really blaming it on AWS, it really takes guts
| to do it so publicly, I think.)
| Johnny555 wrote:
| _I think Slack did not actually know how to handle it: they
| outsourced the handling of this; they passed the buck_
|
| The issue was a transit gateway, a core network component.
| If they weren't in the cloud, this would have been a
| router, so they "outsourced" it in the same way an on-prem
| service outsources routing to Cisco. I guess the difference
| is they might have had better visibility into the Cisco
| router and known it was overloaded.
| tw04 wrote:
| >I don't mean this ironically, but I think Slack did not
| actually know how to handle it: they outsourced the
| handling of this; they passed the buck.
|
| Isn't that literally supposed to be the sales pitch for the
| cloud? Get away from the infrastructure as a whole so you
| can focus on code, and let the cloud providers wave their
| magic wand to enable scaling?
|
| If you're saying now the story is: well rely on them to
| auto scale, until they don't - then why would I bother? Now
| you're telling me I need to go back to having
| infrastructure experts, which means I can save TON of money
| by going with a hosting provider that allows allocation of
| resources via API (which is basically all of them).
| JMTQp8lwXL wrote:
| You have to know how to write code that fits into the
| cloud. You can't arbitrarily read/write to the file
| system, acting as if there's only one instance of the
| server running (if you plan to run hundreds or
| thousands). So even by waving the cloud 'magic wand', you
| still need to understand writing code in a cloud-friendly
| way. So in some sense, it's a shared responsibility
| between the vendor and engineering. You need to
| understand how to apply the tools being given to you.
| tw04 wrote:
| Per the article, literally nothing in their code would
| have solved the issue. AWS was supposed to auto-scale
| TGWs and didn't.
|
| >Our own serving systems scale quickly to meet these
| kinds of peaks in demand (and have always done so
| successfully after the holidays in previous years).
| However, our TGWs did not scale fast enough. During the
| incident, AWS engineers were alerted to our packet drops
| by their own internal monitoring, and increased our TGW
| capacity manually. By 10:40am PST that change had rolled
| out across all Availability Zones and our network
| returned to normal, as did our error rates and latency.
| JMTQp8lwXL wrote:
| Correct, I was disputing the point that you can freely
| code without being mindful of the architecture even
| though the selling point of cloud providers is "focus on
| code, leave architecture to us". I'm not disputing in
| this case AWS was at fault: as the customer, Slack did
| everything right.
| twblalock wrote:
| If you are serious about reliability you always need
| infrastructure experts.
|
| AWS is pretty good about documenting the limits of their
| systems, SLAs, how to configure them, etc. They don't
| just say you should wave a magic wand -- and even if they
| did say that, professional software engineers know
| better.
|
| "a hosting provider that allows allocation of resources
| via API" is exactly what AWS is. Your infrastructure
| experts come into the picture because they need to know
| which resources to request, how to estimate the scale
| they need, and how to configure them properly. They
| should also be doing performance testing to see if the
| claimed performance really holds up.
| ncallaway wrote:
| > Isn't that literally supposed to be the sales pitch for
| the cloud?
|
| Yes.
|
| But a sales pitch is the _most positive_ framing of the
| product possible. I wouldn 't rely on the sales pitch
| when making the decision about how much you should depend
| on the cloud.
| remus wrote:
| > Isn't that literally supposed to be the sales pitch for
| the cloud? Get away from the infrastructure as a whole so
| you can focus on code, and let the cloud providers wave
| their magic wand to enable scaling?
|
| Clearly there are limits even with the largest cloud
| providers. You'll have to engage a bit of critical
| thought in to whether you're going to get near those
| limits and what that might mean for your product.
| Obviously that's easier said than done, but you could
| argue that the cloud providers are still giving you
| reasonable value if you can pass the buck on a given
| issue for x years.
| solidasparagus wrote:
| No, the cloud provides scalable infrastructure, but once
| you are in the 0.01% and you have very unique usage
| patterns, you still need to know how to set up your
| infrastructure for your needs. The difference is that
| instead of writing and managing a scalable cache, you
| just need to build the layer that knows to pre-provision
| for that scale/talk with AWS to make sure the system has
| sufficient capacity.
|
| The cloud isn't some magic thing that solves all scaling
| problems, it's a tool that gives you strong primitives
| (and once you're a large enough customer, an active
| partner) to help you solve your scaling problems.
| thu2111 wrote:
| This feels like AWS apologism.
|
| Slack knew how to set up their infrastructure. Nothing in
| the postmortem implies AWS was misconfigured. AWS spotted
| the problem and fixed it entirely on their side.
|
| Nothing in this report suggests that Slack has unique
| usage patterns. Users returning to work after Christmas
| is not a phenomenon unique to Slack.
|
| Their problems were:
|
| 1. The AWS infrastructure broke due to an event as
| predictable as the start of the year. That's on Amazon.
|
| 2. Their infrastructure is too complicated. Their auto-
| scaling created chaos by shutting down machines whilst
| engineers were logged into them due to bad heuristics,
| although it's not like this was a good way to save money,
| and their separation of Slack into many different AWS
| accounts created weird bottlenecks they had no way to
| understand or fix.
|
| 3. They were unable to diagnose the root cause and the
| outage ended when AWS noticed the problem and fixed their
| gateway system themselves.
|
| _The cloud isn 't some magic thing that solves all
| scaling problems_
|
| In this case it actually created scaling problems where
| none needed to exist. AWS is expensive compared to
| dedicated machines in a colo. Part of the justification
| for that high cost is seamless scalability and ability to
| 'flex'.
|
| But Slack doesn't need the ability to flex here. Scaling
| down over the holidays and then back up once people
| returned to work just isn't that important for them -
| it's unlikely there were a large number of jobs queued up
| waiting to run on their spare hardware for a few days
| anyway. It just wasn't a good way to save money: a
| massive outage certainly cost them far more than they'll
| ever save.
| oxfordmale wrote:
| If you hit yourself on the thumb while using a hammer, do you
| blame the hammer manufacturer or yourself? TGW limits are
| well documented.
| 0dmethz wrote:
| I mean, when the hammer manufacturer sells managed, auto
| scaling thumb-avoiding services you might rely on that.
|
| If I understand correctly they didn't initially hit a TGW
| quota, it just didn't scale up fast enough.
| sokoloff wrote:
| "Hey Boss, this system that our team selected and
| configured, behaved as documented but not in a way that
| protected our customers' experience.
|
| It's Amazon's fault, not ours..."
|
| If someone came to me with that, I'd educate them on how
| I saw it quite differently, politely but firmly.
| 0dmethz wrote:
| Unless I'm misunderstanding something the system did not
| perform as documented. It should have scaled, it didn't.
|
| When a critical piece of infrastructure fails under
| massive load I'm not sure it it'll help much when you
| politely tell your engineers they fucked up for not
| anticipating it.
|
| You learn lessons. Both Slack and AWS seem to have learnt
| lessons here.
| sokoloff wrote:
| I agree with much of what you say, but if you change it
| to "It's Amazon's fault, not ours", that's where I
| diverge.
|
| Slack did fuck up here, as evidenced by the outage and
| you seem to at least partially agree by the fact that
| Slack learned a lesson. Further, I think that
| "understanding how your system scales up from a low
| baseline to a high level of utilization (such as Black
| Friday/Cyber Monday for e-commerce, or special event
| launches, or a SuperBowl ad landing page)" is a standard,
| "par for the course" cloud engineering topic to be on top
| of nowadays.
| coldcode wrote:
| I actually thought something in AWS was a cause but did not
| know about these internal systems.
| ignoramous wrote:
| True, but with any managed service, hidden limits and a cloud
| provider's own engineering (or lack thereof) may come back to
| bite the top 0.1% (the whales).
|
| One approach to solve problems of scale is to trim down scale
| and bound it across multiple disparate silos that do not
| absolutely interact with each other at all, under any
| circumstances, except for making quick, constant-time, scale-
| independent decisions, may be.
|
| In short, do things that don't need scale.
| jabart wrote:
| If you have a known increase in traffic at a certain
| date/time that auto-scaling (either EC2 or NLB/ALB or another
| service) can't handle fast enough you can let AWS know
| through your support contract to over-provision during that
| time or the scale up will take too long.
| coldcode wrote:
| I actually thought something in AWS was a cause but did not
| know anything about how TGW works internally.
| ak217 wrote:
| I don't think that's true. Slack seems to have their core
| online services split across a number of VPCs, and for some
| reason decided to use Transit Gateway to connect them.
| Transit Gateway is a special-purpose solution that is geared
| toward cross-region and on-prem to VPC connections in
| corporate networks, not to global high-traffic consumer
| products. It's the wrong tool for the job. Its architecture
| is antithetical to the other horizontally scalable AWS
| solutions. It introduces a single (up to) 50 gbps network hub
| that all inter-service traffic must go through. Native AWS
| architectures avoid such central hubs and provide a virtual
| routing fabric instead.
|
| Slack could have chosen one of many other AWS design patterns
| such as VPC peering, transit VPC, IGW routing, or colocating
| more services in fewer VPCs (with more granular IAM role
| policies to separate operator privileges), to provide an
| automatically scaled network fabric to connect their
| services.
|
| (This isn't to criticize Slack's engineering team. They have
| successfully scaled their service in a short time, and I'm
| happy with their product overall, and with their transparency
| in this report. But I think AWS has the world's biggest and
| most scalable network fabric - it's just a matter of knowing
| how to harness it.)
| curun1r wrote:
| I remember going to a presentation by someone from FanDuel
| where he discussed something similar. Their usage patterns
| (heavy spike on NFL Sundays) caused similar problems with
| infrastructure that expected more gradual build-up. They
| engineered for it with synthetic traffic in advance of their
| expected spike to ensure their infrastructure was warm.
|
| TL;DR it's still your responsibility to understand the
| limitations of your infrastructure decisions and engineer
| your systems accordingly.
| m463 wrote:
| AWS might have had back-to-work traffic in lots of domains
| simultaneously.
|
| Or maybe their monitoring and response staff was just coming
| back online.
| tempest_ wrote:
| Well Slack depended on the Cloud(tm).
|
| It is a interesting though because a lot of the blog posts
| like "How we handled a 3000% traffic increase overnight!"
| boil down to "We turned up the AWS knob".
|
| What happens when the AWS knob doesn't work?
| buildawesome wrote:
| You do what Slack did and call the maker of the AWS
| Knob:tm:.
| rrrrrrrrrrrryan wrote:
| Some of my co-workers came from active.com (a website that
| lets people register for marathons and events). The
| infrastructure had to handle massive spikes because
| registrations for big races would open all at once, so
| scalability was everything.
|
| They explained to me that they'd intentionally slam the
| production website with external traffic a couple of times
| per year, at a scheduled time in the middle of the night.
| Like basically an order of magnitude greater than they'd
| every received in real life, just to try to find the breaking
| point. The production website would usually go down for a
| bit, but this was vastly better than the website going down
| when actual real users are trying to sign up for the Boston
| Marathon.
|
| Slack probably should've anticipated this surge in traffic
| after the holidays, and if _might_ have been able to run some
| better simulations and fire drills before it occurred.
| t0mas88 wrote:
| Very good test. The guys at iracing.com should have done
| this before organising the e-sports Daytona 24 hours race
| last week, it was by far their largest event (boosted by
| Covid lockdown). It crashed their central scheduling
| service with a database deadlock. Classic case of a bug you
| only find under heavy load.
| bryan_w wrote:
| The problem you run into is that while you can load test
| your website with no problems, when running on shared
| infrastructure (AWS), you have to account for everyone's
| website being under load at the same time. That isn't as
| easy to test or find bottlenecks for.
| cle wrote:
| > During the incident, AWS engineers were alerted to our
| packet drops by their own internal monitoring, and increased
| our TGW capacity manually. By 10:40am PST that change had
| rolled out across all Availability Zones and our network
| returned to normal, as did our error rates and latency.
|
| Sounds like AWS knew how to handle it too.
|
| Given how AWS has responded to past events like this, I'd bet
| there's an internal post-mortem and they'll add mechanisms to
| fix this scaling bottleneck for everyone.
|
| Although one thing I'm not clear on is if this was really an
| AWS issue or if Slack hit one of the documented limits of
| Transit Gateway (such as bandwidth), after which AWS started
| dropping packets. If that's the case then I don't see what
| AWS could have done here, other than perhaps have ways to
| monitor those limits, if they don't already. The details here
| are a bit fuzzy in the post.
| floatingatoll wrote:
| If your oldest request was queued 5+ seconds ago in a near-
| realtime system (such as Slack), CPU usage isn't your biggest
| problem.
|
| Slack wrote an autoscaling implementation that ignored
| request queue depth and downsized their cluster based on CPU
| usage alone, so while they knew how to _resolve_ it, I would
| not go so far as to say they knew how to _prevent_ it. The
| mistake of ignoring the maxage of the request queue is
| perhaps the second most common blind spot in every Ops team I
| 've ever worked with. No insult to my fellow Ops folks, but
| we've got to stop overlooking this.
| nicoburns wrote:
| > The mistake of ignoring the maxage of the request queue
| is perhaps the second most common blind spot in every Ops
| team I've ever worked with. No insult to my fellow Ops
| folks, but we've got to stop overlooking this.
|
| What's the first?
| floatingatoll wrote:
| Non-randomized wallclock integers.
|
| For example: "sleep 60 seconds", "cron 0 * * * *
| command", "X-Retry-After: 300"
|
| Found in: recurring jobs, backoff algorithms, oauth
| tokens.
|
| Found in: ops-created tasks, dev-released software.
| encoderer wrote:
| I'm building something at Cronitor to help detect those
| hot-spots! If you want to learn more, email me: shane at
| cronitor.io
| twblalock wrote:
| Lots of other services that use AWS didn't go down the same
| day -- because they provisioned enough AWS capacity.
| helper wrote:
| Where's the button to provision more TGW capacity?
| thunderbong wrote:
| But Slack has been around for longer than a year, right?
| Shouldn't they have noticed this happening earlier?
|
| I mean, considering Slack is mostly used as a workplace chat
| mechanism, they should have faced this kind of a scenario
| previously and had a solution for this by now.
| delfaras wrote:
| Yeah but this year the number of people working from home
| that would connect to slack directly at the beginning of
| their work day must be much much larger than the other years
| steve_adams_86 wrote:
| Another small but potentially relevant detail: Not many
| people vacationed, so more people would have returned to
| work at standard times. Many people travel over holidays
| for example (usually) but this time around in many places
| it wasn't even an option. Other people extend their
| holidays to relax more, but I don't know any people
| interested in staycations in their house. We've had enough
| of it.
| jeffrallen wrote:
| tldr: "we added complexity into our system to make it safer and
| that complexity blew us up. Also: we cannot scale up our system
| without everything working well, so we couldn't fix our stuff.
| Also: we were flying blind, probably because of the failing
| complexity that was supposed to protect us."
|
| I am really not impressed... with the state of IT. I could not
| have done better, but isn't it too bad that we've built these
| towers of sand that keep knocking each other over?
| bovermyer wrote:
| I really enjoy reading these write-ups, even if the causal
| incident is not something I enjoy.
| ignoramous wrote:
| Not at all fun when you're actively involved in mitigating
| these, though. Pretty rough and sometimes scars you for life.
| cle wrote:
| I work on critical services, similar to this. While nobody
| likes the overall business impact of events like this, I love
| being in the trenches when things like this happen. I enjoy
| the pressure of having to use anything I can to mitigate as
| quickly as possible.
|
| I used to work in professional kitchens before software, and
| it feels a lot like the pressure of a really busy night as a
| line cook. Some people love it.
| [deleted]
| bovermyer wrote:
| I've been in the middle of events like these and had to write
| my share of postmortem documents.
|
| Hearing about others' similar experiences makes me feel a
| connection to them, and often teaches me something.
| danw1979 wrote:
| So many fails due to in-band control and monitoring are laid
| bare, followed by this absolute chestnut -
|
| > We've also set ourselves a reminder (a Slack reminder, of
| course) to request a preemptive upscaling of our TGWs at the end
| of the next holiday season.
| jrockway wrote:
| The thing that always worries me about cloud systems are the
| hidden dependencies in your cloud provider that work until they
| don't. They typically don't output logs and metrics, so you have
| no choice to pray that someone looks at your support ticket and
| clicks their internal system's "fix it for this customer" button.
|
| I'll also say that I'm interested in ubiquitous mTLS so that you
| don't have to isolate teams with VPCs and opaque proxies. I don't
| think we have widely-available technology around yet that
| eliminates the need for what Slack seems to have here, but
| trusting the network has always seemed like a bad idea to me, and
| this shows how a workaround can go wrong. (Of course, to avoid
| issues like the confused deputy problem, which Slack suffered
| from, you need some service to issue certs to applications as
| they scale up that will be accepted by services that it is
| allowed to talk to and rejected by all other services. In that
| case, this postmortem would have said "we scaled up our web
| frontends, but the service that issues them certificates to talk
| to the backend exploded in a big ball of fire, so we were down."
| Ya just can't win ;)
| bengale wrote:
| Experienced something similar with mongo atlas today. Our
| primary node went down and the cluster didn't failover to
| either of the secondaries. We got to sit with our production
| environment completely offline while staring at two completely
| functional nodes that we had no ability to use. Even when we
| managed to get hold of support they also seemed unable to
| trigger a failover and basically told us to wait for the
| primary node to come back up. It took 90 minutes in the end and
| has definitely made us rethink about the future and the control
| we've given over.
| [deleted]
| gen220 wrote:
| Some customers have a hard requirement that their slack
| instances be behind a unique VPC. Other customers are easier to
| sell to if you sprinkle some "you'll get your own closed
| network" on top of the offer, if security is something they've
| been burned by in the past.
|
| I agree with you the mTLS is the future. It exists within many
| companies internally (as a VPC alternative!) and works great.
| There's some problems around the certificate issuer being a
| central point of failure, but these are known problems with
| well-understood solutions.
|
| I think there's mostly a non-technical barrier to be overcome
| here, where the non-technical executives need to understand
| that closed network != better security. mTLS's time in the sun
| will only come when the aforementioned sales pitch is less
| effective (or even counterproductive!) for Enterprise Inc., I
| think.
| ViViDboarder wrote:
| mTLS appears to work great between servers, but I've been
| unable to get my iPhone to authenticate with a web server via
| Safari using mTLS. Even after installing the cert, it never
| presents it.
|
| I wish it were better supported though.
| Thaxll wrote:
| Surprised that there is just a few metrics available for TGW:
| https://docs.aws.amazon.com/vpc/latest/tgw/transit-gateway-c...
|
| Probably the only way to see a problem is if you have a flat line
| for bandwidth, but as the article suggested they had packet drop
| wich does not appear on the cloudwatch metrics, aws should add
| those metrics imo
| miyuru wrote:
| some TGW limits are documented here:
| https://aws.amazon.com/transit-gateway/faqs/#Performance_and...
| fullstop wrote:
| I was kind of surprised to see that they are using Apache's
| threaded workers and not nginx.
| dijit wrote:
| Apache mod_php is still much faster than php-fpm, and since
| slack uses a lot of PHP on the backend it makes a lot of sense
| for them.
| saurik wrote:
| I loved this article where someone started with the goal of
| writing an article about how much faster nginx was, but then
| discovered the opposite (and for some reason didn't change
| the title of the article, which is hilarious)... both because
| it showed the author "cared", but also because it showed that
| people just assume what amounts to marketing myths (such as
| that Apache and mod_php are ancient tech vs. the more modern
| php-fpm stack) before bothering to verify anything.
|
| https://www.eschrade.com/page/why-is-fastcgi-w-nginx-so-
| much...
| fullstop wrote:
| When I moved things from Apache -> nginx years ago, I did
| it not because it was faster but because the resource
| requirements of nginx were so much more predictable under
| load.
| Denvercoder9 wrote:
| That's not the reason, as mod_php only supports the prefork
| MPM, and not the threaded MPMs.
| johannes1234321 wrote:
| Threaded MPMs and PHP work quite well, I fixed a few bugs
| in that space some time ago.
|
| There however might be issues in some of the millions
| libraries PHP potentially links in an called from it's
| extensions, and those sometimes at emit thread safe, but
| finding that and finding bypasses isn't easy ...
|
| The other issue is that it's often slower (while there
| recently were changes from custom thread local storage to
| more modern one) and if there is a crash (i.e. Recursion
| stack overflow ...) it affects all requests in that
| process, not only the one.
| cholmon wrote:
| Are you saying that Slack uses mod_php? According to a Slack
| Engineering blog post[0] from 9 months ago, Slack has been
| using Hack/HHVM since 2016 in place of PHP. My understanding
| is that HHVM can only be run over FastCGI, unless there's a
| mod_hhvm that I'm unaware of.
|
| [0]: https://slack.engineering/hacklang-at-slack-a-better-
| php/
| kmavm wrote:
| This is accurate: Slack is exclusively using Hack/HHVM for
| its application servers.
|
| HHVM has an embedded web server (the folly project's
| Proxygen), and can directly terminate HTTP/HTTPS itself.
| Facebook uses it in this way. If you want to bring your own
| webserver, though, FastCGI is the most practical way to do
| so with HHVM.
| user5994461 wrote:
| mod_xyz plugins were deprecated long ago because they are
| unstable. If you do a comparison you should compare fastcgi
| versus fastcgi, that is the standard way to run web
| applications. Running with mod should be faster because it's
| running the interpreter directly into the apache process but
| it's also making apache unstable.
|
| mod_python was abandoned around a decade ago. It's crashing
| on python 2.7.
|
| mod_perl was dropped in 2012 with the release of apache 2.4.
| It was kicked out of the project but continues to exist as a
| separate project (not sure if it works at all).
| nickthemagicman wrote:
| Wow. The trail leads back to AWS. Wasn't there a number of other
| companies that were down around that same time or was that a
| different time?
| conradfr wrote:
| Does AWS compensate you in cases like this?
| dewey wrote:
| It depends: https://aws.amazon.com/legal/service-level-
| agreements/
| dandigangi wrote:
| They do. I can't remember where the documentation is but
| there's a clause where they payout as long as they aren't
| meeting their SLAs.
|
| Who knows the different ways they may be able to get out of
| that. I assume this wasn't one of those times.
| jjtheblunt wrote:
| Is this the recent event you refer to?
| https://aws.amazon.com/message/11201/
| nickthemagicman wrote:
| Ah yes, that was the one I was referring to. Looks like a
| different event.
| nhoughto wrote:
| Never seen Transit Gateway, I assume they wouldn't have this
| problem if it was just single VPC or done via VPC peering?
| sargun wrote:
| I wonder why Slack uses TGW instead of VPC peering.
| tikkabhuna wrote:
| The "I've just done my Solutions Architect exam" answer would
| be that TGW simplifies the topology by having a central hub,
| rather than each VPC having to peer with all the other VPCs.
|
| I wonder how many VPCs people have before transitioning over to
| TGW.
| saurik wrote:
| I miss EC2 Classic :/. It always feels like the entire world
| of VPCs must have come from the armies of network engineers
| who felt like if the world didn't support all of the
| complexity they had designed to fix a problem EC2 no longer
| had--the tyranny of cables and hubs and devices acting as
| routers--that maybe they would be out of a job or something,
| and so rather than design hierarchical security groups Amazon
| just brought back in every feature of network administration
| I had been happily prepared to never have to think about
| every again :(.
| BillinghamJ wrote:
| Generally inclined to agree, but to be fair you can operate
| a VPC in exactly the same way as EC2 Classic - give
| everything public IPs, public subnets and ignore the
| internal IPs. Pretty sure those are the defaults too
| whatisthiseven wrote:
| Agreed, I always thought VPC and all that complexity was a
| big step backwards. My org is moving from a largely managed
| network into AWS, and now we have to configure the whole
| network and external gateways ourselves? What engineer
| wants to do this?
|
| VPCs are virtual, but I don't need VPCs, I need the entire
| network layer virtualized and abstracted. As you
| suggested,just grouping devices in a single network and
| saying "let them all talk to each other, let this one talk
| to that one over this port/IP" should be all I describe.
| Let AWS figure out CIDR, routing, gateways, etc.
| gen220 wrote:
| People use it as a (imo lazy) form of enforcing access
| control. If two services aren't in the same VPC, they
| can't talk to each other. It theoretically limits the
| damage of a rogue node.
|
| Of course, it also creates a ton of overhead and
| complexity, because you still have to wire all your VPCs
| together to implement things like monitoring and log
| aggregation, for example.
|
| As other people have suggested, the better solution (imo)
| is to have all your traffic be encrypted with mTLS, and
| enforce your ACLs with certs instead of network
| accessibility.
| Lammy wrote:
| My assumption is it's an IPv4 address exhaustion thing too.
| jen20 wrote:
| It's more to do with what entries you would put in a
| routing table to get EC2 Classic over a DirectConnect,
| no?
| schoolornot wrote:
| Exams these days are a cross between pre-sales training and
| AWS dogma. Fundamentally all AWS services share the same
| primitives. It would be great if AWS could take incidents
| like these, provide some guidance on how to avoid them, and
| then add 1 or 2 questions to the exam. It would give some
| credence to the exams which are now basically crammed
| material.
| mbyio wrote:
| Yeah this stuck out to me too. Adding an extra hop and point of
| failure isn't acceptable IMO.
| miyuru wrote:
| this article from 4 months ago explains it.
|
| https://slack.engineering/building-the-next-evolution-of-clo...
| hanikesn wrote:
| Interesting, I was wondering whether using a shared VPC would
| have been the better solution, but it turns out they use a
| shared VPC per region and peer them via TGW. IMHO it'd be
| worth peering those regions individually, to get rid of that
| potential bottleneck. Of course you loose quite a few
| interesting features.
| johnnymonster wrote:
| TL;DR outage caused by traffic spike from people returning to
| work after the holiday.
| ianrw wrote:
| I wonder if you can pre-warm TGWs like you can ELB? It would be
| annoying to have to have AWS prewarm a bunch of you stuff, but
| it's better than it going down.
| kparaju wrote:
| Some lessons I took from this retro:
|
| - Disable autoscaling if appropriate during outage. For example
| if the web server is degraded, it's probably best to make sure
| that the backends don't autoscale down.
|
| - Panic mode in Envoy is amazing!
|
| - Ability to quickly scale your services is important, but that
| metric should also take into account how quickly the underlying
| infrastructure can scale. Your pods could spin up in 15 seconds
| but k8s nodes will not!
___________________________________________________________________
(page generated 2021-02-01 23:01 UTC)