[HN Gopher] Slack's migration to a cellular architecture
___________________________________________________________________
Slack's migration to a cellular architecture
Author : serial_dev
Score : 206 points
Date : 2023-08-26 17:27 UTC (5 hours ago)
(HTM) web link (slack.engineering)
(TXT) w3m dump (slack.engineering)
| random3 wrote:
| This brings back memories - we speced an open distributed
| operating system called Metal Cell and built an implementation
| called Cell-OS. It was inspired by the "Datacenter as a computer"
| paper, but built with open-source tech.
|
| We had it running accross bare metal, AWS and Azure and it one of
| the key aspects was that it handled persistent workloads for big
| data, including distributed databases.
|
| Kubernetes was just getting built when we started and was
| supposed to be a Mesos scheduler initially.
|
| I assumed Kubernetes would get all the pieces in and make things
| easier, but I still miss the whole paradigm we had almost 10
| years ago.
|
| This is retro now :)
|
| https://github.com/cell-os/metal-cell
|
| https://github.com/cell-os/cell-os
| athoscouto wrote:
| Their siloing strategy, which I'll roughly refer as resolving a
| request from a single AZ, is a good way to keep operations and
| monitoring simple.
|
| A past team of mine managed services in a similar fashion. We had
| a couple (usually 2-4) single AZ clusters with a thin (Envoy)
| layer to balance traffic between clusters.
|
| We could detect incidents in a single cluster by comparing
| metrics across clusters. Mitigation was easy, we could drain a
| cluster in under a minute, redirecting traffic to the other ones.
| Most traffic was intra AZ, so it was fast and there was no cross-
| AZ networking fees.
|
| The downside is that most services were running in several
| clusters, so there was redundancy in compute, caches, etc.
|
| When we talked to people outside the company, e.g. solution
| architects from our cloud provider, they would be surprised at
| our architecture and immediately suggest multi-region clusters. I
| would joke that our single AZ clusters were a feature, not a bug.
|
| Nice to see other folks having success with a similar
| architecture!
| AugustoCAS wrote:
| I assume you were using AWS? I know some of the AZ of other
| cloud providers (Azure? Oracle? Google?) are not fully siloed.
| They might have independent power and networking, but be in the
| same physical location.
|
| I'm mentioning this for other people to be aware as one can
| easily make the assumption that an AZ is the same concept on
| all clouds, which is not true and painful to realise.
| ComputerGuru wrote:
| It sounds like you didn't have persistent data, and were only
| offering compute? If there's no need for a coherent master view
| accessible/writeable from all the clusters, there would be no
| reason to use multi-region cluster whatsoever.
| athoscouto wrote:
| We did. But the persisted data didn't live inside those
| ephemeral compute clusters though.
| slashdev wrote:
| So your data store was still multi AZ? I'm a little
| confused how you'd serve the same user's data consistently
| from multiple silos. Do you pin users to one AZ?
| lordofnorn wrote:
| Yeah, keep stateful stuff and stateless stuff separate;
| separate clusters, network spaces, cloud accounts, likely a
| mix of all that.
|
| Clearly define boundaries and acceptable behavior within
| boundaries.
|
| Setup up telemetry and observability to monitor for
| threshold violations.
|
| Simple. Right?
| catchnear4321 wrote:
| i mean you could also just spin up a reeeeeally big
| compute node and just do it all there.
|
| fewer things to monitor. fewer things that can fail.
|
| just log in from time to time to update packages.
|
| see, cloud doesn't have to be complex.
| bushbaba wrote:
| The downside of single AZ clusters is capacity. If you have a
| need to drastically scale up the compute might not be available
| in a single AZ.
| jldugger wrote:
| Indeed, this is the main problem I run into. We have to scale
| up capacity before the traffic can be redirected or you
| basically double the scope of the outage briefly. Which
| involves multiple layers of capacity bringup -- ASG brings up
| new nodes, then HPA brings up the new pods.
| ec109685 wrote:
| If there's uncorrelated load you can also run on your
| hosts, then you can share their spare capacity, with the
| hope they don't spike at same time.
|
| AWS does that with their lambda arch to reduce waste.
| Terretta wrote:
| If you have enough scale that could be a problem, cookie
| cutter more smaller AZs so any one outage is less of
| numerator of capacity over the denominator of scale.
|
| Worth noting that requiring teams to use 3 AZs is a good
| idea because you get "n" shaped patterns instead of mirror
| shaped patterns, which have very different characteristics
| for resilience and continuity.
| athoscouto wrote:
| Even though each cluster was single AZ the whole system
| wasn't, so we weren't bound by the capacity of a single AZ.
|
| Most of the situations where we needed to drastically scale
| up were known ahead of time as well (e.g. campaign from
| customer), and we would preallocate instances or even more
| clusters.
|
| I may be forcing my memory, but if I'm not mistaken, our auto
| scaling was setup in a way that the system could handle
| sudden load increases of ~50% without noticeable disruption.
| Spikes bigger than this could lead to increased latency
| and/or error rate.
| kccqzy wrote:
| That's another way of saying your typical utilization ratio
| is 66%. Which is on the low side honestly.
|
| That said, it's a trade off between efficiency and load
| spike tolerance. I trust that the trade off is made with
| informed decision.
| rewmie wrote:
| > That said, it's a trade off between efficiency and load
| spike tolerance. I trust that the trade off is made with
| informed decision.
|
| I don't think that relatively low utilization rates is
| the scenario that requires "informed decision". The only
| tradeoff in low utilization rate scenarios is cost, which
| might be outright cheaper and irrelevant once you do the
| math on the tradeoffs of using reserved instances vs the
| cost of scaling up with on-demand instances.
|
| You need to make a damn good case to chronically
| underprovision your system and expect it to autoscale
| your way into nickle-and-dime savings.
| ec109685 wrote:
| 66% isn't low utilization. You're always going to have
| micro spikes, and you never want to clip, so keeping some
| headroom around feels smart.
|
| Unless you co-mingle online and offline (batch) traffic
| on same hosts, flat response times and high utilization
| aren't compatible.
| sitkack wrote:
| High utilization means high variability and low
| resiliency and the last k-percentage of utilization
| causes highly non-linear effects.
| jasonwatkinspdx wrote:
| Yeah, I talked with a business that used a similar architecture
| for the same reasons. It can be really effective in multi-
| tenant apps where each customers data is fully independent and
| private. They also used multiple Amazon organizational accounts
| as a security partition. It made a few things more difficult
| but they felt the peace of mind was worth it.
| endymi0n wrote:
| It took me some time to realize that Cloud Solution Architects
| are also just slightly more technical sales people in disguise
| whose only mission is upselling you onto more dependency. Same
| thing about their PR, every CxO these days says they need
| "multi-cloud", whatever that means and the costs are usually
| enormous, while complexity rises -- with questionable benefit.
|
| I did the math for our own stack and after a setback month in
| client revenue, and decided to put all our servers into a
| single AZ in a single region. The only multi-AZ, multi-region
| services are our backups. Surviving bad machines happens often
| enough that it's priced in via using Kubernetes, but losing a
| whole AZ is a freak accident that's just SO rare that
| calculating real business risk, it seemed apt to pretend it
| just doesn't happen (sorry, Google Cloud Paris customers).
|
| Call me reckless, but I haven't looked back ever since and it
| saves us thousands of dollars in intra-AZ fees per month alone.
| anon84873628 wrote:
| Yeah, for many businesses it probably isn't necessary to have
| crazy short RTO and RPO. Just restore the most recent backup
| in a new region and point at the cloud provider outage
| report...
| [deleted]
| memefrog wrote:
| _" For example slack is an incredibly successful product. But it
| seems like every week I encounter a new bug that makes it
| completely unusable for me, from taking seconds per character
| when typing to being completely unable to render messages.
| (Discord on the other hand has always been reliable and snappy
| despite, judging by my highly scientific googling, having 1/3rd
| as many employees. So it's not like chat apps are just
| intrinsically hard.) And yet slack's technical advice is popular
| and if I ran across it without having experienced the results
| myself it would probably seem compelling."_
|
| https://www.scattered-thoughts.net/writing/on-bad-advice/
| purpleturtle22 wrote:
| Can someone ELI5 the difference between using AWS availability
| zone affinity and then simply dropping the downed AZ at the top
| most routing point?
|
| Wouldn't that be the same thing, with the obvious caveat you are
| t using the routing technology Slack is using (We don't - We use
| vanilla AWS offerings)
| t0mas88 wrote:
| They decided to use every routing tool available at least once
| in their setup, so they can't do this. But there is no
| explanation in the blog about why they use so many platforms
| and so many routing tools. Sounds to me like they got
| themselves into a mess and decided to continue on that path.
| jonathankoren wrote:
| Somewhere, an engineering "leader" is going to point to this
| blog post and then say, "Well, that's how Slack did it!" and
| promptly copy this overwrought system
| vinnymac wrote:
| I'm not sure if you're being serious, but in any case; This
| will happen, as it always does, inevitably.
| sitkack wrote:
| Warning statement becomes the howto guide.
| Terretta wrote:
| You're doing it right.
| ec109685 wrote:
| Isn't that exactly what they are doing? Keeping requests within
| an AZ and using global DNS at the first hop into AZ.
| ec109685 wrote:
| Isn't that exactly what they are doing? Keeping requests within
| an AZ and instead of using DNS at the first hop into AZ, they
| use envoy to control traffic shaping and making that initial
| decision if traffic needs to be routed away.
| chrisweekly wrote:
| I appreciate the clear explanation of the problem and the
| solution, which (as is so often the case) seems fairly simple or
| obvious in retrospect.
|
| Semi-related tangent: sometime around mid-2016, I came across a
| tool that helped visualize requests in near real-time, and showed
| what it "looks" like (ie, flow slows to trickle in service A
| during draining, while it ramps up in service B)... there was a
| really compelling demo, but I never bookmarked it and can't seem
| to find it. IIRC its name was a single word. Maybe someone
| reading this will know what I'm talking about... ?
| [deleted]
| mrkeen wrote:
| Vizceral
| xwowsersx wrote:
| Neat.
|
| > If a graph of nodes and edges with data about traffic
| volume is provided, it will render a traffic graph animating
| the connection volume between nodes.
|
| How would one go about providing such a graph? :)
| ThePhysicist wrote:
| So they run everything in AWS USE1? That doesn't seem very
| redundant, but then I guess if the whole of USE1 goes down Slack
| won't be the only service that will be affected.
| radicality wrote:
| Isn't the point of the article that they don't? And it
| describes how they implemented region drains to traffic shift
| between the different regions.
|
| edit: Hmm or maybe not? I still sometimes confuse aws
| terminology. Perhaps it is all in us-east--1, just in different
| availability zones (buildings?)
| ThePhysicist wrote:
| If I understand it correctly they have an edge network for
| ingress traffic but host all of their core services in a
| single AWS region (USE1) in multiple availability zones
| there.
| jldugger wrote:
| >edit: Hmm or maybe not? I still sometimes confuse aws
| terminology. Perhaps it is all in us-east--1, just in
| different availability zones (buildings?)
|
| Correct, us-east-1 has several AZs, names like us-east-1a,
| us-east-1b etc. IIRC us-east-1 has six of them now.
| messe wrote:
| AWS also uses Slack internally, so add that to the list of shit
| that can hit the fan if us-east-1/IAD goes down.
| mynameisvlad wrote:
| Don't they also use Chime? It wouldn't be a single point of
| failure.
| nostrebored wrote:
| To contribute to the tangled ball of messaging, slack also
| uses chime sdk to handle huddles
| skullone wrote:
| Lots of teams use Slack as well. Oddly enough, I didn't
| mind Chime as an end-user, but 6 years ago their API
| features were somewhat lacking.
| fotta wrote:
| Huh, I'm surprised they're not all in on Chime.
| shepherdjerred wrote:
| It was all on Chime until the Pandemic. Then they moved to
| Slack.
| [deleted]
| deanCommie wrote:
| the "whole" of USE1 very rarely goes down [0], because unlike
| other cloud providers, Amazon's availability zones are actually
| independent and decoupled, and if you're running on EC2 in a
| zonal way it's highly unlikely an outage will affect multiple
| zones.
|
| [0] There are of course exceptions that come once every few
| years, but most instances people can think of in terms of
| widespread outages is one specific _service_ going down in a
| region, creating a cascade of other dependencies. e.g. Lambda
| or Kinesis going down and impacting some other higher-level
| service, say, Translate.
| asah wrote:
| Am I missing something about us-east-1 reliability ?
|
| https://www.google.com/search?q=us-east-1+reliability
| https://www.google.com/search?q=us-east-1+outage
| temp_praneshp wrote:
| Yes. to put it a bit bluntly, you are using a very generic
| google search and being blind to nuance.
|
| us-east-1 does have more problems than other zones due to a
| variety of reasons, but it rarely (ie, once a few years)
| goes down as a whole. As long as you're in several AZs
| within us-east-1, the impact of most outages should not
| take you down completely. In the context of the comment you
| are replying to, your google search links are lazy and fail
| to see the big picture.
| oceanplexian wrote:
| AZs are buildings often times right next to each other on the
| same street. People who think this is a great failure domain
| for your entire business are deeply misguided. All it takes
| is a hurricane, a truck hitting a pole, a fire, or any number
| of extremely common situations and infra will be wiped off
| the map. Build stuff to be properly multi-region.
| johannes1234321 wrote:
| But then everybody trying to recover from USE1 outage can't use
| Slack to coordinate the recovery ...
| enduser wrote:
| So.. IRC?
| anonshadow wrote:
| [flagged]
| [deleted]
| [deleted]
| UncleOxidant wrote:
| Initially read this as: "Slack's Migration to Cellular Automata"
| and now I'm a little disappointed.
| t0mas88 wrote:
| They got themselves into a mess here:
|
| > This turns out to have a lot of complexity lurking within.
| Slack does not share a common codebase or even runtime; services
| in the user-facing request path are written in Hack, Go, Java,
| and C++. This would necessitate a separate implementation in each
| language.
|
| This sounds crazy. I've seen several products where there is a
| core stack (e.g. Java) and then surrounding tools, analytics etc
| in Python, R and others. But why would you create such a mess for
| your primary user request path?
|
| Sure, they're not "just a chat app" they have video, file sharing
| etc included and a lot of integrations. But still this sounds
| like a company that had too much money and too little sense while
| growing rapidly.
| eikenberry wrote:
| What mess? That sounds like a healthy internal language
| ecosystem to me. You need _at least_ 2 primary languages to
| avoid accidental lock-in and maintain good developer diversity.
| That very paragraph is a great example of how the diversity
| helped them avoid the trap of plumbing it through their RPCs.
| t0mas88 wrote:
| Since when is an "internal langue ecosystem" a good idea?
| Technology in a company like Slack exists to deliver useful
| features and good performance/stability to users faster than
| competitors can do it. For an app like theirs it doesn't
| sound like something that needs several disparate internal
| platforms that are slowing them down.
| nostrebored wrote:
| How is choosing the right language for a task/team slowing
| them down?
|
| For large scale, cross cutting initiatives you'll have some
| pain. For feature velocity, you'll see great results.
| Everything is a trade off.
| lopkeny12ko wrote:
| You're suggesting that needing to reimplement the same thing
| 5 times for every single language in use is a hallmark of a
| "healthy internal language ecosystem"?
| skullone wrote:
| It even pains me to see they're suffering from so many own
| goals. And it's unfortunately reflected in the poor experience
| using the Slack client. Not to mention the multiple deprecated
| bot/integration APIs with such bad feature parity between all
| the different ways to integrate your own tooling into Slack.
| snoman wrote:
| There was a time when this was the case (and electron was the
| punching bag for critics at the time, iirc) but I don't think
| this criticism is fair anymore. Slack is quite responsive and
| performant these days.
| nostrebored wrote:
| What do you mean? Slack is one of the most responsive and
| reliable tools I touch every day.
| lopkeny12ko wrote:
| I hope this is satire. Slack is one of the slowest work
| tools I've ever used. Every interaction and click visibly
| lags.
|
| It's a sad state of the world that almost every application
| now is written in Javascript and deployed with Electron,
| and massive memory usage and slow UIs have become accepted
| as the norm.
|
| Try any IRC client and tell me, with a straight face, that
| Slack is just as responsive.
| ladzoppelin wrote:
| So I only use Firefox and the Slack web client and don't
| experience any lag. I am surprised so many people use the
| Slack app over a web tab.
| skullone wrote:
| How slow are the rest of your tools? The Slack client
| probably performs worse today than it did a few years ago.
| It has the laggiest interface of any of my tools, you can
| watch your CPU spike to 60-80% just switching channels.
| Just do it right now, open up htop/top/atop/Activity
| Monitor - whatever you want, and just switch channels.
| Laugh as the Slack client wastes a universe's worth of time
| just... rendering a DOM with plain text. It is genuinely
| pathetic how bad the client is.
| tbrownaw wrote:
| "The right language for each job" was one of the heavy
| advertising points for microservices. Might still be too some
| extent, even.
| BoorishBears wrote:
| The problem is most engineers don't understand the "job".
|
| They see the job as a strictly technical problem looking for
| the best technical solution. They don't look up and see how
| that problem fits into the larger organization.
|
| They think things like "I can make a microservice that
| encodes PDFs 10x faster by using Rust" and give an estimate
| based on that, never thinking about how we're going to need
| to hire 2 more Rust devs to keep that running, and we could
| have delivered twice as quickly if I had used our default
| Python stack and now our "10x faster" doesn't matter because
| that feature is old news.
|
| Microservices are such an unfortunate concept because they
| attract the people least suited to use them: If your team
| can't handle a monolith, you shouldn't even be looking up
| what a microservice is.
| pavlov wrote:
| The only way you get Hack on that list of languages is that
| they had a policy of letting lead engineers starting a project
| to choose the language at will, and they hired enough lead
| engineers who previously worked at FB/Meta.
| tlunter wrote:
| I think that Hack might've been on that list earlier than you
| think. Slack started as a PHP application.
| matwood wrote:
| Yeah. If they already had a large php codebase, moving to
| Hack makes complete sense.
| [deleted]
| awinter-py wrote:
| their backend being on 2G explains a lot of other stuff about
| their software
| heywhatupboys wrote:
| Is Slack dead? unironically. Does it have a future? With Teams,
| etc. coming out, it seems most companies do not want to go the
| Slack route
| aftbit wrote:
| Clearly no. Legacy inertia will carry it pretty far, even if
| literally nobody new tries to sign up for it. Our team is still
| using Slack and has no plans to migrate away at the moment.
| robertlagrant wrote:
| Teams is doing well because it's often an IT department's
| simplest choice, but I don't find it's great for users.
| zo1 wrote:
| Why would I choose Slack for my employees when Teams
| integrates so nicely with everything else in the "stack".
| Teams is leaps and bounds ahead already, and Slack really
| lost the boat many years ago.
|
| Speaking of which, I'm going now to buy more Microsoft
| shares.
| Shared404 wrote:
| > Why would I choose Slack for my employees when Teams
| integrates so nicely with everything else in the "stack".
|
| Does it really though? In my experience teams has a buggy
| integration with other things in the stack.
|
| And Teams itself ia massively buggy and a resource hog for
| the whole time I've used it.
| zdragnar wrote:
| I don't think I have ever heard someone favorably compare
| teams chat with slack before. Even when I worked at a
| company that used teams for video calls and MS for email
| and calendar and documents and what not, everyone used
| slack for chat.
|
| I don't think anyone was sad that slack didn't integrate
| with the other MS services "stack".
| fooster wrote:
| You are choosing teams. What are your employees choosing?
| In my experience teams is a terrible mess and a company
| using it would exclude me from working for the company
| because they very likely don't give a crap about the day to
| day experience of the employee.
| grokys wrote:
| Maybe because you value your employees being able to copy
| an image from your chat platform?
|
| (Teams still can't copy images, instead you get a massive
| base64 block of text iirc)
| packetlost wrote:
| The company I work for has a "Hours wasted because Teams
| sucks" page that gets updated at least weekly.
|
| Eventually the list will grow so large that we could probably
| attach a 5-figure dollar amount to it, if it hasn't already.
| Racing0461 wrote:
| Depending on the size of the company, that value is
| absolutely insignificant.
| hotnfresh wrote:
| Bigcos with robust sales truly can't afford the
| organizational-attentional cost of walking across the
| street to pick up a $10,000 coin.
| Racing0461 wrote:
| /s ?
| hotnfresh wrote:
| No, that's really how it is. They leave opportunities to
| save or make five-figure (and larger) amounts all the
| time, because it's not worth the distraction from other
| activities. And also from straight-up mis-management, but
| a lot of the time they know exactly what they're doing,
| and it's on purpose, and it's probably not a mistake.
| devmor wrote:
| If your goal is to monitor your staff and gather metrics on
| their communication - Teams outdoes Slack and is incomparable.
| If your goal is to have a platform that enables your employees
| to communicate with as little friction as possible, I have yet
| to see anything capable of replacing Slack.
|
| Teams especially, is something I loathe using every day.
| Everything about the UI and UX gets in the way of what I'm
| trying to do, rather than assisting in or even enabling it.
| It's like it doesn't want me to communicate - it wants me to
| react and offer as little useful information as possible.
| ecshafer wrote:
| I went from a company using teams to slack a few years ago.
| Truly night and day. I have such a visceral hatred for Teams,
| it actually surprises me how much I can dislike some software
| that is for messaging. From how it can't copy and paste in
| and out of chat, to the way it sets laptops on fire, or its
| horrible ui. I really truly hate that software. Please just
| use slack or god forbid set up an irc node or something.
| [deleted]
| imperialdrive wrote:
| Agreed. Teams is already the most painful experience, and
| it's about to get even worse with the new 2.0 version being
| deployed.
| wasmitnetzen wrote:
| My employer buys no Microsoft SaaS service, since we're mostly
| on Google services, so a stand-alone like Slack works quite
| well. And nobody uses Google Chat.
|
| And besides that, the UX of Teams is miles behind Slack.
| walthamstow wrote:
| Not even GitHub? I believe that's the only MSFT service we
| have at my <40 people fintech dayjob
| quickthrower2 wrote:
| Slack is not good UX in my opinion. It is often hard to see
| what generated a message notification - so yeah someone
| called me out, but who? where?. It shows me latest thread as
| being from last month when I know there have been more recent
| ones. It doesn't collapse those threads, so 100 reply
| incident threads dominate that view. Slack doesn't scale well
| (UX-wise) above say 30 people.
| snoman wrote:
| Opinions are valid, for sure. I can tell you that I'm a
| happy slack user at a company of just over a hundred
| thousand.
|
| I haven't regularly used teams in about a year, but I would
| legitimately consider passing on a job offer where they
| used it.
|
| In a thread where many folks are talking about using the
| best tools for a job, teams is never the best tool for any
| form of digital communication.
| gumballindie wrote:
| "cellular architecture"
|
| What? Does amazon need to push for new sales points or are they
| simply making up architectures now?
| tedd4u wrote:
| Cell architecture goes way back, at least 10 years. Tumblr for
| example.
|
| http://highscalability.com/blog/2012/5/9/cell-architectures....
| CyberDildonics wrote:
| Sounds like it's two bird with one stone.
| ignoramous wrote:
| _ex-AWS here_
|
| May be marketing but it is an architecture born out of Amazon's
| (and AWS's) use of AWS:
|
| - _Reliable scalability: How Amazon.com scales in the cloud_ ,
| https://www.youtube.com/watch?v=QeW9wCB36ck&t=993 (2022)
|
| - _How AWS minimizes the blast radius of failures_ ,
| https://youtu.be/swQbA4zub20 (2018)
|
| For massive enterprise products like Slack that need close to
| 100% uptime across all their services, cells make sense.
| mike_hock wrote:
| Cells, interlinked.
| gumballindie wrote:
| Yeah that's what microservices were meant to achieve. Suppose
| the market is staturated with "microservices", so a new term
| was needed.
| ignoramous wrote:
| Microservices is one reason you need cells. If you haven't,
| the second talk I linked to might interest you.
| diarrhea wrote:
| A big term for a simple design principle indeed.
|
| But their implementation isn't as grim as what I had initially
| envisioned when hearing that term. I immediately thought of
| Smalltalk and the idea of objects sitting next to each other,
| forming a graph (of no particular structure... just a graph),
| passing messages to neighbours. Like cells in an organism send
| hormones and whatnot. That makes for a huge mess that cannot be
| reasoned about, hence why we instead went with stricter
| structures like trees for (single) inheritance. That's much
| closer to this silo approach, which seems nice and reasonable
| (although I get the impression considerable complexity was
| swept under the rug, like global DB consistence; the siloes
| cannot truly be siloed).
| mike_hock wrote:
| Why is than an either/or?
| [deleted]
| aftbit wrote:
| How can such an architecture function with respect to user data?
| If the DB instance primary handling your shard is in AZ-1 and
| AZ-1 gets drained, how can your writes continue to be serviced?
| [deleted]
| progbits wrote:
| Usually in distributed strongly consistent and durable systems,
| data is not considered committed until it has been persisted in
| multiple replicas.
|
| So if one goes down nothing is lost, but capacity and
| durability is degraded.
| skybrian wrote:
| That makes sense on its own, but doesn't it mean that there
| are lots of network requests happening between silos all the
| time? It doesn't seem very siloed.
|
| Or is this some lower-level service that "doesn't count"
| somehow?
| progbits wrote:
| It's siloed that if one is down others are not affected as
| long as enough other replicas are healthy to keep the
| quorum.
|
| You always need cross-AZ traffic, otherwise your data is
| single homed (which we used to call "your data doesn't
| exist").
| dexwiz wrote:
| Multiple tiers of redundancy. There is usually redundancy
| within the AZ and then a following copy in another AZ. Usually
| at least four copies exist for a tenant.
| danielovichdk wrote:
| "A single Slack API request from a user (for example, loading
| messages in a channel) may fan out into hundreds of RPCs to
| service backends, each of which must complete to return a correct
| response to the user."
|
| Not being a dick here but is this not a fairly obvious flaw?
|
| I mean why not keep a structured "message log" of all channels of
| all time ?
|
| For every write the system updates the message log.
|
| I am guessing and making assumptions I know.
| skullone wrote:
| XMMP was extensible to support all this in the early 2000s.
| Slack reinvented simple services in the most obtuse way. I have
| to use Slack and I sideline quarterback all the ways things
| could have been better every day.
| [deleted]
| madduci wrote:
| Cellular architecture? They've just rediscovered the art of
| redundancy systems
| [deleted]
| Terretta wrote:
| Indeed, for 20+ years of distributed data centers (remember AZs
| are generally separate DCs near a city but on different grids,
| regions are geographically disparate cities) we called it
| "shared nothing" architecture pattern.
|
| Here's AWS's 2019 guide for financial services in AWS, where
| the isolated stack concept is referenced under parallel
| resiliency section and called "shared nothing":
|
| https://d1.awsstatic.com/Financial%20Services/Resilient%20Ap...
| politelemon wrote:
| It's a common pattern in tech. Everything old will be new
| again.
| donutshop wrote:
| But kubernetes
| [deleted]
| [deleted]
| benatkin wrote:
| To me it seems without the art. The costs will be passed on to
| the customers. I think there must be good ways to do redundancy
| without having all services running at full blast in each
| Availability Zone.
|
| It's a blunt tool, much like PHP. PHP does seem to be a good
| choice for them, but I wouldn't want to work there. It's all
| right, there are different ways to do stuff.
| gumballindie wrote:
| Oh hey they now have a new buzzword to sell!
| skullone wrote:
| But if they call it cellular architecture, it sounds much more
| exotic than a shared-nothing active/active service!
| inertially wrote:
| [dead]
| skullone wrote:
| So they used a feature built into a load balancer to gracefully
| drain traffic from specific availability zones? Odd that a
| feature found in load balancers from the last 25 years is a blog
| post worthy thing.
| progbits wrote:
| The other bit is separating the service into isolated cells so
| issues in one don't affect dependent services everywhere like
| they had experienced before.
|
| But yeah any good SRE could point this out years ago.
| skullone wrote:
| Just odd a company worth billions and billions of dollars is
| just now discovering HA models standard since the 90s. Can
| expand the Clos network architecture to these distributed
| service applications too. But judging by Slack's client
| quality, mature concepts such as those must be new to them.
| [deleted]
| antoniojtorres wrote:
| The linked AWS article specifically explains that it's not
| just the typical single load balancer for cross AZ routing.
| I frankly don't know where you're getting that this means
| that HA is new to them.
| skullone wrote:
| Of course this isn't a typical single load balancer for
| cross AZ - but the general gist of their "new"
| architecture is first principles level of design. But
| sure, we can celebrate their minor achievement I guess
| [deleted]
| jameshart wrote:
| That seems like a shallow dismissal. In a distributed system,
| making sure that sub requests are handled across distributed
| nodes within the local AZ, and correctly draining traffic from
| AZs with partial component service outages, is not as trivial
| as 'using a feature built in to a load balancer'.
| skullone wrote:
| It may be shallow, but architecting for this is not really
| "advanced, FAANG-only accessible methodology". I'm surprised
| their services have been as "reliable" as they have been
| considering such trivial stuff is just now being employed in
| their architecture.
| jameshart wrote:
| Half the complaints on here on architecture posts are 'you
| don't need this kind of stuff unless you're at FAANG
| scale'. Now we have a write up of something that's
| accessible to businesses at non-FAAANG scale, and we have
| the new complaint, that this kind of stuff isn't worthy of
| FAANG-scale architecture.
| skullone wrote:
| Geo traffic distribution, multi regions/AZs with
| functionality to weight and drain traffic should be used
| in most SaaS services where a simple failure somewhere
| could cost users time and lose company money/goodwill.
| It's not terribly hard nor expensive.
| nostrebored wrote:
| Those are all much looser restrictions than routing
| traffic consistently to a cell
| mlhpdx wrote:
| Route 53 latency based routing -> APIGW or ALB -> Lambda
| or Step Functions -> DDB Global Table.
|
| No reserved capacity (pay for usage), so it works for
| boot strapping startups and provides superior resilience
| while being extremely simple to setup and involves almost
| zero maintenance or patching (even under the hug of
| death). I don't understand settling for less (and taking
| longer and paying more for it).
| robertlagrant wrote:
| > architecting for this is not really "advanced, FAANG-only
| accessible methodology"
|
| Sorry - where are you quoting this claim from?
| wilg wrote:
| The S in FAANG is for Slack.
| skullone wrote:
| My own words, but this is fairly trivial in the context
| of these massive companies with presumably PHDs working
| on their architecture.
| [deleted]
| [deleted]
| [deleted]
| colmmacc wrote:
| Close but I don't think it's quite 25 years! I added graceful
| draining to Apache httpd's mod_proxy and mod_proxy_balancer
| either in 2003 or 2004, and at the time I'm nearly certain it
| was the first software load balancer to have the feature, and
| it wasn't available on the hardware load balancers of the time
| that I had access to ... though I later learned that at least
| BigIP load balancers had the feature.
|
| At the time, we had healthy debates about whether the feature
| was useful enough to justify additional complexity, and whether
| there could be cases where it would backfire. To this day, it's
| an underused feature. I still regularly run into customers and
| configurations that cause unnecessary blips to their end-users,
| so it's nice to see when people dig in and make sure that the
| next level of networking is working as well as it can.
| robertlagrant wrote:
| Well played, HN.
| skullone wrote:
| I migrated some old BigIP load balancers over to Apache in
| 2004ish, and extended some of mod_proxy to do some "unholy"
| things at the time. We also did a lot of direct server return
| stuff when no load balancer you could buy could handle the
| amount of traffic statefully. Man, how times have changed,
| and lesson forgotten.
| djbusby wrote:
| Microsoft bought Convoy in 1998[0]. Then incorporated it into
| NT4sp6a and Win2k as NLB/WLBS. One of its features was to
| gracefully remove a server from the cluster after all
| connections were closed - draining. But, cluster not the same
| as an LB.
|
| [0] https://news.microsoft.com/1998/08/24/microsoft-corp-
| acquire...
| alberth wrote:
| Is Slack still written in Hack/PHP?
| aftbit wrote:
| from the article:
|
| >Slack does not share a common codebase or even runtime;
| services in the user-facing request path are written in Hack,
| Go, Java, and C++.
| skullone wrote:
| Man what a mess. Meanwhile, everyone else can extend a
| library used by their common services in a common language
| trivially.
| hotnfresh wrote:
| Meh. As long as you've got a good, typed interface for
| passing messages between them and for having a common
| understanding of (and versioning system for) key data
| structures, that's fine for this sort of thing where it's
| largely processing steams of small messages and events.
|
| ... but it's probably JSON and some JSON-Schema-based "now
| you have two problems" junk instead of what I described. In
| which case, yeah, ew, gross. Unless they've made some
| unusually good choices.
| nostrebored wrote:
| There are tons of approaches to align on service
| contracts for JSON based API calls. There's also
| libraries like gRPC which help make contacts explicit.
| Neither are really uncommon
| xwowsersx wrote:
| What are some of those approaches? Are there formal
| methods and/or tools for doing this?
| wmf wrote:
| Almost everyone embraced polyglotism and microservices
| together.
| rs_rs_rs_rs_rs wrote:
| Let me guess, they should rewrite everything in Javascript?
| tomrod wrote:
| Nah, Excel. /s
| skullone wrote:
| Woe is us if they actually did.
| muglug wrote:
| Yes -- see my recent article https://slack.engineering/hakana-
| taking-hack-seriously/
|
| We use a few languages to serve client requests, but by far the
| biggest codebase is written in Hack, which runs inside an
| interpreter called HHVM that's also used at Facebook.
| WinLychee wrote:
| PHP has some excellent ideas that other languages can't
| replicate, while at the same time having terrible ideas that
| other languages don't have to think about. Overall a huge fan
| of modern PHP, thanks for this writeup.
| dcgudeman wrote:
| I noticed that the hack blog (https://hhvm.com/blog/)
| basically stopped posting updates since the end of 2022. As
| downstream users of hacklang development have you folks
| noticed a change in development pace or ambition within the
| hack development team?
| alberth wrote:
| I too am super curious about this.
|
| Plus, it seems telling that Threads was developed in Python
| - not Hack.
|
| (I'm aware IG is Python & it's the same team)
| rubyss wrote:
| You answered yourself there, Hack is still very widely
| used inside meta, just less so in IG.
| xwowsersx wrote:
| Kinda makes sense you would use PHP, even though I'm sure
| many people are shocked by it. PHP was pretty much born in
| a web context. The language was created with servers and
| request/response in mind and it shows.
| koolba wrote:
| I really like the writing style in that article:
|
| > PHP makes it really easy to make a dynamically-rendered
| website. PHP also makes it really easy to create an utterly
| insecure dynamically-rendered website.
| alberth wrote:
| Hi Matt
|
| Thanks for Psalm!
|
| Curious, if Slack was built today from ground up - what tech
| stack do you think should/would be used?
| muglug wrote:
| That's a simple question that's hard to answer.
|
| A slightly different question that's a bit easier to
| answer: "if I could wave a magic wand and X million lines
| of code were instantly rewritten and all developers were
| instantly trained on that language".
|
| There the choice would be limited to languages that have
| similar or faster perf characteristics to Hack, without
| sacrificing developer productivity.
|
| Rust is out of the question (compile times for hundreds of
| devs would instantly sap productivity). PHP, Ruby, Node and
| Python are too slow -- for the moment at least.
|
| So it would be either Hack or Go. I don't know enough about
| JVM languages to know whether they would be a good fit.
| davedx wrote:
| Not erlang?
| conradfr wrote:
| But Discord uses Rust to improve performance bottlenecks
| in OTP ;)
| alberth wrote:
| I like your question way better than mine :)
|
| Some follow-up ...
|
| A. isn't PHP on par perf wise to Hack these days? Re:
| "PHP is too slow" comment.
|
| B. have you ever looked into PHP-NGX? It's perf looks
| impressive, though you lose the benefit of stateless
|
| https://github.com/rryqszq4/ngx-php
|
| https://www.techempower.com/benchmarks/#section=data-r21
| muglug wrote:
| > isn't PHP on par perf wise to Hack these days?
|
| No. But I don't have any numbers, because it's been years
| since the two languages were directly comparable on
| anything but a teeny tiny example program.
|
| Facebook gets big cost savings from a 1% improvement in
| performance, so they make sure that performance is as
| good as it can possibly be. They have a team of engineers
| working on the problem.
|
| PHP doesn't have any engineers working on performance
| full-time -- it's impossible for the language to compete
| there. Hack has also removed a bunch of PHP constructs
| (e.g. magic methods) that are a drain on performance, so
| there's no way to close the gap.
|
| But that should in no way make you choose Hack over PHP.
| Apart from anything else, the delta won't matter for
| 99.9% of websites.
| syspec wrote:
| Thank you for being brave enough not to suggest Rust.
| [deleted]
| fiddlerwoaroof wrote:
| The thing I don't understand about Slack is how the core
| functionality seems to have continuously degraded since I started
| using it in ~2015. When I started using it, its core message
| sending features basically didn't have the issues with delayed
| messages or failure to send that I had experienced with
| competitors. Now, I routinely have to reset the app/clear the
| cache and go through various dances to get files to upload
| reliably (add the file to a message, wait five or ten seconds,
| then hit send). It's nice to see these technical write-ups about
| improving the infrastructure behind Slack, but I'd like to see
| fewer feature launches and more stability improvements to make
| the web, desktop and mobile apps feel like reliable software
| again. (nice to haves would be re-launching the XMPP and IRC
| bridges)
| [deleted]
| tmpX7dMeXU wrote:
| Not to "works on my machine" you, but I...genuinely do not have
| these problems. I've never heard it from my team either. So we
| could at the very least say it's not a widespread global issue.
|
| Even the percentage of nerds that would want IRC or XMPP
| bridges back would have to be vanishingly small. I'd be annoyed
| if Slack reimplemented such functionality because it no doubt
| slows down future development. Slack has a number of mechanics
| that do not carry across to IRC or XMPP, and they did when they
| killed the bridges. I'd be annoyed if new features were
| compromised to increase compatibility with this blatant nerd
| vanity project.
| fiddlerwoaroof wrote:
| So, it's workspace and user/device specific: two of the
| workspaces I interact with regularly have these problems and
| the problems also show up intermittently for some users and
| not others. (Anecdotally, my experience is that
| Matrix/Element used to be annoying compared to the Slack
| experience and now I mostly prefer it to Slack)
|
| I would be fine with the understanding that the IRC bridge
| was missing functionality (and it always was). Although
| threads might make it impossible to implement in a nice way
| now.
|
| As far as new features go, I don't want any new features in
| Slack: it worked exactly like I wanted it to seven years ago
| and the new stuff is nice, but not worth the degradation in
| user experience.
| [deleted]
| fulladder wrote:
| I haven't used Slack in a long time, but isn't this just the
| normal enshittification cycle that occurs with all Internet
| products? The founders got a nice exit several years back, I
| doubt they stuck around at Salesforce for long, so it's natural
| that the product would deteriorate over time.
|
| Slack IRC bridging in the 2014/2015 era was great. We had a lot
| of people who spent their whole workday in a terminal window
| and weren't interested in running a web browser in the
| background continuously just for a chat room.
| memefrog wrote:
| >isn't this just the normal enshittification cycle that
| occurs with all Internet products?
|
| No! Stop diluting this word.
| jmull wrote:
| This is the Cory Doctorow sense of the word, is it not?
|
| (Or, now that I notice your username, maybe you're making
| an ironic joke, since complaining about the misuse of the
| word enshitification is a meme now?)
| fulladder wrote:
| > No! Stop diluting this word.
|
| Yes, you're right, I'm misusing it.
|
| However, I think that there is a phenomenon that happens to
| a lot of tech products that is more general than what
| Doctorow is talking about. There is a certain type of
| person who is attracted to building a new thing, and there
| is a different type of person who is attracted to a thing
| that is already successful. Pioneers and Settlers, as a
| former colleague of mine described it. In the context of
| Internet services, pioneers care a lot about attracting
| users initially so they tend to dwell on every minor
| detail. Settlers care a lot about stability, so gradual
| degradation over time (e.g., in performance, in other
| measures of quality) is tolerable as long as its rate is
| controllable and well-understood.
|
| I think that Doctorow's thesis is a special case of this
| where greed is the driving factor behind the gradual
| erosion of quality.
| fiddlerwoaroof wrote:
| > Isn't this just the normal enshittification cycle that
| occurs with all Internet products?
|
| Yeah, although one can dream that some SaaS company would do
| htings differently
| ec109685 wrote:
| They support much much larger workspaces now, and support team
| to team shared channels, so the problem space is much more
| complex than 2015.
|
| Not saying they shouldn't fix their reliability. Every other
| week it seems like they have an outage with this or that.
|
| The Flickr style commit to production multiple times per day
| seems to have its limits. Perhaps longer canary and slower
| rollouts would help.
| dr_kiszonka wrote:
| Nice write-up! If no new requests from users
| are arriving in a siloed AZ, internal services in that AZ will
| naturally quiesce as they have no new work to do.
|
| Not necessarily because, due to some bug, there may be resource-
| hungry jobs running indefinitely. (Slack's engineers must have
| considered this; I am just nitpicking this particular part of the
| text.)
| ninkendo wrote:
| If you replace "because" with "if", your comment makes more
| sense. "If" there are such bugs, you are right, but such bugs
| might not exist.
___________________________________________________________________
(page generated 2023-08-26 23:00 UTC)