[HN Gopher] Slack's migration to a cellular architecture
       ___________________________________________________________________
        
       Slack's migration to a cellular architecture
        
       Author : serial_dev
       Score  : 206 points
       Date   : 2023-08-26 17:27 UTC (5 hours ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | random3 wrote:
       | This brings back memories - we speced an open distributed
       | operating system called Metal Cell and built an implementation
       | called Cell-OS. It was inspired by the "Datacenter as a computer"
       | paper, but built with open-source tech.
       | 
       | We had it running accross bare metal, AWS and Azure and it one of
       | the key aspects was that it handled persistent workloads for big
       | data, including distributed databases.
       | 
       | Kubernetes was just getting built when we started and was
       | supposed to be a Mesos scheduler initially.
       | 
       | I assumed Kubernetes would get all the pieces in and make things
       | easier, but I still miss the whole paradigm we had almost 10
       | years ago.
       | 
       | This is retro now :)
       | 
       | https://github.com/cell-os/metal-cell
       | 
       | https://github.com/cell-os/cell-os
        
       | athoscouto wrote:
       | Their siloing strategy, which I'll roughly refer as resolving a
       | request from a single AZ, is a good way to keep operations and
       | monitoring simple.
       | 
       | A past team of mine managed services in a similar fashion. We had
       | a couple (usually 2-4) single AZ clusters with a thin (Envoy)
       | layer to balance traffic between clusters.
       | 
       | We could detect incidents in a single cluster by comparing
       | metrics across clusters. Mitigation was easy, we could drain a
       | cluster in under a minute, redirecting traffic to the other ones.
       | Most traffic was intra AZ, so it was fast and there was no cross-
       | AZ networking fees.
       | 
       | The downside is that most services were running in several
       | clusters, so there was redundancy in compute, caches, etc.
       | 
       | When we talked to people outside the company, e.g. solution
       | architects from our cloud provider, they would be surprised at
       | our architecture and immediately suggest multi-region clusters. I
       | would joke that our single AZ clusters were a feature, not a bug.
       | 
       | Nice to see other folks having success with a similar
       | architecture!
        
         | AugustoCAS wrote:
         | I assume you were using AWS? I know some of the AZ of other
         | cloud providers (Azure? Oracle? Google?) are not fully siloed.
         | They might have independent power and networking, but be in the
         | same physical location.
         | 
         | I'm mentioning this for other people to be aware as one can
         | easily make the assumption that an AZ is the same concept on
         | all clouds, which is not true and painful to realise.
        
         | ComputerGuru wrote:
         | It sounds like you didn't have persistent data, and were only
         | offering compute? If there's no need for a coherent master view
         | accessible/writeable from all the clusters, there would be no
         | reason to use multi-region cluster whatsoever.
        
           | athoscouto wrote:
           | We did. But the persisted data didn't live inside those
           | ephemeral compute clusters though.
        
             | slashdev wrote:
             | So your data store was still multi AZ? I'm a little
             | confused how you'd serve the same user's data consistently
             | from multiple silos. Do you pin users to one AZ?
        
             | lordofnorn wrote:
             | Yeah, keep stateful stuff and stateless stuff separate;
             | separate clusters, network spaces, cloud accounts, likely a
             | mix of all that.
             | 
             | Clearly define boundaries and acceptable behavior within
             | boundaries.
             | 
             | Setup up telemetry and observability to monitor for
             | threshold violations.
             | 
             | Simple. Right?
        
               | catchnear4321 wrote:
               | i mean you could also just spin up a reeeeeally big
               | compute node and just do it all there.
               | 
               | fewer things to monitor. fewer things that can fail.
               | 
               | just log in from time to time to update packages.
               | 
               | see, cloud doesn't have to be complex.
        
         | bushbaba wrote:
         | The downside of single AZ clusters is capacity. If you have a
         | need to drastically scale up the compute might not be available
         | in a single AZ.
        
           | jldugger wrote:
           | Indeed, this is the main problem I run into. We have to scale
           | up capacity before the traffic can be redirected or you
           | basically double the scope of the outage briefly. Which
           | involves multiple layers of capacity bringup -- ASG brings up
           | new nodes, then HPA brings up the new pods.
        
             | ec109685 wrote:
             | If there's uncorrelated load you can also run on your
             | hosts, then you can share their spare capacity, with the
             | hope they don't spike at same time.
             | 
             | AWS does that with their lambda arch to reduce waste.
        
             | Terretta wrote:
             | If you have enough scale that could be a problem, cookie
             | cutter more smaller AZs so any one outage is less of
             | numerator of capacity over the denominator of scale.
             | 
             | Worth noting that requiring teams to use 3 AZs is a good
             | idea because you get "n" shaped patterns instead of mirror
             | shaped patterns, which have very different characteristics
             | for resilience and continuity.
        
           | athoscouto wrote:
           | Even though each cluster was single AZ the whole system
           | wasn't, so we weren't bound by the capacity of a single AZ.
           | 
           | Most of the situations where we needed to drastically scale
           | up were known ahead of time as well (e.g. campaign from
           | customer), and we would preallocate instances or even more
           | clusters.
           | 
           | I may be forcing my memory, but if I'm not mistaken, our auto
           | scaling was setup in a way that the system could handle
           | sudden load increases of ~50% without noticeable disruption.
           | Spikes bigger than this could lead to increased latency
           | and/or error rate.
        
             | kccqzy wrote:
             | That's another way of saying your typical utilization ratio
             | is 66%. Which is on the low side honestly.
             | 
             | That said, it's a trade off between efficiency and load
             | spike tolerance. I trust that the trade off is made with
             | informed decision.
        
               | rewmie wrote:
               | > That said, it's a trade off between efficiency and load
               | spike tolerance. I trust that the trade off is made with
               | informed decision.
               | 
               | I don't think that relatively low utilization rates is
               | the scenario that requires "informed decision". The only
               | tradeoff in low utilization rate scenarios is cost, which
               | might be outright cheaper and irrelevant once you do the
               | math on the tradeoffs of using reserved instances vs the
               | cost of scaling up with on-demand instances.
               | 
               | You need to make a damn good case to chronically
               | underprovision your system and expect it to autoscale
               | your way into nickle-and-dime savings.
        
               | ec109685 wrote:
               | 66% isn't low utilization. You're always going to have
               | micro spikes, and you never want to clip, so keeping some
               | headroom around feels smart.
               | 
               | Unless you co-mingle online and offline (batch) traffic
               | on same hosts, flat response times and high utilization
               | aren't compatible.
        
               | sitkack wrote:
               | High utilization means high variability and low
               | resiliency and the last k-percentage of utilization
               | causes highly non-linear effects.
        
         | jasonwatkinspdx wrote:
         | Yeah, I talked with a business that used a similar architecture
         | for the same reasons. It can be really effective in multi-
         | tenant apps where each customers data is fully independent and
         | private. They also used multiple Amazon organizational accounts
         | as a security partition. It made a few things more difficult
         | but they felt the peace of mind was worth it.
        
         | endymi0n wrote:
         | It took me some time to realize that Cloud Solution Architects
         | are also just slightly more technical sales people in disguise
         | whose only mission is upselling you onto more dependency. Same
         | thing about their PR, every CxO these days says they need
         | "multi-cloud", whatever that means and the costs are usually
         | enormous, while complexity rises -- with questionable benefit.
         | 
         | I did the math for our own stack and after a setback month in
         | client revenue, and decided to put all our servers into a
         | single AZ in a single region. The only multi-AZ, multi-region
         | services are our backups. Surviving bad machines happens often
         | enough that it's priced in via using Kubernetes, but losing a
         | whole AZ is a freak accident that's just SO rare that
         | calculating real business risk, it seemed apt to pretend it
         | just doesn't happen (sorry, Google Cloud Paris customers).
         | 
         | Call me reckless, but I haven't looked back ever since and it
         | saves us thousands of dollars in intra-AZ fees per month alone.
        
           | anon84873628 wrote:
           | Yeah, for many businesses it probably isn't necessary to have
           | crazy short RTO and RPO. Just restore the most recent backup
           | in a new region and point at the cloud provider outage
           | report...
        
           | [deleted]
        
       | memefrog wrote:
       | _" For example slack is an incredibly successful product. But it
       | seems like every week I encounter a new bug that makes it
       | completely unusable for me, from taking seconds per character
       | when typing to being completely unable to render messages.
       | (Discord on the other hand has always been reliable and snappy
       | despite, judging by my highly scientific googling, having 1/3rd
       | as many employees. So it's not like chat apps are just
       | intrinsically hard.) And yet slack's technical advice is popular
       | and if I ran across it without having experienced the results
       | myself it would probably seem compelling."_
       | 
       | https://www.scattered-thoughts.net/writing/on-bad-advice/
        
       | purpleturtle22 wrote:
       | Can someone ELI5 the difference between using AWS availability
       | zone affinity and then simply dropping the downed AZ at the top
       | most routing point?
       | 
       | Wouldn't that be the same thing, with the obvious caveat you are
       | t using the routing technology Slack is using (We don't - We use
       | vanilla AWS offerings)
        
         | t0mas88 wrote:
         | They decided to use every routing tool available at least once
         | in their setup, so they can't do this. But there is no
         | explanation in the blog about why they use so many platforms
         | and so many routing tools. Sounds to me like they got
         | themselves into a mess and decided to continue on that path.
        
           | jonathankoren wrote:
           | Somewhere, an engineering "leader" is going to point to this
           | blog post and then say, "Well, that's how Slack did it!" and
           | promptly copy this overwrought system
        
             | vinnymac wrote:
             | I'm not sure if you're being serious, but in any case; This
             | will happen, as it always does, inevitably.
        
               | sitkack wrote:
               | Warning statement becomes the howto guide.
        
         | Terretta wrote:
         | You're doing it right.
        
         | ec109685 wrote:
         | Isn't that exactly what they are doing? Keeping requests within
         | an AZ and using global DNS at the first hop into AZ.
        
         | ec109685 wrote:
         | Isn't that exactly what they are doing? Keeping requests within
         | an AZ and instead of using DNS at the first hop into AZ, they
         | use envoy to control traffic shaping and making that initial
         | decision if traffic needs to be routed away.
        
       | chrisweekly wrote:
       | I appreciate the clear explanation of the problem and the
       | solution, which (as is so often the case) seems fairly simple or
       | obvious in retrospect.
       | 
       | Semi-related tangent: sometime around mid-2016, I came across a
       | tool that helped visualize requests in near real-time, and showed
       | what it "looks" like (ie, flow slows to trickle in service A
       | during draining, while it ramps up in service B)... there was a
       | really compelling demo, but I never bookmarked it and can't seem
       | to find it. IIRC its name was a single word. Maybe someone
       | reading this will know what I'm talking about... ?
        
         | [deleted]
        
         | mrkeen wrote:
         | Vizceral
        
           | xwowsersx wrote:
           | Neat.
           | 
           | > If a graph of nodes and edges with data about traffic
           | volume is provided, it will render a traffic graph animating
           | the connection volume between nodes.
           | 
           | How would one go about providing such a graph? :)
        
       | ThePhysicist wrote:
       | So they run everything in AWS USE1? That doesn't seem very
       | redundant, but then I guess if the whole of USE1 goes down Slack
       | won't be the only service that will be affected.
        
         | radicality wrote:
         | Isn't the point of the article that they don't? And it
         | describes how they implemented region drains to traffic shift
         | between the different regions.
         | 
         | edit: Hmm or maybe not? I still sometimes confuse aws
         | terminology. Perhaps it is all in us-east--1, just in different
         | availability zones (buildings?)
        
           | ThePhysicist wrote:
           | If I understand it correctly they have an edge network for
           | ingress traffic but host all of their core services in a
           | single AWS region (USE1) in multiple availability zones
           | there.
        
           | jldugger wrote:
           | >edit: Hmm or maybe not? I still sometimes confuse aws
           | terminology. Perhaps it is all in us-east--1, just in
           | different availability zones (buildings?)
           | 
           | Correct, us-east-1 has several AZs, names like us-east-1a,
           | us-east-1b etc. IIRC us-east-1 has six of them now.
        
         | messe wrote:
         | AWS also uses Slack internally, so add that to the list of shit
         | that can hit the fan if us-east-1/IAD goes down.
        
           | mynameisvlad wrote:
           | Don't they also use Chime? It wouldn't be a single point of
           | failure.
        
             | nostrebored wrote:
             | To contribute to the tangled ball of messaging, slack also
             | uses chime sdk to handle huddles
        
             | skullone wrote:
             | Lots of teams use Slack as well. Oddly enough, I didn't
             | mind Chime as an end-user, but 6 years ago their API
             | features were somewhat lacking.
        
           | fotta wrote:
           | Huh, I'm surprised they're not all in on Chime.
        
             | shepherdjerred wrote:
             | It was all on Chime until the Pandemic. Then they moved to
             | Slack.
        
           | [deleted]
        
         | deanCommie wrote:
         | the "whole" of USE1 very rarely goes down [0], because unlike
         | other cloud providers, Amazon's availability zones are actually
         | independent and decoupled, and if you're running on EC2 in a
         | zonal way it's highly unlikely an outage will affect multiple
         | zones.
         | 
         | [0] There are of course exceptions that come once every few
         | years, but most instances people can think of in terms of
         | widespread outages is one specific _service_ going down in a
         | region, creating a cascade of other dependencies. e.g. Lambda
         | or Kinesis going down and impacting some other higher-level
         | service, say, Translate.
        
           | asah wrote:
           | Am I missing something about us-east-1 reliability ?
           | 
           | https://www.google.com/search?q=us-east-1+reliability
           | https://www.google.com/search?q=us-east-1+outage
        
             | temp_praneshp wrote:
             | Yes. to put it a bit bluntly, you are using a very generic
             | google search and being blind to nuance.
             | 
             | us-east-1 does have more problems than other zones due to a
             | variety of reasons, but it rarely (ie, once a few years)
             | goes down as a whole. As long as you're in several AZs
             | within us-east-1, the impact of most outages should not
             | take you down completely. In the context of the comment you
             | are replying to, your google search links are lazy and fail
             | to see the big picture.
        
           | oceanplexian wrote:
           | AZs are buildings often times right next to each other on the
           | same street. People who think this is a great failure domain
           | for your entire business are deeply misguided. All it takes
           | is a hurricane, a truck hitting a pole, a fire, or any number
           | of extremely common situations and infra will be wiped off
           | the map. Build stuff to be properly multi-region.
        
         | johannes1234321 wrote:
         | But then everybody trying to recover from USE1 outage can't use
         | Slack to coordinate the recovery ...
        
       | enduser wrote:
       | So.. IRC?
        
       | anonshadow wrote:
       | [flagged]
        
         | [deleted]
        
         | [deleted]
        
       | UncleOxidant wrote:
       | Initially read this as: "Slack's Migration to Cellular Automata"
       | and now I'm a little disappointed.
        
       | t0mas88 wrote:
       | They got themselves into a mess here:
       | 
       | > This turns out to have a lot of complexity lurking within.
       | Slack does not share a common codebase or even runtime; services
       | in the user-facing request path are written in Hack, Go, Java,
       | and C++. This would necessitate a separate implementation in each
       | language.
       | 
       | This sounds crazy. I've seen several products where there is a
       | core stack (e.g. Java) and then surrounding tools, analytics etc
       | in Python, R and others. But why would you create such a mess for
       | your primary user request path?
       | 
       | Sure, they're not "just a chat app" they have video, file sharing
       | etc included and a lot of integrations. But still this sounds
       | like a company that had too much money and too little sense while
       | growing rapidly.
        
         | eikenberry wrote:
         | What mess? That sounds like a healthy internal language
         | ecosystem to me. You need _at least_ 2 primary languages to
         | avoid accidental lock-in and maintain good developer diversity.
         | That very paragraph is a great example of how the diversity
         | helped them avoid the trap of plumbing it through their RPCs.
        
           | t0mas88 wrote:
           | Since when is an "internal langue ecosystem" a good idea?
           | Technology in a company like Slack exists to deliver useful
           | features and good performance/stability to users faster than
           | competitors can do it. For an app like theirs it doesn't
           | sound like something that needs several disparate internal
           | platforms that are slowing them down.
        
             | nostrebored wrote:
             | How is choosing the right language for a task/team slowing
             | them down?
             | 
             | For large scale, cross cutting initiatives you'll have some
             | pain. For feature velocity, you'll see great results.
             | Everything is a trade off.
        
           | lopkeny12ko wrote:
           | You're suggesting that needing to reimplement the same thing
           | 5 times for every single language in use is a hallmark of a
           | "healthy internal language ecosystem"?
        
         | skullone wrote:
         | It even pains me to see they're suffering from so many own
         | goals. And it's unfortunately reflected in the poor experience
         | using the Slack client. Not to mention the multiple deprecated
         | bot/integration APIs with such bad feature parity between all
         | the different ways to integrate your own tooling into Slack.
        
           | snoman wrote:
           | There was a time when this was the case (and electron was the
           | punching bag for critics at the time, iirc) but I don't think
           | this criticism is fair anymore. Slack is quite responsive and
           | performant these days.
        
           | nostrebored wrote:
           | What do you mean? Slack is one of the most responsive and
           | reliable tools I touch every day.
        
             | lopkeny12ko wrote:
             | I hope this is satire. Slack is one of the slowest work
             | tools I've ever used. Every interaction and click visibly
             | lags.
             | 
             | It's a sad state of the world that almost every application
             | now is written in Javascript and deployed with Electron,
             | and massive memory usage and slow UIs have become accepted
             | as the norm.
             | 
             | Try any IRC client and tell me, with a straight face, that
             | Slack is just as responsive.
        
               | ladzoppelin wrote:
               | So I only use Firefox and the Slack web client and don't
               | experience any lag. I am surprised so many people use the
               | Slack app over a web tab.
        
             | skullone wrote:
             | How slow are the rest of your tools? The Slack client
             | probably performs worse today than it did a few years ago.
             | It has the laggiest interface of any of my tools, you can
             | watch your CPU spike to 60-80% just switching channels.
             | Just do it right now, open up htop/top/atop/Activity
             | Monitor - whatever you want, and just switch channels.
             | Laugh as the Slack client wastes a universe's worth of time
             | just... rendering a DOM with plain text. It is genuinely
             | pathetic how bad the client is.
        
         | tbrownaw wrote:
         | "The right language for each job" was one of the heavy
         | advertising points for microservices. Might still be too some
         | extent, even.
        
           | BoorishBears wrote:
           | The problem is most engineers don't understand the "job".
           | 
           | They see the job as a strictly technical problem looking for
           | the best technical solution. They don't look up and see how
           | that problem fits into the larger organization.
           | 
           | They think things like "I can make a microservice that
           | encodes PDFs 10x faster by using Rust" and give an estimate
           | based on that, never thinking about how we're going to need
           | to hire 2 more Rust devs to keep that running, and we could
           | have delivered twice as quickly if I had used our default
           | Python stack and now our "10x faster" doesn't matter because
           | that feature is old news.
           | 
           | Microservices are such an unfortunate concept because they
           | attract the people least suited to use them: If your team
           | can't handle a monolith, you shouldn't even be looking up
           | what a microservice is.
        
         | pavlov wrote:
         | The only way you get Hack on that list of languages is that
         | they had a policy of letting lead engineers starting a project
         | to choose the language at will, and they hired enough lead
         | engineers who previously worked at FB/Meta.
        
           | tlunter wrote:
           | I think that Hack might've been on that list earlier than you
           | think. Slack started as a PHP application.
        
             | matwood wrote:
             | Yeah. If they already had a large php codebase, moving to
             | Hack makes complete sense.
        
         | [deleted]
        
       | awinter-py wrote:
       | their backend being on 2G explains a lot of other stuff about
       | their software
        
       | heywhatupboys wrote:
       | Is Slack dead? unironically. Does it have a future? With Teams,
       | etc. coming out, it seems most companies do not want to go the
       | Slack route
        
         | aftbit wrote:
         | Clearly no. Legacy inertia will carry it pretty far, even if
         | literally nobody new tries to sign up for it. Our team is still
         | using Slack and has no plans to migrate away at the moment.
        
         | robertlagrant wrote:
         | Teams is doing well because it's often an IT department's
         | simplest choice, but I don't find it's great for users.
        
           | zo1 wrote:
           | Why would I choose Slack for my employees when Teams
           | integrates so nicely with everything else in the "stack".
           | Teams is leaps and bounds ahead already, and Slack really
           | lost the boat many years ago.
           | 
           | Speaking of which, I'm going now to buy more Microsoft
           | shares.
        
             | Shared404 wrote:
             | > Why would I choose Slack for my employees when Teams
             | integrates so nicely with everything else in the "stack".
             | 
             | Does it really though? In my experience teams has a buggy
             | integration with other things in the stack.
             | 
             | And Teams itself ia massively buggy and a resource hog for
             | the whole time I've used it.
        
             | zdragnar wrote:
             | I don't think I have ever heard someone favorably compare
             | teams chat with slack before. Even when I worked at a
             | company that used teams for video calls and MS for email
             | and calendar and documents and what not, everyone used
             | slack for chat.
             | 
             | I don't think anyone was sad that slack didn't integrate
             | with the other MS services "stack".
        
             | fooster wrote:
             | You are choosing teams. What are your employees choosing?
             | In my experience teams is a terrible mess and a company
             | using it would exclude me from working for the company
             | because they very likely don't give a crap about the day to
             | day experience of the employee.
        
             | grokys wrote:
             | Maybe because you value your employees being able to copy
             | an image from your chat platform?
             | 
             | (Teams still can't copy images, instead you get a massive
             | base64 block of text iirc)
        
           | packetlost wrote:
           | The company I work for has a "Hours wasted because Teams
           | sucks" page that gets updated at least weekly.
           | 
           | Eventually the list will grow so large that we could probably
           | attach a 5-figure dollar amount to it, if it hasn't already.
        
             | Racing0461 wrote:
             | Depending on the size of the company, that value is
             | absolutely insignificant.
        
               | hotnfresh wrote:
               | Bigcos with robust sales truly can't afford the
               | organizational-attentional cost of walking across the
               | street to pick up a $10,000 coin.
        
               | Racing0461 wrote:
               | /s ?
        
               | hotnfresh wrote:
               | No, that's really how it is. They leave opportunities to
               | save or make five-figure (and larger) amounts all the
               | time, because it's not worth the distraction from other
               | activities. And also from straight-up mis-management, but
               | a lot of the time they know exactly what they're doing,
               | and it's on purpose, and it's probably not a mistake.
        
         | devmor wrote:
         | If your goal is to monitor your staff and gather metrics on
         | their communication - Teams outdoes Slack and is incomparable.
         | If your goal is to have a platform that enables your employees
         | to communicate with as little friction as possible, I have yet
         | to see anything capable of replacing Slack.
         | 
         | Teams especially, is something I loathe using every day.
         | Everything about the UI and UX gets in the way of what I'm
         | trying to do, rather than assisting in or even enabling it.
         | It's like it doesn't want me to communicate - it wants me to
         | react and offer as little useful information as possible.
        
           | ecshafer wrote:
           | I went from a company using teams to slack a few years ago.
           | Truly night and day. I have such a visceral hatred for Teams,
           | it actually surprises me how much I can dislike some software
           | that is for messaging. From how it can't copy and paste in
           | and out of chat, to the way it sets laptops on fire, or its
           | horrible ui. I really truly hate that software. Please just
           | use slack or god forbid set up an irc node or something.
        
             | [deleted]
        
           | imperialdrive wrote:
           | Agreed. Teams is already the most painful experience, and
           | it's about to get even worse with the new 2.0 version being
           | deployed.
        
         | wasmitnetzen wrote:
         | My employer buys no Microsoft SaaS service, since we're mostly
         | on Google services, so a stand-alone like Slack works quite
         | well. And nobody uses Google Chat.
         | 
         | And besides that, the UX of Teams is miles behind Slack.
        
           | walthamstow wrote:
           | Not even GitHub? I believe that's the only MSFT service we
           | have at my <40 people fintech dayjob
        
           | quickthrower2 wrote:
           | Slack is not good UX in my opinion. It is often hard to see
           | what generated a message notification - so yeah someone
           | called me out, but who? where?. It shows me latest thread as
           | being from last month when I know there have been more recent
           | ones. It doesn't collapse those threads, so 100 reply
           | incident threads dominate that view. Slack doesn't scale well
           | (UX-wise) above say 30 people.
        
             | snoman wrote:
             | Opinions are valid, for sure. I can tell you that I'm a
             | happy slack user at a company of just over a hundred
             | thousand.
             | 
             | I haven't regularly used teams in about a year, but I would
             | legitimately consider passing on a job offer where they
             | used it.
             | 
             | In a thread where many folks are talking about using the
             | best tools for a job, teams is never the best tool for any
             | form of digital communication.
        
       | gumballindie wrote:
       | "cellular architecture"
       | 
       | What? Does amazon need to push for new sales points or are they
       | simply making up architectures now?
        
         | tedd4u wrote:
         | Cell architecture goes way back, at least 10 years. Tumblr for
         | example.
         | 
         | http://highscalability.com/blog/2012/5/9/cell-architectures....
        
         | CyberDildonics wrote:
         | Sounds like it's two bird with one stone.
        
         | ignoramous wrote:
         | _ex-AWS here_
         | 
         | May be marketing but it is an architecture born out of Amazon's
         | (and AWS's) use of AWS:
         | 
         | - _Reliable scalability: How Amazon.com scales in the cloud_ ,
         | https://www.youtube.com/watch?v=QeW9wCB36ck&t=993 (2022)
         | 
         | - _How AWS minimizes the blast radius of failures_ ,
         | https://youtu.be/swQbA4zub20 (2018)
         | 
         | For massive enterprise products like Slack that need close to
         | 100% uptime across all their services, cells make sense.
        
           | mike_hock wrote:
           | Cells, interlinked.
        
           | gumballindie wrote:
           | Yeah that's what microservices were meant to achieve. Suppose
           | the market is staturated with "microservices", so a new term
           | was needed.
        
             | ignoramous wrote:
             | Microservices is one reason you need cells. If you haven't,
             | the second talk I linked to might interest you.
        
         | diarrhea wrote:
         | A big term for a simple design principle indeed.
         | 
         | But their implementation isn't as grim as what I had initially
         | envisioned when hearing that term. I immediately thought of
         | Smalltalk and the idea of objects sitting next to each other,
         | forming a graph (of no particular structure... just a graph),
         | passing messages to neighbours. Like cells in an organism send
         | hormones and whatnot. That makes for a huge mess that cannot be
         | reasoned about, hence why we instead went with stricter
         | structures like trees for (single) inheritance. That's much
         | closer to this silo approach, which seems nice and reasonable
         | (although I get the impression considerable complexity was
         | swept under the rug, like global DB consistence; the siloes
         | cannot truly be siloed).
        
         | mike_hock wrote:
         | Why is than an either/or?
        
         | [deleted]
        
       | aftbit wrote:
       | How can such an architecture function with respect to user data?
       | If the DB instance primary handling your shard is in AZ-1 and
       | AZ-1 gets drained, how can your writes continue to be serviced?
        
         | [deleted]
        
         | progbits wrote:
         | Usually in distributed strongly consistent and durable systems,
         | data is not considered committed until it has been persisted in
         | multiple replicas.
         | 
         | So if one goes down nothing is lost, but capacity and
         | durability is degraded.
        
           | skybrian wrote:
           | That makes sense on its own, but doesn't it mean that there
           | are lots of network requests happening between silos all the
           | time? It doesn't seem very siloed.
           | 
           | Or is this some lower-level service that "doesn't count"
           | somehow?
        
             | progbits wrote:
             | It's siloed that if one is down others are not affected as
             | long as enough other replicas are healthy to keep the
             | quorum.
             | 
             | You always need cross-AZ traffic, otherwise your data is
             | single homed (which we used to call "your data doesn't
             | exist").
        
         | dexwiz wrote:
         | Multiple tiers of redundancy. There is usually redundancy
         | within the AZ and then a following copy in another AZ. Usually
         | at least four copies exist for a tenant.
        
       | danielovichdk wrote:
       | "A single Slack API request from a user (for example, loading
       | messages in a channel) may fan out into hundreds of RPCs to
       | service backends, each of which must complete to return a correct
       | response to the user."
       | 
       | Not being a dick here but is this not a fairly obvious flaw?
       | 
       | I mean why not keep a structured "message log" of all channels of
       | all time ?
       | 
       | For every write the system updates the message log.
       | 
       | I am guessing and making assumptions I know.
        
         | skullone wrote:
         | XMMP was extensible to support all this in the early 2000s.
         | Slack reinvented simple services in the most obtuse way. I have
         | to use Slack and I sideline quarterback all the ways things
         | could have been better every day.
        
       | [deleted]
        
       | madduci wrote:
       | Cellular architecture? They've just rediscovered the art of
       | redundancy systems
        
         | [deleted]
        
         | Terretta wrote:
         | Indeed, for 20+ years of distributed data centers (remember AZs
         | are generally separate DCs near a city but on different grids,
         | regions are geographically disparate cities) we called it
         | "shared nothing" architecture pattern.
         | 
         | Here's AWS's 2019 guide for financial services in AWS, where
         | the isolated stack concept is referenced under parallel
         | resiliency section and called "shared nothing":
         | 
         | https://d1.awsstatic.com/Financial%20Services/Resilient%20Ap...
        
         | politelemon wrote:
         | It's a common pattern in tech. Everything old will be new
         | again.
        
           | donutshop wrote:
           | But kubernetes
        
         | [deleted]
        
           | [deleted]
        
         | benatkin wrote:
         | To me it seems without the art. The costs will be passed on to
         | the customers. I think there must be good ways to do redundancy
         | without having all services running at full blast in each
         | Availability Zone.
         | 
         | It's a blunt tool, much like PHP. PHP does seem to be a good
         | choice for them, but I wouldn't want to work there. It's all
         | right, there are different ways to do stuff.
        
         | gumballindie wrote:
         | Oh hey they now have a new buzzword to sell!
        
         | skullone wrote:
         | But if they call it cellular architecture, it sounds much more
         | exotic than a shared-nothing active/active service!
        
       | inertially wrote:
       | [dead]
        
       | skullone wrote:
       | So they used a feature built into a load balancer to gracefully
       | drain traffic from specific availability zones? Odd that a
       | feature found in load balancers from the last 25 years is a blog
       | post worthy thing.
        
         | progbits wrote:
         | The other bit is separating the service into isolated cells so
         | issues in one don't affect dependent services everywhere like
         | they had experienced before.
         | 
         | But yeah any good SRE could point this out years ago.
        
           | skullone wrote:
           | Just odd a company worth billions and billions of dollars is
           | just now discovering HA models standard since the 90s. Can
           | expand the Clos network architecture to these distributed
           | service applications too. But judging by Slack's client
           | quality, mature concepts such as those must be new to them.
        
             | [deleted]
        
             | antoniojtorres wrote:
             | The linked AWS article specifically explains that it's not
             | just the typical single load balancer for cross AZ routing.
             | I frankly don't know where you're getting that this means
             | that HA is new to them.
        
               | skullone wrote:
               | Of course this isn't a typical single load balancer for
               | cross AZ - but the general gist of their "new"
               | architecture is first principles level of design. But
               | sure, we can celebrate their minor achievement I guess
        
         | [deleted]
        
         | jameshart wrote:
         | That seems like a shallow dismissal. In a distributed system,
         | making sure that sub requests are handled across distributed
         | nodes within the local AZ, and correctly draining traffic from
         | AZs with partial component service outages, is not as trivial
         | as 'using a feature built in to a load balancer'.
        
           | skullone wrote:
           | It may be shallow, but architecting for this is not really
           | "advanced, FAANG-only accessible methodology". I'm surprised
           | their services have been as "reliable" as they have been
           | considering such trivial stuff is just now being employed in
           | their architecture.
        
             | jameshart wrote:
             | Half the complaints on here on architecture posts are 'you
             | don't need this kind of stuff unless you're at FAANG
             | scale'. Now we have a write up of something that's
             | accessible to businesses at non-FAAANG scale, and we have
             | the new complaint, that this kind of stuff isn't worthy of
             | FAANG-scale architecture.
        
               | skullone wrote:
               | Geo traffic distribution, multi regions/AZs with
               | functionality to weight and drain traffic should be used
               | in most SaaS services where a simple failure somewhere
               | could cost users time and lose company money/goodwill.
               | It's not terribly hard nor expensive.
        
               | nostrebored wrote:
               | Those are all much looser restrictions than routing
               | traffic consistently to a cell
        
               | mlhpdx wrote:
               | Route 53 latency based routing -> APIGW or ALB -> Lambda
               | or Step Functions -> DDB Global Table.
               | 
               | No reserved capacity (pay for usage), so it works for
               | boot strapping startups and provides superior resilience
               | while being extremely simple to setup and involves almost
               | zero maintenance or patching (even under the hug of
               | death). I don't understand settling for less (and taking
               | longer and paying more for it).
        
             | robertlagrant wrote:
             | > architecting for this is not really "advanced, FAANG-only
             | accessible methodology"
             | 
             | Sorry - where are you quoting this claim from?
        
               | wilg wrote:
               | The S in FAANG is for Slack.
        
               | skullone wrote:
               | My own words, but this is fairly trivial in the context
               | of these massive companies with presumably PHDs working
               | on their architecture.
        
               | [deleted]
        
               | [deleted]
        
               | [deleted]
        
         | colmmacc wrote:
         | Close but I don't think it's quite 25 years! I added graceful
         | draining to Apache httpd's mod_proxy and mod_proxy_balancer
         | either in 2003 or 2004, and at the time I'm nearly certain it
         | was the first software load balancer to have the feature, and
         | it wasn't available on the hardware load balancers of the time
         | that I had access to ... though I later learned that at least
         | BigIP load balancers had the feature.
         | 
         | At the time, we had healthy debates about whether the feature
         | was useful enough to justify additional complexity, and whether
         | there could be cases where it would backfire. To this day, it's
         | an underused feature. I still regularly run into customers and
         | configurations that cause unnecessary blips to their end-users,
         | so it's nice to see when people dig in and make sure that the
         | next level of networking is working as well as it can.
        
           | robertlagrant wrote:
           | Well played, HN.
        
           | skullone wrote:
           | I migrated some old BigIP load balancers over to Apache in
           | 2004ish, and extended some of mod_proxy to do some "unholy"
           | things at the time. We also did a lot of direct server return
           | stuff when no load balancer you could buy could handle the
           | amount of traffic statefully. Man, how times have changed,
           | and lesson forgotten.
        
           | djbusby wrote:
           | Microsoft bought Convoy in 1998[0]. Then incorporated it into
           | NT4sp6a and Win2k as NLB/WLBS. One of its features was to
           | gracefully remove a server from the cluster after all
           | connections were closed - draining. But, cluster not the same
           | as an LB.
           | 
           | [0] https://news.microsoft.com/1998/08/24/microsoft-corp-
           | acquire...
        
       | alberth wrote:
       | Is Slack still written in Hack/PHP?
        
         | aftbit wrote:
         | from the article:
         | 
         | >Slack does not share a common codebase or even runtime;
         | services in the user-facing request path are written in Hack,
         | Go, Java, and C++.
        
           | skullone wrote:
           | Man what a mess. Meanwhile, everyone else can extend a
           | library used by their common services in a common language
           | trivially.
        
             | hotnfresh wrote:
             | Meh. As long as you've got a good, typed interface for
             | passing messages between them and for having a common
             | understanding of (and versioning system for) key data
             | structures, that's fine for this sort of thing where it's
             | largely processing steams of small messages and events.
             | 
             | ... but it's probably JSON and some JSON-Schema-based "now
             | you have two problems" junk instead of what I described. In
             | which case, yeah, ew, gross. Unless they've made some
             | unusually good choices.
        
               | nostrebored wrote:
               | There are tons of approaches to align on service
               | contracts for JSON based API calls. There's also
               | libraries like gRPC which help make contacts explicit.
               | Neither are really uncommon
        
               | xwowsersx wrote:
               | What are some of those approaches? Are there formal
               | methods and/or tools for doing this?
        
             | wmf wrote:
             | Almost everyone embraced polyglotism and microservices
             | together.
        
             | rs_rs_rs_rs_rs wrote:
             | Let me guess, they should rewrite everything in Javascript?
        
               | tomrod wrote:
               | Nah, Excel. /s
        
               | skullone wrote:
               | Woe is us if they actually did.
        
         | muglug wrote:
         | Yes -- see my recent article https://slack.engineering/hakana-
         | taking-hack-seriously/
         | 
         | We use a few languages to serve client requests, but by far the
         | biggest codebase is written in Hack, which runs inside an
         | interpreter called HHVM that's also used at Facebook.
        
           | WinLychee wrote:
           | PHP has some excellent ideas that other languages can't
           | replicate, while at the same time having terrible ideas that
           | other languages don't have to think about. Overall a huge fan
           | of modern PHP, thanks for this writeup.
        
           | dcgudeman wrote:
           | I noticed that the hack blog (https://hhvm.com/blog/)
           | basically stopped posting updates since the end of 2022. As
           | downstream users of hacklang development have you folks
           | noticed a change in development pace or ambition within the
           | hack development team?
        
             | alberth wrote:
             | I too am super curious about this.
             | 
             | Plus, it seems telling that Threads was developed in Python
             | - not Hack.
             | 
             | (I'm aware IG is Python & it's the same team)
        
               | rubyss wrote:
               | You answered yourself there, Hack is still very widely
               | used inside meta, just less so in IG.
        
             | xwowsersx wrote:
             | Kinda makes sense you would use PHP, even though I'm sure
             | many people are shocked by it. PHP was pretty much born in
             | a web context. The language was created with servers and
             | request/response in mind and it shows.
        
           | koolba wrote:
           | I really like the writing style in that article:
           | 
           | > PHP makes it really easy to make a dynamically-rendered
           | website. PHP also makes it really easy to create an utterly
           | insecure dynamically-rendered website.
        
           | alberth wrote:
           | Hi Matt
           | 
           | Thanks for Psalm!
           | 
           | Curious, if Slack was built today from ground up - what tech
           | stack do you think should/would be used?
        
             | muglug wrote:
             | That's a simple question that's hard to answer.
             | 
             | A slightly different question that's a bit easier to
             | answer: "if I could wave a magic wand and X million lines
             | of code were instantly rewritten and all developers were
             | instantly trained on that language".
             | 
             | There the choice would be limited to languages that have
             | similar or faster perf characteristics to Hack, without
             | sacrificing developer productivity.
             | 
             | Rust is out of the question (compile times for hundreds of
             | devs would instantly sap productivity). PHP, Ruby, Node and
             | Python are too slow -- for the moment at least.
             | 
             | So it would be either Hack or Go. I don't know enough about
             | JVM languages to know whether they would be a good fit.
        
               | davedx wrote:
               | Not erlang?
        
               | conradfr wrote:
               | But Discord uses Rust to improve performance bottlenecks
               | in OTP ;)
        
               | alberth wrote:
               | I like your question way better than mine :)
               | 
               | Some follow-up ...
               | 
               | A. isn't PHP on par perf wise to Hack these days? Re:
               | "PHP is too slow" comment.
               | 
               | B. have you ever looked into PHP-NGX? It's perf looks
               | impressive, though you lose the benefit of stateless
               | 
               | https://github.com/rryqszq4/ngx-php
               | 
               | https://www.techempower.com/benchmarks/#section=data-r21
        
               | muglug wrote:
               | > isn't PHP on par perf wise to Hack these days?
               | 
               | No. But I don't have any numbers, because it's been years
               | since the two languages were directly comparable on
               | anything but a teeny tiny example program.
               | 
               | Facebook gets big cost savings from a 1% improvement in
               | performance, so they make sure that performance is as
               | good as it can possibly be. They have a team of engineers
               | working on the problem.
               | 
               | PHP doesn't have any engineers working on performance
               | full-time -- it's impossible for the language to compete
               | there. Hack has also removed a bunch of PHP constructs
               | (e.g. magic methods) that are a drain on performance, so
               | there's no way to close the gap.
               | 
               | But that should in no way make you choose Hack over PHP.
               | Apart from anything else, the delta won't matter for
               | 99.9% of websites.
        
               | syspec wrote:
               | Thank you for being brave enough not to suggest Rust.
        
             | [deleted]
        
       | fiddlerwoaroof wrote:
       | The thing I don't understand about Slack is how the core
       | functionality seems to have continuously degraded since I started
       | using it in ~2015. When I started using it, its core message
       | sending features basically didn't have the issues with delayed
       | messages or failure to send that I had experienced with
       | competitors. Now, I routinely have to reset the app/clear the
       | cache and go through various dances to get files to upload
       | reliably (add the file to a message, wait five or ten seconds,
       | then hit send). It's nice to see these technical write-ups about
       | improving the infrastructure behind Slack, but I'd like to see
       | fewer feature launches and more stability improvements to make
       | the web, desktop and mobile apps feel like reliable software
       | again. (nice to haves would be re-launching the XMPP and IRC
       | bridges)
        
         | [deleted]
        
         | tmpX7dMeXU wrote:
         | Not to "works on my machine" you, but I...genuinely do not have
         | these problems. I've never heard it from my team either. So we
         | could at the very least say it's not a widespread global issue.
         | 
         | Even the percentage of nerds that would want IRC or XMPP
         | bridges back would have to be vanishingly small. I'd be annoyed
         | if Slack reimplemented such functionality because it no doubt
         | slows down future development. Slack has a number of mechanics
         | that do not carry across to IRC or XMPP, and they did when they
         | killed the bridges. I'd be annoyed if new features were
         | compromised to increase compatibility with this blatant nerd
         | vanity project.
        
           | fiddlerwoaroof wrote:
           | So, it's workspace and user/device specific: two of the
           | workspaces I interact with regularly have these problems and
           | the problems also show up intermittently for some users and
           | not others. (Anecdotally, my experience is that
           | Matrix/Element used to be annoying compared to the Slack
           | experience and now I mostly prefer it to Slack)
           | 
           | I would be fine with the understanding that the IRC bridge
           | was missing functionality (and it always was). Although
           | threads might make it impossible to implement in a nice way
           | now.
           | 
           | As far as new features go, I don't want any new features in
           | Slack: it worked exactly like I wanted it to seven years ago
           | and the new stuff is nice, but not worth the degradation in
           | user experience.
        
         | [deleted]
        
         | fulladder wrote:
         | I haven't used Slack in a long time, but isn't this just the
         | normal enshittification cycle that occurs with all Internet
         | products? The founders got a nice exit several years back, I
         | doubt they stuck around at Salesforce for long, so it's natural
         | that the product would deteriorate over time.
         | 
         | Slack IRC bridging in the 2014/2015 era was great. We had a lot
         | of people who spent their whole workday in a terminal window
         | and weren't interested in running a web browser in the
         | background continuously just for a chat room.
        
           | memefrog wrote:
           | >isn't this just the normal enshittification cycle that
           | occurs with all Internet products?
           | 
           | No! Stop diluting this word.
        
             | jmull wrote:
             | This is the Cory Doctorow sense of the word, is it not?
             | 
             | (Or, now that I notice your username, maybe you're making
             | an ironic joke, since complaining about the misuse of the
             | word enshitification is a meme now?)
        
             | fulladder wrote:
             | > No! Stop diluting this word.
             | 
             | Yes, you're right, I'm misusing it.
             | 
             | However, I think that there is a phenomenon that happens to
             | a lot of tech products that is more general than what
             | Doctorow is talking about. There is a certain type of
             | person who is attracted to building a new thing, and there
             | is a different type of person who is attracted to a thing
             | that is already successful. Pioneers and Settlers, as a
             | former colleague of mine described it. In the context of
             | Internet services, pioneers care a lot about attracting
             | users initially so they tend to dwell on every minor
             | detail. Settlers care a lot about stability, so gradual
             | degradation over time (e.g., in performance, in other
             | measures of quality) is tolerable as long as its rate is
             | controllable and well-understood.
             | 
             | I think that Doctorow's thesis is a special case of this
             | where greed is the driving factor behind the gradual
             | erosion of quality.
        
           | fiddlerwoaroof wrote:
           | > Isn't this just the normal enshittification cycle that
           | occurs with all Internet products?
           | 
           | Yeah, although one can dream that some SaaS company would do
           | htings differently
        
         | ec109685 wrote:
         | They support much much larger workspaces now, and support team
         | to team shared channels, so the problem space is much more
         | complex than 2015.
         | 
         | Not saying they shouldn't fix their reliability. Every other
         | week it seems like they have an outage with this or that.
         | 
         | The Flickr style commit to production multiple times per day
         | seems to have its limits. Perhaps longer canary and slower
         | rollouts would help.
        
       | dr_kiszonka wrote:
       | Nice write-up!                   If no new requests from users
       | are arriving in a siloed AZ, internal services in that AZ will
       | naturally quiesce as they have no new work to do.
       | 
       | Not necessarily because, due to some bug, there may be resource-
       | hungry jobs running indefinitely. (Slack's engineers must have
       | considered this; I am just nitpicking this particular part of the
       | text.)
        
         | ninkendo wrote:
         | If you replace "because" with "if", your comment makes more
         | sense. "If" there are such bugs, you are right, but such bugs
         | might not exist.
        
       ___________________________________________________________________
       (page generated 2023-08-26 23:00 UTC)