[HN Gopher] Slack's Incident on 2-22-22
___________________________________________________________________
Slack's Incident on 2-22-22
Author : alphabettsy
Score : 122 points
Date : 2022-04-26 17:26 UTC (5 hours ago)
(HTM) web link (slack.engineering)
(TXT) w3m dump (slack.engineering)
| scrollaway wrote:
| A post-mortem with non-ISO dates? Even on THAT date?! :)
| smegsicle wrote:
| looks unambiguous to me lol
| mulmen wrote:
| Ok, what is the format string? a) m-dd-yy
| b) m-yy-dd
|
| I can't tell. How do you disambiguate?
| rjh29 wrote:
| It can't be either, 22 is not a valid month.
|
| I agree with you though, the point of a date like yyyy-mm-
| dd is to avoid working out stuff like this. You don't pick
| a date format based on whether the current date is
| ambiguous or not.
| mulmen wrote:
| Good catch. I updated my post. The question remains, how
| can this format be disambiguated?
|
| Agreed, this is why ISO8601 exists.
| 58x14 wrote:
| Just my personal take, I think this is a really well-written
| incident postmortem. It's specific, extensive, candid, and dare I
| say, entertaining?
|
| Many incident reports are fully lacking in any meaningful detail,
| or wholly unapologetic. I actually enjoyed learning tidbits about
| the author, in particular their mention of
| https://how.complexsystems.fail/.
|
| Reading this boosted my confidence in Slack's teams, which should
| ultimately be the objective of a release like this. It's not pure
| PR nor a gruff legally-obligated disclosure.
|
| It helps that I wasn't really affected by this incident.
| vxNsr wrote:
| The fact that the current top comment thread is quibbling about
| the date format in the title seems to agree with this
| assessment, if there was anything real to complain about that's
| what we'd be seeing, instead we get bikeshedding on the date
| format in the title of a post.
| dijonman2 wrote:
| Which is aligned with US date formatting.
| ttul wrote:
| On today's internet, a frank post mortem delivers value to
| customers and PR gold too.
| _justinfunk wrote:
| I'm looking forward to http://thisincidentreportdoesnotexist.com
| launching sometime later this year.
| notacoward wrote:
| I love the diagrams of the cache<->DB cycle in normal vs.
| degenerate states. Those illustrate the problem very clearly and
| succinctly, and I hope they make it into a textbook some day.
| Kudos.
| godmode2019 wrote:
| Tidbit:
|
| 2-22-22 was also when Russia invaded Ukraine
|
| And Joe Bidens statement about the invasion was on 2-22-22 2:22pm
| on the dot.
|
| I could not figure out the significance of this more than 11:11
| was when WW1 ended, but it's probably something else.
| Kwpolska wrote:
| Nope, you're off by two days:
| https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukrai...
| oxfordmale wrote:
| Which major outage ? According to the Slack uptime, there was
| barely 1.5 hour of outage :-)
|
| P.S.
|
| Yes I know the uptime is decide by committee, and doesn't reflect
| reality. I am just being cynical.
| olliej wrote:
| I find reading about these incidents super interesting, and I
| generally find the work performed by the folk keeping these
| service running (and dealing with the inevitable falling over of
| any computer system).
|
| At the same time it seems like a horrifying job I would never
| ever want :D
| diarrhea wrote:
| That date format is actually the worst I have ever encountered.
| m-d-y, with year in 2 digits, numbers not zero-padded, US "order"
| yet using dashes. It's like a moderator of /r/ISO8601 came up
| with the worst possible format _on purpose_. Am I missing
| something?
| HiJon89 wrote:
| I expected the top comment on hackernews to be something this
| pedantic and irrelevant to the content, and I was not
| disappointed
| elpakal wrote:
| I had to read the parent comment twice to understand that it
| was talking about the date in the title of the post and not
| anything relevant whatsoever.
| missedthecue wrote:
| I mean, there is literally no way to confuse it with another
| date, unless you go back 100 years, when Slack didn't exist.
|
| There is no 22nd month, so we know the 22s are the day and the
| year, leaving only the 2 to be the month. Is it really that
| difficult to parse?
| johannes1234321 wrote:
| > leaving only the 2 to be the day
|
| I think you meant "to be the month" there. qed#
| diarrhea wrote:
| Beautiful.
| bamboozled wrote:
| So we could either:
|
| a) Use a far superior date format which nearly the entire
| world uses by default and is better and simpler in many ways.
|
| b) Do logic when we see dates to try workout what format the
| date is in.
|
| Going with _a_ seems like a no brainer...
| verve_rat wrote:
| Especially in this scenario where you are communicating to
| an international audience.
| burnte wrote:
| Yeah, but what about that party that I threw on 10/11/12. Did
| I set it for November 12, 2010, or October 11 in 2012? Or
| somewhen else?
| tjoff wrote:
| I'm so tired of having to do that game every time I see a
| date. It is not hard, but it is quite annoying. Especially
| since it isn't solvable in a lot of cases, so you try to
| reason your way to the most realistic interpretation.
|
| It shouldn't be this hard.
| true_religion wrote:
| It's not really that hard. Like the imperial system,
| Americans just memorize how it works as children and don't
| think about it anymore.
|
| Think about it like speaking a different language, except
| with numbers and not words.
| colejohnson66 wrote:
| Really HN? We're seriously downvoting this comment to
| oblivion? I get non-Americans get passionate in their
| anger at imperial units, but this person is just
| explaining why it's natural to us.
| dllthomas wrote:
| The complaint isn't about the particular other order, but
| the fact that the order is ambiguous. In this case that
| doesn't matter, but often it does.
|
| Americans memorize inches and yards, and often also
| memorize centimeters and meters, and working with
| _either_ is fine, but we 're not so often faced with
| numbers where it might be inches _or_ centimeters and we
| have to figure out which (and when we are, it 's
| sometimes a pain - certainly a bigger pain that working
| with known units).
|
| Or, working with your language analogy, please go fetch
| me some "pasta" without knowing whether I'm speaking
| Italian or Polish.
| xeromal wrote:
| Context matters in your pasta scenario.
| dllthomas wrote:
| Since the text itself doesn't clarify, context is the
| only way of resolving any of the scenarios. In each case
| it's usually sufficient and often not all that hard. But
| it's always harder than if the system in use was made
| explicit, and I understand the complaint (even if my
| annoyance at the ambiguity is quite significantly below
| the level where I would have complained myself,
| particularly in this case).
| erpellan wrote:
| Context also matters in the date parsing scenario.
| 11/12/22 could be several different dates depending on
| the context.
| xeromal wrote:
| Yeah, and in this post, it's clear what it is.
| ldh wrote:
| It may not have happened to you yet, but someday you'll
| see a date somewhere other than this post.
| xeromal wrote:
| This honestly has me laughing.
| ascar wrote:
| > Think about it like speaking a different language
|
| The correct analogy is I don't know which language is
| spoken and the same words get used in multiple languages
| with different meaning. Now I can apply heuristics to
| figure it out or in some cases I can only guess.
| huhtenberg wrote:
| I just thought it was a typo.
| noselasd wrote:
| That particular date is possible to understand, but the date
| format is not. (Is is really that fun to try to figure out
| what 12-11-21 means ?)
| CPLX wrote:
| November 21, 1812
|
| What do I win?
| jonpurdy wrote:
| Came here to complain specifically about this. 2022-02-22 is
| unambiguous, big endian, and sorts nicely. IDK why society
| still uses any other date formats considering how international
| everything is.
| [deleted]
| gleenn wrote:
| It's because people for hundreds of years have been saying
| "March second, nineteen sixty two" which they then write out
| in that order. As a programmer, peoples' frustrations are
| understandable, but you're a bit naive if you think even a
| percentage point of the speaking population of the world
| knows or is concerned with big endian-ness or sortability.
| However they speak English, at least in America, in that
| order, and that's the way they write it. Europeans only got
| it a little better.
| [deleted]
| theamk wrote:
| We do say "five dollars" while writing "$5", so saying and
| writing different things is not unheard of.
|
| And endiannes / sorting comes up in real life pretty often
| - scanning for large numbers in the price list, or finding
| stuff in the sorts list.
|
| I think if history turned differently, we could have had
| sane time format in the US.
| ascar wrote:
| > Europeans only got it a little better.
|
| There is a reasonable argument for little endian dates (as
| in the least significant information is usually the most
| relevant as it changes most often), but apart from the "it
| has been like this forever" I don't see any reasonable
| argument for middle endian date formats. Then again, the US
| is notoriously resistant to the metric system too.
| mumblemumble wrote:
| Your error is expecting reasonableness. All linguistic
| conventions are either arbitrary or lost to time, and
| mostly only exist for tradition's sake.
| lilyball wrote:
| It's because it matches the way we speak dates aloud.
| When intended for human consumption, sortability and big-
| endianness doesn't matter, but matching the way we speak
| does. Maybe other cultures actually speak dates
| differently, I don't know, but I have never seen a native
| English speaker habitually speak dates any differently
| than "January 1st, 2001".
|
| All that said, I definitely agree with the original
| complaint, m-dd-yy is an atrocious format. If you're
| going to use dashes, stick with yyyy-mm-dd. Replacing the
| dashes with slashes, as in 2/22/22, would have been fine.
| rmccue wrote:
| "the twenty-sixth of April" would be the way I say
| today's date and anecdotally is in common usage in both
| countries I've lived in (the UK and Australia, both using
| d/m/y). I'd say it's about as frequent as "April the
| twenty-sixth" by itself, and definitely more common if
| you include the day ("Tuesday, the twenty-sixth of
| April").
| ChrisKnott wrote:
| In the UK I think "1st of January" is probably slightly
| more common than "January the 1st" although you hear
| both. "January 1st" (no "the") sounds American.
| verve_rat wrote:
| I'm from NZ and it is 100% normal to switch back and
| forth between "The second of March" and "March the 23rd".
|
| People I have met from Australia, South Africa, the UK,
| all have the same flexibility.
| [deleted]
| jc_811 wrote:
| Oh this is a great point! I'd never realized that. I know
| that in Spanish (and I assume many of the romance
| languages) we always say the day first, eg dos de febrero
| (2nd of February). In American English even though the
| day first technically is grammatically correct, we pretty
| much never say it in that order (February 2nd instead of
| the 2nd of February)
| ksdnjweusdnkl21 wrote:
| Have they not been doing the same with "fourth of July"? Or
| is this an exception?
| gleenn wrote:
| Our Independence Day is probably a special case. Clearly
| language is flexible enough to say all the formats, but
| the date format we write matches the most common
| verbalization.
| Hjfrf wrote:
| The one exception I can think of is a bug in the mssql
| datetime type (but not date or datetime2) where strings in
| that format are assumed to be yyyy-dd-mm if the locale
| dateformat is dmy (e.g. British English).
| charcircuit wrote:
| It's a shortened version of "February 22, 2022"
|
| It doesn't seem that bad to turn it into 2-22-22.
| lucideer wrote:
| It's also a shortened version of "22nd of February, 2022".
| smcl wrote:
| The issue is a great deal of the rest of the world don't do
| this, so you need to decide whether to apply best-guess
| heuristics to parse it or decide that it's a typo ("ah
| there's not 22 months, so maybe it's the 22nd of February or
| someone fat-fingered the 2nd of February...?").
|
| In this case you can lookup Slack outages to disambiguate it,
| but the frustration here - and I share it - is directed at
| the stubborn refusal to use a standard format that the reest
| of the world has agreed upon.
| Invictus0 wrote:
| It's a quirky nod to the fact that all the digits were 2 on
| that day in this format.
| vxNsr wrote:
| > _Am I missing something?_
|
| Yes, the numbers are all the same, and the author is based in
| the US, and thus is using the default format in the US. So odd
| that this is the top comment.
| samstave wrote:
| OMG - I thought I clicked on the tablet thread regarding
| Sumerian OOOs -- and I thought you were sarcastically making
| fun of the way the Sumerians captured dates on limestone
| tablets ~4,000 years ago...
|
| (i had scrolled immediately down, so the thread titel wasnt
| visible when I was reading your comment)
|
| haha
| adamomada wrote:
| This is what you sometimes see for best-before dates in Canada.
| Even better, because our dates are "supposed to" be like 22/2
| but I don't think anyone here does that, except Quebec perhaps.
| Sometimes you just have no clue
| rcthompson wrote:
| The point is to describe the date using only the number 2.
| neerajk wrote:
| "Mcrib is objectively a better system for generating memcached
| configurations -- but its efficiency made the broader system
| behave in a less safe way." Be good but not _that_ good :)
| pierrebai wrote:
| Sometimes new roll-out causes outage, sometimes, roll-out are
| delayed due to the overall system architecture. Reading the post-
| mortem, I could not help but be reminded of this issue as
| described here: https://www.youtube.com/watch?v=y8OnoxKotPQ
| epmatsw wrote:
| McRib is a hilarious name for a service
| whoopdedo wrote:
| I certainly wouldn't trust it's availability.
|
| (The original McDonald's McRib sandwich is well known for only
| being sold a limited time.)
|
| So "Mcrouter" comes from Memcache-Router, then the obvious
| McDonalds jokes are made and someone cleverly suggests "Mcrib"
| for the next service. But I can't think what the backronym
| would be for it. Memcache Ring Buffer maybe. Or Broker.
| phan wrote:
| memcache router interface broker
| bee_rider wrote:
| Apparently it generates configurations for Mcrouter. Could be
| MemCache-Router Instance Borker.
| true_religion wrote:
| I think you meant Broker, but the misspelling is an act of
| genius since we are talking about downtime caused by an
| infrastructure failure.
| qubyte wrote:
| And I misread it as McBorker and now I can't stop
| chuckling.
| erichurkman wrote:
| McBorker, Chaos Monkey's cousin.
| bee_rider wrote:
| Does it make it less of an act of genius if it was
| intentional?
| jonah-archive wrote:
| RIB is a common term in networking for "Routing Information
| Base" (being the set of all routes which could be chosen to
| be installed in the routing table (or FIB -- "Forwarding") by
| the control plane. I don't know that this is the actual
| etymology but it's not implausible.
| mescaline wrote:
| An over communication platform should have scheduled outages like
| this regularly!
| alex-zierhut wrote:
| What motivation would someone have to run a scheduled outage? I
| can't think of any.
| AndrewUnmuted wrote:
| Something about all of this feels like a scheduled outage, to
| me.
|
| I am suspicious, though cannot back this up at all, that they
| were ready for this incident and may have even planed for it.
| jdlshore wrote:
| This sort of handwavy conspiracy thinking is distressingly
| common. What basis do you have for your suspicion? Is it just
| "big company bad"?
| AndrewUnmuted wrote:
| > distressingly
|
| You're distressed by my thinking? That's odd.
|
| Slack is a terrible product that engulfs the worker inside
| a dead-eyed grunt culture, featuring an endless spree of
| work-life balance destroyers. It might be great for people
| who ask for things from others, but for the people who have
| to actually do the thing being asked of them, Slack is a
| nightmare world.
|
| Anyone with the psychology to make a product like Slack, is
| likely to engage in handwavy conspiracy thinking
| themselves.
| alphabettsy wrote:
| > engulfs the worker inside a dead-eyed grunt culture,
| featuring an endless spree of work-life balance
| destroyers. It might be great for people who ask for
| things from others, but for the people who have to
| actually do the thing being asked of them, Slack is a
| nightmare world.
|
| I think it depends on the organization and how you use
| it. In a previous role I would've agreed with you. People
| expected you to reply at all hours, where I am now that
| isn't the case.
|
| Tools do not create toxic culture or destroy work-life
| balance. Organizations do that.
| mulmen wrote:
| I choose to believe it is not common thinking but instead
| commonly verbalized among the minority with such thoughts.
| encryptluks2 wrote:
| > including the author -- which certainly made my role as
| Incident Commander more challenging!
|
| As if no other way to communicate exists?
|
| I remember using Slack, feeling fed up with emails, until I
| realized that if I wanted to sync Slack messages offline and have
| a standard way to view these messages that I was SOL. I am so
| glad that I've returned to email and optimized my workflow to use
| email effectively and efficiently. The best part is no more
| vendor lock-in.
| orf wrote:
| *22/2/22
| adamomada wrote:
| You know how the date looked strange to you? It's the same for
| your correction, but for other people
| orf wrote:
| For a statistically insignificant portion of people, sure. It
| doesn't make it any less correct.
| 4ggr0 wrote:
| For the whole of Europe it would be 22.02.2022, how is all
| of Europe statistically insignificant?
| orf wrote:
| The official EU rules say 22.02.2022, but nobody in
| Europe would have trouble parsing 22/2/22 or any
| variation thereof. And the / (or -) separator is indeed
| used in parts of the EU.
|
| It's the ordering that's significant, not the separator.
| dormando wrote:
| Hi! I'd like to offer some hopefully useful information if any
| Slack folks end up reading this, or anyone else with a similar
| infrastructure. I'll start with some tech and make a separate
| philosophical comment.
|
| Also caveat: I have no deep view into Slack's infrastructure so
| anything I say here may not even be relevant. YMMV.
|
| First some self promotion:
| https://github.com/memcached/memcached/wiki/Proxy memcached
| itself is shipping router/proxy software. Mcrouter is difficult
| to manage and unsupported. This proxy is community developed,
| more flexible, likely faster, and will support more native
| features of memcached. We're currently in a stabilization round
| ensuring it won't eat pets but all of the basic features have
| been in for a while. Documentation and example libraries are
| still needed but community feedback help speed those up
| tremendously (or any kind of question/help request).
|
| It's not clear to me why memcached is being managed like this;
| mcrouter seems to only be used to abstract the configuration from
| the clients. It has a lot of features for redundant pools and so
| on. Especially with what sounds like globally immutable data and
| the threat of cascading failures during rolling upgrades it
| sounds like it would be very helpful here.
|
| If cost or pool sizes are the main reasons why the structure is
| flat, using Extstore
| (https://github.com/memcached/memcached/wiki/Extstore) can likely
| help. Even if object value sizes are in the realm of 500 bytes,
| using flash storage can still greatly reduce the amount of RAM
| necessary or reduce the pool size (granted the network can still
| keep up) with nearly identical performance. Extstore takes a lot
| of tradeoffs (ie; keeping keys in RAM) to ensure most operations
| don't actually write to flash or double-read. Extstore's in use
| in tons of places and everyone's immediately addicted.
|
| Finally, the Meta Protocol
| (https://github.com/memcached/memcached/wiki/MetaCommands) can
| help with stampeding herds to help keep DB load from exploding
| without adding excess network roundtrips under normal conditions.
| I've seen lots of workarounds people build but this protocol
| extension gives a lot of flexibility you can use to help survive
| degraded states: anti-stampeding herd, serve-stale, better
| counter semantics, and so on.
| dormando wrote:
| Now a more philosoraptor style comment: I see Mcrib is a service
| built to quickly detect and replace memcached's. I treat
| memcached in infrastructure as a very stable service. Meaning it
| is infrequently necessary to upgrade it, and it will generally
| not fail on its own. If it does it will be highly infrequent
| compared to services with higher churn or more
| complexity/dependencies. This means if they're failing often
| enough that you need to rapidly detect and replace them you have
| a more fundamental problem.
|
| From a structural standpoint I think my technical comment can be
| useful. If things really are failing this much A) you should
| figure out why and slow that down. B) if you have a generally
| stable system and understand the typical rate of failure, you can
| add tripwires into Mcrib to avoid over-culling services and
| loudly raise alarms. Then C) you can improve technical
| reliability with redundancy/extstore/etc.
|
| I've also seen plenty of times where folks have a dependency of a
| service determine if that service is usable, which I disagree
| with quite strongly. Consul being down on a node should trigger
| something to consider if the service is dead. It's important both
| for reliability (don't kill perfectly working things because you
| end up having to design around it), and for maintainability as
| you've now made people afraid of upgrading Consul or other co-
| dependent services. Other similar failures are single-point-of-
| testing availability checking where instead you probably want two
| points of truth before shooting a service.
|
| Now you risk people being afraid of upgrading probably anything,
| which means they will work around it, abstract it, or needlessly
| replace it with something they feel safer managing. The latter is
| at best a waste of time, at worst a time bomb until you find out
| what conditions this new thing breaks under.
|
| This isn't advocating that you design without assuming anything
| can fail anywhere at any time; just pointing out that how often a
| service _should_ fail is extremely useful information when
| designing systems and designing fail safes, alerts, monitoring,
| etc.
| bognition wrote:
| It's likely that the memcached install is so large that the
| underlying instances themselves are failing. When you have
| hundreds or thousands of instances, failures in the instances
| themselves become pretty regular.
| belter wrote:
| A Date that is both a Palindrome and an Ambigram:
|
| https://www.jagranjosh.com/general-knowledge/22-02-2022-is-b...
| sva_ wrote:
| Well, not the way they format it.
___________________________________________________________________
(page generated 2022-04-26 23:00 UTC)