[HN Gopher] Slack is experiencing a service disruption
___________________________________________________________________
Slack is experiencing a service disruption
Author : tjwds
Score : 214 points
Date : 2021-09-30 18:31 UTC (4 hours ago)
(HTM) web link (status.slack.com)
(TXT) w3m dump (status.slack.com)
| jftuga wrote:
| https://downdetector.com/status/slack/
| 40acres wrote:
| Started a new job in July. It's my first time using slack so
| heavily for day to day business needs. After some adjustment I've
| gotten used to it and enjoy it a lot. Now it's down and I feel
| like I can't do anything. We have gchat available but no one else
| is using it and we can't recreate our slack channels in gchat.
| I've been rendered totally ineffective.
| shados wrote:
| My last couple of jobs have involved being on slack almost all
| the time.
|
| Since when its down we couldn't do anything, I just got into
| the habit of keeping a Google Chat room that everyone knows
| about. If Slack goes down, we just hop into the google chat
| room. Google Chat got a lot better in the last few years, it's
| actually fairly serviceable now.
| giacaglia wrote:
| Could it be related to: https://techcrunch.com/2021/09/21/lets-
| encrypt-root-expiry/?
| iJohnDoe wrote:
| Related thread.
|
| https://news.ycombinator.com/item?id=28708544
|
| https://news.ycombinator.com/item?id=28711925
| tialaramex wrote:
| No
|
| But, however in that regard it is interesting to see the shoe
| on the other foot for Firefox this time around.
|
| Historically Firefox used to rely on remembering recent
| intermediates it had seen to "fill in the blanks" when trying
| to access a site that doesn't bother providing any reason to
| trust the certificate it offers+.
|
| But, Mozilla spent years collecting knowledge of all
| unconstrained intermediates for trusted root CAs. Then they
| shipped the entire set as part of Firefox. So, a modern Firefox
| visiting a site that only has a Let's Encrypt certificate and
| no other reason to be trusted, sees that and goes oh, that's
| issued by Let's Encrypt's R3, which I have a copy of, and so
| it's trustworthy. Done.
|
| Whereas for several other popular browsers they're looking at
| potentially stale local information, and may conclude that they
| can't trust this site because some stale data has expired or
| they rely on an already expired certificate sent to them in
| some cases rather than disregarding it.
|
| + You're supposed to provide a "chain" (it doesn't actually
| need to be a chain, but that's most compatible) of other
| certificates to show why the leaf certificate you have is
| trustworthy, e.g. my certificate is from Z9, Z9 has a
| certificate from Big Trusted Corp Q46, Big Trusted Corp Q46 has
| a certificate from Very Famous Trusted CA, and the relying
| party (well, their client software) goes, "Oh, I see, and I
| trust Very Famous Trusted CA, so that means I trust you're
| you". But, lots of web sites (maybe 10-20% and more of the
| smallest with negligible IT budget) don't get this correct so
| web browsers try to work around it.
| Deathmax wrote:
| There's two Let's Encrypt issues at play at the moment.
|
| The issue that your comment about cached/discovered
| intermediates relates to is that some servers were still
| manually constructing the chain that goes leaf -> R3 -> DST
| Root X3, and the R3 intermediate signed by X3 expired on 29
| September. This chain hasn't been returned by Let's Encrypt
| since May 2021.
|
| The other issue relates to the current default chain returned
| by Let's Encrypt that goes leaf -> R3 -> ISRG Root X1 -> DST
| Root X3 for Android compatibility. Most clients are able to
| successfully build a valid chain from the leaf to the still
| valid ISRG X1, however, old versions of OpenSSL (pre-1.1.0)
| and several other TLS libraries that don't explore the graph
| correctly barfs at a chain that terminates in the now expired
| X3 root.
|
| [1]: https://community.letsencrypt.org/t/production-chain-
| changes...
| tialaramex wrote:
| Well, you're right that there are two issues, and it's
| perhaps unfortunate that ISRG chose to arrange that they
| happen next to each other (but to be fair I don't think we
| pressed them to do anything else back when it would have
| been possible, although I'm behind on some reading I don't
| think I'm _that_ far behind)
|
| But, in both cases Firefox's choice works out for it
| regardless. And unfortunately I believe -- though it's hard
| to tell for sure -- that some other browsers get the
| missing certificate case wrong, perhaps for a few hours and
| perhaps much longer. Lacking a guiding hand, they may end
| up choosing to "validate" the R3 -> DST Root CA X3 case
| which can't work. I think some logic will eventually expire
| this useless data and those browsers will work, but
| obviously that's not helpful if your site seems "broken"
| now.
|
| It will be clearer by tomorrow, but understandably Let's
| Encrypt's community site is under a deluge of "Help!?" type
| posts, which are a struggle for the volunteers.
| thomascgalvin wrote:
| Productivity, on the other hand, is through the roof.
| milkers wrote:
| Contrarily for me I can not add myself to company VPC, because
| the command works on Slack. What a bummer.
| sieabah wrote:
| More proof that SlackOps/ChatOps is a failed methodology. I'm
| not quite sure what the fascination is with it. It's
| certainly neat for querying infrastructure but I certainly
| wouldn't rely on it for critical infra.
|
| Maybe use this as an example that you guys shouldn't do that?
| JoBrad wrote:
| That feels like a leap too far. ChatOps should be a
| convenience, not the sole method of doing something.
| However conveniences are important.
| snvzz wrote:
| That's some terrible OPs and ridiculous single point of
| failure.
| kaustubhvp wrote:
| time to go home...
| dylan604 wrote:
| closes laptop. phew, short commute today.
| siva7 wrote:
| the covid years sufficiently described
| dylan604 wrote:
| if i were a poet, i might have tried to put it in a
| meter. since i'm a coder, i made it a one-liner.
| JonathanMerklin wrote:
| I'll do my next reboot at nine, I'll do my next commute
| with wine.
| kgeist wrote:
| We host Rocket.Chat ourselves (300+ people company), and when
| it's down (which is very rare), all it takes is to make a visit
| to the IT guy on the same floor. Also our infosec people are
| happy sensitive data is not stored on servers we don't control.
| There've been attempts to move to Slack or Google Chat without
| success. Is there something we're missing by not using Slack?
| ziddoap wrote:
| Did you intend to reply to the person you did?
|
| My understanding of the parent post was poking fun at the
| fact that these chat spaces (Slack or otherwise) allegedly
| end up being a productivity loss as people spend more time
| finding the meme-of-the-week to post rather than working.
|
| Nothing to do with the viability of Slack or whatever instant
| messenger you choose to use.
| kgeist wrote:
| Yes, my bad.
| nathanyz wrote:
| Could be related to the currently ongoing Let's Encrypt
| certificate expiration event(1) that is causing fun for engineers
| today(2).
|
| (1) https://community.letsencrypt.org/t/help-thread-for-dst-
| root...
|
| (2)
| https://twitter.com/search?q=letsencrypt&src=typed_query&f=l...
| betaby wrote:
| No. DNSSEC has nothing to do HTTPS.
| nathanyz wrote:
| Interdependencies among services can cause all sorts of
| unexpected issues.
| kelnos wrote:
| It can, but it did not, in this case.
|
| Slack uses DigiCert, not LetsEncrypt, and if you poke
| entries into your /etc/hosts file for various slack
| hostnames, things work just fine (which they wouldn't if it
| was a TLS cert problem).
| eropple wrote:
| No, it's not. This is a fault that appears to be from
| removing DNSKEY and DNS records at the exact same time.
|
| Let's Encrypt doesn't have _anything_ to do with that even
| if somebody uses it. Slack uses DigiCert.
| telesilla wrote:
| We had this issue with a desktop client, this seems to be working
| for us as a fix:
|
| https://twitter.com/Zap42/status/1443647882045927427
|
| Also, just waiting, or maybe rebooting.
|
| Edit: ah wrong thread sorry! This is re: letsencrypt.
| kelnos wrote:
| Slack doesn't appear to use LetsEncrypt (checking their cert on
| slack.com and app.slack.com shows a DigiCert cert), so this
| will not do anything to help.
| judge2020 wrote:
| Sounds like that's a different issue then, because that solves
| the LetsEncrypt issue - the Slack issue is thanks to DNSSec
| (and Slack uses a Digicert HTTPS certificate[0])
|
| 0: https://crt.sh/?Identity=slack.com&exclude=expired&match==
| iamjohnsears wrote:
| Switching to Cloudflare DNS fixed this for me
| daxuak wrote:
| Thank you, I switched to cloudflare/google DNS and flushed
| router's DNS cache, then it worked.
| michael_michael wrote:
| Thank you. Switching to 1.1.1.1 got me back online.
| eigthbits wrote:
| Good tip, back up, appreciate it
| black_13 wrote:
| Slack is a disruption slack is an answer to question that no one
| is asking. "Did you ask the slack channel" is new speak for go "i
| dont know".
| madars wrote:
| You can test it at home using "delv www.slack.com @4.2.2.1
| +rtrace" (4.2.2.[1-6] are Level 3's servers):
| ;; fetch: www.slack.com/A ;; fetch: com/DS ;;
| fetch: ./DNSKEY ;; fetch: slack.com/DS ;; fetch:
| com/DNSKEY ;; fetch: www.slack.com/DS ;;
| validating slack.com/SOA: got insecure response; parent indicates
| it should be secure ;; no valid RRSIG resolving
| 'www.slack.com/DS/IN': 4.2.2.1#53 ;; broken trust chain
| resolving 'www.slack.com/A/IN': 4.2.2.1#53 ;; resolution
| failed: broken trust chain
| mike-cardwell wrote:
| You can also just use dig. Running: `dig a www.slack.com`
| returns a SERVFAIL for me. Asking my resolver to skip the
| dnssec checking gives me the A record though: `dig +cd a
| www.slack.com`
|
| I then look at my unbound dns resolver logs:
|
| Sep 30 21:53:11 unbound[8985:0] info: validation failure
| <www.slack.com. A IN>: No DNSKEY record from 208.67.220.123 for
| key slack.com. while building chain of trust
| exikyut wrote:
| FWIW, after getting exactly the same output from `delv`
| locally, I noted that $ dig +short slack.com
| @4.2.2.1 15.206.34.128
|
| and I take this to mean that, despite failing DNSSEC, Level 3
| is yielding an A record - and that the major players have
| basically monkey-patched this into working again, and it just
| needs to propagate through all the caches now?
| sys_64738 wrote:
| This is a good thing.
| testplzignore wrote:
| I like how all 5 status updates (as of now) basically say the
| same thing, though they rushed the third one and forgot to
| apologize :)
| zucked wrote:
| >We are aware of connectivity issues (...) In order to resolve
| this faster, your ISP (Internet Service Provider) will need to
| flush their DNS record for slack.com. Please reach out to your
| networking team to provide them with this information.
|
| Yeah, sure, lemme just pick up the phone and call my ISP and let
| 'em know to flush their DNS record. I'm sure the T1 rep will
| absolutely know how to handle that.
| kaustubhvp wrote:
| Have you tried calling Comcast/Xfinity ever?
| sieabah wrote:
| Well they aren't necessarily incorrect, TTLs can be a pain.
|
| However that's why when you do a migration you lower the TTLs
| and eat the cost of lookup requests. Wait until your previous
| TTL lease expires (which is 2x the TTL) then start the
| migration. After everything is known to be good you up the TTLs
| again. That way you have a way to "quickly" recover if issues
| arise.
| walrus01 wrote:
| translation:
|
| a) we pushed a change that was not well thought out in its
| ramifications
|
| b) we probably didn't lower the TTL on our zones several days
| in advance of doing this change
|
| c) we reverted in a panic!
|
| d) by the way, did you know that many ISPs' caching nameservers
| don't respect low TTLs and will hold onto zones longer than
| they should?
|
| e) please go bother and waste the time of some first tier
| support rep at RCN or Comcast or CenturyLink, who _certainly_
| has the power to administer their DNS servers...
|
| oh jeez.
| yjftsjthsd-h wrote:
| > d) by the way, did you know that many ISPs' caching
| nameservers don't respect low TTLs and will hold onto zones
| longer than they should?
|
| In fairness, this part really is a dumb problem on the ISP's
| end.
| simcop2387 wrote:
| It's also partly because so many isps got tired of their
| caches being basically useless because of so many people
| setting obscenely low ttls too. There's no winning answer.
| [deleted]
| amelius wrote:
| Why not edit your /etc/hosts file directly?
| [deleted]
| hnarn wrote:
| Probably because slack doesn't want a long tail of future
| hard to pin down issues related to people not reverting their
| changes to their hosts file.
| betaby wrote:
| That's a laughably bad and dishonest response from Slack. They
| are redirecting the blame to ISPs now.
| paxys wrote:
| It's more dishonest to read one part of one line and draw
| conclusions from it without quoting the entire response.
| azundo wrote:
| Are they blaming them though? Is this not the actual fastest
| path to resolving the issue?
| lima wrote:
| No, the fastest path is putting the signature back. But I
| guess Route53 won't let them do that.
| walrus01 wrote:
| unless you are a customer of a _very tiny_ ISP the odds of
| getting the person on the phone who has root on their
| recursive caching nameservers provided to DHCP clients are
| minuscule.
| znpy wrote:
| And chances are they wouldn't flush that entry for you
| anyway.
| wk_end wrote:
| They should at least say something like "we're working with
| major ISPs to resolve the issue". It really comes off as
| walking away from a disaster of their own making.
| kreeben wrote:
| I wonder how much electricity will be spent flushing then
| repopulating the caches of each and every DNS server on the
| globe. Prolly not that much?
| lima wrote:
| No need to flush the full caches. Clearing caches for
| individual zones is not uncommon.
| alex_c wrote:
| Missing part of that quote is "This issue was caused by our
| own change and not related to any third-party DNS software
| and services."
|
| Seems refreshingly honest to me. Slack screwed up, it's not
| anyone else's fault, but there's nothing Slack can do about
| it now, here's a workaround.
| ziddoap wrote:
| Which part of the full Slack reply is dishonest?
| myth_drannon wrote:
| I managed to find a work around by connecting using my company's
| VPN but still can't do directly on my home line
| phlhar wrote:
| Wow no DNS A Record for slack.com. Somebody fucked up
| [deleted]
| mhousley wrote:
| It appears that only certain DNS servers are affected. I can
| connect from my phone, but not from my home internet
| connection.
| _jal wrote:
| It looks like a failed attempt at implementing DNSSEC.
| [deleted]
| bastardoperator wrote:
| These are my favorite outages
| keville wrote:
| https://lists.dns-oarc.net/pipermail/dns-operations/2021-Sep...
| TobTobXX wrote:
| Huh, TIL:
|
| > Both Google and Cloudflare have a publicly accessible feature
| to flush the cache for a domain, so anyone could have done it:
| > https://developers.google.com/speed/public-dns/cache >
| https://1.1.1.1/purge-cache/
|
| Quite useful feature indeed.
| lima wrote:
| Unlikely to help in this particular case (which is a root-
| level DS record).
| syvanen wrote:
| It did help. As it removes also the DS record cache. Just
| try to make DNS query using those resolvers.
| terom wrote:
| https://dnsviz.net/d/slack.com/YVXX_g/dnssec/ the dnsviz
| analysis showing the slack.com zone DNSKEY existing at 12:55,
| followed by the the .com zone DS record at 15:30. However, the
| next analysis at 17:24 shows both the .com zone DS and
| slack.com DNSKEY records have disappeared!
|
| Given that the slack.com DNSKEY shows up with a 1h TTL and the
| .com zone DS has a 24h TTL, they are screwed in the presence of
| cached slack.com DS records from the .com zone. Do not throw
| away your DNSKEY until your delegation's TTL has absolutely
| positively surely expired from any resolver caches!
|
| The slack.com domain is an AWS Route 53 zone, I'd be really
| interested to see a post-mortem explaining what happened here.
| Are they unable to recover the KSK/ZSK and restore the
| DNSKEY/etc records?
| [deleted]
| tptacek wrote:
| Holy shit. That's bad.
|
| What this suggests is that Slack, for reasons passing
| understanding, enabled DNSSEC on their zones (with a DS record
| that essentially turns DNSSEC on, and the accompanying key
| records) --- then disabled DNSSEC by pulling all the records.
| But the DS records are in caches; validating resolvers go
| looking for the keys, which don't exist, and say "welp, I guess
| Slack.com doesn't exist".
| tialaramex wrote:
| Aren't you, in fact, the same Thomas Ptacek who has
| repeatedly claimed that DNSSEC is so irrelevant that events
| like this would go essentially unnoticed?
|
| Edited to add, e.g.
| https://news.ycombinator.com/item?id=22400167
|
| > DNSSEC is moribund and almost nobody uses it; in reality,
| the DNSSEC root private keys could land on Pastebin tomorrow
| and nothing would "break"
|
| We have this whole thread here about a "service disruption"
| for Slack, and nobody leaked the "root private keys" just one
| person made a dumb error and it blew up their site.
| tptacek wrote:
| No, I'm the Thomas Ptacek who has repeatedly claimed that
| the only impact DNSSEC is going to have on the Internet is
| causing outages like this. It's right there in the blog
| posts; in fact, it's even in the 2007 blog posts I wrote
| about this on the Matasano blog.
| ptomato wrote:
| > just one person made a dumb error and it blew up their
| site
|
| yeah, the dumb error they made was "using DNSSEC"
| bluejekyll wrote:
| I'm not going to defend DNSSEC here, because this outage
| and others continue to support tptacek's perspective on
| its usefulness.
|
| But, some governments are requiring DNSSEC, which
| regardless of its usefulness, puts companies that want
| those contracts in a bit of a bind.
|
| Perhaps it would make sense to split domains such that
| DNSSEC guarded ones would not negatively impact ones that
| do not have DNSSEC.
| tptacek wrote:
| The USG DNSSEC requirements, which seem to be a part of
| what happened, are fragmented and incoherent. OMB
| withdrew DNSSEC requirements in 2018, and CLOUD.GOV
| doesn't support it. But some older requirements documents
| still have them, and need to be updated.
|
| The important top-line thing to know here is that
| virtually all tech companies eschew DNSSEC (you can
| verify that for yourself with `host -t ds stripe.com`;
| substitute any other company for Stripe.
|
| DNSSEC-quarantine TLDs are a good idea.
| ithkuil wrote:
| We had a DNS related outage with route53. Some of our zones
| just lost some records and then they reappeared. Could that
| explain what happened to slack's DNSSEC related records?
| terom wrote:
| A good question, and apparently enough to elict a response:
| No!
|
| > This issue was caused by our own change and not related
| to any third-party DNS software and services.
| aeden wrote:
| I wonder if they are using tooling that doesn't properly
| retain DNSKEY records for DS that recently removed? This is
| one of the reasons we perform controlled automated key
| rotation and removal in DNSimple, so that we can ensure we
| retain the keys in the authoritative zone on each key
| rollover giving the DS records time to expire from caches.
| icedchai wrote:
| It was especially bad since their status page wouldn't even
| resolve! I eventually just restarted my local caching DNS
| server.
| keville wrote:
| If you can't reach slack.com, here's their status page:
| https://slack-status.azureedge.net/2021-09/06c1e17de93e7dc2
| omreaderhn wrote:
| Based on that status page and the list-serve email link you
| posted it seems like they still don't understand what's going
| on (if that email is correct).
| ricardobeat wrote:
| It seems to be standard practice to be as fuzzy as possible
| during an outage, and only share details later. Probably
| avoids anyone looking stupid if the initial hunch turns out
| to be wrong.
| pjlegato wrote:
| Many outages are caused by malicious actors attacking your
| infrastructure. This cause may not be apparent at all until
| much later.
|
| It's better to avoid leaking any information in general
| during the incident, as it's often not immediately possible
| to know whether a hostile adversary exists, nor what
| advantages the adversary might derive from detailed updates
| during the incident.
| patrickbolle wrote:
| Shopify has been down quite a bit today as well.
| jftuga wrote:
| Yes it has, https://downdetector.com/status/shopify/
___________________________________________________________________
(page generated 2021-09-30 23:01 UTC)