[HN Gopher] Slack is experiencing a service disruption
       ___________________________________________________________________
        
       Slack is experiencing a service disruption
        
       Author : tjwds
       Score  : 214 points
       Date   : 2021-09-30 18:31 UTC (4 hours ago)
        
 (HTM) web link (status.slack.com)
 (TXT) w3m dump (status.slack.com)
        
       | jftuga wrote:
       | https://downdetector.com/status/slack/
        
       | 40acres wrote:
       | Started a new job in July. It's my first time using slack so
       | heavily for day to day business needs. After some adjustment I've
       | gotten used to it and enjoy it a lot. Now it's down and I feel
       | like I can't do anything. We have gchat available but no one else
       | is using it and we can't recreate our slack channels in gchat.
       | I've been rendered totally ineffective.
        
         | shados wrote:
         | My last couple of jobs have involved being on slack almost all
         | the time.
         | 
         | Since when its down we couldn't do anything, I just got into
         | the habit of keeping a Google Chat room that everyone knows
         | about. If Slack goes down, we just hop into the google chat
         | room. Google Chat got a lot better in the last few years, it's
         | actually fairly serviceable now.
        
       | giacaglia wrote:
       | Could it be related to: https://techcrunch.com/2021/09/21/lets-
       | encrypt-root-expiry/?
        
         | iJohnDoe wrote:
         | Related thread.
         | 
         | https://news.ycombinator.com/item?id=28708544
         | 
         | https://news.ycombinator.com/item?id=28711925
        
         | tialaramex wrote:
         | No
         | 
         | But, however in that regard it is interesting to see the shoe
         | on the other foot for Firefox this time around.
         | 
         | Historically Firefox used to rely on remembering recent
         | intermediates it had seen to "fill in the blanks" when trying
         | to access a site that doesn't bother providing any reason to
         | trust the certificate it offers+.
         | 
         | But, Mozilla spent years collecting knowledge of all
         | unconstrained intermediates for trusted root CAs. Then they
         | shipped the entire set as part of Firefox. So, a modern Firefox
         | visiting a site that only has a Let's Encrypt certificate and
         | no other reason to be trusted, sees that and goes oh, that's
         | issued by Let's Encrypt's R3, which I have a copy of, and so
         | it's trustworthy. Done.
         | 
         | Whereas for several other popular browsers they're looking at
         | potentially stale local information, and may conclude that they
         | can't trust this site because some stale data has expired or
         | they rely on an already expired certificate sent to them in
         | some cases rather than disregarding it.
         | 
         | + You're supposed to provide a "chain" (it doesn't actually
         | need to be a chain, but that's most compatible) of other
         | certificates to show why the leaf certificate you have is
         | trustworthy, e.g. my certificate is from Z9, Z9 has a
         | certificate from Big Trusted Corp Q46, Big Trusted Corp Q46 has
         | a certificate from Very Famous Trusted CA, and the relying
         | party (well, their client software) goes, "Oh, I see, and I
         | trust Very Famous Trusted CA, so that means I trust you're
         | you". But, lots of web sites (maybe 10-20% and more of the
         | smallest with negligible IT budget) don't get this correct so
         | web browsers try to work around it.
        
           | Deathmax wrote:
           | There's two Let's Encrypt issues at play at the moment.
           | 
           | The issue that your comment about cached/discovered
           | intermediates relates to is that some servers were still
           | manually constructing the chain that goes leaf -> R3 -> DST
           | Root X3, and the R3 intermediate signed by X3 expired on 29
           | September. This chain hasn't been returned by Let's Encrypt
           | since May 2021.
           | 
           | The other issue relates to the current default chain returned
           | by Let's Encrypt that goes leaf -> R3 -> ISRG Root X1 -> DST
           | Root X3 for Android compatibility. Most clients are able to
           | successfully build a valid chain from the leaf to the still
           | valid ISRG X1, however, old versions of OpenSSL (pre-1.1.0)
           | and several other TLS libraries that don't explore the graph
           | correctly barfs at a chain that terminates in the now expired
           | X3 root.
           | 
           | [1]: https://community.letsencrypt.org/t/production-chain-
           | changes...
        
             | tialaramex wrote:
             | Well, you're right that there are two issues, and it's
             | perhaps unfortunate that ISRG chose to arrange that they
             | happen next to each other (but to be fair I don't think we
             | pressed them to do anything else back when it would have
             | been possible, although I'm behind on some reading I don't
             | think I'm _that_ far behind)
             | 
             | But, in both cases Firefox's choice works out for it
             | regardless. And unfortunately I believe -- though it's hard
             | to tell for sure -- that some other browsers get the
             | missing certificate case wrong, perhaps for a few hours and
             | perhaps much longer. Lacking a guiding hand, they may end
             | up choosing to "validate" the R3 -> DST Root CA X3 case
             | which can't work. I think some logic will eventually expire
             | this useless data and those browsers will work, but
             | obviously that's not helpful if your site seems "broken"
             | now.
             | 
             | It will be clearer by tomorrow, but understandably Let's
             | Encrypt's community site is under a deluge of "Help!?" type
             | posts, which are a struggle for the volunteers.
        
       | thomascgalvin wrote:
       | Productivity, on the other hand, is through the roof.
        
         | milkers wrote:
         | Contrarily for me I can not add myself to company VPC, because
         | the command works on Slack. What a bummer.
        
           | sieabah wrote:
           | More proof that SlackOps/ChatOps is a failed methodology. I'm
           | not quite sure what the fascination is with it. It's
           | certainly neat for querying infrastructure but I certainly
           | wouldn't rely on it for critical infra.
           | 
           | Maybe use this as an example that you guys shouldn't do that?
        
             | JoBrad wrote:
             | That feels like a leap too far. ChatOps should be a
             | convenience, not the sole method of doing something.
             | However conveniences are important.
        
           | snvzz wrote:
           | That's some terrible OPs and ridiculous single point of
           | failure.
        
           | kaustubhvp wrote:
           | time to go home...
        
             | dylan604 wrote:
             | closes laptop. phew, short commute today.
        
               | siva7 wrote:
               | the covid years sufficiently described
        
               | dylan604 wrote:
               | if i were a poet, i might have tried to put it in a
               | meter. since i'm a coder, i made it a one-liner.
        
               | JonathanMerklin wrote:
               | I'll do my next reboot at nine, I'll do my next commute
               | with wine.
        
         | kgeist wrote:
         | We host Rocket.Chat ourselves (300+ people company), and when
         | it's down (which is very rare), all it takes is to make a visit
         | to the IT guy on the same floor. Also our infosec people are
         | happy sensitive data is not stored on servers we don't control.
         | There've been attempts to move to Slack or Google Chat without
         | success. Is there something we're missing by not using Slack?
        
           | ziddoap wrote:
           | Did you intend to reply to the person you did?
           | 
           | My understanding of the parent post was poking fun at the
           | fact that these chat spaces (Slack or otherwise) allegedly
           | end up being a productivity loss as people spend more time
           | finding the meme-of-the-week to post rather than working.
           | 
           | Nothing to do with the viability of Slack or whatever instant
           | messenger you choose to use.
        
             | kgeist wrote:
             | Yes, my bad.
        
       | nathanyz wrote:
       | Could be related to the currently ongoing Let's Encrypt
       | certificate expiration event(1) that is causing fun for engineers
       | today(2).
       | 
       | (1) https://community.letsencrypt.org/t/help-thread-for-dst-
       | root...
       | 
       | (2)
       | https://twitter.com/search?q=letsencrypt&src=typed_query&f=l...
        
         | betaby wrote:
         | No. DNSSEC has nothing to do HTTPS.
        
           | nathanyz wrote:
           | Interdependencies among services can cause all sorts of
           | unexpected issues.
        
             | kelnos wrote:
             | It can, but it did not, in this case.
             | 
             | Slack uses DigiCert, not LetsEncrypt, and if you poke
             | entries into your /etc/hosts file for various slack
             | hostnames, things work just fine (which they wouldn't if it
             | was a TLS cert problem).
        
             | eropple wrote:
             | No, it's not. This is a fault that appears to be from
             | removing DNSKEY and DNS records at the exact same time.
             | 
             | Let's Encrypt doesn't have _anything_ to do with that even
             | if somebody uses it. Slack uses DigiCert.
        
       | telesilla wrote:
       | We had this issue with a desktop client, this seems to be working
       | for us as a fix:
       | 
       | https://twitter.com/Zap42/status/1443647882045927427
       | 
       | Also, just waiting, or maybe rebooting.
       | 
       | Edit: ah wrong thread sorry! This is re: letsencrypt.
        
         | kelnos wrote:
         | Slack doesn't appear to use LetsEncrypt (checking their cert on
         | slack.com and app.slack.com shows a DigiCert cert), so this
         | will not do anything to help.
        
         | judge2020 wrote:
         | Sounds like that's a different issue then, because that solves
         | the LetsEncrypt issue - the Slack issue is thanks to DNSSec
         | (and Slack uses a Digicert HTTPS certificate[0])
         | 
         | 0: https://crt.sh/?Identity=slack.com&exclude=expired&match==
        
       | iamjohnsears wrote:
       | Switching to Cloudflare DNS fixed this for me
        
         | daxuak wrote:
         | Thank you, I switched to cloudflare/google DNS and flushed
         | router's DNS cache, then it worked.
        
         | michael_michael wrote:
         | Thank you. Switching to 1.1.1.1 got me back online.
        
           | eigthbits wrote:
           | Good tip, back up, appreciate it
        
       | black_13 wrote:
       | Slack is a disruption slack is an answer to question that no one
       | is asking. "Did you ask the slack channel" is new speak for go "i
       | dont know".
        
       | madars wrote:
       | You can test it at home using "delv www.slack.com @4.2.2.1
       | +rtrace" (4.2.2.[1-6] are Level 3's servers):
       | ;; fetch: www.slack.com/A         ;; fetch: com/DS         ;;
       | fetch: ./DNSKEY         ;; fetch: slack.com/DS         ;; fetch:
       | com/DNSKEY         ;; fetch: www.slack.com/DS         ;;
       | validating slack.com/SOA: got insecure response; parent indicates
       | it should be secure         ;; no valid RRSIG resolving
       | 'www.slack.com/DS/IN': 4.2.2.1#53         ;; broken trust chain
       | resolving 'www.slack.com/A/IN': 4.2.2.1#53         ;; resolution
       | failed: broken trust chain
        
         | mike-cardwell wrote:
         | You can also just use dig. Running: `dig a www.slack.com`
         | returns a SERVFAIL for me. Asking my resolver to skip the
         | dnssec checking gives me the A record though: `dig +cd a
         | www.slack.com`
         | 
         | I then look at my unbound dns resolver logs:
         | 
         | Sep 30 21:53:11 unbound[8985:0] info: validation failure
         | <www.slack.com. A IN>: No DNSKEY record from 208.67.220.123 for
         | key slack.com. while building chain of trust
        
         | exikyut wrote:
         | FWIW, after getting exactly the same output from `delv`
         | locally, I noted that                 $ dig +short slack.com
         | @4.2.2.1        15.206.34.128
         | 
         | and I take this to mean that, despite failing DNSSEC, Level 3
         | is yielding an A record - and that the major players have
         | basically monkey-patched this into working again, and it just
         | needs to propagate through all the caches now?
        
       | sys_64738 wrote:
       | This is a good thing.
        
       | testplzignore wrote:
       | I like how all 5 status updates (as of now) basically say the
       | same thing, though they rushed the third one and forgot to
       | apologize :)
        
       | zucked wrote:
       | >We are aware of connectivity issues (...) In order to resolve
       | this faster, your ISP (Internet Service Provider) will need to
       | flush their DNS record for slack.com. Please reach out to your
       | networking team to provide them with this information.
       | 
       | Yeah, sure, lemme just pick up the phone and call my ISP and let
       | 'em know to flush their DNS record. I'm sure the T1 rep will
       | absolutely know how to handle that.
        
         | kaustubhvp wrote:
         | Have you tried calling Comcast/Xfinity ever?
        
         | sieabah wrote:
         | Well they aren't necessarily incorrect, TTLs can be a pain.
         | 
         | However that's why when you do a migration you lower the TTLs
         | and eat the cost of lookup requests. Wait until your previous
         | TTL lease expires (which is 2x the TTL) then start the
         | migration. After everything is known to be good you up the TTLs
         | again. That way you have a way to "quickly" recover if issues
         | arise.
        
         | walrus01 wrote:
         | translation:
         | 
         | a) we pushed a change that was not well thought out in its
         | ramifications
         | 
         | b) we probably didn't lower the TTL on our zones several days
         | in advance of doing this change
         | 
         | c) we reverted in a panic!
         | 
         | d) by the way, did you know that many ISPs' caching nameservers
         | don't respect low TTLs and will hold onto zones longer than
         | they should?
         | 
         | e) please go bother and waste the time of some first tier
         | support rep at RCN or Comcast or CenturyLink, who _certainly_
         | has the power to administer their DNS servers...
         | 
         | oh jeez.
        
           | yjftsjthsd-h wrote:
           | > d) by the way, did you know that many ISPs' caching
           | nameservers don't respect low TTLs and will hold onto zones
           | longer than they should?
           | 
           | In fairness, this part really is a dumb problem on the ISP's
           | end.
        
             | simcop2387 wrote:
             | It's also partly because so many isps got tired of their
             | caches being basically useless because of so many people
             | setting obscenely low ttls too. There's no winning answer.
        
         | [deleted]
        
         | amelius wrote:
         | Why not edit your /etc/hosts file directly?
        
           | [deleted]
        
           | hnarn wrote:
           | Probably because slack doesn't want a long tail of future
           | hard to pin down issues related to people not reverting their
           | changes to their hosts file.
        
         | betaby wrote:
         | That's a laughably bad and dishonest response from Slack. They
         | are redirecting the blame to ISPs now.
        
           | paxys wrote:
           | It's more dishonest to read one part of one line and draw
           | conclusions from it without quoting the entire response.
        
           | azundo wrote:
           | Are they blaming them though? Is this not the actual fastest
           | path to resolving the issue?
        
             | lima wrote:
             | No, the fastest path is putting the signature back. But I
             | guess Route53 won't let them do that.
        
             | walrus01 wrote:
             | unless you are a customer of a _very tiny_ ISP the odds of
             | getting the person on the phone who has root on their
             | recursive caching nameservers provided to DHCP clients are
             | minuscule.
        
               | znpy wrote:
               | And chances are they wouldn't flush that entry for you
               | anyway.
        
             | wk_end wrote:
             | They should at least say something like "we're working with
             | major ISPs to resolve the issue". It really comes off as
             | walking away from a disaster of their own making.
        
             | kreeben wrote:
             | I wonder how much electricity will be spent flushing then
             | repopulating the caches of each and every DNS server on the
             | globe. Prolly not that much?
        
               | lima wrote:
               | No need to flush the full caches. Clearing caches for
               | individual zones is not uncommon.
        
           | alex_c wrote:
           | Missing part of that quote is "This issue was caused by our
           | own change and not related to any third-party DNS software
           | and services."
           | 
           | Seems refreshingly honest to me. Slack screwed up, it's not
           | anyone else's fault, but there's nothing Slack can do about
           | it now, here's a workaround.
        
           | ziddoap wrote:
           | Which part of the full Slack reply is dishonest?
        
       | myth_drannon wrote:
       | I managed to find a work around by connecting using my company's
       | VPN but still can't do directly on my home line
        
       | phlhar wrote:
       | Wow no DNS A Record for slack.com. Somebody fucked up
        
         | [deleted]
        
         | mhousley wrote:
         | It appears that only certain DNS servers are affected. I can
         | connect from my phone, but not from my home internet
         | connection.
        
         | _jal wrote:
         | It looks like a failed attempt at implementing DNSSEC.
        
         | [deleted]
        
       | bastardoperator wrote:
       | These are my favorite outages
        
       | keville wrote:
       | https://lists.dns-oarc.net/pipermail/dns-operations/2021-Sep...
        
         | TobTobXX wrote:
         | Huh, TIL:
         | 
         | > Both Google and Cloudflare have a publicly accessible feature
         | to flush the cache for a domain, so anyone could have done it:
         | > https://developers.google.com/speed/public-dns/cache >
         | https://1.1.1.1/purge-cache/
         | 
         | Quite useful feature indeed.
        
           | lima wrote:
           | Unlikely to help in this particular case (which is a root-
           | level DS record).
        
             | syvanen wrote:
             | It did help. As it removes also the DS record cache. Just
             | try to make DNS query using those resolvers.
        
         | terom wrote:
         | https://dnsviz.net/d/slack.com/YVXX_g/dnssec/ the dnsviz
         | analysis showing the slack.com zone DNSKEY existing at 12:55,
         | followed by the the .com zone DS record at 15:30. However, the
         | next analysis at 17:24 shows both the .com zone DS and
         | slack.com DNSKEY records have disappeared!
         | 
         | Given that the slack.com DNSKEY shows up with a 1h TTL and the
         | .com zone DS has a 24h TTL, they are screwed in the presence of
         | cached slack.com DS records from the .com zone. Do not throw
         | away your DNSKEY until your delegation's TTL has absolutely
         | positively surely expired from any resolver caches!
         | 
         | The slack.com domain is an AWS Route 53 zone, I'd be really
         | interested to see a post-mortem explaining what happened here.
         | Are they unable to recover the KSK/ZSK and restore the
         | DNSKEY/etc records?
        
         | [deleted]
        
         | tptacek wrote:
         | Holy shit. That's bad.
         | 
         | What this suggests is that Slack, for reasons passing
         | understanding, enabled DNSSEC on their zones (with a DS record
         | that essentially turns DNSSEC on, and the accompanying key
         | records) --- then disabled DNSSEC by pulling all the records.
         | But the DS records are in caches; validating resolvers go
         | looking for the keys, which don't exist, and say "welp, I guess
         | Slack.com doesn't exist".
        
           | tialaramex wrote:
           | Aren't you, in fact, the same Thomas Ptacek who has
           | repeatedly claimed that DNSSEC is so irrelevant that events
           | like this would go essentially unnoticed?
           | 
           | Edited to add, e.g.
           | https://news.ycombinator.com/item?id=22400167
           | 
           | > DNSSEC is moribund and almost nobody uses it; in reality,
           | the DNSSEC root private keys could land on Pastebin tomorrow
           | and nothing would "break"
           | 
           | We have this whole thread here about a "service disruption"
           | for Slack, and nobody leaked the "root private keys" just one
           | person made a dumb error and it blew up their site.
        
             | tptacek wrote:
             | No, I'm the Thomas Ptacek who has repeatedly claimed that
             | the only impact DNSSEC is going to have on the Internet is
             | causing outages like this. It's right there in the blog
             | posts; in fact, it's even in the 2007 blog posts I wrote
             | about this on the Matasano blog.
        
             | ptomato wrote:
             | > just one person made a dumb error and it blew up their
             | site
             | 
             | yeah, the dumb error they made was "using DNSSEC"
        
               | bluejekyll wrote:
               | I'm not going to defend DNSSEC here, because this outage
               | and others continue to support tptacek's perspective on
               | its usefulness.
               | 
               | But, some governments are requiring DNSSEC, which
               | regardless of its usefulness, puts companies that want
               | those contracts in a bit of a bind.
               | 
               | Perhaps it would make sense to split domains such that
               | DNSSEC guarded ones would not negatively impact ones that
               | do not have DNSSEC.
        
               | tptacek wrote:
               | The USG DNSSEC requirements, which seem to be a part of
               | what happened, are fragmented and incoherent. OMB
               | withdrew DNSSEC requirements in 2018, and CLOUD.GOV
               | doesn't support it. But some older requirements documents
               | still have them, and need to be updated.
               | 
               | The important top-line thing to know here is that
               | virtually all tech companies eschew DNSSEC (you can
               | verify that for yourself with `host -t ds stripe.com`;
               | substitute any other company for Stripe.
               | 
               | DNSSEC-quarantine TLDs are a good idea.
        
           | ithkuil wrote:
           | We had a DNS related outage with route53. Some of our zones
           | just lost some records and then they reappeared. Could that
           | explain what happened to slack's DNSSEC related records?
        
             | terom wrote:
             | A good question, and apparently enough to elict a response:
             | No!
             | 
             | > This issue was caused by our own change and not related
             | to any third-party DNS software and services.
        
           | aeden wrote:
           | I wonder if they are using tooling that doesn't properly
           | retain DNSKEY records for DS that recently removed? This is
           | one of the reasons we perform controlled automated key
           | rotation and removal in DNSimple, so that we can ensure we
           | retain the keys in the authoritative zone on each key
           | rollover giving the DS records time to expire from caches.
        
           | icedchai wrote:
           | It was especially bad since their status page wouldn't even
           | resolve! I eventually just restarted my local caching DNS
           | server.
        
       | keville wrote:
       | If you can't reach slack.com, here's their status page:
       | https://slack-status.azureedge.net/2021-09/06c1e17de93e7dc2
        
         | omreaderhn wrote:
         | Based on that status page and the list-serve email link you
         | posted it seems like they still don't understand what's going
         | on (if that email is correct).
        
           | ricardobeat wrote:
           | It seems to be standard practice to be as fuzzy as possible
           | during an outage, and only share details later. Probably
           | avoids anyone looking stupid if the initial hunch turns out
           | to be wrong.
        
             | pjlegato wrote:
             | Many outages are caused by malicious actors attacking your
             | infrastructure. This cause may not be apparent at all until
             | much later.
             | 
             | It's better to avoid leaking any information in general
             | during the incident, as it's often not immediately possible
             | to know whether a hostile adversary exists, nor what
             | advantages the adversary might derive from detailed updates
             | during the incident.
        
       | patrickbolle wrote:
       | Shopify has been down quite a bit today as well.
        
         | jftuga wrote:
         | Yes it has, https://downdetector.com/status/shopify/
        
       ___________________________________________________________________
       (page generated 2021-09-30 23:01 UTC)