[HN Gopher] The case of the recursive resolvers: What happened d...
       ___________________________________________________________________
        
       The case of the recursive resolvers: What happened during Slack's
       DNSSEC rollout
        
       Author : usrme
       Score  : 49 points
       Date   : 2021-11-29 11:36 UTC (11 hours ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | daper wrote:
       | From the described mistakes two come from lack of understanding
       | how exactly DNS works. But I agree it's in fact hard, see [1]).
       | 
       | 1. "This strict DNS spec enforcement will reject a CNAME record
       | at the apex of a zone (as per RFC-2181), including the APEX of a
       | sub-delegated subdomain. This was the reason that customers using
       | VPN providers were disproportionately" - This is non intuitive
       | and maay people are surprised by that. You cannot create any
       | subdomain (even www.domain.tld) if you created "domain.tld CNAME
       | something...". Looks like not every server/resolver enforces that
       | restriction.
       | 
       | 2. "based on expert advice, our understanding at the time was
       | that DS records at the .com zone were never cached, so pulling it
       | from the registrar would cause resolvers to immediately stop
       | performing DNSSEC validation." - like any other record, they can
       | be cached. DNS has also negative caching (caching of "not found
       | responses". Moreover there are resolvers that allow configuring
       | minimum TTL that can be higher that what your NS servers returns
       | (like unbound - "cache-min-ttl" option) or can be configured to
       | serve stale responses in case of resolution failures after the
       | cached data expires [2]. That means returning TTL of "1s" will
       | not work as you expect.
       | 
       | [1] https://blog.powerdns.com/2020/11/27/goodbye-dns-goodbye-
       | pow... [2] https://www.isc.org/blogs/2020-serve-stale/
        
         | btown wrote:
         | My (basic and conservative) mental model that "in DNS,
         | _everything including the lack of presence of a thing_ can be
         | cached " is why I'm very cautious before rolling out anything
         | from DKIM to DNSSEC. A deep understanding of specifications is
         | vital. I'm somewhat surprised an organization of Slack's scale
         | didn't have a consultant on the level of "I designed DNSSEC" on
         | hand for this.
        
           | belorn wrote:
           | DNS is a bit like network engineering, in that simpler errors
           | has the tendency to have large impacts that prevent trial and
           | error. Before working as a sysadmin I thought that doing
           | experimental lab setups was something only researchers and
           | student did, but when you have an old system up and running,
           | it can be quite difficult to get in there and make changes
           | unless you are very sure about what you are doing.
           | 
           | Like networking there can also be existing protocol errors
           | and plain broken things that has for one reason or an other
           | been seemingly working for decades without causing a problem.
           | Internet flag day is one of those things that pokes at those
           | problems, and maybe one day we will see a test for CNAME at
           | the apex.
        
             | tptacek wrote:
             | It's worth noting that this by itself is a reason not to do
             | ambitious security things (and a global PKI is nothing if
             | not ambitious) at the layer of DNS. It's an extension of
             | the end-to-end argument, or at least of the the logic used
             | in the Saltzer and Reed paper: because it's difficult and
             | error-prone to deploy policy code in the core of the
             | network (here: the "conceptual" core of the protocol
             | stack), we should work to get that policy further up the
             | stack and closer to the applications that actually care
             | about that policy.
             | 
             | The Saltzer and Reed paper, if I'm remembering right, even
             | calls out security as specifically one of those things you
             | don't want to be doing in the middle of the network.
             | 
             | See also: Zero Trust / BeyondCorp.
        
       | dogecoinbase wrote:
       | In addition to the other note that DNSSEC is _not_ required for
       | FedRAMP certification (it's even discouraged by cloud.gov!
       | https://cloud.gov/docs/compliance/domain-standards/ ), this is
       | some weirdly intellectually dishonest phrasing (linking to
       | tptacek's article Against DNSSEC:
       | https://sockpuppet.org/blog/2015/01/15/against-dnssec/ ):
       | 
       | > While we are aware of the debate around the utility of DNSSEC
       | among the DNS community, we are still committed to securing Slack
       | for our customers.
       | 
       | The argument is specifically that it doesn't provide that
       | security. At least it's neat to see actual begging the question
       | in the wild, I guess.
        
         | mpyne wrote:
         | FedRAMP is designed to provide reusable cybersecurity work
         | against the NIST security controls that your Federal agency's
         | Authorizing Official deems your Federal IT system must
         | implement.
         | 
         | Those security controls come from a document NIST SP 800-53, 2
         | of which (that Slack linked to in the linked post-mortem),
         | SC-20 and SC-21, effectively seem to me to conspire to require
         | DNSSEC. Both of these are included as part of the "Low"
         | baseline of security controls, so they are effectively required
         | for all Federal IT systems unless your Agency Authorizing
         | Official wants to walk on the wild side.
         | 
         | So even if you get a FedRAMP certification, if you do it
         | without fully implementing SC-20 and SC-21, that just means
         | your customer needs to either convince their Agency Authorizing
         | Official to sign off on an ATO despite the missing SC-20 and
         | SC-21 security control, convince them to sign off on some sort
         | of Plan of Action and Milestones where Slack will commit to fix
         | this in the future (which is just kicking the can down the
         | road), or somehow manage to implement the same effect
         | completely within the customer end without help from Slack. All
         | you would have done is to spend a lot of money on FedRAMP
         | paperwork without making it appreciably easier for potential
         | customers who have to deal with compliance regimes to buy your
         | product.
         | 
         | Cloud.gov's argument is valid but all they posted is that they
         | don't implement SC-20 or SC-21 for their government customers,
         | and that the OMB M-08-23 mandate for DNSSEC is no longer
         | operative (not that no other DNSSEC mandate applies). Indeed
         | they even give explanation for how their customers should work
         | to enable it (presumably by refusing to use the non-DNSSEC
         | compliant .app.cloud.gov services and instead using only their
         | DNSSEC-compliant custom domains).
         | 
         | FWIW I fully agree with tptacek's arguments against DNSSEC, and
         | will note that I recently stopped being able to navigate to
         | literally the entire .mil on my Linux host until I disabled
         | DNSSEC in systemd, for reasons that are still unclear to me
         | even now.
        
         | tylermenezes wrote:
         | > intellectually dishonest phrasing
         | 
         | Not everyone agrees with the linked argument. For example, I
         | disagree that browsers can't take advantage of DNSSEC, since
         | many are using DoH, and the rest of the article reads like
         | someone complaining that we need to wait for the perfect
         | protocol or nothing at all.
         | 
         | That's the thing about a debate... it's got arguments on both
         | sides.
        
           | dogecoinbase wrote:
           | It's fine to disagree with the linked argument, but you
           | actually have to do so. This is them presupposing that
           | "securing Slack for [their] customers" requires DNSSEC --
           | it's not engaging with the argument at all.
        
           | tptacek wrote:
           | I mean, I agree with you and don't find the language
           | disingenuous (I felt like it was more of a tell that the
           | people working on this cursed project weren't super read into
           | DNSSEC and DNS security in general, which isn't a knock; it's
           | a boring thing to keep up with, especially when the best-
           | practice answer is so simple --- just don't bother with
           | DNSSEC).
           | 
           | But I'd also say that DoH (1) largely obviates any need for
           | DNSSEC (the last-mile DNS problem is the only on-the-wire DNS
           | security problem that needs solving) and (2) doesn't enable
           | DANE in browsers, which is what people are talking about when
           | they talk about DNSSEC intersecting with browsers in any way
           | other than randomly making sites fall off the Internet.
        
       | Joe8Bit wrote:
       | I know we've all collectively accepted that DNSSEC is a terrible,
       | complicated blight on the world but I still find it incredible
       | that that an organisation with Slacks resources and access to
       | expertise can't make it work.
        
         | toomuchtodo wrote:
         | No tech company is infallible. All of them have outages, some
         | lasting hours, even days.
         | 
         | Complex systems can and will fail. Try to do better, of course,
         | but let's acknowledge that perfection will always exceed our
         | grasp. The world will continue to turn regardless.
         | 
         | One day it might just be your turn to break production.
        
           | tptacek wrote:
           | The subtext here isn't that Slack is bad at this (they are
           | not), but that DNSSEC is somehow intrinsically unsafe (it
           | probably is).
        
             | toomuchtodo wrote:
             | I agree with your points about DNSSEC (disclaimer: I have
             | not had the pleasure of having to implement it myself in
             | infra), but was attempting to communicate that DNSSEC isn't
             | the only area of ops that folks get exposed to these sorts
             | of unknowns or edge cases, and that no amount of resourcing
             | enables you to avoid these issues. For Slack, it was
             | DNSSEC. For Roblox, Consul. Facebook/Insta, software
             | defined BGP. Akamai, DNS.
             | 
             | Perhaps I did not read the room appropriately. Mea culpa.
        
         | tptacek wrote:
         | You say Slack, and I agree, that's telling, but you have to add
         | to that _AWS itself_ , which had a DNSSEC bug in its wildcard
         | record support as well. Slack and AWS together couldn't make
         | this feature work. Further: the open source tooling Slack (like
         | most places) relies on for deployment is also DNSSEC-hostile:
         | one of their problems is that Terraform's Route53 provider
         | doesn't safely disable DNSSEC once enabled. It's a mess
         | everywhere you look.
         | 
         | I think another interesting question here is why Slack bothered
         | in the first place. As was pointed out on the other DNSSEC
         | thread today: practically nobody in the technology industry
         | uses DNSSEC in the first place. Presumably, Slack did DNSSEC
         | (they don't anymore!) in service of FedRAMP compliance. Why?
         | Slack has one of the most popular products in all of computing.
         | What bad thing was going to happen if they said "nah, we're
         | going to go with Cloud.gov's recommendation and not this
         | FedRAMP document"?
        
           | x3n0ph3n3 wrote:
           | Because FedRAMP compliance is required for many US federal
           | (and now some state) customers, which Slack can charge a
           | premium.
        
           | vimda wrote:
           | Gotta be Fedramp compliant to do business with the US
           | government. Even worse, you have to be Fedramp compliant to
           | work with anyone who works with the US government. From a
           | business (if not an engineering) standpoint, there's plenty
           | to gain in going through the motions
        
             | tptacek wrote:
             | As was pointed out downthread, there are tech companies
             | that are "more" FedRAMP compliant (FedRAMP "High") without
             | DNSSEC support.
             | 
             | (Kenn White points out on Twitter that some of this may be
             | due to grandfathering --- though, the FedRAMP DNSSEC
             | requirement is pretty old.)
        
           | mpyne wrote:
           | > Presumably, Slack did DNSSEC (they don't anymore!) in
           | service of FedRAMP compliance. Why? Slack has one of the most
           | popular products in all of computing. What bad thing was
           | going to happen if they said "nah, we're going to go with
           | Cloud.gov's recommendation and not this FedRAMP document"?
           | 
           | As just one example, it's tremendously difficult, if not
           | impossible, to sell your cloud-based SaaS to Navy customers
           | if you have open FedRAMP compliance issues that you aren't at
           | least working to address.
           | 
           | I say "compliance" instead of "security" for a reason as
           | well, as "compliance" truly runs the show in Navy
           | cybersecurity. And if you want to sell to that market (and
           | it's hardly just Navy who runs this way), it's easier to
           | check the checkboxes than it is to argue about whether NIST
           | is right or cloud.gov is right.
        
         | technion wrote:
         | I know HN has collectively accepted but every time I'm
         | associated with an organisation that pays for a penetration
         | test it comes in as a high risk finding, so much so that I've
         | given in to deploying it to avoid sitting with non-technical
         | managers doing the "here's why I disagree" all over again.
         | Outside of this group I definitely feel like I'm on my own in
         | that view.
        
         | belorn wrote:
         | _" It turned out that some resolvers become more strict when
         | DNSSEC signing is enabled at the authoritative name servers,
         | even while signing was not enabled at the root name servers
         | (i.e. before DS records were published to COM nameservers).
         | This strict DNS spec enforcement will reject a CNAME record at
         | the apex of a zone (as per RFC-2181), including the APEX of a
         | sub-delegated subdomain"_
         | 
         | Slack's second attempt wasn't a DNSSEC problem. Slack depended
         | on a permissive fallback of revolvers when encountering a plain
         | DNS protocol error. It is similar to how some websites in the
         | past relied on permissive browsers implementation when facing
         | broken HTML/JS/CSS. Slack fixed their broken DNS as a result of
         | this.
         | 
         | Slack's third attempt was not the fault of Slack but rather a
         | software bug at Amazon. I would make the argument that Amazon's
         | primary product isn't DNS services, but they did fixed their
         | bug after this.
         | 
         | The general conclusion I get from the article is not that
         | DNSSEC is broken, nor that is too complicated. It is that when
         | doing changes with your core infrastructure to make it more
         | secure, bugs that may have been laying dormant might pop up and
         | bite. I am sure some people has had that experience in domains
         | outside of DNS.
        
           | ignoramous wrote:
           | You are not wrong, but by simply avoiding DNSSEC, Slack would
           | have not had the outage they did. Not to mention the drain on
           | eng resources, which perhaps may be even more expensive.
           | 
           | What one can't ignore is the underlying chicken-and-egg
           | problem that DNSSEC must overcome: Not many DNSSEC
           | deployments and hence not much of it has been tested in the
           | real-world, which results in bugs despite the attention of
           | some of the most qualified engs, including the ones running
           | one of the largest nameserver deployments in the world.
           | 
           | https://apenwarr.ca/log/20201227
        
       | tptacek wrote:
       | Additional discussion, indirectly and spurred from this, is here:
       | 
       | https://news.ycombinator.com/item?id=29381778
       | 
       | That thread, which is big, is probably the right place to take
       | general discussion of DNSSEC itself, though I'll snipe DNSSEC
       | here too. :)
        
       | [deleted]
        
       | teddyh wrote:
       | From what I can tell, the problem was not caused by DNSSEC
       | directly. It was caused by:
       | 
       | 1. A bug in Route 53 which caused wildcard record not to work
       | with DNSSEC signing. Anyone not using Route 53 would not have had
       | any problems with DNSSEC.
       | 
       | 2. Slack decided to revert the DNSSEC rollout, but botched the
       | process _badly_ , effectively locking themselves in the trunk and
       | throwing away the key. If they hadn't tried to revert the DNSSEC
       | rollout, or if they had been a bit more deliberate and careful
       | while doing it, this would not have happened.
        
       | jeffbee wrote:
       | Seems like an organizational failure, as they got conned by their
       | 3PAO into believing that DNSSEC was a requirement for FedRAMP
       | moderate when it's not. The disproof of this belief is that
       | Google has FedRAMP High (for Google Cloud and Workspace) but does
       | not use DNSSEC for google.com.
        
         | goalieca wrote:
         | If you use https everywhere, you will have a server certificate
         | with the hostname embedded in it. This is how TLS knows you're
         | talking to the right server.
        
         | mpyne wrote:
         | The ultimate arbiter of whether a cloud service gets used isn't
         | FedRAMP, it's the Agency Authorizing Official. FedRAMP just
         | makes much of the work reusable. With GCP, you can build
         | something that obeys and uses DNSSEC without needing google.com
         | to participate in DNSSEC.
         | 
         | Google Workspace is a good point though. I know there are many
         | users of it in government... maybe some AOs are fine signing
         | off on it even without the needed security controls, which is
         | an option they have in their discretion with and without
         | FedRAMP.
        
       | dsXLII wrote:
       | It's always DNS.
        
         | eropple wrote:
         | This is a dirty lie.
         | 
         | Sometimes it's BGP.
        
           | vimda wrote:
           | And sometimes (as in the Facebook outage), it's both!
        
       ___________________________________________________________________
       (page generated 2021-11-29 23:00 UTC)