[HN Gopher] Last week's Let's Encrypt downtime
___________________________________________________________________
Last week's Let's Encrypt downtime
Author : agwa
Score : 150 points
Date : 2023-06-22 14:42 UTC (8 hours ago)
(HTM) web link (www.agwa.name)
(TXT) w3m dump (www.agwa.name)
| AdamJacobMuller wrote:
| Did we kill crt.sh? FATAL: terminating
| connection due to conflict with recovery DETAIL: User
| query might have needed to see row versions that must be removed.
| CONTEXT: SQL statement "SELECT c.ID, x509_print(c.CERTIFICATE,
| NULL, 196608), ca.ID, cac.CA_ID,
| digest(c.CERTIFICATE, 'sha1'::text),
| digest(c.CERTIFICATE, 'sha256'::text),
| x509_serialNumber(c.CERTIFICATE),
| digest(x509_publicKey(c.CERTIFICATE), 'sha256'::text),
| x509_rsamodulus(c.CERTIFICATE),
| x509_hasROCAFingerprint(c.CERTIFICATE),
| x509_hasClosePrimes(c.CERTIFICATE), c.CERTIFICATE
| FROM certificate c LEFT OUTER JOIN ca ON
| (c.ISSUER_CA_ID = ca.ID) LEFT OUTER JOIN
| ca_certificate cac ON (c.ID =
| cac.CERTIFICATE_ID) WHERE digest(c.CERTIFICATE,
| 'sha256') = t_bytea" PL/pgSQL function
| web_apis(text,text[],text[]) line 1757 at SQL statement
| ERROR: server conn crashed? server closed the connection
| unexpectedly This probably means the server terminated
| abnormally before or while processing the request.
|
| and now it's just a 502 error!
| agwa wrote:
| Unfortunately, crt.sh is chronically overloaded.
| AdamJacobMuller wrote:
| I've never seen it happen before, but, you would know better!
| mjw1007 wrote:
| > these certificates were already being rejected by Chrome and
| Safari for having invalid SCTs
|
| What's a good way to make an equivalent check from a script, if I
| want (in future) to be able to check whether I have a website
| whose certificate has such a problem?
| agwa wrote:
| Excellent question! The sctcheck command from
| https://github.com/google/certificate-transparency-go/ can be
| used to check the signatures of the embedded SCTs in a
| certificate.
|
| I've also got an online tool which you can use to test a site
| for CT policy compliance:
| https://sslmate.com/labs/ct_policy_analyzer/
|
| Example of a working site:
| https://sslmate.com/labs/ct_policy_analyzer/?sslmate.com
|
| Example of one of the sites affected by the Let's Encrypt
| incident:
| https://sslmate.com/labs/ct_policy_analyzer/?thecandyshake.c...
| jimmyl02 wrote:
| This is a great writeup and intro to certificate transparency
| overall. Glad to see that certificate authorities are being held
| accountable and learn more about how its done!
| jrpelkonen wrote:
| > I find it alarming that a week after the incident, 40% of the
| affected certificates are still in use, despite being rejected by
| the most popular browsers and despite affected subscribers being
| emailed by Let's Encrypt.
|
| This is perhaps a consequence on how well-oiled of a machine LE
| typically is: people stop paying attention to it.
| tredre3 wrote:
| That is true but how come certbot had no awareness of
| revoked/withdrawn certificates before now? It seems like one of
| the things a CA is supposed to solve for you, and the fact that
| it doesn't is bit alarming in itself.
|
| Though, as the following sentence points out, they were already
| working on it before the outage, so clearly they knew it was
| needed.
|
| 1. https://datatracker.ietf.org/doc/draft-ietf-acme-ari/
| 411111111111111 wrote:
| The CA can't solve it for you.
|
| The certificate authority signs certificate requests,
| creating certificates. The revocation process is necessary as
| well, but the CA doesn't have the ability to change the
| already issued certificate, thus it cannot take action.
|
| A software like certbot can solve it for you, but that's not
| affiliated with your CA
| agwa wrote:
| The CA is part of the solution by using ARI to inform ACME
| clients to replace impacted certificates.
| mcpherrinm wrote:
| Even before ARI, some integrated ACME/Web servers use
| OCSP as a way of knowing to renew if a cert was revoked.
| Plus if you're doing that you can pin the OCSP response
| while you're at it.
| 411111111111111 wrote:
| My point was that the CA can't solve it for you, they can
| only give you APIs and processes with which you can solve
| it yourself.
|
| If your webserver supports checking the certificate
| validity then it's not solved by the CA, it's been solved
| by the developers of that software and by you installing
| it.
| tialaramex wrote:
| I haven't looked at a list of revoked certificates, because I
| was busy, (and I no longer operate my own CT auditing software,
| so I'd have to poke around in crt.sh which is not much fun) but
| lets suppose these are a random sample of Let's Encrypt's ~2
| million issuances per day.
|
| What %age of the world's HTTPS web sites are "parked" and so
| there is nobody who expects them to actually work?
| BrandFromATVShow.example ? TeenDanceISawOnTikTok.example ?
| SomeShortEnglishWord.example ? Nobody cares, if they do visit,
| and there's a certificate failure, they realise that's not
| where they meant to go and leave.
|
| Then what %age are somebody's fever dream / retirement plan /
| abandoned start-up idea and so although the owner may notice
| _eventually_ that it 's broken, that might not happen before
| automatic renewal "fixes" the problem anyway if ever.
| MyTownOlympicSwimmingPool.example JimAndBethsCakeShop.example
| and LikeAWSForDogsSomehow.example
|
| And then how about all the outfits which folded weeks, months,
| even in some cases years ago, but the ISP bill was paid, so,
| the web site continues to exist until somebody removes it, but
| of course nobody cares ? BoughtByGoogle.example and
| YetAnotherBayAreaCryptoStartup.example together with
| DefinitelyViableProduct.example and
| OopsWalmartAlreadySellsThatForLessMoney.example
|
| If it was 95% I'd be more worried, at 40% I'd need to actually
| check at least a decent sample and see for myself. In the time
| I was writing this post I checked one, it wasn't replaced...
| exactly, because the actual web site uses a certificate issued
| five days earlier. Chances are they've got a bunch of duplicate
| certificates, so the fact that some they don't use are broken
| has never come up - that's just rude (wastes other people's
| resources) but it works fine technically.
| tedunangst wrote:
| Renewing a cert without immediately deploying it seems like a
| reasonable practice in the face of CAs that will misissue
| through no fault of your own.
| schoen wrote:
| When we wrote Certbot, we thought (by analogy with prior
| practice) that many sysadmins would want to manually
| inspect certificates before deploying them! That's one
| reason that we kept old certificates around and used a
| symlink-updating system.
|
| As it turned out, misissued and invalid certs account for
| an incredibly small fraction of Let's Encrypt's issuance
| volume (I'm going to say < 1/108 offhand?) and manual
| inspection kind of gets in the way of automation, so the
| idea of separating these steps has come to seem kind of
| quaint, for me at least. I've also helped thousands of
| people on the Let's Encrypt forum and I think at most 2
| have said they were interested in looking at their new
| certs' contents before starting to use them.
| tedunangst wrote:
| I may not inspect it myself (which wouldn't even catch
| this issue), but letting it simmer for a week isn't hard.
| agwa wrote:
| That's a pretty good idea, and would also mitigate
| clients with slow clocks rejecting a certificate for not
| being valid yet.
| NovemberWhiskey wrote:
| Based on my experience, the capability model for certificate
| management usually went like:
|
| 1) Chaos: certificates requested and installed manually, either
| in response to incidents caused by expiration or calendar
| reminders
|
| 2) Monitoring: certificates requested and installed manually,
| in response to noisy alerting by probers looking for
| indications of pending expiration or other ill-health
|
| 3) Automation: continuous certificate provisioning,
| distribution and enablement either through platform or
| integration
|
| The Let's Encrypt revolution has taken a lot of people from
| stage 1 to stage 3 without stage 2 in between.
| hinkley wrote:
| Vernor Vinge has dominated the Singularity space in science
| fiction pretty much from the beginning of the concept.
|
| Rainbow's End plays around in time frame right around where we
| are now, just a bit before the sorts of doglegs we predict
| would presage a Singularity in your lifetime.
|
| At one point the protagonists need to attack a bad actor, and
| to make it work they need chaos on the internet. I don't recall
| exactly how this plays out, but the way they decide to achieve
| it is that one of the collaborators believes that they can
| reject a CA cert that affects 10% of all certificates in the
| wild, and the resulting pandemonium will give them
| approximately the sort of chaos they need.
|
| Sounds to me like maybe that is either no longer true, or never
| was.
| tialaramex wrote:
| [Spoilers]
|
| They don't need Chaos. They want to disable Rabbit, and they
| know Rabbit's certificates mostly tie back to a single CA,
| Credit Suisse. So they "revoke" Credit Suisse and accept the
| consequences, which (they acknowledge) are career ending for
| the Europeans. This is mostly a plot convenience because
| Rabbit is much too powerful to allow what Vinge wants to
| happen next.
|
| No, you can't actually "revoke" a root CA, the decision to
| trust (or not) a root is local. So this part of the novel is
| a fantasy. But even if you assume it means that the European
| authorities can somehow reach into Credit Suisse and cause it
| to revoke all the intermediates (which _maybe_ is a plausible
| reading) and so on down to end entity certificates, that
| doesn 't really work either. Not on the time scale Vinge
| needs for the novel.
|
| Hours are conceivable but unlikely. Days maybe. A week. But
| the novel needs it to be seconds.
|
| There are two big obstacles to even the revocation which does
| really exist. Firstly humans are _much_ more enthusiastic
| about seeing Dancing Pigs than they are about safety, because
| safety is a very abstract idea, whereas seeing dancing pigs
| is an immediate reward. This is the Dancing Pigs problem, and
| we 've put some effort in, it's _less_ likely a random Chrome
| user would get their face ripped to pieces because they
| wanted Dancing Pigs and so they bypassed the security checks
| that would protected them - than say - fifteen years ago, but
| only somewhat.
|
| Secondly though, there's not a great enthusiasm technically
| for this sort of counter-measure. It's so rarely beneficial
| in practice. Most of the time those humans were right, we
| were just denying them Dancing Pigs. Their face _might_ get
| ripped to pieces, but to be honest it 's as likely to be
| because they deliberate went to "Rip My Face To
| Pieces.example" as through anything we could have prevented.
| This is only barely a technical problem. So, when there are
| things we could do to get closer to what's in the novel, why
| would we?
|
| Building the PKI which exists in Vinge's novel is probably a
| bad expenditure of resources.
| francislavoie wrote:
| FWIW, if those websites used Caddy as their ACME client, then it
| would have detected the certificate being revoked as soon as
| possible via OCSP stapling and would have had the certificate
| renewed. It's a shame that other ACME clients aren't as robust to
| problems like this. (Disclaimer: I work on Caddy as a volunteer)
| agwa wrote:
| Note that the certificates were not revoked until 2023-06-19 at
| 18:00. In contrast, ARI was updated on 2023-06-15 at 22:43 to
| tell ARI-supporting clients (such as lego) to renew
| immediately. That means Caddy served broken certificates for
| almost 4 days longer than necessary.
|
| Are there plans for Caddy to support ARI?
| mholt wrote:
| > That means Caddy served broken certificates for almost 4
| days longer than necessary.
|
| This would be news to me. Do you have a source for Caddy
| serving any of the affected certificates? I'd like as much
| info as possible.
|
| > Are there plans for Caddy to support ARI?
|
| If ARI can be made into an effective mechanism, then yes.
| ACMEz already supports the current draft.
|
| I know Francis linked to a forum category, here's some more
| specific links for background:
|
| - https://community.letsencrypt.org/t/can-ari-conforming-
| clien...
|
| - https://community.letsencrypt.org/t/thoughts-from-
| starting-t...
| agwa wrote:
| _> This would be news to me. Do you have a source for Caddy
| serving any of the affected certificates? I'd like as much
| info as possible._
|
| That's news to you? I informed you last week that Caddy
| would serve broken certificates in this situation:
| https://news.ycombinator.com/item?id=36344549
|
| I omitted "would" from my previous comment, but I think
| it's pretty clear from Francis' comment that we're
| discussing a hypothetical situation, and neither of us know
| if any of the 645 affected certificates were requested by
| Caddy or not.
|
| I skimmed the forum links (it would be productive if you
| could send a email summarizing your thoughts to the IETF
| ACME WG) and it seems like your complaints could also be
| said of OCSP so it's hard to figure out why OCSP is OK for
| Caddy but ARI isn't.
|
| FWIW, there's currently a ballot in the CABF which would
| make OCSP optional for CAs, so OCSP may be on the way out
| in the WebPKI.
| mholt wrote:
| You said:
|
| > Caddy served broken certificates
|
| So yes, that would be news to me. I'm asking for more
| information. If Caddy did not serve broken certificates,
| then I would appreciate clarification there so I know
| where to spend my energy.
|
| > (it would be productive if you could send a email
| summarizing your thoughts to the IETF ACME WG)
|
| I did this once and it was like talking into a black
| hole. All the responses I got to the issue I brought up
| were laced with complacency.
|
| > I skimmed the forum links and it seems like your
| complaints could also be said of OCSP so it's hard to
| figure out why OCSP is OK for Caddy but ARI isn't.
|
| Because OCSP does what it's intended to do. ARI does not.
|
| > FWIW, there's currently a ballot in the CABF which
| would make OCSP optional for CAs, so OCSP may be on the
| way out in the WebPKI.
|
| I am tracking that proposal and get daily notifications.
| It is only for short-lived certs. I would be thrilled if
| we could replace revocation -- and OCSP -- with short-
| lived certs.
| agwa wrote:
| _> So yes, that would be news to me. I 'm asking for more
| information. If Caddy did not serve broken certificates,
| then I would appreciate clarification there so I know
| where to spend my energy._
|
| This is not engaging in good faith.
|
| _> I am tracking that proposal and get daily
| notifications. It is only for short-lived certs._
|
| It would make OCSP optional for all certificates. CRLs
| would be optional only for short-lived certs.
| mholt wrote:
| > This is not engaging in good faith.
|
| Sorry, come again? Why so combative?
| francislavoie wrote:
| > Note that the certificates were not revoked until
| 2023-06-19 at 18:00.
|
| Ah okay, I missed that.
|
| > Are there plans for Caddy to support ARI?
|
| It's... complicated. Matt argues that ARI does not make sense
| for a variety of reasons. You can find the complex and deep
| discussions about it on the LE forums. Do a Ctrl+F for ARI in
| https://community.letsencrypt.org/c/client-dev/14 to find
| them, there's a lot.
| ElongatedMusket wrote:
| Thanks for following through on this writeup! I knew LE certs
| were publicly logged but didn't know the logs were decentralized
| or how they hold the CA accountable. Appreciate the layman
| explanation.
| fruitreunion1 wrote:
| Will non-browser clients like curl/requests ever support checking
| CT logs? It's great that some browsers have it, but browsers are
| not the only clients using TLS with CAs. Also doesn't help that a
| lot of software can't use CA root stores with much granularity:
| https://news.ycombinator.com/item?id=33876949
| agwa wrote:
| Hopefully, although there are challenges to overcome. CT is a
| fast-moving ecosystem, with logs coming and going, and policies
| changing regularly. This requires CT-enforcing clients to be
| very on-the-ball with updates, both in the sense that the
| developers need to pay attention and update their code in time,
| and any users of the apps need to upgrade frequently. Browser
| makers can handle this because they are competently-staffed and
| well-resourced. The authors of non-browser apps need to know
| what they're getting into.
|
| A cautionary tale: there is a library for adding CT enforcement
| to Android apps. Earlier this year, every app using this
| library was suddenly unable to establish any TLS connections
| because Google stopped publishing a JSON file which the library
| should never have been consuming in the first place. There was
| plenty of warning that this would happen, but the author of the
| library was not on-the-ball.
| https://groups.google.com/g/certificate-transparency/c/38Lr9...
| NovemberWhiskey wrote:
| The elephant in the room is that TLS implementations for
| browsers and those in the libraries of common programming
| languages have diverged really substantially: Web PKI is
| massively more restrictive and depends on a bunch of technology
| that's not in the baseline PKI.
___________________________________________________________________
(page generated 2023-06-22 23:00 UTC)