[HN Gopher] About the Tailscale.com outage on March 7, 2024
       ___________________________________________________________________
        
       About the Tailscale.com outage on March 7, 2024
        
       Author : tatersolid
       Score  : 150 points
       Date   : 2024-03-30 15:34 UTC (7 hours ago)
        
 (HTM) web link (tailscale.com)
 (TXT) w3m dump (tailscale.com)
        
       | lmeyerov wrote:
       | Expiring certs strikes again!
       | 
       | I'd recommend as part of the post mortem to move their install
       | script off their marketing site or putting in some other fallback
       | so marketing site activity is unrelated to customer operations
       | critical path. They're almost there for maintaining that typical
       | isolation, which helps bc this kind of thing is common.
       | 
       | We track uptime of our various providers, and seeing bits like
       | the GitHub or Zendesk sites go down is more common than we
       | expected... and they're the good cases.
        
         | GICodeWarrior wrote:
         | Further, security of a marketing site tends to be lower
         | priority than the product itself, and an install script should
         | generally be secured similar to the product.
        
           | bradfitz wrote:
           | Yes. We're lamentably probably going to have to move it (the
           | install script), even though it has a nice URL today.
           | 
           | When we picked that URL, the marketing site was created and
           | run by the same people who built the rest of the product, so
           | it didn't seem like a concern at the time.
        
             | oliviabenson wrote:
             | You can achieve both. The only mistake you made was to
             | half-bake the proxy (doing it for IPv6 only): proxy every
             | http(s) request to tailscale.com. Vercel's platform is
             | valuable for a whole host of reasons, the networking side
             | isn't that important, your developers will greatly value
             | the use of Vercel even if every request is being proxied
             | through a web server hosting tailscale.com which responds
             | to a request for /install.sh instead of passing it through
             | to the marketing site.
             | 
             | (In Google Cloud you could do it entirely with load
             | balancing rules, no need to even run a web server)
        
               | bradfitz wrote:
               | That is exactly what I want us to do :)
        
             | ShakataGaNai wrote:
             | That's great that DevOps (or whatever their title) owns
             | both product and marketing sites. Far too many companies
             | (and DevOps teams) think the www site is "not important" or
             | "not their core job" and outsource it to either a less
             | qualified team, or out of the company altogether.
             | 
             | From an external perspective no one cares if www going down
             | isn't "your fault" or of "direct impact to the product".
             | It's a corporate blackeye either way.
        
         | j45 wrote:
         | Brings to mind if there's a service that will monitor all the
         | certs and their expiry.
         | 
         | Cloudflare seems to handle a fair bit of this if you host your
         | domain with them, but you have to use Cloudflare.
        
           | cassianoleal wrote:
           | Push their expiry date as a UNIX timestamp to whatever TSDB
           | you use and hook up an alert for when it gets close.
        
       | Scubabear68 wrote:
       | "That means the root issue with renewal is still a problem, and
       | we plan to address it in the short term much like our ancestors
       | did: multiple redundant calendar alerts and a designated window
       | to manually renew the certificates ourselves".
        
         | lijok wrote:
         | The other day I was looking for a system where we can track
         | recurring yearly/monthly/etc tasks (such as cert rotation) and
         | get alerted a week before and on the day.
         | 
         | 2~ hours into my search, contemplating building my own, someone
         | pointed out we can just use a shared gsuite calendar.
         | 
         | How the mind overcomplicates things sometimes..
        
           | _joel wrote:
           | I'd probably spend the effort adding ssl expiration to the
           | monitoring system for all the certs in use. Trigger then a
           | month/week whatever before they're due to expire.
        
             | rustcleaner wrote:
             | We both got the downdoot simultaneously, someone didn't
             | like our contributions. :^)
        
           | supriyo-biswas wrote:
           | Uptime Kuma[1] has a certificate expiry notification feature.
           | 
           | [1] https://github.com/louislam/uptime-kuma
        
             | rnewme wrote:
             | Uptime kuma is cool.
        
             | j45 wrote:
             | Thanks for the tip!
        
           | eastbound wrote:
           | Jira. Create an automation that clones tickets on due dates.
        
             | j45 wrote:
             | Jira is really underrated for it's workflows and
             | automations.
             | 
             | Part of my love/hate relationship with JIRA was until the
             | lightbulb that it's not supposed to work perfect out of the
             | box because no two places are the same.
        
       | nerdbaggy wrote:
       | I wonder what provider they use for their website. Sounds like a
       | lot of hoops to jump through for IPV6 when just about any other
       | provider has IPv6 support.
        
         | p1mrx wrote:
         | $ host www.tailscale.com       www.tailscale.com has address
         | 76.76.21.21  # Vercel       www.tailscale.com has IPv6 address
         | 2600:9000:a51d:27c1:6748:d035:a989:fb3c  # Amazon
         | www.tailscale.com has IPv6 address
         | 2600:9000:a602:b1e6:5b89:50a1:7cf7:67b8  # Amazon
         | 
         | IPv4 uses a Let's Encrypt certificate, while IPv6 uses an
         | Amazon certificate.
        
           | opheliate wrote:
           | The Vercel IPv6 feature request & surrounding discussion
           | makes for frustrating reading:
           | https://github.com/orgs/vercel/discussions/47
        
             | opello wrote:
             | Only if you expand all the discussion they've hidden!
             | Ignorance is bliss, right?
             | 
             | > We are targeting to land support for IPv6 towards the
             | beginning of next year. We will communicate updates on this
             | issue.
             | 
             | Was from 2023-10-01, I guess it's early until June 30.
        
               | watermelon0 wrote:
               | I'd say that the beginning of the year ends at the end of
               | Q1, if not earlier.
        
             | miyuru wrote:
             | Vercel's VP of product has been asking for requirements for
             | IPv6 there. This should be a good one.
             | 
             | https://github.com/orgs/vercel/discussions/47#discussioncom
             | m...
             | 
             | It painful to see tech providers go down this road, which
             | is pretty similar to what's happening at Boeing. (Business
             | taking over Engineering)
        
             | kawsper wrote:
             | It feels like the same anywhere, sadly, DigitalOceans IPv6
             | support in their loadbalancer product have been "under
             | review" since 2021:
             | https://ideas.digitalocean.com/network/p/ipv6-for-load-
             | balan...
        
             | aftbit wrote:
             | Wow, all comments removed as spam or hidden by default,
             | update posted saying "We are targeting to land support for
             | IPv6 towards the beginning of next year." Well, Q1 2024 has
             | come and gone. Where's IPv6 support or the communication
             | about what is happening? Good reason to never use Vercel if
             | you ask me.
        
       | NelsonMinar wrote:
       | The conclusion is hilarious: "we plan to address it in the short
       | term much like our ancestors did: multiple redundant calendar
       | alerts and a designated window to manually renew the certificates
       | ourselves"
       | 
       | Devops is so 2023. Back to ops!
        
         | bradfitz wrote:
         | That was mostly a joke. You know we're going to fix it
         | properly, Nelson :)
         | 
         | (But super short term, yes.)
        
           | NelsonMinar wrote:
           | Fair enough! But you might want to add this feed to your
           | Google Reader, Brad :-)
           | https://scrutineer.tech/monitor/cert/tailscale.com.rss
        
         | j45 wrote:
         | It's a joke but also the least that should be in place while
         | whatever fix is coming is put into place.
         | 
         | A simple cronjob would look like it would handle it, but what
         | usually ends up being needed with 10-15 of these types of tasks
         | is a simple, independent bpm workflow platform that tracks
         | whether it happened or not.. or anything else.
         | 
         | Learned this the hard way and won't do it any other way.
        
       | PuffinBlue wrote:
       | Migrating to a host that doesn't support IPv6 when it's important
       | to you seems...like a bad decision.
        
         | bradfitz wrote:
         | Suffice it to say neither their lack of IPv6 nor its importance
         | to us was evenly understood throughout the company.
        
           | j45 wrote:
           | This doesn't seem uncommon. Why learn IPv6 until you need to.
           | I know it has some great features.
        
           | PuffinBlue wrote:
           | I very much enjoy the diplomatic phrasing of this statement
           | :-)
        
         | speedgoose wrote:
         | IPv6 is important but step one is to remove the AAAA records.
        
       | sowbug wrote:
       | Two ideas for discussion.
       | 
       | Certificate Transparency is used to account for maliciously or
       | mistakenly issued certificates. Perhaps it could also be used to
       | assert the unavailability of correctly issued but obsolete
       | certificates that are believed to be purged but actually aren't.
       | (Services like KeyChest might already do this.)
       | 
       | Let's Encrypt is a miracle compared to the expensive pain of
       | getting a cert 20 years ago. Rather than resting on laurels,
       | would there be any benefit to renewing even more frequently, like
       | daily? This might have confined the Tailscale incident to a quick
       | "oops!" while the provider migration was still underway and being
       | actively watched.
        
         | striking wrote:
         | 90 day renewal is frequent enough in my book. It's not so often
         | as to be easy to miss, but often enough that the person setting
         | it up can witness the first renewal cycle (if they so choose,
         | which in this case they apparently did not).
        
           | sowbug wrote:
           | Right. I was thinking of keeping the same 90-day validity but
           | renewing much more frequently, rather than the 60-day period
           | that LE recommends. But I can see my questions have irked
           | other community members, so I'll leave it at that. :)
        
         | starttoaster wrote:
         | I renew some of my LetsEncrypt certificates monthly, which
         | should be plenty, in my opinion. Gets you about 2 buffer cycles
         | to notice the certificate isn't updating and recognize an issue
         | in your automation.
        
       | agwa wrote:
       | Why does the proxy need to terminate TLS? If it were just a TCP
       | proxy, then at least the monitoring wouldn't have been fooled
       | into thinking the certificate wasn't about to expire.
       | 
       | Heck, a TCP proxy might even allow automatic renewal to work if
       | the domain validation is being done using a TLS-ALPN challenge.
        
         | fanf2 wrote:
         | I think a TCP proxy would also work with http challenges.
        
           | agwa wrote:
           | It would, but so would an HTTP proxy. It makes make think the
           | hosting provider doesn't use HTTP challenges.
        
             | fanf2 wrote:
             | An http proxy would need to be configured cleverly enough
             | to serve its own acme challenges directly, and proxy any
             | requests for the backend's acme challenges. Which is I
             | think the trick that was missed by the tailscale setup.
        
         | bastawhiz wrote:
         | Not that it's an amazing reason, but H3 doesn't run over TCP,
         | and running a UDP proxy doesn't sound like a great time.
        
           | ignoramous wrote:
           | > _UDP proxy doesn 't sound like a great time_
           | 
           | QUIC, in particular, is harder to proxy (if you're load
           | balancing, say: https://quicwg.org/ops-drafts/draft-ietf-
           | quic-manageability....).
           | 
           | If it is point-to-point and you control both those points
           | (forward A to B with ports open as approp), proxying any
           | protocol should be straightforward, no?
        
             | cayde wrote:
             | I donet believe they control the Vercel endpoint "B"
        
         | RulerOf wrote:
         | They may have AWS CloudFront CDN in front of it for IPv6. If
         | you're doing that, you're terminating TLS at CloudFront. I
         | don't believe that's optional.
        
         | amluto wrote:
         | A non-TLS-terminating proxy is a great thing to host on a
         | service like Hetzner. If you set up CAA correctly, then you are
         | trusting the provider for latency and availability only, and
         | you might as well avoid hilariously expensive services like
         | CloudFront or an EC2-based proxy.
         | 
         | Hmm, it looks like Tailscale is using NetActuate for
         | pkgs.tailscale.com. I bet NetActuate could help serve up a non-
         | terminating proxy with plenty of PoPs at a reasonable price.
         | Their website doesn't give pricing, but it sounds like the kind
         | of company that doesn't mark up egress 50x.
        
         | bradfitz wrote:
         | It doesn't. That was one of our mistakes and action items to
         | fix.
         | 
         | The original proxy was stood up quickly when it was first
         | discovered IPv6 was broken and the people standing up the proxy
         | didn't know at the time how ACME worked.
         | 
         | We'll be changing it to just a TCP proxy.
        
           | ikiris wrote:
           | > and the people standing up the proxy didn't know at the
           | time how ACME worked
           | 
           | yikes
        
             | bradfitz wrote:
             | To be fair, it didn't help the provider's docs were
             | inconsistent about whether dns-01 or http-01 was to be
             | used.
        
         | p1mrx wrote:
         | A TCP proxy discards the user's IP address, unless you use
         | something like the PROXY protocol[1], which then needs to be
         | supported by the target HTTPS server. You would also need a way
         | to prevent unauthorized users from injecting their own PROXY
         | header.
         | 
         | This isn't a problem if you don't need the user's IP address at
         | all, but it's often useful for logging and abuse detection.
         | 
         | [1] https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt
        
       | snapplebobapple wrote:
       | I really like these guys, I wish their pricings wasn't so
       | ridiculous. proper access control shouldn't cost 18 bucks a month
       | for a vpn, it's basically unsellable to management at that price
       | and the lower tiers are unsellable without it.
        
         | starttoaster wrote:
         | I'm really interested in what you're comparing Tailscale to
         | internally, because it does way more than just VPN. What are
         | the cheaper options, and do they also have an SSH feature,
         | oauth authentication to the network for automation services,
         | the ability to stand up VPN node loadbalancers in kubernetes
         | clusters, and ACME certificate request automation through
         | LetsEncrypt? Just to list a few features that I use from
         | Tailscale's free tier that I don't normally think of as the job
         | of a VPN service. And they're constantly adding new features
         | that make it a really interesting and competitive choice, in my
         | opinion. Honestly I'm mostly interested in this take because
         | I'm shocked by how much they offer in the cheaper tiers.
        
           | conception wrote:
           | But if you're comparing it to other VPNs and only need VPN
           | the pricing is bonkers.
        
             | starttoaster wrote:
             | That's fair. I guess if you just need a VPN, it doesn't
             | really make sense to consider a product that packs all of
             | these VPN-adjacent features. But part of my point was how
             | much you're able to do on Tailscale for free, so in that
             | regard, the pricing really doesn't seem as bonkers to me,
             | to be honest. The $6/month tier is also incredibly
             | reasonable for unlimited users, but it is annoying, I'll
             | grant, that they actually drop some ACL features from Free
             | tier to Starter. I suppose that's how they funnel a large
             | number of enterprises from considering Starter in favor of
             | Premium. But if you actually make use of the feature set of
             | the platform, which if you have a decent DevOps team I'm
             | pretty sure there's tools in there that they'll love, then
             | $18/User/month actually doesn't seem too outrageous to me.
        
             | newdee wrote:
             | Which other VPNs? And are you talking about trad VPN
             | (concentrators/hub & spoke)? If so, you wouldn't be
             | considering Tailscale anyway.
        
         | j45 wrote:
         | You can install headscale and self-host for nothing then.
         | 
         | Tailscale has competitors too with some overlaps, it might not
         | be fully what you're looking for.
         | 
         | All I know is within a few minutes I had more of a project
         | working together than without it.
         | 
         | It really is one of the more remarkably simple tools out there
         | for everything it does, and has a generous free tier with 100
         | devices and 3 users.
        
           | j45 wrote:
           | Link to headscale https://github.com/juanfont/headscale
        
           | j-krieger wrote:
           | Just a quick heads up, colleagues of mine could not
           | successfully host headscale themselves. In the end, they saw
           | the value and bought tailscale access.
           | 
           | Configuring wireguard really is that hard. Tailscale is
           | easily worth it
        
             | watermelon0 wrote:
             | IIRC, Wireguard is exclusively managed by Tailscale
             | clients, and not by the server (headscale in this case).
        
             | j45 wrote:
             | Any details? Could be adjacent self-hosting issues compared
             | to headscale itself.
             | 
             | Considering it can also run in a docker container, it's
             | next to trivial to install locally to try out
             | 
             | https://headscale.net/running-headscale-container/
        
             | hnarn wrote:
             | > colleagues of mine could not successfully host headscale
             | themselves
             | 
             | it'd be interesting to know why, I use it frequently at
             | work and it's worked pretty well so far.
        
       | aktuel wrote:
       | That's why I roll my VPN locally. One less party to worry about.
        
         | txutxu wrote:
         | The P in VPN has been perverted long time ago.
        
           | edward28 wrote:
           | Virtually public network
        
       | johnnyAghands wrote:
       | Wow, mad jelly their CI/CD and monitoring proceses are robust
       | enough to trust a major rollout in December. That's a pretty
       | badass eng culture
       | 
       | That being said, still some unanswered questions:
       | 
       | - If the issue was ipv6 configuration breaking automated cert
       | renewals for ipv4, wouldn't they have hit this like.. a long time
       | ago? Did I miss something here?
       | 
       | - Why did this take 90 minutes to resolve? I know it's like a
       | blog post and not a real post-mortem, but some kind of timeline
       | would have been nice to include in the post.
       | 
       | - Why not move to DNS provider that natively supports ipv6s?
       | 
       | Also I'm curious if it's worth the overhead to have a dedicated
       | domain for scripts/packages? Do other folks do this? (excluding
       | third-parties like package repositories).
        
         | Thorrez wrote:
         | >- If the issue was ipv6 configuration breaking automated cert
         | renewals for ipv4, wouldn't they have hit this like.. a long
         | time ago? Did I miss something her
         | 
         | AIUI, they switched to their current setup 90 days prior to the
         | outage. The initial cert they installed during their migration
         | lasted 90 days. So 90 days after the migration, they had an
         | outage.
        
       | smackeyacky wrote:
       | I've said it before and I'll say it again: expiring certs are the
       | new DNS for outages.
       | 
       | I still marvel at just how good Tailscale is. I'm a minor user
       | really but I have two sites that I use tailscale to access: a
       | couple of on-prem servers and my AWS production setup.
       | 
       | I can literally work from anywhere - had an issue over the
       | weekend where I was trying to deploy an ECS container but the
       | local wifi was so slow that the deploy kept timing out.
       | 
       | I simply SSH'd over to my on-prem development machine, did a git
       | pull of the latest code and did the deploy from there. All while
       | remaining secure with no open ports at all on my on-prem system
       | and none in AWS. Can even do testing against the production
       | Aurora database without any open ports on it, simply run a
       | tailscale agent in AWS on a nano sized EC2.
       | 
       | Got another developer you need to give access to your network to?
       | Tailscale makes that trivial (as it does revoking them).
       | 
       | Yeah, for that deployment I could just make a GitHub action or
       | something and avoid the perils of terrible internet, but for this
       | I like to do it manually and Tailscale lets me do just that.
        
       | gigatexal wrote:
       | Surely they can automate the renewal? It seems their solution is
       | a manual one. Am I being a simpleton?
        
       ___________________________________________________________________
       (page generated 2024-03-30 23:00 UTC)