[HN Gopher] About the Tailscale.com outage on March 7, 2024
___________________________________________________________________
About the Tailscale.com outage on March 7, 2024
Author : tatersolid
Score : 150 points
Date : 2024-03-30 15:34 UTC (7 hours ago)
(HTM) web link (tailscale.com)
(TXT) w3m dump (tailscale.com)
| lmeyerov wrote:
| Expiring certs strikes again!
|
| I'd recommend as part of the post mortem to move their install
| script off their marketing site or putting in some other fallback
| so marketing site activity is unrelated to customer operations
| critical path. They're almost there for maintaining that typical
| isolation, which helps bc this kind of thing is common.
|
| We track uptime of our various providers, and seeing bits like
| the GitHub or Zendesk sites go down is more common than we
| expected... and they're the good cases.
| GICodeWarrior wrote:
| Further, security of a marketing site tends to be lower
| priority than the product itself, and an install script should
| generally be secured similar to the product.
| bradfitz wrote:
| Yes. We're lamentably probably going to have to move it (the
| install script), even though it has a nice URL today.
|
| When we picked that URL, the marketing site was created and
| run by the same people who built the rest of the product, so
| it didn't seem like a concern at the time.
| oliviabenson wrote:
| You can achieve both. The only mistake you made was to
| half-bake the proxy (doing it for IPv6 only): proxy every
| http(s) request to tailscale.com. Vercel's platform is
| valuable for a whole host of reasons, the networking side
| isn't that important, your developers will greatly value
| the use of Vercel even if every request is being proxied
| through a web server hosting tailscale.com which responds
| to a request for /install.sh instead of passing it through
| to the marketing site.
|
| (In Google Cloud you could do it entirely with load
| balancing rules, no need to even run a web server)
| bradfitz wrote:
| That is exactly what I want us to do :)
| ShakataGaNai wrote:
| That's great that DevOps (or whatever their title) owns
| both product and marketing sites. Far too many companies
| (and DevOps teams) think the www site is "not important" or
| "not their core job" and outsource it to either a less
| qualified team, or out of the company altogether.
|
| From an external perspective no one cares if www going down
| isn't "your fault" or of "direct impact to the product".
| It's a corporate blackeye either way.
| j45 wrote:
| Brings to mind if there's a service that will monitor all the
| certs and their expiry.
|
| Cloudflare seems to handle a fair bit of this if you host your
| domain with them, but you have to use Cloudflare.
| cassianoleal wrote:
| Push their expiry date as a UNIX timestamp to whatever TSDB
| you use and hook up an alert for when it gets close.
| Scubabear68 wrote:
| "That means the root issue with renewal is still a problem, and
| we plan to address it in the short term much like our ancestors
| did: multiple redundant calendar alerts and a designated window
| to manually renew the certificates ourselves".
| lijok wrote:
| The other day I was looking for a system where we can track
| recurring yearly/monthly/etc tasks (such as cert rotation) and
| get alerted a week before and on the day.
|
| 2~ hours into my search, contemplating building my own, someone
| pointed out we can just use a shared gsuite calendar.
|
| How the mind overcomplicates things sometimes..
| _joel wrote:
| I'd probably spend the effort adding ssl expiration to the
| monitoring system for all the certs in use. Trigger then a
| month/week whatever before they're due to expire.
| rustcleaner wrote:
| We both got the downdoot simultaneously, someone didn't
| like our contributions. :^)
| supriyo-biswas wrote:
| Uptime Kuma[1] has a certificate expiry notification feature.
|
| [1] https://github.com/louislam/uptime-kuma
| rnewme wrote:
| Uptime kuma is cool.
| j45 wrote:
| Thanks for the tip!
| eastbound wrote:
| Jira. Create an automation that clones tickets on due dates.
| j45 wrote:
| Jira is really underrated for it's workflows and
| automations.
|
| Part of my love/hate relationship with JIRA was until the
| lightbulb that it's not supposed to work perfect out of the
| box because no two places are the same.
| nerdbaggy wrote:
| I wonder what provider they use for their website. Sounds like a
| lot of hoops to jump through for IPV6 when just about any other
| provider has IPv6 support.
| p1mrx wrote:
| $ host www.tailscale.com www.tailscale.com has address
| 76.76.21.21 # Vercel www.tailscale.com has IPv6 address
| 2600:9000:a51d:27c1:6748:d035:a989:fb3c # Amazon
| www.tailscale.com has IPv6 address
| 2600:9000:a602:b1e6:5b89:50a1:7cf7:67b8 # Amazon
|
| IPv4 uses a Let's Encrypt certificate, while IPv6 uses an
| Amazon certificate.
| opheliate wrote:
| The Vercel IPv6 feature request & surrounding discussion
| makes for frustrating reading:
| https://github.com/orgs/vercel/discussions/47
| opello wrote:
| Only if you expand all the discussion they've hidden!
| Ignorance is bliss, right?
|
| > We are targeting to land support for IPv6 towards the
| beginning of next year. We will communicate updates on this
| issue.
|
| Was from 2023-10-01, I guess it's early until June 30.
| watermelon0 wrote:
| I'd say that the beginning of the year ends at the end of
| Q1, if not earlier.
| miyuru wrote:
| Vercel's VP of product has been asking for requirements for
| IPv6 there. This should be a good one.
|
| https://github.com/orgs/vercel/discussions/47#discussioncom
| m...
|
| It painful to see tech providers go down this road, which
| is pretty similar to what's happening at Boeing. (Business
| taking over Engineering)
| kawsper wrote:
| It feels like the same anywhere, sadly, DigitalOceans IPv6
| support in their loadbalancer product have been "under
| review" since 2021:
| https://ideas.digitalocean.com/network/p/ipv6-for-load-
| balan...
| aftbit wrote:
| Wow, all comments removed as spam or hidden by default,
| update posted saying "We are targeting to land support for
| IPv6 towards the beginning of next year." Well, Q1 2024 has
| come and gone. Where's IPv6 support or the communication
| about what is happening? Good reason to never use Vercel if
| you ask me.
| NelsonMinar wrote:
| The conclusion is hilarious: "we plan to address it in the short
| term much like our ancestors did: multiple redundant calendar
| alerts and a designated window to manually renew the certificates
| ourselves"
|
| Devops is so 2023. Back to ops!
| bradfitz wrote:
| That was mostly a joke. You know we're going to fix it
| properly, Nelson :)
|
| (But super short term, yes.)
| NelsonMinar wrote:
| Fair enough! But you might want to add this feed to your
| Google Reader, Brad :-)
| https://scrutineer.tech/monitor/cert/tailscale.com.rss
| j45 wrote:
| It's a joke but also the least that should be in place while
| whatever fix is coming is put into place.
|
| A simple cronjob would look like it would handle it, but what
| usually ends up being needed with 10-15 of these types of tasks
| is a simple, independent bpm workflow platform that tracks
| whether it happened or not.. or anything else.
|
| Learned this the hard way and won't do it any other way.
| PuffinBlue wrote:
| Migrating to a host that doesn't support IPv6 when it's important
| to you seems...like a bad decision.
| bradfitz wrote:
| Suffice it to say neither their lack of IPv6 nor its importance
| to us was evenly understood throughout the company.
| j45 wrote:
| This doesn't seem uncommon. Why learn IPv6 until you need to.
| I know it has some great features.
| PuffinBlue wrote:
| I very much enjoy the diplomatic phrasing of this statement
| :-)
| speedgoose wrote:
| IPv6 is important but step one is to remove the AAAA records.
| sowbug wrote:
| Two ideas for discussion.
|
| Certificate Transparency is used to account for maliciously or
| mistakenly issued certificates. Perhaps it could also be used to
| assert the unavailability of correctly issued but obsolete
| certificates that are believed to be purged but actually aren't.
| (Services like KeyChest might already do this.)
|
| Let's Encrypt is a miracle compared to the expensive pain of
| getting a cert 20 years ago. Rather than resting on laurels,
| would there be any benefit to renewing even more frequently, like
| daily? This might have confined the Tailscale incident to a quick
| "oops!" while the provider migration was still underway and being
| actively watched.
| striking wrote:
| 90 day renewal is frequent enough in my book. It's not so often
| as to be easy to miss, but often enough that the person setting
| it up can witness the first renewal cycle (if they so choose,
| which in this case they apparently did not).
| sowbug wrote:
| Right. I was thinking of keeping the same 90-day validity but
| renewing much more frequently, rather than the 60-day period
| that LE recommends. But I can see my questions have irked
| other community members, so I'll leave it at that. :)
| starttoaster wrote:
| I renew some of my LetsEncrypt certificates monthly, which
| should be plenty, in my opinion. Gets you about 2 buffer cycles
| to notice the certificate isn't updating and recognize an issue
| in your automation.
| agwa wrote:
| Why does the proxy need to terminate TLS? If it were just a TCP
| proxy, then at least the monitoring wouldn't have been fooled
| into thinking the certificate wasn't about to expire.
|
| Heck, a TCP proxy might even allow automatic renewal to work if
| the domain validation is being done using a TLS-ALPN challenge.
| fanf2 wrote:
| I think a TCP proxy would also work with http challenges.
| agwa wrote:
| It would, but so would an HTTP proxy. It makes make think the
| hosting provider doesn't use HTTP challenges.
| fanf2 wrote:
| An http proxy would need to be configured cleverly enough
| to serve its own acme challenges directly, and proxy any
| requests for the backend's acme challenges. Which is I
| think the trick that was missed by the tailscale setup.
| bastawhiz wrote:
| Not that it's an amazing reason, but H3 doesn't run over TCP,
| and running a UDP proxy doesn't sound like a great time.
| ignoramous wrote:
| > _UDP proxy doesn 't sound like a great time_
|
| QUIC, in particular, is harder to proxy (if you're load
| balancing, say: https://quicwg.org/ops-drafts/draft-ietf-
| quic-manageability....).
|
| If it is point-to-point and you control both those points
| (forward A to B with ports open as approp), proxying any
| protocol should be straightforward, no?
| cayde wrote:
| I donet believe they control the Vercel endpoint "B"
| RulerOf wrote:
| They may have AWS CloudFront CDN in front of it for IPv6. If
| you're doing that, you're terminating TLS at CloudFront. I
| don't believe that's optional.
| amluto wrote:
| A non-TLS-terminating proxy is a great thing to host on a
| service like Hetzner. If you set up CAA correctly, then you are
| trusting the provider for latency and availability only, and
| you might as well avoid hilariously expensive services like
| CloudFront or an EC2-based proxy.
|
| Hmm, it looks like Tailscale is using NetActuate for
| pkgs.tailscale.com. I bet NetActuate could help serve up a non-
| terminating proxy with plenty of PoPs at a reasonable price.
| Their website doesn't give pricing, but it sounds like the kind
| of company that doesn't mark up egress 50x.
| bradfitz wrote:
| It doesn't. That was one of our mistakes and action items to
| fix.
|
| The original proxy was stood up quickly when it was first
| discovered IPv6 was broken and the people standing up the proxy
| didn't know at the time how ACME worked.
|
| We'll be changing it to just a TCP proxy.
| ikiris wrote:
| > and the people standing up the proxy didn't know at the
| time how ACME worked
|
| yikes
| bradfitz wrote:
| To be fair, it didn't help the provider's docs were
| inconsistent about whether dns-01 or http-01 was to be
| used.
| p1mrx wrote:
| A TCP proxy discards the user's IP address, unless you use
| something like the PROXY protocol[1], which then needs to be
| supported by the target HTTPS server. You would also need a way
| to prevent unauthorized users from injecting their own PROXY
| header.
|
| This isn't a problem if you don't need the user's IP address at
| all, but it's often useful for logging and abuse detection.
|
| [1] https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt
| snapplebobapple wrote:
| I really like these guys, I wish their pricings wasn't so
| ridiculous. proper access control shouldn't cost 18 bucks a month
| for a vpn, it's basically unsellable to management at that price
| and the lower tiers are unsellable without it.
| starttoaster wrote:
| I'm really interested in what you're comparing Tailscale to
| internally, because it does way more than just VPN. What are
| the cheaper options, and do they also have an SSH feature,
| oauth authentication to the network for automation services,
| the ability to stand up VPN node loadbalancers in kubernetes
| clusters, and ACME certificate request automation through
| LetsEncrypt? Just to list a few features that I use from
| Tailscale's free tier that I don't normally think of as the job
| of a VPN service. And they're constantly adding new features
| that make it a really interesting and competitive choice, in my
| opinion. Honestly I'm mostly interested in this take because
| I'm shocked by how much they offer in the cheaper tiers.
| conception wrote:
| But if you're comparing it to other VPNs and only need VPN
| the pricing is bonkers.
| starttoaster wrote:
| That's fair. I guess if you just need a VPN, it doesn't
| really make sense to consider a product that packs all of
| these VPN-adjacent features. But part of my point was how
| much you're able to do on Tailscale for free, so in that
| regard, the pricing really doesn't seem as bonkers to me,
| to be honest. The $6/month tier is also incredibly
| reasonable for unlimited users, but it is annoying, I'll
| grant, that they actually drop some ACL features from Free
| tier to Starter. I suppose that's how they funnel a large
| number of enterprises from considering Starter in favor of
| Premium. But if you actually make use of the feature set of
| the platform, which if you have a decent DevOps team I'm
| pretty sure there's tools in there that they'll love, then
| $18/User/month actually doesn't seem too outrageous to me.
| newdee wrote:
| Which other VPNs? And are you talking about trad VPN
| (concentrators/hub & spoke)? If so, you wouldn't be
| considering Tailscale anyway.
| j45 wrote:
| You can install headscale and self-host for nothing then.
|
| Tailscale has competitors too with some overlaps, it might not
| be fully what you're looking for.
|
| All I know is within a few minutes I had more of a project
| working together than without it.
|
| It really is one of the more remarkably simple tools out there
| for everything it does, and has a generous free tier with 100
| devices and 3 users.
| j45 wrote:
| Link to headscale https://github.com/juanfont/headscale
| j-krieger wrote:
| Just a quick heads up, colleagues of mine could not
| successfully host headscale themselves. In the end, they saw
| the value and bought tailscale access.
|
| Configuring wireguard really is that hard. Tailscale is
| easily worth it
| watermelon0 wrote:
| IIRC, Wireguard is exclusively managed by Tailscale
| clients, and not by the server (headscale in this case).
| j45 wrote:
| Any details? Could be adjacent self-hosting issues compared
| to headscale itself.
|
| Considering it can also run in a docker container, it's
| next to trivial to install locally to try out
|
| https://headscale.net/running-headscale-container/
| hnarn wrote:
| > colleagues of mine could not successfully host headscale
| themselves
|
| it'd be interesting to know why, I use it frequently at
| work and it's worked pretty well so far.
| aktuel wrote:
| That's why I roll my VPN locally. One less party to worry about.
| txutxu wrote:
| The P in VPN has been perverted long time ago.
| edward28 wrote:
| Virtually public network
| johnnyAghands wrote:
| Wow, mad jelly their CI/CD and monitoring proceses are robust
| enough to trust a major rollout in December. That's a pretty
| badass eng culture
|
| That being said, still some unanswered questions:
|
| - If the issue was ipv6 configuration breaking automated cert
| renewals for ipv4, wouldn't they have hit this like.. a long time
| ago? Did I miss something here?
|
| - Why did this take 90 minutes to resolve? I know it's like a
| blog post and not a real post-mortem, but some kind of timeline
| would have been nice to include in the post.
|
| - Why not move to DNS provider that natively supports ipv6s?
|
| Also I'm curious if it's worth the overhead to have a dedicated
| domain for scripts/packages? Do other folks do this? (excluding
| third-parties like package repositories).
| Thorrez wrote:
| >- If the issue was ipv6 configuration breaking automated cert
| renewals for ipv4, wouldn't they have hit this like.. a long
| time ago? Did I miss something her
|
| AIUI, they switched to their current setup 90 days prior to the
| outage. The initial cert they installed during their migration
| lasted 90 days. So 90 days after the migration, they had an
| outage.
| smackeyacky wrote:
| I've said it before and I'll say it again: expiring certs are the
| new DNS for outages.
|
| I still marvel at just how good Tailscale is. I'm a minor user
| really but I have two sites that I use tailscale to access: a
| couple of on-prem servers and my AWS production setup.
|
| I can literally work from anywhere - had an issue over the
| weekend where I was trying to deploy an ECS container but the
| local wifi was so slow that the deploy kept timing out.
|
| I simply SSH'd over to my on-prem development machine, did a git
| pull of the latest code and did the deploy from there. All while
| remaining secure with no open ports at all on my on-prem system
| and none in AWS. Can even do testing against the production
| Aurora database without any open ports on it, simply run a
| tailscale agent in AWS on a nano sized EC2.
|
| Got another developer you need to give access to your network to?
| Tailscale makes that trivial (as it does revoking them).
|
| Yeah, for that deployment I could just make a GitHub action or
| something and avoid the perils of terrible internet, but for this
| I like to do it manually and Tailscale lets me do just that.
| gigatexal wrote:
| Surely they can automate the renewal? It seems their solution is
| a manual one. Am I being a simpleton?
___________________________________________________________________
(page generated 2024-03-30 23:00 UTC)