[HN Gopher] Fastmail 30 June outage post-mortem
___________________________________________________________________
Fastmail 30 June outage post-mortem
Author : _han
Score : 147 points
Date : 2023-07-06 14:08 UTC (8 hours ago)
(HTM) web link (www.fastmail.com)
(TXT) w3m dump (www.fastmail.com)
| xeromal wrote:
| Redundancy is the name of the game. Glad they realize this.
| lern_too_spel wrote:
| They will still be in only one datacenter. This is worse than
| its competitors.
| withinboredom wrote:
| You might have multiple datacenters, but until everyone isn't
| in the same office, you still have the same problem (office
| can burn down, or fall down and bury everyone).
|
| Also, it's email. It was literally designed to work in such a
| way that you can be down for days and still get your email.
| lern_too_spel wrote:
| > You might have multiple datacenters, but until everyone
| isn't in the same office, you still have the same problem
| (office can burn down, or fall down and bury everyone).
|
| Fastmail has multiple offices.
|
| > Also, it's email. It was literally designed to work in
| such a way that you can be down for days and still get your
| email.
|
| Delivery is designed that way, not storage. Once Fastmail
| tells the sender that a message was delivered to an inbox,
| Fastmail cannot ask the sender to redeliver it if its inbox
| storage is lost.
| hedora wrote:
| That doesn't seem to be true from a durability perspective:
|
| https://www.fastmail.help/hc/en-us/articles/1500000278242
|
| Read the section on "slots". They keep two copies in New
| Jersey, and one in Seattle.
|
| However, based on the post mortem, it sounds like they're not
| willing to invoke failover at the drop of a hat. They allude
| to needing complex routing to keep their old good-reputation
| IP addresses alive. That might have something to do with it.
| (They were "only" 3-5% down during the outage, which is bad
| for them, but not unusual by industry standards.)
| Lk7Of3vfJS2n wrote:
| I don't know if IP range reputation is part of SMTP but I wish
| sending email worked without it.
| _joel wrote:
| It's not part of SMTP, it's part of the cludge of anti-spam
| measures we bolted on top of it.
| micah94 wrote:
| ...bolted on top of DNS.
| nik736 wrote:
| Crazy that they were single homed.
|
| Edit: After seeing the network diagram I have even more
| questions. What happens if CF is down? This all seems cobbled
| together and very prone to failures.
| pc86 wrote:
| > This all seems cobbled together and very prone to failures.
|
| The entire internet.
| bsder wrote:
| If CF is down, they're down.
|
| The problem here is that _there isn 't an alternative to
| Cloudflare_.
|
| They say this is the article. None of their DDoS solutions can
| take the heat except for Cloudflare.
|
| So, if you want resilience in the face of Cloudflare being
| down, you need to build another Cloudflare. Let me know when
| you build it. Lots of people will sign up.
| nik736 wrote:
| There are several providers in this space. Path, Voxility,
| etc.
| macinjosh wrote:
| To be fair. I've been a customer for 5+ years for my main
| personal email account and this is the first outage that has
| impacted me.
| hedgehog wrote:
| 15+ and there have been a few but nothing particularly
| notable. Gmail has had outages too, and I couldn't tell you
| from personal experience which is more reliable which is
| interesting given the big difference in the complexity of the
| deployments (obviously Gmail also has the burden of much
| bigger scale).
| OhNoMyqueen wrote:
| They're still singled homed, right?
|
| They just added redondancy to in/outbound routes.
| jmull wrote:
| Why assume CF is a simple box?
|
| I think the whole point of CF is that it isn't.
| ilyt wrote:
| Doesn't matter.
|
| Single or million boxes, if they apply wrong config, it
| doesn't work.
|
| You dual home because you hedge your bets and hope no 2 ISPs
| gonna have fuckup at same time
| jmull wrote:
| I think you're suggesting that some redundancy you get by
| accident -- by having two ISPs -- is better than the
| redundancy a single ISP could engineer.
|
| That's certainly possible in specific cases, but not a very
| good general principle to rely on. One CF could very well
| be better than two given ISPs.
| dfcowell wrote:
| That's a false dichotomy. You can absolutely have two
| ISPs, one of whom is CF.
| jmull wrote:
| Sorry, that's not a false dichotomy.
|
| A second ISP isn't free, it has significant costs in
| terms of dollars and complexity. The question is, does CF
| _and_ another provider have significant benefits to
| justify the additional costs? For it to make sense you
| have to believe the redundancy CF provides is
| significantly lacking (and in a way that adding a second
| provider addresses). Maybe it 's true, but it would be
| nuts to just assume it and start spending a lot of money.
| nik736 wrote:
| You have to plug in the cable somewhere. If the hardware
| where the x-connect is plugged in dies, has issues, has to be
| rebooted, etc. you have an issue and it's not like there were
| no CF issues ever.
| jmull wrote:
| Why does there have to be just one cable?
| withinboredom wrote:
| Do you think there are two cables that are never bundled
| in the same underground tube?
|
| All the redundancy in the world can't protect you from
| some random person digging in the wrong place.
| jmull wrote:
| Sure, maybe the underground tube was breached while
| someone was moving goalposts?
| hedora wrote:
| From the post-mortem, it doesn't sound like like the
| problem was a single network cable. Redundant network
| switches have existed for a long time, and they're
| certainly using them (but not bothering to mention it in
| the post-mortem).
|
| Their problem was that they only have two transit
| providers, and one of them black-holed about 3-5% of the
| internet. Since it was a routing issue, I'd guess it was
| either a misconfiguration, or that the traffic is being
| split across dozens of paths, and one path had a correlated
| failure.
| stuff4ben wrote:
| If CloudFlare is down, a significant portion of the Internet is
| down. Not that it's an excuse, but this isn't Microsoft or
| Apple. I'm sure funds have to be allocated to take into account
| the likelihood of something being down. But by all means write
| a blog post and tell them what they're doing wrong and how
| you'd fix it. Maybe they'll hire you...
| theideaofcoffee wrote:
| And you don't have to have the resources of Microsoft or
| Apple to plan and build for the eventuality that a provider
| becomes intermittent or unavailable. There are fundamental
| aspects of running an internet-facing service and they failed
| at one of the most basic.
| stuff4ben wrote:
| LOL ok, they "failed". They haven't had an outage like this
| in decades and this one only affected a small number of
| their clients. But sure, let's spend money on providing a
| backup for CF. Armchair QBs are the worst.
| theideaofcoffee wrote:
| Yet they still had the outage. I take exception to being
| called an 'armchair QB' when most of my career has been
| spent being called in to repair failures like this,
| providing postmortem advice to weather future ones and
| fix technical and cultural issues that give rise to just
| this type of thinking: oh, it won't happen to us because
| it has never happened to us.
| 2b3a51 wrote:
| In your experience, what kind of cost multiple is
| involved in remediation of the kinds of failure you deal
| with?
|
| Is it x2 or x100 or somewhere in between?
| theideaofcoffee wrote:
| Since you need two of everything, or more, two switches,
| two physical links, (hopefully) two physical racks or
| cabinets, and all that, it's minimum x2, but nowhere near
| x100. The cost for additional physical transit links is
| generally pretty reasonable, depending on provider, if
| you have more you can negotiate better rates, same with
| committed bandwidth. You can get better rates if you buy
| more.
|
| There are a lot of aspects to that, but the cost of doing
| all of the above is a lot less than not having it and
| failing to have it at the wrong moment and losing money
| that way. Each business needs to weigh their risk against
| how much they want to invest and how much they think they
| can tolerate in terms of downtime.
| 2b3a51 wrote:
| Seems logical thanks for engaging with the question.
| tedivm wrote:
| The issue isn't that they needed a backup to cloudflare.
| The problem was they only have a single internet provider
| at their datacenter, so they couldn't communicate with
| Cloudflare.
|
| I've honestly never had a service with a single outbound
| path. Most datacenters where you rent colo have two or
| three providers as part of their network. In the cases
| where I've had to manage my own networking inside of a
| datacenter I always pick two providers in case one fails.
|
| > Work is now underway to select a provider for a second
| transit connection directly into our servers -- either
| via Megaport, or from a service with their own physical
| presence in 365's New Jersey datacenter. Once we have
| this, we will be able to directly control our outbound
| traffic flow and route around any network with issues.
|
| Having multiple transit options is High Availability 101
| level stuff.
| toast0 wrote:
| > The issue isn't that they needed a backup to
| cloudflare. The problem was they only have a single
| internet provider at their datacenter, so they couldn't
| communicate with Cloudflare.
|
| That's not the issue. With Cloudflare MagicTransit,
| packets come in from Cloudflare, and egress normally.
| They were able to get packets from Cloudflare, but egress
| wasn't working to all destinations. I wasn't able to
| communicate with them from my CenturyLink DSL in Seattle,
| but when I forced a new IP that happened to be in a
| different /24, because I was seeing some other issues
| too, the fastmail issues resolved (although timing may be
| coincidental). Connecting via Verizon and T-Mobile, or a
| rented server in Seattle also worked. It's kind of a
| shame they don't provide services with IPv6, because if
| 5% of IPv4 failed and 5% of IPv6 failed, chances are good
| that the overall impact to users would be less than 5%,
| possibly much less, depending on exactly what the
| underlying issue was (which isn't disclosed); if it was a
| physical link issue, that's going to affect v4 and v6
| traffic that is routed over it, but if it's a BGP
| announcement issue, those are often separate.
| withinboredom wrote:
| You'd be surprised at how many things break when
| different routes are chosen. Like etcd, MySQL, and so
| much more.
| tedivm wrote:
| Those are generally on internal networks and rarely need
| to communicate with the internet. They shouldn't be
| affected by this.
| withinboredom wrote:
| Twould be nice...
| arp242 wrote:
| > This all seems cobbled together and very prone to failures.
|
| AFAIK it's not like FastMail has a crazy number of network-
| related outages, so overall it doesn't seem that "prone to
| failure". As with many things, it's a trade-off with complexity
| and costs.
| paulryanrogers wrote:
| I'd argue that often the CDN or transit isn't drop-in
| replaceable. So it's usually more than 2x the cost as one has
| to maintain two architectures (or at least abstractions).
| That includes the expertise and not optimizing for strengths
| of either, or building really robust abstractions/adapters.
| bm3 wrote:
| It is truly crazy that they do not have their own ASN and IP
| blocks.
|
| I cannot imagine running a service like that with cobbled
| together DIA circuits and leased IPs.
| daenney wrote:
| Incidents happen, that's life. You can hedge your bets, but some
| things are out of your control. Communicating with your customers
| however is entirely within your control. Fastmail did a poor job
| of it. Their status page was useless beyond an initial "we found
| an issue" and then nothing for almost 11hrs. Their Twitter
| account was the same story, didn't bother with the Mastodon
| account at all. Unfortunately they don't seem to realise or
| recognise that they dropped the ball on this and that goes
| entirely unaddressed.
|
| I'm also not really charmed with how they try to minimise the
| importance of the incident by repeating it only affected 3-5% of
| the customers. That might very well be. But those are real people
| and real businesses that rely on your services that were
| unavailable for the whole of the EU workday and a significant
| part of the US workday. Everyone I know who was affected is a
| paying customer, none of us have received so much as a
| communication or apology for it.
|
| For a company that's been on the internet since 1999, the single-
| homed setup is a little shocking. But fine, it's being addressed.
| But both the communication during and after the incident don't
| inspire a ton of confidence.
| luuurker wrote:
| On one hand, I too like better communications in situations
| like this. On the other, you knew they were aware of the issue
| and working on it. Updating the status page with a "we're
| trying to fix it" every hour or so wouldn't speed up the
| process of fixing the problem or help you in any way.
| daenney wrote:
| I didn't ask for an update every hour, but an 11hr stretch of
| silence is not cool. An update every 3 would've been fine.
| Even if there isn't much new to share, reiterating you're
| working on it is useful to reassure your customers. I'm
| fairly certain they could've found something more meaningful
| to communicate than "still twirling the thumbs". For example,
| the update after 11hrs included the tip to switch to a VPN.
| That would've been useful to communicate.
| JohnMakin wrote:
| I'm not clear from the post-mortem why the outbound packets were
| having issues. Was it cloudflare? Did someone accidentally delete
| an outbound route? Why couldn't they see the issue themselves? I
| only have more questions now.
| ilyt wrote:
| It looks like their upstream provider fucked something up so
| until they release that info we can only speculate.
| electroly wrote:
| I'll admit that seeing they are single-homed is sketchier than I
| assumed Fastmail's infrastructure was. I've been using Fastmail
| for years and I like them, but they are clearly big enough to
| have a second transit provider, and have been for many years. I'm
| amazed it took an outage for them to decide to get one. I
| appreciate the post-mortem but I felt better before I had read
| it.
| majkinetor wrote:
| Judged by comments and upvotes here, and previously, I am
| actually amazed how almost everybody thinks how perfect and
| unfallable he is and how he surrely deserves 100% perfection
| from the day he is born to the day his grand-grand-... kids
| die.
| N0RMAN wrote:
| Am I the only one who thinks that this post-mortem for a margin
| of email users worldwide is a pure marketing gimmick to attract
| more "technical" users?
| scandox wrote:
| I'm surprised they don't own their own IPs. In the email world I
| would say that's quite important. Seems a tad casual to say
| "luckily they are willing to lease them"...
| withinboredom wrote:
| First, someone has to be willing to sell them...
| gwright wrote:
| I'm a little rusty in this area, but I'm pretty sure you'll
| have a hard time getting a direct IP allocation (vs from your
| transit provider) _unless_ you are multi-homed, which
| apparently they were not.
|
| There is a nice tick up in complexity when you go to
| advertising your address space via BGP to multiple providers.
| tivert wrote:
| Yeah, me too, but it sounds like they have good reason and are
| working to fix it:
|
| > Thankfully, NYI were willing to lease us those addresses,
| because IP range reputation is really important in the email
| world, and those things are hardcoded all over the place -- but
| it has caused us complications due to more complex routing.
| Over the past year, we've been migrating to a new IP range.
| We've been running the two networks concurrently as we build up
| the trust reputation for the new addresses.
|
| Might be one of those things that they did when they were
| small, but then got hard to change. Hopefully they will fully
| own their new addresses.
|
| The thing I'm surprised about is that they only have "single
| path for traffic out to the internet."
| dizhn wrote:
| NYI. Good people.
| theideaofcoffee wrote:
| Netgear switches? In an environment like this? I'll give them the
| benefit of the doubt that that is maybe a provider-owned thing,
| and that they have an 'enterprise' line, but, really... Netgear.
| The firewall brand isn't revealed in the network diagram, but
| what is it, a $100 sonicwall? Should I be concerned keeping all
| my email, business and personal, there about what other parts of
| their infrastructure they are cheaping out on?
|
| When you are running a service like this, redundancy among
| transit providers is the most basic, table-stakes thing you can
| do. It's almost negligent to not have that.
| incahoots wrote:
| To be fair, broadcasting your hardware via topology isn't what
| I consider a safe practice.
|
| As to the netgear switches, I would figure they'd have hot
| spares considering the cost savings. I'm not entirely sold on a
| specific vendor for the end-all-be-all for switching needs,
| support for most is less than stellar, and the need for hot
| spares grows every quarter report from the big name vendors as
| they continue to push for larger margins.
|
| In environments like this, it's less about the vendor specific
| product, and more about the redundancy setup (which appears
| they lacked but are transparent regarding it).
|
| Just my .02 cents
|
| edit: didn't realize this was such a "hot take"
| selykg wrote:
| > To be fair, broadcasting your hardware via topology isn't
| what I consider a safe practice.
|
| Ah yes, security through obscurity.
| incahoots wrote:
| You're not wrong but considering all of the recent 0-day
| exploits, I would argue that it's a better practice than
| the wack-a-mole response from vendors like Fortinet &
| Barracuda.
|
| https://www.bleepingcomputer.com/news/security/fortinet-
| new-...
|
| When the vendors you're buying from aren't taking security
| seriously, I suppose you take any necessary step in
| limiting exposure. I'd also argue that outside of the big
| boys like Cloudflare, no one else is displaying their
| topology via their own website.
| selykg wrote:
| That's fair, but let's still call it what it is. We
| shouldn't normalize hiding information as some sort of
| form of security.
| incahoots wrote:
| Fair enough.
| dktnj wrote:
| Netgear do some half decent fully managed switches. It's not
| all blue crap off Amazon.
|
| The worst switches I ever used were HPE ones in the old C5000
| blade chassis. Absolute turds. Packet loss, constant port
| failures and complete hangs. HPE's solution was to tell us to
| buy new ones.
| bluedino wrote:
| The worst switches I've ever used would probably be various
| 'Cisco' switches from their small business line, usually ones
| that ran the same OS used when they were sold under different
| names like 3Com or Linksys.
| incahoots wrote:
| oh god, when I worked for a small telecom in the midwest
| they heavily used the 3Com switches. They were the bane of
| my existance, things would power loop, or my favorite,
| continue to work but prevent any sort of access to them.
| publicmail wrote:
| > It's not all blue crap off Amazon.
|
| To be honest, those little blue unmanaged Netgear switches
| aren't bad at all. We have dozens of them in our lab at work
| running 24/7 for like decades and have never had a failure
| that I remember.
| [deleted]
| ilyt wrote:
| I'd be far worried about single homed internet connection than
| brand of the switch.
| hedora wrote:
| If netgear has you worried, search for "white box switches".
| Most big iron runs off those these days.
| theideaofcoffee wrote:
| I am aware, I deploy white box gear, what concerns me is the
| software, some is better, some is worse, less so than the
| merchant silicon the system is based on.
___________________________________________________________________
(page generated 2023-07-06 23:02 UTC)