[HN Gopher] Fastmail 30 June outage post-mortem
       ___________________________________________________________________
        
       Fastmail 30 June outage post-mortem
        
       Author : _han
       Score  : 147 points
       Date   : 2023-07-06 14:08 UTC (8 hours ago)
        
 (HTM) web link (www.fastmail.com)
 (TXT) w3m dump (www.fastmail.com)
        
       | xeromal wrote:
       | Redundancy is the name of the game. Glad they realize this.
        
         | lern_too_spel wrote:
         | They will still be in only one datacenter. This is worse than
         | its competitors.
        
           | withinboredom wrote:
           | You might have multiple datacenters, but until everyone isn't
           | in the same office, you still have the same problem (office
           | can burn down, or fall down and bury everyone).
           | 
           | Also, it's email. It was literally designed to work in such a
           | way that you can be down for days and still get your email.
        
             | lern_too_spel wrote:
             | > You might have multiple datacenters, but until everyone
             | isn't in the same office, you still have the same problem
             | (office can burn down, or fall down and bury everyone).
             | 
             | Fastmail has multiple offices.
             | 
             | > Also, it's email. It was literally designed to work in
             | such a way that you can be down for days and still get your
             | email.
             | 
             | Delivery is designed that way, not storage. Once Fastmail
             | tells the sender that a message was delivered to an inbox,
             | Fastmail cannot ask the sender to redeliver it if its inbox
             | storage is lost.
        
           | hedora wrote:
           | That doesn't seem to be true from a durability perspective:
           | 
           | https://www.fastmail.help/hc/en-us/articles/1500000278242
           | 
           | Read the section on "slots". They keep two copies in New
           | Jersey, and one in Seattle.
           | 
           | However, based on the post mortem, it sounds like they're not
           | willing to invoke failover at the drop of a hat. They allude
           | to needing complex routing to keep their old good-reputation
           | IP addresses alive. That might have something to do with it.
           | (They were "only" 3-5% down during the outage, which is bad
           | for them, but not unusual by industry standards.)
        
       | Lk7Of3vfJS2n wrote:
       | I don't know if IP range reputation is part of SMTP but I wish
       | sending email worked without it.
        
         | _joel wrote:
         | It's not part of SMTP, it's part of the cludge of anti-spam
         | measures we bolted on top of it.
        
           | micah94 wrote:
           | ...bolted on top of DNS.
        
       | nik736 wrote:
       | Crazy that they were single homed.
       | 
       | Edit: After seeing the network diagram I have even more
       | questions. What happens if CF is down? This all seems cobbled
       | together and very prone to failures.
        
         | pc86 wrote:
         | > This all seems cobbled together and very prone to failures.
         | 
         | The entire internet.
        
         | bsder wrote:
         | If CF is down, they're down.
         | 
         | The problem here is that _there isn 't an alternative to
         | Cloudflare_.
         | 
         | They say this is the article. None of their DDoS solutions can
         | take the heat except for Cloudflare.
         | 
         | So, if you want resilience in the face of Cloudflare being
         | down, you need to build another Cloudflare. Let me know when
         | you build it. Lots of people will sign up.
        
           | nik736 wrote:
           | There are several providers in this space. Path, Voxility,
           | etc.
        
         | macinjosh wrote:
         | To be fair. I've been a customer for 5+ years for my main
         | personal email account and this is the first outage that has
         | impacted me.
        
           | hedgehog wrote:
           | 15+ and there have been a few but nothing particularly
           | notable. Gmail has had outages too, and I couldn't tell you
           | from personal experience which is more reliable which is
           | interesting given the big difference in the complexity of the
           | deployments (obviously Gmail also has the burden of much
           | bigger scale).
        
         | OhNoMyqueen wrote:
         | They're still singled homed, right?
         | 
         | They just added redondancy to in/outbound routes.
        
         | jmull wrote:
         | Why assume CF is a simple box?
         | 
         | I think the whole point of CF is that it isn't.
        
           | ilyt wrote:
           | Doesn't matter.
           | 
           | Single or million boxes, if they apply wrong config, it
           | doesn't work.
           | 
           | You dual home because you hedge your bets and hope no 2 ISPs
           | gonna have fuckup at same time
        
             | jmull wrote:
             | I think you're suggesting that some redundancy you get by
             | accident -- by having two ISPs -- is better than the
             | redundancy a single ISP could engineer.
             | 
             | That's certainly possible in specific cases, but not a very
             | good general principle to rely on. One CF could very well
             | be better than two given ISPs.
        
               | dfcowell wrote:
               | That's a false dichotomy. You can absolutely have two
               | ISPs, one of whom is CF.
        
               | jmull wrote:
               | Sorry, that's not a false dichotomy.
               | 
               | A second ISP isn't free, it has significant costs in
               | terms of dollars and complexity. The question is, does CF
               | _and_ another provider have significant benefits to
               | justify the additional costs? For it to make sense you
               | have to believe the redundancy CF provides is
               | significantly lacking (and in a way that adding a second
               | provider addresses). Maybe it 's true, but it would be
               | nuts to just assume it and start spending a lot of money.
        
           | nik736 wrote:
           | You have to plug in the cable somewhere. If the hardware
           | where the x-connect is plugged in dies, has issues, has to be
           | rebooted, etc. you have an issue and it's not like there were
           | no CF issues ever.
        
             | jmull wrote:
             | Why does there have to be just one cable?
        
               | withinboredom wrote:
               | Do you think there are two cables that are never bundled
               | in the same underground tube?
               | 
               | All the redundancy in the world can't protect you from
               | some random person digging in the wrong place.
        
               | jmull wrote:
               | Sure, maybe the underground tube was breached while
               | someone was moving goalposts?
        
             | hedora wrote:
             | From the post-mortem, it doesn't sound like like the
             | problem was a single network cable. Redundant network
             | switches have existed for a long time, and they're
             | certainly using them (but not bothering to mention it in
             | the post-mortem).
             | 
             | Their problem was that they only have two transit
             | providers, and one of them black-holed about 3-5% of the
             | internet. Since it was a routing issue, I'd guess it was
             | either a misconfiguration, or that the traffic is being
             | split across dozens of paths, and one path had a correlated
             | failure.
        
         | stuff4ben wrote:
         | If CloudFlare is down, a significant portion of the Internet is
         | down. Not that it's an excuse, but this isn't Microsoft or
         | Apple. I'm sure funds have to be allocated to take into account
         | the likelihood of something being down. But by all means write
         | a blog post and tell them what they're doing wrong and how
         | you'd fix it. Maybe they'll hire you...
        
           | theideaofcoffee wrote:
           | And you don't have to have the resources of Microsoft or
           | Apple to plan and build for the eventuality that a provider
           | becomes intermittent or unavailable. There are fundamental
           | aspects of running an internet-facing service and they failed
           | at one of the most basic.
        
             | stuff4ben wrote:
             | LOL ok, they "failed". They haven't had an outage like this
             | in decades and this one only affected a small number of
             | their clients. But sure, let's spend money on providing a
             | backup for CF. Armchair QBs are the worst.
        
               | theideaofcoffee wrote:
               | Yet they still had the outage. I take exception to being
               | called an 'armchair QB' when most of my career has been
               | spent being called in to repair failures like this,
               | providing postmortem advice to weather future ones and
               | fix technical and cultural issues that give rise to just
               | this type of thinking: oh, it won't happen to us because
               | it has never happened to us.
        
               | 2b3a51 wrote:
               | In your experience, what kind of cost multiple is
               | involved in remediation of the kinds of failure you deal
               | with?
               | 
               | Is it x2 or x100 or somewhere in between?
        
               | theideaofcoffee wrote:
               | Since you need two of everything, or more, two switches,
               | two physical links, (hopefully) two physical racks or
               | cabinets, and all that, it's minimum x2, but nowhere near
               | x100. The cost for additional physical transit links is
               | generally pretty reasonable, depending on provider, if
               | you have more you can negotiate better rates, same with
               | committed bandwidth. You can get better rates if you buy
               | more.
               | 
               | There are a lot of aspects to that, but the cost of doing
               | all of the above is a lot less than not having it and
               | failing to have it at the wrong moment and losing money
               | that way. Each business needs to weigh their risk against
               | how much they want to invest and how much they think they
               | can tolerate in terms of downtime.
        
               | 2b3a51 wrote:
               | Seems logical thanks for engaging with the question.
        
               | tedivm wrote:
               | The issue isn't that they needed a backup to cloudflare.
               | The problem was they only have a single internet provider
               | at their datacenter, so they couldn't communicate with
               | Cloudflare.
               | 
               | I've honestly never had a service with a single outbound
               | path. Most datacenters where you rent colo have two or
               | three providers as part of their network. In the cases
               | where I've had to manage my own networking inside of a
               | datacenter I always pick two providers in case one fails.
               | 
               | > Work is now underway to select a provider for a second
               | transit connection directly into our servers -- either
               | via Megaport, or from a service with their own physical
               | presence in 365's New Jersey datacenter. Once we have
               | this, we will be able to directly control our outbound
               | traffic flow and route around any network with issues.
               | 
               | Having multiple transit options is High Availability 101
               | level stuff.
        
               | toast0 wrote:
               | > The issue isn't that they needed a backup to
               | cloudflare. The problem was they only have a single
               | internet provider at their datacenter, so they couldn't
               | communicate with Cloudflare.
               | 
               | That's not the issue. With Cloudflare MagicTransit,
               | packets come in from Cloudflare, and egress normally.
               | They were able to get packets from Cloudflare, but egress
               | wasn't working to all destinations. I wasn't able to
               | communicate with them from my CenturyLink DSL in Seattle,
               | but when I forced a new IP that happened to be in a
               | different /24, because I was seeing some other issues
               | too, the fastmail issues resolved (although timing may be
               | coincidental). Connecting via Verizon and T-Mobile, or a
               | rented server in Seattle also worked. It's kind of a
               | shame they don't provide services with IPv6, because if
               | 5% of IPv4 failed and 5% of IPv6 failed, chances are good
               | that the overall impact to users would be less than 5%,
               | possibly much less, depending on exactly what the
               | underlying issue was (which isn't disclosed); if it was a
               | physical link issue, that's going to affect v4 and v6
               | traffic that is routed over it, but if it's a BGP
               | announcement issue, those are often separate.
        
               | withinboredom wrote:
               | You'd be surprised at how many things break when
               | different routes are chosen. Like etcd, MySQL, and so
               | much more.
        
               | tedivm wrote:
               | Those are generally on internal networks and rarely need
               | to communicate with the internet. They shouldn't be
               | affected by this.
        
               | withinboredom wrote:
               | Twould be nice...
        
         | arp242 wrote:
         | > This all seems cobbled together and very prone to failures.
         | 
         | AFAIK it's not like FastMail has a crazy number of network-
         | related outages, so overall it doesn't seem that "prone to
         | failure". As with many things, it's a trade-off with complexity
         | and costs.
        
           | paulryanrogers wrote:
           | I'd argue that often the CDN or transit isn't drop-in
           | replaceable. So it's usually more than 2x the cost as one has
           | to maintain two architectures (or at least abstractions).
           | That includes the expertise and not optimizing for strengths
           | of either, or building really robust abstractions/adapters.
        
         | bm3 wrote:
         | It is truly crazy that they do not have their own ASN and IP
         | blocks.
         | 
         | I cannot imagine running a service like that with cobbled
         | together DIA circuits and leased IPs.
        
       | daenney wrote:
       | Incidents happen, that's life. You can hedge your bets, but some
       | things are out of your control. Communicating with your customers
       | however is entirely within your control. Fastmail did a poor job
       | of it. Their status page was useless beyond an initial "we found
       | an issue" and then nothing for almost 11hrs. Their Twitter
       | account was the same story, didn't bother with the Mastodon
       | account at all. Unfortunately they don't seem to realise or
       | recognise that they dropped the ball on this and that goes
       | entirely unaddressed.
       | 
       | I'm also not really charmed with how they try to minimise the
       | importance of the incident by repeating it only affected 3-5% of
       | the customers. That might very well be. But those are real people
       | and real businesses that rely on your services that were
       | unavailable for the whole of the EU workday and a significant
       | part of the US workday. Everyone I know who was affected is a
       | paying customer, none of us have received so much as a
       | communication or apology for it.
       | 
       | For a company that's been on the internet since 1999, the single-
       | homed setup is a little shocking. But fine, it's being addressed.
       | But both the communication during and after the incident don't
       | inspire a ton of confidence.
        
         | luuurker wrote:
         | On one hand, I too like better communications in situations
         | like this. On the other, you knew they were aware of the issue
         | and working on it. Updating the status page with a "we're
         | trying to fix it" every hour or so wouldn't speed up the
         | process of fixing the problem or help you in any way.
        
           | daenney wrote:
           | I didn't ask for an update every hour, but an 11hr stretch of
           | silence is not cool. An update every 3 would've been fine.
           | Even if there isn't much new to share, reiterating you're
           | working on it is useful to reassure your customers. I'm
           | fairly certain they could've found something more meaningful
           | to communicate than "still twirling the thumbs". For example,
           | the update after 11hrs included the tip to switch to a VPN.
           | That would've been useful to communicate.
        
       | JohnMakin wrote:
       | I'm not clear from the post-mortem why the outbound packets were
       | having issues. Was it cloudflare? Did someone accidentally delete
       | an outbound route? Why couldn't they see the issue themselves? I
       | only have more questions now.
        
         | ilyt wrote:
         | It looks like their upstream provider fucked something up so
         | until they release that info we can only speculate.
        
       | electroly wrote:
       | I'll admit that seeing they are single-homed is sketchier than I
       | assumed Fastmail's infrastructure was. I've been using Fastmail
       | for years and I like them, but they are clearly big enough to
       | have a second transit provider, and have been for many years. I'm
       | amazed it took an outage for them to decide to get one. I
       | appreciate the post-mortem but I felt better before I had read
       | it.
        
         | majkinetor wrote:
         | Judged by comments and upvotes here, and previously, I am
         | actually amazed how almost everybody thinks how perfect and
         | unfallable he is and how he surrely deserves 100% perfection
         | from the day he is born to the day his grand-grand-... kids
         | die.
        
       | N0RMAN wrote:
       | Am I the only one who thinks that this post-mortem for a margin
       | of email users worldwide is a pure marketing gimmick to attract
       | more "technical" users?
        
       | scandox wrote:
       | I'm surprised they don't own their own IPs. In the email world I
       | would say that's quite important. Seems a tad casual to say
       | "luckily they are willing to lease them"...
        
         | withinboredom wrote:
         | First, someone has to be willing to sell them...
        
         | gwright wrote:
         | I'm a little rusty in this area, but I'm pretty sure you'll
         | have a hard time getting a direct IP allocation (vs from your
         | transit provider) _unless_ you are multi-homed, which
         | apparently they were not.
         | 
         | There is a nice tick up in complexity when you go to
         | advertising your address space via BGP to multiple providers.
        
         | tivert wrote:
         | Yeah, me too, but it sounds like they have good reason and are
         | working to fix it:
         | 
         | > Thankfully, NYI were willing to lease us those addresses,
         | because IP range reputation is really important in the email
         | world, and those things are hardcoded all over the place -- but
         | it has caused us complications due to more complex routing.
         | Over the past year, we've been migrating to a new IP range.
         | We've been running the two networks concurrently as we build up
         | the trust reputation for the new addresses.
         | 
         | Might be one of those things that they did when they were
         | small, but then got hard to change. Hopefully they will fully
         | own their new addresses.
         | 
         | The thing I'm surprised about is that they only have "single
         | path for traffic out to the internet."
        
       | dizhn wrote:
       | NYI. Good people.
        
       | theideaofcoffee wrote:
       | Netgear switches? In an environment like this? I'll give them the
       | benefit of the doubt that that is maybe a provider-owned thing,
       | and that they have an 'enterprise' line, but, really... Netgear.
       | The firewall brand isn't revealed in the network diagram, but
       | what is it, a $100 sonicwall? Should I be concerned keeping all
       | my email, business and personal, there about what other parts of
       | their infrastructure they are cheaping out on?
       | 
       | When you are running a service like this, redundancy among
       | transit providers is the most basic, table-stakes thing you can
       | do. It's almost negligent to not have that.
        
         | incahoots wrote:
         | To be fair, broadcasting your hardware via topology isn't what
         | I consider a safe practice.
         | 
         | As to the netgear switches, I would figure they'd have hot
         | spares considering the cost savings. I'm not entirely sold on a
         | specific vendor for the end-all-be-all for switching needs,
         | support for most is less than stellar, and the need for hot
         | spares grows every quarter report from the big name vendors as
         | they continue to push for larger margins.
         | 
         | In environments like this, it's less about the vendor specific
         | product, and more about the redundancy setup (which appears
         | they lacked but are transparent regarding it).
         | 
         | Just my .02 cents
         | 
         | edit: didn't realize this was such a "hot take"
        
           | selykg wrote:
           | > To be fair, broadcasting your hardware via topology isn't
           | what I consider a safe practice.
           | 
           | Ah yes, security through obscurity.
        
             | incahoots wrote:
             | You're not wrong but considering all of the recent 0-day
             | exploits, I would argue that it's a better practice than
             | the wack-a-mole response from vendors like Fortinet &
             | Barracuda.
             | 
             | https://www.bleepingcomputer.com/news/security/fortinet-
             | new-...
             | 
             | When the vendors you're buying from aren't taking security
             | seriously, I suppose you take any necessary step in
             | limiting exposure. I'd also argue that outside of the big
             | boys like Cloudflare, no one else is displaying their
             | topology via their own website.
        
               | selykg wrote:
               | That's fair, but let's still call it what it is. We
               | shouldn't normalize hiding information as some sort of
               | form of security.
        
               | incahoots wrote:
               | Fair enough.
        
         | dktnj wrote:
         | Netgear do some half decent fully managed switches. It's not
         | all blue crap off Amazon.
         | 
         | The worst switches I ever used were HPE ones in the old C5000
         | blade chassis. Absolute turds. Packet loss, constant port
         | failures and complete hangs. HPE's solution was to tell us to
         | buy new ones.
        
           | bluedino wrote:
           | The worst switches I've ever used would probably be various
           | 'Cisco' switches from their small business line, usually ones
           | that ran the same OS used when they were sold under different
           | names like 3Com or Linksys.
        
             | incahoots wrote:
             | oh god, when I worked for a small telecom in the midwest
             | they heavily used the 3Com switches. They were the bane of
             | my existance, things would power loop, or my favorite,
             | continue to work but prevent any sort of access to them.
        
           | publicmail wrote:
           | > It's not all blue crap off Amazon.
           | 
           | To be honest, those little blue unmanaged Netgear switches
           | aren't bad at all. We have dozens of them in our lab at work
           | running 24/7 for like decades and have never had a failure
           | that I remember.
        
         | [deleted]
        
         | ilyt wrote:
         | I'd be far worried about single homed internet connection than
         | brand of the switch.
        
         | hedora wrote:
         | If netgear has you worried, search for "white box switches".
         | Most big iron runs off those these days.
        
           | theideaofcoffee wrote:
           | I am aware, I deploy white box gear, what concerns me is the
           | software, some is better, some is worse, less so than the
           | merchant silicon the system is based on.
        
       ___________________________________________________________________
       (page generated 2023-07-06 23:02 UTC)