[HN Gopher] Lichess: Post-Mortem of Our Longest Downtime
___________________________________________________________________
Lichess: Post-Mortem of Our Longest Downtime
Author : jpablo
Score : 141 points
Date : 2024-09-18 23:02 UTC (23 hours ago)
(HTM) web link (lichess.org)
(TXT) w3m dump (lichess.org)
| carlsborg wrote:
| The main lichess engine (lila, open source) is a single monolith
| program that's deployed on a single server. It serves ~5 million
| games per day. But there are a several other pieces too. They
| discuss the architecture here
| https://www.youtube.com/watch?v=crKNBSpO2_I
|
| BTW consider donating if you use lichess.
| justinclift wrote:
| Wow. ~US$40k/mo running costs, with about US$5k/mo for server
| hosting:
|
| https://lichess.org/costs
|
| It _looks_ like the servers are individually managed via OVH or
| similar, rather than running their own gear in co-location or
| similar. Wonder why?
| squigz wrote:
| Surprising numbers, and really goes to show how cheap the
| hardware/software side is for this sort of thing if you do it
| right.
|
| I wonder what the "Misc dev salaries" is for - only curious
| because it's a flat $5k
| justinclift wrote:
| Heh heh heh.
|
| To me those numbers seem on the high side as I'm
| (personally) used to (for cheap projects) scavenging
| together stuff from Ebay before deploying to a data centre.
| ;)
| squigz wrote:
| lichess is hardly a "cheap project" though :P It's one of
| the most popular chess platforms
| justinclift wrote:
| Sure, but they seem to be extremely budget constrained.
| ;)
| me_me_me wrote:
| no surprise there tbh
|
| Here is a comparison of free and their premium accounts:
|
| https://lichess.org/features
| justinclift wrote:
| Looks like they're fulfilling their mission?
| tormeh wrote:
| Easy: If something is wrong with the physical gear it's OVH's
| problem rather than theirs. It also means no one has to ever
| go to the data center which is probably important for a
| geographically distributed team (I assume they are). Cheap,
| no-frills cloud is extremely underrated, IMO.
| benmmurphy wrote:
| its also crazy how much cheaper it is than AWS. the primary
| DB is around $500/month with 32 CPU and 256 GB of RAM and
| 7TB. AWS RDS db.m6gd.8xlarge which is 32 CPU and 128 GB of
| RAM costs $2150/month before paying for storage as well.
| bryan_w wrote:
| Yeah, but you get what you pay for. That m6gd.8xlarge would
| never be subject to such a long network outage as once the
| hardware fault was detected, it would be moved to another
| machine
| beaviskhan wrote:
| Yup, and you also get to make AWS deal with OS upgrades,
| DB upgrades, backups, etc.
| squigz wrote:
| https://lichess.org/patron
| hilux wrote:
| I'm a patron!
|
| I really appreciate the benefits package for patrons. Thibault
| is zee best.
| holsta wrote:
| This response and post-mortem is superior to most commercial
| services I have seen in recent years.
| nomilk wrote:
| Exact same thought went through my head. Also note in the first
| few paragraphs they acknowledge the worst impacts to users.
| That's very selfless - often corporate postmortems _downplay_
| the impact, which frustrates users more. Incidentally, a
| critical service I use (Postmark) had an outage this week and I
| didn 't even hear from them (I found out via a random twitter
| post). Shows the difference.
| CSMastermind wrote:
| Presumably because Lichess is free thus doesn't have
| contractual obligations and SLAs that they'll be sued for
| breaching.
| hyperbovine wrote:
| That's basically every aspect of their service. The founder
| Thibault Duplessis is criminally undercompensated (his choice)
| for running a site that is better designed, faster, and more
| popular than 99% of commercial websites out there.
| agentcoops wrote:
| I worked with him once on a job -- incredibly nice guy and
| obviously talented developer who used to work for the French
| agency responsible for the Scala Play Framework.
| https://github.com/lichess-org/lila and
| https://github.com/lichess-org/scalachess are great resources
| for anyone ever curious to see a production quality Scala3
| web application using Cats and all the properly functional
| properties of the language.
| notagoodidea wrote:
| Would you recommend it as a deep-dive to observe Scala in
| production?
| agentcoops wrote:
| I haven't looked at the code in ages, but it's probably
| the only scaled consumer web application written in Scala
| and moreover running on Scala 3 that you can see the end-
| to-end source for. You have all the Twitter open source
| Scala projects, of course, but that's just infrastructure
| for running a web application, rather than an actual
| production quality app -- and my sense is that in 2024
| there aren't many product teams outside of Twitter using
| their application tooling (as opposed to some of their
| data infrastructure, certainly the area where Scala sees
| the most use today with Spark etc).
|
| TLDR if you want to see production-quality Scala code
| that this very second is serving 40k chess games -- and
| mostly bullet/blitz where ms latency is of course crucial
| -- definitely take a look.
|
| Not as much hype for the language at the moment over Rust
| or Kotlin, say, but it remains my language of choice for
| web backends by far.
| redbell wrote:
| > so you, as our beneficiaries and stakeholders, who support us
| and encourage us -- _deserve to get clarification on what
| happened_
|
| Is it that complicated for big tech to reply politely with the
| above statement when they suddenly disable your account for no
| obvious reason!
| mewpmewp2 wrote:
| It may not be complicated, but it does require caring about
| what you do and your customers as opposed to going through
| basic minimum requirements to appear that you are doing
| something.
|
| It is much more difficult for corporate cogs to have that
| level of care compared to someone who does their things with
| passion.
| morgante wrote:
| The post-mortem is honest, but the infrastructure is well below
| what I'd expect from commercial services.
|
| If a commercial provider told me they're dependent on a single
| physical server, with no real path or plans to fail over to
| another server if they need to, I would consider it extremely
| negligent.
|
| It's fine to not use big cloud providers, but frankly it's
| pretty incompetent to not have the ability to quickly deploy to
| a new server.
| lukhas wrote:
| We're an understaffed charity.
| morgante wrote:
| Yeah I'm not criticizing it as a charity, just pointing out
| this definitely isn't "superior to most commercial
| services."
|
| That being said, removing dependence on single hardware
| nodes isn't something you need a big team for. I've done
| failover at 1-person startups.
| KolmogorovComp wrote:
| And yet even Meta recently had a multiple hours downtime,
| despite a budget thousands if not million times higher. Would
| you call them negligent too?
|
| By increasing the complexity you multiply the failure points
| and increase ongoing maintenance, which is the bottleneck
| (even more than money) for volunteer-driven projects.
| morgante wrote:
| To be clear, you don't need to make it more complex /
| failure-prone. I didn't say failover needs to be automated.
|
| Kubernetes or complex cloud services are not required to
| have some basic deployment automation.
|
| You can do it with a simple bash script if you need to.
| It's just pretty surprising to see the reaction to a
| hardware failure being to wait around for it to be repaired
| instead of simply spinning up a new host.
| ctippett wrote:
| Once the private link was reestablished, could they not have
| tunneled out to the internet via another server acting as a sort
| of gateway?
|
| Disclaimer: I'm not a network engineer so I may be
| misunderstanding the practicality and complexity of such a
| workaround.
| theideaofcoffee wrote:
| I guess some of my questions are addressed in the latter half of
| the post, but I'm still puzzled why a prominent service didn't
| have a plan for what looked like a run of the mill hardware
| outage. It's hard to know exactly what happened as I'm having
| trouble parsing some of the post (what is a 'network connector'?
| is it a cable? nic?). What were some of the 'increasingly
| outlandish' workarounds? Are they actually standing up production
| hosts manually, and was that the cause of a delay or
| unwillingness to get new hardware goin? I think it would be
| important to have all of that set down either in documentation or
| code seeing as most of their technical staff are either
| volunteers, who may come and go, or part timers. Maybe they did,
| it's not clear.
|
| It's also weird seeing that they are still waiting on their
| provider to tell them exactly what was done to the hardware to
| get it going again, that's usually one of the first things a tech
| mentions: "ok, we replaced the optics in port 1" or "I replaced
| that cable after seeing increased error rates", something like
| that.
| lazyant wrote:
| summary for the lazy: OVH
___________________________________________________________________
(page generated 2024-09-19 23:01 UTC)