[HN Gopher] Lichess: Post-Mortem of Our Longest Downtime
       ___________________________________________________________________
        
       Lichess: Post-Mortem of Our Longest Downtime
        
       Author : jpablo
       Score  : 141 points
       Date   : 2024-09-18 23:02 UTC (23 hours ago)
        
 (HTM) web link (lichess.org)
 (TXT) w3m dump (lichess.org)
        
       | carlsborg wrote:
       | The main lichess engine (lila, open source) is a single monolith
       | program that's deployed on a single server. It serves ~5 million
       | games per day. But there are a several other pieces too. They
       | discuss the architecture here
       | https://www.youtube.com/watch?v=crKNBSpO2_I
       | 
       | BTW consider donating if you use lichess.
        
         | justinclift wrote:
         | Wow. ~US$40k/mo running costs, with about US$5k/mo for server
         | hosting:
         | 
         | https://lichess.org/costs
         | 
         | It _looks_ like the servers are individually managed via OVH or
         | similar, rather than running their own gear in co-location or
         | similar. Wonder why?
        
           | squigz wrote:
           | Surprising numbers, and really goes to show how cheap the
           | hardware/software side is for this sort of thing if you do it
           | right.
           | 
           | I wonder what the "Misc dev salaries" is for - only curious
           | because it's a flat $5k
        
             | justinclift wrote:
             | Heh heh heh.
             | 
             | To me those numbers seem on the high side as I'm
             | (personally) used to (for cheap projects) scavenging
             | together stuff from Ebay before deploying to a data centre.
             | ;)
        
               | squigz wrote:
               | lichess is hardly a "cheap project" though :P It's one of
               | the most popular chess platforms
        
               | justinclift wrote:
               | Sure, but they seem to be extremely budget constrained.
               | ;)
        
               | me_me_me wrote:
               | no surprise there tbh
               | 
               | Here is a comparison of free and their premium accounts:
               | 
               | https://lichess.org/features
        
               | justinclift wrote:
               | Looks like they're fulfilling their mission?
        
           | tormeh wrote:
           | Easy: If something is wrong with the physical gear it's OVH's
           | problem rather than theirs. It also means no one has to ever
           | go to the data center which is probably important for a
           | geographically distributed team (I assume they are). Cheap,
           | no-frills cloud is extremely underrated, IMO.
        
           | benmmurphy wrote:
           | its also crazy how much cheaper it is than AWS. the primary
           | DB is around $500/month with 32 CPU and 256 GB of RAM and
           | 7TB. AWS RDS db.m6gd.8xlarge which is 32 CPU and 128 GB of
           | RAM costs $2150/month before paying for storage as well.
        
             | bryan_w wrote:
             | Yeah, but you get what you pay for. That m6gd.8xlarge would
             | never be subject to such a long network outage as once the
             | hardware fault was detected, it would be moved to another
             | machine
        
               | beaviskhan wrote:
               | Yup, and you also get to make AWS deal with OS upgrades,
               | DB upgrades, backups, etc.
        
         | squigz wrote:
         | https://lichess.org/patron
        
         | hilux wrote:
         | I'm a patron!
         | 
         | I really appreciate the benefits package for patrons. Thibault
         | is zee best.
        
       | holsta wrote:
       | This response and post-mortem is superior to most commercial
       | services I have seen in recent years.
        
         | nomilk wrote:
         | Exact same thought went through my head. Also note in the first
         | few paragraphs they acknowledge the worst impacts to users.
         | That's very selfless - often corporate postmortems _downplay_
         | the impact, which frustrates users more. Incidentally, a
         | critical service I use (Postmark) had an outage this week and I
         | didn 't even hear from them (I found out via a random twitter
         | post). Shows the difference.
        
           | CSMastermind wrote:
           | Presumably because Lichess is free thus doesn't have
           | contractual obligations and SLAs that they'll be sued for
           | breaching.
        
         | hyperbovine wrote:
         | That's basically every aspect of their service. The founder
         | Thibault Duplessis is criminally undercompensated (his choice)
         | for running a site that is better designed, faster, and more
         | popular than 99% of commercial websites out there.
        
           | agentcoops wrote:
           | I worked with him once on a job -- incredibly nice guy and
           | obviously talented developer who used to work for the French
           | agency responsible for the Scala Play Framework.
           | https://github.com/lichess-org/lila and
           | https://github.com/lichess-org/scalachess are great resources
           | for anyone ever curious to see a production quality Scala3
           | web application using Cats and all the properly functional
           | properties of the language.
        
             | notagoodidea wrote:
             | Would you recommend it as a deep-dive to observe Scala in
             | production?
        
               | agentcoops wrote:
               | I haven't looked at the code in ages, but it's probably
               | the only scaled consumer web application written in Scala
               | and moreover running on Scala 3 that you can see the end-
               | to-end source for. You have all the Twitter open source
               | Scala projects, of course, but that's just infrastructure
               | for running a web application, rather than an actual
               | production quality app -- and my sense is that in 2024
               | there aren't many product teams outside of Twitter using
               | their application tooling (as opposed to some of their
               | data infrastructure, certainly the area where Scala sees
               | the most use today with Spark etc).
               | 
               | TLDR if you want to see production-quality Scala code
               | that this very second is serving 40k chess games -- and
               | mostly bullet/blitz where ms latency is of course crucial
               | -- definitely take a look.
               | 
               | Not as much hype for the language at the moment over Rust
               | or Kotlin, say, but it remains my language of choice for
               | web backends by far.
        
         | redbell wrote:
         | > so you, as our beneficiaries and stakeholders, who support us
         | and encourage us -- _deserve to get clarification on what
         | happened_
         | 
         | Is it that complicated for big tech to reply politely with the
         | above statement when they suddenly disable your account for no
         | obvious reason!
        
           | mewpmewp2 wrote:
           | It may not be complicated, but it does require caring about
           | what you do and your customers as opposed to going through
           | basic minimum requirements to appear that you are doing
           | something.
           | 
           | It is much more difficult for corporate cogs to have that
           | level of care compared to someone who does their things with
           | passion.
        
         | morgante wrote:
         | The post-mortem is honest, but the infrastructure is well below
         | what I'd expect from commercial services.
         | 
         | If a commercial provider told me they're dependent on a single
         | physical server, with no real path or plans to fail over to
         | another server if they need to, I would consider it extremely
         | negligent.
         | 
         | It's fine to not use big cloud providers, but frankly it's
         | pretty incompetent to not have the ability to quickly deploy to
         | a new server.
        
           | lukhas wrote:
           | We're an understaffed charity.
        
             | morgante wrote:
             | Yeah I'm not criticizing it as a charity, just pointing out
             | this definitely isn't "superior to most commercial
             | services."
             | 
             | That being said, removing dependence on single hardware
             | nodes isn't something you need a big team for. I've done
             | failover at 1-person startups.
        
           | KolmogorovComp wrote:
           | And yet even Meta recently had a multiple hours downtime,
           | despite a budget thousands if not million times higher. Would
           | you call them negligent too?
           | 
           | By increasing the complexity you multiply the failure points
           | and increase ongoing maintenance, which is the bottleneck
           | (even more than money) for volunteer-driven projects.
        
             | morgante wrote:
             | To be clear, you don't need to make it more complex /
             | failure-prone. I didn't say failover needs to be automated.
             | 
             | Kubernetes or complex cloud services are not required to
             | have some basic deployment automation.
             | 
             | You can do it with a simple bash script if you need to.
             | It's just pretty surprising to see the reaction to a
             | hardware failure being to wait around for it to be repaired
             | instead of simply spinning up a new host.
        
       | ctippett wrote:
       | Once the private link was reestablished, could they not have
       | tunneled out to the internet via another server acting as a sort
       | of gateway?
       | 
       | Disclaimer: I'm not a network engineer so I may be
       | misunderstanding the practicality and complexity of such a
       | workaround.
        
       | theideaofcoffee wrote:
       | I guess some of my questions are addressed in the latter half of
       | the post, but I'm still puzzled why a prominent service didn't
       | have a plan for what looked like a run of the mill hardware
       | outage. It's hard to know exactly what happened as I'm having
       | trouble parsing some of the post (what is a 'network connector'?
       | is it a cable? nic?). What were some of the 'increasingly
       | outlandish' workarounds? Are they actually standing up production
       | hosts manually, and was that the cause of a delay or
       | unwillingness to get new hardware goin? I think it would be
       | important to have all of that set down either in documentation or
       | code seeing as most of their technical staff are either
       | volunteers, who may come and go, or part timers. Maybe they did,
       | it's not clear.
       | 
       | It's also weird seeing that they are still waiting on their
       | provider to tell them exactly what was done to the hardware to
       | get it going again, that's usually one of the first things a tech
       | mentions: "ok, we replaced the optics in port 1" or "I replaced
       | that cable after seeing increased error rates", something like
       | that.
        
       | lazyant wrote:
       | summary for the lazy: OVH
        
       ___________________________________________________________________
       (page generated 2024-09-19 23:01 UTC)