https://lichess.org/@/thibault/blog/lichess-on-scala3-help-needed/2bpotLb0 Accessibility: Enable blind mode [ ] lichess.org Playlichess.org Create a gameArena tournamentsSwiss tournamentsSimultaneous exhibitions Puzzles PuzzlesPuzzle DashboardPuzzle StreakPuzzle StormPuzzle Racer Learn Chess basicsPracticeCoordinatesStudyCoaches Watch Lichess TVCurrent gamesStreamersBroadcastsVideo library Community PlayersTeamsForumBlog Tools Analysis boardOpening explorerBoard editorImport gameAdvanced search [ ] Sign in Community blogsBlog topicsLiked blog postsLichess blog [display] Photo by Manuel on Unsplash Lichess on scala3 - help needed thibaultDec 5, 20226321,039 views Software Development A week after deploying Lichess rewritten with scala 3, I had to revert to the scala 2 version. I need help diagnosing a peculiar JVM behaviour. TLDR The problem Since its recent rewrite to scala3, Lichess is using way more CPU than usual, but the problem takes a while to kick in and has peculiar patterns. Available data * 30-days monitoring for context + Overview + JVM CPU/threads/pools + JVM memory + JVM GC (G1) * peak time monitoring for granularity (4pm-7pm) + Overview + JVM CPU/threads/pools + JVM memory + JVM GC (G1) * Thread dumps at peak time Raw: 1 2 3 4 5 * JVM GC stats dumped every 500ms, at peak time Context The core of Lichess is a monolith written in scala. It serves thousands of pages per second, plays thousands of chess moves per second, while doing a lot of other things. It's a big program running on a big server with lots of CPU and memory. Until now, lila was written in scala 2.13.10, and it was running well and performing well. We'll call it lila2. About a month ago, I started migrating it to scala 3.2.1 - we'll call it lila3. It was fun and I'll write about it... After problems are solved. On the 2022-11-22 at 7AM UTC, I deployed lila3 to production, and it worked quite well, and performed quite well. During the next days, things went smoothly as I deployed a new version every morning. Symptoms But when I let it run for more than 24h, the CPU usage started increasing dramatically. It became obvious that lila3 could not handle more than 48h of uptime. In comparison, lila2 could easily stay up for 2 consecutive weeks. [96qYcEO] lila2 CPU usage, each spike is 24h. Vertical red lines are deploys/ restarts. [WyBiag3] lila3 CPU usage, each spike is 24h. CPU usage increases on the second day of runtime. A closer look at CPU usage The CPU usage is not regular, even during the worst times. Instead, it will be normal for a minute or two; then be crazy for a minute or two. Then back to normal. [VjBM6KI] Close-up look. Around 20% is normal CPU usage. We see it spike up to 80% for extended periods - that is abnormal. Memory pressure / GC problem? I don't think the garbage collector is to blame. During the worst times, when the JVM almost maxed out 90 CPUs, the GC was only using 3s of CPU time per minute. Allocations and memory usage also look rather normal. What about exceptions or log messages? There are none relevant. We have extensive logging in lila and no new errors are popping up. No timeouts either. Help I had to revert to the scala2 version of Lichess. This is very unfortunate, as the gap between lila2 and lila3 is huge, and every day we spend with lila2 in production makes it bigger. At this point I don't know how to diagnose and fix the problem with lila3. If you know about large JVM deployments and have an idea about what could be causing this, please let me know. You can use the Lichess programming discord channel or our email address. Please focus on the problem at hand and its quick resolution. I'm not looking for long-term architecture discussions right now, we'll do these after production is healthy again. I will update this post with more data and info, as people ask for them. FAQ For monitoring data, thread dumps and GC logs, see the top of this post. Versions lila2 and lila3 are compiled with java 17, and the prod server runs the same JVM: # java -version openjdk version "17" 2021-09-14 OpenJDK Runtime Environment (build 17+35-2724) OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing) # uname -a Linux manta 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux JVM args are the same as before: -Xms30g -Xmx30g -XX:+UseG1GC lila2 runs scala 2.13.10, lila3 runs scala 3.2.1. SBT dependencies can be found in the lila github repo Discuss this blog post in the forum View more blog posts by thibault Time traveling machine Starting from scratch If you had to start Lichess from scratch, what would you change?Aug 11, 2022Something vaguely resembling a stylized robot How to create a Lichess bot Well, it depends. On you.Oct 29, 2021two persons asking and answering questions How to ask technical questions Simple guidelines to get quick and actionable answersOct 1, 2021 [display] How I started building Lichess I get this question sometimes. How did you decide to make a chess server? The truth is, I didn't.Sep 8, 2021 Reconnecting