https://www.morling.dev/blog/finding-java-thread-leaks-with-jdk-flight-recorder-and-bit-of-sql/ Gunnar Morling Gunnar Morling Random Musings on All Things Software Engineering * * * * * * Blog * Projects * Conferences * Podcasts * About [ ] Gunnar Morling Gunnar Morling Random Musings on All Things Software Engineering * * * * * * Blog * Projects * Conferences * Podcasts * About [ ] Finding Java Thread Leaks With JDK Flight Recorder and a Bit Of SQL Posted at Feb 28, 2023 The other day at work, we had a situation where we suspected a thread leak in one particular service, i.e. code which continuously starts new threads, without taking care of ever stopping them again. Each thread requires a bit of memory for its stack space, so starting an unbounded number of threads can be considered as a form of memory leak, causing your application to run out of memory eventually. In addition, the more threads there are, the more overhead the operating system incurs for scheduling them, until the scheduler itself will consume most of the available CPU resources. Thus it's vital to detect and fix this kind of problem early on. The usual starting point for analyzing a suspected thread leak is taking a thread dump, for instance using the jstack CLI tool or via JDK Mission Control; if there's an unexpected large number of threads (oftentimes with similar or even identical names), then it's very likely that something is wrong indeed. But a thread dump by itself is only a snapshot of the thread state at a given time, i.e. it doesn't tell you how the thread count is changing over time (perhaps there are many threads which are started but also stopped again?), and it also doesn't provide you with information about the cause, i.e. which part of your application is starting all those threads. Does it happen in your own code base, or within some 3rd party dependency? While the thread names and stacks in the dump can give you some idea, that information isn't necessarily enough for a conclusive root cause analysis. Luckily, Java's built-in event recorder and performance analysis tool, JDK Flight Recorder, exposes all the data you need to identify thread leaks and their cause. So let's take a look at the details, bidding farewell to those pesky thread leaks once and forever! The first JFR event type to look at is jdk.JavaThreadStatistics: recorded every second by default, it keeps track of active, accumulated, and peak thread counts. Here is a JFR recording from a simple thread leak demo application I've created (newest events at the top): `jdk.JavaThreadStatistics` Events in JDK Mission Control The number of active threads is continuously increasing, never going back down again -- pretty clearly that this a thread leak. Now let's figure out where exactly all those threads are coming from. For this, two other JFR event types come in handy: jdk.ThreadStart and jdk.ThreadEnd. The former captures all the relevant information when a thread is started: time stamp, name of the new thread and the parent thread, and the stack trace of the parent thread when starting the child thread. The latter event type will be recorded when a thread finishes. If we find many thread start events originating at the same code location without a corresponding end event (correlated via the thread id contained in the events), this is very likely a source of a thread leak. This sort of event analysis is a perfect use case for JFR Analytics. This tool allows you to analyze JFR recordings using standard SQL (leveraging Apache Calcite under the hood). In JFR Analytics, each event type is represented by its own "table". Finding thread start events without matching end events is as simple as running a LEFT JOIN on the two event types and keeping only those start events which don't have a join partner. So let's load the file into the SQLLine command line client (see the README of JFR Analytics for instructions on building and launching this tool): 1 !connect jdbc:calcite:schemaFactory=org.moditect.jfranalytics.JfrSchemaFactory;schema.file=thread_leak_recording.jfr dummy dummy 2 3 !outputformat vertical Run the following SQL query for finding thread start events without corresponding thread join events: 1 SELECT 2 ts."startTime", 3 ts."parentThread"."javaName" as "parentThread", 4 ts."eventThread"."javaName" AS "newThread", 5 TRUNCATE_STACKTRACE(ts."stackTrace", 20) AS "stackTrace" 6 FROM "jdk.ThreadStart" ts LEFT JOIN "jdk.ThreadEnd" te 7 ON ts."eventThread"."javaThreadId" = te."eventThread"."javaThreadId" 8 WHERE te."startTime" IS NULL; Note how the parentThread and eventThread columns are of a complex SQL type, allowing you to refer to thread properties such as javaName or javaThreadId using dot notation. In that simple example recording, there's one stack trace which dominates the result set, so looking at any of the events reveals the culprit: 1 startTime 2023-02-26 11:36:04.284 2 javaName executor-thread-0 3 javaName pool-1060-thread-1 4 stackTrace java.lang.System$2.start(Thread, ThreadContainer):2528 5 jdk.internal.vm.SharedThreadContainer.start(Thread):160 6 java.util.concurrent.ThreadPoolExecutor.addWorker(Runnable, boolean):953 7 java.util.concurrent.ThreadPoolExecutor.execute(Runnable):1364 8 java.util.concurrent.AbstractExecutorService.submit(Callable):145 9 java.util.concurrent.Executors$DelegatedExecutorService.submit(Callable):791 10 org.acme.GreetingResource.hello():18 11 null 12 null 13 null 14 null 15 jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Object, Object[]):104 16 java.lang.reflect.Method.invoke(Object, Object[]):578 17 org.jboss.resteasy.core.MethodInjectorImpl.invoke(HttpRequest, HttpResponse, Object, Object[]):170 18 org.jboss.resteasy.core.MethodInjectorImpl.invoke(HttpRequest, HttpResponse, Object):130 19 org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(HttpRequest, HttpResponse, Object):660 20 org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(HttpRequest, HttpResponse, Object):524 21 org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(HttpRequest, HttpResponse, Object):474 22 null 23 org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter():364 The call for creating a new thread apparently is initiated by the GreetingResource::hello() method by submitting a Callable to an executor service. And surely enough, this is how it looks like: 1 @GET 2 @Produces(MediaType.TEXT_PLAIN) 3 public String hello() { 4 ExecutorService executor = Executors.newSingleThreadExecutor(); 5 executor.submit(() -> { 6 while (true) { 7 Thread.sleep(1000L); 8 } 9 }); 10 return "Hello World"; 11 } If things are not as clear-cut as in that contrived example, it can be useful to truncate stack traces to a reasonable line count (e.g. it should be save to assume that the user code starting a thread is never further away than ten frames from the actual thread start call) and group by that. JFR Analytics provides the built-in function TRUNCATE_STACKTRACE for this purpose: 1 SELECT 2 TRUNCATE_STACKTRACE(ts."stackTrace", 10) AS "stackTrace", 3 COUNT(1) AS "threadCount" 4 FROM "jdk.ThreadStart" ts LEFT JOIN "jdk.ThreadEnd" te 5 ON ts."eventThread"."javaThreadId" = te."eventThread"."javaThreadId" 6 WHERE te."startTime" IS NULL 7 GROUP BY 8 TRUNCATE_STACKTRACE(ts."stackTrace", 10) 9 ORDER BY "threadCount" DESC; This points at the problematic stack traces and code locations in a very pronounced way (output slightly adjusted for better readability): 1 stackTrace java.lang.System$2.start(Thread, ThreadContainer):2528 2 jdk.internal.vm.SharedThreadContainer.start(Thread):160 3 java.util.concurrent.ThreadPoolExecutor.addWorker(Runnable, boolean):953 4 java.util.concurrent.ThreadPoolExecutor.execute(Runnable):1364 5 java.util.concurrent.AbstractExecutorService.submit(Callable):145 6 java.util.concurrent.Executors$DelegatedExecutorService.submit(Callable):791 7 org.acme.GreetingResource.hello():18 8 null 9 null 10 null 11 12 threadCount 414 13 --- 14 stackTrace java.util.Timer.(String, boolean):188 15 jdk.jfr.internal.PlatformRecorder.lambda$createTimer$0(List):101 16 null 17 java.lang.Thread.run():1589 18 19 threadCount 1 Sometimes you may encounter a situation where new threads are started from within other threads in a 3rd party dependency, rather than directly from threads within your own code base. In that case the stack traces of the thread start events may not tell you enough about the root cause of the problem, i.e. where those other "intermediary" threads are coming from, and how they relate to your own code. To dig into the details here, you can leverage the fact that each jdk.ThreadStart event contains information about the parent thread which started a new thread. So you can join the jdk.ThreadStart table to itself on the parent thread's id, fetching also the stack traces of the code starting those parent threads: 1 SELECT 2 ts."startTime", 3 pts."parentThread"."javaName" AS "grandParentThread", 4 ts."parentThread"."javaName" AS "parentThread", 5 ts."eventThread"."javaName" AS "newThread", 6 TRUNCATE_STACKTRACE(pts."stackTrace", 15) AS "parentStackTrace", 7 TRUNCATE_STACKTRACE(ts."stackTrace", 15) AS "stackTrace" 8 FROM "jdk.ThreadStart" ts LEFT JOIN "jdk.ThreadEnd" te 9 ON ts."startTime" = te."startTime" 10 JOIN "jdk.ThreadStart" pts 11 ON ts."parentThread"."javaThreadId" = pts."eventThread"."javaThreadId" 12 WHERE te."startTime" IS NULL; Here, stackTrace is the trace of a thread (named "pool-728-thread-1") of an external library, "greeting provider", which starts another (leaking) thread (named "pool-729-thread-1"), and parentStackTrace points to the code in our own application (thread name "executor-thread-0") which started that first thread: 1 startTime 2023-02-28 09:15:24.493 2 grandParentThread executor-thread-0 3 parentThread pool-728-thread-1 4 newThread pool-729-thread-1 5 parentStackTrace java.lang.System$2.start(Thread, ThreadContainer):2528 6 jdk.internal.vm.SharedThreadContainer.start(Thread):160 7 java.util.concurrent.ThreadPoolExecutor.addWorker(Runnable, boolean):953 8 java.util.concurrent.ThreadPoolExecutor.execute(Runnable):1364 9 java.util.concurrent.AbstractExecutorService.submit(Runnable):123 10 java.util.concurrent.Executors$DelegatedExecutorService.submit(Runnable):786 11 com.example.greeting.GreetingService.greet():20 12 com.example.greeting.GreetingService_ClientProxy.greet() 13 org.acme.GreetingResource.hello():20 14 null 15 null 16 null 17 null 18 jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Object, Object[]):104 19 java.lang.reflect.Method.invoke(Object, Object[]):578 20 --- 21 stackTrace java.lang.System$2.start(Thread, ThreadContainer):2528 22 jdk.internal.vm.SharedThreadContainer.start(Thread):160 23 java.util.concurrent.ThreadPoolExecutor.addWorker(Runnable, boolean):953 24 java.util.concurrent.ThreadPoolExecutor.execute(Runnable):1364 25 java.util.concurrent.AbstractExecutorService.submit(Callable):145 26 java.util.concurrent.Executors$DelegatedExecutorService.submit(Callable):791 27 com.example.greeting.GreetingProvider.createGreeting():13 28 com.example.greeting.GreetingProvider_ClientProxy.createGreeting() 29 com.example.greeting.GreetingService.lambda$greet$0(AtomicReference):21 30 null 31 java.util.concurrent.Executors$RunnableAdapter.call():577 32 java.util.concurrent.FutureTask.run():317 33 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker):1144 34 java.util.concurrent.ThreadPoolExecutor$Worker.run():642 35 java.lang.Thread.run():1589 If the thread hierarchy is even deeper, you could continue down that path and keep joining more and more parent threads until you've arrived at the application's main thread. I was hoping to leverage recursive query support in Calcite for this purpose, but as it turned out, support for this only exists in the Calcite RelBuilder API at the moment, while the RECURSIVE keyword is not supported for SQL queries yet. Equipped with JDK Flight Recorder, JDK Mission Control, and JFR Analytics, identifying and fixing thread leaks in your Java application is becoming a relatively simple task. The jdk.JavaThreadStatistics, jdk.ThreadStart, and jdk.ThreadEnd event types are enabled in the default JFR profile, which is meant for permanent usage in production. I.e. you can keep a size-capped continuous recording running all the time, dump it into a file whenever needed, and then start the analysis process as described above. Taking things a step further, you could also set up monitoring and alerting on the number of active threads, e.g. by exposing the jdk.JavaThreadStatistics event via a remote JFR event recording stream, allowing you to react in real-time whenever the active thread count reaches an unexpected high level. Please enable JavaScript, or join the discussion on GitHub. (c) 2019 - 2023 Gunnar Morling | Licensed Under Creative Commons BY-SA 4.0