[HN Gopher] Discovering a JDK Race Condition, and Debugging It i...
       ___________________________________________________________________
        
       Discovering a JDK Race Condition, and Debugging It in 30 Minutes
       with Fray
        
       Author : aoli-al
       Score  : 135 points
       Date   : 2025-06-07 19:01 UTC (1 days ago)
        
 (HTM) web link (aoli.al)
 (TXT) w3m dump (aoli.al)
        
       | latchkey wrote:
       | Maybe it is just me, but I can't read the text in the code
       | because the font is nearly white on white.
        
         | masklinn wrote:
         | The light mode is fine, but you're right the dark mode is truly
         | awful, the code blocks are unreadable.
         | 
         | edit: for some reason the author overrode the background color
         | on code blocks via an inline style of
         | background-color:#f0f0f0
         | 
         | from                   var(--code-background-color) = #f2f2f2
         | 
         | to make the background nigh imperceptibly darker, but then
         | while the stylesheet properly switches the to #01242e in dark
         | mode the inline override stays and blows it to bit.
         | 
         | Not that it's amazing if you remove the inline stle, on account
         | of operators and method names being styled pretty dark (#666
         | and #4070a0).
        
           | aoli-al wrote:
           | Thanks for pointing it out! Just did a quick fix using Claude
           | :)
        
             | malcolmgreaves wrote:
             | On mobile (Safari), the lines in the code blocks have
             | different font sizes. They also have different fonts. Some
             | are like 3-4x the size of other lines. No idea what could
             | be going wrong, but it does unfortunately make the code
             | blocks difficult to follow along.
        
               | aoli-al wrote:
               | should be fixed as well :)
        
               | NooneAtAll3 wrote:
               | any chance you can make light/dark mode switch a UI
               | button?
        
               | masklinn wrote:
               | On desktop I'd suggest installing an extension that adds
               | a toggle (they exist for Firefox and chrome at least):
               | adding a toggle manually is a bit of a chore, especially
               | if the css system you use does not build that in.
        
       | MaxBarraclough wrote:
       | Neat to see sleep calls artificially introduced to reliably
       | recreate the deadlock. [0]
       | 
       | Looks like fixing the underlying bug is still in-progress, [1] I
       | wonder how many lines of code it will take.
       | 
       | [0] https://github.com/aoli-
       | al/jdk/commit/625420ba82d2b0ebac24d9...
       | 
       | [1] https://bugs.openjdk.org/browse/JDK-8358601
        
         | trhway wrote:
         | without reworking of the code all these checks of the executor
         | and queue state and queue manipulations have to be under a
         | mutex, and that is just a few lines.
        
         | brabel wrote:
         | Bugs like these are pervasive in languages like Java that give
         | no protection against even the most basic race condition
         | causes. It's nearly impossible to write reliable concurrent
         | code. Freya only helps if you actually use it to test
         | everything which is not realistic. I am convinced, after my
         | last year long struggle to get a highly concurrent Java
         | (actually Kotlin but Kotlin does not add much to help) module
         | at work, that we should only use languages that provide safe
         | concurrency models, like Erlang/Elixir and Rust, or actor-like
         | like Dart and JavaScript, where concurrency is required.
        
           | gf000 wrote:
           | What _is_ a safe concurrency model? Like, actors can
           | trivially deadlock /livelock, they are no panacea at all, and
           | are trivial to recreate (there are a million java
           | implementations)
           | 
           | You make it sound like there is some modern development
           | superseding what java has, but that's absolutely not the
           | case.
           | 
           | Like even rust is just pretty much a no-overhead
           | `synchronized` on top of an object. It _is_ necessary there,
           | because data races are a fundamental memory safety issue, but
           | Java is immune to that (it has  "safe" data races). Logical
           | bugs can trivially happen in either case - as an easy example
           | even if all your fields are atomically mutated, the whole
           | object may not make sense in certain states, like a date with
           | February the 31st. Rust does nothing against such, and
           | concurrent data structures have ample grounds for realistic
           | examples of the above.
        
             | mrkeen wrote:
             | > What is a safe concurrency model?
             | 
             | STM.
             | 
             | The terms 'atomic', 'thread-safe', and 'concurrent'
             | collections are thrown around too loosely for application
             | programmers IMO, for exactly your example above.
             | 
             | In other scenarios, 'atomics' refer to the ability to do
             | _one_ thing atomically. With STM, you can do two or more
             | things atomically.
             | 
             | Likewise with 'thread-safe'. Thread-safe seems to indicate
             | that the object won't break internally in the presence of
             | multiple threads, which is too low of a bar to clear if
             | your goal is to write an actually thread-safe application
             | out of so-called 'thread-safe' parts.
             | 
             | STM has _actual_ concurrent data structures, where you can
             | write straight-line code like  'if this collection has at
             | least 5 elements, then pop one'.
             | 
             | I don't think the Feb 31 example is that fair though,
             | because if you want to construct a representation of Feb
             | 31, who's going to stop you? And if you don't want to,
             | plain old static types is the solution.
        
               | gf000 wrote:
               | I couldn't give a better reply than this author:
               | 
               | https://joeduffyblog.com/2010/01/03/a-brief-
               | retrospective-on...
               | 
               | Also, a phenomenal writing (as are his other posts) on
               | the whole concurrency landscape, see:
               | 
               | > A wondrous property of concurrent programming is the
               | sheer number and diversity of programming models
               | developed over the years. Actors, message-passing, data
               | parallel, auto-vectorization, ...; the titles roll off
               | the tongue, and yet none dominates and pervades. In fact,
               | concurrent programming is a multi-dimensional space with
               | a vast number of worthy points along its many axes.
        
               | mrkeen wrote:
               | I've read a few postmortems about STM. I have to take
               | them with a grain of salt because I usually read those
               | reports right after doing a bunch of STM programming, and
               | right before doing a bunch more STM programming. Reports
               | of its death have been greatly exaggerated.
               | 
               | Here it is in 2006 featuring the same Tim from your
               | article: https://www.youtube.com/watch?v=tve57vilywc
               | 
               | I didn't start using it in anger till 2013-2014 maybe?
               | But I don't recall any major differences between what the
               | video shows and how it works in 2025.
               | 
               | Anyway, postmortems usually boil down to two issues:
               | 
               | 1) That's not how programmers usually do it
               | 
               | 2) We couldn't pull it off
               | 
               | The most obvious explanation for 1 is 2. I, too, would be
               | disappointed by the low-adoption rates of my new
               | technology if I hadn't built it or released it to users.
               | 
               | But the article has some gems:
               | Transactions unfortunately do not address one other
               | issue, which turns out to be the most fundamental of all:
               | sharing. Indeed, TM is insufficient - indeed, even
               | dangerous - on its own because it makes it very easy to
               | share data and access it from multiple threads;
               | 
               | I cannot read this charitably. This is _the only reason
               | for_ , not _a damning reason against_. It 's like doing
               | research & development on condoms, and then realising
               | it's a hopeless failure because they might be used for
               | dangerous activities like sex.                 I already
               | mentioned a great virtue of transactions is their ability
               | to nest. But I neglected to say how this works. And in
               | fact when we began, we only recognized one form of
               | nesting. You're in one atomic block and then enter into
               | another one. What happens if that inner transaction
               | commits or rolls back, before the fate of the outer
               | transaction is known
               | 
               | You nest transactional _statements_ , not the calls to
               | _atomic_. The happy-path for an _atomic_ is that it will
               | _commit_ ; it should be obvious a priori that something
               | that commits cannot be in the codepath that can be rolled
               | back.                 Then that same intern's casual
               | statement pointing out an Earth-shattering flaw that
               | would threaten the kind of TM we (and most of the
               | industry at the time) were building. ...       An update
               | in-place system will allow that transaction to freely
               | change the state of x. Of course, it will roll back here,
               | because isItOwned changed to true. But by then it is too
               | late: the other thread using x outside of a transaction
               | will see constantly changing state - torn reads even -
               | and who knows what will happen from there. A known flaw
               | in any weakly atomic, update in-place TM.            If
               | this example appears contrived, it's not. It shows up in
               | many circumstances.
               | 
               | I agree that it's not contrived. It's in the problem-
               | space of application writers. It's not a problem caused
               | by introducing STM. We want an STM system to allow safe
               | access to isItOwned & x, because it's a PITA to try to do
               | this with locks.
        
               | gf000 wrote:
               | `atomic` is their choice of syntax for an STM transaction
               | in their experimental C# runtime, it's _not_ an atomic
               | statement. Please take the time to actually read the
               | article, because you have obviously just skimmed over it.
               | This was not written by some nobody, he does know what he
               | talks about.
        
             | tialaramex wrote:
             | > the whole object may not make sense in certain states
             | 
             | "Make invalid states unrepresentable" - it's bad design
             | that February the 31st is a thing in your data structure
             | when that's invalid. You can't _always_ avoid this, but it
             | 's appalling how bad most people's data structures are.
             | 
             | C's stdlib provides a tm structure in which day of the week
             | is stored in a _signed_ 32-bit integer. You know, for when
             | it 's the negative two billionth day of the week...
        
               | nlitened wrote:
               | > "Make invalid states unrepresentable"
               | 
               | I think this phrase sounds good but is not applicable to
               | systems that touch messy reality.
               | 
               | For example, I think it's not even possible to apply it
               | to the `tm` structure, as leap seconds are not known in
               | advance.
        
               | tialaramex wrote:
               | I agree that messy reality can intervene, in the medium
               | term (for about a decade) we'll need to handle leap
               | seconds
               | 
               | But we can do a _lot_ without challenging the messy
               | reality. 61 second minutes are (regrettably) a thing in
               | some time systems, but negative 1 million second minutes
               | are not a thing, there 's no need for this to be a signed
               | integer!
        
               | kbolino wrote:
               | The struct is also used for date/time arithmetic and the
               | standard library explicitly supports out-of-range values
               | for this reason.
        
               | tialaramex wrote:
               | I have no doubt that C "explicitly supports" this, but
               | it's a bad idea.
               | 
               | The C standard library has the excuse that most of it is
               | very old. We should do better.
        
               | kbolino wrote:
               | Better for whom? If you want a dead-simple time type, use
               | time_t.
               | 
               | There are plenty of improvements needed in the C time
               | APIs, like sub-second precision, thread safety, and
               | timezone awareness. What benefit is there to making the
               | struct fields unsigned beyond some arbitrary purity test?
               | This is still C, there are still plenty of ways to make
               | invalid values. And it is nice to be able to subtract as
               | well as add.
               | 
               | Heck, there's no way to encode the full Gregorian
               | Calendar rules in the type system of any language I've
               | ever used, so every choice is going to be a compromise.
               | February 29 Not-A-Leap-Year and April 31 are still
               | invalid dates even if you can outlaw January 0 and March
               | 32.
               | 
               | Making all the fields in struct tm signed ints is clearly
               | there to allow them to be manipulated and consistently
               | so, since other types would obviously be better for size
               | if nothing else.
        
               | gf000 wrote:
               | This is more of a toy example for how a set of atomic
               | changes can still end up in an inconsistent state, e.g.
               | setting January the 31st and February 3rd in quick
               | succession from two or more different threads may result
               | in Feb 31st being visible from a third thread. This is
               | _not_ solved by Rust and your struct will even get the
               | Sync trait automatically, which may be not be applicable
               | as in this case.
        
               | brabel wrote:
               | Given your example, I am convinced you've never written
               | any Rust. Of course it does stop you doing shit like
               | that. But in this example, even Java does it properly,
               | since the constructor runs to completion before any
               | Object is accessible to any Thread, not just the one
               | creating it. You need to validate the state of the object
               | in the constructor to prevent that, but TBH why are we
               | talking about this, it's almost completely unrelated to
               | concurrency models.
        
               | gf000 wrote:
               | Of course if you are creating a new object and you have
               | an atomic handle to it, it is trivial to solve. Like,
               | having immutable objects solves a _lot_ of these
               | problems.
               | 
               | But what I'm quite obviously talking about is a Rust
               | struct with 3 atomic fields. Just because I can safely
               | race on any of its fields, doesn't mean that the whole
               | struct can safely be shared, yet it will be inferred to
               | be Sync.
        
               | tialaramex wrote:
               | Object mutability isn't relevant here. A Date type which
               | is mutable can ensure that all mutations are valid, it
               | just can't do so while retaining this clumsy "LOL I'm
               | just a D-M-Y tuple" API.
               | 
               | We can see immediately that your type is broken because
               | it allows us to directly set the date to February 31st,
               | there's no concurrency bug needed, the type was always
               | defective.
        
               | gf000 wrote:
               | void setDate(int month, int day) {         if
               | (notValidDate(month, date)) { throw; }
               | this.month = month; // atomic         this.day = day //
               | atomic       }
               | 
               | Yet the whole function is not
               | "atomic"/transactional/consistent, and two threads
               | running simultaneously may surface the above error.
               | 
               | Of course it can ensure that it is consistent, C code can
               | also just ensure that it is memory safe. This is just not
               | an inherent property, and in general you _will_ mess it
               | up.
               | 
               | The only difference is that we can reliably solve memory
               | safety issues (GC, Rusty's ownership model), but we have
               | absolutely no way to solve concurrency issues in any
               | model. The only solution is.. having a single thread.
        
               | tialaramex wrote:
               | But you were critiquing Rust's model, yet you've written
               | C++ here. I agree it's perfectly easy to write the bug in
               | C++.
               | 
               | In Rust this improved type doesn't have the defect, to
               | call Rust's analogue of your setDate function you must
               | have the exclusive mutable reference, which means there's
               | no concurrency problem.
               | 
               | You have to do a whole lot of extra work to write the bug
               | and why would you, just write what you meant and it
               | behaves correctly.
        
               | gf000 wrote:
               | It's called pseudo-code, and some extra attempt on your
               | part to deliberately miss the point.
               | 
               | Give it another go at understanding what I'm saying,
               | cheers!
        
             | brabel wrote:
             | > Like, actors can trivially deadlock/livelock,
             | 
             | Oh my ... you never seen a proper Actor language, have you?
             | 
             | Have a look at Erlang and Pony, for starters. It will open
             | your mind.
             | 
             | This in particular is great:
             | https://www.ponylang.io/discover/what-makes-pony-
             | different/#...
             | 
             | > Pony doesn't have locks nor atomic operations or anything
             | like that. Instead, the type system ensures at compile time
             | that your concurrent program can never have data races. So
             | you can write highly concurrent code and never get it
             | wrong.
             | 
             | This is what I am talking about.
             | 
             | > You make it sound like there is some modern development
             | superseding what java has, but that's absolutely not the
             | case.
             | 
             | Both Actor-model languages and Rust (through a surprisingly
             | different path: tracking aliases and lifetimes) do
             | something that's impossible in Java (and most languages):
             | prevent data races due to improper locking (as mentioned
             | above, if your language even has locks and it doesn't make
             | them safe like Rust does, you know you're going to have a
             | really hard time. actor-languages just eliminate locks, and
             | "manual concurrency", completely). Other kinds of races are
             | still possible, but preventing data races go a very, very
             | long way to making concurrency safe and easy.
        
               | gf000 wrote:
               | Does preventing data races (which is not particularly
               | hard if you are willing to give up certain properties,
               | e.g. just immutability alone solves it) that much of a
               | win?
               | 
               | You just made a bunch of concurrent algorithms un-
               | implementable that would give much better performance for
               | the benefit of.. having all the other unsolvable issues
               | with concurrency? Like, all the same issues are trivially
               | reproducible at a higher level, with loops within actors'
               | communication that only appear under certain, very
               | dynamic conditions, or a bunch of message passing ending
               | up in an inconsistent state, just not on an "object"
               | level, but on a "group of object" level.
        
               | jbritton wrote:
               | Perhaps there is some confusion here between data races
               | and race conditions. Rust and Pony prevent data races,
               | but not race conditions.
        
           | rand_r wrote:
           | Race conditions are generally solved with algorithms, not the
           | language. For example, defining a total ordering on locks and
           | only acquiring locks in that order to prevent deadlock.
           | 
           | I guess there there are language features like co-
           | routines/co-operative multi-tasking that make certain
           | algorithms possible, but nothing about Java prevents
           | implementing sound concurrency algorithms in general.
        
             | mrkeen wrote:
             | > Race conditions are generally solved with algorithms, not
             | the language. For example, defining a total ordering on
             | locks
             | 
             | You wouldn't make that claim if your language didn't have
             | locks.
        
               | brabel wrote:
               | Exactly, this thread is full of ignorant comments. I was
               | talking about a certain class of race conditions that can
               | be completely prevented in some languages, like Rust
               | (through its aliasing rules that just make it impossible
               | to mutate things from different threads simultaneously,
               | among other things) and languages like Pony, for example,
               | as the language uses the Actor model for concurrency,
               | which means it has no locks at all (it doesn't need
               | them), though I mentioned Dart because Dart Isolates look
               | a lot like Actors (they are single-threaded but can send
               | messages and receive messages from other "actors",
               | similarly to JS workers).
        
       | TYMorningCoffee wrote:
       | Impressive! Can't wait to try Fray out at work.
        
       | exabrial wrote:
       | > Fray is a concurrency testing tool for Java that can help you
       | find and debug tricky race conditions that manifest as assertion
       | violations, run-time exceptions, or deadlocks. It performs
       | controlled concurrency testing using state-of-the-art techniques
       | such as probabilistic concurrency testing or partial order
       | sampling.
       | 
       | > Fray also provides deterministic replay capabilities for
       | debugging specific thread interleavings. Fray is designed to be
       | easy to use and can be integrated into existing testing
       | frameworks.
       | 
       | I wish I had this 20 years ago.
        
       | delusional wrote:
       | You appear to be one of the authors, so forgive me asking a
       | technical question.
       | 
       | In the technical paper, Section 5.4 you mention that kotlin has
       | non-determinism in the scheduler. Where does this non-determinism
       | come from?
       | 
       | It seems unclear to me why Kotlin would inject randomness here,
       | and I suspect that you may actually have identified a false
       | positive in the Lincheck DSL.
        
         | aoli-al wrote:
         | The "randomness" comes from Kotlin coroutines and user-space
         | scheduling. For example, Kotlin runs multiple user-space
         | threads on the same physical thread. Fray only reschedules
         | physical threads. So when testing applications use
         | coroutine/virtual threads, Fray cannot generate certain thread
         | interleavings. Also, It cannot deterministically replay because
         | the thread execution is no longer controlled by Fray.
         | 
         | In our paper, we found that Fray suffers from false negatives
         | because of this missing feature. Lincheck supports Kotlin
         | coroutines so it finds one more bug than Fray in LC-Bench.
         | 
         | We didn't make any claims about false positives in Lincheck.
        
           | delusional wrote:
           | > We didn't make any claims about false positives in
           | Lincheck.
           | 
           | To be clear, I made that claim :) I agree that the paper
           | makes no such claim.
        
       | AugustoCAS wrote:
       | [posted this in another thread, but maybe the author can clarify
       | this]
       | 
       | I wonder how this works when one runs test in parallel (something
       | I always enable in any project). By this I mean configuring JUnit
       | to run as many tests as cores are available to speed up the run
       | of the whole test suite.
       | 
       | I took a peek at the code and I have the impression it doesn't
       | work that well as it hooks into when a thread is started. Also,
       | I'm not sure if this works with fibers.
        
         | aoli-al wrote:
         | Yes, Fray controls all application threads so it runs one test
         | per JVM. But you can always use multiple JVMs run multiple
         | tests[1].
         | 
         | Fray currently does not support virtual threads. We do have an
         | open issue tracking it, but it is low priority.
         | 
         | [1]:
         | https://docs.gradle.org/current/userguide/java_testing.html#...
        
       ___________________________________________________________________
       (page generated 2025-06-08 23:01 UTC)