[HN Gopher] Discovering a JDK Race Condition, and Debugging It i...
___________________________________________________________________
Discovering a JDK Race Condition, and Debugging It in 30 Minutes
with Fray
Author : aoli-al
Score : 135 points
Date : 2025-06-07 19:01 UTC (1 days ago)
(HTM) web link (aoli.al)
(TXT) w3m dump (aoli.al)
| latchkey wrote:
| Maybe it is just me, but I can't read the text in the code
| because the font is nearly white on white.
| masklinn wrote:
| The light mode is fine, but you're right the dark mode is truly
| awful, the code blocks are unreadable.
|
| edit: for some reason the author overrode the background color
| on code blocks via an inline style of
| background-color:#f0f0f0
|
| from var(--code-background-color) = #f2f2f2
|
| to make the background nigh imperceptibly darker, but then
| while the stylesheet properly switches the to #01242e in dark
| mode the inline override stays and blows it to bit.
|
| Not that it's amazing if you remove the inline stle, on account
| of operators and method names being styled pretty dark (#666
| and #4070a0).
| aoli-al wrote:
| Thanks for pointing it out! Just did a quick fix using Claude
| :)
| malcolmgreaves wrote:
| On mobile (Safari), the lines in the code blocks have
| different font sizes. They also have different fonts. Some
| are like 3-4x the size of other lines. No idea what could
| be going wrong, but it does unfortunately make the code
| blocks difficult to follow along.
| aoli-al wrote:
| should be fixed as well :)
| NooneAtAll3 wrote:
| any chance you can make light/dark mode switch a UI
| button?
| masklinn wrote:
| On desktop I'd suggest installing an extension that adds
| a toggle (they exist for Firefox and chrome at least):
| adding a toggle manually is a bit of a chore, especially
| if the css system you use does not build that in.
| MaxBarraclough wrote:
| Neat to see sleep calls artificially introduced to reliably
| recreate the deadlock. [0]
|
| Looks like fixing the underlying bug is still in-progress, [1] I
| wonder how many lines of code it will take.
|
| [0] https://github.com/aoli-
| al/jdk/commit/625420ba82d2b0ebac24d9...
|
| [1] https://bugs.openjdk.org/browse/JDK-8358601
| trhway wrote:
| without reworking of the code all these checks of the executor
| and queue state and queue manipulations have to be under a
| mutex, and that is just a few lines.
| brabel wrote:
| Bugs like these are pervasive in languages like Java that give
| no protection against even the most basic race condition
| causes. It's nearly impossible to write reliable concurrent
| code. Freya only helps if you actually use it to test
| everything which is not realistic. I am convinced, after my
| last year long struggle to get a highly concurrent Java
| (actually Kotlin but Kotlin does not add much to help) module
| at work, that we should only use languages that provide safe
| concurrency models, like Erlang/Elixir and Rust, or actor-like
| like Dart and JavaScript, where concurrency is required.
| gf000 wrote:
| What _is_ a safe concurrency model? Like, actors can
| trivially deadlock /livelock, they are no panacea at all, and
| are trivial to recreate (there are a million java
| implementations)
|
| You make it sound like there is some modern development
| superseding what java has, but that's absolutely not the
| case.
|
| Like even rust is just pretty much a no-overhead
| `synchronized` on top of an object. It _is_ necessary there,
| because data races are a fundamental memory safety issue, but
| Java is immune to that (it has "safe" data races). Logical
| bugs can trivially happen in either case - as an easy example
| even if all your fields are atomically mutated, the whole
| object may not make sense in certain states, like a date with
| February the 31st. Rust does nothing against such, and
| concurrent data structures have ample grounds for realistic
| examples of the above.
| mrkeen wrote:
| > What is a safe concurrency model?
|
| STM.
|
| The terms 'atomic', 'thread-safe', and 'concurrent'
| collections are thrown around too loosely for application
| programmers IMO, for exactly your example above.
|
| In other scenarios, 'atomics' refer to the ability to do
| _one_ thing atomically. With STM, you can do two or more
| things atomically.
|
| Likewise with 'thread-safe'. Thread-safe seems to indicate
| that the object won't break internally in the presence of
| multiple threads, which is too low of a bar to clear if
| your goal is to write an actually thread-safe application
| out of so-called 'thread-safe' parts.
|
| STM has _actual_ concurrent data structures, where you can
| write straight-line code like 'if this collection has at
| least 5 elements, then pop one'.
|
| I don't think the Feb 31 example is that fair though,
| because if you want to construct a representation of Feb
| 31, who's going to stop you? And if you don't want to,
| plain old static types is the solution.
| gf000 wrote:
| I couldn't give a better reply than this author:
|
| https://joeduffyblog.com/2010/01/03/a-brief-
| retrospective-on...
|
| Also, a phenomenal writing (as are his other posts) on
| the whole concurrency landscape, see:
|
| > A wondrous property of concurrent programming is the
| sheer number and diversity of programming models
| developed over the years. Actors, message-passing, data
| parallel, auto-vectorization, ...; the titles roll off
| the tongue, and yet none dominates and pervades. In fact,
| concurrent programming is a multi-dimensional space with
| a vast number of worthy points along its many axes.
| mrkeen wrote:
| I've read a few postmortems about STM. I have to take
| them with a grain of salt because I usually read those
| reports right after doing a bunch of STM programming, and
| right before doing a bunch more STM programming. Reports
| of its death have been greatly exaggerated.
|
| Here it is in 2006 featuring the same Tim from your
| article: https://www.youtube.com/watch?v=tve57vilywc
|
| I didn't start using it in anger till 2013-2014 maybe?
| But I don't recall any major differences between what the
| video shows and how it works in 2025.
|
| Anyway, postmortems usually boil down to two issues:
|
| 1) That's not how programmers usually do it
|
| 2) We couldn't pull it off
|
| The most obvious explanation for 1 is 2. I, too, would be
| disappointed by the low-adoption rates of my new
| technology if I hadn't built it or released it to users.
|
| But the article has some gems:
| Transactions unfortunately do not address one other
| issue, which turns out to be the most fundamental of all:
| sharing. Indeed, TM is insufficient - indeed, even
| dangerous - on its own because it makes it very easy to
| share data and access it from multiple threads;
|
| I cannot read this charitably. This is _the only reason
| for_ , not _a damning reason against_. It 's like doing
| research & development on condoms, and then realising
| it's a hopeless failure because they might be used for
| dangerous activities like sex. I already
| mentioned a great virtue of transactions is their ability
| to nest. But I neglected to say how this works. And in
| fact when we began, we only recognized one form of
| nesting. You're in one atomic block and then enter into
| another one. What happens if that inner transaction
| commits or rolls back, before the fate of the outer
| transaction is known
|
| You nest transactional _statements_ , not the calls to
| _atomic_. The happy-path for an _atomic_ is that it will
| _commit_ ; it should be obvious a priori that something
| that commits cannot be in the codepath that can be rolled
| back. Then that same intern's casual
| statement pointing out an Earth-shattering flaw that
| would threaten the kind of TM we (and most of the
| industry at the time) were building. ... An update
| in-place system will allow that transaction to freely
| change the state of x. Of course, it will roll back here,
| because isItOwned changed to true. But by then it is too
| late: the other thread using x outside of a transaction
| will see constantly changing state - torn reads even -
| and who knows what will happen from there. A known flaw
| in any weakly atomic, update in-place TM. If
| this example appears contrived, it's not. It shows up in
| many circumstances.
|
| I agree that it's not contrived. It's in the problem-
| space of application writers. It's not a problem caused
| by introducing STM. We want an STM system to allow safe
| access to isItOwned & x, because it's a PITA to try to do
| this with locks.
| gf000 wrote:
| `atomic` is their choice of syntax for an STM transaction
| in their experimental C# runtime, it's _not_ an atomic
| statement. Please take the time to actually read the
| article, because you have obviously just skimmed over it.
| This was not written by some nobody, he does know what he
| talks about.
| tialaramex wrote:
| > the whole object may not make sense in certain states
|
| "Make invalid states unrepresentable" - it's bad design
| that February the 31st is a thing in your data structure
| when that's invalid. You can't _always_ avoid this, but it
| 's appalling how bad most people's data structures are.
|
| C's stdlib provides a tm structure in which day of the week
| is stored in a _signed_ 32-bit integer. You know, for when
| it 's the negative two billionth day of the week...
| nlitened wrote:
| > "Make invalid states unrepresentable"
|
| I think this phrase sounds good but is not applicable to
| systems that touch messy reality.
|
| For example, I think it's not even possible to apply it
| to the `tm` structure, as leap seconds are not known in
| advance.
| tialaramex wrote:
| I agree that messy reality can intervene, in the medium
| term (for about a decade) we'll need to handle leap
| seconds
|
| But we can do a _lot_ without challenging the messy
| reality. 61 second minutes are (regrettably) a thing in
| some time systems, but negative 1 million second minutes
| are not a thing, there 's no need for this to be a signed
| integer!
| kbolino wrote:
| The struct is also used for date/time arithmetic and the
| standard library explicitly supports out-of-range values
| for this reason.
| tialaramex wrote:
| I have no doubt that C "explicitly supports" this, but
| it's a bad idea.
|
| The C standard library has the excuse that most of it is
| very old. We should do better.
| kbolino wrote:
| Better for whom? If you want a dead-simple time type, use
| time_t.
|
| There are plenty of improvements needed in the C time
| APIs, like sub-second precision, thread safety, and
| timezone awareness. What benefit is there to making the
| struct fields unsigned beyond some arbitrary purity test?
| This is still C, there are still plenty of ways to make
| invalid values. And it is nice to be able to subtract as
| well as add.
|
| Heck, there's no way to encode the full Gregorian
| Calendar rules in the type system of any language I've
| ever used, so every choice is going to be a compromise.
| February 29 Not-A-Leap-Year and April 31 are still
| invalid dates even if you can outlaw January 0 and March
| 32.
|
| Making all the fields in struct tm signed ints is clearly
| there to allow them to be manipulated and consistently
| so, since other types would obviously be better for size
| if nothing else.
| gf000 wrote:
| This is more of a toy example for how a set of atomic
| changes can still end up in an inconsistent state, e.g.
| setting January the 31st and February 3rd in quick
| succession from two or more different threads may result
| in Feb 31st being visible from a third thread. This is
| _not_ solved by Rust and your struct will even get the
| Sync trait automatically, which may be not be applicable
| as in this case.
| brabel wrote:
| Given your example, I am convinced you've never written
| any Rust. Of course it does stop you doing shit like
| that. But in this example, even Java does it properly,
| since the constructor runs to completion before any
| Object is accessible to any Thread, not just the one
| creating it. You need to validate the state of the object
| in the constructor to prevent that, but TBH why are we
| talking about this, it's almost completely unrelated to
| concurrency models.
| gf000 wrote:
| Of course if you are creating a new object and you have
| an atomic handle to it, it is trivial to solve. Like,
| having immutable objects solves a _lot_ of these
| problems.
|
| But what I'm quite obviously talking about is a Rust
| struct with 3 atomic fields. Just because I can safely
| race on any of its fields, doesn't mean that the whole
| struct can safely be shared, yet it will be inferred to
| be Sync.
| tialaramex wrote:
| Object mutability isn't relevant here. A Date type which
| is mutable can ensure that all mutations are valid, it
| just can't do so while retaining this clumsy "LOL I'm
| just a D-M-Y tuple" API.
|
| We can see immediately that your type is broken because
| it allows us to directly set the date to February 31st,
| there's no concurrency bug needed, the type was always
| defective.
| gf000 wrote:
| void setDate(int month, int day) { if
| (notValidDate(month, date)) { throw; }
| this.month = month; // atomic this.day = day //
| atomic }
|
| Yet the whole function is not
| "atomic"/transactional/consistent, and two threads
| running simultaneously may surface the above error.
|
| Of course it can ensure that it is consistent, C code can
| also just ensure that it is memory safe. This is just not
| an inherent property, and in general you _will_ mess it
| up.
|
| The only difference is that we can reliably solve memory
| safety issues (GC, Rusty's ownership model), but we have
| absolutely no way to solve concurrency issues in any
| model. The only solution is.. having a single thread.
| tialaramex wrote:
| But you were critiquing Rust's model, yet you've written
| C++ here. I agree it's perfectly easy to write the bug in
| C++.
|
| In Rust this improved type doesn't have the defect, to
| call Rust's analogue of your setDate function you must
| have the exclusive mutable reference, which means there's
| no concurrency problem.
|
| You have to do a whole lot of extra work to write the bug
| and why would you, just write what you meant and it
| behaves correctly.
| gf000 wrote:
| It's called pseudo-code, and some extra attempt on your
| part to deliberately miss the point.
|
| Give it another go at understanding what I'm saying,
| cheers!
| brabel wrote:
| > Like, actors can trivially deadlock/livelock,
|
| Oh my ... you never seen a proper Actor language, have you?
|
| Have a look at Erlang and Pony, for starters. It will open
| your mind.
|
| This in particular is great:
| https://www.ponylang.io/discover/what-makes-pony-
| different/#...
|
| > Pony doesn't have locks nor atomic operations or anything
| like that. Instead, the type system ensures at compile time
| that your concurrent program can never have data races. So
| you can write highly concurrent code and never get it
| wrong.
|
| This is what I am talking about.
|
| > You make it sound like there is some modern development
| superseding what java has, but that's absolutely not the
| case.
|
| Both Actor-model languages and Rust (through a surprisingly
| different path: tracking aliases and lifetimes) do
| something that's impossible in Java (and most languages):
| prevent data races due to improper locking (as mentioned
| above, if your language even has locks and it doesn't make
| them safe like Rust does, you know you're going to have a
| really hard time. actor-languages just eliminate locks, and
| "manual concurrency", completely). Other kinds of races are
| still possible, but preventing data races go a very, very
| long way to making concurrency safe and easy.
| gf000 wrote:
| Does preventing data races (which is not particularly
| hard if you are willing to give up certain properties,
| e.g. just immutability alone solves it) that much of a
| win?
|
| You just made a bunch of concurrent algorithms un-
| implementable that would give much better performance for
| the benefit of.. having all the other unsolvable issues
| with concurrency? Like, all the same issues are trivially
| reproducible at a higher level, with loops within actors'
| communication that only appear under certain, very
| dynamic conditions, or a bunch of message passing ending
| up in an inconsistent state, just not on an "object"
| level, but on a "group of object" level.
| jbritton wrote:
| Perhaps there is some confusion here between data races
| and race conditions. Rust and Pony prevent data races,
| but not race conditions.
| rand_r wrote:
| Race conditions are generally solved with algorithms, not the
| language. For example, defining a total ordering on locks and
| only acquiring locks in that order to prevent deadlock.
|
| I guess there there are language features like co-
| routines/co-operative multi-tasking that make certain
| algorithms possible, but nothing about Java prevents
| implementing sound concurrency algorithms in general.
| mrkeen wrote:
| > Race conditions are generally solved with algorithms, not
| the language. For example, defining a total ordering on
| locks
|
| You wouldn't make that claim if your language didn't have
| locks.
| brabel wrote:
| Exactly, this thread is full of ignorant comments. I was
| talking about a certain class of race conditions that can
| be completely prevented in some languages, like Rust
| (through its aliasing rules that just make it impossible
| to mutate things from different threads simultaneously,
| among other things) and languages like Pony, for example,
| as the language uses the Actor model for concurrency,
| which means it has no locks at all (it doesn't need
| them), though I mentioned Dart because Dart Isolates look
| a lot like Actors (they are single-threaded but can send
| messages and receive messages from other "actors",
| similarly to JS workers).
| TYMorningCoffee wrote:
| Impressive! Can't wait to try Fray out at work.
| exabrial wrote:
| > Fray is a concurrency testing tool for Java that can help you
| find and debug tricky race conditions that manifest as assertion
| violations, run-time exceptions, or deadlocks. It performs
| controlled concurrency testing using state-of-the-art techniques
| such as probabilistic concurrency testing or partial order
| sampling.
|
| > Fray also provides deterministic replay capabilities for
| debugging specific thread interleavings. Fray is designed to be
| easy to use and can be integrated into existing testing
| frameworks.
|
| I wish I had this 20 years ago.
| delusional wrote:
| You appear to be one of the authors, so forgive me asking a
| technical question.
|
| In the technical paper, Section 5.4 you mention that kotlin has
| non-determinism in the scheduler. Where does this non-determinism
| come from?
|
| It seems unclear to me why Kotlin would inject randomness here,
| and I suspect that you may actually have identified a false
| positive in the Lincheck DSL.
| aoli-al wrote:
| The "randomness" comes from Kotlin coroutines and user-space
| scheduling. For example, Kotlin runs multiple user-space
| threads on the same physical thread. Fray only reschedules
| physical threads. So when testing applications use
| coroutine/virtual threads, Fray cannot generate certain thread
| interleavings. Also, It cannot deterministically replay because
| the thread execution is no longer controlled by Fray.
|
| In our paper, we found that Fray suffers from false negatives
| because of this missing feature. Lincheck supports Kotlin
| coroutines so it finds one more bug than Fray in LC-Bench.
|
| We didn't make any claims about false positives in Lincheck.
| delusional wrote:
| > We didn't make any claims about false positives in
| Lincheck.
|
| To be clear, I made that claim :) I agree that the paper
| makes no such claim.
| AugustoCAS wrote:
| [posted this in another thread, but maybe the author can clarify
| this]
|
| I wonder how this works when one runs test in parallel (something
| I always enable in any project). By this I mean configuring JUnit
| to run as many tests as cores are available to speed up the run
| of the whole test suite.
|
| I took a peek at the code and I have the impression it doesn't
| work that well as it hooks into when a thread is started. Also,
| I'm not sure if this works with fibers.
| aoli-al wrote:
| Yes, Fray controls all application threads so it runs one test
| per JVM. But you can always use multiple JVMs run multiple
| tests[1].
|
| Fray currently does not support virtual threads. We do have an
| open issue tracking it, but it is low priority.
|
| [1]:
| https://docs.gradle.org/current/userguide/java_testing.html#...
___________________________________________________________________
(page generated 2025-06-08 23:01 UTC)