https://probablydance.com/2022/09/17/finding-the-second-bug-in-glibcs-condition-variable/ Probably Dance I can program and like games Finding the "Second Bug" in glibc's Condition Variable by Malte Skarupke I continue to have no time for big programming projects, so here is a short blog post. Two years ago I looked into a bug in the glibc implementation of condition variables: Sometimes pthread_cond_signal () doesn't do anything, which can easily hang your program. The bug is still not fixed, partially because a mitigation patch was available right away that seemed to make it go away. Except that people kept on showing up in the bug report saying that they still hit the bug sometimes, raising the suspicion that there might be a second bug. I finally got around to looking into this. I found that the mitigation patch only helps a little, it's still the same bug, and the patch I submitted (unreviewed, don't use yet) would actually fix it. As I mentioned last time, one of the affected programming languages is Ocaml. Their master lock occasionally doesn't notify a sleeper because sometimes pthread_cond_signal() doesn't do anything. And then the whole process hangs forever, because that's what happens when someone doesn't get woken up in cooperative multithreading. Checking this code in TLA+ happens to be easier than the code I was checking last time, because this results in a deadlock, which can be checked quickly. Last time I had to write temporal formulas, which make TLA+ run slowly, but a deadlock is easy to find: Find a state that has no successor states. All TLA+ has to do is enumerate all states. So whenever you can, try to write your TLA+ code so that it causes a deadlock on error. The Ocaml masterlock can be directly translated into the PlusCal language of TLA+. It just wraps a normal mutex and adds a count of waiters, to make cooperative multithreading slightly more efficient: You can quickly check if anyone actually wants the lock, so you don't have to give it up if nobody is waiting. For our purposes we don't even need the yield function, just st_masterlock_acquire() and st_masterlock_release() ended up being enough. Then all we have to do is call those in a loop on multiple threads. Here is the code in PlusCal (it's calling the mutex and condition-variable functions we wrote last time) procedure acquire_masterlock() { acquire_masterlock_start: call lock_mutex(); acquire_masterlock_loop: while (busy) { waiters := waiters + 1; call cv_wait(); acquire_lock_after_wait: waiters := waiters - 1; }; acquire_masterlock_after_loop: busy := TRUE; call unlock_mutex(); return; } procedure release_masterlock() { release_masterlock_start: call lock_mutex(); release_masterlock_after_lock: busy := FALSE; call unlock_mutex(); release_masterlock_signal: call cv_signal(); return; }; fair process (Proc \in Procs) variable num_loops = MaxNumLoops; { proc_start: while (num_loops > 0) { num_loops := num_loops - 1; either { call acquire_masterlock(); } or { goto proc_done; }; proc_loop_done: call release_masterlock(); }; proc_done: skip; } If you clicked on the github link, you will see that this is a direct translation. But the point is that it's fairly straightforward code. Now we just have to run this with the magic number 3, three Procs and MaxNumLoops=3, and three hours later we have a trace that shows how you can hang even with the mitigation patch. (full reproduce steps at the bottom of this blog post) One reason why I didn't find this problem last time is that the trace is even more complicated. The mitigation patch does help for smaller traces, with fewer threads or smaller numbers for MaxNumLoops. To hit the bug you need a very specific long interleaving. This will be gibberish unless you have spent many hours studying the glibc condition variable code: 1. Find the magic interleaving that makes you hit the "potential steal" code path in pthread_cond_wait(), discussed last time 2. Run the pthread_cond_broadcast() from the mitigation patch while no other thread is waiting, so that it will early-out without doing anything 3. Then signal again and trigger a call to quiesce_and_switch_g1() 4. At the same time as step 3 have a thread wait, consuming the leftover signal from the "potential steal" in step 1, then wait again 5. Now have step 3 finish with g_size=2 (because of the two waits) for the new g1, then signal after the switch, leaving g_size=1 6. Then have step 4 finish the second wait, consuming the signal from step 5 You're now left with g_size=1 for the new g1 even though nobody is waiting on it. The next call to pthread_cond_signal() just reduces g_size to 0 without waking anyone. Any signal after that will work again. This is still really complicated for me, so I'm not sure exactly where the problem is, but it is suspicious that the pthread_cond_broadcast() in my trace would just early-out because nobody was waiting, so that the mitigation patch doesn't result in any change to the state. If that can happen, TLA+ just had to find an interleaving where the lingering signal from the "potential steal" causes problems later. Why didn't I find this last time? Because I stopped looking after I had found the initial problem, and after the mitigation patch made the problem go away with my trace. Turns out I needed to run an even bigger job to find the problem with the mitigation patch. So now the state of the glibc condition variable is 1. It mostly works except occasionally a thread can take too long to wake up, which will cause it to steal a signal from a later wakeup. (and then the later thread doesn't wake) 2. To fix that the code tries to detect whether we potentially stole a signal. But the response to that can leave the condition variable in a broken state, so that a later pthread_cond_signal() will signal on the wrong futex, causing sleepers to not be woken. 3. To fix that there is a mitigation patch which various distributions are running. But that mitigation patch has two issues: + it doesn't work for the reasons that the author thinks it does (discussed last time) + it sometimes doesn't do anything, which allows the broken state to cause problems later. (discovered in this blog post) Both of these fixes makes the issue less likely to happen, but also make it harder to understand (and debug). What's the solution? I submitted a patch for this already a year ago: Fix the start of the chain of errors by broadening the scope of grefs, so that no waiter can ever be missed when we close a group. Then the "potential stealing" code path becomes unnecessary, and the bug in that path goes away, and the mitigation patch becomes unnecessary. The implementation also becomes simpler to reason about: My "patch" is actually a series of six patches in which the first one fixes the bug and the remaining five just clean up code. (but a word of warning: this patch is unreviewed and I wouldn't be at all surprised if there is still a problem with it. This is tricky code. See comment by Ricardo below) If anyone has specific steps I can take to get this into glibc, I will try to do them. Last time I gave up a bit too easily because I got burned out on this problem after having spent way too many hours debugging it. Actually getting the fix in was too much extra work as a project to do in my spare time. But it's been two years, so my burnout on this particular problem is gone and I'm game. As long as there are steps that actually work to get this in. Appendix: Detailed Steps to Reproduce 1. Download TLA+ https://github.com/tlaplus/tlaplus/releases/tag/ v1.7.1#latest-tla-files 2. Run the toolbox 3. Click "File -> Open Spec -> Add New Spec" then select ocaml_mutex.tla 4. Click "TLC Model Checker -> New Model", click OK 5. In the model under + What is the behavior spec: leave as default "Spec" + What is the model? o UseSpinlockForCVMutex <- TRUE o AlwaysSetGrefs <- FALSE (this flag controls whether my patch should be used) o Procs <- {P0, P1, P2}, also select "Set of model values" and "Symmetry set" o UseSpinlockForMutex <- TRUE o UsePatch <- TRUE (this flag controls whether the mitigation patch should be used) o MaxNumLoops <- 3 + What to check? o Check "Deadlock" (should be checked by default) o Invariants: leave empty o Properties: leave empty 6. Click "TLC Model Checker -> Run Model" 7. Watch the progress in the "Model Checking Results" column. Once per minute it should populate the table "Time, Diameter, States Found, Distinct States, Queue Size." After ~2.5 hours, when "Diameter" reaches 342, a new panel should open up titled "TLC Errors", and it should say "Deadlock reached." in the first window. 8. Now you have 341 states to look at in the Error-Trace, in full detail. The "pc" variable shows you which line each process was on for every state. All other variables are also visible. They are a pretty direct translation from the C code. It should be possible to figure out by having the C code and TLA+ code side-by-side to see the minor differences. I uploaded a table with only the relevant information here. (this was from a run with slightly different source code, with the AlwaysSetGrefs commented out. This means your trace will be slightly different and the line numbers of the spreadsheet won't align if you try to add more information. You'll need to create your own by exporting the trace) (optional step before 6: Click on "Additional TLC Options" and increase "Number of worker threads" for your machine, also "Fraction of physical memory allocated to TLC" and set "Profiling" to "Off", this should speed up the checking) Share this: * Twitter * Facebook * Like this: Like Loading... Related Published: September 17, 2022 Filed Under: Programming Tags: Condition variable : Glibc : TLA+ 2 Comments to "Finding the "Second Bug" in glibc's Condition Variable" 1. [1296c] Ricardo says: September 18, 2022 at 07:32 In my private copy of NixOS/nixpkgs, I've tried replacing glibc's condvar mitigation patch (which I've been using for 18 months) with your patch (well, the first one out of the 6, which you said contains the minimal amount of changes to fix the bug) and I've ran into a relatively consistent deadlock while building Poly/ML 5.9. Specifically, it freezes (for at least 2 hours) right after showing "*****Writing object code******". This is on a machine with 32 cores and lots of background compilation going on at the same time (i.e. with a very high load) and it happened at least 3 times already. It could be a bug in Poly/ML or even an unrelated bug, but I suspect that glibc's conditional variable bug or a related one might not be completely fixed yet, even after applying your patch... Reply + [32d3e] Malte Skarupke says: September 18, 2022 at 12:01 Thanks, I added a warning to the text. I wouldn't be surprised if there were issues with the patch. This is complicated code. I have only proved it correct, not tried it. (though it did pass all glibc tests) I would have expected bugs with one of the later patches in the series, not the first... I'll re-review it, and also check if something needs to change about it when it is applied against the new glibc version. Reply Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. << Previous Post Search for: [ ] [Search] Recent Posts * Finding the "Second Bug" in glibc's Condition Variable * Sudoku Variants as Playful Proof Practice * Reasons why Babies Cry in the First Three Months, How to Tell The Cries Apart, and What to Do * Automated Game AI Testing * C++ Coroutines Do Not Spark Joy Archives * September 2022 * June 2022 * February 2022 * January 2022 * October 2021 * July 2021 * April 2021 * January 2021 * November 2020 * October 2020 * August 2020 * July 2020 * June 2020 * May 2020 * April 2020 * March 2020 * January 2020 * December 2019 * September 2019 * August 2019 * June 2019 * April 2019 * March 2019 * June 2018 * May 2018 * April 2018 * January 2018 * December 2017 * November 2017 * October 2017 * September 2017 * August 2017 * February 2017 * January 2017 * December 2016 * November 2016 * June 2016 * April 2016 * March 2016 * February 2016 * December 2015 * September 2015 * July 2015 * June 2015 * May 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * October 2013 * September 2013 * August 2013 * May 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * August 2012 * July 2012 * April 2012 * March 2012 * February 2012 * January 2012 * October 2011 * September 2011 * August 2011 * July 2011 * June 2011 * May 2011 Categories * Children * Games * Links * Math * Politics and Economics * Programming * Uncategorized Meta * Register * Log in * Entries feed * Comments feed * WordPress.com [ ] [Search] Blog at WordPress.com. * Follow Following + [wpcom-] Probably Dance Join 165 other followers [ ] Sign me up + Already have a WordPress.com account? Log in now. * + [wpcom-] Probably Dance + Customize + Follow Following + Sign up + Log in + Copy shortlink + Report this content + View post in Reader + Manage subscriptions + Collapse this bar %d bloggers like this: [b]