https://probablydance.com/2022/09/17/finding-the-second-bug-in-glibcs-condition-variable/

Probably Dance

I can program and like games

Finding the "Second Bug" in glibc's Condition Variable

by Malte Skarupke

I continue to have no time for big programming projects, so here is a
short blog post. Two years ago I looked into a bug in the glibc
implementation of condition variables: Sometimes pthread_cond_signal
() doesn't do anything, which can easily hang your program. The bug
is still not fixed, partially because a mitigation patch was
available right away that seemed to make it go away. Except that
people kept on showing up in the bug report saying that they still
hit the bug sometimes, raising the suspicion that there might be a
second bug. I finally got around to looking into this. I found that
the mitigation patch only helps a little, it's still the same bug,
and the patch I submitted (unreviewed, don't use yet) would actually
fix it.

As I mentioned last time, one of the affected programming languages
is Ocaml. Their master lock occasionally doesn't notify a sleeper
because sometimes pthread_cond_signal() doesn't do anything. And then
the whole process hangs forever, because that's what happens when
someone doesn't get woken up in cooperative multithreading.

Checking this code in TLA+ happens to be easier than the code I was
checking last time, because this results in a deadlock, which can be
checked quickly. Last time I had to write temporal formulas, which
make TLA+ run slowly, but a deadlock is easy to find: Find a state
that has no successor states. All TLA+ has to do is enumerate all
states. So whenever you can, try to write your TLA+ code so that it
causes a deadlock on error.

The Ocaml masterlock can be directly translated into the PlusCal
language of TLA+. It just wraps a normal mutex and adds a count of
waiters, to make cooperative multithreading slightly more efficient:
You can quickly check if anyone actually wants the lock, so you don't
have to give it up if nobody is waiting. For our purposes we don't
even need the yield function, just st_masterlock_acquire() and
st_masterlock_release() ended up being enough. Then all we have to do
is call those in a loop on multiple threads. Here is the code in
PlusCal (it's calling the mutex and condition-variable functions we
wrote last time)

procedure acquire_masterlock()
{
acquire_masterlock_start:
  call lock_mutex();
acquire_masterlock_loop:
  while (busy)
  {
    waiters := waiters + 1;
    call cv_wait();
  acquire_lock_after_wait:
    waiters := waiters - 1;
  };
acquire_masterlock_after_loop:
  busy := TRUE;
  call unlock_mutex();
  return;
}

procedure release_masterlock()
{
release_masterlock_start:
  call lock_mutex();
release_masterlock_after_lock:
  busy := FALSE;
  call unlock_mutex();
release_masterlock_signal:
  call cv_signal();
  return;
};

fair process (Proc \in Procs)
variable num_loops = MaxNumLoops;
{
proc_start:
  while (num_loops > 0)
  {
    num_loops := num_loops - 1;
    either { call acquire_masterlock(); }
    or     { goto proc_done; };
  proc_loop_done:
    call release_masterlock();
  };
proc_done:
  skip;
}

If you clicked on the github link, you will see that this is a direct
translation. But the point is that it's fairly straightforward code.
Now we just have to run this with the magic number 3, three Procs and
MaxNumLoops=3, and three hours later we have a trace that shows how
you can hang even with the mitigation patch. (full reproduce steps at
the bottom of this blog post)

One reason why I didn't find this problem last time is that the trace
is even more complicated. The mitigation patch does help for smaller
traces, with fewer threads or smaller numbers for MaxNumLoops. To hit
the bug you need a very specific long interleaving. This will be
gibberish unless you have spent many hours studying the glibc
condition variable code:

 1. Find the magic interleaving that makes you hit the "potential
    steal" code path in pthread_cond_wait(), discussed last time
 2. Run the pthread_cond_broadcast() from the mitigation patch while
    no other thread is waiting, so that it will early-out without
    doing anything
 3. Then signal again and trigger a call to quiesce_and_switch_g1()
 4. At the same time as step 3 have a thread wait, consuming the
    leftover signal from the "potential steal" in step 1, then wait
    again
 5. Now have step 3 finish with g_size=2 (because of the two waits)
    for the new g1, then signal after the switch, leaving g_size=1
 6. Then have step 4 finish the second wait, consuming the signal
    from step 5

You're now left with g_size=1 for the new g1 even though nobody is
waiting on it. The next call to pthread_cond_signal() just reduces
g_size to 0 without waking anyone. Any signal after that will work
again.

This is still really complicated for me, so I'm not sure exactly
where the problem is, but it is suspicious that the
pthread_cond_broadcast() in my trace would just early-out because
nobody was waiting, so that the mitigation patch doesn't result in
any change to the state. If that can happen, TLA+ just had to find an
interleaving where the lingering signal from the "potential steal"
causes problems later.

Why didn't I find this last time? Because I stopped looking after I
had found the initial problem, and after the mitigation patch made
the problem go away with my trace. Turns out I needed to run an even
bigger job to find the problem with the mitigation patch.

So now the state of the glibc condition variable is

 1. It mostly works except occasionally a thread can take too long to
    wake up, which will cause it to steal a signal from a later
    wakeup. (and then the later thread doesn't wake)
 2. To fix that the code tries to detect whether we potentially stole
    a signal. But the response to that can leave the condition
    variable in a broken state, so that a later pthread_cond_signal()
    will signal on the wrong futex, causing sleepers to not be woken.
 3. To fix that there is a mitigation patch which various
    distributions are running. But that mitigation patch has two
    issues:
      + it doesn't work for the reasons that the author thinks it
        does (discussed last time)
      + it sometimes doesn't do anything, which allows the broken
        state to cause problems later. (discovered in this blog post)

Both of these fixes makes the issue less likely to happen, but also
make it harder to understand (and debug). What's the solution? I
submitted a patch for this already a year ago: Fix the start of the
chain of errors by broadening the scope of grefs, so that no waiter
can ever be missed when we close a group. Then the "potential
stealing" code path becomes unnecessary, and the bug in that path
goes away, and the mitigation patch becomes unnecessary. The
implementation also becomes simpler to reason about: My "patch" is
actually a series of six patches in which the first one fixes the bug
and the remaining five just clean up code. (but a word of warning:
this patch is unreviewed and I wouldn't be at all surprised if there
is still a problem with it. This is tricky code. See comment by
Ricardo below)

If anyone has specific steps I can take to get this into glibc, I
will try to do them. Last time I gave up a bit too easily because I
got burned out on this problem after having spent way too many hours
debugging it. Actually getting the fix in was too much extra work as
a project to do in my spare time. But it's been two years, so my
burnout on this particular problem is gone and I'm game. As long as
there are steps that actually work to get this in.

Appendix: Detailed Steps to Reproduce

 1. Download TLA+ https://github.com/tlaplus/tlaplus/releases/tag/
    v1.7.1#latest-tla-files
 2. Run the toolbox
 3. Click "File -> Open Spec -> Add New Spec" then select
    ocaml_mutex.tla
 4. Click "TLC Model Checker -> New Model", click OK
 5. In the model under
      + What is the behavior spec: leave as default "Spec"
      + What is the model?
          o UseSpinlockForCVMutex <- TRUE
          o AlwaysSetGrefs <- FALSE (this flag controls whether my
            patch should be used)
          o Procs <- {P0, P1, P2}, also select "Set of model values"
            and "Symmetry set"
          o UseSpinlockForMutex <- TRUE
          o UsePatch <- TRUE (this flag controls whether the
            mitigation patch should be used)
          o MaxNumLoops <- 3
      + What to check?
          o Check "Deadlock" (should be checked by default)
          o Invariants: leave empty
          o Properties: leave empty
 6. Click "TLC Model Checker -> Run Model"
 7. Watch the progress in the "Model Checking Results" column. Once
    per minute it should populate the table "Time, Diameter, States
    Found, Distinct States, Queue Size." After ~2.5 hours, when
    "Diameter" reaches 342, a new panel should open up titled "TLC
    Errors", and it should say "Deadlock reached." in the first
    window.
 8. Now you have 341 states to look at in the Error-Trace, in full
    detail. The "pc" variable shows you which line each process was
    on for every state. All other variables are also visible. They
    are a pretty direct translation from the C code. It should be
    possible to figure out by having the C code and TLA+ code
    side-by-side to see the minor differences. I uploaded a table
    with only the relevant information here. (this was from a run
    with slightly different source code, with the AlwaysSetGrefs
    commented out. This means your trace will be slightly different
    and the line numbers of the spreadsheet won't align if you try to
    add more information. You'll need to create your own by exporting
    the trace)

(optional step before 6: Click on "Additional TLC Options" and
increase "Number of worker threads" for your machine, also "Fraction
of physical memory allocated to TLC" and set "Profiling" to "Off",
this should speed up the checking)

Share this:

  * Twitter
  * Facebook
  * 

Like this:

Like Loading...

Related

Published: September 17, 2022
Filed Under: Programming
Tags: Condition variable : Glibc : TLA+

2 Comments to "Finding the "Second Bug" in glibc's
Condition Variable"

 1. [1296c] Ricardo says:
    September 18, 2022 at 07:32

    In my private copy of NixOS/nixpkgs, I've tried replacing glibc's
    condvar mitigation patch (which I've been using for 18 months)
    with your patch (well, the first one out of the 6, which you said
    contains the minimal amount of changes to fix the bug) and I've
    ran into a relatively consistent deadlock while building Poly/ML
    5.9.
    Specifically, it freezes (for at least 2 hours) right after
    showing "*****Writing object code******". This is on a machine
    with 32 cores and lots of background compilation going on at the
    same time (i.e. with a very high load) and it happened at least 3
    times already.

    It could be a bug in Poly/ML or even an unrelated bug, but I
    suspect that glibc's conditional variable bug or a related one
    might not be completely fixed yet, even after applying your
    patch...

    Reply
      + [32d3e] Malte Skarupke says:
        September 18, 2022 at 12:01

        Thanks, I added a warning to the text. I wouldn't be
        surprised if there were issues with the patch. This is
        complicated code. I have only proved it correct, not tried
        it. (though it did pass all glibc tests)

        I would have expected bugs with one of the later patches in
        the series, not the first... I'll re-review it, and also check
        if something needs to change about it when it is applied
        against the new glibc version.

        Reply

Leave a Reply Cancel reply

Enter your comment here...
[                    ]

Fill in your details below or click an icon to log in:

  *  
  *  
  *  
  *  

Gravatar
Email (required) (Address never made public)
[                    ]
Name (required)
[                    ]
Website
[                    ]
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out / 
Change )

Twitter picture

You are commenting using your Twitter account. ( Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

[ ] Notify me of new comments via email.

[ ] Notify me of new posts via email.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

This site uses Akismet to reduce spam. Learn how your comment data is
processed.

<< Previous Post
Search for: [                    ] [Search]
Recent Posts

  * Finding the "Second Bug" in glibc's Condition Variable
  * Sudoku Variants as Playful Proof Practice
  * Reasons why Babies Cry in the First Three Months, How to Tell The
    Cries Apart, and What to Do
  * Automated Game AI Testing
  * C++ Coroutines Do Not Spark Joy

Archives

  * September 2022
  * June 2022
  * February 2022
  * January 2022
  * October 2021
  * July 2021
  * April 2021
  * January 2021
  * November 2020
  * October 2020
  * August 2020
  * July 2020
  * June 2020
  * May 2020
  * April 2020
  * March 2020
  * January 2020
  * December 2019
  * September 2019
  * August 2019
  * June 2019
  * April 2019
  * March 2019
  * June 2018
  * May 2018
  * April 2018
  * January 2018
  * December 2017
  * November 2017
  * October 2017
  * September 2017
  * August 2017
  * February 2017
  * January 2017
  * December 2016
  * November 2016
  * June 2016
  * April 2016
  * March 2016
  * February 2016
  * December 2015
  * September 2015
  * July 2015
  * June 2015
  * May 2015
  * February 2015
  * January 2015
  * December 2014
  * November 2014
  * October 2014
  * September 2014
  * August 2014
  * June 2014
  * May 2014
  * April 2014
  * March 2014
  * February 2014
  * January 2014
  * October 2013
  * September 2013
  * August 2013
  * May 2013
  * February 2013
  * January 2013
  * December 2012
  * November 2012
  * October 2012
  * August 2012
  * July 2012
  * April 2012
  * March 2012
  * February 2012
  * January 2012
  * October 2011
  * September 2011
  * August 2011
  * July 2011
  * June 2011
  * May 2011

Categories

  * Children
  * Games
  * Links
  * Math
  * Politics and Economics
  * Programming
  * Uncategorized

Meta

  * Register
  * Log in
  * Entries feed
  * Comments feed
  * WordPress.com

[                    ] [Search]
Blog at WordPress.com.

  * Follow Following
      + [wpcom-] Probably Dance
        Join 165 other followers
        [                    ]
        Sign me up
      + Already have a WordPress.com account? Log in now.
  * 
      + [wpcom-] Probably Dance
      + Customize
      + Follow Following
      + Sign up
      + Log in
      + Copy shortlink
      + Report this content
      + View post in Reader
      + Manage subscriptions
      + Collapse this bar

%d bloggers like this:

[b]