https://surfingcomplexity.blog/2024/10/05/futexes-in-tla/ Skip to content [f2641f12e8] Surfing Complexity Lorin Hochstein's ramblings about software, complex systems, and incidents. Futexes in TLA+ Lorin Hochstein formal methods October 5, 2024October 5, 2024 14 Minutes Justine Tunney recently wrote a blog post titled The Fastest Mutexes where she describes how she implemented mutexes in Cosmopolitan Libc. The post discusses how her implementation uses futexes by way of Mike Burrows's nsync library. From her post nsync enlists the help of the operating system by using futexes. This is a great abstraction invented by Linux some years ago, that quickly found its way into other OSes. On MacOS, futexes are called ulock. On Windows, futexes are called WaitOnAddress(). The only OS Cosmo supports that doesn't have futexes is NetBSD, which implements POSIX semaphores in kernelspace, and each semaphore sadly requires creating a new file descriptor. But the important thing about futexes and semaphores is they allow the OS to put a thread to sleep. That's what lets nsync avoid consuming CPU time when there's nothing to do. Before I read this post, I had no idea what futexes were or how they worked. I figured a good way to learn would be to model them in TLA+. Note: I'm going to give a simplified account of how futexes work. In addition, I'm not an expert on this topic. In particular, I'm not a kernel programmer, so there may be important details I get wrong here. Mutexes Readers who have done multithreaded programming before are undoubtedly familiar with mutexes: they are a type of lock that allows the programmer to enforce mutual exclusion, so that we can guarantee that at most one thread accesses a resource, such as a shared variable. The locking mechanism is implemented by the operating system: locking and unlock the mutex ultimately involves a system call. If you were programming in C on a Unix-y system like Linux, you'd use the pthreads API to access the mutex objects. Which pthreads implementation you use (e.g., glibc, musl) will make the relevant system calls for you. #include ... // create a mutex pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_lock(&mutex); // mutex is now locked pthread_mutex_unlock(&mutex); // mutex is now unlocked Modeling a mutex in TLA+ Before we get into futexes, let's start off by modeling desired behavior of a mutex in TLA+. I'll use the PlusCal algorithm language for my model, which can be translated directly to a TLA+ model (see my mutex.tla file) ---- MODULE mutex ---- CONSTANT Processes (*--algorithm MutualExclusion variables lock = {}; process proc \in Processes begin ncs: skip; acq: await lock = {}; lock := {self}; cs: skip; rel: lock := lock \ {self}; goto ncs; end process; end algorithm; *) Modeling threads (Processes) This model defines a fixed set of Processes. You should really think of these as threads, but there's a convention in academia to refer to them as processes, so I used that convention. You can think of Processes in this model as a set of threads ids. Modeling the lock This model has only explicit variable, named lock, which is my model for the mutex lock. I've modeled it as a set. When the lock is free, the set is empty, and when the lock is held by a process, the set contains a single element, the process id. Process states PlusCal models also have an implicit variable, pc for program counter. The program counter indicates which state each process is in. Each process can be in one of four states. We need to give a label to each of these states in our model. Label Description ncs in the non-critical section (lock is not held) acq waiting to acquire the lock cs in the critical section (lock is held) rel releasing the lock We want to make sure that two processes are never in the critical section at the same time. We can express this desired property in TLA+ like this: InCriticalSection(p) == pc[p] = "cs" MutualExclusion == \A p1,p2 \in Processes : InCriticalSection(p1) /\ InCriticalSection(p2) => p1 = p2 We can specify MutualExclusion as an invariant and use the TLC model checker to verify that our specification satisfies this property. Check out mutex.cfg for how to configure TLC to check the invariant. Overhead of traditional mutexes With a traditional mutex, the application make a system call every time the mutex is locked or unlocked. If the common usage pattern for an application is that there's only one thread that tries to take the mutex, then you're paying a performance penalty for having to execute those system calls. If the mutex is being acquired and released in a tight loop, then the time that goes to servicing the system calls could presumably be a substantial fraction of the execution time of the loop. for(...) { pthread_mutex_lock(&mutex); // do work here pthread_mutex_unlock(&mutex); } I don't know how high these overheads are in practice, but Justine Tunney provides some numbers in her blog post as well as a link to Mark Waterman's mutex shootout with more performance numbers. Ideally, we would only pay the performance penalty for system calls when the mutex was under contention, when there were multiple threads that were trying to acquire the lock. Futexes: avoid the syscalls where possible Futexes provide a mechanism for avoiding system calls for taking a mutex lock in cases where the lock is not under contention. More specifically, futexes provide a set of low-level synchronization primitives. These primitives allow library authors to implement mutexes in such a way that they avoid making system calls when possible. Application programmers don't interact with futexes directly, they're hidden behind APIs like pthread_mutex_lock. Primitives: wait and wake The primary two futex primitives are a wait system call and a wake system call. Each of them take an integer pointer as an argument, which I call futex. Here's a simplified C function prototype for each of them. void futex_wait(int *futex); void futex_wake(int *futex); (Actual versions of these return status codes, but I won't be using those return codes in this blog post). Note that the futex_wait prototype shown above is incomplete: it needs to take an additional argument to guarantee correctness, but we'll get to that later. I want to start off by providing some intuition on how to use futexes. When a thread calls the futex_wait system call, the kernel puts the thread to sleep until another thread calls futex_wake with the same futex argument. Using primitives to build mutexes that avoid system calls OK, so how do we actually use these things to create mutexes? Here's a naive (i.e., incorrect) implementation of lock and unlock functions that implement mutual exclusion using the futex calls. The lock function checks if the lock is free. If it is, it takes the lock, otherwise it waits to be woken up and then tries again. #define FREE 0 #define ACQUIRED 1 #define CONTENDED 2 void futex_wait(int *futex); void futex_wake(int *futex); /** * Try to acquire the lock. On failure, wait and then try again. */ void lock(int *futex) { bool acquired = false; while (!acquired) { if (*futex == FREE) { *futex = ACQUIRED; acquired = true; } else { *futex = CONTENDED; futex_wait(futex, CONTENDED); } } } /** * Release lock, wake threads that are waiting on lock */ void unlock(int *futex) { int prev = *futex; *futex = FREE; if(prev == CONTENDED) { futex_wake(futex); } } Note that futex is just an ordinary pointer. In the fast path, the lock function just sets the futex to ACQUIRED, no system call is necessary. It's only when the futex is not free that it has to make the futex_wait system call. Similarly, on the unlock side, it's only when the futex is in the CONTENDED state that the futex_wake system call happens. Now's a good time to point out that futex is short for fast userspace mutex. The reason it's fast is because we can (sometimes) avoid system calls. And the reason we are able to avoid system calls is that, in the fast path, the threads coordinate by modifying a memory location that is accessible in userspace. By userspace, we mean that our futex variable is just an ordinary pointer that the threads all have direct access to: no system call is required to modify it. By contrast, when we call futex_wait and futex_wake, the kernel needs to read and modify kernel data structures, hence a system call is required. The code above should provide you with an intuition of how futexes are supposed to work. The tricky part is writing the algorithm in such a way as to guarantee correctness for all possible thread schedules. There's a reason that Ulrich Drepper wrote a paper titled Futexes are Tricky: it's easy to get the lock/unlock methods wrong. Why futex_wait needs another agument There are many potential race conditions in our initial lock/unlock implementation, but let's focus on one in particular: if the futex gets freed after the lock checks if it's free, but before calling futex_wait. Here's what the scenario looks like (think of the red arrows as being like breakpoints): [image-1] We need to prevent the situation where the unlock thread completes after the *futex == FREE check but before the futex_wait system call. We can prevent this by modifying the futex_wait function prototype to pass the value we expect the futex to have, it looks like this: void futex_wait(int *futex, int val); The lock code then looks like this instead: void lock(int *futex) { if(*futex == FREE) { *futex = ACQUIRED; } else { *futex = CONTENDED; futex_wait(futex, CONTENDED); // try again after waking up lock(futex); } } The futex_wait system call will check to ensure that *futex == val. So, if the *futex gets changed, the function will return immediately rather than sleeping. Here's the FUTEX_WAIT section of the Linux futex man page, which hopefully should be clear now. FUTEX_WAIT (since Linux 2.6.0) This operation tests that the value at the futex word pointed to by the address uaddr still contains the expected value val, and if so, then sleeps waiting for a FUTEX_WAKE operation on the futex word. The load of the value of the futex word is an atomic memory access (i.e., using atomic machine instructions of the respective architecture). This load, the comparison with the expected value, and starting to sleep are performed atomically and totally ordered with respect to other futex operations on the same futex word. If the thread starts to sleep, it is considered a waiter on this futex word. If the futex value does not match val, then the call fails immediately with the error EAGAIN. Atomic operations In order for the lock/unlock implementations to guarantee correctness, we need to rely on what are called atomic operations when reading and modifying the futex across multiple threads. These are operations that the hardware guarantees can be performed atomically, so that there are no possible race conditions. In my futex model, I assumed the existence of three atomic operations: 1. atomic store 2. atomic exchange 3. atomic compare and exchange Atomic store isn't very interesting, it just says that we can atomically set the value of a variable, i.e., that when we do something like this, it happens atomically. *futex = FREE In PlusCal, atomic stores are just modeled as assigning a value to a variable, so there's not much else to sta Atomic exchange Atomic exchange looks like this: oldval = atomic_exchange(x, newval) You give atomic exchange two arguments: a variable (x) you want to modify, and the new value (newval) you want it to have. The atomic_exchange function will atomically set x=newval and return the value x had before the assignment. In PlusCal, I modeled it as a macro. Macros can't return values, so we need to pass in oldval as an argument. macro atomic_exchange(x, oldval, newval) begin oldval := x; x := newval; end macro; Then I can invoke it like this in my PlusCal model: atomic_exchange(mem[a], uprev, Free); And the resulting behavior is: uprev := mem[a]; mem[a] := Free; Atomic compare and exchange Atomic compare and exchange is sometimes called test-and-set. It looks like this: oldval = atomic_compare_exchange(x, expecting, newval) It's similar to atomic exchange, except that it only performs the exchange if the variable x contains the value expecting. Basically, it's an atomic version of this: if (x == expecting) oldval = atomic_compare_exchange(x, newval) else oldval = x In my PlusCal model, I also implemented it as a macro: macro atomic_compare_exchange(x, oldval, expecting, newval) begin oldval := x; if x = expecting then x := newval; end if; end macro; Futex modeling basics Here are the basics constants and variables in my model: CONSTANTS Processes, Addresses, Free, Acquired, Contended (*--algorithm futex variables mem = [x \in Addresses |-> Free], a \in Addresses, ... In addition to modeling Processes, my futex model also models a set of memory address as Addresses. I also defined three constants: Free, Acquired, Contended which have the same role as FREE, ACQUIRED, and CONTENDED in the example C code above. I model memory as a function (mem) that maps addresses to values, as well as a specific memory address (a). I use a as my futex. At the top-level, the algorithm should look familiar, it's basically the same as the mutex.tla one, except that it's calling acquire_lock and release_lock procedures. process proc \in Processes begin ncs: skip; acq: call acquire_lock(); cs: skip; rel: call release_lock(); goto ncs; end process; acquire_lock: implementing lock I based my implementation of the lock on a simplified version of the one in Ulrich Drupper's Futexes are Tricky paper. Here's the model for acquiring the lock. It doesn't render particularly well in WordPress, so you may want to view the futex.tla file on GitHub directly. procedure acquire_lock() variable lprev; begin \* Attempt to acquire the lock Lcmpx1: atomic_compare_exchange(mem[a], lprev, Free, Acquired); Ltest: while lprev # Free do \* Mark the lock as contended, assuming it's in the acquired state Lcmpx2: atomic_compare_exchange(mem[a], lprev, Acquired, Contended); if lprev # Free then call_wait: call futex_wait(a, Contended); end if; \* Attempt to acquire the lock again Lcmpx3: atomic_compare_exchange(mem[a], lprev, Free, Contended); end while; \* if we reach here, we have the lock Lret: return; end procedure; Note that we need three separate atomic_compare_exchange calls to implement this (each with different arguments!). Yipe! release_lock: implementing unlock The unlock implementation is much simpler. We just set the futex to Free and then wake the waiters. procedure release_lock() variable uprev; begin u_xch: atomic_exchange(mem[a], uprev, Free); u_wake: if uprev = Contended then call futex_wake(a); end if; u_ret: return; end procedure; Modeling the futex_wait/futex_wake calls Finally, it's not enough in our model to just invoke futex_wait and futex_wake, we need to model the behavior of these as well! I'll say a little bit about the relevant kernel data structures in the Linux kernel and how I modeled them Kernel data structures The Linux kernel hashes futexes into buckets, and each of these bucket is associated with a futex_hash_bucket structure. For the purposes of our TLA+ model, the fields we care about are: * queue of threads (tasks) that are waiting on the futex * lock for protecting the structure The kernel also uses a structure called a wake-queue (see here for more details) to keep a list of the tasks that have been selected to be woken up. I modeled this list of threads-to-be-woken as a set. Here are the variables: variables ... waitq = [x \in Addresses |-> <<>>], \* a map of addresses to wait queues qlock = {}, \* traditional mutex lock used by the kernel on the wait queue. wake = {}; \* processes that have been sent a signal to wake up I made the following assumptions to simplify my model * every futex (address) is associated with one queue, rather than hashing and bucketing * I used a global lock instead of having a per-queue lock futex_wait Here's the basic futex_wait algorithm (links go to Linux source code) 1. Take the lock 2. Retrieve the current value of the futex. 3. Check if the passed value matches the futex value. If not, return. 4. Add the thread to the wait queue 5. Release the lock 6. Wait to be woken up Note: I assume that calling schedule() inside of the kernel at this point has the effect of putting the thread to sleep, but as I said earlier, I'm not a kernel programmer so I'm not familiar with this code at all. Here's my PlusCal model: procedure futex_wait(addr, val) begin wt_acq: await qlock = {}; qlock := {self}; wt_valcheck: if val /= mem[addr] then qlock := {}; return; end if; \* Add the process to the wait queue for this address wt_enq: waitq[addr] := Append(waitq[addr], self); qlock := {}; wt_wait: await self \in wake; wake := wake \ {self}; return; end procedure; futex_wake Here's what the futex_wake algorithm looks like: 1. Take the lock 2. Add the tasks to be woken up to the wake queue 3. Release the lock 4. Wake the tasks on the wake queue Here's my PlusCal model of it. I chose to wake only one process in my model, but we could have have woken all of the waiting processes. procedure futex_wake(addr) variables nxt = {} begin wk_acq: await qlock = {}; qlock := {self}; wk_deq: if waitq[addr] /= <<>> then nxt := {Head(waitq[addr])}; waitq[addr] := Tail(waitq[addr]); end if; wk_rel: qlock := {}; wk_wake: wake := wake \union nxt; return; end procedure; You can see the full model on GitHub at futex.tla. Checking for properties One of the reasons to model in TLA+ is to check properties of the specification. I care about three things with this specification: 1. It implements mutual exclusion 2. It doesn't make system calls when there's no contention 3. Processes can't get stuck waiting on the queue forever Mutual exclusion We check mutual exclusion the same way we did in our mutex.tla specification, by asserting that there are never two different processes in the critical section at the same time. This is our invariant. MutualExclusion == \A p1,p2 \in Processes : pc[p1]="cs" /\ pc[p2]="cs" => p1=p2 No contention means no system calls The whole point of using futexes to implement locks was to avoid system calls in the cases where there's no contention. Even if our algorithm satisfies mutual exclusion, that doesn't mean that it avoids these system calls. I wrote an invariant for the futex_wait system call, that asserts that we only make the system call when there's contention. I called the invariant OnlyWaitUnderContention, and here's how I defined it. I created several helper definitions as well. LockIsHeld == mem[a] /= Free ProcessAttemptingToAcquireLock(p) == pc[p] \in {"Lcmpx1", "Ltest", "Lcmpx2", "call_wait", "Lcmpx3", "wt_acq", "wt_valcheck", "wt_enq", "wt_wait"} Contention == LockIsHeld /\ \E p \in Processes : ProcessAttemptingToAcquireLock(p) OnlyWaitUnderContention == \E p \in Processes : pc[p]="call_wait" => Contention Nobody gets stuck waiting Recall earlier in the blog post how we had to modify the prototype of the futex_wait system call to take an additional argument, in order to prevent a race condition that could leave a process waiting forever on the queue. We want to make sure that we have actually addressed that risk. Note that the comments in the Linux source code specifically call out this risk. I checked this by defining an invariant that stated that it never can happen that a process is waiting and all of the other processes are past the point where they could wake up the waiter. Stuck(x) == /\ pc[x] = "wt_wait" /\ x \notin wake /\ \A p \in Processes \ {x} : \/ pc[p] \in {"ncs", "u_ret"} \/ /\ pc[p] \in {"wk_rel", "wk_wake"} /\ x \notin nxt[p] NoneStuck == ~ \E x \in Processes : Stuck(x) Final check: refinement In addition to checking mutual exclusion, we can check that our futex-based lock model (futex.tla) implements our original high-level mutex model (mutex.tla) by means of a refinement mapping. To do that, we need to define mappings between the futex model variables and the mutex model variables. The mutex model has two variables: * lock - the model of the lock * pc - the program counters for the processes I called my mappings lockBar and pcBar. Here's what the mapping looks like: InAcquireLock(p) == pc[p] \in {"Lcmpx1", "Ltest", "Lcmpx2", "call_wait", "Lcmpx3", "Lret"} InFutexWait(p) == pc[p] \in {"wt_acq", "wt_valcheck", "wt_enq", "wt_wait"} InReleaseLockBeforeRelease(p) == pc[p] \in {"u_xch"} InReleaseLockAfterRelease(p) == pc[p] \in {"u_wake", "u_ret"} InFutexWake(p) == pc[p] \in {"wk_acq", "wk_deq", "wk_rel", "wk_wake"} lockBar == {p \in Processes: \/ pc[p] \in {"cs", "rel"} \/ InReleaseLockBeforeRelease(p)} pcBar == [p \in Processes |-> CASE pc[p] = "ncs" -> "ncs" [] pc[p] = "cs" -> "cs" [] pc[p] = "acq" -> "acq" [] InAcquireLock(p) -> "acq" [] InFutexWait(p) -> "acq" [] pc[p] = "rel" -> "rel" [] InReleaseLockBeforeRelease(p) -> "rel" [] InReleaseLockAfterRelease(p) -> "ncs" [] InFutexWake(p) -> "ncs" ] mutex == INSTANCE mutex WITH lock <- lockBar, pc <- pcBar We can then define a property that says that our futex specification implements the mutex specification: ImplementsMutex == mutex!Spec Finally, in our futex.cfg file, we can specify that we want to check the invariants, as well as this behavioral property. The relevant config lines look like this: INVARIANT MutualExclusion OnlyWaitUnderContention NoneStuck PROPERTY ImplementsMutex You can find my repo with these models at https://github.com/lorin/ futex-tla. Share this: * Twitter * Facebook * Like Loading... * Tagged * tlaplus [f2641f12e8] Published by Lorin Hochstein View all posts by Lorin Hochstein Published October 5, 2024October 5, 2024 Post navigation Previous Post Why I don't like discussing action items during incident reviews Leave a comment Cancel reply [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] Search for: [ ] [Search] Elsewhere on the Internet * Website: lorinhochstein.org * Mastodon: @norootcause@hachyderm.io * Twitter: @norootcause Recent Posts * Futexes in TLA+ * Why I don't like discussing action items during incident reviews * Linearizability! Refinement! Prophecy! * Safety first! * You can specify even when you can't implement Archives * October 2024 * September 2024 * August 2024 * July 2024 * June 2024 * May 2024 * March 2024 * February 2024 * January 2024 * December 2023 * September 2023 * August 2023 * June 2023 * April 2023 * March 2023 * February 2023 * January 2023 * December 2022 * November 2022 * October 2022 * September 2022 * August 2022 * July 2022 * June 2022 * May 2022 * April 2022 * January 2022 * December 2021 * November 2021 * October 2021 * September 2021 * August 2021 * July 2021 * June 2021 * May 2021 * April 2021 * February 2021 * January 2021 * December 2020 * September 2020 * August 2020 * July 2020 * June 2020 * May 2020 * April 2020 * March 2020 * February 2020 * January 2020 * November 2019 * October 2019 * September 2019 * August 2019 * July 2019 * June 2019 * May 2019 * March 2019 * January 2019 * December 2018 * October 2017 * June 2017 * May 2017 * February 2017 * January 2017 * July 2016 * June 2016 * October 2015 * August 2015 * June 2015 * April 2015 * February 2015 * November 2014 * October 2014 * September 2014 * August 2014 * July 2014 * June 2014 * May 2014 * April 2014 * March 2014 * November 2013 * September 2013 * August 2013 * June 2013 * May 2013 * March 2013 * February 2013 * December 2012 * November 2012 * October 2012 * August 2012 * July 2012 * April 2012 * March 2012 * February 2012 * January 2012 * December 2011 Categories * books * cognitive-systems-engineering * education * engineering * formal methods * incidents * journalism * netflix * openstack * research * resilience * software * systems * Uncategorized * work Meta * Register * Log in * Entries feed * Comments feed * WordPress.com Blog at WordPress.com. * Comment * Reblog * Subscribe Subscribed + [surfer] Surfing Complexity Join 169 other subscribers [ ] Sign me up + Already have a WordPress.com account? Log in now. * + [surfer] Surfing Complexity + Customize + Subscribe Subscribed + Sign up + Log in + Copy shortlink + Report this content + View post in Reader + Manage subscriptions + Collapse this bar Loading Comments... Write a Comment... [ ] Email (Required) [ ] Name (Required) [ ] Website [ ] [Post Comment] %d [b]