https://probablydance.com/2022/12/05/fine-grained-locking-with-two-bit-mutexes/ Probably Dance I can program and like games Fine-grained Locking with Two-Bit Mutexes by Malte Skarupke Lets say you want to have a mutex for every item in a list with 10k elements. It feels a bit wasteful to use a std::mutex for each of those elements. In Linux std::mutex is 40 bytes, in Windows it's 80 bytes. But mutexes don't need to be that big. You can fit a mutex into two bits, and it's going to be fast. So we could fit a mutex into the low bits of a pointer. If your container already stores pointers, you might be able to store a mutex for each element with zero space overhead, not even any extra operations during initialization. You'd pay no cost until you use a mutex for the first time. Lets start with a mutex that uses one byte. It's easy now that C++ has added futexes to the standard: struct one_byte_mutex { void lock() { if (state.exchange(locked, std::memory_order_acquire) == unlocked) return; while (state.exchange(sleeper, std::memory_order_acquire) != unlocked) state.wait(sleeper, std::memory_order_relaxed); } void unlock() { if (state.exchange(unlocked, std::memory_order_release) == sleeper) state.notify_one(); } private: std::atomic state{ unlocked }; static constexpr uint8_t unlocked = 0; static constexpr uint8_t locked = 0b01; static constexpr uint8_t sleeper = 0b10; }; The futex operations on here are the call to "wait()" and "notify_one ()". These are like simpler versions of condition variables. The "state.wait(sleeper)" call will put the thread to sleep only if state ==sleeper. And "notify_one()" will wake one thread that went to sleep on this variable. You can do this on any atomic variable now. There are no special requirements. How can you sleep and notify on anything? The kernel has a hash table to store sleeping threads which is indexed by the address of the variable. So all you need is a unique address. Most operating systems support this, even Windows. (who, when they copied this idea from Linux, had the chutzpah to patent futexes) If an operating system doesn't have this, this is implemented with a global hash table in the standard library. This mutex is optimized for the common case when it is unlocked and there is no contention. In that case we only have to do one exchange () when locking, and one exchange() when unlocking. It's a bit tricky to convince yourself that this is correct. I verified the implementation in TLA+. The main trick to keeping it simple is that we only try to set the state to "locked" once. If that doesn't work, we instead set it to "sleeper". This is necessary because we don't know how many threads are sleeping, and unlock() clears both bits. So if there was one sleeping thread, it has to set the "sleeper" flag just in case there are others. One tricky interleaving is if a new thread comes in and does the initial "exchange()" call at a bad time, clearing the "sleeper" bit and causing an unlocking thread to not call notify_one(). In that case the newly entering thread also sets the sleeper flag, so it will call notify_one() when it unlocks. Two Bit Mutex The one_byte_mutex is simple and storing a mutex in one byte is good. Storing it in two bits, say the low bits of a pointer, is trickier. Here is an implementation that does that: template struct pointer_with_mutex { T* get() const { uint64_t masked = pointer.load(std::memory_order_relaxed) & ~both_bits; return reinterpret_cast(masked); } void set(T* ptr) { static_assert(std::alignment_of::value >= 4); uint64_t as_int = reinterpret_cast(ptr); uint64_t old = pointer.load(std::memory_order_relaxed); while (!pointer.compare_exchange_weak(old, (old & both_bits) | as_int, std::memory_order_relaxed)) { } } void lock() { uint64_t old = pointer.load(std::memory_order_relaxed); if (!(old & both_bits) && pointer.compare_exchange_strong(old, old | locked, std::memory_order_acquire)) return; for(;;) { if (old & sleeper) { pointer.wait(old, std::memory_order_relaxed); old = pointer.load(std::memory_order_relaxed); } else if (pointer.compare_exchange_weak(old, old | sleeper, std::memory_order_acquire)) { if (!(old & both_bits)) return; pointer.wait(old | sleeper, std::memory_order_relaxed); old = pointer.load(std::memory_order_relaxed); } } } void unlock() { uint64_t old = pointer.fetch_and(~both_bits, std::memory_order_release); if (old & sleeper) pointer.notify_one(); } private: std::atomic pointer{ 0 }; static constexpr uint64_t locked = 0b01; static constexpr uint64_t sleeper = 0b10; static constexpr uint64_t both_bits = locked | sleeper; }; This one is significantly larger, but it's a direct translation from the above, just replacing "exchange" with "compare_exchange" to leave the other bits unaffected. Plus some extra conditionals to skip the compare_exchange when we can. (my first attempt was to just replace exchange() with fetch_or(), which would lead to simpler code, but that just uses compare_exchange internally, in a way that was noticeably slower) The thing to point out is that none of the code depends on which bits we use, or on what the remaining bits are used for. In this case I use them for a pointer, which has to be at least four-byte-aligned, but you could use any two bits for the mutex and store anything in the remaining bits. Performance How does this perform? It's a bit slow when the mutex is contended, but it's actually faster than std::mutex for an unlocked mutex. Here are the timings on Windows (compiled with Visual Studio 2019): lock/unlock lock/unlock many single-threaded threads std::mutex 22ns 50ns one_byte_mutex 10ns 67ns pointer_with_mutex 18ns 94ns Here are the timings on Linux (compiled with clang-12 against libc++-12) lock/unlock lock/unlock many single-threaded threads std::mutex 12ns 71ns one_byte_mutex 8ns 228ns pointer_with_mutex 15ns 255ns This is running a benchmark where I call lock(); unlock(); in a loop. The first column is running the benchmark with a single thread, so we always hit the fast path where we don't have to go to sleep. The second column is running with sixteen threads, so the mutex will often be locked. I then divide the length of the benchmark by the number of lock() unlock() calls to get the time for the average call. These are running on the same machine, booted into either Windows 10 or Ubuntu 20.04. So if you expect that you will usually find your mutex unlocked, these will actually be both faster and smaller for you than std::mutex. But if you have lots of contention, you may have to put a bit more work into these to make them fast. (my first guess would be that this particular benchmark would benefit from spinning for a bit before sleeping, because the critical section is small. My second guess is that these futexes aren't implemented efficiently in the standard library yet. I didn't look into either of these guesses) Four Mutexes per Byte So if we only need two bits for a mutex, can we store four mutexes in a single byte? Maybe, but you can't control which mutex you want to wake up. futexes work by having a hash table with the address of the futex. And you can't address bits, you can only address bytes. So all four mutexes would have the same address, so they would all be the same futex. You could probably still make it work by using notify_all() instead of notify_one(), and that might be fine if you expect the mutex to almost always sit idle. Or alternatively you could have a look at the hash table that your standard library uses to implement futexes in userspace, copy it, and change the key to not just be a pointer. But I'll leave that as an exercise for the reader. Code The code for this is available here. It's released under the boost license. Also I'm trying out a donation button for the first time. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated: Make a one-time donation Thanks! I have no idea how much this is worth to people. Feedback appreciated. Donate Share this: * Twitter * Facebook * Like this: Like Loading... Related Published: December 5, 2022 Filed Under: Programming Tags: C++ : mutex 2 Comments to "Fine-grained Locking with Two-Bit Mutexes" 1. [9ab5c] Markus says: December 5, 2022 at 20:53 Can you share the TLA+ spec? Reply + [32d3e] Malte Skarupke says: December 6, 2022 at 17:42 It was already in the code repository. Or at least I thought it was, but turns out I had only included the file for the two_bit_mutex. Now both files are there: https://github.com/skarupke/two_bit_mutex Reply Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. << Previous Post Search for: [ ] [Search] Recent Posts * Fine-grained Locking with Two-Bit Mutexes * Finding the "Second Bug" in glibc's Condition Variable * Sudoku Variants as Playful Proof Practice * Reasons why Babies Cry in the First Three Months, How to Tell The Cries Apart, and What to Do * Automated Game AI Testing Archives * December 2022 * September 2022 * June 2022 * February 2022 * January 2022 * October 2021 * July 2021 * April 2021 * January 2021 * November 2020 * October 2020 * August 2020 * July 2020 * June 2020 * May 2020 * April 2020 * March 2020 * January 2020 * December 2019 * September 2019 * August 2019 * June 2019 * April 2019 * March 2019 * June 2018 * May 2018 * April 2018 * January 2018 * December 2017 * November 2017 * October 2017 * September 2017 * August 2017 * February 2017 * January 2017 * December 2016 * November 2016 * June 2016 * April 2016 * March 2016 * February 2016 * December 2015 * September 2015 * July 2015 * June 2015 * May 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * October 2013 * September 2013 * August 2013 * May 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * August 2012 * July 2012 * April 2012 * March 2012 * February 2012 * January 2012 * October 2011 * September 2011 * August 2011 * July 2011 * June 2011 * May 2011 Categories * Children * Games * Links * Math * Politics and Economics * Programming * Uncategorized Meta * Register * Log in * Entries feed * Comments feed * WordPress.com [ ] [Search] Blog at WordPress.com. * Follow Following + [wpcom-] Probably Dance Join 167 other followers [ ] Sign me up + Already have a WordPress.com account? Log in now. * + [wpcom-] Probably Dance + Customize + Follow Following + Sign up + Log in + Copy shortlink + Report this content + View post in Reader + Manage subscriptions + Collapse this bar %d bloggers like this: [b]