[HN Gopher] Landing a new syscall, part 1: What is futex?
       ___________________________________________________________________
        
       Landing a new syscall, part 1: What is futex?
        
       Author : 1_player
       Score  : 108 points
       Date   : 2022-02-09 12:36 UTC (10 hours ago)
        
 (HTM) web link (www.collabora.com)
 (TXT) w3m dump (www.collabora.com)
        
       | __s wrote:
       | https://lore.kernel.org/lkml/20200612185122.327860-1-andreal...
       | goes over what futex2 introduces
        
         | wbl wrote:
         | Windows has had this one for a while.
        
           | gpderetta wrote:
           | The windows equivalent of futex_wait is WaitOnAddress. I
           | don't think it supports waiting on multiple addresses, but
           | I'm not a WIN32 programmer.
        
             | wbl wrote:
             | WaitForMultipleObjects. And it isn't from the Windows
             | heritage but from VMS.
        
               | gpderetta wrote:
               | WFMO is equivalent to select. This is different.
        
             | sumtechguy wrote:
             | Yeah from the 10 mins I spent in the docs I do not think
             | there is one baked in. You would probably have to roll your
             | own. It would not be pretty. https://docs.microsoft.com/en-
             | us/windows/win32/api/synchapi/...
        
           | ww520 wrote:
           | I don't think Windows has the equivalent of futex2, waiting
           | on multiple addresses. WaitOnAddress() in Windows is the
           | equivalent of futex, waiting on a single address.
           | 
           | WaitForMultipleObjects() in Windows is for acquiring multiple
           | locks, not for waiting on arbitrary addresses. The parameters
           | to WaitForMultipleObjects() are mutex/semaphore/event/etc
           | handles. When WaitForMultipleObjects() returns, the
           | ownerships of the locks are actually acquired. E.g. for
           | semaphore, the count is decremented by
           | WaitForMultipleObjects().
           | 
           | Futex2 is more generic where it just waits on multiple
           | arbitrary addresses. The locking is done separately.
        
           | __s wrote:
           | Relevant that Valve is sponsoring this for Wine's benefit
        
         | gpderetta wrote:
         | While futex_waitv is great, I would have also liked a
         | futex_wakev.
        
           | ot wrote:
           | As an optimization? Would futex support in io_uring be
           | enough? https://lwn.net/Articles/857835/
        
             | gpderetta wrote:
             | Possibly yes! Edit: but I'm having a deja vu. Did we
             | already have this conversation?!?
        
               | ot wrote:
               | Not that I remember :)
               | 
               | I went through that LKML thread and it looks like that
               | particular patch isn't going anywhere, but given the
               | general direction of supporting all syscalls in io_uring,
               | I'm confident it will happen eventually.
               | 
               | Support for wait, in particular, would be strictly more
               | powerful than waitv: you could wait for futexes _and_ IO
               | (or any other completions) at the same time!
        
       | LoveGracePeace wrote:
       | That web site name looked familiar, it's one I used to get tons
       | of recruiter spam from. I added it to my email server's block
       | list some time ago.
        
         | andrewshadura wrote:
         | That's grossly untrue. We never send recruiting spam.
        
         | tyingq wrote:
         | You're probably thinking of https://www.collabera.com/
         | 
         | Note the 'e' instead of the 'o'..."bera", not "bora".
        
       | ww520 wrote:
       | It's important to remember that futex is not a lock; it is a
       | conditional sleeping operation. Its name unfortunately looks
       | similar to mutex but it's not a lock.
       | 
       | A thread calls futex(&flag, FUTEX_WAIT) to sleep. The kernel
       | wakes up the caller thread when the futex flag's value might have
       | been changed, by another thread calling futex(&flag, FUTEX_WAKE).
       | When futex() returns, you don't get a lock. It just means the
       | value of the flag might have been changed.
       | 
       | Since it's not a lock, the value of the flag is not guaranteed to
       | be certain value when futex() returns. The flag can be changed
       | again by another thread between the futex's return and the next
       | instruction of the caller thread. It's important to put the
       | futex() call inside a loop calling compare_and_set() repeatedly
       | to check the expected value of the flag.
        
         | l33t2328 wrote:
         | I've always imagined a lock as a conditional sleeping
         | operation. What's the difference?
        
           | dataflow wrote:
           | I think what the parent was saying was that, for a futex, the
           | kernel wakes you up when the condition _might_ have been
           | satisfied, but spurious wakes are possible too. (I 'm
           | counting the simultaneous waking of multiple threads as
           | spurious wakes here.) That's not the case with locks, which
           | only wake up when the condition is satisfied.
        
         | RustyRussell wrote:
         | > Its name unfortunately looks similar to mutex but it's not a
         | lock.
         | 
         | Hmm, looking at the latest source code, my original quote has
         | been split, but:
         | 
         | * Fast Userspace Mutexes (which I call "Futexes!").
         | 
         | * (C) Rusty Russell, IBM 2002
         | 
         | ...
         | 
         | * "The futexes are also cursed."
         | 
         | * "But they come in a choice of three flavours!"
         | 
         | https://github.com/torvalds/linux/blob/master/kernel/futex/c...
        
         | dpratt71 wrote:
         | Stupid question: Why isn't such logic incorporated into the
         | call itself?
        
           | ww520 wrote:
           | Good question. It's for performance reason. Calling into
           | kernel is expensive. Lock acquired in the user mode is much
           | faster than lock acquired in kernel.
           | 
           | A typical lock acquisition using futex looks like:
           | (a) while !compare_and_swap(&lock, 0, 1) // see if flag is
           | free as 0, lock it as 1         (b)     futex(&lock, WAIT, 1)
           | // sleep until flag changes from 1
           | 
           | (a) runs in user mode and it's very fast. It's just one
           | CMPXCHG assembly instruction. If the lock is free, it's
           | acquired in one instruction as 1 (locked).
           | 
           | If the lock is not free, then do the expensive call into the
           | kernel to sleep via futex at (b). Futex() helps in detecting
           | the change of the value while putting the thread to sleep, to
           | avoid hogging the CPU.
        
             | RustyRussell wrote:
             | Importantly, prior to this (and, hell, even since) state of
             | the art was to try atomically, then yield (or sleep(0))
             | then try usleep.
             | 
             | The kernel had no idea what was going on, so had no idea
             | how to schedule such a thing. It particularly didn't know
             | to wake you when the lock (which it has no idea about)
             | became available.
        
             | chrchang523 wrote:
             | It is worth noting here that this "one assembly
             | instruction" is not that cheap. The hardware on a multicore
             | system does have to perform some locking under the hood to
             | execute that instruction. But yes, it still has enough of
             | an advantage over calling into the kernel to justify the
             | additional usage complexity.
        
               | chacham15 wrote:
               | Im pretty sure the same construct can be implemented
               | without the compare:                 int lock = 0;
               | void AcquireLock(int *lock){         while
               | (ATOMIC_SWAP(lock, 1)){           sleep(10); //or futex
               | or w/e         }       }            void ReleaseLock(int
               | *lock){         ATOMIC_SWAP(lock, 0);       }
        
               | ww520 wrote:
               | The 'lock' variable is shared among threads. Compare is
               | needed to avoid stomping on the lock acquired by another
               | thread.
        
               | ww520 wrote:
               | Yes, CMPXCHG is a relatively more expensive instruction.
               | In x86 it is a memory barrier instruction.
        
             | derefr wrote:
             | The gettimeofday(3) vDSO is pure-userspace code. Why not,
             | then, a futex(3) vDSO that does a while +
             | compare_and_swap(2) in userspace, but then contains a real
             | call to the futex(3) syscall?
        
               | ww520 wrote:
               | What you described is what the phread_mutex_lock() does
               | exactly, which is in user space. Application programmers
               | don't deal with futex directly, they call
               | phread_mutex_lock/phread_mutex_unlock.
        
               | tialaramex wrote:
               | This part of the optimisation is well-known and had been
               | for years before futex was invented, so there's no need
               | to provide it as a vDSO or any other special work,
               | userspace can (and was ready to) do that part anyway.
               | 
               | The futex is clever because previously the heavyweight
               | locking is in the kernel, but all you actually needed was
               | this very flimsy wait feature. BeOS for example, has a
               | "Benaphore" construction where you initialise an
               | expensive heavyweight lock but you try the userspace
               | trick before needing to actually take it. So that looks
               | superficially just as good as a futex...
               | 
               | ... and then you realise oh, the underlying locks need
               | kernel RAM to store their state, so the OS can only allow
               | each process to have so many of them and thus you can't
               | have so many Benaphores, they were expensive after all.
               | But a futex doesn't _use_ any kernel resources when you
               | aren 't calling the OS, Linux doesn't mind if your
               | program uses a billion futexes, that's fine, any
               | particular thread can only be _waiting_ on one of them,
               | and Linux is only tracking the waiting.
        
               | derefr wrote:
               | > This part of the optimisation is well-known and had
               | been for years before futex was invented, so there's no
               | need to provide it as a vDSO or any other special work,
               | userspace can (and was ready to) do that part anyway.
               | 
               | I guess, to me, the semantics of futex(2) on their own
               | seem ill-formed, without the wait + cmpxchg being part of
               | them. It feels like the user shouldn't have access to raw
               | "low-level" calls to futex(2), since there's only one
               | useful way to use those calls, and it's in combination
               | with other code, that's theoretically possible to get
               | wrong.
               | 
               | I'm used to thinking with "building reusable
               | crypto"-tinted glasses -- with crypto libraries, you
               | never want to expose a primitive that needs to be used a
               | certain way to be sound, when you could instead just
               | expose a higher-level primitive that inherently works
               | that one correct way.
               | 
               | Of course, there's nothing inherently _dangerous_ about
               | calling futex(2), so I suppose this kind of thinking is
               | meaningless here.
        
       | gpderetta wrote:
       | The article is nice, aside from the buggy and suboptimal
       | implementations of the mutex.
       | 
       | But I would like to stress that futexes are not just useful to
       | implement mutexes (nor they are even the most interesting
       | feature). They are a generalized waitqueue abstraction that can
       | be used for all kind of signaling primitives. Also the nice thing
       | about futexes is that they do not consume resources when they are
       | not in use: you could implement a fast pathed mutex on top of,
       | say, an eventfd, but allocating a file descriptor for each mutex
       | can potentially run out of resources quickly.
        
         | leshow wrote:
         | I didn't really get the example, futex is itself a syscall, so
         | how does implementing mutex in terms of futex help you create a
         | "fast userspace" mutex? The example presented looks to do
         | almost nothing in userspace. Am I missing something?
        
           | kvhdude wrote:
           | when the cas (compare and swap) succeeds in user space, the
           | subsequent futex call to wait will not be made, i suppose.
        
           | aidenn0 wrote:
           | mutex_lock makes 0 system calls in the fast case (when
           | cmpxchg succeeds on the first try). mutex_unlock makes 1
           | system call always, but TFA mentions that it can be optimized
           | by adding some user-space complexity to check if signalling
           | might be required (if you can demonstrate zero waiters, then
           | the FUTEX_WAKE is not needed).
        
             | gpderetta wrote:
             | Yes, usually you keep a (saturating) waiter count, at the
             | limit just one additional bit.
        
         | mfilion wrote:
         | The mutex example was purely for didactic purposes, and
         | definitely not a complete mutex. It was meant to help
         | demonstrate the basics of how a mutex works. Blog post has been
         | updated to clarify this.
        
         | dataflow wrote:
         | What are some use cases for having so many futexes that would
         | exceed the number of available file descriptors? Are you
         | assuming a low FD limit, or do you really mean
         | millions/billions of futexes?
        
       | waynesonfire wrote:
       | Wow so poorly written. Hope author does a proof read for part 2.
       | Mutual exclusion is tricky to follow and the bad grammer doesn't
       | help.
        
         | JanLikar wrote:
         | > and the bad grammer
         | 
         | I'm sorry but I have to ask; was this intentional?
        
           | softjobs wrote:
           | That's not bad grammar but bad spelling.
        
       | samatman wrote:
       | I know that account balances were just a "motivating example"
       | here, but please, I beg you: don't handle accounts this way.
       | 
       | Double-entry accounting predates computers by many centuries, and
       | solves a problem which computers still have. Use double-entry
       | accounting internally if you're handling money! This means both
       | the liability and the asset side of the equation are balanced
       | when written, and can be (should be, _must be_ ) balanced across
       | both entries as well.
        
         | derefr wrote:
         | In my mental model of a digital financial ledger, it just looks
         | like a table of transfer events, e.g. {timestamp, from_acct=A,
         | to_acct=B, balance_delta=V}.
         | 
         | Which, yes, tracks the counterparty balances as well as your
         | own. But it's not _double entry_ per se. Each transfer only has
         | its data in one place; there 's no second corresponding entry
         | in the table looking like {timestamp, from_acct=B, to_acct=A,
         | balance_delta=-V}. You don't _need_ that second entry; you can
         | already do everything you need with just the one. (E.g. someone
         | 's current balance is the sum of the balance_deltas for which
         | they're the receiver, minus the sum of the balance_deltas for
         | which they're the sender.)
         | 
         | Are you suggesting there's a point in having rows in
         | corresponding pairs like this? If so, what is it? (If it's data
         | integrity, IMHO that should be guaranteed at a lower level,
         | with cryptographic checksums et al.)
        
           | samatman wrote:
           | I'm not sure which of us is confused and it doesn't matter:
           | let us say that there is a confusion about double entry,
           | which I used to have and no longer do.
           | 
           | It's exactly having from: and to: in a single entry which
           | makes for double entry accounting. Or one from: and several
           | to: but that's effectively an elision.
           | 
           | By _implication_ there is a second double-entry on the other
           | side, but that 's often their business, not mine: as an
           | example, the Income statements in my personal ledger are
           | balanced by an Expenses:Payroll on the other side...
           | spiritually, I neither know nor care about the details of how
           | that accounting is done.
           | 
           | In the special case that both accounts are internal to the
           | business, then yes, expense on one account becomes asset on
           | the other. Whether that is one or two (double) entries
           | depends on a bunch of things.
           | 
           | In terms of specific data modeling, I would urge you that
           | _yes_ , the rows corresponding to any debit or credit should
           | include a 'from' field. The reasons are numerous, and have
           | more to do with audit than data integrity per se.
           | 
           | The sufficient reason is that there might not be a
           | counterparty inside your database.
        
         | PoignardAzur wrote:
         | Naive question: how does double-entry accounting help computer
         | accouting systems?
        
           | inopinatus wrote:
           | It's the original CRDT.
        
           | OJFord wrote:
           | Everything is accounted for, you can't accidentally (or
           | 'accidentally') have money come from or go nowhere, since it
           | must balance - account1 going up by a pound is matched to
           | account2 going down a pound.
        
         | exdsq wrote:
         | One of those great moments where domain knowledge results in a
         | substantially different system to the intuitive approach :)
        
       | im3w1l wrote:
       | I'm pretty sure the _mutex_lock_ and _mutex_unlock_ are bugged,
       | as they take a _mutex_ int by value. Would be unfortunate if
       | somone copy pasted those.
        
         | [deleted]
        
         | More-nitors wrote:
         | care to explain more?
        
           | smhenderson wrote:
           | Without having looked into any details I assume the parent is
           | saying that the integer passed by value is a flag that is
           | checked periodically and used to trigger a next step or to
           | release a lock.
           | 
           | If it's passed by value it will never change inside the
           | function and whatever is supposed to happen when it changes
           | will never get executed. The integers should be passed by
           | reference so when they change the function receiving them is
           | aware of the update.
        
             | gpderetta wrote:
             | Most importantly wake and wait are keyed by address. The
             | waker must provide to futex_wake the same address that the
             | waiter passed to futex_wait to wake it.
        
           | mockery wrote:
           | Not GP, but the code in the article passes 'unsigned int
           | mutex' by-value, and then operates on its address (which will
           | potentially be at a different stack address on each
           | invocation, and will certainly be different for different
           | threads.) It won't serve its purpose at all.
           | 
           | To my eye, the code is fundamentally broken in a way that's
           | surprising for an article trying to teach threading, and it's
           | likely confuse someone unfamiliar with these concepts.
        
           | im3w1l wrote:
           | The articles code for mutex_lock is this
           | mutex_lock(unsigned int mutex)       {         while
           | (!cmpxchg(&mutex, UNLOCKED, LOCKED))           futex(&mutex,
           | FUTEX_WAIT, LOCKED);       }
           | 
           | Now consider what happens if someone calls it
           | volatile unsigned int my_mutex = UNLOCKED;
           | mutex_lock(my_mutex);       ...       mutex_unlock(my_mutex);
           | 
           | my_mutex is copied into the mutex parameter. The mutex
           | parameter is updated to locked. mutex_lock exits, which
           | destroys the mutex variable. my_mutex still has the variable
           | UNLOCKED.
        
             | mfilion wrote:
             | Good catch, corrected!
        
               | waynesonfire wrote:
               | One more...
               | 
               | > We use &mutex as the first argument
        
               | andrealmeid wrote:
               | ops, fixed, thanks!
        
         | mfilion wrote:
         | As mentioned above, the mutex example was purely for didactic
         | purposes, and definitely not a complete mutex. It was meant to
         | help demonstrate the basics of how a mutex works. Blog post has
         | been updated to clarify this.
        
       | winrid wrote:
       | Neat timing, was just running gdb on MongoDB and saw it uses
       | futexes.
       | 
       | Was debugging why the instances were hanging... sigh
        
         | unilynx wrote:
         | That doesn't imply MongoDB mentions futexes anywhere in its
         | source code. Applications that use pthreads for their
         | multithreading (which is the vast majority of them) will also
         | show futex syscalls
        
       ___________________________________________________________________
       (page generated 2022-02-09 23:01 UTC)