[HN Gopher] Landing a new syscall, part 1: What is futex?
___________________________________________________________________
Landing a new syscall, part 1: What is futex?
Author : 1_player
Score : 108 points
Date : 2022-02-09 12:36 UTC (10 hours ago)
(HTM) web link (www.collabora.com)
(TXT) w3m dump (www.collabora.com)
| __s wrote:
| https://lore.kernel.org/lkml/20200612185122.327860-1-andreal...
| goes over what futex2 introduces
| wbl wrote:
| Windows has had this one for a while.
| gpderetta wrote:
| The windows equivalent of futex_wait is WaitOnAddress. I
| don't think it supports waiting on multiple addresses, but
| I'm not a WIN32 programmer.
| wbl wrote:
| WaitForMultipleObjects. And it isn't from the Windows
| heritage but from VMS.
| gpderetta wrote:
| WFMO is equivalent to select. This is different.
| sumtechguy wrote:
| Yeah from the 10 mins I spent in the docs I do not think
| there is one baked in. You would probably have to roll your
| own. It would not be pretty. https://docs.microsoft.com/en-
| us/windows/win32/api/synchapi/...
| ww520 wrote:
| I don't think Windows has the equivalent of futex2, waiting
| on multiple addresses. WaitOnAddress() in Windows is the
| equivalent of futex, waiting on a single address.
|
| WaitForMultipleObjects() in Windows is for acquiring multiple
| locks, not for waiting on arbitrary addresses. The parameters
| to WaitForMultipleObjects() are mutex/semaphore/event/etc
| handles. When WaitForMultipleObjects() returns, the
| ownerships of the locks are actually acquired. E.g. for
| semaphore, the count is decremented by
| WaitForMultipleObjects().
|
| Futex2 is more generic where it just waits on multiple
| arbitrary addresses. The locking is done separately.
| __s wrote:
| Relevant that Valve is sponsoring this for Wine's benefit
| gpderetta wrote:
| While futex_waitv is great, I would have also liked a
| futex_wakev.
| ot wrote:
| As an optimization? Would futex support in io_uring be
| enough? https://lwn.net/Articles/857835/
| gpderetta wrote:
| Possibly yes! Edit: but I'm having a deja vu. Did we
| already have this conversation?!?
| ot wrote:
| Not that I remember :)
|
| I went through that LKML thread and it looks like that
| particular patch isn't going anywhere, but given the
| general direction of supporting all syscalls in io_uring,
| I'm confident it will happen eventually.
|
| Support for wait, in particular, would be strictly more
| powerful than waitv: you could wait for futexes _and_ IO
| (or any other completions) at the same time!
| LoveGracePeace wrote:
| That web site name looked familiar, it's one I used to get tons
| of recruiter spam from. I added it to my email server's block
| list some time ago.
| andrewshadura wrote:
| That's grossly untrue. We never send recruiting spam.
| tyingq wrote:
| You're probably thinking of https://www.collabera.com/
|
| Note the 'e' instead of the 'o'..."bera", not "bora".
| ww520 wrote:
| It's important to remember that futex is not a lock; it is a
| conditional sleeping operation. Its name unfortunately looks
| similar to mutex but it's not a lock.
|
| A thread calls futex(&flag, FUTEX_WAIT) to sleep. The kernel
| wakes up the caller thread when the futex flag's value might have
| been changed, by another thread calling futex(&flag, FUTEX_WAKE).
| When futex() returns, you don't get a lock. It just means the
| value of the flag might have been changed.
|
| Since it's not a lock, the value of the flag is not guaranteed to
| be certain value when futex() returns. The flag can be changed
| again by another thread between the futex's return and the next
| instruction of the caller thread. It's important to put the
| futex() call inside a loop calling compare_and_set() repeatedly
| to check the expected value of the flag.
| l33t2328 wrote:
| I've always imagined a lock as a conditional sleeping
| operation. What's the difference?
| dataflow wrote:
| I think what the parent was saying was that, for a futex, the
| kernel wakes you up when the condition _might_ have been
| satisfied, but spurious wakes are possible too. (I 'm
| counting the simultaneous waking of multiple threads as
| spurious wakes here.) That's not the case with locks, which
| only wake up when the condition is satisfied.
| RustyRussell wrote:
| > Its name unfortunately looks similar to mutex but it's not a
| lock.
|
| Hmm, looking at the latest source code, my original quote has
| been split, but:
|
| * Fast Userspace Mutexes (which I call "Futexes!").
|
| * (C) Rusty Russell, IBM 2002
|
| ...
|
| * "The futexes are also cursed."
|
| * "But they come in a choice of three flavours!"
|
| https://github.com/torvalds/linux/blob/master/kernel/futex/c...
| dpratt71 wrote:
| Stupid question: Why isn't such logic incorporated into the
| call itself?
| ww520 wrote:
| Good question. It's for performance reason. Calling into
| kernel is expensive. Lock acquired in the user mode is much
| faster than lock acquired in kernel.
|
| A typical lock acquisition using futex looks like:
| (a) while !compare_and_swap(&lock, 0, 1) // see if flag is
| free as 0, lock it as 1 (b) futex(&lock, WAIT, 1)
| // sleep until flag changes from 1
|
| (a) runs in user mode and it's very fast. It's just one
| CMPXCHG assembly instruction. If the lock is free, it's
| acquired in one instruction as 1 (locked).
|
| If the lock is not free, then do the expensive call into the
| kernel to sleep via futex at (b). Futex() helps in detecting
| the change of the value while putting the thread to sleep, to
| avoid hogging the CPU.
| RustyRussell wrote:
| Importantly, prior to this (and, hell, even since) state of
| the art was to try atomically, then yield (or sleep(0))
| then try usleep.
|
| The kernel had no idea what was going on, so had no idea
| how to schedule such a thing. It particularly didn't know
| to wake you when the lock (which it has no idea about)
| became available.
| chrchang523 wrote:
| It is worth noting here that this "one assembly
| instruction" is not that cheap. The hardware on a multicore
| system does have to perform some locking under the hood to
| execute that instruction. But yes, it still has enough of
| an advantage over calling into the kernel to justify the
| additional usage complexity.
| chacham15 wrote:
| Im pretty sure the same construct can be implemented
| without the compare: int lock = 0;
| void AcquireLock(int *lock){ while
| (ATOMIC_SWAP(lock, 1)){ sleep(10); //or futex
| or w/e } } void ReleaseLock(int
| *lock){ ATOMIC_SWAP(lock, 0); }
| ww520 wrote:
| The 'lock' variable is shared among threads. Compare is
| needed to avoid stomping on the lock acquired by another
| thread.
| ww520 wrote:
| Yes, CMPXCHG is a relatively more expensive instruction.
| In x86 it is a memory barrier instruction.
| derefr wrote:
| The gettimeofday(3) vDSO is pure-userspace code. Why not,
| then, a futex(3) vDSO that does a while +
| compare_and_swap(2) in userspace, but then contains a real
| call to the futex(3) syscall?
| ww520 wrote:
| What you described is what the phread_mutex_lock() does
| exactly, which is in user space. Application programmers
| don't deal with futex directly, they call
| phread_mutex_lock/phread_mutex_unlock.
| tialaramex wrote:
| This part of the optimisation is well-known and had been
| for years before futex was invented, so there's no need
| to provide it as a vDSO or any other special work,
| userspace can (and was ready to) do that part anyway.
|
| The futex is clever because previously the heavyweight
| locking is in the kernel, but all you actually needed was
| this very flimsy wait feature. BeOS for example, has a
| "Benaphore" construction where you initialise an
| expensive heavyweight lock but you try the userspace
| trick before needing to actually take it. So that looks
| superficially just as good as a futex...
|
| ... and then you realise oh, the underlying locks need
| kernel RAM to store their state, so the OS can only allow
| each process to have so many of them and thus you can't
| have so many Benaphores, they were expensive after all.
| But a futex doesn't _use_ any kernel resources when you
| aren 't calling the OS, Linux doesn't mind if your
| program uses a billion futexes, that's fine, any
| particular thread can only be _waiting_ on one of them,
| and Linux is only tracking the waiting.
| derefr wrote:
| > This part of the optimisation is well-known and had
| been for years before futex was invented, so there's no
| need to provide it as a vDSO or any other special work,
| userspace can (and was ready to) do that part anyway.
|
| I guess, to me, the semantics of futex(2) on their own
| seem ill-formed, without the wait + cmpxchg being part of
| them. It feels like the user shouldn't have access to raw
| "low-level" calls to futex(2), since there's only one
| useful way to use those calls, and it's in combination
| with other code, that's theoretically possible to get
| wrong.
|
| I'm used to thinking with "building reusable
| crypto"-tinted glasses -- with crypto libraries, you
| never want to expose a primitive that needs to be used a
| certain way to be sound, when you could instead just
| expose a higher-level primitive that inherently works
| that one correct way.
|
| Of course, there's nothing inherently _dangerous_ about
| calling futex(2), so I suppose this kind of thinking is
| meaningless here.
| gpderetta wrote:
| The article is nice, aside from the buggy and suboptimal
| implementations of the mutex.
|
| But I would like to stress that futexes are not just useful to
| implement mutexes (nor they are even the most interesting
| feature). They are a generalized waitqueue abstraction that can
| be used for all kind of signaling primitives. Also the nice thing
| about futexes is that they do not consume resources when they are
| not in use: you could implement a fast pathed mutex on top of,
| say, an eventfd, but allocating a file descriptor for each mutex
| can potentially run out of resources quickly.
| leshow wrote:
| I didn't really get the example, futex is itself a syscall, so
| how does implementing mutex in terms of futex help you create a
| "fast userspace" mutex? The example presented looks to do
| almost nothing in userspace. Am I missing something?
| kvhdude wrote:
| when the cas (compare and swap) succeeds in user space, the
| subsequent futex call to wait will not be made, i suppose.
| aidenn0 wrote:
| mutex_lock makes 0 system calls in the fast case (when
| cmpxchg succeeds on the first try). mutex_unlock makes 1
| system call always, but TFA mentions that it can be optimized
| by adding some user-space complexity to check if signalling
| might be required (if you can demonstrate zero waiters, then
| the FUTEX_WAKE is not needed).
| gpderetta wrote:
| Yes, usually you keep a (saturating) waiter count, at the
| limit just one additional bit.
| mfilion wrote:
| The mutex example was purely for didactic purposes, and
| definitely not a complete mutex. It was meant to help
| demonstrate the basics of how a mutex works. Blog post has been
| updated to clarify this.
| dataflow wrote:
| What are some use cases for having so many futexes that would
| exceed the number of available file descriptors? Are you
| assuming a low FD limit, or do you really mean
| millions/billions of futexes?
| waynesonfire wrote:
| Wow so poorly written. Hope author does a proof read for part 2.
| Mutual exclusion is tricky to follow and the bad grammer doesn't
| help.
| JanLikar wrote:
| > and the bad grammer
|
| I'm sorry but I have to ask; was this intentional?
| softjobs wrote:
| That's not bad grammar but bad spelling.
| samatman wrote:
| I know that account balances were just a "motivating example"
| here, but please, I beg you: don't handle accounts this way.
|
| Double-entry accounting predates computers by many centuries, and
| solves a problem which computers still have. Use double-entry
| accounting internally if you're handling money! This means both
| the liability and the asset side of the equation are balanced
| when written, and can be (should be, _must be_ ) balanced across
| both entries as well.
| derefr wrote:
| In my mental model of a digital financial ledger, it just looks
| like a table of transfer events, e.g. {timestamp, from_acct=A,
| to_acct=B, balance_delta=V}.
|
| Which, yes, tracks the counterparty balances as well as your
| own. But it's not _double entry_ per se. Each transfer only has
| its data in one place; there 's no second corresponding entry
| in the table looking like {timestamp, from_acct=B, to_acct=A,
| balance_delta=-V}. You don't _need_ that second entry; you can
| already do everything you need with just the one. (E.g. someone
| 's current balance is the sum of the balance_deltas for which
| they're the receiver, minus the sum of the balance_deltas for
| which they're the sender.)
|
| Are you suggesting there's a point in having rows in
| corresponding pairs like this? If so, what is it? (If it's data
| integrity, IMHO that should be guaranteed at a lower level,
| with cryptographic checksums et al.)
| samatman wrote:
| I'm not sure which of us is confused and it doesn't matter:
| let us say that there is a confusion about double entry,
| which I used to have and no longer do.
|
| It's exactly having from: and to: in a single entry which
| makes for double entry accounting. Or one from: and several
| to: but that's effectively an elision.
|
| By _implication_ there is a second double-entry on the other
| side, but that 's often their business, not mine: as an
| example, the Income statements in my personal ledger are
| balanced by an Expenses:Payroll on the other side...
| spiritually, I neither know nor care about the details of how
| that accounting is done.
|
| In the special case that both accounts are internal to the
| business, then yes, expense on one account becomes asset on
| the other. Whether that is one or two (double) entries
| depends on a bunch of things.
|
| In terms of specific data modeling, I would urge you that
| _yes_ , the rows corresponding to any debit or credit should
| include a 'from' field. The reasons are numerous, and have
| more to do with audit than data integrity per se.
|
| The sufficient reason is that there might not be a
| counterparty inside your database.
| PoignardAzur wrote:
| Naive question: how does double-entry accounting help computer
| accouting systems?
| inopinatus wrote:
| It's the original CRDT.
| OJFord wrote:
| Everything is accounted for, you can't accidentally (or
| 'accidentally') have money come from or go nowhere, since it
| must balance - account1 going up by a pound is matched to
| account2 going down a pound.
| exdsq wrote:
| One of those great moments where domain knowledge results in a
| substantially different system to the intuitive approach :)
| im3w1l wrote:
| I'm pretty sure the _mutex_lock_ and _mutex_unlock_ are bugged,
| as they take a _mutex_ int by value. Would be unfortunate if
| somone copy pasted those.
| [deleted]
| More-nitors wrote:
| care to explain more?
| smhenderson wrote:
| Without having looked into any details I assume the parent is
| saying that the integer passed by value is a flag that is
| checked periodically and used to trigger a next step or to
| release a lock.
|
| If it's passed by value it will never change inside the
| function and whatever is supposed to happen when it changes
| will never get executed. The integers should be passed by
| reference so when they change the function receiving them is
| aware of the update.
| gpderetta wrote:
| Most importantly wake and wait are keyed by address. The
| waker must provide to futex_wake the same address that the
| waiter passed to futex_wait to wake it.
| mockery wrote:
| Not GP, but the code in the article passes 'unsigned int
| mutex' by-value, and then operates on its address (which will
| potentially be at a different stack address on each
| invocation, and will certainly be different for different
| threads.) It won't serve its purpose at all.
|
| To my eye, the code is fundamentally broken in a way that's
| surprising for an article trying to teach threading, and it's
| likely confuse someone unfamiliar with these concepts.
| im3w1l wrote:
| The articles code for mutex_lock is this
| mutex_lock(unsigned int mutex) { while
| (!cmpxchg(&mutex, UNLOCKED, LOCKED)) futex(&mutex,
| FUTEX_WAIT, LOCKED); }
|
| Now consider what happens if someone calls it
| volatile unsigned int my_mutex = UNLOCKED;
| mutex_lock(my_mutex); ... mutex_unlock(my_mutex);
|
| my_mutex is copied into the mutex parameter. The mutex
| parameter is updated to locked. mutex_lock exits, which
| destroys the mutex variable. my_mutex still has the variable
| UNLOCKED.
| mfilion wrote:
| Good catch, corrected!
| waynesonfire wrote:
| One more...
|
| > We use &mutex as the first argument
| andrealmeid wrote:
| ops, fixed, thanks!
| mfilion wrote:
| As mentioned above, the mutex example was purely for didactic
| purposes, and definitely not a complete mutex. It was meant to
| help demonstrate the basics of how a mutex works. Blog post has
| been updated to clarify this.
| winrid wrote:
| Neat timing, was just running gdb on MongoDB and saw it uses
| futexes.
|
| Was debugging why the instances were hanging... sigh
| unilynx wrote:
| That doesn't imply MongoDB mentions futexes anywhere in its
| source code. Applications that use pthreads for their
| multithreading (which is the vast majority of them) will also
| show futex syscalls
___________________________________________________________________
(page generated 2022-02-09 23:01 UTC)