[HN Gopher] Proposed futex2 allows Linux to mimic the NT kernel ...
___________________________________________________________________
Proposed futex2 allows Linux to mimic the NT kernel for better wine
performance
Author : marcodiego
Score : 309 points
Date : 2021-04-28 12:44 UTC (10 hours ago)
(HTM) web link (lkml.org)
(TXT) w3m dump (lkml.org)
| Shadonototro wrote:
| if linux people focused on offering compelling ecosystem for
| app/game distribution, they wouldn't need to bloat their os with
| slow and bloated emulators like wine
|
| the reason why linux isn't gaining shares on steam anymore is
| because valve openly said linux has no ecosystem and is forced to
| emulate windows -- proton
| ijlx wrote:
| Wine is not an emulator, it is a compatibility layer for
| Windows system calls on a Unix-like platform. If something runs
| on wine, the performance overhead should be unnoticeable/very
| small for the typical user.
| chungy wrote:
| Ironically, it seems quite often that Wine runs apps/games
| faster than Windows does.
| encryptluks2 wrote:
| A compatibility layer wouldn't require that you essentially
| emulate the same file system layout, registry and
| functionality. It is basically emulating a Windows
| environment.
| anthk wrote:
| So Windows 10 emulates Windows XP/2000/98 too?
| encryptluks2 wrote:
| Pretty much... they took a pile of crap and kept adding
| onto it.
| Dah00n wrote:
| That has nothing in common with emulating though. Running
| ARM code in x64 is emulating. If adding code were
| emulating than Linux is emulating Linux v0.1. Wine isn't
| an emulator. PCSX2 is an emulator.
| chungy wrote:
| The word "emulate" is overloaded, but because there's
| often an assumption of ISA emulation, Wine for a time
| called itself "WINE Is Not an Emulator" to prevent said
| assumption.
|
| Taking "emulate" at its dictionary-definition, which is
| to imitate or copy, it's easy to argue that Wine is an
| emulator, because it copies and imitates the Windows ABI
| and API (the earliest versions of Wine even called itself
| a "WINdows Emulator"). They just happen to draw a hard
| line against emulating x86.
| samus wrote:
| In many cases it is indeed not so simple:
|
| * Most nontrivial Windows program would get very confused
| if exposed to a trivially mapped Unix filesystem. It's not
| so bad if every program would use environment variables to
| locate stuff, but you can't discount the possibility that
| some of them hardcode paths. Also, application data in
| Windows is usually kept in a common folder, while in Unix
| filesystems it is usually split across `/usr/bin`,
| `/usr/share`, `/usr/lib` and so on.
|
| * Many Windows applications store their configuration in
| the registry. That part is not too bad though; it should be
| completely doable in userspace.
|
| * The application might issue system calls that have
| different semantics, or that don't even exist on Linux. TA
| is significant because it paves the way to an efficient
| implementation of one of these system calls.
| encryptluks2 wrote:
| > Most nontrivial Windows program would get very confused
| if exposed to a trivially mapped Unix filesystem. It's
| not so bad if every program would use environment
| variables to locate stuff, but you can't discount the
| possibility that some of them hardcode paths. Also,
| application data in Windows is usually kept in a common
| folder, while in Unix filesystems it is usually split
| across `/usr/bin`, `/usr/share`, `/usr/lib` and so on.
|
| The application data in windows would not be the same as
| `/usr/` paths. It would be the equivalent of `.config`
| and `.local`. The last thing anyone in Linux would want
| is for Windows to start populating their system folders,
| let alone probably creating hundreds of top-level folders
| in their `.config` or `.local` directory.
|
| > Many Windows applications store their configuration in
| the registry. That part is not too bad though; it should
| be completely doable in userspace.
|
| The Windows registry is an atrocious monster that should
| have died already. You can rarely ever truly delete an
| app in Windows, as there will be sometimes hundreds of
| registry keys leftover.
|
| > TA is significant because it paves the way to an
| efficient implementation of one of these system calls.
|
| If there is mutual benefit outside of Windows, then sure
| why not. If it is built just for better Wine
| compatibility with Windows then I hope to see a rant by
| Linus telling them how ridiculous they are.
| fnord77 wrote:
| w.i.n.e = Wine Is Not an Emulator
| a-dub wrote:
| what's the nt api call called? WaitForMultipleObjects? takes me
| back 20 years just thinking about it...
| dboreham wrote:
| The NT api would begin ntXxxx WaitForMultipleObjects is a Win32
| function.
| ziml77 wrote:
| If we're being pedantic here, it's not Win32.
|
| > (Note that this was formerly called the Win32 API. The name
| Windows API more accurately reflects its roots in 16-bit
| Windows and its support on 64-bit Windows.)
|
| Though Win32 is going to stick around for a while as can be
| seen in the URL for the page that's quoted from:
| https://docs.microsoft.com/en-
| us/windows/win32/apiindex/wind...
| a-dub wrote:
| i think it's still called win32 today because win32 was a
| massive overhaul. 16 bit windows was loaded with that far
| and near pointer gunk to stretch those 16 bit pointers
| which win32 was able to do away with thanks to larger
| pointer widths.
|
| regarding win32 vs. the nt api. i'm not super familiar with
| the pedantic distinctions there, but i do know, much to the
| programmer's chagrin, the thousands of windows api
| functions have varying levels of kernel entry, where i
| always have appreciated the (relatively) small number of
| unix style syscalls and the clear delineation between
| kernel and userspace api. (although i suppose futexes blur
| that, in name at least!)
| Joker_vD wrote:
| Well, since introduction of vsyscall and vDSO, the Linux
| syscalls too have varying levels of kernel entry.
| beatrobot wrote:
| NtXxxx or ZwXxxx
| poizan42 wrote:
| I don't know why you are being downvoted, you are completely
| right. The NT api function is NtWaitForMultipleObjects. The
| Win32 apis WaitForMultipleObjects and
| WaitForMultipleObjectsEx are thin wrappers over this.
| AndriyKunitsyn wrote:
| Is this NT API actually possible to be called from
| userspace? Or even from drivers?
| poizan42 wrote:
| Yes of course, they are all exported from ntdll.dll (in
| userspace) and ntoskrnl.exe (in kernelspace). You may
| need to link dynamically or make your own import
| libraries for many of them though.
| a-dub wrote:
| think of it as like select and pthread_cond_wait rolled
| into one generic api for waiting on one or more things
| from the kernel.
|
| it's actually pretty nice how generic they made it.
| andrealmeid wrote:
| Sure, here's an userspace example:
| https://docs.microsoft.com/en-
| us/windows/win32/sync/waiting-...
| leeter wrote:
| Based on the docs kernel mode drivers would call
| KeWaitForMultipleObjects instead it appears. User space
| should just call WaitForMultipleObjects. While
| NtWaitForMultipleObjects and ZwWaitForMultipleObjects
| exist they seem to be undocumented which means they
| probably have funky behavior. Anything in KernelBase.dll
| is a pretty thin wrapper that I'd be hesitant to call
| Win32 so much as "really thin syscall wrapper"
| tobias3 wrote:
| WaitOnAddress and co. But that one was added with Windows 8, so
| it can't take you back 20 years :)
| gpderetta wrote:
| WFMO is indeed the api that the new futex_waitv is supposed
| to help accelerate under wine.
| Joker_vD wrote:
| By the way, hearing about futex implementation and their API
| considerations should also bring everyone about 20 years back.
| marcodiego wrote:
| Artificial benchmarks show a 4% improvement in some cases, which
| is huge. There aren't many tests to demonstrate the improvements
| on common use cases but there are already reports of subjective
| improvements with games having less stutter:
| https://www.phoronix.com/forums/forum/phoronix/latest-phoron...
| savant_penguin wrote:
| Really curious, why is 4% improvement huge?
| chalst wrote:
| "Huge" means different things in different contexts. A
| compiler optimisation that delivered a 4% increase in speed
| for a whole class of well-tuned C programs would be regarded
| as huge by people who follow compiler technology, but would
| not strike most users of computers as a big deal. I take it
| "huge" is meant in this sense of extraordinary within a class
| of potential improvements.
|
| If you can accumulate a series of 4% improvements, then even
| the computer consumer will start to see the significance.
| Someone wrote:
| I can't find where the 4% number comes from, but I expect
| that it comes from a micro benchmark that does little more
| than futex calls. Reason: if it is from a benchmark that
| represents real-world use, real-world use must spend at
| least 4% of its time in futex calls. If that were the case,
| somebody would have seen that, and work would have been
| spent on this.
|
| So no, this won't give you 4% in general.
| jstanley wrote:
| > If that were the case, somebody would have seen that,
| and work would have been spent on this.
|
| I would think that the linked patch is good evidence that
| work is being spent on this.
| xxs wrote:
| The microbenchmark part is part of the article, no idea
| why the downvotes here.
| Dah00n wrote:
| Because the benchmark isn't doing what GP days it does.
| chalst wrote:
| From the lkml post, it's neither from a real-world case
| nor from an especially constructed futex benchmark, but
| from an existing kernel benchmark where it was thought it
| would be interesting to see how it performed with the new
| call.
| jhgb wrote:
| A 4% improvement of the 100m dash world record would have
| been considered phenomenal, I imagine.
| alexeldeib wrote:
| This reminded me of the 4 minute mile. The first 4 minute
| mile was a 3:59 run by Roger Bannister in 1954. Prior to
| that, the record was a 4:01 run in 1945.
|
| A 4% improvement on the 4:01 would be a 3:51. A 3:51
| wasn't achieved until 1966, about 20 years later.
|
| So 4% can be quite a jump!
| dbpatterson wrote:
| Extending your point, 4% more would go down to just under
| 3:42.... which has not yet happened, over a half century
| later.
| zimpenfish wrote:
| To put that into context, Bolt's 9.58s (current WR,
| considered phenomenal in and of itself, I think?) is
| about 1.7% faster than Powell's 9.74s (last pre-Bolt WR).
|
| A 4% improvement on Bolt's 9.58s would be about 9.19s -
| probably not something we'll see for several decades, if
| ever.
| jamiek88 wrote:
| I wonder if I'll ever see a Bolt again in my lifetime.
|
| Such dominance and grace.
|
| He jogged over the line with that record too.
| fefe23 wrote:
| Intel Xeons have huge price differences at the top tier for
| relatively minor performance differences.
|
| So if you get 4% for free here, that could mean hundreds of
| dollars savings compared to having to buy the next tier
| hardware. Multiplied by the number of Intel customers.
|
| 4% could mean the difference between "I can still use my old
| computer" and "I need to buy an upgrade".
| ltbarcly3 wrote:
| The price difference is due to artificial market
| segmentation, for example only xeons support ecc. It is not
| due to performance.
|
| (consumer chips are usually basically identical, just with
| functionality disabled)
| k1rcher wrote:
| This is interesting. Does anyone know of previous figures on
| Windows virtualization improvement?
| delroth wrote:
| It's not 4% improvement for the wine use case, it's 4%
| improvement in existing synthetic benchmarks for the futex
| syscall. And a 21% regression in performance in another
| synthetic benchmark.
|
| It's likely the wine WMO implementation is getting a much
| more significant increase in performance than 4%, but the
| patch description doesn't provide numbers.
| tombert wrote:
| I don't know much about Wine, but I'm speaking in
| generalities, but I've worked jobs where we spend weeks to
| get <10% performance increases. Very often that last bit of
| optimization is the hardest part.
| [deleted]
| marcodiego wrote:
| Because you don't have to change one line of your previously
| running software to get this benefit or replace your
| hardware. Of course, will need a new version of the kernel
| and wine, but that is a matter of time to hit distro
| repositories. Also, consider that this will accumulate when
| combined with new user space libs, servers, compilers,
| drivers... Software will improve without any change just by
| keeping your distro updated.
|
| Yes, a 4% improvement is huge.
| sp332 wrote:
| Well, it's a relative measure. There's been a ton of
| optimization already. All the low-hanging fruit is gone. It's
| probably been a long time since a single optimization gained
| 4%.
| xxs wrote:
| > 4% performance improvement for up to 800 threads.
|
| 800 threads on similar WaitForMultipleObject is already a
| horrific design if you even remotely care about performance.
| The only viable case for 800 threads (and wine) would be
| blocking IO which is an extremely boring one.
| asveikau wrote:
| IIRC, WaitForMultipleObjects can only block on 63 handles.
| 800 threads blocked on 63 handles seems pretty weird.
| xxs wrote:
| >IIRC, WaitForMultipleObjects can only block on 63 handles
|
| Yup (64 though), it's an extra annoying limitation. It has
| been there since the 90s, WinNT (cant recall win95 version
| as I never used it with win95). It's limited by
| MAXIMUM_WAIT_OBJECTS which is 64.
|
| For instance Java implements non-blocking IO
| (java.nio.Selector) by using multiple threads and sync
| between them to effectively achieve linux' epoll.
|
| ----
|
| my point was mostly that the 'huge' 4% happens with massive
| amounts of threads and WaitForMultipleObjects won't see
| even that... so kind of click-baity. Flip note: sync.
| between threads can be done better than relying on OS
| primitives aside for real wait/notify stuff (which indeed
| it's implemented via futex)
| usefulcat wrote:
| Sure, but that's irrelevant for wine, which doesn't get to
| choose how the things that it's running (e.g. games) are
| implemented.
| xxs wrote:
| Of course. Yet, I am be massively surprised to see so
| heavily multithreaded games with massive contention on
| WaitForMultipleObjects. No one programs that way (again,
| save for blocking IO... and then it totally doesn't matter)
| monocasa wrote:
| Unfortunately, people do program that way, and I've seen
| it way more in closed source Windows code than in the
| open source world where someone will (rightly) call them
| out.
|
| For instance, in a previous job, a C++ codebase existed
| where the architect had horrendously misunderstood the
| actor model and had instated a "all classes must extend
| company::Thread" rule in the coding standard. So close,
| yet so incredibly far away.
|
| At the end of the day, wine (like windows) still has to
| run these programs and do it's best with terrible choices
| that were made in the past.
| xxs wrote:
| Having a policy to have to extend something is proper bad
| already. For actors likely you need lock-free queue and
| much preferably a bounded one - so each message is a CAS
| (very likely uncontended) and there is an upper limit how
| much stuff can be enqueued.
|
| I do believe there are tons of incredible horrid code
| (esp. when it comes to concurrency) - but most games tend
| to use already established engines, less likely to have
| glaring concurrency design blunders.
| monocasa wrote:
| I mean, yeah, you don't make a choice like that without
| making a lot of other terrible choices too. Like I said
| "horrendously misunderstood the actor model". And as you
| seem to be picking up on, they had a very C++98 view at
| best of how inheritance should work.
|
| In case it's not clear to anyone reading along: if you
| find yourself doing anything like this, take a long, deep
| look in the mirror; you're almost certainly making the
| wrong choice. Best I can say is that the architect
| learned from some of their choices in the years since
| from the resulting consequences.
|
| However, that code is still out there. The engineers have
| tried to rewrite since then, but it turns out that
| rewriting isn't adding new features. I don't agree with
| the company's calculus on the matter, but that's a tale
| as old as time.
|
| And you'd be surprised how many games aren't written
| well. Writing a game engine is a bit of a rite of passage
| among young coders, and a surprising subset of those
| actually ship. In my experience, a lack of maturity in
| the field and (we'll be courteous and say) 'questionable'
| threading models (and as you point out inheritance models
| too) go hand in hand.
|
| The point at the end of the day is that Wine users (like
| Windows users) expect the runtime to do the best it can
| even against terribly written applications. The users
| want to run that code, it isn't going to be fixed, so
| anything Wine can do better to paper over the (incredibly
| obvious to someone skilled in the art) deficiencies of
| that code is in their users' benefit.
| phkahler wrote:
| - For requeue, I measured an average of 21% decrease in
| performance compared to the original futex implementation. This
| is expected given the new design with individual flags. The
| performance trade-offs are explained at patch 4 ("futex2:
| Implement requeue operation").
|
| It's not clear that this is a win overall. But hey, if Windows
| games run better under wine...
| Denvercoder9 wrote:
| The old interface will remain available, so if requeue
| performance is important, you can keep using that. Also at
| least one kernel maintainer considers requeue to be
| historical ballast [1], so it's probably not used much
| anymore.
|
| [1] https://lwn.net/ml/linux-
| kernel/87o8thg031.fsf@nanos.tec.lin...
| derefr wrote:
| Better yet, since Wine is there to serve as a "smart" shim,
| it can offer a launch config parameter for the app, that will
| get Wine to point its futex symbols at its previous userland
| impl. (Sort of like the various Windows compatibility-mode
| flags.) So even Windows apps that do tons of requeues (and so
| presumably would be slowed down by unilaterally being made to
| do kernel futexes) would be able to be tuned back up to
| speed. And distros of Wine that provide pre-tuned installer
| scripts for various apps, like CrossOver, can add that flag
| to their scripts, so users don't even have to know.
| [deleted]
| gpderetta wrote:
| IIRC glibc stopped using requeue a while ago for cobdution
| variables wait morphing, so maybe requeue is not a hugely
| important use case anymore.
| alexhutcheson wrote:
| Bummer that this doesn't seem to support or explicitly plan for
| the proposed FUTEX_SWAP operation[1] (unless I'm missing it).
|
| [1] https://lwn.net/Articles/824409/
| dundarious wrote:
| I haven't read anything about FUTEX_SWAP since Aug 2020. Do you
| know is there ongoing progress?
| Cloudef wrote:
| What happened to NTSYNC?
| https://lore.kernel.org/lkml/f4cc1a38-1441-62f8-47e4-0c67f5a...
| elynl wrote:
| It's still a thing! kernel:
| https://repo.or.cz/linux/zf.git/shortlog/refs/heads/winesync
| wine:
| https://repo.or.cz/wine/zf.git/shortlog/refs/heads/fastsync2
|
| They decided to rename it since the kernel maintainers don't
| want NT interfaces and having NT in the name doesn't help
| Cloudef wrote:
| Thats good to hear. Darling devs also went with kernel module
| approach. Its weird wine devs did not do the same until now.
| Wineserver is nice but as we see, its sometimes impossible to
| do accurate emulation of some apis there.
|
| https://docs.darlinghq.org/internals/basics/system-call-
| emul...
| aasasd wrote:
| I lowkey have to wonder why Linux seemingly goes all-in on
| monster functions that implement a variety of operations
| depending on the arguments (as far as I can see from casual
| encounters). In my experience, such code often both reads poorly
| on the calling side and looks very messy inside. I came to prefer
| making distinct functions that spell out their intended action in
| the names. Windows apparently went the same path.
|
| E.g.: syscall doesn't accept multiple handles and other bells and
| whistles? Well duh, it's not supposed to--just make a new one
| that does.
|
| Of course, it's probably too late to change this design choice
| now...
| whatshisface wrote:
| I can only offer speculation, but it could be that the
| alternative, a complicated API that requires several subsequent
| calls to use, is seen as more of a risk for bad programming and
| state mis-tracking. We definitely don't want programs leaving
| the kernel in a bad state by calling the first half of a
| sequence of syscalls then later on getting distracted or
| crashing and never finishing it. A single call gives atomicity
| guarantees that are otherwise hard to come by in languages
| without state tracking primitives like futures or borrow
| checked references.
|
| Another thing to remember is that in C, it's easier to make a
| call more specific, by hardcoding values than it is to make it
| more general, by picking between variants in a branch. I can
| see programmers wishing the API was more dynamic while writing
| a giant if-else tree to select the appropriate version of the
| syscall, but the opposite problem, when the same invocation is
| always needed but the API takes a lot of dynamic options, is
| not as bad because it is easy to write a bunch of literals in
| the line that calls the function. (If C had named arguments
| that approach would be almost flawless.)
| aasasd wrote:
| I don't see how `an_op_syscall(param)` is 'more complicated'
| or requires more calls or state than `a_syscall(OP_NAME,
| param)`.
| whatshisface wrote:
| Since C doesn't have first-class functions, if you needed
| to dynamically pick between values of OP_NAME you'd be glad
| it was a parameter. Imagine ten options parameters, which
| your program needs to choose between depending on the
| circumstances. That's a huge branch tree. Imagine choosing
| between two options - you'd need to keep track of function
| pointers, or at least write your own enum. C makes it far
| nicer to handle enum values than it is to handle functions,
| because ints are first-class. I agree that there's no
| difference between a symbol in the name and a value in the
| args when your program always needs the same value, but
| sometimes it needs dynamism.
| delusional wrote:
| I don't see how function pointers are any harder to
| manage than an enum. Just slap those function pointers
| into an array or a switch, and away you go.
| aasasd wrote:
| You're talking here about the second half of your
| previous comment. This reasoning is ok by me, though I
| have to wonder how often, and why, do programmers need to
| decide dynamically in mid-flight whether they want to
| wait instead of wake (or whatever choices futex()
| offers), or that they need one handle instead of a
| vector.
|
| I'm talking above about the first half of your previous
| comment, where you say that having operation names in the
| function names would somehow require multiple calls and
| break atomicity. I can only understand this if Linux
| syscalls are parameterized to do one thing and then
| another thing and something else, in one call--which
| sounds horrible, but still can be expressed in function
| names.
| gpderetta wrote:
| There is no direct function calls into the kernel.
|
| At the end of the day a function call is an interrupt
| with the syscall parameters in a bunch of registers. As
| you do not want to use a different interrupt (as there
| are only so many interrupts [1]) for each function call,
| the syscall id is passed as a parameter in a register. On
| the kernel side the id is used to index in a function
| pointer table to find and call the actual system
| function.
|
| Turns out this table is actually limited in size so some
| related syscalls use another level of indirection with
| additional private tables (I think this is no longer a
| limitation and these syscall multiplexers are frowned
| upon for newer syscalls).
|
| All of this to say that FUTEX_WAIT and FUTEX_WAKE are
| just implementation details. A proper syscall wrapper
| would just expose distinct functions for each operation.
|
| But for a very long time glibc did not expose any wrapper
| for futexes as they were considered linux specific
| implementation details and one was supoosed to use the
| portable posix functions. So the futex man pages
| documented the raw kernel interface. That didn't stop
| anybody from using futexes of course and I think glibc
| now has proper wrappers.
|
| [1] yes, these days you would use sysenter or similar,
| but same difference.
| aasasd wrote:
| Thanks, this actually answers my original question in
| regard to the kernel's part of it. I'll take a look at
| what names and signatures libc has, and if they're still
| like that then the question is readdressed to libc--
| though I guess the answer then would be "because that's
| how it is in the kernel".
|
| P.S. Come to think of it, I sorta knew about the
| interrupt business, and vaguely about the later but
| fundamentally similar method (but not that there are many
| interrupts for syscalls, though thinking about eight-bit
| registers clears this up quickly). Guess I just forgot
| that libc is kept separate from the kernel, in the Linux
| world.
| whatshisface wrote:
| > _I 'm talking above about the first half of your
| previous comment, where you say that having operation
| names in the function names_
|
| I was addressing operation names in the function names in
| the second half of my comment, in the first half I was
| addressing breaking up specification into a state machine
| process like OpenGL does. State machines are another way
| of breaking up big functions with lots of options. They
| have pseudo default argument support as an advantage, and
| ease of implementation via static globals as a historical
| (before multithreading) advantage. State machines are
| kind of like builder pattern as a way to get named
| arguments and default arguments in a language that
| doesn't have either, except instead of the builder being
| a class it's a singleton (with all the attendant
| disadvantages.)
| aasasd wrote:
| Ah, no, that's also contrary to my taste. I love clear
| function names and I hate state. Immutable return values
| are way better than builders, in my experience, even if
| that means many arguments. And static globals are a whole
| other level of bad.
| whatshisface wrote:
| Builders can be a huge help when you get up to a dozen
| arguments and above. Seriously, I never want to see
| foo(13,15,91,39,LIB_YES,LIB_NO,5.0,183.5) ever again
| haha. Good engineers would comment each argument, good
| languages support named arguments, but when you have
| neither it can be a mess unless the library author used
| builders to force annotation of the values. A close
| relative of builders are settings struct arguments, which
| have their own pros and cons. I think Vulkan uses
| settings structs.
| aasasd wrote:
| I think OOP builders can work without extra state if they
| implement `->addThis()->addThat()` by actually doing
| functional `addThat(addThis(defaultObj))` and creating
| immutable values inside. But ultimately there's no good
| answer to the absence of named arguments, and by now not
| having them is a pure atrocity.
|
| Of course, I'm guessing C programmers would cringe at the
| prospect of going full COW for such use-cases.
| xiphias2 wrote:
| Efficient OpenGL programming was always hard for me, as I
| don't just need to learn the API, but also try to figure
| out how it's implemented, which is often undocumented and
| unspecified.
|
| It's actually hard to understand for me that an API that
| was created for efficient computing doesn't
| specify/guarantee the relative speed/latency of
| operations.
| whatshisface wrote:
| Yeah, there are some really interesting unexpected
| performance implications - for instance, it's actually a
| great idea to send updates to buffers in a way that
| implies new allocation, because most drivers will
| silently and automatically set up a multiple-buffering
| system that reuses resources efficiently and doesn't
| stall the pipeline. Sending partial updates looks better
| on the surface, but because it can stall while the driver
| waits for the resource you're updating to fall out of
| use, it can be worse unless you set up the triple
| buffering yourself!
| maxfurman wrote:
| Can someone please ELI5 what is a futex? Is it like a mutex?
| whoisburbansky wrote:
| It's a primitive that lets you build mutexes; you've got an
| atomic integer in userspace, attached to a wait queue in kernel
| space. Eli Bendersky has a pretty good rundown:
| https://eli.thegreenplace.net/2018/basics-of-futexes/
| bogomipz wrote:
| This is a good post thanks. One thing confuses me is that
| these two sentences seem to contradict each other:
|
| >"Before the introduction of futexes, system calls were
| required for locking and unlocking shared resources (for
| example semop)."
|
| >"A user-space program employs the futex() system call only
| when it is likely that the program has to block for a longer
| time until the condition becomes true"
|
| The first indicates futexes are an improvement because they
| avoid the overhead of system call but the second states that
| futex() is a system call.
|
| Is this similar to how a vDSO or vsyscall works where a page
| from kernel address space is mapped into all users processes?
| In other words is it similar to how gettimeofday works?[1]
|
| [1] https://0xax.gitbooks.io/linux-
| insides/content/SysCall/linux...
| kccqzy wrote:
| It's not a contradiction. The two sentences are perfectly
| clear to me. Of course futex() is itself a system call, but
| in case of no contention you don't need to invoke the
| system call. That's why futex() is an improvement. The key
| phrase is "only when" in the second sentence.
| bogomipz wrote:
| >'The key phrase is "only when" in the second sentence'
|
| Thank you for pointing this out. That makes sense. This
| is sort of like the idea that the fastest system call is
| the one that is never made. Can you say how does
| something in userland know if something "would" block
| without actually crossing the user/kernel boundary? Does
| the kernel expose a queue length counter or similar via a
| read-only page?
| tialaramex wrote:
| The trick is, the memory address futex cares about _is_
| how you signal. Normally, in the "uncontended" case,
| everybody involved uses atomic memory access to
| manipulate this memory location (for example they set it
| to 1 as a flag to say somebody is in the mutually
| excluded code and then back to zero to say now nobody is
| in there) and they pay no kernel cost at all for this
| feature. The kernel has no idea the memory is even
| special.
|
| _But_ in the contended case the flag is already set by
| somebody else, so you leave a sign (maybe you set it to
| 42) about the contention in the memory address, and you
| tell the kernel that you 're going to sleep and it should
| wake you back up once whoever is blocking you releases.
| When whoever was in there is done, they try to put things
| back to zero, but they discover you've left that sign
| that you were waiting, so _they_ tell the kernel they 're
| done with the address and it's time to release you.
|
| None of the conventional semantics are enforced by the
| operating system. If you write buggy code which just
| scribbles nonsense on the magic flag location, your
| program doesn't work properly. Too bad. The kernel does
| contain some code that helps do some tricks lots of
| people want to do using futexes, and so for that reason
| you should stick to the conventions, but nobody will
| force you to.
| kccqzy wrote:
| The most common implementation of mutexes using futex()
| is that the userland simply spins a while waiting to
| acquire the mutex (just atomic operations, no system
| call). After a while the userland gives up spinning and
| makes the system call.
| gpderetta wrote:
| IMHO The nice thing about futexes is not that they allow
| to fast path uncontended sinchronization.
|
| The userspace fast path can in principle be implemented
| on top of any kernel only synchronization primitive (a
| pipe for example).
|
| The nice thing about futexes is that they do not consume
| any kernel resource when uncontended, i.e there is no
| open or close futex syscalls. Any kernel state is created
| lazily when needed (i.e on the first wait) and destroyed
| automatically when not needed (i.e. when the last waiter
| is woken up). Coupled with the minimal userspace space
| requirements (just a 32 bit word) it means that futexes
| can be embedded very cheaply in a lot of places.
| Iwan-Zotow wrote:
| it is thing behind/inside mutex. It is making uncontested mutex
| acquire/release ops pure userspace thingy.
| ROARosen wrote:
| Can someone please ELI5 what is a mutex?
| hyperman1 wrote:
| A mutex is the software equivalent of the lock on your toilet
| door. There is a zone in your code where only 1 thread at a
| time is allowed. So before entering, you lock the mutex. Now
| either you are allowed in or are added to a set of waiters.
| When you leave, you unlock the mutex, and 1 other waiter is
| allowed to enter.
|
| This sounds easy, but is in practice wrought eith peril. You
| need high performance, not waste too much cpu time on
| sleepers, some notion of fairness when you have lots of
| waiters, minimal memory usage, etc...
| __init wrote:
| (Edited because I transposed "mutex" and "semaphore" in my
| mind.)
|
| A mutex is a lock -- if multiple users attempt to access a
| resource simultaneously, they each attempt to acquire the
| lock serially, and those not first are excluded (blocked)
| until the current holder releases the lock.
|
| Mutexes are typically implemented with semaphores:
|
| First a more developer-oriented explanation (because I
| imagine that's what you really want):
|
| A semaphore locks a shared resource by implementing two
| operations, "up" and "down". First, the semaphore is
| initialized to some positive integer, which represents how
| many concurrent "users" (typically threads) can use that
| resource. Then, each time a user wants to take advantage of
| that resource, it "down"s the semaphore , which atomically
| subtracts one from the current value -- but the key is that
| the semaphore's value can _never_ be negative, so if the
| value is currently zero, "down"ing a semaphore implies
| blocking until it's non-zero again. "Up"ing is just what
| you'd imagine: atomically increment the semaphore value, and
| unblock someone waiting, if necessary.
|
| Semaphores are generally seen as a fundamental primitive (one
| reason being that locks/mutexes can be implemented trivially
| as a semaphore initialized to one), but they also have broad
| use as a general synchronization mechanism between threads.
|
| For a true ELI5 of semaphores (I enjoy writing these):
|
| Imagine everyone in class needs to write something, but there
| are only three pencils available, so you devise a system (a
| system of mutual exclusion, per se) for everyone to follow
| while the pencils are used: First, you note how many pencils
| are available -- three. Then, each time someone asks you for
| one, (which we call "down"ing the semaphore) you subtract one
| from the number in your head and give them a pencil. But if
| that number is zero, you don't have any pencils left, so you
| ask the person to wait. When someone returns and is done with
| their pencil, you hand it off to someone waiting, or, if
| nobody is waiting, you add one to that number in your head
| and take the pencil back ("up"ing the semaphore). It's
| important that you decide to only handle one request at a
| time (atomicity) -- if you tried to do too many things at
| once, you might lose track of your pencil count and
| accidentally promise more pencils than you have.
| MayeulC wrote:
| > First, you note how many pencils are available -- three.
| Then, each time someone asks you for one, (which we call
| "down"ing the mutex) you subtract one from the number in
| your head and give them a pencil
|
| Isn't that a semaphore rather than a mutex?
|
| For ELI5 on semaphores, it's like going at the train
| station, airport, or Mc Donald's: There is a single queue
| for multiple counters. Everyone is given a paper with a
| number, and once a counter is free, the one with the lowest
| number can walk to it. A semaphore is just the number of
| free counters, it decreases when someone goes to one, and
| increases when one leaves. If the semaphore is zero, there
| is no free spot, and people will have to queue.
|
| It's quite close to another synchronization primitive,
| barriers: if you have to wait for _n_ tasks to finish, you
| make one with _n_ spots, and release everyone when all the
| spots are filled.
|
| To me, mutexes are semaphores, but with only one counter:
| there is only exactly one or zero resources available. If
| you take the mutex, others will have to wait for you to be
| done before they can take it. They can queue in any
| fashion, depending on the implementation: first come, first
| served; fastest first; most important first; random; etc.
| __init wrote:
| Oh no, you're absolutely right. My sleep-deprived brain
| has confused the names -- thanks for the heads up. I'll
| edit my post.
| e_proxus wrote:
| But it can't be turtles all the way down? Isn't a semaphore
| counter also shared state that needs to be protected if
| several threads want to "down" the value? Can it become
| negative?
| gpderetta wrote:
| Well at the bottom of the turtle stack you have harware
| primitives that guarantee atomicity of a bunch of word
| sized arithmetic operations (at the minimum CAS or
| LL/SC).
| ROARosen wrote:
| Thanks, great analogue there, it really helped it 'click'!
|
| Please excuse my ignorance...
| marcodiego wrote:
| Mutex is named after "mutual exclusion" a primitive that is
| used to guarantee that some parts of code will run
| exclusively, never at the same time. This is important for
| good performance and synchronization on multiple threaded
| programs.
| xxs wrote:
| My ELI5 version of that locking mechanism, e.g. mutex, uses a
| toilet and the disaster if multiple people try taking the
| dump at the same time.
| brandmeyer wrote:
| The best introduction is probably the original presentation
| paper from 2002.
|
| Fuss, Futexes, and Furwoks: Fast Userlevel Locking in Linux
|
| https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pd...
|
| Another great paper that came after some additional experience
| by glibc developers:
|
| Futexes are Tricky, by Ulrich Drepper
|
| https://www.akkadia.org/drepper/futex.pdf
| MaxBarraclough wrote:
| I see Drepper resisted the urge to pluralise _futex_ to
| _futices_.
| DSingularity wrote:
| Okay, consider a multi-threaded program. These programs have
| multiple threads running on multiple cores. Sometimes you want
| to coordinate the execution of multiple threads in such
| programs. A common example is when a thread is in a "critical
| section" where it is only safe for one thread to access/modify
| state. Sometimes threads will have to wait for each other. This
| is where the futex comes in. Instead of the thread to wait
| while using the processor it can tell the OS that it needs to
| wait for this futex to become free. The OS will then suspend
| the thread and find another one to schedule in its place.
| bogomipz wrote:
| Ah Interesting. So is the concept of futex() similar to how a
| process in kernel mode will yield the CPU by putting itself
| on a wait_queue and call the schedule() function before it
| issues an I/O request then?
|
| I realize that one is userspace and the other kernel but
| would the above be the right mental model?
| gpderetta wrote:
| Yes, a futex is basically a waitqueue abstraction coupled
| with a TOCTOU guard.
| hulitu wrote:
| As far as i remember a futex is a fast mutex
| corbet wrote:
| See https://lwn.net/Articles/823513/ for a look at both the old
| and proposed new futex interfaces.
| achernya wrote:
| futex is a Fast Userspace muTEX. It's the syscall to help
| implement a mutex when there are two or more threads waiting on
| the lock to let other processes/threads schedule and do useful
| work during the wait.
| hyperman1 wrote:
| Futex stands for fast userspace mutex. It is a basic building
| block for implementing a mutex (amongst others).
|
| Afaik, if locking a mutex succeeds, futexes avoid a context
| switch to and from the kernel. The fast path happens completely
| in userspace. Only when a mutex lock causes a thread to sleep,
| the kernel is needed. All of this improves mutex performance.
|
| There are other locking primitives than mutexes, and the
| current futex interface is not optimal for implementing common
| windows ones. Hence this proposal.
| ece wrote:
| ESYNC is the way this is currently done, it's hit and miss in
| that it doesn't work unless the number of open files is set
| high enough for some games. It's also slower than this
| proposal.
| xxs wrote:
| >"context switch to and from the kernel"
|
| This is a common misconception, a "context" switch is a lot
| more expensive than a "mode" switch that goes from user mode
| to the kernel mode (and back).
|
| Of course, if the futex has to wait, it'd be a context switch
| for real.
| gpderetta wrote:
| A futex is an ephemeral list of waiting threads or processes
| keyed on an address. It can be used to implement any
| synchronization or signaling primitive, like mutexes, condition
| variables, semaphores, etc.
| tialaramex wrote:
| This is a really nice insight into the kernel's model of
| what's going on. It's probably not the best way for an
| absolute beginner to think about why they _want_ this, but it
| 's much better than some of the other explanations if you
| have trickier technical or performance questions, and it
| makes it obvious why it's clever and why you need the OS
| kernel to actually implement it.
| twoodfin wrote:
| Unclear to me from the writeup: Would it be possible to implement
| the original futex() API on top of futex2() or is Linux doomed to
| preserve most of two complete futex implementations?
| guipsp wrote:
| Would it be possible? Probably yes, but not performant.
| c7DJTLrn wrote:
| This might get me some angry replies, but does the feature set of
| Linux ever end? Will there ever be a day where someone says "you
| know what, Linux isn't built for this and your code adds too much
| complexity, let's not". It really just seems like Linux has
| become a dumping ground for anyone to introduce whatever they
| feel like as long as it passes code review, with nobody looking
| at the big picture.
| DSMan195276 wrote:
| Features get rejected all the time, even some pretty
| substantial ones. That said I think there's two things to keep
| in mind:
|
| 1. If someone is willing to take the time to develop and submit
| something to the kernel, it probably has _some_ redeeming
| factors to it. The original idea or code might not get merged,
| but something similar to meet those needs likely can be if it
| 's really something the kernel can't do.
|
| 2. The modularity of the Linux Kernel build system means that
| adding new features doesn't have to be that big of a deal,
| since anybody who doesn't want them can just choose not to
| compile them in. It's not _always_ that simple, but for a lot
| of the added features it is.
| marcodiego wrote:
| Getting a new syscall on linux is hard. You'll often need user-
| space users before it gets accepted; that alone is hard. The
| linux syscall set is very small considered the breadth of cases
| it is used today. Compared to nt, linux syscall set is very
| small and well thought out.
| formerly_proven wrote:
| Windows has around 500 syscalls, Linux around 300.
|
| Both have several escape-hatches for arbitrary driver-
| userspace communication, which makes the number of syscalls
| pretty irrelevant. (E.g. Linux has ioctl, netlink, procfs,
| sysfs, ...)
| monocasa wrote:
| Well, there's also win32k.sys which provides an additional
| 1000 or so syscalls.
| Ericson2314 wrote:
| > Both have several escape-hatches for arbitrary driver-
| userspace communication, which makes the number of syscalls
| pretty irrelevant. (E.g. Linux has ioctl, netlink, procfs,
| sysfs, ...)
|
| Yes, we need a better metric for accounting for those
| things, which will paint a much more sobering picture of
| the complexity and growth of complexity of the userland <->
| kernel interface.
| GoblinSlayer wrote:
| Linux syscall set is very small and posix. No matter what you
| think, the first version is always a prototype.
| pram wrote:
| That's the beauty of open source! Just fork your own Linux and
| then refuse any further patches or PRs. True freedom.
| dooglius wrote:
| No, because the Linux kernel is designed explicitly as a
| monolithic kernel, so as long as there is anything anyone wants
| to do that is "OS-like", it expects to own that.
| rkangel wrote:
| > with nobody looking at the big picture
|
| As I understand it, this is exactly what Linus does. He's the
| final arbiter of what makes it in, and has a history of setting
| high level direction for how stuff should be (usually prefixed
| with "I'm rejecting this because").
| gpderetta wrote:
| this patch both cleanup existing complexity and add
| functionality needed for a significant use case.
| anoncake wrote:
| What purpose is a general-purpose kernel not built for?
| vondur wrote:
| I'm assuming if you build the kernel yourself, you may be able
| to strip out features you don't want.
| penagwin wrote:
| There are pieces of linux - say the networking - that already
| don't meet many people's use cases. People seem to like BSD for
| certain networking stuff due to differences in the respective
| networking stacks.
| zokula wrote:
| > People seem to like BSD for certain networking stuff due to
| differences in the respective networking stacks.
|
| Is that why *BSD has lower market share for servers and other
| networking equipment than it did 15 years ago?
| thedracle wrote:
| I mean, the futex2 proposal has been around for over a year.
|
| A lot of the history and discussion can be found here:
| https://lkml.org/lkml/2021/4/27/1208
|
| I personally feel like the kernel maintainers have been very
| conservative over the years regarding the introduction of new
| system calls, but then again I've maybe gotten used to the
| trauma of being familiar with windows internals.
| slaymaker1907 wrote:
| I think that will be the day somebody comes and creates a new
| FOSS kernel which gets rid of mostly unused legacy code. The
| problem with this sort of stuff is that it is hard to know what
| abstractions will actually be useful before said abstractions
| are available.
|
| I don't think anyone in their right mind would design the
| IndexedDB APIs the way they are with the background of modern
| JS. It would certainly be promise based and I think there would
| be a better distinction between an actual error with IndexedDB
| vs just an aborted transaction. It's also extremely confusing
| about how long you can have transactions ongoing. I suspect
| they would replace the nonsensical autocommit behavior of
| transactions currently with some sort of timeout mechanism.
| There is way too much asynchronous code out there these days to
| rely on the event loop for transactions committing.
|
| Linux is honestly pretty clean as far as outdated abstractions
| go, but Windows is definitely very guilty of keeping around old
| outdated abstractions. Having execution policies for PowerShell
| while batch scripts get off scotch free is really strange.
| These restrictions would only make sense if Windows didn't let
| users run VBScript, Batch, etc. Can anyone tell me how running
| a PowerShell script is inherently more dangerous than VBScript
| or Batch scripts?
| stjohnswarts wrote:
| That's a fair criticism but the kernel is very modular and the
| legacy stuff that no one will maintain gets cut all the time,
| so I think it's all a fair tradeoff. I'm not angry, just
| pragmatic because if you don't evolve you die, look at how much
| market share BSD has lost over the past 15 years
| Ericson2314 wrote:
| This shouldn't be getting down-votes.
|
| Consider the slogan "composition not configuration", Linux is
| squarely the latter. All the hardware, all the different use-
| cases, that are just added to what's mostly a monolith. You can
| disable them, but not really pull things a part.
|
| Now, Linux also has some of the most stringent code review in
| the industry, and they are well aware of issues of
| maintainability, but fundamentally C is not a sufficiently
| expressive language for them to architect the code base in a
| compositional manner.
|
| What really worries me is that as the totality of the userland
| users an increasingly large interface to the kernel, it will
| very hard to ever replace linux withsomething else when we _do_
| have something more compositional. I am therefore interested,
| and somewhat working on, stuff that hopefully would end up with
| a multi-kernel NixOS (Think Debian GNU /kFreeBSD) to try to get
| some kernel competition going again, and incentivize some
| userland portability.
|
| Note that I don't think portability <===> just use old system
| calls. I am no Unix purist. I would really like a kernel that
| does add new interfaces but also dispenses with the old cruft;
| think Rust vs C++.
| fabianhjr wrote:
| > Consider the slogan "composition not configuration", Linux
| is squarely the latter. All the hardware, all the different
| use-cases, that are just added to what's mostly a monolith.
| You can disable them, but not really pull things a part.
|
| Linux is a Monolithic Kernel, sounds like you are more
| interested in Microkernels.
| https://www.wikiwand.com/en/Microkernel
|
| One microkernel alternative could be Redox-OS:
| https://www.wikiwand.com/en/Redox_(operating_system)
| bruckie wrote:
| You could write a monolithic kernel using composition for
| your code structure. It would just be composition at build
| time rather than at runtime.
|
| It seems pretty hard to write a microkernel without using
| composition, though, since the runtime requirements kind of
| dictate your code structure.
| oalae5niMiel7qu wrote:
| > but fundamentally C is not a sufficiently expressive
| language for them to architect the code base in a
| compositional manner.
|
| Plan 9 would like to have a word with you.
| joppy wrote:
| I don't think this is an achievable goal for a kernel, since it
| has strictly more permissions than user code. There are some
| parts of user code that simply cannot be made faster by
| changing how they're written: no amount of rewriting in
| Rust/C/C++ will help if the relevant kernel primitives are not
| there. The kernel must move with the times, and incorporate new
| features as they are (tastefully) decided upon.
| fullstop wrote:
| The BSDs are always an option.
| zokula wrote:
| The BSDs were an option they are becoming more moribund and
| less used every year.
| bombcar wrote:
| Linus has done exactly that a few times; though usually he
| defines what would be needed to be merged.
|
| Some of the changes aren't compatible with the kernel direction
| as a whole and so they're a separate branch. Given enough time
| (and if the change is major AND ends up being widely used) it
| may get merged in.
|
| Those who want a specific kernel with a defined feature set
| often will look to one of the BSDs as a base (as often they
| also have reduced hardware requirements, think routers).
| yarg wrote:
| As a relative layman, this all seemed reasonable and desirable
| - and the fact that he integrated it with pthreads seems
| particularly cool to me.
|
| NUMA awareness especially stood out as highly significant.
|
| The new supported sizes seems nice and shiny - if well
| utilised, it could improve performance and capacity in general,
| capabilities in the low-end and scaling in the high-end.
| lizknope wrote:
| Linus Torvalds looks at the big picture all the time. That is
| basically his job as he doesn't write a lot of low level code
| anymore.
|
| Do you have an example of something that you consider part of
| the "dumping ground?"
| trillic wrote:
| So what? This code (and most drivers, modules, additional
| fucntionality, etc) will only be compiled in if necessary and
| will only actually be run if you use the feature.
| c7DJTLrn wrote:
| My concern is not about what _my_ flavor of kernel includes.
| It 's about the size of the codebase, and whether the limited
| number of maintainers can keep up with the countless features
| being introduced.
| coolreader18 wrote:
| I mean I think in terms of maintainer resources Linux is
| one of the projects with the least to worry about, there
| are always companies who are willing to have a team work on
| stuff.
| Ericson2314 wrote:
| Those megacorps also accumulate endless complexity
| internally, as does the industry as a whole. It's a
| problem larger than the discipline of reviewers of C code
| could ever hope to fix.
|
| Even the larger world outside the field is somewhat aware
| of the the looming complexity problem: https://www.theatl
| antic.com/technology/archive/2017/09/savin...
|
| Basically on our current trajectory we will end up with
| something like the biosphere that can do stuff but too
| complex to fully understand --- but also probably well
| more fragile than that?
| kaba0 wrote:
| That's what testing is for. We can't really reason
| globally about much smaller codebases either Software is
| complex.
| cbmuser wrote:
| The kernel is regularly shedding off unmaintained code.
|
| So the codebase is actually not growing as much as it
| could.
| c7DJTLrn wrote:
| The code they're shedding off is drivers. They won't ever
| get rid of old syscalls because Linus refuses to break
| userspace, ever.
| zamalek wrote:
| > The use case lies in the Wine implementation of the Windows NT
| interface WaitMultipleObjects.
|
| This can't only be a win for emulating Windows, surely (future)
| Linux native code could benefit from this too?
| CoolGuySteve wrote:
| Linux's answer is io_uring with eventfd. I've noticed it has
| higher latency than WaitForMultipleObjects though. I've had a
| hard time getting the 99th percentile latency below 5usec.
|
| It would be nice if io_uring polling had a futex-like feature
| that spinned in userspace rather than having to syscall
| io_uring_wait every time. I feel like that feature would be
| more universal than futex2.
|
| Adding an atomic int somewhere in the userspace would add
| ~100ns in the kernel path but save many microseconds in the
| userspace-spin path during high contention so it seems like a
| win to me.
| vardump wrote:
| Spinning consumes a lot of power, so I'd guess one would only
| want to do that when very low latency is truly required. At
| least I'd certainly wouldn't want it to be the default for
| the generic case.
|
| Of course a low latency option should also exist for those
| cases that require it and currently that pretty much means
| spinning.
| the8472 wrote:
| My understanding of io_uring is that you can already poll the
| CQ with atomics without any syscall. io_uring_enter() is only
| needed to inform the kernel that new SQ entries exist and
| that too can be replaced by polling via IORING_SETUP_SQPOLL
| in privileged processes.
| vardump wrote:
| > This can't only be a win for emulating Windows, surely
| (future) Linux native code could benefit from this too?
|
| That would certainly help porting Windows software to Linux.
| [deleted]
| monocasa wrote:
| Too bad bpf isn't allowed for non root yet. A bpf futex program
| like xok's wake programs would be more general, and way nicer.
| fefe23 wrote:
| What are you talking about?
|
| I have been using bpf in seccomp for non-root for years.
|
| Maybe your distro patched that out. The stock kernel allows bpf
| for non-root.
| monocasa wrote:
| Sort of... there's been push back in the post spectre days of
| expanding non root usage of BPF. Right now, a subset of trace
| programs and seccomp are allowed for non-root, but that
| hasn't expanded since 2016. Everything else requires
| CAP_SYS_ADMIN.
|
| They would probably remove those too if the balance of Linux
| didn't put "don't break user space" higher than "defense in
| depth security".
___________________________________________________________________
(page generated 2021-04-28 23:00 UTC)