[HN Gopher] The Fastest Mutexes
___________________________________________________________________
The Fastest Mutexes
Author : jart
Score : 472 points
Date : 2024-10-02 15:31 UTC (7 hours ago)
(HTM) web link (justine.lol)
(TXT) w3m dump (justine.lol)
| kccqzy wrote:
| > The reason why Cosmopolitan Mutexes are so good is because I
| used a library called nsync. It only has 371 stars on GitHub, but
| it was written by a distinguished engineer at Google called Mike
| Burrows.
|
| Indeed this is the first time I've heard of nsync, but Mike
| Burrows also wrote Google's production mutex implementation at
| https://github.com/abseil/abseil-cpp/blob/master/absl/synchr...
| I'm curious why this mutex implementation is absent from the
| author's benchmarks.
|
| By the way if the author farms out to __ulock on macOS, this
| could be more simply achieved by just using the wait(),
| notify_one() member functions in the libc++'s atomic library.
|
| A while ago there was also a giant thread related to improving
| Rust's mutex implementation at https://github.com/rust-
| lang/rust/issues/93740#issuecomment-.... What's interesting is
| that there's a detailed discussion of the inner workings of
| almost every popular mutex implementation.
| tialaramex wrote:
| Although it does get there eventually, that Rust thread is
| about Mara's work, which is why it eventually mentions her
| January 2023 book.
|
| The current Rust mutex implementation (which that thread does
| talk about later) landed earlier this year and although if
| you're on Linux it's not (much?) different, on Windows and Mac
| I believe it's new work.
|
| That said, Mara's descriptions of the guts of other people's
| implementations is still interesting, just make sure to check
| if they're out-dated for your situation.
| SkiFire13 wrote:
| > although if you're on Linux it's not (much?) different
|
| AFAIK one reason to switch was that mutexes on Linux and
| MacOS were not guaranteed to be moveable, so every rust's
| Mutex had to box the underlying os mutex and was not const-
| constructible. So this makes a considerable change.
| tialaramex wrote:
| That's the reason for Mara's Mutex. I know, it doesn't seem
| like five minutes, but Mara's is now the _previous_ version
| of the Rust Mutex implementation.
|
| std::sync::Mutex::new became const in 1.63, which was _over
| two years ago_.
| lilyball wrote:
| It looks like Mara also did the work to make it const-
| constructible, so it's still her implementation, no?
| https://github.com/rust-lang/rust/pull/97647
| tialaramex wrote:
| Sorry, maybe I was unclear, when I pointed out that this
| work was in 2022 that's not because I was trying to say
| it's recent, that's _two years ago_. The current work was
| in 2024.
|
| Here's the 2024 rework of Windows Mutex:
| https://github.com/rust-lang/rust/pull/121956/
|
| Edited to add: Since we're going back and forth I read
| the blame log, the Linux implementation, with which I'm
| most familiar, is still 85% Mara's although with some
| meaningful changes from 2024.
| the-rc wrote:
| Burrows is also responsible for the Burrows Wheeler Transform,
| Bigtable, Dapper and Chubby, among others.
| loeg wrote:
| > I'm curious why [Abseil's] mutex implementation is absent
| from the author's benchmarks.
|
| Possibly because it's C++ (as opposed to C)? I am speculating.
| vitus wrote:
| > Possibly because it's C++ (as opposed to C)?
|
| MSVC 2022's std::mutex is listed, though. (That said, GCC's /
| clang's std::mutex is not listed for Linux or macOS.)
|
| absl::Mutex does come with some microbenchmarks with a
| handful of points of comparison (std::mutex,
| absl::base_internal::SpinLock) which _might_ be useful to get
| an approximate baseline.
|
| https://github.com/abseil/abseil-
| cpp/blob/master/absl/synchronization/mutex_benchmark.cc
| BryantD wrote:
| Mike was a legend by the time I got to AV. The myth was that
| any time the search engine needed to be faster, he came in and
| rewrote a few core functions and went back to whatever else he
| was doing. Might be true, I just can't verify it personally.
| Extremely smart engineer who cares about efficiency.
|
| We did not, however, run on one server for any length of time.
| rnrn wrote:
| is comsmopolitan's mutex also less fair than the other
| implementations compared?
| cogman10 wrote:
| It's not fair, but (from the description) it does make some
| effort to be fairish. It'll queue up waiters in a linked list
| that is fairly fair, but new people can jump ahead of the line
| and grab the CAS before the list is processed.
|
| However, it has added logic to start chunking through the queue
| after 30 wakeups from a waiter. With that, waiting isn't
| indefinite, but it also isn't fair.
|
| I have no idea how that compares to other implementation's
| fairness. I know the JDK recently abandoned fairness because of
| the overhead and complexity it added around mutex handling.
| tialaramex wrote:
| Fairness is known to cause serious problems such as convoys,
| so while "unfair" isn't itself a desirable property, you
| might choose to be unfair because the alternatives are worse
| for some applications.
|
| Probably a specifically Fair version of any concurrent
| primitive which can be fair is worth giving a distinct name,
| the way you'd offer both a _stable_ and an _unstable_ sort,
| knowing that the unstable sort will often (though not always)
| be faster but some people cannot tolerate an unstable sort so
| it 's useless for them.
|
| On the other hand, maybe it's an attractive nuisance and if
| you offered this you'd find most users took your FairMutex,
| and then were angry because of the inevitable problems from
| fairness, even though the unfair one was right there...
| greiskul wrote:
| In a lot of cases, programmers don't even care about
| fairness, but do care about starvation. Is there a word for
| structures like the one discussed here, that are unfair but
| appear to prevent unlimited starvation?
| DSMan195276 wrote:
| I don't think this lock is _guaranteed_ to prevent
| starvation, it just makes an effort at it. There's only
| two priority levels and a hard-coded 30 wake-ups required
| to enter high priority - if waiters were continually
| joining then there could always be more than one entry in
| high priority and an entry could get stuck there forever.
| Typically it won't matter, but if you have high
| contention then this might not be good enough.
| pizlonator wrote:
| Most fast locks these days are unfair. It turns out the perf
| advantage of unfairness is so massive that it's just Better
| (TM).
| tombert wrote:
| I have to admit that I have an extremely visceral, negative
| feeling whenever I see a mutex, simply because I've had to debug
| enough code written by engineers who don't really know how to use
| them, so a large part of previous jobs has been to remove locks
| from code and replace with some kind of queue or messaging
| abstraction [1].
|
| It's only recently that I've been actively looking into different
| locking algorithms, just because I've been diving in head-first
| to a lot of pure concurrency and distributed computing theory, a
| lot of which is about figuring out clever ways of doing mutexes
| with different tradeoffs.
|
| I've gotten a lot better with them now, and while I still
| personally will gravitate towards messaging-based concurrency
| over locks, I do feel the need to start playing with some of the
| more efficient locking tools in C, like nsync (mentioned in this
| article).
|
| [1] Before you give me shit over this, generally the replacement
| code runs at roughly the same speed, and I at least personally
| think that it's easier to reason about.
| aidenn0 wrote:
| I feel the similarly about C"s "volatile" (when used in
| multithreaded code rather than device drivers). I've seen
| people scatter volatile around randomly until the problem goes
| away. Given that volatile significantly disturbs the timing of
| a program, any timing sensitive bugs can be masked by adding it
| around randomly.
| cogman10 wrote:
| There seems to be a lot of voodoo beliefs around concurrent
| programming that lead to really bad things.
|
| One of the best books I've read on it is Java concurrency in
| practice [1]. It does an excellent job of dispelling these
| occultic beliefs and letting the reader know exactly when and
| how concurrency should be implemented. It is applicable to
| more languages than just java, especially since many have
| adopted large parts of the java memory model.
|
| The worst things I usually find when reviewing concurrent
| code is people either not using locks when they should, using
| locks when they shouldn't, and having inconsistent data
| guards. I've seen people throw in random locks to guard local
| non-shared state which is just crazy town but "Multiple
| threads are running this code, so I'm adding a lock".
|
| I certainly prefer message passing over shared state.
| However, it's a little baffling to me why it's so hard for
| devs to grasp how to properly maintain shared state. Instead
| of just learning the basic rules, it gets couched in "It's
| just too hard to understand so keep adding things until it
| works".
|
| [1] https://www.amazon.com/Java-Concurrency-Practice-Brian-
| Goetz...
| LegionMammal978 wrote:
| > However, it's a little baffling to me why it's so hard
| for devs to grasp how to properly maintain shared state.
| Instead of just learning the basic rules, it gets couched
| in "It's just too hard to understand so keep adding things
| until it works".
|
| Probably because most people aren't aware that there are
| basic rules to be learned. I'd imagine the typical
| experience is, you're very familiar with single-threaded
| code, and now you're trying to let other threads work with
| your data. You have heard that there are many pitfalls, and
| that there are special-purpose tools like mutexes to avoid
| those, but you look at the examples and find them mostly
| baffling. "Why do they perform these incantations for this
| data but not that data, or in this place but not that
| place?" So you come up with some weird mental model and
| move on with your life, never aware that there are
| underlying principles for maintaining shared state.
|
| Personally, I didn't understand mutexes very well at all,
| until I started looking into what the atomic memory
| orderings from C++ et al. were supposed to mean.
| Maxatar wrote:
| Not too sure what the basic rules are and I'm not able to
| find any list of such rules.
|
| For me the biggest challenge when sharing state is that
| the only benefit I can see for parallelism is
| performance, so if I'm not gaining performance there is
| no reason to use parallelism. If I use coarse-grained
| mutexes then I end up with straight forward to reason
| about code but I lose the performance benefit and in fact
| can end up with slower than single threaded code.
|
| If I use very fine grained mutexes then I end up with
| faster code that has very hard to find bugs that happen
| on very rare occasion.
|
| And then on top of that even if you do write correct fine
| grained locking, you can still end up with slow code due
| to cache behavior such as false sharing and cache
| coherence.
|
| So ultimately I disagree that writing parallel code is
| simple unless you're willing to give up performance in
| which case you may as well just stick to single threaded
| code or use parallelism among independent data. Writing
| correct parallel software that shares state and actually
| delivers substantial performance benefits is incredibly
| difficult, and I am skeptical that there is a set of
| simple rules that one can simply read about.
| tialaramex wrote:
| > Not too sure what the basic rules are and I'm not able
| to find any list of such rules.
|
| The _actual_ rules are completely terrifying because they
| involve the physics of microprocessors. If you 've
| watched Grace Hopper's lectures where she gives out
| physical nanoseconds (pieces of wire that are the same
| length as the distance light travels in a nanosecond,
| thus, the maximum possible distance data could travel in
| that time) you can start to appreciate the problem. It is
| _literally impossible_ for the intuitive Sequentially
| Consistent model of how computers work to apply for today
| 's fast yet concurrent processors. Light is too slow.
|
| However generally people mean either Java's memory model
| or the C++ 11 (and subsequently 14, 17, 20) memory models
| used in languages such as C++, C and Rust. Those rules
| are less terrifying but still pretty complicated and the
| programming language promises to somehow provide an
| environment where these rules (not the terrifying ones)
| are all you need to know to write software. So that's
| nice.
|
| It _can_ be simple to write parallel code for a language
| designed to make that easy. Yes even if there 's shared
| data. It only started to get trickier if the shared data
| is _modified_ , so long as it isn't we can make copies of
| it safely and modern CPUs will do that without actual
| work by the programmer.
| cogman10 wrote:
| Are there popular languages that don't have memory models
| which make reasoning about concurrent models easier?
|
| A language with a notion of threading and shared state is
| going to have something akin to read/write barriers built
| into the language memory model to tame the beast.
| jcranmer wrote:
| I think tialaramex is overselling the complexity of
| concurrent memory models in practice, at least for end
| users. In reality, all modern memory models are based on
| the data-race-free theorem, which states that in the
| absence of data races--if your program is correctly
| synchronized--you can't tell that the hardware isn't
| sequentially consistent (i.e., what you naively expected
| it to do).
|
| Correct synchronization is based on the happens-before
| relation; a data race is defined as a write and a
| conflicting read or write such that neither happens-
| before the other. Within a thread, happens-before is just
| regular program order. Across a thread, the main happens-
| before that is relevant is that an release-store on a
| memory location happens-before an acquire-load on that
| memory location (this can be generalized to any memory
| location _if_ they 're both sequentially-consistent, but
| that's usually not necessary).
|
| The real cardinal rule of concurrent programming is to
| express your semantics in the highest-possible level of
| what you're trying to do, and find some library that does
| all the nitty-grityy of the implementation. Can you
| express it with fork-join parallelism? Cool, use your
| standard library's implementation of fork-join and just
| don't care about it otherwise.
| xxpor wrote:
| C?
| zozbot234 wrote:
| > Not too sure what the basic rules are and I'm not able
| to find any list of such rules.
|
| You may want to consider https://marabos.nl/atomics/ for
| an approachable overview that's still quite rigorous.
| cogman10 wrote:
| > Not too sure what the basic rules are and I'm not able
| to find any list of such rules.
|
| I'd suggest the book in my original comment, Java
| concurrency in practice.
|
| > If I use very fine grained mutexes then I end up with
| faster code that has very hard to find bugs that happen
| on very rare occasion.
|
| I agree this is a real risk if you are doing fine grained
| mutexes. But the rules are the same whether or not you
| want to follow them. If you have shared state (A, B, C)
| and you want to do a calculation based on the values of
| (A, B, C) then you need a mutex which locks (A, B, C).
| Certainly, that become a problem if you have calculations
| that just require (A, C) and you might want to avoid
| locking for B. In that case, you need a more complicated
| mechanism for locking than just simple mutexes which is
| certainly easy to get wrong. When the (A, B, C) actions
| happen you have to ensure that the (A, C) actions can't
| happen at the same time.
|
| This isn't a complicated rule, but it is one that can be
| hard to follow if you are trying to do super fine grained
| locking. It's even trickier if you are going to abuse the
| platform to get correct results.
|
| But fine v coarse isn't the problem I'm referring to when
| I say people get the simple rules wrong. Rather, than
| worrying about fine vs coarse grained locking, I very
| frequently see code where mutexes and concurrency
| primitives are just peppered everywhere and haphazardly.
| We might call that super coarse grained.
| SJC_Hacker wrote:
| > For me the biggest challenge when sharing state is that
| the only benefit I can see for parallelism is
| performance, so if I'm not gaining performance there is
| no reason to use parallelism.
|
| Aside from performance, another very common reason is to
| not lock the UI from the user. Even in UI-less programs,
| the ability to abort some operation which is taking too
| long. Another is averaging out performance of compute
| tasks, even in the case where it would be faster to
| handle them sequentially. Without some degree of
| parallelism these things are not possible.
|
| Consider a web server. Without parallelism every single
| request is going to completely lock the program until its
| complete. With parallelism, you can spawn off each
| request, and handle new ones as they come in. Perceived
| performance for majority of users in this case is
| significantly improved even in the case of single
| processor system - e.g. you have 99 requests which each
| take a single second, and then one which takes 101
| seconds. Total request time is 200 seconds / 100 requests
| = 2 seconds average per request, but if that 100 second
| request comes in first, the other 99 are locked for 100
| seconds, so average is now > 100 seconds per request ...
| Maxatar wrote:
| >Aside from performance, another very common reason is to
| not lock the UI from the user.
|
| This is not a good fit for parallelism, this is pretty
| much always accomplished using concurrency ie.
| async/await.
| f1shy wrote:
| Maybe because I had a complete semester of multiprogramming
| in the uni, I see almost trivial to work in such
| environments, and cannot comprehend why is so much mystic
| and voodo. Actually is pretty simple.
| tombert wrote:
| I feel like it's not terribly hard to write something
| that more or less works using mutexes and the like, but I
| find it exceedingly hard to debug. You're at the mercy of
| timing and the scheduler, meaning that often just
| throwing a breakpoint and stepping through isn't as easy
| as it would be with a sequential program.
|
| I feel like with a queue or messaging abstraction, it can
| be easier to debug. Generally your actual _work_ is being
| done on a single thread, meaning that traditional
| debugging tools work fine, and as I 've said in sibling
| comments, I also just think it's easier to reason about
| what's going on.
| tombert wrote:
| +1 for the Java Concurrency in Practice book. It's the book
| I recommend to nearly everyone who wants to get into
| concurrent programming. Goetz makes it a lot more
| approachable than most other books.
| hinkley wrote:
| Goetz has come a long way. I knew one of the people who
| contributed to that book and he was a little frustrated
| about having to explain things to him he felt he
| shouldn't have had to. The implication was he'd already
| had this conversation with some of the other
| contributors.
|
| Sometimes though, the newbie is going to write the
| clearest documentation.
| hinkley wrote:
| I loved concurrent code when I was starting out. I'd taken
| a pretty good distributed computing class which started the
| ball rolling. They just fit into how my brain worked very
| well.
|
| Then I had to explain my code to other devs, either before
| or after they broke it, and over and over I got the message
| that I was being too clever. I've been writing Grug-brained
| concurrent code for so long I'm not sure I can still do the
| fancy shit anymore, but I'm okay with that. In fact I know
| I implemented multiple reader single writer at least a few
| times and that came back to me during this thread but I
| still can't remember how I implemented it.
| tombert wrote:
| That's something I'm afraid of for my latest project. I
| did some concurrent stuff that wasn't 100% clear would
| actually work, and I had to write a PlusCal spec to
| exhaustively prove to myself that what I was doing is
| actually OK.
|
| It works pretty well, and I'm getting decent speeds, but
| I'm really scared someone is going to come and "fix" all
| my code by doing it the "normal" way, and thus slow
| everything down. I've been trying to comment the hell out
| of everything, and I've shared the PlusCal spec, but no
| one else on my team knows PlusCal and I feel like most
| engineers don't actually read comments, so I think it's
| an inevitability that my baby is killed.
| tialaramex wrote:
| In most cases (in a C or C++ compiler, not Java) it's just
| straight up incorrect to use volatile for something other
| than memory mapped I/O. Yes, POSIX guarantees that in a
| specific case (signal handling IIRC) it'll do what you meant
| if you use volatile int. That's nice, but more generally this
| is not the right choice.
|
| Unfortunately Microsoft enshrined the situation (on Windows,
| on their compiler, on x86 and x86-64 only) that volatile
| primitive types are effectively atomics with Acquire-Release
| ordering. This is of course awkward when Microsoft tries to
| bring people to a non-x86 architecture and it can't just give
| them this because it would _suck_ really hard, so finally
| they have to grow up and teach their developers about actual
| atomics.
|
| !! Edited to fix: Previously this said Relaxed ordering, the
| ordering guaranteed by Microsoft is in fact Acquire-Release,
| hence it 's expensive to provide for architectures where
| that's not the default.
| hinkley wrote:
| When Java implemented volatile it didn't do anything. Later
| when they fixed the memory model to deal with out of order
| execution they made it part of the fence semantics, and
| then it actually made some sense.
| zozbot234 wrote:
| The "volatile" keyword should _never_ be used for C /C++
| multithreaded code. It's specifically intended for access to
| device-mapped addresses and does not account for any specific
| memory model, so using it for multithreading will lead to
| breakage. Please use the C/C++ memory model facilities
| instead.
|
| (As a contrast, note that in Java the "volatile" keyword can
| be used for multithreading, but again this does not apply to
| C/C++.)
| hinkley wrote:
| I'm surprised that's true. C borrowed very heavily from
| Java when fixing the NUMA situations that were creeping
| into modern processors.
| jcranmer wrote:
| The C/C++ memory model is directly derived from the Java
| 5 memory model. However, the decision was made that
| volatile in C/C++ specifically referred to memory-mapped
| I/O stuff, and the extra machinery needed to effect the
| sequential consistency guarantees was undesirable. As a
| result, what is volatile in Java is _Atomic in C and
| std::atomic in C++.
|
| C/C++ also went further and adopted a few different
| notions of atomic variables, so you can choose between a
| sequentially-consistent atomic variable, a
| release/acquire atomic variable, a release/consume atomic
| variable (which ended up going unimplemented for
| reasons), and a fully relaxed atomic variable (whose
| specification turned out to be unexpectedly tortuous).
| aidenn0 wrote:
| > Please use the C/C++ memory model facilities instead
|
| I should point out that for more than half of my
| professional career, those facilities did not exist, so
| volatile was the most portable way of implementing e.g. a
| spinlock without the compiler optimizing away the check.
| There was a period after which compilers were aggressively
| inlining and before C11 came out in which it could be
| otherwise quite hard to otherwise convince a compiler that
| a value might change.
| jart wrote:
| What are some examples of people using mutexes wrong? I know
| one gotcha is you need to maintain a consistent hierarchy.
| Usually the easiest way to not get snagged by that, is to have
| critical sections be small and pure. Java's whole MO of letting
| people add a synchronized keyword to an entire method was
| probably not the greatest idea.
| cogman10 wrote:
| When, how, and why.
|
| The biggest part of mutexes and how to properly use them is
| thinking of the consistency of the data that you are working
| with.
|
| Here's a really common bug (psuedocode) if
| (lock {data.size()} > 0) { value = lock {
| data.pop() } lock { foo.add(value) } }
|
| The issue here is size can change, pop can change, and foo
| can change in unexpected ways between each of the acquired
| locks.
|
| The right way to write this code is lock {
| if (data.size() > 0) { value = data.pop()
| foo.add(value) } }
|
| That ensures the data is all in a consistent state while you
| are mutating it.
|
| Now, what does make this tricky is someone well-meaning might
| have decided to push the lock down a method.
|
| Imagine, for example, you have a `Foo` where all methods
| operate within a mutex.
|
| This code is also (likely) incorrect. value
| = foo.bar() if (value.bat()) {
| foo.baz(value) }
|
| The problem here is exactly the same problem as above.
| Between `foo.bar()` and `foo.baz()` the state of foo may have
| changed such that running `foo.baz(value)` is now a mistake.
| That's why the right thing to do is likely to have a
| `foo.barBaz()` method that encapsulates the `if` logic to
| avoid inconsistency (or to add another mutex).
|
| In java, the most common manifestation (that I see) of this
| looks like this var map = new
| ConcurrentHashMap(); if (map.get(foo) == null)
| map.put(foo, new Value());
|
| Because now, you have a situation where the value of `foo` in
| the map could be 2 or more values depending on who gets it.
| So, if someone is mutating `Value` concurrently you have a
| weird hard to track down data race.
|
| The solution to this problem in java is
| map.computeIfAbsent(foo, (unused)->new Value());
| hinkley wrote:
| Composing locks is where Java usually blows up.
|
| And computeIfAbsent can end up holding the lock for too
| long if the function is slow.
| foobazgt wrote:
| Composing locks isn't a Java problem - it's a fundamental
| abstraction problem with locks. This is one of the
| reasons why you usually reach for higher level
| abstractions than mutexes.
|
| > And computeIfAbsent can end up holding the lock for too
| long if the function is slow.
|
| How is this different from any other lock-holding code
| written anywhere?
| hinkley wrote:
| I'm saying Java is exceptionally bad at this because
| every object is its own mutex.
|
| And you end up having to trade single core performance
| for multi core by deciding to speculatively calculate the
| object. If there's no object to make the critical section
| is very small. But as the object sprouts features you
| start smashing face first into Amdahl.
| foobazgt wrote:
| > because every object is its own mutex.
|
| Not true in any practical sense.
|
| > And you end up having to trade single core performance
| for multi core by deciding to speculatively calculate the
| object.
|
| What is the alternative you suggest? If you care about
| having the predicate actually hold, and you also don't
| want to have to hold the lock while constructing the
| object, then you're going to end up in an optimistic-
| concurrency scenario where you check the predicate under
| lock, compute the object, and check again before swapping
| the value in. You may end up having to throw your work
| away when you discover the predicate changed. Java is no
| better nor worse at doing this than anything else.
| hinkley wrote:
| > Not true in any practical sense.
|
| This is going to put a damper on any further
| conversation.
|
| Even with coarsening and elision every synchronized
| function closes a lock on the enclosing object.
| another-acct wrote:
| Might want to move foo.add() out of the lock scope
| (assuming foo is a thread-private resource):
| value = nil lock { if (data.size() > 0) {
| value = data.pop() } } if (value)
| { foo.add(value) }
| samatman wrote:
| This is one of the areas where Zig's combination of
| anonymous blocks and block-based defer really pay off. To
| create a locked region of code is just this
| { mutex.lock(); defer
| mutex.unlock(); // Do mutex things }
|
| It's possible to get this wrong still, of course, but both
| the anonymous scope and the use of `defer` make it easier
| to get things right.
|
| Nothing can prevent poor engineering around mutex use
| though. I'd want a critical path for a concurrent hashmap
| to look like this: {
| shared_map.lock(); defer shared_map.unlock();
| if (shared_map.getOrNull(foo) == null) {
| shared_map.put(foo, new_val); } }
|
| Where the SharedMap type has an internal mutex, and a way
| to check it, and all operations panic if no lock has been
| acquired. It could have
| `shared_map.lockAndGet(OrNull?)(...)`, so that the kind of
| problem pattern you're describing would stand out on the
| page, but it's still a one-liner to do an atomic get when
| that's all you need the critical path to perform.
|
| I don't think these invariants are overly onerous to
| uphold, but one does have to understand that they're a hard
| requirement.
| jhatemyjob wrote:
| I digress but my autistic brain couldn't help itself.
| Provided that it's a recursive lock you could do this
| instead of adding a new method `foo.BarBaz`
| foo.lock { value = foo.bar() // foo.lock within
| this method is ignored if(value.bat()) {
| foo.baz() // foo.lock within this method is ignored
| } }
|
| Also, to catch this bug early, you could assert foo is
| locked in `value.bat` or something. But that may or may not
| be feasible depending on how the codebase is structured
| tombert wrote:
| Personally I've had issues with performance because of people
| using `synchronized` too liberally, where they end up locking
| a lot more code than necessary. I've also had issues with
| fairly typical circular-dependencies, causing deadlock, or at
| least pauses that aren't strictly necessary. Deadlock doesn't
| happen nearly as often as textbooks have led me to believe,
| but it can happen with sloppily written code.
|
| In regards to Java, at this point I almost never use the
| `synchronized` keyword anymore and instead (if I can't easily
| map to some kind of queuing abstraction) use the
| ReentrantLock object simply because of the ability to have
| lock acquisition time out, and also letting you opt-in to
| fairness if you'd like. It's not as pretty but it's more
| flexible and as far as I'm aware it doesn't affect
| performance much.
|
| For the most part, though, in Java, you can get away without
| (explicit) locks by simply abusing the built-in data
| structures. I know they're using their own synchronization
| techniques behind the scenes, but I trust those to be correct
| more than some ad-hoc stuff I'd write as an engineer.
| foobazgt wrote:
| Java's take on monitors was definitely not great, and people
| were emulating mutexes with them even in the language's
| earliest days.
|
| Still there are a lot of things that can go wrong with
| mutexes: forgetting to unlock in the case of exceptions,
| priority inversion, recursive locking, deadlock, needlessly
| high contention, etc.
|
| Java has had an excellent concurrency runtime with
| abstractions that are typically a better fit than a bare
| mutex for over 20 years now (c.f. Doug Lea). Synchronized
| still exists, because of Java's excellent backwards
| compatibility.
| foobiekr wrote:
| I've always disliked that lock cyclic dependencies is
| discussed as a hierarchy when what it really comes down to is
| a linear order of locks.
|
| The problem with lock _hierarchies_ as a concept is that a
| lock really should represent serialization of access to a
| particular pool of data, and should make no assumptions that
| it being held implies some other lock's domain is also held.
| The code that results when people do not maintain this kind
| of rigor is quite terrible, but hierarchies tend to steer
| people into thinking that way because they imply recursively
| taking locks.
|
| Stated differently: locks should be taken and released in a
| fixed order - so locks are ranked - but there should not be a
| model where all lower-ranked locks must be held for a given
| lock to be taken. The lock protects its domain and the
| ordering of take and release is to prevent deadlock, but
| there's no requirement for completeness.
| bob1029 wrote:
| > some kind of queue or messaging abstraction
|
| Agreed. I find things like LMAX Disruptor much easier to reason
| about.
| tombert wrote:
| Even within Java, something like BlockingQueue will get you
| pretty far, and that's built into the runtime.
|
| If I am allowed to use libraries, I end up using Vert.x for
| nearly everything. I think that their eventbus abstraction is
| easy enough to reason about, and even without using it simply
| using the non-blocking stuff it provides ends up being pretty
| handy.
| jeffbee wrote:
| Message passing is just outsourcing the lock, right? For
| example a Go channel is internally synchronized, nothing magic
| about it.
|
| Most of the mutex tragedies I have seen in my career have been
| in C, a useless language without effective scopes. In C++ it's
| pretty easy to use a scoped lock. In fact I'd say I have had
| more trouble with people who are trying to avoid locks than
| with people who use them. The avoiders either think their
| program order is obviously correct (totally wrong on modern
| CPUs) or that their atomics are faster (wrong again on many
| CPUs).
| loeg wrote:
| > Message passing is just outsourcing the lock, right?
|
| More or less, yeah. You can write an MPSC queue that doesn't
| explicitly use a lock (or even anything that looks like a
| lock).
| tombert wrote:
| It's definitely doing synchronization behind the scenes, no
| argument here. BlockingQueues in Java seem to use
| ReentrantLocks everywhere. It's outsourcing the lock to
| people who understand locks better.
|
| It just abstracts this detail away for me, and I personally
| trust the libraries implementing these abstractions to be
| more correct than some ad hoc thing I write. It's an
| abstraction that I personally find a lot easier to reason
| about, and so my thinking is this: if my reasoning is more
| likely to be correct because of the easier abstraction, and
| the internal synchronization is more likely to be correct,
| then it's more likely that my code will be correct.
|
| I don't do super low-level stuff at all, most of my stuff
| ends up touching a network, so the small differences between
| the built-in synchronized structures vs the regular ones
| really don't matter since any small gains I'd get on that
| will be eaten the first time I hit the network, so a
| considerably higher ROI for me is almost always figuring out
| how to reduce latency.
|
| If I did C or C++, I'd probably have different opinions on
| this stuff.
| foobazgt wrote:
| Every abstraction is about outsourcing the thing it's
| abstracting away. If using a queue solves your problem, you
| no longer have to deal with all the headaches that you can
| run into using a bare mutex.
| another-acct wrote:
| > C, a useless language without effective scopes
|
| Mutexes can be handled safely in C. It's "just another
| flavor" of resource management, which does take quite a bit
| of discipline. Cascading error paths / exit paths help.
| sgarland wrote:
| > C, a useless language
|
| You misspelled "fast as fuck" and "lingua franca of all
| architectures."
| another-acct wrote:
| > remove locks from code and replace with some kind of queue or
| messaging abstraction
|
| Shared-nothing message passing reflects the underlying (modern)
| computer architecture more closely, so I'd call the above a
| good move. Shared memory / symmetric multiprocessing is an
| abstraction that leaks like a sieve; it no longer reflects how
| modern computers are built (multiple levels of CPU caches,
| cores, sockets, NUMA, etc).
| tombert wrote:
| That's good to hear! I am pretty removed from underlying
| hardware now, so it makes me happy to hear that better way of
| doing things is catching on even in low-level land.
| gpderetta wrote:
| If you are doing pure shared nothing message passing, you do
| not need coherent caches; in fact cache coherency gets in the
| way of pure message passing.
|
| Viceversa if you do pure message passing you are not
| benefitting from hardware accelerated cache coherency and
| leaving performance (and usability) on the floor.
| intelVISA wrote:
| Shared-nothing is typically The Right Choice in my experience
| as well. Maybe the odd atomic...
| alberth wrote:
| Are there any Linux distro's built/using Cosmo?
|
| (like Alpine use of musl)
| forrestthewoods wrote:
| Hrm. Big fan of Justine and their work. However this is probably
| the _least_ interesting benchmark test case for a Mutex. You
| should never have a bunch of threads constantly spamming the same
| mutex. So which mutex implementation best handles this case isn't
| particularly interesting, imho.
| jart wrote:
| What do you consider a good benchmark test case for mutexes?
| pizlonator wrote:
| Very large multithreaded programs with lots of diverse uses
| of locks, including:
|
| - uncontended, mildly contended, and horrifically contended
|
| - short and long critical sections
|
| - contention among small numbers of threads and large numbers
| of threads
|
| - contention that happens when other locks are held
| recursively and those are also contended on (like, thread A
| wants a lock held by thread B, but thread B is trying to get
| a lock held by thread C)
|
| Different lock algos work well or badly depending on how the
| locks are used, so it's important to pick real programs as
| your benchmark rather than trying to cook a synthetic
| benchmark.
| convolvatron wrote:
| what else would you measure? certainly the uncontended case is
| important and a baseline, but otherwise this is kind of weak
| point for mutexes - that if you don't handle contention well
| then you have idle hardware or lots of additional scheduler
| work or kernel crossings.
|
| [edit - I forget to even mention one of the most important
| things, that locks that reform poorly under contention can have
| really negative systemic effects like hot spotting the memory
| network, and that would show up here too]
| tialaramex wrote:
| Uncontended is _crucial_. If you want to benchmark other
| things that 's excellent, but if MutexA has crap uncontended
| performance then I'm on a loser if we pick MutexA unless I am
| absolutely sure we will have a lot of contention. Since
| contention is never desirable, that's rare.
|
| Think of this like the random input case for a sort
| benchmark. Do I want stuff like all-ascending, all-
| descending, zig-zag and so on? Sure, those are nice. But
| without the random input case the benchmark is not very
| helpful. I _might_ sort a zig-zag, I _might_ sort data that
| 's already in ascending order, but I _will_ sort random data,
| that 's going to happen or else I would not need a sort
| function.
| jart wrote:
| Uncontended is uninteresting, because all mutex
| implementations perform roughly the same here, give or take
| a nanosecond or two. If you're truly uncontended then a
| naive spin lock will actually seem fastest, because xchg is
| faster than cmpxchg which is needed for good locks.
| loeg wrote:
| Uh, why do you say a naive spin lock would use xchg
| instead of cmpxchg? I don't think you could make a valid
| spinlock using xchg.
| jart wrote:
| On x86 you can. When xchg is used with a memory parameter
| it locks the bus. This is true even in the absence of a
| lock prefix. I included a spinlock implementation in the
| blog post. If you see any errors with it, then please let
| me know!
| loeg wrote:
| Oh, sure, your 1-bit spinlock with no other state works.
| cogman10 wrote:
| > You should never have a bunch of threads constantly spamming
| the same mutex.
|
| I'm not sure I agree with this assessment. I can think of a few
| cases where you might end up with a bunch of threads
| challenging the same mutex.
|
| A simple example would be something like concurrently
| populating some data structure (list/dict/etc). Yes, you could
| accomplish this with message passing, but that uses more memory
| and would be slower than just having everything wait to write
| to a shared location.
| kccqzy wrote:
| > would be slower than just having everything wait to write
| to a shared location
|
| Nope.
| cogman10 wrote:
| Yup.
|
| Message passing has allocation pressure and cache
| consistency pressure not present in using a shared message
| location. Especially as the amount of memory in question
| goes up, the benefit of a shared location increases in
| terms of the performance impact.
|
| Sure, for something silly like writing to an int, there is
| negative benefit in a shared location, but when you start
| talking about a dictionary with 1 million entries, the
| shared location becomes much more of a benefit vs all the
| copying and allocating you'd have to do if you tried to do
| the same thing with message passing.
|
| For some datastructures, it's the optimal way to move data
| around. For example, LMAX disruptor is about the fastest
| way to pass messages because of the shared memory and well
| tuned locks.
| forrestthewoods wrote:
| If you're trying to mutate a dictionary many times from many
| threads you're going to have a bad time. The fix isn't a
| faster mutex, it's don't do that.
| cogman10 wrote:
| Depends on the dictionary implementation. There's a number
| of thread safe dictionaries in the wild with varying
| degrees of parallelism performance. Pretty much all of them
| benefit from faster mutexes.
|
| For example, some thread safe dictionaries will segment
| their underlying key/value pairs which allows them to have
| concurrent reads and writes for a given segment which
| significantly improves performance.
| Salgat wrote:
| I'd say the vast majority of cases where I use a lock/semaphore
| is around very expensive resources, where the utilization of
| that resource vastly outweighs any performance overhead of the
| lock.
| another-acct wrote:
| This is how it should be. IIRC -- apologies, can't find a
| source --, Ulrich Drepper wrote somewhere about NPTL that its
| mutexes were not particularly lightweight, but that you
| should design your program for low contention anyways.
|
| For highly contended data structures, spinlocks (and nowadays
| explicit atomics) are likely better.
| joelthelion wrote:
| Could established libcs adopt this? If not, why not?
| loeg wrote:
| Maybe. nsync is Apache licensed.
| pizlonator wrote:
| Always cool to see new mutex implementations and shootouts
| between them, but I don't like how this one is benchmarked. Looks
| like a microbenchmark.
|
| Most of us who ship fast locks use very large multithreaded
| programs as our primary way of testing performance. The things
| that make a mutex fast or slow seem to be different for complex
| workloads with varied critical section length, varied numbers of
| threads contending, and varying levels of contention.
|
| (Source: I wrote the fast locks that WebKit uses, I'm the person
| who invented the ParkingLot abstraction for lock impls (now also
| used in Rust and Unreal Engine), and I previously did research on
| fast locks for Java and have a paper about that.)
| PaulDavisThe1st wrote:
| To add to this, as the original/lead author of a desktop app
| that frequently runs with many tens of threads, I'd like to see
| numbers on performance in _non-heavily contended cases_. As a
| real-time (audio) programmer, I am more concerned with (for
| example) the cost to take the mutex even when it is not already
| locked (which is the overwhelming situation in our app).
| Likewise, I want to know the cost of a try-lock operation that
| will fail, not what happens when N threads are contending.
|
| Of course, with Cosmopolitan being open source and all, I could
| do these measurements myself, but still ...
| ataylor284_ wrote:
| The writeup suggests this implementation is optimized for the
| not-locked case:
|
| > uses an optimistic CAS (compare and swap) immediately, so
| that locking happens quickly when there's no contention
| pizlonator wrote:
| Totally!
|
| Pro tip: if you really do know that contention is unlikely,
| and uncontended acquisition is super important, then it's
| theoretically impossible to do better than a spinlock.
|
| Reason: locks that have the ability to put the thread to
| sleep on a queue must do compare-and-swap (or at least an
| atomic RMW) on `unlock`. But spinlocks can get away with just
| doing a store-release (or just a store with a compiler fence
| on X86) to `unlock`.
|
| Spinlocks also have excellent throughput under most
| contention scenarios, though at the cost of power and being
| unkind to other apps on the system. If you want your spinlock
| to be hella fast on contention just make sure you
| `sched_yield` before each retry (or `SwitchToThread` on
| Windows, and on Darwin you can do a bit better with
| `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).
| PaulDavisThe1st wrote:
| We use spinlocks where appropriate. In the 90s I recall
| that the general rule of thumb was if the lock is held for
| <10x the context switch time, spinlocks are generally a
| better choice. Not sure if that's still true of
| contemporary architectures.
|
| The more common pattern in rt/audio code is "try to take
| the lock, but have an alternate code path if that fails".
| It's not that is never going to be contention, but it will
| be extremely rare, and when it occurs, it probably matters.
| RWLocks are also a common pattern, with the RT thread(s)
| being read-only (and still being required to fall back on
| an alternate code path if the read lock cannot be taken).
| pizlonator wrote:
| These days, fast lock implementations use the following
| rough idiom, or some idiom that is demonstrably not any
| slower even for short critical sections.
| if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
| for (unsigned i = 0; i < 40; ++i) { /* spin
| for a bounded a mount of time */ if
| (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
| sched_yield(); } ... /* actual full lock
| algo goes here */
|
| So, the reason to use spinlocks isn't that they are
| faster for short critical sections, but that they don't
| have to CAS on unlock - and so they are faster especially
| in the uncontended case (and in all other cases too).
|
| In other words, the modern rule of thumb is something
| like: if you're going to grab the lock so frequently that
| the uncontended lock/unlock time shows up as a
| significant percentage of your execution time, then use a
| spinlock. That probably implies that you are probably
| holding the lock for <100x or even <1000x context switch
| time, or something around there.
| adastra22 wrote:
| Huh, interesting. TIL. Thanks for that!
| lilyball wrote:
| On Darwin, it's possible for a pure spinlock to produce a
| priority inversion deadlock, because Darwin has a quality
| of service implementation in the kernel that differs from
| how everyone else handles thread priority. In other
| kernels, a low-priority thread will still eventually be
| guaranteed a cpu slice, so if it's holding a spinlock, it
| will eventually make progress and unlock. On Darwin with
| Quality of Service, it's possible for higher-QoS threads to
| preempt lower-QoS threads indefinitely.
|
| For this reason, on Darwin you want to avoid spinlocks
| unless you know that all threads touching the spinlock are
| always running in the same QoS. Instead of spinlocks, your
| go-to for low-overhead locks there is os_unfair_lock, which
| is a spinlock variant that donates priority of the blocked
| threads to the running thread.
| pizlonator wrote:
| I've shipped code on Darwin that spinlocks and gets away
| with it without any noticeable cases of this happening.
|
| I know it can happen in theory. But theory and practice
| ain't the same.
|
| I worked for Apple when I shipped this too lmao
| ot wrote:
| > Reason: locks that have the ability to put the thread to
| sleep on a queue must do compare-and-swap (or at least an
| atomic RMW) on `unlock`. But spinlocks can get away with
| just doing a store-release (or just a store with a compiler
| fence on X86) to `unlock`.
|
| This is something I've thinking about a lot over time, that
| the CAS is only there to atomically determine if there are
| any sleeping waiters on unlock and you have to do a
| futex_wake. I would really want some way to get away with
| non-fenced operations (at least on x86), but I don't know
| if it's just that nobody has figured out why, or there is a
| fundamental reason why that's not possible.
| pizlonator wrote:
| You do need a fence in the unlock path though (at least a
| release fence).
|
| I think the issue is that if you ask the CPU to just
| store something (like in a spin lock), whether or not
| there's a fence, it's an operation with limited data flow
| dependencies so it's easy for the CPU to execute. Even
| the fence can be handled using wacky speculation tricks.
|
| But if you want to do something like, "store this value
| but only if the old value satisfies some predicate", then
| there's a load and the whole thing is dependent on the
| load. So you're asking the CPU to load, then run a
| predicate, then store, and for that to be fenced, and
| atomic.
|
| Strictly more work. I don't think there's any trick to
| make it faster than just the store release.
| Animats wrote:
| > just make sure you `sched_yield` before each retry
|
| Assuming `sched_yield` does something.
|
| There's a futex congestion problem inside Wine's memory
| allocator. There are several levels of locks. If you're
| growing a buffer, in the sense of C's "realloc", and no
| buffer is available, memory allocation is locked during the
| allocation of a bigger buffer, copying of the contents, and
| release of the old buffer. "Push" type operations can force
| this. Two orders of magnitude performance drops ensue when
| multi-threaded programs are contending for that lock.[1]
|
| Inside one of the lock loops is a call to "YieldProcessor".
| static void spin_lock( LONG *lock ) {
| while (InterlockedCompareExchange( lock, -1, 0 ))
| YieldProcessor(); }
|
| But the actual code for YieldProcessor is a NOP on x86:[2]
| static FORCEINLINE void YieldProcessor(void) {
| #ifdef __GNUC__ #if defined(__i386__) ||
| defined(__x86_64__) __asm__ __volatile__(
| "rep; nop" : : : "memory" ); #elif
| defined(__arm__) || defined(__aarch64__)
| __asm__ __volatile__( "dmb ishst\n\tyield" : : : "memory"
| ); #else __asm__ __volatile__(
| "" : : : "memory" ); #endif #endif
| }
|
| }
|
| Wine devs are aware of this, but the mess is bad enough
| that no one has tackled it. This is down in the core of
| what "malloc" calls, so changes there could have unexpected
| effects on many programs. Needs attention from someone
| really into mutexes.
|
| [1] https://forum.winehq.org/viewtopic.php?t=37688
|
| [2] https://gitlab.winehq.org/wine/wine/-/blob/HEAD/include
| /winn...
| pizlonator wrote:
| sched_yield isn't a nop
| secondcoming wrote:
| `rep; nop;` is actually the `pause` instruction. On older
| CPUs it's a standard nop, but on newer CPUs it's a more
| efficient nop.
|
| Spinning on the CMPXCHG is also a bad idea. You should
| spin on the read and only then attempt the CMPXCHG.
| uvdn7 wrote:
| I was thinking the same. There are many mutexes out there, some
| are better at certain workloads than the rest. DistributedMutex
| and SharedMutex come to mind (https://github.com/facebook/folly
| /blob/main/folly/synchroniz..., https://github.com/facebook/fol
| ly/blob/main/folly/SharedMute...) Just like hashmaps, it's
| rarely the case that a single hashmap is better under _all_
| possible workloads.
| pizlonator wrote:
| Yeah.
|
| I should say, though, that if you're on Windows then I have
| yet to find a real workload where SRWLock isn't the fastest
| (provided you're fine with no recursion and with a lock that
| is word-sized). That lock has made some kind of deal with the
| devil AFAICT.
| ngoldbaum wrote:
| This style of mutex will also power PyMutex in Python 3.13. I
| have real-world benchmarks showing how much faster PyMutex is
| than the old PyThread_type_lock that was available before 3.13.
| miohtama wrote:
| Any rough summary?
| ngoldbaum wrote:
| https://github.com/numpy/numpy/issues/26510#issuecomment-22
| 9...
|
| And now that I look at that again I realize I forgot to
| finish that up!
| 0xDEADFED5 wrote:
| Can I use PyMutex from my own Python code?
| intelVISA wrote:
| Fast locks are an oxymoron vs optimized CAS
| ot wrote:
| Yeah, that specific benchmark is actually likely to prefer
| undesirable behaviors, for example pathological unfairness:
| clearly the optimal scheduling of those threads runs first all
| the increments from the first thread, then all of the second
| thread, etc... because this will minimize inter-processor
| traffic.
|
| A mutex that sleeps for a fixed amount (for example 100us) on
| lock failure acquisition will probably get very close to that
| behavior (since it almost always bunches), and "win" the
| benchmark. Still, that would be a terrible mutex for any
| practical application where there is any contention.
|
| This is not to say that this mutex is not good (or that pthread
| mutexes are not bad), just that the microbenchmark in question
| does not measure anything that predicts performance in a real
| application.
| pizlonator wrote:
| Yeah! For all we know this performs great on real programs.
|
| And for all we know it's absolute trash on real programs.
| gok wrote:
| Consider adopting `os_sync_wait_on_address()` on Darwin for your
| futex needs
| jart wrote:
| I've used that. It's just as good as ulock although relatively
| new. The issue is that using this API makes cancelation points
| no longer atomic. SIGTHR needs to be able to know the exact
| instruction in memory where an asynchronous signal is delivered
| when interrupting a wait operation and that's not possible if
| it's inside an opaque library.
| jonathrg wrote:
| > It's still a new C library and it's a little rough around the
| edges. But it's getting so good, so fast, that I'm starting to
| view not using it in production as an abandonment of professional
| responsibility.
|
| What an odd statement. I appreciate the Cosmopolitan project, but
| these exaggerated claims of superiority are usually a pretty bad
| red flag.
| tredre3 wrote:
| I'd like to point out that Justine's claims are usually
| correct. It's just her shtick (or personality?) to use
| hyperbole and ego-stroking wording. I can see why some might
| see it as abrasive (it has caused drama before, namely in
| llamacpp).
| jancsika wrote:
| This case isn't abrasive, but it's certainly incoherent.
|
| Name a single case where professional responsibility would
| require C code _advertised_ as "rough around the edges" to
| be used anywhere near production. (The weasel words "starting
| to" do not help the logic of that sentence.)
|
| I can definitely understand how OP could view this as a red
| flag. The author should amend it.
| another-acct wrote:
| I also meant to comment about the grandstanding in her post.
|
| Technical achievement aside, when a person invents something
| new, the burden is _on them_ to prove that the new thing is a
| suitable replacement of / improvement over the existing
| stuff. "I'm starting to view /not/ using [cosmo] in
| production as an abandonment of professional responsibility"
| is emotional manipulation -- it's guilt-tripping.
| Professional responsibility is the exact opposite of what she
| suggests: it's not jumping on the newest bandwagon. "a little
| rough around the edges" is precisely what production
| environments don't want; predictability/stability is
| frequently more important than peak performance /
| microbenchmarks.
|
| Furthermore,
|
| > The C library is so deeply embedded in the software supply
| chain, and so depended upon, that you really don't want it to
| be a planet killer.
|
| This is just underhanded. She implicitly called glibc and
| musl "planet killers".
|
| First, technically speaking, it's just not true; and even if
| the implied statement were remotely true (i.e., if those
| mutex implementations were in fact responsible for a
| significant amount of cycles in actual workloads), the
| emotional load / snide remark ("planet killer") is
| unjustified.
|
| Second, she must know very well that whenever efficiency of
| computation is improved, we don't use that for running the
| same workloads as before at lower cost / smaller
| environmental footprint. Instead, we keep all CPUs pegged all
| the time, and efficiency improvements only ever translate to
| _larger profit_. A faster mutex too translates to more $$$
| pocketed, and not to less energy consumed.
|
| I find her tone of voice repulsive.
| vitus wrote:
| I agree overall with your sentiment but wanted to comment
| on one of your statements that I perceived to be hyperbole.
|
| > Second, she must know very well that whenever efficiency
| of computation is improved, we don't use that for running
| the same workloads as before at lower cost / smaller
| environmental footprint. Instead, we keep all CPUs pegged
| all the time, and efficiency improvements only ever
| translate to larger profit. A faster mutex too translates
| to more $$$ pocketed, and not to less energy consumed.
|
| It depends on the use case. If you can serve the same
| number of users / requests with fewer machines, then you
| buy and run fewer machines. (Yes, saving energy, but also
| saving on both capex and opex.)
|
| Also, when you're talking about anything resembling
| interactivity (as you might in the context of, say, a
| webserver), you really don't want to run anywhere close to
| 100% average utilization. With unbounded queues, you end up
| with arbitrarily high wait times; with bounded queues, you
| end up serving 503s and 429s and other server errors.
|
| That said, my experience with modern webservers is that you
| generally don't rely on mutexes for synchronizing most work
| across worker threads, and instead you try to keep your
| workloads as embarrassingly parallel as possible.
| almostgotcaught wrote:
| It's especially a red-flag since an enormous number of projects
| (almost all of them?) will never tolerate shipping fat binaries
| (ie what cosmopolitan C is in reality).
| another-acct wrote:
| Agreed; this is what I've always (silently) thought of those
| fat binaries. Absolute stroke of genius, no doubt, and also a
| total abomination (IMO) from a sustainability perspective.
| ataylor284_ wrote:
| The core of this a library called nsync. It appears most of
| the improvements by Justine are upstreamed into nsync itself
| which doesn't have any of the baggage of cosmopolitan.
| Whatever your opinion of the project or author, they've made
| good effort to not lock you in.
| intelVISA wrote:
| Ironic given the generous size of the average Go binary.
| pmarreck wrote:
| I feel that there's a certain amount of hubris that comes along
| with spending long periods of time solo-coding on a computer,
| and perhaps unwittingly starved of social contact. Without any
| checks on you or your work's importance (normally provided by
| your bog-standard "job"), your achievements take on a grandeur
| that they might not have broadly earned, as impressive as they
| might be.
|
| An example is APE (which I otherwise feel is a very impressive
| hack). One criticism might be "oh, so I not only get to be
| insecure on one platform, I can be insecure on many all at
| once?"
|
| The longer you spend in technology, the more you realize that
| there are extremely few win-wins and a very many win-somes,
| lose-somes (tradeoffs)
| fefe23 wrote:
| Have you considered that you may have a different kind of humor
| than Justine?
|
| Why would you even post this here? Who do you think this is
| helping?
| jonathrg wrote:
| It doesn't clearly come across as a joke.
| jart wrote:
| It's a splash of dry humor on a personal blog in an
| information dense article.
| jonathrg wrote:
| OK, Poe's law at work then :)
| another-acct wrote:
| I think it's fair to comment not only on the subject, but on
| the writing itself, too.
|
| And it might help Justine improve her writing (and reach a
| larger audience -- after all, blog posts _intend_ to reach
| _some_ audience, don 't they?). Of course you can always say,
| "if you find yourself alienated, it's your loss".
| gr4vityWall wrote:
| It came off as humor to me, at least.
| stonethrowaway wrote:
| > In 2012, Tunney started working for Google as a software
| engineer.[4] In March 2014, Tunney petitioned the US government
| on We the People to hold a referendum asking for support to
| retire all government employees with full pensions, transfer
| administrative authority to the technology industry, and appoint
| the executive chairman of Google Eric Schmidt as CEO of America.
|
| the absolute madman
| Dansvidania wrote:
| I wonder what they (Tunney) think of that now.
| 01HNNWZ0MV43FF wrote:
| Wild. Then again, in 2012 I was on a grippy sock vacation.
| wallstprog wrote:
| " I also managed to make contended nsync mutexes go 30% faster
| than nsync upstream on AARCH64, by porting it to use C11
| atomics."
|
| Curious about this -- so what does C11 atomics use to implement?
| At least in Linux, C++11 atomics _use_ pthreads (not the other
| way around).
| rcxdude wrote:
| It depends on what atomics. In principle most of them should
| map to an underlying CPU primitive, and only fallback to a
| mutex if it's not supported on the platform.
| loeg wrote:
| Atomics mostly map to underlying compiler / CPU intrinsics, not
| pthreads.
| jeffbee wrote:
| I am also curious about this and the ambiguity of "AARCH64".
| There are 64-bit ARM ISA versions without atomic primitives and
| on these what looks like an atomic op is actually a library
| retry loop with potentially unbounded runtime. The original AWS
| Graviton CPU had this behavior. The version of the ISA that you
| target can have significant performance impact.
| jart wrote:
| nsync has wrapper macros for all the various atomics libraries
| that prevented it from using two things.
|
| 1. Weak CAS. nsync always uses strong CAS upstream to make the
| portability abstraction simpler. Being able to use weak CAS
| when appropriate helps avoid code being generated for an
| additional loop.
|
| 2. Updating the &expected parameter. nsync upstream always
| manually does another relaxed load when a CAS fails. This isn't
| necessary with the C11 atomics API, because it gives you a
| relaxed load of the expected value for free when it fails.
|
| Being able to exploit those two features resulted in a
| considerable improvement in nsync's mu_test.c benchmark for the
| contended mutex case, which I measured on RPI5.
| bjourne wrote:
| I made a benchmark on this last year when I didn't know how slow
| pthread mutexes were:
| https://stackoverflow.com/questions/76965112/why-are-pthread...
| For my use case, the mutex wait amounted to roughly 40% of the
| total runtime and spinlocks were way faster. Perhaps nsync or
| Cosmopolitan would have made my code much faster.
|
| I still believe the FUD around spinlocks is overstated. For
| "normal" hpc code the number of threads should be <= the number
| of cores. In that scenario spinlocking will minimize the latency
| which likely is what you care about the most. It beats having
| your performance ruined by dvfs.
| amiga386 wrote:
| If it's so good, why haven't all C libraries adopted the same
| tricks?
|
| My betting is that its tricks are only always-faster for certain
| architectures, or certain CPU models, or certain types of
| workload / access patterns... and a proper benchmarking of varied
| workloads on all supported hardware would not show the same
| benefits.
|
| Alternatively, maybe the semantics of the pthread API (that
| cosmopolitan is meant to be implementing) are somehow subtly
| different and this implementation isn't strictly compliant to the
| spec?
|
| I can't imagine it's that the various libc authors aren't keeping
| up in state-of-the-art research on OS primitives...
| jitl wrote:
| > I can't imagine it's that the various libc authors aren't
| keeping up in state-of-the-art research on OS primitives...
|
| is this sarcasm?
|
| (I don't know any libc maintainers, but as a maintainer of a
| few thingies myself, I do not try to implement state-of-the-art
| research, I try to keep my thingies stable and ensure the
| performance is acceptable; implementing research is out of my
| budget for "maintenance")
| amiga386 wrote:
| But if you maintain a few thingies, you'd probably know about
| rival thingies that do a similar thing, right?
|
| If the rival thingies got a speed boost recently, and they
| were open source, you'd want to have a look at how they did
| it and maybe get a similar speed boost for yourself.
| tialaramex wrote:
| This is nowhere near as common as you seem to think, and
| mostly only happens for the narrow cases where somebody is
| obsessed with a particular problem so that they'd actually
| _want_ to read other people 's solutions. Most of the
| implementers do not have that sort of relationship to a
| problem they solved in the past.
|
| If in December you make a general purpose stable sort
| that's 25% faster than his, Orson Peters _is_ probably
| going to read your code and try to apply ideas from it. But
| sorting is the thing Peters really cares about, the people
| who wrote the stable sort in say Microsoft 's STL (their
| C++ standard library implementation) even if they still
| work there won't care enough to go back and change that
| unless told to do so.
| jitl wrote:
| It depends on the calculus about (time) budget and
| stability. Maybe I consider performance already acceptable,
| and don't have time budget to investigate beyond that.
| Maybe I look at "nsync", see its mutex (may) change the
| fairness semantics, and then decide not to adopt it because
| this may break my callers; or its enough that it _may_
| change the fairness semantics, and I don 't have the budget
| to test nsync or a new implementation based on the nsync
| algorithm to determine if the semantics differ.
| dist-epoch wrote:
| Politics, not-invented-here syndrome, old maintainers.
|
| It takes forever to change something in glibc, or the C++
| equivalent.
|
| There are many kinds of synchronization primitives. pthreads
| only supports a subset. If you are limiting yourself to them,
| you are most likely leaving performance on the table, but you
| gain portability.
| aseipp wrote:
| Those projects often have dozens of other priorities beyond
| just one specific API, and obsessing over individual APIs isn't
| a good way to spend the limited time they have. In any case, as
| a concrete example to disprove your claim, you can look at
| malloc and string routines in your average libc on Linux.
|
| glibc's malloc is tolerable but fails handily to more modern
| alternatives in overall speed and scalability (it fragments
| badly and degrades over time, not to mention it has a dozen
| knobs that can deeply impact real life workloads like
| MALLOC_ARENA_MAX). musl malloc is completely awful in terms of
| performance at every level; in a multithreaded program, using
| musl's allocator will destroy your performance so badly that it
| nearly qualifies as malpractice, in my experience.
|
| musl doesn't even have things like SIMD optimized string
| comparison routines. You would be shocked at how many CPU
| cycles in a non-trivial program are spent on those tasks, and
| yes it absolutely shows up in non-trivial profiles, and yes
| improving this improves all programs nearly universally.
| glibc's optimized routines are good, but they can _always_ , it
| seems, become faster.
|
| These specific things aren't "oh, they're hyper specific
| optimizations for one architecture that don't generalize".
| These two things in particular -- we're talking 2-5x wall clock
| reduction, and drastically improved long-term working set
| utilization, in nearly all workloads for any given program.
| These are well explored and understood spaces with good known
| approaches. So why didn't they take them? Because, as always,
| they probably had other things to do (or conflicting priorities
| like musl prioritizing simplicity over peak performance, even
| when that philosophy is actively detrimental to users.)
|
| I'm not blaming these projects or anything. Nobody sets out and
| says "My program is slow as shit and does nothing right, and I
| designed it that way and I'm proud of it." But the idea that
| the people working on them have made only the perfect pareto
| frontier of design choices just isn't realistic in the
| slightest and doesn't capture the actual dynamics of how most
| of these projects are run.
| Uehreka wrote:
| So on the one hand, all this Cosmo/ape/redbean stuff sounds
| incredible, and the comments on these articles are usually pretty
| positive and don't generally debunk the concepts. But on the
| other hand, I never hear mention of anyone else using these
| things (I get that not everyone shares what they're doing in a
| big way, but after so many years I'd expect to have seen a couple
| project writeups talk about them). Every mention of
| Cosmo/ape/redbean I've ever seen is from Justine's site.
|
| So I've gotta ask: Is there a catch? Are these tools doing
| something evil to achieve what they're achieving? Is the whole
| thing a tom7-esque joke/troll that I don't get because I'm not as
| deep into compilers/runtimes? Or are these really just ingenious
| tools that haven't caught on yet?
| almostgotcaught wrote:
| > So I've gotta ask: Is there a catch? Are these tools doing
| something evil to achieve what they're achieving?
|
| it's not that complicated; they're fat binaries (plus i guess a
| lot of papering over the differences between the platforms)
| that exploit a quirk of tshell:
|
| > One day, while studying old code, I found out that it's
| possible to encode Windows Portable Executable files as a UNIX
| Sixth Edition shell script, due to the fact that the Thompson
| Shell didn't use a shebang line.
|
| (https://justine.lol/ape.html)
|
| so the answer is simple: i can't think of anyone that _wants_
| to ship fat binaries.
| fragmede wrote:
| Anyone focused on customer experience wants to ship a binary
| that just works, without customers having to know what a fat
| binary even is.
| thefaux wrote:
| I don't think macos will allow you to run a downloaded
| cosmo binary without going into security settings and
| enabling it. That's not an experience I'd want my customers
| to have personally which means if you care about mac
| normies, this isn't a viable approach.
| bluGill wrote:
| Unless the customer is one who has issues with large
| downloads. Not everyone has fiber to the home with massive
| storage on modern computers. Some people have the entry
| level systems, settle for slow internet. Often they are in
| poor or "third world" places which means maybe you don't
| care about them, but they exist.
| 01HNNWZ0MV43FF wrote:
| Fat binaries are fine. Electron is a fat binary in a
| different sense.
| 0cf8612b2e1e wrote:
| Given how fat modern websites are, compassion for the
| bandwidth deprived seems rare.
| cycomanic wrote:
| Correct me if I'm wrong, but I always thought Apple universal
| binaries are fat binaries? So why did Apple build this
| ability if nobody wants it?
| amiga386 wrote:
| APE works through cunning trickery that might get patched out
| any day now (and in OpenBSD, it has been).
|
| Most people producing cross-platform software _don 't_ want a
| single executable that runs on every platform, they want a
| single _codebase_ that works correctly on each platform they
| support.
|
| With that in mind that respect, languages like go letting you
| cross compile for all your targets (provided you avoid CGO) is
| delightful... but the 3-ways-executable magic trick of APE,
| while really clever, doesn't inspire confidence it'll work
| forever, and for the most part it doesn't gain you anything.
| Each platform has their own packaging/signing requirements. You
| might as well compile a different target for each platform.
| 0x1ceb00da wrote:
| Wasn't elf format modified by upstream to accomodate for
| cosmo? That makes it kinda official. Still hard to see a use
| case for it. If you want everyone to be able to run your
| program, just write a web app, a win32 program, or a java
| applet. 20 years old java applets still run on modern JVMs.
| bcoates wrote:
| You can't reasonably assume the end user system has a
| system JVM installed (they probably don't, in fact) so
| they're not really an alternative to a fat binary -- if you
| can install dependencies, you can just pick a single-target
| binary while you're at it.
| bmacho wrote:
| Justine has similar claims about POSIX allowing binary in
| shell scripts now
|
| > This is an idea whose time has come; POSIX even changed
| their rules about binary in shell scripts specifically to
| let us do it.
|
| https://justine.lol/cosmo3/
|
| > Jilles Tjoelker from the FreeBSD project played an
| instrumental role in helping me to get the POSIX rules
| changed to allow binary in shell scripts, which is what
| made this project possible.
|
| https://justine.lol/ape.html
|
| but that's not true, as recently discussed here:
| https://news.ycombinator.com/item?id=41636569
| fragmede wrote:
| > Most people producing cross-platform software don't want a
| single executable that runs on every platform
|
| They don't? Having one file to download instead of a maze of
| "okay so what do you have" is way easier than the current
| mess. It would be very nice not to have to ask users what
| platform they're on, they just click a link and get the
| thing.
| bluGill wrote:
| Trade offs. While there is a point to that, memory,
| bandwidth and storage are not free. Users with constrains
| will want a smaller executable. Of course all 3 are pretty
| cheap these days so more and more you can get by with just
| one does it all executable and nobody will care about the
| downsides, but lets not lose track of them.
| hyperion2010 wrote:
| > Most people
|
| We'll I'm used to not being most people, but I'd much rather
| be able to produce a single identical binary for my users
| that works everywhere than the platform specific nonsense I
| have to go through right now. Having to maintain different
| special build processes for different platforms is a stupid
| waste of time.
|
| Frankly this is how it always should have worked except for
| the monopolistic behavior of various platforms in the past.
| bjourne wrote:
| The binary is only one part of the puzzle (and largely
| solved by WSL). Installation/uninstallation and desktop
| integration is just as much of a hassle.
| cycomanic wrote:
| I don't get the argument, if the binary part is solved by
| WSL it isn't useless is it? Otherwise why would MS invest
| so much resources into it?
| bjourne wrote:
| WSL can do a lot more than just run Linux binaries.
| cyberax wrote:
| > With that in mind that respect, languages like go letting
| you cross compile for all your targets (provided you avoid
| CGO)
|
| Even that is not a big deal in most of cases, if you use zig
| to wrap CC: https://dev.to/kristoff/zig-makes-go-cross-
| compilation-just-...
| jrockway wrote:
| Does this still work? The article is from 2021 but when I
| last tried it this year, Go appeared to (newly) depend on
| headers that Zig doesn't need and thus it doesn't work. The
| Github issue was something like "yeah, we don't need those,
| so I guess Go doesn't work anymore". Without the actual
| error message I can't find the issue, however, so maybe I
| imagined this.
| cyberax wrote:
| Yes, I use it in my PKCS#11 client glue code.
| dwattttt wrote:
| I believe these are the issues:
| https://github.com/golang/go/issues/52690
| https://github.com/ziglang/zig/issues/14989
| jkachmar wrote:
| reposting my comment from another time this discussion came up:
|
| "Cosmopolitan has basically always felt like the interesting
| sort of technical loophole that makes for a fun blog post which
| is almost guaranteed to make it to the front page of HN (or
| similar) purely based in ingenuity & dedication to the bit.
|
| as a piece of foundational technology, in the way that `libc`
| necessarily is, it seems primarily useful for fun toys and
| small personal projects.
|
| with that context, it always feels a little strange to see it
| presented as a serious alternative to something like `glibc`,
| `musl`, or `msvcrt`; it's a very cute hack, but if i were to
| find it in something i seriously depend on i think i'd be a
| little taken aback."
| blenderob wrote:
| > Is there a catch?
|
| I am only speaking for myself here. While cosmo and ape do seem
| very clever, I do not need this type of clever stuff in my work
| if the ordinary stuff already works fine.
|
| Like for example if I can already cross-compile my project to
| other OSes and platforms or if I've got the infra to build my
| project for other OSes and platforms, I've no reason to look
| for a solution that lets me build one binary that works
| everywhere.
|
| There's also the thing that ape uses clever hacks to be able to
| run on multiple OSes. What if those hacks break someday due to
| how executable formats evolve? What if nobody has the time to
| patch APE to make it compatible with those changes?
|
| But my boring tools like gcc, clang, go, rust, etc. will
| continue to get updated and they will continue to work with
| evolving OSes. So I just tend to stick with the boring. That's
| why I don't bother with the clever because the boring just
| works for me.
| tangus wrote:
| Last year I needed to make a small webapp to be hosted on a
| Windows server, and I thought RedBean would be ideal for it.
| Unfortunately it was too buggy (at least on Windows).
|
| I don't know whether RedBean is production-ready now, but a
| year and a half ago, that was the catch.
| jart wrote:
| Give the latest nightly build a try:
| https://cosmo.zip/pub/cosmos/bin/ Windows has been a long
| hard march, but we've recently hit near feature completion.
| As of last month, the final major missing piece of the puzzle
| was implemented, which is the ability to send UNIX signals
| between processes. Cosmopolitan does such a good job on
| Windows now that it's not only been sturdy for redbean, but
| much more mature and complicated software as well, like
| Emacs, GNU Make, clang, Qt, and more.
| dundarious wrote:
| Mozilla llamafile uses it. Bundles model weights and an
| executable into a single file, that can be run from any
| cosmo/ape platform, and spawns a redbean http server for you to
| interact with the LLM. Can also run it without the integrated
| weights, and read weights from the filesystem. It's the easiest
| "get up and go" for local LLMs you could possibly create.
| sfn42 wrote:
| Most people aren't writing C as far as I know. We use Java, C#,
| Go, Python etc, some lunatics even use Node.
|
| We generally don't care if some mutex is 3x faster than some
| other mutex. Most of the time if I'm even using a mutex which
| is rare in itself, the performance of the mutex is generally
| the least of my concerns.
|
| I'm sure it matters to someone, but most people couldn't give
| two shits if they know what they're doing. We're not writing
| code where it's going to make a noticeable difference. There
| are thousands of things in our code we could optimize that
| would make a greater impact than a faster mutex, but we're not
| looking at those either because it's fast enough the way it is.
| 1st1 wrote:
| > I've even had the opportunity to make upstream contributions.
| For example, I found and fixed a bug in his mutex unlock function
| that had gone undiscovered for years.
|
| I see a stream of improvements to the vendored in nsync inside
| the cosmopolitan project [1]. Are you planning on upstreaming
| most of those too?
|
| A separate question -- is using the upstream nsync as safe as
| using your fork?
|
| [1]
| https://github.com/jart/cosmopolitan/commits/master/third_pa...
| jart wrote:
| If Burrows wants my C11 atomics refactoring then he shall have
| it. Beyond that, my changes mostly concern libc integration,
| systems integration, and portability. Our projects have
| different goals in those areas, so I'm not sure he needs them.
| yshui wrote:
| I had the pleasure of reverse-engineering win32 SRWLOCKs, and
| based on the author description of nsync it is very close to how
| SRWLOCK works internally. Kind of surprised how much faster nsync
| is compared to SRWLOCK.
| betimsl wrote:
| And here I thought musl was better than libc _sigh_
| favorited wrote:
| musl _is_ a libc, and while it is superior in some ways, it is
| inferior in others. If you want a statically linked libc, or a
| permissively licensed libc, musl is a fantastic choice. If you
| want a libc with the fastest malloc implementation, you 'll
| probably want to look elsewhere.
| akira2501 wrote:
| Production isn't about speed, efficiency, or obviously "clever
| hacks."
|
| If I have to sacrifice 50% of my efficiency to ensure that I
| never get called on Sunday at 3am to fix a broken system, no
| kidding, I'll make that trade every time.
|
| Production is about _reliability_. And writing reliable code is
| 10x harder than writing "fast" code.
| kgeist wrote:
| >Contention is where mutex implementations show their inequality.
| Mark was so impressed by Microsoft's SRWLOCK that he went on to
| recommend Linux and FreeBSD users consider targeting Windows if
| mutex contention is an issue.
|
| Interesting, I remember reading a detailed article where they
| found that there's a lot of severe contention in the Windows
| kernel, compared to Linux. I think it was when they were trying
| to parallelize Chrome builds?
| TinkersW wrote:
| Maybe they weren't using SRWLock, at least last time I checked
| std::mutex didn't use it with MS STL(They were stuck with
| critical section because of binary compatibility).
| markdoubleyou wrote:
| I'm the Mark who's referenced there. When I did that original
| benchmark I discovered that the underlying mutex used by
| MSVCRT did change between versions. For example, in Visual
| C++ 2013, they used the Windows Concurrency Runtime, which
| was awful under heavy contention. Newer MSVCRT versions use
| SRWLOCK.
|
| (And I wouldn't characterize myself as being _overly_
| impressed... for my particular scenario I wrote, "if you
| have a poorly written app that's bottlenecked on a lock, then
| consider targeting Windows to make the best of a bad
| situation." A better approach, of course, would be to just
| improve your code!)
| dumdood wrote:
| Threads and mutexes are the most complicating things in computer
| science. I am always skeptical of new implementations until
| they've been used for several years at scale. Bugs in these
| threading mechanisms often elude even the most intense scrutiny.
| When Java hit the scene in the mid 90s it exposed all manner of
| thread and mutex bugs in Solaris. I don't want the fastest mutex
| implementation - I want a reliable one.
___________________________________________________________________
(page generated 2024-10-02 23:00 UTC)