hngopher.com

       [HN Gopher] The Fastest Mutexes
       ___________________________________________________________________
        
       The Fastest Mutexes
        
       Author : jart
       Score  : 472 points
       Date   : 2024-10-02 15:31 UTC (7 hours ago)
        
 (HTM) web link (justine.lol)
 (TXT) w3m dump (justine.lol)
        
       | kccqzy wrote:
       | > The reason why Cosmopolitan Mutexes are so good is because I
       | used a library called nsync. It only has 371 stars on GitHub, but
       | it was written by a distinguished engineer at Google called Mike
       | Burrows.
       | 
       | Indeed this is the first time I've heard of nsync, but Mike
       | Burrows also wrote Google's production mutex implementation at
       | https://github.com/abseil/abseil-cpp/blob/master/absl/synchr...
       | I'm curious why this mutex implementation is absent from the
       | author's benchmarks.
       | 
       | By the way if the author farms out to __ulock on macOS, this
       | could be more simply achieved by just using the wait(),
       | notify_one() member functions in the libc++'s atomic library.
       | 
       | A while ago there was also a giant thread related to improving
       | Rust's mutex implementation at https://github.com/rust-
       | lang/rust/issues/93740#issuecomment-.... What's interesting is
       | that there's a detailed discussion of the inner workings of
       | almost every popular mutex implementation.
        
         | tialaramex wrote:
         | Although it does get there eventually, that Rust thread is
         | about Mara's work, which is why it eventually mentions her
         | January 2023 book.
         | 
         | The current Rust mutex implementation (which that thread does
         | talk about later) landed earlier this year and although if
         | you're on Linux it's not (much?) different, on Windows and Mac
         | I believe it's new work.
         | 
         | That said, Mara's descriptions of the guts of other people's
         | implementations is still interesting, just make sure to check
         | if they're out-dated for your situation.
        
           | SkiFire13 wrote:
           | > although if you're on Linux it's not (much?) different
           | 
           | AFAIK one reason to switch was that mutexes on Linux and
           | MacOS were not guaranteed to be moveable, so every rust's
           | Mutex had to box the underlying os mutex and was not const-
           | constructible. So this makes a considerable change.
        
             | tialaramex wrote:
             | That's the reason for Mara's Mutex. I know, it doesn't seem
             | like five minutes, but Mara's is now the _previous_ version
             | of the Rust Mutex implementation.
             | 
             | std::sync::Mutex::new became const in 1.63, which was _over
             | two years ago_.
        
               | lilyball wrote:
               | It looks like Mara also did the work to make it const-
               | constructible, so it's still her implementation, no?
               | https://github.com/rust-lang/rust/pull/97647
        
               | tialaramex wrote:
               | Sorry, maybe I was unclear, when I pointed out that this
               | work was in 2022 that's not because I was trying to say
               | it's recent, that's _two years ago_. The current work was
               | in 2024.
               | 
               | Here's the 2024 rework of Windows Mutex:
               | https://github.com/rust-lang/rust/pull/121956/
               | 
               | Edited to add: Since we're going back and forth I read
               | the blame log, the Linux implementation, with which I'm
               | most familiar, is still 85% Mara's although with some
               | meaningful changes from 2024.
        
         | the-rc wrote:
         | Burrows is also responsible for the Burrows Wheeler Transform,
         | Bigtable, Dapper and Chubby, among others.
        
         | loeg wrote:
         | > I'm curious why [Abseil's] mutex implementation is absent
         | from the author's benchmarks.
         | 
         | Possibly because it's C++ (as opposed to C)? I am speculating.
        
           | vitus wrote:
           | > Possibly because it's C++ (as opposed to C)?
           | 
           | MSVC 2022's std::mutex is listed, though. (That said, GCC's /
           | clang's std::mutex is not listed for Linux or macOS.)
           | 
           | absl::Mutex does come with some microbenchmarks with a
           | handful of points of comparison (std::mutex,
           | absl::base_internal::SpinLock) which _might_ be useful to get
           | an approximate baseline.
           | 
           | https://github.com/abseil/abseil-
           | cpp/blob/master/absl/synchronization/mutex_benchmark.cc
        
         | BryantD wrote:
         | Mike was a legend by the time I got to AV. The myth was that
         | any time the search engine needed to be faster, he came in and
         | rewrote a few core functions and went back to whatever else he
         | was doing. Might be true, I just can't verify it personally.
         | Extremely smart engineer who cares about efficiency.
         | 
         | We did not, however, run on one server for any length of time.
        
       | rnrn wrote:
       | is comsmopolitan's mutex also less fair than the other
       | implementations compared?
        
         | cogman10 wrote:
         | It's not fair, but (from the description) it does make some
         | effort to be fairish. It'll queue up waiters in a linked list
         | that is fairly fair, but new people can jump ahead of the line
         | and grab the CAS before the list is processed.
         | 
         | However, it has added logic to start chunking through the queue
         | after 30 wakeups from a waiter. With that, waiting isn't
         | indefinite, but it also isn't fair.
         | 
         | I have no idea how that compares to other implementation's
         | fairness. I know the JDK recently abandoned fairness because of
         | the overhead and complexity it added around mutex handling.
        
           | tialaramex wrote:
           | Fairness is known to cause serious problems such as convoys,
           | so while "unfair" isn't itself a desirable property, you
           | might choose to be unfair because the alternatives are worse
           | for some applications.
           | 
           | Probably a specifically Fair version of any concurrent
           | primitive which can be fair is worth giving a distinct name,
           | the way you'd offer both a _stable_ and an _unstable_ sort,
           | knowing that the unstable sort will often (though not always)
           | be faster but some people cannot tolerate an unstable sort so
           | it 's useless for them.
           | 
           | On the other hand, maybe it's an attractive nuisance and if
           | you offered this you'd find most users took your FairMutex,
           | and then were angry because of the inevitable problems from
           | fairness, even though the unfair one was right there...
        
             | greiskul wrote:
             | In a lot of cases, programmers don't even care about
             | fairness, but do care about starvation. Is there a word for
             | structures like the one discussed here, that are unfair but
             | appear to prevent unlimited starvation?
        
               | DSMan195276 wrote:
               | I don't think this lock is _guaranteed_ to prevent
               | starvation, it just makes an effort at it. There's only
               | two priority levels and a hard-coded 30 wake-ups required
               | to enter high priority - if waiters were continually
               | joining then there could always be more than one entry in
               | high priority and an entry could get stuck there forever.
               | Typically it won't matter, but if you have high
               | contention then this might not be good enough.
        
         | pizlonator wrote:
         | Most fast locks these days are unfair. It turns out the perf
         | advantage of unfairness is so massive that it's just Better
         | (TM).
        
       | tombert wrote:
       | I have to admit that I have an extremely visceral, negative
       | feeling whenever I see a mutex, simply because I've had to debug
       | enough code written by engineers who don't really know how to use
       | them, so a large part of previous jobs has been to remove locks
       | from code and replace with some kind of queue or messaging
       | abstraction [1].
       | 
       | It's only recently that I've been actively looking into different
       | locking algorithms, just because I've been diving in head-first
       | to a lot of pure concurrency and distributed computing theory, a
       | lot of which is about figuring out clever ways of doing mutexes
       | with different tradeoffs.
       | 
       | I've gotten a lot better with them now, and while I still
       | personally will gravitate towards messaging-based concurrency
       | over locks, I do feel the need to start playing with some of the
       | more efficient locking tools in C, like nsync (mentioned in this
       | article).
       | 
       | [1] Before you give me shit over this, generally the replacement
       | code runs at roughly the same speed, and I at least personally
       | think that it's easier to reason about.
        
         | aidenn0 wrote:
         | I feel the similarly about C"s "volatile" (when used in
         | multithreaded code rather than device drivers). I've seen
         | people scatter volatile around randomly until the problem goes
         | away. Given that volatile significantly disturbs the timing of
         | a program, any timing sensitive bugs can be masked by adding it
         | around randomly.
        
           | cogman10 wrote:
           | There seems to be a lot of voodoo beliefs around concurrent
           | programming that lead to really bad things.
           | 
           | One of the best books I've read on it is Java concurrency in
           | practice [1]. It does an excellent job of dispelling these
           | occultic beliefs and letting the reader know exactly when and
           | how concurrency should be implemented. It is applicable to
           | more languages than just java, especially since many have
           | adopted large parts of the java memory model.
           | 
           | The worst things I usually find when reviewing concurrent
           | code is people either not using locks when they should, using
           | locks when they shouldn't, and having inconsistent data
           | guards. I've seen people throw in random locks to guard local
           | non-shared state which is just crazy town but "Multiple
           | threads are running this code, so I'm adding a lock".
           | 
           | I certainly prefer message passing over shared state.
           | However, it's a little baffling to me why it's so hard for
           | devs to grasp how to properly maintain shared state. Instead
           | of just learning the basic rules, it gets couched in "It's
           | just too hard to understand so keep adding things until it
           | works".
           | 
           | [1] https://www.amazon.com/Java-Concurrency-Practice-Brian-
           | Goetz...
        
             | LegionMammal978 wrote:
             | > However, it's a little baffling to me why it's so hard
             | for devs to grasp how to properly maintain shared state.
             | Instead of just learning the basic rules, it gets couched
             | in "It's just too hard to understand so keep adding things
             | until it works".
             | 
             | Probably because most people aren't aware that there are
             | basic rules to be learned. I'd imagine the typical
             | experience is, you're very familiar with single-threaded
             | code, and now you're trying to let other threads work with
             | your data. You have heard that there are many pitfalls, and
             | that there are special-purpose tools like mutexes to avoid
             | those, but you look at the examples and find them mostly
             | baffling. "Why do they perform these incantations for this
             | data but not that data, or in this place but not that
             | place?" So you come up with some weird mental model and
             | move on with your life, never aware that there are
             | underlying principles for maintaining shared state.
             | 
             | Personally, I didn't understand mutexes very well at all,
             | until I started looking into what the atomic memory
             | orderings from C++ et al. were supposed to mean.
        
               | Maxatar wrote:
               | Not too sure what the basic rules are and I'm not able to
               | find any list of such rules.
               | 
               | For me the biggest challenge when sharing state is that
               | the only benefit I can see for parallelism is
               | performance, so if I'm not gaining performance there is
               | no reason to use parallelism. If I use coarse-grained
               | mutexes then I end up with straight forward to reason
               | about code but I lose the performance benefit and in fact
               | can end up with slower than single threaded code.
               | 
               | If I use very fine grained mutexes then I end up with
               | faster code that has very hard to find bugs that happen
               | on very rare occasion.
               | 
               | And then on top of that even if you do write correct fine
               | grained locking, you can still end up with slow code due
               | to cache behavior such as false sharing and cache
               | coherence.
               | 
               | So ultimately I disagree that writing parallel code is
               | simple unless you're willing to give up performance in
               | which case you may as well just stick to single threaded
               | code or use parallelism among independent data. Writing
               | correct parallel software that shares state and actually
               | delivers substantial performance benefits is incredibly
               | difficult, and I am skeptical that there is a set of
               | simple rules that one can simply read about.
        
               | tialaramex wrote:
               | > Not too sure what the basic rules are and I'm not able
               | to find any list of such rules.
               | 
               | The _actual_ rules are completely terrifying because they
               | involve the physics of microprocessors. If you 've
               | watched Grace Hopper's lectures where she gives out
               | physical nanoseconds (pieces of wire that are the same
               | length as the distance light travels in a nanosecond,
               | thus, the maximum possible distance data could travel in
               | that time) you can start to appreciate the problem. It is
               | _literally impossible_ for the intuitive Sequentially
               | Consistent model of how computers work to apply for today
               | 's fast yet concurrent processors. Light is too slow.
               | 
               | However generally people mean either Java's memory model
               | or the C++ 11 (and subsequently 14, 17, 20) memory models
               | used in languages such as C++, C and Rust. Those rules
               | are less terrifying but still pretty complicated and the
               | programming language promises to somehow provide an
               | environment where these rules (not the terrifying ones)
               | are all you need to know to write software. So that's
               | nice.
               | 
               | It _can_ be simple to write parallel code for a language
               | designed to make that easy. Yes even if there 's shared
               | data. It only started to get trickier if the shared data
               | is _modified_ , so long as it isn't we can make copies of
               | it safely and modern CPUs will do that without actual
               | work by the programmer.
        
               | cogman10 wrote:
               | Are there popular languages that don't have memory models
               | which make reasoning about concurrent models easier?
               | 
               | A language with a notion of threading and shared state is
               | going to have something akin to read/write barriers built
               | into the language memory model to tame the beast.
        
               | jcranmer wrote:
               | I think tialaramex is overselling the complexity of
               | concurrent memory models in practice, at least for end
               | users. In reality, all modern memory models are based on
               | the data-race-free theorem, which states that in the
               | absence of data races--if your program is correctly
               | synchronized--you can't tell that the hardware isn't
               | sequentially consistent (i.e., what you naively expected
               | it to do).
               | 
               | Correct synchronization is based on the happens-before
               | relation; a data race is defined as a write and a
               | conflicting read or write such that neither happens-
               | before the other. Within a thread, happens-before is just
               | regular program order. Across a thread, the main happens-
               | before that is relevant is that an release-store on a
               | memory location happens-before an acquire-load on that
               | memory location (this can be generalized to any memory
               | location _if_ they 're both sequentially-consistent, but
               | that's usually not necessary).
               | 
               | The real cardinal rule of concurrent programming is to
               | express your semantics in the highest-possible level of
               | what you're trying to do, and find some library that does
               | all the nitty-grityy of the implementation. Can you
               | express it with fork-join parallelism? Cool, use your
               | standard library's implementation of fork-join and just
               | don't care about it otherwise.
        
               | xxpor wrote:
               | C?
        
               | zozbot234 wrote:
               | > Not too sure what the basic rules are and I'm not able
               | to find any list of such rules.
               | 
               | You may want to consider https://marabos.nl/atomics/ for
               | an approachable overview that's still quite rigorous.
        
               | cogman10 wrote:
               | > Not too sure what the basic rules are and I'm not able
               | to find any list of such rules.
               | 
               | I'd suggest the book in my original comment, Java
               | concurrency in practice.
               | 
               | > If I use very fine grained mutexes then I end up with
               | faster code that has very hard to find bugs that happen
               | on very rare occasion.
               | 
               | I agree this is a real risk if you are doing fine grained
               | mutexes. But the rules are the same whether or not you
               | want to follow them. If you have shared state (A, B, C)
               | and you want to do a calculation based on the values of
               | (A, B, C) then you need a mutex which locks (A, B, C).
               | Certainly, that become a problem if you have calculations
               | that just require (A, C) and you might want to avoid
               | locking for B. In that case, you need a more complicated
               | mechanism for locking than just simple mutexes which is
               | certainly easy to get wrong. When the (A, B, C) actions
               | happen you have to ensure that the (A, C) actions can't
               | happen at the same time.
               | 
               | This isn't a complicated rule, but it is one that can be
               | hard to follow if you are trying to do super fine grained
               | locking. It's even trickier if you are going to abuse the
               | platform to get correct results.
               | 
               | But fine v coarse isn't the problem I'm referring to when
               | I say people get the simple rules wrong. Rather, than
               | worrying about fine vs coarse grained locking, I very
               | frequently see code where mutexes and concurrency
               | primitives are just peppered everywhere and haphazardly.
               | We might call that super coarse grained.
        
               | SJC_Hacker wrote:
               | > For me the biggest challenge when sharing state is that
               | the only benefit I can see for parallelism is
               | performance, so if I'm not gaining performance there is
               | no reason to use parallelism.
               | 
               | Aside from performance, another very common reason is to
               | not lock the UI from the user. Even in UI-less programs,
               | the ability to abort some operation which is taking too
               | long. Another is averaging out performance of compute
               | tasks, even in the case where it would be faster to
               | handle them sequentially. Without some degree of
               | parallelism these things are not possible.
               | 
               | Consider a web server. Without parallelism every single
               | request is going to completely lock the program until its
               | complete. With parallelism, you can spawn off each
               | request, and handle new ones as they come in. Perceived
               | performance for majority of users in this case is
               | significantly improved even in the case of single
               | processor system - e.g. you have 99 requests which each
               | take a single second, and then one which takes 101
               | seconds. Total request time is 200 seconds / 100 requests
               | = 2 seconds average per request, but if that 100 second
               | request comes in first, the other 99 are locked for 100
               | seconds, so average is now > 100 seconds per request ...
        
               | Maxatar wrote:
               | >Aside from performance, another very common reason is to
               | not lock the UI from the user.
               | 
               | This is not a good fit for parallelism, this is pretty
               | much always accomplished using concurrency ie.
               | async/await.
        
             | f1shy wrote:
             | Maybe because I had a complete semester of multiprogramming
             | in the uni, I see almost trivial to work in such
             | environments, and cannot comprehend why is so much mystic
             | and voodo. Actually is pretty simple.
        
               | tombert wrote:
               | I feel like it's not terribly hard to write something
               | that more or less works using mutexes and the like, but I
               | find it exceedingly hard to debug. You're at the mercy of
               | timing and the scheduler, meaning that often just
               | throwing a breakpoint and stepping through isn't as easy
               | as it would be with a sequential program.
               | 
               | I feel like with a queue or messaging abstraction, it can
               | be easier to debug. Generally your actual _work_ is being
               | done on a single thread, meaning that traditional
               | debugging tools work fine, and as I 've said in sibling
               | comments, I also just think it's easier to reason about
               | what's going on.
        
             | tombert wrote:
             | +1 for the Java Concurrency in Practice book. It's the book
             | I recommend to nearly everyone who wants to get into
             | concurrent programming. Goetz makes it a lot more
             | approachable than most other books.
        
               | hinkley wrote:
               | Goetz has come a long way. I knew one of the people who
               | contributed to that book and he was a little frustrated
               | about having to explain things to him he felt he
               | shouldn't have had to. The implication was he'd already
               | had this conversation with some of the other
               | contributors.
               | 
               | Sometimes though, the newbie is going to write the
               | clearest documentation.
        
             | hinkley wrote:
             | I loved concurrent code when I was starting out. I'd taken
             | a pretty good distributed computing class which started the
             | ball rolling. They just fit into how my brain worked very
             | well.
             | 
             | Then I had to explain my code to other devs, either before
             | or after they broke it, and over and over I got the message
             | that I was being too clever. I've been writing Grug-brained
             | concurrent code for so long I'm not sure I can still do the
             | fancy shit anymore, but I'm okay with that. In fact I know
             | I implemented multiple reader single writer at least a few
             | times and that came back to me during this thread but I
             | still can't remember how I implemented it.
        
               | tombert wrote:
               | That's something I'm afraid of for my latest project. I
               | did some concurrent stuff that wasn't 100% clear would
               | actually work, and I had to write a PlusCal spec to
               | exhaustively prove to myself that what I was doing is
               | actually OK.
               | 
               | It works pretty well, and I'm getting decent speeds, but
               | I'm really scared someone is going to come and "fix" all
               | my code by doing it the "normal" way, and thus slow
               | everything down. I've been trying to comment the hell out
               | of everything, and I've shared the PlusCal spec, but no
               | one else on my team knows PlusCal and I feel like most
               | engineers don't actually read comments, so I think it's
               | an inevitability that my baby is killed.
        
           | tialaramex wrote:
           | In most cases (in a C or C++ compiler, not Java) it's just
           | straight up incorrect to use volatile for something other
           | than memory mapped I/O. Yes, POSIX guarantees that in a
           | specific case (signal handling IIRC) it'll do what you meant
           | if you use volatile int. That's nice, but more generally this
           | is not the right choice.
           | 
           | Unfortunately Microsoft enshrined the situation (on Windows,
           | on their compiler, on x86 and x86-64 only) that volatile
           | primitive types are effectively atomics with Acquire-Release
           | ordering. This is of course awkward when Microsoft tries to
           | bring people to a non-x86 architecture and it can't just give
           | them this because it would _suck_ really hard, so finally
           | they have to grow up and teach their developers about actual
           | atomics.
           | 
           | !! Edited to fix: Previously this said Relaxed ordering, the
           | ordering guaranteed by Microsoft is in fact Acquire-Release,
           | hence it 's expensive to provide for architectures where
           | that's not the default.
        
             | hinkley wrote:
             | When Java implemented volatile it didn't do anything. Later
             | when they fixed the memory model to deal with out of order
             | execution they made it part of the fence semantics, and
             | then it actually made some sense.
        
           | zozbot234 wrote:
           | The "volatile" keyword should _never_ be used for C /C++
           | multithreaded code. It's specifically intended for access to
           | device-mapped addresses and does not account for any specific
           | memory model, so using it for multithreading will lead to
           | breakage. Please use the C/C++ memory model facilities
           | instead.
           | 
           | (As a contrast, note that in Java the "volatile" keyword can
           | be used for multithreading, but again this does not apply to
           | C/C++.)
        
             | hinkley wrote:
             | I'm surprised that's true. C borrowed very heavily from
             | Java when fixing the NUMA situations that were creeping
             | into modern processors.
        
               | jcranmer wrote:
               | The C/C++ memory model is directly derived from the Java
               | 5 memory model. However, the decision was made that
               | volatile in C/C++ specifically referred to memory-mapped
               | I/O stuff, and the extra machinery needed to effect the
               | sequential consistency guarantees was undesirable. As a
               | result, what is volatile in Java is _Atomic in C and
               | std::atomic in C++.
               | 
               | C/C++ also went further and adopted a few different
               | notions of atomic variables, so you can choose between a
               | sequentially-consistent atomic variable, a
               | release/acquire atomic variable, a release/consume atomic
               | variable (which ended up going unimplemented for
               | reasons), and a fully relaxed atomic variable (whose
               | specification turned out to be unexpectedly tortuous).
        
             | aidenn0 wrote:
             | > Please use the C/C++ memory model facilities instead
             | 
             | I should point out that for more than half of my
             | professional career, those facilities did not exist, so
             | volatile was the most portable way of implementing e.g. a
             | spinlock without the compiler optimizing away the check.
             | There was a period after which compilers were aggressively
             | inlining and before C11 came out in which it could be
             | otherwise quite hard to otherwise convince a compiler that
             | a value might change.
        
         | jart wrote:
         | What are some examples of people using mutexes wrong? I know
         | one gotcha is you need to maintain a consistent hierarchy.
         | Usually the easiest way to not get snagged by that, is to have
         | critical sections be small and pure. Java's whole MO of letting
         | people add a synchronized keyword to an entire method was
         | probably not the greatest idea.
        
           | cogman10 wrote:
           | When, how, and why.
           | 
           | The biggest part of mutexes and how to properly use them is
           | thinking of the consistency of the data that you are working
           | with.
           | 
           | Here's a really common bug (psuedocode)                   if
           | (lock {data.size()} > 0) {           value = lock {
           | data.pop() }           lock { foo.add(value) }         }
           | 
           | The issue here is size can change, pop can change, and foo
           | can change in unexpected ways between each of the acquired
           | locks.
           | 
           | The right way to write this code is                   lock {
           | if (data.size() > 0) {             value = data.pop()
           | foo.add(value)           }         }
           | 
           | That ensures the data is all in a consistent state while you
           | are mutating it.
           | 
           | Now, what does make this tricky is someone well-meaning might
           | have decided to push the lock down a method.
           | 
           | Imagine, for example, you have a `Foo` where all methods
           | operate within a mutex.
           | 
           | This code is also (likely) incorrect.                   value
           | = foo.bar()         if (value.bat()) {
           | foo.baz(value)         }
           | 
           | The problem here is exactly the same problem as above.
           | Between `foo.bar()` and `foo.baz()` the state of foo may have
           | changed such that running `foo.baz(value)` is now a mistake.
           | That's why the right thing to do is likely to have a
           | `foo.barBaz()` method that encapsulates the `if` logic to
           | avoid inconsistency (or to add another mutex).
           | 
           | In java, the most common manifestation (that I see) of this
           | looks like this                   var map = new
           | ConcurrentHashMap();         if (map.get(foo) == null)
           | map.put(foo, new Value());
           | 
           | Because now, you have a situation where the value of `foo` in
           | the map could be 2 or more values depending on who gets it.
           | So, if someone is mutating `Value` concurrently you have a
           | weird hard to track down data race.
           | 
           | The solution to this problem in java is
           | map.computeIfAbsent(foo, (unused)->new Value());
        
             | hinkley wrote:
             | Composing locks is where Java usually blows up.
             | 
             | And computeIfAbsent can end up holding the lock for too
             | long if the function is slow.
        
               | foobazgt wrote:
               | Composing locks isn't a Java problem - it's a fundamental
               | abstraction problem with locks. This is one of the
               | reasons why you usually reach for higher level
               | abstractions than mutexes.
               | 
               | > And computeIfAbsent can end up holding the lock for too
               | long if the function is slow.
               | 
               | How is this different from any other lock-holding code
               | written anywhere?
        
               | hinkley wrote:
               | I'm saying Java is exceptionally bad at this because
               | every object is its own mutex.
               | 
               | And you end up having to trade single core performance
               | for multi core by deciding to speculatively calculate the
               | object. If there's no object to make the critical section
               | is very small. But as the object sprouts features you
               | start smashing face first into Amdahl.
        
               | foobazgt wrote:
               | > because every object is its own mutex.
               | 
               | Not true in any practical sense.
               | 
               | > And you end up having to trade single core performance
               | for multi core by deciding to speculatively calculate the
               | object.
               | 
               | What is the alternative you suggest? If you care about
               | having the predicate actually hold, and you also don't
               | want to have to hold the lock while constructing the
               | object, then you're going to end up in an optimistic-
               | concurrency scenario where you check the predicate under
               | lock, compute the object, and check again before swapping
               | the value in. You may end up having to throw your work
               | away when you discover the predicate changed. Java is no
               | better nor worse at doing this than anything else.
        
               | hinkley wrote:
               | > Not true in any practical sense.
               | 
               | This is going to put a damper on any further
               | conversation.
               | 
               | Even with coarsening and elision every synchronized
               | function closes a lock on the enclosing object.
        
             | another-acct wrote:
             | Might want to move foo.add() out of the lock scope
             | (assuming foo is a thread-private resource):
             | value = nil         lock {           if (data.size() > 0) {
             | value = data.pop()           }         }         if (value)
             | {             foo.add(value)         }
        
             | samatman wrote:
             | This is one of the areas where Zig's combination of
             | anonymous blocks and block-based defer really pay off. To
             | create a locked region of code is just this
             | {             mutex.lock();             defer
             | mutex.unlock();             // Do mutex things         }
             | 
             | It's possible to get this wrong still, of course, but both
             | the anonymous scope and the use of `defer` make it easier
             | to get things right.
             | 
             | Nothing can prevent poor engineering around mutex use
             | though. I'd want a critical path for a concurrent hashmap
             | to look like this:                   {
             | shared_map.lock();             defer shared_map.unlock();
             | if (shared_map.getOrNull(foo) == null) {
             | shared_map.put(foo, new_val);             }         }
             | 
             | Where the SharedMap type has an internal mutex, and a way
             | to check it, and all operations panic if no lock has been
             | acquired. It could have
             | `shared_map.lockAndGet(OrNull?)(...)`, so that the kind of
             | problem pattern you're describing would stand out on the
             | page, but it's still a one-liner to do an atomic get when
             | that's all you need the critical path to perform.
             | 
             | I don't think these invariants are overly onerous to
             | uphold, but one does have to understand that they're a hard
             | requirement.
        
             | jhatemyjob wrote:
             | I digress but my autistic brain couldn't help itself.
             | Provided that it's a recursive lock you could do this
             | instead of adding a new method `foo.BarBaz`
             | foo.lock {             value = foo.bar() // foo.lock within
             | this method is ignored             if(value.bat()) {
             | foo.baz() // foo.lock within this method is ignored
             | }         }
             | 
             | Also, to catch this bug early, you could assert foo is
             | locked in `value.bat` or something. But that may or may not
             | be feasible depending on how the codebase is structured
        
           | tombert wrote:
           | Personally I've had issues with performance because of people
           | using `synchronized` too liberally, where they end up locking
           | a lot more code than necessary. I've also had issues with
           | fairly typical circular-dependencies, causing deadlock, or at
           | least pauses that aren't strictly necessary. Deadlock doesn't
           | happen nearly as often as textbooks have led me to believe,
           | but it can happen with sloppily written code.
           | 
           | In regards to Java, at this point I almost never use the
           | `synchronized` keyword anymore and instead (if I can't easily
           | map to some kind of queuing abstraction) use the
           | ReentrantLock object simply because of the ability to have
           | lock acquisition time out, and also letting you opt-in to
           | fairness if you'd like. It's not as pretty but it's more
           | flexible and as far as I'm aware it doesn't affect
           | performance much.
           | 
           | For the most part, though, in Java, you can get away without
           | (explicit) locks by simply abusing the built-in data
           | structures. I know they're using their own synchronization
           | techniques behind the scenes, but I trust those to be correct
           | more than some ad-hoc stuff I'd write as an engineer.
        
           | foobazgt wrote:
           | Java's take on monitors was definitely not great, and people
           | were emulating mutexes with them even in the language's
           | earliest days.
           | 
           | Still there are a lot of things that can go wrong with
           | mutexes: forgetting to unlock in the case of exceptions,
           | priority inversion, recursive locking, deadlock, needlessly
           | high contention, etc.
           | 
           | Java has had an excellent concurrency runtime with
           | abstractions that are typically a better fit than a bare
           | mutex for over 20 years now (c.f. Doug Lea). Synchronized
           | still exists, because of Java's excellent backwards
           | compatibility.
        
           | foobiekr wrote:
           | I've always disliked that lock cyclic dependencies is
           | discussed as a hierarchy when what it really comes down to is
           | a linear order of locks.
           | 
           | The problem with lock _hierarchies_ as a concept is that a
           | lock really should represent serialization of access to a
           | particular pool of data, and should make no assumptions that
           | it being held implies some other lock's domain is also held.
           | The code that results when people do not maintain this kind
           | of rigor is quite terrible, but hierarchies tend to steer
           | people into thinking that way because they imply recursively
           | taking locks.
           | 
           | Stated differently: locks should be taken and released in a
           | fixed order - so locks are ranked - but there should not be a
           | model where all lower-ranked locks must be held for a given
           | lock to be taken. The lock protects its domain and the
           | ordering of take and release is to prevent deadlock, but
           | there's no requirement for completeness.
        
         | bob1029 wrote:
         | > some kind of queue or messaging abstraction
         | 
         | Agreed. I find things like LMAX Disruptor much easier to reason
         | about.
        
           | tombert wrote:
           | Even within Java, something like BlockingQueue will get you
           | pretty far, and that's built into the runtime.
           | 
           | If I am allowed to use libraries, I end up using Vert.x for
           | nearly everything. I think that their eventbus abstraction is
           | easy enough to reason about, and even without using it simply
           | using the non-blocking stuff it provides ends up being pretty
           | handy.
        
         | jeffbee wrote:
         | Message passing is just outsourcing the lock, right? For
         | example a Go channel is internally synchronized, nothing magic
         | about it.
         | 
         | Most of the mutex tragedies I have seen in my career have been
         | in C, a useless language without effective scopes. In C++ it's
         | pretty easy to use a scoped lock. In fact I'd say I have had
         | more trouble with people who are trying to avoid locks than
         | with people who use them. The avoiders either think their
         | program order is obviously correct (totally wrong on modern
         | CPUs) or that their atomics are faster (wrong again on many
         | CPUs).
        
           | loeg wrote:
           | > Message passing is just outsourcing the lock, right?
           | 
           | More or less, yeah. You can write an MPSC queue that doesn't
           | explicitly use a lock (or even anything that looks like a
           | lock).
        
           | tombert wrote:
           | It's definitely doing synchronization behind the scenes, no
           | argument here. BlockingQueues in Java seem to use
           | ReentrantLocks everywhere. It's outsourcing the lock to
           | people who understand locks better.
           | 
           | It just abstracts this detail away for me, and I personally
           | trust the libraries implementing these abstractions to be
           | more correct than some ad hoc thing I write. It's an
           | abstraction that I personally find a lot easier to reason
           | about, and so my thinking is this: if my reasoning is more
           | likely to be correct because of the easier abstraction, and
           | the internal synchronization is more likely to be correct,
           | then it's more likely that my code will be correct.
           | 
           | I don't do super low-level stuff at all, most of my stuff
           | ends up touching a network, so the small differences between
           | the built-in synchronized structures vs the regular ones
           | really don't matter since any small gains I'd get on that
           | will be eaten the first time I hit the network, so a
           | considerably higher ROI for me is almost always figuring out
           | how to reduce latency.
           | 
           | If I did C or C++, I'd probably have different opinions on
           | this stuff.
        
           | foobazgt wrote:
           | Every abstraction is about outsourcing the thing it's
           | abstracting away. If using a queue solves your problem, you
           | no longer have to deal with all the headaches that you can
           | run into using a bare mutex.
        
           | another-acct wrote:
           | > C, a useless language without effective scopes
           | 
           | Mutexes can be handled safely in C. It's "just another
           | flavor" of resource management, which does take quite a bit
           | of discipline. Cascading error paths / exit paths help.
        
           | sgarland wrote:
           | > C, a useless language
           | 
           | You misspelled "fast as fuck" and "lingua franca of all
           | architectures."
        
         | another-acct wrote:
         | > remove locks from code and replace with some kind of queue or
         | messaging abstraction
         | 
         | Shared-nothing message passing reflects the underlying (modern)
         | computer architecture more closely, so I'd call the above a
         | good move. Shared memory / symmetric multiprocessing is an
         | abstraction that leaks like a sieve; it no longer reflects how
         | modern computers are built (multiple levels of CPU caches,
         | cores, sockets, NUMA, etc).
        
           | tombert wrote:
           | That's good to hear! I am pretty removed from underlying
           | hardware now, so it makes me happy to hear that better way of
           | doing things is catching on even in low-level land.
        
           | gpderetta wrote:
           | If you are doing pure shared nothing message passing, you do
           | not need coherent caches; in fact cache coherency gets in the
           | way of pure message passing.
           | 
           | Viceversa if you do pure message passing you are not
           | benefitting from hardware accelerated cache coherency and
           | leaving performance (and usability) on the floor.
        
         | intelVISA wrote:
         | Shared-nothing is typically The Right Choice in my experience
         | as well. Maybe the odd atomic...
        
       | alberth wrote:
       | Are there any Linux distro's built/using Cosmo?
       | 
       | (like Alpine use of musl)
        
       | forrestthewoods wrote:
       | Hrm. Big fan of Justine and their work. However this is probably
       | the _least_ interesting benchmark test case for a Mutex. You
       | should never have a bunch of threads constantly spamming the same
       | mutex. So which mutex implementation best handles this case isn't
       | particularly interesting, imho.
        
         | jart wrote:
         | What do you consider a good benchmark test case for mutexes?
        
           | pizlonator wrote:
           | Very large multithreaded programs with lots of diverse uses
           | of locks, including:
           | 
           | - uncontended, mildly contended, and horrifically contended
           | 
           | - short and long critical sections
           | 
           | - contention among small numbers of threads and large numbers
           | of threads
           | 
           | - contention that happens when other locks are held
           | recursively and those are also contended on (like, thread A
           | wants a lock held by thread B, but thread B is trying to get
           | a lock held by thread C)
           | 
           | Different lock algos work well or badly depending on how the
           | locks are used, so it's important to pick real programs as
           | your benchmark rather than trying to cook a synthetic
           | benchmark.
        
         | convolvatron wrote:
         | what else would you measure? certainly the uncontended case is
         | important and a baseline, but otherwise this is kind of weak
         | point for mutexes - that if you don't handle contention well
         | then you have idle hardware or lots of additional scheduler
         | work or kernel crossings.
         | 
         | [edit - I forget to even mention one of the most important
         | things, that locks that reform poorly under contention can have
         | really negative systemic effects like hot spotting the memory
         | network, and that would show up here too]
        
           | tialaramex wrote:
           | Uncontended is _crucial_. If you want to benchmark other
           | things that 's excellent, but if MutexA has crap uncontended
           | performance then I'm on a loser if we pick MutexA unless I am
           | absolutely sure we will have a lot of contention. Since
           | contention is never desirable, that's rare.
           | 
           | Think of this like the random input case for a sort
           | benchmark. Do I want stuff like all-ascending, all-
           | descending, zig-zag and so on? Sure, those are nice. But
           | without the random input case the benchmark is not very
           | helpful. I _might_ sort a zig-zag, I _might_ sort data that
           | 's already in ascending order, but I _will_ sort random data,
           | that 's going to happen or else I would not need a sort
           | function.
        
             | jart wrote:
             | Uncontended is uninteresting, because all mutex
             | implementations perform roughly the same here, give or take
             | a nanosecond or two. If you're truly uncontended then a
             | naive spin lock will actually seem fastest, because xchg is
             | faster than cmpxchg which is needed for good locks.
        
               | loeg wrote:
               | Uh, why do you say a naive spin lock would use xchg
               | instead of cmpxchg? I don't think you could make a valid
               | spinlock using xchg.
        
               | jart wrote:
               | On x86 you can. When xchg is used with a memory parameter
               | it locks the bus. This is true even in the absence of a
               | lock prefix. I included a spinlock implementation in the
               | blog post. If you see any errors with it, then please let
               | me know!
        
               | loeg wrote:
               | Oh, sure, your 1-bit spinlock with no other state works.
        
         | cogman10 wrote:
         | > You should never have a bunch of threads constantly spamming
         | the same mutex.
         | 
         | I'm not sure I agree with this assessment. I can think of a few
         | cases where you might end up with a bunch of threads
         | challenging the same mutex.
         | 
         | A simple example would be something like concurrently
         | populating some data structure (list/dict/etc). Yes, you could
         | accomplish this with message passing, but that uses more memory
         | and would be slower than just having everything wait to write
         | to a shared location.
        
           | kccqzy wrote:
           | > would be slower than just having everything wait to write
           | to a shared location
           | 
           | Nope.
        
             | cogman10 wrote:
             | Yup.
             | 
             | Message passing has allocation pressure and cache
             | consistency pressure not present in using a shared message
             | location. Especially as the amount of memory in question
             | goes up, the benefit of a shared location increases in
             | terms of the performance impact.
             | 
             | Sure, for something silly like writing to an int, there is
             | negative benefit in a shared location, but when you start
             | talking about a dictionary with 1 million entries, the
             | shared location becomes much more of a benefit vs all the
             | copying and allocating you'd have to do if you tried to do
             | the same thing with message passing.
             | 
             | For some datastructures, it's the optimal way to move data
             | around. For example, LMAX disruptor is about the fastest
             | way to pass messages because of the shared memory and well
             | tuned locks.
        
           | forrestthewoods wrote:
           | If you're trying to mutate a dictionary many times from many
           | threads you're going to have a bad time. The fix isn't a
           | faster mutex, it's don't do that.
        
             | cogman10 wrote:
             | Depends on the dictionary implementation. There's a number
             | of thread safe dictionaries in the wild with varying
             | degrees of parallelism performance. Pretty much all of them
             | benefit from faster mutexes.
             | 
             | For example, some thread safe dictionaries will segment
             | their underlying key/value pairs which allows them to have
             | concurrent reads and writes for a given segment which
             | significantly improves performance.
        
         | Salgat wrote:
         | I'd say the vast majority of cases where I use a lock/semaphore
         | is around very expensive resources, where the utilization of
         | that resource vastly outweighs any performance overhead of the
         | lock.
        
           | another-acct wrote:
           | This is how it should be. IIRC -- apologies, can't find a
           | source --, Ulrich Drepper wrote somewhere about NPTL that its
           | mutexes were not particularly lightweight, but that you
           | should design your program for low contention anyways.
           | 
           | For highly contended data structures, spinlocks (and nowadays
           | explicit atomics) are likely better.
        
       | joelthelion wrote:
       | Could established libcs adopt this? If not, why not?
        
         | loeg wrote:
         | Maybe. nsync is Apache licensed.
        
       | pizlonator wrote:
       | Always cool to see new mutex implementations and shootouts
       | between them, but I don't like how this one is benchmarked. Looks
       | like a microbenchmark.
       | 
       | Most of us who ship fast locks use very large multithreaded
       | programs as our primary way of testing performance. The things
       | that make a mutex fast or slow seem to be different for complex
       | workloads with varied critical section length, varied numbers of
       | threads contending, and varying levels of contention.
       | 
       | (Source: I wrote the fast locks that WebKit uses, I'm the person
       | who invented the ParkingLot abstraction for lock impls (now also
       | used in Rust and Unreal Engine), and I previously did research on
       | fast locks for Java and have a paper about that.)
        
         | PaulDavisThe1st wrote:
         | To add to this, as the original/lead author of a desktop app
         | that frequently runs with many tens of threads, I'd like to see
         | numbers on performance in _non-heavily contended cases_. As a
         | real-time (audio) programmer, I am more concerned with (for
         | example) the cost to take the mutex even when it is not already
         | locked (which is the overwhelming situation in our app).
         | Likewise, I want to know the cost of a try-lock operation that
         | will fail, not what happens when N threads are contending.
         | 
         | Of course, with Cosmopolitan being open source and all, I could
         | do these measurements myself, but still ...
        
           | ataylor284_ wrote:
           | The writeup suggests this implementation is optimized for the
           | not-locked case:
           | 
           | > uses an optimistic CAS (compare and swap) immediately, so
           | that locking happens quickly when there's no contention
        
           | pizlonator wrote:
           | Totally!
           | 
           | Pro tip: if you really do know that contention is unlikely,
           | and uncontended acquisition is super important, then it's
           | theoretically impossible to do better than a spinlock.
           | 
           | Reason: locks that have the ability to put the thread to
           | sleep on a queue must do compare-and-swap (or at least an
           | atomic RMW) on `unlock`. But spinlocks can get away with just
           | doing a store-release (or just a store with a compiler fence
           | on X86) to `unlock`.
           | 
           | Spinlocks also have excellent throughput under most
           | contention scenarios, though at the cost of power and being
           | unkind to other apps on the system. If you want your spinlock
           | to be hella fast on contention just make sure you
           | `sched_yield` before each retry (or `SwitchToThread` on
           | Windows, and on Darwin you can do a bit better with
           | `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).
        
             | PaulDavisThe1st wrote:
             | We use spinlocks where appropriate. In the 90s I recall
             | that the general rule of thumb was if the lock is held for
             | <10x the context switch time, spinlocks are generally a
             | better choice. Not sure if that's still true of
             | contemporary architectures.
             | 
             | The more common pattern in rt/audio code is "try to take
             | the lock, but have an alternate code path if that fails".
             | It's not that is never going to be contention, but it will
             | be extremely rare, and when it occurs, it probably matters.
             | RWLocks are also a common pattern, with the RT thread(s)
             | being read-only (and still being required to fall back on
             | an alternate code path if the read lock cannot be taken).
        
               | pizlonator wrote:
               | These days, fast lock implementations use the following
               | rough idiom, or some idiom that is demonstrably not any
               | slower even for short critical sections.
               | if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
               | for (unsigned i = 0; i < 40; ++i) {             /* spin
               | for a bounded a mount of time */             if
               | (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
               | sched_yield();         }         ... /* actual full lock
               | algo goes here */
               | 
               | So, the reason to use spinlocks isn't that they are
               | faster for short critical sections, but that they don't
               | have to CAS on unlock - and so they are faster especially
               | in the uncontended case (and in all other cases too).
               | 
               | In other words, the modern rule of thumb is something
               | like: if you're going to grab the lock so frequently that
               | the uncontended lock/unlock time shows up as a
               | significant percentage of your execution time, then use a
               | spinlock. That probably implies that you are probably
               | holding the lock for <100x or even <1000x context switch
               | time, or something around there.
        
               | adastra22 wrote:
               | Huh, interesting. TIL. Thanks for that!
        
             | lilyball wrote:
             | On Darwin, it's possible for a pure spinlock to produce a
             | priority inversion deadlock, because Darwin has a quality
             | of service implementation in the kernel that differs from
             | how everyone else handles thread priority. In other
             | kernels, a low-priority thread will still eventually be
             | guaranteed a cpu slice, so if it's holding a spinlock, it
             | will eventually make progress and unlock. On Darwin with
             | Quality of Service, it's possible for higher-QoS threads to
             | preempt lower-QoS threads indefinitely.
             | 
             | For this reason, on Darwin you want to avoid spinlocks
             | unless you know that all threads touching the spinlock are
             | always running in the same QoS. Instead of spinlocks, your
             | go-to for low-overhead locks there is os_unfair_lock, which
             | is a spinlock variant that donates priority of the blocked
             | threads to the running thread.
        
               | pizlonator wrote:
               | I've shipped code on Darwin that spinlocks and gets away
               | with it without any noticeable cases of this happening.
               | 
               | I know it can happen in theory. But theory and practice
               | ain't the same.
               | 
               | I worked for Apple when I shipped this too lmao
        
             | ot wrote:
             | > Reason: locks that have the ability to put the thread to
             | sleep on a queue must do compare-and-swap (or at least an
             | atomic RMW) on `unlock`. But spinlocks can get away with
             | just doing a store-release (or just a store with a compiler
             | fence on X86) to `unlock`.
             | 
             | This is something I've thinking about a lot over time, that
             | the CAS is only there to atomically determine if there are
             | any sleeping waiters on unlock and you have to do a
             | futex_wake. I would really want some way to get away with
             | non-fenced operations (at least on x86), but I don't know
             | if it's just that nobody has figured out why, or there is a
             | fundamental reason why that's not possible.
        
               | pizlonator wrote:
               | You do need a fence in the unlock path though (at least a
               | release fence).
               | 
               | I think the issue is that if you ask the CPU to just
               | store something (like in a spin lock), whether or not
               | there's a fence, it's an operation with limited data flow
               | dependencies so it's easy for the CPU to execute. Even
               | the fence can be handled using wacky speculation tricks.
               | 
               | But if you want to do something like, "store this value
               | but only if the old value satisfies some predicate", then
               | there's a load and the whole thing is dependent on the
               | load. So you're asking the CPU to load, then run a
               | predicate, then store, and for that to be fenced, and
               | atomic.
               | 
               | Strictly more work. I don't think there's any trick to
               | make it faster than just the store release.
        
             | Animats wrote:
             | > just make sure you `sched_yield` before each retry
             | 
             | Assuming `sched_yield` does something.
             | 
             | There's a futex congestion problem inside Wine's memory
             | allocator. There are several levels of locks. If you're
             | growing a buffer, in the sense of C's "realloc", and no
             | buffer is available, memory allocation is locked during the
             | allocation of a bigger buffer, copying of the contents, and
             | release of the old buffer. "Push" type operations can force
             | this. Two orders of magnitude performance drops ensue when
             | multi-threaded programs are contending for that lock.[1]
             | 
             | Inside one of the lock loops is a call to "YieldProcessor".
             | static void spin_lock( LONG *lock )         {
             | while (InterlockedCompareExchange( lock, -1, 0 ))
             | YieldProcessor();         }
             | 
             | But the actual code for YieldProcessor is a NOP on x86:[2]
             | static FORCEINLINE void YieldProcessor(void)         {
             | #ifdef __GNUC__             #if defined(__i386__) ||
             | defined(__x86_64__)                  __asm__ __volatile__(
             | "rep; nop" : : : "memory" );             #elif
             | defined(__arm__) || defined(__aarch64__)
             | __asm__ __volatile__( "dmb ishst\n\tyield" : : : "memory"
             | );             #else                 __asm__ __volatile__(
             | "" : : : "memory" );             #endif             #endif
             | }
             | 
             | }
             | 
             | Wine devs are aware of this, but the mess is bad enough
             | that no one has tackled it. This is down in the core of
             | what "malloc" calls, so changes there could have unexpected
             | effects on many programs. Needs attention from someone
             | really into mutexes.
             | 
             | [1] https://forum.winehq.org/viewtopic.php?t=37688
             | 
             | [2] https://gitlab.winehq.org/wine/wine/-/blob/HEAD/include
             | /winn...
        
               | pizlonator wrote:
               | sched_yield isn't a nop
        
               | secondcoming wrote:
               | `rep; nop;` is actually the `pause` instruction. On older
               | CPUs it's a standard nop, but on newer CPUs it's a more
               | efficient nop.
               | 
               | Spinning on the CMPXCHG is also a bad idea. You should
               | spin on the read and only then attempt the CMPXCHG.
        
         | uvdn7 wrote:
         | I was thinking the same. There are many mutexes out there, some
         | are better at certain workloads than the rest. DistributedMutex
         | and SharedMutex come to mind (https://github.com/facebook/folly
         | /blob/main/folly/synchroniz..., https://github.com/facebook/fol
         | ly/blob/main/folly/SharedMute...) Just like hashmaps, it's
         | rarely the case that a single hashmap is better under _all_
         | possible workloads.
        
           | pizlonator wrote:
           | Yeah.
           | 
           | I should say, though, that if you're on Windows then I have
           | yet to find a real workload where SRWLock isn't the fastest
           | (provided you're fine with no recursion and with a lock that
           | is word-sized). That lock has made some kind of deal with the
           | devil AFAICT.
        
         | ngoldbaum wrote:
         | This style of mutex will also power PyMutex in Python 3.13. I
         | have real-world benchmarks showing how much faster PyMutex is
         | than the old PyThread_type_lock that was available before 3.13.
        
           | miohtama wrote:
           | Any rough summary?
        
             | ngoldbaum wrote:
             | https://github.com/numpy/numpy/issues/26510#issuecomment-22
             | 9...
             | 
             | And now that I look at that again I realize I forgot to
             | finish that up!
        
           | 0xDEADFED5 wrote:
           | Can I use PyMutex from my own Python code?
        
         | intelVISA wrote:
         | Fast locks are an oxymoron vs optimized CAS
        
         | ot wrote:
         | Yeah, that specific benchmark is actually likely to prefer
         | undesirable behaviors, for example pathological unfairness:
         | clearly the optimal scheduling of those threads runs first all
         | the increments from the first thread, then all of the second
         | thread, etc... because this will minimize inter-processor
         | traffic.
         | 
         | A mutex that sleeps for a fixed amount (for example 100us) on
         | lock failure acquisition will probably get very close to that
         | behavior (since it almost always bunches), and "win" the
         | benchmark. Still, that would be a terrible mutex for any
         | practical application where there is any contention.
         | 
         | This is not to say that this mutex is not good (or that pthread
         | mutexes are not bad), just that the microbenchmark in question
         | does not measure anything that predicts performance in a real
         | application.
        
           | pizlonator wrote:
           | Yeah! For all we know this performs great on real programs.
           | 
           | And for all we know it's absolute trash on real programs.
        
       | gok wrote:
       | Consider adopting `os_sync_wait_on_address()` on Darwin for your
       | futex needs
        
         | jart wrote:
         | I've used that. It's just as good as ulock although relatively
         | new. The issue is that using this API makes cancelation points
         | no longer atomic. SIGTHR needs to be able to know the exact
         | instruction in memory where an asynchronous signal is delivered
         | when interrupting a wait operation and that's not possible if
         | it's inside an opaque library.
        
       | jonathrg wrote:
       | > It's still a new C library and it's a little rough around the
       | edges. But it's getting so good, so fast, that I'm starting to
       | view not using it in production as an abandonment of professional
       | responsibility.
       | 
       | What an odd statement. I appreciate the Cosmopolitan project, but
       | these exaggerated claims of superiority are usually a pretty bad
       | red flag.
        
         | tredre3 wrote:
         | I'd like to point out that Justine's claims are usually
         | correct. It's just her shtick (or personality?) to use
         | hyperbole and ego-stroking wording. I can see why some might
         | see it as abrasive (it has caused drama before, namely in
         | llamacpp).
        
           | jancsika wrote:
           | This case isn't abrasive, but it's certainly incoherent.
           | 
           | Name a single case where professional responsibility would
           | require C code _advertised_ as  "rough around the edges" to
           | be used anywhere near production. (The weasel words "starting
           | to" do not help the logic of that sentence.)
           | 
           | I can definitely understand how OP could view this as a red
           | flag. The author should amend it.
        
           | another-acct wrote:
           | I also meant to comment about the grandstanding in her post.
           | 
           | Technical achievement aside, when a person invents something
           | new, the burden is _on them_ to prove that the new thing is a
           | suitable replacement of  / improvement over the existing
           | stuff. "I'm starting to view /not/ using [cosmo] in
           | production as an abandonment of professional responsibility"
           | is emotional manipulation -- it's guilt-tripping.
           | Professional responsibility is the exact opposite of what she
           | suggests: it's not jumping on the newest bandwagon. "a little
           | rough around the edges" is precisely what production
           | environments don't want; predictability/stability is
           | frequently more important than peak performance /
           | microbenchmarks.
           | 
           | Furthermore,
           | 
           | > The C library is so deeply embedded in the software supply
           | chain, and so depended upon, that you really don't want it to
           | be a planet killer.
           | 
           | This is just underhanded. She implicitly called glibc and
           | musl "planet killers".
           | 
           | First, technically speaking, it's just not true; and even if
           | the implied statement were remotely true (i.e., if those
           | mutex implementations were in fact responsible for a
           | significant amount of cycles in actual workloads), the
           | emotional load / snide remark ("planet killer") is
           | unjustified.
           | 
           | Second, she must know very well that whenever efficiency of
           | computation is improved, we don't use that for running the
           | same workloads as before at lower cost / smaller
           | environmental footprint. Instead, we keep all CPUs pegged all
           | the time, and efficiency improvements only ever translate to
           | _larger profit_. A faster mutex too translates to more $$$
           | pocketed, and not to less energy consumed.
           | 
           | I find her tone of voice repulsive.
        
             | vitus wrote:
             | I agree overall with your sentiment but wanted to comment
             | on one of your statements that I perceived to be hyperbole.
             | 
             | > Second, she must know very well that whenever efficiency
             | of computation is improved, we don't use that for running
             | the same workloads as before at lower cost / smaller
             | environmental footprint. Instead, we keep all CPUs pegged
             | all the time, and efficiency improvements only ever
             | translate to larger profit. A faster mutex too translates
             | to more $$$ pocketed, and not to less energy consumed.
             | 
             | It depends on the use case. If you can serve the same
             | number of users / requests with fewer machines, then you
             | buy and run fewer machines. (Yes, saving energy, but also
             | saving on both capex and opex.)
             | 
             | Also, when you're talking about anything resembling
             | interactivity (as you might in the context of, say, a
             | webserver), you really don't want to run anywhere close to
             | 100% average utilization. With unbounded queues, you end up
             | with arbitrarily high wait times; with bounded queues, you
             | end up serving 503s and 429s and other server errors.
             | 
             | That said, my experience with modern webservers is that you
             | generally don't rely on mutexes for synchronizing most work
             | across worker threads, and instead you try to keep your
             | workloads as embarrassingly parallel as possible.
        
         | almostgotcaught wrote:
         | It's especially a red-flag since an enormous number of projects
         | (almost all of them?) will never tolerate shipping fat binaries
         | (ie what cosmopolitan C is in reality).
        
           | another-acct wrote:
           | Agreed; this is what I've always (silently) thought of those
           | fat binaries. Absolute stroke of genius, no doubt, and also a
           | total abomination (IMO) from a sustainability perspective.
        
           | ataylor284_ wrote:
           | The core of this a library called nsync. It appears most of
           | the improvements by Justine are upstreamed into nsync itself
           | which doesn't have any of the baggage of cosmopolitan.
           | Whatever your opinion of the project or author, they've made
           | good effort to not lock you in.
        
           | intelVISA wrote:
           | Ironic given the generous size of the average Go binary.
        
         | pmarreck wrote:
         | I feel that there's a certain amount of hubris that comes along
         | with spending long periods of time solo-coding on a computer,
         | and perhaps unwittingly starved of social contact. Without any
         | checks on you or your work's importance (normally provided by
         | your bog-standard "job"), your achievements take on a grandeur
         | that they might not have broadly earned, as impressive as they
         | might be.
         | 
         | An example is APE (which I otherwise feel is a very impressive
         | hack). One criticism might be "oh, so I not only get to be
         | insecure on one platform, I can be insecure on many all at
         | once?"
         | 
         | The longer you spend in technology, the more you realize that
         | there are extremely few win-wins and a very many win-somes,
         | lose-somes (tradeoffs)
        
         | fefe23 wrote:
         | Have you considered that you may have a different kind of humor
         | than Justine?
         | 
         | Why would you even post this here? Who do you think this is
         | helping?
        
           | jonathrg wrote:
           | It doesn't clearly come across as a joke.
        
             | jart wrote:
             | It's a splash of dry humor on a personal blog in an
             | information dense article.
        
               | jonathrg wrote:
               | OK, Poe's law at work then :)
        
           | another-acct wrote:
           | I think it's fair to comment not only on the subject, but on
           | the writing itself, too.
           | 
           | And it might help Justine improve her writing (and reach a
           | larger audience -- after all, blog posts _intend_ to reach
           | _some_ audience, don 't they?). Of course you can always say,
           | "if you find yourself alienated, it's your loss".
        
         | gr4vityWall wrote:
         | It came off as humor to me, at least.
        
       | stonethrowaway wrote:
       | > In 2012, Tunney started working for Google as a software
       | engineer.[4] In March 2014, Tunney petitioned the US government
       | on We the People to hold a referendum asking for support to
       | retire all government employees with full pensions, transfer
       | administrative authority to the technology industry, and appoint
       | the executive chairman of Google Eric Schmidt as CEO of America.
       | 
       | the absolute madman
        
         | Dansvidania wrote:
         | I wonder what they (Tunney) think of that now.
        
         | 01HNNWZ0MV43FF wrote:
         | Wild. Then again, in 2012 I was on a grippy sock vacation.
        
       | wallstprog wrote:
       | " I also managed to make contended nsync mutexes go 30% faster
       | than nsync upstream on AARCH64, by porting it to use C11
       | atomics."
       | 
       | Curious about this -- so what does C11 atomics use to implement?
       | At least in Linux, C++11 atomics _use_ pthreads (not the other
       | way around).
        
         | rcxdude wrote:
         | It depends on what atomics. In principle most of them should
         | map to an underlying CPU primitive, and only fallback to a
         | mutex if it's not supported on the platform.
        
         | loeg wrote:
         | Atomics mostly map to underlying compiler / CPU intrinsics, not
         | pthreads.
        
         | jeffbee wrote:
         | I am also curious about this and the ambiguity of "AARCH64".
         | There are 64-bit ARM ISA versions without atomic primitives and
         | on these what looks like an atomic op is actually a library
         | retry loop with potentially unbounded runtime. The original AWS
         | Graviton CPU had this behavior. The version of the ISA that you
         | target can have significant performance impact.
        
         | jart wrote:
         | nsync has wrapper macros for all the various atomics libraries
         | that prevented it from using two things.
         | 
         | 1. Weak CAS. nsync always uses strong CAS upstream to make the
         | portability abstraction simpler. Being able to use weak CAS
         | when appropriate helps avoid code being generated for an
         | additional loop.
         | 
         | 2. Updating the &expected parameter. nsync upstream always
         | manually does another relaxed load when a CAS fails. This isn't
         | necessary with the C11 atomics API, because it gives you a
         | relaxed load of the expected value for free when it fails.
         | 
         | Being able to exploit those two features resulted in a
         | considerable improvement in nsync's mu_test.c benchmark for the
         | contended mutex case, which I measured on RPI5.
        
       | bjourne wrote:
       | I made a benchmark on this last year when I didn't know how slow
       | pthread mutexes were:
       | https://stackoverflow.com/questions/76965112/why-are-pthread...
       | For my use case, the mutex wait amounted to roughly 40% of the
       | total runtime and spinlocks were way faster. Perhaps nsync or
       | Cosmopolitan would have made my code much faster.
       | 
       | I still believe the FUD around spinlocks is overstated. For
       | "normal" hpc code the number of threads should be <= the number
       | of cores. In that scenario spinlocking will minimize the latency
       | which likely is what you care about the most. It beats having
       | your performance ruined by dvfs.
        
       | amiga386 wrote:
       | If it's so good, why haven't all C libraries adopted the same
       | tricks?
       | 
       | My betting is that its tricks are only always-faster for certain
       | architectures, or certain CPU models, or certain types of
       | workload / access patterns... and a proper benchmarking of varied
       | workloads on all supported hardware would not show the same
       | benefits.
       | 
       | Alternatively, maybe the semantics of the pthread API (that
       | cosmopolitan is meant to be implementing) are somehow subtly
       | different and this implementation isn't strictly compliant to the
       | spec?
       | 
       | I can't imagine it's that the various libc authors aren't keeping
       | up in state-of-the-art research on OS primitives...
        
         | jitl wrote:
         | > I can't imagine it's that the various libc authors aren't
         | keeping up in state-of-the-art research on OS primitives...
         | 
         | is this sarcasm?
         | 
         | (I don't know any libc maintainers, but as a maintainer of a
         | few thingies myself, I do not try to implement state-of-the-art
         | research, I try to keep my thingies stable and ensure the
         | performance is acceptable; implementing research is out of my
         | budget for "maintenance")
        
           | amiga386 wrote:
           | But if you maintain a few thingies, you'd probably know about
           | rival thingies that do a similar thing, right?
           | 
           | If the rival thingies got a speed boost recently, and they
           | were open source, you'd want to have a look at how they did
           | it and maybe get a similar speed boost for yourself.
        
             | tialaramex wrote:
             | This is nowhere near as common as you seem to think, and
             | mostly only happens for the narrow cases where somebody is
             | obsessed with a particular problem so that they'd actually
             | _want_ to read other people 's solutions. Most of the
             | implementers do not have that sort of relationship to a
             | problem they solved in the past.
             | 
             | If in December you make a general purpose stable sort
             | that's 25% faster than his, Orson Peters _is_ probably
             | going to read your code and try to apply ideas from it. But
             | sorting is the thing Peters really cares about, the people
             | who wrote the stable sort in say Microsoft 's STL (their
             | C++ standard library implementation) even if they still
             | work there won't care enough to go back and change that
             | unless told to do so.
        
             | jitl wrote:
             | It depends on the calculus about (time) budget and
             | stability. Maybe I consider performance already acceptable,
             | and don't have time budget to investigate beyond that.
             | Maybe I look at "nsync", see its mutex (may) change the
             | fairness semantics, and then decide not to adopt it because
             | this may break my callers; or its enough that it _may_
             | change the fairness semantics, and I don 't have the budget
             | to test nsync or a new implementation based on the nsync
             | algorithm to determine if the semantics differ.
        
         | dist-epoch wrote:
         | Politics, not-invented-here syndrome, old maintainers.
         | 
         | It takes forever to change something in glibc, or the C++
         | equivalent.
         | 
         | There are many kinds of synchronization primitives. pthreads
         | only supports a subset. If you are limiting yourself to them,
         | you are most likely leaving performance on the table, but you
         | gain portability.
        
         | aseipp wrote:
         | Those projects often have dozens of other priorities beyond
         | just one specific API, and obsessing over individual APIs isn't
         | a good way to spend the limited time they have. In any case, as
         | a concrete example to disprove your claim, you can look at
         | malloc and string routines in your average libc on Linux.
         | 
         | glibc's malloc is tolerable but fails handily to more modern
         | alternatives in overall speed and scalability (it fragments
         | badly and degrades over time, not to mention it has a dozen
         | knobs that can deeply impact real life workloads like
         | MALLOC_ARENA_MAX). musl malloc is completely awful in terms of
         | performance at every level; in a multithreaded program, using
         | musl's allocator will destroy your performance so badly that it
         | nearly qualifies as malpractice, in my experience.
         | 
         | musl doesn't even have things like SIMD optimized string
         | comparison routines. You would be shocked at how many CPU
         | cycles in a non-trivial program are spent on those tasks, and
         | yes it absolutely shows up in non-trivial profiles, and yes
         | improving this improves all programs nearly universally.
         | glibc's optimized routines are good, but they can _always_ , it
         | seems, become faster.
         | 
         | These specific things aren't "oh, they're hyper specific
         | optimizations for one architecture that don't generalize".
         | These two things in particular -- we're talking 2-5x wall clock
         | reduction, and drastically improved long-term working set
         | utilization, in nearly all workloads for any given program.
         | These are well explored and understood spaces with good known
         | approaches. So why didn't they take them? Because, as always,
         | they probably had other things to do (or conflicting priorities
         | like musl prioritizing simplicity over peak performance, even
         | when that philosophy is actively detrimental to users.)
         | 
         | I'm not blaming these projects or anything. Nobody sets out and
         | says "My program is slow as shit and does nothing right, and I
         | designed it that way and I'm proud of it." But the idea that
         | the people working on them have made only the perfect pareto
         | frontier of design choices just isn't realistic in the
         | slightest and doesn't capture the actual dynamics of how most
         | of these projects are run.
        
       | Uehreka wrote:
       | So on the one hand, all this Cosmo/ape/redbean stuff sounds
       | incredible, and the comments on these articles are usually pretty
       | positive and don't generally debunk the concepts. But on the
       | other hand, I never hear mention of anyone else using these
       | things (I get that not everyone shares what they're doing in a
       | big way, but after so many years I'd expect to have seen a couple
       | project writeups talk about them). Every mention of
       | Cosmo/ape/redbean I've ever seen is from Justine's site.
       | 
       | So I've gotta ask: Is there a catch? Are these tools doing
       | something evil to achieve what they're achieving? Is the whole
       | thing a tom7-esque joke/troll that I don't get because I'm not as
       | deep into compilers/runtimes? Or are these really just ingenious
       | tools that haven't caught on yet?
        
         | almostgotcaught wrote:
         | > So I've gotta ask: Is there a catch? Are these tools doing
         | something evil to achieve what they're achieving?
         | 
         | it's not that complicated; they're fat binaries (plus i guess a
         | lot of papering over the differences between the platforms)
         | that exploit a quirk of tshell:
         | 
         | > One day, while studying old code, I found out that it's
         | possible to encode Windows Portable Executable files as a UNIX
         | Sixth Edition shell script, due to the fact that the Thompson
         | Shell didn't use a shebang line.
         | 
         | (https://justine.lol/ape.html)
         | 
         | so the answer is simple: i can't think of anyone that _wants_
         | to ship fat binaries.
        
           | fragmede wrote:
           | Anyone focused on customer experience wants to ship a binary
           | that just works, without customers having to know what a fat
           | binary even is.
        
             | thefaux wrote:
             | I don't think macos will allow you to run a downloaded
             | cosmo binary without going into security settings and
             | enabling it. That's not an experience I'd want my customers
             | to have personally which means if you care about mac
             | normies, this isn't a viable approach.
        
             | bluGill wrote:
             | Unless the customer is one who has issues with large
             | downloads. Not everyone has fiber to the home with massive
             | storage on modern computers. Some people have the entry
             | level systems, settle for slow internet. Often they are in
             | poor or "third world" places which means maybe you don't
             | care about them, but they exist.
        
               | 01HNNWZ0MV43FF wrote:
               | Fat binaries are fine. Electron is a fat binary in a
               | different sense.
        
               | 0cf8612b2e1e wrote:
               | Given how fat modern websites are, compassion for the
               | bandwidth deprived seems rare.
        
           | cycomanic wrote:
           | Correct me if I'm wrong, but I always thought Apple universal
           | binaries are fat binaries? So why did Apple build this
           | ability if nobody wants it?
        
         | amiga386 wrote:
         | APE works through cunning trickery that might get patched out
         | any day now (and in OpenBSD, it has been).
         | 
         | Most people producing cross-platform software _don 't_ want a
         | single executable that runs on every platform, they want a
         | single _codebase_ that works correctly on each platform they
         | support.
         | 
         | With that in mind that respect, languages like go letting you
         | cross compile for all your targets (provided you avoid CGO) is
         | delightful... but the 3-ways-executable magic trick of APE,
         | while really clever, doesn't inspire confidence it'll work
         | forever, and for the most part it doesn't gain you anything.
         | Each platform has their own packaging/signing requirements. You
         | might as well compile a different target for each platform.
        
           | 0x1ceb00da wrote:
           | Wasn't elf format modified by upstream to accomodate for
           | cosmo? That makes it kinda official. Still hard to see a use
           | case for it. If you want everyone to be able to run your
           | program, just write a web app, a win32 program, or a java
           | applet. 20 years old java applets still run on modern JVMs.
        
             | bcoates wrote:
             | You can't reasonably assume the end user system has a
             | system JVM installed (they probably don't, in fact) so
             | they're not really an alternative to a fat binary -- if you
             | can install dependencies, you can just pick a single-target
             | binary while you're at it.
        
             | bmacho wrote:
             | Justine has similar claims about POSIX allowing binary in
             | shell scripts now
             | 
             | > This is an idea whose time has come; POSIX even changed
             | their rules about binary in shell scripts specifically to
             | let us do it.
             | 
             | https://justine.lol/cosmo3/
             | 
             | > Jilles Tjoelker from the FreeBSD project played an
             | instrumental role in helping me to get the POSIX rules
             | changed to allow binary in shell scripts, which is what
             | made this project possible.
             | 
             | https://justine.lol/ape.html
             | 
             | but that's not true, as recently discussed here:
             | https://news.ycombinator.com/item?id=41636569
        
           | fragmede wrote:
           | > Most people producing cross-platform software don't want a
           | single executable that runs on every platform
           | 
           | They don't? Having one file to download instead of a maze of
           | "okay so what do you have" is way easier than the current
           | mess. It would be very nice not to have to ask users what
           | platform they're on, they just click a link and get the
           | thing.
        
             | bluGill wrote:
             | Trade offs. While there is a point to that, memory,
             | bandwidth and storage are not free. Users with constrains
             | will want a smaller executable. Of course all 3 are pretty
             | cheap these days so more and more you can get by with just
             | one does it all executable and nobody will care about the
             | downsides, but lets not lose track of them.
        
           | hyperion2010 wrote:
           | > Most people
           | 
           | We'll I'm used to not being most people, but I'd much rather
           | be able to produce a single identical binary for my users
           | that works everywhere than the platform specific nonsense I
           | have to go through right now. Having to maintain different
           | special build processes for different platforms is a stupid
           | waste of time.
           | 
           | Frankly this is how it always should have worked except for
           | the monopolistic behavior of various platforms in the past.
        
             | bjourne wrote:
             | The binary is only one part of the puzzle (and largely
             | solved by WSL). Installation/uninstallation and desktop
             | integration is just as much of a hassle.
        
               | cycomanic wrote:
               | I don't get the argument, if the binary part is solved by
               | WSL it isn't useless is it? Otherwise why would MS invest
               | so much resources into it?
        
               | bjourne wrote:
               | WSL can do a lot more than just run Linux binaries.
        
           | cyberax wrote:
           | > With that in mind that respect, languages like go letting
           | you cross compile for all your targets (provided you avoid
           | CGO)
           | 
           | Even that is not a big deal in most of cases, if you use zig
           | to wrap CC: https://dev.to/kristoff/zig-makes-go-cross-
           | compilation-just-...
        
             | jrockway wrote:
             | Does this still work? The article is from 2021 but when I
             | last tried it this year, Go appeared to (newly) depend on
             | headers that Zig doesn't need and thus it doesn't work. The
             | Github issue was something like "yeah, we don't need those,
             | so I guess Go doesn't work anymore". Without the actual
             | error message I can't find the issue, however, so maybe I
             | imagined this.
        
               | cyberax wrote:
               | Yes, I use it in my PKCS#11 client glue code.
        
               | dwattttt wrote:
               | I believe these are the issues:
               | https://github.com/golang/go/issues/52690
               | https://github.com/ziglang/zig/issues/14989
        
         | jkachmar wrote:
         | reposting my comment from another time this discussion came up:
         | 
         | "Cosmopolitan has basically always felt like the interesting
         | sort of technical loophole that makes for a fun blog post which
         | is almost guaranteed to make it to the front page of HN (or
         | similar) purely based in ingenuity & dedication to the bit.
         | 
         | as a piece of foundational technology, in the way that `libc`
         | necessarily is, it seems primarily useful for fun toys and
         | small personal projects.
         | 
         | with that context, it always feels a little strange to see it
         | presented as a serious alternative to something like `glibc`,
         | `musl`, or `msvcrt`; it's a very cute hack, but if i were to
         | find it in something i seriously depend on i think i'd be a
         | little taken aback."
        
         | blenderob wrote:
         | > Is there a catch?
         | 
         | I am only speaking for myself here. While cosmo and ape do seem
         | very clever, I do not need this type of clever stuff in my work
         | if the ordinary stuff already works fine.
         | 
         | Like for example if I can already cross-compile my project to
         | other OSes and platforms or if I've got the infra to build my
         | project for other OSes and platforms, I've no reason to look
         | for a solution that lets me build one binary that works
         | everywhere.
         | 
         | There's also the thing that ape uses clever hacks to be able to
         | run on multiple OSes. What if those hacks break someday due to
         | how executable formats evolve? What if nobody has the time to
         | patch APE to make it compatible with those changes?
         | 
         | But my boring tools like gcc, clang, go, rust, etc. will
         | continue to get updated and they will continue to work with
         | evolving OSes. So I just tend to stick with the boring. That's
         | why I don't bother with the clever because the boring just
         | works for me.
        
         | tangus wrote:
         | Last year I needed to make a small webapp to be hosted on a
         | Windows server, and I thought RedBean would be ideal for it.
         | Unfortunately it was too buggy (at least on Windows).
         | 
         | I don't know whether RedBean is production-ready now, but a
         | year and a half ago, that was the catch.
        
           | jart wrote:
           | Give the latest nightly build a try:
           | https://cosmo.zip/pub/cosmos/bin/ Windows has been a long
           | hard march, but we've recently hit near feature completion.
           | As of last month, the final major missing piece of the puzzle
           | was implemented, which is the ability to send UNIX signals
           | between processes. Cosmopolitan does such a good job on
           | Windows now that it's not only been sturdy for redbean, but
           | much more mature and complicated software as well, like
           | Emacs, GNU Make, clang, Qt, and more.
        
         | dundarious wrote:
         | Mozilla llamafile uses it. Bundles model weights and an
         | executable into a single file, that can be run from any
         | cosmo/ape platform, and spawns a redbean http server for you to
         | interact with the LLM. Can also run it without the integrated
         | weights, and read weights from the filesystem. It's the easiest
         | "get up and go" for local LLMs you could possibly create.
        
         | sfn42 wrote:
         | Most people aren't writing C as far as I know. We use Java, C#,
         | Go, Python etc, some lunatics even use Node.
         | 
         | We generally don't care if some mutex is 3x faster than some
         | other mutex. Most of the time if I'm even using a mutex which
         | is rare in itself, the performance of the mutex is generally
         | the least of my concerns.
         | 
         | I'm sure it matters to someone, but most people couldn't give
         | two shits if they know what they're doing. We're not writing
         | code where it's going to make a noticeable difference. There
         | are thousands of things in our code we could optimize that
         | would make a greater impact than a faster mutex, but we're not
         | looking at those either because it's fast enough the way it is.
        
       | 1st1 wrote:
       | > I've even had the opportunity to make upstream contributions.
       | For example, I found and fixed a bug in his mutex unlock function
       | that had gone undiscovered for years.
       | 
       | I see a stream of improvements to the vendored in nsync inside
       | the cosmopolitan project [1]. Are you planning on upstreaming
       | most of those too?
       | 
       | A separate question -- is using the upstream nsync as safe as
       | using your fork?
       | 
       | [1]
       | https://github.com/jart/cosmopolitan/commits/master/third_pa...
        
         | jart wrote:
         | If Burrows wants my C11 atomics refactoring then he shall have
         | it. Beyond that, my changes mostly concern libc integration,
         | systems integration, and portability. Our projects have
         | different goals in those areas, so I'm not sure he needs them.
        
       | yshui wrote:
       | I had the pleasure of reverse-engineering win32 SRWLOCKs, and
       | based on the author description of nsync it is very close to how
       | SRWLOCK works internally. Kind of surprised how much faster nsync
       | is compared to SRWLOCK.
        
       | betimsl wrote:
       | And here I thought musl was better than libc _sigh_
        
         | favorited wrote:
         | musl _is_ a libc, and while it is superior in some ways, it is
         | inferior in others. If you want a statically linked libc, or a
         | permissively licensed libc, musl is a fantastic choice. If you
         | want a libc with the fastest malloc implementation, you 'll
         | probably want to look elsewhere.
        
       | akira2501 wrote:
       | Production isn't about speed, efficiency, or obviously "clever
       | hacks."
       | 
       | If I have to sacrifice 50% of my efficiency to ensure that I
       | never get called on Sunday at 3am to fix a broken system, no
       | kidding, I'll make that trade every time.
       | 
       | Production is about _reliability_. And writing reliable code is
       | 10x harder than writing "fast" code.
        
       | kgeist wrote:
       | >Contention is where mutex implementations show their inequality.
       | Mark was so impressed by Microsoft's SRWLOCK that he went on to
       | recommend Linux and FreeBSD users consider targeting Windows if
       | mutex contention is an issue.
       | 
       | Interesting, I remember reading a detailed article where they
       | found that there's a lot of severe contention in the Windows
       | kernel, compared to Linux. I think it was when they were trying
       | to parallelize Chrome builds?
        
         | TinkersW wrote:
         | Maybe they weren't using SRWLock, at least last time I checked
         | std::mutex didn't use it with MS STL(They were stuck with
         | critical section because of binary compatibility).
        
           | markdoubleyou wrote:
           | I'm the Mark who's referenced there. When I did that original
           | benchmark I discovered that the underlying mutex used by
           | MSVCRT did change between versions. For example, in Visual
           | C++ 2013, they used the Windows Concurrency Runtime, which
           | was awful under heavy contention. Newer MSVCRT versions use
           | SRWLOCK.
           | 
           | (And I wouldn't characterize myself as being _overly_
           | impressed... for my particular scenario I wrote,  "if you
           | have a poorly written app that's bottlenecked on a lock, then
           | consider targeting Windows to make the best of a bad
           | situation." A better approach, of course, would be to just
           | improve your code!)
        
       | dumdood wrote:
       | Threads and mutexes are the most complicating things in computer
       | science. I am always skeptical of new implementations until
       | they've been used for several years at scale. Bugs in these
       | threading mechanisms often elude even the most intense scrutiny.
       | When Java hit the scene in the mid 90s it exposed all manner of
       | thread and mutex bugs in Solaris. I don't want the fastest mutex
       | implementation - I want a reliable one.
        
       ___________________________________________________________________
       (page generated 2024-10-02 23:00 UTC)