[HN Gopher] Why does musl make my Rust code so slow? (2020)
       ___________________________________________________________________
        
       Why does musl make my Rust code so slow? (2020)
        
       Author : croemer
       Score  : 44 points
       Date   : 2023-12-12 18:13 UTC (4 hours ago)
        
 (HTM) web link (andygrove.io)
 (TXT) w3m dump (andygrove.io)
        
       | croemer wrote:
       | This appears to still be an issue in late 2023 when using the
       | default allocator, as I discovered here:
       | https://github.com/nextstrain/nextclade/issues/1338
        
       | croemer wrote:
       | Previous discussion in 2020:
       | https://news.ycombinator.com/item?id=23080290
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Why does musl make my Rust code so slow?_ -
         | https://news.ycombinator.com/item?id=23080290 - May 2020 (72
         | comments)
        
       | flohofwoe wrote:
       | Hmm... if the allocator affects performance so much, then maybe
       | the code simply uses the allocator too frequently?
       | 
       | In C++, if someone starts frantically searching for a faster
       | general purpose allocator then it's usually a sign of putting
       | each C++ object into its own heap allocation behind a smart
       | pointer, creating and destroying tiny objects all the time, and
       | then wondering why so much time is spent in the allocator.
        
         | a_humean wrote:
         | Not sure its right to blame the code when changing the base
         | image/allocator results in 30x performance improvements on the
         | same hardware. I would be inclined to blame a poor choice of
         | base image/allocator.
        
           | croemer wrote:
           | It's not a docker thing, we're getting the same 30x
           | performance hit without using docker, simply compiling musl
           | vs glibc binaries with the default allocator chosen by Rust.
        
         | croemer wrote:
         | Every optimization that you don't need to do is a win. With
         | glibc, performance impact of unnecessary allocations might be
         | minimal, but with musl allocations suddenly matter a lot.
        
         | the_mitsuhiko wrote:
         | > Hmm... if the allocator affects performance so much, then
         | maybe the code simply uses the allocator too frequently?
         | 
         | I'm not familiar with musl's allocator so it's hard for me to
         | say anything about it. However what is pretty common is that
         | modern allocators use thread local allocators to avoid
         | hammering global locks too much. Since Rust is very concurrent
         | you don't need too many allocations to benefit from these
         | optimizations.
        
         | wredue wrote:
         | In a lot of ways, rust definitely encourages heavy use of
         | allocator.
         | 
         | Even if you program better for it, one of the prevailing
         | opinions on "learning about lifetimes" is to "not learn
         | lifetimes, just make copies".
        
       | returningfory2 wrote:
       | Minimizing Docker image size is my favorite premature
       | optimization.
       | 
       | In most cases it's not even an optimization! Docker's layered
       | caching means the most relevant size is the diff between the base
       | image and the final image. This is generally independent of which
       | base image you use.
       | 
       | It has a bunch of negative consequences like in this blog post.
       | 
       | It's something that many people new to Docker feel they should be
       | doing, so you see it a lot.
       | 
       | Just an all-round great premature optimization.
        
         | sshine wrote:
         | I deploy Docker images on a lot of small devices with sketchy
         | internet connections.
         | 
         | During development, it really does matter if my images are 4MB
         | or 80MB.
         | 
         | Just like when compilation times and tests start growing in the
         | minutes, my feedback cycle is broken.
         | 
         | I also compile Rust statically to musl, so I'll seriously
         | consider building the final image on the device to avoid
         | copying the base image during deployment. Or ditch Docker
         | entirely.
        
           | Xelynega wrote:
           | Isn't their point that docker brought incrementalness to
           | this?
           | 
           | So just like when you change one c file you don't have to
           | recompile the whole project, updating the last layer in the
           | image should only require a few mb of transfer, the base
           | layers would only be transfered once.
        
           | returningfory2 wrote:
           | My initial suspension is that this falls into the "In most
           | cases it's not even an optimization" case, but do correct me
           | if I'm wrong.
           | 
           | When you create a new Docker image with a recompiled Rust
           | binary, only the last Docker layer will need to be downloaded
           | to the device. Incremental caching means Docker won't re-
           | download the base layer. The last layer is basically the new
           | binary itself. So I don't see how it would be slower than
           | copying the raw binary. But maybe I'm missing something.
        
         | maxmcd wrote:
         | Maybe "picking alpine images because they are small without
         | thinking deeply about it" is the mistake here?
         | 
         | Debian images can end up using a lot of apt-get tricks to
         | install dependencies cleanly. Way easier with alpine. I think
         | there are plenty of reasons people are quick to make this
         | decision without the necessary depth of investigation.
         | 
         | Minimizing docker image size is great. Cold starts with large
         | image downloads can cause all sorts of fun issues. Yes, if
         | you're shipping lots of dependencies in your containers the
         | size of the base image can quickly be irrelevant. And yes, I
         | think it is usually reached for prematurely, but the benefits
         | are there for the few that need them.
        
         | egnehots wrote:
         | Yes, but on the other hand, people who don't care can create
         | huge images (especially in Node.js and Java ecosystems) that
         | slow down CI and have a large attack surface.
        
         | eikenberry wrote:
         | You don't use Alpine because it is the smallest anymore, but
         | because it has the best package repo. It beats the big 2 (and
         | Nix), Fedora and Debian, in terms of package coverage and being
         | kept up to date. At least in all my testing.
         | 
         | I don't think it is a premature optimization anymore, it has
         | become the best practice by default. The network effect at
         | work.
        
           | whoopdedo wrote:
           | It has to do that because upstream software won't have
           | Alpine-specific build pipelines. (Though I expect a lot more
           | do nowadays.) Meanwhile everyone tests on Fedora, Debian, and
           | Ubuntu so if it isn't in the distribution repository, or the
           | version isn't up-to-date, it's as simple as having your
           | Dockerfile fetch and build from upstream.
        
           | amarshall wrote:
           | Newness is at best only partly true. As far as total packages
           | go, Alpine definitely loses to all three in your cohort. No
           | need to guess, plenty of statistics at
           | https://repology.org/repositories/statistics
        
       | zX41ZdbW wrote:
       | It is the case when you use a default malloc, default memcpy, or
       | default string functions from libc.
       | 
       | In ClickHouse, we use jemalloc as a memory allocator and custom
       | memcpy:
       | https://github.com/ClickHouse/ClickHouse/blob/master/base/gl...
       | 
       | So, the Musl build does not imply performance degradations. But
       | the usage of Musl is not related to Docker, because ClickHouse is
       | a single self-contained binary anyway, and it is easy to use
       | without Docker.
        
         | MuffinFlavored wrote:
         | Fantasy land: it'd be cool if something like this got
         | upstreamed into stdlib/libc repos as like "fast_memcpy" (I'm
         | guessing it's too not-backwards-compatible enough to make it
         | the default) so large powerful big players don't all have their
         | own "slighty-non-standard" way of doing this?
        
       | croemer wrote:
       | This recent blog post explains what happened here:
       | https://www.tweag.io/blog/2023-08-10-rust-static-link-with-m...
       | 
       | "When it comes to musl, there's a certain performance trap: its
       | allocator does not perform well in multi-core environments."
       | 
       | "A common cause of the above symptom is thread contention:
       | multiple threads are fighting to acquire the same resource
       | instead of doing useful work. Although it's better than deadlock
       | or livelock, since the application runs to completion, it's still
       | a significant waste of CPU."
       | 
       | "The real source of thread contention is in the malloc
       | implementation of musl. A malloc implementation must be thread-
       | safe, as multiple threads may allocate memory at once or even
       | free memory allocated in other threads. Thus, the thread
       | synchronization logic could become a bottleneck.
       | 
       | Although I have not deep-dived into the musl malloc codebase to
       | locate the contention, replacing it with a cutting-edge malloc
       | implementation like mimalloc is sufficient to rectify the problem
       | and enhance performance. Mimalloc minimizes contention and
       | enhances multi-threaded performance. In fact, this replacement
       | can make the application run even faster than the default glibc
       | build."
        
         | kragen wrote:
         | per-thread malloc pools are probably enough to solve this
         | problem and are probably only a few dozen lines of code? i
         | haven't written one so i could be wrong, but it sounds like it
         | might be an acceptable fix to musl
        
           | mperham wrote:
           | > probably only a few dozen lines of code
           | 
           | Classic HN.
        
       | Ericson2314 wrote:
       | libc is a bad idea. Deciding the default allocator how to talk to
       | the kernel should be provided by the same library is a bad.
       | 
       | Musl should be able to exist without providing _any_ allocator.
        
         | sidkshatriya wrote:
         | You can use other allocators with musl. For example in Chimera
         | Linux, musl uses the llvm scudo allocator instead of the
         | default allocator packaged with musl libc.
        
       | NelsonMinar wrote:
       | Is there somewhere I can read more about Debian slim? I see
       | there's a bookworm-slim but it's not clear what's removed or how
       | much smaller it is. I'm asking for Proxmox; been using Alpine
       | containers because they are so small but wouldn't mind other
       | options. Debian-12-standard is 126MB (alpine is 3.)
        
       ___________________________________________________________________
       (page generated 2023-12-12 23:01 UTC)