[HN Gopher] Why does musl make my Rust code so slow? (2020)
___________________________________________________________________
Why does musl make my Rust code so slow? (2020)
Author : croemer
Score : 44 points
Date : 2023-12-12 18:13 UTC (4 hours ago)
(HTM) web link (andygrove.io)
(TXT) w3m dump (andygrove.io)
| croemer wrote:
| This appears to still be an issue in late 2023 when using the
| default allocator, as I discovered here:
| https://github.com/nextstrain/nextclade/issues/1338
| croemer wrote:
| Previous discussion in 2020:
| https://news.ycombinator.com/item?id=23080290
| dang wrote:
| Thanks! Macroexpanded:
|
| _Why does musl make my Rust code so slow?_ -
| https://news.ycombinator.com/item?id=23080290 - May 2020 (72
| comments)
| flohofwoe wrote:
| Hmm... if the allocator affects performance so much, then maybe
| the code simply uses the allocator too frequently?
|
| In C++, if someone starts frantically searching for a faster
| general purpose allocator then it's usually a sign of putting
| each C++ object into its own heap allocation behind a smart
| pointer, creating and destroying tiny objects all the time, and
| then wondering why so much time is spent in the allocator.
| a_humean wrote:
| Not sure its right to blame the code when changing the base
| image/allocator results in 30x performance improvements on the
| same hardware. I would be inclined to blame a poor choice of
| base image/allocator.
| croemer wrote:
| It's not a docker thing, we're getting the same 30x
| performance hit without using docker, simply compiling musl
| vs glibc binaries with the default allocator chosen by Rust.
| croemer wrote:
| Every optimization that you don't need to do is a win. With
| glibc, performance impact of unnecessary allocations might be
| minimal, but with musl allocations suddenly matter a lot.
| the_mitsuhiko wrote:
| > Hmm... if the allocator affects performance so much, then
| maybe the code simply uses the allocator too frequently?
|
| I'm not familiar with musl's allocator so it's hard for me to
| say anything about it. However what is pretty common is that
| modern allocators use thread local allocators to avoid
| hammering global locks too much. Since Rust is very concurrent
| you don't need too many allocations to benefit from these
| optimizations.
| wredue wrote:
| In a lot of ways, rust definitely encourages heavy use of
| allocator.
|
| Even if you program better for it, one of the prevailing
| opinions on "learning about lifetimes" is to "not learn
| lifetimes, just make copies".
| returningfory2 wrote:
| Minimizing Docker image size is my favorite premature
| optimization.
|
| In most cases it's not even an optimization! Docker's layered
| caching means the most relevant size is the diff between the base
| image and the final image. This is generally independent of which
| base image you use.
|
| It has a bunch of negative consequences like in this blog post.
|
| It's something that many people new to Docker feel they should be
| doing, so you see it a lot.
|
| Just an all-round great premature optimization.
| sshine wrote:
| I deploy Docker images on a lot of small devices with sketchy
| internet connections.
|
| During development, it really does matter if my images are 4MB
| or 80MB.
|
| Just like when compilation times and tests start growing in the
| minutes, my feedback cycle is broken.
|
| I also compile Rust statically to musl, so I'll seriously
| consider building the final image on the device to avoid
| copying the base image during deployment. Or ditch Docker
| entirely.
| Xelynega wrote:
| Isn't their point that docker brought incrementalness to
| this?
|
| So just like when you change one c file you don't have to
| recompile the whole project, updating the last layer in the
| image should only require a few mb of transfer, the base
| layers would only be transfered once.
| returningfory2 wrote:
| My initial suspension is that this falls into the "In most
| cases it's not even an optimization" case, but do correct me
| if I'm wrong.
|
| When you create a new Docker image with a recompiled Rust
| binary, only the last Docker layer will need to be downloaded
| to the device. Incremental caching means Docker won't re-
| download the base layer. The last layer is basically the new
| binary itself. So I don't see how it would be slower than
| copying the raw binary. But maybe I'm missing something.
| maxmcd wrote:
| Maybe "picking alpine images because they are small without
| thinking deeply about it" is the mistake here?
|
| Debian images can end up using a lot of apt-get tricks to
| install dependencies cleanly. Way easier with alpine. I think
| there are plenty of reasons people are quick to make this
| decision without the necessary depth of investigation.
|
| Minimizing docker image size is great. Cold starts with large
| image downloads can cause all sorts of fun issues. Yes, if
| you're shipping lots of dependencies in your containers the
| size of the base image can quickly be irrelevant. And yes, I
| think it is usually reached for prematurely, but the benefits
| are there for the few that need them.
| egnehots wrote:
| Yes, but on the other hand, people who don't care can create
| huge images (especially in Node.js and Java ecosystems) that
| slow down CI and have a large attack surface.
| eikenberry wrote:
| You don't use Alpine because it is the smallest anymore, but
| because it has the best package repo. It beats the big 2 (and
| Nix), Fedora and Debian, in terms of package coverage and being
| kept up to date. At least in all my testing.
|
| I don't think it is a premature optimization anymore, it has
| become the best practice by default. The network effect at
| work.
| whoopdedo wrote:
| It has to do that because upstream software won't have
| Alpine-specific build pipelines. (Though I expect a lot more
| do nowadays.) Meanwhile everyone tests on Fedora, Debian, and
| Ubuntu so if it isn't in the distribution repository, or the
| version isn't up-to-date, it's as simple as having your
| Dockerfile fetch and build from upstream.
| amarshall wrote:
| Newness is at best only partly true. As far as total packages
| go, Alpine definitely loses to all three in your cohort. No
| need to guess, plenty of statistics at
| https://repology.org/repositories/statistics
| zX41ZdbW wrote:
| It is the case when you use a default malloc, default memcpy, or
| default string functions from libc.
|
| In ClickHouse, we use jemalloc as a memory allocator and custom
| memcpy:
| https://github.com/ClickHouse/ClickHouse/blob/master/base/gl...
|
| So, the Musl build does not imply performance degradations. But
| the usage of Musl is not related to Docker, because ClickHouse is
| a single self-contained binary anyway, and it is easy to use
| without Docker.
| MuffinFlavored wrote:
| Fantasy land: it'd be cool if something like this got
| upstreamed into stdlib/libc repos as like "fast_memcpy" (I'm
| guessing it's too not-backwards-compatible enough to make it
| the default) so large powerful big players don't all have their
| own "slighty-non-standard" way of doing this?
| croemer wrote:
| This recent blog post explains what happened here:
| https://www.tweag.io/blog/2023-08-10-rust-static-link-with-m...
|
| "When it comes to musl, there's a certain performance trap: its
| allocator does not perform well in multi-core environments."
|
| "A common cause of the above symptom is thread contention:
| multiple threads are fighting to acquire the same resource
| instead of doing useful work. Although it's better than deadlock
| or livelock, since the application runs to completion, it's still
| a significant waste of CPU."
|
| "The real source of thread contention is in the malloc
| implementation of musl. A malloc implementation must be thread-
| safe, as multiple threads may allocate memory at once or even
| free memory allocated in other threads. Thus, the thread
| synchronization logic could become a bottleneck.
|
| Although I have not deep-dived into the musl malloc codebase to
| locate the contention, replacing it with a cutting-edge malloc
| implementation like mimalloc is sufficient to rectify the problem
| and enhance performance. Mimalloc minimizes contention and
| enhances multi-threaded performance. In fact, this replacement
| can make the application run even faster than the default glibc
| build."
| kragen wrote:
| per-thread malloc pools are probably enough to solve this
| problem and are probably only a few dozen lines of code? i
| haven't written one so i could be wrong, but it sounds like it
| might be an acceptable fix to musl
| mperham wrote:
| > probably only a few dozen lines of code
|
| Classic HN.
| Ericson2314 wrote:
| libc is a bad idea. Deciding the default allocator how to talk to
| the kernel should be provided by the same library is a bad.
|
| Musl should be able to exist without providing _any_ allocator.
| sidkshatriya wrote:
| You can use other allocators with musl. For example in Chimera
| Linux, musl uses the llvm scudo allocator instead of the
| default allocator packaged with musl libc.
| NelsonMinar wrote:
| Is there somewhere I can read more about Debian slim? I see
| there's a bookworm-slim but it's not clear what's removed or how
| much smaller it is. I'm asking for Proxmox; been using Alpine
| containers because they are so small but wouldn't mind other
| options. Debian-12-standard is 126MB (alpine is 3.)
___________________________________________________________________
(page generated 2023-12-12 23:01 UTC)