[HN Gopher] Rust std fs slower than Python? No, it's hardware
___________________________________________________________________
Rust std fs slower than Python? No, it's hardware
Author : Pop_-
Score : 555 points
Date : 2023-11-29 09:18 UTC (13 hours ago)
(HTM) web link (xuanwo.io)
(TXT) w3m dump (xuanwo.io)
| royjacobs wrote:
| I was prepared to read the article and scoff at the author's
| misuse of std::fs. However, the article is a delightful
| succession of rabbit holes and mysteries. Well written and very
| interesting!
| bri3d wrote:
| This was such a good article! The debugging was smart (writing
| test programs to peel each layer off), the conclusion was
| fascinating and unexpected, and the writing was clear and easy
| to follow.
| sgift wrote:
| Either the author changed the headline to something less
| clickbaity in the meantime or you edited it for clickbait Pop_-
| (in that case: shame on you) - current headline: "Rust std fs
| slower than Python!? No, it's hardware!"
| epage wrote:
| Based on the /r/rust thread, the author seemed to change the
| headline based on feedback to make it less clickbait-y
| xuanwo wrote:
| Sorry for the clickbaity title, I have changed it based on
| others advice.
| thechao wrote:
| I disagree that it's clickbait-y. Diving down from Python
| bindings to ucode is ... not how things usually go. Doubly
| so, since Python is a very mature runtime, and I'd be
| inclined to believe they've dug up file-reading Kung Fu not
| available to the Average Joe.
| Pop_- wrote:
| The author has updated the title and also contacted me. But
| unfortunately I'm no longer able to update it so.
| Pesthuf wrote:
| Clickbait headline, but the article is great!
| joshfee wrote:
| Surprisingly I think this usage of clickbait is totally
| reasonable because it matches the author's initial
| thoughts/experiences of "what?! this can't be right..."
| saghm wrote:
| I think there might be a range of where people draw the line
| between reasonable headlines and clickbait, because I tend to
| think of clickbait as something where the "answer" to some
| question is intentionally left out to try to bait people into
| clicking. For this article, something I'd consider clickbait
| would be something like "Rust std fs is slower than Python?"
| without the answer after. More commonly, the headline isn't
| phrased directly as a question, but instead of saying something
| like "So-and-so musician loves burritos", it will leave out the
| main detail and say something like "The meal so-and-so eats
| before every concert", which is trying to get you to click and
| have to read through lots of extraneous prose just to find the
| word "burritos".
|
| Having a hook to get people to want to read the article is
| reasonable in my opinion; after all, if you could fit every
| detail in the size of a headline, you wouldn't need an article
| at all! Clickbait inverts this by _only_ having enough enough
| substance that you could get all the info in the headline, but
| instead it leaves out the one detail that's interesting and
| then pads it with fluff that you're forced to click and read
| through if you want the answer.
| iampims wrote:
| Most interesting article I've read this week. Excellent write-up.
| Pop_- wrote:
| Disclaimer: The title has been changed to "Rust std fs slower
| than Python!? No, it's hardware!" to avoid clickbait. However I'm
| not able to fix the title in HN.
| 3cats-in-a-coat wrote:
| What's the TLDR on how... hardware performs differently on two
| software runtimes?
| lynndotpy wrote:
| One of the very first things in the article is a TLDR section
| that points you to the conclusion.
|
| > In conclusion, the issue isn't software-related. Python
| outperforms C/Rust due to an AMD CPU bug.
| j16sdiz wrote:
| It _is_ software-related. Just the CPU perform badly on
| some _software_ instruction.
| xuanwo wrote:
| FSRM is a CPU feature embedded in the microcode (in this
| instance, amd-ucode) that software such as glibc cannot
| interact with. I refer to it as hardware because I
| consider microcode a part of the hardware.
| pornel wrote:
| AMD's implementation of `rep movsb` instruction is
| surprisingly slow when addresses are page aligned. Python's
| allocator happens to add a 16-byte offset that avoids the
| hardware quirk/bug.
| sound1 wrote:
| thank you, upvoted!
| sharperguy wrote:
| "Works on contingency? No, money down!"
| pvg wrote:
| you can mail hn@ycombinator.com and they can change it for you
| to whatever.
| quietbritishjim wrote:
| I'm a bit confused about the premise. This is not comparing pure
| Python code against some native (C or Rust) code. It's comparing
| one Python wrapper around native code (Python's file read method)
| against another Python wrapper around some native code (OpenDAL).
| OK it's still interesting that there's a difference in
| performance, but it's very odd to describe it as "slower than
| Python". Did they expect that the Python standard library is all
| written in pure Python? On the contrary, I would expect the
| implementations of functions in Python's standard library to be
| native and, individually, highly optimised.
|
| I'm not surprised the conclusion had something to do with the way
| that native code works. Admittedly I was surprised at the
| specific answer - still a very interesting article despite the
| confusing start.
|
| Edit: The conclusion also took me a couple of attempts to parse.
| There's a heading "C is slower than Python with specified
| offset". To me, as a native English speaker, this reads as "C is
| slower (than Python) with specified offset" i.e. it sounds like
| they took the C code, specified the same offset as Python, and
| then it's still slower than Python. But it's the opposite: once
| the offset from Python was also specified in the C code, the C
| code was then faster. Still very interesting once I got what they
| were saying though.
| xuanwo wrote:
| Thanks for the comments. I have fixed the headers :)
| crabbone wrote:
| > individually, highly optimised.
|
| Now why would you expect _that_?
|
| What happened to OP is a pure chance. CPython's C code doesn't
| even care about const-consistency. It's flush with dynamic
| memory allocations, bunch of helper / convenience calls... Even
| stuff like arithmetic does dynamic memory allocation...
|
| Normally, you don't expect CPython to perform well, not if you
| have any experience working with it. Whenever you want to
| improve performance you want to sidestep all the functionality
| available there.
|
| Also, while Python doesn't have a standard library, since it
| doesn't have a standard... the library that's distributed with
| it is _mostly_ written in Python. Of course, some of it comes
| written in C, but there 's also a sizable fraction of that C
| code that's essentially Python code translated mechanically
| into C (a good example of this is Python's binary search
| implementation which was originally written in Python, and
| later translated into C using Python's C API).
|
| What one would expect is that functionality that is simple to
| map to operating system functionality has a relatively thin
| wrapper. I.e. reading files wouldn't require much in terms of
| binding code because, essentially, it goes straight into the
| system interface.
| codr7 wrote:
| Have you ever attempted to write a scripting language that
| performs better?
|
| I have, several, and it's far from trivial.
|
| The basics are seriously optimized for typical use cases,
| take a look at the source code for the dict type.
| svieira wrote:
| Raymond Hettinger's talk _Modern Python Dictionaries: A
| confluence of a dozen great ideas_ is an awesome "history
| of how we got these optimizations" and a walk through why
| they are so effective -
| https://www.youtube.com/watch?v=npw4s1QTmPg
| codr7 wrote:
| Yeah, I had a nice chat with Raymond Hettinger at a Pycon
| in Birmingham/UK back in the days (had no idea who he was
| at the time). He seemed like a dedicated and intelligent
| person, I'm sure we can thank him for some of that.
| crabbone wrote:
| > Have you ever attempted to write a scripting language
| that performs better?
|
| No, because "scripting language" is not a thing.
|
| But, if we are talking about implementing languages, then I
| worked with many language implementations. The most
| comparable one that I know fairly well, inside-and-out
| would be the AVM, i.e. the ActionScript Virtual Machine.
| It's not well-written either unfortunately.
|
| I've looked at implementations of Lua, Emacs Lisp and
| Erlang at different times and to various degree. I'm also
| somewhat familiar with SBCL and ECL, the implementation
| side. There are different things the authors looked for in
| these implementations. For example, SBCL emphasizes
| performance, where ECL emphasizes simplicity and interop
| with C.
|
| If I had to grade language implementations I've seen,
| Erlang would absolutely take the cake. It's a very
| thoughtful and disciplined program where authors went to a
| great length to design and implement it. CPython is on the
| lower end of such programs. It's anarchic, very unevenly
| implemented, you run into comments testifying to the author
| not knowing what they are doing, what their predecessor
| did, nor what to do next. Sometimes the code is written
| from that perspective as well, as in if the author somehow
| manages to drive themselves in the corner they don't know
| what the reference count is anymore, they'll just hammer it
| until they hope all references are dead (well, maybe).
|
| It's the code style that, unfortunately, I associate with
| proprietary projects where deadlines and cost dictate the
| quality, where concurrency problems are solved with sleeps,
| and if that doesn't work, then the sleep delay is doubled.
| It's not because I specifically hate code being
| proprietary, but because I meet that kind of code in my day
| job more than I meet it in hobby open-source projects.
|
| > take a look at the source code for the dict type.
|
| I wrote a Protobuf parser in C with the intention of
| exposing its bindings to Python. Dictionaries were a
| natural choice for the hash-map Protobuf elements. I
| benchmarked my implementation against C++ (Google's)
| implementation only to discover that std::map wins against
| Python's dictionary by a landslide.
|
| Maybe Python's dict isn't as bad as most of the rest of the
| interpreter, but being the best of the worst still doesn't
| make it good.
| codr7 wrote:
| Except it is, because everyone knows sort of what it
| means, an interpreted language that prioritizes
| convenience over performance;
| Perl/Python/Ruby/Lua/PHP/etc.
|
| SBCL is definitely a different beast.
|
| I would expect Emacs Lisp & Lua to be more similar.
|
| Erlang had plenty more funding and stricter requirements.
|
| C++'s std::map has most likely gotten even more attention
| than Python's dict, but I'm not sure from your comment if
| you're including Python's VM dispatch in that comparison.
|
| What are you trying to prove here?
| wahern wrote:
| > The basics are seriously optimized for typical use cases,
| take a look at the source code for the dict type
|
| Python is well micro-optimized, but the broader
| architecture of the language and especially the CPython
| implementation did not put much concern into performance,
| even for a dynamically typed scripting language. For
| example, in CPython values of built-in types are still
| allocated as regular objects and passed by reference; this
| is atrocious for performance and no amount of micro
| optimization will suffice to completely bridge the
| performance gap for tasks which stress this aspect of
| CPython. By contrast, primitive types in Lua (including PUC
| Lua, the reference, non-JIT implementation) and JavaScript
| are passed around internally as scalar values, and the
| languages were designed with this in mind.
|
| Perl is similar to Python in this regard--the language
| constructs and type systems weren't designed for high
| primitive operation throughput. Rather, performance
| considerations were focused on higher level, functional
| tasks. For example, Perl string objects were designed to
| support fast concatenation and copy-on-write references,
| optimizations which pay huge dividends for the tasks for
| which Perl became popular. Perl can often seem ridiculously
| fast for naive string munging compared to even compiled
| languages, yet few people care to defend Perl as a
| performant language per se.
| qd011 wrote:
| I don't understand why Python gets shit for being a slow
| language when it's slow but no credit for being fast when it's
| fast just because "it's not really Python".
|
| If I write Python and my code is fast, to me that sounds like
| Python is fast, I couldn't care less whether it's because the
| implementation is in another language or for some other reason.
| paulddraper wrote:
| Yeah, it's weird.
| afdbcreid wrote:
| Usually, yes, but when it's a bug in the hardware, it's not
| really that Python is fast, more like that CPython developers
| were lucky enough to not have the bug.
| munch117 wrote:
| How do you know that it's luck?
| cozzyd wrote:
| Because the offset is entirely due to space for the
| PyObject header.
| munch117 wrote:
| The PyObject header is a target for optimisation.
| Performance regressions are likely to be noticed, and if
| a different header layout is faster, then it's entirely
| possible that it will be used for purely empirical
| reasons. Trying different options and picking the best
| performing one is not luck, even if you can't explain why
| it's the best performing.
| cozzyd wrote:
| I suspect any size other than 0 would lead to this.
|
| But the Zen3/4 were developed far, far after the PyObject
| header...
| adgjlsfhk1 wrote:
| because the offset here is a result of python's reference
| counting which dates ~20 years before zen3
| benrutter wrote:
| I wonder if its because we're sometimes talking cross
| purposes.
|
| For me, coding is almost exclusively using python libraries
| like numpy to call out to other languages like c or FORTRAN.
| It feels silly to say I'm not coding in Python to me.
|
| On the other hand, if you're writing those libraries, coding
| to you is mostly writing FORTRAN and c optimizations. It
| probably feels silly to say you're coding in Python just
| because that's where your code is called from.
| kbenson wrote:
| Because for any nontrivial case you would expect
| python+compiled library and associated marshaling of data to
| be slower than that library in its native implementation
| without any inyerop/marshaling required.
|
| When you see an interpreted language faster than a compiled
| one, it's worth looking at why, because _most_ the time it 's
| because there's some hidden issue causing the other to be
| slow (which could just be a different and much worse
| implementation).
|
| Put another way, you can do a lot to make a Honda Civic very
| fast, but when you hear one goes up against a Ferrari and
| wins your first thoughts should be about what the test was,
| how the Civic was modified, and if the Ferrari had problems
| or the test wasn't to its strengths at all. If you just think
| "yeah, I love Civics, that's awesome" then you're not
| thinking critically enough about it.
| Attummm wrote:
| In this case, Python's code (opening and loading the
| content of a file) operates almost fully within its C
| runtime.
|
| The C components initiate the system call and manage the
| file pointer, which loads the data from the disk into a
| pyobj string.
|
| Therefore, it isn't so much Python itself that is being
| tested, but rather python underlying C runtime.
| kbenson wrote:
| Yep, and the next logical question when both
| implementations are for the most part bare metal
| (compiled and low-level), is why is there a large
| difference? Is it a matter of implementation/algorithm,
| inefficiency, or a bug somewhere? In this case, that
| search turned up a hardware issue that should be
| addressed, which is why it's so useful to examine these
| things.
| rafaelmn wrote:
| But you will care if that "python" breaks - you get to drop
| down to C/C++ and debugging native code. Likewise for adding
| features or understanding the implementation. Not to mention
| having to deal with native build tooling and platform
| specific stuff.
|
| It's completely fair to say that's not python because it
| isn't - any language out there can FFI to C and it has the
| same problems mentioned above.
| IshKebab wrote:
| Because when people talk about Python performance they're
| talking about the performance of Python code itself, not
| C/Rust code that it's wrapping.
|
| Pretty much any language can wrap C/Rust code.
|
| Why does it matter?
|
| 1. Having to split your code across 2 languages via FFI is a
| huge pain.
|
| 2. You are still writing _some_ Python. There 's plenty of
| code that is pure Python. That code is slow.
| munch117 wrote:
| Of course in this case there's no FFI involved - the _open_
| function is built-in. It 's as pure-Python as it can get.
| IshKebab wrote:
| Not sure I agree there, but anyway in this case the
| performance had nothing to do with Python being a slow or
| fast language.
| insanitybit wrote:
| >I don't understand why Python gets shit for being a slow
| language when it's slow but no credit for being fast when
| it's fast just because "it's not really Python".
|
| What's there to understand? When it's fast it's not really
| Python, it's C. C is fast. Python can call out to C. You
| don't have to care that the implementation is in another
| language, but it is.
| fl0ki wrote:
| The premise is that any time you say "Python [...] faster than
| Rust [...]" you get page views even if it's not true. People
| have noticed after the last few dozen times something like this
| was posted.
| lambda wrote:
| I'm a bit confused by why you are confused.
|
| It's surprising that something as simple as reading a file is
| slower in the Rust standard library as the Python standard
| library. Even knowing that a Python standard library call like
| this is written in C, you'd still expect the Rust standard
| library call to be of a similar speed; so you'd expect either
| that you're using it wrong, or that the Rust standard library
| has some weird behavior.
|
| In this case, it turns out that neither were the case; there's
| just a weird hardware performance cliff based on the exact
| alignment of an allocation on particular hardware.
|
| So, yeah, I'd expect a filesystem read to be pretty well
| optimized in Python, but I'd expect the same in Rust, so it's
| surprising that the latter was so much slower, and especially
| surprising that it turned out to be hardware and allocator
| dependent.
| drtgh wrote:
| >Rust std fs slower than Python!? No, it's hardware!
|
| >...
|
| >Python features three memory domains, each representing
| different allocation strategies and optimized for various
| purposes.
|
| >...
|
| >Rust is slower than Python only on my machine.
|
| if one library performs wildly better than the other in the same
| test, on the same hardware, how can that not be a software-
| related problem? sounds like a contradiction.
|
| Maybe should be considered a coding issue and/or feature absent?
| IMHO it would be expected Rust's std library perform well without
| making all the users to circumvent the issue manually.
|
| The article is well investigated so I assume the author just want
| to show the problem existence without creating controversy
| because other way I can not understand.
| Pop_- wrote:
| The root cause is AMD's bad support for rep movsb (which is a
| hardware problem). However, python by default has a small
| offset when reading memories while lower level language (rust
| and c) does not, which is why python seems to perform better
| than c/rust. It "accidentally" avoided the hardware problem.
| CoastalCoder wrote:
| I'm not sure it makes sense to pin this only on AMD.
|
| Whenever you're writing performance-critical software, you
| need to consider the relevant combinations of hardware +
| software + workload + configuration.
|
| Sometimes a problem can be created or fixed by adjusting any
| one / some subset of those details.
| hobofan wrote:
| If that's a bug that only happens with AMD CPUs, I think
| that's totally fair.
|
| If we start adding in exceptions at the top of the software
| stack for individuals failures of specific CPUs/vendors,
| that seems like a strong regression from where we are today
| in terms of ergonomics of writing performance-critical
| software. We can't be writing individual code for each N x
| M x O x P combination of hardware + software + workload +
| configuration (even if you can narrow down the "relevant"
| ones).
| jpc0 wrote:
| > We can't be writing individual code for each N x M x O
| x P combination of hardware + software + workload +
| configuration
|
| That is kind of exactly what you would do when optimising
| for popular platforms.
|
| If this error occurs on an AMD Cpu used by half your
| users is your response to your user going to be "just buy
| a different CPU" or are you going to fix it in code and
| ship a "performance improvement on XYZ platform" update
| jacoblambda wrote:
| Nobody said "just buy a different CPU" anywhere in this
| discussion or the article. And they are pinning the root
| cause on AMD which is completely fair because they are
| the source of the issue.
|
| Given that the fix is within the memory allocator, there
| is already a relatively trivial fix for users who really
| need it (recompile with jemalloc as the global memory
| allocator).
|
| For everyone else, it's probably better to wait until AMD
| reports back with an analysis from their side and either
| recommends an "official" mitigation or pushes out a
| microcode update.
| ansible wrote:
| The fix is that AMD needs to develop, test and deploy a
| microcode update for their affected CPUs, and then the
| problem is truly fixed for everyone, not just the people
| who have detected the issue and tried to mitigate it.
| richardwhiuk wrote:
| You are going to be disappointed when you find out
| there's lots of architecture and CPU specific code in
| software libraries and the kernel.
| pmontra wrote:
| Well, if Excel would be running at half the speed (or
| half of LibreOffice Calc!) on half of the machines around
| here somebody at Redmond would notice, find the hardware
| bug and work around it.
|
| I guess that in most big companies it suffices that there
| is a problem with their own software running on the
| laptop of a C* manager or of somebody close to there.
| When I was working for a mobile operator the antennas the
| network division cared about most were the ones close to
| the home of the CEO. If he could make his test calls with
| no problems they had the time to fix the problems of the
| rest of the network in all the country.
| Pop_- wrote:
| It's a known issue for AMD and has been tested by multiple
| people, and by the data provided by the author. It's fair
| to pin this problem to AMD.
| formerly_proven wrote:
| That extra 0x20 (32 byte) offset is the size of the PyBytes
| object header for anyone wondering; 64 bits each for type
| object pointer, reference count, base pointer and item count.
| mrweasel wrote:
| Thank you, because I was wondering if some Python developer
| found the same issue and decided to just implement the
| offset. It makes much more sense that it just happens to
| work out that way in Python.
| meneer_oke wrote:
| It doesn't seem faster. Seem would imply that it isn't the
| case. It is faster currently on that setup.
|
| But since python runtime is written in C, the issue can't be
| Python vs C.
| TylerE wrote:
| C is a very wide target. There are plenty of things that
| one can do "in C" that no human would ever write. For
| instance, the C code generated by languages like nim and
| zig that essentially use C as a sort of IR.
| meneer_oke wrote:
| That is true, With C allot of possible
|
| > However, python by default has a small offset when
| reading memories while lower level language (rust and c)
|
| Yet if the runtime is made with C, then that statement is
| incorrect.
| bilkow wrote:
| By going through that line of thought, you could also
| argue that the slow implementation for the slow version
| in C and Rust is actually implemented in C, as memcpy is
| on glibc. Hence, Python being faster than Rust would also
| mean in this case that Python is faster than C.
|
| The point is not that one language is faster than
| another. The point is that the default way to implement
| something in a language ended up being surprisingly
| faster when compared to other languages in this specific
| scenario due to a performance issue in the hardware.
|
| In other words: on this specific hardware, the default
| way to do this in Python is faster than the default way
| to do this in C and Rust. That can be true, as Python
| does not use C in the default way, it adds an offset! You
| can change your implementation in any of those languages
| to make it faster, in this case by just adding an offset,
| so it doesn't mean that "Python is faster than C or Rust
| in general".
| topaz0 wrote:
| It's obviously not python vs c -- the time difference turns
| out to be in kernel code (system call) and not user code at
| all, and the post explicitly constructs a c program that
| doesn't have the slowdown by adding a memory offset. It
| just turns up by default in a comparison of python vs c
| code because python reads have a memory offset by default
| (for completely unrelated reasons) and analogous c reads
| don't by default. In principle you could also construct
| python code that does see this slowdown, it would just be
| much less likely to show up at random. So the python vs c
| comp is a total red herring here, it just happened to be
| what the author noticed and used as a hook to understand
| the problem.
| magicalhippo wrote:
| I recall when Pentium was introduced we were told to avoid
| rep and write a carefully tuned loop ourselves. To go really
| fast one could use the FPU to do the loads and stores.
|
| Not too long ago I read in Intel's optimization guidelines
| that rep was now faster again and should be used.
|
| Seems most of these things needs to be benchmarked on the
| CPU, as they change "all the time". I've sped up plenty of
| code by just replacing hand crafted assembly with high-level
| functional equivalent code.
|
| Of course so-slow-it's-bad is different, however a runtime-
| determined implementation choice would avoid that as well.
| mwcampbell wrote:
| Years ago, Rust's standard library used jemalloc. That decision
| substantially increased the minimum executable size, though. I
| didn't publicly complain about it back then (as far as I can
| recall), but perhaps others did. So the Rust library team
| switched to using the OS's allocator by default.
|
| Maybe using an alternative allocator only solves the problem by
| accident and there's another way to solve it intentionally; I
| don't yet fully understand the problem. My point is that using
| a different allocator by default was already tried.
| saghm wrote:
| > I didn't publicly complain about it back then (as far as I
| can recall), but perhaps others did. So the Rust library team
| switched to using the OS's allocator by default.
|
| I've honestly never worked in a domain where binary size ever
| really mattered beyond maybe invoking `strip` on a binary
| before deploying it, so I try to keep an open mind. That
| said, this has always been a topic of discussion around
| Rust[0], and while I obviously don't have anything against
| binary sizes being smaller, bugs like this do make me wonder
| about huge changes like switching the default allocator where
| we can't really test all of the potential side effects; next
| time, the unintended consequences might not be worth the
| tradeoff.
|
| [0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=fals
| e&qu...
| exxos wrote:
| It's the hardware. Of course Rust remains the fastest and safest
| language and you must rewrite your applications in Rust.
| dang wrote:
| You've been posting like this so frequently as to cross into
| abusing the forum, so I've banned the account.
|
| If you don't want to be banned, you're welcome to email
| hn@ycombinator.com and give us reason to believe that you'll
| follow the rules in the future. They're here:
| https://news.ycombinator.com/newsguidelines.html.
| Aissen wrote:
| Associated glibc bug (Zen 4 though):
| https://sourceware.org/bugzilla/show_bug.cgi?id=30994
| Arnavion wrote:
| The bug is also about Zen 3, and even mentions the 5900X (the
| article author's CPU).
| nabakin wrote:
| If you read the bug tracker, a comment mentions this affects
| Zen 3 and Zen 4
| fweimer wrote:
| And AMD is investigating: https://inbox.sourceware.org/libc-
| alpha/20231115190559.29112...
| explodingwaffle wrote:
| Anyone else feeling the frequency illusion with rep movsb?
|
| (https://lock.cmpxchg8b.com/reptar.html)
| a1o wrote:
| > Rust developers might consider switching to jemallocator for
| improved performance
|
| I am curious if this is something that everyone can do to get
| free performance or if there are caveats. Can C codebases benefit
| from this too? Is this performance that is simply left on table
| currently?
| nicoburns wrote:
| I think it's pretty much free performance that's being left on
| the table. There's slight cost to binary size. And it may not
| perform better in absolutely all circumstances (but it will in
| almost all).
|
| Rust used to use jemalloc by default but switched as people
| found this surprising as the default.
| Pop_- wrote:
| Switching to non-default allocator does not always brings
| performance boost. It really depend on your workload, which
| requires profiling and benchmarking. But C/C++/Rust and other
| lower level languages should all at least be able to choose
| from these allocators. One caveat is binary size. Custom
| allocator does add more bytes to executable.
| vlovich123 wrote:
| I don't know why people still look to jemalloc. Mimalloc
| outperforms the standard allocator on nearly every single
| benchmark. Glibc's allocator & jemalloc both are long in the
| tooth & don't actually perform as well as state of the art
| allocators. I wish Rust would switch to mimalloc or the
| latest tcmalloc (not the one in gperftools).
| masklinn wrote:
| > I wish Rust would switch to mimalloc or the latest
| tcmalloc (not the one in gperftools).
|
| That's nonsensical. Rust uses the system allocators for
| reliability, compatibility, binary bloat, maintenance
| burden, ..., not because they're _good_ (they were not when
| Rust switched away from jemalloc, and they aren 't now).
|
| If you want to use mimalloc in your rust programs, you can
| just set it as global allocator same as jemalloc, that
| takes all of three lines:
| https://github.com/purpleprotocol/mimalloc_rust#usage
|
| If you want the rust compiler to link against mimilloc
| rather than jemalloc, feel free to test it out and open an
| issue, but maybe take a gander at the previous attempt:
| https://github.com/rust-lang/rust/pull/103944 which died
| for the exact same reason the the one before that
| (https://github.com/rust-lang/rust/pull/92249) did:
| unacceptable regression of max-rss.
| vlovich123 wrote:
| I know it's easy to change but the arguments for using
| glibc's allocator are less clear to me:
|
| 1. Reliability - how is an alternate allocator less
| reliable? Seems like a FUD-based argument. Unless by
| reliability you mean performance in which case yes -
| jemalloc isn't reliably faster than standard allocators,
| but mimalloc is.
|
| 2. Compatibility - again sounds like a FUD argument. How
| is compatibility reduced by swapping out the allocator?
| You don't even have to do it on all systems if you want.
| Glibc is just unequivocally bad.
|
| 3. Binary bloat - This one is maybe an OK argument
| although I don't know what size difference we're talking
| about for mimalloc. Also, most people aren't writing
| hello world applications so the default should probably
| be for a good allocator. I'd also note that having a
| dependency of the std runtime on glibc in the first place
| likely bloats your binary more than the specific
| allocator selected.
|
| 4. Maintenance burden - I don't really buy this argument.
| In both cases you're relying on a 3rd party to maintain
| the code.
| masklinn wrote:
| > I know it's easy to change but the arguments for using
| glibc's allocator are less clear to me:
|
| You can find them at the original motivation for removing
| jemalloc, 7 years ago: https://github.com/rust-
| lang/rust/issues/36963
|
| Also it's not "glibc's allocator", it's the system
| allocator. If you're unhappy with glibc's, get that
| replaced.
|
| > 1. Reliability - how is an alternate allocator less
| reliable?
|
| Jemalloc had to be disabled on various platforms and
| architectures, there is no reason to think mimalloc or
| tcmalloc are any different.
|
| The system allocator, while shit, is always there and
| functional, the project does not have to curate its
| availability across platforms.
|
| > 2. Compatibility - again sounds like a FUD argument.
| How is compatibility reduced by swapping out the
| allocator?
|
| It makes interactions with anything which _does_ use the
| system allocator worse, and almost certainly fails to
| interact correctly with some of the more specialised
| system facilities (e.g. malloc.conf) or tooling (in rust,
| jemalloc as shipped did not work with valgrind).
|
| > Also, most people aren't writing hello world
| applications
|
| Most people aren't writing applications bound on
| allocation throughput either
|
| > so the default should probably be for a good allocator.
|
| Probably not, no.
|
| > I'd also note that having a dependency of the std
| runtime on glibc in the first place likely bloats your
| binary more than the specific allocator selected.
|
| That makes no sense whatsoever. The libc is the system's
| and dynamically linked. And changing allocator does not
| magically unlink it.
|
| > 4. Maintenance burden - I don't really buy this
| argument.
|
| It doesn't matter that you don't buy it. Having to ship,
| resync, debug, and curate (cf (1)) an allocator is a
| maintenance burden. With a system allocator, all the
| project does is ensure it calls the system allocators
| correctly, the rest is out of its purview.
| vlovich123 wrote:
| The reason the reliability & compatibility arguments
| don't make sense to me is that jemalloc is still in use
| for rustc (again - not sure why they haven't switched to
| mimalloc) which has all the same platform requirements as
| the standard library. There's also no reason an alternate
| allocator can't be used on Linux specifically because
| glibc's allocator is just bad full stop.
|
| > It makes interactions with anything which does use the
| system allocator worse
|
| That's a really niche argument. Most people are not doing
| any of that and malloc.conf is only for people who are
| tuning the glibc allocator which is a silly thing to do
| when mimalloc will outperform whatever tuning you do (yes
| - glibc really is that bad).
|
| > or tooling (in rust, jemalloc as shipped did not work
| with valgrind)
|
| That's a fair argument, but it's not an unsolvable one.
|
| > Most people aren't writing applications bound on
| allocation throughput either
|
| You'd be surprised at how big an impact the allocator can
| make even when you don't think you're bound on
| allocations. There's also all sorts of other things
| beyond allocation throughput & glibc sucks at all of them
| (e.g. freeing memory, behavior in multithreaded programs,
| fragmentation etc etc).
|
| > The libc is the system's and dynamically linked. And
| changing allocator does not magically unlink it
|
| I meant that the dependency on libc at all in the
| standard library bloats the size of a statically linked
| executable.
| josephg wrote:
| > jemalloc is still in use for rustc (again - not sure
| why they haven't switched to mimalloc)
|
| Performance of rustc matters a lot! If the rust compiler
| runs faster when using mimalloc, please benchmark &
| submit a patch to the compiler.
| vlovich123 wrote:
| Any links to instructions on how to run said benchmarks?
| masklinn wrote:
| I literally linked two attempts to use mimalloc in rustc
| just a few comments upthread.
| charcircuit wrote:
| I've never not gotten increased performance by swapping outc
| the allocator.
| nh2 wrote:
| Be aware `jemalloc` will make you suffer the observability
| issues of `MADV_FREE`. `htop` will no longer show the truth
| about how much memory is in use.
|
| *
| https://github.com/jemalloc/jemalloc/issues/387#issuecomment...
|
| * https://gitlab.haskell.org/ghc/ghc/-/issues/17411
|
| Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds
| after `MADV_FREE`:
| https://github.com/JuliaLang/julia/issues/51086#issuecomment...
|
| So while this "fixes" the issue, it'll introduce a confusing
| time delay between you freeing the memory and you observing
| that in `htop`.
|
| But according to https://jemalloc.net/jemalloc.3.html you can
| set `opt.muzzy_decay_ms = 0` to remove the delay.
|
| Still, the musl author has some reservations against making
| `jemalloc` the default:
|
| https://www.openwall.com/lists/musl/2018/04/23/2
|
| > It's got serious bloat problems, problems with undermining
| ASLR, and is optimized pretty much only for being as fast as
| possible without caring how much memory you use.
|
| With the above-mentioned tunables, this should be mitigated to
| some extent, but the general "theme" (focusing on e.g.
| performance vs memory usage) will likely still mean "it's a
| tradeoff" or "it's no tradeoff, but only if you set tunables to
| what you need".
| a1o wrote:
| Thank you! That was very thorough! I will be reading the
| links. :)
| singron wrote:
| Note that glibc has a similar problem in multithreaded
| contexts. It strands unused memory in thread-local pools,
| which grows your memory usage over time like a memory leak.
| We got lower memory usage that didn't grow over time by
| switching to jemalloc.
|
| Example of this:
| https://github.com/prestodb/presto/issues/8993
| masklinn wrote:
| The musl remark is funny, because jemalloc's use of pretty
| fine-grained arenas sometimes leads to better memory
| utilisation through reduced fragmentation. For instance
| Aerospike couldn't fit in available memory under (admittedly
| old) glibc, and jemalloc fixed the issue:
| http://highscalability.com/blog/2015/3/17/in-memory-
| computin...
|
| And this is not a one-off: https://hackernoon.com/reducing-
| rails-memory-use-on-amazon-l...
| https://engineering.linkedin.com/blog/2021/taming-memory-
| fra...
|
| jemalloc also has extensive observability / debugging
| capabilities, which can provide a useful global view of the
| system, it's been used to debug memleaks in JNI-bridge code:
| https://www.evanjones.ca/java-native-leak-bug.html
| https://technology.blog.gov.uk/2015/12/11/using-jemalloc-
| to-...
| dralley wrote:
| glibc isn't totally free of such issues
| https://www.algolia.com/blog/engineering/when-allocators-
| are...
| the8472 wrote:
| Aiming to please people who panic about their RSS numbers
| seems... misguided? It seems like worrying about RAM being
| "used" as file cache[0].
|
| If you want to gauge whether your system is memory-limited
| look at the PSI metrics instead.
|
| [0] https://www.linuxatemyram.com/
| TillE wrote:
| jemalloc and mimalloc are very popular in C and C++ software,
| yes. There are few drawbacks, and it's really easy to benchmark
| different allocators against eachother in your particular use
| case.
| kragen wrote:
| basically that's why jason wrote it in the first place, but
| other allocators have caught up since then to some extent. so
| jemalloc might make your c either slower or faster, you'll have
| to test to know. it's pretty reliable at being close to the
| best choice
|
| does tend to use more ram tho
| secondcoming wrote:
| You can override the allocator for any app via LD_PRELOAD
| fsniper wrote:
| The article itself is a great read and it has fascinating info
| related to this issue.
|
| However I am more interested/concerned about another part. How
| the issue is reported/recorded and how the communications are
| handled.
|
| Reporting is done over discord, which is a proprietary
| environment which is not indexed, or searchable. Will not be
| archived.
|
| Communications and deliberations are done over discord and
| telegram, which is probably worse than discord in this context.
|
| This blog post and the github repository is the lingering remains
| of them. If Xuanwo did not blog this. It would be lost in
| timeline.
|
| Isn't this fascinating?
| amluto wrote:
| I sent this to the right people.
| londons_explore wrote:
| So the obvious thing to do... Send a patch to change the
| "copy_user_generic" kernel method to use a different memory
| copying implementation when the CPU is detected to be a bad one
| and the memory alignment is one that triggers the slowness bug...
| p3n1s wrote:
| Not obvious. Seems like if it can be corrected with microcode
| just have people use updated microcode rather than litter the
| kernel with fixes that are effectively patchable software
| problems.
|
| The accepted fix would not be trivial to anyone not already
| experienced with the kernel. But more important, it obviously
| isn't obvious what is the right way to enable the workaround.
| The best way is to probably measure at boot time, otherwise how
| do you know which models and steppings are affected.
| londons_explore wrote:
| I don't think AMD does microcode updates for performance
| issues do they? I thought it was strictly correctness or
| security issues.
|
| If the vendor won't patch it, then a workaround is the next
| best thing. There shouldn't be many - that's why all copying
| code is in just a handful of functions.
| p3n1s wrote:
| A significant performance degradation due to normal use of
| the instruction (FSRM) not otherwise documented is a
| correctness problem. Especially considering that the
| workaround is to avoid using the CPU feature in many cases.
| People pay for this CPU feature now they need kernel
| tooling to warn them when they fallback to some slower
| workaround because of an alignment issue way up the stack.
| prirun wrote:
| If AMD has a performance issue and doesn't fix it, AMD
| should pay the negative publicity costs rather than kernel
| and library authors adding exceptions. IMHO.
| pmontra wrote:
| > However, mmap has other uses too. It's commonly used to
| allocate large regions of memory for applications.
|
| Slack is allocating 1132 GB of virtual memory on my laptop right
| now. I don't know if they are using mmap but that's 1100 GB more
| than the physical memory.
| Waterluvian wrote:
| I'm not sure allocations mean anything practical anymore. I
| recall OSX allocating ridiculous amounts of virtual memory to
| stuff but never found OSX or the software to ever feel slow and
| pagey.
| dietrichepp wrote:
| The way I describe mmap these days is to say it allocates
| address space. This can sometimes be a clearer way of
| describing it, since the physical memory will only get
| allocated once you use the memory (maybe never).
| byteknight wrote:
| But is it not still limited by allocating the RAM +
| Page/Swap size?
| wbkang wrote:
| I don't think so, but it's difficult to find an actual
| reference. For sure it does overcommit like crazy. Here's
| an output from my mac:
|
| % ps aux | sort -k5 -rh | head -1
|
| xxxxxxxx 88273 1.2 0.9 1597482768 316064 ?? S 4:07PM
| 35:09.71
| /Applications/Slack.app/Contents/Frameworks/Slack Helper
| (Renderer).app/...
|
| Since ps displays vsz column in KiB, 1597482768
| corresponds to 1TB+.
| aseipp wrote:
| Maybe I'm misunderstanding you but: no, you can allocate
| terabytes of address space on modern 64-bit Linux on a
| machine with only 8GB of RAM with overcommit. Try it; you
| can allocate 2^46 bytes of space (~= 100TB) today, with
| no problem. There is no limit to the allocation space in
| an overcommit system; there is only a limit to the actual
| working set, which is very different.
| j16sdiz wrote:
| You can do it without overcommit -- you can just back the
| mmap with file
| Pop_- wrote:
| I don't know why but this really makes me laugh
| aseipp wrote:
| That is Chromium doing it, and yes, it is using mmap to create
| a very large, (almost certainly) contiguous range of memory.
| Many runtimes do this, because it's useful (on 64-bit systems)
| to create a ridiculously large virtually mapped address space
| and then only commit small parts of it over time as needed,
| because it makes memory allocation simpler in several ways;
| notably it means you don't have to worry about allocating new
| address spaces when simply allocating memory, and it means
| answering things like "Is this a heap object?" is easier.
| rasz wrote:
| dolphin emulator has recent example of this: https://dolphin-
| emu.org/blog/2023/11/25/dolphin-progress-rep...
|
| seems its not without perils on Windows:
|
| "In an ideal world, that would be all we have to say about
| the new solution. But for Windows users, there's a special
| quirk. On most operating systems, we can use a special flag
| to signal that we don't really care if the system has 32 GiB
| of real memory. Unfortunately, Windows has no convenient way
| to do this. Dolphin still works fine on Windows computers
| that have less than 32 GiB of RAM, but if Windows is set to
| automatically manage the size of the page file, which is the
| case by default, starting any game in Dolphin will cause the
| page file to balloon in size. Dolphin isn't actually writing
| to all this newly allocated space in the page file, so there
| are no concerns about performance or disk lifetime. Also,
| Windows won't try to grow the page file beyond the amount of
| available disk space, and the page file shrinks back to its
| previous size when you close Dolphin, so for the most part
| there are no real consequences... "
| comonoid wrote:
| jemalloc was Rust's default allocator till 2018.
|
| https://internals.rust-lang.org/t/jemalloc-was-just-removed-...
| titaniumtown wrote:
| Extremely well written article! Very surprising outcome.
| diamondlovesyou wrote:
| AMD's string store is not like Intel's. Generally, you don't want
| to use it until you are past the CPU's L2 size (L3 is a victim
| cache), making ~2k WAY too small. Once past that point, it's
| profitable to use string store, and should run at "DRAM speed".
| But it has a high startup cost, hence 256bit vector loads/stores
| should be used until that threshold is met.
| rasz wrote:
| Or you leave it as is forcing AMD to fix their shit. "fast
| string mode" has been strongly hinted as _the_ optimal way over
| 30 years ago with Pentium Pro, further enforced over 10 years
| ago with ERMSB and FSRM 4 years ago. AMD get with the program.
| js2 wrote:
| Isn't the high startup cost what FSRM is intended to solve?
|
| > With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally
| added to AMD's CPU functions analog to Intel's
| X86_FEATURE_FSRM. Intel had already introduced this in 2017
| with the Ice Lake Client microarchitecture. But now AMD is
| obviously using this feature to increase the performance of REP
| MOVSB for short and very short operations. This improvement
| applies to Intel for string lengths between 1 and 128 bytes and
| one can assume that AMD's implementation will look the same for
| compatibility reasons.
|
| https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...
| diamondlovesyou wrote:
| Fast is relative here. These are microcoded instructions,
| which are generally terrible for latency: microcoded
| instructions don't get branch prediction benefits, nor OoO
| benefits (they lock the FE/scheduler while running). Small
| memcpy/moves are always latency bound, hence even if the HW
| supports "fast" rep store, you're better off not using them.
| L2 is wicked fast, and these copies are linear, so prediction
| will be good.
|
| Note that for rep store to be better it must overcome the
| cost of the initial latency and then catch up to the 32byte
| vector copies, which yes generally have not-as-good-perf vs
| DRAM speed, but they aren't that bad either. Thus for small
| copies.... just don't use string store.
|
| All this is not even considering non-temporal loads/stores;
| many larger copies would see better perf by not trashing the
| L2 cache, since the destination or source is often not
| inspected right after. String stores don't have a non-
| temporal option, so this has to be done with vectors.
| js2 wrote:
| I'm not sure that your comment is responsive to the
| original post.
|
| FSRM is fast on Intel, even with single byte strings. AMD
| claims to support FSRM with recent CPUs but performs poorly
| on small strings, so code which Just Works on Intel has a
| performance regression when running on AMD.
|
| Now here you're saying `REP MOVSB` shouldn't be used on AMD
| with small strings. In that case, AMD CPUs shouldn't
| advertise FSRM. As long as they're advertising it, it
| shouldn't perform worse than the alternative.
|
| https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/203051
| 5
|
| https://sourceware.org/bugzilla/show_bug.cgi?id=30994
|
| I'm not a CPU expert so perhaps I'm misinterpreting you and
| we're talking past each other. If so, please clarify.
| forrestthewoods wrote:
| Delightful article. Thank you author for sharing! I felt like I
| experienced every shock twist in surprise in your journey like I
| was right there with you all along.
| darkwater wrote:
| Totally unrelated but: this post talks about the bug being first
| discovered in OpenDAL [1], which seems to be an Apache
| (Incubator) project to add an abstraction layer for storage over
| several types of storage backend. What's the point/use case of
| such an abstraction? Anybody using it?
|
| [1] https://opendal.apache.org/
| the8472 wrote:
| There are two dedicated CPU feature flags to indicate that REP
| STOS/MOV are fast and usable as short instruction sequence for
| memset/memcpy. Having to hand-roll optimized routines for each
| new CPU generation has been an ongoing pain for decades.
|
| And yet here we are again. Shouldn't this be part of some timing
| testsuite of CPU vendors by now?
| giancarlostoro wrote:
| So correct me if I am wrong but does this mean you need to
| compile two executables for a specific compile time build? Or
| is it just you need to compile it from specific hardware?
| Wondering what the fix would be, some sort of runtime check?
| immibis wrote:
| glibc has the ability to dynamically link a different version
| of a function based on the CPU.
| dralley wrote:
| Glibc supports runtime selection of different optimized
| paths, yes. There was a recent discussion about a security
| vulnerability in that feature (discussion
| https://news.ycombinator.com/item?id=37756357), but in
| essence this is exactly the kind of thing it's useful for.
| fweimer wrote:
| The exact nature of the fix is unclear at present.
|
| During dynamic linking, glibc picks a memcpy implementation
| which seems most appropriate for the current machine. We have
| about 13 different implementations just for x86-64. We could
| add another one for current(ish) AMD CPUs, select a different
| existing implementation for them, or change the default for a
| configurable cutover point in a parameterized implementation.
| ww520 wrote:
| Since the CPU instructions are the same, instruction patching
| at startup or install time can be used. Just patch in the
| correct instructions for the respective hardware.
| the8472 wrote:
| The sibling comments mention the hardware specific dynamic
| linking in glibc that's used for function calls. But if your
| compiler inlines memcpy (usually for short, fixed-sized
| copies) into the binary then yes you'll have to compile it
| for a specific CPU to get optimal performance. But that's
| true for all target-dependent optimizations.
|
| More broadly compatible routines will still work on newer
| CPUs, they just won yield the best performance.
|
| It still would be nice if such central routines could just be
| compiled to the REP-prefixed instructions and would deliver
| (near-)optimal performance so we could stop worrying about
| that particular part.
| lxe wrote:
| I wonder what other things we can improve by removing spectre
| mitigations and tuning hugepage, syscall altency, and core
| affinity
| lxe wrote:
| So Python isn't affected by the bug because pymalloc performs
| better on buggy CPUs than jemalloc or malloc?
| js2 wrote:
| No, it has nothing to do with pymalloc's performance. Rather,
| the performance issue only occurs when using `rep movsb` on AMD
| CPUs with unaligned pages, and pymalloc just happens to be
| using unaligned pages in this case.
| jokethrowaway wrote:
| Clickbait title but interesting article.
|
| This has nothing to do with python or rust
| codedokode wrote:
| Why is there need to move memory? Hardware cannot DMA data into
| non-page-aligned memory? Or Linux doesn't want to load non-
| aligned data?
| wmf wrote:
| The Linux page cache keeps data page-aligned so if you want the
| data to be unaligned Linux will copy it.
| codedokode wrote:
| What if I don't want to use cache?
| tedunangst wrote:
| Pull out some RAM sticks.
| wmf wrote:
| You can use O_DIRECT although that also forces alignment
| IIRC.
| eigenform wrote:
| would be lovely if ${cpu_vendor} would document exactly how
| FSRM/ERMS/etc are implemented and what the expected behavior is
___________________________________________________________________
(page generated 2023-11-29 23:00 UTC)