[HN Gopher] Wild performance tricks
___________________________________________________________________
Wild performance tricks
Author : tbillington
Score : 65 points
Date : 2025-09-23 08:16 UTC (2 days ago)
(HTM) web link (davidlattimore.github.io)
(TXT) w3m dump (davidlattimore.github.io)
| Strilanc wrote:
| Every one of these "performance tricks" is describing how to
| convince rust's borrow checker that you're allowed to do a thing.
| It's more like "performance permission slips".
| oleganza wrote:
| You don't have to play this game - you can always write within
| unsafe { ... } like in plain old C or C++. But people do choose
| to play this game because it helps them to write code that is
| also correct, where "correct" has an old-school meaning of
| "actually doing what it is supposed to do and not doing what
| it's not supposed to".
| ManlyBread wrote:
| That just makes it seem like there's no point in using this
| language in the first place.
| maccard wrote:
| Dont let perfect be the enemy of good.
|
| Software is built on abstractions - if all your app code is
| written without unsafe and you have one low level unsafe
| block to allow for something, you get the value of rust for
| all your app logic and you know the actual bug is in the
| unsafe code
| Ar-Curunir wrote:
| This is an issue that you would face in any language with
| strong typing. It only rears its head in Rust because Rust
| tries to give you both low-level control _and_ strong types.
|
| For example, in something like Go (which has a weaker type
| system than Rust), you wouldn't think twice about, paying for
| the re-allocation in buffer-reuse example.
|
| Of course, in something like C or C++ you could do these things
| via simple pointer casts, but then you run the risk of
| violating some undefined behaviour.
| jstimpfle wrote:
| In C I wouldn't use such a fluffy high-level approach in the
| first place. I wouldn't use contiguous unbounded vec-slices.
| And no, I wouldn't attempt trickery with overwriting input
| buffers. That's a bad inflexible approach that will bite at
| the next refactor. Instead, I would first make sure there's a
| way to cheaply allocate fixed size buffers (like 4 K buffers
| or whatever) and stream into those. Memory should be used in
| a allocate/write-once/release fashion whenever possible. This
| approach leads to straightforward, efficient architecture and
| bug-free code. It's also much better for
| concurrency/parallelism.
| kibwen wrote:
| _> In C I wouldn 't use such a fluffy high-level approach
| in the first place._
|
| Sure, though that's because C has abstraction like Mars has
| a breathable atmosphere.
|
| _> This approach leads to straightforward, efficient
| architecture and bug-free code. It 's also much better for
| concurrency/parallelism._
|
| This claim is wild considering that Rust code is more bug-
| free than C code while being just as efficient, while
| keeping in mind that Rust makes parallelism so much easier
| than C that it's stops being funny and starts being tragic.
| dwattttt wrote:
| > straightforward, efficient architecture and bug-free code
|
| The grace with which C handles projects of high complexity
| disagrees.
|
| You get a simple implementation only by ignoring edge cases
| or improvements that increase complexity.
| jandrewrogers wrote:
| > in something like C or C++ you could do these things via
| simple pointer casts
|
| No you don't. You explicitly start a new object lifetime at
| the address, either of the same type or a different type.
| There are standard mechanisms for this.
|
| Developers that can't be bothered to do things correctly is
| why languages like Rust exist.
| Ar-Curunir wrote:
| And that is safer... how?
| jstimpfle wrote:
| Yup -- yet another article only solving language level problems
| instead of teaching something about real constraints (i.e.
| hardware performance characteristics). Booooring. This kind of
| article is why I still haven't mustered the energy to get up to
| date with Rust. I'm still writing C (or C-in-C++) and having
| fun, most of the time feeling like I'm solving actual technical
| problems.
| the-smug-one wrote:
| The rayon thing is neat.
| kibwen wrote:
| ...Except that Rust is thread-safe, so expressing your
| algorithm in terms that the borrow checker accepts makes safe
| parallelism possible, as shown in the example using Rayon to
| trivially parallelize an operation. This is the whole point of
| Rust, and to say that C and C++ fail at thread-safety would be
| the understatement of the century.
| Cheetahlee01 wrote:
| Just want to know some hacking tricks
| vlovich123 wrote:
| > Now that we have a Vec with no non-static lifetimes, we can
| safely move it to another thread.
|
| I liked most of the tricks but this one seems pointless. This is
| no different than transmute as accessing the borrower requires an
| assume_init which I believe is technically UB when called on an
| uninit. Unless the point is that you're going to be working with
| Owned but want to just transmute the Vec safely.
|
| Overall I like the into_iter/collect trick to avoid unsafe. It
| was also most of the article, just various ways to apply this
| trick in different scenarios. Very neat!
| ComputerGuru wrote:
| You misunderstood the purpose of that trick. The vector is not
| going to be accessed again, the idea is to move it to another
| thread so it can be dropped in parallel (never accessed).
| quotemstr wrote:
| > Even if it were stable, it only works with slices of primitive
| types, so we'd have to lose our newtypes (SymbolId etc).
|
| That's weird. I'd expect it to work with _any_ type, primitive or
| not, newtype or not, with a sufficiently simple memory layout,
| the rough equivalent of what C++ calls a "standard-layout type"
| or (formerly) a "POD".
|
| I don't like magical stdlibs and I don't like user types being
| less powerful than built-in ones.
|
| Clever workaround doing a no-op transformation of the whole
| vector though! Very nearly zero-cost.
|
| > It would be possible to ensure that the proper Vec was restored
| for use-cases where that was important, however it would add
| extra complexity and might be enough to convince me that it'd be
| better to just use transmute.
|
| Great example of Rust being built such that you have to deal with
| error returns _and_ think about C++-style exception safety.
|
| > The optimisation in the Rust standard library that allows reuse
| of the heap allocation will only actually work if the size and
| alignment of T and U are the same
|
| Shouldn't it work when T and U are the same size and T has
| stricter alignment requirements than U but not exactly the same
| alignment? In this situation, any U would be properly aligned
| because T is even more aligned.
| aw1621107 wrote:
| > I'd expect it to work with _any_ type, primitive or not,
| newtype or not, with a sufficiently simple memory layout, the
| rough equivalent of what C++ calls a "standard-layout type" or
| (formerly) a "POD".
|
| This might be related in part to the fact that Rust chose to
| create specific AtomicU8/AtomicU16/etc. types instead of going
| for Atomic<T> like in C++. The reasoning for forgoing the
| latter is [0]:
|
| > However the consensus was that having unsupported atomic
| types either fail at monomorphization time or fall back to
| lock-based implementations was undesirable.
|
| That doesn't mean that one couldn't hypothetically try to write
| from_mut_slice<T> where T is a transparent newtype over one of
| the supported atomics, but I'm not sure whether that function
| signature is expressible at the moment. Maybe if/when safe
| transmutes land, since from_mut_slice is basically just doing a
| transmute?
|
| > Shouldn't it work when T and U are the same size and T has
| stricter alignment requirements than U but not exactly the same
| alignment? In this situation, any U would be properly aligned
| because T is even more aligned.
|
| I think this optimization does what you say? A quick skim of
| the source code [1] seems to show that the alignments don't
| have to exactly match: //! # Layout
| constraints //! <snip> //! Alignments of `T`
| must be the same or larger than `U`. Since alignments are
| always a power //! of two _larger_ implies _is a
| multiple of_.
|
| And later: const fn
| in_place_collectible<DEST, SRC>( step_merge:
| Option<NonZeroUsize>, step_expand:
| Option<NonZeroUsize>, ) -> bool { if const
| { SRC::IS_ZST || DEST::IS_ZST || mem::align_of::<SRC>() <
| mem::align_of::<DEST>() } { return false;
| } // Other code that deals with non-alignment
| conditions }
|
| [0]:
| https://github.com/Amanieu/rfcs/blob/more_atomic_types/text/...
|
| [1]: https://github.com/rust-
| lang/rust/blob/c58a5da7d48ff3887afe4...
| quotemstr wrote:
| > I think this optimization does what you say?
|
| Cool. Thanks for checking! I guess the article should be
| tweaked a bit --- it states that the alignment has to match
| exactly.
| 0x1ceb00da wrote:
| > Great example of Rust being built such that you have to deal
| with error returns and think about C++-style exception safety.
|
| Not really. Panics are supposed to be used in super exceptional
| situations, where the only course of action is to abort the
| whole unit of work you're doing and throw away all the
| resources. However you do have to be careful in critical code
| because things like integer overflow can also raise a panic.
| quotemstr wrote:
| > be careful in critical code because things like integer
| overflow can also raise a panic
|
| So you can basically panic anywhere. I understand people have
| looked at no-panic markers (like C++ noexcept) but the
| proposals haven't gone anywhere. Consequently, you need to
| maintain the basic exception safety guarantee [1] at all
| times. In safe Rust, the compiler enforces this level of
| safety in most cases on its own, but there are situations in
| which you can temporarily violate program invariants and
| panic before being able to restore them. (A classic example
| is debiting from one bank account before crediting to
| another. If you panic in the middle, the money is lost.)
|
| If you want that bank code to be robust against panics, you
| need to use something like
| https://docs.rs/scopeguard/latest/scopeguard/
|
| In unsafe Rust, you basically have the same burden of
| exception safety that C++ creates, except your job as an
| unsafe Rust programmer is harder than a C++ programmer's
| because Rust doesn't have a noexcept. Without noexcept, it's
| hard to reason about which calls can panic and which can't,
| so it's hard to make bulletproof cleanup paths.
|
| Most Rust programmers don't think much about panics, so I
| assume most Rust programs are full of latent bugs of this
| sort. That's why I usually recommend panic=abort.
|
| [1] https://en.wikipedia.org/wiki/Exception_safety#Classifica
| tio...
| ComputerGuru wrote:
| I don't like relying on (release-only) llvm optimizations for a
| number of reasons, but primarily a) they break between releases,
| more often than you'd think, b) they're part of the reason why
| debug builds of rust software are so much slower (at runtime)
| than release builds, c) they're much harder to verify (and very
| opaque).
|
| For non-performance-sensitive code, sure, go ahead and rely on
| the rust compiler to compile away the allocation of a whole new
| vector of a different type to convert from T to AtomicT, but
| where the performance matters, for my money I would go with the
| transmute 100% of the time (assuming the underlying type was
| decorated with #[transparent], though it would be nice if we
| could statically assert that). It'll perform better in debug
| mode, it's obvious what you are doing, it's guaranteed not to
| break in a minor rustc update, and it'll work with &mut [T]
| instead of an owned Vec<T> (which is a big one).
| mjmas wrote:
| The particular optimisation for non-copying Vec -> IntoIter ->
| Vec transform is actually hard coded in the standard library as
| a special case of collecting an Iterator into a Vec. It doesn't
| rely on the backend for this.
|
| Though this optimisation is treated as an implementation detail
| [1].
|
| [1]: https://doc.rust-
| lang.org/stable/std/vec/struct.Vec.html#imp...
| 0x1ceb00da wrote:
| > It'd be reasonable to think that this will have a runtime cost,
| however it doesn't. The reason is that the Rust standard library
| has a nice optimisation in it that when we consume a Vec and
| collect the result into a new Vec, in many circumstances, the
| heap allocation of the original Vec can be reused. This applies
| in this case. But what even with the heap allocation being
| reused, we're still looping over all the elements to transform
| them right? Because the in-memory representation of an
| AtomicSymbolId is identical to that of a SymbolId, our loop
| becomes a no-op and is optimised away.
|
| Those optimisations that this code relies on are literally
| undefined behaviour. The compiler doesn't guarantee it's gonna
| apply those optimisations. So your code might suddenly become
| super slow and you'll have to go digging in to see why. Is this
| undefined behaviour better than just having an unsafe block? I'm
| not so sure. The unsafe code will be easier to read and you won't
| need any comments or a blog to explain why we're doing voodoo
| stuff because the logic of the code will explain its intentions.
| stouset wrote:
| You're misusing the term "undefined behavior". You can
| certainly say that these kinds of performance optimizations
| aren't guaranteed.
| steveklabnik wrote:
| > Those optimisations that this code relies on are literally
| undefined behaviour.
|
| You cannot get undefined behavior in Rust without an unsafe
| block.
|
| > The compiler doesn't guarantee it's gonna apply those
| optimisations.
|
| This is a different concept than UB.
|
| However, for the "heap allocation can be re-used", Rust does
| talk about this: https://doc.rust-
| lang.org/stable/std/vec/struct.Vec.html#imp...
|
| It cannot guarantee it for arbitrary iterators, but the
| map().collect() re-use is well known, and the machinery is
| there to do this, so while other implementations may not, rustc
| always will.
|
| Basically, it is implementation-defined behavior. (If it were
| C/C++ it would be 'unspecified behavior' because rustc does not
| document exactly when it does this, but this is a _very_ fine
| nitpick and not language Rust currently uses, though I 'd argue
| it should.)
|
| > So your code might suddenly become super slow and you'll have
| to go digging in to see why.
|
| That's why wild has performance tests, to ensure that if a
| change breaks rustc's ability to optimize, it'll be noticed,
| and therefore fixed.
| 0x1ceb00da wrote:
| > That's why wild has performance tests, to ensure that if a
| change breaks rustc's ability to optimize, it'll be noticed,
| and therefore fixed.
|
| But benchmarks won't tell us which optimisation suddenly
| stopped working. This looks so similar to the argument
| against UB to me. Something breaks, but you don't know what,
| where, and why.
| steveklabnik wrote:
| It is true that it won't tell you, for sure. It's just that
| UB means something very specific when discussing language
| semantics.
| 0x1ceb00da wrote:
| I see. These optimisations might not be UB as understood
| in compiler lingo, but it is a kind of "undefined
| behaviour", as in anything could happen. And honestly the
| problems it might cause don't look that different from
| those caused by UB (from compiler lingo). Not to mention,
| using unsafe for writing optimised code will generate
| same-ish code in both debug and release mode, so DX will
| be better too.
| Ar-Curunir wrote:
| The optimization not getting applied doesn't mean that
| "anything could happen". Your code would just run slower.
| The result of this computation would still be correct and
| would match what you would expect to happen. This is the
| opposite of undefined behaviour, where the result is
| literally undefined, and, in particular, can be garbage.
| tuckerman wrote:
| As an example, parts of the C++ standard library (none of
| the core language I believe though) are covered by
| complexity requirements but implementations can still
| vary widely, e.g. std::sort needs to be linearithmic but
| someone could still implement a very slow version without
| it being UB (even if it was quadratic or something it
| still wouldn't be UB but wouldn't be standards
| conforming).
|
| UB is really about the observable behavior of the
| abstract machine which is limited to the reads/writes to
| volatile data and I/O library calls [1]
|
| [1] http://open-std.org/jtc1/sc22/open/n2356/intro.html
|
| Edit: to clarify the example
| ashvardanian wrote:
| I'd strongly caution against many of those "performance tricks."
| Spawning an asynchronous task on a separate thread, often with a
| heap-allocated handle, solely to deallocate a local object is a
| dubious pattern -- especially given how typical allocators behave
| under the hood.
|
| I frequently encounter use-cases akin to the "Sharded Vec Writer"
| idea, and I agree it can be valuable. But if performance is a
| genuine requirement, the implementation needs to be very
| different. I once attempted to build a general-purpose trait for
| performing parallel in-place updates of a Vec<T>, and found it
| extremely difficult to express cleanly in Rust without
| degenerating into unsafe or brittle abstractions.
___________________________________________________________________
(page generated 2025-09-25 23:00 UTC)