[HN Gopher] Small Strings in Rust: smolstr vs. smartstring
___________________________________________________________________
Small Strings in Rust: smolstr vs. smartstring
Author : airstrike
Score : 116 points
Date : 2024-08-24 14:58 UTC (8 hours ago)
(HTM) web link (fasterthanli.me)
(TXT) w3m dump (fasterthanli.me)
| masklinn wrote:
| [2020]. I wouldn't be surprised if the field had changed a fair
| bit since.
| brigadier132 wrote:
| Ok, I dived into the rabbit hole a bit and found this:
|
| https://github.com/rosetta-rs/string-rosetta-rs?tab=readme-o...
|
| Click the charts for actually readable data.
|
| From a quick scan, it seems like hipstr is the current fastest
| string implementation and is actively maintained.
| IshKebab wrote:
| And has the best name!
| skavi wrote:
| Those benchmarks aren't great.
|
| A real world-ish workload made generic over all these types
| would provide much more interesting results. Especially if
| heap usage was also tracked.
| dathinab wrote:
| the author has some recommendations, to quote formatting):
|
| > Suggestions:
|
| >
|
| > Generally, String
|
| > If you deal mostly with string literals but want some
| flexibility (like clap), generally you'll want Cow<'static,
| str>
|
| > If a profiler says your strings are a problem:
|
| > Try different crates and settings for that crate out with
| a profiler
|
| > O(1) clones are important when doing a lot of clones.
|
| > For one-off allocations, they are slower.
|
| > For short-lived programs, look into string interning
|
| i.e. author thinks you should profile with your own use
| case
|
| EDIT: Not sure why HN removes indention between a leading >
| and the start of the text, makes quoting unnecessary hard
| epage wrote:
| Real world benchmarks also wouldn't be great because they
| would be showing how well it works in someone else's
| program, rather than yours.
|
| This at least gives you an idea of what the relative cost
| of different operations are so you can consider what are
| frequent operations in your program and then benchmark a
| couple from there, rather than everything.
| epage wrote:
| Also, smol_str was removed from the comparison because
| matklad, the author of smol_str, suggests ecow
|
| > I'd rather say the opposite: users of those crates should
| switch to ecow. It is exactly what smol_str would have been,
| if it were a proper crate with a stable API, rather than an
| implementation detail of rust-analyzer. > > It's a drop-in
| replacement for String, with O(1) clone and SSO, and I
| believe this is all you need. Other crates either have
| needlessly restricted API (no mutation), questionable
| implementation choices, or a bunch of ad hoc traits in the
| API.
|
| https://www.reddit.com/r/rust/comments/117ksvr/ecow_compact_.
| ..
| abhorrence wrote:
| Sadly it seems like some of the images have broken since it was
| originally posted. :(
| 01HNNWZ0MV43FF wrote:
| This capture looks a little better. Even the CSS looks off on
| the live site. Maybe someone should ping him? I think he's on
| Mastodon
| https://web.archive.org/web/20201002084042/https://fastertha...
| cute_boi wrote:
| Actually previous design was better to my eyes.
| pronoiac wrote:
| I emailed him.
| fasterthanlime wrote:
| Thanks for the heads up!
|
| I fixed the SVG graphs and made a couple layout updates based
| on the feedback here and some earlier feedback from my
| subreddit, see the before-after here.[1]
|
| I've been mostly focused on function lately, the redesign is,
| let's say, a work in progress.
|
| (Oh also, I use they/them pronouns these days [2])
|
| [1] https://imgur.com/a/before-after-2024-08-24-PAmeHFX [2]
| https://fasterthanli.me/articles/state-of-the-
| fasterthanlime...
| unshavedyak wrote:
| On the note of small strings, Compact String[1] was i believed
| released after this article and has a nifty trick. Where Smol and
| Smart can fit 22 and 23 bytes, CompactStr can fit 24! Which is
| kinda nutty imo, that's the full size of the normal String on the
| stack.. but packed with actual string data.
|
| There's a nice explanation on their readme[2]. Love tricks like
| this.
|
| [1]: https://github.com/ParkMyCar/compact_str
|
| [2]: https://github.com/ParkMyCar/compact_str?tab=readme-ov-
| file#...
| dataflow wrote:
| Note for C++ developers: Their trick is only possible because
| the strings are UTF-8 and not null-terminated. It wouldn't work
| as a drop-in for standard strings in C++.
| kloop wrote:
| std::string isn't null terminated (or at least it isn't
| guaranteed to be, I don't think it's forbidden for an
| implementation to do that).
|
| That's why the c_str method exists, so you can get a pointer
| to a null terminated character array
| searealist wrote:
| I think std::string is actually required to be null
| terminated now in the latest standards. But even before it
| was basically required as that was the only way to make
| c_str() constant time.
| dataflow wrote:
| > But even before it was basically required as that was
| the only way to make c_str() constant time.
|
| Note that c_str() could've inserted the terminator in
| constant time, as long as the string kept space reserved
| for that. So this wouldn't have violated constant-time-
| ness, and it's not an insane thing to do considering it
| would save some instructions elsewhere. But yeah, the
| flexibility wasn't all that useful even in the beginning,
| and became even more useless as soon as C++ incorporated
| threading, due to the constness of the function.
| unclad5968 wrote:
| std::string is guaranteed to be null terminated since
| c++11.
|
| std::string::c_str returns the same address as &string[0]
| dataflow wrote:
| > std::string::c_str returns the same address as
| &string[0]
|
| Note that this by itself doesn't imply null-termination,
| though as you say strings are indeed null-terminated now.
|
| Edit: This is not particularly obvious, but the reason
| this has nothing to do with allocation or the return
| value is that the implementation could still leave space
| for the null terminator, but avoid actually setting it to
| zero until c_str() is invoked. That would neither affect
| the returned pointer nor the constant-time guarantee.
| masspro wrote:
| Why downvotes? Ignoring C++11 changes for a minute, c_str
| is absolutely able to cause a reallocation if it needs
| to, such that after it returns, its return value and the
| value of data() are the same. Ergo no, the GP statement
| alone doesn't imply anything.
| dpassens wrote:
| But the return value guarantee is, as far as I can see, a
| C++11 change, so you can't ignore those. And as of C++11,
| c_str is noexcept, so it cannot allocate anything.
| dataflow wrote:
| Please see my edit.
| dataflow wrote:
| > std::string isn't null terminated
|
| It is as of C++11. The constness of c_str() threw a wrench
| into that as soon as C++ got a threading model.
| neonsunset wrote:
| Null-terminated strings really are a mistake. Make vectorized
| algorithms problematic by forcing them to account for page
| size and paged memory in general as well as always scan for
| NUL, cannot be easily sliced without re-allocation, are
| opposite to how languages with better string primitives
| define them and in general don't save much by passing a
| single pointer over ptr + length.
| mananaysiempre wrote:
| More of a knock against C than C++, seeing as today's C++
| tends to prefer `string_view`s or at worst iterator pairs,
| neither of which use null-termination. (Saying that as
| someone who prefers C to C++ and avoids Rust exactly because
| of "better C++" vibes--I don't _want_ a better version of
| C++, if anything I want a much much simpler one that's worse
| at some things.)
|
| That said, I don't see why it wouldn't be possible to cram in
| 24 bytes of null-terminated payload (so 23 useful ones, the
| best you could hope for with null termination) into the same
| structure the same way by storing the compact version with
| null termination and ensuring the last byte is also always
| zero. For extra style points, define the last byte to be 24
| minus payload length instead so you don't need to recompute
| the length in the inline case.
|
| To be clear: none of this makes null termination not dumb.
| dataflow wrote:
| > That said, I don't see why it wouldn't be possible to
| cram in 24 bytes of null-terminated payload
|
| Note that I didn't say this is impossible, just that the
| given trick wouldn't work.
|
| However, this _is_ impossible for general strings. The only
| way could possibly make this work is if you constrain the
| inline string somehow (e.g., to UTF-8), so that some
| shorter strings failing that constraint are forced to go on
| the heap too. Otherwise you have 1 fixed zero byte at the
| end, and 23 fully flexible bytes, leaving you no way to
| represent an out-of-line string.
|
| (Well, you could do it if you use the address as a key into
| some static map or such where you shove the real data, but
| that's cheating and beside the point here.)
| aw1621107 wrote:
| > That said, I don't see why it wouldn't be possible to
| cram in 24 bytes of null-terminated payload (so 23 useful
| ones, the best you could hope for with null termination)
| into the same structure the same way by storing the compact
| version with null termination and ensuring the last byte is
| also always zero.
|
| I want to say libc++ and maybe MSVC do something along
| those lines in their std::string implementations.
|
| > For extra style points, define the last byte to be 24
| minus payload length instead so you don't need to recompute
| the length in the inline case.
|
| IIRC Facebook's FBString from Folly does (did?) that?
| tialaramex wrote:
| > I want to say libc++ and maybe MSVC do something along
| those lines in their std::string implementations.
|
| Here's Raymond Chen on this topic earlier this same year.
| As a bonus since you're looking at this in August it's
| more or less correct now, whereas when it was published
| it had numerous serious errors. Whether the standard
| library implementation of such an important type should
| be so complicated that an expert makes numerous errors is
| another question...
|
| https://devblogs.microsoft.com/oldnewthing/20240510-00/?p
| =10...
|
| So, libc++ gets closest, 1 flag byte + 22 bytes of text +
| 1 byte of ASCII NUL = 24 bytes
|
| The others are much worse, larger (32 bytes on modern
| computers) yet with lower SSO capacity (15 bytes of
| text).
| silisili wrote:
| This is interesting, thanks.
|
| > 0b11111110 - All 1s with a trailing 0, indicates heap
| allocated
|
| > 0b11XXXXXX - Two leading 1s, indicates inline, with the
| trailing 6 bits used to store the length
|
| I stared at this for too long, as it allows collision. Then I
| realized you'd never set the third bit, it should probably have
| been written 0b110XXXXX and recorded that 5 bits are used for
| length. Right or did I understand it wrong?
| tialaramex wrote:
| Probably this isn't helpful anyway - what's actually going on
| is more complicated and is explained later at a high level or
| I'll try now:
|
| Rust has "niches" - bit patterns which are never used by that
| type and thus can be occupied by something else in a sum type
| (Rust's enum) which adds to that type. But stable Rust
| doesn't allow third parties to promise arbitrary niches exist
| for a type they made.
|
| However, if you make a simple enumeration of N possibilities
| that _automatically_ has a niche of all the M-N bit patterns
| which weren 't needed by your enumeration in the M value
| machine integer that was chosen to store this enumerated type
| (M will typically be 256 or 65536 depending on how many
| things you enumerated)
|
| So, CompactString has a custom enum type LastUtf8Char which
| it uses for the last byte in its data structure - this has
| values V0 through V191 corresponding to the 192 possible last
| bytes of a UTF-8 string. That leaves 64 bit patterns unused.
| Then L0 through L23 represent lengths - inline strings of
| length 0 to 23 inclusive which didn't need this last byte (if
| it was 24 then that's V0 through V191). Now we've got 40 bit
| patterns left.
|
| Then one bit pattern (the pattern equivalent to the unsigned
| integer 216) signifies that this string data lives on the
| heap, the rest should be interpreted accordingly, and another
| (217) signifies that it's a weird static allocation (I do not
| know why you'd do this)
|
| That leaves 38 bit patterns unused when the type is finished
| using any it wanted so there's still a niche for
| Option<CompactString> or MyCustomType<CompactString>
| parkmycar wrote:
| Author of compact_str here, you hit the nail on the head,
| great explanation!
|
| > ... and another (217) signifies that it's a weird static
| allocation (I do not know why you'd do this)
|
| In addition to String Rust also has str[1], which is an
| entirely different type. It's common to represent string
| literals known at compile time as `&'static str`, but they
| can't be used in all of the same places that a String can.
| For example, you can't put a &'static str into a
| Vec<String> unless you first heap allocate and create a
| String. We added the additional variant of 217 so users of
| CompactString could abstract over both string literals and
| strings created at runtime to solve cases like the example.
|
| [1]: https://doc.rust-lang.org/std/primitive.str.html
| tialaramex wrote:
| Thanks! And the explanation of 217 makes sense too.
|
| Since I have you here, wouldn't it be better to name that
| type "LastByte" or something? It's not a (Rust) char, and
| it's not necessarily UTF-8 whereas it is definitely the
| last byte.
| parkmycar wrote:
| Ha, naming is hard! You're totally right, it used to be
| just the values of the last byte of a UTF-8 char (and
| before that it was NonMaxU8) but now represents more.
| I'll update it once I'm back at my computer, thanks!
| shikon7 wrote:
| Maybe you need it if pointers are 16 bytes?
| parkmycar wrote:
| Hey I'm the author of compact_str, thanks for the kind words!
|
| Fun fact, it was this fasterthanlime post that originally
| inspired me to play around with small strings and led to the
| creation of compact_str.
| conaclos wrote:
| compact_str is a fantastic crate, used by many projects. Do
| you know the byteyarn crate [0]? This could be nice to add
| this to the `Similar Crates` section if it makes sense.
|
| [0] https://docs.rs/byteyarn/latest/byteyarn/
| parkmycar wrote:
| I do, mcyoung wrote a great blogpost[1] about it! Good
| idea, I'm AFK at the moment but will add it to the 'Similar
| Crates' section once I'm back
|
| [1] https://mcyoung.xyz/2023/08/09/yarns/
| tomcam wrote:
| I've been wondering something for ages. Where did you get the
| 24 byte number, and how does it compare in Unicode terms?
| That is, did you analyze a large corpus and determine that 24
| bytes was right for the largest number of strings? And does
| it come out to, say, 10 Unicode characters?
| mastax wrote:
| It's nice that rustc's niche optimization lets smolstr be
| implemented with a simple enum, rather than having to do some
| unsafe union bit packing[0]. The only concession that had to be
| made to the compiler is using an enum for the InlineSize value to
| show that the last 3 bits of that aren't used.
|
| [0]: https://github.com/rust-
| analyzer/smol_str/blob/fde86a5c0cb8f...
| jtrueb wrote:
| I made humanize-bytes[1] for that formatting reason (1000 vs
| 1024). Coincidentally, it uses smartstring to avoid allocations.
|
| [1] https://crates.io/crates/humanize-bytes
| weinzierl wrote:
| Here [1] is a nice talk that discusses various options and trade-
| offs for small string and small vector optimization in Rust.
|
| [1]
| https://m.youtube.com/watch?time_continue=2658&v=tLX_nvWD738...
| avinassh wrote:
| slightly related: earlier I was experimenting with Rust string
| allocations, even though my code did almost same as standard
| library, the heap allocations were taking 10x of time. Relevant
| code: fn alloc_string(s: &str) -> NonNull<u8> {
| let boxed_slice = s.as_bytes().to_owned().into_boxed_slice();
| NonNull::new(Box::into_raw(boxed_slice) as *mut u8).unwrap()
| }
|
| Approximately, if stdlib was taking 1us, but my code was about
| 14-15us for large strings. Profiling also did not help. Anyone
| have any guesses? Here is the full code:
| https://github.com/avinassh/string-alloc
| hummingly wrote:
| When you call to_owned, it creates a Vec<u8>. This must then be
| truncated due to the call to into_boxed_slice. The
| documentation says this could potentially mean a re-allocation
| depending on the allocator. When this happens the old
| allocation needs to be dropped too. You could try using a
| different allocator but before this I would recommend to
| replace those two calls with Box::from(s.as_bytes()).
| dathinab wrote:
| on interesting thing to realize is that some small string types
| go beyond just the basic small len storage optimization
|
| - compact_str e.g. depends on the string being valid utf-8, and
| in turn has larger short strings
|
| - smol_str e.g. is a a enum over `[u8; CAP] | &'static str |
| Arc<str>` this means it avoids any allocations for static strings
| and has very fast clones, leading to similar perf.
| characteristics as string internalization in some use cases (like
| the use-cases it was designed for). But at the cost of it being
| immutable only and the heap allocation being slightly larger for
| the Rc.
|
| Other interesting differences can be the handling of shrinking
| mutable Strings, do you re-inline it or not? What is better here
| is highly use-case dependent.
|
| In the end there are many design decisions where there is no
| clear winner but it's a question of trade off with use-case
| specific preferences.
| conaclos wrote:
| Readers who like this article may also like a more recent one
| [0]. It designs a compact string with extra capabilities. The
| crate was released under the name byteyarn [1]
|
| [0] https://mcyoung.xyz/2023/08/09/yarns/
|
| [1] https://docs.rs/byteyarn/latest/byteyarn/
___________________________________________________________________
(page generated 2024-08-24 23:00 UTC)