[HN Gopher] Small Strings in Rust: smolstr vs. smartstring
       ___________________________________________________________________
        
       Small Strings in Rust: smolstr vs. smartstring
        
       Author : airstrike
       Score  : 116 points
       Date   : 2024-08-24 14:58 UTC (8 hours ago)
        
 (HTM) web link (fasterthanli.me)
 (TXT) w3m dump (fasterthanli.me)
        
       | masklinn wrote:
       | [2020]. I wouldn't be surprised if the field had changed a fair
       | bit since.
        
         | brigadier132 wrote:
         | Ok, I dived into the rabbit hole a bit and found this:
         | 
         | https://github.com/rosetta-rs/string-rosetta-rs?tab=readme-o...
         | 
         | Click the charts for actually readable data.
         | 
         | From a quick scan, it seems like hipstr is the current fastest
         | string implementation and is actively maintained.
        
           | IshKebab wrote:
           | And has the best name!
        
           | skavi wrote:
           | Those benchmarks aren't great.
           | 
           | A real world-ish workload made generic over all these types
           | would provide much more interesting results. Especially if
           | heap usage was also tracked.
        
             | dathinab wrote:
             | the author has some recommendations, to quote formatting):
             | 
             | > Suggestions:
             | 
             | >
             | 
             | > Generally, String
             | 
             | > If you deal mostly with string literals but want some
             | flexibility (like clap), generally you'll want Cow<'static,
             | str>
             | 
             | > If a profiler says your strings are a problem:
             | 
             | > Try different crates and settings for that crate out with
             | a profiler
             | 
             | > O(1) clones are important when doing a lot of clones.
             | 
             | > For one-off allocations, they are slower.
             | 
             | > For short-lived programs, look into string interning
             | 
             | i.e. author thinks you should profile with your own use
             | case
             | 
             | EDIT: Not sure why HN removes indention between a leading >
             | and the start of the text, makes quoting unnecessary hard
        
             | epage wrote:
             | Real world benchmarks also wouldn't be great because they
             | would be showing how well it works in someone else's
             | program, rather than yours.
             | 
             | This at least gives you an idea of what the relative cost
             | of different operations are so you can consider what are
             | frequent operations in your program and then benchmark a
             | couple from there, rather than everything.
        
           | epage wrote:
           | Also, smol_str was removed from the comparison because
           | matklad, the author of smol_str, suggests ecow
           | 
           | > I'd rather say the opposite: users of those crates should
           | switch to ecow. It is exactly what smol_str would have been,
           | if it were a proper crate with a stable API, rather than an
           | implementation detail of rust-analyzer. > > It's a drop-in
           | replacement for String, with O(1) clone and SSO, and I
           | believe this is all you need. Other crates either have
           | needlessly restricted API (no mutation), questionable
           | implementation choices, or a bunch of ad hoc traits in the
           | API.
           | 
           | https://www.reddit.com/r/rust/comments/117ksvr/ecow_compact_.
           | ..
        
       | abhorrence wrote:
       | Sadly it seems like some of the images have broken since it was
       | originally posted. :(
        
         | 01HNNWZ0MV43FF wrote:
         | This capture looks a little better. Even the CSS looks off on
         | the live site. Maybe someone should ping him? I think he's on
         | Mastodon
         | https://web.archive.org/web/20201002084042/https://fastertha...
        
           | cute_boi wrote:
           | Actually previous design was better to my eyes.
        
         | pronoiac wrote:
         | I emailed him.
        
           | fasterthanlime wrote:
           | Thanks for the heads up!
           | 
           | I fixed the SVG graphs and made a couple layout updates based
           | on the feedback here and some earlier feedback from my
           | subreddit, see the before-after here.[1]
           | 
           | I've been mostly focused on function lately, the redesign is,
           | let's say, a work in progress.
           | 
           | (Oh also, I use they/them pronouns these days [2])
           | 
           | [1] https://imgur.com/a/before-after-2024-08-24-PAmeHFX [2]
           | https://fasterthanli.me/articles/state-of-the-
           | fasterthanlime...
        
       | unshavedyak wrote:
       | On the note of small strings, Compact String[1] was i believed
       | released after this article and has a nifty trick. Where Smol and
       | Smart can fit 22 and 23 bytes, CompactStr can fit 24! Which is
       | kinda nutty imo, that's the full size of the normal String on the
       | stack.. but packed with actual string data.
       | 
       | There's a nice explanation on their readme[2]. Love tricks like
       | this.
       | 
       | [1]: https://github.com/ParkMyCar/compact_str
       | 
       | [2]: https://github.com/ParkMyCar/compact_str?tab=readme-ov-
       | file#...
        
         | dataflow wrote:
         | Note for C++ developers: Their trick is only possible because
         | the strings are UTF-8 and not null-terminated. It wouldn't work
         | as a drop-in for standard strings in C++.
        
           | kloop wrote:
           | std::string isn't null terminated (or at least it isn't
           | guaranteed to be, I don't think it's forbidden for an
           | implementation to do that).
           | 
           | That's why the c_str method exists, so you can get a pointer
           | to a null terminated character array
        
             | searealist wrote:
             | I think std::string is actually required to be null
             | terminated now in the latest standards. But even before it
             | was basically required as that was the only way to make
             | c_str() constant time.
        
               | dataflow wrote:
               | > But even before it was basically required as that was
               | the only way to make c_str() constant time.
               | 
               | Note that c_str() could've inserted the terminator in
               | constant time, as long as the string kept space reserved
               | for that. So this wouldn't have violated constant-time-
               | ness, and it's not an insane thing to do considering it
               | would save some instructions elsewhere. But yeah, the
               | flexibility wasn't all that useful even in the beginning,
               | and became even more useless as soon as C++ incorporated
               | threading, due to the constness of the function.
        
             | unclad5968 wrote:
             | std::string is guaranteed to be null terminated since
             | c++11.
             | 
             | std::string::c_str returns the same address as &string[0]
        
               | dataflow wrote:
               | > std::string::c_str returns the same address as
               | &string[0]
               | 
               | Note that this by itself doesn't imply null-termination,
               | though as you say strings are indeed null-terminated now.
               | 
               | Edit: This is not particularly obvious, but the reason
               | this has nothing to do with allocation or the return
               | value is that the implementation could still leave space
               | for the null terminator, but avoid actually setting it to
               | zero until c_str() is invoked. That would neither affect
               | the returned pointer nor the constant-time guarantee.
        
               | masspro wrote:
               | Why downvotes? Ignoring C++11 changes for a minute, c_str
               | is absolutely able to cause a reallocation if it needs
               | to, such that after it returns, its return value and the
               | value of data() are the same. Ergo no, the GP statement
               | alone doesn't imply anything.
        
               | dpassens wrote:
               | But the return value guarantee is, as far as I can see, a
               | C++11 change, so you can't ignore those. And as of C++11,
               | c_str is noexcept, so it cannot allocate anything.
        
               | dataflow wrote:
               | Please see my edit.
        
             | dataflow wrote:
             | > std::string isn't null terminated
             | 
             | It is as of C++11. The constness of c_str() threw a wrench
             | into that as soon as C++ got a threading model.
        
           | neonsunset wrote:
           | Null-terminated strings really are a mistake. Make vectorized
           | algorithms problematic by forcing them to account for page
           | size and paged memory in general as well as always scan for
           | NUL, cannot be easily sliced without re-allocation, are
           | opposite to how languages with better string primitives
           | define them and in general don't save much by passing a
           | single pointer over ptr + length.
        
           | mananaysiempre wrote:
           | More of a knock against C than C++, seeing as today's C++
           | tends to prefer `string_view`s or at worst iterator pairs,
           | neither of which use null-termination. (Saying that as
           | someone who prefers C to C++ and avoids Rust exactly because
           | of "better C++" vibes--I don't _want_ a better version of
           | C++, if anything I want a much much simpler one that's worse
           | at some things.)
           | 
           | That said, I don't see why it wouldn't be possible to cram in
           | 24 bytes of null-terminated payload (so 23 useful ones, the
           | best you could hope for with null termination) into the same
           | structure the same way by storing the compact version with
           | null termination and ensuring the last byte is also always
           | zero. For extra style points, define the last byte to be 24
           | minus payload length instead so you don't need to recompute
           | the length in the inline case.
           | 
           | To be clear: none of this makes null termination not dumb.
        
             | dataflow wrote:
             | > That said, I don't see why it wouldn't be possible to
             | cram in 24 bytes of null-terminated payload
             | 
             | Note that I didn't say this is impossible, just that the
             | given trick wouldn't work.
             | 
             | However, this _is_ impossible for general strings. The only
             | way could possibly make this work is if you constrain the
             | inline string somehow (e.g., to UTF-8), so that some
             | shorter strings failing that constraint are forced to go on
             | the heap too. Otherwise you have 1 fixed zero byte at the
             | end, and 23 fully flexible bytes, leaving you no way to
             | represent an out-of-line string.
             | 
             | (Well, you could do it if you use the address as a key into
             | some static map or such where you shove the real data, but
             | that's cheating and beside the point here.)
        
             | aw1621107 wrote:
             | > That said, I don't see why it wouldn't be possible to
             | cram in 24 bytes of null-terminated payload (so 23 useful
             | ones, the best you could hope for with null termination)
             | into the same structure the same way by storing the compact
             | version with null termination and ensuring the last byte is
             | also always zero.
             | 
             | I want to say libc++ and maybe MSVC do something along
             | those lines in their std::string implementations.
             | 
             | > For extra style points, define the last byte to be 24
             | minus payload length instead so you don't need to recompute
             | the length in the inline case.
             | 
             | IIRC Facebook's FBString from Folly does (did?) that?
        
               | tialaramex wrote:
               | > I want to say libc++ and maybe MSVC do something along
               | those lines in their std::string implementations.
               | 
               | Here's Raymond Chen on this topic earlier this same year.
               | As a bonus since you're looking at this in August it's
               | more or less correct now, whereas when it was published
               | it had numerous serious errors. Whether the standard
               | library implementation of such an important type should
               | be so complicated that an expert makes numerous errors is
               | another question...
               | 
               | https://devblogs.microsoft.com/oldnewthing/20240510-00/?p
               | =10...
               | 
               | So, libc++ gets closest, 1 flag byte + 22 bytes of text +
               | 1 byte of ASCII NUL = 24 bytes
               | 
               | The others are much worse, larger (32 bytes on modern
               | computers) yet with lower SSO capacity (15 bytes of
               | text).
        
         | silisili wrote:
         | This is interesting, thanks.
         | 
         | > 0b11111110 - All 1s with a trailing 0, indicates heap
         | allocated
         | 
         | > 0b11XXXXXX - Two leading 1s, indicates inline, with the
         | trailing 6 bits used to store the length
         | 
         | I stared at this for too long, as it allows collision. Then I
         | realized you'd never set the third bit, it should probably have
         | been written 0b110XXXXX and recorded that 5 bits are used for
         | length. Right or did I understand it wrong?
        
           | tialaramex wrote:
           | Probably this isn't helpful anyway - what's actually going on
           | is more complicated and is explained later at a high level or
           | I'll try now:
           | 
           | Rust has "niches" - bit patterns which are never used by that
           | type and thus can be occupied by something else in a sum type
           | (Rust's enum) which adds to that type. But stable Rust
           | doesn't allow third parties to promise arbitrary niches exist
           | for a type they made.
           | 
           | However, if you make a simple enumeration of N possibilities
           | that _automatically_ has a niche of all the M-N bit patterns
           | which weren 't needed by your enumeration in the M value
           | machine integer that was chosen to store this enumerated type
           | (M will typically be 256 or 65536 depending on how many
           | things you enumerated)
           | 
           | So, CompactString has a custom enum type LastUtf8Char which
           | it uses for the last byte in its data structure - this has
           | values V0 through V191 corresponding to the 192 possible last
           | bytes of a UTF-8 string. That leaves 64 bit patterns unused.
           | Then L0 through L23 represent lengths - inline strings of
           | length 0 to 23 inclusive which didn't need this last byte (if
           | it was 24 then that's V0 through V191). Now we've got 40 bit
           | patterns left.
           | 
           | Then one bit pattern (the pattern equivalent to the unsigned
           | integer 216) signifies that this string data lives on the
           | heap, the rest should be interpreted accordingly, and another
           | (217) signifies that it's a weird static allocation (I do not
           | know why you'd do this)
           | 
           | That leaves 38 bit patterns unused when the type is finished
           | using any it wanted so there's still a niche for
           | Option<CompactString> or MyCustomType<CompactString>
        
             | parkmycar wrote:
             | Author of compact_str here, you hit the nail on the head,
             | great explanation!
             | 
             | > ... and another (217) signifies that it's a weird static
             | allocation (I do not know why you'd do this)
             | 
             | In addition to String Rust also has str[1], which is an
             | entirely different type. It's common to represent string
             | literals known at compile time as `&'static str`, but they
             | can't be used in all of the same places that a String can.
             | For example, you can't put a &'static str into a
             | Vec<String> unless you first heap allocate and create a
             | String. We added the additional variant of 217 so users of
             | CompactString could abstract over both string literals and
             | strings created at runtime to solve cases like the example.
             | 
             | [1]: https://doc.rust-lang.org/std/primitive.str.html
        
               | tialaramex wrote:
               | Thanks! And the explanation of 217 makes sense too.
               | 
               | Since I have you here, wouldn't it be better to name that
               | type "LastByte" or something? It's not a (Rust) char, and
               | it's not necessarily UTF-8 whereas it is definitely the
               | last byte.
        
               | parkmycar wrote:
               | Ha, naming is hard! You're totally right, it used to be
               | just the values of the last byte of a UTF-8 char (and
               | before that it was NonMaxU8) but now represents more.
               | I'll update it once I'm back at my computer, thanks!
        
           | shikon7 wrote:
           | Maybe you need it if pointers are 16 bytes?
        
         | parkmycar wrote:
         | Hey I'm the author of compact_str, thanks for the kind words!
         | 
         | Fun fact, it was this fasterthanlime post that originally
         | inspired me to play around with small strings and led to the
         | creation of compact_str.
        
           | conaclos wrote:
           | compact_str is a fantastic crate, used by many projects. Do
           | you know the byteyarn crate [0]? This could be nice to add
           | this to the `Similar Crates` section if it makes sense.
           | 
           | [0] https://docs.rs/byteyarn/latest/byteyarn/
        
             | parkmycar wrote:
             | I do, mcyoung wrote a great blogpost[1] about it! Good
             | idea, I'm AFK at the moment but will add it to the 'Similar
             | Crates' section once I'm back
             | 
             | [1] https://mcyoung.xyz/2023/08/09/yarns/
        
           | tomcam wrote:
           | I've been wondering something for ages. Where did you get the
           | 24 byte number, and how does it compare in Unicode terms?
           | That is, did you analyze a large corpus and determine that 24
           | bytes was right for the largest number of strings? And does
           | it come out to, say, 10 Unicode characters?
        
       | mastax wrote:
       | It's nice that rustc's niche optimization lets smolstr be
       | implemented with a simple enum, rather than having to do some
       | unsafe union bit packing[0]. The only concession that had to be
       | made to the compiler is using an enum for the InlineSize value to
       | show that the last 3 bits of that aren't used.
       | 
       | [0]: https://github.com/rust-
       | analyzer/smol_str/blob/fde86a5c0cb8f...
        
       | jtrueb wrote:
       | I made humanize-bytes[1] for that formatting reason (1000 vs
       | 1024). Coincidentally, it uses smartstring to avoid allocations.
       | 
       | [1] https://crates.io/crates/humanize-bytes
        
       | weinzierl wrote:
       | Here [1] is a nice talk that discusses various options and trade-
       | offs for small string and small vector optimization in Rust.
       | 
       | [1]
       | https://m.youtube.com/watch?time_continue=2658&v=tLX_nvWD738...
        
       | avinassh wrote:
       | slightly related: earlier I was experimenting with Rust string
       | allocations, even though my code did almost same as standard
       | library, the heap allocations were taking 10x of time. Relevant
       | code:                 fn alloc_string(s: &str) -> NonNull<u8> {
       | let boxed_slice = s.as_bytes().to_owned().into_boxed_slice();
       | NonNull::new(Box::into_raw(boxed_slice) as *mut u8).unwrap()
       | }
       | 
       | Approximately, if stdlib was taking 1us, but my code was about
       | 14-15us for large strings. Profiling also did not help. Anyone
       | have any guesses? Here is the full code:
       | https://github.com/avinassh/string-alloc
        
         | hummingly wrote:
         | When you call to_owned, it creates a Vec<u8>. This must then be
         | truncated due to the call to into_boxed_slice. The
         | documentation says this could potentially mean a re-allocation
         | depending on the allocator. When this happens the old
         | allocation needs to be dropped too. You could try using a
         | different allocator but before this I would recommend to
         | replace those two calls with Box::from(s.as_bytes()).
        
       | dathinab wrote:
       | on interesting thing to realize is that some small string types
       | go beyond just the basic small len storage optimization
       | 
       | - compact_str e.g. depends on the string being valid utf-8, and
       | in turn has larger short strings
       | 
       | - smol_str e.g. is a a enum over `[u8; CAP] | &'static str |
       | Arc<str>` this means it avoids any allocations for static strings
       | and has very fast clones, leading to similar perf.
       | characteristics as string internalization in some use cases (like
       | the use-cases it was designed for). But at the cost of it being
       | immutable only and the heap allocation being slightly larger for
       | the Rc.
       | 
       | Other interesting differences can be the handling of shrinking
       | mutable Strings, do you re-inline it or not? What is better here
       | is highly use-case dependent.
       | 
       | In the end there are many design decisions where there is no
       | clear winner but it's a question of trade off with use-case
       | specific preferences.
        
       | conaclos wrote:
       | Readers who like this article may also like a more recent one
       | [0]. It designs a compact string with extra capabilities. The
       | crate was released under the name byteyarn [1]
       | 
       | [0] https://mcyoung.xyz/2023/08/09/yarns/
       | 
       | [1] https://docs.rs/byteyarn/latest/byteyarn/
        
       ___________________________________________________________________
       (page generated 2024-08-24 23:00 UTC)