[HN Gopher] Why Polars rewrote its Arrow string data type
       ___________________________________________________________________
        
       Why Polars rewrote its Arrow string data type
        
       Author : fanf2
       Score  : 52 points
       Date   : 2024-08-06 14:42 UTC (8 hours ago)
        
 (HTM) web link (pola.rs)
 (TXT) w3m dump (pola.rs)
        
       | padthai wrote:
       | I am getting Vietnam flashbacks of all the small
       | incompatibilities between Numpy arrays and Pandas Series
        
         | grahamjameson wrote:
         | https://youtu.be/J0zjJKBK9ys
        
       | justinsaccount wrote:
       | Is there a standalone c/c++ implementation of this string type
       | anywhere?
        
         | lor_louis wrote:
         | It looks a lot like cedarDB's "german strings".
         | 
         | <https://cedardb.com/blog/german_strings/>
         | 
         | You could probably write a C++ implementation based on the
         | article.
         | 
         | Note that it's not all that useful if you don't plan on
         | searching for text based on their prefix. From my
         | understanding, this is mostly a better way to store SStables in
         | RAM/partially on disk if you mmap.
        
           | o11c wrote:
           | Hm, both the main article and your link are wasteful for
           | strings of length 13-15, which are still pretty common. As a
           | rule for SSO, if you're already take up the bytes
           | unconditionally, it's going to be worth complexifying your
           | "size" calculation to squeeze a little more into the no-alloc
           | case.
           | 
           | That said, note that there are a _lot_ of strings around 20
           | bytes (stringified 64-bit integers or floats), so pushing to
           | 24-1 is a reasonable choice too.
           | 
           | I'd use 3 size classes:
           | 
           | * 0-15 bytes, stored in-line. If you need C-string
           | compatibility (which is fairly likely _somewhere_ in your
           | application), ensure that size 15 is encoded as a zero at the
           | last byte.
           | 
           | * up to 3.75GB (if my mental math is right), stored with a
           | mangled 32-bit size. Alternatively you could use a simpler
           | cutoff if it makes the mangling easier. Another possibility
           | would be to have a 16-bit size class.
           | 
           | * add support for very large strings (likely with a
           | completely different allocation method) too; a 4GB limit
           | sucks and is easily exceeded. If need be this can use an
           | extra indirection.
           | 
           | Honestly, with a 16-byte budget, I'd consider spending more
           | of that on the prefix - you can get 8 bytes with enough
           | contortion elsewhere.
           | 
           | Duplicating the prefix is probably worth it in more cases
           | than you might think, since it does speed up comparisons.
           | Just remember, you have to check the size to distinguish
           | "a\0" from "a" too, not just from "a\0\0\0\0".
        
       | ismailmaj wrote:
       | I might have missed it in the article but I'm not sure why the
       | prefix is stored for strings that can't be inlined.
        
         | pgwhalen wrote:
         | Ctrl-F for "Some motivations are as follows" under the "String
         | view with short string optimizations" section here:
         | https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQ...
         | 
         | Copying here:
         | 
         | > Having the 4-byte prefix directly accessible (without
         | indirection through an offset into a separate data buffer) can
         | substantially improve the performance of comparisons returning
         | false. This prefix can be encoded with multi-column hash keys
         | to accelerate aggregations, joins. Sorts would likely also be
         | significantly faster with this representation (experiments
         | would tell for certain)
         | 
         | > Certain algorithms (for example "prefix of string" or "suffix
         | of string" -- e.g. PREFIX("foobar", 3) -> "bar") can execute by
         | manipulating StringView values only and not requiring any
         | memory copying of large strings.
         | 
         | This document was an early proposal for adding what is now
         | called the StringView (and ByteView) types to the Arrow format
         | itself.
        
         | make3 wrote:
         | the first n bytes are likely by far the most often accessed in
         | practices, specifically for sorting & filtering, etc. Storing
         | them inline is likely a huge optimization for little cost.
        
       | Rygian wrote:
       | > "short string optimization": A short enough string can be
       | stored "in place" [...] An optimization that's impossible in
       | Rust, by the way ;).
       | 
       | Author is not aware of
       | https://docs.rs/compact_str/latest/compact_str/ or
       | https://github.com/bodil/smartstring
        
         | diggan wrote:
         | That documentation talks about all the benefits and "can mostly
         | be used as a drop in replacement for String", but what are the
         | tradeoffs? When cannot it be used?
        
           | tialaramex wrote:
           | It looks like there's a bunch of "garbage collection" type
           | activity where you may have some bytes which were once part
           | of a string but aren't now used, and you're always paying for
           | the overhead of this optimisation even if it's useless for
           | your problem.
           | 
           | Suppose you work only with 500-4000 byte strings, maybe
           | they're short reviews, and each ends with a rating in star
           | emoji, ***** is the best * is the worst. [[HN ate my star
           | emoji of course]]
           | 
           | So your reviews never fit in the "optimised" string slot, but
           | also the prefix is just opening words from a review, which in
           | some review styles will be the start of seemingly unrelated
           | anecdotes. "My grandfather used to tell me" it'll get to the
           | review eventually, and you'll see why they're connected, but
           | the _suffix_ is useful and that 's not stored in a "German
           | string" data structure.
           | 
           | Or maybe you have a high turnover of somewhat related medium
           | size strings, so then that garbage collection step costs
           | quite a lot of overhead.
        
         | charliermarsh wrote:
         | Was this section removed? I'm not seeing it in the linked post.
        
           | tialaramex wrote:
           | It seems to be a quote from
           | https://cedardb.com/blog/german_strings/ which is about this
           | German Strings type (implemented in Polars)
           | 
           | But yeah, it's pretty ignorant to assume Rust can't do this
           | since the best available examples (as with many things) are
           | in Rust. CompactString is _really_ nice. On a typical modern
           | (64-bit) computer CompactString takes 24 bytes and holds up
           | to 24 bytes of UTF-8 text inline, while also having a niche.
           | 
           | I guess the confusion arises because C++ people tend to
           | assume that anywhere Rust differs from the practice in the
           | C++ community it's a mistake, even though that's often
           | because C++ made the wrong choice? Rust's &str is "just"
           | &[u8] plus a rule about the meaning of these bytes, and
           | Rust's String is correspondingly "just" Vec<u8> plus a rule
           | about the meaning of those bytes. C++ couldn't have done the
           | former because it only belatedly got the fat pointer slice
           | reference (as the troubled std::span) years after having a
           | string data type.
           | 
           | Rust _didn 't_ do this in the stdlib, but not because it's
           | impossible, because it's a trade off and they wanted the
           | provided stdlib type to be straightforward. If you need or
           | even just want the trade off, you can just cargo add
           | compact_str
        
         | ladyanita22 wrote:
         | Why is it impossible in Rust? Do you have any source for that?
        
           | steveklabnik wrote:
           | It is not. It is not implemented in std::string::String, but
           | (as pointed out elsewhere in this thread) there are other
           | string implementations that have it.
           | 
           | It was decided explicitly against for the standard library,
           | because not every optimization is universally good, and
           | keeping String as a thin wrapper over Vec<T> is a good
           | default.
        
       | Rygian wrote:
       | > When you access a transient string, the string itself won't
       | know whether the data it points to is still valid, so you as a
       | programmer need to ensure that every transient string you use is
       | actually still valid.
       | 
       | In my mind this reads identical to "if you're a security
       | practitioner, worry about this bit here."
        
       | renewiltord wrote:
       | Haha, quite a huge improvement in a point-release. 0.20.5 to
       | 0.20.6.
       | 
       | This is a cool article. Good motivation. Good explanation. Plots.
       | ~~One small thing is that the plots are missing a legend so I had
       | to hover~~. Nevermind, I don't know why it didn't show (or why I
       | thought that) because I can clearly see them on the x-axis now.
        
       ___________________________________________________________________
       (page generated 2024-08-06 23:00 UTC)