[HN Gopher] Why Polars rewrote its Arrow string data type
___________________________________________________________________
Why Polars rewrote its Arrow string data type
Author : fanf2
Score : 52 points
Date : 2024-08-06 14:42 UTC (8 hours ago)
(HTM) web link (pola.rs)
(TXT) w3m dump (pola.rs)
| padthai wrote:
| I am getting Vietnam flashbacks of all the small
| incompatibilities between Numpy arrays and Pandas Series
| grahamjameson wrote:
| https://youtu.be/J0zjJKBK9ys
| justinsaccount wrote:
| Is there a standalone c/c++ implementation of this string type
| anywhere?
| lor_louis wrote:
| It looks a lot like cedarDB's "german strings".
|
| <https://cedardb.com/blog/german_strings/>
|
| You could probably write a C++ implementation based on the
| article.
|
| Note that it's not all that useful if you don't plan on
| searching for text based on their prefix. From my
| understanding, this is mostly a better way to store SStables in
| RAM/partially on disk if you mmap.
| o11c wrote:
| Hm, both the main article and your link are wasteful for
| strings of length 13-15, which are still pretty common. As a
| rule for SSO, if you're already take up the bytes
| unconditionally, it's going to be worth complexifying your
| "size" calculation to squeeze a little more into the no-alloc
| case.
|
| That said, note that there are a _lot_ of strings around 20
| bytes (stringified 64-bit integers or floats), so pushing to
| 24-1 is a reasonable choice too.
|
| I'd use 3 size classes:
|
| * 0-15 bytes, stored in-line. If you need C-string
| compatibility (which is fairly likely _somewhere_ in your
| application), ensure that size 15 is encoded as a zero at the
| last byte.
|
| * up to 3.75GB (if my mental math is right), stored with a
| mangled 32-bit size. Alternatively you could use a simpler
| cutoff if it makes the mangling easier. Another possibility
| would be to have a 16-bit size class.
|
| * add support for very large strings (likely with a
| completely different allocation method) too; a 4GB limit
| sucks and is easily exceeded. If need be this can use an
| extra indirection.
|
| Honestly, with a 16-byte budget, I'd consider spending more
| of that on the prefix - you can get 8 bytes with enough
| contortion elsewhere.
|
| Duplicating the prefix is probably worth it in more cases
| than you might think, since it does speed up comparisons.
| Just remember, you have to check the size to distinguish
| "a\0" from "a" too, not just from "a\0\0\0\0".
| ismailmaj wrote:
| I might have missed it in the article but I'm not sure why the
| prefix is stored for strings that can't be inlined.
| pgwhalen wrote:
| Ctrl-F for "Some motivations are as follows" under the "String
| view with short string optimizations" section here:
| https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQ...
|
| Copying here:
|
| > Having the 4-byte prefix directly accessible (without
| indirection through an offset into a separate data buffer) can
| substantially improve the performance of comparisons returning
| false. This prefix can be encoded with multi-column hash keys
| to accelerate aggregations, joins. Sorts would likely also be
| significantly faster with this representation (experiments
| would tell for certain)
|
| > Certain algorithms (for example "prefix of string" or "suffix
| of string" -- e.g. PREFIX("foobar", 3) -> "bar") can execute by
| manipulating StringView values only and not requiring any
| memory copying of large strings.
|
| This document was an early proposal for adding what is now
| called the StringView (and ByteView) types to the Arrow format
| itself.
| make3 wrote:
| the first n bytes are likely by far the most often accessed in
| practices, specifically for sorting & filtering, etc. Storing
| them inline is likely a huge optimization for little cost.
| Rygian wrote:
| > "short string optimization": A short enough string can be
| stored "in place" [...] An optimization that's impossible in
| Rust, by the way ;).
|
| Author is not aware of
| https://docs.rs/compact_str/latest/compact_str/ or
| https://github.com/bodil/smartstring
| diggan wrote:
| That documentation talks about all the benefits and "can mostly
| be used as a drop in replacement for String", but what are the
| tradeoffs? When cannot it be used?
| tialaramex wrote:
| It looks like there's a bunch of "garbage collection" type
| activity where you may have some bytes which were once part
| of a string but aren't now used, and you're always paying for
| the overhead of this optimisation even if it's useless for
| your problem.
|
| Suppose you work only with 500-4000 byte strings, maybe
| they're short reviews, and each ends with a rating in star
| emoji, ***** is the best * is the worst. [[HN ate my star
| emoji of course]]
|
| So your reviews never fit in the "optimised" string slot, but
| also the prefix is just opening words from a review, which in
| some review styles will be the start of seemingly unrelated
| anecdotes. "My grandfather used to tell me" it'll get to the
| review eventually, and you'll see why they're connected, but
| the _suffix_ is useful and that 's not stored in a "German
| string" data structure.
|
| Or maybe you have a high turnover of somewhat related medium
| size strings, so then that garbage collection step costs
| quite a lot of overhead.
| charliermarsh wrote:
| Was this section removed? I'm not seeing it in the linked post.
| tialaramex wrote:
| It seems to be a quote from
| https://cedardb.com/blog/german_strings/ which is about this
| German Strings type (implemented in Polars)
|
| But yeah, it's pretty ignorant to assume Rust can't do this
| since the best available examples (as with many things) are
| in Rust. CompactString is _really_ nice. On a typical modern
| (64-bit) computer CompactString takes 24 bytes and holds up
| to 24 bytes of UTF-8 text inline, while also having a niche.
|
| I guess the confusion arises because C++ people tend to
| assume that anywhere Rust differs from the practice in the
| C++ community it's a mistake, even though that's often
| because C++ made the wrong choice? Rust's &str is "just"
| &[u8] plus a rule about the meaning of these bytes, and
| Rust's String is correspondingly "just" Vec<u8> plus a rule
| about the meaning of those bytes. C++ couldn't have done the
| former because it only belatedly got the fat pointer slice
| reference (as the troubled std::span) years after having a
| string data type.
|
| Rust _didn 't_ do this in the stdlib, but not because it's
| impossible, because it's a trade off and they wanted the
| provided stdlib type to be straightforward. If you need or
| even just want the trade off, you can just cargo add
| compact_str
| ladyanita22 wrote:
| Why is it impossible in Rust? Do you have any source for that?
| steveklabnik wrote:
| It is not. It is not implemented in std::string::String, but
| (as pointed out elsewhere in this thread) there are other
| string implementations that have it.
|
| It was decided explicitly against for the standard library,
| because not every optimization is universally good, and
| keeping String as a thin wrapper over Vec<T> is a good
| default.
| Rygian wrote:
| > When you access a transient string, the string itself won't
| know whether the data it points to is still valid, so you as a
| programmer need to ensure that every transient string you use is
| actually still valid.
|
| In my mind this reads identical to "if you're a security
| practitioner, worry about this bit here."
| renewiltord wrote:
| Haha, quite a huge improvement in a point-release. 0.20.5 to
| 0.20.6.
|
| This is a cool article. Good motivation. Good explanation. Plots.
| ~~One small thing is that the plots are missing a legend so I had
| to hover~~. Nevermind, I don't know why it didn't show (or why I
| thought that) because I can clearly see them on the x-axis now.
___________________________________________________________________
(page generated 2024-08-06 23:00 UTC)