[HN Gopher] Meta String: A more space-efficient string encoding ...
___________________________________________________________________
Meta String: A more space-efficient string encoding than UTF-8 in
Fury
Author : chaokunyang
Score : 56 points
Date : 2024-05-07 14:02 UTC (8 hours ago)
(HTM) web link (fury.apache.org)
(TXT) w3m dump (fury.apache.org)
| chaokunyang wrote:
| In rpc/serialization systems, we often need to send namespace/pat
| h/filename/fieldName/packageName/moduleName/className/enumValue
| string between processes.
|
| Those strings are mostly ascii strings. In order to transfer
| between processes, we encode such strings using utf-8 encodings.
| Such encoding will take one byte for every char, which is not
| space efficient actually.
|
| If we take a deeper look, we will found that most chars are
| lowercase chars, ., $ and _, which can be expressed in a much
| smaller range 0~32. But one byte can represent range 0~255, the
| significant bits are wasted, and this cost is not ignorable. In a
| dynamic serialization framework, such meta will take considerable
| cost compared to actual data.
|
| So we proposed a new string encoding which we called meta string
| encoding in Fury. It will encode most chars using 5 bits instead
| of 8 bits in utf-8 encoding, which can bring 37.5% space cost
| savings compared to utf-8 encoding.
|
| For string can't be represented by 5 bits, we also proposed
| encoding using 6 bits which can bring 25% space cost savings
| masfuerte wrote:
| You don't include '/' in the list of special characters so how
| are you using this to encode paths? As an array of strings?
| chaokunyang wrote:
| We can use `|` instead, `|` is not used in path. Currently we
| have only 2 chars can be extended. I haven't decided what to
| inlucde
| chaokunyang wrote:
| For spec details, please see fury meta string spec:
| https://fury.apache.org/docs/specification/fury_xlang_serial...
| chaokunyang wrote:
| For implemetation details, https://github.com/apache/incubator-
| fury/blob/main/java/fury... can be taken as an example
| agilob wrote:
| If you're intersted in this topic, there exist more string
| compression algorithms that save data transfer or memory in
| different ways. Look into Thrift Compact Protocol, MessagePack,
| Protocol Buffers and Avro.
| chaokunyang wrote:
| Protobuf only support varint encoding, fury supports that too.
| But IMO, protobuf use utf-8 for string encoding. Could you
| share more details how protobuf compress string.
| chaokunyang wrote:
| And Fury meta string are not used to compress data plane , it's
| used to compress meta in data plane.
| bawolff wrote:
| Seems like they are just using a custom 5 bit charset for strings
| with a limited alphabet.
|
| That's not really a rethink of string encoding, just using a
| different fixed encoding.
|
| Given the usecase, not sure i understand why not just use static
| huffman.
| chaokunyang wrote:
| We tried huffman first, But we can't know which string will be
| serialized. If using huffman, we must collect most strings and
| compute a static huffman tree. If not, we must send huffman
| stats to peer, which the cost will be bigger
| masklinn wrote:
| > If using huffman, we must collect most strings and compute
| a static huffman tree.
|
| Isn't that essentially what you've done to define your
| bespoke alphabets / partial encodings?
| chaokunyang wrote:
| The thing is that we don't the frequency about every char
| happens
| ryan-c wrote:
| You don't need to. You compute a static huffman tree over
| fixed probabilities you've estimated or generated from a
| corpus.
| chaokunyang wrote:
| Good suggestion, I was planing it as an optional option.
| We can crawl some github repos, and extract most types,
| extract their class names, paths, fields names, and use
| such data as the corpus
| ryan-c wrote:
| Crude static Huffman example, that could definitely be
| improved: 5 bits 0 0000-1111 acde
| fghi lmno rstu 7 bits 10 00000-11111 bjkp qvwx
| yzAB CDEF GHIK LMNO PRST UVWY 7 bits 110
| 0000-1111 JQXZ 0123 4567 89/. 8 bits 111
| 00000-11110 !@#$ $%^& *()_ {}[] `~+= |\"' ;:<> ,?
| (space at the end) 16 bits 11111111
| 00000000-11111111 plus UTF-8 1 to 256 unicode characters
| encoded as UTF-8
|
| You could even include some bigrams in this sort of
| table.
|
| There's some code here that could maybe be used for that
| sort of static huffman tree:
|
| https://github.com/ryancdotorg/lztp/blob/main/generate-
| seed....
|
| Alternatively, have something like this:
| 00 000000-111111 1-64 characters, 5 bits each 01
| 000000-111111 1-64 characters, 6 bits each 10
| 000000-111111 1-64 characters, ~6.4 bits each (character
| set of 84 characters, every 5 packed into 32 bits)
| 11 000000-111111 1-64 characters, UTF-8
|
| This is vaguely similar to how data is encoded for QR
| codes.
|
| As pointed out elsewhere, this will not outperform zstd
| with a dictionary trained on a corpus, but zstd _would_
| require pulling in a lot of code.
| masklinn wrote:
| > This is vaguely similar to how data is encoded for QR
| codes.
|
| Doesn't QR use variable static encodings (alphabets)?
| e.g. mode 1 is numerics, mode 2 is uppercase
| alphanumeric, mode 3 is binary, and mode 4 is jis
| (japanese).
| ryan-c wrote:
| QR codes use a series of (mode, length, characters)
| segments - a given code can switch between modes. I think
| for QR codes the length encoding is mode-specific.
| topaz0 wrote:
| The fact that the example is "org.apache.fury.benchmark.data"
| makes me think that there is probably a lot of room to do
| better than 5 bits per char, if the goal is simply to reduce
| space. E.g. you probably know ahead of time that the string
| "org.apache.fury" is going to be very common in this context,
| and giving that a 2-bit or even an 8-bit codepoint would be a
| huge space savings over the project. That said, it's not
| super surprising that you'd rather have a suboptimal encoding
| than have the complexity of a decoder that depends on the
| project. But of course you've already chosen to sacrifice
| simplicity for the sake of space to some degree by rolling
| your own encoding when you could use readily agreed-upon
| ones, which people assume is because space is at a premium.
| It's not surprising that the reader wonders why this space
| savings is worth this complexity, but that space savings is
| not worth it. I think what's missing is the justification of
| the tradeoffs, at least in the blog post. Both the space vs
| time tradeoffs which I can imagine are important here, and
| the space vs. complexity tradeoffs that are more subjective.
| chaokunyang wrote:
| We have such options in Fury too. We support register
| classes. In such cases, such string will be encoded with a
| varint id. It's kind of dict encoding. But not all users
| like to register classes. Meta string will try to reduce
| space cost when such dict encoding are not available.
|
| Your suggestions are great. I should clarify why dict
| encoding can't be used to recude cost.
|
| As for performance, it won't be an issue. The string for
| meta encoding are limited, we can cache the encoded result.
| So it's not in critical path
| chaokunyang wrote:
| We first consider using some lossless compression algorithms,
| but since we are just encoding meta strings such as
| classname/fieldname/packagename/path. There doesn't have much
| duplciation pattern to detect, the compression won't have a
| very good effect. If we are encoding generic string, such
| compression algorithms such astd/deflate will be better, but
| that's not the case here. This is why we name it as meta string
| smsm42 wrote:
| It looks like it's the same idea as static Huffman without the
| variable length part. Comparing it to UTF-8 is pointless, of
| course - UTF-8 is an universal text representation (whose major
| selling point is it's also ASCII-compatible and human-
| readable), and this is pretty much a compression algorithm. It
| is obvious natural language based text is redundant, so no
| wonder it's compressible and I am sure a lot of literature
| exists on which compression works best on which kinds of texts.
| vlovich123 wrote:
| When doing ad-hoc space optimized encodings, it's usually a good
| idea to compare the compress+decompress speed in addition to the
| resultant size against just compressing the data. This may win on
| one-off encodings of strings within an RPC, but the compression
| mechanism probably wins on the overall RPC message unless the
| message is very small & you can't build a more optimized baseline
| dictionary for small messages.
|
| It's really hard to actually beat something like zstd with these
| ad-hoc compression since both the compressed size will be bigger
| than what zstd produces with a dictionary or on larger messages
| without a dictionary AND the speed of a round trip through zstd
| is likely faster.
| chaokunyang wrote:
| It's not designed to beat zstd, those are used for different
| scenarios. zstd is used for data compression, the meta string
| encoding here is used for meta meta encoding. Meta data here
| are classname/fieldname/packagename/namespace/path/modulename
| mostly. For such small string, any statistical compression
| won't have a smaller size.
| vlovich123 wrote:
| > For such small string, any statistical compression won't
| have a smaller size
|
| That is a statement you can make, but you have to actually
| demonstrate this is true against zstd's --train mechanism [1]
| which generates a more optimized "small string" dictionary
| based on representative training data precisely for this use-
| case.
|
| [1]
| https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md
|
| > those are used for different scenarios. zstd is used for
| data compression, the meta string encoding here is used for
| meta meta encoding
|
| Not sure what this means. The motivating use-case described
| for the alternate string encoding is precisely compression.
| taeric wrote:
| I'm curious on this. Why wouldn't statistical compression go
| smaller? Assuming you restrict the statistics in question to
| be tailored to the metadata that is being compressed, it
| feels like it should be comparable? Yes, you may need/want to
| store a precomputed dictionary at the receiving end so that
| you aren't sending that every time, but that is really no
| different from having custom code that you need at receiving
| end. Right?
| Dylan16807 wrote:
| > For such small string, any statistical compression won't
| have a smaller size.
|
| You can hardcode expected frequencies and throw arithmetic
| encoding at it and the average size will probably drop a
| meaningful amount.
|
| And I can't easily find an example corpus, but the
| description of these strings sounds like they'd often have
| repetition, so another symbol to encode repetition and make
| this into a normal compression algorithm is probably worth
| it.
|
| I wonder how many of these string start with org.apache
| chaokunyang wrote:
| We use meta string mainly for object serialization. Users
| passed an object to Fury, we return a binary to users. In such
| cases, the serialized binary are mostly in 200~1000 bytes. Not
| big enough for zstd to work.
|
| For dictionary encoding, Fury used similar tech too. When a
| string first come up, we encoding it as binary, later writing
| in same object graph will write such string as an varint id.
|
| Zstd can be used with Fury too. For multiple serialization
| across different objects. Fury don't have a context to
| compression, unless users enable fury meta share mode. If meta
| share mode not enabled, zstd can be used to compress data
| across multiple object graph serializaiton
| vlovich123 wrote:
| > In such cases, the serialized binary are mostly in 200~1000
| bytes. Not big enough for zstd to work
|
| You're not referring to the same dictionary that I am. Look
| at --train in [1].
|
| If you have a training corpus of representative data, you can
| generate a dictionary that you preshare on both sides which
| will perform much better for very small binaries (including
| 200-1k bytes). It's the same kind of technique but zstd's
| mechanism is absolutely general whereas purpose-built non-
| entropy based dictionaries are more limited & less likely to
| outperform.
|
| If you want maximum flexibility (i.e. you don't know the
| universe of representative messages ahead of time or you want
| maximum compression performance), you can gather this corpus
| transparently as messages are generated & then generate a
| dictionary & attach it as sideband metadata to a message.
| You'll probably need to defer the decoding if it references a
| dictionary not yet received (i.e. send delivers messages out-
| of-order from generation). There are other techniques you can
| apply, but the general rule is that your custom encoding
| scheme is unlikely to outperform zstd + a representative
| training corpus. If it does, you'd need to actually show this
| rather than try to argue from first principles.
|
| [1]
| https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md
| irq-1 wrote:
| Wow. Didn't know about Train and custom dictionaries. Very
| cool.
| eps wrote:
| It's a compression via domain-specific frequency encoding.
|
| Good, but that's a pretty common technique for cases where every
| bit counts.
| chaokunyang wrote:
| Yes, in many rpc/serialization systems, we need to serialize
| class name, field names, package names. So we can supprrt
| dynamic deserialization, type forward/backward compatibility.
| In such cases, those meta may took more bits than the actual
| data, this is where meta string bring gains
| vbezhenar wrote:
| What about using gzip on messages? I'd expect for it to yield
| similar savings while keeping message format simpler.
| masklinn wrote:
| You'd never use gzip for that, it has two dozen bytes of
| overhead before you can even start building a deflate stream.
| Even a raw deflate stream might not recoup the overhead on
| values which are only a few dozen bytes at the most, and if
| you're not using a static huffman tree the overhead for the
| tree is going to end all possibility of a gain.
|
| Furthermore LZ constructions work off of redundancy, which is
| difficult to find in large amounts in such short strings. The
| overhead of framing will just increase the size of the payload
| after construction.
|
| You _could_ define just a static huffman tree and make that
| part of your spec.
| chaokunyang wrote:
| Yes, totally agree. I tried gzip first, but it introduce
| about 10~20 bytes cost first. So I proposed such meta
| encoding here. If we have a much bigger string, gzip will be
| better, but all string we want to compress here are just
| classname/fieldname/packagename/namespace/path/modulename. We
| don't have enough repete pattern for gzip to work
| Dylan16807 wrote:
| If you rip off the gzip headers and footers, DEFLATE
| applied to a small string only has 10 bits of overhead, 2
| bytes with padding.
|
| The default Huffman table is not suited for this use case,
| and a Huffman table in general is a waste of space these
| days, but the overhead is very small and can be reduced to
| nothing if you know the original string length.
| chaokunyang wrote:
| Using gzip to the whole stream will introduce extra cpu cost,
| we try to make out serialization implementation as fast as
| possible. If we apply gzip only on such small meta string, it
| emits bigger size. I belive some stats are written in the data.
| NohatCoder wrote:
| I can't help but feel like something has gone fundamentally wrong
| when there are so many arbitrary strings in the system that this
| optimisation makes a tangible difference.
| chaokunyang wrote:
| Meta string is not designed for encoding arbitrary strings,
| they are used to encode limited enumerated string only, such as
| classname/fieldname/packagename/namespace/path/modulename. This
| is why we name it as meta string, used in a case where lossless
| statistical compresion can't provide a gain
| smsm42 wrote:
| Why is it wrong? It is a widely known fact that texts are
| redundant and compress well. Many systems work with texts and
| there's an advantage when those texts are human-comprehensible.
| You can make a system which would auto-compress and decompress
| any text of sufficient length - and some do - but there's
| absolutely no surprise that texts are compressible and it
| doesn't indicate there's something wrong. Texts are not
| optimized for space, they are optimized for human convenience.
| zokier wrote:
| But these are not general text, they are identifiers. If you
| are building space-efficient RPC system then I do think it is
| reasonable to question why are these identifiers passed as
| strings in such quantity that it makes a difference.
|
| On top of my head, a better approach could be to have some
| mechanism to establish mapping from these human-readable
| identifiers to numeric identifiers (protocol handshake or
| some other schema exchange), and then use those numeric
| identifiers in the actual high-volume messages.
|
| edit: umm... seems like fury is doing something like that
| already https://fury.apache.org/docs/guide/java_object_graph_
| guide#m... so I am bit puzzled if this saving really makes
| meaningful difference?!
| smsm42 wrote:
| Identifiers are textual also (though shorter). You don't
| have arbitrary byte strings as identifiers. And yes, if you
| build RPC specifically, then you have to be ready to the
| fact that you will have to deal with symbolic identifiers.
| You can work around that by transcoding them somehow into
| an efficient format - like protobuf does by forcing the
| schema to provide translation, or by additional
| communication as you mentioned, but it's unavoidable that
| the symbolic identifiers will pop up somewhere and they
| will be human-readable and thus redundant.
|
| > I am bit puzzled if this saving really makes meaningful
| difference?!
|
| They make somewhat misleading claim - "37.5% space
| efficient against UTF-8" - but that doesn't tell us how
| much gain it is on a typical protocol interaction, It could
| be 37.5% improvement on 0.1% of all data or on 50% of all
| data - depending on that, the overall effect could vary
| drastically. I guess if they are doing it, it makes some
| sense for them, but it's hard to figure out how much it
| saves in the big picture.
| amiga386 wrote:
| Everything old is new again. This reminds me of ZSCII, which was
| how Infocom games encoded string data to fit ~100kb of text data
| into a Commodore 64, packing 3 characters into every 2 bytes.
| --first byte------- --second byte--- 7 6 5 4 3 2 1
| 0 7 6 5 4 3 2 1 0 bit --first-- --second---
| --third--
|
| The 5-bit character encoding was one of three character sets:
| Z-char 6789abcdef0123456789abcdef current
| -------------------------- A0
| abcdefghijklmnopqrstuvwxyz A1
| ABCDEFGHIJKLMNOPQRSTUVWXYZ A2
| ^0123456789.,!?_#'"/\-:()
| --------------------------
|
| Z-char 0 was space in all character sets, z-char 1 was newline,
| z-chars 2 and 3 switched to one of the other character sets
| depending on which one you were already in _for the next
| character only_ , and z-chars 4 and 5 switched permanently, the
| "shift-lock".
|
| https://inform-fiction.org/zmachine/standards/z1point0/sect0...
| chaokunyang wrote:
| This is interesting, thanks for sharing this. I will take a
| deep look at it.
| chubot wrote:
| Apple's NSString apparently uses an in-memory encoding (not wire
| encoding) that can go as low as 5 bits per byte, as of OS X 10.10
| in 2014:
|
| https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...
|
| _If the length is between 0 and 7, store the string as raw
| eight-bit characters.
|
| _ If the length is 8 or 9, store the string in a six-bit
| encoding, using the alphabet "eilotrm.apdnsIc
| ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".*
|
| _If the length is 10 or 11, store the string in a five-bit
| encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"_
|
| ---
|
| _How about 5 bits? This isn 't totally ludicrous. There are
| probably a lot of strings which are just lowercase, for example.
| 5 bits gives 32 possible values. If you include the whole
| lowercase alphabet, there are 6 extra values, which you could
| allot to the more common uppercase letters, or some symbols, or
| digits, or some mix._
|
| Related thread -
| https://lobste.rs/s/5417dx/storing_data_pointers#c_l4zfrv
|
| It's interesting how network formats and in-memory formats kinda
| converge in design, because retrieving data from memory to CPU is
| so expensive now.
| arcticbull wrote:
| ... only in tagged pointers meaning the total size can't exceed
| like ~60 bits. After that, the canonical representation of
| NSString is UTF-16 and in Swift strings is UTF-8.
|
| So not "in memory" (although obviously it is in memory) it's
| when the goal is to fit the totality of the string plus a tag
| into a register.
|
| If they can't completely encode the 10-11 character value using
| the 5-bit charset it spills to UTF-16 on the heap.
|
| > It's interesting how network formats and in-memory formats
| kinda converge in design, because retrieving data from memory
| to CPU is so expensive now.
|
| Honestly Strings are probably one of the worst examples of
| this. Strings have like 50 different representations depending
| on what you're trying to do with them.
|
| 1. Wire format for storage and retrieval is generally UTF-8.
|
| 2. If you're trying to parse a string you generally want to
| expand it into code points.
|
| 3. If you're trying to sort or compare strings you generally
| want to expand that into a normalized form like NFC or NFD.
|
| 4. If you're trying to edit text, you generally want grapheme
| clusters.
|
| 5. If you're trying to present text, you generally want pixels.
|
| The way to think about it is: "there is no such things as the
| length of a string without context." Strings are basically a
| bytecode format for representing concepts.
|
| [1]
| https://developer.apple.com/documentation/foundation/nsstrin...
___________________________________________________________________
(page generated 2024-05-07 23:01 UTC)