[HN Gopher] Meta String: A more space-efficient string encoding ...
       ___________________________________________________________________
        
       Meta String: A more space-efficient string encoding than UTF-8 in
       Fury
        
       Author : chaokunyang
       Score  : 56 points
       Date   : 2024-05-07 14:02 UTC (8 hours ago)
        
 (HTM) web link (fury.apache.org)
 (TXT) w3m dump (fury.apache.org)
        
       | chaokunyang wrote:
       | In rpc/serialization systems, we often need to send namespace/pat
       | h/filename/fieldName/packageName/moduleName/className/enumValue
       | string between processes.
       | 
       | Those strings are mostly ascii strings. In order to transfer
       | between processes, we encode such strings using utf-8 encodings.
       | Such encoding will take one byte for every char, which is not
       | space efficient actually.
       | 
       | If we take a deeper look, we will found that most chars are
       | lowercase chars, ., $ and _, which can be expressed in a much
       | smaller range 0~32. But one byte can represent range 0~255, the
       | significant bits are wasted, and this cost is not ignorable. In a
       | dynamic serialization framework, such meta will take considerable
       | cost compared to actual data.
       | 
       | So we proposed a new string encoding which we called meta string
       | encoding in Fury. It will encode most chars using 5 bits instead
       | of 8 bits in utf-8 encoding, which can bring 37.5% space cost
       | savings compared to utf-8 encoding.
       | 
       | For string can't be represented by 5 bits, we also proposed
       | encoding using 6 bits which can bring 25% space cost savings
        
         | masfuerte wrote:
         | You don't include '/' in the list of special characters so how
         | are you using this to encode paths? As an array of strings?
        
           | chaokunyang wrote:
           | We can use `|` instead, `|` is not used in path. Currently we
           | have only 2 chars can be extended. I haven't decided what to
           | inlucde
        
       | chaokunyang wrote:
       | For spec details, please see fury meta string spec:
       | https://fury.apache.org/docs/specification/fury_xlang_serial...
        
       | chaokunyang wrote:
       | For implemetation details, https://github.com/apache/incubator-
       | fury/blob/main/java/fury... can be taken as an example
        
       | agilob wrote:
       | If you're intersted in this topic, there exist more string
       | compression algorithms that save data transfer or memory in
       | different ways. Look into Thrift Compact Protocol, MessagePack,
       | Protocol Buffers and Avro.
        
         | chaokunyang wrote:
         | Protobuf only support varint encoding, fury supports that too.
         | But IMO, protobuf use utf-8 for string encoding. Could you
         | share more details how protobuf compress string.
        
         | chaokunyang wrote:
         | And Fury meta string are not used to compress data plane , it's
         | used to compress meta in data plane.
        
       | bawolff wrote:
       | Seems like they are just using a custom 5 bit charset for strings
       | with a limited alphabet.
       | 
       | That's not really a rethink of string encoding, just using a
       | different fixed encoding.
       | 
       | Given the usecase, not sure i understand why not just use static
       | huffman.
        
         | chaokunyang wrote:
         | We tried huffman first, But we can't know which string will be
         | serialized. If using huffman, we must collect most strings and
         | compute a static huffman tree. If not, we must send huffman
         | stats to peer, which the cost will be bigger
        
           | masklinn wrote:
           | > If using huffman, we must collect most strings and compute
           | a static huffman tree.
           | 
           | Isn't that essentially what you've done to define your
           | bespoke alphabets / partial encodings?
        
             | chaokunyang wrote:
             | The thing is that we don't the frequency about every char
             | happens
        
               | ryan-c wrote:
               | You don't need to. You compute a static huffman tree over
               | fixed probabilities you've estimated or generated from a
               | corpus.
        
               | chaokunyang wrote:
               | Good suggestion, I was planing it as an optional option.
               | We can crawl some github repos, and extract most types,
               | extract their class names, paths, fields names, and use
               | such data as the corpus
        
               | ryan-c wrote:
               | Crude static Huffman example, that could definitely be
               | improved:                    5 bits 0 0000-1111      acde
               | fghi lmno rstu          7 bits 10 00000-11111   bjkp qvwx
               | yzAB CDEF GHIK LMNO PRST UVWY          7 bits 110
               | 0000-1111    JQXZ 0123 4567 89/.          8 bits 111
               | 00000-11110  !@#$ $%^& *()_ {}[] `~+= |\"' ;:<> ,?
               | (space at the end)         16 bits 11111111
               | 00000000-11111111 plus UTF-8 1 to 256 unicode characters
               | encoded as UTF-8
               | 
               | You could even include some bigrams in this sort of
               | table.
               | 
               | There's some code here that could maybe be used for that
               | sort of static huffman tree:
               | 
               | https://github.com/ryancdotorg/lztp/blob/main/generate-
               | seed....
               | 
               | Alternatively, have something like this:
               | 00 000000-111111 1-64 characters, 5 bits each         01
               | 000000-111111 1-64 characters, 6 bits each         10
               | 000000-111111 1-64 characters, ~6.4 bits each (character
               | set of 84 characters, every 5 packed into 32 bits)
               | 11 000000-111111 1-64 characters, UTF-8
               | 
               | This is vaguely similar to how data is encoded for QR
               | codes.
               | 
               | As pointed out elsewhere, this will not outperform zstd
               | with a dictionary trained on a corpus, but zstd _would_
               | require pulling in a lot of code.
        
               | masklinn wrote:
               | > This is vaguely similar to how data is encoded for QR
               | codes.
               | 
               | Doesn't QR use variable static encodings (alphabets)?
               | e.g. mode 1 is numerics, mode 2 is uppercase
               | alphanumeric, mode 3 is binary, and mode 4 is jis
               | (japanese).
        
               | ryan-c wrote:
               | QR codes use a series of (mode, length, characters)
               | segments - a given code can switch between modes. I think
               | for QR codes the length encoding is mode-specific.
        
           | topaz0 wrote:
           | The fact that the example is "org.apache.fury.benchmark.data"
           | makes me think that there is probably a lot of room to do
           | better than 5 bits per char, if the goal is simply to reduce
           | space. E.g. you probably know ahead of time that the string
           | "org.apache.fury" is going to be very common in this context,
           | and giving that a 2-bit or even an 8-bit codepoint would be a
           | huge space savings over the project. That said, it's not
           | super surprising that you'd rather have a suboptimal encoding
           | than have the complexity of a decoder that depends on the
           | project. But of course you've already chosen to sacrifice
           | simplicity for the sake of space to some degree by rolling
           | your own encoding when you could use readily agreed-upon
           | ones, which people assume is because space is at a premium.
           | It's not surprising that the reader wonders why this space
           | savings is worth this complexity, but that space savings is
           | not worth it. I think what's missing is the justification of
           | the tradeoffs, at least in the blog post. Both the space vs
           | time tradeoffs which I can imagine are important here, and
           | the space vs. complexity tradeoffs that are more subjective.
        
             | chaokunyang wrote:
             | We have such options in Fury too. We support register
             | classes. In such cases, such string will be encoded with a
             | varint id. It's kind of dict encoding. But not all users
             | like to register classes. Meta string will try to reduce
             | space cost when such dict encoding are not available.
             | 
             | Your suggestions are great. I should clarify why dict
             | encoding can't be used to recude cost.
             | 
             | As for performance, it won't be an issue. The string for
             | meta encoding are limited, we can cache the encoded result.
             | So it's not in critical path
        
         | chaokunyang wrote:
         | We first consider using some lossless compression algorithms,
         | but since we are just encoding meta strings such as
         | classname/fieldname/packagename/path. There doesn't have much
         | duplciation pattern to detect, the compression won't have a
         | very good effect. If we are encoding generic string, such
         | compression algorithms such astd/deflate will be better, but
         | that's not the case here. This is why we name it as meta string
        
         | smsm42 wrote:
         | It looks like it's the same idea as static Huffman without the
         | variable length part. Comparing it to UTF-8 is pointless, of
         | course - UTF-8 is an universal text representation (whose major
         | selling point is it's also ASCII-compatible and human-
         | readable), and this is pretty much a compression algorithm. It
         | is obvious natural language based text is redundant, so no
         | wonder it's compressible and I am sure a lot of literature
         | exists on which compression works best on which kinds of texts.
        
       | vlovich123 wrote:
       | When doing ad-hoc space optimized encodings, it's usually a good
       | idea to compare the compress+decompress speed in addition to the
       | resultant size against just compressing the data. This may win on
       | one-off encodings of strings within an RPC, but the compression
       | mechanism probably wins on the overall RPC message unless the
       | message is very small & you can't build a more optimized baseline
       | dictionary for small messages.
       | 
       | It's really hard to actually beat something like zstd with these
       | ad-hoc compression since both the compressed size will be bigger
       | than what zstd produces with a dictionary or on larger messages
       | without a dictionary AND the speed of a round trip through zstd
       | is likely faster.
        
         | chaokunyang wrote:
         | It's not designed to beat zstd, those are used for different
         | scenarios. zstd is used for data compression, the meta string
         | encoding here is used for meta meta encoding. Meta data here
         | are classname/fieldname/packagename/namespace/path/modulename
         | mostly. For such small string, any statistical compression
         | won't have a smaller size.
        
           | vlovich123 wrote:
           | > For such small string, any statistical compression won't
           | have a smaller size
           | 
           | That is a statement you can make, but you have to actually
           | demonstrate this is true against zstd's --train mechanism [1]
           | which generates a more optimized "small string" dictionary
           | based on representative training data precisely for this use-
           | case.
           | 
           | [1]
           | https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md
           | 
           | > those are used for different scenarios. zstd is used for
           | data compression, the meta string encoding here is used for
           | meta meta encoding
           | 
           | Not sure what this means. The motivating use-case described
           | for the alternate string encoding is precisely compression.
        
           | taeric wrote:
           | I'm curious on this. Why wouldn't statistical compression go
           | smaller? Assuming you restrict the statistics in question to
           | be tailored to the metadata that is being compressed, it
           | feels like it should be comparable? Yes, you may need/want to
           | store a precomputed dictionary at the receiving end so that
           | you aren't sending that every time, but that is really no
           | different from having custom code that you need at receiving
           | end. Right?
        
           | Dylan16807 wrote:
           | > For such small string, any statistical compression won't
           | have a smaller size.
           | 
           | You can hardcode expected frequencies and throw arithmetic
           | encoding at it and the average size will probably drop a
           | meaningful amount.
           | 
           | And I can't easily find an example corpus, but the
           | description of these strings sounds like they'd often have
           | repetition, so another symbol to encode repetition and make
           | this into a normal compression algorithm is probably worth
           | it.
           | 
           | I wonder how many of these string start with org.apache
        
         | chaokunyang wrote:
         | We use meta string mainly for object serialization. Users
         | passed an object to Fury, we return a binary to users. In such
         | cases, the serialized binary are mostly in 200~1000 bytes. Not
         | big enough for zstd to work.
         | 
         | For dictionary encoding, Fury used similar tech too. When a
         | string first come up, we encoding it as binary, later writing
         | in same object graph will write such string as an varint id.
         | 
         | Zstd can be used with Fury too. For multiple serialization
         | across different objects. Fury don't have a context to
         | compression, unless users enable fury meta share mode. If meta
         | share mode not enabled, zstd can be used to compress data
         | across multiple object graph serializaiton
        
           | vlovich123 wrote:
           | > In such cases, the serialized binary are mostly in 200~1000
           | bytes. Not big enough for zstd to work
           | 
           | You're not referring to the same dictionary that I am. Look
           | at --train in [1].
           | 
           | If you have a training corpus of representative data, you can
           | generate a dictionary that you preshare on both sides which
           | will perform much better for very small binaries (including
           | 200-1k bytes). It's the same kind of technique but zstd's
           | mechanism is absolutely general whereas purpose-built non-
           | entropy based dictionaries are more limited & less likely to
           | outperform.
           | 
           | If you want maximum flexibility (i.e. you don't know the
           | universe of representative messages ahead of time or you want
           | maximum compression performance), you can gather this corpus
           | transparently as messages are generated & then generate a
           | dictionary & attach it as sideband metadata to a message.
           | You'll probably need to defer the decoding if it references a
           | dictionary not yet received (i.e. send delivers messages out-
           | of-order from generation). There are other techniques you can
           | apply, but the general rule is that your custom encoding
           | scheme is unlikely to outperform zstd + a representative
           | training corpus. If it does, you'd need to actually show this
           | rather than try to argue from first principles.
           | 
           | [1]
           | https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md
        
             | irq-1 wrote:
             | Wow. Didn't know about Train and custom dictionaries. Very
             | cool.
        
       | eps wrote:
       | It's a compression via domain-specific frequency encoding.
       | 
       | Good, but that's a pretty common technique for cases where every
       | bit counts.
        
         | chaokunyang wrote:
         | Yes, in many rpc/serialization systems, we need to serialize
         | class name, field names, package names. So we can supprrt
         | dynamic deserialization, type forward/backward compatibility.
         | In such cases, those meta may took more bits than the actual
         | data, this is where meta string bring gains
        
       | vbezhenar wrote:
       | What about using gzip on messages? I'd expect for it to yield
       | similar savings while keeping message format simpler.
        
         | masklinn wrote:
         | You'd never use gzip for that, it has two dozen bytes of
         | overhead before you can even start building a deflate stream.
         | Even a raw deflate stream might not recoup the overhead on
         | values which are only a few dozen bytes at the most, and if
         | you're not using a static huffman tree the overhead for the
         | tree is going to end all possibility of a gain.
         | 
         | Furthermore LZ constructions work off of redundancy, which is
         | difficult to find in large amounts in such short strings. The
         | overhead of framing will just increase the size of the payload
         | after construction.
         | 
         | You _could_ define just a static huffman tree and make that
         | part of your spec.
        
           | chaokunyang wrote:
           | Yes, totally agree. I tried gzip first, but it introduce
           | about 10~20 bytes cost first. So I proposed such meta
           | encoding here. If we have a much bigger string, gzip will be
           | better, but all string we want to compress here are just
           | classname/fieldname/packagename/namespace/path/modulename. We
           | don't have enough repete pattern for gzip to work
        
             | Dylan16807 wrote:
             | If you rip off the gzip headers and footers, DEFLATE
             | applied to a small string only has 10 bits of overhead, 2
             | bytes with padding.
             | 
             | The default Huffman table is not suited for this use case,
             | and a Huffman table in general is a waste of space these
             | days, but the overhead is very small and can be reduced to
             | nothing if you know the original string length.
        
         | chaokunyang wrote:
         | Using gzip to the whole stream will introduce extra cpu cost,
         | we try to make out serialization implementation as fast as
         | possible. If we apply gzip only on such small meta string, it
         | emits bigger size. I belive some stats are written in the data.
        
       | NohatCoder wrote:
       | I can't help but feel like something has gone fundamentally wrong
       | when there are so many arbitrary strings in the system that this
       | optimisation makes a tangible difference.
        
         | chaokunyang wrote:
         | Meta string is not designed for encoding arbitrary strings,
         | they are used to encode limited enumerated string only, such as
         | classname/fieldname/packagename/namespace/path/modulename. This
         | is why we name it as meta string, used in a case where lossless
         | statistical compresion can't provide a gain
        
         | smsm42 wrote:
         | Why is it wrong? It is a widely known fact that texts are
         | redundant and compress well. Many systems work with texts and
         | there's an advantage when those texts are human-comprehensible.
         | You can make a system which would auto-compress and decompress
         | any text of sufficient length - and some do - but there's
         | absolutely no surprise that texts are compressible and it
         | doesn't indicate there's something wrong. Texts are not
         | optimized for space, they are optimized for human convenience.
        
           | zokier wrote:
           | But these are not general text, they are identifiers. If you
           | are building space-efficient RPC system then I do think it is
           | reasonable to question why are these identifiers passed as
           | strings in such quantity that it makes a difference.
           | 
           | On top of my head, a better approach could be to have some
           | mechanism to establish mapping from these human-readable
           | identifiers to numeric identifiers (protocol handshake or
           | some other schema exchange), and then use those numeric
           | identifiers in the actual high-volume messages.
           | 
           | edit: umm... seems like fury is doing something like that
           | already https://fury.apache.org/docs/guide/java_object_graph_
           | guide#m... so I am bit puzzled if this saving really makes
           | meaningful difference?!
        
             | smsm42 wrote:
             | Identifiers are textual also (though shorter). You don't
             | have arbitrary byte strings as identifiers. And yes, if you
             | build RPC specifically, then you have to be ready to the
             | fact that you will have to deal with symbolic identifiers.
             | You can work around that by transcoding them somehow into
             | an efficient format - like protobuf does by forcing the
             | schema to provide translation, or by additional
             | communication as you mentioned, but it's unavoidable that
             | the symbolic identifiers will pop up somewhere and they
             | will be human-readable and thus redundant.
             | 
             | > I am bit puzzled if this saving really makes meaningful
             | difference?!
             | 
             | They make somewhat misleading claim - "37.5% space
             | efficient against UTF-8" - but that doesn't tell us how
             | much gain it is on a typical protocol interaction, It could
             | be 37.5% improvement on 0.1% of all data or on 50% of all
             | data - depending on that, the overall effect could vary
             | drastically. I guess if they are doing it, it makes some
             | sense for them, but it's hard to figure out how much it
             | saves in the big picture.
        
       | amiga386 wrote:
       | Everything old is new again. This reminds me of ZSCII, which was
       | how Infocom games encoded string data to fit ~100kb of text data
       | into a Commodore 64, packing 3 characters into every 2 bytes.
       | --first byte-------   --second byte---         7    6 5 4 3 2  1
       | 0   7 6 5  4 3 2 1 0         bit  --first--  --second---
       | --third--
       | 
       | The 5-bit character encoding was one of three character sets:
       | Z-char 6789abcdef0123456789abcdef         current
       | --------------------------           A0
       | abcdefghijklmnopqrstuvwxyz           A1
       | ABCDEFGHIJKLMNOPQRSTUVWXYZ           A2
       | ^0123456789.,!?_#'"/\-:()
       | --------------------------
       | 
       | Z-char 0 was space in all character sets, z-char 1 was newline,
       | z-chars 2 and 3 switched to one of the other character sets
       | depending on which one you were already in _for the next
       | character only_ , and z-chars 4 and 5 switched permanently, the
       | "shift-lock".
       | 
       | https://inform-fiction.org/zmachine/standards/z1point0/sect0...
        
         | chaokunyang wrote:
         | This is interesting, thanks for sharing this. I will take a
         | deep look at it.
        
       | chubot wrote:
       | Apple's NSString apparently uses an in-memory encoding (not wire
       | encoding) that can go as low as 5 bits per byte, as of OS X 10.10
       | in 2014:
       | 
       | https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...
       | 
       |  _If the length is between 0 and 7, store the string as raw
       | eight-bit characters.
       | 
       | _ If the length is 8 or 9, store the string in a six-bit
       | encoding, using the alphabet "eilotrm.apdnsIc
       | ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".*
       | 
       |  _If the length is 10 or 11, store the string in a five-bit
       | encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"_
       | 
       | ---
       | 
       |  _How about 5 bits? This isn 't totally ludicrous. There are
       | probably a lot of strings which are just lowercase, for example.
       | 5 bits gives 32 possible values. If you include the whole
       | lowercase alphabet, there are 6 extra values, which you could
       | allot to the more common uppercase letters, or some symbols, or
       | digits, or some mix._
       | 
       | Related thread -
       | https://lobste.rs/s/5417dx/storing_data_pointers#c_l4zfrv
       | 
       | It's interesting how network formats and in-memory formats kinda
       | converge in design, because retrieving data from memory to CPU is
       | so expensive now.
        
         | arcticbull wrote:
         | ... only in tagged pointers meaning the total size can't exceed
         | like ~60 bits. After that, the canonical representation of
         | NSString is UTF-16 and in Swift strings is UTF-8.
         | 
         | So not "in memory" (although obviously it is in memory) it's
         | when the goal is to fit the totality of the string plus a tag
         | into a register.
         | 
         | If they can't completely encode the 10-11 character value using
         | the 5-bit charset it spills to UTF-16 on the heap.
         | 
         | > It's interesting how network formats and in-memory formats
         | kinda converge in design, because retrieving data from memory
         | to CPU is so expensive now.
         | 
         | Honestly Strings are probably one of the worst examples of
         | this. Strings have like 50 different representations depending
         | on what you're trying to do with them.
         | 
         | 1. Wire format for storage and retrieval is generally UTF-8.
         | 
         | 2. If you're trying to parse a string you generally want to
         | expand it into code points.
         | 
         | 3. If you're trying to sort or compare strings you generally
         | want to expand that into a normalized form like NFC or NFD.
         | 
         | 4. If you're trying to edit text, you generally want grapheme
         | clusters.
         | 
         | 5. If you're trying to present text, you generally want pixels.
         | 
         | The way to think about it is: "there is no such things as the
         | length of a string without context." Strings are basically a
         | bytecode format for representing concepts.
         | 
         | [1]
         | https://developer.apple.com/documentation/foundation/nsstrin...
        
       ___________________________________________________________________
       (page generated 2024-05-07 23:01 UTC)