https://fury.apache.org/blog/fury_meta_string_37_5_percent_space_efficient_encoding_than_utf8/

Skip to main content
 
Fury LogoFury Logo
StartIntroductionGuideSpecificationDownloadBlog
ASF

  * Foundation
  * License
  * Events
  * Privacy
  * Security
  * Sponsorship
  * Thanks
  * Code of Conduct

 
[                    ]
All our posts

  * Meta String: A 37.5% space efficient string encoding than UTF-8
    in Fury serialization
  * Fury v0.5.0 Released
  * Apache Fury: A blazing fast multi-language serialization
    framework powered by JIT and zero-copy
  * Fury 0.4.1 Released
  * Fury 0.4.0 Released
  * Fury v0.3.1 released
  * Fury v0.3.0 released
  * Fury v0.2.1 released
  * Fury v0.2.0 released
  * Fury v0.1.2 released
  * Fury v0.1.1 released
  * Fury v0.1.0 released

Meta String: A 37.5% space efficient string encoding than UTF-8 in
Fury serialization

May 6, 2024 * 7 min read
Shawn Yang
Shawn Yang
Apache Fury PPMC Member

Background 

In rpc/serialization systems, we often need to send namespace/path/
filename/fieldName/packageName/moduleName/className/enumValue string
between processes.

Those strings are mostly ascii strings. In order to transfer between
processes, we encode such strings using utf-8 encodings. Such
encoding will take one byte for every char, which is not space
efficient actually.

If we take a deeper look, we will found that most chars are lowercase
chars, ., $ and _, which can be expressed in a much smaller range
0~32. But one byte can represent range 0~255, the significant bits
are wasted, and this cost is not ignorable. In a dynamic
serialization framework, such meta will take considerable cost
compared to actual data.

So we proposed a new string encoding algorithm which we called meta
string encoding in Fury. It will encode most chars using 5 bits
instead of 8 bits in utf-8 encoding, which can bring 37.5% space cost
savings compared to utf-8 encoding.

Meta String Introduction 

Meta string encoding algorithm is mainly used to encode meta strings
such as field names, namespace, packageName, className, path and
filename. Such a string is enumerated and limited, so the encoding
performance is not important since we can cache the encoding result.

Meta string encoding uses 5/6 bits instead of 8 bits in utf-8
encoding for every chars. Since it uses less bits than utf8, it can
bring 37.5% space cost savings compared to utf-8 and has a smaller
encoded binary size, which uses less storage and makes the network
transfer faster.

More details about meta string spec can be found in Fury xlang
serialization specification.

Encoding Algorithms 

String binary encoding algorithm:

        Algorithm           Pattern             Description
                                      every char is written using 5
                                      bits, a-z: 0b00000~0b11001, ._$
                                      |: 0b11010~0b11101, prepend one
LOWER_SPECIAL             a-z._$|     bit at the start to indicate
                                      whether strip last char since
                                      last byte may have 7 redundant
                                      bits(1 indicates strip last
                                      char)
                                      every char is written using 6
                                      bits, a-z: 0b00000~0b11001,
                                      A-Z: 0b11010~0b110011, 0~9:
                                      0b110100~0b111101, ._:
LOWER_UPPER_DIGIT_SPECIAL a-zA-Z0~9._ 0b111110~0b111111, prepend one
                                      bit at the start to indicate
                                      whether strip last char since
                                      last byte may have 7 redundant
                                      bits(1 indicates strip last
                                      char)
UTF-8                     any chars   UTF-8 encoding

If we use LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL, we must add a
strip last char flag in encoded data. This is because every char will
be encoded using 5/6 bits, and the last char may have 1~7 bits which
are unused by encoding, such bits may cause an extra char to be read,
which we must strip off.

Here is encoding code snippet in java, see
org.apache.fury.meta.MetaStringEncoder#encodeGeneric(char[], int) for
more details:

private byte[] encodeGeneric(char[] chars, int bitsPerChar) {
  int totalBits = chars.length * bitsPerChar + 1;
  int byteLength = (totalBits + 7) / 8; // Calculate number of needed bytes
  byte[] bytes = new byte[byteLength];
  int currentBit = 1;
  for (char c : chars) {
    int value =
        (bitsPerChar == 5) ? charToValueLowerSpecial(c) : charToValueLowerUpperDigitSpecial(c);
    // Encode the value in bitsPerChar bits
    for (int i = bitsPerChar - 1; i >= 0; i--) {
      if ((value & (1 << i)) != 0) {
        // Set the bit in the byte array
        int bytePos = currentBit / 8;
        int bitPos = currentBit % 8;
        bytes[bytePos] |= (byte) (1 << (7 - bitPos));
      }
      currentBit++;
    }
  }
  boolean stripLastChar = bytes.length * 8 >= totalBits + bitsPerChar;
  if (stripLastChar) {
    bytes[0] = (byte) (bytes[0] | 0x80);
  }
  return bytes;
}

private int charToValueLowerSpecial(char c) {
  if (c >= 'a' && c <= 'z') {
    return c - 'a';
  } else if (c == '.') {
    return 26;
  } else if (c == '_') {
    return 27;
  } else if (c == '$') {
    return 28;
  } else if (c == '|') {
    return 29;
  } else {
    throw new IllegalArgumentException("Unsupported character for LOWER_SPECIAL encoding: " + c);
  }
}

private int charToValueLowerUpperDigitSpecial(char c) {
  if (c >= 'a' && c <= 'z') {
    return c - 'a';
  } else if (c >= 'A' && c <= 'Z') {
    return 26 + (c - 'A');
  } else if (c >= '0' && c <= '9') {
    return 52 + (c - '0');
  } else if (c == specialChar1) {
    return 62;
  } else if (c == specialChar2) {
    return 63;
  } else {
    throw new IllegalArgumentException(
        "Unsupported character for LOWER_UPPER_DIGIT_SPECIAL encoding: " + c);
  }
}

Here is decoding code snippet in golang, see go/fury/meta/
meta_string_decoder.go:70 for more details:

func (d *Decoder) decodeGeneric(data []byte, algorithm Encoding) ([]byte, error) {
        bitsPerChar := 5
        if algorithm == LOWER_UPPER_DIGIT_SPECIAL {
                bitsPerChar = 6
        }
        // Retrieve 5 bits every iteration from data, convert them to characters, and save them to chars
        // "abc" encodedBytes as [00000] [000,01] [00010] [0, corresponding to three bytes, which are 0, 68, 0
        // Take the highest digit first, then the lower, in order

        // here access data[0] before entering the loop, so we had to deal with empty data in Decode method
        // totChars * bitsPerChar <= totBits < (totChars + 1) * bitsPerChar
        stripLastChar := (data[0] & 0x80) >> 7
        totBits := len(data)*8 - 1 - int(stripLastChar)*bitsPerChar
        totChars := totBits / bitsPerChar
        chars := make([]byte, totChars)
        bitPos, bitCount := 6, 1 // first highest bit indicates whether strip last char
        for i := 0; i < totChars; i++ {
                var val byte = 0
                for i := 0; i < bitsPerChar; i++ {
                        if data[bitCount/8]&(1<<bitPos) > 0 {
                                val |= 1 << (bitsPerChar - i - 1)
                        }
                        bitPos = (bitPos - 1 + 8) % 8
                        bitCount++
                }
                ch, err := d.decodeChar(val, algorithm)
                if err != nil {
                        return nil, err
                }
                chars[i] = ch
        }
        return chars, nil
}

Select Best Encoding 

For most lowercase characters, meta string will use 5 bits to encode
every char. For string containing uppercase chars, meta string will
try to convert the string into a lower case representation by
inserting some markers, and compare used bytes with 6 bits encoding,
then select the encoding which has smaller encoded size.

Here is the common encoding selection strategy:

      Encoding Flag          Pattern         Encoding Algorithm
LOWER_SPECIAL             every char is LOWER_SPECIAL
                          in a-z._|
                          every char is
                          in a-z._      replace first upper case char
FIRST_TO_LOWER_SPECIAL    except first  to lower case, then use
                          char is upper LOWER_SPECIAL
                          case
                                        replace every upper case char
                                        by | + lower case, then use
ALL_TO_LOWER_SPECIAL      every char is LOWER_SPECIAL, use this
                          in a-zA-Z._   encoding if it's smaller than
                                        Encoding
                                        LOWER_UPPER_DIGIT_SPECIAL
                                        use LOWER_UPPER_DIGIT_SPECIAL
LOWER_UPPER_DIGIT_SPECIAL every char is encoding if it's smaller than
                          in a-zA-Z._   Encoding
                                        FIRST_TO_LOWER_SPECIAL
UTF8                      any utf-8     use UTF-8 encoding
                          char
Compression               any utf-8     lossless compression
                          char

For package name, module name or namespace, LOWER_SPECIAL will be
used mostly. ALL_TO_LOWER_SPECIAL can be used too, since it can
represent the same chars as LOWER_SPECIAL without using more bits,
but also support string with uppercase chars.

For className, FIRST_TO_LOWER_SPECIAL will be used mostly. If there
are multiple uppercase chars, then ALL_TO_LOWER_SPECIAL will be used
instead. If a string contains digits, then LOWER_UPPER_DIGIT_SPECIAL
will be used.

Finally, utf8 will be the fallback encoding if the string contains
some chars which is not in range a-z0-9A-Z.

Encoding Flags and Data jointly 

  * Depending on the case, one can choose encoding flags + data
    jointly, using 3 bits of first byte for flags and other bytes for
    data. This can be useful since there are some holes remaining in
    last byte, adding flags in data doesn't always increase
    serialized bytes size.
  * Or one can use a header to encode such flags with other meta such
    as encoded size, this is what Fury does in https://github.com/
    apache/incubator-fury/pull/1556

Benchmark 

utf8 encoding uses 30 bytes for string
org.apache.fury.benchmark.data, fury meta string uses only 19 bytes.
utf8 encoding uses 12 bytes for string MediaContent, but fury meta
string uses only 9 bytes.

// utf8 use 30 bytes, we use only 19 bytes
assertEquals(encoder.encode("org.apache.fury.benchmark.data").getBytes().length, 19);
// utf8 uses 12 bytes, we use only 9 bytes.
assertEquals(encoder.encode("MediaContent").getBytes().length, 9);

Tags:

  * fury

 
Older Post
Fury v0.5.0 Released

  * Background
  * Meta String Introduction
  * Encoding Algorithms
  * Select Best Encoding
  * Encoding Flags and Data jointly
  * Benchmark

Community

  * Mailing list
  * Slack
  * Twitter

Docs

  * Install
  * Usage
  * Benchmark

Repositories

  * Fury
  * Website

Apache Incubator logoApache Incubator logo

Apache Fury is an effort undergoing incubation at The Apache Software
Foundation (ASF), sponsored by the Apache Incubator. Incubation is
required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision
making process have stabilized in a manner consistent with other
successful ASF projects. While incubation status is not necessarily a
reflection of the completeness or stability of the code, it does
indicate that the project has yet to be fully endorsed by the ASF.

Copyright (c) 2024 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.
Apache, the names of Apache projects, and the feather logo are either
registered trademarks or trademarks of the Apache Software Foundation
in the United States and/or other countries.