https://huonw.github.io/blog/2024/03/qr-base10-base64/

Huon on the internet

  * About
  * Blog
  * GitHub
  * Twitter

10 > 64, in QR codes

By Huon Wilson -- 31 Mar 2024

Encoding data in decimal requires many more characters than the same
data encoded in base64--06513249 vs YWJj--but using decimal is better
when stored in a QR code. The magic of QR modes means all those extra
digits are stored efficiently, almost as though there was no encoding
at all. Decimal encoding makes for QR codes that store more data, or
are easier to scan.

In this post, we'll see:

  * how using a decimal encoding (slightly) reduces the density of a
    QR code containing a URL, in practice;
  * why QR code modes make this work: decimal data is all URL-safe
    and stores in a QR code Numeric mode efficiently, while base64
    ends up with 75% waste because it has to be stored in Binary
    mode;
  * how this allows shovelling maximum data into a URL in a QR code.

Two QR codes, the left one is version 16 (81 small squares), while
the right is slightly less dense, version 14 (73 small squares)
base64 decimal

Two QR codes that encode the same data: the left one using base64 is
slightly denser than the right using decimal (compare the 'position'
patterns in the top center). The larger modules (little squares) of
the decimal QR makes for easier scanning.

As a case study, in Mechanical sympathy for QR codes: making NSW
check-in better, I explored the QR codes used for COVID
contact-tracing check-ins here in Australia. It turns out several
states plopped a bunch of information in URLs in their QR codes, as a
base64-encoded JSON blob, presumably because that's convenient.

NSW used URLs like the 228-character https://www.service.nsw.gov.au/
campaign/service-nsw-mobile-app?data=
eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ
==, with that eyJ... being the blob. At error correction level H,
this can be encoded into a 81x81 QR code (version 16).

If we need to continue storing data in JSON like that (that post
describes why we don't), we can do better than base64: by rewriting
the data to decimal, we get the 353-character https://
www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=
072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939.

Anyone sensible would know this is longer... 50% longer. Fortunately,
QR codes transcend sense, so this requires 20% fewer modules (little
squares): it fits into a 73x73 QR code (version 14).

QR codes with fewer modules are easier to scan.

Complicated data in URLs

QR codes can store arbitrary data, but often, they're used to store a
URL, so that anyone can get something useful when they scan by
loading a website. At the same time, a QR code might be placed
somewhere without internet access, and thus also want to include
enough information to allow someone with an appropriate app to
interact with it usefully while offline. This can make for URLs with
a big blob of data in them.

However, a URL is a constrained container: trying to shove arbitrary
data into a URL can hit problems with special characters being
misinterpreted or mangled. Fortunately, URLs are mostly just text,
and there's lots of ways to convert arbitrary data into text: the 
Binary-to-text encoding wikipedia page describes 28, at the time of
writing.

Base64 is arguably one of the most common: it is built into every web
browser, like the one you're using to read this. It encodes 3 bytes
into 4 selected from a reduced 64-character alphabet. It is a
sensible default choice. It can be included literally in JSON, HTML/
XML attributes, CSS and, of course, URLs.

All of this means base64 data can end up in QR codes, where a URL
containing a blob of data is encoded. Unfortunately, base64 is a poor
method for encoding binary data in a QR code: the choice of alphabet
forces the QR code to store data in an needlessly inefficient way.

Encoding schemes

There's many, many ways to encode a set of arbitrary bytes into a
reduced character set. The ones we'll consider in detail are base64,
base10, and base45 from RFC 9285, which is explicitly designed for QR
codes (others like base16 (hex), base32 or base36 are strictly
worse).

encoding        characters          input:output    example    URL
                                        ratio                 safe?
none      all bytes               1                 abc      no
base64    0-9, A-Z, a-z, +, /     1.33              YWJj     partial
base64url 0-9, A-Z, a-z, -, _     1.33              YWJj     yes
base45    0-9, A-Z, $%*+-.:/,     1.5               0EC92    no
          space
base10    0-9                     2.41              06513249 yes

Some of these are "URL-safe": encoded data can be splatted directly
into a URL without requiring any special escaping or handling, while
others are not or might require special handling from a server. We're
talking about putting data into URLs, so URL-safety is a requirement,
and thus the choices to consider are:

  * the base64url URL-safe variant of base64, and
  * base10 (decimal).

I've invented "base10" by treating the bytes as a huge
(little-endian) base-256 integer, and then printing the integer in a
different base. This doesn't scale for large data, but works fine for
the couple of kilobytes that can be stored in a QR code. A Python
version might be:

1  import math
2
3  _DIGITS_PER_BYTE = math.log(2**8, 10)
4
5  def b10encode(data: bytes) -> str:
6      raw = str(int.from_bytes(data, byteorder="little"))
7      # Handle leading zeros so that, for instance, b"", b"\x00"
8      # and b"\x00\x00" each have their own unique encoding, not
9      # just "0".
10     encoded_length = math.ceil(len(data) * _DIGITS_PER_BYTE)
11     prefix = "0" * (encoded_length - len(raw))
12     return prefix + raw
13
14 def b10decode(s: str) -> bytes:
15     # Deduce the length of the result from the input, matching
16     # b64encode's zero-padding (NB. a real implementation
17     # should validate the length is valid)
18     decoded_length = math.floor(len(s) / _DIGITS_PER_BYTE)
19     return int(s).to_bytes(
20         length=decoded_length,
21         byteorder="little",
22     )

The "input:output ratio" column shows number of output characters
required above the number of input characters, on average^0.

For example, for encoding in base10, a long input hits the average
almost exactly, while short inputs can vary more:

input (hexadecimal)           output               ratio
01                  001                        3/1 = 3
12 34 56            05649426                   8/3 = 2.67
FF FE ... 01 00       00000496...7615 (617 digits) 617/256 = 2.41

QR modes

A QR code stored data in a bitstream, with input data encoded in
segments. Each segment of data can be encoded in one of four
different modes. Each mode supports different input alphabets, like
the "Alphanumeric" mode which only supports storing numbers,
upper-case letters and some punctuation, defining a way to map those
characters to bits for storage.

    mode              characters          bits per char overhead
Numeric      10: 0-9                      3.33          0.34%
Alphanumeric 45: 0-9, A-Z, $%*+-.:/ space 5.5           0.15%
Binary       256: arbitrary bytes         8             0%
Kanji        8189^1                       13            0.0041%

Multiple input characters are stored together for the numeric and
alphanumeric modes, which leads to the fractional bits for a single
character. For instance, the numeric mode stores groups of 3 digits
into 10 bits, like 123456 is encoded in two chunks, 123 then 456, in
20 bits.

The overhead captures how much "wastage" there is. Not much, in any
encoding! It's convenient that 10^3 is only slightly less than 2^10,
and 45^2 is less than 2^11.

When a QR code contains multiple segements of data, it still
represents one long stream of characters, so clever choice of splits
allows picking the best mode for each significant substring of the
input.

Decimal, a la mode

The mode required for storing encoded data depends on the output
character set. For the encodings we looked at above^2:

encoding      characters      QR mode
base64url A-Z, a-z, 0-9, -, _ Binary
base10    0-9                 Numeric

Base64url contains lower-case letters its output and thus requires
the Binary mode when stored in a QR code. 3 input bytes (24 bits of
input) turn into 4 output characters, and those 4 characters have to
be stored in 4x8 = 32 bits in the QR code. This makes for an average
overhead of 33%: 1 input byte is stored as 1.33 bytes in the QR code.
Each byte could store up to 256 different values, but the base64
encoding only uses 64 of them. That's wasting 75% of the values, or 2
bits in every stored byte.

For base10, the calculation isn't as simple, but we can work through
it:

  * The input will be some number of 8-bit bytes, each with 2^8 = 256
    possible values.
  * Each input byte turns into log(256, 10) [?] 2.408 output digits, on
    average.
  * 3 digits are stored in 10 bits.
  * Putting this together, each input byte requires 10 / 3 * log(256,
    10) [?] 8.027 bits to store.

This works out to 1 input byte being stored as 1.0034 bytes, on
average: an overhead of 0.34%.

This is exactly the overhead of the Numeric mode itself, and there's
no overhead for binary-to-text base10 encoding step! Base10 requires
242% more characters to encode the same data, but those characters
can be stored into a QR code efficiently. There's no waste when
stored.

Putting it all together

Let's go back to our URL from before: https://www.service.nsw.gov.au/
campaign/service-nsw-mobile-app?data=
eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ
==

The data= parameter contains some base64-encoded JSON. I explored in
an earlier earlier post how this can be encoded better without JSON
or base64... but let's assume that it needs to remain. We can take the
eyJ0I... blob and decode it to the underlying JSON:
{"t":"covid19_business","bid":"121321","bname":"Test NSW Government
QR code","baddress":"Business address goes here "}.

Running these bytes through the b10encode function above gives a
rather long number:
072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939.

The decimal encoding looks long to a human, but it is simpler for the
QR code encoding, only consuming barely more space than the raw JSON
requires[^base16-base32-base36]:

     string           length
raw JSON          118 bytes
base64 encoding   160 characters
base64 QR storage 160 bytes
base10 encoding   285 digits
base10 QR storage 119 bytes

After publishing this article, people asked about other encodings
like base16 (hex), base32 and base36 (which fit into the Alphanumeric
mode), or base 8 (fitting into Numeric mode but being easier to work
with than base 10). These are less efficient^3 than base10 using
Numeric mode, but are potentially more convenient in some cases.

In a URL, the rest of the URL is not purely numeric, so actually
seeing the benefits of this encoding requires using two segments:

  * one with the "boring" bits of the URL at the start, likely using
    the Binary mode
  * one with the big blob of base10 data, using the Numeric mode

Using Segno 1.6.1, this is possible to guarantee^4 like:

1 url_prefix = "https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data="
2 encoded_data = b10encode(data)
3
4 # Two segments, so that they're encoded with separate modes
5 qr = segno.make([url_prefix, encoded_data])

The NSW COVID check-in QR codes used error correction level H. Using
this, and just swapping the encoding of the data parameter, we see
the result from the start of the post:

Two QR codes, same as the ones at the start of the blog post base64
base10

Two QR codes that encode the same data: the left uses base64, giving
a 81x81 QR code (version 16), while the right uses base10, giving a
73x73 QR code (version 14).

Extremes

Being clever about modes, encodings and segments allows one to meet
the limits of the QR code format. QR codes can store up to 23.6
kilobits [?] 3.0 KB, by using version 40 with error correction level L.
This translates into approximately 7k decimal digits in the Numeric
mode, or 3k bytes in the Binary mode.

We can try this out with some test URLs like http://example.com/
{encoded_data}. The maximum we can fit into a QR code looks like:

encoding  input data URL length
base64url 2.2 KB     2.9k
base10    2.9 KB     7.0k

The much longer URL of base10 theoretically doesn't matter: the QR
mode means it squishes in very efficiently.

However, after publishing this article, it was pointed out that iOS
doesn't correctly scan the huge base10 URL, resolving to just http://
example.com instead of http://example.com/310.... So URL length does
matter! It seems like this is a different length limit to normal URL
restrictions, since 8KB is a reasonable lower limit for "normal"
browsers these days, so 7k characters should be safe. Pasting the
same link into Safari directly works fine!

Thus, if you're (ab)using this trick, caveat QRtor: do your own
testing.

Two QR codes, both version 40 (177 squares on each side). The one on
the left uses base64, the one on the right . base64 decimal

Two maxed-out QR codes that both only slightly different to most
humans, but you and I know the second one contains 33% more data.

Summary

Paying attention to the details of the QR code format allow being
more efficient. The trick in this post is using the Decimal mode to
store arbitrary binary data in URLs, with minimal overhead. This
works because QR code modes allow packing particular character sets
into the underlying QR bitstream efficiently.

This allows reducing the density of a QR code to make it easier to
scan, or jamming in more data.

Comments:

  * Hacker News
  * Lobsters

Share this on

  * Twitter
  * Facebook
  * Reddit

 0. The average number of output characters required for each input
    is log(size of input alphabet) / log(size of output alphabet).
    For instance, for encoding bytes to decimal, log(256) / log(10) =
    log10(256) = 2.4082.... -

 1. The 8189 count was computed by this Python one-liner: len({(diff
    >> 8) * 0xc0 + (diff & 0xff) for diff in [*(code - 0x8140 for
    code in range(0x8140, 0x9ffc+1)), *(code - 0xc140 for code in
    range(0xe040, 0xebbf+1))]}). It naively computes the number of
    unique outputs of Segno's Kanji encoder, assuming that all inputs
    from 0x8140 to 09FFC and 0xE040 to 0xEBBF are valid. -

 2. This only uses two of the modes: what about Alphanumeric and
    Kanji? According to encodeURIComponent the largest subset of
    Alphanumeric that's URL safe is only 39 characters (0-9, A-Z,
    *-.), and losing 5 characters introduces an additional 3.9%
    overhead, similar to how going from 256 characters to 64 for
    base64 has 33% overhead. It seems like Kanji are similarly not
    URL safe, and thus violate that requirement too. -

 3. This table shows the values for all those options, including the
    results on the example JSON data, sorted by overhead / storage
    size:

                                        example  example QR  example
    base      mode         overhead     encoding   storage      QR
                                                   (bytes)   version
    10    Numeric      0.35%            285      119         14
                                        digits
                       3.9%             179
    39    Alphanumeric log(45) / log    chars    124         14
                       (39) - 1
                       6.2%             183
    36    Alphanumeric log(45) / log    chars    126         15
                       (36) - 1
                       10%              189
    32    Alphanumeric 5 bits into 1    chars    130         15
                       char (5.5 bits)
                       11%
    8     Numeric      3 bits into 1    315      132         15
    bits               digit (3.33      digits
                       bits)
    8                  25%              354
    bytes Numeric      8 bits into 3    digits   148         15
                       digits (10 bits)
    64    Binary       33%              160      160         16
                                        chars
                       37.5%
    16    Alphanumeric 1 byte (8 bits)  236      163         16
                       into 2 chars (11 digits
                       bits)

    There's two variants of base 8: encoding each byte as its own 3
    digit base-8 number (bytes), or ignoring byte boundaries and
    encoding 3 bits at a time (bits) which is similar to how base 10
    and base 36 work. This was suggested by a reader.

    Base 39 is maximal URL-safe subset of the Alphanumeric character
    set^2. -

 4. In the QR code standard (ISO/IEC 18004:2015(E)), Annex J
    "Optimisation of bit stream length" suggests a method to
    automatically split up an input stream to optimise the modes, so
    a sufficiently smart library could just take a single string and
    arrange the segments itself. -

 

I'm Huon Wilson @huon_w, a mathematically and statistically inclined
software engineer. I have been a long-term volunteer on Rust's core
team, a compiler engineer on the Swift team at Apple, and a senior
software engineer at CSIRO's Data61, working on the StellarGraph
graph machine learning library.

Latest posts

  * Mechanical sympathy for QR codes: making NSW check-in better

    QR codes are now critical infrastructure here in NSW, Australia.
    Let's learn how to make them better.

  * QR error correction helps and hinders scanning

    A QR code can use one of four error correction levels. Higher
    error correction forces denser codes, but allows scanning in more
    situations. A trade-off!

  * The joy of cooking (an app)

    I've been writing a bunch of code designed to enhance the life of
    just me and my wife. It's great fun.

Huon Wilson -- 2021
hit counter