https://huonw.github.io/blog/2024/03/qr-base10-base64/ Huon on the internet * About * Blog * GitHub * Twitter 10 > 64, in QR codes By Huon Wilson -- 31 Mar 2024 Encoding data in decimal requires many more characters than the same data encoded in base64--06513249 vs YWJj--but using decimal is better when stored in a QR code. The magic of QR modes means all those extra digits are stored efficiently, almost as though there was no encoding at all. Decimal encoding makes for QR codes that store more data, or are easier to scan. In this post, we'll see: * how using a decimal encoding (slightly) reduces the density of a QR code containing a URL, in practice; * why QR code modes make this work: decimal data is all URL-safe and stores in a QR code Numeric mode efficiently, while base64 ends up with 75% waste because it has to be stored in Binary mode; * how this allows shovelling maximum data into a URL in a QR code. Two QR codes, the left one is version 16 (81 small squares), while the right is slightly less dense, version 14 (73 small squares) base64 decimal Two QR codes that encode the same data: the left one using base64 is slightly denser than the right using decimal (compare the 'position' patterns in the top center). The larger modules (little squares) of the decimal QR makes for easier scanning. As a case study, in Mechanical sympathy for QR codes: making NSW check-in better, I explored the QR codes used for COVID contact-tracing check-ins here in Australia. It turns out several states plopped a bunch of information in URLs in their QR codes, as a base64-encoded JSON blob, presumably because that's convenient. NSW used URLs like the 228-character https://www.service.nsw.gov.au/ campaign/service-nsw-mobile-app?data= eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ ==, with that eyJ... being the blob. At error correction level H, this can be encoded into a 81x81 QR code (version 16). If we need to continue storing data in JSON like that (that post describes why we don't), we can do better than base64: by rewriting the data to decimal, we get the 353-character https:// www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data= 072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939. Anyone sensible would know this is longer... 50% longer. Fortunately, QR codes transcend sense, so this requires 20% fewer modules (little squares): it fits into a 73x73 QR code (version 14). QR codes with fewer modules are easier to scan. Complicated data in URLs QR codes can store arbitrary data, but often, they're used to store a URL, so that anyone can get something useful when they scan by loading a website. At the same time, a QR code might be placed somewhere without internet access, and thus also want to include enough information to allow someone with an appropriate app to interact with it usefully while offline. This can make for URLs with a big blob of data in them. However, a URL is a constrained container: trying to shove arbitrary data into a URL can hit problems with special characters being misinterpreted or mangled. Fortunately, URLs are mostly just text, and there's lots of ways to convert arbitrary data into text: the Binary-to-text encoding wikipedia page describes 28, at the time of writing. Base64 is arguably one of the most common: it is built into every web browser, like the one you're using to read this. It encodes 3 bytes into 4 selected from a reduced 64-character alphabet. It is a sensible default choice. It can be included literally in JSON, HTML/ XML attributes, CSS and, of course, URLs. All of this means base64 data can end up in QR codes, where a URL containing a blob of data is encoded. Unfortunately, base64 is a poor method for encoding binary data in a QR code: the choice of alphabet forces the QR code to store data in an needlessly inefficient way. Encoding schemes There's many, many ways to encode a set of arbitrary bytes into a reduced character set. The ones we'll consider in detail are base64, base10, and base45 from RFC 9285, which is explicitly designed for QR codes (others like base16 (hex), base32 or base36 are strictly worse). encoding characters input:output example URL ratio safe? none all bytes 1 abc no base64 0-9, A-Z, a-z, +, / 1.33 YWJj partial base64url 0-9, A-Z, a-z, -, _ 1.33 YWJj yes base45 0-9, A-Z, $%*+-.:/, 1.5 0EC92 no space base10 0-9 2.41 06513249 yes Some of these are "URL-safe": encoded data can be splatted directly into a URL without requiring any special escaping or handling, while others are not or might require special handling from a server. We're talking about putting data into URLs, so URL-safety is a requirement, and thus the choices to consider are: * the base64url URL-safe variant of base64, and * base10 (decimal). I've invented "base10" by treating the bytes as a huge (little-endian) base-256 integer, and then printing the integer in a different base. This doesn't scale for large data, but works fine for the couple of kilobytes that can be stored in a QR code. A Python version might be: 1 import math 2 3 _DIGITS_PER_BYTE = math.log(2**8, 10) 4 5 def b10encode(data: bytes) -> str: 6 raw = str(int.from_bytes(data, byteorder="little")) 7 # Handle leading zeros so that, for instance, b"", b"\x00" 8 # and b"\x00\x00" each have their own unique encoding, not 9 # just "0". 10 encoded_length = math.ceil(len(data) * _DIGITS_PER_BYTE) 11 prefix = "0" * (encoded_length - len(raw)) 12 return prefix + raw 13 14 def b10decode(s: str) -> bytes: 15 # Deduce the length of the result from the input, matching 16 # b64encode's zero-padding (NB. a real implementation 17 # should validate the length is valid) 18 decoded_length = math.floor(len(s) / _DIGITS_PER_BYTE) 19 return int(s).to_bytes( 20 length=decoded_length, 21 byteorder="little", 22 ) The "input:output ratio" column shows number of output characters required above the number of input characters, on average^0. For example, for encoding in base10, a long input hits the average almost exactly, while short inputs can vary more: input (hexadecimal) output ratio 01 001 3/1 = 3 12 34 56 05649426 8/3 = 2.67 FF FE ... 01 00 00000496...7615 (617 digits) 617/256 = 2.41 QR modes A QR code stored data in a bitstream, with input data encoded in segments. Each segment of data can be encoded in one of four different modes. Each mode supports different input alphabets, like the "Alphanumeric" mode which only supports storing numbers, upper-case letters and some punctuation, defining a way to map those characters to bits for storage. mode characters bits per char overhead Numeric 10: 0-9 3.33 0.34% Alphanumeric 45: 0-9, A-Z, $%*+-.:/ space 5.5 0.15% Binary 256: arbitrary bytes 8 0% Kanji 8189^1 13 0.0041% Multiple input characters are stored together for the numeric and alphanumeric modes, which leads to the fractional bits for a single character. For instance, the numeric mode stores groups of 3 digits into 10 bits, like 123456 is encoded in two chunks, 123 then 456, in 20 bits. The overhead captures how much "wastage" there is. Not much, in any encoding! It's convenient that 10^3 is only slightly less than 2^10, and 45^2 is less than 2^11. When a QR code contains multiple segements of data, it still represents one long stream of characters, so clever choice of splits allows picking the best mode for each significant substring of the input. Decimal, a la mode The mode required for storing encoded data depends on the output character set. For the encodings we looked at above^2: encoding characters QR mode base64url A-Z, a-z, 0-9, -, _ Binary base10 0-9 Numeric Base64url contains lower-case letters its output and thus requires the Binary mode when stored in a QR code. 3 input bytes (24 bits of input) turn into 4 output characters, and those 4 characters have to be stored in 4x8 = 32 bits in the QR code. This makes for an average overhead of 33%: 1 input byte is stored as 1.33 bytes in the QR code. Each byte could store up to 256 different values, but the base64 encoding only uses 64 of them. That's wasting 75% of the values, or 2 bits in every stored byte. For base10, the calculation isn't as simple, but we can work through it: * The input will be some number of 8-bit bytes, each with 2^8 = 256 possible values. * Each input byte turns into log(256, 10) [?] 2.408 output digits, on average. * 3 digits are stored in 10 bits. * Putting this together, each input byte requires 10 / 3 * log(256, 10) [?] 8.027 bits to store. This works out to 1 input byte being stored as 1.0034 bytes, on average: an overhead of 0.34%. This is exactly the overhead of the Numeric mode itself, and there's no overhead for binary-to-text base10 encoding step! Base10 requires 242% more characters to encode the same data, but those characters can be stored into a QR code efficiently. There's no waste when stored. Putting it all together Let's go back to our URL from before: https://www.service.nsw.gov.au/ campaign/service-nsw-mobile-app?data= eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ == The data= parameter contains some base64-encoded JSON. I explored in an earlier earlier post how this can be encoded better without JSON or base64... but let's assume that it needs to remain. We can take the eyJ0I... blob and decode it to the underlying JSON: {"t":"covid19_business","bid":"121321","bname":"Test NSW Government QR code","baddress":"Business address goes here "}. Running these bytes through the b10encode function above gives a rather long number: 072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737089008384293066931974311305533337894591404330656702603998035920596585517131555967430155259257402711671699276432408209151397638174974409842883898456527289026013404155725275860173673194594939. The decimal encoding looks long to a human, but it is simpler for the QR code encoding, only consuming barely more space than the raw JSON requires[^base16-base32-base36]: string length raw JSON 118 bytes base64 encoding 160 characters base64 QR storage 160 bytes base10 encoding 285 digits base10 QR storage 119 bytes After publishing this article, people asked about other encodings like base16 (hex), base32 and base36 (which fit into the Alphanumeric mode), or base 8 (fitting into Numeric mode but being easier to work with than base 10). These are less efficient^3 than base10 using Numeric mode, but are potentially more convenient in some cases. In a URL, the rest of the URL is not purely numeric, so actually seeing the benefits of this encoding requires using two segments: * one with the "boring" bits of the URL at the start, likely using the Binary mode * one with the big blob of base10 data, using the Numeric mode Using Segno 1.6.1, this is possible to guarantee^4 like: 1 url_prefix = "https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=" 2 encoded_data = b10encode(data) 3 4 # Two segments, so that they're encoded with separate modes 5 qr = segno.make([url_prefix, encoded_data]) The NSW COVID check-in QR codes used error correction level H. Using this, and just swapping the encoding of the data parameter, we see the result from the start of the post: Two QR codes, same as the ones at the start of the blog post base64 base10 Two QR codes that encode the same data: the left uses base64, giving a 81x81 QR code (version 16), while the right uses base10, giving a 73x73 QR code (version 14). Extremes Being clever about modes, encodings and segments allows one to meet the limits of the QR code format. QR codes can store up to 23.6 kilobits [?] 3.0 KB, by using version 40 with error correction level L. This translates into approximately 7k decimal digits in the Numeric mode, or 3k bytes in the Binary mode. We can try this out with some test URLs like http://example.com/ {encoded_data}. The maximum we can fit into a QR code looks like: encoding input data URL length base64url 2.2 KB 2.9k base10 2.9 KB 7.0k The much longer URL of base10 theoretically doesn't matter: the QR mode means it squishes in very efficiently. However, after publishing this article, it was pointed out that iOS doesn't correctly scan the huge base10 URL, resolving to just http:// example.com instead of http://example.com/310.... So URL length does matter! It seems like this is a different length limit to normal URL restrictions, since 8KB is a reasonable lower limit for "normal" browsers these days, so 7k characters should be safe. Pasting the same link into Safari directly works fine! Thus, if you're (ab)using this trick, caveat QRtor: do your own testing. Two QR codes, both version 40 (177 squares on each side). The one on the left uses base64, the one on the right . base64 decimal Two maxed-out QR codes that both only slightly different to most humans, but you and I know the second one contains 33% more data. Summary Paying attention to the details of the QR code format allow being more efficient. The trick in this post is using the Decimal mode to store arbitrary binary data in URLs, with minimal overhead. This works because QR code modes allow packing particular character sets into the underlying QR bitstream efficiently. This allows reducing the density of a QR code to make it easier to scan, or jamming in more data. Comments: * Hacker News * Lobsters Share this on * Twitter * Facebook * Reddit 0. The average number of output characters required for each input is log(size of input alphabet) / log(size of output alphabet). For instance, for encoding bytes to decimal, log(256) / log(10) = log10(256) = 2.4082.... - 1. The 8189 count was computed by this Python one-liner: len({(diff >> 8) * 0xc0 + (diff & 0xff) for diff in [*(code - 0x8140 for code in range(0x8140, 0x9ffc+1)), *(code - 0xc140 for code in range(0xe040, 0xebbf+1))]}). It naively computes the number of unique outputs of Segno's Kanji encoder, assuming that all inputs from 0x8140 to 09FFC and 0xE040 to 0xEBBF are valid. - 2. This only uses two of the modes: what about Alphanumeric and Kanji? According to encodeURIComponent the largest subset of Alphanumeric that's URL safe is only 39 characters (0-9, A-Z, *-.), and losing 5 characters introduces an additional 3.9% overhead, similar to how going from 256 characters to 64 for base64 has 33% overhead. It seems like Kanji are similarly not URL safe, and thus violate that requirement too. - 3. This table shows the values for all those options, including the results on the example JSON data, sorted by overhead / storage size: example example QR example base mode overhead encoding storage QR (bytes) version 10 Numeric 0.35% 285 119 14 digits 3.9% 179 39 Alphanumeric log(45) / log chars 124 14 (39) - 1 6.2% 183 36 Alphanumeric log(45) / log chars 126 15 (36) - 1 10% 189 32 Alphanumeric 5 bits into 1 chars 130 15 char (5.5 bits) 11% 8 Numeric 3 bits into 1 315 132 15 bits digit (3.33 digits bits) 8 25% 354 bytes Numeric 8 bits into 3 digits 148 15 digits (10 bits) 64 Binary 33% 160 160 16 chars 37.5% 16 Alphanumeric 1 byte (8 bits) 236 163 16 into 2 chars (11 digits bits) There's two variants of base 8: encoding each byte as its own 3 digit base-8 number (bytes), or ignoring byte boundaries and encoding 3 bits at a time (bits) which is similar to how base 10 and base 36 work. This was suggested by a reader. Base 39 is maximal URL-safe subset of the Alphanumeric character set^2. - 4. In the QR code standard (ISO/IEC 18004:2015(E)), Annex J "Optimisation of bit stream length" suggests a method to automatically split up an input stream to optimise the modes, so a sufficiently smart library could just take a single string and arrange the segments itself. - I'm Huon Wilson @huon_w, a mathematically and statistically inclined software engineer. I have been a long-term volunteer on Rust's core team, a compiler engineer on the Swift team at Apple, and a senior software engineer at CSIRO's Data61, working on the StellarGraph graph machine learning library. Latest posts * Mechanical sympathy for QR codes: making NSW check-in better QR codes are now critical infrastructure here in NSW, Australia. Let's learn how to make them better. * QR error correction helps and hinders scanning A QR code can use one of four error correction levels. Higher error correction forces denser codes, but allows scanning in more situations. A trade-off! * The joy of cooking (an app) I've been writing a bunch of code designed to enhance the life of just me and my wife. It's great fun. Huon Wilson -- 2021 hit counter