https://ptrchm.com/posts/base32-explained/ * Home * Writing * About * Email me * * Base32 Encoding Explained How does Base32 (or any Base2n) work exactly at the bit level? Base64 is one of those things I've used countless times, without ever fully understanding what it does. For my UUID shortener, I chose to implement Base32 -- a more readable and URL-friendly alternative to Base64. I didn't want to rely on any third-party libraries for the encoding part. This meant going back to basics and learning what actually happens when you encode data to a Base2n format. This post is my attempt to explain and visualize how Base32 encoding works at the bit level and how it relates to other Base2n formats, like Base64. We'll implement a Base32 encoder from scratch. I'll be using Ruby for the example code, but the principles should be easily translated to any programming language. If you want to jump straight to the code, here's the result. In this post * What's Base2n encoding? * Pushing bits: how to convert a number to Base8 * Base32 implementation (RFC 4648) + Encoding arbitrary data + Decoding + Towards a more generic approach * Base64 and beyond What's Base2n encoding? Just as Base10 (decimal) uses 10 digits to represent numbers, Base2n encoding is simply another way to represent numbers (or bytes), using 2n digits instead: Base2 (binary), Base16 (hexadecimal), Base32, Base64, etc. The more digits we have, the more compact the representation is going to be. For example, let's look at different representations of the same 128-bit number (UUID): 3d89a119-b3f8-4107-8f29-3daf5b964e50 # standard UUID string 0x3d89a119b3f841078f293daf5b964e50 # hex 81797519916847327862337238645062651472 # decimal 1xh6ghkczr843rya9xnxdsckjg # base32 (Crockford's variant) # and binary: 111101100010011010000100011001101100111111100001000001000001111000111100101001001111011010111101011011100101100100111001010000 The number of available digits determines how many bits can be represented by a single character. For example: with binary we can encode 1 bit of data into each character (2^1 = 2 character combinations of 1 bit). Similarly, in Base16 (hex) we can encode 4 bits of data into a single character (2^4 = 16 character combinations of 4 bits). Pushing bits: how to convert a number to Base8 With that in mind, let's start with a simple example: converting the number 249 into Base8 (octal). There are several ways to achieve that, but to illustrate the general method for Base2n encoding, we'll use bitwise operations. Here's what the number 249 looks like in binary (an 8-bit representation): 11111001 Base8 uses 8 digits (0-7), so we can encode 3 bits of data into a single character (2^3 = 8 character combinations of 3 bits). digits = ['0', '1', '2', '3', '4', '5', '6', '7'] The number 249 translates to 371 in octal. Here's how the binary representation looks like, in groups of 3 bits: 011 111 001 +-+ +-+ +-+ 3 7 1 The method is pretty simple: looking at the binary representation, we'll split the value into groups of 3 bits each. Each group represents a character/digit in our target system (its index in the digits array, to be more specific). Starting from the lower-order bits (right-hand), we'll use a bit mask to extract 3 bits at a time: 7 # decimal (max. index in the digits array) 0x7 # hex 111 # binary (3 bits, our chunk size) Let's extract the first 3 bits. We'll do this by masking every bit except the first 3 lower (right-hand) bits: number = 249 digit = number & 0x7 #=> 1 (001) This gives us the first digit in the result: 1. The bitwise AND operation looks like this: 011 111 001 & 000 000 111 = 000 000 001 Now we have to shift to the next 3 bits. We'll push the rightmost bits the right, removing them from the number: number = number >> 3 #=> 252 (000 001 111) The bitwise operation looks like this: >> 011 111 001 >> 011 111 00 >> 011 111 0 >> 011 111 Now, we can extract the next 3 bits: 000 011 111 & 000 000 111 = 000 000 111 This gives us the next digit: 7 (0b111). We keep doing this as long as number > 0, so in this case, just one more time. Shift to the last 3 bits: >> 011 111 >> 011 11 >> 011 1 >> 011 And extract the last 3 bits: 000 000 011 & 000 000 111 = 000 000 011 0b011 is 3 in our target system, so putting it all together, we get our final result: 371. Here's the code representing the whole sequence: digits = ['0', '1', '2', '3', '4', '5', '6', '7'] result = [] number = 249 while number > 0 digit = digits[number & 0x7] result.push(digit) number = number >> 3 end puts result.reverse.join #=> 371 The method will be virtually the same for any Base2n encoding. For Base8, the digits array is not strictly necessary, but for larger bases, such as Base16 or Base32, we'll need to define a list of characters that represent the additional digits. Now, let's change the code to encode the number 249 into Base16 (4 bits per character). To do this, we'll define the 16 characters and modify the bit mask to extract 4 bits at a time: digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'] result = [] number = 249 while number > 0 digit = digits[number & 0xf] # 0xf = 0b1111 result.push(digit) number = number >> 4 end puts result.reverse.join #=> f9 In the same manner, you can extend the code to support any Base2n encoding: simply add more characters to the alphabet, calculate the bit mask and the number of bits to shift in each iteration, based on the number of characters in the alphabet. We'll explore this process in more detail in the next section. An important caveat The method described above is used to encode numbers. We're treating the input as one chunk of fixed-size data, encoding bits from right to left (least significant first). However, the standard Base32 and Base64 algorithms that operate on arbitrary data will split the input into smaller chunks, starting from left to right (most significant first). For example, in the BasedUUID gem, I treat the UUID as a 128-bit number, encoding it from right to left. This method is also used by ULID and TypeID. Keep in mind that this approach will not yield the same result as encoding the string or raw byte representation of the UUID. Base32 implementation (RFC 4648) Encoding arbitrary data In Base32, each character can encode 5 bits of data (2^5 = 32 character combinations of 5 bits). The RFC 4648 defines the following set of characters: Value Encoding Value Encoding Value Encoding Value Encoding 0 A 9 J 18 S 27 3 1 B 10 K 19 T 28 4 2 C 11 L 20 U 29 5 3 D 12 M 21 V 30 6 4 E 13 N 22 W 31 7 5 F 14 O 23 X 6 G 15 P 24 Y (pad) = 7 H 16 Q 25 Z 8 I 17 R 26 2 There are other variants of the encoding, like Crockford's Base32, and the alphabet can be customized to fit specific needs^1. However, for this article, we'll stick to the RFC 4648 version. Encoding arbitrary data to Base32 is going to be slightly different than what I described earlier. We need to split the input string into chunks (5 bytes each), starting from the left (most significant first). Let's follow the RFC algorithm step by step to encode an example string: foobar, which should be encoded to MZXW6YTBOI======. Here's a visualization of what we're going to do: 1. Groups of 8 bits 01100110 01101111 01101111 01100010 01100001 +------+ +------+ +------+ +------+ +------+ 102 111 111 098 097 f o o b a 1. Groups of 8 bits 01100110 01101111 01101111 01100010 01100001 2. Join into a single number 01100110 01101111 01101111 01100010 01100001 2. Join into a single number 0110011001101111011011110110001001100001 3. Split into groups of 5 bits 0110011001101111011011110110001001100001 3. Split into groups of 5 bits 01100 11001 10111 10110 11110 11000 10011 00001 4. Each group is mapped to a char in the ALPHABET 01100 11001 10111 10110 11110 11000 10011 00001 4. Each group is mapped to a char in the ALPHABET 01100 11001 10111 10110 11110 11000 10011 00001 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 12 25 23 22 30 24 19 1 M Z X W 6 Y T B 4. Each group is mapped to a char in the ALPHABET M Z X W 6 Y T B 4. Each group is mapped to a char in the ALPHABET MZXW6YTB First, we'll convert the string to an array of bytes. Then we'll split the array into groups of 5 bytes. Since each byte is 8 bits, this means each group will be 40 bits (8 bits per byte * 5 bytes). bytes = 'foobar'.bytes #=> [102, 111, 111, 98, 97, 114] chunks = bytes.each_slice(5).to_a # => [[102, 111, 111, 98, 97], [114]] The last group contains only 1 byte, so it's not divisible by 5. We'll deal with this problem later. For now, let's focus on the first chunk: 102 111 111 098 097 +-+ +-+ +-+ +-+ +-+ f o o b a And the binary representation: 01100110 01101111 01101111 01100010 01100001 +------+ +------+ +------+ +------+ +------+ 102 111 111 098 097 f o o b a First, we'll combine these bytes into a single 40-bit number: chunk = [102, 111, 111, 98, 97] buf = 0 chunk.each do |byte| buf = (buf << 8) + byte end buf #=> 439956234849 (01100 11001 10111 10110 11110 11000 10011 00001) Here's the binary version of the loop above: 1100110 << 8 110011000000000 + 1101111 = 110011001101111 110011001101111 << 8 11001100110111100000000 + 1101111 = 11001100110111111011111 11001100110111101101111 << 8 1100110011011110110111100000000 + 1100010 = 1100110011011110110111101100010 1100110011011110110111101100010 << 8 110011001101111011011110110001000000000 + 1100001 = 110011001101111011011110110001001100001 The result is a 40-bit number. We'll divide this number into eight 5-bit groups, with each 5-bit group being encoded into a single character: 01100 11001 10111 10110 11110 11000 10011 00001 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 12 25 23 22 30 24 19 1 # index in the ALPHABET array M Z X W 6 Y T B # the character it represents Here's the code to extract those 5-bit groups and encode them into 8 characters: ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567".split("") encoded = Array.new(8) j = encoded.length - 1 encoded.length.times do |i| encoded[j] = ALPHABET[(buf >> 5 * i) & 0x1f] j -= 1 end encoded.join #=> MZXW6YTB This operation is very similar to what we did in our Base8 example. We use the 0x1f bit mask to extract 5 bits at a time. After each extraction, we shift the number right by 5 bits to access the next 5 bits. The extracted 5 bits act as an index to find the corresponding character in the ALPHABET array. Now we have our first chunk encoded to Base32, but it's time to deal with the tricky part. The RFC says that if the last group contains less than 40 bits, it must be padded with zeros until the total number of bits becomes divisible by 5. Each group of 5 bytes should yield 8 encoded characters. If the last chunk produces fewer than 8 characters, we'll pad the remaining space with =. In our example, the last chunk contains only 1 byte (8 bits), so we'll pad it with 2 bits to reach 10 bits, the smallest number divisible by 5. Two calculations are necessary here: 1. Determining how many characters the last chunk will produce. 2. Calculating the number of bits needed to pad the chunk to make the total divisible by 5. For the first part, we'll use this formula: 40 bits = 8 characters 8 bits = x characters x = 8 * 8 / 40 = 1.6 Rounding up, we know this chunk should yield 2 characters Now, for the padding calculation: padding = 5 - (chunk.size * 8) % 5 In our case, the last chunk is a single byte (114, 01110010, representing the letter r). We'll pad it with 2 bits at the end: 01110010 << 2 011100100 << 2 0111001000 << 2 This gives us 10 bits that will be encoded to 2 characters. 01110 01000 +---+ +---+ 14 8 # index in the ALPHABET array O I # the character it represents Translated to code, here's the whole encoding sequence: chunk = [114] # the last chunk, the letter "r" bits_in_chunk = chunk.length * 8 number_of_characters = (bits_in_chunk / 5.to_f).ceil # how many encoded characters this chunk will produce padding = bits_in_chunk < 40 ? 5 - bits_in_chunk % 5 : 0 # how many bits we need to pad to make the number of bits divisible by 5 buf = 0 chunk.each do |byte| buf = (buf << 8) + byte # combine the bytes into a single number end buf <<= padding # add missing bits to the right encoded = Array.new(8) # an array to hold 8 encoded characters j = number_of_characters - 1 # encode 2 characters number_of_characters.times do |i| encoded[j] = ALPHABET[(buf >> 5 * i) & 0x1f] # extract 5 bits at a time j -= 1 end # pad the result with 6 '=' (8 - number_of_characters).times do |i| encoded[number_of_characters + i] = "=" end encoded.join #=> OI====== We're now done with the encoding part. At this point, our code should cover all the test vectors described by the RFC: BASE32("") = "" BASE32("f") = "MY======" BASE32("fo") = "MZXQ====" BASE32("foo") = "MZXW6===" BASE32("foob") = "MZXW6YQ=" BASE32("fooba") = "MZXW6YTB" BASE32("foobar") = "MZXW6YTBOI======" Decoding Now, let's reverse the process. At the bit level, the process of turning MZXW6YTBOI====== back to foobar will look like this: 1. Groups of 5 bits 01100 11001 10111 10110 11110 10110 10011 00001 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 12 25 23 22 30 22 19 1 1. Groups of 5 bits 01100 11001 10111 10110 11110 10110 10011 00001 2. Join into a single number 01100 11001 10111 10110 11110 10110 10011 00001 2. Join into a single number 0110011001101111011011110101101001100001 3. Split into groups of 8 bits 0110011001101111011011110101101001100001 3. Split into groups of 8 bits 01100110 01101111 01101111 01100010 01100001 4. Map each byte to a character 01100110 01101111 01101111 01100010 01100001 4. Map each byte to a character 01100110 01101111 01101111 01100010 01100001 +------+ +------+ +------+ +------+ +------+ 102 111 111 098 097 f o o b a 4. Map each byte to a character f o o b a 4. Map each byte to a character fooba Step by step, to decode a Base32 string, we have to do the following: 1. Remove the = padding characters from the input string. 2. Split the string into an array of characters. 3. Turn each character into its index in the ALPHABET array. 4. Divide the array to chunks of 8 bytes (40 bits = 8 * 5 encoded bits). 5. Calculate the number of original bytes the given chunk represents (when the last chunk has less than 40 bits). 6. Calculate the bit padding applied when encoding. 7. Combine the bytes into a single number and strip the padding. 8. Decode the number by extracting 1 byte (8 bits) at a time. So, starting with the string MZXW6YTBOI======: 1. MZXW6YTBOI 2. ["M", "Z", "X", "W", "6", "Y", "T", "B", "O", "I"] 3. [12, 25, 23, 22, 30, 24, 19, 1, 14, 8] 4. [[12, 25, 23, 22, 30, 24, 19, 1], [14, 8]] The first chunk of the array has exactly 8 elements. We'll combine the values into a single 40-bit number: 01100 11001 10111 10110 11110 10110 10011 00001 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 12 25 23 22 30 22 19 1 01100 << 5 = 0110000000 0110000000 + 11001 = 0110011001 0110011001 << 5 011001100100000 + 10111 = 011001100101111 011001100101111 << 5 01100110010111100000 + 10110 = 01100110010111110110 01100110010111110110 << 5 0110011001011111011000000 + 11110 = 0110011001011111011001110 0110011001011111011001110 << 5 011001100101111101100111000000 + 11000 = 011001100101111101100111011000 011001100101111101100111011000 << 5 01100110010111110110011101100000000 + 10011 = 01100110010111110110011101100010011 01100110010111110110011101100010011 << 5 0110011001011111011001110110001001100000 + 00001 = 0110011001011111011001110110001001100001 We should end up with the exact same number that we initially encoded. To decode it, we'll group the bits into 8-bit segments (bytes) and extract them one by one: 01100110 01101111 01101111 01100010 01100001 +------+ +------+ +------+ +------+ +------+ 102 111 111 098 097 f o o b a To extract 8 bits at a time, we'll use the 0xff (11111111) bit mask: 01100110 01101111 01101111 01100010 01100001 & 00000000 00000000 00000000 00000000 11111111 = 00000000 00000000 00000000 00000000 01100001 +------+ 97 For the second chunk (OI, representing 2 bytes), we have to do some extra work. We need to calculate 1. The number of bytes this chunk represents in the original foobar string. 2. The number of bits that were added as padding during the encoding process. The whole decoding sequence: str = "MZXW6YTBOI" str = str.delete("=") bytes = str.each_char.map { |char| ALPHABET.index(char) } bytes.each_slice(8).map do |chunk| number_of_original_bytes = (chunk.length * 5.0 / 8.0).floor padding = chunk.length < 8 ? 5 - (number_of_original_bytes * 8) % 5 : 0 buf = 0 chunk.each do |byte| buf = (buf << 5) + byte # each byte in the chunk represents 5 bits end buf >>= padding # remove the padding (in this case, the last 2 bits) decoded = Array.new(number_of_original_bytes) j = decoded.length - 1 number_of_original_bytes.times do |i| # extract 8 bits at a time and convert to a character decoded[i] = ((buf >> 8 * j) & 0xff).chr j -= 1 end decoded end.join #=> foobar At this point, we should have a working Base32 encoder and decoder. Towards a more generic approach Now let's wrap the code into a class. We'll also replace the hardcoded values with constants to make it a little more descriptive and universal: class Base32 ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567".split("") PADDING_CHAR = "=" BITS_PER_BYTE = 8 # 1 byte = 8 bits BITS_PER_CHAR = Math.log2(ALPHABET.length).round # 5 = 32 chars = 2^5 - number of bits encoded into a single character in the ALPHABET BITS_PER_CHUNK = BITS_PER_CHAR.lcm(BITS_PER_BYTE) # 40 (least common mutliple of 5 and 8) CHARS_PER_CHUNK = BITS_PER_CHUNK / BITS_PER_CHAR # 8 CHUNK_LENGTH = BITS_PER_CHUNK / BITS_PER_BYTE # 5 ENCODING_MASK = ALPHABET.length - 1 # 0x1f DECODING_MASK = 0xff def self.encode(str) # ... end def self.decode(str) # ... end end The result may be slightly more difficult to follow, but the point of this change should be clear in the next section^2. The complete implementation can be found here. The pitfalls of Base32 Compared to Base64, Base32 offers a lot more flexibility and there are multiple variations of the encoding. The alphabet is fully customizable: you can change the order in the alphabet, replace certain characters to avoid accidental profanity or skip characters that look similar, like in the Crockford's variant. But this is a choice you'll only get to make once and you can't easily change the alphabet later. This flexibility makes Base32 tied to the specific implementation. The lack of a universal standard is an important reason why it's not nearly as popular as Base64. Base64 and beyond Now that we know how to implement Base32, let's try to use the same method to implement our own Base64 encoder^3. Base64 encodes 6 bits of data into a single character (2^6 = 64 character combinations of 6 bits). This means that our chunk has to be the least common multiple of 8 and 6, that is 24 (8 bits per byte, 6 bits per character). Our existing code should be able handle it out of the box. The only required change is to replace the ALPHABET array: ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".split The rest of the code will be identical. Every other value will be calculated automatically based on the length of the ALPHABET: BITS_PER_CHARACTER = 6 BITS_PER_CHUNK = 24 CHARS_PER_CHUNK = 4 CHUNK_LENGTH = 3 ENCODING_MASK = 0x3f # mask to extract 6 bits at a time In theory, the method I described could be used to implement any Base2n encoding. However, considering that the ASCII character set includes only 95 printable characters, Base64 is the highest Base2n encoding you'll see in everyday use. --------------------------------------------------------------------- With all that said, implementing a Base2n encoder is a problem you'll hardly ever have to deal with in the real world. But I hope the examples in this post helped you understand what exactly happens behind the scenes when you use Base64 or Base32. If you have any comments, questions or suggestions, don't hesitate to reach out to me via email. --------------------------------------------------------------------- 1. For example, you may want to remove certain vowels like a, u to reduce the chance of accidental profanity in the encoded strings. -[?] 2. To make it more universal and configurable, the class would have to be rewritten to use instance variables instead of constants, but I feel that's beyond the scope of this article. -[?] 3. We're doing this purely for educational purposes. I'd never recommend rolling your own Base64 code. You should always use the built-in Base64 implementation in your programming language of choice. -[?] Published on 2023-12-17 (Updated: 2023-12-17) author photo I'm Piotr, a freelance developer living in Warsaw, Poland. This is a place where I explore different topics related to software development -- or whatever happens to be interesting to me at the moment. If you found this post helpful, let me know. Copyright (c) 2023 * Home * Writing * About * Email me