https://ptrchm.com/posts/base32-explained/

  * Home
  * Writing
  * About
  * Email me

  *  
  *  

Base32 Encoding Explained

How does Base32 (or any Base2n) work exactly at the bit level?

Base64 is one of those things I've used countless times, without ever
fully understanding what it does. For my UUID shortener, I chose to
implement Base32 -- a more readable and URL-friendly alternative to
Base64. I didn't want to rely on any third-party libraries for the
encoding part. This meant going back to basics and learning what
actually happens when you encode data to a Base2n format.

This post is my attempt to explain and visualize how Base32 encoding
works at the bit level and how it relates to other Base2n formats,
like Base64.

We'll implement a Base32 encoder from scratch. I'll be using Ruby for
the example code, but the principles should be easily translated to
any programming language.

If you want to jump straight to the code, here's the result.

In this post

  * What's Base2n encoding?
  * Pushing bits: how to convert a number to Base8
  * Base32 implementation (RFC 4648)
      + Encoding arbitrary data
      + Decoding
      + Towards a more generic approach
  * Base64 and beyond

What's Base2n encoding?

Just as Base10 (decimal) uses 10 digits to represent numbers, Base2n
encoding is simply another way to represent numbers (or bytes), using
2n digits instead: Base2 (binary), Base16 (hexadecimal), Base32,
Base64, etc. The more digits we have, the more compact the
representation is going to be.

For example, let's look at different representations of the same
128-bit number (UUID):

3d89a119-b3f8-4107-8f29-3daf5b964e50   # standard UUID string
0x3d89a119b3f841078f293daf5b964e50     # hex
81797519916847327862337238645062651472 # decimal
1xh6ghkczr843rya9xnxdsckjg             # base32 (Crockford's variant)

# and binary:
111101100010011010000100011001101100111111100001000001000001111000111100101001001111011010111101011011100101100100111001010000

The number of available digits determines how many bits can be
represented by a single character. For example: with binary we can
encode 1 bit of data into each character (2^1 = 2 character
combinations of 1 bit). Similarly, in Base16 (hex) we can encode 4
bits of data into a single character (2^4 = 16 character combinations
of 4 bits).

Pushing bits: how to convert a number to Base8

With that in mind, let's start with a simple example: converting the
number 249 into Base8 (octal). There are several ways to achieve
that, but to illustrate the general method for Base2n encoding, we'll
use bitwise operations.

Here's what the number 249 looks like in binary (an 8-bit
representation):

11111001

Base8 uses 8 digits (0-7), so we can encode 3 bits of data into a
single character (2^3 = 8 character combinations of 3 bits).

digits = ['0', '1', '2', '3', '4', '5', '6', '7']

The number 249 translates to 371 in octal. Here's how the binary
representation looks like, in groups of 3 bits:

011 111 001
+-+ +-+ +-+
 3   7   1

The method is pretty simple: looking at the binary representation,
we'll split the value into groups of 3 bits each. Each group
represents a character/digit in our target system (its index in the
digits array, to be more specific).

Starting from the lower-order bits (right-hand), we'll use a bit mask
to extract 3 bits at a time:

7   # decimal (max. index in the digits array)
0x7 # hex
111 # binary (3 bits, our chunk size)

Let's extract the first 3 bits. We'll do this by masking every bit
except the first 3 lower (right-hand) bits:

number = 249
digit  = number & 0x7 #=> 1 (001)

This gives us the first digit in the result: 1.

The bitwise AND operation looks like this:

  011 111 001
& 000 000 111
= 000 000 001

Now we have to shift to the next 3 bits. We'll push the rightmost
bits the right, removing them from the number:

number = number >> 3 #=> 252 (000 001 111)

The bitwise operation looks like this:

>> 011 111 001

>>  011 111 00

>>   011 111 0

>>    011 111

Now, we can extract the next 3 bits:

  000 011 111
& 000 000 111
= 000 000 111

This gives us the next digit: 7 (0b111).

We keep doing this as long as number > 0, so in this case, just one
more time. Shift to the last 3 bits:

>>     011 111

>>      011 11

>>       011 1

>>        011

And extract the last 3 bits:

  000 000 011
& 000 000 111
= 000 000 011

0b011 is 3 in our target system, so putting it all together, we get
our final result: 371.

Here's the code representing the whole sequence:

digits = ['0', '1', '2', '3', '4', '5', '6', '7']
result = []
number = 249

while number > 0
  digit = digits[number & 0x7]
  result.push(digit)
  number = number >> 3
end

puts result.reverse.join #=> 371

The method will be virtually the same for any Base2n encoding. For
Base8, the digits array is not strictly necessary, but for larger
bases, such as Base16 or Base32, we'll need to define a list of
characters that represent the additional digits.

Now, let's change the code to encode the number 249 into Base16 (4
bits per character). To do this, we'll define the 16 characters and
modify the bit mask to extract 4 bits at a time:

digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f']
result = []
number = 249

while number > 0
  digit = digits[number & 0xf] # 0xf = 0b1111
  result.push(digit)
  number = number >> 4
end

puts result.reverse.join #=> f9

In the same manner, you can extend the code to support any Base2n
encoding: simply add more characters to the alphabet, calculate the
bit mask and the number of bits to shift in each iteration, based on
the number of characters in the alphabet. We'll explore this process
in more detail in the next section.

 An important caveat

The method described above is used to encode numbers. We're treating
the input as one chunk of fixed-size data, encoding bits from right
to left (least significant first). However, the standard Base32 and
Base64 algorithms that operate on arbitrary data will split the input
into smaller chunks, starting from left to right (most significant
first).

For example, in the BasedUUID gem, I treat the UUID as a 128-bit
number, encoding it from right to left. This method is also used by
ULID and TypeID. Keep in mind that this approach will not yield the
same result as encoding the string or raw byte representation of the
UUID.

Base32 implementation (RFC 4648)

Encoding arbitrary data

In Base32, each character can encode 5 bits of data (2^5 = 32
character combinations of 5 bits). The RFC 4648 defines the following
set of characters:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
    0 A             9 J            18 S            27 3
    1 B            10 K            19 T            28 4
    2 C            11 L            20 U            29 5
    3 D            12 M            21 V            30 6
    4 E            13 N            22 W            31 7
    5 F            14 O            23 X
    6 G            15 P            24 Y         (pad) =
    7 H            16 Q            25 Z
    8 I            17 R            26 2

There are other variants of the encoding, like Crockford's Base32,
and the alphabet can be customized to fit specific needs^1. However,
for this article, we'll stick to the RFC 4648 version.

Encoding arbitrary data to Base32 is going to be slightly different
than what I described earlier. We need to split the input string into
chunks (5 bytes each), starting from the left (most significant
first).

Let's follow the RFC algorithm step by step to encode an example
string: foobar, which should be encoded to MZXW6YTBOI======.

Here's a visualization of what we're going to do:

1. Groups of 8 bits

01100110 01101111 01101111 01100010 01100001
+------+ +------+ +------+ +------+ +------+
  102      111      111      098      097
  f        o        o        b        a

1. Groups of 8 bits

01100110 01101111 01101111 01100010 01100001

2. Join into a single number

01100110 01101111 01101111 01100010 01100001

2. Join into a single number

0110011001101111011011110110001001100001

3. Split into groups of 5 bits

0110011001101111011011110110001001100001

3. Split into groups of 5 bits

01100 11001 10111 10110 11110 11000 10011 00001

4. Each group is mapped to a char in the ALPHABET

01100 11001 10111 10110 11110 11000 10011 00001

4. Each group is mapped to a char in the ALPHABET

01100 11001 10111 10110 11110 11000 10011 00001
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
 12    25    23    22    30    24    19    1
 M     Z     X     W     6     Y     T     B

4. Each group is mapped to a char in the ALPHABET




 M     Z     X     W     6     Y     T     B

4. Each group is mapped to a char in the ALPHABET




 MZXW6YTB

First, we'll convert the string to an array of bytes. Then we'll
split the array into groups of 5 bytes. Since each byte is 8 bits,
this means each group will be 40 bits (8 bits per byte * 5 bytes).

bytes = 'foobar'.bytes #=> [102, 111, 111, 98, 97, 114]
chunks = bytes.each_slice(5).to_a

# => [[102, 111, 111, 98, 97], [114]]

The last group contains only 1 byte, so it's not divisible by 5.
We'll deal with this problem later. For now, let's focus on the first
chunk:

102 111 111 098 097
+-+ +-+ +-+ +-+ +-+
 f   o   o   b   a

And the binary representation:

01100110 01101111 01101111 01100010 01100001
+------+ +------+ +------+ +------+ +------+
  102      111      111      098      097
  f        o        o        b        a

First, we'll combine these bytes into a single 40-bit number:

chunk = [102, 111, 111, 98, 97]
buf = 0
chunk.each do |byte|
  buf = (buf << 8) + byte
end

buf #=> 439956234849 (01100 11001 10111 10110 11110 11000 10011 00001)

Here's the binary version of the loop above:

1100110 << 8
110011000000000 + 1101111 = 110011001101111

110011001101111 << 8
11001100110111100000000 + 1101111 = 11001100110111111011111

11001100110111101101111 << 8
1100110011011110110111100000000 + 1100010 = 1100110011011110110111101100010

1100110011011110110111101100010 << 8
110011001101111011011110110001000000000 + 1100001 = 110011001101111011011110110001001100001

The result is a 40-bit number. We'll divide this number into eight
5-bit groups, with each 5-bit group being encoded into a single
character:

01100 11001 10111 10110 11110 11000 10011 00001
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
 12    25    23    22    30    24    19    1    # index in the ALPHABET array
 M     Z     X     W     6     Y     T     B    # the character it represents

Here's the code to extract those 5-bit groups and encode them into 8
characters:

ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567".split("")
encoded = Array.new(8)
j = encoded.length - 1

encoded.length.times do |i|
  encoded[j] = ALPHABET[(buf >> 5 * i) & 0x1f]
  j -= 1
end

encoded.join #=> MZXW6YTB

This operation is very similar to what we did in our Base8 example.
We use the 0x1f bit mask to extract 5 bits at a time. After each
extraction, we shift the number right by 5 bits to access the next 5
bits. The extracted 5 bits act as an index to find the corresponding
character in the ALPHABET array.

Now we have our first chunk encoded to Base32, but it's time to deal
with the tricky part.

The RFC says that if the last group contains less than 40 bits, it
must be padded with zeros until the total number of bits becomes
divisible by 5. Each group of 5 bytes should yield 8 encoded
characters. If the last chunk produces fewer than 8 characters, we'll
pad the remaining space with =.

In our example, the last chunk contains only 1 byte (8 bits), so
we'll pad it with 2 bits to reach 10 bits, the smallest number
divisible by 5.

Two calculations are necessary here:

 1. Determining how many characters the last chunk will produce.
 2. Calculating the number of bits needed to pad the chunk to make
    the total divisible by 5.

For the first part, we'll use this formula:

40 bits = 8 characters
8 bits  = x characters

x = 8 * 8 / 40 = 1.6

Rounding up, we know this chunk should yield 2 characters

Now, for the padding calculation:

padding = 5 - (chunk.size * 8) % 5

In our case, the last chunk is a single byte (114, 01110010,
representing the letter r). We'll pad it with 2 bits at the end:

  01110010 << 2

 011100100 << 2

0111001000 << 2

This gives us 10 bits that will be encoded to 2 characters.

01110 01000
+---+ +---+
 14    8    # index in the ALPHABET array
 O     I    # the character it represents

Translated to code, here's the whole encoding sequence:

chunk = [114] # the last chunk, the letter "r"
bits_in_chunk = chunk.length * 8
number_of_characters = (bits_in_chunk / 5.to_f).ceil     # how many encoded characters this chunk will produce
padding = bits_in_chunk < 40 ? 5 - bits_in_chunk % 5 : 0 # how many bits we need to pad to make the number of bits divisible by 5

buf = 0
chunk.each do |byte|
  buf = (buf << 8) + byte # combine the bytes into a single number
end
buf <<= padding # add missing bits to the right

encoded = Array.new(8) # an array to hold 8 encoded characters
j = number_of_characters - 1

# encode 2 characters
number_of_characters.times do |i|
  encoded[j] = ALPHABET[(buf >> 5 * i) & 0x1f] # extract 5 bits at a time
  j -= 1
end

# pad the result with 6 '='
(8 - number_of_characters).times do |i|
  encoded[number_of_characters + i] = "="
end

encoded.join #=> OI======

We're now done with the encoding part. At this point, our code should
cover all the test vectors described by the RFC:

BASE32("")       = ""
BASE32("f")      = "MY======"
BASE32("fo")     = "MZXQ===="
BASE32("foo")    = "MZXW6==="
BASE32("foob")   = "MZXW6YQ="
BASE32("fooba")  = "MZXW6YTB"
BASE32("foobar") = "MZXW6YTBOI======"

Decoding

Now, let's reverse the process. At the bit level, the process of
turning MZXW6YTBOI====== back to foobar will look like this:

1. Groups of 5 bits

01100 11001 10111 10110 11110 10110 10011 00001
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
 12    25    23    22    30    22    19     1

1. Groups of 5 bits

01100 11001 10111 10110 11110 10110 10011 00001

2. Join into a single number

01100 11001 10111 10110 11110 10110 10011 00001

2. Join into a single number

0110011001101111011011110101101001100001

3. Split into groups of 8 bits

0110011001101111011011110101101001100001

3. Split into groups of 8 bits

01100110 01101111 01101111 01100010 01100001

4. Map each byte to a character

01100110 01101111 01101111 01100010 01100001

4. Map each byte to a character

01100110 01101111 01101111 01100010 01100001
+------+ +------+ +------+ +------+ +------+
  102      111      111      098      097
   f        o        o        b        a

4. Map each byte to a character




   f        o        o        b        a

4. Map each byte to a character




   fooba

Step by step, to decode a Base32 string, we have to do the following:

 1. Remove the = padding characters from the input string.
 2. Split the string into an array of characters.
 3. Turn each character into its index in the ALPHABET array.
 4. Divide the array to chunks of 8 bytes (40 bits = 8 * 5 encoded
    bits).
 5. Calculate the number of original bytes the given chunk represents
    (when the last chunk has less than 40 bits).
 6. Calculate the bit padding applied when encoding.
 7. Combine the bytes into a single number and strip the padding.
 8. Decode the number by extracting 1 byte (8 bits) at a time.

So, starting with the string MZXW6YTBOI======:

1. MZXW6YTBOI
2. ["M", "Z", "X", "W", "6", "Y", "T", "B", "O", "I"]
3. [12, 25, 23, 22, 30, 24, 19, 1, 14, 8]
4. [[12, 25, 23, 22, 30, 24, 19, 1], [14, 8]]

The first chunk of the array has exactly 8 elements. We'll combine
the values into a single 40-bit number:

01100 11001 10111 10110 11110 10110 10011 00001
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
 12    25    23    22    30    22    19     1

01100 << 5 = 0110000000
0110000000 + 11001 = 0110011001

0110011001 << 5
011001100100000 + 10111 = 011001100101111

011001100101111 << 5
01100110010111100000 + 10110 = 01100110010111110110

01100110010111110110 << 5
0110011001011111011000000 + 11110 = 0110011001011111011001110

0110011001011111011001110 << 5
011001100101111101100111000000 + 11000 = 011001100101111101100111011000

011001100101111101100111011000 << 5
01100110010111110110011101100000000 + 10011 = 01100110010111110110011101100010011

01100110010111110110011101100010011 << 5
0110011001011111011001110110001001100000 + 00001 = 0110011001011111011001110110001001100001

We should end up with the exact same number that we initially
encoded. To decode it, we'll group the bits into 8-bit segments
(bytes) and extract them one by one:

01100110 01101111 01101111 01100010 01100001
+------+ +------+ +------+ +------+ +------+
  102      111      111      098      097
   f        o        o        b        a

To extract 8 bits at a time, we'll use the 0xff (11111111) bit mask:

  01100110 01101111 01101111 01100010 01100001
& 00000000 00000000 00000000 00000000 11111111
= 00000000 00000000 00000000 00000000 01100001
                                      +------+
                                         97

For the second chunk (OI, representing 2 bytes), we have to do some
extra work. We need to calculate

 1. The number of bytes this chunk represents in the original foobar
    string.
 2. The number of bits that were added as padding during the encoding
    process.

The whole decoding sequence:

str = "MZXW6YTBOI"
str = str.delete("=")
bytes = str.each_char.map { |char| ALPHABET.index(char) }

bytes.each_slice(8).map do |chunk|
  number_of_original_bytes = (chunk.length * 5.0 / 8.0).floor
  padding = chunk.length < 8 ? 5 - (number_of_original_bytes * 8) % 5 : 0

  buf = 0
  chunk.each do |byte|
    buf = (buf << 5) + byte # each byte in the chunk represents 5 bits
  end
  buf >>= padding # remove the padding (in this case, the last 2 bits)

  decoded = Array.new(number_of_original_bytes)
  j = decoded.length - 1

  number_of_original_bytes.times do |i|
    # extract 8 bits at a time and convert to a character
    decoded[i] = ((buf >> 8 * j) & 0xff).chr
    j -= 1
  end

  decoded
end.join #=> foobar

At this point, we should have a working Base32 encoder and decoder.

Towards a more generic approach

Now let's wrap the code into a class. We'll also replace the
hardcoded values with constants to make it a little more descriptive
and universal:

class Base32
  ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567".split("")
  PADDING_CHAR = "="

  BITS_PER_BYTE  = 8  # 1 byte = 8 bits
  BITS_PER_CHAR  = Math.log2(ALPHABET.length).round # 5 = 32 chars = 2^5 - number of bits encoded into a single character in the ALPHABET
  BITS_PER_CHUNK = BITS_PER_CHAR.lcm(BITS_PER_BYTE) # 40 (least common mutliple of 5 and 8)

  CHARS_PER_CHUNK = BITS_PER_CHUNK / BITS_PER_CHAR  # 8
  CHUNK_LENGTH    = BITS_PER_CHUNK / BITS_PER_BYTE  # 5

  ENCODING_MASK = ALPHABET.length - 1 # 0x1f
  DECODING_MASK = 0xff

  def self.encode(str)
    # ...
  end

  def self.decode(str)
    # ...
  end
end

The result may be slightly more difficult to follow, but the point of
this change should be clear in the next section^2.

The complete implementation can be found here.

The pitfalls of Base32

Compared to Base64, Base32 offers a lot more flexibility and there
are multiple variations of the encoding. The alphabet is fully
customizable: you can change the order in the alphabet, replace
certain characters to avoid accidental profanity or skip characters
that look similar, like in the Crockford's variant. But this is a
choice you'll only get to make once and you can't easily change the
alphabet later.

This flexibility makes Base32 tied to the specific implementation.
The lack of a universal standard is an important reason why it's not
nearly as popular as Base64.

Base64 and beyond

Now that we know how to implement Base32, let's try to use the same
method to implement our own Base64 encoder^3.

Base64 encodes 6 bits of data into a single character (2^6 = 64
character combinations of 6 bits). This means that our chunk has to
be the least common multiple of 8 and 6, that is 24 (8 bits per byte,
6 bits per character).

Our existing code should be able handle it out of the box. The only
required change is to replace the ALPHABET array:

ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".split

The rest of the code will be identical. Every other value will be
calculated automatically based on the length of the ALPHABET:

BITS_PER_CHARACTER = 6
BITS_PER_CHUNK     = 24
CHARS_PER_CHUNK    = 4
CHUNK_LENGTH       = 3
ENCODING_MASK      = 0x3f # mask to extract 6 bits at a time

In theory, the method I described could be used to implement any
Base2n encoding. However, considering that the ASCII character set
includes only 95 printable characters, Base64 is the highest Base2n
encoding you'll see in everyday use.

---------------------------------------------------------------------

With all that said, implementing a Base2n encoder is a problem you'll
hardly ever have to deal with in the real world. But I hope the
examples in this post helped you understand what exactly happens
behind the scenes when you use Base64 or Base32.

If you have any comments, questions or suggestions, don't hesitate to
reach out to me via email.

---------------------------------------------------------------------

 1. For example, you may want to remove certain vowels like a, u to
    reduce the chance of accidental profanity in the encoded
    strings. -[?]

 2. To make it more universal and configurable, the class would have
    to be rewritten to use instance variables instead of constants,
    but I feel that's beyond the scope of this article. -[?]

 3. We're doing this purely for educational purposes. I'd never
    recommend rolling your own Base64 code. You should always use the
    built-in Base64 implementation in your programming language of
    choice. -[?]

Published on 2023-12-17 (Updated: 2023-12-17)
author photo

I'm Piotr, a freelance developer living in Warsaw, Poland. This is a
place where I explore different topics related to software
development -- or whatever happens to be interesting to me at the
moment.

If you found this post helpful, let me know.

Copyright (c) 2023

  * Home
  * Writing
  * About
  * Email me