[HN Gopher] How does Base32 (or any Base2^n) work exactly?
       ___________________________________________________________________
        
       How does Base32 (or any Base2^n) work exactly?
        
       Author : pchm
       Score  : 39 points
       Date   : 2023-12-17 15:17 UTC (7 hours ago)
        
 (HTM) web link (ptrchm.com)
 (TXT) w3m dump (ptrchm.com)
        
       | daisydaisytts wrote:
       | Just putting this here for alternate baseX:
       | https://github.com/qntm/base2048
        
       | Uptrenda wrote:
       | A few other bases that are interesting:
       | 
       | Base36: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
       | 
       | Good encoding for binary data in textual contexts. Such as where
       | you have parameter inputs or database fields that are constrained
       | and only accept certain characters. The lack of spaces means that
       | it can be used on the command-line easily. Example use: IRC
       | channel names.
       | 
       | Base64:
       | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
       | 
       | Same as above but it adds lower-case alphabet characters. This is
       | important because as you restrict the number of characters
       | allowed in a byte: the length of the string goes up massively.
       | With more characters the coding is more efficient. Example use:
       | YouTube video ids.
       | 
       | Base92: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrst
       | uvwxyz~!@#$%^&*()_+{}\"<>?`-=[];',./|"
       | 
       | Base92 is every character you can make on a standard key-board
       | (I've replaced space with pipe here.) It includes many characters
       | that have special meanings on the command-line or may be used as
       | delimiters in text-based protocols. So while this offers a more
       | 'efficient' encoding scheme for binary data it may break in some
       | contexts. It's best where the input allows for typical
       | formatting. Example use: forum / chat messages.
       | 
       | BaseN encoding schemes are interesting because they allow you to
       | use standard text-fields in many systems to store binary data.
       | The most well-known here is base64 which allows browsers to embed
       | whole files as text and store them directly in the HTML. Some
       | sites use these for optimization hacks.
        
         | MiddleEndian wrote:
         | >Base64:
         | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
         | 
         | This is only 62 characters!
        
           | dexwiz wrote:
           | Base64 uses / and +
        
             | explaininjs wrote:
             | Sometimes. Other times those chars are not allowed in the
             | embedding context (paths, for instance), so you have to use
             | '+' and ','. Or maybe '_' and '-'.
        
           | gnabgib wrote:
           | Suspect they mistyped `Base62` (since they seem to be
           | favouring non powers of two)
        
         | explaininjs wrote:
         | That is not base64, it's base62. You can tell because it only
         | has 62 symbols. To get base64 you have to add 2 symbols that
         | you arbitrarily select from the master "table of symbols to add
         | to base62 to get to base64 depending on what the platform is
         | and what characters are restricted in it" [1]. For instance you
         | might use `@`, except in an email. Or `/`, but not in an fs
         | path or URL.
         | 
         | As for base92, those symbols might all be easy to enter on your
         | keyboard, but on international layouts the process can be quite
         | involved indeed.
         | 
         | I prefer base36 for this reason. Want a compact random string?
         | Math.random().toString(36). Watch out to prefix it with a char
         | for settings that disallow leading digits through! (variable
         | identifiers, css class names, etc.)
         | 
         | [1] https://en.wikipedia.org/wiki/Base64#Variants_summary_table
        
           | romanhn wrote:
           | Base62 is fantastic for URL-friendly encoding. I use GUIDs
           | for primary keys in my web app, and encode them for frontend
           | consumption using Base62. Looks much neater and doesn't cause
           | issues like Base64 extra characters might.
        
         | paulddraper wrote:
         | > base36 textual contexts
         | 
         | Better IMO is base 32 with U (obscenity), 0/O (ambiguity), and
         | I (ambiguity) removed.
        
           | gnabgib wrote:
           | So.. crockford32 mentioned in the article?
        
           | djbusby wrote:
           | That's Crockford Base32, not RFC Base32
           | 
           | https://en.m.wikipedia.org/wiki/Base32
        
             | paulddraper wrote:
             | Crockford is a bit different, and normalizes I/1/O/0 on
             | parsing.
        
           | charlieyu1 wrote:
           | Do we really expect humans to read baseX encodings directly
           | to make it worth to have ambiguity checks?
        
             | saltcured wrote:
             | Sometimes. Imagine if this is being used to generate
             | something like a DOI or other catalog number for some data
             | or physical artifact. As research scales up, the size of
             | these identifiers also benefits from a more compact
             | encoding.
             | 
             | These kinds of IDs might be printed in a research paper
             | (perhaps in a figure caption or bibliography/reference
             | entry). Then, someone might be reading this from a printed
             | copy of the paper rather than a PDF with a link in it.
             | 
             | Or, researchers might be verbally referencing a particular
             | item during some meeting. It might be recognizable among
             | some peers actively working with the same artifacts, but
             | might also need to be typed back into some search form to
             | get back to online metadata etc.
             | 
             | Another place the same identifier might be is on a printed
             | label for physical artifacts in an archive. Of course, you
             | might also want something like a 2D barcode for scanning,
             | but it is helpful to have something human readable.
        
           | maxcoder4 wrote:
           | Removing characters for obscenity is pointless (thousands way
           | to evade this "filter"), english-centric and honestly a weird
           | idea.
           | 
           | I've always heard that the reason in another ambiguity (u/v)
           | which makes more sense to me.
        
             | paulddraper wrote:
             | Base64/Base32/ASCII is English-centric.
             | 
             | Might be weird to you personally, but there's literally
             | government agencies to prevent obscenities.
        
           | ljm wrote:
           | What makes the letter U obscene?
        
       | gnabgib wrote:
       | Hey Piotr/pchm, I'm not sure I follow your argument that Base32
       | is less popular because it's not a standard (there is a standard
       | - RFC4648 as you mention).
       | 
       | Not implementing the RFC, is not implementing Base32, changing
       | the order, or using 32 emoji does not make it Base32. Put another
       | way, you can change the order of characters in Base64, or use a
       | different dictionary, and indeed there are several variants of
       | that too (BinHex4, Uuencoding, Base64Url, B64) - there are
       | specific implementation detail concerns there too.
       | 
       | Base64 won out as a reasonably dense way to encode binary data in
       | 7-bit safe ASCII for use in email, and later http headers (where
       | spacing and line length may be modified in transit, and some
       | ASCII characters are prohibited - eg 0x00/null). Part of the
       | reason is; bit-grouping makes encode/decode simpler (you can use
       | bit shifting). Something like ASCII85/Base85 which is a more
       | dense encoding, and close to the maximum you can get in 7 bit
       | safe ASCII (94 characters 33-126 if space is important, 95 if
       | space quantity can be preserved) but you have to use
       | multiply/divide instructions. The union of bit-shift speed (power
       | of 2) and 7-bit safe ASCII characters (max 94 values) is: binary,
       | base4, octal, hexadecimal, base32, and base64.
       | 
       | For human readability, especially verbal communication,
       | hexadecimal or base32 are advantageous in that they are more
       | dense than decimal, can be generated via bit-shifting vs more
       | complex processor instructions, but you needn't also communicate
       | the character's case (unlike Base64).
        
         | pchm wrote:
         | You make some good points. What I was trying to say is that
         | even though there is the RFC, it's quite common to modify the
         | alphabet or use other variants like Crockford's (mainly to
         | avoid random profanity, e.g. in the URL identifiers).
         | 
         | When you see a Base64 string, you can be pretty certain that
         | it's the standard version. With Base32, it's not obvious which
         | variant was used.
         | 
         | Many languages don't provide a stdlib Base32 implementation
         | (Ruby doesn't), but Base64 is pretty much always included.
         | Maybe this influenced my perception of the lack of a universal
         | standard.
         | 
         | Anyway, I should work on that section to communicate my point
         | better.
        
           | gnabgib wrote:
           | In some cases (luck of the data, but often when encoding
           | ASCII without padding) you won't see the non alphanumeric
           | characters (62nd and 63rd place) in Base64 either. So you
           | _can 't_ always tell the difference between Base64,
           | Base64Url, Xxencode, or B64.
           | 
           | "Hello, world!" = `SGVsbG8sIHdvcmxkIQ` (base64, base64url),
           | `BG4JgP4wg65RjQalY6E` (Xxencode), or `G4JgP4wg65RjQalY6E`
           | (b64). A legitimate reason for choosing B64 over Base64 would
           | be: it maintains ASCII sort-order.
           | 
           | Any language that has to deal with HTTP (or MIME) has to
           | encode/decode Base64 in order to support some headers (eg
           | Basic auth) and features (binary data from a form
           | submission). There is no similar HTTP need for Base32, so
           | perhaps it's less surprising it's not in the standard
           | library?
        
       | malodyets wrote:
       | I'm a big fan of base58
       | 
       | + almost as efficient as base64 + no special characters + no
       | padding characters
        
         | cerved wrote:
         | I'm a big fan of not base anything encoding
        
         | Dylan16807 wrote:
         | Base64 doesn't need padding so that one's easy.
         | 
         | No special characters... I mean it's true, but there's not many
         | places I'm worried about inability to mix in some - and _.
         | 
         | Base58 also avoids a _couple_ confusable characters, but that
         | only matters when copying by hand, and if I 'm copying by hand
         | I'd rather use base32.
        
       | waynesonfire wrote:
       | Which encoding eliminates symbols that can be confused ljke O vs
       | 0? Or I vs l
        
         | Snawoot wrote:
         | z-base-32 is one example:
         | https://philzimmermann.com/docs/human-oriented-base-32-encod...
        
       | rsk wrote:
       | If anyone interested, here is my article about optimization of
       | encoding/decoding u128 to base62 (non power of 2)
       | https://dev.to/rsk/optimization-of-u128-to-base62-encoding-3...
        
       | sedatk wrote:
       | Base32 had encouraged me to develop my own Base32-encoder on
       | .NET. I eventually added other encoding types over the years,
       | leading to the library called SimpleBase. It's now being used by
       | popular packages like Ipfs.Core, net-dns, and KubeOps.
       | 
       | https://github.com/ssg/SimpleBase
        
       ___________________________________________________________________
       (page generated 2023-12-17 23:00 UTC)