[HN Gopher] How does Base32 (or any Base2^n) work exactly?
___________________________________________________________________
How does Base32 (or any Base2^n) work exactly?
Author : pchm
Score : 39 points
Date : 2023-12-17 15:17 UTC (7 hours ago)
(HTM) web link (ptrchm.com)
(TXT) w3m dump (ptrchm.com)
| daisydaisytts wrote:
| Just putting this here for alternate baseX:
| https://github.com/qntm/base2048
| Uptrenda wrote:
| A few other bases that are interesting:
|
| Base36: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
|
| Good encoding for binary data in textual contexts. Such as where
| you have parameter inputs or database fields that are constrained
| and only accept certain characters. The lack of spaces means that
| it can be used on the command-line easily. Example use: IRC
| channel names.
|
| Base64:
| 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
|
| Same as above but it adds lower-case alphabet characters. This is
| important because as you restrict the number of characters
| allowed in a byte: the length of the string goes up massively.
| With more characters the coding is more efficient. Example use:
| YouTube video ids.
|
| Base92: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrst
| uvwxyz~!@#$%^&*()_+{}\"<>?`-=[];',./|"
|
| Base92 is every character you can make on a standard key-board
| (I've replaced space with pipe here.) It includes many characters
| that have special meanings on the command-line or may be used as
| delimiters in text-based protocols. So while this offers a more
| 'efficient' encoding scheme for binary data it may break in some
| contexts. It's best where the input allows for typical
| formatting. Example use: forum / chat messages.
|
| BaseN encoding schemes are interesting because they allow you to
| use standard text-fields in many systems to store binary data.
| The most well-known here is base64 which allows browsers to embed
| whole files as text and store them directly in the HTML. Some
| sites use these for optimization hacks.
| MiddleEndian wrote:
| >Base64:
| 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
|
| This is only 62 characters!
| dexwiz wrote:
| Base64 uses / and +
| explaininjs wrote:
| Sometimes. Other times those chars are not allowed in the
| embedding context (paths, for instance), so you have to use
| '+' and ','. Or maybe '_' and '-'.
| gnabgib wrote:
| Suspect they mistyped `Base62` (since they seem to be
| favouring non powers of two)
| explaininjs wrote:
| That is not base64, it's base62. You can tell because it only
| has 62 symbols. To get base64 you have to add 2 symbols that
| you arbitrarily select from the master "table of symbols to add
| to base62 to get to base64 depending on what the platform is
| and what characters are restricted in it" [1]. For instance you
| might use `@`, except in an email. Or `/`, but not in an fs
| path or URL.
|
| As for base92, those symbols might all be easy to enter on your
| keyboard, but on international layouts the process can be quite
| involved indeed.
|
| I prefer base36 for this reason. Want a compact random string?
| Math.random().toString(36). Watch out to prefix it with a char
| for settings that disallow leading digits through! (variable
| identifiers, css class names, etc.)
|
| [1] https://en.wikipedia.org/wiki/Base64#Variants_summary_table
| romanhn wrote:
| Base62 is fantastic for URL-friendly encoding. I use GUIDs
| for primary keys in my web app, and encode them for frontend
| consumption using Base62. Looks much neater and doesn't cause
| issues like Base64 extra characters might.
| paulddraper wrote:
| > base36 textual contexts
|
| Better IMO is base 32 with U (obscenity), 0/O (ambiguity), and
| I (ambiguity) removed.
| gnabgib wrote:
| So.. crockford32 mentioned in the article?
| djbusby wrote:
| That's Crockford Base32, not RFC Base32
|
| https://en.m.wikipedia.org/wiki/Base32
| paulddraper wrote:
| Crockford is a bit different, and normalizes I/1/O/0 on
| parsing.
| charlieyu1 wrote:
| Do we really expect humans to read baseX encodings directly
| to make it worth to have ambiguity checks?
| saltcured wrote:
| Sometimes. Imagine if this is being used to generate
| something like a DOI or other catalog number for some data
| or physical artifact. As research scales up, the size of
| these identifiers also benefits from a more compact
| encoding.
|
| These kinds of IDs might be printed in a research paper
| (perhaps in a figure caption or bibliography/reference
| entry). Then, someone might be reading this from a printed
| copy of the paper rather than a PDF with a link in it.
|
| Or, researchers might be verbally referencing a particular
| item during some meeting. It might be recognizable among
| some peers actively working with the same artifacts, but
| might also need to be typed back into some search form to
| get back to online metadata etc.
|
| Another place the same identifier might be is on a printed
| label for physical artifacts in an archive. Of course, you
| might also want something like a 2D barcode for scanning,
| but it is helpful to have something human readable.
| maxcoder4 wrote:
| Removing characters for obscenity is pointless (thousands way
| to evade this "filter"), english-centric and honestly a weird
| idea.
|
| I've always heard that the reason in another ambiguity (u/v)
| which makes more sense to me.
| paulddraper wrote:
| Base64/Base32/ASCII is English-centric.
|
| Might be weird to you personally, but there's literally
| government agencies to prevent obscenities.
| ljm wrote:
| What makes the letter U obscene?
| gnabgib wrote:
| Hey Piotr/pchm, I'm not sure I follow your argument that Base32
| is less popular because it's not a standard (there is a standard
| - RFC4648 as you mention).
|
| Not implementing the RFC, is not implementing Base32, changing
| the order, or using 32 emoji does not make it Base32. Put another
| way, you can change the order of characters in Base64, or use a
| different dictionary, and indeed there are several variants of
| that too (BinHex4, Uuencoding, Base64Url, B64) - there are
| specific implementation detail concerns there too.
|
| Base64 won out as a reasonably dense way to encode binary data in
| 7-bit safe ASCII for use in email, and later http headers (where
| spacing and line length may be modified in transit, and some
| ASCII characters are prohibited - eg 0x00/null). Part of the
| reason is; bit-grouping makes encode/decode simpler (you can use
| bit shifting). Something like ASCII85/Base85 which is a more
| dense encoding, and close to the maximum you can get in 7 bit
| safe ASCII (94 characters 33-126 if space is important, 95 if
| space quantity can be preserved) but you have to use
| multiply/divide instructions. The union of bit-shift speed (power
| of 2) and 7-bit safe ASCII characters (max 94 values) is: binary,
| base4, octal, hexadecimal, base32, and base64.
|
| For human readability, especially verbal communication,
| hexadecimal or base32 are advantageous in that they are more
| dense than decimal, can be generated via bit-shifting vs more
| complex processor instructions, but you needn't also communicate
| the character's case (unlike Base64).
| pchm wrote:
| You make some good points. What I was trying to say is that
| even though there is the RFC, it's quite common to modify the
| alphabet or use other variants like Crockford's (mainly to
| avoid random profanity, e.g. in the URL identifiers).
|
| When you see a Base64 string, you can be pretty certain that
| it's the standard version. With Base32, it's not obvious which
| variant was used.
|
| Many languages don't provide a stdlib Base32 implementation
| (Ruby doesn't), but Base64 is pretty much always included.
| Maybe this influenced my perception of the lack of a universal
| standard.
|
| Anyway, I should work on that section to communicate my point
| better.
| gnabgib wrote:
| In some cases (luck of the data, but often when encoding
| ASCII without padding) you won't see the non alphanumeric
| characters (62nd and 63rd place) in Base64 either. So you
| _can 't_ always tell the difference between Base64,
| Base64Url, Xxencode, or B64.
|
| "Hello, world!" = `SGVsbG8sIHdvcmxkIQ` (base64, base64url),
| `BG4JgP4wg65RjQalY6E` (Xxencode), or `G4JgP4wg65RjQalY6E`
| (b64). A legitimate reason for choosing B64 over Base64 would
| be: it maintains ASCII sort-order.
|
| Any language that has to deal with HTTP (or MIME) has to
| encode/decode Base64 in order to support some headers (eg
| Basic auth) and features (binary data from a form
| submission). There is no similar HTTP need for Base32, so
| perhaps it's less surprising it's not in the standard
| library?
| malodyets wrote:
| I'm a big fan of base58
|
| + almost as efficient as base64 + no special characters + no
| padding characters
| cerved wrote:
| I'm a big fan of not base anything encoding
| Dylan16807 wrote:
| Base64 doesn't need padding so that one's easy.
|
| No special characters... I mean it's true, but there's not many
| places I'm worried about inability to mix in some - and _.
|
| Base58 also avoids a _couple_ confusable characters, but that
| only matters when copying by hand, and if I 'm copying by hand
| I'd rather use base32.
| waynesonfire wrote:
| Which encoding eliminates symbols that can be confused ljke O vs
| 0? Or I vs l
| Snawoot wrote:
| z-base-32 is one example:
| https://philzimmermann.com/docs/human-oriented-base-32-encod...
| rsk wrote:
| If anyone interested, here is my article about optimization of
| encoding/decoding u128 to base62 (non power of 2)
| https://dev.to/rsk/optimization-of-u128-to-base62-encoding-3...
| sedatk wrote:
| Base32 had encouraged me to develop my own Base32-encoder on
| .NET. I eventually added other encoding types over the years,
| leading to the library called SimpleBase. It's now being used by
| popular packages like Ipfs.Core, net-dns, and KubeOps.
|
| https://github.com/ssg/SimpleBase
___________________________________________________________________
(page generated 2023-12-17 23:00 UTC)