# Text Encoding Concerns in Ada
September 2, 2025

I wrote a very long reply to someone on the Comp.Lang.Ada Usenet
newsgroup about handling text encoding in Ada, which is important
information for modern programs and is very poorly documented; so I
figured I should duplicate the reply here for posterity.

I was replying to this post, archived here.  It has an amusing
large collection of quotes of people talking about the issues with
text encoding in Ada, and then asks how they should handle text
encoding for their speicfic project.:
files/ada-encodings.eml

***

I've written about this at length before because it's a major pain
point; but I can't find any of my old writing on it so I've
rewritten it here lol. I go into extremely verbose detail on all
the recommendations and the issues at play below, but to summarize:
* You really should use Unicode both in storage/interchange and
  internally
* Use `Wide_Wide_<>` internally everywhere in your program
* Use Ada's Streams facility to read/write external text as
  binary, transcoding it manually using `UTF_Encoding` (or custom
  implemented routines if you need non-Unicode encodings)
* You can use `Text_Streams` to get a binary stream even from
  stdin/stdout/stderr, although with some annoying caveats
  regarding `Text_IO` adding spurious end-of-file newlines
* Be careful with string functions that inspect the contents of
  strings even for `Wide_Wide_Strings`, because Unicode can have
  tricky issues (basically, just only ever look for/split on/etc.
  hardcoded valid sequences/characters due to issues with multi-
  codepoint graphemes)

***

Right off the bat, in modern code either on its own or interfacing
with other modern code, you really should use Unicode, and really
really should use UTF-8. If you use Latin-1 or Windows-1252 or some
weird regional encoding everyone will hate you, and if you restrict
inputs to 7-bit ASCII everyone will hate you too lol. And people
will get annoyed if you use UTF-16 or UTF-32 instead of UTF-8 as
the interchange/storage format in a new program.

But first, looking at how you deal with text internally with your
program, you *really* have two options (technically there's more
but the others are not good): storing UTF-8 with `String`s (you
have to use a `String` even for individual characters), or storing
UTF-32 in `Wide_Wide_String`/`Wide_Wide_Character`s.

When storing UTF-8 in a String (for good practice, use the
`Ada.Strings.UTF_Encoding.UTF_8_String` subtype just to indicate
that it is UTF-8 and not Latin-1), the main thing is you can't use
or have to be very cautious (and really should just avoid if
possible) using any of the built-in `String`/`Unbounded_String`
utilities that inspect or manipulate the contents of text.

With `Wide_Wide_<>`, you're technically wasting 11 out of every 32
bits of memory for alignment reasons---or 24 out of 32 bits with
text that's mostly ASCII with only the occasional higher character--
-but eh, not that big a deal *on modern systems capable of running
a modern hosted environment*. Note that there is zero chance in
hell that UTF-32 will ever be adopted as an interchange or storage
encoding (except in isolated singular corporate apps *maybe*), so
UTF-32 being used should purely be an internal implementation
detail: incoming text in whatever encoding gets converted to it and
outgoing text will always get converted from it. And you should
only convert at the I/O "boundary", don't have half of your program
dealing with native string encoding and half dealing with
`Wide_Wide_<>` (with the only exception being that if you don't
need to look at the string's contents and are just passing it
through, then you can and should avoid transcoding at all).

I personally use `Wide_Wide_<>` for everything just because it's
more convenient to have more useful built-in string functions, and
it makes dealing with input/output encoding much easier later
(detailed below).

I would never use `Wide_<>` unless you're exclusively targeting
Windows or something, because UTF-16 is just inconvenient and has
none of the benefits of UTF-8 nor any of the benefits of UTF-32 and
most of the downsides of both. Plus since Ada standardized wide
characters so early there's additional fuckups relating to UCS-
2–UTF-16 incompatibilities like Windows has[1] and you absolutely
do not want to deal with that.

I'm unfortunate enough to know most of the nuances of Unicode but I
won't subject you to it, but a lot of the statements in your
collection are a bit oversimplified (UCS-4 has a number of
additional differences from UTF-32 regarding "valid encodings",
namely that all valid Unicode codepoints (0x0–0x10FFFF inclusive)
are allowed in UCS-4 but only Unicode scalar values (0x0–0xD7FF and
0xE000–0x10FFFF inclusive) are valid in UTF-32), and are missing
some additional information: a key detail is that even with UTF-32
where each Unicode scalar value is held in one array element rather
than being variable-width like UTF-8/UTF-16, you still can't treat
them as arbitrary arrays like 7-bit ASCII because a grapheme can be
made up of multiple Unicode scalar values. Even with ASCII
characters there's the possibility of combining diacritics or such
that would break if you split the string between them.

Also, I just stumbled across `Ada.Strings.Text_Buffers` which seems
to be new to Ada 2022, makes "string builder" stuff much more
convenient because you can write text using any of Ada's string
types and then get a string in whatever encoding you want (and with
the correct system-specific line endings which is a whole 'nother
issue with Ada strings) out of it instead of needing to fiddle with
all that manually, maybe that'll be useful if you can use Ada 2022.

[1]: https://wtf-8.codeberg.page/#ill-formed-utf-16

***

Okay, so I've discussed the internal representation and issues with
that, but now we get into input/output transcoding... this is just
a nightmare in Ada, one almost decent solution but even it has
caveats and bugs, uggh.

In general, just the `Text_IO` packages will always transcode the
input file to whatever format you're getting and transcode your
given output to some other format, and it's annoying to configure
what encoding is used at compile time[2] and impossible to change
at runtime which makes the `Text_IO` packages just useless for non-
Latin-1/ASCII IMO. Even if you get GNAT whipped into shape for your
codebase's needs you're abandoning all portability should a
hypothetical second Ada implementation that you might want to use
arise.

The only way to get full control of the input and output encodings
is to use one of Ada's ways of performing binary I/O and then
manually convert strings to binary yourself. I personally prefer
using Streams over `Sequential_IO`/`Direct_IO`, using
`UTF_Encoding` (or the new `Text_Buffers`) to convert to/from the
specific format I want before reading or writing from the stream.

There is one singular bug though: if you use
`Ada.Text_IO.Text_Streams` to get a byte stream from an `Text_IO`
output file (the only way to read/write binary data from stdin,
stdout, and stderr at all), then after writing and the file is
closed, an extra newline will always be added. The Ada standard
requires that `Text_IO` always output a newline if the output
didn't end with one, and the stream from `Text_Streams` completely
bypasses all of the `Text_IO` package's bookkeeping, so from its
perspective nothing was written to the file (let alone a newline)
so it has to add a newline.[3] So you either just have to deal with
output files having an empty trailing line or make sure to strip
off the final newline from the text you're outputting.

[2]: The problem is GNAT completely changes how the `Text_IO`
packages behave with regards to text encoding through opaque
methods. The encodings used by Text_IO are mostly (but not
entirely) based off of the `-gnatW` flag, which is configuring the
encoding of *THE PROGRAM'S SOURCE CODE.* Absolutely batshit they
abused the source file encoding flag as the only way for the
programmer to configure what encoding the program reads and writes,
which is completely orthogonal to the source code format.
[3]: When I was more active on IRC, either Lucretia or Shark8 (who
you both quoted) would complain about this in #ada on libera.chat
every time `Text_IO` was brought up lol (can't remember who
specifically it was). It is extremely annoying even when you use
`Text_IO` directly rather than through streams, because it's
messing with my damn file even when I didn't ask it to.

***

Sorry for it being so long, but that's the horror of working with
text XD, particularly older things like Ada that didn't have the
benefit of modern hindsight for how text encoding would end up and
had to bolt on solutions afterwards that just doesn't work right.
Although at least Ada is better than the C/C++ nightmare[4] or
Windows or really any software created prior to Unicode 1.1 (1993).

[4]: https://github.com/mpv-player/mpv/commit/1e70e82baa91


                              * * *

Contact via email: alex [at] nytpu.com
or through anywhere else I'm at:
gopher://nytpu.com/0/about

Copyright (c) 2025 nytpu - CC BY-SA 4.0