https://raphlinus.github.io/text/2020/10/26/text-layout.html

Raph Levien's blog [ ]
About

Text layout is a loose hierarchy of segmentation

Oct 26, 2020

I love text layout, and have been working with it in one form or
other for over 35 years. Yet, knowledge about it is quite arcane. I
don't believe there is a single place where it's all properly written
down. I have some explanation for that: while basic text layout is
very important for UI, games, and other contexts, a lot of the
"professional" needs around text layout are embedded in much more
complicated systems such as Microsoft Word or a modern Web browser.

A complete account of text layout would be at least a small book.
Since there's no way I can write that now, this blog post is a small
step towards that - in particular, an attempt to describe the "big
picture," using the conceptual framework of a "loose hierarchy."
Essentially, a text layout engine breaks the input into finer and
finer grains, then reassembles the results into a text layout object
suitable for drawing, measurement, and hit testing.

The main hierarchy is concerned with laying out the entire paragraph
as a single line of text. Line breaking is also important, but has a
separate, parallel hierarchy.

The main text layout hierarchy

The hierarchy is: paragraph segmentation as the coarsest granularity,
followed by rich text style and BiDi analysis, then itemization
(coverage by font), then Unicode script, and shaping clusters as the
finest.

diagram of layout hierarchy

Paragraph segmentation

The coarsest, and also simplest, segmentation task is paragraph
segmentation. Most of the time, paragraphs are simply separated by
newline (U+000A) characters, though Unicode in its infinite wisdom
specifies a number of code point sequences that function as paragraph
separators in plain text:

  * U+000A LINE FEED
  * U+000B VERTICAL TAB
  * U+000C FORM FEED
  * U+000D CARRIAGE RETURN
  * U+000D U+000A (CR + LF)
  * U+0085 NEXT LINE
  * U+2008 LINE SEPARATOR
  * U+2009 PARAGRAPH SEPARATOR

In rich text, paragraphs are usually indicated through markup rather
than special characters, for example <p> or <br> in HTML. But in this
post, as in most text layout APIs, we'll treat rich text as plain
text + attribute spans.

Rich text style

A paragraph of rich text may contain spans that can affect
formatting. In particular, choice of font, font weight, italic or no,
and a number of other attributes can affect text layout. Thus, each
paragraph is typically broken into a some number of style runs, so
that within a run the style is consistent.

Note that some style changes don't necessarily affect text layout. A
classic example is color. Firefox, rather famously, does not define
segmentation boundaries here for color changes. If a color boundary
cuts a ligature, it uses fancy graphics techiques to render parts of
the ligature in different color. But this is a subtle refinement and
I think not required for basic text rendering. For more details, see
Text Rendering Hates You.

Bidirectional analysis

Completely separate from the style spans, a paragraph may in general
contain both left-to-right and right-to-left text. The need for
bidirectional (BiDi) text is certainly one of the things that makes
text layout more complicated.

Fortunately, this part of the stack is defined by a standard (UAX #9
), and there are a number of good implementations. The interested
reader is referred to Unicode Bidirectional Algorithm basics. The key
takeaway here is that BiDi analysis is done on the plain text of the
entire paragraph, and the result is a sequence of level runs, where
the level of each run defines whether it is LTR or RTL.

The level runs and the style runs are then merged, so that in
subsequent stages each run is of a consistent style and
directionality. As such, for the purpose of defining the hierarchy,
the result of BiDi analysis could alternatively be considered an
implicit or derived rich text span.

In addition to BiDi, which I consider a basic requirement, a more
sophisticated text layout engine will also be able to handle vertical
writing modes, including mixed cases where short strings are
horizontal within the vertical primary direction. Extremely
sophisticated layout engines will also be able to handle ruby text
and other ways of annotating the main text flow with intercalated
strings. See Requirements for Japanese Text Layout for many examples
of sophisticated layout requirements; the scope of this blog post
really is basic text layout of the kind needed in user interfaces.

Itemization (font coverage)

Itemization is the trickiest and least well specified part of the
hierarchy. There is no standard for it, and no common implementation.
Rather, each text layout engine deals with it in its own special way.

Essentially, the result of itemization is to choose a single concrete
font for a run, from a font collection. Generally a font collection
consists of a main font (selected by font name from system fonts, or
loaded as a custom asset), backed by a fallback stack, which are
usually system fonts, but thanks to Noto it is possible to bundle a
fallback font stack with an application, if you don't mind spending a
few hundred megabytes for the assets.

Why is it so tricky? A few reasons, which I'll touch on.

First, it's not so easy to determine whether a font can render a
particular string of text. One reason is Unicode normalization. For
example, the string "e" can be encoded as U+00E9 (in NFC encoding) or
as U+0065 U+0301 (in NFD encoding). Due to the principle of Unicode
equivalence, these should be rendered identically, but a font may
have coverage for only one or the other in its Character to Glyph
Index Mapping (cmap) table. The shaping engine has all the Unicode
logic to handle these cases.

Of course, realistic fonts with Latin coverage will have both of
these particular sequences covered in the cmap table, but edge cases
certainly do happen, both in extended Latin ranges, and other scripts
such as Hangul, which has complex normalization rules (thanks in part
to a Korean standard for normalization which is somewhat at odds with
Unicode). It's worth noting that DirectWrite gets Hangul
normalization quite wrong.

I believe a similar situation exists with the Arabic presentation
forms; see Developing Arabic fonts for more detail on that.

Because of these tricky normalization and presentation issues, the
most robust way to determine whether a font can render a string is to
try it. This is how LibreOffice has worked for a while, and in 2015
Chromium followed. See also Eliminating Simple Text for more
background on the Chromium text layout changes.

Another whole class of complexity is emoji. A lot of emoji can be
rendered with either text or emoji presentation, and there are no
hard and fast rules to pick one or the other. Generally the text
presentation is in a symbol font, and the emoji presentation is in a
separate color font. A particularly tough example is the smiling
emoji, which began its encoding life as 0x01 in Code page 437, the
standard 8-bit character encoding of the original IBM PC, and is now
U+263A in Unicode. However, the suggested default presentation is
text, which won't do in a world which expects color. Apple on iOS
unilaterally chose an emoji presentation, so many text stacks follow
Apple's lead. (Incidentally, the most robust way to encode such emoji
is to append a variation selector to pin down the presentation.)

Another source of complexity when trying to write a cross-platform
text layout engine is querying the system fonts. See Font fallback
deep dive for more information about that.

I should note one thing, which might help people doing archaeology of
legacy text stacks: it used to be pretty common for text layout to
resolve "compatibility" forms such as NFKC and NFKD, and this can
lead to various problems. But today it is more common to solve that
particular problem by providing a font stack with massive Unicode
coverage, including all the code points in the relevant compatibility
ranges.

Script

The shaping of text, or the transformation of a sequence of code
points into a sequence of positioned glyphs, depends on the script.
Some scripts, such as Arabic and Devanagari, have extremely elaborate
shaping rules, while others, such as Chinese, are a fairly
straightforward mapping from code point into glyph. Latin is
somewhere in the middle, starting with a straightforward mapping, but
ligatures and kerning are also required for high quality text layout.

Determining script runs is reasonably straightforward - many
characters have a Unicode script property which uniquely identifies
which script they belong to. However, some characters, such as space,
are "common," so the assigned script just continues the previous run.

A simple example is "hello mir". This string is broken into two
script runs: "hello " is Latn, and "mir" is Cyrl.

Shaping (cluster)

At this point, we have a run of constant style, font, direction, and
script. It is ready for shaping. Shaping is a complicated process
that converts a string (sequence of Unicode code points) into
positioned glyphs. For the purpose of this blog post, we can
generally treat it as a black box. Fortunately, a very high quality
open source implementation exists, in the form of HarfBuzz.

We're not quite done with segmentation, though, as shaping assigns
substrings in the input to clusters of glyphs. The correspondence
depends a lot on the font. In Latin, the string "fi" is often shaped
to a single glyph (a ligature). For complex scripts such as
Devanagari, a cluster is most often a syllable in the source text,
and complex reordering can happen within the cluster.

Clusters are important for hit testing, or determining the
correspondence between a physical cursor position in the text layout
and the offset within the text. Generally, they can be ignored if the
text will only be rendered, not edited (or selected).

Note that these shaping clusters are distinct from grapheme clusters.
The "fi" example has two grapheme clusters but a single shaping
cluster, so a grapheme cluster boundary can cut a shaping cluster.
Since it's possible to move the cursor between the "f" and "i", one
tricky problem is to determine the cursor location in that case.
Fonts do have a caret table, but implementation is spotty. A more
robust solution is to portion the width of the cluster equally to
each grapheme cluster within the cluster. See also Let's Stop
Ascribing Meaning to Code Points for a detailed dive into grapheme
clusters.

Line breaking

While short strings can be considered a single strip, longer strings
require breaking into lines. Doing this properly is quite a tricky
problem. In this post, we treat it as a separate (small) hierarchy,
parallel to the main text layout hierarchy above.

The problem can be factored into identifying line break candidates,
then choosing a subset of those candidates as line breaks that
satisfy the layout constraints. The main constraint is that lines
should fit within the specified maximum width. It's common to use a
greedy algorithm, but high end typography tends to use an algorithm
that minimizes a raggedness score for the paragraph. Knuth and Plass
have a famous paper, Breaking Paragraphs into Lines, that describes
the algorithm used in TeX in detail. But we'll focus on the problems
of determining candidates and measuring the widths, as these are
tricky enough.

In theory, the Unicode Line Breaking Algorithm (UAX #14) identifies
positions in a string that are candidate line breaks. In practice,
there are some additional subtleties. For one, some languages (Thai
is the most common) don't use spaces to divide words, so need some
kind of natural language processing (based on a dictionary) to
identify word boundaries. For two, automatic hyphenation is often
desirable, as it fills lines more efficiently and makes the right
edge less ragged. Liang's algorithm is most common for automatically
inferring "soft hyphens," and there are many good implementations of
it.

Android's line breaking implementation (in the Minikin library)
applies an additional refinement: since email addresses and URLs are
common in strings displayed on mobile devices, and since the UAX #14
rules give poor choices for those, it has an additional parser to
detect those cases and apply different rules.

Finally, if words are very long or the maximum width is very narrow,
it's possible for a word to exceed that width. In some cases, the
line can be "overfull," but it's more common to break the word at the
last grapheme cluster boundary that still fits inside the line. In
Android, these are known as "desperate breaks."

So, to recap, after the paragraph segmentation (also known as "hard
breaks"), there is a loose hierarchy of 3 line break candidates: word
breaks as determined by UAX #14 (with possible "tailoring"), soft
hyphens, and finally grapheme cluster boundaries. The first is
preferred, but the other two may be used in order to satisfy the
layout constraints.

This leaves another problem, which is suprisingly tricky to get fully
right: how to measure the width of a line between two candidate
breaks, in order to validate that it fits within the maximum width
(or, in the more general case, to help compute a global raggedness
score). For Latin text in a normal font, this seems almost
ridiculously easy: just measure the width of each word, and add them
up. But in the general case, things are nowhere nearly so simple.

First, while in Latin, most line break candidates are at space
characters, in the fully general case they can cut anywhere in the
text layout hierarchy, even in the middle of a cluster. An additional
complication is that hyphenation can add a hyphen character.

Even without hyphenation, because shaping is Turing Complete, the
width of a line (a substring between two line break candidates) can
be any function. Of course, such extreme cases are rare; it's most
common for the widths to be exactly equal to the sum of the widths of
the words, and even in the other cases this tends to be a good
approximation.

So getting this exactly right in the general case is conceptually not
difficult, but is horribly inefficient: for each candidate for the
end of the line, perform text layout (mostly shaping) on the
substring from the beginning of the line (possibly inserting a
hyphen), and measure the width of that layout.

Very few text layout engines even try to handle this general case,
using various heuristics and approximations which work well most of
the time, but break down when presented with a font with shaping
rules that change widths aggressively. DirectWrite does, however,
using very clever techniques that took several years of iteration.
The full story is in harfbuzz/harfbuzz#1463 (comment). Further
analysis, towards a goal of getting this implemented in an open
source text layout engine, is in yeslogic/allsorts#29. If and when
either HarfBuzz or Allsorts implements the lower-level logic, I'll
probably want to write another blog post explaining in more detail
how a higher level text layout engine can take advantage of it.

A great example of how line breaking can go wrong is Firefox bug
479829, in which an "f + soft hyphen + f" sequence in the text is
shaped as the "ff" ligature, then the line is broken at the soft
hyphen. Because Firefox reuses the existing shaping rather than
reshaping the line, it actually renders with the ligature glyph split
across lines:

Example of layout bug in Firefox

Implementations to study

While I still feel a need for a solid, high-level, cross-platform
text layout engine, there are good implementations to study. In open
source, on of my favorites (though I am biased), is the Android text
stack, based on Minikin for its lower levels. It is fairly capable
and efficient, and also makes a concerted effort to get "all of
Unicode" right, including emoji. It is also reasonably simple and the
code is accessible.

While not open source, DirectWrite is also well worth study, as it is
without question one of the most capable engines, supporting Word and
the previous iteration of Edge before it was abandonded in favor of
Chromium. Note that there is a proposal for a cross-platform
implementation and also potentially to take it open-source. If that
were to happen, it would be something of a game changer.

Chromium and Firefox are a rich source as well, especially as they've
driven a lot of the improvements in HarfBuzz. However, their text
layout stacks are quite complex and do not have a clean, documented
API boundary with the rest of the application, so they are not as
suitable for study as the others I've chosen here.

Android

Paragraph and style segmentation (with BiDi) is done at higher
levels, in Layout.java and StaticLayout.java. At that point, runs are
handed to Minikin for lower-level processing. Most of the rest of the
hierarchy is in Layout.cpp, and ultimately shaping is done by
HarfBuzz.

Minikin also contains a sophisticated line breaking implementation,
including Knuth-Plass style optimized breaking.

Android deals with shaping boundaries by using heuristics to further
segment the text to implied word boundaries (which are also used as
the grain for layout cache). If a font does shaping across these
boundaries, the shaping context is simply lost. This is a reasonable
compromise, especially in mobile, as results are always consistent,
ie the width for measurement never mismatches the width for layout.
And none of the fonts in the system stack have exotic behavior such
as shaping across spaces.

Android does base its itemization on cmap coverage, and builds
sophisticated bitmap structures for fast queries. As such, it can get
normalization issues wrong, but overall this seems like a reasonable
compromise. In particular, most of the time you'll run into
normalization issues is with Latin and the combining diacritical
marks, both of which are supplied by Roboto, which in turn has
massive Unicode coverage (and thus less need to rely on normalization
logic). But with custom fonts, handling may be less than ideal,
resulting in more fallback to Roboto than might actually be needed.

Note that Minikin was also the starting point for libTxt, the text
layout library used in Flutter.

DirectWrite

Some notes on things I've found while studying the API; these
observations are quite a bit in the weeds, but might be useful to
people wanting to deeply understand or engage the API.

Hit testing in DirectWrite is based on leading/trailing positions,
while in Android it's based on primary and secondary. The latter is
more useful for text edition, but leading/trailing is a more
well-defined concept (for one, it doesn't rely on paragraph
direction). For more information on this topic, see linebender/piet#
323. My take is that proper hit testing requires iterating through
the text layout to access lower level structures.

While Core Text (see below) exposes a hierarchy of objects,
DirectWrite uses the TextLayout as the primary interface, and exposes
internal structure (even including lines) by iterating over a
callback per run in the confusingly named Draw method. The
granularity of this callback is a glyph run, which corresponds to
"script" in the hierarchy above. Cluster information is provided in
an associated glyph run description structure.

There are other ways to access lower level text layout capabilities,
including TextAnalyzer, which computes BiDi and line break
opportunities, script runs, and shaping. In fact, the various methods
on that interface represents much of the internal structure of the
text layout engine. Itemization, however, is done in the FontFallback
interface, which was added later.

Core Text

Another high quality implementation is Core Text. I don't personally
find it as well designed as DirectWrite, but it does get the job
done. In general, though, Core Text is considered a lower level
interface, and applications are recommended to use a higher level
mechanism (Cocoa text on macOS, Text Kit on iOS).

When doing text layout on macOS, it's probably better to use the
platform-provided itemization method (CTFontCreateForString), rather
than getting the font list and doing itemization in the client. See
linebender/skribo#14 for more information on this tradeoff.

Druid/Piet

At this point, the Druid GUI toolkit does not have its own native
text layout engine, but rather does provide a cross-platform API
which is delegated to platform text layout engines, DirectWrite and
Core Text in particular.

The situation on Linux is currently unsatisfactory, as it's based on
the Cairo toy text API. There is work ongoing to improve this, but no
promises when.

While the Piet text API is currently fairly basic, I do think it's a
good starting point for text layout, especially in the Rust
community. While the complexity of Web text basically forces browsers
to do all their text layout from scratch, for UI text there are
serious advantages to using the platform text layout capabilities,
including more consistency with native UI, and less code to compile
and ship.

Pango

I should at least mention Pango, which provides text layout
capabilities for Gtk and other software. It is open source and has a
long history, but is more focused on the needs of Linux and in my
opinion is less suitable as a cross-platform engine, though there is
porting work for both Windows and macOS. As evidence it hasn't been
keeping quite up to date, the Windows integration is all based on
GDI+ rather than the more recent Direct2D and DirectWrite, so
capabilities are quite limited by modern standards.

The question of level

A consistent theme in the design of text level APIs is: what level?
Ideally the text layout engine provides a high level API, meaning
that rich text (in some concrete representation) comes in, along with
the fonts, and a text layout object comes out. However, this is not
always adequate.

In particular, word processors and web browsers have vastly more
complex layout requirements than can be expressed in a reasonable
"attributed string" representation of rich text. For these
applications, it makes sense to break apart the task of text layout,
and provide unbundled access to these lower levels. Often, that
corresponds to lower levels in the hierarchy I've presented. A good
choice of boundary is style runs (including BiDi), as it simiplifies
the question of rich text representation; expressing the style of a
single run is simpler than a data structure which can represent all
formatting requirements for the rich text.

Until more recently, web browsers tended to use platform text
capabilities for the lower levels, but ultimately they needed more
control, so for the most part, they do all the layout themselves,
deferring to the platform only when absolutely necessary, for example
to enumerate the system fonts for fallback.

The desire to accommodate both UI and browser needs motivated the
design of the skribo API, and explains why it only handles single
style runs. Unfortunately, the lack of a complementary high level
driver proved to be quite a mistake, as there was no easy way for
applications to use the library. We will be rethinking some of these
decisions in coming months.

Other resources

A book in progress on text layout is Fonts and Layout for Global
Scripts by Simon Cozens. There is more emphasis on complex script
shaping and fonts, but touches on some of the same concepts as here.

Another useful resources is Modern text rendering with Linux:
Overview, which has a Linux focus and explains Pango in more detail.
It also links the SIGGRAPH 2018 - Digital typography slide deck,
which is quite informative.

Thanks to Chris Morgan for review and examples.

 

Raph Levien's blog

  * Raph Levien's blog
  * raph.levien@gmail.com

  * raphlinus
  * raph

Blog of Raph Levien.