suckless.org

       libgrapheme.7 - libgrapheme - unicode string library
 (HTM) git clone git://git.suckless.org/libgrapheme
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       libgrapheme.7 (5643B)
       ---
            1 .Dd 2022-08-26
            2 .Dt LIBGRAPHEME 7
            3 .Os suckless.org
            4 .Sh NAME
            5 .Nm libgrapheme
            6 .Nd unicode string library
            7 .Sh SYNOPSIS
            8 .In grapheme.h
            9 .Sh DESCRIPTION
           10 The
           11 .Nm
           12 library provides functions to properly handle Unicode strings according
           13 to the Unicode specification.
           14 Unicode strings are made up of user-perceived characters (so-called
           15 .Dq grapheme clusters ,
           16 see
           17 .Sx MOTIVATION )
           18 that are made up of one or more Unicode codepoints, which in turn
           19 are encoded in one or more bytes in an encoding like UTF-8.
           20 .Pp
           21 There is a widespread misconception that it was enough to simply
           22 determine codepoints in a string and treat them as user-perceived
           23 characters to be Unicode compliant.
           24 While this may work in some cases, this assumption quickly breaks,
           25 especially for non-Western languages and decomposed Unicode strings
           26 where user-perceived characters are usually represented using multiple
           27 codepoints.
           28 .Pp
           29 Despite this complicated multilevel structure of Unicode strings,
           30 .Nm
           31 provides methods to work with them at the byte-level (i.e. UTF-8
           32 .Sq char
           33 arrays) while also offering codepoint-level methods.
           34 .Pp
           35 Every documented function's manual page provides a self-contained
           36 example illustrating the possible usage.
           37 .Sh SEE ALSO
           38 .Xr grapheme_decode_utf8 3 ,
           39 .Xr grapheme_encode_utf8 3 ,
           40 .Xr grapheme_is_character_break 3 ,
           41 .Xr grapheme_next_character_break 3 ,
           42 .Xr grapheme_next_line_break 3 ,
           43 .Xr grapheme_next_sentence_break 3 ,
           44 .Xr grapheme_next_word_break 3 ,
           45 .Xr grapheme_next_character_break_utf8 3 ,
           46 .Xr grapheme_next_line_break_utf8 3 ,
           47 .Xr grapheme_next_sentence_break_utf8 3 ,
           48 .Xr grapheme_next_word_break_utf8 3
           49 .Sh STANDARDS
           50 .Nm
           51 is compliant with the Unicode 14.0.0 specification.
           52 .Sh MOTIVATION
           53 The idea behind every character encoding scheme like ASCII or Unicode
           54 is to express abstract characters (which can be thought of as shapes
           55 making up a written language). ASCII for instance, which comprises the
           56 range 0 to 127, assigns the number 65 (0x41) to the abstract character
           57 .Sq A .
           58 This number is called a
           59 .Dq codepoint ,
           60 and all codepoints of an encoding make up its so-called
           61 .Dq code space .
           62 .Pp
           63 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
           64 first 128 codepoints are identical to ASCII's. The additional code
           65 points are needed as Unicode's goal is to express all writing systems
           66 of the world.
           67 To give an example, the abstract character
           68 .Sq \[u00C4]
           69 is not expressable in ASCII, given no ASCII codepoint has been assigned
           70 to it.
           71 It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
           72 .Pp
           73 One may assume that this process is straightfoward, but as more and
           74 more codepoints were assigned to abstract characters, the Unicode
           75 Consortium (that defines the Unicode standard) was facing a problem:
           76 Many (mostly non-European) languages have such a large amount of
           77 abstract characters that it would exhaust the available Unicode code
           78 space if one tried to assign a codepoint to each abstract character.
           79 The solution to that problem is best introduced with an example: Consider
           80 the abstract character
           81 .Sq \[u01DE] ,
           82 which is
           83 .Sq A
           84 with an umlaut and a macron added to it.
           85 In this sense, one can consider
           86 .Sq \[u01DE]
           87 as a two-fold modification (namely
           88 .Dq add umlaut
           89 and
           90 .Dq add macron )
           91 of the
           92 .Dq base character
           93 .Sq A .
           94 .Pp
           95 The Unicode Consortium adapted this idea by assigning codepoints to
           96 modifications.
           97 For example, the codepoint 0x308 represents adding an umlaut and 0x304
           98 represents adding a macron, and thus, the codepoint sequence
           99 .Dq 0x41 0x308 0x304 ,
          100 namely the base character
          101 .Sq A
          102 followed by the umlaut and macron modifiers, represents the abstract
          103 character
          104 .Sq \[u01DE] .
          105 As a side-note, the single codepoint 0x1DE was also assigned to
          106 .Sq \[u01DE] ,
          107 which is a good example for the fact that there can be multiple
          108 representations of a single abstract character in Unicode.
          109 .Pp
          110 Expressing a single abstract character with multiple codepoints solved
          111 the code space exhaustion-problem, and the concept has been greatly
          112 expanded since its first introduction (emojis, joiners, etc.). A sequence
          113 (which can also have the length 1) of codepoints that belong together
          114 this way and represents an abstract character is called a
          115 .Dq grapheme cluster .
          116 .Pp
          117 In many applications it is necessary to count the number of
          118 user-perceived characters, i.e. grapheme clusters, in a string.
          119 A good example for this is a terminal text editor, which needs to
          120 properly align characters on a grid.
          121 This is pretty simple with ASCII-strings, where you just count the number
          122 of bytes (as each byte is a codepoint and each codepoint is a grapheme
          123 cluster).
          124 With Unicode-strings, it is a common mistake to simply adapt the
          125 ASCII-approach and count the number of code points.
          126 This is wrong, as, for example, the sequence
          127 .Dq 0x41 0x308 0x304 ,
          128 while made up of 3 codepoints, is a single grapheme cluster and
          129 represents the user-perceived character
          130 .Sq \[u01DE] .
          131 .Pp
          132 The proper way to segment a string into user-perceived characters
          133 is to segment it into its grapheme clusters by applying the Unicode
          134 grapheme cluster breaking algorithm (UAX #29).
          135 It is based on a complex ruleset and lookup-tables and determines if a
          136 grapheme cluster ends or is continued between two codepoints.
          137 Libraries like ICU and libunistring, which also offer this functionality,
          138 are often bloated, not correct, difficult to use or not reasonably
          139 statically linkable.
          140 .Pp
          141 Analogously, the standard provides algorithms to separate strings by
          142 words, sentences and lines, convert cases and compare strings.
          143 The motivation behind
          144 .Nm
          145 is to make unicode handling suck less and abide by the UNIX philosophy.
          146 .Sh AUTHORS
          147 .An Laslo Hunhold Aq Mt dev@frign.de