suckless.org

       index.md - sites - public wiki contents of suckless.org
 (HTM) git clone git://git.suckless.org/sites
 (DIR) Log
 (DIR) Files
 (DIR) Refs
       ---
       index.md (6288B)
       ---
            1 LIBGRAPHEME(7) - Miscellaneous Information Manual
            2 
            3 # NAME
            4 
            5 **libgrapheme** - unicode string library
            6 
            7 # SYNOPSIS
            8 
            9 **#include <grapheme.h>**
           10 
           11 # DESCRIPTION
           12 
           13 The
           14 **libgrapheme**
           15 library provides functions to properly handle Unicode strings according
           16 to the Unicode specification in regard to character, word, sentence and
           17 line segmentation and case detection and conversion.
           18 
           19 Unicode strings are made up of user-perceived characters (so-called
           20 "grapheme clusters",
           21 see
           22 *MOTIVATION*)
           23 that are composed of one or more Unicode codepoints, which in turn
           24 are encoded in one or more bytes in an encoding like UTF-8.
           25 
           26 There is a widespread misconception that it was enough to simply
           27 determine codepoints in a string and treat them as user-perceived
           28 characters to be Unicode compliant.
           29 While this may work in some cases, this assumption quickly breaks,
           30 especially for non-Western languages and decomposed Unicode strings
           31 where user-perceived characters are usually represented using multiple
           32 codepoints.
           33 
           34 Despite this complicated multilevel structure of Unicode strings,
           35 **libgrapheme**
           36 provides methods to work with them at the byte-level (i.e. UTF-8
           37 'char'
           38 arrays) while also offering codepoint-level methods.
           39 Additionally, it is a
           40 "freestanding"
           41 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
           42 a standard library. This makes it easy to use in bare metal environments.
           43 
           44 Every documented function's manual page provides a self-contained
           45 example illustrating the possible usage.
           46 
           47 # SEE ALSO
           48 
           49 grapheme\_decode\_utf8(3),
           50 grapheme\_encode\_utf8(3),
           51 grapheme\_is\_character\_break(3),
           52 grapheme\_is\_lowercase(3),
           53 grapheme\_is\_lowercase\_utf8(3),
           54 grapheme\_is\_titlecase(3),
           55 grapheme\_is\_titlecase\_utf8(3),
           56 grapheme\_is\_uppercase(3),
           57 grapheme\_is\_uppercase\_utf8(3),
           58 grapheme\_next\_character\_break(3),
           59 grapheme\_next\_character\_break\_utf8(3),
           60 grapheme\_next\_line\_break(3),
           61 grapheme\_next\_line\_break\_utf8(3),
           62 grapheme\_next\_sentence\_break(3),
           63 grapheme\_next\_sentence\_break\_utf8(3),
           64 grapheme\_next\_word\_break(3),
           65 grapheme\_next\_word\_break\_utf8(3),
           66 grapheme\_to\_lowercase(3),
           67 grapheme\_to\_lowercase\_utf8(3),
           68 grapheme\_to\_titlecase(3),
           69 grapheme\_to\_titlecase\_utf8(3)
           70 grapheme\_to\_uppercase(3),
           71 grapheme\_to\_uppercase\_utf8(3),
           72 
           73 # STANDARDS
           74 
           75 **libgrapheme**
           76 is compliant with the Unicode 15.0.0 specification.
           77 
           78 # MOTIVATION
           79 
           80 The idea behind every character encoding scheme like ASCII or Unicode
           81 is to express abstract characters (which can be thought of as shapes
           82 making up a written language). ASCII for instance, which comprises the
           83 range 0 to 127, assigns the number 65 (0x41) to the abstract character
           84 'A'.
           85 This number is called a
           86 "codepoint",
           87 and all codepoints of an encoding make up its so-called
           88 "code space".
           89 
           90 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
           91 first 128 codepoints are identical to ASCII's. The additional code
           92 points are needed as Unicode's goal is to express all writing systems
           93 of the world.
           94 To give an example, the abstract character
           95 '&#196;'
           96 is not expressable in ASCII, given no ASCII codepoint has been assigned
           97 to it.
           98 It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
           99 
          100 One may assume that this process is straightfoward, but as more and
          101 more codepoints were assigned to abstract characters, the Unicode
          102 Consortium (that defines the Unicode standard) was facing a problem:
          103 Many (mostly non-European) languages have such a large amount of
          104 abstract characters that it would exhaust the available Unicode code
          105 space if one tried to assign a codepoint to each abstract character.
          106 The solution to that problem is best introduced with an example: Consider
          107 the abstract character
          108 '&#478;',
          109 which is
          110 'A'
          111 with an umlaut and a macron added to it.
          112 In this sense, one can consider
          113 '&#478;'
          114 as a two-fold modification (namely
          115 "add umlaut"
          116 and
          117 "add macron")
          118 of the
          119 "base character"
          120 'A'.
          121 
          122 The Unicode Consortium adapted this idea by assigning codepoints to
          123 modifications.
          124 For example, the codepoint 0x308 represents adding an umlaut and 0x304
          125 represents adding a macron, and thus, the codepoint sequence
          126 "0x41 0x308 0x304",
          127 namely the base character
          128 'A'
          129 followed by the umlaut and macron modifiers, represents the abstract
          130 character
          131 '&#478;'.
          132 As a side-note, the single codepoint 0x1DE was also assigned to
          133 '&#478;',
          134 which is a good example for the fact that there can be multiple
          135 representations of a single abstract character in Unicode.
          136 
          137 Expressing a single abstract character with multiple codepoints solved
          138 the code space exhaustion-problem, and the concept has been greatly
          139 expanded since its first introduction (emojis, joiners, etc.). A sequence
          140 (which can also have the length 1) of codepoints that belong together
          141 this way and represents an abstract character is called a
          142 "grapheme cluster".
          143 
          144 In many applications it is necessary to count the number of
          145 user-perceived characters, i.e. grapheme clusters, in a string.
          146 A good example for this is a terminal text editor, which needs to
          147 properly align characters on a grid.
          148 This is pretty simple with ASCII-strings, where you just count the number
          149 of bytes (as each byte is a codepoint and each codepoint is a grapheme
          150 cluster).
          151 With Unicode-strings, it is a common mistake to simply adapt the
          152 ASCII-approach and count the number of code points.
          153 This is wrong, as, for example, the sequence
          154 "0x41 0x308 0x304",
          155 while made up of 3 codepoints, is a single grapheme cluster and
          156 represents the user-perceived character
          157 '&#478;'.
          158 
          159 The proper way to segment a string into user-perceived characters
          160 is to segment it into its grapheme clusters by applying the Unicode
          161 grapheme cluster breaking algorithm (UAX #29).
          162 It is based on a complex ruleset and lookup-tables and determines if a
          163 grapheme cluster ends or is continued between two codepoints.
          164 Libraries like ICU and libunistring, which also offer this functionality,
          165 are often bloated, not correct, difficult to use or not reasonably
          166 statically linkable.
          167 
          168 Analogously, the standard provides algorithms to separate strings by
          169 words, sentences and lines, convert cases and compare strings.
          170 The motivation behind
          171 **libgrapheme**
          172 is to make unicode handling suck less and abide by the UNIX philosophy.
          173 
          174 # AUTHORS
          175 
          176 Laslo Hunhold ([dev@frign.de](mailto:dev@frign.de))
          177 
          178 suckless.org - 2022-10-06