suckless.org

       libgrapheme.sh - libgrapheme - unicode string library
 (HTM) git clone git://git.suckless.org/libgrapheme
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       libgrapheme.sh (6347B)
       ---
            1 cat << EOF
            2 .Dd ${MAN_DATE}
            3 .Dt LIBGRAPHEME 7
            4 .Os suckless.org
            5 .Sh NAME
            6 .Nm libgrapheme
            7 .Nd unicode string library
            8 .Sh SYNOPSIS
            9 .In grapheme.h
           10 .Sh DESCRIPTION
           11 The
           12 .Nm
           13 library provides functions to properly handle Unicode strings according
           14 to the Unicode specification in regard to character, word, sentence and
           15 line segmentation and case detection and conversion.
           16 .Pp
           17 Unicode strings are made up of user-perceived characters (so-called
           18 .Dq grapheme clusters ,
           19 see
           20 .Sx MOTIVATION )
           21 that are composed of one or more Unicode codepoints, which in turn
           22 are encoded in one or more bytes in an encoding like UTF-8.
           23 .Pp
           24 There is a widespread misconception that it was enough to simply
           25 determine codepoints in a string and treat them as user-perceived
           26 characters to be Unicode compliant.
           27 While this may work in some cases, this assumption quickly breaks,
           28 especially for non-Western languages and decomposed Unicode strings
           29 where user-perceived characters are usually represented using multiple
           30 codepoints.
           31 .Pp
           32 Despite this complicated multilevel structure of Unicode strings,
           33 .Nm
           34 provides methods to work with them at the byte-level (i.e. UTF-8
           35 .Sq char
           36 arrays) while also offering codepoint-level methods.
           37 Additionally, it is a
           38 .Dq freestanding
           39 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
           40 a standard library. This makes it easy to use in bare metal environments.
           41 .Pp
           42 Every documented function's manual page provides a self-contained
           43 example illustrating the possible usage.
           44 .Sh SEE ALSO
           45 .Xr grapheme_decode_utf8 3 ,
           46 .Xr grapheme_encode_utf8 3 ,
           47 .Xr grapheme_is_character_break 3 ,
           48 .Xr grapheme_is_lowercase 3 ,
           49 .Xr grapheme_is_lowercase_utf8 3 ,
           50 .Xr grapheme_is_titlecase 3 ,
           51 .Xr grapheme_is_titlecase_utf8 3 ,
           52 .Xr grapheme_is_uppercase 3 ,
           53 .Xr grapheme_is_uppercase_utf8 3 ,
           54 .Xr grapheme_next_character_break 3 ,
           55 .Xr grapheme_next_character_break_utf8 3 ,
           56 .Xr grapheme_next_line_break 3 ,
           57 .Xr grapheme_next_line_break_utf8 3 ,
           58 .Xr grapheme_next_sentence_break 3 ,
           59 .Xr grapheme_next_sentence_break_utf8 3 ,
           60 .Xr grapheme_next_word_break 3 ,
           61 .Xr grapheme_next_word_break_utf8 3 ,
           62 .Xr grapheme_to_lowercase 3 ,
           63 .Xr grapheme_to_lowercase_utf8 3 ,
           64 .Xr grapheme_to_titlecase 3 ,
           65 .Xr grapheme_to_titlecase_utf8 3
           66 .Xr grapheme_to_uppercase 3 ,
           67 .Xr grapheme_to_uppercase_utf8 3 ,
           68 .Sh STANDARDS
           69 .Nm
           70 is compliant with the Unicode ${UNICODE_VERSION} specification.
           71 .Sh MOTIVATION
           72 The idea behind every character encoding scheme like ASCII or Unicode
           73 is to express abstract characters (which can be thought of as shapes
           74 making up a written language). ASCII for instance, which comprises the
           75 range 0 to 127, assigns the number 65 (0x41) to the abstract character
           76 .Sq A .
           77 This number is called a
           78 .Dq codepoint ,
           79 and all codepoints of an encoding make up its so-called
           80 .Dq code space .
           81 .Pp
           82 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
           83 first 128 codepoints are identical to ASCII's. The additional code
           84 points are needed as Unicode's goal is to express all writing systems
           85 of the world.
           86 To give an example, the abstract character
           87 .Sq \[u00C4]
           88 is not expressible in ASCII, given no ASCII codepoint has been assigned
           89 to it.
           90 It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
           91 .Pp
           92 One may assume that this process is straightforward, but as more and
           93 more codepoints were assigned to abstract characters, the Unicode
           94 Consortium (that defines the Unicode standard) was facing a problem:
           95 Many (mostly non-European) languages have such a large amount of
           96 abstract characters that it would exhaust the available Unicode code
           97 space if one tried to assign a codepoint to each abstract character.
           98 The solution to that problem is best introduced with an example: Consider
           99 the abstract character
          100 .Sq \[u01DE] ,
          101 which is
          102 .Sq A
          103 with an umlaut and a macron added to it.
          104 In this sense, one can consider
          105 .Sq \[u01DE]
          106 as a two-fold modification (namely
          107 .Dq add umlaut
          108 and
          109 .Dq add macron )
          110 of the
          111 .Dq base character
          112 .Sq A .
          113 .Pp
          114 The Unicode Consortium adapted this idea by assigning codepoints to
          115 modifications.
          116 For example, the codepoint 0x308 represents adding an umlaut and 0x304
          117 represents adding a macron, and thus, the codepoint sequence
          118 .Dq 0x41 0x308 0x304 ,
          119 namely the base character
          120 .Sq A
          121 followed by the umlaut and macron modifiers, represents the abstract
          122 character
          123 .Sq \[u01DE] .
          124 As a side-note, the single codepoint 0x1DE was also assigned to
          125 .Sq \[u01DE] ,
          126 which is a good example for the fact that there can be multiple
          127 representations of a single abstract character in Unicode.
          128 .Pp
          129 Expressing a single abstract character with multiple codepoints solved
          130 the code space exhaustion-problem, and the concept has been greatly
          131 expanded since its first introduction (emojis, joiners, etc.). A sequence
          132 (which can also have the length 1) of codepoints that belong together
          133 this way and represents an abstract character is called a
          134 .Dq grapheme cluster .
          135 .Pp
          136 In many applications it is necessary to count the number of
          137 user-perceived characters, i.e. grapheme clusters, in a string.
          138 A good example for this is a terminal text editor, which needs to
          139 properly align characters on a grid.
          140 This is pretty simple with ASCII-strings, where you just count the number
          141 of bytes (as each byte is a codepoint and each codepoint is a grapheme
          142 cluster).
          143 With Unicode-strings, it is a common mistake to simply adapt the
          144 ASCII-approach and count the number of code points.
          145 This is wrong, as, for example, the sequence
          146 .Dq 0x41 0x308 0x304 ,
          147 while made up of 3 codepoints, is a single grapheme cluster and
          148 represents the user-perceived character
          149 .Sq \[u01DE] .
          150 .Pp
          151 The proper way to segment a string into user-perceived characters
          152 is to segment it into its grapheme clusters by applying the Unicode
          153 grapheme cluster breaking algorithm (UAX #29).
          154 It is based on a complex ruleset and lookup-tables and determines if a
          155 grapheme cluster ends or is continued between two codepoints.
          156 Libraries like ICU and libunistring, which also offer this functionality,
          157 are often bloated, not correct, difficult to use or not reasonably
          158 statically linkable.
          159 .Pp
          160 Analogously, the standard provides algorithms to separate strings by
          161 words, sentences and lines, convert cases and compare strings.
          162 The motivation behind
          163 .Nm
          164 is to make unicode handling suck less and abide by the UNIX philosophy.
          165 .Sh AUTHORS
          166 .An Laslo Hunhold Aq Mt dev@frign.de
          167 EOF