suckless.org

       index.md - sites - public wiki contents of suckless.org
 (HTM) git clone git://git.suckless.org/sites
 (DIR) Log
 (DIR) Files
 (DIR) Refs
       ---
       index.md (6597B)
       ---
            1 ![libgrapheme](libgrapheme.svg)
            2 
            3 libgrapheme is an extremely simple freestanding C99 library providing
            4 utilities for properly handling strings according to the latest
            5 Unicode standard 15.0.0. It offers fully Unicode compliant
            6 
            7 * __grapheme cluster__ (i.e. user-perceived character) __segmentation__
            8 * __word segmentation__
            9 * __sentence segmentation__
           10 * detection of permissible __line break opportunities__
           11 * __case detection__ (lower-, upper- and title-case)
           12 * __case conversion__ (to lower-, upper- and title-case)
           13 
           14 on UTF-8 strings and codepoint arrays, which both can also be
           15 null-terminated.
           16 
           17 The necessary lookup-tables are automatically generated from the Unicode
           18 standard data (contained in the tarball) and heavily compressed. Over
           19 10,000 automatically generated conformance tests and over 150 unit tests
           20 ensure conformance and correctness.
           21 
           22 There is no complicated build-system involved and it's all done using
           23 one POSIX-compliant Makefile. All you need is a C99 compiler, given
           24 the lookup-table-generators and compressors that are only run at
           25 build-time are also written in C99.
           26 The resulting library is freestanding and thus not even dependent on a
           27 standard library to be present at runtime, making it a suitable choice
           28 for bare metal applications.
           29 
           30 It is also way smaller and much faster than the other established Unicode
           31 string libraries (ICU, GNU's libunistring, libutf8proc).
           32 
           33 Development
           34 -----------
           35 You can [browse](//git.suckless.org/libgrapheme) the source code
           36 repository or get a copy with the following command:
           37 
           38         git clone https://git.suckless.org/libgrapheme
           39 
           40 Download
           41 --------
           42 libgrapheme follows the [semantic versioning](https://semver.org/) scheme.
           43 
           44 * [libgrapheme-2.0.2](//dl.suckless.org/libgrapheme/libgrapheme-2.0.2.tar.gz) (2022-11-02)
           45 * [libgrapheme-1.0.0](//dl.suckless.org/libgrapheme/libgrapheme-1.0.0.tar.gz) (2021-12-22)
           46 
           47 
           48 Getting Started
           49 ---------------
           50 Automatically configuring and installing libgrapheme via
           51 
           52         ./configure
           53         make install
           54 
           55 will install the header grapheme.h and both the static library
           56 libgrapheme.a and the dynamic library libgrapheme.so (with symlinks) in
           57 the respective folders. The conformance and unit tests can be run with
           58 
           59         make test
           60 
           61 and comparative benchmarks against libutf8proc (which is the only Unicode
           62 library compliant enough to be comparable to) can be run with
           63 
           64         make benchmark
           65 
           66 You can access the manual [here](man/) or via libgrapheme(7) by typing
           67 
           68         man libgrapheme
           69 
           70 and looking at the referred pages, e.g.
           71 [grapheme\_next\_character\_break_utf8(3)](man/grapheme_next_character_break_utf8.3/).
           72 Each page contains code-examples and an extensive description. To give
           73 one example that is also given in the manuals, the following code
           74 separates a given string 'Tëst 👨‍👩‍👦 🇺🇸 नी நி!'
           75 into its user-perceived characters:
           76 
           77         #include <grapheme.h>
           78         #include <stdint.h>
           79         #include <stdio.h>
           80         
           81         int
           82         main(void)
           83         {
           84                 /* UTF-8 encoded input */
           85                 char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0"
           86                           "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0"
           87                           "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0"
           88                           "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!";
           89                 size_t ret, len, off;
           90         
           91                 printf("Input: \"%s\"\n", s);
           92         
           93                 /* print each grapheme cluster with byte-length */
           94                 printf("grapheme clusters in NUL-delimited input:\n");
           95                 for (off = 0; s[off] != '\0'; off += ret) {
           96                         ret = grapheme_next_character_break_utf8(s + off, SIZE_MAX);
           97                         printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
           98                 }
           99                 printf("\n");
          100         
          101                 /* do the same, but this time string is length-delimited */
          102                 len = 17;
          103                 printf("grapheme clusters in input delimited to %zu bytes:\n", len);
          104                 for (off = 0; off < len; off += ret) {
          105                         ret = grapheme_next_character_break_utf8(s + off, len - off);
          106                         printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
          107                 }
          108         
          109                 return 0;
          110         }
          111 
          112 This code can be compiled with
          113 
          114         cc (-static) -o example example.c -lgrapheme
          115 
          116 and the output is
          117 
          118         Input: "Tëst 👨‍👩‍👦 🇺🇸 नी நி!"
          119         grapheme clusters in NUL-delimited input:
          120          1 bytes | T
          121          2 bytes | ë
          122          1 bytes | s
          123          1 bytes | t
          124          1 bytes |  
          125         18 bytes | 👨‍👩‍👦
          126          1 bytes |  
          127          8 bytes | 🇺🇸
          128          1 bytes |  
          129          6 bytes | नी
          130          1 bytes |  
          131          6 bytes | நி
          132          1 bytes | !
          133         
          134         grapheme clusters in input delimited to 17 bytes:
          135          1 bytes | T
          136          2 bytes | ë
          137          1 bytes | s
          138          1 bytes | t
          139          1 bytes |  
          140         11 bytes | 👨‍👩
          141 
          142 Motivation
          143 ----------
          144 The goal of this project is to be a suckless and statically linkable
          145 alternative to the existing bloated, complicated, overscoped and/or
          146 incorrect solutions for Unicode string handling (ICU, GNU's
          147 libunistring, libutf8proc, etc.), motivating more hackers to properly
          148 handle Unicode strings in their projects and allowing this even in
          149 embedded applications.
          150 
          151 The problem can be easily seen when looking at the sizes of the respective
          152 libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a,
          153 libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring
          154 (libunistring.a) is around 2MB, which is unacceptable for static
          155 linking. Both take many minutes to compile even on a good computer and
          156 require a lot of dependencies, including Python for ICU. On
          157 the other hand libgrapheme (libgrapheme.a) only weighs in at around 300K
          158 and is compiled (including Unicode data parsing and compression) in
          159 under a second, requiring nothing but a C99 compiler and POSIX make(1).
          160 
          161 Some libraries, like libutf8proc and libunistring, are incorrect by
          162 basing their API on assumptions that haven't been true for years
          163 (e.g. offering stateless grapheme cluster segmentation even though the
          164 underlying algorithm is not stateless). As an additional factor,
          165 libutf8proc's UTF-8-decoder is unsafe, as it allows overlong encodings
          166 that can be easily used for exploits.
          167 
          168 While ICU and libunistring offer a lot of functions and the weight mostly
          169 comes from locale-data provided by the Unicode standard, which is applied
          170 implementation-specifically (!) for some things, the same standard always
          171 defines a sane 'default' behaviour as an alternative in such cases that
          172 is satisfying in 99% of the cases and which you can rely on.
          173 
          174 For some languages, for instance, it is necessary to have a dictionary
          175 on hand to always accurately determine when a word begins and ends. The
          176 defaults provided by the standard, though, already do a great job
          177 respecting the language's boundaries in the general case and are not too
          178 taxing in terms of performance.
          179 
          180 Author
          181 ------
          182 * Laslo Hunhold (dev@frign.de)
          183 
          184 Please contact me if you have information that could be added to this page.