[HN Gopher] Libgrapheme: A simple freestanding C99 library for U...
___________________________________________________________________
Libgrapheme: A simple freestanding C99 library for Unicode
Author : harporoeder
Score : 58 points
Date : 2022-11-15 17:16 UTC (5 hours ago)
(HTM) web link (libs.suckless.org)
(TXT) w3m dump (libs.suckless.org)
| daaaaaaan wrote:
| https://twitter.com/kuschku/status/1156488420413362177
| raspyberr wrote:
| https://en.wikipedia.org/wiki/Talk:Suckless.org
| manifoldgeo wrote:
| I've been trying to get into C this past week, and this is a
| great coincidence to see this today! I was just thinking how
| convenient it is to type emojis right into strings in Python and
| print them. I assumed C didn't have much unicode compatibility,
| though I didn't research it.
|
| I gave libgrapheme a try, and it compiled just as the
| instructions said it would. The hello-world program also mostly
| worked, but in my terminal it malformed several things. For
| example the American flag emoji rendered in my terminal as
| [U][S], and the family emoji rendered as three distinct emoji
| faces (side-by-side) rather than one grouped one.
|
| I went to a website that lets me copy emojis to my clipboard, and
| I directly copy-pasted the American flag into my terminal, and I
| still got [U][S], so I think the problem is just with the
| terminal and not the library.
|
| edit: Indeed, this is a problem in Gnome terminal. I found a
| Bugzilla link[0] that is still open. The official name for the
| grouped emoji type is "ZWJ sequence"[1], short for Zero-Width
| Joiner, and it appears not a lot of terminals support them. If
| anyone knows of a good one for Linux, please let me know!
|
| Great stuff, thank you for sharing!
|
| References:
|
| [0]: https://gitlab.gnome.org/GNOME/vte/-/issues/2317
|
| [1]: https://emojipedia.org/emoji-zwj-sequence/
| csande17 wrote:
| > I was just thinking how convenient it is to type emojis right
| into strings in Python and print them. I assumed C didn't have
| much unicode compatibility, though I didn't research it.
|
| Libgrapheme is a nice library, but it doesn't really have
| anything to do with this.
|
| Almost all modern terminal emulators use the UTF-8 character
| encoding. In order to successfully output Unicode characters,
| your programming language doesn't actually need much "Unicode
| support"; it just needs to be able to send UTF-8-encoded bytes
| to stdout. (That's why many modern programming languages like
| Go and Zig define strings as a simple array of bytes.) Modern C
| compilers allow you to printf("e") and get the appropriate
| behavior.
|
| As you mention, the terminal emulator also needs to be able to
| _decode_ and _display_ those UTF-8 bytes correctly, and a lot
| of terminals don 't get it right in some situations. Off the
| top of my head, I don't know of a terminal that actually
| implements the entire (very complex) set of Unicode text
| rendering behaviors; maybe one of the web-based ones that run
| in Electron? macOS's Terminal.app is also pretty good IIRC.
|
| Where libgrapheme comes in is if you want to analyze or
| manipulate a UTF-8-encoded string. It provides operations like
| "split into words" and "convert to uppercase". A surprising
| number of programs never need to do that stuff, but if you do,
| libgrapheme will give you a Unicode-compatible implementation.
| (Many more basic operations, like concatenating two strings,
| will work just fine without libgrapheme.)
| mananaysiempre wrote:
| (Not a language or Unicode expert, the following likely has
| important mistakes.)
|
| > Off the top of my head, I don't know of a terminal that
| actually implements the entire (very complex) set of Unicode
| text rendering behaviors
|
| There are at least two reasons for this:
|
| First, nobody actually seems to know how bidirectional text
| should interact with terminal control sequences, or indeed
| how it should be typeset on a terminal in the first place
| (where are the paragraph boundaries?). There is the pre-
| Unicode bidirectional support mode (BDSM, I kid you not) in
| ECMA-48[1] and TR/53[2], which AFAIK nobody implements nor
| cares about; there are terminal emulators endorsed by bidi-
| language users[3], which AFAIK nobody has written down the
| behaviour of; there is the Freedesktop bidi terminal spec[4],
| which is a draft and AFAIK nobody implements yet either but
| at least some people care about; finally, there are bidi-
| language users who say that spec is a mistake[5].
|
| Second, aside from bidi and a smattering of other things such
| as emoji, there _is_ no detailed "Unicode text rendering
| behaviour", only standards specific to font formats--the most
| recent among them being OpenType, which is dubiously
| compatible across implementations, decently documented only
| through painstaking reverse engineering (sometimes in
| words[6], sometimes only in Freetype library code), and
| generally full of snakes[7]. And it has no notion of a
| monospace font--only of a (proportional) font where all Lat
| /Cyr/Grk characters just happen to have the same advance.
|
| AFAICT that is not negligence or an oversight, but rather a
| concession to the fact that there are scripts which don't
| really have a notion of monospace in the typographic
| tradition and in fact are written such that it's extremely
| unclear what monospace would even mean--certainly not one or
| two cells per codepoint (e.g. Burmese or Tibetan; apparently
| there _are_ Arabic monospace fonts[8] but I've no idea how
| the hell they work). Not coincidentally, those are the
| scripts where you really, really need that shaper, otherwise
| nothing looks anywhere close to correct.
|
| [This post could have been titled " _Contra_ Muratori on
| Unicode in terminal emulators".]
|
| [1] https://www.ecma-international.org/publications-and-
| standard...
|
| [2] https://www.ecma-international.org/publications-and-
| standard...
|
| [3] https://news.ycombinator.com/item?id=8086417
|
| [4] https://terminal-wg.pages.freedesktop.org/bidi/
|
| [5] http://litcave.rudi.ir/
|
| [6] https://github.com/n8willis/opentype-shaping-documents
|
| [7] https://litherum.blogspot.com/2019/03/addition-font.html
|
| [8] https://news.ycombinator.com/item?id=10395464
| csande17 wrote:
| Here's another fun Unicode pitfall: does _any_ terminal
| provide a way to display Chinese and Japanese text
| simultaneously, using the appropriate versions of the
| glyphs for each language 's characters?
| mananaysiempre wrote:
| As far as existing terminals are concerned, I don't know.
| FWIW, there are similar problems (though only to the
| point of looking wrong, not of misunderstanding) in other
| scripts: Cyrillic as used in Bulgarian and a number of
| other languages[1] and even Latin as used in Polish[2].
|
| Even the Han version, though, does not seem to me to be
| the sort of "what does it even _mean_?" problem like
| those I listed above; more like what you want the input
| to be. You can make your terminal keep language state,
| e.g. using the deprecated language tags. Pro: some form
| of this likely already needs to happen for bidi support;
| similar to what HTML does. Con: no text file or program
| ever did this; your nice UTF-8-only terminal is now
| stateful and goes mad after `head /dev/urandom`.
| Alternatively, you can require the driving program to
| emit variation selectors for each Han character. Pro: the
| state and the ensuing madness is now limited; you can
| still pretend you're looking at a stream of characters.
| Con: no text file or program ever did this; neither does
| HTML although it theorerically could.
|
| [1] https://commons.wikimedia.org/wiki/File:Cyrillic_alte
| rnates....
|
| [2]
| https://www.twardoch.com/download/polishhowto/kreska.html
| duskwuff wrote:
| > First, nobody actually seems to know how bidirectional
| text should interact with terminal control sequences...
|
| This goes beyond just bidirectional text. The traditional
| behavior of text in a terminal is based around two key
| assumptions, both of which break down catastrophically when
| dealing with non-ASCII text:
|
| 1) The state of a terminal can be represented as a set of
| cells, each of which has exactly one glyph in it and can be
| drawn independently from any other cell.
|
| 2) Printing a character will write a glyph to the cell the
| cursor is in and move the cursor to the right by one cell
| (or down to the next line).
|
| The first assumption breaks down when dealing with full-
| width characters and ligatures/complex scripts, but can at
| least be papered over to handle full-width. The second
| assumption breaks down when exposed to virtually any
| interesting typographical feature (RTL, combining
| characters and ZWJ, shaped characters, etc). And I'm not
| sure it's possible to fix without some pretty substantial
| changes to how terminals operate -- standard terminal
| control sequences, and the code that uses them, are all
| built around these assumptions; introducing new behaviors
| like "the cursor doesn't always move from left to right" or
| "erasing the middle of a string might change how the rest
| of it displays" _will_ break existing applications.
|
| The ECMA standards are of absolutely no help in the matter.
| They were written in the early 1990s, before Unicode came
| onto the scene. Their idea of "international language
| support" was supporting both French and German.
| __d wrote:
| Does anyone know offhand whether this does comparisons? And
| normalization?
| gigel82 wrote:
| This is interesting, particularly for implementing Intl in JS
| engines without the mega-heavy ICU. But I wonder how portable it
| really is.
|
| Sometimes I have to dig very deep to find that what folks call
| "portable C" is actually POSIX-dependent.
|
| It doesn't appear to be the case after going through the code for
| a bit, so that's promising.
| mananaysiempre wrote:
| You can also refer to the Unicode routines of other small JS
| engines[1,2], those don't use ICU either, although the
| implementations are mercilessly size-optimized (to put it
| politely) and restricted to what the target JS version requires
| (e.g. Duktape does casemapping but no normalization). Still,
| Bellard's in particular look like he had a small Unicode
| processing library lying around and just copied it into the
| tree, not like he was forced to write the absolute minimum to
| do a JS inteprerer, so they can even be compared with dedicated
| libraries like libgrapheme, libutf8proc or libutf.
|
| [1] https://github.com/bellard/quickjs/blob/master/libunicode.c
|
| [2] https://github.com/svaarala/duktape/blob/master/src-
| input/du...
| dochtman wrote:
| Maybe a comparison to ICU4X is more interesting.
| sylware wrote:
| tomcam wrote:
___________________________________________________________________
(page generated 2022-11-15 23:01 UTC)