[HN Gopher] How to chop off bytes of an UTF-8 string to fit into...
       ___________________________________________________________________
        
       How to chop off bytes of an UTF-8 string to fit into a small slot
       and look nice
        
       Author : domm
       Score  : 81 points
       Date   : 2024-06-04 11:25 UTC (1 days ago)
        
 (HTM) web link (domm.plix.at)
 (TXT) w3m dump (domm.plix.at)
        
       | pianohacker wrote:
       | I cut my programmer teeth on Koha years and years ago. Still one
       | of the warmest open-source communities I've ever been involved
       | in, especially to a shy teenager with a lot of opinions.
       | 
       | Great to see new faces in the community, sad to see the sheer
       | insanity of MARC21 still causing chaos. MARCXML is gonna make it
       | obsolete Any Day Now!
        
         | quink wrote:
         | MARCXML is just a new format to encode what's the vast majority
         | of the MARC 21 standard (or for that matter any other MARC
         | variety).
         | 
         | BIBFRAME is gonna make it obsolete Any Day Now!
         | 
         | ("any day now" in this sector means that librarians have been
         | talking about it for two decades and in about two decades
         | something might actually happen)
        
           | tingletech wrote:
           | I do wish the community had leaned into the MODS direction vs
           | going over to RDF.
        
             | quink wrote:
             | Yeah I don't know what the sell is there... throw away the
             | fidelity of your data when WEMI/FRBR/BIBFRAME/semantic web
             | is coming any ~year~ decade now, soon (lol), while re-
             | learning everything, definitely going out to tender because
             | your current system won't do it and shift your processes
             | and integrations. All so you can end up halfway to DC, yeah
             | no.
             | 
             | The reason libraries don't do cataloguing any longer
             | anywhere near as much hasn't got much to do with MARC 21
             | being hard.
        
               | tingletech wrote:
               | The fidelity seems pretty good, at least if you convert
               | to MARCXML and then use the XSLT from Library of Congress
               | to generate it. IIRC it has record types for all the FRBR
               | levels. It is also not flat like DC. It was a joy to work
               | with from a record aggregator perspective, especially if
               | you were generating it from MARC. You can even put the
               | full table of contents into it. At the time that one of
               | my colleagues wrote the "MARC Must Die" article (at least
               | if I remember correctly) the teams working on RDA and
               | MODS had a lot of overlap and MODS was being designed
               | with the era's cataloging theory in mind. There was a
               | moment in time where it seemed like a "new MARC" might go
               | in that direction.
               | 
               | Having catalogers or metadata librarians write directly
               | in MODS XML by hand never made sense (although some folks
               | tried this), but as far as something usable to ship
               | around I'd rather get MODS than MARC or dublin core. I
               | really don't want to have to query a triple store to
               | aggregate records.
               | 
               | Catalogers ideally would have tools that make it easy for
               | them follow RDA/AACR2 descriptive practices without
               | having to think about the details of MARC or MODS or
               | linked data.
               | 
               | I've been out of the business for a couple of years, so I
               | have not been following Library of Congress' BIBFRAME
               | transition.
        
       | re wrote:
       | Further reading:
       | 
       | * https://hoytech.github.io/truncate-presentation/ /
       | https://metacpan.org/pod/Unicode::Truncate
       | 
       | * https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundarie...
       | 
       | Truncating at codepoint boundaries at least avoids generating
       | invalid (non-UTF-8) strings, but can still result in confusing or
       | incorrect displays for human readers, so for best results the
       | truncation algorithm should take extended grapheme clusters into
       | account, which are probably the closest thing that Unicode has to
       | what most people think of as "characters".
        
         | iforgotpassword wrote:
         | To avoid this and a bunch of other confusion, when accepting
         | user input I recommend normalizing it to the composed form
         | before writing to a DB or file. While unicode-aware tools and
         | software should handle either form just fine, you probably want
         | to avoid that there's something in the pipeline somewhere that
         | treats the decomposed and composed form of the same string as
         | different.
        
           | arp242 wrote:
           | It's good advice to normalise to pre-composed form, but that
           | doesn't solve the problem the previous poster mentioned as
           | not everything exists as a composed form. That said: _most_
           | things do have a composed form, so you can probably get away
           | with it - right up to when you can 't.
        
             | quink wrote:
             | Yeah, working an a library system our path was to compose
             | everything (taking into account of course that the octet
             | sizes specified in the directory may or may not actually be
             | accurate depending on whatever system produced the record)
             | and around the same time deprecate any pretense we had of
             | supporting MARC-8.
        
           | magicalhippo wrote:
           | How does that affect filenames?
           | 
           | IIRC the lower levels of Windows will happily work with
           | filenames that are not valid Unicode strings, for example if
           | you use the kernel API rather than Win32.
           | 
           | But what about Win32? If you create a file before
           | normalization and then open it using the normalized form,
           | will it open the same file or return file not found?
           | 
           | What about other systems? For example AWS' S3 allows UTF-8
           | keys, with no mention of normalization[1].
           | 
           | On the phone so can't try myself right now.
           | 
           | Anyway for general text I agree, but for identifiers,
           | filenames and such I prefer to treat them as opaquely as
           | possible.
           | 
           | [1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/ob
           | ject...
        
             | heinrich5991 wrote:
             | No need to go to the kernel API to create filenames that
             | are invalid UTF-16. The Win32 API will happily let you do
             | it.
        
               | magicalhippo wrote:
               | I was AFK so couldn't check, but yeah you're right.
               | 
               | Just made two files named a.txt which happily sat next to
               | each other, one being NFC and other NFD.
               | 
               | So yeah, don't mess with the normalization of filenames.
        
               | panzi wrote:
               | They're both valid UTF-16, though. Can you create a
               | filename with only half of a surrogate pair in it?
               | 
               | I don't use Windows, so I can't check. Linux literally
               | allows any arbitrary byte except for 0x00 and 0x2F ('/'
               | in ASCII/UTF-8). It's a problem for programming languages
               | that want to only use valid Unicode strings, like Python.
               | Rust has a separate type "OsString" to handle that, with
               | either lossy conversion to "String" or a conversion
               | method that can fail. Python uses the custom use Unicode
               | range to represent invalid byte sequences in filenames.
               | It's all a mess. JavaScript doesn't give a damn about the
               | validity of their UTF-16 strings.
               | 
               | (Note that Rust's OsString is different from it's CString
               | type. Well, I guess under Unix they're the same, but
               | under Windows OsString is UTF-16 (or "WTF-16", because it
               | isn't actually valid UTF-16 in all cases).)
        
               | magicalhippo wrote:
               | Yeah seems to be same with Win32.
               | 
               | I tried using U+13161 EGYPTIAN HIEROGLYPH G029[1], which
               | resulted in a string of length 2 as expected.
               | 
               | Using both chars (code units) and just the first char
               | (code unit) worked equally fine. In Windows Explorer the
               | first one shows the stork as expected, while the second
               | shows that "invalid character" rectangle.
               | 
               | So yeah, treating filenames as nearly-opaque byte
               | sequences is probably the best approach.
               | 
               | [1]: https://en.wiktionary.org/wiki/%F0%93%85%A1
        
             | moefh wrote:
             | > What about other systems?
             | 
             | The Linux kernel doesn't validate filenames in any way, so
             | a filename in Linux can contain any byte except 0x2F ('/',
             | which is interpreted as directory separator) and 0x00
             | (which signals the end of the byte string).
             | 
             | ETA: of course some file systems have other limitations,
             | for example '\' is not valid in FAT32.
        
           | account42 wrote:
           | Not all grapheme clusters have composed forms so
           | normalization doesn't actually gain you anything here.
        
             | nordsieck wrote:
             | > Not all grapheme clusters have composed forms so
             | normalization doesn't actually gain you anything here.
             | 
             | Just because the worst case can't improve doesn't mean that
             | making the average case better is worthless.
        
               | panzi wrote:
               | I saw something about Arabic text, where that naive
               | truncation at codepoint boundaries turns one word into a
               | different word! Like the sequence of codepoints generate
               | something that is represented as a single glyph in fonts,
               | but truncated its totally different glyphs. I don't
               | remember more details, I don't know any Arabic, but
               | grapheme clusters aren't just about adding diacritics to
               | latin characters. In other languages it all might work
               | quite differently. So truncating at word boundaries (at
               | breakable white-space or punctuation) is probably best.
               | Though of course that way you might truncate the string
               | by a lot. _shrug-emoji_
               | 
               | (I don't think the talk where the stuff about Arabic was
               | mentioned was Plain Text by Dylan Beattie, but I haven't
               | re-watched it to confirm. So maybe it is. Can't remember
               | the name of any other talk about the subject right now.)
        
               | gmueckl wrote:
               | Randomly truncating words can have the same effect in any
               | language. It's outright trivial to find examples in
               | English or German. I don't understand why one has to
               | invoke Arab script for a good example.
        
               | asabil wrote:
               | Yes, but you don't end up with different glyphs. Arabic
               | script has letter shaping, that means a letter can have
               | up to 4 shapes based on its position within the word. If
               | you chop off the last letter, the previous one which used
               | to have a "middle" position shape suddenly changes into
               | "terminal" position shape.
        
           | mort96 wrote:
           | The emoji "" can't be normalized further -- it's a ""
           | followed by a "". If you just split on code points rather
           | than grapheme clusters, even after normalizing, your naive
           | truncation algorithm will have accidentally changed the skin
           | colors of emoji. Or turned the flag of Norway into an "". Or
           | turned the rainbow flag into a white flag .
           | 
           | EDIT: oh lord Hacker News strips emoji. You get the idea even
           | though HN ruined the illustrations. Not my fault HN is
           | broken.
        
             | recursive wrote:
             | Presumably referring to Fitzpatrick modifiers.
        
             | tingletech wrote:
             | HN is not broken, it's working as designed.
        
               | postmodest wrote:
               | [poop]
        
               | mort96 wrote:
               | It makes technical conversations about Unicode
               | ridiculously annoying. It's working as designed and the
               | design is broken.
        
               | lisper wrote:
               | Whether the absence of emojis on HN is a feature or a bug
               | is arguable. But if you can't figure out a way to work
               | around this constraint (e.g. put your example literally
               | anywhere else on the web and post a link here) HN is
               | probably not a good fit for you.
        
               | mort96 wrote:
               | Reading a discussion thread where each message is just a
               | link to some pastebin with the actual message isn't very
               | nice. Besides, I wasn't going to write the message again
               | after HN removed arbitrary parts of it, hence the edit; I
               | think people got the gist. You may feel that discussion
               | about Unicode doesn't belong on HN but I feel otherwise.
        
               | lisper wrote:
               | Reading discussion threads full of silly emojis isn't
               | "very nice" either, at least for a certain kind of
               | audience. It's a tradeoff, and the powers that be at HN
               | have decided to optimize for sober discussion over
               | expressivity. It's a defensible decision. Keeping HN from
               | degenerating into Reddit is already hard enough.
        
               | mort96 wrote:
               | I'm sure you'd have survived my unsensored message.
        
               | lisper wrote:
               | Of course. It's everyone else's emojis that would get
               | annoying.
        
             | AlienRobot wrote:
             | More sites should have this bug. Whenever I see a colored
             | icon in the middle of black text I get inexplicably angry.
        
               | mort96 wrote:
               | Might I suggest therapy?
        
           | Sharlin wrote:
           | Even if you only support scripts for which Unicode has
           | composed codepoints, these days you likely can't get away
           | without properly handling emoji, and there are no precomposed
           | versions of all the numerous emojis that are made of multiple
           | code points (eg. skin color and gender variants as well as
           | flags).
        
         | masklinn wrote:
         | For actual best results you'd probably want to truncate at the
         | word or syllable boundary, and it should likely be language
         | specific.
        
       | gpvos wrote:
       | Better use grapheme clusters than Unicode characters. After all,
       | you don't want to chop the diaeresis off an e.
        
       | electroly wrote:
       | Don't do this. Use a language (like C#) or library (like
       | libunistring) that can do grapheme cluster segmentation. In .NET
       | it's StringInfo.GetTextElementEnumerator(). In libunistring it's
       | u8_grapheme_breaks(). In ICU4C it's
       | icu::BreakIterator::createCharacterInstance(). In Ruby it's
       | each_grapheme_cluster(). Other ecosystems with rich Unicode
       | support should have similar functionality.
        
         | daneel_w wrote:
         | "I had a problem, and here's my working solution for my
         | specific case."
         | 
         | -"Don't do this. Instead use a completely different programming
         | language."
        
           | electroly wrote:
           | In Perl (OP's chosen language) you can use the Unicode::Util
           | package. That's why I was pretty clear that you can use a
           | different language _or_ a different library. This seems to be
           | a pretty uncharitable reading of my post. Use the right tool
           | for the job.
        
         | memco wrote:
         | That's fine unless you are a language or library creator in
         | which case knowing how to do it properly can't be deferred to
         | someone else. Perhaps porting someone else's correct
         | implementation is good but someone somewhere has to implement
         | this. If they don't share their knowledge this will always be
         | esoteric knowledge locked away unless those who do that kind of
         | work share their knowledge and experience. Most of us are not
         | those people, but some are.
        
           | neonsunset wrote:
           | Hi, I'm one of the people who are library authors in this
           | area.
           | 
           | This article is very specific to Perl, and the way it does so
           | is also subject to question - it does not look efficient.
           | 
           | You will be better off by reading excellent wikipedia page on
           | UTF-8: https://en.wikipedia.org/wiki/UTF-8
           | 
           | Now, extended grapheme cluster enumeration is much more
           | complex than finding the next non-continuation byte (or
           | counting such), but to perform those correctly you would
           | ultimately end up reading the official spec at unicode.org
           | and perusing reference implementations like ICU (which is
           | painful to read) or from standard library/popular packages
           | for Rust/Java/C#/Swift (the decent ones I'm aware of, do not
           | look at C++).
        
           | electroly wrote:
           | As it turns out, I _am_ writing my own language, and my
           | language supports grapheme cluster segmentation. I just used
           | libunistring (and before that, I used ICU). TFA is not doing
           | this correctly at all; the Unicode specification provides the
           | rules for grapheme cluster segmentation if you wish to
           | implement it yourself[0]. There 's nothing to be learned from
           | TFA's hacky and fundamentally incorrect approach. OP's
           | technique will freely chop combining code points that needed
           | to be kept.
           | 
           | [0] https://unicode.org/reports/tr29/
        
         | simonw wrote:
         | I pasted your comment here into GPT-4o and asked for the Python
         | equivalent, it suggested this which seems to work well:
         | import regex as re                  def
         | grapheme_clusters(text):             # \X is the regex pattern
         | that matches a grapheme cluster             pattern =
         | re.compile(r'\X')             return [
         | match.group(0)                 for match in
         | pattern.finditer(text)             ]
         | 
         | https://chatgpt.com/share/481c9c94-0431-4fcb-82aa-a44a4f3c21...
        
           | AlotOfReading wrote:
           | Note that _regex_ is not the _re_ module from the stdlib, it
           | 's a separate third party module that exposes the more
           | powerful capabilities of PCRE like grapheme clustering
           | directly.
        
             | simonw wrote:
             | That's a good callout, here's the docs for \X in that regex
             | module: https://github.com/mrabarnett/mrab-
             | regex?tab=readme-ov-file#...
        
       | hoten wrote:
       | I wrote a JavaScript version of this using Intl:
       | https://github.com/GoogleChrome/lighthouse/blob/9baac0ae9da7...
        
       | LeonidasXIV wrote:
       | This will probably fail if the thing being chopped off is a
       | composed emoji, like the flag emoji (where it can chop off the
       | second letter of the ISO code and just leave a bewildering to the
       | user but completely valid first letter) or the ZWJ sequence
       | emojis which will leave a color or half a family or other
       | shenanigans, depending where it cuts.
        
         | amelius wrote:
         | Why is that a problem? If you cut off the country, at least you
         | know that there was a flag. If you cut off the entire grapheme,
         | then you know nothing!
        
           | bux93 wrote:
           | Well, if the user entered a French flag, and then you show it
           | back to them as a white flag, you may cause a bit of an
           | international incident. Or worse, accusations of telling very
           | old jokes.
        
           | recursive wrote:
           | I think you'd just get the first letter of the country code
           | and not a flag at all.
        
       | Aransentin wrote:
       | There are some hard-to-handle edge cases when doing display
       | length truncation in Unicode, e.g. the character U+FDFD or "[?]"
       | is four bytes but can be very long depending on the typeface*, so
       | "completely" solving it is quite hard and has to depend on
       | feedback from your rasterization engine.
       | 
       | (*Rendered version on Wikipedia:
       | https://commons.wikimedia.org/wiki/File:Lateef_unicode_U%2BF... )
        
         | account42 wrote:
         | This is a completely unrelated problem since the article is
         | quite clearly about limiting to a certain maximum byte length
         | and not display length. For display length you don't even need
         | Unicode for that to depend on the font and shaping engine.
        
       | tingletech wrote:
       | MARC can do all kinds of crazy things. I used to work with folks
       | who had been hacking on MARC since the 1960s. If I remember
       | correctly, at one point it got punched onto dangling chad cards
       | (and of course was used to print the cards in the card catalog in
       | the library).
       | 
       | > The real problem is that USMARC uses an int with 4 digits to
       | store the size of a field, followed by 5 digits for the offset.
       | 
       | A colleague told me they used to exploit this "feature" to leave
       | hidden messages in MARC records between fields.
        
         | quink wrote:
         | Well, until some system comes along that relies on the
         | directory for the tags only and just splits the record using
         | the separator characters. Which is a valid enough approach to
         | either work around bad encoding or if your record is on
         | something other than a magnetic band and you don't need to know
         | what exact offset to move to.
         | 
         | Hidden is a very relative term there.
        
           | tingletech wrote:
           | I never really worked with MARC much (except for a script for
           | generating patron records once a quarter to load new students
           | and staff into III and somehow we marked obsolete users to
           | change their status) but I used to work at the successor
           | organization to the University of California Division of
           | Library Automation (nee University Library Automation
           | Program), and one of the folks telling MARC war stories was
           | describing doing this with with a tool he created
           | specifically for creating pathological MARC records. They
           | aggregated records from local systems into the systemwide
           | "Melvyl" (during ULAP they produced microfiche binders of the
           | union catalog) -- I don't know that they ever redistributed
           | the MARC to other display systems.
        
       | fl0ki wrote:
       | Fun fact: part of why TOML 1.1 has taken so long to land is
       | because of open questions around unicode key normalization. That
       | in itself sounds dry and boring, but the discussion threads are
       | anything but.
       | 
       | https://github.com/toml-lang/toml/issues/994
       | 
       | https://github.com/toml-lang/toml/issues/966
       | 
       | https://github.com/toml-lang/toml/issues/989
       | 
       | https://github.com/toml-lang/toml/pull/979
        
       ___________________________________________________________________
       (page generated 2024-06-05 23:02 UTC)