[HN Gopher] How to chop off bytes of an UTF-8 string to fit into...
___________________________________________________________________
How to chop off bytes of an UTF-8 string to fit into a small slot
and look nice
Author : domm
Score : 81 points
Date : 2024-06-04 11:25 UTC (1 days ago)
(HTM) web link (domm.plix.at)
(TXT) w3m dump (domm.plix.at)
| pianohacker wrote:
| I cut my programmer teeth on Koha years and years ago. Still one
| of the warmest open-source communities I've ever been involved
| in, especially to a shy teenager with a lot of opinions.
|
| Great to see new faces in the community, sad to see the sheer
| insanity of MARC21 still causing chaos. MARCXML is gonna make it
| obsolete Any Day Now!
| quink wrote:
| MARCXML is just a new format to encode what's the vast majority
| of the MARC 21 standard (or for that matter any other MARC
| variety).
|
| BIBFRAME is gonna make it obsolete Any Day Now!
|
| ("any day now" in this sector means that librarians have been
| talking about it for two decades and in about two decades
| something might actually happen)
| tingletech wrote:
| I do wish the community had leaned into the MODS direction vs
| going over to RDF.
| quink wrote:
| Yeah I don't know what the sell is there... throw away the
| fidelity of your data when WEMI/FRBR/BIBFRAME/semantic web
| is coming any ~year~ decade now, soon (lol), while re-
| learning everything, definitely going out to tender because
| your current system won't do it and shift your processes
| and integrations. All so you can end up halfway to DC, yeah
| no.
|
| The reason libraries don't do cataloguing any longer
| anywhere near as much hasn't got much to do with MARC 21
| being hard.
| tingletech wrote:
| The fidelity seems pretty good, at least if you convert
| to MARCXML and then use the XSLT from Library of Congress
| to generate it. IIRC it has record types for all the FRBR
| levels. It is also not flat like DC. It was a joy to work
| with from a record aggregator perspective, especially if
| you were generating it from MARC. You can even put the
| full table of contents into it. At the time that one of
| my colleagues wrote the "MARC Must Die" article (at least
| if I remember correctly) the teams working on RDA and
| MODS had a lot of overlap and MODS was being designed
| with the era's cataloging theory in mind. There was a
| moment in time where it seemed like a "new MARC" might go
| in that direction.
|
| Having catalogers or metadata librarians write directly
| in MODS XML by hand never made sense (although some folks
| tried this), but as far as something usable to ship
| around I'd rather get MODS than MARC or dublin core. I
| really don't want to have to query a triple store to
| aggregate records.
|
| Catalogers ideally would have tools that make it easy for
| them follow RDA/AACR2 descriptive practices without
| having to think about the details of MARC or MODS or
| linked data.
|
| I've been out of the business for a couple of years, so I
| have not been following Library of Congress' BIBFRAME
| transition.
| re wrote:
| Further reading:
|
| * https://hoytech.github.io/truncate-presentation/ /
| https://metacpan.org/pod/Unicode::Truncate
|
| * https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundarie...
|
| Truncating at codepoint boundaries at least avoids generating
| invalid (non-UTF-8) strings, but can still result in confusing or
| incorrect displays for human readers, so for best results the
| truncation algorithm should take extended grapheme clusters into
| account, which are probably the closest thing that Unicode has to
| what most people think of as "characters".
| iforgotpassword wrote:
| To avoid this and a bunch of other confusion, when accepting
| user input I recommend normalizing it to the composed form
| before writing to a DB or file. While unicode-aware tools and
| software should handle either form just fine, you probably want
| to avoid that there's something in the pipeline somewhere that
| treats the decomposed and composed form of the same string as
| different.
| arp242 wrote:
| It's good advice to normalise to pre-composed form, but that
| doesn't solve the problem the previous poster mentioned as
| not everything exists as a composed form. That said: _most_
| things do have a composed form, so you can probably get away
| with it - right up to when you can 't.
| quink wrote:
| Yeah, working an a library system our path was to compose
| everything (taking into account of course that the octet
| sizes specified in the directory may or may not actually be
| accurate depending on whatever system produced the record)
| and around the same time deprecate any pretense we had of
| supporting MARC-8.
| magicalhippo wrote:
| How does that affect filenames?
|
| IIRC the lower levels of Windows will happily work with
| filenames that are not valid Unicode strings, for example if
| you use the kernel API rather than Win32.
|
| But what about Win32? If you create a file before
| normalization and then open it using the normalized form,
| will it open the same file or return file not found?
|
| What about other systems? For example AWS' S3 allows UTF-8
| keys, with no mention of normalization[1].
|
| On the phone so can't try myself right now.
|
| Anyway for general text I agree, but for identifiers,
| filenames and such I prefer to treat them as opaquely as
| possible.
|
| [1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/ob
| ject...
| heinrich5991 wrote:
| No need to go to the kernel API to create filenames that
| are invalid UTF-16. The Win32 API will happily let you do
| it.
| magicalhippo wrote:
| I was AFK so couldn't check, but yeah you're right.
|
| Just made two files named a.txt which happily sat next to
| each other, one being NFC and other NFD.
|
| So yeah, don't mess with the normalization of filenames.
| panzi wrote:
| They're both valid UTF-16, though. Can you create a
| filename with only half of a surrogate pair in it?
|
| I don't use Windows, so I can't check. Linux literally
| allows any arbitrary byte except for 0x00 and 0x2F ('/'
| in ASCII/UTF-8). It's a problem for programming languages
| that want to only use valid Unicode strings, like Python.
| Rust has a separate type "OsString" to handle that, with
| either lossy conversion to "String" or a conversion
| method that can fail. Python uses the custom use Unicode
| range to represent invalid byte sequences in filenames.
| It's all a mess. JavaScript doesn't give a damn about the
| validity of their UTF-16 strings.
|
| (Note that Rust's OsString is different from it's CString
| type. Well, I guess under Unix they're the same, but
| under Windows OsString is UTF-16 (or "WTF-16", because it
| isn't actually valid UTF-16 in all cases).)
| magicalhippo wrote:
| Yeah seems to be same with Win32.
|
| I tried using U+13161 EGYPTIAN HIEROGLYPH G029[1], which
| resulted in a string of length 2 as expected.
|
| Using both chars (code units) and just the first char
| (code unit) worked equally fine. In Windows Explorer the
| first one shows the stork as expected, while the second
| shows that "invalid character" rectangle.
|
| So yeah, treating filenames as nearly-opaque byte
| sequences is probably the best approach.
|
| [1]: https://en.wiktionary.org/wiki/%F0%93%85%A1
| moefh wrote:
| > What about other systems?
|
| The Linux kernel doesn't validate filenames in any way, so
| a filename in Linux can contain any byte except 0x2F ('/',
| which is interpreted as directory separator) and 0x00
| (which signals the end of the byte string).
|
| ETA: of course some file systems have other limitations,
| for example '\' is not valid in FAT32.
| account42 wrote:
| Not all grapheme clusters have composed forms so
| normalization doesn't actually gain you anything here.
| nordsieck wrote:
| > Not all grapheme clusters have composed forms so
| normalization doesn't actually gain you anything here.
|
| Just because the worst case can't improve doesn't mean that
| making the average case better is worthless.
| panzi wrote:
| I saw something about Arabic text, where that naive
| truncation at codepoint boundaries turns one word into a
| different word! Like the sequence of codepoints generate
| something that is represented as a single glyph in fonts,
| but truncated its totally different glyphs. I don't
| remember more details, I don't know any Arabic, but
| grapheme clusters aren't just about adding diacritics to
| latin characters. In other languages it all might work
| quite differently. So truncating at word boundaries (at
| breakable white-space or punctuation) is probably best.
| Though of course that way you might truncate the string
| by a lot. _shrug-emoji_
|
| (I don't think the talk where the stuff about Arabic was
| mentioned was Plain Text by Dylan Beattie, but I haven't
| re-watched it to confirm. So maybe it is. Can't remember
| the name of any other talk about the subject right now.)
| gmueckl wrote:
| Randomly truncating words can have the same effect in any
| language. It's outright trivial to find examples in
| English or German. I don't understand why one has to
| invoke Arab script for a good example.
| asabil wrote:
| Yes, but you don't end up with different glyphs. Arabic
| script has letter shaping, that means a letter can have
| up to 4 shapes based on its position within the word. If
| you chop off the last letter, the previous one which used
| to have a "middle" position shape suddenly changes into
| "terminal" position shape.
| mort96 wrote:
| The emoji "" can't be normalized further -- it's a ""
| followed by a "". If you just split on code points rather
| than grapheme clusters, even after normalizing, your naive
| truncation algorithm will have accidentally changed the skin
| colors of emoji. Or turned the flag of Norway into an "". Or
| turned the rainbow flag into a white flag .
|
| EDIT: oh lord Hacker News strips emoji. You get the idea even
| though HN ruined the illustrations. Not my fault HN is
| broken.
| recursive wrote:
| Presumably referring to Fitzpatrick modifiers.
| tingletech wrote:
| HN is not broken, it's working as designed.
| postmodest wrote:
| [poop]
| mort96 wrote:
| It makes technical conversations about Unicode
| ridiculously annoying. It's working as designed and the
| design is broken.
| lisper wrote:
| Whether the absence of emojis on HN is a feature or a bug
| is arguable. But if you can't figure out a way to work
| around this constraint (e.g. put your example literally
| anywhere else on the web and post a link here) HN is
| probably not a good fit for you.
| mort96 wrote:
| Reading a discussion thread where each message is just a
| link to some pastebin with the actual message isn't very
| nice. Besides, I wasn't going to write the message again
| after HN removed arbitrary parts of it, hence the edit; I
| think people got the gist. You may feel that discussion
| about Unicode doesn't belong on HN but I feel otherwise.
| lisper wrote:
| Reading discussion threads full of silly emojis isn't
| "very nice" either, at least for a certain kind of
| audience. It's a tradeoff, and the powers that be at HN
| have decided to optimize for sober discussion over
| expressivity. It's a defensible decision. Keeping HN from
| degenerating into Reddit is already hard enough.
| mort96 wrote:
| I'm sure you'd have survived my unsensored message.
| lisper wrote:
| Of course. It's everyone else's emojis that would get
| annoying.
| AlienRobot wrote:
| More sites should have this bug. Whenever I see a colored
| icon in the middle of black text I get inexplicably angry.
| mort96 wrote:
| Might I suggest therapy?
| Sharlin wrote:
| Even if you only support scripts for which Unicode has
| composed codepoints, these days you likely can't get away
| without properly handling emoji, and there are no precomposed
| versions of all the numerous emojis that are made of multiple
| code points (eg. skin color and gender variants as well as
| flags).
| masklinn wrote:
| For actual best results you'd probably want to truncate at the
| word or syllable boundary, and it should likely be language
| specific.
| gpvos wrote:
| Better use grapheme clusters than Unicode characters. After all,
| you don't want to chop the diaeresis off an e.
| electroly wrote:
| Don't do this. Use a language (like C#) or library (like
| libunistring) that can do grapheme cluster segmentation. In .NET
| it's StringInfo.GetTextElementEnumerator(). In libunistring it's
| u8_grapheme_breaks(). In ICU4C it's
| icu::BreakIterator::createCharacterInstance(). In Ruby it's
| each_grapheme_cluster(). Other ecosystems with rich Unicode
| support should have similar functionality.
| daneel_w wrote:
| "I had a problem, and here's my working solution for my
| specific case."
|
| -"Don't do this. Instead use a completely different programming
| language."
| electroly wrote:
| In Perl (OP's chosen language) you can use the Unicode::Util
| package. That's why I was pretty clear that you can use a
| different language _or_ a different library. This seems to be
| a pretty uncharitable reading of my post. Use the right tool
| for the job.
| memco wrote:
| That's fine unless you are a language or library creator in
| which case knowing how to do it properly can't be deferred to
| someone else. Perhaps porting someone else's correct
| implementation is good but someone somewhere has to implement
| this. If they don't share their knowledge this will always be
| esoteric knowledge locked away unless those who do that kind of
| work share their knowledge and experience. Most of us are not
| those people, but some are.
| neonsunset wrote:
| Hi, I'm one of the people who are library authors in this
| area.
|
| This article is very specific to Perl, and the way it does so
| is also subject to question - it does not look efficient.
|
| You will be better off by reading excellent wikipedia page on
| UTF-8: https://en.wikipedia.org/wiki/UTF-8
|
| Now, extended grapheme cluster enumeration is much more
| complex than finding the next non-continuation byte (or
| counting such), but to perform those correctly you would
| ultimately end up reading the official spec at unicode.org
| and perusing reference implementations like ICU (which is
| painful to read) or from standard library/popular packages
| for Rust/Java/C#/Swift (the decent ones I'm aware of, do not
| look at C++).
| electroly wrote:
| As it turns out, I _am_ writing my own language, and my
| language supports grapheme cluster segmentation. I just used
| libunistring (and before that, I used ICU). TFA is not doing
| this correctly at all; the Unicode specification provides the
| rules for grapheme cluster segmentation if you wish to
| implement it yourself[0]. There 's nothing to be learned from
| TFA's hacky and fundamentally incorrect approach. OP's
| technique will freely chop combining code points that needed
| to be kept.
|
| [0] https://unicode.org/reports/tr29/
| simonw wrote:
| I pasted your comment here into GPT-4o and asked for the Python
| equivalent, it suggested this which seems to work well:
| import regex as re def
| grapheme_clusters(text): # \X is the regex pattern
| that matches a grapheme cluster pattern =
| re.compile(r'\X') return [
| match.group(0) for match in
| pattern.finditer(text) ]
|
| https://chatgpt.com/share/481c9c94-0431-4fcb-82aa-a44a4f3c21...
| AlotOfReading wrote:
| Note that _regex_ is not the _re_ module from the stdlib, it
| 's a separate third party module that exposes the more
| powerful capabilities of PCRE like grapheme clustering
| directly.
| simonw wrote:
| That's a good callout, here's the docs for \X in that regex
| module: https://github.com/mrabarnett/mrab-
| regex?tab=readme-ov-file#...
| hoten wrote:
| I wrote a JavaScript version of this using Intl:
| https://github.com/GoogleChrome/lighthouse/blob/9baac0ae9da7...
| LeonidasXIV wrote:
| This will probably fail if the thing being chopped off is a
| composed emoji, like the flag emoji (where it can chop off the
| second letter of the ISO code and just leave a bewildering to the
| user but completely valid first letter) or the ZWJ sequence
| emojis which will leave a color or half a family or other
| shenanigans, depending where it cuts.
| amelius wrote:
| Why is that a problem? If you cut off the country, at least you
| know that there was a flag. If you cut off the entire grapheme,
| then you know nothing!
| bux93 wrote:
| Well, if the user entered a French flag, and then you show it
| back to them as a white flag, you may cause a bit of an
| international incident. Or worse, accusations of telling very
| old jokes.
| recursive wrote:
| I think you'd just get the first letter of the country code
| and not a flag at all.
| Aransentin wrote:
| There are some hard-to-handle edge cases when doing display
| length truncation in Unicode, e.g. the character U+FDFD or "[?]"
| is four bytes but can be very long depending on the typeface*, so
| "completely" solving it is quite hard and has to depend on
| feedback from your rasterization engine.
|
| (*Rendered version on Wikipedia:
| https://commons.wikimedia.org/wiki/File:Lateef_unicode_U%2BF... )
| account42 wrote:
| This is a completely unrelated problem since the article is
| quite clearly about limiting to a certain maximum byte length
| and not display length. For display length you don't even need
| Unicode for that to depend on the font and shaping engine.
| tingletech wrote:
| MARC can do all kinds of crazy things. I used to work with folks
| who had been hacking on MARC since the 1960s. If I remember
| correctly, at one point it got punched onto dangling chad cards
| (and of course was used to print the cards in the card catalog in
| the library).
|
| > The real problem is that USMARC uses an int with 4 digits to
| store the size of a field, followed by 5 digits for the offset.
|
| A colleague told me they used to exploit this "feature" to leave
| hidden messages in MARC records between fields.
| quink wrote:
| Well, until some system comes along that relies on the
| directory for the tags only and just splits the record using
| the separator characters. Which is a valid enough approach to
| either work around bad encoding or if your record is on
| something other than a magnetic band and you don't need to know
| what exact offset to move to.
|
| Hidden is a very relative term there.
| tingletech wrote:
| I never really worked with MARC much (except for a script for
| generating patron records once a quarter to load new students
| and staff into III and somehow we marked obsolete users to
| change their status) but I used to work at the successor
| organization to the University of California Division of
| Library Automation (nee University Library Automation
| Program), and one of the folks telling MARC war stories was
| describing doing this with with a tool he created
| specifically for creating pathological MARC records. They
| aggregated records from local systems into the systemwide
| "Melvyl" (during ULAP they produced microfiche binders of the
| union catalog) -- I don't know that they ever redistributed
| the MARC to other display systems.
| fl0ki wrote:
| Fun fact: part of why TOML 1.1 has taken so long to land is
| because of open questions around unicode key normalization. That
| in itself sounds dry and boring, but the discussion threads are
| anything but.
|
| https://github.com/toml-lang/toml/issues/994
|
| https://github.com/toml-lang/toml/issues/966
|
| https://github.com/toml-lang/toml/issues/989
|
| https://github.com/toml-lang/toml/pull/979
___________________________________________________________________
(page generated 2024-06-05 23:02 UTC)