[HN Gopher] Unicode shenanigans: Martine A(c)crit en UTF-8
___________________________________________________________________
Unicode shenanigans: Martine A(c)crit en UTF-8
Author : 082349872349872
Score : 46 points
Date : 2024-10-05 18:51 UTC (3 days ago)
(HTM) web link (blog.poisson.chat)
(TXT) w3m dump (blog.poisson.chat)
| keybored wrote:
| Martine A(c)crit thought she was an ugly duckling. In reality she
| was a UTF-8 swan.
| netsharc wrote:
| When UTF-8 wasn't univeral (geez, I'm old... It was still this
| century though!), a page I found to figure out what was going on
| when I encountered Mojibake is: http://www.jeppesn.dk/utf-8.html
| . Amazingly it hasn't suffered linkrot and is still online.
|
| And I fully agree with footnote 2, why is an extended Unicode
| character being used in place of the apostrophe?
| https://tedclancy.wordpress.com/2015/06/03/which-unicode-cha...
| LegionMammal978 wrote:
| Because the apostrophe as used in English is a punctuation
| mark, not a letter, and especially not a modifier letter. The
| author argues that any name ought to be matched by anything in
| \w, and we should avoid a punctuation character for that
| reason, but he doesn't mention other punctuation marks like the
| hyphen that also commonly occur in names.
| bazzargh wrote:
| I read this and wondered,
|
| "what Planet software is Planet Haskell using that it's doing
| these odd things - oh, it's Venus? https://github.com/rubys/venus
| ... wow that's old... and doesn't that use something like
| filesystem storage for the feeds and couldn't this happen if you
| stored the xml with no character set specified and the parser
| messed it up on the way back in? Which since it's ending up with
| a windows encoding...wait, why am I remembering any of
| this....oh. _checks credits for Venus, recordscratch_ yes, that
| 's me, wondering why I fixed venus to run on windows 18 years
| ago"
|
| https://github.com/rubys/venus/commit/210781768705d20dc3cbe6...
|
| Sorry about that. As I recall at the time I was the only person
| using Venus on windows, and I was just running a hacked up
| version on a local machine. There's some conditionals in there
| about whether it uses libxml2 or not (in modern python, it
| wouldn't need to), that call doesn't take a charset parameter and
| my guess is the problems begin as soon as libxml2 tries to parse
| the file on disk. I think my own version was frankensteined to
| the extent of using a sqlite back end so I didn't have to deal
| with windows files any more.
|
| The author of Venus, Sam Ruby, is on here (rubys) but it looks
| like he hasn't checked in for a long time; last I saw he was over
| at fly.io.
|
| Oh and the even funnier part of this all is, back when this was
| written, Sam's blog was THE go to place to look up the list of
| Mojibake mistakes you'd made...
|
| https://intertwingly.net/stories/2004/04/14/i18n.html#Cleani...
| andai wrote:
| I often run into this when I do stuff in Python and forget to add
| encoding="utf-8" to open(). I think they're finally changing this
| to the default.
|
| Actually I ran into a separate issue on Windows where Python will
| automatically replace the line endings depending on the OS. So I
| had to specify newline='\n' as an argument to open() or it would
| alter the newlines to Windows format in the output.
|
| (My fault for not running it in WSL, I guess.)
| Dwedit wrote:
| Wikipedia used to have this picture for an illustration of
| Mojibake:
| https://dic.academic.ru/pictures/wiki/files/76/Letter_to_Rus... A
| very good job from the postal employees who corrected it.
___________________________________________________________________
(page generated 2024-10-08 23:00 UTC)