[HN Gopher] Unicode shenanigans: Martine A(c)crit en UTF-8
       ___________________________________________________________________
        
       Unicode shenanigans: Martine A(c)crit en UTF-8
        
       Author : 082349872349872
       Score  : 46 points
       Date   : 2024-10-05 18:51 UTC (3 days ago)
        
 (HTM) web link (blog.poisson.chat)
 (TXT) w3m dump (blog.poisson.chat)
        
       | keybored wrote:
       | Martine A(c)crit thought she was an ugly duckling. In reality she
       | was a UTF-8 swan.
        
       | netsharc wrote:
       | When UTF-8 wasn't univeral (geez, I'm old... It was still this
       | century though!), a page I found to figure out what was going on
       | when I encountered Mojibake is: http://www.jeppesn.dk/utf-8.html
       | . Amazingly it hasn't suffered linkrot and is still online.
       | 
       | And I fully agree with footnote 2, why is an extended Unicode
       | character being used in place of the apostrophe?
       | https://tedclancy.wordpress.com/2015/06/03/which-unicode-cha...
        
         | LegionMammal978 wrote:
         | Because the apostrophe as used in English is a punctuation
         | mark, not a letter, and especially not a modifier letter. The
         | author argues that any name ought to be matched by anything in
         | \w, and we should avoid a punctuation character for that
         | reason, but he doesn't mention other punctuation marks like the
         | hyphen that also commonly occur in names.
        
       | bazzargh wrote:
       | I read this and wondered,
       | 
       | "what Planet software is Planet Haskell using that it's doing
       | these odd things - oh, it's Venus? https://github.com/rubys/venus
       | ... wow that's old... and doesn't that use something like
       | filesystem storage for the feeds and couldn't this happen if you
       | stored the xml with no character set specified and the parser
       | messed it up on the way back in? Which since it's ending up with
       | a windows encoding...wait, why am I remembering any of
       | this....oh. _checks credits for Venus, recordscratch_ yes, that
       | 's me, wondering why I fixed venus to run on windows 18 years
       | ago"
       | 
       | https://github.com/rubys/venus/commit/210781768705d20dc3cbe6...
       | 
       | Sorry about that. As I recall at the time I was the only person
       | using Venus on windows, and I was just running a hacked up
       | version on a local machine. There's some conditionals in there
       | about whether it uses libxml2 or not (in modern python, it
       | wouldn't need to), that call doesn't take a charset parameter and
       | my guess is the problems begin as soon as libxml2 tries to parse
       | the file on disk. I think my own version was frankensteined to
       | the extent of using a sqlite back end so I didn't have to deal
       | with windows files any more.
       | 
       | The author of Venus, Sam Ruby, is on here (rubys) but it looks
       | like he hasn't checked in for a long time; last I saw he was over
       | at fly.io.
       | 
       | Oh and the even funnier part of this all is, back when this was
       | written, Sam's blog was THE go to place to look up the list of
       | Mojibake mistakes you'd made...
       | 
       | https://intertwingly.net/stories/2004/04/14/i18n.html#Cleani...
        
       | andai wrote:
       | I often run into this when I do stuff in Python and forget to add
       | encoding="utf-8" to open(). I think they're finally changing this
       | to the default.
       | 
       | Actually I ran into a separate issue on Windows where Python will
       | automatically replace the line endings depending on the OS. So I
       | had to specify newline='\n' as an argument to open() or it would
       | alter the newlines to Windows format in the output.
       | 
       | (My fault for not running it in WSL, I guess.)
        
       | Dwedit wrote:
       | Wikipedia used to have this picture for an illustration of
       | Mojibake:
       | https://dic.academic.ru/pictures/wiki/files/76/Letter_to_Rus... A
       | very good job from the postal employees who corrected it.
        
       ___________________________________________________________________
       (page generated 2024-10-08 23:00 UTC)