Post AkfTIZJdIu0Te0dsTQ by flensrocker@troet.cafe
 (DIR) More posts by flensrocker@troet.cafe
 (DIR) Post #AkfRXszoDQY17JXu6a by ErikUden@mastodon.de
       2024-08-05T20:11:12Z
       
       0 likes, 0 repeats
       
       Can someone explain why when things go wrong with Unicode, sometimes this happens specifically?“Ü” = Ü“ö” = ö“ä” = äWhen files are transferred through FTP or stored on servers that don't accept Umlauts within filenames (or something like that) I find some files renamed to include these characters as replacements for them. What's going on here?
       
 (DIR) Post #AkfRn0g3wE5zHhWNYO by mrtnsnp@mastodon.social
       2024-08-05T20:13:51Z
       
       0 likes, 0 repeats
       
       @ErikUden The bytes are then read as an 8-bit encoding of some sort, rather than the UTF-8 encoding that is in the input. https://www.youtube.com/watch?v=MijmeoH9LT4
       
 (DIR) Post #AkfRvWjWdgat1LCbFw by vogelchr@chaos.social
       2024-08-05T20:15:26Z
       
       0 likes, 0 repeats
       
       @ErikUden probably a variation of https://en.m.wikipedia.org/wiki/Mojibake#Other_Western_European_languages (see section Other Western European languages).
       
 (DIR) Post #AkfS5S3esP7XPsLrWK by klarmarx@det.social
       2024-08-05T20:17:15Z
       
       0 likes, 0 repeats
       
       @ErikUden That's actually a UTF-8 thing. It encode all Unicode characters using 1 to 4 bytes. The first 127 characters match ASCII. The 8th bit, which would help encode characters 128 to 255 could be interpreted as a flag, that more bytes are coming. So the the character encoding you gave, are just the Latin1 interpretation of the UTF-8 encoded characters.Or in old web terms: Wrong codepage 😎
       
 (DIR) Post #AkfSCiU8CRg8OICMG8 by bert_hubert@fosstodon.org
       2024-08-05T20:18:34Z
       
       0 likes, 0 repeats
       
       @ErikUden This is UTF-8. Captital U umlaut is unicode 220 (decimal) which in the variable length UTF-8 encoding ends up as 195 156 (decimal), which is Ü in ASCII. https://en.wikipedia.org/wiki/UTF-8
       
 (DIR) Post #AkfSTjrKLO5zZkAHNw by kami_kadse@don.linxx.net
       2024-08-05T20:21:38Z
       
       0 likes, 0 repeats
       
       @ErikUden probably not unicode, but differing file name encondings on different systems (and file names don't have a BOM, so there might be guessing involved at some stage of conversion).
       
 (DIR) Post #AkfSVs1UxefDTVwGZ6 by birne@troet.cafe
       2024-08-05T20:22:01Z
       
       0 likes, 0 repeats
       
       @ErikUden Magic
       
 (DIR) Post #AkfSm17SuRDIJtQfL6 by Pentropy@lazysocial.de
       2024-08-05T20:24:57Z
       
       0 likes, 0 repeats
       
       @ErikUden which file system is that? remember seeing such on some old dos or windows file system and some samba servers. something about windows code pages if i remember right.anyway, file system is what i would start with
       
 (DIR) Post #AkfTIZJdIu0Te0dsTQ by flensrocker@troet.cafe
       2024-08-05T20:30:50Z
       
       0 likes, 0 repeats
       
       @ErikUden Encoding is hard and there will always be problems. Just use ASCII characters. And all lowercase, because some filesystems are case sensitive and some are not. Lowercase letters, numbers, underscore.
       
 (DIR) Post #AkfTKFHmkGK86OZs24 by b90g@gruene.social
       2024-08-05T20:31:07Z
       
       0 likes, 0 repeats
       
       @ErikUden lol ftp :)
       
 (DIR) Post #AkfVSvsOTjLuQ2yG9I by jk@nfdi.social
       2024-08-05T20:51:06Z
       
       0 likes, 0 repeats
       
       @ErikUden The phenomenon is called Mojibake and emerges when a string of Bytes is interpreted in a different character set from its original encoding. The examples come from Unicode (UTF-8) strings interpreted as ISO_8859-1 or a Windows proprietary character set, and reconverted to UTF-8 afterwards.It happens quite frequently with databases, I had a mySQL database containing a lot of Mojibake (and sometimes happening twice). I wrote a bunch of scripts to repair it.https://en.wikipedia.org/wiki/Mojibake
       
 (DIR) Post #AkfWaTVRwR78hZOp2e by chessert@mastodon.online
       2024-08-05T21:07:41Z
       
       0 likes, 0 repeats
       
       @ErikUden One of my first coding projects was cleaning up codepage character translations like this using Perl. MS Word 6.0 docs -> HTML safe characters -> XML documents -> HTML web pages.Ah, what fun times those were! 😉
       
 (DIR) Post #AkfZngK9tbpY5C6hgu by danlyke@researchbuzz.masto.host
       2024-08-05T21:43:40Z
       
       0 likes, 0 repeats
       
       @ErikUden I realize you have a lot of answers already, but: "Üüöä" encoded in UTF-8 is \xC3\x9C\xC3\xBC\xC3\xB6\xC3\xA4You can look up those characters as 8859-1: C3 is "Ã", I'm not sure how it's getting the "œ" (though 9C is in a blank area, so...) but the ¼ is BC, ¶ is B6, and so forth. https://en.wikipedia.org/wiki/ISO/IEC_8859-1
       
 (DIR) Post #AkgIa6or5Rlw8gMqHY by TheLancashireman@hostux.social
       2024-08-06T06:05:26Z
       
       0 likes, 0 repeats
       
       @ErikUden Make sure your ftp client uses binary mode. I think the default is ASCII.But the problem isn't restricted to FTP. I see it with some email-based systems too. And not just umlauts - I see the à character inserted for no apparent reason.