hngopher.com

       [HN Gopher] I couldn't debug the code because of my name
       ___________________________________________________________________
        
       I couldn't debug the code because of my name
        
       Author : mikasjp
       Score  : 158 points
       Date   : 2021-10-18 08:40 UTC (2 days ago)
        
 (HTM) web link (mikolaj-kaminski.com)
 (TXT) w3m dump (mikolaj-kaminski.com)
        
       | m_kos wrote:
       | Isn't it bizarre that we have self-driving cars, the ISS, and
       | phones with 50 megapixel cameras but still struggle with
       | character encoding?
        
         | tetha wrote:
         | Character encoding is in a special class of problems. Like time
         | handling.
         | 
         | If you pick up a halfway non-ancient framework in a somewhat
         | common language with a somewhat non-terrible persistence like
         | postgres, you just don't have problems. Just don't care, and it
         | just works.
         | 
         | But it's super easy to derail that fragile correctness with
         | something like MySQLs utf8-ish handling, or some OS's path
         | handling, or 'efficiency', or a user or frontend dev submitting
         | data in a wrong encoding. And then it gets mangled. And then
         | the user is unhappy.
         | 
         | At that point, it becomes very hard to argue why one of the two
         | things is wrong, and the other is not. While the user argues
         | the other way around. Because both look correct, if you look
         | from the right angle. And the only reason why I am right is
         | because of some standard, while the customer is right because
         | of money.
         | 
         | And yes, it is very 'surprising' why our software now functions
         | correctly for russian or greek customers.
        
       | darkhorn wrote:
       | I think it is a Java related issue. Relevant issue occurs in
       | Jaspersoft Report. You cannot install Jaspersoft Report on
       | Turkish Windows no matter what.
        
       | dmingod666 wrote:
       | The domain name to the website is all ascii..
        
         | zamalek wrote:
         | If you use a Microsoft account to set up windows then you have
         | no control over the local username.
        
           | dmingod666 wrote:
           | That sucks.. always hated the idea of an online account to
           | access your local system..
        
           | moonchrome wrote:
           | This is exactly why I don't do that initially - I don't mind
           | my account being linked - but I've been bitten by the home
           | path bugs multiple times, I unplug my pc during setup
        
       | numpad0 wrote:
       | Oh, it's not a common knowledge that you should not UTF-8 in
       | Windows username? That had been the case since 95 days. Only
       | recently it had supposedly improved after Microsoft Account login
       | become semi mandatory.
        
         | progval wrote:
         | On the contrary, the first bug happens because docker-compose
         | tries to decode the path as UTF-8, but it is not UTF-8-encoded.
         | ("'utf-8' codec can't decode byte")
        
         | chris_overseas wrote:
         | I don't think this bug is anything to do with Windows, rather
         | it is due to the way the paths are handled in the IDE's
         | codebase. Presumably the same problem exists when using these
         | IDEs in conjunction with a path containing non-ascii characters
         | in the Linux or macOS world.
        
           | numpad0 wrote:
           | Isn't it some compilation option issue in native part? I
           | thought it's a line on .sln or include library in a C++
           | source or something that has to be explicitly specified when
           | building a Win32 binary.
        
             | GoblinSlayer wrote:
             | InteliJ has native part?
        
         | Fordec wrote:
         | A lot of adults today weren't even alive in 95. Also, the
         | assumption that people are familiar with windows vs other
         | operating systems is becoming less and less valid. And as the
         | world gets more globalised and remote, it's no longer to be
         | assumed that all technical people are of a Anglo American
         | culture.
        
       | david422 wrote:
       | There's also this article: falsehoods-programmers-believe-about-
       | names:
       | 
       | https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
       | 
       | Certainly informative if you haven't seen it before.
       | 
       | My takeaway from it was that design your system to try to
       | accommodate as much as possible, but it would basically be
       | impossible to accommodate them all, so aim for your target
       | audience.
        
       | ygra wrote:
       | One way of working arrive such issues is to use subst. That way
       | the application thinks your project directory is actually located
       | on P:\ or something like that.
        
       | rcxdude wrote:
       | Sadly there is even still software which fails to build or even
       | fails to run when there is a space in a filename (as is super
       | common on windows file paths, as well as autogenerated CI build
       | folders). It's ridiculous to no end that software cannot handle
       | paths correctly.
        
       | tazjin wrote:
       | The amount of random encoding problems that still exist are so
       | bizarre. I recently left a UK job after already leaving the
       | country more than a year ago, and in their attempt to mail P45
       | form to my new address (in Moscow) the only bits that survived
       | are the string "c/o" and the postal code.
        
       | tediousdemise wrote:
       | The solution to this is extremely simple: don't validate
       | usernames, period.
       | 
       | The rationale is from an article someone linked here ("Falsehoods
       | Programmer's Believe About Names"):
       | 
       | > Anything someone tells you is their name is--by definition--an
       | appropriate identifier for them.
       | 
       | If you try to validate by checking for profanity, knowing full
       | well that people can have names that contain profane substrings,
       | I have a tongue-in-check message for you-- _you are a fucking
       | asshole_.
        
       | xlii wrote:
       | Very similar problem to one described started my exodus from
       | Google services.
       | 
       | I also have non-latin characters in my name however I knew it was
       | always an issue so I never used it in paths etc.
       | 
       | At some point, long time ago, I was tasked to do some maintance
       | with Google Cloud service (can't remember the name of the service
       | now) which was doable only through Python CLI utility and it
       | failed with very similar Python error.
       | 
       | What I found out rather quickly is that utility took my name from
       | Google+ profile, which did include those non-latin characters. No
       | biggie - I thought and fired e-mail to support (yeah it was those
       | times it was still that easy). Few hours passed and I received
       | information that this won't be fixed anytime soon and the best
       | course of action would be to change my name.
       | 
       | Of course, support person probably meant to remove the
       | diacriticals from my Google+ profiles, but still it left
       | unplesant aftertaste for years to come.
        
         | nullspace wrote:
         | > the best course of action would be to change my name
         | 
         | As someone who has been told this, for other reasons, I
         | empathize. My reaction has always been - "Your system can't
         | even handle names, you need to fix it".
         | 
         | Edit: I wish there was a library / service that helped you
         | handle all sorts of edge cases in names, so that you don' t
         | have to worry about it. Just use a user-id, and set / get a
         | name from a lib / service that can actually handle it.
        
           | dymk wrote:
           | Has that reaction ever resulted in the other party fixing
           | their system in a timely manner?
        
         | mjevans wrote:
         | This is exactly why I hate the way Python3 handles Unicode.
         | 
         | EVERY language should _try_ to handle Unicode such that if a
         | data sequence were valid before it remains valid after. NONE
         | should ever FORCE validation, since sometimes, like in the
         | article's case, the correct answer is GIGO. Just pass it
         | through and hope it continues to work. Sometimes the error is
         | trying to enforce that validation.
        
           | geofft wrote:
           | Python 3 usually handles this correctly, and I'm a little bit
           | confused what's going on in the article, exactly.
           | 
           | For UNIX path names (and other OS data like environment
           | variables), Python uses the "surrogateescape" error handling
           | method, which does exactly what you ask. Any byte sequence
           | can be converted to a string. If it decodes as valid UTF-8,
           | it will do that. If it hits a byte that does not decode as
           | valid UTF-8 (necessarily a byte >= 128), it will map it to
           | code points U+DC80 through U+DCFF. These are in a reserved
           | ranges of code points ("surrogates", which make it possible
           | to represent code points > 0xFFFF in UTF-16), and they can't
           | show up in actual Unicode text (i.e., there is no UTF-8
           | encoding of them, strictly speaking, and if you applied the
           | UTF-8 encoding algorithm to a code point in the U+D800 to
           | U+DFFF range, you would get bytes that aren't valid UTF-8).
           | 
           | On the way out, this is reversed. So you get the results you
           | expect if your filenames are in UTF-8, but since UNIX has no
           | requirement that filenames are indeed UTF-8 (the only
           | constraint is they can't contain NUL or ASCII-forward-slash),
           | the bytes are preserved in a funky-looking format in Python
           | and you get the exact same output on the other end.
           | 
           | See https://www.python.org/dev/peps/pep-0383/ for more on
           | what's going on. The tl;dr for users of Python is that if you
           | want to interact with, say, subprocess output as mostly-
           | normal strings (instead of bytes) but you want to be robust
           | to non-UTF-8 bytes, you should do something like
           | subprocess.check_output(["some", "command"],
           | errors="surrogateescape")
           | 
           | You don't need to do this for APIs that directly interact
           | with pathnames, because they do it already. You just need to
           | do it for things like subprocess output and file contents
           | that Python doesn't know you want to handle in this way.
           | 
           | ...
           | 
           | On Windows, however, path names must be valid Unicode and are
           | stored in UTF-16. So the idea of a "l" that doesn't decode
           | properly shouldn't even happen! Mikolaj's home directory
           | ought to be a very boring (and valid) 004d 0069 006b 006f
           | 0142 0061 006a on disk.
           | 
           | Windows doesn't enforce that file paths are _valid_ UTF-16
           | though (specifically, the surrogate code points are only
           | supposed to show up in a certain way, but nothing enforces
           | that and you can have random surrogates on disk), and hence
           | Rust, which internally represents all strings in UTF-8, has a
           | solution ( "WTF-8") that's basically the inverse of
           | surrogateescape - it uses extrapolated-UTF-8-encoding-of-
           | surrogates to handle unpaired surrogates.
           | http://simonsapin.github.io/wtf-8/ But it seems very odd to
           | me that the directory C:\Users\Mikolaj would actually contain
           | any of those, and if it doesn't, I would expect it to very
           | easily turn into a Python Unicode string.
           | 
           | Maybe this is from a Python version before
           | https://www.python.org/dev/peps/pep-0529/ , which is claimed
           | to "fail to round-trip characters outside of the user's
           | active code page"? Maybe this is from a Python version
           | _after_ that change and it 's wrong?
        
             | nightpool wrote:
             | The incorrect docker-compose file was _generated_ by Java
             | (Jetbrains) but _consumed_ by Python (docker-compose). The
             | GP comment was complaining about Python 's strict Unicode
             | consumption, not Java's invalid Unicode generation.
        
           | nightpool wrote:
           | How is this Python's fault? It's not like the `docker-
           | compose` file would have worked any better if it silently
           | replaced one of the volumes with an inaccessible file.
           | Instead, you'd just get a failure from the Windows filesystem
           | API when you tried to access or create a file at "C:\\\Users\
           | \\Mikoaj\\\AppData\\\Local\\\JetBrains\\\Rider2021.2\\\log\\\
           | DebuggerWorker\\\\\", right?
        
       | sschueller wrote:
       | Many years ago I could not access the apple developer panel
       | because of the umlaut in my last name. It was eventually fixed
       | but I was quite surprised that such a large company would run
       | into such a basic issue.
        
         | rodgerd wrote:
         | If you look at many of the responses here it's sadly
         | unsurprising: small-minded provincialism or outright xenophobia
         | are no less common amongst programmers than the general
         | population.
        
         | [deleted]
        
         | devrand wrote:
         | My last name has an apostrophe in it which Apple apparently
         | loves to embed directly into their JavaScript unescaped. For a
         | long time neither I nor Apple could look up AppleCare status on
         | my stuff as they were all linked to my Apple ID. The portal
         | would thus require me to login, but then would just show a
         | partially rendered page as my last name was causing an JS
         | syntax error.
        
           | nneonneo wrote:
           | Hmm, it sure sounds like John <script>alert(1);</script>Doe
           | (Bobby Tables' distant cousin) should sign up for an Apple
           | account. An XSS attack which could target the AppleCare reps'
           | machines could be catastrophically bad...
        
           | doubled112 wrote:
           | You'd think the apostrophe would be common enough they'd know
           | it could happen, but no.
           | 
           | I love to enter it and see what each vendor and website's
           | backend does with it.
           | 
           | The Staples Canada website, for example, returns it as &#39;
           | (HTML escaped) A couple times I've logged in, it seems to
           | escape a new character. I'm currently up to &amp;amp;#39;
        
         | irrational wrote:
         | >such a large company would run into such a basic issue
         | 
         | Every large company is just a conglomeration of smaller
         | departments. Each department had individual contributors. Some
         | individual contributor in that department wrote the code and if
         | nobody else is their department caught it, nobody else at the
         | large company would have caught it since they have their own
         | work to consider and don't have time to look at other people's
         | stuff.
        
           | lostgame wrote:
           | I think what OP means is that a company so large should have
           | the resources to test such edge cases.
        
       | supernes wrote:
       | It's somewhat common to see videogames issue a patch shortly
       | after release where they fix crashes due to non-ASCII Windows
       | usernames or non-English locales. I'm not sure what the root
       | cause of the confusion is, other than text strings being hard in
       | general.
        
         | GoblinSlayer wrote:
         | It's text encoding confusion:
         | https://en.wikipedia.org/wiki/Mojibake
        
         | jerf wrote:
         | It's easy to think the answer is "just UTF-8 everything" but
         | unfortunately the long and twisty history of filesystems means
         | that's not the correct answer, and the "correct answer" is
         | really hard to write down quickly.
         | 
         | If you never display the filename, the answer is to treat
         | existing filenames as bags of bytes, but that breaks down as
         | soon as you need to display them, or if you need to manipulate
         | them by appending unicode to them, in which case you have to
         | decide on an encoding.
         | 
         | Unicode encodings tend to mangle non-Unicode values because
         | they're specified to replace whatever they can't understand
         | with a particular Unicode character, usually represented as a
         | diamond with an inverted ? inside of it.
         | 
         | There's some obscure solutions to this problem, like
         | https://simonsapin.github.io/wtf-8/ (which includes discussion
         | of the 16 bit encodings you need for Windows), but I haven't
         | seen broad support for them. You need a deliberately
         | "noncompliant" encoding/decoding system that doesn't replace
         | unknown characters with replacement characters. Fortunately,
         | compliant systems are becoming more and more popular and
         | available. Unfortunately, that can make file name handling
         | _harder_ than when you had a non-Unicode-compliant handling
         | system for your strings.
        
           | nyanpasu64 wrote:
           | Rust uses WTF-8 on Windows for OsStr[ing] and Path[Buf]. It's
           | zero-overhead to cast from &str to &OsStr/&Path to &[u8]
           | (though converting WTF-8 to UTF-16 costs an extra operation
           | when performing a Win32 function call). However this doesn't
           | solve the inability to round-trip "possibly-valid UTF-8/16"
           | to "Unicode text" and back (though Python's surrogateescape
           | might be one viable approach).
           | 
           | Other libraries handle this even worse than Rust. On Linux
           | (filenames are bytes), Qt is unable to open files with
           | invalid UTF-8 names, while GTK can open them (but shows an
           | "invalid encoding" message instead of the original filename),
           | which I think is a good-enough approach.
        
         | garaetjjte wrote:
         | Part of the problem is legacy Windows cruft. For long time to
         | properly handle Unicode characers you needed to explictly use
         | widechar UTF-16 functions. Legacy narrow encoding is systemwide
         | setting, couldn't be set to UTF8, thus only subset of
         | characters would be represented correctly. Only recently they
         | introduced ability to set narrow encoding for application to
         | UTF-8 with setlocale, which is a lot saner.
        
         | mkotowski wrote:
         | In case of a home-grown code, it could be simply the question
         | of a programmer awareness. There are still many outdated and/or
         | unfinished tutorials that use WinAPI without any concern about
         | enabling Unicode and wide chars support.
         | 
         | If we are talking about ready game engines like Unity and
         | Unreal... it is probably a naive assumption about input being 1
         | byte wide and things getting lost because of that in some
         | gamedev-made script.
        
         | jan_Inkepa wrote:
         | I've been bitten on a few small releases by forgetting that C#
         | localises number->string conversion by default (which makes
         | sense. But if you forget, and you're writing floats to csv
         | files and the decimal points become decimal commas....).
        
         | breakingcups wrote:
         | It's also a common thing that Silent (aka CookiePLMonster)
         | fixes in the games he patches.
         | 
         | See for example: -
         | https://cookieplmonster.github.io/2020/05/23/silentpatch-maf...
         | - https://cookieplmonster.github.io/2021/02/27/silentpatch-
         | yak...
        
       | amarshall wrote:
       | For a list of strings that often cause problems to, e.g., add to
       | a test suite, see https://github.com/minimaxir/big-list-of-
       | naughty-strings
        
         | tomaslaureano wrote:
         | Great resource! I usually use pangrams (holoalphabetic
         | sentences like "The quick brown fox jumps over the lazy dog")
         | to ensure that my code can handle all the alphabet characters
         | for the languages that should be supported at the very minimum.
        
         | munk-a wrote:
         | It's also important to width-test fields. Never forget to make
         | sure that WWWWWWWWWWWW doesn't cause weird application
         | wrapping.
        
           | aidenn0 wrote:
           | I used a system where the maximum length on the "new
           | password" field in the change password form was longer than
           | the password field in the login form.
           | 
           | The symptom was that I could login if I used my password
           | manager browser plugin, but not if I pasted it from my
           | password manager.
        
             | kevinmgranger wrote:
             | You're lucky they weren't different lengths in the backend.
             | I've been bitten by that surprise one too many times (which
             | is any number higher than zero)
        
               | aidenn0 wrote:
               | The most ridiculous thing is the UI for setting the
               | password even said "X-Y characters long, must include at
               | least one..." but the login page could not support Y
               | characters.
        
             | pferde wrote:
             | I have seen a windows app with a text field whose max
             | character count was somehow determined by system font size
             | - probably a crude way to make sure the entered text fits
             | the hard-coded field size.
             | 
             | The problem was that this field was used to enter a
             | 10-digit code, and as it turns out, on default Windows10
             | system, the fonts are set up so that this field only fit 8
             | of them. Oops! :)
        
               | munk-a wrote:
               | I'd like to see how that App would work with me sitting
               | here fonts cranked up to 175%. I've never heard of a
               | setup like that though - it sounds like it'd be
               | surprisingly intricate to actually configure.
        
             | munk-a wrote:
             | I maintained a system where we had unbounded password
             | length... but only respected the first six characters of
             | the password. (we did fix that).
        
           | amarshall wrote:
           | Related (we do this at my work):
           | https://en.wikipedia.org/wiki/Pseudolocalization
        
         | OskarS wrote:
         | An enormously useful list, I've used it several times, and it
         | can often dig up some real nastiness if you haven't been super
         | careful.
         | 
         | This entry, by the way, is a fantastic little easter egg in the
         | list: https://github.com/minimaxir/big-list-of-naughty-
         | strings/blo...
        
           | vertis wrote:
           | No, seriously, wake up
        
             | [deleted]
        
         | ryanianian wrote:
         | Very handy. My previous simple test-case was simply a selection
         | from this well-known text-file which is simply a collection of
         | somewhat uncommon unicode characters, usually used for
         | rendering tests.
         | 
         | https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
         | 
         | But this set of strings is specifically designed to cause edge-
         | case errors.
         | 
         | Also don't forget Spolsky's seminal "The Absolute Minimum Every
         | Software Developer Absolutely, Positively Must Know About
         | Unicode and Character Sets (No Excuses!)".
         | 
         | https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
        
       | spicybright wrote:
       | So frustrating how this still happens. It's too latin centric.
        
       | mikasjp wrote:
       | I think the whole problem is keeping the character encoding
       | consistent in the applications and their dependencies.
       | Programmers often forget this because they avoid non-ASCII
       | characters in their code.
        
       | mkotowski wrote:
       | I, too, have the L letter in my name, and yes, it is a sick joke
       | that so many things even in a supposedly modern systems make an
       | assumption that the world runs on ASCII.
       | 
       | In the case of the Windows operating system, the worst fact is
       | that every single part of it behaves differently. Some parts
       | display the path with a wrong encoding, but handle it correctly.
       | A third-party app can display it correctly, but fails while
       | trying to access any file. From what I remember, even the built-
       | in PATH variable editor/manager goes through some arcane steps to
       | display the letters in a wrong way, but getting them to work
       | _sometimes_.
       | 
       | I can only imagine how much more pain it is for someone using any
       | of the less widely-used writing systems or those with more
       | advanced features compared to ASCII (Hebrew's RTL, Arabic scripts
       | mid- and final forms, etcetera).
        
         | gerdesj wrote:
         | Can L have an alternative representation? For example the
         | German ss => ss. Also I think o can be written as oe.
         | 
         | In English we simply shake the big bag of letters, pick a few
         | at random and then throw them at the page until a few stick.
        
           | q3k wrote:
           | > Can L have an alternative representation?
           | 
           | Nope. Neither can z, c, s, a or e. You can, and people do
           | write them as z, c, s, a and e when writing in a restriced
           | character set, but that is not 'correct' and is not a
           | bijection, ie. ,,polka" and ,,polka" mean two different
           | things.
           | 
           | There's also the case of technically-same-sounding-
           | especially-recently z/rz and o/u (whose replacement would let
           | you get rid of two 'non standard' characters), but for
           | historical reasons these are not interchangeable.
        
             | gerdesj wrote:
             | I do find this sort of stuff fascinating and also faintly
             | frustrating but of course my mother tongue is (in)famous
             | for being a bit loose at first sight.
             | 
             | According to one of my employees (Polish) L sounds roughly
             | like w as in win or water but not as in what. A quick read
             | of this: https://en.wikipedia.org/wiki/%C5%81 doesn't help
             | too much.
             | 
             | Does enforcing L instead of say w cause your written
             | language to fail in some way? I don't want to cause
             | offense, I want to understand the causes of difference.
        
               | q3k wrote:
               | 'W' in Polish is already used, but for a different sound
               | - it's pronounced like the English 'v'. 'V' in turn is
               | not present the Polish alphabet (in the sense of it not
               | being present in words of Polish origin).
               | 
               | If you wanna change that, you might as well change the
               | entire writing system of the language, eg. to be more in
               | line with some other, more common writing system (ie.
               | other latin alphabets or the cyrillic alphabet which
               | would probably make the most sense phonetically). But no-
               | one's gonna go for that any time soon.
        
               | gerdesj wrote:
               | "If you wanna change".
               | 
               | I think we have found the disconnect: you quite happily
               | use a word like "wanna" which is nonsense in English. Its
               | allowed because it is understandable. Wanna is "want to".
               | 
               | Ooh, "gonna": That'll be "going to".
               | 
               | What's gonna to you is l bar for me or vice versa or
               | something 8)
        
       | bagswatchesus wrote:
       | Not sure how they managed to do it but they had some basic rules
       | that they used to say "no real name can look like this, this is a
       | fake person!" and just kicked it out.
       | https://www.thelvbags.co/louis-vuitton-wallets-and-purses.ht...
        
       | xwdv wrote:
       | What's wrong with just writing it as Mikolaj? It's not like it's
       | a kanji or something.
        
         | sophacles wrote:
         | Because that's not their name?
        
         | dahfizz wrote:
         | Their URL is even mikolaj-kaminski.com . I get its annoying,
         | but I would never use non-ascii chars in a username / file
         | path.
        
           | jerf wrote:
           | So, what does A Bu Ming Ren  do in this case?
           | 
           | Polish may be close enough that an approximation is available
           | in English, but there's an awful lot of languages that don't
           | have a large overlap with English characters.
           | 
           | In the Asian case above, if someone with that name did try to
           | "convert to English" they are ironically just as likely to
           | end up with Akihito Abe as the ASCII, which will be just as
           | broken!
        
             | numpad0 wrote:
             | Assuming that hypothetical guy is an average Japanese
             | male(somewhat leaning right), he'd just turn IME off.
             | Japanese input on desktop is consist of three following
             | states:
             | 
             | - IME On state. IME capture and interpret keypresses as
             | engraved and generate corresponding Kana-Kanji texts.
             | 
             | - IME Off state. IME passes through keypresses as engraved
             | on keytops.
             | 
             | - Direct Input state. IME becomes dormant.
             | 
             | In IME Off state, the keyboard behaves as a plain jp106(or
             | ANSI if it is) keyboard, like I'm doing right now. The
             | cases where you would use conversion with IME on for an
             | English word is when you have reasons for the word to be in
             | "full width"(usually for typesetting reasons).
        
               | jerf wrote:
               | I don't think it's something that people should 'just
               | know' that when Windows asks them their name during
               | install time, they _ought_ to use 7-bit clean ASCII for
               | everything, no matter where they are in the world or how
               | much they know about other languages. When Windows says
               | "What is your name?", they ought to be able to _use_
               | their name without things breaking.
               | 
               | I'm sure a computer savvy speaker of a fully-non-Latin
               | language may still guess this is a good idea, but
               | "computer savvy" doesn't cover everyone... and they
               | shouldn't _have_ to.
               | 
               | "Just use 7-bit-clean ASCII English" is not a solution to
               | this problem.
        
             | dahfizz wrote:
             | They could use a different name as their windows name (Do
             | people use their real names as their usernames? I never
             | do). Or, they would have to go through the pain of finding
             | a real solution, like the author did.
             | 
             | Considering JetBrains seems unwilling to fix this bug,
             | maybe the best solution of all is to switch to an IDE that
             | works.
        
           | Kye wrote:
           | "You're holding it wrong"
           | 
           | The problem is the technology, not the user using it in a
           | reasonable way. l is older than computers and the only reason
           | computers struggle with it is lack of foresight or choosing
           | to make things harder for most of the world by some of the
           | people involved early on.
        
             | dahfizz wrote:
             | Obviously the IDE is at fault here. Rider has a bug with
             | Unicode.
             | 
             | BUT, there is an easy workaround to avoid all Unicode
             | related bugs: don't use Unicode. If that's morally
             | objectionable for you, then you can keep fighting this
             | fight.
        
               | bivargen wrote:
               | Avoiding unicode, or anything but 7-bit ASCII is like
               | using chiseling text into a stone instead of pen and
               | paper because the pen might break. Fix the pen! Or
               | replace it with a computer (and we're back full circle)!
               | 
               | It is not morally objectionable avoiding, it's just
               | stupid.
        
               | tremon wrote:
               | I think it's reasonable to find that morally
               | objectionable: English is the only language* that can be
               | fully represented in ASCII, so pretending that ASCII is
               | all you need excludes a large part of the world.
               | 
               | * yes, by and large. Many languages make do, but even the
               | European languages that use the same script as English
               | cannot be fully represented:
               | 
               | - Pretty much all mainland European languages use accents
               | (simple example, in Spanish el and el are different
               | words)
               | 
               | - French misses c
               | 
               | - German/Swiss/Austrian misses ss
               | 
               | - Spanish misses n
               | 
               | - Dutch misses ij
        
               | InitialLastName wrote:
               | It's naive of you to maintain the facade that English can
               | be fully represented in ASCII. We've just had longer than
               | other languages to adapt to that particular encoding
               | technology, and the good luck to have a code set built to
               | represent our language become the lingua franca of
               | computer technology.
        
               | Symbiote wrote:
               | Not even Britain and Ireland can manage with ASCII: they
               | need PS and EUR.
               | 
               | I agree with you, and disagree strongly with dahfizz, who
               | is essentially telling people their name and language are
               | unacceptable.
        
               | Muromec wrote:
               | Cyrillic-writing countries miss all of their alphabets
               | and so does Greek.
        
         | ludamad wrote:
         | For the record, it's a stark pronunciation difference as l has
         | drifted to a very different "w" sound
        
           | MadeThisToReply wrote:
           | Yep. For example, the name of the third-largest city in
           | Poland is "Lodz", which might look like it's pronounced
           | "lods", but is actually pronounced more like "wootch".
        
             | garaetjjte wrote:
             | Sometimes you end up with parcel addressed to city "??d?".
             | Shipping systems cannot cope with non-ASCII chars more
             | often than I would expect...
        
               | greenshackle2 wrote:
               | I've seen shipping labels with HTML encoded characters,
               | like &eacute; and &egrave;. I'm not sure if that's better
               | or worse:
               | 
               | &Lstrok;&oacute;d&zacute;
        
         | ssivark wrote:
         | That's about as aggravating as asking Ryan to change name to
         | Pyan -- because the encoding doesn't support "R" and "P" looks
         | very similar.
        
         | no_time wrote:
         | Because it's not his name. Imagine you are John but you had to
         | make do with Yohn because the people designing you software
         | didn't need the letter J...
        
           | kmlx wrote:
           | it was 30 years ago when i discovered that it doesn't really
           | matter what my name is. the system i'm interacting with
           | expects my name to be "john" or something like that. so i let
           | it be.
           | 
           | 30 years later and i completely dropped all non-latin chars
           | from my name in any and all forms. from airplane tickets to
           | passport to you name it.
           | 
           | and you know what? no one cared about non-latin. not even the
           | government. i loled when i actually realised.
           | 
           | i've encountered zero issues ever since.
           | 
           | and it's been the same for lots of my friends. they just
           | adopted some western name. case closed, no more issues.
           | 
           | it all depends on who much importance you attribute to your
           | name. for me it's always been a random variable. for others
           | it's a matter of pride. but to the "system" it will be a
           | "random list of chars", sometimes latin, other times utf.
        
           | zanderwohl wrote:
           | It's not strange to localize your name. In ASL for example,
           | you could sign your English name letter-by-letter, but it's
           | much more common to have a totally new sign for your name -
           | usually a word combined with the first letter of your name.
           | Taking part in a different system often means taking on
           | another name.
        
             | q3k wrote:
             | It seems that you're implying computers are universally
             | american and therefore people are expected to
             | speak/use/adapt to american.
        
               | thereddaikon wrote:
               | That's the harsh way to put it. A more diplomatic way is
               | that computing is not unique in having deeply ingrained
               | artifacts of the language and culture that birthed it and
               | developed many of the paradigms.
               | 
               | Take anything having to do with seamanship. There are
               | many terms that date back to early modern English that
               | simply don't make sense anymore yet are accepted and
               | universal because the British Empire had a large and
               | enduring influence on maritime matters and happened to be
               | at the forefront of most modern developments until about
               | 70 years ago.
               | 
               | In some cases this is actually built into laws and
               | industry practice. Pilots speak English. That's the
               | rules. Don't like it? Invent the time machine and beat
               | Wilbur and Orville. For much the same reason, science
               | speaks Latin.
               | 
               | This technical debt is difficult if not impossible to
               | overcome, especially in regards to computers because we
               | still haven't cracked general purpose AI. Software will
               | only accommodate what it was written to accommodate.
               | 
               | Recognizing the problem and working to fix it is all well
               | and good. But its wise to understand that this wont be
               | solved any time soon so in the meantime it is pragmatic
               | to operate in such a way to maximize compatibility.
               | 
               | After all, I still have to call it a Foc'sle even if I
               | think that's dumb or isn't inclusive of my culture.
        
               | xxpor wrote:
               | There's also the practical consideration that English,
               | due to having a) an alphabet b) letter shapes that aren't
               | affected by surrounding letters and c) no diacritics, is
               | the easiest major language to store and display on a
               | computer. Even if silicon valley ended up in a country
               | with a logographic writing system, I'd bet that the first
               | character set that would have been used would have been
               | Latin based
        
               | [deleted]
        
           | AdrianB1 wrote:
           | My name contains non-Latin characters (apparently strange as
           | we use a Latin language), but 40 years of working with
           | computers I learned to avoid using the original form and
           | always convert to ASCII; yes, it is not my name, but my pride
           | and sense of entitlement are not hurt at all.
           | 
           | Sometimes it is better to avoid being hit by the bus even if
           | you are right.
        
         | wbsss4412 wrote:
         | So the solution is for the user to change their entire windows
         | account name, rather than handling common characters in your
         | code?
        
           | toast0 wrote:
           | For a user, changing their account (probably creating a new
           | user, since rename apparently doesn't change the directory),
           | is something they can do.
           | 
           | Changing all software to respect their perfectly valid name
           | isn't something they can do.
           | 
           | They shouldn't need to change their name, but if they do,
           | they can ignore all the broken software and go about their
           | day.
           | 
           | This particular user is more capable than most, and found a
           | workaround for this particular problem, which is good... But
           | this is not likely to be the last of the problems.
        
           | dahfizz wrote:
           | Of course it would be better if all code was bug free. But
           | that's impossible. As a user, avoiding unicode is a pretty
           | easy way to avoid bugs like this - its the rational thing to
           | do.
        
           | Jensson wrote:
           | When you have non-standard characters in your name you
           | quickly learn to never use them in computers since even
           | though most systems works fine, some don't. And you can't fix
           | all the thousands of systems your name has to interact with.
           | 
           | I even had trouble booking flight tickets since their
           | security system couldn't parse my name, and then had to go
           | through some special security check due to it returning
           | errors. After that, never again. Not sure how they managed to
           | do it but they had some basic rules that they used to say "no
           | real name can look like this, this is a fake person!" and
           | just kicked it out.
        
             | wbsss4412 wrote:
             | I totally understand what you're saying, but it's also a
             | sad state of affairs when we can't handle "non standard
             | characters".
             | 
             | Standard characters (ie english) are only used by a small
             | subset (maybe 5-10%) of the global population.
        
             | yuliyp wrote:
             | They're not non-standard characters. They're just as much a
             | part of the Polish alphabet as 'a' and 'b' are.
        
               | Jensson wrote:
               | That is exactly what I meant. My name doesn't have non-
               | standard characters either from the perspective of my
               | home country, it is just normal letters in the alphabet,
               | but not in the English alphabet.
        
             | q3k wrote:
             | > When you have non-standard characters in your name
             | 
             | 'standard' by what measure? L is more standard than X or Q
             | in the polish alphabet.
             | 
             | ~ Sincerely, a person whose name contains ,,n" and
             | therefore had to deal with this bullshit his entire life.
        
               | Jensson wrote:
               | From a programmers perspective. The characters in my name
               | are standard where I come from, but they are not standard
               | to the international air travel security systems likely
               | developed by Americans.
               | 
               | Edit: You know how aircraft travel security always
               | transforms your name into letters from the English
               | alphabet to parse? Yeah, it transformed my name and then
               | the resulting string looked so bad that the system
               | rejected that. The original name doesn't look bad, but
               | after transformations it did...
        
         | miloignis wrote:
         | From the article:
         | 
         | The first idea was to change the username to one that does not
         | contain Polish characters. It turned out that Windows does not
         | rename the user's folder when changing the username. Manually
         | renaming the folder was not an option. This way I could corrupt
         | my profile in the system.
         | 
         | The end of the article is about how to change the directory
         | where the temporary files go to one not under the user folder.
        
       | jasonpeacock wrote:
       | And yet it's one of the simplest things to add non-ASCII chars to
       | your tests to validate their handling.
       | 
       | It's like not testing if your calculate application can handle
       | negative numbers or decimals.
        
         | nradov wrote:
         | In fact it's trivial to generate a text file of all valid
         | Unicode code points and use that as input to unit tests.
        
           | yakubin wrote:
           | It may be faster to generate them on the fly. Iterating over
           | ranges of integers is a lot faster than reading files from
           | disk.
        
           | Someone wrote:
           | I would have to do research on whether the list of valid code
           | points depends on the Unicode version. For example, can
           | regional indicator code points
           | (https://en.wikipedia.org/wiki/Regional_indicator_symbol)
           | appear in isolation? If not, is that different in Unicode <
           | 6, where those code points weren't assigned yet?
           | 
           | Similarly, what about tags
           | (https://en.wikipedia.org/wiki/Tags_(Unicode_block) )? Do
           | these _require_ an U+E007F CANCEL TAG?
           | 
           | The 66 noncharacters certainly need consideration.
           | http://www.unicode.org/faq/private_use.html says:
           | 
           |  _"Because of this complicated history and confusing changes
           | of wording in the standard over the years regarding what are
           | now known as noncharacters, there is still considerable
           | disagreement about their use and whether they should be
           | considered "illegal" or "invalid" in various contexts"_
           | 
           | Edit: also, testing all code points likely is overkill and
           | using code points in isolation likely isn't enough. Most
           | tests are better of with something like the big list of
           | naughty strings (https://github.com/minimaxir/big-list-of-
           | naughty-strings)
        
       | mrweasel wrote:
       | It's a pretty good test case. Similarly we found a number of bugs
       | in a Django application and path handling, because I happend to
       | be using Windows for six months, while the rest of the team was
       | on Linux and Mac.
        
       | umvi wrote:
       | Using non-ascii characters in file paths, toolchain config files,
       | and other non-display contexts is just asking for trouble, even
       | if it is your name...
        
         | fluxem wrote:
         | Also spaces. I spent half an hour debugging why cmake cuda
         | build was failing.
        
           | munk-a wrote:
           | A lack of support for spaces at this point is unacceptable.
           | I, personally, despise spaces in paths but on windows a whole
           | bunch of default system paths already have spaces embedded in
           | them in major ways... and let's not forget parens as well -
           | thanks "Program Files (x86)"
        
         | bbarnett wrote:
         | This wouldn't have happened if using rust!
        
           | burnished wrote:
           | Some of the other attempts are a little subtle, this one is a
           | pretty blatant attempt to rile up the folks that are already
           | angry about rust for whatever reason. Please stop.
        
           | nightfly wrote:
           | Can you knock it off??? This is even more annoying that out-
           | of-place rust evangelism
        
         | jasonpeacock wrote:
         | This is the modern, post-ASCII computing world, we should no
         | longer be willing to settle for the lowest-common-denominator
         | of ASCII-only strings.
         | 
         | There's no excuse for actively supported, _paid_ products to
         | have these problems today.
        
           | amenod wrote:
           | True. But these actively supported, paid products build upon
           | layers and layers of no-longer-supported, free/opensource
           | products. Good luck fixing them.
           | 
           | Not saying that this is OK, just explaining why using non-
           | ascii characters, in this day and age, is still asking for
           | trouble.
        
             | SAI_Peregrinus wrote:
             | This is on the Windows version.
             | 
             | Windows 2000 is when the OS changed to UTF-16 by default.
             | Before that Windows NT was UCS-2, IIRC only the DOS-based
             | Windows versions were Windows-1252 internally, starting
             | from Windows 1.0. So while l wasn't supported in Windows 1,
             | characters like n were. Windows has literally NEVER been an
             | ASCII-based OS.
        
               | horsawlarway wrote:
               | Sure, but having used a lot of the windows system apis
               | (admittedly - a lot of years ago) it was a complete
               | hodgepodge of which api would take a char vs a wchar, and
               | then they tried to hide the whole thing behind tchar,
               | which just made it even harder to keep track of.
               | 
               | Basically - I agree: This shouldn't be a problem, and 7
               | months is a long time to wait for a basic fix. But there
               | are a lot of footguns hanging around in windows code with
               | respect to character encodings.
               | 
               | Just looking at the first result on google for "c++ get
               | windows home directory" shows this:
               | https://docs.microsoft.com/en-
               | us/windows/win32/api/userenv/n...
               | 
               | Which takes a long pointer to tchar string (LPTSTR) - so
               | this behavior is dependent on the unicode settings of the
               | project at compile time, even today.
        
               | david_allison wrote:
               | > Windows 2000 is when the OS changed to UTF-16 by
               | default.
               | 
               | Paths are UTF-16 + unpaired surrogates, so a Windows path
               | isn't legally representable in UTF-8.
        
           | ainar-g wrote:
           | _Especially_ if those products are developed by a company
           | from Russia, where Cyrillic is used. For me, a Russian
           | myself, this situation is honestly ridiculous.
        
             | zczc wrote:
             | Russian companies generally have ascii-only username
             | policies
        
             | mbesto wrote:
             | Do you write "if" statements in Cyrillic when you write in
             | <insert Python/Ruby/Java/.NET/whatever>?
        
               | pavel_lishin wrote:
               | It would be very amusing to see "esli" in an if
               | statement, given how much it looks and sound like "else"
               | at a brief glance.
        
               | GoblinSlayer wrote:
               | I thought ArnoldC was just a couple of #define's, but
               | looks like it isn't.
        
               | nine_k wrote:
               | No. Keywords are ASCII everywhere (no, APL's are not
               | words). Mixing English in keywords and non-English in
               | identifiers feels odd.
               | 
               | Algol-68 supported localized sets of keywords;
               | fortunately this language is gone.
               | 
               | You can #define non-ASCII stuff in modern C++. It's your
               | best chance to "localize" a mainstream language.
               | 
               | Same would work for Clojure, but Lisp uses a lot of
               | quirky abbreviations like `cdr` or `setq` that give
               | awkward translations.
        
         | gumby wrote:
         | This is blaming the victim
        
         | BiteCode_dev wrote:
         | Unfortunately, it's true, most toolchains are stuck in the
         | past, and don't deal with non-ascii characters or even spaces
         | very well. In fact, I just learned that spaces in .deskop files
         | values could cause trouble after a long debugging.
         | 
         | But it's a shame.
         | 
         | In Europe, we do have a lot of non-ascii characters everywhere.
         | Ubuntu puts a "Video" and a "Telechargements" directory in my
         | $HOME because I'm french. If I were to use my name as my
         | username I would have even more troubles.
         | 
         | I'm careful with not using special chars in names for work, but
         | it feels like I'm a girl trying to not dress sexy in the wrong
         | part of town: necessary, but I shouldn't have to do this, and
         | it's definitely the others to blame.
         | 
         | All in all, I thank the Gods of encoding for Python 3 unicode
         | handling. Having a scripting language that does the right thing
         | out of the box is wonderful on this side of the pond.
        
           | mjevans wrote:
           | "The right thing" for filesystem entries is transparently
           | copy, do not evaluate. A file path is a mem-copied, length
           | value sized block of identifier you don't ever mangle. If you
           | must mangle it, touch only the necessary areas as directed.
           | (E.G. join with os.pathsep and do not normalize anything).
           | 
           | Want to offer Unicode validation? Sure having that as an
           | OPTION is fine. Forcing it means I can't rely on that tool to
           | handle real world data which happens to not be valid but is
           | still a valid file-system address.
        
           | GoblinSlayer wrote:
           | No seriously, create a user d'Artagnan.
        
       | simonblack wrote:
       | Isn't this one of those "100 things Programmers don't know about
       | People's Names" things?
       | 
       | Like the poor, it will be with us always.
        
         | xdfgh1112 wrote:
         | I don't know, it's just a Unicode character? Not even a newer
         | one, it's just 2 utf8 bytes. Pretty much everything should
         | support that in 2021.
         | 
         | When I think of 100 things I think of stuff like "some people
         | spell their name in all lowercase and get really funny if you
         | change it"
        
           | numpad0 wrote:
           | Yeah so double byte characters costs extra. I don't know, a
           | checkbox or something default off. Always did still does.
           | Double width costs even more.
        
             | horsawlarway wrote:
             | you're getting downvoted, but between tchar hiding wchar vs
             | char... this literally could be someone toggling off the
             | "UNICODE" checkbox in visual studio somewhere.
        
           | hprotagonist wrote:
           | windows probably defaults to latin-1
        
             | bryanrasmussen wrote:
             | the default windows encoding is UTF-16, a long time ago it
             | was Windows-1252 https://en.wikipedia.org/wiki/Windows-1252
        
               | hprotagonist wrote:
               | or CP-1251, in some locations.
        
       | f311a wrote:
       | That's a pretty common problem, especially for cyrillic names.
       | People just use ASCII names.
        
       | souptonuts wrote:
       | Idk changing your stupid fucking name could be a fix too
        
       | Dannymetconan wrote:
       | I can very much relate to this but also have very little sympathy
       | here.
       | 
       | I have a special character in my name, an apostrophe, and it
       | causes trouble regularly online and with tooling. A number of
       | years ago I decided just to never use it when it came to anything
       | to do with technical work be it email, logins or usernames.
       | 
       | Unicode characters are a pain to deal with and I have suffered
       | from it first hand trying to handle it. At the end of the day it
       | is much easier just to not use the special characters and move on
       | with your life rather then be battling the constant frustration.
       | 
       | I'm sure these tools have lots of issues opening and you would be
       | surprised at the amount of time, effort and testing it would be
       | required to provide fully Unicode support. Most people would see
       | it as a very small positive and not worth the effort. I find it
       | hard to disagree.
        
         | vultour wrote:
         | I'm really surprised someone technically minded thought it's a
         | good idea to put a non ASCII character in their username. I'd
         | never do that.
        
           | ctdonath wrote:
           | I'm really surprised someone technically minded thought it's
           | a good idea to not allow non ASCII alphanumerics in a
           | username.
           | 
           | Unicode has been a thing since 1988. Names have included non
           | a-z characters since forever.
        
         | jltsiren wrote:
         | My legal last name is "Siren". When I was younger, I almost
         | always used "Siren", because it was easier to type. Then, ~15
         | years ago, I started noticing that American websites sometimes
         | rejected it, because they considered it inappropriate.
         | Sometimes "Siren" would work, sometimes it worked but caused
         | minor annoyances, and sometimes it would not work for technical
         | reasons.
         | 
         | Both versions work most of the time these days, but I still run
         | into trouble once in a while no matter which name I use.
        
           | 10000truths wrote:
           | Why would Siren be an inappropriate name?
        
             | lostgame wrote:
             | Someone who I know has the last name 'Island' and was
             | unable to sign up for Facebook forever because they thought
             | it was a fake last name.
             | 
             | Maybe 'Siren' is similar. It's a pre-existing word that
             | perhaps flags some sort of weird edge case.
        
       | pledess wrote:
       | The article offers a solution of
       | idea.system.path=${root.dir}/JetBrains/Rider/system but doesn't
       | mention the C:\JetBrains directory permissions. Directory
       | permissions under %LOCALAPPDATA% (the location that works for
       | people without a Polish character) should restrict write access
       | to one user. With the Windows default behavior, creating
       | C:\JetBrains would inherit permissions from C:\ - and wouldn't
       | restrict write access to one user. Maybe 99% of the time this is
       | irrelevant (i.e., there's no realistic threat from malicious
       | actors who control unprivileged user accounts on your own
       | development machine). Still, it's a potential downside of the
       | solution, and more motivation for the vendor to fix their code so
       | that Polish characters can be used under %LOCALAPPDATA%.
        
         | Kwpolska wrote:
         | If you are on a multi-user system, the path "C:\JetBrains"
         | isn't really ideal (what if other users also need Rider and
         | have non-ASCII usernames?). That said, you can easily change
         | file permissions on Windows if the default ones don't work for
         | you.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-10-20 23:00 UTC)