[HN Gopher] I couldn't debug the code because of my name
___________________________________________________________________
I couldn't debug the code because of my name
Author : mikasjp
Score : 158 points
Date : 2021-10-18 08:40 UTC (2 days ago)
(HTM) web link (mikolaj-kaminski.com)
(TXT) w3m dump (mikolaj-kaminski.com)
| m_kos wrote:
| Isn't it bizarre that we have self-driving cars, the ISS, and
| phones with 50 megapixel cameras but still struggle with
| character encoding?
| tetha wrote:
| Character encoding is in a special class of problems. Like time
| handling.
|
| If you pick up a halfway non-ancient framework in a somewhat
| common language with a somewhat non-terrible persistence like
| postgres, you just don't have problems. Just don't care, and it
| just works.
|
| But it's super easy to derail that fragile correctness with
| something like MySQLs utf8-ish handling, or some OS's path
| handling, or 'efficiency', or a user or frontend dev submitting
| data in a wrong encoding. And then it gets mangled. And then
| the user is unhappy.
|
| At that point, it becomes very hard to argue why one of the two
| things is wrong, and the other is not. While the user argues
| the other way around. Because both look correct, if you look
| from the right angle. And the only reason why I am right is
| because of some standard, while the customer is right because
| of money.
|
| And yes, it is very 'surprising' why our software now functions
| correctly for russian or greek customers.
| darkhorn wrote:
| I think it is a Java related issue. Relevant issue occurs in
| Jaspersoft Report. You cannot install Jaspersoft Report on
| Turkish Windows no matter what.
| dmingod666 wrote:
| The domain name to the website is all ascii..
| zamalek wrote:
| If you use a Microsoft account to set up windows then you have
| no control over the local username.
| dmingod666 wrote:
| That sucks.. always hated the idea of an online account to
| access your local system..
| moonchrome wrote:
| This is exactly why I don't do that initially - I don't mind
| my account being linked - but I've been bitten by the home
| path bugs multiple times, I unplug my pc during setup
| numpad0 wrote:
| Oh, it's not a common knowledge that you should not UTF-8 in
| Windows username? That had been the case since 95 days. Only
| recently it had supposedly improved after Microsoft Account login
| become semi mandatory.
| progval wrote:
| On the contrary, the first bug happens because docker-compose
| tries to decode the path as UTF-8, but it is not UTF-8-encoded.
| ("'utf-8' codec can't decode byte")
| chris_overseas wrote:
| I don't think this bug is anything to do with Windows, rather
| it is due to the way the paths are handled in the IDE's
| codebase. Presumably the same problem exists when using these
| IDEs in conjunction with a path containing non-ascii characters
| in the Linux or macOS world.
| numpad0 wrote:
| Isn't it some compilation option issue in native part? I
| thought it's a line on .sln or include library in a C++
| source or something that has to be explicitly specified when
| building a Win32 binary.
| GoblinSlayer wrote:
| InteliJ has native part?
| Fordec wrote:
| A lot of adults today weren't even alive in 95. Also, the
| assumption that people are familiar with windows vs other
| operating systems is becoming less and less valid. And as the
| world gets more globalised and remote, it's no longer to be
| assumed that all technical people are of a Anglo American
| culture.
| david422 wrote:
| There's also this article: falsehoods-programmers-believe-about-
| names:
|
| https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
|
| Certainly informative if you haven't seen it before.
|
| My takeaway from it was that design your system to try to
| accommodate as much as possible, but it would basically be
| impossible to accommodate them all, so aim for your target
| audience.
| ygra wrote:
| One way of working arrive such issues is to use subst. That way
| the application thinks your project directory is actually located
| on P:\ or something like that.
| rcxdude wrote:
| Sadly there is even still software which fails to build or even
| fails to run when there is a space in a filename (as is super
| common on windows file paths, as well as autogenerated CI build
| folders). It's ridiculous to no end that software cannot handle
| paths correctly.
| tazjin wrote:
| The amount of random encoding problems that still exist are so
| bizarre. I recently left a UK job after already leaving the
| country more than a year ago, and in their attempt to mail P45
| form to my new address (in Moscow) the only bits that survived
| are the string "c/o" and the postal code.
| tediousdemise wrote:
| The solution to this is extremely simple: don't validate
| usernames, period.
|
| The rationale is from an article someone linked here ("Falsehoods
| Programmer's Believe About Names"):
|
| > Anything someone tells you is their name is--by definition--an
| appropriate identifier for them.
|
| If you try to validate by checking for profanity, knowing full
| well that people can have names that contain profane substrings,
| I have a tongue-in-check message for you-- _you are a fucking
| asshole_.
| xlii wrote:
| Very similar problem to one described started my exodus from
| Google services.
|
| I also have non-latin characters in my name however I knew it was
| always an issue so I never used it in paths etc.
|
| At some point, long time ago, I was tasked to do some maintance
| with Google Cloud service (can't remember the name of the service
| now) which was doable only through Python CLI utility and it
| failed with very similar Python error.
|
| What I found out rather quickly is that utility took my name from
| Google+ profile, which did include those non-latin characters. No
| biggie - I thought and fired e-mail to support (yeah it was those
| times it was still that easy). Few hours passed and I received
| information that this won't be fixed anytime soon and the best
| course of action would be to change my name.
|
| Of course, support person probably meant to remove the
| diacriticals from my Google+ profiles, but still it left
| unplesant aftertaste for years to come.
| nullspace wrote:
| > the best course of action would be to change my name
|
| As someone who has been told this, for other reasons, I
| empathize. My reaction has always been - "Your system can't
| even handle names, you need to fix it".
|
| Edit: I wish there was a library / service that helped you
| handle all sorts of edge cases in names, so that you don' t
| have to worry about it. Just use a user-id, and set / get a
| name from a lib / service that can actually handle it.
| dymk wrote:
| Has that reaction ever resulted in the other party fixing
| their system in a timely manner?
| mjevans wrote:
| This is exactly why I hate the way Python3 handles Unicode.
|
| EVERY language should _try_ to handle Unicode such that if a
| data sequence were valid before it remains valid after. NONE
| should ever FORCE validation, since sometimes, like in the
| article's case, the correct answer is GIGO. Just pass it
| through and hope it continues to work. Sometimes the error is
| trying to enforce that validation.
| geofft wrote:
| Python 3 usually handles this correctly, and I'm a little bit
| confused what's going on in the article, exactly.
|
| For UNIX path names (and other OS data like environment
| variables), Python uses the "surrogateescape" error handling
| method, which does exactly what you ask. Any byte sequence
| can be converted to a string. If it decodes as valid UTF-8,
| it will do that. If it hits a byte that does not decode as
| valid UTF-8 (necessarily a byte >= 128), it will map it to
| code points U+DC80 through U+DCFF. These are in a reserved
| ranges of code points ("surrogates", which make it possible
| to represent code points > 0xFFFF in UTF-16), and they can't
| show up in actual Unicode text (i.e., there is no UTF-8
| encoding of them, strictly speaking, and if you applied the
| UTF-8 encoding algorithm to a code point in the U+D800 to
| U+DFFF range, you would get bytes that aren't valid UTF-8).
|
| On the way out, this is reversed. So you get the results you
| expect if your filenames are in UTF-8, but since UNIX has no
| requirement that filenames are indeed UTF-8 (the only
| constraint is they can't contain NUL or ASCII-forward-slash),
| the bytes are preserved in a funky-looking format in Python
| and you get the exact same output on the other end.
|
| See https://www.python.org/dev/peps/pep-0383/ for more on
| what's going on. The tl;dr for users of Python is that if you
| want to interact with, say, subprocess output as mostly-
| normal strings (instead of bytes) but you want to be robust
| to non-UTF-8 bytes, you should do something like
| subprocess.check_output(["some", "command"],
| errors="surrogateescape")
|
| You don't need to do this for APIs that directly interact
| with pathnames, because they do it already. You just need to
| do it for things like subprocess output and file contents
| that Python doesn't know you want to handle in this way.
|
| ...
|
| On Windows, however, path names must be valid Unicode and are
| stored in UTF-16. So the idea of a "l" that doesn't decode
| properly shouldn't even happen! Mikolaj's home directory
| ought to be a very boring (and valid) 004d 0069 006b 006f
| 0142 0061 006a on disk.
|
| Windows doesn't enforce that file paths are _valid_ UTF-16
| though (specifically, the surrogate code points are only
| supposed to show up in a certain way, but nothing enforces
| that and you can have random surrogates on disk), and hence
| Rust, which internally represents all strings in UTF-8, has a
| solution ( "WTF-8") that's basically the inverse of
| surrogateescape - it uses extrapolated-UTF-8-encoding-of-
| surrogates to handle unpaired surrogates.
| http://simonsapin.github.io/wtf-8/ But it seems very odd to
| me that the directory C:\Users\Mikolaj would actually contain
| any of those, and if it doesn't, I would expect it to very
| easily turn into a Python Unicode string.
|
| Maybe this is from a Python version before
| https://www.python.org/dev/peps/pep-0529/ , which is claimed
| to "fail to round-trip characters outside of the user's
| active code page"? Maybe this is from a Python version
| _after_ that change and it 's wrong?
| nightpool wrote:
| The incorrect docker-compose file was _generated_ by Java
| (Jetbrains) but _consumed_ by Python (docker-compose). The
| GP comment was complaining about Python 's strict Unicode
| consumption, not Java's invalid Unicode generation.
| nightpool wrote:
| How is this Python's fault? It's not like the `docker-
| compose` file would have worked any better if it silently
| replaced one of the volumes with an inaccessible file.
| Instead, you'd just get a failure from the Windows filesystem
| API when you tried to access or create a file at "C:\\\Users\
| \\Mikoaj\\\AppData\\\Local\\\JetBrains\\\Rider2021.2\\\log\\\
| DebuggerWorker\\\\\", right?
| sschueller wrote:
| Many years ago I could not access the apple developer panel
| because of the umlaut in my last name. It was eventually fixed
| but I was quite surprised that such a large company would run
| into such a basic issue.
| rodgerd wrote:
| If you look at many of the responses here it's sadly
| unsurprising: small-minded provincialism or outright xenophobia
| are no less common amongst programmers than the general
| population.
| [deleted]
| devrand wrote:
| My last name has an apostrophe in it which Apple apparently
| loves to embed directly into their JavaScript unescaped. For a
| long time neither I nor Apple could look up AppleCare status on
| my stuff as they were all linked to my Apple ID. The portal
| would thus require me to login, but then would just show a
| partially rendered page as my last name was causing an JS
| syntax error.
| nneonneo wrote:
| Hmm, it sure sounds like John <script>alert(1);</script>Doe
| (Bobby Tables' distant cousin) should sign up for an Apple
| account. An XSS attack which could target the AppleCare reps'
| machines could be catastrophically bad...
| doubled112 wrote:
| You'd think the apostrophe would be common enough they'd know
| it could happen, but no.
|
| I love to enter it and see what each vendor and website's
| backend does with it.
|
| The Staples Canada website, for example, returns it as '
| (HTML escaped) A couple times I've logged in, it seems to
| escape a new character. I'm currently up to &amp;#39;
| irrational wrote:
| >such a large company would run into such a basic issue
|
| Every large company is just a conglomeration of smaller
| departments. Each department had individual contributors. Some
| individual contributor in that department wrote the code and if
| nobody else is their department caught it, nobody else at the
| large company would have caught it since they have their own
| work to consider and don't have time to look at other people's
| stuff.
| lostgame wrote:
| I think what OP means is that a company so large should have
| the resources to test such edge cases.
| supernes wrote:
| It's somewhat common to see videogames issue a patch shortly
| after release where they fix crashes due to non-ASCII Windows
| usernames or non-English locales. I'm not sure what the root
| cause of the confusion is, other than text strings being hard in
| general.
| GoblinSlayer wrote:
| It's text encoding confusion:
| https://en.wikipedia.org/wiki/Mojibake
| jerf wrote:
| It's easy to think the answer is "just UTF-8 everything" but
| unfortunately the long and twisty history of filesystems means
| that's not the correct answer, and the "correct answer" is
| really hard to write down quickly.
|
| If you never display the filename, the answer is to treat
| existing filenames as bags of bytes, but that breaks down as
| soon as you need to display them, or if you need to manipulate
| them by appending unicode to them, in which case you have to
| decide on an encoding.
|
| Unicode encodings tend to mangle non-Unicode values because
| they're specified to replace whatever they can't understand
| with a particular Unicode character, usually represented as a
| diamond with an inverted ? inside of it.
|
| There's some obscure solutions to this problem, like
| https://simonsapin.github.io/wtf-8/ (which includes discussion
| of the 16 bit encodings you need for Windows), but I haven't
| seen broad support for them. You need a deliberately
| "noncompliant" encoding/decoding system that doesn't replace
| unknown characters with replacement characters. Fortunately,
| compliant systems are becoming more and more popular and
| available. Unfortunately, that can make file name handling
| _harder_ than when you had a non-Unicode-compliant handling
| system for your strings.
| nyanpasu64 wrote:
| Rust uses WTF-8 on Windows for OsStr[ing] and Path[Buf]. It's
| zero-overhead to cast from &str to &OsStr/&Path to &[u8]
| (though converting WTF-8 to UTF-16 costs an extra operation
| when performing a Win32 function call). However this doesn't
| solve the inability to round-trip "possibly-valid UTF-8/16"
| to "Unicode text" and back (though Python's surrogateescape
| might be one viable approach).
|
| Other libraries handle this even worse than Rust. On Linux
| (filenames are bytes), Qt is unable to open files with
| invalid UTF-8 names, while GTK can open them (but shows an
| "invalid encoding" message instead of the original filename),
| which I think is a good-enough approach.
| garaetjjte wrote:
| Part of the problem is legacy Windows cruft. For long time to
| properly handle Unicode characers you needed to explictly use
| widechar UTF-16 functions. Legacy narrow encoding is systemwide
| setting, couldn't be set to UTF8, thus only subset of
| characters would be represented correctly. Only recently they
| introduced ability to set narrow encoding for application to
| UTF-8 with setlocale, which is a lot saner.
| mkotowski wrote:
| In case of a home-grown code, it could be simply the question
| of a programmer awareness. There are still many outdated and/or
| unfinished tutorials that use WinAPI without any concern about
| enabling Unicode and wide chars support.
|
| If we are talking about ready game engines like Unity and
| Unreal... it is probably a naive assumption about input being 1
| byte wide and things getting lost because of that in some
| gamedev-made script.
| jan_Inkepa wrote:
| I've been bitten on a few small releases by forgetting that C#
| localises number->string conversion by default (which makes
| sense. But if you forget, and you're writing floats to csv
| files and the decimal points become decimal commas....).
| breakingcups wrote:
| It's also a common thing that Silent (aka CookiePLMonster)
| fixes in the games he patches.
|
| See for example: -
| https://cookieplmonster.github.io/2020/05/23/silentpatch-maf...
| - https://cookieplmonster.github.io/2021/02/27/silentpatch-
| yak...
| amarshall wrote:
| For a list of strings that often cause problems to, e.g., add to
| a test suite, see https://github.com/minimaxir/big-list-of-
| naughty-strings
| tomaslaureano wrote:
| Great resource! I usually use pangrams (holoalphabetic
| sentences like "The quick brown fox jumps over the lazy dog")
| to ensure that my code can handle all the alphabet characters
| for the languages that should be supported at the very minimum.
| munk-a wrote:
| It's also important to width-test fields. Never forget to make
| sure that WWWWWWWWWWWW doesn't cause weird application
| wrapping.
| aidenn0 wrote:
| I used a system where the maximum length on the "new
| password" field in the change password form was longer than
| the password field in the login form.
|
| The symptom was that I could login if I used my password
| manager browser plugin, but not if I pasted it from my
| password manager.
| kevinmgranger wrote:
| You're lucky they weren't different lengths in the backend.
| I've been bitten by that surprise one too many times (which
| is any number higher than zero)
| aidenn0 wrote:
| The most ridiculous thing is the UI for setting the
| password even said "X-Y characters long, must include at
| least one..." but the login page could not support Y
| characters.
| pferde wrote:
| I have seen a windows app with a text field whose max
| character count was somehow determined by system font size
| - probably a crude way to make sure the entered text fits
| the hard-coded field size.
|
| The problem was that this field was used to enter a
| 10-digit code, and as it turns out, on default Windows10
| system, the fonts are set up so that this field only fit 8
| of them. Oops! :)
| munk-a wrote:
| I'd like to see how that App would work with me sitting
| here fonts cranked up to 175%. I've never heard of a
| setup like that though - it sounds like it'd be
| surprisingly intricate to actually configure.
| munk-a wrote:
| I maintained a system where we had unbounded password
| length... but only respected the first six characters of
| the password. (we did fix that).
| amarshall wrote:
| Related (we do this at my work):
| https://en.wikipedia.org/wiki/Pseudolocalization
| OskarS wrote:
| An enormously useful list, I've used it several times, and it
| can often dig up some real nastiness if you haven't been super
| careful.
|
| This entry, by the way, is a fantastic little easter egg in the
| list: https://github.com/minimaxir/big-list-of-naughty-
| strings/blo...
| vertis wrote:
| No, seriously, wake up
| [deleted]
| ryanianian wrote:
| Very handy. My previous simple test-case was simply a selection
| from this well-known text-file which is simply a collection of
| somewhat uncommon unicode characters, usually used for
| rendering tests.
|
| https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
|
| But this set of strings is specifically designed to cause edge-
| case errors.
|
| Also don't forget Spolsky's seminal "The Absolute Minimum Every
| Software Developer Absolutely, Positively Must Know About
| Unicode and Character Sets (No Excuses!)".
|
| https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
| spicybright wrote:
| So frustrating how this still happens. It's too latin centric.
| mikasjp wrote:
| I think the whole problem is keeping the character encoding
| consistent in the applications and their dependencies.
| Programmers often forget this because they avoid non-ASCII
| characters in their code.
| mkotowski wrote:
| I, too, have the L letter in my name, and yes, it is a sick joke
| that so many things even in a supposedly modern systems make an
| assumption that the world runs on ASCII.
|
| In the case of the Windows operating system, the worst fact is
| that every single part of it behaves differently. Some parts
| display the path with a wrong encoding, but handle it correctly.
| A third-party app can display it correctly, but fails while
| trying to access any file. From what I remember, even the built-
| in PATH variable editor/manager goes through some arcane steps to
| display the letters in a wrong way, but getting them to work
| _sometimes_.
|
| I can only imagine how much more pain it is for someone using any
| of the less widely-used writing systems or those with more
| advanced features compared to ASCII (Hebrew's RTL, Arabic scripts
| mid- and final forms, etcetera).
| gerdesj wrote:
| Can L have an alternative representation? For example the
| German ss => ss. Also I think o can be written as oe.
|
| In English we simply shake the big bag of letters, pick a few
| at random and then throw them at the page until a few stick.
| q3k wrote:
| > Can L have an alternative representation?
|
| Nope. Neither can z, c, s, a or e. You can, and people do
| write them as z, c, s, a and e when writing in a restriced
| character set, but that is not 'correct' and is not a
| bijection, ie. ,,polka" and ,,polka" mean two different
| things.
|
| There's also the case of technically-same-sounding-
| especially-recently z/rz and o/u (whose replacement would let
| you get rid of two 'non standard' characters), but for
| historical reasons these are not interchangeable.
| gerdesj wrote:
| I do find this sort of stuff fascinating and also faintly
| frustrating but of course my mother tongue is (in)famous
| for being a bit loose at first sight.
|
| According to one of my employees (Polish) L sounds roughly
| like w as in win or water but not as in what. A quick read
| of this: https://en.wikipedia.org/wiki/%C5%81 doesn't help
| too much.
|
| Does enforcing L instead of say w cause your written
| language to fail in some way? I don't want to cause
| offense, I want to understand the causes of difference.
| q3k wrote:
| 'W' in Polish is already used, but for a different sound
| - it's pronounced like the English 'v'. 'V' in turn is
| not present the Polish alphabet (in the sense of it not
| being present in words of Polish origin).
|
| If you wanna change that, you might as well change the
| entire writing system of the language, eg. to be more in
| line with some other, more common writing system (ie.
| other latin alphabets or the cyrillic alphabet which
| would probably make the most sense phonetically). But no-
| one's gonna go for that any time soon.
| gerdesj wrote:
| "If you wanna change".
|
| I think we have found the disconnect: you quite happily
| use a word like "wanna" which is nonsense in English. Its
| allowed because it is understandable. Wanna is "want to".
|
| Ooh, "gonna": That'll be "going to".
|
| What's gonna to you is l bar for me or vice versa or
| something 8)
| bagswatchesus wrote:
| Not sure how they managed to do it but they had some basic rules
| that they used to say "no real name can look like this, this is a
| fake person!" and just kicked it out.
| https://www.thelvbags.co/louis-vuitton-wallets-and-purses.ht...
| xwdv wrote:
| What's wrong with just writing it as Mikolaj? It's not like it's
| a kanji or something.
| sophacles wrote:
| Because that's not their name?
| dahfizz wrote:
| Their URL is even mikolaj-kaminski.com . I get its annoying,
| but I would never use non-ascii chars in a username / file
| path.
| jerf wrote:
| So, what does A Bu Ming Ren do in this case?
|
| Polish may be close enough that an approximation is available
| in English, but there's an awful lot of languages that don't
| have a large overlap with English characters.
|
| In the Asian case above, if someone with that name did try to
| "convert to English" they are ironically just as likely to
| end up with Akihito Abe as the ASCII, which will be just as
| broken!
| numpad0 wrote:
| Assuming that hypothetical guy is an average Japanese
| male(somewhat leaning right), he'd just turn IME off.
| Japanese input on desktop is consist of three following
| states:
|
| - IME On state. IME capture and interpret keypresses as
| engraved and generate corresponding Kana-Kanji texts.
|
| - IME Off state. IME passes through keypresses as engraved
| on keytops.
|
| - Direct Input state. IME becomes dormant.
|
| In IME Off state, the keyboard behaves as a plain jp106(or
| ANSI if it is) keyboard, like I'm doing right now. The
| cases where you would use conversion with IME on for an
| English word is when you have reasons for the word to be in
| "full width"(usually for typesetting reasons).
| jerf wrote:
| I don't think it's something that people should 'just
| know' that when Windows asks them their name during
| install time, they _ought_ to use 7-bit clean ASCII for
| everything, no matter where they are in the world or how
| much they know about other languages. When Windows says
| "What is your name?", they ought to be able to _use_
| their name without things breaking.
|
| I'm sure a computer savvy speaker of a fully-non-Latin
| language may still guess this is a good idea, but
| "computer savvy" doesn't cover everyone... and they
| shouldn't _have_ to.
|
| "Just use 7-bit-clean ASCII English" is not a solution to
| this problem.
| dahfizz wrote:
| They could use a different name as their windows name (Do
| people use their real names as their usernames? I never
| do). Or, they would have to go through the pain of finding
| a real solution, like the author did.
|
| Considering JetBrains seems unwilling to fix this bug,
| maybe the best solution of all is to switch to an IDE that
| works.
| Kye wrote:
| "You're holding it wrong"
|
| The problem is the technology, not the user using it in a
| reasonable way. l is older than computers and the only reason
| computers struggle with it is lack of foresight or choosing
| to make things harder for most of the world by some of the
| people involved early on.
| dahfizz wrote:
| Obviously the IDE is at fault here. Rider has a bug with
| Unicode.
|
| BUT, there is an easy workaround to avoid all Unicode
| related bugs: don't use Unicode. If that's morally
| objectionable for you, then you can keep fighting this
| fight.
| bivargen wrote:
| Avoiding unicode, or anything but 7-bit ASCII is like
| using chiseling text into a stone instead of pen and
| paper because the pen might break. Fix the pen! Or
| replace it with a computer (and we're back full circle)!
|
| It is not morally objectionable avoiding, it's just
| stupid.
| tremon wrote:
| I think it's reasonable to find that morally
| objectionable: English is the only language* that can be
| fully represented in ASCII, so pretending that ASCII is
| all you need excludes a large part of the world.
|
| * yes, by and large. Many languages make do, but even the
| European languages that use the same script as English
| cannot be fully represented:
|
| - Pretty much all mainland European languages use accents
| (simple example, in Spanish el and el are different
| words)
|
| - French misses c
|
| - German/Swiss/Austrian misses ss
|
| - Spanish misses n
|
| - Dutch misses ij
| InitialLastName wrote:
| It's naive of you to maintain the facade that English can
| be fully represented in ASCII. We've just had longer than
| other languages to adapt to that particular encoding
| technology, and the good luck to have a code set built to
| represent our language become the lingua franca of
| computer technology.
| Symbiote wrote:
| Not even Britain and Ireland can manage with ASCII: they
| need PS and EUR.
|
| I agree with you, and disagree strongly with dahfizz, who
| is essentially telling people their name and language are
| unacceptable.
| Muromec wrote:
| Cyrillic-writing countries miss all of their alphabets
| and so does Greek.
| ludamad wrote:
| For the record, it's a stark pronunciation difference as l has
| drifted to a very different "w" sound
| MadeThisToReply wrote:
| Yep. For example, the name of the third-largest city in
| Poland is "Lodz", which might look like it's pronounced
| "lods", but is actually pronounced more like "wootch".
| garaetjjte wrote:
| Sometimes you end up with parcel addressed to city "??d?".
| Shipping systems cannot cope with non-ASCII chars more
| often than I would expect...
| greenshackle2 wrote:
| I've seen shipping labels with HTML encoded characters,
| like é and è. I'm not sure if that's better
| or worse:
|
| Łódź
| ssivark wrote:
| That's about as aggravating as asking Ryan to change name to
| Pyan -- because the encoding doesn't support "R" and "P" looks
| very similar.
| no_time wrote:
| Because it's not his name. Imagine you are John but you had to
| make do with Yohn because the people designing you software
| didn't need the letter J...
| kmlx wrote:
| it was 30 years ago when i discovered that it doesn't really
| matter what my name is. the system i'm interacting with
| expects my name to be "john" or something like that. so i let
| it be.
|
| 30 years later and i completely dropped all non-latin chars
| from my name in any and all forms. from airplane tickets to
| passport to you name it.
|
| and you know what? no one cared about non-latin. not even the
| government. i loled when i actually realised.
|
| i've encountered zero issues ever since.
|
| and it's been the same for lots of my friends. they just
| adopted some western name. case closed, no more issues.
|
| it all depends on who much importance you attribute to your
| name. for me it's always been a random variable. for others
| it's a matter of pride. but to the "system" it will be a
| "random list of chars", sometimes latin, other times utf.
| zanderwohl wrote:
| It's not strange to localize your name. In ASL for example,
| you could sign your English name letter-by-letter, but it's
| much more common to have a totally new sign for your name -
| usually a word combined with the first letter of your name.
| Taking part in a different system often means taking on
| another name.
| q3k wrote:
| It seems that you're implying computers are universally
| american and therefore people are expected to
| speak/use/adapt to american.
| thereddaikon wrote:
| That's the harsh way to put it. A more diplomatic way is
| that computing is not unique in having deeply ingrained
| artifacts of the language and culture that birthed it and
| developed many of the paradigms.
|
| Take anything having to do with seamanship. There are
| many terms that date back to early modern English that
| simply don't make sense anymore yet are accepted and
| universal because the British Empire had a large and
| enduring influence on maritime matters and happened to be
| at the forefront of most modern developments until about
| 70 years ago.
|
| In some cases this is actually built into laws and
| industry practice. Pilots speak English. That's the
| rules. Don't like it? Invent the time machine and beat
| Wilbur and Orville. For much the same reason, science
| speaks Latin.
|
| This technical debt is difficult if not impossible to
| overcome, especially in regards to computers because we
| still haven't cracked general purpose AI. Software will
| only accommodate what it was written to accommodate.
|
| Recognizing the problem and working to fix it is all well
| and good. But its wise to understand that this wont be
| solved any time soon so in the meantime it is pragmatic
| to operate in such a way to maximize compatibility.
|
| After all, I still have to call it a Foc'sle even if I
| think that's dumb or isn't inclusive of my culture.
| xxpor wrote:
| There's also the practical consideration that English,
| due to having a) an alphabet b) letter shapes that aren't
| affected by surrounding letters and c) no diacritics, is
| the easiest major language to store and display on a
| computer. Even if silicon valley ended up in a country
| with a logographic writing system, I'd bet that the first
| character set that would have been used would have been
| Latin based
| [deleted]
| AdrianB1 wrote:
| My name contains non-Latin characters (apparently strange as
| we use a Latin language), but 40 years of working with
| computers I learned to avoid using the original form and
| always convert to ASCII; yes, it is not my name, but my pride
| and sense of entitlement are not hurt at all.
|
| Sometimes it is better to avoid being hit by the bus even if
| you are right.
| wbsss4412 wrote:
| So the solution is for the user to change their entire windows
| account name, rather than handling common characters in your
| code?
| toast0 wrote:
| For a user, changing their account (probably creating a new
| user, since rename apparently doesn't change the directory),
| is something they can do.
|
| Changing all software to respect their perfectly valid name
| isn't something they can do.
|
| They shouldn't need to change their name, but if they do,
| they can ignore all the broken software and go about their
| day.
|
| This particular user is more capable than most, and found a
| workaround for this particular problem, which is good... But
| this is not likely to be the last of the problems.
| dahfizz wrote:
| Of course it would be better if all code was bug free. But
| that's impossible. As a user, avoiding unicode is a pretty
| easy way to avoid bugs like this - its the rational thing to
| do.
| Jensson wrote:
| When you have non-standard characters in your name you
| quickly learn to never use them in computers since even
| though most systems works fine, some don't. And you can't fix
| all the thousands of systems your name has to interact with.
|
| I even had trouble booking flight tickets since their
| security system couldn't parse my name, and then had to go
| through some special security check due to it returning
| errors. After that, never again. Not sure how they managed to
| do it but they had some basic rules that they used to say "no
| real name can look like this, this is a fake person!" and
| just kicked it out.
| wbsss4412 wrote:
| I totally understand what you're saying, but it's also a
| sad state of affairs when we can't handle "non standard
| characters".
|
| Standard characters (ie english) are only used by a small
| subset (maybe 5-10%) of the global population.
| yuliyp wrote:
| They're not non-standard characters. They're just as much a
| part of the Polish alphabet as 'a' and 'b' are.
| Jensson wrote:
| That is exactly what I meant. My name doesn't have non-
| standard characters either from the perspective of my
| home country, it is just normal letters in the alphabet,
| but not in the English alphabet.
| q3k wrote:
| > When you have non-standard characters in your name
|
| 'standard' by what measure? L is more standard than X or Q
| in the polish alphabet.
|
| ~ Sincerely, a person whose name contains ,,n" and
| therefore had to deal with this bullshit his entire life.
| Jensson wrote:
| From a programmers perspective. The characters in my name
| are standard where I come from, but they are not standard
| to the international air travel security systems likely
| developed by Americans.
|
| Edit: You know how aircraft travel security always
| transforms your name into letters from the English
| alphabet to parse? Yeah, it transformed my name and then
| the resulting string looked so bad that the system
| rejected that. The original name doesn't look bad, but
| after transformations it did...
| miloignis wrote:
| From the article:
|
| The first idea was to change the username to one that does not
| contain Polish characters. It turned out that Windows does not
| rename the user's folder when changing the username. Manually
| renaming the folder was not an option. This way I could corrupt
| my profile in the system.
|
| The end of the article is about how to change the directory
| where the temporary files go to one not under the user folder.
| jasonpeacock wrote:
| And yet it's one of the simplest things to add non-ASCII chars to
| your tests to validate their handling.
|
| It's like not testing if your calculate application can handle
| negative numbers or decimals.
| nradov wrote:
| In fact it's trivial to generate a text file of all valid
| Unicode code points and use that as input to unit tests.
| yakubin wrote:
| It may be faster to generate them on the fly. Iterating over
| ranges of integers is a lot faster than reading files from
| disk.
| Someone wrote:
| I would have to do research on whether the list of valid code
| points depends on the Unicode version. For example, can
| regional indicator code points
| (https://en.wikipedia.org/wiki/Regional_indicator_symbol)
| appear in isolation? If not, is that different in Unicode <
| 6, where those code points weren't assigned yet?
|
| Similarly, what about tags
| (https://en.wikipedia.org/wiki/Tags_(Unicode_block) )? Do
| these _require_ an U+E007F CANCEL TAG?
|
| The 66 noncharacters certainly need consideration.
| http://www.unicode.org/faq/private_use.html says:
|
| _"Because of this complicated history and confusing changes
| of wording in the standard over the years regarding what are
| now known as noncharacters, there is still considerable
| disagreement about their use and whether they should be
| considered "illegal" or "invalid" in various contexts"_
|
| Edit: also, testing all code points likely is overkill and
| using code points in isolation likely isn't enough. Most
| tests are better of with something like the big list of
| naughty strings (https://github.com/minimaxir/big-list-of-
| naughty-strings)
| mrweasel wrote:
| It's a pretty good test case. Similarly we found a number of bugs
| in a Django application and path handling, because I happend to
| be using Windows for six months, while the rest of the team was
| on Linux and Mac.
| umvi wrote:
| Using non-ascii characters in file paths, toolchain config files,
| and other non-display contexts is just asking for trouble, even
| if it is your name...
| fluxem wrote:
| Also spaces. I spent half an hour debugging why cmake cuda
| build was failing.
| munk-a wrote:
| A lack of support for spaces at this point is unacceptable.
| I, personally, despise spaces in paths but on windows a whole
| bunch of default system paths already have spaces embedded in
| them in major ways... and let's not forget parens as well -
| thanks "Program Files (x86)"
| bbarnett wrote:
| This wouldn't have happened if using rust!
| burnished wrote:
| Some of the other attempts are a little subtle, this one is a
| pretty blatant attempt to rile up the folks that are already
| angry about rust for whatever reason. Please stop.
| nightfly wrote:
| Can you knock it off??? This is even more annoying that out-
| of-place rust evangelism
| jasonpeacock wrote:
| This is the modern, post-ASCII computing world, we should no
| longer be willing to settle for the lowest-common-denominator
| of ASCII-only strings.
|
| There's no excuse for actively supported, _paid_ products to
| have these problems today.
| amenod wrote:
| True. But these actively supported, paid products build upon
| layers and layers of no-longer-supported, free/opensource
| products. Good luck fixing them.
|
| Not saying that this is OK, just explaining why using non-
| ascii characters, in this day and age, is still asking for
| trouble.
| SAI_Peregrinus wrote:
| This is on the Windows version.
|
| Windows 2000 is when the OS changed to UTF-16 by default.
| Before that Windows NT was UCS-2, IIRC only the DOS-based
| Windows versions were Windows-1252 internally, starting
| from Windows 1.0. So while l wasn't supported in Windows 1,
| characters like n were. Windows has literally NEVER been an
| ASCII-based OS.
| horsawlarway wrote:
| Sure, but having used a lot of the windows system apis
| (admittedly - a lot of years ago) it was a complete
| hodgepodge of which api would take a char vs a wchar, and
| then they tried to hide the whole thing behind tchar,
| which just made it even harder to keep track of.
|
| Basically - I agree: This shouldn't be a problem, and 7
| months is a long time to wait for a basic fix. But there
| are a lot of footguns hanging around in windows code with
| respect to character encodings.
|
| Just looking at the first result on google for "c++ get
| windows home directory" shows this:
| https://docs.microsoft.com/en-
| us/windows/win32/api/userenv/n...
|
| Which takes a long pointer to tchar string (LPTSTR) - so
| this behavior is dependent on the unicode settings of the
| project at compile time, even today.
| david_allison wrote:
| > Windows 2000 is when the OS changed to UTF-16 by
| default.
|
| Paths are UTF-16 + unpaired surrogates, so a Windows path
| isn't legally representable in UTF-8.
| ainar-g wrote:
| _Especially_ if those products are developed by a company
| from Russia, where Cyrillic is used. For me, a Russian
| myself, this situation is honestly ridiculous.
| zczc wrote:
| Russian companies generally have ascii-only username
| policies
| mbesto wrote:
| Do you write "if" statements in Cyrillic when you write in
| <insert Python/Ruby/Java/.NET/whatever>?
| pavel_lishin wrote:
| It would be very amusing to see "esli" in an if
| statement, given how much it looks and sound like "else"
| at a brief glance.
| GoblinSlayer wrote:
| I thought ArnoldC was just a couple of #define's, but
| looks like it isn't.
| nine_k wrote:
| No. Keywords are ASCII everywhere (no, APL's are not
| words). Mixing English in keywords and non-English in
| identifiers feels odd.
|
| Algol-68 supported localized sets of keywords;
| fortunately this language is gone.
|
| You can #define non-ASCII stuff in modern C++. It's your
| best chance to "localize" a mainstream language.
|
| Same would work for Clojure, but Lisp uses a lot of
| quirky abbreviations like `cdr` or `setq` that give
| awkward translations.
| gumby wrote:
| This is blaming the victim
| BiteCode_dev wrote:
| Unfortunately, it's true, most toolchains are stuck in the
| past, and don't deal with non-ascii characters or even spaces
| very well. In fact, I just learned that spaces in .deskop files
| values could cause trouble after a long debugging.
|
| But it's a shame.
|
| In Europe, we do have a lot of non-ascii characters everywhere.
| Ubuntu puts a "Video" and a "Telechargements" directory in my
| $HOME because I'm french. If I were to use my name as my
| username I would have even more troubles.
|
| I'm careful with not using special chars in names for work, but
| it feels like I'm a girl trying to not dress sexy in the wrong
| part of town: necessary, but I shouldn't have to do this, and
| it's definitely the others to blame.
|
| All in all, I thank the Gods of encoding for Python 3 unicode
| handling. Having a scripting language that does the right thing
| out of the box is wonderful on this side of the pond.
| mjevans wrote:
| "The right thing" for filesystem entries is transparently
| copy, do not evaluate. A file path is a mem-copied, length
| value sized block of identifier you don't ever mangle. If you
| must mangle it, touch only the necessary areas as directed.
| (E.G. join with os.pathsep and do not normalize anything).
|
| Want to offer Unicode validation? Sure having that as an
| OPTION is fine. Forcing it means I can't rely on that tool to
| handle real world data which happens to not be valid but is
| still a valid file-system address.
| GoblinSlayer wrote:
| No seriously, create a user d'Artagnan.
| simonblack wrote:
| Isn't this one of those "100 things Programmers don't know about
| People's Names" things?
|
| Like the poor, it will be with us always.
| xdfgh1112 wrote:
| I don't know, it's just a Unicode character? Not even a newer
| one, it's just 2 utf8 bytes. Pretty much everything should
| support that in 2021.
|
| When I think of 100 things I think of stuff like "some people
| spell their name in all lowercase and get really funny if you
| change it"
| numpad0 wrote:
| Yeah so double byte characters costs extra. I don't know, a
| checkbox or something default off. Always did still does.
| Double width costs even more.
| horsawlarway wrote:
| you're getting downvoted, but between tchar hiding wchar vs
| char... this literally could be someone toggling off the
| "UNICODE" checkbox in visual studio somewhere.
| hprotagonist wrote:
| windows probably defaults to latin-1
| bryanrasmussen wrote:
| the default windows encoding is UTF-16, a long time ago it
| was Windows-1252 https://en.wikipedia.org/wiki/Windows-1252
| hprotagonist wrote:
| or CP-1251, in some locations.
| f311a wrote:
| That's a pretty common problem, especially for cyrillic names.
| People just use ASCII names.
| souptonuts wrote:
| Idk changing your stupid fucking name could be a fix too
| Dannymetconan wrote:
| I can very much relate to this but also have very little sympathy
| here.
|
| I have a special character in my name, an apostrophe, and it
| causes trouble regularly online and with tooling. A number of
| years ago I decided just to never use it when it came to anything
| to do with technical work be it email, logins or usernames.
|
| Unicode characters are a pain to deal with and I have suffered
| from it first hand trying to handle it. At the end of the day it
| is much easier just to not use the special characters and move on
| with your life rather then be battling the constant frustration.
|
| I'm sure these tools have lots of issues opening and you would be
| surprised at the amount of time, effort and testing it would be
| required to provide fully Unicode support. Most people would see
| it as a very small positive and not worth the effort. I find it
| hard to disagree.
| vultour wrote:
| I'm really surprised someone technically minded thought it's a
| good idea to put a non ASCII character in their username. I'd
| never do that.
| ctdonath wrote:
| I'm really surprised someone technically minded thought it's
| a good idea to not allow non ASCII alphanumerics in a
| username.
|
| Unicode has been a thing since 1988. Names have included non
| a-z characters since forever.
| jltsiren wrote:
| My legal last name is "Siren". When I was younger, I almost
| always used "Siren", because it was easier to type. Then, ~15
| years ago, I started noticing that American websites sometimes
| rejected it, because they considered it inappropriate.
| Sometimes "Siren" would work, sometimes it worked but caused
| minor annoyances, and sometimes it would not work for technical
| reasons.
|
| Both versions work most of the time these days, but I still run
| into trouble once in a while no matter which name I use.
| 10000truths wrote:
| Why would Siren be an inappropriate name?
| lostgame wrote:
| Someone who I know has the last name 'Island' and was
| unable to sign up for Facebook forever because they thought
| it was a fake last name.
|
| Maybe 'Siren' is similar. It's a pre-existing word that
| perhaps flags some sort of weird edge case.
| pledess wrote:
| The article offers a solution of
| idea.system.path=${root.dir}/JetBrains/Rider/system but doesn't
| mention the C:\JetBrains directory permissions. Directory
| permissions under %LOCALAPPDATA% (the location that works for
| people without a Polish character) should restrict write access
| to one user. With the Windows default behavior, creating
| C:\JetBrains would inherit permissions from C:\ - and wouldn't
| restrict write access to one user. Maybe 99% of the time this is
| irrelevant (i.e., there's no realistic threat from malicious
| actors who control unprivileged user accounts on your own
| development machine). Still, it's a potential downside of the
| solution, and more motivation for the vendor to fix their code so
| that Polish characters can be used under %LOCALAPPDATA%.
| Kwpolska wrote:
| If you are on a multi-user system, the path "C:\JetBrains"
| isn't really ideal (what if other users also need Rider and
| have non-ASCII usernames?). That said, you can easily change
| file permissions on Windows if the default ones don't work for
| you.
| [deleted]
___________________________________________________________________
(page generated 2021-10-20 23:00 UTC)