[HN Gopher] Debian opens a can of username worms
___________________________________________________________________
Debian opens a can of username worms
Author : jwilk
Score : 240 points
Date : 2024-12-06 09:55 UTC (1 days ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| rini17 wrote:
| Perhaps it's time to agree upon how to Unicode in identifiers?
| The normalization, unprintable characters, confusing characters
| with same glyphs, etc. It's obviously problematic when everyone
| is doing it on their own.
| magicalhippo wrote:
| As long as I can enter my Zalgo[1] username, I'm fine with your
| suggestion.
|
| [1]: https://en.wikipedia.org/wiki/Zalgo_text
| m000 wrote:
| Good luck bringing everyone together. There's still a ton of
| Microsoft software that relies on the presence of the BOM [1],
| despite practically everyone else not using it. And
| bidirectional rsync between practically everything else and a
| Mac still requires `--iconv=utf-8,utf-8-mac` to avoid problems
| because of homographs.
|
| [1] https://en.wikipedia.org/wiki/Byte_order_mark
| bayindirh wrote:
| The first bar to clear is "The Turkish Test"[0], then we can
| talk about Unicode. It'll smooth the rest of the process a lot.
|
| You can't guess how many workarounds I implement to make sure
| that a stray application doesn't get "i" or "I" in their naive
| codepaths, and start burning mid-flight (e.g.: Kodi, Pagico,
| some old Java programs, oh my...).
|
| [0]: https://blog.codinghorror.com/whats-wrong-with-turkey/
| beardyw wrote:
| The date format part is ridiculous. Americans are almost
| unique in using mm/dd/yyyy, so an assumption of that would be
| plain wrong.
| bayindirh wrote:
| Localization libraries handle these parts well, since date
| is same with Europe (and generally stored as time-date
| objects rather than pure strings). None of the number
| shenanigans cause problems since these numbers are always
| stored as IEEE754 or other decimal formats. Money is no
| problem as well.
|
| However, when you go through an upper() or lower() or
| anything which plays with capitalization, and if that data
| is being fed to a hash algorithm or anything which mucks
| with strings, boy, oh boy...
|
| The easiest way is to sanitize these programmatic parts
| with forced locale of en_US or plain old "C". If the
| strings is not facing to the user and never localized, just
| force its locale. It's the only sane way.
| oblio wrote:
| > since date is same with Europe
|
| Do you mean MM-DD-YYYY? No, the vast majority of Europe
| does DD-MM-YYYY in some form or another.
| bayindirh wrote:
| No, I mean DD-MM-YYYY. We use the same format with the
| vast majority of Europe.
| oblio wrote:
| I'm confused, are you talking about the US? The US for
| sure does not default to DD-MM-YYYY.
| bayindirh wrote:
| I'm talking about the Turkish Language, its locale and
| its peculiarities since it has letters "i" and "I".
|
| So, I'm talking about Turkish date format. Turkey uses
| DD-MM-YYYY format, like the most of the Europe.
| a3w wrote:
| https://xkcd.com/1179/ I heard the US and A are moving to
| the hissing cat date format shown here.
| Muromec wrote:
| I kind of like the one using roman numerals for month.
| Reasonable people would figure out that other reasonable
| people would not use roman numerals for _days_ , so the
| order can be implicit. I like implicit ordering, it
| always makes things more interesting.
| bluGill wrote:
| I have switched to yyyymmdd for everything - it is usually
| obvious to everyone what date I mean.
| bayindirh wrote:
| I also use the same format while naming my files, or in
| changelogs or whatnot, but not all documents are suitable
| for that, and in the presentation layer you need to match
| the country standards.
|
| However, date is mostly presentation and internal storage
| of these are vastly different than what we see generally.
| bluGill wrote:
| I don't match country standards. That is the point.
| kelnos wrote:
| It depends on what you're doing, though. If you're
| helping people fill out documents (even non-government
| documents), then you really need to match the country
| standard.
|
| Localization is important; some countries outright
| require it if you're going to do business within their
| borders. But even where it's not required, you will lose
| customers if your website/application/product feels
| "foreign". I'm not sure date ordering is a big enough
| deal to trigger that feeling in anyone, but unless it's a
| huge burden to format things the way people expect, I
| would do so for the UX benefits.
| throw0101a wrote:
| > _Perhaps it 's time to agree upon how to Unicode in
| identifiers?_
|
| And then update all data structures that refer to them (like
| _last_ and _w_ / _who_ , also NFS), as well as file formats
| (like _cpio_ , _tar_ , and _pax_ which encodes ownership).
| maccard wrote:
| Yes. Those formats have had 20 years since Unicode was
| standardised, and things like my terminal still routinely
| break when given "unexpected" inputs. Practically every other
| application can handle it.
| layer8 wrote:
| Unicode has provided a specification for Unicode identifiers
| since 2005: https://www.unicode.org/reports/tr31/
| rini17 wrote:
| Great! Is there a library for their validation? ICU seems to
| have only spoof checker for confusables.
| rurban wrote:
| libu8ident
| westurner wrote:
| rurban/libu8ident : https://github.com/rurban/libu8ident
| :
|
| > _unicode security guidelines for identifiers_
| westurner wrote:
| ICU: International Components for Unicode: https://en.wikip
| edia.org/wiki/International_Components_for_U...
|
| unicode-org/icu: https://github.com/unicode-org/icu
|
| Microsoft/ICU: https://github.com/microsoft/icu
|
| IDN: Internationalized domain name:
| https://en.wikipedia.org/wiki/Internationalized_domain_name
|
| Punycode: https://en.wikipedia.org/wiki/Punycode
|
| IDN homograph attack:
| https://en.wikipedia.org/wiki/IDN_homograph_attack
|
| CWE-1007: Insufficient Visual Distinction of Homoglyphs
| Presented to User:
| https://cwe.mitre.org/data/definitions/1007.html
|
| GNU libidn/libidn2: https://gitlab.com/libidn/libidn2
|
| Comparison of regular expression engines > Language
| features > Part 2; Unicode support: https://en.wikipedia.or
| g/wiki/Comparison_of_regular_expressi...
| secondcoming wrote:
| Would punycode be suitable?
| tiahura wrote:
| When you think about all the time, money and effort that have
| been wasted on Unicode...
| kalleboo wrote:
| Yeah we should have all just stuck to Shift-JIS
| Joker_vD wrote:
| Vseki triabva da izpolzva latinitsa, absoliutno s'm s'glasen.
|
| After all, it's objectively the most perfect set of characters
| for any reasonable human language.
| febusravenga wrote:
| Random cross-language-script observation.
|
| In Bulgarian, latinitsa ("latin alphabet") transliterated to
| latin alphabet is just "latinitsa" or "latinica".
|
| In Polish "cyrillic" is "cyrylica" - basically reverse.
| pjc50 wrote:
| What's your preferred solution for representing the CJK
| languages?
| tiahura wrote:
| Computing did pretty well in the prior 50 years.
| pjc50 wrote:
| That's not an answer. Be specific. How do you want to
| represent the 97k CJK characters?
| vman81 wrote:
| I really don't want to be snarky or sarcastic, so I'll
| just be plain. Many people are unwilling or unable to
| understand a problem that doesn't affect them directly.
| Like - "UTF is woke" kind of people. They are out there.
| CorrectHorseBat wrote:
| Not for the majority of the world population who doesn't
| know English
| jcranmer wrote:
| I still remember the days when I couldn't use p and e in
| the same document, because there was no codepage that
| contained both of them. I also remember the days when
| pretty much any website that had non-English text had to
| have instructions on it for how to view it properly,
| because mojibake was so bloody common.
|
| (It should also tell you something that not only is there a
| name for "computers failed at charsets", but the name is
| Japanese.)
| umanwizard wrote:
| Only if you could expect a given person to only ever deal
| with one language. Anything international sucked and was a
| much bigger pain than now.
|
| It would be impossible to e.g. build a site like Reddit
| where people can comment in any language.
| vman81 wrote:
| Computing has improved massively over the last 50 years,
| not least because it now can accommodate peoples diverse
| languages.
| kryptiskt wrote:
| No, it didn't. There were all kinds of encodings out there,
| and dealing with code pages was way worse than any
| inconveniences that Unicode has brought. Unicode was
| created for a reason, not just to torture US programmers
| with the diversity of scripts in the world.
|
| Maybe it was nice if you worked for a US company without
| any operations abroad, which includes absolutely none of
| those which mattered.
| account42 wrote:
| You still need to deal with "codepages" to differentiate
| between Japanese Unicode and Chinese Unicode even if it's
| called a language and not codepage now.
| CorrectHorseBat wrote:
| Han unification sucks indeed but if you get the wrong
| font it's still readable
| numpad0 wrote:
| Sometimes, not always. Depends on how similar specific
| characters happen to be.
| dotancohen wrote:
| Only if your name isn't Dong Jiu Er Gong Ren Yan Wang .
| throw0101a wrote:
| > _Computing did pretty well in the prior 50 years._
|
| Contra:
|
| * https://stackoverflow.com/questions/25812790/wrong-
| character...
| Muromec wrote:
| I had to, in the year of our lord 2024, deal with a certain
| non-unicode system that ate one specific Cyrillic symbol
| when producing an open data artifact mandated by law. It
| was never fun then and it's still manages to create
| problems.
| account42 wrote:
| Something that doesn't unify different characters. So not
| Unicode.
| Cthulhu_ wrote:
| What alternative do you propose? I mean personally I think that
| emoji don't belong in unicode, but at the same time it's been
| integrated into society for many years now and it's made
| communications platforms so much more streamlined.
|
| But how else would you represent non-latin characters? More
| character sets?
| a3w wrote:
| > emoji don't belong in unicode
|
| Well, they are defined as: "an intermediate technology until
| we find a way to transfer images over data connections."
|
| So it was always a technology that was 40 years too late to
| the party?
| layer8 wrote:
| Without it, all textual data would need its own charset header,
| and you couldn't freely copy & paste between pieces of text
| with different charsets without creating mangled garbage. This
| was the situation before Unicode (except that charsets were
| often only implicit, so you had to guess which it is).
| card_zero wrote:
| > naming things is one of the hard things to do in computer
| science
|
| I've been thinking about that a lot lately. Code is text, it's
| arranged linearly, code has to be readable, identifiers are thus
| short strings that try to express short essays about the purpose
| of the variable or whatever it is, and then ideally there's a
| longer version of the essay in a comment, but not too long
| because that would clutter up the code as well (because it's
| text, arranged linearly). And we have code folding to tidy them
| up, for what good it does, and ideally an even longer version of
| the essay in documentation except nobody writes that.
|
| What if it wasn't text, and wasn't linear, and we didn't have an
| expectation that code should be strings of stupid over-terse
| names and hieroglyphic symbols? So I was thinking vaguely about
| investigating graphic-based programming, but it's probably worse,
| IDK. It could automatically assign arbitrary icons* instead of
| identifiers, and you could write tooltip-like comments to
| describe them as and when you want to, and everything could be
| laid out nicely with diagrams and different pages instead of like
| a text file. I suppose this is all merely cosmetic? The thing
| with the instance on code being _written_ as strings of text
| feels very primitive, is all. It causes this problem.
|
| * Which doesn't solve the problem, I admit, because now you have
| to remember what the icons mean, but maybe that's easier?
| jstanley wrote:
| I don't think remembering the meaning of icons is easier,
| because in order to think about it you have to be able to
| pronounce it inside your head.
|
| And code isn't just linear, it can be spread across multiple
| files in a directory tree, functions can can each other, etc.
| c22 wrote:
| _> in order to think about it you have to be able to
| pronounce it inside your head._
|
| I'm not sure this is universal.
| vidarh wrote:
| Indeed, some people do not even have an inner voice, the
| same way some of us don't "see" things in our minds eye.
| Neither prevents you from thinking about words or visual
| objects.
| pjc50 wrote:
| > I was thinking vaguely about investigating graphic-based
| programming, but it's probably worse, IDK. It could
| automatically assign arbitrary icons* instead of identifiers,
| and you could write tooltip-like comments to describe them as
| and when you want to, and everything could be laid out nicely
| with diagrams and different pages instead of like a text file.
|
| Have you ever read large electronic schematics? That's
| basically it .. except all the important things have to be
| identified by text anyway, because it's a massive challenge to
| the imagination to come up with two hundred different
| pictograms.
|
| Of course, if you really want your identifiers to be
| pictograms, why not just use kanji for your identifiers? The
| Japanese language and Unicode provide tens of thousands of
| ready made pictograms for your convenience!
|
| The only nonlinear programming environments that have really
| worked are the spreadsheet (which is still linear within each
| cell) and Labview. Possible shoutout to Unity blueprints, but
| when those get too complicated sphagetti .. people rewrite them
| in linear text code.
| card_zero wrote:
| _Sigh_
|
| I guess you're right. This has been a dimly-felt wish of mine
| for some 25 years, but probably pie in the sky.
|
| Edit: I see there are a _lot_ of visual programming
| languages.
|
| https://en.wikipedia.org/wiki/Visual_programming_language
| 9dev wrote:
| I don't think that has to be the answer, though. We can
| probably all agree that plaintext code is not the best form
| to represent the schematics of a process, and neither are
| images. But that seems to be a very limited set of options,
| and I wonder if there aren't any other dimensions to
| express what is essentially persisted chains of reasoning.
| For an example of alternative modes of input, have a look
| at the Reactable, a pretty innovative way to compose music.
| Sadly I think they didn't disrupt the music industry as
| they should have, but it's a pretty good example of a new
| way to think about making sounds.
|
| Edit: forgot the link. Here is: http://reactable.com
| WillAdams wrote:
| Longer than that --- I would argue it goes back to Herman
| Hesse's _The Glass Bead Game_ (originally published as
| Magister Ludi) --- but Hesse seems to have gone out of
| style.
|
| That said, I keep trying various ones, and will keep hoping
| that someday someone will make a graphical tool able to
| make a GUI program.
|
| Nodezator seems promising.
| auxym wrote:
| > Have you ever read large electronic schematics? That's
| basically it .. except all the important things have to be
| identified by text anyway, because it's a massive challenge
| to the imagination to come up with two hundred different
| pictograms.
|
| As a mechanical engineer who works with Labview and Simulink,
| as well as more conventional code (python mostly), that is
| indeed a very good description. First glance at a large
| labview program feels very much like first glance at a large
| and complex electronics schematic. Lots of wire everywhere
| and you're not even sure where to start.
|
| I think a nice "best of both worlds" approach is a graphical
| "high level" view which shows the flow of data, at least for
| "data transformation" kind of programs, and code for the low
| level logic (what actually happens in the blocks). Sort of
| like nodal editors in Blender and NLE apps. Fortunately
| Simulink makes it easy to drop in a Matlab function call,
| Labview not so much (need to get into C FFI or use a really
| old version of .net or something).
|
| The thought I have about spreadsheets (might have read that
| on here), is that spreadsheets make the data visible and hide
| the code. Text-based programming hides the data but shows the
| code. I'm not sure what something that makes both code and
| data first class and visible would look like, but I'd be
| curious for sure (for engineering type applications at
| least). Best I've found so far (and what I actually for a lot
| of data processing tasks) is a Jupyter notebook making
| plentiful use of df.head() and df.plot().
| umanwizard wrote:
| It's odd to say those characters come from the Japanese
| language when they were invented in China to write Chinese,
| are still used for that purpose, and were only introduced to
| Japan 2000 years later.
| taneq wrote:
| > The only nonlinear programming environments that have
| really worked are the spreadsheet (which is still linear
| within each cell) and Labview. Possible shoutout to Unity
| blueprints, but when those get too complicated sphagetti ..
| people rewrite them in linear text code.
|
| Not 100% sure what you mean by 'nonlinear' here (flow
| control?) but almost all industrial and mining equipment is
| programmed in visual languages on PLCs. Ladder Logic looks
| like, well, a stylized electrical drawing of a bunch of
| relays wired up to perform logical operations. Function Block
| Diagram looks like a PCB layout, but the 'integrated
| circuits' are function blocks (basically functors) and the
| 'traces' are copying data between between the function
| blocks. Not great for implementing hardcore algorithms but
| you can do a surprising amount with them (once you get used
| to coding with both hands tied behind your back) and they
| sure are accessible to people who otherwise wouldn't be
| programming.
|
| Of course, as you say, when things get genuinely complicated,
| it's much nicer to use a 'real' programming language (or even
| just Structured Text, which is pretty much just Pascal).
|
| Then again, even with electronics, once things get complex
| enough don't we start using text (eg. VHDL)? Expressing
| designs is always a tradeoff between simplicity and
| 'obviousness' on the one hand, and representational
| efficiency on the other. Structured text sits right in the
| sweet spot between the two.
| jcranmer wrote:
| Graphical programming is one of those things that's often
| suggested as an improvement on textual programming, and just
| about every implementation tends to disappoint. I know, when
| working on compilers, that nearly every time I go "I think I
| want to see the CFG as a graph here," I tend to realize no,
| that's not quite what I wanted. For a complex function, the
| surprising superpower is just to have an editor that shows the
| opening brace line of every currently-open brace.
|
| Another case in point: when was the last time you saw someone
| use a flowchart to describe the pseudocode of an algorithm, as
| opposed to writing, er, pseudocode? Flowcharts used to be the
| dominant way to do this, decades ago, but they seem to me to
| have been thoroughly supplanted by pseudocode...
| WillAdams wrote:
| I think the problem here is that there isn't an agreed-upon
| answer for the question:
|
| >What does an algorithm look like?
|
| And any effort to answer it which gets beyond the size of a
| single diagram/screen/page/poster becomes a problem like to:
|
| https://blueprintsfromhell.tumblr.com/
|
| https://scriptsofanotherdimension.tumblr.com/
|
| I like to think of myself as a visual person, and I wish
| there was a good solution here, and I keep looking for and
| trying different solutions other folks have made (current two
| iterations are BlockCAD and OpenSCAD Graph Editor) --- I'd be
| glad of other suggestions, esp. if able to make graphic user
| interfaces more complex than the OpenSCAD Customizer.
| card_zero wrote:
| Argh! Wire-wrapped backplanes! That wasn't the fantasy at
| all!
| WillAdams wrote:
| Yes, the fantasy is something like to Herman Hesse's _The
| Glass Bead Game_ which I mentioned elsethread --- what is
| the closest available tool to that?
|
| How do such tools manage the problem of
| encapsulation/modularity becoming the "wall of text"
| which one is trying to escape, just a pretty wall w/ all
| the labels in boxes decorated/connected w/ lines?
| AlienRobot wrote:
| The difficult in naming things is that you're trying to encode
| semantics and an interface contract in a name. If you give up
| doing that, it's easy.
|
| For example, say you have getFoo(). It's clear it gets the foo.
| But later you introduce getFooAsync(). Suddenly it's no longer
| clear whether getFoo() is sync or async, because you didn't
| call it getFooSync().
|
| If instead you used names like getFoo1, getFoo2, getFoo3, etc.,
| the semantics you're providing is that there are multiple
| "ways" to getFoo without making promises (a contract) about
| what the function actually does in its name.
|
| Although this sounds like bad naming practices (it is), it
| effectively solves the naming problem. Apply this to CSS, and
| instead of .red-button or .secondary-button, you get .button1,
| .button2, .button3, and you just don't have to think about WHY
| are you creating a button to give it a class and start styling
| it.
| card_zero wrote:
| Yep, that sort of thing happens _constantly._ Things get
| misleading names because the first three alternatives I came
| up with were also misleading. So I agree, and indeed I
| considered a foo bar baz scheme instead of icons, same
| difference. Then you have to look somewhere else for what the
| thing does. Self-documenting code doesn 't really work, and
| strict naming schemes are long-winded and worse than ad-
| libbing it, so it would have to be comments, but then the
| comments get forgotten and no longer reflect the code. I give
| up, might take up woodwork instead.
| mmsc wrote:
| I wonder how this will affect ssh. OpenSSH recently restricted
| more characters for valid usernames:
| https://github.com/openssh/openssh-portable/commit/7ef3787c8...
| cedws wrote:
| This is a great example of how one poor decision, or one piece
| of code that is too liberal cascades into an avalanche of
| shitty workarounds.
| throw0101a wrote:
| It should be noted that shell metacharacters are also not
| allowed under POSIX:
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b
| c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3
| 4 5 6 7 8 9 . _ -
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| (Hyphen forbidden as first character.)
| linuxftw wrote:
| I think it will be fine. Everyone will quickly learn the lesson
| "Use something other than ASCII letters and numbers at your own
| peril."
|
| Similar to people who put spaces in file names, it should be a
| fire-able offense.
| lexicality wrote:
| any software that can't handle spaces in filenames is broken
| Muromec wrote:
| All of the software is broken (including security wise) all
| the time anyway.
| bdangubic wrote:
| this is exactly right... I spoke a few years ago with a
| mate who is a software dev at one of the major car
| companies... since then I wouldn't sit in the car from
| that company if my life depended on it...
|
| then I thought - if I spoke any dev in any industry I
| would also stop doing whatever their software is
| controlling and end up moving to live with amish or some
| wilderness without electricity
| hiccuphippo wrote:
| Was that the fireable offense? I always thought the offense
| was not putting quotes around filenames in scripts.
| dfranke wrote:
| Allowing purely numeric usernames seems like a terrible idea to
| me, because it creates ambiguity between what's a username and
| what's a UID. It's common for tools like ls or ps to display a
| username when one is found and fall back to displaying a UID if
| it isn't, and similarly tools like chown will accept either a UID
| or a username and disambiguate based on whether it's numeric or
| not. Now suppose there's a numeric username that doesn't match
| its own UID, but does match some other user's UID. It doesn't
| take a lot of imagination to see how this would lead to
| vulnerabilities.
| throw0101a wrote:
| Talk to POSIX:
|
| > _A string that is used to identify a user; see also User
| Database. To be portable across systems conforming to
| POSIX.1-2017, the value is composed of characters from the
| portable filename character set. The <hyphen-minus> character
| should not be used as the first character of a portable user
| name._
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| The "portable filename character set" is defined as:
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b
| c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3
| 4 5 6 7 8 9 . _ -
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| So only a hyphen as the first character is forbidden.
|
| Given that you can't necessarilly control where usernames come
| from (e.g., LDAP lookups), properly speaking your system has to
| handle everything anyway, even if you don't allow local
| creation.
| dfranke wrote:
| Yes, I'm aware, and POSIX has many such bugs that make
| command input or output unavoidably ambiguous if certain
| unexpected characters are present that they didn't think to
| prohibit. A lot of the revisions that went into POSIX 2024
| were aimed at fixing some of these, such as standardizing
| find -print0 and xargs -0. The fact that this one got
| overlooked doesn't mean it's a good idea to make the
| situation worse and harder for future POSIX revisions to
| address.
| bluGill wrote:
| It is time for POSIX to get with the times. Computers are
| used in more than the US and Canada (for the most generous
| interpretation of American in ASCII I'm including Canada,
| their French speakers will not be happy with that, not to
| mention first nations of which I know nothing but imagine
| their written language needs more than ASCII). UTF8 has been
| standard for decades now, just state that as of POSIX 2025
| all of UTF8 is allowed in all string contexts unless there is
| a specific list of exception characters for that context
| (that is they never do a list of allowed characters). They
| probably need to standardize on utf8 normalization functions
| and when they must be used in string comparisons. Probably
| also need some requirement that and alternate utf8 character
| entry scheme exist on all keyboards.
|
| The above is a lot of work and will probably take more than a
| year to put into the standard, much less implement, but
| anything less is just user hostile. Sometimes commettiees
| need to lead from the front not just write down existing
| practice.
| chikere232 wrote:
| Sounds like lots of work and a lot of new bugs for no real
| value.
| throw0101a wrote:
| > _It is time for POSIX to get with the times._
|
| "Be the change that you wish to see in the world." --
| Mahatma Gandhi
|
| It's free to join:
|
| * https://www.opengroup.org/austin/lists.html
|
| * https://www.opengroup.org/austin/
| atoav wrote:
| Sure, go ahead. Write the PR and make sure to test against
| all other things used in production.
|
| Let's talk again in 30 years when you're done.
| jerf wrote:
| Oh, it's been closer to 20 years for the rest of the
| world to catch up to Unicode than 30. We aren't at
| "perfect" now but we're certainly down to the trickier
| corner cases that are difficult to even see how you solve
| the problems at all, let alone code the solutions, and
| that's just reality's ugly nose sticking in to our
| pristine world of numbers.
|
| But there really isn't any other solution. Yes, there
| will be an uncomfortable transition. Yes, it blows. But
| there isn't any other solution that is going to work
| other than _deal with it_ and take the hits as they come.
| The software needs to be updated. The presumption that
| usernames are from some 7-bit ASCII subset is simply
| unreasonable. We 'll be chasing bugs with these features
| for years. But that's not some sort of optional aspect
| that we can somehow work around. It's just what is coming
| down the pike. Better to grasp the nettle firmly [1] than
| shy away from it.
|
| At least this transition can learn a lot from previous
| transitions, e.g., I would mandate something like NFKC
| normalization applied at the operating system level on
| the way in for API calls:
| https://en.wikipedia.org/wiki/Unicode_equivalence Unicode
| case folding decisions can also be made at that point.
| The point here not being these specific suggestions per
| se, but that previous efforts have already created a
| world where I can reference these problems and solutions
| with specific existing terminology and standards, rather
| than being the bleeding-edge code that is figuring this
| all out for the first time.
|
| [1]: https://www.phrases.org.uk/meanings/grasp-the-
| nettle.html
| atoav wrote:
| Don't get me wrong, I think using UTF-8 everywhere is how
| things should be.
|
| But this is not a "let's just" or "why don't we" type of
| endeavor. This is a _major_ undertaking, and as such
| people are needed who (A) think it is worth the effort
| and (B) are willing to follow through with all the
| consequences.
|
| Open Source software lives from contributions and if
| you're not willing to do it, why should others spend
| years of their lives for it?
|
| In the end this is a question of: are the benefits worth
| the effort? What do we win? Where do things get simpler?
| Where more complicated? How do you pull it off if half
| the distributions use UTF8 and the other half uses the
| legach way? How would tooling deal with this split? etc.
| atoav wrote:
| To add a little bit of context:
|
| You know what I think would be way _worse_ than todays
| reduced characterset usernames with some special rules or
| "just" using utf-8 for them?
|
| Both. Imagine a world where some usernames are UTF-8 some
| are not and it is hard to figure out which is which. That
| would be worse than just leaving things as they are.
|
| Avoiding that situation makes pulling the whole thing off
| even harder, since there needs to be a high amount of
| coordination between many projects, distros etc.
| gray_-_wolf wrote:
| > Unicode case folding decisions can also be made at that
| point
|
| Ok I will bite. How do you indent to do case folding
| without knowing the language the string is in? Will every
| filename or whatever also have its language as part of
| the string? I am not sure what the plan is there.
| somat wrote:
| I would say it is not the place of posix to prescribe how
| it should be, the job of posix is describe what it is, a
| common operating environment. this is why posix is such a
| mess and why I feel it is not a big deal to deviate from
| posix, however posix fills an important role in getting
| everyone on the same page for interoperability.
|
| In my opinion the way to improve this, is bottom up, not
| top down. Start with linux(theese days posix is largely
| "what does linux do?"), get a patch in that changes the
| defination of the user name from a subset of ascii to a
| subset of utf-8. what subset? that is a much harder problem
| with utf-8 than ascii, good luck. get a similer patch in
| for a few of the bsd. then you tell posix what the os's are
| doing. and fight to get it included.
|
| On the subject of what unicode subset. perhaps the most
| enlightened thing to do is the same as the unix filesystem
| and punt. one neat thing about the unix filesystem is that
| names are not defined in an encoding but as a set of bytes.
| This has problems and has made many people very mad. but it
| does mean your file system can be in whatever encoding you
| want, transitioning to utf-8 was easy(mainly doe to the
| clever backwards compatible nature of utf-8) and we were
| not locked into a problematic encoding like on windows.
| perhaps just define that the name is a array of bytes and
| call it a day. that sounds like the unix way to me.
| tssva wrote:
| "however posix fills an important role in getting
| everyone on the same page for interoperability."
|
| Isn't that exactly what the posix username rules are
| doing? Specifying a set of characters which are portable
| across systems to allow for interoperability between
| current and legacy unix systems along with most non-unix
| systems.
|
| "Start with linux"
|
| Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils,
| and systemd all differ.
|
| "get a patch in that changes the defination of the user
| name from a subset of ascii to a subset of utf-8"
|
| ASCII is a subset of UTF-8 so the POSIX definition
| already specifies a subset of UTF-8.
| PhilipRoman wrote:
| Some practical concerns I have with UTF-8 are similar (or
| even the same, depending on font) characters which can be
| used in malicious ways (think package names, URLs, etc),
| not to even mention RTL text and other control characters.
| Every time I add logging code, I make sure that any
| "interesting" characters are unambiguously escaped or
| otherwise signaled out-of-band. Having English as an
| international writing standard is perfectly fine and I say
| that as a non-native speaker with a non-ascii name.
| abdullahkhalids wrote:
| A good chunk of the world does not speak english or latin
| character based languages. They should be able to
| interact with computers completely in their own languages
| and alphabet sets, even if those are written right-to-
| left or top-to-bottom.
|
| Of course, someone has to do the work to make this
| possible. And no one is obliged to do it. But to suggest
| that, such work should not be done at all, does not sit
| right.
| hnthrowaway6543 wrote:
| > A good chunk of the world does not speak english or
| latin character based languages.
|
| nearly everyone in a first world country knows the
| English alphabet though. a vast majority of the
| developing world as well. just look at street view on
| Google maps in any country, there's going to be a ton of
| street signs using English characters, even in non-
| touristy areas.
|
| > They should be able to interact with computers
| completely in their own languages and alphabet sets, even
| if those are written right-to-left or top-to-bottom.
|
| if you're a typical android/ios end user you're
| interacting with a computer in your native language
| anyway. this discussion only applies to low level power
| users.
|
| in that case: why? these aren't user-facing features.
| this is like saying that people should be able to use
| symbols native to their language rather than greek
| letters when writing math papers.
|
| it might not be "fair" that English is overrepresented in
| computing but it also hasn't demonstrably been a barrier
| to entry. Japan, Korea and China have dominated,
| particularly in hardware.
|
| if you think it should be fixed why stop at usernames?
| why represent uids with 1234 instead of Yi Er San Si ?
| abdullahkhalids wrote:
| > if you're a typical android/ios end user you're
| interacting with a computer in your native language
| anyway. this discussion only applies to low level power
| users.
|
| I don't think you realize how poor this experience is.
| Partly the reason being that the underlying system is so
| english focused, that app developers have to do so much
| work to get things working.
|
| > if you think it should be fixed why stop at usernames?
| why represent uids with 1234 instead of Yi Er San Si ?
|
| I mean, if the computers had first been built in south
| east asia, they would have been.
| hnthrowaway6543 wrote:
| it's certainly hard to localize everything but billions
| of people use ios/android in India, China, SEA, MENA,
| etc... i think it's fair to say that at the end user
| level, computers are in fact usable by non-English
| speakers.
|
| individual apps may not be as usable, but that's on the
| developers. good counter-example, a lot of japanese
| games, even made within the past 5 years, require setting
| the Windows system locale to Japanese to function
| properly. and as someone who played a fair number of
| japanese doujin games in the 00s/10s, it used to be
| _every_ game with this problem.
|
| > I mean, if the computers had first been built in south
| east asia, they would have been.
|
| debatable as CJK heavily use Arabic numerals everywhere,
| but even if they did, so what? you'd learn those symbols
| and get used to it. the same way that if you're a unix
| sysadmin you get used to only being able to use a small
| subset of ASCII characters for usernames.
| abdullahkhalids wrote:
| > it's certainly hard to localize everything but billions
| of people use ios/android in India, China, SEA, MENA,
| etc... i think it's fair to say that at the end user
| level, computers are in fact usable by non-English
| speakers.
|
| Its important to contextualize these discussions in
| socioeconomics. Computers are not just fun play things.
| They are serious tools used for economic activities.
| Their usage, through their design, has significant impact
| on the social systems of society. Non-latin-language
| speakers are able to use poorly localized computers, but
| they are only able to use them less well than the latin-
| language speakers. At least in South Asia, there is a
| huge economic divide between those who can speak English
| and those who can't, where causality runs both ways, and
| in more recent times exacerbated by the inability of some
| to use technology. And that economic divide then causes
| huge sociopolitical problems in societies.
|
| If computers are means for economic progress, we
| shouldn't put the condition that one has to somehow learn
| English to use them well. But isn't localization
| sufficient? No it isn't. Ignore even that localization
| requires some members of your language to be dual
| speakers. The current era of economic progress is
| characterized by software development. But if the only
| way you can develop software is to learn a foreign
| language, then surely we are denying economic progress to
| some communities.
|
| P.S. I will repeat. Nobody has to do any work to help
| other communities. But to assert that such work should
| not happen is plain wrong.
| hnthrowaway6543 wrote:
| you're confusing "speaking English" with "knowing the
| English alphabet." these things are orthogonal. 95%+ of
| people in those countries know the english alphabet. i
| just threw down google maps street view at a random spot
| in Phnom Penh and instantly found english letters visible
| from the street, on advertisements[0]. then i threw it
| down in a much smaller Thai city that i had never heard
| of, Nakhon Sawan, and instantly found English on the
| street.[1] i've been in China, Japan and Korea enough to
| know english characters are all over the place. the
| English alphabet is omnipresent _everywhere_ , i think
| you fail to realize this. nobody who is using a computer
| in these places is getting confused by the english
| alphabet.
|
| > But to assert that such work should not happen is plain
| wrong.
|
| i assert it should not happen because it's not solving an
| actual problem, the same way that changing "x" and "y" to
| "k" and "t" in algebra doesn't solve a problem, and
| trying to "solve" it will yield to a monstrous amount of
| incompatibilities and confusion. here's a really good
| comparison: ipv6. IPv6 _is_ solving a problem, maybe in a
| way people disagree with, but definitely a real
| problem... and yet _we still can 't make ipv6 fucking
| work_ after God knows how many years, and trying to get
| IPv6 networking at any sort of scale is a massive fucking
| headache. now we want to go through the same headaches to
| support... umlauts in usernames? yeah, no thanks.
|
| there's enough real work left to be done in the world
| that we shouldn't waste time with stupid makework like
| this.
|
| or maybe in 30 years i'll be able to call up IT support
| and say "hey i forgot my password, can you reset it? my
| username is Shen Wang s`wd. ... need me to spell that
| for you?"
|
| edit: somewhat ironically, HN swallowed a few of the
| unicode characters in my theoretical future username...
|
| [0] https://i.imgur.com/0WkG0ze.png
|
| [1] https://i.imgur.com/VhDR5Xh.png
| abdullahkhalids wrote:
| I am from Pakistan. At least in South Asia, there are
| english characters everywhere because the infrastructure
| is primarily designed for the rich english-speaking
| classes, while the poor are left behind. A serious
| political problem.
|
| I have seen many non-english speaking people interact
| with computers in English, both poor people and old folks
| in rich families who don't know English. They kinda
| recognize the shape of words, or they go by icons. They
| don't actually know the meaning of anything. They can
| only do a limited set of pre-memorized actions. Scamming
| them is easy. If they get stuck, they need to beg someone
| to help them.
|
| Again, I will say this. There are two problems here. One
| for users and one for developers. Users must be able to
| read in their own language. Developers must be able to
| develop in their own language.
| wongarsu wrote:
| > They kinda recognize the shape of words, or they go by
| icons. They don't actually know the meaning of anything.
|
| That's kind of true of a lot of English computer users
| too.
|
| But more to the point, what you are advocating for is
| translating the interface. Which I think nobody is
| against, and which is a common thing to do (at least for
| countries people care about, which sadly excludes a lot
| of the poorer parts of the world). The username prompt
| should read "username" in Pakistani. That doesn't
| automatically mean it has to accept non-ascii input too,
| as long as you accept unicode in the display name.
|
| > Developers must be able to develop in their own
| language.
|
| I learned coding in Pascal before I learned that "if" is
| an English word. English helps, but in the end keywords
| in programming languages and shell commands are only
| mnemonics. Knowing the translation helps but isn't
| necessary. What's important are documentation, tutorials
| and other resources in a language the developer
| understands.
| citrin_ru wrote:
| > nearly everyone in a first world country knows the
| English alphabet though
|
| And not only 1st world. Actually the bigger country the
| more everything is localized - from dubbed films to food
| packaging labels. In a small country one would see more
| English/Spanish/French e. t. c. because they don't have
| resources to localize everything.
| Muromec wrote:
| Oh no please, I don't want to have my linux username in
| Cyrillic. Thanks but no, thanks!
|
| I know enough linux to see 10 ways in which it will make
| things worse at some point.
| notpushkin wrote:
| This isn't quite black and white.
|
| Right now, I can set up and use Linux in my language,
| have my display name in my script, but my username and
| password are ASCII-only and are available on the standard
| English keyboard anywhere. If I run into trouble, I can
| SSH in _from any device in the world_ without any issue.
| I can just borrow a laptop from anyone, switch to English
| if needed, and jump right in.
|
| Having a common denominator set of characters for such
| things is just really, really useful. I'd rather focus on
| all the other things that need to be localised.
| folmar wrote:
| Without any issue is a stretch, using a French keyboard
| is bad enough experience for passwords, not everyone uses
| standard English keyboards.
| wongarsu wrote:
| The French keyboard is the most notable example of anyone
| using something other than query or quertz. Even Japan
| and China use an extended querty. But even with the
| French keyboard the only issue is that everything is in
| the wrong place, not that the standard 26 "English"
| letters don't exist or are hard to reach.
|
| Meanwhile using a, e or s in a username or password will
| make your life much harder once you are in a foreign
| country. Never mind any letter that isn't derived from
| the Latin alphabet.
| oarsinsync wrote:
| > something other than query or quertz. Even Japan and
| China use an extended querty
|
| qwerty
| citrin_ru wrote:
| I have an impression that people confuse learning English
| (which is hard unless you native language is a
| Germanic/Romance one) with learning to recognize and type
| Latin characters which is easy and people around the
| world already use Latin alphabet without knowing any
| English. You may escape Latin alphabet if you have spend
| a whole life in a remote village but for people living in
| cities around the world it should be familiar and not a
| barrier at all. It's hard to escape Latin characters in
| the modern world and this ship has already sailed like it
| or not (I mostly do).
| smitelli wrote:
| > similar (or even the same, depending on font)
| characters which can be used in malicious ways
|
| These are called "confusables" and boy does that well run
| deep: https://www.unicode.org/Public/security/16.0.0/conf
| usables.t...
| miki123211 wrote:
| > Computers are used in more than the US and Canada
|
| Even if you speak US (or Canadian) English exclusively,
| there are still some words that are just impossible to
| spell correctly in pure ASCII, e.g. resume, cafe etc.
| drdeca wrote:
| "correctly". I don't consider it "incorrect" English when
| someone writes "cafe" or "resume". It seems to me a
| little bit paedantic to insist that those words must have
| the accent marks in order to be correct (when using them
| in English).
| sneak wrote:
| Yeah, loanwords are different words than the original
| word.
|
| The correct plural of "baby" in German is "babys".
| rurban wrote:
| Almost nobody supports string search and comparison API
| functions for unicode. The unicode security tables for
| unicode identifiers are hopelessly broken.
|
| Not even the simplest tools, like grep do support unicode
| yet. This didnt happen in the last 15 years, even if there
| are patches and libs.
| ygra wrote:
| Wasn't one way to make grep faster setting LANG=C to
| avoid using language-aware string comparison? If so,
| shouldn't Unicode be supported by default or what would,
| say, de_DE.UTF-8 actually compare to make it slower?
| patrick451 wrote:
| Honestly, I just don't care. UTF8 is excessively
| complicated. ASCII is simple.
| citrin_ru wrote:
| Unicode opens a whole can of worms. World is already full
| of software which in theory supports non-ASCII texts but in
| practice breaks for some use cases. It's easy to allow
| UTF8, it's hard to test all possible use cases and to
| foresee them to know what to test. Nowadays I use mostly
| English so don't see localization bugs but when I used my
| native language with software/internet (~10y ago) I've
| encountered too many bugs and avoided using non-ASCII in
| things like usernames/password, file names and other places
| when utf-8 may be allowed but causes problems later. Just
| allowing UTF-8 is rarely enough. Localization is hard so
| better to start with places where it is important.
| Usernames IMHO not one of them.
| numpad0 wrote:
| NO. PLEASE DON'T. This wreaks havoc especially on East
| Asian users because Unicode is poorly supported in console
| on top of being binary non-canonical in both entry and
| display.
|
| Meaning, - :potato: OR :potatoh: may
| display as :eggplant: OR :potato: -
| isEqual(`:eggplant:`, `:eggplant:`) may fail OR succeed
| - trying to type :sequence: breaks console until reboot
| - typing :potato: may work but not :eggplant: -
| users don't know how to spell :eggplant: - etc.
|
| If you must, please fix Unicode first so that user entry
| and display would have 1:1 relationship. I do have Han
| Unification in mind, but I believe the problem isn't unique
| to the unification or East Asia.
| NoMoreNicksLeft wrote:
| > properly speaking your system has to handle everything
| anyway, even if you don't allow local creation.
|
| Honestly, I try not to be a pessimist, but this sounds like
| the opening narration to some dystopian doomsday movie.
| Titled something like _You 're Not Wrong_, I suppose.
| macintux wrote:
| At the meatspace level, purely numeric usernames are
| problematic.
|
| I was working as a contractor at a Fortune 500 firm several
| years ago when they introduced a new ERP system which
| apparently encouraged the company to switch to numeric system
| IDs. Fortunately the technical teams, especially Linux support,
| objected and it was overruled, but I was just as worried about
| the communications problems that would result.
|
| When everyone has a system ID that matches a consistent
| pattern, like "YZ12345", IDs are easy to recognize in
| documentation and data. An ID like "1234567" could be
| practically anything.
| PhilipRoman wrote:
| I really like the concept of adding some redundancy to ids,
| like a prefix. It helps to disambiguate things (kind of like
| static typing). A good example is also bank numbers, which
| must be a multiple of 97 +1, enabling fast client-side
| validation against typos.
| cupantae wrote:
| Could you give a reference on this 97 rule? I'm intrigued.
| az09mugen wrote:
| I was also intrigued, so I searched and on wikipedia ( ht
| tps://en.wikipedia.org/wiki/International_Bank_Account_Nu
| m... ), in the section "Validating the IBAN" it is
| written : Interpret the string as a
| decimal integer and compute the remainder of that number
| on division by 97 If the remainder is 1, the
| check digit test is passed and the IBAN might be valid
| Spooky23 wrote:
| It's pretty common in places that handle Tax data.
|
| At the end of the day, pushing opinionated bullshit doesn't
| belong in utilities. If there's a security vulnerability,
| sell that and push for incorporation into NIST standards.
| hulitu wrote:
| > Allowing purely numeric usernames seems like a terrible idea
| to me
|
| "I'm not a number, i am a free man. Ha ha ha ha ha"
| kps wrote:
| "Who is UID 0?"
|
| "You are UID 6."
| wombatpm wrote:
| You have an off by one error. But I honestly don't know
| which you should change to with the spirit of the show.
| thephyber wrote:
| I am also worried about more subtle bugs caused by usernames
| that are not strictly only-numeric, such as "10e2" or
| "0xDEADBEEF".
| Ferret7446 wrote:
| It shouldn't be a problem as long as the system disallows a
| numeric username to be the same as an existing UID (excepting
| the case where the matching UID is assigned to said username).
| Spooky23 wrote:
| There's lots of dumb things that you can do. Where do the
| safety bumpers stop?
| pas wrote:
| wherever each community puts them?
| huhtenberg wrote:
| Sound like a solution in search of a problem.
|
| And a disruptive solution with unclear side effects at that.
| johnisgood wrote:
| > If a keyboard input system provides the former sequence of
| bytes, but the username is stored in the login infrastructure
| using the latter sequence of [bytes], then a naive comparison
| will not find the user "emollier" in the system. Unicode defines
| in Annex 15 a few normalization forms as a way to work around
| this problem. But a correct use of these normalization forms
| still requires coordination and standardization among all
| programs accessing the data.
|
| ICU could work, but adds an extra dependency, there is also GNU's
| libunistring.
| resource_waste wrote:
| This is important because Debain-family is used on many servers?
|
| Debian seems to just squander resources on things a few powerful
| people care about.
|
| All my servers have been Debian-based, so I can't be too hard on
| them, but whenever I see someone recommend a Debian-family distro
| as a Desktop OS, I feel like I need to call the police.
| perlgeek wrote:
| Just imagine how many poorly-written shell scripts will break
| when we suddenly allow dollars, quotes, backticks and the likes
| in username. Heck, even allowing spaces sound like horror to me.
|
| On the display side, I'm sure most tools that display usernames
| won't make it easy to see if there are leading or trailing
| whitespace characters, double blanks, tabs etc in usernames.
|
| This sounds like support hell to me.
| gmuslera wrote:
| The problem could be old scripts or systems that doesn't handle
| UTF-8 (that doesn't need to be the ones where the username was
| defined). I'm not sure if I.e. the Bobby tables trick could be
| done with characters with UTF8 representation seeing them in
| pure ascii.
| Starlevel004 wrote:
| Breaking shell scripts sounds like a good idea to me. The
| faster they die the better the world gets.
| Rygian wrote:
| That's going to be a very bumpy road, even if everyone were
| to agree that the destination is appealing.
| bigstrat2003 wrote:
| Yeah for better or for worse compatibility is king. I
| _despise_ shell scripts, they are an absolute nightmare to
| work with and full of footguns. But they are so commonplace
| that people are not going to tolerate YOLO breaking
| changes.
| raverbashing wrote:
| Yeah I think ESH
|
| While we have more modern shells the fact that bash (or
| even sh) is the "common denominator" 30 yrs on is both good
| and awful
|
| We need a PowerShell for Linux
| ygra wrote:
| Not even that is free of footguns, especially around
| argument parsing and calling native commands.
| chikere232 wrote:
| Perhaps unix isn't for you?
| makeitdouble wrote:
| Thing is, they don't die. Instead you get the short end of
| the stick.
|
| You'd have to be pretty darn important for an org to fix
| their scripts because of your name or the username you
| created. Of it would need to happen at a larger scale, but
| then that wouldn't be so controversial in the first place.
| codedokode wrote:
| But spaces are allowed in filenames since 80s, didn't software
| had enough time to adapt?
| michaelt wrote:
| Microsoft's Windows 95 put spaces into "c:\My Documents" and
| "c:\Program Files" so that developers targeting Windows were
| _forced_ to support spaces in filenames.
|
| Of course, in those days if an OS upgrade broke some third
| party software, the end user _paid for an upgrade_. So
| although Microsoft forced developers ' hands, the developers
| all got paid for their trouble. And you'd only have your hand
| forced that way once or twice a decade.
|
| Windows at the time was also all about the GUI file-pickers.
| Breaking the command line? Shell scripts? What are those?
| toast0 wrote:
| And now it's \Users, presumably because after 20 years,
| Microsoft gave up?
| hwc wrote:
| Or someone got tired of typing long paths.
| Uvix wrote:
| They changed from \Documents and Settings to \Users in
| Vista, alongside other profile rejiggering (e.g.
| introducing AppData folders). By that point software had
| either been fixed or would never be fixed, so keeping a
| space in the name wasn't particularly useful.
| rcxdude wrote:
| It's still very common for usernames to have spaces,
| though.
| alterom wrote:
| _And now it 's \Users, presumably because after 20 years,
| Microsoft gave up?_
|
| Only if you assume that people rarely have spaces in
| their Windows login names (e.g. "Joe Smith").
|
| Either that, or Windows users have learned to _not be
| scared of spaces_ in filenames, usernames, and _their own
| literal names_.
| numpad0 wrote:
| Windows set up with Microsoft Account uses abbreviated
| e-mail for user names, because UTF-8 breaks apps,
| including many East Asian apps.
|
| non-Western Windows users always knew never to use
| anything outside ASCII for usernames, passwords, or any
| programmatically used identifiers. It's English users
| that haven't learned it.
| throw16180339 wrote:
| IIRC, they changed it to get more value out of the 260
| character MAX_PATH. I know there was some sort of
| manifest to enable longer paths, but I'm not sure what
| the current status is.
| LegionMammal978 wrote:
| The status quo is that officially, you still have to both
| set a registry key (or equivalently, set an option in the
| Group Policy Editor) and add an element to each
| application manifest.
|
| The official workaround at runtime is to use the "\\\?\"
| prefix with an absolute path to create an unrestricted
| verbatim pathname. For instance, the fs::canonicalize()
| function in Rust will always return such a pathname, to
| many programmers' dismay, since outside tools often choke
| on them.
|
| The unofficial workaround is to set the undocumented
| IsLongPathAwareProcess bit in the process's PEB. The Go
| runtime does this, but silently falls back to "\\\?\"
| prefixes if the Windows version is too old.
|
| (Note that in general, canonicalizing paths is safer on
| Windows than on Unix-like systems, since open directories
| cannot be renamed.)
| 3eb7988a1663 wrote:
| OneDrive breaks that convention. Last two companies I was
| at, the corporate location was something like
| "$HOME/OneDrive - $COMPANY". That the two companies had
| the same format tells me it is a default and/or suggested
| practice for some reason.
| bigstrat2003 wrote:
| That doesn't sound right. Microsoft is _obsessed_ with
| backwards compatibility, going so far as to accommodate
| programs that were _writing to Windows ' private memory_
| just to preserve it. Deliberately breaking programs isn't
| in their ethos at all.
| sltkr wrote:
| The new filesystem APIs were introduced with Windows 95,
| so there was no backward compatibility to break. _New_
| programs using those _new_ APIs were forced to support
| spaces in directories. Using spaces in the system
| directories forced application developers to consider
| that scenario and deal with it appropriately.
|
| Meanwhile, DOS and Windows 3.1 applications that did run
| on Windows 95 could access files under a backward
| compatible 8.3 scheme, like C:\Progra~1\ instead of
| "C:\Program Files".
| bigstrat2003 wrote:
| That's a good point, thanks for pointing it out.
| michaelt wrote:
| I'm thinking of the transitions from Windows 3.1 to
| Windows 95 (IIRC introducing 32-bit and filenames longer
| than 8 characters) and the transition from Windows 95 to
| Windows XP (IIRC introducing a proper permission system,
| thus breaking anything that relied on being able to write
| things outside of user-owned folders)
|
| I agree they were famously accommodating in those days.
| But they also had enough market power that if they said
| users could only write to one folder and it had a space
| in the filename, developers who disliked it couldn't vote
| with their feet.
| lousken wrote:
| And yet... if you create user using a display name e.g.
| Peter Cenicka in AAD and deploy a PC with intune you will
| get home folder called PeterCenicka.[0] It breaks SO MANY
| things. And no, that beta UTF8 system wide setting does
| not work with 3rd party apps.
|
| I just dont understand why they dont use part of the
| email address as the home folder name. And just because
| of this stupidity, user display names have to be without
| any of these characters
|
| Microsoft ... PLEASE
|
| [0] https://doitpshway.com/do-not-use-diacritics-in-aad-
| user-dis...
| dizhn wrote:
| A lot of software still had issues and asked the user to
| use C:\Directory directly. Some probably still do.
| reginald78 wrote:
| I remember trying to install Visual Studio in the mid-
| late 2000s (when SSDs make hard drive space small again)
| to a directory other than C: and found that after
| following a rather convoluted process you could only
| actually move maybe 20% of the install files off C:.
| StefanBatory wrote:
| It is still the same. :(
| yonatan8070 wrote:
| I've seen some things installing directly into C:\,
| NVIDIA's software jumps to mind
| akira2501 wrote:
| C:\Progra~1
|
| They didn't force anything.
| moritzwarhier wrote:
| Did they intentionally use only folder names with spaces
| that are at least 9 characters long and with the space
| after the first 6, so that the 8.3 version contains no
| spaces?
|
| Pretty clever if so :D
| volemo wrote:
| What 'bout "C:\My Documents" though?
| cobbaut wrote:
| That came later, end of 1996 with OSR2.
| repiret wrote:
| A space in an otherwise 8.3 file name would still be
| treated as a long file name and get a ~1 shot name alias.
| moritzwarhier wrote:
| Thanks for the clarification!
|
| I was curious about a deep dive into this topic, and
| skimmed the MS doc pages after a Google search. They
| mentioned different Windows APIs and Long file names, but
| the only mention of the tilde compat layer I found was
| very superficial ("some file-systems" use the tilde as
| special character), so I abandoned my initial interest in
| getting up to speed on this during a 2min weeekend read.
| deltarholamda wrote:
| My last name has an apostrophe in it. This isn't super weird
| or anything, there have been "O'Haras" and "O'Neills" (with 2
| Ls) forever.
|
| And yet whenever I deal with a computer system I don't put
| the apostrophe in because even in 2024 it is completely
| jacked up. Sometimes it's just disallowed. Sometimes I get
| "\\\'" showing up. Sometimes I get "'". I've seen
| "’". One time, one system accepted it, but another
| system that accessed the same data didn't allow apostrophes
| so the person using the second system couldn't access the
| record, and it took 2 phone calls and 3 people to come up
| with a workaround.
|
| It doesn't work often enough that I don't even try anymore.
| There are just too many opportunities for it to get forgotten
| or handled improperly from all directions.
| soneil wrote:
| I had fun in the vmware-broadcom transition because the
| broadcom portal doesn't allow that, but the vmware portal
| did. Not even in my username, just in the surname field.
| The new portal ate it on that so hard, I wasn't even
| allowed to create a ticket to do anything about it.
|
| Not as bad as when I was once issued a first.o'last@corp
| email address though ..
| mixmastamyk wrote:
| There may be a Unicode character that looks like
| apostrophe but has no quoting semantics. I use an arrow
| instead of greater-than symbol in my prompt for the same
| reason. To avoid copy/paste issues.
| jcranmer wrote:
| Non-ASCII characters in email addresses have even worse
| compatibility issues than punctuation characters.
| Punctuation fails because people don't know the standard.
| Non-ASCII fails because people don't know the _latest_
| standard.
| deltarholamda wrote:
| >Not as bad as when I was once issued a first.o'last@corp
| email address though
|
| Oh, man, that happened to me too, way back in the late
| 90s. I had forgotten about that.
|
| It broke things all over the place. Even now you run into
| the occasional validator that is convinced that the plus
| sign is not valid in email addresses.
| mschuster91 wrote:
| > Even now you run into the occasional validator that is
| convinced that the plus sign is not valid in email
| addresses.
|
| These are intentional IMHO - force people to use their
| actual email address so a potential breach can't be tied
| back to the service. That's the _only_ reason why someone
| would use a + in the first place.
| tolciho wrote:
| Some validators are silly regular expressions that
| someone wrote in a minute without thinking about it
| ("Mastering Regular Expressions" has a regex associated
| with it for better matching an address; that regex is
| quite the sight to behold). And disallowing + is a crummy
| solution to whatever "force people to use their actual
| email address" means given that someone with full control
| of a domain can invent the alias
| whatevertheywant@example.org instead of using something
| with a + in it, or they can spin up an alternate address
| on some alternate provider, etc.
|
| Other reasons folks use + in their email is to do mail
| routing (except where crappy web services disallow the +
| because they relied on a crappy regex) but then again I
| have no idea what "potential breach can't be tied back to
| the service" is meant to mean.
| mschuster91 wrote:
| > but then again I have no idea what "potential breach
| can't be tied back to the service" is meant to mean.
|
| Easy. Say I subscribe as "username+servicename@gmail.com"
| everywhere, when I get spam at that email address that
| service must have been either breached or sold off my
| data.
| jonathanlydall wrote:
| More likely just a default.
|
| I built the authentication system on our website and as a
| regular user of Gmail + aliasing I was very surprised
| when my brother pointed out our website didn't allow
| them.
|
| Turns out the default for Microsoft's ASP.NET Identity
| Framework is to disallow special characters, but simply
| setting a flag in its configuration rectified this.
| graemep wrote:
| > And yet whenever I deal with a computer system I don't
| put the apostrophe in because even in 2024
|
| In usernames or in name fields for text generally?
|
| I assume things like bank systems can deal with it because
| they should match things like IDs?
| deltarholamda wrote:
| Name fields in general.
|
| But sometimes I don't have control, e.g. another person
| is inputting the data and dutifully duplicates my name.
| That's how I ended up with the 2 phone calls/3 person
| situation, which happened about a month ago.
|
| Hell, my driver's license is missing the apostrophe
| because the system doesn't accept it.
|
| When somebody is trying to find me in a computer there's
| a whole litany of things they have to try, including
| assuming "First O'Lastame" got bashed into "First O.
| Lastname".
|
| I think about this every time I read an article extolling
| the wonders of technology.
| tsimionescu wrote:
| Generally, countries' systems only handle characters in
| names that are common in that country. Virtually no
| banking or ID system in Europe or the USA will handle
| Chinese names, for example. Even if they did at the
| technical level, it wouldn't actually help at a holistic
| level, because people who interact with these systems
| (bank tellers, policemen, etc) can't be expected to
| recognize any writing system in the world.
|
| So, the reality is that you have to adapt to the country
| you're trying to live or do business in and the name
| systems that they can actually use. This can even mean
| you have to adopt a name that people can actually
| pronounce, as many Chinese people do when interacting
| with people outside East Asia
|
| For example, Chinese is particularly sensitive to tone
| accent, which extremely few people outside that area can
| even distinguish, leading to hopeless mispronunciation.
| Consider that Ma2 and Ma4 are completely different words
| for a Chinese speaker, while a French speaker who hasn't
| studied this wouldn't even be able to tell that you are
| intentionally pronouncing things differently and not just
| your intonation.
|
| And for a reverse example, if you want to move or do
| business in Japan, you should adopt a well-known Japanese
| pronunciation of your name, as otherwise Japanese
| speakers, who have an extremely limited syllable
| inventory compared to most other languages in the world,
| will just not be able to follow your name.
| graemep wrote:
| That is true, but I think this example shows systems
| being too restrictive. If people can read Latin letters
| the system should accept apostrophes.
| jorvi wrote:
| > One time, one system accepted it, but another system that
| accessed the same data didn't allow apostrophes so the
| person using the second system couldn't access the record,
| and it took 2 phone calls and 3 people to come up with a
| workaround.
|
| There's still a lot of organisations that somewhere in
| their e-mail processing chain cannnot deal with 4-letter
| TLD e-mail addresses*. Even worse is that the front-end is
| often a relatively new framework and will happily accept
| your e-mail, only to then have it silently fail forever.
| Mercifully a lot of those organisations have their customer
| service authorized to change your e-mail address manually,
| but if they don't.. good luck.
| wongarsu wrote:
| NPX on windows was broken for years when your username had a
| space in it. Never underestimate how long bugs can stay
| around when it doesn't affect any of the developers and for
| everyone else the workaround is quicker than fixing it
| slightwinder wrote:
| Problem is, the design of Unix shells is older, and they have
| some parts which automatically split on space if not handled
| carefully. This is really annoying.
| rossy wrote:
| For people using NSS modules like winbind, most of those
| scripts are already broken
| wolrah wrote:
| > Just imagine how many poorly-written shell scripts will break
| when we suddenly allow dollars, quotes, backticks and the likes
| in username. Heck, even allowing spaces sound like horror to
| me.
|
| If we're admitting they're poorly-written, why can't we admit
| that they're already broken regardless of whether that
| brokenness is currently being triggered? Allowing symbols or
| spaces didn't break anything, it was broken from day one just
| no one noticed.
|
| Why is the answer always "go out of your way to not upset the
| broken garbage that's been around forever" rather than "throw
| Zalgo at it and fix what breaks so it's no longer broken and
| won't be broken in the future"?
|
| Bug compatibility is the worst behavior of the computing
| industry. Let the bad code break and more importantly call it
| out so everyone knows where the blame belongs.
| tsimionescu wrote:
| Because people don't care about the presence or absence of
| bugs, they care about getting their work or leisure done with
| the help of the computer. If the computer isn't working, then
| they can't get their work done, and so they are mad at
| whoever broke it (for example by upgrading it, or by adding a
| username with spaces inside it). If it's working, then
| they're happy, no matter how dangerously on the precipice it
| is.
| raverbashing wrote:
| Yes yes it is
|
| Same for when people are being too clever and use a password
| generator with all the characters for things you need to
| call/pass on some types of config files
|
| No, you're not being smart for adding double quotes to a
| generated password, in fact _quite the contrary_. And guess who
| needs to try all types of escapes for that?!
|
| TFA seems like another of Debian's self inflicted problems by
| people trying to be "too smart"
| nmstoker wrote:
| Unfortunate ambiguous uses of the word drop throughout the
| otherwise excellent article
| TimK65 wrote:
| There are three uses of the word "drop," all of which are
| correct.
|
| The latter-day meaning of "drop" is an abomination.
| toast0 wrote:
| I dropped X off at Y. Then X dropped off the face of the map,
| never to be seen again.
|
| Many words and phrases in English are self-antonyms.
| fargle wrote:
| > The src:shadow package had dropped a Debian-specific patch,
|
| shoot, that's evil. had not noticed this. i read this as
| "removed", not "was released". now idk.
|
| this pseudo-definition of dropped as "released" is beyond
| stupid. yikes!
| account42 wrote:
| Always fun to see people poke the Unicode dragon only to be
| dumbstruck by its true size as it stands up in preparation of
| engulfing them with the fire of unintended consequences.
| beardygo wrote:
| Indeed. As a speaker of several languages, including RTL
| language (they haven't even considered the problems with RTL
| marks etc), I say stay with ASCII for usernames, keep UTF for
| full names.
|
| If restricted ASCII a-z is good enough for passport names
| worldwide, it's good enough for usernames.
| macbr wrote:
| I'm confused - my name as written on my passport definitely
| contains non ASCII characters?
| extraduder_ire wrote:
| What is it in the machine-readable section at the bottom?
| My passport takes the apostrophe out of my name down there.
| belorn wrote:
| What is the point of a machine-readable name when there
| is a machine-readable passport number which should be
| unique for each issuing country? In this age I would
| assume that places which uses machines to read passports
| also are connected to international databases where the
| unique number is checked for validation. My country also
| mandated passport with chips in them for the last couple
| of decades, so by now there are no longer any valid
| passports without such chip.
|
| If I had to guess, it seems the machine-readable section
| is just backward compatibility for machines built during
| the period where people started doing machine reading of
| passports but had yet to started to put chips into them.
|
| (as a fun side note, smart phones can read the chip on
| passports and this is then used by some digital identity
| providers to establish identity on account creation, in
| combination with the phone camera).
| Muromec wrote:
| There is no database to query unless you issued the
| document (except revocation database). There is a chip
| with CMS signed data in it and MRZ is used for key
| agreement to read the data.
|
| To know that MRZ and data arent from a different person
| or document, they have the name in ascii. It all kinda
| works and mskes sense in the end.
|
| When you read the card with phone camera it uses mrz too
| belorn wrote:
| Looking it up, the mrz are only there to validate that
| the information stored on the document is the same as the
| information provided by the chip, and to make any
| eavesdrop attacks between the reader and the chip less
| likely to succeed. Its an optional standard.
|
| The data on the chip is authenticated through a country
| signing key. This part is mandatory and prevent the
| person who carries the document from falsifying the
| information on the chip. There is also an optional active
| authentication chip to prevent someone from copying a
| passport even if they copy of the mrz and a copy of the
| traffic between chip and reader.
|
| The MRZ is also part of the older standard which is
| intended to be replaced by a newer system that has card
| access numbers, which mean that the mrz and the ascii it
| embeds could very well be gone from passports. This new
| standard was implemented in EU by 2014, so there might
| passports issues now without the MRZ.
| macbr wrote:
| Oh, yeah. No non-ASCII in the "machine readable" part.
| Though I've never seen anything use that section. My
| national id card also has a "machine readable" section -
| but that doesn't even contain my whole name: It's just
| cut off after 20 letters.
| Muromec wrote:
| You probably have ASCII-adjacent name to begin with, so
| people who can read some kind of language using Latin
| letters will simply ignore "funny dots and dashes" and
| pronounce it kinda wrong.
|
| It's on a different level from having a name originally
| written in a different alphabet entirely. At this point you
| just have it written in two scripts, with second being
| ASCII.
| mschuster91 wrote:
| > If restricted ASCII a-z is good enough for passport names
| worldwide, it's good enough for usernames.
|
| Passports (and credit cards) are the best example why ASCII-
| only is horribly broken. It's 2024, people want to type in
| their name as they write it normally, and they have the
| reasonable expectation of IT "dealing with it" behind the
| scenes.
|
| Unfortunately, that expectation isn't reality, and it's all
| too common people are being rejected at the border or their
| card transactions are denied because braindead policies leave
| no other option but to blanket deny in case of mismatches.
| tgbugs wrote:
| I made a design decision for a standard for dataset structure
| to explicitly ban characters beyond ascii [A-Za-z0-9.,-_ ]
| precisely because all the positivity around utf-8 often leads
| people to think that it comes with no additional complexity
| cost. There is an escape hatch with a way to indicate that a
| dataset uses unicode filenames but the standard states that any
| consumer may reject such datasets because unicode support is
| explicitly not required.
|
| I got pushback from people who would not have to implement or
| maintain the systems for being a backward asciite so seeing
| this article is rather vindicating.
| miohtama wrote:
| I remember useradd and adduser when learning Linux and oh boy
| what a confusion it was... Why not just one command
| abigail95 wrote:
| if you cannot handle UTF-8 anywhere anything approaching text
| could be, your program is malformed and should be deprecated and
| removed.
|
| if you wrote code that couldn't handle bob;>/hacked in a
| username, you would and should be laughed at.
|
| why are we using this ancient stuff?
| knorker wrote:
| It's not just programs. And it's not just semantics of all-
| numeric username. It's also whether you want usernames that you
| cannot type, nor possibly even render.
|
| Definitely you can't spell it to someone else.
|
| Who owns that file? Oh, it's right-to-left non breaking space
| smiley snowman Chinese sign for water, I love that guy!
| abigail95 wrote:
| If people want to set up a Debian environment where people
| are mixing RTL and Hanzi I see no reason for that to be
| prohibited.
|
| Debian has opinions but I disagree that they should extend
| that far.
|
| If my employee Zalgo-fies everything. I don't file a bug
| report with Debian. I just fire them.
| Muromec wrote:
| >If my employee Zalgo-fies everything. I don't file a bug
| report with Debian. I just fire them.
|
| Which such clearly north American attitude you can as well
| use ASCII for everything.
| Izkata wrote:
| > Who owns that file? Oh, it's right-to-left non breaking
| space smiley snowman Chinese sign for water, I love that guy!
|
| This reminds me, around 10 years ago on the chat app we used
| at work, we were able to change our nicknames and I made mine
| start with a combining character instead of a regular
| character. No one could ping me, it broke that part of the UI
| when they tried.
| adrian_b wrote:
| This thread like also the parent thread is full of comments
| which are completely outdated, because there already exist
| standards for Unicode identifiers and obviously they forbid
| such cases.
|
| See e.g. RFC 8264. Only a restricted set of characters is
| permitted in identifiers, mostly letters and digits.
|
| This is enough to write any user name, without allowing
| "smiley snowman Chinese sign for water" or other such
| nonsense.
| drtgh wrote:
| With Unicode the same grapheme can be written with a sequence
| of one or more code points, and each code point can be a
| sequence of one or more code units.
|
| For example "a" can be written with U+00E5, and the same visual
| glyph "a" with U+0061 + U+030A ( U+0061 {a} plus the code unit
| U+030A {Combining Ring Above}).
|
| Another homoglyph Unicode user name example:
|
| * is Cafe == Cafe ?
|
| * C + a + f + e + ' ' vs C + a + f + e
|
| * Utf8: 43616665CC81 vs 436166C3A9
|
| As one user has pointed out in another comment, some kind of
| standardisation for that specific use case with some kind of
| normalisation would be needed first (nevertheless a database
| search would want a different one, and so on). The above
| examples are among the simpler ones, there are also unprintable
| characters, etc.
|
| It can be done as in "nothing is impossible", but it's not that
| easy, it's actually complex.
| abigail95 wrote:
| If a user picks a presentation layer that displays a from
| noncomparable alphabets, but has them look identical - that's
| a choice they can and should be able to make. I think it's
| dumb but I'm not here to hold anyones hand.
|
| It's the users choice whether 43616665CC81 == 436166C3A9,
| same for Cafe == Cafe. But they are distinct and separate
| choices. Text and bytes are separate things.
|
| We accept that case sensitivity exists and whether a
| user/business/program treats them as identical is and _should
| always be_ their choice to make.
|
| There is abstract complexity in the problem, but the context
| in which text is used solves most of that.
|
| If I have handwritten notes and I make a copy but write the
| second one in cursive and ask someone if they say the same
| thing - the correct answer isn't "we need to create a
| standard to normalize the presentation of text" - it's "be
| more precise in what you are asking".
|
| Whether Cafe == Cafe depends on if it's written on a road
| sign, or a network packet with a fixed byte size.
|
| Unprintable characters are not text and should not be stored
| in text fields. Neither are control characters, and as far as
| I'm concerned should not be included in any text encoding
| standard. Formatting and terminal processing _should never be
| stored in-band_ , that's an obvious design flaw that should
| be corrected.
|
| We already deal with ambiguity within ASCII re I vs l vs 1.
| Some fonts render those identically - Using those fonts in a
| passport is bad design. Saying we should avoid having to
| compare those characters at all because _some people /systems
| might confuse them_ is misguided.
|
| This isn't a true rebuttal of what you were saying but some
| of my next thoughts.
| alterom wrote:
| _> This isn't a true rebuttal of what you were saying but
| some of my next thoughts._
|
| I feel it's a rebuttal enough, and it provides a clear
| answer to the parent's question:
|
| * is Cafe == Cafe ?
|
| * C + a + f + e + ' ' vs C + a + f + e
|
| * Utf8: 43616665CC81 vs 436166C3A9
|
| When we're talking about username/password fields, what
| we're really talking about _keystrokes_ , or the _input
| sequences_ that the user makes to identify themselves.
|
| Android lock screen patterns are passwords, and the answer
| is blatantly clear there: the _same_ shape drawn in a
| _different_ way is a _different_ pattern.
|
| The context here isn't "are these two strings saying the
| same text".
|
| It's "is the person typing this text _who they say they
| are_ ", boiled down to "can they repeat the input sequence
| provided at registration".
|
| So, we get the answers:
|
| * _C + a + f + e + ' ' != C + a + f + e_ if either can be
| _intentionally_ produced by the user at the log-in screen
| (i.e., if these Unicode sequences can be produced by
| different _keystroke sequences_ , and the user knows which
| output they're producing)
|
| * _C + a + f + e + ' ' == C + a + f + e_ if _either_ can be
| obtained as a result of the _same_ keystroke sequence
| (i.e., if virtual /physical keyboard + OS combinations may
| represent the same keystroke sequence with _different
| character sequences_ provided to the program).
|
| * If both are true, _neither should be allowed_
|
| The case of _not all input devices having the keys
| requisite for reproducing the input sequence_ would boil
| down to either deciding based on context, or _asking_ the
| user if they are sure they want to limit themselves to the
| particular hardware /software combinations to log into the
| service.
|
| For example, a username like BDZhILKA is perfectly fine
| _if_ you only ever want to log into the service from
| devices where a Ukrainian keyboard is available.
|
| Which would be an appropriate assumption for e.g. Ukrainian
| government systems, where Ukrainian language support is
| _required by law_ , but not in an general context (what if
| user travels outside Ukraine, and wants to log in from a
| device they don't own and can't enable Ukrainian input
| on?).
|
| One can't hit the "Zh" key if their keyboard lacks it.
|
| Same goes for the concern raised in the article:
|
| _> I see and type my username hundreds times a day, people
| use it to address me in written and spoken conversations
| with it, etc._
|
| Good. That means that @BDZhILKA is only appropriate where
| _everyone can be assumed to be able to write and speak
| Ukrainian_ , which doesn't even hold universally true _in
| Ukraine_ , unless it's a government office.
|
| That's to say, most people reading this comment won't be
| able to address me as @BDZhILKA in neither a _spoken
| conversation_ , nor a _written_ one (copy-pasting is not
| _writing_ ).
|
| At the same time, if I _can_ type "BDZhILKA", it should be
| my _choice_ to have that as a username /log-in name, since
| _only_ being able to log in from devices with a Ukrainian
| keyboard would be a _security feature_ for me. I know that
| I will have that on _my_ devices, but an adversary may not.
|
| Similarly, a log-in name like @SIRNIK _should_ be
| acceptable if I wanted it.
|
| Note that it's not the same as @CIPHIK - the former uses
| Ukrainian character set. @SIRNIK != @CIPHIK for
| authentication purposes because I typed in _different input
| sequences_ to produces these glyphs on the screen.
|
| This is not a Unicode issue either; ASCII with codepages
| for internationalization had the same problem. Homoglyphs
| aren't limited to accents or complex Unicode sequences.
|
| With Unicode, SIRNIK is not a problematic username -
| there's only _one_ way to type that particular byte
| sequence in. Before Unicode, it was, because the letters
| were encoded as different _bytes_ in KOI-8 (Unix) vs.
| Windows-1251 character sets, and the user didn 't
| necessarily have a choice about _which one is being used to
| record their input_.
|
| The problem wasn't limited to log-in screens, of course; it
| resulted in hilariously unreadable words which have since
| been enshrined in memes, like "bNOPNIa" for "Vopros"
| ("question", a common first word in a chat message asking
| about how to make text readable).
|
| See, bNOPNIa (KOI-8) == Vopros (Windows-1251); same bytes.
| Whether to allow that as a log-in or password (e.g. on a
| Linux machine) depended on whether you wanted to allow the
| user to log in from Windows devices too.
|
| Obviously, for local accounts on Windows 95 machines, it
| was not an issue, as Windows encoding would be the only one
| available on a Windows log-in screen. The context gives all
| the answers.
|
| All of this directly follows from the "not a true rebuttal"
| you typed, and I frankly don't see what else there is to
| say on the matter, or how else to say what you said to get
| that point across.
| adrian_b wrote:
| The discussion thread at LWN has already mentioned standards
| for Unicode identifiers (RFC 8264 and RFC 8265), which
| prescribe how to handle all these problems, i.e. which
| characters should be allowed in identifiers and how to
| normalize and compare Unicode identifiers.
| anon-3988 wrote:
| Nah, you can use whatever you want for _display_.
|
| We have our tower of babel here and we are telling people not
| to use it? I am not even native English user btw. Having a
| lingua franca allowed me to understand someone from Russia,
| China, Japan, etc.
|
| Maybe once we have easily accessible ML translate nuances in
| one language to another without loss we can all talk in our own
| languages and just translate each others words.
| abigail95 wrote:
| I think people should be able to configure systems to handle
| a broad range of text from popular encoding standards like
| UTF-8.
|
| Limiting text-space because of communcation is a strange
| objection that I don't think will hold up over time.
| numpad0 wrote:
| Unicode is a garbage standard that breaks apart so easily.
| That's why people hate ideas like yours. You're right in an
| ideal world but not in this baseline reality with Unicode.
| adrian_b wrote:
| Except that it is much better than anything that had
| existed before it.
|
| The earlier handling of non-English alphabets or writing
| systems was horrible in MS-DOS and Windows.
|
| While there have been made some serious mistakes in the
| development of Unicode, its main principles were fine and
| it does not have any competition.
|
| Feel free to propose and implement a better standard.
| numpad0 wrote:
| Just skim other branches of this tree. Unicode is non-
| canonical in many ways.
|
| You can't guarantee that the same binary representation
| reproduce on every machines.
|
| That kind of encoding system has no place "under the
| hood". That should be obvious.
| PhilipRoman wrote:
| I really love this powerless use of "should". If you spit on
| billions of lines of code, all you will get is a dry mouth. The
| reality defines "what is", unless you have lots of tanks and
| people under your control, in which case you can change the
| reality.
|
| There is tons of useful code which you will likely never
| encounter, that helps people accomplish their tasks every day.
| Do you think there is some central authority who is going to go
| building to building and dd if=/dev/zero every shell script
| they find?
| abigail95 wrote:
| This is a contemporary discussion, today, concerning hundreds
| perhaps thousands of lines of code. That's it.
|
| If someone is objecting to changes because of things like
| "bob;>/hacked". That is laughable, and I will continue to
| point and laugh. Imagine limiting URL encoding because of SQL
| injection.
|
| We can fix this, then fix the things that break - and then we
| can improve.
|
| Or we can ossify into stone. Your choice.
| PhilipRoman wrote:
| >if you cannot handle UTF-8 anywhere anything approaching
| text could be, your program is malformed and should be
| deprecated and removed.
|
| I was referring to this. Don't get me wrong, I also would
| love to make sweeping changes to many things in computing.
| I still think it is perfectly valid to impose reasonable
| limitations on input even if the program could
| theoretically handle it - it prevents all kinds of problems
| at the very root (like allocating disproportionate amounts
| of resources, infinite timeouts, etc).
| chikere232 wrote:
| oh yes, let's break things to gain nothing of value
| gspr wrote:
| Perhaps nothing of value _to you_.
|
| I'll hazard a guess that your preferred username can be
| expressed in a small subset of ASCII? And to hell with everyone
| else?
| knorker wrote:
| I'll hazard a guess that your preferred username can't be
| written by 99.99999% of the world, and would always have to
| be copy-pasted?
| Ylpertnodi wrote:
| Yeah, us foreigners, up to our usual tricks again.
| knorker wrote:
| By any definition of the word, I'm a foreigner.
|
| So if you meant to imply that I'm an American, you've
| guessed wrong.
| chikere232 wrote:
| If your personal identity is threatened by having to use an
| ascii alphanumeric login name, you're kind of creating
| problems for yourself for no reason...
|
| There is a field for the full name of the person if you want
| to, and at least on my linux it warns for non-ascii
| characters but allows them
| anon-3988 wrote:
| Its a give and take. If you allow for anything beyond latin,
| then you have to accept that there will be a class of
| software that will be difficult to interact with.
|
| Latin-like language system is simply superior for machine
| purposes. I am sorry, but I don't even want to think of
| supporting the entire unicode in my software. I am not going
| to even attempt to reverse that emoji.
| chikere232 wrote:
| It gets real fun when it's something you need to look up
| and have match, like a username.
|
| Because then it to be normalised in the right way for
| comparisons to work, or it will only match if your input
| method happens to produce the exact same variant.
|
| ... And unicode is an evolving standard where this
| normalisation sometimes changes between standards, so the
| names as normalised in the old version of your standard
| library might disagree with the new version. So you need to
| care for that transition.
|
| ... And often this is implemented separately for different
| languages, so you can get names that won't match if you
| normalise them in python, java or C.
|
| ... And as all implementations, these unicode
| implementations sometimes have bugs, so you need to think
| not only about matching supported unicode versions, but
| matching bugs.
|
| ... And any change in these normalisations can in theory
| lead to two usernames that used to be distinct becoming
| identical.
|
| It's a deep well
| khaled wrote:
| > And unicode is an evolving standard where this
| normalisation sometimes changes between standards
|
| Unicode normalization is subject to its stability policy,
| and Unicode no longer allow adding new canonically
| equivalent code points.
|
| https://www.unicode.org/policies/stability_policy.html
| adrian_b wrote:
| There are many variants of the Latin alphabet and the
| English alphabet contains only a subset of the letters
| contained in the other variants.
|
| There is no reason to consider the English alphabet as
| superior for machine purposes, in comparison to other Latin
| alphabets.
|
| Its dominance in IT is caused by the fact that most of the
| development of commercial computers after WWII has been
| done at IBM and other US companies, not by any properties
| of the English alphabet.
| layer8 wrote:
| The issue is that it has already been broken (read: has allowed
| arbitrary byte sequences) for a long time, and the debate is
| about what to restrict it to.
| codedokode wrote:
| Don't you think that it would be better to get rid of usernames
| in UI? They only provide unique data for fingerprinting and do
| almost nothing useful on a single-user system. Wouldn't it be
| better to simply have a default name like "primary user" or "main
| user" for the first user and skip one step in installation
| process? Also it frees you from typing a username on login for a
| single-user system.
| eviks wrote:
| Single user systems can just not ask for a username if there is
| only one, they control the UI
| knorker wrote:
| So in the future I may not be able to even type the name of
| another user? Admins and other users not being able to type
| usernames sounds very bad.
|
| And I say that as someone whose native language has more letters
| than English.
| zvr wrote:
| Most people are too young to remember that when you typed your
| username in all-caps in the login prompt (because the CapsLock
| key was on by accident, for example), the login(8) program
| assumed you were in a connection that could only do 7-bit (upper
| case, but no lower case characters) and immediately switched the
| tty settings and you were then presented with a "\PASSWORD: "
| prompt.
| roelschroeven wrote:
| Don't you mean 6-bit? 7-bit ASCII supports lower case
| characters. Or maybe there are other 7-bit character sets that
| don't have lower case characters and it was one of those?
| jks wrote:
| PETSCII? On the Commodore 64 you could press the Commodore
| key and Shift together to change character sets between
| lowercase and the graphical characters.
|
| But the Unix login thing might have been because of
| teletypes?
| https://www.columbia.edu/cu/computinghistory/teletype/ claims
| that ASR 33 used 8-bit ASCII but was uppercase only - not
| sure if the "8-bit" claim can be true.
|
| On some Unix (and Linux) systems, you can still enter a kind
| of retro mode with "stty olcuc iuclc" (output lowercase to
| uppercase, input uppercase to lowercase) and turning on Caps
| Lock.
| zvr wrote:
| You are of course correct that 7-bit ASCII includes lower
| case characters. I don't think there exists "6-bit ASCII",
| but the original ASCII did not have lower case (the slots
| were empty). We're talking early '60s here.
|
| I'm not even sure it was only about ASCII. I suppose I should
| have written a more generic "character set" (which supports
| or not lower case characters) rather than "7-bit".
|
| In the cases where you could only communicate in a single
| case (upper), you typed the commands in the usual letters
| (e.g., "LS") and capital letters were designated by a
| preceding backslash (e.g., "ECHO \JOHN \DOE"). That's why you
| were seeing the "\PASSWORD: " prompt, the initial letter was
| capital (as it still is).
|
| Just for fun, I checked my current Debian system. The
| getty(8) command still supports it: -U,
| --detect-case Turn on support for detecting an
| uppercase-only terminal. This setting will
| detect a login name containing only capitals as indicating an
| uppercase-only terminal and turn on some
| upper-to-lower case conversions. Note that this has
| no support for any Unicode characters.
| soneil wrote:
| This reminds me of the systemd bug where usernames starting with
| a digit were mishandled (#15141).
|
| It seems to me like something that "should" be relaxed, but we
| need to have high confidence in the entire foodchain. adduser
| seems like the last place it should be changed, not the first -
| anyone requiring "enough rope" is already served by useradd.
| hwc wrote:
| My work machine uses my complete email address as a user machine
| (this was a done by someone in the IT department). Vim gets
| confused when I use the `gf` command to open a path that contains
| an '@' character in it.
| bjourne wrote:
| Honestly, it is super brain-dead that Linux and other operating
| systems still have such massive problems with "special"
| characters. Just the other day I had to help someone who had
| trouble building. The cause turned out to be that they had
| dropped filenames with parentheses in the source directory which,
| apparently, confused bash which make relies on. Such trash is
| everywhere on Linux systems. Eventually you learn to only use
| [a-zA-Z0-9-_.] in names because anything else will inevitably
| confuse some tool or another (even capital letters can be a
| PITA)... I so wish someone would take it upon themselves to clean
| up this mess, but it's probably too much work and too many who
| are nay-sayers conditioned to it who don't see the need for
| changes.
| hiccuphippo wrote:
| As someone who needs non-ascii characters to write my name:
| _please don 't_. You are making things worse just to be
| "courteous" about something we don't care about and will actually
| be annoyed at if we have to find how to write a letter in the
| keyboard or worse case scenario, figure out how to change the
| layout to the correct one _before I even logged in_.
| jks wrote:
| Likewise. My last name contains a non-ascii character. In ~2009
| I started at a company whose admin conveniently set up an
| account for me on their Ubuntu server... on which no-one could
| then log in locally because the login manager crashed when
| trying to display the list of users. I logged in via ssh and
| changed my name to the nearest ASCII equivalent.
|
| I always feel slightly worried on sites that demand that I give
| my full legal name (such as the US ESTA form), and then refuse
| to handle it because it includes "illegal" characters.
| ASalazarMX wrote:
| This has happened to me with _passwords_ containing foreign
| characters. The system would accept it, but further logons
| would be impossible. Now I always strip diacritics to be
| safe.
| jks wrote:
| A friend mentioned using control characters in passwords...
| like ^F and ^B, but not ^C because that's the interrupt
| character. Feels vaguely risky to me (does ^U empty the
| line? does ^W delete the last word? does your terminal
| emulator do some weird encoding like it does for cursor
| keys?) but if it works, why not?
| jowea wrote:
| I suspect I have run into a couple bugs because of
| password generators putting characters that some backend
| system cannot process in the password. Halfwish they just
| did DKWhhjwqjkwqjmHSJKHAIUHQwdmlsadkl instead.
| hughesjj wrote:
| I remember in school learning that technically speaking
| on Unix you could have the backspace character as part of
| your password too
|
| But for the same reason with ^W and ^U I have no idea how
| you'd implement that in an interactive prompt without
| escaping
| beardygo wrote:
| Full legal name as appears on machine readable zone in your
| own passport. Allowed characters are A-Z only, see MRZ
| specifications:
|
| https://en.wikipedia.org/wiki/Machine-readable_passport
| Muromec wrote:
| What's a legal name? It presumes it's somehow different
| from other ... illegal names. But in which way? Which law
| has a say?
| beezlebroxxxxxx wrote:
| "Legal name" is a catch-all term that usually means
| "approved for use on government issued ID". Are there
| instances when that's not always the case and some forms
| of ID (not just, say, an ID card, but also in tax
| filings, for example) actually have different rules?
| Amazingly, sometimes yes. But usually that's what it
| means.
| Muromec wrote:
| I get what it could mean but it's jurisdiction bound and
| doesnt resolve unambigyously, doesnt match mrz and isnt
| always ascii.
| Dylan16807 wrote:
| The name the legal system uses to refer to you.
| Muromec wrote:
| Legal system as in court of law? They tend to use more
| letters than I have in my actual passport (definitely
| more than fits into mrz) and depending on which court we
| talk about they also use different alphabets. They also
| assume certain structure in those nsmes, which differs
| from one court to another.
| Dylan16807 wrote:
| Are you using courts that insist on different alphabets?
| Then you have multiple legal names.
|
| And some operations are based on exactly what's on your
| passport.
|
| It's more than court, taxes are an important and relevant
| set of laws.
| Muromec wrote:
| Yes, I had a pleasure do deal with two courts that use
| two different alphabets this year. They one of the two
| referenced the other. The name written in neither of two
| matches whats actually written in my passport. It isn't a
| complicated name by any reasonable metric.
|
| Taxes are easier -- they just ids and names are display
| only kind of stuff, sourced from the base registry.
| doubled112 wrote:
| Just having an apostrophe in my last name causes me issues.
|
| Yes, that's me, Mr. O&Conner
| j-bos wrote:
| As another nonascii character named individual who's lost hours
| of life calling service reps for companies that used my utf-8
| name, I second this.
| zigzag312 wrote:
| I sometimes use this as a quick test of software quality. If it
| can't handle non-ascii characters in 2024, then it will
| probably be more trouble that it's worth.
| SuperSandro2000 wrote:
| They are clearly bored and want to start a year long bug hunt
| through half of unix
| Muromec wrote:
| That sounds like a good kind of bored and bug hunting through
| half the unix sounds like fun too.
| db48x wrote:
| Agreed! I'm half tempted to join the hunt myself just for
| kicks...
| kej wrote:
| I wonder if it would work to do something like the punycode
| system for internationalized domain names. Shell scripts could
| handle a name like `xn--0civ130n` just fine, and user-facing
| utilities could choose to convert that to :sparkle::unicorn: when
| appropriate. The same homograph protections would probably work,
| as well.
| dsr_ wrote:
| I will remind everyone that there are a minimum of three
| identifiers here.
|
| The UID, which is an integer. Ownership resides here; it's the
| primary key. Can be used by programs.
|
| The username, each of which must be unique and maps to one UID --
| but multiple usernames can map to the same UID. Used by humans
| and programs to login.
|
| The GECOS field, or "human readable name", which is only used as
| a display label. Some systems include a structure inside this for
| additional info like phone number, office number, or similar". I
| don't think anyone would object to UTF-8 here.
| seu wrote:
| The fact that this whole discussion happens in english, partially
| explains why there is a discussion at all. The whole problem
| could have been avoided if the development of computers had been
| a more international effort.
| seiferteric wrote:
| OMG Can't believe this, I ran into this exact thing at my last
| job. We discovered a security vuln in several of our services
| because we were accepting unsanitized usernames, but since we and
| doing things with them (passing them to scripts etc.) but only
| after passing them to useradd/usermod etc so we thought they were
| safe, and of course you could put in things like ";" and "&", ">"
| etc and do whatever you want. I discovered that debian DISABLED
| the username sanity checks and could not believe it. anyway I
| installed a patched version as well as sanitized input and other
| stuff to resolve the issue.
| IshKebab wrote:
| > Most Debian users don't work with useradd, or groupadd,
| directly. Instead, Debian has long supplied its own adduser (and
| addgroup) utilities, originally written by founder Ian Murdock.
| These act as simpler front ends to useradd
|
| One of the dumbest things Debian has done.
| rurban wrote:
| They are so stupid, I cannot believe!
|
| Names are identifiers, and such need to stay identifiable. There
| exist unicode security guidelines and rules for identifiers, they
| don't know about. My libu8ident library would help with that.
| UniverseHacker wrote:
| Clearly we should open up usernames to be an unlimited size set
| of mixed data types: e.g. the first "character" could be a hand
| drawn picture of a cat, the second the entire text of the US
| constitution in unicode, and so on. We could then extend this
| flexibility to filenames, passwords, and Unix commands.
| Internally, this could involve replacing all text strings with
| folders on a filesystem where you can put any files you want in
| any desired order. /s
| adrian_b wrote:
| As already pointed in that discussion thread, there are
| standards for Unicode identifiers, e.g. RFC 8264 and RFC 8265.
|
| All Unicode characters have types, like letter, digit,
| punctuation, mathematical operator and so on. The standard for
| identifiers allows in identifiers only certain types of Unicode
| characters and it defines rules for normalization and
| comparison of identifiers.
|
| So rules for handling Unicode identifiers have already been
| defined. Whoever wants this functionality should just implement
| the standards.
|
| One may have opinions whether this is worthwhile or not for a
| certain application, but strawman arguments about cat pictures
| and other impossible dangers are no longer valid.
| UniverseHacker wrote:
| Apparently humor is no longer valid?
| nineteen999 wrote:
| I have an affectionate place in my heart for Debian, the
| community is passionate, they have wonderful ideals, hell I even
| helped found a charity which distributes it on used PC's
| discarded by large companies to disadvantaged people over 20
| years ago which is still running today. It was my favourite
| distro for a long time after I moved on from Slackware in the
| late 90's, I used it at home, I used it in my job at a small ISP
| on everything from x86 to Sun Sparc to DEC Alpha hardware. We are
| lucky in the Linux community to have them. I could care less
| about deriatives like Ubuntu, seems to be one too far removed.
|
| But over the years the bikeshedding and some of the poor
| technical decisions started to wear on me. The debconf approach
| of asking a million questions on install bothered me. In my
| current job we use it on small industrial ARM PC's and it does a
| great job there at a large scale distributed over a wide variety
| of environments and geographical area, scorching heat, freezing
| cold and everything in between. But that's easy because it's a
| single system image which we deploy to hundreds of devices and it
| only requires minimal customisation to perform the required
| tasks.
|
| But our datacenter servers remain RHEL for the simple reason ...
| the deployment and broad customisation process per server is
| easy, LDAP integration is straight forward and the customer wants
| to pay for support from the vendor even though we never use it.
| Security updates and bugfixes are delivered quickly and the
| vendors commitment to stability is commendable. It's a no
| brainer. More and more companies started to move their workloads
| to RHEL once it came out and unfortunately it just didn't make
| sense to bother with distributions outside of RHEL/Fedora for my
| personal use anymore, some sort of work/life balance is needed
| and I don't want to spend my personal computing time remembering
| all the idiosyncracies between different Linux distributions
| anymore. I would argue that Debian is pretty idiosyncratic and
| opinionated if you have come from more traditional UNIX systems
| in the 90's, while RHEL/Fedora more closely model an "evolution"
| of those classic systems if you like. It will be interesting to
| see what happens to RHEL in the coming years as Redhat becomes
| more and more absorbed into the IBM environment.
| finnthehuman wrote:
| That's the reality of deploying professionally though. I've got
| a soft spot for debian from using it for over 20 years too, but
| choosing open source often means picking the vendor that
| accommodates the use case. Many products have the enterprise
| upsell of good LDAP/AD integration but that's just a nice to
| have when you're really buy it for the ability to call someone
| when shit goes sideways.
|
| And when you don't need the support net, it's often gonna be
| ubuntu because that's what most people are comfortable with. Or
| yocto if you're shipping a custom OS. And containers are so
| ephemeral and purpose specific it means distro doesn't matter
| as much.
|
| I'm still rooting for them the most. They're community based,
| an important upstream, and stable has never done me dirty. It's
| still my go-to for "I don't want to think hard about, or worry
| about this system."
| jcarrano wrote:
| I don't get it? What's the purpose of changing the default rule
| in shadow-utils. Not only is it completely unnecessary and
| introduces risks for shell injections, it also risks introducing
| incompatibilities between Debian and any other system.
|
| I feel that there are already too many other things to fix to be
| wasting time in creating new potential bugs.
| thway15269037 wrote:
| Before opening this can of worms, can we finally address that
| there is a hard, hardcoded limit of 255 bytes per file name
| (folder name) in Linux? Yeah, 255 bytes, that is, like 63
| japanese characters or emojis or maybe less. And in kernel, too,
| so you physically cannot correct this issue by using another
| filesystem or something.
|
| Before anyone asks: yes, these folders do occur in real life, and
| I tired of pretending that they do not.
| cratermoon wrote:
| My take: user names are _not_ strings, though they may be
| _represented_ as strings. As such, a type, e.g. Username, would
| provide a constrained and consistent range of allowed values,
| much as a type like float32 allows (within IEEE 754 rules).
|
| It's time for programmers to stop treating everything that can be
| represented by a string as anything representable by a string
| type.
| jmclnx wrote:
| Company I work at moved to an ID like [A-Z]Employee-number. Moot
| point for them :
| okasaki wrote:
| Aren't pretty much all devices nowadays owned by a single person?
|
| What's the user case for non-system usernames at all?
|
| Why not just "user" and "root"?
| ipython wrote:
| This sounds like a security nightmare just waiting to happen.
| Nothing like embedding gigantic libraries like libicu into
| security critical code bases so you can do things like Unicode
| normalization and comparison functions on usernames.
___________________________________________________________________
(page generated 2024-12-07 23:01 UTC)