[HN Gopher] Debian opens a can of username worms
       ___________________________________________________________________
        
       Debian opens a can of username worms
        
       Author : jwilk
       Score  : 240 points
       Date   : 2024-12-06 09:55 UTC (1 days ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | rini17 wrote:
       | Perhaps it's time to agree upon how to Unicode in identifiers?
       | The normalization, unprintable characters, confusing characters
       | with same glyphs, etc. It's obviously problematic when everyone
       | is doing it on their own.
        
         | magicalhippo wrote:
         | As long as I can enter my Zalgo[1] username, I'm fine with your
         | suggestion.
         | 
         | [1]: https://en.wikipedia.org/wiki/Zalgo_text
        
         | m000 wrote:
         | Good luck bringing everyone together. There's still a ton of
         | Microsoft software that relies on the presence of the BOM [1],
         | despite practically everyone else not using it. And
         | bidirectional rsync between practically everything else and a
         | Mac still requires `--iconv=utf-8,utf-8-mac` to avoid problems
         | because of homographs.
         | 
         | [1] https://en.wikipedia.org/wiki/Byte_order_mark
        
         | bayindirh wrote:
         | The first bar to clear is "The Turkish Test"[0], then we can
         | talk about Unicode. It'll smooth the rest of the process a lot.
         | 
         | You can't guess how many workarounds I implement to make sure
         | that a stray application doesn't get "i" or "I" in their naive
         | codepaths, and start burning mid-flight (e.g.: Kodi, Pagico,
         | some old Java programs, oh my...).
         | 
         | [0]: https://blog.codinghorror.com/whats-wrong-with-turkey/
        
           | beardyw wrote:
           | The date format part is ridiculous. Americans are almost
           | unique in using mm/dd/yyyy, so an assumption of that would be
           | plain wrong.
        
             | bayindirh wrote:
             | Localization libraries handle these parts well, since date
             | is same with Europe (and generally stored as time-date
             | objects rather than pure strings). None of the number
             | shenanigans cause problems since these numbers are always
             | stored as IEEE754 or other decimal formats. Money is no
             | problem as well.
             | 
             | However, when you go through an upper() or lower() or
             | anything which plays with capitalization, and if that data
             | is being fed to a hash algorithm or anything which mucks
             | with strings, boy, oh boy...
             | 
             | The easiest way is to sanitize these programmatic parts
             | with forced locale of en_US or plain old "C". If the
             | strings is not facing to the user and never localized, just
             | force its locale. It's the only sane way.
        
               | oblio wrote:
               | > since date is same with Europe
               | 
               | Do you mean MM-DD-YYYY? No, the vast majority of Europe
               | does DD-MM-YYYY in some form or another.
        
               | bayindirh wrote:
               | No, I mean DD-MM-YYYY. We use the same format with the
               | vast majority of Europe.
        
               | oblio wrote:
               | I'm confused, are you talking about the US? The US for
               | sure does not default to DD-MM-YYYY.
        
               | bayindirh wrote:
               | I'm talking about the Turkish Language, its locale and
               | its peculiarities since it has letters "i" and "I".
               | 
               | So, I'm talking about Turkish date format. Turkey uses
               | DD-MM-YYYY format, like the most of the Europe.
        
             | a3w wrote:
             | https://xkcd.com/1179/ I heard the US and A are moving to
             | the hissing cat date format shown here.
        
               | Muromec wrote:
               | I kind of like the one using roman numerals for month.
               | Reasonable people would figure out that other reasonable
               | people would not use roman numerals for _days_ , so the
               | order can be implicit. I like implicit ordering, it
               | always makes things more interesting.
        
             | bluGill wrote:
             | I have switched to yyyymmdd for everything - it is usually
             | obvious to everyone what date I mean.
        
               | bayindirh wrote:
               | I also use the same format while naming my files, or in
               | changelogs or whatnot, but not all documents are suitable
               | for that, and in the presentation layer you need to match
               | the country standards.
               | 
               | However, date is mostly presentation and internal storage
               | of these are vastly different than what we see generally.
        
               | bluGill wrote:
               | I don't match country standards. That is the point.
        
               | kelnos wrote:
               | It depends on what you're doing, though. If you're
               | helping people fill out documents (even non-government
               | documents), then you really need to match the country
               | standard.
               | 
               | Localization is important; some countries outright
               | require it if you're going to do business within their
               | borders. But even where it's not required, you will lose
               | customers if your website/application/product feels
               | "foreign". I'm not sure date ordering is a big enough
               | deal to trigger that feeling in anyone, but unless it's a
               | huge burden to format things the way people expect, I
               | would do so for the UX benefits.
        
         | throw0101a wrote:
         | > _Perhaps it 's time to agree upon how to Unicode in
         | identifiers?_
         | 
         | And then update all data structures that refer to them (like
         | _last_ and _w_ / _who_ , also NFS), as well as file formats
         | (like _cpio_ , _tar_ , and _pax_ which encodes ownership).
        
           | maccard wrote:
           | Yes. Those formats have had 20 years since Unicode was
           | standardised, and things like my terminal still routinely
           | break when given "unexpected" inputs. Practically every other
           | application can handle it.
        
         | layer8 wrote:
         | Unicode has provided a specification for Unicode identifiers
         | since 2005: https://www.unicode.org/reports/tr31/
        
           | rini17 wrote:
           | Great! Is there a library for their validation? ICU seems to
           | have only spoof checker for confusables.
        
             | rurban wrote:
             | libu8ident
        
               | westurner wrote:
               | rurban/libu8ident : https://github.com/rurban/libu8ident
               | :
               | 
               | > _unicode security guidelines for identifiers_
        
             | westurner wrote:
             | ICU: International Components for Unicode: https://en.wikip
             | edia.org/wiki/International_Components_for_U...
             | 
             | unicode-org/icu: https://github.com/unicode-org/icu
             | 
             | Microsoft/ICU: https://github.com/microsoft/icu
             | 
             | IDN: Internationalized domain name:
             | https://en.wikipedia.org/wiki/Internationalized_domain_name
             | 
             | Punycode: https://en.wikipedia.org/wiki/Punycode
             | 
             | IDN homograph attack:
             | https://en.wikipedia.org/wiki/IDN_homograph_attack
             | 
             | CWE-1007: Insufficient Visual Distinction of Homoglyphs
             | Presented to User:
             | https://cwe.mitre.org/data/definitions/1007.html
             | 
             | GNU libidn/libidn2: https://gitlab.com/libidn/libidn2
             | 
             | Comparison of regular expression engines > Language
             | features > Part 2; Unicode support: https://en.wikipedia.or
             | g/wiki/Comparison_of_regular_expressi...
        
         | secondcoming wrote:
         | Would punycode be suitable?
        
       | tiahura wrote:
       | When you think about all the time, money and effort that have
       | been wasted on Unicode...
        
         | kalleboo wrote:
         | Yeah we should have all just stuck to Shift-JIS
        
         | Joker_vD wrote:
         | Vseki triabva da izpolzva latinitsa, absoliutno s'm s'glasen.
         | 
         | After all, it's objectively the most perfect set of characters
         | for any reasonable human language.
        
           | febusravenga wrote:
           | Random cross-language-script observation.
           | 
           | In Bulgarian, latinitsa ("latin alphabet") transliterated to
           | latin alphabet is just "latinitsa" or "latinica".
           | 
           | In Polish "cyrillic" is "cyrylica" - basically reverse.
        
         | pjc50 wrote:
         | What's your preferred solution for representing the CJK
         | languages?
        
           | tiahura wrote:
           | Computing did pretty well in the prior 50 years.
        
             | pjc50 wrote:
             | That's not an answer. Be specific. How do you want to
             | represent the 97k CJK characters?
        
               | vman81 wrote:
               | I really don't want to be snarky or sarcastic, so I'll
               | just be plain. Many people are unwilling or unable to
               | understand a problem that doesn't affect them directly.
               | Like - "UTF is woke" kind of people. They are out there.
        
             | CorrectHorseBat wrote:
             | Not for the majority of the world population who doesn't
             | know English
        
             | jcranmer wrote:
             | I still remember the days when I couldn't use p and e in
             | the same document, because there was no codepage that
             | contained both of them. I also remember the days when
             | pretty much any website that had non-English text had to
             | have instructions on it for how to view it properly,
             | because mojibake was so bloody common.
             | 
             | (It should also tell you something that not only is there a
             | name for "computers failed at charsets", but the name is
             | Japanese.)
        
             | umanwizard wrote:
             | Only if you could expect a given person to only ever deal
             | with one language. Anything international sucked and was a
             | much bigger pain than now.
             | 
             | It would be impossible to e.g. build a site like Reddit
             | where people can comment in any language.
        
             | vman81 wrote:
             | Computing has improved massively over the last 50 years,
             | not least because it now can accommodate peoples diverse
             | languages.
        
             | kryptiskt wrote:
             | No, it didn't. There were all kinds of encodings out there,
             | and dealing with code pages was way worse than any
             | inconveniences that Unicode has brought. Unicode was
             | created for a reason, not just to torture US programmers
             | with the diversity of scripts in the world.
             | 
             | Maybe it was nice if you worked for a US company without
             | any operations abroad, which includes absolutely none of
             | those which mattered.
        
               | account42 wrote:
               | You still need to deal with "codepages" to differentiate
               | between Japanese Unicode and Chinese Unicode even if it's
               | called a language and not codepage now.
        
               | CorrectHorseBat wrote:
               | Han unification sucks indeed but if you get the wrong
               | font it's still readable
        
               | numpad0 wrote:
               | Sometimes, not always. Depends on how similar specific
               | characters happen to be.
        
             | dotancohen wrote:
             | Only if your name isn't Dong Jiu Er Gong Ren Yan Wang .
        
             | throw0101a wrote:
             | > _Computing did pretty well in the prior 50 years._
             | 
             | Contra:
             | 
             | * https://stackoverflow.com/questions/25812790/wrong-
             | character...
        
             | Muromec wrote:
             | I had to, in the year of our lord 2024, deal with a certain
             | non-unicode system that ate one specific Cyrillic symbol
             | when producing an open data artifact mandated by law. It
             | was never fun then and it's still manages to create
             | problems.
        
           | account42 wrote:
           | Something that doesn't unify different characters. So not
           | Unicode.
        
         | Cthulhu_ wrote:
         | What alternative do you propose? I mean personally I think that
         | emoji don't belong in unicode, but at the same time it's been
         | integrated into society for many years now and it's made
         | communications platforms so much more streamlined.
         | 
         | But how else would you represent non-latin characters? More
         | character sets?
        
           | a3w wrote:
           | > emoji don't belong in unicode
           | 
           | Well, they are defined as: "an intermediate technology until
           | we find a way to transfer images over data connections."
           | 
           | So it was always a technology that was 40 years too late to
           | the party?
        
         | layer8 wrote:
         | Without it, all textual data would need its own charset header,
         | and you couldn't freely copy & paste between pieces of text
         | with different charsets without creating mangled garbage. This
         | was the situation before Unicode (except that charsets were
         | often only implicit, so you had to guess which it is).
        
       | card_zero wrote:
       | > naming things is one of the hard things to do in computer
       | science
       | 
       | I've been thinking about that a lot lately. Code is text, it's
       | arranged linearly, code has to be readable, identifiers are thus
       | short strings that try to express short essays about the purpose
       | of the variable or whatever it is, and then ideally there's a
       | longer version of the essay in a comment, but not too long
       | because that would clutter up the code as well (because it's
       | text, arranged linearly). And we have code folding to tidy them
       | up, for what good it does, and ideally an even longer version of
       | the essay in documentation except nobody writes that.
       | 
       | What if it wasn't text, and wasn't linear, and we didn't have an
       | expectation that code should be strings of stupid over-terse
       | names and hieroglyphic symbols? So I was thinking vaguely about
       | investigating graphic-based programming, but it's probably worse,
       | IDK. It could automatically assign arbitrary icons* instead of
       | identifiers, and you could write tooltip-like comments to
       | describe them as and when you want to, and everything could be
       | laid out nicely with diagrams and different pages instead of like
       | a text file. I suppose this is all merely cosmetic? The thing
       | with the instance on code being _written_ as strings of text
       | feels very primitive, is all. It causes this problem.
       | 
       | * Which doesn't solve the problem, I admit, because now you have
       | to remember what the icons mean, but maybe that's easier?
        
         | jstanley wrote:
         | I don't think remembering the meaning of icons is easier,
         | because in order to think about it you have to be able to
         | pronounce it inside your head.
         | 
         | And code isn't just linear, it can be spread across multiple
         | files in a directory tree, functions can can each other, etc.
        
           | c22 wrote:
           | _> in order to think about it you have to be able to
           | pronounce it inside your head._
           | 
           | I'm not sure this is universal.
        
             | vidarh wrote:
             | Indeed, some people do not even have an inner voice, the
             | same way some of us don't "see" things in our minds eye.
             | Neither prevents you from thinking about words or visual
             | objects.
        
         | pjc50 wrote:
         | > I was thinking vaguely about investigating graphic-based
         | programming, but it's probably worse, IDK. It could
         | automatically assign arbitrary icons* instead of identifiers,
         | and you could write tooltip-like comments to describe them as
         | and when you want to, and everything could be laid out nicely
         | with diagrams and different pages instead of like a text file.
         | 
         | Have you ever read large electronic schematics? That's
         | basically it .. except all the important things have to be
         | identified by text anyway, because it's a massive challenge to
         | the imagination to come up with two hundred different
         | pictograms.
         | 
         | Of course, if you really want your identifiers to be
         | pictograms, why not just use kanji for your identifiers? The
         | Japanese language and Unicode provide tens of thousands of
         | ready made pictograms for your convenience!
         | 
         | The only nonlinear programming environments that have really
         | worked are the spreadsheet (which is still linear within each
         | cell) and Labview. Possible shoutout to Unity blueprints, but
         | when those get too complicated sphagetti .. people rewrite them
         | in linear text code.
        
           | card_zero wrote:
           | _Sigh_
           | 
           | I guess you're right. This has been a dimly-felt wish of mine
           | for some 25 years, but probably pie in the sky.
           | 
           | Edit: I see there are a _lot_ of visual programming
           | languages.
           | 
           | https://en.wikipedia.org/wiki/Visual_programming_language
        
             | 9dev wrote:
             | I don't think that has to be the answer, though. We can
             | probably all agree that plaintext code is not the best form
             | to represent the schematics of a process, and neither are
             | images. But that seems to be a very limited set of options,
             | and I wonder if there aren't any other dimensions to
             | express what is essentially persisted chains of reasoning.
             | For an example of alternative modes of input, have a look
             | at the Reactable, a pretty innovative way to compose music.
             | Sadly I think they didn't disrupt the music industry as
             | they should have, but it's a pretty good example of a new
             | way to think about making sounds.
             | 
             | Edit: forgot the link. Here is: http://reactable.com
        
             | WillAdams wrote:
             | Longer than that --- I would argue it goes back to Herman
             | Hesse's _The Glass Bead Game_ (originally published as
             | Magister Ludi) --- but Hesse seems to have gone out of
             | style.
             | 
             | That said, I keep trying various ones, and will keep hoping
             | that someday someone will make a graphical tool able to
             | make a GUI program.
             | 
             | Nodezator seems promising.
        
           | auxym wrote:
           | > Have you ever read large electronic schematics? That's
           | basically it .. except all the important things have to be
           | identified by text anyway, because it's a massive challenge
           | to the imagination to come up with two hundred different
           | pictograms.
           | 
           | As a mechanical engineer who works with Labview and Simulink,
           | as well as more conventional code (python mostly), that is
           | indeed a very good description. First glance at a large
           | labview program feels very much like first glance at a large
           | and complex electronics schematic. Lots of wire everywhere
           | and you're not even sure where to start.
           | 
           | I think a nice "best of both worlds" approach is a graphical
           | "high level" view which shows the flow of data, at least for
           | "data transformation" kind of programs, and code for the low
           | level logic (what actually happens in the blocks). Sort of
           | like nodal editors in Blender and NLE apps. Fortunately
           | Simulink makes it easy to drop in a Matlab function call,
           | Labview not so much (need to get into C FFI or use a really
           | old version of .net or something).
           | 
           | The thought I have about spreadsheets (might have read that
           | on here), is that spreadsheets make the data visible and hide
           | the code. Text-based programming hides the data but shows the
           | code. I'm not sure what something that makes both code and
           | data first class and visible would look like, but I'd be
           | curious for sure (for engineering type applications at
           | least). Best I've found so far (and what I actually for a lot
           | of data processing tasks) is a Jupyter notebook making
           | plentiful use of df.head() and df.plot().
        
           | umanwizard wrote:
           | It's odd to say those characters come from the Japanese
           | language when they were invented in China to write Chinese,
           | are still used for that purpose, and were only introduced to
           | Japan 2000 years later.
        
           | taneq wrote:
           | > The only nonlinear programming environments that have
           | really worked are the spreadsheet (which is still linear
           | within each cell) and Labview. Possible shoutout to Unity
           | blueprints, but when those get too complicated sphagetti ..
           | people rewrite them in linear text code.
           | 
           | Not 100% sure what you mean by 'nonlinear' here (flow
           | control?) but almost all industrial and mining equipment is
           | programmed in visual languages on PLCs. Ladder Logic looks
           | like, well, a stylized electrical drawing of a bunch of
           | relays wired up to perform logical operations. Function Block
           | Diagram looks like a PCB layout, but the 'integrated
           | circuits' are function blocks (basically functors) and the
           | 'traces' are copying data between between the function
           | blocks. Not great for implementing hardcore algorithms but
           | you can do a surprising amount with them (once you get used
           | to coding with both hands tied behind your back) and they
           | sure are accessible to people who otherwise wouldn't be
           | programming.
           | 
           | Of course, as you say, when things get genuinely complicated,
           | it's much nicer to use a 'real' programming language (or even
           | just Structured Text, which is pretty much just Pascal).
           | 
           | Then again, even with electronics, once things get complex
           | enough don't we start using text (eg. VHDL)? Expressing
           | designs is always a tradeoff between simplicity and
           | 'obviousness' on the one hand, and representational
           | efficiency on the other. Structured text sits right in the
           | sweet spot between the two.
        
         | jcranmer wrote:
         | Graphical programming is one of those things that's often
         | suggested as an improvement on textual programming, and just
         | about every implementation tends to disappoint. I know, when
         | working on compilers, that nearly every time I go "I think I
         | want to see the CFG as a graph here," I tend to realize no,
         | that's not quite what I wanted. For a complex function, the
         | surprising superpower is just to have an editor that shows the
         | opening brace line of every currently-open brace.
         | 
         | Another case in point: when was the last time you saw someone
         | use a flowchart to describe the pseudocode of an algorithm, as
         | opposed to writing, er, pseudocode? Flowcharts used to be the
         | dominant way to do this, decades ago, but they seem to me to
         | have been thoroughly supplanted by pseudocode...
        
           | WillAdams wrote:
           | I think the problem here is that there isn't an agreed-upon
           | answer for the question:
           | 
           | >What does an algorithm look like?
           | 
           | And any effort to answer it which gets beyond the size of a
           | single diagram/screen/page/poster becomes a problem like to:
           | 
           | https://blueprintsfromhell.tumblr.com/
           | 
           | https://scriptsofanotherdimension.tumblr.com/
           | 
           | I like to think of myself as a visual person, and I wish
           | there was a good solution here, and I keep looking for and
           | trying different solutions other folks have made (current two
           | iterations are BlockCAD and OpenSCAD Graph Editor) --- I'd be
           | glad of other suggestions, esp. if able to make graphic user
           | interfaces more complex than the OpenSCAD Customizer.
        
             | card_zero wrote:
             | Argh! Wire-wrapped backplanes! That wasn't the fantasy at
             | all!
        
               | WillAdams wrote:
               | Yes, the fantasy is something like to Herman Hesse's _The
               | Glass Bead Game_ which I mentioned elsethread --- what is
               | the closest available tool to that?
               | 
               | How do such tools manage the problem of
               | encapsulation/modularity becoming the "wall of text"
               | which one is trying to escape, just a pretty wall w/ all
               | the labels in boxes decorated/connected w/ lines?
        
         | AlienRobot wrote:
         | The difficult in naming things is that you're trying to encode
         | semantics and an interface contract in a name. If you give up
         | doing that, it's easy.
         | 
         | For example, say you have getFoo(). It's clear it gets the foo.
         | But later you introduce getFooAsync(). Suddenly it's no longer
         | clear whether getFoo() is sync or async, because you didn't
         | call it getFooSync().
         | 
         | If instead you used names like getFoo1, getFoo2, getFoo3, etc.,
         | the semantics you're providing is that there are multiple
         | "ways" to getFoo without making promises (a contract) about
         | what the function actually does in its name.
         | 
         | Although this sounds like bad naming practices (it is), it
         | effectively solves the naming problem. Apply this to CSS, and
         | instead of .red-button or .secondary-button, you get .button1,
         | .button2, .button3, and you just don't have to think about WHY
         | are you creating a button to give it a class and start styling
         | it.
        
           | card_zero wrote:
           | Yep, that sort of thing happens _constantly._ Things get
           | misleading names because the first three alternatives I came
           | up with were also misleading. So I agree, and indeed I
           | considered a foo bar baz scheme instead of icons, same
           | difference. Then you have to look somewhere else for what the
           | thing does. Self-documenting code doesn 't really work, and
           | strict naming schemes are long-winded and worse than ad-
           | libbing it, so it would have to be comments, but then the
           | comments get forgotten and no longer reflect the code. I give
           | up, might take up woodwork instead.
        
       | mmsc wrote:
       | I wonder how this will affect ssh. OpenSSH recently restricted
       | more characters for valid usernames:
       | https://github.com/openssh/openssh-portable/commit/7ef3787c8...
        
         | cedws wrote:
         | This is a great example of how one poor decision, or one piece
         | of code that is too liberal cascades into an avalanche of
         | shitty workarounds.
        
         | throw0101a wrote:
         | It should be noted that shell metacharacters are also not
         | allowed under POSIX:
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z         a b
         | c d e f g h i j k l m n o p q r s t u v w x y z         0 1 2 3
         | 4 5 6 7 8 9 . _ -
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | (Hyphen forbidden as first character.)
        
         | linuxftw wrote:
         | I think it will be fine. Everyone will quickly learn the lesson
         | "Use something other than ASCII letters and numbers at your own
         | peril."
         | 
         | Similar to people who put spaces in file names, it should be a
         | fire-able offense.
        
           | lexicality wrote:
           | any software that can't handle spaces in filenames is broken
        
             | Muromec wrote:
             | All of the software is broken (including security wise) all
             | the time anyway.
        
               | bdangubic wrote:
               | this is exactly right... I spoke a few years ago with a
               | mate who is a software dev at one of the major car
               | companies... since then I wouldn't sit in the car from
               | that company if my life depended on it...
               | 
               | then I thought - if I spoke any dev in any industry I
               | would also stop doing whatever their software is
               | controlling and end up moving to live with amish or some
               | wilderness without electricity
        
           | hiccuphippo wrote:
           | Was that the fireable offense? I always thought the offense
           | was not putting quotes around filenames in scripts.
        
       | dfranke wrote:
       | Allowing purely numeric usernames seems like a terrible idea to
       | me, because it creates ambiguity between what's a username and
       | what's a UID. It's common for tools like ls or ps to display a
       | username when one is found and fall back to displaying a UID if
       | it isn't, and similarly tools like chown will accept either a UID
       | or a username and disambiguate based on whether it's numeric or
       | not. Now suppose there's a numeric username that doesn't match
       | its own UID, but does match some other user's UID. It doesn't
       | take a lot of imagination to see how this would lead to
       | vulnerabilities.
        
         | throw0101a wrote:
         | Talk to POSIX:
         | 
         | > _A string that is used to identify a user; see also User
         | Database. To be portable across systems conforming to
         | POSIX.1-2017, the value is composed of characters from the
         | portable filename character set. The <hyphen-minus> character
         | should not be used as the first character of a portable user
         | name._
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | The "portable filename character set" is defined as:
         | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z         a b
         | c d e f g h i j k l m n o p q r s t u v w x y z         0 1 2 3
         | 4 5 6 7 8 9 . _ -
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | So only a hyphen as the first character is forbidden.
         | 
         | Given that you can't necessarilly control where usernames come
         | from (e.g., LDAP lookups), properly speaking your system has to
         | handle everything anyway, even if you don't allow local
         | creation.
        
           | dfranke wrote:
           | Yes, I'm aware, and POSIX has many such bugs that make
           | command input or output unavoidably ambiguous if certain
           | unexpected characters are present that they didn't think to
           | prohibit. A lot of the revisions that went into POSIX 2024
           | were aimed at fixing some of these, such as standardizing
           | find -print0 and xargs -0. The fact that this one got
           | overlooked doesn't mean it's a good idea to make the
           | situation worse and harder for future POSIX revisions to
           | address.
        
           | bluGill wrote:
           | It is time for POSIX to get with the times. Computers are
           | used in more than the US and Canada (for the most generous
           | interpretation of American in ASCII I'm including Canada,
           | their French speakers will not be happy with that, not to
           | mention first nations of which I know nothing but imagine
           | their written language needs more than ASCII). UTF8 has been
           | standard for decades now, just state that as of POSIX 2025
           | all of UTF8 is allowed in all string contexts unless there is
           | a specific list of exception characters for that context
           | (that is they never do a list of allowed characters). They
           | probably need to standardize on utf8 normalization functions
           | and when they must be used in string comparisons. Probably
           | also need some requirement that and alternate utf8 character
           | entry scheme exist on all keyboards.
           | 
           | The above is a lot of work and will probably take more than a
           | year to put into the standard, much less implement, but
           | anything less is just user hostile. Sometimes commettiees
           | need to lead from the front not just write down existing
           | practice.
        
             | chikere232 wrote:
             | Sounds like lots of work and a lot of new bugs for no real
             | value.
        
             | throw0101a wrote:
             | > _It is time for POSIX to get with the times._
             | 
             | "Be the change that you wish to see in the world." --
             | Mahatma Gandhi
             | 
             | It's free to join:
             | 
             | * https://www.opengroup.org/austin/lists.html
             | 
             | * https://www.opengroup.org/austin/
        
             | atoav wrote:
             | Sure, go ahead. Write the PR and make sure to test against
             | all other things used in production.
             | 
             | Let's talk again in 30 years when you're done.
        
               | jerf wrote:
               | Oh, it's been closer to 20 years for the rest of the
               | world to catch up to Unicode than 30. We aren't at
               | "perfect" now but we're certainly down to the trickier
               | corner cases that are difficult to even see how you solve
               | the problems at all, let alone code the solutions, and
               | that's just reality's ugly nose sticking in to our
               | pristine world of numbers.
               | 
               | But there really isn't any other solution. Yes, there
               | will be an uncomfortable transition. Yes, it blows. But
               | there isn't any other solution that is going to work
               | other than _deal with it_ and take the hits as they come.
               | The software needs to be updated. The presumption that
               | usernames are from some 7-bit ASCII subset is simply
               | unreasonable. We 'll be chasing bugs with these features
               | for years. But that's not some sort of optional aspect
               | that we can somehow work around. It's just what is coming
               | down the pike. Better to grasp the nettle firmly [1] than
               | shy away from it.
               | 
               | At least this transition can learn a lot from previous
               | transitions, e.g., I would mandate something like NFKC
               | normalization applied at the operating system level on
               | the way in for API calls:
               | https://en.wikipedia.org/wiki/Unicode_equivalence Unicode
               | case folding decisions can also be made at that point.
               | The point here not being these specific suggestions per
               | se, but that previous efforts have already created a
               | world where I can reference these problems and solutions
               | with specific existing terminology and standards, rather
               | than being the bleeding-edge code that is figuring this
               | all out for the first time.
               | 
               | [1]: https://www.phrases.org.uk/meanings/grasp-the-
               | nettle.html
        
               | atoav wrote:
               | Don't get me wrong, I think using UTF-8 everywhere is how
               | things should be.
               | 
               | But this is not a "let's just" or "why don't we" type of
               | endeavor. This is a _major_ undertaking, and as such
               | people are needed who (A) think it is worth the effort
               | and (B) are willing to follow through with all the
               | consequences.
               | 
               | Open Source software lives from contributions and if
               | you're not willing to do it, why should others spend
               | years of their lives for it?
               | 
               | In the end this is a question of: are the benefits worth
               | the effort? What do we win? Where do things get simpler?
               | Where more complicated? How do you pull it off if half
               | the distributions use UTF8 and the other half uses the
               | legach way? How would tooling deal with this split? etc.
        
               | atoav wrote:
               | To add a little bit of context:
               | 
               | You know what I think would be way _worse_ than todays
               | reduced characterset usernames with some special rules or
               | "just" using utf-8 for them?
               | 
               | Both. Imagine a world where some usernames are UTF-8 some
               | are not and it is hard to figure out which is which. That
               | would be worse than just leaving things as they are.
               | 
               | Avoiding that situation makes pulling the whole thing off
               | even harder, since there needs to be a high amount of
               | coordination between many projects, distros etc.
        
               | gray_-_wolf wrote:
               | > Unicode case folding decisions can also be made at that
               | point
               | 
               | Ok I will bite. How do you indent to do case folding
               | without knowing the language the string is in? Will every
               | filename or whatever also have its language as part of
               | the string? I am not sure what the plan is there.
        
             | somat wrote:
             | I would say it is not the place of posix to prescribe how
             | it should be, the job of posix is describe what it is, a
             | common operating environment. this is why posix is such a
             | mess and why I feel it is not a big deal to deviate from
             | posix, however posix fills an important role in getting
             | everyone on the same page for interoperability.
             | 
             | In my opinion the way to improve this, is bottom up, not
             | top down. Start with linux(theese days posix is largely
             | "what does linux do?"), get a patch in that changes the
             | defination of the user name from a subset of ascii to a
             | subset of utf-8. what subset? that is a much harder problem
             | with utf-8 than ascii, good luck. get a similer patch in
             | for a few of the bsd. then you tell posix what the os's are
             | doing. and fight to get it included.
             | 
             | On the subject of what unicode subset. perhaps the most
             | enlightened thing to do is the same as the unix filesystem
             | and punt. one neat thing about the unix filesystem is that
             | names are not defined in an encoding but as a set of bytes.
             | This has problems and has made many people very mad. but it
             | does mean your file system can be in whatever encoding you
             | want, transitioning to utf-8 was easy(mainly doe to the
             | clever backwards compatible nature of utf-8) and we were
             | not locked into a problematic encoding like on windows.
             | perhaps just define that the name is a array of bytes and
             | call it a day. that sounds like the unix way to me.
        
               | tssva wrote:
               | "however posix fills an important role in getting
               | everyone on the same page for interoperability."
               | 
               | Isn't that exactly what the posix username rules are
               | doing? Specifying a set of characters which are portable
               | across systems to allow for interoperability between
               | current and legacy unix systems along with most non-unix
               | systems.
               | 
               | "Start with linux"
               | 
               | Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils,
               | and systemd all differ.
               | 
               | "get a patch in that changes the defination of the user
               | name from a subset of ascii to a subset of utf-8"
               | 
               | ASCII is a subset of UTF-8 so the POSIX definition
               | already specifies a subset of UTF-8.
        
             | PhilipRoman wrote:
             | Some practical concerns I have with UTF-8 are similar (or
             | even the same, depending on font) characters which can be
             | used in malicious ways (think package names, URLs, etc),
             | not to even mention RTL text and other control characters.
             | Every time I add logging code, I make sure that any
             | "interesting" characters are unambiguously escaped or
             | otherwise signaled out-of-band. Having English as an
             | international writing standard is perfectly fine and I say
             | that as a non-native speaker with a non-ascii name.
        
               | abdullahkhalids wrote:
               | A good chunk of the world does not speak english or latin
               | character based languages. They should be able to
               | interact with computers completely in their own languages
               | and alphabet sets, even if those are written right-to-
               | left or top-to-bottom.
               | 
               | Of course, someone has to do the work to make this
               | possible. And no one is obliged to do it. But to suggest
               | that, such work should not be done at all, does not sit
               | right.
        
               | hnthrowaway6543 wrote:
               | > A good chunk of the world does not speak english or
               | latin character based languages.
               | 
               | nearly everyone in a first world country knows the
               | English alphabet though. a vast majority of the
               | developing world as well. just look at street view on
               | Google maps in any country, there's going to be a ton of
               | street signs using English characters, even in non-
               | touristy areas.
               | 
               | > They should be able to interact with computers
               | completely in their own languages and alphabet sets, even
               | if those are written right-to-left or top-to-bottom.
               | 
               | if you're a typical android/ios end user you're
               | interacting with a computer in your native language
               | anyway. this discussion only applies to low level power
               | users.
               | 
               | in that case: why? these aren't user-facing features.
               | this is like saying that people should be able to use
               | symbols native to their language rather than greek
               | letters when writing math papers.
               | 
               | it might not be "fair" that English is overrepresented in
               | computing but it also hasn't demonstrably been a barrier
               | to entry. Japan, Korea and China have dominated,
               | particularly in hardware.
               | 
               | if you think it should be fixed why stop at usernames?
               | why represent uids with 1234 instead of Yi Er San Si ?
        
               | abdullahkhalids wrote:
               | > if you're a typical android/ios end user you're
               | interacting with a computer in your native language
               | anyway. this discussion only applies to low level power
               | users.
               | 
               | I don't think you realize how poor this experience is.
               | Partly the reason being that the underlying system is so
               | english focused, that app developers have to do so much
               | work to get things working.
               | 
               | > if you think it should be fixed why stop at usernames?
               | why represent uids with 1234 instead of Yi Er San Si ?
               | 
               | I mean, if the computers had first been built in south
               | east asia, they would have been.
        
               | hnthrowaway6543 wrote:
               | it's certainly hard to localize everything but billions
               | of people use ios/android in India, China, SEA, MENA,
               | etc... i think it's fair to say that at the end user
               | level, computers are in fact usable by non-English
               | speakers.
               | 
               | individual apps may not be as usable, but that's on the
               | developers. good counter-example, a lot of japanese
               | games, even made within the past 5 years, require setting
               | the Windows system locale to Japanese to function
               | properly. and as someone who played a fair number of
               | japanese doujin games in the 00s/10s, it used to be
               | _every_ game with this problem.
               | 
               | > I mean, if the computers had first been built in south
               | east asia, they would have been.
               | 
               | debatable as CJK heavily use Arabic numerals everywhere,
               | but even if they did, so what? you'd learn those symbols
               | and get used to it. the same way that if you're a unix
               | sysadmin you get used to only being able to use a small
               | subset of ASCII characters for usernames.
        
               | abdullahkhalids wrote:
               | > it's certainly hard to localize everything but billions
               | of people use ios/android in India, China, SEA, MENA,
               | etc... i think it's fair to say that at the end user
               | level, computers are in fact usable by non-English
               | speakers.
               | 
               | Its important to contextualize these discussions in
               | socioeconomics. Computers are not just fun play things.
               | They are serious tools used for economic activities.
               | Their usage, through their design, has significant impact
               | on the social systems of society. Non-latin-language
               | speakers are able to use poorly localized computers, but
               | they are only able to use them less well than the latin-
               | language speakers. At least in South Asia, there is a
               | huge economic divide between those who can speak English
               | and those who can't, where causality runs both ways, and
               | in more recent times exacerbated by the inability of some
               | to use technology. And that economic divide then causes
               | huge sociopolitical problems in societies.
               | 
               | If computers are means for economic progress, we
               | shouldn't put the condition that one has to somehow learn
               | English to use them well. But isn't localization
               | sufficient? No it isn't. Ignore even that localization
               | requires some members of your language to be dual
               | speakers. The current era of economic progress is
               | characterized by software development. But if the only
               | way you can develop software is to learn a foreign
               | language, then surely we are denying economic progress to
               | some communities.
               | 
               | P.S. I will repeat. Nobody has to do any work to help
               | other communities. But to assert that such work should
               | not happen is plain wrong.
        
               | hnthrowaway6543 wrote:
               | you're confusing "speaking English" with "knowing the
               | English alphabet." these things are orthogonal. 95%+ of
               | people in those countries know the english alphabet. i
               | just threw down google maps street view at a random spot
               | in Phnom Penh and instantly found english letters visible
               | from the street, on advertisements[0]. then i threw it
               | down in a much smaller Thai city that i had never heard
               | of, Nakhon Sawan, and instantly found English on the
               | street.[1] i've been in China, Japan and Korea enough to
               | know english characters are all over the place. the
               | English alphabet is omnipresent _everywhere_ , i think
               | you fail to realize this. nobody who is using a computer
               | in these places is getting confused by the english
               | alphabet.
               | 
               | > But to assert that such work should not happen is plain
               | wrong.
               | 
               | i assert it should not happen because it's not solving an
               | actual problem, the same way that changing "x" and "y" to
               | "k" and "t" in algebra doesn't solve a problem, and
               | trying to "solve" it will yield to a monstrous amount of
               | incompatibilities and confusion. here's a really good
               | comparison: ipv6. IPv6 _is_ solving a problem, maybe in a
               | way people disagree with, but definitely a real
               | problem... and yet _we still can 't make ipv6 fucking
               | work_ after God knows how many years, and trying to get
               | IPv6 networking at any sort of scale is a massive fucking
               | headache. now we want to go through the same headaches to
               | support... umlauts in usernames? yeah, no thanks.
               | 
               | there's enough real work left to be done in the world
               | that we shouldn't waste time with stupid makework like
               | this.
               | 
               | or maybe in 30 years i'll be able to call up IT support
               | and say "hey i forgot my password, can you reset it? my
               | username is Shen Wang  s`wd. ... need me to spell that
               | for you?"
               | 
               | edit: somewhat ironically, HN swallowed a few of the
               | unicode characters in my theoretical future username...
               | 
               | [0] https://i.imgur.com/0WkG0ze.png
               | 
               | [1] https://i.imgur.com/VhDR5Xh.png
        
               | abdullahkhalids wrote:
               | I am from Pakistan. At least in South Asia, there are
               | english characters everywhere because the infrastructure
               | is primarily designed for the rich english-speaking
               | classes, while the poor are left behind. A serious
               | political problem.
               | 
               | I have seen many non-english speaking people interact
               | with computers in English, both poor people and old folks
               | in rich families who don't know English. They kinda
               | recognize the shape of words, or they go by icons. They
               | don't actually know the meaning of anything. They can
               | only do a limited set of pre-memorized actions. Scamming
               | them is easy. If they get stuck, they need to beg someone
               | to help them.
               | 
               | Again, I will say this. There are two problems here. One
               | for users and one for developers. Users must be able to
               | read in their own language. Developers must be able to
               | develop in their own language.
        
               | wongarsu wrote:
               | > They kinda recognize the shape of words, or they go by
               | icons. They don't actually know the meaning of anything.
               | 
               | That's kind of true of a lot of English computer users
               | too.
               | 
               | But more to the point, what you are advocating for is
               | translating the interface. Which I think nobody is
               | against, and which is a common thing to do (at least for
               | countries people care about, which sadly excludes a lot
               | of the poorer parts of the world). The username prompt
               | should read "username" in Pakistani. That doesn't
               | automatically mean it has to accept non-ascii input too,
               | as long as you accept unicode in the display name.
               | 
               | > Developers must be able to develop in their own
               | language.
               | 
               | I learned coding in Pascal before I learned that "if" is
               | an English word. English helps, but in the end keywords
               | in programming languages and shell commands are only
               | mnemonics. Knowing the translation helps but isn't
               | necessary. What's important are documentation, tutorials
               | and other resources in a language the developer
               | understands.
        
               | citrin_ru wrote:
               | > nearly everyone in a first world country knows the
               | English alphabet though
               | 
               | And not only 1st world. Actually the bigger country the
               | more everything is localized - from dubbed films to food
               | packaging labels. In a small country one would see more
               | English/Spanish/French e. t. c. because they don't have
               | resources to localize everything.
        
               | Muromec wrote:
               | Oh no please, I don't want to have my linux username in
               | Cyrillic. Thanks but no, thanks!
               | 
               | I know enough linux to see 10 ways in which it will make
               | things worse at some point.
        
               | notpushkin wrote:
               | This isn't quite black and white.
               | 
               | Right now, I can set up and use Linux in my language,
               | have my display name in my script, but my username and
               | password are ASCII-only and are available on the standard
               | English keyboard anywhere. If I run into trouble, I can
               | SSH in _from any device in the world_ without any issue.
               | I can just borrow a laptop from anyone, switch to English
               | if needed, and jump right in.
               | 
               | Having a common denominator set of characters for such
               | things is just really, really useful. I'd rather focus on
               | all the other things that need to be localised.
        
               | folmar wrote:
               | Without any issue is a stretch, using a French keyboard
               | is bad enough experience for passwords, not everyone uses
               | standard English keyboards.
        
               | wongarsu wrote:
               | The French keyboard is the most notable example of anyone
               | using something other than query or quertz. Even Japan
               | and China use an extended querty. But even with the
               | French keyboard the only issue is that everything is in
               | the wrong place, not that the standard 26 "English"
               | letters don't exist or are hard to reach.
               | 
               | Meanwhile using a, e or s in a username or password will
               | make your life much harder once you are in a foreign
               | country. Never mind any letter that isn't derived from
               | the Latin alphabet.
        
               | oarsinsync wrote:
               | > something other than query or quertz. Even Japan and
               | China use an extended querty
               | 
               | qwerty
        
               | citrin_ru wrote:
               | I have an impression that people confuse learning English
               | (which is hard unless you native language is a
               | Germanic/Romance one) with learning to recognize and type
               | Latin characters which is easy and people around the
               | world already use Latin alphabet without knowing any
               | English. You may escape Latin alphabet if you have spend
               | a whole life in a remote village but for people living in
               | cities around the world it should be familiar and not a
               | barrier at all. It's hard to escape Latin characters in
               | the modern world and this ship has already sailed like it
               | or not (I mostly do).
        
               | smitelli wrote:
               | > similar (or even the same, depending on font)
               | characters which can be used in malicious ways
               | 
               | These are called "confusables" and boy does that well run
               | deep: https://www.unicode.org/Public/security/16.0.0/conf
               | usables.t...
        
             | miki123211 wrote:
             | > Computers are used in more than the US and Canada
             | 
             | Even if you speak US (or Canadian) English exclusively,
             | there are still some words that are just impossible to
             | spell correctly in pure ASCII, e.g. resume, cafe etc.
        
               | drdeca wrote:
               | "correctly". I don't consider it "incorrect" English when
               | someone writes "cafe" or "resume". It seems to me a
               | little bit paedantic to insist that those words must have
               | the accent marks in order to be correct (when using them
               | in English).
        
               | sneak wrote:
               | Yeah, loanwords are different words than the original
               | word.
               | 
               | The correct plural of "baby" in German is "babys".
        
             | rurban wrote:
             | Almost nobody supports string search and comparison API
             | functions for unicode. The unicode security tables for
             | unicode identifiers are hopelessly broken.
             | 
             | Not even the simplest tools, like grep do support unicode
             | yet. This didnt happen in the last 15 years, even if there
             | are patches and libs.
        
               | ygra wrote:
               | Wasn't one way to make grep faster setting LANG=C to
               | avoid using language-aware string comparison? If so,
               | shouldn't Unicode be supported by default or what would,
               | say, de_DE.UTF-8 actually compare to make it slower?
        
             | patrick451 wrote:
             | Honestly, I just don't care. UTF8 is excessively
             | complicated. ASCII is simple.
        
             | citrin_ru wrote:
             | Unicode opens a whole can of worms. World is already full
             | of software which in theory supports non-ASCII texts but in
             | practice breaks for some use cases. It's easy to allow
             | UTF8, it's hard to test all possible use cases and to
             | foresee them to know what to test. Nowadays I use mostly
             | English so don't see localization bugs but when I used my
             | native language with software/internet (~10y ago) I've
             | encountered too many bugs and avoided using non-ASCII in
             | things like usernames/password, file names and other places
             | when utf-8 may be allowed but causes problems later. Just
             | allowing UTF-8 is rarely enough. Localization is hard so
             | better to start with places where it is important.
             | Usernames IMHO not one of them.
        
             | numpad0 wrote:
             | NO. PLEASE DON'T. This wreaks havoc especially on East
             | Asian users because Unicode is poorly supported in console
             | on top of being binary non-canonical in both entry and
             | display.
             | 
             | Meaning,                 - :potato: OR :potatoh: may
             | display as :eggplant: OR :potato:           -
             | isEqual(`:eggplant:`, `:eggplant:`) may fail OR succeed
             | - trying to type :sequence: breaks console until reboot
             | - typing :potato: may work but not :eggplant:         -
             | users don't know how to spell :eggplant:         - etc.
             | 
             | If you must, please fix Unicode first so that user entry
             | and display would have 1:1 relationship. I do have Han
             | Unification in mind, but I believe the problem isn't unique
             | to the unification or East Asia.
        
           | NoMoreNicksLeft wrote:
           | > properly speaking your system has to handle everything
           | anyway, even if you don't allow local creation.
           | 
           | Honestly, I try not to be a pessimist, but this sounds like
           | the opening narration to some dystopian doomsday movie.
           | Titled something like _You 're Not Wrong_, I suppose.
        
         | macintux wrote:
         | At the meatspace level, purely numeric usernames are
         | problematic.
         | 
         | I was working as a contractor at a Fortune 500 firm several
         | years ago when they introduced a new ERP system which
         | apparently encouraged the company to switch to numeric system
         | IDs. Fortunately the technical teams, especially Linux support,
         | objected and it was overruled, but I was just as worried about
         | the communications problems that would result.
         | 
         | When everyone has a system ID that matches a consistent
         | pattern, like "YZ12345", IDs are easy to recognize in
         | documentation and data. An ID like "1234567" could be
         | practically anything.
        
           | PhilipRoman wrote:
           | I really like the concept of adding some redundancy to ids,
           | like a prefix. It helps to disambiguate things (kind of like
           | static typing). A good example is also bank numbers, which
           | must be a multiple of 97 +1, enabling fast client-side
           | validation against typos.
        
             | cupantae wrote:
             | Could you give a reference on this 97 rule? I'm intrigued.
        
               | az09mugen wrote:
               | I was also intrigued, so I searched and on wikipedia ( ht
               | tps://en.wikipedia.org/wiki/International_Bank_Account_Nu
               | m... ), in the section "Validating the IBAN" it is
               | written :                   Interpret the string as a
               | decimal integer and compute the remainder of that number
               | on division by 97         If the remainder is 1, the
               | check digit test is passed and the IBAN might be valid
        
           | Spooky23 wrote:
           | It's pretty common in places that handle Tax data.
           | 
           | At the end of the day, pushing opinionated bullshit doesn't
           | belong in utilities. If there's a security vulnerability,
           | sell that and push for incorporation into NIST standards.
        
         | hulitu wrote:
         | > Allowing purely numeric usernames seems like a terrible idea
         | to me
         | 
         | "I'm not a number, i am a free man. Ha ha ha ha ha"
        
           | kps wrote:
           | "Who is UID 0?"
           | 
           | "You are UID 6."
        
             | wombatpm wrote:
             | You have an off by one error. But I honestly don't know
             | which you should change to with the spirit of the show.
        
         | thephyber wrote:
         | I am also worried about more subtle bugs caused by usernames
         | that are not strictly only-numeric, such as "10e2" or
         | "0xDEADBEEF".
        
         | Ferret7446 wrote:
         | It shouldn't be a problem as long as the system disallows a
         | numeric username to be the same as an existing UID (excepting
         | the case where the matching UID is assigned to said username).
        
         | Spooky23 wrote:
         | There's lots of dumb things that you can do. Where do the
         | safety bumpers stop?
        
           | pas wrote:
           | wherever each community puts them?
        
       | huhtenberg wrote:
       | Sound like a solution in search of a problem.
       | 
       | And a disruptive solution with unclear side effects at that.
        
       | johnisgood wrote:
       | > If a keyboard input system provides the former sequence of
       | bytes, but the username is stored in the login infrastructure
       | using the latter sequence of [bytes], then a naive comparison
       | will not find the user "emollier" in the system. Unicode defines
       | in Annex 15 a few normalization forms as a way to work around
       | this problem. But a correct use of these normalization forms
       | still requires coordination and standardization among all
       | programs accessing the data.
       | 
       | ICU could work, but adds an extra dependency, there is also GNU's
       | libunistring.
        
       | resource_waste wrote:
       | This is important because Debain-family is used on many servers?
       | 
       | Debian seems to just squander resources on things a few powerful
       | people care about.
       | 
       | All my servers have been Debian-based, so I can't be too hard on
       | them, but whenever I see someone recommend a Debian-family distro
       | as a Desktop OS, I feel like I need to call the police.
        
       | perlgeek wrote:
       | Just imagine how many poorly-written shell scripts will break
       | when we suddenly allow dollars, quotes, backticks and the likes
       | in username. Heck, even allowing spaces sound like horror to me.
       | 
       | On the display side, I'm sure most tools that display usernames
       | won't make it easy to see if there are leading or trailing
       | whitespace characters, double blanks, tabs etc in usernames.
       | 
       | This sounds like support hell to me.
        
         | gmuslera wrote:
         | The problem could be old scripts or systems that doesn't handle
         | UTF-8 (that doesn't need to be the ones where the username was
         | defined). I'm not sure if I.e. the Bobby tables trick could be
         | done with characters with UTF8 representation seeing them in
         | pure ascii.
        
         | Starlevel004 wrote:
         | Breaking shell scripts sounds like a good idea to me. The
         | faster they die the better the world gets.
        
           | Rygian wrote:
           | That's going to be a very bumpy road, even if everyone were
           | to agree that the destination is appealing.
        
             | bigstrat2003 wrote:
             | Yeah for better or for worse compatibility is king. I
             | _despise_ shell scripts, they are an absolute nightmare to
             | work with and full of footguns. But they are so commonplace
             | that people are not going to tolerate YOLO breaking
             | changes.
        
             | raverbashing wrote:
             | Yeah I think ESH
             | 
             | While we have more modern shells the fact that bash (or
             | even sh) is the "common denominator" 30 yrs on is both good
             | and awful
             | 
             | We need a PowerShell for Linux
        
               | ygra wrote:
               | Not even that is free of footguns, especially around
               | argument parsing and calling native commands.
        
           | chikere232 wrote:
           | Perhaps unix isn't for you?
        
           | makeitdouble wrote:
           | Thing is, they don't die. Instead you get the short end of
           | the stick.
           | 
           | You'd have to be pretty darn important for an org to fix
           | their scripts because of your name or the username you
           | created. Of it would need to happen at a larger scale, but
           | then that wouldn't be so controversial in the first place.
        
         | codedokode wrote:
         | But spaces are allowed in filenames since 80s, didn't software
         | had enough time to adapt?
        
           | michaelt wrote:
           | Microsoft's Windows 95 put spaces into "c:\My Documents" and
           | "c:\Program Files" so that developers targeting Windows were
           | _forced_ to support spaces in filenames.
           | 
           | Of course, in those days if an OS upgrade broke some third
           | party software, the end user _paid for an upgrade_. So
           | although Microsoft forced developers ' hands, the developers
           | all got paid for their trouble. And you'd only have your hand
           | forced that way once or twice a decade.
           | 
           | Windows at the time was also all about the GUI file-pickers.
           | Breaking the command line? Shell scripts? What are those?
        
             | toast0 wrote:
             | And now it's \Users, presumably because after 20 years,
             | Microsoft gave up?
        
               | hwc wrote:
               | Or someone got tired of typing long paths.
        
               | Uvix wrote:
               | They changed from \Documents and Settings to \Users in
               | Vista, alongside other profile rejiggering (e.g.
               | introducing AppData folders). By that point software had
               | either been fixed or would never be fixed, so keeping a
               | space in the name wasn't particularly useful.
        
               | rcxdude wrote:
               | It's still very common for usernames to have spaces,
               | though.
        
               | alterom wrote:
               | _And now it 's \Users, presumably because after 20 years,
               | Microsoft gave up?_
               | 
               | Only if you assume that people rarely have spaces in
               | their Windows login names (e.g. "Joe Smith").
               | 
               | Either that, or Windows users have learned to _not be
               | scared of spaces_ in filenames, usernames, and _their own
               | literal names_.
        
               | numpad0 wrote:
               | Windows set up with Microsoft Account uses abbreviated
               | e-mail for user names, because UTF-8 breaks apps,
               | including many East Asian apps.
               | 
               | non-Western Windows users always knew never to use
               | anything outside ASCII for usernames, passwords, or any
               | programmatically used identifiers. It's English users
               | that haven't learned it.
        
               | throw16180339 wrote:
               | IIRC, they changed it to get more value out of the 260
               | character MAX_PATH. I know there was some sort of
               | manifest to enable longer paths, but I'm not sure what
               | the current status is.
        
               | LegionMammal978 wrote:
               | The status quo is that officially, you still have to both
               | set a registry key (or equivalently, set an option in the
               | Group Policy Editor) and add an element to each
               | application manifest.
               | 
               | The official workaround at runtime is to use the "\\\?\"
               | prefix with an absolute path to create an unrestricted
               | verbatim pathname. For instance, the fs::canonicalize()
               | function in Rust will always return such a pathname, to
               | many programmers' dismay, since outside tools often choke
               | on them.
               | 
               | The unofficial workaround is to set the undocumented
               | IsLongPathAwareProcess bit in the process's PEB. The Go
               | runtime does this, but silently falls back to "\\\?\"
               | prefixes if the Windows version is too old.
               | 
               | (Note that in general, canonicalizing paths is safer on
               | Windows than on Unix-like systems, since open directories
               | cannot be renamed.)
        
               | 3eb7988a1663 wrote:
               | OneDrive breaks that convention. Last two companies I was
               | at, the corporate location was something like
               | "$HOME/OneDrive - $COMPANY". That the two companies had
               | the same format tells me it is a default and/or suggested
               | practice for some reason.
        
             | bigstrat2003 wrote:
             | That doesn't sound right. Microsoft is _obsessed_ with
             | backwards compatibility, going so far as to accommodate
             | programs that were _writing to Windows ' private memory_
             | just to preserve it. Deliberately breaking programs isn't
             | in their ethos at all.
        
               | sltkr wrote:
               | The new filesystem APIs were introduced with Windows 95,
               | so there was no backward compatibility to break. _New_
               | programs using those _new_ APIs were forced to support
               | spaces in directories. Using spaces in the system
               | directories forced application developers to consider
               | that scenario and deal with it appropriately.
               | 
               | Meanwhile, DOS and Windows 3.1 applications that did run
               | on Windows 95 could access files under a backward
               | compatible 8.3 scheme, like C:\Progra~1\ instead of
               | "C:\Program Files".
        
               | bigstrat2003 wrote:
               | That's a good point, thanks for pointing it out.
        
               | michaelt wrote:
               | I'm thinking of the transitions from Windows 3.1 to
               | Windows 95 (IIRC introducing 32-bit and filenames longer
               | than 8 characters) and the transition from Windows 95 to
               | Windows XP (IIRC introducing a proper permission system,
               | thus breaking anything that relied on being able to write
               | things outside of user-owned folders)
               | 
               | I agree they were famously accommodating in those days.
               | But they also had enough market power that if they said
               | users could only write to one folder and it had a space
               | in the filename, developers who disliked it couldn't vote
               | with their feet.
        
               | lousken wrote:
               | And yet... if you create user using a display name e.g.
               | Peter Cenicka in AAD and deploy a PC with intune you will
               | get home folder called PeterCenicka.[0] It breaks SO MANY
               | things. And no, that beta UTF8 system wide setting does
               | not work with 3rd party apps.
               | 
               | I just dont understand why they dont use part of the
               | email address as the home folder name. And just because
               | of this stupidity, user display names have to be without
               | any of these characters
               | 
               | Microsoft ... PLEASE
               | 
               | [0] https://doitpshway.com/do-not-use-diacritics-in-aad-
               | user-dis...
        
             | dizhn wrote:
             | A lot of software still had issues and asked the user to
             | use C:\Directory directly. Some probably still do.
        
               | reginald78 wrote:
               | I remember trying to install Visual Studio in the mid-
               | late 2000s (when SSDs make hard drive space small again)
               | to a directory other than C: and found that after
               | following a rather convoluted process you could only
               | actually move maybe 20% of the install files off C:.
        
               | StefanBatory wrote:
               | It is still the same. :(
        
               | yonatan8070 wrote:
               | I've seen some things installing directly into C:\,
               | NVIDIA's software jumps to mind
        
             | akira2501 wrote:
             | C:\Progra~1
             | 
             | They didn't force anything.
        
               | moritzwarhier wrote:
               | Did they intentionally use only folder names with spaces
               | that are at least 9 characters long and with the space
               | after the first 6, so that the 8.3 version contains no
               | spaces?
               | 
               | Pretty clever if so :D
        
               | volemo wrote:
               | What 'bout "C:\My Documents" though?
        
               | cobbaut wrote:
               | That came later, end of 1996 with OSR2.
        
               | repiret wrote:
               | A space in an otherwise 8.3 file name would still be
               | treated as a long file name and get a ~1 shot name alias.
        
               | moritzwarhier wrote:
               | Thanks for the clarification!
               | 
               | I was curious about a deep dive into this topic, and
               | skimmed the MS doc pages after a Google search. They
               | mentioned different Windows APIs and Long file names, but
               | the only mention of the tilde compat layer I found was
               | very superficial ("some file-systems" use the tilde as
               | special character), so I abandoned my initial interest in
               | getting up to speed on this during a 2min weeekend read.
        
           | deltarholamda wrote:
           | My last name has an apostrophe in it. This isn't super weird
           | or anything, there have been "O'Haras" and "O'Neills" (with 2
           | Ls) forever.
           | 
           | And yet whenever I deal with a computer system I don't put
           | the apostrophe in because even in 2024 it is completely
           | jacked up. Sometimes it's just disallowed. Sometimes I get
           | "\\\'" showing up. Sometimes I get "&apos;". I've seen
           | "&#8217;". One time, one system accepted it, but another
           | system that accessed the same data didn't allow apostrophes
           | so the person using the second system couldn't access the
           | record, and it took 2 phone calls and 3 people to come up
           | with a workaround.
           | 
           | It doesn't work often enough that I don't even try anymore.
           | There are just too many opportunities for it to get forgotten
           | or handled improperly from all directions.
        
             | soneil wrote:
             | I had fun in the vmware-broadcom transition because the
             | broadcom portal doesn't allow that, but the vmware portal
             | did. Not even in my username, just in the surname field.
             | The new portal ate it on that so hard, I wasn't even
             | allowed to create a ticket to do anything about it.
             | 
             | Not as bad as when I was once issued a first.o'last@corp
             | email address though ..
        
               | mixmastamyk wrote:
               | There may be a Unicode character that looks like
               | apostrophe but has no quoting semantics. I use an arrow
               | instead of greater-than symbol in my prompt for the same
               | reason. To avoid copy/paste issues.
        
               | jcranmer wrote:
               | Non-ASCII characters in email addresses have even worse
               | compatibility issues than punctuation characters.
               | Punctuation fails because people don't know the standard.
               | Non-ASCII fails because people don't know the _latest_
               | standard.
        
               | deltarholamda wrote:
               | >Not as bad as when I was once issued a first.o'last@corp
               | email address though
               | 
               | Oh, man, that happened to me too, way back in the late
               | 90s. I had forgotten about that.
               | 
               | It broke things all over the place. Even now you run into
               | the occasional validator that is convinced that the plus
               | sign is not valid in email addresses.
        
               | mschuster91 wrote:
               | > Even now you run into the occasional validator that is
               | convinced that the plus sign is not valid in email
               | addresses.
               | 
               | These are intentional IMHO - force people to use their
               | actual email address so a potential breach can't be tied
               | back to the service. That's the _only_ reason why someone
               | would use a + in the first place.
        
               | tolciho wrote:
               | Some validators are silly regular expressions that
               | someone wrote in a minute without thinking about it
               | ("Mastering Regular Expressions" has a regex associated
               | with it for better matching an address; that regex is
               | quite the sight to behold). And disallowing + is a crummy
               | solution to whatever "force people to use their actual
               | email address" means given that someone with full control
               | of a domain can invent the alias
               | whatevertheywant@example.org instead of using something
               | with a + in it, or they can spin up an alternate address
               | on some alternate provider, etc.
               | 
               | Other reasons folks use + in their email is to do mail
               | routing (except where crappy web services disallow the +
               | because they relied on a crappy regex) but then again I
               | have no idea what "potential breach can't be tied back to
               | the service" is meant to mean.
        
               | mschuster91 wrote:
               | > but then again I have no idea what "potential breach
               | can't be tied back to the service" is meant to mean.
               | 
               | Easy. Say I subscribe as "username+servicename@gmail.com"
               | everywhere, when I get spam at that email address that
               | service must have been either breached or sold off my
               | data.
        
               | jonathanlydall wrote:
               | More likely just a default.
               | 
               | I built the authentication system on our website and as a
               | regular user of Gmail + aliasing I was very surprised
               | when my brother pointed out our website didn't allow
               | them.
               | 
               | Turns out the default for Microsoft's ASP.NET Identity
               | Framework is to disallow special characters, but simply
               | setting a flag in its configuration rectified this.
        
             | graemep wrote:
             | > And yet whenever I deal with a computer system I don't
             | put the apostrophe in because even in 2024
             | 
             | In usernames or in name fields for text generally?
             | 
             | I assume things like bank systems can deal with it because
             | they should match things like IDs?
        
               | deltarholamda wrote:
               | Name fields in general.
               | 
               | But sometimes I don't have control, e.g. another person
               | is inputting the data and dutifully duplicates my name.
               | That's how I ended up with the 2 phone calls/3 person
               | situation, which happened about a month ago.
               | 
               | Hell, my driver's license is missing the apostrophe
               | because the system doesn't accept it.
               | 
               | When somebody is trying to find me in a computer there's
               | a whole litany of things they have to try, including
               | assuming "First O'Lastame" got bashed into "First O.
               | Lastname".
               | 
               | I think about this every time I read an article extolling
               | the wonders of technology.
        
               | tsimionescu wrote:
               | Generally, countries' systems only handle characters in
               | names that are common in that country. Virtually no
               | banking or ID system in Europe or the USA will handle
               | Chinese names, for example. Even if they did at the
               | technical level, it wouldn't actually help at a holistic
               | level, because people who interact with these systems
               | (bank tellers, policemen, etc) can't be expected to
               | recognize any writing system in the world.
               | 
               | So, the reality is that you have to adapt to the country
               | you're trying to live or do business in and the name
               | systems that they can actually use. This can even mean
               | you have to adopt a name that people can actually
               | pronounce, as many Chinese people do when interacting
               | with people outside East Asia
               | 
               | For example, Chinese is particularly sensitive to tone
               | accent, which extremely few people outside that area can
               | even distinguish, leading to hopeless mispronunciation.
               | Consider that Ma2 and Ma4 are completely different words
               | for a Chinese speaker, while a French speaker who hasn't
               | studied this wouldn't even be able to tell that you are
               | intentionally pronouncing things differently and not just
               | your intonation.
               | 
               | And for a reverse example, if you want to move or do
               | business in Japan, you should adopt a well-known Japanese
               | pronunciation of your name, as otherwise Japanese
               | speakers, who have an extremely limited syllable
               | inventory compared to most other languages in the world,
               | will just not be able to follow your name.
        
               | graemep wrote:
               | That is true, but I think this example shows systems
               | being too restrictive. If people can read Latin letters
               | the system should accept apostrophes.
        
             | jorvi wrote:
             | > One time, one system accepted it, but another system that
             | accessed the same data didn't allow apostrophes so the
             | person using the second system couldn't access the record,
             | and it took 2 phone calls and 3 people to come up with a
             | workaround.
             | 
             | There's still a lot of organisations that somewhere in
             | their e-mail processing chain cannnot deal with 4-letter
             | TLD e-mail addresses*. Even worse is that the front-end is
             | often a relatively new framework and will happily accept
             | your e-mail, only to then have it silently fail forever.
             | Mercifully a lot of those organisations have their customer
             | service authorized to change your e-mail address manually,
             | but if they don't.. good luck.
        
           | wongarsu wrote:
           | NPX on windows was broken for years when your username had a
           | space in it. Never underestimate how long bugs can stay
           | around when it doesn't affect any of the developers and for
           | everyone else the workaround is quicker than fixing it
        
           | slightwinder wrote:
           | Problem is, the design of Unix shells is older, and they have
           | some parts which automatically split on space if not handled
           | carefully. This is really annoying.
        
         | rossy wrote:
         | For people using NSS modules like winbind, most of those
         | scripts are already broken
        
         | wolrah wrote:
         | > Just imagine how many poorly-written shell scripts will break
         | when we suddenly allow dollars, quotes, backticks and the likes
         | in username. Heck, even allowing spaces sound like horror to
         | me.
         | 
         | If we're admitting they're poorly-written, why can't we admit
         | that they're already broken regardless of whether that
         | brokenness is currently being triggered? Allowing symbols or
         | spaces didn't break anything, it was broken from day one just
         | no one noticed.
         | 
         | Why is the answer always "go out of your way to not upset the
         | broken garbage that's been around forever" rather than "throw
         | Zalgo at it and fix what breaks so it's no longer broken and
         | won't be broken in the future"?
         | 
         | Bug compatibility is the worst behavior of the computing
         | industry. Let the bad code break and more importantly call it
         | out so everyone knows where the blame belongs.
        
           | tsimionescu wrote:
           | Because people don't care about the presence or absence of
           | bugs, they care about getting their work or leisure done with
           | the help of the computer. If the computer isn't working, then
           | they can't get their work done, and so they are mad at
           | whoever broke it (for example by upgrading it, or by adding a
           | username with spaces inside it). If it's working, then
           | they're happy, no matter how dangerously on the precipice it
           | is.
        
         | raverbashing wrote:
         | Yes yes it is
         | 
         | Same for when people are being too clever and use a password
         | generator with all the characters for things you need to
         | call/pass on some types of config files
         | 
         | No, you're not being smart for adding double quotes to a
         | generated password, in fact _quite the contrary_. And guess who
         | needs to try all types of escapes for that?!
         | 
         | TFA seems like another of Debian's self inflicted problems by
         | people trying to be "too smart"
        
       | nmstoker wrote:
       | Unfortunate ambiguous uses of the word drop throughout the
       | otherwise excellent article
        
         | TimK65 wrote:
         | There are three uses of the word "drop," all of which are
         | correct.
         | 
         | The latter-day meaning of "drop" is an abomination.
        
           | toast0 wrote:
           | I dropped X off at Y. Then X dropped off the face of the map,
           | never to be seen again.
           | 
           | Many words and phrases in English are self-antonyms.
        
         | fargle wrote:
         | > The src:shadow package had dropped a Debian-specific patch,
         | 
         | shoot, that's evil. had not noticed this. i read this as
         | "removed", not "was released". now idk.
         | 
         | this pseudo-definition of dropped as "released" is beyond
         | stupid. yikes!
        
       | account42 wrote:
       | Always fun to see people poke the Unicode dragon only to be
       | dumbstruck by its true size as it stands up in preparation of
       | engulfing them with the fire of unintended consequences.
        
         | beardygo wrote:
         | Indeed. As a speaker of several languages, including RTL
         | language (they haven't even considered the problems with RTL
         | marks etc), I say stay with ASCII for usernames, keep UTF for
         | full names.
         | 
         | If restricted ASCII a-z is good enough for passport names
         | worldwide, it's good enough for usernames.
        
           | macbr wrote:
           | I'm confused - my name as written on my passport definitely
           | contains non ASCII characters?
        
             | extraduder_ire wrote:
             | What is it in the machine-readable section at the bottom?
             | My passport takes the apostrophe out of my name down there.
        
               | belorn wrote:
               | What is the point of a machine-readable name when there
               | is a machine-readable passport number which should be
               | unique for each issuing country? In this age I would
               | assume that places which uses machines to read passports
               | also are connected to international databases where the
               | unique number is checked for validation. My country also
               | mandated passport with chips in them for the last couple
               | of decades, so by now there are no longer any valid
               | passports without such chip.
               | 
               | If I had to guess, it seems the machine-readable section
               | is just backward compatibility for machines built during
               | the period where people started doing machine reading of
               | passports but had yet to started to put chips into them.
               | 
               | (as a fun side note, smart phones can read the chip on
               | passports and this is then used by some digital identity
               | providers to establish identity on account creation, in
               | combination with the phone camera).
        
               | Muromec wrote:
               | There is no database to query unless you issued the
               | document (except revocation database). There is a chip
               | with CMS signed data in it and MRZ is used for key
               | agreement to read the data.
               | 
               | To know that MRZ and data arent from a different person
               | or document, they have the name in ascii. It all kinda
               | works and mskes sense in the end.
               | 
               | When you read the card with phone camera it uses mrz too
        
               | belorn wrote:
               | Looking it up, the mrz are only there to validate that
               | the information stored on the document is the same as the
               | information provided by the chip, and to make any
               | eavesdrop attacks between the reader and the chip less
               | likely to succeed. Its an optional standard.
               | 
               | The data on the chip is authenticated through a country
               | signing key. This part is mandatory and prevent the
               | person who carries the document from falsifying the
               | information on the chip. There is also an optional active
               | authentication chip to prevent someone from copying a
               | passport even if they copy of the mrz and a copy of the
               | traffic between chip and reader.
               | 
               | The MRZ is also part of the older standard which is
               | intended to be replaced by a newer system that has card
               | access numbers, which mean that the mrz and the ascii it
               | embeds could very well be gone from passports. This new
               | standard was implemented in EU by 2014, so there might
               | passports issues now without the MRZ.
        
               | macbr wrote:
               | Oh, yeah. No non-ASCII in the "machine readable" part.
               | Though I've never seen anything use that section. My
               | national id card also has a "machine readable" section -
               | but that doesn't even contain my whole name: It's just
               | cut off after 20 letters.
        
             | Muromec wrote:
             | You probably have ASCII-adjacent name to begin with, so
             | people who can read some kind of language using Latin
             | letters will simply ignore "funny dots and dashes" and
             | pronounce it kinda wrong.
             | 
             | It's on a different level from having a name originally
             | written in a different alphabet entirely. At this point you
             | just have it written in two scripts, with second being
             | ASCII.
        
           | mschuster91 wrote:
           | > If restricted ASCII a-z is good enough for passport names
           | worldwide, it's good enough for usernames.
           | 
           | Passports (and credit cards) are the best example why ASCII-
           | only is horribly broken. It's 2024, people want to type in
           | their name as they write it normally, and they have the
           | reasonable expectation of IT "dealing with it" behind the
           | scenes.
           | 
           | Unfortunately, that expectation isn't reality, and it's all
           | too common people are being rejected at the border or their
           | card transactions are denied because braindead policies leave
           | no other option but to blanket deny in case of mismatches.
        
         | tgbugs wrote:
         | I made a design decision for a standard for dataset structure
         | to explicitly ban characters beyond ascii [A-Za-z0-9.,-_ ]
         | precisely because all the positivity around utf-8 often leads
         | people to think that it comes with no additional complexity
         | cost. There is an escape hatch with a way to indicate that a
         | dataset uses unicode filenames but the standard states that any
         | consumer may reject such datasets because unicode support is
         | explicitly not required.
         | 
         | I got pushback from people who would not have to implement or
         | maintain the systems for being a backward asciite so seeing
         | this article is rather vindicating.
        
       | miohtama wrote:
       | I remember useradd and adduser when learning Linux and oh boy
       | what a confusion it was... Why not just one command
        
       | abigail95 wrote:
       | if you cannot handle UTF-8 anywhere anything approaching text
       | could be, your program is malformed and should be deprecated and
       | removed.
       | 
       | if you wrote code that couldn't handle bob;>/hacked in a
       | username, you would and should be laughed at.
       | 
       | why are we using this ancient stuff?
        
         | knorker wrote:
         | It's not just programs. And it's not just semantics of all-
         | numeric username. It's also whether you want usernames that you
         | cannot type, nor possibly even render.
         | 
         | Definitely you can't spell it to someone else.
         | 
         | Who owns that file? Oh, it's right-to-left non breaking space
         | smiley snowman Chinese sign for water, I love that guy!
        
           | abigail95 wrote:
           | If people want to set up a Debian environment where people
           | are mixing RTL and Hanzi I see no reason for that to be
           | prohibited.
           | 
           | Debian has opinions but I disagree that they should extend
           | that far.
           | 
           | If my employee Zalgo-fies everything. I don't file a bug
           | report with Debian. I just fire them.
        
             | Muromec wrote:
             | >If my employee Zalgo-fies everything. I don't file a bug
             | report with Debian. I just fire them.
             | 
             | Which such clearly north American attitude you can as well
             | use ASCII for everything.
        
           | Izkata wrote:
           | > Who owns that file? Oh, it's right-to-left non breaking
           | space smiley snowman Chinese sign for water, I love that guy!
           | 
           | This reminds me, around 10 years ago on the chat app we used
           | at work, we were able to change our nicknames and I made mine
           | start with a combining character instead of a regular
           | character. No one could ping me, it broke that part of the UI
           | when they tried.
        
           | adrian_b wrote:
           | This thread like also the parent thread is full of comments
           | which are completely outdated, because there already exist
           | standards for Unicode identifiers and obviously they forbid
           | such cases.
           | 
           | See e.g. RFC 8264. Only a restricted set of characters is
           | permitted in identifiers, mostly letters and digits.
           | 
           | This is enough to write any user name, without allowing
           | "smiley snowman Chinese sign for water" or other such
           | nonsense.
        
         | drtgh wrote:
         | With Unicode the same grapheme can be written with a sequence
         | of one or more code points, and each code point can be a
         | sequence of one or more code units.
         | 
         | For example "a" can be written with U+00E5, and the same visual
         | glyph "a" with U+0061 + U+030A ( U+0061 {a} plus the code unit
         | U+030A {Combining Ring Above}).
         | 
         | Another homoglyph Unicode user name example:
         | 
         | * is Cafe == Cafe ?
         | 
         | * C + a + f + e + ' ' vs C + a + f + e
         | 
         | * Utf8: 43616665CC81 vs 436166C3A9
         | 
         | As one user has pointed out in another comment, some kind of
         | standardisation for that specific use case with some kind of
         | normalisation would be needed first (nevertheless a database
         | search would want a different one, and so on). The above
         | examples are among the simpler ones, there are also unprintable
         | characters, etc.
         | 
         | It can be done as in "nothing is impossible", but it's not that
         | easy, it's actually complex.
        
           | abigail95 wrote:
           | If a user picks a presentation layer that displays a from
           | noncomparable alphabets, but has them look identical - that's
           | a choice they can and should be able to make. I think it's
           | dumb but I'm not here to hold anyones hand.
           | 
           | It's the users choice whether 43616665CC81 == 436166C3A9,
           | same for Cafe == Cafe. But they are distinct and separate
           | choices. Text and bytes are separate things.
           | 
           | We accept that case sensitivity exists and whether a
           | user/business/program treats them as identical is and _should
           | always be_ their choice to make.
           | 
           | There is abstract complexity in the problem, but the context
           | in which text is used solves most of that.
           | 
           | If I have handwritten notes and I make a copy but write the
           | second one in cursive and ask someone if they say the same
           | thing - the correct answer isn't "we need to create a
           | standard to normalize the presentation of text" - it's "be
           | more precise in what you are asking".
           | 
           | Whether Cafe == Cafe depends on if it's written on a road
           | sign, or a network packet with a fixed byte size.
           | 
           | Unprintable characters are not text and should not be stored
           | in text fields. Neither are control characters, and as far as
           | I'm concerned should not be included in any text encoding
           | standard. Formatting and terminal processing _should never be
           | stored in-band_ , that's an obvious design flaw that should
           | be corrected.
           | 
           | We already deal with ambiguity within ASCII re I vs l vs 1.
           | Some fonts render those identically - Using those fonts in a
           | passport is bad design. Saying we should avoid having to
           | compare those characters at all because _some people /systems
           | might confuse them_ is misguided.
           | 
           | This isn't a true rebuttal of what you were saying but some
           | of my next thoughts.
        
             | alterom wrote:
             | _> This isn't a true rebuttal of what you were saying but
             | some of my next thoughts._
             | 
             | I feel it's a rebuttal enough, and it provides a clear
             | answer to the parent's question:
             | 
             | * is Cafe == Cafe ?
             | 
             | * C + a + f + e + ' ' vs C + a + f + e
             | 
             | * Utf8: 43616665CC81 vs 436166C3A9
             | 
             | When we're talking about username/password fields, what
             | we're really talking about _keystrokes_ , or the _input
             | sequences_ that the user makes to identify themselves.
             | 
             | Android lock screen patterns are passwords, and the answer
             | is blatantly clear there: the _same_ shape drawn in a
             | _different_ way is a _different_ pattern.
             | 
             | The context here isn't "are these two strings saying the
             | same text".
             | 
             | It's "is the person typing this text _who they say they
             | are_ ", boiled down to "can they repeat the input sequence
             | provided at registration".
             | 
             | So, we get the answers:
             | 
             | * _C + a + f + e + ' ' != C + a + f + e_ if either can be
             | _intentionally_ produced by the user at the log-in screen
             | (i.e., if these Unicode sequences can be produced by
             | different _keystroke sequences_ , and the user knows which
             | output they're producing)
             | 
             | * _C + a + f + e + ' ' == C + a + f + e_ if _either_ can be
             | obtained as a result of the _same_ keystroke sequence
             | (i.e., if virtual /physical keyboard + OS combinations may
             | represent the same keystroke sequence with _different
             | character sequences_ provided to the program).
             | 
             | * If both are true, _neither should be allowed_
             | 
             | The case of _not all input devices having the keys
             | requisite for reproducing the input sequence_ would boil
             | down to either deciding based on context, or _asking_ the
             | user if they are sure they want to limit themselves to the
             | particular hardware /software combinations to log into the
             | service.
             | 
             | For example, a username like BDZhILKA is perfectly fine
             | _if_ you only ever want to log into the service from
             | devices where a Ukrainian keyboard is available.
             | 
             | Which would be an appropriate assumption for e.g. Ukrainian
             | government systems, where Ukrainian language support is
             | _required by law_ , but not in an general context (what if
             | user travels outside Ukraine, and wants to log in from a
             | device they don't own and can't enable Ukrainian input
             | on?).
             | 
             | One can't hit the "Zh" key if their keyboard lacks it.
             | 
             | Same goes for the concern raised in the article:
             | 
             |  _> I see and type my username hundreds times a day, people
             | use it to address me in written and spoken conversations
             | with it, etc._
             | 
             | Good. That means that @BDZhILKA is only appropriate where
             | _everyone can be assumed to be able to write and speak
             | Ukrainian_ , which doesn't even hold universally true _in
             | Ukraine_ , unless it's a government office.
             | 
             | That's to say, most people reading this comment won't be
             | able to address me as @BDZhILKA in neither a _spoken
             | conversation_ , nor a _written_ one (copy-pasting is not
             | _writing_ ).
             | 
             | At the same time, if I _can_ type  "BDZhILKA", it should be
             | my _choice_ to have that as a username /log-in name, since
             | _only_ being able to log in from devices with a Ukrainian
             | keyboard would be a _security feature_ for me. I know that
             | I will have that on _my_ devices, but an adversary may not.
             | 
             | Similarly, a log-in name like @SIRNIK _should_ be
             | acceptable if I wanted it.
             | 
             | Note that it's not the same as @CIPHIK - the former uses
             | Ukrainian character set. @SIRNIK != @CIPHIK for
             | authentication purposes because I typed in _different input
             | sequences_ to produces these glyphs on the screen.
             | 
             | This is not a Unicode issue either; ASCII with codepages
             | for internationalization had the same problem. Homoglyphs
             | aren't limited to accents or complex Unicode sequences.
             | 
             | With Unicode, SIRNIK is not a problematic username -
             | there's only _one_ way to type that particular byte
             | sequence in. Before Unicode, it was, because the letters
             | were encoded as different _bytes_ in KOI-8 (Unix) vs.
             | Windows-1251 character sets, and the user didn 't
             | necessarily have a choice about _which one is being used to
             | record their input_.
             | 
             | The problem wasn't limited to log-in screens, of course; it
             | resulted in hilariously unreadable words which have since
             | been enshrined in memes, like "bNOPNIa" for "Vopros"
             | ("question", a common first word in a chat message asking
             | about how to make text readable).
             | 
             | See, bNOPNIa (KOI-8) == Vopros (Windows-1251); same bytes.
             | Whether to allow that as a log-in or password (e.g. on a
             | Linux machine) depended on whether you wanted to allow the
             | user to log in from Windows devices too.
             | 
             | Obviously, for local accounts on Windows 95 machines, it
             | was not an issue, as Windows encoding would be the only one
             | available on a Windows log-in screen. The context gives all
             | the answers.
             | 
             | All of this directly follows from the "not a true rebuttal"
             | you typed, and I frankly don't see what else there is to
             | say on the matter, or how else to say what you said to get
             | that point across.
        
           | adrian_b wrote:
           | The discussion thread at LWN has already mentioned standards
           | for Unicode identifiers (RFC 8264 and RFC 8265), which
           | prescribe how to handle all these problems, i.e. which
           | characters should be allowed in identifiers and how to
           | normalize and compare Unicode identifiers.
        
         | anon-3988 wrote:
         | Nah, you can use whatever you want for _display_.
         | 
         | We have our tower of babel here and we are telling people not
         | to use it? I am not even native English user btw. Having a
         | lingua franca allowed me to understand someone from Russia,
         | China, Japan, etc.
         | 
         | Maybe once we have easily accessible ML translate nuances in
         | one language to another without loss we can all talk in our own
         | languages and just translate each others words.
        
           | abigail95 wrote:
           | I think people should be able to configure systems to handle
           | a broad range of text from popular encoding standards like
           | UTF-8.
           | 
           | Limiting text-space because of communcation is a strange
           | objection that I don't think will hold up over time.
        
             | numpad0 wrote:
             | Unicode is a garbage standard that breaks apart so easily.
             | That's why people hate ideas like yours. You're right in an
             | ideal world but not in this baseline reality with Unicode.
        
               | adrian_b wrote:
               | Except that it is much better than anything that had
               | existed before it.
               | 
               | The earlier handling of non-English alphabets or writing
               | systems was horrible in MS-DOS and Windows.
               | 
               | While there have been made some serious mistakes in the
               | development of Unicode, its main principles were fine and
               | it does not have any competition.
               | 
               | Feel free to propose and implement a better standard.
        
               | numpad0 wrote:
               | Just skim other branches of this tree. Unicode is non-
               | canonical in many ways.
               | 
               | You can't guarantee that the same binary representation
               | reproduce on every machines.
               | 
               | That kind of encoding system has no place "under the
               | hood". That should be obvious.
        
         | PhilipRoman wrote:
         | I really love this powerless use of "should". If you spit on
         | billions of lines of code, all you will get is a dry mouth. The
         | reality defines "what is", unless you have lots of tanks and
         | people under your control, in which case you can change the
         | reality.
         | 
         | There is tons of useful code which you will likely never
         | encounter, that helps people accomplish their tasks every day.
         | Do you think there is some central authority who is going to go
         | building to building and dd if=/dev/zero every shell script
         | they find?
        
           | abigail95 wrote:
           | This is a contemporary discussion, today, concerning hundreds
           | perhaps thousands of lines of code. That's it.
           | 
           | If someone is objecting to changes because of things like
           | "bob;>/hacked". That is laughable, and I will continue to
           | point and laugh. Imagine limiting URL encoding because of SQL
           | injection.
           | 
           | We can fix this, then fix the things that break - and then we
           | can improve.
           | 
           | Or we can ossify into stone. Your choice.
        
             | PhilipRoman wrote:
             | >if you cannot handle UTF-8 anywhere anything approaching
             | text could be, your program is malformed and should be
             | deprecated and removed.
             | 
             | I was referring to this. Don't get me wrong, I also would
             | love to make sweeping changes to many things in computing.
             | I still think it is perfectly valid to impose reasonable
             | limitations on input even if the program could
             | theoretically handle it - it prevents all kinds of problems
             | at the very root (like allocating disproportionate amounts
             | of resources, infinite timeouts, etc).
        
       | chikere232 wrote:
       | oh yes, let's break things to gain nothing of value
        
         | gspr wrote:
         | Perhaps nothing of value _to you_.
         | 
         | I'll hazard a guess that your preferred username can be
         | expressed in a small subset of ASCII? And to hell with everyone
         | else?
        
           | knorker wrote:
           | I'll hazard a guess that your preferred username can't be
           | written by 99.99999% of the world, and would always have to
           | be copy-pasted?
        
             | Ylpertnodi wrote:
             | Yeah, us foreigners, up to our usual tricks again.
        
               | knorker wrote:
               | By any definition of the word, I'm a foreigner.
               | 
               | So if you meant to imply that I'm an American, you've
               | guessed wrong.
        
           | chikere232 wrote:
           | If your personal identity is threatened by having to use an
           | ascii alphanumeric login name, you're kind of creating
           | problems for yourself for no reason...
           | 
           | There is a field for the full name of the person if you want
           | to, and at least on my linux it warns for non-ascii
           | characters but allows them
        
           | anon-3988 wrote:
           | Its a give and take. If you allow for anything beyond latin,
           | then you have to accept that there will be a class of
           | software that will be difficult to interact with.
           | 
           | Latin-like language system is simply superior for machine
           | purposes. I am sorry, but I don't even want to think of
           | supporting the entire unicode in my software. I am not going
           | to even attempt to reverse that emoji.
        
             | chikere232 wrote:
             | It gets real fun when it's something you need to look up
             | and have match, like a username.
             | 
             | Because then it to be normalised in the right way for
             | comparisons to work, or it will only match if your input
             | method happens to produce the exact same variant.
             | 
             | ... And unicode is an evolving standard where this
             | normalisation sometimes changes between standards, so the
             | names as normalised in the old version of your standard
             | library might disagree with the new version. So you need to
             | care for that transition.
             | 
             | ... And often this is implemented separately for different
             | languages, so you can get names that won't match if you
             | normalise them in python, java or C.
             | 
             | ... And as all implementations, these unicode
             | implementations sometimes have bugs, so you need to think
             | not only about matching supported unicode versions, but
             | matching bugs.
             | 
             | ... And any change in these normalisations can in theory
             | lead to two usernames that used to be distinct becoming
             | identical.
             | 
             | It's a deep well
        
               | khaled wrote:
               | > And unicode is an evolving standard where this
               | normalisation sometimes changes between standards
               | 
               | Unicode normalization is subject to its stability policy,
               | and Unicode no longer allow adding new canonically
               | equivalent code points.
               | 
               | https://www.unicode.org/policies/stability_policy.html
        
             | adrian_b wrote:
             | There are many variants of the Latin alphabet and the
             | English alphabet contains only a subset of the letters
             | contained in the other variants.
             | 
             | There is no reason to consider the English alphabet as
             | superior for machine purposes, in comparison to other Latin
             | alphabets.
             | 
             | Its dominance in IT is caused by the fact that most of the
             | development of commercial computers after WWII has been
             | done at IBM and other US companies, not by any properties
             | of the English alphabet.
        
         | layer8 wrote:
         | The issue is that it has already been broken (read: has allowed
         | arbitrary byte sequences) for a long time, and the debate is
         | about what to restrict it to.
        
       | codedokode wrote:
       | Don't you think that it would be better to get rid of usernames
       | in UI? They only provide unique data for fingerprinting and do
       | almost nothing useful on a single-user system. Wouldn't it be
       | better to simply have a default name like "primary user" or "main
       | user" for the first user and skip one step in installation
       | process? Also it frees you from typing a username on login for a
       | single-user system.
        
         | eviks wrote:
         | Single user systems can just not ask for a username if there is
         | only one, they control the UI
        
       | knorker wrote:
       | So in the future I may not be able to even type the name of
       | another user? Admins and other users not being able to type
       | usernames sounds very bad.
       | 
       | And I say that as someone whose native language has more letters
       | than English.
        
       | zvr wrote:
       | Most people are too young to remember that when you typed your
       | username in all-caps in the login prompt (because the CapsLock
       | key was on by accident, for example), the login(8) program
       | assumed you were in a connection that could only do 7-bit (upper
       | case, but no lower case characters) and immediately switched the
       | tty settings and you were then presented with a "\PASSWORD: "
       | prompt.
        
         | roelschroeven wrote:
         | Don't you mean 6-bit? 7-bit ASCII supports lower case
         | characters. Or maybe there are other 7-bit character sets that
         | don't have lower case characters and it was one of those?
        
           | jks wrote:
           | PETSCII? On the Commodore 64 you could press the Commodore
           | key and Shift together to change character sets between
           | lowercase and the graphical characters.
           | 
           | But the Unix login thing might have been because of
           | teletypes?
           | https://www.columbia.edu/cu/computinghistory/teletype/ claims
           | that ASR 33 used 8-bit ASCII but was uppercase only - not
           | sure if the "8-bit" claim can be true.
           | 
           | On some Unix (and Linux) systems, you can still enter a kind
           | of retro mode with "stty olcuc iuclc" (output lowercase to
           | uppercase, input uppercase to lowercase) and turning on Caps
           | Lock.
        
           | zvr wrote:
           | You are of course correct that 7-bit ASCII includes lower
           | case characters. I don't think there exists "6-bit ASCII",
           | but the original ASCII did not have lower case (the slots
           | were empty). We're talking early '60s here.
           | 
           | I'm not even sure it was only about ASCII. I suppose I should
           | have written a more generic "character set" (which supports
           | or not lower case characters) rather than "7-bit".
           | 
           | In the cases where you could only communicate in a single
           | case (upper), you typed the commands in the usual letters
           | (e.g., "LS") and capital letters were designated by a
           | preceding backslash (e.g., "ECHO \JOHN \DOE"). That's why you
           | were seeing the "\PASSWORD: " prompt, the initial letter was
           | capital (as it still is).
           | 
           | Just for fun, I checked my current Debian system. The
           | getty(8) command still supports it:                      -U,
           | --detect-case                Turn on support for detecting an
           | uppercase-only terminal. This setting will
           | detect a login name containing only capitals as indicating an
           | uppercase-only                terminal and turn on some
           | upper-to-lower case conversions. Note that this has
           | no support for any Unicode characters.
        
       | soneil wrote:
       | This reminds me of the systemd bug where usernames starting with
       | a digit were mishandled (#15141).
       | 
       | It seems to me like something that "should" be relaxed, but we
       | need to have high confidence in the entire foodchain. adduser
       | seems like the last place it should be changed, not the first -
       | anyone requiring "enough rope" is already served by useradd.
        
       | hwc wrote:
       | My work machine uses my complete email address as a user machine
       | (this was a done by someone in the IT department). Vim gets
       | confused when I use the `gf` command to open a path that contains
       | an '@' character in it.
        
       | bjourne wrote:
       | Honestly, it is super brain-dead that Linux and other operating
       | systems still have such massive problems with "special"
       | characters. Just the other day I had to help someone who had
       | trouble building. The cause turned out to be that they had
       | dropped filenames with parentheses in the source directory which,
       | apparently, confused bash which make relies on. Such trash is
       | everywhere on Linux systems. Eventually you learn to only use
       | [a-zA-Z0-9-_.] in names because anything else will inevitably
       | confuse some tool or another (even capital letters can be a
       | PITA)... I so wish someone would take it upon themselves to clean
       | up this mess, but it's probably too much work and too many who
       | are nay-sayers conditioned to it who don't see the need for
       | changes.
        
       | hiccuphippo wrote:
       | As someone who needs non-ascii characters to write my name:
       | _please don 't_. You are making things worse just to be
       | "courteous" about something we don't care about and will actually
       | be annoyed at if we have to find how to write a letter in the
       | keyboard or worse case scenario, figure out how to change the
       | layout to the correct one _before I even logged in_.
        
         | jks wrote:
         | Likewise. My last name contains a non-ascii character. In ~2009
         | I started at a company whose admin conveniently set up an
         | account for me on their Ubuntu server... on which no-one could
         | then log in locally because the login manager crashed when
         | trying to display the list of users. I logged in via ssh and
         | changed my name to the nearest ASCII equivalent.
         | 
         | I always feel slightly worried on sites that demand that I give
         | my full legal name (such as the US ESTA form), and then refuse
         | to handle it because it includes "illegal" characters.
        
           | ASalazarMX wrote:
           | This has happened to me with _passwords_ containing foreign
           | characters. The system would accept it, but further logons
           | would be impossible. Now I always strip diacritics to be
           | safe.
        
             | jks wrote:
             | A friend mentioned using control characters in passwords...
             | like ^F and ^B, but not ^C because that's the interrupt
             | character. Feels vaguely risky to me (does ^U empty the
             | line? does ^W delete the last word? does your terminal
             | emulator do some weird encoding like it does for cursor
             | keys?) but if it works, why not?
        
               | jowea wrote:
               | I suspect I have run into a couple bugs because of
               | password generators putting characters that some backend
               | system cannot process in the password. Halfwish they just
               | did DKWhhjwqjkwqjmHSJKHAIUHQwdmlsadkl instead.
        
               | hughesjj wrote:
               | I remember in school learning that technically speaking
               | on Unix you could have the backspace character as part of
               | your password too
               | 
               | But for the same reason with ^W and ^U I have no idea how
               | you'd implement that in an interactive prompt without
               | escaping
        
           | beardygo wrote:
           | Full legal name as appears on machine readable zone in your
           | own passport. Allowed characters are A-Z only, see MRZ
           | specifications:
           | 
           | https://en.wikipedia.org/wiki/Machine-readable_passport
        
             | Muromec wrote:
             | What's a legal name? It presumes it's somehow different
             | from other ... illegal names. But in which way? Which law
             | has a say?
        
               | beezlebroxxxxxx wrote:
               | "Legal name" is a catch-all term that usually means
               | "approved for use on government issued ID". Are there
               | instances when that's not always the case and some forms
               | of ID (not just, say, an ID card, but also in tax
               | filings, for example) actually have different rules?
               | Amazingly, sometimes yes. But usually that's what it
               | means.
        
               | Muromec wrote:
               | I get what it could mean but it's jurisdiction bound and
               | doesnt resolve unambigyously, doesnt match mrz and isnt
               | always ascii.
        
               | Dylan16807 wrote:
               | The name the legal system uses to refer to you.
        
               | Muromec wrote:
               | Legal system as in court of law? They tend to use more
               | letters than I have in my actual passport (definitely
               | more than fits into mrz) and depending on which court we
               | talk about they also use different alphabets. They also
               | assume certain structure in those nsmes, which differs
               | from one court to another.
        
               | Dylan16807 wrote:
               | Are you using courts that insist on different alphabets?
               | Then you have multiple legal names.
               | 
               | And some operations are based on exactly what's on your
               | passport.
               | 
               | It's more than court, taxes are an important and relevant
               | set of laws.
        
               | Muromec wrote:
               | Yes, I had a pleasure do deal with two courts that use
               | two different alphabets this year. They one of the two
               | referenced the other. The name written in neither of two
               | matches whats actually written in my passport. It isn't a
               | complicated name by any reasonable metric.
               | 
               | Taxes are easier -- they just ids and names are display
               | only kind of stuff, sourced from the base registry.
        
         | doubled112 wrote:
         | Just having an apostrophe in my last name causes me issues.
         | 
         | Yes, that's me, Mr. O&amp;Conner
        
         | j-bos wrote:
         | As another nonascii character named individual who's lost hours
         | of life calling service reps for companies that used my utf-8
         | name, I second this.
        
         | zigzag312 wrote:
         | I sometimes use this as a quick test of software quality. If it
         | can't handle non-ascii characters in 2024, then it will
         | probably be more trouble that it's worth.
        
       | SuperSandro2000 wrote:
       | They are clearly bored and want to start a year long bug hunt
       | through half of unix
        
         | Muromec wrote:
         | That sounds like a good kind of bored and bug hunting through
         | half the unix sounds like fun too.
        
           | db48x wrote:
           | Agreed! I'm half tempted to join the hunt myself just for
           | kicks...
        
       | kej wrote:
       | I wonder if it would work to do something like the punycode
       | system for internationalized domain names. Shell scripts could
       | handle a name like `xn--0civ130n` just fine, and user-facing
       | utilities could choose to convert that to :sparkle::unicorn: when
       | appropriate. The same homograph protections would probably work,
       | as well.
        
       | dsr_ wrote:
       | I will remind everyone that there are a minimum of three
       | identifiers here.
       | 
       | The UID, which is an integer. Ownership resides here; it's the
       | primary key. Can be used by programs.
       | 
       | The username, each of which must be unique and maps to one UID --
       | but multiple usernames can map to the same UID. Used by humans
       | and programs to login.
       | 
       | The GECOS field, or "human readable name", which is only used as
       | a display label. Some systems include a structure inside this for
       | additional info like phone number, office number, or similar". I
       | don't think anyone would object to UTF-8 here.
        
       | seu wrote:
       | The fact that this whole discussion happens in english, partially
       | explains why there is a discussion at all. The whole problem
       | could have been avoided if the development of computers had been
       | a more international effort.
        
       | seiferteric wrote:
       | OMG Can't believe this, I ran into this exact thing at my last
       | job. We discovered a security vuln in several of our services
       | because we were accepting unsanitized usernames, but since we and
       | doing things with them (passing them to scripts etc.) but only
       | after passing them to useradd/usermod etc so we thought they were
       | safe, and of course you could put in things like ";" and "&", ">"
       | etc and do whatever you want. I discovered that debian DISABLED
       | the username sanity checks and could not believe it. anyway I
       | installed a patched version as well as sanitized input and other
       | stuff to resolve the issue.
        
       | IshKebab wrote:
       | > Most Debian users don't work with useradd, or groupadd,
       | directly. Instead, Debian has long supplied its own adduser (and
       | addgroup) utilities, originally written by founder Ian Murdock.
       | These act as simpler front ends to useradd
       | 
       | One of the dumbest things Debian has done.
        
       | rurban wrote:
       | They are so stupid, I cannot believe!
       | 
       | Names are identifiers, and such need to stay identifiable. There
       | exist unicode security guidelines and rules for identifiers, they
       | don't know about. My libu8ident library would help with that.
        
       | UniverseHacker wrote:
       | Clearly we should open up usernames to be an unlimited size set
       | of mixed data types: e.g. the first "character" could be a hand
       | drawn picture of a cat, the second the entire text of the US
       | constitution in unicode, and so on. We could then extend this
       | flexibility to filenames, passwords, and Unix commands.
       | Internally, this could involve replacing all text strings with
       | folders on a filesystem where you can put any files you want in
       | any desired order. /s
        
         | adrian_b wrote:
         | As already pointed in that discussion thread, there are
         | standards for Unicode identifiers, e.g. RFC 8264 and RFC 8265.
         | 
         | All Unicode characters have types, like letter, digit,
         | punctuation, mathematical operator and so on. The standard for
         | identifiers allows in identifiers only certain types of Unicode
         | characters and it defines rules for normalization and
         | comparison of identifiers.
         | 
         | So rules for handling Unicode identifiers have already been
         | defined. Whoever wants this functionality should just implement
         | the standards.
         | 
         | One may have opinions whether this is worthwhile or not for a
         | certain application, but strawman arguments about cat pictures
         | and other impossible dangers are no longer valid.
        
           | UniverseHacker wrote:
           | Apparently humor is no longer valid?
        
       | nineteen999 wrote:
       | I have an affectionate place in my heart for Debian, the
       | community is passionate, they have wonderful ideals, hell I even
       | helped found a charity which distributes it on used PC's
       | discarded by large companies to disadvantaged people over 20
       | years ago which is still running today. It was my favourite
       | distro for a long time after I moved on from Slackware in the
       | late 90's, I used it at home, I used it in my job at a small ISP
       | on everything from x86 to Sun Sparc to DEC Alpha hardware. We are
       | lucky in the Linux community to have them. I could care less
       | about deriatives like Ubuntu, seems to be one too far removed.
       | 
       | But over the years the bikeshedding and some of the poor
       | technical decisions started to wear on me. The debconf approach
       | of asking a million questions on install bothered me. In my
       | current job we use it on small industrial ARM PC's and it does a
       | great job there at a large scale distributed over a wide variety
       | of environments and geographical area, scorching heat, freezing
       | cold and everything in between. But that's easy because it's a
       | single system image which we deploy to hundreds of devices and it
       | only requires minimal customisation to perform the required
       | tasks.
       | 
       | But our datacenter servers remain RHEL for the simple reason ...
       | the deployment and broad customisation process per server is
       | easy, LDAP integration is straight forward and the customer wants
       | to pay for support from the vendor even though we never use it.
       | Security updates and bugfixes are delivered quickly and the
       | vendors commitment to stability is commendable. It's a no
       | brainer. More and more companies started to move their workloads
       | to RHEL once it came out and unfortunately it just didn't make
       | sense to bother with distributions outside of RHEL/Fedora for my
       | personal use anymore, some sort of work/life balance is needed
       | and I don't want to spend my personal computing time remembering
       | all the idiosyncracies between different Linux distributions
       | anymore. I would argue that Debian is pretty idiosyncratic and
       | opinionated if you have come from more traditional UNIX systems
       | in the 90's, while RHEL/Fedora more closely model an "evolution"
       | of those classic systems if you like. It will be interesting to
       | see what happens to RHEL in the coming years as Redhat becomes
       | more and more absorbed into the IBM environment.
        
         | finnthehuman wrote:
         | That's the reality of deploying professionally though. I've got
         | a soft spot for debian from using it for over 20 years too, but
         | choosing open source often means picking the vendor that
         | accommodates the use case. Many products have the enterprise
         | upsell of good LDAP/AD integration but that's just a nice to
         | have when you're really buy it for the ability to call someone
         | when shit goes sideways.
         | 
         | And when you don't need the support net, it's often gonna be
         | ubuntu because that's what most people are comfortable with. Or
         | yocto if you're shipping a custom OS. And containers are so
         | ephemeral and purpose specific it means distro doesn't matter
         | as much.
         | 
         | I'm still rooting for them the most. They're community based,
         | an important upstream, and stable has never done me dirty. It's
         | still my go-to for "I don't want to think hard about, or worry
         | about this system."
        
       | jcarrano wrote:
       | I don't get it? What's the purpose of changing the default rule
       | in shadow-utils. Not only is it completely unnecessary and
       | introduces risks for shell injections, it also risks introducing
       | incompatibilities between Debian and any other system.
       | 
       | I feel that there are already too many other things to fix to be
       | wasting time in creating new potential bugs.
        
       | thway15269037 wrote:
       | Before opening this can of worms, can we finally address that
       | there is a hard, hardcoded limit of 255 bytes per file name
       | (folder name) in Linux? Yeah, 255 bytes, that is, like 63
       | japanese characters or emojis or maybe less. And in kernel, too,
       | so you physically cannot correct this issue by using another
       | filesystem or something.
       | 
       | Before anyone asks: yes, these folders do occur in real life, and
       | I tired of pretending that they do not.
        
       | cratermoon wrote:
       | My take: user names are _not_ strings, though they may be
       | _represented_ as strings. As such, a type, e.g. Username, would
       | provide a constrained and consistent range of allowed values,
       | much as a type like float32 allows (within IEEE 754 rules).
       | 
       | It's time for programmers to stop treating everything that can be
       | represented by a string as anything representable by a string
       | type.
        
       | jmclnx wrote:
       | Company I work at moved to an ID like [A-Z]Employee-number. Moot
       | point for them :
        
       | okasaki wrote:
       | Aren't pretty much all devices nowadays owned by a single person?
       | 
       | What's the user case for non-system usernames at all?
       | 
       | Why not just "user" and "root"?
        
       | ipython wrote:
       | This sounds like a security nightmare just waiting to happen.
       | Nothing like embedding gigantic libraries like libicu into
       | security critical code bases so you can do things like Unicode
       | normalization and comparison functions on usernames.
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:01 UTC)