[HN Gopher] Debian opens a can of username worms
       ___________________________________________________________________
        
       Debian opens a can of username worms
        
       Author : jwilk
       Score  : 158 points
       Date   : 2024-12-06 09:55 UTC (13 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | rini17 wrote:
       | Perhaps it's time to agree upon how to Unicode in identifiers?
       | The normalization, unprintable characters, confusing characters
       | with same glyphs, etc. It's obviously problematic when everyone
       | is doing it on their own.
        
         | magicalhippo wrote:
         | As long as I can enter my Zalgo[1] username, I'm fine with your
         | suggestion.
         | 
         | [1]: https://en.wikipedia.org/wiki/Zalgo_text
        
         | m000 wrote:
         | Good luck bringing everyone together. There's still a ton of
         | Microsoft software that relies on the presence of the BOM [1],
         | despite practically everyone else not using it. And
         | bidirectional rsync between practically everything else and a
         | Mac still requires `--iconv=utf-8,utf-8-mac` to avoid problems
         | because of homographs.
         | 
         | [1] https://en.wikipedia.org/wiki/Byte_order_mark
        
         | bayindirh wrote:
         | The first bar to clear is "The Turkish Test"[0], then we can
         | talk about Unicode. It'll smooth the rest of the process a lot.
         | 
         | You can't guess how many workarounds I implement to make sure
         | that a stray application doesn't get "i" or "I" in their naive
         | codepaths, and start burning mid-flight (e.g.: Kodi, Pagico,
         | some old Java programs, oh my...).
         | 
         | [0]: https://blog.codinghorror.com/whats-wrong-with-turkey/
        
           | beardyw wrote:
           | The date format part is ridiculous. Americans are almost
           | unique in using mm/dd/yyyy, so an assumption of that would be
           | plain wrong.
        
             | bayindirh wrote:
             | Localization libraries handle these parts well, since date
             | is same with Europe (and generally stored as time-date
             | objects rather than pure strings). None of the number
             | shenanigans cause problems since these numbers are always
             | stored as IEEE754 or other decimal formats. Money is no
             | problem as well.
             | 
             | However, when you go through an upper() or lower() or
             | anything which plays with capitalization, and if that data
             | is being fed to a hash algorithm or anything which mucks
             | with strings, boy, oh boy...
             | 
             | The easiest way is to sanitize these programmatic parts
             | with forced locale of en_US or plain old "C". If the
             | strings is not facing to the user and never localized, just
             | force its locale. It's the only sane way.
        
             | a3w wrote:
             | https://xkcd.com/1179/ I heard the US and A are moving to
             | the hissing cat date format shown here.
        
               | Muromec wrote:
               | I kind of like the one using roman numerals for month.
               | Reasonable people would figure out that other reasonable
               | people would not use roman numerals for _days_ , so the
               | order can be implicit. I like implicit ordering, it
               | always makes things more interesting.
        
             | bluGill wrote:
             | I have switched to yyyymmdd for everything - it is usually
             | obvious to everyone what date I mean.
        
               | bayindirh wrote:
               | I also use the same format while naming my files, or in
               | changelogs or whatnot, but not all documents are suitable
               | for that, and in the presentation layer you need to match
               | the country standards.
               | 
               | However, date is mostly presentation and internal storage
               | of these are vastly different than what we see generally.
        
               | bluGill wrote:
               | I don't match country standards. That is the point.
        
               | kelnos wrote:
               | It depends on what you're doing, though. If you're
               | helping people fill out documents (even non-government
               | documents), then you really need to match the country
               | standard.
               | 
               | Localization is important; some countries outright
               | require it if you're going to do business within their
               | borders. But even where it's not required, you will lose
               | customers if your website/application/product feels
               | "foreign". I'm not sure date ordering is a big enough
               | deal to trigger that feeling in anyone, but unless it's a
               | huge burden to format things the way people expect, I
               | would do so for the UX benefits.
        
         | throw0101a wrote:
         | > _Perhaps it 's time to agree upon how to Unicode in
         | identifiers?_
         | 
         | And then update all data structures that refer to them (like
         | _last_ and _w_ / _who_ , also NFS), as well as file formats
         | (like _cpio_ , _tar_ , and _pax_ which encodes ownership).
        
           | maccard wrote:
           | Yes. Those formats have had 20 years since Unicode was
           | standardised, and things like my terminal still routinely
           | break when given "unexpected" inputs. Practically every other
           | application can handle it.
        
         | layer8 wrote:
         | Unicode has provided a specification for Unicode identifiers
         | since 2005: https://www.unicode.org/reports/tr31/
        
           | rini17 wrote:
           | Great! Is there a library for their validation? ICU seems to
           | have only spoof checker for confusables.
        
             | rurban wrote:
             | libu8ident
        
         | secondcoming wrote:
         | Would punycode be suitable?
        
       | tiahura wrote:
       | When you think about all the time, money and effort that have
       | been wasted on Unicode...
        
         | kalleboo wrote:
         | Yeah we should have all just stuck to Shift-JIS
        
         | Joker_vD wrote:
         | Vseki triabva da izpolzva latinitsa, absoliutno s'm s'glasen.
         | 
         | After all, it's objectively the most perfect set of characters
         | for any reasonable human language.
        
           | febusravenga wrote:
           | Random cross-language-script observation.
           | 
           | In Bulgarian, latinitsa ("latin alphabet") transliterated to
           | latin alphabet is just "latinitsa" or "latinica".
           | 
           | In Polish "cyrillic" is "cyrylica" - basically reverse.
        
         | pjc50 wrote:
         | What's your preferred solution for representing the CJK
         | languages?
        
           | tiahura wrote:
           | Computing did pretty well in the prior 50 years.
        
             | pjc50 wrote:
             | That's not an answer. Be specific. How do you want to
             | represent the 97k CJK characters?
        
               | vman81 wrote:
               | I really don't want to be snarky or sarcastic, so I'll
               | just be plain. Many people are unwilling or unable to
               | understand a problem that doesn't affect them directly.
               | Like - "UTF is woke" kind of people. They are out there.
        
             | CorrectHorseBat wrote:
             | Not for the majority of the world population who doesn't
             | know English
        
             | jcranmer wrote:
             | I still remember the days when I couldn't use p and e in
             | the same document, because there was no codepage that
             | contained both of them. I also remember the days when
             | pretty much any website that had non-English text had to
             | have instructions on it for how to view it properly,
             | because mojibake was so bloody common.
             | 
             | (It should also tell you something that not only is there a
             | name for "computers failed at charsets", but the name is
             | Japanese.)
        
             | umanwizard wrote:
             | Only if you could expect a given person to only ever deal
             | with one language. Anything international sucked and was a
             | much bigger pain than now.
             | 
             | It would be impossible to e.g. build a site like Reddit
             | where people can comment in any language.
        
             | vman81 wrote:
             | Computing has improved massively over the last 50 years,
             | not least because it now can accommodate peoples diverse
             | languages.
        
             | kryptiskt wrote:
             | No, it didn't. There were all kinds of encodings out there,
             | and dealing with code pages was way worse than any
             | inconveniences that Unicode has brought. Unicode was
             | created for a reason, not just to torture US programmers
             | with the diversity of scripts in the world.
             | 
             | Maybe it was nice if you worked for a US company without
             | any operations abroad, which includes absolutely none of
             | those which mattered.
        
               | account42 wrote:
               | You still need to deal with "codepages" to differentiate
               | between Japanese Unicode and Chinese Unicode even if it's
               | called a language and not codepage now.
        
               | CorrectHorseBat wrote:
               | Han unification sucks indeed but if you get the wrong
               | font it's still readable
        
             | dotancohen wrote:
             | Only if your name isn't Dong Jiu Er Gong Ren Yan Wang .
        
             | throw0101a wrote:
             | > _Computing did pretty well in the prior 50 years._
             | 
             | Contra:
             | 
             | * https://stackoverflow.com/questions/25812790/wrong-
             | character...
        
             | Muromec wrote:
             | I had to, in the year of our lord 2024, deal with a certain
             | non-unicode system that ate one specific Cyrillic symbol
             | when producing an open data artifact mandated by law. It
             | was never fun then and it's still manages to create
             | problems.
        
           | account42 wrote:
           | Something that doesn't unify different characters. So not
           | Unicode.
        
         | Cthulhu_ wrote:
         | What alternative do you propose? I mean personally I think that
         | emoji don't belong in unicode, but at the same time it's been
         | integrated into society for many years now and it's made
         | communications platforms so much more streamlined.
         | 
         | But how else would you represent non-latin characters? More
         | character sets?
        
           | a3w wrote:
           | > emoji don't belong in unicode
           | 
           | Well, they are defined as: "an intermediate technology until
           | we find a way to transfer images over data connections."
           | 
           | So it was always a technology that was 40 years too late to
           | the party?
        
         | layer8 wrote:
         | Without it, all textual data would need its own charset header,
         | and you couldn't freely copy & paste between pieces of text
         | with different charsets without creating mangled garbage. This
         | was the situation before Unicode (except that charsets were
         | often only implicit, so you had to guess which it is).
        
       | card_zero wrote:
       | > naming things is one of the hard things to do in computer
       | science
       | 
       | I've been thinking about that a lot lately. Code is text, it's
       | arranged linearly, code has to be readable, identifiers are thus
       | short strings that try to express short essays about the purpose
       | of the variable or whatever it is, and then ideally there's a
       | longer version of the essay in a comment, but not too long
       | because that would clutter up the code as well (because it's
       | text, arranged linearly). And we have code folding to tidy them
       | up, for what good it does, and ideally an even longer version of
       | the essay in documentation except nobody writes that.
       | 
       | What if it wasn't text, and wasn't linear, and we didn't have an
       | expectation that code should be strings of stupid over-terse
       | names and hieroglyphic symbols? So I was thinking vaguely about
       | investigating graphic-based programming, but it's probably worse,
       | IDK. It could automatically assign arbitrary icons* instead of
       | identifiers, and you could write tooltip-like comments to
       | describe them as and when you want to, and everything could be
       | laid out nicely with diagrams and different pages instead of like
       | a text file. I suppose this is all merely cosmetic? The thing
       | with the instance on code being _written_ as strings of text
       | feels very primitive, is all. It causes this problem.
       | 
       | * Which doesn't solve the problem, I admit, because now you have
       | to remember what the icons mean, but maybe that's easier?
        
         | jstanley wrote:
         | I don't think remembering the meaning of icons is easier,
         | because in order to think about it you have to be able to
         | pronounce it inside your head.
         | 
         | And code isn't just linear, it can be spread across multiple
         | files in a directory tree, functions can can each other, etc.
        
           | c22 wrote:
           | _> in order to think about it you have to be able to
           | pronounce it inside your head._
           | 
           | I'm not sure this is universal.
        
             | vidarh wrote:
             | Indeed, some people do not even have an inner voice, the
             | same way some of us don't "see" things in our minds eye.
             | Neither prevents you from thinking about words or visual
             | objects.
        
         | pjc50 wrote:
         | > I was thinking vaguely about investigating graphic-based
         | programming, but it's probably worse, IDK. It could
         | automatically assign arbitrary icons* instead of identifiers,
         | and you could write tooltip-like comments to describe them as
         | and when you want to, and everything could be laid out nicely
         | with diagrams and different pages instead of like a text file.
         | 
         | Have you ever read large electronic schematics? That's
         | basically it .. except all the important things have to be
         | identified by text anyway, because it's a massive challenge to
         | the imagination to come up with two hundred different
         | pictograms.
         | 
         | Of course, if you really want your identifiers to be
         | pictograms, why not just use kanji for your identifiers? The
         | Japanese language and Unicode provide tens of thousands of
         | ready made pictograms for your convenience!
         | 
         | The only nonlinear programming environments that have really
         | worked are the spreadsheet (which is still linear within each
         | cell) and Labview. Possible shoutout to Unity blueprints, but
         | when those get too complicated sphagetti .. people rewrite them
         | in linear text code.
        
           | card_zero wrote:
           | _Sigh_
           | 
           | I guess you're right. This has been a dimly-felt wish of mine
           | for some 25 years, but probably pie in the sky.
           | 
           | Edit: I see there are a _lot_ of visual programming
           | languages.
           | 
           | https://en.wikipedia.org/wiki/Visual_programming_language
        
             | 9dev wrote:
             | I don't think that has to be the answer, though. We can
             | probably all agree that plaintext code is not the best form
             | to represent the schematics of a process, and neither are
             | images. But that seems to be a very limited set of options,
             | and I wonder if there aren't any other dimensions to
             | express what is essentially persisted chains of reasoning.
             | For an example of alternative modes of input, have a look
             | at the Reactable, a pretty innovative way to compose music.
             | Sadly I think they didn't disrupt the music industry as
             | they should have, but it's a pretty good example of a new
             | way to think about making sounds.
             | 
             | Edit: forgot the link. Here is: http://reactable.com
        
             | WillAdams wrote:
             | Longer than that --- I would argue it goes back to Herman
             | Hesse's _The Glass Bead Game_ (originally published as
             | Magister Ludi) --- but Hesse seems to have gone out of
             | style.
             | 
             | That said, I keep trying various ones, and will keep hoping
             | that someday someone will make a graphical tool able to
             | make a GUI program.
             | 
             | Nodezator seems promising.
        
           | auxym wrote:
           | > Have you ever read large electronic schematics? That's
           | basically it .. except all the important things have to be
           | identified by text anyway, because it's a massive challenge
           | to the imagination to come up with two hundred different
           | pictograms.
           | 
           | As a mechanical engineer who works with Labview and Simulink,
           | as well as more conventional code (python mostly), that is
           | indeed a very good description. First glance at a large
           | labview program feels very much like first glance at a large
           | and complex electronics schematic. Lots of wire everywhere
           | and you're not even sure where to start.
           | 
           | I think a nice "best of both worlds" approach is a graphical
           | "high level" view which shows the flow of data, at least for
           | "data transformation" kind of programs, and code for the low
           | level logic (what actually happens in the blocks). Sort of
           | like nodal editors in Blender and NLE apps. Fortunately
           | Simulink makes it easy to drop in a Matlab function call,
           | Labview not so much (need to get into C FFI or use a really
           | old version of .net or something).
           | 
           | The thought I have about spreadsheets (might have read that
           | on here), is that spreadsheets make the data visible and hide
           | the code. Text-based programming hides the data but shows the
           | code. I'm not sure what something that makes both code and
           | data first class and visible would look like, but I'd be
           | curious for sure (for engineering type applications at
           | least). Best I've found so far (and what I actually for a lot
           | of data processing tasks) is a Jupyter notebook making
           | plentiful use of df.head() and df.plot().
        
           | umanwizard wrote:
           | It's odd to say those characters come from the Japanese
           | language when they were invented in China to write Chinese,
           | are still used for that purpose, and were only introduced to
           | Japan 2000 years later.
        
           | taneq wrote:
           | > The only nonlinear programming environments that have
           | really worked are the spreadsheet (which is still linear
           | within each cell) and Labview. Possible shoutout to Unity
           | blueprints, but when those get too complicated sphagetti ..
           | people rewrite them in linear text code.
           | 
           | Not 100% sure what you mean by 'nonlinear' here (flow
           | control?) but almost all industrial and mining equipment is
           | programmed in visual languages on PLCs. Ladder Logic looks
           | like, well, a stylized electrical drawing of a bunch of
           | relays wired up to perform logical operations. Function Block
           | Diagram looks like a PCB layout, but the 'integrated
           | circuits' are function blocks (basically functors) and the
           | 'traces' are copying data between between the function
           | blocks. Not great for implementing hardcore algorithms but
           | you can do a surprising amount with them (once you get used
           | to coding with both hands tied behind your back) and they
           | sure are accessible to people who otherwise wouldn't be
           | programming.
           | 
           | Of course, as you say, when things get genuinely complicated,
           | it's much nicer to use a 'real' programming language (or even
           | just Structured Text, which is pretty much just Pascal).
           | 
           | Then again, even with electronics, once things get complex
           | enough don't we start using text (eg. VHDL)? Expressing
           | designs is always a tradeoff between simplicity and
           | 'obviousness' on the one hand, and representational
           | efficiency on the other. Structured text sits right in the
           | sweet spot between the two.
        
         | jcranmer wrote:
         | Graphical programming is one of those things that's often
         | suggested as an improvement on textual programming, and just
         | about every implementation tends to disappoint. I know, when
         | working on compilers, that nearly every time I go "I think I
         | want to see the CFG as a graph here," I tend to realize no,
         | that's not quite what I wanted. For a complex function, the
         | surprising superpower is just to have an editor that shows the
         | opening brace line of every currently-open brace.
         | 
         | Another case in point: when was the last time you saw someone
         | use a flowchart to describe the pseudocode of an algorithm, as
         | opposed to writing, er, pseudocode? Flowcharts used to be the
         | dominant way to do this, decades ago, but they seem to me to
         | have been thoroughly supplanted by pseudocode...
        
           | WillAdams wrote:
           | I think the problem here is that there isn't an agreed-upon
           | answer for the question:
           | 
           | >What does an algorithm look like?
           | 
           | And any effort to answer it which gets beyond the size of a
           | single diagram/screen/page/poster becomes a problem like to:
           | 
           | https://blueprintsfromhell.tumblr.com/
           | 
           | https://scriptsofanotherdimension.tumblr.com/
           | 
           | I like to think of myself as a visual person, and I wish
           | there was a good solution here, and I keep looking for and
           | trying different solutions other folks have made (current two
           | iterations are BlockCAD and OpenSCAD Graph Editor) --- I'd be
           | glad of other suggestions, esp. if able to make graphic user
           | interfaces more complex than the OpenSCAD Customizer.
        
             | card_zero wrote:
             | Argh! Wire-wrapped backplanes! That wasn't the fantasy at
             | all!
        
               | WillAdams wrote:
               | Yes, the fantasy is something like to Herman Hesse's _The
               | Glass Bead Game_ which I mentioned elsethread --- what is
               | the closest available tool to that?
               | 
               | How do such tools manage the problem of
               | encapsulation/modularity becoming the "wall of text"
               | which one is trying to escape, just a pretty wall w/ all
               | the labels in boxes decorated/connected w/ lines?
        
         | AlienRobot wrote:
         | The difficult in naming things is that you're trying to encode
         | semantics and an interface contract in a name. If you give up
         | doing that, it's easy.
         | 
         | For example, say you have getFoo(). It's clear it gets the foo.
         | But later you introduce getFooAsync(). Suddenly it's no longer
         | clear whether getFoo() is sync or async, because you didn't
         | call it getFooSync().
         | 
         | If instead you used names like getFoo1, getFoo2, getFoo3, etc.,
         | the semantics you're providing is that there are multiple
         | "ways" to getFoo without making promises (a contract) about
         | what the function actually does in its name.
         | 
         | Although this sounds like bad naming practices (it is), it
         | effectively solves the naming problem. Apply this to CSS, and
         | instead of .red-button or .secondary-button, you get .button1,
         | .button2, .button3, and you just don't have to think about WHY
         | are you creating a button to give it a class and start styling
         | it.
        
           | card_zero wrote:
           | Yep, that sort of thing happens _constantly._ Things get
           | misleading names because the first three alternatives I came
           | up with were also misleading. So I agree, and indeed I
           | considered a foo bar baz scheme instead of icons, same
           | difference. Then you have to look somewhere else for what the
           | thing does. Self-documenting code doesn 't really work, and
           | strict naming schemes are long-winded and worse than ad-
           | libbing it, so it would have to be comments, but then the
           | comments get forgotten and no longer reflect the code. I give
           | up, might take up woodwork instead.
        
       | mmsc wrote:
       | I wonder how this will affect ssh. OpenSSH recently restricted
       | more characters for valid usernames:
       | https://github.com/openssh/openssh-portable/commit/7ef3787c8...
        
         | cedws wrote:
         | This is a great example of how one poor decision, or one piece
         | of code that is too liberal cascades into an avalanche of
         | shitty workarounds.
        
         | throw0101a wrote:
         | It should be noted that shell metacharacters are also not
         | allowed under POSIX:
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z         a b
         | c d e f g h i j k l m n o p q r s t u v w x y z         0 1 2 3
         | 4 5 6 7 8 9 . _ -
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | (Hyphen forbidden as first character.)
        
         | linuxftw wrote:
         | I think it will be fine. Everyone will quickly learn the lesson
         | "Use something other than ASCII letters and numbers at your own
         | peril."
         | 
         | Similar to people who put spaces in file names, it should be a
         | fire-able offense.
        
           | lexicality wrote:
           | any software that can't handle spaces in filenames is broken
        
             | Muromec wrote:
             | All of the software is broken (including security wise) all
             | the time anyway.
        
               | bdangubic wrote:
               | this is exactly right... I spoke a few years ago with a
               | mate who is a software dev at one of the major car
               | companies... since then I wouldn't sit in the car from
               | that company if my life depended on it...
               | 
               | then I thought - if I spoke any dev in any industry I
               | would also stop doing whatever their software is
               | controlling and end up moving to live with amish or some
               | wilderness without electricity
        
           | hiccuphippo wrote:
           | Was that the fireable offense? I always thought the offense
           | was not putting quotes around filenames in scripts.
        
       | dfranke wrote:
       | Allowing purely numeric usernames seems like a terrible idea to
       | me, because it creates ambiguity between what's a username and
       | what's a UID. It's common for tools like ls or ps to display a
       | username when one is found and fall back to displaying a UID if
       | it isn't, and similarly tools like chown will accept either a UID
       | or a username and disambiguate based on whether it's numeric or
       | not. Now suppose there's a numeric username that doesn't match
       | its own UID, but does match some other user's UID. It doesn't
       | take a lot of imagination to see how this would lead to
       | vulnerabilities.
        
         | throw0101a wrote:
         | Talk to POSIX:
         | 
         | > _A string that is used to identify a user; see also User
         | Database. To be portable across systems conforming to
         | POSIX.1-2017, the value is composed of characters from the
         | portable filename character set. The <hyphen-minus> character
         | should not be used as the first character of a portable user
         | name._
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | The "portable filename character set" is defined as:
         | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z         a b
         | c d e f g h i j k l m n o p q r s t u v w x y z         0 1 2 3
         | 4 5 6 7 8 9 . _ -
         | 
         | *
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | So only a hyphen as the first character is forbidden.
         | 
         | Given that you can't necessarilly control where usernames come
         | from (e.g., LDAP lookups), properly speaking your system has to
         | handle everything anyway, even if you don't allow local
         | creation.
        
           | dfranke wrote:
           | Yes, I'm aware, and POSIX has many such bugs that make
           | command input or output unavoidably ambiguous if certain
           | unexpected characters are present that they didn't think to
           | prohibit. A lot of the revisions that went into POSIX 2024
           | were aimed at fixing some of these, such as standardizing
           | find -print0 and xargs -0. The fact that this one got
           | overlooked doesn't mean it's a good idea to make the
           | situation worse and harder for future POSIX revisions to
           | address.
        
           | bluGill wrote:
           | It is time for POSIX to get with the times. Computers are
           | used in more than the US and Canada (for the most generous
           | interpretation of American in ASCII I'm including Canada,
           | their French speakers will not be happy with that, not to
           | mention first nations of which I know nothing but imagine
           | their written language needs more than ASCII). UTF8 has been
           | standard for decades now, just state that as of POSIX 2025
           | all of UTF8 is allowed in all string contexts unless there is
           | a specific list of exception characters for that context
           | (that is they never do a list of allowed characters). They
           | probably need to standardize on utf8 normalization functions
           | and when they must be used in string comparisons. Probably
           | also need some requirement that and alternate utf8 character
           | entry scheme exist on all keyboards.
           | 
           | The above is a lot of work and will probably take more than a
           | year to put into the standard, much less implement, but
           | anything less is just user hostile. Sometimes commettiees
           | need to lead from the front not just write down existing
           | practice.
        
             | chikere232 wrote:
             | Sounds like lots of work and a lot of new bugs for no real
             | value.
        
             | throw0101a wrote:
             | > _It is time for POSIX to get with the times._
             | 
             | "Be the change that you wish to see in the world." --
             | Mahatma Gandhi
             | 
             | It's free to join:
             | 
             | * https://www.opengroup.org/austin/lists.html
             | 
             | * https://www.opengroup.org/austin/
        
             | atoav wrote:
             | Sure, go ahead. Write the PR and make sure to test against
             | all other things used in production.
             | 
             | Let's talk again in 30 years when you're done.
        
               | jerf wrote:
               | Oh, it's been closer to 20 years for the rest of the
               | world to catch up to Unicode than 30. We aren't at
               | "perfect" now but we're certainly down to the trickier
               | corner cases that are difficult to even see how you solve
               | the problems at all, let alone code the solutions, and
               | that's just reality's ugly nose sticking in to our
               | pristine world of numbers.
               | 
               | But there really isn't any other solution. Yes, there
               | will be an uncomfortable transition. Yes, it blows. But
               | there isn't any other solution that is going to work
               | other than _deal with it_ and take the hits as they come.
               | The software needs to be updated. The presumption that
               | usernames are from some 7-bit ASCII subset is simply
               | unreasonable. We 'll be chasing bugs with these features
               | for years. But that's not some sort of optional aspect
               | that we can somehow work around. It's just what is coming
               | down the pike. Better to grasp the nettle firmly [1] than
               | shy away from it.
               | 
               | At least this transition can learn a lot from previous
               | transitions, e.g., I would mandate something like NFKC
               | normalization applied at the operating system level on
               | the way in for API calls:
               | https://en.wikipedia.org/wiki/Unicode_equivalence Unicode
               | case folding decisions can also be made at that point.
               | The point here not being these specific suggestions per
               | se, but that previous efforts have already created a
               | world where I can reference these problems and solutions
               | with specific existing terminology and standards, rather
               | than being the bleeding-edge code that is figuring this
               | all out for the first time.
               | 
               | [1]: https://www.phrases.org.uk/meanings/grasp-the-
               | nettle.html
        
             | somat wrote:
             | I would say it is not the place of posix to prescribe how
             | it should be, the job of posix is describe what it is, a
             | common operating environment. this is why posix is such a
             | mess and why I feel it is not a big deal to deviate from
             | posix, however posix fills an important role in getting
             | everyone on the same page for interoperability.
             | 
             | In my opinion the way to improve this, is bottom up, not
             | top down. Start with linux(theese days posix is largely
             | "what does linux do?"), get a patch in that changes the
             | defination of the user name from a subset of ascii to a
             | subset of utf-8. what subset? that is a much harder problem
             | with utf-8 than ascii, good luck. get a similer patch in
             | for a few of the bsd. then you tell posix what the os's are
             | doing. and fight to get it included.
             | 
             | On the subject of what unicode subset. perhaps the most
             | enlightened thing to do is the same as the unix filesystem
             | and punt. one neat thing about the unix filesystem is that
             | names are not defined in an encoding but as a set of bytes.
             | This has problems and has made many people very mad. but it
             | does mean your file system can be in whatever encoding you
             | want, transitioning to utf-8 was easy(mainly doe to the
             | clever backwards compatible nature of utf-8) and we were
             | not locked into a problematic encoding like on windows.
             | perhaps just define that the name is a array of bytes and
             | call it a day. that sounds like the unix way to me.
        
               | tssva wrote:
               | "however posix fills an important role in getting
               | everyone on the same page for interoperability."
               | 
               | Isn't that exactly what the posix username rules are
               | doing? Specifying a set of characters which are portable
               | across systems to allow for interoperability between
               | current and legacy unix systems along with most non-unix
               | systems.
               | 
               | "Start with linux"
               | 
               | Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils,
               | and systemd all differ.
               | 
               | "get a patch in that changes the defination of the user
               | name from a subset of ascii to a subset of utf-8"
               | 
               | ASCII is a subset of UTF-8 so the POSIX definition
               | already specifies a subset of UTF-8.
        
             | PhilipRoman wrote:
             | Some practical concerns I have with UTF-8 are similar (or
             | even the same, depending on font) characters which can be
             | used in malicious ways (think package names, URLs, etc),
             | not to even mention RTL text and other control characters.
             | Every time I add logging code, I make sure that any
             | "interesting" characters are unambiguously escaped or
             | otherwise signaled out-of-band. Having English as an
             | international writing standard is perfectly fine and I say
             | that as a non-native speaker with a non-ascii name.
        
               | abdullahkhalids wrote:
               | A good chunk of the world does not speak english or latin
               | character based languages. They should be able to
               | interact with computers completely in their own languages
               | and alphabet sets, even if those are written right-to-
               | left or top-to-bottom.
               | 
               | Of course, someone has to do the work to make this
               | possible. And no one is obliged to do it. But to suggest
               | that, such work should not be done at all, does not sit
               | right.
        
               | hnthrowaway6543 wrote:
               | > A good chunk of the world does not speak english or
               | latin character based languages.
               | 
               | nearly everyone in a first world country knows the
               | English alphabet though. a vast majority of the
               | developing world as well. just look at street view on
               | Google maps in any country, there's going to be a ton of
               | street signs using English characters, even in non-
               | touristy areas.
               | 
               | > They should be able to interact with computers
               | completely in their own languages and alphabet sets, even
               | if those are written right-to-left or top-to-bottom.
               | 
               | if you're a typical android/ios end user you're
               | interacting with a computer in your native language
               | anyway. this discussion only applies to low level power
               | users.
               | 
               | in that case: why? these aren't user-facing features.
               | this is like saying that people should be able to use
               | symbols native to their language rather than greek
               | letters when writing math papers.
               | 
               | it might not be "fair" that English is overrepresented in
               | computing but it also hasn't demonstrably been a barrier
               | to entry. Japan, Korea and China have dominated,
               | particularly in hardware.
               | 
               | if you think it should be fixed why stop at usernames?
               | why represent uids with 1234 instead of Yi Er San Si ?
        
               | abdullahkhalids wrote:
               | > if you're a typical android/ios end user you're
               | interacting with a computer in your native language
               | anyway. this discussion only applies to low level power
               | users.
               | 
               | I don't think you realize how poor this experience is.
               | Partly the reason being that the underlying system is so
               | english focused, that app developers have to do so much
               | work to get things working.
               | 
               | > if you think it should be fixed why stop at usernames?
               | why represent uids with 1234 instead of Yi Er San Si ?
               | 
               | I mean, if the computers had first been built in south
               | east asia, they would have been.
        
               | hnthrowaway6543 wrote:
               | it's certainly hard to localize everything but billions
               | of people use ios/android in India, China, SEA, MENA,
               | etc... i think it's fair to say that at the end user
               | level, computers are in fact usable by non-English
               | speakers.
               | 
               | individual apps may not be as usable, but that's on the
               | developers. good counter-example, a lot of japanese
               | games, even made within the past 5 years, require setting
               | the Windows system locale to Japanese to function
               | properly. and as someone who played a fair number of
               | japanese doujin games in the 00s/10s, it used to be
               | _every_ game with this problem.
               | 
               | > I mean, if the computers had first been built in south
               | east asia, they would have been.
               | 
               | debatable as CJK heavily use Arabic numerals everywhere,
               | but even if they did, so what? you'd learn those symbols
               | and get used to it. the same way that if you're a unix
               | sysadmin you get used to only being able to use a small
               | subset of ASCII characters for usernames.
        
               | Muromec wrote:
               | Oh no please, I don't want to have my linux username in
               | Cyrillic. Thanks but no, thanks!
               | 
               | I know enough linux to see 10 ways in which it will make
               | things worse at some point.
        
             | miki123211 wrote:
             | > Computers are used in more than the US and Canada
             | 
             | Even if you speak US (or Canadian) English exclusively,
             | there are still some words that are just impossible to
             | spell correctly in pure ASCII, e.g. resume, cafe etc.
        
               | drdeca wrote:
               | "correctly". I don't consider it "incorrect" English when
               | someone writes "cafe" or "resume". It seems to me a
               | little bit paedantic to insist that those words must have
               | the accent marks in order to be correct (when using them
               | in English).
        
               | sneak wrote:
               | Yeah, loanwords are different words than the original
               | word.
               | 
               | The correct plural of "baby" in German is "babys".
        
             | rurban wrote:
             | Almost nobody supports string search and comparison API
             | functions for unicode. The unicode security tables for
             | unicode identifiers are hopelessly broken.
             | 
             | Not even the simplest tools, like grep do support unicode
             | yet. This didnt happen in the last 15 years, even if there
             | are patches and libs.
        
         | macintux wrote:
         | At the meatspace level, purely numeric usernames are
         | problematic.
         | 
         | I was working as a contractor at a Fortune 500 firm several
         | years ago when they introduced a new ERP system which
         | apparently encouraged the company to switch to numeric system
         | IDs. Fortunately the technical teams, especially Linux support,
         | objected and it was overruled, but I was just as worried about
         | the communications problems that would result.
         | 
         | When everyone has a system ID that matches a consistent
         | pattern, like "YZ12345", IDs are easy to recognize in
         | documentation and data. An ID like "1234567" could be
         | practically anything.
        
           | PhilipRoman wrote:
           | I really like the concept of adding some redundancy to ids,
           | like a prefix. It helps to disambiguate things (kind of like
           | static typing). A good example is also bank numbers, which
           | must be a multiple of 97 +1, enabling fast client-side
           | validation against typos.
        
         | hulitu wrote:
         | > Allowing purely numeric usernames seems like a terrible idea
         | to me
         | 
         | "I'm not a number, i am a free man. Ha ha ha ha ha"
        
           | kps wrote:
           | "Who is UID 0?"
           | 
           | "You are UID 6."
        
         | thephyber wrote:
         | I am also worried about more subtle bugs caused by usernames
         | that are not strictly only-numeric, such as "10e2" or
         | "0xDEADBEEF".
        
         | Ferret7446 wrote:
         | It shouldn't be a problem as long as the system disallows a
         | numeric username to be the same as an existing UID (excepting
         | the case where the matching UID is assigned to said username).
        
       | huhtenberg wrote:
       | Sound like a solution in search of a problem.
       | 
       | And a disruptive solution with unclear side effects at that.
        
       | johnisgood wrote:
       | > If a keyboard input system provides the former sequence of
       | bytes, but the username is stored in the login infrastructure
       | using the latter sequence of [bytes], then a naive comparison
       | will not find the user "emollier" in the system. Unicode defines
       | in Annex 15 a few normalization forms as a way to work around
       | this problem. But a correct use of these normalization forms
       | still requires coordination and standardization among all
       | programs accessing the data.
       | 
       | ICU could work, but adds an extra dependency, there is also GNU's
       | libunistring.
        
       | resource_waste wrote:
       | This is important because Debain-family is used on many servers?
       | 
       | Debian seems to just squander resources on things a few powerful
       | people care about.
       | 
       | All my servers have been Debian-based, so I can't be too hard on
       | them, but whenever I see someone recommend a Debian-family distro
       | as a Desktop OS, I feel like I need to call the police.
        
       | perlgeek wrote:
       | Just imagine how many poorly-written shell scripts will break
       | when we suddenly allow dollars, quotes, backticks and the likes
       | in username. Heck, even allowing spaces sound like horror to me.
       | 
       | On the display side, I'm sure most tools that display usernames
       | won't make it easy to see if there are leading or trailing
       | whitespace characters, double blanks, tabs etc in usernames.
       | 
       | This sounds like support hell to me.
        
         | gmuslera wrote:
         | The problem could be old scripts or systems that doesn't handle
         | UTF-8 (that doesn't need to be the ones where the username was
         | defined). I'm not sure if I.e. the Bobby tables trick could be
         | done with characters with UTF8 representation seeing them in
         | pure ascii.
        
         | Starlevel004 wrote:
         | Breaking shell scripts sounds like a good idea to me. The
         | faster they die the better the world gets.
        
           | Rygian wrote:
           | That's going to be a very bumpy road, even if everyone were
           | to agree that the destination is appealing.
        
             | bigstrat2003 wrote:
             | Yeah for better or for worse compatibility is king. I
             | _despise_ shell scripts, they are an absolute nightmare to
             | work with and full of footguns. But they are so commonplace
             | that people are not going to tolerate YOLO breaking
             | changes.
        
           | chikere232 wrote:
           | Perhaps unix isn't for you?
        
           | makeitdouble wrote:
           | Thing is, they don't die. Instead you get the short end of
           | the stick.
           | 
           | You'd have to be pretty darn important for an org to fix
           | their scripts because of your name or the username you
           | created. Of it would need to happen at a larger scale, but
           | then that wouldn't be so controversial in the first place.
        
         | codedokode wrote:
         | But spaces are allowed in filenames since 80s, didn't software
         | had enough time to adapt?
        
           | michaelt wrote:
           | Microsoft's Windows 95 put spaces into "c:\My Documents" and
           | "c:\Program Files" so that developers targeting Windows were
           | _forced_ to support spaces in filenames.
           | 
           | Of course, in those days if an OS upgrade broke some third
           | party software, the end user _paid for an upgrade_. So
           | although Microsoft forced developers ' hands, the developers
           | all got paid for their trouble. And you'd only have your hand
           | forced that way once or twice a decade.
           | 
           | Windows at the time was also all about the GUI file-pickers.
           | Breaking the command line? Shell scripts? What are those?
        
             | toast0 wrote:
             | And now it's \Users, presumably because after 20 years,
             | Microsoft gave up?
        
               | hwc wrote:
               | Or someone got tired of typing long paths.
        
               | Uvix wrote:
               | They changed from \Documents and Settings to \Users in
               | Vista, alongside other profile rejiggering (e.g.
               | introducing AppData folders). By that point software had
               | either been fixed or would never be fixed, so keeping a
               | space in the name wasn't particularly useful.
        
               | rcxdude wrote:
               | It's still very common for usernames to have spaces,
               | though.
        
               | alterom wrote:
               | _And now it 's \Users, presumably because after 20 years,
               | Microsoft gave up?_
               | 
               | Only if you assume that people rarely have spaces in
               | their Windows login names (e.g. "Joe Smith").
               | 
               | Either that, or Windows users have learned to _not be
               | scared of spaces_ in filenames, usernames, and _their own
               | literal names_.
        
             | bigstrat2003 wrote:
             | That doesn't sound right. Microsoft is _obsessed_ with
             | backwards compatibility, going so far as to accommodate
             | programs that were _writing to Windows ' private memory_
             | just to preserve it. Deliberately breaking programs isn't
             | in their ethos at all.
        
               | sltkr wrote:
               | The new filesystem APIs were introduced with Windows 95,
               | so there was no backward compatibility to break. _New_
               | programs using those _new_ APIs were forced to support
               | spaces in directories. Using spaces in the system
               | directories forced application developers to consider
               | that scenario and deal with it appropriately.
               | 
               | Meanwhile, DOS and Windows 3.1 applications that did run
               | on Windows 95 could access files under a backward
               | compatible 8.3 scheme, like C:\Progra~1\ instead of
               | "C:\Program Files".
        
               | bigstrat2003 wrote:
               | That's a good point, thanks for pointing it out.
        
               | michaelt wrote:
               | I'm thinking of the transitions from Windows 3.1 to
               | Windows 95 (IIRC introducing 32-bit and filenames longer
               | than 8 characters) and the transition from Windows 95 to
               | Windows XP (IIRC introducing a proper permission system,
               | thus breaking anything that relied on being able to write
               | things outside of user-owned folders)
               | 
               | I agree they were famously accommodating in those days.
               | But they also had enough market power that if they said
               | users could only write to one folder and it had a space
               | in the filename, developers who disliked it couldn't vote
               | with their feet.
        
             | dizhn wrote:
             | A lot of software still had issues and asked the user to
             | use C:\Directory directly. Some probably still do.
        
               | reginald78 wrote:
               | I remember trying to install Visual Studio in the mid-
               | late 2000s (when SSDs make hard drive space small again)
               | to a directory other than C: and found that after
               | following a rather convoluted process you could only
               | actually move maybe 20% of the install files off C:.
        
               | StefanBatory wrote:
               | It is still the same. :(
        
               | yonatan8070 wrote:
               | I've seen some things installing directly into C:\,
               | NVIDIA's software jumps to mind
        
             | akira2501 wrote:
             | C:\Progra~1
             | 
             | They didn't force anything.
        
           | deltarholamda wrote:
           | My last name has an apostrophe in it. This isn't super weird
           | or anything, there have been "O'Haras" and "O'Neills" (with 2
           | Ls) forever.
           | 
           | And yet whenever I deal with a computer system I don't put
           | the apostrophe in because even in 2024 it is completely
           | jacked up. Sometimes it's just disallowed. Sometimes I get
           | "\\\'" showing up. Sometimes I get "&apos;". I've seen
           | "&#8217;". One time, one system accepted it, but another
           | system that accessed the same data didn't allow apostrophes
           | so the person using the second system couldn't access the
           | record, and it took 2 phone calls and 3 people to come up
           | with a workaround.
           | 
           | It doesn't work often enough that I don't even try anymore.
           | There are just too many opportunities for it to get forgotten
           | or handled improperly from all directions.
        
             | soneil wrote:
             | I had fun in the vmware-broadcom transition because the
             | broadcom portal doesn't allow that, but the vmware portal
             | did. Not even in my username, just in the surname field.
             | The new portal ate it on that so hard, I wasn't even
             | allowed to create a ticket to do anything about it.
             | 
             | Not as bad as when I was once issued a first.o'last@corp
             | email address though ..
        
               | mixmastamyk wrote:
               | There may be a Unicode character that looks like
               | apostrophe but has no quoting semantics. I use an arrow
               | instead of greater-than symbol in my prompt for the same
               | reason. To avoid copy/paste issues.
        
               | jcranmer wrote:
               | Non-ASCII characters in email addresses have even worse
               | compatibility issues than punctuation characters.
               | Punctuation fails because people don't know the standard.
               | Non-ASCII fails because people don't know the _latest_
               | standard.
        
               | deltarholamda wrote:
               | >Not as bad as when I was once issued a first.o'last@corp
               | email address though
               | 
               | Oh, man, that happened to me too, way back in the late
               | 90s. I had forgotten about that.
               | 
               | It broke things all over the place. Even now you run into
               | the occasional validator that is convinced that the plus
               | sign is not valid in email addresses.
        
               | mschuster91 wrote:
               | > Even now you run into the occasional validator that is
               | convinced that the plus sign is not valid in email
               | addresses.
               | 
               | These are intentional IMHO - force people to use their
               | actual email address so a potential breach can't be tied
               | back to the service. That's the _only_ reason why someone
               | would use a + in the first place.
        
             | graemep wrote:
             | > And yet whenever I deal with a computer system I don't
             | put the apostrophe in because even in 2024
             | 
             | In usernames or in name fields for text generally?
             | 
             | I assume things like bank systems can deal with it because
             | they should match things like IDs?
        
               | deltarholamda wrote:
               | Name fields in general.
               | 
               | But sometimes I don't have control, e.g. another person
               | is inputting the data and dutifully duplicates my name.
               | That's how I ended up with the 2 phone calls/3 person
               | situation, which happened about a month ago.
               | 
               | Hell, my driver's license is missing the apostrophe
               | because the system doesn't accept it.
               | 
               | When somebody is trying to find me in a computer there's
               | a whole litany of things they have to try, including
               | assuming "First O'Lastame" got bashed into "First O.
               | Lastname".
               | 
               | I think about this every time I read an article extolling
               | the wonders of technology.
        
             | jorvi wrote:
             | > One time, one system accepted it, but another system that
             | accessed the same data didn't allow apostrophes so the
             | person using the second system couldn't access the record,
             | and it took 2 phone calls and 3 people to come up with a
             | workaround.
             | 
             | There's still a lot of organisations that somewhere in
             | their e-mail processing chain cannnot deal with 4-letter
             | TLD e-mail addresses*. Even worse is that the front-end is
             | often a relatively new framework and will happily accept
             | your e-mail, only to then have it silently fail forever.
             | Mercifully a lot of those organisations have their customer
             | service authorized to change your e-mail address manually,
             | but if they don't.. good luck.
        
           | wongarsu wrote:
           | NPX on windows was broken for years when your username had a
           | space in it. Never underestimate how long bugs can stay
           | around when it doesn't affect any of the developers and for
           | everyone else the workaround is quicker than fixing it
        
           | slightwinder wrote:
           | Problem is, the design of Unix shells is older, and they have
           | some parts which automatically split on space if not handled
           | carefully. This is really annoying.
        
         | rossy wrote:
         | For people using NSS modules like winbind, most of those
         | scripts are already broken
        
         | wolrah wrote:
         | > Just imagine how many poorly-written shell scripts will break
         | when we suddenly allow dollars, quotes, backticks and the likes
         | in username. Heck, even allowing spaces sound like horror to
         | me.
         | 
         | If we're admitting they're poorly-written, why can't we admit
         | that they're already broken regardless of whether that
         | brokenness is currently being triggered? Allowing symbols or
         | spaces didn't break anything, it was broken from day one just
         | no one noticed.
         | 
         | Why is the answer always "go out of your way to not upset the
         | broken garbage that's been around forever" rather than "throw
         | Zalgo at it and fix what breaks so it's no longer broken and
         | won't be broken in the future"?
         | 
         | Bug compatibility is the worst behavior of the computing
         | industry. Let the bad code break and more importantly call it
         | out so everyone knows where the blame belongs.
        
       | nmstoker wrote:
       | Unfortunate ambiguous uses of the word drop throughout the
       | otherwise excellent article
        
         | TimK65 wrote:
         | There are three uses of the word "drop," all of which are
         | correct.
         | 
         | The latter-day meaning of "drop" is an abomination.
        
           | toast0 wrote:
           | I dropped X off at Y. Then X dropped off the face of the map,
           | never to be seen again.
           | 
           | Many words and phrases in English are self-antonyms.
        
         | fargle wrote:
         | > The src:shadow package had dropped a Debian-specific patch,
         | 
         | shoot, that's evil. had not noticed this. i read this as
         | "removed", not "was released". now idk.
         | 
         | this pseudo-definition of dropped as "released" is beyond
         | stupid. yikes!
        
       | account42 wrote:
       | Always fun to see people poke the Unicode dragon only to be
       | dumbstruck by its true size as it stands up in preparation of
       | engulfing them with the fire of unintended consequences.
        
         | beardygo wrote:
         | Indeed. As a speaker of several languages, including RTL
         | language (they haven't even considered the problems with RTL
         | marks etc), I say stay with ASCII for usernames, keep UTF for
         | full names.
         | 
         | If restricted ASCII a-z is good enough for passport names
         | worldwide, it's good enough for usernames.
        
           | macbr wrote:
           | I'm confused - my name as written on my passport definitely
           | contains non ASCII characters?
        
             | extraduder_ire wrote:
             | What is it in the machine-readable section at the bottom?
             | My passport takes the apostrophe out of my name down there.
        
             | Muromec wrote:
             | You probably have ASCII-adjacent name to begin with, so
             | people who can read some kind of language using Latin
             | letters will simply ignore "funny dots and dashes" and
             | pronounce it kinda wrong.
             | 
             | It's on a different level from having a name originally
             | written in a different alphabet entirely. At this point you
             | just have it written in two scripts, with second being
             | ASCII.
        
           | mschuster91 wrote:
           | > If restricted ASCII a-z is good enough for passport names
           | worldwide, it's good enough for usernames.
           | 
           | Passports (and credit cards) are the best example why ASCII-
           | only is horribly broken. It's 2024, people want to type in
           | their name as they write it normally, and they have the
           | reasonable expectation of IT "dealing with it" behind the
           | scenes.
           | 
           | Unfortunately, that expectation isn't reality, and it's all
           | too common people are being rejected at the border or their
           | card transactions are denied because braindead policies leave
           | no other option but to blanket deny in case of mismatches.
        
         | tgbugs wrote:
         | I made a design decision for a standard for dataset structure
         | to explicitly ban characters beyond ascii [A-Za-z0-9.,-_ ]
         | precisely because all the positivity around utf-8 often leads
         | people to think that it comes with no additional complexity
         | cost. There is an escape hatch with a way to indicate that a
         | dataset uses unicode filenames but the standard states that any
         | consumer may reject such datasets because unicode support is
         | explicitly not required.
         | 
         | I got pushback from people who would not have to implement or
         | maintain the systems for being a backward asciite so seeing
         | this article is rather vindicating.
        
       | miohtama wrote:
       | I remember useradd and adduser when learning Linux and oh boy
       | what a confusion it was... Why not just one command
        
       | abigail95 wrote:
       | if you cannot handle UTF-8 anywhere anything approaching text
       | could be, your program is malformed and should be deprecated and
       | removed.
       | 
       | if you wrote code that couldn't handle bob;>/hacked in a
       | username, you would and should be laughed at.
       | 
       | why are we using this ancient stuff?
        
         | knorker wrote:
         | It's not just programs. And it's not just semantics of all-
         | numeric username. It's also whether you want usernames that you
         | cannot type, nor possibly even render.
         | 
         | Definitely you can't spell it to someone else.
         | 
         | Who owns that file? Oh, it's right-to-left non breaking space
         | smiley snowman Chinese sign for water, I love that guy!
        
           | abigail95 wrote:
           | If people want to set up a Debian environment where people
           | are mixing RTL and Hanzi I see no reason for that to be
           | prohibited.
           | 
           | Debian has opinions but I disagree that they should extend
           | that far.
           | 
           | If my employee Zalgo-fies everything. I don't file a bug
           | report with Debian. I just fire them.
        
             | Muromec wrote:
             | >If my employee Zalgo-fies everything. I don't file a bug
             | report with Debian. I just fire them.
             | 
             | Which such clearly north American attitude you can as well
             | use ASCII for everything.
        
         | drtgh wrote:
         | With Unicode the same grapheme can be written with a sequence
         | of one or more code points, and each code point can be a
         | sequence of one or more code units.
         | 
         | For example "a" can be written with U+00E5, and the same visual
         | glyph "a" with U+0061 + U+030A ( U+0061 {a} plus the code unit
         | U+030A {Combining Ring Above}).
         | 
         | Another homoglyph Unicode user name example:
         | 
         | * is Cafe == Cafe ?
         | 
         | * C + a + f + e + ' ' vs C + a + f + e
         | 
         | * Utf8: 43616665CC81 vs 436166C3A9
         | 
         | As one user has pointed out in another comment, some kind of
         | standardisation for that specific use case with some kind of
         | normalisation would be needed first (nevertheless a database
         | search would want a different one, and so on). The above
         | examples are among the simpler ones, there are also unprintable
         | characters, etc.
         | 
         | It can be done as in "nothing is impossible", but it's not that
         | easy, it's actually complex.
        
           | abigail95 wrote:
           | If a user picks a presentation layer that displays a from
           | noncomparable alphabets, but has them look identical - that's
           | a choice they can and should be able to make. I think it's
           | dumb but I'm not here to hold anyones hand.
           | 
           | It's the users choice whether 43616665CC81 == 436166C3A9,
           | same for Cafe == Cafe. But they are distinct and separate
           | choices. Text and bytes are separate things.
           | 
           | We accept that case sensitivity exists and whether a
           | user/business/program treats them as identical is and _should
           | always be_ their choice to make.
           | 
           | There is abstract complexity in the problem, but the context
           | in which text is used solves most of that.
           | 
           | If I have handwritten notes and I make a copy but write the
           | second one in cursive and ask someone if they say the same
           | thing - the correct answer isn't "we need to create a
           | standard to normalize the presentation of text" - it's "be
           | more precise in what you are asking".
           | 
           | Whether Cafe == Cafe depends on if it's written on a road
           | sign, or a network packet with a fixed byte size.
           | 
           | Unprintable characters are not text and should not be stored
           | in text fields. Neither are control characters, and as far as
           | I'm concerned should not be included in any text encoding
           | standard. Formatting and terminal processing _should never be
           | stored in-band_ , that's an obvious design flaw that should
           | be corrected.
           | 
           | We already deal with ambiguity within ASCII re I vs l vs 1.
           | Some fonts render those identically - Using those fonts in a
           | passport is bad design. Saying we should avoid having to
           | compare those characters at all because _some people /systems
           | might confuse them_ is misguided.
           | 
           | This isn't a true rebuttal of what you were saying but some
           | of my next thoughts.
        
         | anon-3988 wrote:
         | Nah, you can use whatever you want for _display_.
         | 
         | We have our tower of babel here and we are telling people not
         | to use it? I am not even native English user btw. Having a
         | lingua franca allowed me to understand someone from Russia,
         | China, Japan, etc.
         | 
         | Maybe once we have easily accessible ML translate nuances in
         | one language to another without loss we can all talk in our own
         | languages and just translate each others words.
        
           | abigail95 wrote:
           | I think people should be able to configure systems to handle
           | a broad range of text from popular encoding standards like
           | UTF-8.
           | 
           | Limiting text-space because of communcation is a strange
           | objection that I don't think will hold up over time.
        
         | PhilipRoman wrote:
         | I really love this powerless use of "should". If you spit on
         | billions of lines of code, all you will get is a dry mouth. The
         | reality defines "what is", unless you have lots of tanks and
         | people under your control, in which case you can change the
         | reality.
         | 
         | There is tons of useful code which you will likely never
         | encounter, that helps people accomplish their tasks every day.
         | Do you think there is some central authority who is going to go
         | building to building and dd if=/dev/zero every shell script
         | they find?
        
           | abigail95 wrote:
           | This is a contemporary discussion, today, concerning hundreds
           | perhaps thousands of lines of code. That's it.
           | 
           | If someone is objecting to changes because of things like
           | "bob;>/hacked". That is laughable, and I will continue to
           | point and laugh. Imagine limiting URL encoding because of SQL
           | injection.
           | 
           | We can fix this, then fix the things that break - and then we
           | can improve.
           | 
           | Or we can ossify into stone. Your choice.
        
             | PhilipRoman wrote:
             | >if you cannot handle UTF-8 anywhere anything approaching
             | text could be, your program is malformed and should be
             | deprecated and removed.
             | 
             | I was referring to this. Don't get me wrong, I also would
             | love to make sweeping changes to many things in computing.
             | I still think it is perfectly valid to impose reasonable
             | limitations on input even if the program could
             | theoretically handle it - it prevents all kinds of problems
             | at the very root (like allocating disproportionate amounts
             | of resources, infinite timeouts, etc).
        
       | chikere232 wrote:
       | oh yes, let's break things to gain nothing of value
        
         | gspr wrote:
         | Perhaps nothing of value _to you_.
         | 
         | I'll hazard a guess that your preferred username can be
         | expressed in a small subset of ASCII? And to hell with everyone
         | else?
        
           | knorker wrote:
           | I'll hazard a guess that your preferred username can't be
           | written by 99.99999% of the world, and would always have to
           | be copy-pasted?
        
             | Ylpertnodi wrote:
             | Yeah, us foreigners, up to our usual tricks again.
        
               | knorker wrote:
               | By any definition of the word, I'm a foreigner.
               | 
               | So if you meant to imply that I'm an American, you've
               | guessed wrong.
        
           | chikere232 wrote:
           | If your personal identity is threatened by having to use an
           | ascii alphanumeric login name, you're kind of creating
           | problems for yourself for no reason...
           | 
           | There is a field for the full name of the person if you want
           | to, and at least on my linux it warns for non-ascii
           | characters but allows them
        
           | anon-3988 wrote:
           | Its a give and take. If you allow for anything beyond latin,
           | then you have to accept that there will be a class of
           | software that will be difficult to interact with.
           | 
           | Latin-like language system is simply superior for machine
           | purposes. I am sorry, but I don't even want to think of
           | supporting the entire unicode in my software. I am not going
           | to even attempt to reverse that emoji.
        
             | chikere232 wrote:
             | It gets real fun when it's something you need to look up
             | and have match, like a username.
             | 
             | Because then it to be normalised in the right way for
             | comparisons to work, or it will only match if your input
             | method happens to produce the exact same variant.
             | 
             | ... And unicode is an evolving standard where this
             | normalisation sometimes changes between standards, so the
             | names as normalised in the old version of your standard
             | library might disagree with the new version. So you need to
             | care for that transition.
             | 
             | ... And often this is implemented separately for different
             | languages, so you can get names that won't match if you
             | normalise them in python, java or C.
             | 
             | ... And as all implementations, these unicode
             | implementations sometimes have bugs, so you need to think
             | not only about matching supported unicode versions, but
             | matching bugs.
             | 
             | ... And any change in these normalisations can in theory
             | lead to two usernames that used to be distinct becoming
             | identical.
             | 
             | It's a deep well
        
               | khaled wrote:
               | > And unicode is an evolving standard where this
               | normalisation sometimes changes between standards
               | 
               | Unicode normalization is subject to its stability policy,
               | and Unicode no longer allow adding new canonically
               | equivalent code points.
               | 
               | https://www.unicode.org/policies/stability_policy.html
        
         | layer8 wrote:
         | The issue is that it has already been broken (read: has allowed
         | arbitrary byte sequences) for a long time, and the debate is
         | about what to restrict it to.
        
       | codedokode wrote:
       | Don't you think that it would be better to get rid of usernames
       | in UI? They only provide unique data for fingerprinting and do
       | almost nothing useful on a single-user system. Wouldn't it be
       | better to simply have a default name like "primary user" or "main
       | user" for the first user and skip one step in installation
       | process? Also it frees you from typing a username on login for a
       | single-user system.
        
         | eviks wrote:
         | Single user systems can just not ask for a username if there is
         | only one, they control the UI
        
       | knorker wrote:
       | So in the future I may not be able to even type the name of
       | another user? Admins and other users not being able to type
       | usernames sounds very bad.
       | 
       | And I say that as someone whose native language has more letters
       | than English.
        
       | zvr wrote:
       | Most people are too young to remember that when you typed your
       | username in all-caps in the login prompt (because the CapsLock
       | key was on by accident, for example), the login(8) program
       | assumed you were in a connection that could only do 7-bit (upper
       | case, but no lower case characters) and immediately switched the
       | tty settings and you were then presented with a "\PASSWORD: "
       | prompt.
        
         | roelschroeven wrote:
         | Don't you mean 6-bit? 7-bit ASCII supports lower case
         | characters. Or maybe there are other 7-bit character sets that
         | don't have lower case characters and it was one of those?
        
           | jks wrote:
           | PETSCII? On the Commodore 64 you could press the Commodore
           | key and Shift together to change character sets between
           | lowercase and the graphical characters.
           | 
           | But the Unix login thing might have been because of
           | teletypes?
           | https://www.columbia.edu/cu/computinghistory/teletype/ claims
           | that ASR 33 used 8-bit ASCII but was uppercase only - not
           | sure if the "8-bit" claim can be true.
           | 
           | On some Unix (and Linux) systems, you can still enter a kind
           | of retro mode with "stty olcuc iuclc" (output lowercase to
           | uppercase, input uppercase to lowercase) and turning on Caps
           | Lock.
        
       | soneil wrote:
       | This reminds me of the systemd bug where usernames starting with
       | a digit were mishandled (#15141).
       | 
       | It seems to me like something that "should" be relaxed, but we
       | need to have high confidence in the entire foodchain. adduser
       | seems like the last place it should be changed, not the first -
       | anyone requiring "enough rope" is already served by useradd.
        
       | hwc wrote:
       | My work machine uses my complete email address as a user machine
       | (this was a done by someone in the IT department). Vim gets
       | confused when I use the `gf` command to open a path that contains
       | an '@' character in it.
        
       | bjourne wrote:
       | Honestly, it is super brain-dead that Linux and other operating
       | systems still have such massive problems with "special"
       | characters. Just the other day I had to help someone who had
       | trouble building. The cause turned out to be that they had
       | dropped filenames with parentheses in the source directory which,
       | apparently, confused bash which make relies on. Such trash is
       | everywhere on Linux systems. Eventually you learn to only use
       | [a-zA-Z0-9-_.] in names because anything else will inevitably
       | confuse some tool or another (even capital letters can be a
       | PITA)... I so wish someone would take it upon themselves to clean
       | up this mess, but it's probably too much work and too many who
       | are nay-sayers conditioned to it who don't see the need for
       | changes.
        
       | hiccuphippo wrote:
       | As someone who needs non-ascii characters to write my name:
       | _please don 't_. You are making things worse just to be
       | "courteous" about something we don't care about and will actually
       | be annoyed at if we have to find how to write a letter in the
       | keyboard or worse case scenario, figure out how to change the
       | layout to the correct one _before I even logged in_.
        
         | jks wrote:
         | Likewise. My last name contains a non-ascii character. In ~2009
         | I started at a company whose admin conveniently set up an
         | account for me on their Ubuntu server... on which no-one could
         | then log in locally because the login manager crashed when
         | trying to display the list of users. I logged in via ssh and
         | changed my name to the nearest ASCII equivalent.
         | 
         | I always feel slightly worried on sites that demand that I give
         | my full legal name (such as the US ESTA form), and then refuse
         | to handle it because it includes "illegal" characters.
        
           | ASalazarMX wrote:
           | This has happened to me with _passwords_ containing foreign
           | characters. The system would accept it, but further logons
           | would be impossible. Now I always strip diacritics to be
           | safe.
        
             | jks wrote:
             | A friend mentioned using control characters in passwords...
             | like ^F and ^B, but not ^C because that's the interrupt
             | character. Feels vaguely risky to me (does ^U empty the
             | line? does ^W delete the last word? does your terminal
             | emulator do some weird encoding like it does for cursor
             | keys?) but if it works, why not?
        
               | jowea wrote:
               | I suspect I have run into a couple bugs because of
               | password generators putting characters that some backend
               | system cannot process in the password. Halfwish they just
               | did DKWhhjwqjkwqjmHSJKHAIUHQwdmlsadkl instead.
        
           | beardygo wrote:
           | Full legal name as appears on machine readable zone in your
           | own passport. Allowed characters are A-Z only, see MRZ
           | specifications:
           | 
           | https://en.wikipedia.org/wiki/Machine-readable_passport
        
             | Muromec wrote:
             | What's a legal name? It presumes it's somehow different
             | from other ... illegal names. But in which way? Which law
             | has a say?
        
         | doubled112 wrote:
         | Just having an apostrophe in my last name causes me issues.
         | 
         | Yes, that's me, Mr. O&amp;Conner
        
       | SuperSandro2000 wrote:
       | They are clearly bored and want to start a year long bug hunt
       | through half of unix
        
         | Muromec wrote:
         | That sounds like a good kind of bored and bug hunting through
         | half the unix sounds like fun too.
        
       | kej wrote:
       | I wonder if it would work to do something like the punycode
       | system for internationalized domain names. Shell scripts could
       | handle a name like `xn--0civ130n` just fine, and user-facing
       | utilities could choose to convert that to :sparkle::unicorn: when
       | appropriate. The same homograph protections would probably work,
       | as well.
        
       | dsr_ wrote:
       | I will remind everyone that there are a minimum of three
       | identifiers here.
       | 
       | The UID, which is an integer. Ownership resides here; it's the
       | primary key. Can be used by programs.
       | 
       | The username, each of which must be unique and maps to one UID --
       | but multiple usernames can map to the same UID. Used by humans
       | and programs to login.
       | 
       | The GECOS field, or "human readable name", which is only used as
       | a display label. Some systems include a structure inside this for
       | additional info like phone number, office number, or similar". I
       | don't think anyone would object to UTF-8 here.
        
       | seu wrote:
       | The fact that this whole discussion happens in english, partially
       | explains why there is a discussion at all. The whole problem
       | could have been avoided if the development of computers had been
       | a more international effort.
        
       | seiferteric wrote:
       | OMG Can't believe this, I ran into this exact thing at my last
       | job. We discovered a security vuln in several of our services
       | because we were accepting unsanitized usernames, but since we and
       | doing things with them (passing them to scripts etc.) but only
       | after passing them to useradd/usermod etc so we thought they were
       | safe, and of course you could put in things like ";" and "&", ">"
       | etc and do whatever you want. I discovered that debian DISABLED
       | the username sanity checks and could not believe it. anyway I
       | installed a patched version as well as sanitized input and other
       | stuff to resolve the issue.
        
       | IshKebab wrote:
       | > Most Debian users don't work with useradd, or groupadd,
       | directly. Instead, Debian has long supplied its own adduser (and
       | addgroup) utilities, originally written by founder Ian Murdock.
       | These act as simpler front ends to useradd
       | 
       | One of the dumbest things Debian has done.
        
       | rurban wrote:
       | They are so stupid, I cannot believe!
       | 
       | Names are identifiers, and such need to stay identifiable. There
       | exist unicode security guidelines and rules for identifiers, they
       | don't know about. My libu8ident library would help with that.
        
       | UniverseHacker wrote:
       | Clearly we should open up usernames to be an unlimited size set
       | of mixed data types: e.g. the first "character" could be a hand
       | drawn picture of a cat, the second the entire text of the US
       | constitution in unicode, and so on. We could then extend this
       | flexibility to filenames, passwords, and Unix commands.
       | Internally, this could involve replacing all text strings with
       | folders on a filesystem where you can put any files you want in
       | any desired order. /s
        
       | nineteen999 wrote:
       | I have an affectionate place in my heart for Debian, the
       | community is passionate, they have wonderful ideals, hell I even
       | helped found a charity which distributes it on used PC's
       | discarded by large companies to disadvantaged people over 20
       | years ago which is still running today. It was my favourite
       | distro for a long time after I moved on from Slackware in the
       | late 90's, I used it at home, I used it in my job at a small ISP
       | on everything from x86 to Sun Sparc to DEC Alpha hardware. We are
       | lucky in the Linux community to have them. I could care less
       | about deriatives like Ubuntu, seems to be one too far removed.
       | 
       | But over the years the bikeshedding and some of the poor
       | technical decisions started to wear on me. The debconf approach
       | of asking a million questions on install bothered me. In my
       | current job we use it on small industrial ARM PC's and it does a
       | great job there at a large scale distributed over a wide variety
       | of environments and geographical area, scorching heat, freezing
       | cold and everything in between. But that's easy because it's a
       | single system image which we deploy to hundreds of devices and it
       | only requires minimal customisation to perform the required
       | tasks.
       | 
       | But our datacenter servers remain RHEL for the simple reason ...
       | the deployment and broad customisation process per server is
       | easy, LDAP integration is straight forward and the customer wants
       | to pay for support from the vendor even though we never use it.
       | Security updates and bugfixes are delivered quickly and the
       | vendors commitment to stability is commendable. It's a no
       | brainer. More and more companies started to move their workloads
       | to RHEL once it came out and unfortunately it just didn't make
       | sense to bother with distributions outside of RHEL/Fedora for my
       | personal use anymore, some sort of work/life balance is needed
       | and I don't want to spend my personal computing time remembering
       | all the idiosyncracies between different Linux distributions
       | anymore. I would argue that Debian is pretty idiosyncratic and
       | opinionated if you have come from more traditional UNIX systems
       | in the 90's, while RHEL/Fedora more closely model an "evolution"
       | of those classic systems if you like. It will be interesting to
       | see what happens to RHEL in the coming years as Redhat becomes
       | more and more absorbed into the IBM environment.
        
         | finnthehuman wrote:
         | That's the reality of deploying professionally though. I've got
         | a soft spot for debian from using it for over 20 years too, but
         | choosing open source often means picking the vendor that
         | accommodates the use case. Many products have the enterprise
         | upsell of good LDAP/AD integration but that's just a nice to
         | have when you're really buy it for the ability to call someone
         | when shit goes sideways.
         | 
         | And when you don't need the support net, it's often gonna be
         | ubuntu because that's what most people are comfortable with. Or
         | yocto if you're shipping a custom OS. And containers are so
         | ephemeral and purpose specific it means distro doesn't matter
         | as much.
         | 
         | I'm still rooting for them the most. They're community based,
         | an important upstream, and stable has never done me dirty. It's
         | still my go-to for "I don't want to think hard about, or worry
         | about this system."
        
       | jcarrano wrote:
       | I don't get it? What's the purpose of changing the default rule
       | in shadow-utils. Not only is it completely unnecessary and
       | introduces risks for shell injections, it also risks introducing
       | incompatibilities between Debian and any other system.
       | 
       | I feel that there are already too many other things to fix to be
       | wasting time in creating new potential bugs.
        
       | thway15269037 wrote:
       | Before opening this can of worms, can we finally address that
       | there is a hard, hardcoded limit of 255 bytes per file name
       | (folder name) in Linux? Yeah, 255 bytes, that is, like 63
       | japanese characters or emojis or maybe less. And in kernel, too,
       | so you physically cannot correct this issue by using another
       | filesystem or something.
       | 
       | Before anyone asks: yes, these folders do occur in real life, and
       | I tired of pretending that they do not.
        
       ___________________________________________________________________
       (page generated 2024-12-06 23:01 UTC)