[HN Gopher] Debian opens a can of username worms
___________________________________________________________________
Debian opens a can of username worms
Author : jwilk
Score : 158 points
Date : 2024-12-06 09:55 UTC (13 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| rini17 wrote:
| Perhaps it's time to agree upon how to Unicode in identifiers?
| The normalization, unprintable characters, confusing characters
| with same glyphs, etc. It's obviously problematic when everyone
| is doing it on their own.
| magicalhippo wrote:
| As long as I can enter my Zalgo[1] username, I'm fine with your
| suggestion.
|
| [1]: https://en.wikipedia.org/wiki/Zalgo_text
| m000 wrote:
| Good luck bringing everyone together. There's still a ton of
| Microsoft software that relies on the presence of the BOM [1],
| despite practically everyone else not using it. And
| bidirectional rsync between practically everything else and a
| Mac still requires `--iconv=utf-8,utf-8-mac` to avoid problems
| because of homographs.
|
| [1] https://en.wikipedia.org/wiki/Byte_order_mark
| bayindirh wrote:
| The first bar to clear is "The Turkish Test"[0], then we can
| talk about Unicode. It'll smooth the rest of the process a lot.
|
| You can't guess how many workarounds I implement to make sure
| that a stray application doesn't get "i" or "I" in their naive
| codepaths, and start burning mid-flight (e.g.: Kodi, Pagico,
| some old Java programs, oh my...).
|
| [0]: https://blog.codinghorror.com/whats-wrong-with-turkey/
| beardyw wrote:
| The date format part is ridiculous. Americans are almost
| unique in using mm/dd/yyyy, so an assumption of that would be
| plain wrong.
| bayindirh wrote:
| Localization libraries handle these parts well, since date
| is same with Europe (and generally stored as time-date
| objects rather than pure strings). None of the number
| shenanigans cause problems since these numbers are always
| stored as IEEE754 or other decimal formats. Money is no
| problem as well.
|
| However, when you go through an upper() or lower() or
| anything which plays with capitalization, and if that data
| is being fed to a hash algorithm or anything which mucks
| with strings, boy, oh boy...
|
| The easiest way is to sanitize these programmatic parts
| with forced locale of en_US or plain old "C". If the
| strings is not facing to the user and never localized, just
| force its locale. It's the only sane way.
| a3w wrote:
| https://xkcd.com/1179/ I heard the US and A are moving to
| the hissing cat date format shown here.
| Muromec wrote:
| I kind of like the one using roman numerals for month.
| Reasonable people would figure out that other reasonable
| people would not use roman numerals for _days_ , so the
| order can be implicit. I like implicit ordering, it
| always makes things more interesting.
| bluGill wrote:
| I have switched to yyyymmdd for everything - it is usually
| obvious to everyone what date I mean.
| bayindirh wrote:
| I also use the same format while naming my files, or in
| changelogs or whatnot, but not all documents are suitable
| for that, and in the presentation layer you need to match
| the country standards.
|
| However, date is mostly presentation and internal storage
| of these are vastly different than what we see generally.
| bluGill wrote:
| I don't match country standards. That is the point.
| kelnos wrote:
| It depends on what you're doing, though. If you're
| helping people fill out documents (even non-government
| documents), then you really need to match the country
| standard.
|
| Localization is important; some countries outright
| require it if you're going to do business within their
| borders. But even where it's not required, you will lose
| customers if your website/application/product feels
| "foreign". I'm not sure date ordering is a big enough
| deal to trigger that feeling in anyone, but unless it's a
| huge burden to format things the way people expect, I
| would do so for the UX benefits.
| throw0101a wrote:
| > _Perhaps it 's time to agree upon how to Unicode in
| identifiers?_
|
| And then update all data structures that refer to them (like
| _last_ and _w_ / _who_ , also NFS), as well as file formats
| (like _cpio_ , _tar_ , and _pax_ which encodes ownership).
| maccard wrote:
| Yes. Those formats have had 20 years since Unicode was
| standardised, and things like my terminal still routinely
| break when given "unexpected" inputs. Practically every other
| application can handle it.
| layer8 wrote:
| Unicode has provided a specification for Unicode identifiers
| since 2005: https://www.unicode.org/reports/tr31/
| rini17 wrote:
| Great! Is there a library for their validation? ICU seems to
| have only spoof checker for confusables.
| rurban wrote:
| libu8ident
| secondcoming wrote:
| Would punycode be suitable?
| tiahura wrote:
| When you think about all the time, money and effort that have
| been wasted on Unicode...
| kalleboo wrote:
| Yeah we should have all just stuck to Shift-JIS
| Joker_vD wrote:
| Vseki triabva da izpolzva latinitsa, absoliutno s'm s'glasen.
|
| After all, it's objectively the most perfect set of characters
| for any reasonable human language.
| febusravenga wrote:
| Random cross-language-script observation.
|
| In Bulgarian, latinitsa ("latin alphabet") transliterated to
| latin alphabet is just "latinitsa" or "latinica".
|
| In Polish "cyrillic" is "cyrylica" - basically reverse.
| pjc50 wrote:
| What's your preferred solution for representing the CJK
| languages?
| tiahura wrote:
| Computing did pretty well in the prior 50 years.
| pjc50 wrote:
| That's not an answer. Be specific. How do you want to
| represent the 97k CJK characters?
| vman81 wrote:
| I really don't want to be snarky or sarcastic, so I'll
| just be plain. Many people are unwilling or unable to
| understand a problem that doesn't affect them directly.
| Like - "UTF is woke" kind of people. They are out there.
| CorrectHorseBat wrote:
| Not for the majority of the world population who doesn't
| know English
| jcranmer wrote:
| I still remember the days when I couldn't use p and e in
| the same document, because there was no codepage that
| contained both of them. I also remember the days when
| pretty much any website that had non-English text had to
| have instructions on it for how to view it properly,
| because mojibake was so bloody common.
|
| (It should also tell you something that not only is there a
| name for "computers failed at charsets", but the name is
| Japanese.)
| umanwizard wrote:
| Only if you could expect a given person to only ever deal
| with one language. Anything international sucked and was a
| much bigger pain than now.
|
| It would be impossible to e.g. build a site like Reddit
| where people can comment in any language.
| vman81 wrote:
| Computing has improved massively over the last 50 years,
| not least because it now can accommodate peoples diverse
| languages.
| kryptiskt wrote:
| No, it didn't. There were all kinds of encodings out there,
| and dealing with code pages was way worse than any
| inconveniences that Unicode has brought. Unicode was
| created for a reason, not just to torture US programmers
| with the diversity of scripts in the world.
|
| Maybe it was nice if you worked for a US company without
| any operations abroad, which includes absolutely none of
| those which mattered.
| account42 wrote:
| You still need to deal with "codepages" to differentiate
| between Japanese Unicode and Chinese Unicode even if it's
| called a language and not codepage now.
| CorrectHorseBat wrote:
| Han unification sucks indeed but if you get the wrong
| font it's still readable
| dotancohen wrote:
| Only if your name isn't Dong Jiu Er Gong Ren Yan Wang .
| throw0101a wrote:
| > _Computing did pretty well in the prior 50 years._
|
| Contra:
|
| * https://stackoverflow.com/questions/25812790/wrong-
| character...
| Muromec wrote:
| I had to, in the year of our lord 2024, deal with a certain
| non-unicode system that ate one specific Cyrillic symbol
| when producing an open data artifact mandated by law. It
| was never fun then and it's still manages to create
| problems.
| account42 wrote:
| Something that doesn't unify different characters. So not
| Unicode.
| Cthulhu_ wrote:
| What alternative do you propose? I mean personally I think that
| emoji don't belong in unicode, but at the same time it's been
| integrated into society for many years now and it's made
| communications platforms so much more streamlined.
|
| But how else would you represent non-latin characters? More
| character sets?
| a3w wrote:
| > emoji don't belong in unicode
|
| Well, they are defined as: "an intermediate technology until
| we find a way to transfer images over data connections."
|
| So it was always a technology that was 40 years too late to
| the party?
| layer8 wrote:
| Without it, all textual data would need its own charset header,
| and you couldn't freely copy & paste between pieces of text
| with different charsets without creating mangled garbage. This
| was the situation before Unicode (except that charsets were
| often only implicit, so you had to guess which it is).
| card_zero wrote:
| > naming things is one of the hard things to do in computer
| science
|
| I've been thinking about that a lot lately. Code is text, it's
| arranged linearly, code has to be readable, identifiers are thus
| short strings that try to express short essays about the purpose
| of the variable or whatever it is, and then ideally there's a
| longer version of the essay in a comment, but not too long
| because that would clutter up the code as well (because it's
| text, arranged linearly). And we have code folding to tidy them
| up, for what good it does, and ideally an even longer version of
| the essay in documentation except nobody writes that.
|
| What if it wasn't text, and wasn't linear, and we didn't have an
| expectation that code should be strings of stupid over-terse
| names and hieroglyphic symbols? So I was thinking vaguely about
| investigating graphic-based programming, but it's probably worse,
| IDK. It could automatically assign arbitrary icons* instead of
| identifiers, and you could write tooltip-like comments to
| describe them as and when you want to, and everything could be
| laid out nicely with diagrams and different pages instead of like
| a text file. I suppose this is all merely cosmetic? The thing
| with the instance on code being _written_ as strings of text
| feels very primitive, is all. It causes this problem.
|
| * Which doesn't solve the problem, I admit, because now you have
| to remember what the icons mean, but maybe that's easier?
| jstanley wrote:
| I don't think remembering the meaning of icons is easier,
| because in order to think about it you have to be able to
| pronounce it inside your head.
|
| And code isn't just linear, it can be spread across multiple
| files in a directory tree, functions can can each other, etc.
| c22 wrote:
| _> in order to think about it you have to be able to
| pronounce it inside your head._
|
| I'm not sure this is universal.
| vidarh wrote:
| Indeed, some people do not even have an inner voice, the
| same way some of us don't "see" things in our minds eye.
| Neither prevents you from thinking about words or visual
| objects.
| pjc50 wrote:
| > I was thinking vaguely about investigating graphic-based
| programming, but it's probably worse, IDK. It could
| automatically assign arbitrary icons* instead of identifiers,
| and you could write tooltip-like comments to describe them as
| and when you want to, and everything could be laid out nicely
| with diagrams and different pages instead of like a text file.
|
| Have you ever read large electronic schematics? That's
| basically it .. except all the important things have to be
| identified by text anyway, because it's a massive challenge to
| the imagination to come up with two hundred different
| pictograms.
|
| Of course, if you really want your identifiers to be
| pictograms, why not just use kanji for your identifiers? The
| Japanese language and Unicode provide tens of thousands of
| ready made pictograms for your convenience!
|
| The only nonlinear programming environments that have really
| worked are the spreadsheet (which is still linear within each
| cell) and Labview. Possible shoutout to Unity blueprints, but
| when those get too complicated sphagetti .. people rewrite them
| in linear text code.
| card_zero wrote:
| _Sigh_
|
| I guess you're right. This has been a dimly-felt wish of mine
| for some 25 years, but probably pie in the sky.
|
| Edit: I see there are a _lot_ of visual programming
| languages.
|
| https://en.wikipedia.org/wiki/Visual_programming_language
| 9dev wrote:
| I don't think that has to be the answer, though. We can
| probably all agree that plaintext code is not the best form
| to represent the schematics of a process, and neither are
| images. But that seems to be a very limited set of options,
| and I wonder if there aren't any other dimensions to
| express what is essentially persisted chains of reasoning.
| For an example of alternative modes of input, have a look
| at the Reactable, a pretty innovative way to compose music.
| Sadly I think they didn't disrupt the music industry as
| they should have, but it's a pretty good example of a new
| way to think about making sounds.
|
| Edit: forgot the link. Here is: http://reactable.com
| WillAdams wrote:
| Longer than that --- I would argue it goes back to Herman
| Hesse's _The Glass Bead Game_ (originally published as
| Magister Ludi) --- but Hesse seems to have gone out of
| style.
|
| That said, I keep trying various ones, and will keep hoping
| that someday someone will make a graphical tool able to
| make a GUI program.
|
| Nodezator seems promising.
| auxym wrote:
| > Have you ever read large electronic schematics? That's
| basically it .. except all the important things have to be
| identified by text anyway, because it's a massive challenge
| to the imagination to come up with two hundred different
| pictograms.
|
| As a mechanical engineer who works with Labview and Simulink,
| as well as more conventional code (python mostly), that is
| indeed a very good description. First glance at a large
| labview program feels very much like first glance at a large
| and complex electronics schematic. Lots of wire everywhere
| and you're not even sure where to start.
|
| I think a nice "best of both worlds" approach is a graphical
| "high level" view which shows the flow of data, at least for
| "data transformation" kind of programs, and code for the low
| level logic (what actually happens in the blocks). Sort of
| like nodal editors in Blender and NLE apps. Fortunately
| Simulink makes it easy to drop in a Matlab function call,
| Labview not so much (need to get into C FFI or use a really
| old version of .net or something).
|
| The thought I have about spreadsheets (might have read that
| on here), is that spreadsheets make the data visible and hide
| the code. Text-based programming hides the data but shows the
| code. I'm not sure what something that makes both code and
| data first class and visible would look like, but I'd be
| curious for sure (for engineering type applications at
| least). Best I've found so far (and what I actually for a lot
| of data processing tasks) is a Jupyter notebook making
| plentiful use of df.head() and df.plot().
| umanwizard wrote:
| It's odd to say those characters come from the Japanese
| language when they were invented in China to write Chinese,
| are still used for that purpose, and were only introduced to
| Japan 2000 years later.
| taneq wrote:
| > The only nonlinear programming environments that have
| really worked are the spreadsheet (which is still linear
| within each cell) and Labview. Possible shoutout to Unity
| blueprints, but when those get too complicated sphagetti ..
| people rewrite them in linear text code.
|
| Not 100% sure what you mean by 'nonlinear' here (flow
| control?) but almost all industrial and mining equipment is
| programmed in visual languages on PLCs. Ladder Logic looks
| like, well, a stylized electrical drawing of a bunch of
| relays wired up to perform logical operations. Function Block
| Diagram looks like a PCB layout, but the 'integrated
| circuits' are function blocks (basically functors) and the
| 'traces' are copying data between between the function
| blocks. Not great for implementing hardcore algorithms but
| you can do a surprising amount with them (once you get used
| to coding with both hands tied behind your back) and they
| sure are accessible to people who otherwise wouldn't be
| programming.
|
| Of course, as you say, when things get genuinely complicated,
| it's much nicer to use a 'real' programming language (or even
| just Structured Text, which is pretty much just Pascal).
|
| Then again, even with electronics, once things get complex
| enough don't we start using text (eg. VHDL)? Expressing
| designs is always a tradeoff between simplicity and
| 'obviousness' on the one hand, and representational
| efficiency on the other. Structured text sits right in the
| sweet spot between the two.
| jcranmer wrote:
| Graphical programming is one of those things that's often
| suggested as an improvement on textual programming, and just
| about every implementation tends to disappoint. I know, when
| working on compilers, that nearly every time I go "I think I
| want to see the CFG as a graph here," I tend to realize no,
| that's not quite what I wanted. For a complex function, the
| surprising superpower is just to have an editor that shows the
| opening brace line of every currently-open brace.
|
| Another case in point: when was the last time you saw someone
| use a flowchart to describe the pseudocode of an algorithm, as
| opposed to writing, er, pseudocode? Flowcharts used to be the
| dominant way to do this, decades ago, but they seem to me to
| have been thoroughly supplanted by pseudocode...
| WillAdams wrote:
| I think the problem here is that there isn't an agreed-upon
| answer for the question:
|
| >What does an algorithm look like?
|
| And any effort to answer it which gets beyond the size of a
| single diagram/screen/page/poster becomes a problem like to:
|
| https://blueprintsfromhell.tumblr.com/
|
| https://scriptsofanotherdimension.tumblr.com/
|
| I like to think of myself as a visual person, and I wish
| there was a good solution here, and I keep looking for and
| trying different solutions other folks have made (current two
| iterations are BlockCAD and OpenSCAD Graph Editor) --- I'd be
| glad of other suggestions, esp. if able to make graphic user
| interfaces more complex than the OpenSCAD Customizer.
| card_zero wrote:
| Argh! Wire-wrapped backplanes! That wasn't the fantasy at
| all!
| WillAdams wrote:
| Yes, the fantasy is something like to Herman Hesse's _The
| Glass Bead Game_ which I mentioned elsethread --- what is
| the closest available tool to that?
|
| How do such tools manage the problem of
| encapsulation/modularity becoming the "wall of text"
| which one is trying to escape, just a pretty wall w/ all
| the labels in boxes decorated/connected w/ lines?
| AlienRobot wrote:
| The difficult in naming things is that you're trying to encode
| semantics and an interface contract in a name. If you give up
| doing that, it's easy.
|
| For example, say you have getFoo(). It's clear it gets the foo.
| But later you introduce getFooAsync(). Suddenly it's no longer
| clear whether getFoo() is sync or async, because you didn't
| call it getFooSync().
|
| If instead you used names like getFoo1, getFoo2, getFoo3, etc.,
| the semantics you're providing is that there are multiple
| "ways" to getFoo without making promises (a contract) about
| what the function actually does in its name.
|
| Although this sounds like bad naming practices (it is), it
| effectively solves the naming problem. Apply this to CSS, and
| instead of .red-button or .secondary-button, you get .button1,
| .button2, .button3, and you just don't have to think about WHY
| are you creating a button to give it a class and start styling
| it.
| card_zero wrote:
| Yep, that sort of thing happens _constantly._ Things get
| misleading names because the first three alternatives I came
| up with were also misleading. So I agree, and indeed I
| considered a foo bar baz scheme instead of icons, same
| difference. Then you have to look somewhere else for what the
| thing does. Self-documenting code doesn 't really work, and
| strict naming schemes are long-winded and worse than ad-
| libbing it, so it would have to be comments, but then the
| comments get forgotten and no longer reflect the code. I give
| up, might take up woodwork instead.
| mmsc wrote:
| I wonder how this will affect ssh. OpenSSH recently restricted
| more characters for valid usernames:
| https://github.com/openssh/openssh-portable/commit/7ef3787c8...
| cedws wrote:
| This is a great example of how one poor decision, or one piece
| of code that is too liberal cascades into an avalanche of
| shitty workarounds.
| throw0101a wrote:
| It should be noted that shell metacharacters are also not
| allowed under POSIX:
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b
| c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3
| 4 5 6 7 8 9 . _ -
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| (Hyphen forbidden as first character.)
| linuxftw wrote:
| I think it will be fine. Everyone will quickly learn the lesson
| "Use something other than ASCII letters and numbers at your own
| peril."
|
| Similar to people who put spaces in file names, it should be a
| fire-able offense.
| lexicality wrote:
| any software that can't handle spaces in filenames is broken
| Muromec wrote:
| All of the software is broken (including security wise) all
| the time anyway.
| bdangubic wrote:
| this is exactly right... I spoke a few years ago with a
| mate who is a software dev at one of the major car
| companies... since then I wouldn't sit in the car from
| that company if my life depended on it...
|
| then I thought - if I spoke any dev in any industry I
| would also stop doing whatever their software is
| controlling and end up moving to live with amish or some
| wilderness without electricity
| hiccuphippo wrote:
| Was that the fireable offense? I always thought the offense
| was not putting quotes around filenames in scripts.
| dfranke wrote:
| Allowing purely numeric usernames seems like a terrible idea to
| me, because it creates ambiguity between what's a username and
| what's a UID. It's common for tools like ls or ps to display a
| username when one is found and fall back to displaying a UID if
| it isn't, and similarly tools like chown will accept either a UID
| or a username and disambiguate based on whether it's numeric or
| not. Now suppose there's a numeric username that doesn't match
| its own UID, but does match some other user's UID. It doesn't
| take a lot of imagination to see how this would lead to
| vulnerabilities.
| throw0101a wrote:
| Talk to POSIX:
|
| > _A string that is used to identify a user; see also User
| Database. To be portable across systems conforming to
| POSIX.1-2017, the value is composed of characters from the
| portable filename character set. The <hyphen-minus> character
| should not be used as the first character of a portable user
| name._
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| The "portable filename character set" is defined as:
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b
| c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3
| 4 5 6 7 8 9 . _ -
|
| *
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| So only a hyphen as the first character is forbidden.
|
| Given that you can't necessarilly control where usernames come
| from (e.g., LDAP lookups), properly speaking your system has to
| handle everything anyway, even if you don't allow local
| creation.
| dfranke wrote:
| Yes, I'm aware, and POSIX has many such bugs that make
| command input or output unavoidably ambiguous if certain
| unexpected characters are present that they didn't think to
| prohibit. A lot of the revisions that went into POSIX 2024
| were aimed at fixing some of these, such as standardizing
| find -print0 and xargs -0. The fact that this one got
| overlooked doesn't mean it's a good idea to make the
| situation worse and harder for future POSIX revisions to
| address.
| bluGill wrote:
| It is time for POSIX to get with the times. Computers are
| used in more than the US and Canada (for the most generous
| interpretation of American in ASCII I'm including Canada,
| their French speakers will not be happy with that, not to
| mention first nations of which I know nothing but imagine
| their written language needs more than ASCII). UTF8 has been
| standard for decades now, just state that as of POSIX 2025
| all of UTF8 is allowed in all string contexts unless there is
| a specific list of exception characters for that context
| (that is they never do a list of allowed characters). They
| probably need to standardize on utf8 normalization functions
| and when they must be used in string comparisons. Probably
| also need some requirement that and alternate utf8 character
| entry scheme exist on all keyboards.
|
| The above is a lot of work and will probably take more than a
| year to put into the standard, much less implement, but
| anything less is just user hostile. Sometimes commettiees
| need to lead from the front not just write down existing
| practice.
| chikere232 wrote:
| Sounds like lots of work and a lot of new bugs for no real
| value.
| throw0101a wrote:
| > _It is time for POSIX to get with the times._
|
| "Be the change that you wish to see in the world." --
| Mahatma Gandhi
|
| It's free to join:
|
| * https://www.opengroup.org/austin/lists.html
|
| * https://www.opengroup.org/austin/
| atoav wrote:
| Sure, go ahead. Write the PR and make sure to test against
| all other things used in production.
|
| Let's talk again in 30 years when you're done.
| jerf wrote:
| Oh, it's been closer to 20 years for the rest of the
| world to catch up to Unicode than 30. We aren't at
| "perfect" now but we're certainly down to the trickier
| corner cases that are difficult to even see how you solve
| the problems at all, let alone code the solutions, and
| that's just reality's ugly nose sticking in to our
| pristine world of numbers.
|
| But there really isn't any other solution. Yes, there
| will be an uncomfortable transition. Yes, it blows. But
| there isn't any other solution that is going to work
| other than _deal with it_ and take the hits as they come.
| The software needs to be updated. The presumption that
| usernames are from some 7-bit ASCII subset is simply
| unreasonable. We 'll be chasing bugs with these features
| for years. But that's not some sort of optional aspect
| that we can somehow work around. It's just what is coming
| down the pike. Better to grasp the nettle firmly [1] than
| shy away from it.
|
| At least this transition can learn a lot from previous
| transitions, e.g., I would mandate something like NFKC
| normalization applied at the operating system level on
| the way in for API calls:
| https://en.wikipedia.org/wiki/Unicode_equivalence Unicode
| case folding decisions can also be made at that point.
| The point here not being these specific suggestions per
| se, but that previous efforts have already created a
| world where I can reference these problems and solutions
| with specific existing terminology and standards, rather
| than being the bleeding-edge code that is figuring this
| all out for the first time.
|
| [1]: https://www.phrases.org.uk/meanings/grasp-the-
| nettle.html
| somat wrote:
| I would say it is not the place of posix to prescribe how
| it should be, the job of posix is describe what it is, a
| common operating environment. this is why posix is such a
| mess and why I feel it is not a big deal to deviate from
| posix, however posix fills an important role in getting
| everyone on the same page for interoperability.
|
| In my opinion the way to improve this, is bottom up, not
| top down. Start with linux(theese days posix is largely
| "what does linux do?"), get a patch in that changes the
| defination of the user name from a subset of ascii to a
| subset of utf-8. what subset? that is a much harder problem
| with utf-8 than ascii, good luck. get a similer patch in
| for a few of the bsd. then you tell posix what the os's are
| doing. and fight to get it included.
|
| On the subject of what unicode subset. perhaps the most
| enlightened thing to do is the same as the unix filesystem
| and punt. one neat thing about the unix filesystem is that
| names are not defined in an encoding but as a set of bytes.
| This has problems and has made many people very mad. but it
| does mean your file system can be in whatever encoding you
| want, transitioning to utf-8 was easy(mainly doe to the
| clever backwards compatible nature of utf-8) and we were
| not locked into a problematic encoding like on windows.
| perhaps just define that the name is a array of bytes and
| call it a day. that sounds like the unix way to me.
| tssva wrote:
| "however posix fills an important role in getting
| everyone on the same page for interoperability."
|
| Isn't that exactly what the posix username rules are
| doing? Specifying a set of characters which are portable
| across systems to allow for interoperability between
| current and legacy unix systems along with most non-unix
| systems.
|
| "Start with linux"
|
| Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils,
| and systemd all differ.
|
| "get a patch in that changes the defination of the user
| name from a subset of ascii to a subset of utf-8"
|
| ASCII is a subset of UTF-8 so the POSIX definition
| already specifies a subset of UTF-8.
| PhilipRoman wrote:
| Some practical concerns I have with UTF-8 are similar (or
| even the same, depending on font) characters which can be
| used in malicious ways (think package names, URLs, etc),
| not to even mention RTL text and other control characters.
| Every time I add logging code, I make sure that any
| "interesting" characters are unambiguously escaped or
| otherwise signaled out-of-band. Having English as an
| international writing standard is perfectly fine and I say
| that as a non-native speaker with a non-ascii name.
| abdullahkhalids wrote:
| A good chunk of the world does not speak english or latin
| character based languages. They should be able to
| interact with computers completely in their own languages
| and alphabet sets, even if those are written right-to-
| left or top-to-bottom.
|
| Of course, someone has to do the work to make this
| possible. And no one is obliged to do it. But to suggest
| that, such work should not be done at all, does not sit
| right.
| hnthrowaway6543 wrote:
| > A good chunk of the world does not speak english or
| latin character based languages.
|
| nearly everyone in a first world country knows the
| English alphabet though. a vast majority of the
| developing world as well. just look at street view on
| Google maps in any country, there's going to be a ton of
| street signs using English characters, even in non-
| touristy areas.
|
| > They should be able to interact with computers
| completely in their own languages and alphabet sets, even
| if those are written right-to-left or top-to-bottom.
|
| if you're a typical android/ios end user you're
| interacting with a computer in your native language
| anyway. this discussion only applies to low level power
| users.
|
| in that case: why? these aren't user-facing features.
| this is like saying that people should be able to use
| symbols native to their language rather than greek
| letters when writing math papers.
|
| it might not be "fair" that English is overrepresented in
| computing but it also hasn't demonstrably been a barrier
| to entry. Japan, Korea and China have dominated,
| particularly in hardware.
|
| if you think it should be fixed why stop at usernames?
| why represent uids with 1234 instead of Yi Er San Si ?
| abdullahkhalids wrote:
| > if you're a typical android/ios end user you're
| interacting with a computer in your native language
| anyway. this discussion only applies to low level power
| users.
|
| I don't think you realize how poor this experience is.
| Partly the reason being that the underlying system is so
| english focused, that app developers have to do so much
| work to get things working.
|
| > if you think it should be fixed why stop at usernames?
| why represent uids with 1234 instead of Yi Er San Si ?
|
| I mean, if the computers had first been built in south
| east asia, they would have been.
| hnthrowaway6543 wrote:
| it's certainly hard to localize everything but billions
| of people use ios/android in India, China, SEA, MENA,
| etc... i think it's fair to say that at the end user
| level, computers are in fact usable by non-English
| speakers.
|
| individual apps may not be as usable, but that's on the
| developers. good counter-example, a lot of japanese
| games, even made within the past 5 years, require setting
| the Windows system locale to Japanese to function
| properly. and as someone who played a fair number of
| japanese doujin games in the 00s/10s, it used to be
| _every_ game with this problem.
|
| > I mean, if the computers had first been built in south
| east asia, they would have been.
|
| debatable as CJK heavily use Arabic numerals everywhere,
| but even if they did, so what? you'd learn those symbols
| and get used to it. the same way that if you're a unix
| sysadmin you get used to only being able to use a small
| subset of ASCII characters for usernames.
| Muromec wrote:
| Oh no please, I don't want to have my linux username in
| Cyrillic. Thanks but no, thanks!
|
| I know enough linux to see 10 ways in which it will make
| things worse at some point.
| miki123211 wrote:
| > Computers are used in more than the US and Canada
|
| Even if you speak US (or Canadian) English exclusively,
| there are still some words that are just impossible to
| spell correctly in pure ASCII, e.g. resume, cafe etc.
| drdeca wrote:
| "correctly". I don't consider it "incorrect" English when
| someone writes "cafe" or "resume". It seems to me a
| little bit paedantic to insist that those words must have
| the accent marks in order to be correct (when using them
| in English).
| sneak wrote:
| Yeah, loanwords are different words than the original
| word.
|
| The correct plural of "baby" in German is "babys".
| rurban wrote:
| Almost nobody supports string search and comparison API
| functions for unicode. The unicode security tables for
| unicode identifiers are hopelessly broken.
|
| Not even the simplest tools, like grep do support unicode
| yet. This didnt happen in the last 15 years, even if there
| are patches and libs.
| macintux wrote:
| At the meatspace level, purely numeric usernames are
| problematic.
|
| I was working as a contractor at a Fortune 500 firm several
| years ago when they introduced a new ERP system which
| apparently encouraged the company to switch to numeric system
| IDs. Fortunately the technical teams, especially Linux support,
| objected and it was overruled, but I was just as worried about
| the communications problems that would result.
|
| When everyone has a system ID that matches a consistent
| pattern, like "YZ12345", IDs are easy to recognize in
| documentation and data. An ID like "1234567" could be
| practically anything.
| PhilipRoman wrote:
| I really like the concept of adding some redundancy to ids,
| like a prefix. It helps to disambiguate things (kind of like
| static typing). A good example is also bank numbers, which
| must be a multiple of 97 +1, enabling fast client-side
| validation against typos.
| hulitu wrote:
| > Allowing purely numeric usernames seems like a terrible idea
| to me
|
| "I'm not a number, i am a free man. Ha ha ha ha ha"
| kps wrote:
| "Who is UID 0?"
|
| "You are UID 6."
| thephyber wrote:
| I am also worried about more subtle bugs caused by usernames
| that are not strictly only-numeric, such as "10e2" or
| "0xDEADBEEF".
| Ferret7446 wrote:
| It shouldn't be a problem as long as the system disallows a
| numeric username to be the same as an existing UID (excepting
| the case where the matching UID is assigned to said username).
| huhtenberg wrote:
| Sound like a solution in search of a problem.
|
| And a disruptive solution with unclear side effects at that.
| johnisgood wrote:
| > If a keyboard input system provides the former sequence of
| bytes, but the username is stored in the login infrastructure
| using the latter sequence of [bytes], then a naive comparison
| will not find the user "emollier" in the system. Unicode defines
| in Annex 15 a few normalization forms as a way to work around
| this problem. But a correct use of these normalization forms
| still requires coordination and standardization among all
| programs accessing the data.
|
| ICU could work, but adds an extra dependency, there is also GNU's
| libunistring.
| resource_waste wrote:
| This is important because Debain-family is used on many servers?
|
| Debian seems to just squander resources on things a few powerful
| people care about.
|
| All my servers have been Debian-based, so I can't be too hard on
| them, but whenever I see someone recommend a Debian-family distro
| as a Desktop OS, I feel like I need to call the police.
| perlgeek wrote:
| Just imagine how many poorly-written shell scripts will break
| when we suddenly allow dollars, quotes, backticks and the likes
| in username. Heck, even allowing spaces sound like horror to me.
|
| On the display side, I'm sure most tools that display usernames
| won't make it easy to see if there are leading or trailing
| whitespace characters, double blanks, tabs etc in usernames.
|
| This sounds like support hell to me.
| gmuslera wrote:
| The problem could be old scripts or systems that doesn't handle
| UTF-8 (that doesn't need to be the ones where the username was
| defined). I'm not sure if I.e. the Bobby tables trick could be
| done with characters with UTF8 representation seeing them in
| pure ascii.
| Starlevel004 wrote:
| Breaking shell scripts sounds like a good idea to me. The
| faster they die the better the world gets.
| Rygian wrote:
| That's going to be a very bumpy road, even if everyone were
| to agree that the destination is appealing.
| bigstrat2003 wrote:
| Yeah for better or for worse compatibility is king. I
| _despise_ shell scripts, they are an absolute nightmare to
| work with and full of footguns. But they are so commonplace
| that people are not going to tolerate YOLO breaking
| changes.
| chikere232 wrote:
| Perhaps unix isn't for you?
| makeitdouble wrote:
| Thing is, they don't die. Instead you get the short end of
| the stick.
|
| You'd have to be pretty darn important for an org to fix
| their scripts because of your name or the username you
| created. Of it would need to happen at a larger scale, but
| then that wouldn't be so controversial in the first place.
| codedokode wrote:
| But spaces are allowed in filenames since 80s, didn't software
| had enough time to adapt?
| michaelt wrote:
| Microsoft's Windows 95 put spaces into "c:\My Documents" and
| "c:\Program Files" so that developers targeting Windows were
| _forced_ to support spaces in filenames.
|
| Of course, in those days if an OS upgrade broke some third
| party software, the end user _paid for an upgrade_. So
| although Microsoft forced developers ' hands, the developers
| all got paid for their trouble. And you'd only have your hand
| forced that way once or twice a decade.
|
| Windows at the time was also all about the GUI file-pickers.
| Breaking the command line? Shell scripts? What are those?
| toast0 wrote:
| And now it's \Users, presumably because after 20 years,
| Microsoft gave up?
| hwc wrote:
| Or someone got tired of typing long paths.
| Uvix wrote:
| They changed from \Documents and Settings to \Users in
| Vista, alongside other profile rejiggering (e.g.
| introducing AppData folders). By that point software had
| either been fixed or would never be fixed, so keeping a
| space in the name wasn't particularly useful.
| rcxdude wrote:
| It's still very common for usernames to have spaces,
| though.
| alterom wrote:
| _And now it 's \Users, presumably because after 20 years,
| Microsoft gave up?_
|
| Only if you assume that people rarely have spaces in
| their Windows login names (e.g. "Joe Smith").
|
| Either that, or Windows users have learned to _not be
| scared of spaces_ in filenames, usernames, and _their own
| literal names_.
| bigstrat2003 wrote:
| That doesn't sound right. Microsoft is _obsessed_ with
| backwards compatibility, going so far as to accommodate
| programs that were _writing to Windows ' private memory_
| just to preserve it. Deliberately breaking programs isn't
| in their ethos at all.
| sltkr wrote:
| The new filesystem APIs were introduced with Windows 95,
| so there was no backward compatibility to break. _New_
| programs using those _new_ APIs were forced to support
| spaces in directories. Using spaces in the system
| directories forced application developers to consider
| that scenario and deal with it appropriately.
|
| Meanwhile, DOS and Windows 3.1 applications that did run
| on Windows 95 could access files under a backward
| compatible 8.3 scheme, like C:\Progra~1\ instead of
| "C:\Program Files".
| bigstrat2003 wrote:
| That's a good point, thanks for pointing it out.
| michaelt wrote:
| I'm thinking of the transitions from Windows 3.1 to
| Windows 95 (IIRC introducing 32-bit and filenames longer
| than 8 characters) and the transition from Windows 95 to
| Windows XP (IIRC introducing a proper permission system,
| thus breaking anything that relied on being able to write
| things outside of user-owned folders)
|
| I agree they were famously accommodating in those days.
| But they also had enough market power that if they said
| users could only write to one folder and it had a space
| in the filename, developers who disliked it couldn't vote
| with their feet.
| dizhn wrote:
| A lot of software still had issues and asked the user to
| use C:\Directory directly. Some probably still do.
| reginald78 wrote:
| I remember trying to install Visual Studio in the mid-
| late 2000s (when SSDs make hard drive space small again)
| to a directory other than C: and found that after
| following a rather convoluted process you could only
| actually move maybe 20% of the install files off C:.
| StefanBatory wrote:
| It is still the same. :(
| yonatan8070 wrote:
| I've seen some things installing directly into C:\,
| NVIDIA's software jumps to mind
| akira2501 wrote:
| C:\Progra~1
|
| They didn't force anything.
| deltarholamda wrote:
| My last name has an apostrophe in it. This isn't super weird
| or anything, there have been "O'Haras" and "O'Neills" (with 2
| Ls) forever.
|
| And yet whenever I deal with a computer system I don't put
| the apostrophe in because even in 2024 it is completely
| jacked up. Sometimes it's just disallowed. Sometimes I get
| "\\\'" showing up. Sometimes I get "'". I've seen
| "’". One time, one system accepted it, but another
| system that accessed the same data didn't allow apostrophes
| so the person using the second system couldn't access the
| record, and it took 2 phone calls and 3 people to come up
| with a workaround.
|
| It doesn't work often enough that I don't even try anymore.
| There are just too many opportunities for it to get forgotten
| or handled improperly from all directions.
| soneil wrote:
| I had fun in the vmware-broadcom transition because the
| broadcom portal doesn't allow that, but the vmware portal
| did. Not even in my username, just in the surname field.
| The new portal ate it on that so hard, I wasn't even
| allowed to create a ticket to do anything about it.
|
| Not as bad as when I was once issued a first.o'last@corp
| email address though ..
| mixmastamyk wrote:
| There may be a Unicode character that looks like
| apostrophe but has no quoting semantics. I use an arrow
| instead of greater-than symbol in my prompt for the same
| reason. To avoid copy/paste issues.
| jcranmer wrote:
| Non-ASCII characters in email addresses have even worse
| compatibility issues than punctuation characters.
| Punctuation fails because people don't know the standard.
| Non-ASCII fails because people don't know the _latest_
| standard.
| deltarholamda wrote:
| >Not as bad as when I was once issued a first.o'last@corp
| email address though
|
| Oh, man, that happened to me too, way back in the late
| 90s. I had forgotten about that.
|
| It broke things all over the place. Even now you run into
| the occasional validator that is convinced that the plus
| sign is not valid in email addresses.
| mschuster91 wrote:
| > Even now you run into the occasional validator that is
| convinced that the plus sign is not valid in email
| addresses.
|
| These are intentional IMHO - force people to use their
| actual email address so a potential breach can't be tied
| back to the service. That's the _only_ reason why someone
| would use a + in the first place.
| graemep wrote:
| > And yet whenever I deal with a computer system I don't
| put the apostrophe in because even in 2024
|
| In usernames or in name fields for text generally?
|
| I assume things like bank systems can deal with it because
| they should match things like IDs?
| deltarholamda wrote:
| Name fields in general.
|
| But sometimes I don't have control, e.g. another person
| is inputting the data and dutifully duplicates my name.
| That's how I ended up with the 2 phone calls/3 person
| situation, which happened about a month ago.
|
| Hell, my driver's license is missing the apostrophe
| because the system doesn't accept it.
|
| When somebody is trying to find me in a computer there's
| a whole litany of things they have to try, including
| assuming "First O'Lastame" got bashed into "First O.
| Lastname".
|
| I think about this every time I read an article extolling
| the wonders of technology.
| jorvi wrote:
| > One time, one system accepted it, but another system that
| accessed the same data didn't allow apostrophes so the
| person using the second system couldn't access the record,
| and it took 2 phone calls and 3 people to come up with a
| workaround.
|
| There's still a lot of organisations that somewhere in
| their e-mail processing chain cannnot deal with 4-letter
| TLD e-mail addresses*. Even worse is that the front-end is
| often a relatively new framework and will happily accept
| your e-mail, only to then have it silently fail forever.
| Mercifully a lot of those organisations have their customer
| service authorized to change your e-mail address manually,
| but if they don't.. good luck.
| wongarsu wrote:
| NPX on windows was broken for years when your username had a
| space in it. Never underestimate how long bugs can stay
| around when it doesn't affect any of the developers and for
| everyone else the workaround is quicker than fixing it
| slightwinder wrote:
| Problem is, the design of Unix shells is older, and they have
| some parts which automatically split on space if not handled
| carefully. This is really annoying.
| rossy wrote:
| For people using NSS modules like winbind, most of those
| scripts are already broken
| wolrah wrote:
| > Just imagine how many poorly-written shell scripts will break
| when we suddenly allow dollars, quotes, backticks and the likes
| in username. Heck, even allowing spaces sound like horror to
| me.
|
| If we're admitting they're poorly-written, why can't we admit
| that they're already broken regardless of whether that
| brokenness is currently being triggered? Allowing symbols or
| spaces didn't break anything, it was broken from day one just
| no one noticed.
|
| Why is the answer always "go out of your way to not upset the
| broken garbage that's been around forever" rather than "throw
| Zalgo at it and fix what breaks so it's no longer broken and
| won't be broken in the future"?
|
| Bug compatibility is the worst behavior of the computing
| industry. Let the bad code break and more importantly call it
| out so everyone knows where the blame belongs.
| nmstoker wrote:
| Unfortunate ambiguous uses of the word drop throughout the
| otherwise excellent article
| TimK65 wrote:
| There are three uses of the word "drop," all of which are
| correct.
|
| The latter-day meaning of "drop" is an abomination.
| toast0 wrote:
| I dropped X off at Y. Then X dropped off the face of the map,
| never to be seen again.
|
| Many words and phrases in English are self-antonyms.
| fargle wrote:
| > The src:shadow package had dropped a Debian-specific patch,
|
| shoot, that's evil. had not noticed this. i read this as
| "removed", not "was released". now idk.
|
| this pseudo-definition of dropped as "released" is beyond
| stupid. yikes!
| account42 wrote:
| Always fun to see people poke the Unicode dragon only to be
| dumbstruck by its true size as it stands up in preparation of
| engulfing them with the fire of unintended consequences.
| beardygo wrote:
| Indeed. As a speaker of several languages, including RTL
| language (they haven't even considered the problems with RTL
| marks etc), I say stay with ASCII for usernames, keep UTF for
| full names.
|
| If restricted ASCII a-z is good enough for passport names
| worldwide, it's good enough for usernames.
| macbr wrote:
| I'm confused - my name as written on my passport definitely
| contains non ASCII characters?
| extraduder_ire wrote:
| What is it in the machine-readable section at the bottom?
| My passport takes the apostrophe out of my name down there.
| Muromec wrote:
| You probably have ASCII-adjacent name to begin with, so
| people who can read some kind of language using Latin
| letters will simply ignore "funny dots and dashes" and
| pronounce it kinda wrong.
|
| It's on a different level from having a name originally
| written in a different alphabet entirely. At this point you
| just have it written in two scripts, with second being
| ASCII.
| mschuster91 wrote:
| > If restricted ASCII a-z is good enough for passport names
| worldwide, it's good enough for usernames.
|
| Passports (and credit cards) are the best example why ASCII-
| only is horribly broken. It's 2024, people want to type in
| their name as they write it normally, and they have the
| reasonable expectation of IT "dealing with it" behind the
| scenes.
|
| Unfortunately, that expectation isn't reality, and it's all
| too common people are being rejected at the border or their
| card transactions are denied because braindead policies leave
| no other option but to blanket deny in case of mismatches.
| tgbugs wrote:
| I made a design decision for a standard for dataset structure
| to explicitly ban characters beyond ascii [A-Za-z0-9.,-_ ]
| precisely because all the positivity around utf-8 often leads
| people to think that it comes with no additional complexity
| cost. There is an escape hatch with a way to indicate that a
| dataset uses unicode filenames but the standard states that any
| consumer may reject such datasets because unicode support is
| explicitly not required.
|
| I got pushback from people who would not have to implement or
| maintain the systems for being a backward asciite so seeing
| this article is rather vindicating.
| miohtama wrote:
| I remember useradd and adduser when learning Linux and oh boy
| what a confusion it was... Why not just one command
| abigail95 wrote:
| if you cannot handle UTF-8 anywhere anything approaching text
| could be, your program is malformed and should be deprecated and
| removed.
|
| if you wrote code that couldn't handle bob;>/hacked in a
| username, you would and should be laughed at.
|
| why are we using this ancient stuff?
| knorker wrote:
| It's not just programs. And it's not just semantics of all-
| numeric username. It's also whether you want usernames that you
| cannot type, nor possibly even render.
|
| Definitely you can't spell it to someone else.
|
| Who owns that file? Oh, it's right-to-left non breaking space
| smiley snowman Chinese sign for water, I love that guy!
| abigail95 wrote:
| If people want to set up a Debian environment where people
| are mixing RTL and Hanzi I see no reason for that to be
| prohibited.
|
| Debian has opinions but I disagree that they should extend
| that far.
|
| If my employee Zalgo-fies everything. I don't file a bug
| report with Debian. I just fire them.
| Muromec wrote:
| >If my employee Zalgo-fies everything. I don't file a bug
| report with Debian. I just fire them.
|
| Which such clearly north American attitude you can as well
| use ASCII for everything.
| drtgh wrote:
| With Unicode the same grapheme can be written with a sequence
| of one or more code points, and each code point can be a
| sequence of one or more code units.
|
| For example "a" can be written with U+00E5, and the same visual
| glyph "a" with U+0061 + U+030A ( U+0061 {a} plus the code unit
| U+030A {Combining Ring Above}).
|
| Another homoglyph Unicode user name example:
|
| * is Cafe == Cafe ?
|
| * C + a + f + e + ' ' vs C + a + f + e
|
| * Utf8: 43616665CC81 vs 436166C3A9
|
| As one user has pointed out in another comment, some kind of
| standardisation for that specific use case with some kind of
| normalisation would be needed first (nevertheless a database
| search would want a different one, and so on). The above
| examples are among the simpler ones, there are also unprintable
| characters, etc.
|
| It can be done as in "nothing is impossible", but it's not that
| easy, it's actually complex.
| abigail95 wrote:
| If a user picks a presentation layer that displays a from
| noncomparable alphabets, but has them look identical - that's
| a choice they can and should be able to make. I think it's
| dumb but I'm not here to hold anyones hand.
|
| It's the users choice whether 43616665CC81 == 436166C3A9,
| same for Cafe == Cafe. But they are distinct and separate
| choices. Text and bytes are separate things.
|
| We accept that case sensitivity exists and whether a
| user/business/program treats them as identical is and _should
| always be_ their choice to make.
|
| There is abstract complexity in the problem, but the context
| in which text is used solves most of that.
|
| If I have handwritten notes and I make a copy but write the
| second one in cursive and ask someone if they say the same
| thing - the correct answer isn't "we need to create a
| standard to normalize the presentation of text" - it's "be
| more precise in what you are asking".
|
| Whether Cafe == Cafe depends on if it's written on a road
| sign, or a network packet with a fixed byte size.
|
| Unprintable characters are not text and should not be stored
| in text fields. Neither are control characters, and as far as
| I'm concerned should not be included in any text encoding
| standard. Formatting and terminal processing _should never be
| stored in-band_ , that's an obvious design flaw that should
| be corrected.
|
| We already deal with ambiguity within ASCII re I vs l vs 1.
| Some fonts render those identically - Using those fonts in a
| passport is bad design. Saying we should avoid having to
| compare those characters at all because _some people /systems
| might confuse them_ is misguided.
|
| This isn't a true rebuttal of what you were saying but some
| of my next thoughts.
| anon-3988 wrote:
| Nah, you can use whatever you want for _display_.
|
| We have our tower of babel here and we are telling people not
| to use it? I am not even native English user btw. Having a
| lingua franca allowed me to understand someone from Russia,
| China, Japan, etc.
|
| Maybe once we have easily accessible ML translate nuances in
| one language to another without loss we can all talk in our own
| languages and just translate each others words.
| abigail95 wrote:
| I think people should be able to configure systems to handle
| a broad range of text from popular encoding standards like
| UTF-8.
|
| Limiting text-space because of communcation is a strange
| objection that I don't think will hold up over time.
| PhilipRoman wrote:
| I really love this powerless use of "should". If you spit on
| billions of lines of code, all you will get is a dry mouth. The
| reality defines "what is", unless you have lots of tanks and
| people under your control, in which case you can change the
| reality.
|
| There is tons of useful code which you will likely never
| encounter, that helps people accomplish their tasks every day.
| Do you think there is some central authority who is going to go
| building to building and dd if=/dev/zero every shell script
| they find?
| abigail95 wrote:
| This is a contemporary discussion, today, concerning hundreds
| perhaps thousands of lines of code. That's it.
|
| If someone is objecting to changes because of things like
| "bob;>/hacked". That is laughable, and I will continue to
| point and laugh. Imagine limiting URL encoding because of SQL
| injection.
|
| We can fix this, then fix the things that break - and then we
| can improve.
|
| Or we can ossify into stone. Your choice.
| PhilipRoman wrote:
| >if you cannot handle UTF-8 anywhere anything approaching
| text could be, your program is malformed and should be
| deprecated and removed.
|
| I was referring to this. Don't get me wrong, I also would
| love to make sweeping changes to many things in computing.
| I still think it is perfectly valid to impose reasonable
| limitations on input even if the program could
| theoretically handle it - it prevents all kinds of problems
| at the very root (like allocating disproportionate amounts
| of resources, infinite timeouts, etc).
| chikere232 wrote:
| oh yes, let's break things to gain nothing of value
| gspr wrote:
| Perhaps nothing of value _to you_.
|
| I'll hazard a guess that your preferred username can be
| expressed in a small subset of ASCII? And to hell with everyone
| else?
| knorker wrote:
| I'll hazard a guess that your preferred username can't be
| written by 99.99999% of the world, and would always have to
| be copy-pasted?
| Ylpertnodi wrote:
| Yeah, us foreigners, up to our usual tricks again.
| knorker wrote:
| By any definition of the word, I'm a foreigner.
|
| So if you meant to imply that I'm an American, you've
| guessed wrong.
| chikere232 wrote:
| If your personal identity is threatened by having to use an
| ascii alphanumeric login name, you're kind of creating
| problems for yourself for no reason...
|
| There is a field for the full name of the person if you want
| to, and at least on my linux it warns for non-ascii
| characters but allows them
| anon-3988 wrote:
| Its a give and take. If you allow for anything beyond latin,
| then you have to accept that there will be a class of
| software that will be difficult to interact with.
|
| Latin-like language system is simply superior for machine
| purposes. I am sorry, but I don't even want to think of
| supporting the entire unicode in my software. I am not going
| to even attempt to reverse that emoji.
| chikere232 wrote:
| It gets real fun when it's something you need to look up
| and have match, like a username.
|
| Because then it to be normalised in the right way for
| comparisons to work, or it will only match if your input
| method happens to produce the exact same variant.
|
| ... And unicode is an evolving standard where this
| normalisation sometimes changes between standards, so the
| names as normalised in the old version of your standard
| library might disagree with the new version. So you need to
| care for that transition.
|
| ... And often this is implemented separately for different
| languages, so you can get names that won't match if you
| normalise them in python, java or C.
|
| ... And as all implementations, these unicode
| implementations sometimes have bugs, so you need to think
| not only about matching supported unicode versions, but
| matching bugs.
|
| ... And any change in these normalisations can in theory
| lead to two usernames that used to be distinct becoming
| identical.
|
| It's a deep well
| khaled wrote:
| > And unicode is an evolving standard where this
| normalisation sometimes changes between standards
|
| Unicode normalization is subject to its stability policy,
| and Unicode no longer allow adding new canonically
| equivalent code points.
|
| https://www.unicode.org/policies/stability_policy.html
| layer8 wrote:
| The issue is that it has already been broken (read: has allowed
| arbitrary byte sequences) for a long time, and the debate is
| about what to restrict it to.
| codedokode wrote:
| Don't you think that it would be better to get rid of usernames
| in UI? They only provide unique data for fingerprinting and do
| almost nothing useful on a single-user system. Wouldn't it be
| better to simply have a default name like "primary user" or "main
| user" for the first user and skip one step in installation
| process? Also it frees you from typing a username on login for a
| single-user system.
| eviks wrote:
| Single user systems can just not ask for a username if there is
| only one, they control the UI
| knorker wrote:
| So in the future I may not be able to even type the name of
| another user? Admins and other users not being able to type
| usernames sounds very bad.
|
| And I say that as someone whose native language has more letters
| than English.
| zvr wrote:
| Most people are too young to remember that when you typed your
| username in all-caps in the login prompt (because the CapsLock
| key was on by accident, for example), the login(8) program
| assumed you were in a connection that could only do 7-bit (upper
| case, but no lower case characters) and immediately switched the
| tty settings and you were then presented with a "\PASSWORD: "
| prompt.
| roelschroeven wrote:
| Don't you mean 6-bit? 7-bit ASCII supports lower case
| characters. Or maybe there are other 7-bit character sets that
| don't have lower case characters and it was one of those?
| jks wrote:
| PETSCII? On the Commodore 64 you could press the Commodore
| key and Shift together to change character sets between
| lowercase and the graphical characters.
|
| But the Unix login thing might have been because of
| teletypes?
| https://www.columbia.edu/cu/computinghistory/teletype/ claims
| that ASR 33 used 8-bit ASCII but was uppercase only - not
| sure if the "8-bit" claim can be true.
|
| On some Unix (and Linux) systems, you can still enter a kind
| of retro mode with "stty olcuc iuclc" (output lowercase to
| uppercase, input uppercase to lowercase) and turning on Caps
| Lock.
| soneil wrote:
| This reminds me of the systemd bug where usernames starting with
| a digit were mishandled (#15141).
|
| It seems to me like something that "should" be relaxed, but we
| need to have high confidence in the entire foodchain. adduser
| seems like the last place it should be changed, not the first -
| anyone requiring "enough rope" is already served by useradd.
| hwc wrote:
| My work machine uses my complete email address as a user machine
| (this was a done by someone in the IT department). Vim gets
| confused when I use the `gf` command to open a path that contains
| an '@' character in it.
| bjourne wrote:
| Honestly, it is super brain-dead that Linux and other operating
| systems still have such massive problems with "special"
| characters. Just the other day I had to help someone who had
| trouble building. The cause turned out to be that they had
| dropped filenames with parentheses in the source directory which,
| apparently, confused bash which make relies on. Such trash is
| everywhere on Linux systems. Eventually you learn to only use
| [a-zA-Z0-9-_.] in names because anything else will inevitably
| confuse some tool or another (even capital letters can be a
| PITA)... I so wish someone would take it upon themselves to clean
| up this mess, but it's probably too much work and too many who
| are nay-sayers conditioned to it who don't see the need for
| changes.
| hiccuphippo wrote:
| As someone who needs non-ascii characters to write my name:
| _please don 't_. You are making things worse just to be
| "courteous" about something we don't care about and will actually
| be annoyed at if we have to find how to write a letter in the
| keyboard or worse case scenario, figure out how to change the
| layout to the correct one _before I even logged in_.
| jks wrote:
| Likewise. My last name contains a non-ascii character. In ~2009
| I started at a company whose admin conveniently set up an
| account for me on their Ubuntu server... on which no-one could
| then log in locally because the login manager crashed when
| trying to display the list of users. I logged in via ssh and
| changed my name to the nearest ASCII equivalent.
|
| I always feel slightly worried on sites that demand that I give
| my full legal name (such as the US ESTA form), and then refuse
| to handle it because it includes "illegal" characters.
| ASalazarMX wrote:
| This has happened to me with _passwords_ containing foreign
| characters. The system would accept it, but further logons
| would be impossible. Now I always strip diacritics to be
| safe.
| jks wrote:
| A friend mentioned using control characters in passwords...
| like ^F and ^B, but not ^C because that's the interrupt
| character. Feels vaguely risky to me (does ^U empty the
| line? does ^W delete the last word? does your terminal
| emulator do some weird encoding like it does for cursor
| keys?) but if it works, why not?
| jowea wrote:
| I suspect I have run into a couple bugs because of
| password generators putting characters that some backend
| system cannot process in the password. Halfwish they just
| did DKWhhjwqjkwqjmHSJKHAIUHQwdmlsadkl instead.
| beardygo wrote:
| Full legal name as appears on machine readable zone in your
| own passport. Allowed characters are A-Z only, see MRZ
| specifications:
|
| https://en.wikipedia.org/wiki/Machine-readable_passport
| Muromec wrote:
| What's a legal name? It presumes it's somehow different
| from other ... illegal names. But in which way? Which law
| has a say?
| doubled112 wrote:
| Just having an apostrophe in my last name causes me issues.
|
| Yes, that's me, Mr. O&Conner
| SuperSandro2000 wrote:
| They are clearly bored and want to start a year long bug hunt
| through half of unix
| Muromec wrote:
| That sounds like a good kind of bored and bug hunting through
| half the unix sounds like fun too.
| kej wrote:
| I wonder if it would work to do something like the punycode
| system for internationalized domain names. Shell scripts could
| handle a name like `xn--0civ130n` just fine, and user-facing
| utilities could choose to convert that to :sparkle::unicorn: when
| appropriate. The same homograph protections would probably work,
| as well.
| dsr_ wrote:
| I will remind everyone that there are a minimum of three
| identifiers here.
|
| The UID, which is an integer. Ownership resides here; it's the
| primary key. Can be used by programs.
|
| The username, each of which must be unique and maps to one UID --
| but multiple usernames can map to the same UID. Used by humans
| and programs to login.
|
| The GECOS field, or "human readable name", which is only used as
| a display label. Some systems include a structure inside this for
| additional info like phone number, office number, or similar". I
| don't think anyone would object to UTF-8 here.
| seu wrote:
| The fact that this whole discussion happens in english, partially
| explains why there is a discussion at all. The whole problem
| could have been avoided if the development of computers had been
| a more international effort.
| seiferteric wrote:
| OMG Can't believe this, I ran into this exact thing at my last
| job. We discovered a security vuln in several of our services
| because we were accepting unsanitized usernames, but since we and
| doing things with them (passing them to scripts etc.) but only
| after passing them to useradd/usermod etc so we thought they were
| safe, and of course you could put in things like ";" and "&", ">"
| etc and do whatever you want. I discovered that debian DISABLED
| the username sanity checks and could not believe it. anyway I
| installed a patched version as well as sanitized input and other
| stuff to resolve the issue.
| IshKebab wrote:
| > Most Debian users don't work with useradd, or groupadd,
| directly. Instead, Debian has long supplied its own adduser (and
| addgroup) utilities, originally written by founder Ian Murdock.
| These act as simpler front ends to useradd
|
| One of the dumbest things Debian has done.
| rurban wrote:
| They are so stupid, I cannot believe!
|
| Names are identifiers, and such need to stay identifiable. There
| exist unicode security guidelines and rules for identifiers, they
| don't know about. My libu8ident library would help with that.
| UniverseHacker wrote:
| Clearly we should open up usernames to be an unlimited size set
| of mixed data types: e.g. the first "character" could be a hand
| drawn picture of a cat, the second the entire text of the US
| constitution in unicode, and so on. We could then extend this
| flexibility to filenames, passwords, and Unix commands.
| Internally, this could involve replacing all text strings with
| folders on a filesystem where you can put any files you want in
| any desired order. /s
| nineteen999 wrote:
| I have an affectionate place in my heart for Debian, the
| community is passionate, they have wonderful ideals, hell I even
| helped found a charity which distributes it on used PC's
| discarded by large companies to disadvantaged people over 20
| years ago which is still running today. It was my favourite
| distro for a long time after I moved on from Slackware in the
| late 90's, I used it at home, I used it in my job at a small ISP
| on everything from x86 to Sun Sparc to DEC Alpha hardware. We are
| lucky in the Linux community to have them. I could care less
| about deriatives like Ubuntu, seems to be one too far removed.
|
| But over the years the bikeshedding and some of the poor
| technical decisions started to wear on me. The debconf approach
| of asking a million questions on install bothered me. In my
| current job we use it on small industrial ARM PC's and it does a
| great job there at a large scale distributed over a wide variety
| of environments and geographical area, scorching heat, freezing
| cold and everything in between. But that's easy because it's a
| single system image which we deploy to hundreds of devices and it
| only requires minimal customisation to perform the required
| tasks.
|
| But our datacenter servers remain RHEL for the simple reason ...
| the deployment and broad customisation process per server is
| easy, LDAP integration is straight forward and the customer wants
| to pay for support from the vendor even though we never use it.
| Security updates and bugfixes are delivered quickly and the
| vendors commitment to stability is commendable. It's a no
| brainer. More and more companies started to move their workloads
| to RHEL once it came out and unfortunately it just didn't make
| sense to bother with distributions outside of RHEL/Fedora for my
| personal use anymore, some sort of work/life balance is needed
| and I don't want to spend my personal computing time remembering
| all the idiosyncracies between different Linux distributions
| anymore. I would argue that Debian is pretty idiosyncratic and
| opinionated if you have come from more traditional UNIX systems
| in the 90's, while RHEL/Fedora more closely model an "evolution"
| of those classic systems if you like. It will be interesting to
| see what happens to RHEL in the coming years as Redhat becomes
| more and more absorbed into the IBM environment.
| finnthehuman wrote:
| That's the reality of deploying professionally though. I've got
| a soft spot for debian from using it for over 20 years too, but
| choosing open source often means picking the vendor that
| accommodates the use case. Many products have the enterprise
| upsell of good LDAP/AD integration but that's just a nice to
| have when you're really buy it for the ability to call someone
| when shit goes sideways.
|
| And when you don't need the support net, it's often gonna be
| ubuntu because that's what most people are comfortable with. Or
| yocto if you're shipping a custom OS. And containers are so
| ephemeral and purpose specific it means distro doesn't matter
| as much.
|
| I'm still rooting for them the most. They're community based,
| an important upstream, and stable has never done me dirty. It's
| still my go-to for "I don't want to think hard about, or worry
| about this system."
| jcarrano wrote:
| I don't get it? What's the purpose of changing the default rule
| in shadow-utils. Not only is it completely unnecessary and
| introduces risks for shell injections, it also risks introducing
| incompatibilities between Debian and any other system.
|
| I feel that there are already too many other things to fix to be
| wasting time in creating new potential bugs.
| thway15269037 wrote:
| Before opening this can of worms, can we finally address that
| there is a hard, hardcoded limit of 255 bytes per file name
| (folder name) in Linux? Yeah, 255 bytes, that is, like 63
| japanese characters or emojis or maybe less. And in kernel, too,
| so you physically cannot correct this issue by using another
| filesystem or something.
|
| Before anyone asks: yes, these folders do occur in real life, and
| I tired of pretending that they do not.
___________________________________________________________________
(page generated 2024-12-06 23:01 UTC)