hngopher.com

       [HN Gopher] A popular but wrong way to convert a string to upper...
       ___________________________________________________________________
        
       A popular but wrong way to convert a string to uppercase or
       lowercase
        
       Author : ingve
       Score  : 111 points
       Date   : 2024-10-08 08:11 UTC (14 hours ago)
        
 (HTM) web link (devblogs.microsoft.com)
 (TXT) w3m dump (devblogs.microsoft.com)
        
       | vardump wrote:
       | As always, Raymond is right. (And as usually, I could guess it's
       | him before even clicking the link.)
       | 
       | That said, 99% time when doing upper- or lowercase operation
       | you're interested just in the 7-bit ASCII range of characters.
       | 
       | For the remaining 1%, there's ICU library. Just like Raymond Chen
       | mentioned.
        
         | fhars wrote:
         | No, when you are doing string manipulation, you are almost
         | never interestet in just the seven bit ASCII range, as there is
         | almost no language that can be written using just that.
        
           | daemin wrote:
           | I would argue that for most programs when you're doing string
           | manipulation you're doing it for internal programming reasons
           | - logs, error messages, etc. In that case you are in nearly
           | full control of the strings and therefore can declare that
           | you're only working with ASCII.
           | 
           | The other normal cases of string usage are file paths and
           | user interface, and the needed operations can be done with
           | simple string functions, and even in UTF8 encoding the
           | characters you care about are in the ASCII range. With file
           | paths the manipulations that you're most often doing is path
           | based so you only care about '/', '\', ':', and '.' ASCII
           | characters. With user interface elements you're likely to be
           | using them as just static data and only substituting values
           | into placeholders when necessary.
        
             | pistoleer wrote:
             | > I would argue that for most programs when you're doing
             | string manipulation you're doing it for internal
             | programming reasons - logs, error messages, etc. In that
             | case you are in nearly full control of the strings and
             | therefore can declare that you're only working with ASCII.
             | 
             | Why would you argue that? In my experience it's about
             | formatting things that are addressed to the user, where the
             | hardest and most annoying localization problems matter a
             | lot. That includes sorting the last name "van den Berg"
             | just after "Bakker", stylizing it as "Berg, van den", and
             | making sure this capitalization is correct and not "Van Den
             | Berg". There is no built in standard library function in
             | any language that does any of that. It's so much larger
             | than ascii and even larger than unicode.
             | 
             | Another user said that the main takeaway is that you can't
             | process strings until you know their language (locale), and
             | that is exactly correct.
        
               | daemin wrote:
               | I would maintain that your program has more string
               | manipulation for error messages and logging than for
               | generating localised formatted names.
               | 
               | Further I do say that if you're creating text for
               | presenting to the user then the most common operation
               | would be replacement of some field in pre-defined text.
               | 
               | In your case I would design it so that the correctly
               | capitalised first name, surname, and variations of those
               | for sorting would be generated at the data entry point
               | (manually or automatically) and then just used when
               | needed in user facing text generation. Therefore the only
               | string operation needed would be replacement of
               | placeholders like the fmt and standard library provide.
               | This uses more memory and storage but these are cheaper
               | now.
        
               | pistoleer wrote:
               | I agree, but the logging formatters don't really do much
               | beyond trivially pasting in placeholders.
               | 
               | And as for data entry... Maybe in an ideal world. In the
               | current world, marred by importing previously mangled
               | datasets, a common solution in the few companies I've
               | worked at is to just not do anything, which leaves ugly
               | edges, yet is "good enough".
        
             | BoringTimesGang wrote:
             | Now double all of that effort, so you can get it to work
             | with Windows' UTF-16 wstrings.
        
             | heisenzombie wrote:
             | File paths? I think filesystem paths are generally "bags of
             | bytes" that the OS might interpret as UTF-16 (Windows) or
             | UTF-8 (macOS, Linux).
             | 
             | For example:
             | https://en.m.wikipedia.org/wiki/Program_Files#Localization
        
               | vardump wrote:
               | File paths are scary. The last I checked (which is
               | admittedly a while ago), Windows didn't for example care
               | about correct UTF-16 surrogate pairs at all, it'd happily
               | accept invalid UTF-16 strings.
               | 
               | So use standard string processing libraries on path names
               | at your own peril.
               | 
               | It's a good idea to consider file paths as a bag of
               | bytes.
        
               | Someone wrote:
               | > It's a good idea to consider file paths as a bag of
               | bytes
               | 
               | (Nitpick: _sequence_ of bytes)
               | 
               | Also very limiting. If you do that, you can't, for
               | example, show a file name to the user as a string or
               | easily use a shell to process data in your file system
               | (do you type "/bin" or "\x2F\x62\x69\x6E"?)
               | 
               | Unix, from the start, claimed file names where byte
               | sequences, yet assumed many of those to encode ascii.
               | 
               | That's part of why Plan 9 made the choice _"names may
               | contain any printable character (that is, any character
               | outside hexadecimal 00-1F and 80-9F)"_
               | (https://9fans.github.io/plan9port/man/man9/intro.html)
        
               | daemin wrote:
               | That's what I mean, you treat filesystem paths as bags of
               | bytes separated by known ASCII characters, as the only
               | path manipulation that you generally need to do is to
               | append a path, remove a path, change extension, things
               | that only care about those ASCII characters. You only
               | modify the path strings at those known characters and
               | leave everything in between as is (with some exceptions
               | using OS API specific functions as needed).
        
               | netsharc wrote:
               | IIRC, the FAT filesystem (before Windows 95) allowed
               | lowercase letters, but there's a layer in the filesystem
               | driver that converted everything to uppercase, e.g. if
               | you did the command "more readme.txt", the more command
               | would ask the filesystem for "readme.txt" and it would
               | search for "README.TXT" in the file allocation table.
               | 
               | I think I once hex-edited the FA-table to change a
               | filename to have a lowercase name (or maybe it was disk
               | corruption), trying to delete that file didn't work
               | because it would be trying to delete "FOO", and couldn't
               | find it because the file was named "FOo".
        
           | vardump wrote:
           | > as there is almost no language that can be written using
           | just that.
           | 
           | 99% of use cases I've seen have nothing to do with human
           | language.
           | 
           | 1% human language case that is needs to be handled properly
           | using a proper Unicode library.
           | 
           | Your mileage (percentages) may vary depending on your job.
        
             | 9dev wrote:
             | It's funny how software developers live in bubbles so much.
             | Whether you deal with human language a lot or almost not at
             | all depends entirely on your specific domain. Anyone
             | working on user interfaces of any kind must accommodate for
             | proper encoding, for example; that includes pretty much
             | every line-of-business app out there, which is _a lot of
             | code_.
        
             | inexcf wrote:
             | Why do you need upper- or lowercase conversion in cases
             | that have nothing to do with human language?
        
               | vardump wrote:
               | Here's an example. Hypothetically say you want to build
               | an HTML parser.
               | 
               | You might encounter tags like <html>, <HTML>, <Html>,
               | etc., but you want to perform a hash table lookup.
               | 
               | So first you're going to normalize to either lower- or
               | uppercase.
        
               | Muromec wrote:
               | But but, I want to have a custom web component and
               | register it under my own name, which can only be properly
               | written in Ukrainian Cyrillic. How dare you not let me
               | have it.
        
               | inexcf wrote:
               | Ah, i see, we disagree on what is "human language". An
               | abbreviation like HTML and it's different capitalisations
               | to me sound a lot like a feature of human language.
        
               | recursive wrote:
               | Is this a serious argument? Humans don't directly use
               | HTML to communicate with each other. It's a document
               | markup language rendered by user agents, developed
               | against a specification.
        
               | tannhaeuser wrote:
               | Markup languages and SGML in particular absolutely are
               | designed for digital text communication by humans and to
               | be written using plain text editors; it's kindof the
               | entire point of avoiding binary data constructs.
               | 
               | And to GP, SGML/HTML actually has a facility to define
               | uppercasing rules beyond ASCII, namely the LCNMSTRT,
               | UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING
               | section in the SGML declaration introduced in the
               | "Extended Naming Rules" revision of ISO 8879 (SGML std,
               | cf. https://sgmljs.net/docs/sgmlrefman.html). Like
               | basically everything else on this level, these rules are
               | still used by HTML 5 to this date, and in particular,
               | that while elements names can contain arbitrary
               | characters, only those in the IRV (ASCII) get case-folded
               | for canonization.
        
               | ARandumGuy wrote:
               | Converting string case is almost never something you want
               | to do for text that's displayed to the end user, but
               | there are many situations where you need to do it
               | internally. Generally when the spec is case insensitive,
               | but you still need to verify or organize things using
               | string comparison.
        
             | kergonath wrote:
             | Right. That's why I still get mail with my name mangled and
             | my street name barely recognisable. Because I'm in the 1%.
             | Too bad for me...
             | 
             | In all seriousness, though, in the real world ASCII works
             | only for a subset of a handful of languages. The vast
             | majority of the population does not read or write any
             | English in their day to day lives. As far as end users are
             | concerned, you should probably swap your percentages.
             | 
             | ASCII is mostly fine within your programs like the parser
             | you mention in your other comment. But even then, it's
             | better if a Chinese user name does not break your reporting
             | or logging systems or your parser, so it's still a good
             | idea to take Unicode seriously. Otherwise, anything that
             | comes from a user or gets out of the program needs to
             | behave.
        
               | vardump wrote:
               | I said use a Unicode library if input data is actual
               | human language. Which names and addresses are.
               | 
               | 99% case being ASCII data generated by other software of
               | unknown provenance. (Or sometimes by humans, but it's
               | still data for machines, not for humans.)
        
               | kergonath wrote:
               | I am really not sure about this 99%. A lot of programs
               | deal with quite a lot of user-provided data, which you
               | don't control.
        
               | Muromec wrote:
               | Who and why still tries to lowercase/uppercase names?
               | Please tell them to stop.
        
               | kergonath wrote:
               | Hell if I know. I don't know what kind of abomination
               | e-commerce websites run on their backend, I just see the
               | consequences.
        
             | elpocko wrote:
             | Every search feature everywhere has to be case-insensitive
             | or it's unusable. Search seems like a pretty ubiquitous
             | feature in a lot of software, and has to work regardless of
             | locale/encoding.
        
         | sebstefan wrote:
         | Yes please, keep making software that mangles my actual last
         | name at every step of the way. 99% of the world loves it when
         | you only care about the USA.
        
           | Muromec wrote:
           | If it needs to uppercase names it probably interfaces with
           | something forsaken like Sabre/Amadeus that only understands
           | ASCII anyway.
           | 
           | The real problem is accepting non-ASCII input from user where
           | you later assume it's ASCII-only and safe to bitfuck around.
        
             | sebstefan wrote:
             | From experience anything banking adjacent will usually fuck
             | it up as well
             | 
             | For some reason they have a hard-on for putting last names
             | in capital letters and they still have systems in place
             | that use ASCII
        
               | Muromec wrote:
               | If it uses ASCII anyway, what's the problem then? Don't
               | accept non-ASCII user input.
        
               | sebstefan wrote:
               | First off: And exclude 70% of the world?
               | 
               | Usually they'll accept it, but some parts of the backend
               | are still running code from the 60's.
               | 
               | So you get your name rendered properly on the web
               | interface, and most core features, but one day you're
               | wandering off from the beaten path, by, like, requesting
               | some insurance contract, and you'll see your name at the
               | top with some characters mangled, depending on what your
               | name's like. Mine is just accented latin characters so it
               | usually drops the accents ; not sure how it would work if
               | your name was in an entirely different alphabet
        
               | Muromec wrote:
               | >First off: And exclude 70% of the world?
               | 
               | Guess what, I'm part of this 70% and I also work in a
               | bank and I know exactly how.
               | 
               | Not a single letter in my name (any of them) can be
               | represented with ASCII. When it is represented in UTF-8,
               | most of the people who have to see it can't read it
               | anyway.
               | 
               | So my identity document issued by the country which
               | doesn't use Latin alphabet includes ASCII-representation
               | of my name in addition to canonical form in Ukrainian
               | Cyrillic. That ASCII-rendering is happily accepted by all
               | kinds of systems that only speak ASCII.
               | 
               | People still can't pronounce it and it got misspelled
               | like _yesterday_ when dictated over the phone.
               | 
               | Now regarding the accents, it's illegal to not support
               | them per GDPR (as per case law, discussed here few years
               | ago).
        
             | InfamousRece wrote:
             | Some systems are still using EBCDIC.
        
         | crazygringo wrote:
         | > _That said, 99% time when doing upper- or lowercase operation
         | you 're interested just in the 7-bit ASCII range of
         | characters._
         | 
         | I think it's more the exact opposite.
         | 
         | The only times I'm dealing with 7-bit ASCII is for internal
         | identifiers like variable names or API endpoints. Which is a
         | lot of the time, but I can't ever think of when I've needed my
         | code to change their case. It might literally be never.
         | 
         | On the other hand, needing to switch between upper, lower, and
         | title case happens all the time, always with people's names and
         | article titles and product names and whatnot. Which are _never_
         | in ASCII because this isn 't 1990.
        
           | hinkley wrote:
           | And you could argue that if the internal identifiers need to
           | be capitalized or lower-cased, you've already lost.
           | 
           | On an enterprise app these little string manipulations are a
           | drop in the bucket. In a game they might not be. Sort that
           | stuff out at compile time, or commit time.
        
       | blenderob wrote:
       | It is issues like this due to which I gave up on C++. There are
       | so many ways to do something and every way is freaking wrong!
       | 
       | An acceptable solution is given at the end of the article:
       | 
       | > If you use the International Components for Unicode (ICU)
       | library, you can use u_strToUpper and u_strToLower.
       | 
       | Makes you wonder why this isn't part of the C++ standard library
       | itself. Every revision of the C++ standard brings with itself
       | more syntax and more complexity in the language. But as a user of
       | C++ I don't need more syntax and more complexity in the language.
       | But I do need more standard library functions that solves these
       | ordinary real-world programming problems.
        
         | BoringTimesGang wrote:
         | >It is issues like this due to which I gave up on C++. There
         | are so many ways to do something and every way is freaking
         | wrong!
         | 
         | These are mostly unicode or linguistics problems.
        
           | tralarpa wrote:
           | The fact that the standard library works against you doesn't
           | help (to_lower takes an int, but only kind of works
           | (sometimes) correctly on unsigned char, and wchar_t is
           | implicitly promoted to int).
        
             | BoringTimesGang wrote:
             | to_lower is in the std namespace but is actually just part
             | of the C89 standard, meaning it predates both UTF8 and
             | UTF16. Is the alternative that it should be made unusable,
             | and more existing code broken? A modern user has to include
             | one of the c-prefix headers to use it, already hinting to
             | them that 'here be dragons'.
             | 
             | But there are always dragons. It's strings. The mere
             | assumption that they can be transformed int-by-int,
             | irrespective of encoding, is wrong. As is the assumption
             | that a sensible transformation to lower case without error
             | handling _exists_.
        
         | pistoleer wrote:
         | > There are so many ways to do something and every way is
         | freaking wrong!
         | 
         | That's life! The perfect way does not exist. The best you can
         | do is be aware of the tradeoffs, and languages like C++
         | absolutely throw them in your face at every single opportunity.
         | It's fatiguing, and writing in javascript or python allows us
         | to uphold the facade that everything is okay and that we don't
         | have to worry about a thing.
        
           | pornel wrote:
           | JS and Python are still old enough to have been created when
           | Unicode was in its infancy, so they have their own share of
           | problems from using UCS-2 (such as indexing strings by what
           | is now a UTF-16 code unit, rather than by a codepoint or a
           | grapheme cluster).
           | 
           | Swift has been developed in the modern times, and it's able
           | to tackle Unicode properly, e.g. makes distinction between
           | codepoints and grapheme clusters, and steers users away from
           | random-access indexing and having a single (incorrect) notion
           | of a string length.
        
         | bayindirh wrote:
         | I don't think it's a C++ problem. You just can't transform
         | anything developed in "ancient" times to unicode aware in a
         | single swoop.
         | 
         | On the other hand, libicu is 37MB by itself, so it's not
         | something someone can write in a weekend and ship.
         | 
         | Any tool which is old enough will have a thousand ways to do
         | something. This is the inevitability of software and
         | programming languages. In the domain of C++, which has a size
         | mammoth now, everyone expects this huge pony to learn new
         | tricks, but everybody has a different idea of the "new tricks",
         | so more features are added on top of its already impressive and
         | very long list of features and capabilities.
         | 
         | You want libICU built-in? There must be other folks who want
         | that too. So you may need to find them and work with them to
         | make your dream a reality.
         | 
         | So, C++ is doing fine. It's not that they omitted Unicode
         | during the design phase. Unicode has arrived later, and it has
         | to be integrated by other means. This is what libraries for.
        
           | pornel wrote:
           | Being developed in, _and having to stay compatible with_ ,
           | ancient times is a real problem of C++.
           | 
           | The now-invalid assumptions couldn't have been avoided 50
           | years ago. Fixing them now in C++ is difficult or impossible,
           | but still, the end result is a ton of brokenness baked into
           | C++.
           | 
           | Languages developed in the 21st century typically have some
           | at least half-decent Unicode support built-in. Unicode is big
           | and complex, but there's a lot that a language can do to at
           | least not silently destroy the encoding.
        
             | cm2187 wrote:
             | That explains why there are two functions, one for ascii
             | and one for unicode. That doesn't explain why the unicode
             | functions are hard to use (per the article).
        
               | BoringTimesGang wrote:
               | Because human language is hard to boil down to a simple
               | computing model and the problem is underdefined, based on
               | naive assumptions.
               | 
               | Or perhaps I should say naive.
        
               | cm2187 wrote:
               | Well pretty much every other more recent language solved
               | that problem.
        
               | kccqzy wrote:
               | Almost no programming language, perhaps other than Swift,
               | solved that problem. Just use the article's examples as
               | test cases. It's just as wrong as the C++ version in the
               | article, except it's wrong with nicer syntax.
        
               | zahlman wrote:
               | Python's strings have uppercase, lowercase and case-
               | folding methods that don't choke on this. They don't use
               | UTF-16 internally (they _can_ use UCS-2 for strings whose
               | code points will fit in that range; while a string might
               | store code points from the surrogate-pair range, they 're
               | never interpreted as surrogate pairs, but instead as an
               | error encoding so that e.g. invalid UTF-8 can be round-
               | tripped) so they're never worried about surrogate pairs,
               | and it knows a few things about localized text casing:
               | >>> 'ss'.upper()         'SS'         >>> 'ss'.lower()
               | 'ss'         >>> 'ss'.casefold()         'ss'
               | 
               | There are a lot of really complicated tasks for Unicode
               | strings. String casing isn't really one of them.
               | 
               | (No, Python can't turn 'SS' back into 'ss'. But doing
               | that requires metadata about language that a string
               | simply doesn't represent.)
        
               | kccqzy wrote:
               | Still breaks on, for example, Turkish i vs I. It's
               | impossible to do correctly without language information.
               | 
               | > (No, Python can't turn 'SS' back into 'ss'. But doing
               | that requires metadata about language that a string
               | simply doesn't represent.)
               | 
               | Yes that's my point. Because in typical languages strings
               | don't store language metadata, this is impossible to do
               | correctly in general.
        
               | zahlman wrote:
               | I'm not seeing anything in the Swift documentation about
               | strings carrying language metadata, either, though?
        
               | kccqzy wrote:
               | This lowercase function takes a locale argument https://d
               | eveloper.apple.com/documentation/foundation/nsstrin...
               | 
               | It looks like an old NSString method that's available in
               | both Obj-C and Swift.
               | 
               | The casefold function is even older than that. https://de
               | veloper.apple.com/documentation/foundation/nsstrin... Its
               | documentation specifically includes a discussion of the
               | Turkish I/I issue.
        
               | tedunangst wrote:
               | But that's wrong. The upper case for ss is Ss.
        
               | IncreasePosts wrote:
               | That was only adopted in Germany like 7 years ago!
        
               | kccqzy wrote:
               | Well languages and conventions change. The EUR sign was
               | added not that long ago and it was somewhat painful. The
               | Chinese language uses a single character to refer to
               | chemical elements so when IUPAC names new elements they
               | will invent new characters. Etc.
        
               | cm2187 wrote:
               | C#'s "ToUpper" takes an optional CultureInfo argument if
               | you want to play around with how to treat different
               | languages. Again, solved problem decades ago.
        
               | tialaramex wrote:
               | Rust will cheerfully:
               | assert_eq!("odusseus", "ODUSSEUS".to_lowercase());
               | 
               | [Notice that this is in fact entirely impossible with the
               | naive strategy since Greek cares about position of
               | symbols]
               | 
               | Some of the latter examples aren't cases where a
               | programming language or library should just "do the right
               | thing" but cases of ambiguity where you need locale
               | information to decide what's appropriate, which isn't
               | "just as wrong as the C++ version" it's a whole other
               | problem. It isn't _wrong_ to capitalise A-acute as a
               | capital A-acute, it 's just _not always appropriate_
               | depending on the locale.
        
               | MBCook wrote:
               | So what?
               | 
               | That doesn't prevent adding a new function that converts
               | an entire string to upper or lowercase in a Unicode aware
               | way.
               | 
               | What would be wrong with adding new correct functions to
               | the standard library to make this easy? There are already
               | namespaces in C++ so you don't even have to worry about
               | collisions.
               | 
               | That's the problem I see. It's fine if you have a history
               | of stuff that's not that great in hindsight. But what's
               | wrong with having a better standard library going
               | forward?
               | 
               | It's not like this is an esoteric thing.
        
           | relaxing wrote:
           | It's been 30 years. Unicode predates C++98. Java saw the
           | writing on the wall. There's no excuse.
        
             | bayindirh wrote:
             | > There's no excuse.
             | 
             | I politely disagree. None of the programming languages
             | which started integrating Unicode was targeting from bare
             | metal to GUI, incl. embedded and OS development at the same
             | time.
             | 
             | C++ has a great target area when compared to other
             | programming languages. There are widely used libraries
             | which compile correctly on PDP-11s, even if they are
             | updated constantly.
             | 
             | You can't just say "I'll be just making everything Unicode
             | aware, backwards compatibility be damned, eh".
        
               | blenderob wrote:
               | But we don't have to make everything Unicode aware.
               | Backward compatibility is indeed very important in C++.
               | Like you rightly said, it still has to work for PDP-11
               | without breaking anything.
               | 
               | But the C++ overlords could always add a new type that is
               | Unicode-aware. Converting one Unicode string to another
               | is a purely in-memory, in-CPU operation. It does not need
               | any I/O and it does not need any interaction with
               | peripherals. So one can dream that such a type along with
               | its conversion routines could be added to an updated
               | standard library without breaking existing code that
               | compiles correctly on PDP-11s.
        
               | bayindirh wrote:
               | > Converting one Unicode string to another is a purely
               | in-memory, in-CPU operation.
               | 
               | ...but it's a complex operation. This is what libICU is
               | mostly for. You can't just look-up a single table and
               | convert a string to another like you work on ASCII table
               | or any other simple encoding.
               | 
               | Germans have their ss to S (or capital ss depending on
               | the year), Turkish has i/I and i/I pairs, and tons of
               | other languages have other rules.
               | 
               | Esp, this I/i and I/i pairs break tons of applications in
               | very unexpected ways. I don't remember how many bugs I
               | reported, and how many workarounds I have implemented in
               | my systems.
               | 
               | Adding a type is nice, but the surrounding machinery is
               | so big, it brings tons of work with itself. Unicode is
               | such a complicated system, that I read that even you need
               | two UTF-16 characters (4 bytes in total) to encode a
               | single character. This is insane (as in complexity, I
               | guess they have their reasons).
        
               | blenderob wrote:
               | Thanks for the reply! Really appreciate the time you have
               | taken to write down a thoughtful reply.
        
               | bayindirh wrote:
               | No problems! If you want a slightly longer write-up,
               | here's a classic I constantly share with people:
               | 
               | https://blog.codinghorror.com/whats-wrong-with-turkey/
        
               | SAI_Peregrinus wrote:
               | > Unicode is such a complicated system, that I read that
               | even you need two UTF-16 characters (4 bytes in total) to
               | encode a single character. This is insane (as in
               | complexity, I guess they have their reasons).
               | 
               | Because there are more than 65,535 characters. That's
               | just writing systems, not Unicode's fault. Most of the
               | unnecessary complexity of Unicode is legacy
               | compatibility: UTF-16 & UTF-32 are bad ideas that
               | increase complexity, but they predate UTF-8 which
               | actually works decently well so they get kept around for
               | backwards compatibility. Likewise with the need for
               | multiple normalization forms.
        
               | bayindirh wrote:
               | I mean, I already know some Unicode internals and
               | linguistics (since I developed a language-specific
               | compression algorithm back in the day), but I have never
               | seen a single character requiring four bytes (and I know
               | Emoji chaining for skin color, etc.).
               | 
               | So, seeing this just moved the complexity of Unicode one
               | notch up in my head, and I respect the guys who designed
               | and made it work. It was not whining or complaining of
               | any sort. :)
        
               | fluoridation wrote:
               | Cuneiform codepoints are 17 bits long. If you're using
               | UTF-16 you'll need two code units to represent a
               | character.
        
             | gpderetta wrote:
             | Java ended up picking UCS-2 and getting screwed.
        
           | akira2501 wrote:
           | > libicu is 37MB by itself, so it's not something someone can
           | write in a weekend and ship.
           | 
           | Isn't that mostly just from tables derived from the Unicode
           | standard?
        
           | zahlman wrote:
           | >You just can't transform anything developed in "ancient"
           | times to unicode aware in a single swoop.
           | 
           | Even for Python it took well over a decade, and people
           | _still_ complain about the fact that they don 't get to treat
           | byte-sequences transparently as text any more - as if they
           | _want_ to wrestle with the `basestring` supertype, getting
           | `UnicodeDecodeError` from an encoding operation or vice-
           | versa, trying to guess the encoding of someone else 's data
           | instead of expecting it to be decoded on the other side....
           | 
           | But in C++ (and in C), you have the additional problem that
           | the 8-bit integer type was _named for_ the concept of a
           | character of text, even though it clearly cannot actually
           | represent any such thing. (Not to mention the whole bit about
           | `char` being a separate type from both `signed char` and
           | `unsigned char`, without defined signedness.)
        
           | ectospheno wrote:
           | > Any tool which is old enough will have a thousand ways to
           | do something.
           | 
           | Only because of the strange desire of programmers to never
           | stop. Not every program is a never ending story. Most are
           | short stories their authors bludgeon into a novel.
           | 
           | Programming languages bloat into stupidity for the same
           | reason. Nothing is ever removed. Programmers need editors.
        
             | fluoridation wrote:
             | So how do you design a language that accommodates both the
             | people who need a codebase to be stable for decades and the
             | people who want the bleeding edge all the time, backwards
             | compatibility be damned?
        
               | the_gorilla wrote:
               | You don't. Any language that tries to do both turns into
               | an unusable abomination like C++. Good languages are
               | stable and the bleeding edge is just the "new thing" and
               | not necessarily better than the old thing.
        
               | fluoridation wrote:
               | C++ _doesn 't_ try to do that. It aims to remain as
               | backwards compatible as possible, which is what the GP is
               | complaining about.
        
         | Muromec wrote:
         | Well, the only time you can do str lower where unicode locale
         | awareness will be a problem is when you do it on the user
         | input, like names.
         | 
         | How about you just dont? If it's a constant in your code, you
         | probably use ASCII anyway or can do a static mapping. If it's
         | user user input -- just don't str lower / str upper it.
        
         | pjmlp wrote:
         | Because it is a fight to put anything on a ISO managed
         | language, and only the strongest persevere long enough to make
         | it happen.
         | 
         | Regardless of what ISO language we are talking about.
        
           | gpderetta wrote:
           | Yes, significantly smaller libraries had an hard time getting
           | onto the standard. Getting the equivalent of ICU would be
           | almost impossible. And good luck keeping it up to date.
        
       | appointment wrote:
       | The key takeaway here is that you can't correctly process a
       | string if you don't what language it's in. That includes variants
       | of the same language with different rules, eg en-US and en-UK or
       | es-MX and es-ES.
       | 
       | If you are handling multilingual text the locale is mandatory
       | metadata.
        
         | zarzavat wrote:
         | Different parts of a string can be in different languages
         | too[1].
         | 
         | The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss
         | about fussball". Unless you're in Switzerland.
         | 
         | [1] https://en.wikipedia.org/wiki/Code-switching
        
           | schoen wrote:
           | Probably "don't fuss about Fussball" for the same reasons,
           | right?
        
           | thiht wrote:
           | I thought the German language deprecated the use of ss years
           | ago, no? I learned German for a year and that's what the
           | teacher told us, but maybe it's not the whole story
        
             | 47282847 wrote:
             | Incorrect. Ss is still a thing.
        
               | CamperBob2 wrote:
               | Going by what you and the grandparent wrote, it's not
               | just a thing, but two _different_ things: Ss ss
               | 
               | It is probably time for an Esperanto advocate to show up
               | and set us all straight.
        
               | D-Coder wrote:
               | Pri kio vi parolas? En Esperanto, unu letero egalas unu
               | sonon.
               | 
               | What are you talking about? In Esperanto, one letter
               | equals one sound.
        
               | selenography wrote:
               | > set us all straight.
               | 
               | Se fareblus oni, jam farintus oni. (It definitely won't
               | happen on an echo-change day like today, either. ;))
               | 
               | Contra my comrade's comment, Esperanto orthography is
               | firmly European, and so retains European-style casing
               | distinctions; every sound thus still has two letters --
               | or at least two codepoints.
               | 
               | (There aren't any eszettesque bigraphs, but that's not
               | saying much.)
        
               | TZubiri wrote:
               | Germans run Uber Long Term Support dialects
        
       | ahartmetz wrote:
       | ...and that is why you use QString if you are using the Qt
       | framework. QString is a string class that actually does what you
       | want when used in the obvious way. It probably helps that it was
       | mostly created by people with "ASCII+" native languages. Or with
       | customers that expect not exceedingly dumb behavior. The methods
       | are called QString::toUpper() and QString::toLower() and take
       | only the implicit "this" argument, unlike Win32 LCMapStringEx()
       | which takes 5-8 arguments...
        
         | vardump wrote:
         | You just want a banana, but you also get the gorilla. And the
         | jungle.
        
         | cannam wrote:
         | QString::toUpper/toLower are _not_ locale-aware
         | (https://doc.qt.io/qt-6/qstring.html#toLower)
         | 
         | Qt does have a locale-aware equivalent
         | (QLocale::toUpper/toLower) which calls out to ICU if available.
         | Otherwise it falls back to the QString functions, so you have
         | to be confident about how your build is configured. Whether it
         | works or not has very little to do with the design of QString.
        
           | ahartmetz wrote:
           | I don't see a problem with that. You can have it done locale-
           | aware or not and "not" seems like a sane default. QString
           | will uppercase 'u' to 'U' just fine without locale-awareness
           | whereas std::string doesn't handle non-ASCII according to the
           | article. The cases where locale matters are probably very
           | rare and the result will probably be reasonable anyway.
        
         | aetherspawn wrote:
         | I will admit I don't love the Qt licensing model, but most
         | things in Qt just work as they are supposed to, and on every
         | platform too.
        
       | cyxxon wrote:
       | Small nitpick: the example "LATIN SMALL LETTER SHARP S ("ss"
       | U+00DF) uppercases to the two-character sequence "SS":3 Strasse =
       | STRASSE" is slightly wrong, it seems to me, as we now do actually
       | have a uppercase version of that, so it should uppercase to
       | "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is
       | still widely used, though.
        
         | mkayokay wrote:
         | Duden mentions this: "Bei Verwendung von Grossbuchstaben steht
         | traditionellerweise SS fur ss. In manchen Schriften gibt es
         | aber auch einen entsprechenden Grossbuchstaben; seine
         | Verwendung ist fakultativ <SS 25 E3>."
         | 
         | But isn't it also dependent on the available glyphs in the font
         | used? So f.e. it needs to be ensured that U+1E9E exists?
        
         | Muromec wrote:
         | But what if you need to uppercase the historical record in a
         | vital records registry from 1950ies, but and OCRed last week?
         | Now you need to not just be locale-aware, but you locale should
         | be versioned.
        
         | pjmlp wrote:
         | Lowering case is even better, because a Swiss user would expect
         | the two-character sequence "SS" to be converted into "ss" and
         | not "ss".
         | 
         | And thus we add country specific locale to the party.
        
         | Rygian wrote:
         | The footnote #3 in the article (called as part of your quote)
         | covers the different ways to uppercase ss with more detail.
        
       | serbuvlad wrote:
       | The real insights here are that strings in C++ suck and UTF-16 is
       | extremely unintuitive.
        
         | criddell wrote:
         | Strings in C++ standard library do suck (and C++ is my favorite
         | language).
         | 
         | As for UTF-16, well, I don't know that UTF-8 is a whole lot
         | more intuitive:
         | 
         | > And for UTF-8 data, you have the same issues discussed
         | before: Multibyte characters will not be converted properly,
         | and it breaks for case mappings that alter string lengths.
        
           | recursive wrote:
           | UTF-16 has all the complexity of UTF-8 _plus_ surrogate
           | pairs.
        
             | zahlman wrote:
             | Surrogate pairs aren't more complex than UTF-8's scheme for
             | determining the number of bytes used to represent a code
             | point. (Arguably the logic is slightly simpler.) But the
             | important point is that UTF-16 _pretends to_ be a constant-
             | length encoding while actually having the surrogate-pair
             | loophole - that 's because it's a hack on top of UCS-2
             | (which originally worked well enough for Microsoft to get
             | married to; but then the BMP turned out not to be enough
             | code points). UTF-8 is clearly designed from scratch to be
             | a multi-byte encoding (and, while the standard now makes
             | the corresponding sequences illegal, the scheme was
             | designed to be able to support much higher code points - up
             | to 2^42 if we extend the logic all the way; hypothetical
             | 6-byte sequences starting with values FC or FD would neatly
             | map up to 2^31).
        
       | PhilipRoman wrote:
       | Thought this was going to be about and-not-ing bytes with 0x20.
       | Wrong for most inputs but sure as hell faster than anything else.
        
       | high_na_euv wrote:
       | In cpp basic things are hard
        
         | onemoresoop wrote:
         | It's subjective but I find C++ extremely ugly.
        
         | johnnyjeans wrote:
         | nothing about working with locales, or text in general, is
         | basic. we were decades into working with digital computers
         | before we moved past switchboards and LEDs. don't take for
         | granted just how high of a perch upon the shoulders of giants
         | you have. that's exactly how the mistakes in the blog post get
         | made.
        
       | SleepyMyroslav wrote:
       | In gamedev there is simple rule: don't try to do any of that.
       | 
       | If it is text game needs to show to user then every version of
       | the text that is needed is a translated text. Programmer will
       | never know if context or locale will need word order changes or
       | anything complicated. Just trust the translation team.
       | 
       | If text is coming from user - then change design until its not
       | needed to 'convert'. There are major issues just to show user
       | back what he entered! Because the font for editing and displayed
       | text could be different. Not even mentioning RTL and other
       | issues.
       | 
       | Once ppl learn about localization the questions like why a
       | programming language does not do this 'simple text operation' are
       | just a newcomer detector. :)
        
         | zahlman wrote:
         | >If text is coming from user - then change design until its not
         | needed to 'convert'
         | 
         | In games, you can possibly get away with this. Most other
         | people need to worry about things like string collation
         | (locale-aware sorting) for user-supplied text.
        
         | fluoridation wrote:
         | >Once ppl learn about localization the questions like why a
         | programming language does not do this 'simple text operation'
         | are just a newcomer detector. :)
         | 
         | I think you are purposefully misinterpreting the question.
         | They're not asking about converting the case of any Unicode
         | string with locale sensitivity, they're asking about converting
         | the case of ASCII characters.
         | 
         | What if your game needs to talk to a server and do some string
         | manipulation in between requests? Are you really going to
         | architect everything so that the client doesn't need to handle
         | any of that ever?
        
           | squeaky-clean wrote:
           | > They're not asking about converting the case of any Unicode
           | string with locale sensitivity, they're asking about
           | converting the case of ASCII characters.
           | 
           | I'm confused now. The article specifically mentions issues
           | with UTF-16 and UTF-32 unicode characters outside the basic
           | multilingual plane (BMP).
        
             | fluoridation wrote:
             | I'm referring to the people who call case conversion in
             | general "a simple text operation". Say you have an
             | std::string and you want to make it lower case. If you
             | assume it contains just ASCII that's a simpler operation
             | than if you assume it contains UTF-8, but C++ doesn't
             | provide a single function that does either of them. A
             | person can rightly complain that the former is a basic
             | functionality that the language should include; personally,
             | I would agree. And you could say "wow, doesn't this person
             | realize that case conversion in Unicode is actually
             | complicated? They must be really inexperienced." It could
             | be that the other person really doesn't know about Unicode,
             | or it could mean that you and them are thinking about
             | entirely different problems and you're being judgemental a
             | bit too eagerly.
        
               | squeaky-clean wrote:
               | For ascii in C++ isn't there std::tolower / std::toupper?
               | If you're not dealing with unsigned char types there
               | isn't a simple case conversion function, but that's for a
               | good reason as the article lays out.
        
               | fluoridation wrote:
               | Those functions take and return single characters. What's
               | missing is functions that operate on strings. You can use
               | them in combination with std::transform(), but as the
               | article points out, even if you're just dealing with
               | ASCII you can easily do it wrong. I've been using C++ for
               | over 20 years and I didn't know tolower() and toupper()
               | were non-addressable. There's really no excuse for the
               | library not having simple case conversion functions that
               | operate on strings in-place.
        
           | SleepyMyroslav wrote:
           | >What if your game needs to talk to a server and do some
           | string manipulation in between requests? Are you really going
           | to architect everything so that the client doesn't need to
           | handle any of that ever?
           | 
           | Of course! Your string manipulation with user entered
           | attributes like display names or chat messages are 1
           | millimeter away from old good sql 'Bobby; drop table
           | students'. Never ever do that if you can avoid it. Every time
           | someone 'just concatenates' two strings like to add ie
           | 'symbol that represents input button' programmer makes bad
           | bug that will be both annoying and wrong. Games should use
           | substitution patterns guided by translation team. Because
           | there is no ASCII culture in like around 15 typically
           | supported by big publishers.
           | 
           | There are exceptions like platform provided services to
           | filter ban words in chat. And even there you don't have to do
           | 'things with ASCII characters'. Yeah, players will input
           | unsupported symbols everywhere they can and you need to have
           | good replacement characters for those and fix support for
           | popular emojis regularly. That is expected by communities
           | now.
        
         | beeboobaa3 wrote:
         | > There are major issues just to show user back what he
         | entered! Because the font for editing and displayed text could
         | be different. Not even mentioning RTL and other issues.
         | 
         | Your web browser is doing it right now as you are reading this
         | comment.
        
           | rty32 wrote:
           | And web development is not game development? And chances are
           | that games don't ship chromium with them?
        
       | HPsquared wrote:
       | I thought this was going to be about adding or subtracting 32.
       | Old school.
        
       | the_gorilla wrote:
       | Why are some functions addressable in C++ and others not? Seems
       | like a pointless design oversight.
        
         | bialpio wrote:
         | Footnote in the article provides the following explanation:
         | "The standard imposes this limitation because the
         | implementation may need to add default function parameters,
         | template default parameters, or overloads in order to
         | accomplish the various requirements of the standard."
        
       | flareback wrote:
       | He gave 4 examples of how it's done incorrectly, but zero actual
       | examples of doing it correctly.
        
         | commandlinefan wrote:
         | for (int i = 0; i < strlen(s); i++) {             s[i] ^= 0x20;
         | }
        
           | vardump wrote:
           | Surely you meant:                 s[i] &= ~0x20;
           | 
           | We're talking about converting to upper case after all! As an
           | added benefit, every space character (0x20) is now a NUL
           | byte!
        
           | calibas wrote:
           | Thank you for this universal approach. I can now toggle
           | capitalization on/off for any character, instead of just
           | being limited to alphabetic ones!
           | 
           | Jokes aside, I was kinda hoping for a good answer that
           | doesn't rely on a Windows API or an external library, but I'm
           | not sure there is one. It's a rather complex problem when you
           | account for more than just ASCII and the English language.
        
             | TZubiri wrote:
             | Next up, check out our vector addition implementation of
             | Hello+World. Spoiler alert, the result is Zalgo
        
         | TheGeminon wrote:
         | > Okay, so those are the problems. What's the solution?
         | 
         | > If you need to perform a case mapping on a string, you can
         | use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE,
         | possibly with other flags like LCMAP_LINGUISTIC_CASING. If you
         | use the International Components for Unicode (ICU) library, you
         | can use u_strToUpper and u_strToLower.
        
       | PoignardAzur wrote:
       | So I'm going to be that guy and say it:
       | 
       | Man, I'm happy we don't need to deal with this crap in Rust, and
       | we can just use String::to_lowercase. Not having to worry about
       | things makes coding fun.
        
       | himinlomax wrote:
       | > And in certain forms of the French language, capitalizing an
       | accented character causes the accent to be dropped: a Paris = A
       | PARIS.
       | 
       | That's incorrect, using diacritics on capital letters is always
       | the preferred form, it's just that dropping them is acceptable as
       | it was often done for technical reasons.
        
       | codr7 wrote:
       | C++, where every line of code is a book waiting to be written.
        
       ___________________________________________________________________
       (page generated 2024-10-08 23:00 UTC)