[HN Gopher] A popular but wrong way to convert a string to upper...
___________________________________________________________________
A popular but wrong way to convert a string to uppercase or
lowercase
Author : ingve
Score : 111 points
Date : 2024-10-08 08:11 UTC (14 hours ago)
(HTM) web link (devblogs.microsoft.com)
(TXT) w3m dump (devblogs.microsoft.com)
| vardump wrote:
| As always, Raymond is right. (And as usually, I could guess it's
| him before even clicking the link.)
|
| That said, 99% time when doing upper- or lowercase operation
| you're interested just in the 7-bit ASCII range of characters.
|
| For the remaining 1%, there's ICU library. Just like Raymond Chen
| mentioned.
| fhars wrote:
| No, when you are doing string manipulation, you are almost
| never interestet in just the seven bit ASCII range, as there is
| almost no language that can be written using just that.
| daemin wrote:
| I would argue that for most programs when you're doing string
| manipulation you're doing it for internal programming reasons
| - logs, error messages, etc. In that case you are in nearly
| full control of the strings and therefore can declare that
| you're only working with ASCII.
|
| The other normal cases of string usage are file paths and
| user interface, and the needed operations can be done with
| simple string functions, and even in UTF8 encoding the
| characters you care about are in the ASCII range. With file
| paths the manipulations that you're most often doing is path
| based so you only care about '/', '\', ':', and '.' ASCII
| characters. With user interface elements you're likely to be
| using them as just static data and only substituting values
| into placeholders when necessary.
| pistoleer wrote:
| > I would argue that for most programs when you're doing
| string manipulation you're doing it for internal
| programming reasons - logs, error messages, etc. In that
| case you are in nearly full control of the strings and
| therefore can declare that you're only working with ASCII.
|
| Why would you argue that? In my experience it's about
| formatting things that are addressed to the user, where the
| hardest and most annoying localization problems matter a
| lot. That includes sorting the last name "van den Berg"
| just after "Bakker", stylizing it as "Berg, van den", and
| making sure this capitalization is correct and not "Van Den
| Berg". There is no built in standard library function in
| any language that does any of that. It's so much larger
| than ascii and even larger than unicode.
|
| Another user said that the main takeaway is that you can't
| process strings until you know their language (locale), and
| that is exactly correct.
| daemin wrote:
| I would maintain that your program has more string
| manipulation for error messages and logging than for
| generating localised formatted names.
|
| Further I do say that if you're creating text for
| presenting to the user then the most common operation
| would be replacement of some field in pre-defined text.
|
| In your case I would design it so that the correctly
| capitalised first name, surname, and variations of those
| for sorting would be generated at the data entry point
| (manually or automatically) and then just used when
| needed in user facing text generation. Therefore the only
| string operation needed would be replacement of
| placeholders like the fmt and standard library provide.
| This uses more memory and storage but these are cheaper
| now.
| pistoleer wrote:
| I agree, but the logging formatters don't really do much
| beyond trivially pasting in placeholders.
|
| And as for data entry... Maybe in an ideal world. In the
| current world, marred by importing previously mangled
| datasets, a common solution in the few companies I've
| worked at is to just not do anything, which leaves ugly
| edges, yet is "good enough".
| BoringTimesGang wrote:
| Now double all of that effort, so you can get it to work
| with Windows' UTF-16 wstrings.
| heisenzombie wrote:
| File paths? I think filesystem paths are generally "bags of
| bytes" that the OS might interpret as UTF-16 (Windows) or
| UTF-8 (macOS, Linux).
|
| For example:
| https://en.m.wikipedia.org/wiki/Program_Files#Localization
| vardump wrote:
| File paths are scary. The last I checked (which is
| admittedly a while ago), Windows didn't for example care
| about correct UTF-16 surrogate pairs at all, it'd happily
| accept invalid UTF-16 strings.
|
| So use standard string processing libraries on path names
| at your own peril.
|
| It's a good idea to consider file paths as a bag of
| bytes.
| Someone wrote:
| > It's a good idea to consider file paths as a bag of
| bytes
|
| (Nitpick: _sequence_ of bytes)
|
| Also very limiting. If you do that, you can't, for
| example, show a file name to the user as a string or
| easily use a shell to process data in your file system
| (do you type "/bin" or "\x2F\x62\x69\x6E"?)
|
| Unix, from the start, claimed file names where byte
| sequences, yet assumed many of those to encode ascii.
|
| That's part of why Plan 9 made the choice _"names may
| contain any printable character (that is, any character
| outside hexadecimal 00-1F and 80-9F)"_
| (https://9fans.github.io/plan9port/man/man9/intro.html)
| daemin wrote:
| That's what I mean, you treat filesystem paths as bags of
| bytes separated by known ASCII characters, as the only
| path manipulation that you generally need to do is to
| append a path, remove a path, change extension, things
| that only care about those ASCII characters. You only
| modify the path strings at those known characters and
| leave everything in between as is (with some exceptions
| using OS API specific functions as needed).
| netsharc wrote:
| IIRC, the FAT filesystem (before Windows 95) allowed
| lowercase letters, but there's a layer in the filesystem
| driver that converted everything to uppercase, e.g. if
| you did the command "more readme.txt", the more command
| would ask the filesystem for "readme.txt" and it would
| search for "README.TXT" in the file allocation table.
|
| I think I once hex-edited the FA-table to change a
| filename to have a lowercase name (or maybe it was disk
| corruption), trying to delete that file didn't work
| because it would be trying to delete "FOO", and couldn't
| find it because the file was named "FOo".
| vardump wrote:
| > as there is almost no language that can be written using
| just that.
|
| 99% of use cases I've seen have nothing to do with human
| language.
|
| 1% human language case that is needs to be handled properly
| using a proper Unicode library.
|
| Your mileage (percentages) may vary depending on your job.
| 9dev wrote:
| It's funny how software developers live in bubbles so much.
| Whether you deal with human language a lot or almost not at
| all depends entirely on your specific domain. Anyone
| working on user interfaces of any kind must accommodate for
| proper encoding, for example; that includes pretty much
| every line-of-business app out there, which is _a lot of
| code_.
| inexcf wrote:
| Why do you need upper- or lowercase conversion in cases
| that have nothing to do with human language?
| vardump wrote:
| Here's an example. Hypothetically say you want to build
| an HTML parser.
|
| You might encounter tags like <html>, <HTML>, <Html>,
| etc., but you want to perform a hash table lookup.
|
| So first you're going to normalize to either lower- or
| uppercase.
| Muromec wrote:
| But but, I want to have a custom web component and
| register it under my own name, which can only be properly
| written in Ukrainian Cyrillic. How dare you not let me
| have it.
| inexcf wrote:
| Ah, i see, we disagree on what is "human language". An
| abbreviation like HTML and it's different capitalisations
| to me sound a lot like a feature of human language.
| recursive wrote:
| Is this a serious argument? Humans don't directly use
| HTML to communicate with each other. It's a document
| markup language rendered by user agents, developed
| against a specification.
| tannhaeuser wrote:
| Markup languages and SGML in particular absolutely are
| designed for digital text communication by humans and to
| be written using plain text editors; it's kindof the
| entire point of avoiding binary data constructs.
|
| And to GP, SGML/HTML actually has a facility to define
| uppercasing rules beyond ASCII, namely the LCNMSTRT,
| UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING
| section in the SGML declaration introduced in the
| "Extended Naming Rules" revision of ISO 8879 (SGML std,
| cf. https://sgmljs.net/docs/sgmlrefman.html). Like
| basically everything else on this level, these rules are
| still used by HTML 5 to this date, and in particular,
| that while elements names can contain arbitrary
| characters, only those in the IRV (ASCII) get case-folded
| for canonization.
| ARandumGuy wrote:
| Converting string case is almost never something you want
| to do for text that's displayed to the end user, but
| there are many situations where you need to do it
| internally. Generally when the spec is case insensitive,
| but you still need to verify or organize things using
| string comparison.
| kergonath wrote:
| Right. That's why I still get mail with my name mangled and
| my street name barely recognisable. Because I'm in the 1%.
| Too bad for me...
|
| In all seriousness, though, in the real world ASCII works
| only for a subset of a handful of languages. The vast
| majority of the population does not read or write any
| English in their day to day lives. As far as end users are
| concerned, you should probably swap your percentages.
|
| ASCII is mostly fine within your programs like the parser
| you mention in your other comment. But even then, it's
| better if a Chinese user name does not break your reporting
| or logging systems or your parser, so it's still a good
| idea to take Unicode seriously. Otherwise, anything that
| comes from a user or gets out of the program needs to
| behave.
| vardump wrote:
| I said use a Unicode library if input data is actual
| human language. Which names and addresses are.
|
| 99% case being ASCII data generated by other software of
| unknown provenance. (Or sometimes by humans, but it's
| still data for machines, not for humans.)
| kergonath wrote:
| I am really not sure about this 99%. A lot of programs
| deal with quite a lot of user-provided data, which you
| don't control.
| Muromec wrote:
| Who and why still tries to lowercase/uppercase names?
| Please tell them to stop.
| kergonath wrote:
| Hell if I know. I don't know what kind of abomination
| e-commerce websites run on their backend, I just see the
| consequences.
| elpocko wrote:
| Every search feature everywhere has to be case-insensitive
| or it's unusable. Search seems like a pretty ubiquitous
| feature in a lot of software, and has to work regardless of
| locale/encoding.
| sebstefan wrote:
| Yes please, keep making software that mangles my actual last
| name at every step of the way. 99% of the world loves it when
| you only care about the USA.
| Muromec wrote:
| If it needs to uppercase names it probably interfaces with
| something forsaken like Sabre/Amadeus that only understands
| ASCII anyway.
|
| The real problem is accepting non-ASCII input from user where
| you later assume it's ASCII-only and safe to bitfuck around.
| sebstefan wrote:
| From experience anything banking adjacent will usually fuck
| it up as well
|
| For some reason they have a hard-on for putting last names
| in capital letters and they still have systems in place
| that use ASCII
| Muromec wrote:
| If it uses ASCII anyway, what's the problem then? Don't
| accept non-ASCII user input.
| sebstefan wrote:
| First off: And exclude 70% of the world?
|
| Usually they'll accept it, but some parts of the backend
| are still running code from the 60's.
|
| So you get your name rendered properly on the web
| interface, and most core features, but one day you're
| wandering off from the beaten path, by, like, requesting
| some insurance contract, and you'll see your name at the
| top with some characters mangled, depending on what your
| name's like. Mine is just accented latin characters so it
| usually drops the accents ; not sure how it would work if
| your name was in an entirely different alphabet
| Muromec wrote:
| >First off: And exclude 70% of the world?
|
| Guess what, I'm part of this 70% and I also work in a
| bank and I know exactly how.
|
| Not a single letter in my name (any of them) can be
| represented with ASCII. When it is represented in UTF-8,
| most of the people who have to see it can't read it
| anyway.
|
| So my identity document issued by the country which
| doesn't use Latin alphabet includes ASCII-representation
| of my name in addition to canonical form in Ukrainian
| Cyrillic. That ASCII-rendering is happily accepted by all
| kinds of systems that only speak ASCII.
|
| People still can't pronounce it and it got misspelled
| like _yesterday_ when dictated over the phone.
|
| Now regarding the accents, it's illegal to not support
| them per GDPR (as per case law, discussed here few years
| ago).
| InfamousRece wrote:
| Some systems are still using EBCDIC.
| crazygringo wrote:
| > _That said, 99% time when doing upper- or lowercase operation
| you 're interested just in the 7-bit ASCII range of
| characters._
|
| I think it's more the exact opposite.
|
| The only times I'm dealing with 7-bit ASCII is for internal
| identifiers like variable names or API endpoints. Which is a
| lot of the time, but I can't ever think of when I've needed my
| code to change their case. It might literally be never.
|
| On the other hand, needing to switch between upper, lower, and
| title case happens all the time, always with people's names and
| article titles and product names and whatnot. Which are _never_
| in ASCII because this isn 't 1990.
| hinkley wrote:
| And you could argue that if the internal identifiers need to
| be capitalized or lower-cased, you've already lost.
|
| On an enterprise app these little string manipulations are a
| drop in the bucket. In a game they might not be. Sort that
| stuff out at compile time, or commit time.
| blenderob wrote:
| It is issues like this due to which I gave up on C++. There are
| so many ways to do something and every way is freaking wrong!
|
| An acceptable solution is given at the end of the article:
|
| > If you use the International Components for Unicode (ICU)
| library, you can use u_strToUpper and u_strToLower.
|
| Makes you wonder why this isn't part of the C++ standard library
| itself. Every revision of the C++ standard brings with itself
| more syntax and more complexity in the language. But as a user of
| C++ I don't need more syntax and more complexity in the language.
| But I do need more standard library functions that solves these
| ordinary real-world programming problems.
| BoringTimesGang wrote:
| >It is issues like this due to which I gave up on C++. There
| are so many ways to do something and every way is freaking
| wrong!
|
| These are mostly unicode or linguistics problems.
| tralarpa wrote:
| The fact that the standard library works against you doesn't
| help (to_lower takes an int, but only kind of works
| (sometimes) correctly on unsigned char, and wchar_t is
| implicitly promoted to int).
| BoringTimesGang wrote:
| to_lower is in the std namespace but is actually just part
| of the C89 standard, meaning it predates both UTF8 and
| UTF16. Is the alternative that it should be made unusable,
| and more existing code broken? A modern user has to include
| one of the c-prefix headers to use it, already hinting to
| them that 'here be dragons'.
|
| But there are always dragons. It's strings. The mere
| assumption that they can be transformed int-by-int,
| irrespective of encoding, is wrong. As is the assumption
| that a sensible transformation to lower case without error
| handling _exists_.
| pistoleer wrote:
| > There are so many ways to do something and every way is
| freaking wrong!
|
| That's life! The perfect way does not exist. The best you can
| do is be aware of the tradeoffs, and languages like C++
| absolutely throw them in your face at every single opportunity.
| It's fatiguing, and writing in javascript or python allows us
| to uphold the facade that everything is okay and that we don't
| have to worry about a thing.
| pornel wrote:
| JS and Python are still old enough to have been created when
| Unicode was in its infancy, so they have their own share of
| problems from using UCS-2 (such as indexing strings by what
| is now a UTF-16 code unit, rather than by a codepoint or a
| grapheme cluster).
|
| Swift has been developed in the modern times, and it's able
| to tackle Unicode properly, e.g. makes distinction between
| codepoints and grapheme clusters, and steers users away from
| random-access indexing and having a single (incorrect) notion
| of a string length.
| bayindirh wrote:
| I don't think it's a C++ problem. You just can't transform
| anything developed in "ancient" times to unicode aware in a
| single swoop.
|
| On the other hand, libicu is 37MB by itself, so it's not
| something someone can write in a weekend and ship.
|
| Any tool which is old enough will have a thousand ways to do
| something. This is the inevitability of software and
| programming languages. In the domain of C++, which has a size
| mammoth now, everyone expects this huge pony to learn new
| tricks, but everybody has a different idea of the "new tricks",
| so more features are added on top of its already impressive and
| very long list of features and capabilities.
|
| You want libICU built-in? There must be other folks who want
| that too. So you may need to find them and work with them to
| make your dream a reality.
|
| So, C++ is doing fine. It's not that they omitted Unicode
| during the design phase. Unicode has arrived later, and it has
| to be integrated by other means. This is what libraries for.
| pornel wrote:
| Being developed in, _and having to stay compatible with_ ,
| ancient times is a real problem of C++.
|
| The now-invalid assumptions couldn't have been avoided 50
| years ago. Fixing them now in C++ is difficult or impossible,
| but still, the end result is a ton of brokenness baked into
| C++.
|
| Languages developed in the 21st century typically have some
| at least half-decent Unicode support built-in. Unicode is big
| and complex, but there's a lot that a language can do to at
| least not silently destroy the encoding.
| cm2187 wrote:
| That explains why there are two functions, one for ascii
| and one for unicode. That doesn't explain why the unicode
| functions are hard to use (per the article).
| BoringTimesGang wrote:
| Because human language is hard to boil down to a simple
| computing model and the problem is underdefined, based on
| naive assumptions.
|
| Or perhaps I should say naive.
| cm2187 wrote:
| Well pretty much every other more recent language solved
| that problem.
| kccqzy wrote:
| Almost no programming language, perhaps other than Swift,
| solved that problem. Just use the article's examples as
| test cases. It's just as wrong as the C++ version in the
| article, except it's wrong with nicer syntax.
| zahlman wrote:
| Python's strings have uppercase, lowercase and case-
| folding methods that don't choke on this. They don't use
| UTF-16 internally (they _can_ use UCS-2 for strings whose
| code points will fit in that range; while a string might
| store code points from the surrogate-pair range, they 're
| never interpreted as surrogate pairs, but instead as an
| error encoding so that e.g. invalid UTF-8 can be round-
| tripped) so they're never worried about surrogate pairs,
| and it knows a few things about localized text casing:
| >>> 'ss'.upper() 'SS' >>> 'ss'.lower()
| 'ss' >>> 'ss'.casefold() 'ss'
|
| There are a lot of really complicated tasks for Unicode
| strings. String casing isn't really one of them.
|
| (No, Python can't turn 'SS' back into 'ss'. But doing
| that requires metadata about language that a string
| simply doesn't represent.)
| kccqzy wrote:
| Still breaks on, for example, Turkish i vs I. It's
| impossible to do correctly without language information.
|
| > (No, Python can't turn 'SS' back into 'ss'. But doing
| that requires metadata about language that a string
| simply doesn't represent.)
|
| Yes that's my point. Because in typical languages strings
| don't store language metadata, this is impossible to do
| correctly in general.
| zahlman wrote:
| I'm not seeing anything in the Swift documentation about
| strings carrying language metadata, either, though?
| kccqzy wrote:
| This lowercase function takes a locale argument https://d
| eveloper.apple.com/documentation/foundation/nsstrin...
|
| It looks like an old NSString method that's available in
| both Obj-C and Swift.
|
| The casefold function is even older than that. https://de
| veloper.apple.com/documentation/foundation/nsstrin... Its
| documentation specifically includes a discussion of the
| Turkish I/I issue.
| tedunangst wrote:
| But that's wrong. The upper case for ss is Ss.
| IncreasePosts wrote:
| That was only adopted in Germany like 7 years ago!
| kccqzy wrote:
| Well languages and conventions change. The EUR sign was
| added not that long ago and it was somewhat painful. The
| Chinese language uses a single character to refer to
| chemical elements so when IUPAC names new elements they
| will invent new characters. Etc.
| cm2187 wrote:
| C#'s "ToUpper" takes an optional CultureInfo argument if
| you want to play around with how to treat different
| languages. Again, solved problem decades ago.
| tialaramex wrote:
| Rust will cheerfully:
| assert_eq!("odusseus", "ODUSSEUS".to_lowercase());
|
| [Notice that this is in fact entirely impossible with the
| naive strategy since Greek cares about position of
| symbols]
|
| Some of the latter examples aren't cases where a
| programming language or library should just "do the right
| thing" but cases of ambiguity where you need locale
| information to decide what's appropriate, which isn't
| "just as wrong as the C++ version" it's a whole other
| problem. It isn't _wrong_ to capitalise A-acute as a
| capital A-acute, it 's just _not always appropriate_
| depending on the locale.
| MBCook wrote:
| So what?
|
| That doesn't prevent adding a new function that converts
| an entire string to upper or lowercase in a Unicode aware
| way.
|
| What would be wrong with adding new correct functions to
| the standard library to make this easy? There are already
| namespaces in C++ so you don't even have to worry about
| collisions.
|
| That's the problem I see. It's fine if you have a history
| of stuff that's not that great in hindsight. But what's
| wrong with having a better standard library going
| forward?
|
| It's not like this is an esoteric thing.
| relaxing wrote:
| It's been 30 years. Unicode predates C++98. Java saw the
| writing on the wall. There's no excuse.
| bayindirh wrote:
| > There's no excuse.
|
| I politely disagree. None of the programming languages
| which started integrating Unicode was targeting from bare
| metal to GUI, incl. embedded and OS development at the same
| time.
|
| C++ has a great target area when compared to other
| programming languages. There are widely used libraries
| which compile correctly on PDP-11s, even if they are
| updated constantly.
|
| You can't just say "I'll be just making everything Unicode
| aware, backwards compatibility be damned, eh".
| blenderob wrote:
| But we don't have to make everything Unicode aware.
| Backward compatibility is indeed very important in C++.
| Like you rightly said, it still has to work for PDP-11
| without breaking anything.
|
| But the C++ overlords could always add a new type that is
| Unicode-aware. Converting one Unicode string to another
| is a purely in-memory, in-CPU operation. It does not need
| any I/O and it does not need any interaction with
| peripherals. So one can dream that such a type along with
| its conversion routines could be added to an updated
| standard library without breaking existing code that
| compiles correctly on PDP-11s.
| bayindirh wrote:
| > Converting one Unicode string to another is a purely
| in-memory, in-CPU operation.
|
| ...but it's a complex operation. This is what libICU is
| mostly for. You can't just look-up a single table and
| convert a string to another like you work on ASCII table
| or any other simple encoding.
|
| Germans have their ss to S (or capital ss depending on
| the year), Turkish has i/I and i/I pairs, and tons of
| other languages have other rules.
|
| Esp, this I/i and I/i pairs break tons of applications in
| very unexpected ways. I don't remember how many bugs I
| reported, and how many workarounds I have implemented in
| my systems.
|
| Adding a type is nice, but the surrounding machinery is
| so big, it brings tons of work with itself. Unicode is
| such a complicated system, that I read that even you need
| two UTF-16 characters (4 bytes in total) to encode a
| single character. This is insane (as in complexity, I
| guess they have their reasons).
| blenderob wrote:
| Thanks for the reply! Really appreciate the time you have
| taken to write down a thoughtful reply.
| bayindirh wrote:
| No problems! If you want a slightly longer write-up,
| here's a classic I constantly share with people:
|
| https://blog.codinghorror.com/whats-wrong-with-turkey/
| SAI_Peregrinus wrote:
| > Unicode is such a complicated system, that I read that
| even you need two UTF-16 characters (4 bytes in total) to
| encode a single character. This is insane (as in
| complexity, I guess they have their reasons).
|
| Because there are more than 65,535 characters. That's
| just writing systems, not Unicode's fault. Most of the
| unnecessary complexity of Unicode is legacy
| compatibility: UTF-16 & UTF-32 are bad ideas that
| increase complexity, but they predate UTF-8 which
| actually works decently well so they get kept around for
| backwards compatibility. Likewise with the need for
| multiple normalization forms.
| bayindirh wrote:
| I mean, I already know some Unicode internals and
| linguistics (since I developed a language-specific
| compression algorithm back in the day), but I have never
| seen a single character requiring four bytes (and I know
| Emoji chaining for skin color, etc.).
|
| So, seeing this just moved the complexity of Unicode one
| notch up in my head, and I respect the guys who designed
| and made it work. It was not whining or complaining of
| any sort. :)
| fluoridation wrote:
| Cuneiform codepoints are 17 bits long. If you're using
| UTF-16 you'll need two code units to represent a
| character.
| gpderetta wrote:
| Java ended up picking UCS-2 and getting screwed.
| akira2501 wrote:
| > libicu is 37MB by itself, so it's not something someone can
| write in a weekend and ship.
|
| Isn't that mostly just from tables derived from the Unicode
| standard?
| zahlman wrote:
| >You just can't transform anything developed in "ancient"
| times to unicode aware in a single swoop.
|
| Even for Python it took well over a decade, and people
| _still_ complain about the fact that they don 't get to treat
| byte-sequences transparently as text any more - as if they
| _want_ to wrestle with the `basestring` supertype, getting
| `UnicodeDecodeError` from an encoding operation or vice-
| versa, trying to guess the encoding of someone else 's data
| instead of expecting it to be decoded on the other side....
|
| But in C++ (and in C), you have the additional problem that
| the 8-bit integer type was _named for_ the concept of a
| character of text, even though it clearly cannot actually
| represent any such thing. (Not to mention the whole bit about
| `char` being a separate type from both `signed char` and
| `unsigned char`, without defined signedness.)
| ectospheno wrote:
| > Any tool which is old enough will have a thousand ways to
| do something.
|
| Only because of the strange desire of programmers to never
| stop. Not every program is a never ending story. Most are
| short stories their authors bludgeon into a novel.
|
| Programming languages bloat into stupidity for the same
| reason. Nothing is ever removed. Programmers need editors.
| fluoridation wrote:
| So how do you design a language that accommodates both the
| people who need a codebase to be stable for decades and the
| people who want the bleeding edge all the time, backwards
| compatibility be damned?
| the_gorilla wrote:
| You don't. Any language that tries to do both turns into
| an unusable abomination like C++. Good languages are
| stable and the bleeding edge is just the "new thing" and
| not necessarily better than the old thing.
| fluoridation wrote:
| C++ _doesn 't_ try to do that. It aims to remain as
| backwards compatible as possible, which is what the GP is
| complaining about.
| Muromec wrote:
| Well, the only time you can do str lower where unicode locale
| awareness will be a problem is when you do it on the user
| input, like names.
|
| How about you just dont? If it's a constant in your code, you
| probably use ASCII anyway or can do a static mapping. If it's
| user user input -- just don't str lower / str upper it.
| pjmlp wrote:
| Because it is a fight to put anything on a ISO managed
| language, and only the strongest persevere long enough to make
| it happen.
|
| Regardless of what ISO language we are talking about.
| gpderetta wrote:
| Yes, significantly smaller libraries had an hard time getting
| onto the standard. Getting the equivalent of ICU would be
| almost impossible. And good luck keeping it up to date.
| appointment wrote:
| The key takeaway here is that you can't correctly process a
| string if you don't what language it's in. That includes variants
| of the same language with different rules, eg en-US and en-UK or
| es-MX and es-ES.
|
| If you are handling multilingual text the locale is mandatory
| metadata.
| zarzavat wrote:
| Different parts of a string can be in different languages
| too[1].
|
| The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss
| about fussball". Unless you're in Switzerland.
|
| [1] https://en.wikipedia.org/wiki/Code-switching
| schoen wrote:
| Probably "don't fuss about Fussball" for the same reasons,
| right?
| thiht wrote:
| I thought the German language deprecated the use of ss years
| ago, no? I learned German for a year and that's what the
| teacher told us, but maybe it's not the whole story
| 47282847 wrote:
| Incorrect. Ss is still a thing.
| CamperBob2 wrote:
| Going by what you and the grandparent wrote, it's not
| just a thing, but two _different_ things: Ss ss
|
| It is probably time for an Esperanto advocate to show up
| and set us all straight.
| D-Coder wrote:
| Pri kio vi parolas? En Esperanto, unu letero egalas unu
| sonon.
|
| What are you talking about? In Esperanto, one letter
| equals one sound.
| selenography wrote:
| > set us all straight.
|
| Se fareblus oni, jam farintus oni. (It definitely won't
| happen on an echo-change day like today, either. ;))
|
| Contra my comrade's comment, Esperanto orthography is
| firmly European, and so retains European-style casing
| distinctions; every sound thus still has two letters --
| or at least two codepoints.
|
| (There aren't any eszettesque bigraphs, but that's not
| saying much.)
| TZubiri wrote:
| Germans run Uber Long Term Support dialects
| ahartmetz wrote:
| ...and that is why you use QString if you are using the Qt
| framework. QString is a string class that actually does what you
| want when used in the obvious way. It probably helps that it was
| mostly created by people with "ASCII+" native languages. Or with
| customers that expect not exceedingly dumb behavior. The methods
| are called QString::toUpper() and QString::toLower() and take
| only the implicit "this" argument, unlike Win32 LCMapStringEx()
| which takes 5-8 arguments...
| vardump wrote:
| You just want a banana, but you also get the gorilla. And the
| jungle.
| cannam wrote:
| QString::toUpper/toLower are _not_ locale-aware
| (https://doc.qt.io/qt-6/qstring.html#toLower)
|
| Qt does have a locale-aware equivalent
| (QLocale::toUpper/toLower) which calls out to ICU if available.
| Otherwise it falls back to the QString functions, so you have
| to be confident about how your build is configured. Whether it
| works or not has very little to do with the design of QString.
| ahartmetz wrote:
| I don't see a problem with that. You can have it done locale-
| aware or not and "not" seems like a sane default. QString
| will uppercase 'u' to 'U' just fine without locale-awareness
| whereas std::string doesn't handle non-ASCII according to the
| article. The cases where locale matters are probably very
| rare and the result will probably be reasonable anyway.
| aetherspawn wrote:
| I will admit I don't love the Qt licensing model, but most
| things in Qt just work as they are supposed to, and on every
| platform too.
| cyxxon wrote:
| Small nitpick: the example "LATIN SMALL LETTER SHARP S ("ss"
| U+00DF) uppercases to the two-character sequence "SS":3 Strasse =
| STRASSE" is slightly wrong, it seems to me, as we now do actually
| have a uppercase version of that, so it should uppercase to
| "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is
| still widely used, though.
| mkayokay wrote:
| Duden mentions this: "Bei Verwendung von Grossbuchstaben steht
| traditionellerweise SS fur ss. In manchen Schriften gibt es
| aber auch einen entsprechenden Grossbuchstaben; seine
| Verwendung ist fakultativ <SS 25 E3>."
|
| But isn't it also dependent on the available glyphs in the font
| used? So f.e. it needs to be ensured that U+1E9E exists?
| Muromec wrote:
| But what if you need to uppercase the historical record in a
| vital records registry from 1950ies, but and OCRed last week?
| Now you need to not just be locale-aware, but you locale should
| be versioned.
| pjmlp wrote:
| Lowering case is even better, because a Swiss user would expect
| the two-character sequence "SS" to be converted into "ss" and
| not "ss".
|
| And thus we add country specific locale to the party.
| Rygian wrote:
| The footnote #3 in the article (called as part of your quote)
| covers the different ways to uppercase ss with more detail.
| serbuvlad wrote:
| The real insights here are that strings in C++ suck and UTF-16 is
| extremely unintuitive.
| criddell wrote:
| Strings in C++ standard library do suck (and C++ is my favorite
| language).
|
| As for UTF-16, well, I don't know that UTF-8 is a whole lot
| more intuitive:
|
| > And for UTF-8 data, you have the same issues discussed
| before: Multibyte characters will not be converted properly,
| and it breaks for case mappings that alter string lengths.
| recursive wrote:
| UTF-16 has all the complexity of UTF-8 _plus_ surrogate
| pairs.
| zahlman wrote:
| Surrogate pairs aren't more complex than UTF-8's scheme for
| determining the number of bytes used to represent a code
| point. (Arguably the logic is slightly simpler.) But the
| important point is that UTF-16 _pretends to_ be a constant-
| length encoding while actually having the surrogate-pair
| loophole - that 's because it's a hack on top of UCS-2
| (which originally worked well enough for Microsoft to get
| married to; but then the BMP turned out not to be enough
| code points). UTF-8 is clearly designed from scratch to be
| a multi-byte encoding (and, while the standard now makes
| the corresponding sequences illegal, the scheme was
| designed to be able to support much higher code points - up
| to 2^42 if we extend the logic all the way; hypothetical
| 6-byte sequences starting with values FC or FD would neatly
| map up to 2^31).
| PhilipRoman wrote:
| Thought this was going to be about and-not-ing bytes with 0x20.
| Wrong for most inputs but sure as hell faster than anything else.
| high_na_euv wrote:
| In cpp basic things are hard
| onemoresoop wrote:
| It's subjective but I find C++ extremely ugly.
| johnnyjeans wrote:
| nothing about working with locales, or text in general, is
| basic. we were decades into working with digital computers
| before we moved past switchboards and LEDs. don't take for
| granted just how high of a perch upon the shoulders of giants
| you have. that's exactly how the mistakes in the blog post get
| made.
| SleepyMyroslav wrote:
| In gamedev there is simple rule: don't try to do any of that.
|
| If it is text game needs to show to user then every version of
| the text that is needed is a translated text. Programmer will
| never know if context or locale will need word order changes or
| anything complicated. Just trust the translation team.
|
| If text is coming from user - then change design until its not
| needed to 'convert'. There are major issues just to show user
| back what he entered! Because the font for editing and displayed
| text could be different. Not even mentioning RTL and other
| issues.
|
| Once ppl learn about localization the questions like why a
| programming language does not do this 'simple text operation' are
| just a newcomer detector. :)
| zahlman wrote:
| >If text is coming from user - then change design until its not
| needed to 'convert'
|
| In games, you can possibly get away with this. Most other
| people need to worry about things like string collation
| (locale-aware sorting) for user-supplied text.
| fluoridation wrote:
| >Once ppl learn about localization the questions like why a
| programming language does not do this 'simple text operation'
| are just a newcomer detector. :)
|
| I think you are purposefully misinterpreting the question.
| They're not asking about converting the case of any Unicode
| string with locale sensitivity, they're asking about converting
| the case of ASCII characters.
|
| What if your game needs to talk to a server and do some string
| manipulation in between requests? Are you really going to
| architect everything so that the client doesn't need to handle
| any of that ever?
| squeaky-clean wrote:
| > They're not asking about converting the case of any Unicode
| string with locale sensitivity, they're asking about
| converting the case of ASCII characters.
|
| I'm confused now. The article specifically mentions issues
| with UTF-16 and UTF-32 unicode characters outside the basic
| multilingual plane (BMP).
| fluoridation wrote:
| I'm referring to the people who call case conversion in
| general "a simple text operation". Say you have an
| std::string and you want to make it lower case. If you
| assume it contains just ASCII that's a simpler operation
| than if you assume it contains UTF-8, but C++ doesn't
| provide a single function that does either of them. A
| person can rightly complain that the former is a basic
| functionality that the language should include; personally,
| I would agree. And you could say "wow, doesn't this person
| realize that case conversion in Unicode is actually
| complicated? They must be really inexperienced." It could
| be that the other person really doesn't know about Unicode,
| or it could mean that you and them are thinking about
| entirely different problems and you're being judgemental a
| bit too eagerly.
| squeaky-clean wrote:
| For ascii in C++ isn't there std::tolower / std::toupper?
| If you're not dealing with unsigned char types there
| isn't a simple case conversion function, but that's for a
| good reason as the article lays out.
| fluoridation wrote:
| Those functions take and return single characters. What's
| missing is functions that operate on strings. You can use
| them in combination with std::transform(), but as the
| article points out, even if you're just dealing with
| ASCII you can easily do it wrong. I've been using C++ for
| over 20 years and I didn't know tolower() and toupper()
| were non-addressable. There's really no excuse for the
| library not having simple case conversion functions that
| operate on strings in-place.
| SleepyMyroslav wrote:
| >What if your game needs to talk to a server and do some
| string manipulation in between requests? Are you really going
| to architect everything so that the client doesn't need to
| handle any of that ever?
|
| Of course! Your string manipulation with user entered
| attributes like display names or chat messages are 1
| millimeter away from old good sql 'Bobby; drop table
| students'. Never ever do that if you can avoid it. Every time
| someone 'just concatenates' two strings like to add ie
| 'symbol that represents input button' programmer makes bad
| bug that will be both annoying and wrong. Games should use
| substitution patterns guided by translation team. Because
| there is no ASCII culture in like around 15 typically
| supported by big publishers.
|
| There are exceptions like platform provided services to
| filter ban words in chat. And even there you don't have to do
| 'things with ASCII characters'. Yeah, players will input
| unsupported symbols everywhere they can and you need to have
| good replacement characters for those and fix support for
| popular emojis regularly. That is expected by communities
| now.
| beeboobaa3 wrote:
| > There are major issues just to show user back what he
| entered! Because the font for editing and displayed text could
| be different. Not even mentioning RTL and other issues.
|
| Your web browser is doing it right now as you are reading this
| comment.
| rty32 wrote:
| And web development is not game development? And chances are
| that games don't ship chromium with them?
| HPsquared wrote:
| I thought this was going to be about adding or subtracting 32.
| Old school.
| the_gorilla wrote:
| Why are some functions addressable in C++ and others not? Seems
| like a pointless design oversight.
| bialpio wrote:
| Footnote in the article provides the following explanation:
| "The standard imposes this limitation because the
| implementation may need to add default function parameters,
| template default parameters, or overloads in order to
| accomplish the various requirements of the standard."
| flareback wrote:
| He gave 4 examples of how it's done incorrectly, but zero actual
| examples of doing it correctly.
| commandlinefan wrote:
| for (int i = 0; i < strlen(s); i++) { s[i] ^= 0x20;
| }
| vardump wrote:
| Surely you meant: s[i] &= ~0x20;
|
| We're talking about converting to upper case after all! As an
| added benefit, every space character (0x20) is now a NUL
| byte!
| calibas wrote:
| Thank you for this universal approach. I can now toggle
| capitalization on/off for any character, instead of just
| being limited to alphabetic ones!
|
| Jokes aside, I was kinda hoping for a good answer that
| doesn't rely on a Windows API or an external library, but I'm
| not sure there is one. It's a rather complex problem when you
| account for more than just ASCII and the English language.
| TZubiri wrote:
| Next up, check out our vector addition implementation of
| Hello+World. Spoiler alert, the result is Zalgo
| TheGeminon wrote:
| > Okay, so those are the problems. What's the solution?
|
| > If you need to perform a case mapping on a string, you can
| use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE,
| possibly with other flags like LCMAP_LINGUISTIC_CASING. If you
| use the International Components for Unicode (ICU) library, you
| can use u_strToUpper and u_strToLower.
| PoignardAzur wrote:
| So I'm going to be that guy and say it:
|
| Man, I'm happy we don't need to deal with this crap in Rust, and
| we can just use String::to_lowercase. Not having to worry about
| things makes coding fun.
| himinlomax wrote:
| > And in certain forms of the French language, capitalizing an
| accented character causes the accent to be dropped: a Paris = A
| PARIS.
|
| That's incorrect, using diacritics on capital letters is always
| the preferred form, it's just that dropping them is acceptable as
| it was often done for technical reasons.
| codr7 wrote:
| C++, where every line of code is a book waiting to be written.
___________________________________________________________________
(page generated 2024-10-08 23:00 UTC)