[HN Gopher] The C Standard Library Function Isspace() Depends on...
       ___________________________________________________________________
        
       The C Standard Library Function Isspace() Depends on Locale
        
       Author : jandeboevrie
       Score  : 50 points
       Date   : 2023-06-06 17:40 UTC (5 hours ago)
        
 (HTM) web link (www.evanjones.ca)
 (TXT) w3m dump (www.evanjones.ca)
        
       | [deleted]
        
       | rwmj wrote:
       | See also a classic glibc bug: _" [0-9] matches 1/4 1 2 3 and
       | others, but not 9 (and other nines)"_
       | (https://news.ycombinator.com/item?id=17557243)
       | 
       | Sadly they renamed the upstream bug report to something more
       | sober (although didn't fix it).
        
       | kazinator wrote:
       | Scanning a floating-point value with strtod depends on locale. If
       | it's in some locale where the decimal point is a comma, it may
       | stop recognizing the standard 123.456+EE notation.
       | 
       | The fix is never to call setlocale(); calling setlocale is like
       | asking "f___ my C program".
       | 
       | ISO C localization was designed back in the 1980s, when nobody
       | had any real experience with localizing. In a greenfield C
       | program, it's best to do it all yourself from scratch and stay
       | away from C localization, so you can depend on strtod and isspace
       | to do what they are supposed to.
        
         | pavlov wrote:
         | It's useful to remember that the 1980s approach to localization
         | preceded the global Internet.
         | 
         | The way most software worked was that you bought it from a
         | local reseller. It came localized for your country (perhaps by
         | the reseller or importer rather than the original authors of
         | the software), and then you'd use it to conduct your local
         | business. Data interchange wasn't that common.
         | 
         | Desktop printers were hugely important because a hard copy was
         | how you'd share anything. If you needed to get the information
         | somewhere fast, you'd then fax it.
         | 
         | Rarely when you did need to exchange files, you'd use floppies.
         | Maybe you'd take your WordPerfect document to a professional
         | print shop so they would do a layout using cutting-edge desktop
         | publishing technology.
         | 
         | So the notion that somebody in Germany might receive American
         | files, or vice versa, wasn't really a primary concern. It was
         | considered far more important that the Germans, and everybody
         | else, would be able to work with their data with the number
         | formatting that was preferred (and sometimes legally mandated).
        
           | orf wrote:
           | Cool but that was over 40 years ago. Who cares, and why
           | hasn't it been improved since then?
        
             | pvh wrote:
             | To the former, curious people with an interest in how and
             | why the world came to be as it is.
             | 
             | To the latter, obviously "it" has improved, but ecosystem
             | effects make certain changes very difficult and expensive
             | to coordinate and what we see here is the scars from that
             | process.
             | 
             | Everything you see in the world grew out of things that
             | came before, and was made by fallible people working with
             | limited time, energy, and perspective.
             | 
             | Honestly, I'm a bit surprised someone with a three letter
             | handle wouldn't already recognize this. Surely you have
             | been around here for a while.
        
             | JdeBP wrote:
             | It actually _has_ been improved. See discssion of
             | isspace_l() elsewhere on this page.
        
             | IshKebab wrote:
             | I guess anyone that develops in C is happy living in the
             | 80s amongst the footguns. Anyone who isn't has moved on to
             | other languages where it _has_ been improved.
        
         | bitwize wrote:
         | s/greenfield C program/greenfield program/
         | 
         | s/from C localization.*$/from C./
        
       | nemothekid wrote:
       | mpv's locale rant is another "blog post" about the frustration of
       | locale.
       | 
       | https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...
        
         | Dwedit wrote:
         | This is the famous "shitfucked retarded legacy braindeath"
         | post.
        
       | eMSF wrote:
       | While writing a fancy word counter I learnt that glibc iswspace
       | (or the glibc locale data) actually does not consider non-
       | breaking spaces as, well, spaces even when using a Unicode
       | locale. This apparently conforms to ISO 30112. (For example
       | MSVCRT does do so.)
       | 
       | I happened to notice this via a result mismatch as GNU wc does
       | count NBSPs as word separators. Even though it uses iswspace, it
       | also additionally checks for a hard coded set of Unicode non-
       | breaking spaces.
       | 
       | (I have to say I'm a bit surprised at being at getting voted
       | hidden here. I thought this was mostly related to the topic at
       | hand. I would of course gladly be corrected if mistaken about the
       | details.)
        
       | emmelaich wrote:
       | I think I've mentioned it before, but the isspace() and similar
       | man pages used to warn that they made sense only if ascii.
       | 
       | So the recommendation was to always do (isascii() &&
       | iswhatever()).
       | 
       | With the advent of locales they seem to have just omitted this
       | rather than put in a warning or hint.
        
       | nirvanis wrote:
       | Somewhat related tip: prepend LANG=C to many console commands
       | such as grep to speed up many tools processing large files, as
       | they will assume ASCII input (which is probably what you have in
       | most cases)
        
         | emmelaich wrote:
         | and set it for consistency of ordering (collation) between
         | sort, join, tsort, look, etc.
        
       | zokier wrote:
       | C locales are one of those optimistic features that we have
       | inherited that in retrospect ended up being misguided and more
       | trouble than worth; any program that really needs to deal with
       | localization probably will end up pulling something like ICU to
       | deal with all sorts of cases, and on the other hand locales cause
       | all sorts of weird issues for programs not really expecting to be
       | localized. As a bonus locale support incurs heavy performance
       | hit.
       | 
       | In this case its extra-awkward to have an attempt of having
       | unicode support on a function that takes a single char as an
       | input; it can't actually handle arbitrary unicode codepoints
       | anyways.
       | 
       | I feel a common theme with these sort of things is the thinking
       | that difficult problems can be made tractable by presenting a
       | "simple" naive interface while fudging things behind the scenes.
       | Those supposedly simple interfaces actually become complex to
       | think about once you start asking difficult questions about
       | correctness, error handling, and edge cases.
        
         | Someone wrote:
         | > In this case its extra-awkward to have an attempt of having
         | unicode support on a function that takes a single char as an
         | input
         | 
         | Nitpick: it doesn't take a _char_ ; it takes an _int_ that must
         | either be representable as unsigned char or be equal to EOF
         | (https://en.cppreference.com/w/c/string/byte/isspace)
         | 
         | Given that description, I don't think anybody attempted to have
         | unicode support for _isspace_.
         | 
         | IMO, the bug is to call _isspace_ for bytes extracted from
         | utf-8 data.
        
       | djoldman wrote:
       | For implications in python, see for example:
       | 
       | https://docs.python.org/3/library/re.html#re.LOCALE
        
       | david2ndaccount wrote:
       | Yep.
       | 
       | Because of the terribleness of locales, you should always just
       | roll your own to handle ascii, or use an explicitly unicode-aware
       | function. Anything in libc that relies on locale is unusable
       | because of this.
       | 
       | Besides, isspace is such a trivial function that you don't want
       | to actually call an extern function (possibly even dynamically
       | linked and thus having to hit the PLT) for it, you want something
       | that is easily inlined.
        
       | JdeBP wrote:
       | The fix if one wants to stick to the standard library is to make
       | use of                   locale_t posixctypelocale =
       | newlocale(LC_CTYPE_MASK, "POSIX", NULL);
       | 
       | saved somewhere early on and then                   b =
       | isspace_l(c, posixctypelocale);
       | 
       | and                   b = iswspace_l(wc, posixctypelocale);
       | 
       | whenever one needs them.
       | 
       | The irony is that systems based upon the BSD C library like MacOS
       | and FreeBSD will have this.
       | 
       | * https://pubs.opengroup.org/onlinepubs/9699919799/functions/n...
       | 
       | * https://pubs.opengroup.org/onlinepubs/9699919799/functions/i...
       | 
       | * https://pubs.opengroup.org/onlinepubs/9699919799/functions/i...
       | 
       | * https://man.freebsd.org/cgi/man.cgi?query=xlocale&sektion=3
        
       | wahern wrote:
       | The fact that Unicode codepoints were being passed to isspace
       | instead of iswspace indicates the relevant code was already
       | fubar'd.
       | 
       | > For example, isspace(0x01fe) is true. I can't figure out why
       | this might be considered a whitespace character
       | 
       | Because the only valid values (independent of locale) that can be
       | passed to isspace are 0 to UCHAR_MAX and -1/EOF, where UCHAR_MAX
       | refers to unsigned char (usually 255), not Unicode character.
       | Most implementations I've seen (glibc, musl, OpenBSD) index the
       | passed value into a locale-specific array of length UCHAR_MAX +
       | 1, possibly masking the index and/or return values. But TIL macOS
       | (and possibly FreeBSD and NetBSD at some point, if not currently)
       | had vestigial support for passing higher values as part of a
       | presumably long-abandoned approach to I18N.
       | 
       | EDIT: FWIW, based on the glibc code (ctype/isctype.c),
       | int       __isctype (int ch, int mask)       {         return
       | (((uint16_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_CLASS) + 128)
       | [(int) (ch)] & mask);       }
       | 
       | where isspace(c) seems to be translated to __isctype(c, _ISspace)
       | there's a good chance the array is being overflowed. Without
       | looking further (glibc isn't the easiest code to grok), I'd guess
       | the array size is probably 128 + UCHAR_MAX with the offset of 128
       | (instead of 1) to handle the common case, especially on systems
       | where char is signed, of people passing in negative values,
       | though that only works for a locale like ASCII where -1/EOF and
       | 255/(unsigned char)-1 aren't ambiguous.
        
         | JdeBP wrote:
         | FreeBSD layers narrow and wide character typing on top of a
         | single common mechanism based upon 32-bit signed "runes".
         | Basically, the 256 narrow characters are treated as the first
         | 256 characters in the Unicode BMP, and an accident of
         | implementation allows one to pass in other Unicode code points
         | to the narrow character functions, given that the int and
         | wint_t types are designed to be trivially convertible to a
         | "rune".
        
         | dvh wrote:
         | Seems like instead of using boolean result better API should
         | return tri-state: space, no space, bad request
        
           | TeMPOraL wrote:
           | This is already the case; "bad request" is usually returned
           | as SIGSEGV / 0xC0000005 and similar.
        
           | rm445 wrote:
           | Reminds me of a joke in a John Meaney novel, a physicist
           | discovers faster-than-light travel through a dimension with
           | unusual properties, gaining inspiration from Java's booleans
           | having three states (true, false and NullPointerException).
        
             | Phrodo_00 wrote:
             | Hate to point it out, but java booleans can only be true or
             | false. Booleans (capital B) can be true, false or null.
        
           | bitwize wrote:
           | So... True, False, and File Not Found?
        
             | teddyh wrote:
             | Reference:
             | https://thedailywtf.com/articles/What_Is_Truth_0x3f_
        
       ___________________________________________________________________
       (page generated 2023-06-06 23:00 UTC)