[HN Gopher] The C Standard Library Function Isspace() Depends on...
___________________________________________________________________
The C Standard Library Function Isspace() Depends on Locale
Author : jandeboevrie
Score : 50 points
Date : 2023-06-06 17:40 UTC (5 hours ago)
(HTM) web link (www.evanjones.ca)
(TXT) w3m dump (www.evanjones.ca)
| [deleted]
| rwmj wrote:
| See also a classic glibc bug: _" [0-9] matches 1/4 1 2 3 and
| others, but not 9 (and other nines)"_
| (https://news.ycombinator.com/item?id=17557243)
|
| Sadly they renamed the upstream bug report to something more
| sober (although didn't fix it).
| kazinator wrote:
| Scanning a floating-point value with strtod depends on locale. If
| it's in some locale where the decimal point is a comma, it may
| stop recognizing the standard 123.456+EE notation.
|
| The fix is never to call setlocale(); calling setlocale is like
| asking "f___ my C program".
|
| ISO C localization was designed back in the 1980s, when nobody
| had any real experience with localizing. In a greenfield C
| program, it's best to do it all yourself from scratch and stay
| away from C localization, so you can depend on strtod and isspace
| to do what they are supposed to.
| pavlov wrote:
| It's useful to remember that the 1980s approach to localization
| preceded the global Internet.
|
| The way most software worked was that you bought it from a
| local reseller. It came localized for your country (perhaps by
| the reseller or importer rather than the original authors of
| the software), and then you'd use it to conduct your local
| business. Data interchange wasn't that common.
|
| Desktop printers were hugely important because a hard copy was
| how you'd share anything. If you needed to get the information
| somewhere fast, you'd then fax it.
|
| Rarely when you did need to exchange files, you'd use floppies.
| Maybe you'd take your WordPerfect document to a professional
| print shop so they would do a layout using cutting-edge desktop
| publishing technology.
|
| So the notion that somebody in Germany might receive American
| files, or vice versa, wasn't really a primary concern. It was
| considered far more important that the Germans, and everybody
| else, would be able to work with their data with the number
| formatting that was preferred (and sometimes legally mandated).
| orf wrote:
| Cool but that was over 40 years ago. Who cares, and why
| hasn't it been improved since then?
| pvh wrote:
| To the former, curious people with an interest in how and
| why the world came to be as it is.
|
| To the latter, obviously "it" has improved, but ecosystem
| effects make certain changes very difficult and expensive
| to coordinate and what we see here is the scars from that
| process.
|
| Everything you see in the world grew out of things that
| came before, and was made by fallible people working with
| limited time, energy, and perspective.
|
| Honestly, I'm a bit surprised someone with a three letter
| handle wouldn't already recognize this. Surely you have
| been around here for a while.
| JdeBP wrote:
| It actually _has_ been improved. See discssion of
| isspace_l() elsewhere on this page.
| IshKebab wrote:
| I guess anyone that develops in C is happy living in the
| 80s amongst the footguns. Anyone who isn't has moved on to
| other languages where it _has_ been improved.
| bitwize wrote:
| s/greenfield C program/greenfield program/
|
| s/from C localization.*$/from C./
| nemothekid wrote:
| mpv's locale rant is another "blog post" about the frustration of
| locale.
|
| https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...
| Dwedit wrote:
| This is the famous "shitfucked retarded legacy braindeath"
| post.
| eMSF wrote:
| While writing a fancy word counter I learnt that glibc iswspace
| (or the glibc locale data) actually does not consider non-
| breaking spaces as, well, spaces even when using a Unicode
| locale. This apparently conforms to ISO 30112. (For example
| MSVCRT does do so.)
|
| I happened to notice this via a result mismatch as GNU wc does
| count NBSPs as word separators. Even though it uses iswspace, it
| also additionally checks for a hard coded set of Unicode non-
| breaking spaces.
|
| (I have to say I'm a bit surprised at being at getting voted
| hidden here. I thought this was mostly related to the topic at
| hand. I would of course gladly be corrected if mistaken about the
| details.)
| emmelaich wrote:
| I think I've mentioned it before, but the isspace() and similar
| man pages used to warn that they made sense only if ascii.
|
| So the recommendation was to always do (isascii() &&
| iswhatever()).
|
| With the advent of locales they seem to have just omitted this
| rather than put in a warning or hint.
| nirvanis wrote:
| Somewhat related tip: prepend LANG=C to many console commands
| such as grep to speed up many tools processing large files, as
| they will assume ASCII input (which is probably what you have in
| most cases)
| emmelaich wrote:
| and set it for consistency of ordering (collation) between
| sort, join, tsort, look, etc.
| zokier wrote:
| C locales are one of those optimistic features that we have
| inherited that in retrospect ended up being misguided and more
| trouble than worth; any program that really needs to deal with
| localization probably will end up pulling something like ICU to
| deal with all sorts of cases, and on the other hand locales cause
| all sorts of weird issues for programs not really expecting to be
| localized. As a bonus locale support incurs heavy performance
| hit.
|
| In this case its extra-awkward to have an attempt of having
| unicode support on a function that takes a single char as an
| input; it can't actually handle arbitrary unicode codepoints
| anyways.
|
| I feel a common theme with these sort of things is the thinking
| that difficult problems can be made tractable by presenting a
| "simple" naive interface while fudging things behind the scenes.
| Those supposedly simple interfaces actually become complex to
| think about once you start asking difficult questions about
| correctness, error handling, and edge cases.
| Someone wrote:
| > In this case its extra-awkward to have an attempt of having
| unicode support on a function that takes a single char as an
| input
|
| Nitpick: it doesn't take a _char_ ; it takes an _int_ that must
| either be representable as unsigned char or be equal to EOF
| (https://en.cppreference.com/w/c/string/byte/isspace)
|
| Given that description, I don't think anybody attempted to have
| unicode support for _isspace_.
|
| IMO, the bug is to call _isspace_ for bytes extracted from
| utf-8 data.
| djoldman wrote:
| For implications in python, see for example:
|
| https://docs.python.org/3/library/re.html#re.LOCALE
| david2ndaccount wrote:
| Yep.
|
| Because of the terribleness of locales, you should always just
| roll your own to handle ascii, or use an explicitly unicode-aware
| function. Anything in libc that relies on locale is unusable
| because of this.
|
| Besides, isspace is such a trivial function that you don't want
| to actually call an extern function (possibly even dynamically
| linked and thus having to hit the PLT) for it, you want something
| that is easily inlined.
| JdeBP wrote:
| The fix if one wants to stick to the standard library is to make
| use of locale_t posixctypelocale =
| newlocale(LC_CTYPE_MASK, "POSIX", NULL);
|
| saved somewhere early on and then b =
| isspace_l(c, posixctypelocale);
|
| and b = iswspace_l(wc, posixctypelocale);
|
| whenever one needs them.
|
| The irony is that systems based upon the BSD C library like MacOS
| and FreeBSD will have this.
|
| * https://pubs.opengroup.org/onlinepubs/9699919799/functions/n...
|
| * https://pubs.opengroup.org/onlinepubs/9699919799/functions/i...
|
| * https://pubs.opengroup.org/onlinepubs/9699919799/functions/i...
|
| * https://man.freebsd.org/cgi/man.cgi?query=xlocale&sektion=3
| wahern wrote:
| The fact that Unicode codepoints were being passed to isspace
| instead of iswspace indicates the relevant code was already
| fubar'd.
|
| > For example, isspace(0x01fe) is true. I can't figure out why
| this might be considered a whitespace character
|
| Because the only valid values (independent of locale) that can be
| passed to isspace are 0 to UCHAR_MAX and -1/EOF, where UCHAR_MAX
| refers to unsigned char (usually 255), not Unicode character.
| Most implementations I've seen (glibc, musl, OpenBSD) index the
| passed value into a locale-specific array of length UCHAR_MAX +
| 1, possibly masking the index and/or return values. But TIL macOS
| (and possibly FreeBSD and NetBSD at some point, if not currently)
| had vestigial support for passing higher values as part of a
| presumably long-abandoned approach to I18N.
|
| EDIT: FWIW, based on the glibc code (ctype/isctype.c),
| int __isctype (int ch, int mask) { return
| (((uint16_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_CLASS) + 128)
| [(int) (ch)] & mask); }
|
| where isspace(c) seems to be translated to __isctype(c, _ISspace)
| there's a good chance the array is being overflowed. Without
| looking further (glibc isn't the easiest code to grok), I'd guess
| the array size is probably 128 + UCHAR_MAX with the offset of 128
| (instead of 1) to handle the common case, especially on systems
| where char is signed, of people passing in negative values,
| though that only works for a locale like ASCII where -1/EOF and
| 255/(unsigned char)-1 aren't ambiguous.
| JdeBP wrote:
| FreeBSD layers narrow and wide character typing on top of a
| single common mechanism based upon 32-bit signed "runes".
| Basically, the 256 narrow characters are treated as the first
| 256 characters in the Unicode BMP, and an accident of
| implementation allows one to pass in other Unicode code points
| to the narrow character functions, given that the int and
| wint_t types are designed to be trivially convertible to a
| "rune".
| dvh wrote:
| Seems like instead of using boolean result better API should
| return tri-state: space, no space, bad request
| TeMPOraL wrote:
| This is already the case; "bad request" is usually returned
| as SIGSEGV / 0xC0000005 and similar.
| rm445 wrote:
| Reminds me of a joke in a John Meaney novel, a physicist
| discovers faster-than-light travel through a dimension with
| unusual properties, gaining inspiration from Java's booleans
| having three states (true, false and NullPointerException).
| Phrodo_00 wrote:
| Hate to point it out, but java booleans can only be true or
| false. Booleans (capital B) can be true, false or null.
| bitwize wrote:
| So... True, False, and File Not Found?
| teddyh wrote:
| Reference:
| https://thedailywtf.com/articles/What_Is_Truth_0x3f_
___________________________________________________________________
(page generated 2023-06-06 23:00 UTC)