[HN Gopher] Fun with Glibc and the Ctype.h Functions
___________________________________________________________________
Fun with Glibc and the Ctype.h Functions
Author : picture
Score : 24 points
Date : 2021-09-30 05:36 UTC (1 days ago)
(HTM) web link (rachelbythebay.com)
(TXT) w3m dump (rachelbythebay.com)
| _kst_ wrote:
| IMHO the more interesting oddity about the functions declared in
| <ctype.h> is that they work with unsigned char, which means that
| they have undefined behavior if you pass a negative char value
| (other than EOF, which is typically -1).
|
| This means that if you have a char value (say, an element of a
| string), you need to cast it to unsigned char before passing it
| to any of the is*() functions.
| gumby wrote:
| The rant behind her post
| (https://drewdevault.com/2020/09/25/A-story-of-two-libcs.html ),
| which has had some circulation, really shows its author's limited
| perspective.
|
| glibc needs to solve two hard problems: be very fast and run on
| innumerable systems. Some of that conditional stuff is because
| all the world is not Linux or BSD; some of the macrology is there
| to make sure such handling is performed everywhere needed, and of
| course the preprocessor is the closest a language like C can get
| to preprocessing.
|
| I was in the code as glibc started to exist (we paid for a lot of
| it) and it looked like Musl: very straightforward.
| cryptonector wrote:
| The definition of the ctype functions as working on unsigned
| char values and EOF + CHAR_BIT being 8 everywhere now basically
| means that there isn't much locale-specificity to the ctype
| functions: they can be made to work with ASCII, ISO-8859-*,
| and... EBCDIC, but not UTF-8 in general (just ASCII) or any
| Unicode encoding (idk, maybe they can be made to be locale-
| specific for Shift-JIS, but only for ASCII in Shift-JIS).
|
| And... yes, glibc does have support for EBCDIC, which is
| probably ultimately why it has these run-time indirections in
| its ctype. There's no other reason to have run-time
| indirections for ctype functions given the limitation of
| unsigned char values + EOF. That means this code can be
| simplified a great deal.
|
| Anyways, yes, Drew DeVault's rant misses glibc's need to
| support EBCDIC, but glibc is exactly like this for every little
| thing -- an unmaintainable mess. There has to be a better way
| to produce a fast C library w/o being such a mess on the
| inside.
| Hello71 wrote:
| > all the world is not Linux or BSD
|
| since when does glibc run on bsd
| LukeShu wrote:
| Well for one, Debian GNU/kFreeBSD.
| jcelerier wrote:
| didn't glibc exist before linux ? surely it would have been
| running on bsd then
| tyingq wrote:
| I do get what you're saying, but musl also has to live in many
| different worlds. Using the example where glibc is trawling
| into endianness in the post you linked, for example. Musl runs
| on a bunch of different big and little endian router boxes and
| other unusual use cases. While I haven't tested, I'm guessing
| that their much simpler isalnum() works fine on all of them.
|
| Musl does have a lot less legacy to contend with, and musl is
| often much slower than glibc, so your point stands, of course.
| masklinn wrote:
| > While I haven't tested, I'm guessing that their much
| simpler isalnum() works fine on all of them.
|
| isalnum works fine of both, it only veers off when you get
| into UB which is UB.
|
| If you define "works fine" as "gives correct answers even in
| ub" then musl's is completely broken since it only gives
| correct answers for english in ascii.
| cryptonector wrote:
| It can't give correct answers for anything other than
| English in UTF-8 locales.
|
| It can't give correct answers for any non-Latin scripts in
| any locales.
|
| The problem is ctype and POSIX.
|
| Given that, making ctype only work for ASCII (and maybe
| EBCDIC if you're really unlucky, which glibc is) is
| basically sufficient.
| jcelerier wrote:
| musl's "isalpha" is trivially wrong, for instance it wouldn't
| support "c" (0xe7) or "ss" (0xdf) in ISO 8859-1 which are
| both alphabetic characters which fit in an unsigned char.
| cryptonector wrote:
| ctype is trivially non-localizable to locales with codesets
| larger than sizeof(unsigned char) anyways. Maybe the
| problem here is POSIX.
| jcelerier wrote:
| oh yes, no code written in 2021 should use that mess. but
| a glibc being some level of posix compatibility.. hard to
| blame them for at least trying to make it work.
| cryptonector wrote:
| Hmm, well, I mean, if ctype can't work for any
| interesting non-ASCII (and non-EBCDIC) cases (no one
| should still be using ISO-8859 locales...)... maybe stop
| trying so hard?
| tyingq wrote:
| Those both return 0 for isalpha() on glibc for me, with or
| without export LC_CTYPE=iso_8859_1
|
| Is there some other setup I'd need to do to see it work in
| glibc?
| jcelerier wrote:
| most likely you need to build the locale on your system
| (uncomment the relevant line in /etc/locale.gen and run
| sudo locale-gen).
|
| here #include <ctype.h> #include
| <locale.h> #include <stdio.h> int
| main(int argc, char** argv) {
| setlocale(LC_CTYPE, "fr_FR.iso88591");
| if(isalpha('c')) printf("ok\n"); }
|
| prints ok (with the file in the correct encoding)
| _kst_ wrote:
| isalpha() works with the "C" locale unless you first call
| setlocale().
|
| For example, on my system isalpha(0xe7) is true if I first
| call setlocale(LC_ALL, "en_US.iso88591").
| jcelerier wrote:
| well, yes, in "normal" C programs you're supposed to
| fetch the locale from the user's env vars (with setlocale
| (LC_ALL, ""))
| [deleted]
| anonymousiam wrote:
| Ran it on 32-bit ARM, 64-bit ARM, 32-bit x86, and 64-bit x86. All
| had different results, but all were the same until index 549,
| which is greater than the maximum value for unsigned char (255).
| zx2c4 wrote:
| Here are some branchless/constant-time versions of those
| functions that don't rely on locale:
| https://git.zx2c4.com/wireguard-tools/tree/src/ctype.h
| malkia wrote:
| I like the suffix in 0x80001FU
| josephcsible wrote:
| Here's what the C standard says about character handling
| functions:
|
| > In all cases the argument is an int, the value of which shall
| be representable as an unsigned char or shall equal the value of
| the macro EOF. If the argument has any other value, the behavior
| is undefined.
|
| So this is just a case of glibc being optimized in a way that's
| really unforgiving if you commit that particular UB.
| cryptonector wrote:
| No, this is a case of glibc trying to support localization of
| ctype in spite of the fact that it can't be localized to
| anything other than English in UTF-8 locales, anything other
| than Latin scripts in ISO-8859-* locales, or English in C/POSIX
| or EBCDIC locales. And then on top of that trying to be fast.
|
| I'd give up on supporting localization for ctype.
|
| This makes me think, too, "never use ctype, just hardcode my
| own that assumes ASCII".
| guidovranken wrote:
| This also applies to C++ <locale> functions, like std::isspace.
|
| Another fun one: With FD_CLR, FD_ISSET, FD_SET you can corrupt
| memory by merely passing a socket descriptor that is not 0..1024.
| Pass a negative integer for some undefined behavior as well
| (shift by negative value occurs here [1])
|
| [1]
| https://github.com/lattera/glibc/blob/895ef79e04a953cac14938...
___________________________________________________________________
(page generated 2021-10-01 23:00 UTC)