[HN Gopher] Glibc Buffer Overflow in Iconv
       ___________________________________________________________________
        
       Glibc Buffer Overflow in Iconv
        
       Author : theamk
       Score  : 155 points
       Date   : 2024-04-21 04:31 UTC (18 hours ago)
        
 (HTM) web link (www.openwall.com)
 (TXT) w3m dump (www.openwall.com)
        
       | keikobadthebad wrote:
       | Sounds like you had better upgrade to the fixed glibc version if
       | you're running php...
        
         | thenickdude wrote:
         | Or else edit /usr/lib/x86_64-linux-gnu/gconv/gconv-modules and
         | comment out this section:                 #       from
         | to                      module          cost       alias
         | ISO2022CNEXT//          ISO-2022-CN-EXT//       module
         | ISO-2022-CN-EXT//       INTERNAL                ISO-2022-CN-EXT
         | 1       module INTERNAL                ISO-2022-CN-EXT//
         | ISO-2022-CN-EXT 1
         | 
         | Then run "iconvconfig" to rebuild the iconv cache. This
         | disables that charset completely.
        
           | LXicon wrote:
           | As a test on various distros, I ran iconv -l |grep 2022 |grep
           | -i cn
           | 
           | and it listed ISO-2022-CN-EXT// and ISO2022CNEXT// before I
           | made any changes. After editing the modules and running
           | iconvconfig the command no longer showed those charsets.
           | 
           | This was handy since the alma8 has a /usr/lib64/gconv/gconv-
           | modules file but the file to edit was /usr/lib64/gconv/gconv-
           | modules.d/gconv-modules-extra.conf
        
             | voyagerfan5761 wrote:
             | Thanks a bunch to you and thenickdude for the test command
             | and config to change.
             | 
             | I have an old VPS that isn't worth trying to update to a
             | newer OS image, because I'm already (slowly) migrating
             | things off of it before the current paid-up term expires,
             | but it definitely won't get the newer glibc. Disabling the
             | vulnerable character encoding works for me, since no
             | legitimate user of the server will need these conversion
             | pairs.
        
       | saagarjha wrote:
       | > I hope Charles will share further detail with oss-security in
       | due time, but meanwhile his upcoming OffensiveCon talk abstract
       | reveals a bit
       | 
       | Wonder what the story is here. Burned 0 day? Not worth
       | exploiting? lolz?
        
         | bawolff wrote:
         | Why do you think there is a story here? Witholding the nitty
         | gritty details until your conference talk is not that unusual.
        
           | saagarjha wrote:
           | Well for one usually library maintainers don't learn about
           | bugs in their software from the abstracts of security talks
        
             | kzrdude wrote:
             | There's a CVE for it - CVE-2024-2961 - and fixed glibc
             | versions have already been released (and reached Ubuntu LTS
             | for one thing) so I think it's fine.
        
             | bawolff wrote:
             | Responsible disclosure generally means you tell the
             | maintainers first before doing your splashy talk. Which
             | appears to be what happened here since there is a cve, and
             | we know mostly what was fixed. The talk would probably just
             | go into nitty gritty details about how it was found and how
             | its exploitable, stuff a skilled researcher would already
             | be able to figure out based on what has already been
             | publicly released.
        
       | blueflow wrote:
       | Man, i wish everything was UTF-8 so we iconv would not be needed
       | anymore. Too bad its defined in POSIX.
        
         | rwmj wrote:
         | A corollary to this is that if we had a simpler function for
         | converting between UTF-8 and UTF-16LE, then I could remove all
         | uses of iconv from my code, since I only use it to convert
         | to/from MS Windows formats. (iconv's API is ugly and difficult
         | to use correctly.)
        
           | blueflow wrote:
           | Ugh i forgot about Windows. Windows used UCS-2 some decades
           | ago, from there, UCS-2 got standardized into MEFI FAT and
           | that document got incorporated into the EFI spec which is now
           | the mechanism how x86 boots.
        
             | mananaysiempre wrote:
             | There's more fun stuff like that: because Microsoft wrote
             | the spec for the Language Server Protocol, the offsets in
             | it are in UTF-16 code units even though the transport
             | format is UTF-8. (This is less awful than it sounds,
             | because a code point takes two UTF-16 code units iff it
             | takes the maximum allowed four UTF-8 ones, i.e. if its
             | UTF-8 starts with >= 0xF0. But it's still pretty awful.)
             | And of course UCS-2 had also been baked into Java and
             | Java/ECMAScript at about the same time (circa 1990) and
             | only afterwards was it half-brokenly extended to UTF-16.
             | 
             | In defense of all those usages (except the LSP one, which
             | is indefensible), the original pitches for Unicode[1]
             | literally said that it was intended to be fixed-width, an
             | international ASCII of sorts; that was to be achieved by
             | restricting it to "commercially-relevant" text (and Han
             | unification). Then it turned out there are plenty of very
             | rare Han characters people really, really want to see
             | encoded (for place and personal names, etc.). Of course, in
             | hindsight an encoding with nontrivial string equivalences
             | (e.g. combining diacritics) was never going to be as simple
             | to handle as ASCII.
             | 
             | [1] https://unicode.org/history/unicode88.pdf
        
           | neonsunset wrote:
           | It's mostly a libc issue. Conversion is completely painless
           | in better languages.
        
             | blueflow wrote:
             | Charset conversion being painless where?
        
               | rwmj wrote:
               | Not any that I'm aware of. You have your choice of
               | programming languages where they got it wrong (Python 3,
               | Ruby), languages which are incredibly nit-picky (Rust),
               | languages which are full of footguns (C, C++), languages
               | which pass on the issue (OCaml), languages which assume
               | everything is UTF8 (Perl), languages which assume
               | everything is UTF16 except when they forgot about planes
               | and assume UCS-2 (Java, everything Windows), languages
               | which are mad (APL), ...
        
               | im3w1l wrote:
               | I'm curious what you think python got wrong.
        
               | rwmj wrote:
               | Your program will crash at runtime if you pass any string
               | that isn't valid UTF-8 and you forgot to use bytes, and
               | using .decode requires that your input has a valid, known
               | encoding which is rarely true in the real world of messy
               | data.
               | 
               | It's a problem with Unix filenames where the encoding is
               | just a convention. A Python program that doesn't take
               | great care can crash on a parameter that takes a filename
               | even if that filename is just passed to a function like
               | 'open' so no sanitisation or conversion is necessary.
               | 
               | This is a very real problem in hivex, our Windows
               | registry library, where the Python bindings don't really
               | work well. The Windows registry is a hodge podge of
               | random encodings, essentially whatever the program that
               | wrote the registry key thought it was using at the time.
               | When parsed through Python as a string this means you'll
               | get unicode decoding errors all over the place.
               | 
               | Also more in this article:
               | https://changelog.complete.org/archives/9938-the-python-
               | unic...
        
               | lyu07282 wrote:
               | You can tell .decode what to do with errors so it won't
               | throw an exception, the default is 'strict'. I think
               | python 3.6+ did a pretty good job with it overall.
               | 
               | Don't get me wrong it's still painful and annoying and
               | bug prone, but the point is, it's encodings, it's always
               | going to suck no matter what.
        
               | SAI_Peregrinus wrote:
               | > It's a problem with Unix filenames where the encoding
               | is just a convention
               | 
               | UNIX filenames are NOT necessarily printable text,
               | they're byte strings. Don't treat them as printable text.
               | They're sequences of bytes not containing 0x00 or 0x2F,
               | but _with no encoding_.
               | 
               | Same for Windows registry keys. Don't mistake byte
               | strings for text.
               | 
               | Text is a byte string with an encoding which describes
               | which byte values are valid and which characters each
               | byte (or sequence of bytes) corresponds to. If the
               | encoding information is discarded, it stops being text
               | and becomes just a byte string.
        
               | Karellen wrote:
               | > UNIX filenames are NOT necessarily printable text,
               | they're byte strings. Don't treat them as printable text.
               | 
               | But they _nearly always_ also happen to be printable
               | text. Treating them as printable text is _soooo_
               | convenient 99.9% of the time.
               | 
               | Having them be almost printable text, but not quite, is
               | an API design wart that is begging to be used
               | incorrectly. It's the inverse of an affordance.
        
               | SAI_Peregrinus wrote:
               | Sure, I can agree with that. Just because the API is bad
               | doesn't mean you can treat it as though it's the API you
               | wish it were.
        
               | jcranmer wrote:
               | To be honest, I disagree. Unix filenames are printable
               | text, they're just text that the OS chooses not to
               | enforce any validation on. Especially since the charset
               | is implied by your choice of locale, which is a user
               | decision, not a kernel decision. But we've transitioned
               | into a world where pretty much all systems have settled
               | on UTF-8 as the charset of choice, and the OS's refusal
               | to even permit kernel options to forcibly validate
               | filenames as UTF-8 is starting to look like a poor
               | decision.
               | 
               | (IIRC, at least one of the BSDs has actually moved to
               | forcing filenames to be UTF-8 and refusing path names
               | that aren't UTF-8. Would only that Linux moved down that
               | path as well so that we could be done with this farce.)
        
               | SAI_Peregrinus wrote:
               | A filename consisting of nothing but ASCII Bell
               | characters (0x07) is valid. Those are non-printable
               | characters that (used to) make a sound from the PC's
               | speaker. POSIX filenames can be sound, not text.
               | 
               | I'd agree it'd be nice if we could restrict filenames to
               | valid UTF-8. But that's not the API that existing
               | filesystems provide, nor what (most of) the existing OSes
               | enforce.
        
               | edflsafoiewq wrote:
               | That's an important way of looking at it, and is correct
               | as far as traditional UNIX operations like fopen go. But
               | it isn't the whole story because many _other_ operations
               | require treating paths as text. For example, converting
               | them to URIs, putting them in ZIP files, or looking up a
               | file on a filesystem which internally stores filenames in
               | UCS-2.
               | 
               | Threading both of these needles at once basically
               | requires viewing paths as potentially-invalid encoded
               | text.
        
               | tialaramex wrote:
               | Of these outcomes I like the choice to pick nits best. I
               | felt the same way before I learned any Rust (ie I wrote C
               | where I had to manually groom strings) so this went in my
               | pile of things to like about Rust. Apparently Rust only
               | chose relatively narrowly to have &str (the string slice
               | type, which is guaranteed to be UTF-8 text) at all,
               | rather than just &[u8] (a slice of bytes) everywhere and
               | to my mind that's a pretty serious benefit.
               | 
               | The choice to have this in a language with the safe/
               | unsafe distinction works very nicely because in so many
               | languages you'd have this promised UTF-8 type and then in
               | practice everybody and their dog uses the unsafe assumed
               | conversion because it's easier, but in Rust you're pulled
               | up short because that conversion needs an unsafe block,
               | your local style may require (and good practice certainly
               | does) that you address this with a safety comment
               | explaining why it's OK and... it just isn't, in most
               | cases. So you write the _safe_ conversion instead unless
               | you really need not to. This is a really nice nudge, you
               | _can_ do the Wrong Thing(tm), but it 's just easier not
               | to.
        
               | masfuerte wrote:
               | Rust still has its warts. When dealing with an archive,
               | say, you can find yourself needing to deal with Windows
               | strings on Unix, or vice versa, but Rust only provides
               | the string type for the platform you are running on. It
               | could do with UnixString and WindowsString in addition to
               | OsString.
               | 
               | MFC had a similar problem for years. It had a CString
               | class which was ANSI or Unicode, depending on how your
               | compiled your app, but moderately often you needed the
               | other one, so it should have had a CStringA and CStringW
               | too, with nice conversions between them.
        
               | tialaramex wrote:
               | OsString isn't "the native string type", it's a container
               | for whatever was convenient for Rust's internals on that
               | OS, there are convenience functions so it probably
               | _feels_ like it 's "UnixString" or "WindowsString" but
               | it's neither.
               | 
               | In most of these file format cases what you've got is
               | &[u8] or &[u16] and _maybe_ it 's the NonZero variant
               | instead, so I think it's fine to be explicit that's what
               | is going on and maybe in the process remind you to check
               | - is this data UTF16LE? UTF16 with a BOM? UCS2 with a nod
               | and a wink? Just arbitrary 16-bit integers and good luck?
               | 
               | But like I said, I favoured "picking nits" long before I
               | learned Rust, so mileage may vary.
        
               | pornel wrote:
               | IMHO Rust got OSString wrong - it indirectly promised
               | that UTF-8 can always be cast to it without any copying
               | or allocations, so on Windows it has to use WTF-8 rather
               | than UCS2/UTF-16. Instead of being OS's preferred string
               | type, it's merely a trick to preserve unpaired
               | surrogates.
        
               | tialaramex wrote:
               | Huh. Where is the "indirect promise" ? Is there a
               | (conventionally hinted as "free", so it's not good if
               | there's one which isn't very cheap) as function? Like
               | as_os_string or something?
        
               | LegionMammal978 wrote:
               | It's because str implements AsRef<OsStr> [0]. The
               | function signature promises that whenever you have a
               | borrowed &'a str, the standard library can give you a
               | borrowed &'a OsStr with the same data.
               | 
               | Since references can't have destructors (they don't own
               | the data like an OsString does), it means that the
               | standard library can't give you a newly-allocated string
               | without leaking it. Since obviously it isn't going to do
               | that, the &OsStr must instead just act as a view into the
               | underlying &str. And the conversion can't enforce any
               | extra restrictions on the input string without breaking
               | backward compatibility.
               | 
               | The overall effect is that whatever format OsStr uses, it
               | has to be a superset of UTF-8.
               | 
               | [0] https://doc.rust-
               | lang.org/1.77.2/std/ffi/struct.OsStr.html#i...
        
               | tialaramex wrote:
               | Ah, an AsRef<OsStr>, yep that would do it. Thanks.
        
               | jcranmer wrote:
               | I wouldn't say it's necessarily wrong; it may be
               | (accidental) foresight. Windows has added a UTF-8 code
               | page, which means you can get as near enough as makes no
               | difference full support with the A functions instead of
               | the W functions.
               | 
               | That said, even now in 2024, it's not clear how much of a
               | bet Windows is making on UTF-8 versus UTF-16.
        
               | pornel wrote:
               | I would be surprised if the UTF-8 support in Windows was
               | anything deeper than the A functions creating a W string
               | and calling W functions, which is what Rust is doing
               | already.
        
               | pjmlp wrote:
               | The first time I had to deal with this was on Turbo
               | Pascal for Windows, where we got C strings alongside
               | Pascal ones.
               | 
               | Then when 32 bit came to be we had the other variations
               | on top.
               | 
               | Got to love leaky abstractions.
        
               | randrus wrote:
               | ...languages which are mad..
               | 
               | Reference to "Celestial Emporium of Benevolent
               | Knowledge"?
        
               | rwmj wrote:
               | It was a sly dig at APL for using non-ASCII characters as
               | regular operators. Actually I have no idea how those were
               | implemented in the language. Presumably not as Unicode
               | since APL predates Unicode by quite a considerable number
               | of years. Does anyone know?
               | 
               | As a meta-joke I was also considering:
               | 
               | APL (j/k)
        
               | mlochbaum wrote:
               | In modern APLs a character scalar is just a Unicode code
               | point, which you might consider UTF-32. It's no trouble
               | to work with. Although interfacing with things like
               | filenames that can be invalid UTF-8 is a bit of a mess;
               | Dyalog encodes these using the character range that is
               | reserved for UTF-16 surrogates and therefore unused. If
               | you know you want to work with a byte sequence instead,
               | you can use Unicode characters 0-255, and an array of
               | these will be optimized to use one byte per character in
               | dialects that care about performance.
        
               | vbezhenar wrote:
               | Java is pretty painless. Never had any issues with it.
        
               | bobmcnamara wrote:
               | Java doesn't read uniboms sanely. Heaven help you parsing
               | mixed input files.
        
               | neonsunset wrote:
               | It must be very difficult to use one of the dozens of
               | packages that let you detect and pick correct encoding to
               | wrap a file stream before reading into a string.
        
               | neonsunset wrote:
               | Not in C/C++ which is what makes you ask this question :)
               | 
               | Rust, C#, Java and Go are fairly straightforward in this
               | regard.
               | 
               | `Encoding.{Name}.GetString/GetBytes`
        
             | neonsunset wrote:
             | -4 badge of honor :D
             | 
             | Either way, I suggest to the readers who might feel upset
             | over this statement to explore something outside of C and
             | C++, liking which, when it comes to strings, is nothing
             | short of Stockholm syndrome.
             | 
             | I'm working on a UTF-8 string library for C# and across the
             | last 6-8 months explored string design in Rust, Swift, Go,
             | C, C++ and a little in other languages. C and C++ were, by
             | far, most horrifying in the amount of footguns as well as
             | the average effort required to perform trivial operations
             | (including transcoding discussed here).
             | 
             | Strings are _not_ easy. But it does not mean their
             | complexity has to be unjustified or unreasonable, which it
             | is in C++ and C (for reasons somewhat different although
             | overlapping). The problem comes from the fact that C and
             | C++ do not enjoy the benefit of the hindsight that Rust had
             | designing its string around being UTF-8 exclusive with
             | special types to express either opaque, ANSI or UTF-16
             | encodings to deal with situations where UTF-8 won 't do.
             | 
             | But I assure you, there will be strong negative correlation
             | here between complaining about string complexity and using
             | Rust, or C#/Java or even Go. Keep in mind that Go's strings
             | are still a poor design that lets you arbitrarily tear code
             | points and foregoes richness and safety of Rust strings.
             | Same, to an extent, applies to C# and Java strings, though
             | they are also safe mostly through a quirk of UTF-16 where
             | you can only ever tear non-BMP code points, which happen
             | infrequently at the edges of substrings or string slices as
             | the offsets are produced by scanning or from known good
             | constants.
             | 
             | If, at your own peril, you still wish to stay with C++,
             | then you may want to look at QString from Qt which is how a
             | decent string type UX should look like.
        
               | burntsushi wrote:
               | Go's strings aren't poor design. The only difference
               | between a Go string and a Rust &str/String is that the
               | latter is required to be valid UTF-8. In Go, a string is
               | only conventionally valid UTF-8. It is permitted to
               | contain invalid UTF-8. This is a feature, not a bug,
               | because it more closely represents the reality of data
               | encoded in a file on Unix. Of course, this feature comes
               | with a trade-off, because Rust's _guarantee_ that
               | &str/String is valid UTF-8 is _also_ a feature and not a
               | bug.
               | 
               | I wrote more about this here:
               | https://blog.burntsushi.net/bstr/#motivation-based-on-
               | concep...
               | 
               | I mention gecko as an example repository that contains
               | data that isn't valid UTF-8. But it isn't unique. The
               | cpython repository does too. When you make your string
               | type have the invariant that it must be valid UTF-8,
               | you're giving up _something_ when it comes to writing
               | tools that process the contents of arbitrary files.
        
           | bdd8f1df777b wrote:
           | I've seldom seen UTF16 text file even under Windows.
           | Sometimes, but not that often.
        
             | Tempest1981 wrote:
             | iirc, RC files are often UTF-16
             | 
             | https://stackoverflow.com/questions/72143553
             | 
             | > it looks like switching back to UTF-16 would unblock your
             | experience with Resource Editor
             | 
             | Edit: reportedly changed in VS2017 15.9
        
             | frutiger wrote:
             | You need to translate any text content into UTF-16 before
             | calling Win32 APIs in most cases.
        
               | Mindless2112 wrote:
               | You can set the code page to UTF-8 now.
               | 
               | https://learn.microsoft.com/en-
               | us/windows/apps/design/global...
        
           | torstenvl wrote:
           | This is pretty easy to implement.
           | 
           | https://pastebin.com/bmxCNZiG
           | 
           | Getting the hi and lo UTF-16 code points can use the BSD
           | licensed                   uint16_t le16toh(uint16_t
           | little_endian_16bits)
           | 
           | and                   uint16_t htole16(uint16_t host_16bits)
        
           | Jarred wrote:
           | Try simdutf. You'll get a performance boost too.
        
           | nitwit005 wrote:
           | I'd encourage you to just write the function. It's ultimately
           | just two encodings of the same data. You can figure things
           | out from the wikipedia pages for utf-8 and utf-16.
        
           | heftig wrote:
           | `wcstombs` and `mbstowcs` sound like they might do this?
           | 
           | They're C99 standard functions and should be converting
           | between "wide strings" and "multibyte strings", which should
           | be native UTF-16 and UTF-8 if your current locale is an UTF-8
           | locale.
           | 
           | Apparently this works on Windows since Windows 10 version
           | 1803 (April 2018).
           | 
           | There are also "restartable" variants `wcsrtombs` and
           | `mbsrtowcs` where the conversion state is explicitly stored,
           | instead of (presumably) a thread-local variable.
           | 
           | C11 added "secure" variants (with an `_s` suffix) of all
           | these which check the destination buffer size and have
           | different return values.
        
         | bawolff wrote:
         | We will always have to read historical formats from time to
         | time. UTF-8 already has extremely good penetration.
        
           | Longhanks wrote:
           | Cries in WinAPI
        
             | jsheard wrote:
             | They're getting there, starting from later builds of
             | Windows 10 there's a manifest flag you can set on your
             | executables which makes all of the legacy ASCII interfaces
             | accept and return UTF-8 instead. Windows is probably just
             | converting to and from UTF-16 internally but that's not
             | your problem anymore.
        
               | im3w1l wrote:
               | Will it refuse to run on older versions of windows or
               | silently do the wrong thing?
        
               | jsheard wrote:
               | I believe it silently does the wrong thing so you should
               | probably enforce a minimum Windows version by some other
               | means. The last version to not support that UTF-8 flag
               | has been officially EOL for years so cutting it off
               | completely is on the table, depending on your audience.
        
               | andreyv wrote:
               | Even the original version 1507 is still supported on the
               | LTSC channel. Support for UTF-8 manifest was added only
               | in version 1903.
        
               | Tempest1981 wrote:
               | Maybe this? https://learn.microsoft.com/en-
               | us/windows/apps/design/global...
        
             | pjmlp wrote:
             | Isn't that bad actually, (licks WinRT scars).
        
         | Karellen wrote:
         | > Too bad its defined in POSIX.
         | 
         | Well, a conforming implementation could just return -1/EINVAL
         | from `iconv_open()` for any given pairs of character codes.
         | 
         | https://manpages.debian.org/bookworm/manpages-dev/iconv_open...
        
         | TacticalCoder wrote:
         | > Man, i wish everything was UTF-8 so we iconv would not be
         | needed anymore. Too bad its defined in POSIX.
         | 
         | I wish _nothing_ was in UTF-8 and UTF-8 was relegated to
         | properties files. There are codebases out there with complete
         | i18n and l10n in more languages that most here have ever worked
         | with where there 's _zero_ Unicode characters allowed in source
         | code files (with pre-commit hooks preventing committing such
         | source code files).
         | 
         | Bruce Schneier was right all along in 1998 or whatever the date
         | was when he said: _" Unicode is too complex to ever be
         | secure"_.
         | 
         | We've seen countless exploits based on Unicode. The latest
         | (re)posted here on HN was a few days ago: some Unicode parsing
         | but affecting OpenSSL. Why? To allow support for
         | internationalized domain names and/or internationalized emails.
         | 
         | Something that should _never_ have been authorized.
         | 
         | We don't need more of what brings countless security exploits:
         | we need less of it.
         | 
         | Relegated Unicode to translation/properties file, where it
         | belongs.
         | 
         | Sure, Unicode is great for documents, chat, etc.
         | 
         | But everything in UTF-8? emails? domain names? source code?
         | This is madness.
         | 
         | I don't understand how anyone can admire the fact that HANGUL
         | fillers are valid in source code are somehow a great win for
         | our industry.
        
           | blueflow wrote:
           | Let me guess, English is your native language?
        
             | apantel wrote:
             | This is the obvious complaint, but ASCII is the only common
             | subset of the various encoding schemes. For some things
             | like programming languages and protocols I think it make
             | sense to have an ASCII constraint. You can express pretty
             | much any sound phonetically using the Latin alphabet.
        
               | hobs wrote:
               | The last part is so obviously wrong as to be silly, off
               | the top of my head voiced lateral fricatives are used in
               | some of my home languages that would not be possible to
               | explain in simple latin alphabet terms.
               | 
               | How about a tonal language? Or a whistling language? Or a
               | clicking language?
        
         | kingspact wrote:
         | UTF-8 is bloat and should never have been a first-class
         | encoding on Unix. Unix is Western and Latin.
        
         | snnn wrote:
         | Man, if English is the only human language in this world, who
         | would need UTF-8? The other encodings exist because they are
         | more efficient for the other languages. Especially, for the
         | Chinese, Japanese, and Korean languages. UTF-8 takes 50% more
         | space than the alternatives. To bad modern Linux systems only
         | support UTF-8 locales.
        
           | loeg wrote:
           | The other encodings mostly exist for historical reasons;
           | efficiency is just not a huge factor in 2024.
        
           | Karellen wrote:
           | > To bad modern Linux systems only support UTF-8 locales.
           | 
           | Do they? On my system:                   $ grep _
           | /etc/locale.gen | grep -v UTF-8 | wc -l         183
           | 
           | That's 183 non-UTF-8 locales that are available on my system.
           | OK, I don't have any non-UTF-8 locales currently configured
           | for use, but I don't have to install anything extra for them
           | to be available. Just uncomment some configuration lines and
           | re-run `locale-gen`.
           | 
           | https://manpages.debian.org/bookworm/locales/locale-
           | gen.8.en...
        
             | snnn wrote:
             | But the reality is: most glibc functions like `dirname`
             | could not handle non UTF-8 encodings, because some
             | encodings (like GBK) have overlaps with ASCII, which means
             | when you search an ASCII char(like '\') in a char array,
             | you may accidentally hit a half of a non-English character.
             | Therefore, people in Asia usually do not use the non UTF-8
             | locales.
        
       | CodesInChaos wrote:
       | I wonder what the PHP specific part is. Does it automatically
       | convert the encoding based on a request header?
        
         | fweimer wrote:
         | That's the other direction (legacy charset conversion to UCS-4
         | or UTF-8). This other direction is often reachable using the
         | charset parameter in the Content-Type header and similar MIME
         | contexts.
         | 
         | HTTP theoretically supports Accept-Charset, but it's
         | deprecated:
         | 
         | https://www.rfc-editor.org/rfc/rfc9110.html#name-accept-char...
         | 
         | But I think on-the-fly charset conversion in the web server is
         | quite rare. Apache httpd does not seem to implement it:
         | https://httpd.apache.org/docs/2.4/content-negotiation.html#m...
         | 
         | The charset in question does not have a locale associated with
         | it (it's not even ASCII-transparent), so I don't think it's
         | usable in a local context together with SUID/SGID/AT_SECURE
         | programs.
        
         | lyu07282 wrote:
         | My guess is, it's application specific, php applications that
         | use the iconv function in some specific way, in some specific
         | context, will be vulnerable.
         | 
         | https://www.php.net/manual/en/function.iconv.php
        
         | pengaru wrote:
         | Your comment knocked loose a long dormant memory
         | 
         | https://en.wikipedia.org/wiki/Magic_quotes
        
       ___________________________________________________________________
       (page generated 2024-04-21 23:01 UTC)