[HN Gopher] The History and rationale of the Python 3 Unicode mo...
       ___________________________________________________________________
        
       The History and rationale of the Python 3 Unicode model for the
       operating system
        
       Author : goranmoomin
       Score  : 69 points
       Date   : 2022-12-14 15:01 UTC (8 hours ago)
        
 (HTM) web link (vstinner.github.io)
 (TXT) w3m dump (vstinner.github.io)
        
       | zgrxat wrote:
       | The Unicode model is complex and overengineered. Like many Python
       | features, it requires permanent marketing until people believe it
       | is good. It provides a steady income stream for those paid to
       | "work" on Python.
       | 
       | The right way would have been utf-8, which is the only reasonable
       | representation. Indexing could have been broken early in Python 3
       | or outsourced to a special type.
        
         | jerf wrote:
         | Do you mean the right way would be for Python to use utf-8 for
         | file names, or for the filesystem? I'm not sure how "utf-8" is
         | the solution in this case for the problem.
         | 
         | The problem here is that file names aren't any kind of Unicode.
         | You can have a utf-8 filename right next to a utf-16 file name
         | right next to a Shift-JIS filename right next to one that's
         | just binary gibberish that isn't a legal string in any encoding
         | (that has any restrictions at all). They're just bytes with no
         | indication of what coding scheme they are at all. You can't
         | solve this problem with any Unicode solution. (Even the one
         | that tries to wrap bad bytes in a recoverable way, whose name
         | escapes me and I can't recall if it's actually a standard or
         | not, would still make me nervous. You really need code that
         | just treats the filenames as opaque bytes, and only later at
         | display time tries to figure out if they're UTF-8 or not.)
         | 
         | It doesn't matter what Unicode model Python used. That may
         | solve Python's _other_ problems, sure. But filenames would
         | always be some sort of problem. Even if you do just treat them
         | as bytes, all that really does is make your potential errors
         | deterministic, which is still better than a crash probably, but
         | the filesystem 's lack of encoding standards is a fundamental
         | problem that can't be solved by higher code levels.
        
           | StefanKarpinski wrote:
           | On UNIX, paths are UTF-8 by convention, but not forced to be
           | valid. Treating paths as UTF-8 works very well as long as you
           | hadn't also make the mistake of requiring your UTF-8 strings
           | to be valid (which Python did, unfortunately).
           | 
           | On Windows, paths are UTF-16 by convention, but also not
           | forced to be valid. However, invalid UTF-16 can be faithfully
           | converted to WTF-8 and converted back losslessly, so you can
           | translate Windows paths to WTF-16 and everything Just
           | Works(tm) [1].
           | 
           | There aren't any operating systems I'm aware of where paths
           | are actually Shift-JIS by convention, so that seems like a
           | non-issue. Using "UTF-8 by convention" strings works on all
           | modern OSes.
           | 
           | [1] Ok, here's why the WTF-8 thing works so well. If we write
           | WTF-16 for potentially invalid UTF-16 (just arbitrary
           | sequences of 16-bit code units), then the mapping between
           | WTF-16 and WTF-8 space is a bijection because it's losslessly
           | round-trippable. But more importantly, this WTF-8/16
           | bijection is also a homomorphism with respect to pretty much
           | any string operation you can think of. For example
           | `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for
           | arbitrary UTF-16 strings a and b. Similar identities hold for
           | other string operations like searching for substrings or
           | splitting on specific strings.
        
             | ChrisSD wrote:
             | Just to clarify further, note that actually preserving the
             | "as if" behaviour does involve some complexity in the
             | implementation. E.g. appending to WTF-8 has to be handled
             | carefully to ensure it remains truly the same as doing so
             | with WTF-16. This is because any newly paired surrogate has
             | to be converted to its proper UTF-8 encoding. Similarly
             | splitting WTF-8 can potentially break apart what was valid
             | UTF-8 (though I'm not totally convinced that there's a good
             | use case for actually doing this, at least for Windows
             | paths).
             | 
             | Of course the implementation details are something that can
             | and should be handled by a library instead of doing it
             | manually.
        
             | lmm wrote:
             | > There aren't any operating systems I'm aware of where
             | paths are actually Shift-JIS by convention, so that seems
             | like a non-issue. Using "UTF-8 by convention" strings works
             | on all modern OSes.
             | 
             | Nonsense. Unix paths use the system locale by convention,
             | and it's entirely normal for that to be Shift-JIS.
        
             | msla wrote:
             | > On UNIX, paths are UTF-8 by convention, but not forced to
             | be valid.
             | 
             | On UNIX, paths are a sequence of bytes, with two bytes
             | being sacred to the kernel (0x2F, used to separate path
             | elements, and 0x00, used to terminate paths) and no other
             | bytes being interpreted in any way. Any character encoding
             | which respects the sacred bytes by not using them to encode
             | any other characters is therefore usable to make UNIX
             | paths; in fact, a UNIX path can contain multiple encodings,
             | as long as they're all suitably respectful.
             | 
             | That requirement for respect means that UTF-16 and UCS-2
             | and UCS-4 are not suitable. UTF-7 is, however, as is UTF-8,
             | and all of the ISO/IEC 8859 encodings are as well, not to
             | mention a whole raft of non-standard "extended ASCII"
             | character sets. In theory, UTF-16 in some suitably
             | respectful encoding would work, too, but gouge my eyes out
             | with a goddamned spoon.
        
               | deathanatos wrote:
               | They know all that. The operative words there are "by
               | convention", not "by requirement".
               | 
               | I.e., their comment is an RFC "OUGHT TO", not an RFC
               | "MUST".
        
               | msla wrote:
               | My point is that assuming UNIX paths are UTF-8 by default
               | is fragile, and that assuming UNIX paths have any
               | consistent character encoding is also somewhat fragile.
               | You can check to see if the path is UTF-8, but,
               | otherwise, mitts off unless the user explicitly tells you
               | an encoding.
        
               | deathanatos wrote:
               | > _assuming UNIX paths are UTF-8 by default is fragile,
               | and that assuming UNIX paths have any consistent
               | character encoding is also somewhat fragile._
               | 
               | Again, we're both aware that it's not guaranteed. But the
               | convention these days is nonetheless UTF-8.
               | 
               | > _mitts off unless the user explicitly tells you an
               | encoding._
               | 
               | This doesn't adequately solve the problem, though. Typed
               | languages have to emit _some type of value_ : having that
               | be the "string" type of the language is _useful_ , as you
               | can do things like printf a message that includes the
               | filename. (Or display an "Open file..." dialog. Or...)
               | 
               | There are middle-grounds, such as byte smuggling and
               | escaping. But if you take the stance that filenames are
               | arbitrary bags of bytes (which is actually a subset: the
               | reality is even worse) -- then anything that returns a
               | filename is stuck: it can't return a string ("mitts
               | off"). You can take Rust's approach with Path (a type
               | specific to Paths) but people wail about that all the
               | time too ("why are there so many types?"), and you can't
               | print it _because it 's not text!_
               | 
               | "Error out if the file name isn't conventional" is a
               | pragmatic tradeoff: bad file names will cause errors, but
               | it makes basically all other operations much more
               | tractable. It's not worth supporting insane file names.
               | 
               | There are workarounds, of course (such as just
               | replacement-charactering "" anything that can't be
               | understood), and finding a format that can encode Paths
               | when transmitting them, but these all take more time and
               | effort. Allowing non-text files introduces unnecessary
               | complexity and bugs into every single program that needs
               | to deal with the file system.
        
         | mistrial9 wrote:
         | > requires permanent marketing
         | 
         | LOL this is great, so true
        
         | froh wrote:
         | utf-8 needs 24bit per glyph for the logographic unicode
         | scripts, while utf-16 needs 16 bits. So if you're into selling
         | silicon or don't care about these languages, or you only care
         | about heavily mixed content (like html), then yes, utf-8 would
         | be The Right Way (TM).
        
         | StefanKarpinski wrote:
         | Absolutely right. Deprecating direct string indexing would have
         | been the right move. Require writing `str.chars()` to get
         | something that lets you slice by Unicode characters (i.e. code
         | points); provide `str.graphemes()` and
         | `str.grapheme_clusters()` to get something that lets you slice
         | by graphemes and grapheme clusters, respectively. Cache an
         | index structure that lets you do that kind of indexing
         | efficiently once you've asked for it the first time. Provide an
         | API to clear the caches.
         | 
         | Not allowing strings to represent invalid Unicode is also a
         | huge mistake (and essentially forced by the representation
         | strategy that they adopted). It forces any programmer who wants
         | to robustly handle potentially invalid string data to use byte
         | vectors instead. Which is exactly what they did with OS paths,
         | but that's far from the only place you can get invalid strings.
         | You can get invalid strings almost anywhere! Worse, since it's
         | incredibly inconvenient to work with byte vectors when you want
         | to do stringlike stuff, no one does it unless forced to, so
         | this design choice effectively guarantees that all Python code
         | that works with strings will blow up if it encounters anything
         | invalid--which is a very common occurrence.
         | 
         | If only there was a type that behaves like a string and
         | supports all the handy string operations but which handles
         | invalid data gracefully. Then you could write robust string
         | code conveniently. But at that point, you should just make that
         | the standard string type! This isn't hypothetical, it's exactly
         | how Burnt Sushi's bstr type [1] works in Rust and how the
         | standard String type works in Julia.
         | 
         | [1] https://github.com/BurntSushi/bstr
        
           | Jasper_ wrote:
           | It's worth noting that Python str's are sequences of code
           | points, not scalar values. This was a truly horrendous
           | mistake made mostly out of ignorance, but now they rely upon
           | it in surrogateescape to hide "invalid" data, so...
           | 
           | I have ranted for long hours go friends about the insanity of
           | Python 3's text model before. It's mostly the blind leading
           | the blind.
        
             | 323 wrote:
             | I've seen this happen multiple times in Python, where they
             | implement a feature without a thorough survey of what other
             | languages are doing. Then when it turns out the new feature
             | has a major broken part, they fix that, but again, without
             | looking how it's done in other places. asyncio is a major
             | example.
        
           | Animats wrote:
           | Unicode string indexing should have been made lazy, rather
           | than deprecated. Random access to strings is rare. Mostly,
           | operations are moving forward linearly or using saved
           | positions.
           | 
           | So, only build the index for random access if needed.
           | Optimize "advance one glyph" and "back up one glyph"
           | expressed as indexing, and you'll get most of the frequently
           | used cases. Have the "index" functions that return a string
           | index return an opaque type that's a byte index. Attempting
           | to convert that to an integer forces creation of the string
           | index.
           | 
           | This preserves the user visible semantics but keeps
           | performance.
           | 
           | PyPy does something like this.
        
           | wk_end wrote:
           | Deprecating direct string indexing might have been the
           | "right" move in some sense but it's hard to imagine it ever
           | happening in Python; it's such a natural and frequently-used
           | operation in the language that you would've broken an
           | enormous amount of code. Because indexing strings is
           | syntactically identical to indexing lists, and because
           | there's no* real static typing, there'd be no* good and
           | robust way to automate the conversion.
           | 
           | Like, the changes Python 3 made were honestly pretty subtle,
           | and nearly 15 years later there's still people reluctant to
           | upgrade. If they broke string indexing I'm pretty sure Python
           | 3 adoption would make Perl 6 uptake look impressive.
        
           | Spivak wrote:
           | > It forces any programmer who wants to robustly handle
           | potentially invalid string data to use byte vectors instead.
           | 
           | Is this not the only thing you can really do? If strings
           | could hold invalid unicode then they effectively become bytes
           | and you now have to be wary of every possible string. I would
           | rather Python just do away with strings entirely in the
           | integration points with the OS and make you either keep them
           | as opaque bytes or decode them.
           | 
           | > behaves like a string and supports all the handy string
           | operations but which handles invalid data gracefully
           | 
           | Unless Guido was feeling particularly practical I can't
           | imagine this ever making it into the stdlib because the
           | choice of "gracefully" is somewhat arbitrary and application
           | dependent.
        
         | zokier wrote:
         | Utf8 would not help with the issue in the article in any way.
        
           | StefanKarpinski wrote:
           | It's not at all obvious how it helps, but it does.
           | 
           | First, why is Python unable to represent invalid path names
           | as strings? Because internally it converts strings from
           | UTF-8, UTF-16, or any other encoding, to a fixed-with array
           | of decoded Unicode code points. The width of integer used to
           | represent code points is determined by the largest code point
           | in the string: if the string is ASCII, it can use a byte
           | (uint8) per character; if the string is non-ASCII but all
           | BMP, then it can use a uint16 per character; otherwise it has
           | to use uint32 per character.
           | 
           | Why does Python do all this? So that you can have O(1)
           | character indexing. If you gave up on that, you wouldn't need
           | to convert the string at all, you could just leave it as
           | (potentially invalid) UTF-8 data.
           | 
           | Suppose you get an invalid path on UNIX where paths are UTF-8
           | by convention? What does Python do with this string? It can't
           | convert it to an array of code points because invalid UTF-8
           | doesn't correspond to a code point (well, it can if it's just
           | illegal, not malformed, but in general, we have to consider
           | completely malformed strings that don't even follow the basic
           | UTF-8 format). So Python is stuck: it can only replace the
           | invalid data with something like the Unicode replacement
           | character. But then you can't do anything useful with that
           | because it's not the correct name of the path you're trying
           | to work with.
           | 
           | How does using UTF-8 to represent strings help? Because you
           | can represent invalid strings: just leave them as-is and
           | don't try to decode them unless you have to. Sure, you can't
           | decode them as code points, but that's actually a pretty
           | unusual thing to do. If someone asks for decoding, _then_ you
           | can give an error. What about Windows where paths are UTF-16
           | by convention? You can convert them to WTF-8 and everything
           | works out. (Described in way more detail here:
           | https://news.ycombinator.com/item?id=33984308).
        
             | zokier wrote:
             | > How does using UTF-8 to represent strings help? Because
             | you can represent invalid strings: just leave them as-is
             | and don't try to decode them unless you have to. Sure, you
             | can't decode them as code points, but that's actually a
             | pretty unusual thing to do. If someone asks for decoding,
             | _then_ you can give an error
             | 
             | How is that better than just handling paths as `bytes`?
        
             | ilyt wrote:
             | > Why does Python do all this? So that you can have O(1)
             | character indexing. If you gave up on that, you wouldn't
             | need to convert the string at all, you could just leave it
             | as (potentially invalid) UTF-8 data.
             | 
             | Seems like "giving up" would've been better choice,
             | considering just how rare operation that is. Or
             | alternatively doing the conversion lazily the first time
             | operation needing runes instead of bytes happen.
             | 
             | Most string operations are not accessing string by index
             | and most of them even at O(n) would be fast enough because
             | n is small. Like in typical "get a file name, extract some
             | info from it", you're doing extraction once and anything
             | after that doesn't need character indexing, because you
             | already got the relevant data.
        
             | masklinn wrote:
             | > How does using UTF-8 to represent strings help? Because
             | you can represent invalid strings: just leave them as-is
             | and don't try to decode them unless you have to.
             | 
             | That's not UTF8. That's a bag'o bytes which might be UTF8.
             | Very different thing.
             | 
             | > Sure, you can't decode them as code points, but that's
             | actually a pretty unusual thing to do.
             | 
             | It's not, any unicode-aware text processing does it
             | implicitly. This means any such processing has to either
             | perform its own validation that the input is valid, or it
             | may fly off the rails entirely if fed nonsense. This also
             | increases risks if security issues, either outright UBs, or
             | the ability to smuggle payloads through overlong encoding.
        
               | [deleted]
        
               | StefanKarpinski wrote:
               | > That's not UTF8.
               | 
               | True; I was careful not to call it that, but treating
               | strings as UTF-8 by convention does make sense.
               | 
               | > It's not, any unicode-aware text processing does it
               | implicitly. This means any such things processing has to
               | either perform its own validation that the input is
               | valid, or it may fly off the rails entirely if fed
               | nonsense.
               | 
               | In theory, but that's just not how most string operations
               | actually work. If you have two UTF-8 strings and you want
               | to concatenate them, you just concatenate the bytes. It
               | would be ridiculously inefficient to decode the code
               | points in each string and then re-encode them back into a
               | destination buffer. If you have two UTF-8 strings and you
               | want to see if one is a substring of the other and at
               | what byte index, you just look for the bytes of one as a
               | "substring" of the bytes of the other. Again, it would be
               | ridiculously inefficient to decode the code points in
               | each and do matching on code points. But what if the
               | strings aren't valid UTF-8?! Both of those operations
               | work just fine even if the strings aren't valid and
               | produce sensible, intuitive results.
               | 
               | If you're implementing a browser or a terminal that has
               | to actually display UTF-8 as characters then sure, you
               | have to actually decode characters. Similarly, if you're
               | parsing text somehow, then you have to decode characters.
               | But many program only do concatenation and search and
               | other operations like that which are actually implemented
               | in terms of byte sequences, not characters.
        
               | masklinn wrote:
               | > True; I was careful not to call it that
               | 
               | You specifically called it UTF8, repeatedly. The very
               | comment I quoted asserts that "Utf8 would help deal with
               | the issue [of garbage inputs]" (in its denial of the
               | opposite assertion). You also did it in
               | https://news.ycombinator.com/item?id=33986421
               | 
               | > If you have two UTF-8 strings and you want to
               | concatenate them, you just concatenate the bytes.
               | 
               | That's not a unicode-aware operation, it's mostly a
               | unicode-irrelevant operation (though unicode awareness
               | can be useful in edge cases because of special grapheme
               | clusters, but that's very task-specific).
               | 
               | > But what if the strings aren't valid UTF-8?! Both of
               | those operations work just fine even if the strings
               | aren't valid and produce sensible, intuitive results.
               | 
               | If your content is not actually UTF-8, you can end up
               | with UTF-8, thus changing the semantics of the content.
               | You can also end up with overlong UTF-8, which also
               | changes the semantics of the content in a worse way.
        
               | StefanKarpinski wrote:
               | The comment that you're quoting wasn't mine. In the
               | comment you link to says "UTF-8 by convention". If either
               | string is valid, then the result is as expected. If
               | you're concatenating two strings that are both invalid
               | UTF-8, there's not much you can do that's better than
               | just concatenating the bytes together... which is exactly
               | what treating them as byte arrays would end up doing (but
               | it's less convenient). If you're worried about invalid
               | UTF-8 you can check for validity (which again, is exactly
               | what you end up doing if you use byte arrays).
        
               | adgjlsfhk1 wrote:
               | The problem with using strict UTF-8 for paths is that
               | paths aren't guaranteed to be valid UTF-8. How do you
               | want to write a program that opens a path who's name is
               | invalid UTF-8?
        
               | masklinn wrote:
               | > The problem with using strict UTF-8 for paths is that
               | paths aren't guaranteed to be valid UTF-8.
               | 
               | Ok but I'm not saying to do that. I'm saying if you have
               | not-utf8 strings don't call them UTF8.
               | 
               | > How do you want to write a program that opens a path
               | who's name is invalid UTF-8?
               | 
               | That's not my problem given I'm not advocating for that.
        
               | StefanKarpinski wrote:
               | The issue is that when you're implementing something like
               | a programming language or a robust general purpose
               | utility, then simply not being able to open--or list or
               | remove or stat--paths with invalid names is not really
               | acceptable.
        
       | PaulHoule wrote:
       | One time my job was standardizing a machine learning model
       | training system in Python so you could develop models on a Mac or
       | PC and then train them with large data sets on a DGX-1 for
       | production.
       | 
       | Over the course of a few months I developed quite a list of
       | configuration problems that could break Python scripts and some
       | of the worse involved character encoding. In particular, in many
       | Pythons, writing invalid Unicode in a print statement could crash
       | your Python. It's one thing to say "don't write invalid Unicode
       | in a print" but if you are sucking in a large number of 3rd party
       | libraries you can't control their use of print. It turned out
       | many customers had CSV files with invalid Unicode sequences so we
       | couldn't control the input. People would say "just use docker"
       | but the team was always finding and making docker images with
       | strangely configured Pythons. (If anything Docker seemed to make
       | it easier and faster to use and install misconfigured software,
       | not make it easier to get things under control.)
       | 
       | It's good news that Python, like Java, is defaulting to UTF-8
       | instead of whatever locale it gets from the OS because frequently
       | this locale is not just "not UTF-8" but something really bizarre.
        
       | sakras wrote:
       | I'm a little confused, how can a file name be non-decodable? A
       | file with that name exists, so someone somewhere knows how to
       | decode it. Why wouldn't Python just always use the same encoding
       | as the OS it's running on? Is this some locale-related thing?
        
         | jjtheblunt wrote:
         | Does trying to decode, on Unix, a Windows file path (not
         | filename) like "drive:blah" freak it out as an example of non-
         | decodable?
        
           | johannes1234321 wrote:
           | That's a valid file name. Non-decodable are byte sequences
           | which are invalid UTF-8 byte sequences (if UTF-8 is expected)
        
         | masklinn wrote:
         | > A file with that name exists, so someone somewhere knows how
         | to decode it.
         | 
         | No. A unix filename is just a bunch of bytes (two of them being
         | off-limits). There is no requirement that it be in _any_
         | encoding.
         | 
         | You can always use a fallback encoding (an iso-8859) to get
         | _something_ out of the garbage, but it 's just that, garbage.
         | 
         | Windows has a similar issue, NTFS paths are sequences of UCS2
         | code units, but there's no guarantee that they form any sort of
         | valid UTF-16 string, you can find random lone surrogates for
         | instance.
         | 
         | And I'm sure network filesystems have invented their own even
         | worse issues, because being awful is what they do.
         | 
         | > Why wouldn't Python just always use the same encoding as the
         | OS it's running on?
         | 
         | 1. because OS don't really have encodings, Python has a
         | function to try and retrieve FS encoding[0] but per the above
         | there's no requirement that it is correct for any file, let
         | alone the one you actually want to open (hell technically
         | speaking it's not even a property of the FS)
         | 
         | 2. because OS lie and user configurations are garbage, you
         | can't even trust the user's locale to be configured properly
         | for reading _files_ (an other mistake Python 3 made,
         | incidentally)
         | 
         | 3. because the user may not even have created the file, it
         | might come from a broken archive, or some random download from
         | someone having fun with filenames, or from fetching crap from
         | an FTP or network share
         | 
         | There are a few FS / FS configurations which are reliable, in
         | that case they either error or pre-mangle the files on intake.
         | 
         | IIRC ZFS can be configured to only accept valid UTF-8
         | filenames, HFS(+) requires valid unicode (stored as UTF-16) and
         | APFS does as well (stored as UTF-8).
         | 
         | [0]
         | https://docs.python.org/3/library/sys.html#sys.getfilesystem...
        
         | StefanKarpinski wrote:
         | Unfortunately, neither UNIX nor Windows require path names to
         | be valid Unicode. UNIX interprets them as "UTF-8 by convention"
         | and Windows as "UTF-16 by convention" but both actually allow
         | arbitrary sequences of code units. It would be nice if this
         | didn't actually occur, but alas, it does, and if you're writing
         | general purpose utilities that work with files, you don't want
         | them to simply crash when this happens.
        
         | heisenzombie wrote:
         | It's filesystem-dependent, but a lot of filesystems treat
         | filenames as just an arbitrary sequence of bytes. Most OSes
         | sort of hide this from you, but there's nothing stopping you
         | doing weird stuff like:                 > touch `echo 00:
         | DEADBEEF | xxd -r`       > ls       'ey'$'\276\357'
        
         | CorrectHorseBat wrote:
         | Unix allows any byte in a filename except 0x0 (ASCII Nul) and
         | 0x2f (ASCII '/'), anything else is allowed. It doesn't have to
         | be decodable to text.
        
       | cabirum wrote:
       | Filenames come from multiple filesystems (and not just
       | filesystems!), with different encodings and such. Both ntfs and
       | ext3 can use arbitrary bytes as names, not representable in utf8.
       | (Contrary to popular belief, ntfs works with any byte sequences
       | except 0x0000, and may contain invalid utf16 code points)
       | 
       | The only sane way to deal with them all is to use bytes and bytes
       | only (Or use a sane language with encoding-agnostic string type;
       | see: golang) Other solutions introduce loss of data.
        
         | deathanatos wrote:
         | > _The only sane way to deal with them all is to use bytes and
         | bytes only_
         | 
         | It's not _really_ , though. Yes, if you want to round-trip the
         | name of the file and be fully general over the bad situation
         | provided by the OS. But as soon as you want to _display_ that
         | filename, or really do much of anything with it, you 're back
         | into the world of hurt the article describes, and you need some
         | super complicated type or logic that can convey to the user
         | "hey, this file's name is some random undecodable garbage".
         | 
         | The sane state would have been "file names are text", but
         | filenames/Unix predates software engineering having a good
         | discipline around text encoding, and it predates Unicode. I'd
         | go as far as to argue that "file names are text and \n is
         | disallowed" is the sane state, as multi-line filenames also
         | just cause all sorts of unnecessary complexity.
         | 
         | The problem, of course, is that most OS enforce neither
         | constraint, which is what leads to things like the struggle
         | Python has gone through, or the Path type in Rust, and the
         | unrealistic nature of trying to move from the present state to
         | such a sane state. (Linux is loathe to break compatibility, and
         | I bet Windows would be similar, here. There are a number of
         | arguments for that, but in the meantime, it continues to be
         | that file names are ridiculously complicated if handled fully
         | correctly.
         | 
         | Or, you just decode them as Unicode (UTF-8/-16 depending on OS)
         | and give up if the user is crazy. 99.99% of the time that'll
         | work just fine.)
         | 
         | > _Or use a sane language with encoding-agnostic string type;
         | see: golang_
         | 
         | Of course, this might be the root of our disagreement. Types
         | that permit the representation of invalid values are not
         | desirable, to me. Especially in a core type, like strings.
        
           | throw0101a wrote:
           | > _It 's not really, though. Yes, if you want to round-trip
           | the name of the file and be fully general over the bad
           | situation provided by the OS. But as soon as you want to
           | display that filename, or really do much of anything with it_
           | 
           | IMHO, that is mostly not the file system's problem. Bits
           | going into and out of the file system should act as
           | deterministically as possible.
           | 
           | If you want to start being 'clever' about these problems, do
           | it in userland.
        
             | deathanatos wrote:
             | And the point is that this line of thinking introduces a
             | metric crap ton of unnecessary complexity downstream.
             | 
             | Yeah, an FS _could_ say that1. Alternatively, it could say
             | that names are intended to be read by humans, and that
             | names should be a single line, to support things that
             | people reasonable expect to be able to do with a file
             | system, like being able to list the files on the
             | filesystem, without whatever is doing the listing needing
             | to decide if it is going to just say  "I can't display raw
             | bytes, obviously, so here's a best-effort of .txt" or if it
             | has to invent some sort of grammar to encode/escape bytes
             | into text.
             | 
             | I.e., a file system is meant to solve a problem, which is
             | the storage _and organization_ of data, and from that, that
             | names are meant to used by humans is one of the
             | requirements of both the OS  & the filesystem. (The OS, as
             | it presents the file hierarchy to userspace, e.g., in the
             | VFS, and the filesystem, as it of course needs to store the
             | data.)
             | 
             | By your logic, we should permit nul and '/' in file names,
             | too.
             | 
             | 1and, I mean, most _do._ Which is why we 're in the
             | situation we're in, and why languages are faced with a
             | barrel of bad choices, and why probably 90% of shell
             | scripts fail to properly handle file names. Is this what we
             | _want?_
        
               | [deleted]
        
           | saurik wrote:
           | > Or, you just decode them as Unicode (UTF-8/-16 depending on
           | OS) and give up if the user is crazy. 99.99% of the time
           | that'll work just fine.)
           | 
           | You have to be really careful about the way in which you give
           | up here, as the user isn't necessarily in control of the
           | files on disk: it might be an attacker. If I can hide files
           | in folders from various parts of a Python program -- as was
           | possible with early versions of Python 3, which simply
           | skipped such files -- or cause corruption in how the
           | filenames are round-tripped during a copy (allowing me to
           | bypass a filter in one place but then write to a file with a
           | different critical name in another place) I gain a lot of
           | flexibility with my exploits.
        
             | deathanatos wrote:
             | Yeah, and that's a problem. But again, Python is faced with
             | bad choices because the interface exposed by OSes for file
             | systems is basically broken.
             | 
             | (But, that might also fall into the 1% case where you'd
             | need to pay attention.)
             | 
             | Perhaps you might argue for a dedicated type like Rust's
             | Path, but that too has issues, and I don't think Python had
             | the "design freedom" to go there. (Although it does have
             | pathlib, nowadays, so I wonder how it handles this; but it
             | would also need to duplicate a lot of the functionality in
             | os...)
        
           | CorrectHorseBat wrote:
           | ZFS even has a utf8only flag that disallows anything but
           | utf-8. Sadly it hasn't caught on with other filesystems.
        
             | ilyt wrote:
             | Kinda waste of CPU to test for that tbh. Broken app won't
             | work any better coz of it
        
           | ilyt wrote:
           | > It's not really, though. Yes, if you want to round-trip the
           | name of the file and be fully general over the bad situation
           | provided by the OS. But as soon as you want to display that
           | filename, or really do much of anything with it, you're back
           | into the world of hurt the article describes, and you need
           | some super complicated type or logic that can convey to the
           | user "hey, this file's name is some random undecodable
           | garbage".
           | 
           | It really is. Once you are using byte type you can handle
           | displaying byte type to user as you see fit but that's not a
           | decision stdlib should make for you. Not only that, if any
           | file-related function operates on bytes that by definition it
           | doesn't care about FS semantics, you're also not wasting any
           | CPU for useless conversions to string.
           | 
           | The blob of bytes acts as opaque token for file location, as
           | it should. If you want to display it to users _you have to
           | quote it anyway_ , because you can have perfectly valid utf8
           | file name that contains newline or some special utf8
           | characters that will fuck up how it is displayed and might
           | even in malicious way
           | 
           | >> Or use a sane language with encoding-agnostic string type;
           | see: golang
           | 
           | >Of course, this might be the root of our disagreement. Types
           | that permit the representation of invalid values are not
           | desirable, to me. Especially in a core type, like strings.
           | 
           | string is defined as "read-only slice of bytes", not "UTF8
           | slice of bytes". That is mostly so conversion between is
           | free. If you want to pay the cost of validation you can, but
           | are not forced to.
           | 
           | The functions operating on them, if relevant to the
           | function's functionality, will operate in UTF8.
           | 
           | It might sound bad theory and it did to me at first but in
           | practice it's probably the least problems I had out of all
           | languages.
        
         | chungy wrote:
         | NTFS disallows 0x002F as well; it is possible to use the entire
         | set even from Windows (installing Interix, for instance).
         | Windows NT was designed around UCS-2, the file system didn't
         | have to do much but be a dumb bag of bytes for file names,
         | character interpretation was on upper layers. You can mostly
         | pretend it's UTF-16, until it's not.
         | 
         | Something fun on file systems: ZFS lets you set "utf8only=on"
         | which will guarantee no file names will ever be possible that
         | are not valid UTF-8. I use it on all my systems.
        
           | jcranmer wrote:
           | > Something fun on file systems: ZFS lets you set
           | "utf8only=on" which will guarantee no file names will ever be
           | possible that are not valid UTF-8. I use it on all my
           | systems.
           | 
           | This is the sort of option that should be available for all
           | filesystems, and distros should default new installs to
           | turning it on.
        
             | lmm wrote:
             | Please don't unless you've thoroughly tested in multiple
             | cultures. Unicode is broken for Japanese; on some systems
             | it's possible to change a setting to make it broken for
             | Chinese instead, but the only non-theoretical way to make a
             | program that can list a directory full of Japanese
             | filenames and a directory full of Chinese filenames
             | correctly is to have that program be encoding-aware.
        
             | deckard1 wrote:
             | if we have a time machine then yes. In fact, it shouldn't
             | even be an option. UTF-8 period. Always.
             | 
             | But alas, we do not and someone will plug in a memory card
             | or USB drive with some FAT formatted drive and your nice
             | universe crumbles before you.
        
               | jcranmer wrote:
               | Handling non-UTF-8 pathnames via a filesystem mount
               | option is completely viable. Hell, I'd go even further
               | and suggest that the kernel itself reject non-UTF-8
               | pathnames entirely, handling filesystems with the non-
               | UTF-8-pathname option enabled via a translation layer
               | that converts \x00-\xff to \u0000-\u00ff (and vice
               | versa).
               | 
               | The lesson of text is clear: text needs to have a defined
               | charset. Filesystem names are _absolutely_ textual
               | (listing a directory for human display is one of the most
               | common operations). If it 's not UTF-8, then what is it?
               | How is the application supposed to figure it out?
               | (Automatic encoding detection is flaky in the best of
               | times, but with the lengths of typical filenames, it is
               | almost impossible to be reliable). The reality is that
               | most applications _already_ assume that filesystem names
               | are UTF-8, and non-UTF-8 names break them.
               | 
               | We are at a stage in Unicode penetration that I think it
               | is reasonable to treat a filesystem that has non-UTF-8
               | filenames as a problem that needs to be solved with some
               | kind of repair tool like fsck, not something that every
               | application is supposed to worry about.
        
               | chungy wrote:
               | This is why "convmv" exists :) and also why my /tmp is a
               | normal tmpfs where file names can be any byte sequence. I
               | can, for instance, extract random archives there, use
               | convmv to make them UTF-8 (if not already), and then it
               | can go on my ZFS file systems.
        
         | gumby wrote:
         | I have used filesystems that permitted a character represented
         | as 0x0. Good way to attack a C program.
        
           | GeorgeTirebiter wrote:
           | I sometimes think the C decision to encode strings as non-
           | zero bytes was one of the worst.
           | 
           | The other candidate is the atrocious operator precedence. OK,
           | we now love parens to 'fix' this -- might as well use lisp...
           | ;-)
        
             | ilyt wrote:
             | It was unequivocally one of the worst decisions in
             | programming. I wouldn't be surprised if costs of dealing
             | with it would be in the billions now. It occasionally saved
             | a byte on devices with little memory which kinda made sense
             | at a time, but not anymore
        
               | stabbles wrote:
               | It makes perfect sense when writing and reading strings
               | from disk, since you don't have to worry where and how to
               | store the integer for the length (#bytes & endianness)
        
         | russdill wrote:
         | Even using bytes can lead to loss of data when trying to copy
         | from one file-system to another or dealing with operating
         | systems that treat filenames differently when different system
         | calls are used.
        
       | zzzeek wrote:
       | SQLAlchemy creator here.
       | 
       | No single improvement in Python has made my life more
       | dramatically easier than the introduction and standardization of
       | strings as Unicode. The basic reason is because it clearly placed
       | the burden of encoding / decoding on the *outside* edges of a
       | program; right when data is coming in, or right when data is
       | going out. If you want the world to consider the data between
       | those to be a "a string", you have to do that for us.
       | 
       | For me, it meant that every PEP-249 DBAPI finally took on the
       | role of handling database encodings fully and completely.
       | Enormous reams of arbitrary string encode/decode logic in
       | SQLAlchemy, having to guess its way around all the different
       | decisions every DBAPI made about this, gone. If I get a string
       | from the DBAPI, it's decoded. If I have a string, I can send it
       | out, DBAPI handles it. If it contains characters that aren't
       | appropriate for the database's encoding, it raises. Great! I send
       | the users off to the manual for their database, they aren't
       | complaining to me. Total lifesaver.
       | 
       | I see a lot of odd complaints here which seem to be people
       | wanting non-textual bytestrings to do things that make sense for
       | text; if someone has a use case like that, there can be libraries
       | that do the particular non-textual-but-text-like manipulations
       | they need, however, these are not the "standard" uses for textual
       | strings in Python. The standard use for textual strings is...to
       | represent text! If I have data that's bytes, then I use bytes!
       | These threads always seem to fill up with the vanishingly small
       | number of people that have issues with this. Just like the anti-
       | ORM threads. All the while the vast majority of devs are just
       | getting work done.
        
         | Alir3z4 wrote:
         | Unicode errors were one the most painful and annoying issues I
         | had. Regardless of working with xml files, parsing text or just
         | dealing with anything text related. When upgrading to Python 3,
         | they just vanished.
         | 
         | Anti ORM people probably don't know what ORM is or confused
         | about their usage and benefit they bring.
         | 
         | Once you're dealing with several tables and relations and lots
         | of aggregation and annotations, raw sql is the one way ticket
         | to insanity.
         | 
         | By the way, thank you very much for making SQLAlchemy. You
         | saved millions of hours of developer time and god knows how
         | many SQL bugs have been prevented by using SQLAlchemy
        
           | kstrauser wrote:
           | I agree completely. Turns out all those TypeErrors that
           | suddenly popped up when switching to Python 3 were actually
           | uncaught logic errors all along. The biggest annoyance coming
           | from the string/bytes split was realizing how sloppy I'd been
           | about treating them identically.
        
       | mmastrac wrote:
       | Rust uses a solution similar to his original proposal:
       | OsString/OsStr: https://doc.rust-
       | lang.org/std/ffi/struct.OsString.html
       | 
       | I wonder if unicode needs a "undecodable byte escape" sequence
       | for these raw sequences that cannot be mapped into the string
       | space, allowing for a guaranteed round-trip between the OS and
       | Unicode. For this to work, you'd need a guarantee that all
       | codepage/unicode mappings are consistent and 1:1 which might not
       | be possible.
       | 
       | Paths are tricky because they act as both an identifier and a
       | handle, and the string encoding of the identifier and handle are
       | _close but not exactly the same_ when unicode is in the mix. In
       | the ASCII world, you can assume that this is the case.
        
         | LegionMammal978 wrote:
         | Python already has the "surrogateescape" error handler [0] that
         | performs something similar to what you described: undecodable
         | bytes are translated into unpaired U+DC80 to U+DCFF surrogates.
         | Of course, this isn't standardized in any way, but I've found
         | it useful myself for smuggling raw pathnames through Java.
         | 
         | [0] https://peps.python.org/pep-0383/
        
       ___________________________________________________________________
       (page generated 2022-12-14 23:01 UTC)