[HN Gopher] The History and rationale of the Python 3 Unicode mo...
___________________________________________________________________
The History and rationale of the Python 3 Unicode model for the
operating system
Author : goranmoomin
Score : 69 points
Date : 2022-12-14 15:01 UTC (8 hours ago)
(HTM) web link (vstinner.github.io)
(TXT) w3m dump (vstinner.github.io)
| zgrxat wrote:
| The Unicode model is complex and overengineered. Like many Python
| features, it requires permanent marketing until people believe it
| is good. It provides a steady income stream for those paid to
| "work" on Python.
|
| The right way would have been utf-8, which is the only reasonable
| representation. Indexing could have been broken early in Python 3
| or outsourced to a special type.
| jerf wrote:
| Do you mean the right way would be for Python to use utf-8 for
| file names, or for the filesystem? I'm not sure how "utf-8" is
| the solution in this case for the problem.
|
| The problem here is that file names aren't any kind of Unicode.
| You can have a utf-8 filename right next to a utf-16 file name
| right next to a Shift-JIS filename right next to one that's
| just binary gibberish that isn't a legal string in any encoding
| (that has any restrictions at all). They're just bytes with no
| indication of what coding scheme they are at all. You can't
| solve this problem with any Unicode solution. (Even the one
| that tries to wrap bad bytes in a recoverable way, whose name
| escapes me and I can't recall if it's actually a standard or
| not, would still make me nervous. You really need code that
| just treats the filenames as opaque bytes, and only later at
| display time tries to figure out if they're UTF-8 or not.)
|
| It doesn't matter what Unicode model Python used. That may
| solve Python's _other_ problems, sure. But filenames would
| always be some sort of problem. Even if you do just treat them
| as bytes, all that really does is make your potential errors
| deterministic, which is still better than a crash probably, but
| the filesystem 's lack of encoding standards is a fundamental
| problem that can't be solved by higher code levels.
| StefanKarpinski wrote:
| On UNIX, paths are UTF-8 by convention, but not forced to be
| valid. Treating paths as UTF-8 works very well as long as you
| hadn't also make the mistake of requiring your UTF-8 strings
| to be valid (which Python did, unfortunately).
|
| On Windows, paths are UTF-16 by convention, but also not
| forced to be valid. However, invalid UTF-16 can be faithfully
| converted to WTF-8 and converted back losslessly, so you can
| translate Windows paths to WTF-16 and everything Just
| Works(tm) [1].
|
| There aren't any operating systems I'm aware of where paths
| are actually Shift-JIS by convention, so that seems like a
| non-issue. Using "UTF-8 by convention" strings works on all
| modern OSes.
|
| [1] Ok, here's why the WTF-8 thing works so well. If we write
| WTF-16 for potentially invalid UTF-16 (just arbitrary
| sequences of 16-bit code units), then the mapping between
| WTF-16 and WTF-8 space is a bijection because it's losslessly
| round-trippable. But more importantly, this WTF-8/16
| bijection is also a homomorphism with respect to pretty much
| any string operation you can think of. For example
| `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for
| arbitrary UTF-16 strings a and b. Similar identities hold for
| other string operations like searching for substrings or
| splitting on specific strings.
| ChrisSD wrote:
| Just to clarify further, note that actually preserving the
| "as if" behaviour does involve some complexity in the
| implementation. E.g. appending to WTF-8 has to be handled
| carefully to ensure it remains truly the same as doing so
| with WTF-16. This is because any newly paired surrogate has
| to be converted to its proper UTF-8 encoding. Similarly
| splitting WTF-8 can potentially break apart what was valid
| UTF-8 (though I'm not totally convinced that there's a good
| use case for actually doing this, at least for Windows
| paths).
|
| Of course the implementation details are something that can
| and should be handled by a library instead of doing it
| manually.
| lmm wrote:
| > There aren't any operating systems I'm aware of where
| paths are actually Shift-JIS by convention, so that seems
| like a non-issue. Using "UTF-8 by convention" strings works
| on all modern OSes.
|
| Nonsense. Unix paths use the system locale by convention,
| and it's entirely normal for that to be Shift-JIS.
| msla wrote:
| > On UNIX, paths are UTF-8 by convention, but not forced to
| be valid.
|
| On UNIX, paths are a sequence of bytes, with two bytes
| being sacred to the kernel (0x2F, used to separate path
| elements, and 0x00, used to terminate paths) and no other
| bytes being interpreted in any way. Any character encoding
| which respects the sacred bytes by not using them to encode
| any other characters is therefore usable to make UNIX
| paths; in fact, a UNIX path can contain multiple encodings,
| as long as they're all suitably respectful.
|
| That requirement for respect means that UTF-16 and UCS-2
| and UCS-4 are not suitable. UTF-7 is, however, as is UTF-8,
| and all of the ISO/IEC 8859 encodings are as well, not to
| mention a whole raft of non-standard "extended ASCII"
| character sets. In theory, UTF-16 in some suitably
| respectful encoding would work, too, but gouge my eyes out
| with a goddamned spoon.
| deathanatos wrote:
| They know all that. The operative words there are "by
| convention", not "by requirement".
|
| I.e., their comment is an RFC "OUGHT TO", not an RFC
| "MUST".
| msla wrote:
| My point is that assuming UNIX paths are UTF-8 by default
| is fragile, and that assuming UNIX paths have any
| consistent character encoding is also somewhat fragile.
| You can check to see if the path is UTF-8, but,
| otherwise, mitts off unless the user explicitly tells you
| an encoding.
| deathanatos wrote:
| > _assuming UNIX paths are UTF-8 by default is fragile,
| and that assuming UNIX paths have any consistent
| character encoding is also somewhat fragile._
|
| Again, we're both aware that it's not guaranteed. But the
| convention these days is nonetheless UTF-8.
|
| > _mitts off unless the user explicitly tells you an
| encoding._
|
| This doesn't adequately solve the problem, though. Typed
| languages have to emit _some type of value_ : having that
| be the "string" type of the language is _useful_ , as you
| can do things like printf a message that includes the
| filename. (Or display an "Open file..." dialog. Or...)
|
| There are middle-grounds, such as byte smuggling and
| escaping. But if you take the stance that filenames are
| arbitrary bags of bytes (which is actually a subset: the
| reality is even worse) -- then anything that returns a
| filename is stuck: it can't return a string ("mitts
| off"). You can take Rust's approach with Path (a type
| specific to Paths) but people wail about that all the
| time too ("why are there so many types?"), and you can't
| print it _because it 's not text!_
|
| "Error out if the file name isn't conventional" is a
| pragmatic tradeoff: bad file names will cause errors, but
| it makes basically all other operations much more
| tractable. It's not worth supporting insane file names.
|
| There are workarounds, of course (such as just
| replacement-charactering "" anything that can't be
| understood), and finding a format that can encode Paths
| when transmitting them, but these all take more time and
| effort. Allowing non-text files introduces unnecessary
| complexity and bugs into every single program that needs
| to deal with the file system.
| mistrial9 wrote:
| > requires permanent marketing
|
| LOL this is great, so true
| froh wrote:
| utf-8 needs 24bit per glyph for the logographic unicode
| scripts, while utf-16 needs 16 bits. So if you're into selling
| silicon or don't care about these languages, or you only care
| about heavily mixed content (like html), then yes, utf-8 would
| be The Right Way (TM).
| StefanKarpinski wrote:
| Absolutely right. Deprecating direct string indexing would have
| been the right move. Require writing `str.chars()` to get
| something that lets you slice by Unicode characters (i.e. code
| points); provide `str.graphemes()` and
| `str.grapheme_clusters()` to get something that lets you slice
| by graphemes and grapheme clusters, respectively. Cache an
| index structure that lets you do that kind of indexing
| efficiently once you've asked for it the first time. Provide an
| API to clear the caches.
|
| Not allowing strings to represent invalid Unicode is also a
| huge mistake (and essentially forced by the representation
| strategy that they adopted). It forces any programmer who wants
| to robustly handle potentially invalid string data to use byte
| vectors instead. Which is exactly what they did with OS paths,
| but that's far from the only place you can get invalid strings.
| You can get invalid strings almost anywhere! Worse, since it's
| incredibly inconvenient to work with byte vectors when you want
| to do stringlike stuff, no one does it unless forced to, so
| this design choice effectively guarantees that all Python code
| that works with strings will blow up if it encounters anything
| invalid--which is a very common occurrence.
|
| If only there was a type that behaves like a string and
| supports all the handy string operations but which handles
| invalid data gracefully. Then you could write robust string
| code conveniently. But at that point, you should just make that
| the standard string type! This isn't hypothetical, it's exactly
| how Burnt Sushi's bstr type [1] works in Rust and how the
| standard String type works in Julia.
|
| [1] https://github.com/BurntSushi/bstr
| Jasper_ wrote:
| It's worth noting that Python str's are sequences of code
| points, not scalar values. This was a truly horrendous
| mistake made mostly out of ignorance, but now they rely upon
| it in surrogateescape to hide "invalid" data, so...
|
| I have ranted for long hours go friends about the insanity of
| Python 3's text model before. It's mostly the blind leading
| the blind.
| 323 wrote:
| I've seen this happen multiple times in Python, where they
| implement a feature without a thorough survey of what other
| languages are doing. Then when it turns out the new feature
| has a major broken part, they fix that, but again, without
| looking how it's done in other places. asyncio is a major
| example.
| Animats wrote:
| Unicode string indexing should have been made lazy, rather
| than deprecated. Random access to strings is rare. Mostly,
| operations are moving forward linearly or using saved
| positions.
|
| So, only build the index for random access if needed.
| Optimize "advance one glyph" and "back up one glyph"
| expressed as indexing, and you'll get most of the frequently
| used cases. Have the "index" functions that return a string
| index return an opaque type that's a byte index. Attempting
| to convert that to an integer forces creation of the string
| index.
|
| This preserves the user visible semantics but keeps
| performance.
|
| PyPy does something like this.
| wk_end wrote:
| Deprecating direct string indexing might have been the
| "right" move in some sense but it's hard to imagine it ever
| happening in Python; it's such a natural and frequently-used
| operation in the language that you would've broken an
| enormous amount of code. Because indexing strings is
| syntactically identical to indexing lists, and because
| there's no* real static typing, there'd be no* good and
| robust way to automate the conversion.
|
| Like, the changes Python 3 made were honestly pretty subtle,
| and nearly 15 years later there's still people reluctant to
| upgrade. If they broke string indexing I'm pretty sure Python
| 3 adoption would make Perl 6 uptake look impressive.
| Spivak wrote:
| > It forces any programmer who wants to robustly handle
| potentially invalid string data to use byte vectors instead.
|
| Is this not the only thing you can really do? If strings
| could hold invalid unicode then they effectively become bytes
| and you now have to be wary of every possible string. I would
| rather Python just do away with strings entirely in the
| integration points with the OS and make you either keep them
| as opaque bytes or decode them.
|
| > behaves like a string and supports all the handy string
| operations but which handles invalid data gracefully
|
| Unless Guido was feeling particularly practical I can't
| imagine this ever making it into the stdlib because the
| choice of "gracefully" is somewhat arbitrary and application
| dependent.
| zokier wrote:
| Utf8 would not help with the issue in the article in any way.
| StefanKarpinski wrote:
| It's not at all obvious how it helps, but it does.
|
| First, why is Python unable to represent invalid path names
| as strings? Because internally it converts strings from
| UTF-8, UTF-16, or any other encoding, to a fixed-with array
| of decoded Unicode code points. The width of integer used to
| represent code points is determined by the largest code point
| in the string: if the string is ASCII, it can use a byte
| (uint8) per character; if the string is non-ASCII but all
| BMP, then it can use a uint16 per character; otherwise it has
| to use uint32 per character.
|
| Why does Python do all this? So that you can have O(1)
| character indexing. If you gave up on that, you wouldn't need
| to convert the string at all, you could just leave it as
| (potentially invalid) UTF-8 data.
|
| Suppose you get an invalid path on UNIX where paths are UTF-8
| by convention? What does Python do with this string? It can't
| convert it to an array of code points because invalid UTF-8
| doesn't correspond to a code point (well, it can if it's just
| illegal, not malformed, but in general, we have to consider
| completely malformed strings that don't even follow the basic
| UTF-8 format). So Python is stuck: it can only replace the
| invalid data with something like the Unicode replacement
| character. But then you can't do anything useful with that
| because it's not the correct name of the path you're trying
| to work with.
|
| How does using UTF-8 to represent strings help? Because you
| can represent invalid strings: just leave them as-is and
| don't try to decode them unless you have to. Sure, you can't
| decode them as code points, but that's actually a pretty
| unusual thing to do. If someone asks for decoding, _then_ you
| can give an error. What about Windows where paths are UTF-16
| by convention? You can convert them to WTF-8 and everything
| works out. (Described in way more detail here:
| https://news.ycombinator.com/item?id=33984308).
| zokier wrote:
| > How does using UTF-8 to represent strings help? Because
| you can represent invalid strings: just leave them as-is
| and don't try to decode them unless you have to. Sure, you
| can't decode them as code points, but that's actually a
| pretty unusual thing to do. If someone asks for decoding,
| _then_ you can give an error
|
| How is that better than just handling paths as `bytes`?
| ilyt wrote:
| > Why does Python do all this? So that you can have O(1)
| character indexing. If you gave up on that, you wouldn't
| need to convert the string at all, you could just leave it
| as (potentially invalid) UTF-8 data.
|
| Seems like "giving up" would've been better choice,
| considering just how rare operation that is. Or
| alternatively doing the conversion lazily the first time
| operation needing runes instead of bytes happen.
|
| Most string operations are not accessing string by index
| and most of them even at O(n) would be fast enough because
| n is small. Like in typical "get a file name, extract some
| info from it", you're doing extraction once and anything
| after that doesn't need character indexing, because you
| already got the relevant data.
| masklinn wrote:
| > How does using UTF-8 to represent strings help? Because
| you can represent invalid strings: just leave them as-is
| and don't try to decode them unless you have to.
|
| That's not UTF8. That's a bag'o bytes which might be UTF8.
| Very different thing.
|
| > Sure, you can't decode them as code points, but that's
| actually a pretty unusual thing to do.
|
| It's not, any unicode-aware text processing does it
| implicitly. This means any such processing has to either
| perform its own validation that the input is valid, or it
| may fly off the rails entirely if fed nonsense. This also
| increases risks if security issues, either outright UBs, or
| the ability to smuggle payloads through overlong encoding.
| [deleted]
| StefanKarpinski wrote:
| > That's not UTF8.
|
| True; I was careful not to call it that, but treating
| strings as UTF-8 by convention does make sense.
|
| > It's not, any unicode-aware text processing does it
| implicitly. This means any such things processing has to
| either perform its own validation that the input is
| valid, or it may fly off the rails entirely if fed
| nonsense.
|
| In theory, but that's just not how most string operations
| actually work. If you have two UTF-8 strings and you want
| to concatenate them, you just concatenate the bytes. It
| would be ridiculously inefficient to decode the code
| points in each string and then re-encode them back into a
| destination buffer. If you have two UTF-8 strings and you
| want to see if one is a substring of the other and at
| what byte index, you just look for the bytes of one as a
| "substring" of the bytes of the other. Again, it would be
| ridiculously inefficient to decode the code points in
| each and do matching on code points. But what if the
| strings aren't valid UTF-8?! Both of those operations
| work just fine even if the strings aren't valid and
| produce sensible, intuitive results.
|
| If you're implementing a browser or a terminal that has
| to actually display UTF-8 as characters then sure, you
| have to actually decode characters. Similarly, if you're
| parsing text somehow, then you have to decode characters.
| But many program only do concatenation and search and
| other operations like that which are actually implemented
| in terms of byte sequences, not characters.
| masklinn wrote:
| > True; I was careful not to call it that
|
| You specifically called it UTF8, repeatedly. The very
| comment I quoted asserts that "Utf8 would help deal with
| the issue [of garbage inputs]" (in its denial of the
| opposite assertion). You also did it in
| https://news.ycombinator.com/item?id=33986421
|
| > If you have two UTF-8 strings and you want to
| concatenate them, you just concatenate the bytes.
|
| That's not a unicode-aware operation, it's mostly a
| unicode-irrelevant operation (though unicode awareness
| can be useful in edge cases because of special grapheme
| clusters, but that's very task-specific).
|
| > But what if the strings aren't valid UTF-8?! Both of
| those operations work just fine even if the strings
| aren't valid and produce sensible, intuitive results.
|
| If your content is not actually UTF-8, you can end up
| with UTF-8, thus changing the semantics of the content.
| You can also end up with overlong UTF-8, which also
| changes the semantics of the content in a worse way.
| StefanKarpinski wrote:
| The comment that you're quoting wasn't mine. In the
| comment you link to says "UTF-8 by convention". If either
| string is valid, then the result is as expected. If
| you're concatenating two strings that are both invalid
| UTF-8, there's not much you can do that's better than
| just concatenating the bytes together... which is exactly
| what treating them as byte arrays would end up doing (but
| it's less convenient). If you're worried about invalid
| UTF-8 you can check for validity (which again, is exactly
| what you end up doing if you use byte arrays).
| adgjlsfhk1 wrote:
| The problem with using strict UTF-8 for paths is that
| paths aren't guaranteed to be valid UTF-8. How do you
| want to write a program that opens a path who's name is
| invalid UTF-8?
| masklinn wrote:
| > The problem with using strict UTF-8 for paths is that
| paths aren't guaranteed to be valid UTF-8.
|
| Ok but I'm not saying to do that. I'm saying if you have
| not-utf8 strings don't call them UTF8.
|
| > How do you want to write a program that opens a path
| who's name is invalid UTF-8?
|
| That's not my problem given I'm not advocating for that.
| StefanKarpinski wrote:
| The issue is that when you're implementing something like
| a programming language or a robust general purpose
| utility, then simply not being able to open--or list or
| remove or stat--paths with invalid names is not really
| acceptable.
| PaulHoule wrote:
| One time my job was standardizing a machine learning model
| training system in Python so you could develop models on a Mac or
| PC and then train them with large data sets on a DGX-1 for
| production.
|
| Over the course of a few months I developed quite a list of
| configuration problems that could break Python scripts and some
| of the worse involved character encoding. In particular, in many
| Pythons, writing invalid Unicode in a print statement could crash
| your Python. It's one thing to say "don't write invalid Unicode
| in a print" but if you are sucking in a large number of 3rd party
| libraries you can't control their use of print. It turned out
| many customers had CSV files with invalid Unicode sequences so we
| couldn't control the input. People would say "just use docker"
| but the team was always finding and making docker images with
| strangely configured Pythons. (If anything Docker seemed to make
| it easier and faster to use and install misconfigured software,
| not make it easier to get things under control.)
|
| It's good news that Python, like Java, is defaulting to UTF-8
| instead of whatever locale it gets from the OS because frequently
| this locale is not just "not UTF-8" but something really bizarre.
| sakras wrote:
| I'm a little confused, how can a file name be non-decodable? A
| file with that name exists, so someone somewhere knows how to
| decode it. Why wouldn't Python just always use the same encoding
| as the OS it's running on? Is this some locale-related thing?
| jjtheblunt wrote:
| Does trying to decode, on Unix, a Windows file path (not
| filename) like "drive:blah" freak it out as an example of non-
| decodable?
| johannes1234321 wrote:
| That's a valid file name. Non-decodable are byte sequences
| which are invalid UTF-8 byte sequences (if UTF-8 is expected)
| masklinn wrote:
| > A file with that name exists, so someone somewhere knows how
| to decode it.
|
| No. A unix filename is just a bunch of bytes (two of them being
| off-limits). There is no requirement that it be in _any_
| encoding.
|
| You can always use a fallback encoding (an iso-8859) to get
| _something_ out of the garbage, but it 's just that, garbage.
|
| Windows has a similar issue, NTFS paths are sequences of UCS2
| code units, but there's no guarantee that they form any sort of
| valid UTF-16 string, you can find random lone surrogates for
| instance.
|
| And I'm sure network filesystems have invented their own even
| worse issues, because being awful is what they do.
|
| > Why wouldn't Python just always use the same encoding as the
| OS it's running on?
|
| 1. because OS don't really have encodings, Python has a
| function to try and retrieve FS encoding[0] but per the above
| there's no requirement that it is correct for any file, let
| alone the one you actually want to open (hell technically
| speaking it's not even a property of the FS)
|
| 2. because OS lie and user configurations are garbage, you
| can't even trust the user's locale to be configured properly
| for reading _files_ (an other mistake Python 3 made,
| incidentally)
|
| 3. because the user may not even have created the file, it
| might come from a broken archive, or some random download from
| someone having fun with filenames, or from fetching crap from
| an FTP or network share
|
| There are a few FS / FS configurations which are reliable, in
| that case they either error or pre-mangle the files on intake.
|
| IIRC ZFS can be configured to only accept valid UTF-8
| filenames, HFS(+) requires valid unicode (stored as UTF-16) and
| APFS does as well (stored as UTF-8).
|
| [0]
| https://docs.python.org/3/library/sys.html#sys.getfilesystem...
| StefanKarpinski wrote:
| Unfortunately, neither UNIX nor Windows require path names to
| be valid Unicode. UNIX interprets them as "UTF-8 by convention"
| and Windows as "UTF-16 by convention" but both actually allow
| arbitrary sequences of code units. It would be nice if this
| didn't actually occur, but alas, it does, and if you're writing
| general purpose utilities that work with files, you don't want
| them to simply crash when this happens.
| heisenzombie wrote:
| It's filesystem-dependent, but a lot of filesystems treat
| filenames as just an arbitrary sequence of bytes. Most OSes
| sort of hide this from you, but there's nothing stopping you
| doing weird stuff like: > touch `echo 00:
| DEADBEEF | xxd -r` > ls 'ey'$'\276\357'
| CorrectHorseBat wrote:
| Unix allows any byte in a filename except 0x0 (ASCII Nul) and
| 0x2f (ASCII '/'), anything else is allowed. It doesn't have to
| be decodable to text.
| cabirum wrote:
| Filenames come from multiple filesystems (and not just
| filesystems!), with different encodings and such. Both ntfs and
| ext3 can use arbitrary bytes as names, not representable in utf8.
| (Contrary to popular belief, ntfs works with any byte sequences
| except 0x0000, and may contain invalid utf16 code points)
|
| The only sane way to deal with them all is to use bytes and bytes
| only (Or use a sane language with encoding-agnostic string type;
| see: golang) Other solutions introduce loss of data.
| deathanatos wrote:
| > _The only sane way to deal with them all is to use bytes and
| bytes only_
|
| It's not _really_ , though. Yes, if you want to round-trip the
| name of the file and be fully general over the bad situation
| provided by the OS. But as soon as you want to _display_ that
| filename, or really do much of anything with it, you 're back
| into the world of hurt the article describes, and you need some
| super complicated type or logic that can convey to the user
| "hey, this file's name is some random undecodable garbage".
|
| The sane state would have been "file names are text", but
| filenames/Unix predates software engineering having a good
| discipline around text encoding, and it predates Unicode. I'd
| go as far as to argue that "file names are text and \n is
| disallowed" is the sane state, as multi-line filenames also
| just cause all sorts of unnecessary complexity.
|
| The problem, of course, is that most OS enforce neither
| constraint, which is what leads to things like the struggle
| Python has gone through, or the Path type in Rust, and the
| unrealistic nature of trying to move from the present state to
| such a sane state. (Linux is loathe to break compatibility, and
| I bet Windows would be similar, here. There are a number of
| arguments for that, but in the meantime, it continues to be
| that file names are ridiculously complicated if handled fully
| correctly.
|
| Or, you just decode them as Unicode (UTF-8/-16 depending on OS)
| and give up if the user is crazy. 99.99% of the time that'll
| work just fine.)
|
| > _Or use a sane language with encoding-agnostic string type;
| see: golang_
|
| Of course, this might be the root of our disagreement. Types
| that permit the representation of invalid values are not
| desirable, to me. Especially in a core type, like strings.
| throw0101a wrote:
| > _It 's not really, though. Yes, if you want to round-trip
| the name of the file and be fully general over the bad
| situation provided by the OS. But as soon as you want to
| display that filename, or really do much of anything with it_
|
| IMHO, that is mostly not the file system's problem. Bits
| going into and out of the file system should act as
| deterministically as possible.
|
| If you want to start being 'clever' about these problems, do
| it in userland.
| deathanatos wrote:
| And the point is that this line of thinking introduces a
| metric crap ton of unnecessary complexity downstream.
|
| Yeah, an FS _could_ say that1. Alternatively, it could say
| that names are intended to be read by humans, and that
| names should be a single line, to support things that
| people reasonable expect to be able to do with a file
| system, like being able to list the files on the
| filesystem, without whatever is doing the listing needing
| to decide if it is going to just say "I can't display raw
| bytes, obviously, so here's a best-effort of .txt" or if it
| has to invent some sort of grammar to encode/escape bytes
| into text.
|
| I.e., a file system is meant to solve a problem, which is
| the storage _and organization_ of data, and from that, that
| names are meant to used by humans is one of the
| requirements of both the OS & the filesystem. (The OS, as
| it presents the file hierarchy to userspace, e.g., in the
| VFS, and the filesystem, as it of course needs to store the
| data.)
|
| By your logic, we should permit nul and '/' in file names,
| too.
|
| 1and, I mean, most _do._ Which is why we 're in the
| situation we're in, and why languages are faced with a
| barrel of bad choices, and why probably 90% of shell
| scripts fail to properly handle file names. Is this what we
| _want?_
| [deleted]
| saurik wrote:
| > Or, you just decode them as Unicode (UTF-8/-16 depending on
| OS) and give up if the user is crazy. 99.99% of the time
| that'll work just fine.)
|
| You have to be really careful about the way in which you give
| up here, as the user isn't necessarily in control of the
| files on disk: it might be an attacker. If I can hide files
| in folders from various parts of a Python program -- as was
| possible with early versions of Python 3, which simply
| skipped such files -- or cause corruption in how the
| filenames are round-tripped during a copy (allowing me to
| bypass a filter in one place but then write to a file with a
| different critical name in another place) I gain a lot of
| flexibility with my exploits.
| deathanatos wrote:
| Yeah, and that's a problem. But again, Python is faced with
| bad choices because the interface exposed by OSes for file
| systems is basically broken.
|
| (But, that might also fall into the 1% case where you'd
| need to pay attention.)
|
| Perhaps you might argue for a dedicated type like Rust's
| Path, but that too has issues, and I don't think Python had
| the "design freedom" to go there. (Although it does have
| pathlib, nowadays, so I wonder how it handles this; but it
| would also need to duplicate a lot of the functionality in
| os...)
| CorrectHorseBat wrote:
| ZFS even has a utf8only flag that disallows anything but
| utf-8. Sadly it hasn't caught on with other filesystems.
| ilyt wrote:
| Kinda waste of CPU to test for that tbh. Broken app won't
| work any better coz of it
| ilyt wrote:
| > It's not really, though. Yes, if you want to round-trip the
| name of the file and be fully general over the bad situation
| provided by the OS. But as soon as you want to display that
| filename, or really do much of anything with it, you're back
| into the world of hurt the article describes, and you need
| some super complicated type or logic that can convey to the
| user "hey, this file's name is some random undecodable
| garbage".
|
| It really is. Once you are using byte type you can handle
| displaying byte type to user as you see fit but that's not a
| decision stdlib should make for you. Not only that, if any
| file-related function operates on bytes that by definition it
| doesn't care about FS semantics, you're also not wasting any
| CPU for useless conversions to string.
|
| The blob of bytes acts as opaque token for file location, as
| it should. If you want to display it to users _you have to
| quote it anyway_ , because you can have perfectly valid utf8
| file name that contains newline or some special utf8
| characters that will fuck up how it is displayed and might
| even in malicious way
|
| >> Or use a sane language with encoding-agnostic string type;
| see: golang
|
| >Of course, this might be the root of our disagreement. Types
| that permit the representation of invalid values are not
| desirable, to me. Especially in a core type, like strings.
|
| string is defined as "read-only slice of bytes", not "UTF8
| slice of bytes". That is mostly so conversion between is
| free. If you want to pay the cost of validation you can, but
| are not forced to.
|
| The functions operating on them, if relevant to the
| function's functionality, will operate in UTF8.
|
| It might sound bad theory and it did to me at first but in
| practice it's probably the least problems I had out of all
| languages.
| chungy wrote:
| NTFS disallows 0x002F as well; it is possible to use the entire
| set even from Windows (installing Interix, for instance).
| Windows NT was designed around UCS-2, the file system didn't
| have to do much but be a dumb bag of bytes for file names,
| character interpretation was on upper layers. You can mostly
| pretend it's UTF-16, until it's not.
|
| Something fun on file systems: ZFS lets you set "utf8only=on"
| which will guarantee no file names will ever be possible that
| are not valid UTF-8. I use it on all my systems.
| jcranmer wrote:
| > Something fun on file systems: ZFS lets you set
| "utf8only=on" which will guarantee no file names will ever be
| possible that are not valid UTF-8. I use it on all my
| systems.
|
| This is the sort of option that should be available for all
| filesystems, and distros should default new installs to
| turning it on.
| lmm wrote:
| Please don't unless you've thoroughly tested in multiple
| cultures. Unicode is broken for Japanese; on some systems
| it's possible to change a setting to make it broken for
| Chinese instead, but the only non-theoretical way to make a
| program that can list a directory full of Japanese
| filenames and a directory full of Chinese filenames
| correctly is to have that program be encoding-aware.
| deckard1 wrote:
| if we have a time machine then yes. In fact, it shouldn't
| even be an option. UTF-8 period. Always.
|
| But alas, we do not and someone will plug in a memory card
| or USB drive with some FAT formatted drive and your nice
| universe crumbles before you.
| jcranmer wrote:
| Handling non-UTF-8 pathnames via a filesystem mount
| option is completely viable. Hell, I'd go even further
| and suggest that the kernel itself reject non-UTF-8
| pathnames entirely, handling filesystems with the non-
| UTF-8-pathname option enabled via a translation layer
| that converts \x00-\xff to \u0000-\u00ff (and vice
| versa).
|
| The lesson of text is clear: text needs to have a defined
| charset. Filesystem names are _absolutely_ textual
| (listing a directory for human display is one of the most
| common operations). If it 's not UTF-8, then what is it?
| How is the application supposed to figure it out?
| (Automatic encoding detection is flaky in the best of
| times, but with the lengths of typical filenames, it is
| almost impossible to be reliable). The reality is that
| most applications _already_ assume that filesystem names
| are UTF-8, and non-UTF-8 names break them.
|
| We are at a stage in Unicode penetration that I think it
| is reasonable to treat a filesystem that has non-UTF-8
| filenames as a problem that needs to be solved with some
| kind of repair tool like fsck, not something that every
| application is supposed to worry about.
| chungy wrote:
| This is why "convmv" exists :) and also why my /tmp is a
| normal tmpfs where file names can be any byte sequence. I
| can, for instance, extract random archives there, use
| convmv to make them UTF-8 (if not already), and then it
| can go on my ZFS file systems.
| gumby wrote:
| I have used filesystems that permitted a character represented
| as 0x0. Good way to attack a C program.
| GeorgeTirebiter wrote:
| I sometimes think the C decision to encode strings as non-
| zero bytes was one of the worst.
|
| The other candidate is the atrocious operator precedence. OK,
| we now love parens to 'fix' this -- might as well use lisp...
| ;-)
| ilyt wrote:
| It was unequivocally one of the worst decisions in
| programming. I wouldn't be surprised if costs of dealing
| with it would be in the billions now. It occasionally saved
| a byte on devices with little memory which kinda made sense
| at a time, but not anymore
| stabbles wrote:
| It makes perfect sense when writing and reading strings
| from disk, since you don't have to worry where and how to
| store the integer for the length (#bytes & endianness)
| russdill wrote:
| Even using bytes can lead to loss of data when trying to copy
| from one file-system to another or dealing with operating
| systems that treat filenames differently when different system
| calls are used.
| zzzeek wrote:
| SQLAlchemy creator here.
|
| No single improvement in Python has made my life more
| dramatically easier than the introduction and standardization of
| strings as Unicode. The basic reason is because it clearly placed
| the burden of encoding / decoding on the *outside* edges of a
| program; right when data is coming in, or right when data is
| going out. If you want the world to consider the data between
| those to be a "a string", you have to do that for us.
|
| For me, it meant that every PEP-249 DBAPI finally took on the
| role of handling database encodings fully and completely.
| Enormous reams of arbitrary string encode/decode logic in
| SQLAlchemy, having to guess its way around all the different
| decisions every DBAPI made about this, gone. If I get a string
| from the DBAPI, it's decoded. If I have a string, I can send it
| out, DBAPI handles it. If it contains characters that aren't
| appropriate for the database's encoding, it raises. Great! I send
| the users off to the manual for their database, they aren't
| complaining to me. Total lifesaver.
|
| I see a lot of odd complaints here which seem to be people
| wanting non-textual bytestrings to do things that make sense for
| text; if someone has a use case like that, there can be libraries
| that do the particular non-textual-but-text-like manipulations
| they need, however, these are not the "standard" uses for textual
| strings in Python. The standard use for textual strings is...to
| represent text! If I have data that's bytes, then I use bytes!
| These threads always seem to fill up with the vanishingly small
| number of people that have issues with this. Just like the anti-
| ORM threads. All the while the vast majority of devs are just
| getting work done.
| Alir3z4 wrote:
| Unicode errors were one the most painful and annoying issues I
| had. Regardless of working with xml files, parsing text or just
| dealing with anything text related. When upgrading to Python 3,
| they just vanished.
|
| Anti ORM people probably don't know what ORM is or confused
| about their usage and benefit they bring.
|
| Once you're dealing with several tables and relations and lots
| of aggregation and annotations, raw sql is the one way ticket
| to insanity.
|
| By the way, thank you very much for making SQLAlchemy. You
| saved millions of hours of developer time and god knows how
| many SQL bugs have been prevented by using SQLAlchemy
| kstrauser wrote:
| I agree completely. Turns out all those TypeErrors that
| suddenly popped up when switching to Python 3 were actually
| uncaught logic errors all along. The biggest annoyance coming
| from the string/bytes split was realizing how sloppy I'd been
| about treating them identically.
| mmastrac wrote:
| Rust uses a solution similar to his original proposal:
| OsString/OsStr: https://doc.rust-
| lang.org/std/ffi/struct.OsString.html
|
| I wonder if unicode needs a "undecodable byte escape" sequence
| for these raw sequences that cannot be mapped into the string
| space, allowing for a guaranteed round-trip between the OS and
| Unicode. For this to work, you'd need a guarantee that all
| codepage/unicode mappings are consistent and 1:1 which might not
| be possible.
|
| Paths are tricky because they act as both an identifier and a
| handle, and the string encoding of the identifier and handle are
| _close but not exactly the same_ when unicode is in the mix. In
| the ASCII world, you can assume that this is the case.
| LegionMammal978 wrote:
| Python already has the "surrogateescape" error handler [0] that
| performs something similar to what you described: undecodable
| bytes are translated into unpaired U+DC80 to U+DCFF surrogates.
| Of course, this isn't standardized in any way, but I've found
| it useful myself for smuggling raw pathnames through Java.
|
| [0] https://peps.python.org/pep-0383/
___________________________________________________________________
(page generated 2022-12-14 23:01 UTC)