[HN Gopher] PEP 686 - Make UTF-8 mode default
___________________________________________________________________
PEP 686 - Make UTF-8 mode default
Author : GalaxySnail
Score : 195 points
Date : 2024-04-26 11:55 UTC (11 hours ago)
(HTM) web link (peps.python.org)
(TXT) w3m dump (peps.python.org)
| Macha wrote:
| > And many other popular programming languages, including
| Node.js, Go, Rust, and Java uses UTF-8 by default.
|
| Oh, I missed Java moving from UTF-16 to UTF-8.
| PurpleRamen wrote:
| Seems it happened two years ago, with Java 18.
| rootext wrote:
| It seems you are mixing two things: inner string representation
| and read/write encoding. Java has never used UTF-16 as default
| for the second.
| cryptonector wrote:
| Not even on Windows?
| layer8 wrote:
| No, file I/O on Windows in general doesn't use UTF-16, but
| the regional code page, or nowadays UTF-8 if the
| application decides so.
| int_19h wrote:
| Depends on what you define as "file I/O", though. NTFS
| filenames are UTF-16 (or rather UCS2). As far as file
| contents, there isn't really a standard, but FWIW for a
| long time most Windows apps - Notepad being the canonical
| example when asked to save anything as "Unicode" would
| save it as UTF-16.
| layer8 wrote:
| I'm talking about the default behavior of Microsoft's C
| runtime (MSVCRT.DLL) that everyone is/was using.
|
| UTF-16 text files are rather rare, as is using Notepad's
| UTF-16 options. The only semi-common use I know of is
| *.reg files saved from regedit. One issue with UTF-16 is
| that it has two different serializations (BE and LE), and
| hence generally requires a BOM to disambiguate.
| hashmash wrote:
| With Java, the default encoding when converting bytes to
| strings was originally platform independent, but now it's
| UTF-8. UTF-16 and latin-1 encodings are (still*) used
| internally by the String class, and the JVM uses a modified
| UTF-8 encoding like it always has.
|
| * The String class originally only used UTF-16 encoding, but
| since Java 9 it also uses a single-byte-per-character latin-1
| encoding when possible.
| Myrmornis wrote:
| Hm TIL, I thought that the string encoding argument to .decode()
| and .encode() was required, but now I see it defaults to "utf-8".
| Did that change at some point?
| _ache_ wrote:
| You can verify on the documentation by switching the version.
|
| So ... since 3.2:
| https://docs.python.org/3.2/library/stdtypes.html#bytes.deco...
| In 3.1 it was the default encoding of string (the type str I
| guess).
| https://docs.python.org/3.1/library/stdtypes.html#bytes.deco...
| LeoPanthera wrote:
| > ChatGPT4 says it's always been that way since the beginning
| of Python3
|
| This is not a reliable way to look up information. It doesn't
| know when it's wrong.
| Affric wrote:
| Make UTF-8 default on Windows
| johannes1234321 wrote:
| Since Windows Version 1903 (May 2019 Update) they push for
| Utf-8. But Windows is a big pile of compatible legacy.
| tedivm wrote:
| That's exactly what this proposal (which has been accepted) is
| going to do.
| lolinder wrote:
| I think they mean that the Windows operating system should
| default to UTF-8.
| pjc50 wrote:
| In addition to ApiFunctionA and ApiFunctionW, introduce
| ApiFunction8? (times whole API surface)
|
| Introduce a #define
| UNICODE_NO_REALLY_ALL_UNICODE_WE_MEAN_IT_THIS_TIME ?
| cryptonector wrote:
| ApiFunctionA is UTF-8 capable. Needs a run-time switch too,
| not just compile-time.
| garaetjjte wrote:
| It's now possible, but for years the excuse was that MBCS
| encodings only supported characters up to 2 bytes.
| ComputerGuru wrote:
| Only under windows 11, I believe. And that switch is off by
| default.
| int_19h wrote:
| You're thinking of the global setting that is enabled by
| the user and applies to all apps that operate in terms of
| "current code page" - if enabled, that codepage becomes
| 65001 (UTF-8).
|
| However, on Win10+, apps themselves can explicitly opt
| into UTF-8 for all non-widechar Win32 APIs regardless of
| the current locale/codepage.
| sebazzz wrote:
| Yes: https://learn.microsoft.com/en-
| us/windows/win32/sbscs/applic...
|
| > On Windows 10, this element forces a process to use UTF-8
| as the process code page. For more information, see Use the
| UTF-8 code page. On Windows 10, the only valid value for
| activeCodePage is UTF-8.
|
| > This element was first added in Windows 10 version 1903
| (May 2019 Update). You can declare this property and
| target/run on earlier Windows builds, but you must handle
| legacy code page detection and conversion as usual. This
| element has no attributes.
| layer8 wrote:
| That would break so many applications and workflows that it
| will never happen.
| numpad0 wrote:
| [delayed]
| lexicality wrote:
| > Additionally, many Python developers using Unix forget that the
| default encoding is platform dependent. They omit to specify
| encoding="utf-8" when they read text files encoded in UTF-8
|
| "forget" or possibly simply aren't made well enough aware? I
| genuinely thought that python would only use UTF-8 for everything
| unless you explicitly ask it to do otherwise.
| jillesvangurp wrote:
| Not relying on flaky system defaults is a good thing. These
| things have a way of turning around and being different than what
| you assume them to be. A few years ago I was dealing with Ubuntu
| and some init.d scripts. One issue I ran into was that some
| script we used to launch Java (this was before docker) was
| running as root (bad, I know) and with a shell that did not set
| UTF-8 as the default like would be completely normal for regular
| users. And of course that revealed some bad APIs that we were
| using in Java that use the os default. Most of these things have
| variants that allow you to set the encoding at this point and a
| lot of static code checkers will warn you if you use the wrong
| one. But of course it only takes one place for this to start
| messing up content.
|
| These days it's less of an issue but I would simply not rely on
| the os to get this right ever for this. Most uses of encodings
| other than UTF-8 are extremely likely to be unintentional at this
| point. And if it is intentional, you should be very explicit
| about it and not rely on weird indirect configuration through the
| OS that may or may not line up.
|
| So, good change. Anything that breaks over this is probably
| better off with the simple fix added. And it's not worth leaving
| everything else as broken as it is with content corruption bugs
| just waiting to happen.
| nerdponx wrote:
| Default text file encoding being platform-dependent always drove
| me nuts. This is a welcome change.
|
| I also appreciate that they did not attempt to tackle filesystem
| encoding here, which is a separate issue that drives me nuts, but
| separately.
| layer8 wrote:
| Historically it made sense, when most software was local-only,
| and text files were expected to be in the local encoding. Not
| just platform-dependent, but user's preferred locale-dependent.
| This is also how the C standard library operates.
|
| For example, on Unix/Linux, using iso-8859-1 was common when
| using Western-European languages, and in Europe it became
| common to switch to iso-8859-15 after the Euro was introduced,
| because it contained the EUR symbol. UTF-8 only began to work
| flawlessly in the later aughts. Debian switched to it as the
| default with the Etch release in 2010.
| anthk wrote:
| Emacs was amazing for that; builtin text
| encoders/decoders/transcoders for everything.
| hollerith wrote:
| My experience was that brittleness around text encoding in
| Emacs (versions 22 and 23 or so) was a constant source of
| annoyance for years.
|
| IIRC, the main way this brittleness bit me was that every
| time a buffer containing a non-ASCII character was saved,
| Emacs would engage me in a conversation (which I found
| tedious and distracting) about what coding system I would
| like to use to save the file, and I never found a sane way
| to configure it to avoid such conversations even after
| spending hours learning about how Emacs does coding
| systems: I simply had to wait (a year or 3) for a new
| version of Emacs in which the code for saving buffers
| worked better.
|
| I think some people _like_ engaging in these conversations
| with their computers even though the conversations are very
| boring and repetitive and that such conversation-likers are
| numerous among Emacs users or at least Emacs maintainers.
| da_chicken wrote:
| It's still not that uncommon to see programs on Linux not
| understanding multibyte UTF-8.
|
| It's also true that essentially nothing on Linux supports the
| UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but
| it is explicitly allowed in the specifications. Since
| Microsoft tends to always include a BOM in any flavor of
| Unicode, this means Linux often chokes on valid UTF-8 text
| files from Windows systems.
| nerdponx wrote:
| Interestingly, Python is one of those programs.
|
| You need to use the special "utf-8-sig" encoding for that,
| which is not prominently advertised anywhere in the
| documentation (but it is stated deep inside the "Unicode
| HOWTO").
|
| I never understood why ignoring this special character
| requires a totally separate encoding.
| duskwuff wrote:
| > I never understood why ignoring this special character
| requires a totally separate encoding.
|
| Because the BOM is indistinguishable from the "real"
| UTF-8 encoding of U+FEFF (zero-width no-break space).
| Trimming that codepoint in the UTF-8 decoder means that
| some strings like "\uFEFF" can't be safely round-tripped;
| adding it in the encoder is invalid in many contexts.
| tialaramex wrote:
| The BOM cases are at best a consequence of trying to use
| poor quality Windows software to do stuff it's not suited
| to. It's true that in terms of Unicode text it's valid for
| a UTF-8 string to have a BOM, but just because that's true
| in the text itself doesn't magically change file formats
| which long pre-dated that.
|
| Most obviously shebang (the practice of writing
| #!/path/to/interpreter at the start of a script) is
| specifically defined on those first two bytes. It doesn't
| make any sense have a BOM here because that's not the
| format, and inventing a new rule later which says you can
| do it doesn't make that true, any more than in 2024 the
| German government can decide Germany didn't invade Poland
| in 1939, that's not how Time's Arrow works.
| fbdab103 wrote:
| A different one that just bit me the other day was implicitly
| changing line endings. Local testing on my corporate laptop all
| went according to plan. Deploy to linux host and downstream
| application cannot consume it because it requires CRLF.
|
| Just one of those stupid little things you have to remember
| from time to time. Although, why does newly written software
| require a specific line terminator is a valid question.
| Dwedit wrote:
| With system-default code pages on Windows, it's not only
| platform-dependent, it's also System Locale dependent.
|
| Windows badly dropped the ball here by not providing a simple
| opt-in way to make all the Ansi functions (TextOutA, etc) use
| the UTF-8 code page, until many many years later with the
| manifest file. This should have been a feature introduced in
| NT4 or Windows 98, not something that's put off until midway
| through Windows 10's development cycle.
| sheepscreek wrote:
| I suspect that is a symptom of Microsoft being an enormously
| large organization. Coordinating a change like this that cuts
| across all apps, services and drivers is monumental. Honestly
| it is quite refreshing to see them do it with Copilot
| integration across all things MS. I don't use it though, just
| admire the valiant effort and focus it takes to pull off
| something like this.
|
| Of course - goes without saying, only works when the
| directive comes from all the way at the top. Otherwise there
| will be just too many conflicting incentives for any real
| change to happen.
|
| While I am on this topic - I want to mention Apple. It is
| absolutely bonkers how they have done exactly the is
| countless times. Like changing your entire platform
| architecture! It could have been like opening a can of worms
| but they knew what they were doing. Kudos to them.
|
| Also..(sorry, this is becoming a long post) civil and
| industrial engineering firms routinely pull off projects like
| that. But the point I wanted to emphasize is that it's very
| uncommon in tech which prides on having decentralized and
| semi-autonomous teams vs centralized and highly aligned
| teams.
| Euphorbium wrote:
| I thought it was default since python 3.
| lucb1e wrote:
| You may be thinking of strings where the u"" prefix was made
| obsolete in python3. Then again, trying on Python 2.7 just now,
| typing "eku" results in it printing the UTF-8 bytes for those
| characters so I don't actually know what that u prefix ever
| did, but one of the big py2-to-3 changes was strings having an
| encoding and byte strings being for byte sequences without
| encodings
|
| This change seems to be about things like open('filename',
| mode='r') mainly on Windows where the default encoding is not
| UTF-8 and so you'd have to specify open('filename', mode='r',
| encoding='UTF-8')
| aktiur wrote:
| > strings having an encoding and byte strings being for byte
| sequences without encodings
|
| You got it kind of backwards. `str` are sequence of unicode
| codepoints ( _not_ UTF-8, which is a specific encoding for
| unicode codepoints), without reference to any encoding.
| `bytes` are arbitrary sequence of octets. If you have some
| `bytes` object that somehow stands for text, you need to know
| that it is text and what its encoding is to be able to
| interpret it correctly (by decoding it to `str`).
|
| And, if you got a `str` and want to serialize it (for writing
| or transmitting), you need to choose an encoding, because
| different encodings will generate different `bytes`.
|
| As an example :
|
| >>> "evenement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'
|
| >>> "evenement".encode("latin-1") b'\xe9v\xe8nement'
| jcranmer wrote:
| Python has two types of strings: byte strings (every
| character is in the range of 0-255) and Unicode strings
| (every character is a Unicode codepoint). In Python 2.x, ""
| maps to a byte string and u"" maps to a Unicode string; in
| Python 3.x, "" maps to a unicode string and b"" maps to a
| byte string.
|
| If you typed in "eku" in Python 2.7, what you get is a string
| consisting of the hex chars 0xC3 0xA9 0xC4 0xB7 0xC5 0xAF,
| which if you printed it out and displayed it as UTF-8--the
| default of most terminals--would appear to be eku. But
| "eku"[1] would return a byte string of \xa9 which isn't valid
| UTF-8 and would likely display as garbage.
|
| If you instead had used u"eku", you'd instead get a string of
| three Unicode code points, U+00E9 U+0137 U+016F. And
| u"eku"[1] would return u"k", which is a valid Unicode
| character.
| d0mine wrote:
| The Python source code is utf-8 by default in Python 3. But it
| says nothing about a character encoding used to save to a file.
| It is locale-dependent by default. # string
| literals create str objects using utf-8 by default
| Path("filenames use their own encoding").write_text("file
| content encoding uses yet another encoding")
|
| The corresponding encodings are:
|
| - utf-8 [tokenize.open] - sys.getfilesystemencoding()
| [os.fsencode] - locale.getpreferredencoding() [open]
| a-french-anon wrote:
| Why not utf-8-sig, though? It handles _optional_ BOMs. Had to fix
| a script last week that choked on it.
| orf wrote:
| Because changing Python to silently prefixing all IO with an
| invisible BOM isn't a good idea.
| int_19h wrote:
| The expectation isn't for it to generate BOM in the output,
| but to handle BOM gracefully when it occurs in the input.
| shellac wrote:
| At this point nothing ought to be inserting BOMs in utf-8. It's
| not recommended, and I think choking on it is reasonable
| behaviour these days.
| Athas wrote:
| Why were BOMs ever allowed for UTF-8?
| plorkyeran wrote:
| When UTF-8 was still very much not the default encoding for
| text files it was useful to have a way to signal that a
| file was UTF-8 and not the local system encoding.
| josefx wrote:
| Some editors used them to help detect UTF-8 encoded files.
| Since they are also valid zero length space characters they
| also served as a nice easter egg for people who ended up
| editing their linux shell scripts with a windows text
| editor.
| Dwedit wrote:
| Basically every C# program will insert BOMs into text files
| by default unless you opt-out.
| anordal wrote:
| The following heuristic has become increasingly true over the
| last couple of decades: If you have some kind of "charset"
| configuration anywhere, and it's not UTF-8, it's wrong.
|
| Python 2 was charset agnostic, so it always worked, but the
| improvement with Python 3 was not only an improvement - how to
| tell a Python 3 script from a Python 2 script?
|
| * If it contains the string "utf-8", it's Python3.
|
| * If it only works if your locale is C.UTF-8, it's Python3.
|
| Needless to say, I welcome this change. The way I understand it,
| it would "repair" Python 3.
| Aerbil313 wrote:
| Nice. Now the only thing we need is JS to switch to UTF-8. But of
| course JS can't improve, because unlike any other programming
| language, we need to be compatible with code written in 1995.
| Animats wrote:
| Is the internal encoding in CPython UTF-8 yet?
|
| You can index through Python strings with a subscript, but random
| access is rare enough that it's probably worthwhile to lazily
| index a string when needed. If you just need to advance or back
| up by 1, you don't need an index. So an internal representation
| of UTF-8 is quite possible.
| rogerbinns wrote:
| The PyUnicode object is what represents a str. If the UTF-8
| bytes are ever requested, then a bytes object is created on
| demand and cached as part of the PyUnicode, being freed with
| the PyUnicode itself is freed.
|
| Separately from that the codepoints making up the string are
| stored in a straight forward array allowing random access. The
| size of each codepoint can be 1, 2, or 4 bytes. When you create
| a PyUnicode you have to specify the maximum codepoint value
| which is rounded up to 127, 255, 65535, or 1,114,111. That
| determines if 1, 2, or 4 bytes is used.
|
| If the maxiumum codepoint value is 127 then that array
| representation can be used for the UTF-8 directly. So the
| answer to your question is that many strings are stored as
| UTF-8 because all the codepoints are <= 127.
|
| Separately from that, advancing through strings should not be
| done by codepoints anyway. A user perceived character (aka
| grapheme cluster) is made up of one or more codepoints. For
| example an e with an accent could be the e codepoint followed
| by a combining accent codepoint. The phoenix emoji is really
| the bird emoji, a zero width joiner, and then fire emoji. Some
| writing systems used by hundreds of millions of people are
| similar to having consonants, with combining marks to represent
| vowels.
|
| This - - is 5 codepoints. There is a good blog post diving into
| it and how various languages report its "length".
| https://hsivonen.fi/string-length/
|
| Source: I've just finished implementing Unicode TR29 which
| covers this for a Python C extension.
___________________________________________________________________
(page generated 2024-04-26 23:01 UTC)