hngopher.com

       [HN Gopher] PEP 686 - Make UTF-8 mode default
       ___________________________________________________________________
        
       PEP 686 - Make UTF-8 mode default
        
       Author : GalaxySnail
       Score  : 195 points
       Date   : 2024-04-26 11:55 UTC (11 hours ago)
        
 (HTM) web link (peps.python.org)
 (TXT) w3m dump (peps.python.org)
        
       | Macha wrote:
       | > And many other popular programming languages, including
       | Node.js, Go, Rust, and Java uses UTF-8 by default.
       | 
       | Oh, I missed Java moving from UTF-16 to UTF-8.
        
         | PurpleRamen wrote:
         | Seems it happened two years ago, with Java 18.
        
         | rootext wrote:
         | It seems you are mixing two things: inner string representation
         | and read/write encoding. Java has never used UTF-16 as default
         | for the second.
        
           | cryptonector wrote:
           | Not even on Windows?
        
             | layer8 wrote:
             | No, file I/O on Windows in general doesn't use UTF-16, but
             | the regional code page, or nowadays UTF-8 if the
             | application decides so.
        
               | int_19h wrote:
               | Depends on what you define as "file I/O", though. NTFS
               | filenames are UTF-16 (or rather UCS2). As far as file
               | contents, there isn't really a standard, but FWIW for a
               | long time most Windows apps - Notepad being the canonical
               | example when asked to save anything as "Unicode" would
               | save it as UTF-16.
        
               | layer8 wrote:
               | I'm talking about the default behavior of Microsoft's C
               | runtime (MSVCRT.DLL) that everyone is/was using.
               | 
               | UTF-16 text files are rather rare, as is using Notepad's
               | UTF-16 options. The only semi-common use I know of is
               | *.reg files saved from regedit. One issue with UTF-16 is
               | that it has two different serializations (BE and LE), and
               | hence generally requires a BOM to disambiguate.
        
         | hashmash wrote:
         | With Java, the default encoding when converting bytes to
         | strings was originally platform independent, but now it's
         | UTF-8. UTF-16 and latin-1 encodings are (still*) used
         | internally by the String class, and the JVM uses a modified
         | UTF-8 encoding like it always has.
         | 
         | * The String class originally only used UTF-16 encoding, but
         | since Java 9 it also uses a single-byte-per-character latin-1
         | encoding when possible.
        
       | Myrmornis wrote:
       | Hm TIL, I thought that the string encoding argument to .decode()
       | and .encode() was required, but now I see it defaults to "utf-8".
       | Did that change at some point?
        
         | _ache_ wrote:
         | You can verify on the documentation by switching the version.
         | 
         | So ... since 3.2:
         | https://docs.python.org/3.2/library/stdtypes.html#bytes.deco...
         | In 3.1 it was the default encoding of string (the type str I
         | guess).
         | https://docs.python.org/3.1/library/stdtypes.html#bytes.deco...
        
         | LeoPanthera wrote:
         | > ChatGPT4 says it's always been that way since the beginning
         | of Python3
         | 
         | This is not a reliable way to look up information. It doesn't
         | know when it's wrong.
        
       | Affric wrote:
       | Make UTF-8 default on Windows
        
         | johannes1234321 wrote:
         | Since Windows Version 1903 (May 2019 Update) they push for
         | Utf-8. But Windows is a big pile of compatible legacy.
        
         | tedivm wrote:
         | That's exactly what this proposal (which has been accepted) is
         | going to do.
        
           | lolinder wrote:
           | I think they mean that the Windows operating system should
           | default to UTF-8.
        
         | pjc50 wrote:
         | In addition to ApiFunctionA and ApiFunctionW, introduce
         | ApiFunction8? (times whole API surface)
         | 
         | Introduce a #define
         | UNICODE_NO_REALLY_ALL_UNICODE_WE_MEAN_IT_THIS_TIME ?
        
           | cryptonector wrote:
           | ApiFunctionA is UTF-8 capable. Needs a run-time switch too,
           | not just compile-time.
        
             | garaetjjte wrote:
             | It's now possible, but for years the excuse was that MBCS
             | encodings only supported characters up to 2 bytes.
        
             | ComputerGuru wrote:
             | Only under windows 11, I believe. And that switch is off by
             | default.
        
               | int_19h wrote:
               | You're thinking of the global setting that is enabled by
               | the user and applies to all apps that operate in terms of
               | "current code page" - if enabled, that codepage becomes
               | 65001 (UTF-8).
               | 
               | However, on Win10+, apps themselves can explicitly opt
               | into UTF-8 for all non-widechar Win32 APIs regardless of
               | the current locale/codepage.
        
             | sebazzz wrote:
             | Yes: https://learn.microsoft.com/en-
             | us/windows/win32/sbscs/applic...
             | 
             | > On Windows 10, this element forces a process to use UTF-8
             | as the process code page. For more information, see Use the
             | UTF-8 code page. On Windows 10, the only valid value for
             | activeCodePage is UTF-8.
             | 
             | > This element was first added in Windows 10 version 1903
             | (May 2019 Update). You can declare this property and
             | target/run on earlier Windows builds, but you must handle
             | legacy code page detection and conversion as usual. This
             | element has no attributes.
        
         | layer8 wrote:
         | That would break so many applications and workflows that it
         | will never happen.
        
         | numpad0 wrote:
         | [delayed]
        
       | lexicality wrote:
       | > Additionally, many Python developers using Unix forget that the
       | default encoding is platform dependent. They omit to specify
       | encoding="utf-8" when they read text files encoded in UTF-8
       | 
       | "forget" or possibly simply aren't made well enough aware? I
       | genuinely thought that python would only use UTF-8 for everything
       | unless you explicitly ask it to do otherwise.
        
       | jillesvangurp wrote:
       | Not relying on flaky system defaults is a good thing. These
       | things have a way of turning around and being different than what
       | you assume them to be. A few years ago I was dealing with Ubuntu
       | and some init.d scripts. One issue I ran into was that some
       | script we used to launch Java (this was before docker) was
       | running as root (bad, I know) and with a shell that did not set
       | UTF-8 as the default like would be completely normal for regular
       | users. And of course that revealed some bad APIs that we were
       | using in Java that use the os default. Most of these things have
       | variants that allow you to set the encoding at this point and a
       | lot of static code checkers will warn you if you use the wrong
       | one. But of course it only takes one place for this to start
       | messing up content.
       | 
       | These days it's less of an issue but I would simply not rely on
       | the os to get this right ever for this. Most uses of encodings
       | other than UTF-8 are extremely likely to be unintentional at this
       | point. And if it is intentional, you should be very explicit
       | about it and not rely on weird indirect configuration through the
       | OS that may or may not line up.
       | 
       | So, good change. Anything that breaks over this is probably
       | better off with the simple fix added. And it's not worth leaving
       | everything else as broken as it is with content corruption bugs
       | just waiting to happen.
        
       | nerdponx wrote:
       | Default text file encoding being platform-dependent always drove
       | me nuts. This is a welcome change.
       | 
       | I also appreciate that they did not attempt to tackle filesystem
       | encoding here, which is a separate issue that drives me nuts, but
       | separately.
        
         | layer8 wrote:
         | Historically it made sense, when most software was local-only,
         | and text files were expected to be in the local encoding. Not
         | just platform-dependent, but user's preferred locale-dependent.
         | This is also how the C standard library operates.
         | 
         | For example, on Unix/Linux, using iso-8859-1 was common when
         | using Western-European languages, and in Europe it became
         | common to switch to iso-8859-15 after the Euro was introduced,
         | because it contained the EUR symbol. UTF-8 only began to work
         | flawlessly in the later aughts. Debian switched to it as the
         | default with the Etch release in 2010.
        
           | anthk wrote:
           | Emacs was amazing for that; builtin text
           | encoders/decoders/transcoders for everything.
        
             | hollerith wrote:
             | My experience was that brittleness around text encoding in
             | Emacs (versions 22 and 23 or so) was a constant source of
             | annoyance for years.
             | 
             | IIRC, the main way this brittleness bit me was that every
             | time a buffer containing a non-ASCII character was saved,
             | Emacs would engage me in a conversation (which I found
             | tedious and distracting) about what coding system I would
             | like to use to save the file, and I never found a sane way
             | to configure it to avoid such conversations even after
             | spending hours learning about how Emacs does coding
             | systems: I simply had to wait (a year or 3) for a new
             | version of Emacs in which the code for saving buffers
             | worked better.
             | 
             | I think some people _like_ engaging in these conversations
             | with their computers even though the conversations are very
             | boring and repetitive and that such conversation-likers are
             | numerous among Emacs users or at least Emacs maintainers.
        
           | da_chicken wrote:
           | It's still not that uncommon to see programs on Linux not
           | understanding multibyte UTF-8.
           | 
           | It's also true that essentially nothing on Linux supports the
           | UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but
           | it is explicitly allowed in the specifications. Since
           | Microsoft tends to always include a BOM in any flavor of
           | Unicode, this means Linux often chokes on valid UTF-8 text
           | files from Windows systems.
        
             | nerdponx wrote:
             | Interestingly, Python is one of those programs.
             | 
             | You need to use the special "utf-8-sig" encoding for that,
             | which is not prominently advertised anywhere in the
             | documentation (but it is stated deep inside the "Unicode
             | HOWTO").
             | 
             | I never understood why ignoring this special character
             | requires a totally separate encoding.
        
               | duskwuff wrote:
               | > I never understood why ignoring this special character
               | requires a totally separate encoding.
               | 
               | Because the BOM is indistinguishable from the "real"
               | UTF-8 encoding of U+FEFF (zero-width no-break space).
               | Trimming that codepoint in the UTF-8 decoder means that
               | some strings like "\uFEFF" can't be safely round-tripped;
               | adding it in the encoder is invalid in many contexts.
        
             | tialaramex wrote:
             | The BOM cases are at best a consequence of trying to use
             | poor quality Windows software to do stuff it's not suited
             | to. It's true that in terms of Unicode text it's valid for
             | a UTF-8 string to have a BOM, but just because that's true
             | in the text itself doesn't magically change file formats
             | which long pre-dated that.
             | 
             | Most obviously shebang (the practice of writing
             | #!/path/to/interpreter at the start of a script) is
             | specifically defined on those first two bytes. It doesn't
             | make any sense have a BOM here because that's not the
             | format, and inventing a new rule later which says you can
             | do it doesn't make that true, any more than in 2024 the
             | German government can decide Germany didn't invade Poland
             | in 1939, that's not how Time's Arrow works.
        
         | fbdab103 wrote:
         | A different one that just bit me the other day was implicitly
         | changing line endings. Local testing on my corporate laptop all
         | went according to plan. Deploy to linux host and downstream
         | application cannot consume it because it requires CRLF.
         | 
         | Just one of those stupid little things you have to remember
         | from time to time. Although, why does newly written software
         | require a specific line terminator is a valid question.
        
         | Dwedit wrote:
         | With system-default code pages on Windows, it's not only
         | platform-dependent, it's also System Locale dependent.
         | 
         | Windows badly dropped the ball here by not providing a simple
         | opt-in way to make all the Ansi functions (TextOutA, etc) use
         | the UTF-8 code page, until many many years later with the
         | manifest file. This should have been a feature introduced in
         | NT4 or Windows 98, not something that's put off until midway
         | through Windows 10's development cycle.
        
           | sheepscreek wrote:
           | I suspect that is a symptom of Microsoft being an enormously
           | large organization. Coordinating a change like this that cuts
           | across all apps, services and drivers is monumental. Honestly
           | it is quite refreshing to see them do it with Copilot
           | integration across all things MS. I don't use it though, just
           | admire the valiant effort and focus it takes to pull off
           | something like this.
           | 
           | Of course - goes without saying, only works when the
           | directive comes from all the way at the top. Otherwise there
           | will be just too many conflicting incentives for any real
           | change to happen.
           | 
           | While I am on this topic - I want to mention Apple. It is
           | absolutely bonkers how they have done exactly the is
           | countless times. Like changing your entire platform
           | architecture! It could have been like opening a can of worms
           | but they knew what they were doing. Kudos to them.
           | 
           | Also..(sorry, this is becoming a long post) civil and
           | industrial engineering firms routinely pull off projects like
           | that. But the point I wanted to emphasize is that it's very
           | uncommon in tech which prides on having decentralized and
           | semi-autonomous teams vs centralized and highly aligned
           | teams.
        
       | Euphorbium wrote:
       | I thought it was default since python 3.
        
         | lucb1e wrote:
         | You may be thinking of strings where the u"" prefix was made
         | obsolete in python3. Then again, trying on Python 2.7 just now,
         | typing "eku" results in it printing the UTF-8 bytes for those
         | characters so I don't actually know what that u prefix ever
         | did, but one of the big py2-to-3 changes was strings having an
         | encoding and byte strings being for byte sequences without
         | encodings
         | 
         | This change seems to be about things like open('filename',
         | mode='r') mainly on Windows where the default encoding is not
         | UTF-8 and so you'd have to specify open('filename', mode='r',
         | encoding='UTF-8')
        
           | aktiur wrote:
           | > strings having an encoding and byte strings being for byte
           | sequences without encodings
           | 
           | You got it kind of backwards. `str` are sequence of unicode
           | codepoints ( _not_ UTF-8, which is a specific encoding for
           | unicode codepoints), without reference to any encoding.
           | `bytes` are arbitrary sequence of octets. If you have some
           | `bytes` object that somehow stands for text, you need to know
           | that it is text and what its encoding is to be able to
           | interpret it correctly (by decoding it to `str`).
           | 
           | And, if you got a `str` and want to serialize it (for writing
           | or transmitting), you need to choose an encoding, because
           | different encodings will generate different `bytes`.
           | 
           | As an example :
           | 
           | >>> "evenement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'
           | 
           | >>> "evenement".encode("latin-1") b'\xe9v\xe8nement'
        
           | jcranmer wrote:
           | Python has two types of strings: byte strings (every
           | character is in the range of 0-255) and Unicode strings
           | (every character is a Unicode codepoint). In Python 2.x, ""
           | maps to a byte string and u"" maps to a Unicode string; in
           | Python 3.x, "" maps to a unicode string and b"" maps to a
           | byte string.
           | 
           | If you typed in "eku" in Python 2.7, what you get is a string
           | consisting of the hex chars 0xC3 0xA9 0xC4 0xB7 0xC5 0xAF,
           | which if you printed it out and displayed it as UTF-8--the
           | default of most terminals--would appear to be eku. But
           | "eku"[1] would return a byte string of \xa9 which isn't valid
           | UTF-8 and would likely display as garbage.
           | 
           | If you instead had used u"eku", you'd instead get a string of
           | three Unicode code points, U+00E9 U+0137 U+016F. And
           | u"eku"[1] would return u"k", which is a valid Unicode
           | character.
        
         | d0mine wrote:
         | The Python source code is utf-8 by default in Python 3. But it
         | says nothing about a character encoding used to save to a file.
         | It is locale-dependent by default.                   # string
         | literals create str objects using utf-8 by default
         | Path("filenames use their own encoding").write_text("file
         | content encoding uses yet another encoding")
         | 
         | The corresponding encodings are:
         | 
         | - utf-8 [tokenize.open] - sys.getfilesystemencoding()
         | [os.fsencode] - locale.getpreferredencoding() [open]
        
       | a-french-anon wrote:
       | Why not utf-8-sig, though? It handles _optional_ BOMs. Had to fix
       | a script last week that choked on it.
        
         | orf wrote:
         | Because changing Python to silently prefixing all IO with an
         | invisible BOM isn't a good idea.
        
           | int_19h wrote:
           | The expectation isn't for it to generate BOM in the output,
           | but to handle BOM gracefully when it occurs in the input.
        
         | shellac wrote:
         | At this point nothing ought to be inserting BOMs in utf-8. It's
         | not recommended, and I think choking on it is reasonable
         | behaviour these days.
        
           | Athas wrote:
           | Why were BOMs ever allowed for UTF-8?
        
             | plorkyeran wrote:
             | When UTF-8 was still very much not the default encoding for
             | text files it was useful to have a way to signal that a
             | file was UTF-8 and not the local system encoding.
        
             | josefx wrote:
             | Some editors used them to help detect UTF-8 encoded files.
             | Since they are also valid zero length space characters they
             | also served as a nice easter egg for people who ended up
             | editing their linux shell scripts with a windows text
             | editor.
        
           | Dwedit wrote:
           | Basically every C# program will insert BOMs into text files
           | by default unless you opt-out.
        
       | anordal wrote:
       | The following heuristic has become increasingly true over the
       | last couple of decades: If you have some kind of "charset"
       | configuration anywhere, and it's not UTF-8, it's wrong.
       | 
       | Python 2 was charset agnostic, so it always worked, but the
       | improvement with Python 3 was not only an improvement - how to
       | tell a Python 3 script from a Python 2 script?
       | 
       | * If it contains the string "utf-8", it's Python3.
       | 
       | * If it only works if your locale is C.UTF-8, it's Python3.
       | 
       | Needless to say, I welcome this change. The way I understand it,
       | it would "repair" Python 3.
        
       | Aerbil313 wrote:
       | Nice. Now the only thing we need is JS to switch to UTF-8. But of
       | course JS can't improve, because unlike any other programming
       | language, we need to be compatible with code written in 1995.
        
       | Animats wrote:
       | Is the internal encoding in CPython UTF-8 yet?
       | 
       | You can index through Python strings with a subscript, but random
       | access is rare enough that it's probably worthwhile to lazily
       | index a string when needed. If you just need to advance or back
       | up by 1, you don't need an index. So an internal representation
       | of UTF-8 is quite possible.
        
         | rogerbinns wrote:
         | The PyUnicode object is what represents a str. If the UTF-8
         | bytes are ever requested, then a bytes object is created on
         | demand and cached as part of the PyUnicode, being freed with
         | the PyUnicode itself is freed.
         | 
         | Separately from that the codepoints making up the string are
         | stored in a straight forward array allowing random access. The
         | size of each codepoint can be 1, 2, or 4 bytes. When you create
         | a PyUnicode you have to specify the maximum codepoint value
         | which is rounded up to 127, 255, 65535, or 1,114,111. That
         | determines if 1, 2, or 4 bytes is used.
         | 
         | If the maxiumum codepoint value is 127 then that array
         | representation can be used for the UTF-8 directly. So the
         | answer to your question is that many strings are stored as
         | UTF-8 because all the codepoints are <= 127.
         | 
         | Separately from that, advancing through strings should not be
         | done by codepoints anyway. A user perceived character (aka
         | grapheme cluster) is made up of one or more codepoints. For
         | example an e with an accent could be the e codepoint followed
         | by a combining accent codepoint. The phoenix emoji is really
         | the bird emoji, a zero width joiner, and then fire emoji. Some
         | writing systems used by hundreds of millions of people are
         | similar to having consonants, with combining marks to represent
         | vowels.
         | 
         | This - - is 5 codepoints. There is a good blog post diving into
         | it and how various languages report its "length".
         | https://hsivonen.fi/string-length/
         | 
         | Source: I've just finished implementing Unicode TR29 which
         | covers this for a Python C extension.
        
       ___________________________________________________________________
       (page generated 2024-04-26 23:01 UTC)