[HN Gopher] Transcoding Latin 1 strings to UTF-8 strings at 12 G...
       ___________________________________________________________________
        
       Transcoding Latin 1 strings to UTF-8 strings at 12 GB/s using
       AVX-512
        
       Author : mariuz
       Score  : 109 points
       Date   : 2023-08-20 10:52 UTC (1 days ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | justin101 wrote:
       | Where does one even go about finding 12Gb of pure latin text?
        
         | Rebelgecko wrote:
         | I had the same question, wondering what sort of workflow would
         | have this task in the critical path. Maybe if the Library of
         | Congress needs to change their default text encoding it'll save
         | a minute or two?
         | 
         | The benchmark result is cool, but I'm curious how well it works
         | with smaller outputs. When I've played around with SIMD stuff
         | in the past, you can't necessary go off of metrics like "bytes
         | generated per cycle", because of how much CPU freq can vary
         | when using SIMD instructions, context switching costs, and
         | different thermal properties (eg maybe the work per cycles is
         | higher per SIMD, but the CPU generates heat much more quickly
         | and downclocks itself).
        
         | [deleted]
        
         | martijnvds wrote:
         | The Vatican?
        
           | ant6n wrote:
           | The latin in latin-1 refers to the alphabet, not the
           | language. In fact latin-1 can encode many Western European
           | languages.
        
             | CoastalCoder wrote:
             | I believe it was a joke.
             | 
             | But the humour may have been lost in translation. It's
             | funnier in the original ASCII.
        
               | mmastrac wrote:
               | The high bit is generally used to indicate humour.
        
         | lovasoa wrote:
         | Not sure whether that was sarcastic, but ISO-8859-1 (Latin 1)
         | encodes most european languages, not just latin.
         | 
         | https://en.wikipedia.org/wiki/ISO/IEC_8859-1
        
           | ko27 wrote:
           | But where do you find it? Almost the entirety of internet is
           | UTF-8. You can always transcode to Latin 1 for testing
           | purposes, but that raises the question of practical benefits
           | of this algorithm.
        
             | tgv wrote:
             | Older corpora are probably still in Latin-1 or some
             | variant. That could include decades of news paper
             | publications.
        
         | [deleted]
        
         | the8472 wrote:
         | It's not necessarily about sustained throughput spent only in
         | this routine. It can be small bursts of processing text
         | segments that are then handed off to other parts of the
         | program.
         | 
         | Once a program is optimized to the point where no leaf method /
         | hot loop takes up more than a few percent of runtime and
         | algorithmic improvements aren't available or extremely hard to
         | implement the speed of all the basic routines (memcpy,
         | allocations, string processing, data structures) start to
         | matter. The constant factors elided by Big-O notation start to
         | matter.
        
         | [deleted]
        
       | londons_explore wrote:
       | Since values 0-127 are used _far_ more frequently than 128-255 in
       | latin-1, it might make more sense to simply have a fast path
       | which simply loads 512 bits at a time (ie. 64 bytes), detects if
       | any are 0x80 or above, and if not just outputs them verbatim.
        
         | NelsonMinar wrote:
         | The article has a whole section about that, you might enjoy
         | reading about it. He reports a ~20% speedup on his test data.
        
         | twoodfin wrote:
         | I don't know if the article has been updated since your
         | comment, but this approach is discussed & benchmarked. For the
         | benchmarked data set it's a winner.
        
           | wffurr wrote:
           | The article was indeed updated since I read it and the parent
           | comment this morning.
        
         | jojobas wrote:
         | Either way throughput will depend on the fraction of >192
         | characters, what input data gave 12GB/s seems to be a mystery.
        
           | reaperhulk wrote:
           | The article states it's the French version of the Mars
           | wikipedia entry and the repository has a link to the file he
           | used in the readme: https://raw.githubusercontent.com/lemire/
           | unicode_lipsum/main...
        
           | [deleted]
        
       | redox99 wrote:
       | 12GB/s seems a bit slow. I'd expect the only bottleneck to be
       | memory bandwidth.
       | 
       | A dual channel DDR4 system memory bandwidth is ~40GB/s, and DDR5
       | ~80GB/s.
       | 
       | Since this operation requires both a read and a write, you'd
       | expect half that.
        
         | peppermint_gum wrote:
         | > A dual channel DDR4 system memory bandwidth is ~40GB/s, and
         | DDR5 ~80GB/s.
         | 
         | It's impossible to saturate the memory bandwidth on a modern
         | CPU with a single thread, even if all you do is reads with
         | absolutely no processing. The bottleneck is how fast
         | outstanding cache misses can be satisfied.
         | 
         | The article even links to a benchmark that attempts to measure
         | what it calls "sustainable memory bandwidth":
         | https://www.cs.virginia.edu/stream/ref.html
        
       | jojobas wrote:
       | Interesting to see how a non-AVX, non-branching version would do,
       | need a prefilled array of extra pointer advance (0/1) and
       | seemingly two more for the bitbanging.
        
         | [deleted]
        
         | xiphias2 wrote:
         | Another option would be a vector of 256 16 bit entries and
         | keeping the pointer advance vector as you suggested.
        
       | londons_explore wrote:
       | Every time someone writes some really carefully micro-optimized
       | piece of code like this, I worry that the implementation won't be
       | shared with the whole world.
       | 
       | This code only makes people's lives better if many languages and
       | frameworks that translates latin-1 to utf8 are updated to have
       | this new faster implementation.
       | 
       | If this took 3 days to write and benchmark, then to save 3 days
       | of human time, we probably need to get this into the hands of
       | hundreds of millions of people, saving each person a few hundred
       | microseconds.
        
         | re-thc wrote:
         | > I worry that the implementation won't be shared with the
         | whole world.
         | 
         | Considering the author also created
         | https://github.com/simdutf/simdutf it's likely used or will be
         | used in NodeJs amongst other things. Is that good enough?
        
         | magicalhippo wrote:
         | > This code only makes people's lives better if many languages
         | and frameworks that translates latin-1 to utf8 are updated to
         | have this new faster implementation.
         | 
         | Except CPUs evolve and what was once a fast way of doing things
         | may no longer be very fast. And with ASM you got no compiler to
         | generate better targeted instructions.
         | 
         | I've seen many instances where significant performance was
         | gained by swapping out and old hand-written ASM routine with a
         | plain language version.
         | 
         | If you ever add some optimized ASM to your code, do a
         | performance check at startup or similar, and have the plain
         | language version as a fallback.
        
           | TinkersW wrote:
           | It is written with intrinsics not ASM.
           | 
           | Compilers understand intrinsics and can optimize around them,
           | and CPUs evolve improved SIMD instruction sets at a snails
           | pace.
           | 
           | Intel doesn't even really support AVX512 yet for consumer
           | hardware, and maybe never will, so this code is mostly only
           | good for very modern AMD.
        
             | bruce343434 wrote:
             | What do you mean "optimize around them"? Do you have a
             | godbolt/codegen example of suboptimal intrinsic calls being
             | optimized?
        
             | magicalhippo wrote:
             | I'm talking about which instructions and idioms are
             | optimal. AFAIK, with intrinsics the compiler won't
             | completely change what you've written.
             | 
             | Back in the days REP MOVSB was the fastes way to copy
             | bytes, then Pentium came and rolling your own loop was
             | better. Then CPUs improved and REP MOVSB was suddenly
             | better again[1], for those CPUs. And then it changed
             | again...
             | 
             | Similar story for other idioms where implementation details
             | on CPUs change. Compilers can respond and target your exact
             | CPU.
             | 
             | [1]: https://github.com/golang/go/issues/14630 (notice how
             | one comments the same patch that gives 1.6x boost for OP
             | gives them a 5x degradation)
        
         | maxerickson wrote:
         | Are you also worried about my hobby vegetable garden being a
         | waste of time?
         | 
         | I'm sure I could get my tomato fix at the farmers market.
        
         | whoknowswhat11 wrote:
         | Is avx512 broadly available and error free w no stalls
         | slowdowns or other side effects. For a long time it felt like a
         | corner intel thing
        
           | jacoblambda wrote:
           | In terms of being broadly available, most of AVX-512 (ER, PF,
           | 4FMAPS, and 4VNNIW haven't been available on any new hardware
           | since 2017) is available on basically any Intel cpu
           | manufactured since 2020 as well as on all AMD Zen4 (2022 and
           | on) cpus.
           | 
           | https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
           | 
           | I can't speak to being error free or other issues but it
           | should at the very least be present on any modern desktop,
           | laptop, or server x86 CPU you could buy today.
           | 
           | Edit: I forgot to mention but Intel's Alder lake CPUs only
           | have partial support presumably due to some issue with E
           | cores. I'd guess Intel will get their shit together
           | eventually wrt this now that AMD is shipping all their
           | hardware with this instruction set.
        
             | unnah wrote:
             | Intel seems to be going for market segmentation, with
             | AVX-512 only available on their server CPUs. The option to
             | enable AVX-512 has been removed from Alder Lake CPUs since
             | 2022, and there is no AVX-512 on Raptor Lake.
             | 
             | AMD also keeps making and selling Zen 3 and Zen 2 chips as
             | lower-cost products, and those do not have AVX-512.
        
               | the8472 wrote:
               | With AVX10 intel will make the instructions available
               | again on all segments. SIMD register width will vary
               | between cores but the instructions will be there.
        
               | wtallis wrote:
               | I don't think it was intentional market segmentation,
               | just poor planning: the whole heterogenous cores strategy
               | seems to have been thrown together in a hurry and they
               | didn't have time to add AVX-512 to their Atom cores in an
               | area-efficient way (so as not to negate the point of
               | having E-cores).
        
             | nullifidian wrote:
             | >most of AVX-512 is available on basically any Intel cpu
             | manufactured since 2020
             | 
             | That's incorrect. On the consumer cpu side Intel introduced
             | AVX-512 for one generation in 2021 (Rocket lake), but than
             | removed AVX-512 from the subsequent Alder Lake using bios
             | updates, and fused it off in later revisions. It's also
             | absent from the current Raptor Lake. So actually it's only
             | available on Intel's server grade cpus.
             | 
             | >Edit: I forgot to mention but Intel's Alder lake CPUs only
             | have partial support presumably due to some issue with E
             | cores.
             | 
             | No, this wiki page is outdated.
        
           | papercrane wrote:
           | The latest Intel architecture (Sapphire Rapids) support it
           | without downclocking. AMD Zen 4 also supports it, although
           | their implementation is double pumped, not sure what the real
           | world performance impact of that is.
        
             | adrian_b wrote:
             | There is a huge confusion about this "double pumped" thing.
             | 
             | All that this means is that Zen 4 uses the same execution
             | units both for 256-bit operations and for 512-bit
             | operations. This means that the throughput in instructions
             | per cycle for 512-bit operations is half of that for
             | 256-bit operations, but the throughput in bytes per cycle
             | is the same.
             | 
             | However the 512-bit operations need fewer resources for
             | instruction fetching and decoding and for micro-operation
             | storing and dispatching, so in most cases using 512-bit
             | instructions on Zen 4 provides a big speed-up.
             | 
             | Even if Zen 4 is "double pumped", its 256-bit throughput is
             | higher than that of Sapphire Rapids, so after dividing by
             | two, for most instructions it has exactly the same 512-bit
             | throughput as Sapphire Rapids, i.e. two 512-bit register-
             | register instructions per cycle.
             | 
             | The only exceptions are that Sapphire Rapids (with the
             | exception of the cheap SKUs) can do 2 FMA instructions per
             | cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions
             | per cycle, and that Sapphire Rapids has a double throughput
             | for loads and stores from the L1 cache memory. There are
             | also a few 512-bit instructions where Zen 4 has better
             | throughput or latency than Sapphire Rapids, e.g. some of
             | the shuffles.
        
         | eesmith wrote:
         | You should also worry about how other peoples' time is wasted
         | when you miss important details then comment about easily
         | assuaged worries.
         | 
         | Quoting the article "I use GCC 11 on an Ice Lake server. My
         | source code is available.", linking to
         | https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
         | .
         | 
         | From the README at the top-level:
         | 
         | > Unless otherwise stated, I make no copyright claim on this
         | code: you may consider it to be in the public domain.
         | 
         | > Don't bother forking this code: just steal it.
        
         | stkdump wrote:
         | It's unlikely that this makes anyone's life better. It is more
         | a curiosity and maybe a teachable thing on how to do SIMD. I
         | would venture the guess that there are very few workloads that
         | require this conversion for more than a few KB, and over time
         | as the world migrates to Unicode it will be less and less.
        
         | slashdev wrote:
         | The author is a French Canadian academic at Universite du
         | Quebec a Montreal. He is one of the more famous figures in
         | computer science in all of Canada, with over 5000 citations
         | (which is stretching the meaning of famous, but still.) This is
         | not closed source work optimizing for some company product,
         | this is research for publication on his blog or in computer
         | science journals.
        
           | benreesman wrote:
           | He's one of the most famous computer scientists in general!
           | 
           | The audience for wicked-clever, low/no branch, cache aware,
           | SIMD sorcery is admittedly not everyone, but if you end up
           | with that kind of problem, this is a go to!
        
         | SomeoneFromCA wrote:
         | It is mostly an educational code. Once you learn AVX-512 you
         | can get boosts in many areas.
        
       | SomeoneFromCA wrote:
       | Another proof that Linus is not always right. There were many
       | folks who just blindly regurgitated AVX 512 is evil, without even
       | actually knowing a thing about it.
        
         | wtallis wrote:
         | > Another proof that Linus is not always right.
         | 
         | No, this is just a case of the right answer changing over time,
         | as _good_ AVX-512 implementations became available, long after
         | its introduction. And nothing in this article even comes close
         | to addressing the main concern with the early AVX-512
         | implementation: the significant performance penalty it imposes
         | on other code due to Skylake 's slow power state transitions.
         | Microbenchmarks like this made AVX-512 look good even on
         | Skylake, because they ignore the broader effects.
        
           | nwallin wrote:
           | To add to your point, this benchmark would not have run on
           | Skylake at all. It uses the _mm512_maskz_compress_epi8
           | instruction, which wasn't introduced until Ice Lake.
        
           | SomeoneFromCA wrote:
           | So your point he is indeed always right? Or was right in that
           | particular case (he was not)?
           | 
           | If you remember, Linux complained not about the particular
           | implementation of AVX-512, but the concept itself. It is also
           | kinda looks ignorant of him (and anyone else who thinks the
           | same way) to believe that AVX-512 is only about 512 or it has
           | no potential, being a just simply better SIMD ISA compared to
           | AVX1/2. What he did he just expressed himself in his
           | trademark silly edgy maximalist way. It is an absolute
           | pleasure to work with, gives great performance boost, and he
           | should have more careful with his statements.
        
             | a1369209993 wrote:
             | > So your point he is indeed always right?
             | 
             | No, their point is that this does not _refute_ said
             | hypothetical claim. That is, their point is that it is not,
             | as you claimed:
             | 
             | > Another proof that Linus is not always right.
             | 
             | (I don't know if their point is _correct_ , but it's of the
             | form "your argument against X is invalid", not "X is
             | correct".)
        
         | camel-cdr wrote:
         | It kinda depends, I wouldn't be surprises, if properly
         | optimized avx2 could get the same performance, since it looks
         | like the operation is memory bottlenecked.
        
           | SomeoneFromCA wrote:
           | Nah, AVX512 is more performant design due to the support of
           | masking. It does not depend in fact on anything. Those who
           | compares favorably or equally AVX2 with 512 never used either
           | of them.
        
       | ko27 wrote:
       | > Latin 1 standard is still in widespread inside some systems
       | (such as browsers)
       | 
       | That doesn't seem to be correct. UTF-8 is used by 98% of all the
       | websites. I am not sure if it's even worth the trouble for
       | libraries to implement this algorithm, since Latin-1 encoding is
       | being phased out.
       | 
       | https://w3techs.com/technologies/details/en-utf8
        
         | pzmarzly wrote:
         | And yet HTTP/1.1 headers should be sent in Latin1 (is this
         | fixed in HTTP/2 or HTTP/3?). And WebKit's JavaScriptCore has
         | special handling for Latin1 strings in JS, for performance
         | reasons I assume.
        
           | ko27 wrote:
           | > should be sent in Latin1
           | 
           | Do you have a source on that "should" part. Because the spec
           | disagrees https://www.rfc-
           | editor.org/rfc/rfc7230#section-3.2.4:
           | 
           | > Historically, HTTP has allowed field content with text in
           | the ISO-8859-1 charset [ISO-8859-1], supporting other
           | charsets only through use of [RFC2047] encoding. In practice,
           | most HTTP header field values use only a subset of the US-
           | ASCII charset [USASCII]. Newly defined header fields SHOULD
           | limit their field values to US-ASCII octets.
           | 
           | In practice and by spec, HTTP headers should be ASCII
           | encoded.
        
             | missblit wrote:
             | The spec may disagree, but webservers do sometimes send
             | bytes outside the ASCII range, and the most sensible way to
             | deal with that on the receiving side is still by treating
             | them as latin1 to match (last I checked) what browsers do
             | with it.
             | 
             | I do agree that latin1 headers shouldn't be _sent_ out
             | though.
        
             | nicktelford wrote:
             | ISO-8859-1 (aka. Latin-1) is a superset of ASCII, so all
             | ASCII strings are also valid Latin-1 strings.
             | 
             | The section you quoted actually suggests that
             | implementations should support ISO-8859-1 to ensure
             | compatibility with systems that use it.
        
               | ko27 wrote:
               | You should read it again
               | 
               | > Newly defined header fields SHOULD limit their field
               | values to US-ASCII octets
               | 
               | ASCII octets! That means you SHOULD NOT send Latin1
               | encoded headers. The opposite of what pzmarzly was
               | saying. I don't disagree Latin-1 being a superset of
               | ASCII or having backward compatibility in mind, but
               | that's not relevant to my response.
        
               | layer8 wrote:
               | SHOULD is a recommendation, not a requirement, and it
               | refers only to newly-defined header fields, not existing
               | ones. The text implies that 8-bit characters in existing
               | fields are to be interpreted as ISO-8859-1.
        
               | jart wrote:
               | Haven't you heard of Postel's Maxim?
               | 
               | Web servers need to be able to receive and decode latin1
               | into utf-8 regardless of what the RFC recommends people
               | send. The fact that it's going to become rarer over time
               | to have the 8th bit set in headers, means you can write a
               | simpler algorithm than what Lemire did that assumes an
               | ASCII average case. https://github.com/jart/cosmopolitan/
               | blob/755ae64e73ef5ef7d1... That goes 23 GB/s on my
               | machine using just SSE2 (rather than AVX512). However it
               | goes much slower if the text is full of european
               | diacritics. Lemire's algorithm is better at decoding
               | those.
        
               | HideousKojima wrote:
               | >Haven't you heard of Postel's Maxim?
               | 
               | Otherwise known as "Making other people's incompetence
               | and inability to implement a specification _your_
               | problem. " Just because it's a widely quoted maxim
               | doesn't make it good advice.
        
         | kannanvijayan wrote:
         | One place I know where latin1 is still used is as an internal
         | optimization in javascript engines. JS strings are composed of
         | 16-bit values, but the vast majority of strings are ascii. So
         | there's a motivation to store simpler strings using 1 byte per
         | char.
         | 
         | However, once that optimization has been decided, there's no
         | point in leaving the high bit unused, so the engines keep
         | optimized "1-byte char" strings as Latin1.
        
           | HideousKojima wrote:
           | >So there's a motivation to store simpler strings using 1
           | byte per char.
           | 
           | What advantage would this have over UTF-7, especially since
           | the upper 128 characters wouldn't match their Unicode values?
        
             | laurencerowe wrote:
             | > What advantage would this have over UTF-7, especially
             | since the upper 128 characters wouldn't match their Unicode
             | values?
             | 
             | (I'm going to assume you mean UTF-8 here rather than UTF-7
             | since UTF-7 is not really useful for anything, it's jus a
             | way to pack Unicode into only 7-bit ascii characters.)
             | 
             | Fixed width string encodings like Latin-1 let you directly
             | index to a particular character (code point) within a
             | string without having to iterate from the beginning of the
             | string.
             | 
             | JavaScript was originally specified in terms of UCS-2 which
             | is a 16 bit fixed width encoding as this was commonly used
             | at the time in both Windows and Java. However there are
             | more than 64k characters in all the world's languages so it
             | eventually evolved to UTF-16 which allows for wide
             | characters.
             | 
             | However because of this history indexing into a JavaScript
             | string gives you the 16-bit code unit which may be only
             | part of a wide character. A string's length is defined in
             | terms of 16-bit code units but iterating over a string
             | gives you full characters.
             | 
             | Using Latin-1 as an optimisation allows JavaScript to
             | preserve the same semantics around indexing and length.
             | While it does require translating 8 bit Latin-1 character
             | codes to 16 bit code points, this can be done very quickly
             | through a lookup table. This would not be possible with
             | UTF-8 since it is not fixed width.
             | 
             | EDIT: A lookup table may not be required. I was confused by
             | new TextDecoder('latin1') actually using windows-1252.
             | 
             | More modern languages just use UTF-8 everywhere because it
             | uses less space on average and UTF-16 doesn't save you from
             | having to deal with wide characters.
        
             | layer8 wrote:
             | Latin1 does match the Unicode values (0-255).
        
           | layer8 wrote:
           | Java nowadays does the same.
        
         | TheRealPomax wrote:
         | Only because those websites include `<meta charset="utf-8">`.
         | _Browsers_ don 't use utf-8 unless you tell them to, so we tell
         | them to. But there's an entire internet archive's worth of
         | pages that don't tell them to.
        
           | ko27 wrote:
           | Not including charset="utf-8" doesn't mean that the website
           | is not UTF-8. Do you have a source on a significant
           | percentage of website being Latin-1 while omitting charset
           | encoding? I don't believe that's the case.
           | 
           | > Browsers don't use utf-8 unless you tell them to
           | 
           | This is wrong. You can prove this very easily by creating a
           | HTML file with UTF-8 text while omitting the charset. It will
           | render correctly.
        
             | missblit wrote:
             | Be careful, since at least Chrome may choose a different
             | charset if loading a file from disk versus from a HTTP URL
             | (yes this has tripped me up more than once).
             | 
             | I've observed Chrome to usually default to windows-1252
             | (latin1) for UTF-8 documents loaded from the network.
        
             | bawolff wrote:
             | > This is wrong. You can prove this very easily by creating
             | a HTML file with UTF-8 text while omitting the charset. It
             | will render correctly.
             | 
             | I'm pretty sure this is incorrect.
        
               | electroly wrote:
               | The following .html file encoded in UTF-8, when loaded
               | from disk in Google Chrome (so no server headers hinting
               | anything), yields document.characterSet == "UTF-8". If
               | you make it "a" instead of "a" it becomes "windows-1252".
               | <html>a
               | 
               | The renders correctly in Chrome and does not show
               | mojibake as you might have expected from old browsers.
               | Explicitly specifying a character set just ensures you're
               | not relying on the browser's heuristics.
        
               | bawolff wrote:
               | There may be a difference here between local and network,
               | as well as if the multi-byte utf-8 character appears in
               | the first 1024 bytes or how much network delay there is
               | before that character appears.
        
               | electroly wrote:
               | The original claim was that browsers don't ever use UTF-8
               | unless you specify it. Then ko27 provided a
               | counterexample that clearly shows that a browser _can_
               | choose UTF-8 without you specifying it. You then said
               | "I'm pretty sure this is incorrect"--which part? ko27's
               | counterexample is correct; I tried it and it renders
               | correctly as ko27 said. If you do it, the browser does
               | choose UTF-8. I'm not sure where you're going with this
               | now. This was a minimal counterexample for a narrow
               | claim.
        
             | TheRealPomax wrote:
             | Answering your "do you have a source" question, yeah: "the
             | entire history of the web prior to HTML5's release", which
             | the internet has already forgotten is a rather recent thing
             | (2008). And even then, it took a while for HTML5 to become
             | the _de facto_ format, because it took the majority of the
             | web years before they 'd changed over their tooling from
             | HTML 4.01 to HTML5.
             | 
             | > This is wrong. You can prove this very easily by creating
             | a HTML file with UTF-8 text
             | 
             | No, but I will create an HTML file with _latin-1_ text,
             | because that 's what we're discussing: HTML files that _don
             | 't_ use UTF-8 (and so by definition don't _contain_ UTF-8
             | either).
             | 
             | While modern browsers will guess the encoding by examining
             | the content, if you make an html file that just has plain
             | text, then it won't magically convert it to UTF-8: create a
             | file with `<html><head><title>encoding
             | check</title></head><body><h1>Not much here, just plain
             | text</h1><p>More text that's not special</p></body></html>`
             | in it. Load it in your browser through an http server (e.g.
             | `python -m http.server`), and then hit up the dev tools
             | console and look at `document.characterSet`.
             | 
             | Both firefox and chrome give me "windows-1252" on Windows,
             | for which the "windows" part in the name is of course
             | irrelevant; what matters is what it's _not_ , which is that
             | it's not UTF-8, because the content has nothing in it to
             | warrant UTF-8.
        
               | ko27 wrote:
               | Okay, it's good that we agree then on my original
               | premise, the vast majority of websites (by quantity and
               | popularity) on the Internet today are using UTF-8
               | encoding, and Latin-1 is being phased out.
               | 
               | Btw I appreciate your edited response, but still you were
               | factually incorrect about:
               | 
               | > Browsers don't use utf-8 unless you tell them to
               | 
               | Browsers can use UTF-8 even if we don't tell them. I am
               | already aware of the extra heuristics you wrote about.
               | 
               | > HTML file with latin-1 ... which is that it's not
               | UTF-8, because the content has nothing in it to warrant
               | UTF-8
               | 
               | You are incorrect here as well, try using some latin-1
               | special character like "a" and you will see that browsers
               | default to document.characterSet UTF-8 not windows-1252
        
               | lelandbatey wrote:
               | > You are incorrect here as well, try using some latin-1
               | special character like "a" and you will see that browsers
               | default to document.characterSet UTF-8 not windows-1252
               | 
               | I decided to try this experimentally. In my findings, if
               | neither the server nor the page contents indicate that a
               | file is UTF-8, then the browser NEVER defaults to setting
               | document.characterSet to UTF-8, instead basically always
               | assuming that it's "windows-1252" a.k.a. "latin1". Read
               | on for my methodology, an exact copy of my test data, and
               | some particular oddities at the end.
               | 
               | To begin, we have three '.html' files, one with ASCII
               | only characters, a second file with two separate
               | characters that are specifically latin1 encoded, and a
               | third with those same latin1 characters but encoded using
               | UTF-8. Those two characters are:                   E -
               | "Latin Capital Letter E with Diaeresis" - Latin1
               | encoding: 0xCB  - UTF-8 encoding: 0xC3 0x8B   -
               | https://www.compart.com/en/unicode/U+00CB         Y= -
               | "Yen Sign"                              - Latin1
               | encoding: 0xA5  - UTF-8 encoding: 0xC2 0xA5   -
               | https://www.compart.com/en/unicode/U+00A5
               | 
               | To avoid copy-paste errors around encoding, I've dumped
               | the contents of each file as "hexdumps", which you can
               | transform back into their binary form by feeding the
               | hexdump form into the command 'xxd -r -p -'.
               | $ cat ascii.html | xxd -p         3c68746d6c3e3c686561643
               | e3c7469746c653e656e636f64696e67206368         65636b20415
               | 34349493c2f7469746c653e3c2f686561643e3c626f64793e
               | 3c68313e4e6f74206d75636820686572652c206a75737420706c61696
               | e20         746578743c2f68313e3c703e4d6f72652074657874207
               | 46861742773206e         6f74207370656369616c3c2f703e3c2f6
               | 26f64793e3c2f68746d6c3e0a         $ cat latinone.html |
               | xxd -p         3c68746d6c3e3c686561643e3c7469746c653e656e
               | 636f64696e67206368         65636b206c6174696e313c2f746974
               | 6c653e3c2f686561643e3c626f6479         3e3c68313e54686973
               | 2069732061206c6174696e31206368617261637465         722030
               | 7841353a20a53c2f68313e3c703e54686973206973206368617220
               | 307843423a20cb3c2f703e3c2f626f64793e3c2f68746d6c3e0a
               | $ cat utf8.html | xxd -p         3c68746d6c3e3c686561643e
               | 3c7469746c653e656e636f64696e67206368         65636b207574
               | 663820203c2f7469746c653e3c2f686561643e3c626f6479         
               | 3e3c68313e54686973206973206120757466382020206368617261637
               | 465         7220307841353a20c2a53c2f68313e3c703e546869732
               | 069732063686172         203078433338423a20c38b3c2f703e3c2
               | f626f64793e3c2f68746d6c3e0a
               | 
               | The full contents of my current folder is as such:
               | $ ls -a .         .  ..  ascii.html  latinone.html
               | utf8.html
               | 
               | Now that we have our test files, we can serve them via a
               | very basic HTTP server. But first, we must verify that
               | all responses from the HTTP server do not contain a
               | header implying the content type; we want the browser to
               | have to make a guess based on nothing but the contents of
               | the file. So, we run the server and check to make sure
               | it's not being well intentioned and guessing the content
               | type:                   $ curl -s -vvv
               | 'http://127.0.0.1:8000/ascii.html' 2>&1 | egrep -v -e
               | 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'         >
               | GET /ascii.html HTTP/1.1         > Accept: */*         >
               | < HTTP/1.0 200 OK         < Server: SimpleHTTP/0.6
               | Python/3.10.7         < Content-type: text/html
               | $ curl -s -vvv 'http://127.0.0.1:8000/latinone.html' 2>&1
               | | egrep -v -e
               | 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'         >
               | GET /latinone.html HTTP/1.1         > Accept: */*
               | >         < HTTP/1.0 200 OK         < Server:
               | SimpleHTTP/0.6 Python/3.10.7         < Content-type:
               | text/html              $ curl -s -vvv
               | 'http://127.0.0.1:8000/utf8.html' 2>&1 | egrep -v -e
               | 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'         >
               | GET /utf8.html HTTP/1.1         > Accept: */*         >
               | < HTTP/1.0 200 OK         < Server: SimpleHTTP/0.6
               | Python/3.10.7         < Content-type: text/html
               | 
               | Now we've verified that we won't have our observations
               | muddled by the server doing its own detection, so our
               | results from the browser should be able to tell us
               | conclusively if the presence of a latin1 character causes
               | the browser to use UTF-8 encoding. To test, I loaded each
               | web page in Firefox and Chromium and checked what
               | `document.characterSet` said.                   Firefox
               | (v116.0.3):             http://127.0.0.1:8000/ascii.html
               | result of `document.characterSet`: "windows-1252"
               | http://127.0.0.1:8000/latinone.html  result of
               | `document.characterSet`: "windows-1252"
               | http://127.0.0.1:8000/utf8.html      result of
               | `document.characterSet`: "windows-1252"
               | Chromium (v115.0.5790.170):
               | http://127.0.0.1:8000/ascii.html     result of
               | `document.characterSet`: "windows-1252"
               | http://127.0.0.1:8000/latinone.html  result of
               | `document.characterSet`: "macintosh"
               | http://127.0.0.1:8000/utf8.html      result of
               | `document.characterSet`: "windows-1252"
               | 
               | So in my testing, neither browser EVER guesses that any
               | of these pages are UTF-8, all these browsers seem to
               | mostly default to assuming that if no content-type is set
               | in the document or in the headers then the encoding is
               | "windows-1252" (bar Chromium and the Latin1 characters
               | which bizzarely caused Chromium to guess that it's
               | "macintosh" encoded?). Also note that if I add the exact
               | character you proposed (a) to the text body, it still
               | doesn't cause the browser to start assuming everything is
               | UTF-8; the only change is that Chromium starts to think
               | the latinone.html file is also "windows-1252" instead of
               | "macintosh".
        
               | [deleted]
        
               | ko27 wrote:
               | You are missing the point of what we were discussing. If
               | page content IS UTF-8 but there is no charset or server
               | headers, then browsers will still treat it as UTF-8.
               | 
               | Here is a far simpler test:
               | 
               | Save this content as .html utf-8 file:
               | 
               | <head>a
               | 
               | document.characterSet is UTF-8.
        
               | capitainenemo wrote:
               | A simpler test FWIW.. type:
               | data:text/html,<html>
               | 
               | Into your url bar and inspect that. Avoids server messing
               | with encoding values. And yes, here on my linux machine
               | in firefox it is windows-1252 too.
               | 
               | (You can type the complete document, but <html> is
               | sufficient. Browsers autocomplete a valid document. BTW,
               | data:text/html,<html contenteditable> is something I use
               | quite a lot)
               | 
               | But yeah, I think windows-1252 is standard for quirks
               | mode, for historical reasons.
        
               | layer8 wrote:
               | The historical (and present?) default is to use the local
               | character set, which on US Windows is Windows-1252, but
               | for example on Japanese Windows is Shift-JIS. The
               | expectation is that users will tend to view web pages
               | from their region.
        
               | bawolff wrote:
               | > Both firefox and chrome give me "windows-1252" on
               | Windows, for which the "windows" part in the name is of
               | course irrelevant; what matters is what it's not, which
               | is that it's not UTF-8, because the content has nothing
               | in it to warrant UTF-8.
               | 
               | While technically latin-1/iso-8859-1 is a different
               | encoding than windows-1252, html5 spec says browsers are
               | supposed to treat latin1 as windows-1252.
        
               | electroly wrote:
               | Chromium (and I'm sure other browsers, but I didn't test)
               | will sniff character set heuristically regardless of the
               | HTML version or quirks mode. It's happy to choose UTF-8
               | if it sees something UTF-8-like in there. I don't know
               | how to square this with your earlier claim of "Browsers
               | don't use utf-8 unless you tell them to."
               | 
               | That is, the following UTF-8 encoded .html files all
               | produce document.characterSet == "UTF-8" and render as
               | expected without mojibake, despite not saying anything
               | about UTF-8. Change "a" to "a" to get windows-1252 again.
               | <html>a              <!DOCTYPE html><html>a
               | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
               | "http://www.w3.org/TR/html4/strict.dtd"><html>a
               | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>a
        
         | fulafel wrote:
         | It's the default HTTP character set. It's not clear whether the
         | above stat page is about what charsets are explicitly
         | specified.
         | 
         | Also headers, mostly relevant for header values, are I think
         | ISO-8859-1.
        
           | rhdunn wrote:
           | Be aware that with the WHATWG Encoding specification [1],
           | that says that latin1, ISO-8859-1, etc. are aliases of the
           | windows-1252 encoding, _not_ the proper latin1 encoding. As a
           | result, browsers and operating systems will display those
           | files differently! It also aliases the ASCII encoding to
           | windows-1252.
           | 
           | [1] https://encoding.spec.whatwg.org/#names-and-labels
        
           | ko27 wrote:
           | Since HTML5 UTF-8 is the default charset. And for headers,
           | they are parsed as ASCII encoded in almost all cases although
           | ISO-8859-1 is supported.
        
             | fulafel wrote:
             | I tried to find confirmation of this but found only: https:
             | //html.spec.whatwg.org/multipage/semantics.html#charse...
             | 
             | > The Encoding standard requires use of the UTF-8 character
             | encoding and requires use of the "utf-8" encoding label to
             | identify it. Those
             | 
             | Sounds to me like it tells you that you have to explicitly
             | declare the charset as UTF-8, so you don't get the HTTP
             | default of Latin-1.
             | 
             | (But that's just one "living standard" not exactly
             | synonymous with with HTML5 and it might change, or might
             | have been different last week..)
        
               | ko27 wrote:
               | > so you don't get the HTTP default of Latin-1.
               | 
               | That's not what your linked spec says. You can try it
               | yourself, in any browser. If you omit the encoding the
               | browser uses heuristics to guess, but it will always work
               | if you write UTF-8 even without meta charset or encoding
               | header.
        
               | fulafel wrote:
               | I don't doubt browsers use heuristics. But spec-wise I
               | think it's your turn to to provide a reference in favour
               | of a utf-8-is-default interpretation :)
        
               | ko27 wrote:
               | No it isn't. My original point is that Latin-1 is used
               | very rarely on Internet and is being phased out. Now it's
               | your turn to provide some references that a significant
               | percentage of websites are omitting encoding (which is
               | required by spec!) and using Latin-1.
               | 
               | But if you insist, here is this quote:
               | 
               | https://www.w3docs.com/learn-html/html-character-
               | sets.html
               | 
               | > UTF-8 is the default character encoding for HTML5.
               | However, it was used to be different. ASCII was the
               | character set before it. And the ISO-8859-1 was the
               | default character set from HTML 2.0 till HTML 4.01.
               | 
               | or another:
               | 
               | https://www.dofactory.com/html/charset
               | 
               | > If a web page starts with <!DOCTYPE html> (which
               | indicates HTML5), then the above meta tag is optional,
               | because the default for HTML5 is UTF-8.
        
               | bawolff wrote:
               | > My original point is that Latin-1 is used very rarely
               | on Internet and is being phased out.
               | 
               | Nobody disagrees with this, but this is a very different
               | statement from what you said originally in regards to
               | what the default is. Things can be phased out but still
               | have the old default with no plan to change the default.
               | 
               | Re other sources - how about citing the actual spec
               | instead of sketchy websites that seem likely to have
               | incorrect information.
        
               | rhdunn wrote:
               | The WHATWG HTML spec [1] has various heuristics it
               | uses/specifies for detecting the character encoding.
               | 
               | In point 8, it says an implementation _may_ use
               | heuristics to detect the encoding. It has a note which
               | states:
               | 
               | > The UTF-8 encoding has a highly detectable bit pattern.
               | Files from the local file system that contain bytes with
               | values greater than 0x7F which match the UTF-8 pattern
               | are very likely to be UTF-8, while documents with byte
               | sequences that do not match it are very likely not. When
               | a user agent can examine the whole file, rather than just
               | the preamble, detecting for UTF-8 specifically can be
               | especially effective.
               | 
               | In point 9, the implementation can return an
               | implementation or user-defined encoding. Here, it
               | suggests a locale-based default encoding, including
               | windows-1252 for "en".
               | 
               | As such, implementations _may_ be capable of detecting
               | /defaulting to UTF-8, but are equally likely to default
               | to windows-1252, Shift_JIS, or other encoding.
               | 
               | [1] https://html.spec.whatwg.org/#determining-the-
               | character-enco...
        
       | masfuerte wrote:
       | Is this useful? Most Latin 1 text is really Windows 1252, which
       | has additional characters that don't have the same regular
       | mapping to unicode. So this conversion will mangle curly quotes
       | and the Euro sign, among others.
        
         | dotancohen wrote:
         | > Most Latin 1 text is really Windows 1252
         | 
         | I'd say that the vast majority of Latin-1 that I've encountered
         | is just ASCII. Where have you seen Windows-1252 presented with
         | a Latin-1 header or other encoding declaration that declared it
         | as Latin-1?
        
       ___________________________________________________________________
       (page generated 2023-08-21 23:01 UTC)