[HN Gopher] Transcoding Latin 1 strings to UTF-8 strings at 12 G...
___________________________________________________________________
Transcoding Latin 1 strings to UTF-8 strings at 12 GB/s using
AVX-512
Author : mariuz
Score : 109 points
Date : 2023-08-20 10:52 UTC (1 days ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| justin101 wrote:
| Where does one even go about finding 12Gb of pure latin text?
| Rebelgecko wrote:
| I had the same question, wondering what sort of workflow would
| have this task in the critical path. Maybe if the Library of
| Congress needs to change their default text encoding it'll save
| a minute or two?
|
| The benchmark result is cool, but I'm curious how well it works
| with smaller outputs. When I've played around with SIMD stuff
| in the past, you can't necessary go off of metrics like "bytes
| generated per cycle", because of how much CPU freq can vary
| when using SIMD instructions, context switching costs, and
| different thermal properties (eg maybe the work per cycles is
| higher per SIMD, but the CPU generates heat much more quickly
| and downclocks itself).
| [deleted]
| martijnvds wrote:
| The Vatican?
| ant6n wrote:
| The latin in latin-1 refers to the alphabet, not the
| language. In fact latin-1 can encode many Western European
| languages.
| CoastalCoder wrote:
| I believe it was a joke.
|
| But the humour may have been lost in translation. It's
| funnier in the original ASCII.
| mmastrac wrote:
| The high bit is generally used to indicate humour.
| lovasoa wrote:
| Not sure whether that was sarcastic, but ISO-8859-1 (Latin 1)
| encodes most european languages, not just latin.
|
| https://en.wikipedia.org/wiki/ISO/IEC_8859-1
| ko27 wrote:
| But where do you find it? Almost the entirety of internet is
| UTF-8. You can always transcode to Latin 1 for testing
| purposes, but that raises the question of practical benefits
| of this algorithm.
| tgv wrote:
| Older corpora are probably still in Latin-1 or some
| variant. That could include decades of news paper
| publications.
| [deleted]
| the8472 wrote:
| It's not necessarily about sustained throughput spent only in
| this routine. It can be small bursts of processing text
| segments that are then handed off to other parts of the
| program.
|
| Once a program is optimized to the point where no leaf method /
| hot loop takes up more than a few percent of runtime and
| algorithmic improvements aren't available or extremely hard to
| implement the speed of all the basic routines (memcpy,
| allocations, string processing, data structures) start to
| matter. The constant factors elided by Big-O notation start to
| matter.
| [deleted]
| londons_explore wrote:
| Since values 0-127 are used _far_ more frequently than 128-255 in
| latin-1, it might make more sense to simply have a fast path
| which simply loads 512 bits at a time (ie. 64 bytes), detects if
| any are 0x80 or above, and if not just outputs them verbatim.
| NelsonMinar wrote:
| The article has a whole section about that, you might enjoy
| reading about it. He reports a ~20% speedup on his test data.
| twoodfin wrote:
| I don't know if the article has been updated since your
| comment, but this approach is discussed & benchmarked. For the
| benchmarked data set it's a winner.
| wffurr wrote:
| The article was indeed updated since I read it and the parent
| comment this morning.
| jojobas wrote:
| Either way throughput will depend on the fraction of >192
| characters, what input data gave 12GB/s seems to be a mystery.
| reaperhulk wrote:
| The article states it's the French version of the Mars
| wikipedia entry and the repository has a link to the file he
| used in the readme: https://raw.githubusercontent.com/lemire/
| unicode_lipsum/main...
| [deleted]
| redox99 wrote:
| 12GB/s seems a bit slow. I'd expect the only bottleneck to be
| memory bandwidth.
|
| A dual channel DDR4 system memory bandwidth is ~40GB/s, and DDR5
| ~80GB/s.
|
| Since this operation requires both a read and a write, you'd
| expect half that.
| peppermint_gum wrote:
| > A dual channel DDR4 system memory bandwidth is ~40GB/s, and
| DDR5 ~80GB/s.
|
| It's impossible to saturate the memory bandwidth on a modern
| CPU with a single thread, even if all you do is reads with
| absolutely no processing. The bottleneck is how fast
| outstanding cache misses can be satisfied.
|
| The article even links to a benchmark that attempts to measure
| what it calls "sustainable memory bandwidth":
| https://www.cs.virginia.edu/stream/ref.html
| jojobas wrote:
| Interesting to see how a non-AVX, non-branching version would do,
| need a prefilled array of extra pointer advance (0/1) and
| seemingly two more for the bitbanging.
| [deleted]
| xiphias2 wrote:
| Another option would be a vector of 256 16 bit entries and
| keeping the pointer advance vector as you suggested.
| londons_explore wrote:
| Every time someone writes some really carefully micro-optimized
| piece of code like this, I worry that the implementation won't be
| shared with the whole world.
|
| This code only makes people's lives better if many languages and
| frameworks that translates latin-1 to utf8 are updated to have
| this new faster implementation.
|
| If this took 3 days to write and benchmark, then to save 3 days
| of human time, we probably need to get this into the hands of
| hundreds of millions of people, saving each person a few hundred
| microseconds.
| re-thc wrote:
| > I worry that the implementation won't be shared with the
| whole world.
|
| Considering the author also created
| https://github.com/simdutf/simdutf it's likely used or will be
| used in NodeJs amongst other things. Is that good enough?
| magicalhippo wrote:
| > This code only makes people's lives better if many languages
| and frameworks that translates latin-1 to utf8 are updated to
| have this new faster implementation.
|
| Except CPUs evolve and what was once a fast way of doing things
| may no longer be very fast. And with ASM you got no compiler to
| generate better targeted instructions.
|
| I've seen many instances where significant performance was
| gained by swapping out and old hand-written ASM routine with a
| plain language version.
|
| If you ever add some optimized ASM to your code, do a
| performance check at startup or similar, and have the plain
| language version as a fallback.
| TinkersW wrote:
| It is written with intrinsics not ASM.
|
| Compilers understand intrinsics and can optimize around them,
| and CPUs evolve improved SIMD instruction sets at a snails
| pace.
|
| Intel doesn't even really support AVX512 yet for consumer
| hardware, and maybe never will, so this code is mostly only
| good for very modern AMD.
| bruce343434 wrote:
| What do you mean "optimize around them"? Do you have a
| godbolt/codegen example of suboptimal intrinsic calls being
| optimized?
| magicalhippo wrote:
| I'm talking about which instructions and idioms are
| optimal. AFAIK, with intrinsics the compiler won't
| completely change what you've written.
|
| Back in the days REP MOVSB was the fastes way to copy
| bytes, then Pentium came and rolling your own loop was
| better. Then CPUs improved and REP MOVSB was suddenly
| better again[1], for those CPUs. And then it changed
| again...
|
| Similar story for other idioms where implementation details
| on CPUs change. Compilers can respond and target your exact
| CPU.
|
| [1]: https://github.com/golang/go/issues/14630 (notice how
| one comments the same patch that gives 1.6x boost for OP
| gives them a 5x degradation)
| maxerickson wrote:
| Are you also worried about my hobby vegetable garden being a
| waste of time?
|
| I'm sure I could get my tomato fix at the farmers market.
| whoknowswhat11 wrote:
| Is avx512 broadly available and error free w no stalls
| slowdowns or other side effects. For a long time it felt like a
| corner intel thing
| jacoblambda wrote:
| In terms of being broadly available, most of AVX-512 (ER, PF,
| 4FMAPS, and 4VNNIW haven't been available on any new hardware
| since 2017) is available on basically any Intel cpu
| manufactured since 2020 as well as on all AMD Zen4 (2022 and
| on) cpus.
|
| https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
|
| I can't speak to being error free or other issues but it
| should at the very least be present on any modern desktop,
| laptop, or server x86 CPU you could buy today.
|
| Edit: I forgot to mention but Intel's Alder lake CPUs only
| have partial support presumably due to some issue with E
| cores. I'd guess Intel will get their shit together
| eventually wrt this now that AMD is shipping all their
| hardware with this instruction set.
| unnah wrote:
| Intel seems to be going for market segmentation, with
| AVX-512 only available on their server CPUs. The option to
| enable AVX-512 has been removed from Alder Lake CPUs since
| 2022, and there is no AVX-512 on Raptor Lake.
|
| AMD also keeps making and selling Zen 3 and Zen 2 chips as
| lower-cost products, and those do not have AVX-512.
| the8472 wrote:
| With AVX10 intel will make the instructions available
| again on all segments. SIMD register width will vary
| between cores but the instructions will be there.
| wtallis wrote:
| I don't think it was intentional market segmentation,
| just poor planning: the whole heterogenous cores strategy
| seems to have been thrown together in a hurry and they
| didn't have time to add AVX-512 to their Atom cores in an
| area-efficient way (so as not to negate the point of
| having E-cores).
| nullifidian wrote:
| >most of AVX-512 is available on basically any Intel cpu
| manufactured since 2020
|
| That's incorrect. On the consumer cpu side Intel introduced
| AVX-512 for one generation in 2021 (Rocket lake), but than
| removed AVX-512 from the subsequent Alder Lake using bios
| updates, and fused it off in later revisions. It's also
| absent from the current Raptor Lake. So actually it's only
| available on Intel's server grade cpus.
|
| >Edit: I forgot to mention but Intel's Alder lake CPUs only
| have partial support presumably due to some issue with E
| cores.
|
| No, this wiki page is outdated.
| papercrane wrote:
| The latest Intel architecture (Sapphire Rapids) support it
| without downclocking. AMD Zen 4 also supports it, although
| their implementation is double pumped, not sure what the real
| world performance impact of that is.
| adrian_b wrote:
| There is a huge confusion about this "double pumped" thing.
|
| All that this means is that Zen 4 uses the same execution
| units both for 256-bit operations and for 512-bit
| operations. This means that the throughput in instructions
| per cycle for 512-bit operations is half of that for
| 256-bit operations, but the throughput in bytes per cycle
| is the same.
|
| However the 512-bit operations need fewer resources for
| instruction fetching and decoding and for micro-operation
| storing and dispatching, so in most cases using 512-bit
| instructions on Zen 4 provides a big speed-up.
|
| Even if Zen 4 is "double pumped", its 256-bit throughput is
| higher than that of Sapphire Rapids, so after dividing by
| two, for most instructions it has exactly the same 512-bit
| throughput as Sapphire Rapids, i.e. two 512-bit register-
| register instructions per cycle.
|
| The only exceptions are that Sapphire Rapids (with the
| exception of the cheap SKUs) can do 2 FMA instructions per
| cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions
| per cycle, and that Sapphire Rapids has a double throughput
| for loads and stores from the L1 cache memory. There are
| also a few 512-bit instructions where Zen 4 has better
| throughput or latency than Sapphire Rapids, e.g. some of
| the shuffles.
| eesmith wrote:
| You should also worry about how other peoples' time is wasted
| when you miss important details then comment about easily
| assuaged worries.
|
| Quoting the article "I use GCC 11 on an Ice Lake server. My
| source code is available.", linking to
| https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
| .
|
| From the README at the top-level:
|
| > Unless otherwise stated, I make no copyright claim on this
| code: you may consider it to be in the public domain.
|
| > Don't bother forking this code: just steal it.
| stkdump wrote:
| It's unlikely that this makes anyone's life better. It is more
| a curiosity and maybe a teachable thing on how to do SIMD. I
| would venture the guess that there are very few workloads that
| require this conversion for more than a few KB, and over time
| as the world migrates to Unicode it will be less and less.
| slashdev wrote:
| The author is a French Canadian academic at Universite du
| Quebec a Montreal. He is one of the more famous figures in
| computer science in all of Canada, with over 5000 citations
| (which is stretching the meaning of famous, but still.) This is
| not closed source work optimizing for some company product,
| this is research for publication on his blog or in computer
| science journals.
| benreesman wrote:
| He's one of the most famous computer scientists in general!
|
| The audience for wicked-clever, low/no branch, cache aware,
| SIMD sorcery is admittedly not everyone, but if you end up
| with that kind of problem, this is a go to!
| SomeoneFromCA wrote:
| It is mostly an educational code. Once you learn AVX-512 you
| can get boosts in many areas.
| SomeoneFromCA wrote:
| Another proof that Linus is not always right. There were many
| folks who just blindly regurgitated AVX 512 is evil, without even
| actually knowing a thing about it.
| wtallis wrote:
| > Another proof that Linus is not always right.
|
| No, this is just a case of the right answer changing over time,
| as _good_ AVX-512 implementations became available, long after
| its introduction. And nothing in this article even comes close
| to addressing the main concern with the early AVX-512
| implementation: the significant performance penalty it imposes
| on other code due to Skylake 's slow power state transitions.
| Microbenchmarks like this made AVX-512 look good even on
| Skylake, because they ignore the broader effects.
| nwallin wrote:
| To add to your point, this benchmark would not have run on
| Skylake at all. It uses the _mm512_maskz_compress_epi8
| instruction, which wasn't introduced until Ice Lake.
| SomeoneFromCA wrote:
| So your point he is indeed always right? Or was right in that
| particular case (he was not)?
|
| If you remember, Linux complained not about the particular
| implementation of AVX-512, but the concept itself. It is also
| kinda looks ignorant of him (and anyone else who thinks the
| same way) to believe that AVX-512 is only about 512 or it has
| no potential, being a just simply better SIMD ISA compared to
| AVX1/2. What he did he just expressed himself in his
| trademark silly edgy maximalist way. It is an absolute
| pleasure to work with, gives great performance boost, and he
| should have more careful with his statements.
| a1369209993 wrote:
| > So your point he is indeed always right?
|
| No, their point is that this does not _refute_ said
| hypothetical claim. That is, their point is that it is not,
| as you claimed:
|
| > Another proof that Linus is not always right.
|
| (I don't know if their point is _correct_ , but it's of the
| form "your argument against X is invalid", not "X is
| correct".)
| camel-cdr wrote:
| It kinda depends, I wouldn't be surprises, if properly
| optimized avx2 could get the same performance, since it looks
| like the operation is memory bottlenecked.
| SomeoneFromCA wrote:
| Nah, AVX512 is more performant design due to the support of
| masking. It does not depend in fact on anything. Those who
| compares favorably or equally AVX2 with 512 never used either
| of them.
| ko27 wrote:
| > Latin 1 standard is still in widespread inside some systems
| (such as browsers)
|
| That doesn't seem to be correct. UTF-8 is used by 98% of all the
| websites. I am not sure if it's even worth the trouble for
| libraries to implement this algorithm, since Latin-1 encoding is
| being phased out.
|
| https://w3techs.com/technologies/details/en-utf8
| pzmarzly wrote:
| And yet HTTP/1.1 headers should be sent in Latin1 (is this
| fixed in HTTP/2 or HTTP/3?). And WebKit's JavaScriptCore has
| special handling for Latin1 strings in JS, for performance
| reasons I assume.
| ko27 wrote:
| > should be sent in Latin1
|
| Do you have a source on that "should" part. Because the spec
| disagrees https://www.rfc-
| editor.org/rfc/rfc7230#section-3.2.4:
|
| > Historically, HTTP has allowed field content with text in
| the ISO-8859-1 charset [ISO-8859-1], supporting other
| charsets only through use of [RFC2047] encoding. In practice,
| most HTTP header field values use only a subset of the US-
| ASCII charset [USASCII]. Newly defined header fields SHOULD
| limit their field values to US-ASCII octets.
|
| In practice and by spec, HTTP headers should be ASCII
| encoded.
| missblit wrote:
| The spec may disagree, but webservers do sometimes send
| bytes outside the ASCII range, and the most sensible way to
| deal with that on the receiving side is still by treating
| them as latin1 to match (last I checked) what browsers do
| with it.
|
| I do agree that latin1 headers shouldn't be _sent_ out
| though.
| nicktelford wrote:
| ISO-8859-1 (aka. Latin-1) is a superset of ASCII, so all
| ASCII strings are also valid Latin-1 strings.
|
| The section you quoted actually suggests that
| implementations should support ISO-8859-1 to ensure
| compatibility with systems that use it.
| ko27 wrote:
| You should read it again
|
| > Newly defined header fields SHOULD limit their field
| values to US-ASCII octets
|
| ASCII octets! That means you SHOULD NOT send Latin1
| encoded headers. The opposite of what pzmarzly was
| saying. I don't disagree Latin-1 being a superset of
| ASCII or having backward compatibility in mind, but
| that's not relevant to my response.
| layer8 wrote:
| SHOULD is a recommendation, not a requirement, and it
| refers only to newly-defined header fields, not existing
| ones. The text implies that 8-bit characters in existing
| fields are to be interpreted as ISO-8859-1.
| jart wrote:
| Haven't you heard of Postel's Maxim?
|
| Web servers need to be able to receive and decode latin1
| into utf-8 regardless of what the RFC recommends people
| send. The fact that it's going to become rarer over time
| to have the 8th bit set in headers, means you can write a
| simpler algorithm than what Lemire did that assumes an
| ASCII average case. https://github.com/jart/cosmopolitan/
| blob/755ae64e73ef5ef7d1... That goes 23 GB/s on my
| machine using just SSE2 (rather than AVX512). However it
| goes much slower if the text is full of european
| diacritics. Lemire's algorithm is better at decoding
| those.
| HideousKojima wrote:
| >Haven't you heard of Postel's Maxim?
|
| Otherwise known as "Making other people's incompetence
| and inability to implement a specification _your_
| problem. " Just because it's a widely quoted maxim
| doesn't make it good advice.
| kannanvijayan wrote:
| One place I know where latin1 is still used is as an internal
| optimization in javascript engines. JS strings are composed of
| 16-bit values, but the vast majority of strings are ascii. So
| there's a motivation to store simpler strings using 1 byte per
| char.
|
| However, once that optimization has been decided, there's no
| point in leaving the high bit unused, so the engines keep
| optimized "1-byte char" strings as Latin1.
| HideousKojima wrote:
| >So there's a motivation to store simpler strings using 1
| byte per char.
|
| What advantage would this have over UTF-7, especially since
| the upper 128 characters wouldn't match their Unicode values?
| laurencerowe wrote:
| > What advantage would this have over UTF-7, especially
| since the upper 128 characters wouldn't match their Unicode
| values?
|
| (I'm going to assume you mean UTF-8 here rather than UTF-7
| since UTF-7 is not really useful for anything, it's jus a
| way to pack Unicode into only 7-bit ascii characters.)
|
| Fixed width string encodings like Latin-1 let you directly
| index to a particular character (code point) within a
| string without having to iterate from the beginning of the
| string.
|
| JavaScript was originally specified in terms of UCS-2 which
| is a 16 bit fixed width encoding as this was commonly used
| at the time in both Windows and Java. However there are
| more than 64k characters in all the world's languages so it
| eventually evolved to UTF-16 which allows for wide
| characters.
|
| However because of this history indexing into a JavaScript
| string gives you the 16-bit code unit which may be only
| part of a wide character. A string's length is defined in
| terms of 16-bit code units but iterating over a string
| gives you full characters.
|
| Using Latin-1 as an optimisation allows JavaScript to
| preserve the same semantics around indexing and length.
| While it does require translating 8 bit Latin-1 character
| codes to 16 bit code points, this can be done very quickly
| through a lookup table. This would not be possible with
| UTF-8 since it is not fixed width.
|
| EDIT: A lookup table may not be required. I was confused by
| new TextDecoder('latin1') actually using windows-1252.
|
| More modern languages just use UTF-8 everywhere because it
| uses less space on average and UTF-16 doesn't save you from
| having to deal with wide characters.
| layer8 wrote:
| Latin1 does match the Unicode values (0-255).
| layer8 wrote:
| Java nowadays does the same.
| TheRealPomax wrote:
| Only because those websites include `<meta charset="utf-8">`.
| _Browsers_ don 't use utf-8 unless you tell them to, so we tell
| them to. But there's an entire internet archive's worth of
| pages that don't tell them to.
| ko27 wrote:
| Not including charset="utf-8" doesn't mean that the website
| is not UTF-8. Do you have a source on a significant
| percentage of website being Latin-1 while omitting charset
| encoding? I don't believe that's the case.
|
| > Browsers don't use utf-8 unless you tell them to
|
| This is wrong. You can prove this very easily by creating a
| HTML file with UTF-8 text while omitting the charset. It will
| render correctly.
| missblit wrote:
| Be careful, since at least Chrome may choose a different
| charset if loading a file from disk versus from a HTTP URL
| (yes this has tripped me up more than once).
|
| I've observed Chrome to usually default to windows-1252
| (latin1) for UTF-8 documents loaded from the network.
| bawolff wrote:
| > This is wrong. You can prove this very easily by creating
| a HTML file with UTF-8 text while omitting the charset. It
| will render correctly.
|
| I'm pretty sure this is incorrect.
| electroly wrote:
| The following .html file encoded in UTF-8, when loaded
| from disk in Google Chrome (so no server headers hinting
| anything), yields document.characterSet == "UTF-8". If
| you make it "a" instead of "a" it becomes "windows-1252".
| <html>a
|
| The renders correctly in Chrome and does not show
| mojibake as you might have expected from old browsers.
| Explicitly specifying a character set just ensures you're
| not relying on the browser's heuristics.
| bawolff wrote:
| There may be a difference here between local and network,
| as well as if the multi-byte utf-8 character appears in
| the first 1024 bytes or how much network delay there is
| before that character appears.
| electroly wrote:
| The original claim was that browsers don't ever use UTF-8
| unless you specify it. Then ko27 provided a
| counterexample that clearly shows that a browser _can_
| choose UTF-8 without you specifying it. You then said
| "I'm pretty sure this is incorrect"--which part? ko27's
| counterexample is correct; I tried it and it renders
| correctly as ko27 said. If you do it, the browser does
| choose UTF-8. I'm not sure where you're going with this
| now. This was a minimal counterexample for a narrow
| claim.
| TheRealPomax wrote:
| Answering your "do you have a source" question, yeah: "the
| entire history of the web prior to HTML5's release", which
| the internet has already forgotten is a rather recent thing
| (2008). And even then, it took a while for HTML5 to become
| the _de facto_ format, because it took the majority of the
| web years before they 'd changed over their tooling from
| HTML 4.01 to HTML5.
|
| > This is wrong. You can prove this very easily by creating
| a HTML file with UTF-8 text
|
| No, but I will create an HTML file with _latin-1_ text,
| because that 's what we're discussing: HTML files that _don
| 't_ use UTF-8 (and so by definition don't _contain_ UTF-8
| either).
|
| While modern browsers will guess the encoding by examining
| the content, if you make an html file that just has plain
| text, then it won't magically convert it to UTF-8: create a
| file with `<html><head><title>encoding
| check</title></head><body><h1>Not much here, just plain
| text</h1><p>More text that's not special</p></body></html>`
| in it. Load it in your browser through an http server (e.g.
| `python -m http.server`), and then hit up the dev tools
| console and look at `document.characterSet`.
|
| Both firefox and chrome give me "windows-1252" on Windows,
| for which the "windows" part in the name is of course
| irrelevant; what matters is what it's _not_ , which is that
| it's not UTF-8, because the content has nothing in it to
| warrant UTF-8.
| ko27 wrote:
| Okay, it's good that we agree then on my original
| premise, the vast majority of websites (by quantity and
| popularity) on the Internet today are using UTF-8
| encoding, and Latin-1 is being phased out.
|
| Btw I appreciate your edited response, but still you were
| factually incorrect about:
|
| > Browsers don't use utf-8 unless you tell them to
|
| Browsers can use UTF-8 even if we don't tell them. I am
| already aware of the extra heuristics you wrote about.
|
| > HTML file with latin-1 ... which is that it's not
| UTF-8, because the content has nothing in it to warrant
| UTF-8
|
| You are incorrect here as well, try using some latin-1
| special character like "a" and you will see that browsers
| default to document.characterSet UTF-8 not windows-1252
| lelandbatey wrote:
| > You are incorrect here as well, try using some latin-1
| special character like "a" and you will see that browsers
| default to document.characterSet UTF-8 not windows-1252
|
| I decided to try this experimentally. In my findings, if
| neither the server nor the page contents indicate that a
| file is UTF-8, then the browser NEVER defaults to setting
| document.characterSet to UTF-8, instead basically always
| assuming that it's "windows-1252" a.k.a. "latin1". Read
| on for my methodology, an exact copy of my test data, and
| some particular oddities at the end.
|
| To begin, we have three '.html' files, one with ASCII
| only characters, a second file with two separate
| characters that are specifically latin1 encoded, and a
| third with those same latin1 characters but encoded using
| UTF-8. Those two characters are: E -
| "Latin Capital Letter E with Diaeresis" - Latin1
| encoding: 0xCB - UTF-8 encoding: 0xC3 0x8B -
| https://www.compart.com/en/unicode/U+00CB Y= -
| "Yen Sign" - Latin1
| encoding: 0xA5 - UTF-8 encoding: 0xC2 0xA5 -
| https://www.compart.com/en/unicode/U+00A5
|
| To avoid copy-paste errors around encoding, I've dumped
| the contents of each file as "hexdumps", which you can
| transform back into their binary form by feeding the
| hexdump form into the command 'xxd -r -p -'.
| $ cat ascii.html | xxd -p 3c68746d6c3e3c686561643
| e3c7469746c653e656e636f64696e67206368 65636b20415
| 34349493c2f7469746c653e3c2f686561643e3c626f64793e
| 3c68313e4e6f74206d75636820686572652c206a75737420706c61696
| e20 746578743c2f68313e3c703e4d6f72652074657874207
| 46861742773206e 6f74207370656369616c3c2f703e3c2f6
| 26f64793e3c2f68746d6c3e0a $ cat latinone.html |
| xxd -p 3c68746d6c3e3c686561643e3c7469746c653e656e
| 636f64696e67206368 65636b206c6174696e313c2f746974
| 6c653e3c2f686561643e3c626f6479 3e3c68313e54686973
| 2069732061206c6174696e31206368617261637465 722030
| 7841353a20a53c2f68313e3c703e54686973206973206368617220
| 307843423a20cb3c2f703e3c2f626f64793e3c2f68746d6c3e0a
| $ cat utf8.html | xxd -p 3c68746d6c3e3c686561643e
| 3c7469746c653e656e636f64696e67206368 65636b207574
| 663820203c2f7469746c653e3c2f686561643e3c626f6479
| 3e3c68313e54686973206973206120757466382020206368617261637
| 465 7220307841353a20c2a53c2f68313e3c703e546869732
| 069732063686172 203078433338423a20c38b3c2f703e3c2
| f626f64793e3c2f68746d6c3e0a
|
| The full contents of my current folder is as such:
| $ ls -a . . .. ascii.html latinone.html
| utf8.html
|
| Now that we have our test files, we can serve them via a
| very basic HTTP server. But first, we must verify that
| all responses from the HTTP server do not contain a
| header implying the content type; we want the browser to
| have to make a guess based on nothing but the contents of
| the file. So, we run the server and check to make sure
| it's not being well intentioned and guessing the content
| type: $ curl -s -vvv
| 'http://127.0.0.1:8000/ascii.html' 2>&1 | egrep -v -e
| 'Last|Length|^\*|^<html|^{|Date:|Agent|Host' >
| GET /ascii.html HTTP/1.1 > Accept: */* >
| < HTTP/1.0 200 OK < Server: SimpleHTTP/0.6
| Python/3.10.7 < Content-type: text/html
| $ curl -s -vvv 'http://127.0.0.1:8000/latinone.html' 2>&1
| | egrep -v -e
| 'Last|Length|^\*|^<html|^{|Date:|Agent|Host' >
| GET /latinone.html HTTP/1.1 > Accept: */*
| > < HTTP/1.0 200 OK < Server:
| SimpleHTTP/0.6 Python/3.10.7 < Content-type:
| text/html $ curl -s -vvv
| 'http://127.0.0.1:8000/utf8.html' 2>&1 | egrep -v -e
| 'Last|Length|^\*|^<html|^{|Date:|Agent|Host' >
| GET /utf8.html HTTP/1.1 > Accept: */* >
| < HTTP/1.0 200 OK < Server: SimpleHTTP/0.6
| Python/3.10.7 < Content-type: text/html
|
| Now we've verified that we won't have our observations
| muddled by the server doing its own detection, so our
| results from the browser should be able to tell us
| conclusively if the presence of a latin1 character causes
| the browser to use UTF-8 encoding. To test, I loaded each
| web page in Firefox and Chromium and checked what
| `document.characterSet` said. Firefox
| (v116.0.3): http://127.0.0.1:8000/ascii.html
| result of `document.characterSet`: "windows-1252"
| http://127.0.0.1:8000/latinone.html result of
| `document.characterSet`: "windows-1252"
| http://127.0.0.1:8000/utf8.html result of
| `document.characterSet`: "windows-1252"
| Chromium (v115.0.5790.170):
| http://127.0.0.1:8000/ascii.html result of
| `document.characterSet`: "windows-1252"
| http://127.0.0.1:8000/latinone.html result of
| `document.characterSet`: "macintosh"
| http://127.0.0.1:8000/utf8.html result of
| `document.characterSet`: "windows-1252"
|
| So in my testing, neither browser EVER guesses that any
| of these pages are UTF-8, all these browsers seem to
| mostly default to assuming that if no content-type is set
| in the document or in the headers then the encoding is
| "windows-1252" (bar Chromium and the Latin1 characters
| which bizzarely caused Chromium to guess that it's
| "macintosh" encoded?). Also note that if I add the exact
| character you proposed (a) to the text body, it still
| doesn't cause the browser to start assuming everything is
| UTF-8; the only change is that Chromium starts to think
| the latinone.html file is also "windows-1252" instead of
| "macintosh".
| [deleted]
| ko27 wrote:
| You are missing the point of what we were discussing. If
| page content IS UTF-8 but there is no charset or server
| headers, then browsers will still treat it as UTF-8.
|
| Here is a far simpler test:
|
| Save this content as .html utf-8 file:
|
| <head>a
|
| document.characterSet is UTF-8.
| capitainenemo wrote:
| A simpler test FWIW.. type:
| data:text/html,<html>
|
| Into your url bar and inspect that. Avoids server messing
| with encoding values. And yes, here on my linux machine
| in firefox it is windows-1252 too.
|
| (You can type the complete document, but <html> is
| sufficient. Browsers autocomplete a valid document. BTW,
| data:text/html,<html contenteditable> is something I use
| quite a lot)
|
| But yeah, I think windows-1252 is standard for quirks
| mode, for historical reasons.
| layer8 wrote:
| The historical (and present?) default is to use the local
| character set, which on US Windows is Windows-1252, but
| for example on Japanese Windows is Shift-JIS. The
| expectation is that users will tend to view web pages
| from their region.
| bawolff wrote:
| > Both firefox and chrome give me "windows-1252" on
| Windows, for which the "windows" part in the name is of
| course irrelevant; what matters is what it's not, which
| is that it's not UTF-8, because the content has nothing
| in it to warrant UTF-8.
|
| While technically latin-1/iso-8859-1 is a different
| encoding than windows-1252, html5 spec says browsers are
| supposed to treat latin1 as windows-1252.
| electroly wrote:
| Chromium (and I'm sure other browsers, but I didn't test)
| will sniff character set heuristically regardless of the
| HTML version or quirks mode. It's happy to choose UTF-8
| if it sees something UTF-8-like in there. I don't know
| how to square this with your earlier claim of "Browsers
| don't use utf-8 unless you tell them to."
|
| That is, the following UTF-8 encoded .html files all
| produce document.characterSet == "UTF-8" and render as
| expected without mojibake, despite not saying anything
| about UTF-8. Change "a" to "a" to get windows-1252 again.
| <html>a <!DOCTYPE html><html>a
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
| "http://www.w3.org/TR/html4/strict.dtd"><html>a
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>a
| fulafel wrote:
| It's the default HTTP character set. It's not clear whether the
| above stat page is about what charsets are explicitly
| specified.
|
| Also headers, mostly relevant for header values, are I think
| ISO-8859-1.
| rhdunn wrote:
| Be aware that with the WHATWG Encoding specification [1],
| that says that latin1, ISO-8859-1, etc. are aliases of the
| windows-1252 encoding, _not_ the proper latin1 encoding. As a
| result, browsers and operating systems will display those
| files differently! It also aliases the ASCII encoding to
| windows-1252.
|
| [1] https://encoding.spec.whatwg.org/#names-and-labels
| ko27 wrote:
| Since HTML5 UTF-8 is the default charset. And for headers,
| they are parsed as ASCII encoded in almost all cases although
| ISO-8859-1 is supported.
| fulafel wrote:
| I tried to find confirmation of this but found only: https:
| //html.spec.whatwg.org/multipage/semantics.html#charse...
|
| > The Encoding standard requires use of the UTF-8 character
| encoding and requires use of the "utf-8" encoding label to
| identify it. Those
|
| Sounds to me like it tells you that you have to explicitly
| declare the charset as UTF-8, so you don't get the HTTP
| default of Latin-1.
|
| (But that's just one "living standard" not exactly
| synonymous with with HTML5 and it might change, or might
| have been different last week..)
| ko27 wrote:
| > so you don't get the HTTP default of Latin-1.
|
| That's not what your linked spec says. You can try it
| yourself, in any browser. If you omit the encoding the
| browser uses heuristics to guess, but it will always work
| if you write UTF-8 even without meta charset or encoding
| header.
| fulafel wrote:
| I don't doubt browsers use heuristics. But spec-wise I
| think it's your turn to to provide a reference in favour
| of a utf-8-is-default interpretation :)
| ko27 wrote:
| No it isn't. My original point is that Latin-1 is used
| very rarely on Internet and is being phased out. Now it's
| your turn to provide some references that a significant
| percentage of websites are omitting encoding (which is
| required by spec!) and using Latin-1.
|
| But if you insist, here is this quote:
|
| https://www.w3docs.com/learn-html/html-character-
| sets.html
|
| > UTF-8 is the default character encoding for HTML5.
| However, it was used to be different. ASCII was the
| character set before it. And the ISO-8859-1 was the
| default character set from HTML 2.0 till HTML 4.01.
|
| or another:
|
| https://www.dofactory.com/html/charset
|
| > If a web page starts with <!DOCTYPE html> (which
| indicates HTML5), then the above meta tag is optional,
| because the default for HTML5 is UTF-8.
| bawolff wrote:
| > My original point is that Latin-1 is used very rarely
| on Internet and is being phased out.
|
| Nobody disagrees with this, but this is a very different
| statement from what you said originally in regards to
| what the default is. Things can be phased out but still
| have the old default with no plan to change the default.
|
| Re other sources - how about citing the actual spec
| instead of sketchy websites that seem likely to have
| incorrect information.
| rhdunn wrote:
| The WHATWG HTML spec [1] has various heuristics it
| uses/specifies for detecting the character encoding.
|
| In point 8, it says an implementation _may_ use
| heuristics to detect the encoding. It has a note which
| states:
|
| > The UTF-8 encoding has a highly detectable bit pattern.
| Files from the local file system that contain bytes with
| values greater than 0x7F which match the UTF-8 pattern
| are very likely to be UTF-8, while documents with byte
| sequences that do not match it are very likely not. When
| a user agent can examine the whole file, rather than just
| the preamble, detecting for UTF-8 specifically can be
| especially effective.
|
| In point 9, the implementation can return an
| implementation or user-defined encoding. Here, it
| suggests a locale-based default encoding, including
| windows-1252 for "en".
|
| As such, implementations _may_ be capable of detecting
| /defaulting to UTF-8, but are equally likely to default
| to windows-1252, Shift_JIS, or other encoding.
|
| [1] https://html.spec.whatwg.org/#determining-the-
| character-enco...
| masfuerte wrote:
| Is this useful? Most Latin 1 text is really Windows 1252, which
| has additional characters that don't have the same regular
| mapping to unicode. So this conversion will mangle curly quotes
| and the Euro sign, among others.
| dotancohen wrote:
| > Most Latin 1 text is really Windows 1252
|
| I'd say that the vast majority of Latin-1 that I've encountered
| is just ASCII. Where have you seen Windows-1252 presented with
| a Latin-1 header or other encoding declaration that declared it
| as Latin-1?
___________________________________________________________________
(page generated 2023-08-21 23:01 UTC)