[HN Gopher] Scan HTML faster with SIMD instructions - Chrome edi...
___________________________________________________________________
Scan HTML faster with SIMD instructions - Chrome edition
Author : p_l
Score : 100 points
Date : 2024-06-11 10:26 UTC (2 days ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| nasretdinov wrote:
| To me it's fascinating that you indeed _can_ speed up what
| appears to be a non-parallelisable task like parsing HTML (or
| JSON to a lesser extent) using SIMD. And that the gains are so
| substantial too.
| dzaima wrote:
| There's a surprising amount of things that are possible to
| parse in a parallel fashion if one tries hard enough, though
| keeping significant or any gains is more tricky. The operation
| in the article though is only a rather tiny part of parsing
| HTML though.
| floxy wrote:
| Can anyone recommend pointers to find out more about creating
| programming languages / markup languages that would be more
| parallel friendly for parsing? Or maybe even embarrassingly
| parallel? How does that affect the grammar, etc.. Maybe you
| need a different formalism rather than BNF do describe it,
| and you don't use regex to tokenize, and there is no
| recursive descent, or state machines in parsing, etc..
| magicalhippo wrote:
| Now all we need is for pre-2000s web to come back and we could
| have instant web surfing.
| Const-me wrote:
| The reason why the single-lookup version works, the bytes being
| searched for have unique low 4 bits. So the lookup generates the
| vector with bytes equal to low_nibble_mask[ source[i] & 0x0F ].
| The table contains the bytes being searched for, in the elements
| which correspond to the low 4 bits of these bytes.
|
| The next instruction compares bytes in the source vector for
| equality with the results of that lookup. So, for each byte in
| the source vector, the first 3 instructions are computing the
| following expression, in parallel for each of the 16 bytes in the
| vector: bool eq = ( b == low_nibble_mask[ b &
| 0x0F ] );
|
| The rest is trivial, identifying index of the first byte which
| compared true.
|
| And one more thing.
|
| > If you have an old SSE2 PC, only the simple SIMD approach is
| available
|
| That PSHUFB table lookup SSE instruction was introduced in SSSE3
| extension. Now in 2024 it's hard to find an old SSE2 PC which
| doesn't also support SSSE3. You gonna need things like Pentium D
| launched in 2005, or AMD K10 launched in 2007.
| ilrwbwrkhv wrote:
| Do you still use c++ or have you moved to rust?
| Const-me wrote:
| I mostly use C++ and C#, and I'm not a fan of Rust.
| victor106 wrote:
| I love C#.
|
| > I'm not a fan of Rust.
|
| I wonder why? Care to elaborate?
| neonsunset wrote:
| A little more on topic, if you like SIMD _and_ C#, dotnet
| /runtime now has an introductory guide to
| Vector128/256/512<T> API:
|
| https://github.com/dotnet/runtime/blob/main/docs/coding-
| guid...
|
| Now, the default syntax can still be a bit more wordy
| with byref arithmetics than ideal, but you can add a few
| extension methods to make it look it closer to Rust's
| pointer arithmetics:
|
| https://github.com/U8String/U8String/blob/main/Sources/U8
| Str... (unrelated note but .NET lowers this to vpternlogd
| if AVX512VL is available)
|
| You can find more examples of SIMD use in dotnet/runtime
| if that interests you:
|
| https://grep.app/search?q=Vector%28128%7C256%7C512%29%3C%
| 5Cw...
|
| Many paths are vectorized: to/from hexadecimal byte
| conversion, unicode encoding/decoding, various hash
| algorithms, counting and searching for elements, chunking
| text, there's a BLAS library now too and more.
| Const-me wrote:
| > I wonder why?
|
| The whole point of Rust is memory safety and security,
| but I don't care too much about either of them. I mostly
| work on offline desktop apps. When I do work on network
| exposed servers, I use C# which delivers even better
| safety and security for a reasonable performance
| overhead.
|
| In pursue of that memory safety, Rust sacrificed pretty
| much everything else.
|
| The language is incredibly hard to learn. The compiler is
| even slower than C++. It's hard to do R&D in Rust because
| memory safety forces minor changes into large refactors.
| The recent push to async/await is controversial, to say
| the least.
|
| It's impossible to implement any non-trivial data
| structures in safe Rust. In my line of work designing
| data structures is more important than writing executable
| code, because the gap between RAM bandwidth and processor
| performance have been growing for decades now.
|
| Also, interesting post and discussion:
| https://news.ycombinator.com/item?id=40172033
| akira2501 wrote:
| Rust. Never mentioned. Shoehorned into every conversation
| anyways.
| adrian_b wrote:
| Not only the Pentium D (2004) did not have SSSE3, but also the
| later CPUs until the Core Solo, which was launched at the
| beginning of 2006.
|
| SSSE3 (a.k.a. Merom New Instructions), including PSHUFB, has
| been introduced first in Core Duo (mid 2006).
|
| AMD has launched its first CPUs with SSSE3 only in 2011 (Bobcat
| & Bulldozer).
| mananaysiempre wrote:
| As a side note, it's also technically possible to do an
| arbitrary set of bytes < 128 in two lookups (or one lookup and
| one variable shift), by vectorizing !!(bitset[b
| >> 3] & oneshl[b & 7]),
|
| where bitset[i] has its jth bit set iff i*8+j is in the set,
| and oneshl[i] is 1<<i. (I don't remember where I saw this
| first, unfortunately, a pull request to a Rust something or
| other.) If you're on x86 and thus using PSHUFB for the lookup,
| making the first shift an arithmetic one will also see you
| reject bytes >= 128 automatically.
| o11c wrote:
| And how does `strcspn` from various libcs compare?
| klysm wrote:
| It's always crazy to me that we have picked encodings for data
| that are human-first to send between computers. JSON is in a
| similar boat. Absolutely enormous amounts of computing power are
| spent on serialization and deserialization. That latency is
| experienced everywhere too. If we used zero-copy serialization
| structures everywhere, a whole lot of compute would be saved
| p_l wrote:
| Considerable portion of that was early ARPANET work often
| involving somewhat lacking access to hardware, so I'm it's
| formative ages internet had a lot of directly attaching
| teletypes to network ports.
|
| Also one can't forget about TIPs, which provided "phone modem
| to telnet-like" bridge service in ARPANET.
|
| Another part was how text was the one thing that people _had_
| standardise enough between different computers. FTP 's ASCII
| and binary modes aren't about love conversion for Unix, but
| because "binary" meant totally different things on different
| hosts (could be 8bit octets, could be 28bit words, could be
| 36bit words, could be 60 bit, before internet fully settled
| down there were 40 bit hosts too).
|
| Also people are scared of compilers.
|
| All of that led to cultural bias towards textual protocols
| bigbuppo wrote:
| So... what you're saying is that we should do what cobol did in
| 1960?
| dingdingdang wrote:
| I tend to agree here, judiciously used json (these days often
| running under zstd compression) is seldomly the bottleneck on
| anything and allows immediate human debugging if need be.
| scottlamb wrote:
| > The results follow my expectations: the simplest vectorized
| classification routine has the best performance. However, you may
| observe that even a rather naive SIMD approach can be quite fast
| in this instance.
|
| I've recently written my first SIMD code [1]. This matches my
| observation: you get a big improvement just moving from scalar
| code to autovectorized code (i.e. working in fixed widths and
| telling the compiler to use specific CPU features you've
| detected), another decent improvement going to basic use of
| vendor intrinsics, and then more and more modest improvements
| from extra sophistication.
|
| [1] uyvy->i420 pixel format conversion code, a much easier
| application of SIMD. No branching, just a bunch of pixels
| transformed in identical fashion.
| ijidak wrote:
| It's amazing how ChatGPT makes code the reader may not be
| familiar with faster to grok.
|
| The examples in this article assume familiarity with SIMD.
|
| In the past, I would have tabled an article like this for
| deciphering later.
|
| But, with ChatGPT, I was able to ask about just a few bits of
| code and immediately grok the implementation.
|
| From a technology standpoint, the future is looking scary
| productive.
|
| Can't wait to see what new vistas developers open up with these
| new super powers.
| secondcoming wrote:
| you're assuming the output from ChatGPT is correct, and it's
| not bluffing.
___________________________________________________________________
(page generated 2024-06-13 23:00 UTC)