[HN Gopher] Scan HTML faster with SIMD instructions - Chrome edi...
       ___________________________________________________________________
        
       Scan HTML faster with SIMD instructions - Chrome edition
        
       Author : p_l
       Score  : 100 points
       Date   : 2024-06-11 10:26 UTC (2 days ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | nasretdinov wrote:
       | To me it's fascinating that you indeed _can_ speed up what
       | appears to be a non-parallelisable task like parsing HTML (or
       | JSON to a lesser extent) using SIMD. And that the gains are so
       | substantial too.
        
         | dzaima wrote:
         | There's a surprising amount of things that are possible to
         | parse in a parallel fashion if one tries hard enough, though
         | keeping significant or any gains is more tricky. The operation
         | in the article though is only a rather tiny part of parsing
         | HTML though.
        
           | floxy wrote:
           | Can anyone recommend pointers to find out more about creating
           | programming languages / markup languages that would be more
           | parallel friendly for parsing? Or maybe even embarrassingly
           | parallel? How does that affect the grammar, etc.. Maybe you
           | need a different formalism rather than BNF do describe it,
           | and you don't use regex to tokenize, and there is no
           | recursive descent, or state machines in parsing, etc..
        
         | magicalhippo wrote:
         | Now all we need is for pre-2000s web to come back and we could
         | have instant web surfing.
        
       | Const-me wrote:
       | The reason why the single-lookup version works, the bytes being
       | searched for have unique low 4 bits. So the lookup generates the
       | vector with bytes equal to low_nibble_mask[ source[i] & 0x0F ].
       | The table contains the bytes being searched for, in the elements
       | which correspond to the low 4 bits of these bytes.
       | 
       | The next instruction compares bytes in the source vector for
       | equality with the results of that lookup. So, for each byte in
       | the source vector, the first 3 instructions are computing the
       | following expression, in parallel for each of the 16 bytes in the
       | vector:                   bool eq = ( b == low_nibble_mask[ b &
       | 0x0F ] );
       | 
       | The rest is trivial, identifying index of the first byte which
       | compared true.
       | 
       | And one more thing.
       | 
       | > If you have an old SSE2 PC, only the simple SIMD approach is
       | available
       | 
       | That PSHUFB table lookup SSE instruction was introduced in SSSE3
       | extension. Now in 2024 it's hard to find an old SSE2 PC which
       | doesn't also support SSSE3. You gonna need things like Pentium D
       | launched in 2005, or AMD K10 launched in 2007.
        
         | ilrwbwrkhv wrote:
         | Do you still use c++ or have you moved to rust?
        
           | Const-me wrote:
           | I mostly use C++ and C#, and I'm not a fan of Rust.
        
             | victor106 wrote:
             | I love C#.
             | 
             | > I'm not a fan of Rust.
             | 
             | I wonder why? Care to elaborate?
        
               | neonsunset wrote:
               | A little more on topic, if you like SIMD _and_ C#, dotnet
               | /runtime now has an introductory guide to
               | Vector128/256/512<T> API:
               | 
               | https://github.com/dotnet/runtime/blob/main/docs/coding-
               | guid...
               | 
               | Now, the default syntax can still be a bit more wordy
               | with byref arithmetics than ideal, but you can add a few
               | extension methods to make it look it closer to Rust's
               | pointer arithmetics:
               | 
               | https://github.com/U8String/U8String/blob/main/Sources/U8
               | Str... (unrelated note but .NET lowers this to vpternlogd
               | if AVX512VL is available)
               | 
               | You can find more examples of SIMD use in dotnet/runtime
               | if that interests you:
               | 
               | https://grep.app/search?q=Vector%28128%7C256%7C512%29%3C%
               | 5Cw...
               | 
               | Many paths are vectorized: to/from hexadecimal byte
               | conversion, unicode encoding/decoding, various hash
               | algorithms, counting and searching for elements, chunking
               | text, there's a BLAS library now too and more.
        
               | Const-me wrote:
               | > I wonder why?
               | 
               | The whole point of Rust is memory safety and security,
               | but I don't care too much about either of them. I mostly
               | work on offline desktop apps. When I do work on network
               | exposed servers, I use C# which delivers even better
               | safety and security for a reasonable performance
               | overhead.
               | 
               | In pursue of that memory safety, Rust sacrificed pretty
               | much everything else.
               | 
               | The language is incredibly hard to learn. The compiler is
               | even slower than C++. It's hard to do R&D in Rust because
               | memory safety forces minor changes into large refactors.
               | The recent push to async/await is controversial, to say
               | the least.
               | 
               | It's impossible to implement any non-trivial data
               | structures in safe Rust. In my line of work designing
               | data structures is more important than writing executable
               | code, because the gap between RAM bandwidth and processor
               | performance have been growing for decades now.
               | 
               | Also, interesting post and discussion:
               | https://news.ycombinator.com/item?id=40172033
        
           | akira2501 wrote:
           | Rust. Never mentioned. Shoehorned into every conversation
           | anyways.
        
         | adrian_b wrote:
         | Not only the Pentium D (2004) did not have SSSE3, but also the
         | later CPUs until the Core Solo, which was launched at the
         | beginning of 2006.
         | 
         | SSSE3 (a.k.a. Merom New Instructions), including PSHUFB, has
         | been introduced first in Core Duo (mid 2006).
         | 
         | AMD has launched its first CPUs with SSSE3 only in 2011 (Bobcat
         | & Bulldozer).
        
         | mananaysiempre wrote:
         | As a side note, it's also technically possible to do an
         | arbitrary set of bytes < 128 in two lookups (or one lookup and
         | one variable shift), by vectorizing                 !!(bitset[b
         | >> 3] & oneshl[b & 7]),
         | 
         | where bitset[i] has its jth bit set iff i*8+j is in the set,
         | and oneshl[i] is 1<<i. (I don't remember where I saw this
         | first, unfortunately, a pull request to a Rust something or
         | other.) If you're on x86 and thus using PSHUFB for the lookup,
         | making the first shift an arithmetic one will also see you
         | reject bytes >= 128 automatically.
        
       | o11c wrote:
       | And how does `strcspn` from various libcs compare?
        
       | klysm wrote:
       | It's always crazy to me that we have picked encodings for data
       | that are human-first to send between computers. JSON is in a
       | similar boat. Absolutely enormous amounts of computing power are
       | spent on serialization and deserialization. That latency is
       | experienced everywhere too. If we used zero-copy serialization
       | structures everywhere, a whole lot of compute would be saved
        
         | p_l wrote:
         | Considerable portion of that was early ARPANET work often
         | involving somewhat lacking access to hardware, so I'm it's
         | formative ages internet had a lot of directly attaching
         | teletypes to network ports.
         | 
         | Also one can't forget about TIPs, which provided "phone modem
         | to telnet-like" bridge service in ARPANET.
         | 
         | Another part was how text was the one thing that people _had_
         | standardise enough between different computers. FTP 's ASCII
         | and binary modes aren't about love conversion for Unix, but
         | because "binary" meant totally different things on different
         | hosts (could be 8bit octets, could be 28bit words, could be
         | 36bit words, could be 60 bit, before internet fully settled
         | down there were 40 bit hosts too).
         | 
         | Also people are scared of compilers.
         | 
         | All of that led to cultural bias towards textual protocols
        
         | bigbuppo wrote:
         | So... what you're saying is that we should do what cobol did in
         | 1960?
        
           | dingdingdang wrote:
           | I tend to agree here, judiciously used json (these days often
           | running under zstd compression) is seldomly the bottleneck on
           | anything and allows immediate human debugging if need be.
        
       | scottlamb wrote:
       | > The results follow my expectations: the simplest vectorized
       | classification routine has the best performance. However, you may
       | observe that even a rather naive SIMD approach can be quite fast
       | in this instance.
       | 
       | I've recently written my first SIMD code [1]. This matches my
       | observation: you get a big improvement just moving from scalar
       | code to autovectorized code (i.e. working in fixed widths and
       | telling the compiler to use specific CPU features you've
       | detected), another decent improvement going to basic use of
       | vendor intrinsics, and then more and more modest improvements
       | from extra sophistication.
       | 
       | [1] uyvy->i420 pixel format conversion code, a much easier
       | application of SIMD. No branching, just a bunch of pixels
       | transformed in identical fashion.
        
       | ijidak wrote:
       | It's amazing how ChatGPT makes code the reader may not be
       | familiar with faster to grok.
       | 
       | The examples in this article assume familiarity with SIMD.
       | 
       | In the past, I would have tabled an article like this for
       | deciphering later.
       | 
       | But, with ChatGPT, I was able to ask about just a few bits of
       | code and immediately grok the implementation.
       | 
       | From a technology standpoint, the future is looking scary
       | productive.
       | 
       | Can't wait to see what new vistas developers open up with these
       | new super powers.
        
         | secondcoming wrote:
         | you're assuming the output from ChatGPT is correct, and it's
         | not bluffing.
        
       ___________________________________________________________________
       (page generated 2024-06-13 23:00 UTC)