[HN Gopher] 1brc merykitty's magic SWAR: 8 lines of code explain...
       ___________________________________________________________________
        
       1brc merykitty's magic SWAR: 8 lines of code explained in 3k words
        
       Author : signa11
       Score  : 140 points
       Date   : 2024-03-09 05:57 UTC (17 hours ago)
        
 (HTM) web link (questdb.io)
 (TXT) w3m dump (questdb.io)
        
       | majke wrote:
       | Can you vectorize this with SSE. Most of the magic could be done
       | with vector of four 32 bit integers.
       | 
       | The question is if the cost of setting up the initial vector and
       | extracting the result isnt prohibitive.
        
         | gfody wrote:
         | I imagine code like this gets auto-vectorized either right from
         | the start or after Hotspot detects a hotspot
        
           | haxen wrote:
           | I wonder what you mean here. What code exactly would get
           | auto-vectorized? Parsing the temperature surely not, since
           | it's not in an array of fixed-size fields.
        
             | gfody wrote:
             | once you've done the work to implement parsing as
             | branchless bitwise operations on integers I think that code
             | can be extended automatically to operate on larger
             | registers
        
         | dzaima wrote:
         | It can, and various other 1BRC implementations have (though I
         | doubt hotspot would manage to do it itself, never mind that
         | most 1BRC submissions were run with Graal for the lower startup
         | overhead). The 32x32-64-bit multiplication poses an issue on
         | the default SSE2 as it has no 32-bit or 64-bit multiplication,
         | but SSE4.1 adds exactly the needed thing, pmuldq (although, as
         | that returns a 64-bit result, you'd need two of such to process
         | a full vector of 32-bit ints).
        
         | haxen wrote:
         | The temperature fields are interleaved with name fields, so I
         | don't think you'd get any extra benefit from SSE. Also, since
         | the temperature field is variable-sized, it would probably not
         | pay off even if it was stored by column.
         | 
         | SSE was successfully applied to finding the delimiter between
         | the name and the temperature, though.
        
           | dzaima wrote:
           | It can still be beneficial - yes you need to load the
           | temperatures individually, but there's enough operations
           | afterward to make it worth it.
        
       | readthenotes1 wrote:
       | Why does they compute and throw away nextLineStart?
       | 
       | long nextLineStart = ...
        
         | robin_reala wrote:
         | That's explained at the end of the post: in the real code,
         | that's the most efficient place to calculate it, and it was
         | passed back out as well.
        
           | __float wrote:
           | That was some key context lost in the simplification of the
           | original code[1] for this blog post, though. The blog post's
           | code _returns_ the temperature (throwing away the next line),
           | whereas the original adds it to a map and returns the next
           | line.
           | 
           | [1] https://github.com/gunnarmorling/1brc/blob/dfec2cdbe6a033
           | 4cf...
        
             | haxen wrote:
             | There are many variations of the original code used in
             | different solutions. Many of them return the temperature
             | like the variant used in the post, but they split out the
             | part that finds the decimal dot into a separate function.
             | Then you can reuse that twice: to finish parsing the
             | temperature, and to advance the cursor to the next row.
        
       | pacaro wrote:
       | We used to use the BCD opcodes for this kind of thing. Masking
       | off the 0x30, shift digits (if you want packed BCD)
       | 
       | I can't imagine that has been efficient for decades tho
        
         | jsheard wrote:
         | The BCD instructions are one of the few things that was dropped
         | in the x86-64 transition, so it wouldn't work at all anymore.
        
           | kragen wrote:
           | they'd never been extended to work on more than 8 bits, even
           | in the 8086; they only really existed for 8080 compatibility,
           | and arguably the 8080 primarily had them to ease the path
           | from earlier intel processors designed for, if i'm not
           | mistaken, literal pocket calculators
           | 
           | in pocket calculators you have to display the result in
           | decimal after performing a single arithmetic operation, so
           | it's much more efficient to do the arithmetic in bcd than to
           | convert from decimal to binary, perform the arithmetic
           | operation, and then convert back
        
             | pacaro wrote:
             | Yeah, if you wanted numbers greater than 99 you had to use
             | them in the x87 (if you had one)
        
         | pcwalton wrote:
         | Agner says that AAA has latency 5 on Cannon Lake, so using that
         | instruction is a bit faster than doing the operations manually.
         | But if you vectorize (or use SWAR) I imagine you can start to
         | beat the legacy instructions with larger numbers.
        
       | jdright wrote:
       | > the real mystery is how a single guy working alone could come
       | up with all this in just a few days
       | 
       | Really not a mystery. This is not that complex for people with
       | low level programming experience (embedded, (older) games,
       | demoscene, performance systems, etc.).
       | 
       | Knowing these tricks and knowing how to apply them is basically a
       | requirement on these areas.
        
         | AtlasBarfed wrote:
         | Imo the most important class in comp sci is computer
         | architecture, specifically one that takes you from gates to a
         | rudimentary alu, registers, to microcode, to some form of
         | assembly coding.
         | 
         | There is nothing magical in computers once you can do that
         | deconstruction in your head.
        
         | AlotOfReading wrote:
         | Bit twiddling hacks are increasingly becoming a lost art even
         | in the embedded/low level space. There's relatively few people
         | around with any day-to-day need for them and I've found myself
         | deleting a lot of the legacy ones I find because the compiler
         | situation has changed so much from when they were originally
         | written.
         | 
         | I think this is an excellent skills showcase, even if I don't
         | think it's magical.
        
           | at_a_remove wrote:
           | Back when the word "nanotechnology" used to mean itty-bitty
           | robots* rather than some extremely fine chemical processes
           | done in sequence, or extremely small particle size, I had
           | wondered if we could _get_ to the point of said itty-bity
           | robots before all of the brilliant people who made playable
           | games for restricted consoles like the Atari had retired.
           | 
           | * More specifically, miniaturized robots consisting of one or
           | more robot arms with multiple degrees of freedom, a CPU, some
           | kind of storage (perhaps something like wound-up DNA serving
           | as tape on reels), executing programs written by humans.
        
             | MichaelZuo wrote:
             | That concepts sounds like fantasy, cells are shaped like
             | cells because 'itty-bitty robots', and most other
             | configurations, don't work thermodynamically or physically
             | or chemically at such small scales.
        
               | at_a_remove wrote:
               | Cells are shaped like cells (which is to say _ugly bags
               | of mostly water_ ) because they are self-perpetuating
               | chemical reactions in solution. We usually conduct our
               | chemical reactions in vessels that are round along at
               | least one axis, maybe almost two for our round-bottomed
               | flasks. This is natural because we're enclosing a maximum
               | volume of our solution with a minimal surface (cell
               | membrane or borosilicate glass).
               | 
               | On a macro scale, life tends to be bilaterally or
               | radially symmetrical, but our robots are not necessarily
               | like that, just considering even factory robots which are
               | often an arm and little else. So, at the micro scale, I
               | don't think they have to resemble "life" there, either.
               | I'm hardly suggesting some kind of tinkertoy with one
               | atom here and another atom there and the strut being,
               | instead of wood, a covalent bond. No, I think we would
               | need more atoms than that.
               | 
               | Frankly, we haven't much tried to scale down our
               | robotics. Oh, you'll find someone who will produce the
               | flagellar motor (a mere forty-five nanometers, compared
               | to the ten thousand nanometers of a human cell) but not
               | much else. I wouldn't worry about the thermodynamics and
               | quantum effects until we're down to that motor level.
        
         | jandrewrogers wrote:
         | Yes. These are standard idioms in low-level performance
         | engineering. This is the kind of thing that would be a fun
         | puzzle for someone with this skill set, something you could
         | come up with in an hour. This particular example does
         | demonstrate a relatively complete understanding of the toolset.
         | 
         | I feel like this kind of knowledge is becoming a lost art these
         | days despite having lost no relevance.
        
           | winwang wrote:
           | "a sufficient smart compiler...." (/s)
           | 
           | I had a question recently: would an extra 65th bit in some
           | registers have a good use case with stuff with SWAR?
        
       | WoodenChair wrote:
       | For context on why you should read this: this is a well written
       | article that explains in detail how some bitwise operations work
       | to parse a temperature (think -10.3, or 3.4) with no conditionals
       | (no if-statements or switches). This results in very fast
       | parsing. The language used is Java but it could just as easily be
       | C or Swift.
        
         | mnau wrote:
         | Result is fast parsing with no error checking.
         | 
         | If the input wasn't in expected format (e.g. different decimal
         | separator), the result would be incorrect.
         | 
         | Fun ss an exercise, but hard to understand, maintain.
        
       | plokhotnyuk wrote:
       | What an amazing step by step explanation!
       | 
       | More than 2 years ago I found that byte array view var handles
       | are quite suitable to cook efficient SWAR routines with
       | Java/Scala.
       | 
       | See a lot of other examples of SWAR usage, like parsing Base16/64
       | strings, java.time.* and number values directly from byte arrays:
       | 
       | https://github.com/plokhotnyuk/jsoniter-scala/blob/master/js...
        
       | quercusa wrote:
       | As per the article: SWAR is "SIMD Within A Register"
        
       | teo_zero wrote:
       | I remember the first time I read about this technique used by the
       | Linux kernel to find the \0 at the end of the string. If course
       | XORing with another char will detect that char instead.
       | 
       | But before you run to implement it as a drop-in replacement for
       | strlen() all over your code, please note that in real-world cases
       | the string length is not guaranteed to be a multiple of 8. When
       | you read the last long, you might overrun your memory buffer and
       | trigger a page fault.
        
         | Karellen wrote:
         | > When you read the last long, you might overrun your memory
         | buffer and trigger a page fault.
         | 
         | You know that page boundaries are always on a `PAGE_SIZE`
         | alignment, which you can get with `getpagesize()` or
         | `sysconf(_SC_PAGESIZE)`.
         | 
         | ...but overrunning the declared size of a memory allocation is
         | still UB, even if you don't trigger a page fault, so YMMV ;-)
        
       | winwang wrote:
       | If people like stuff like this, the simdjson paper uses similar
       | techniques, is extremely well-written and has great examples.
       | 
       | Paper: https://arxiv.org/abs/1902.08318
       | 
       | Github: https://github.com/simdjson/simdjson
        
         | djmips wrote:
         | This isn't SWAR though but I get how it might appeal.
        
         | jsheard wrote:
         | Rusts standard hashmap (hashbrown) is another practical modern
         | example, it _can_ take advantage of 128-bit SIMD but on
         | platforms where that 's not supported it uses SWAR tricks to
         | implement the same algorithms.
        
       | teucris wrote:
       | Great article and given the context of the code this is a
       | brilliant solution, but note that this assumes that the data is
       | well-formed. Much of the value in an effective, battled-worn
       | parser is efficient error checking and recovery.
        
         | Karellen wrote:
         | It would be interesting to see a breakdown of the ways in which
         | ill-formed inputs could affect the output. And how much work it
         | would be to detect such inputs and return some kind of sentinel
         | error value - in the same style as the current code.
         | 
         | (Not quite interesting enough for me to do it myself though ;-)
        
       | tayo42 wrote:
       | Pretty cool. I'll have to come back to this
       | 
       | But can't someone just ask merry kitty where they got the idea
       | from?
        
       | djmips wrote:
       | Used to do SWAR on the 68000 to good effect. 4 bytes operated on
       | in parallel in a single instruction. Tricky to handle the
       | overflow IIRC. I really like this article!
        
       | gigatexal wrote:
       | This article helped me understand bit fiddling better than any
       | before it. Thanks to the author and the author of the really
       | novel Java solution to the 1BRC challenge.
        
       ___________________________________________________________________
       (page generated 2024-03-09 23:00 UTC)