[HN Gopher] 1brc merykitty's magic SWAR: 8 lines of code explain...
___________________________________________________________________
1brc merykitty's magic SWAR: 8 lines of code explained in 3k words
Author : signa11
Score : 140 points
Date : 2024-03-09 05:57 UTC (17 hours ago)
(HTM) web link (questdb.io)
(TXT) w3m dump (questdb.io)
| majke wrote:
| Can you vectorize this with SSE. Most of the magic could be done
| with vector of four 32 bit integers.
|
| The question is if the cost of setting up the initial vector and
| extracting the result isnt prohibitive.
| gfody wrote:
| I imagine code like this gets auto-vectorized either right from
| the start or after Hotspot detects a hotspot
| haxen wrote:
| I wonder what you mean here. What code exactly would get
| auto-vectorized? Parsing the temperature surely not, since
| it's not in an array of fixed-size fields.
| gfody wrote:
| once you've done the work to implement parsing as
| branchless bitwise operations on integers I think that code
| can be extended automatically to operate on larger
| registers
| dzaima wrote:
| It can, and various other 1BRC implementations have (though I
| doubt hotspot would manage to do it itself, never mind that
| most 1BRC submissions were run with Graal for the lower startup
| overhead). The 32x32-64-bit multiplication poses an issue on
| the default SSE2 as it has no 32-bit or 64-bit multiplication,
| but SSE4.1 adds exactly the needed thing, pmuldq (although, as
| that returns a 64-bit result, you'd need two of such to process
| a full vector of 32-bit ints).
| haxen wrote:
| The temperature fields are interleaved with name fields, so I
| don't think you'd get any extra benefit from SSE. Also, since
| the temperature field is variable-sized, it would probably not
| pay off even if it was stored by column.
|
| SSE was successfully applied to finding the delimiter between
| the name and the temperature, though.
| dzaima wrote:
| It can still be beneficial - yes you need to load the
| temperatures individually, but there's enough operations
| afterward to make it worth it.
| readthenotes1 wrote:
| Why does they compute and throw away nextLineStart?
|
| long nextLineStart = ...
| robin_reala wrote:
| That's explained at the end of the post: in the real code,
| that's the most efficient place to calculate it, and it was
| passed back out as well.
| __float wrote:
| That was some key context lost in the simplification of the
| original code[1] for this blog post, though. The blog post's
| code _returns_ the temperature (throwing away the next line),
| whereas the original adds it to a map and returns the next
| line.
|
| [1] https://github.com/gunnarmorling/1brc/blob/dfec2cdbe6a033
| 4cf...
| haxen wrote:
| There are many variations of the original code used in
| different solutions. Many of them return the temperature
| like the variant used in the post, but they split out the
| part that finds the decimal dot into a separate function.
| Then you can reuse that twice: to finish parsing the
| temperature, and to advance the cursor to the next row.
| pacaro wrote:
| We used to use the BCD opcodes for this kind of thing. Masking
| off the 0x30, shift digits (if you want packed BCD)
|
| I can't imagine that has been efficient for decades tho
| jsheard wrote:
| The BCD instructions are one of the few things that was dropped
| in the x86-64 transition, so it wouldn't work at all anymore.
| kragen wrote:
| they'd never been extended to work on more than 8 bits, even
| in the 8086; they only really existed for 8080 compatibility,
| and arguably the 8080 primarily had them to ease the path
| from earlier intel processors designed for, if i'm not
| mistaken, literal pocket calculators
|
| in pocket calculators you have to display the result in
| decimal after performing a single arithmetic operation, so
| it's much more efficient to do the arithmetic in bcd than to
| convert from decimal to binary, perform the arithmetic
| operation, and then convert back
| pacaro wrote:
| Yeah, if you wanted numbers greater than 99 you had to use
| them in the x87 (if you had one)
| pcwalton wrote:
| Agner says that AAA has latency 5 on Cannon Lake, so using that
| instruction is a bit faster than doing the operations manually.
| But if you vectorize (or use SWAR) I imagine you can start to
| beat the legacy instructions with larger numbers.
| jdright wrote:
| > the real mystery is how a single guy working alone could come
| up with all this in just a few days
|
| Really not a mystery. This is not that complex for people with
| low level programming experience (embedded, (older) games,
| demoscene, performance systems, etc.).
|
| Knowing these tricks and knowing how to apply them is basically a
| requirement on these areas.
| AtlasBarfed wrote:
| Imo the most important class in comp sci is computer
| architecture, specifically one that takes you from gates to a
| rudimentary alu, registers, to microcode, to some form of
| assembly coding.
|
| There is nothing magical in computers once you can do that
| deconstruction in your head.
| AlotOfReading wrote:
| Bit twiddling hacks are increasingly becoming a lost art even
| in the embedded/low level space. There's relatively few people
| around with any day-to-day need for them and I've found myself
| deleting a lot of the legacy ones I find because the compiler
| situation has changed so much from when they were originally
| written.
|
| I think this is an excellent skills showcase, even if I don't
| think it's magical.
| at_a_remove wrote:
| Back when the word "nanotechnology" used to mean itty-bitty
| robots* rather than some extremely fine chemical processes
| done in sequence, or extremely small particle size, I had
| wondered if we could _get_ to the point of said itty-bity
| robots before all of the brilliant people who made playable
| games for restricted consoles like the Atari had retired.
|
| * More specifically, miniaturized robots consisting of one or
| more robot arms with multiple degrees of freedom, a CPU, some
| kind of storage (perhaps something like wound-up DNA serving
| as tape on reels), executing programs written by humans.
| MichaelZuo wrote:
| That concepts sounds like fantasy, cells are shaped like
| cells because 'itty-bitty robots', and most other
| configurations, don't work thermodynamically or physically
| or chemically at such small scales.
| at_a_remove wrote:
| Cells are shaped like cells (which is to say _ugly bags
| of mostly water_ ) because they are self-perpetuating
| chemical reactions in solution. We usually conduct our
| chemical reactions in vessels that are round along at
| least one axis, maybe almost two for our round-bottomed
| flasks. This is natural because we're enclosing a maximum
| volume of our solution with a minimal surface (cell
| membrane or borosilicate glass).
|
| On a macro scale, life tends to be bilaterally or
| radially symmetrical, but our robots are not necessarily
| like that, just considering even factory robots which are
| often an arm and little else. So, at the micro scale, I
| don't think they have to resemble "life" there, either.
| I'm hardly suggesting some kind of tinkertoy with one
| atom here and another atom there and the strut being,
| instead of wood, a covalent bond. No, I think we would
| need more atoms than that.
|
| Frankly, we haven't much tried to scale down our
| robotics. Oh, you'll find someone who will produce the
| flagellar motor (a mere forty-five nanometers, compared
| to the ten thousand nanometers of a human cell) but not
| much else. I wouldn't worry about the thermodynamics and
| quantum effects until we're down to that motor level.
| jandrewrogers wrote:
| Yes. These are standard idioms in low-level performance
| engineering. This is the kind of thing that would be a fun
| puzzle for someone with this skill set, something you could
| come up with in an hour. This particular example does
| demonstrate a relatively complete understanding of the toolset.
|
| I feel like this kind of knowledge is becoming a lost art these
| days despite having lost no relevance.
| winwang wrote:
| "a sufficient smart compiler...." (/s)
|
| I had a question recently: would an extra 65th bit in some
| registers have a good use case with stuff with SWAR?
| WoodenChair wrote:
| For context on why you should read this: this is a well written
| article that explains in detail how some bitwise operations work
| to parse a temperature (think -10.3, or 3.4) with no conditionals
| (no if-statements or switches). This results in very fast
| parsing. The language used is Java but it could just as easily be
| C or Swift.
| mnau wrote:
| Result is fast parsing with no error checking.
|
| If the input wasn't in expected format (e.g. different decimal
| separator), the result would be incorrect.
|
| Fun ss an exercise, but hard to understand, maintain.
| plokhotnyuk wrote:
| What an amazing step by step explanation!
|
| More than 2 years ago I found that byte array view var handles
| are quite suitable to cook efficient SWAR routines with
| Java/Scala.
|
| See a lot of other examples of SWAR usage, like parsing Base16/64
| strings, java.time.* and number values directly from byte arrays:
|
| https://github.com/plokhotnyuk/jsoniter-scala/blob/master/js...
| quercusa wrote:
| As per the article: SWAR is "SIMD Within A Register"
| teo_zero wrote:
| I remember the first time I read about this technique used by the
| Linux kernel to find the \0 at the end of the string. If course
| XORing with another char will detect that char instead.
|
| But before you run to implement it as a drop-in replacement for
| strlen() all over your code, please note that in real-world cases
| the string length is not guaranteed to be a multiple of 8. When
| you read the last long, you might overrun your memory buffer and
| trigger a page fault.
| Karellen wrote:
| > When you read the last long, you might overrun your memory
| buffer and trigger a page fault.
|
| You know that page boundaries are always on a `PAGE_SIZE`
| alignment, which you can get with `getpagesize()` or
| `sysconf(_SC_PAGESIZE)`.
|
| ...but overrunning the declared size of a memory allocation is
| still UB, even if you don't trigger a page fault, so YMMV ;-)
| winwang wrote:
| If people like stuff like this, the simdjson paper uses similar
| techniques, is extremely well-written and has great examples.
|
| Paper: https://arxiv.org/abs/1902.08318
|
| Github: https://github.com/simdjson/simdjson
| djmips wrote:
| This isn't SWAR though but I get how it might appeal.
| jsheard wrote:
| Rusts standard hashmap (hashbrown) is another practical modern
| example, it _can_ take advantage of 128-bit SIMD but on
| platforms where that 's not supported it uses SWAR tricks to
| implement the same algorithms.
| teucris wrote:
| Great article and given the context of the code this is a
| brilliant solution, but note that this assumes that the data is
| well-formed. Much of the value in an effective, battled-worn
| parser is efficient error checking and recovery.
| Karellen wrote:
| It would be interesting to see a breakdown of the ways in which
| ill-formed inputs could affect the output. And how much work it
| would be to detect such inputs and return some kind of sentinel
| error value - in the same style as the current code.
|
| (Not quite interesting enough for me to do it myself though ;-)
| tayo42 wrote:
| Pretty cool. I'll have to come back to this
|
| But can't someone just ask merry kitty where they got the idea
| from?
| djmips wrote:
| Used to do SWAR on the 68000 to good effect. 4 bytes operated on
| in parallel in a single instruction. Tricky to handle the
| overflow IIRC. I really like this article!
| gigatexal wrote:
| This article helped me understand bit fiddling better than any
| before it. Thanks to the author and the author of the really
| novel Java solution to the 1BRC challenge.
___________________________________________________________________
(page generated 2024-03-09 23:00 UTC)