[HN Gopher] 1brc merykitty's magic SWAR: 8 lines of code explain...
       ___________________________________________________________________
        
       1brc merykitty's magic SWAR: 8 lines of code explained in 3k words
        
       Author : signa11
       Score  : 235 points
       Date   : 2024-03-09 05:57 UTC (1 days ago)
        
 (HTM) web link (questdb.io)
 (TXT) w3m dump (questdb.io)
        
       | majke wrote:
       | Can you vectorize this with SSE. Most of the magic could be done
       | with vector of four 32 bit integers.
       | 
       | The question is if the cost of setting up the initial vector and
       | extracting the result isnt prohibitive.
        
         | gfody wrote:
         | I imagine code like this gets auto-vectorized either right from
         | the start or after Hotspot detects a hotspot
        
           | haxen wrote:
           | I wonder what you mean here. What code exactly would get
           | auto-vectorized? Parsing the temperature surely not, since
           | it's not in an array of fixed-size fields.
        
             | gfody wrote:
             | once you've done the work to implement parsing as
             | branchless bitwise operations on integers I think that code
             | can be extended automatically to operate on larger
             | registers
        
               | haxen wrote:
               | I think it would work if that was the only code in the
               | loop. But the loop spans several more nontrivial
               | operations, including hashtable insertion.
        
         | dzaima wrote:
         | It can, and various other 1BRC implementations have (though I
         | doubt hotspot would manage to do it itself, never mind that
         | most 1BRC submissions were run with Graal for the lower startup
         | overhead). The 32x32-64-bit multiplication poses an issue on
         | the default SSE2 as it has no 32-bit or 64-bit multiplication,
         | but SSE4.1 adds exactly the needed thing, pmuldq (although, as
         | that returns a 64-bit result, you'd need two of such to process
         | a full vector of 32-bit ints).
        
           | RaisingSpear wrote:
           | Did the challenge limit submissions to only using SSE2? Seems
           | odd, given the prevalence of SSE4.2 support.
           | 
           | PMULUDQ is in SSE2, though I haven't checked if that's usable
           | for the problem here. There's also PMULLD in SSE4.1 if you
           | only need a 32-bit result. But for summing digits, perhaps
           | SSE2's PMADDWD could be sufficient?
        
             | dzaima wrote:
             | The official 1BRC was Java-only, so no using any
             | architecture-specific SIMD at all; the test system did have
             | AVX2 though, and that's what most non-competing native
             | solutions (including mine) targeted.
             | 
             | Completely forgot about pmuludq, that works too for SSE2.
             | But a 32-bit result is insufficient for the magic number
             | method, needs to be at least 36-bit. I originally used
             | vpmaddubsw+vpmaddwd, but switched to vpmuldq for the
             | reduced register pressure, and I was already only parsing 4
             | numbers in ymm registers so the 64-bit result didn't affect
             | me (after parsing 4 temperatures I immediately did the
             | respective hashmap stuff for each).
        
         | haxen wrote:
         | The temperature fields are interleaved with name fields, so I
         | don't think you'd get any extra benefit from SSE. Also, since
         | the temperature field is variable-sized, it would probably not
         | pay off even if it was stored by column.
         | 
         | SSE was successfully applied to finding the delimiter between
         | the name and the temperature, though.
        
           | dzaima wrote:
           | It can still be beneficial - yes you need to load the
           | temperatures individually, but there's enough operations
           | afterward to make it worth it.
        
       | readthenotes1 wrote:
       | Why does they compute and throw away nextLineStart?
       | 
       | long nextLineStart = ...
        
         | robin_reala wrote:
         | That's explained at the end of the post: in the real code,
         | that's the most efficient place to calculate it, and it was
         | passed back out as well.
        
           | __float wrote:
           | That was some key context lost in the simplification of the
           | original code[1] for this blog post, though. The blog post's
           | code _returns_ the temperature (throwing away the next line),
           | whereas the original adds it to a map and returns the next
           | line.
           | 
           | [1] https://github.com/gunnarmorling/1brc/blob/dfec2cdbe6a033
           | 4cf...
        
             | haxen wrote:
             | There are many variations of the original code used in
             | different solutions. Many of them return the temperature
             | like the variant used in the post, but they split out the
             | part that finds the decimal dot into a separate function.
             | Then you can reuse that twice: to finish parsing the
             | temperature, and to advance the cursor to the next row.
        
       | pacaro wrote:
       | We used to use the BCD opcodes for this kind of thing. Masking
       | off the 0x30, shift digits (if you want packed BCD)
       | 
       | I can't imagine that has been efficient for decades tho
        
         | jsheard wrote:
         | The BCD instructions are one of the few things that was dropped
         | in the x86-64 transition, so it wouldn't work at all anymore.
        
           | kragen wrote:
           | they'd never been extended to work on more than 8 bits, even
           | in the 8086; they only really existed for 8080 compatibility,
           | and arguably the 8080 primarily had them to ease the path
           | from earlier intel processors designed for, if i'm not
           | mistaken, literal pocket calculators
           | 
           | in pocket calculators you have to display the result in
           | decimal after performing a single arithmetic operation, so
           | it's much more efficient to do the arithmetic in bcd than to
           | convert from decimal to binary, perform the arithmetic
           | operation, and then convert back
        
             | pacaro wrote:
             | Yeah, if you wanted numbers greater than 99 you had to use
             | them in the x87 (if you had one)
        
         | pcwalton wrote:
         | Agner says that AAA has latency 5 on Cannon Lake, so using that
         | instruction is a bit faster than doing the operations manually.
         | But if you vectorize (or use SWAR) I imagine you can start to
         | beat the legacy instructions with larger numbers.
        
       | jdright wrote:
       | > the real mystery is how a single guy working alone could come
       | up with all this in just a few days
       | 
       | Really not a mystery. This is not that complex for people with
       | low level programming experience (embedded, (older) games,
       | demoscene, performance systems, etc.).
       | 
       | Knowing these tricks and knowing how to apply them is basically a
       | requirement on these areas.
        
         | AtlasBarfed wrote:
         | Imo the most important class in comp sci is computer
         | architecture, specifically one that takes you from gates to a
         | rudimentary alu, registers, to microcode, to some form of
         | assembly coding.
         | 
         | There is nothing magical in computers once you can do that
         | deconstruction in your head.
        
         | AlotOfReading wrote:
         | Bit twiddling hacks are increasingly becoming a lost art even
         | in the embedded/low level space. There's relatively few people
         | around with any day-to-day need for them and I've found myself
         | deleting a lot of the legacy ones I find because the compiler
         | situation has changed so much from when they were originally
         | written.
         | 
         | I think this is an excellent skills showcase, even if I don't
         | think it's magical.
        
           | at_a_remove wrote:
           | Back when the word "nanotechnology" used to mean itty-bitty
           | robots* rather than some extremely fine chemical processes
           | done in sequence, or extremely small particle size, I had
           | wondered if we could _get_ to the point of said itty-bity
           | robots before all of the brilliant people who made playable
           | games for restricted consoles like the Atari had retired.
           | 
           | * More specifically, miniaturized robots consisting of one or
           | more robot arms with multiple degrees of freedom, a CPU, some
           | kind of storage (perhaps something like wound-up DNA serving
           | as tape on reels), executing programs written by humans.
        
             | MichaelZuo wrote:
             | That concepts sounds like fantasy, cells are shaped like
             | cells because 'itty-bitty robots', and most other
             | configurations, don't work thermodynamically or physically
             | or chemically at such small scales.
        
               | at_a_remove wrote:
               | Cells are shaped like cells (which is to say _ugly bags
               | of mostly water_ ) because they are self-perpetuating
               | chemical reactions in solution. We usually conduct our
               | chemical reactions in vessels that are round along at
               | least one axis, maybe almost two for our round-bottomed
               | flasks. This is natural because we're enclosing a maximum
               | volume of our solution with a minimal surface (cell
               | membrane or borosilicate glass).
               | 
               | On a macro scale, life tends to be bilaterally or
               | radially symmetrical, but our robots are not necessarily
               | like that, just considering even factory robots which are
               | often an arm and little else. So, at the micro scale, I
               | don't think they have to resemble "life" there, either.
               | I'm hardly suggesting some kind of tinkertoy with one
               | atom here and another atom there and the strut being,
               | instead of wood, a covalent bond. No, I think we would
               | need more atoms than that.
               | 
               | Frankly, we haven't much tried to scale down our
               | robotics. Oh, you'll find someone who will produce the
               | flagellar motor (a mere forty-five nanometers, compared
               | to the ten thousand nanometers of a human cell) but not
               | much else. I wouldn't worry about the thermodynamics and
               | quantum effects until we're down to that motor level.
        
               | MichaelZuo wrote:
               | Do you realize you can determine this for yourself, right
               | now?
               | 
               | By just taking standard equations found in textbooks,
               | plugging in some numbers, and realizing that things such
               | as Ideal Gas Law don't hold at such scales.
               | 
               | You can even derive the equations for yourself to confirm
               | your not being misled by the textbooks, but that takes
               | more skill and knowledge then is likely feasible to
               | acquire without advanced study.
        
         | jandrewrogers wrote:
         | Yes. These are standard idioms in low-level performance
         | engineering. This is the kind of thing that would be a fun
         | puzzle for someone with this skill set, something you could
         | come up with in an hour. This particular example does
         | demonstrate a relatively complete understanding of the toolset.
         | 
         | I feel like this kind of knowledge is becoming a lost art these
         | days despite having lost no relevance.
        
           | winwang wrote:
           | "a sufficient smart compiler...." (/s)
           | 
           | I had a question recently: would an extra 65th bit in some
           | registers have a good use case with stuff with SWAR?
        
             | rep_lodsb wrote:
             | Not directly, but I think it could be an alternative to the
             | carry flag present on most CPU architectures (today that
             | means x86 and ARM, and also historically pretty much
             | anything else that was at all significant).
             | 
             | The much-hyped RISC-V is one of the few that doesn't have a
             | flag register, because it's been claimed to be a
             | performance bottleneck - in my somewhat heretical opinion,
             | the actual reason is that its designers are deluded into
             | thinking that C and UNIX are as fundamental to software as
             | transistors are to hardware, and thus anything beyond
             | what's needed to support that environment is not worth
             | implementing.
             | 
             | But an extra bit per register would be like having a
             | separate carry flag for each, potentially solving some
             | problems with parallelization and also allowing checks for
             | integer overflow at the point a value is stored into memory
             | or used in a subsequent operation.
             | 
             | Having some kind of carry flag enables various programming
             | tricks that can also be useful for SWAR, for example
             | subtract-with-carry of a register with itself will either
             | set or clear all bits.
        
             | anonymoushn wrote:
             | These techniques are mostly not that useful because you can
             | just use much bigger registers.
        
       | WoodenChair wrote:
       | For context on why you should read this: this is a well written
       | article that explains in detail how some bitwise operations work
       | to parse a temperature (think -10.3, or 3.4) with no conditionals
       | (no if-statements or switches). This results in very fast
       | parsing. The language used is Java but it could just as easily be
       | C or Swift.
        
         | mnau wrote:
         | Result is fast parsing with no error checking.
         | 
         | If the input wasn't in expected format (e.g. different decimal
         | separator), the result would be incorrect.
         | 
         | Fun ss an exercise, but hard to understand, maintain.
        
       | plokhotnyuk wrote:
       | What an amazing step by step explanation!
       | 
       | More than 2 years ago I found that byte array view var handles
       | are quite suitable to cook efficient SWAR routines with
       | Java/Scala.
       | 
       | See a lot of other examples of SWAR usage, like parsing Base16/64
       | strings, java.time.* and number values directly from byte arrays:
       | 
       | https://github.com/plokhotnyuk/jsoniter-scala/blob/master/js...
        
       | quercusa wrote:
       | As per the article: SWAR is "SIMD Within A Register"
        
       | teo_zero wrote:
       | I remember the first time I read about this technique used by the
       | Linux kernel to find the \0 at the end of the string. If course
       | XORing with another char will detect that char instead.
       | 
       | But before you run to implement it as a drop-in replacement for
       | strlen() all over your code, please note that in real-world cases
       | the string length is not guaranteed to be a multiple of 8. When
       | you read the last long, you might overrun your memory buffer and
       | trigger a page fault.
        
         | Karellen wrote:
         | > When you read the last long, you might overrun your memory
         | buffer and trigger a page fault.
         | 
         | You know that page boundaries are always on a `PAGE_SIZE`
         | alignment, which you can get with `getpagesize()` or
         | `sysconf(_SC_PAGESIZE)`.
         | 
         | ...but overrunning the declared size of a memory allocation is
         | still UB, even if you don't trigger a page fault, so YMMV ;-)
        
           | o11c wrote:
           | Also, as I learned recently: UB minor faults are allowed to
           | change under you. You're not guaranteed to get the same value
           | next time you access them, even if nothing else in your
           | process is touching that memory.
        
             | Karellen wrote:
             | Anything is allowed to happen if your program invokes UB.
             | All bets are off (according to the C standard/abstract
             | machine) for the rest of the runtime of your program, from
             | the first time it happens.
             | 
             | In fact, because the compiler is allowed to re-order your
             | program provided that the end result of any non-UB
             | computation is the same (the "as-if" rule[0]), if part of
             | your program invokes UB, it's possible for that to "reach
             | backwards" and invalidate computations that appear to
             | happen before the UB is/was invoked (by a straightforward
             | reading of the source code).
             | 
             | [0] https://en.wikipedia.org/wiki/As-if_rule
        
       | winwang wrote:
       | If people like stuff like this, the simdjson paper uses similar
       | techniques, is extremely well-written and has great examples.
       | 
       | Paper: https://arxiv.org/abs/1902.08318
       | 
       | Github: https://github.com/simdjson/simdjson
        
         | djmips wrote:
         | This isn't SWAR though but I get how it might appeal.
        
       | teucris wrote:
       | Great article and given the context of the code this is a
       | brilliant solution, but note that this assumes that the data is
       | well-formed. Much of the value in an effective, battled-worn
       | parser is efficient error checking and recovery.
        
         | Karellen wrote:
         | It would be interesting to see a breakdown of the ways in which
         | ill-formed inputs could affect the output. And how much work it
         | would be to detect such inputs and return some kind of sentinel
         | error value - in the same style as the current code.
         | 
         | (Not quite interesting enough for me to do it myself though ;-)
        
       | tayo42 wrote:
       | Pretty cool. I'll have to come back to this
       | 
       | But can't someone just ask merry kitty where they got the idea
       | from?
        
         | Terretta wrote:
         | Not saying the idea came from this, but if you were coding in
         | late 70s to mid 80s, you may well have written lots of code
         | using these techniques because 6502 assembly was fun and
         | manageable enough to just write your software in it, especially
         | if you had a Beagle Bros book at hand.
        
           | qingcharles wrote:
           | Anyone who coded in the demoscene too was writing oodles of
           | weird bit-twiddling guff like this, from experience...
        
         | mihaic wrote:
         | I think something like this was a lot more common knowledge
         | around 20 years ago, and now seems like mystic arts. It just
         | takes a mindset of thinking that everything is bytes and
         | binary, and after you've seen a couple SWAR snippets of stuff
         | like popcount or first significant bit you're able do it these
         | inferences yourself.
        
       | djmips wrote:
       | Used to do SWAR on the 68000 to good effect. 4 bytes operated on
       | in parallel in a single instruction. Tricky to handle the
       | overflow IIRC. I really like this article!
        
       | gigatexal wrote:
       | This article helped me understand bit fiddling better than any
       | before it. Thanks to the author and the author of the really
       | novel Java solution to the 1BRC challenge.
        
       | stavros wrote:
       | Can someone explain to me how the BRC isn't bottlenecked on I/O?
       | I don't understand how the CPU is the bottleneck.
        
         | benhoyt wrote:
         | Local disk I/O is no longer the bottleneck on modern systems:
         | https://benhoyt.com/writings/io-is-no-longer-the-bottleneck/
         | 
         | In addition, the official 1BRC explicitly evaluated results on
         | a RAM disk to avoid I/O speed entirely:
         | https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva...
         | "Programs are run from a RAM disk (i.o. the IO overhead for
         | loading the file from disk is not relevant)"
        
           | davidmurdoch wrote:
           | I was surprised by this recently when optimizing a build
           | process that uses an intermediate write-to-disk step. I
           | replaced the intermediate filesystem API with an in-memory
           | one and it was not measurably faster. Not even by a single
           | millisecond.
        
             | 10000truths wrote:
             | How much data were you writing? If you don't fill the OS's
             | page cache and the SSD controller's DRAM cache, and you're
             | not blocking on fsync() or O_DIRECT or some other explicit
             | flushing mechanism, then you're not going to see much of a
             | difference in throughput.
        
               | davidmurdoch wrote:
               | Only a few MB written then read, so that is a likely
               | explanation.
        
           | stavros wrote:
           | Ahh, thanks, I didn't know about the ramdisk. Very
           | interesting about I/O not being the bottleneck, though.
        
           | mike_hearn wrote:
           | Unfortunately there are still cases where local disk I/O can
           | be a serious bottleneck that you do have to be aware of:
           | 
           | 1. Cloud VMs sometimes have very slow access even to devices
           | advertised as local disks, depending on which cloud
           | 
           | 2. Windows
           | 
           | The latter often catches people out because they develop
           | intuitions about file I/O speed on Linux, which is extremely
           | fast, and then their app runs 10x slower on Windows due to
           | opening file handles being much slower (and sometimes they
           | don't sign their software and virus scanners make things much
           | slower again).
        
             | NikkiA wrote:
             | Apparently, the cause of the long standing windows disk IO
             | problem was discovered a month or so ago, and MS were said
             | to be working on a fix.
             | 
             | Whether it'll be constrained to Win 11 is yet to be seen.
        
         | nkurz wrote:
         | I haven't looked at the problem closely enough to answer, but
         | could we start from the other direction: what makes you think
         | that memory I/O would be the bottleneck?
         | 
         | From my limited understanding, we sequentially bring a large
         | text file into L1 and then do a single read for each value. On
         | most processors we can do two of these per cycle. The slow part
         | will be bringing it into L1 from RAM, but sequential reads are
         | pretty fast.
         | 
         | We then do some processing on each read. At a glance, maybe 4
         | cycles worth in this optimized version? Then we need to write
         | the result somewhere, presumably with a random read (or two?)
         | first. Is this the part you are thinking is going to be the I/O
         | bottleneck?
         | 
         | I'm not saying it's obviously CPU limited, but it doesn't seem
         | obvious that it wouldn't be.
         | 
         | Edit: I hadn't considered that you might have meant "disk I/O".
         | As others have said, that's not really a factor here.
        
           | haxen wrote:
           | > maybe 4 cycles worth in this optimized version?
           | 
           | It's quite a bit more than that, just the code discussed in
           | the post is around 20 instructions, and there's a bunch more
           | concerns like finding the delimiter between the name and the
           | temperature, and hashtable operations. All put together, it
           | comes to around 80 cycles per row.
           | 
           | When explaining the timing of 1.5 seconds, one must take into
           | account that it's parallelized across 8 CPU cores.
        
             | nkurz wrote:
             | You are right. In my defense, I meant to say "about 4
             | cycles per byte of input" but in my editing I messed this
             | up. I'd just deleted a sentence talking about the number of
             | bytes per cycle we could bring in from RAM, but was worried
             | my estimate was outdated. I started trying to research the
             | current answer, then gave up and deleted it, leaving the
             | other sentence wrong.
        
           | stavros wrote:
           | Sorry, yes, I meant disk I/O, I should have clarified.
        
         | codegladiator wrote:
         | The test is executed with memfs. The file and everything is in
         | ram at startup.
        
         | menaerus wrote:
         | Since the dataset is small enough so that it fits into the
         | Linux kernel page cache, and since the benchmark is repeated
         | for 5 consecutive times, first iteration of the benchmark will
         | be bottlenecked by the disk I/O but the remaining 4 will not -
         | e.g. all data will be in RAM (page-cache).
        
         | superjan wrote:
         | For some background: here is an interview with Daniel Lemire
         | who built an entire career based on the observation that I/O is
         | often not the bottlenck. https://corecursive.com/frontiers-of-
         | performance-with-daniel...
        
         | fulafel wrote:
         | Not being IO bound is part of the test design in this case.
         | https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva...
         | 
         | "Programs are run from a RAM disk (i.o. the IO overhead for
         | loading the file from disk is not relevant)"
        
       | kwillets wrote:
       | Multiplying digit bitfields by their respective powers of 10 and
       | shift/adding via MUL is a (well?) known technique, see Lemire
       | https://lemire.me/blog/2023/11/28/parsing-8-bit-integers-qui... .
        
       | MrYellowP wrote:
       | > the real mystery is how a single guy working alone could come
       | up with all this in just a few days of casually doing an online
       | challenge with a T-shirt and a coffee mug as a reward.
       | 
       | Why is that a mystery? There's still people out there who
       | actually know how to program a CPU and actually understand what
       | they're doing. The _real_ mystery is the lack of deeper
       | understanding of most people who call themselves programmers,
       | combined with the fact that they don 't appear to be knowing that
       | they're seriously lacking.
        
       | londons_explore wrote:
       | This algorithm could be done 32x in parallel with AVX
       | instructions...
       | 
       | But Java doesn't really support SIMD use of the CPU except in a
       | very few specific cases, which is sad.
        
         | haxen wrote:
         | Can you provide more details on how that would work? Given that
         | the input is CSV rows and not an array of integers alone, and
         | the fact that getting to the next row is dependent on finding
         | the delimiter in the current row.
        
           | anonymoushn wrote:
           | There isn't really a dependency there. You can scan for a
           | bunch of delimiters and save a bunch of row start and end
           | positions and process the rows from there.
           | 
           | If (unlike the solutions that won this challenge) you would
           | like to use a hash that consumes all the bytes in the input,
           | then scanning for all the delimiters and bucketizing strings
           | by length also lets you avoid branch mispredictions caused by
           | random lengths.
        
           | londons_explore wrote:
           | You could choose 32 start points approximately evenly through
           | the file, and process starting at each point in parallel.
        
         | dzaima wrote:
         | Not quite 32x as there's a 3-5 bytes of input per number, and a
         | range of -999...999 for the output (and you'd generally want to
         | load those consecutive bytes at the same time). 4x or 8x is
         | certainly possible though; and you can even do that in Java
         | with jdk.incubator.vector to some extent, although the
         | available operations are somewhat limited, and Hotspot's
         | codegen leaves enough to be desired.
        
           | londons_explore wrote:
           | But there are 1 billion rows... So plenty of parallelism.
           | Your 32 lanes can be working on totally different parts of
           | the dataset.
        
             | dzaima wrote:
             | Processing 32 inputs at 1-byte granularity means that
             | you'll need to expand each of those 32 inputs across 5
             | registers, and means you lose the ability to trivially
             | shift them along (would need 15 blends to align the 5 input
             | bytes to 3 output bytes for the 32xi8 approach, vs a single
             | vpsllvd or some shuffle for 8xi32), or do any arithmetic
             | wider than 8-bit (i.e. no 64-bit multiply at the core of
             | the method in the article). More lanes just for the sake of
             | it makes no sense (not counting unrolling as you can unroll
             | anything to any amount (give or take register pressure)).
        
             | dzaima wrote:
             | Also, fwiw, re: your gather/scatter approach at
             | https://news.ycombinator.com/item?id=38867074:
             | 
             | I believe you can't get away with just 16+16-bit per state
             | transition case - you have to validate that the input is
             | what the entry was for, and your proposed 16-bit fields
             | don't include anything for that. (having a full
             | 2^32-element table won't help you, and would be a massive
             | 16GB by itself anyways).
             | 
             | I don't see how 2^16 counters can work, as there are
             | clearly 400x400 = 160000 different namextemperature
             | combinations to track; lowering to 160 temperatures is
             | awful. (maybe you could squeeze that together with the
             | output state bits though, which are somewhat fewer, but
             | doesn't help much).
             | 
             | Which leaves with just the option of increasing to 8 bytes
             | per entry.
             | 
             | That ends up as a massive table as e.g. a window of "4\nAb"
             | has 400x40 possible input states, and is 10x141 cases, for
             | a total of 22560000 transitions, 180MB; and there are more
             | than just this window, and you need some empty space to
             | reduce collisions (though some enormous L3-s maybe could
             | still cover it. maybe.)
             | 
             | The current best native solution works at ~2 cycles per
             | input byte per core, so your method per 4 bytes has to beat
             | 8 cycles/element. Your described solution is (cycle counts
             | on Intel, as Zen 4 has much slower gather/scatter):
             | 
             | 1. gather the next block of 4 bytes for each element; let's
             | say 0.33c/elt as these are all in L1.
             | 
             | 2. gather 8-byte values from the state transition table; as
             | the table is massive and random access, that's limited by
             | the RAM to cache interface; though you only need 8 bytes,
             | the CPU will fetch 64B cache lines, so even with DDR5 at
             | 64GB/s that's just 1B reads per second, or 3c/elt at 3GHz.
             | 
             | 3. vpconflictd for when elements hit the same histogram
             | bucket (~1.2c/elt on Intel; 0.08c/elt on Zen 4; also needs
             | some fallback for when the collisions do happen (luckily
             | unlikely if you mask out states that don't terminate the
             | record); also means that these steps 3&4&5 cannot be
             | interleaved without a separate histogram table for each).
             | 
             | 4. gather the histogram previous value (0.5c/elt?).
             | 
             | 5. scatter the incremented histogram values (0.7c/elt?).
             | 
             | Furthermore, to handle the rarer temperatures/collisions
             | (the distribution of names is uniform so there's no point
             | in handling only a subset of those), you need some fallback
             | mechanism. Targeting 400 temperatures gives you a 5% chance
             | of a failed lookup per entry (i.e., for a 16-element
             | register, there's a 56% chance that at least one will hit a
             | bad temperature), and you must restore the failed lane to a
             | working one very fast, or otherwise all lanes will end up
             | dying quite quickly. Some options:
             | 
             | 6.a. do a scalar ctz+blsr loop over a mask of the failed
             | entries (a bunch of cycles and very branch-mispredicty);
             | 
             | 6.b. increase to ~600 targeted temperatures, for only 0.2%
             | chance of a failed lookup (still 3.1% over a 16-element
             | vector), and do a scalar fallback when it does happen
             | (still somewhat mispredicty);
             | 
             | 6.c. store all failed entries to a buffer (scatter,
             | 0.7c/elt, probably even with most entries being masked off?
             | could be wrong though); and then of course whatever code
             | for actually handling these 5% of more complex entries.
             | 
             | So that's somewhere around 0.33 + 3 + 1.2 + 0.5 + 0.7 + 0.7
             | = 6.43 cycles per 4 bytes, not far from the 8 cycles of
             | current solutions, and that's not counting any of the
             | intermediate arithmetic, assuming ideal cache & RAM
             | performance (and that the CPU can actually OoO over all of
             | the RAM latency and not flop on the gather/scatter of the
             | histogram not having known addresses for a while), and
             | doesn't leave much for the table initialization.
        
       | neonsunset wrote:
       | You don't need to do such SWAR tricks in C# - it offers first-
       | class cross-platform SIMD API instead.
       | 
       | Evidently, it works really well which can be observed on the
       | example of a C# solution being the fastest at 1BRC that has been
       | publicly seen so far:
       | https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
        
       | EE84M3i wrote:
       | This state space is pretty small - is there a super-optimizer
       | that will just spit out a similar result?
        
       ___________________________________________________________________
       (page generated 2024-03-10 23:01 UTC)