[HN Gopher] 1brc merykitty's magic SWAR: 8 lines of code explain...
___________________________________________________________________
1brc merykitty's magic SWAR: 8 lines of code explained in 3k words
Author : signa11
Score : 235 points
Date : 2024-03-09 05:57 UTC (1 days ago)
(HTM) web link (questdb.io)
(TXT) w3m dump (questdb.io)
| majke wrote:
| Can you vectorize this with SSE. Most of the magic could be done
| with vector of four 32 bit integers.
|
| The question is if the cost of setting up the initial vector and
| extracting the result isnt prohibitive.
| gfody wrote:
| I imagine code like this gets auto-vectorized either right from
| the start or after Hotspot detects a hotspot
| haxen wrote:
| I wonder what you mean here. What code exactly would get
| auto-vectorized? Parsing the temperature surely not, since
| it's not in an array of fixed-size fields.
| gfody wrote:
| once you've done the work to implement parsing as
| branchless bitwise operations on integers I think that code
| can be extended automatically to operate on larger
| registers
| haxen wrote:
| I think it would work if that was the only code in the
| loop. But the loop spans several more nontrivial
| operations, including hashtable insertion.
| dzaima wrote:
| It can, and various other 1BRC implementations have (though I
| doubt hotspot would manage to do it itself, never mind that
| most 1BRC submissions were run with Graal for the lower startup
| overhead). The 32x32-64-bit multiplication poses an issue on
| the default SSE2 as it has no 32-bit or 64-bit multiplication,
| but SSE4.1 adds exactly the needed thing, pmuldq (although, as
| that returns a 64-bit result, you'd need two of such to process
| a full vector of 32-bit ints).
| RaisingSpear wrote:
| Did the challenge limit submissions to only using SSE2? Seems
| odd, given the prevalence of SSE4.2 support.
|
| PMULUDQ is in SSE2, though I haven't checked if that's usable
| for the problem here. There's also PMULLD in SSE4.1 if you
| only need a 32-bit result. But for summing digits, perhaps
| SSE2's PMADDWD could be sufficient?
| dzaima wrote:
| The official 1BRC was Java-only, so no using any
| architecture-specific SIMD at all; the test system did have
| AVX2 though, and that's what most non-competing native
| solutions (including mine) targeted.
|
| Completely forgot about pmuludq, that works too for SSE2.
| But a 32-bit result is insufficient for the magic number
| method, needs to be at least 36-bit. I originally used
| vpmaddubsw+vpmaddwd, but switched to vpmuldq for the
| reduced register pressure, and I was already only parsing 4
| numbers in ymm registers so the 64-bit result didn't affect
| me (after parsing 4 temperatures I immediately did the
| respective hashmap stuff for each).
| haxen wrote:
| The temperature fields are interleaved with name fields, so I
| don't think you'd get any extra benefit from SSE. Also, since
| the temperature field is variable-sized, it would probably not
| pay off even if it was stored by column.
|
| SSE was successfully applied to finding the delimiter between
| the name and the temperature, though.
| dzaima wrote:
| It can still be beneficial - yes you need to load the
| temperatures individually, but there's enough operations
| afterward to make it worth it.
| readthenotes1 wrote:
| Why does they compute and throw away nextLineStart?
|
| long nextLineStart = ...
| robin_reala wrote:
| That's explained at the end of the post: in the real code,
| that's the most efficient place to calculate it, and it was
| passed back out as well.
| __float wrote:
| That was some key context lost in the simplification of the
| original code[1] for this blog post, though. The blog post's
| code _returns_ the temperature (throwing away the next line),
| whereas the original adds it to a map and returns the next
| line.
|
| [1] https://github.com/gunnarmorling/1brc/blob/dfec2cdbe6a033
| 4cf...
| haxen wrote:
| There are many variations of the original code used in
| different solutions. Many of them return the temperature
| like the variant used in the post, but they split out the
| part that finds the decimal dot into a separate function.
| Then you can reuse that twice: to finish parsing the
| temperature, and to advance the cursor to the next row.
| pacaro wrote:
| We used to use the BCD opcodes for this kind of thing. Masking
| off the 0x30, shift digits (if you want packed BCD)
|
| I can't imagine that has been efficient for decades tho
| jsheard wrote:
| The BCD instructions are one of the few things that was dropped
| in the x86-64 transition, so it wouldn't work at all anymore.
| kragen wrote:
| they'd never been extended to work on more than 8 bits, even
| in the 8086; they only really existed for 8080 compatibility,
| and arguably the 8080 primarily had them to ease the path
| from earlier intel processors designed for, if i'm not
| mistaken, literal pocket calculators
|
| in pocket calculators you have to display the result in
| decimal after performing a single arithmetic operation, so
| it's much more efficient to do the arithmetic in bcd than to
| convert from decimal to binary, perform the arithmetic
| operation, and then convert back
| pacaro wrote:
| Yeah, if you wanted numbers greater than 99 you had to use
| them in the x87 (if you had one)
| pcwalton wrote:
| Agner says that AAA has latency 5 on Cannon Lake, so using that
| instruction is a bit faster than doing the operations manually.
| But if you vectorize (or use SWAR) I imagine you can start to
| beat the legacy instructions with larger numbers.
| jdright wrote:
| > the real mystery is how a single guy working alone could come
| up with all this in just a few days
|
| Really not a mystery. This is not that complex for people with
| low level programming experience (embedded, (older) games,
| demoscene, performance systems, etc.).
|
| Knowing these tricks and knowing how to apply them is basically a
| requirement on these areas.
| AtlasBarfed wrote:
| Imo the most important class in comp sci is computer
| architecture, specifically one that takes you from gates to a
| rudimentary alu, registers, to microcode, to some form of
| assembly coding.
|
| There is nothing magical in computers once you can do that
| deconstruction in your head.
| AlotOfReading wrote:
| Bit twiddling hacks are increasingly becoming a lost art even
| in the embedded/low level space. There's relatively few people
| around with any day-to-day need for them and I've found myself
| deleting a lot of the legacy ones I find because the compiler
| situation has changed so much from when they were originally
| written.
|
| I think this is an excellent skills showcase, even if I don't
| think it's magical.
| at_a_remove wrote:
| Back when the word "nanotechnology" used to mean itty-bitty
| robots* rather than some extremely fine chemical processes
| done in sequence, or extremely small particle size, I had
| wondered if we could _get_ to the point of said itty-bity
| robots before all of the brilliant people who made playable
| games for restricted consoles like the Atari had retired.
|
| * More specifically, miniaturized robots consisting of one or
| more robot arms with multiple degrees of freedom, a CPU, some
| kind of storage (perhaps something like wound-up DNA serving
| as tape on reels), executing programs written by humans.
| MichaelZuo wrote:
| That concepts sounds like fantasy, cells are shaped like
| cells because 'itty-bitty robots', and most other
| configurations, don't work thermodynamically or physically
| or chemically at such small scales.
| at_a_remove wrote:
| Cells are shaped like cells (which is to say _ugly bags
| of mostly water_ ) because they are self-perpetuating
| chemical reactions in solution. We usually conduct our
| chemical reactions in vessels that are round along at
| least one axis, maybe almost two for our round-bottomed
| flasks. This is natural because we're enclosing a maximum
| volume of our solution with a minimal surface (cell
| membrane or borosilicate glass).
|
| On a macro scale, life tends to be bilaterally or
| radially symmetrical, but our robots are not necessarily
| like that, just considering even factory robots which are
| often an arm and little else. So, at the micro scale, I
| don't think they have to resemble "life" there, either.
| I'm hardly suggesting some kind of tinkertoy with one
| atom here and another atom there and the strut being,
| instead of wood, a covalent bond. No, I think we would
| need more atoms than that.
|
| Frankly, we haven't much tried to scale down our
| robotics. Oh, you'll find someone who will produce the
| flagellar motor (a mere forty-five nanometers, compared
| to the ten thousand nanometers of a human cell) but not
| much else. I wouldn't worry about the thermodynamics and
| quantum effects until we're down to that motor level.
| MichaelZuo wrote:
| Do you realize you can determine this for yourself, right
| now?
|
| By just taking standard equations found in textbooks,
| plugging in some numbers, and realizing that things such
| as Ideal Gas Law don't hold at such scales.
|
| You can even derive the equations for yourself to confirm
| your not being misled by the textbooks, but that takes
| more skill and knowledge then is likely feasible to
| acquire without advanced study.
| jandrewrogers wrote:
| Yes. These are standard idioms in low-level performance
| engineering. This is the kind of thing that would be a fun
| puzzle for someone with this skill set, something you could
| come up with in an hour. This particular example does
| demonstrate a relatively complete understanding of the toolset.
|
| I feel like this kind of knowledge is becoming a lost art these
| days despite having lost no relevance.
| winwang wrote:
| "a sufficient smart compiler...." (/s)
|
| I had a question recently: would an extra 65th bit in some
| registers have a good use case with stuff with SWAR?
| rep_lodsb wrote:
| Not directly, but I think it could be an alternative to the
| carry flag present on most CPU architectures (today that
| means x86 and ARM, and also historically pretty much
| anything else that was at all significant).
|
| The much-hyped RISC-V is one of the few that doesn't have a
| flag register, because it's been claimed to be a
| performance bottleneck - in my somewhat heretical opinion,
| the actual reason is that its designers are deluded into
| thinking that C and UNIX are as fundamental to software as
| transistors are to hardware, and thus anything beyond
| what's needed to support that environment is not worth
| implementing.
|
| But an extra bit per register would be like having a
| separate carry flag for each, potentially solving some
| problems with parallelization and also allowing checks for
| integer overflow at the point a value is stored into memory
| or used in a subsequent operation.
|
| Having some kind of carry flag enables various programming
| tricks that can also be useful for SWAR, for example
| subtract-with-carry of a register with itself will either
| set or clear all bits.
| anonymoushn wrote:
| These techniques are mostly not that useful because you can
| just use much bigger registers.
| WoodenChair wrote:
| For context on why you should read this: this is a well written
| article that explains in detail how some bitwise operations work
| to parse a temperature (think -10.3, or 3.4) with no conditionals
| (no if-statements or switches). This results in very fast
| parsing. The language used is Java but it could just as easily be
| C or Swift.
| mnau wrote:
| Result is fast parsing with no error checking.
|
| If the input wasn't in expected format (e.g. different decimal
| separator), the result would be incorrect.
|
| Fun ss an exercise, but hard to understand, maintain.
| plokhotnyuk wrote:
| What an amazing step by step explanation!
|
| More than 2 years ago I found that byte array view var handles
| are quite suitable to cook efficient SWAR routines with
| Java/Scala.
|
| See a lot of other examples of SWAR usage, like parsing Base16/64
| strings, java.time.* and number values directly from byte arrays:
|
| https://github.com/plokhotnyuk/jsoniter-scala/blob/master/js...
| quercusa wrote:
| As per the article: SWAR is "SIMD Within A Register"
| teo_zero wrote:
| I remember the first time I read about this technique used by the
| Linux kernel to find the \0 at the end of the string. If course
| XORing with another char will detect that char instead.
|
| But before you run to implement it as a drop-in replacement for
| strlen() all over your code, please note that in real-world cases
| the string length is not guaranteed to be a multiple of 8. When
| you read the last long, you might overrun your memory buffer and
| trigger a page fault.
| Karellen wrote:
| > When you read the last long, you might overrun your memory
| buffer and trigger a page fault.
|
| You know that page boundaries are always on a `PAGE_SIZE`
| alignment, which you can get with `getpagesize()` or
| `sysconf(_SC_PAGESIZE)`.
|
| ...but overrunning the declared size of a memory allocation is
| still UB, even if you don't trigger a page fault, so YMMV ;-)
| o11c wrote:
| Also, as I learned recently: UB minor faults are allowed to
| change under you. You're not guaranteed to get the same value
| next time you access them, even if nothing else in your
| process is touching that memory.
| Karellen wrote:
| Anything is allowed to happen if your program invokes UB.
| All bets are off (according to the C standard/abstract
| machine) for the rest of the runtime of your program, from
| the first time it happens.
|
| In fact, because the compiler is allowed to re-order your
| program provided that the end result of any non-UB
| computation is the same (the "as-if" rule[0]), if part of
| your program invokes UB, it's possible for that to "reach
| backwards" and invalidate computations that appear to
| happen before the UB is/was invoked (by a straightforward
| reading of the source code).
|
| [0] https://en.wikipedia.org/wiki/As-if_rule
| winwang wrote:
| If people like stuff like this, the simdjson paper uses similar
| techniques, is extremely well-written and has great examples.
|
| Paper: https://arxiv.org/abs/1902.08318
|
| Github: https://github.com/simdjson/simdjson
| djmips wrote:
| This isn't SWAR though but I get how it might appeal.
| teucris wrote:
| Great article and given the context of the code this is a
| brilliant solution, but note that this assumes that the data is
| well-formed. Much of the value in an effective, battled-worn
| parser is efficient error checking and recovery.
| Karellen wrote:
| It would be interesting to see a breakdown of the ways in which
| ill-formed inputs could affect the output. And how much work it
| would be to detect such inputs and return some kind of sentinel
| error value - in the same style as the current code.
|
| (Not quite interesting enough for me to do it myself though ;-)
| tayo42 wrote:
| Pretty cool. I'll have to come back to this
|
| But can't someone just ask merry kitty where they got the idea
| from?
| Terretta wrote:
| Not saying the idea came from this, but if you were coding in
| late 70s to mid 80s, you may well have written lots of code
| using these techniques because 6502 assembly was fun and
| manageable enough to just write your software in it, especially
| if you had a Beagle Bros book at hand.
| qingcharles wrote:
| Anyone who coded in the demoscene too was writing oodles of
| weird bit-twiddling guff like this, from experience...
| mihaic wrote:
| I think something like this was a lot more common knowledge
| around 20 years ago, and now seems like mystic arts. It just
| takes a mindset of thinking that everything is bytes and
| binary, and after you've seen a couple SWAR snippets of stuff
| like popcount or first significant bit you're able do it these
| inferences yourself.
| djmips wrote:
| Used to do SWAR on the 68000 to good effect. 4 bytes operated on
| in parallel in a single instruction. Tricky to handle the
| overflow IIRC. I really like this article!
| gigatexal wrote:
| This article helped me understand bit fiddling better than any
| before it. Thanks to the author and the author of the really
| novel Java solution to the 1BRC challenge.
| stavros wrote:
| Can someone explain to me how the BRC isn't bottlenecked on I/O?
| I don't understand how the CPU is the bottleneck.
| benhoyt wrote:
| Local disk I/O is no longer the bottleneck on modern systems:
| https://benhoyt.com/writings/io-is-no-longer-the-bottleneck/
|
| In addition, the official 1BRC explicitly evaluated results on
| a RAM disk to avoid I/O speed entirely:
| https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva...
| "Programs are run from a RAM disk (i.o. the IO overhead for
| loading the file from disk is not relevant)"
| davidmurdoch wrote:
| I was surprised by this recently when optimizing a build
| process that uses an intermediate write-to-disk step. I
| replaced the intermediate filesystem API with an in-memory
| one and it was not measurably faster. Not even by a single
| millisecond.
| 10000truths wrote:
| How much data were you writing? If you don't fill the OS's
| page cache and the SSD controller's DRAM cache, and you're
| not blocking on fsync() or O_DIRECT or some other explicit
| flushing mechanism, then you're not going to see much of a
| difference in throughput.
| davidmurdoch wrote:
| Only a few MB written then read, so that is a likely
| explanation.
| stavros wrote:
| Ahh, thanks, I didn't know about the ramdisk. Very
| interesting about I/O not being the bottleneck, though.
| mike_hearn wrote:
| Unfortunately there are still cases where local disk I/O can
| be a serious bottleneck that you do have to be aware of:
|
| 1. Cloud VMs sometimes have very slow access even to devices
| advertised as local disks, depending on which cloud
|
| 2. Windows
|
| The latter often catches people out because they develop
| intuitions about file I/O speed on Linux, which is extremely
| fast, and then their app runs 10x slower on Windows due to
| opening file handles being much slower (and sometimes they
| don't sign their software and virus scanners make things much
| slower again).
| NikkiA wrote:
| Apparently, the cause of the long standing windows disk IO
| problem was discovered a month or so ago, and MS were said
| to be working on a fix.
|
| Whether it'll be constrained to Win 11 is yet to be seen.
| nkurz wrote:
| I haven't looked at the problem closely enough to answer, but
| could we start from the other direction: what makes you think
| that memory I/O would be the bottleneck?
|
| From my limited understanding, we sequentially bring a large
| text file into L1 and then do a single read for each value. On
| most processors we can do two of these per cycle. The slow part
| will be bringing it into L1 from RAM, but sequential reads are
| pretty fast.
|
| We then do some processing on each read. At a glance, maybe 4
| cycles worth in this optimized version? Then we need to write
| the result somewhere, presumably with a random read (or two?)
| first. Is this the part you are thinking is going to be the I/O
| bottleneck?
|
| I'm not saying it's obviously CPU limited, but it doesn't seem
| obvious that it wouldn't be.
|
| Edit: I hadn't considered that you might have meant "disk I/O".
| As others have said, that's not really a factor here.
| haxen wrote:
| > maybe 4 cycles worth in this optimized version?
|
| It's quite a bit more than that, just the code discussed in
| the post is around 20 instructions, and there's a bunch more
| concerns like finding the delimiter between the name and the
| temperature, and hashtable operations. All put together, it
| comes to around 80 cycles per row.
|
| When explaining the timing of 1.5 seconds, one must take into
| account that it's parallelized across 8 CPU cores.
| nkurz wrote:
| You are right. In my defense, I meant to say "about 4
| cycles per byte of input" but in my editing I messed this
| up. I'd just deleted a sentence talking about the number of
| bytes per cycle we could bring in from RAM, but was worried
| my estimate was outdated. I started trying to research the
| current answer, then gave up and deleted it, leaving the
| other sentence wrong.
| stavros wrote:
| Sorry, yes, I meant disk I/O, I should have clarified.
| codegladiator wrote:
| The test is executed with memfs. The file and everything is in
| ram at startup.
| menaerus wrote:
| Since the dataset is small enough so that it fits into the
| Linux kernel page cache, and since the benchmark is repeated
| for 5 consecutive times, first iteration of the benchmark will
| be bottlenecked by the disk I/O but the remaining 4 will not -
| e.g. all data will be in RAM (page-cache).
| superjan wrote:
| For some background: here is an interview with Daniel Lemire
| who built an entire career based on the observation that I/O is
| often not the bottlenck. https://corecursive.com/frontiers-of-
| performance-with-daniel...
| fulafel wrote:
| Not being IO bound is part of the test design in this case.
| https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva...
|
| "Programs are run from a RAM disk (i.o. the IO overhead for
| loading the file from disk is not relevant)"
| kwillets wrote:
| Multiplying digit bitfields by their respective powers of 10 and
| shift/adding via MUL is a (well?) known technique, see Lemire
| https://lemire.me/blog/2023/11/28/parsing-8-bit-integers-qui... .
| MrYellowP wrote:
| > the real mystery is how a single guy working alone could come
| up with all this in just a few days of casually doing an online
| challenge with a T-shirt and a coffee mug as a reward.
|
| Why is that a mystery? There's still people out there who
| actually know how to program a CPU and actually understand what
| they're doing. The _real_ mystery is the lack of deeper
| understanding of most people who call themselves programmers,
| combined with the fact that they don 't appear to be knowing that
| they're seriously lacking.
| londons_explore wrote:
| This algorithm could be done 32x in parallel with AVX
| instructions...
|
| But Java doesn't really support SIMD use of the CPU except in a
| very few specific cases, which is sad.
| haxen wrote:
| Can you provide more details on how that would work? Given that
| the input is CSV rows and not an array of integers alone, and
| the fact that getting to the next row is dependent on finding
| the delimiter in the current row.
| anonymoushn wrote:
| There isn't really a dependency there. You can scan for a
| bunch of delimiters and save a bunch of row start and end
| positions and process the rows from there.
|
| If (unlike the solutions that won this challenge) you would
| like to use a hash that consumes all the bytes in the input,
| then scanning for all the delimiters and bucketizing strings
| by length also lets you avoid branch mispredictions caused by
| random lengths.
| londons_explore wrote:
| You could choose 32 start points approximately evenly through
| the file, and process starting at each point in parallel.
| dzaima wrote:
| Not quite 32x as there's a 3-5 bytes of input per number, and a
| range of -999...999 for the output (and you'd generally want to
| load those consecutive bytes at the same time). 4x or 8x is
| certainly possible though; and you can even do that in Java
| with jdk.incubator.vector to some extent, although the
| available operations are somewhat limited, and Hotspot's
| codegen leaves enough to be desired.
| londons_explore wrote:
| But there are 1 billion rows... So plenty of parallelism.
| Your 32 lanes can be working on totally different parts of
| the dataset.
| dzaima wrote:
| Processing 32 inputs at 1-byte granularity means that
| you'll need to expand each of those 32 inputs across 5
| registers, and means you lose the ability to trivially
| shift them along (would need 15 blends to align the 5 input
| bytes to 3 output bytes for the 32xi8 approach, vs a single
| vpsllvd or some shuffle for 8xi32), or do any arithmetic
| wider than 8-bit (i.e. no 64-bit multiply at the core of
| the method in the article). More lanes just for the sake of
| it makes no sense (not counting unrolling as you can unroll
| anything to any amount (give or take register pressure)).
| dzaima wrote:
| Also, fwiw, re: your gather/scatter approach at
| https://news.ycombinator.com/item?id=38867074:
|
| I believe you can't get away with just 16+16-bit per state
| transition case - you have to validate that the input is
| what the entry was for, and your proposed 16-bit fields
| don't include anything for that. (having a full
| 2^32-element table won't help you, and would be a massive
| 16GB by itself anyways).
|
| I don't see how 2^16 counters can work, as there are
| clearly 400x400 = 160000 different namextemperature
| combinations to track; lowering to 160 temperatures is
| awful. (maybe you could squeeze that together with the
| output state bits though, which are somewhat fewer, but
| doesn't help much).
|
| Which leaves with just the option of increasing to 8 bytes
| per entry.
|
| That ends up as a massive table as e.g. a window of "4\nAb"
| has 400x40 possible input states, and is 10x141 cases, for
| a total of 22560000 transitions, 180MB; and there are more
| than just this window, and you need some empty space to
| reduce collisions (though some enormous L3-s maybe could
| still cover it. maybe.)
|
| The current best native solution works at ~2 cycles per
| input byte per core, so your method per 4 bytes has to beat
| 8 cycles/element. Your described solution is (cycle counts
| on Intel, as Zen 4 has much slower gather/scatter):
|
| 1. gather the next block of 4 bytes for each element; let's
| say 0.33c/elt as these are all in L1.
|
| 2. gather 8-byte values from the state transition table; as
| the table is massive and random access, that's limited by
| the RAM to cache interface; though you only need 8 bytes,
| the CPU will fetch 64B cache lines, so even with DDR5 at
| 64GB/s that's just 1B reads per second, or 3c/elt at 3GHz.
|
| 3. vpconflictd for when elements hit the same histogram
| bucket (~1.2c/elt on Intel; 0.08c/elt on Zen 4; also needs
| some fallback for when the collisions do happen (luckily
| unlikely if you mask out states that don't terminate the
| record); also means that these steps 3&4&5 cannot be
| interleaved without a separate histogram table for each).
|
| 4. gather the histogram previous value (0.5c/elt?).
|
| 5. scatter the incremented histogram values (0.7c/elt?).
|
| Furthermore, to handle the rarer temperatures/collisions
| (the distribution of names is uniform so there's no point
| in handling only a subset of those), you need some fallback
| mechanism. Targeting 400 temperatures gives you a 5% chance
| of a failed lookup per entry (i.e., for a 16-element
| register, there's a 56% chance that at least one will hit a
| bad temperature), and you must restore the failed lane to a
| working one very fast, or otherwise all lanes will end up
| dying quite quickly. Some options:
|
| 6.a. do a scalar ctz+blsr loop over a mask of the failed
| entries (a bunch of cycles and very branch-mispredicty);
|
| 6.b. increase to ~600 targeted temperatures, for only 0.2%
| chance of a failed lookup (still 3.1% over a 16-element
| vector), and do a scalar fallback when it does happen
| (still somewhat mispredicty);
|
| 6.c. store all failed entries to a buffer (scatter,
| 0.7c/elt, probably even with most entries being masked off?
| could be wrong though); and then of course whatever code
| for actually handling these 5% of more complex entries.
|
| So that's somewhere around 0.33 + 3 + 1.2 + 0.5 + 0.7 + 0.7
| = 6.43 cycles per 4 bytes, not far from the 8 cycles of
| current solutions, and that's not counting any of the
| intermediate arithmetic, assuming ideal cache & RAM
| performance (and that the CPU can actually OoO over all of
| the RAM latency and not flop on the gather/scatter of the
| histogram not having known addresses for a while), and
| doesn't leave much for the table initialization.
| neonsunset wrote:
| You don't need to do such SWAR tricks in C# - it offers first-
| class cross-platform SIMD API instead.
|
| Evidently, it works really well which can be observed on the
| example of a C# solution being the fastest at 1BRC that has been
| publicly seen so far:
| https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
| EE84M3i wrote:
| This state space is pretty small - is there a super-optimizer
| that will just spit out a similar result?
___________________________________________________________________
(page generated 2024-03-10 23:01 UTC)