[HN Gopher] The Billion Row Challenge (1BRC) - Step-by-Step from...
___________________________________________________________________
The Billion Row Challenge (1BRC) - Step-by-Step from 71s to 1.7s
Author : mfiguiere
Score : 145 points
Date : 2024-02-22 14:49 UTC (8 hours ago)
(HTM) web link (questdb.io)
(TXT) w3m dump (questdb.io)
| haxen wrote:
| Cool, I see mfiguiere linked to my recent blog post! Let me share
| a few words about it...
|
| I took part in the One Billion Row challenge (1BRC). It was a lot
| of fun, but also a great learning experience. People came up with
| some pretty incredible optimization tricks. When you put them all
| together, it's a huge number, and they are all mingled up in
| individual solutions. They also happen on many levels -- from
| quite high, to incredibly low and detailed.
|
| In retrospect, I could see there was a good number of tricks that
| are relatively easy to grasp, and reusable in other projects. I
| felt the urge to do a writeup that captures this knowledge in one
| place, isolating and explaining each of the tricks.
| e12e wrote:
| Thank you for taking the time to write it up. Really nice to
| see some modern Java, as well as the ideas with comments - many
| of which seem to generalize quite well - beyond Java and the
| jvm.
| fearthetelomere wrote:
| If you do find the time, I'm sure that writeup would be very
| valuable!
| compsciphd wrote:
| I wonder if using a trie instead of a hash would have provided a
| performance win.
|
| if you're parsing the file row by row, iterating over the trie as
| you process each character (as they argue to calculate the int
| value) (so what you have to do to hash it anyways), should be
| similar. What you'd end up in is micro-architectual issues on
| cache performance.
| haxen wrote:
| Interestingly enough, that was my first idea. But when you
| consider the tiny keyset size, it would be hard to beat two
| machine instructions to calculate the hash + a single array
| lookup.
| gunnarmorling wrote:
| There is one entry which uses a trie, but it's not at the top,
| IIRC (which may or may not be related to using a trie, there's
| many other relevant design decisions).
| o11c wrote:
| There's a reason everybody uses hashing - if your data is
| trusted, it does very few operations.
|
| Tries have to allocate, walk, and interpret multiple nodes.
| Perhaps not as bad as trees (though that depends on how many
| possible trie node representations there are, vs what the data
| density is at each level), but still worse than the non-
| colliding hash.
|
| That said, with a finite dataset a perfect hash would probably
| beat a general hash though.
| koliber wrote:
| Was the list of city names known? Could you statically pre-
| build the Trie structure?
| haxen wrote:
| It was known, but the requirement was that the program keep
| working for an arbitrary keyset that conforms to the
| specified rules (up to 10,000 unique keys, each up to 100
| bytes in length).
| paulddraper wrote:
| Tries are rarely the most efficient option.
|
| Kind of like linked lists...conceptually pleasant but rarely if
| ever the best option for real-world perf.
| anonymoushn wrote:
| Top solutions tend to use hashes that ignore many input bytes.
| Even if they did not, they would use hashes that consume 4 or 8
| bytes of input at a time (you could do more if the language
| exposed aesenc)
| scottlamb wrote:
| Fun read. I haven't used Java in a long time, so even the
| "idiomatic Java code that would pass muster with any seasoned
| Java developer" was a bit shocking.
|
| I know it's a cliche that someone brings up Rust in any
| programming thread, but I can't help myself. ;-) Several of the
| techniques here are easy enough in Rust that I just do them
| routinely.
|
| * Rayon makes it easy to process line-wise in reasonable-sized
| chunks without allocating a string per line.
|
| * The standard integer parsing stuff takes a slice, not an owned
| string.
|
| * The standard hash map doesn't have the same expectation "that
| we pass in an instance of the key class that is equal to an
| existing instance"; you can call HashMap<String>::get with a
| &str. (One limitation is that std::collections::HashMap::entry
| does expect a K, so there's some redundancy if you need to insert
| afterward, but you could drop to
| hashbrown::HashMap::raw_entry_mut to avoid that.)
| haxen wrote:
| I actually wrote a Rust version as well, and yes, it was far
| easier to write, far less code (although not incorporating all
| the tricks), completely safe, and pretty fast -- but still 2x
| slower than my end result in Java.
|
| HashMap was quite a bottleneck in Rust as well, for many
| reasons, not just the key allocation problem. But it was very
| easy to implement the same kind of custom hashtable, which
| almost by default ends up having a better memory layout than in
| Java.
|
| https://github.com/mtopolnik/rust-1brc/blob/main/src/main.rs
|
| I bet I could improve on it just a bit and match the Java time.
| Not sure about topping it, though.
| jasonjmcghee wrote:
| I thought everyone used hashbrown now
|
| https://github.com/rust-lang/hashbrown
| IshKebab wrote:
| They do because the standard library implementation was
| changed to use Hashbrown. However the standard library
| _API_ has a slight limitation in that you can 't get say "I
| have a reference to a string. If there's an entry for it,
| give me that. Otherwise clone the string and insert a new
| entry. Also only do one key lookup."
|
| You end up either having to do two lookups or always
| cloning the string even if you ended up not needing it.
| scottlamb wrote:
| > HashMap was quite a bottleneck in Rust as well, for many
| reasons, not just the key allocation problem.
|
| I'd be curious to hear more. Other than the default hash
| function being quite expensive, and occasionally wanting to
| use the more expressive hashbrown API, I've been happy with
| Rust's std::collections::HashMap.
| haxen wrote:
| The reason the custom hashtable wins out isn't something
| generally applicable. For the very specific dataset used in
| the challenge, the hash function could be radically
| simplified, to just a single multiply and rotate left.
|
| To be fair, I didn't try out the hashbrown API. Maybe that,
| together with FxHash, would have been enough for the
| official dataset with 97% of keys < 16 bytes. But, I was
| optimizing for the "10k" dataset as well, with a lot of
| longer keys, and hashing just the first 8 bytes was the
| winning idea.
| scottlamb wrote:
| > For the very specific dataset used in the challenge,
| the hash function could be radically simplified, to just
| a single multiply and rotate left.
|
| That's just a custom hash function, though; couldn't you
| do that with the standard std::collections::HashMap?
| haxen wrote:
| I didn't bother to try, so not sure. There would probably
| be some challenges and I don't see how I'd accomplish it
| without some branch instruction in the custom hasher.
| gunnarmorling wrote:
| What an excellent post, one of the best on 1BRC I've come across
| so far. Big shout-out to Marko for participating in the
| challenge, making a strong push for clarifying corner cases of
| the rules, and sharing his experiences in this amazing write-up!
| haxen wrote:
| Thanks for the praise Gunnar, but we all owe it to you for
| organizing it, and especially sticking through thick and thin
| when it took off, and needed lots of attention to evaluate
| everyone and maintain a level playing field!
| Keyframe wrote:
| I must admit I only glanced at the article and in the past when
| topic appeared. How does this work, the speed of 1.7s I mean?
| Taking a look at input, let's say average entry is 16 bytes,
| there's billion of it; That's ~16GB. Average read speed of SSDs
| is what, ~500MB/s? Scanning alone at max throughput will take
| half a minute. This must be relying on DDR4+ read speeds which
| would probably come in under a second in certain cases. Is that's
| what's going on, RAM disk?
| haxen wrote:
| I used to benchmark a lot on an enterprise-grade SSD 10 years
| ago, and that was already at 2 GB/s. Today, even my laptop's
| SSD supports multiple GB/s.
|
| But you're right about the contest -- each program was measured
| five times in a row, and there was enough RAM to fit the entire
| file into the page cache.
|
| The best time using all 32 cores (64 hyperthreads) on the
| evaluation machine was 323 milliseconds.
| Keyframe wrote:
| I went to the original repo, it's indeed a RAM disk. That
| makes it clear then. I was already almost excited about some
| IO wizardry going on.
|
| _Results are determined by running the program on a Hetzner
| AX161 dedicated server (32 core AMD EPYC(tm) 7502P (Zen2),
| 128 GB RAM).
|
| Programs are run from a RAM disk (i.o. the IO overhead for
| loading the file from disk is not relevant), using 8 cores of
| the machine._
| haxen wrote:
| Running without RAM cache would be a great followup to this
| challenge. I think a time around 2-3 seconds should be
| achievable. But, it would be highly sensitive to the
| hardware setup and how well the disk-reading code is placed
| on cores relative to the connection to the disk. Not sure
| what it would take to allow hundreds of contestants to
| benefit from the best arrangement.
| pulse7 wrote:
| PCIe 3.0 x4 SSD runs at ~3500MB/s, PCIe 4.0 x4 SSD runs at
| ~7000MB/s and PCIe 5.0 x4 SSD runs at ~14000MB/s (yes -
| megabytes, not megabits)!
| Keyframe wrote:
| blazing! I had ye olde SSD SATA bricks in mind though.
| sroussey wrote:
| 1 KIOXIA SSD on PCI 5 can do that 14000MB/s and 2M IOPS on 4
| lanes. Some servers have 100+ lanes. You can have some
| serious speed if you want!
___________________________________________________________________
(page generated 2024-02-22 23:00 UTC)