[HN Gopher] The Billion Row Challenge (1BRC) - Step-by-Step from...
       ___________________________________________________________________
        
       The Billion Row Challenge (1BRC) - Step-by-Step from 71s to 1.7s
        
       Author : mfiguiere
       Score  : 145 points
       Date   : 2024-02-22 14:49 UTC (8 hours ago)
        
 (HTM) web link (questdb.io)
 (TXT) w3m dump (questdb.io)
        
       | haxen wrote:
       | Cool, I see mfiguiere linked to my recent blog post! Let me share
       | a few words about it...
       | 
       | I took part in the One Billion Row challenge (1BRC). It was a lot
       | of fun, but also a great learning experience. People came up with
       | some pretty incredible optimization tricks. When you put them all
       | together, it's a huge number, and they are all mingled up in
       | individual solutions. They also happen on many levels -- from
       | quite high, to incredibly low and detailed.
       | 
       | In retrospect, I could see there was a good number of tricks that
       | are relatively easy to grasp, and reusable in other projects. I
       | felt the urge to do a writeup that captures this knowledge in one
       | place, isolating and explaining each of the tricks.
        
         | e12e wrote:
         | Thank you for taking the time to write it up. Really nice to
         | see some modern Java, as well as the ideas with comments - many
         | of which seem to generalize quite well - beyond Java and the
         | jvm.
        
         | fearthetelomere wrote:
         | If you do find the time, I'm sure that writeup would be very
         | valuable!
        
       | compsciphd wrote:
       | I wonder if using a trie instead of a hash would have provided a
       | performance win.
       | 
       | if you're parsing the file row by row, iterating over the trie as
       | you process each character (as they argue to calculate the int
       | value) (so what you have to do to hash it anyways), should be
       | similar. What you'd end up in is micro-architectual issues on
       | cache performance.
        
         | haxen wrote:
         | Interestingly enough, that was my first idea. But when you
         | consider the tiny keyset size, it would be hard to beat two
         | machine instructions to calculate the hash + a single array
         | lookup.
        
         | gunnarmorling wrote:
         | There is one entry which uses a trie, but it's not at the top,
         | IIRC (which may or may not be related to using a trie, there's
         | many other relevant design decisions).
        
         | o11c wrote:
         | There's a reason everybody uses hashing - if your data is
         | trusted, it does very few operations.
         | 
         | Tries have to allocate, walk, and interpret multiple nodes.
         | Perhaps not as bad as trees (though that depends on how many
         | possible trie node representations there are, vs what the data
         | density is at each level), but still worse than the non-
         | colliding hash.
         | 
         | That said, with a finite dataset a perfect hash would probably
         | beat a general hash though.
        
           | koliber wrote:
           | Was the list of city names known? Could you statically pre-
           | build the Trie structure?
        
             | haxen wrote:
             | It was known, but the requirement was that the program keep
             | working for an arbitrary keyset that conforms to the
             | specified rules (up to 10,000 unique keys, each up to 100
             | bytes in length).
        
         | paulddraper wrote:
         | Tries are rarely the most efficient option.
         | 
         | Kind of like linked lists...conceptually pleasant but rarely if
         | ever the best option for real-world perf.
        
         | anonymoushn wrote:
         | Top solutions tend to use hashes that ignore many input bytes.
         | Even if they did not, they would use hashes that consume 4 or 8
         | bytes of input at a time (you could do more if the language
         | exposed aesenc)
        
       | scottlamb wrote:
       | Fun read. I haven't used Java in a long time, so even the
       | "idiomatic Java code that would pass muster with any seasoned
       | Java developer" was a bit shocking.
       | 
       | I know it's a cliche that someone brings up Rust in any
       | programming thread, but I can't help myself. ;-) Several of the
       | techniques here are easy enough in Rust that I just do them
       | routinely.
       | 
       | * Rayon makes it easy to process line-wise in reasonable-sized
       | chunks without allocating a string per line.
       | 
       | * The standard integer parsing stuff takes a slice, not an owned
       | string.
       | 
       | * The standard hash map doesn't have the same expectation "that
       | we pass in an instance of the key class that is equal to an
       | existing instance"; you can call HashMap<String>::get with a
       | &str. (One limitation is that std::collections::HashMap::entry
       | does expect a K, so there's some redundancy if you need to insert
       | afterward, but you could drop to
       | hashbrown::HashMap::raw_entry_mut to avoid that.)
        
         | haxen wrote:
         | I actually wrote a Rust version as well, and yes, it was far
         | easier to write, far less code (although not incorporating all
         | the tricks), completely safe, and pretty fast -- but still 2x
         | slower than my end result in Java.
         | 
         | HashMap was quite a bottleneck in Rust as well, for many
         | reasons, not just the key allocation problem. But it was very
         | easy to implement the same kind of custom hashtable, which
         | almost by default ends up having a better memory layout than in
         | Java.
         | 
         | https://github.com/mtopolnik/rust-1brc/blob/main/src/main.rs
         | 
         | I bet I could improve on it just a bit and match the Java time.
         | Not sure about topping it, though.
        
           | jasonjmcghee wrote:
           | I thought everyone used hashbrown now
           | 
           | https://github.com/rust-lang/hashbrown
        
             | IshKebab wrote:
             | They do because the standard library implementation was
             | changed to use Hashbrown. However the standard library
             | _API_ has a slight limitation in that you can 't get say "I
             | have a reference to a string. If there's an entry for it,
             | give me that. Otherwise clone the string and insert a new
             | entry. Also only do one key lookup."
             | 
             | You end up either having to do two lookups or always
             | cloning the string even if you ended up not needing it.
        
           | scottlamb wrote:
           | > HashMap was quite a bottleneck in Rust as well, for many
           | reasons, not just the key allocation problem.
           | 
           | I'd be curious to hear more. Other than the default hash
           | function being quite expensive, and occasionally wanting to
           | use the more expressive hashbrown API, I've been happy with
           | Rust's std::collections::HashMap.
        
             | haxen wrote:
             | The reason the custom hashtable wins out isn't something
             | generally applicable. For the very specific dataset used in
             | the challenge, the hash function could be radically
             | simplified, to just a single multiply and rotate left.
             | 
             | To be fair, I didn't try out the hashbrown API. Maybe that,
             | together with FxHash, would have been enough for the
             | official dataset with 97% of keys < 16 bytes. But, I was
             | optimizing for the "10k" dataset as well, with a lot of
             | longer keys, and hashing just the first 8 bytes was the
             | winning idea.
        
               | scottlamb wrote:
               | > For the very specific dataset used in the challenge,
               | the hash function could be radically simplified, to just
               | a single multiply and rotate left.
               | 
               | That's just a custom hash function, though; couldn't you
               | do that with the standard std::collections::HashMap?
        
               | haxen wrote:
               | I didn't bother to try, so not sure. There would probably
               | be some challenges and I don't see how I'd accomplish it
               | without some branch instruction in the custom hasher.
        
       | gunnarmorling wrote:
       | What an excellent post, one of the best on 1BRC I've come across
       | so far. Big shout-out to Marko for participating in the
       | challenge, making a strong push for clarifying corner cases of
       | the rules, and sharing his experiences in this amazing write-up!
        
         | haxen wrote:
         | Thanks for the praise Gunnar, but we all owe it to you for
         | organizing it, and especially sticking through thick and thin
         | when it took off, and needed lots of attention to evaluate
         | everyone and maintain a level playing field!
        
       | Keyframe wrote:
       | I must admit I only glanced at the article and in the past when
       | topic appeared. How does this work, the speed of 1.7s I mean?
       | Taking a look at input, let's say average entry is 16 bytes,
       | there's billion of it; That's ~16GB. Average read speed of SSDs
       | is what, ~500MB/s? Scanning alone at max throughput will take
       | half a minute. This must be relying on DDR4+ read speeds which
       | would probably come in under a second in certain cases. Is that's
       | what's going on, RAM disk?
        
         | haxen wrote:
         | I used to benchmark a lot on an enterprise-grade SSD 10 years
         | ago, and that was already at 2 GB/s. Today, even my laptop's
         | SSD supports multiple GB/s.
         | 
         | But you're right about the contest -- each program was measured
         | five times in a row, and there was enough RAM to fit the entire
         | file into the page cache.
         | 
         | The best time using all 32 cores (64 hyperthreads) on the
         | evaluation machine was 323 milliseconds.
        
           | Keyframe wrote:
           | I went to the original repo, it's indeed a RAM disk. That
           | makes it clear then. I was already almost excited about some
           | IO wizardry going on.
           | 
           |  _Results are determined by running the program on a Hetzner
           | AX161 dedicated server (32 core AMD EPYC(tm) 7502P (Zen2),
           | 128 GB RAM).
           | 
           | Programs are run from a RAM disk (i.o. the IO overhead for
           | loading the file from disk is not relevant), using 8 cores of
           | the machine._
        
             | haxen wrote:
             | Running without RAM cache would be a great followup to this
             | challenge. I think a time around 2-3 seconds should be
             | achievable. But, it would be highly sensitive to the
             | hardware setup and how well the disk-reading code is placed
             | on cores relative to the connection to the disk. Not sure
             | what it would take to allow hundreds of contestants to
             | benefit from the best arrangement.
        
         | pulse7 wrote:
         | PCIe 3.0 x4 SSD runs at ~3500MB/s, PCIe 4.0 x4 SSD runs at
         | ~7000MB/s and PCIe 5.0 x4 SSD runs at ~14000MB/s (yes -
         | megabytes, not megabits)!
        
           | Keyframe wrote:
           | blazing! I had ye olde SSD SATA bricks in mind though.
        
           | sroussey wrote:
           | 1 KIOXIA SSD on PCI 5 can do that 14000MB/s and 2M IOPS on 4
           | lanes. Some servers have 100+ lanes. You can have some
           | serious speed if you want!
        
       ___________________________________________________________________
       (page generated 2024-02-22 23:00 UTC)