[HN Gopher] My favourite small hash table
       ___________________________________________________________________
        
       My favourite small hash table
        
       Author : speckx
       Score  : 140 points
       Date   : 2025-12-09 14:47 UTC (19 hours ago)
        
 (HTM) web link (www.corsix.org)
 (TXT) w3m dump (www.corsix.org)
        
       | clbrmbr wrote:
       | Awesome blog! Looking at the code I feel like there's a kindred
       | soul behind that keyboard, but there's no About page afaict. Who
       | beeth this mysterious writer?
        
         | LargoLasskhyfv wrote:
         | https://github.com/corsix
        
         | loeg wrote:
         | He's done some interesting work with crc32 ~recently:
         | 
         | https://www.corsix.org/content/fast-crc32c-4k
         | 
         | https://github.com/corsix/fast-crc32/
        
       | judofyr wrote:
       | Is there a specific reason to store the key + value as an
       | `uint64_t` instead of just using a struct like this?
       | struct slot {           uint32_t key;           uint32_t value;
       | }
        
         | zimpenfish wrote:
         | Maybe trying to avoid struct padding? Although having done a
         | quick test on {arm64, amd64} {gcc, clang}, they all give the
         | same `sizeof` for a struct with 2x`uint32_t`, a struct with a
         | single `uint64_t`, or a bare `uint64_t`.
        
           | simonask wrote:
           | In any struct where all fields have the same size (and no
           | field type requires higher alignment than its size), it is
           | guaranteed on every (relevant) ABI that there is no padding
           | bytes.
        
             | zimpenfish wrote:
             | TIL! Thanks!
        
         | nitnelave wrote:
         | The alignment constraint is different, which they use to be
         | able to load both as a 64-bit integer and compare to 0 (the
         | empty slot).
         | 
         | You could work around that with a union or casts with explicit
         | alignment constraints, but this is the shortest way to express
         | that.
        
           | Asooka wrote:
           | In that case you can use bit fields in a union:
           | union slot {             uint64_t keyvalue;
           | struct {                 uint64_t key: 32;
           | uint64_t value: 32;             };         };
           | 
           | Since both members of the union are effectively the exact
           | same type, there is no issue. C99: "If the member used to
           | access the contents of a union is not the same as the member
           | last used to store a value, the object representation of the
           | value that was stored is reinterpreted as an object
           | representation of the new type". Meaning, you can initialise
           | keyvalue and that will initialise both key and value, so
           | writing "union slot s{0}" initialises everything to 0. One
           | issue is that the exact layout for bit fields is
           | implementation defined, so if you absolutely need to know
           | where key and value are in memory, you will have to read
           | GCC's manual (or just experiment). Another is that you cannot
           | take the address of key or value individually, but if your
           | code was already using uint64_t, you probably don't need to.
           | 
           | Edit: Note also that you can cast a pointer to slot to a
           | pointer to uint64_t and that does not break strict aliasing
           | rules.
        
             | nitnelave wrote:
             | You can probably get away with just a union between a 64
             | bit and 2 32 bit integers.
        
           | crest wrote:
           | C has finally gained `alignas` so you can avoid the union
           | hack or you could just rely on malloc to alway return the
           | maximum alignment anyway.
        
         | loeg wrote:
         | No real reason. Slightly terser to compare with zero to find an
         | empty slot.
        
         | mwkaufma wrote:
         | Or better, just store keys and values in separate arrays, so
         | you can have compact cache lines of just keys when probing.
        
       | Aardwolf wrote:
       | > The table occupies at most 32 GiB of memory.
       | 
       | This constraint allows making a linear array of all the 4 billion
       | values, with the key as array index, which fits in 16 GiB.
       | Another 500 MiB is enough to have a bit indicating present or not
       | for each.
       | 
       | Perhaps text strings as keys and values would give a more
       | interesting example...
        
         | dragontamer wrote:
         | This hashtable implements a multiset. Not (merely) a simple
         | set.
        
         | re wrote:
         | > a linear array of all the 4 billion values, with the key as
         | array index, which fits in 16 GiB
         | 
         | The hash table has the significant advantage of having a _much_
         | smaller _minimum_ size.
         | 
         | > Perhaps text strings as keys and values would give a more
         | interesting example
         | 
         | Keep reading to "If keys and values are larger than 32 bits"
        
         | kazinator wrote:
         | That's not a constraint as much as the worst case size.
         | 
         | If you actually only have a handful of entries in the table, it
         | is measurable in bytes.
         | 
         | A linear array of all 4 billion possible values will occupy 16
         | GiB (of virtual memory) upfront. We have then essentially
         | replaced the hash table with a radix tree --- that made up of
         | the page directories and tables of the VM system. If only the
         | highest and lowest value are present in the table, then we only
         | need the highest and lowest page (4 kB or whatever) to be
         | mapped. It's not very compact for small sets; storing N numbers
         | in random locations could require as many as N virtual memory
         | pages to be committed.
        
       | artur44 wrote:
       | I always find it interesting how often the simplest hash table
       | layouts end up performing best in real workloads. Once you avoid
       | pointer chasing and keep everything in a compact array, CPU
       | caches do most of the heavy lifting.
       | 
       | It's also a good reminder that clarity of layout often beats more
       | "clever" designs, especially when the dataset fits comfortably in
       | memory.
        
         | hinkley wrote:
         | Until you get high memory contention from the rest of the code.
         | Once eviction gets high you get some pretty counterintuitive
         | improvements by fixing things that seem like they shouldn't
         | need to be fixed.
         | 
         | My best documented case was a 10x speed up from removing a
         | double lookup that was killing caches.
        
           | crest wrote:
           | My best improvment was just bit-interleaving both axes of a
           | 2x32bit integer coordinate (aka z-curve). I obtained factor
           | ~100x (yes factor not percent) throughput improvement over
           | locality in only one dimension. All it took was ~10 lines of
           | bit twiddling. The runtime went from a bit above 300ms to
           | slightly less then 3ms.
        
             | throw-the-towel wrote:
             | I'm wondering how do you folk even come up with this kind
             | of optimisations.
        
               | hinkley wrote:
               | Sheer stubbornness.
               | 
               | Profilers lie and some more than most. I've gotten 3x
               | from code with a perfectly rectangular profile output.
               | 
               | Part of it comes down to a trick game devs steal from
               | finance: give each task a budget and keep it to the
               | budget even if it's not the tall tent pole.
               | 
               | You should not spend 10% of your response time on
               | telemetry and logging combined. Yet I pulled 10% TTFB out
               | of just the logging and telemetry code on a project. It
               | was a frog boiling situation. Every new epic used the new
               | code and determining the cumulative cost wasn't easy.
        
             | hinkley wrote:
             | End to end gets weird. I was asked to look at an admin
             | page, nobody could figure out why it was 30s. Literally the
             | first thing I tried got it under 4 and the second down to
             | three. It was pulling the same list of rows twice, applying
             | two filters and then looking at the intersection. I changed
             | the signature to send the list as input instead of the
             | query constraints. Then I changed them to avoid the
             | intersect.
             | 
             | If you would have asked me to bet on which one would have
             | had the bigger impact I would have split the difference.
             | 
             | My second favorite was similar. Two functions making a call
             | instead of sharing the answer. Profiler said 10%
             | cumulative. I removed half. Instead of 5% I got 20%. Which
             | just demonstrates how much data a profiler cannot show you.
        
         | saltcured wrote:
         | To me, these sorts of examples always seem contrived. To the
         | first order, I've never had a real hash table problem that was
         | on machine word keys.
         | 
         | I've nearly always had a variable length string or other
         | complex structure that was being hashed, not their handles.
         | 
         | Back in my early career in C, this would be a generic API to
         | hash and store void pointers, but the pointers were not being
         | hashed. The domain-specific hash function needed to downcast
         | and perform the appropriate remote memory access to fetch the
         | variable-length material that was actually being hashed.
        
         | jasonwatkinspdx wrote:
         | I'm a big fan of the basic power of two choices hash table
         | design. It's simple to understand and implement, has reasonable
         | constant factors, and hits high load factors on real world
         | datasets.
         | 
         | You can use more elaborate probe and relocation schemes, but
         | just choosing the less full bucket and resizing if both choices
         | are full gets you surprisingly far.
        
           | kccqzy wrote:
           | Power-of-two length is also the natural choice for a growable
           | linear array where the designer has no idea how many elements
           | there will be.
        
       | ww520 wrote:
       | This hash table is pretty good. It has at best one memory read if
       | there's no collision. Randomized key might introduce any level of
       | key collision though.
        
       | air217 wrote:
       | Used Gemini to explain this. Very interesting and I think Gemini
       | did a good job explaining the Robin hood fairness mechanism
       | 
       | https://gemini.google.com/share/5add15a1c12f
        
       | Joker_vD wrote:
       | > If the Zba extension is present, sh3add.uw is a single
       | instruction for zero-extending idx from 32 bits to 64,
       | multiplying it by sizeof(uint64_t), and adding it to slots.
       | 
       | Yay, we've got an equivalent of SIB byte but as three (six?)
       | separate opcodes. Well, sub-opcodes.
       | 
       | It's a shame though that Zcmp extension didn't get into RVA23
       | even as an optional extension.
        
       | attractivechaos wrote:
       | This is a smart implementation of Robin Hood hashing I am not
       | aware of. In my understanding, a standard implementation keeps
       | the probe length of each entry. This one avoids that due to its
       | extra constraints. I don't quite understand the following
       | strategy, though
       | 
       | > _To meet property (3) [if the key 0 is present, its value is
       | not 0] ... "array index plus one" can be stored rather than
       | "array index"._
       | 
       | If hash code can take any value in [0,2^32), how to define a
       | special value for empty buckets? The more common solution is to
       | have a special key, not a special hash code, for empty slots,
       | which is easier to achieve. In addition, as the author points
       | out, supporting generic keys requires to store 32-bit hash
       | values. With the extra 4 bytes per bucket, it is not clear if
       | this implementation is better than plain linear probing (my
       | favorite). The fastest hash table implementations like boost and
       | abseil don't often use Robin Hood hashing.
        
       | Panzerschrek wrote:
       | I don't understand the part about key removal. Is it correct to
       | shift values to the left? Can't it break lookups for keys
       | inserted into their natural positions?
        
       | unixhero wrote:
       | I am dipping my toes in programming. But I can't follow this
       | without graphical representations of the tables as the author
       | (brilliantly) walks through it.
        
       ___________________________________________________________________
       (page generated 2025-12-10 10:01 UTC)