[HN Gopher] Optimizing Open Addressing
___________________________________________________________________
Optimizing Open Addressing
Author : TheNumbat
Score : 108 points
Date : 2023-04-02 17:16 UTC (5 hours ago)
(HTM) web link (thenumb.at)
(TXT) w3m dump (thenumb.at)
| mgaunard wrote:
| I find that there is too much of an emphasis on open-addressing
| in blog articles and such.
|
| There are a lot of real and common data structures where chaining
| is much better suited.
|
| For example whenever you have data part of one or multiple other
| node-based data structures, and you layer an index on top of
| that.
|
| Chaining also tends to waste a lot less memory.
| dragontamer wrote:
| In my experience / tests, its way easier to write a high-
| performance open-addressing Hash Table than a high-performance
| chaining one.
|
| That being said, chaining is easier to write.
|
| > Chaining also tends to waste a lot less memory.
|
| How so? A typical 32-bit integer or 32-bit float uses 4-bytes.
| But a typical pointer is 8-bytes. But there's also the internal
| malloc/new chunk header to consider.
|
| That means, to store a 4-byte integer into a chained hash
| table, you need 8-bytes (pointer) + 8-byte (glibc malloc chunk
| header) + 4 byte integer/float. Or 20 bytes total in practice.
|
| If you have say, 50 integers inside of the hashmap, then you'll
| use (table-size * 8) + 50-integers * 20 bytes each == 1000+
| bytes for the pointers/data alone, plus even more for the table
| itself. (ex: Table of size 50 would need 50 * 8-byte pointers
| even when empty, for a total of 1000 bytes + 400 bytes == 1400
| bytes or so)
|
| In contrast, a 4-byte open-addressing hash table only takes up
| 4-bytes. So 50-integers in a hashmap with ~50% load-factor (ie:
| size the table to be size 128 or so) will be 128 * 4 == 512
| bytes.
|
| EDIT: Its because of this huge practical difference in memory
| used that I'm pretty sure open-addressing is so high
| performance in comparison. When your data-structures are 1/2 or
| smaller number of bytes than the competition, its easier to
| stay in L1 cache or other such size benefits.
| mgaunard wrote:
| Nodes don't have to be allocated with malloc (that is
| actually the worst thing you could possibly do).
|
| An open-addressing hash table only performs well up to 50-75%
| bucket usage. Chaining doesn't care and performs the same up
| to 100%.
| a1369209993 wrote:
| > (that is actually the worst thing you could possibly do)
|
| Nonsense, there are many worse things you could do. For
| example, you could allocate nodes with mmap, at 4KB per
| node. (Which is a thing I've actually _done_ in the context
| of cursed bootstrap code where malloc isn 't available and
| having more than a handful of (in that case doubly-linked-
| list) nodes means the environment is deranged anyway.)
| dragontamer wrote:
| > Nodes don't have to be allocated with malloc (that is
| actually the worst thing you could possibly do).
|
| I mean... the most obvious place to place nodes is...
| inside the table itself. Also known as Open Addressing.
|
| Or what, are you going to implement a 2nd, custom heap
| algorithm to manage your nodes? I guess there's "buddy
| allocators", but buddy-allocators aren't exactly free
| either. Whatever method you're using to keep track of nodes
| is going to have some level of inefficiency.
|
| Whatever allocation you're doing to get these nodes (above-
| and-beyond generic malloc) is time you could have been
| spent writing open-addressing instead.
|
| > An open-addressing hash table only performs well up to
| 50-75% bucket usage. Chaining doesn't care and performs the
| same up to 100%.
|
| Hardly an apples-to-apples comparison. As I pointed out,
| 50% load-factor on 50x uint32 open-addressing is only ~400
| bytes used.
|
| While 100% load factor on 50x uint32 is 1400 bytes used on
| chaining. (plus additional load-factor in practice due to
| malloc fragmentation)
| mgaunard wrote:
| I already explained. Objects in non-trivial data
| structures are part of several data structures
| concurrently.
| dragontamer wrote:
| Perhaps we need to start talking with actual code? Here's
| just a simple idea that's in my brain right now.
| template <class Data> struct HashNode{
| // Probably should be shared_ptr<HashNode<Data>>, but
| that's // now adding even more
| inefficiencies like a ref_count per element
| struct HashNode<Data>* nextPointerChain; Data
| d; // Maybe Data* d if you're sharing it with other data-
| structures? }; template <class Data,
| int size> struct ChainHashTable{
| HashNode<Data> theTable[size]; };
| template <class Data, int size> struct
| LinearProbingHashTable{ Data theTable[size];
| vector<bool> occupied; // set to "size", 0 means empty
| and 1 means occupied };
|
| nextPointerChain takes up 8 bytes, even if its nullptr.
| No other data-structure needs to have the
| nextPointerChain. If we use shared_ptr<HashNode<Data>> as
| per typical modern C++, there's even more inefficiencies
| that I forgot about. EDIT: You probably can get away with
| unique_ptr, now that I think of it.
|
| ----------
|
| Pointers aren't free btw. If you're sharing the pointer
| with many different parts of your program, that's
| definitely one of those things you'll want to start
| getting rid of. Not only does a pointer cost 8 bytes, but
| its also _SIGNIFICANTLY_ slower and cache-unfriendly in
| practice.
|
| L1 cache works by reading data in-and-around anything you
| access. If you're only using 8-bytes of a cache line for
| pointer-indirection (8/64), you're wasting 56-bytes of
| your fetch.
|
| Linear Probing is extremely fast because it tends to
| blitz through L1 cache thanks to locality of data. Simple
| Linear Probing (or Robin-hood augmented probing) is
| probably the fastest, and simplest, approach for modern
| CPUs.
| mgaunard wrote:
| You still fail to understand what being part of multiple
| data structures means.
|
| Shared pointers, separate container to check occupancy,
| suggesting making a container of pointers, all these
| things suggest you have no idea how to do hybrid data
| structures.
|
| The nodes contain the data along with multiple pointers
| in them to chain them to various data structures they are
| part of (trees, lists, hash tables etc.). The object in
| question contains state that is several cache lines long.
|
| You cannot just put the nodes inside your buckets as you
| don't get to control where the nodes are and cannot
| invalidate the other data structures. Though I suppose
| it's a possibility if you know ahead of time how many
| nodes you're going to need, but then in that case you may
| be able to go for perfect hashing to begin with (unless
| you're building some kind of fixed-size cache).
|
| A given object for example can be part of several hash
| maps, based on the various pieces of state it is useful
| to index by.
|
| In practice you'd use a pool to allocate all your nodes
| in blocks. There is no per-node overhead for this
| allocation strategy.
| TheNumbat wrote:
| Agreed - I do mention in the post that open addressing is
| impractical when using intrusive linked lists, which are
| common in low level data structures.
| dragontamer wrote:
| So you've at best removed just 8 bytes for rather
| inconvenient programming for...
| LinearProbingHashTable<FooBar*, size*2>
| ChainHashTable<FooBar*, size>
|
| Note that linearProbingHashTable only uses 8 bytes per
| pointer. While ChainHashPointer still needs nextPtr,
| taking up 16 bytes.
|
| The *2 is for 50% load factor on linear table. But we
| still win on linear if we're at say 75% (simple linear)
| or 95% load factor (which is doable with Robin Hood)
| kevin_thibedeau wrote:
| > only performs well up to 50-75% bucket usage.
|
| Robin Hood performs well into the high 90's.
| anarazel wrote:
| One situation where I painfully learned that it doesn't,
| is when you iterate over one hashtable to fill another.
| To defend against that one needs to add some per-
| hashtable randomized state into the hash IV.
|
| Through bad experience I also learned that you need to
| not just grow due to fillfactor, but also due to
| disproportional chain length, even with the above defense
| in place.
| 10000truths wrote:
| > To defend against that one needs to add some per-
| hashtable randomized state into the hash IV.
|
| You should almost always be doing this anyways, since
| most hash tables use user-provided data as keys, which
| opens up a well known avenue for denial-of-service
| attacks:
|
| https://lwn.net/Articles/474912/
| anarazel wrote:
| I think that's often fixed using a _per run_ random seed,
| rather than a per table seed. The patch linked in the
| python bug linked from that article does seem to do that,
| for example.
| grive wrote:
| I would tend to agree.
|
| This article conveniently designs its benchmark to be exactly
| when open addressing would work.
|
| In almost all of the maps implementations I have used so far,
| chaining was necessary and intrusive linked lists were used.
|
| An open-addressing scheme was however preferable (cuckoo) for a
| read-mostly map allowing concurrent reads and locked writes.
|
| I would be more interested in an article exploring the
| compromises involved there, e.g. related to paying the cost of
| an additional indirection when accessing the value.
|
| Another critic of this article I would add is the many tables
| throughout referencing different baselines. A single baseline
| would be much preferable, to keep a sense of scale for each
| change.
| [deleted]
| tialaramex wrote:
| I'm not convinced there are "a lot" of these real and common
| structures.
|
| They certainly do arise, and particularly if you're also doing
| concurrency linked lists are attractive because a compare-
| exchange type mechanism is ideally suited for such structures
| and that has been implemented cheaply in hardware. But this
| article was about the _default_ , and we don't want defaults
| tailored to niche uses. The _default_ T-shirt size shouldn 't
| be big enough for The Rock. The _default_ ceiling height in new
| buildings shouldn 't be 1.5 metres. The _default_ beverage
| offered at your restaurant shouldn 't be Mountain Dew Code Red.
| It's OK that these things are options, that somebody who wants
| them can have them, but they're bad defaults.
|
| I'd want to see some measurements from real world projects to
| believe chainig should be the default. If I take, say, the
| Chromium codebase, and the LibreOffice codebase, and I look at
| places they use any type of hash table, how often are they
| going to be for stuff that'd be "better suited" to chaining? Am
| I going to consistently find 50% ? 10% ? 1% ?
| mgaunard wrote:
| Pretty much 95% of all hybrid data structures.
|
| Now hybrid data structures are only used in systems
| programming, because they require to be particularly careful
| about object lifetime and such.
| eternalban wrote:
| Choice-of-two + bias (pick L) + fixed-sized bin slices of the
| backing arrays = very high loading, strictly 2 ops (max) per
| R/W, trivial to write.
|
| I typically just use a very fast cryptographic hash (Blake3 is
| awesome) and have plenty of bits for h1, h2 (you can have
| additional choices but it is a fast dimiminishing return).
| utopcell wrote:
| Reads like a great summary for the hash map ideas that work in
| practice. I would have loved to see flat_hash_map<> thrown into
| the benchmark mix.
| TheNumbat wrote:
| I just benchmarked absl::flat_hash_map and got results
| comparable to Robin Hood with a load factor between 75% and
| 90%, which makes sense. It's also faster for looking up missing
| keys, so seems like a good option. I didn't benchmark the
| maximum probe lengths, though, so not sure on that front.
| camel-cdr wrote:
| I tied adding the maps to the [1] benchmark, but I wasn't
| able to, since they aren't type generic yet. You may want to
| benchmark against [2], and [3] which are the best performing
| ones in the above benchmark.
|
| [1] https://github.com/martinus/map_benchmark/
|
| [2] https://github.com/martinus/unordered_dense/blob/main/inc
| lud...
|
| [3] https://github.com/ktprime/emhash
|
| Edit: I had the wrong link for [2] and [3], somehow...
| infinet wrote:
| I learned about Open Addressing a long time ago when I was
| optimizing Dnsmasq. The scenario involved looking up a very large
| list of domain names to block or combine with ipset in order to
| customize routing based on domain names. Open addressing worked
| very well on these large immutable lists.
| camel-cdr wrote:
| Sometimes a very basic fixed size chaining hash tables is
| actually the best you can do.
|
| Take for example tcc, which is a very fast c compiler.
|
| They just use a basic chaining hash table: `Node *nodes[16384];`.
|
| Since most translation units have far less than 16384 tokens,
| most lookups result in a direct hit.
| tialaramex wrote:
| _Is_ that the best TCC could have done here? It 's easy to
| write this in C, which is one obvious reason for TCC to do it,
| but it's not at all clear this is the best way.
|
| What TCC is actually storing in TokenSym (which you've named
| "Node" in this context) is four pointers to other information
| about symbols, a unique ID, and a string.
|
| If we used a compact inline string (like CompactString) we can
| squeeze all this into 60 bytes unless the string is more than
| 24 bytes in length such as TCC's own
| warn_implicit_function_declaration or
| is_compatible_unqualified_types which would need a heap
| allocation like they have today, for the string itself.
|
| If we do _that_ we can build a linear probed open addressed
| hash table and avoid paying for the extra indirection any time
| the symbol names are 24 bytes or shorter in length, I 'd expect
| this is markedly better.
|
| But it's a lot of work in C, so it isn't a surprise TCC doesn't
| do it.
| mananaysiempre wrote:
| TCC has the advantage of being a batch process, so it can
| insert things using open addressing without ever considering
| how it intends to delete them, then just drop the whole thing
| on the floor at the end. Of course, if you are also a batch
| process, you absolutely should do that too.
| citizen_friend wrote:
| This is an advantage of C style programming. If you
| understand the problem and tailor a solution you can utilize
| simplifying assumptions which make the code easier and
| faster.
|
| General libraries have to handle all kinds cases you might
| not care about. The most egregious example is assuming every
| piece of code might be threaded so everything needs to be
| protected by locks. (I would guess > 75% of dictionaries
| never remove a key, and simply throw away the whole thing at
| the end.)
| pornel wrote:
| Modern CPUs keep making optimal algorithms weirder. Speculative
| superscalar execution and the colossal gap between the CPU and
| memory speed means that often a brute-force solution that fits in
| a cache line wins over solutions that would feel more elegant or
| efficient.
| sporadicity wrote:
| Have you tried absl::flat_map? It uses simd in a different way
| than described in this article, and Google claims that it saves
| them a lot of memory because it still works pretty well at 90-95%
| occupancy.
| anonymoushn wrote:
| I've benchmarked swiss tables and found that (for hit-heavy
| workloads) a minimum of 2 loads per lookup is expensive
| compared to 1.
___________________________________________________________________
(page generated 2023-04-02 23:00 UTC)