[HN Gopher] How to implement a hash table in C (2021)
       ___________________________________________________________________
        
       How to implement a hash table in C (2021)
        
       Author : benhoyt
       Score  : 127 points
       Date   : 2024-07-06 02:36 UTC (20 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | MrVandemar wrote:
       | Enjoyable article and thorough run through. I didn't read all of
       | it, but it really took me back to learning 'C' in comp sci in the
       | early 90s. Great times.
        
       | suprjami wrote:
       | It's the ihih guy! Thanks for being one of the few C bloggers out
       | there.
        
         | brabel wrote:
         | I enjoy reading posts about C just to see how much boilerplate
         | and footguns higher languages let you avoid. I would like to
         | write C, but I just have no trust I would remember or even know
         | how to do things right (use `calloc` vs `malloc`, free the
         | table when allocation of `entries` fails etc).
        
           | 082349872349872 wrote:
           | The thing about boilerplate is that you were always rewriting
           | it, so avoiding the footguns became habit.
           | 
           | (So ... which takes longer, writing the data structure you
           | want, as we all did then, or searching for the perfect
           | library, as some people do now?)
           | 
           | EDIT: awk(1) had built-in hash tables, and dates from 1977.
           | Note that generic hashing becomes much more useful after
           | D-space gets larger than 16-64k! (14-16 bits of address)
        
           | mauvehaus wrote:
           | Counterpoint: when you're writing C, you end up only
           | implementing as much complexity as you actually need. I
           | rolled a hash table for a project that only need to contain
           | integer keys and (I think) string values. It didn't need to
           | carry around all the complexity to handle hashing arbitrary
           | structured data for keys, etc. Multi-threading was supported
           | but a simple lock over the whole table was sufficient for
           | correctness and met the required performance.
           | 
           | The whole thing was perhaps 200 lines of straightforward
           | code. If you don't need to carry around support for all the
           | third-prime-numbered-Tuesday-of-the-month scenarios that
           | general purpose libraries have to, you end up with code you
           | can understand fully and don't have to spend time reasoning
           | about complexity that doesn't exist.
        
             | Horffupolde wrote:
             | Until it does and you do and you don't.
        
           | suprjami wrote:
           | I find that stuff enjoyable. I almost never use malloc, I
           | learnt to use calloc so it's my go-to allocator.
           | 
           | Whenever I write an allocation I try to write a free after
           | it, then fill in the code in between.
           | 
           | Use of static analysers and sanitisers helps too.
           | 
           | Like some people enjoy making all their hand tools from
           | scratch in their workshop, there's a craft and discipline to
           | follow and a satisfaction in doing things well, I enjoy the
           | craft of and discipline C for the same reason.
        
         | rlinge wrote:
         | I started to make some new toys in C too:
         | 
         | https://github.com/webd90kb/webd/tree/master/codes/c_project...
         | 
         | https://github.com/webd90kb/webd/tree/master/codes/scripts/e...
        
       | akira2501 wrote:
       | I really like the API design of libjudy for general hash table
       | like interfaces. It prevents you from having to hash a key twice
       | just to do "check if key is not present, then set value,
       | otherwise leave original value in hash." The pattern also meshes
       | better with the way iterators naturally work.
       | 
       | Also, in terms of this table, if you add the computed key hash
       | value to the stored hash entry, you can check that the value of
       | the hashes match before you do the full strcmp. If you have a
       | weak hash you might also get a benefit from checking that the
       | first characters match before calling the full strcmp.
       | 
       | It would also make rehashing easier since you already have the
       | full key available to you and don't have to use your internal set
       | function to move entries into the new table. In the posted
       | implementation the big-O semantics of a rehash are worst case.
       | 
       | Anyways.. "man hsearch.3" if you want something cheap and easy.
        
         | shawn_w wrote:
         | POSIX hsearch tables are terrible. There's a reason why nobody
         | uses them.
         | 
         | One at a time per program, having to know how big the table
         | needs to be at initialization time with no automatic resizing,
         | an API that's very limited (no way to remove entries, iterate
         | over entries, tell how many entries are in the table, ...)
        
           | akira2501 wrote:
           | > One at a time per program
           | 
           | Reentrant versions have existed for quite some time.
           | 
           | > having to know how big the table needs to be at
           | initialization time with no automatic resizing
           | 
           | The posted implementation resizes by creating an entirely new
           | table and then moving all entries. The exact same mechanism
           | is available with hsearch.
           | 
           | > no way to remove entries, iterate over entries
           | 
           | Strong downside but the posted implementation has no removal
           | function either. It also leaks key memory on resize and free.
           | It's iterator would need modification to allow delete during
           | iteration.
           | 
           | > tell how many entries are in the table
           | 
           | somewhat_reasonable_point++;
        
             | shawn_w wrote:
             | >> One at a time per program
             | 
             | >Reentrant versions have existed for quite some time.
             | 
             | Not in POSIX. glibc has one, but that doesn't do much good
             | if you want your code to work portably. I do see they've
             | made their way into Free and Net BSD, but not Open. Are
             | they available for Mac? Any of the surviving commercial
             | unixes?
             | 
             | >> having to know how big the table needs to be at
             | initialization time with no automatic resizing
             | 
             | >The posted implementation resizes by creating an entirely
             | new table and then moving all entries. The exact same
             | mechanism is available with hsearch.
             | 
             | That's how resizing a hash table typically works, yes.
             | Except it's normally done automatically and transparently
             | by the hash table code. hsearch doesn't do that. And since
             | it doesn't give you a way to iterate over values in the
             | table, you'd have to keep a separate list of everything you
             | add to it in order to manually delete the existing table
             | and make a bigger one and re-add everything to it... Like I
             | said, it's terrible.
        
       | timo-e-aus-e wrote:
       | I played around with C++ when I was at university. Then never
       | touched it again. So, with a grin I stumble over things like
       | 
       | "void* ht_get(...)"
       | 
       | Wait. What? A void pointer? Interesting... I have no clue.
       | 
       | I like articles like these. For someone not familiar with C it's
       | a perfect level. In terms of explanation and the code itself.
        
         | floam wrote:
         | It has no type, so you can cast it to whatever it actually is.
         | Even a function you will call.
         | 
         | The main use case is generic functions. And in data structures
         | like this.
        
           | shawn_w wrote:
           | Casting between void pointers and function pointers is,
           | strictly speaking, undefined behavior. Though it does tend to
           | work, it's not something to rely on.
           | 
           | (And you don't need casts when converting between void
           | pointers and data pointers)
        
             | HexDecOctBin wrote:
             | AFAIK it's undefined behaviour in ISO C, but not so in
             | POSIX.
        
             | elteto wrote:
             | The dlopen/dlsym API uses void *, there's really no
             | alternative in C.
             | 
             | It might be undefined in the sense that a language lawyer
             | might care, but for any sane implementation it's completely
             | valid.
        
             | maccard wrote:
             | c11 fixed this. C++ allows casting as long as you don't
             | access the intermediate casted item and cast it back to the
             | type it started as.
             | 
             | In practice though, it works pretty much everywhere in my
             | experience.
        
         | unwind wrote:
         | This is a bit of a pet peeve with anonymous Internet writing,
         | but: are we supposed to know when you were at university? Do
         | you know when I was? Why not state something like the number of
         | decades or whatever?
         | 
         | And yeah, a void * is how you express pointer to data of
         | unknown type in C.
        
           | kchr wrote:
           | > This is a bit of a pet peeve with anonymous Internet
           | writing, but: are we supposed to know when you were at
           | university? Do you know when I was? Why not state something
           | like the number of decades or whatever?
           | 
           | Does it matter? I tend to read it as "haven't touched it/had
           | any use for it since university".
        
         | maccard wrote:
         | Remember that this is for c, not c++. Void pointers are
         | entirely unnecessary in c++ unless you're interoperating with
         | existing libraries and can't change the interface. You'll get
         | faster and safer code with templates and function overloading
        
       | a-dub wrote:
       | might be neat if there were acceleration wins to be had with
       | instructions for computing hash ints from machine ints and
       | strings.
       | 
       | or even better, a full fledged linear probing hash lookup
       | instruction.
        
       | commandersaki wrote:
       | There was a fantastic benchmark of C and C++ hash tables doing
       | the rounds a few weeks ago, it's pretty fun reading:
       | https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/.
       | 
       | Unless I really didn't want to introduce dependencies, or reduce
       | code size, I think I'd use an off the shelf hash table
       | implementation these days. It's still a fun exercise building
       | your own though.
        
         | fenesiistvan wrote:
         | One important consideration is missing from this benchmark: how
         | they behave with multi-threading. There can be big differences
         | between diffent hash table implementation when you have to use
         | them from different threads (built-in smart thread safety vs
         | external dumb locks, etc)
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _How to implement a hash table in C_ -
       | https://news.ycombinator.com/item?id=26590234 - March 2021 (156
       | comments)
        
       | iamevn wrote:
       | if (size + size < size) {
       | 
       | I know size_t wraps around on overflow so this would actually
       | work, but this still makes me do a double take.
        
       | Joker_vD wrote:
       | > but it is non-ideal that I'm only allowing half the range of
       | size_t.
       | 
       | I am fairly certain that in C it's actually impossible to have an
       | object whose size is larger than half of the range of size_t
       | unless ptrdiff_t is wider than size_t, which normally isn't.
       | Unless, of course, C standard decided to make subtracting two
       | valid pointers into the same array (or one past the end) a
       | potential UB, just because.
        
       | SloopJon wrote:
       | I was noodling in this area recently, trying to speed up some
       | code similar to the tr utility:                   $ echo abcdef
       | |tr abc ghi         ghidef
       | 
       | For an eight-bit character set, I found that building an array to
       | map every character improved on linear search, even for short
       | replacement strings and relatively short input strings.
       | 
       | There isn't as easy a win for Unicode, so I played with some
       | simple hash tables. Although the conventional wisdom is to use
       | the high-order bits of a hash function's output, FNV-1a is not so
       | good for short inputs of one or two bytes. (I used the 32-bit
       | variant.) It was better just to use the low-order bits.
        
       | aronhegedus wrote:
       | This came up on an interview test where I had to implement this
       | from scratch! Fun to read
        
       | foresto wrote:
       | In case anyone here is not already familiar with it, gperf is a
       | perfect hash function generator.
       | 
       | https://www.gnu.org/software/gperf/manual/gperf.html
        
       ___________________________________________________________________
       (page generated 2024-07-06 23:01 UTC)