[HN Gopher] How to implement a hash table in C (2021)
___________________________________________________________________
How to implement a hash table in C (2021)
Author : benhoyt
Score : 127 points
Date : 2024-07-06 02:36 UTC (20 hours ago)
(HTM) web link (benhoyt.com)
(TXT) w3m dump (benhoyt.com)
| MrVandemar wrote:
| Enjoyable article and thorough run through. I didn't read all of
| it, but it really took me back to learning 'C' in comp sci in the
| early 90s. Great times.
| suprjami wrote:
| It's the ihih guy! Thanks for being one of the few C bloggers out
| there.
| brabel wrote:
| I enjoy reading posts about C just to see how much boilerplate
| and footguns higher languages let you avoid. I would like to
| write C, but I just have no trust I would remember or even know
| how to do things right (use `calloc` vs `malloc`, free the
| table when allocation of `entries` fails etc).
| 082349872349872 wrote:
| The thing about boilerplate is that you were always rewriting
| it, so avoiding the footguns became habit.
|
| (So ... which takes longer, writing the data structure you
| want, as we all did then, or searching for the perfect
| library, as some people do now?)
|
| EDIT: awk(1) had built-in hash tables, and dates from 1977.
| Note that generic hashing becomes much more useful after
| D-space gets larger than 16-64k! (14-16 bits of address)
| mauvehaus wrote:
| Counterpoint: when you're writing C, you end up only
| implementing as much complexity as you actually need. I
| rolled a hash table for a project that only need to contain
| integer keys and (I think) string values. It didn't need to
| carry around all the complexity to handle hashing arbitrary
| structured data for keys, etc. Multi-threading was supported
| but a simple lock over the whole table was sufficient for
| correctness and met the required performance.
|
| The whole thing was perhaps 200 lines of straightforward
| code. If you don't need to carry around support for all the
| third-prime-numbered-Tuesday-of-the-month scenarios that
| general purpose libraries have to, you end up with code you
| can understand fully and don't have to spend time reasoning
| about complexity that doesn't exist.
| Horffupolde wrote:
| Until it does and you do and you don't.
| suprjami wrote:
| I find that stuff enjoyable. I almost never use malloc, I
| learnt to use calloc so it's my go-to allocator.
|
| Whenever I write an allocation I try to write a free after
| it, then fill in the code in between.
|
| Use of static analysers and sanitisers helps too.
|
| Like some people enjoy making all their hand tools from
| scratch in their workshop, there's a craft and discipline to
| follow and a satisfaction in doing things well, I enjoy the
| craft of and discipline C for the same reason.
| rlinge wrote:
| I started to make some new toys in C too:
|
| https://github.com/webd90kb/webd/tree/master/codes/c_project...
|
| https://github.com/webd90kb/webd/tree/master/codes/scripts/e...
| akira2501 wrote:
| I really like the API design of libjudy for general hash table
| like interfaces. It prevents you from having to hash a key twice
| just to do "check if key is not present, then set value,
| otherwise leave original value in hash." The pattern also meshes
| better with the way iterators naturally work.
|
| Also, in terms of this table, if you add the computed key hash
| value to the stored hash entry, you can check that the value of
| the hashes match before you do the full strcmp. If you have a
| weak hash you might also get a benefit from checking that the
| first characters match before calling the full strcmp.
|
| It would also make rehashing easier since you already have the
| full key available to you and don't have to use your internal set
| function to move entries into the new table. In the posted
| implementation the big-O semantics of a rehash are worst case.
|
| Anyways.. "man hsearch.3" if you want something cheap and easy.
| shawn_w wrote:
| POSIX hsearch tables are terrible. There's a reason why nobody
| uses them.
|
| One at a time per program, having to know how big the table
| needs to be at initialization time with no automatic resizing,
| an API that's very limited (no way to remove entries, iterate
| over entries, tell how many entries are in the table, ...)
| akira2501 wrote:
| > One at a time per program
|
| Reentrant versions have existed for quite some time.
|
| > having to know how big the table needs to be at
| initialization time with no automatic resizing
|
| The posted implementation resizes by creating an entirely new
| table and then moving all entries. The exact same mechanism
| is available with hsearch.
|
| > no way to remove entries, iterate over entries
|
| Strong downside but the posted implementation has no removal
| function either. It also leaks key memory on resize and free.
| It's iterator would need modification to allow delete during
| iteration.
|
| > tell how many entries are in the table
|
| somewhat_reasonable_point++;
| shawn_w wrote:
| >> One at a time per program
|
| >Reentrant versions have existed for quite some time.
|
| Not in POSIX. glibc has one, but that doesn't do much good
| if you want your code to work portably. I do see they've
| made their way into Free and Net BSD, but not Open. Are
| they available for Mac? Any of the surviving commercial
| unixes?
|
| >> having to know how big the table needs to be at
| initialization time with no automatic resizing
|
| >The posted implementation resizes by creating an entirely
| new table and then moving all entries. The exact same
| mechanism is available with hsearch.
|
| That's how resizing a hash table typically works, yes.
| Except it's normally done automatically and transparently
| by the hash table code. hsearch doesn't do that. And since
| it doesn't give you a way to iterate over values in the
| table, you'd have to keep a separate list of everything you
| add to it in order to manually delete the existing table
| and make a bigger one and re-add everything to it... Like I
| said, it's terrible.
| timo-e-aus-e wrote:
| I played around with C++ when I was at university. Then never
| touched it again. So, with a grin I stumble over things like
|
| "void* ht_get(...)"
|
| Wait. What? A void pointer? Interesting... I have no clue.
|
| I like articles like these. For someone not familiar with C it's
| a perfect level. In terms of explanation and the code itself.
| floam wrote:
| It has no type, so you can cast it to whatever it actually is.
| Even a function you will call.
|
| The main use case is generic functions. And in data structures
| like this.
| shawn_w wrote:
| Casting between void pointers and function pointers is,
| strictly speaking, undefined behavior. Though it does tend to
| work, it's not something to rely on.
|
| (And you don't need casts when converting between void
| pointers and data pointers)
| HexDecOctBin wrote:
| AFAIK it's undefined behaviour in ISO C, but not so in
| POSIX.
| elteto wrote:
| The dlopen/dlsym API uses void *, there's really no
| alternative in C.
|
| It might be undefined in the sense that a language lawyer
| might care, but for any sane implementation it's completely
| valid.
| maccard wrote:
| c11 fixed this. C++ allows casting as long as you don't
| access the intermediate casted item and cast it back to the
| type it started as.
|
| In practice though, it works pretty much everywhere in my
| experience.
| unwind wrote:
| This is a bit of a pet peeve with anonymous Internet writing,
| but: are we supposed to know when you were at university? Do
| you know when I was? Why not state something like the number of
| decades or whatever?
|
| And yeah, a void * is how you express pointer to data of
| unknown type in C.
| kchr wrote:
| > This is a bit of a pet peeve with anonymous Internet
| writing, but: are we supposed to know when you were at
| university? Do you know when I was? Why not state something
| like the number of decades or whatever?
|
| Does it matter? I tend to read it as "haven't touched it/had
| any use for it since university".
| maccard wrote:
| Remember that this is for c, not c++. Void pointers are
| entirely unnecessary in c++ unless you're interoperating with
| existing libraries and can't change the interface. You'll get
| faster and safer code with templates and function overloading
| a-dub wrote:
| might be neat if there were acceleration wins to be had with
| instructions for computing hash ints from machine ints and
| strings.
|
| or even better, a full fledged linear probing hash lookup
| instruction.
| commandersaki wrote:
| There was a fantastic benchmark of C and C++ hash tables doing
| the rounds a few weeks ago, it's pretty fun reading:
| https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/.
|
| Unless I really didn't want to introduce dependencies, or reduce
| code size, I think I'd use an off the shelf hash table
| implementation these days. It's still a fun exercise building
| your own though.
| fenesiistvan wrote:
| One important consideration is missing from this benchmark: how
| they behave with multi-threading. There can be big differences
| between diffent hash table implementation when you have to use
| them from different threads (built-in smart thread safety vs
| external dumb locks, etc)
| dang wrote:
| Discussed at the time:
|
| _How to implement a hash table in C_ -
| https://news.ycombinator.com/item?id=26590234 - March 2021 (156
| comments)
| iamevn wrote:
| if (size + size < size) {
|
| I know size_t wraps around on overflow so this would actually
| work, but this still makes me do a double take.
| Joker_vD wrote:
| > but it is non-ideal that I'm only allowing half the range of
| size_t.
|
| I am fairly certain that in C it's actually impossible to have an
| object whose size is larger than half of the range of size_t
| unless ptrdiff_t is wider than size_t, which normally isn't.
| Unless, of course, C standard decided to make subtracting two
| valid pointers into the same array (or one past the end) a
| potential UB, just because.
| SloopJon wrote:
| I was noodling in this area recently, trying to speed up some
| code similar to the tr utility: $ echo abcdef
| |tr abc ghi ghidef
|
| For an eight-bit character set, I found that building an array to
| map every character improved on linear search, even for short
| replacement strings and relatively short input strings.
|
| There isn't as easy a win for Unicode, so I played with some
| simple hash tables. Although the conventional wisdom is to use
| the high-order bits of a hash function's output, FNV-1a is not so
| good for short inputs of one or two bytes. (I used the 32-bit
| variant.) It was better just to use the low-order bits.
| aronhegedus wrote:
| This came up on an interview test where I had to implement this
| from scratch! Fun to read
| foresto wrote:
| In case anyone here is not already familiar with it, gperf is a
| perfect hash function generator.
|
| https://www.gnu.org/software/gperf/manual/gperf.html
___________________________________________________________________
(page generated 2024-07-06 23:01 UTC)