[HN Gopher] How to implement a hash table in C
___________________________________________________________________
How to implement a hash table in C
Author : benhoyt
Score : 239 points
Date : 2021-03-26 09:39 UTC (13 hours ago)
(HTM) web link (benhoyt.com)
(TXT) w3m dump (benhoyt.com)
| sobriquet9 wrote:
| Will this expression correctly handle overflow?
|
| size_t mid = (low + high) / 2;
| mywittyname wrote:
| Interesting question.
|
| Since this is integer division, I think you can do:
| size_t low_s = low >> 1; size_t high_s = high >> 1;
| size_t mid = (low_s + high_s)
|
| ...With no overflow worries. (expanded for clarity)
|
| That would effectively be distributing the division. So it's
| equivalent to low / 2 + high / 2.
| saagarjha wrote:
| This gives a different result when both low and high are odd.
| chrchang523 wrote:
| Not an issue since the following is at the top of the function:
| if (size + size < size) { return NULL; // size too big;
| avoid overflow }
| kt66 wrote:
| what about low>>1 + high>>1 + (1 & low & high)
| thamer wrote:
| > When the hash table gets too full, we need to allocate a larger
| array and move the items over. This is absolutely required when
| the number of items in the hash table has reached the size of the
| array, but usually you want to do it when the table is half or
| three-quarters full.
|
| Yes, but _how_ you resize is important too: if you have a
| threshold size (like 3 /4 full) at which you block and re-
| distribute all elements into a new array, you will incur a
| significant pause when this happens, e.g.
| https://log.kv.io/post/2009/05/15/leaky-hashtables
|
| Instead, when you reach the threshold amount you can create the
| new array and then gradually migrate the keys in small batches
| either with each operation or with a background thread. So on
| `get` if we're in the process of resizing: first check the new
| table, then the old one, and before returning migrate N keys from
| the old table to the new one. Free the old array once all the
| keys are migrated.
|
| I wrote a small hash table implementation with gradual re-hashing
| a while back, search for DICT_MAX_LOAD and dict_rehash here:
| https://github.com/nicolasff/ht/blob/master/dict.c#L193
| moistbar wrote:
| I'm not a particularly good programmer, so I may be missing
| something here, but I'd think that moving the items should be
| its own function that happens in between gets and sets, no?
| Besides adding a decidedly non-read-only operation to what
| should be read-only, shouldn't the movement happen when you're
| _not_ trying to do anything with the hash table?
| jeffffff wrote:
| typically rehashing is done during insert. if you were going
| to do it while not doing anything else with the hash table
| you'd have to use a background thread, and then you'd have to
| do some form of synchronization, so there would be
| significant additional complexity and overhead.
| OskarS wrote:
| It's usually not the right call to use a hash table like you're
| describing. You're not saving any work (all keys have to be
| migrated sooner or later) and by doing it like this you incur a
| non-trivial amount of overhead on most operations.
|
| Hash tables generally work on the assumption that operations
| are O(1) in the amortized sense, that individual inserts might
| be expensive because of rehashing, but in aggregate they are
| very cheap. In practice, this is almost always a fine
| assumption, and it is how the vast majority of hash tables in
| all programming languages work.
|
| The kinds of hash tables you are talking about have their uses
| in places where you have soft real-time constraints, but that
| is honestly a pretty niche use case.
| ModernMech wrote:
| > Hash tables generally work on the assumption that
| operations are O(1) in the amortized sense, that individual
| inserts might be expensive because of rehashing, but in
| aggregate they are very cheap.
|
| Hash tables aren't O(1) in an amortized sense, they are O(1)
| in a statistical sense. The O(1) complexity of hash tables is
| dependent entirely on the size of the underlying array, the
| method of collision resolution, and the quality of the
| (pre)hash function in relation to the distribution of data
| you're storing in the table. Whether you insert 1 item or 1
| million items into the table, you're still getting O(1)
| computational complexity if the table is designed well. If
| not, you could be getting O(n) complexity in the worst case.
|
| While the fixed cost of hashing is irrelevant in the O(1)
| nature of hash table operations, it does mean that a hash
| table in practice could be slower (in terms of wall time)
| than a linked list for insert/find/remove operations. Despite
| a linked list being O(n) for those operations, a linked list
| doesn't have to pay the fixed cost of hashing. So for low
| values of n, the linked list might be a better choice. I
| think this is what you mean by "O(1) in the amortized sense"
| but it's not what we mean when we say the computational
| complexity of these operations on a hash table is O(1).
|
| What the OP is talking about is an example of amortization of
| resize(), essentially adding a fixed cost to insert() and
| remove() so you don't have to pay a huge sum when you call
| resize(). But that doesn't mean insert() and remove() are no
| longer O(1).
| justinpombrio wrote:
| Amortized cost is a technical term in complexity analysis:
| https://en.wikipedia.org/wiki/Amortized_analysis
|
| If you use the terminology precisely, the worst case cost
| of a regular hash table insertion is O(n) because of the
| potential resize, but its amortized cost is O(1) because
| that's what it "averages out to".
| sgerenser wrote:
| Agreed this seems like major overkill, but then again so does
| writing your own hash table from scratch in C. :shrug:
| legulere wrote:
| If your application is not negatively affected by such jitter
| why not use a garbage collected language?
| jeffffff wrote:
| iirc memcached either uses or used to use
| https://en.wikipedia.org/wiki/Linear_hashing to avoid
| rehashing induced pauses. it definitely makes sense for that
| use case but you're correct that it's not the right call for
| a general purpose hash table implementation.
| pjam wrote:
| The progressive rehashing described in the article is very
| similar to what Redis does [1].
|
| Just thought I'd share a example of a use case where the
| incremental rehashing logic makes sense.
|
| [1]: https://github.com/redis/redis/blob/unstable/src/dict.c
| vladf wrote:
| What happens if you have a sequence of puts that triggers an
| expansion and some keys get migrated, but all of a sudden you
| get a series of removes, such that you need to perform a shrink
| to keep memory usage linear?
|
| Do you now end up with three live arrays? You can probably make
| things even more pathological...
| coliveira wrote:
| Hash tables are not great for data structures that change
| dramatically in size. If you have this situation, the best
| thing to do is to copy one hash table into another and
| discard the original data structure.
| chasd00 wrote:
| can you do hashtable sharding? like have a hashtable point to
| another hashtable by using something appended to the key? just
| a thought on a procrastination filled Friday afternoon.
| stephc_int13 wrote:
| You can also allocate a small hash table at first, and
| reallocate it in a large virtual memory region once it about
| the size of a page (4k) so you won't have to reallocate it
| later.
| kbenson wrote:
| > So on `get` if we're in the process of resizing
|
| Not just _get_ , but _set_ also, right? Otherwise I would think
| a large group of set operations that overflowed more than once
| might be problematic to deal with. Or maybe I 'm missing
| something that makes that a non issue?
| tkhattra wrote:
| yes, _set_ also. go 's map implementation uses this technique
| - allocate a new array of buckets and gradually migrate from
| old to new buckets (i think it does the migration during
| _set_ and _delete_ only, not _get_ ).
| tengbretson wrote:
| Unless you need the specific memory characteristics of a hash
| table, a trie data structure is a pretty good solution to a lot
| of the same problems and in my opinion is conceptually easier to
| understand.
| taeric wrote:
| I have never seen a true described as easier than a hash table.
| Curious what makes you say they are easier?
| tengbretson wrote:
| I find their performance characteristics easier to naively
| understand without resorting to profiling with production
| data.
| taeric wrote:
| I should have worded my question more as "do you have a
| recommendation to describe tries?" In particular,
| construction is tough.
| tengbretson wrote:
| The C# implementation here is pretty approachable for
| just about any kind of programming background.
|
| https://www.geeksforgeeks.org/trie-insert-and-search/
| [deleted]
| bluedino wrote:
| What if this became the new "invert a binary tree on a
| whiteboard"?
| Google234 wrote:
| This is pretty straight forward stuff.
| 0xbadcafebee wrote:
| Hash tables are deceptively complicated, like sorting
| algorithms.
| cogman10 wrote:
| Like sorting algorithms, they can be as complicated as you
| want them to be.
|
| A merge sort isn't the fastest sort but it's pretty easy
| for anyone to implement. Something like a timsort is a bit
| more complex but is what a lot of languages use now-a-days.
|
| Hash tables are in the same boat. A basic hashtable is
| pretty easy to implement. It's only when you start looking
| at things like ideal item placement and optimal filling
| that things start to get more complex.
| kragen wrote:
| Dunno, I wrote
| https://news.ycombinator.com/item?id=26595520 just now
| and my super dumb hash table has two bugs in its 21 lines
| of code:
|
| 1. It doesn't update--attempts to change the value
| associated with an existing key are silently ignored,
| although they do consume buckets.
|
| 2. It loops forever if the table is full _and the key it
| was looking for is negative_ , because that invokes
| conversion of a negative signed value to an unsigned
| value, which I think is UB in C.
|
| Maybe it has more bugs I haven't found yet.
|
| Additionally the reason it took me 15 minutes to write 21
| lines of code was that I changed a bunch of things in
| midstream, before posting the comment:
|
| 1. At first there was no "full" field in the hash table
| bucket, so there was no way to distinguish full buckets
| from empty buckets.
|
| 2. And then when I added it, I added it as "dead", with
| the opposite Boolean sense of "full", so I would have
| needed a separate initialization routine (to set all the
| dead fields to something nonzero). Instead I swapped the
| sense.
|
| 3. I was trying to detect the case where we'd looped back
| around to our starting point, indicating that the table
| was full and the search was unsuccessful, and I realized
| that the condition I wanted was not b == orig, which
| would always fail on the first iteration, but (b + 1) %
| table_size == orig. But doing that on every iteration
| seemed obscene, so I incremented b (mod table_size) to
| start out with instead.
|
| 4. At some point I think I realized I'd forgotten to set
| the .full field on newly inserted items.
|
| 5. Also I realized I'd forgotten to test the table[b].k
| == k condition in `get` as well, which would have made
| every search unsuccessful.
|
| So I think it's reasonable to say that even a pretty
| limited hash-table implementation has plenty of
| opportunities to write bugs, more than I expected. But
| maybe I'm just running on low blood sugar this afternoon
| or something.
| SavantIdiot wrote:
| This shouldn't be a mystery to someone who's had a
| datastructures course. It's a fundamental data structure. Yes,
| optimization always goes down a rabbit-hole, but Knuth's
| "Algorithms in C" was mandatory in most engineering schools' CS
| degrees in the early 90's, and it essentially what this website
| describes, but with more rigor.
| [deleted]
| kragen wrote:
| _Algorithms in C_ , which I strongly recommend, is actually
| by Knuth's student Sedgewick.
| oldmanhorton wrote:
| Ok, but school was years ago. I've had years of relevant
| experience that would make me a good fit for some job since,
| but I've forgotten how to implement a hash table in C within
| 30 minutes. Is that really the gatekeeper we want?
| kragen wrote:
| "Is that really the gatekeeper we want?" Yes, I think so? I
| mean implementing a hash table in C is really easy. What is
| there to "forget"? If you can't do it, either you don't
| know what a hash table is, or you don't know how to program
| in C. enum { table_size = 1024 };
| typedef struct item { int k, v, full; } item;
| static item table[table_size]; void put(item
| kv) { size_t orig = kv.k % table_size;
| // dumbest possible hash function size_t b =
| (orig + 1) % table_size; // use linear probing
| for collisions while (table[b].full && b !=
| orig) b = (b + 1) % table_size; if (b == orig)
| abort(); // table full table[b] = kv;
| table[b].full = 1; } item *get(int k)
| { size_t orig = k % table_size;
| size_t b = (orig + 1) % table_size; while
| (table[b].full && table[b].k != k && b != orig) {
| b = (b + 1) % table_size; } return
| (table[b].full && table[b].k == k) ? &table[b] : NULL;
| }
|
| I haven't tested that (just as I wouldn't in a whiteboard
| interview) so I don't know if it works. I did briefly skim
| the article we're talking about, but I concluded I didn't
| have anything to learn from it, so I didn't read it. I
| found a bunch of bugs in my implementation as I was writing
| it. And of course it's a pretty dumb hash table: linear
| probing is very suboptimal, it uses a statically-allocated
| non-resizable hash table, it's specialized for a certain
| type of keys, and so on.
|
| But are you saying you can't even do _that_? Probably it
| would be bad to hire you for a C programming job, then,
| unless it was an entry-level position for you to learn C
| in.
|
| ***
|
| Evidently the above comment took me 15 minutes to write.
| Oh, and now I see another bug or at least lacuna: if you
| try to update the value of an existing key, it doesn't
| fail, but it also doesn't update, instead adding a new
| entry that will never be read. And then I did write a
| minimal smoke test and run it, and unsurprisingly, it does
| work: int main() {
| put((item) { 4, 7 }); put((item) { 1028, 9 });
| put((item) { 3, 25 }); printf("%d %d %d %p %p\n",
| get(4)->v, get(1028)->v, get(3)->v, get(5), get(3));
| return 0; }
|
| Writing the test and the above additional commentary, and
| running the test, took another 8 minutes.
|
| Oh, and there's another bug where it will loop infinitely
| when the hash table is full and the key is negative.
| (Actually it invokes UB but typically the result will just
| be an infinite loop.)
|
| If it takes you 60 or 100 minutes to write this, or to
| write something better, maybe you still know C and know
| what hash tables are. But if you throw up your hands and
| say "I don't remember, it's been a long time since I was in
| school"? Either you don't know C, or you don't know what a
| hash table is, which probably means you've never written
| anything in C but toy programs. (This is a toy program, of
| course, but many toy programs in C don't need hash tables.)
|
| The compilable and fixed toy program in question is at
| http://canonical.org/~kragen/sw/dev3/dumbhash.c if you want
| to probe it for bugs.
|
| It occurs to me that you might have been talking about
| something that isn't a C programming job. For example, we
| might be talking about a job programming an HTML database
| front-end in Python using Django. And of course it would be
| silly to reject someone for a job like that because they
| didn't know C, unless they claimed to know C on their
| resume. And it's totally reasonable for a Python
| programmer, or a Python/Perl/JS/bash programmer, to not
| understand hash tables.
| jstanley wrote:
| Knowing how to implement a hash table in C doesn't strike
| me as something you can "forget". If you know how to write
| C and you know how a hash table works, what is there to
| forget?
| 908B64B197 wrote:
| Honestly it's such a simple concept it shouldn't be hard to
| remember.
| philzook wrote:
| Sedgewick's?
| seanwilson wrote:
| I wouldn't ask someone to whiteboard it but I think "what is a
| hash table?" is a brilliant interview question. The depth
| someone can go into tells you a lot and can bring up lots of
| other topics to discuss. It tells you a lot if a senior
| developer doesn't know what a hash table is or only knows "it's
| a class from Java" too.
| 908B64B197 wrote:
| > I wouldn't ask someone to whiteboard it
|
| I did. Works very well as an interview question.
| anaerobicover wrote:
| Who said anything about interviews? This is just in the context
| of "something interesting to read with your morning cuppa".
| thrower123 wrote:
| Hashtables are probably more directly useful than all the
| binary tree gymnastics.
|
| The K&R C Programming Language book has nearly the same example
| (although I recall it using linked-lists for hash collisions
| and not probing).
|
| I would be mildly suspicious of anyone claiming to have a
| recent CS degree who wasn't familiar with the ideas.
| [deleted]
| vram22 wrote:
| IIRC, a very simple implementation of a hash table was shown in
| the classic Kernighan & Ritchie C book, "The C Programming
| Language" (which was one of the first good programming books that
| I bought and read, early in my programming career).
|
| It was in either the 1st (pre-ANSI C) or the 2nd (ANSI C)
| edition. Just a handful of lines of C code, and easy enough for
| even beginners to understand. Of course, it was not a production-
| quality hash table implementation, nor was it meant to be. I
| still remember my excitement at understanding the logic and its
| usefulness at the time, even though I did not know about the
| importance of data structures and algorithms then.
|
| Edit: Even though it was simple, IIRC, it handled at least
| collisions.
| benhoyt wrote:
| Nice. Yeah, looks like it's in section "6.6 Table Lookup", and
| it implements a simple, fixed-size hash table (array size 101)
| of char* to char*, and it handles collisions via linked lists.
| vram22 wrote:
| Cool, thanks for looking it up :)
| daneelsan wrote:
| This might not be the best place to ask but: I'm wondering what
| the best data structure is for these needs: * there are fields
| and values, similar to a struct * I know at initialization time
| what fields there are (assume a string) * at run time I want to
| be able to get this values accessing via the key/field * also I
| want to update those values * what I don't care about is
| inserting/deleting fields because that's now allowed,
|
| I guess what I want is a struct like data structure, but in an
| interpreted language
|
| Mightve typed to fast...
| whateveracct wrote:
| if the fields are different types, how do you know the type of
| key access's return?
| daneelsan wrote:
| I guess that can be solved by creating an Expression type
| daneelsan wrote:
| Sorry if I wasn't clear enough. Yes I want something like an
| object in javascript, like a struct in C but at runtime. But I
| want to implement this myself. At first I thought a hash table
| was the right choice. But I don't need the insert delete
| increase capacity that this data structure typically needs to
| implement. I want to implement this in Mathematica. The
| language doesn't have native objects given that the main idea
| is for things to be immutable. But I still can program data
| structures, so I'm looking for one...
| mywittyname wrote:
| A class? After all, an class is a struct with some additional
| features to handle functions.
|
| You can use a static instance if you just need a single object.
| ben509 wrote:
| You're looking for something similar to Python's namedtuple.[1]
| (Though modern python should use a dataclasses in the stdlib or
| the attrs package.)
|
| Essentially, you have an array/list/vector under the hood, and
| a mapping of field names to array indices to govern field
| access.
|
| If you're trying to implement this at a lower level, in the
| interpreted language itself, take a look at the slots
| mechanism[2] in Python.
|
| [1]:
| https://docs.python.org/3/library/collections.html#collectio...
|
| [2]:
| https://docs.python.org/3/reference/datamodel.html#object.__...
| daneelsan wrote:
| That seems like the thing I'm looking for! Thanks for the
| info
| sgtnoodle wrote:
| It depends on what you want its footprint in memory to look
| like. Since you don't need to mutate the structure, a densely
| packed structure seems optimal. For small structures, linearly
| searching through an array would be fast. For larger
| structures, a sorted array could be binary searched
| efficiently.
|
| It also depends on how you can represent the string and their
| typical size. If the strings are small, then allocating fixed
| size chunks to store them inline might make sense. If the
| strings are long, it might make sense to go crazy and use a
| "trie" data structure to minimize comparisons.
|
| I believe python uses hash maps for classes by default, but you
| can alternatively use "slots", which are less mutable and
| presumably more densely packed.
| daneelsan wrote:
| Yes. At some point i encountered this trie with payloads
| (which would act as t he field values).
| tyingq wrote:
| _" I guess what I want is a struct like data structure, but in
| an interpreted language"_
|
| Python ffi is an option:
| https://cffi.readthedocs.io/en/latest/using.html#working-wit...
|
| Though I don't know what advantage that route would have over
| most interpreted language's already existing associative arrays
| / hashmaps / dicts.
|
| There may be some module already built that serves your needs.
| DiscoDB is a good example..a very fast hashmap that uses
| mmap(). https://discodb.readthedocs.io/en/latest/
| ectopod wrote:
| Use an object?
| brundolf wrote:
| Sounds like you want a struct except you want to be able to
| reflect on the property keys (which you can't do normally in
| C/C++/Rust)?
|
| I got thrown off by "but in an interpreted language" so I'm not
| sure whether you're requesting this for C or for another
| language. In JS, at least the V8 engine does some of this for
| you: if you have a group of objects that always have the same
| set of properties, V8 will detect this in many cases and
| optimize them under the hood as a struct or class (it must also
| store the string keys somewhere since those can be iterated in
| JS).
|
| In C/C++ you'd have to store the keys as strings somehow, and
| also add some way to index the struct via those keys (since you
| can't even do that with a normal struct). You may be stuck with
| a hashmap one way or another.
| nicoburns wrote:
| Rust allows you to do this with a macro. The serde macros are
| pretty close to what you need, but a slight variant could
| give you a much more ergonomic API.
| brundolf wrote:
| Right, I thought about suggesting a macro for C/C++ but I
| don't know enough about their macro systems to know for
| sure that it would work
| nicoburns wrote:
| I believe C++ has "Runtime Type Information" (RTTI) that
| allows you to do this. Although I also hear that it's not
| great.
| MaxBarraclough wrote:
| To expand on _not great_ : RTTI is one of the few
| features of C++ where you pay for it even if you don't
| use it, so it's not all that popular. It can be disabled
| with compiler flags to shrink the binary. LLVM's coding
| standard prohibits use of RTTI, they instead use their
| own 'hand-rolled' solution. [0] RTTI is also banned by
| the Google C++ coding standard. [1]
|
| [0] https://llvm.org/docs/CodingStandards.html#do-not-
| use-rtti-o...
|
| [1]
| https://google.github.io/styleguide/cppguide.html#Run-
| Time_T...
| frankenst1 wrote:
| JavaScript also lacks a proper hash table. I wonder how much
| faster a native implementation could possibly be over say `Map()`
| (which keeps insertion order by maintaining linked lists and does
| not have constant op).
| perlgeek wrote:
| What's wrong with using objects as hashes?
| tyingq wrote:
| There's a table at the bottom of
| https://developer.mozilla.org/en-
| US/docs/Web/JavaScript/Refe... that shows some reasons why.
| frankenst1 wrote:
| They can be faster than `Map` but:
|
| 1) Slower than hash tables
|
| 2) `{}` is not empty but has default keys
|
| 3) `{}` has multiple read-only keys
|
| 2 and 3 are caused by the prototype chain (properties not
| present on the initial object will be attempted to be
| resolved via the prototype hierarchy) which can be avoided
| via `Object.create(null)`
| winrid wrote:
| Isn't a faster way to, instead of using a backing array, to use a
| pool of memory that you just point into with pointers addressed
| with the hash?
|
| Nasty stuff that you can do in a language like C :)
| ufo wrote:
| As far as the CPU is concerned, there isn't a big difference if
| you access a block of memory via pointers or via an array.
| Converting the hash result to a memory address is not the
| bottleneck. The main bottlenecks are the hash function itself
| and the logic for dealing with collusions
| selimnairb wrote:
| And depending on your use case, you may not need to implement
| growing the hash table, or deletions, making it even simpler/less
| daunting to implement in C (since this avoids a lot of the memory
| management parts of the job).
| benhoyt wrote:
| Yeah, good point. And then you can allocate the hash table
| struct and all your buckets in one call to malloc().
| tirrex wrote:
| You skipped hs_remove but depending on implementation, hs_remove
| strategy may change your hs_set, e.g using tombstones. I've
| recently learnt that you don't need tombstones at all in linear
| hashmaps, you just do backshift removals, it may be a simple
| addition to your map. Check out these repos for the example
| implementation:
|
| (C) https://github.com/tezc/sc/tree/master/map
|
| (C++) https://github.com/rigtorp/HashMap
| speps wrote:
| A more detailed explanation of "backward shift deletion":
| https://codecapsule.com/2013/11/17/robin-hood-hashing-backwa...
| chrchang523 wrote:
| Well, he did say "simple hash table". In my experience, many
| applications have no need for single-element deletion.
| attractivechaos wrote:
| I wrote a blog post [1] on this a couple of years ago. This is
| a quite old technique [2] but was not widely used until
| recently. I was surprised that I hadn't found this earlier.
|
| [1] https://attractivechaos.wordpress.com/2019/12/28/deletion-
| fr...
|
| [2] https://en.wikipedia.org/wiki/Linear_probing#Deletion
| tirrex wrote:
| I can't find right now but I saw one older version of wiki
| page, it had C code example how to do deletion with this.
| This is how I discovered it. I knew your blog but apperantly
| I didn't visit recently.
|
| > I think the new deletion algorithm without tombstones is
| overall better than traditional algorithms. It should become
| the preferred way to implement hash tables
|
| I agree with you 100%. Tombstone has no advantage over this.
| annilt wrote:
| https://en.wikipedia.org/wiki/Open_addressing
|
| Maybe it was open addressing page? It has pseudo code for
| that.
| benhoyt wrote:
| Thanks for this (and parent reply for mentioning it), that's
| simple and clever!
|
| > The basic idea is not complex: when we delete an element,
| we move the next element to the deleted location if it is
| pushed away by the deleted element due to hash collision; we
| repeat this process until we come to an empty bucket.
|
| I wonder, is it still _amortized_ constant time? My gut
| (which is not very good at computer science) tells me that it
| still may be, as with an average probe length of ~1.4 (which
| I was seeing), you won 't have to walk more than 1 or 2
| elements on average.
| hnfong wrote:
| It shouldn't take more operations (asymptotically) to
| delete than to probe/insert so as long as the original list
| is still amortized constant time then the delete operation
| still is.
| [deleted]
| e12e wrote:
| I guess the idea is to contrast dyi bsearch vs dyi hash table -
| but it's a bit odd to mention bsearch is in the standard lib,
| then not use it?
|
| > Here's how you'd do it in C (with and without bsearch).
|
| Juding by random hit on the net - it would've been quite
| straightforward to use the stdlib version?
|
| https://www.tutorialspoint.com/c_standard_library/c_function...
| st_goliath wrote:
| POSIX actually specifies a hash table interface in the standard
| library as well: https://man7.org/linux/man-
| pages/man3/hsearch.3.html
|
| In typical Unix fashion, however, it is designed to use
| internal, global state, so you can only have _one_ table at a
| time.
|
| There is a GNU extension that provides the same functions with
| an `_r` suffix that allow you to work with more than one table.
| Tyrubias wrote:
| I've never really understood why standard Unix functions rely
| on internal global state. For example, strtok():
|
| https://en.cppreference.com/w/c/string/byte/strtok
|
| Was there some reason it was optimal to write these functions
| this way back in the early days of Unix?
| jandrese wrote:
| It probably saved a few bytes of memory or disk. A PDP11
| had 64k of memory and had to be shared by multiple
| simultaneous users. A lot of the Unix tradeoffs can be
| traced back to limitations of 1970s hardware.
|
| Plus you weren't going to need more than one anyway, your
| program wasn't going to be big enough.
| Zambyte wrote:
| Because they were written with the expectation that they
| would be used in composable programs that each do one thing
| well.
| kragen wrote:
| Well, if you wanted multiple dbm files or hash tables or
| whatever, it was easy enough to run multiple processes.
| Unix processes on the PDP-11 could only address 64 KiB of
| memory--PDP-11 addresses are 16-bit, and C strongly
| encourages a von Neumann model where your program and data
| are in the same address space, so you couldn't even use 64
| KiB for your program and another 64 KiB for your data,
| though the PDP-11 hardware was capable of doing this. (On
| things like the AVR, this is handled by copying your data
| into RAM at startup unless you mark it PROGMEM, in which
| case you can't pass pointers to it to ordinary functions.
| PDP-11 Unix didn't have anything like this.)
|
| As long as you don't have multiple threads (and you
| shouldn't, and on Unix you couldn't) the static-variable
| interface in these cases was usually more convenient,
| though more bug-prone.
|
| There's also a micro-efficiency question.
|
| Global variables are at a known memory address. If you put
| them in a struct and pass around a pointer to it, they are
| at a dynamically computed memory address. Every time you
| want to access them, you have to add the field offset to
| the struct's base pointer. This is a small amount of extra
| computation, and how much it actually costs you depends on
| the hardware--on the 8080 it's a mess that probably
| requires you to spill a couple of registers onto the stack,
| and on modern implementations of amd64 like the ones from
| Intel, it probably takes literally zero extra time, because
| you probably aren't keeping all of your addition-capable
| functional units busy anyway.
|
| I don't know the PDP-11's instruction set well enough to
| say, but I suspect the cost of such indexed addressing was
| intermediate between these extremes: much like the 8086, it
| had an "index" addressing mode, mode 6, which added an
| immediate 16-bit offset to a base register
| https://pages.cpsc.ucalgary.ca/~dsb/PDP11/AddrModes.html,
| and it doesn't seem to have had a pure immediate addressing
| mode where the address of the variable is just an immediate
| word without any indexing. But if you didn't need a base
| register for your variables, you could use PC-relative
| addressing, thus saving a register and decreasing register
| pressure (the PDP-11 only had 8 general-purpose registers,
| and one of them was PC, so it really only had 7).
|
| I don't have any idea what actual PDP-11 compilers did,
| though, and that matters a lot here.
| tyingq wrote:
| A fair amount of the timeline for Unix was when it
| supported multiple processes, but not multiple threads.
| SunOS/Solaris didn't have threads until 1992, for example,
| and it still didn't support SMP, so global state wasn't an
| issue initially. Strtok() predates 1992. They did add
| strtok_r() once Posix threads were a thing.
| ludocode wrote:
| hcreate_r() can be used on some of the BSDs as well, but of
| course they're not 100% compatible with GNU. I've implemented
| a platform-independent POSIX hsearch as a wrapper to a
| templated hashtable in Pottery:
|
| https://github.com/ludocode/pottery/tree/master/examples/pot.
| ..
|
| Be warned, hsearch is terrible. You should probably only use
| this if you have some existing code that depends on hsearch
| and you want to make it portable.
| smcameron wrote:
| The article actually does both (uses bsearch, and also the home
| made one) presumably just to demonstrate both.
| smcameron wrote:
| (I noticed this because I was wondering why the cmp()
| function was taking void pointers only to cast them to
| "item*", and it occurred to me it would make sense if it were
| passed as a function pointer to a generic library function
| with no knowledge of "item*"... as it is when bsearch is
| used.)
| e12e wrote:
| Heh, I missed that - just skimming the code on mobile. Maybe
| there should be a slight edit in the text and/or the custom
| binary search could be dropped (as it's not the main focus of
| the article).
| SavantIdiot wrote:
| How are keys removed so that the table doesn't overflow? Are you
| really going to recommend growing the table by duplicating it?
|
| > Beware: this is a teaching tool and not a library, so it's not
| very well tested.
|
| Dear OP: This really bugs me. If you're not going to put the
| effort in, and instead crank out some content that might be
| buggy, then please don't pollute the web with more half-assed
| "look what I can do" pages.
| kljsandjb wrote:
| 100% agree with U.
| Tyrubias wrote:
| Yes, I believe you are intended to duplicate the table once a
| certain load factor (fraction of elements filled) of your
| choosing is reached.
|
| For linear probing, this is recommended to keep your probe
| lengths logarithmic:
|
| https://www.sciencedirect.com/science/article/abs/pii/019667...
|
| Removing elements for a hash table using linear probing is
| fairly non-trivial. Without using so-called "tombstones" to
| flag an element as deleted, you essentially have to shift
| elements up in the table.
|
| Also, your rudeness is really quite unwarranted. If everyone
| took that attitude, where would the teaching examples ever come
| from?
| humantorso wrote:
| This is the worst attitude I have this on HN. Look out people -
| its the internet's gatekeeper - making sure only
| legitimate/fully thought out things are published to the
| internet. By golly, we wouldn't want to waste bits and bytes on
| just random things being available on the internet.
| st_goliath wrote:
| > Are you really going to recommend growing the table by
| duplicating it?
|
| Um... yes.
|
| You want the data linearly in memory in the first place so you
| can have constant time access by index. If you want to grow a
| dynamic array, doing a realloc can easily cause the same effect
| as the article does manually (alloc new, copy data, free old),
| because the memory immediately after may be occupied by
| something else if you did _any_ other malloc /calloc. Copying
| over the data is linear in time, thus degrading your constant
| time insert to linear time as soon as the table is full if you
| do a realloc for single items instead.
|
| You may find the choice of growing _when half full_ as in the
| article questionable, because then there will always be an
| unused half, but pretty much every, remotely practical dynamic
| array implementation in C (and probably a ton of other
| languages) does that (grow by capacity * 2 _when full_ ) to
| maintain _amortized_ constant time when appending.
| tyingq wrote:
| It's also how many popular, in production, hashtable
| implementations work.
|
| From the uthash user guide, for example:
|
| _" Normal expansion - This hash attempts to keep fewer than
| 10 items in each bucket. When an item is added that would
| cause a bucket to exceed this number, the number of buckets
| in the hash is doubled and the items are redistributed into
| the new buckets. In an ideal world, each bucket will then
| contain half as many items as it did before."_
|
| So, the person questioning "why" should be instead explaining
| "why not".
| solids wrote:
| The problem with your way of thinking, is who draws the line to
| define what's "half-assed"? Because everything can be improved.
| You should take ideas from this kind of articles, not fully
| working code.
| the-dude wrote:
| This comment bugs me.
| tyingq wrote:
| I suppose it's a personal preference thing, but I like over
| simplified implementations when the purpose is educational.
|
| I am more bugged by the accusation that the author did this
| because _" look what I can do"_. I don't see any evidence of
| that.
| ffdeadbeef wrote:
| do we really need more hostile, sneering, condescending, toxic
| pricks like you policing the web?
| kstenerud wrote:
| A teaching tool is deliberately simplified so as not to
| overwhelm students with details they don't need to worry about
| yet. This is absolutely the right way to go about it.
| le-mark wrote:
| >Are you really going to recommend growing the table by
| duplicating it?
|
| Setting a reasonable initial size then doubling it and copying
| when (if!) it needs to grow is a pragmatic, reasonable approach
| that's quite common!
| knome wrote:
| https://github.com/python/cpython/blob/master/Objects/dictob.
| ..
| ape4 wrote:
| This is a fun exercise, but if anyone is thinking about doing
| this for a project - don't - use a library.
| philwelch wrote:
| "But Doctor, I am Pagliacci!"
| kevin_thibedeau wrote:
| Existing libraries make decisions that aren't necessarily
| suitable for all applications. Most hash table code makes
| excessive use of malloc/realloc or has power of two growth
| which is problematic in memory constrained systems. If you want
| to minimize waste by using high load factors you're going to
| want Robin Hood hash which is still relatively uncommon in lib
| code.
| attractivechaos wrote:
| Although Robin Hood hashing has good performance under high
| load, it needs to record the probe length in each bucket.
| This partly cancels its advantage especially for small keys.
| In my evaluation, Robin Hood hashing doesn't use less memory
| in comparison to others.
|
| > _Most hash table code makes excessive use of malloc
| /realloc or has power of two growth_
|
| Libraries based on open addressing only call malloc when
| there is not enough room. That is not "excessive". Also, if
| you know the maximum size before hand, many libraries allow
| you to preallocate buckets such that you never trigger
| rehashing in this case.
| ModernMech wrote:
| Small nit: in table 2 with the columns labeled "Key, Hash, Hash
| modulo 16", the thing called "hash" is actually known as a
| "prehash", while the "hash modulo 16" is the hash. In the context
| of hash tables, pre-hashing is when you take something like a
| string or a data structure and turn it into a number. Hashing is
| the mapping from that number to one of the hash table bins. The
| distinction is often lost in practice, but if we're talking about
| hash table design and implementation it's important to make this
| clear at least once.
| joosters wrote:
| Is that always true? It depends whether or not you think of the
| hash as a property of the hashtable, or a property of the key
| itself. e.g. you may have string objects that store their hash
| values internally - a useful optimisation if you are dealing
| with lots of long read-only strings, given that the hash
| function will have a cost of at least O(n) for the string
| length. Inserting a string into a hashtable wouldn't change its
| hash (what if the string is inserted into two tables of
| different sizes?)
| ModernMech wrote:
| Right, it does depend on this perspective, which is why I say
| the distinction is often lost in practice. But from the
| perspective of hash table designers (at least historically),
| the hash can in fact change and is dependent on the size of
| the table. The distinction is important because the two
| operations are separable. I guess people didn't want to call
| one thing a "hash" and the other "hash mod k" because there
| are ways to do this other than mod. They could have called
| one "hash" and the other "post hash" but it is what it is.
|
| All I'm trying to say is if you want to know more about hash
| tables, and you bother to venture into the literature or some
| textbooks, you're going to run into this distinction sooner
| rather than later, so I wanted to point that out here.
| varajelle wrote:
| > Even while writing this I realized I'd used malloc instead of
| calloc to allocate the entries array, which meant the keys may
| not have been initialized to NULL.
|
| I think this is still UB to allocate a struct containing pointer
| with calloc without explicitly set these pointers to 0 (or NULL).
| The C standard doesn't specify the binary representation of the
| null pointer value. In practice, that will work in virtually all
| current platforms, but it's still UB.
| ajross wrote:
| I think that's true only in the narrowest sense of the C
| language standard. At the ABI and runtime linkage level,
| platforms (yes, every single one of them) document data in the
| .bss section to be zero, and C compilers place pointers there
| that are defined to be initialized to NULL per the standard.
|
| So, no, in practice it's well-defined, fully specified behavior
| on any given platform.
| ncmncm wrote:
| Any particular platform can declare it to be defined, by fiat.
| "With our compiler, all zeroed raw storage treated as pointers
| are null."
|
| Anywhere that they technically could, you can treat it as if
| they actually had so declared, and just didn't publicize it
| enough.
| dpedu wrote:
| NULL is guaranteed to equal integer 0 by standard, and
| converting 0 to a pointer is guaranteed to be a NULL pointer.
|
| > An integer constant expression with the value 0, or such an
| expression cast to type void *, is called a null pointer
| constant. If a null pointer constant is converted to a pointer
| type, the resulting pointer, called a null pointer, is
| guaranteed to compare unequal to a pointer to any object or
| function.
| layer8 wrote:
| An integer constant expression is a syntactical construct,
| not a runtime entity. The C standard provides no guarantee
| that bytes (chars) with value zero can be read from memory as
| a null pointer.
| adamrezich wrote:
| wait what? my C is a bit rusty but when would reading 0
| from memory & casting to a (void?) pointer type ever !=
| NULL?
| layer8 wrote:
| C implementations are allowed to use a not-all-bits-zero
| representation for null pointer values. They are merely
| required to convert integer constant expressions with
| value zero to a null pointer value -- regardless of how
| the resulting null pointer value is represented in
| memory.
|
| See also these FAQs:
|
| http://c-faq.com/null/machnon0.html
|
| http://c-faq.com/null/machexamp.html
| cygx wrote:
| Implementation-defined as far as I'm aware. If pointers with
| all bits zero actually represent null pointers, things will be
| fine.
|
| _edit:_
|
| I haven't found any smoking gun quotation in the C11 standard
| on short notice, but there is footnote 296 (p.348) on calloc()
| zeroing all bits:
|
| _> Note that this need not be the same as the representation
| of floating-point zero or a null pointer constant._
|
| Nothing about this being UB in cases where it _does_ represent
| a null pointer constant.
| benhoyt wrote:
| Wow, thanks, I did not know that. Though that would be a
| kinda crazy platform or implementation! I'm betting that a
| lot of people depend on this.
| yudlejoza wrote:
| 99% of the effort is 'what hash function?'
| varjag wrote:
| FWIW there's a standard hash table API in POSIX.
| jandrese wrote:
| Yeah, but it's pretty lame since you have to specify the size
| of the table up front and can't change it later. They didn't
| implement the hard stuff!
| queuebert wrote:
| This might sound stupid, but could you store N-1 entries, and
| then store a pointer to the next hash as the Nth entry? So
| you dynamically add fixed tables as you run out of space?
|
| Yes, I realize the lookup would get complicated.
| banana_giraffe wrote:
| Nope. The POSIX API lets you have one table.
|
| There is a GNU extension that lets you have more than one
| table.
| ludocode wrote:
| There are many reasons why it's lame! Here's a short list:
|
| - There is only one global hash table for the whole process!
| If you want individual hash tables you need to use
| implementation-specific extensions (hcreate_r().)
|
| - There is no way to remove a key from a hash table. No
| implementation I know of provides an extension to do it. If
| you want to truly remove a key you must destroy and rebuild
| the table.
|
| - There is no requirement for a means to grow the table. On
| some platforms it's possible to run out of space. If you want
| to truly grow you must destroy and rebuild the table.
|
| - There is no standard way to free keys or data stored in the
| table. Destroying the table leaks all contents; you must keep
| track of keys and data externally and free them yourself.
| Only OpenBSD frees keys by default, and only NetBSD has
| extensions to call callbacks to free keys and data.
|
| - Keys must be strings. You cannot provide custom types or
| void pointers as keys. There is no way to specify custom hash
| or comparison functions. The quality of the hash function is
| unspecified.
|
| - When using ENTER, you may not want to allocate the key
| unless it is necessary to insert a new entry. Since the given
| key is inserted directly, it's not necessarily safe to use a
| stack-allocated buffer for lookups. It's awkward to replace
| the key with an allocation after insertion and it's unclear
| whether it's allowed.
|
| This doesn't even get into incompatibilities between the
| various implementations. You will encounter all of the above
| flaws even if your only target is glibc.
|
| No one should ever be encouraged to use POSIX hash tables.
| They should be deprecated and we should all just pretend they
| don't exist.
| jandrese wrote:
| IMHO this is one area where the C committee could improve
| the situation by mandating a sane and useful hashtable
| implementation for the stdlib.
|
| Or at the very least just deprecate the old one. I can't
| remember ever seeing it used in code.
| TonyTrapp wrote:
| If you'd have to implement the hash table every time you want to
| use one, you'd probably realize that there is a better
| alternative data structure. Quite often you data is not as
| randomly distributed as you may initially think it is.
| kragen wrote:
| One of the cool nuances of hash tables is that you can exploit
| the nonrandomness of your data to get a more uniform
| distribution of keys over hash buckets than you would get with
| an ideal random-oracle hash function. For example, it's very
| common to use hash functions which will produce four
| consecutive output values (possibly shuffled) for string keys
| "r0", "r1", "r2", and "r3", which guarantees that those keys
| will not collide with each other. Hashpjw is one such widely-
| used hash function; FNV is another. (Better not use linear
| probing with such a hash function, though! And it's probably
| better not to use them in cases where the keys might be chosen
| adversarially to produce a DoS.)
| jolux wrote:
| What?
| selljamhere wrote:
| I think the parent comment is highlighting that a hash table
| is just one implementation of a "Map" abstract data type.
| Depending on the use case, other implementations (such as
| using a heap, in the C++ STL) may have better performance.
| wffurr wrote:
| Or for the example of word frequencies in the article, a
| prefix trie.
| jolux wrote:
| True, but as far as I know it's rarely the worst choice you
| can make, and it's plenty fast for a lot of applications. I
| don't know that there's a single data structure that's
| always better, which is what the GP seems to be implying.
| iforgotpassword wrote:
| > Quite often you data is not as randomly distributed as you
| may initially think it is.
|
| That's why it's important to pick the right hash function. A
| hash table usually goes a long way before it becomes a
| performance issue.
___________________________________________________________________
(page generated 2021-03-26 23:01 UTC)