[HN Gopher] Solving "Two Sum" in C with a tiny hash table
___________________________________________________________________
Solving "Two Sum" in C with a tiny hash table
Author : ingve
Score : 102 points
Date : 2023-06-27 20:53 UTC (2 days ago)
(HTM) web link (nullprogram.com)
(TXT) w3m dump (nullprogram.com)
| mananaysiempre wrote:
| Linear probing for a hash table that you're not going to delete
| things from. Textbooks usually tell you to prefer other probing
| patterns unless you need deletion (and e.g. Go interface casts
| use quadratic probing[1]). Or is it still advantageous for some
| reason? (Locality?)
|
| [1]
| https://github.com/golang/go/blob/03cd8a7b0e47f2141a1dbdad9f...
| via https://lukasatkinson.de/2018/interface-dispatch/
| badtension wrote:
| I was really hoping for something more creative than the
| traditional hashmap approach.
|
| Still pretty cool, I wish I was that confident with rolling my
| own data structures.
| abhishekjha wrote:
| All you need is a mumrmur hash function to create a hash table.
| That is something that people should copy.
| tialaramex wrote:
| This (a hash table) is a basic vocabulary type, and it's only
| _barely_ understandable that C doesn 't provide one given
| when it was invented. The hash table which people are likely
| to knock together this way in C isn't very good, which makes
| sense - but if the C library provided one then the effort to
| do it well only needs to happen once.
|
| This isn't like say, Bloom filters where most people don't
| need one. It's like sort, which I notice C _does_ provide in
| its standard library.
| camel-cdr wrote:
| > but if the C library provided one then the effort to do
| it well only needs to happen once.
|
| It would've happened only once, but in 1987, and then we
| would've stuck with it, because every body relies on the
| exact behavior.
|
| Just look at `rand()`, it's basically unusable.
| jandrese wrote:
| It's not technically C stdlib, but POSIX.1 is the next best
| thing:
|
| https://linux.die.net/man/3/hcreate
|
| That said, the implementation leaves a lot to be desired.
| Xeoncross wrote:
| I'm sure my next interview to require I know this "optimum
| solution".
|
| Jokes aside, I'm very glad to see another article using more than
| a single language to showcase something. Chances are much higher
| I can read one language well, or parts of both.
| taeric wrote:
| I used to use this as an interview question. I passed people
| more for being able to talk about different tradeoffs than any
| optimum solution, though. Gave a yes vote to more than a few
| that messed up and had bugs. (Granted, most of the bugs were in
| extending this to the "3 sum" idea.)
| [deleted]
| [deleted]
| floor_ wrote:
| Just a warning that this is secretly a GO article for those who
| are GO-intolerant.
| chongli wrote:
| Yes. There's not a single line of C code in it. It's
| practically a bait-and-switch.
|
| _Edit: you're right. There is C code intermixed with Go. It's
| tricky to tell apart on a skim. Need to read carefully_
| o1y32 wrote:
| ... I'm sure there is some C code there. In fact most code is
| in C.
| codetrotter wrote:
| Well, at least it's not Rust am I right fellas? Hahahah
|
| _(I like Rust.)_
| Const-me wrote:
| Given the constraints, the proposed implementation is not
| efficient at all. Here's a quick benchmark to compare two
| implementations: https://gist.github.com/Const-
| me/265b8a3e89f17f4268d34d64771...
|
| In theory, the hash map is O(n), and std::sort approach is
| O(n*log(n))
|
| And yet when measured on my computer, std::sort is more than
| twice as fast. Specifically, for hash map my program prints
| 3574736 (best of 10 tries), for std::sort only 1695712. My CPU
| base frequency is 3.8 GHz, so these numbers measured with RDTSC
| translate to 941 and 446 microseconds, respectively.
|
| I'm using a CPU with AMD Zen 3 cores, and compiled that code for
| AVX2 ISA, with Visual Studio 2022.
| attractivechaos wrote:
| std::unordered_map is notoriously slow, several times slower
| than a fast hashmap implementation like Google's absl or
| Martin's robin-hood-hashing [1]. In addition, your
| implementation lets the hashmap grow from a small size. As you
| know the total length, you could preallocate the hash table and
| avoid rehashing. As typical hash table implementations double
| the capacity during rehashing, preallocating the buckets should
| be about twice as fast if we don't consider the cache and the
| complex mechanism behind the STL allocator.
|
| That said, std::sort is not the fastest sort implementation,
| either. It is hard to say which is faster for a small N of
| 10000. Your general point is still right in that an O(n*log(n))
| algorithm is not necessarily slower than an O(n) algorithm in
| practice.
|
| [1]: https://github.com/martinus/robin-hood-hashing
| _a_a_a_ wrote:
| > Your general point is still right in that an O(n*log(n))
| algorithm is not necessarily slower than an O(n) algorithm in
| practice.
|
| Sounds reasonable
|
| > for a small N of 10000
|
| Ah, now that is not what I would call a small N
| tialaramex wrote:
| 10_000 is easily in the ballpark of constant factors k
| which really happen in software.
|
| Suppose my O(n) algorithm is a random access walk of nums.
| So I'm causing ~10_000 memory fetches, maybe they each take
| k=100 cycles, so it takes 1 million cycles total.
|
| But the O(n*log(n)) algorithm happens to be linear, it does
| access some nearby objects more than once, but every damn
| access is in cache. That's only 100_000 cycles, it's _ten
| times faster_ despite you not thinking 10000 is a small N.
|
| Stalls are _huge_. A modern CPU can do a crazy amount of
| arithmetic in the time it takes to fetch from even an L3
| shared cache, never mind main memory. If you need to go
| out, to the network or to "disk" it starts to look faster
| to do a whole bunch of difficult work again instead. Oh the
| 2417th digit of Pi is on the hard disk? Thanks but I'll
| just calculate it again instead 'cos I don't got all day.
| _a_a_a_ wrote:
| Thank you for the lesson in memory access times I
| actually know something about the subject already :-)
|
| > But the O(n*log(n)) algorithm
|
| I think you're badly missing out the "* log(n)" part. But
| on reflection, you have made your point, thank you.
| aidenn0 wrote:
| If you pick base 2 for the then log2(10000) is about 13. So
| if you are a constant factor of >13 faster for the work
| that dominates, the n*log(n) will be faster.
|
| Slightly OT:
|
| I remember in my algorithms class we used to joke that
| log(log(N)) was the same as a constant factor of 5 (in the
| 64-bit era, perhaps we should say 6).
| creata wrote:
| Can someone explain why std::unordered_map _has_ to be slow?
|
| I vaguely remember that it's because it isn't allowed to
| invalidate references when inserting elements.
| tialaramex wrote:
| Yes, it's obliged to be the design you'd have been taught
| in the 1980s or maybe 1990s in a typical data structures
| class, linked lists of entry nodes rather than a modern
| open addressed map.
|
| You can satisfy typical _real world_ needs with better
| structures now, even when people want stable references to
| the nodes, but you can 't implement the API that
| std::unordered_map promises with the faster choices. So
| there are "near drop-in" replacements, but you can't
| actually swap those into the C++ standard library.
|
| The chained design has terrible locality, which means doing
| lookups (failed or successful) in the hash table will cause
| more memory fetches, which are very slow, compared to a
| modern design which was optimised for this.
|
| Imagine we're looking at a Swiss Table, we're looking for
| 12530, we hash that, we get an index into our hash table,
| that causes our first memory fetch, to pull in the RAM
| where that part of the hash table lives. We check the key,
| this one is wrong, we step to the next, it's in the same
| blob of RAM but it's wrong, we step to the next, it's
| empty, we're done, one memory fetch, nothing found.
|
| Now std::unordered_map, we're looking for 12530, we hash
| it, we get an index into our hash table, we do a memory
| fetch. We get a pointer to a node. We do another memory
| fetch, we have a key to examine, it's wrong, we get a
| pointer to another node, we do another memory fetch, we
| have another key to examine, this one is wrong too, but
| there are no more pointers, _three_ memory fetches and
| nothing found.
| kccqzy wrote:
| That's exactly it. Also because of that, it has to be
| implemented with chaining rather than open addressing.
| Const-me wrote:
| Adding `map.reserve( vec.size() );` indeed helped, but still
| slower than sorting these entries. Specifically, the time for
| the hash map is 740 ms this way.
|
| I think the main reason for the observed behavior is memory
| management. The sorted vector takes under 80kb of RAM,
| allocated with a single malloc() call, and the entire vector
| fits in L2 cache which is very fast to access. OTOH, without
| advanced shenanigans like custom allocators, the hash map
| needs 1 malloc() call per entry.
|
| On a general note, I noticed that when the data being handled
| fits in the faster in-core caches, dense buffers pretty often
| perform best, despite these theoretical big O-s. Even if this
| means spending some extra time to sort the data.
| attractivechaos wrote:
| > _the [std::unordered_map] hash map needs 1 malloc() call
| per entry._
|
| This is a key reason why std::unordered_map is several
| times slower than fast hashmap implementations that only
| use a couple of long flat arrays per hashmap object.
| Nonetheless, common STL implementations use their own
| allocators by default and would not call malloc() on small
| objects frequently.
| nly wrote:
| If you wanted to see the effect of memory allocations then
| you could use boosts intrusive unordered_set implementation
|
| https://www.boost.org/doc/libs/1_82_0/doc/html/boost/intrus
| i...
|
| It doesn't own the underlying memory, so you can
| preallocate your nodes
| charlieyu1 wrote:
| Just learning c++, curious how reserve works. I thought map
| would allocate memory exponentially automatically, so an
| early reserve at best saves you a few reallocation
| Const-me wrote:
| That reserve() method only reserves memory for the hash
| table inside the container. It does help with performance
| to an extent, but the memory for the items stored in that
| map is still allocated one by one, as new entries are
| inserted into the map.
|
| I think the main reason why reserve() helped
| substantially weren't saved malloc() calls, it were saved
| re-hashes. Each time the hash table grows, it needs to
| re-insert all items currently stored on the map into the
| new, larger table of these buckets.
| tialaramex wrote:
| Sort of. For std::unordered_map there are two types of
| memory allocation, there's a table, of pointers, and then
| there are nodes, which themselves have a pointer to a
| next node if any, forming a "chain".
|
| The table is indeed grown exponentially as you describe.
| If you tell std::unordered_map that you expect 10000
| entries using reserve, it will calculate how big it would
| expect to grow a table for 10000 entries and make it that
| size immediately.
|
| The nodes however are created when you insert data. Even
| though you called reserve, it is an allocation each time
| you insert to make space for the new node.
|
| If you used a Swiss table (e.g. from Abseil) the reserve
| behaves more as a person might expect, it will allocate a
| data structure big enough to insert at least 10 000
| entries successfully, and no further allocations are
| needed unless you keep going.
| tromp wrote:
| Sorting provides a alternative solution: sort the array, and then
| traverse a linear path over pairs of indices from (0,n-1) to
| (n-1,0) whose sum straddles the target: int
| main() { /* read in and sort array a */
| int i=0,j=N-1, sum; while ((sum = a[i] + a[j]) !=
| tgt) { if (sum < tgt) i++;
| else j--; } printf("%d
| %d\n", i, j); }
| colejohnson66 wrote:
| The hashmap solution here is ideally O(n) as you only iterate
| once (assuming O(1) lookups and inserts in the map). A
| sort/search method, even using quicksort would be O(n*logn).
| enedil wrote:
| Even it it was O(1) not amortized, it's not possible to beat
| the sorting in actual execution time for reasonable inputs
| (less than 10B elements probably)
| tromp wrote:
| You can use radix sort which is linear for fixed width ints.
| jjgreen wrote:
| _O_ ( _n log n_ + _n_ ) = _O_ ( _n log n_ )
| topposter32 wrote:
| What's log base 2 of a million or a billion?
| tromp wrote:
| Although hashmaps are O(n) in practice, they can be worse
| than O(n log n) on some rare inputs that cause lots of hash
| collisions. Several sorting algorithms such as mergesort and
| heapsort are guaranteed O(n log n).
| nkurz wrote:
| I think it's a little trickier than that, since the problem
| asks for the indices not the numbers, which would be lost in
| the sorted version. You'd have to keep an additional mapping
| from the sorted to original positions. And are we counting the
| sorting and space for the maps as free? Or do we have to
| account for them somehow?
|
| Mostly, I just don't like the problem specification. What does
| it actually mean to "solve it efficiently"? How many times are
| we going to be searching each input array? How much memory do
| we have available? What tradeoff between memory and computation
| is allowable? How much does a comparison cost us versus
| accessing memory? Are all targets equally likely?
|
| Assuming as theoretical computer scientists tend to do that we
| can ignore all constant time operations, that access to memory
| is free, that an unknown but fixed amount memory is available,
| and that we'll be searching the same array forever ruins the
| problem for me. Fill in the numbers and we can talk about
| what's optimal, but the "spherical cow" approach does nothing
| for me.
| tialaramex wrote:
| For the specifics, as someone else observed, walking nums to
| find the values we've discovered and identify their indices
| is very affordable, it's N operations, and if we can't afford
| N we're in trouble since we definitely need to do at least N
| comparisons.
|
| To your general point, sure, one thing that's nice (but takes
| a colossal effort) about Advent of Code is that it has
| specific inputs for players, and they have specific answers,
| so although it's usual for participants to write code which
| could (you hope) solve the problem in the abstract, the
| actual test is whether you solved the input you actually had.
| Sometimes not all conceivable inputs are even tractable (and
| some people find this situation frustrating if it happens)
| only the inputs given to real players may be suitable.
|
| For this exercise the provided Go solution gets to handwave a
| bunch of problems. For example, Go's "map" hash table
| allocates whenever it wants to and has amortized table growth
| on top. That can really add up for a problem like this.
|
| If I attacked it in Rust I would make
| HashMap::with_capacity(nums.len()) which does a single
| allocation up front, I might not _need_ all this space if I
| exit early, but it won 't hurt. I don't know if there's even
| a way to write that in Go.
| marginalia_nu wrote:
| For an input array of values I[], create an array A = [ 0 ...
| n ] so that n = len(I). Sort the indices in A based on their
| corresponding value in I. Now A _is_ the mapping, A[n] is the
| index in I of the n+1:th largest value, and you can binary
| search in the same way, by comparing I[A[i]].
|
| This is strictly more space efficient than a hash table.
| You're not going to win any awards for being cache efficient,
| but then again, when the competition is a damn hash table
| it's not too bad ;P
| raverbashing wrote:
| > You'd have to keep an additional mapping from the sorted to
| original positions.
|
| 2*O(n) is still O(n)
| taeric wrote:
| Though, at the low level, it is perfectly valid to compare
| the full costs, not just the growth rates. Right?
| ape4 wrote:
| It seems like it could be a "waste" to make a hash or sorted
| list only to throw throw it away after one use. Depending on
| what's required.
| [deleted]
| tromp wrote:
| > I think it's a little trickier than that, since the problem
| asks for the indices not the numbers, which would be lost in
| the sorted version.
|
| Good point.
|
| > You'd have to keep an additional mapping from the sorted to
| original positions.
|
| It's simpler to just find the two numbers in the original
| unsorted array.
___________________________________________________________________
(page generated 2023-06-29 23:02 UTC)