[HN Gopher] Solving "Two Sum" in C with a tiny hash table
       ___________________________________________________________________
        
       Solving "Two Sum" in C with a tiny hash table
        
       Author : ingve
       Score  : 102 points
       Date   : 2023-06-27 20:53 UTC (2 days ago)
        
 (HTM) web link (nullprogram.com)
 (TXT) w3m dump (nullprogram.com)
        
       | mananaysiempre wrote:
       | Linear probing for a hash table that you're not going to delete
       | things from. Textbooks usually tell you to prefer other probing
       | patterns unless you need deletion (and e.g. Go interface casts
       | use quadratic probing[1]). Or is it still advantageous for some
       | reason? (Locality?)
       | 
       | [1]
       | https://github.com/golang/go/blob/03cd8a7b0e47f2141a1dbdad9f...
       | via https://lukasatkinson.de/2018/interface-dispatch/
        
       | badtension wrote:
       | I was really hoping for something more creative than the
       | traditional hashmap approach.
       | 
       | Still pretty cool, I wish I was that confident with rolling my
       | own data structures.
        
         | abhishekjha wrote:
         | All you need is a mumrmur hash function to create a hash table.
         | That is something that people should copy.
        
           | tialaramex wrote:
           | This (a hash table) is a basic vocabulary type, and it's only
           | _barely_ understandable that C doesn 't provide one given
           | when it was invented. The hash table which people are likely
           | to knock together this way in C isn't very good, which makes
           | sense - but if the C library provided one then the effort to
           | do it well only needs to happen once.
           | 
           | This isn't like say, Bloom filters where most people don't
           | need one. It's like sort, which I notice C _does_ provide in
           | its standard library.
        
             | camel-cdr wrote:
             | > but if the C library provided one then the effort to do
             | it well only needs to happen once.
             | 
             | It would've happened only once, but in 1987, and then we
             | would've stuck with it, because every body relies on the
             | exact behavior.
             | 
             | Just look at `rand()`, it's basically unusable.
        
             | jandrese wrote:
             | It's not technically C stdlib, but POSIX.1 is the next best
             | thing:
             | 
             | https://linux.die.net/man/3/hcreate
             | 
             | That said, the implementation leaves a lot to be desired.
        
       | Xeoncross wrote:
       | I'm sure my next interview to require I know this "optimum
       | solution".
       | 
       | Jokes aside, I'm very glad to see another article using more than
       | a single language to showcase something. Chances are much higher
       | I can read one language well, or parts of both.
        
         | taeric wrote:
         | I used to use this as an interview question. I passed people
         | more for being able to talk about different tradeoffs than any
         | optimum solution, though. Gave a yes vote to more than a few
         | that messed up and had bugs. (Granted, most of the bugs were in
         | extending this to the "3 sum" idea.)
        
           | [deleted]
        
         | [deleted]
        
       | floor_ wrote:
       | Just a warning that this is secretly a GO article for those who
       | are GO-intolerant.
        
         | chongli wrote:
         | Yes. There's not a single line of C code in it. It's
         | practically a bait-and-switch.
         | 
         |  _Edit: you're right. There is C code intermixed with Go. It's
         | tricky to tell apart on a skim. Need to read carefully_
        
           | o1y32 wrote:
           | ... I'm sure there is some C code there. In fact most code is
           | in C.
        
         | codetrotter wrote:
         | Well, at least it's not Rust am I right fellas? Hahahah
         | 
         |  _(I like Rust.)_
        
       | Const-me wrote:
       | Given the constraints, the proposed implementation is not
       | efficient at all. Here's a quick benchmark to compare two
       | implementations: https://gist.github.com/Const-
       | me/265b8a3e89f17f4268d34d64771...
       | 
       | In theory, the hash map is O(n), and std::sort approach is
       | O(n*log(n))
       | 
       | And yet when measured on my computer, std::sort is more than
       | twice as fast. Specifically, for hash map my program prints
       | 3574736 (best of 10 tries), for std::sort only 1695712. My CPU
       | base frequency is 3.8 GHz, so these numbers measured with RDTSC
       | translate to 941 and 446 microseconds, respectively.
       | 
       | I'm using a CPU with AMD Zen 3 cores, and compiled that code for
       | AVX2 ISA, with Visual Studio 2022.
        
         | attractivechaos wrote:
         | std::unordered_map is notoriously slow, several times slower
         | than a fast hashmap implementation like Google's absl or
         | Martin's robin-hood-hashing [1]. In addition, your
         | implementation lets the hashmap grow from a small size. As you
         | know the total length, you could preallocate the hash table and
         | avoid rehashing. As typical hash table implementations double
         | the capacity during rehashing, preallocating the buckets should
         | be about twice as fast if we don't consider the cache and the
         | complex mechanism behind the STL allocator.
         | 
         | That said, std::sort is not the fastest sort implementation,
         | either. It is hard to say which is faster for a small N of
         | 10000. Your general point is still right in that an O(n*log(n))
         | algorithm is not necessarily slower than an O(n) algorithm in
         | practice.
         | 
         | [1]: https://github.com/martinus/robin-hood-hashing
        
           | _a_a_a_ wrote:
           | > Your general point is still right in that an O(n*log(n))
           | algorithm is not necessarily slower than an O(n) algorithm in
           | practice.
           | 
           | Sounds reasonable
           | 
           | > for a small N of 10000
           | 
           | Ah, now that is not what I would call a small N
        
             | tialaramex wrote:
             | 10_000 is easily in the ballpark of constant factors k
             | which really happen in software.
             | 
             | Suppose my O(n) algorithm is a random access walk of nums.
             | So I'm causing ~10_000 memory fetches, maybe they each take
             | k=100 cycles, so it takes 1 million cycles total.
             | 
             | But the O(n*log(n)) algorithm happens to be linear, it does
             | access some nearby objects more than once, but every damn
             | access is in cache. That's only 100_000 cycles, it's _ten
             | times faster_ despite you not thinking 10000 is a small N.
             | 
             | Stalls are _huge_. A modern CPU can do a crazy amount of
             | arithmetic in the time it takes to fetch from even an L3
             | shared cache, never mind main memory. If you need to go
             | out, to the network or to  "disk" it starts to look faster
             | to do a whole bunch of difficult work again instead. Oh the
             | 2417th digit of Pi is on the hard disk? Thanks but I'll
             | just calculate it again instead 'cos I don't got all day.
        
               | _a_a_a_ wrote:
               | Thank you for the lesson in memory access times I
               | actually know something about the subject already :-)
               | 
               | > But the O(n*log(n)) algorithm
               | 
               | I think you're badly missing out the "* log(n)" part. But
               | on reflection, you have made your point, thank you.
        
             | aidenn0 wrote:
             | If you pick base 2 for the then log2(10000) is about 13. So
             | if you are a constant factor of >13 faster for the work
             | that dominates, the n*log(n) will be faster.
             | 
             | Slightly OT:
             | 
             | I remember in my algorithms class we used to joke that
             | log(log(N)) was the same as a constant factor of 5 (in the
             | 64-bit era, perhaps we should say 6).
        
           | creata wrote:
           | Can someone explain why std::unordered_map _has_ to be slow?
           | 
           | I vaguely remember that it's because it isn't allowed to
           | invalidate references when inserting elements.
        
             | tialaramex wrote:
             | Yes, it's obliged to be the design you'd have been taught
             | in the 1980s or maybe 1990s in a typical data structures
             | class, linked lists of entry nodes rather than a modern
             | open addressed map.
             | 
             | You can satisfy typical _real world_ needs with better
             | structures now, even when people want stable references to
             | the nodes, but you can 't implement the API that
             | std::unordered_map promises with the faster choices. So
             | there are "near drop-in" replacements, but you can't
             | actually swap those into the C++ standard library.
             | 
             | The chained design has terrible locality, which means doing
             | lookups (failed or successful) in the hash table will cause
             | more memory fetches, which are very slow, compared to a
             | modern design which was optimised for this.
             | 
             | Imagine we're looking at a Swiss Table, we're looking for
             | 12530, we hash that, we get an index into our hash table,
             | that causes our first memory fetch, to pull in the RAM
             | where that part of the hash table lives. We check the key,
             | this one is wrong, we step to the next, it's in the same
             | blob of RAM but it's wrong, we step to the next, it's
             | empty, we're done, one memory fetch, nothing found.
             | 
             | Now std::unordered_map, we're looking for 12530, we hash
             | it, we get an index into our hash table, we do a memory
             | fetch. We get a pointer to a node. We do another memory
             | fetch, we have a key to examine, it's wrong, we get a
             | pointer to another node, we do another memory fetch, we
             | have another key to examine, this one is wrong too, but
             | there are no more pointers, _three_ memory fetches and
             | nothing found.
        
             | kccqzy wrote:
             | That's exactly it. Also because of that, it has to be
             | implemented with chaining rather than open addressing.
        
           | Const-me wrote:
           | Adding `map.reserve( vec.size() );` indeed helped, but still
           | slower than sorting these entries. Specifically, the time for
           | the hash map is 740 ms this way.
           | 
           | I think the main reason for the observed behavior is memory
           | management. The sorted vector takes under 80kb of RAM,
           | allocated with a single malloc() call, and the entire vector
           | fits in L2 cache which is very fast to access. OTOH, without
           | advanced shenanigans like custom allocators, the hash map
           | needs 1 malloc() call per entry.
           | 
           | On a general note, I noticed that when the data being handled
           | fits in the faster in-core caches, dense buffers pretty often
           | perform best, despite these theoretical big O-s. Even if this
           | means spending some extra time to sort the data.
        
             | attractivechaos wrote:
             | > _the [std::unordered_map] hash map needs 1 malloc() call
             | per entry._
             | 
             | This is a key reason why std::unordered_map is several
             | times slower than fast hashmap implementations that only
             | use a couple of long flat arrays per hashmap object.
             | Nonetheless, common STL implementations use their own
             | allocators by default and would not call malloc() on small
             | objects frequently.
        
             | nly wrote:
             | If you wanted to see the effect of memory allocations then
             | you could use boosts intrusive unordered_set implementation
             | 
             | https://www.boost.org/doc/libs/1_82_0/doc/html/boost/intrus
             | i...
             | 
             | It doesn't own the underlying memory, so you can
             | preallocate your nodes
        
             | charlieyu1 wrote:
             | Just learning c++, curious how reserve works. I thought map
             | would allocate memory exponentially automatically, so an
             | early reserve at best saves you a few reallocation
        
               | Const-me wrote:
               | That reserve() method only reserves memory for the hash
               | table inside the container. It does help with performance
               | to an extent, but the memory for the items stored in that
               | map is still allocated one by one, as new entries are
               | inserted into the map.
               | 
               | I think the main reason why reserve() helped
               | substantially weren't saved malloc() calls, it were saved
               | re-hashes. Each time the hash table grows, it needs to
               | re-insert all items currently stored on the map into the
               | new, larger table of these buckets.
        
               | tialaramex wrote:
               | Sort of. For std::unordered_map there are two types of
               | memory allocation, there's a table, of pointers, and then
               | there are nodes, which themselves have a pointer to a
               | next node if any, forming a "chain".
               | 
               | The table is indeed grown exponentially as you describe.
               | If you tell std::unordered_map that you expect 10000
               | entries using reserve, it will calculate how big it would
               | expect to grow a table for 10000 entries and make it that
               | size immediately.
               | 
               | The nodes however are created when you insert data. Even
               | though you called reserve, it is an allocation each time
               | you insert to make space for the new node.
               | 
               | If you used a Swiss table (e.g. from Abseil) the reserve
               | behaves more as a person might expect, it will allocate a
               | data structure big enough to insert at least 10 000
               | entries successfully, and no further allocations are
               | needed unless you keep going.
        
       | tromp wrote:
       | Sorting provides a alternative solution: sort the array, and then
       | traverse a linear path over pairs of indices from (0,n-1) to
       | (n-1,0) whose sum straddles the target:                   int
       | main() {           /* read in and sort array a */
       | int i=0,j=N-1, sum;                while ((sum = a[i] + a[j]) !=
       | tgt) {             if (sum < tgt)               i++;
       | else               j--;           }                printf("%d
       | %d\n", i, j);         }
        
         | colejohnson66 wrote:
         | The hashmap solution here is ideally O(n) as you only iterate
         | once (assuming O(1) lookups and inserts in the map). A
         | sort/search method, even using quicksort would be O(n*logn).
        
           | enedil wrote:
           | Even it it was O(1) not amortized, it's not possible to beat
           | the sorting in actual execution time for reasonable inputs
           | (less than 10B elements probably)
        
           | tromp wrote:
           | You can use radix sort which is linear for fixed width ints.
        
           | jjgreen wrote:
           | _O_ ( _n log n_ + _n_ ) = _O_ ( _n log n_ )
        
           | topposter32 wrote:
           | What's log base 2 of a million or a billion?
        
           | tromp wrote:
           | Although hashmaps are O(n) in practice, they can be worse
           | than O(n log n) on some rare inputs that cause lots of hash
           | collisions. Several sorting algorithms such as mergesort and
           | heapsort are guaranteed O(n log n).
        
         | nkurz wrote:
         | I think it's a little trickier than that, since the problem
         | asks for the indices not the numbers, which would be lost in
         | the sorted version. You'd have to keep an additional mapping
         | from the sorted to original positions. And are we counting the
         | sorting and space for the maps as free? Or do we have to
         | account for them somehow?
         | 
         | Mostly, I just don't like the problem specification. What does
         | it actually mean to "solve it efficiently"? How many times are
         | we going to be searching each input array? How much memory do
         | we have available? What tradeoff between memory and computation
         | is allowable? How much does a comparison cost us versus
         | accessing memory? Are all targets equally likely?
         | 
         | Assuming as theoretical computer scientists tend to do that we
         | can ignore all constant time operations, that access to memory
         | is free, that an unknown but fixed amount memory is available,
         | and that we'll be searching the same array forever ruins the
         | problem for me. Fill in the numbers and we can talk about
         | what's optimal, but the "spherical cow" approach does nothing
         | for me.
        
           | tialaramex wrote:
           | For the specifics, as someone else observed, walking nums to
           | find the values we've discovered and identify their indices
           | is very affordable, it's N operations, and if we can't afford
           | N we're in trouble since we definitely need to do at least N
           | comparisons.
           | 
           | To your general point, sure, one thing that's nice (but takes
           | a colossal effort) about Advent of Code is that it has
           | specific inputs for players, and they have specific answers,
           | so although it's usual for participants to write code which
           | could (you hope) solve the problem in the abstract, the
           | actual test is whether you solved the input you actually had.
           | Sometimes not all conceivable inputs are even tractable (and
           | some people find this situation frustrating if it happens)
           | only the inputs given to real players may be suitable.
           | 
           | For this exercise the provided Go solution gets to handwave a
           | bunch of problems. For example, Go's "map" hash table
           | allocates whenever it wants to and has amortized table growth
           | on top. That can really add up for a problem like this.
           | 
           | If I attacked it in Rust I would make
           | HashMap::with_capacity(nums.len()) which does a single
           | allocation up front, I might not _need_ all this space if I
           | exit early, but it won 't hurt. I don't know if there's even
           | a way to write that in Go.
        
           | marginalia_nu wrote:
           | For an input array of values I[], create an array A = [ 0 ...
           | n ] so that n = len(I). Sort the indices in A based on their
           | corresponding value in I. Now A _is_ the mapping, A[n] is the
           | index in I of the n+1:th largest value, and you can binary
           | search in the same way, by comparing I[A[i]].
           | 
           | This is strictly more space efficient than a hash table.
           | You're not going to win any awards for being cache efficient,
           | but then again, when the competition is a damn hash table
           | it's not too bad ;P
        
           | raverbashing wrote:
           | > You'd have to keep an additional mapping from the sorted to
           | original positions.
           | 
           | 2*O(n) is still O(n)
        
             | taeric wrote:
             | Though, at the low level, it is perfectly valid to compare
             | the full costs, not just the growth rates. Right?
        
           | ape4 wrote:
           | It seems like it could be a "waste" to make a hash or sorted
           | list only to throw throw it away after one use. Depending on
           | what's required.
        
           | [deleted]
        
           | tromp wrote:
           | > I think it's a little trickier than that, since the problem
           | asks for the indices not the numbers, which would be lost in
           | the sorted version.
           | 
           | Good point.
           | 
           | > You'd have to keep an additional mapping from the sorted to
           | original positions.
           | 
           | It's simpler to just find the two numbers in the original
           | unsorted array.
        
       ___________________________________________________________________
       (page generated 2023-06-29 23:02 UTC)