hngopher.com

       [HN Gopher] WyHLL: The most accurate 3-bits HyperLogLog
       ___________________________________________________________________
        
       WyHLL: The most accurate 3-bits HyperLogLog
        
       Author : wangyi_fudan
       Score  : 41 points
       Date   : 2021-03-13 05:33 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | vitus wrote:
       | Aside: I'm admittedly grumpy about the commit history, which is
       | filled with entries such as "Add files via upload" and "Update
       | README.md".
       | 
       | That, together with the lack of documentation, makes it much more
       | difficult to figure out what's special about this implementation.
        
         | vitus wrote:
         | https://github.com/redis/redis/blob/unstable/src/hyperloglog...
         | seems to be the starting point, given a number of identical
         | comments (complete with typos, e.g. "Estimate cardinality form
         | register histogram"), function names, macros. Copyright isn't
         | preserved, which is troubling in itself. There's a brief
         | mention of Redis's HLL implementation in the README, which is a
         | hint re: the code's origins.
         | 
         | The first divergence I see is in hllSparseToDense, although it
         | seems more like a tweak to the input/output (passing in o->ptr
         | instead of o, returning hdr instead of C_OK / C_ERR), than an
         | algorithmic difference.
         | 
         | Line 550 contains a looser check:                   if (span ==
         | 0) return -1;
         | 
         | omitting the check that p >= end.
         | 
         | And... the only meaningful algorithmic change I found is in
         | hllAdd, wherein we invalidate the cache if hllDenseAdd()
         | returns 1.
         | 
         | There might be something else, but a lot of the details look to
         | be standard (e.g. impl of murmurhash64a).
        
       | zwarag wrote:
       | For anyone wondering what HyperLogLog is:
       | 
       | HyperLogLog is an algorithm for the count-distinct problem,
       | approximating the number of distinct elements in a multiset.
       | Calculating the exact cardinality of a multiset requires an
       | amount of memory proportional to the cardinality, which is
       | impractical for very large data sets. Probabilistic cardinality
       | estimators, such as the HyperLogLog algorithm, use significantly
       | less memory than this, at the cost of obtaining only an
       | approximation of the cardinality. The HyperLogLog algorithm is
       | able to estimate cardinalities of > 109 with a typical accuracy
       | (standard error) of 2%, using 1.5 kB of memory.
       | 
       | Source: https://en.wikipedia.org/wiki/HyperLogLog
        
         | skrebbel wrote:
         | For anyone wondering what a multiset is:
         | 
         | A multiset is a set where each element can be present multiple
         | times. In other words, it's like an array or list but you don't
         | care about the order.
        
         | wenc wrote:
         | Hyperloglog is part of a class of algorithms known as
         | "sketches", which use precomputed probabilistic data structures
         | to perform fast but approximate computations on very large data
         | sets.
         | 
         | Supported computations include count distinct, frequency,
         | sampling, and quantiles and histograms.
         | 
         | There's a project called Apache Datasketches (developed at
         | Yahoo) that implements production versions of these algorithms.
         | They are useful in the implementations of search engines,
         | discussion forum software, etc. that are designed for scale.
         | 
         | https://datasketches.apache.org/docs/Background/TheChallenge...
         | 
         | Another sketch is the well-known Bloom filter, which can
         | quickly test if an element is part of a set without ever
         | returning a false negative (useful for quickly checking a large
         | database for whether a particular username is still available).
        
       | yorwba wrote:
       | Is there a companion paper somewhere explaining how it works? I
       | tried to read the comments in the source but didn't really
       | understand much. (Might be related to the fact that I never knew
       | much about vanilla HyperLogLog to begin with.)
        
         | jalk wrote:
         | This should get what you want
         | https://googlethatforyou.com/?q=hyperloglog%20paper
        
           | pwnguin wrote:
           | But what about the actual software question? What makes this
           | different than redis hyper log log? Or statsite
        
         | Manozco wrote:
         | This article by antirez helped me understand HyperLogLog a
         | couple years ago http://antirez.com/news/75
        
         | rahimiali wrote:
         | based on the comments it appears to rely on this paper:
         | https://arxiv.org/abs/1702.01284
        
       ___________________________________________________________________
       (page generated 2021-03-14 23:02 UTC)