[HN Gopher] WyHLL: The most accurate 3-bits HyperLogLog
___________________________________________________________________
WyHLL: The most accurate 3-bits HyperLogLog
Author : wangyi_fudan
Score : 41 points
Date : 2021-03-13 05:33 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| vitus wrote:
| Aside: I'm admittedly grumpy about the commit history, which is
| filled with entries such as "Add files via upload" and "Update
| README.md".
|
| That, together with the lack of documentation, makes it much more
| difficult to figure out what's special about this implementation.
| vitus wrote:
| https://github.com/redis/redis/blob/unstable/src/hyperloglog...
| seems to be the starting point, given a number of identical
| comments (complete with typos, e.g. "Estimate cardinality form
| register histogram"), function names, macros. Copyright isn't
| preserved, which is troubling in itself. There's a brief
| mention of Redis's HLL implementation in the README, which is a
| hint re: the code's origins.
|
| The first divergence I see is in hllSparseToDense, although it
| seems more like a tweak to the input/output (passing in o->ptr
| instead of o, returning hdr instead of C_OK / C_ERR), than an
| algorithmic difference.
|
| Line 550 contains a looser check: if (span ==
| 0) return -1;
|
| omitting the check that p >= end.
|
| And... the only meaningful algorithmic change I found is in
| hllAdd, wherein we invalidate the cache if hllDenseAdd()
| returns 1.
|
| There might be something else, but a lot of the details look to
| be standard (e.g. impl of murmurhash64a).
| zwarag wrote:
| For anyone wondering what HyperLogLog is:
|
| HyperLogLog is an algorithm for the count-distinct problem,
| approximating the number of distinct elements in a multiset.
| Calculating the exact cardinality of a multiset requires an
| amount of memory proportional to the cardinality, which is
| impractical for very large data sets. Probabilistic cardinality
| estimators, such as the HyperLogLog algorithm, use significantly
| less memory than this, at the cost of obtaining only an
| approximation of the cardinality. The HyperLogLog algorithm is
| able to estimate cardinalities of > 109 with a typical accuracy
| (standard error) of 2%, using 1.5 kB of memory.
|
| Source: https://en.wikipedia.org/wiki/HyperLogLog
| skrebbel wrote:
| For anyone wondering what a multiset is:
|
| A multiset is a set where each element can be present multiple
| times. In other words, it's like an array or list but you don't
| care about the order.
| wenc wrote:
| Hyperloglog is part of a class of algorithms known as
| "sketches", which use precomputed probabilistic data structures
| to perform fast but approximate computations on very large data
| sets.
|
| Supported computations include count distinct, frequency,
| sampling, and quantiles and histograms.
|
| There's a project called Apache Datasketches (developed at
| Yahoo) that implements production versions of these algorithms.
| They are useful in the implementations of search engines,
| discussion forum software, etc. that are designed for scale.
|
| https://datasketches.apache.org/docs/Background/TheChallenge...
|
| Another sketch is the well-known Bloom filter, which can
| quickly test if an element is part of a set without ever
| returning a false negative (useful for quickly checking a large
| database for whether a particular username is still available).
| yorwba wrote:
| Is there a companion paper somewhere explaining how it works? I
| tried to read the comments in the source but didn't really
| understand much. (Might be related to the fact that I never knew
| much about vanilla HyperLogLog to begin with.)
| jalk wrote:
| This should get what you want
| https://googlethatforyou.com/?q=hyperloglog%20paper
| pwnguin wrote:
| But what about the actual software question? What makes this
| different than redis hyper log log? Or statsite
| Manozco wrote:
| This article by antirez helped me understand HyperLogLog a
| couple years ago http://antirez.com/news/75
| rahimiali wrote:
| based on the comments it appears to rely on this paper:
| https://arxiv.org/abs/1702.01284
___________________________________________________________________
(page generated 2021-03-14 23:02 UTC)