[HN Gopher] Computer scientists invent an efficient new way to c...
       ___________________________________________________________________
        
       Computer scientists invent an efficient new way to count
        
       Author : jasondavies
       Score  : 726 points
       Date   : 2024-05-16 14:54 UTC (1 days ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | pixelmonkey wrote:
       | This algorithm seems to resemble HyperLogLog (and all its
       | variants), which is also cited in the research paper. Using the
       | same insight of the estimation value of tracking whether we've
       | hit a "run" of heads or tails, but flipping the idea on its head
       | (heh), it leads to the simpler algorithm described, which is
       | about discarding memorized values on the basis of runs of
       | heads/tails.
       | 
       | This also works especially well (that is, efficiently) in the
       | streaming case, allowing you to keep something resembling a
       | "counter" for the distinct elements, albeit with a error rate.
       | 
       | The benefit of HyperLogLog is that it behaves similarly to a hash
       | set in some respects -- you can add items, count distinct them,
       | and, importantly, merge two HLLs together (union), all the while
       | keeping memory fixed to mere kilobytes even for billion-item
       | sets. In distributed data stores, this is the trick behind
       | Elasticsearch/OpenSearch cardinality agg, as well as behind
       | Redis/Redict with its PFADD/PFMERGE/PFCOUNT.
       | 
       | I am not exactly sure how this CVM algorithm compares to HLL, but
       | they got Knuth to review it, and they claim an undergrad can
       | implement it easily, so it must be pretty good!
        
         | hmottestad wrote:
         | It's also possible to use HLL to estimate the cardinality of
         | joins since it's possible to estimate both the union and the
         | intersection of two HLLs.
         | 
         | http://oertl.github.io/hyperloglog-sketch-estimation-paper/
        
           | j-pb wrote:
           | It's a really interesting open problem to get the cost of
           | these down so that they can be used to heuristically select
           | the variable order for worst case optimal joins during
           | evaluation.
           | 
           | It's somewhere on the back of my todo list, and I have the
           | hunch that it would enable instance optimal join algorithms.
           | 
           | I've dubbed these the Atreides Family of Joins:
           | - Jessicas Join: The cost of each variable is based on the
           | smallest number of rows that might be proposed for that
           | variable by each joined relation.       - Pauls join: The
           | cost of each variable is based on the smallest number of
           | distinct values that will actually be proposed for that
           | variable from each joined relation.       - Letos join: The
           | cost of each variable is based on the actual size of the
           | intersection.
           | 
           | In a sense each of the variants can look further into the
           | future.
           | 
           | I'm using the first and the second in a triplestore I build
           | in Rust [1] and it's a lot faster than Oxigraph. But I
           | suspect that the constant factors would make the third
           | infeasable (yet).
           | 
           | 1: https://github.com/triblespace/tribles-
           | rust/blob/master/src/...
        
             | refset wrote:
             | Having read something vaguely related recently [0] I
             | believe "Lookahead Information Passing" is the common term
             | for this general idea. That paper discusses the use of
             | bloom filters (not HLL) in the context of typical binary
             | join trees.
             | 
             | > Letos join
             | 
             | God-Emperor Join has a nice ring to it.
             | 
             | [0] "Simple Adaptive Query Processing vs. Learned Query
             | Optimizers: Observations and Analysis" -
             | https://www.vldb.org/pvldb/vol16/p2962-zhang.pdf
        
               | j-pb wrote:
               | Thanks for the interesting paper!                 We now
               | formally define our _God-Emperor Join_ henceforth denoted
               | join_ge...
               | 
               | Nice work with TXDB btw, it's funny how much impact
               | Clojure, Datomic and Datascript had outside their own
               | ecosystem!
               | 
               | Let me return the favour with an interesting paper [1]
               | that should be especially relevant to the columnar data
               | layout of TXDB. I'm currently building a succinct on-disk
               | format with it [2], but you might be able to simply add
               | some auxiliary structures to your arrow columns instead.
               | 
               | 1: https://aidanhogan.com/docs/ring-graph-wco.pdf
               | 
               | 2: https://github.com/triblespace/tribles-
               | rust/blob/archive/src...
        
               | refset wrote:
               | _> Nice work with TXDB btw_
               | 
               | It's X.T. (as in 'Cross-Time' / https://xtdb.com), but
               | thank you! :)
               | 
               |  _> 1: https://aidanhogan.com/docs/ring-graph-wco.pdf_
               | 
               | Oh nice, I recall skimming this team's precursor paper
               | "Worst-Case Optimal Graph Joins in Almost No Space"
               | (2021) - seems like they've done a lot more work since
               | though, so definitely looking forward to reading it:
               | 
               |  _> The conference version presented the ring in terms of
               | the Burrows-Wheeler transform. We present a new
               | formulation of the ring in terms of stable sorting on
               | column databases, which we hope will be more accessible
               | to a broader audience not familiar with text indexing_
        
               | j-pb wrote:
               | My apologies! It's even more emberrasing given the fact
               | that I looked up the website, knowing that I _always_
               | swap them after having written too many `tx-listen` in my
               | career.
               | 
               | They expanded their work to wider relations and made the
               | whole framework a lot more penetrable. I think they over-
               | complicate things a bit with their forward-extension, so
               | I'm keeping every column twice (still much better than
               | all permutations), which in turn allows for ad-hoc
               | cardinality estimation.
               | 
               | Also the 1 based indexing with inclusive ranges is really
               | not doing them any favours. Most formula become much more
               | streamlined and simpler with 0 based indexing and
               | exclusive ranges. (see `base_range` and `restrict_range`
               | in [0])
               | 
               | 0: https://github.com/triblespace/tribles-
               | rust/blob/e3ad6f21cdc...
        
           | jalk wrote:
           | Iirc intersection requires the HLLs to have similar
           | cardinality, otherwise the result is way off.
        
         | willvarfar wrote:
         | Just curious, dusting off my distant school memories :)
         | 
         | How do the HLL and CVM that I hear about relate to reservoir
         | sampling which I remember learning?
         | 
         | I once had a job at a hospital (back when 'whiz kids' were
         | being hired by pretty much every business) where I used
         | reservoir sampling to make small subsets of records that were
         | stored on DAT tapes.
        
           | michaelmior wrote:
           | I guess there is a connection in the sense that with
           | reservoir sampling, each sample observed has an equal chance
           | of remaining when you're done. However, if you have
           | duplicates in your samples, traditional algorithms for
           | reservoir sampling do not do anything special with
           | duplicates. So you can end up with duplicates in your output
           | with some probability.
           | 
           | I guess maybe it's more interesting to look at the other way.
           | How is the set of samples you're left with at the end of CVM
           | related to the set of samples you get with reservoir
           | sampling?
        
             | _a_a_a_ wrote:
             | Was wondering the same, thanks for an answer.
        
           | krackers wrote:
           | Knuth's presentation of this [1] seems very very similar to
           | the heap-version (top-k on a uniform deviate) of reservoir
           | sampling as mentioned in [2]. The difference is in how
           | duplicates are handled. I wouldn't be surprised if this
           | algorithm was in fact already in use somewhere!
           | 
           | [1] https://cs.stanford.edu/~knuth/papers/cvm-note.pdf [2]
           | https://florian.github.io/reservoir-sampling/
           | 
           | Edit: Another commenter [3] brought up the BJKST algorithm
           | which seems to be similar procedure except using a suitably
           | "uniform" hash function (pairwise independence) as the
           | deviate instead of a random number.
           | 
           | [3] https://news.ycombinator.com/item?id=40389178
        
       | imoverclocked wrote:
       | When do we stop calling this counting and start calling it
       | estimation?
        
         | pmontra wrote:
         | Yes, even the subtile states "a simple algorithm for estimating
         | large numbers of distinct objects".
        
           | cubefox wrote:
           | This doesn't excuse the title though.
        
         | vlovich123 wrote:
         | As soon as people start reading past the headline.
        
           | 112233 wrote:
           | tbh, the title (and introduction) did a lot to dissuade me
           | from finishing the (really good) article. It was actually
           | informative, why dress it as a SEO blogspam?
        
             | seanhunter wrote:
             | Presumably so it is optimised for search engines and people
             | find it.
             | 
             | Publishers generally have good data about where their
             | audience comes from. They wouldn't do this if it wasn't the
             | best way they know of to maximise readership.
        
               | a57721 wrote:
               | I've been following Quanta for some time, I'm sure they
               | don't care about SEO and number of visitors, they care
               | about the quality of texts. They write for the general
               | audience, and even though they try to preserve the
               | scientific accuracy, sometimes their explanations may
               | seem oversimplified and even a bit confusing when you
               | come from the same field. It's not clickbait, it's their
               | popular science style.
        
             | HDThoreaun wrote:
             | So they can get paid for their work? Are you giving them
             | anything?
        
             | lupire wrote:
             | It's Quanta. Their mission is to make laypeople like math
             | (not understand math), so they drown the math in sugar.
        
         | card_zero wrote:
         | Seems this is one of those things like UUIDs where we rely on
         | it being very unlikely to be wrong, because statistics.
         | 
         | > the accuracy of this technique scales with the size of the
         | memory.
         | 
         | I wonder if that's proportional to the number of distinct items
         | to count, though.
         | 
         | > if the [memory] is so big that it fits all the words, then we
         | can get 100% accuracy
         | 
         | Yes, but then the algorithm isn't being used any more, that's
         | just normal counting.
         | 
         | They counted the distinct words in Hamlet with a memory size of
         | 100 words, about 2.5% of the number to find, and got a result
         | that was off by 2. If you do the same with the whole of
         | Shakespeare, again using 2.5% of the memory needed to hold all
         | the distinct words, is the accuracy better?
         | 
         | Anyway, this is limited to counting, and doesn't help list what
         | the words _are,_ though quickly counting them first is perhaps
         | a way to speed up the task of actually finding them?
        
         | BlackFly wrote:
         | When it is actually impossible to count something and when the
         | error between estimation and an exact answer is not significant
         | the pedantic distinction is not helpful.
         | 
         | The same thing happens with measurement. No measurement is ever
         | exact. If I said I was measuring a count, someone would
         | probably correct to say that I am counting.
         | 
         | Common speech is like that.
        
       | matt3210 wrote:
       | CS guys always wanting to throw away a good system and start from
       | scratch.
        
         | Miraltar wrote:
         | Which good system are we talking about ?
        
         | 112233 wrote:
         | Which good system are you referring to?
        
         | baq wrote:
         | Tell me you didn't read the article without telling me you
         | didn't read the article.
         | 
         | ...but actually, I think you didn't even read the headline...?
        
         | coldtea wrote:
         | Yeah, if only we have stuck with those stone wheels from 20,000
         | B.C.
        
           | datavirtue wrote:
           | Well, they are easier to work on.
        
       | biscuit1v9 wrote:
       | > The trick, he said, is to rely on randomization.
       | 
       | > When the space is full, press pause and flip a coin for each
       | word. Heads, and the word stays on the list; tails, and you
       | delete it.
       | 
       | I wasn't expecting to go that far: randomization. How can you
       | verify if the answer is good? Only approximation, maybe..
        
         | 8372049 wrote:
         | Did you read the (entire) article?
        
         | seanhunter wrote:
         | The proof is pretty straightforward and is included in the
         | paper https://arxiv.org/abs/2301.10191
        
         | HDThoreaun wrote:
         | Yes, the result is an estimation.
        
           | biscuit1v9 wrote:
           | Thanks. Just wanted to be sure I didn't misunderstood.
        
       | akamoonknight wrote:
       | I don't know a word or phrase for this, but I really enjoy any
       | examples of "thinking outside the box" like this because it's
       | something I struggle with in my professional career. Learning not
       | only the right ways to solve problems, but figuring out the
       | questions to ask that make solving the problems you have easier
       | or even in some cases possible. In this case, it's hey, we don't
       | need exact numbers if we can define a probabilistic range given
       | defined parameters. Other problems are gonna have other
       | questions. I guess my hope is that if I see enough examples I'll
       | be able to eventually internalize the thought process and apply
       | it correctly.
        
         | wmwragg wrote:
         | I think it's generally thought of as "lateral thinking", Edward
         | de Bono has written a few books about it you might find
         | interesting.
        
           | coldtea wrote:
           | And some more commonplace words like "creativity" (as in
           | "creative solution") etc. would apply.
        
           | bzuker wrote:
           | any particular one you'd recommend?
        
         | dentemple wrote:
         | To be fair, this was a university research team. Literally, a
         | team of folks who can, all day everyday, iterate over a single
         | topic using the Scientific Method.
         | 
         | If you were paid by a big company to sit at a whiteboard all
         | day with a team of equally intelligent engineers, I'm sure
         | you'd be come up with SOMETHING that would look like an
         | "outside the box" solution to the rest of the world.
         | 
         | However, most of us are paid to work the JIRA factory line
         | instead, which limits the amount of time we can spend
         | experimenting on just one single problem.
        
       | 112233 wrote:
       | After skimming Knuth's paper -- does the algorithm work if values
       | are hashed, that is, the "uniform deviate" is selected
       | deterministically for each unique value of the stream?
        
         | devnonymous wrote:
         | Not sure which Knuth paper you're referring to but skimming
         | through the article my understanding is this algorithm works
         | /only/ if the values are hashable. IOW how else does one define
         | unique/distinct values ?
        
           | 112233 wrote:
           | https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
           | 
           | note how "u"s are selected every time value is not in a list.
           | I don't read it as being a hash.
        
             | hcs wrote:
             | I think the analysis relies on independent random "u"s,
             | even for the same key.
        
           | hcs wrote:
           | https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
           | 
           | Looks like it was posted at the time
           | https://news.ycombinator.com/item?id=36079213 but not much
           | discussed. I found it over here
           | https://news.ycombinator.com/item?id=40387594
        
       | saulrh wrote:
       | Huh, that's a clever twist on reservoir sampling. Neat.
        
       | usgroup wrote:
       | I found the paper took about as long to read as the blog post and
       | is more informative:
       | 
       | https://arxiv.org/pdf/2301.10191
       | 
       | It is about estimating the cardinality of a set of elements
       | derived from a stream. The algorithm is so simple, you can code
       | it and play with it whilst you read the paper.
       | 
       | The authors are explicit about the target audience and purpose
       | for the algorithm: undergraduates and textbooks.
        
         | vanderZwan wrote:
         | If you refer to the subtitle of the paper - _An Algorithm for
         | the (Text) Book_ - I think that is actually a reference to
         | something *Paul Erdos allegedly said about some proofs are so
         | elegant in their simplicity and beauty that they are  "from The
         | Book", like representing some divine Platonic ideal.
         | 
         | Given that Knuth himself reviewed it, he might have remarked
         | that this was one of those algorithms! Perhaps the authors
         | decided to include it in the title as a not-so-humble brag
         | (which would be well-earned if that's the case!)
         | 
         |  _edit: originally this comment said Knuth was the one who said
         | this about some algorithms being from The Book, but that was my
         | faulty memory._
        
           | kibibu wrote:
           | I thought The Book was an Erdos thing. I wonder who used it
           | first.
        
             | stevesimmons wrote:
             | "During a lecture in 1985, Erdos said, `You don't have to
             | believe in God, but you should believe in The Book.`"
             | 
             | https://en.wikipedia.org/wiki/Proofs_from_THE_BOOK
        
             | vanderZwan wrote:
             | I think you're right, I must have confused the two. I'll
             | edit my comment to reduce the spread of misinformation.
        
           | usgroup wrote:
           | From the abstract: "... All the current state-of-the-art
           | algorithms are, however, beyond the reach of an undergraduate
           | textbook owing to their reliance on the usage of notions such
           | as pairwise independence and universal hash functions. We
           | present a simple, intuitive, sampling-based space-efficient
           | algorithm whose description and the proof are accessible to
           | undergraduates with the knowledge of basic probability theory
           | ...."
        
             | Sharlin wrote:
             | The point is that the subtitle's is pretty clearly intended
             | to have a dual meaning, it wouldn't be phrased like that
             | otherwise.
        
             | dchftcs wrote:
             | This is really twisting it, I don't find pairwise
             | indepedence to be more advanced than applying a Chernoff
             | bound. In this problem it'd mostly be the difference of
             | using a Cherbyshev type bound or Chernoff bound.
             | 
             | Pairwise independence is to give the algorithm stronger
             | guarantees and let it work with a simple hash function like
             | ax+b, otherwise probably most existing algorithms can be
             | simplified in the same way. The most similar algorithm is
             | BJKST, which is almost identical except for specifyimg the
             | sampling mechanism to require less randomness.
             | 
             | To someone who worked on this type of thing before, it's
             | like seeing people familar with LLMs say "oh yet another
             | X-billion parameter model on github doing more or less the
             | same".
        
           | cschmidt wrote:
           | I liked this part. They got Knuth to review it, and found
           | mistakes. That's kind of cool, in its own way.
           | We are deeply grateful to Donald E. Knuth for his thorough
           | review,          which not only enhanced the quality of this
           | paper (including fixing         several errors) but has also
           | inspired us for higher standards.
        
         | coldtea wrote:
         | > _The authors are explicit about the target audience and
         | purpose for the algorithm: undergraduates and textbooks._
         | 
         | Doesn't seem like it. Seems like an algorithm (similar to other
         | approximate cardinality estimation algorithms) with huge
         | applicability.
        
           | usgroup wrote:
           | From the abstract: "All the current state-of-the-art
           | algorithms are, however, beyond the reach of an undergraduate
           | textbook owing to their reliance on the usage of notions such
           | as pairwise independence and universal hash functions. We
           | present a simple, intuitive, sampling-based space-efficient
           | algorithm whose description and the proof are accessible to
           | undergraduates with the knowledge of basic probability
           | theory."
        
             | coldtea wrote:
             | That just says that this is also simpler and more
             | accessible algorithm, suitable even for undegraduate
             | textbooks.
             | 
             | Not that this is just useful for textbooks, a mere academic
             | toy example, which would be something else entirely.
             | 
             | This is both accessible AND efficient.
        
         | swores wrote:
         | > _" The authors are explicit about the target audience and
         | purpose for the algorithm: undergraduates and textbooks."_
         | 
         | If you're saying it's just for "undergraduates and textbooks",
         | as opposed to just being simple enough for them to use but not
         | limited to them, would you mind explaining what makes it useful
         | for undergrads but not for professionals?
        
           | pfsalter wrote:
           | My interpretation from the paper is that this algorithm is
           | simpler than other options but also worse. So in a
           | professional context you'd use one of those instead
        
           | usgroup wrote:
           | From the abstract: "All the current state-of-the-art
           | algorithms are, however, beyond the reach of an undergraduate
           | textbook owing to their reliance on the usage of notions such
           | as pairwise independence and universal hash functions. We
           | present a simple, intuitive, sampling-based space-efficient
           | algorithm whose description and the proof are accessible to
           | undergraduates with the knowledge of basic probability
           | theory."
        
             | swores wrote:
             | That still only speaks to it being simple enough for
             | students, not whether its too simple for any other use vs.
             | useful enough that students who learn it will spend the
             | rest of their lives using it.
             | 
             | For example word processor software is commonly described
             | as simple enough for children to use at school, that
             | doesn't mean that word processor software is of no use to
             | adults.
        
               | adgjlsfhk1 wrote:
               | the reason it's too simple for most real world use is
               | that hyper-log-log is the "good" version of this
               | technique (but is harder to prove that it works)
        
               | aidenn0 wrote:
               | Simplicity in an algorithm has limited direct impact on
               | its usefulness in industry where libraries are prevalent.
               | 
               | Consider mapping structures with Th(k) expected lookup
               | times. The simplest implementation is a hash table with
               | chaining. Open-addressing is a bit more complicated, but
               | also more common. Tries, which have O(k) worst-case
               | lookup times are often covered in Undergraduate courses
               | and are definitely easier to analyze and implement than
               | forms of open-addressed hash-tables that have O(k)
               | guarantees (e.g. Cuckoo hashing).
        
         | resonious wrote:
         | The blog post was more than half padding. Good that the
         | algorithm is so simple it's hard to write a full length blog
         | post about it!
        
         | aspenmayer wrote:
         | Link to abstract: https://arxiv.org/abs/2301.10191
        
         | cb321 wrote:
         | I agree the paper is better than the blog post, although one
         | criticism I have of the CVM paper is that it has some
         | termination/algo exit condition instead of what Knuth's CVM
         | notes (refed else-thread here) do which is just a loop to
         | ensure getting more space in the reservoir halving-step. It
         | seems more work to explain the
         | https://en.wikipedia.org/wiki/Up_tack than just do the loop.
         | [1]
         | 
         | [1] https://news.ycombinator.com/item?id=40388878
        
           | imglorp wrote:
           | On that note, I'm also unfamiliar with this \ operator
           | notation which is used without explanation.
           | X - X \ {ai}
        
             | jtanderson wrote:
             | That is conventional set subtraction notation. "Assign to X
             | the same set minus all elements of the set {a_i}".
             | 
             | One example source, but it is pretty common in general:
             | http://www.mathwords.com/s/set_subtraction.htm
        
             | rocqua wrote:
             | Set difference.
             | 
             | Set X becomes X without element ai. This is the case
             | whether ai was in the set X before the step was taken.
        
           | _a_a_a_ wrote:
           | I've known that symbol for decades, never knew it's name -
           | up-tack it is. Ta!
        
       | hum3hum3 wrote:
       | The counting/estimstion technique is rather like a floating point
       | number. An integer exponent k and a mantissa of a population.
        
       | melq wrote:
       | Estimating the amount of unique elements in a set and counting
       | the amount of unique elements in a set are very different things.
       | Cool method, bad headline.
        
         | dools wrote:
         | It's an approximation, not an estimation.
        
           | ranguna wrote:
           | Still very different things, no?
        
             | coldtea wrote:
             | It's the same thing at different degrees of accuracy. The
             | goal is the same.
        
               | amelius wrote:
               | Still, counting things and counting unique things are two
               | different procedures.
        
           | blackkettle wrote:
           | Actually, my understanding is that it is an estimation
           | because in the given context we don't know or cannot compute
           | the true answer due to some kind of constraint (here memory
           | or the size of |X|). An approximation is when we use a
           | simplified or rounded version of an exact number that we
           | actually know.
        
           | davidmurdoch wrote:
           | The authors of the article disagree with you.
        
           | chrisweekly wrote:
           | For someone who's pretty well-versed in English, but not a
           | math-oriented computer scientist, this seems like a
           | distinction without a difference. Please remedy my ignorance.
        
             | lupire wrote:
             | My GP was wrong, but the words are different.
             | 
             | Eatimation is a procedure the generates an estimate, which
             | is a kind of approximation, while approximation is a result
             | value. They are different "types", as a computer scientist
             | would say. An approximation is any value that is
             | justifiably considered to be nearly exact. ("prox" means
             | "near". See also "proximate" and "proxy".)
             | 
             | Estimation is one way to generate an approximation. An
             | estimate is a subtype of an approximation. There are non-
             | estimation ways to generate an approximation. For example,
             | take an exact value and round it to the nearest multiple of
             | 100. That generates an approximation, but does not use
             | estimation.
        
               | jameshart wrote:
               | I'm not sure the linguistic differences here are as cut
               | and dried as you would like them to be. Estimate and
               | approximate are both verbs, so you can derive nouns from
               | them both for the process of doing the thing, and for the
               | thing that results from such a process.
               | 
               | Estimation is the process of estimating. It produces an
               | estimate.
               | 
               | Approximation is the process of approximating. It
               | produces an approximation.
               | 
               | You can also derive adjectives from the verbs as well.
               | 
               | An estimate is an estimated value.
               | 
               | An approximation is an approximate value.
               | 
               | But you're right that the 'approximate' terms make claims
               | about the _result_ - that it is in some way near to the
               | correct value - while the 'estimate' derived terms all
               | make a claim about the _process that produced the result_
               | (ie that it was based on data that is known to be
               | incomplete, uncertain, or approximate)
        
         | jameshart wrote:
         | They're not _very_ different things; the terms are used
         | interchangeably in most contexts because in the real world all
         | counting methods have some nonzero error rate.
         | 
         | We talk about 'counting votes' in elections, for example, yet
         | when things are close we perform 'recounts' which we fully
         | expect can produce slightly different numbers than the original
         | count.
         | 
         | That means that vote counting is actually vote estimating, and
         | recounting is just estimating with a tighter error bound.
         | 
         | I kind of think the mythology of the 'countless stones'
         | (https://en.wikipedia.org/wiki/Countless_stones) is a sort of
         | folk-reminder that you can never be too certain that you
         | counted something right. Even something as big and solid and
         | static as a standing stone.
         | 
         | The situations where counting is not estimating are limited to
         | the mathematical, where you can assure yourself of exhaustively
         | never missing any item or ever mistaking one thing's identity
         | for another's.
        
           | lupire wrote:
           | Come on. There is a fundamental difference between trying to
           | get an exactly answer and not trying to get an exactly
           | correct answer.
        
             | jameshart wrote:
             | It's not a fundamental difference, it's a fundamental
             | constraint.
             | 
             | There are circumstances - and in real life those
             | circumstances are _very common_ - where you must accept
             | that getting an exactly correct answer is not realistic.
             | Yet nonetheless you want to 'count' things anyway.
             | 
             | We still call procedures for counting things under those
             | circumstances 'counting'.
             | 
             | The constraints on this problem (insufficient memory to
             | remember all the unique items you encounter) are one such
             | situation where even computerized counting isn't going to
             | be exact.
        
               | Levitz wrote:
               | I agree with you, but we are talking theory here. The
               | algorithm doesn't count, it estimates.
               | 
               | You can make an algorithm that counts, you can make an
               | algorithm that estimates, this is the second.
        
               | sangnoir wrote:
               | Estimation _is_ counting with error bars.
               | 
               | Frankly, most of what you consider counting in your
               | comment needs error bars - ask anyone who operated an
               | all-cash cash-register how frequently end-of-day
               | reconciliation didn't match the actual cash in the drawer
               | (to the nearest dollar.)
               | 
               | The following is a list from my personal experience - of
               | presumably precisely countable things that didn't turn
               | out to be the case: the number of computers owned by an
               | fairly large regional business, the number of (virtual)
               | servers operated by a moderately sized team, the number
               | of batteries sold in a financial year by a battery
               | company.
        
               | shkkmo wrote:
               | Counting is a subset of estimation, not a synonym.
               | 
               | If I estimated the number of quarters in a stack by
               | weighing them, that would be different from estimating
               | the number of quaters in a stack by counting them. Both
               | methods of estimation have error bars.
               | 
               | The list you provide is of categories that don't have
               | clear definitions. If you have a sufficiently clear
               | definition for a category given your population, it has a
               | precise count (though your counting methodologies will
               | still be estimates.) If your definition is too fuzzy,
               | then you don't actually have a countable set.
        
               | seadan83 wrote:
               | Counting and estimation are different by definition. One
               | is a full enumeration, the latter an extrapolation from
               | sampled data. In both cases 'accuracy' is a factor. Even
               | if we are counting the number of stars, it is still a
               | difference of technique compared to estimating the number
               | if stars.
               | 
               | I could try to count fibers in muscle or grains of sand
               | in the beach, chances are accuracy would be low. One can
               | be smart about technique for more accurate counts, eg:
               | get 10M sand counters and give them each 1kg of sand
               | which they then count the grains with tweezer and
               | microscope. That is counting. At the same time, we could
               | find an average count of grains in 1kg sand from a random
               | 100 of our counters, and then estimate what an expected
               | total would be. The estimate can be used to confirm the
               | accuracy of the counts.
        
               | jameshart wrote:
               | And by that definition this is a counting algorithm.
        
           | samatman wrote:
           | > _the terms are used interchangeably in most contexts_
           | 
           | Counting and estimating are not used interchangeably in most
           | contexts.
           | 
           | > _because in the real world all counting methods have some
           | nonzero error rate._
           | 
           | The possibility that the counting process may be defective
           | does not make it an estimation.
           | 
           | > _We talk about 'counting votes' in elections, for example,
           | yet when things are close we perform 'recounts' which we
           | fully expect can produce slightly different numbers than the
           | original count._
           | 
           | We talk about counting votes in elections because votes are
           | counted. The fact that the process isn't perfect is a defect;
           | this does not make it estimation.
           | 
           | > _That means that vote counting is actually vote estimating,
           | and recounting is just estimating with a tighter error
           | bound._
           | 
           | No. Exit polling is estimation. Vote counting is counting.
           | Vote recounting is also counting, and does not necessarily
           | impose a tighter error bound, nor necessarily derive a
           | different number.
           | 
           | > _The situations where counting is not estimating are
           | limited to the mathematical, where you can assure yourself of
           | exhaustively never missing any item or ever mistaking one
           | thing's identity for another's._
           | 
           | So like, computers? Regardless, this is wrong. Estimating
           | something and counting it are not the same thing. Estimation
           | has uncertainty, counting may have error.
           | 
           | This is like saying addition estimates a sum because you
           | might get it wrong. It's just not true.
        
             | jameshart wrote:
             | So, IEEE floating point doesn't support 'addition' then.
        
               | samatman wrote:
               | IEEE 754 defines an exact binary result for the addition
               | of any two floats.
               | 
               | That this bit-identical result is not the same operation
               | as addition of real numbers is irrelevant, because floats
               | aren't reals.
               | 
               | f1 + f2 is not an estimation. Even treating it as an
               | approximation will get you into trouble. It's not that
               | either, it's a floating-point result, and algorithms
               | making heavy use of floating point had better understand
               | precisely what f1 + f2 is going to give you if they want
               | to obtain maximum precision and accuracy.
        
               | aidenn0 wrote:
               | Cool, so next time I have numbers that aren't reals to
               | perform math on, I can use floats.
        
         | Koshkin wrote:
         | True - for (relatively) small numbers. For large (huge) numbers
         | estimation is usually considered to be equivalent to counting,
         | and the result is sometimes represented using the "scientific"
         | notation (i.e. "floating-point") rather than as an integer. For
         | example, the mole is an integer whose value is only known
         | approximately (and no one cares about the exact value anyway).
        
           | OutOfHere wrote:
           | This doesn't justify estimation to be equivalent to counting
           | even if some mathematicians consider them to be the same.
           | Floating points are for estimation. Integers are for
           | counting. The two are not the same, not even for large
           | numbers.
        
             | Koshkin wrote:
             | "Equivalent" and "the same" are sometimes equivalent. (Or
             | the same.)
             | 
             |  _It depends on what the meaning of the word 'is' is._
             | 
             | https://libquotes.com/bill-clinton/quote/lby0h7o
        
       | aaron695 wrote:
       | Knuth talks about this paper here - "The CVM Algorithm for
       | Estimating Distinct Elements in Streams" -
       | https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
        
       | thesz wrote:
       | HyperLogLog uses additions, it keeps sums. Thus, you can subtract
       | one HLL sums from other. This is useful if stream supports
       | deletion. Streams with deletions can be found in log-structured
       | merge trees, for one example, so one can estimate count of
       | distinct elements in all of the LSM tree hierarchy.
       | 
       | The algorithm in the paper does not allow for deletions.
       | 
       | Also, if one counts statistics of the stream of large elements
       | (say, SHA-512 hashes, 64 bytes per hash), this algorithm requires
       | some storage for elements from this stream, so memory requirement
       | is O(table size * element size).
        
       | kromem wrote:
       | Ah, so Thanos was just conducting a census.
        
         | rcarmo wrote:
         | I see what you did there.
        
       | dist-epoch wrote:
       | Do they assume any particular distribution of the items?
       | 
       | Otherwise Nassim Taleb would like a word with them.
        
         | coldtea wrote:
         | No, and Taleb is not relevant for this.
        
           | dist-epoch wrote:
           | It is if the distribution is extreme.
        
             | coldtea wrote:
             | If Hamlet was 100000000 times the word Hamlet and 1 time
             | the word pancake, it would still give an good estimate
             | measurement - even if pancake gets 0.
        
       | drycabinet wrote:
       | Does finding the number of unique elements in a set actually
       | require comparison of each element with everything else? Can't
       | you use a hashtable? For every element, add it to the table
       | (ignore if already exists), and finally, take a count of keys.
        
         | jangxx wrote:
         | Using a hashtable is the "normal" approach mentioned in the
         | article. It works of course, but requires memory to store each
         | unique element (or their hashes). If you have less memory
         | available, the described algorithm can still give a very good
         | approximation.
        
         | xwolfi wrote:
         | Imagine a million elements. How big must your hashtable be ?
         | The article explains it very well, did you miss it ? It's a way
         | to save memory.
         | 
         | But to be honest I implemented it, ran it on Hamlet, and it's
         | very wrong, it's barely useful but maybe if you just need a
         | vague idea...
        
           | cb321 wrote:
           | How big was your _thresh_? I found it pretty accurate:
           | https://news.ycombinator.com/item?id=40388878
        
         | theginger wrote:
         | That is fine when you have say 1 million values and only 1000
         | are unique. But when you have 1 million values and about 900
         | thousand are unique you are putting more or less the whole data
         | set into memory.
        
         | Anduia wrote:
         | Using a hashtable is effective because you only compare
         | elements within their hash buckets, not the entire set.
         | However, they can become inefficient with very large datasets
         | due to memory usage and processing time, which is where
         | approximate counts shine.
        
           | datavirtue wrote:
           | This algorithm is still spinning a lot of random. I would
           | guess that this is much less overhead than hashing but still
           | seems like it could be significant.
        
         | ReleaseCandidat wrote:
         | > Does finding the number of unique elements in a set
         | 
         | That's easy, that's all of them. Sorry, could not resist.
         | 
         | Yes, hashing is the usual method. In a sorted list you can
         | compare to the following element.
        
         | seadan83 wrote:
         | Imagine 1PB of data and you expect 30% of it to be unique. That
         | needs 300TB RAM to store unique elements. Keep in mind the
         | values in a hash table are the elements themselves, so a
         | hastable of perhaps 300TB. Doing that without that much RAM,
         | even swapping to disk can be tough.
        
       | gh0stcloud wrote:
       | very interesting solution. A perfect example of how even some
       | very complicated problems can sometimes have simple solutions.
       | Will definitely try to write an implementation of this.
       | 
       | The only minor downside to this is that it's obviously not
       | deterministic since it depends on chance. But for many
       | applications where the dataset is so big it doesn't fit in
       | memory, that's probably just a tiny rounding error anyway.
        
       | rep_lodsb wrote:
       | >What if you're Facebook, and you want to count the number of
       | distinct users who log in each day, even if some of them log in
       | from multiple devices and at multiple times?
       | 
       | Seems like a bad example of when this algorithm could be useful
       | in practice.
       | 
       | If you already know you will want this info when designing the
       | login process, it's simple: keep track of the date of last login
       | for each account, and increment the unique user counter when the
       | stored value is different from the current one.
       | 
       | And even if not, it should still be possible to "replay" a stream
       | of login events from the database later to do this analysis.
       | Unless maybe you already had years worth of data?
        
       | novaRom wrote:
       | I wish we had a theory connecting randomness with time/space
       | complexity.
       | 
       | Intuitively I think the key to many computational limitations is
       | making use of randomness.
        
         | ilya_m wrote:
         | Your intuition is correct. See
         | https://en.wikipedia.org/wiki/BPP_(complexity)
        
       | ilya_m wrote:
       | "In a recent paper" - really? The paper first appeared in ESA
       | 2022. The revised version (with some errors fixed) is from May
       | 2023. To be fair, compared to the rate of progress in this area
       | between Flajolet & Martin (1985) and the previous SOTA (Blasiok,
       | 2018), a couple of years of sitting on this news is not a lot.
        
         | davidmurdoch wrote:
         | Sounds like this qualifies as "recent" to me.
        
       | ngrilly wrote:
       | It is quite amazing that after we had so many talented
       | researchers working for decades on this problem, the authors were
       | able to come up with such an elegant and simple solution. Another
       | proof that research is very necessary and useful.
        
       | unnouinceput wrote:
       | "Computer scientists invent an efficient new way to approximate
       | counting large unique entries" - fixed the title
        
       | renonce wrote:
       | At first find this paragraph confusing:
       | 
       | > Keep going through Hamlet, adding new words as you go. If you
       | come to a word that's already on your list, flip a coin again. If
       | it's tails, delete the word; heads, and the word stays on the
       | list.
       | 
       | Why would you delete the word already on the list by flipping
       | coins? Doesn't this reduce the accuracy by counting less words
       | than expected? And will the word be added to the list later?
       | 
       | After thinking about it for a while and reading the paper, I've
       | finally developed a good mental model for how this algorithm
       | works as below, which should convince even a high schooler why
       | the algorithm works:
       | 
       | 1. You are given a streaming set of elements from [n] (a finite
       | set of n distinct elements). Now let's give each element a random
       | uniform real number in [0,1]. This helps us choose the elements
       | we want: if we choose all elements with the number below 0.5, we
       | get about half of the elements.
       | 
       | 2. Assume for a moment we have unbounded memory. Now we maintain
       | a table of already seen elements, and for each element we keep
       | the _last_ real number attached with that element: so if an
       | element A appears three times with the number 0.3, 0.7, 0.6, we
       | keep A with the number 0.6.
       | 
       | 3. Our memory is bounded! So we keep only the elements below a
       | threshold 2^-r, that is, 1, 0.5, 0.25, etc. So at first we will
       | keep all elements, but when we don't have enough memory, we will
       | need to filter existing elements according to the threshold. It's
       | easy to see that when we decrease threshold, only elements
       | already in memory will meet the criteria and continue to exist in
       | memory. No other elements in the stream will be below threshold.
       | Also note that whether an element is in memory depends only on
       | its _last_ occurence. It can exist in memory for a while, get
       | dropped because a later element does not meet the threshold, and
       | get back in.
       | 
       | 4. We don't actually have a real number attached to every
       | element! But we can pretend as if there is one. For each new
       | element X from the stream, we replace the number in memory with
       | its attached number, and we only care if its attached number is
       | below 2^-r or not. If it is, it should be in our memory, and if
       | it's not, it should be out. Once the number is in memory, it's a
       | random number in [0,2^-r] and we care no further.
       | 
       | When increasing r, we only care about whether the number is in
       | [0,2^{-r-1}]. This has exactly 1/2 probability. So each number
       | has 1/2 probability of getting out, and 1/2 probability of being
       | kept in memory.
       | 
       | 5. Now it's easy to see that whether an element is in the memory
       | depends solely on the real number attached to its last occurence.
       | That is, each distinct element has exactly 2^-r probability of
       | being in memory. 2^r multiplied by the number of elements in
       | memory gives a good estimate of number of distinct elements.
       | 
       | They criticized previous approaches as relying on hash functions.
       | A simplified version of the approach is as follows:
       | 
       | 1. Make a hash function that maps distinct elements to different
       | uniform real numbers. So all As compute to 0.3, all Bs compute to
       | 0.7, etc.
       | 
       | 2. Use that hash function to transform all elements. The same
       | element maps to the same real number, and different elements map
       | to different real numbers.
       | 
       | 3. Keep a set of N smallest distinct real numbers. This works by
       | inserting numbers from the stream to the set. It's easy to see
       | that adding the same element multiple times has exactly no
       | effect; it's either too big to be in the set or the same as an
       | existing number in the set.
       | 
       | 4. This gives the Nth smallest real number in the set as K. The
       | number of distinct elements is approximated as N/K.
       | 
       | This algorithm is remarkable, but not in that it's efficient or
       | it's new. Both algorithm using a hash function and algorithms
       | without hash functions has been around and is optimal. I assume
       | this algorithm is the first optimal algorithm without hash
       | functions that is very easy to understand even by undergraduates.
        
         | majikandy wrote:
         | I had the same problem with the same paragraph and still don't
         | quite get it.
         | 
         | Unfortunately I struggle to follow the detailed explanation you
         | gave... since you seem to understand it... can you confirm that
         | they really do mean to throw away the word in the list they
         | just found?
         | 
         | Eg
         | 
         | ACT I SCENE Elsinore A platform before the castle FRANCISCO at
         | his post. Enter to him
         | 
         | If buffer max is 16, I am supposed to randomly half it first?
         | 
         | Act Scene Elisnore Platform Before The His Post
         | 
         | Now what? The next words are "Bernado Bernado who's there"
        
           | Majromax wrote:
           | Yes, you're supposed to throw it away.
           | 
           | The key insight is that words should appear in your scratch
           | space with equal probability, no matter how often they appear
           | in the source text. If you have a scratch space of size one,
           | then the sequence of "apple" x 16 + "banana" x 1 should have
           | equal chances of of the scratch space containing [apple] or
           | [banana] at the end of the sequence, at least averaged over
           | all 17 permutations.
           | 
           | One way to achieve this result is to make the scratch space
           | memory-free. Rather than think of it as "remove the word with
           | probability x", think of it as " _always_ remove the word,
           | then re-add it with probability (1-x) ".
        
         | reichstein wrote:
         | The way I'd describe it is:
         | 
         | - You have a buffer. You initially try to fit every unique word
         | on there. - If the buffer gets full, you know that you can't
         | fit all the unique words in there. So you decide to keep only a
         | fraction, _p1_, of them. Run through the buffer, keep each
         | value with probability _p1_. - Keep adding new words, again
         | only with probability _p1_. - If the buffer gets full again,
         | _p1_ was too large, so you choose a lower fraction, _p2_,
         | retain existing elements only with probability _p2_/_p1_, and
         | continue adding new words with probability _p2_. - Every time
         | the buffer gets full, you lower the faction of words you try to
         | retain.
         | 
         | The choice of using _pn_ = (1/2)^n is just easy for a computer,
         | it only needs entire random bits at each step.
         | 
         | What I _don't_ get is how this is correct for counting unique
         | words. If I have a text of 16384 unique words, then I can
         | accept that each will be in the final list with the same
         | probability. But if I take that list and repeat the last word
         | 30000 times, then it becomes overwhelmingly plausible that that
         | word is in the final list, even though I haven't changed the
         | number of unique words. Maybe there is some statistical
         | property that evens things out, but I couldn't see it from the
         | article.
        
       | gnfargbl wrote:
       | On the topic of counting things, I would like to mention this
       | efficient and easily-implemented algorithm for finding the top-
       | _k_ items in a stream, which I think is perhaps not as well known
       | as it should be:
       | 
       |  _A Simple Algorithm for Finding Frequent Elements in Streams and
       | Bags_
       | 
       |  _Karp, Shenker & Papadimitriou_
       | 
       | https://www.cs.umd.edu/~samir/498/karp.pdf
        
         | vanderZwan wrote:
         | > _the top-k items in a stream_
         | 
         | Hmm, this is phrased in a way that sounds different (to my
         | ears) than the abstract, which says:
         | 
         | > _it is often desirable to identify from a very long sequence
         | of symbols (or tuples, or packets) coming from a large alphabet
         | those symbols whose frequency is above a given threshold_
         | 
         | Your description suggests a finding fixed nr of k items, with
         | the guarantee that it will be the top ones. The abstract sounds
         | like if determines an a priori unknown number of items that
         | meet the criteria of having a particular value greater than k.
         | 
         | So "find the 100 oldest users" vs "find all users older than
         | 30".
         | 
         | Am I misunderstanding you or the abstract? (English is not my
         | first language)
        
       | ceving wrote:
       | Can some native speaker tell me: is "count" the new "guess"?
        
         | dekhn wrote:
         | It's estimation, not counting.
        
         | Culonavirus wrote:
         | In the post-GPT world? Yes. Wait, maybe actually no? Depends on
         | context.
         | https://www.reddit.com/r/ChatGPT/comments/16jvl4x/wait_actua...
        
         | Koshkin wrote:
         | https://news.ycombinator.com/item?id=40394015
        
       | arn3n wrote:
       | Does anyone else notice that quanta consistently has good
       | illustrations and diagrams for their articles? Good
       | visualizations and graphics can do extraordinary things for the
       | understandability of a topic -- people like 3Blue1Brown
       | understand this, and I'm glad quanta puts effort into it.
        
       | Gbotex wrote:
       | nice read!
        
       | cb321 wrote:
       | This Nim program might clarify some things for someone.
       | import std/[random, math, sets, times]; type F = float
       | template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped =
       | ## Return (unbiased estim. cardinality of `it`, 1-alpha CI
       | ebound*(1 +- e)).           var x1 = initHashSet[type(xfm)](k);
       | var x  = x1.addr           var x2 = initHashSet[type(xfm)](k);
       | var y  = x2.addr           var p  = 1.0          # p always 2^-i
       | means could ditch FP           var n  = 0            #..RNG &
       | arith for bit shifts/masks.           let t0 = epochTime()
       | for e in it:             inc n   #; let e = xfm # to compute
       | randState.next here             x[].excl e          # 'd be nice
       | to reduce to 1 hash by             if rand(1.0) < p:   #..excl
       | w/prob 1 - p but then cannot               x[].incl e
       | #..add new keys.  p -> 0 anyway.             while x[].len == k:
       | # KnuthLoop not CVMIf guards against               for e in x[]:
       | #..unlikely case of freeing nothing.                 if rand(1.0)
       | < 0.5: y[].incl e               p *= 0.5               swap x, y;
       | y[].clear # | e.bound conservative by ~15X
       | (int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F),
       | (epochTime() - t0)*1e9/n.F)                  when isMainModule:
       | # Takes nItem/trial dupMask space nTrial           when defined
       | danger: randomize()           import std/[strutils, os,
       | strformat]           let n   =  if paramCount()>0:
       | parseInt(paramStr(1)) else: 100000           let msk = (if
       | paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64
       | let k   =  if paramCount()>2: parseInt(paramStr(3)) else: 512 #
       | 32KiB           let m   =  if paramCount()>3:
       | parseInt(paramStr(4)) else: 35           var ex  =
       | initHashSet[uint64]()           var xs  = newSeq[uint64](n)
       | for j in 1..m:             xs.setLen 0; ex.clear             for
       | i in 1..n: xs.add randState.next and msk             for e in xs:
       | ex.incl e             let (apx, err, t) = uniqCEcvm(xs, uint64,
       | k)             let ap = apx.F; let an = ex.len.F             echo
       | ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f}
       | {t:.1f}ns"
       | 
       | First few lines of laptop output:                   51160 49152
       | eb: 0.6235 |x-a|/x: 0.0392 10.5ns         51300 51840 eb: 0.6235
       | |x-a|/x: 0.0105 10.7ns         51250 55168 eb: 0.6235 |x-a|/x:
       | 0.0764 10.9ns         51388 52352 eb: 0.6235 |x-a|/x: 0.0188
       | 10.5ns
       | 
       | Ditching the floating point RNG & arithmetic might be able to
       | lower that time per element cost to more like 5 ns or maybe
       | 2GiB/s for 8B ints. The toggle back & forth between old & new
       | HashSet is just to re-use the same L1 CPU cache memory.
       | 
       | Note also that the error bound in the CVM paper seems very
       | conservative. My guess is that it could maybe be tightened by
       | 15+X (at the scale of those example numbers), but that might also
       | require some kind of special function evaluation. Anyone have any
       | ideas on that?
       | 
       | Also note that since this is just estimating unique identities,
       | you can always just count unique 64-bit hashes of keys (or maybe
       | even _32-bit_ hashes depending on your accuracy needs and
       | cardinalities) if you care about the key space /key compares are
       | slow.
        
         | menthe wrote:
         | Frankly, who can read this!? I am not sure what's worse, the
         | multi-line comments spanning multiple lines of code, having
         | multiple instructions on a single line, or the apparent
         | disconnect between the pseudo-code of the article.
        
           | V1ndaar wrote:
           | I would blame the majority of your criticism on the fact that
           | HN is not the best place to read code. Also, syntax
           | highlighting & basic familiarity with Nim helps.
           | 
           | His code is doing a few more things than necessary. The
           | actual algorithm is inside the `uniqCEcvm` template. The `it`
           | it receives is anything you can iterate over (a collection or
           | an iterator). Multiple things in one line really only appear
           | where they directly relate to the part at the beginning of
           | the line.
           | 
           | The `when isMainModule` is Nim's way of Python's `if __name__
           | == "__main__"`. The entire part below that is really just a
           | mini CL interface to bench different (random) examples. Final
           | thing to note maybe, the last expression of a block (e.g. of
           | the template here) will be returned.
           | 
           | And well, the style of comments is just personal preference
           | of course. Whether you prefer to stay strictly below 80 cols
           | or not, shrug.
           | 
           | I grant you that the usage of 2 sets + pointer access to them
           | + swapping makes it harder to follow than necessary. But I
           | assume the point of it was not on "how to write the simplest
           | looking implementation of the algorithm as it appears in the
           | paper". But rather to showcase a full implementation of a
           | reasonably optimized version.
           | 
           | Here's a version (only the algorithm) following the paper
           | directly:                   proc estimate[T](A: seq[T], e, d:
           | float): float =           let m = A.len           var p = 1.0
           | var thr = ceil(12.0 / e^2 * log2(8*m.float / d))
           | var kh = initHashSet[T](thr.round.int)           for i in 0
           | ..< m:             kh.excl A[i]             if rand(1.0) < p:
           | kh.incl A[i]             if kh.card.float >= thr:
           | for el in toSeq(kh): # clean out ~half probabilistically
           | if rand(1.0) < 0.5:                   kh.excl el
           | p /= 2.0               if kh.card.float >= thr:
           | return -1.0           result = kh.card.float / p
        
             | menthe wrote:
             | Thank you very much for having taken the time.. Your
             | comments and function are both very helpful!
        
       | eterevsky wrote:
       | Computer scientists invent a memory-efficient way to estimate the
       | size of a subset
        
         | m3kw9 wrote:
         | It seems fast too as you can use less rounds of flips and get
         | an estimate meaning you may not need to go thru the entire
         | "book" to get an estimate of distinct words
        
         | seadan83 wrote:
         | The subset is very important, it is the subset of unique
         | elements.
        
           | Koshkin wrote:
           | It's not a "subset," it's "the set of equivalence classes."
        
       | vitus wrote:
       | From the paper [0]:
       | 
       | > We state the following well-known concentration bound, Chernoff
       | bound, for completeness.
       | 
       | Which variant of the Chernoff bound is this? This is almost the
       | (looser variant of the) multiplicative form, but it's not quite
       | right (per the use of 1+delta instead of a single parameter
       | beta). In particular, that bound is only guaranteed to hold for
       | delta >= 0 (not beta = 1 + delta > 0 as asserted in the paper)
       | 
       | [0] https://arxiv.org/pdf/2301.10191
       | 
       | [1]
       | https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_...
       | 
       | edit: to be clear: I'm not sure at all whether this breaks the
       | proof of correctness, although I'm having a bit of difficulty
       | following the actual details (I think I'd need to work through
       | the intermediate steps on paper).
        
       | mudiadamz wrote:
       | Python implementation:                 def streaming_algorithm(A,
       | epsilon, delta):           # Initialize parameters           p =
       | 1           X = set()           thresh = math.ceil((12 / epsilon
       | ** 2) * math.log(8 * len(A) / delta))                # Process
       | the stream           for ai in A:               if ai in X:
       | X.remove(ai)               if random.random() < p:
       | X.add(ai)               if len(X) == thresh:                   X
       | = {x for x in X if random.random() >= 0.5}                   p /=
       | 2                   if len(X) == thresh:
       | return '[?]'                return len(X) / p                 #
       | Example usage       A = [1, 2, 3, 1, 2, 3]       epsilon = 0.1
       | delta = 0.01            output = streaming_algorithm(A, epsilon,
       | delta)       print(output)
        
         | planede wrote:
         | return '[?]'
         | 
         | what's this?
        
           | jbaber wrote:
           | In some symbolic logic classes, that character "bottom"
           | represents "false" ad flipped "top" means true.
           | 
           | Don't know what they're getting at in the code, though.
        
             | dentemple wrote:
             | Once again proving the need for comments in code.
             | Especially for comments that are more useful than
             | "initialize parameters"
        
             | hollerith wrote:
             | >In some symbolic logic classes, that character "bottom"
             | represents "false"
             | 
             | That's unfortunate, because in the study of computer
             | programming languages, it means "undefined" (raise an
             | error).
        
               | tsss wrote:
               | Not always. It is also the uninhabited bottom type.
        
               | hollerith wrote:
               | My point is that there is a difference between a Python
               | function's returning false and the function's raising an
               | error, and sometimes the difference really matters, so it
               | would be regrettable if logic teachers actually did use
               | [?] to mean false because programming-language theorists
               | use it to mean something whose only reasonable
               | translation in the domain of practical programming is to
               | raise an error.
               | 
               | I have no idea what your point is.
        
           | martinsnow wrote:
           | An easy way to identify who copies code without understanding
           | it.
        
             | mudiadamz wrote:
             | You can just replace it with something like: print
             | ('Invalid thresh or something')
        
               | martinsnow wrote:
               | This however looks scary so an innocent copy/paste
               | programmer wouldn't touch it.
        
           | rcarmo wrote:
           | An error condition. I decided to do away with it and take a
           | small hit on the error by assuming the chances of the trimmed
           | set being equal to the threshold are very small and that the
           | error condition is effectively doing nothing.
           | 
           | I also changed the logic from == to >= to trigger
           | unfailingly, and pass in the "window"/threshold to allow my
           | code to work without internal awareness of the length of the
           | iterable:                   from random import random
           | def estimate_uniques(iterable, window_size=100):
           | p = 1             seen = set()                  for i in
           | iterable:                 if i not in seen:
           | seen.add(i)                 if random() > p:
           | seen.remove(i)                 if len(seen) >= window_size:
           | seen = {s for s in seen if random() < 0.5}
           | p /= 2             return int(len(seen) / p)
           | 
           | I also didn't like the possible "set thrashing" when an item
           | is removed and re-added for high values of p, so I inverted
           | the logic. This should work fine for any iterable.
        
         | sestep wrote:
         | Is this ChatGPT? Also, I feel like this would be more useful if
         | it included import statements.
        
           | mudiadamz wrote:
           | import math       import random
        
         | ericmcer wrote:
         | I don't think there is a single variable name or comment in
         | this entire code block that conveys any information. Name stuff
         | well! Especially if you want random strangers to gaze upon your
         | code in wonder.
        
           | foobarian wrote:
           | Speaking of, one of my favorite discoveries with Unicode is
           | that there is a ton of code points acceptable for symbol
           | identifiers in various languages that I just can't wait to
           | abuse.
           | 
           | >>> a=3
           | 
           | >>> b=6
           | 
           | >>> a+b
           | 
           | 9
        
             | 8372049 wrote:
             | You're even using a and b, very good.
        
             | pragma_x wrote:
             | Ah yes, programming like the vikings intended.
        
           | tgv wrote:
           | The names are literally taken from the paper.
        
             | seaman1921 wrote:
             | well the paper also contains the code so I doubt anyone who
             | looked at the paper cares about this paste - for folks who
             | did not read the paper this is not very readable
        
           | sneva wrote:
           | > Name stuff well
           | 
           | OP is following the same variable names of the article. I
           | prefer that over changing the variable names and then
           | figuring out what variable name maps in code to the article.
        
         | rcarmo wrote:
         | That's not streaming if you're already aware of the length of
         | the iterable.
        
           | axitanull wrote:
           | In python, you can simply substitute `A` with an iterable or
           | generator object, which can be a of unknown length.
        
             | mattkrause wrote:
             | But for this algorithm, you need to know the total length
             | ("m") to set the threshold for the register purges.
             | 
             | Does it still work if you update m as you go?
        
               | empath-nirvana wrote:
               | You actually don't need to do that part in the algorithm.
               | If you don't know the length of the list, you can just
               | choose a threshold that seems reasonable and calculate
               | the margin of error after you're done processing. (or i
               | guess at whatever checkpoints you want if it's
               | continuous)
               | 
               | In this example, they have the length of the list and
               | choose the threshold to give them a desired margin of
               | error.
        
               | rcarmo wrote:
               | See https://news.ycombinator.com/item?id=40390192
        
               | istjohn wrote:
               | You would want to calculate the threshold by choosing
               | your target epsilon and delta and an 'm' equal to the
               | largest conceivable size of the stream. Fortunately, the
               | threshold increases with log(m), so it's inexpensive to
               | anticipate several orders of magnitude more data than
               | necessary. If you wanted, you could work backwards to
               | calculate the actual 'epsilon' and 'delta' values for the
               | actual 'm' of the stream after the fact.
        
               | cb321 wrote:
               | Besides the ideas from istjohn, empath-nirvana, and
               | rcarmo, you can also just "flip the script": solve for
               | epsilon and report that as 1-delta confidence interval
               | for the worst case data distribution as here:
               | https://news.ycombinator.com/item?id=40388878
               | 
               | Best case error is of course zero, but if you look at my
               | output then you will see as I did that the worst case is
               | a very conservative bound (i.e. 15X bigger than what
               | might "tend to happen". That matters a lot for "space
               | usage" since the error =~ 1/sqrt(space) implying you need
               | a lot more space for lower errors. 15^2 = 225X more
               | space. Space optimization is usually well attended for
               | this kind of problem. And, hey, maybe you know something
               | about the input data distribution?
               | 
               | So, in addition to the worst case bound, average case
               | errors under various distributional scenarios would be
               | very interesting. Or even better "measuring as you go"
               | enough distributional meta data to get a tighter error
               | bound. That latter starts to sound like it's one of
               | Knuth's Hard Questions Which if You Solve He'll Sign your
               | PhD Thesis territory, though. Maybe a starting point
               | would be some kind of online entropy(distribution)
               | estimation, perhaps inspired by
               | https://arxiv.org/abs/2105.07408 . And sure, maybe you
               | need to bound the error ahead of time instead of
               | inspecting it at any point in the stream.
        
             | rcarmo wrote:
             | I know that. See:
             | https://news.ycombinator.com/item?id=40390192
        
       | klaussilveira wrote:
       | This actually comes at a good time. I'm currently refactoring a
       | system that counts visits using inserts into a table, with a
       | tuple of date and IP. I was planning to replace it with a HLL
       | approach, but this is really interesting.
        
       | pytness wrote:
       | Is it me or is the description of the algo wrong?
       | > Round 1. Keep going through Hamlet, adding new words as you go.
       | If you come to a word that's already on your list, flip a coin
       | again. If it's tails, delete the word;
       | 
       | If i follow this description of "check if exists in list ->
       | delete":                   if hash_set.contains(word) {
       | if !keep_a_word(round) {                 hash_set.remove(word);
       | continue;             }         } else {
       | hash_set.insert(word.to_string());         }
       | 
       | The algorithm runs for ~20 iterations:                   Total
       | word count: 31955 | limit: 1000         End Round: 20, word
       | count: 737         Unique word count: 7233         Estimated
       | unique word count: 772800512
       | 
       | But if I save the word first and then delete the same word:
       | hash_set.insert(word.to_string());              if
       | !keep_a_word(round) {             hash_set.remove(word);
       | continue;         }
       | 
       | It gets the correct answer:                   Total word count:
       | 31955 | 1000         End Round: 3, word count: 905         Unique
       | word count: 7233         Estimated unique word count: 7240
        
         | Alexanfa wrote:
         | Was just now solving it and came to see if others had the same
         | issue. Yep, you are right.                   function
         | generateRandomNumbers(c, n) {           let randomNumbers = new
         | Array(c);           for (let i = 0; i < randomNumbers.length;
         | i++) {             let randomNumber = Math.floor(Math.random()
         | * (n + 1));             randomNumbers[i] = randomNumber;
         | }           return randomNumbers;         }         function
         | run(w, wS, m, r) {             function round(r) {
         | while(wS.size < m) {                     const next = w.next()
         | if (next.done) return true;
         | wS.add(next.value)                     prune(next.value, r)
         | }                 return false             }
         | function prune(v,r) {                 for (let i = 0; i < r;
         | i++) {                     const flip = new
         | Boolean(Math.round(Math.random()))                     if (flip
         | == false) {                         wS.delete(v)
         | }                 }             }             function
         | purge(wS) {                 const copy = new Set(wS)
         | copy.forEach(ith=>{                     const flip = new
         | Boolean(Math.round(Math.random()))                     if (flip
         | == false) {                         wS.delete(ith)
         | }                 })             }             const done =
         | round(r);             if (!done) {                 purge(wS)
         | return run(w, wS, r+1,m)             }
         | console.log(`Round ${r} done. ${wS.size} Estimate: ${wS.size /
         | (1/Math.pow(2,r))}`)         }         const memory = 1000
         | const words = generateRandomNumbers(3000000,15000)
         | const w = words[Symbol.iterator]() // create an iterator
         | const wS = new Set();         run(w,wS, memory,0);
        
           | Alexanfa wrote:
           | Noticed an error;                   return run(w, wS, r+1,m)
           | 
           | Should be changed to:                   return run(w, wS, m,
           | r+1)
        
         | josephernest wrote:
         | I got the same problem.
         | 
         | When implementing the exact method as described in quanta
         | magazine (without looking at the arxiv paper), I always had
         | estimates like 461746372167462146216468796214962164.
         | 
         | Then after reading the arxiv paper, I got the the correct
         | estimate, with this code (very close to mudiadamz's comment
         | solution):                   import numpy as np         L =
         | np.random.randint(0, 3900, 30557)
         | print(f"{len(set(L))=}")         thresh = 100         p = 1
         | mem =  set()           for k in L:             if k in mem:
         | mem.remove(k)             if np.random.rand() < p:
         | mem.add(k)             if len(mem) == thresh:
         | mem = {m for m in mem if np.random.rand() < 0.5}
         | p /= 2         print(f"{len(mem)=} {p=} {len(mem)/p=}")
         | 
         | Or equivalently:                   import numpy as np         L
         | = np.random.randint(0, 3900, 30557)
         | print(f"{len(set(L))=}")         thresh = 100         p = 1
         | mem = []         for k in L:             if k not in mem:
         | mem += [k]             if np.random.rand() > p:
         | mem.remove(k)             if len(mem) == thresh:
         | mem = [m for m in mem if np.random.rand() < 0.5]
         | p /= 2         print(f"{len(mem)=} {p=} {len(mem)/p=}")
         | 
         | Now I found the quanta magazine formulation problem. By
         | reading:
         | 
         | > Round 1. Keep going through Hamlet, adding new words as you
         | go. If you come to a word that's already on your list, flip a
         | coin again. If it's tails, delete the word; heads, and the word
         | stays on the list. Proceed in this fashion until you have 100
         | words on the whiteboard. Then randomly delete about half again,
         | based on the outcome of 100 coin tosses. That concludes Round
         | 1.
         | 
         | we want to write:                   for k in L:             if
         | k not in mem:                 mem += [k]             else:
         | if np.random.rand() > p:                     mem.remove(k)
         | if len(mem) == thresh:                 mem = [m for m in mem if
         | np.random.rand() < 0.5]                 p /= 2
         | 
         | whereas it should be (correct):                   for k in L:
         | if k not in mem:                 mem += [k]             if
         | np.random.rand() > p:    # without the else
         | mem.remove(k)             if len(mem) == thresh:
         | mem = [m for m in mem if np.random.rand() < 0.5]
         | p /= 2
         | 
         | Just this little "else" made it wrong!
        
           | Alexanfa wrote:
           | Quanta:                   Round 1. Keep going through Hamlet,
           | adding new words as you go. If you come to a word that's
           | already on your list, flip a coin again. If it's tails,
           | delete the word; heads, and the word stays on the list.
           | 
           | To:                   Round 1. Keep going through Hamlet, but
           | now flipping a coin for each word. If it's tails, delete the
           | word if it exists; heads, and add the word  if it's not
           | already on the list.
           | 
           | Old edit:                   Round 1. Keep going through
           | Hamlet, adding words but now flipping a coin immediately
           | after adding it. If it's tails, delete the word; heads, and
           | the word stays on the list.
        
             | josephernest wrote:
             | > adding words but now flipping a coin immediately after
             | adding it
             | 
             | Edit: I thought your formulation was correct but not
             | really:
             | 
             | We flip the coin after adding, but we also flip the coin
             | _even if we didn 't add the word_ (because it was already
             | there). This is subtle!
             | 
             | wrong:                   if k not in mem:             mem
             | += [k]             if np.random.rand() > p:
             | mem.remove(k)
             | 
             | wrong:                   if k not in mem:             mem
             | += [k]         else:             if np.random.rand() > p:
             | mem.remove(k)
             | 
             | correct:                   if k not in mem:             mem
             | += [k]         if k in mem:      # not the same than "else"
             | here             if np.random.rand() > p:
             | mem.remove(k)
             | 
             | correct:                   if k not in mem:             mem
             | += [k]         if np.random.rand() > p:
             | mem.remove(k)
        
               | Alexanfa wrote:
               | Ah, I'm using a set instead of list so I just always add
               | and then toss remove.
        
       | jtanderson wrote:
       | Given the topic of the paper[0], the footnote is especially
       | charming:
       | 
       | > the authors decided to forgo the old convention of alphabetical
       | ordering of authors in favor of a randomized ordering, denoted by
       | r. The publicly verifiable record of the randomization is
       | available at https://www.aeaweb.org/journals/policies/random-
       | author-order...
       | 
       | [0]: https://arxiv.org/pdf/2301.10191
       | 
       | edit: formatting
        
       | theelous3 wrote:
       | > Imagine that you're sent to a pristine rainforest to carry out
       | a wildlife census.
       | 
       | alt-f4
        
       | mring33621 wrote:
       | Feels like reservoir sampling to me
        
       | prerok wrote:
       | While the algorithm is interesting and I commend the authors for
       | the algorithm, it should be noted that this is a guesstimate
       | only. So, it's not really an efficient way to count distinct
       | values but an efficient way to get an approximate count of
       | distinct values.
        
       | unbalancedevh wrote:
       | I wonder how the error changes if you start a new round just
       | before finishing the stream, compared to if the last word in the
       | stream just fills the buffer and ends a round.
        
       | deadeye wrote:
       | Am I correct in thinking the accuracy of this method has more to
       | do with the distribution in the sample than the algo itself.
       | Meaning, given a 100% unique list of items, how far off would
       | this estimate be?
        
         | MarkusQ wrote:
         | It doesn't appear to be heavily dependent on the distribution.
         | It mostly depends on how large your buffer is compared to the
         | number of unique items in the list. If the buffer is the same
         | size or larger, it would be exact. If it's half the size needed
         | to hold all the unique items, you drop O(1 bit) of precision;
         | at one quarter, O(2 bits), etc.
        
       | SoftTalker wrote:
       | New interview question: Tell me how you would estimate the number
       | of unique words in _Hamlet_ using only 100 elements.
        
       | brianhorakh wrote:
       | Wondering if this approach could I found myself be applied to CFD
       | (computational flow dynamics) methods to reduce the total volume
       | of data points while still getting approximately close to the
       | correct final answer?
        
       | xpe wrote:
       | I have questions for Quanta Magazine -- this is probably a shot
       | in the dark, but if anyone can ask them a few questions, here
       | would be three of mine:
       | 
       | 1. Have independent researchers evaluated the effectiveness of
       | Quanta Magazine at their mission? [1]
       | 
       | 2. In the case of this article, did the editors consider
       | including visualization? If not, why not?
       | 
       | 3. Does Quanta have interactive designers and/or data scientists
       | on staff? How many? Why not more?
       | 
       | ## Background & Biases
       | 
       | I'm trying to overcome my disappointment in Quanta by being
       | curious about their mission, organization, and constraints. I am
       | glad they exist, but sometimes your "closest allies" can be one's
       | harshest critics.
       | 
       | I'm greatly troubled by the level of scientific, mathematical,
       | and rational understanding among the denizens of the world. But
       | for some reason, the state of science writing bothers me more. It
       | would seem that I hold out hope that science writers could do
       | better. This may be unfair, I admit.
       | 
       | Anyhow, rather than just bash Quanta for being, say, not as good
       | at the best math blogs or YouTube channels (such as 3Blue1Brown),
       | I really want to figure out (a) if I'm missing something; or (2)
       | if they are actively trying to improve; and (3) what we can all
       | learn from their experience.
       | 
       | [1] From https://www.quantamagazine.org/about/ : "Quanta Magazine
       | is an editorially independent online publication launched by the
       | Simons Foundation in 2012 to enhance public understanding of
       | science. Why Quanta? Albert Einstein called photons "quanta of
       | light." Our goal is to "illuminate science.""
        
       | fnord77 wrote:
       | With other count-distinct algorithms, you can do unions and
       | intersections on the "sets". (theta sketches, even bloom
       | filiters)
       | 
       | Can you do this with CVM?
        
       | burjui wrote:
       | The algorithm uses less memory, but more CPU time because of
       | rather frequent deletions, so it's a tradeoff, not just generally
       | better algorithm, as article may suggest.
        
         | empath-nirvana wrote:
         | the list is small so the cost of deletions should be small.
        
       | minkzilla wrote:
       | There is a big practicality problem I see with this algorithm.
       | The thresh defined in the paper relies on the length of the
       | stream. It seems to me that in a scenario where you have a big
       | enough data set to desire a solution that doesn't just store
       | every unique value you don't know the length. I did not make it
       | through all the proofs but I believe they use the fact that the
       | defined threshold has the length in it to prove error bounds. If
       | I were to use this in a scenario where I need to know error
       | bounds i would probably ballpark the length of my stream to
       | estimate error bounds and then use the algorithm with a ballpark
       | threshold depending on my systems memory.
       | 
       | Another practical thing is the "exception" if nothing is removed
       | on line 6 in the original algorithm. This also seems needed for
       | the proof but you would not want in production, though the chance
       | of hitting it should be vanishingly small so maybe worth the
       | gamble?
       | 
       | Here is my faithful interpretation of the algorithm. And then a
       | re-interpretation with some "practical" improvements that almost
       | certainly make the provability of the correctness impossible.
       | func CountUnique(scanner *bufio.Scanner, epsilon float64, delta
       | float64, m int) int {              X := make(map[string]bool)
       | p := 1.0         thresh := int(math.Ceil((12 / (epsilon *
       | epsilon)) \* math.Log(8*float64(m)/delta)))              for
       | scanner.Scan() {             a := scanner.Text()
       | delete(X, a)             if rand.Float64() < p {
       | X[a] = true             }                  if len(X) == thresh {
       | for key := range X {                     if rand.Float64() < 0.5
       | {                         delete(X, key)                     }
       | }                 p /= 2                 if len(X) == thresh {
       | panic("Error")                 }             }         }
       | return int(float64(len(X)) / p)
       | 
       | }                 func CountUnique2(scanner *bufio.Scanner,
       | thresh int) int {               //threshold passed in, based on
       | system memory / estimates         X := make(map[string]bool)
       | p := 1.0              for scanner.Scan() {             a :=
       | scanner.Text()             delete(X, a)             if
       | rand.Float64() < p {                 X[a] = true             }
       | if len(X) >= thresh {  // >= instead of == and remove the panic
       | below                 for key := range X {                     if
       | rand.Float64() < 0.5 {                         delete(X, key)
       | }                 }                 p /= 2             }
       | }              return int(float64(len(X)) / p)
       | 
       | }
       | 
       | I tested it with Shakespeare's work. The actual unique word count
       | is 71,595. With the second algorithm it is interesting to play
       | with the threshold. Here are some examples.
       | 
       | threshold 1000 Mean Absolute Error: 2150.44 Root Mean Squared
       | Error: 2758.33 Standard Deviation: 2732.61
       | 
       | threshold 2000 Mean Absolute Error: 1723.72 Root Mean Squared
       | Error: 2212.74 Standard Deviation: 2199.39
       | 
       | threshold 10000 Mean Absolute Error: 442.76 Root Mean Squared
       | Error: 556.74 Standard Deviation: 555.53
       | 
       | threshold 50000 Mean Absolute Error: 217.28 Root Mean Squared
       | Error: 267.39 Standard Deviation: 262.84
        
         | empath-nirvana wrote:
         | If you don't know the length of the stream in advance, you can
         | just calculate the margin of error when you're done, no?
        
           | cb321 wrote:
           | You just solve for epsilon. That's what I did:
           | https://news.ycombinator.com/user?id=cb321
        
       | zero_k wrote:
       | I was involved with implementing the DNF volume counting version
       | of this with the authors. You can see my blog post of it here:
       | 
       | https://www.msoos.org/2023/09/pepin-our-probabilistic-approx...
       | 
       | And the code here: https://github.com/meelgroup/pepin
       | 
       | Often, 30% of the time is spent in IO of reading the file, that's
       | how incredibly fast this algorithm is. Crazy stuff.
       | 
       | BTW, Knuth contributed to the algo, Knuths' notes:
       | https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
       | 
       | He actually took time off (a whole month) from TAOCP to do this.
       | Also, he is exactly as crazy good as you'd imagine. Just mind-
       | blowing.
        
         | Hnrobert42 wrote:
         | That's really interesting and thanks for sharing.
         | 
         | I am very curious about the extraordinarily gifted. What made
         | you think Knuth is crazy good? Was there a particularly moment?
         | Was it how fast he groked ideas? Was it his ability to ELI5?
        
           | zero_k wrote:
           | What made me realize is that I saw some snippets of emails he
           | wrote to a colleague. It was... insane. You could see his
           | mind race. He recognized patterns in minutes that would take
           | me days, if not weeks, to recognize. Also, he actually writes
           | and runs code, overnight if need be. It was as bit of a shock
           | to me. He's not in an ivory tower. He's very much hands on,
           | and when he's behind the wheel, you're in for a ride.
        
             | PartiallyTyped wrote:
             | I feel extremely jealous of you.
        
             | almostgotcaught wrote:
             | > He recognized patterns in minutes that would take me
             | days, if not weeks, to recognize... he actually writes and
             | runs code, overnight if need be
             | 
             | 70-80 years of actually being hands-on and i bet you'd be
             | pretty quick too. dude is definitely naturally "gifted" but
             | it seems pretty obvious being hands-on has a lot to do with
             | it.
        
               | gspetr wrote:
               | Experience and age have diminishing returns.
               | 
               | Biden's been hands-on in his domain for over 50 years,
               | yet "quick" is definitely not the word that comes to most
               | people's mind when they think of him nowadays.
        
               | jedberg wrote:
               | Actually quick is definitely something that comes to
               | mind. Quick in politics is of course relative, but the
               | speed with which he has enacted major changes (for
               | example marijuana legalization) is pretty quick in the
               | realm of politics when congress is of the other party.
        
             | devnonymous wrote:
             | > Also, he actually writes and runs code, overnight if need
             | be...He's not in an ivory tower. He's very much hands on,
             | and when he's behind the wheel, you're in for a ride.
             | 
             | That's such an admirable thing. Something to aspire to,
             | given his age. I wonder if one can say the same thing of
             | all the 'thought leaders' of the industry who go around
             | pontificating about code hygiene and tidiness.
        
         | jandrese wrote:
         | So now we have you to blame for a delay on the release of his
         | next book. :)
        
           | zero_k wrote:
           | Not me, the authors -- I'm a fly on the wall compared to them
           | ;) This is some serious work, I just did a fast
           | implementation. Re implementation -- it turns out that there
           | are parts of some standard libraries that this problem pushes
           | to its limits, that we had to go around during
           | implementation. So there were still some cool challenges
           | involved. I was also pretty happy about the late binding/lazy
           | evaluation thing I came up with. Of course Knuth just did it
           | (check his notes), without even thinking about it :D What is
           | an achievement for me is a lazy Monday coffee for him, but oh
           | well!
        
         | sesteel wrote:
         | Maybe you'd know, but why would one choose to not sort favoring
         | larger counts and drop the bottom half when full? It may be
         | obvious to others, but I'd be curious.
        
           | zero_k wrote:
           | The guarantees would not hold, I'm pretty sure ;) Maybe one
           | of the authors could chip in, but my hunch is that with that
           | you could actually introduce arbitrarily large errors. The
           | beauty of this algorithm really is its simplicity. Of course,
           | simple is.. not always easy. This absolute masterpiece by
           | Knuth should demonstrate this quite well:
           | 
           | https://www.sciencedirect.com/science/article/pii/0022000078.
           | ..
           | 
           | It's an absolutely trivial algorithm. Its average-case
           | analysis is ridiculously hard. Hence why I think this whole
           | Ordo obsessions needs to be refined -- worst case complexity
           | has often little to do with real-world behavior.
        
       | pierrebai wrote:
       | IDK, if my explanation is correct, but I do believe it is. I t
       | goes as follow.
       | 
       | Imagine that you have a container of potential limitless
       | capacity. The container starts with smalls capacity, equal to the
       | real limited capacity that the real algorithm uses. As you add
       | elements, when the container is full, its capacity is doubled,
       | but all elements are then placed in a random position.
       | 
       | When you're done, you're told the occupancy of the subset of the
       | large container corresponding to the initial size and how many
       | times the container size was doubled. Multiplying that occupancy
       | by the power of two of the number of doubling gives you an
       | approximation of the real size.
       | 
       | The small catch is that in the actual algorithm, due to the
       | discarding, the final number of elements, the occupancy, is
       | somewhat erroneous.
       | 
       | EDIT
       | 
       | Another way to say this: you got a container of limited capacity
       | S. When full, you "virtually" double its size and then randomly
       | move elements over the full "imaginary" size of the virtual
       | container. So after the first filling, you end up with about 1/2
       | the elements. After the second filling 1/4, etc. Also, since now
       | your "virtual" container is larger, when you add a new element,
       | there is only 1/2^n the it will be place inside your limited-
       | capacity view of the entire virtual container.
       | 
       | At the end, the approximate real count is the number of elements
       | you got multiplied by 2 to the power of the number of size
       | doubling.
       | 
       | Again, it is as if you have a small window into a limitless
       | container.
        
       | OutOfHere wrote:
       | The title is misleading. This is about probabilistic counting,
       | therefore about estimation. It is not clear if the efficiency
       | benefit extends to exact counting. Counting and estimation are
       | not the same concepts.
        
         | orf wrote:
         | Obviously it wouldn't
        
         | Koshkin wrote:
         | This is about estimation.
         | 
         | https://news.ycombinator.com/item?id=40394015
        
       | u8 wrote:
       | I took a crack at implementing this in Go. For anyone curious I
       | settled for algorithm 2 as I can just use a map as the base set
       | structure.
       | 
       | https://github.com/tristanisham/f0
        
       ___________________________________________________________________
       (page generated 2024-05-17 23:01 UTC)