[HN Gopher] Computer scientists invent an efficient new way to c...
___________________________________________________________________
Computer scientists invent an efficient new way to count
Author : jasondavies
Score : 726 points
Date : 2024-05-16 14:54 UTC (1 days ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| pixelmonkey wrote:
| This algorithm seems to resemble HyperLogLog (and all its
| variants), which is also cited in the research paper. Using the
| same insight of the estimation value of tracking whether we've
| hit a "run" of heads or tails, but flipping the idea on its head
| (heh), it leads to the simpler algorithm described, which is
| about discarding memorized values on the basis of runs of
| heads/tails.
|
| This also works especially well (that is, efficiently) in the
| streaming case, allowing you to keep something resembling a
| "counter" for the distinct elements, albeit with a error rate.
|
| The benefit of HyperLogLog is that it behaves similarly to a hash
| set in some respects -- you can add items, count distinct them,
| and, importantly, merge two HLLs together (union), all the while
| keeping memory fixed to mere kilobytes even for billion-item
| sets. In distributed data stores, this is the trick behind
| Elasticsearch/OpenSearch cardinality agg, as well as behind
| Redis/Redict with its PFADD/PFMERGE/PFCOUNT.
|
| I am not exactly sure how this CVM algorithm compares to HLL, but
| they got Knuth to review it, and they claim an undergrad can
| implement it easily, so it must be pretty good!
| hmottestad wrote:
| It's also possible to use HLL to estimate the cardinality of
| joins since it's possible to estimate both the union and the
| intersection of two HLLs.
|
| http://oertl.github.io/hyperloglog-sketch-estimation-paper/
| j-pb wrote:
| It's a really interesting open problem to get the cost of
| these down so that they can be used to heuristically select
| the variable order for worst case optimal joins during
| evaluation.
|
| It's somewhere on the back of my todo list, and I have the
| hunch that it would enable instance optimal join algorithms.
|
| I've dubbed these the Atreides Family of Joins:
| - Jessicas Join: The cost of each variable is based on the
| smallest number of rows that might be proposed for that
| variable by each joined relation. - Pauls join: The
| cost of each variable is based on the smallest number of
| distinct values that will actually be proposed for that
| variable from each joined relation. - Letos join: The
| cost of each variable is based on the actual size of the
| intersection.
|
| In a sense each of the variants can look further into the
| future.
|
| I'm using the first and the second in a triplestore I build
| in Rust [1] and it's a lot faster than Oxigraph. But I
| suspect that the constant factors would make the third
| infeasable (yet).
|
| 1: https://github.com/triblespace/tribles-
| rust/blob/master/src/...
| refset wrote:
| Having read something vaguely related recently [0] I
| believe "Lookahead Information Passing" is the common term
| for this general idea. That paper discusses the use of
| bloom filters (not HLL) in the context of typical binary
| join trees.
|
| > Letos join
|
| God-Emperor Join has a nice ring to it.
|
| [0] "Simple Adaptive Query Processing vs. Learned Query
| Optimizers: Observations and Analysis" -
| https://www.vldb.org/pvldb/vol16/p2962-zhang.pdf
| j-pb wrote:
| Thanks for the interesting paper! We now
| formally define our _God-Emperor Join_ henceforth denoted
| join_ge...
|
| Nice work with TXDB btw, it's funny how much impact
| Clojure, Datomic and Datascript had outside their own
| ecosystem!
|
| Let me return the favour with an interesting paper [1]
| that should be especially relevant to the columnar data
| layout of TXDB. I'm currently building a succinct on-disk
| format with it [2], but you might be able to simply add
| some auxiliary structures to your arrow columns instead.
|
| 1: https://aidanhogan.com/docs/ring-graph-wco.pdf
|
| 2: https://github.com/triblespace/tribles-
| rust/blob/archive/src...
| refset wrote:
| _> Nice work with TXDB btw_
|
| It's X.T. (as in 'Cross-Time' / https://xtdb.com), but
| thank you! :)
|
| _> 1: https://aidanhogan.com/docs/ring-graph-wco.pdf_
|
| Oh nice, I recall skimming this team's precursor paper
| "Worst-Case Optimal Graph Joins in Almost No Space"
| (2021) - seems like they've done a lot more work since
| though, so definitely looking forward to reading it:
|
| _> The conference version presented the ring in terms of
| the Burrows-Wheeler transform. We present a new
| formulation of the ring in terms of stable sorting on
| column databases, which we hope will be more accessible
| to a broader audience not familiar with text indexing_
| j-pb wrote:
| My apologies! It's even more emberrasing given the fact
| that I looked up the website, knowing that I _always_
| swap them after having written too many `tx-listen` in my
| career.
|
| They expanded their work to wider relations and made the
| whole framework a lot more penetrable. I think they over-
| complicate things a bit with their forward-extension, so
| I'm keeping every column twice (still much better than
| all permutations), which in turn allows for ad-hoc
| cardinality estimation.
|
| Also the 1 based indexing with inclusive ranges is really
| not doing them any favours. Most formula become much more
| streamlined and simpler with 0 based indexing and
| exclusive ranges. (see `base_range` and `restrict_range`
| in [0])
|
| 0: https://github.com/triblespace/tribles-
| rust/blob/e3ad6f21cdc...
| jalk wrote:
| Iirc intersection requires the HLLs to have similar
| cardinality, otherwise the result is way off.
| willvarfar wrote:
| Just curious, dusting off my distant school memories :)
|
| How do the HLL and CVM that I hear about relate to reservoir
| sampling which I remember learning?
|
| I once had a job at a hospital (back when 'whiz kids' were
| being hired by pretty much every business) where I used
| reservoir sampling to make small subsets of records that were
| stored on DAT tapes.
| michaelmior wrote:
| I guess there is a connection in the sense that with
| reservoir sampling, each sample observed has an equal chance
| of remaining when you're done. However, if you have
| duplicates in your samples, traditional algorithms for
| reservoir sampling do not do anything special with
| duplicates. So you can end up with duplicates in your output
| with some probability.
|
| I guess maybe it's more interesting to look at the other way.
| How is the set of samples you're left with at the end of CVM
| related to the set of samples you get with reservoir
| sampling?
| _a_a_a_ wrote:
| Was wondering the same, thanks for an answer.
| krackers wrote:
| Knuth's presentation of this [1] seems very very similar to
| the heap-version (top-k on a uniform deviate) of reservoir
| sampling as mentioned in [2]. The difference is in how
| duplicates are handled. I wouldn't be surprised if this
| algorithm was in fact already in use somewhere!
|
| [1] https://cs.stanford.edu/~knuth/papers/cvm-note.pdf [2]
| https://florian.github.io/reservoir-sampling/
|
| Edit: Another commenter [3] brought up the BJKST algorithm
| which seems to be similar procedure except using a suitably
| "uniform" hash function (pairwise independence) as the
| deviate instead of a random number.
|
| [3] https://news.ycombinator.com/item?id=40389178
| imoverclocked wrote:
| When do we stop calling this counting and start calling it
| estimation?
| pmontra wrote:
| Yes, even the subtile states "a simple algorithm for estimating
| large numbers of distinct objects".
| cubefox wrote:
| This doesn't excuse the title though.
| vlovich123 wrote:
| As soon as people start reading past the headline.
| 112233 wrote:
| tbh, the title (and introduction) did a lot to dissuade me
| from finishing the (really good) article. It was actually
| informative, why dress it as a SEO blogspam?
| seanhunter wrote:
| Presumably so it is optimised for search engines and people
| find it.
|
| Publishers generally have good data about where their
| audience comes from. They wouldn't do this if it wasn't the
| best way they know of to maximise readership.
| a57721 wrote:
| I've been following Quanta for some time, I'm sure they
| don't care about SEO and number of visitors, they care
| about the quality of texts. They write for the general
| audience, and even though they try to preserve the
| scientific accuracy, sometimes their explanations may
| seem oversimplified and even a bit confusing when you
| come from the same field. It's not clickbait, it's their
| popular science style.
| HDThoreaun wrote:
| So they can get paid for their work? Are you giving them
| anything?
| lupire wrote:
| It's Quanta. Their mission is to make laypeople like math
| (not understand math), so they drown the math in sugar.
| card_zero wrote:
| Seems this is one of those things like UUIDs where we rely on
| it being very unlikely to be wrong, because statistics.
|
| > the accuracy of this technique scales with the size of the
| memory.
|
| I wonder if that's proportional to the number of distinct items
| to count, though.
|
| > if the [memory] is so big that it fits all the words, then we
| can get 100% accuracy
|
| Yes, but then the algorithm isn't being used any more, that's
| just normal counting.
|
| They counted the distinct words in Hamlet with a memory size of
| 100 words, about 2.5% of the number to find, and got a result
| that was off by 2. If you do the same with the whole of
| Shakespeare, again using 2.5% of the memory needed to hold all
| the distinct words, is the accuracy better?
|
| Anyway, this is limited to counting, and doesn't help list what
| the words _are,_ though quickly counting them first is perhaps
| a way to speed up the task of actually finding them?
| BlackFly wrote:
| When it is actually impossible to count something and when the
| error between estimation and an exact answer is not significant
| the pedantic distinction is not helpful.
|
| The same thing happens with measurement. No measurement is ever
| exact. If I said I was measuring a count, someone would
| probably correct to say that I am counting.
|
| Common speech is like that.
| matt3210 wrote:
| CS guys always wanting to throw away a good system and start from
| scratch.
| Miraltar wrote:
| Which good system are we talking about ?
| 112233 wrote:
| Which good system are you referring to?
| baq wrote:
| Tell me you didn't read the article without telling me you
| didn't read the article.
|
| ...but actually, I think you didn't even read the headline...?
| coldtea wrote:
| Yeah, if only we have stuck with those stone wheels from 20,000
| B.C.
| datavirtue wrote:
| Well, they are easier to work on.
| biscuit1v9 wrote:
| > The trick, he said, is to rely on randomization.
|
| > When the space is full, press pause and flip a coin for each
| word. Heads, and the word stays on the list; tails, and you
| delete it.
|
| I wasn't expecting to go that far: randomization. How can you
| verify if the answer is good? Only approximation, maybe..
| 8372049 wrote:
| Did you read the (entire) article?
| seanhunter wrote:
| The proof is pretty straightforward and is included in the
| paper https://arxiv.org/abs/2301.10191
| HDThoreaun wrote:
| Yes, the result is an estimation.
| biscuit1v9 wrote:
| Thanks. Just wanted to be sure I didn't misunderstood.
| akamoonknight wrote:
| I don't know a word or phrase for this, but I really enjoy any
| examples of "thinking outside the box" like this because it's
| something I struggle with in my professional career. Learning not
| only the right ways to solve problems, but figuring out the
| questions to ask that make solving the problems you have easier
| or even in some cases possible. In this case, it's hey, we don't
| need exact numbers if we can define a probabilistic range given
| defined parameters. Other problems are gonna have other
| questions. I guess my hope is that if I see enough examples I'll
| be able to eventually internalize the thought process and apply
| it correctly.
| wmwragg wrote:
| I think it's generally thought of as "lateral thinking", Edward
| de Bono has written a few books about it you might find
| interesting.
| coldtea wrote:
| And some more commonplace words like "creativity" (as in
| "creative solution") etc. would apply.
| bzuker wrote:
| any particular one you'd recommend?
| dentemple wrote:
| To be fair, this was a university research team. Literally, a
| team of folks who can, all day everyday, iterate over a single
| topic using the Scientific Method.
|
| If you were paid by a big company to sit at a whiteboard all
| day with a team of equally intelligent engineers, I'm sure
| you'd be come up with SOMETHING that would look like an
| "outside the box" solution to the rest of the world.
|
| However, most of us are paid to work the JIRA factory line
| instead, which limits the amount of time we can spend
| experimenting on just one single problem.
| 112233 wrote:
| After skimming Knuth's paper -- does the algorithm work if values
| are hashed, that is, the "uniform deviate" is selected
| deterministically for each unique value of the stream?
| devnonymous wrote:
| Not sure which Knuth paper you're referring to but skimming
| through the article my understanding is this algorithm works
| /only/ if the values are hashable. IOW how else does one define
| unique/distinct values ?
| 112233 wrote:
| https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
|
| note how "u"s are selected every time value is not in a list.
| I don't read it as being a hash.
| hcs wrote:
| I think the analysis relies on independent random "u"s,
| even for the same key.
| hcs wrote:
| https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
|
| Looks like it was posted at the time
| https://news.ycombinator.com/item?id=36079213 but not much
| discussed. I found it over here
| https://news.ycombinator.com/item?id=40387594
| saulrh wrote:
| Huh, that's a clever twist on reservoir sampling. Neat.
| usgroup wrote:
| I found the paper took about as long to read as the blog post and
| is more informative:
|
| https://arxiv.org/pdf/2301.10191
|
| It is about estimating the cardinality of a set of elements
| derived from a stream. The algorithm is so simple, you can code
| it and play with it whilst you read the paper.
|
| The authors are explicit about the target audience and purpose
| for the algorithm: undergraduates and textbooks.
| vanderZwan wrote:
| If you refer to the subtitle of the paper - _An Algorithm for
| the (Text) Book_ - I think that is actually a reference to
| something *Paul Erdos allegedly said about some proofs are so
| elegant in their simplicity and beauty that they are "from The
| Book", like representing some divine Platonic ideal.
|
| Given that Knuth himself reviewed it, he might have remarked
| that this was one of those algorithms! Perhaps the authors
| decided to include it in the title as a not-so-humble brag
| (which would be well-earned if that's the case!)
|
| _edit: originally this comment said Knuth was the one who said
| this about some algorithms being from The Book, but that was my
| faulty memory._
| kibibu wrote:
| I thought The Book was an Erdos thing. I wonder who used it
| first.
| stevesimmons wrote:
| "During a lecture in 1985, Erdos said, `You don't have to
| believe in God, but you should believe in The Book.`"
|
| https://en.wikipedia.org/wiki/Proofs_from_THE_BOOK
| vanderZwan wrote:
| I think you're right, I must have confused the two. I'll
| edit my comment to reduce the spread of misinformation.
| usgroup wrote:
| From the abstract: "... All the current state-of-the-art
| algorithms are, however, beyond the reach of an undergraduate
| textbook owing to their reliance on the usage of notions such
| as pairwise independence and universal hash functions. We
| present a simple, intuitive, sampling-based space-efficient
| algorithm whose description and the proof are accessible to
| undergraduates with the knowledge of basic probability theory
| ...."
| Sharlin wrote:
| The point is that the subtitle's is pretty clearly intended
| to have a dual meaning, it wouldn't be phrased like that
| otherwise.
| dchftcs wrote:
| This is really twisting it, I don't find pairwise
| indepedence to be more advanced than applying a Chernoff
| bound. In this problem it'd mostly be the difference of
| using a Cherbyshev type bound or Chernoff bound.
|
| Pairwise independence is to give the algorithm stronger
| guarantees and let it work with a simple hash function like
| ax+b, otherwise probably most existing algorithms can be
| simplified in the same way. The most similar algorithm is
| BJKST, which is almost identical except for specifyimg the
| sampling mechanism to require less randomness.
|
| To someone who worked on this type of thing before, it's
| like seeing people familar with LLMs say "oh yet another
| X-billion parameter model on github doing more or less the
| same".
| cschmidt wrote:
| I liked this part. They got Knuth to review it, and found
| mistakes. That's kind of cool, in its own way.
| We are deeply grateful to Donald E. Knuth for his thorough
| review, which not only enhanced the quality of this
| paper (including fixing several errors) but has also
| inspired us for higher standards.
| coldtea wrote:
| > _The authors are explicit about the target audience and
| purpose for the algorithm: undergraduates and textbooks._
|
| Doesn't seem like it. Seems like an algorithm (similar to other
| approximate cardinality estimation algorithms) with huge
| applicability.
| usgroup wrote:
| From the abstract: "All the current state-of-the-art
| algorithms are, however, beyond the reach of an undergraduate
| textbook owing to their reliance on the usage of notions such
| as pairwise independence and universal hash functions. We
| present a simple, intuitive, sampling-based space-efficient
| algorithm whose description and the proof are accessible to
| undergraduates with the knowledge of basic probability
| theory."
| coldtea wrote:
| That just says that this is also simpler and more
| accessible algorithm, suitable even for undegraduate
| textbooks.
|
| Not that this is just useful for textbooks, a mere academic
| toy example, which would be something else entirely.
|
| This is both accessible AND efficient.
| swores wrote:
| > _" The authors are explicit about the target audience and
| purpose for the algorithm: undergraduates and textbooks."_
|
| If you're saying it's just for "undergraduates and textbooks",
| as opposed to just being simple enough for them to use but not
| limited to them, would you mind explaining what makes it useful
| for undergrads but not for professionals?
| pfsalter wrote:
| My interpretation from the paper is that this algorithm is
| simpler than other options but also worse. So in a
| professional context you'd use one of those instead
| usgroup wrote:
| From the abstract: "All the current state-of-the-art
| algorithms are, however, beyond the reach of an undergraduate
| textbook owing to their reliance on the usage of notions such
| as pairwise independence and universal hash functions. We
| present a simple, intuitive, sampling-based space-efficient
| algorithm whose description and the proof are accessible to
| undergraduates with the knowledge of basic probability
| theory."
| swores wrote:
| That still only speaks to it being simple enough for
| students, not whether its too simple for any other use vs.
| useful enough that students who learn it will spend the
| rest of their lives using it.
|
| For example word processor software is commonly described
| as simple enough for children to use at school, that
| doesn't mean that word processor software is of no use to
| adults.
| adgjlsfhk1 wrote:
| the reason it's too simple for most real world use is
| that hyper-log-log is the "good" version of this
| technique (but is harder to prove that it works)
| aidenn0 wrote:
| Simplicity in an algorithm has limited direct impact on
| its usefulness in industry where libraries are prevalent.
|
| Consider mapping structures with Th(k) expected lookup
| times. The simplest implementation is a hash table with
| chaining. Open-addressing is a bit more complicated, but
| also more common. Tries, which have O(k) worst-case
| lookup times are often covered in Undergraduate courses
| and are definitely easier to analyze and implement than
| forms of open-addressed hash-tables that have O(k)
| guarantees (e.g. Cuckoo hashing).
| resonious wrote:
| The blog post was more than half padding. Good that the
| algorithm is so simple it's hard to write a full length blog
| post about it!
| aspenmayer wrote:
| Link to abstract: https://arxiv.org/abs/2301.10191
| cb321 wrote:
| I agree the paper is better than the blog post, although one
| criticism I have of the CVM paper is that it has some
| termination/algo exit condition instead of what Knuth's CVM
| notes (refed else-thread here) do which is just a loop to
| ensure getting more space in the reservoir halving-step. It
| seems more work to explain the
| https://en.wikipedia.org/wiki/Up_tack than just do the loop.
| [1]
|
| [1] https://news.ycombinator.com/item?id=40388878
| imglorp wrote:
| On that note, I'm also unfamiliar with this \ operator
| notation which is used without explanation.
| X - X \ {ai}
| jtanderson wrote:
| That is conventional set subtraction notation. "Assign to X
| the same set minus all elements of the set {a_i}".
|
| One example source, but it is pretty common in general:
| http://www.mathwords.com/s/set_subtraction.htm
| rocqua wrote:
| Set difference.
|
| Set X becomes X without element ai. This is the case
| whether ai was in the set X before the step was taken.
| _a_a_a_ wrote:
| I've known that symbol for decades, never knew it's name -
| up-tack it is. Ta!
| hum3hum3 wrote:
| The counting/estimstion technique is rather like a floating point
| number. An integer exponent k and a mantissa of a population.
| melq wrote:
| Estimating the amount of unique elements in a set and counting
| the amount of unique elements in a set are very different things.
| Cool method, bad headline.
| dools wrote:
| It's an approximation, not an estimation.
| ranguna wrote:
| Still very different things, no?
| coldtea wrote:
| It's the same thing at different degrees of accuracy. The
| goal is the same.
| amelius wrote:
| Still, counting things and counting unique things are two
| different procedures.
| blackkettle wrote:
| Actually, my understanding is that it is an estimation
| because in the given context we don't know or cannot compute
| the true answer due to some kind of constraint (here memory
| or the size of |X|). An approximation is when we use a
| simplified or rounded version of an exact number that we
| actually know.
| davidmurdoch wrote:
| The authors of the article disagree with you.
| chrisweekly wrote:
| For someone who's pretty well-versed in English, but not a
| math-oriented computer scientist, this seems like a
| distinction without a difference. Please remedy my ignorance.
| lupire wrote:
| My GP was wrong, but the words are different.
|
| Eatimation is a procedure the generates an estimate, which
| is a kind of approximation, while approximation is a result
| value. They are different "types", as a computer scientist
| would say. An approximation is any value that is
| justifiably considered to be nearly exact. ("prox" means
| "near". See also "proximate" and "proxy".)
|
| Estimation is one way to generate an approximation. An
| estimate is a subtype of an approximation. There are non-
| estimation ways to generate an approximation. For example,
| take an exact value and round it to the nearest multiple of
| 100. That generates an approximation, but does not use
| estimation.
| jameshart wrote:
| I'm not sure the linguistic differences here are as cut
| and dried as you would like them to be. Estimate and
| approximate are both verbs, so you can derive nouns from
| them both for the process of doing the thing, and for the
| thing that results from such a process.
|
| Estimation is the process of estimating. It produces an
| estimate.
|
| Approximation is the process of approximating. It
| produces an approximation.
|
| You can also derive adjectives from the verbs as well.
|
| An estimate is an estimated value.
|
| An approximation is an approximate value.
|
| But you're right that the 'approximate' terms make claims
| about the _result_ - that it is in some way near to the
| correct value - while the 'estimate' derived terms all
| make a claim about the _process that produced the result_
| (ie that it was based on data that is known to be
| incomplete, uncertain, or approximate)
| jameshart wrote:
| They're not _very_ different things; the terms are used
| interchangeably in most contexts because in the real world all
| counting methods have some nonzero error rate.
|
| We talk about 'counting votes' in elections, for example, yet
| when things are close we perform 'recounts' which we fully
| expect can produce slightly different numbers than the original
| count.
|
| That means that vote counting is actually vote estimating, and
| recounting is just estimating with a tighter error bound.
|
| I kind of think the mythology of the 'countless stones'
| (https://en.wikipedia.org/wiki/Countless_stones) is a sort of
| folk-reminder that you can never be too certain that you
| counted something right. Even something as big and solid and
| static as a standing stone.
|
| The situations where counting is not estimating are limited to
| the mathematical, where you can assure yourself of exhaustively
| never missing any item or ever mistaking one thing's identity
| for another's.
| lupire wrote:
| Come on. There is a fundamental difference between trying to
| get an exactly answer and not trying to get an exactly
| correct answer.
| jameshart wrote:
| It's not a fundamental difference, it's a fundamental
| constraint.
|
| There are circumstances - and in real life those
| circumstances are _very common_ - where you must accept
| that getting an exactly correct answer is not realistic.
| Yet nonetheless you want to 'count' things anyway.
|
| We still call procedures for counting things under those
| circumstances 'counting'.
|
| The constraints on this problem (insufficient memory to
| remember all the unique items you encounter) are one such
| situation where even computerized counting isn't going to
| be exact.
| Levitz wrote:
| I agree with you, but we are talking theory here. The
| algorithm doesn't count, it estimates.
|
| You can make an algorithm that counts, you can make an
| algorithm that estimates, this is the second.
| sangnoir wrote:
| Estimation _is_ counting with error bars.
|
| Frankly, most of what you consider counting in your
| comment needs error bars - ask anyone who operated an
| all-cash cash-register how frequently end-of-day
| reconciliation didn't match the actual cash in the drawer
| (to the nearest dollar.)
|
| The following is a list from my personal experience - of
| presumably precisely countable things that didn't turn
| out to be the case: the number of computers owned by an
| fairly large regional business, the number of (virtual)
| servers operated by a moderately sized team, the number
| of batteries sold in a financial year by a battery
| company.
| shkkmo wrote:
| Counting is a subset of estimation, not a synonym.
|
| If I estimated the number of quarters in a stack by
| weighing them, that would be different from estimating
| the number of quaters in a stack by counting them. Both
| methods of estimation have error bars.
|
| The list you provide is of categories that don't have
| clear definitions. If you have a sufficiently clear
| definition for a category given your population, it has a
| precise count (though your counting methodologies will
| still be estimates.) If your definition is too fuzzy,
| then you don't actually have a countable set.
| seadan83 wrote:
| Counting and estimation are different by definition. One
| is a full enumeration, the latter an extrapolation from
| sampled data. In both cases 'accuracy' is a factor. Even
| if we are counting the number of stars, it is still a
| difference of technique compared to estimating the number
| if stars.
|
| I could try to count fibers in muscle or grains of sand
| in the beach, chances are accuracy would be low. One can
| be smart about technique for more accurate counts, eg:
| get 10M sand counters and give them each 1kg of sand
| which they then count the grains with tweezer and
| microscope. That is counting. At the same time, we could
| find an average count of grains in 1kg sand from a random
| 100 of our counters, and then estimate what an expected
| total would be. The estimate can be used to confirm the
| accuracy of the counts.
| jameshart wrote:
| And by that definition this is a counting algorithm.
| samatman wrote:
| > _the terms are used interchangeably in most contexts_
|
| Counting and estimating are not used interchangeably in most
| contexts.
|
| > _because in the real world all counting methods have some
| nonzero error rate._
|
| The possibility that the counting process may be defective
| does not make it an estimation.
|
| > _We talk about 'counting votes' in elections, for example,
| yet when things are close we perform 'recounts' which we
| fully expect can produce slightly different numbers than the
| original count._
|
| We talk about counting votes in elections because votes are
| counted. The fact that the process isn't perfect is a defect;
| this does not make it estimation.
|
| > _That means that vote counting is actually vote estimating,
| and recounting is just estimating with a tighter error
| bound._
|
| No. Exit polling is estimation. Vote counting is counting.
| Vote recounting is also counting, and does not necessarily
| impose a tighter error bound, nor necessarily derive a
| different number.
|
| > _The situations where counting is not estimating are
| limited to the mathematical, where you can assure yourself of
| exhaustively never missing any item or ever mistaking one
| thing's identity for another's._
|
| So like, computers? Regardless, this is wrong. Estimating
| something and counting it are not the same thing. Estimation
| has uncertainty, counting may have error.
|
| This is like saying addition estimates a sum because you
| might get it wrong. It's just not true.
| jameshart wrote:
| So, IEEE floating point doesn't support 'addition' then.
| samatman wrote:
| IEEE 754 defines an exact binary result for the addition
| of any two floats.
|
| That this bit-identical result is not the same operation
| as addition of real numbers is irrelevant, because floats
| aren't reals.
|
| f1 + f2 is not an estimation. Even treating it as an
| approximation will get you into trouble. It's not that
| either, it's a floating-point result, and algorithms
| making heavy use of floating point had better understand
| precisely what f1 + f2 is going to give you if they want
| to obtain maximum precision and accuracy.
| aidenn0 wrote:
| Cool, so next time I have numbers that aren't reals to
| perform math on, I can use floats.
| Koshkin wrote:
| True - for (relatively) small numbers. For large (huge) numbers
| estimation is usually considered to be equivalent to counting,
| and the result is sometimes represented using the "scientific"
| notation (i.e. "floating-point") rather than as an integer. For
| example, the mole is an integer whose value is only known
| approximately (and no one cares about the exact value anyway).
| OutOfHere wrote:
| This doesn't justify estimation to be equivalent to counting
| even if some mathematicians consider them to be the same.
| Floating points are for estimation. Integers are for
| counting. The two are not the same, not even for large
| numbers.
| Koshkin wrote:
| "Equivalent" and "the same" are sometimes equivalent. (Or
| the same.)
|
| _It depends on what the meaning of the word 'is' is._
|
| https://libquotes.com/bill-clinton/quote/lby0h7o
| aaron695 wrote:
| Knuth talks about this paper here - "The CVM Algorithm for
| Estimating Distinct Elements in Streams" -
| https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
| thesz wrote:
| HyperLogLog uses additions, it keeps sums. Thus, you can subtract
| one HLL sums from other. This is useful if stream supports
| deletion. Streams with deletions can be found in log-structured
| merge trees, for one example, so one can estimate count of
| distinct elements in all of the LSM tree hierarchy.
|
| The algorithm in the paper does not allow for deletions.
|
| Also, if one counts statistics of the stream of large elements
| (say, SHA-512 hashes, 64 bytes per hash), this algorithm requires
| some storage for elements from this stream, so memory requirement
| is O(table size * element size).
| kromem wrote:
| Ah, so Thanos was just conducting a census.
| rcarmo wrote:
| I see what you did there.
| dist-epoch wrote:
| Do they assume any particular distribution of the items?
|
| Otherwise Nassim Taleb would like a word with them.
| coldtea wrote:
| No, and Taleb is not relevant for this.
| dist-epoch wrote:
| It is if the distribution is extreme.
| coldtea wrote:
| If Hamlet was 100000000 times the word Hamlet and 1 time
| the word pancake, it would still give an good estimate
| measurement - even if pancake gets 0.
| drycabinet wrote:
| Does finding the number of unique elements in a set actually
| require comparison of each element with everything else? Can't
| you use a hashtable? For every element, add it to the table
| (ignore if already exists), and finally, take a count of keys.
| jangxx wrote:
| Using a hashtable is the "normal" approach mentioned in the
| article. It works of course, but requires memory to store each
| unique element (or their hashes). If you have less memory
| available, the described algorithm can still give a very good
| approximation.
| xwolfi wrote:
| Imagine a million elements. How big must your hashtable be ?
| The article explains it very well, did you miss it ? It's a way
| to save memory.
|
| But to be honest I implemented it, ran it on Hamlet, and it's
| very wrong, it's barely useful but maybe if you just need a
| vague idea...
| cb321 wrote:
| How big was your _thresh_? I found it pretty accurate:
| https://news.ycombinator.com/item?id=40388878
| theginger wrote:
| That is fine when you have say 1 million values and only 1000
| are unique. But when you have 1 million values and about 900
| thousand are unique you are putting more or less the whole data
| set into memory.
| Anduia wrote:
| Using a hashtable is effective because you only compare
| elements within their hash buckets, not the entire set.
| However, they can become inefficient with very large datasets
| due to memory usage and processing time, which is where
| approximate counts shine.
| datavirtue wrote:
| This algorithm is still spinning a lot of random. I would
| guess that this is much less overhead than hashing but still
| seems like it could be significant.
| ReleaseCandidat wrote:
| > Does finding the number of unique elements in a set
|
| That's easy, that's all of them. Sorry, could not resist.
|
| Yes, hashing is the usual method. In a sorted list you can
| compare to the following element.
| seadan83 wrote:
| Imagine 1PB of data and you expect 30% of it to be unique. That
| needs 300TB RAM to store unique elements. Keep in mind the
| values in a hash table are the elements themselves, so a
| hastable of perhaps 300TB. Doing that without that much RAM,
| even swapping to disk can be tough.
| gh0stcloud wrote:
| very interesting solution. A perfect example of how even some
| very complicated problems can sometimes have simple solutions.
| Will definitely try to write an implementation of this.
|
| The only minor downside to this is that it's obviously not
| deterministic since it depends on chance. But for many
| applications where the dataset is so big it doesn't fit in
| memory, that's probably just a tiny rounding error anyway.
| rep_lodsb wrote:
| >What if you're Facebook, and you want to count the number of
| distinct users who log in each day, even if some of them log in
| from multiple devices and at multiple times?
|
| Seems like a bad example of when this algorithm could be useful
| in practice.
|
| If you already know you will want this info when designing the
| login process, it's simple: keep track of the date of last login
| for each account, and increment the unique user counter when the
| stored value is different from the current one.
|
| And even if not, it should still be possible to "replay" a stream
| of login events from the database later to do this analysis.
| Unless maybe you already had years worth of data?
| novaRom wrote:
| I wish we had a theory connecting randomness with time/space
| complexity.
|
| Intuitively I think the key to many computational limitations is
| making use of randomness.
| ilya_m wrote:
| Your intuition is correct. See
| https://en.wikipedia.org/wiki/BPP_(complexity)
| ilya_m wrote:
| "In a recent paper" - really? The paper first appeared in ESA
| 2022. The revised version (with some errors fixed) is from May
| 2023. To be fair, compared to the rate of progress in this area
| between Flajolet & Martin (1985) and the previous SOTA (Blasiok,
| 2018), a couple of years of sitting on this news is not a lot.
| davidmurdoch wrote:
| Sounds like this qualifies as "recent" to me.
| ngrilly wrote:
| It is quite amazing that after we had so many talented
| researchers working for decades on this problem, the authors were
| able to come up with such an elegant and simple solution. Another
| proof that research is very necessary and useful.
| unnouinceput wrote:
| "Computer scientists invent an efficient new way to approximate
| counting large unique entries" - fixed the title
| renonce wrote:
| At first find this paragraph confusing:
|
| > Keep going through Hamlet, adding new words as you go. If you
| come to a word that's already on your list, flip a coin again. If
| it's tails, delete the word; heads, and the word stays on the
| list.
|
| Why would you delete the word already on the list by flipping
| coins? Doesn't this reduce the accuracy by counting less words
| than expected? And will the word be added to the list later?
|
| After thinking about it for a while and reading the paper, I've
| finally developed a good mental model for how this algorithm
| works as below, which should convince even a high schooler why
| the algorithm works:
|
| 1. You are given a streaming set of elements from [n] (a finite
| set of n distinct elements). Now let's give each element a random
| uniform real number in [0,1]. This helps us choose the elements
| we want: if we choose all elements with the number below 0.5, we
| get about half of the elements.
|
| 2. Assume for a moment we have unbounded memory. Now we maintain
| a table of already seen elements, and for each element we keep
| the _last_ real number attached with that element: so if an
| element A appears three times with the number 0.3, 0.7, 0.6, we
| keep A with the number 0.6.
|
| 3. Our memory is bounded! So we keep only the elements below a
| threshold 2^-r, that is, 1, 0.5, 0.25, etc. So at first we will
| keep all elements, but when we don't have enough memory, we will
| need to filter existing elements according to the threshold. It's
| easy to see that when we decrease threshold, only elements
| already in memory will meet the criteria and continue to exist in
| memory. No other elements in the stream will be below threshold.
| Also note that whether an element is in memory depends only on
| its _last_ occurence. It can exist in memory for a while, get
| dropped because a later element does not meet the threshold, and
| get back in.
|
| 4. We don't actually have a real number attached to every
| element! But we can pretend as if there is one. For each new
| element X from the stream, we replace the number in memory with
| its attached number, and we only care if its attached number is
| below 2^-r or not. If it is, it should be in our memory, and if
| it's not, it should be out. Once the number is in memory, it's a
| random number in [0,2^-r] and we care no further.
|
| When increasing r, we only care about whether the number is in
| [0,2^{-r-1}]. This has exactly 1/2 probability. So each number
| has 1/2 probability of getting out, and 1/2 probability of being
| kept in memory.
|
| 5. Now it's easy to see that whether an element is in the memory
| depends solely on the real number attached to its last occurence.
| That is, each distinct element has exactly 2^-r probability of
| being in memory. 2^r multiplied by the number of elements in
| memory gives a good estimate of number of distinct elements.
|
| They criticized previous approaches as relying on hash functions.
| A simplified version of the approach is as follows:
|
| 1. Make a hash function that maps distinct elements to different
| uniform real numbers. So all As compute to 0.3, all Bs compute to
| 0.7, etc.
|
| 2. Use that hash function to transform all elements. The same
| element maps to the same real number, and different elements map
| to different real numbers.
|
| 3. Keep a set of N smallest distinct real numbers. This works by
| inserting numbers from the stream to the set. It's easy to see
| that adding the same element multiple times has exactly no
| effect; it's either too big to be in the set or the same as an
| existing number in the set.
|
| 4. This gives the Nth smallest real number in the set as K. The
| number of distinct elements is approximated as N/K.
|
| This algorithm is remarkable, but not in that it's efficient or
| it's new. Both algorithm using a hash function and algorithms
| without hash functions has been around and is optimal. I assume
| this algorithm is the first optimal algorithm without hash
| functions that is very easy to understand even by undergraduates.
| majikandy wrote:
| I had the same problem with the same paragraph and still don't
| quite get it.
|
| Unfortunately I struggle to follow the detailed explanation you
| gave... since you seem to understand it... can you confirm that
| they really do mean to throw away the word in the list they
| just found?
|
| Eg
|
| ACT I SCENE Elsinore A platform before the castle FRANCISCO at
| his post. Enter to him
|
| If buffer max is 16, I am supposed to randomly half it first?
|
| Act Scene Elisnore Platform Before The His Post
|
| Now what? The next words are "Bernado Bernado who's there"
| Majromax wrote:
| Yes, you're supposed to throw it away.
|
| The key insight is that words should appear in your scratch
| space with equal probability, no matter how often they appear
| in the source text. If you have a scratch space of size one,
| then the sequence of "apple" x 16 + "banana" x 1 should have
| equal chances of of the scratch space containing [apple] or
| [banana] at the end of the sequence, at least averaged over
| all 17 permutations.
|
| One way to achieve this result is to make the scratch space
| memory-free. Rather than think of it as "remove the word with
| probability x", think of it as " _always_ remove the word,
| then re-add it with probability (1-x) ".
| reichstein wrote:
| The way I'd describe it is:
|
| - You have a buffer. You initially try to fit every unique word
| on there. - If the buffer gets full, you know that you can't
| fit all the unique words in there. So you decide to keep only a
| fraction, _p1_, of them. Run through the buffer, keep each
| value with probability _p1_. - Keep adding new words, again
| only with probability _p1_. - If the buffer gets full again,
| _p1_ was too large, so you choose a lower fraction, _p2_,
| retain existing elements only with probability _p2_/_p1_, and
| continue adding new words with probability _p2_. - Every time
| the buffer gets full, you lower the faction of words you try to
| retain.
|
| The choice of using _pn_ = (1/2)^n is just easy for a computer,
| it only needs entire random bits at each step.
|
| What I _don't_ get is how this is correct for counting unique
| words. If I have a text of 16384 unique words, then I can
| accept that each will be in the final list with the same
| probability. But if I take that list and repeat the last word
| 30000 times, then it becomes overwhelmingly plausible that that
| word is in the final list, even though I haven't changed the
| number of unique words. Maybe there is some statistical
| property that evens things out, but I couldn't see it from the
| article.
| gnfargbl wrote:
| On the topic of counting things, I would like to mention this
| efficient and easily-implemented algorithm for finding the top-
| _k_ items in a stream, which I think is perhaps not as well known
| as it should be:
|
| _A Simple Algorithm for Finding Frequent Elements in Streams and
| Bags_
|
| _Karp, Shenker & Papadimitriou_
|
| https://www.cs.umd.edu/~samir/498/karp.pdf
| vanderZwan wrote:
| > _the top-k items in a stream_
|
| Hmm, this is phrased in a way that sounds different (to my
| ears) than the abstract, which says:
|
| > _it is often desirable to identify from a very long sequence
| of symbols (or tuples, or packets) coming from a large alphabet
| those symbols whose frequency is above a given threshold_
|
| Your description suggests a finding fixed nr of k items, with
| the guarantee that it will be the top ones. The abstract sounds
| like if determines an a priori unknown number of items that
| meet the criteria of having a particular value greater than k.
|
| So "find the 100 oldest users" vs "find all users older than
| 30".
|
| Am I misunderstanding you or the abstract? (English is not my
| first language)
| ceving wrote:
| Can some native speaker tell me: is "count" the new "guess"?
| dekhn wrote:
| It's estimation, not counting.
| Culonavirus wrote:
| In the post-GPT world? Yes. Wait, maybe actually no? Depends on
| context.
| https://www.reddit.com/r/ChatGPT/comments/16jvl4x/wait_actua...
| Koshkin wrote:
| https://news.ycombinator.com/item?id=40394015
| arn3n wrote:
| Does anyone else notice that quanta consistently has good
| illustrations and diagrams for their articles? Good
| visualizations and graphics can do extraordinary things for the
| understandability of a topic -- people like 3Blue1Brown
| understand this, and I'm glad quanta puts effort into it.
| Gbotex wrote:
| nice read!
| cb321 wrote:
| This Nim program might clarify some things for someone.
| import std/[random, math, sets, times]; type F = float
| template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped =
| ## Return (unbiased estim. cardinality of `it`, 1-alpha CI
| ebound*(1 +- e)). var x1 = initHashSet[type(xfm)](k);
| var x = x1.addr var x2 = initHashSet[type(xfm)](k);
| var y = x2.addr var p = 1.0 # p always 2^-i
| means could ditch FP var n = 0 #..RNG &
| arith for bit shifts/masks. let t0 = epochTime()
| for e in it: inc n #; let e = xfm # to compute
| randState.next here x[].excl e # 'd be nice
| to reduce to 1 hash by if rand(1.0) < p: #..excl
| w/prob 1 - p but then cannot x[].incl e
| #..add new keys. p -> 0 anyway. while x[].len == k:
| # KnuthLoop not CVMIf guards against for e in x[]:
| #..unlikely case of freeing nothing. if rand(1.0)
| < 0.5: y[].incl e p *= 0.5 swap x, y;
| y[].clear # | e.bound conservative by ~15X
| (int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F),
| (epochTime() - t0)*1e9/n.F) when isMainModule:
| # Takes nItem/trial dupMask space nTrial when defined
| danger: randomize() import std/[strutils, os,
| strformat] let n = if paramCount()>0:
| parseInt(paramStr(1)) else: 100000 let msk = (if
| paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64
| let k = if paramCount()>2: parseInt(paramStr(3)) else: 512 #
| 32KiB let m = if paramCount()>3:
| parseInt(paramStr(4)) else: 35 var ex =
| initHashSet[uint64]() var xs = newSeq[uint64](n)
| for j in 1..m: xs.setLen 0; ex.clear for
| i in 1..n: xs.add randState.next and msk for e in xs:
| ex.incl e let (apx, err, t) = uniqCEcvm(xs, uint64,
| k) let ap = apx.F; let an = ex.len.F echo
| ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f}
| {t:.1f}ns"
|
| First few lines of laptop output: 51160 49152
| eb: 0.6235 |x-a|/x: 0.0392 10.5ns 51300 51840 eb: 0.6235
| |x-a|/x: 0.0105 10.7ns 51250 55168 eb: 0.6235 |x-a|/x:
| 0.0764 10.9ns 51388 52352 eb: 0.6235 |x-a|/x: 0.0188
| 10.5ns
|
| Ditching the floating point RNG & arithmetic might be able to
| lower that time per element cost to more like 5 ns or maybe
| 2GiB/s for 8B ints. The toggle back & forth between old & new
| HashSet is just to re-use the same L1 CPU cache memory.
|
| Note also that the error bound in the CVM paper seems very
| conservative. My guess is that it could maybe be tightened by
| 15+X (at the scale of those example numbers), but that might also
| require some kind of special function evaluation. Anyone have any
| ideas on that?
|
| Also note that since this is just estimating unique identities,
| you can always just count unique 64-bit hashes of keys (or maybe
| even _32-bit_ hashes depending on your accuracy needs and
| cardinalities) if you care about the key space /key compares are
| slow.
| menthe wrote:
| Frankly, who can read this!? I am not sure what's worse, the
| multi-line comments spanning multiple lines of code, having
| multiple instructions on a single line, or the apparent
| disconnect between the pseudo-code of the article.
| V1ndaar wrote:
| I would blame the majority of your criticism on the fact that
| HN is not the best place to read code. Also, syntax
| highlighting & basic familiarity with Nim helps.
|
| His code is doing a few more things than necessary. The
| actual algorithm is inside the `uniqCEcvm` template. The `it`
| it receives is anything you can iterate over (a collection or
| an iterator). Multiple things in one line really only appear
| where they directly relate to the part at the beginning of
| the line.
|
| The `when isMainModule` is Nim's way of Python's `if __name__
| == "__main__"`. The entire part below that is really just a
| mini CL interface to bench different (random) examples. Final
| thing to note maybe, the last expression of a block (e.g. of
| the template here) will be returned.
|
| And well, the style of comments is just personal preference
| of course. Whether you prefer to stay strictly below 80 cols
| or not, shrug.
|
| I grant you that the usage of 2 sets + pointer access to them
| + swapping makes it harder to follow than necessary. But I
| assume the point of it was not on "how to write the simplest
| looking implementation of the algorithm as it appears in the
| paper". But rather to showcase a full implementation of a
| reasonably optimized version.
|
| Here's a version (only the algorithm) following the paper
| directly: proc estimate[T](A: seq[T], e, d:
| float): float = let m = A.len var p = 1.0
| var thr = ceil(12.0 / e^2 * log2(8*m.float / d))
| var kh = initHashSet[T](thr.round.int) for i in 0
| ..< m: kh.excl A[i] if rand(1.0) < p:
| kh.incl A[i] if kh.card.float >= thr:
| for el in toSeq(kh): # clean out ~half probabilistically
| if rand(1.0) < 0.5: kh.excl el
| p /= 2.0 if kh.card.float >= thr:
| return -1.0 result = kh.card.float / p
| menthe wrote:
| Thank you very much for having taken the time.. Your
| comments and function are both very helpful!
| eterevsky wrote:
| Computer scientists invent a memory-efficient way to estimate the
| size of a subset
| m3kw9 wrote:
| It seems fast too as you can use less rounds of flips and get
| an estimate meaning you may not need to go thru the entire
| "book" to get an estimate of distinct words
| seadan83 wrote:
| The subset is very important, it is the subset of unique
| elements.
| Koshkin wrote:
| It's not a "subset," it's "the set of equivalence classes."
| vitus wrote:
| From the paper [0]:
|
| > We state the following well-known concentration bound, Chernoff
| bound, for completeness.
|
| Which variant of the Chernoff bound is this? This is almost the
| (looser variant of the) multiplicative form, but it's not quite
| right (per the use of 1+delta instead of a single parameter
| beta). In particular, that bound is only guaranteed to hold for
| delta >= 0 (not beta = 1 + delta > 0 as asserted in the paper)
|
| [0] https://arxiv.org/pdf/2301.10191
|
| [1]
| https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_...
|
| edit: to be clear: I'm not sure at all whether this breaks the
| proof of correctness, although I'm having a bit of difficulty
| following the actual details (I think I'd need to work through
| the intermediate steps on paper).
| mudiadamz wrote:
| Python implementation: def streaming_algorithm(A,
| epsilon, delta): # Initialize parameters p =
| 1 X = set() thresh = math.ceil((12 / epsilon
| ** 2) * math.log(8 * len(A) / delta)) # Process
| the stream for ai in A: if ai in X:
| X.remove(ai) if random.random() < p:
| X.add(ai) if len(X) == thresh: X
| = {x for x in X if random.random() >= 0.5} p /=
| 2 if len(X) == thresh:
| return '[?]' return len(X) / p #
| Example usage A = [1, 2, 3, 1, 2, 3] epsilon = 0.1
| delta = 0.01 output = streaming_algorithm(A, epsilon,
| delta) print(output)
| planede wrote:
| return '[?]'
|
| what's this?
| jbaber wrote:
| In some symbolic logic classes, that character "bottom"
| represents "false" ad flipped "top" means true.
|
| Don't know what they're getting at in the code, though.
| dentemple wrote:
| Once again proving the need for comments in code.
| Especially for comments that are more useful than
| "initialize parameters"
| hollerith wrote:
| >In some symbolic logic classes, that character "bottom"
| represents "false"
|
| That's unfortunate, because in the study of computer
| programming languages, it means "undefined" (raise an
| error).
| tsss wrote:
| Not always. It is also the uninhabited bottom type.
| hollerith wrote:
| My point is that there is a difference between a Python
| function's returning false and the function's raising an
| error, and sometimes the difference really matters, so it
| would be regrettable if logic teachers actually did use
| [?] to mean false because programming-language theorists
| use it to mean something whose only reasonable
| translation in the domain of practical programming is to
| raise an error.
|
| I have no idea what your point is.
| martinsnow wrote:
| An easy way to identify who copies code without understanding
| it.
| mudiadamz wrote:
| You can just replace it with something like: print
| ('Invalid thresh or something')
| martinsnow wrote:
| This however looks scary so an innocent copy/paste
| programmer wouldn't touch it.
| rcarmo wrote:
| An error condition. I decided to do away with it and take a
| small hit on the error by assuming the chances of the trimmed
| set being equal to the threshold are very small and that the
| error condition is effectively doing nothing.
|
| I also changed the logic from == to >= to trigger
| unfailingly, and pass in the "window"/threshold to allow my
| code to work without internal awareness of the length of the
| iterable: from random import random
| def estimate_uniques(iterable, window_size=100):
| p = 1 seen = set() for i in
| iterable: if i not in seen:
| seen.add(i) if random() > p:
| seen.remove(i) if len(seen) >= window_size:
| seen = {s for s in seen if random() < 0.5}
| p /= 2 return int(len(seen) / p)
|
| I also didn't like the possible "set thrashing" when an item
| is removed and re-added for high values of p, so I inverted
| the logic. This should work fine for any iterable.
| sestep wrote:
| Is this ChatGPT? Also, I feel like this would be more useful if
| it included import statements.
| mudiadamz wrote:
| import math import random
| ericmcer wrote:
| I don't think there is a single variable name or comment in
| this entire code block that conveys any information. Name stuff
| well! Especially if you want random strangers to gaze upon your
| code in wonder.
| foobarian wrote:
| Speaking of, one of my favorite discoveries with Unicode is
| that there is a ton of code points acceptable for symbol
| identifiers in various languages that I just can't wait to
| abuse.
|
| >>> a=3
|
| >>> b=6
|
| >>> a+b
|
| 9
| 8372049 wrote:
| You're even using a and b, very good.
| pragma_x wrote:
| Ah yes, programming like the vikings intended.
| tgv wrote:
| The names are literally taken from the paper.
| seaman1921 wrote:
| well the paper also contains the code so I doubt anyone who
| looked at the paper cares about this paste - for folks who
| did not read the paper this is not very readable
| sneva wrote:
| > Name stuff well
|
| OP is following the same variable names of the article. I
| prefer that over changing the variable names and then
| figuring out what variable name maps in code to the article.
| rcarmo wrote:
| That's not streaming if you're already aware of the length of
| the iterable.
| axitanull wrote:
| In python, you can simply substitute `A` with an iterable or
| generator object, which can be a of unknown length.
| mattkrause wrote:
| But for this algorithm, you need to know the total length
| ("m") to set the threshold for the register purges.
|
| Does it still work if you update m as you go?
| empath-nirvana wrote:
| You actually don't need to do that part in the algorithm.
| If you don't know the length of the list, you can just
| choose a threshold that seems reasonable and calculate
| the margin of error after you're done processing. (or i
| guess at whatever checkpoints you want if it's
| continuous)
|
| In this example, they have the length of the list and
| choose the threshold to give them a desired margin of
| error.
| rcarmo wrote:
| See https://news.ycombinator.com/item?id=40390192
| istjohn wrote:
| You would want to calculate the threshold by choosing
| your target epsilon and delta and an 'm' equal to the
| largest conceivable size of the stream. Fortunately, the
| threshold increases with log(m), so it's inexpensive to
| anticipate several orders of magnitude more data than
| necessary. If you wanted, you could work backwards to
| calculate the actual 'epsilon' and 'delta' values for the
| actual 'm' of the stream after the fact.
| cb321 wrote:
| Besides the ideas from istjohn, empath-nirvana, and
| rcarmo, you can also just "flip the script": solve for
| epsilon and report that as 1-delta confidence interval
| for the worst case data distribution as here:
| https://news.ycombinator.com/item?id=40388878
|
| Best case error is of course zero, but if you look at my
| output then you will see as I did that the worst case is
| a very conservative bound (i.e. 15X bigger than what
| might "tend to happen". That matters a lot for "space
| usage" since the error =~ 1/sqrt(space) implying you need
| a lot more space for lower errors. 15^2 = 225X more
| space. Space optimization is usually well attended for
| this kind of problem. And, hey, maybe you know something
| about the input data distribution?
|
| So, in addition to the worst case bound, average case
| errors under various distributional scenarios would be
| very interesting. Or even better "measuring as you go"
| enough distributional meta data to get a tighter error
| bound. That latter starts to sound like it's one of
| Knuth's Hard Questions Which if You Solve He'll Sign your
| PhD Thesis territory, though. Maybe a starting point
| would be some kind of online entropy(distribution)
| estimation, perhaps inspired by
| https://arxiv.org/abs/2105.07408 . And sure, maybe you
| need to bound the error ahead of time instead of
| inspecting it at any point in the stream.
| rcarmo wrote:
| I know that. See:
| https://news.ycombinator.com/item?id=40390192
| klaussilveira wrote:
| This actually comes at a good time. I'm currently refactoring a
| system that counts visits using inserts into a table, with a
| tuple of date and IP. I was planning to replace it with a HLL
| approach, but this is really interesting.
| pytness wrote:
| Is it me or is the description of the algo wrong?
| > Round 1. Keep going through Hamlet, adding new words as you go.
| If you come to a word that's already on your list, flip a coin
| again. If it's tails, delete the word;
|
| If i follow this description of "check if exists in list ->
| delete": if hash_set.contains(word) {
| if !keep_a_word(round) { hash_set.remove(word);
| continue; } } else {
| hash_set.insert(word.to_string()); }
|
| The algorithm runs for ~20 iterations: Total
| word count: 31955 | limit: 1000 End Round: 20, word
| count: 737 Unique word count: 7233 Estimated
| unique word count: 772800512
|
| But if I save the word first and then delete the same word:
| hash_set.insert(word.to_string()); if
| !keep_a_word(round) { hash_set.remove(word);
| continue; }
|
| It gets the correct answer: Total word count:
| 31955 | 1000 End Round: 3, word count: 905 Unique
| word count: 7233 Estimated unique word count: 7240
| Alexanfa wrote:
| Was just now solving it and came to see if others had the same
| issue. Yep, you are right. function
| generateRandomNumbers(c, n) { let randomNumbers = new
| Array(c); for (let i = 0; i < randomNumbers.length;
| i++) { let randomNumber = Math.floor(Math.random()
| * (n + 1)); randomNumbers[i] = randomNumber;
| } return randomNumbers; } function
| run(w, wS, m, r) { function round(r) {
| while(wS.size < m) { const next = w.next()
| if (next.done) return true;
| wS.add(next.value) prune(next.value, r)
| } return false }
| function prune(v,r) { for (let i = 0; i < r;
| i++) { const flip = new
| Boolean(Math.round(Math.random())) if (flip
| == false) { wS.delete(v)
| } } } function
| purge(wS) { const copy = new Set(wS)
| copy.forEach(ith=>{ const flip = new
| Boolean(Math.round(Math.random())) if (flip
| == false) { wS.delete(ith)
| } }) } const done =
| round(r); if (!done) { purge(wS)
| return run(w, wS, r+1,m) }
| console.log(`Round ${r} done. ${wS.size} Estimate: ${wS.size /
| (1/Math.pow(2,r))}`) } const memory = 1000
| const words = generateRandomNumbers(3000000,15000)
| const w = words[Symbol.iterator]() // create an iterator
| const wS = new Set(); run(w,wS, memory,0);
| Alexanfa wrote:
| Noticed an error; return run(w, wS, r+1,m)
|
| Should be changed to: return run(w, wS, m,
| r+1)
| josephernest wrote:
| I got the same problem.
|
| When implementing the exact method as described in quanta
| magazine (without looking at the arxiv paper), I always had
| estimates like 461746372167462146216468796214962164.
|
| Then after reading the arxiv paper, I got the the correct
| estimate, with this code (very close to mudiadamz's comment
| solution): import numpy as np L =
| np.random.randint(0, 3900, 30557)
| print(f"{len(set(L))=}") thresh = 100 p = 1
| mem = set() for k in L: if k in mem:
| mem.remove(k) if np.random.rand() < p:
| mem.add(k) if len(mem) == thresh:
| mem = {m for m in mem if np.random.rand() < 0.5}
| p /= 2 print(f"{len(mem)=} {p=} {len(mem)/p=}")
|
| Or equivalently: import numpy as np L
| = np.random.randint(0, 3900, 30557)
| print(f"{len(set(L))=}") thresh = 100 p = 1
| mem = [] for k in L: if k not in mem:
| mem += [k] if np.random.rand() > p:
| mem.remove(k) if len(mem) == thresh:
| mem = [m for m in mem if np.random.rand() < 0.5]
| p /= 2 print(f"{len(mem)=} {p=} {len(mem)/p=}")
|
| Now I found the quanta magazine formulation problem. By
| reading:
|
| > Round 1. Keep going through Hamlet, adding new words as you
| go. If you come to a word that's already on your list, flip a
| coin again. If it's tails, delete the word; heads, and the word
| stays on the list. Proceed in this fashion until you have 100
| words on the whiteboard. Then randomly delete about half again,
| based on the outcome of 100 coin tosses. That concludes Round
| 1.
|
| we want to write: for k in L: if
| k not in mem: mem += [k] else:
| if np.random.rand() > p: mem.remove(k)
| if len(mem) == thresh: mem = [m for m in mem if
| np.random.rand() < 0.5] p /= 2
|
| whereas it should be (correct): for k in L:
| if k not in mem: mem += [k] if
| np.random.rand() > p: # without the else
| mem.remove(k) if len(mem) == thresh:
| mem = [m for m in mem if np.random.rand() < 0.5]
| p /= 2
|
| Just this little "else" made it wrong!
| Alexanfa wrote:
| Quanta: Round 1. Keep going through Hamlet,
| adding new words as you go. If you come to a word that's
| already on your list, flip a coin again. If it's tails,
| delete the word; heads, and the word stays on the list.
|
| To: Round 1. Keep going through Hamlet, but
| now flipping a coin for each word. If it's tails, delete the
| word if it exists; heads, and add the word if it's not
| already on the list.
|
| Old edit: Round 1. Keep going through
| Hamlet, adding words but now flipping a coin immediately
| after adding it. If it's tails, delete the word; heads, and
| the word stays on the list.
| josephernest wrote:
| > adding words but now flipping a coin immediately after
| adding it
|
| Edit: I thought your formulation was correct but not
| really:
|
| We flip the coin after adding, but we also flip the coin
| _even if we didn 't add the word_ (because it was already
| there). This is subtle!
|
| wrong: if k not in mem: mem
| += [k] if np.random.rand() > p:
| mem.remove(k)
|
| wrong: if k not in mem: mem
| += [k] else: if np.random.rand() > p:
| mem.remove(k)
|
| correct: if k not in mem: mem
| += [k] if k in mem: # not the same than "else"
| here if np.random.rand() > p:
| mem.remove(k)
|
| correct: if k not in mem: mem
| += [k] if np.random.rand() > p:
| mem.remove(k)
| Alexanfa wrote:
| Ah, I'm using a set instead of list so I just always add
| and then toss remove.
| jtanderson wrote:
| Given the topic of the paper[0], the footnote is especially
| charming:
|
| > the authors decided to forgo the old convention of alphabetical
| ordering of authors in favor of a randomized ordering, denoted by
| r. The publicly verifiable record of the randomization is
| available at https://www.aeaweb.org/journals/policies/random-
| author-order...
|
| [0]: https://arxiv.org/pdf/2301.10191
|
| edit: formatting
| theelous3 wrote:
| > Imagine that you're sent to a pristine rainforest to carry out
| a wildlife census.
|
| alt-f4
| mring33621 wrote:
| Feels like reservoir sampling to me
| prerok wrote:
| While the algorithm is interesting and I commend the authors for
| the algorithm, it should be noted that this is a guesstimate
| only. So, it's not really an efficient way to count distinct
| values but an efficient way to get an approximate count of
| distinct values.
| unbalancedevh wrote:
| I wonder how the error changes if you start a new round just
| before finishing the stream, compared to if the last word in the
| stream just fills the buffer and ends a round.
| deadeye wrote:
| Am I correct in thinking the accuracy of this method has more to
| do with the distribution in the sample than the algo itself.
| Meaning, given a 100% unique list of items, how far off would
| this estimate be?
| MarkusQ wrote:
| It doesn't appear to be heavily dependent on the distribution.
| It mostly depends on how large your buffer is compared to the
| number of unique items in the list. If the buffer is the same
| size or larger, it would be exact. If it's half the size needed
| to hold all the unique items, you drop O(1 bit) of precision;
| at one quarter, O(2 bits), etc.
| SoftTalker wrote:
| New interview question: Tell me how you would estimate the number
| of unique words in _Hamlet_ using only 100 elements.
| brianhorakh wrote:
| Wondering if this approach could I found myself be applied to CFD
| (computational flow dynamics) methods to reduce the total volume
| of data points while still getting approximately close to the
| correct final answer?
| xpe wrote:
| I have questions for Quanta Magazine -- this is probably a shot
| in the dark, but if anyone can ask them a few questions, here
| would be three of mine:
|
| 1. Have independent researchers evaluated the effectiveness of
| Quanta Magazine at their mission? [1]
|
| 2. In the case of this article, did the editors consider
| including visualization? If not, why not?
|
| 3. Does Quanta have interactive designers and/or data scientists
| on staff? How many? Why not more?
|
| ## Background & Biases
|
| I'm trying to overcome my disappointment in Quanta by being
| curious about their mission, organization, and constraints. I am
| glad they exist, but sometimes your "closest allies" can be one's
| harshest critics.
|
| I'm greatly troubled by the level of scientific, mathematical,
| and rational understanding among the denizens of the world. But
| for some reason, the state of science writing bothers me more. It
| would seem that I hold out hope that science writers could do
| better. This may be unfair, I admit.
|
| Anyhow, rather than just bash Quanta for being, say, not as good
| at the best math blogs or YouTube channels (such as 3Blue1Brown),
| I really want to figure out (a) if I'm missing something; or (2)
| if they are actively trying to improve; and (3) what we can all
| learn from their experience.
|
| [1] From https://www.quantamagazine.org/about/ : "Quanta Magazine
| is an editorially independent online publication launched by the
| Simons Foundation in 2012 to enhance public understanding of
| science. Why Quanta? Albert Einstein called photons "quanta of
| light." Our goal is to "illuminate science.""
| fnord77 wrote:
| With other count-distinct algorithms, you can do unions and
| intersections on the "sets". (theta sketches, even bloom
| filiters)
|
| Can you do this with CVM?
| burjui wrote:
| The algorithm uses less memory, but more CPU time because of
| rather frequent deletions, so it's a tradeoff, not just generally
| better algorithm, as article may suggest.
| empath-nirvana wrote:
| the list is small so the cost of deletions should be small.
| minkzilla wrote:
| There is a big practicality problem I see with this algorithm.
| The thresh defined in the paper relies on the length of the
| stream. It seems to me that in a scenario where you have a big
| enough data set to desire a solution that doesn't just store
| every unique value you don't know the length. I did not make it
| through all the proofs but I believe they use the fact that the
| defined threshold has the length in it to prove error bounds. If
| I were to use this in a scenario where I need to know error
| bounds i would probably ballpark the length of my stream to
| estimate error bounds and then use the algorithm with a ballpark
| threshold depending on my systems memory.
|
| Another practical thing is the "exception" if nothing is removed
| on line 6 in the original algorithm. This also seems needed for
| the proof but you would not want in production, though the chance
| of hitting it should be vanishingly small so maybe worth the
| gamble?
|
| Here is my faithful interpretation of the algorithm. And then a
| re-interpretation with some "practical" improvements that almost
| certainly make the provability of the correctness impossible.
| func CountUnique(scanner *bufio.Scanner, epsilon float64, delta
| float64, m int) int { X := make(map[string]bool)
| p := 1.0 thresh := int(math.Ceil((12 / (epsilon *
| epsilon)) \* math.Log(8*float64(m)/delta))) for
| scanner.Scan() { a := scanner.Text()
| delete(X, a) if rand.Float64() < p {
| X[a] = true } if len(X) == thresh {
| for key := range X { if rand.Float64() < 0.5
| { delete(X, key) }
| } p /= 2 if len(X) == thresh {
| panic("Error") } } }
| return int(float64(len(X)) / p)
|
| } func CountUnique2(scanner *bufio.Scanner,
| thresh int) int { //threshold passed in, based on
| system memory / estimates X := make(map[string]bool)
| p := 1.0 for scanner.Scan() { a :=
| scanner.Text() delete(X, a) if
| rand.Float64() < p { X[a] = true }
| if len(X) >= thresh { // >= instead of == and remove the panic
| below for key := range X { if
| rand.Float64() < 0.5 { delete(X, key)
| } } p /= 2 }
| } return int(float64(len(X)) / p)
|
| }
|
| I tested it with Shakespeare's work. The actual unique word count
| is 71,595. With the second algorithm it is interesting to play
| with the threshold. Here are some examples.
|
| threshold 1000 Mean Absolute Error: 2150.44 Root Mean Squared
| Error: 2758.33 Standard Deviation: 2732.61
|
| threshold 2000 Mean Absolute Error: 1723.72 Root Mean Squared
| Error: 2212.74 Standard Deviation: 2199.39
|
| threshold 10000 Mean Absolute Error: 442.76 Root Mean Squared
| Error: 556.74 Standard Deviation: 555.53
|
| threshold 50000 Mean Absolute Error: 217.28 Root Mean Squared
| Error: 267.39 Standard Deviation: 262.84
| empath-nirvana wrote:
| If you don't know the length of the stream in advance, you can
| just calculate the margin of error when you're done, no?
| cb321 wrote:
| You just solve for epsilon. That's what I did:
| https://news.ycombinator.com/user?id=cb321
| zero_k wrote:
| I was involved with implementing the DNF volume counting version
| of this with the authors. You can see my blog post of it here:
|
| https://www.msoos.org/2023/09/pepin-our-probabilistic-approx...
|
| And the code here: https://github.com/meelgroup/pepin
|
| Often, 30% of the time is spent in IO of reading the file, that's
| how incredibly fast this algorithm is. Crazy stuff.
|
| BTW, Knuth contributed to the algo, Knuths' notes:
| https://cs.stanford.edu/~knuth/papers/cvm-note.pdf
|
| He actually took time off (a whole month) from TAOCP to do this.
| Also, he is exactly as crazy good as you'd imagine. Just mind-
| blowing.
| Hnrobert42 wrote:
| That's really interesting and thanks for sharing.
|
| I am very curious about the extraordinarily gifted. What made
| you think Knuth is crazy good? Was there a particularly moment?
| Was it how fast he groked ideas? Was it his ability to ELI5?
| zero_k wrote:
| What made me realize is that I saw some snippets of emails he
| wrote to a colleague. It was... insane. You could see his
| mind race. He recognized patterns in minutes that would take
| me days, if not weeks, to recognize. Also, he actually writes
| and runs code, overnight if need be. It was as bit of a shock
| to me. He's not in an ivory tower. He's very much hands on,
| and when he's behind the wheel, you're in for a ride.
| PartiallyTyped wrote:
| I feel extremely jealous of you.
| almostgotcaught wrote:
| > He recognized patterns in minutes that would take me
| days, if not weeks, to recognize... he actually writes and
| runs code, overnight if need be
|
| 70-80 years of actually being hands-on and i bet you'd be
| pretty quick too. dude is definitely naturally "gifted" but
| it seems pretty obvious being hands-on has a lot to do with
| it.
| gspetr wrote:
| Experience and age have diminishing returns.
|
| Biden's been hands-on in his domain for over 50 years,
| yet "quick" is definitely not the word that comes to most
| people's mind when they think of him nowadays.
| jedberg wrote:
| Actually quick is definitely something that comes to
| mind. Quick in politics is of course relative, but the
| speed with which he has enacted major changes (for
| example marijuana legalization) is pretty quick in the
| realm of politics when congress is of the other party.
| devnonymous wrote:
| > Also, he actually writes and runs code, overnight if need
| be...He's not in an ivory tower. He's very much hands on,
| and when he's behind the wheel, you're in for a ride.
|
| That's such an admirable thing. Something to aspire to,
| given his age. I wonder if one can say the same thing of
| all the 'thought leaders' of the industry who go around
| pontificating about code hygiene and tidiness.
| jandrese wrote:
| So now we have you to blame for a delay on the release of his
| next book. :)
| zero_k wrote:
| Not me, the authors -- I'm a fly on the wall compared to them
| ;) This is some serious work, I just did a fast
| implementation. Re implementation -- it turns out that there
| are parts of some standard libraries that this problem pushes
| to its limits, that we had to go around during
| implementation. So there were still some cool challenges
| involved. I was also pretty happy about the late binding/lazy
| evaluation thing I came up with. Of course Knuth just did it
| (check his notes), without even thinking about it :D What is
| an achievement for me is a lazy Monday coffee for him, but oh
| well!
| sesteel wrote:
| Maybe you'd know, but why would one choose to not sort favoring
| larger counts and drop the bottom half when full? It may be
| obvious to others, but I'd be curious.
| zero_k wrote:
| The guarantees would not hold, I'm pretty sure ;) Maybe one
| of the authors could chip in, but my hunch is that with that
| you could actually introduce arbitrarily large errors. The
| beauty of this algorithm really is its simplicity. Of course,
| simple is.. not always easy. This absolute masterpiece by
| Knuth should demonstrate this quite well:
|
| https://www.sciencedirect.com/science/article/pii/0022000078.
| ..
|
| It's an absolutely trivial algorithm. Its average-case
| analysis is ridiculously hard. Hence why I think this whole
| Ordo obsessions needs to be refined -- worst case complexity
| has often little to do with real-world behavior.
| pierrebai wrote:
| IDK, if my explanation is correct, but I do believe it is. I t
| goes as follow.
|
| Imagine that you have a container of potential limitless
| capacity. The container starts with smalls capacity, equal to the
| real limited capacity that the real algorithm uses. As you add
| elements, when the container is full, its capacity is doubled,
| but all elements are then placed in a random position.
|
| When you're done, you're told the occupancy of the subset of the
| large container corresponding to the initial size and how many
| times the container size was doubled. Multiplying that occupancy
| by the power of two of the number of doubling gives you an
| approximation of the real size.
|
| The small catch is that in the actual algorithm, due to the
| discarding, the final number of elements, the occupancy, is
| somewhat erroneous.
|
| EDIT
|
| Another way to say this: you got a container of limited capacity
| S. When full, you "virtually" double its size and then randomly
| move elements over the full "imaginary" size of the virtual
| container. So after the first filling, you end up with about 1/2
| the elements. After the second filling 1/4, etc. Also, since now
| your "virtual" container is larger, when you add a new element,
| there is only 1/2^n the it will be place inside your limited-
| capacity view of the entire virtual container.
|
| At the end, the approximate real count is the number of elements
| you got multiplied by 2 to the power of the number of size
| doubling.
|
| Again, it is as if you have a small window into a limitless
| container.
| OutOfHere wrote:
| The title is misleading. This is about probabilistic counting,
| therefore about estimation. It is not clear if the efficiency
| benefit extends to exact counting. Counting and estimation are
| not the same concepts.
| orf wrote:
| Obviously it wouldn't
| Koshkin wrote:
| This is about estimation.
|
| https://news.ycombinator.com/item?id=40394015
| u8 wrote:
| I took a crack at implementing this in Go. For anyone curious I
| settled for algorithm 2 as I can just use a map as the base set
| structure.
|
| https://github.com/tristanisham/f0
___________________________________________________________________
(page generated 2024-05-17 23:01 UTC)