[HN Gopher] Implementing "seen by" functionality with Postgres
___________________________________________________________________
Implementing "seen by" functionality with Postgres
Author : kiwicopple
Score : 143 points
Date : 2022-07-18 14:27 UTC (8 hours ago)
(HTM) web link (supabase.com)
(TXT) w3m dump (supabase.com)
| js4ever wrote:
| About 500 rps max? That seems pretty low, I can easily do
| something like 25k rps with 1 cpu using clickhouse for the same
| scenario. Am I missing something?
| fermentation wrote:
| This is a bit of a weird post. The author sets up benchmarks,
| shows that hstore appears to work best, then suggests to use
| HyperLogLog despite performing worse. The reason being because it
| scales better, but the author didn't really discuss how HLL works
| so I'm not sure why it scales better other than that it has a
| cool name.
| asperous wrote:
| I agree; why not do the work to show that it scales better? It
| violates scientific integrity to have a hypothesis, run the
| experiment, find that your hypothesis was wrong but then say it
| is probably right regardless
| hardwaresofton wrote:
| Looks like I could have been much clearer, apologies!
|
| The main ideas around picking HLL was that it was a good
| compromise between fast operation, forget nothing nature of the
| assoc table, and avoided the should-you-really-do-that nature
| of storing possibly 100s of thousands or millions of numeric
| kvs in a hstore column alone.
|
| There's definitely a whole 'nother post in there on pushing it
| to it's limits but it was already long enough, I thought!
| kiwicopple wrote:
| If you like to jump to the comments before reading the post (like
| me), this is an exploration of a fairly common situation:
| calculating the total number of people who have "viewed/seen" a
| blog post.
|
| tldr: use either Use HyperLogLog [0] (if you can install
| extensions on your postgres database), or hstore[1] if you can't.
|
| HyperLogLog is a probablistic data structure, which can
| "estimate" the number of viewers, so it scales very well for
| Twitter/FB-scale platforms.
|
| [0] HyperLogLog: https://en.wikipedia.org/wiki/HyperLogLog
|
| [1] hstore: https://www.postgresql.org/docs/current/hstore.html
| WayToDoor wrote:
| So how does the HyperLogLog work exactly? It seems like the
| solution of choice, but I'd like to know more about the
| internals.
| kiwicopple wrote:
| I don't have a good enough grasp HLL to summarize accurately,
| so for convenience here is the original paper[0] and hopefully
| HN comes through with a more digestible explanation.
|
| The blog post uses Citus' Postgres extension[1] which uses an
| augmented various of the original algorithm to reduce the size
| of the data structure so that it is fixed-size
|
| [0] HLL paper:
| http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
|
| [1] postgresql-hll: https://github.com/citusdata/postgresql-hll
| periodontal wrote:
| Basically, hash all your data items, keep track of the maximum
| number of leading 0 bits seen in a hash so far.
|
| This is a decent proxy (but not without error) for how many
| distinct items you've seen so far since large numbers of
| leading zeros should be uncommon. Finding this means you
| probably saw a lot of other things to get to that point
| (intuitively, about N leading zeroes are expected after seeing
| 2^N items).
|
| This is actually the same thing that a proof of work
| cryptocurrency does to control difficulty: change target number
| of leading zeros so miners have to do more/less work to find a
| match.
|
| Of course, you could get "lucky" with a single counter and the
| resolution isn't great, so HLL first separates the values into
| buckets which are estimated separately and then combined to
| give a more robust estimate.
| [deleted]
| hardwaresofton wrote:
| Hey I wrote the article --- the way I think of HyperLogLog and
| similar (but different) approaches like bloomfilters and count-
| min sketches is that the key insight is hashing.
|
| The fact that we can reduce some piece of data to a hash, then
| work on the distributions of data/entropy of those hashes in
| number space many different ways is what makes these data
| structures work.
|
| I think of it this way --- if you have a hashing algo that
| turns "someone@example.com" into "7395718184927", you can
| really easily count with relative certainty how often you see
| that email address by keeping a sparse numeric associative
| array. Getting the same exact value again produces the same
| hash. Obviously doing this isn't SUPER useful because then you
| just get strict cardinality amount (same as just checking
| equality with the other keys). but if you choose a weak enough,
| fast hashing function (or multiple, in the case of a count-min
| sketch), you can have collisions in what some values hash to
| but others do not -- so that means there will be some amount of
| error -- you can control that to suit your needs, on a scale of
| everything in one bucket to exact cardinality # of buckets.
|
| Here's an actual proper guide (CS lectures, as you might
| expect!) I found after a quick search, which you might find
| useful:
|
| https://williams-cs.github.io/cs358-f21/lectures/lecture10/l...
|
| (IMO focus on the algorithm description)
|
| And here's some old HN discussion:
|
| https://news.ycombinator.com/item?id=7508658
|
| To really try and understand the concept I found it useful a
| while back to actually try and build it:
|
| https://vadosware.io/post/countmin-sketch-in-haskell/
|
| http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf (This
| paper has a really good visual depiction)
|
| I find the count-min sketch to be much more approachable
| [deleted]
| mongrelion wrote:
| Sidetracking a bit: does anybody know what tool was used to
| generate the table diagrams?
| kiwicopple wrote:
| These ones are custom (we use figma)
| junon wrote:
| If it's a tool I'd love to know, too. However it very well
| might just be a hand-crafted vector.
| proto-n wrote:
| If I understand correctly these numbers are a result of users
| viewing posts uniformly random? While the post is a nice overview
| of options, I doubt the numbers generalize to the much more
| realistic scenario of a power-law distribution of view counts for
| posts.
| hardwaresofton wrote:
| I thought of that! I cover it very crudely and the generation
| code isn't as nice as I would have liked, but I did try to pick
| certain numbers of posts that would receive a disproportionate
| amount of views (and also making sure posts were not generated
| equally by everyone).
|
| I'd love to have some feedback on the attached repo!
| ilrwbwrkhv wrote:
| The ending was weird. After all that test setup and data
| crunching it became a bit of "I feel this is better".
| maxloh wrote:
| The mentioned Progres extension, citus/postgresql-hll, hasn't
| been updated for about a year.
|
| Is it really reliable to use it in production?
| frankgrecojr wrote:
| Have there been issues that have gone on unmaintained?
| cldellow wrote:
| When we used it (granted, years ago), it would crash
| occasionally.
|
| Looking at GitHub, there's at least one open issue around
| crashes based on how large of an HLL you use.
| SigmundA wrote:
| Can anyone else see the source? I am getting 404 and would like
| to see the sql statements that implement these tests.
|
| >not to mention a lot of concurrent database updates (which isn't
| necessarily Postgres's forte to begin with)
|
| Isn't PG multithreaded db server and the default row store is
| tuned for OLTP with row level locking over OLAP? I could see this
| being said for Sqlite but not PG.
| cldellow wrote:
| I can't see the source either.
|
| Re the concurrent updates - yes, Postgres is multi-process.
| However, the hstore and HLL approaches are going to have a lot
| of contention at the row level. The association table wouldn't
| have that contention, since each user would get their own row.
|
| The hstore/HLL approaches _could_ be sharded on a hash of the
| user's ID. In fact, this would work really well for HLL since I
| think you can union HLL structures quickly. But I don't think
| any of this was addressed in the article...
| SigmundA wrote:
| >Re the concurrent updates - yes, Postgres is multi-process.
| However, the hstore and HLL approaches are going to have a
| lot of contention at the row level. The association table
| wouldn't have that contention, since each user would get
| their own row.
|
| Each process is a OS level thread, again saying that
| concurrent updates is not PG's forte seems pretty ridiculous
| just because one some of your approaches use in row storage
| rather than the typical relational design. The statement
| seems unnecessary and may give someone with less experience
| the wrong idea about PG's concurrent abilities.
| mdoms wrote:
| bastawhiz wrote:
| A way to do this for read heavy applications is to use the
| separate table to store who has viewed the post, but then use a
| trigger to update a seen count column. When you write a (unique)
| row to the table, the trigger counts all the views and updates
| the posts table. All reads have no extra overhead, only views.
| You can queue/process the views at your leisure, or pause them
| altogether as a circuit breaker.
| hardwaresofton wrote:
| This is a great tip, thanks for sharing -- reducing read
| overhead to a minimum and even delaying the processing is
| fantastic.
|
| This would have been great to include as a sort of better assoc
| table approach
| hyzyla wrote:
| I expected to find that solution in article. Or if someone
| doesn't like triggers, you can just insert row in associated
| table, if failed with duplication error, then do nothing,
| otherwise increase counter.
| stareatgoats wrote:
| I have to admit this is over my head. But sometimes being stupid
| also what works, and I've always abided by the old adage that
| "it's better to have a tractor that works than a limo that you
| can't get out of the garage". I realize that in this case maybe
| the problem was just a pretext for demonstrating HyperLogLog, but
| in case someone is looking for a simple solution to this very
| problem maybe this naive solution could work (use at your own
| peril!):
|
| So I'd create an associated table that records the user, time and
| object viewed, and a field with the rolled up count on the object
| record. The problem with duplicate views is solved by wrapping
| the insert and the upping of the counter in the same transaction:
| if the insert fails because of a breach of the unique index then
| the upping fails too.
|
| I probably missed something?
| hardwaresofton wrote:
| Hey sorry the post wasn't more well written!
|
| I tried to cover all the basic or obvious approaches (counter,
| assoc table) so that people could see the complexity kind of
| ramp up.
|
| You're right that an assoc table is a very workable solution,
| along with the partitioning capabilities of postgres these
| days.
| davidw wrote:
| Unless I knew there was going to be a huge boatload of data,
| I'd go for an association table too. That's the cleanest thing
| in terms of the DB.
| cldellow wrote:
| I felt like I didn't understand the blog post very well,
| either.
|
| It spends a lot time building a benchmark framework and
| measuring things. At the end, hstore is the choice supported by
| the data. But instead the author writes:
|
| > If we go strictly with the data, the best way looks to be the
| hstore-powered solution, but I think the HLL is probably the
| right choice.
|
| ?!? Why bother defining a framework for measurement if you're
| just going to substitute your own judgment? Perhaps the
| framework didn't capture something important -- like the
| absolute number of people looking at any given post, concurrent
| updates, etc.
|
| I'm also confused by the benchmark. The autocannon script has:
| // 10% of the time, get *all* the users request.method
| = 'GET' request.path = `/posts/${postId}/seen-by/users`
|
| Does this imply there's a path it can hit to get the list of
| users who saw a post? HLL is a probabilistic data structure, so
| it doesn't support that with 100% accuracy. You could reverse
| engineer it from the list of users, but the performance will
| vary with the number of users, which didn't seem to be captured
| by the benchmark. I tried to look at the source code, but the
| GitLab URL was 404ing for me.
| dvlsg wrote:
| I don't think I'm getting 404s, but the repo's gitlab url
| does keep bouncing me to a login page.
|
| > You need to sign in or sign up before continuing.
|
| Has gitlab always enforced that, even for public repos? Or is
| the linked repo not public?
| jakear wrote:
| Linked repo isn't public, you can check through the
| author's public repos on gitlab.
| derefr wrote:
| > ?!? Why bother defining a framework for measurement if
| you're just going to substitute your own judgment? Perhaps
| the framework didn't capture something important -- like the
| absolute number of people looking at any given post,
| concurrent updates, etc.
|
| Just guessing: the benchmark tells you the time complexity.
| It doesn't tell you the space complexity. The author is
| optimizing between the time- and space-complexity of the
| solutions, with the time-complexity benchmarks as an input.
| (Also, at a scale the benchmark doesn't reach, space-
| complexity starts to affect time-complexity, as large
| datasets become less able to be hot in disk cache.)
| hardwaresofton wrote:
| Yup this is it! You've said it much better than I did,
| thank you.
| nnnnnande wrote:
| Yeah, this is probably the reason and the author even
| elaborates on this in the sentences following the bit
| quoted by cldellow:
|
| > Even though the data says hstore, knowing that posts will
| be seen by more and more people over time, I might choose
| the HLL solution for an actual implementation. It's far
| less likely to pose a bloated row problem, [...]
| hardwaresofton wrote:
| Just posting here too, but yup this is exactly what I was
| trying to convey.
|
| Hstore might have been the fastest but the way I am using
| it or the scales the use case could scale to might not
| work out.
|
| Bigger benchmarks could have been done! Maybe a multi
| part post would have been better so I could split apart
| methodology and results/blabbering about approach!
| derefr wrote:
| If the goal is to count distinct users who have seen something,
| but _also_ to efficiently retrieve the actual set of users who
| have seen something -- then why wasn 't a roaring bitmap
| (https://pgxn.org/dist/pg_roaringbitmap/) considered as an
| option? Presuming user IDs are non-sparse nonnegative integers
| (as SERIAL/BIGSERIAL pkeys usually are), a roaring bitmap should
| do everything hstore does, just a lot more efficiently.
| djKianoosh wrote:
| the docs/paper for roaring bitmap says it's to be used for
| large datasets, but I must have missed where "large" is
| defined. In practice, how large does a set have to be to take
| advantage of roaring bitmaps?
| derefr wrote:
| The essential consideration is that roaring bitmaps are
| _compressed_ -- trading increased CPU overhead per read
| /write, for lower size-on-disk => more rows fetched per IOP +
| better cache-coherency. So think of the point where the CPU
| overhead of compression/decompression for every read/write
| becomes worth those gains.
|
| Most developers make this kind of judgement every day: when
| deciding whether to e.g. gzip a file before SCPing it
| somewhere; or whether to apply e.g. lz4 to the data being
| written to a KV store like Redis. IMHO that same intuition
| applies here.
|
| See also: the heuristic Redis uses to decide whether to back
| a set with a hashtable, vs. a sorted list -- sits around 512
| elements: https://redis.com/ebook/part-2-core-
| concepts/01chapter-9-red...
|
| ---
|
| Mind you, the compression in roaring bitmaps is very
| efficient, has extremely low start-up costs, and is
| effectively "random access" rather than streaming; these low
| costs are why roaring bitmaps are preferred over other
| compressed-bitmap approaches.
|
| And also, it not like, at small data sizes, even a "large"
| (hundreds/thousands of ops) CPU penalty would really be
| _much_ of a penalty, either -- because, with small data, you
| wouldn 't be paying it very often! And within a "heavyweight"
| RDBMS like Postgres, that does a bunch of things when you
| make a request to it, these sorts of small CPU costs usually
| get lost entirely under the static overhead of query
| planning. (Postgres takes advantage of this property in a lot
| of other ways as well.)
|
| It's only when you're trying to do something _at scale_ , but
| where dataset size _isn 't_ the scaling factor (example:
| running a core DNS server -- trillions of queries, but
| against only a small amount of data) that a roaring bitmap
| would be an explicit "wrong choice" vs. a sorted-list or
| b-tree or hashtable-backed presence set.
| 5e92cb50239222b wrote:
| > deciding whether to e.g. gzip a file before SCPing it
| somewhere
|
| Off topic, but to not waste time on such trivia in the
| future, add Compression yes
|
| to your ~/.ssh/config.
| derefr wrote:
| Not a good idea if you're frequently sending things that
| are _already_ compressed, though. Which many formats
| today are.
|
| (Also, I was using `scp` as a general abstraction;
| personally, I'm not often pushing things around between
| VMs directly, but rather between VMs and object storage;
| where part of the question is not just about the
| transport encoding, but rather whether it's worth it to
| have the data be compressed-at-rest in the object store.
| Usually depends on how many times you're going to be
| downloading it.)
| hardwaresofton wrote:
| Author here, a roaring bitmap could absolutely have been used!
|
| The concept is similar and it absolutely should have been
| mentioned.
|
| > Presuming user IDs are non-sparse nonnegative integers (as
| SERIAL/BIGSERIAL pkeys usually are),
|
| I personally wanted to build a method that worked for textual
| content easily but with that constraint (which is valid and
| what I ended up doing in practice) a roaring bitmap definitely
| works!
|
| I want to give this a try in the future
| jakswa wrote:
| I just wanted to say thanks for linking. I feel like I'm always
| on the search for new PG storage extensions, even if I'm not
| able to use them everywhere (sad to see AWS RDS wasn't in list
| of vendors that include/support it).
___________________________________________________________________
(page generated 2022-07-18 23:00 UTC)