[HN Gopher] Netflix's Distributed Counter Abstraction
___________________________________________________________________
Netflix's Distributed Counter Abstraction
Author : benocodes
Score : 35 points
Date : 2024-11-13 19:31 UTC (3 hours ago)
(HTM) web link (netflixtechblog.com)
(TXT) w3m dump (netflixtechblog.com)
| mannyv wrote:
| I wonder how they're going to go about purging all the counters
| that end up unused once the employee and/or team leaves?
|
| I can see someone setting up a huge number of counters then
| leaving...and in a hundred years their counters are taking up TB
| of space and thousands of requests-per-second.
| singron wrote:
| There is a retention policy, so the raw events aren't kept very
| long. The rollups probably compress really well in their time
| series database, which I'm guessing also has a retention
| policy.
|
| If you have high cardinality metrics, it can still be really
| painful, although I think you will feel the pain initially and
| it won't take years. Usually these systems have a way to
| inspect what metrics or counters are using the most resources
| and then they can be reduced or purged from time to time.
| rshrin wrote:
| Yes, once the events are aggregated (and optionally moved to
| a cost-effective storage for audits), we don't need them
| anymore in the primary storage. You can check the retention
| section in the article. The rollups themselves can have TTL
| if the users wish to set that on a namespace. Although doing
| that, they have to be fine with certain timing issues on when
| the rollups expire and new events are aggregated.
| ilrwbwrkhv wrote:
| Looks a bit overengineered due to Netflix's own microservices
| nonsense.
|
| I would be more interested in how a higher traffic video company
| like Pornhub handles things like this.
| oreoftw wrote:
| How would you design it to support mentioned use cases?
| klaussilveira wrote:
| HyperLogLog and PostgreSQL:
| https://github.com/citusdata/postgresql-hll
|
| Or even simpler, Roaring Bitmaps:
| https://pncnmnp.github.io/blogs/roaring-bitmaps.html
|
| https://blog.quastor.org/p/grab-rate-limiting
|
| https://github.com/RoaringBitmap/CRoaring
| philjohn wrote:
| HyperLogLog doesn't support exact counters though, does it?
| That seems to be one of the core requirements of the queue-
| based solution.
| rshrin wrote:
| HyperLogLog (or even Count-Min Sketch) will not support
| some of the requirements of even the Best-Effort counter
| (clearing counts for specific keys, decrementing counts
| by any arbitrary number, having a TTL on counts etc.).
| For Accurate counters, we are trying to solve for multi-
| region read/write availability at low single-digit
| millisecond latency at cheap costs, using the
| infrastructure we already operate and deploy. There are
| also other requirements such as tracking the provenance
| of increments, which play a part.
| philjohn wrote:
| Have you worked with distributed counters before? It's a hard
| problem to solve. Typical tradeoffs are lower cardinality for
| exact counters.
|
| The queue solution is pretty elegant.
| Alupis wrote:
| > Netflix's own microservices nonsense
|
| How many times has Netflix been entirely down over the years?
|
| Seems it's not "nonesense".
| vlovich123 wrote:
| It's a bit weird to not compare this to HyperLogLog & similar
| techniques that are designed to solve exactly this problem but
| much more cheaply (at least as far as I understand).
| zug_zug wrote:
| I came here to write the same thing. Getting an estimate
| accurate for at least 5 digits on all netflix video watches
| worldwide can all be done with intelligent sampling (like
| hyperloglog) and likely one macbook air as the backend. And
| aside from the compute save the complexity and implementation
| time would be much lower too.
| rshrin wrote:
| Fwiw, we didn't mention any probabilistic data structures
| because they don't satisfy some of the basic requirements we
| had for the Best-Effort counter. HyperLogLog is designed for
| cardinality estimation, not for incrementing or decrementing
| specific counts (which in our case could be any arbitrary
| +ve/-ve number per key). AFAIK, both Count-Min Sketch and
| HyperLogLog do not support clearing counts for specific keys.
| I believe Count-Min Sketch cannot support decrement as well.
| The core EvCache solution for the Best-Effort counter is like
| 5 lines of code. And EvCache can handle millions of
| operations/second relatively cheaply.
| millipede wrote:
| > EVCache
|
| EVCache is a disaster. The code base has no concept of a
| threading model. The code is almost completely untested* too. I
| was on call at least 2 time when EVcache blew up on us. I tried
| root causing it and the code is a rats nest. Avoid!
|
| * https://github.com/Netflix/EVCache
| dopamean wrote:
| Why would netflix put their blog on medium?
___________________________________________________________________
(page generated 2024-11-13 23:00 UTC)