[HN Gopher] Monarch: Google's Planet-Scale In-Memory Time Series...
___________________________________________________________________
Monarch: Google's Planet-Scale In-Memory Time Series Database
Author : mlerner
Score : 173 points
Date : 2022-05-14 16:12 UTC (6 hours ago)
(HTM) web link (www.micahlerner.com)
(TXT) w3m dump (www.micahlerner.com)
| yegle wrote:
| Google Cloud Monitoring's time series database is backed by
| Monarch.
|
| The query language is mql which closely resembles the internal
| Python based query language:
| https://cloud.google.com/monitoring/mql
| sleepydog wrote:
| MQL is an improvement over the internal language, IMO. There
| are some missing features around literal tables, but otherwise
| the language is more consistent and flexible.
| holly76 wrote:
| candiddevmike wrote:
| Interesting that Google replaced a pull based metric system
| similar to Prometheus with a push based system... I thought one
| of the selling points of Prometheus and the pull based dance was
| how scalable it was?
| lokar wrote:
| It's sort of a pull/push hybrid. The client connects to the
| collection system and is told how often to send each metric (or
| group of them) back over that same connection. You configure
| per target/metric collection policy centrally.
| jeffbee wrote:
| Prometheus itself has no scalability at all. Without
| distributed evaluation they have a brick wall.
| gttalbot wrote:
| This. Any new large query or aggregation in the
| Borgmon/Prometheus model requires re-solving federation, and
| continuing maintenance of runtime configuration. That might
| technically be scalable in that you could do it but you have
| to maintain it, and pay the labor cost. It's not practical
| over a certain size or system complexity. It's also friction.
| You can only do the queries you can afford to set up.
|
| That's why Google spent all that money to build Monarch. At
| the end of the day Monarch is vastly cheaper in person time
| and resources than manually-configured Borgmon/Prometheus.
| And there is much less friction in trying new queries, etc.
| halfmatthalfcat wrote:
| Can you elaborate? I've ran Prometheus at some scale and it's
| performed fine.
| lokar wrote:
| You pretty quickly exceed what one instance can handle for
| memory, cpu or both. At that point you don't have any real
| good options to scale while maintaining a flat namespace
| (you need to partition).
| preseinger wrote:
| Sure? Prometheus scales with a federation model, not a
| single flat namespace.
| gttalbot wrote:
| This means each new query over a certain size becomes a
| federation problem, so the friction for trying new things
| becomes very high above the scale of a single instance.
|
| Monitoring as a service has a lot of advantages.
| preseinger wrote:
| Well you obviously don't issue metrics queries over
| arbitrarily large datasets, right? The Prometheus
| architecture reflects this invariant. You constrain
| queries against both time and domain boundaries.
| gttalbot wrote:
| Monarch can support both ad-hoc and periodic, standing
| queries of arbitrarily large size, and has the means to
| spread the computation out over many intermediate mixer
| and leaf nodes. It does query push-down so that the
| "expensive" parts of aggregations, joins, etc., can be
| done in massively parallel fashion at the leaf level.
|
| It scales so well that many aggregations are set up and
| computed for every service across the whole company (CPU,
| memory usage, error rates, etc.). For basic monitoring
| you can run a new service in production and go and look
| at a basic dashboard for it without doing anything else
| to set up monitoring.
| gttalbot wrote:
| Also, very large ad-hoc queries are supported, with
| really good user isolation, so that (in general) users
| don't harm each other.
| preseinger wrote:
| Prometheus is highly scalable?? What are you talking about??
| dijit wrote:
| It is not.
|
| It basically does the opposite of what every scalable
| system does.
|
| To get HA you double you're number of pollers.
|
| To get scale your queries you aggregate them into other
| prometheii.
|
| If this is scalability: everything is scalable.
| preseinger wrote:
| I don't understand how the properties you're describing
| imply that Prometheus isn't scalable.
|
| High Availability always requires duplication of effort.
| Scaling queries always requires sharding and aggregation
| at some level.
|
| I've deployed stock Prometheus at global scale, O(100k)
| targets, with great success. You have to understand and
| buy into Prometheus' architectural model, of course.
| dijit wrote:
| The ways in which you can scale Prometheus: you can scale
| anything.
|
| It does not; itself, have highly scalable properties
| built in.
|
| It does not do sharding, it does not do proxying, it does
| not do batching, it does not do anything that would allow
| it to run multiple servers and query over multiple
| servers.
|
| Look. I'm not saying that it doesn't work; but when I
| read about borgmon and Prometheus: I understood the
| design goal was intentionally not to solve these hard
| problems, and instead use them as primitive time series
| systems that can be deployed with a small footprint
| basically everywhere (and individually queried).
|
| I submit to you, I could also have an influxdb in every
| server and get the same "scalability".
|
| Difference being that I can actually run a huge influxdb
| cluster with a dataset that exceeds the capabilities of a
| single machine.
| preseinger wrote:
| It seems like you're asserting a very specific definition
| of scalability that excludes Prometheus' scalability
| model. Scalability is an abstract property of a system
| that can be achieved in many different ways. It doesn't
| require any specific model of sharding, batching, query
| replication, etc. Do you not agree?
| dijit wrote:
| You need to go back to computer science class.
|
| I'm not defining terms.
|
| https://en.wikipedia.org/wiki/Database_scalability
|
| Scalability means running a single workload across
| multiple machines.
|
| Prometheus intentionally does not scale this way.
|
| I'm not being mean, it is fact.
|
| It has made engineering design trade offs and one of
| those means it is not built to scale, this is fine, I'm
| not here pooping on your baby.
|
| You can build scalable systems on top of things which do
| not individually scale.
| preseinger wrote:
| Scalability isn't a well-defined term, and Prometheus
| isn't a database. :shrug:
| dijit wrote:
| Wrong on both counts
|
| Sorry for being rude, but this level of ignorance is
| extremely frustrating.
| jeffbee wrote:
| Prometheus cannot evaluate a query over time series that
| do not fit in the memory of a single node, therefore it
| is not scalable.
|
| The fact that it could theoretically ingest an infinite
| amount of data that it cannot thereafter query is not
| very interesting.
| preseinger wrote:
| It can? It just partitions the query over multiple nodes?
| lokar wrote:
| Lots of systems provide redundancy with 2X cost. It's not
| that hard.
| dilyevsky wrote:
| You can set up dist eval similar to how it was done in
| borgmon but you gotta do it manually (or maybe write an
| operator to automate). One of Monarchs core ideas is to do
| that behind the scenes for you
| jeffbee wrote:
| Prometheus' own docs say that distributed evaluation is
| "deemed infeasible".
| dilyevsky wrote:
| Thats just like.. you know.. their opinion, man.
| https://prometheus.io/docs/prometheus/latest/federation/
| bpicolo wrote:
| Prometheus federation isn't distributed evaluation. It
| "federates" from other nodes onto a single node.
|
| > Federation allows a Prometheus server to scrape
| selected time series from another Prometheus server
| dilyevsky wrote:
| Collect directly from shard-level prometheus then
| aggregate using /federate at another level. That's how
| thanos also works afaik
| buro9 wrote:
| That's what Mimir solves
| deepsun wrote:
| How does it compare to VictoriaMetrics?
| dilyevsky wrote:
| It was originally push but i think they went back to sort of
| scheduled pull mode after a few years. There was a very in
| depth review doc written about this internally which maybe will
| get published some day
| atdt wrote:
| What's the go/ link?
| dilyevsky wrote:
| Can't remember - just search on moma /s
| gttalbot wrote:
| Pull collection eventually became a real scaling bottleneck for
| Monarch.
|
| The way the "pull" collection worked was that there was an
| external process-discovery mechanism, which the leaf used to
| connect to the entities it was monitoring, the leaf backend
| processes would connect to the monitored entities to an
| endpoint that the collection library would listen on, and those
| entities collection libraries would stream the metric
| measurements according to the schedules that the leaves sent.
|
| Several problems.
|
| First, the leaf-side data structures and TCP connections become
| very expensive. If that leaf process is connecting to many many
| many thousands of monitored entities, TCP buffers aren't free,
| keep-alives aren't free, and a host of other data structures.
| Eventually this became an...interesting...fraction of the CPU
| and RAM on these leaf processes.
|
| Second, this implies a service discovery mechanism so that the
| leaves can find the entities to monitor. This was a combination
| of code in Monarch and an external discovery service. This was
| a constant source of headaches an outages, as the appearance
| and disappearance of entities is really spiky and
| unpredictable. Any burp in operation of the discovery service
| could cause a monitoring outage as well. Relatedly, the
| technical "powers that be" decided that the particular
| discovery service, of which Monarch was the largest user,
| wasn't really something that was suitable for the
| infrastructure at scale. This decision was made largely
| independently of Monarch, but required Monarch to move off.
|
| Third, Monarch does replication, up to three ways. In the pull-
| based system, it wasn't possible to guarantee that the
| measurement that each replica sees is the same measurement with
| the same microsecond timestamp. This was a huge data quality
| issue that made the distributed queries much harder to make
| correct and performant. Also, the clients had to pay both in
| persistent TCP connections on their side and in RAM, state
| machines, etc., for this replication as a connection would be
| made from each backend leaf processes holding a replica for a
| given client.
|
| Fourth, persistent TCP connections and load balancers don't
| really play well together.
|
| Fifth, not everyone wants to accept incoming connections in
| their binary.
|
| Sixth, if the leaf process doesn't need to know the collection
| policies for all the clients, those policies don't have to be
| distributed and updated to all of them. At scale this matters
| for both machine resources and reliability. This can be made a
| separate service, pushed to the "edge", etc.
|
| Switching from a persistent connection to the clients pushing
| measurements in distinct RPCs as they were recorded eventually
| solved all of these problems. It was a very intricate
| transition that took a long time. A lot of people worked very
| hard on this, and should be very proud of their work. I hope
| some of them jump in to the discussion! (At very least they'll
| add things I missed/didn't remember... ;^)
| nickstinemates wrote:
| too small for me, i was looking more for the scale of the
| universe.
| yayr wrote:
| in case this can be deployed single-handed it might be useful
| on a spaceship... would need some relativistic time accounting
| though.
| kasey_junk wrote:
| A _huge_ difference between monarch and other tsdb that isn't
| outlined in this overview, is that a storage primitive for schema
| values is a histogram. Most (maybe all besides Circonus) tsdb try
| to create histograms at query time using counter primitives.
|
| All of those query time histogram aggregations are making pretty
| subtle trade offs that make analysis fraught.
| teraflop wrote:
| Is it really that different from, say, the way Prometheus
| supports histogram-based quantiles?
| https://prometheus.io/docs/practices/histograms/
|
| Granted, it looks like Monarch supports a more cleanly-defined
| schema for distributions, whereas Prometheus just relies on you
| to define the buckets yourself and follow the convention of
| using a "le" label to expose them. But the underlying
| representation (an empirical CDF) seems to be the same, and so
| the accuracy tradeoffs should also be the same.
| spullara wrote:
| Much different. When you are reporting histograms you can
| combine them and see the true p50 or whatever across all the
| individual systems reporting the metric.
| [deleted]
| nvarsj wrote:
| Can you elaborate a bit? You can do the same in Prometheus
| by summing the bucket counts. Not sure what you mean by
| "true p50" either. With buckets it's always an
| approximation based on the bucket widths.
| spullara wrote:
| Ah, I misunderstood what you meant. If you are reporting
| static buckets I get how that is better than what folks
| typically do but how do you know the buckets a priori?
| Others back their histograms with things like
| https://github.com/tdunning/t-digest. It is pretty
| powerful as the buckets are dynamic based on the data and
| histograms can be added together.
| gttalbot wrote:
| Yes. This. Also, displaying histograms in heatmap format
| can allow you to intuit the behavior of layered distributed
| systems, caches, etc. Relatedly, exemplars allowed tying
| related data to histogram buckets. For example, RPC traces
| could be tied to the latency bucket & time at which they
| complete, giving a natural means to tie metrics monitoring
| and tracing, so you can "go to the trace with the problem".
| This is described in the paper as well.
| spullara wrote:
| Wavefront also has histogram ingestion (I wrote the original
| implementation, I'm sure it is much better now). Hugely
| important if you ask me but honestly I don't think that many
| customers use it.
| sujayakar wrote:
| I've been pretty happy with datadog's distribution type [1]
| that uses their own approximate histogram data structure [2]. I
| haven't evaluated their error bounds deeply in production yet,
| but I haven't had to tune any bucketing. The linked paper [3]
| claims a fixed percentage of relative error per percentile.
|
| [1] https://docs.datadoghq.com/metrics/distributions/
|
| [2] https://www.datadoghq.com/blog/engineering/computing-
| accurat...
|
| [3] https://arxiv.org/pdf/1908.10693.pdf
| hn_go_brrrrr wrote:
| In my experience, Monarch storing histograms and being unable
| to rebucket on the fly is a big problem. A percentile line on a
| histogram will be incredibly misleading, because it's trying to
| figure out what the p50 of a bunch of buckets is. You'll see
| monitoring artifacts like large jumps and artificial plateaus
| as a result of how requests fall into buckets. The bucketer on
| the default RPC latency metric might not be well tuned for your
| service. I've seen countless experienced oncallers tripped up
| by this, because "my graphs are lying to me" is not their first
| thought.
| heinrichhartman wrote:
| Circonus Histograms solve that by using a universal bucketing
| scheme. Details are explained in this paper:
| https://arxiv.org/abs/2001.06561
|
| Disclaimer: I am a co-author.
| kasey_junk wrote:
| My personal opinion is that they should have done a log
| linear histogram which solves the problems you mention (with
| other trade offs) but to me the big news was making the db
| flexible enough to have that data type.
|
| Leaving the world of single numeric type for each datum will
| influence the next generation of open source metrics db.
| jrockway wrote:
| I definitely remember a lot of time spent tweaking histogram
| buckets for performance vs. accuracy. The default bucketing
| algorithm at the time was powers of 4 or something very
| unusual like that.
| shadowgovt wrote:
| It's because powers of four was great for the original
| application of statistics on high traffic services where
| the primary thing the user was interested in was deviations
| from the norm, and with a high traffic system the signal
| for what the norm is would be very strong.
|
| I tried applying it to a service with much lower traffic
| and found the bucketing to be extremely fussy.
| [deleted]
| 8040 wrote:
| I broke this once several years ago. I even use the incident
| number in my random usernames to see if a Googler recognizes it.
| ajstiles wrote:
| Wow - that was a doozy.
| ikiris wrote:
| ahahahahahaha
| orf wrote:
| How did you break it?
| ikiris wrote:
| IIRC they didn't not break it.
| [deleted]
| zoover2020 wrote:
| This is also why I love HN. So niche!
| dijit wrote:
| Discussion from 2020:
| https://news.ycombinator.com/item?id=24303422
| klysm wrote:
| I don't really grasp why this is a useful spot in the trade off
| space from a quick skim. Seems risky.
| dijit wrote:
| There's a good talk on Monarch https://youtu.be/2mw12B7W7RI
|
| Why it exists is laid out quite plainly.
|
| The pain of it is we're all jumping on Prometheus (borgmon)
| without considering why Monarch exists. Monarch doesn't have a
| good corollary outside of google.
|
| Maybe some weird mix of timescale DB backed by cockroachdb with
| a Prometheus push gateway.
| AlphaSite wrote:
| Wavefront is based on FoundationDB which I've always found
| pretty cool.
|
| [1] https://news.ycombinator.com/item?id=16879392
|
| Disclaimer: I work at vmware on an unrelated thing.
| codethief wrote:
| The first time I heard about Monarch was in discussions about the
| hilarious "I just want to serve 5 terabytes" video[0].
|
| [0]: https://m.youtube.com/watch?v=3t6L-FlfeaI
| pm90 wrote:
| A lot of Google projects seem to rely on other Google projects.
| In this case Monarch relies on spanner.
|
| I guess its nice to publish at least the conceptual design so
| that others can implement it in "rest of the world" case. Working
| with OSS can be painful, slow and time consuming so this seems
| like a reasonable middle ground (although selfishly I do wish all
| of this was source available).
| praptak wrote:
| Spanner may be hard to set up even with source code available.
| It relies on atomic clocks for reliable ordering of events.
| [deleted]
| joshuamorton wrote:
| I don't think there's any spanner _necessity_ and iirc monarch
| existed pre-spanner.
| gttalbot wrote:
| Correct. Spanner is used to hold configuration state, but is
| not in the serving path.
| sydthrowaway wrote:
| Stop overhyping software with buzzwords
___________________________________________________________________
(page generated 2022-05-14 23:00 UTC)