[HN Gopher] Reducing logging cost by two orders of magnitude usi...
___________________________________________________________________
Reducing logging cost by two orders of magnitude using CLP
Author : ath0
Score : 167 points
Date : 2022-09-30 10:08 UTC (12 hours ago)
(HTM) web link (www.uber.com)
(TXT) w3m dump (www.uber.com)
| shrubble wrote:
| This is basically sysadmin 101, however.
|
| Compressing logs has been a thing since the mid-1990s.
|
| Minimizing writes to disk, or setting up a way to coalesce the
| writes, has also been around for as long as we have had disk
| drives. If you don't have enough RAM on your system to buffer the
| writes so that more of the writes get turned into sequential
| writes, your disk performance will suffer - this too has been
| known since the 1990s.
| sa46 wrote:
| Sysadmin 101 doesn't involve separating the dynamic portions of
| similar, but unstructured log lines to dramatically improve
| compression and search performance.
|
| > Zstandard or Gzip do not allow gaps in the repetitive
| pattern; therefore when a log type is interleaved by variable
| values, they can only identify the multiple substrings of the
| log type as repetitive.
| SkeuomorphicBee wrote:
| "Page not found"
|
| Apparently the Uber site noticed I'm not in the USA and
| automatically redirects to a localized version, which doesn't
| exist. If their web-development capabilities are any indication
| I'll skip their development tips.
| detaro wrote:
| https://web.archive.org/web/20220930114340/https://www.uber....
| works
| emj wrote:
| https://www.uber.com/en-US/blog/reducing-logging-cost-by-two...
| hknmtt wrote:
| yep
| twunde wrote:
| Original CLP Paper:
| https://www.usenix.org/system/files/osdi21-rodrigues.pdf
|
| Github project for CLP: https://github.com/y-scope/clp
|
| The interesting part about the article isn't that structured data
| is easier to compress and store, its that there's a relatively
| new way to efficiently transform unstructured logs to structured
| data. For those shipping unstructured logs to an observability
| backend this could be a way to save significant money
| SergeAx wrote:
| Wow, that's awesome! What are the chances to see it integrated
| into Loki stack?
| gdcohen wrote:
| Does anyone have a simple explanation of how it structures the
| log data?
| benmanns wrote:
| Figure 2[2] from the article is pretty good.
|
| [2]: https://blog.uber-cdn.com/cdn-
| cgi/image/width=2216,quality=8...
| VectorLock wrote:
| It seems very cool to get logmine style log line templating
| built right in. I've found its very helpful to run logs
| through it, and having a log system that can do this at
| ingest time for quicker querying seems like it'd have
| amazing log digging workflow benefits.
| pmarreck wrote:
| 1) Use Zstandard with a generated dictionary kept separate
| from the data, but moreover:
|
| 2) Organize the log data into tables with columns, and then
| compress _by column_ (so, each column has its own
| dictionary). This lets the compression algorithm perform
| optimally, since now all the similar data is right next to
| itself. (This reminds me of the Burrows-Wheeler transform,
| except much more straightforward, thanks to how similar log
| lines are.)
|
| 3) Search is performed without decompression. Somehow they
| use the dictionary to index into the table- very clever. I
| would have just compressed the search term using the same
| dictionary and do a binary search for that, but I think that
| would only work for exact matches.
| gcr wrote:
| Does the user have to specify the "schema" (each unique log
| message type) manually? or is it learned automatically (a
| la gzip and friends)? I wasn't able to discover this from a
| cursory readthrough of the paper...
| tkhattra wrote:
| the paper mentions that CLP comes with a default set of
| schemas, but you can also provide your own rules for
| better compression and faster search
| Veserv wrote:
| If you have exactly two logging operations in your entire
| program:
|
| log("Began {x} connected to {y}")
|
| log("Ended {x} connected to {y}")
|
| We can label the first one logging operation 1 and the second
| one logging operation 2.
|
| Then, if logging operation 1 occurs we can write out:
|
| 1, {x}, {y} instead of "Began {x} connected to {y}" because
| we can reconstruct the message as long as we know what
| operation occurred, 1, and the value of all the variables in
| the message. This general strategy can be extended to any
| number of logging operations by just giving them all a unique
| ID.
|
| That is basically the source of their entire improvement. The
| only other thing that may cause a non-trivial improvement is
| that they delta encode their timestamps instead of writing
| out what looks to be a 23 character timestamp string.
|
| The columnar storage of data and dictionary deduplication,
| what is called Phase 2 in the article, is still not fully
| implemented according to the article authors and is only
| expected to result in a 2x improvement. In contrast, the
| elements I mentioned previously, Phase 1, were responsible
| for a 169x(!) improvement in storage density.
| taftster wrote:
| I'm not trying to flame bait here, but this whole article refutes
| the "Java is Dead" sentiment that seems to float around regularly
| among developers.
|
| This is a very complicated and sophisticated architecture that
| leverages the JVM to the hilt. The "big data" architecture that
| Java and the JVM ecosystem present is really something to be
| admired, and it can definitely move big data.
|
| I know that competition to this architecture must exist in other
| frameworks or platforms. But what exactly would replace the HDFS,
| Spark, Yarn configuration described by the article? Are there
| equivalents of this stack in other non-JVM deployments, or to
| other big data projects, like Storm, Hive, Flink, Cassandra?
|
| And granted, Hadoop is somewhat "old" at this point. But I think
| it (and Google's original map-reduce paper) significantly moved
| the needle in terms of architecture. Hadoop's Map-Reduce might be
| dated, but HDFS is still being used very successfully in big data
| centers. Has the cloud and/or Kubernetes completely replaced the
| described style of architecture at this point?
|
| Honest questions above, interested in other thoughts.
| foobarian wrote:
| Not answering your primary question, I know. But I wonder where
| you are getting the "Java is Dead" sentiment - I am not getting
| it at all in my (web/enterprisey) circle, if anything there is
| a lot of excitement due to new LTS versions and other JVM
| languages like Kotlin. And I am also finding a lot of gratitude
| for the language not changing in drastic ways (can you imagine
| a Python 2->3 like transition?) despite the siren call of fancy
| new PL features.
| taftster wrote:
| Maybe it's just a little cliche and maybe the phrase "XXX is
| Dying" is too easily thrown around for click-bait and
| hyperbole. It can probably be applied to any language that
| isn't garnering recent fandom. You could probably just as
| easily say, "Is C# dead?" or "Is Ruby on Rails dead?" or "Is
| Python dead?" or "Is Rust dead?" (kidding on those last
| ones).
|
| And yes, I'm with you. I'm super excited about the changes to
| the Java language, and the JVM continues to be superior for
| many workloads. Hotspot is arguably one of the best virtual
| machines that exists today.
|
| But there are plenty of "Java is dead" blog posts and
| comments here on HN to substantiate my original viewpoint.
| Maybe because I make a living with Java, I have a bias
| towards those articles but filter out others, so I don't have
| a clean picture of this sentiment and it's more in my head.
| npalli wrote:
| I didn't read the article that way. FTA, the sense is Java is
| not dead in the same sense COBOL is not dead, that is "legacy"
| technology that you have now work around because it is too
| costly to operate and maintain. Ironically, from this article
| the two main technical solves for the issues with their whole
| JVM setup are CLP (which is the main article) and moving to
| Clickhouse for non-Spark logs both of which are written in C++.
|
| With Cloud operating costs dominating the expenses at companies
| one can see more migration away from JVM setups to simpler
| (Golang) and close to metal architectures (Rust, C++).
| taftster wrote:
| Just to probe. COBOL doesn't have many (if any) updates to
| it, though. And there are no big data architectures being
| built around it. Equating "Java is Dead" to the same meaning
| as "COBOL is Dead" doesn't seem like a legitimate comparison.
|
| But I do get your points and don't necessarily disagree with
| them. I just don't see this as "legacy" technology, but maybe
| more like "mature"?
| npalli wrote:
| Yes, "mature" would have been more accurate for Java, some
| exaggeration on my end. I was trying to convey the sense of
| excitement for new projects and developers in Java but it
| is not fair to Java to be compared to COBOL. Primarily
| because Java is actively developed, lot more developers
| etc. Nevertheless Cloud is so big nowadays that people are
| looking for alternatives to the JVM world. 10 years ago it
| would been a close to default option.
| lenkite wrote:
| Java supports AOT via Graal so you can have non JVM setups
| already.
| cjalmeida wrote:
| Of note, Java !== JVM. Spark and Flink, for instance, are
| written in Scala which is alive and well :).
|
| My best effort in finding replacements of those tools that
| don't leverage the JVM:
|
| HDFS: Any cloud object store like S3/AzBlob, really. In some
| workloads data locality provided by HDFS may be important.
| Alluxio can help here (but I cheat, it's a JVM product)
|
| Spark: Different approach but you could use Dask, Ray, or dbt
| plus any SQL Analytical DB like Clickhouse. If you're in the
| cloud, and are not processing 10s TB at a time, spinning an
| ephemeral HUGE VM and using something in-memory like DuckDB,
| Polars or DataFrame.jl is much faster.
|
| Yarn: Kubernetes Jobs. Period. At this point I don't see any
| advantage of Yarn, including running Spark workloads.
|
| Hive: Maybe Clickhouse for some SQL-like experience. Faster but
| likely not at the same scale.
|
| Storm/Flink/Cassandra: no clue.
|
| My preferred "modern" FOSS stack (for many reasons) is Python
| based, with the occasional Julia/Rust thrown in. For a medium
| scale (ie. few TB daily ingestion), I would go with:
|
| Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge
| ephemeral VMs.
| ofrzeta wrote:
| There's ScyllaDB as a replacement for Cassandra.
| https://www.scylladb.com/
| mdaniel wrote:
| That's probably only true for extremely license-permissive
| shops:
|
| https://github.com/apache/cassandra/blob/trunk/LICENSE.txt
| (Apache 2)
|
| https://github.com/scylladb/scylladb/blob/master/LICENSE.AG
| P...
| ddorian43 wrote:
| The next step is to build it in lower level language with
| modern hardware in mind to be 2x+ faster than the java
| alternatives. See scylladb, redpanda, quickwit, yugabytedb,
| etc.
| xani_ wrote:
| I'm more surprised that they don't just ship logs to central log
| gathering directly instead of saving plain files then move them
| around
| tomgs wrote:
| _Disclaimer: I run Developer Relations for Lightrun._
|
| There is another way to tackle the problem for most normal, back-
| end applications: Dynamic Logging[0].
|
| Instead of adding a large of amount of logs during development
| (and then having to deal with compressing and transforming them
| later) one can instead choose to only add the logs required at
| runtime.
|
| This is a workflow shift, and as such should be handled with
| care. But for the majority of logs used for troubleshooting, it's
| actually a saner approach: Don't make a priori assumptions about
| what you might need in production, then try and "massage" the
| right parts out of it when the problem rears its head.
|
| Instead, when facing an issue, add logs where and when you need
| them to almost "surgically" only get the bits you want. This way,
| logging cost reduction happens naturally - because you're never
| writing many of the logs to begin with.
|
| Note: we're not talking about removing logs needed for
| compliance, forensics or other regulatory reasons here, of
| course. We're talking about those logs that are used by
| developers to better understand what's going on inside the
| application: the "print this variable" or "show this user's
| state" or "show me which path the execution took" type logs, the
| ones you look at once and then forget about (while their costs
| piles on and on).
|
| We call this workflow "Dynamic Logging", and have a fully-
| featured version of the product available for use at the website
| with up to 3 live instances.
|
| On a personal - albeit obviously biased - note, I was an SRE
| before I joined the company, and saw an early demo of the
| product. I remember uttering a very verbal f-word during the
| demonstration, and thinking that I want me one of these nice
| little IDE thingies this company makes. It's a different way to
| think about logging - I'll give you that - but it makes a world
| of sense to me.
|
| [0] https://docs.lightrun.com/logs/
| thethimble wrote:
| Perhaps I'm misunderstanding but what happens if you've had a
| one-off production issue (job failed, etc) and you hadn't
| dynamically logged the corresponding code? You can't go back in
| time and enable logging for that failure right?
| tomgs wrote:
| That would entail time-travelling and capturing that exact
| spot in the code, which is usually done by exception
| monitoring/handling products (plenty exist on the market).
|
| We're more after ongoing situations, where the issue is
| either hard to reproduce locally or requires very specific
| state - APIs returning wrong data, vague API 500 errors,
| application transactions issues, misbehaving caches, 3rd
| party library errors - that kind of stuff.
|
| If you're looking at the app and your approach would normally
| be to add another hotfix with logging because some specific
| piece of information is missing, this approach works
| beautifully.
| mdaniel wrote:
| > which is usually done by exception monitoring/handling
| products (plenty exist on the market).
|
| Only if one considers the bug/unexpected condition to be an
| _exception_ ; the only thing worse than nothing being an
| exception is everything being an exception
| mdaniel wrote:
| An alternative approach, IMHO, is to log all the things and
| just be judicious about expunging old stuff -- I believe the
| metrics community buys into this approach, too, storing high
| granularity captures for a week or whatever, and then rolling
| them up into larger aggregates for longer-term storage
|
| I would also at least _try_ a cluster-local log buffering
| system that forwards INFO and above as received, but buffers
| DEBUG and below, optionally allowing someone to uncork them
| if required, getting the "time traveling logging" you were
| describing. The risk, of course, is the more chains in that
| transmission flow the more opportunities for something to go
| sideways and take out _all_ logs which would be :-(
| jedberg wrote:
| I'm glad someone put a name on the concept I've been advocating
| for a decade. Thank you! It's something we added at Netflix
| when we realized our logging costs were out of control.
|
| We had a dashboard where you could flip on certain logging only
| as needed.
| tomgs wrote:
| I know Mykyta, who does dev productivity at Netflix now, and
| he said something to that effect;)
|
| I tried finding you on twitter but no go since DMs are
| closed.
|
| Would be happy to pick your brain about the topic -
| tom@granot.dev is where I'm at if you have the time!
| VectorLock wrote:
| Sounds pretty cool. How much?
| tomgs wrote:
| Pricing is here:
|
| https://lightrun.com/pricing
| hermanradtke wrote:
| I don't consider Free for one agent and "contact us" for
| everything else to be pricing.
| jbergens wrote:
| > After implementing Phase 1, we were surprised to see we had
| achieved a compression ratio of 169x.
|
| Sounds interesting, now I want to read up on CLP. Not that we
| have much log texts to worry about.
| prionassembly wrote:
| Man, I was expecting constraint linear programming.
| demux wrote:
| I was expecting constraint logic programming!
| laweijfmvo wrote:
| I thought it was going to be about trees until I saw the
| URL...
| dylan604 wrote:
| They're just saving lumberjacks money by not having to own
| their own trucks. They can just load their chainsaws in an
| Uber XL. Boom! Another industry disruption! /s
| kazinator wrote:
| Log4j is 350,000 lines of code ... and you still need an add-on
| to compress logs?
| hobs wrote:
| This just in, Uber rediscovers what all us database people
| already knew, structured data is usually way easier to compress
| and store and index and query than unstructured blobs of text,
| which is why we kept telling you to stop storing json in your
| databases.
| kazinator wrote:
| Maybe people do things like that because:
|
| - their application parses and generates JSON already, so it's
| low-effort.
|
| - the JSON can have various shapes: database records generally
| don't do that.
|
| - even if it has the same shape, it can change over time; they
| don't want to deal with the insane hassle of upgrade-time DB
| schema changes in existing installations
|
| The alternative to JSON-in-DB is to have a persistent object
| store. That has downsides too.
| xwolfi wrote:
| It's a hassle to change shape in a db possibly, but have you
| lived through changing the shape of data in a json store
| where historical data is important ?
|
| You either dont care abt the past and cant read it anymore,
| version your writer and reader each time you realize an
| address in a new country has yet another frigging field, or
| parse each json value to add the new shape in place in your
| store.
|
| Json doesnt solve the problem of shape evolution, but it
| tempt you very strongly to think you can ignore it.
| VectorLock wrote:
| You do database migrations or you handle data with a
| variable shape. Where do you want to put your effort? The
| latter makes rollbacks easier at least.
| xwolfi wrote:
| But you dont care as much about rollback (we rarely
| rollack successful migration and only on prod issues the
| next few days, and always prepare for it with a reverse
| script) as you care about the past (can you read data
| from 2 years ago? this can matter a lot more, and you
| must know it's never done on unstructured data: the code
| evolve with the new shape, week after week and you re
| cucked when it's the old consumer code you need to
| unearth to understand the past). It's never perfect, but
| the belief data dont need structure nor a well automated
| history of migration, is dangerous.
|
| I've seen myself, anecdotically, that most of the time
| with json data, the past is ignored during the ramp up to
| scale and then once the product is alive and popular,
| problems start arising. Usually the cry is "why the hell
| dont we have a dumb db rather than this magic dynamic
| crap". I now work in a more serious giant corporation
| where dumb dbs are the default and it's way more
| comfortable than I was led to believe, when I was
| younger.
| rubyist5eva wrote:
| Postgres has excellent JSON support, it's one of my favorite
| features, it's a nice middle ground between having a schema for
| absolutely everything or standing up mongodb beside it because
| it's web scale. Us in particular, we leverage json-schema in
| our application for a big portion json data and it works great.
| jd_mongodb wrote:
| MongoDB actually doesn't store JSON. It stores a binary
| encoding of JSON called BSON (Binary JSON) which encodes type
| and size information.
|
| This means we can encode objects in your program directly
| into objects in the database. It also means we can natively
| encode documents, sub-documents, arrays, geo-spatial
| coordinates, floats, ints and decimals. This is a primary
| function of the driver.
|
| This also allows us to efficiently index these fields, even
| sub-documents and arrays.
|
| All MongoDB collections are compressed on disk by default.
|
| (I work for MongoDB)
| rubyist5eva wrote:
| Thanks for the clarification. I appreciate the info, and
| all-in-all I think MongoDB seems really good these days.
| Though I did have a lot of problems with earlier versions
| (which, I acknowledge have mostly been resolved in current
| versions) that have kinda soured me on the product and I
| hesitate to reach for it on new projects when Postgres has
| been rock-solid for me for over a decade. I wish you guys
| all the best in building Mongo.
|
| Getting back to my main point though, less the bit of
| sarcasm, is more of a general rule of thumb that has served
| me well is that if you already have something like
| postgresql stood up, you can generally take it much further
| than you may initially think before having to complicate
| your infrastructure by setting up another database (not
| just mongodb, but pretty much anything else).
| jodrellblank wrote:
| This just in: "I knew that already hahaha morons" still as
| unhelpful and uninteresting a comment as ever.
| stingraycharles wrote:
| Yeah it's silly, even when Spark is writing unstructured logs,
| that doesn't mean that you can't parse them after-the-fact and
| store them in a structured way. Even if it doesn't work for
| 100% of the cases, it's very easy to achieve for 99% of them,
| in which case you'll still keep a "raw_message" column which
| you can query as text.
|
| Next up: Uber discovers column oriented databases are more
| efficient for data warehouses.
| foobiekr wrote:
| It is pretty well known at this point by much of the industry
| that Uber has the same promo policy incentives as Google.
| That's what happens when you ape google.
| xani_ wrote:
| Modern SQL engines can index JSONs tho. And they can be
| structured
| dewey wrote:
| There's nothing wrong with storing json in your database if the
| tradeoffs are clear and it's used in a sensible way.
|
| Having structured data and an additional json/jsonb column
| where it makes sense can be very powerful. There's a reason
| every new release of Postgres improves on the performance and
| features available for the json data type.
| (https://www.postgresql.org/docs/9.5/functions-json.html)
| marcosdumay wrote:
| > There's nothing wrong with storing json in your database if
| the tradeoffs are clear and it's used in a sensible way.
|
| Of course. If there was, postgres wouldn't even support it.
|
| The GP's rant is usually thrown against people that default
| into json instead of thinking about it and maybe coming up
| with an adequate structure. There are way too many of those
| people.
| VectorLock wrote:
| Rigid structures and schemas are nice, but having document
| oriented data also has its advantages.
| inkeddeveloper wrote:
| And so are document storage databases.
| xwolfi wrote:
| It cant be. Json has a huge structural problem: it's an ASCII
| representation of a schema+value list, where the schema is
| repeated with each value. It improved on xml because it
| doesn't repeat the schema twice, at least...
|
| It's nonsensical most of the time: do a table, transform
| values out of the db or in the consumer.
|
| The reason postgres does it is because lazy developpers
| overused the json columns and then got fucked and say
| postgres is slow (talking from repeated experience here).
| Yeah searching in random unstructured blob is slow, surprise.
|
| I dont dislike the idea to store json and structured data
| together but... you dont need performance then. Transferring
| a binary representation of a table and having a binary to
| object converter in your consumer (even chrome) is several
| orders of magnitudes faster than parsing strings, especially
| with json vomit of schema at every value.
| dewey wrote:
| > It's nonsensical most of the time
|
| As usual, it comes down to being sensible about how to use
| a given tool. You can start with a json column and later
| when access patterns become clear you split out specific
| keys that are often accessed / queried on into specific
| columns.
|
| Another good use case for data that you want to have but
| don't have to query on often:
| https://supabase.com/blog/audit
|
| > Yeah searching in random unstructured blob is slow,
| surprise.
|
| If your use case it to search / query on it often then
| jsonb column is the wrong choice. You can have an index on
| a json key, which works reasonably well but I'd probably
| not put it in the hot path:
| https://www.postgresql.org/docs/current/datatype-
| json.html#J...
| groestl wrote:
| I think it's important to note the difference between
| unstructured and schemaless. JSON is very much not a blob of
| text. One layer's structured data is another layer's opaque
| blob.
| xwolfi wrote:
| It s the same ! Look the syntax of json does not impose a
| structure (what fields, what order, can I remove a field),
| making it dangerous for any stable parsing over time.
| polotics wrote:
| Word! Storing JSON is so often the most direct and explicit way
| of accruing technical debt: "We don't really know what
| structure the data we'll get should have, just specify that
| it's going to be JSON"...
| gtowey wrote:
| I like to say that when you try to make a "schemaless"
| database, you've just made 1000 different schemas instead.
| Gh0stRAT wrote:
| Yeah, "Schemaless" is a total misnomer. You either have
| "schema-on-write" or "schema-on-read".
| TickleSteve wrote:
| "schema in code" covers all bases.
| layer8 wrote:
| Schemaless means there's no assurance that the stored
| data matches any consistent schema. You may try to apply
| a schema on read, but you don't know if the data being
| read will match it.
| stingraycharles wrote:
| But if you're not storing data as JSON, can you _really_ say
| you're agile? /s
| weego wrote:
| Look, we'll just get it in this way for now, once it's live
| we'll have all the time we need to change the schema in the
| background
| stingraycharles wrote:
| We don't have a use case yet, but let's just collect all
| the data and figure out what to do with it later!
|
| It's funny how these cliches repeat everywhere in the
| industry, and it's almost impossible for people to figure
| this out beforehand. It seems like everyone needs to deal
| with data lakes (at scale) at least once in their life
| before they truly appreciate the costs of the flexibility
| they offer.
| beckingz wrote:
| The Data Exhaust approach is simultaneously bad and
| justifiable. You should measure what matters and think
| about what you want to measure and why before collecting
| data. On the other hand, collecting data in case what you
| want to measure changes later is a usually lowish cost
| way of maybe having the right data in advance later.
| stingraycharles wrote:
| Oh I agree, that's why I was careful to put "at scale" in
| there -- these types of approaches are typically good
| when you're still trying to understand your problem
| domain, and have not yet hit production scale.
|
| But I've met many a customer that's spending 7-figures on
| a yearly basis on data that they have yet to extract
| value from. The rationale is typically "we don't know yet
| what parameters are important to the model we come up
| with later", but even then, you could do better than
| store everything in plaintext JSON on S3.
| kevindong wrote:
| You can't realistically expect every log format to get a
| custom schema declared for it prior to deployment.
| xwolfi wrote:
| If you never intend to monitor them systematically,
| absolutely!
|
| If you're a bit serious you can at least impose date, time
| to the millisecond, pointer to the source of the log line,
| level, and a message. Let s be crazy and even say the
| message could have a structure too, but I can feel the
| weight of effort on your shoulders and say you ve already
| saved yourself the embarassement a colleague of mine faced
| when he realized he couldnt give me millisecond timestamp,
| rendering a latency calculation in the past impossible.
| kevindong wrote:
| Sorry if I was ambiguous before. When I said "log
| format", I was referring to the message part of the log
| line. Standardized timestamp, line in the source code
| that emitted the log line, and level are the bare minimum
| for all logging.
|
| Keeping the message part of the log line's format in sync
| with some external store is deviously difficult
| particularly when the interesting parts of the log are
| the dynamic portions that can take on multiple shapes.
| bcjordan wrote:
| Always bugged me that highly repetitive logs take up so much
| space!
|
| I'm curious, are there any managed services / simple to use
| setups to take advantage of something like this for massive log
| storage and search? (Most hosted log aggregators I've looked at
| charge by the raw text GB processed)
| xani_ wrote:
| They would still charge you per raw GB processed regardless of
| compression used.
|
| IIRC Elasticsearch compresses by default with LZ4
| hericium wrote:
| ZFS as an underlying filesystem offers several compression
| algos and suits raw logs storage well.
| LilBytes wrote:
| Deduplication can literally save petabytes.
| paulmd wrote:
| deduplication is probably the biggest "we don't do that
| here" in the ZFS world lol, at this point I think even the
| authors of that feature have disowned it.
|
| it does what it says on the tin, but this comes at a much
| higher price than almost any other ZFS feature: you have to
| store the dedup tables in memory, permanently, to get any
| performance out of the system, so the rule of thumb you
| need at least 20GB of RAM per TB stored. In practice you
| only want to do it if your data is HIGHLY duplicated, and
| that's often a smell that building a layered image from a
| common ancestor using the snapshot functionality is going
| to be a better option.
|
| and once you've committed to deduplication, you're
| committed... dedup metadata builds up over time and the
| only time it gets purged is if you remove ALL references to
| ANY dedup'd blocks on that pool. So practically speaking
| this is a commitment to running multiple pools and
| migrating them at some point. That's not a _huge_ problem
| for enterprise, but, most people usually want to run "one
| big pool" for their home stuff. But all in all, even for
| enterprise, you have to really know that you want it and
| it's going to produce big gains for your specific use-case.
|
| in contrast LZ4 compression is basically free (actually
| it's usually faster due to reduced IOPS) and still performs
| very well on things like column-oriented stores, or even
| just unstructured json blobs, and imposes no particular
| limitations on the pool, it's just compressed blocks.
| trueleo wrote:
| Check out https://github.com/parseablehq/parseable ... we are
| building a log storage and analysis platform in rust. Columnar
| format helps a lot in reducing overall size but then you have
| little computational overhead to deal with conversion and
| compression. This trade off will be there but we are
| discovering ways to minimise it with rust
| otikik wrote:
| I really didn't know whether this was going to be an article
| about structuring sequential information or about a more
| efficient way to produce wood. Hacker news!
|
| I clicked, found out, and was dissapointed that this wasn't about
| wood.
|
| Maybe I should start that woodworking career change already.
| eterm wrote:
| Given that uber.com is prominently displayed in the title, I
| don't believe this charming and relatable little anecdote about
| title confusion.
| Cerium wrote:
| Eh, I didn't notice Uber while I was considering if this was
| about automation of forestry operations. Though I did stop to
| consider if there is even enough waste in the industry to
| have potential gains of 100x. I doubt there is more than 2x
| available.
| otikik wrote:
| You overestimate my reading speed and underestimate my
| clicking speed, good sir
___________________________________________________________________
(page generated 2022-09-30 23:01 UTC)