[HN Gopher] Reducing logging cost by two orders of magnitude usi...
       ___________________________________________________________________
        
       Reducing logging cost by two orders of magnitude using CLP
        
       Author : ath0
       Score  : 167 points
       Date   : 2022-09-30 10:08 UTC (12 hours ago)
        
 (HTM) web link (www.uber.com)
 (TXT) w3m dump (www.uber.com)
        
       | shrubble wrote:
       | This is basically sysadmin 101, however.
       | 
       | Compressing logs has been a thing since the mid-1990s.
       | 
       | Minimizing writes to disk, or setting up a way to coalesce the
       | writes, has also been around for as long as we have had disk
       | drives. If you don't have enough RAM on your system to buffer the
       | writes so that more of the writes get turned into sequential
       | writes, your disk performance will suffer - this too has been
       | known since the 1990s.
        
         | sa46 wrote:
         | Sysadmin 101 doesn't involve separating the dynamic portions of
         | similar, but unstructured log lines to dramatically improve
         | compression and search performance.
         | 
         | > Zstandard or Gzip do not allow gaps in the repetitive
         | pattern; therefore when a log type is interleaved by variable
         | values, they can only identify the multiple substrings of the
         | log type as repetitive.
        
       | SkeuomorphicBee wrote:
       | "Page not found"
       | 
       | Apparently the Uber site noticed I'm not in the USA and
       | automatically redirects to a localized version, which doesn't
       | exist. If their web-development capabilities are any indication
       | I'll skip their development tips.
        
         | detaro wrote:
         | https://web.archive.org/web/20220930114340/https://www.uber....
         | works
        
         | emj wrote:
         | https://www.uber.com/en-US/blog/reducing-logging-cost-by-two...
        
         | hknmtt wrote:
         | yep
        
       | twunde wrote:
       | Original CLP Paper:
       | https://www.usenix.org/system/files/osdi21-rodrigues.pdf
       | 
       | Github project for CLP: https://github.com/y-scope/clp
       | 
       | The interesting part about the article isn't that structured data
       | is easier to compress and store, its that there's a relatively
       | new way to efficiently transform unstructured logs to structured
       | data. For those shipping unstructured logs to an observability
       | backend this could be a way to save significant money
        
         | SergeAx wrote:
         | Wow, that's awesome! What are the chances to see it integrated
         | into Loki stack?
        
         | gdcohen wrote:
         | Does anyone have a simple explanation of how it structures the
         | log data?
        
           | benmanns wrote:
           | Figure 2[2] from the article is pretty good.
           | 
           | [2]: https://blog.uber-cdn.com/cdn-
           | cgi/image/width=2216,quality=8...
        
             | VectorLock wrote:
             | It seems very cool to get logmine style log line templating
             | built right in. I've found its very helpful to run logs
             | through it, and having a log system that can do this at
             | ingest time for quicker querying seems like it'd have
             | amazing log digging workflow benefits.
        
           | pmarreck wrote:
           | 1) Use Zstandard with a generated dictionary kept separate
           | from the data, but moreover:
           | 
           | 2) Organize the log data into tables with columns, and then
           | compress _by column_ (so, each column has its own
           | dictionary). This lets the compression algorithm perform
           | optimally, since now all the similar data is right next to
           | itself. (This reminds me of the Burrows-Wheeler transform,
           | except much more straightforward, thanks to how similar log
           | lines are.)
           | 
           | 3) Search is performed without decompression. Somehow they
           | use the dictionary to index into the table- very clever. I
           | would have just compressed the search term using the same
           | dictionary and do a binary search for that, but I think that
           | would only work for exact matches.
        
             | gcr wrote:
             | Does the user have to specify the "schema" (each unique log
             | message type) manually? or is it learned automatically (a
             | la gzip and friends)? I wasn't able to discover this from a
             | cursory readthrough of the paper...
        
               | tkhattra wrote:
               | the paper mentions that CLP comes with a default set of
               | schemas, but you can also provide your own rules for
               | better compression and faster search
        
           | Veserv wrote:
           | If you have exactly two logging operations in your entire
           | program:
           | 
           | log("Began {x} connected to {y}")
           | 
           | log("Ended {x} connected to {y}")
           | 
           | We can label the first one logging operation 1 and the second
           | one logging operation 2.
           | 
           | Then, if logging operation 1 occurs we can write out:
           | 
           | 1, {x}, {y} instead of "Began {x} connected to {y}" because
           | we can reconstruct the message as long as we know what
           | operation occurred, 1, and the value of all the variables in
           | the message. This general strategy can be extended to any
           | number of logging operations by just giving them all a unique
           | ID.
           | 
           | That is basically the source of their entire improvement. The
           | only other thing that may cause a non-trivial improvement is
           | that they delta encode their timestamps instead of writing
           | out what looks to be a 23 character timestamp string.
           | 
           | The columnar storage of data and dictionary deduplication,
           | what is called Phase 2 in the article, is still not fully
           | implemented according to the article authors and is only
           | expected to result in a 2x improvement. In contrast, the
           | elements I mentioned previously, Phase 1, were responsible
           | for a 169x(!) improvement in storage density.
        
       | taftster wrote:
       | I'm not trying to flame bait here, but this whole article refutes
       | the "Java is Dead" sentiment that seems to float around regularly
       | among developers.
       | 
       | This is a very complicated and sophisticated architecture that
       | leverages the JVM to the hilt. The "big data" architecture that
       | Java and the JVM ecosystem present is really something to be
       | admired, and it can definitely move big data.
       | 
       | I know that competition to this architecture must exist in other
       | frameworks or platforms. But what exactly would replace the HDFS,
       | Spark, Yarn configuration described by the article? Are there
       | equivalents of this stack in other non-JVM deployments, or to
       | other big data projects, like Storm, Hive, Flink, Cassandra?
       | 
       | And granted, Hadoop is somewhat "old" at this point. But I think
       | it (and Google's original map-reduce paper) significantly moved
       | the needle in terms of architecture. Hadoop's Map-Reduce might be
       | dated, but HDFS is still being used very successfully in big data
       | centers. Has the cloud and/or Kubernetes completely replaced the
       | described style of architecture at this point?
       | 
       | Honest questions above, interested in other thoughts.
        
         | foobarian wrote:
         | Not answering your primary question, I know. But I wonder where
         | you are getting the "Java is Dead" sentiment - I am not getting
         | it at all in my (web/enterprisey) circle, if anything there is
         | a lot of excitement due to new LTS versions and other JVM
         | languages like Kotlin. And I am also finding a lot of gratitude
         | for the language not changing in drastic ways (can you imagine
         | a Python 2->3 like transition?) despite the siren call of fancy
         | new PL features.
        
           | taftster wrote:
           | Maybe it's just a little cliche and maybe the phrase "XXX is
           | Dying" is too easily thrown around for click-bait and
           | hyperbole. It can probably be applied to any language that
           | isn't garnering recent fandom. You could probably just as
           | easily say, "Is C# dead?" or "Is Ruby on Rails dead?" or "Is
           | Python dead?" or "Is Rust dead?" (kidding on those last
           | ones).
           | 
           | And yes, I'm with you. I'm super excited about the changes to
           | the Java language, and the JVM continues to be superior for
           | many workloads. Hotspot is arguably one of the best virtual
           | machines that exists today.
           | 
           | But there are plenty of "Java is dead" blog posts and
           | comments here on HN to substantiate my original viewpoint.
           | Maybe because I make a living with Java, I have a bias
           | towards those articles but filter out others, so I don't have
           | a clean picture of this sentiment and it's more in my head.
        
         | npalli wrote:
         | I didn't read the article that way. FTA, the sense is Java is
         | not dead in the same sense COBOL is not dead, that is "legacy"
         | technology that you have now work around because it is too
         | costly to operate and maintain. Ironically, from this article
         | the two main technical solves for the issues with their whole
         | JVM setup are CLP (which is the main article) and moving to
         | Clickhouse for non-Spark logs both of which are written in C++.
         | 
         | With Cloud operating costs dominating the expenses at companies
         | one can see more migration away from JVM setups to simpler
         | (Golang) and close to metal architectures (Rust, C++).
        
           | taftster wrote:
           | Just to probe. COBOL doesn't have many (if any) updates to
           | it, though. And there are no big data architectures being
           | built around it. Equating "Java is Dead" to the same meaning
           | as "COBOL is Dead" doesn't seem like a legitimate comparison.
           | 
           | But I do get your points and don't necessarily disagree with
           | them. I just don't see this as "legacy" technology, but maybe
           | more like "mature"?
        
             | npalli wrote:
             | Yes, "mature" would have been more accurate for Java, some
             | exaggeration on my end. I was trying to convey the sense of
             | excitement for new projects and developers in Java but it
             | is not fair to Java to be compared to COBOL. Primarily
             | because Java is actively developed, lot more developers
             | etc. Nevertheless Cloud is so big nowadays that people are
             | looking for alternatives to the JVM world. 10 years ago it
             | would been a close to default option.
        
           | lenkite wrote:
           | Java supports AOT via Graal so you can have non JVM setups
           | already.
        
         | cjalmeida wrote:
         | Of note, Java !== JVM. Spark and Flink, for instance, are
         | written in Scala which is alive and well :).
         | 
         | My best effort in finding replacements of those tools that
         | don't leverage the JVM:
         | 
         | HDFS: Any cloud object store like S3/AzBlob, really. In some
         | workloads data locality provided by HDFS may be important.
         | Alluxio can help here (but I cheat, it's a JVM product)
         | 
         | Spark: Different approach but you could use Dask, Ray, or dbt
         | plus any SQL Analytical DB like Clickhouse. If you're in the
         | cloud, and are not processing 10s TB at a time, spinning an
         | ephemeral HUGE VM and using something in-memory like DuckDB,
         | Polars or DataFrame.jl is much faster.
         | 
         | Yarn: Kubernetes Jobs. Period. At this point I don't see any
         | advantage of Yarn, including running Spark workloads.
         | 
         | Hive: Maybe Clickhouse for some SQL-like experience. Faster but
         | likely not at the same scale.
         | 
         | Storm/Flink/Cassandra: no clue.
         | 
         | My preferred "modern" FOSS stack (for many reasons) is Python
         | based, with the occasional Julia/Rust thrown in. For a medium
         | scale (ie. few TB daily ingestion), I would go with:
         | 
         | Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge
         | ephemeral VMs.
        
           | ofrzeta wrote:
           | There's ScyllaDB as a replacement for Cassandra.
           | https://www.scylladb.com/
        
             | mdaniel wrote:
             | That's probably only true for extremely license-permissive
             | shops:
             | 
             | https://github.com/apache/cassandra/blob/trunk/LICENSE.txt
             | (Apache 2)
             | 
             | https://github.com/scylladb/scylladb/blob/master/LICENSE.AG
             | P...
        
         | ddorian43 wrote:
         | The next step is to build it in lower level language with
         | modern hardware in mind to be 2x+ faster than the java
         | alternatives. See scylladb, redpanda, quickwit, yugabytedb,
         | etc.
        
       | xani_ wrote:
       | I'm more surprised that they don't just ship logs to central log
       | gathering directly instead of saving plain files then move them
       | around
        
       | tomgs wrote:
       | _Disclaimer: I run Developer Relations for Lightrun._
       | 
       | There is another way to tackle the problem for most normal, back-
       | end applications: Dynamic Logging[0].
       | 
       | Instead of adding a large of amount of logs during development
       | (and then having to deal with compressing and transforming them
       | later) one can instead choose to only add the logs required at
       | runtime.
       | 
       | This is a workflow shift, and as such should be handled with
       | care. But for the majority of logs used for troubleshooting, it's
       | actually a saner approach: Don't make a priori assumptions about
       | what you might need in production, then try and "massage" the
       | right parts out of it when the problem rears its head.
       | 
       | Instead, when facing an issue, add logs where and when you need
       | them to almost "surgically" only get the bits you want. This way,
       | logging cost reduction happens naturally - because you're never
       | writing many of the logs to begin with.
       | 
       | Note: we're not talking about removing logs needed for
       | compliance, forensics or other regulatory reasons here, of
       | course. We're talking about those logs that are used by
       | developers to better understand what's going on inside the
       | application: the "print this variable" or "show this user's
       | state" or "show me which path the execution took" type logs, the
       | ones you look at once and then forget about (while their costs
       | piles on and on).
       | 
       | We call this workflow "Dynamic Logging", and have a fully-
       | featured version of the product available for use at the website
       | with up to 3 live instances.
       | 
       | On a personal - albeit obviously biased - note, I was an SRE
       | before I joined the company, and saw an early demo of the
       | product. I remember uttering a very verbal f-word during the
       | demonstration, and thinking that I want me one of these nice
       | little IDE thingies this company makes. It's a different way to
       | think about logging - I'll give you that - but it makes a world
       | of sense to me.
       | 
       | [0] https://docs.lightrun.com/logs/
        
         | thethimble wrote:
         | Perhaps I'm misunderstanding but what happens if you've had a
         | one-off production issue (job failed, etc) and you hadn't
         | dynamically logged the corresponding code? You can't go back in
         | time and enable logging for that failure right?
        
           | tomgs wrote:
           | That would entail time-travelling and capturing that exact
           | spot in the code, which is usually done by exception
           | monitoring/handling products (plenty exist on the market).
           | 
           | We're more after ongoing situations, where the issue is
           | either hard to reproduce locally or requires very specific
           | state - APIs returning wrong data, vague API 500 errors,
           | application transactions issues, misbehaving caches, 3rd
           | party library errors - that kind of stuff.
           | 
           | If you're looking at the app and your approach would normally
           | be to add another hotfix with logging because some specific
           | piece of information is missing, this approach works
           | beautifully.
        
             | mdaniel wrote:
             | > which is usually done by exception monitoring/handling
             | products (plenty exist on the market).
             | 
             | Only if one considers the bug/unexpected condition to be an
             | _exception_ ; the only thing worse than nothing being an
             | exception is everything being an exception
        
           | mdaniel wrote:
           | An alternative approach, IMHO, is to log all the things and
           | just be judicious about expunging old stuff -- I believe the
           | metrics community buys into this approach, too, storing high
           | granularity captures for a week or whatever, and then rolling
           | them up into larger aggregates for longer-term storage
           | 
           | I would also at least _try_ a cluster-local log buffering
           | system that forwards INFO and above as received, but buffers
           | DEBUG and below, optionally allowing someone to uncork them
           | if required, getting the  "time traveling logging" you were
           | describing. The risk, of course, is the more chains in that
           | transmission flow the more opportunities for something to go
           | sideways and take out _all_ logs which would be :-(
        
         | jedberg wrote:
         | I'm glad someone put a name on the concept I've been advocating
         | for a decade. Thank you! It's something we added at Netflix
         | when we realized our logging costs were out of control.
         | 
         | We had a dashboard where you could flip on certain logging only
         | as needed.
        
           | tomgs wrote:
           | I know Mykyta, who does dev productivity at Netflix now, and
           | he said something to that effect;)
           | 
           | I tried finding you on twitter but no go since DMs are
           | closed.
           | 
           | Would be happy to pick your brain about the topic -
           | tom@granot.dev is where I'm at if you have the time!
        
         | VectorLock wrote:
         | Sounds pretty cool. How much?
        
           | tomgs wrote:
           | Pricing is here:
           | 
           | https://lightrun.com/pricing
        
             | hermanradtke wrote:
             | I don't consider Free for one agent and "contact us" for
             | everything else to be pricing.
        
       | jbergens wrote:
       | > After implementing Phase 1, we were surprised to see we had
       | achieved a compression ratio of 169x.
       | 
       | Sounds interesting, now I want to read up on CLP. Not that we
       | have much log texts to worry about.
        
       | prionassembly wrote:
       | Man, I was expecting constraint linear programming.
        
         | demux wrote:
         | I was expecting constraint logic programming!
        
           | laweijfmvo wrote:
           | I thought it was going to be about trees until I saw the
           | URL...
        
             | dylan604 wrote:
             | They're just saving lumberjacks money by not having to own
             | their own trucks. They can just load their chainsaws in an
             | Uber XL. Boom! Another industry disruption! /s
        
       | kazinator wrote:
       | Log4j is 350,000 lines of code ... and you still need an add-on
       | to compress logs?
        
       | hobs wrote:
       | This just in, Uber rediscovers what all us database people
       | already knew, structured data is usually way easier to compress
       | and store and index and query than unstructured blobs of text,
       | which is why we kept telling you to stop storing json in your
       | databases.
        
         | kazinator wrote:
         | Maybe people do things like that because:
         | 
         | - their application parses and generates JSON already, so it's
         | low-effort.
         | 
         | - the JSON can have various shapes: database records generally
         | don't do that.
         | 
         | - even if it has the same shape, it can change over time; they
         | don't want to deal with the insane hassle of upgrade-time DB
         | schema changes in existing installations
         | 
         | The alternative to JSON-in-DB is to have a persistent object
         | store. That has downsides too.
        
           | xwolfi wrote:
           | It's a hassle to change shape in a db possibly, but have you
           | lived through changing the shape of data in a json store
           | where historical data is important ?
           | 
           | You either dont care abt the past and cant read it anymore,
           | version your writer and reader each time you realize an
           | address in a new country has yet another frigging field, or
           | parse each json value to add the new shape in place in your
           | store.
           | 
           | Json doesnt solve the problem of shape evolution, but it
           | tempt you very strongly to think you can ignore it.
        
             | VectorLock wrote:
             | You do database migrations or you handle data with a
             | variable shape. Where do you want to put your effort? The
             | latter makes rollbacks easier at least.
        
               | xwolfi wrote:
               | But you dont care as much about rollback (we rarely
               | rollack successful migration and only on prod issues the
               | next few days, and always prepare for it with a reverse
               | script) as you care about the past (can you read data
               | from 2 years ago? this can matter a lot more, and you
               | must know it's never done on unstructured data: the code
               | evolve with the new shape, week after week and you re
               | cucked when it's the old consumer code you need to
               | unearth to understand the past). It's never perfect, but
               | the belief data dont need structure nor a well automated
               | history of migration, is dangerous.
               | 
               | I've seen myself, anecdotically, that most of the time
               | with json data, the past is ignored during the ramp up to
               | scale and then once the product is alive and popular,
               | problems start arising. Usually the cry is "why the hell
               | dont we have a dumb db rather than this magic dynamic
               | crap". I now work in a more serious giant corporation
               | where dumb dbs are the default and it's way more
               | comfortable than I was led to believe, when I was
               | younger.
        
         | rubyist5eva wrote:
         | Postgres has excellent JSON support, it's one of my favorite
         | features, it's a nice middle ground between having a schema for
         | absolutely everything or standing up mongodb beside it because
         | it's web scale. Us in particular, we leverage json-schema in
         | our application for a big portion json data and it works great.
        
           | jd_mongodb wrote:
           | MongoDB actually doesn't store JSON. It stores a binary
           | encoding of JSON called BSON (Binary JSON) which encodes type
           | and size information.
           | 
           | This means we can encode objects in your program directly
           | into objects in the database. It also means we can natively
           | encode documents, sub-documents, arrays, geo-spatial
           | coordinates, floats, ints and decimals. This is a primary
           | function of the driver.
           | 
           | This also allows us to efficiently index these fields, even
           | sub-documents and arrays.
           | 
           | All MongoDB collections are compressed on disk by default.
           | 
           | (I work for MongoDB)
        
             | rubyist5eva wrote:
             | Thanks for the clarification. I appreciate the info, and
             | all-in-all I think MongoDB seems really good these days.
             | Though I did have a lot of problems with earlier versions
             | (which, I acknowledge have mostly been resolved in current
             | versions) that have kinda soured me on the product and I
             | hesitate to reach for it on new projects when Postgres has
             | been rock-solid for me for over a decade. I wish you guys
             | all the best in building Mongo.
             | 
             | Getting back to my main point though, less the bit of
             | sarcasm, is more of a general rule of thumb that has served
             | me well is that if you already have something like
             | postgresql stood up, you can generally take it much further
             | than you may initially think before having to complicate
             | your infrastructure by setting up another database (not
             | just mongodb, but pretty much anything else).
        
         | jodrellblank wrote:
         | This just in: "I knew that already hahaha morons" still as
         | unhelpful and uninteresting a comment as ever.
        
         | stingraycharles wrote:
         | Yeah it's silly, even when Spark is writing unstructured logs,
         | that doesn't mean that you can't parse them after-the-fact and
         | store them in a structured way. Even if it doesn't work for
         | 100% of the cases, it's very easy to achieve for 99% of them,
         | in which case you'll still keep a "raw_message" column which
         | you can query as text.
         | 
         | Next up: Uber discovers column oriented databases are more
         | efficient for data warehouses.
        
         | foobiekr wrote:
         | It is pretty well known at this point by much of the industry
         | that Uber has the same promo policy incentives as Google.
         | That's what happens when you ape google.
        
         | xani_ wrote:
         | Modern SQL engines can index JSONs tho. And they can be
         | structured
        
         | dewey wrote:
         | There's nothing wrong with storing json in your database if the
         | tradeoffs are clear and it's used in a sensible way.
         | 
         | Having structured data and an additional json/jsonb column
         | where it makes sense can be very powerful. There's a reason
         | every new release of Postgres improves on the performance and
         | features available for the json data type.
         | (https://www.postgresql.org/docs/9.5/functions-json.html)
        
           | marcosdumay wrote:
           | > There's nothing wrong with storing json in your database if
           | the tradeoffs are clear and it's used in a sensible way.
           | 
           | Of course. If there was, postgres wouldn't even support it.
           | 
           | The GP's rant is usually thrown against people that default
           | into json instead of thinking about it and maybe coming up
           | with an adequate structure. There are way too many of those
           | people.
        
             | VectorLock wrote:
             | Rigid structures and schemas are nice, but having document
             | oriented data also has its advantages.
        
               | inkeddeveloper wrote:
               | And so are document storage databases.
        
           | xwolfi wrote:
           | It cant be. Json has a huge structural problem: it's an ASCII
           | representation of a schema+value list, where the schema is
           | repeated with each value. It improved on xml because it
           | doesn't repeat the schema twice, at least...
           | 
           | It's nonsensical most of the time: do a table, transform
           | values out of the db or in the consumer.
           | 
           | The reason postgres does it is because lazy developpers
           | overused the json columns and then got fucked and say
           | postgres is slow (talking from repeated experience here).
           | Yeah searching in random unstructured blob is slow, surprise.
           | 
           | I dont dislike the idea to store json and structured data
           | together but... you dont need performance then. Transferring
           | a binary representation of a table and having a binary to
           | object converter in your consumer (even chrome) is several
           | orders of magnitudes faster than parsing strings, especially
           | with json vomit of schema at every value.
        
             | dewey wrote:
             | > It's nonsensical most of the time
             | 
             | As usual, it comes down to being sensible about how to use
             | a given tool. You can start with a json column and later
             | when access patterns become clear you split out specific
             | keys that are often accessed / queried on into specific
             | columns.
             | 
             | Another good use case for data that you want to have but
             | don't have to query on often:
             | https://supabase.com/blog/audit
             | 
             | > Yeah searching in random unstructured blob is slow,
             | surprise.
             | 
             | If your use case it to search / query on it often then
             | jsonb column is the wrong choice. You can have an index on
             | a json key, which works reasonably well but I'd probably
             | not put it in the hot path:
             | https://www.postgresql.org/docs/current/datatype-
             | json.html#J...
        
         | groestl wrote:
         | I think it's important to note the difference between
         | unstructured and schemaless. JSON is very much not a blob of
         | text. One layer's structured data is another layer's opaque
         | blob.
        
           | xwolfi wrote:
           | It s the same ! Look the syntax of json does not impose a
           | structure (what fields, what order, can I remove a field),
           | making it dangerous for any stable parsing over time.
        
         | polotics wrote:
         | Word! Storing JSON is so often the most direct and explicit way
         | of accruing technical debt: "We don't really know what
         | structure the data we'll get should have, just specify that
         | it's going to be JSON"...
        
           | gtowey wrote:
           | I like to say that when you try to make a "schemaless"
           | database, you've just made 1000 different schemas instead.
        
             | Gh0stRAT wrote:
             | Yeah, "Schemaless" is a total misnomer. You either have
             | "schema-on-write" or "schema-on-read".
        
               | TickleSteve wrote:
               | "schema in code" covers all bases.
        
               | layer8 wrote:
               | Schemaless means there's no assurance that the stored
               | data matches any consistent schema. You may try to apply
               | a schema on read, but you don't know if the data being
               | read will match it.
        
           | stingraycharles wrote:
           | But if you're not storing data as JSON, can you _really_ say
           | you're agile?  /s
        
             | weego wrote:
             | Look, we'll just get it in this way for now, once it's live
             | we'll have all the time we need to change the schema in the
             | background
        
               | stingraycharles wrote:
               | We don't have a use case yet, but let's just collect all
               | the data and figure out what to do with it later!
               | 
               | It's funny how these cliches repeat everywhere in the
               | industry, and it's almost impossible for people to figure
               | this out beforehand. It seems like everyone needs to deal
               | with data lakes (at scale) at least once in their life
               | before they truly appreciate the costs of the flexibility
               | they offer.
        
               | beckingz wrote:
               | The Data Exhaust approach is simultaneously bad and
               | justifiable. You should measure what matters and think
               | about what you want to measure and why before collecting
               | data. On the other hand, collecting data in case what you
               | want to measure changes later is a usually lowish cost
               | way of maybe having the right data in advance later.
        
               | stingraycharles wrote:
               | Oh I agree, that's why I was careful to put "at scale" in
               | there -- these types of approaches are typically good
               | when you're still trying to understand your problem
               | domain, and have not yet hit production scale.
               | 
               | But I've met many a customer that's spending 7-figures on
               | a yearly basis on data that they have yet to extract
               | value from. The rationale is typically "we don't know yet
               | what parameters are important to the model we come up
               | with later", but even then, you could do better than
               | store everything in plaintext JSON on S3.
        
           | kevindong wrote:
           | You can't realistically expect every log format to get a
           | custom schema declared for it prior to deployment.
        
             | xwolfi wrote:
             | If you never intend to monitor them systematically,
             | absolutely!
             | 
             | If you're a bit serious you can at least impose date, time
             | to the millisecond, pointer to the source of the log line,
             | level, and a message. Let s be crazy and even say the
             | message could have a structure too, but I can feel the
             | weight of effort on your shoulders and say you ve already
             | saved yourself the embarassement a colleague of mine faced
             | when he realized he couldnt give me millisecond timestamp,
             | rendering a latency calculation in the past impossible.
        
               | kevindong wrote:
               | Sorry if I was ambiguous before. When I said "log
               | format", I was referring to the message part of the log
               | line. Standardized timestamp, line in the source code
               | that emitted the log line, and level are the bare minimum
               | for all logging.
               | 
               | Keeping the message part of the log line's format in sync
               | with some external store is deviously difficult
               | particularly when the interesting parts of the log are
               | the dynamic portions that can take on multiple shapes.
        
       | bcjordan wrote:
       | Always bugged me that highly repetitive logs take up so much
       | space!
       | 
       | I'm curious, are there any managed services / simple to use
       | setups to take advantage of something like this for massive log
       | storage and search? (Most hosted log aggregators I've looked at
       | charge by the raw text GB processed)
        
         | xani_ wrote:
         | They would still charge you per raw GB processed regardless of
         | compression used.
         | 
         | IIRC Elasticsearch compresses by default with LZ4
        
         | hericium wrote:
         | ZFS as an underlying filesystem offers several compression
         | algos and suits raw logs storage well.
        
           | LilBytes wrote:
           | Deduplication can literally save petabytes.
        
             | paulmd wrote:
             | deduplication is probably the biggest "we don't do that
             | here" in the ZFS world lol, at this point I think even the
             | authors of that feature have disowned it.
             | 
             | it does what it says on the tin, but this comes at a much
             | higher price than almost any other ZFS feature: you have to
             | store the dedup tables in memory, permanently, to get any
             | performance out of the system, so the rule of thumb you
             | need at least 20GB of RAM per TB stored. In practice you
             | only want to do it if your data is HIGHLY duplicated, and
             | that's often a smell that building a layered image from a
             | common ancestor using the snapshot functionality is going
             | to be a better option.
             | 
             | and once you've committed to deduplication, you're
             | committed... dedup metadata builds up over time and the
             | only time it gets purged is if you remove ALL references to
             | ANY dedup'd blocks on that pool. So practically speaking
             | this is a commitment to running multiple pools and
             | migrating them at some point. That's not a _huge_ problem
             | for enterprise, but, most people usually want to run  "one
             | big pool" for their home stuff. But all in all, even for
             | enterprise, you have to really know that you want it and
             | it's going to produce big gains for your specific use-case.
             | 
             | in contrast LZ4 compression is basically free (actually
             | it's usually faster due to reduced IOPS) and still performs
             | very well on things like column-oriented stores, or even
             | just unstructured json blobs, and imposes no particular
             | limitations on the pool, it's just compressed blocks.
        
         | trueleo wrote:
         | Check out https://github.com/parseablehq/parseable ... we are
         | building a log storage and analysis platform in rust. Columnar
         | format helps a lot in reducing overall size but then you have
         | little computational overhead to deal with conversion and
         | compression. This trade off will be there but we are
         | discovering ways to minimise it with rust
        
       | otikik wrote:
       | I really didn't know whether this was going to be an article
       | about structuring sequential information or about a more
       | efficient way to produce wood. Hacker news!
       | 
       | I clicked, found out, and was dissapointed that this wasn't about
       | wood.
       | 
       | Maybe I should start that woodworking career change already.
        
         | eterm wrote:
         | Given that uber.com is prominently displayed in the title, I
         | don't believe this charming and relatable little anecdote about
         | title confusion.
        
           | Cerium wrote:
           | Eh, I didn't notice Uber while I was considering if this was
           | about automation of forestry operations. Though I did stop to
           | consider if there is even enough waste in the industry to
           | have potential gains of 100x. I doubt there is more than 2x
           | available.
        
           | otikik wrote:
           | You overestimate my reading speed and underestimate my
           | clicking speed, good sir
        
       ___________________________________________________________________
       (page generated 2022-09-30 23:01 UTC)