[HN Gopher] The Art of Logging
       ___________________________________________________________________
        
       The Art of Logging
        
       Author : kiyanwang
       Score  : 45 points
       Date   : 2022-10-02 10:42 UTC (3 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | adra wrote:
       | Logs for machines should probably be expressed as metrics
       | instead, because one can actually take the output of metrics for
       | easy higher order analysis. Logs can do the same, but it takes
       | more human effort to get there.
        
         | deathanatos wrote:
         | Until you want to know _which_ query is slow, which metrics won
         | 't tell you.
         | 
         | Being able to do "where query_time_ms > 5000" and then _seeing
         | the query_ has made some of my systems incredibly debuggable1.
         | 
         | Or just finding a failing request with "where response_code =
         | 500". Yeah, I can "grep 500", but you would be _shocked_ at how
         | often those characters occur by happenstance in logs  "grep '
         | 500 '" is a bit closer but ... you're just dancing around a
         | solution engineered to do the job: structured logs.
         | 
         | 1and then finding out someone drew a rather large bounding box
         | that managed to be centered precisely between and encapsulate
         | both Philly & NYC. This kills the DB.
        
           | adra wrote:
           | Most metric systems include tags, which allows you to have
           | infinite dimensions of data correlation at your finger tips.
           | Want the full query, or just a normalized version of the
           | query? It should take a couple minutes in most metric engines
           | to add such a simple requirement.
        
         | jasonpeacock wrote:
         | Logs should be the input to generate metrics (in addition to
         | any metrics published directly).
         | 
         | Otherwise, it's hard to convert a stack trace into a metric ;)
         | But you can process the logs to generate a metric about the
         | frequency of stack traces.
         | 
         | Ditto for logs telling you "user Foo doesn't have permissions
         | for that operation". Again, you can make a metric about "user
         | permissions failure" but you still need the log to know that it
         | was user Foo.
        
           | rroot wrote:
           | If you need a metric on the frequency of stack traces, you
           | have a way more serious problem than structuring your logs.
        
       | Veserv wrote:
       | Storing a human readable format is a amateurish practice with
       | immense performance and resource costs. It conflates the storage
       | format with the visualization/processing format.
       | 
       | In basically every case you should store/log a dense binary dump
       | of your varying data and a ID that uniquely identifies the data
       | format. Then, when a human wants to read the log, you do a post-
       | processing step in your debug environment that formats the data
       | in a human readable way.
       | 
       | This moves string formatting off of the target and prevents the
       | frankly outrageous amount of redundant string duplication when
       | storing human readable logs. It also avoids a roundtrip through
       | stringly-typed logs and allows for efficient binary parsing of
       | your logged data in their natural data format instead of doing
       | brittle string parsing.
       | 
       | It is frankly challenging to come up with even a single case
       | where you should not do it this way as long as your tooling is
       | mature enough to manage the separated encoding/decoding pipeline.
       | Even if it is not, you could always just run the post-processing
       | step on all your new raw logs and store that afterwards as your
       | "official logs" which still gives you the performance advantages
       | on your production systems while being no worse than any existing
       | scheme.
        
         | capableweb wrote:
         | > It is frankly challenging to come up with even a single case
         | where you should not do it this way
         | 
         | One use case for human readable format: your service/whatever
         | that output logs doesn't output TBs of output every month and
         | you can easily handle the amount on commodity hardware. Why
         | complicate something that doesn't need it?
         | 
         | Both methods have their place.
        
       | jrockway wrote:
       | I think the philosophy for logs is that they are THE user
       | interface for operators of your service. If you are SaaS, that's
       | you! If you provide software for self-hosting, that's your
       | customers. It can't be an afterthought; this is how ordinary
       | people can peer deep into your system to understand what's wrong
       | with it. If they figure it out, they don't page you. If it's
       | useless, you can't fix it after they page you. Be paranoid and
       | make your subsystems show their work. Anything you want to know
       | while debugging, print it, don't make the user "hey just run this
       | and tell me what it returns". I think there is an art here, and
       | many treat it as an afterthought.
       | 
       | I always liked structured logs because it frees my mind from
       | coming up with how I want to format arguments. "New HTTP request
       | for /foobar from 1.2.3.4 with request-id
       | ac31a9e0-5d57-44de-9e98-60fa94d3e866" is a pain to read and
       | maintain. `log.Debug("incoming http request", logger.IP("remote-
       | ip", req.Address), logger.String("request-id", req.ID),
       | logger.String("route", req.Route))` is easy to write, and the
       | resulting logs are easy to analyze!
       | 
       | I do like myself some pretty colorful logs, but prefer JSON as
       | the intermediary. It's pretty easy to post-process the logs into
       | something understandable. I wrote
       | https://github.com/jrockway/json-logs for this task. Give it a
       | stream of JSON logs, get colorful text out. Add/drop fields or
       | select/highlight interesting lines with JQ or regular expressions
       | (with -A -B -C of course). Installable on Linux or Mac with
       | Homebrew ;) It's one of the few pieces of software I wrote for
       | myself that passes the "toothbrush test". I use it twice a day
       | every day. And that's a good day when I'm not pulling my hair out
       | over an obscure issue ;)
        
         | xani_ wrote:
         | Log structurally, have option for human readable output is
         | probably the sane default approach here. Sometimes you just
         | want to run an app from commandline, send a request to it and
         | see what happens
        
           | jrockway wrote:
           | I could go either way on making it optional. I try to avoid
           | configuration complexity and the log format is a piece of
           | complexity that can be removed from your application with
           | post-processing.
           | 
           | For example, I wrote opinionated-server which lets you pick
           | text or JSON logging. Even though it has text logging, when
           | I'm developing a server locally, I just pipe the JSON through
           | jlog because I have super-refined that format to my personal
           | tastes, whereas zap's builtin text format isn't as nice.
           | 
           | But I will say that when I create loggers for unit tests, I
           | use the text formatter if one is available. I have never been
           | interested in parsing JSON logs interspersed among plain text
           | (though it is totally very possible to do).
        
       | bob1029 wrote:
       | I'd recommend trying SQLite for logging. We've been doing this
       | for a half decade and it's fantastic.
       | 
       | There's an index on our trace/correlation column, so we can
       | instantly pull all log entries for a specific user interaction
       | with a query like:                 SELECT *        FROM
       | LogEntries       WHERE TraceId = @TraceId
       | 
       | We also use a single connection and WAL, so we can write many
       | thousands of entries per second if necessary.
        
       | 0xbadcafebee wrote:
       | Logs are for humans, metrics are for machines.
       | 
       | Metrics tell you there's a problem. They're an application's way
       | of reporting metadata or state. They are designed to be small and
       | frequent and machine readable/writeable, with small bits of
       | useful information that are of themselves quite useless, but are
       | very useful in aggregate.
       | 
       | Logs describe the problem in detail to a human. They're a
       | developer's way of diagnosing problems in an application. They
       | can have machine-readable components to them, but being machine-
       | readable is not the point. The point is for a human to quickly
       | diagnose and fix an arbitrary problem, however you decide to do
       | that. It's _extremely annoying_ if you can 't read a log because
       | it was machine-formatted.
       | 
       | From the article:                 Invest time in designing your
       | log structure
       | 
       | Don't. Eventually you will have to deal with logs with a random
       | structure. Invest in telemetry management and distributed
       | tracing.                 Log as much as possible.
       | 
       | Don't. It's about quality, not quantity. Too many logs with too
       | little information leaves you buried and unable to diagnose
       | quickly. And too much information is bad for security; always
       | mask sensitive information. I can't tell you how many giant
       | companies have been hacked because of this. And cost _is_ a
       | factor: I have seen people stop logging because it was costing
       | them too much money, when they should have been logging smarter
       | (fewer messages with more useful info) and utilizing metrics.
       | Keeping consistency is everone's priority
       | 
       | It is guaranteed that you will have to deal with some metrics and
       | logs in arbitrary formats. Do not worry about making them
       | perfect. Invest in telemetry management and distributed tracing.
        
         | rrwo wrote:
         | Yes.
         | 
         | I used to work at a company that invested too much time and
         | effort into an ELK stack that handled specialist log formats
         | sent over a custom UDP service. It was brittle and fell over in
         | all sorts of strange ways. This was all for monitoring a
         | handful of servers.
         | 
         | The only good that came of it was when we realized that the
         | support staff could search the logs and diagnose 95% of
         | customer complaints without bothering developers. ("Your
         | transaction failed because you entered a bad card number.")
         | 
         | But that was because the text logs were more useful than the
         | metrics.
        
       | jasonpeacock wrote:
       | This is called "Structured Logging" and there's lots of articles
       | already about it, and libraries to support it.
       | 
       | For example, a great structured logging logging library for
       | Python:
       | 
       | https://www.structlog.org/en/stable/
        
       | kerblang wrote:
       | I parse my team's nginx logs with a script even though they
       | aren't in JSON. I'm fine with it and other people can do as they
       | like. I also do all my application logging as just vanilla logs,
       | and I use grep & less like the cretin I am, plus a tiny bit of
       | bash for helping with gzip and so on.
       | 
       | If you want to go log-crazy with a big elaborate deal and the
       | whole ELK thing and so forth and so on, by all means, if you can
       | afford it - which means afford _to see it all the way through_.
       | Too many times I 've had to deal with teams where everybody says,
       | "Yeah... It would be nice to have logs...." because someone built
       | the Big Elaborate Thing and nobody can figure it out, find it,
       | etc. etc. It becomes a long-term "would be" instead of "is".
        
       | deathanatos wrote:
       | > _correlation_id_ ... microservices
       | 
       | We actually did something slightly different: the correlation ID
       | is generated at the start of a request, where it enters "our
       | stuff" as a whole. But downstream "microservice" style requests
       | get a _different_ correlation ID, but with the original as a
       | prefix. So if you have lots of microservice calls being made this
       | means you can select any subtree with some prefix. The original
       | correlation ID is the entire tree, but you might have,
       | b8a0b848-92e6-49e7-bde2-283e165b02e4  # the original request
       | b8a0b848-92e6-49e7-bde2-283e165b02e4/authz  # request to authz
       | service       b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/0  #
       | first subrequest to foo-db
       | b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/1  # second
       | subrequest to foo-db
       | b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/2  # third subrequest
       | to foo-db       b8a0b848-92e6-49e7-bde2-283e165b02e4/bar-db/0  #
       | first subrequest to bar-db
       | 
       | etc. So you can pull just the original request with "=
       | 'b8a0b848-92e6-49e7-bde2-283e165b02e4'", a whole subsystem with
       | something like "starts_with
       | 'b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/'", so on. _Very_
       | convenient. (Our graphs weren 't even that deep; we really just
       | have "backends for your frontends" and "backends" and that was
       | mostly it, so it was mostly to correlate between the BE and the
       | BE-for-FE.)
       | 
       | You need structured logs first, so if you've not got that, start
       | with that first.
       | 
       | There's this thing called "OpenTracing", which I hear is a sort-
       | of standardization of this, but I've not had time to investigate
       | it.
        
       | gmuslera wrote:
       | There is an opinionated vision of logs, for a particular use case
       | and available resources.
       | 
       | For simple enough logs (like the nginx/apache example shown)
       | having so simple format enables a quick and dirty search and
       | information extraction with common unix command line tools (like
       | grep, cut, sort, uniq, tail and sometimes rev or awk). It is not
       | an approach that works for all logs and all situations, but
       | sometimes is enough. And simplicity is a good thing to have.
        
         | rconti wrote:
         | Plus it's virtually impossible to visually scan log messages as
         | they go by when they're 10 lines long per entry vs 1 or 2 lines
         | long.
        
       | swlkr wrote:
       | I was looking for logfmt mentioned in the article and couldn't
       | find it.
       | 
       | Also stripe's canonical log lines[1] solve this problem nicely
       | 
       | 1: https://stripe.com/blog/canonical-log-lines
        
         | ithkuil wrote:
         | You can read about logfmt at https://brandur.org/logfmt
         | 
         | It's quite nice at first glance. My main qualm with it is that
         | there is no formal spec and I got bitten a few times by
         | incompatibilities between emitters and parsers especially
         | around quoted values.
         | 
         | (In particular we had a rust binary emitting badly escaped
         | values such that fluentbit emitted structured logs with
         | thousands of fields which in turn caused graylog to explode
         | which was particularly gnarly because at the same time we had a
         | production incident we had to troubleshoot without historical
         | logs)
        
         | bornfreddy wrote:
         | TIL - thank you!
         | 
         | Canonical log lines don't solve the exact same problem (they
         | provide all the information for some api call in a single
         | place), but they need to be structured to work.
         | 
         | Logfmt: https://www.cloudbees.com/blog/logfmt-a-log-format-
         | thats-eas...
        
       ___________________________________________________________________
       (page generated 2022-10-05 23:01 UTC)