[HN Gopher] The Art of Logging
___________________________________________________________________
The Art of Logging
Author : kiyanwang
Score : 45 points
Date : 2022-10-02 10:42 UTC (3 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| adra wrote:
| Logs for machines should probably be expressed as metrics
| instead, because one can actually take the output of metrics for
| easy higher order analysis. Logs can do the same, but it takes
| more human effort to get there.
| deathanatos wrote:
| Until you want to know _which_ query is slow, which metrics won
| 't tell you.
|
| Being able to do "where query_time_ms > 5000" and then _seeing
| the query_ has made some of my systems incredibly debuggable1.
|
| Or just finding a failing request with "where response_code =
| 500". Yeah, I can "grep 500", but you would be _shocked_ at how
| often those characters occur by happenstance in logs "grep '
| 500 '" is a bit closer but ... you're just dancing around a
| solution engineered to do the job: structured logs.
|
| 1and then finding out someone drew a rather large bounding box
| that managed to be centered precisely between and encapsulate
| both Philly & NYC. This kills the DB.
| adra wrote:
| Most metric systems include tags, which allows you to have
| infinite dimensions of data correlation at your finger tips.
| Want the full query, or just a normalized version of the
| query? It should take a couple minutes in most metric engines
| to add such a simple requirement.
| jasonpeacock wrote:
| Logs should be the input to generate metrics (in addition to
| any metrics published directly).
|
| Otherwise, it's hard to convert a stack trace into a metric ;)
| But you can process the logs to generate a metric about the
| frequency of stack traces.
|
| Ditto for logs telling you "user Foo doesn't have permissions
| for that operation". Again, you can make a metric about "user
| permissions failure" but you still need the log to know that it
| was user Foo.
| rroot wrote:
| If you need a metric on the frequency of stack traces, you
| have a way more serious problem than structuring your logs.
| Veserv wrote:
| Storing a human readable format is a amateurish practice with
| immense performance and resource costs. It conflates the storage
| format with the visualization/processing format.
|
| In basically every case you should store/log a dense binary dump
| of your varying data and a ID that uniquely identifies the data
| format. Then, when a human wants to read the log, you do a post-
| processing step in your debug environment that formats the data
| in a human readable way.
|
| This moves string formatting off of the target and prevents the
| frankly outrageous amount of redundant string duplication when
| storing human readable logs. It also avoids a roundtrip through
| stringly-typed logs and allows for efficient binary parsing of
| your logged data in their natural data format instead of doing
| brittle string parsing.
|
| It is frankly challenging to come up with even a single case
| where you should not do it this way as long as your tooling is
| mature enough to manage the separated encoding/decoding pipeline.
| Even if it is not, you could always just run the post-processing
| step on all your new raw logs and store that afterwards as your
| "official logs" which still gives you the performance advantages
| on your production systems while being no worse than any existing
| scheme.
| capableweb wrote:
| > It is frankly challenging to come up with even a single case
| where you should not do it this way
|
| One use case for human readable format: your service/whatever
| that output logs doesn't output TBs of output every month and
| you can easily handle the amount on commodity hardware. Why
| complicate something that doesn't need it?
|
| Both methods have their place.
| jrockway wrote:
| I think the philosophy for logs is that they are THE user
| interface for operators of your service. If you are SaaS, that's
| you! If you provide software for self-hosting, that's your
| customers. It can't be an afterthought; this is how ordinary
| people can peer deep into your system to understand what's wrong
| with it. If they figure it out, they don't page you. If it's
| useless, you can't fix it after they page you. Be paranoid and
| make your subsystems show their work. Anything you want to know
| while debugging, print it, don't make the user "hey just run this
| and tell me what it returns". I think there is an art here, and
| many treat it as an afterthought.
|
| I always liked structured logs because it frees my mind from
| coming up with how I want to format arguments. "New HTTP request
| for /foobar from 1.2.3.4 with request-id
| ac31a9e0-5d57-44de-9e98-60fa94d3e866" is a pain to read and
| maintain. `log.Debug("incoming http request", logger.IP("remote-
| ip", req.Address), logger.String("request-id", req.ID),
| logger.String("route", req.Route))` is easy to write, and the
| resulting logs are easy to analyze!
|
| I do like myself some pretty colorful logs, but prefer JSON as
| the intermediary. It's pretty easy to post-process the logs into
| something understandable. I wrote
| https://github.com/jrockway/json-logs for this task. Give it a
| stream of JSON logs, get colorful text out. Add/drop fields or
| select/highlight interesting lines with JQ or regular expressions
| (with -A -B -C of course). Installable on Linux or Mac with
| Homebrew ;) It's one of the few pieces of software I wrote for
| myself that passes the "toothbrush test". I use it twice a day
| every day. And that's a good day when I'm not pulling my hair out
| over an obscure issue ;)
| xani_ wrote:
| Log structurally, have option for human readable output is
| probably the sane default approach here. Sometimes you just
| want to run an app from commandline, send a request to it and
| see what happens
| jrockway wrote:
| I could go either way on making it optional. I try to avoid
| configuration complexity and the log format is a piece of
| complexity that can be removed from your application with
| post-processing.
|
| For example, I wrote opinionated-server which lets you pick
| text or JSON logging. Even though it has text logging, when
| I'm developing a server locally, I just pipe the JSON through
| jlog because I have super-refined that format to my personal
| tastes, whereas zap's builtin text format isn't as nice.
|
| But I will say that when I create loggers for unit tests, I
| use the text formatter if one is available. I have never been
| interested in parsing JSON logs interspersed among plain text
| (though it is totally very possible to do).
| bob1029 wrote:
| I'd recommend trying SQLite for logging. We've been doing this
| for a half decade and it's fantastic.
|
| There's an index on our trace/correlation column, so we can
| instantly pull all log entries for a specific user interaction
| with a query like: SELECT * FROM
| LogEntries WHERE TraceId = @TraceId
|
| We also use a single connection and WAL, so we can write many
| thousands of entries per second if necessary.
| 0xbadcafebee wrote:
| Logs are for humans, metrics are for machines.
|
| Metrics tell you there's a problem. They're an application's way
| of reporting metadata or state. They are designed to be small and
| frequent and machine readable/writeable, with small bits of
| useful information that are of themselves quite useless, but are
| very useful in aggregate.
|
| Logs describe the problem in detail to a human. They're a
| developer's way of diagnosing problems in an application. They
| can have machine-readable components to them, but being machine-
| readable is not the point. The point is for a human to quickly
| diagnose and fix an arbitrary problem, however you decide to do
| that. It's _extremely annoying_ if you can 't read a log because
| it was machine-formatted.
|
| From the article: Invest time in designing your
| log structure
|
| Don't. Eventually you will have to deal with logs with a random
| structure. Invest in telemetry management and distributed
| tracing. Log as much as possible.
|
| Don't. It's about quality, not quantity. Too many logs with too
| little information leaves you buried and unable to diagnose
| quickly. And too much information is bad for security; always
| mask sensitive information. I can't tell you how many giant
| companies have been hacked because of this. And cost _is_ a
| factor: I have seen people stop logging because it was costing
| them too much money, when they should have been logging smarter
| (fewer messages with more useful info) and utilizing metrics.
| Keeping consistency is everone's priority
|
| It is guaranteed that you will have to deal with some metrics and
| logs in arbitrary formats. Do not worry about making them
| perfect. Invest in telemetry management and distributed tracing.
| rrwo wrote:
| Yes.
|
| I used to work at a company that invested too much time and
| effort into an ELK stack that handled specialist log formats
| sent over a custom UDP service. It was brittle and fell over in
| all sorts of strange ways. This was all for monitoring a
| handful of servers.
|
| The only good that came of it was when we realized that the
| support staff could search the logs and diagnose 95% of
| customer complaints without bothering developers. ("Your
| transaction failed because you entered a bad card number.")
|
| But that was because the text logs were more useful than the
| metrics.
| jasonpeacock wrote:
| This is called "Structured Logging" and there's lots of articles
| already about it, and libraries to support it.
|
| For example, a great structured logging logging library for
| Python:
|
| https://www.structlog.org/en/stable/
| kerblang wrote:
| I parse my team's nginx logs with a script even though they
| aren't in JSON. I'm fine with it and other people can do as they
| like. I also do all my application logging as just vanilla logs,
| and I use grep & less like the cretin I am, plus a tiny bit of
| bash for helping with gzip and so on.
|
| If you want to go log-crazy with a big elaborate deal and the
| whole ELK thing and so forth and so on, by all means, if you can
| afford it - which means afford _to see it all the way through_.
| Too many times I 've had to deal with teams where everybody says,
| "Yeah... It would be nice to have logs...." because someone built
| the Big Elaborate Thing and nobody can figure it out, find it,
| etc. etc. It becomes a long-term "would be" instead of "is".
| deathanatos wrote:
| > _correlation_id_ ... microservices
|
| We actually did something slightly different: the correlation ID
| is generated at the start of a request, where it enters "our
| stuff" as a whole. But downstream "microservice" style requests
| get a _different_ correlation ID, but with the original as a
| prefix. So if you have lots of microservice calls being made this
| means you can select any subtree with some prefix. The original
| correlation ID is the entire tree, but you might have,
| b8a0b848-92e6-49e7-bde2-283e165b02e4 # the original request
| b8a0b848-92e6-49e7-bde2-283e165b02e4/authz # request to authz
| service b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/0 #
| first subrequest to foo-db
| b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/1 # second
| subrequest to foo-db
| b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/2 # third subrequest
| to foo-db b8a0b848-92e6-49e7-bde2-283e165b02e4/bar-db/0 #
| first subrequest to bar-db
|
| etc. So you can pull just the original request with "=
| 'b8a0b848-92e6-49e7-bde2-283e165b02e4'", a whole subsystem with
| something like "starts_with
| 'b8a0b848-92e6-49e7-bde2-283e165b02e4/foo-db/'", so on. _Very_
| convenient. (Our graphs weren 't even that deep; we really just
| have "backends for your frontends" and "backends" and that was
| mostly it, so it was mostly to correlate between the BE and the
| BE-for-FE.)
|
| You need structured logs first, so if you've not got that, start
| with that first.
|
| There's this thing called "OpenTracing", which I hear is a sort-
| of standardization of this, but I've not had time to investigate
| it.
| gmuslera wrote:
| There is an opinionated vision of logs, for a particular use case
| and available resources.
|
| For simple enough logs (like the nginx/apache example shown)
| having so simple format enables a quick and dirty search and
| information extraction with common unix command line tools (like
| grep, cut, sort, uniq, tail and sometimes rev or awk). It is not
| an approach that works for all logs and all situations, but
| sometimes is enough. And simplicity is a good thing to have.
| rconti wrote:
| Plus it's virtually impossible to visually scan log messages as
| they go by when they're 10 lines long per entry vs 1 or 2 lines
| long.
| swlkr wrote:
| I was looking for logfmt mentioned in the article and couldn't
| find it.
|
| Also stripe's canonical log lines[1] solve this problem nicely
|
| 1: https://stripe.com/blog/canonical-log-lines
| ithkuil wrote:
| You can read about logfmt at https://brandur.org/logfmt
|
| It's quite nice at first glance. My main qualm with it is that
| there is no formal spec and I got bitten a few times by
| incompatibilities between emitters and parsers especially
| around quoted values.
|
| (In particular we had a rust binary emitting badly escaped
| values such that fluentbit emitted structured logs with
| thousands of fields which in turn caused graylog to explode
| which was particularly gnarly because at the same time we had a
| production incident we had to troubleshoot without historical
| logs)
| bornfreddy wrote:
| TIL - thank you!
|
| Canonical log lines don't solve the exact same problem (they
| provide all the information for some api call in a single
| place), but they need to be structured to work.
|
| Logfmt: https://www.cloudbees.com/blog/logfmt-a-log-format-
| thats-eas...
___________________________________________________________________
(page generated 2022-10-05 23:01 UTC)