[HN Gopher] The Syslog Hell
___________________________________________________________________
The Syslog Hell
Author : rdpintqogeogsaa
Score : 83 points
Date : 2021-05-10 10:27 UTC (12 hours ago)
(HTM) web link (techblog.bozho.net)
(TXT) w3m dump (techblog.bozho.net)
| sam_lowry_ wrote:
| You have not seen the journalctl inferno yet.
| rwmj wrote:
| The main problem with the journal is how slow it is, but other
| than that having a structured format that can carry multi-line
| messages and even large blobs (eg. core dumps) is a reasonable
| idea. I still run rsyslog so I get real text files as well.
| fullstop wrote:
| I was forced to use systemd and journald for a while and I've
| seriously grown to like them, especially journald.
| lathiat wrote:
| Yep
| hughrr wrote:
| They're really nice until everything on the node is
| completely broken. Then they are a massive obstruction to
| access and understanding what went wrong due to the
| opaqueness. It's best to do some drills on a purposely broken
| system to gain some deep insight into recovery scenarios.
| theamk wrote:
| I am curious what kind of problems have you seen? I am
| transitioning some of my systems to journald, and I am very
| interested in things that could go wrong.
|
| So far, I tried looking at logs from the dead system using
| "journalctl -D" - it seemed to work. And the way how the
| log files from each boot are always separate is pretty
| handy. Other than that, the only problems I have seen were
| having more to type and having to learn more commands.
|
| Am I in for a nasty surprise?
| hughrr wrote:
| Key issues for me were machine-id related and journal
| corruption. Also journalctl dumping core after a
| recovery-mode boot. That left us with no tools to deal
| with a compressed journal and no strace to find out what
| it was doing. I actually attached a USB disk to the
| machine and cp'ed the files off it in the end. This was
| inconvenient as the node was 1550 miles away from me.
| oblio wrote:
| If everything is broken and you need to do recovery, don't
| you usually do that from another system anywhere? And mount
| the target system.
|
| I imagine the journald logs are just files at the end of
| the day and you can just read them with some tool?
| fullstop wrote:
| This can be tricky because journald compression, but is
| this really that much different than a corrupted compressed
| log file, like that which logrotate produces, in /var/log?
|
| If that is your concern, you can disable compression in the
| journald configuration so that the contents can be read
| with "strings" or similar tools.
| 2ion wrote:
| If your logs are that important they should get shipped off
| the node immediately anyway. That's what ELK, Graylog, loki
| etc are for. If you don't do centralized automated log
| management on a many-node server farm and that impedes your
| forensic processes when a node fails then not compressed
| logs are to blame but your system. On any server I am
| responsible for, any log that is created on the node and
| required or useful for forensics gets shipped off the node
| immediately after it's written, including all system logs,
| container outputs, etc. I can only highly recommend it. The
| amounts of data to store centrally are very manageable as
| well even if you ship 100% of logs.
| hughrr wrote:
| Yes 100% agree with this and we do that but there's
| normally a few seconds window between something going
| snap and being flushed to the upstream log aggregator. At
| which point the smouldering remains of your node or
| container are fairly important to have at hand.
|
| It's like having a black box that forgets the last 30
| seconds of the flight otherwise.
|
| This is incredibly difficult to get right.
| fullstop wrote:
| Which has better odds, the smoldering remains having
| actually flushed the write to disk or the flushed to the
| upstream log aggregator?
| hughrr wrote:
| From experience of both, the smouldering remains seems to
| win out by a fair margin. The aggregator may have some
| smoke signals in it but that's about it.
|
| Most of the issues I've dealt with shaft the network
| before the filesystem.
| fullstop wrote:
| Outside of the "cloud world" most of my failures have
| been the result of disk failure. In those cases, the logs
| definitely did not get written to disk, but they were
| flushed to graylog.
| 0xbadcafebee wrote:
| I can't remember the name of it anymore, but we used to
| use a user-space daemon to collect logs from applications
| over various methods (sockets pipes files etc) and it
| would store them in a ring buffer until they could be
| flushed to disk or network. If neither happened, you
| could attach to the host somehow and get the logs out of
| memory, watch them on a terminal screen, etc.
|
| Opening an EC2 console in AWS is still a simple and
| reliable way to find out what's going wrong with your
| instance. Wouldn't be possible if we didn't already have
| the convention to have the kernel, syslog, etc print to
| tty1.
| 0xbadcafebee wrote:
| So, if you care about logs, don't depend on journalctl...
| Well that's fine if you _can_ ship logs off the box, and
| if the log shipping is working. But if you have to
| troubleshoot on the machine itself (hello, desktops?)
| then having tools that don 't come with a lot of pain and
| big learning curve is critical to addressing the issue
| quickly. What's the point in even having these tools if
| we're not supposed to use them?
|
| Normally, Unix-like tools are not as painful as
| journalctl. But Poettering's interfaces are a gauntlet of
| unusual concepts and hidden inter-relationships, with no
| common examples or intuitiveness. And things like dbus
| make it worse, now that there are many parts of a modern
| Linux system that have no console interface, because
| nobody's written one for the particular application you
| need to view or change the right dbus settings. And _/
| sys/_ is literally a wilderness of random undocumented
| settings that are often the _only interface_ to critical
| system functions.
|
| Linux distributions are now a tiresome no-mans-land of
| overcomplicated mysterious crap. I'm willing to bet the
| major Linux distributions will be abandoned over the next
| decade for simpler systems that are cloud-native, mobile-
| friendly, and have less Kafkaesque interfaces.
| fullstop wrote:
| What kind of desktop do you have where your journal is
| constantly getting corrupted? Most distributions log both
| to the journal _and_ to /var/log/, so I'm seriously
| struggling to understand the complaint here. GP's message
| about centralized logging is saying that if logs are that
| important to you, you will keep them stored in a
| centralized location. You should be able to catch the
| server on fire and have logs recorded centrally up until
| that point. This would be the case with or without
| journald.
|
| You started with a strawman about caring about logs, and
| ended complaining about sysfs.
| 0xbadcafebee wrote:
| Yeah I digressed quite a bit, sorry.
|
| > I'm seriously struggling to understand the complaint
| here
|
| The complaint was trying to explain a parent commenter's
| point about journald _" They're really nice until
| everything on the node is completely broken. Then they
| are a massive obstruction to access and understanding
| what went wrong due to the opaqueness."_
|
| Point: Logs are really annoying to manage on systems with
| journald.
|
| Counter-point: Ship your logs somewhere else / you
| probably don't have these problems in real life
|
| Counter-Counter: If we're not supposed to use these tools
| on our hosts, exactly why are they installed?
| fullstop wrote:
| Logs are written to both the journal and /var/log/
|
| I would argue that logs are less annoying to manage on
| systems with journald, once you take the time to learn
| how to leverage the tools.
|
| I would also argue that shipping mission-critical logs
| off-server is a worthy endeavor, regardless of logging
| system used.
|
| I like journald because it lets me isolate logs for a
| particular unit without grepping and accidentally
| including output from unrelated services. It's faster to
| find the data that I need between time ranges rather than
| manually comparing time stamps.
|
| I can count on one finger the number of times the journal
| has been corrupted on the servers that I manage, and it
| was because of hardware failure.
| jeppesen-io wrote:
| Journalctl is fantastic. Never again do I have to parse
| different filenames, syslog "fotmats" or locations. It just
| works. Add journalbeat to ship the logs to elasicsearch. Since
| it's all in journalctl, no need to even configure what to ship.
| Just works.
|
| The cherry is it's much harder to fill the disk with logs.
|
| No matter the service I'm using the same journalctl command to
| find what I want.
|
| This is why syslog is completely disabled on all our servers
| theamk wrote:
| I dunno, it seems pretty fine to me? Plenty of metadata,
| standardized text serialization format. Granted, it is a bit
| verbose; and I can no longer mark my files append-only - but
| those are pretty minor problems.
|
| And delivery mechanism is so powerful! Online or batched, pull
| or push, with clear logic and great documentation. After
| looking at things like RELP, I just want rsyslog to go away and
| everyone switch to journald. It is time we stop losing syslog
| entries just because the net was down for a bit!
| jascii wrote:
| Syslog "Hell"? Ever looked at SNMP?!
|
| When a "standard" sticks around this long and needs to support so
| many legacy devices things can get a bit messy. At least syslog
| is human readable, while things may not be as machine parsable as
| you'd like, the info you need is usually only a few greps away.
| jandrese wrote:
| SNMP is at least a real standard, well several real standards,
| that are sometimes even followed.
|
| The big problem with SNMP is that the MIBs have to be handled
| out of band. If there was some part of the standard where you
| could query the device to get its MIB in some standard format
| it would be so so much better. The daemon could be small
| because it wouldn't have to ship with hundreds of megabytes of
| data for devices built over the past 40 years. You wouldn't
| have to go on a hunt to track down where the vendor hid the
| MIBs for oddball and obsolete equipment, often times only
| available with a support contract on a website that was
| decommissioned years ago.
|
| Alternatively it could have a query type that gives you a
| description of every field, so when you walk the tree you get
| all of the data that you would otherwise need the MIB for.
| coldacid wrote:
| Don't get me started on Windows Event Log. Sure it might be
| better structured than syslog on paper, but it's even more of a
| mess. Not to mention that it's really hundreds of logs that just
| share a similar format and same UI.
| jandrese wrote:
| I would complain more, except I have never ever gotten a useful
| bit of information out of the Event Log. Not once. It's so
| laughably useless I'm surprised anybody even bothers anymore.
| Microsoft certainly doesn't.
|
| When applications on Windows fail they never think to generate
| an event, not even for something as simple as a "permission
| denied attempting to open file 'c:\blah'". Instead it's chock
| full of useless noise from daemons that activate every 2
| seconds to poll something and then log that everything is still
| ok.
| pitay wrote:
| Not even Microsoft's core windows applications seem to
| generate a event log when it has a critical error. If you've
| disabled a service which part of its 'Settings' application
| uses, or Microsoft Store requires and it has a critical error
| then you get an general error message (which the error does
| not tell you that it requires a service running), but nothing
| in the event logs. So far event logs have been far more
| useful for things that aren't errors such as login and
| startup times for Windows.
| floren wrote:
| Huge numbers of EventIDs, and you'll get different EventIDs
| from different versions of Windows, and there's no database of
| THESE ARE THE EVENT IDS WINDOWS USES, you just have to google
| each one individually.
| corty wrote:
| And mostly the structure isn't really used at all, it's just
| plain text for a lot of log messages.
| WorldMaker wrote:
| Plain text that you cannot even access without a complicated
| COM control dance in many cases. It's an interesting design
| for sure.
|
| I have seen and hope to never see again worst cases such as
| the COM control for all of .NET Framework event log messages
| (everything from .NET system messages to just the mostly
| plain text storage from applications written in .NET)
| accidentally badly unregistered leaving all of the event log
| messages unreadable.
| jjkaczor wrote:
| "...applications written in .NET) accidentally badly
| unregistered leaving all of the event log messages
| unreadable..."
|
| Don't even get me started on this... Microsoft is actually
| bad for this with even their own .NET-based enterprise
| applications. As well - it also makes gathering logs from
| production servers, then performing analysis on a different
| machine difficult, as that machine will likely have none of
| the dependencies required.
|
| Text... text and more text, that is universal.
| floren wrote:
| I feel the pain. Everybody decides their appliance will emit
| "syslog", but they don't bother to look at either RFC (not even
| the very lax RFC3164) and just emit "log messages preceded by
| some sort of timestamp", as the article calls out:
|
| > But what makes things hell is the fact that too many vendors
| decided not to care about what is in the RFCs, they decided that
| "hey, putting a year there is just fine" even though the RFC says
| "no", that they don't really need to set a host in the header,
| and that they didn't really need to implement anything new after
| their initial legacy stuff was created.
|
| It sounds like the author and I are doing similar work, so he
| knows my pain: if you make a product which can parse syslog, and
| somebody selects your product for parsing syslog, and they they
| feed it non-syslog logs from Company Y's product... it's now your
| problem, instead of Company Y's, even though you're perfectly
| capable of parsing _syslog_! Luckily, regular expressions and
| beer eventually get most things sorted out. :)
| imglorp wrote:
| Any love for JSON logging when aiming for central aggregation?
| All kinds of benefits.
| 0xbadcafebee wrote:
| It wouldn't really change anything, it's just a data format.
| The problem is how to get people to both send and receive
| specific fields correctly. Lazy programmers/vendors would also
| still think they could throw together some crappy code and
| pretend it's generating JSON, but it would end up generating
| illegal JSON. Same for parsing.
|
| There are a lot of things you need in order to prevent the lazy
| from fucking up:
|
| - A version. A lazy programmer might either ignore or hard-code
| a version number, but at least you have a _hint_ as to which
| standard you 're trying to conform to, and can retain backwards
| compatibility. (The new RFC has a version, but the old one
| doesn't, preventing interoperability)
|
| - A format that's easy enough for programmers to understand,
| but difficult (or "feature-filled") enough that they won't
| attempt to implement it all themselves and will reach for real
| libraries.
|
| - Extensions. Vendors will always want to do something
| different than everyone else, so if you don't add the option of
| extensions, they will either fork the protocol and make
| breaking changes, or try to sneak changes into other parts of
| the protocol/format.
|
| - A standard reference implementation + tests. Make it easy for
| vendors to test their versions against another one, so the
| developers don't have to do busy-work like "read a standard" or
| "write tests".
|
| - Think about the future. Does your standard include a specific
| width integer? Does it preclude a specific network payload
| size? Will addressing change in the future? Can your data
| payload support arbitrary binary data? Can your standard change
| later and still be backwards compatible?
| ashtonbaker wrote:
| I've parsed a lot of logs for our SIEM - adding a JSON parser
| is usually the nicest, barring overly-clever nesting, which
| does happen.
| xorcist wrote:
| There's something about logging that conjures all sort of
| strange situations.
|
| A very well established and expensive product apparently
| thought the way to write json logs is something akin to:
| printf("{date: \"%s\", msg: \"%s\"}\n", date, msg);
|
| That turned out real good when msg contained quotes.
|
| Json isn't a magic bullet here. Most log pipelines end up with
| custom logic for all sorts of reasons that mostly shouldn't be
| an issue.
| tannhaeuser wrote:
| Actually the syslog protocol has it's own structured format for
| log messages [1].
|
| [1]: https://datatracker.ietf.org/doc/html/rfc5424
| kevincox wrote:
| Which as far as I can tell is implemented by next to nothing.
| Logging single-line JSON events will give you far more
| support for downstream tools.
| latch wrote:
| A bit off topic, but we've been using Vector (1) for log
| ingestion, and I really like it. It's fast, low resources,
| actively developed and flexible.
|
| (1) https://vector.dev/
| e12e wrote:
| Ingestion to where? I'm considering loki, but haven't quite ran
| enough test runs yet..
| FridgeSeal wrote:
| Seconding Vector here.
|
| Config was also way less painful than traipsing through
| whatever hellscape the FluentBit/FluentD configs are.
| dale_glass wrote:
| And that's why journald is such a cool thing.
|
| * Want to parse stuff? journalctl -o json
|
| * Lots of stuff going on, need more precise timestamps? -o short-
| precise
|
| * Want metadata, like the pid? It's in there.
|
| * Want to know where to continue parsing? It supports cursors.
|
| * Want to save disk space? It uncompresses logs transparently and
| can trim logs to whatever size you want.
| InvertedRhodium wrote:
| Want to filter logs at collection time? Well, too bad - your
| use case is impure and you should just fix everything
| immediately, even that over which you have no direct control.
|
| https://github.com/systemd/systemd/issues/6432
| Poiesis wrote:
| I realize that the linked thread explains that decreasing the
| log level wouldn't work because everything else would be
| affected--but couldn't someone combine decreasing the log
| level with using a different journal namespace?
| dale_glass wrote:
| I actually mostly agree with Poettering here that it's kind
| of a hack and that the ideal is to collect everything and
| filter afterwards. You can't get back stuff you exclude by
| mistake.
|
| Programs should be logging because there's some value in the
| information being sent to the log. If it annoys everyone and
| serves no purpose, the program needs fixing.
| nobleach wrote:
| When I started playing with Linux back around 1996, I started
| learning the real value of having logs. It took me into my first
| career in IT. The Unix/Linux value behind logging everything and
| quickly getting to the bottom of any problem, was something that
| was missing from my MS Windows experience. Yet the Event Viewer
| existed. No, it was rarely all that useful. Somehow lines in a
| text file... rotated at a predetermined interval... is just so
| simple yet, completely effective. Through the years I've tried
| the other tools that are supposed to supplant it yet, I still
| think it was the best. I'm getting used to the SystemD/JournalD
| way but, I really did like having a directory full of text files
| (and gzipped friends from previous days)
| Stranger43 wrote:
| And yet syslog works to the point where anything sold as an
| syslog replacement ends up adding complexity(along with features)
| rather then an simplification of the core problem.
|
| It's in general a trend for old unix tools to work better in
| reality then in theory something thats rare for more modern
| tools.
|
| Sure it's nice been able to use more modern query tools and have
| graphing libraries available but syslog grep and awk does get the
| job done and dont require a lot of resources to set up and
| maintain.
| dale_glass wrote:
| Yeah, no. Maybe that worked in the 80s, back when hardware was
| weak, and an admin could keep an eye on everything by hand. I
| dealt with parsing log files more recently than that, and it's
| a never-ending list of annoying bullshit to deal with:
|
| * Some stuff logs with syslog and some doesn't.
|
| * Formats vary, oh the fun of implementing the different
| variations.
|
| * You have to parse text dates into unix, when whatever wrote
| the file converted unix to text. It probably lacks
| milliseconds, which is all kinds of fun in a modern setting
| where there can easily be a hundred things happening during any
| second.
|
| * Various edge cases. Where exactly does every field in the log
| file end? Can there be a newline? (yes, guaranteed). Can there
| be random binary junk (yup, sometimes).
|
| * How do you keep track of where you stopped parsing? How do
| you deal with that the old log might have been removed and a
| new one with the same name now appeared?
|
| * Dealing with compression, log rotation, race conditions.
|
| It's simple on the surface. Actually writing a program that
| deals with all that stuff is bloody annoying, because none of
| it actually gets what you want done. You typically want to
| detect important events happening, or graphing something.
| Instead, 95% of the time goes on mind-numbing minutia dealing
| with parsing because the system was built for an admin using
| `tail` and `grep` in the 80s. It wasn't planned for a modern
| admin maintaining a few dozen computers each of which log
| multiple megabytes of stuff every hour.
|
| Never understood people who complain about journald, because
| this stuff was a pain in my butt a good decade before journald
| existed. I certainly don't have any fond memories from dealing
| with it.
| touisteur wrote:
| I'm guessing i would have preferred to standardise something
| in syslog-ng and rsyslog, like a database-backed log-file
| format (sqlite?) instead of yet another piece of the yuge
| systemd hydra...
| dale_glass wrote:
| sqlite doesn't sound like a very good choice to me.
| journald's log format is nearly \0 separated fields, while
| sqlite is a good deal more complex than that. Also journald
| allows for append-only logs, which I don't think sqlite
| supports.
|
| And personally I see the "hydra" as a benefit -- everything
| integrates well with everything else, because it's all
| designed to go well together.
| nimbius wrote:
| Anyone whos been sold Splunk as a drop-in syslog replacement
| for observability or visibility is painfully aware of this.
| Larger companies that use splunk as a central auditing tool are
| invariably left with a byzantine nightmare of dashboards and
| strings to figure out. people leave and roles change, and
| eventually youre running a nine year old server that can hardly
| tell you the time, let alone the state of nginx.
|
| just write a script, email the people regularly in charge with
| the CSV, and let the PHB make it look pretty.
| znpy wrote:
| > and eventually youre running a nine year old server that
| can hardly tell you the time, let alone the state of nginx.
|
| thank you for the good laugh :)
| cb321 wrote:
| I think this is more that "more features" = "more marketable"
| which impacts far more than just "unix tools" or even just
| software.
|
| Also, simplifiation does not _never_ happen. E.g., I have a
| near trivial impl (under 300 lines of Nim) at
| https://github.com/c-blake/kslog { yeah, it may not be of very
| wide appeal..See the first point. :-) }
|
| I feel a better factoring/separation of concerns is to
| disentangle file/data distribution from getting in-socket data
| somewhere persistent.
| viraptor wrote:
| At the stage where you're looking at query tools and graphs,
| syslog stops being easy. How do you transfer logs? What's the
| naming convention for files? What's the lifecycle and what
| ensures it? How do multiple people access the logs? What
| happens with logs if network loses packets?
|
| That requires resources to design and keep running in practice.
| After dealing with a few systems for logs, I'd rather choose
| them for non-trivial setup now, than bare syslog and redo all
| of this from scratch.
| touisteur wrote:
| Ideally you stream logs and try not to rely on local storage.
| Rsyslog is quite the work beast and has a multitude of useful
| features (e.g. tcp+round-robin-multipath transfer mechanisms,
| disk-backed queues, rate limitation) and if you're feeling
| it, writing extensions is not that hard. Just reading all the
| docs is mind-boggling...
| mrintegrity wrote:
| Started using Grafana Loki recently, so far it seems very good at
| consuming all the various log formats you might encounter and you
| can parse them into metrics etc as needed using regular
| expressions. Much nicer than dumping everything on a central
| syslog. My only "gripe" is the high learning curve compared to
| say elk stack.
| bogota wrote:
| Does it really have a higher learning curve than running ELK?
| ELK is a nightmare to manage at any kind of scale in my
| experience.
|
| I guess I have been looking at Loki and is the issue in
| configuration or operation? If it's harder to operate than ELK
| I'm not going to touch it though.
| donio wrote:
| I've only started to mess around with Loki but there is no
| way that it's more complex than a scaled up ELK stack.
| mrintegrity wrote:
| To be more clear, installing it and getting logs into the
| system is a piece of cake, getting useful information beyond
| the basics is more complicated and in my experience not as
| straight forward as elk stack. Kind of the exact opposite of
| elk in fact
| smetj wrote:
| > Parsing all of that mess is extremely "hacky", with tons of
| regexes trying to account for all vendor quirks.
|
| Whilst that's possibly true, at least you/we have the possibility
| to do so.
| viraptor wrote:
| What do you mean? You have this possibility with virtually
| every logs system. (On Linux anyway) Some of them just make it
| much easier.
| jandrese wrote:
| Try doing it with the Windows Event Viewer.
|
| SystemD's built in log thing at least has a text dump
| feature, but god help you if the database gets corrupted.
| viraptor wrote:
| That's why I said for Linux. Journald does not really have
| a database as such - it has index + sequential logs data.
| You can run the data through strings (unless you enabled
| compression) to get your data in emergency. You'd lose as
| much data from plaintext log corruption.
| miga wrote:
| Did you try Pcapng specification? I believe it is used by new
| syslog, and this is the packet format for logging both network
| interfaces, syslog files, and many more things together in a
| single format. https://wiki.wireshark.org/SampleCaptures
| https://wiki.wireshark.org/Development/PcapNg
| touisteur wrote:
| I second this. Pcapng is just a very simple, extensible, file
| format. I actually use it for anything that needs timestamping
| and a block-sequence format, for post-processing.
| alecco wrote:
| Being a maintainer of a secure syslog implementation around 2001,
| I was asked to join the IETF syslog group. There were a couple of
| syslog-ng guys, a Cisco guy, and I think a Microsoft guy. Plus a
| couple of randoms that didn't participate. The Cisco/Microsoft
| guys there looked like people whose sole job was to sit on
| standard comitees. They were not developers, at least not as a
| day job.
|
| I was trying to reach a compromise in a simple syslog standard so
| it would be easier to authenticate and analyze. And trying to
| make it good enough for non-*nix systems. Nobody else cared about
| this.
|
| It was one of the worst time wasters in my life. It was all
| politics. The syslog-ng guys were adamant with their proposal
| which was a very, very over-complicated idea based on another
| standard (BEEP). And I strongly suspect the Cisco/Microsoft guys
| were intentionally trying to make the group not work in subtle
| ways. After months, I just left.
|
| They eventually published RFC 3195. And it's barely used, of
| course.
|
| It seems Cisco's implementation still uses DIGEST-MD5 for
| authentication.
| 0xbadcafebee wrote:
| Sorry you had to go through that, but standards bodies are
| literally political bodies. That's why the commercial
| representatives are not devs. They are there to advocate for
| their own constituents, not help others. The Cisco/Microsoft
| guys just wanted to make sure nobody would break their
| products, and that their products would work with other things.
| That generally means changing as little as possible, which can
| definitely look like "not working".
___________________________________________________________________
(page generated 2021-05-10 23:01 UTC)