[HN Gopher] Show HN: My Single-File Python Script I Used to Repl...
___________________________________________________________________
Show HN: My Single-File Python Script I Used to Replace Splunk in
My Startup
My immediate reaction to today's news that Splunk was being
acquired was to comment in the HN discussion for that story: "I
hated Splunk so much that I spent a couple days a few months ago
writing a single 1200 line python script that does absolutely
everything I need in terms of automatic log collection, ingestion,
and analysis from a fleet of cloud instances. It pulls in all the
log lines, enriches them with useful metadata like the IP address
of the instance, the machine name, the log source, the datetime,
etc. and stores it all in SQlite, which it then exposes to a very
convenient web interface using Datasette. I put it in a cronjob
and it's infinitely better (at least for my purposes) than Splunk,
which is just a total nightmare to use, and can be customized super
easily and quickly. My coworkers all prefer it to Splunk as well.
And oh yeah, it's totally free instead of costing my company
thousands of dollars a year! If I owned CSCO stock I would sell
it-- this deal shows incredibly bad judgment." I had been meaning
to clean it up a bit and open-source it but never got around to it.
However, someone asked today in response to my comment if I had
released it, so I figured now would be a good time to go through it
and clean it up, move the constants to an .env file, and create a
README. This code is obviously tailored to my own requirements for
my project, but if you know Python, it's extremely straightforward
to customize it for your own logs (plus, some of the logs are
generic, like systemd logs, and the output of netstat/ss/lsof,
which it combines to get a table of open connections by process
over time for each machine-- extremely useful for finding code that
is leaking connections!). And I also included the actual sample log
files from my project that correspond to the parsing functions in
the code, so you can easily reason by analogy to adapt it to your
own log files. As many people pointed out in responses to my
comment, this is obviously not a real replacement for Splunk for
enterprise users who are ingesting terabytes a day from thousands
of machines and hundreds of sources. If it were, hopefully someone
would be paying me $28 billion for it instead of me giving it away
for free! But if you don't have a huge number of machines and
really hate using Splunk while wasting thousands of dollars, this
might be for you.
Author : eigenvalue
Score : 249 points
Date : 2023-09-21 16:26 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| surfingdino wrote:
| "Hell hath no fury like a Python dev annoyed"
|
| :-)
|
| Thank you for sharing!
| ozfive wrote:
| Hell hath no fury like any dev annoyed by something they could
| build themselves...
| languagehacker wrote:
| Ah, the hubris of the single developer who believes they can
| replace a battle-tested product from a company with innumerable
| decades of combined human effort.
|
| Glad it works for you!
| cooper_ganglia wrote:
| I mean, it seems like it's working for them. Not every single
| startup needs the same solutions as a larger company,
| especially when the solution is as expensive as Splunk!
| xorcist wrote:
| Fetching logs regularly sounds hard? Wouldn't you need to keep
| track of the position of all files, with heuristics around file
| rotations? And if something catastrophic happens, the most
| interesting data would be in that last block which couldn't be
| polled?
|
| Normally you'd avoid all that complexity by shipping logs the
| other way, sending from each machine. That way you can keep state
| locally should you need to. All unix-like systems do this out of
| the box, and almost all software supports the syslog protocol to
| directly stream logs. But you can also use something like
| filebeat and a bunch of other modern alternatives.
|
| The analyzer can then run locally on the log server and a whole
| lot of complexity just disappears.
| eigenvalue wrote:
| I considered doing it the way you described, but then you need
| to deploy software on every single one of your machines and
| make sure it's running, that it's not accidentally using up 99%
| of your CPU (I've had bad experiences with the monitoring
| agents for Splunk and Netdata misbehaving and slowing down the
| machines and causing problems), etc. Whereas with the "pull"
| approach I used in my tool, you don't need to deploy ANY
| software to the machines you are monitoring-- you just connect
| with SSH and grab the files you need and do all the work on
| your control node.
| xorcist wrote:
| One way or the other, your hosts are running your
| application, and you are already deploying software on every
| single host. But I hear you with some of the agents. That's
| why I mentioned syslog.
|
| It's already there, it's supported by most logging packages,
| and it's dead simple. No additional software required. All
| text. What it doesn't do is structured logging, but analyzing
| on the log host is often enough.
|
| Agents aren't all that bad however, and you're likely already
| running some agent like icinga or zabbix for regular
| monitoring.
| runjake wrote:
| Neat! Definitely a better solution for single source logs. Splunk
| is ridiculous and Cisco acquiring it isn't going to make that
| better.
|
| For others with a bit more complex needs, take a look at the free
| (or paid) versions of Graylog Open[1].
|
| It's really improved over the years. I had messed with Graylog in
| it's early days but was turned off by it. A few years back, I
| encountered someone doing some neat stuff with it. It looked much
| improved. I stood up a "pilot project" to test, and it's now been
| running for years and several different people use it for their
| areas of responsibility.
|
| It does log collection/transforming and graphing and dashboarding
| and we use the everloving crap out of it at work. I wish I could
| publicly post some of the stuff we're doing with it.
|
| It takes input from just about any source.
|
| 1. https://graylog.org/products/source-available/
| catlover76 wrote:
| That's cool!
|
| I disagree though that the deal shows any bad judgment on Cisco's
| part; the gravamen of whether the acquisition was good is not
| whether many software developers can quickly develop replacements
| for their own use-cases, or how ergonomic the software is, but
| whether Splunk is a profitable business with a bunch of paying
| subscriptions/contracts that aren't going to go away any time
| soon.
| rollcat wrote:
| "This simple tool solves X at my org" is probably the most
| underrated type of project. There's not enough room to
| overcomplicate something that isn't a core part of the business,
| it must be practical to maintain, simple&stupid enough so that
| onboarding is not a hurdle, etc.
|
| I encourage everyone to share your "splunk in 1kloc of Python"
| projects! Some of my own:
|
| - https://github.com/rollcat/judo is Ansible without Python or
| YAML
|
| - https://github.com/rollcat/zfs-autosnap manages rolling ZFS
| snapshots
| olafalo wrote:
| For me, it's configinator[0]. Write a spec file for a config
| like [1], get a Go file that loads a config from environment
| variables like [2]. Code-gen only, no reflection, fairly type-
| safe, supports enums, string, bool, and int64. I made it
| because it was gross to add new config vars in a project at
| work, and it's come in handy a lot!
|
| [0] https://github.com/olafal0/configinator
|
| [1]
| https://github.com/olafal0/configinator/blob/0576a53970bcb4d...
|
| [2]
| https://github.com/olafal0/configinator/blob/0576a53970bcb4d...
| emodendroket wrote:
| > There's not enough room to overcomplicate something that
| isn't a core part of the business, it must be practical to
| maintain, simple&stupid enough so that onboarding is not a
| hurdle, etc.
|
| You would think. But no, there is lots of room to make it over
| complicated without the corresponding efforts to manage the
| complexity.
| dynisor wrote:
| Don't worry, that's just tech debt and we will deeefinitely
| come back to it next sprint.
| [deleted]
| eigenvalue wrote:
| Thanks, based on the dismissive replies to my original comment
| in the Splunk acquisition discussion, I thought this would get
| a lot of hostile takes saying that it was dumb, that I
| reinvented the wheel because I didn't want to spend 2 weeks
| trying to figure out opentelemetry nonsense and tools X, Y, and
| Z, that it was trivial, that it wouldn't scale, etc.
|
| But people are actually being surprisingly nice and friendly! I
| guess people just really hate Splunk!
| egwor wrote:
| I suggest you sell it to Oracle, get some popcorn and watch
| the Cisco vs Oracle log war begin!
| RexM wrote:
| mired in antitrust lawsuits
| facorreia wrote:
| Your project sounds like something that definitely could come
| in handy. I forked it as a "bookmark". I particularly like
| the idea of storing the data in a local SQLite database. Not
| everything needs to be "web scale".
| silasb wrote:
| Not only that, but with https://litestream.io/ things
| becomes even more interesting.
|
| I'm currently using this for a small application to easily
| backup databases in docker containers.
| eigenvalue wrote:
| The real power is that with datasette you can instantly
| turn that SQLite DB into a full fledged responsive web app
| with arbitrarily complex filtering and sorting, which all
| gets serialized to the URL so you can share it and bookmark
| it for future use.
| _aavaa_ wrote:
| > reinvented the wheel
|
| I hate this meme. It's as if cars, trains, and airplanes all
| use the same wheels. Or that wheels under my stove, my tiny
| filing dresser, and my shopping cart are all the same.
|
| Oh yeah, re-inventing the wheel, what a stupid idea and
| something we obviously don't frequently do and for good
| reasons.
|
| This meme is almost as bad as the horrible misquoted
| "premature optimisation is the root of all evil".
| xorcist wrote:
| Your software is cool but the description is a bit unfair to
| Ansible. Ansible works by solving for desired state. This
| software runs scripts, it replaces "for host in; do ssh $host <
| script.sh; done".
| matrss wrote:
| Last time I checked, ansible playbooks were also essentially
| just sequential steps to execute via ssh, albeit in yaml
| format. There was certainly no way to describe a desired
| state nor did ansible accomplish consistently bringing a
| system into some desired state. Two executions of the same
| playbook could result in a very different system state for
| example, depending on what happened in between.
|
| The only systems I am aware of which are reliably capable of
| "solving for desired state" are nix and guix.
| xorcist wrote:
| You _can_ run sequential commands with ansible, but then
| you 're just using it as a replacement for "ssh $host <
| script.sh". You would be missing out on most of the
| usefulness of the tool. It's meant to be used
| declaratively. The manual describes it quite well. In the
| same type of tools are puppet and indeed nix, but with one
| important difference for the latter: nix is also a package
| manager, which allows for more a fine grained state
| specification.
| eigenvalue wrote:
| Ansible is definitely all about solving from a specified
| target state and ensuring that it is followed. It's even at
| the level of syntax for ansible, which is how it can be
| used totally declaratively. And if you stick to the native
| idiomatic ansible way of doing everything (as opposed to
| doing hacky stuff with ad hoc shell commmands), you get
| automatic idempotence and other nice stuff "for free".
| aaviator42 wrote:
| My org's apps heavily use this simple key-value interface built
| on sqlite: https://github.com/aaviator42/StorX
|
| There's also a bunch of other purpose-built tiny utilities on
| that GitHub account:
| https://github.com/aaviator42?tab=repositories
| simonw wrote:
| I love this!
|
| Log analysis isn't one of the core use-cases for Datasette, but
| I've done my own experiments with it that have worked pretty well
| - anything up to 10GB or so of data is likely to work just fine
| if you pipe it into SQLite, and you could go a lot larger than
| that with a bit of tuning.
|
| I added some features to my sqlite-utils CLI tool a while back to
| help with log ingestion as well:
| https://simonwillison.net/2022/Jan/11/sqlite-utils/
| nurettin wrote:
| > If I owned CSCO stock I would sell it-- this deal shows
| incredibly bad judgment."
|
| That may be so, but beware that acquisitions usually increase
| stock price rather than decrease it.
| candiddevmike wrote:
| Is that overtime or immediately after acquisition?
| https://www.google.com/finance/quote/CSCO:NASDAQ?&window=5D
| nurettin wrote:
| The results should show up pretty quickly. Maybe that was
| pricing in the acquisition, or the acquisition is nothing
| compared to the whole company and that's quarterly earnings.
| Not sure.
| CliffStoll wrote:
| Oh, how I wish I had your scripts (and insights!) when I was
| analyzing Unix logs in 1986, looking for the footprints of an
| intruder...
| eigenvalue wrote:
| Thanks for the comment! Going to check out your book now-- I
| somehow hadn't heard of it before despite it being right down
| my alley!
| neilk wrote:
| You should write about that sometime! /s
| h0p3 wrote:
| I'm kinda glad you didn't; it might have made the book I read
| as a kid (and again as an adult, and again with my offspring)
| less interesting somehow.
| hliyan wrote:
| A long long time ago, I used a series of tail -f's and unix pipes
| to aggregate logs, and grep, less and awk to analyse them. There
| were about 20 different services written in C++, each producing
| over 1GB of logs each day. Managed to debug some fairly complex
| algorithmic trading bugs. Twenty years later, I still can't
| fathom why we're spending so much money on Splunk, DataDog an the
| like.
| dsXLII wrote:
| Volume. 1GB of data per day is rounding error. If you have tens
| of thousands of servers, each generating hundreds of gigabytes
| of data per day, tail -f and grep don't scale especially well.
| vincnetas wrote:
| 100GB of logs per day? what kind of applications are that
| chatty?
| jononomo wrote:
| Yeah, the solution here is to get rid of 98% of the
| logging.
| getrealyall wrote:
| And I bet a hang glider can't fly from New York to Paris,
| either! The nerve!
|
| Recall that the poster said this was for a small startup. If
| you're Google, by all means, use Google logging tools. If you
| aren't, then solve the problem you have, not the problem your
| resume needs.
| phyrex wrote:
| The guy asked
|
| > Twenty years later, I still can't fathom why we're
| spending so much money on Splunk, DataDog a the like.
|
| And the poster above answered that question
| 10000truths wrote:
| They scale perfectly fine, as long as you filter locally
| before aggregating. Lo and behold: mkdir -p
| /tmp/ssh_output while read ssh_host; do ssh
| "$ssh_host" grep 'keyword' /var/my/app.log >
| "/tmp/ssh_output/${ssh_host}.log" & done <
| ssh_hosts.txt wait cat /tmp/ssh_output/*.log
| rm -rf /tmp/ssh_output
|
| Tweak as needed. Truncation of results and real-time tailing
| are left as an exercise to the reader.
| getrealyall wrote:
| Financialization and mediocre developers. I haven't worked with
| too many people I could actually trust to even emit logs
| correctly, let alone develop a tool to collect and aggregate
| them.
|
| I've also been told, time and again, in no uncertain terms, to
| "buy as much as possible". We've reached the logical conclusion
| of SaaS-everything: every company just cobbles together
| expensive, overcomplicated computers from other expensive,
| overcomplicated computer providers, resulting in expensive,
| bloated systems that barely work.
| diarrhea wrote:
| Buying everything and SaaSing the whole place up is a true
| killjoy. I giggle with joy whenever I am allowed to write
| code. And then a support request comes in that I get assigned
| to, "thing in SaaS doesn't work please fix". And all you have
| to debug that SaaS is their UI. The checkbox in question is
| on, you notice, so it can only be a bug on their side. Off to
| contacting support as the only available avenue. Incredibly
| boring.
| fragmede wrote:
| What did you use for visualization in that stack? The fact that
| I can "|" (pipe) my data and make bar and pie charts is what
| really does it for me. What's really money is trivially being
| able to see requests coming in overlaid on a world map. I was
| sold the first time I saw that because it let me fix an issue
| that would have taken me hours to suss out just grepping
| around.
|
| More power to you for using sed awk and grep, they're powerful
| tools and every computer person should know how to use them.
| But if you're hung up on _only_ using sed awk and grep for
| emotional reasons, that 's self-limiting. We have better tools
| today, and you don't get hero points for using shittier tools
| when there are better ones available to you.
|
| https://www.splunk.com/en_us/blog/tips-and-tricks/mapping-wi...
| KaiserPro wrote:
| I used to work at a splunk shop. It was used for alerting,
| graphing & prediction. It was critical to how the company
| functioned.
|
| There was lots of stuff that relied on splunk, and we had splunk
| specialists who knew the magic splunkQL to get the graph/data
| they wanted.
|
| However, we managed to remove most of the need for splunk by
| using graphite/grafana. It took about 2-3 years but it meant that
| non techs could create dashbaords and alerts.
|
| As someone once told me, splunk is the most expensive way you can
| ignore your data.
| rgrieselhuber wrote:
| Thanks for sharing.
| Okkef wrote:
| "log files of several several gigabytes ... Process them in
| minutes"
|
| Thanks but no thanks.
| eigenvalue wrote:
| I misspoke there-- meant to say:
|
| "The application has been tested with log files several
| gigabytes in size from dozens of machines and can process all
| of it in minutes."
|
| That's the time it takes to connect to 20+ machines, download
| multiple gigs of log files from all of them, and parse/ingest
| all the data into a sqlite. If you have a big machine with a
| lot of cores and a lot of RAM, it's incredibly performant for
| what it does.
| dingdong33 wrote:
| [flagged]
| oblvious-earth wrote:
| Quickly skimming some points that would irratate me if I had to
| maintain this script:
|
| * Importing Paramiko but regularly call `ssh` via subprocess
|
| * Unused functions like `execute_network_commands_func`
|
| * Sharing state via a global instead of creating a class
|
| Overall it's fit for purpose, but makes a lot of assumptions
| about the host and client machines. As you said in the thread
| you're running a very small number of servers (less than 30).
| I've written similiar things over the years and they are great
| for what you need.
|
| When I heavily used Splunk (back in 2013) I was in an application
| production support team that managed over 100 productions servers
| for over a dozen applications, there were dozens of other teams
| in similar situations across the company. The Splunk instance was
| managed by a central team, minimal assumptions about the client
| environment, had well defined permissions, understood common and
| essoteric logging formats, and could reinterpret the log
| structure at query time. A script like this is not competiting in
| that kind of situation.
| eigenvalue wrote:
| Thanks for the feedback. I do use Paramiko for some things. I
| tried to use it for everything in the project but ran into some
| weird stuff that wouldn't work reliably for me, which is why I
| switched some of it over to using SSH directly via subprocess
| (it was a few months ago so I don't even remember now what it
| was; I believe it was also performance related, since I'm
| trying to SSH to tons of machines at the same time
| concurrently).
|
| I guess I did forget to use the execute_network_commands_func.
| I'm using the ruff linter extension in VSCode now which would
| have flagged that to me, but back when I made this I wasn't.
|
| I don't think globals are so awful for certain things. I prefer
| a more functional approach where you have simple composable
| standalone functions instead of classes. Obviously classes have
| a role, but I find they sometimes overly complicate things and
| make the logic harder to follow and debug.
|
| Anyway, I do appreciate that someone took the time to actually
| read through the code!
| msto wrote:
| * barely any comments and not a single docstring in the entire
| kiloline file
| eigenvalue wrote:
| I find comments annoying to read and write and distracting.
| I'd rather fit more code on the screen at once and instead
| focus on making the variable names and function names really
| descriptive and clear so you immediately grasp what it's
| doing from context alone. Nowadays, if you really need
| comments to tell you what code is doing, you can just throw
| it into ChatGPT and get it that way.
| fastasucan wrote:
| I really like the tool you made, and appreciate helping
| your company save money as well! I don't think it matter
| that this isn't a perfect fit for everyone else (as you
| said, this was something you made to solve your problem) -
| but boy do I disagree with the "variable and function names
| really descriptive and clear so you immediate grasp what
| its doing from context alone". What is a descriptive
| function or variable name is extremely dependent on how
| familiar you are with the context the program functions in.
| Using `execute_network_commands_func`from above - this
| descriptive name say nothing about what network commands
| that are executed. With docstrings it would be so easy to
| detail input and output of this function.
| facorreia wrote:
| It's not that much code and it has sensible function names. I
| appreciate that OP took the time to share his tool with us.
| babuloseo wrote:
| Thanks for the share, I still find it hilarious how Python is by
| default installed on most distros, I was working on some
| compression tools and by default the os didn't come with the
| ability zip/unzip toolsets, but the python standard library
| zipfile did.
|
| https://docs.python.org/3/library/zipfile.html
| codetrotter wrote:
| > by default the os didn't come with the ability zip/unzip
|
| Some versions of tar are able to extract zip files.
|
| Try tar xf somefile.zip
|
| It might or might not work with the version in your OS
| BerislavLopac wrote:
| You don't even have to write a custom script around the
| library: python -m zipfile -e monty.zip target-
| dir/
|
| https://docs.python.org/3/library/zipfile.html#command-line-...
___________________________________________________________________
(page generated 2023-09-21 23:03 UTC)