[HN Gopher] Show HN: Logparser - Alternative to GoAccess Written...
___________________________________________________________________
Show HN: Logparser - Alternative to GoAccess Written in Python
Author : lcnmrn
Score : 52 points
Date : 2021-09-23 12:05 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| simonw wrote:
| Suggestion: take the time to package this up for PyPI as
| something people can install using "pip install" (or "pipx
| install").
|
| This is hard the first time you do it, but worth learning because
| it's a really great way to distribute your Python software.
|
| I'm giving a talk about how to do this at PyGotham next month,
| but the notes from that talk are already available and may be
| useful to you: https://github.com/simonw/pygotham-packaging
|
| You may also find this cookiecutter template that I use to build
| and package Python CLI apps helpful:
| https://github.com/simonw/click-app
| dwohnitmok wrote:
| Is there a way to distribute proprietary software with PyPI?
| Based on the license text it appears the author wishes to keep
| it proprietary (maybe source-available, but not open source).
| uranusjr wrote:
| I suppose you actually mean close source? Because it's
| trivial to distribute proprietary code on PyPI: Just say that
| in your license.
|
| There is no true "close source" for pure Python programs, but
| if obfuscation is close enough, you can choose to only deploy
| wheels containing pre-compiled pyc files. This is good enough
| for most situations.
| vsajip wrote:
| The license is currently just All rights
| reserved. Copyright (c) 2020 Lucian Marin
|
| It was the MIT license at the time of initial commit, and been
| updated to this. So it's not immediately clear if anyone else can
| necessarily use Logparser - care to clarify, Lucian?
| css wrote:
| IMO this could benefit from using `collections.Counter` instead
| of `defaultdict(set)`.
| masklinn wrote:
| With a `Counter` you would be counting each access from a given
| IP as a hit against a category, rather than counting the IP
| itsef.
|
| Currently if 4 clients hit one URL and 1 client hits 5
| logparser will register 5 records in each category (unless
| they're classified as bots for browsers and systems). With a
| Counter, it'd be 9.
|
| Both informations could be accessible using a
| `defaultdict(Counter)` but I don't know how useful that would
| be to the people actually using logparser.
| eatonphil wrote:
| On a tangent, I've been looking into log parsing for an
| application I'm building recently.
|
| If you want to support pulling info out of common logs it's
| pretty simple to pull together a list of regexes for the default
| log format in each major system. Simple example here:
| https://github.com/multiprocessio/datastation/blob/master/sh....
|
| I use this in the app to be able to quickly pull info out of
| access logs for further analysis a la OP's app and GoAccess but
| in a GUI where you can also do further processing.
|
| Demo video of this here:
| https://www.youtube.com/watch?v=sCx2mF2jyUQ&t=9s.
| linuxdude314 wrote:
| You can find a very comprehensive list of regex patterns
| looking at the logstash's grok definitions:
|
| https://github.com/logstash-plugins/logstash-patterns-core/t...
| eu wrote:
| To be fair, GoAccess does a bit more (is has that websockets live
| view)
| edoceo wrote:
| That's not in the parse loop - where comparison is happening.
| joshyi wrote:
| Still, there's a lot more data outputting from goaccess with
| support for custom logs.
| Svetlitski wrote:
| Are you certain your benchmarks are correct? The GoAccess FAQ
| states that it parses over 100,000 lines/second [1]. While this
| figure depends on the hardware used, this still is _massively_
| faster than the figure quoted in the README. Benchmarking is
| quite technical if you want consistent results, so some more
| information on the benchmarking methodology used here would be
| much appreciated.
|
| [1] https://goaccess.io/faq#performance
| makapuf wrote:
| Im not sure its an alternative yet, functionally it seems that it
| misses incremental parsing, live updates, interactive html and
| tui interfaces, graphs,...
| patja wrote:
| Seems like a confusing name given that logparser for IIS log
| files has been around for a very long time.
| jerf wrote:
| I am skeptical of those benchmarks. This is written in Python,
| and, looking at the core loop, yes, it really is Python, not
| Python wrapped around C or some other acceleration technology.
| For pure Python to come out appearing to get four times the
| through put of a C program is pretty dubious. That would have to
| be one crappy C program. GoAccess looks like it ought to be far
| enough along that somebody has at least taken a bit of a crack at
| optimization, but, perhaps not. C ought to be able to smoke pure
| Python at this task. (Possibly, you know, _unsafely_ , where a
| crafted referrer may get to arbitrary code execution or
| something, but still it ought to be _way faster_.)
| [deleted]
| messe wrote:
| > This is written in Python, and, looking at the core loop,
| yes, it really is Python, not Python wrapped around C or some
| other acceleration technology
|
| It seems to use a library clfparser to parse apache common log
| format logs; internally that uses Python's regex engine which
| is written in C.
|
| 6000 line/s seems incredibly slow to me for a C program parsing
| a log file. I'm seeing a lot or strstr's, strlen's, strdup's,
| and strchr's in GoAccess's parse.c, all of which are O(n) per
| line and, while fine in isolation, could be causing GoAccess to
| do quite a bit more work per line than just using an optimized
| regex engine.
| brundolf wrote:
| I wonder what percentage of real-world C programs are
| exponentially slower than they could be because of the str
| functions
| jerf wrote:
| Thank you. That does sound like something that could resolve
| my skepticism into concrete facts.
___________________________________________________________________
(page generated 2021-09-23 23:01 UTC)