[HN Gopher] Minimally Invasive (and More Accurate) Analytics: Go...
___________________________________________________________________
Minimally Invasive (and More Accurate) Analytics: GoAccess and
Athena/SQL
Author : Chris911
Score : 73 points
Date : 2021-02-16 16:17 UTC (6 hours ago)
(HTM) web link (brandur.org)
(TXT) w3m dump (brandur.org)
| alias_neo wrote:
| I have always used GoAccess on my blog (https://2byt.es) which
| gets very modest traffic because I don't post much and don't
| advertise outside of my few twitter followers. Privacy has always
| been a core principle of mine.
|
| I've found that over time, crawlers drown out the numbers of
| actual visitors but I find GoAccess hard to use to get any
| meaningful data from when interesting things do happen.
|
| Can anyone suggest a way I can do something similar to this
| without relying on a service I don't host (and without having to
| write parsers into a SQL or similar DB by hand)?
| marvinblum wrote:
| https://pirsch.io/ I'm one of the founders :) Check my post
| above (or below, who knows on hn?) for more details, or ask me.
|
| [Edit] Sorry, I did read "which I don't host" instead of the
| other way around. You can check out the open-source core
| library, that might work for you if you put in some work.
| mrwnmonm wrote:
| Wow, Go support, thanks.
| herunan wrote:
| I use Cloudflare Web Analytics. Since I use Cf, I thought: why
| not utilize their analytics? Anonymous, no cookie, no
| fingerprinting and no localStorage.
|
| Edit: also, no JS if you have Pro (20usd/mo).
| mrwnmonm wrote:
| I read that DNS analytics is not accurate too.
| XCSme wrote:
| Semi-related question: Is any type of web analytics 100%
| accurate?
| mrwnmonm wrote:
| I meant not as accurate as client tracking if ad blockers
| were not used. As far as I understand it.
| joshyi wrote:
| I'm waiting as well for the ability to use filters directly
| from goaccess. Hope they get to it soon!
| https://github.com/allinurl/goaccess/issues/117
| tstack wrote:
| lnav[0] can take care of parsing the log files and displaying
| them in a TUI. There's also a SQLite interface for doing
| queries. However, you'll need to build the filters/queries
| yourself, there aren't any built-in ones at the moment.
|
| [0] - https://lnav.org
| alias_neo wrote:
| Perfect, thank you, this is right up my street!
|
| I host a static site for my blog, using Hugo so "no server"
| etc is exactly what I need, and writing the filters/queries
| myself leaves me in control of getting what I need out of
| them.
| conradfr wrote:
| There is the --ignore-crawlers argument, on my modest projects
| it seems efficient but I've not look at it too precisely.
| e12e wrote:
| For a long time I've thought about scavenging the robot
| identifiers from Matomo, nee Piwik - so one might leverage the
| hive mind in updating robot identifiers, and use it to strip
| plain access logs for easier use with tools like goaccess..
| tobilg wrote:
| For a readymade stack similar to what's described in the article
| to self-host in your own AWS account have a look at
| https://ownstats.cloud
| marvinblum wrote:
| I've found the same issue. A lot of traffic will get blocked if
| you use a simple JavaScript integration. The solution is
| (obviously) to track from the backend and provide a simple
| dashboard for it. I've started building a library [0] written in
| Go, which I could integrate into my website and until the end of
| last year, it became a product (in beta right now) called Pirsch
| [1]. We offer a JS integration to onboard customers more easily,
| but one of the main reasons we build it is, that you can use it
| from your backend through our API [2]. We plan to add more SDKs
| and plugins (Wordpress, ...) to make the integration easier, but
| it should be fairly simple already.
|
| I would love to hear feedback, as we plan to fully release it
| soon :)
|
| [0] https://github.com/pirsch-analytics/pirsch
|
| [1] https://pirsch.io/
|
| [2] https://docs.pirsch.io/get-started/backend-integration/
|
| [Edit]
|
| I forgot to mention my website, which I initially created Pirsch
| for. The article I wrote about the issue and my solution is here:
| https://marvinblum.de/blog/server-side-tracking-without-cook...
| fatsdomino001 wrote:
| Is it possible to integrate pirsch into a heroku deployment?
| marvinblum wrote:
| I haven't worked with heroku yet, but if you can make an API
| request, yes. You can read about how that works here:
| https://docs.pirsch.io/api-sdks/api/
| fatsdomino001 wrote:
| Looks very interesting.
|
| Yeah I mean, I'm just running a django site, so I imagine I
| could add a custom middleware that makes an API request on
| every page load. I guess it would have to try and see if
| the access token is expired first? and if so grab a new one
| then make the hit. Is that the recommended setup?
|
| Would I be able to pass extra information to be included in
| the logs, like e.g. username?
|
| Also, I know you have good privacy policies, but still
| sending this information through a request makes me nervous
| nevertheless, even though it's of course miles better than
| js-based solutions. But what are your thoughts on how
| possible is it for these requests to be intercepted and
| this logged data siphoned off by someone else?
| marvinblum wrote:
| > I guess it would have to try and see if the access
| token is expired first? and if so grab a new one then
| make the hit. Is that the recommended setup?
|
| Exactly. The token expires after 15 minutes, so you need
| to check the response and issue a new token should it
| have expired. You can read our docs on how to do that or
| take a look at our Go SDK [0] and re-implement it in
| Python. Unfortunately, I don't have enough time to
| provide one right now.
|
| > Would I be able to pass extra information to be
| included in the logs, like e.g. username?
|
| That's not possible right now, but you will be able to
| send custom events in the future.
|
| > But what are your thoughts on how possible is it for
| these requests to be intercepted and this logged data
| siphoned off by someone else?
|
| Highly unlikely. All traffic is SSL encrypted, the
| internal communication of our server cluster is
| encrypted, the database, ... I mean, software can always
| be hacked, but I spend a lot of my time on infrastructure
| and security.
|
| [0] https://github.com/pirsch-analytics/pirsch-go-
| sdk/blob/maste...
| joshyi wrote:
| We use goaccess against a pretty busy centralized log server and
| has worked really well for years. We don't have to worry about JS
| and that's always a plus. I personally like how it follows the
| unix philosophy.
| megous wrote:
| Hmm, it ocured to me that you can probably get a nice list of
| robot user-agents by querying all UAs that accessed the
| robots.txt file. I don't think normal browsers touch that file.
|
| Also a thing to do on the cheap, if you want more usable logs is
| to do JSON logging[1] (one object per line). This is trivial to
| import into PostgreSQL and also trivial to query via tools like
| jq, as is.
|
| [1] Example: https://stackoverflow.com/questions/25049667/how-to-
| generate...
| heipei wrote:
| Logging JSON directly from nginx is what I currently do, and
| then the log output is ingested straight into ElasticSearch.
| One neat thing you can do is also log return headers from an
| upstream HTTP server, such as a username for example or any
| application-specific piece of data. That way you can interleave
| your HTTP access logs with application data and have everything
| available for querying in one index.
| twotwotwo wrote:
| I want to second that plug for Athena for ad-hoc analysis. (If
| you're hosting your own stuff and at the scale where it'd be
| useful, there's Presto/Hive, which Athena is based on, and/or at
| Trino, the Presto fork maintained by some of its initial
| developers.)
|
| It was useful for me when tweaking spam/bot detection rules a
| while ago; if I could roughly describe a rule in a query, I could
| back-test it on old traffic and follow up on questionable-looking
| results (e.g. what other requests did this IP make around the
| time of the suspicious ones?). We also used Athena on a project
| looking into performance, and on network flow logs. The lack of
| recurring charges for an always-on cluster makes it great for
| occasional use like that.
|
| You can use what the docs call "partition projection" to
| efficiently limit the date range of logs to look at
| (https://docs.aws.amazon.com/athena/latest/ug/partition-
| proje...), so it was free-ish to experiment with a query on the
| last couple days of data before looking further back.
|
| More generally, Athena/Presto/Hive support various data sources
| and formats (including applying regexps to text). Compressed
| plain-text formats like ALB logs can already be surprisingly
| cheap to store/scan. If you're producing/exporting data, it's
| worth looking into how these tools "like" to receive it--you may
| be able to use a more compact columnar format (Parquet or ORC) or
| take advantage of partitioning/bucketing
| (https://docs.aws.amazon.com/athena/latest/ug/partitions.html,
| https://trino.io/blog/2019/05/29/improved-hive-bucketing.htm...)
| for more efficient querying later.
|
| As the blog post notes, usability was...imperfect, especially
| during initial setup. Error messages sometimes point at one of
| the first few tokens of the SQL, nowhere near the mistake, and
| there are lots of knobs to tweak, some controlled by
| 'magical.dotted.names.in.strings'. CLIs were sometimes easier
| than the GUI. But you can get a lot out of it once you've got it
| working!
| cube2222 wrote:
| With OctoSQL[0], as I wanted to see how people are using it, I
| literally just set up an http endpoint which received a JSON
| request on each CLI invocation (you can see the data sent in the
| code, it's open source) and appended it to an on-disk JSON file.
|
| Then I used... OctoSQL to analyze it!
|
| Nit: The project may seem dead for a few months, but I'm just in
| midst of a rewrite (on a branch) which gets rid of wrong
| decisions and makes it easier to embed in existing applications.
|
| [0]:https://github.com/cube2222/octosql
| blakesterz wrote:
| Interesting, there's quite a big number of people running ad
| blockers!
|
| "Both Google Analytics and Goatcounter agreed that I got ~13k
| unique visitors across the couple days where it spiked. GoAccess
| and my own custom Athena queries agreed that it was more like
| ~33k unique visitors, giving me a rough ratio of 2.5x more
| visitors than reported by analytics, and meaning that about 60%
| of my readers are using an adblocker."
___________________________________________________________________
(page generated 2021-02-16 23:01 UTC)