[HN Gopher] Minimally Invasive (and More Accurate) Analytics: Go...
       ___________________________________________________________________
        
       Minimally Invasive (and More Accurate) Analytics: GoAccess and
       Athena/SQL
        
       Author : Chris911
       Score  : 73 points
       Date   : 2021-02-16 16:17 UTC (6 hours ago)
        
 (HTM) web link (brandur.org)
 (TXT) w3m dump (brandur.org)
        
       | alias_neo wrote:
       | I have always used GoAccess on my blog (https://2byt.es) which
       | gets very modest traffic because I don't post much and don't
       | advertise outside of my few twitter followers. Privacy has always
       | been a core principle of mine.
       | 
       | I've found that over time, crawlers drown out the numbers of
       | actual visitors but I find GoAccess hard to use to get any
       | meaningful data from when interesting things do happen.
       | 
       | Can anyone suggest a way I can do something similar to this
       | without relying on a service I don't host (and without having to
       | write parsers into a SQL or similar DB by hand)?
        
         | marvinblum wrote:
         | https://pirsch.io/ I'm one of the founders :) Check my post
         | above (or below, who knows on hn?) for more details, or ask me.
         | 
         | [Edit] Sorry, I did read "which I don't host" instead of the
         | other way around. You can check out the open-source core
         | library, that might work for you if you put in some work.
        
           | mrwnmonm wrote:
           | Wow, Go support, thanks.
        
         | herunan wrote:
         | I use Cloudflare Web Analytics. Since I use Cf, I thought: why
         | not utilize their analytics? Anonymous, no cookie, no
         | fingerprinting and no localStorage.
         | 
         | Edit: also, no JS if you have Pro (20usd/mo).
        
           | mrwnmonm wrote:
           | I read that DNS analytics is not accurate too.
        
             | XCSme wrote:
             | Semi-related question: Is any type of web analytics 100%
             | accurate?
        
               | mrwnmonm wrote:
               | I meant not as accurate as client tracking if ad blockers
               | were not used. As far as I understand it.
        
         | joshyi wrote:
         | I'm waiting as well for the ability to use filters directly
         | from goaccess. Hope they get to it soon!
         | https://github.com/allinurl/goaccess/issues/117
        
         | tstack wrote:
         | lnav[0] can take care of parsing the log files and displaying
         | them in a TUI. There's also a SQLite interface for doing
         | queries. However, you'll need to build the filters/queries
         | yourself, there aren't any built-in ones at the moment.
         | 
         | [0] - https://lnav.org
        
           | alias_neo wrote:
           | Perfect, thank you, this is right up my street!
           | 
           | I host a static site for my blog, using Hugo so "no server"
           | etc is exactly what I need, and writing the filters/queries
           | myself leaves me in control of getting what I need out of
           | them.
        
         | conradfr wrote:
         | There is the --ignore-crawlers argument, on my modest projects
         | it seems efficient but I've not look at it too precisely.
        
         | e12e wrote:
         | For a long time I've thought about scavenging the robot
         | identifiers from Matomo, nee Piwik - so one might leverage the
         | hive mind in updating robot identifiers, and use it to strip
         | plain access logs for easier use with tools like goaccess..
        
       | tobilg wrote:
       | For a readymade stack similar to what's described in the article
       | to self-host in your own AWS account have a look at
       | https://ownstats.cloud
        
       | marvinblum wrote:
       | I've found the same issue. A lot of traffic will get blocked if
       | you use a simple JavaScript integration. The solution is
       | (obviously) to track from the backend and provide a simple
       | dashboard for it. I've started building a library [0] written in
       | Go, which I could integrate into my website and until the end of
       | last year, it became a product (in beta right now) called Pirsch
       | [1]. We offer a JS integration to onboard customers more easily,
       | but one of the main reasons we build it is, that you can use it
       | from your backend through our API [2]. We plan to add more SDKs
       | and plugins (Wordpress, ...) to make the integration easier, but
       | it should be fairly simple already.
       | 
       | I would love to hear feedback, as we plan to fully release it
       | soon :)
       | 
       | [0] https://github.com/pirsch-analytics/pirsch
       | 
       | [1] https://pirsch.io/
       | 
       | [2] https://docs.pirsch.io/get-started/backend-integration/
       | 
       | [Edit]
       | 
       | I forgot to mention my website, which I initially created Pirsch
       | for. The article I wrote about the issue and my solution is here:
       | https://marvinblum.de/blog/server-side-tracking-without-cook...
        
         | fatsdomino001 wrote:
         | Is it possible to integrate pirsch into a heroku deployment?
        
           | marvinblum wrote:
           | I haven't worked with heroku yet, but if you can make an API
           | request, yes. You can read about how that works here:
           | https://docs.pirsch.io/api-sdks/api/
        
             | fatsdomino001 wrote:
             | Looks very interesting.
             | 
             | Yeah I mean, I'm just running a django site, so I imagine I
             | could add a custom middleware that makes an API request on
             | every page load. I guess it would have to try and see if
             | the access token is expired first? and if so grab a new one
             | then make the hit. Is that the recommended setup?
             | 
             | Would I be able to pass extra information to be included in
             | the logs, like e.g. username?
             | 
             | Also, I know you have good privacy policies, but still
             | sending this information through a request makes me nervous
             | nevertheless, even though it's of course miles better than
             | js-based solutions. But what are your thoughts on how
             | possible is it for these requests to be intercepted and
             | this logged data siphoned off by someone else?
        
               | marvinblum wrote:
               | > I guess it would have to try and see if the access
               | token is expired first? and if so grab a new one then
               | make the hit. Is that the recommended setup?
               | 
               | Exactly. The token expires after 15 minutes, so you need
               | to check the response and issue a new token should it
               | have expired. You can read our docs on how to do that or
               | take a look at our Go SDK [0] and re-implement it in
               | Python. Unfortunately, I don't have enough time to
               | provide one right now.
               | 
               | > Would I be able to pass extra information to be
               | included in the logs, like e.g. username?
               | 
               | That's not possible right now, but you will be able to
               | send custom events in the future.
               | 
               | > But what are your thoughts on how possible is it for
               | these requests to be intercepted and this logged data
               | siphoned off by someone else?
               | 
               | Highly unlikely. All traffic is SSL encrypted, the
               | internal communication of our server cluster is
               | encrypted, the database, ... I mean, software can always
               | be hacked, but I spend a lot of my time on infrastructure
               | and security.
               | 
               | [0] https://github.com/pirsch-analytics/pirsch-go-
               | sdk/blob/maste...
        
       | joshyi wrote:
       | We use goaccess against a pretty busy centralized log server and
       | has worked really well for years. We don't have to worry about JS
       | and that's always a plus. I personally like how it follows the
       | unix philosophy.
        
       | megous wrote:
       | Hmm, it ocured to me that you can probably get a nice list of
       | robot user-agents by querying all UAs that accessed the
       | robots.txt file. I don't think normal browsers touch that file.
       | 
       | Also a thing to do on the cheap, if you want more usable logs is
       | to do JSON logging[1] (one object per line). This is trivial to
       | import into PostgreSQL and also trivial to query via tools like
       | jq, as is.
       | 
       | [1] Example: https://stackoverflow.com/questions/25049667/how-to-
       | generate...
        
         | heipei wrote:
         | Logging JSON directly from nginx is what I currently do, and
         | then the log output is ingested straight into ElasticSearch.
         | One neat thing you can do is also log return headers from an
         | upstream HTTP server, such as a username for example or any
         | application-specific piece of data. That way you can interleave
         | your HTTP access logs with application data and have everything
         | available for querying in one index.
        
       | twotwotwo wrote:
       | I want to second that plug for Athena for ad-hoc analysis. (If
       | you're hosting your own stuff and at the scale where it'd be
       | useful, there's Presto/Hive, which Athena is based on, and/or at
       | Trino, the Presto fork maintained by some of its initial
       | developers.)
       | 
       | It was useful for me when tweaking spam/bot detection rules a
       | while ago; if I could roughly describe a rule in a query, I could
       | back-test it on old traffic and follow up on questionable-looking
       | results (e.g. what other requests did this IP make around the
       | time of the suspicious ones?). We also used Athena on a project
       | looking into performance, and on network flow logs. The lack of
       | recurring charges for an always-on cluster makes it great for
       | occasional use like that.
       | 
       | You can use what the docs call "partition projection" to
       | efficiently limit the date range of logs to look at
       | (https://docs.aws.amazon.com/athena/latest/ug/partition-
       | proje...), so it was free-ish to experiment with a query on the
       | last couple days of data before looking further back.
       | 
       | More generally, Athena/Presto/Hive support various data sources
       | and formats (including applying regexps to text). Compressed
       | plain-text formats like ALB logs can already be surprisingly
       | cheap to store/scan. If you're producing/exporting data, it's
       | worth looking into how these tools "like" to receive it--you may
       | be able to use a more compact columnar format (Parquet or ORC) or
       | take advantage of partitioning/bucketing
       | (https://docs.aws.amazon.com/athena/latest/ug/partitions.html,
       | https://trino.io/blog/2019/05/29/improved-hive-bucketing.htm...)
       | for more efficient querying later.
       | 
       | As the blog post notes, usability was...imperfect, especially
       | during initial setup. Error messages sometimes point at one of
       | the first few tokens of the SQL, nowhere near the mistake, and
       | there are lots of knobs to tweak, some controlled by
       | 'magical.dotted.names.in.strings'. CLIs were sometimes easier
       | than the GUI. But you can get a lot out of it once you've got it
       | working!
        
       | cube2222 wrote:
       | With OctoSQL[0], as I wanted to see how people are using it, I
       | literally just set up an http endpoint which received a JSON
       | request on each CLI invocation (you can see the data sent in the
       | code, it's open source) and appended it to an on-disk JSON file.
       | 
       | Then I used... OctoSQL to analyze it!
       | 
       | Nit: The project may seem dead for a few months, but I'm just in
       | midst of a rewrite (on a branch) which gets rid of wrong
       | decisions and makes it easier to embed in existing applications.
       | 
       | [0]:https://github.com/cube2222/octosql
        
       | blakesterz wrote:
       | Interesting, there's quite a big number of people running ad
       | blockers!
       | 
       | "Both Google Analytics and Goatcounter agreed that I got ~13k
       | unique visitors across the couple days where it spiked. GoAccess
       | and my own custom Athena queries agreed that it was more like
       | ~33k unique visitors, giving me a rough ratio of 2.5x more
       | visitors than reported by analytics, and meaning that about 60%
       | of my readers are using an adblocker."
        
       ___________________________________________________________________
       (page generated 2021-02-16 23:01 UTC)