[HN Gopher] Show HN: An open source access logs analytics script...
___________________________________________________________________
Show HN: An open source access logs analytics script to block bot
attacks
This is a small PoC Python project for web server access logs
analyzing to classify and dynamically block bad bots, such as L7
(application-level) DDoS bots, web scrappers and so on. We'll be
happy to gather initial feedback on usability and features,
especialy from people having good or bad experience wit bots.
*Requirements* The analyzer relies on 3 Tempesta FW specific
features which you still can get with other HTTP servers or
accelerators: 1. JA5 client fingerprinting (https://tempesta-
tech.com/knowledge-base/Traffic-Filtering-b...). This is a HTTP and
TLS layers fingerprinting, similar to JA4
(https://blog.foxio.io/ja4%2B-network-fingerprinting) and JA3
fingerprints. The last is also available in Envoy
(https://www.envoyproxy.io/docs/envoy/latest/api-v3/extension...)
or Nginx module (https://github.com/fooinha/nginx-ssl-ja3), so
check the documentation for your web server 2. Access logs are
directly written to Clickhouse analytics database, which can
cunsume large data batches and quickly run analytic queries. For
other web proxies beside Tempesta FW, you typically need to build a
custom pipeline to load access logs into Clickhouse. Such pipeliens
aren't so rare though. 3. Abbility to block web clients by IP or
JA5 hashes. IP blocking is probably available in any HTTP proxy.
*How does it work* This is a daemon, which 1. Learns normal
traffic profiles: means and standard deviations for client requests
per second, error responses, bytes per second and so on. Also it
remembers client IPs and fingerprints. 2. If it sees a spike in
z-score (https://en.wikipedia.org/wiki/Standard_score) for traffic
characteristics or can be triggered manually. Next, it goes in data
model search mode 3. For example, the first model could be top 100
JA5 HTTP hashes, which produce the most error responses per second
(typical for password crackers). Or it could be top 1000 IP
addresses generating the most requests per second (L7 DDoS). Next,
this model is going to be verified 4. The daemon repeats the
query, but for some time, long enough history, in the past to see
if in the past we saw a hige fraction of clients in both the query
results. If yes, then the model is bad and we got to previous step
to try another one. If not, then we (likely) has found the
representative query. 5. Transfer the IP addresses or JA5 hashes
from the query results into the web proxy blocking configuration
and reload the proxy configuration (on-the-fly).
Author : krizhanovsky
Score : 13 points
Date : 2025-10-14 19:15 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
___________________________________________________________________
(page generated 2025-10-14 23:00 UTC)