[HN Gopher] Show HN: Trench - Open-source analytics infrastructure
       ___________________________________________________________________
        
       Show HN: Trench - Open-source analytics infrastructure
        
       Hey HN! I want to share a new open source project I've been working
       on called Trench (https://trench.dev). It's open source analytics
       infrastructure for tracking events, page views, and identifying
       users, and it's built on top of ClickHouse and Kafka.
       https://github.com/frigadehq/trench  I built Trench because the
       Postgres table we used for tracking events at our startup
       (http://frigade.com/) was getting expensive and becoming a
       performance bottleneck as we scaled to millions of end users.  Many
       companies run into the same problem as us (e.g. Stripe, Heroku:
       https://brandur.org/fragments/events). They often start by adding a
       basic events table to their relational database, which works at
       first, but can become an issue as the application scales. It's
       usually the biggest table in the database, the slowest one to
       query, and the longest one to back up.  With Trench, we've put
       together a single Docker image that gives you a production-ready
       tracking event table built for scale and speed. When we migrated
       our tracking table from Postgres to Trench, we saw a 42% reduction
       in cost to serve on our primary Postgres cluster and all lag spikes
       from autoscaling under high traffic were eliminated.  Here are some
       of the core features:  * Fully compliant with the Segment tracking
       spec e.g. track(), identify(), group(), etc.  * Can handle
       thousands of events per second on a single node  * Query tracking
       data in real-time with read-after-write guarantees  * Send data
       anywhere with throttled and batched webhooks  * Single production-
       ready docker image. No need to manage and roll your own
       Kafka/ClickHouse/Nodejs/etc.  * Easily plugs into any cloud hosted
       ClickHouse and Kafka solutions e.g. ClickHouse Cloud, Confluent
       Trench can be used for a range of use cases. Here are some
       possibilities:  1. Real-Time Monitoring and Alerting: Set up real-
       time alerts and monitoring for your services by tracking custom
       events like errors, usage spikes, or specific user actions and
       sending that data anywhere with Trench's webhooks  2. Event Replay
       and Debugging: Capture all user interactions in real-time for event
       replay  3. A/B Testing Platform: Capture events from different
       users and groups in real time. Segment users by querying in real
       time and serve the right experiences to the right users  4. Product
       Analytics for SaaS Applications: Embed Trench into your existing
       SaaS product to power user audit logs or tracking scripts on your
       end-users' websites  5. Build a custom RAG model: Easily query
       event data and give users answers in real-time. LLMs are really
       good at writing SQL  The project is open-source and MIT-licensed.
       If there's interest, we're thinking about adding support for
       Elastic Search, direct data integrations (e.g. Redshift, S3, etc.),
       and an admin interface for creating queries, webhooks, etc.  Have
       you experienced the same issues with your events tables? I'd love
       to hear what HN thinks about the project.
        
       Author : pancomplex
       Score  : 143 points
       Date   : 2024-10-25 14:07 UTC (4 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Attummm wrote:
       | Looks great, but what is missing for me are use cases.
       | 
       | Why should I use it? What are the unique selling points of your
       | project?
        
         | pancomplex wrote:
         | I looked around, but all the open source analytics projects I
         | could find were bloated with all kinds of UI and unnecessary
         | code paths. They also all seemed to use row-based RDMS as the
         | data backbone (vs columnar stores like ClickHouse). I was
         | looking for a backend-only solution that we could shape for our
         | product use case that could scale.
         | 
         | So TLDR, if you're at a smaller scale (<1M MAUs), you probably
         | will be fine just using a table in MySQL or Postgres. If you
         | have a lot of traffic and users, you will need something like
         | Trench that uses Kafka and ClickHouse.
        
           | Attummm wrote:
           | You are selling the underlying
           | technologies(Kafka/Clickhouse).
           | 
           | I'm interested in your project can do for me, my project(s),
           | team/company. There is a reason that most of the internet
           | still uses PHP and old technologies. Because they focused not
           | on the latest tech but on solving problems for others.
           | 
           | The project looks cool, but tell us the usecases.
        
             | dfltr wrote:
             | Use case #1: You have a problem table (e.g. a high-volume
             | events table) that grows non-linearly as your business
             | starts to scale up. A queue + columnar store package like
             | Trench moves the problem table out to a system better
             | equipped to deal with it and lets your DB server handle its
             | relational business in relative peace and quiet.
        
               | Attummm wrote:
               | Maybe I wasn't clear enough but my questions have been
               | rhetorical. They were not for me. If one starts stating
               | technologies, it is akin to describing the individual
               | ingredients of a sandwich.
               | 
               | The question remains: Why choose Trench over just using
               | Kafka and Clickhouse or any other message queue and
               | columnar database / big data base?
               | 
               | If the goal of the post and the landing website is to
               | entice people to use the tool, then answering these
               | questions is important. If what is being discussed seems
               | obvious, then who is the target demographic? Because they
               | already know the space, use alternatives or have built
               | their own.
        
             | mind-blight wrote:
             | It seems pretty clearly spelled out. If you have enough
             | traffic that an events table is slowing down your postgres
             | instance, you can easily set this up as a service to
             | offload the events table. The author says <1 million MAUs,
             | and you probably don't need this.
             | 
             | It's built on tech known for handling very large amounts of
             | traffic, which answers the how after the what.
        
             | Jgrubb wrote:
             | Sometimes the innovation is a new underlying technology
             | applied to an old problem?
        
       | bosky101 wrote:
       | 1) Appreciate the single image to get started, but am
       | particularly curious how you handle different events of a new
       | user going to different nodes.
       | 
       | 2) any admin interface or just the rest API?
       | 
       | 3) a little bit on the clickhouse table and engine choices?
       | 
       | 4) stats on Ingesting and querying tbe same time
       | 
       | 5) node doesn't support the clickhouse TCP interface. This was a
       | major bottleneck even with batching of 50k events (or 30 secs
       | whichever comes first)
       | 
       | 6) CH indexes?
       | 
       | 7) how are events partitioned to a Kafka partition? By userId?
       | Any assumptions on minimum fields
       | 
       | Will try porting our in-house marketing automation backend
       | (posthog frontend compatible) to this and see how it goes (150M+
       | events per day)
       | 
       | Kudos all around. Love all 3 of your technology choices.
        
         | pancomplex wrote:
         | Thank you!
         | 
         | 1) All data is partitioned based on the "instanceId" of events
         | (see `instanceId` here: https://docs.trench.dev/api-
         | reference/events-create). Instance IDs are typically a
         | logically meaningful way of separating users (such as by
         | company/team/etc.) that allows for sharding the data across
         | nodes.
         | 
         | 2) Yes, this the number 1 thing on our roadmap right now (if
         | anyone is interested in helping build this, please reach out!)
         | 
         | 3) We're using the Kafka engine in ClickHouse for throttling
         | the ingestion of events. It's partitioned by instanceId (see
         | #1) for scaling/fast queries over similar events.
         | 
         | 4) My benchmarks in production showed a single EC2 instance (16
         | cores / 32 gb ram) barely working at 1000+ inserts / second
         | with roughly the same amount of queries per second. Load
         | averages 0.91, 0.89 0.9. This was in stark contrast to our AWS
         | Postgres cluster which continued to hit 90%+ CPU and low memory
         | with 80 ACUs, before we finished the migration to Trench.
         | 
         | 5) We seemed to solve this by running individual Node processes
         | on every core (16 in parallel). Was the limit you saw caused by
         | ClickHouse's inbound HTTP interface?
         | 
         | 6) Right now the system uses just a default MergeTree ordered
         | by instanceId, useId, timestamp. This works really well for
         | doing queries across the same user or instance, especially when
         | generating timeseries graphs.
         | 
         | 7) I am still trying to figure out the best Kafka partitioning
         | scheme. userId seems to be the best for avoiding hot
         | partitions. Curious if you have any experience with this?
         | 
         | Let us know how the migration goes and feel free to connect
         | with me (christian@trench.dev).
        
           | bosky101 wrote:
           | Not sure of the CH Kafka engine but generally I think you
           | should partition by userId.
           | 
           | Because the next step would be trying to run some cron for a
           | user or event based trigger based on the events.
           | 
           | And the only way to avoid multiple machines doing the same
           | work / sending the same comms - would be to push all users
           | events to a partition. This way with multiple workers you
           | don't have the risk of duplicate processing.
        
           | klaussilveira wrote:
           | How do you guarantee ACID with Kafka being responsible for
           | actually INSERT'ing into ClickHouse? Wouldn't it be less
           | error prone to just use ClickHouse directly and their async
           | inserts?
           | 
           | https://clickhouse.com/blog/asynchronous-data-inserts-in-
           | cli...
        
       | d_watt wrote:
       | Looks super interesting. Any positioning thoughts on this vs
       | https://jitsu.com ?
        
         | pancomplex wrote:
         | I think a major difference is that Jitsu depends on you having
         | a data warehouse whereas Trench can be spun up as a standalone
         | system. The nature of Trench's data is also to enable real-time
         | querying a high scale which will be much slower when depending
         | on ETL'ed data in a data warehouse.
        
       | hitradostava wrote:
       | Looks interesting, we solved this problem with Kinesis Firehose,
       | S3 and Athena. Pricing is cheap, you can run any arbitrary SQL
       | query and there is zero infrastructure to maintain.
        
         | bosky101 wrote:
         | Storing small events in s3 can explode costs quickly.
         | 
         | At 1M events/day that's $7.5/day. Decent
         | 
         | At 15M, $75/day
         | 
         | Cost for 150 million S3 PUT requests per day of 25KB each would
         | be $750/day, assuming no extra data transfer charges.
         | 
         | With clickhouse you won't get charged per read/write
        
           | hitradostava wrote:
           | Kinesis supports buffering - up to 900 seconds or 128mb. So
           | you are way out on your cost estimations. Over time queries
           | can start costing more due to S3 Requests, but regular spark
           | runs to combine small files solves that.
        
       | antman wrote:
       | How does it scale? Can you spin up multiple containers? For
       | upcoming features auto archiving to cloud storage old data would
       | be great.
        
         | pancomplex wrote:
         | Once you've outgrown a single physical server, you can continue
         | to scale the Trench cluster by spinning up more Trench
         | application servers and switching to dedicated Kafka and
         | ClickHouse (either self-hosted or via cloud offerings). You can
         | also shard Trench itself depending on the structure of your
         | data (e.g. 1 Trench instance per customer, use case, etc.)
         | 
         | Auto-archiving to cloud for Kafka (Confluent, AWS KMS, etc.) /
         | ClickHouse (ClickHouse Cloud, etc.) is definitely high on the
         | roadmap.
        
       | codegeek wrote:
       | Looks good. In market for something like this and I just ran it
       | locally. how do I visualize data ? Is Grafana not included by
       | default.
       | 
       | Also, minor issue in your docs. There is an extra comma in the
       | sample JSON under the sample event. The fragment below:
       | "properties": {                 "totalAccounts": 4,
       | "country": "Denmark"             },         }]
       | 
       | I had to remove that comma at the end.
        
         | pancomplex wrote:
         | Thanks for flagging. Just fixed this. Grafana is intentionally
         | not included by default -- but it takes a few minutes to set it
         | up. We're still trying to figure out what to bundle by default
         | in terms of UI -- for now it's API only.
        
           | codegeek wrote:
           | No worries. I am going to test it as we are looking for a
           | simple centralized tool for multiple customers to run
           | reporting on events. Most tools have been too complex to
           | setup and yours is promising.
        
       | oulipo wrote:
       | Could this be used to log IoT object events? or is it more for
       | app analytics?
        
         | pancomplex wrote:
         | Yes for sure. We intentionally designed Trench to be very
         | unopinionated when it comes to the application. So you can use
         | it to stream and query anything from page views, log traces to
         | IoT object events.
        
       | oulipo wrote:
       | What is the advantage of this rather than using a postgres plugin
       | for clickhouse and S3 storage of the data to build a kind of
       | data-warehouse, which wouldn't require the bloat of Kafka?
        
         | pancomplex wrote:
         | In my experience, at scale (~2-3k QPS), you'd run into a
         | bottleneck ingesting so many events without Kafka. If you don't
         | have this level of throughput, you could totally do the above
         | and still get the advantages of ClickHouse's columnar
         | datastore.
        
       | asdev wrote:
       | how is this different from Posthog?
        
         | pancomplex wrote:
         | The stack is indeed very similar to Posthog. The biggest
         | difference is that we don't come with all the feature bloat
         | (Session Recordings, Feature Flags, Surveys, etc.) and instead
         | provide a very minimal and easy to use backend + API that is
         | applicable to a ton of use cases.
         | 
         | We (Frigade.com) actually use Posthog as well as Trench in
         | production. Posthog powers all our website analytics. Trench
         | powers our own SDK and tracking scripts we ship to our own
         | customers.
         | 
         | I actually tried to spin up Posthog originally before building
         | Trench, but there was just way too much overhead and "junk" we
         | didn't need. I would need to strip out so many features of
         | their Python app, it would eventually be faster to build a
         | clean solution in Typescript ourselves.
        
         | BohdanPetryshyn wrote:
         | In addition to what pancomplex mentioned, Posthog is not fully
         | open-source. Their free self-hosted version has limited
         | functionality and the paid self-hosted version is no longer
         | supported [1] which makes me feel like I'm pushed to use their
         | cloud offering.
         | 
         | [1]: https://posthog.com/docs/self-host
        
       | brody_slade_ai wrote:
       | I've been exploring open source data analytics software and it's
       | been a game-changer. I mean the flexibility and cost savings are
       | huge perks. I've been looking into Apache Spark and KNIME, and
       | they both seem like great options
        
       | biddendidden wrote:
       | I __totally__ associate  'trench' with 'analytics'. Oh, perhaps
       | the author associates it with 'infrastructure'? Just stupid.
        
       | remram wrote:
       | If you don't mind me asking, why the name "Trench"?
        
         | pancomplex wrote:
         | We were inspired by datalakes and thought the name of a super
         | deep lake could be a cool domain. Turns out 10 of the deepest
         | spots on Earth are all trenches, and the domain was cheap, so
         | we went with trench.dev https://www.marineinsight.com/know-
         | more/10-deepest-parts-of-...
        
       ___________________________________________________________________
       (page generated 2024-10-29 23:01 UTC)