[HN Gopher] How we built ngrok's data platform
       ___________________________________________________________________
        
       How we built ngrok's data platform
        
       Author : samber
       Score  : 131 points
       Date   : 2024-09-30 07:35 UTC (15 hours ago)
        
 (HTM) web link (ngrok.com)
 (TXT) w3m dump (ngrok.com)
        
       | LoganDark wrote:
       | > Note that we do not store any data about the traffic content
       | flowing through your tunnels--we only ever look at metadata.
       | While you have the ability to enable full capture mode of all
       | your traffic and can opt in to this service, we never store or
       | analyze this data in our data platform. Instead, we use
       | Clickhouse with a short data retention period in a completely
       | separate platform and strong access controls to store this
       | information and make it available to customers.
       | 
       | Don't worry, your sensitive data isn't handled by our platform,
       | we ship it to a third-party instead. This is for your protection!
       | 
       | (I have no idea if Clickhouse is actually a third party, it
       | sounds like one though?)
        
         | leosanchez wrote:
         | Clickhouse is a database. It has cloud offering.
        
           | faangguyindia wrote:
           | What's the point of clickhouse cloud?when you can just use
           | bigquery and run queries on billions of row.
           | 
           | I am genuinely curious what case does clickhouse serve over
           | bigquery.
        
             | FridgeSeal wrote:
             | It's actually open source, you can self-host it easily
             | enough, you can push a single instance pretty far too.
             | 
             | It'll also happily read from disaggregated storage and is
             | compatible with parquet and friends and a stack of other
             | formats. I've not really used BigQuery in anger, but the
             | ClickHouse performance is really, really good.
             | 
             | I guess ultimately, all the same benefits, and a lot fewer
             | downsides.
        
             | tnolet wrote:
             | - non proprietary
             | 
             | - open source
             | 
             | - run it locally
             | 
             | - SQL like syntax
             | 
             | - tons of plugins
             | 
             | - not by Google
        
         | IanCal wrote:
         | A different platform doesn't mean third party. It can just mean
         | you have completely separated things so that none of the data
         | tooling discussed here has any ability to access it.
        
           | LoganDark wrote:
           | Not sure what you mean... Do you mean they run software
           | called Clickhouse on their own infra, just separated from the
           | other parts of their backend? To me it reads like they were
           | shipping the data off to a third-party named Clickhouse,
           | especially with "we never store or analyze this data in our
           | data platform" (does data platform refer to ngrok itself or
           | what?).
        
             | IanCal wrote:
             | There is a database called clickhouse, while the company
             | offers services and hosting many run clickhouse on their
             | own infra.
        
         | sippeangelo wrote:
         | Clickhouse must be the worst named product in popular use. I
         | know it's a DB, but every time I read it, it sounds like a
         | marketing/Ads company for privacy invasive tracking.
        
       | zurfer wrote:
       | Kudos to the author who is responsible for the whole stack. A lot
       | of effort goes into ingesting data into Iceberg tables to be
       | queried via AWS Athena.
       | 
       | But I think it's great that analytics and data transformation is
       | distributed, so developers also are somewhat responsible for
       | correct analytical numbers.
       | 
       | In most companies there is strong split between building product
       | and maintaining analytics for the product, which leads to all
       | sort of inefficiencies and errors.
        
       | 1a527dd5 wrote:
       | Blimey, that is a lot of moving parts.
       | 
       | Our data team currently has something similar and its costs are
       | astronomical.
       | 
       | On the other hand our internal platform metrics are fired at
       | BigQuery [1] and then we used scheduled queries that run daily
       | (looking at the -24 hours) that aggregate/export to parquet. And
       | it's cheap as chips. From there it's just a flat file that is
       | stored on GCS that can be pulled for analysis.
       | 
       | Do you have more thoughts on Preset/Superset? We looked at both
       | (slightly leaning towards cloud hosted as we want to move away
       | from on-prem) - but ended up going with Metabase.
       | 
       | [1] https://cloud.google.com/bigquery/docs/write-api
        
         | otter-in-a-suit wrote:
         | I'm the author, but posting as a private individual here, these
         | being just my options and all that... but I can shed some more
         | light on why I did move us to Superset.
         | 
         | Preset is great, as are most of these tools' hosted versions!
         | Lots of great folks working on these.
         | 
         | But, tbh, as an infrastructure company this is somewhat the
         | core business of ngrok - hosting another DB + K8s service is
         | something that we have great tooling for and lots of expertise
         | in the infra space. And using ngrok makes it even easier.
         | 
         | The whole dogfooding aspect is important too - if I don't run
         | an app in production with ngrok I have a hard time empathizing
         | with customers who want to do the same. My previous job
         | encouraged that too and I've always liked that.
         | 
         | Also, yes, lots of moving parts - but most of them are very
         | reusable and they share a lot of code, infra, and
         | logic/operations playbooks etc. Costs are manageable - Athena
         | charges $5/TB scanned iirc, which tends to be the biggest
         | factor.
        
           | spmurrayzzz wrote:
           | I appreciate the time you took to write this all out (both
           | the article and your response here). In particular, this line
           | from the article resonated with my own experience over the
           | last couple of decades:
           | 
           | > This particular setup--viewing DE as a very technical,
           | general-purpose distributed system SWE discipline and making
           | the people who know best what real-world scenarios they want
           | to model--makes our setup work in practice.
           | 
           | The common analyst-to-DE path has some benefits for sure with
           | respect to business-centric data modeling, but without the
           | deep technical infrastructure investments and related
           | support, the stack becomes a beast to deal with at scale (or
           | just ends up being a massive cost on the balance sheet from
           | outside vendor sourcing). You really need both verticals in
           | order to be optimal IMO.
           | 
           | Of course if internally an org doesn't already have the
           | platform/infrastructure to dogfood in the first place, this
           | admittedly makes the proposition a bit more of a gamble.
        
           | 1a527dd5 wrote:
           | Appreciate you taking the time to reply :)
           | 
           | I guess the underlying tone of cynicism in my tone speaks to
           | the question that I didn't directly ask - how often do each
           | of the components/moving parts fail and require manual
           | intervention/fixing?
           | 
           | I often get pulled into complex distributed systems and the
           | team responsible for that flow (data or not) often have no
           | idea where to begin.
           | 
           | Edit* On the point of Athena, I desperate wanted to use it
           | but provide BigQuery to be much better in every way you could
           | think of. It's the black sheep in the company, as every other
           | cloud thing we have is AWS. But honestly, nothing I've found
           | in AWS circle comes close to BigQuery.
        
             | datadrivenangel wrote:
             | BigQuery + Metabase is such a powerful combination. Easy,
             | affordable, effective.
        
         | mritchie712 wrote:
         | blimey indeed. This is a lot of work to set up.
         | 
         | I think you'll see more platforms that offer this set up as a
         | service:
         | 
         | * cheap storage / datalake
         | 
         | * pipelines to get data to storage
         | 
         | * BI / dashboards on top of the storage
         | 
         | We're doing this at Definite (https://www.definite.app/) with
         | Iceberg (same as in this post) + DuckDB as a query engine.
        
           | 1a527dd5 wrote:
           | We are waiting for Metabase to support DuckDB on their cloud
           | version. It's pretty neat.
        
       | valzam wrote:
       | i pity the developer who has to maintain tagless final plumbing
       | code after the "functional programming enthusiast" moves on... in
       | a Go first org no less.
        
         | epgui wrote:
         | I would much rather inherit an FP data pipeline than anything
         | else. You do realize data pipelines (and distributed computing)
         | are an ideal use case for FP?
        
           | pjmlp wrote:
           | I guess the issue being point out is the choice in a Go
           | culture shop, and we all know their common point of view
           | regarding "fancy" languages.
        
             | epgui wrote:
             | It's not clear to me why having two different sets of
             | tooling for solving two different kinds of problems, is an
             | issue.
             | 
             | In most well-resourced companies, you're probably not going
             | to have to ask your Go engineers to fix data pipelines in
             | Scala.
        
               | pjmlp wrote:
               | That is why they pointed out the leaving the company, as
               | possible scenario.
               | 
               | As for well resourced I guess it depends, that variable
               | usually doesn't mean much, as we can see by companies
               | firing whole departments, while swimming in profits.
        
         | otter-in-a-suit wrote:
         | Author here. This decision went through all proper architecture
         | channels, including talks with our engineers, proof of concepts
         | and the like.
         | 
         | I've been doing this too long to shoehorn in my pet languages
         | if I didn't think they're a good fit. And I think that scala/FP
         | + Flink _is_ a good fit for this use case.
         | 
         | We did also explore the go ecosystem fwiw - the options there
         | are limited (especially around the data tooling like iceberg)
         | and go is simply not a language that's popular enough in the
         | data world.
         | 
         | Python's typing system (or lack thereof) is a huge hinderance
         | in this space in general (imo), and Java didn't cause many
         | happy faces on the Eng team either, but it's certainly an
         | option. I just find FP semantics a better fit for data /
         | streaming work (lots of map and flat map anyways), and Scala
         | makes that easy.
         | 
         | Also no cats/zio - just some tangles final _inspired_
         | composition and type classes. Not too difficult to reason
         | about, not using any obscure patterns. I even mutate references
         | sometimes. :-)
        
           | boltzmann-brain wrote:
           | scala? why not haskell instead?
        
             | otter-in-a-suit wrote:
             | Not assuming you're serious, but in any case: the reason is
             | the JVM (+ Scala) ecosystem in the data space.
        
               | epgui wrote:
               | FWIW, I do believe there is a serious case to be made for
               | Haskell... But it's probably beyond the scope of this
               | context / would require changing many other decisions.
               | 
               | If integrating with java tools was important then
               | personally I'd ask "why not Clojure".
               | 
               | :)
        
             | atomicnumber3 wrote:
             | Spark is written in scala and Scala is its first-class
             | language - other languages suffer from either second-class
             | APIs (Java) or suffer from codec/serde overhead (pyspark)
             | (though pyspark actually also is missing a few APIs that
             | scala has, as well).
        
           | atomicnumber3 wrote:
           | I'm assuming the parent commenter hasn't worked in data/spark
           | before either. The functional rabbit hole goes WAY deeper
           | than even just cats et al, and Scala and spark themselves
           | both encourage a fair amount of functional-style code on
           | their own.
        
         | moandcompany wrote:
         | There was a prior effort to create a Golang SDK for Apache Beam
         | https://beam.apache.org/documentation/sdks/go/
         | 
         | The BEAM Golang SDK work came from Googlers working on Beam
         | that were Golang fans, and internally there were Golang-
         | oriented tools for batch data processing that needed a
         | migration path forward.
         | 
         | Historical Note: Apache Beam also originated from Google as
         | "Dataflow"
        
       | Fripplebubby wrote:
       | I found the technical details really interesting, but I think
       | this gem applies more broadly:
       | 
       | > I find this is often an artifact of the DE roles not being
       | equipped with the necessary knowledge of more generic SWE tools,
       | and general SWEs not being equipped with knowledge of data-
       | specific tools and workflows.
       | 
       | > Speaking of, especially in smaller companies, equipping all
       | engineers with the technical tooling and knowledge to work on all
       | parts of the platform (including data) is a big advantage, since
       | it allows people not usually on your team to help on projects as
       | needed. Standardized tooling is a part of that equation.
       | 
       | I have found this to be so true. SWE vs DE is one division where
       | this applies, and I think it also applies for SWE vs SRE (if you
       | have those in your company), data scientists, "analysts",
       | basically anyone who is in a technical role should ideally know
       | what kinds of problems other teams work on and what kinds of
       | tooling they use to address those problems so that you can cross-
       | pollinate.
        
         | anonzzzies wrote:
         | I too see this; I have a big hole in my DE knowledge, even
         | though I manage a _lot_ of data for our clients. I just work
         | from experience and have been using more or less the same tech
         | for decades (with upgrades and one major  'newer' addition;
         | Clickhouse). I try to learn DE stuff, but I do find it
         | particularly hard because i'm NOT a DE but an SWE, so I really
         | quickly fall back on the tooling I already know and love and
         | see very little reasons for anything else.
         | 
         | So is there a something like 'DE for SWE's' someone would
         | recommend here?
        
       | moandcompany wrote:
       | At the end of the day, we're all pushing protobufs from place to
       | place
        
         | tonymet wrote:
         | why aws, azure & gcp are printing money
        
       | tonymet wrote:
       | 15k/s event rate and 650GB volume / day is massive. Of course
       | that's confidential, but I'd guess they are below 10k concurrent
       | connections. So they are recording 1.5 event's / second / user.
       | Does every packet need discrete & real-time telemetry? I've seen
       | games with millions of active users only hit 30k concurrents and
       | this is a developer tool.
       | 
       | Most events can be aggregated over time with a statistic (count,
       | avg, max, etc). Even discrete events can be aggregated with a 5
       | min latency. That should reduce their event volume by 90% . Every
       | layer in that diagram is CPU wasted on encode-decode that costs
       | money.
       | 
       | The paragraph on integrity violation queries was helpful -- it
       | would be good to understand more of the query and latency
       | requirements.
       | 
       | The article is a great technical overview, but it's also helpful
       | to discuss whether this system is a viable business investment.
       | Sure they are making high margins, but why burn good cash on
       | something like this?
        
         | nemothekid wrote:
         | > _Of course that 's confidential, but I'd guess they are below
         | 10k concurrent connections_
         | 
         | I think 10k concurrent connections might be low? I've seen
         | ngrok used at places where you need a reverse proxy to some
         | device - each of those types of customers may have thousands of
         | agents alone.
        
           | tonymet wrote:
           | it's anyone's guess. i'm factoring in the fact that it's a
           | niche dev tool with lots of competition . remember that
           | concurrent figures are 100x -1000x smaller than monthly
           | active users
        
       | jmuguy wrote:
       | I wonder if this data collection is why Ngrok's tunnels are now
       | painfully slow to use. I've just gone back to localhost unless I
       | specifically need to test omniauth or something similar.
        
       ___________________________________________________________________
       (page generated 2024-09-30 23:01 UTC)