[HN Gopher] How we built ngrok's data platform
___________________________________________________________________
How we built ngrok's data platform
Author : samber
Score : 131 points
Date : 2024-09-30 07:35 UTC (15 hours ago)
(HTM) web link (ngrok.com)
(TXT) w3m dump (ngrok.com)
| LoganDark wrote:
| > Note that we do not store any data about the traffic content
| flowing through your tunnels--we only ever look at metadata.
| While you have the ability to enable full capture mode of all
| your traffic and can opt in to this service, we never store or
| analyze this data in our data platform. Instead, we use
| Clickhouse with a short data retention period in a completely
| separate platform and strong access controls to store this
| information and make it available to customers.
|
| Don't worry, your sensitive data isn't handled by our platform,
| we ship it to a third-party instead. This is for your protection!
|
| (I have no idea if Clickhouse is actually a third party, it
| sounds like one though?)
| leosanchez wrote:
| Clickhouse is a database. It has cloud offering.
| faangguyindia wrote:
| What's the point of clickhouse cloud?when you can just use
| bigquery and run queries on billions of row.
|
| I am genuinely curious what case does clickhouse serve over
| bigquery.
| FridgeSeal wrote:
| It's actually open source, you can self-host it easily
| enough, you can push a single instance pretty far too.
|
| It'll also happily read from disaggregated storage and is
| compatible with parquet and friends and a stack of other
| formats. I've not really used BigQuery in anger, but the
| ClickHouse performance is really, really good.
|
| I guess ultimately, all the same benefits, and a lot fewer
| downsides.
| tnolet wrote:
| - non proprietary
|
| - open source
|
| - run it locally
|
| - SQL like syntax
|
| - tons of plugins
|
| - not by Google
| IanCal wrote:
| A different platform doesn't mean third party. It can just mean
| you have completely separated things so that none of the data
| tooling discussed here has any ability to access it.
| LoganDark wrote:
| Not sure what you mean... Do you mean they run software
| called Clickhouse on their own infra, just separated from the
| other parts of their backend? To me it reads like they were
| shipping the data off to a third-party named Clickhouse,
| especially with "we never store or analyze this data in our
| data platform" (does data platform refer to ngrok itself or
| what?).
| IanCal wrote:
| There is a database called clickhouse, while the company
| offers services and hosting many run clickhouse on their
| own infra.
| sippeangelo wrote:
| Clickhouse must be the worst named product in popular use. I
| know it's a DB, but every time I read it, it sounds like a
| marketing/Ads company for privacy invasive tracking.
| zurfer wrote:
| Kudos to the author who is responsible for the whole stack. A lot
| of effort goes into ingesting data into Iceberg tables to be
| queried via AWS Athena.
|
| But I think it's great that analytics and data transformation is
| distributed, so developers also are somewhat responsible for
| correct analytical numbers.
|
| In most companies there is strong split between building product
| and maintaining analytics for the product, which leads to all
| sort of inefficiencies and errors.
| 1a527dd5 wrote:
| Blimey, that is a lot of moving parts.
|
| Our data team currently has something similar and its costs are
| astronomical.
|
| On the other hand our internal platform metrics are fired at
| BigQuery [1] and then we used scheduled queries that run daily
| (looking at the -24 hours) that aggregate/export to parquet. And
| it's cheap as chips. From there it's just a flat file that is
| stored on GCS that can be pulled for analysis.
|
| Do you have more thoughts on Preset/Superset? We looked at both
| (slightly leaning towards cloud hosted as we want to move away
| from on-prem) - but ended up going with Metabase.
|
| [1] https://cloud.google.com/bigquery/docs/write-api
| otter-in-a-suit wrote:
| I'm the author, but posting as a private individual here, these
| being just my options and all that... but I can shed some more
| light on why I did move us to Superset.
|
| Preset is great, as are most of these tools' hosted versions!
| Lots of great folks working on these.
|
| But, tbh, as an infrastructure company this is somewhat the
| core business of ngrok - hosting another DB + K8s service is
| something that we have great tooling for and lots of expertise
| in the infra space. And using ngrok makes it even easier.
|
| The whole dogfooding aspect is important too - if I don't run
| an app in production with ngrok I have a hard time empathizing
| with customers who want to do the same. My previous job
| encouraged that too and I've always liked that.
|
| Also, yes, lots of moving parts - but most of them are very
| reusable and they share a lot of code, infra, and
| logic/operations playbooks etc. Costs are manageable - Athena
| charges $5/TB scanned iirc, which tends to be the biggest
| factor.
| spmurrayzzz wrote:
| I appreciate the time you took to write this all out (both
| the article and your response here). In particular, this line
| from the article resonated with my own experience over the
| last couple of decades:
|
| > This particular setup--viewing DE as a very technical,
| general-purpose distributed system SWE discipline and making
| the people who know best what real-world scenarios they want
| to model--makes our setup work in practice.
|
| The common analyst-to-DE path has some benefits for sure with
| respect to business-centric data modeling, but without the
| deep technical infrastructure investments and related
| support, the stack becomes a beast to deal with at scale (or
| just ends up being a massive cost on the balance sheet from
| outside vendor sourcing). You really need both verticals in
| order to be optimal IMO.
|
| Of course if internally an org doesn't already have the
| platform/infrastructure to dogfood in the first place, this
| admittedly makes the proposition a bit more of a gamble.
| 1a527dd5 wrote:
| Appreciate you taking the time to reply :)
|
| I guess the underlying tone of cynicism in my tone speaks to
| the question that I didn't directly ask - how often do each
| of the components/moving parts fail and require manual
| intervention/fixing?
|
| I often get pulled into complex distributed systems and the
| team responsible for that flow (data or not) often have no
| idea where to begin.
|
| Edit* On the point of Athena, I desperate wanted to use it
| but provide BigQuery to be much better in every way you could
| think of. It's the black sheep in the company, as every other
| cloud thing we have is AWS. But honestly, nothing I've found
| in AWS circle comes close to BigQuery.
| datadrivenangel wrote:
| BigQuery + Metabase is such a powerful combination. Easy,
| affordable, effective.
| mritchie712 wrote:
| blimey indeed. This is a lot of work to set up.
|
| I think you'll see more platforms that offer this set up as a
| service:
|
| * cheap storage / datalake
|
| * pipelines to get data to storage
|
| * BI / dashboards on top of the storage
|
| We're doing this at Definite (https://www.definite.app/) with
| Iceberg (same as in this post) + DuckDB as a query engine.
| 1a527dd5 wrote:
| We are waiting for Metabase to support DuckDB on their cloud
| version. It's pretty neat.
| valzam wrote:
| i pity the developer who has to maintain tagless final plumbing
| code after the "functional programming enthusiast" moves on... in
| a Go first org no less.
| epgui wrote:
| I would much rather inherit an FP data pipeline than anything
| else. You do realize data pipelines (and distributed computing)
| are an ideal use case for FP?
| pjmlp wrote:
| I guess the issue being point out is the choice in a Go
| culture shop, and we all know their common point of view
| regarding "fancy" languages.
| epgui wrote:
| It's not clear to me why having two different sets of
| tooling for solving two different kinds of problems, is an
| issue.
|
| In most well-resourced companies, you're probably not going
| to have to ask your Go engineers to fix data pipelines in
| Scala.
| pjmlp wrote:
| That is why they pointed out the leaving the company, as
| possible scenario.
|
| As for well resourced I guess it depends, that variable
| usually doesn't mean much, as we can see by companies
| firing whole departments, while swimming in profits.
| otter-in-a-suit wrote:
| Author here. This decision went through all proper architecture
| channels, including talks with our engineers, proof of concepts
| and the like.
|
| I've been doing this too long to shoehorn in my pet languages
| if I didn't think they're a good fit. And I think that scala/FP
| + Flink _is_ a good fit for this use case.
|
| We did also explore the go ecosystem fwiw - the options there
| are limited (especially around the data tooling like iceberg)
| and go is simply not a language that's popular enough in the
| data world.
|
| Python's typing system (or lack thereof) is a huge hinderance
| in this space in general (imo), and Java didn't cause many
| happy faces on the Eng team either, but it's certainly an
| option. I just find FP semantics a better fit for data /
| streaming work (lots of map and flat map anyways), and Scala
| makes that easy.
|
| Also no cats/zio - just some tangles final _inspired_
| composition and type classes. Not too difficult to reason
| about, not using any obscure patterns. I even mutate references
| sometimes. :-)
| boltzmann-brain wrote:
| scala? why not haskell instead?
| otter-in-a-suit wrote:
| Not assuming you're serious, but in any case: the reason is
| the JVM (+ Scala) ecosystem in the data space.
| epgui wrote:
| FWIW, I do believe there is a serious case to be made for
| Haskell... But it's probably beyond the scope of this
| context / would require changing many other decisions.
|
| If integrating with java tools was important then
| personally I'd ask "why not Clojure".
|
| :)
| atomicnumber3 wrote:
| Spark is written in scala and Scala is its first-class
| language - other languages suffer from either second-class
| APIs (Java) or suffer from codec/serde overhead (pyspark)
| (though pyspark actually also is missing a few APIs that
| scala has, as well).
| atomicnumber3 wrote:
| I'm assuming the parent commenter hasn't worked in data/spark
| before either. The functional rabbit hole goes WAY deeper
| than even just cats et al, and Scala and spark themselves
| both encourage a fair amount of functional-style code on
| their own.
| moandcompany wrote:
| There was a prior effort to create a Golang SDK for Apache Beam
| https://beam.apache.org/documentation/sdks/go/
|
| The BEAM Golang SDK work came from Googlers working on Beam
| that were Golang fans, and internally there were Golang-
| oriented tools for batch data processing that needed a
| migration path forward.
|
| Historical Note: Apache Beam also originated from Google as
| "Dataflow"
| Fripplebubby wrote:
| I found the technical details really interesting, but I think
| this gem applies more broadly:
|
| > I find this is often an artifact of the DE roles not being
| equipped with the necessary knowledge of more generic SWE tools,
| and general SWEs not being equipped with knowledge of data-
| specific tools and workflows.
|
| > Speaking of, especially in smaller companies, equipping all
| engineers with the technical tooling and knowledge to work on all
| parts of the platform (including data) is a big advantage, since
| it allows people not usually on your team to help on projects as
| needed. Standardized tooling is a part of that equation.
|
| I have found this to be so true. SWE vs DE is one division where
| this applies, and I think it also applies for SWE vs SRE (if you
| have those in your company), data scientists, "analysts",
| basically anyone who is in a technical role should ideally know
| what kinds of problems other teams work on and what kinds of
| tooling they use to address those problems so that you can cross-
| pollinate.
| anonzzzies wrote:
| I too see this; I have a big hole in my DE knowledge, even
| though I manage a _lot_ of data for our clients. I just work
| from experience and have been using more or less the same tech
| for decades (with upgrades and one major 'newer' addition;
| Clickhouse). I try to learn DE stuff, but I do find it
| particularly hard because i'm NOT a DE but an SWE, so I really
| quickly fall back on the tooling I already know and love and
| see very little reasons for anything else.
|
| So is there a something like 'DE for SWE's' someone would
| recommend here?
| moandcompany wrote:
| At the end of the day, we're all pushing protobufs from place to
| place
| tonymet wrote:
| why aws, azure & gcp are printing money
| tonymet wrote:
| 15k/s event rate and 650GB volume / day is massive. Of course
| that's confidential, but I'd guess they are below 10k concurrent
| connections. So they are recording 1.5 event's / second / user.
| Does every packet need discrete & real-time telemetry? I've seen
| games with millions of active users only hit 30k concurrents and
| this is a developer tool.
|
| Most events can be aggregated over time with a statistic (count,
| avg, max, etc). Even discrete events can be aggregated with a 5
| min latency. That should reduce their event volume by 90% . Every
| layer in that diagram is CPU wasted on encode-decode that costs
| money.
|
| The paragraph on integrity violation queries was helpful -- it
| would be good to understand more of the query and latency
| requirements.
|
| The article is a great technical overview, but it's also helpful
| to discuss whether this system is a viable business investment.
| Sure they are making high margins, but why burn good cash on
| something like this?
| nemothekid wrote:
| > _Of course that 's confidential, but I'd guess they are below
| 10k concurrent connections_
|
| I think 10k concurrent connections might be low? I've seen
| ngrok used at places where you need a reverse proxy to some
| device - each of those types of customers may have thousands of
| agents alone.
| tonymet wrote:
| it's anyone's guess. i'm factoring in the fact that it's a
| niche dev tool with lots of competition . remember that
| concurrent figures are 100x -1000x smaller than monthly
| active users
| jmuguy wrote:
| I wonder if this data collection is why Ngrok's tunnels are now
| painfully slow to use. I've just gone back to localhost unless I
| specifically need to test omniauth or something similar.
___________________________________________________________________
(page generated 2024-09-30 23:01 UTC)