https://ngrok.com/blog-post/how-we-built-ngroks-data-platform Introducing Agent endpoints: Secure tunnel DX meets control and governance. Learn more -> Platform Use cases ngrok for development [6679eb465c] Developer Preview [6679d60b6c] Webhook Testing ngrok for production [666cae8c8c] API Gateway [6674428874] Device Gateway [6675c1ae09] Global Load Balancer [6675c9af5c] Identity-Aware Proxy [6674842b9b] Kubernetes Operator [666a0aba89] Site-to-site connectivity More from ngrok [668466e051] Talk to an expert Blog Resources Resources [6538f65318] Customers Trusted by the best teams and recommended by category leaders [6538f653e6] Partners Build what you love with ngrok and our partners [6538f653d0] Security Security, privacy, and compliance [668467b1b9] Live events Visit virtual and in-person events for live learning and discussions [66982867ec] Guides Tutorials for common tasks using ngrok [65f48227e2] Videos Learn how you can use ngrok with your application Docs Docs [65ce30961d] View all docs Browse our catalog of technical documentation [6538fc02c4] Quick start Put your app on the internet with the ngrok agent in less than a minute [65ce3176b6] API Programmatic access to all of ngrok's resources [6538eef599] SDKs Embed ngrok directly into your application [65ce304d5a] Integrations Effortlessly integrate with your favorite software platforms [65ce2f88fe] GitHub The home of ngrok's open source software projects that you can use to build on ngrok Pricing Get ngrok Get ngrok [6538fc0291] Download The fastest way to put anything on the internet [6538fc02c4] Getting started ngrok is easy to install [6538eef599] SDKs Embed ngrok directly into your application [6538fc0206] Contact us Talk to an ngrok expert LoginSign up Log inSign up [6658f90c18] How we built ngrok's data platform Company Python Cool tools September 26, 2024 | 15 min read by Christian Hollinger Christian Hollinger At ngrok, we manage an extensive data lake with an engineering team of one (me!). This article is a look at how we built it, what we learned, as well as some selective deep dives I found interesting enough to be worth sharing in more detail, since they'll bridge the gap between what people usually understand by the term "data engineering" and how we run data here at ngrok. Some of this might even be useful (or, at the very least, interesting!) for your own data platform endeavors, whether your team is big or small. Data we store First of all, as an ngrok user, it's important to understand what data we store and what we use it for, especially when talking about building a data platform. I want to be as transparent and upfront as possible about this, since talking about storing and processing data will always raise valid privacy concerns. My colleague Ari recently wrote in-depth about what personal data of customers we store and you can always find up to date compliance data in our Trust Center. What we store But to give you a more concise overview, we generally store and process: * Data from our globally distributed Postgres instances, which contains customer data, such as account IDs, user to account mappings, certificates, domains and similar data. * Data from our metering and usage cluster which tracks the usage of resources. * Subscription and payment information (excluding credit card information). * Third-party data, such as support interactions from Zendesk. * Metadata about movement of events through the various components that make up ngrok's infrastructure. * Purpose-built product signals, also as real-time events (more on those two later!). Note that we do not store any data about the traffic content flowing through your tunnels--we only ever look at metadata. While you have the ability to enable full capture mode of all your traffic and can opt in to this service, we never store or analyze this data in our data platform. Instead, we use Clickhouse with a short data retention period in a completely separate platform and strong access controls to store this information and make it available to customers. I hope this demystifies our data processing efforts a bit, so we can talk about the engineering side of things. Why data engineering is different at ngrok (and probably not what you think) We hired the first full time data person (it's me!) in the summer of 2023. Before we filled that position, the data platform was set up by our CTO, Peter. Because of that, our data engineering (DE) team (or rather, our data role) is part of the Office of the CTO and effectively works horizontally across our engineering organization. One artifact of having such a small team, our DE work is much closer aligned to holistic backend engineering work than the term "data engineering" often implies. While I'm primarily responsible for the Data Platform and all associated services and tooling, I frequently find myself working on the actual ngrok products (as opposed to "just" the data lake). That includes architecture proposals and designs that span multiple teams and services and are mostly tangentially related to the "core" data work. And yes, I do write a fair bit of Go because of it. As such, the majority of the data modeling work (i.e., SQL) is done by subject matter experts, which is very different to DE roles in many other organizations. In other words, I write very little SQL on a day to day basis and won't usually be the person that writes a data model. Within those subject matter experts, some people write reusable, well-structured dbt models, while other people focus on ad hoc analysis (based on these models) via our BI tooling in Superset. It's worth noting that our entire organization has access to Superset and most data models and sources! For instance, our growth team implements a huge amount of our actual data models and knows the business, product, and finance side much better than I do. They're much better equipped to build sensible, complex, and maintainable data models that properly answer business questions. In addition to that, we have a small-but-mighty infrastructure team that owns and operates our core infrastructure, such as Kubernetes, as well as the developer tools that are instrumental in keeping every engineering team (and, of course, ngrok itself!) running smoothly. This particular setup--viewing DE as a very technical, general-purpose distributed system SWE discipline and making the people who know best what real-world scenarios they want to model--makes our setup work in practice. Our data platform architecture Now that you know how our (data) engineering org is structured, let's talk about what we actually built and maintain and how it evolved over time. ngrok as a product is a large, distributed, worldwide networking system with very high uptime guarantees, huge traffic volumes, and a large set of inherently eventually consistent (and often, ephemeral) data that can be difficult to interpret. Think about how a TCP connection works, and what steps are involved with making and routing it! In other words, even our basic architecture was a bit more complex than "just copy your Postgres database somewhere to query it offline." ngrok's data architecture in the past Despite that, our original architecture was utilitarian and relied more heavily on AWS tools than our contemporary architecture, which is very open-source focused. Its primary goal was to get the most important data sets we needed--Postgres, important external data, as well as some events--to effectively run finance reporting, abuse, and support. On the batch ingestion side, we used Airbyte open source on Kubernetes to ingest third-party data via their respective APIs. We utilized the ngrok OAuth module to do authentication (as we do for all our open-source services that require an ingress controller). Airbyte wrote JSON files, where we determined the schema with manual runs of a Glue parser and several Python scripts to create the target schemas, as well as another Glue job to write the target schema as Iceberg. At the time, we did not have an orchestrator available and relied on Glue internal schedules. This meant we had no alerting or any integration with on-call tools. We used AWS DMS here to get our core Postgres data, writing parquet data to S3. This was a once-a-day batch job. On the streaming side, we streamed event metadata via AWS Firehose, writing JSON data to 2 different S3 locations. [66f5ad4d16] For analytics, all our data was (and still is) eventually stored as Apache Iceberg and generally queried via AWS Athena, although the legacy architecture did have some datasets that were based on raw JSON in the mix. We used AWS Glue as a meta store. Our SQL models were actually SQL views directly in Athena, with no version control or lineage, that were directly created in production and queried via Preset (which is the managed cloud version of Superset). Expensive queries and unreasonable models Our eventing system, which is core to understand system behavior, relied on a very pricy AWS Firehose pipeline, as the way we split and organized events required us to both write JSON data (creating hundreds of TiB of data), as well as maintain data platform specific Go code in otherwise purely customer facing services (see the later section on Apache Flink, Scala, and Protobuf). Some of the data became straight up impossible to query (or very expensive), as queries would time-out despite tuning with partitions and other tricks. The entire system was on borrowed time from the start. It was also hard to impossible to reason about our models, since we lacked any of dbt's (or a comparable tool's) creature comforts, such as lineage, documentation, version control, auditing, tests, and so on. Without expecting you to be able to grok the details here, imagine getting asked why a certain field looks suspicious (if not to say, wrong), at the very end of this lineage tree: [66f5d7ac25] ...without having this lineage tree available, of course. On a similar vein, not having a central orchestrator, alerting, and auditing for our data jobs was an operational challenge (you can learn more about how we solved those two issues here). Our data stack was also not integrated very deeply in our Go monorepo and tooling, missing things like Datadog monitors and metrics, good test coverage, or style guides and enforcements via CI (see the Working in a go monorepo section). Lastly (at least for the scope of this article), Airbyte and Glue have been a challenge to get right, but we'll tell you how we did a few sections from now. ngrok's data architecture now Our modern data platform is more heavily based around open-source tools we self-host on Kubernetes, dogfooding ngrok, with some AWS native tools in the mix. To solve these challenges, a simplified, contemporary view of our current architecture looks like this. [66f5b443bc] All our batch ingestion is now run and orchestrated via Dagster, which I've written about previously. We still use Airbyte and still use ngrok to do so, but write directly to Glue and maintain our schemas as Terraform by querying the Glue API. For streaming data (which is where most of our volume and complexity comes from), we now run Apache Flink to consume Protobuf messages directly from Kafka, rather than rely on Firehose and internal services. We'll also cover this in more detail in a bit. Our database ingestion is still using DMS, but now mostly relies on streaming writes, which are faster and more efficient (when responding to a support request, you don't want yesterday's data!). For analytics, we heavily rely on dbt now, as well as self-host the open-source version of Apache Superset. We also added a hosted version of the dbt docs, of course also dogfooded behind an ngrok endpoint. Technical deep-dives and problem-solving While we cannot get into all the details of all the challenges we solved in the past 12 or so months, here are some challenges I found especially interesting as a software engineer. Collaborating on data and infra in a Go monorepo Most of ngrok's code base is written in Go and exists in a monorepo. We run Bazel for build tooling, as well as Nix as a package manager. This allows us to have reproducible developer environments, as well as reasonably fast compile, build, and by proxy, CI times. As most of our data infrastructure exists in Python and Scala, we had to adapt our workflow to this environment, as it is important to us to integrate the data environment with the rest of the engineering organization at ngrok. Speaking from experience, having a completely separate data engineering team or department will eventually result in a fragmented engineering environment, with many bespoke paths that are not applicable to all team members, usually causing huge support and maintenance burdens on individual teams (e.g., maintaining two or more iterations of a CI/CD system). Having one deployment system all engineers use is much easier and can be maintained by one infrastructure team: [66f5b45c4a] I find this is often an artifact of the DE roles not being equipped with the necessary knowledge of more generic SWE tools, and general SWEs not being equipped with knowledge of data-specific tools and workflows. Speaking of, especially in smaller companies, equipping all engineers with the technical tooling and knowledge to work on all parts of the platform (including data) is a big advantage, since it allows people not usually on your team to help on projects as needed. Standardized tooling is a part of that equation. For instance, we have an internal, Go-based developer CLI, called nd, that does a lot of heavy lifting (think "abstract kubectl commands for half a dozen clusters"). We also use it to run diffs between a developer's branch and expected state, for instance to enforce formatting and code styles. Our CI runners run NixOS. So, for our data work, enforcing standards around dbt models involved a custom Nix package for shandy-sqlfmt, which we use as a standard for formatting all our dbt models, as well as integration into our nd tool, so developers (as well as CI) have access to nd sql fmt, just as they have nd go fmt. While this does involve additional work for me, it ensures data tooling is never the "odd one out" and ramping onto data work (or vice versa) is much less of a cognitive shift. Other integrations we've added over time include: * Bespoke, modern Python tooling (not only for our data tools), such as poetry and ruff, as well as enforcement of style and static analysis via CI. * Smart sbt caches for Scala, since Scala + Bazel is not something we've explored in depth. * Datadog monitors and metrics, including custom metric interfaces for all our data tools. * Integration in our on-call tooling, such as Slack alerts, OpsGenie integration, and others. * Various custom Nix derivations. Wrestling schemas and structs between Airbyte and Glue A more "data specific" challenge we've dealt with are complex schemas in Airbyte that often don't match the actual data or are otherwise incompatible with our query engine, which is something I'm sure a lot of you are familiar with. With a team of one, I can't reasonably write individual jobs for individual tables or sources that handle all corner cases, as we simply have too large and diverse a set of data sources. Myself and others have to rely on code-gen and automated processing. This holds true for all data tools, not just Airbyte. Originally, we wrote JSON files to S3, which supported the arbitrary data and schema changes that might happen, and ran AWS Glue crawlers on top of these files to detect the schema and create "raw" tables. JSON is conceptually nice for this, since it can deal with arbitrary schemas. For example, using a parquet writer to S3 would rely on the source schema to be 100% accurate and has to deal with an array of limitations. Glue crawlers, on paper, support table versions and table evolutions. But we quickly realized that these crawlers were very unreliable, especially with changing schemas or differences between dev and prod. This resulted in schemas that either couldn't be queried outright or reported incorrect data. We experimented with custom schema detection logic, which gave us more control over parameters like sample size, look back windows, and corner cases, but found that burdensome to manage, despite using existing libraries. A part of this was AWS Glue's odd way of storing structs, which are (deceptively so) depicted as arbitrarily deep JSON objects in the Glue web UI: { "payment_method_configuration_details": { "id": "string", "parent": "string" } } Whereas the API describes these fields as: { "Name": "payment_method_configuration_details", "Type": "struct" } Which includes describe table and show create table statements via Athena: CREATE EXTERNAL TABLE `...`( `status` string COMMENT 'from deserializer', `payment_method_configuration_details` struct COMMENT 'from deserializer', This very custom, bespoke way of describing structs meant that standard "JSON to JSON schema" parsers would not work out of the box. While it is possible to work around some of this, this became a very convoluted problem, given that some AWS APIs are notoriously convoluted in themselves. The Glue API expects the struct< syntax, for instance. It's also arguably not a problem we should need to solve. So, we settled on using the Airbyte Glue destination connector, which would create the tables directly on Glue based on the reported schema of the source. This eliminates the additional hop (and point of failure) of running a crawler entirely and ensures we get at least a valid Glue table (albeit not necessarily a valid Athena table). But it still does not solve the issue of fields being reported incorrectly at times, usually directly by the API. For instance, the table stripe.charges cannot be queried due to Athena returning a TYPE_NOT_FOUND: Unknown type: row error. Trying to get a DDL will yield a java.lang.IllegalArgumentException: Error: name expected at the position 204. Keep in mind that this was entirely set up by Airbyte, with no human involvement or custom code run yet. Position 204, for those that are curious, looks like this: total_count:decimal(38)>:boolean:struct<>: This struct<> field can't be queried in Athena. To solve this, we now have a post-processing step that turns each struct field into a custom tree data structure, maps or removes invalid types at arbitrary depth (such as struct<>), and generates a terraform representation of each table, so we get uniform environments between dev and prod. It also does an additional step by creating a flattened table that will map each nested field into a flat field with the appropriate type. We do this to maximize compatibility with Iceberg and make queries more ergonomic for users. This works by querying the Glue API and some basic DSA. For instance, the following field might present as: [66f5b58a45] But really contain the following data, as reported to Glue (note the object under subscription_items): { "pending_update": { "trial_end": "int", "expires_at": "int", "trial_from_plan": "boolean", "subscription_items": [ { "id": "string", "price": { "id": "string", "type": "string", // ... } } If any of these fields is invalid, we map them to valid names or remove them. For instance, a struct<> or simply a field that has a name that's invalid in Athena, but valid in Glue). Invalid names like "street-name" or "1streetname" need to be escaped with " in Athena, but cannot be used in nested fields, which are very common in JSON. Airbyte also doesn't have a bigint type, but Athena does; if a schema reports an Athena Integer type, we generally map it to a bigint to be safe, since a value >= 2^32 will cause Athena queries to fail. We also normalize other types, such as decimal(38). All of this results in parsed (stripped) tree similar to this, with types attached on the nodes: [66f5b5eb4a] Which we can now traverse and mutate, e.g. change names, types, or set deletion tombstones. This would yield a field in the "raw" table as: columns { name = "pending_update" type = "struct