[HN Gopher] IOx: InfluxData's New Storage Engine
___________________________________________________________________
IOx: InfluxData's New Storage Engine
Author : resizeitplz
Score : 129 points
Date : 2022-10-26 14:51 UTC (8 hours ago)
(HTM) web link (www.influxdata.com)
(TXT) w3m dump (www.influxdata.com)
| toinbis wrote:
| Happy longtime Influxdb user here. I wanted to congratulate Paul
| and the team on reaching this milestone. Followed IOx development
| a bit - can't wait to finally test it out!
| michael_j_ward wrote:
| Just want to say congratulations to the team!
|
| 2 years and 9,500+ commits is a hell of a feat.
|
| https://github.com/influxdata/influxdb_iox
| okay_dude_q wrote:
| mildbyte wrote:
| Just wanted to also give a shout out to Apache DataFusion[0] that
| IOx relies on a lot (and contributes to as well!).
|
| It's a framework for writing query engines in Rust that takes
| care of a lot of heavy lifting around parsing SQL, type casting,
| constructing and transforming query plans and optimizing them.
| It's pluggable, making it easy to write custom data sources,
| optimizer rules, query nodes etc.
|
| It's has very good single-node performance (there's even a way to
| compile it with SIMD support) and Ballista [1] extends that to
| build it into a distributed query engine.
|
| Plenty of other projects use it besides IOx, including
| VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily
| using it to build Seafowl [2], an analytical database that's
| optimized for running SQL queries directly from the user's
| browser (caching, CDNs, low latency, some WASM support, all that
| fun stuff).
|
| [0] https://github.com/apache/arrow-datafusion
|
| [1] https://github.com/apache/arrow-ballista
|
| [2] https://github.com/splitgraph/seafowl
| pauldix wrote:
| DataFusion is great, we're happy to be contributing to it. Also
| excited to see so many people around the world picking it up
| and contributing as well. With our development efforts on IOx,
| it's like a strong tailwind. But we put a ton of effort into
| helping manage community efforts (thanks, alamb! our developer
| on IOx that is also on the Arrow PMC).
| andygrove wrote:
| Original author of DataFusion/Ballista here. Having alamb and
| others from InfluxData involved has been a huge help in
| driving the project forward and helping build an active
| community behind the project. It is genuinely hard to keep up
| with the momentum these days!
| menaerus wrote:
| Hi, I just had a glance over the DataFusion project. Very
| interesting work out there which I will be definitely
| keeping the track of but I've got a genuine question. Do
| you sometimes find development in Rust a little bit
| challenging for large-scale and performance sensitive type
| of work?
|
| I say this because I've noticed more than several PRs
| fixing (large) performance regressions which to my
| understanding were mostly introduced due to unforeseen or
| unexpected Rust compiler subtleties which would then lead
| to less than optimal code generation. One example of such
| event was a naive and simply looking abstraction that was
| introduced and which brought down the performance by
| something like 50% in TPC-H benchmarks. This really struck
| me a little bit, especially because it seems quite hard to
| identify the root cause, and I would like to hear the
| experiences from the first hand. Thanks a bunch!
| nevi-me wrote:
| Your initial experiments and decision to build on arrow-rs
| has been great for the project. Thank you and everyone
| involved.
| [deleted]
| ignoramous wrote:
| > _We 're heavily using it to build Seafowl, an analytical
| database that's optimized for running SQL queries directly from
| the user's browser..._
|
| Interesting. Where does _seafowl_ fit in when I compare it
| with, say, data-stack-in-a-box approach, for ex: meltano + dbt
| + duckdb + superset [0]? Is my thinking right that _seafowl_
| possibly replaces both _duckdb_ (with IOx) and superset (if
| there 's a web front-end)?
|
| Incidentally, dagster had an article up just yesterday making a
| case for poor-man's datalake with dbt + dagster + duckdb [1].
| What does _splitgraph_ replace if I were to use _it_ in a
| similar setup?
|
| Thanks.
|
| [0] https://archive.is/DxU1e
|
| [1] https://archive.is/5ikU4
| mildbyte wrote:
| Great question! With Seafowl, the idea is different from what
| the modern data stack addresses. It's trying to simplify
| public-facing Web-based visualizations: apps that need to run
| analytical queries on large datasets and can be accessed by
| users all around the world. This is why we made the query API
| easily cacheable by CDNs and Seafowl itself easy to deploy at
| the edge, e.g. with Fly.io.
|
| It's a fairly different use case from DuckDB (query execution
| for Web applications vs fast embedded analytical database for
| notebooks) and the rest of the modern data stack (which
| mostly is about analytics internal to a company). Just to
| clarify, we're not related to IOx directly (only via us both
| using Apache DataFusion).
|
| If we had to place Seafowl _inside_ of the modern data stack,
| it'd be mostly a warehouse, but one that is optimized for
| being queried from the Internet, rather than by a limited set
| of internal users. Or, a potential use case could be
| extracting internal data from your warehouse to Seafowl in
| order to build public applications that use it.
|
| We don't currently ship a Web front-end and so can't serve as
| a replacement to Superset: it's exposed to the developer as
| an HTTP API that can be queried directly from the end user's
| Web browser. But we have some ideas around a frontend
| component: some kind of a middleware, where the Web app can
| pre-declare the queries it will need to run at build time and
| we can compute some pre-aggregations to speed those up at
| runtime. Currently we recommend querying it with Observable
| [0] for an end-to-end query + visualization experience (or
| use a different viz library like d3/Vega).
|
| Re: the second question about Splitgraph for a data lake, the
| intention behind Splitgraph is to orchestrate all those tools
| and there the use case is indeed the modern data stack in a
| box. It's kind of similar to dbt Labs's Sinter [1] which was
| supposed to be the end-to-end data platform before they
| focused on dbt and dbt Cloud instead: being able to run
| Airbyte ingestion, dbt transformations, be a data warehouse
| (using PostgreSQL and a columnar store extension), let users
| organize and discover data at the same time. There's a lot of
| baggage in Splitgraph though, as we moved through a few
| iterations of the product (first Git/Docker for data, then a
| platform for the modern data stack). Currently we're thinking
| about how to best integrate Splitgraph and Seafowl in order
| to build a managed pay-as-you-go Seafowl, kind of like Fauna
| [2] for analytics.
|
| Hope this helps!
|
| [0] https://observablehq.com/@seafowl/interactive-
| visualization-...
|
| [1] https://www.getdbt.com/blog/whats-in-a-name/
|
| [2] https://fauna.com/
| PaulWaldman wrote:
| >Unbounded cardinality
|
| This has been the largest criticism of InfluxDB in the past.
| Kudos to the team for acknowledging and solving it!
|
| > IOx supports SQL natively and our cloud customers can connect
| using Postgres-compatible clients like psql, Grafana's Postgres
| data source, and BI tools like PowerBI and Tableau.
|
| Initially InfluxDB had InfluxQL, a SQL like language for querying
| data. Then they transitioned to Flux, indicating it was superior
| to writing complex SQL queries over time series data. Now they
| are highlighting native SQL support. Since this was only
| announced today, hopefully there will be clear messaging on which
| query languages will be supported going forward.
|
| It's also worth noting that queries can also be executed over an
| HTTP API that platforms like PowerBI can consume today.
|
| >First introduced in 2020 as the open source project InfluxDB
| IOx, the new storage engine is the product of sustained
| development by InfluxData and considerable contribution from the
| InfluxDB open source developer community. Today, the new engine
| based on IOx arrives first in InfluxData's multi-tenant InfluxDB
| Cloud service, available to developers worldwide.
|
| Will this later be available in an OSS package for self-hosting?
| pauldix wrote:
| Hi, post author and founder of InfluxDB here. We're supporting
| Flux (our scripting and query language), InfluxQL (our original
| SQL like language), and SQL (specifically the Postgres dialect
| as that's what DataFusion supports). The query engine is
| DataFusion, which is part of the Apache Arrow project. We
| contribute to it significantly. So that's what's built in
| natively. We support Flux and InfluxQL through separate Go
| processes that use an API to connect to the core DB. Although
| we're working on native InfluxQL support (it's a Rust based
| InfluxQL parser that will yield DataFusion logical query
| plans).
|
| Right now we're focused on our cloud offering. We'll have
| official open source releases and documentation in the future.
| minhazm wrote:
| The SQL support is likely because they're using DataFusion
| which already has pretty good SQL support, so it's sort of
| "free".
|
| https://arrow.apache.org/datafusion/user-guide/sql/sql_statu...
| alamb wrote:
| Author here -- it is "free" in the sense that all the effort
| we put into DataFusion flows directly into IOx. But we do put
| a lot of effort into DataFusion
| minhazm wrote:
| I didn't mean to imply it's free as in no effort goes into.
| Just that the underlying library provides it so it's less
| effort on top of the already significant effort going into
| DataFusion itself.
| alamb wrote:
| Ah -- got it! This is the beauty of aligning ourselves
| with technologies like Arrow, Parquet and DataFusion. We
| can share as well as benefit from the efforts of the
| broader community
| _peter_ wrote:
| Isn't InfluxDB rewriting their storage engine for the nth time?
| It makes me have a little less faith in their project to be
| honest.
| mhall119 wrote:
| The original TSM engine is still used by InfluxDB v2 OSS.
|
| The InfluxDB Cloud platform uses a variation of TSM that's
| tailored for a distributed SaaS rather than stand-alone nodes
| (this was originally intended to be used in InfluxDB v2 OSS as
| well, but alpha-testing showed that the old engine performed
| better there so it ultimately was reverted for the beta
| release).
|
| So IOx is really the first major new storage engine in
| InfluxDB.
| [deleted]
| dgnorton wrote:
| Member of the engineering team here - I would break the history
| into 3 phases:
|
| 1) Alpha / Beta phase where we experimented with several off-
| the-shelf key-value stores (RocksDB, LevelDB, & BoltDB). During
| this early phase, we learned from observing a wide variety of
| workloads / use-cases that we needed a custom built engine to
| achieve our early performance goals. But, using these off-the-
| shelf key-value stores allowed our (at the time) very small
| team to focus on developing a useful beta product and gathering
| user feedback.
|
| 2) TSM storage engine for 1.0 - Developed from scratch based on
| our learnings from phase 1, this was the first production
| storage engine that shipped with 1.0 in 2016 and carried us
| through 2.0. It served as the workhorse for 3 - 4 years as both
| the number of users and size of their workloads skyrocketed,
| eventually bumping into architectural limits of TSM.
|
| 3) IOx - equipped with a larger engineering team and years of
| experience with a wide variety of workloads and use-cases, IOx
| was developed to handle rapidly growing time series workloads
| that users need to handle.
| c4wrd wrote:
| I would argue the other way and praise them for the storage
| engine changes. Each iteration has had drawbacks, but based on
| the real-world reported usage they've made decisions to better
| support what customers are asking for and actually running
| into, as opposed to trying to iterate on the same engine over
| and over and making assumptions of real-world usage. Sure,
| there are drawbacks, but at the end of the day they're
| continuing to make good improvements for their customers.
| mrsun wrote:
| Will InfluxDB IOx eventually replace InfluxDB v2?
| mhall119 wrote:
| IOx is the data storage layer. It will replace the current TSM
| data storage system in InfluxDB, but it won't replace InfluxDB
| as a whole.
| digerata wrote:
| Personally, very excited to see this happening. Huge
| congrats!
|
| Some constructive criticism around naming... You don't have
| to have Flux in every single damn thing you create!
|
| InfluxDB IOx is not replacing InfluxDB v2 because... It's
| just a new storage engine.
|
| For querying we have Flux or InfluxQL...
| otoolep wrote:
| Congrats to the team at InfluxDB - great to see this released.
___________________________________________________________________
(page generated 2022-10-26 23:00 UTC)