[HN Gopher] Building a streaming SQL engine with Arrow and DataF...
___________________________________________________________________
Building a streaming SQL engine with Arrow and DataFusion
Author : necubi
Score : 86 points
Date : 2024-03-18 08:45 UTC (14 hours ago)
(HTM) web link (www.arroyo.dev)
(TXT) w3m dump (www.arroyo.dev)
| rahulrs wrote:
| SQL streaming engines really seem to be having a moment.
|
| As someone who is less familiar with all the players in the
| space, how should I think about Arroyo vs. streaming databases
| like Materialize or caching tools like Readyset?
| necubi wrote:
| > SQL streaming engines really seem to be having a moment.
|
| I definitely agree! In the past few years, a bunch of folks
| (including myself) who had been working with Flink/Spark
| Streaming/KSQL/etc. at large companies decided that the time
| was right for a new generation of streaming systems and started
| companies to do that. For myself, seeing how much users
| struggled to build pipelines on Flink at Lyft inspired me to
| build Arroyo.
|
| I think it's really exciting after ~5 years of relative
| stagnation.
|
| > As someone who is less familiar with all the players in the
| space, how should I think about Arroyo vs. streaming databases
| like Materialize or caching tools like Readyset?
|
| There are no hard lines (and internally all of these systems
| look fairly similar) but the products and use cases are pretty
| different.
|
| To give my gloss:
|
| * Readyset is a very clever cache for your OLTP database that
| lets you push it into more analytical territory with reasonable
| performance, but still focused mostly on product use cases; the
| stream processing system is internal and not exposed to users
|
| * Materialize is designed to provide OLAP materialized views on
| top of your OLTP database by reading postgres/mysql
| changefeeds. It gives you up-to-date results for analytical
| queries without needing to replicate your postgres to snowflake
| and repeatedly query it.
|
| * Arroyo is a modern Flink, designed for more traditional
| stream processing use cases. This includes real-time analytics,
| but is more focused on operational and product use cases like
| alerting, real-time ML, automated remediation, and streaming
| ETL.
|
| Also, Arroyo is the only one of these that is fully open source
| (apache 2) and designed for self hosting.
| benrutter wrote:
| Especially factoring in the streaming capabilities an arrow based
| SQL database is an exciting prospect!
|
| My assumption is that throughput could be increased quite a bit
| for loading data into arrow based libaries like polars or pandas
| since data doesn't have to be converted. Any idea if that works
| out?
| necubi wrote:
| That's a future direction we're very excited about,
| particularly being able to run pyarrow-based UDFs on Arroyo
| state without any serialization overhead.
| esafak wrote:
| Looking forward to NATS support ;)
| necubi wrote:
| A NATS and Jetstream connector is in development by an Arroyo
| user, and hopefully will be merged into master soon!
| esafak wrote:
| I don't see any PRs:
| https://github.com/ArroyoSystems/arroyo/pulls
|
| There is a _request_ :
| https://github.com/ArroyoSystems/arroyo/issues/162
|
| In any case, I look forward to it.
| necubi wrote:
| There's no PR yet (and I won't link to the branch in case
| the contributor doesn't want it public yet) but it exists
| and is being run in production :)
| memset wrote:
| Hi! Just reading the docs, this looks really slick. I had a few
| questions:
|
| - When you create tables, are they always connected to a source?
| How does that work for the cloud version (ie, source =
| filesystem? would we just use s3, it seems.) - Does arroyo poll
| an s3 bucket for new files and automatically ingest? - Are you
| able to do ALTER TABLE? (What if data, or data types, are
| mismatched?) - Similarly, am I able to change the primary key
| (ie, clickhouse's ORDER BY or projections?) or change indexes?
|
| Any plans for HTTP as a source? (This is what we build and I'd be
| happy to prototype an integration!)
| jnewhouse wrote:
| For the SQL interface, both sources and sinks are treated as
| tables. Sources you SELECT FROM, while sinks you INSERT INTO.
| Right now it is incumbent on the user to correctly specify the
| types of a source for deserialization. How getting this wrong
| behaves is a little source-dependent, as some data formats are
| stricter. Parquet will fail hard at read-time, while JSON will
| coerce as best as it is able, optionally dropping the data
| instead of failing the job depending on the bad_data parameter:
| https://doc.arroyo.dev/connectors/overview#bad-data.
|
| Currently we don't support much in the way of changing
| configuration in external systems, instead focusing on defining
| long-running pipelines.
|
| What did you have in mind for an HTTP source? We have a polling
| HTTP source, as well as a WebSocket source:
|
| https://doc.arroyo.dev/connectors/polling-http
| https://doc.arroyo.dev/connectors/websocket
| memset wrote:
| Let me take a look - thank you!
|
| So if I'm understanding, you actually read data directly from
| (say) S3? It isn't copied from S3 and stored locally (ie, a
| bunch of local .arrow files.)
|
| (Apologies if I'm ignorant of the underlying tech - I think
| this is really cool and just trying to wrap my head around
| what happens from "I upload some data to S3" and "we get
| query results")
| jnewhouse wrote:
| Yep, pretty much. Right now filesystem^ sources are finite,
| scanning the target path at operator startup time and
| processing all matching files. This processing is done by
| opening an asynchronous reader, courtesy of the
| object_store crate.
|
| ^We call these Filesystem Sources/Sinks to match
| terminology present in other streaming systems, but I'm not
| in love with it.
| Pucilowski wrote:
| how would I go about calling python code as a step, say if I
| wanted to explore a grid of parameters and fit models
| accordingly?
| necubi wrote:
| Currently only Rust UDFs are supported
| (https://doc.arroyo.dev/sql/udfs) but one of the things that
| Arrow should enable is performant Python integration, as Arrow
| has a standardized in-memory format that's portable across
| languages. So it's possible to take in-memory data constructed
| Arroyo's Rust implementation and pass it to a Python
| interpreter without serialization or possibly copying.
| qazxcvbnm wrote:
| Not exactly on-topic, but does anyone know of SQL-to-SQL
| optimisers or simplifiers (perhaps DataFusion would be able to do
| this)? I work with generated query systems and SQL macro systems
| that make fairly complex queries quite easy to generate, but
| often times come up with unnecessary joins/subqueries etc.
|
| I find myself needing to mechanically transform and simplify SQL
| every now and then, and it hardly seems something out of reach of
| automation, yet somehow I've never been able to find software
| that simplifies and transforms SQL source-to-source. When I've
| last looked, I've only found optimisers for SQL execution plans.
| hantusk wrote:
| SQLGlot brands itself as an sql optimizer also:
| https://github.com/tobymao/sqlglot?tab=readme-ov-file#sql-op...
| but I haven't tried that aspect of it personally
| qazxcvbnm wrote:
| Thanks! I was not aware that SQLGlot has this functionality,
| let me give it a try.
| mgaunard wrote:
| How does it compare to DuckDB, which is an Arrow-compatible OLAP
| SQL database, easy to embed and just plain awesome?
| jakewins wrote:
| This is a different use case - DuckDB is large queries over
| your entire dataset; Arroyo is continuous queries executing
| over a live data stream.
|
| So if you have a stream of sensor data from a bunch of IoT
| devices, you might use Arroyo to run "live" queries over the
| flowing data, and you might use DuckDB to do analytical queries
| over archive data
|
| Although they are both hammers to whack nails with I guess, so
| a lot of use cases likely work well to solve with both systems
| mgaunard wrote:
| DuckDB processes data lazily, and can easily be integrated
| with custom synthetic relations.
|
| I guess it's still not really designed to work well with
| infinite relations, but then again neither is SQL really.
| jakewins wrote:
| Maybe I've not used DuckDB enough but: How would you set it
| up to consume, say, a Kafka stream, apply some
| transformations or windowed aggregates and continuously
| output the result?
|
| In my experience DuckDB is a batch tool, vs this thing
| which is a continuous stream tool
| mgaunard wrote:
| You would implement the relation type that maps to your
| data source.
| pantsforbirds wrote:
| Arrow has been the most exciting piece of technology I've seen in
| the last few years. The ecosystem being built around it is
| amazing, and it's standardizing a bunch of disparate data
| ecosystems.
|
| The arrow ecosystem nets you a great compute implementation,
| storage (parquet), and a great RPC framework (arrow flight).
| tomrod wrote:
| I only really know Arrow from Feather and from Pandas 2.0+. Can
| you recommend where I could learn more?
| FridgeSeal wrote:
| The DataFusion repo has a lot of good write up's, as does
| Andy Grove's blog - he started datafusion before it was
| donated to Apache.
| amath wrote:
| Nice work on the performance boost :).
|
| How does it compare with things like: 1.
| https://github.com/bytewax/bytewax 2.
| https://github.com/pathwaycom/pathway
|
| I recently read this article (https://materializedview.io/p/from-
| samza-to-flink-a-decade-o...) about Flink and it commented on
| Flink grew to fit all of these different use cases (applications,
| analytics and ETL) with disjoint requirements that Confluent
| built kafka-streams, ksql and connector for. What of those would
| you say Arroyo is better suited for?
| zenbowman wrote:
| This is a great writeup, I work on batch/streaming stuff at
| Google and I'm very excited by some of the stuff I see in the
| Rust ecosystem, Arroyo included.
___________________________________________________________________
(page generated 2024-03-18 23:01 UTC)