[HN Gopher] Building a streaming SQL engine with Arrow and DataF...
       ___________________________________________________________________
        
       Building a streaming SQL engine with Arrow and DataFusion
        
       Author : necubi
       Score  : 86 points
       Date   : 2024-03-18 08:45 UTC (14 hours ago)
        
 (HTM) web link (www.arroyo.dev)
 (TXT) w3m dump (www.arroyo.dev)
        
       | rahulrs wrote:
       | SQL streaming engines really seem to be having a moment.
       | 
       | As someone who is less familiar with all the players in the
       | space, how should I think about Arroyo vs. streaming databases
       | like Materialize or caching tools like Readyset?
        
         | necubi wrote:
         | > SQL streaming engines really seem to be having a moment.
         | 
         | I definitely agree! In the past few years, a bunch of folks
         | (including myself) who had been working with Flink/Spark
         | Streaming/KSQL/etc. at large companies decided that the time
         | was right for a new generation of streaming systems and started
         | companies to do that. For myself, seeing how much users
         | struggled to build pipelines on Flink at Lyft inspired me to
         | build Arroyo.
         | 
         | I think it's really exciting after ~5 years of relative
         | stagnation.
         | 
         | > As someone who is less familiar with all the players in the
         | space, how should I think about Arroyo vs. streaming databases
         | like Materialize or caching tools like Readyset?
         | 
         | There are no hard lines (and internally all of these systems
         | look fairly similar) but the products and use cases are pretty
         | different.
         | 
         | To give my gloss:
         | 
         | * Readyset is a very clever cache for your OLTP database that
         | lets you push it into more analytical territory with reasonable
         | performance, but still focused mostly on product use cases; the
         | stream processing system is internal and not exposed to users
         | 
         | * Materialize is designed to provide OLAP materialized views on
         | top of your OLTP database by reading postgres/mysql
         | changefeeds. It gives you up-to-date results for analytical
         | queries without needing to replicate your postgres to snowflake
         | and repeatedly query it.
         | 
         | * Arroyo is a modern Flink, designed for more traditional
         | stream processing use cases. This includes real-time analytics,
         | but is more focused on operational and product use cases like
         | alerting, real-time ML, automated remediation, and streaming
         | ETL.
         | 
         | Also, Arroyo is the only one of these that is fully open source
         | (apache 2) and designed for self hosting.
        
       | benrutter wrote:
       | Especially factoring in the streaming capabilities an arrow based
       | SQL database is an exciting prospect!
       | 
       | My assumption is that throughput could be increased quite a bit
       | for loading data into arrow based libaries like polars or pandas
       | since data doesn't have to be converted. Any idea if that works
       | out?
        
         | necubi wrote:
         | That's a future direction we're very excited about,
         | particularly being able to run pyarrow-based UDFs on Arroyo
         | state without any serialization overhead.
        
       | esafak wrote:
       | Looking forward to NATS support ;)
        
         | necubi wrote:
         | A NATS and Jetstream connector is in development by an Arroyo
         | user, and hopefully will be merged into master soon!
        
           | esafak wrote:
           | I don't see any PRs:
           | https://github.com/ArroyoSystems/arroyo/pulls
           | 
           | There is a _request_ :
           | https://github.com/ArroyoSystems/arroyo/issues/162
           | 
           | In any case, I look forward to it.
        
             | necubi wrote:
             | There's no PR yet (and I won't link to the branch in case
             | the contributor doesn't want it public yet) but it exists
             | and is being run in production :)
        
       | memset wrote:
       | Hi! Just reading the docs, this looks really slick. I had a few
       | questions:
       | 
       | - When you create tables, are they always connected to a source?
       | How does that work for the cloud version (ie, source =
       | filesystem? would we just use s3, it seems.) - Does arroyo poll
       | an s3 bucket for new files and automatically ingest? - Are you
       | able to do ALTER TABLE? (What if data, or data types, are
       | mismatched?) - Similarly, am I able to change the primary key
       | (ie, clickhouse's ORDER BY or projections?) or change indexes?
       | 
       | Any plans for HTTP as a source? (This is what we build and I'd be
       | happy to prototype an integration!)
        
         | jnewhouse wrote:
         | For the SQL interface, both sources and sinks are treated as
         | tables. Sources you SELECT FROM, while sinks you INSERT INTO.
         | Right now it is incumbent on the user to correctly specify the
         | types of a source for deserialization. How getting this wrong
         | behaves is a little source-dependent, as some data formats are
         | stricter. Parquet will fail hard at read-time, while JSON will
         | coerce as best as it is able, optionally dropping the data
         | instead of failing the job depending on the bad_data parameter:
         | https://doc.arroyo.dev/connectors/overview#bad-data.
         | 
         | Currently we don't support much in the way of changing
         | configuration in external systems, instead focusing on defining
         | long-running pipelines.
         | 
         | What did you have in mind for an HTTP source? We have a polling
         | HTTP source, as well as a WebSocket source:
         | 
         | https://doc.arroyo.dev/connectors/polling-http
         | https://doc.arroyo.dev/connectors/websocket
        
           | memset wrote:
           | Let me take a look - thank you!
           | 
           | So if I'm understanding, you actually read data directly from
           | (say) S3? It isn't copied from S3 and stored locally (ie, a
           | bunch of local .arrow files.)
           | 
           | (Apologies if I'm ignorant of the underlying tech - I think
           | this is really cool and just trying to wrap my head around
           | what happens from "I upload some data to S3" and "we get
           | query results")
        
             | jnewhouse wrote:
             | Yep, pretty much. Right now filesystem^ sources are finite,
             | scanning the target path at operator startup time and
             | processing all matching files. This processing is done by
             | opening an asynchronous reader, courtesy of the
             | object_store crate.
             | 
             | ^We call these Filesystem Sources/Sinks to match
             | terminology present in other streaming systems, but I'm not
             | in love with it.
        
       | Pucilowski wrote:
       | how would I go about calling python code as a step, say if I
       | wanted to explore a grid of parameters and fit models
       | accordingly?
        
         | necubi wrote:
         | Currently only Rust UDFs are supported
         | (https://doc.arroyo.dev/sql/udfs) but one of the things that
         | Arrow should enable is performant Python integration, as Arrow
         | has a standardized in-memory format that's portable across
         | languages. So it's possible to take in-memory data constructed
         | Arroyo's Rust implementation and pass it to a Python
         | interpreter without serialization or possibly copying.
        
       | qazxcvbnm wrote:
       | Not exactly on-topic, but does anyone know of SQL-to-SQL
       | optimisers or simplifiers (perhaps DataFusion would be able to do
       | this)? I work with generated query systems and SQL macro systems
       | that make fairly complex queries quite easy to generate, but
       | often times come up with unnecessary joins/subqueries etc.
       | 
       | I find myself needing to mechanically transform and simplify SQL
       | every now and then, and it hardly seems something out of reach of
       | automation, yet somehow I've never been able to find software
       | that simplifies and transforms SQL source-to-source. When I've
       | last looked, I've only found optimisers for SQL execution plans.
        
         | hantusk wrote:
         | SQLGlot brands itself as an sql optimizer also:
         | https://github.com/tobymao/sqlglot?tab=readme-ov-file#sql-op...
         | but I haven't tried that aspect of it personally
        
           | qazxcvbnm wrote:
           | Thanks! I was not aware that SQLGlot has this functionality,
           | let me give it a try.
        
       | mgaunard wrote:
       | How does it compare to DuckDB, which is an Arrow-compatible OLAP
       | SQL database, easy to embed and just plain awesome?
        
         | jakewins wrote:
         | This is a different use case - DuckDB is large queries over
         | your entire dataset; Arroyo is continuous queries executing
         | over a live data stream.
         | 
         | So if you have a stream of sensor data from a bunch of IoT
         | devices, you might use Arroyo to run "live" queries over the
         | flowing data, and you might use DuckDB to do analytical queries
         | over archive data
         | 
         | Although they are both hammers to whack nails with I guess, so
         | a lot of use cases likely work well to solve with both systems
        
           | mgaunard wrote:
           | DuckDB processes data lazily, and can easily be integrated
           | with custom synthetic relations.
           | 
           | I guess it's still not really designed to work well with
           | infinite relations, but then again neither is SQL really.
        
             | jakewins wrote:
             | Maybe I've not used DuckDB enough but: How would you set it
             | up to consume, say, a Kafka stream, apply some
             | transformations or windowed aggregates and continuously
             | output the result?
             | 
             | In my experience DuckDB is a batch tool, vs this thing
             | which is a continuous stream tool
        
               | mgaunard wrote:
               | You would implement the relation type that maps to your
               | data source.
        
       | pantsforbirds wrote:
       | Arrow has been the most exciting piece of technology I've seen in
       | the last few years. The ecosystem being built around it is
       | amazing, and it's standardizing a bunch of disparate data
       | ecosystems.
       | 
       | The arrow ecosystem nets you a great compute implementation,
       | storage (parquet), and a great RPC framework (arrow flight).
        
         | tomrod wrote:
         | I only really know Arrow from Feather and from Pandas 2.0+. Can
         | you recommend where I could learn more?
        
           | FridgeSeal wrote:
           | The DataFusion repo has a lot of good write up's, as does
           | Andy Grove's blog - he started datafusion before it was
           | donated to Apache.
        
       | amath wrote:
       | Nice work on the performance boost :).
       | 
       | How does it compare with things like: 1.
       | https://github.com/bytewax/bytewax 2.
       | https://github.com/pathwaycom/pathway
       | 
       | I recently read this article (https://materializedview.io/p/from-
       | samza-to-flink-a-decade-o...) about Flink and it commented on
       | Flink grew to fit all of these different use cases (applications,
       | analytics and ETL) with disjoint requirements that Confluent
       | built kafka-streams, ksql and connector for. What of those would
       | you say Arroyo is better suited for?
        
       | zenbowman wrote:
       | This is a great writeup, I work on batch/streaming stuff at
       | Google and I'm very excited by some of the stuff I see in the
       | Rust ecosystem, Arroyo included.
        
       ___________________________________________________________________
       (page generated 2024-03-18 23:01 UTC)