[HN Gopher] Apache Arrow Datafusion 5.0.0 release
       ___________________________________________________________________
        
       Apache Arrow Datafusion 5.0.0 release
        
       Author : houqp
       Score  : 63 points
       Date   : 2021-08-24 16:02 UTC (6 hours ago)
        
 (HTM) web link (arrow.apache.org)
 (TXT) w3m dump (arrow.apache.org)
        
       | houqp wrote:
       | One of the Arrow Datafusion committers here. Happy to help answer
       | any question.
        
         | alexisread wrote:
         | Hi, how does this compare to Vaex?
         | https://vaex.io/docs/tutorial.html
         | 
         | Vaex can join larger than memory datasets, but it doesn't
         | support SQL per-se, it uses it's own lazy dataframe DSL.
        
         | pella wrote:
         | Question:
         | 
         | - Is it possible to handle data larger than fits into RAM?
         | 
         | - Any benchmark? like: https://h2oai.github.io/db-benchmark/ (
         | see 50GB + Join -> "timeout" | "out of memory" )
        
           | andygrove wrote:
           | There is experimental support for distributed query execution
           | with spill-to-disk between stages to support larger than
           | memory datasets. This is implemented in the Ballista crate,
           | which extends DataFusion.
           | 
           | https://github.com/apache/arrow-
           | datafusion/tree/master/balli...
        
           | houqp wrote:
           | > - Is it possible to handle data larger than fits into RAM?
           | 
           | Not at the moment, but the community has plans to add support
           | for disk spill.
           | 
           | > - Any benchmark? like: https://h2oai.github.io/db-
           | benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )
           | 
           | One of the committer Daniel is working on a h2oai db
           | benchmark PR for Datafusion :)
        
           | hashjoiner wrote:
           | There is a PR from me (Daniel, committer) with for db-
           | benchmark. For the group by benchmarks, on my machine, it is
           | currently somewhat slower than the fastest (Polars).
           | 
           | https://github.com/h2oai/db-benchmark/pull/182
           | 
           | Also we do support running TPC-H benchmarks. For the queries
           | we can run, those are already finishing faster than Spark. We
           | are planning to do more benchmarking and optimizations in the
           | future.
        
         | ohnoesjmr wrote:
         | As far as I understand, if this continues in the direction it's
         | going, the intention is to retire Apache Spark, correct?
         | 
         | If correct, how far off are we (ball park), in terms of
         | supporting the functionality, and ability to execute in a
         | distributed fashion?
        
           | peytoncasper wrote:
           | I might be wrong on this, but I don't believe this is a
           | replacement for Spark. Rather this is similar to the Spark
           | SQL execution engine.
           | 
           | I don't believe there is any focus on providing a distributed
           | execution environment, rather platforms like Spark and Flink
           | could integrate DataFusion as an implementation and expose
           | the API for Apache Arrow operations.
        
         | jarpineh wrote:
         | Hi. Is it possible to use Datafusion remotely, likea query
         | service? Perhaps using Arrow Flight? I would like to query data
         | with different clients. Python in Jupyter, straight to browser
         | and perhaps even something like Nu shell. This way each tool
         | won't need to open its own copy of Arrow/Parquet data.
        
           | andygrove wrote:
           | Yes. The Ballista crate (part of the arrow-datafusion repo)
           | provides distributed query execution and the scheduler has a
           | gRPC service. Flight is used internally as well but not
           | directly exposed to users. There is also work in progress to
           | add Python bindings for Ballista (they already exist for
           | DataFusion).
        
             | jarpineh wrote:
             | Thank you. I went through its GitHub repo for docs. It
             | seems I need to dig a bit deeper perhaps. How to get
             | started with my Parquet files isn't immediately obvious.
             | 
             | I assume Python bindings would talk through gRPC. I could
             | use gRPC directly perhaps?
        
               | andygrove wrote:
               | The best "Getting Started" documentation right now is
               | that on docs.rs -
               | https://docs.rs/ballista/0.5.0/ballista/
               | 
               | This demonstrates using the Rust client (BallistaContext
               | + DataFrame). There are already Python bindings for
               | DataFrame but not BallistaContext yet.
               | 
               | Documentation for Ballista is severely lacking right now
               | and this will be an area of focus for the next release.
        
               | jarpineh wrote:
               | Thanks. I'm experimenting with Rust currently. This might
               | fit the bill. I am curious though why does the client
               | need to use async Rust. I hadn't gotten that far in my
               | learnings. I would have guessed that synchronous way
               | should work as well.
        
         | yadamonk wrote:
         | Are there any good resources on using DataFusion in Python
         | beyond the README [1]?
         | 
         | [1] https://github.com/apache/arrow-
         | datafusion/tree/master/pytho...
        
         | pella wrote:
         | What is the "DataFusion"?
         | 
         | - not in the FAQ ( https://arrow.apache.org/faq/ )
         | 
         | - not in the Release page.
        
           | pella wrote:
           | OK: I have found: https://github.com/apache/arrow-datafusion
           | 
           |  _" DataFusion is an extensible query execution framework,
           | written in Rust, that uses Apache Arrow as its in-memory
           | format.DataFusion supports both an SQL and a DataFrame API
           | for building logical query plans as well as a query optimizer
           | and execution engine capable of parallel execution against
           | partitioned data sources (CSV and Parquet) using threads.
           | DataFusion also supports distributed query execution via the
           | Ballista crate."_
           | 
           |  _" Use Cases: DataFusion is used to create modern, fast and
           | efficient data pipelines, ETL processes, and database
           | systems, which need the performance of Rust and Apache Arrow
           | and want to provide their users the convenience of an SQL
           | interface or a DataFrame API."_
        
             | houqp wrote:
             | You beat me to it, was about to post the github link :)
             | Readme is a good starting place to learn more about the
             | project.
        
         | troelsSteegin wrote:
         | Is this the best current view of a roadmap for Datafusion?
         | https://www.mail-archive.com/dev@arrow.apache.org/msg23576.h...
        
           | alamb wrote:
           | <DataFusion committer here>
           | 
           | I do think that is the best current view of a RoadMap and
           | Vision -- it would be great to flesh it out a bit more.
           | 
           | In fact, I'll make a note to try and add some more higher
           | level context into the project on our goals.
        
             | Fordec wrote:
             | If the vision is to pseudo-copy the best bits of postgres,
             | I'd be very interested in seeing features that tackle
             | PostGIS type spatial problems. Native spatial work that
             | actually scales to handle global level data in a single
             | node still feels like a pipe dream a lot of the time.
             | Adding things like Dask or xarray feel like hacks on
             | imperfect base layers just to get a base system to be
             | barely operational.
        
         | eduren wrote:
         | I've been following Arrow and Datafusion dev for a little bit,
         | mostly because the architecture and goals look interesting.
         | 
         | What I'd be curious about is one of the possible use cases
         | mentioned in the Readme: ETL processes. I have yet to come
         | across any projects that are building ETL/ELT/pipeline tools
         | that leverage Datafusion. Might not be looking in the right
         | places.
         | 
         | Would anyone have insight into whether this is simply
         | unexplored territory, or just not as good of a fit as other use
         | cases?
        
           | seddonm1 wrote:
           | Disclosure: I am a contributor to Datafusion.
           | 
           | I have done a lot of work in the ETL space in Apache Spark to
           | build Arc (https://arc.tripl.ai/) and have ported a lot of
           | the basic functionality of Arc to Datafusion as a proof-of-
           | concept. The appeal to me of the Apache Spark and Datafusion
           | engines is the ability to a) seperate compute and storage b)
           | express transformation logic in SQL.
           | 
           | Performance: From those early experiments Datafusion would
           | frequently finish processing an entire job _before_ the
           | SparkContext could be started - even on a local Spark
           | instance. Obviously this is at smaller data sizes but in my
           | experience a lot of ETL is about repeatable processes not
           | necessarily huge datasets.
           | 
           | Compatibility: Those experiments were done a few months ago
           | and the SQL compatibility of the Datafusion engine has
           | improved extremely rapidly (WINDOW functions were recently
           | added). There is still some missing SQL functionality (for
           | example to run all the TPC-H queries
           | https://github.com/apache/arrow-
           | datafusion/tree/master/bench...) but it is moving quickly.
        
       | pauldix wrote:
       | The new core we're building for InfluxDB (named InfluxDB IOx)
       | uses Datafusion for query execution. We have multiple team
       | members contributing to this release and we're super excited to
       | be involved with it.
       | 
       | I think it's a really exciting time for new OLAP systems because
       | of Arrow, Rust, and the rise of object store + ephemeral compute
       | for analytical and time series data.
        
         | houqp wrote:
         | Indeed, big shout out to the InfluxDB team!
        
         | rektide wrote:
         | > _the rise of object store + ephemeral compute_
         | 
         | Question, does Datafusion itself integrate with Arrow's Plasma
         | object store? Or is it more agnostic?
        
           | andygrove wrote:
           | There is no support for Plasma. There is a TableProvider API
           | for custom file formats and there is built-in support for
           | CSV, Parquet, and JSON.
        
       | liminal wrote:
       | What's the relationship between Datafusion and Ballista? They
       | seem to have been merged into a single repo. Do they share a
       | release schedule? Are they a single product or still separate?
        
         | andygrove wrote:
         | Ballista started out as a separate project and was donated in
         | April 2021. They currently share a release schedule (but have
         | different versioning) and this was the first release of
         | DataFusion to include the Ballista crate.
         | 
         | My hope is that Ballista and DataFusion become more integrated
         | over time but remain separate, with DataFusion being an
         | embedded / single-process query engine and Ballista providing
         | distributed execution.
        
       ___________________________________________________________________
       (page generated 2021-08-24 23:01 UTC)