[HN Gopher] Apache Arrow Datafusion 5.0.0 release
___________________________________________________________________
Apache Arrow Datafusion 5.0.0 release
Author : houqp
Score : 63 points
Date : 2021-08-24 16:02 UTC (6 hours ago)
(HTM) web link (arrow.apache.org)
(TXT) w3m dump (arrow.apache.org)
| houqp wrote:
| One of the Arrow Datafusion committers here. Happy to help answer
| any question.
| alexisread wrote:
| Hi, how does this compare to Vaex?
| https://vaex.io/docs/tutorial.html
|
| Vaex can join larger than memory datasets, but it doesn't
| support SQL per-se, it uses it's own lazy dataframe DSL.
| pella wrote:
| Question:
|
| - Is it possible to handle data larger than fits into RAM?
|
| - Any benchmark? like: https://h2oai.github.io/db-benchmark/ (
| see 50GB + Join -> "timeout" | "out of memory" )
| andygrove wrote:
| There is experimental support for distributed query execution
| with spill-to-disk between stages to support larger than
| memory datasets. This is implemented in the Ballista crate,
| which extends DataFusion.
|
| https://github.com/apache/arrow-
| datafusion/tree/master/balli...
| houqp wrote:
| > - Is it possible to handle data larger than fits into RAM?
|
| Not at the moment, but the community has plans to add support
| for disk spill.
|
| > - Any benchmark? like: https://h2oai.github.io/db-
| benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )
|
| One of the committer Daniel is working on a h2oai db
| benchmark PR for Datafusion :)
| hashjoiner wrote:
| There is a PR from me (Daniel, committer) with for db-
| benchmark. For the group by benchmarks, on my machine, it is
| currently somewhat slower than the fastest (Polars).
|
| https://github.com/h2oai/db-benchmark/pull/182
|
| Also we do support running TPC-H benchmarks. For the queries
| we can run, those are already finishing faster than Spark. We
| are planning to do more benchmarking and optimizations in the
| future.
| ohnoesjmr wrote:
| As far as I understand, if this continues in the direction it's
| going, the intention is to retire Apache Spark, correct?
|
| If correct, how far off are we (ball park), in terms of
| supporting the functionality, and ability to execute in a
| distributed fashion?
| peytoncasper wrote:
| I might be wrong on this, but I don't believe this is a
| replacement for Spark. Rather this is similar to the Spark
| SQL execution engine.
|
| I don't believe there is any focus on providing a distributed
| execution environment, rather platforms like Spark and Flink
| could integrate DataFusion as an implementation and expose
| the API for Apache Arrow operations.
| jarpineh wrote:
| Hi. Is it possible to use Datafusion remotely, likea query
| service? Perhaps using Arrow Flight? I would like to query data
| with different clients. Python in Jupyter, straight to browser
| and perhaps even something like Nu shell. This way each tool
| won't need to open its own copy of Arrow/Parquet data.
| andygrove wrote:
| Yes. The Ballista crate (part of the arrow-datafusion repo)
| provides distributed query execution and the scheduler has a
| gRPC service. Flight is used internally as well but not
| directly exposed to users. There is also work in progress to
| add Python bindings for Ballista (they already exist for
| DataFusion).
| jarpineh wrote:
| Thank you. I went through its GitHub repo for docs. It
| seems I need to dig a bit deeper perhaps. How to get
| started with my Parquet files isn't immediately obvious.
|
| I assume Python bindings would talk through gRPC. I could
| use gRPC directly perhaps?
| andygrove wrote:
| The best "Getting Started" documentation right now is
| that on docs.rs -
| https://docs.rs/ballista/0.5.0/ballista/
|
| This demonstrates using the Rust client (BallistaContext
| + DataFrame). There are already Python bindings for
| DataFrame but not BallistaContext yet.
|
| Documentation for Ballista is severely lacking right now
| and this will be an area of focus for the next release.
| jarpineh wrote:
| Thanks. I'm experimenting with Rust currently. This might
| fit the bill. I am curious though why does the client
| need to use async Rust. I hadn't gotten that far in my
| learnings. I would have guessed that synchronous way
| should work as well.
| yadamonk wrote:
| Are there any good resources on using DataFusion in Python
| beyond the README [1]?
|
| [1] https://github.com/apache/arrow-
| datafusion/tree/master/pytho...
| pella wrote:
| What is the "DataFusion"?
|
| - not in the FAQ ( https://arrow.apache.org/faq/ )
|
| - not in the Release page.
| pella wrote:
| OK: I have found: https://github.com/apache/arrow-datafusion
|
| _" DataFusion is an extensible query execution framework,
| written in Rust, that uses Apache Arrow as its in-memory
| format.DataFusion supports both an SQL and a DataFrame API
| for building logical query plans as well as a query optimizer
| and execution engine capable of parallel execution against
| partitioned data sources (CSV and Parquet) using threads.
| DataFusion also supports distributed query execution via the
| Ballista crate."_
|
| _" Use Cases: DataFusion is used to create modern, fast and
| efficient data pipelines, ETL processes, and database
| systems, which need the performance of Rust and Apache Arrow
| and want to provide their users the convenience of an SQL
| interface or a DataFrame API."_
| houqp wrote:
| You beat me to it, was about to post the github link :)
| Readme is a good starting place to learn more about the
| project.
| troelsSteegin wrote:
| Is this the best current view of a roadmap for Datafusion?
| https://www.mail-archive.com/dev@arrow.apache.org/msg23576.h...
| alamb wrote:
| <DataFusion committer here>
|
| I do think that is the best current view of a RoadMap and
| Vision -- it would be great to flesh it out a bit more.
|
| In fact, I'll make a note to try and add some more higher
| level context into the project on our goals.
| Fordec wrote:
| If the vision is to pseudo-copy the best bits of postgres,
| I'd be very interested in seeing features that tackle
| PostGIS type spatial problems. Native spatial work that
| actually scales to handle global level data in a single
| node still feels like a pipe dream a lot of the time.
| Adding things like Dask or xarray feel like hacks on
| imperfect base layers just to get a base system to be
| barely operational.
| eduren wrote:
| I've been following Arrow and Datafusion dev for a little bit,
| mostly because the architecture and goals look interesting.
|
| What I'd be curious about is one of the possible use cases
| mentioned in the Readme: ETL processes. I have yet to come
| across any projects that are building ETL/ELT/pipeline tools
| that leverage Datafusion. Might not be looking in the right
| places.
|
| Would anyone have insight into whether this is simply
| unexplored territory, or just not as good of a fit as other use
| cases?
| seddonm1 wrote:
| Disclosure: I am a contributor to Datafusion.
|
| I have done a lot of work in the ETL space in Apache Spark to
| build Arc (https://arc.tripl.ai/) and have ported a lot of
| the basic functionality of Arc to Datafusion as a proof-of-
| concept. The appeal to me of the Apache Spark and Datafusion
| engines is the ability to a) seperate compute and storage b)
| express transformation logic in SQL.
|
| Performance: From those early experiments Datafusion would
| frequently finish processing an entire job _before_ the
| SparkContext could be started - even on a local Spark
| instance. Obviously this is at smaller data sizes but in my
| experience a lot of ETL is about repeatable processes not
| necessarily huge datasets.
|
| Compatibility: Those experiments were done a few months ago
| and the SQL compatibility of the Datafusion engine has
| improved extremely rapidly (WINDOW functions were recently
| added). There is still some missing SQL functionality (for
| example to run all the TPC-H queries
| https://github.com/apache/arrow-
| datafusion/tree/master/bench...) but it is moving quickly.
| pauldix wrote:
| The new core we're building for InfluxDB (named InfluxDB IOx)
| uses Datafusion for query execution. We have multiple team
| members contributing to this release and we're super excited to
| be involved with it.
|
| I think it's a really exciting time for new OLAP systems because
| of Arrow, Rust, and the rise of object store + ephemeral compute
| for analytical and time series data.
| houqp wrote:
| Indeed, big shout out to the InfluxDB team!
| rektide wrote:
| > _the rise of object store + ephemeral compute_
|
| Question, does Datafusion itself integrate with Arrow's Plasma
| object store? Or is it more agnostic?
| andygrove wrote:
| There is no support for Plasma. There is a TableProvider API
| for custom file formats and there is built-in support for
| CSV, Parquet, and JSON.
| liminal wrote:
| What's the relationship between Datafusion and Ballista? They
| seem to have been merged into a single repo. Do they share a
| release schedule? Are they a single product or still separate?
| andygrove wrote:
| Ballista started out as a separate project and was donated in
| April 2021. They currently share a release schedule (but have
| different versioning) and this was the first release of
| DataFusion to include the Ballista crate.
|
| My hope is that Ballista and DataFusion become more integrated
| over time but remain separate, with DataFusion being an
| embedded / single-process query engine and Ballista providing
| distributed execution.
___________________________________________________________________
(page generated 2021-08-24 23:01 UTC)