[HN Gopher] DataFusion Comet: Apache Spark Accelerator
___________________________________________________________________
DataFusion Comet: Apache Spark Accelerator
Author : andygrove
Score : 56 points
Date : 2024-05-31 16:59 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| OutOfHere wrote:
| But why. Unless you need to use low-level map/reduce, just ditch
| Spark and use https://github.com/apache/datafusion-ballista
| directly. It supports Python too.
| necubi wrote:
| Many companies have 100k+ of lines of Spark code. It's not
| trivial to rewrite all of that in another query framework.
| OutOfHere wrote:
| Following that logic, we should have stuck with C/C++ for
| everything. /s
| MrPowers wrote:
| Lots of Spark workloads are executed with the C++ Photon
| engine on the Databricks platform, so we ironically have
| partially moved back to C++. Disclosure: I work for
| Databricks.
| OutOfHere wrote:
| The continued use of C++ is _not_ exactly something to be
| proud of, although in this case at least it presumably is
| for short-running jobs, not for long-running services
| that accumulate leaks.
| _bohm wrote:
| There is a ton of reliable load-bearing software out
| there written in C++. I don't think the fact that a piece
| of software is written in C++ is enough to presume that
| it has memory leaks.
| threeseed wrote:
| Python would be just another PHP level language if it
| wasn't for C++.
|
| It's what powers all of the DE/ML/AI libraries.
| MrPowers wrote:
| The OP is the original creator of Ballista, so he's well aware
| of the project.
|
| Ballista is much less mature than Spark and needs a lot of
| work. It's awesome they're making Spark faster with Comet.
| andygrove wrote:
| Yes, Ballista failed to gain traction. I think that one of
| the challenges was that it only supported a small subset of
| Spark, and there was too much work involved to try and get to
| parity with Spark.
|
| The Comet approach is much more pragmatic because we just add
| support for more operators and expressions over time and fall
| back to Spark for anything that is not supported yet.
| OutOfHere wrote:
| For the longest time, searching for Ballista linked to its
| old archived repo that didn't even have a link to the new
| repo. There was no search result for the new repo. This
| misled people into thinking that Ballista is a dead project
| but it wasn't. It wasted so much opportunity.
|
| I don't think it's a fair criticism of Ballista to say that
| it failed in any way. It just looks to need substantial
| effort to bring it on par with Spark. The performance
| benefits are meaningful. Ballista can then not only take
| the crown from Spark, but also revalidate Rust as a
| language.
| andygrove wrote:
| I wish I'd known about the search issue.
|
| I do see a new opportunity for Ballista. By leveraging
| all of the Spark-compatible operators and expressions
| being built in Comet, it would be able to support a wider
| range of queries much more quickly.
|
| Ballista already uses protobuf for sending plans to
| executors and Comet accepts protobuf plans (in a similar,
| but different format).
| OutOfHere wrote:
| Did Databricks sponsor Comet?
| threeseed wrote:
| One of the challenges is that most Spark users don't care
| if you 2x performance.
|
| We are in the enterprise with large cloud budgets and can
| simply change instance types. If you're 20x then that is a
| different story but then (a) you need to have feature
| parity and (b) need support from cloud vendors which Spark
| has.
| spenczar5 wrote:
| There seems to be a history of data technologies requiring
| a serious corporate sponsor. Arrow gets so much dev and
| marketing effort from Voltron, Spark from Databricks, etc.
| Did Ballista have anything's similar? I loved the project
| but it never seemed to move very fast on integrating with
| other tools and platforms.
| orthoxerox wrote:
| Because it's a drop-in replacement that lets you
| (theoretically) spend O(1) development effort on speeding up
| your Spark jobs instead of O(N).
|
| I say theoretically, because I have no idea how Comet works
| with the memory limits on Spark executors. If you have to
| rebalance the memory between regular memory and memory overhead
| or provision some off-heap memory for Comet, then the migration
| won't be so simple.
| simicd wrote:
| In short: Compatible with existing Spark jobs but executing
| them much faster. Benchmarks in the README file and docs [1]
| show improvements up to 3x while not even all operations are
| implemented yet (i.e. if an operation is not available in Comet
| it falls back to Spark), so there is room for further
| improvements. Across all TPC-H queries the total speedup is
| currently 1.5x, the docs state that based on datafusion's
| standalone performance 2x-4x is a realistic goal [1]
|
| Haven't seen any memory consumption benchmarks but suspect that
| it's lower than Spark for same jobs since datafusion is
| designsd from the ground up to be columnar-first.
|
| For companies spending 100s of thousands if not millions on
| compute this would mean substantial savings with little effort.
|
| [1] https://datafusion.apache.org/comet/contributor-
| guide/benchm...
| threeseed wrote:
| There is an entire ecosystem of libraries for Spark built up
| over years.
|
| I want to be able to connect to interact with the full services
| from GCS, Azure, AWS, OpenAI etc none of which DataFusion
| supports.
|
| As well as use libraries such as SynapseML, SparkNLP etc.
|
| And do all of this with full support from my cloud provider.
| scirob wrote:
| Imagine if data bricks switched and just started to contribute to
| this.
|
| I live in a dream world :)
| nevi-me wrote:
| They have their own implementation that is closed source (last
| time I checked), Photon [1], which is written with a C++
| engine.
|
| Databricks' terms prevent(ed?) publishing benchmarks, it would
| be interesting to see how Comet performs relative to it over
| time.
|
| Photon comes at a higher cost, so one big advantage of Comet is
| being able to deploy it on a standard Databricks cluster that
| doesn't have Photon, at a lower running cost.
|
| [1] https://www.databricks.com/product/photon
| slt2021 wrote:
| nice word play for two competing spark execution engines:
| Photon and Comet. C++ vs Rust, closed-source vs open-source
___________________________________________________________________
(page generated 2024-05-31 23:00 UTC)