[HN Gopher] DataFusion Comet: Apache Spark Accelerator
       ___________________________________________________________________
        
       DataFusion Comet: Apache Spark Accelerator
        
       Author : andygrove
       Score  : 56 points
       Date   : 2024-05-31 16:59 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | OutOfHere wrote:
       | But why. Unless you need to use low-level map/reduce, just ditch
       | Spark and use https://github.com/apache/datafusion-ballista
       | directly. It supports Python too.
        
         | necubi wrote:
         | Many companies have 100k+ of lines of Spark code. It's not
         | trivial to rewrite all of that in another query framework.
        
           | OutOfHere wrote:
           | Following that logic, we should have stuck with C/C++ for
           | everything. /s
        
             | MrPowers wrote:
             | Lots of Spark workloads are executed with the C++ Photon
             | engine on the Databricks platform, so we ironically have
             | partially moved back to C++. Disclosure: I work for
             | Databricks.
        
               | OutOfHere wrote:
               | The continued use of C++ is _not_ exactly something to be
               | proud of, although in this case at least it presumably is
               | for short-running jobs, not for long-running services
               | that accumulate leaks.
        
               | _bohm wrote:
               | There is a ton of reliable load-bearing software out
               | there written in C++. I don't think the fact that a piece
               | of software is written in C++ is enough to presume that
               | it has memory leaks.
        
               | threeseed wrote:
               | Python would be just another PHP level language if it
               | wasn't for C++.
               | 
               | It's what powers all of the DE/ML/AI libraries.
        
         | MrPowers wrote:
         | The OP is the original creator of Ballista, so he's well aware
         | of the project.
         | 
         | Ballista is much less mature than Spark and needs a lot of
         | work. It's awesome they're making Spark faster with Comet.
        
           | andygrove wrote:
           | Yes, Ballista failed to gain traction. I think that one of
           | the challenges was that it only supported a small subset of
           | Spark, and there was too much work involved to try and get to
           | parity with Spark.
           | 
           | The Comet approach is much more pragmatic because we just add
           | support for more operators and expressions over time and fall
           | back to Spark for anything that is not supported yet.
        
             | OutOfHere wrote:
             | For the longest time, searching for Ballista linked to its
             | old archived repo that didn't even have a link to the new
             | repo. There was no search result for the new repo. This
             | misled people into thinking that Ballista is a dead project
             | but it wasn't. It wasted so much opportunity.
             | 
             | I don't think it's a fair criticism of Ballista to say that
             | it failed in any way. It just looks to need substantial
             | effort to bring it on par with Spark. The performance
             | benefits are meaningful. Ballista can then not only take
             | the crown from Spark, but also revalidate Rust as a
             | language.
        
               | andygrove wrote:
               | I wish I'd known about the search issue.
               | 
               | I do see a new opportunity for Ballista. By leveraging
               | all of the Spark-compatible operators and expressions
               | being built in Comet, it would be able to support a wider
               | range of queries much more quickly.
               | 
               | Ballista already uses protobuf for sending plans to
               | executors and Comet accepts protobuf plans (in a similar,
               | but different format).
        
               | OutOfHere wrote:
               | Did Databricks sponsor Comet?
        
             | threeseed wrote:
             | One of the challenges is that most Spark users don't care
             | if you 2x performance.
             | 
             | We are in the enterprise with large cloud budgets and can
             | simply change instance types. If you're 20x then that is a
             | different story but then (a) you need to have feature
             | parity and (b) need support from cloud vendors which Spark
             | has.
        
             | spenczar5 wrote:
             | There seems to be a history of data technologies requiring
             | a serious corporate sponsor. Arrow gets so much dev and
             | marketing effort from Voltron, Spark from Databricks, etc.
             | Did Ballista have anything's similar? I loved the project
             | but it never seemed to move very fast on integrating with
             | other tools and platforms.
        
         | orthoxerox wrote:
         | Because it's a drop-in replacement that lets you
         | (theoretically) spend O(1) development effort on speeding up
         | your Spark jobs instead of O(N).
         | 
         | I say theoretically, because I have no idea how Comet works
         | with the memory limits on Spark executors. If you have to
         | rebalance the memory between regular memory and memory overhead
         | or provision some off-heap memory for Comet, then the migration
         | won't be so simple.
        
         | simicd wrote:
         | In short: Compatible with existing Spark jobs but executing
         | them much faster. Benchmarks in the README file and docs [1]
         | show improvements up to 3x while not even all operations are
         | implemented yet (i.e. if an operation is not available in Comet
         | it falls back to Spark), so there is room for further
         | improvements. Across all TPC-H queries the total speedup is
         | currently 1.5x, the docs state that based on datafusion's
         | standalone performance 2x-4x is a realistic goal [1]
         | 
         | Haven't seen any memory consumption benchmarks but suspect that
         | it's lower than Spark for same jobs since datafusion is
         | designsd from the ground up to be columnar-first.
         | 
         | For companies spending 100s of thousands if not millions on
         | compute this would mean substantial savings with little effort.
         | 
         | [1] https://datafusion.apache.org/comet/contributor-
         | guide/benchm...
        
         | threeseed wrote:
         | There is an entire ecosystem of libraries for Spark built up
         | over years.
         | 
         | I want to be able to connect to interact with the full services
         | from GCS, Azure, AWS, OpenAI etc none of which DataFusion
         | supports.
         | 
         | As well as use libraries such as SynapseML, SparkNLP etc.
         | 
         | And do all of this with full support from my cloud provider.
        
       | scirob wrote:
       | Imagine if data bricks switched and just started to contribute to
       | this.
       | 
       | I live in a dream world :)
        
         | nevi-me wrote:
         | They have their own implementation that is closed source (last
         | time I checked), Photon [1], which is written with a C++
         | engine.
         | 
         | Databricks' terms prevent(ed?) publishing benchmarks, it would
         | be interesting to see how Comet performs relative to it over
         | time.
         | 
         | Photon comes at a higher cost, so one big advantage of Comet is
         | being able to deploy it on a standard Databricks cluster that
         | doesn't have Photon, at a lower running cost.
         | 
         | [1] https://www.databricks.com/product/photon
        
           | slt2021 wrote:
           | nice word play for two competing spark execution engines:
           | Photon and Comet. C++ vs Rust, closed-source vs open-source
        
       ___________________________________________________________________
       (page generated 2024-05-31 23:00 UTC)