[HN Gopher] Should you ditch Spark for DuckDB or Polars?
       ___________________________________________________________________
        
       Should you ditch Spark for DuckDB or Polars?
        
       Author : RobinL
       Score  : 72 points
       Date   : 2024-12-14 20:28 UTC (1 days ago)
        
 (HTM) web link (milescole.dev)
 (TXT) w3m dump (milescole.dev)
        
       | jimmyl02 wrote:
       | pretty new to the large scale data processing space so not sure
       | if this is a known question but isn't the purpose of spark that
       | it can be distributed across many workers and parallelized?
       | 
       | I guess the scale of data here ~100GB is manageable with
       | something like DuckDB but once data gets past a certain scale,
       | wouldn't single machine performance have no way of matching a
       | distributed spark cluster?
        
         | ojhughes wrote:
         | I think it's quite rare that a dataset is too large to be
         | processed on a single volume. EBS volumes max out at 64Tb
        
           | hobs wrote:
           | It's really the only reason to use Spark in the first place,
           | because you're doing non-local level processing - when I was
           | throwing 100TB at the problem it made more sense than most of
           | the data science tasks I see it used for.
        
             | mnsc wrote:
             | This magnitude of data fascinates me. Can you elaborate on
             | how that much data came to be and what kind of processing
             | was needed for all that data? And if not, maybe point to
             | some podcast, blog that goes into the nitty gritty of those
             | types of real big data challenges.
        
           | threeseed wrote:
           | If you're using EBS for shuffle/source data you've already
           | behind.
           | 
           | If you can afford it you should be using the host NVME drives
           | or FSX and relying on Spark to handle outages. The difference
           | in performance will be orders of magnitude different.
           | 
           | And in this case you won't have the ability to store 64TB.
           | The average max is 2TB.
        
           | anonymousDan wrote:
           | It's not just disk capacity though. Disk bandwidth, CPU and
           | memory capacity/bandwidth etc can also become bottlenecks on
           | a single machine.
        
         | pjot wrote:
         | So many think they (their company) have "Big Data" but don't.
        
           | bdcravens wrote:
           | My employer's product database is about 2TB currently and is
           | still perfectly manageable with a single machine, and to be
           | honest, it's not even all that optimized.
        
         | pletnes wrote:
         | My impression is that companies like databricks because of the
         | framework it offers for organizing data workflows. Spark is
         | then used since it is how databricks does its processing.
         | 
         | Microsoft's synapse is their <<ripoff product>> which tries to
         | compete directly by offering the same thing (only worse) with
         | MS branding and better azure integration.
         | 
         | I've yet to see Spark being used outside of these products but
         | would be happy to hear of such use.
        
         | extr wrote:
         | Yes, but such situations are incredibly rare IMO. I was just
         | recently in a job where a self described expert was obsessed
         | with writing all data pipelines (in this context, 1-2GB of
         | data) in spark, because that was the "proper" way to do things
         | since we had Databricks. It was beyond a waste of time.
        
           | tmountain wrote:
           | Yup, everyone thought they had "big data" for a moment in the
           | industry, and it turns out, they all just read the same
           | trendy blog posts. I made the mistake of "clustering" at my
           | previous company, and it introduced so much unnecessary
           | complexity. If you are really worried about future proofing,
           | just design a schema where the main entities are denormalized
           | and use UUID for your primary keys. If you need to shard
           | later, it will be much easier. Although, again, it's likely
           | that you will never need to.
        
         | threeseed wrote:
         | People make the mistake of thinking it's a single 100GB
         | dataset.
         | 
         | When it's more common to be manipulating lots of smaller
         | datasets together in a way where you need to have more than
         | 100GB of disk space. And in this situation you really need
         | Spark.
        
           | tomthe wrote:
           | Can you elaborate? I had no problem with DuckDB and a few TB
           | of data. But it depends on what exactly you do with it, of
           | course.
        
             | bdangubic wrote:
             | few hundred TBs here - no issues :)
        
       | pm90 wrote:
       | > My name is Miles, I'm a Principal Program Manager at Microsoft.
       | While a Spark specialist by role
       | 
       | JFYI. I think the article itself is pretty unbiased but I feel
       | like its worth putting this disclaimer for the author.
        
         | EdwardDiego wrote:
         | I was a little confused as why he couldn't run Spark locally
         | because he insisted on the "Fabric Spark runtime".
        
         | cyberlurker wrote:
         | Why? I don't understand what working at Microsoft has anything
         | to do with this.
         | 
         | And I found the entire article very informative with little
         | room for bias.
        
           | bdcravens wrote:
           | You know there's always going to be comments from those who
           | bring the low value criticism of sleuthing out someone's
           | employment and making implications. Being transparent can
           | assuage such discussion.
        
       | kianN wrote:
       | I went through this trade off at my last job. I started off
       | migrating my adhoc queries to duckdb directly from delta tables.
       | Over time, I used duckdb enough to do some performance tuning. I
       | found that migrating from Delta to duckdb's native file format
       | provided substantial speed wins.
       | 
       | The author focuses on read/write performance on Delta (makes
       | sense for the scope of the comparison). I think if an engineer is
       | considering switching from spark to duckdb/polars for their data
       | warehouse, they would likely be open to data formats other than
       | Delta, which is tightly coupled to the spark (and even more so to
       | the closed-source Databricks implementation). In my use case, we
       | saw enough speed wins and cost savings that it made sense to
       | fully migrate our data warehouse to a self managed duckdb
       | warehouse using duckdb's native file format.
        
         | thenaturalist wrote:
         | Thanks for sharing, very intersting!
         | 
         | I'm thinking the same wrt dropping Parquet.
         | 
         | I don't need concurrent writes, which seems to me about the
         | only true caveat DuckDB would have.
         | 
         | Two other questions I am asking myself:
         | 
         | 1) Is there a an upper file size limit in duckdb's native
         | format where performance might degrade?
         | 
         | 2) Are there significant performance degradations/ hard limits
         | if I want to consolidate from X DuckDB's into a single one by
         | programmatically attaching them all and pulling data in via a
         | large `UNION ALL` query?
         | 
         | Or would you use something like Polars to query over N
         | DuckDB's?
        
           | kianN wrote:
           | I can offer my experience with respect to your two questions,
           | though my use case is likely atypical.
           | 
           | 1) I haven't personally run into upper size limits to the
           | point of non linear performance degradation. However, some
           | caveats to that are (a) most my files are in the range of
           | 2-10gb with a few topping out near 100gb. (b) I am running a
           | single r6gd metal as the primary interface with this which
           | has 512 gb of ram. So, essentially, any one of my files can
           | fit into ram.
           | 
           | Even given that setup, I will mention that I find myself hand
           | tuning queries a lot more than I was with Spark. Since duckdb
           | is meant to be more lightweight the query optimization engine
           | is less robust.
           | 
           | 2) I can't speak too much towards this use case. I haven't
           | had any occasion to query across duckdb files. However, I
           | experimented on top of delta lake between duckdb and polars
           | and never really found a true performance case for polars in
           | my (again atypical use case) set of test. But definitely
           | worth doing your own benchmarking on your specific use case
           | :)
        
             | thenaturalist wrote:
             | Thanks for sharing! :)
        
         | RobinL wrote:
         | Interesting. So what does that look like on disk? Possibly
         | slightly naively I'm imagining a single massive file?
        
       | memhole wrote:
       | Nice write up. I don't think the comments about duckdb spilling
       | to disk are correct. I believe if you create a temp or persistent
       | db duckdb will spill to disk.
       | 
       | I might have missed it, but the integration of duckdb and the
       | arrow library makes mixing and matching dataframes and sql syntax
       | fairly seamless.
       | 
       | I'm convinced the simplicity of duckdb is worth a performance
       | penalty compared to spark for most workloads. Ime, people
       | struggle with fully utilizing spark.
        
         | m_ke wrote:
         | Yeah I'd never go for spark again, all of its use cases are
         | better handled with either DuckDB or Ray (or combination of
         | both).
        
           | anonymousDan wrote:
           | I thought Ray was a reinforcement learning platform? Can you
           | elaborate on how it is a replacement for Spark?
        
             | m_ke wrote:
             | Ray is a distributed computing framework that has a fast
             | scheduler, 0 copy / serialization between tasks (shared
             | memory) and stateful "actors", making it great for RL but
             | it's more general than that.
             | 
             | I'd recommend checking out their architecture whitepaper: h
             | ttps://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-l
             | ...
             | 
             | Imagine spark without the JVM baggage and with no need to
             | spill to disk / serialize between steps unless it's
             | necessary.
        
         | marshmellman wrote:
         | About spilling to disk, in DuckDB's docs I see:
         | 
         | > Both persistent and in-memory databases use spilling to disk
         | to facilitate larger-than-memory workloads (i.e., out-of-core-
         | processing).
         | 
         | I don't have personal experience with it though.
         | 
         | https://duckdb.org/docs/connect/overview.html
        
       | nicornk wrote:
       | The blog post echos my experience that duckDB just works (due to
       | superior disk spilling capabilities) and polars OOMs a lot.
        
         | jdgoesmarching wrote:
         | My biggest issue with Polars (don't shoot me) was the poor LLM
         | support. 4o at least seems to frequently get confused with
         | Pandas syntax or logic.
         | 
         | This pushed me to finally investigate the DuckDB hype, and it's
         | hard to believe I can just write SQL and it just works.
        
       | shcheklein wrote:
       | Another alternative to consider is https://www.getdaft.io/ .
       | AFAIU it is a more direct competitor to Spark (distributed mode).
        
       | moandcompany wrote:
       | My opinion: the high-prevalence of implementations using Spark,
       | Pandas, etc. are mostly driven by (1) people's tendency to work
       | with tools that use APIs they are already familiar with, (2)
       | resume driven development, and/or to a much lesser degree (3)
       | sustainability with regard to future maintainers, versus what may
       | be technically sensible with regard to performance. A decade ago
       | we saw similar articles referencing misapplications of
       | Hadoop/Mapreduce, and today it is Spark as its successor.
       | 
       | Pandas' use of the dataframe concepts and APIs were informed by R
       | and a desire to provide something familiar and accessible to R
       | users (i.e. ease of user adoption).
       | 
       | Likewise, when the Spark development community somewhere around
       | the version 0.11 days began implementing the dataframe
       | abstraction over its original native RDD abstractions, it
       | understood the need to provide a robust Python API similar to the
       | Pandas APIs for accessibility (i.e. ease of user adoption).
       | 
       | At some point those familiar APIs also became a burden, or were
       | not-great to begin with, in several ways and we see new tools
       | emerge like DuckDB and Polars.
       | 
       | However, we now have a non-unique issue where people are learning
       | and applying specific tools versus general problem-solving skills
       | and tradecraft in the related domain (i.e. the common pattern of
       | people with hammers seeing everything as nails). Note all of the
       | "learn these -n- tools/packages to become a great ____ engineer
       | and make xyz dollars" type tutorials and starter-packs on the
       | internet today.
        
         | threeseed wrote:
         | It's always easy to ignorantly criticise technology choices.
         | 
         | But from my experience in almost all cases it is misguided
         | requirements e.g. we want to support 100x data requirements in
         | 5 years that drive in hindsight bad choices. Not resume driven
         | development.
         | 
         | And at least in enterprise space having a vendor who can
         | support the technology is just as important as the merits of
         | the technology itself. And vendors tend to spring up from
         | popular, trendy technologies.
        
           | bdcravens wrote:
           | It's not binary. Valid uses of a technology don't mean there
           | aren't others using that same technology in a resume-driven
           | manner.
        
             | com wrote:
             | Yeah, I've seen enough of both to know that this is
             | genuine.
             | 
             | The problem with resume-driven technology choices are a
             | kind of tech debt that typically costs a lot of cash to
             | operate and perhaps worse, delivers terrible opportunity
             | costs, both which really do sink businesses.
             | 
             | Premature scaling doesn't. The challenge is not leaving it
             | too late. Even doing that though only has led to a very few
             | high-profile business failures.
        
       | code51 wrote:
       | Why is Apache DataFusion not there as an alternative?
        
       | steveBK123 wrote:
       | I do wonder if some new tech adoption will actually be slowed due
       | to the prevalence of LLM assisted coding?
       | 
       | That is - all these code assistants are going to be 10x as useful
       | on spark/pandas as they would be on duckdb/polars, due to the age
       | of the former and the continued rate of change in the latter.
        
       | RobinL wrote:
       | I submitted this because I thought it was a good, high effort
       | post, but I must admit I was surprised by the conclusion. In my
       | experience, admittedly on different workloads, duckdb is both
       | faster and easier to use than spark, and requires significantly
       | less tuning and less complex infrastructure. I've been trying to
       | transition as much as possible over to duckdb.
       | 
       | There are also some interesting points in the following podcast
       | about ease of use and transactional capabilities of duckdb which
       | are easy to overlook (you can skip the first 10 mins):
       | https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb
       | 
       | Of course, if you have truly massive data, you probably still
       | need spark
        
         | diroussel wrote:
         | Thanks, I'll give that a listen. Here is the Apple Podcast link
         | to the same episode: https://podcasts.apple.com/gb/podcast/the-
         | joe-reis-show/id16...
         | 
         | I've also experimented with duckdb whilst on a databricks
         | project, and did also think "we could do this whole thing with
         | duckdb and a large EC2 instance spun up for an few hours a
         | week".
         | 
         | But of course duckdb was new then, and you can't re-architect
         | on a hunch. Thanks for the aricle.
        
         | IshKebab wrote:
         | This guy says "I live and breathe Spark"... I would take the
         | conclusions with a grain of salt.
        
         | thenaturalist wrote:
         | Do you still use file formats at all in your work?
         | 
         | I'm currently thinking of ditching Parquet all together, and
         | going all in DuckDB files.
         | 
         | I don't need concurrent writes, my data would rarely exceed 1TB
         | and if it were, I could still offload to Parquet.
         | 
         | Conceptually I can't see a reason for this not working, but
         | given the novelty of the tech I'm wondering if it'll hold up.
        
       | serjester wrote:
       | Polars is much more useful if you're doing complex
       | transformations instead of basic ETL.
       | 
       | Something under appreciated about polars is how easy it is to
       | build a plugin. I recently took a rust crate that reimplemented
       | the h3 geospatial coordinate system, exposed it at as a polars
       | plugin and achieved performance 5X faster than the DuckDB
       | version.
       | 
       | With knowing 0 rust and some help from AI it only took me 2ish
       | days - I can't imagine doing this in C++ (DuckDB).
       | 
       | [1] https://github.com/Filimoa/polars-h3
        
       | adsharma wrote:
       | I'm a bit confused by the claim that duckdb doesn't support
       | dataframes.
       | 
       | This blog post suggests that it has been supported since 2021 and
       | matches my experience.
       | 
       | https://duckdb.org/2021/05/14/sql-on-pandas.html
        
         | antman wrote:
         | Maybe they mean the other way around aka querying duckdb using
         | dataframe queries instead of sql, which can be achieved through
         | the ibis project
        
         | simicd wrote:
         | From what I understood the article refers to the point that
         | DuckDB doesn't provide its own dataframe API, meaning a way to
         | express SQL queries in Python classes/functions.
         | 
         | The link you shared shows how DuckDB can run SQL queries on a
         | pandas dataframe (e.g. `duckdb.query("<SQL query>")`. The SQL
         | query in this case is a string. A dataframe API would allow you
         | to write it completely in Python. An example for this would be
         | polars dataframes
         | (`df.select(pl.col("...").alias("...")).filter(pl.col("...") >
         | x)`).
         | 
         | Dataframe APIs benefit from autocompletion, error handling,
         | syntax highlighting, etc. that the SQL strings wouldn't. Please
         | let me know if I missed something from the blog post you
         | linked!
        
       | downrightmike wrote:
       | Or use https://lancedb.com/ LanceDB is a developer-friendly, open
       | source database for AI. From hyper scalable vector search and
       | advanced retrieval for RAG, to streaming training data and
       | interactive exploration of large scale AI datasets, LanceDB is
       | the best foundation for your AI application
       | 
       | Recent podcast
       | https://talkpython.fm/episodes/show/488/multimodal-data-with...
        
         | farsa wrote:
         | Hey, you have some silly thing rendered at your product's
         | landing page chewing CPU.
        
       | pnut wrote:
       | Maybe this is telling more of the company I work in, but it is
       | just incomprehensible for me to casually contemplate dumping a
       | generally comparable, installed production capability.
       | 
       | All I think when I read this is, standing up new environments,
       | observability, dev/QA training, change control, data migration,
       | mitigating risks to business continuity, integrating with data
       | sources and sinks, and on and on...
       | 
       | I've got enough headaches already without another one of those
       | projects.
        
       | buremba wrote:
       | Great post but it seems like you still rely on Fabric to run
       | Spark NEE. If you're on AWS or GCP, you should probably not ditch
       | Spark but combine both. DuckDB's gotcha is that it can't scale
       | horizontally (multi-node), unlike Databricks. A single node can
       | get you as far as you or can rent 2TB memory + 20TB NVME in AWS,
       | and if you use PySpark, you can run DuckDB until it doesn't scale
       | with its Spark integration
       | (https://duckdb.org/docs/api/python/spark_api.html) and switch to
       | Databricks if you need to scale out. That way, you get the best
       | of the two worlds.
       | 
       | DuckDB on AWS EC2's price performance rate is 10x that of
       | Databricks and Snowflake with its native file format, so it's a
       | better deal if you're not processing petabyte-level data. That's
       | unsurprising, given that DuckDB operates in a single node (no
       | need for distributed shuffles) and works primarily with NVME (no
       | use of object stores such as S3 for intermediate data). Thus, it
       | can optimize the workloads much better than the other data
       | warehouses.
       | 
       | If you use SQL, another gotcha is that DuckDB doesn't have
       | advanced catalog features in cloud data warehouses. Still, it's
       | possible to combine DuckDB compute and Snowflake Horizon /
       | Databricks Unity Catalog thanks to Apache Iceberg, which enables
       | multi-engine support in the same catalog. I'm experimenting this
       | multi-stack idea with DuckDB <> Snowflake, and it works well so
       | far: https://github.com/buremba/universql
        
       | ibgeek wrote:
       | Good write up. The only real bias I can detect is that the author
       | seems to conflate their (lack of) familiarity with ease of use. I
       | bet if they spent a few months using DuckDB and Polars on a daily
       | basis, they might find some of the tasks just as easy or easier
       | to implement.
        
       ___________________________________________________________________
       (page generated 2024-12-15 23:00 UTC)