[HN Gopher] Should you ditch Spark for DuckDB or Polars?
___________________________________________________________________
Should you ditch Spark for DuckDB or Polars?
Author : RobinL
Score : 72 points
Date : 2024-12-14 20:28 UTC (1 days ago)
(HTM) web link (milescole.dev)
(TXT) w3m dump (milescole.dev)
| jimmyl02 wrote:
| pretty new to the large scale data processing space so not sure
| if this is a known question but isn't the purpose of spark that
| it can be distributed across many workers and parallelized?
|
| I guess the scale of data here ~100GB is manageable with
| something like DuckDB but once data gets past a certain scale,
| wouldn't single machine performance have no way of matching a
| distributed spark cluster?
| ojhughes wrote:
| I think it's quite rare that a dataset is too large to be
| processed on a single volume. EBS volumes max out at 64Tb
| hobs wrote:
| It's really the only reason to use Spark in the first place,
| because you're doing non-local level processing - when I was
| throwing 100TB at the problem it made more sense than most of
| the data science tasks I see it used for.
| mnsc wrote:
| This magnitude of data fascinates me. Can you elaborate on
| how that much data came to be and what kind of processing
| was needed for all that data? And if not, maybe point to
| some podcast, blog that goes into the nitty gritty of those
| types of real big data challenges.
| threeseed wrote:
| If you're using EBS for shuffle/source data you've already
| behind.
|
| If you can afford it you should be using the host NVME drives
| or FSX and relying on Spark to handle outages. The difference
| in performance will be orders of magnitude different.
|
| And in this case you won't have the ability to store 64TB.
| The average max is 2TB.
| anonymousDan wrote:
| It's not just disk capacity though. Disk bandwidth, CPU and
| memory capacity/bandwidth etc can also become bottlenecks on
| a single machine.
| pjot wrote:
| So many think they (their company) have "Big Data" but don't.
| bdcravens wrote:
| My employer's product database is about 2TB currently and is
| still perfectly manageable with a single machine, and to be
| honest, it's not even all that optimized.
| pletnes wrote:
| My impression is that companies like databricks because of the
| framework it offers for organizing data workflows. Spark is
| then used since it is how databricks does its processing.
|
| Microsoft's synapse is their <<ripoff product>> which tries to
| compete directly by offering the same thing (only worse) with
| MS branding and better azure integration.
|
| I've yet to see Spark being used outside of these products but
| would be happy to hear of such use.
| extr wrote:
| Yes, but such situations are incredibly rare IMO. I was just
| recently in a job where a self described expert was obsessed
| with writing all data pipelines (in this context, 1-2GB of
| data) in spark, because that was the "proper" way to do things
| since we had Databricks. It was beyond a waste of time.
| tmountain wrote:
| Yup, everyone thought they had "big data" for a moment in the
| industry, and it turns out, they all just read the same
| trendy blog posts. I made the mistake of "clustering" at my
| previous company, and it introduced so much unnecessary
| complexity. If you are really worried about future proofing,
| just design a schema where the main entities are denormalized
| and use UUID for your primary keys. If you need to shard
| later, it will be much easier. Although, again, it's likely
| that you will never need to.
| threeseed wrote:
| People make the mistake of thinking it's a single 100GB
| dataset.
|
| When it's more common to be manipulating lots of smaller
| datasets together in a way where you need to have more than
| 100GB of disk space. And in this situation you really need
| Spark.
| tomthe wrote:
| Can you elaborate? I had no problem with DuckDB and a few TB
| of data. But it depends on what exactly you do with it, of
| course.
| bdangubic wrote:
| few hundred TBs here - no issues :)
| pm90 wrote:
| > My name is Miles, I'm a Principal Program Manager at Microsoft.
| While a Spark specialist by role
|
| JFYI. I think the article itself is pretty unbiased but I feel
| like its worth putting this disclaimer for the author.
| EdwardDiego wrote:
| I was a little confused as why he couldn't run Spark locally
| because he insisted on the "Fabric Spark runtime".
| cyberlurker wrote:
| Why? I don't understand what working at Microsoft has anything
| to do with this.
|
| And I found the entire article very informative with little
| room for bias.
| bdcravens wrote:
| You know there's always going to be comments from those who
| bring the low value criticism of sleuthing out someone's
| employment and making implications. Being transparent can
| assuage such discussion.
| kianN wrote:
| I went through this trade off at my last job. I started off
| migrating my adhoc queries to duckdb directly from delta tables.
| Over time, I used duckdb enough to do some performance tuning. I
| found that migrating from Delta to duckdb's native file format
| provided substantial speed wins.
|
| The author focuses on read/write performance on Delta (makes
| sense for the scope of the comparison). I think if an engineer is
| considering switching from spark to duckdb/polars for their data
| warehouse, they would likely be open to data formats other than
| Delta, which is tightly coupled to the spark (and even more so to
| the closed-source Databricks implementation). In my use case, we
| saw enough speed wins and cost savings that it made sense to
| fully migrate our data warehouse to a self managed duckdb
| warehouse using duckdb's native file format.
| thenaturalist wrote:
| Thanks for sharing, very intersting!
|
| I'm thinking the same wrt dropping Parquet.
|
| I don't need concurrent writes, which seems to me about the
| only true caveat DuckDB would have.
|
| Two other questions I am asking myself:
|
| 1) Is there a an upper file size limit in duckdb's native
| format where performance might degrade?
|
| 2) Are there significant performance degradations/ hard limits
| if I want to consolidate from X DuckDB's into a single one by
| programmatically attaching them all and pulling data in via a
| large `UNION ALL` query?
|
| Or would you use something like Polars to query over N
| DuckDB's?
| kianN wrote:
| I can offer my experience with respect to your two questions,
| though my use case is likely atypical.
|
| 1) I haven't personally run into upper size limits to the
| point of non linear performance degradation. However, some
| caveats to that are (a) most my files are in the range of
| 2-10gb with a few topping out near 100gb. (b) I am running a
| single r6gd metal as the primary interface with this which
| has 512 gb of ram. So, essentially, any one of my files can
| fit into ram.
|
| Even given that setup, I will mention that I find myself hand
| tuning queries a lot more than I was with Spark. Since duckdb
| is meant to be more lightweight the query optimization engine
| is less robust.
|
| 2) I can't speak too much towards this use case. I haven't
| had any occasion to query across duckdb files. However, I
| experimented on top of delta lake between duckdb and polars
| and never really found a true performance case for polars in
| my (again atypical use case) set of test. But definitely
| worth doing your own benchmarking on your specific use case
| :)
| thenaturalist wrote:
| Thanks for sharing! :)
| RobinL wrote:
| Interesting. So what does that look like on disk? Possibly
| slightly naively I'm imagining a single massive file?
| memhole wrote:
| Nice write up. I don't think the comments about duckdb spilling
| to disk are correct. I believe if you create a temp or persistent
| db duckdb will spill to disk.
|
| I might have missed it, but the integration of duckdb and the
| arrow library makes mixing and matching dataframes and sql syntax
| fairly seamless.
|
| I'm convinced the simplicity of duckdb is worth a performance
| penalty compared to spark for most workloads. Ime, people
| struggle with fully utilizing spark.
| m_ke wrote:
| Yeah I'd never go for spark again, all of its use cases are
| better handled with either DuckDB or Ray (or combination of
| both).
| anonymousDan wrote:
| I thought Ray was a reinforcement learning platform? Can you
| elaborate on how it is a replacement for Spark?
| m_ke wrote:
| Ray is a distributed computing framework that has a fast
| scheduler, 0 copy / serialization between tasks (shared
| memory) and stateful "actors", making it great for RL but
| it's more general than that.
|
| I'd recommend checking out their architecture whitepaper: h
| ttps://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-l
| ...
|
| Imagine spark without the JVM baggage and with no need to
| spill to disk / serialize between steps unless it's
| necessary.
| marshmellman wrote:
| About spilling to disk, in DuckDB's docs I see:
|
| > Both persistent and in-memory databases use spilling to disk
| to facilitate larger-than-memory workloads (i.e., out-of-core-
| processing).
|
| I don't have personal experience with it though.
|
| https://duckdb.org/docs/connect/overview.html
| nicornk wrote:
| The blog post echos my experience that duckDB just works (due to
| superior disk spilling capabilities) and polars OOMs a lot.
| jdgoesmarching wrote:
| My biggest issue with Polars (don't shoot me) was the poor LLM
| support. 4o at least seems to frequently get confused with
| Pandas syntax or logic.
|
| This pushed me to finally investigate the DuckDB hype, and it's
| hard to believe I can just write SQL and it just works.
| shcheklein wrote:
| Another alternative to consider is https://www.getdaft.io/ .
| AFAIU it is a more direct competitor to Spark (distributed mode).
| moandcompany wrote:
| My opinion: the high-prevalence of implementations using Spark,
| Pandas, etc. are mostly driven by (1) people's tendency to work
| with tools that use APIs they are already familiar with, (2)
| resume driven development, and/or to a much lesser degree (3)
| sustainability with regard to future maintainers, versus what may
| be technically sensible with regard to performance. A decade ago
| we saw similar articles referencing misapplications of
| Hadoop/Mapreduce, and today it is Spark as its successor.
|
| Pandas' use of the dataframe concepts and APIs were informed by R
| and a desire to provide something familiar and accessible to R
| users (i.e. ease of user adoption).
|
| Likewise, when the Spark development community somewhere around
| the version 0.11 days began implementing the dataframe
| abstraction over its original native RDD abstractions, it
| understood the need to provide a robust Python API similar to the
| Pandas APIs for accessibility (i.e. ease of user adoption).
|
| At some point those familiar APIs also became a burden, or were
| not-great to begin with, in several ways and we see new tools
| emerge like DuckDB and Polars.
|
| However, we now have a non-unique issue where people are learning
| and applying specific tools versus general problem-solving skills
| and tradecraft in the related domain (i.e. the common pattern of
| people with hammers seeing everything as nails). Note all of the
| "learn these -n- tools/packages to become a great ____ engineer
| and make xyz dollars" type tutorials and starter-packs on the
| internet today.
| threeseed wrote:
| It's always easy to ignorantly criticise technology choices.
|
| But from my experience in almost all cases it is misguided
| requirements e.g. we want to support 100x data requirements in
| 5 years that drive in hindsight bad choices. Not resume driven
| development.
|
| And at least in enterprise space having a vendor who can
| support the technology is just as important as the merits of
| the technology itself. And vendors tend to spring up from
| popular, trendy technologies.
| bdcravens wrote:
| It's not binary. Valid uses of a technology don't mean there
| aren't others using that same technology in a resume-driven
| manner.
| com wrote:
| Yeah, I've seen enough of both to know that this is
| genuine.
|
| The problem with resume-driven technology choices are a
| kind of tech debt that typically costs a lot of cash to
| operate and perhaps worse, delivers terrible opportunity
| costs, both which really do sink businesses.
|
| Premature scaling doesn't. The challenge is not leaving it
| too late. Even doing that though only has led to a very few
| high-profile business failures.
| code51 wrote:
| Why is Apache DataFusion not there as an alternative?
| steveBK123 wrote:
| I do wonder if some new tech adoption will actually be slowed due
| to the prevalence of LLM assisted coding?
|
| That is - all these code assistants are going to be 10x as useful
| on spark/pandas as they would be on duckdb/polars, due to the age
| of the former and the continued rate of change in the latter.
| RobinL wrote:
| I submitted this because I thought it was a good, high effort
| post, but I must admit I was surprised by the conclusion. In my
| experience, admittedly on different workloads, duckdb is both
| faster and easier to use than spark, and requires significantly
| less tuning and less complex infrastructure. I've been trying to
| transition as much as possible over to duckdb.
|
| There are also some interesting points in the following podcast
| about ease of use and transactional capabilities of duckdb which
| are easy to overlook (you can skip the first 10 mins):
| https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb
|
| Of course, if you have truly massive data, you probably still
| need spark
| diroussel wrote:
| Thanks, I'll give that a listen. Here is the Apple Podcast link
| to the same episode: https://podcasts.apple.com/gb/podcast/the-
| joe-reis-show/id16...
|
| I've also experimented with duckdb whilst on a databricks
| project, and did also think "we could do this whole thing with
| duckdb and a large EC2 instance spun up for an few hours a
| week".
|
| But of course duckdb was new then, and you can't re-architect
| on a hunch. Thanks for the aricle.
| IshKebab wrote:
| This guy says "I live and breathe Spark"... I would take the
| conclusions with a grain of salt.
| thenaturalist wrote:
| Do you still use file formats at all in your work?
|
| I'm currently thinking of ditching Parquet all together, and
| going all in DuckDB files.
|
| I don't need concurrent writes, my data would rarely exceed 1TB
| and if it were, I could still offload to Parquet.
|
| Conceptually I can't see a reason for this not working, but
| given the novelty of the tech I'm wondering if it'll hold up.
| serjester wrote:
| Polars is much more useful if you're doing complex
| transformations instead of basic ETL.
|
| Something under appreciated about polars is how easy it is to
| build a plugin. I recently took a rust crate that reimplemented
| the h3 geospatial coordinate system, exposed it at as a polars
| plugin and achieved performance 5X faster than the DuckDB
| version.
|
| With knowing 0 rust and some help from AI it only took me 2ish
| days - I can't imagine doing this in C++ (DuckDB).
|
| [1] https://github.com/Filimoa/polars-h3
| adsharma wrote:
| I'm a bit confused by the claim that duckdb doesn't support
| dataframes.
|
| This blog post suggests that it has been supported since 2021 and
| matches my experience.
|
| https://duckdb.org/2021/05/14/sql-on-pandas.html
| antman wrote:
| Maybe they mean the other way around aka querying duckdb using
| dataframe queries instead of sql, which can be achieved through
| the ibis project
| simicd wrote:
| From what I understood the article refers to the point that
| DuckDB doesn't provide its own dataframe API, meaning a way to
| express SQL queries in Python classes/functions.
|
| The link you shared shows how DuckDB can run SQL queries on a
| pandas dataframe (e.g. `duckdb.query("<SQL query>")`. The SQL
| query in this case is a string. A dataframe API would allow you
| to write it completely in Python. An example for this would be
| polars dataframes
| (`df.select(pl.col("...").alias("...")).filter(pl.col("...") >
| x)`).
|
| Dataframe APIs benefit from autocompletion, error handling,
| syntax highlighting, etc. that the SQL strings wouldn't. Please
| let me know if I missed something from the blog post you
| linked!
| downrightmike wrote:
| Or use https://lancedb.com/ LanceDB is a developer-friendly, open
| source database for AI. From hyper scalable vector search and
| advanced retrieval for RAG, to streaming training data and
| interactive exploration of large scale AI datasets, LanceDB is
| the best foundation for your AI application
|
| Recent podcast
| https://talkpython.fm/episodes/show/488/multimodal-data-with...
| farsa wrote:
| Hey, you have some silly thing rendered at your product's
| landing page chewing CPU.
| pnut wrote:
| Maybe this is telling more of the company I work in, but it is
| just incomprehensible for me to casually contemplate dumping a
| generally comparable, installed production capability.
|
| All I think when I read this is, standing up new environments,
| observability, dev/QA training, change control, data migration,
| mitigating risks to business continuity, integrating with data
| sources and sinks, and on and on...
|
| I've got enough headaches already without another one of those
| projects.
| buremba wrote:
| Great post but it seems like you still rely on Fabric to run
| Spark NEE. If you're on AWS or GCP, you should probably not ditch
| Spark but combine both. DuckDB's gotcha is that it can't scale
| horizontally (multi-node), unlike Databricks. A single node can
| get you as far as you or can rent 2TB memory + 20TB NVME in AWS,
| and if you use PySpark, you can run DuckDB until it doesn't scale
| with its Spark integration
| (https://duckdb.org/docs/api/python/spark_api.html) and switch to
| Databricks if you need to scale out. That way, you get the best
| of the two worlds.
|
| DuckDB on AWS EC2's price performance rate is 10x that of
| Databricks and Snowflake with its native file format, so it's a
| better deal if you're not processing petabyte-level data. That's
| unsurprising, given that DuckDB operates in a single node (no
| need for distributed shuffles) and works primarily with NVME (no
| use of object stores such as S3 for intermediate data). Thus, it
| can optimize the workloads much better than the other data
| warehouses.
|
| If you use SQL, another gotcha is that DuckDB doesn't have
| advanced catalog features in cloud data warehouses. Still, it's
| possible to combine DuckDB compute and Snowflake Horizon /
| Databricks Unity Catalog thanks to Apache Iceberg, which enables
| multi-engine support in the same catalog. I'm experimenting this
| multi-stack idea with DuckDB <> Snowflake, and it works well so
| far: https://github.com/buremba/universql
| ibgeek wrote:
| Good write up. The only real bias I can detect is that the author
| seems to conflate their (lack of) familiarity with ease of use. I
| bet if they spent a few months using DuckDB and Polars on a daily
| basis, they might find some of the tasks just as easy or easier
| to implement.
___________________________________________________________________
(page generated 2024-12-15 23:00 UTC)