[HN Gopher] Parallel Grouped Aggregation in DuckDB
___________________________________________________________________
Parallel Grouped Aggregation in DuckDB
Author : hfmuehleisen
Score : 69 points
Date : 2022-03-07 15:54 UTC (7 hours ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| keewee7 wrote:
| Polars+DuckDB beats Pandas.
|
| Tidyverse (R) is superior for data exploration but R is not fun
| to deploy and make complex multi job data pipelines.
| claytonjy wrote:
| Despite having switched to Python after many years of being a
| tidyverse acolyte, I don't really understand this argument
| against deployed R.
|
| Are folks who say this not using containers? R has been at
| least as easy to dockerize as Python since whenever Rocker
| started, only easier with more recent package management
| options. Once dockerized, my only R complaints are around
| logging inconsistencies.
|
| I used to think the culture around R meant that productionizing
| arbitrary code was harder on average than in Python...but years
| of suffering with the pandas API has me thinking the opposite
| these days.
|
| I can trust a junior R dev to write reusable pure functions but
| can't trust a senior Python dev to do the same!
| hfmuehleisen wrote:
| DuckDB also works fine with R data frames so there is really
| no downside to using R in this case
| Dayshine wrote:
| For one, in my experience, CRAN only stores binaries of the
| most recent release of each package. This means that either
| you have to accept you can never rebuild a docker image, or
| you have to make sure that you are always able to recompile
| all of your R packages from source.
|
| This means you don't just have to pin your R package
| versions, you have to pin all the build dependencies.
|
| And you have to have a different image for different sets of
| R packages because they might have different build
| dependencies.
| claytonjy wrote:
| Reproducibility is of course a valid concern, but I never
| had issues with that thanks to Rocker and MRAN.
|
| I do sympathize with long from-source build times; as a
| Linux user I don't think I had binaries available until I
| stopped using R, so I've spent days, perhaps weeks, waiting
| for dplyr to install over the course of my R usage.
|
| > different image for different sets of R packages because
| they might have different build dependencies
|
| Is this not true of all software in all languages?
| [deleted]
| RobinL wrote:
| This is great. In terms of real-world uses, I'm currently working
| on enabling DuckDB as a backend in Splink[1], software for record
| linkage at scale. Central to the software is an iterative
| algorithm (Expectation Maximisation) that performs a large number
| of group-by aggregations on large tables.
|
| Until recently, it was PySpark only, but we've found DuckDB gives
| us great performance on medium size data. This will be enabled in
| a forthcoming release (we have an early pre-release demo of
| duckdb backend[2]). This new DuckDB backend will probably be fast
| enough for the majority of our users, who don't have massive
| datasets.
|
| With this in mind, excited to hear that: > Another large area of
| future work is to make our aggregate hash table work with out-of-
| core operations, where an individual hash table no longer fits in
| memory, this is particularly problematic when merging.
|
| This would be an amazing addition. Our users typically need to
| process sensitive data, and spinning up Spark can be a challenge
| from an infrastructure perspective. I'm imagining as we go
| forwards, more and more will be possible on a single beefy
| machine that is easily spun up in the cloud.
|
| Anyway, really just wanted to say thanks to the DuckDB team for
| great work - you're enabling a lot of value downstream!
|
| [1] https://github.com/moj-analytical-services/splink [2]
| https://github.com/moj-analytical-services/splink_demos/tree...
| henrydark wrote:
| Splink over duckdb is the bomb.
|
| My duckdb wrapper I sent you in the github issue a few weeks
| ago linked a pair of five million record datasets in about
| twenty minutes. Spark took about the three hours to do the same
| job with an infinite resources cluster.
| [deleted]
___________________________________________________________________
(page generated 2022-03-07 23:01 UTC)