[HN Gopher] Parallel Grouped Aggregation in DuckDB
       ___________________________________________________________________
        
       Parallel Grouped Aggregation in DuckDB
        
       Author : hfmuehleisen
       Score  : 69 points
       Date   : 2022-03-07 15:54 UTC (7 hours ago)
        
 (HTM) web link (duckdb.org)
 (TXT) w3m dump (duckdb.org)
        
       | keewee7 wrote:
       | Polars+DuckDB beats Pandas.
       | 
       | Tidyverse (R) is superior for data exploration but R is not fun
       | to deploy and make complex multi job data pipelines.
        
         | claytonjy wrote:
         | Despite having switched to Python after many years of being a
         | tidyverse acolyte, I don't really understand this argument
         | against deployed R.
         | 
         | Are folks who say this not using containers? R has been at
         | least as easy to dockerize as Python since whenever Rocker
         | started, only easier with more recent package management
         | options. Once dockerized, my only R complaints are around
         | logging inconsistencies.
         | 
         | I used to think the culture around R meant that productionizing
         | arbitrary code was harder on average than in Python...but years
         | of suffering with the pandas API has me thinking the opposite
         | these days.
         | 
         | I can trust a junior R dev to write reusable pure functions but
         | can't trust a senior Python dev to do the same!
        
           | hfmuehleisen wrote:
           | DuckDB also works fine with R data frames so there is really
           | no downside to using R in this case
        
           | Dayshine wrote:
           | For one, in my experience, CRAN only stores binaries of the
           | most recent release of each package. This means that either
           | you have to accept you can never rebuild a docker image, or
           | you have to make sure that you are always able to recompile
           | all of your R packages from source.
           | 
           | This means you don't just have to pin your R package
           | versions, you have to pin all the build dependencies.
           | 
           | And you have to have a different image for different sets of
           | R packages because they might have different build
           | dependencies.
        
             | claytonjy wrote:
             | Reproducibility is of course a valid concern, but I never
             | had issues with that thanks to Rocker and MRAN.
             | 
             | I do sympathize with long from-source build times; as a
             | Linux user I don't think I had binaries available until I
             | stopped using R, so I've spent days, perhaps weeks, waiting
             | for dplyr to install over the course of my R usage.
             | 
             | > different image for different sets of R packages because
             | they might have different build dependencies
             | 
             | Is this not true of all software in all languages?
        
         | [deleted]
        
       | RobinL wrote:
       | This is great. In terms of real-world uses, I'm currently working
       | on enabling DuckDB as a backend in Splink[1], software for record
       | linkage at scale. Central to the software is an iterative
       | algorithm (Expectation Maximisation) that performs a large number
       | of group-by aggregations on large tables.
       | 
       | Until recently, it was PySpark only, but we've found DuckDB gives
       | us great performance on medium size data. This will be enabled in
       | a forthcoming release (we have an early pre-release demo of
       | duckdb backend[2]). This new DuckDB backend will probably be fast
       | enough for the majority of our users, who don't have massive
       | datasets.
       | 
       | With this in mind, excited to hear that: > Another large area of
       | future work is to make our aggregate hash table work with out-of-
       | core operations, where an individual hash table no longer fits in
       | memory, this is particularly problematic when merging.
       | 
       | This would be an amazing addition. Our users typically need to
       | process sensitive data, and spinning up Spark can be a challenge
       | from an infrastructure perspective. I'm imagining as we go
       | forwards, more and more will be possible on a single beefy
       | machine that is easily spun up in the cloud.
       | 
       | Anyway, really just wanted to say thanks to the DuckDB team for
       | great work - you're enabling a lot of value downstream!
       | 
       | [1] https://github.com/moj-analytical-services/splink [2]
       | https://github.com/moj-analytical-services/splink_demos/tree...
        
         | henrydark wrote:
         | Splink over duckdb is the bomb.
         | 
         | My duckdb wrapper I sent you in the github issue a few weeks
         | ago linked a pair of five million record datasets in about
         | twenty minutes. Spark took about the three hours to do the same
         | job with an infinite resources cluster.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-03-07 23:01 UTC)