[HN Gopher] Querying Parquet with Precision Using DuckDB
       ___________________________________________________________________
        
       Querying Parquet with Precision Using DuckDB
        
       Author : mytherin
       Score  : 74 points
       Date   : 2021-06-25 18:53 UTC (4 hours ago)
        
 (HTM) web link (duckdb.org)
 (TXT) w3m dump (duckdb.org)
        
       | ravanpao wrote:
       | Vertica provides great support to work with Parquet files as
       | external tables:
       | https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...
       | 
       | We use this feature to build a staging layer for incoming files
       | before loading to the data warehouse.
        
       | polskibus wrote:
       | Now, if only I could somehow take that parquet processing
       | optimizer out of DuckDB and into a Postgres fdw reading parquet
       | files... that would make my ETLs work great. Does anyone know of
       | another open source project that uses an optimizer to perform
       | queries on parquet files?
        
         | mytherin wrote:
         | DuckDB is designed to be used as an embedded library, so it
         | should be totally feasible to use it as part of a FDW.
         | 
         | Alternatively, you can run DuckDB as part of a plpython
         | function:                 CREATE FUNCTION pyduckdb ()
         | RETURNS integer       AS $$         import duckdb         con =
         | duckdb.connect()         return con.execute('select
         | 42').fetchall()[0][0]       $$ LANGUAGE plpythonu;       SELECT
         | pyduckdb();
        
         | eatonphil wrote:
         | Why won't you just copy your parquet data into Postgres?
        
           | taffer wrote:
           | Three reasons I can think of:
           | 
           | - Postgres is a row store optimized for transactional
           | workloads, whereas Parquet is a column-oriented format
           | optimized for analytical workloads.
           | 
           | - Database storage is typically very expensive SSD storage
           | optimized for fast IO and high availability. Parquet files,
           | on the other hand, can be stored in inexpensive object
           | storage such as S3.
           | 
           | - Loading is an additional and possibly unnecessary step
        
             | eatonphil wrote:
             | Sure but what is the scenario where you wouldn't want to
             | use DuckDB and you'd want to use PostgreSQL instead but
             | wouldn't want to load into PostgreSQL? DuckDB is already a
             | SQL engine. The only additional benefit I'd imagine from
             | PostgreSQL would be stuff you get from it being on disk?
             | 
             | But that's why I'm asking, I don't know.
        
               | taffer wrote:
               | It depends on your situation: If you already have a
               | Postgres instance then the FDW is probably the easier
               | solution.
        
               | eatonphil wrote:
               | Oh yeah that makes sense. I assumed the commenter was
               | just looking for any system to run SQL over Parquet
               | files. If you don't want to use this system that makes it
               | easy, fair enough.
        
               | taffer wrote:
               | > If you don't want to use this system that makes it
               | easy, fair enough.
               | 
               |  _Be kind. Don 't be snarky. Have curious conversation;
               | don't cross-examine. Please don't fulminate. Please don't
               | sneer, including at the rest of the community._
               | 
               | https://news.ycombinator.com/newsguidelines.html
        
         | taffer wrote:
         | There is a parquet fdw for postgres:
         | 
         | https://github.com/adjust/parquet_fdw
        
           | polskibus wrote:
           | Yes, I know, but is it as smart as duckdb?
        
             | taffer wrote:
             | I havent tried it yet, but the parquet_fdw docs say you
             | have to run ANALYZE on the table for optimal performance.
        
         | legg0myegg0 wrote:
         | Here is a DuckDB FDW for Postgres! I have not used it, but it
         | sounds like what you need!
         | https://github.com/alitrack/duckdb_fdw
        
       | wenc wrote:
       | I just downloaded duckdb and used it to query parquet files. It's
       | amazing. Finally, an easy way to query Parquet data without
       | spinning up a local Spark instance or doing pd.read_parquet. It's
       | pretty fast too.
       | 
       | I've always felt that Parquet (or ORC) ought to be more popular
       | mainstream formats than they are, but in practice the interactive
       | tooling for Parquet has always been weak. For years, there wasn't
       | a good GUI viewer (well, there were a few half-hearted attempts)
       | or a good SQL query tool for Parquet like there is for most
       | relational databases, and lazily examining the contents of a
       | large Parquet folder from the command line has always been a
       | chore. At least now, DuckDB's parquet support moves the needle on
       | the last point.
        
         | mkarlsch wrote:
         | We used Apache Drill a few years back and it could do exactly
         | that - run locally and provide a SQL interface to query parquet
         | files. Still great to see more and better options!
        
           | wenc wrote:
           | Drill is a fair choice (it supports tons of formats), but for
           | me it was a bit heavy -- it required Java and plus it
           | returned results in a form that wasn't easily manipulated
           | (the result is a table in the console drawn with text-chars
           | [1]). If you had a really long result set, you couldn't do a
           | buffered scroll due to text UI limitations -- everything gets
           | printed out and you have to either do manual pagination using
           | LIMIT clauses or just have a really long scrollback buffer.
           | This kind of breaks the continuity of many data analysis
           | workflows.
           | 
           | With DuckDB, the return result is a Python data structure
           | (list of tuples), which is amenable to further manipulation
           | without exporting to another format.
           | 
           | Many GUI db tools often return results in a tabular form
           | which can be copy-pasted into other GUI apps (or copy-pasted
           | as CSV).
           | 
           | [1] https://drill.apache.org/docs/drill-
           | in-10-minutes/#querying-...
        
         | scottmcdot wrote:
         | Which GUI do you use now? For me I have to use Impala via HUE,
         | only after I have created an external table from the parquet
         | file. I really don't like web based GUIs, especially when my
         | employer has a crappy server running the GUI rendering.
        
           | wenc wrote:
           | I don't because I haven't found a good one for Parquet. I
           | would really like something like SSMS for SQL Server.
           | 
           | I would poke at local Parquet files either through a local
           | instance of Spark (or DataBricks when I was using that for
           | remote data folders) or use Pandas' read_parquet from
           | Jupyter.
        
         | mytherin wrote:
         | Glad that DuckDB is useful to you!
         | 
         | Some more notes that might be of use:
         | 
         | * We also have a command line interface (with syntax
         | highlighting) [1]
         | 
         | * You can use .df() instead of .fetchall() to retrieve the
         | results as a Pandas DataFrame. This is faster for larger result
         | sets.
         | 
         | * You can also use DuckDB to run SQL on Pandas DataFrames
         | directly [2]
         | 
         | [1] https://duckdb.org/docs/installation/
         | 
         | [2] https://duckdb.org/2021/05/14/sql-on-pandas.html
        
       | legg0myegg0 wrote:
       | The best part of this is just how easy it is! Just a pip install
       | and you're up and running using industry standard Postgres SQL!
        
       | simonw wrote:
       | I tried out DuckDB before (previous comment:
       | https://news.ycombinator.com/item?id=24531732 ) and thought it
       | looked interesting, but this tutorial has really made me excited
       | about the project.
       | 
       | I've been meaning to figure out Parquet for ages, but now it
       | turns out that "pip install duckdb" gives me everything I need to
       | query Parquet files in addition to the other capabilities DuckDB
       | has!
        
       ___________________________________________________________________
       (page generated 2021-06-25 23:00 UTC)