[HN Gopher] Querying Parquet with Precision Using DuckDB
___________________________________________________________________
Querying Parquet with Precision Using DuckDB
Author : mytherin
Score : 74 points
Date : 2021-06-25 18:53 UTC (4 hours ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| ravanpao wrote:
| Vertica provides great support to work with Parquet files as
| external tables:
| https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...
|
| We use this feature to build a staging layer for incoming files
| before loading to the data warehouse.
| polskibus wrote:
| Now, if only I could somehow take that parquet processing
| optimizer out of DuckDB and into a Postgres fdw reading parquet
| files... that would make my ETLs work great. Does anyone know of
| another open source project that uses an optimizer to perform
| queries on parquet files?
| mytherin wrote:
| DuckDB is designed to be used as an embedded library, so it
| should be totally feasible to use it as part of a FDW.
|
| Alternatively, you can run DuckDB as part of a plpython
| function: CREATE FUNCTION pyduckdb ()
| RETURNS integer AS $$ import duckdb con =
| duckdb.connect() return con.execute('select
| 42').fetchall()[0][0] $$ LANGUAGE plpythonu; SELECT
| pyduckdb();
| eatonphil wrote:
| Why won't you just copy your parquet data into Postgres?
| taffer wrote:
| Three reasons I can think of:
|
| - Postgres is a row store optimized for transactional
| workloads, whereas Parquet is a column-oriented format
| optimized for analytical workloads.
|
| - Database storage is typically very expensive SSD storage
| optimized for fast IO and high availability. Parquet files,
| on the other hand, can be stored in inexpensive object
| storage such as S3.
|
| - Loading is an additional and possibly unnecessary step
| eatonphil wrote:
| Sure but what is the scenario where you wouldn't want to
| use DuckDB and you'd want to use PostgreSQL instead but
| wouldn't want to load into PostgreSQL? DuckDB is already a
| SQL engine. The only additional benefit I'd imagine from
| PostgreSQL would be stuff you get from it being on disk?
|
| But that's why I'm asking, I don't know.
| taffer wrote:
| It depends on your situation: If you already have a
| Postgres instance then the FDW is probably the easier
| solution.
| eatonphil wrote:
| Oh yeah that makes sense. I assumed the commenter was
| just looking for any system to run SQL over Parquet
| files. If you don't want to use this system that makes it
| easy, fair enough.
| taffer wrote:
| > If you don't want to use this system that makes it
| easy, fair enough.
|
| _Be kind. Don 't be snarky. Have curious conversation;
| don't cross-examine. Please don't fulminate. Please don't
| sneer, including at the rest of the community._
|
| https://news.ycombinator.com/newsguidelines.html
| taffer wrote:
| There is a parquet fdw for postgres:
|
| https://github.com/adjust/parquet_fdw
| polskibus wrote:
| Yes, I know, but is it as smart as duckdb?
| taffer wrote:
| I havent tried it yet, but the parquet_fdw docs say you
| have to run ANALYZE on the table for optimal performance.
| legg0myegg0 wrote:
| Here is a DuckDB FDW for Postgres! I have not used it, but it
| sounds like what you need!
| https://github.com/alitrack/duckdb_fdw
| wenc wrote:
| I just downloaded duckdb and used it to query parquet files. It's
| amazing. Finally, an easy way to query Parquet data without
| spinning up a local Spark instance or doing pd.read_parquet. It's
| pretty fast too.
|
| I've always felt that Parquet (or ORC) ought to be more popular
| mainstream formats than they are, but in practice the interactive
| tooling for Parquet has always been weak. For years, there wasn't
| a good GUI viewer (well, there were a few half-hearted attempts)
| or a good SQL query tool for Parquet like there is for most
| relational databases, and lazily examining the contents of a
| large Parquet folder from the command line has always been a
| chore. At least now, DuckDB's parquet support moves the needle on
| the last point.
| mkarlsch wrote:
| We used Apache Drill a few years back and it could do exactly
| that - run locally and provide a SQL interface to query parquet
| files. Still great to see more and better options!
| wenc wrote:
| Drill is a fair choice (it supports tons of formats), but for
| me it was a bit heavy -- it required Java and plus it
| returned results in a form that wasn't easily manipulated
| (the result is a table in the console drawn with text-chars
| [1]). If you had a really long result set, you couldn't do a
| buffered scroll due to text UI limitations -- everything gets
| printed out and you have to either do manual pagination using
| LIMIT clauses or just have a really long scrollback buffer.
| This kind of breaks the continuity of many data analysis
| workflows.
|
| With DuckDB, the return result is a Python data structure
| (list of tuples), which is amenable to further manipulation
| without exporting to another format.
|
| Many GUI db tools often return results in a tabular form
| which can be copy-pasted into other GUI apps (or copy-pasted
| as CSV).
|
| [1] https://drill.apache.org/docs/drill-
| in-10-minutes/#querying-...
| scottmcdot wrote:
| Which GUI do you use now? For me I have to use Impala via HUE,
| only after I have created an external table from the parquet
| file. I really don't like web based GUIs, especially when my
| employer has a crappy server running the GUI rendering.
| wenc wrote:
| I don't because I haven't found a good one for Parquet. I
| would really like something like SSMS for SQL Server.
|
| I would poke at local Parquet files either through a local
| instance of Spark (or DataBricks when I was using that for
| remote data folders) or use Pandas' read_parquet from
| Jupyter.
| mytherin wrote:
| Glad that DuckDB is useful to you!
|
| Some more notes that might be of use:
|
| * We also have a command line interface (with syntax
| highlighting) [1]
|
| * You can use .df() instead of .fetchall() to retrieve the
| results as a Pandas DataFrame. This is faster for larger result
| sets.
|
| * You can also use DuckDB to run SQL on Pandas DataFrames
| directly [2]
|
| [1] https://duckdb.org/docs/installation/
|
| [2] https://duckdb.org/2021/05/14/sql-on-pandas.html
| legg0myegg0 wrote:
| The best part of this is just how easy it is! Just a pip install
| and you're up and running using industry standard Postgres SQL!
| simonw wrote:
| I tried out DuckDB before (previous comment:
| https://news.ycombinator.com/item?id=24531732 ) and thought it
| looked interesting, but this tutorial has really made me excited
| about the project.
|
| I've been meaning to figure out Parquet for ages, but now it
| turns out that "pip install duckdb" gives me everything I need to
| query Parquet files in addition to the other capabilities DuckDB
| has!
___________________________________________________________________
(page generated 2021-06-25 23:00 UTC)