[HN Gopher] Pg_parquet: An extension to connect Postgres and par...
___________________________________________________________________
Pg_parquet: An extension to connect Postgres and parquet
Author : craigkerstiens
Score : 48 points
Date : 2024-10-17 16:18 UTC (6 hours ago)
(HTM) web link (www.crunchydata.com)
(TXT) w3m dump (www.crunchydata.com)
| aamederen wrote:
| Congratulations! I'm happy to see the PostgreSQL license.
| linuxhansl wrote:
| Parquet itself is actually not that interesting. It should be
| able to read (and even write) Iceberg tables.
|
| Also, how does it compare to pg_duckdb (which adds DuckDB
| execution to Postgres including reading parquet and Iceberg), or
| duck_fdw (which wraps a DuckDB database, which can be in memory
| and only pass-through Iceberg/Parquet tables)?
| AdamProut wrote:
| Had a similar thought. Azure Postgres has something similar to
| pg_parquet (pg_azure_storage), but we're looking into replacing
| it with pg_duckdb assuming the extension continues to mature.
|
| It would be great if the Postgres community could get behind
| one good opensource extension for the various columnstore data
| use cases (querying data stored in an open columnstore format -
| delta, iceberg, etc. being one of them). pg_duckdb seems to
| have the best chance at being the goto extension for this.
| mslot wrote:
| Fun fact, I created pg_azure_storage :)
| mslot wrote:
| (Marco from Crunchy Data)
|
| With PostgreSQL extensions, we find it's most effective to have
| single-purpose modular extensions.
|
| For instance, I created pg_cron a few years ago, and it's on
| basically every PostgreSQL service because it does one thing
| and does it well.
|
| We wanted to create a light-weight implementation of Parquet
| that does not pull a multi-threaded library into every postgres
| process.
|
| When you get to more complex features, a lot of questions
| around trade-offs, user experience, and deployment model start
| appearing. For instance, when querying an Iceberg table,
| caching becomes quite important, but that raises lots of other
| questions around cache management. Also, how do you deal with
| that memory hungry, multi-threaded query engine running in
| every process without things constantly falling over?
|
| It's easier to answer those questions in the context of a
| managed service where you control the environment, so we have a
| product that can query Iceberg/Parquet/CSV/etc. in S3, does
| automatic caching, figures out the region of your bucket, can
| create tables directly from files, and uses DuckDB to
| accelerate queries in a reliable manner. This is partially
| powered by a set of custom extensions, partially by other
| things running on the managed service.
| https://docs.crunchybridge.com/analytics
|
| However, some components can be neatly extracted and shared
| broadly like COPY TO/FROM Parquet. We find it very useful for
| archiving old partitions, importing public and private data
| sets, preparing data for analytics, and moving data between
| PostgreSQL servers.
| oulipo wrote:
| Cool, would this be better than using a clickhouse / duckdb
| extension that reads postgres and saves to Parquet?
|
| What would be recommended to output regularly old data to S3 as
| parquet file? To use a cron job which launches a second Postgres
| process connecting to the database and extracting the data, or
| using the regular database instance? doesn't that slow down the
| instance too much?
| craigkerstiens wrote:
| This alone wouldn't be a full replacement. We do have a full
| product that does that with customers seeing great performance
| in production. Crunchy Bridge for Analytics does similar by
| embedding DuckDB inside Postgres, though for users is largely
| an implementation detail. We support iceberg as well and have a
| lot more coming basically to allow for seamless analytics on
| Postgres building on what Postgres is good at, iceberg for
| storage, and duckdb for vectorized execution.
|
| That isn't fully open source at this time but has been
| production grade for some time. This was one piece that makes
| getting to that easier for folks and felt a good standalone bit
| to open source and share with the broader community. We can
| also see where this by itself for certain use cases makes
| sense, as you sort of point out if you had time series
| partitioned data, leveraged partman for new partitions and
| pg_cron which this same set of people authored you could
| automatically archive old partitions to parquet but still have
| thing for analysis if needed.
| whalesalad wrote:
| I wish RDS made it easy to add custom extensions like this.
| wdb wrote:
| or Google Cloud
| boskatch wrote:
| Parquet... more like par-gay tbh
___________________________________________________________________
(page generated 2024-10-17 23:00 UTC)