[HN Gopher] Pg_parquet: An extension to connect Postgres and par...
       ___________________________________________________________________
        
       Pg_parquet: An extension to connect Postgres and parquet
        
       Author : craigkerstiens
       Score  : 48 points
       Date   : 2024-10-17 16:18 UTC (6 hours ago)
        
 (HTM) web link (www.crunchydata.com)
 (TXT) w3m dump (www.crunchydata.com)
        
       | aamederen wrote:
       | Congratulations! I'm happy to see the PostgreSQL license.
        
       | linuxhansl wrote:
       | Parquet itself is actually not that interesting. It should be
       | able to read (and even write) Iceberg tables.
       | 
       | Also, how does it compare to pg_duckdb (which adds DuckDB
       | execution to Postgres including reading parquet and Iceberg), or
       | duck_fdw (which wraps a DuckDB database, which can be in memory
       | and only pass-through Iceberg/Parquet tables)?
        
         | AdamProut wrote:
         | Had a similar thought. Azure Postgres has something similar to
         | pg_parquet (pg_azure_storage), but we're looking into replacing
         | it with pg_duckdb assuming the extension continues to mature.
         | 
         | It would be great if the Postgres community could get behind
         | one good opensource extension for the various columnstore data
         | use cases (querying data stored in an open columnstore format -
         | delta, iceberg, etc. being one of them). pg_duckdb seems to
         | have the best chance at being the goto extension for this.
        
           | mslot wrote:
           | Fun fact, I created pg_azure_storage :)
        
         | mslot wrote:
         | (Marco from Crunchy Data)
         | 
         | With PostgreSQL extensions, we find it's most effective to have
         | single-purpose modular extensions.
         | 
         | For instance, I created pg_cron a few years ago, and it's on
         | basically every PostgreSQL service because it does one thing
         | and does it well.
         | 
         | We wanted to create a light-weight implementation of Parquet
         | that does not pull a multi-threaded library into every postgres
         | process.
         | 
         | When you get to more complex features, a lot of questions
         | around trade-offs, user experience, and deployment model start
         | appearing. For instance, when querying an Iceberg table,
         | caching becomes quite important, but that raises lots of other
         | questions around cache management. Also, how do you deal with
         | that memory hungry, multi-threaded query engine running in
         | every process without things constantly falling over?
         | 
         | It's easier to answer those questions in the context of a
         | managed service where you control the environment, so we have a
         | product that can query Iceberg/Parquet/CSV/etc. in S3, does
         | automatic caching, figures out the region of your bucket, can
         | create tables directly from files, and uses DuckDB to
         | accelerate queries in a reliable manner. This is partially
         | powered by a set of custom extensions, partially by other
         | things running on the managed service.
         | https://docs.crunchybridge.com/analytics
         | 
         | However, some components can be neatly extracted and shared
         | broadly like COPY TO/FROM Parquet. We find it very useful for
         | archiving old partitions, importing public and private data
         | sets, preparing data for analytics, and moving data between
         | PostgreSQL servers.
        
       | oulipo wrote:
       | Cool, would this be better than using a clickhouse / duckdb
       | extension that reads postgres and saves to Parquet?
       | 
       | What would be recommended to output regularly old data to S3 as
       | parquet file? To use a cron job which launches a second Postgres
       | process connecting to the database and extracting the data, or
       | using the regular database instance? doesn't that slow down the
       | instance too much?
        
         | craigkerstiens wrote:
         | This alone wouldn't be a full replacement. We do have a full
         | product that does that with customers seeing great performance
         | in production. Crunchy Bridge for Analytics does similar by
         | embedding DuckDB inside Postgres, though for users is largely
         | an implementation detail. We support iceberg as well and have a
         | lot more coming basically to allow for seamless analytics on
         | Postgres building on what Postgres is good at, iceberg for
         | storage, and duckdb for vectorized execution.
         | 
         | That isn't fully open source at this time but has been
         | production grade for some time. This was one piece that makes
         | getting to that easier for folks and felt a good standalone bit
         | to open source and share with the broader community. We can
         | also see where this by itself for certain use cases makes
         | sense, as you sort of point out if you had time series
         | partitioned data, leveraged partman for new partitions and
         | pg_cron which this same set of people authored you could
         | automatically archive old partitions to parquet but still have
         | thing for analysis if needed.
        
       | whalesalad wrote:
       | I wish RDS made it easy to add custom extensions like this.
        
         | wdb wrote:
         | or Google Cloud
        
       | boskatch wrote:
       | Parquet... more like par-gay tbh
        
       ___________________________________________________________________
       (page generated 2024-10-17 23:00 UTC)