[HN Gopher] Pg_lakehouse: Query Any Data Lake from Postgres
       ___________________________________________________________________
        
       Pg_lakehouse: Query Any Data Lake from Postgres
        
       Author : landingunless
       Score  : 102 points
       Date   : 2024-05-13 13:29 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | tehlike wrote:
       | Paradedb is doing a lot of good work with postgres. Pg_analytics,
       | and now pg_lakehouse...
        
       | yrashk wrote:
       | As somebody who writes a lot of Postgres extensions, I can say
       | this is quite interesting!
       | 
       | I think I can see some parallels to Supabase's wrappers project.
       | 
       | Keep up the good work!
        
       | kiwicopple wrote:
       | Neat that you plan to support both Delta Lake and Apache Iceberg
       | 
       | I'm curious about HN's position between these two formats? I'm
       | having a hard time deciphering which might be the industry winner
       | (or perhaps they both have a place, no "winner" necessary)
        
         | retakeming wrote:
         | This is anecdotal, but I feel that we (ParadeDB) have received
         | more requests for Iceberg integration vs. Delta Lake. We were
         | actually hesitant to launch pg_lakehouse without Iceberg
         | support, but pulled the trigger on it because the iceberg-rust
         | crate is still in its early days. We will probably be
         | contributing to iceberg-rust to make it work with pg_lakehouse.
        
           | kiwicopple wrote:
           | > _We will probably be contributing to iceberg-rust to make
           | it work with pg_lakehouse_
           | 
           | That's great news, thanks for your contributions to open
           | source (here, and all the other extensions)
        
           | lukekim wrote:
           | Also anecdotal, but we (Spice AI) see more requests for
           | Iceberg, but in practice more deployments of Delta Lake.
        
             | FridgeSeal wrote:
             | My theory is that everyone would _prefer_ to use Iceberg,
             | but isn't as widely supported _yet_, so they're stuck with
             | Delta in the interim.
        
         | slap_shot wrote:
         | There isn't a winner and there likely won't be one (at least
         | not for a long time). Tabular will likely be acquired by
         | Snowflake and the two industry behemoths now back their own
         | formats, and each will treat their own as a first class
         | citizen.
        
           | philippemnoel wrote:
           | Agreed, this is why we want to support both. Maybe even
           | Apache Hudi down the line. But I hope the industry converges
           | to a main standard rather than Snowflake/Databricks fighting
           | for their own formats. They can differentiate on much more
           | meaningful features
        
         | kcirerick wrote:
         | I'm also building in the lakehouse space and anecdotally have
         | seen more excitement around Iceberg over delta lake just
         | because of its completely open source origins. Iceberg has
         | evolved faster and has had more contributions from a more
         | diverse set of contributors than Delta Lake. Not sure if this
         | will change with a Snowflake <> Tabular acquisition but I'd
         | easily bet on Iceberg if current trends continue.
        
           | philippemnoel wrote:
           | We agree. We plan to bring Iceberg support as a first-class
           | citizen as soon as we can, but unfortunately the support in
           | Rust these days is still limited. We and the community are
           | working on it
        
       | brunoqc wrote:
       | Nice. I wish timescaledb open-sourced their s3 storage thing.
        
         | philippemnoel wrote:
         | They've been moving more and more towards closed source over
         | the years, which is a shame but I understand why. We don't
         | offer time-series features today, but we're not ruling out
         | adding support for it eventually if it is desired by our users.
        
       | sdairs wrote:
       | Very cool!
       | 
       | Could you share the key difference between this and the previous
       | pg_analytics, and motivation of making it a separate plugin?
        
         | retakeming wrote:
         | Whereas pg_analytics stores the data in Postgres block storage,
         | pg_lakehouse does not use Postgres storage at all.
         | 
         | This makes it a much simpler (and in our opinion, more elegant)
         | extension. We learned that many of our users already stored
         | their Parquet files in S3, so it made sense to connect directly
         | to S3 rather than asking them to ingest those Parquet files
         | into Postgres.
         | 
         | It also accelerates the path to production readiness, since
         | we're not touching Postgres internals (no need to mess with
         | Postgres MVCC, write ahead logs, transactions, etc.)
        
       | mcdonje wrote:
       | Looks like pg as a replacement for databricks sql, which is
       | already a query engine for datalakes. It's not a lakehouse, but
       | it calls itself one. Seems like a cool and useful project, but
       | the name is problematic.
        
         | retakeming wrote:
         | pg_house just wasn't as catchy!
         | 
         | In all seriousness though, I see your point. While it's true
         | that we don't provide the storage or table format, our belief
         | is that companies actually want to own the data in their S3. We
         | called it pg_lakehouse because it's the missing glue for
         | companies already using Postgres + S3 + Delta Lake/Iceberg to
         | have a lakehouse without new infrastructure.
        
       | samber wrote:
       | It seems very promising!
       | 
       | 2 questions:
       | 
       | - do you distribute query processing over multiple pg nodes ?
       | 
       | - do you store the metadata in PG, instead of a traditional
       | metastore?
        
         | retakeming wrote:
         | Thanks!
         | 
         | 1. It's single node, but DataFusion parallelizes query
         | execution across multiple cores. We do have plans for a
         | distributed architecture, but we've found that you can get
         | ~very~ far just by scaling up a single Postgres node.
         | 
         | 2. The only information stored in Postgres are the options
         | passed into the foreign data wrapper and the schema of the
         | foreign table (this is standard for all Postgres foreign data
         | wrappers).
        
       | nathanwallace wrote:
       | Readers may also enjoy Steampipe [1], an open source tool to live
       | query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes,
       | etc). It uses Postgres Foreign Data Wrappers under the hood and
       | supports joins etc with other tables. (Disclaimer - I'm a lead on
       | the project.)
       | 
       | 1 - https://github.com/turbot/steampipe
        
       | nikita wrote:
       | This is great work! Could you please comment on the choice of
       | your license. Lost Postgres extension that achieve wide adoption
       | use Postgres, MIT or Apache license.
        
         | philippemnoel wrote:
         | All ParadeDB extensions are released under AGPL-3.0. We've
         | found that it strikes the right balance between being open-
         | source and enabling the community to adopt for free, while also
         | protecting us from hyperscalers and enabling us to build a
         | sustainable business. Perhaps the topic of a blog post someday
         | :)
        
           | nikita wrote:
           | It looks like hyper scalers can still host it as long as they
           | are publishing changes to the source code ? Am I reading the
           | license right ?
        
             | tomhallett wrote:
             | For aws to make it available on rds Aurora, would it be
             | safe to assume there would have to be some changes to the
             | extension source to make it compatible with the Aurora
             | engine? If we assume aws doesn't want todo that, then their
             | licensing provides some protection there.
        
               | philippemnoel wrote:
               | Yeah exactly. In practice, we've inspired ourselves from
               | the likes of Citus and others who have adopted the
               | AGPL-3.0 license as a good compromise and have found
               | success. It's rather comment nowadays for infra startups
               | to use AGPL-3.0. Other noteworthy examples: MinIO,
               | Quickwit, Lago, etc.
        
           | francoismassot wrote:
           | Well, MongoDB was under AGPL v3.0 :)
        
       | arduanika wrote:
       | The name seems to be an allusion to the author P.G. Wodehouse,
       | creator of the character Jeeves.
       | 
       | https://en.wikipedia.org/wiki/P._G._Wodehouse
       | 
       | Very clever naming!
        
         | pas wrote:
         | Sorry, what do you base that on? To me it just seems like a
         | straightforward inspiration from the "data lake" -> "lakehouse"
         | terminology that Databricks started (?) using.
         | 
         | https://www.databricks.com/product/data-lakehouse
         | 
         | edit: ah, but in a different comment someone noted that it's
         | not actually a lakehouse, so who knows!? :)
        
           | arduanika wrote:
           | Based on pure speculation. I may be reaching.
           | 
           | My best guess is that Databricks and Pg_lakehouse both
           | independently coined "lakehouse" from "data lake", and that
           | for the latter team, it was partly a pun on Wodehouse. But
           | the creators are welcome to chime in and confirm/deny!
           | 
           | (Or to say, like, "Sure...uh, we totally meant that...yes we
           | are very literary.")
        
             | philippemnoel wrote:
             | I wish we were that clever, but it's really just the
             | combination of "data lake" and "data warehouse", which
             | isn't even coined by us :)
        
       | jeadie wrote:
       | This looks functionally similar as using
       | http://github.com/spiceai/spiceai with a postgreSQL data
       | accelerator.
        
       ___________________________________________________________________
       (page generated 2024-05-13 23:00 UTC)