[HN Gopher] Pg_lakehouse: Query Any Data Lake from Postgres
___________________________________________________________________
Pg_lakehouse: Query Any Data Lake from Postgres
Author : landingunless
Score : 102 points
Date : 2024-05-13 13:29 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| tehlike wrote:
| Paradedb is doing a lot of good work with postgres. Pg_analytics,
| and now pg_lakehouse...
| yrashk wrote:
| As somebody who writes a lot of Postgres extensions, I can say
| this is quite interesting!
|
| I think I can see some parallels to Supabase's wrappers project.
|
| Keep up the good work!
| kiwicopple wrote:
| Neat that you plan to support both Delta Lake and Apache Iceberg
|
| I'm curious about HN's position between these two formats? I'm
| having a hard time deciphering which might be the industry winner
| (or perhaps they both have a place, no "winner" necessary)
| retakeming wrote:
| This is anecdotal, but I feel that we (ParadeDB) have received
| more requests for Iceberg integration vs. Delta Lake. We were
| actually hesitant to launch pg_lakehouse without Iceberg
| support, but pulled the trigger on it because the iceberg-rust
| crate is still in its early days. We will probably be
| contributing to iceberg-rust to make it work with pg_lakehouse.
| kiwicopple wrote:
| > _We will probably be contributing to iceberg-rust to make
| it work with pg_lakehouse_
|
| That's great news, thanks for your contributions to open
| source (here, and all the other extensions)
| lukekim wrote:
| Also anecdotal, but we (Spice AI) see more requests for
| Iceberg, but in practice more deployments of Delta Lake.
| FridgeSeal wrote:
| My theory is that everyone would _prefer_ to use Iceberg,
| but isn't as widely supported _yet_, so they're stuck with
| Delta in the interim.
| slap_shot wrote:
| There isn't a winner and there likely won't be one (at least
| not for a long time). Tabular will likely be acquired by
| Snowflake and the two industry behemoths now back their own
| formats, and each will treat their own as a first class
| citizen.
| philippemnoel wrote:
| Agreed, this is why we want to support both. Maybe even
| Apache Hudi down the line. But I hope the industry converges
| to a main standard rather than Snowflake/Databricks fighting
| for their own formats. They can differentiate on much more
| meaningful features
| kcirerick wrote:
| I'm also building in the lakehouse space and anecdotally have
| seen more excitement around Iceberg over delta lake just
| because of its completely open source origins. Iceberg has
| evolved faster and has had more contributions from a more
| diverse set of contributors than Delta Lake. Not sure if this
| will change with a Snowflake <> Tabular acquisition but I'd
| easily bet on Iceberg if current trends continue.
| philippemnoel wrote:
| We agree. We plan to bring Iceberg support as a first-class
| citizen as soon as we can, but unfortunately the support in
| Rust these days is still limited. We and the community are
| working on it
| brunoqc wrote:
| Nice. I wish timescaledb open-sourced their s3 storage thing.
| philippemnoel wrote:
| They've been moving more and more towards closed source over
| the years, which is a shame but I understand why. We don't
| offer time-series features today, but we're not ruling out
| adding support for it eventually if it is desired by our users.
| sdairs wrote:
| Very cool!
|
| Could you share the key difference between this and the previous
| pg_analytics, and motivation of making it a separate plugin?
| retakeming wrote:
| Whereas pg_analytics stores the data in Postgres block storage,
| pg_lakehouse does not use Postgres storage at all.
|
| This makes it a much simpler (and in our opinion, more elegant)
| extension. We learned that many of our users already stored
| their Parquet files in S3, so it made sense to connect directly
| to S3 rather than asking them to ingest those Parquet files
| into Postgres.
|
| It also accelerates the path to production readiness, since
| we're not touching Postgres internals (no need to mess with
| Postgres MVCC, write ahead logs, transactions, etc.)
| mcdonje wrote:
| Looks like pg as a replacement for databricks sql, which is
| already a query engine for datalakes. It's not a lakehouse, but
| it calls itself one. Seems like a cool and useful project, but
| the name is problematic.
| retakeming wrote:
| pg_house just wasn't as catchy!
|
| In all seriousness though, I see your point. While it's true
| that we don't provide the storage or table format, our belief
| is that companies actually want to own the data in their S3. We
| called it pg_lakehouse because it's the missing glue for
| companies already using Postgres + S3 + Delta Lake/Iceberg to
| have a lakehouse without new infrastructure.
| samber wrote:
| It seems very promising!
|
| 2 questions:
|
| - do you distribute query processing over multiple pg nodes ?
|
| - do you store the metadata in PG, instead of a traditional
| metastore?
| retakeming wrote:
| Thanks!
|
| 1. It's single node, but DataFusion parallelizes query
| execution across multiple cores. We do have plans for a
| distributed architecture, but we've found that you can get
| ~very~ far just by scaling up a single Postgres node.
|
| 2. The only information stored in Postgres are the options
| passed into the foreign data wrapper and the schema of the
| foreign table (this is standard for all Postgres foreign data
| wrappers).
| nathanwallace wrote:
| Readers may also enjoy Steampipe [1], an open source tool to live
| query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes,
| etc). It uses Postgres Foreign Data Wrappers under the hood and
| supports joins etc with other tables. (Disclaimer - I'm a lead on
| the project.)
|
| 1 - https://github.com/turbot/steampipe
| nikita wrote:
| This is great work! Could you please comment on the choice of
| your license. Lost Postgres extension that achieve wide adoption
| use Postgres, MIT or Apache license.
| philippemnoel wrote:
| All ParadeDB extensions are released under AGPL-3.0. We've
| found that it strikes the right balance between being open-
| source and enabling the community to adopt for free, while also
| protecting us from hyperscalers and enabling us to build a
| sustainable business. Perhaps the topic of a blog post someday
| :)
| nikita wrote:
| It looks like hyper scalers can still host it as long as they
| are publishing changes to the source code ? Am I reading the
| license right ?
| tomhallett wrote:
| For aws to make it available on rds Aurora, would it be
| safe to assume there would have to be some changes to the
| extension source to make it compatible with the Aurora
| engine? If we assume aws doesn't want todo that, then their
| licensing provides some protection there.
| philippemnoel wrote:
| Yeah exactly. In practice, we've inspired ourselves from
| the likes of Citus and others who have adopted the
| AGPL-3.0 license as a good compromise and have found
| success. It's rather comment nowadays for infra startups
| to use AGPL-3.0. Other noteworthy examples: MinIO,
| Quickwit, Lago, etc.
| francoismassot wrote:
| Well, MongoDB was under AGPL v3.0 :)
| arduanika wrote:
| The name seems to be an allusion to the author P.G. Wodehouse,
| creator of the character Jeeves.
|
| https://en.wikipedia.org/wiki/P._G._Wodehouse
|
| Very clever naming!
| pas wrote:
| Sorry, what do you base that on? To me it just seems like a
| straightforward inspiration from the "data lake" -> "lakehouse"
| terminology that Databricks started (?) using.
|
| https://www.databricks.com/product/data-lakehouse
|
| edit: ah, but in a different comment someone noted that it's
| not actually a lakehouse, so who knows!? :)
| arduanika wrote:
| Based on pure speculation. I may be reaching.
|
| My best guess is that Databricks and Pg_lakehouse both
| independently coined "lakehouse" from "data lake", and that
| for the latter team, it was partly a pun on Wodehouse. But
| the creators are welcome to chime in and confirm/deny!
|
| (Or to say, like, "Sure...uh, we totally meant that...yes we
| are very literary.")
| philippemnoel wrote:
| I wish we were that clever, but it's really just the
| combination of "data lake" and "data warehouse", which
| isn't even coined by us :)
| jeadie wrote:
| This looks functionally similar as using
| http://github.com/spiceai/spiceai with a postgreSQL data
| accelerator.
___________________________________________________________________
(page generated 2024-05-13 23:00 UTC)