[HN Gopher] Preview: Amazon S3 Tables and Lakehouse in DuckDB
___________________________________________________________________
Preview: Amazon S3 Tables and Lakehouse in DuckDB
Author : hn1986
Score : 102 points
Date : 2025-03-18 16:36 UTC (6 hours ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| ayhanfuat wrote:
| Anybody tried S3 tables? How is your experience? It seems more
| tempting now that DuckDB supports it.
| Kalanos wrote:
| Haven't tried it. S3 Tables sounds like a great idea. However,
| I am wary. For it to be useful, a suite of AWS services
| probably needs to integrate with it. These services are all
| managed by different teams that don't always work well together
| out of the box and often compete with redundant products. For
| example, configuring SageMaker Studio to use an EMR cluster for
| Spark was a multi-day hassle with a lot of custom (insecure?)
| configuration. How is this different from other existing table
| offerings? AWS is a mess.
| reedf1 wrote:
| As a data engineering dabbler; parquet in S3 is beautiful. So is
| DuckDB. What an incredible match.
| alexott wrote:
| Plain parquet has a lot of problems. That's why iceberg and
| delta arise
| timenova wrote:
| Can you elaborate what kind of problems does plain parquet
| have?
| pacbard wrote:
| Apache Iceberg builds an additional layer on top of Parquet
| files that let's you do ACID transactions, rollbacks, and
| schema evolution.
|
| A Parquet file is a static file that has the whole data
| associated with a table. You can't insert, update, delete,
| etc. It's just it. It works ok if you have small tables,
| but it becomes unwieldy if you need to do whole-table
| replacements each time your data changes.
|
| Apache Iceberg fixes this problem by adding a metadata
| layer on top of smaller Parquet files (at a 300,000 ft
| overview).
| yodon wrote:
| Can someone Eli5 the difference between AWS S3 Tables and AWS
| SimpleDB?
| nattaylor wrote:
| S3 Tables is designed for storing and optimizing tabular data
| in S3 using Apache Iceberg, offering features like automatic
| optimization and fast query performance. SimpleDB is a NoSQL
| database service focused on providing simple indexing and
| querying capabilities without requiring a schema.
| alex_smart wrote:
| They are so completely different that it would be simpler if
| you explained what similarities you see between the two.
| isjustintime wrote:
| This is pretty exciting. DuckDB is already proving to be a
| powerful tool in the industry.
|
| Previously there was a strong trend of using simple S3-backed
| blob storage with Parquet and Athena for querying data lakes. It
| felt like things have gotten pretty complicated, but as
| integrations improve and Apache Iceberg gains maturity, I'm
| seeing a shift toward greater flexibility with less SaaS/tool
| sprawl in data lakes.
| RobinL wrote:
| Yes - agree! I actually wrote a blog about this just two days
| ago:
|
| May be of interest to people who:
|
| - What to know what DuckDB is and why it's interesting
|
| - What's good about it
|
| - Why for orgs without huge data, we will hopefully see a lot
| more of 's3 + duckdb' rather than more complex architectures
| and services, and hopefully (IMHO) less Spark!
|
| https://www.robinlinacre.com/recommend_duckdb/
|
| I think most people in data science or data engineering should
| at least try it to get a sense of what it can do
|
| Really for me, the most important thing is it makes it so much
| easier to design and test complex ETL because you're not
| constantly having to run queries against Athena/Spark to check
| they work - you can do it all locally, in CI, set up tests,
| etc.
| yakshaving_jgt wrote:
| Funny, I read TFA and came to the comments to share exactly
| this recent blog post of yours. Big fan of your work, Robin!
| RobinL wrote:
| Ah nice - reading that made me feel good! Appreciate the
| feedback!
| hn1986 wrote:
| from the blog: "This is a very interesting new development,
| making DuckDB potentially a suitable replacement for
| lakehouse formats such as Iceberg or Delta lake for medium
| scale data."
|
| I don't think we'll ever see this, honestly.
|
| excellent podcast episode with Joe Reis - I've also never
| understood this whole idea of "just use Spark" or you gotta
| get on Redshift.
| pletnes wrote:
| I have the same thoughts. However my impression is also that
| most orgs would choose eg databricks or something for the
| permission handling, web ui, ++ so what is the equivalent
| <<full rig>> with duckdb and S3 / blob storage?
| RobinL wrote:
| Yeah I think that's fair, especially from the 'end consumer
| of the data' point of view, and doing things like row-level
| permissions.
|
| For the ETL side, where often whole-table access is good
| enough, I find Spark in particular very cumbersome -
| there's more than can go wrong vs. DuckDB and it's harder
| to troubleshoot.
| mritchie712 wrote:
| if you're looking to try out duckdb + iceberg on AWS, we have a
| solid guide here: https://www.definite.app/blog/cloud-iceberg-
| duckdb-aws
| AlecBG wrote:
| Does this support time travel queries?
|
| Does it support reading everything from one snapshot to another?
| (This is missing in Athena)
|
| If yes to both, does it respect row level deletes when it does
| this?
| whinvik wrote:
| When is write support for iceberg coming?
| dm03514 wrote:
| pfsh who needs to write data??? ;p
|
| If you have streaming data as a source, I built a side project
| to write streaming data to s3 in iceberg format:
|
| https://sql-flow.com/docs/tutorials/iceberg-sink
|
| https://github.com/turbolytics/sql-flow
|
| I realize it's not quite what you asked for but wanted to
| mention it. I'm surprised at lack of native iceberg write
| support in these tools.
|
| Pyiceberg though was quite easy to use, arrow-based API was
| very helpful as well.
| whinvik wrote:
| Thanks. This looks cool.
|
| However, my issue is the need to introduce one more tool. I
| feel that without a single tool to read and write to Iceberg,
| I would not want to introduce it to our team.
|
| Spark is cool and all but it requires quite a bit of effort
| to properly work. And Spark seems to be the only thing right
| now that can read and write to Iceberg natively with a SQL
| like interface.
| dm03514 wrote:
| I've mentioend this whenever iceberg comes up. It's wild how
| immature the ecosystem is still. Duckdb itself lacks the ability
| to write iceberg....
|
| https://duckdb.org/docs/stable/extensions/iceberg/overview.h...
|
| Apache iceberg go ? Nope
|
| https://github.com/apache/iceberg-go?tab=readme-ov-file#read...
|
| Basically java iceberg is the only mature way to do this, it's
| not a very accessible ecosystem.
|
| For a side project I'm using pyiceberg to sink streaming data to
| iceberg (using DuckDB as the stream processor):
|
| https://sql-flow.com/docs/tutorials/iceberg-sink
|
| It's basically a workaround for DuckDB's lack of native support.
| I am very happy with the Pyicerbg library as a user, It was very
| easy and the native Arrow support is a glimpse into the future.
| Arrow as an interchange format is quite amazing. Just open up the
| iceberg table and append Arrow dataframes to it!
|
| https://github.com/turbolytics/sql-flow
|
| Arrow is quite spectacular and it's cool to see the industry
| moving to standardize on it as a dataframe. For example,
| Clickhouse python also support arrow-based insertion:
|
| https://sql-flow.com/docs/tutorials/clickhouse-sink
|
| This makes the glue code trivial to sink into these different
| systems as long as arrow is used.
| hn1986 wrote:
| tracking write support here:
|
| https://github.com/duckdb/duckdb-iceberg/issues/37
| barrenko wrote:
| What the hell is iceberg now?
| sys13 wrote:
| Wonder why not Delta Lake instead, since Iceberg will merge with
| Delta
| alexott wrote:
| It's already supported for quite a while:
| https://duckdb.org/2024/06/10/delta.html
| _atyler_ wrote:
| This is a great example of how simplicity often wins in practice.
| Too many systems overcomplicate storage and retrieval, assuming
| every use case needs full indexing or ultra-low latency. In
| reality, for many workloads, treating S3 like a raw table and
| letting the engine handle the heavy lifting makes a lot of sense.
| Curious to see how it performs under high concurrency--any
| benchmarks on that yet?
| TheGuyWhoCodes wrote:
| Does DuckDB just delegate the query to S3 Tables? or does it do
| anything in-engine with the data files?
|
| On thing that's missing in DuckDB is predicate pushdown for
| iceberg - see https://github.com/duckdb/duckdb-iceberg/issues/2
|
| Which puts it way behind the competition, performance wise.
___________________________________________________________________
(page generated 2025-03-18 23:00 UTC)