[HN Gopher] Preview: Amazon S3 Tables and Lakehouse in DuckDB
       ___________________________________________________________________
        
       Preview: Amazon S3 Tables and Lakehouse in DuckDB
        
       Author : hn1986
       Score  : 102 points
       Date   : 2025-03-18 16:36 UTC (6 hours ago)
        
 (HTM) web link (duckdb.org)
 (TXT) w3m dump (duckdb.org)
        
       | ayhanfuat wrote:
       | Anybody tried S3 tables? How is your experience? It seems more
       | tempting now that DuckDB supports it.
        
         | Kalanos wrote:
         | Haven't tried it. S3 Tables sounds like a great idea. However,
         | I am wary. For it to be useful, a suite of AWS services
         | probably needs to integrate with it. These services are all
         | managed by different teams that don't always work well together
         | out of the box and often compete with redundant products. For
         | example, configuring SageMaker Studio to use an EMR cluster for
         | Spark was a multi-day hassle with a lot of custom (insecure?)
         | configuration. How is this different from other existing table
         | offerings? AWS is a mess.
        
       | reedf1 wrote:
       | As a data engineering dabbler; parquet in S3 is beautiful. So is
       | DuckDB. What an incredible match.
        
         | alexott wrote:
         | Plain parquet has a lot of problems. That's why iceberg and
         | delta arise
        
           | timenova wrote:
           | Can you elaborate what kind of problems does plain parquet
           | have?
        
             | pacbard wrote:
             | Apache Iceberg builds an additional layer on top of Parquet
             | files that let's you do ACID transactions, rollbacks, and
             | schema evolution.
             | 
             | A Parquet file is a static file that has the whole data
             | associated with a table. You can't insert, update, delete,
             | etc. It's just it. It works ok if you have small tables,
             | but it becomes unwieldy if you need to do whole-table
             | replacements each time your data changes.
             | 
             | Apache Iceberg fixes this problem by adding a metadata
             | layer on top of smaller Parquet files (at a 300,000 ft
             | overview).
        
       | yodon wrote:
       | Can someone Eli5 the difference between AWS S3 Tables and AWS
       | SimpleDB?
        
         | nattaylor wrote:
         | S3 Tables is designed for storing and optimizing tabular data
         | in S3 using Apache Iceberg, offering features like automatic
         | optimization and fast query performance. SimpleDB is a NoSQL
         | database service focused on providing simple indexing and
         | querying capabilities without requiring a schema.
        
         | alex_smart wrote:
         | They are so completely different that it would be simpler if
         | you explained what similarities you see between the two.
        
       | isjustintime wrote:
       | This is pretty exciting. DuckDB is already proving to be a
       | powerful tool in the industry.
       | 
       | Previously there was a strong trend of using simple S3-backed
       | blob storage with Parquet and Athena for querying data lakes. It
       | felt like things have gotten pretty complicated, but as
       | integrations improve and Apache Iceberg gains maturity, I'm
       | seeing a shift toward greater flexibility with less SaaS/tool
       | sprawl in data lakes.
        
         | RobinL wrote:
         | Yes - agree! I actually wrote a blog about this just two days
         | ago:
         | 
         | May be of interest to people who:
         | 
         | - What to know what DuckDB is and why it's interesting
         | 
         | - What's good about it
         | 
         | - Why for orgs without huge data, we will hopefully see a lot
         | more of 's3 + duckdb' rather than more complex architectures
         | and services, and hopefully (IMHO) less Spark!
         | 
         | https://www.robinlinacre.com/recommend_duckdb/
         | 
         | I think most people in data science or data engineering should
         | at least try it to get a sense of what it can do
         | 
         | Really for me, the most important thing is it makes it so much
         | easier to design and test complex ETL because you're not
         | constantly having to run queries against Athena/Spark to check
         | they work - you can do it all locally, in CI, set up tests,
         | etc.
        
           | yakshaving_jgt wrote:
           | Funny, I read TFA and came to the comments to share exactly
           | this recent blog post of yours. Big fan of your work, Robin!
        
             | RobinL wrote:
             | Ah nice - reading that made me feel good! Appreciate the
             | feedback!
        
           | hn1986 wrote:
           | from the blog: "This is a very interesting new development,
           | making DuckDB potentially a suitable replacement for
           | lakehouse formats such as Iceberg or Delta lake for medium
           | scale data."
           | 
           | I don't think we'll ever see this, honestly.
           | 
           | excellent podcast episode with Joe Reis - I've also never
           | understood this whole idea of "just use Spark" or you gotta
           | get on Redshift.
        
           | pletnes wrote:
           | I have the same thoughts. However my impression is also that
           | most orgs would choose eg databricks or something for the
           | permission handling, web ui, ++ so what is the equivalent
           | <<full rig>> with duckdb and S3 / blob storage?
        
             | RobinL wrote:
             | Yeah I think that's fair, especially from the 'end consumer
             | of the data' point of view, and doing things like row-level
             | permissions.
             | 
             | For the ETL side, where often whole-table access is good
             | enough, I find Spark in particular very cumbersome -
             | there's more than can go wrong vs. DuckDB and it's harder
             | to troubleshoot.
        
         | mritchie712 wrote:
         | if you're looking to try out duckdb + iceberg on AWS, we have a
         | solid guide here: https://www.definite.app/blog/cloud-iceberg-
         | duckdb-aws
        
       | AlecBG wrote:
       | Does this support time travel queries?
       | 
       | Does it support reading everything from one snapshot to another?
       | (This is missing in Athena)
       | 
       | If yes to both, does it respect row level deletes when it does
       | this?
        
       | whinvik wrote:
       | When is write support for iceberg coming?
        
         | dm03514 wrote:
         | pfsh who needs to write data??? ;p
         | 
         | If you have streaming data as a source, I built a side project
         | to write streaming data to s3 in iceberg format:
         | 
         | https://sql-flow.com/docs/tutorials/iceberg-sink
         | 
         | https://github.com/turbolytics/sql-flow
         | 
         | I realize it's not quite what you asked for but wanted to
         | mention it. I'm surprised at lack of native iceberg write
         | support in these tools.
         | 
         | Pyiceberg though was quite easy to use, arrow-based API was
         | very helpful as well.
        
           | whinvik wrote:
           | Thanks. This looks cool.
           | 
           | However, my issue is the need to introduce one more tool. I
           | feel that without a single tool to read and write to Iceberg,
           | I would not want to introduce it to our team.
           | 
           | Spark is cool and all but it requires quite a bit of effort
           | to properly work. And Spark seems to be the only thing right
           | now that can read and write to Iceberg natively with a SQL
           | like interface.
        
       | dm03514 wrote:
       | I've mentioend this whenever iceberg comes up. It's wild how
       | immature the ecosystem is still. Duckdb itself lacks the ability
       | to write iceberg....
       | 
       | https://duckdb.org/docs/stable/extensions/iceberg/overview.h...
       | 
       | Apache iceberg go ? Nope
       | 
       | https://github.com/apache/iceberg-go?tab=readme-ov-file#read...
       | 
       | Basically java iceberg is the only mature way to do this, it's
       | not a very accessible ecosystem.
       | 
       | For a side project I'm using pyiceberg to sink streaming data to
       | iceberg (using DuckDB as the stream processor):
       | 
       | https://sql-flow.com/docs/tutorials/iceberg-sink
       | 
       | It's basically a workaround for DuckDB's lack of native support.
       | I am very happy with the Pyicerbg library as a user, It was very
       | easy and the native Arrow support is a glimpse into the future.
       | Arrow as an interchange format is quite amazing. Just open up the
       | iceberg table and append Arrow dataframes to it!
       | 
       | https://github.com/turbolytics/sql-flow
       | 
       | Arrow is quite spectacular and it's cool to see the industry
       | moving to standardize on it as a dataframe. For example,
       | Clickhouse python also support arrow-based insertion:
       | 
       | https://sql-flow.com/docs/tutorials/clickhouse-sink
       | 
       | This makes the glue code trivial to sink into these different
       | systems as long as arrow is used.
        
         | hn1986 wrote:
         | tracking write support here:
         | 
         | https://github.com/duckdb/duckdb-iceberg/issues/37
        
         | barrenko wrote:
         | What the hell is iceberg now?
        
       | sys13 wrote:
       | Wonder why not Delta Lake instead, since Iceberg will merge with
       | Delta
        
         | alexott wrote:
         | It's already supported for quite a while:
         | https://duckdb.org/2024/06/10/delta.html
        
       | _atyler_ wrote:
       | This is a great example of how simplicity often wins in practice.
       | Too many systems overcomplicate storage and retrieval, assuming
       | every use case needs full indexing or ultra-low latency. In
       | reality, for many workloads, treating S3 like a raw table and
       | letting the engine handle the heavy lifting makes a lot of sense.
       | Curious to see how it performs under high concurrency--any
       | benchmarks on that yet?
        
       | TheGuyWhoCodes wrote:
       | Does DuckDB just delegate the query to S3 Tables? or does it do
       | anything in-engine with the data files?
       | 
       | On thing that's missing in DuckDB is predicate pushdown for
       | iceberg - see https://github.com/duckdb/duckdb-iceberg/issues/2
       | 
       | Which puts it way behind the competition, performance wise.
        
       ___________________________________________________________________
       (page generated 2025-03-18 23:00 UTC)