hngopher.com

       [HN Gopher] Apache Iceberg V3 Spec new features for more efficie...
       ___________________________________________________________________
        
       Apache Iceberg V3 Spec new features for more efficient and flexible
       data lakes
        
       Author : talatuyarer
       Score  : 68 points
       Date   : 2025-08-11 17:07 UTC (5 hours ago)
        
 (HTM) web link (opensource.googleblog.com)
 (TXT) w3m dump (opensource.googleblog.com)
        
       | talatuyarer wrote:
       | This new version has some great new features, including deletion
       | vectors for more efficient transactions and default column values
       | to make schema evolution a breeze. The full article has all the
       | details.
        
       | hodgesrm wrote:
       | This Google article was nice as a high level overview of Iceberg
       | V3. I wish that the V3 spec (and Iceberg specs in general) were
       | more readable. For now the best approach seems to be read the
       | Javadoc for the Iceberg Java API. [0]
       | 
       | [0] https://javadoc.io/doc/org.apache.iceberg/iceberg-
       | api/latest...
        
         | twoodfin wrote:
         | The Iceberg spec is a model of clarity and simplicity compared
         | to the (constantly in flux via Databricks commits...) Delta
         | protocol spec:
         | 
         | https://github.com/delta-io/delta/blob/master/PROTOCOL.md
        
           | eatonphil wrote:
           | To the contrary, the Delta Lake paper is extremely easy to
           | read and implement the basics of (I did) and Iceberg has
           | nothing so concise and clear.
        
             | twoodfin wrote:
             | If I implement what's described in the Delta Lake paper,
             | will I be able to query and update arbitrary Delta Lake
             | tables as populated by Databricks in 2025?
             | 
             | (Would be genuinely excited if the answer is yes.)
        
               | eatonphil wrote:
               | Not sure (probably not). But it's definitely much easier
               | to immediately understand IMO.
        
               | twoodfin wrote:
               | OK, but at least from my perspective, the point of OTF's
               | is to allow ongoing interoperability between query and
               | update engines.
               | 
               | A "standard" getting semi-monthly updates via random
               | Databricks-affiliated GitHub accounts doesn't really fit
               | that bill.
               | 
               | Look at something like this:
               | 
               | https://github.com/delta-
               | io/delta/blob/master/PROTOCOL.md#wr...
               | 
               | Ouch.
        
       | ahmetburhan wrote:
       | Cool to see Iceberg getting these kinds of upgrades. Deletion
       | vectors and default column values sound like real quality-of-life
       | improvements, especially for big, messy datasets. Curious to hear
       | if anyone's tried V3 in production yet and what the performance
       | looks like.
        
         | jamesblonde wrote:
         | Is it out yet?
        
       | amluto wrote:
       | > ALTER TABLE events ADD COLUMN version INT DEFAULT 1;
       | 
       | I've always disliked this approach. It conflates two things: the
       | value to put in preexisting rows and the default going forward. I
       | often want to add a column, backfill it, and not have a default.
       | 
       | Fortunately, the Iceberg spec at least got this right under the
       | hood. There's "initial-default", which is the value implicitly
       | inserted in rows that predate the addition of the column, and
       | there's "write-default", which is the default for new rows.
        
       | drivenextfunc wrote:
       | Many companies seem to be using Apache Iceberg, but the ecosystem
       | feels immature outside of Java. For instance, iceberg-rust
       | doesn't even support HDFS. (Though admittedly, Iceberg's tendency
       | to create many small files makes it a poor fit for HDFS anyway.)
        
         | hodgesrm wrote:
         | Seems like this is going to be a permanent issue, no? Library
         | level storage APIs are complex and often quite leaky. That's
         | based on looking at the innards of MySQL and ClickHouse for a
         | while.
         | 
         | It seems quite possible that there will be maybe three
         | libraries that can write to Iceberg (Java, Python, Rust, maybe
         | Golang), while the rest at best will offer read access only.
         | And those language choices will condition and be conditioned by
         | the languages that developers use to write applications that
         | manage Iceberg data.
        
         | ozgrakkurt wrote:
         | This was the same with arrow/parquet libraries as well. It
         | takes a long time for all implementations to catch up
        
       | jamesblonde wrote:
       | When will open source v3 come out? It's supposed to be in Apache
       | Iceberg 1.10, right?
        
         | talatuyarer wrote:
         | Yes 1.10 version will be first version for V3 spec. But not all
         | features are implemented on runners such as Spark or Flink.
        
           | fabatka wrote:
           | I thought 1.9.0 already had at least some of the v3 features,
           | like the variant type and column lineages?
           | https://iceberg.apache.org/releases/#190-release
           | 
           | Of course I haven't seen any implementations supporting these
           | yet.
        
             | talatuyarer wrote:
             | Yes, the specification will be finalized with version 1.10.
             | Previous versions also include specification changes.
             | Iceberg's implementation of V3 occurs in three stages:
             | Specification Change, Core Implementation, and Spark/Flink
             | Implementation.
             | 
             | So far only Variant is supported in Spark and with 1.10
             | Spark will support nano timestamp and unknowntype I
             | believe.
        
       | robertlagrant wrote:
       | > default column values
       | 
       | The way they implemented this seems really useful for any
       | database.
        
       ___________________________________________________________________
       (page generated 2025-08-11 23:01 UTC)