[HN Gopher] Apache Iceberg V3 Spec new features for more efficie...
___________________________________________________________________
Apache Iceberg V3 Spec new features for more efficient and flexible
data lakes
Author : talatuyarer
Score : 68 points
Date : 2025-08-11 17:07 UTC (5 hours ago)
(HTM) web link (opensource.googleblog.com)
(TXT) w3m dump (opensource.googleblog.com)
| talatuyarer wrote:
| This new version has some great new features, including deletion
| vectors for more efficient transactions and default column values
| to make schema evolution a breeze. The full article has all the
| details.
| hodgesrm wrote:
| This Google article was nice as a high level overview of Iceberg
| V3. I wish that the V3 spec (and Iceberg specs in general) were
| more readable. For now the best approach seems to be read the
| Javadoc for the Iceberg Java API. [0]
|
| [0] https://javadoc.io/doc/org.apache.iceberg/iceberg-
| api/latest...
| twoodfin wrote:
| The Iceberg spec is a model of clarity and simplicity compared
| to the (constantly in flux via Databricks commits...) Delta
| protocol spec:
|
| https://github.com/delta-io/delta/blob/master/PROTOCOL.md
| eatonphil wrote:
| To the contrary, the Delta Lake paper is extremely easy to
| read and implement the basics of (I did) and Iceberg has
| nothing so concise and clear.
| twoodfin wrote:
| If I implement what's described in the Delta Lake paper,
| will I be able to query and update arbitrary Delta Lake
| tables as populated by Databricks in 2025?
|
| (Would be genuinely excited if the answer is yes.)
| eatonphil wrote:
| Not sure (probably not). But it's definitely much easier
| to immediately understand IMO.
| twoodfin wrote:
| OK, but at least from my perspective, the point of OTF's
| is to allow ongoing interoperability between query and
| update engines.
|
| A "standard" getting semi-monthly updates via random
| Databricks-affiliated GitHub accounts doesn't really fit
| that bill.
|
| Look at something like this:
|
| https://github.com/delta-
| io/delta/blob/master/PROTOCOL.md#wr...
|
| Ouch.
| ahmetburhan wrote:
| Cool to see Iceberg getting these kinds of upgrades. Deletion
| vectors and default column values sound like real quality-of-life
| improvements, especially for big, messy datasets. Curious to hear
| if anyone's tried V3 in production yet and what the performance
| looks like.
| jamesblonde wrote:
| Is it out yet?
| amluto wrote:
| > ALTER TABLE events ADD COLUMN version INT DEFAULT 1;
|
| I've always disliked this approach. It conflates two things: the
| value to put in preexisting rows and the default going forward. I
| often want to add a column, backfill it, and not have a default.
|
| Fortunately, the Iceberg spec at least got this right under the
| hood. There's "initial-default", which is the value implicitly
| inserted in rows that predate the addition of the column, and
| there's "write-default", which is the default for new rows.
| drivenextfunc wrote:
| Many companies seem to be using Apache Iceberg, but the ecosystem
| feels immature outside of Java. For instance, iceberg-rust
| doesn't even support HDFS. (Though admittedly, Iceberg's tendency
| to create many small files makes it a poor fit for HDFS anyway.)
| hodgesrm wrote:
| Seems like this is going to be a permanent issue, no? Library
| level storage APIs are complex and often quite leaky. That's
| based on looking at the innards of MySQL and ClickHouse for a
| while.
|
| It seems quite possible that there will be maybe three
| libraries that can write to Iceberg (Java, Python, Rust, maybe
| Golang), while the rest at best will offer read access only.
| And those language choices will condition and be conditioned by
| the languages that developers use to write applications that
| manage Iceberg data.
| ozgrakkurt wrote:
| This was the same with arrow/parquet libraries as well. It
| takes a long time for all implementations to catch up
| jamesblonde wrote:
| When will open source v3 come out? It's supposed to be in Apache
| Iceberg 1.10, right?
| talatuyarer wrote:
| Yes 1.10 version will be first version for V3 spec. But not all
| features are implemented on runners such as Spark or Flink.
| fabatka wrote:
| I thought 1.9.0 already had at least some of the v3 features,
| like the variant type and column lineages?
| https://iceberg.apache.org/releases/#190-release
|
| Of course I haven't seen any implementations supporting these
| yet.
| talatuyarer wrote:
| Yes, the specification will be finalized with version 1.10.
| Previous versions also include specification changes.
| Iceberg's implementation of V3 occurs in three stages:
| Specification Change, Core Implementation, and Spark/Flink
| Implementation.
|
| So far only Variant is supported in Spark and with 1.10
| Spark will support nano timestamp and unknowntype I
| believe.
| robertlagrant wrote:
| > default column values
|
| The way they implemented this seems really useful for any
| database.
___________________________________________________________________
(page generated 2025-08-11 23:01 UTC)