[HN Gopher] Apache Iceberg
       ___________________________________________________________________
        
       Apache Iceberg
        
       Author : jacobmarble
       Score  : 176 points
       Date   : 2025-01-23 01:03 UTC (3 days ago)
        
 (HTM) web link (iceberg.apache.org)
 (TXT) w3m dump (iceberg.apache.org)
        
       | vonnik wrote:
       | Curious to what extent Iceberg enables data composability and
       | what the best complements and alternatives are.
        
         | lmm wrote:
         | Delta Lake is the main competitor. There's a lot of convergence
         | going on, because everyone wants a common format and it's
         | pretty clear what the desirable features are. Ultimately it
         | becomes just boring infrastructure IMO.
        
         | nxm wrote:
         | It allows you to be query engine agnostic - query the same data
         | via Spark, Snowflake or Trino. Granted, performance may suffer
         | vs Snowflake internal tables somewhat due to certain
         | performance optimizations not being there.
        
       | varsketiz wrote:
       | I'm somewhat surprised to see it here - Iceberg is around for
       | some time already.
        
         | mrbluecoat wrote:
         | Yeah, I was confused as well. It was like seeing "postage
         | stamps" on the HN front page.
        
         | benjaminwootton wrote:
         | It's been on the up in recent years though as it appears to
         | have won the format wars. Every vendor is rallying around it
         | and there were new open source catalogues and support from AWS
         | at the end of 2024.
        
           | mritchie712 wrote:
           | yeah, I'll admit I was worried when Databricks acquired
           | Tabular[0] that it would hurt Iceberg's momentum (e.g.
           | databricks would push delta instead), but it seems the
           | opposite has happened.
           | 
           | 0 - https://www.definite.app/blog/databricks-tabular-
           | acquisition
        
             | twoodfin wrote:
             | I was more worried--and continue to be so--that Databricks
             | will bring the rat's nest of complexity and pseudo-open
             | source model that characterizes Delta to the future of
             | Iceberg.
        
       | gopalv wrote:
       | Hidden partitioning is the most interesting Iceberg feature,
       | because most of the very large datasets are timeseries fact
       | tables.
       | 
       | I don't remember seeing that in Delta Lake [1], which is probably
       | because the industry standard benchmarks use date as a column
       | (tpc-h) or join date as a dimension table (tpc-ds) and do not use
       | timestamp ranges instead of dates.
       | 
       | [1] - https://github.com/delta-io/delta/issues/490
        
         | fiddlerwoaroof wrote:
         | Delta Lake now has Hilbert-curve based clustering which solves
         | a lot of the downsides of hive partitioning
        
           | gopalv wrote:
           | > Hilbert-curve based clustering which solves a lot of the
           | downsides of hive partitioning
           | 
           | Yes, that solved the 2-column high NDV partitioning issue -
           | if you had your ip traffic sorted on destination or source,
           | you need Z-curves, which are a little easier with bit
           | twiddling for fixed types to do the same thing.
           | 
           | Hive would write a large number of small files when
           | partitioned like that or you lose efficiencies when scanning
           | on the non-partitioned column.
           | 
           | This does fix the high NDV issue, but in general Netflix
           | wrote hidden partitioning in specifically to avoid sorting on
           | high NDV columns and to reduce the sort complexity on writes
           | (most daily writes won't need any partitioned inserts at
           | all).
           | 
           | While clustering on timestamp will force a sort even if it is
           | a single day.
        
             | autodidacticon wrote:
             | What is NDV partitioning?
        
               | artwr wrote:
               | NDV = Number of distinct values. Here partitioning on
               | high cardinality columns, essentially.
        
       | honestSysAdmin wrote:
       | Iceberg is a pretty cool guy, he consolidates the Parquet and
       | doesn't afraid of anything.
        
       | rubenvanwyk wrote:
       | And yet there's still no straightforward way to write directly to
       | Iceberg tables from Javascript as far as I know.
        
         | Rhubarrbb wrote:
         | Writing to catalogs is still pretty new. Databricks has
         | recently been pushing delta-kernel-rs that DuckDb has a
         | connector set up for, and there's support for writing via
         | Python with the Polars package through delta-rs. For small-time
         | developers this has been pretty helpful for me and influential
         | in picking delta lake over iceberg.
        
           | kermatt wrote:
           | > influential in picking delta lake over iceberg
           | 
           | Can you expand on those reasons a bit?
           | 
           | The dependency on a catalog in Iceberg made it more
           | complicated for simple cases than Delta, where a directory
           | hierarchy was sufficient - if I was understanding the
           | PyIceberg docs correctly.
        
         | enether wrote:
         | for some reason it's really cumbersome to access this tech
        
           | peschu wrote:
           | I agree, as a long time Business Intelligence developer I'm
           | still confused and astounded with all the tooling and bits
           | and pieces seemingly necessary to create analytics/dashboards
           | with open source tools.
           | 
           | For years I used a proprietary solution like Qlik Sense for
           | the whole journey from data extraction to a finished
           | dashboard (mostly on-prem). Going from raw data to a finished
           | dashboard is a matter of days (not weeks/month) with one
           | single tool (and maybe some scripts for supporting tasks).
           | There is some ,,scripting" involved for loading and
           | transforming data, but if you already understand data models
           | (and maybe have some sql experience) it is very easy. The
           | Dashboard creation itself does not need any coding at
           | all.just drag and drop and some formulas like sum(amount).
           | 
           | But this a standalone tool and it is hard to integrate it
           | into your own piece of software. From my experience, software
           | developers have a much more complicated view on data
           | handling. Often this is just the complexity of their use
           | cases, sometimes it is just a lack of knowledge of data
           | preparation for analytics use cases.
           | 
           | Another part which complicates stuff greatly is the focus on
           | use-cases involving cloud storage and doing all the
           | transformations on distributed systems.
           | 
           | And it is often not clear what amount of data we are talking
           | about and if it is realtime (streaming) data or not. There is
           | a big difference in the possible approaches, if you have 6h
           | hours to prepare data or if it has to be refreshed every
           | second (or when new data arrives etc).
           | 
           | Long story short: Yes it is complicated to grasp. There is
           | also a big difference if you use the data for normal
           | analytics use cases in a company (mostly read only data
           | models) or if you use the data in a (big tech) product.
           | 
           | I would suggest to start simple by looking into a ,,query
           | engine" to extract some data from somewhere and then doing
           | some transformations with pandas/polars/cubejs for basic
           | understanding. You will need some schedulers and
           | orchestration on the way forward. But this will be dependent
           | on the real use cases and environment you are in.
        
             | bdndndndbve wrote:
             | I would argue that stuff like Iceberg is really aimed at
             | Data Platform Engineers, not BI analysts. Companies I've
             | worked with in the past have 10-15 people on a Platform
             | team that work directly with stuff like this, to offer
             | analysts and data scientists a view into the company's
             | data.
        
         | nxm wrote:
         | What's your use case? Iceberg is meant for analytical workloads
        
       | teleforce wrote:
       | Apache Iceberg is one of the emerging Open Table Formats in
       | addition to Delta Lake and Apache Hudi [1].
       | 
       | [1] Open Table Formats:
       | 
       | https://www.starburst.io/data-glossary/open-table-formats/
        
         | jl6 wrote:
         | The table on that page makes it look like all three of these
         | are very similar, with schema evolution and partition evolution
         | being the key differences. Is that really it?
         | 
         | I'd also love to see a good comparison between "regular"
         | Iceberg and AWS's new S3 Tables.
        
           | benesch wrote:
           | Yes, the three major open table formats are all quite
           | similar.
           | 
           | When AWS launched S3 Tables last month I wrote a blog post
           | with my first impressions:
           | https://meltware.com/2024/12/04/s3-tables
           | 
           | There may be more in depth comparisons available by now but
           | it's at least a good starting point for understanding how S3
           | Tables integrates with Iceberg.
        
             | jl6 wrote:
             | Cool, thank you. It feels like Athena + S3 Tables has the
             | potential to be a very attractive serverless data lakehouse
             | combo.
        
         | Icathian wrote:
         | I think this mischaracterizes the state of the space. Iceberg
         | is the winner of this competition, as of a few months ago. All
         | major vendors who didn't directly invent one of the others now
         | support iceberg or have announced plans to do so.
         | 
         | Building lakehouse products on any table format but iceberg
         | starting now seems to me like it must be a mistake.
        
           | bdndndndbve wrote:
           | Yeah working in the data space I see a ton of customers using
           | Iceberg and some using Delta Lake if they're already a
           | Databricks shop. Virtually no Hudi.
        
       | volderette wrote:
       | How do you query your iceberg tables? We are looking into moving
       | away from Bigquery and Starrocks [1] looks like a good option.
       | 
       | [1] https://www.starrocks.io/
        
         | macqm wrote:
         | Trino is pretty good (open source presto).
         | 
         | https://trino.io/
        
         | jl6 wrote:
         | Why away from bigquery? Just wondering if it's a cost thing.
        
           | volderette wrote:
           | Yes, mainly driven by cost. BigQuery is really unpredictable
           | when dashboards with filters are being used intensively by
           | users. We don't want to limit our users in their data
           | exploration.
        
         | mritchie712 wrote:
         | right now, starrocks or trino are likely your best options, but
         | all the major query engines (clickhouse, snowflake, databricks,
         | even duckdb) are improving their support too.
        
       | pradeepchhetri wrote:
       | ClickHouse has a solid Iceberg integration. It has an Iceberg
       | table function[0] and Iceberg table engine[1] for interacting
       | with Iceberg data stored in s3, gcs, azure, hadoop etc.
       | 
       | [0] https://clickhouse.com/docs/en/sql-reference/table-
       | functions...
       | 
       | [1] https://clickhouse.com/docs/en/engines/table-
       | engines/integra...
        
         | tlarkworthy wrote:
         | I would say it doesn't but it is actively working on it
         | 
         | https://github.com/ClickHouse/ClickHouse/issues/52054
        
           | mritchie712 wrote:
           | duckdb has the same issue[0], I submitted a PR, but it's been
           | stalled
           | 
           | 0 - https://github.com/duckdb/duckdb-iceberg/pull/78
        
       | mkl95 wrote:
       | Iceberg on S3 tables is going to be a hot topic in the next few
       | years.
        
       | nikolatt wrote:
       | I've been looking at Iceberg for a while, but in the end went
       | with Delta Lake because it doesn't have a dependency on a
       | catalog. It also has good support for reading and writing from it
       | without needing Spark.
       | 
       | Does anyone know if Iceberg has plans to support similar use
       | cases?
        
         | pammf wrote:
         | Iceberg has the hdfs catalog, which also relies only on dirs
         | and files.
         | 
         | That said, a catalog (which Delta also can have) helps a lot to
         | keep things tidy. For example, I can write a dataset with
         | Spark, transform it with dbt and a query engine (such as Trino)
         | and consume the resulting dataset with any client that supports
         | Iceberg. If I use a catalog, all happens without having to
         | register the dataset location in each of these components.
        
         | mritchie712 wrote:
         | Why don't you want a catalog? The SQL or REST catalogs are
         | pretty light to set up. I have my eye on lakekeeper[0], but
         | Polaris (from Snowflake) is a good option too.
         | 
         | PyIceberg is likely the easiest way to write without Spark.
         | 
         | 0 - https://github.com/lakekeeper/lakekeeper
        
           | anktor wrote:
           | PyIceberg is nice but we had to drop it because it's behind
           | Java API and it's unclear when it will match up, so depending
           | on which features are needed I'd look it up
        
             | mritchie712 wrote:
             | what are you using instead?
        
       | crorella wrote:
       | What I like about iceberg is that the partitions of the tables
       | are not tightly coupled to the subfolder structure of the storage
       | layer (at least logically, at the end of the day the partitions
       | are still subfolders with files), but at least the metadata is
       | not tied to that, so you can change the partition of the tables
       | going forward and still query a mix of old and new partitions
       | time ranges.
       | 
       | In the other hand, since one of the use cases they created it at
       | Netflix was to consume directly from real time systems, the
       | management of the file creation when updates to the data is less
       | trivial (the CoW vs MoR problem and how to compact small files)
       | which becomes important on multi-petabytes tables with lots of
       | users and frequent updates. This is something I assume not a lot
       | companies put a lot of attention to (heck, not even at Netflix)
       | and have big performance and cost implications.
        
       | mritchie712 wrote:
       | If you're looking to give Iceberg a spin, here's how to get it
       | running locally, on AWS[0] and on GCP[1]. The posts use DuckDB as
       | the query engine, but you could swap in Trino (or even chdb /
       | clickhouse).
       | 
       | 0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
       | 
       | 1 - https://www.definite.app/blog/cloud-iceberg-duckdb
        
         | romperstomper wrote:
         | you can just use iceberg tables with AWS Glue/Athena
        
       | dm03514 wrote:
       | I think iceberg solves a lot of big data problems, for handling
       | huge amounts of data on blob storage, including partitioning,
       | compaction and ACID semantics.
       | 
       | I really like the way the catalog standard can decouple
       | underlying storage as well.
       | 
       | My biggest concern is how inaccessible the implementations are,
       | Java / spark has the only mature implementation right now,
       | 
       | Even DuckDB doesn't support writing yet.
       | 
       | I built out a tool to stream data to iceberg which uses the
       | python iceberg client:
       | 
       | https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...
        
       | dangoodmanUT wrote:
       | iceberg is plauged with the problems it tries to solve, like
       | being too tied to spark just to write data
        
         | apwell23 wrote:
         | huh what? We use iceberg extensively, never used spark.
        
       | apwell23 wrote:
       | I am stockholder in snowflake and iceberg's ascendance seems to
       | coincide with snow's downfall.
       | 
       | Is the query engine value add justify snowflake's valuation.
       | Their data marketplace thing didn't seem to have actually worked.
        
       | rdegges wrote:
       | OneHouse also has a fantastic iceberg implementation (they're the
       | team behind Apache Hudi) and does a ton of great interop work:
       | https://www.onehouse.ai/blog/comprehensive-data-catalog-comp...
       | && https://www.onehouse.ai/blog/open-data-foundations-with-
       | apac...
        
       | jeffhuys wrote:
       | Looks good, but come on... at least try to open your website on a
       | mobile device.
        
         | malnourish wrote:
         | It loads poorly and causes my 3080 to turn on its fan when I
         | load it in up-to-date Firefox on Windows.
        
       | jmakov wrote:
       | Why would one choose this instead of DeltaLake?
        
       | npalli wrote:
       | Are there robust non-JVM based implementations for Iceberg
       | currently? Sorry to say, but recommending JVM ecosystems around
       | large data just feels like professional malpractice at this
       | point. Whether deployment complexity, resource overhead, tool
       | sprawl or operational complexity the ecosystem seems to attract
       | people who solve only 50% of the problem and have another tool to
       | solve the rest, which in turn only solves 50% etc.. ad infinitum.
       | The popularity of solutions like Snowflake, Clickhouse, or DuckDB
       | is not an accident and is the direction everything should go. I
       | hear Snowflake will adopt this in the future, that is good news.
        
         | juunpp wrote:
         | > who solve only 50% of the problem and have another tool to
         | solve the rest, which in turn only solves 50% etc.. ad
         | infinitum
         | 
         | This actually converges to 1:
         | 
         | 1/2 + 1/4 + 1/8 + 1/16 + ... = 1
         | 
         | You just need 30kloc of maven in your pom before you get there.
        
       | chehai wrote:
       | In order to get good query performance from Iceberg, we have to
       | run compaction frequently. Compaction turns out to be very
       | expensive. Any tip to minimize compaction while keeping queries
       | fast?
        
       ___________________________________________________________________
       (page generated 2025-01-26 23:01 UTC)