[HN Gopher] Apache Iceberg
___________________________________________________________________
Apache Iceberg
Author : jacobmarble
Score : 176 points
Date : 2025-01-23 01:03 UTC (3 days ago)
(HTM) web link (iceberg.apache.org)
(TXT) w3m dump (iceberg.apache.org)
| vonnik wrote:
| Curious to what extent Iceberg enables data composability and
| what the best complements and alternatives are.
| lmm wrote:
| Delta Lake is the main competitor. There's a lot of convergence
| going on, because everyone wants a common format and it's
| pretty clear what the desirable features are. Ultimately it
| becomes just boring infrastructure IMO.
| nxm wrote:
| It allows you to be query engine agnostic - query the same data
| via Spark, Snowflake or Trino. Granted, performance may suffer
| vs Snowflake internal tables somewhat due to certain
| performance optimizations not being there.
| varsketiz wrote:
| I'm somewhat surprised to see it here - Iceberg is around for
| some time already.
| mrbluecoat wrote:
| Yeah, I was confused as well. It was like seeing "postage
| stamps" on the HN front page.
| benjaminwootton wrote:
| It's been on the up in recent years though as it appears to
| have won the format wars. Every vendor is rallying around it
| and there were new open source catalogues and support from AWS
| at the end of 2024.
| mritchie712 wrote:
| yeah, I'll admit I was worried when Databricks acquired
| Tabular[0] that it would hurt Iceberg's momentum (e.g.
| databricks would push delta instead), but it seems the
| opposite has happened.
|
| 0 - https://www.definite.app/blog/databricks-tabular-
| acquisition
| twoodfin wrote:
| I was more worried--and continue to be so--that Databricks
| will bring the rat's nest of complexity and pseudo-open
| source model that characterizes Delta to the future of
| Iceberg.
| gopalv wrote:
| Hidden partitioning is the most interesting Iceberg feature,
| because most of the very large datasets are timeseries fact
| tables.
|
| I don't remember seeing that in Delta Lake [1], which is probably
| because the industry standard benchmarks use date as a column
| (tpc-h) or join date as a dimension table (tpc-ds) and do not use
| timestamp ranges instead of dates.
|
| [1] - https://github.com/delta-io/delta/issues/490
| fiddlerwoaroof wrote:
| Delta Lake now has Hilbert-curve based clustering which solves
| a lot of the downsides of hive partitioning
| gopalv wrote:
| > Hilbert-curve based clustering which solves a lot of the
| downsides of hive partitioning
|
| Yes, that solved the 2-column high NDV partitioning issue -
| if you had your ip traffic sorted on destination or source,
| you need Z-curves, which are a little easier with bit
| twiddling for fixed types to do the same thing.
|
| Hive would write a large number of small files when
| partitioned like that or you lose efficiencies when scanning
| on the non-partitioned column.
|
| This does fix the high NDV issue, but in general Netflix
| wrote hidden partitioning in specifically to avoid sorting on
| high NDV columns and to reduce the sort complexity on writes
| (most daily writes won't need any partitioned inserts at
| all).
|
| While clustering on timestamp will force a sort even if it is
| a single day.
| autodidacticon wrote:
| What is NDV partitioning?
| artwr wrote:
| NDV = Number of distinct values. Here partitioning on
| high cardinality columns, essentially.
| honestSysAdmin wrote:
| Iceberg is a pretty cool guy, he consolidates the Parquet and
| doesn't afraid of anything.
| rubenvanwyk wrote:
| And yet there's still no straightforward way to write directly to
| Iceberg tables from Javascript as far as I know.
| Rhubarrbb wrote:
| Writing to catalogs is still pretty new. Databricks has
| recently been pushing delta-kernel-rs that DuckDb has a
| connector set up for, and there's support for writing via
| Python with the Polars package through delta-rs. For small-time
| developers this has been pretty helpful for me and influential
| in picking delta lake over iceberg.
| kermatt wrote:
| > influential in picking delta lake over iceberg
|
| Can you expand on those reasons a bit?
|
| The dependency on a catalog in Iceberg made it more
| complicated for simple cases than Delta, where a directory
| hierarchy was sufficient - if I was understanding the
| PyIceberg docs correctly.
| enether wrote:
| for some reason it's really cumbersome to access this tech
| peschu wrote:
| I agree, as a long time Business Intelligence developer I'm
| still confused and astounded with all the tooling and bits
| and pieces seemingly necessary to create analytics/dashboards
| with open source tools.
|
| For years I used a proprietary solution like Qlik Sense for
| the whole journey from data extraction to a finished
| dashboard (mostly on-prem). Going from raw data to a finished
| dashboard is a matter of days (not weeks/month) with one
| single tool (and maybe some scripts for supporting tasks).
| There is some ,,scripting" involved for loading and
| transforming data, but if you already understand data models
| (and maybe have some sql experience) it is very easy. The
| Dashboard creation itself does not need any coding at
| all.just drag and drop and some formulas like sum(amount).
|
| But this a standalone tool and it is hard to integrate it
| into your own piece of software. From my experience, software
| developers have a much more complicated view on data
| handling. Often this is just the complexity of their use
| cases, sometimes it is just a lack of knowledge of data
| preparation for analytics use cases.
|
| Another part which complicates stuff greatly is the focus on
| use-cases involving cloud storage and doing all the
| transformations on distributed systems.
|
| And it is often not clear what amount of data we are talking
| about and if it is realtime (streaming) data or not. There is
| a big difference in the possible approaches, if you have 6h
| hours to prepare data or if it has to be refreshed every
| second (or when new data arrives etc).
|
| Long story short: Yes it is complicated to grasp. There is
| also a big difference if you use the data for normal
| analytics use cases in a company (mostly read only data
| models) or if you use the data in a (big tech) product.
|
| I would suggest to start simple by looking into a ,,query
| engine" to extract some data from somewhere and then doing
| some transformations with pandas/polars/cubejs for basic
| understanding. You will need some schedulers and
| orchestration on the way forward. But this will be dependent
| on the real use cases and environment you are in.
| bdndndndbve wrote:
| I would argue that stuff like Iceberg is really aimed at
| Data Platform Engineers, not BI analysts. Companies I've
| worked with in the past have 10-15 people on a Platform
| team that work directly with stuff like this, to offer
| analysts and data scientists a view into the company's
| data.
| nxm wrote:
| What's your use case? Iceberg is meant for analytical workloads
| teleforce wrote:
| Apache Iceberg is one of the emerging Open Table Formats in
| addition to Delta Lake and Apache Hudi [1].
|
| [1] Open Table Formats:
|
| https://www.starburst.io/data-glossary/open-table-formats/
| jl6 wrote:
| The table on that page makes it look like all three of these
| are very similar, with schema evolution and partition evolution
| being the key differences. Is that really it?
|
| I'd also love to see a good comparison between "regular"
| Iceberg and AWS's new S3 Tables.
| benesch wrote:
| Yes, the three major open table formats are all quite
| similar.
|
| When AWS launched S3 Tables last month I wrote a blog post
| with my first impressions:
| https://meltware.com/2024/12/04/s3-tables
|
| There may be more in depth comparisons available by now but
| it's at least a good starting point for understanding how S3
| Tables integrates with Iceberg.
| jl6 wrote:
| Cool, thank you. It feels like Athena + S3 Tables has the
| potential to be a very attractive serverless data lakehouse
| combo.
| Icathian wrote:
| I think this mischaracterizes the state of the space. Iceberg
| is the winner of this competition, as of a few months ago. All
| major vendors who didn't directly invent one of the others now
| support iceberg or have announced plans to do so.
|
| Building lakehouse products on any table format but iceberg
| starting now seems to me like it must be a mistake.
| bdndndndbve wrote:
| Yeah working in the data space I see a ton of customers using
| Iceberg and some using Delta Lake if they're already a
| Databricks shop. Virtually no Hudi.
| volderette wrote:
| How do you query your iceberg tables? We are looking into moving
| away from Bigquery and Starrocks [1] looks like a good option.
|
| [1] https://www.starrocks.io/
| macqm wrote:
| Trino is pretty good (open source presto).
|
| https://trino.io/
| jl6 wrote:
| Why away from bigquery? Just wondering if it's a cost thing.
| volderette wrote:
| Yes, mainly driven by cost. BigQuery is really unpredictable
| when dashboards with filters are being used intensively by
| users. We don't want to limit our users in their data
| exploration.
| mritchie712 wrote:
| right now, starrocks or trino are likely your best options, but
| all the major query engines (clickhouse, snowflake, databricks,
| even duckdb) are improving their support too.
| pradeepchhetri wrote:
| ClickHouse has a solid Iceberg integration. It has an Iceberg
| table function[0] and Iceberg table engine[1] for interacting
| with Iceberg data stored in s3, gcs, azure, hadoop etc.
|
| [0] https://clickhouse.com/docs/en/sql-reference/table-
| functions...
|
| [1] https://clickhouse.com/docs/en/engines/table-
| engines/integra...
| tlarkworthy wrote:
| I would say it doesn't but it is actively working on it
|
| https://github.com/ClickHouse/ClickHouse/issues/52054
| mritchie712 wrote:
| duckdb has the same issue[0], I submitted a PR, but it's been
| stalled
|
| 0 - https://github.com/duckdb/duckdb-iceberg/pull/78
| mkl95 wrote:
| Iceberg on S3 tables is going to be a hot topic in the next few
| years.
| nikolatt wrote:
| I've been looking at Iceberg for a while, but in the end went
| with Delta Lake because it doesn't have a dependency on a
| catalog. It also has good support for reading and writing from it
| without needing Spark.
|
| Does anyone know if Iceberg has plans to support similar use
| cases?
| pammf wrote:
| Iceberg has the hdfs catalog, which also relies only on dirs
| and files.
|
| That said, a catalog (which Delta also can have) helps a lot to
| keep things tidy. For example, I can write a dataset with
| Spark, transform it with dbt and a query engine (such as Trino)
| and consume the resulting dataset with any client that supports
| Iceberg. If I use a catalog, all happens without having to
| register the dataset location in each of these components.
| mritchie712 wrote:
| Why don't you want a catalog? The SQL or REST catalogs are
| pretty light to set up. I have my eye on lakekeeper[0], but
| Polaris (from Snowflake) is a good option too.
|
| PyIceberg is likely the easiest way to write without Spark.
|
| 0 - https://github.com/lakekeeper/lakekeeper
| anktor wrote:
| PyIceberg is nice but we had to drop it because it's behind
| Java API and it's unclear when it will match up, so depending
| on which features are needed I'd look it up
| mritchie712 wrote:
| what are you using instead?
| crorella wrote:
| What I like about iceberg is that the partitions of the tables
| are not tightly coupled to the subfolder structure of the storage
| layer (at least logically, at the end of the day the partitions
| are still subfolders with files), but at least the metadata is
| not tied to that, so you can change the partition of the tables
| going forward and still query a mix of old and new partitions
| time ranges.
|
| In the other hand, since one of the use cases they created it at
| Netflix was to consume directly from real time systems, the
| management of the file creation when updates to the data is less
| trivial (the CoW vs MoR problem and how to compact small files)
| which becomes important on multi-petabytes tables with lots of
| users and frequent updates. This is something I assume not a lot
| companies put a lot of attention to (heck, not even at Netflix)
| and have big performance and cost implications.
| mritchie712 wrote:
| If you're looking to give Iceberg a spin, here's how to get it
| running locally, on AWS[0] and on GCP[1]. The posts use DuckDB as
| the query engine, but you could swap in Trino (or even chdb /
| clickhouse).
|
| 0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
|
| 1 - https://www.definite.app/blog/cloud-iceberg-duckdb
| romperstomper wrote:
| you can just use iceberg tables with AWS Glue/Athena
| dm03514 wrote:
| I think iceberg solves a lot of big data problems, for handling
| huge amounts of data on blob storage, including partitioning,
| compaction and ACID semantics.
|
| I really like the way the catalog standard can decouple
| underlying storage as well.
|
| My biggest concern is how inaccessible the implementations are,
| Java / spark has the only mature implementation right now,
|
| Even DuckDB doesn't support writing yet.
|
| I built out a tool to stream data to iceberg which uses the
| python iceberg client:
|
| https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...
| dangoodmanUT wrote:
| iceberg is plauged with the problems it tries to solve, like
| being too tied to spark just to write data
| apwell23 wrote:
| huh what? We use iceberg extensively, never used spark.
| apwell23 wrote:
| I am stockholder in snowflake and iceberg's ascendance seems to
| coincide with snow's downfall.
|
| Is the query engine value add justify snowflake's valuation.
| Their data marketplace thing didn't seem to have actually worked.
| rdegges wrote:
| OneHouse also has a fantastic iceberg implementation (they're the
| team behind Apache Hudi) and does a ton of great interop work:
| https://www.onehouse.ai/blog/comprehensive-data-catalog-comp...
| && https://www.onehouse.ai/blog/open-data-foundations-with-
| apac...
| jeffhuys wrote:
| Looks good, but come on... at least try to open your website on a
| mobile device.
| malnourish wrote:
| It loads poorly and causes my 3080 to turn on its fan when I
| load it in up-to-date Firefox on Windows.
| jmakov wrote:
| Why would one choose this instead of DeltaLake?
| npalli wrote:
| Are there robust non-JVM based implementations for Iceberg
| currently? Sorry to say, but recommending JVM ecosystems around
| large data just feels like professional malpractice at this
| point. Whether deployment complexity, resource overhead, tool
| sprawl or operational complexity the ecosystem seems to attract
| people who solve only 50% of the problem and have another tool to
| solve the rest, which in turn only solves 50% etc.. ad infinitum.
| The popularity of solutions like Snowflake, Clickhouse, or DuckDB
| is not an accident and is the direction everything should go. I
| hear Snowflake will adopt this in the future, that is good news.
| juunpp wrote:
| > who solve only 50% of the problem and have another tool to
| solve the rest, which in turn only solves 50% etc.. ad
| infinitum
|
| This actually converges to 1:
|
| 1/2 + 1/4 + 1/8 + 1/16 + ... = 1
|
| You just need 30kloc of maven in your pom before you get there.
| chehai wrote:
| In order to get good query performance from Iceberg, we have to
| run compaction frequently. Compaction turns out to be very
| expensive. Any tip to minimize compaction while keeping queries
| fast?
___________________________________________________________________
(page generated 2025-01-26 23:01 UTC)