[HN Gopher] Apache iceberg the Hadoop of the modern-data-stack?
___________________________________________________________________
Apache iceberg the Hadoop of the modern-data-stack?
Author : samrohn
Score : 96 points
Date : 2025-03-06 06:53 UTC (16 hours ago)
(HTM) web link (blog.det.life)
(TXT) w3m dump (blog.det.life)
| hendiatris wrote:
| This is a huge challenge with Iceberg. I have found that there is
| substantial bang for your buck in tuning how parquet files are
| written, particularly in terms of row group size and column-level
| bloom filters. In addition to that, I make heavy use of the
| encoding options (dictionary/RLE) while denormalizing data into
| as few files as possible. This has allowed me to rely on DuckDB
| for querying terabytes of data at low cost and acceptable
| performance.
|
| What we are lacking now is tooling that gives you insight into
| how you should configure Iceberg. Does something like this exist?
| I have been looking for something that would show me the query
| plan that is developed from Iceberg metadata, but didn't find
| anything. It would go a long way to showing where the bottleneck
| is for queries.
| jasonjmcghee wrote:
| Have you written about your parquet strategy anywhere? Or have
| suggested reading related to the tuning you've done? Super
| interested.
| indoordin0saur wrote:
| Also very interested in the parquet tuning. I have been
| building my data lake and most optimization I do is just with
| efficient partitioning.
| hendiatris wrote:
| I will write something up when the dust settles, I'm still
| testing things out. It's a project where the data is fairly
| standardized but there is about a petabyte to deal with, so
| I think it makes sense to make investments in efficiency at
| the lower level rather than through tons of resources at
| it. That has meant a custom parser for the input data
| written in Rust, lots of analysis of the statistics of the
| data, etc. It has been a different approach to data
| engineering and one that I hope we see more of.
|
| Regarding reading materials, I found this DuckDB post to be
| especially helpful in realizing how parquet could be better
| leveraged for efficiency:
| https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-
| the-...
| Gasp0de wrote:
| Does anyone have a good alternative for storing large amounts of
| very small files that need to be individually queriable? We are
| dealing with a large amount of sensor readings that we need to be
| able to query on a per sensor basis and a timespan, and we are
| dealing with the problem mentioned in the article, that storing
| millions of small files in S3 is expensive.
| paulsutter wrote:
| If you want to keep them in S3, consolidate into sorted parquet
| files. You get random access to row groups, and only the
| columns you need are read so it's very efficient. DuckDB can
| both build and access these files efficiently. You could
| compact files hourly/nightly/weekly whatever
|
| Of course you could also use Aurora for a clean scalable
| Postgres that can survive zone failures for a simpler solution
| Gasp0de wrote:
| The problem is that the initial writing is already so
| expensive, I guess we'd have to write multiple sensors into
| the same file instead of having one file per sensor per
| interval. I'll look into parquet access options, if we could
| write 10k sensors into one file but still read a single
| sensor from that file that could work.
| spothedog1 wrote:
| New S3 Table Buckets [1] do automatic compaction
|
| [1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/s
| 3-tab...
| necrobrit wrote:
| Table buckets are currently quite hard to use for a lot
| of use cases as they _only_ support primitive types. No
| nested types.
|
| Hopefully this will come at some point. Product looks
| very cool otherwise.
| bloomingkales wrote:
| Something like Redis instead? [sensorid-timerange] = value.
| Your key is [sensorid-timerange] to get the values for that
| sensor and that time range.
|
| No more files. You might be able to avoid per usage pricing
| just by hosting this on a regular vps.
| Gasp0de wrote:
| We use Redis for buffering for a certain timeperiod, and
| then we write data for one sensor for that period to S3.
| However we fill up large Redis clusters pretty fast, so
| we can only buffer for a shortish period.
| hendiatris wrote:
| You may be able to get close with sufficiently small row
| groups, but you will have to do some tests. You can do this
| in a few hours of work, by taking some sensor data, sorting
| it by the identifier and then writing it to parquet with
| one row group per sensor. You can do this with the
| ParquetWriter class in PyArrow, or something else that
| allows you fine grained control of how the file is written.
| I just checked and saw that you can have around 7 million
| row groups per file, so you should be fine.
|
| Then spin up duckdb and do some performance tests. I'm not
| sure this will work, there is some overheard with reading
| parquet, which is why it is discouraged to have small files
| and row groups.
| 0cf8612b2e1e wrote:
| Why can you not rewrite the initial file into something
| partitioned by sensor+time? Would the one time job really
| be that much more additional cost vs the additional
| complexity of multiple sensors per file?
|
| Do you ever go back and reaggregate older data into bigger,
| sorted files? That is, maybe you originally partitioned by
| hour, but stale data is so infrequently accessed, you could
| roll up into partitions per week/month/whatever. Depending
| on the specifics, you might save some space from less file
| overhead and better compression statistics.
| ramses0 wrote:
| SeaweedFS? https://news.ycombinator.com/item?id=39235593
| tobias3 wrote:
| I guess we need more requirements from OP, such if it should
| be self-hosted or a cloud service
| this_user wrote:
| Do you absolutely have to write the data to files directly? If
| not, then using a time series database might be the better
| option. Most of them are pretty much designed for workloads
| with large numbers of append operations. You could always
| export to individual files later on if you need it.
|
| Another option if you have enough local storage would be to use
| something like JuiceFS that creates a virtual file system where
| the files are initially written to the local cache before
| JuiceFS writes the data to your S3 provider as larger chunks.
|
| SeaweedFS can do something similar if you configure it the
| right way. But both options require that you have enough
| storage outside of your object storage.
| Gasp0de wrote:
| We tried some readymade options but they were way more
| expensive than our custom built S3 solution (by a factor of
| x10 approximately). I think we tried timescale and AWS
| Timestream. I haven't heard of SeaweedFS.
| ramses0 wrote:
| https://github.com/seaweedfs/seaweedfs?tab=readme-ov-
| file#qu...
|
| https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Drive-
| Bene...
|
| https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Tier
|
| https://github.com/seaweedfs/seaweedfs/wiki/Benchmarks
|
| https://github.com/seaweedfs/seaweedfs/wiki/Words-from-
| Seawe...
|
| https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API
|
| ...your true issue is it seems like you're using the
| filesystem as the "only" storage layer in play, but you
| also need time and entity querying(!?!).
|
| >> we need to be able to query on a per sensor basis and a
| timespan
|
| ...look at the "Cloud-Tier" wiki page. If you're truly in
| an "everything's hot all the time" situation, you really
| should be using a database. If you're pulling "usually
| recent stuff, occasionally old stuff" then fronting with
| something like SeaweedFS seems like it might "just"
| transparently reduce your overall costs.
|
| Really, I'd nudge towards "write _.txt ; compact ... ;
| SELECT ... && cat _.txt".
|
| Basically, keep your inbound writes cached to (eg) seaweed
| as unit files. "Compact them" every hour by appending rows
| to some appropriate database (I mean: migrate to using
| litefs, turso, postgres, something like that). When you
| read, you may need to supplement "tip" data from your
| incoming files, but the majority should be hitting a "real"
| remote database, there's plenty to choose from!
|
| A nifty note, sqlite can connect to multiple DB's at once:
| https://www.sqlite.org/lang_attach.html ...
| https://stackoverflow.com/posts/10020/revisions
|
| ...something like `select * from raw union (select * from
| one_hour) union (select * from today) union (select * from
| historical) ...`
| alchemist1e9 wrote:
| https://github.com/mxmlnkn/ratarmount
|
| > To use all fsspec features, either install via pip install
| ratarmount[fsspec] or pip install ratarmount[fsspec]. It should
| also suffice to simply pip install fsspec if ratarmountcore is
| already installed.
| themgt wrote:
| I've only played with it a bit but Nvidia AIStore project seems
| underappreciated: "lightweight, built-from-scratch storage
| stack tailored for AI applications" + S3 compatible
|
| https://github.com/NVIDIA/aistore
| paulsutter wrote:
| Does this feel about 3x too verbose, like it's generated?
| jasonjmcghee wrote:
| Idk if it's the verbosity but yes, reads as generated to me.
| Specifically sounds like ChatGPT's writing.
| mritchie712 wrote:
| 100%, might be gpt4.5
| alexmorley wrote:
| Most of these issues will be ring true to lots of folk using
| Iceberg at the moment. But this does not:
|
| > Yet, competing table formats like Delta Lake and Hudi mirror
| this fragmentation. [ ... ] > Just as Spark emerged as the
| dominant engine in the Hadoop ecosystem, a dominant table format
| and catalog may appear in the Iceberg era.
|
| I think extremely few people are making bets on any other open
| source table format now - that consolidation has already happened
| in 2023-2024 (see e.g. Databricks who have their own competing
| format leaning heavily into iceberg; or adoption from all of the
| major data warehouse providers).
| twoodfin wrote:
| Microsoft is right now making a huge bet on Delta by way of
| their "Microsoft Fabric" initiative (as always with Microsoft:
| Is it a product? Is it a branding scheme? Yes.)
|
| They seem to be the only vendor crazy enough to try to fast-
| follow Databricks, who is clearly driving the increasingly
| elaborate and sophisticated Delta ecosystem (check the GitHub
| traffic...)
|
| But Microsoft + Databricks is a lot of momentum for Delta.
|
| On the merits of open & simple, I agree, better for everyone if
| Iceberg wins out--as Iceberg and not as some Frankenstandard
| mashed together with Delta by the force of 1,000 Databricks
| engineers.
| datadrivenangel wrote:
| The only reason Microsoft is using Delta is to emphasize to
| CTOs and investors that fabric is as good as databricks, even
| when that is obviously false to anyone who has smelled the
| evaporative scent of vaporware before.
| esafak wrote:
| Microsoft's gonna Microsoft.
| twoodfin wrote:
| Very different business, of course, but Databricks v.
| Fabric reminds me a lot of Slack v. Teams.
|
| Regardless of the relative merits now, I think everyone
| agrees that a few years ago Slack was clearly superior.
| Microsoft could have certainly bought Slack instead of
| pumping probably billions into development, marketing,
| discounts to destroy them.
|
| I think Microsoft could and would consider buying
| Databricks--$80-100B is a lot, but not record-shattering.
|
| If I were them, though, I'd spend a few billion competing
| as an experiment, first.
| foobiekr wrote:
| Anti-trust is the reason a lot of the kinds of deals
| you're talking about don't happen.
| twoodfin wrote:
| I agree. If the anti-trust regime had been different
| Microsoft would have bought Databricks years ago. Satya
| Nadella has surely been tapping his foot watching their
| valuation grow and grow.
|
| The Trump folks have given mixed messages on the Biden-
| era FTC; I'd put the odds that with the right tap dancing
| (sigh) Microsoft could make a blockbuster like this in
| the B2B space work.
| alienreborn wrote:
| Better article (imo) on similar topic:
| https://www.dataengineeringweekly.com/p/is-apache-iceberg-th...
| tomnicholas1 wrote:
| I think the posted article was generated from this one - the
| structure of the content is so similar.
| simlevesque wrote:
| I'm working on an alternative Iceberg client to work better in
| write heavy use cases. Instead of many smaller files it writes on
| the same file until it's 1mb in size but it gives it a new name.
| Then I update the manifest to the new filename and checksum. I
| keep old files on disk for 60 seconds to allow pending queries.
| I'm also working on auto compaction, when I have ten 1mb files I
| compact them, same with ten 10mb files, etc...
|
| I feel like this could be a game changer for the ecosystem. It's
| more cpu and network heavy for writes but the reads are always
| fast. And the writes are still faster than pyiceberg.
|
| I want to hear opinions or how this could never work.
| thom wrote:
| Interesting. My personal feeling is that we're slowly headed to
| a world where we can have our cake and eat it: fast bulk
| ingestion, fast OLAP, fast OLTP, low latency, all together in
| the same datastore. I'm hoping we just get to collapse whole
| complex data platforms into a single consistent store with
| great developer experience, and never look back.
| simlevesque wrote:
| I think it's possible too and the Iceberg spec allows it but
| the implementations are not suited for every use case.
| ndm000 wrote:
| I've felt the same way. It's so inefficient to have two
| patterns - OLAP and OLTP - both using SQL interfaces but
| requiring syncing between systems. There are some physical
| limits at play though. OLAP will always take less processing
| and disk usage if the data it needs is all right next to each
| other (columnar storage) where as OLTP's need for fast writes
| usually means row based storage is more efficient. I think
| the solution would be one system that stores data
| consistently both ways and knows when to use which method for
| a given query.
| thom wrote:
| In a sense, OLAP is just a series of indexing strategies
| that takes OLTP data and formats it for particular use
| cases (sometimes with eventual consistency). Some of these
| indexing strategies in enterprises today involve building
| out entire bespoke platforms to extract and transform the
| data. Incremental view maintenance is a step in the right
| direction - tools like Materialize give you good
| performance to keep calculated data up to date, and also
| break out of the streaming world of only paying attention
| to recent data. But you need to close the loop and also be
| able to do massive crunchy queries on top of that. I have
| no doubt we'll get there, really exciting times.
| ndm000 wrote:
| Completely agree. All of the pieces are there and it's
| just waiting to be acted upon. I haven't seen any of the
| major players really doubling down on this, but would be
| so compelling.
| mritchie712 wrote:
| nice! anywhere we can follow your progress?
| simlevesque wrote:
| Not right now sadly I have some work obligations taking my
| time but I can't wait to share more.
|
| I'm using a basic implementation that's not backed by
| iceberg, just Parquet files in hive partitions that I can
| query using DuckDB.
| FuriouslyAdrift wrote:
| so... sharding?
| chehai wrote:
| This approach reminds me of ClickHouse's MergeTree.
|
| Also, https://paimon.apache.org/ seems to be better for
| streaming use cases.
| mritchie712 wrote:
| This is a bit overblown.
|
| Is Iceberg "easy" to set up? No.
|
| Can you get set up in a week? Yes.
|
| If you really need a datalake, spending a week setting it up is
| not so bad. We have a guide[0] here that will get you started in
| under an hour.
|
| For smaller (e.g. under 10tb) data where you don't need real-
| time, DuckDB is becoming a really solid option. Here's on
| setup[1] we've played around with using Arrow Flight.
|
| If you don't want to mess with any of this, we[2] spin it all up
| for you.
|
| 0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
|
| 1 - https://www.definite.app/blog/duck-takes-flight
|
| 2 - https://www.definite.app/
| simlevesque wrote:
| I think Iceberg can work in real time but the current
| implementations make it impossible.
|
| I have a vision for a way to make it work. I made another
| comment here. Your blog posts were helpful, I digged a bit in
| the Duck Takes Flight code in python and rust.
| pid-1 wrote:
| If you're already in AWS, why wouldn't you use AWS Glue Catalog
| + AWS SDK for pandas + Athena?
|
| You can setup a data lake, save data and start doing queries in
| like 10 minutes with this setup.
| thedougd wrote:
| These days you can 'just' create an S3 tables bucket. https:/
| /docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...
| mritchie712 wrote:
| Agreed.
|
| A lot of people worry would worry about "vendor lock-in"
| here, but it's certainly convenient.
| tsss wrote:
| Athena is really expensive though and you will often run into
| a hard limit on the size of your query.
| whalesalad wrote:
| heads up the logo on your site needs to be 2x'd in pixel
| density it comes across as blurry on hidpi displays. or convert
| it to an svg/vector.
| mritchie712 wrote:
| fixed!
| zhousun wrote:
| The only datastack iceberg (or lakehouse) will never replace is
| OLTP systems, for high-concurrency updates optimistic concurrency
| control & object store is simply a no go.
|
| Iceberg out-of-the-box is "NOT" good at streaming use cases,
| unlike formats like Hudi or Paimon, the table format does not
| have the concept of merge/ index. However, the beauty of iceberg
| is it is very unopinionated, so it is indeed possible to design
| an engine to stream write to iceberg. As far as I know this is
| how engines like Upsolver was implemented: 1. Have in-memory
| buffer to track incoming rows before flushing a version to
| iceberg (every 10s to a few minutes). 2. Build Indexing structure
| to write position deletes/ deletion vector instead of equality
| deletes. 3. The writer will all try to merge small files and
| optimize the table.
|
| And stay tuned, we at https://www.mooncake.dev/ are working on a
| solution to mirror a postgres table to iceberg, and keep them
| always up-to-date.
| robertkoss wrote:
| This article is just shameless advertising for Estuary Flow, a
| company that the author is working for. "Operational Maturity",
| as if Iceberg, Delta or Hudi are not mature. These are battle-
| tested frameworks that have been in production for years. The
| "small files problem" is not really a problem because every
| framework supports some way of compacting smaller files. Just run
| a nightly job that compacts the small files and you're good 2 go.
| prpl wrote:
| AWS can do it for you with S3 tables.
| theyinwhy wrote:
| What's a good alternative? Google BigQuery?
| datax2 wrote:
| "Hadoop's meteoric rise led many organizations to implement it
| without understanding its complexities, often resulting in
| underutilized clusters or over-engineered architectures. Iceberg
| is walking a similar path."
|
| This pain is too real, and too close to home. I've seen this
| outcome turn the entire business off of consuming their data via
| hadoop because it turns into a wasteland of delayed deliveries,
| broken datasets, op's teams who cannot scale, and architects
| overselling too robust designs.
|
| I've tried to scale down hadoop to the business user with visual
| etl tools like Alteryx, but there again compatibility between
| Alteryx and hadoop suck via ODBC connectors. I came from an AWS
| based stack into a poorly leapfrogged data stack and it's hard
| not to pull my hair out between the business struggling to use it
| and infra + op's not keeping up. Now these teams want to push to
| iceburg or big query while ignoring the mountains of tech debt
| they have created.
|
| Don't get me wrong Hadoop isn't a bad idea, its just complex and
| a time suck, and unless you have time to dedicate to properly
| deploy these solutions which most business do not, your
| implementation will suffer, your business will suffer.
|
| "While the parallels to Hadoop are striking, we also have the
| opportunity to avoid its pitfalls." no one in IT learns from
| their failures unless they are writing the checks, most will flip
| before they feel the pain.
___________________________________________________________________
(page generated 2025-03-06 23:01 UTC)