hngopher.com

       [HN Gopher] Apache iceberg the Hadoop of the modern-data-stack?
       ___________________________________________________________________
        
       Apache iceberg the Hadoop of the modern-data-stack?
        
       Author : samrohn
       Score  : 96 points
       Date   : 2025-03-06 06:53 UTC (16 hours ago)
        
 (HTM) web link (blog.det.life)
 (TXT) w3m dump (blog.det.life)
        
       | hendiatris wrote:
       | This is a huge challenge with Iceberg. I have found that there is
       | substantial bang for your buck in tuning how parquet files are
       | written, particularly in terms of row group size and column-level
       | bloom filters. In addition to that, I make heavy use of the
       | encoding options (dictionary/RLE) while denormalizing data into
       | as few files as possible. This has allowed me to rely on DuckDB
       | for querying terabytes of data at low cost and acceptable
       | performance.
       | 
       | What we are lacking now is tooling that gives you insight into
       | how you should configure Iceberg. Does something like this exist?
       | I have been looking for something that would show me the query
       | plan that is developed from Iceberg metadata, but didn't find
       | anything. It would go a long way to showing where the bottleneck
       | is for queries.
        
         | jasonjmcghee wrote:
         | Have you written about your parquet strategy anywhere? Or have
         | suggested reading related to the tuning you've done? Super
         | interested.
        
           | indoordin0saur wrote:
           | Also very interested in the parquet tuning. I have been
           | building my data lake and most optimization I do is just with
           | efficient partitioning.
        
             | hendiatris wrote:
             | I will write something up when the dust settles, I'm still
             | testing things out. It's a project where the data is fairly
             | standardized but there is about a petabyte to deal with, so
             | I think it makes sense to make investments in efficiency at
             | the lower level rather than through tons of resources at
             | it. That has meant a custom parser for the input data
             | written in Rust, lots of analysis of the statistics of the
             | data, etc. It has been a different approach to data
             | engineering and one that I hope we see more of.
             | 
             | Regarding reading materials, I found this DuckDB post to be
             | especially helpful in realizing how parquet could be better
             | leveraged for efficiency:
             | https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-
             | the-...
        
       | Gasp0de wrote:
       | Does anyone have a good alternative for storing large amounts of
       | very small files that need to be individually queriable? We are
       | dealing with a large amount of sensor readings that we need to be
       | able to query on a per sensor basis and a timespan, and we are
       | dealing with the problem mentioned in the article, that storing
       | millions of small files in S3 is expensive.
        
         | paulsutter wrote:
         | If you want to keep them in S3, consolidate into sorted parquet
         | files. You get random access to row groups, and only the
         | columns you need are read so it's very efficient. DuckDB can
         | both build and access these files efficiently. You could
         | compact files hourly/nightly/weekly whatever
         | 
         | Of course you could also use Aurora for a clean scalable
         | Postgres that can survive zone failures for a simpler solution
        
           | Gasp0de wrote:
           | The problem is that the initial writing is already so
           | expensive, I guess we'd have to write multiple sensors into
           | the same file instead of having one file per sensor per
           | interval. I'll look into parquet access options, if we could
           | write 10k sensors into one file but still read a single
           | sensor from that file that could work.
        
             | spothedog1 wrote:
             | New S3 Table Buckets [1] do automatic compaction
             | 
             | [1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/s
             | 3-tab...
        
               | necrobrit wrote:
               | Table buckets are currently quite hard to use for a lot
               | of use cases as they _only_ support primitive types. No
               | nested types.
               | 
               | Hopefully this will come at some point. Product looks
               | very cool otherwise.
        
             | bloomingkales wrote:
             | Something like Redis instead? [sensorid-timerange] = value.
             | Your key is [sensorid-timerange] to get the values for that
             | sensor and that time range.
             | 
             | No more files. You might be able to avoid per usage pricing
             | just by hosting this on a regular vps.
        
               | Gasp0de wrote:
               | We use Redis for buffering for a certain timeperiod, and
               | then we write data for one sensor for that period to S3.
               | However we fill up large Redis clusters pretty fast, so
               | we can only buffer for a shortish period.
        
             | hendiatris wrote:
             | You may be able to get close with sufficiently small row
             | groups, but you will have to do some tests. You can do this
             | in a few hours of work, by taking some sensor data, sorting
             | it by the identifier and then writing it to parquet with
             | one row group per sensor. You can do this with the
             | ParquetWriter class in PyArrow, or something else that
             | allows you fine grained control of how the file is written.
             | I just checked and saw that you can have around 7 million
             | row groups per file, so you should be fine.
             | 
             | Then spin up duckdb and do some performance tests. I'm not
             | sure this will work, there is some overheard with reading
             | parquet, which is why it is discouraged to have small files
             | and row groups.
        
             | 0cf8612b2e1e wrote:
             | Why can you not rewrite the initial file into something
             | partitioned by sensor+time? Would the one time job really
             | be that much more additional cost vs the additional
             | complexity of multiple sensors per file?
             | 
             | Do you ever go back and reaggregate older data into bigger,
             | sorted files? That is, maybe you originally partitioned by
             | hour, but stale data is so infrequently accessed, you could
             | roll up into partitions per week/month/whatever. Depending
             | on the specifics, you might save some space from less file
             | overhead and better compression statistics.
        
         | ramses0 wrote:
         | SeaweedFS? https://news.ycombinator.com/item?id=39235593
        
           | tobias3 wrote:
           | I guess we need more requirements from OP, such if it should
           | be self-hosted or a cloud service
        
         | this_user wrote:
         | Do you absolutely have to write the data to files directly? If
         | not, then using a time series database might be the better
         | option. Most of them are pretty much designed for workloads
         | with large numbers of append operations. You could always
         | export to individual files later on if you need it.
         | 
         | Another option if you have enough local storage would be to use
         | something like JuiceFS that creates a virtual file system where
         | the files are initially written to the local cache before
         | JuiceFS writes the data to your S3 provider as larger chunks.
         | 
         | SeaweedFS can do something similar if you configure it the
         | right way. But both options require that you have enough
         | storage outside of your object storage.
        
           | Gasp0de wrote:
           | We tried some readymade options but they were way more
           | expensive than our custom built S3 solution (by a factor of
           | x10 approximately). I think we tried timescale and AWS
           | Timestream. I haven't heard of SeaweedFS.
        
             | ramses0 wrote:
             | https://github.com/seaweedfs/seaweedfs?tab=readme-ov-
             | file#qu...
             | 
             | https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Drive-
             | Bene...
             | 
             | https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Tier
             | 
             | https://github.com/seaweedfs/seaweedfs/wiki/Benchmarks
             | 
             | https://github.com/seaweedfs/seaweedfs/wiki/Words-from-
             | Seawe...
             | 
             | https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API
             | 
             | ...your true issue is it seems like you're using the
             | filesystem as the "only" storage layer in play, but you
             | also need time and entity querying(!?!).
             | 
             | >> we need to be able to query on a per sensor basis and a
             | timespan
             | 
             | ...look at the "Cloud-Tier" wiki page. If you're truly in
             | an "everything's hot all the time" situation, you really
             | should be using a database. If you're pulling "usually
             | recent stuff, occasionally old stuff" then fronting with
             | something like SeaweedFS seems like it might "just"
             | transparently reduce your overall costs.
             | 
             | Really, I'd nudge towards "write _.txt ; compact ... ;
             | SELECT ... && cat _.txt".
             | 
             | Basically, keep your inbound writes cached to (eg) seaweed
             | as unit files. "Compact them" every hour by appending rows
             | to some appropriate database (I mean: migrate to using
             | litefs, turso, postgres, something like that). When you
             | read, you may need to supplement "tip" data from your
             | incoming files, but the majority should be hitting a "real"
             | remote database, there's plenty to choose from!
             | 
             | A nifty note, sqlite can connect to multiple DB's at once:
             | https://www.sqlite.org/lang_attach.html ...
             | https://stackoverflow.com/posts/10020/revisions
             | 
             | ...something like `select * from raw union (select * from
             | one_hour) union (select * from today) union (select * from
             | historical) ...`
        
         | alchemist1e9 wrote:
         | https://github.com/mxmlnkn/ratarmount
         | 
         | > To use all fsspec features, either install via pip install
         | ratarmount[fsspec] or pip install ratarmount[fsspec]. It should
         | also suffice to simply pip install fsspec if ratarmountcore is
         | already installed.
        
         | themgt wrote:
         | I've only played with it a bit but Nvidia AIStore project seems
         | underappreciated: "lightweight, built-from-scratch storage
         | stack tailored for AI applications" + S3 compatible
         | 
         | https://github.com/NVIDIA/aistore
        
       | paulsutter wrote:
       | Does this feel about 3x too verbose, like it's generated?
        
         | jasonjmcghee wrote:
         | Idk if it's the verbosity but yes, reads as generated to me.
         | Specifically sounds like ChatGPT's writing.
        
         | mritchie712 wrote:
         | 100%, might be gpt4.5
        
       | alexmorley wrote:
       | Most of these issues will be ring true to lots of folk using
       | Iceberg at the moment. But this does not:
       | 
       | > Yet, competing table formats like Delta Lake and Hudi mirror
       | this fragmentation. [ ... ] > Just as Spark emerged as the
       | dominant engine in the Hadoop ecosystem, a dominant table format
       | and catalog may appear in the Iceberg era.
       | 
       | I think extremely few people are making bets on any other open
       | source table format now - that consolidation has already happened
       | in 2023-2024 (see e.g. Databricks who have their own competing
       | format leaning heavily into iceberg; or adoption from all of the
       | major data warehouse providers).
        
         | twoodfin wrote:
         | Microsoft is right now making a huge bet on Delta by way of
         | their "Microsoft Fabric" initiative (as always with Microsoft:
         | Is it a product? Is it a branding scheme? Yes.)
         | 
         | They seem to be the only vendor crazy enough to try to fast-
         | follow Databricks, who is clearly driving the increasingly
         | elaborate and sophisticated Delta ecosystem (check the GitHub
         | traffic...)
         | 
         | But Microsoft + Databricks is a lot of momentum for Delta.
         | 
         | On the merits of open & simple, I agree, better for everyone if
         | Iceberg wins out--as Iceberg and not as some Frankenstandard
         | mashed together with Delta by the force of 1,000 Databricks
         | engineers.
        
           | datadrivenangel wrote:
           | The only reason Microsoft is using Delta is to emphasize to
           | CTOs and investors that fabric is as good as databricks, even
           | when that is obviously false to anyone who has smelled the
           | evaporative scent of vaporware before.
        
             | esafak wrote:
             | Microsoft's gonna Microsoft.
        
             | twoodfin wrote:
             | Very different business, of course, but Databricks v.
             | Fabric reminds me a lot of Slack v. Teams.
             | 
             | Regardless of the relative merits now, I think everyone
             | agrees that a few years ago Slack was clearly superior.
             | Microsoft could have certainly bought Slack instead of
             | pumping probably billions into development, marketing,
             | discounts to destroy them.
             | 
             | I think Microsoft could and would consider buying
             | Databricks--$80-100B is a lot, but not record-shattering.
             | 
             | If I were them, though, I'd spend a few billion competing
             | as an experiment, first.
        
               | foobiekr wrote:
               | Anti-trust is the reason a lot of the kinds of deals
               | you're talking about don't happen.
        
               | twoodfin wrote:
               | I agree. If the anti-trust regime had been different
               | Microsoft would have bought Databricks years ago. Satya
               | Nadella has surely been tapping his foot watching their
               | valuation grow and grow.
               | 
               | The Trump folks have given mixed messages on the Biden-
               | era FTC; I'd put the odds that with the right tap dancing
               | (sigh) Microsoft could make a blockbuster like this in
               | the B2B space work.
        
       | alienreborn wrote:
       | Better article (imo) on similar topic:
       | https://www.dataengineeringweekly.com/p/is-apache-iceberg-th...
        
         | tomnicholas1 wrote:
         | I think the posted article was generated from this one - the
         | structure of the content is so similar.
        
       | simlevesque wrote:
       | I'm working on an alternative Iceberg client to work better in
       | write heavy use cases. Instead of many smaller files it writes on
       | the same file until it's 1mb in size but it gives it a new name.
       | Then I update the manifest to the new filename and checksum. I
       | keep old files on disk for 60 seconds to allow pending queries.
       | I'm also working on auto compaction, when I have ten 1mb files I
       | compact them, same with ten 10mb files, etc...
       | 
       | I feel like this could be a game changer for the ecosystem. It's
       | more cpu and network heavy for writes but the reads are always
       | fast. And the writes are still faster than pyiceberg.
       | 
       | I want to hear opinions or how this could never work.
        
         | thom wrote:
         | Interesting. My personal feeling is that we're slowly headed to
         | a world where we can have our cake and eat it: fast bulk
         | ingestion, fast OLAP, fast OLTP, low latency, all together in
         | the same datastore. I'm hoping we just get to collapse whole
         | complex data platforms into a single consistent store with
         | great developer experience, and never look back.
        
           | simlevesque wrote:
           | I think it's possible too and the Iceberg spec allows it but
           | the implementations are not suited for every use case.
        
           | ndm000 wrote:
           | I've felt the same way. It's so inefficient to have two
           | patterns - OLAP and OLTP - both using SQL interfaces but
           | requiring syncing between systems. There are some physical
           | limits at play though. OLAP will always take less processing
           | and disk usage if the data it needs is all right next to each
           | other (columnar storage) where as OLTP's need for fast writes
           | usually means row based storage is more efficient. I think
           | the solution would be one system that stores data
           | consistently both ways and knows when to use which method for
           | a given query.
        
             | thom wrote:
             | In a sense, OLAP is just a series of indexing strategies
             | that takes OLTP data and formats it for particular use
             | cases (sometimes with eventual consistency). Some of these
             | indexing strategies in enterprises today involve building
             | out entire bespoke platforms to extract and transform the
             | data. Incremental view maintenance is a step in the right
             | direction - tools like Materialize give you good
             | performance to keep calculated data up to date, and also
             | break out of the streaming world of only paying attention
             | to recent data. But you need to close the loop and also be
             | able to do massive crunchy queries on top of that. I have
             | no doubt we'll get there, really exciting times.
        
               | ndm000 wrote:
               | Completely agree. All of the pieces are there and it's
               | just waiting to be acted upon. I haven't seen any of the
               | major players really doubling down on this, but would be
               | so compelling.
        
         | mritchie712 wrote:
         | nice! anywhere we can follow your progress?
        
           | simlevesque wrote:
           | Not right now sadly I have some work obligations taking my
           | time but I can't wait to share more.
           | 
           | I'm using a basic implementation that's not backed by
           | iceberg, just Parquet files in hive partitions that I can
           | query using DuckDB.
        
         | FuriouslyAdrift wrote:
         | so... sharding?
        
         | chehai wrote:
         | This approach reminds me of ClickHouse's MergeTree.
         | 
         | Also, https://paimon.apache.org/ seems to be better for
         | streaming use cases.
        
       | mritchie712 wrote:
       | This is a bit overblown.
       | 
       | Is Iceberg "easy" to set up? No.
       | 
       | Can you get set up in a week? Yes.
       | 
       | If you really need a datalake, spending a week setting it up is
       | not so bad. We have a guide[0] here that will get you started in
       | under an hour.
       | 
       | For smaller (e.g. under 10tb) data where you don't need real-
       | time, DuckDB is becoming a really solid option. Here's on
       | setup[1] we've played around with using Arrow Flight.
       | 
       | If you don't want to mess with any of this, we[2] spin it all up
       | for you.
       | 
       | 0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
       | 
       | 1 - https://www.definite.app/blog/duck-takes-flight
       | 
       | 2 - https://www.definite.app/
        
         | simlevesque wrote:
         | I think Iceberg can work in real time but the current
         | implementations make it impossible.
         | 
         | I have a vision for a way to make it work. I made another
         | comment here. Your blog posts were helpful, I digged a bit in
         | the Duck Takes Flight code in python and rust.
        
         | pid-1 wrote:
         | If you're already in AWS, why wouldn't you use AWS Glue Catalog
         | + AWS SDK for pandas + Athena?
         | 
         | You can setup a data lake, save data and start doing queries in
         | like 10 minutes with this setup.
        
           | thedougd wrote:
           | These days you can 'just' create an S3 tables bucket. https:/
           | /docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...
        
           | mritchie712 wrote:
           | Agreed.
           | 
           | A lot of people worry would worry about "vendor lock-in"
           | here, but it's certainly convenient.
        
           | tsss wrote:
           | Athena is really expensive though and you will often run into
           | a hard limit on the size of your query.
        
         | whalesalad wrote:
         | heads up the logo on your site needs to be 2x'd in pixel
         | density it comes across as blurry on hidpi displays. or convert
         | it to an svg/vector.
        
           | mritchie712 wrote:
           | fixed!
        
       | zhousun wrote:
       | The only datastack iceberg (or lakehouse) will never replace is
       | OLTP systems, for high-concurrency updates optimistic concurrency
       | control & object store is simply a no go.
       | 
       | Iceberg out-of-the-box is "NOT" good at streaming use cases,
       | unlike formats like Hudi or Paimon, the table format does not
       | have the concept of merge/ index. However, the beauty of iceberg
       | is it is very unopinionated, so it is indeed possible to design
       | an engine to stream write to iceberg. As far as I know this is
       | how engines like Upsolver was implemented: 1. Have in-memory
       | buffer to track incoming rows before flushing a version to
       | iceberg (every 10s to a few minutes). 2. Build Indexing structure
       | to write position deletes/ deletion vector instead of equality
       | deletes. 3. The writer will all try to merge small files and
       | optimize the table.
       | 
       | And stay tuned, we at https://www.mooncake.dev/ are working on a
       | solution to mirror a postgres table to iceberg, and keep them
       | always up-to-date.
        
       | robertkoss wrote:
       | This article is just shameless advertising for Estuary Flow, a
       | company that the author is working for. "Operational Maturity",
       | as if Iceberg, Delta or Hudi are not mature. These are battle-
       | tested frameworks that have been in production for years. The
       | "small files problem" is not really a problem because every
       | framework supports some way of compacting smaller files. Just run
       | a nightly job that compacts the small files and you're good 2 go.
        
         | prpl wrote:
         | AWS can do it for you with S3 tables.
        
       | theyinwhy wrote:
       | What's a good alternative? Google BigQuery?
        
       | datax2 wrote:
       | "Hadoop's meteoric rise led many organizations to implement it
       | without understanding its complexities, often resulting in
       | underutilized clusters or over-engineered architectures. Iceberg
       | is walking a similar path."
       | 
       | This pain is too real, and too close to home. I've seen this
       | outcome turn the entire business off of consuming their data via
       | hadoop because it turns into a wasteland of delayed deliveries,
       | broken datasets, op's teams who cannot scale, and architects
       | overselling too robust designs.
       | 
       | I've tried to scale down hadoop to the business user with visual
       | etl tools like Alteryx, but there again compatibility between
       | Alteryx and hadoop suck via ODBC connectors. I came from an AWS
       | based stack into a poorly leapfrogged data stack and it's hard
       | not to pull my hair out between the business struggling to use it
       | and infra + op's not keeping up. Now these teams want to push to
       | iceburg or big query while ignoring the mountains of tech debt
       | they have created.
       | 
       | Don't get me wrong Hadoop isn't a bad idea, its just complex and
       | a time suck, and unless you have time to dedicate to properly
       | deploy these solutions which most business do not, your
       | implementation will suffer, your business will suffer.
       | 
       | "While the parallels to Hadoop are striking, we also have the
       | opportunity to avoid its pitfalls." no one in IT learns from
       | their failures unless they are writing the checks, most will flip
       | before they feel the pain.
        
       ___________________________________________________________________
       (page generated 2025-03-06 23:01 UTC)