hngopher.com

       [HN Gopher] Apache Hudi vs. Delta Lake vs. Apache Iceberg Lakeho...
       ___________________________________________________________________
        
       Apache Hudi vs. Delta Lake vs. Apache Iceberg Lakehouse Feature
       Comparison
        
       Author : bhasudha
       Score  : 71 points
       Date   : 2023-01-11 18:19 UTC (4 hours ago)
        
 (HTM) web link (www.onehouse.ai)
 (TXT) w3m dump (www.onehouse.ai)
        
       | zX41ZdbW wrote:
       | These formats look like an attempt to get a halfway solution: you
       | want to get something like a real MPP analytic DBMS (e.g.,
       | ClickHouse) but have to use a data lake for some reason.
       | 
       | It resembles previous trendy technologies that are mostly
       | forgotten now, such as:
       | 
       | - Lambda architecture (based on a wrong assumption that you
       | cannot have a real-time and historical layers in the same
       | system);
       | 
       | - Multidimensional OLAP (based on a wrong assumption that you
       | cannot do analytic queries directly on non-aggregated data);
       | 
       | - Big data (based on a wrong assumption that map-reduce is better
       | than relational DBMS).
       | 
       | I'm exaggerating a little.
       | 
       | Disclaimer: I work on ClickHouse, and I'm a follower of every
       | technology in the data processing area.
        
         | vgt wrote:
         | It would be good if you labeled your posts so as to reveal your
         | bias.
         | 
         | I understand why folks want options. At the end of the day,
         | folks want an easy to use, ALWAYS CORRECT stable database, with
         | minimal well-documented predictable knobs, correct distributed
         | execution plan, no OOMs, separation of storage and compute, and
         | standard SQL, and Clickhouse struggles with all of the above.
         | 
         | (co-founder of MotherDuck)
        
           | thomoco wrote:
           | Could you please elaborate on your comments and possible
           | misconceptions about ClickHouse? Proven stability, massive
           | scale, predictability, native SQL, and industry-best
           | performance are all well-recognized characteristics of
           | clickhouse, so your comments here seem a bit biased.
           | 
           | I am interested to learn more about your point of view, as
           | well as tangentially the strategic vision of MotherDuck as a
           | company.
           | 
           | (VP Support at ClickHouse)
        
             | vgt wrote:
             | Speaking from nearly a decade working on BigQuery, and a
             | year working at Firebolt.
             | 
             | - Stability. It OOMS, your CTO mentioned that last week.
             | 
             | - It is not correct. I believe your team is aware of cases
             | in which your very own benchmarks revealed Clickhouse to be
             | incorrect.
             | 
             | - Scale. The distributed plan is broken and I'm not sure
             | Clickhouse even has shuffle.
             | 
             | - SQL. It is very non-standard.
             | 
             | - Knobs. Lots of knobs that are poorly documented. It's
             | unclear which are mandatory.
             | 
             | Don't get me wrong, I love open source, and I love what
             | Clickhouse has done. I am not a fan of overselling. There
             | are problems with Clickhouse. Trying to sell it as a
             | superset of the modern CDW is not doing users any favors.
        
         | glogla wrote:
         | The formats are kind of a halfway solution, because trying to
         | build something with MPP semantics on objects stores is
         | difficult.
         | 
         | The difference between MPP and something like Databricks or
         | Trino working with object store is that while MPP can likely
         | get much better performance and especially latency from the
         | same hardware, operating it is much harder.
         | 
         | You don't "backup" Databricks - the data is stored in object
         | storage and that is it. You don't have to plan storage sizing
         | quarters upfront, and you never get in trouble because there is
         | unexpected data spike. Compute resizes are trivial, there is no
         | rebalancing. Upgrades are easy, because you're just upgrading
         | the compute and you can't break data that way. You can give
         | each user group (like batch one and interactive one, or each
         | team) their dedicated compute over common data and it works.
         | That compute can spin up and down and autoscale to save some
         | money. You don't have to think about how to replicate my table
         | across a cluster or anything. And so on, and so forth.
         | 
         | Running a big data and analytics platform - place where tens of
         | teams, tens of applications and hundreds or thousands of
         | analysts come for data and where they build their solutions -
         | is already enough of a challenge without all this operations
         | work, and that is why Snowflake and Databricks are worth that
         | crazy money.
         | 
         | If someone could solve the challenge of having MPP that is as
         | easy to manage as Snowflake or a Lakehouse, that would be quite
         | the differentiator. And maybe you people already did and I just
         | didn't notice, I don't know :)
        
       | joeharris76 wrote:
       | "A grand don't come for free"
       | 
       | It's interesting that they link to (and show graphs from) their
       | earlier post but fail to mention that they had to disable a very
       | large chunk of the features in the checklist to get that
       | performance.
       | 
       | Their earlier post: https://www.onehouse.ai/blog/apache-hudi-vs-
       | delta-lake-trans... Responding to: https://databeans-
       | blogs.medium.com/delta-vs-iceberg-vs-hudi-...
        
       | aliqot wrote:
       | Naming this concept a 'data lake' bugs me. Turns out when you get
       | old, you read things sometimes and you're thinking "ah, so that's
       | where my line was. Interesting."
        
         | CharlesW wrote:
         | You need a data lake before you can build your data lakehouse,
         | of course.
        
           | bradleyankrom wrote:
           | And you need a data lakehouse so you can enjoy your data
           | jetskis.
        
       | rguillebert wrote:
       | In our use case (rebuilding tables from Postgres CDC logs), Delta
       | Lake was significantly faster than Hudi.
        
         | deltacontrib wrote:
         | Are you open to sharing more? We'd love to highlight this
         | success and contribute any learnings back to the community!
         | 
         | nick ( a t ) nickkarpov.com
        
       | ckdarby wrote:
       | Straight up an advertisement piece for the company.
        
       | marsupialtail_2 wrote:
       | I think the blog post should point out very early that Onehouse
       | is a Hudi company. There are some other recent benchmarks
       | published in CIDR by Databricks that might paint a different
       | picture: https://petereliaskraft.net/res/cidr_lakehouse.pdf
        
         | anonymousDan wrote:
         | Thanks for the link. I'd be interested to see a perf comparison
         | using a popular processing engine other than spark given the
         | obvious potential for delta lake to be better tuned for spark
         | workloads by default.
        
           | marsupialtail_2 wrote:
           | me too. Trino for one would be a good start. Adding support
           | for those data lakes is really hard though if you want good
           | performance.
        
         | glogla wrote:
         | In Databricks published benchmark of course Delta is the
         | fastest. I have also seen some Iceberg using company publishing
         | benchmarks showing how Iceberg is the fastest.
         | 
         | Vendor published benchmarks are worthless.
        
           | mostdataisnice wrote:
           | fwiw - the lead authors on that linked paper are all grad
           | students not employed at Databricks. That being said, they're
           | advised by Databricks people
        
           | MrPowers wrote:
           | I think vendor published benchmarks are fine if the dataset
           | is open / accessible, the benchmark code is published, all
           | software versions are disclosed, and the exact hardware is
           | specified. I definitely wouldn't consider an audited TPC
           | benchmark that's based on industry standard datasets /
           | queries worthless in the data space. Disclosure: I work for
           | Databricks.
        
       | MrPowers wrote:
       | Some high level context for those less familiar with the
       | Lakehouse storage system space. For various reasons, several
       | companies moved from data warehouses to data lakes starting
       | around 7-10 years ago.
       | 
       | Data lakes are better for ML / AI workloads, cheaper, more
       | flexible, and separate compute from storage. With a data
       | warehouse, you need to share compute with other users. With data
       | lakes you can attach an arbitrary number of computational
       | clusters to the data.
       | 
       | Data lakes were limited in many regards. They were easily
       | corrupted (no schema enforcement), required slow file listings
       | when reading data, and didn't support ACID transactions.
       | 
       | I'm on the Delta Lake team and will speak to some of the benefits
       | of Delta Lake compared to data lakes:
       | 
       | * Delta Lake supports ACID transactions, so Delta tables are
       | harder to corrupt. The transaction log makes it easy to time
       | travel, version datasets, and rollback to earlier versions of
       | your data.
       | 
       | * Delta Lake allows for schema enforcement & evolution
       | 
       | * Delta Lake makes it easy to compact small files (big data
       | systems don't like an excessive number of small files)
       | 
       | * Delta Lake lets readers get files and skip files via the
       | transaction log (much faster than a file listing). Z ORDERING the
       | data makes reads even faster.
       | 
       | The Delta Lake protocol is implemented in a Scala library and
       | exposed via PySpark, Scala Spark, and Java Spark bindings. This
       | is the library most people think of when conceptualizing Delta
       | Lake.
       | 
       | There is also a Delta Lake Java Standalone library that's used to
       | build other readers like the Trino & Hive readers.
       | 
       | The Delta Rust project is another implementation of the Delta
       | Lake protocol that is implemented in Rust. This library is
       | accessible via Rust or Python bindings. Polars just added a Delta
       | Lake reader with delta-rs and this library can also be used to
       | easily read Delta Lakes into other DataFrames like pandas or
       | Dask.
       | 
       | Lots of DataFrame users are struggling with data lakes / single
       | data files. They don't have any data skipping capabilities
       | (unless Parquet file footers are read), their datasets are easily
       | corruptible, and they don't have any schema enforcement / schema
       | evolution / data versioning / etc. I expect the data community to
       | accelerate the shift to Lakehouse storage systems as they learn
       | about all of these advantages.
        
         | vgt wrote:
         | You're making a lot of assertions I am not sure I agree with:
         | 
         | > Data lakes are better for ML / AI workloads, cheaper, more
         | flexible, and separate compute from storage. With a data
         | warehouse, you need to share compute with other users. With
         | data lakes you can attach an arbitrary number of computational
         | clusters to the data.
         | 
         | - I am not sure it's any cheaper than BQ or Snowflake storage.
         | 
         | - Modern CDW separates compute from storage.
         | 
         | - I am not sure what you mean by "you need to share compute
         | with others". Why?
         | 
         | - You can attach an arbitrary number of "clusters" in BQ and
         | Snowflake as well.
         | 
         | Additionally, modern CDW provides a very high level of
         | abstraction and a very high level of manageability. Their time
         | travel and compaction actually work, and their storage systems
         | are continuously optimized for optimal performance.
        
         | AtlasBarfed wrote:
         | Not that I know what anything means in "big data lake OLAP
         | database" anymore, but I always thought a data lake implied a
         | lot of hybrid sources/formats/structures for the data, but the
         | advocacy here implies that the data is all ingested and
         | reformatted, which to me is a data warehouse.
         | 
         | But then again, data lake may simply be what a data warehouse
         | is now called in marketspeak.
         | 
         | Also, I stopped paying attention when the treadmill of new
         | frameworks became unbearable to track, is spark now settled as
         | the standard of distributed "processing", as in mapreduce /
         | distributed query / distributed batch / etc?
         | 
         | I get that performance can improve by unifying to a file format
         | like parquet, but again that seems like a data warehouse. A
         | data lake should be something over heterogenous sources with
         | "drivers" or "adaptors" IMO, in particular because the
         | restoration of the data inputs stays in the knowledge domain of
         | the source production database maintainers.
        
           | MrPowers wrote:
           | You're understandably confused by the industry terminology
           | that's ambiguous and morphing over time.
           | 
           | Data lakes are typically CSV/JSON/ORC/Avro/Parquet files
           | stored in a storage system (cloud storage like AWS S3 or
           | HDFS). Data lakes are schema on read (the query engine gets
           | the schema when reading the data).
           | 
           | A data warehouse is something like Redshift that bundles
           | storage and compute. You have to buy the storage and compute
           | as a single package. Data warehouses are schema on write. The
           | schema is defined when the table is created.
           | 
           | And yes, I'd say that Spark is generally considered the
           | "standard" distributed data processing engine these days
           | although there are alternatives.
        
       | swyx wrote:
       | The data lakehouse space is the most exciting evolution of the
       | data warehousing industry imo! If anyone wants a comparison guide
       | from a neutral party, we did one last year:
       | https://airbyte.com/blog/data-lake-lakehouse-guide-powered-b...
       | 
       | I cant paste images here but imo this table comparing the 3
       | formats is the big takeaway https://assets-global.website-
       | files.com/6064b31ff49a2d31e049... (explained inline, we do cite
       | onehouse heavily but we are independent of them)
        
         | joeharris76 wrote:
         | I notice that they "borrowed" most of the table from you -
         | kudos I guess. My problem with your post is the it presents the
         | choice between Lakehouse formats as a feature checklist. All
         | features are not created equally. All implementations are not
         | equally easy to use.
         | 
         | Personally, I find Brooklyn Data's post to be more practically
         | useful since they used each format to do real work and show how
         | they perform in practice. And they provide all their code so a
         | reader can run the same testing themselves.
         | 
         | https://brooklyndata.co/blog/benchmarking-open-table-formats
        
           | glogla wrote:
           | > Personally, I find Brooklyn Data's post to be more
           | practically useful since they used each format to do real
           | work and show how they perform in practice.
           | 
           | First, you should probably add disclaimer that you work for
           | Databricks.
           | 
           | Second, they really did not.
           | 
           | They came up with a test case of 1) create a table 2) merge
           | into it 3) delete from it 4) delete from it 5) delete from it
           | 6) delete from it (but call it "GDPR request" in chart).
           | 
           | This is a completely absurd workload that should give anyone
           | a pause. In real world, Lakehouse tables are either appended
           | to, merged into with periodic compaction, or dropped and
           | recreated (using something like dbt). Running a delete four
           | times is just bizarre, and making a general statement out of
           | it is just suspiciously biased.
           | 
           | Of course it is very interesting coincidence that this
           | independent company came up with non-realistic benchmark
           | showing how better Delta is compared to major competitor,
           | that everyone from Databricks is sharing everything.
        
       | berkle4455 wrote:
       | Data Lakehouse is where you store your data on object stores and
       | spin up a bunch of instances (cpu/memory) as needed to crunch the
       | data on the timeline you desire. It's incredible to me how
       | "solutions" are continually invented that give cloud providers
       | plenty of stages to charge for not only storage but movement and
       | processing of your data too.
        
         | glogla wrote:
         | As opposed to what setup exactly? What do you propose everyone
         | uses instead of Lakehouses?
        
           | berkle4455 wrote:
           | Disaggregated storage. You still have separation of compute
           | and storage with the bandwidth and latency benefits of
           | directly attached storage.
           | 
           | Object stores are a terribly inefficient way to access and
           | store changes of data.
        
             | nostrebored wrote:
             | Inefficient in what respect? It really depends on your
             | access patterns, performance requirements, and the way data
             | is structured and stored in your object store
             | 
             | - someone who thinks it's often a bad idea
        
             | thinkharderdev wrote:
             | It's definitely inefficient w/r/t performance (reading from
             | Object Storage will always be several times slower than
             | reading from disk/SSD) but the point is usually to minimize
             | cost.
        
           | anonymousDan wrote:
           | I guess a setup such as hdfs where storage and compute are
           | colocated and not disaggregated. But that also offers similar
           | transactional semantics to lakehouses.
        
         | KptMarchewa wrote:
         | No, that's data lake. Data lakehouse is data lake where your
         | objects are wrapped in a "table format" like Iceberg that
         | allows you to query (update, itd) them like they were stored in
         | a traditional data warehouse.
        
         | anonymousDan wrote:
         | Why would you be charged for movement in this case, I thought
         | intra datacenter traffic was free? Or you mean you get charged
         | to update/query the object store?
        
       ___________________________________________________________________
       (page generated 2023-01-11 23:01 UTC)