[HN Gopher] Apache Hudi vs. Delta Lake vs. Apache Iceberg Lakeho...
___________________________________________________________________
Apache Hudi vs. Delta Lake vs. Apache Iceberg Lakehouse Feature
Comparison
Author : bhasudha
Score : 71 points
Date : 2023-01-11 18:19 UTC (4 hours ago)
(HTM) web link (www.onehouse.ai)
(TXT) w3m dump (www.onehouse.ai)
| zX41ZdbW wrote:
| These formats look like an attempt to get a halfway solution: you
| want to get something like a real MPP analytic DBMS (e.g.,
| ClickHouse) but have to use a data lake for some reason.
|
| It resembles previous trendy technologies that are mostly
| forgotten now, such as:
|
| - Lambda architecture (based on a wrong assumption that you
| cannot have a real-time and historical layers in the same
| system);
|
| - Multidimensional OLAP (based on a wrong assumption that you
| cannot do analytic queries directly on non-aggregated data);
|
| - Big data (based on a wrong assumption that map-reduce is better
| than relational DBMS).
|
| I'm exaggerating a little.
|
| Disclaimer: I work on ClickHouse, and I'm a follower of every
| technology in the data processing area.
| vgt wrote:
| It would be good if you labeled your posts so as to reveal your
| bias.
|
| I understand why folks want options. At the end of the day,
| folks want an easy to use, ALWAYS CORRECT stable database, with
| minimal well-documented predictable knobs, correct distributed
| execution plan, no OOMs, separation of storage and compute, and
| standard SQL, and Clickhouse struggles with all of the above.
|
| (co-founder of MotherDuck)
| thomoco wrote:
| Could you please elaborate on your comments and possible
| misconceptions about ClickHouse? Proven stability, massive
| scale, predictability, native SQL, and industry-best
| performance are all well-recognized characteristics of
| clickhouse, so your comments here seem a bit biased.
|
| I am interested to learn more about your point of view, as
| well as tangentially the strategic vision of MotherDuck as a
| company.
|
| (VP Support at ClickHouse)
| vgt wrote:
| Speaking from nearly a decade working on BigQuery, and a
| year working at Firebolt.
|
| - Stability. It OOMS, your CTO mentioned that last week.
|
| - It is not correct. I believe your team is aware of cases
| in which your very own benchmarks revealed Clickhouse to be
| incorrect.
|
| - Scale. The distributed plan is broken and I'm not sure
| Clickhouse even has shuffle.
|
| - SQL. It is very non-standard.
|
| - Knobs. Lots of knobs that are poorly documented. It's
| unclear which are mandatory.
|
| Don't get me wrong, I love open source, and I love what
| Clickhouse has done. I am not a fan of overselling. There
| are problems with Clickhouse. Trying to sell it as a
| superset of the modern CDW is not doing users any favors.
| glogla wrote:
| The formats are kind of a halfway solution, because trying to
| build something with MPP semantics on objects stores is
| difficult.
|
| The difference between MPP and something like Databricks or
| Trino working with object store is that while MPP can likely
| get much better performance and especially latency from the
| same hardware, operating it is much harder.
|
| You don't "backup" Databricks - the data is stored in object
| storage and that is it. You don't have to plan storage sizing
| quarters upfront, and you never get in trouble because there is
| unexpected data spike. Compute resizes are trivial, there is no
| rebalancing. Upgrades are easy, because you're just upgrading
| the compute and you can't break data that way. You can give
| each user group (like batch one and interactive one, or each
| team) their dedicated compute over common data and it works.
| That compute can spin up and down and autoscale to save some
| money. You don't have to think about how to replicate my table
| across a cluster or anything. And so on, and so forth.
|
| Running a big data and analytics platform - place where tens of
| teams, tens of applications and hundreds or thousands of
| analysts come for data and where they build their solutions -
| is already enough of a challenge without all this operations
| work, and that is why Snowflake and Databricks are worth that
| crazy money.
|
| If someone could solve the challenge of having MPP that is as
| easy to manage as Snowflake or a Lakehouse, that would be quite
| the differentiator. And maybe you people already did and I just
| didn't notice, I don't know :)
| joeharris76 wrote:
| "A grand don't come for free"
|
| It's interesting that they link to (and show graphs from) their
| earlier post but fail to mention that they had to disable a very
| large chunk of the features in the checklist to get that
| performance.
|
| Their earlier post: https://www.onehouse.ai/blog/apache-hudi-vs-
| delta-lake-trans... Responding to: https://databeans-
| blogs.medium.com/delta-vs-iceberg-vs-hudi-...
| aliqot wrote:
| Naming this concept a 'data lake' bugs me. Turns out when you get
| old, you read things sometimes and you're thinking "ah, so that's
| where my line was. Interesting."
| CharlesW wrote:
| You need a data lake before you can build your data lakehouse,
| of course.
| bradleyankrom wrote:
| And you need a data lakehouse so you can enjoy your data
| jetskis.
| rguillebert wrote:
| In our use case (rebuilding tables from Postgres CDC logs), Delta
| Lake was significantly faster than Hudi.
| deltacontrib wrote:
| Are you open to sharing more? We'd love to highlight this
| success and contribute any learnings back to the community!
|
| nick ( a t ) nickkarpov.com
| ckdarby wrote:
| Straight up an advertisement piece for the company.
| marsupialtail_2 wrote:
| I think the blog post should point out very early that Onehouse
| is a Hudi company. There are some other recent benchmarks
| published in CIDR by Databricks that might paint a different
| picture: https://petereliaskraft.net/res/cidr_lakehouse.pdf
| anonymousDan wrote:
| Thanks for the link. I'd be interested to see a perf comparison
| using a popular processing engine other than spark given the
| obvious potential for delta lake to be better tuned for spark
| workloads by default.
| marsupialtail_2 wrote:
| me too. Trino for one would be a good start. Adding support
| for those data lakes is really hard though if you want good
| performance.
| glogla wrote:
| In Databricks published benchmark of course Delta is the
| fastest. I have also seen some Iceberg using company publishing
| benchmarks showing how Iceberg is the fastest.
|
| Vendor published benchmarks are worthless.
| mostdataisnice wrote:
| fwiw - the lead authors on that linked paper are all grad
| students not employed at Databricks. That being said, they're
| advised by Databricks people
| MrPowers wrote:
| I think vendor published benchmarks are fine if the dataset
| is open / accessible, the benchmark code is published, all
| software versions are disclosed, and the exact hardware is
| specified. I definitely wouldn't consider an audited TPC
| benchmark that's based on industry standard datasets /
| queries worthless in the data space. Disclosure: I work for
| Databricks.
| MrPowers wrote:
| Some high level context for those less familiar with the
| Lakehouse storage system space. For various reasons, several
| companies moved from data warehouses to data lakes starting
| around 7-10 years ago.
|
| Data lakes are better for ML / AI workloads, cheaper, more
| flexible, and separate compute from storage. With a data
| warehouse, you need to share compute with other users. With data
| lakes you can attach an arbitrary number of computational
| clusters to the data.
|
| Data lakes were limited in many regards. They were easily
| corrupted (no schema enforcement), required slow file listings
| when reading data, and didn't support ACID transactions.
|
| I'm on the Delta Lake team and will speak to some of the benefits
| of Delta Lake compared to data lakes:
|
| * Delta Lake supports ACID transactions, so Delta tables are
| harder to corrupt. The transaction log makes it easy to time
| travel, version datasets, and rollback to earlier versions of
| your data.
|
| * Delta Lake allows for schema enforcement & evolution
|
| * Delta Lake makes it easy to compact small files (big data
| systems don't like an excessive number of small files)
|
| * Delta Lake lets readers get files and skip files via the
| transaction log (much faster than a file listing). Z ORDERING the
| data makes reads even faster.
|
| The Delta Lake protocol is implemented in a Scala library and
| exposed via PySpark, Scala Spark, and Java Spark bindings. This
| is the library most people think of when conceptualizing Delta
| Lake.
|
| There is also a Delta Lake Java Standalone library that's used to
| build other readers like the Trino & Hive readers.
|
| The Delta Rust project is another implementation of the Delta
| Lake protocol that is implemented in Rust. This library is
| accessible via Rust or Python bindings. Polars just added a Delta
| Lake reader with delta-rs and this library can also be used to
| easily read Delta Lakes into other DataFrames like pandas or
| Dask.
|
| Lots of DataFrame users are struggling with data lakes / single
| data files. They don't have any data skipping capabilities
| (unless Parquet file footers are read), their datasets are easily
| corruptible, and they don't have any schema enforcement / schema
| evolution / data versioning / etc. I expect the data community to
| accelerate the shift to Lakehouse storage systems as they learn
| about all of these advantages.
| vgt wrote:
| You're making a lot of assertions I am not sure I agree with:
|
| > Data lakes are better for ML / AI workloads, cheaper, more
| flexible, and separate compute from storage. With a data
| warehouse, you need to share compute with other users. With
| data lakes you can attach an arbitrary number of computational
| clusters to the data.
|
| - I am not sure it's any cheaper than BQ or Snowflake storage.
|
| - Modern CDW separates compute from storage.
|
| - I am not sure what you mean by "you need to share compute
| with others". Why?
|
| - You can attach an arbitrary number of "clusters" in BQ and
| Snowflake as well.
|
| Additionally, modern CDW provides a very high level of
| abstraction and a very high level of manageability. Their time
| travel and compaction actually work, and their storage systems
| are continuously optimized for optimal performance.
| AtlasBarfed wrote:
| Not that I know what anything means in "big data lake OLAP
| database" anymore, but I always thought a data lake implied a
| lot of hybrid sources/formats/structures for the data, but the
| advocacy here implies that the data is all ingested and
| reformatted, which to me is a data warehouse.
|
| But then again, data lake may simply be what a data warehouse
| is now called in marketspeak.
|
| Also, I stopped paying attention when the treadmill of new
| frameworks became unbearable to track, is spark now settled as
| the standard of distributed "processing", as in mapreduce /
| distributed query / distributed batch / etc?
|
| I get that performance can improve by unifying to a file format
| like parquet, but again that seems like a data warehouse. A
| data lake should be something over heterogenous sources with
| "drivers" or "adaptors" IMO, in particular because the
| restoration of the data inputs stays in the knowledge domain of
| the source production database maintainers.
| MrPowers wrote:
| You're understandably confused by the industry terminology
| that's ambiguous and morphing over time.
|
| Data lakes are typically CSV/JSON/ORC/Avro/Parquet files
| stored in a storage system (cloud storage like AWS S3 or
| HDFS). Data lakes are schema on read (the query engine gets
| the schema when reading the data).
|
| A data warehouse is something like Redshift that bundles
| storage and compute. You have to buy the storage and compute
| as a single package. Data warehouses are schema on write. The
| schema is defined when the table is created.
|
| And yes, I'd say that Spark is generally considered the
| "standard" distributed data processing engine these days
| although there are alternatives.
| swyx wrote:
| The data lakehouse space is the most exciting evolution of the
| data warehousing industry imo! If anyone wants a comparison guide
| from a neutral party, we did one last year:
| https://airbyte.com/blog/data-lake-lakehouse-guide-powered-b...
|
| I cant paste images here but imo this table comparing the 3
| formats is the big takeaway https://assets-global.website-
| files.com/6064b31ff49a2d31e049... (explained inline, we do cite
| onehouse heavily but we are independent of them)
| joeharris76 wrote:
| I notice that they "borrowed" most of the table from you -
| kudos I guess. My problem with your post is the it presents the
| choice between Lakehouse formats as a feature checklist. All
| features are not created equally. All implementations are not
| equally easy to use.
|
| Personally, I find Brooklyn Data's post to be more practically
| useful since they used each format to do real work and show how
| they perform in practice. And they provide all their code so a
| reader can run the same testing themselves.
|
| https://brooklyndata.co/blog/benchmarking-open-table-formats
| glogla wrote:
| > Personally, I find Brooklyn Data's post to be more
| practically useful since they used each format to do real
| work and show how they perform in practice.
|
| First, you should probably add disclaimer that you work for
| Databricks.
|
| Second, they really did not.
|
| They came up with a test case of 1) create a table 2) merge
| into it 3) delete from it 4) delete from it 5) delete from it
| 6) delete from it (but call it "GDPR request" in chart).
|
| This is a completely absurd workload that should give anyone
| a pause. In real world, Lakehouse tables are either appended
| to, merged into with periodic compaction, or dropped and
| recreated (using something like dbt). Running a delete four
| times is just bizarre, and making a general statement out of
| it is just suspiciously biased.
|
| Of course it is very interesting coincidence that this
| independent company came up with non-realistic benchmark
| showing how better Delta is compared to major competitor,
| that everyone from Databricks is sharing everything.
| berkle4455 wrote:
| Data Lakehouse is where you store your data on object stores and
| spin up a bunch of instances (cpu/memory) as needed to crunch the
| data on the timeline you desire. It's incredible to me how
| "solutions" are continually invented that give cloud providers
| plenty of stages to charge for not only storage but movement and
| processing of your data too.
| glogla wrote:
| As opposed to what setup exactly? What do you propose everyone
| uses instead of Lakehouses?
| berkle4455 wrote:
| Disaggregated storage. You still have separation of compute
| and storage with the bandwidth and latency benefits of
| directly attached storage.
|
| Object stores are a terribly inefficient way to access and
| store changes of data.
| nostrebored wrote:
| Inefficient in what respect? It really depends on your
| access patterns, performance requirements, and the way data
| is structured and stored in your object store
|
| - someone who thinks it's often a bad idea
| thinkharderdev wrote:
| It's definitely inefficient w/r/t performance (reading from
| Object Storage will always be several times slower than
| reading from disk/SSD) but the point is usually to minimize
| cost.
| anonymousDan wrote:
| I guess a setup such as hdfs where storage and compute are
| colocated and not disaggregated. But that also offers similar
| transactional semantics to lakehouses.
| KptMarchewa wrote:
| No, that's data lake. Data lakehouse is data lake where your
| objects are wrapped in a "table format" like Iceberg that
| allows you to query (update, itd) them like they were stored in
| a traditional data warehouse.
| anonymousDan wrote:
| Why would you be charged for movement in this case, I thought
| intra datacenter traffic was free? Or you mean you get charged
| to update/query the object store?
___________________________________________________________________
(page generated 2023-01-11 23:01 UTC)