[HN Gopher] What Is the Data Lakehouse Pattern?
___________________________________________________________________
What Is the Data Lakehouse Pattern?
Author : benjaminwootton
Score : 16 points
Date : 2021-09-14 20:54 UTC (2 hours ago)
(HTM) web link (timeflow.systems)
(TXT) w3m dump (timeflow.systems)
| mason55 wrote:
| The (I believe) original Lakehouse paper is here:
| http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
| bob1029 wrote:
| Is this new lake house going to have its own pool too?
|
| We need to go a little bit deeper. I can sense that we are just a
| few steps away from circling all the way back around to fancy
| terminology for "Postgresql installed on a big server".
| Zababa wrote:
| You could go pretty far with lakes and islands
| https://en.wikipedia.org/wiki/Recursive_islands_and_lakes#Is...
| bob1029 wrote:
| Fantastic. I was concerned that this might be a thing.
| Hackbraten wrote:
| Too buzzwordy for me.
| bostonsre wrote:
| The term definitely makes me cringe but it describes a solid
| and useful design evolution.
| paulddraper wrote:
| Unstructured predictive big data algorithm unsupervised deep ML
| internet of things mining
| blacktriangle wrote:
| Is it wrong I want to invest in your pitch?
| rubiquity wrote:
| Personally, I think they should invest in their pitch.
| tragomaskhalos wrote:
| Add 'blockchain' in there and my cheque will be straight in
| the post
| benjaminwootton wrote:
| As I mention at the end of the article, it's definetly an
| almost laughable buzzword. I suspect whoever created it almost
| felt awkward to use it. I do however think there is concrete
| meaning behind it.
|
| Databricks and Snowflake are both pulling it off, giving a
| combination of an RDBMs like experience and a data lake
| experience. If it can be pulled off, it strips away a lot of
| complexity in how big companies manage data.
| gigatexal wrote:
| datalake = spark or presto on top of s3.
|
| datelakehouse: I have no idea.
| bostonsre wrote:
| datalake + data warehouse = lakehouse
| glogla wrote:
| "Data Lakehouse" is sadly term ruined by AWS.
|
| It used to mean "data lake extended to support data warehouse use
| cases".
|
| So something like HDFS or S3 with Delta (from DBX) or Apache
| Iceberg storage formats, utilizing Spark or Presto/Trino or
| something for compute. One unified platform built on scalable big
| data technologies, that can do transactions, SQL MERGE, smart
| partitioning and other bells and whistles.
|
| Then AWS decided to unveil "AWS LakeHouse" which meant you have
| both S3 and Redshift and use both at the same time - lake and
| warehouse next to each other.
|
| This is not what lakehouse meant until then It is also terrible
| design - having data in two places means you now have to
| implement access control, logging, auditing, data access and so
| on twice. You also have to sync data between the two storages,
| keep track of what is where and keep track of what is the single
| source of truth
|
| Truly idiotic design / marketing that could only have come from
| AWS. But since any larger company has army of "enterprise
| architects" who went from "nobody was ever fired for recommending
| IBM" to "nobody was ever fired for recommending Oracle" to
| "nobody was ever fired for recommending AWS" who will just
| internally enforce whatever bullshit vendors pushes on them ...
| it is almost what "lakehouse" means nowadays.
|
| AWS truly is the Oracle of 2020s. Fuck them.
|
| (Rant over, sorry, got carried away)
| atwebb wrote:
| > It is also terrible design - having data in two places means
| you now have to implement access control, logging, auditing,
| data access and so on twice.
|
| I've understood and implemented differently. With Spectrum (or
| Polybase for SQL Server / Synapse), you can extended into the
| data lake. Copy over aggregate/curated data or something you
| need to special use cases on. Leave the structured, columnar
| data within the cheap storage. You pay per scan but it is cheap
| (at least to a point).
|
| Also, Databricks took the Lakehouse moniker and sprinted with
| it. AWS was late to the game from what I saw (at least for
| marketing terminology adoption).
___________________________________________________________________
(page generated 2021-09-14 23:01 UTC)