[HN Gopher] What Is the Data Lakehouse Pattern?
       ___________________________________________________________________
        
       What Is the Data Lakehouse Pattern?
        
       Author : benjaminwootton
       Score  : 16 points
       Date   : 2021-09-14 20:54 UTC (2 hours ago)
        
 (HTM) web link (timeflow.systems)
 (TXT) w3m dump (timeflow.systems)
        
       | mason55 wrote:
       | The (I believe) original Lakehouse paper is here:
       | http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
        
       | bob1029 wrote:
       | Is this new lake house going to have its own pool too?
       | 
       | We need to go a little bit deeper. I can sense that we are just a
       | few steps away from circling all the way back around to fancy
       | terminology for "Postgresql installed on a big server".
        
         | Zababa wrote:
         | You could go pretty far with lakes and islands
         | https://en.wikipedia.org/wiki/Recursive_islands_and_lakes#Is...
        
           | bob1029 wrote:
           | Fantastic. I was concerned that this might be a thing.
        
       | Hackbraten wrote:
       | Too buzzwordy for me.
        
         | bostonsre wrote:
         | The term definitely makes me cringe but it describes a solid
         | and useful design evolution.
        
         | paulddraper wrote:
         | Unstructured predictive big data algorithm unsupervised deep ML
         | internet of things mining
        
           | blacktriangle wrote:
           | Is it wrong I want to invest in your pitch?
        
             | rubiquity wrote:
             | Personally, I think they should invest in their pitch.
        
           | tragomaskhalos wrote:
           | Add 'blockchain' in there and my cheque will be straight in
           | the post
        
         | benjaminwootton wrote:
         | As I mention at the end of the article, it's definetly an
         | almost laughable buzzword. I suspect whoever created it almost
         | felt awkward to use it. I do however think there is concrete
         | meaning behind it.
         | 
         | Databricks and Snowflake are both pulling it off, giving a
         | combination of an RDBMs like experience and a data lake
         | experience. If it can be pulled off, it strips away a lot of
         | complexity in how big companies manage data.
        
       | gigatexal wrote:
       | datalake = spark or presto on top of s3.
       | 
       | datelakehouse: I have no idea.
        
         | bostonsre wrote:
         | datalake + data warehouse = lakehouse
        
       | glogla wrote:
       | "Data Lakehouse" is sadly term ruined by AWS.
       | 
       | It used to mean "data lake extended to support data warehouse use
       | cases".
       | 
       | So something like HDFS or S3 with Delta (from DBX) or Apache
       | Iceberg storage formats, utilizing Spark or Presto/Trino or
       | something for compute. One unified platform built on scalable big
       | data technologies, that can do transactions, SQL MERGE, smart
       | partitioning and other bells and whistles.
       | 
       | Then AWS decided to unveil "AWS LakeHouse" which meant you have
       | both S3 and Redshift and use both at the same time - lake and
       | warehouse next to each other.
       | 
       | This is not what lakehouse meant until then It is also terrible
       | design - having data in two places means you now have to
       | implement access control, logging, auditing, data access and so
       | on twice. You also have to sync data between the two storages,
       | keep track of what is where and keep track of what is the single
       | source of truth
       | 
       | Truly idiotic design / marketing that could only have come from
       | AWS. But since any larger company has army of "enterprise
       | architects" who went from "nobody was ever fired for recommending
       | IBM" to "nobody was ever fired for recommending Oracle" to
       | "nobody was ever fired for recommending AWS" who will just
       | internally enforce whatever bullshit vendors pushes on them ...
       | it is almost what "lakehouse" means nowadays.
       | 
       | AWS truly is the Oracle of 2020s. Fuck them.
       | 
       | (Rant over, sorry, got carried away)
        
         | atwebb wrote:
         | > It is also terrible design - having data in two places means
         | you now have to implement access control, logging, auditing,
         | data access and so on twice.
         | 
         | I've understood and implemented differently. With Spectrum (or
         | Polybase for SQL Server / Synapse), you can extended into the
         | data lake. Copy over aggregate/curated data or something you
         | need to special use cases on. Leave the structured, columnar
         | data within the cheap storage. You pay per scan but it is cheap
         | (at least to a point).
         | 
         | Also, Databricks took the Lakehouse moniker and sprinted with
         | it. AWS was late to the game from what I saw (at least for
         | marketing terminology adoption).
        
       ___________________________________________________________________
       (page generated 2021-09-14 23:01 UTC)