hngopher.com

       [HN Gopher] Databricks Unity Catalog
       ___________________________________________________________________
        
       Databricks Unity Catalog
        
       Author : jamesblonde
       Score  : 37 points
       Date   : 2021-05-27 08:22 UTC (1 days ago)
        
 (HTM) web link (databricks.com)
 (TXT) w3m dump (databricks.com)
        
       | OldGoodNewBad wrote:
       | That's pretty cool. We were relying on beaver dams upstream from
       | our lake and a floodgate system going downstream, but the
       | boathouse on the lakehouse property was getting inundated by
       | rising water levels when there was a large inflow. After
       | consulting with a data hydrologist we were able to maintain the
       | data lake with enough security to enable year round data boating
       | and even dataskiing. I'm available for vapid buzzword creation
       | but you won't like my fees. Please talk to my off-roading
       | accountant and he'll skywrite a quote.
        
         | arminiusreturns wrote:
         | Thanks I needed this laugh. I think I got bingo!
        
         | aledalgrande wrote:
         | You are so good I initially believed these were true words used
         | in data engineering! XD
        
         | tomnipotent wrote:
         | Databricks does have a really bad habit of trying to use their
         | product names as general terms, but other than that nothing in
         | this post is particularly esoteric to data engineers anymore
         | than DDD/CQRS/WET/DRY/SOLID/trees/nodes/links/references are to
         | software engineers. Might as well pull out 20 year old jokes
         | about snowflake and star schemas.
        
         | mackatsol wrote:
         | "dataskiing" Awesome. Sounds like a cyberpunk job description.
        
           | markus_zhang wrote:
           | Then there is a snowcrash.
        
         | rexreed wrote:
         | What's sadder than companies spewing this marketing
         | gobbledygook are the customers and the consulting firms that
         | buy into them. Enterprise (and government) buying decision-
         | making is highly flawed and these smart enterprise software
         | firms as well as government contractors know just the right
         | words and just the right method to sell things that would
         | otherwise not even pass a real sniff test.
        
           | MikeDelta wrote:
           | Even enterprise architects cannot get enough of this stuff,
           | and you would expect them to know better.
        
             | TeMPOraL wrote:
             | I think that they're in a "when in Rome, talk like the
             | Romans talk" kind of situation.
        
           | tomnipotent wrote:
           | Just because the problems Databricks solves aren't your
           | problems, doesn't mean that they're not problems for other
           | people or organizations. What's sadder are ignorant comments
           | from non-domain experts surprised that tools exists for
           | experts in other domains.
        
             | jgalt212 wrote:
             | The problem is everything is in the cloud even if it does
             | not belong there.
        
               | tomnipotent wrote:
               | Sounds like it's your problem, not "the" problem. Many
               | people and organizations are happily cloud-based, and
               | Databricks offering customers in the cloud better options
               | is a win-win. What's the issue?
        
         | danmur wrote:
         | Ha ha ha, even funnier after I read the article. I hate that
         | kind of marketing.
        
         | engineerbruh wrote:
         | So what are companies supposed to call their products? Would
         | you propose they just call this "Data Security Solution"?
        
         | [deleted]
        
         | syats wrote:
         | After 5 minutes of reading through their website, I still don't
         | understand what is their actual product. I like to think of my
         | self as someone who is computer-savy, working at a software
         | company, designing systems that access several databases from
         | several clients, using spark, aws, the works... and still,
         | their website makes no fkn sense! Could someone translate,
         | please?
         | 
         | oh.. but look at the amount of job openings these guys have
         | across the world.. is this another marketing ploy? I am deeply
         | bothered by this kind of "products".
        
           | reggieband wrote:
           | I am absolutely no expert in any of these domains but the
           | level of confusion described in these comments seems a little
           | exaggerated. Is it so hard to see what is going on here?
           | 
           | A data lake is a centralized repository where all of a
           | companies data is aggregated. This allows analysts to perform
           | queries against a single data source (often masquerading as a
           | SQL database) rather than against 100s of distinct databases
           | (which may be a hodgepodge of no-sql, sql, custom-rest-api,
           | etc.). These "data lakes" often grow to a massive size since
           | they will often not only include your application data
           | (usually batch replicated from prod databases on some
           | schedule or in some cases streamed directly) but also data
           | from external sources (e.g. a feed from your payment
           | processor, compressed events from your app/website analytics,
           | server logs, marketing and advertising sources).
           | 
           | Storing and processing that volume of data efficiently is a
           | difficult task. Many companies decide to just dump that data
           | in a raw format into cloud storage services like AWS S3. Then
           | some third parties made the SQL-like interfaces run on top of
           | S3 (or connectors from S3 into other familiar tools like
           | Spark). This allows for low-cost storage while also allowing
           | data analysts the ability to use tools they are already very
           | familiar with. This way of handling large volumes of data
           | stored for analysis has become very popular.
           | 
           | But now that you have so much data stored in S3 you might
           | start to wonder how you can control access to it. An analyst
           | doing queries on website performance might not require access
           | to the payment processing data. Your security team might
           | point out that your growing analyst team has more access to
           | sensitive company data than is required. As you negotiate big
           | corporate deals their security team might start to red-flag
           | unnecessary access to data (or ask you for your policies
           | governing access to that data and how those policies are
           | enforced).
           | 
           | This product seems to allow finer control over access to data
           | stored in these kind of data lakes. In the same way a bunch
           | of tools appeared to create a SQL like facade on top of the
           | data, this tool creates a facade on top of data access
           | control.
           | 
           | Not only is what they are doing completely understandable
           | from a quick skim of the article, it also seems totally
           | necessary. I have no doubt this is a massive market and this
           | product has every chance to serve a real need.
        
       | lmeyerov wrote:
       | Super cool, this is a major need. At the public protocol level,
       | doesn't feel like much has happened here in practice since
       | posix/http/odbc, s3, & hdfs/parquet/arrow.
       | 
       | I'm curious how this will play out as I don't think there's an
       | open source impl + UI for it, and most contributors are from one
       | company? As some relevant data points:
       | 
       | - the pydata oss community has been centralizing on uniform data
       | IO for the same aws/azure/etc. systems via pyarrow, fsspec, etc.,
       | including high performance parquet support. Databricks uses some
       | of that, so that's promising!
       | 
       | - ... but no pydata standardization on the control plane, esp.
       | wrt security. fsspec normalizes on "posix acls", if I remember
       | right: it's a real need the protocol+project can fill. Adding
       | modern ABAC/RLS/..., sync, ... .
       | 
       | - ... and I can see why databricks wants to take control of the
       | datalake protocol from aws/az/gcp (s3, ...), so this is super
       | clever if they can keep control, and second place, ensure no one
       | else does!
       | 
       | From an OSS adoption perspective, donating the protocol + self-
       | hostable UI tiers to Apache + major non-databricks contributors
       | would make me excited to use. Without that level of belief &
       | health, as an architect, it feels like switching one lockin for
       | another, so still a proprietary offer in practice
       | 
       | So I'm pretty excited as a user, interested in lightweight
       | integration as a partner, and wondering what's in store for
       | community-minded OSS governance for the confidence to go as deep
       | as we did for Arrow like contributing core code and betting our
       | own infra on it..
        
       | wokwokwok wrote:
       | There's fair reason to be skeptical here.
       | 
       | The way databricks deals with user level access control is by
       | passing the user credentials for the user using a cluster to the
       | underlying access control layer (historically, table and column
       | level permissions).
       | 
       | ...but, it's not that simple. Since anyone on a cluster can do
       | anything with that cluster, you need a cluster per user for this
       | to work. This obviously doesn't scale.
       | 
       | I've used this a lot, and it sucks; the best you can do is offer
       | "reporting clusters" that are for reading pre-canned reports and
       | restricted in what they can see, and not allow anyone to do any
       | actual work using notebooks.
       | 
       | ...reading all the things about unity, this seems like it's just
       | applying the same approach to more types of data objects.
        
         | rxin wrote:
         | We have taken a very different approach with the Unity Catalog.
         | It is designed opposite to the "cluster-centric" access control
         | model, and will be user and role centric.
         | 
         | Disclosure: I'm a Databricks cofounder and have contributed to
         | the unity catalog.
        
       ___________________________________________________________________
       (page generated 2021-05-28 23:02 UTC)