[HN Gopher] Databricks Unity Catalog
___________________________________________________________________
Databricks Unity Catalog
Author : jamesblonde
Score : 37 points
Date : 2021-05-27 08:22 UTC (1 days ago)
(HTM) web link (databricks.com)
(TXT) w3m dump (databricks.com)
| OldGoodNewBad wrote:
| That's pretty cool. We were relying on beaver dams upstream from
| our lake and a floodgate system going downstream, but the
| boathouse on the lakehouse property was getting inundated by
| rising water levels when there was a large inflow. After
| consulting with a data hydrologist we were able to maintain the
| data lake with enough security to enable year round data boating
| and even dataskiing. I'm available for vapid buzzword creation
| but you won't like my fees. Please talk to my off-roading
| accountant and he'll skywrite a quote.
| arminiusreturns wrote:
| Thanks I needed this laugh. I think I got bingo!
| aledalgrande wrote:
| You are so good I initially believed these were true words used
| in data engineering! XD
| tomnipotent wrote:
| Databricks does have a really bad habit of trying to use their
| product names as general terms, but other than that nothing in
| this post is particularly esoteric to data engineers anymore
| than DDD/CQRS/WET/DRY/SOLID/trees/nodes/links/references are to
| software engineers. Might as well pull out 20 year old jokes
| about snowflake and star schemas.
| mackatsol wrote:
| "dataskiing" Awesome. Sounds like a cyberpunk job description.
| markus_zhang wrote:
| Then there is a snowcrash.
| rexreed wrote:
| What's sadder than companies spewing this marketing
| gobbledygook are the customers and the consulting firms that
| buy into them. Enterprise (and government) buying decision-
| making is highly flawed and these smart enterprise software
| firms as well as government contractors know just the right
| words and just the right method to sell things that would
| otherwise not even pass a real sniff test.
| MikeDelta wrote:
| Even enterprise architects cannot get enough of this stuff,
| and you would expect them to know better.
| TeMPOraL wrote:
| I think that they're in a "when in Rome, talk like the
| Romans talk" kind of situation.
| tomnipotent wrote:
| Just because the problems Databricks solves aren't your
| problems, doesn't mean that they're not problems for other
| people or organizations. What's sadder are ignorant comments
| from non-domain experts surprised that tools exists for
| experts in other domains.
| jgalt212 wrote:
| The problem is everything is in the cloud even if it does
| not belong there.
| tomnipotent wrote:
| Sounds like it's your problem, not "the" problem. Many
| people and organizations are happily cloud-based, and
| Databricks offering customers in the cloud better options
| is a win-win. What's the issue?
| danmur wrote:
| Ha ha ha, even funnier after I read the article. I hate that
| kind of marketing.
| engineerbruh wrote:
| So what are companies supposed to call their products? Would
| you propose they just call this "Data Security Solution"?
| [deleted]
| syats wrote:
| After 5 minutes of reading through their website, I still don't
| understand what is their actual product. I like to think of my
| self as someone who is computer-savy, working at a software
| company, designing systems that access several databases from
| several clients, using spark, aws, the works... and still,
| their website makes no fkn sense! Could someone translate,
| please?
|
| oh.. but look at the amount of job openings these guys have
| across the world.. is this another marketing ploy? I am deeply
| bothered by this kind of "products".
| reggieband wrote:
| I am absolutely no expert in any of these domains but the
| level of confusion described in these comments seems a little
| exaggerated. Is it so hard to see what is going on here?
|
| A data lake is a centralized repository where all of a
| companies data is aggregated. This allows analysts to perform
| queries against a single data source (often masquerading as a
| SQL database) rather than against 100s of distinct databases
| (which may be a hodgepodge of no-sql, sql, custom-rest-api,
| etc.). These "data lakes" often grow to a massive size since
| they will often not only include your application data
| (usually batch replicated from prod databases on some
| schedule or in some cases streamed directly) but also data
| from external sources (e.g. a feed from your payment
| processor, compressed events from your app/website analytics,
| server logs, marketing and advertising sources).
|
| Storing and processing that volume of data efficiently is a
| difficult task. Many companies decide to just dump that data
| in a raw format into cloud storage services like AWS S3. Then
| some third parties made the SQL-like interfaces run on top of
| S3 (or connectors from S3 into other familiar tools like
| Spark). This allows for low-cost storage while also allowing
| data analysts the ability to use tools they are already very
| familiar with. This way of handling large volumes of data
| stored for analysis has become very popular.
|
| But now that you have so much data stored in S3 you might
| start to wonder how you can control access to it. An analyst
| doing queries on website performance might not require access
| to the payment processing data. Your security team might
| point out that your growing analyst team has more access to
| sensitive company data than is required. As you negotiate big
| corporate deals their security team might start to red-flag
| unnecessary access to data (or ask you for your policies
| governing access to that data and how those policies are
| enforced).
|
| This product seems to allow finer control over access to data
| stored in these kind of data lakes. In the same way a bunch
| of tools appeared to create a SQL like facade on top of the
| data, this tool creates a facade on top of data access
| control.
|
| Not only is what they are doing completely understandable
| from a quick skim of the article, it also seems totally
| necessary. I have no doubt this is a massive market and this
| product has every chance to serve a real need.
| lmeyerov wrote:
| Super cool, this is a major need. At the public protocol level,
| doesn't feel like much has happened here in practice since
| posix/http/odbc, s3, & hdfs/parquet/arrow.
|
| I'm curious how this will play out as I don't think there's an
| open source impl + UI for it, and most contributors are from one
| company? As some relevant data points:
|
| - the pydata oss community has been centralizing on uniform data
| IO for the same aws/azure/etc. systems via pyarrow, fsspec, etc.,
| including high performance parquet support. Databricks uses some
| of that, so that's promising!
|
| - ... but no pydata standardization on the control plane, esp.
| wrt security. fsspec normalizes on "posix acls", if I remember
| right: it's a real need the protocol+project can fill. Adding
| modern ABAC/RLS/..., sync, ... .
|
| - ... and I can see why databricks wants to take control of the
| datalake protocol from aws/az/gcp (s3, ...), so this is super
| clever if they can keep control, and second place, ensure no one
| else does!
|
| From an OSS adoption perspective, donating the protocol + self-
| hostable UI tiers to Apache + major non-databricks contributors
| would make me excited to use. Without that level of belief &
| health, as an architect, it feels like switching one lockin for
| another, so still a proprietary offer in practice
|
| So I'm pretty excited as a user, interested in lightweight
| integration as a partner, and wondering what's in store for
| community-minded OSS governance for the confidence to go as deep
| as we did for Arrow like contributing core code and betting our
| own infra on it..
| wokwokwok wrote:
| There's fair reason to be skeptical here.
|
| The way databricks deals with user level access control is by
| passing the user credentials for the user using a cluster to the
| underlying access control layer (historically, table and column
| level permissions).
|
| ...but, it's not that simple. Since anyone on a cluster can do
| anything with that cluster, you need a cluster per user for this
| to work. This obviously doesn't scale.
|
| I've used this a lot, and it sucks; the best you can do is offer
| "reporting clusters" that are for reading pre-canned reports and
| restricted in what they can see, and not allow anyone to do any
| actual work using notebooks.
|
| ...reading all the things about unity, this seems like it's just
| applying the same approach to more types of data objects.
| rxin wrote:
| We have taken a very different approach with the Unity Catalog.
| It is designed opposite to the "cluster-centric" access control
| model, and will be user and role centric.
|
| Disclosure: I'm a Databricks cofounder and have contributed to
| the unity catalog.
___________________________________________________________________
(page generated 2021-05-28 23:02 UTC)