[HN Gopher] Databricks is an RDBMS
___________________________________________________________________
Databricks is an RDBMS
Author : georgewfraser
Score : 95 points
Date : 2021-02-01 19:21 UTC (3 hours ago)
(HTM) web link (fivetran.com)
(TXT) w3m dump (fivetran.com)
| anonu wrote:
| So is Databricks like AWS RDS with some Lambda functions built
| around it?
| pbourke wrote:
| No, Databricks is a distribution of Apache Spark with some
| value-added features such as the Delta Lake data format and a
| clean UI for hosting notebooks, doing cluster admin, etc
| fs111 wrote:
| delta lake is open source too btw
| rymurr wrote:
| only in name. check out the lack of traction for a lot of
| PRs on https://github.com/delta/delta-oss . Iceberg
| (https://iceberg.apache.org) is comparable but actually
| OSS.
| rymurr wrote:
| Check out
| https://searchdatamanagement.techtarget.com/news/252495619/A...
| the convergence between data lake and data warehouse idea is
| starting to spread rapidly.
| jpau wrote:
| I'm deeply disappointed in Databricks as an RDBMS.
|
| As a DS/DE, there's a lot to love (not all, but a lot). The easy
| provision of Spark clusters. The jobs API. DeltaLake (mostly).
| Easy notebooks (please don't create a prod system from these..).
| And Spark itself continues to improve, albeit in an increasingly
| crowded field.
|
| But I've worked closely with BigCo SQL analysts on Azure
| Databricks, and their experience was terrible. For example:
| - You cannot browse the data structure without an active cluster
| - Starting a cluster can take ~5 minutes and, since you missed
| that moment, you may not submit your first query until 10-15
| minutes. - The SQL error messages are often (perhaps
| usually?) nonsense, so you have to operate without them.
| - An unfortunate amount of downtime, followed by bizarre excuses.
| - It's so darn slow, relative to equivalent queries on BigQuery
| or Snowflake. - Even submitting a query can take a
| weird amount of time.
|
| If Databricks-as-an-RDBMS were competing against Teradata, sure,
| let's have a chat.
|
| But we're in 2021, and there's just no comparing the experience
| of the SQL analyst on Databricks-as-an-RDBMS vs.
| Snowflake/BigQuery.
|
| I'm excited for the potential of Snowflake's SnowPark (though
| know little about it). Calling UDFs from SQL means you can create
| great features for SQL analysts, provided that they can build the
| momentum to need it.
| mrbungie wrote:
| I've seen and have compared Databricks clusters to a 10-15yo
| Teradata cluster and no way in hell I would use Databricks.
|
| Teradata is a lot faster for interactive workloads than
| Databricks.
|
| PS: I agree there's no comparing on Databricks vs
| Snowflake/BigQuery.
| nattaylor wrote:
| I got excellent performance in Databricks with well partitioned
| Parquet and Spark 2.4. What is making the queries slow? Data
| scanning?
| jpau wrote:
| They use DeltaLake + Spark 3.0, and are mostly careful to
| partition well.
|
| Their datasets are small. Most tables are ~50GB, the odd
| table up to ~2TB. The clusters typically are nothing shabby
| for this size, defaults to ~[4-12]x32GB.
|
| The queries that I have seen are typically not written well.
| Think view-on-view-on-view (there's a BigCo policy against
| them materialising data..), and where the filter is applied
| in the last step. The stuff of horrors, but something I've
| seen in more-than-one-BigCo.
|
| But we have compared some of those same queries on BigQuery
| vs. Databricks, and, I don't know if BigQuery's execution
| optimiser is better? Or if the BigQuery storage is better
| organising the data? Or if BigQuery is simply throwing more
| resource their way?
| agambrahma wrote:
| Sigma Computing (https://www.sigmacomputing.com) might be a
| good fit here too.
|
| (disclaimer: plug)
| bkandel wrote:
| Yes, and I would add to this the (nearly) complete lack of IDE
| support makes working with Spark SQL quite painful.
| kfk wrote:
| These are great innovations but can we please take a moment to
| realize 99% of companies are still stuck with a blend of excels,
| access and sql servers? Why is adoption of this new tech so poor?
| Maybe it has something to do with the amount of confusion all the
| sales pitches about data lakes, big data, ai and company are
| generating
| phoe-krk wrote:
| > Why is adoption of this new tech so poor?
|
| Because it's unnecessary to those 99% of companies. If company
| data fits in an Excel spreadsheet, Access database, or a single
| MySQL/Postgres instance, then introducing all this new tech
| with all of the associated costs and little return gain is a
| net loss.
| vmsp wrote:
| I know Excel runs the world but are there really that many
| people using Access?
| QuesnayJr wrote:
| I actually think the effect of all of this hype will move
| people past the Excel/Access era. There are simple analyses
| every company could do with R or scikit-learn that would save
| or make them money, and they just don't know how. Someone with
| AI expertise is over-qualified to do this, but they are at
| least qualified.
| t0mas88 wrote:
| Because most data isn't "big" data. If it fits in Excel on a
| laptop, why bother to roll out a distributed data lake system
| like Spark with all the associated ops work.
| hztar wrote:
| The ones I have been talking to stick to their Excel because
| their little part of the puzzle can solved by Excel. These
| technologies usually demand data to be collected, but the
| amount of incentive to do so is low in any classic balkanized
| F500 organization.
| bpodgursky wrote:
| Hmm, I agree there's a market here, but I don't know why I
| wouldn't just use Snowflake or Bigquery if what I really wanted
| was a big-data RDBMS.
|
| Everyone I know who uses Databricks (and they all like it) use it
| as hosted Spark with S3 integrations, or write... directly to
| Snowflake. I'm a little skeptical they're going to get traction
| as a true data lake model
| georgewfraser wrote:
| It's a lot simpler to use a single system as both your data
| lake, and your data warehouse. As Databricks gets better and
| better at the core data warehouse features, it becomes feasible
| to use it for both. Meanwhile, Snowflake and BQ are coming from
| the other direction, implementing data lake features. AWS
| strategy seems to be, just make it easier to have 2 systems and
| move data back and forth.
| MikeDelta wrote:
| Their Delta also has SSD caching, which turns out to be logic
| that stores a local copy of the file you queried for faster re-
| query. Going to call my lru cache function like that as well...
|
| My company loves them, I think they only do a few things good
| of which marketing the best, and are not worth the money for
| data science teams with devops skills. Happy to hear from
| others if I am wrong.
| pram wrote:
| Depends. If you only have a couple workspaces, then no. It's
| worth the money for a company with lots of data science
| teams, and one team who is janitoring all the workspace
| infra. Our company has 20 workspaces for 10 teams already and
| it would be a nightmare if we expected everyone to fix and
| manage their own AWS stuff.
| snidane wrote:
| Snowflake and Bigquery will bite you in the ass later on. You
| can do 80% of the things you will need - which is great for
| some newbie stuff or for sales presentations. Once you need
| something complicated, you're on your own, while being stuck in
| a proprietary environment that you cannot extend.
|
| You will have to develop some kind of data lake to store
| unstructured data anyway. You will end up with a Snowflake data
| warehouse and a data lake. Why not just go with data lake first
| then.
|
| Databricks/Spark are just good platforms to help you do
| something with structured data in your lake. With the recent
| additions to its execution engine and Delta (strange naming
| tbh) it will be pretty much the same as Snowflake for you.
| mrbungie wrote:
| BigQuery/Snowflake can process Parquet and multiple other
| formats in Object Storage. You can use them more "freely" if
| you keep your raw data in open formats.
|
| You need something more complicated than what can be done
| using BigQuery/Snowflake (that remaining 20%, though I would
| say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever
| and process it with anything, including
| Dataproc/HDInsight/EMR or even Databricks. That's actually a
| common pattern.
| willvarfar wrote:
| The reason snowflake has such a high market cap is because it's
| customers aren't paying much now, but they'll be paying and
| paying monthly forever. It's lock-in on a massive scale.
|
| Delta lake is something you can run on data you feel you still
| have some semblance of control over.
| ACow_Adonis wrote:
| data lake, delta lake, lake house. snowflake, snowpark. cloud,
| data warehouse. I just want to take a second to thank these
| companies for trying to turn my profession (data science +
| analytics) into the living hell of your every day enterprise and
| tech culture cluster-fuck (pun not intended).
|
| I'm still coming to terms with the fact that there's actually a
| technology called Kafka...
|
| /sorry, I'm just particularly bitter after having to do some
| databricks training yesterday... it was basically 80% trying to
| rote the sales literature about why they're so great/enterprisy
| and parroting company and platform specific jargon. I don't like
| seeing my profession turn into an obsession with tech and
| platforms when 99.9% of companies and people can't reason
| properly or operate their current tools/resources efficiently.
| obviously just my own opinion.
| arafa wrote:
| The training is quite heavy on the sales pitch, I agree (and
| expensive). There were some useful bits if you dig around
| though.
| MikeDelta wrote:
| I read a wonderful comment in another thread the other day. It
| was about the obsession of devs to collect a whole range of
| technologies on their resume. It went something like: "When I
| hire a carpenter I won't hire him for what he has in his
| toolbox. I want to know what he can do with only a hammer and
| chisel (= Linux machine). The rest he can learn."
|
| I will try to look up the link and obviously 'he' can be 'she'
| as well.
| willvarfar wrote:
| The key gap as I see it is that databricks doesn't support multi-
| table transactions. And that is why you can't treat it as though
| it as a drop in functional replacement for an rdbms.
|
| (Why no vector clocks in the manifest files or something?)
| georgewfraser wrote:
| Interestingly, BigQuery is also missing multi-table
| transactions. There are ways to live without this feature, but
| I agree it's a gap.
| rymurr wrote:
| Check out https://projectnessie.org it adds multi-table
| transactions to Databricks Delta.
|
| disclaimer: an author of Nessie
| drej wrote:
| I have seen several deployments of Databricks (including first
| hand experience) and... most use cases could be better served by
| Postgres (or Redshift, Athena, Snowflake for larger scale). It
| has honestly been such an overkill for so many workloads, it was
| quite astonishing. I've seen people move from Excel to Spark...
| to handle the same volume of data. That's obviously not
| Databricks' fault, but their PR is pretty much "please do all
| your data work in our product, it's well suited for it".
|
| Yes, it's very good if you don't like setting up clusters (few
| do/can) and the UI is rather useful for getting up and running
| (not so much for writing code though). But you need to really
| understand the platform before adopting it. Please, please don't
| just adopt it because it's popular.
___________________________________________________________________
(page generated 2021-02-01 23:00 UTC)