[HN Gopher] Databricks response to Snowflake's accusation of lac...
___________________________________________________________________
Databricks response to Snowflake's accusation of lacking integrity
Author : rxin
Score : 166 points
Date : 2021-11-15 20:35 UTC (2 hours ago)
(HTM) web link (databricks.com)
(TXT) w3m dump (databricks.com)
| michaelhartm wrote:
| Data Wars: Snowflake vs Databricks (0 - 2)?
| falaki wrote:
| tl;dr: The data warehouse company used a pre-baked TPC-DS dataset
| and claimed they have similar performance to Databricks. Turns
| out if you use the official TPC-DS data generation scripts, you
| get much worse performance.
| arnon wrote:
| That's altering the methods - and generally considered a
| violation of the validity of the results.
| tyingq wrote:
| I read the original post, the Snowflake response, and this.
| From that I gather that both of them aren't being completely
| honest or fair when making comparisons. A fair amount of truth,
| but also some clever wording and omission on both their parts.
| Which is not surprising or particularly new in this space :)
| slownews45 wrote:
| Databricks results are available at tpc.org [1]
|
| Snowflake has shown NOTHING close to this.
|
| [1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~
| dat...
| david_allison wrote:
| Sorry to nitpick (document seems solid), on page 32:
|
| > Due to a TPC-internal error during the production of
| 3.2.0 of the TPC-DS kit, the benchmark execution had to use
| version 2.13 of the kit. It was confirmed by the TPC that
| the only changes between these two versions of the kit is
| the version number set in the tools/release.h parameter
| file.
|
| How can there be that much of a delta of major/minor
| versions without a change? The only way that I see this
| happening is if 'change' being defined as the specific
| benchmark which was run, rather than the kit.
| ni_po wrote:
| The change is in the Spec ie., allowing cloud storage and
| the metric itself causing the major version update, not
| in the datagen binaries.
| tyingq wrote:
| Yes, I wasn't saying they were lying about their tpc.org
| posted results. I'm saying both companies made use of
| clever indirection, wording, presentations of stats, etc.
| Like price/performance, and which of your competitor's
| tier's to select when doing that, and which of your own. Or
| over-provisioning the competition's setup, for example.
| slownews45 wrote:
| Ahh, fair enough there. That said, snowflake would help
| their case if they would actually do at least one actual
| tpc.org third party result
| tyingq wrote:
| My guess is that they know the result won't look
| terrific. And they also know Snowflake works well in
| production for people despite that. So, little upside.
| slownews45 wrote:
| For sure.
|
| All leaders in a space take this approach. Little be
| gained, a fair bit to lose if you are ALREADY leading
| without having to debate / do a benchmark etc.
|
| Anyways, the benchmark is only one part of the overall
| story for these solutions.
| slownews45 wrote:
| Even worse, they claimed to have similar performance to
| Databricks AND claimed _databricks_ "lacked integrity". WOW,
| talk about chutzpah!
| benjaminwootton wrote:
| Ive been following this and it's kind of embarrassing to watch.
|
| I love working with Databricks and Snowflake. They both knock it
| out of the park for their respective use case. They're amazing
| products.
|
| It makes no sense to fall out about this though.
|
| For a 100TB dataset with a funky calculation, Spark will trounce
| Snowflake. For a 1 row dataset, Snowflake will return before the
| spark job has been serialised.
| hello_moto wrote:
| Serious question: Databricks, Snowflake, Dremio. All these "Data"
| platform companies => which one do you have for your Data Lake
| and Data Warehouse solution?
|
| I'm sick and tired of these companies Snake Oiling the Data
| industry by offering "the easiest" platform to satisfy your Data
| Lake + Warehouse solution only to fall hard whenever you hook it
| up with your production data (big dataset).
|
| PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one
| platform) is on meth.
| kartoonhero wrote:
| Please read up on Lakehouse.
|
| Data Lake + Merge support + DW performance is now possible.
|
| That is the game changer.
| strongbond wrote:
| Do you work for Databricks?
| bpaneural wrote:
| They must do. But if you've been in this area for long
| enough, I'd put my money on Databricks, if anything,
| because of their open source integrity
| naattee wrote:
| snowflake should just pony up and do a TPC-DS audited benchmark
| maslam wrote:
| Everyone win when data platforms submit audited benchmarks...
| bloodyplonker22 wrote:
| Databricks is trying to punch up at the market leader. Every
| decent marketer knows that you should never do the opposite and
| punch down.
| djbusby wrote:
| I'm crap at marketing and know the only-punch-up rule.
| aliswe wrote:
| what differences in size (or height) are we talking about?
| dautkhanov wrote:
| Thanks Snowflake for removing the DeWitt clause, makes
| performance comparison more transparent. Would be best for
| Snowflake to complete official/audited TPC-DS benchmark so
| customers can compare apples to apples.
| avip wrote:
| I've used both products in production. Both are good++.
|
| The blog wars seem extremely ridiculous to me. I don't recall
| ever choosing one over another based on how fast it runs on some
| imaginary arbitrary dataset.
| kartoonhero wrote:
| Its not ridiculous at all. This is the coming of age for a
| brand new data architecture.
|
| One of the biggest FUDs for a data lake architecture is
| performance - and this benchmark should put that concern to
| rest.
| paxys wrote:
| Manufactured rivalries can be a great thing for business. We
| have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs
| Burger King for decades now while these companies laugh all the
| way to the bank.
| javajosh wrote:
| Like the post but I would add "Ford v Ferrari" there. A
| synthetic 100T test is much like an F1 course - not something
| you deal with during your commute, but it's nice to know what
| the limit is, and that there are people pushing that limit.
| 1cvmask wrote:
| This reminds me of the old performance ads of Oracle where they
| would show you how everything ran better on Oracle. They used to
| put those ads at airports, business lounges and the back cover of
| newspapers and magazines read by non-technical executives like
| the FT and Economist.
|
| Everyone technical knew they would game every environment to come
| out with superior results. I suppose it worked. As the top
| executives buy big system software and ignore the IT crowd who
| could easily point out the flaws in the methodology of
| the"studies".
|
| Breakdown of one of those example ads:
|
| https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...
| initplus wrote:
| A key part of the Oracle strategy is making it a breach of
| license to publish any benchmarking data. No performance data
| about Oracle's database is allowed to be published without
| their approval, which means no negative results are published.
| jpalomaki wrote:
| I think this has been quite common clause in the license
| contracts. Databrics has a blog post about it:
| https://databricks.com/blog/2021/11/08/eliminating-the-
| dewit...
|
| This is kind of understandable. Benchmarking complex software
| is complicated. It's easy to give totally wrong picture of
| things either accidentally or deliberately.
| laserlight wrote:
| Here's some background for those who are interested [0].
|
| [0] A solution to DeWitt clauses. https://danluu.com/anon-
| benchmark/
| 1cvmask wrote:
| They also sue you for so many other reasons. It's like the
| management hierarchy joke that Oracle is a litigious law firm
| with a sales team.
|
| https://palisadecompliance.com/oracle-org-chart/
| belter wrote:
| Who could forget the Unbreakable and Unhackable Campaign...
|
| The "Unbreakable" Marketing Campaign:
|
| https://www.oreilly.com/library/view/the-oracle-hackers/9780...
|
| https://www.zdnet.com/article/invincible-oracle-not-so-secur...
| supercanuck wrote:
| similiar as to how SAP is still showing growth even thought
| their core product (ERP Financials) hasn't changed much.
| Normal_gaussian wrote:
| so, alternatives?
|
| Aside from the Azure/GCP/AWS internal offeringa I know about
| Snowflake and Firebolt, Databricks is new to me.
| kofejnik wrote:
| maybe clickhouse?
| glogla wrote:
| Clickhouse is good if you're building application. It has lot
| of great features and incredible performance, but there's an
| expectancy that people using it know what they're doing and
| can work around its limitations (like limited support for
| joins and sql in general).
|
| Something like Snowflake works much better when you're
| building a platform that you can give to two hundred data
| analysts or various skills spread over fifty teams, so they
| can build their own stuff. The nice UI, broad feature set
| (materialized views, time travel, automatic backups,
| superfast scaling up and down, ...) and general just-work-
| iness makes it nice for that, but you're going to pay for the
| privilege.
|
| Databricks is somewhere in the middle - things are way less
| polished, features don't always work and you still have to
| figure out things like backups and partitions on S3 on your
| own, but some people like that. Expect to also pay a pretty
| penny for hundreds of Spark clusters nobody knows who uses.
| solidangle wrote:
| When was the last time you used Databricks? You should
| definitely try it again. Their product offering has
| improved a lot in the past few years.
|
| > broad feature set
|
| My experience is that the feature sets of Snowflake and
| Databricks are very similar. Both have time travel support.
| Snowflake has materialized views, but Databricks has Delta
| Live Tables. Databricks has a distributed Pandas API, but
| Snowflake recently introduced Snowpark. Databricks also has
| autoscaling and they recently launched a serverless
| offering that makes autoscaling super fast aswell.
| ethbr0 wrote:
| https://en.m.wikipedia.org/wiki/Databricks
|
| "Databricks is an enterprise software company founded by the
| creators of Apache Spark. [...] Databricks develops a web-based
| platform for working with Spark, that provides automated
| cluster management and IPython-style notebooks."
| tyingq wrote:
| Oracle and Teradata still have data warehouse pitches ;)
| glogla wrote:
| Redshift is pretty terrible, stay away. AWS is even worse at
| delivering promises than Databricks and that's saying
| something.
|
| I heard Google BigQuery is good. It is completely SaaS (like
| AWS Athena that works).
|
| Unicorns often run their own stack and you could replicate
| that, if you have the apetite. Netflix and Apple run Trino +
| Spark on k8s + Iceberg. Uber used their own Hudi thing, not
| sure if they still do.
| imslowbutnice wrote:
| (X-Posted) I dont get still how much optimization was done for
| the Databricks version Snowflake TPC-DS power run. This is what I
| am seeing so far (and i am foggy on) - DB1.Databricks generated
| the TPC-DS datasets from TPC-DS kit before time started.
| Databricks starts time then generated all queries. Then
| Databricks loaded from CSV to Delta format (also some delta
| tables were partitioned delta tables by date) and also computed
| statistics. Then all of the queries are executed 1-99 for TPCDS
| 100TB
|
| SF1. Databricks generated the TPC-DS datasets from TPC-DS kit
| before time started. Databricks starts time then generated all
| queries. Then load from S3 to Snowflake tables by - (i'm not sure
| about these next parts) - creating external stages and then "copy
| into" statements I guess? Or maybe just using copy into from an
| s3 bucket, that part doesnt matter much. But its not clear did
| they also allow target tables to be partitioned/clustering keys
| at all? Then all of the queries are executed 1-99 for TPCDS 100TB
|
| Its just hard to say exactly what "They were not allowed to apply
| any optimizations that would require deep understanding of the
| dataset or queries (as done in the Snowflake pre-baked dataset,
| with additional clustering columns)" means exactly. Like what
| does that exactly mean. At a glance though, this looks very
| impressive for Databricks, but just want to be sure before I
| submit to an opinion. SF1. Databricks generated the TPC-DS
| datasets from TPC-DS kit before time started. Databricks starts
| time then generated all queries. Then load from S3 to Snowflake
| tables by - (i'm not sure about these next parts) - creating
| external stages and then "copy into" statements I guess? Or maybe
| just using copy into from an s3 bucket, that part doesnt matter
| much. But its not clear did they also allow target tables to be
| partitioned/clustering keys at all? Then all of the queries are
| executed 1-99 for TPCDS 100TB
|
| Its just hard to say exactly what "They were not allowed to apply
| any optimizations that would require deep understanding of the
| dataset or queries (as done in the Snowflake pre-baked dataset,
| with additional clustering columns)" means exactly. Like what
| does that exactly mean. At a glance though, this looks very
| impressive for Databricks, but just want to be sure before I
| submit to an opinion.
| xiaodai wrote:
| Lol
| __MatrixMan__ wrote:
| Instead of blog posts written but experts in app A based on their
| experience with app B, I wish there were a platform for this kind
| of comparison.
|
| Some objective third party sets the goal and then each company
| submits automation (selenium?) that configures their own app to
| achieve the goal. Entrants are scored by:
|
| - time
|
| - storage
|
| - compute
|
| - config complexity
|
| No need to waste time making your opponent look bad, just focus
| on making your self look good, and do it on a level playing
| field.
| rxin wrote:
| Isn't that what the official TPC does?
| falaki wrote:
| That is exactly the role of tpc.org.
| renewiltord wrote:
| If you want some information like this quick, you're gonna have
| to pay to run it.
| AdamProut wrote:
| I would say that TPC-DS and TPC-H are really table stakes
| benchmarks for data warehouses at this point in time (maybe they
| weren't 10 years ago). How to build a database that does well on
| them is well documented in the literature now[1][2][3][4] (maybe
| a few other papers). Its not easy to build such a database, but
| its "just" hard work and many companies have the $$ necessary to
| do that work. There isn't any magic or technical moat in the
| results for databricks (or snowflake, or redshift, etc.).
|
| I think Databricks is overly enthusiastic about their results as
| they have been trying to be competitive with cloud DWs on these
| benchmarks for a number of years now. They have finally caught up
| (by building deltalake and their photon query engine which
| implement a number of standard DW features). [1]
| http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf [2] http
| s://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntd
| bs.pdf [3] https://web.stanford.edu/class/cs245/readings/c-
| store.pdf [4]
| http://sites.computer.org/debull/A12mar/vectorwise.pdf
| jchw wrote:
| Before the Snowflake blog post, I did not know what Snowflake or
| Databricks were. I can only imagine that this rivalry is great
| for both of them, even if Databricks is somewhat on the advantage
| end, at least from a tactical standpoint; I admit though that
| they seem to be a bit unnecessarily defensive considering the
| position they're in with the exchange.
|
| In general though, I'm still not complaining. It's interesting to
| see a dispute like this unfold.
| qaq wrote:
| Snowflake is 120B Market Cap Darling of Cloud Data warehouses I
| doubt obscurity is a problem they are trying to solve
| dreyfan wrote:
| Databricks is a rapidly approaching IPO. Trying to justify their
| valuation with their overpriced in-memory hadoop.
| kartoonhero wrote:
| Databricks is way more than hadoop or spark. A great analogy -
| Spark is a great engine but you need to design and build all of
| the other subsystems.
|
| Databricks is an F1 car - everything is built out. You get in
| and drive - FAST.
| dreyfan wrote:
| Databricks is a shit platform that encourages terrible data
| practices and accretion of technical debt.
| exsmelliarmus wrote:
| Seems pretty good to us! Can you give more information?
| matt123456789 wrote:
| As with other offerings in this space, the key to managing
| technical debt is to get functions out of notebooks ASAP,
| stage intermediate results where appropriate, and turn
| everything into jobs.
| fs111 wrote:
| Finally somebody that has used Databricks! I can't believe
| all the praise I read elsewhere in the comments here.
| Databricks is broken in so many ways, it is beyond me how
| anyone can like using this.
| hunterb123 wrote:
| Yeah it seems to be really beyond you.
|
| Please elaborate what is broken.
| fs111 wrote:
| > Databricks is an F1 car - everything is built out. You get
| in and drive - FAST.
|
| found the databricks employee
| glogla wrote:
| > Databricks is an F1 car
|
| F1 cars really unreliable and need a lot of engineers to keep
| running, are very expensive, and completely impractical in
| normal use. They are fast but only on very specific roads,
| they couldn't survive on normal roads.
|
| What do you know, you might be right! :D
| gnabgib wrote:
| Related post (2 days ago, 95 comments): [Snowflake's response to
| Databricks' TPC-DS
| post](https://news.ycombinator.com/item?id=29206959)
| scapecast wrote:
| The irony here is that what Databricks is doing to Snowflake is
| exactly what Snowflake did to AWS and Redshift.
|
| Same playbook - show that you're better in a key metric that's
| easy to understand (performance) to get the attention, but then
| pitch the paradigm change.
|
| In Snowflake's case, that was separation of storage and compute.
|
| In Databrick's case, it's the Lakehouse Architecture.
|
| I think the reason why Snowflake is so nervous because they know
| they can't win this game.
| ignoramous wrote:
| > _I think the reason why Snowflake is so nervous because they
| know they can't win this game._
|
| Isn't Databricks' delta.io, which their Data Lakehouse product
| builds on top of, open source? Snowflake could take the best
| parts from and run with it?
| bpaneural wrote:
| They could in principle. GCP, for instance, does do that. So
| does HP. And Databricks don't mind that as they have a strong
| open source legacy. But that takes away the proprietary lock-
| in strategy of Snowflake.
| glogla wrote:
| In what way is lakehouse architecture beneficial over something
| like Snowflake or BigQuery?
|
| I understand the appeal over having lake and warehouse as
| separate components, but with those native cloud warehouses,
| you can already do everything a lake does.
| turk- wrote:
| With a datawarehouse, you can only interface with your data
| in SQL. With big query and snowflake, your data is locked
| away in a proprietary format not accessible by other compute
| platforms. You need to export/copy your data to a different
| system to train an ML model in python or R.
|
| With the lakehouse, you can use python, R and Scala, (not
| just SQL) to interface with your data. You can use multiple
| compute engines (spark, Databricks, presto) so you are not
| locked into one compute engine.
|
| I recall being a junior programmer, and wishing I could talk
| to my MySQL database in python code to do some processing
| that was difficult to express in SQL, that day is finally
| here.
| falaki wrote:
| To be fair Apache Spark, which started long before either
| company existed, was built on the assumption that compute and
| storage should be separate. Unlike Hadoop, Spark did not come
| with any storage system and could read from any source.
| d-d-d wrote:
| > To be fair Apache Spark, which started long before either
| company existed
|
| Databricks was founded before Spark 1.0 released by Spark's
| creators.
|
| Hadoop was created at a time when network and disk were much
| slower, RAM was less abundant. Bringing compute to the data
| made sense, but it typically doesn't anymore.
| redwood wrote:
| As much as I love seeing competition in the space and am enjoying
| my popcorn, I really don't understand what Databricks is doing
| here: this feels like a childish foodfight rather than an
| obsession with the customer...
| saj1th wrote:
| :) That is a good question. Why spend eng cycles to submit
| results to the TPC council - why not just focus on customers?
|
| I believe the co-founders have addressed this in the blog.
|
| > Our goal was to dispel the myth that Data Lakehouse cannot
| have best-in-class price and performance. Rather than making
| our own benchmarks, we sought the truth and participated in the
| official TPC benchmark.
|
| I'm sure anybody seriously looking at evaluating data platforms
| would want to look at things holistically. There are different
| dimensions like open ecosystem, support for machine learning,
| performance etc. And different teams evaluating these platforms
| would stack rank them in different orders.
|
| These blogs, I believe, show that Databricks is a viable choice
| for customers when performance is a top priority (along with
| other dimensions). That IMO is customer obsession.
| jjoonathan wrote:
| All publicity is good publicity.
|
| Both participants in a fight can win by implicitly excluding
| their real competitors.
| s_barrow1 wrote:
| Databricks is not known for the SQL/DW space. The original blog
| was focused on breaking the TPC-DS performance record and
| provide validation of the Lakehouse architecture. DB didn't ask
| for a war of words with Snowflake - SF dedicated a whole
| response stating DB lacked integrity and filled it with false
| and misleading information. I commend DB for responding back
| (only because of the integrity accusations). Snowflake has
| asked for this response by acting petty from the outset
| glogla wrote:
| Yes, the tone of those blogposts, the likelihood of fake
| benchmarks submitted on someone else's behalf and especially
| the deluge of new accounts supporting them makes me want to
| trust Databricks even less than the PoC my company ran with
| them last year and spending time with their terrible, terrible
| salespeople.
|
| EDIT: I forgot lying about how open they are when all their
| interesting technologies (like the new sql engine and the good
| parts of delta) are proprietary.
| cai22r wrote:
| insert gif: he started it
| kf6nux wrote:
| I'd say helping customers spot fraud* is serving the customers'
| interests.
|
| * I haven't executed the test suite, but fraud seems likely.
| vgt wrote:
| I think Snowflake cultivates a very careful public image, but
| in private their sales people use.. how do you say.. aggressive
| techniques.. databricks is addressing the source of market
| confusion head-on
| boringg wrote:
| And how soon is the S-1 for Databricks dropping?
| drej wrote:
| What I find hilarious is that companies argue who can query 100
| TB faster and try to sell this to people. I've been on the
| receiving end of offers by both of the companies in question and
| used both platforms (and sadly migrated some data jobs to them).
|
| While they can crunch large datasets, they are laughably slow for
| the datasets most people have. So while I did propose we use
| these solutions for our big-ish data projects, management kept
| pushing for us to migrate our tiny datasets (tens of gigabytes or
| smaller) and the perf expectedly tanked compared to our other
| solutions (Postgres, Redshift, pandas etc.), never mind the
| immense costs to migrate everything and train everyone up.
|
| Yes, these are very good products. But PLEASE, for the love of
| god, don't migrate to them unless you know you need them (and by
| 'need' I don't mean pimping your resume).
| tshanmu wrote:
| Resume driven development FTW!
___________________________________________________________________
(page generated 2021-11-15 23:01 UTC)