[HN Gopher] Databricks response to Snowflake's accusation of lac...
       ___________________________________________________________________
        
       Databricks response to Snowflake's accusation of lacking integrity
        
       Author : rxin
       Score  : 166 points
       Date   : 2021-11-15 20:35 UTC (2 hours ago)
        
 (HTM) web link (databricks.com)
 (TXT) w3m dump (databricks.com)
        
       | michaelhartm wrote:
       | Data Wars: Snowflake vs Databricks (0 - 2)?
        
       | falaki wrote:
       | tl;dr: The data warehouse company used a pre-baked TPC-DS dataset
       | and claimed they have similar performance to Databricks. Turns
       | out if you use the official TPC-DS data generation scripts, you
       | get much worse performance.
        
         | arnon wrote:
         | That's altering the methods - and generally considered a
         | violation of the validity of the results.
        
         | tyingq wrote:
         | I read the original post, the Snowflake response, and this.
         | From that I gather that both of them aren't being completely
         | honest or fair when making comparisons. A fair amount of truth,
         | but also some clever wording and omission on both their parts.
         | Which is not surprising or particularly new in this space :)
        
           | slownews45 wrote:
           | Databricks results are available at tpc.org [1]
           | 
           | Snowflake has shown NOTHING close to this.
           | 
           | [1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~
           | dat...
        
             | david_allison wrote:
             | Sorry to nitpick (document seems solid), on page 32:
             | 
             | > Due to a TPC-internal error during the production of
             | 3.2.0 of the TPC-DS kit, the benchmark execution had to use
             | version 2.13 of the kit. It was confirmed by the TPC that
             | the only changes between these two versions of the kit is
             | the version number set in the tools/release.h parameter
             | file.
             | 
             | How can there be that much of a delta of major/minor
             | versions without a change? The only way that I see this
             | happening is if 'change' being defined as the specific
             | benchmark which was run, rather than the kit.
        
               | ni_po wrote:
               | The change is in the Spec ie., allowing cloud storage and
               | the metric itself causing the major version update, not
               | in the datagen binaries.
        
             | tyingq wrote:
             | Yes, I wasn't saying they were lying about their tpc.org
             | posted results. I'm saying both companies made use of
             | clever indirection, wording, presentations of stats, etc.
             | Like price/performance, and which of your competitor's
             | tier's to select when doing that, and which of your own. Or
             | over-provisioning the competition's setup, for example.
        
               | slownews45 wrote:
               | Ahh, fair enough there. That said, snowflake would help
               | their case if they would actually do at least one actual
               | tpc.org third party result
        
               | tyingq wrote:
               | My guess is that they know the result won't look
               | terrific. And they also know Snowflake works well in
               | production for people despite that. So, little upside.
        
               | slownews45 wrote:
               | For sure.
               | 
               | All leaders in a space take this approach. Little be
               | gained, a fair bit to lose if you are ALREADY leading
               | without having to debate / do a benchmark etc.
               | 
               | Anyways, the benchmark is only one part of the overall
               | story for these solutions.
        
         | slownews45 wrote:
         | Even worse, they claimed to have similar performance to
         | Databricks AND claimed _databricks_ "lacked integrity". WOW,
         | talk about chutzpah!
        
       | benjaminwootton wrote:
       | Ive been following this and it's kind of embarrassing to watch.
       | 
       | I love working with Databricks and Snowflake. They both knock it
       | out of the park for their respective use case. They're amazing
       | products.
       | 
       | It makes no sense to fall out about this though.
       | 
       | For a 100TB dataset with a funky calculation, Spark will trounce
       | Snowflake. For a 1 row dataset, Snowflake will return before the
       | spark job has been serialised.
        
       | hello_moto wrote:
       | Serious question: Databricks, Snowflake, Dremio. All these "Data"
       | platform companies => which one do you have for your Data Lake
       | and Data Warehouse solution?
       | 
       | I'm sick and tired of these companies Snake Oiling the Data
       | industry by offering "the easiest" platform to satisfy your Data
       | Lake + Warehouse solution only to fall hard whenever you hook it
       | up with your production data (big dataset).
       | 
       | PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one
       | platform) is on meth.
        
         | kartoonhero wrote:
         | Please read up on Lakehouse.
         | 
         | Data Lake + Merge support + DW performance is now possible.
         | 
         | That is the game changer.
        
           | strongbond wrote:
           | Do you work for Databricks?
        
             | bpaneural wrote:
             | They must do. But if you've been in this area for long
             | enough, I'd put my money on Databricks, if anything,
             | because of their open source integrity
        
       | naattee wrote:
       | snowflake should just pony up and do a TPC-DS audited benchmark
        
       | maslam wrote:
       | Everyone win when data platforms submit audited benchmarks...
        
       | bloodyplonker22 wrote:
       | Databricks is trying to punch up at the market leader. Every
       | decent marketer knows that you should never do the opposite and
       | punch down.
        
         | djbusby wrote:
         | I'm crap at marketing and know the only-punch-up rule.
        
         | aliswe wrote:
         | what differences in size (or height) are we talking about?
        
       | dautkhanov wrote:
       | Thanks Snowflake for removing the DeWitt clause, makes
       | performance comparison more transparent. Would be best for
       | Snowflake to complete official/audited TPC-DS benchmark so
       | customers can compare apples to apples.
        
       | avip wrote:
       | I've used both products in production. Both are good++.
       | 
       | The blog wars seem extremely ridiculous to me. I don't recall
       | ever choosing one over another based on how fast it runs on some
       | imaginary arbitrary dataset.
        
         | kartoonhero wrote:
         | Its not ridiculous at all. This is the coming of age for a
         | brand new data architecture.
         | 
         | One of the biggest FUDs for a data lake architecture is
         | performance - and this benchmark should put that concern to
         | rest.
        
         | paxys wrote:
         | Manufactured rivalries can be a great thing for business. We
         | have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs
         | Burger King for decades now while these companies laugh all the
         | way to the bank.
        
           | javajosh wrote:
           | Like the post but I would add "Ford v Ferrari" there. A
           | synthetic 100T test is much like an F1 course - not something
           | you deal with during your commute, but it's nice to know what
           | the limit is, and that there are people pushing that limit.
        
       | 1cvmask wrote:
       | This reminds me of the old performance ads of Oracle where they
       | would show you how everything ran better on Oracle. They used to
       | put those ads at airports, business lounges and the back cover of
       | newspapers and magazines read by non-technical executives like
       | the FT and Economist.
       | 
       | Everyone technical knew they would game every environment to come
       | out with superior results. I suppose it worked. As the top
       | executives buy big system software and ignore the IT crowd who
       | could easily point out the flaws in the methodology of
       | the"studies".
       | 
       | Breakdown of one of those example ads:
       | 
       | https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...
        
         | initplus wrote:
         | A key part of the Oracle strategy is making it a breach of
         | license to publish any benchmarking data. No performance data
         | about Oracle's database is allowed to be published without
         | their approval, which means no negative results are published.
        
           | jpalomaki wrote:
           | I think this has been quite common clause in the license
           | contracts. Databrics has a blog post about it:
           | https://databricks.com/blog/2021/11/08/eliminating-the-
           | dewit...
           | 
           | This is kind of understandable. Benchmarking complex software
           | is complicated. It's easy to give totally wrong picture of
           | things either accidentally or deliberately.
        
           | laserlight wrote:
           | Here's some background for those who are interested [0].
           | 
           | [0] A solution to DeWitt clauses. https://danluu.com/anon-
           | benchmark/
        
           | 1cvmask wrote:
           | They also sue you for so many other reasons. It's like the
           | management hierarchy joke that Oracle is a litigious law firm
           | with a sales team.
           | 
           | https://palisadecompliance.com/oracle-org-chart/
        
         | belter wrote:
         | Who could forget the Unbreakable and Unhackable Campaign...
         | 
         | The "Unbreakable" Marketing Campaign:
         | 
         | https://www.oreilly.com/library/view/the-oracle-hackers/9780...
         | 
         | https://www.zdnet.com/article/invincible-oracle-not-so-secur...
        
         | supercanuck wrote:
         | similiar as to how SAP is still showing growth even thought
         | their core product (ERP Financials) hasn't changed much.
        
       | Normal_gaussian wrote:
       | so, alternatives?
       | 
       | Aside from the Azure/GCP/AWS internal offeringa I know about
       | Snowflake and Firebolt, Databricks is new to me.
        
         | kofejnik wrote:
         | maybe clickhouse?
        
           | glogla wrote:
           | Clickhouse is good if you're building application. It has lot
           | of great features and incredible performance, but there's an
           | expectancy that people using it know what they're doing and
           | can work around its limitations (like limited support for
           | joins and sql in general).
           | 
           | Something like Snowflake works much better when you're
           | building a platform that you can give to two hundred data
           | analysts or various skills spread over fifty teams, so they
           | can build their own stuff. The nice UI, broad feature set
           | (materialized views, time travel, automatic backups,
           | superfast scaling up and down, ...) and general just-work-
           | iness makes it nice for that, but you're going to pay for the
           | privilege.
           | 
           | Databricks is somewhere in the middle - things are way less
           | polished, features don't always work and you still have to
           | figure out things like backups and partitions on S3 on your
           | own, but some people like that. Expect to also pay a pretty
           | penny for hundreds of Spark clusters nobody knows who uses.
        
             | solidangle wrote:
             | When was the last time you used Databricks? You should
             | definitely try it again. Their product offering has
             | improved a lot in the past few years.
             | 
             | > broad feature set
             | 
             | My experience is that the feature sets of Snowflake and
             | Databricks are very similar. Both have time travel support.
             | Snowflake has materialized views, but Databricks has Delta
             | Live Tables. Databricks has a distributed Pandas API, but
             | Snowflake recently introduced Snowpark. Databricks also has
             | autoscaling and they recently launched a serverless
             | offering that makes autoscaling super fast aswell.
        
         | ethbr0 wrote:
         | https://en.m.wikipedia.org/wiki/Databricks
         | 
         | "Databricks is an enterprise software company founded by the
         | creators of Apache Spark. [...] Databricks develops a web-based
         | platform for working with Spark, that provides automated
         | cluster management and IPython-style notebooks."
        
         | tyingq wrote:
         | Oracle and Teradata still have data warehouse pitches ;)
        
         | glogla wrote:
         | Redshift is pretty terrible, stay away. AWS is even worse at
         | delivering promises than Databricks and that's saying
         | something.
         | 
         | I heard Google BigQuery is good. It is completely SaaS (like
         | AWS Athena that works).
         | 
         | Unicorns often run their own stack and you could replicate
         | that, if you have the apetite. Netflix and Apple run Trino +
         | Spark on k8s + Iceberg. Uber used their own Hudi thing, not
         | sure if they still do.
        
       | imslowbutnice wrote:
       | (X-Posted) I dont get still how much optimization was done for
       | the Databricks version Snowflake TPC-DS power run. This is what I
       | am seeing so far (and i am foggy on) - DB1.Databricks generated
       | the TPC-DS datasets from TPC-DS kit before time started.
       | Databricks starts time then generated all queries. Then
       | Databricks loaded from CSV to Delta format (also some delta
       | tables were partitioned delta tables by date) and also computed
       | statistics. Then all of the queries are executed 1-99 for TPCDS
       | 100TB
       | 
       | SF1. Databricks generated the TPC-DS datasets from TPC-DS kit
       | before time started. Databricks starts time then generated all
       | queries. Then load from S3 to Snowflake tables by - (i'm not sure
       | about these next parts) - creating external stages and then "copy
       | into" statements I guess? Or maybe just using copy into from an
       | s3 bucket, that part doesnt matter much. But its not clear did
       | they also allow target tables to be partitioned/clustering keys
       | at all? Then all of the queries are executed 1-99 for TPCDS 100TB
       | 
       | Its just hard to say exactly what "They were not allowed to apply
       | any optimizations that would require deep understanding of the
       | dataset or queries (as done in the Snowflake pre-baked dataset,
       | with additional clustering columns)" means exactly. Like what
       | does that exactly mean. At a glance though, this looks very
       | impressive for Databricks, but just want to be sure before I
       | submit to an opinion. SF1. Databricks generated the TPC-DS
       | datasets from TPC-DS kit before time started. Databricks starts
       | time then generated all queries. Then load from S3 to Snowflake
       | tables by - (i'm not sure about these next parts) - creating
       | external stages and then "copy into" statements I guess? Or maybe
       | just using copy into from an s3 bucket, that part doesnt matter
       | much. But its not clear did they also allow target tables to be
       | partitioned/clustering keys at all? Then all of the queries are
       | executed 1-99 for TPCDS 100TB
       | 
       | Its just hard to say exactly what "They were not allowed to apply
       | any optimizations that would require deep understanding of the
       | dataset or queries (as done in the Snowflake pre-baked dataset,
       | with additional clustering columns)" means exactly. Like what
       | does that exactly mean. At a glance though, this looks very
       | impressive for Databricks, but just want to be sure before I
       | submit to an opinion.
        
       | xiaodai wrote:
       | Lol
        
       | __MatrixMan__ wrote:
       | Instead of blog posts written but experts in app A based on their
       | experience with app B, I wish there were a platform for this kind
       | of comparison.
       | 
       | Some objective third party sets the goal and then each company
       | submits automation (selenium?) that configures their own app to
       | achieve the goal. Entrants are scored by:
       | 
       | - time
       | 
       | - storage
       | 
       | - compute
       | 
       | - config complexity
       | 
       | No need to waste time making your opponent look bad, just focus
       | on making your self look good, and do it on a level playing
       | field.
        
         | rxin wrote:
         | Isn't that what the official TPC does?
        
         | falaki wrote:
         | That is exactly the role of tpc.org.
        
         | renewiltord wrote:
         | If you want some information like this quick, you're gonna have
         | to pay to run it.
        
       | AdamProut wrote:
       | I would say that TPC-DS and TPC-H are really table stakes
       | benchmarks for data warehouses at this point in time (maybe they
       | weren't 10 years ago). How to build a database that does well on
       | them is well documented in the literature now[1][2][3][4] (maybe
       | a few other papers). Its not easy to build such a database, but
       | its "just" hard work and many companies have the $$ necessary to
       | do that work. There isn't any magic or technical moat in the
       | results for databricks (or snowflake, or redshift, etc.).
       | 
       | I think Databricks is overly enthusiastic about their results as
       | they have been trying to be competitive with cloud DWs on these
       | benchmarks for a number of years now. They have finally caught up
       | (by building deltalake and their photon query engine which
       | implement a number of standard DW features).                 [1]
       | http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf       [2] http
       | s://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntd
       | bs.pdf       [3] https://web.stanford.edu/class/cs245/readings/c-
       | store.pdf       [4]
       | http://sites.computer.org/debull/A12mar/vectorwise.pdf
        
       | jchw wrote:
       | Before the Snowflake blog post, I did not know what Snowflake or
       | Databricks were. I can only imagine that this rivalry is great
       | for both of them, even if Databricks is somewhat on the advantage
       | end, at least from a tactical standpoint; I admit though that
       | they seem to be a bit unnecessarily defensive considering the
       | position they're in with the exchange.
       | 
       | In general though, I'm still not complaining. It's interesting to
       | see a dispute like this unfold.
        
         | qaq wrote:
         | Snowflake is 120B Market Cap Darling of Cloud Data warehouses I
         | doubt obscurity is a problem they are trying to solve
        
       | dreyfan wrote:
       | Databricks is a rapidly approaching IPO. Trying to justify their
       | valuation with their overpriced in-memory hadoop.
        
         | kartoonhero wrote:
         | Databricks is way more than hadoop or spark. A great analogy -
         | Spark is a great engine but you need to design and build all of
         | the other subsystems.
         | 
         | Databricks is an F1 car - everything is built out. You get in
         | and drive - FAST.
        
           | dreyfan wrote:
           | Databricks is a shit platform that encourages terrible data
           | practices and accretion of technical debt.
        
             | exsmelliarmus wrote:
             | Seems pretty good to us! Can you give more information?
        
             | matt123456789 wrote:
             | As with other offerings in this space, the key to managing
             | technical debt is to get functions out of notebooks ASAP,
             | stage intermediate results where appropriate, and turn
             | everything into jobs.
        
             | fs111 wrote:
             | Finally somebody that has used Databricks! I can't believe
             | all the praise I read elsewhere in the comments here.
             | Databricks is broken in so many ways, it is beyond me how
             | anyone can like using this.
        
               | hunterb123 wrote:
               | Yeah it seems to be really beyond you.
               | 
               | Please elaborate what is broken.
        
           | fs111 wrote:
           | > Databricks is an F1 car - everything is built out. You get
           | in and drive - FAST.
           | 
           | found the databricks employee
        
           | glogla wrote:
           | > Databricks is an F1 car
           | 
           | F1 cars really unreliable and need a lot of engineers to keep
           | running, are very expensive, and completely impractical in
           | normal use. They are fast but only on very specific roads,
           | they couldn't survive on normal roads.
           | 
           | What do you know, you might be right! :D
        
       | gnabgib wrote:
       | Related post (2 days ago, 95 comments): [Snowflake's response to
       | Databricks' TPC-DS
       | post](https://news.ycombinator.com/item?id=29206959)
        
       | scapecast wrote:
       | The irony here is that what Databricks is doing to Snowflake is
       | exactly what Snowflake did to AWS and Redshift.
       | 
       | Same playbook - show that you're better in a key metric that's
       | easy to understand (performance) to get the attention, but then
       | pitch the paradigm change.
       | 
       | In Snowflake's case, that was separation of storage and compute.
       | 
       | In Databrick's case, it's the Lakehouse Architecture.
       | 
       | I think the reason why Snowflake is so nervous because they know
       | they can't win this game.
        
         | ignoramous wrote:
         | > _I think the reason why Snowflake is so nervous because they
         | know they can't win this game._
         | 
         | Isn't Databricks' delta.io, which their Data Lakehouse product
         | builds on top of, open source? Snowflake could take the best
         | parts from and run with it?
        
           | bpaneural wrote:
           | They could in principle. GCP, for instance, does do that. So
           | does HP. And Databricks don't mind that as they have a strong
           | open source legacy. But that takes away the proprietary lock-
           | in strategy of Snowflake.
        
         | glogla wrote:
         | In what way is lakehouse architecture beneficial over something
         | like Snowflake or BigQuery?
         | 
         | I understand the appeal over having lake and warehouse as
         | separate components, but with those native cloud warehouses,
         | you can already do everything a lake does.
        
           | turk- wrote:
           | With a datawarehouse, you can only interface with your data
           | in SQL. With big query and snowflake, your data is locked
           | away in a proprietary format not accessible by other compute
           | platforms. You need to export/copy your data to a different
           | system to train an ML model in python or R.
           | 
           | With the lakehouse, you can use python, R and Scala, (not
           | just SQL) to interface with your data. You can use multiple
           | compute engines (spark, Databricks, presto) so you are not
           | locked into one compute engine.
           | 
           | I recall being a junior programmer, and wishing I could talk
           | to my MySQL database in python code to do some processing
           | that was difficult to express in SQL, that day is finally
           | here.
        
         | falaki wrote:
         | To be fair Apache Spark, which started long before either
         | company existed, was built on the assumption that compute and
         | storage should be separate. Unlike Hadoop, Spark did not come
         | with any storage system and could read from any source.
        
           | d-d-d wrote:
           | > To be fair Apache Spark, which started long before either
           | company existed
           | 
           | Databricks was founded before Spark 1.0 released by Spark's
           | creators.
           | 
           | Hadoop was created at a time when network and disk were much
           | slower, RAM was less abundant. Bringing compute to the data
           | made sense, but it typically doesn't anymore.
        
       | redwood wrote:
       | As much as I love seeing competition in the space and am enjoying
       | my popcorn, I really don't understand what Databricks is doing
       | here: this feels like a childish foodfight rather than an
       | obsession with the customer...
        
         | saj1th wrote:
         | :) That is a good question. Why spend eng cycles to submit
         | results to the TPC council - why not just focus on customers?
         | 
         | I believe the co-founders have addressed this in the blog.
         | 
         | > Our goal was to dispel the myth that Data Lakehouse cannot
         | have best-in-class price and performance. Rather than making
         | our own benchmarks, we sought the truth and participated in the
         | official TPC benchmark.
         | 
         | I'm sure anybody seriously looking at evaluating data platforms
         | would want to look at things holistically. There are different
         | dimensions like open ecosystem, support for machine learning,
         | performance etc. And different teams evaluating these platforms
         | would stack rank them in different orders.
         | 
         | These blogs, I believe, show that Databricks is a viable choice
         | for customers when performance is a top priority (along with
         | other dimensions). That IMO is customer obsession.
        
         | jjoonathan wrote:
         | All publicity is good publicity.
         | 
         | Both participants in a fight can win by implicitly excluding
         | their real competitors.
        
         | s_barrow1 wrote:
         | Databricks is not known for the SQL/DW space. The original blog
         | was focused on breaking the TPC-DS performance record and
         | provide validation of the Lakehouse architecture. DB didn't ask
         | for a war of words with Snowflake - SF dedicated a whole
         | response stating DB lacked integrity and filled it with false
         | and misleading information. I commend DB for responding back
         | (only because of the integrity accusations). Snowflake has
         | asked for this response by acting petty from the outset
        
         | glogla wrote:
         | Yes, the tone of those blogposts, the likelihood of fake
         | benchmarks submitted on someone else's behalf and especially
         | the deluge of new accounts supporting them makes me want to
         | trust Databricks even less than the PoC my company ran with
         | them last year and spending time with their terrible, terrible
         | salespeople.
         | 
         | EDIT: I forgot lying about how open they are when all their
         | interesting technologies (like the new sql engine and the good
         | parts of delta) are proprietary.
        
         | cai22r wrote:
         | insert gif: he started it
        
         | kf6nux wrote:
         | I'd say helping customers spot fraud* is serving the customers'
         | interests.
         | 
         | * I haven't executed the test suite, but fraud seems likely.
        
         | vgt wrote:
         | I think Snowflake cultivates a very careful public image, but
         | in private their sales people use.. how do you say.. aggressive
         | techniques.. databricks is addressing the source of market
         | confusion head-on
        
       | boringg wrote:
       | And how soon is the S-1 for Databricks dropping?
        
       | drej wrote:
       | What I find hilarious is that companies argue who can query 100
       | TB faster and try to sell this to people. I've been on the
       | receiving end of offers by both of the companies in question and
       | used both platforms (and sadly migrated some data jobs to them).
       | 
       | While they can crunch large datasets, they are laughably slow for
       | the datasets most people have. So while I did propose we use
       | these solutions for our big-ish data projects, management kept
       | pushing for us to migrate our tiny datasets (tens of gigabytes or
       | smaller) and the perf expectedly tanked compared to our other
       | solutions (Postgres, Redshift, pandas etc.), never mind the
       | immense costs to migrate everything and train everyone up.
       | 
       | Yes, these are very good products. But PLEASE, for the love of
       | god, don't migrate to them unless you know you need them (and by
       | 'need' I don't mean pimping your resume).
        
         | tshanmu wrote:
         | Resume driven development FTW!
        
       ___________________________________________________________________
       (page generated 2021-11-15 23:01 UTC)