[HN Gopher] Snowflake's response to Databricks' TPC-DS post
___________________________________________________________________
Snowflake's response to Databricks' TPC-DS post
Author : uvdn7
Score : 66 points
Date : 2021-11-13 02:52 UTC (20 hours ago)
(HTM) web link (www.snowflake.com)
(TXT) w3m dump (www.snowflake.com)
| PostThisTooFast wrote:
| Another douchey HN title.
|
| Obscurity isn't cool. Nobody gives a shit.
| glogla wrote:
| What do you know, here's an article[1] from 2017 about Databricks
| making an unfortunate mistake that showed Spark Streaming (which
| they sell) as a better streaming platform to Flink (which they
| don't sell).
|
| I really hope this is not the case again.
|
| (yes, I understand my sarcasm is unneeded, I couldn't help
| myself)
|
| [1]: https://www.ververica.com/blog/curious-case-broken-
| benchmark...
| hiyer wrote:
| Performance is only one part of the story. The major advantage
| Snowflake (and to some extent Presto/Trino) brings to the table
| is it's pretty much plug and play. Spark OTOH usually requires a
| lot of tweaking to work reliably for your workloads.
| EdwardDiego wrote:
| Very much true. I saw a joke tweet recently something along the
| lines of - It's amazing how many data engineering scaling
| issues these days are being solved by just paying Snowflake
| more money.
|
| Spark does take a lot of tuning, but then I'm guessing
| Databricks offer that service as part of your licensing fee?
| (I'd hope so if they're selling a product based on FOSS code,
| there has to be a value add to justify it)
| hiyer wrote:
| > I'd hope so if they're selling a product based on FOSS
| code, there has to be a value add to justify it
|
| They have some proprietary features like DBIO [1]. They also
| have some cloud-specific features like storage autoscaling
| [2] that would not be available in OSS Spark. Even Delta Lake
| [3] used to be proprietary, but I suspect the rise of open-
| source frameworks like Iceberg led them to open-source it.
|
| Shameless plug - when working at a since-shutdown competitor
| to Databricks, I'd come up with storage autoscaling long
| before them [4], so it's not unlikely that they were
| "inspired" by us :-) .
|
| 1. https://docs.databricks.com/spark/latest/spark-sql/dbio-
| comm...
|
| 2. https://databricks.com/blog/2017/12/01/transparent-
| autoscali...
|
| 3. https://delta.io/
|
| 4. https://www.qubole.com/blog/auto-scaling-in-qubole-with-
| aws-...
| glogla wrote:
| The open source Delta is not a replacement for the real
| thing - they did not include features like optimizing small
| files (small file problem is well known in big data, and
| much more of a problem once streaming gets involved) and
| others. It is more of a demo of the real thing. Which does
| not stop them from repeating everywhere how open they are,
| of course.
|
| EDIT: the delta also still keeps partitioning information
| in the hive metastore, while iceberg keeps it in storage,
| making it a far superior design. Adopting iceberg is harder
| due to third party tools like AWS Redshift not supporting
| it - you have to go 100 % of the way.
| saj1th wrote:
| >the delta also still keeps partitioning information in
| the hive metastore, while iceberg keeps it in storage,
| making it a far superior design.
|
| Check out https://github.com/delta-
| io/delta/blob/3ffb30d86c6acda9b59b9... when you get a
| chance. You don't need hive metastore to query delta
| tables since all metadata for a Delta table is stored
| alongside the data
|
| >they did not include features like optimizing small
| files
|
| For optimizing small files, you could run
| https://docs.delta.io/latest/best-practices.html#compact-
| fil...
| rdeboo wrote:
| I think the comparison was Snowflake vs Databricks SQL.
| Databricks SQL is a PaaS service just like Snowflake. Also, it
| uses their Photon engine, which is a proprietary engine written
| in C++. It is not Spark.
| hiyer wrote:
| I'm aware that Databricks is a PaaS service, but what
| Databricks runs under the hood is Spark (with a few
| proprietary extensions). So your jobs/queries do require some
| tuning just like with OS Spark.
|
| Spark has had SQL engines (SparkSQL/Hive on Spark) for a long
| time. Photon is just a new, faster one. Photon tasks also run
| on Spark executors only, so it's not independent of Spark[1].
| Also, while it's proprietary now, I wouldn't be surprised if
| Databricks open-sources it in the future, like they did with
| Delta Lake.
|
| 1. https://databricks.com/blog/2021/06/17/announcing-photon-
| pub...
| blobbers wrote:
| This is the sort of FUD testing that gets thrown back and forth
| between companies of all kinds.
|
| If you're in networking, it's throughput, latency or fairness. If
| you're in graphics its your shaders or polygons or hashes. If
| you're in CPUs its your clock speed. If its cameras, it's
| megapixels (but nobody talks about lens or real measures of
| clarity) If you're in silicon it's your die size (None of that
| has mattered for years, those numbers are like versions not the
| largest block on your die) If you're in finance, it's about your
| returns or your drawdowns or your sharpe ratios.
|
| I'm a little bit surprised how seriously databricks is taking
| this, but maybe it's because one of the cofounders laid this
| claim. Ultimately what you find is one company is not very good
| at setting up the other company's system, and the result is the
| benchmarks are less than ideal.
|
| So why not have a showdown? Both founders, streamed live, running
| their benchmarks on the data. NETFLIX SPECIAL!
| rxin wrote:
| Exactly. Not sure about Netflix special, but there are experts
| that have dedicated their professional careers to creating fair
| benchmarks. Snowflake should just participate in the official
| TPC benchmark.
|
| Disclaimer: Databricks cofounder who authored the original blog
| post.
| AtlasLion wrote:
| The benchmark itself is kinda useless, so I don't see why
| they should. If you look at tpc-h for years, you had exasol
| as a top dog, but in the real world that meant nothing for
| them.
| throwaway984393 wrote:
| "Posting benchmark results is bad because it quickly becomes a
| race to the wrong solution. But somebody showed us sucking on a
| benchmark, so here's our benchmark results showing we're better."
| Rastonbury wrote:
| I'm not familiar with this realm to comment on veracity of
| claims but it could very well be
|
| "Posting benchmark results is bad because it quickly becomes a
| race to the wrong solution. Someone misrepresented our
| performance in a benchmark, here are the actual results."
| aptxkid wrote:
| I disagree. It makes sense for Snowflake to response to what-
| they-think-is an unreasonably bad result published by
| Databricks. And they focused more on Snowflake's result and
| only compared dollar cost against Databricks. It's consistent
| with their philosophy that public benchmark war is beside the
| point and mostly a distraction.
| AtlasLion wrote:
| Their cofounder was behind vectorwise, which kicked ass in
| benchmarks, but died as no one even heard of it. You can run
| the benchmark queries fast, that's great, but can you handle
| code migrated from vertica? Will you optimiser come up with a
| good plan for queries built on 15 layers of views? That's what
| companies in the real world have, not some synthetic benchmark
| that you can make sure you can run for marketing purposes.
| pxc wrote:
| Can someone ELI5 what Snowflake and Databricks are? I spent a few
| minutes on the Databricks website once and couldn't really
| penetrate the marketing jargon.
|
| There are also some technical terms I don't know at all, and when
| I've searched for them, the top results are all more Azure stuff.
| Like wtf is a datalake?
| jeffreygoesto wrote:
| People who downvoted this, please take a minute and reflect
| that your world is not the whole world. There is a serious
| question in this comment and there are myriads of topics _you_
| have no clue about.
| dekhn wrote:
| sure, but if I see the term 'data lake' I'm gonna Bing it,
| with the first result being https://aws.amazon.com/big-
| data/datalakes-and-analytics/what... which explains it
| nicely.
|
| ELI5 is for reddit, generally here we expect you can google
| it to get the ELI5 explanation before giving us your hot take
| in a comment
| aptxkid wrote:
| It's probably just me but the distinction between datalake
| and data warehouse seems like splitting hairs. Unstructured
| data can always be stored on structure databases. What's
| the main reason for both to coexist?
| glogla wrote:
| It used to be that way. Old data warehouses (built on
| relational dbs) couldn't handle large scale data, and old
| data lakes used to be hard to use (write a map-reduce job
| to query data).
|
| It is barely true nowadays.
| incomplete wrote:
| i worked at excite.com right after the IPO, and front and
| center in the HQ building was a MASSIVE glass wall
| showcasing the oracle data warehouse machine room.
|
| i didn't enjoy working w/either the datastore directly,
| or the DBA team that ran it either. an early, more old-
| white-dude "i just want to serve 5T"
| dekhn wrote:
| History matters here and I don't know how well this is
| documented, but: data warehouses have been around since
| the 70s or so, data lake is a newer term. Data warehouses
| came from an era where nearly all data was stored in the
| database itself (typically Oracle), owned and controlled
| by a single or few groups, and there were only a few
| databases, which were the source of truth (the two
| databases would normally be a transaction engine handling
| real time load (just what's required to authorize a
| credit card transaction, for example), and a "warehouse"
| which contained all the long-term data like every
| transaction that had ever occurred.
|
| Data lakes are more modern and came about as people
| realized they had 30 databases and the business wanted to
| do queries against all of them simultaneously (IE, join
| your credit card transaction history with historical
| rates of default in a zip code), quickly. The data
| warehouse solution was to use federated database queries
| (JOINs across databases), or force everybody to
| consolidate. A data lake is a single virtual entity that
| represents "all your data in one place".
|
| It's based on a weak analogy where a warehouse is a place
| where you put stuff in very well organized locations
| while a lake is a place where a bunch of different waters
| slosh together.
|
| Storing unstructured data in a database is dumb because
| databases cost about 10X storage space due to indexing,
| while unstructured data often can just sit around
| passively in a filesystem (and/or have a filesystem index
| built into it for fast queries).
|
| I view this through the lens of web tech, for example,
| see the wars between the mapreduce and database people
| and how Google evolved from MapReduce against GFS to
| Flumes against Spanner, showing we just live in an
| endless cycle of renaming old technology.
|
| It's absolutely correct that the terminology doesn't map
| perfectly
| pxc wrote:
| This was really helpful, too. Thanks!
| pxc wrote:
| Yeah, that's exactly the kind of content I found unsuitable
| when I did a web search for the term. It spends a whole two
| sentences giving an explanation that tells me very little
| about how data lakes are anything more specific than a
| cloud-hosted database solution, and moves on to
|
| > Organizations that successfully generate business value
| from their data, will outperform their peers.
|
| at which point I'm like
|
| > ok, I'm reading a covert advertisement about Fancy Cloud
| Technology aimed at some kind of big-spending manager,
| which is unlikely to tell me meaningfully what this
| actually _is_
|
| and I'm out. I was looking for content that was in a more
| neutral, purely educational genre, and wondering what
| collection of non-cloud analogues it replaces/is composed
| of. Someone writing in the comments
|
| > I used it to transform several terabytes of JSON into
| nice relational data for analysts without too much effort
|
| is way, way more direct and helpful than mentioning that
| 'unlike data warehouses, data lakes support non-relational
| data'. Like great, it's a cloud thing that supports a
| variety of databases. But what is it?
|
| > before giving us your hot take in a comment
|
| I didn't give any take at all? I just really found all the
| sources that came up on the first page of search results to
| be almost in the wrong genre for me, and expected
| (correctly) that people on this site would be able to
| produce descriptions in 1-5 sentences that worked way
| better for me.
|
| Pretty much all of the answers I got here were really good,
| and I'm glad I asked.
| PostThisTooFast wrote:
| "ELI5"?
|
| Try English next time.
| IanCal wrote:
| Snowflake is (amongst other things but primarily to me) SQL
| database as a service, designed for analytical queries over
| large datasets.
|
| It separates compute and storage, so there's just a big ol'
| pile of data and tables, then it spins up large machines to
| crunch the data on demand.
|
| Data storage is cheap and the machines are expensive per hour
| but running for shorter times, and with little to no ops work
| required it can be a cheap overall system.
|
| Bunch of other features that are handy or vital depending on
| your use case (instant data sharing across accounts, for
| example).
|
| I've used it to transform terabytes of JSON into nice
| relational tables for analysts to use with very little effort.
|
| Hopefully that's a useful overview of what kind of thing it is
| and where it sits.
| legerdemain wrote:
| Snowflake is a hosted database that uses SQL. Two distinctions
| it has is that (1) it lets users pay for data storage and
| compute power separately and independently and (2) it takes
| decisions about data indexing out of your hands.
|
| Databricks is a vendor of hosted Spark (and is operated by the
| creators of Spark). Spark is software for coordinating data
| processing jobs on multiple machines. The jobs are written
| using a SQL-like API that allows fairly arbitrary
| transformations. Databricks also offers storage using their
| custom virtual cloud filesystem that exposes stored datasets as
| DB tables.
|
| Both vendors also offer interactive notebook functionality
| (although Databricks has spent more time on theirs). They're
| both getting into dashboarding (I think).
|
| Ultimately, they're both selling cloud data services, and their
| product offerings are gradually converging.
| kevindeasis wrote:
| They are a data warehouse with analytics? So data warehouse as
| a service in the cloud?
|
| So they can collect data from different places like sql,
| images, etc. I think a better question would be what type of
| data can't they ingest?
|
| Once you have your data i guess you can run some analytics to
| find out what your data tells you
| geoduck14 wrote:
| I'd like to add some points: Ive used Snowflake for several
| years. Snowflake works with structured and semi-structured
| data (think spreadsheets and JSON). I've never tried working
| with pics or videos - and I'm not sure it would make sense to
| do that.
|
| I've evaluated Databricks. It works with the above mentioned
| structured and semi-structured data. I also suspect it could
| process unstructured data. My understanding is that it runs
| Python (and some others), so you can do any "Python stuff,
| but in the cloud, and on 1000s of computers"
| mping wrote:
| A data lake is a system designed for ingesting, and possibly
| transforming lots of data, a "lake" where you dump your data.
| This is different from an eg postgres db (a single source of
| truth for a crud app for example), because it captures more
| data (eg events) and it's normally not consistent with the
| single source of truth (the data may arrive in batches,
| imported from other database, etc). Because the volume of data
| is normally huge, you need a cluster to store it, and some way
| of querying it.
|
| Snowflake and data bricks are companies that operate in this
| space, providing ways to ingest, transform and analyze large
| volumes of data.
| ngc248 wrote:
| A data lake is a company wide data repository. All the "data
| streams" from all of the different departments will flow into
| the data lake. Aim is to use this data to get both macro and
| micro insights.
| socaldata wrote:
| Take all the problems you have had with data warehousing and
| throw them in a proprietary cloud. That is Snowflake. They are
| the best today.
|
| Databricks started with the cloud datalake, sitting natively on
| parquet and using cloud native tools, fully open. Recently they
| added SQL to help democratize the data in the data lake versus
| moving it back and forth into a proprietary data warehouse.
|
| The selling point in Databricks is why move the data around when
| you can just have it in one place IF performance is the same or
| better.
|
| This is what led to the latest benchmark which in the writing
| appears to be unbiased.
|
| In snowflakes response however, they condemn it but then submit
| their own fundings. Sound a lot lot trump telling everyone he had
| billions of people attend his inauguration, doesn't it?
|
| Anyhow, I trust independent studies more than I do coming from
| vendors. It cannot be argued or debated unless it was unfairly
| done. I think we are all smart enough to be careful with studies
| of any kind, but I can see why Databricks was excited about the
| findings.
| aptxkid wrote:
| Whose result can be trusted is beside the point - I actually
| believe both experiments were likely conducted in good faith
| but with incomplete context. But that's beside the point. The
| point is there's no good reason to start a benchmark war to
| begin with.
| glogla wrote:
| Delta lake is not meaningfully more "open" than whatever
| Snowflake (or BigQuery and Redshift) are doing. It does not
| require any less "moving data around"
|
| With all these, the data sits on cloud storage and compute is
| done by cloud machines - the difference between Databricks and
| the others is that with Databricks, you can take a look at that
| bucket. But you're not going to be able to do much with that
| data without paying for Databricks compute, since the open
| source Delta library is not usable in real world.
|
| Since commercial data warehouses are an enterprise product for
| enterprise companies (small companies can use stick with normal
| databases or SaaS and unicorns seem to roll their own with
| Presto/Trino, Iceberg, Spark and k8s, nowadays), the vendor and
| the product needs to be most of all reliable partner. And
| Databricks behavior does not inspire confidence of them being
| that.
|
| If I'm outsourcing my analytical platform to a vendor, I want
| the to be almost boring. Not some growth hacking, guerilla
| marketing, sketchy benchmark posting techbros.
|
| At the end of the day, anyone making years lasting million
| dollar decisions in this space should run their own evaluation.
| Our evaluation showed that there's a noticeable gap between
| what Databricks promises and what they deliver. I have not
| worked with Snowflake to compare.
| choppaface wrote:
| The audience for these posts are enterprise managers who don't
| actually understand their compute needs.
|
| For the more technically inclined, don't let any corporate blog
| post / comms piece live in your head rent-free. If you're a
| customer, make them show you value for their money. If you're
| not, make them provide you tools / services for free. Just don't
| help them fuel the pissing contest, you'll end up a bag holder
| (swag holder?).
| bjornsing wrote:
| > At the end of the script, the overall elapsed time and the
| geometric mean for all the queries is computed directly by
| querying the history view of all TPC-DS statements that have
| executed on the warehouse.
|
| The geometric mean? Really? Feels a lot easier to think in terms
| of arithmetic mean, and perhaps percentiles.
| rxin wrote:
| Geometric mean is commonly used in benchmarks when the
| workloads consists of queries that have large (often orders of
| magnitude) differences in runtime.
|
| Consider 4 queries. Two run for 1sec, and the other two
| 1000sec. If we look at arithmetic mean, then we are really only
| taking into account the large queries. But improving geometric
| mean would require improving all queries.
|
| Note that I'm on the opposite side (Databricks cofounder here),
| so when I say that Snowflake didn't make a mistake here, you
| should trust me :)
| bjornsing wrote:
| > But improving geometric mean would require improving all
| queries.
|
| No. Improving the geometric mean only requires reducing the
| product of their execution times. So if you can make the two
| 1 ms queries execute in 0.5 ms at the expense of the two 1000
| ms queries taking 1800 ms each then that's an improvement in
| terms of geometric mean.
|
| So... kind of QED. The geometric mean is not easy to reason
| about.
| kthejoker2 wrote:
| Snowflake conceding they have a 700% markup between Standard and
| Premium editons which has _zero impact_ on query performance is
| ... well, it 's something. I'd start squeezing my sales engineers
| about that, definitely not sustainable...
|
| Also proof that lakehouse and spot compute price performance
| economics are here to stay, that's good for customers.
|
| Otherwise, as a vendor blog post with nothing but self-reported
| performance, this is worthless.
|
| Disclaimer: I work at Databricks but I admire Snowflake's product
| for what it is - iron sharpens iron.
| ghostridr wrote:
| Hey 1990s, your TPC-DS results are in.
| aptxkid wrote:
| I genuinely think DeWitt clause is good for the users (bad for
| researchers). Without it, especially in the context of cooperate
| competitions, the company with the most marketing power will win.
| Users can always compare different products _themselves_. I am
| likely wrong but please help me understand.
| aptxkid wrote:
| Personally I think it's a great response and very well written. I
| didn't jump on the congrats-Databricks wagon when the result
| first came out because of the weird front page comparison against
| snowflake. Both companies are doing great work. Focusing on
| building a better product for your customer is much more
| meaningful than making your competitor look bad.
| tyingq wrote:
| It is well written, but there's some sleight of hand here and
| there too. Like using your lowest tier product to demonstrate
| price/performance against a competitor's highest tier. The
| Snowflake lowest tier doesn't have failover, for example...or
| compliance features.
| aptxkid wrote:
| Exactly. That's why I think public benchmark war is just a
| waste of time. There will ALWAYS be some subtle differences
| between the two platforms that results will never be apple to
| apple.
| [deleted]
| michaelhartm wrote:
| * Databricks is unethical
|
| * Nobody should benchmark anymore, just focus on customers
| instead
|
| * But hey, we just did some benchmarks and we look better than
| what Databricks claims
|
| * Btw, please sign up and do some benchmarks on Snowflake, we
| actually ship TPC-DS dataset with Snowflake
|
| * Btw, we agree with Databricks, let's remove the DeWitt clause,
| vendors should be able to benchmark each other!
|
| * Consistency is more important than anything else!!!
| aptxkid wrote:
| I don't think they are saying benchmark is not important but
| rather public benchmark war being a distraction.
| kingkongv2 wrote:
| If people have never heard of Databricks, now is the time
| because a 100 billion company just started a war against them.
| Great marketing win Databricks.
| glogla wrote:
| Databricks is $28B valuation and 2800 employees, Snowflake is
| $109 valuation and 2500 employees.
|
| They are both billion dolar companies, we're hardly talking
| David and Goliath here.
| alexott wrote:
| for DB that's old number - recent valuation is $38B
| geoduck14 wrote:
| To be fair, I've been equating Databricks for a month or so.
| Databricks is coming after Snowflake. Snowflake doesn't care.
| Snowflake has a pretty solid moat with:
|
| EASY SQL, data sharing (they have a marketplace), simple
| scaling
| bpaneural wrote:
| You'll need to revisit this again. In the last two years
| Databricks has built a lead and a bigger moat. They're
| essentially nice chaps with a huge community backing them.
| And we all love their open source tools which essentially
| powers not only their big data platforms, but everyone
| else's too (AWS, GCP).
| cloudbonsai wrote:
| The interesting part is that Snowflake omits Databricks'
| performance scores in their graphs. Here is how they compare on
| TPC-DS benchmark, based on two companies' self-reports:
|
| * Elapsed time: 3108s (Databricks) vs 3760s (Snowflake)
|
| * Price/Peformance: $242 (Databricks) vs $267 (Snowflake)
|
| Needless to say, these numbers seriously need a verification by
| independent 3rd parties, but it seems that Databricks is still
| 18% faster and 10% cheaper than Snowflake?
| geoduck14 wrote:
| The way I read this is: DataBricks benchmarked against us,
| and they messed it up. Here is hou YOU should evaluate
| Snowflake performance. And, by the way, it is pretty easy to
| do it.
| AtlasLion wrote:
| The main question I have for DB is, how good is their query
| optimiser/compiler? It's fun that you can run some predefined set
| of queries fast. More important is, how good you can run queries
| in the real world, with suboptimal data models, layers upon
| layers of badly written views, CTEs, UDFs... That is what matters
| in the end. Not some synthetic benchmark based on known queries
| you can optimise specifically for.
| maslam wrote:
| @AtlasLion you are right real world performance matters. We
| test extensively with actual workloads, and the speed up holds
| there too. For example: lots of real world BI queries are
| repeated over smallish data sets of 10 to 50 GB. We test that
| size factor and pattern all the time.
| geoduck14 wrote:
| I've been a customer/user of Snowflake. They make it simple to
| run SQL. There is a bunch of performance stuff that I don't need
| to worry about.
|
| I'm interested in using Databricks, but I haven't done it yet.
| I've heard good things about their product.
| [deleted]
| maslam wrote:
| Databricks broke the record by 2x) and is 10x more cost
| effective, in an audited benchmark. Snowflake should participate
| in the official, audited benchmark. Customers win when businesses
| are open and transparent...
| jiggawatts wrote:
| Audited how? If you look at the Snowflake response the numbers
| being posted by Databricks look outright faked or otherwise
| false.
| maslam wrote:
| Hey jiggawatts - TPC is the official way to audit benchmarks
| in the database industry. They've been around for a bit, but
| let me know if you want more info, I'm happy to share more
| about them.
| redis_mlc wrote:
| > TPC is the official way to audit benchmarks in the
| database industry.
|
| TPC is a benchmark suite for a certain problem class. It
| says nothing about how the databases are configured or
| managed.
|
| In case you're thick, the above is a polite way of calling
| you a liar.
|
| This is why Oracle and other database vendors don't allow
| publishing of benchmarks. It's to protect them from
| incompetent or lazy authors primarily.
|
| Source: DBA.
| lmeyerov wrote:
| It sounds fundamentally busted if a competitor can submit
| benchmarks for someone else. TPC is great in general, but I
| didn't realize it had such a gaping flaw.
|
| TPC submissions take real time/$/energy/expertise, so I
| don't know anyone who has ever done it casually. Ex: It was
| a multi-company effort for the RAPIDS community to get
| enough API coverage & edge case optimization for an end-to-
| end GPU submission on the big data one (SQL, ...), and even
| there the TPC folks made them resubmit if I remember right.
|
| Also, note how the parent's response did not actually
| answer 'audited how'. Pushing the work to the questioner is
| on the shortlist of techniques studied by misinformation
| researchers. I'm a fan of both companies, so disappointing
| to see from a company rep.
| rxin wrote:
| Check my reply, Leo.
| lmeyerov wrote:
| The audit question is on Databricks marketing unaudited
| Snowflake TPC numbers. I do think Snowflake is big enough
| to run TPC, but how you guys choose to market is on you.
|
| But: I think it's cool _both_ companies got it to
| $200-300. Way better than years ago. Next stop: GPUs :)
| rxin wrote:
| Ah ok. Wasn't clear. I think some repro scripts will be
| available soon.
| [deleted]
| Spivak wrote:
| The results are so crazy different that either Snowflake or
| Databricks are wrong or outright lying.
| jiggawatts wrote:
| This is my point also, and I'm being downvoted for it.
|
| If two people are in disagreement about the same _facts_ ,
| then one of them is either misinformed or lying. It's that
| simple.
|
| If the only recourse seems to be to sink to the level of
| mud-slinging, with no clear ability to point to the audit
| trail and say "this is where it all went wrong", then it
| calls into question the value of that auditing process.
|
| I'm personally unimpressed with the TPC process in general.
| I remember one "benchmark" that showed the performance of a
| 2RU server breaking some record, and it was a minor
| footnote that it was using a disk array with 7,500 drives
| in it -- _dedicated_ to that one server for the duration of
| the test. That 's an absurd setup that will never exist at
| any customer, ever.
|
| I ran that same software myself on literally the exact same
| server, and it couldn't even begin to approach the posted
| TPC numbers on typical storage. It was at least two orders
| of magnitude slower.
|
| The rub was that its inefficient usage of storage _was the
| main problem_ , and the vendor was pulling a smoke &
| mirrors trick to hide this deficiency of their product. The
| TPC numbers were an outright fraud in this case, at least
| in my mind.
|
| So to me, TPC looks like a staged show where the auditors
| are more like the referees in a WWE wrestling competition.
| ttmahdy wrote:
| The TPC audit process tends to be thorough and strict.
|
| Possibly you missed a configuration that was included in
| the Full Disclosure Report or Supporting Files?
|
| The Databricks official, audited benchmark was executed
| against Databricks SQL which is a PaaS service that
| doesn't allow special tuning btw.
| jiggawatts wrote:
| I didn't miss it. That doesn't make it any less
| misleading.
| AtlasLion wrote:
| That doesn't allow end users any configuration, but this
| doesn't apply to the company itself which can apply
| settings from the background on behalf of end users.
| rxin wrote:
| There's an official TPC process to audit and review the
| benchmark process. This debate can be easiest settled by
| everybody participating in the official benchmark, like we
| (Databricks) did.
|
| The official review process is significantly more complicated
| than just offering a static dataset that's been highly
| optimized for answering the exact set of queries. It includes
| data loading, data maintenance (insert and delete data),
| sequential query test, and concurrent query test.
|
| You can see the description of the official process in this
| 141 page document:
| http://tpc.org/tpc_documents_current_versions/pdf/tpc-
| ds_v3....
|
| Consider the following analogy: Professional athletes compete
| in the Olympics, and there are official judges and a lot of
| stringent rules and checks to ensure fairness. That's the
| real arena. That's what we (Databricks) have done with the
| official TPC-DS world record. For example, in data warehouse
| systems, data loading, ordering and updates can affect
| performance substantially, so it's most useful to compare
| both systems on the official benchmark.
|
| But what's really interesting to me is that even the
| Snowflake self-reported numbers ($267) are still more
| expensive than the Databricks' numbers ($143 on spot, and
| $242 on demand). This is despite Databricks cost being
| calculated on our enterprise tier, while Snowflake used their
| cheapest tier without any enterprise features (e.g. disaster
| recovery).
|
| Edit: added link to audit process doc
| aptxkid wrote:
| Snowflake claims the snowflake result from Databricks was
| not audited. It's not that Databricks numbers were
| artificially good but rather Snowflake's number was
| unreasonably bad.
| jiggawatts wrote:
| Please also refer to my comment below on the value of the
| TPC audit process:
| https://news.ycombinator.com/item?id=29208172
| _dark_matter_ wrote:
| Thanks for the additional context here. As someone who
| works for a company that pays for both databricks and
| snowflake, I will say that these results don't surprise me.
|
| Spark has always been infinitely configurable, in my
| experience. There are probably tens of thousands of
| possible configurations; everything from Java heap size to
| parquet block size.
|
| Snowflake is the opposite: you can't even specify
| partitions! There is only clustering.
|
| For a business, running snowflake is easy because engineers
| don't have to babysit it, and we like it because now we're
| free to work on more interesting problems _. Everybody
| wins.
|
| _ Unless those problems are DB optimization. Then snowflake
| can actually get in your way.
| rxin wrote:
| Totally. Simplicity is critical. That's why we built
| Databricks SQL not based on Spark.
|
| As a matter of fact, we took the extreme approach of not
| allowing customers (or ourselves) to set any of the known
| knobs. We want to force ourselves to build the best the
| system to run well out of the box and yet still beats
| data warehouses in price perf. The official result
| involved no tuning. It was partitioned by date, loaded
| data in, provisioned a Databricks SQL endpoint and that's
| it. No additional knobs or settings. (As a matter of
| fact, Snowflakes own sample TPC-DS dataset has more
| tuning than the ones we did. They clustered by multiple
| columns specifically to optimize for the exact set of
| queries.)
| geoduck14 wrote:
| >That's why we built Databricks SQL not based on Spark.
|
| Wait... really? The sales folks I've been talking to
| didn't mention this. I assumed that when I ran SQL inside
| my Python, it was decomposed into Spark SQL with weird
| join problems (and other nuances I'm not fully familiar
| with).
|
| Not that THAT would have changed my mind. But it would
| have changed the calculus of "who uses this tool at my
| company" and "who do I get on board with this thing"
|
| Edit: To add, I've been a customer of Snowflake for
| years. I've been evaluating Databricks for 2 months, and
| put the POC on hold.
| alexott wrote:
| it's different - rxin talks about this:
| https://databricks.com/product/databricks-sql
|
| when you run Python, it's on Spark, although you now can
| use Photon engine that is used for DB SQL by default
| mst wrote:
| Databricks and snowflake should pay an independent third party
| to re-run these. In-house benchmarks by either company don't
| count with results this different.
| cmhill wrote:
| Databricks didn't run the Snowflake comparison in-house. From
| their article it says: "These results were corroborated by
| research from Barcelona Supercomputing Center, which
| frequently runs TPC-DS on popular data warehouses. Their
| latest research benchmarked Databricks and Snowflake, and
| found that Databricks was 2.7x faster and 12x better in terms
| of price performance."
| dekhn wrote:
| I don't trust a supercomputer center to do a good job
| running a TPC benchmark (I do trust them to run LINPACK
| benchmarks).
___________________________________________________________________
(page generated 2021-11-13 23:02 UTC)