hngopher.com

       [HN Gopher] Snowflake's response to Databricks' TPC-DS post
       ___________________________________________________________________
        
       Snowflake's response to Databricks' TPC-DS post
        
       Author : uvdn7
       Score  : 66 points
       Date   : 2021-11-13 02:52 UTC (20 hours ago)
        
 (HTM) web link (www.snowflake.com)
 (TXT) w3m dump (www.snowflake.com)
        
       | PostThisTooFast wrote:
       | Another douchey HN title.
       | 
       | Obscurity isn't cool. Nobody gives a shit.
        
       | glogla wrote:
       | What do you know, here's an article[1] from 2017 about Databricks
       | making an unfortunate mistake that showed Spark Streaming (which
       | they sell) as a better streaming platform to Flink (which they
       | don't sell).
       | 
       | I really hope this is not the case again.
       | 
       | (yes, I understand my sarcasm is unneeded, I couldn't help
       | myself)
       | 
       | [1]: https://www.ververica.com/blog/curious-case-broken-
       | benchmark...
        
       | hiyer wrote:
       | Performance is only one part of the story. The major advantage
       | Snowflake (and to some extent Presto/Trino) brings to the table
       | is it's pretty much plug and play. Spark OTOH usually requires a
       | lot of tweaking to work reliably for your workloads.
        
         | EdwardDiego wrote:
         | Very much true. I saw a joke tweet recently something along the
         | lines of - It's amazing how many data engineering scaling
         | issues these days are being solved by just paying Snowflake
         | more money.
         | 
         | Spark does take a lot of tuning, but then I'm guessing
         | Databricks offer that service as part of your licensing fee?
         | (I'd hope so if they're selling a product based on FOSS code,
         | there has to be a value add to justify it)
        
           | hiyer wrote:
           | > I'd hope so if they're selling a product based on FOSS
           | code, there has to be a value add to justify it
           | 
           | They have some proprietary features like DBIO [1]. They also
           | have some cloud-specific features like storage autoscaling
           | [2] that would not be available in OSS Spark. Even Delta Lake
           | [3] used to be proprietary, but I suspect the rise of open-
           | source frameworks like Iceberg led them to open-source it.
           | 
           | Shameless plug - when working at a since-shutdown competitor
           | to Databricks, I'd come up with storage autoscaling long
           | before them [4], so it's not unlikely that they were
           | "inspired" by us :-) .
           | 
           | 1. https://docs.databricks.com/spark/latest/spark-sql/dbio-
           | comm...
           | 
           | 2. https://databricks.com/blog/2017/12/01/transparent-
           | autoscali...
           | 
           | 3. https://delta.io/
           | 
           | 4. https://www.qubole.com/blog/auto-scaling-in-qubole-with-
           | aws-...
        
             | glogla wrote:
             | The open source Delta is not a replacement for the real
             | thing - they did not include features like optimizing small
             | files (small file problem is well known in big data, and
             | much more of a problem once streaming gets involved) and
             | others. It is more of a demo of the real thing. Which does
             | not stop them from repeating everywhere how open they are,
             | of course.
             | 
             | EDIT: the delta also still keeps partitioning information
             | in the hive metastore, while iceberg keeps it in storage,
             | making it a far superior design. Adopting iceberg is harder
             | due to third party tools like AWS Redshift not supporting
             | it - you have to go 100 % of the way.
        
               | saj1th wrote:
               | >the delta also still keeps partitioning information in
               | the hive metastore, while iceberg keeps it in storage,
               | making it a far superior design.
               | 
               | Check out https://github.com/delta-
               | io/delta/blob/3ffb30d86c6acda9b59b9... when you get a
               | chance. You don't need hive metastore to query delta
               | tables since all metadata for a Delta table is stored
               | alongside the data
               | 
               | >they did not include features like optimizing small
               | files
               | 
               | For optimizing small files, you could run
               | https://docs.delta.io/latest/best-practices.html#compact-
               | fil...
        
         | rdeboo wrote:
         | I think the comparison was Snowflake vs Databricks SQL.
         | Databricks SQL is a PaaS service just like Snowflake. Also, it
         | uses their Photon engine, which is a proprietary engine written
         | in C++. It is not Spark.
        
           | hiyer wrote:
           | I'm aware that Databricks is a PaaS service, but what
           | Databricks runs under the hood is Spark (with a few
           | proprietary extensions). So your jobs/queries do require some
           | tuning just like with OS Spark.
           | 
           | Spark has had SQL engines (SparkSQL/Hive on Spark) for a long
           | time. Photon is just a new, faster one. Photon tasks also run
           | on Spark executors only, so it's not independent of Spark[1].
           | Also, while it's proprietary now, I wouldn't be surprised if
           | Databricks open-sources it in the future, like they did with
           | Delta Lake.
           | 
           | 1. https://databricks.com/blog/2021/06/17/announcing-photon-
           | pub...
        
       | blobbers wrote:
       | This is the sort of FUD testing that gets thrown back and forth
       | between companies of all kinds.
       | 
       | If you're in networking, it's throughput, latency or fairness. If
       | you're in graphics its your shaders or polygons or hashes. If
       | you're in CPUs its your clock speed. If its cameras, it's
       | megapixels (but nobody talks about lens or real measures of
       | clarity) If you're in silicon it's your die size (None of that
       | has mattered for years, those numbers are like versions not the
       | largest block on your die) If you're in finance, it's about your
       | returns or your drawdowns or your sharpe ratios.
       | 
       | I'm a little bit surprised how seriously databricks is taking
       | this, but maybe it's because one of the cofounders laid this
       | claim. Ultimately what you find is one company is not very good
       | at setting up the other company's system, and the result is the
       | benchmarks are less than ideal.
       | 
       | So why not have a showdown? Both founders, streamed live, running
       | their benchmarks on the data. NETFLIX SPECIAL!
        
         | rxin wrote:
         | Exactly. Not sure about Netflix special, but there are experts
         | that have dedicated their professional careers to creating fair
         | benchmarks. Snowflake should just participate in the official
         | TPC benchmark.
         | 
         | Disclaimer: Databricks cofounder who authored the original blog
         | post.
        
           | AtlasLion wrote:
           | The benchmark itself is kinda useless, so I don't see why
           | they should. If you look at tpc-h for years, you had exasol
           | as a top dog, but in the real world that meant nothing for
           | them.
        
       | throwaway984393 wrote:
       | "Posting benchmark results is bad because it quickly becomes a
       | race to the wrong solution. But somebody showed us sucking on a
       | benchmark, so here's our benchmark results showing we're better."
        
         | Rastonbury wrote:
         | I'm not familiar with this realm to comment on veracity of
         | claims but it could very well be
         | 
         | "Posting benchmark results is bad because it quickly becomes a
         | race to the wrong solution. Someone misrepresented our
         | performance in a benchmark, here are the actual results."
        
         | aptxkid wrote:
         | I disagree. It makes sense for Snowflake to response to what-
         | they-think-is an unreasonably bad result published by
         | Databricks. And they focused more on Snowflake's result and
         | only compared dollar cost against Databricks. It's consistent
         | with their philosophy that public benchmark war is beside the
         | point and mostly a distraction.
        
         | AtlasLion wrote:
         | Their cofounder was behind vectorwise, which kicked ass in
         | benchmarks, but died as no one even heard of it. You can run
         | the benchmark queries fast, that's great, but can you handle
         | code migrated from vertica? Will you optimiser come up with a
         | good plan for queries built on 15 layers of views? That's what
         | companies in the real world have, not some synthetic benchmark
         | that you can make sure you can run for marketing purposes.
        
       | pxc wrote:
       | Can someone ELI5 what Snowflake and Databricks are? I spent a few
       | minutes on the Databricks website once and couldn't really
       | penetrate the marketing jargon.
       | 
       | There are also some technical terms I don't know at all, and when
       | I've searched for them, the top results are all more Azure stuff.
       | Like wtf is a datalake?
        
         | jeffreygoesto wrote:
         | People who downvoted this, please take a minute and reflect
         | that your world is not the whole world. There is a serious
         | question in this comment and there are myriads of topics _you_
         | have no clue about.
        
           | dekhn wrote:
           | sure, but if I see the term 'data lake' I'm gonna Bing it,
           | with the first result being https://aws.amazon.com/big-
           | data/datalakes-and-analytics/what... which explains it
           | nicely.
           | 
           | ELI5 is for reddit, generally here we expect you can google
           | it to get the ELI5 explanation before giving us your hot take
           | in a comment
        
             | aptxkid wrote:
             | It's probably just me but the distinction between datalake
             | and data warehouse seems like splitting hairs. Unstructured
             | data can always be stored on structure databases. What's
             | the main reason for both to coexist?
        
               | glogla wrote:
               | It used to be that way. Old data warehouses (built on
               | relational dbs) couldn't handle large scale data, and old
               | data lakes used to be hard to use (write a map-reduce job
               | to query data).
               | 
               | It is barely true nowadays.
        
               | incomplete wrote:
               | i worked at excite.com right after the IPO, and front and
               | center in the HQ building was a MASSIVE glass wall
               | showcasing the oracle data warehouse machine room.
               | 
               | i didn't enjoy working w/either the datastore directly,
               | or the DBA team that ran it either. an early, more old-
               | white-dude "i just want to serve 5T"
        
               | dekhn wrote:
               | History matters here and I don't know how well this is
               | documented, but: data warehouses have been around since
               | the 70s or so, data lake is a newer term. Data warehouses
               | came from an era where nearly all data was stored in the
               | database itself (typically Oracle), owned and controlled
               | by a single or few groups, and there were only a few
               | databases, which were the source of truth (the two
               | databases would normally be a transaction engine handling
               | real time load (just what's required to authorize a
               | credit card transaction, for example), and a "warehouse"
               | which contained all the long-term data like every
               | transaction that had ever occurred.
               | 
               | Data lakes are more modern and came about as people
               | realized they had 30 databases and the business wanted to
               | do queries against all of them simultaneously (IE, join
               | your credit card transaction history with historical
               | rates of default in a zip code), quickly. The data
               | warehouse solution was to use federated database queries
               | (JOINs across databases), or force everybody to
               | consolidate. A data lake is a single virtual entity that
               | represents "all your data in one place".
               | 
               | It's based on a weak analogy where a warehouse is a place
               | where you put stuff in very well organized locations
               | while a lake is a place where a bunch of different waters
               | slosh together.
               | 
               | Storing unstructured data in a database is dumb because
               | databases cost about 10X storage space due to indexing,
               | while unstructured data often can just sit around
               | passively in a filesystem (and/or have a filesystem index
               | built into it for fast queries).
               | 
               | I view this through the lens of web tech, for example,
               | see the wars between the mapreduce and database people
               | and how Google evolved from MapReduce against GFS to
               | Flumes against Spanner, showing we just live in an
               | endless cycle of renaming old technology.
               | 
               | It's absolutely correct that the terminology doesn't map
               | perfectly
        
               | pxc wrote:
               | This was really helpful, too. Thanks!
        
             | pxc wrote:
             | Yeah, that's exactly the kind of content I found unsuitable
             | when I did a web search for the term. It spends a whole two
             | sentences giving an explanation that tells me very little
             | about how data lakes are anything more specific than a
             | cloud-hosted database solution, and moves on to
             | 
             | > Organizations that successfully generate business value
             | from their data, will outperform their peers.
             | 
             | at which point I'm like
             | 
             | > ok, I'm reading a covert advertisement about Fancy Cloud
             | Technology aimed at some kind of big-spending manager,
             | which is unlikely to tell me meaningfully what this
             | actually _is_
             | 
             | and I'm out. I was looking for content that was in a more
             | neutral, purely educational genre, and wondering what
             | collection of non-cloud analogues it replaces/is composed
             | of. Someone writing in the comments
             | 
             | > I used it to transform several terabytes of JSON into
             | nice relational data for analysts without too much effort
             | 
             | is way, way more direct and helpful than mentioning that
             | 'unlike data warehouses, data lakes support non-relational
             | data'. Like great, it's a cloud thing that supports a
             | variety of databases. But what is it?
             | 
             | > before giving us your hot take in a comment
             | 
             | I didn't give any take at all? I just really found all the
             | sources that came up on the first page of search results to
             | be almost in the wrong genre for me, and expected
             | (correctly) that people on this site would be able to
             | produce descriptions in 1-5 sentences that worked way
             | better for me.
             | 
             | Pretty much all of the answers I got here were really good,
             | and I'm glad I asked.
        
         | PostThisTooFast wrote:
         | "ELI5"?
         | 
         | Try English next time.
        
         | IanCal wrote:
         | Snowflake is (amongst other things but primarily to me) SQL
         | database as a service, designed for analytical queries over
         | large datasets.
         | 
         | It separates compute and storage, so there's just a big ol'
         | pile of data and tables, then it spins up large machines to
         | crunch the data on demand.
         | 
         | Data storage is cheap and the machines are expensive per hour
         | but running for shorter times, and with little to no ops work
         | required it can be a cheap overall system.
         | 
         | Bunch of other features that are handy or vital depending on
         | your use case (instant data sharing across accounts, for
         | example).
         | 
         | I've used it to transform terabytes of JSON into nice
         | relational tables for analysts to use with very little effort.
         | 
         | Hopefully that's a useful overview of what kind of thing it is
         | and where it sits.
        
         | legerdemain wrote:
         | Snowflake is a hosted database that uses SQL. Two distinctions
         | it has is that (1) it lets users pay for data storage and
         | compute power separately and independently and (2) it takes
         | decisions about data indexing out of your hands.
         | 
         | Databricks is a vendor of hosted Spark (and is operated by the
         | creators of Spark). Spark is software for coordinating data
         | processing jobs on multiple machines. The jobs are written
         | using a SQL-like API that allows fairly arbitrary
         | transformations. Databricks also offers storage using their
         | custom virtual cloud filesystem that exposes stored datasets as
         | DB tables.
         | 
         | Both vendors also offer interactive notebook functionality
         | (although Databricks has spent more time on theirs). They're
         | both getting into dashboarding (I think).
         | 
         | Ultimately, they're both selling cloud data services, and their
         | product offerings are gradually converging.
        
         | kevindeasis wrote:
         | They are a data warehouse with analytics? So data warehouse as
         | a service in the cloud?
         | 
         | So they can collect data from different places like sql,
         | images, etc. I think a better question would be what type of
         | data can't they ingest?
         | 
         | Once you have your data i guess you can run some analytics to
         | find out what your data tells you
        
           | geoduck14 wrote:
           | I'd like to add some points: Ive used Snowflake for several
           | years. Snowflake works with structured and semi-structured
           | data (think spreadsheets and JSON). I've never tried working
           | with pics or videos - and I'm not sure it would make sense to
           | do that.
           | 
           | I've evaluated Databricks. It works with the above mentioned
           | structured and semi-structured data. I also suspect it could
           | process unstructured data. My understanding is that it runs
           | Python (and some others), so you can do any "Python stuff,
           | but in the cloud, and on 1000s of computers"
        
         | mping wrote:
         | A data lake is a system designed for ingesting, and possibly
         | transforming lots of data, a "lake" where you dump your data.
         | This is different from an eg postgres db (a single source of
         | truth for a crud app for example), because it captures more
         | data (eg events) and it's normally not consistent with the
         | single source of truth (the data may arrive in batches,
         | imported from other database, etc). Because the volume of data
         | is normally huge, you need a cluster to store it, and some way
         | of querying it.
         | 
         | Snowflake and data bricks are companies that operate in this
         | space, providing ways to ingest, transform and analyze large
         | volumes of data.
        
         | ngc248 wrote:
         | A data lake is a company wide data repository. All the "data
         | streams" from all of the different departments will flow into
         | the data lake. Aim is to use this data to get both macro and
         | micro insights.
        
       | socaldata wrote:
       | Take all the problems you have had with data warehousing and
       | throw them in a proprietary cloud. That is Snowflake. They are
       | the best today.
       | 
       | Databricks started with the cloud datalake, sitting natively on
       | parquet and using cloud native tools, fully open. Recently they
       | added SQL to help democratize the data in the data lake versus
       | moving it back and forth into a proprietary data warehouse.
       | 
       | The selling point in Databricks is why move the data around when
       | you can just have it in one place IF performance is the same or
       | better.
       | 
       | This is what led to the latest benchmark which in the writing
       | appears to be unbiased.
       | 
       | In snowflakes response however, they condemn it but then submit
       | their own fundings. Sound a lot lot trump telling everyone he had
       | billions of people attend his inauguration, doesn't it?
       | 
       | Anyhow, I trust independent studies more than I do coming from
       | vendors. It cannot be argued or debated unless it was unfairly
       | done. I think we are all smart enough to be careful with studies
       | of any kind, but I can see why Databricks was excited about the
       | findings.
        
         | aptxkid wrote:
         | Whose result can be trusted is beside the point - I actually
         | believe both experiments were likely conducted in good faith
         | but with incomplete context. But that's beside the point. The
         | point is there's no good reason to start a benchmark war to
         | begin with.
        
         | glogla wrote:
         | Delta lake is not meaningfully more "open" than whatever
         | Snowflake (or BigQuery and Redshift) are doing. It does not
         | require any less "moving data around"
         | 
         | With all these, the data sits on cloud storage and compute is
         | done by cloud machines - the difference between Databricks and
         | the others is that with Databricks, you can take a look at that
         | bucket. But you're not going to be able to do much with that
         | data without paying for Databricks compute, since the open
         | source Delta library is not usable in real world.
         | 
         | Since commercial data warehouses are an enterprise product for
         | enterprise companies (small companies can use stick with normal
         | databases or SaaS and unicorns seem to roll their own with
         | Presto/Trino, Iceberg, Spark and k8s, nowadays), the vendor and
         | the product needs to be most of all reliable partner. And
         | Databricks behavior does not inspire confidence of them being
         | that.
         | 
         | If I'm outsourcing my analytical platform to a vendor, I want
         | the to be almost boring. Not some growth hacking, guerilla
         | marketing, sketchy benchmark posting techbros.
         | 
         | At the end of the day, anyone making years lasting million
         | dollar decisions in this space should run their own evaluation.
         | Our evaluation showed that there's a noticeable gap between
         | what Databricks promises and what they deliver. I have not
         | worked with Snowflake to compare.
        
       | choppaface wrote:
       | The audience for these posts are enterprise managers who don't
       | actually understand their compute needs.
       | 
       | For the more technically inclined, don't let any corporate blog
       | post / comms piece live in your head rent-free. If you're a
       | customer, make them show you value for their money. If you're
       | not, make them provide you tools / services for free. Just don't
       | help them fuel the pissing contest, you'll end up a bag holder
       | (swag holder?).
        
       | bjornsing wrote:
       | > At the end of the script, the overall elapsed time and the
       | geometric mean for all the queries is computed directly by
       | querying the history view of all TPC-DS statements that have
       | executed on the warehouse.
       | 
       | The geometric mean? Really? Feels a lot easier to think in terms
       | of arithmetic mean, and perhaps percentiles.
        
         | rxin wrote:
         | Geometric mean is commonly used in benchmarks when the
         | workloads consists of queries that have large (often orders of
         | magnitude) differences in runtime.
         | 
         | Consider 4 queries. Two run for 1sec, and the other two
         | 1000sec. If we look at arithmetic mean, then we are really only
         | taking into account the large queries. But improving geometric
         | mean would require improving all queries.
         | 
         | Note that I'm on the opposite side (Databricks cofounder here),
         | so when I say that Snowflake didn't make a mistake here, you
         | should trust me :)
        
           | bjornsing wrote:
           | > But improving geometric mean would require improving all
           | queries.
           | 
           | No. Improving the geometric mean only requires reducing the
           | product of their execution times. So if you can make the two
           | 1 ms queries execute in 0.5 ms at the expense of the two 1000
           | ms queries taking 1800 ms each then that's an improvement in
           | terms of geometric mean.
           | 
           | So... kind of QED. The geometric mean is not easy to reason
           | about.
        
       | kthejoker2 wrote:
       | Snowflake conceding they have a 700% markup between Standard and
       | Premium editons which has _zero impact_ on query performance is
       | ... well, it 's something. I'd start squeezing my sales engineers
       | about that, definitely not sustainable...
       | 
       | Also proof that lakehouse and spot compute price performance
       | economics are here to stay, that's good for customers.
       | 
       | Otherwise, as a vendor blog post with nothing but self-reported
       | performance, this is worthless.
       | 
       | Disclaimer: I work at Databricks but I admire Snowflake's product
       | for what it is - iron sharpens iron.
        
       | ghostridr wrote:
       | Hey 1990s, your TPC-DS results are in.
        
       | aptxkid wrote:
       | I genuinely think DeWitt clause is good for the users (bad for
       | researchers). Without it, especially in the context of cooperate
       | competitions, the company with the most marketing power will win.
       | Users can always compare different products _themselves_. I am
       | likely wrong but please help me understand.
        
       | aptxkid wrote:
       | Personally I think it's a great response and very well written. I
       | didn't jump on the congrats-Databricks wagon when the result
       | first came out because of the weird front page comparison against
       | snowflake. Both companies are doing great work. Focusing on
       | building a better product for your customer is much more
       | meaningful than making your competitor look bad.
        
         | tyingq wrote:
         | It is well written, but there's some sleight of hand here and
         | there too. Like using your lowest tier product to demonstrate
         | price/performance against a competitor's highest tier. The
         | Snowflake lowest tier doesn't have failover, for example...or
         | compliance features.
        
           | aptxkid wrote:
           | Exactly. That's why I think public benchmark war is just a
           | waste of time. There will ALWAYS be some subtle differences
           | between the two platforms that results will never be apple to
           | apple.
        
       | [deleted]
        
       | michaelhartm wrote:
       | * Databricks is unethical
       | 
       | * Nobody should benchmark anymore, just focus on customers
       | instead
       | 
       | * But hey, we just did some benchmarks and we look better than
       | what Databricks claims
       | 
       | * Btw, please sign up and do some benchmarks on Snowflake, we
       | actually ship TPC-DS dataset with Snowflake
       | 
       | * Btw, we agree with Databricks, let's remove the DeWitt clause,
       | vendors should be able to benchmark each other!
       | 
       | * Consistency is more important than anything else!!!
        
         | aptxkid wrote:
         | I don't think they are saying benchmark is not important but
         | rather public benchmark war being a distraction.
        
         | kingkongv2 wrote:
         | If people have never heard of Databricks, now is the time
         | because a 100 billion company just started a war against them.
         | Great marketing win Databricks.
        
           | glogla wrote:
           | Databricks is $28B valuation and 2800 employees, Snowflake is
           | $109 valuation and 2500 employees.
           | 
           | They are both billion dolar companies, we're hardly talking
           | David and Goliath here.
        
             | alexott wrote:
             | for DB that's old number - recent valuation is $38B
        
           | geoduck14 wrote:
           | To be fair, I've been equating Databricks for a month or so.
           | Databricks is coming after Snowflake. Snowflake doesn't care.
           | Snowflake has a pretty solid moat with:
           | 
           | EASY SQL, data sharing (they have a marketplace), simple
           | scaling
        
             | bpaneural wrote:
             | You'll need to revisit this again. In the last two years
             | Databricks has built a lead and a bigger moat. They're
             | essentially nice chaps with a huge community backing them.
             | And we all love their open source tools which essentially
             | powers not only their big data platforms, but everyone
             | else's too (AWS, GCP).
        
         | cloudbonsai wrote:
         | The interesting part is that Snowflake omits Databricks'
         | performance scores in their graphs. Here is how they compare on
         | TPC-DS benchmark, based on two companies' self-reports:
         | 
         | * Elapsed time: 3108s (Databricks) vs 3760s (Snowflake)
         | 
         | * Price/Peformance: $242 (Databricks) vs $267 (Snowflake)
         | 
         | Needless to say, these numbers seriously need a verification by
         | independent 3rd parties, but it seems that Databricks is still
         | 18% faster and 10% cheaper than Snowflake?
        
           | geoduck14 wrote:
           | The way I read this is: DataBricks benchmarked against us,
           | and they messed it up. Here is hou YOU should evaluate
           | Snowflake performance. And, by the way, it is pretty easy to
           | do it.
        
       | AtlasLion wrote:
       | The main question I have for DB is, how good is their query
       | optimiser/compiler? It's fun that you can run some predefined set
       | of queries fast. More important is, how good you can run queries
       | in the real world, with suboptimal data models, layers upon
       | layers of badly written views, CTEs, UDFs... That is what matters
       | in the end. Not some synthetic benchmark based on known queries
       | you can optimise specifically for.
        
         | maslam wrote:
         | @AtlasLion you are right real world performance matters. We
         | test extensively with actual workloads, and the speed up holds
         | there too. For example: lots of real world BI queries are
         | repeated over smallish data sets of 10 to 50 GB. We test that
         | size factor and pattern all the time.
        
       | geoduck14 wrote:
       | I've been a customer/user of Snowflake. They make it simple to
       | run SQL. There is a bunch of performance stuff that I don't need
       | to worry about.
       | 
       | I'm interested in using Databricks, but I haven't done it yet.
       | I've heard good things about their product.
        
       | [deleted]
        
       | maslam wrote:
       | Databricks broke the record by 2x) and is 10x more cost
       | effective, in an audited benchmark. Snowflake should participate
       | in the official, audited benchmark. Customers win when businesses
       | are open and transparent...
        
         | jiggawatts wrote:
         | Audited how? If you look at the Snowflake response the numbers
         | being posted by Databricks look outright faked or otherwise
         | false.
        
           | maslam wrote:
           | Hey jiggawatts - TPC is the official way to audit benchmarks
           | in the database industry. They've been around for a bit, but
           | let me know if you want more info, I'm happy to share more
           | about them.
        
             | redis_mlc wrote:
             | > TPC is the official way to audit benchmarks in the
             | database industry.
             | 
             | TPC is a benchmark suite for a certain problem class. It
             | says nothing about how the databases are configured or
             | managed.
             | 
             | In case you're thick, the above is a polite way of calling
             | you a liar.
             | 
             | This is why Oracle and other database vendors don't allow
             | publishing of benchmarks. It's to protect them from
             | incompetent or lazy authors primarily.
             | 
             | Source: DBA.
        
             | lmeyerov wrote:
             | It sounds fundamentally busted if a competitor can submit
             | benchmarks for someone else. TPC is great in general, but I
             | didn't realize it had such a gaping flaw.
             | 
             | TPC submissions take real time/$/energy/expertise, so I
             | don't know anyone who has ever done it casually. Ex: It was
             | a multi-company effort for the RAPIDS community to get
             | enough API coverage & edge case optimization for an end-to-
             | end GPU submission on the big data one (SQL, ...), and even
             | there the TPC folks made them resubmit if I remember right.
             | 
             | Also, note how the parent's response did not actually
             | answer 'audited how'. Pushing the work to the questioner is
             | on the shortlist of techniques studied by misinformation
             | researchers. I'm a fan of both companies, so disappointing
             | to see from a company rep.
        
               | rxin wrote:
               | Check my reply, Leo.
        
               | lmeyerov wrote:
               | The audit question is on Databricks marketing unaudited
               | Snowflake TPC numbers. I do think Snowflake is big enough
               | to run TPC, but how you guys choose to market is on you.
               | 
               | But: I think it's cool _both_ companies got it to
               | $200-300. Way better than years ago. Next stop: GPUs :)
        
               | rxin wrote:
               | Ah ok. Wasn't clear. I think some repro scripts will be
               | available soon.
        
               | [deleted]
        
           | Spivak wrote:
           | The results are so crazy different that either Snowflake or
           | Databricks are wrong or outright lying.
        
             | jiggawatts wrote:
             | This is my point also, and I'm being downvoted for it.
             | 
             | If two people are in disagreement about the same _facts_ ,
             | then one of them is either misinformed or lying. It's that
             | simple.
             | 
             | If the only recourse seems to be to sink to the level of
             | mud-slinging, with no clear ability to point to the audit
             | trail and say "this is where it all went wrong", then it
             | calls into question the value of that auditing process.
             | 
             | I'm personally unimpressed with the TPC process in general.
             | I remember one "benchmark" that showed the performance of a
             | 2RU server breaking some record, and it was a minor
             | footnote that it was using a disk array with 7,500 drives
             | in it -- _dedicated_ to that one server for the duration of
             | the test. That 's an absurd setup that will never exist at
             | any customer, ever.
             | 
             | I ran that same software myself on literally the exact same
             | server, and it couldn't even begin to approach the posted
             | TPC numbers on typical storage. It was at least two orders
             | of magnitude slower.
             | 
             | The rub was that its inefficient usage of storage _was the
             | main problem_ , and the vendor was pulling a smoke &
             | mirrors trick to hide this deficiency of their product. The
             | TPC numbers were an outright fraud in this case, at least
             | in my mind.
             | 
             | So to me, TPC looks like a staged show where the auditors
             | are more like the referees in a WWE wrestling competition.
        
               | ttmahdy wrote:
               | The TPC audit process tends to be thorough and strict.
               | 
               | Possibly you missed a configuration that was included in
               | the Full Disclosure Report or Supporting Files?
               | 
               | The Databricks official, audited benchmark was executed
               | against Databricks SQL which is a PaaS service that
               | doesn't allow special tuning btw.
        
               | jiggawatts wrote:
               | I didn't miss it. That doesn't make it any less
               | misleading.
        
               | AtlasLion wrote:
               | That doesn't allow end users any configuration, but this
               | doesn't apply to the company itself which can apply
               | settings from the background on behalf of end users.
        
           | rxin wrote:
           | There's an official TPC process to audit and review the
           | benchmark process. This debate can be easiest settled by
           | everybody participating in the official benchmark, like we
           | (Databricks) did.
           | 
           | The official review process is significantly more complicated
           | than just offering a static dataset that's been highly
           | optimized for answering the exact set of queries. It includes
           | data loading, data maintenance (insert and delete data),
           | sequential query test, and concurrent query test.
           | 
           | You can see the description of the official process in this
           | 141 page document:
           | http://tpc.org/tpc_documents_current_versions/pdf/tpc-
           | ds_v3....
           | 
           | Consider the following analogy: Professional athletes compete
           | in the Olympics, and there are official judges and a lot of
           | stringent rules and checks to ensure fairness. That's the
           | real arena. That's what we (Databricks) have done with the
           | official TPC-DS world record. For example, in data warehouse
           | systems, data loading, ordering and updates can affect
           | performance substantially, so it's most useful to compare
           | both systems on the official benchmark.
           | 
           | But what's really interesting to me is that even the
           | Snowflake self-reported numbers ($267) are still more
           | expensive than the Databricks' numbers ($143 on spot, and
           | $242 on demand). This is despite Databricks cost being
           | calculated on our enterprise tier, while Snowflake used their
           | cheapest tier without any enterprise features (e.g. disaster
           | recovery).
           | 
           | Edit: added link to audit process doc
        
             | aptxkid wrote:
             | Snowflake claims the snowflake result from Databricks was
             | not audited. It's not that Databricks numbers were
             | artificially good but rather Snowflake's number was
             | unreasonably bad.
        
             | jiggawatts wrote:
             | Please also refer to my comment below on the value of the
             | TPC audit process:
             | https://news.ycombinator.com/item?id=29208172
        
             | _dark_matter_ wrote:
             | Thanks for the additional context here. As someone who
             | works for a company that pays for both databricks and
             | snowflake, I will say that these results don't surprise me.
             | 
             | Spark has always been infinitely configurable, in my
             | experience. There are probably tens of thousands of
             | possible configurations; everything from Java heap size to
             | parquet block size.
             | 
             | Snowflake is the opposite: you can't even specify
             | partitions! There is only clustering.
             | 
             | For a business, running snowflake is easy because engineers
             | don't have to babysit it, and we like it because now we're
             | free to work on more interesting problems _. Everybody
             | wins.
             | 
             | _ Unless those problems are DB optimization. Then snowflake
             | can actually get in your way.
        
               | rxin wrote:
               | Totally. Simplicity is critical. That's why we built
               | Databricks SQL not based on Spark.
               | 
               | As a matter of fact, we took the extreme approach of not
               | allowing customers (or ourselves) to set any of the known
               | knobs. We want to force ourselves to build the best the
               | system to run well out of the box and yet still beats
               | data warehouses in price perf. The official result
               | involved no tuning. It was partitioned by date, loaded
               | data in, provisioned a Databricks SQL endpoint and that's
               | it. No additional knobs or settings. (As a matter of
               | fact, Snowflakes own sample TPC-DS dataset has more
               | tuning than the ones we did. They clustered by multiple
               | columns specifically to optimize for the exact set of
               | queries.)
        
               | geoduck14 wrote:
               | >That's why we built Databricks SQL not based on Spark.
               | 
               | Wait... really? The sales folks I've been talking to
               | didn't mention this. I assumed that when I ran SQL inside
               | my Python, it was decomposed into Spark SQL with weird
               | join problems (and other nuances I'm not fully familiar
               | with).
               | 
               | Not that THAT would have changed my mind. But it would
               | have changed the calculus of "who uses this tool at my
               | company" and "who do I get on board with this thing"
               | 
               | Edit: To add, I've been a customer of Snowflake for
               | years. I've been evaluating Databricks for 2 months, and
               | put the POC on hold.
        
               | alexott wrote:
               | it's different - rxin talks about this:
               | https://databricks.com/product/databricks-sql
               | 
               | when you run Python, it's on Spark, although you now can
               | use Photon engine that is used for DB SQL by default
        
         | mst wrote:
         | Databricks and snowflake should pay an independent third party
         | to re-run these. In-house benchmarks by either company don't
         | count with results this different.
        
           | cmhill wrote:
           | Databricks didn't run the Snowflake comparison in-house. From
           | their article it says: "These results were corroborated by
           | research from Barcelona Supercomputing Center, which
           | frequently runs TPC-DS on popular data warehouses. Their
           | latest research benchmarked Databricks and Snowflake, and
           | found that Databricks was 2.7x faster and 12x better in terms
           | of price performance."
        
             | dekhn wrote:
             | I don't trust a supercomputer center to do a good job
             | running a TPC benchmark (I do trust them to run LINPACK
             | benchmarks).
        
       ___________________________________________________________________
       (page generated 2021-11-13 23:02 UTC)