https://blog.devgenius.io/why-is-snowflake-so-expensive-92b67203945 Get unlimited access Open in app Home Notifications Lists Stories --------------------------------------------------------------------- Write Dev Genius Published in Dev Genius Stas Sajin Stas Sajin Follow Aug 16 * 9 min read Why is Snowflake so expensive? I was recently reviewing Snowflake's quarterly financial statements and I came to the realization that their product is so well loved that they could stop adding new customers and still get double-digit revenue growth. This chart from Clouded Judgement really highlights how sticky the product is: a customer that joins a year ago and spends $1 is paying out well over $1.7 a year later. Snowflake doesn't have yet an expansive catalog of products that they can upsell or cross-sell, so the majority of this new revenue is driven because of increased usage, low churn, and because ... Snowflake's pricing model doesn't scale. [0] Clouded Judgement; This figure shows that Snowflake has a very high net new revenue indicating that once you start using the platform, you start expanding your usage. Imagine the outcome of this conversation between a performance engineer working at Snowflake and RevOps management. * Engineer: "I figured a way to improve query performance by 20%." * RevOps: "Great, we're not going to use that..." Snowflake has no incentive to push a code change that makes things 20% faster because that can correspond to 10-20% drop in short-term revenue. In a typical Innovator's Dilemma, Snowflake prioritizes other things that generate an ever larger menu of compute options, like Snowpark and data apps built on Streamlit, that will bleed your organization dry. This is a natural course through which a SaaS model functions and many other companies have their own non-scalable pricing structures. For example, what person at Google thought it was a good idea to let BigQuery perform a full table scan when running this query? // most databases scan 10 rows of data when running this query while Biquery will scan the full tableselect * from table limit 10 The choice to scan all the data might seem dumb, but it's not when you realize that they charge per Gb of data scanned, and it's in their interest to leave optimization gremlins in. What Snowflake could do better? As an investor, I expect Snowflake to show amazing profitability and record-breaking revenue numbers. As an Engineer, if Snowflake continues on the current path of ignoring performance, I expect them to lose share to the open-source community or some other competitor, eventually walking down the path of Oracle and Teradata. Here are a few things I think they can do to stay relevant in five years. Disclose Hardware Specs Snowflake charges you based on your consumption. You're not buying any specific hardware and instead pay as you go using virtual warehouse credits that have no hardware definitions. For folks in technical or engineering backgrounds, this is a red flag. Whether your query runs on a machine with SSD or hard drive, low or high RAM, slow or fast CPU, high or low network bandwidth makes a measurable impact on performance. Snowflake is very secretive about their hardware and when I interacted with a Sales team during migration from Redshift, I could never get any SLAs on their query performance nor information on the hardware specs. This is distinctly different from Redshift, Firebolt, and Databricks which are very transparent and provide more flexibility in customizing your performance through hardware. Lack of transparency can also lead to bad incentives where Snowflake could revert to less optimal machinery due to internal pressures to improve its margins. Not adopting benchmarks Several months ago, DataBricks published a study highlighting that they outperform Snowflake on the TPC-DS benchmark, to which Snowflake posted a rebuttal. Snowflake's statement on benchmarks was very clear: In the same way that we had clarity about many things we wanted to do, we also had conviction about what we didn't want to do. One such thing was engaging in benchmarking wars and making competitive performance claims divorced from real-world experiences. This practice is simply inconsistent with our core value of putting customers first. This is a very shortsighted statement and contradicts why data warehouses became so popular and what customers care about. Data warehouses became popular precisely because they delivered on performance promises, where you don't have to wait for days or weeks for reporting to be done. Benchmarks, although not ideal and can be gamed, although can be divorced from reality, although can suffer from reproducibility and generalizability, are starting points for driving discussions around performance. If I pick a technology that will cost my employer north of $1M/year, I would always insist on seeing some numbers. I hope Snowflake reverses its stance and can start implementing benchmarks as part of its release notes. An industry that collectively will be worth north of $1T+ desperately needs closure on benchmarking. Optimizer gremlins If you work long enough with Snowflake you'll find several gremlins that lead to vast performance degradation. I'll give you an example that I experienced. Micro-partitioning pruning is disabled based on predicates in a sub-query, even if the sub-query returns a constant. A query like in the example below will not benefit from fast pruning, even if you partition on created_houryet queries like this are all over the place if you build incremental models or if you use import CTEs. // this query will perform a full table scan even if you partition // on created_hour select count(*) from table1 where created_hour > (select max(hour) from table2) If you use DBT, you should really be mindful of this issue because this query pattern is very common and represents north of 50% of query costs. I almost never see these types of gremlins in open source databases (if they exist, they get fixed), yet they are ever present in enterprise data warehouses. This only makes me wonder whether there is a performance optimization team that has a voice within Snowflake. Improve the workload manager to increase throughput The query workload manager in Snowflake is inefficient. Let's imagine you have one warehouse and two users. One user submits a compute-intensive query while another user submits a small query that scans very little data. The workload management software in Snowflake seems to allocate all its resources based on FIFO (first in first out) pattern. I really hope this is not the case, but after staring dozens of times at query queues I think it is primarily FIFO. If the large query runs first, you leave the second user hanging in a queue, massively decreasing the throughput. A better option is to pause the big query and relocate the resources to the smaller query or split the resource consumption so that 80% of capacity goes to the big query and 20% to the small query. Snowflake recommends two other solutions instead: * Scale out or up your warehouse, so the second user does not suffer from huge queue delays. This forces more spending. * Or use the newly announced query accelerator that allows you to scale on a per query basis. This again forces more spending if you don't have good data governance in place. From my days of administering Redshift, I can definitely say that Redshift's workload manager was a lot smarter in preventing massively inefficient queries/users from bringing your system down and forcing you to scale out. In an era where ML is so advanced and we have so much metadata on queries and tables, it's surprising to me that Snowflake has not built something that smartly manages query workloads. If I can predict that a query will take a while to run based on the number of columns required, joins present, and the size of tables involved, why can't a workload manager do that with ML? I imagine this idea floated around internally within Snowflake, but because it is not a revenue driver, it was not on anyone's roadmap. Not providing observability to monitor and reduce costs Snowflake provides very limited tooling to narrow down on your biggest cost drivers and perform per role/user level query timeouts or budget constraints. For example, I would like to do the following: * Have a list of users/systems that are responsible for most of my costs. Am I dealing with users that have little knowledge of SQL, perform cross joins, and leave their queries to timeout? Do I have ETL jobs that folks forgot existed, yet they capture a large share of my costs? This level of cost attribution is not possible to do today unless you start performing query tagging and build tools on top of Snowflake. You only know the total cost of a warehouse, which requires you to split workloads into multiple warehouses to get a more granular attribution. * Have the ability to specify per user/role timeouts or constraints. Today this is still limited on a per-warehouse basis, limiting your ability to control costs, since it forces a scale out/up pattern. * Have tight integration with pagerduty/slack/email alerting users that they are being over-budget or allowing users to understand that their query patterns are not sustainable. I think Snowflake will likely add better observability or it will be created through custom tooling, nonetheless today those features don't exist and many companies find that Snowflake bills are unexpectedly large. What companies that use Snowflake could do better? Snowflake, like any SaaS or IaaS, is expensive often because it's a tool that gets misused. Although I think Snowflake has a share of the blame for performance scalability, some blame exists with the consumer. Not negotiating We have 5-6 very good open-source data warehouse alternatives. We have Redshift, DataBricks, Firebolt, BigQuery, and likely a few other enterprise offerings, yet it is surprising how little training most companies have in negotiating and re-negotiating vendor contracts or in pushing for heavily discounted pricing. You should not pay the same rate as your usage increases. Lack of education On the topic of misuse, if you have the right monitoring in place on your infrastructure, you'll generally find an interesting pattern where 5% of your internal users drive 95% of your costs. Let me give you some examples: * Almost every person I know learned SQL on the job and made their fair share of mistakes, including writing queries with cross-joins, not filtering on the right predicates, selecting all columns, or pulling massive amounts of data into R and Python because they prefer those languages. Only through a trial-by-fire process do users learn how to work more efficiently. Nonetheless, today I would probably pay $25 and ask users to take the online SQL classes from DataCamp for a few days and just learn the craft efficiently and more scalably. * When my previous company used Looker, I noticed that some folks would just keep adding dimensions and measures on the order of 100+ without reflecting on the number of joins that happened in the background. Then they would run a query that would obviously reach our internal time-outs. This led us to change policies for how you use Looker where creators of dashboards had to belong to user groups that knew SQL. The BI tool is not a replacement for the lack of education unless you're a passive consumer of dashboards. Generally speaking, finding that 5% of users and simply letting them know that what they are doing is not sustainable can make a meaningful impact on costs. For example, you could set up a slack bot that periodically alerts users that they consume a lot of resources or simply provides teams with slack reports on cost attribution. Or if you have a large organization, you could dedicate a team of 2-3 people to optimize the overall infrastructure costs using the following Cloud Finops framework. Should you still use Snowflake? If you're a small company with a budget of less than $500k/year for warehouse costs, I think you don't have a choice but to use an enterprise tool. The number of features a tool like Snowflake offers, the usability, and the opportunity cost of not using data to drive better product development push you into a corner where you have to go with a buy decision. If your yearly costs are above $500k mark, it's useful to consider the benefits of an off-ramp. I expect the ecosystem to change over the next five years, where both big and small companies will go with open-source. We have DucksDB, Clickhouse, Pinot, Trino, Dremio, Druid, Iceberg, Doris, Starrocks, Delta Lake and likely a few other open-source alternatives that are used in large and small companies. The challenge is that some of these tools have not yet reached a point of high popularity or ease of developer experience. As things consolidate and optionality gets narrowed down, as more data engineers become familiar with a core set of tools, it will be a lot cheaper to go with open-source. In the meantime, for Snowflake to be relevant, they need to ensure that its performance engineering team is completely independent of teams that are responsible for revenue drivers, so it can truly reinvent itself. -- -- 8 More from Dev Genius Coding, Tutorials, News, UX, UI and much more related to development Read more from Dev Genius Recommended from Medium Anuradha Mudalige Anuradha Mudalige Guide to version controlling your Android project (Shared/Distinct codebases) [1] David Anaya David Anaya in Koa Health Unit testing in Flutter: bloc pattern [1] Peter Shi Peter Shi Can cloud IT engineers be expected to remember to switch off resources when they're done? [1] Adam Bogdal Adam Bogdal in Blog | Mirumee Deployment of SPA/PWA Applications in the Real World [1] Pirate X Pirate | PXP Pirate X Pirate | PXP Whitelist Registration for Pre-sale: Genesis Ship: Flying Dutchman [1] Daniel Anggrianto Daniel Anggrianto Running Automation Tests on AWS Device Farm using Appium and TestNG [1] _Aid(); _Aid(); Here's How You Sort That Heroku App With A Go-Daddy Domain [1] Vinayak Pandey Vinayak Pandey in AVM Consulting Blog Grant Root/Non-Root Access To Users For Session Manger [1] AboutHelpTermsPrivacy --------------------------------------------------------------------- Get the Medium app A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store Get started [ ] Stas Sajin Stas Sajin 169 Followers Data Scientist Follow More from Medium Kevin Hu, PhD Kevin Hu, PhD in Towards Data Science Concepts and practices to ensure data quality Climbers on top of mountain. Saikat Dutta Saikat Dutta in CodeX 7 Best Practices for Data Ingestion [1] Patrik Braborec Patrik Braborec in GoodData Developers How To Build a Modern Data Pipeline [1] Amit Singh Rathore Amit Singh Rathore in Geek Culture Data Observability & discovery platform-- OpenMetadata [1] Help Status Writers Blog Careers Privacy Terms About Knowable