[HN Gopher] Uses and abuses of cloud data warehouses
___________________________________________________________________
Uses and abuses of cloud data warehouses
Author : Malp
Score : 126 points
Date : 2023-08-16 13:10 UTC (9 hours ago)
(HTM) web link (materialize.com)
(TXT) w3m dump (materialize.com)
| spullara wrote:
| These reasons are why Snowflake is building hybrid tables (under
| the Unistore umbrella). Those tables keep recent data in an
| operational store and historical data in their typical data
| warehouse storage systems. Best of both worlds. Still in private
| preview but definitely the answer to how you build applications
| that need both without using multiple databases and syncing.
|
| https://www.snowflake.com/guides/htap-hybrid-transactional-a...
| datavirtue wrote:
| Conveniently leave out the issue of cost. Snowflake is piling
| on features that encourage more compute. Customers abuse the
| system and they (Snowflake) respond by helping cement them into
| continuing the abuse (spending more) by developing features to
| make bad habits and horrible engineering decisions look like
| something they should be doing. Typical.
| disgruntledphd2 wrote:
| Snowflake are the Oracle of the cloud.
| ed_elliott_asc wrote:
| Oh come on snowflake isn't cheap but there are none of the
| license auditing nonsense.
|
| (Also aren't oracle the oracle of the cloud?)
| tcoff91 wrote:
| Oracle is so far beyond anyone other major player in
| crookedness that it's not even funny.
|
| Oh, you happened to run your oracle database in a VM and
| got audited? They'll try to shake you down to pay for
| however many oracle licenses for every other box you are
| running hypervisors on, because they claim that you could
| have transferred the database to any of those other
| boxes. So if you have a datacenter with 1000 boxes
| running VMWare, and you ran Oracle on one of them, they
| try to shake you down for paying for not buying 1000
| licenses. Then they say but if you just buy a bunch of
| cloud credits, we can make your 1000x license violation
| go away.
| mrbungie wrote:
| I remeber one time I was working as a Data & Analytics Lead
| (almost a Chief Data Officer but without the title) in a company
| were I don't work anymore and I was "challenged" by our parent
| company CDO about our data tech stack and operations. Just for
| context, my team at the time was me working as the lead and main
| Data Engineer plus 3 Data Analysts that I was coaching/teaching
| to convert into DEngs/DScientists.
|
| At the time we were mostly a batch data shop, based on Apache
| Airflow + K8S + BigQuery + GCS in Google Cloud Platform, with
| BigQuery + GCS as the central datalake techs for analytics and
| processing. We still had RT capabilities due to having also some
| Flink processes running in the K8S cluster, and also having time-
| critical (time, not latency) processes running in microbatches of
| minutes for NRT. It was pretty cheap and sufficiently reliable,
| with both Airflow and Flink having self-healing capabilities at
| least at the node/process level (and even cluster/region level
| should we need it and be willing to increase the costs), while
| also allowing for some changes down the road like moving out of
| BQ if the costs scaled up too much.
|
| What they wanted us to implement what according to them was the
| industry "best practices" circa 2021: a Kafka-based Datalake
| (KSQL and co.), at least other 4 engines (Trino, Pinot, Postgres
| and Flink) and an external object storage with most of the stuff
| running inside Docker containers orchestrated by Ansible in N
| compute instances manually controlled from a bastion instance.
| For some reason, they insisted on having a real time datalake
| based on Kafka. It was an insane mix of cargo cult, FOMO, high
| operational complexity and low reliability in one package.
|
| I resisted the idea until the last second I was in that place. I
| reunited with some of my team members for drinks months later
| after my departure and they told me the new CDO was already
| convinced that said "RT-based" datalake was the way to go
| forward. I still shudder every time I remember the architectural
| diagram and I hope they didn't finally follow that terrible
| advice.
|
| tl;dr: I will never understand the cargo cult around real time
| data and analytics but it is a thing that appeals to both
| decision makers and "data workers". Most businesses and
| operations (especially those whose main focus is not IT by
| itself) won't act or decide in hours, but rather in days. Build
| around your main use case and then make exceptions, not the other
| way around.
| chuckhend wrote:
| I agree that is a great approach - build around the main use
| cases and then make exceptions. I think a lot of companies have
| legitimate use cases for real-time analytics (outside of their
| internal decision making), but as you mention, preemptively
| optimize for the aspiration and leads them towards unnecessary
| tool and tech sprawl. For example, a marketplace application
| that shows you the quantity of an item currently available --
| you as a consumer use that information to make a decision in
| seconds, so its a great use-case. Internally, the org probably
| uses that data for weekly or quarterly forecasting. I've seen
| use cases like that lead to the "let's make everything real-
| time", but not every use case benefits the same from real-time.
| kitanata wrote:
| [flagged]
| mritchie712 wrote:
| I caught myself wondering how Google, Microsoft and Amazon let
| Snowflake win. You can argue they haven't won, but lets assume
| they have. Two things:
|
| 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T.
| Owning Snowflake would be a drop in the bucket for any of them
| (let alone if they were splitting the revenue).
|
| 2. Snowflake runs on AWS, GCP or Azure (customers choice), so a
| good chunk of their revenue goes back to these services.
|
| Looking at these two points as the CEO of GOOGL, MSFT, or AMZN,
| I'd shrug away Snowflake "beating us". It's crazy that you can
| build a $50B company that your largest competitors barely care
| about.
| riku_iki wrote:
| > 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over
| $1T. Owning Snowflake would be a drop in the bucket for any of
| them (let alone if they were splitting the revenue).
|
| FAANG can't utilize their market cap to buy SNOW, they would
| need to pay cash, and 50B is very large amount for any of these
| companies (its about annual Google net income).
|
| Also, snow stock is very inflated now, it is heavily income
| negative, and revenue not that high, stock price is very high
| on growth expectations.
| mritchie712 wrote:
| My point is that they wouldn't want to buy it (or have
| focused much on building a competitive product) if it's only
| worth $50B.
| hnthrowaway0328 wrote:
| I agree. The cloud providers are basically the guys who sell
| shovels in gold rush. Snowflake still needs to build on top the
| clouds so MAG never lose. I heard that SNOW is offering its own
| cloud services but I could be wrong -- and even if I'm correct
| they have a super long way to catch up.
| pm90 wrote:
| What I heard is that AWS got there first with Redshift but then
| didn't really invest as much as was required by users so
| Snowflake found an opening and pounced on it.
|
| BigQuery in GCP is a pretty great alternative and I know that
| GCP invests/promotes it heavily, but they were slightly late to
| the market.
| datadrivenangel wrote:
| BigQuery is pretty great. The serverless by default setup
| works very well for most BI use cases. There are some weird
| issues when you're a heavy user and start hitting the
| normally hidden quotas.
| dsaavy wrote:
| There are some ways around the heavy user issues that
| aren't ideal but will work for BI-oriented heavy users.
| hodgesrm wrote:
| This article uses an either or definition that leaves out a big
| set of use cases that combine operational _and_ analytic usage:
|
| > First, a working definition. An operational tool facilitates
| the day-to-day operation of your business. Think of it in
| contrast to analytical tools that facilitate historical analysis
| of your business to inform longer term resource allocation or
| strategy.
|
| Security event and incident management (SEIM) is a typical
| example. You want fast notification on events _combined with_ the
| ability to sift through history extremely quickly to assesss
| problems. This is precisely the niche occupied by real-time
| analytic databases like ClickHouse, Druid, and Pinot.
| dontupvoteme wrote:
| The random bolding of words reeks of adtech.
|
| is the usage of such an old html tag itself now a trigger to send
| something to /dev/null?
| atwong wrote:
| There are other databases today that do real time analytics
| (ClickHouse, Apache Druid, StarRocks along with Apache Pinot).
| I'd look at the ClickHouse Benchmark to see who are the
| competitors in that space and their relative performance.
| slotrans wrote:
| Yeah ClickHouse is definitely the way to go here. Its ability
| to serve queries with low latency and high concurrency is in an
| entirely different league from Snowflake, Redshift, BigQuery,
| etc.
| biggestdummy wrote:
| StarRocks handles latency and concurrency as well as
| Clickhouse but also does joins. Less denormalization, and you
| can use the same platform for traditional BI/ad-hoc queries.
| riku_iki wrote:
| Clickhouse also does joins.
|
| Somehow StarRocks dudes appear in every relevant post with
| this false claim.
| biggestdummy wrote:
| There's a difference between "supports the syntax for
| joins" and "does joins efficiently enough that they are
| useful."
|
| My experience with Clickhouse is that its joins are not
| performant enough to be useful. So the best practice in
| most cases is to denormalize. I should have been more
| specific in my earlier comment.
| riku_iki wrote:
| ack that anonymous user in internet said he couldn't make
| CLickhouse joins perform well in his case which he didn't
| describe
| albert_e wrote:
| Arent a lot of businesses being sold on "real time analytics"
| these days?
|
| That mixes the uses cases of analytics and operations because
| everyone is led to believe that things that happened in last 10
| minutes must go through the analytics lens and yield actionable
| insights in real time so their operational systems can
| react/adapt instantly.
|
| Most business processes probably don't need anywhere near such
| real time analytics capability but it is very easy to think (or
| be convinced that) we do. Especially if I am a owner of a given
| business process (with an IT budget) why wouldn't I want the
| ability to understand trends in real-time and react to it if not
| get ahead of them and predict/be prepared. Anything less than
| that is seen as being shamefully behind on the tech curve.
|
| In this context-- the section in article where it says present
| data is of virtually zero importance to analytics is no longer
| true. We need a real solution even if we apply those (presumably
| complex and costly) solutions to only the most deserving use
| cases (and not abuse them).
|
| What is the current thinking in this space? I am sure there are
| technical solutions here but what is the framework to evaluate
| which use case actually deserves pursuing such a setup.
|
| Curious to hear.
| riordan wrote:
| > In this context-- the section in article where it says
| present data is of virtually zero importance to analytics is no
| longer true. We need a real solution even if we apply those
| (presumably complex and costly) solutions to only the most
| deserving use cases (and not abuse them).
|
| Totally agreed, though where real-time data is being put
| through an analytics lens is where CDW's start to creak and get
| costly. In my experience, these real-time uses shift the burden
| from being about human-decision-makers to automated decision-
| making and it becomes more a part of the product. And that's
| cool, but it gets costly, fast.
|
| It also makes perfect sense to fake-it-til-you-make-it for
| real-time use cases on an existing Cloud Data Warehouse/dbt
| style _modern data stack_ if your data team's already using it
| for the rest of their data platform; after all they already
| know it and it's allowed that team to scale.
|
| But a huge part of the challenge is that once you've made it,
| the alternative for a data-intensive use case is a bespoke
| microservice or a streaming pipeline, often in a language or on
| a platform that's foreign to the existing data team who's built
| the thing. If most of your code is dbt sql and airflow jobs,
| working with Kafka and streaming spark is pretty foreign (not
| to mention entirely outside of the observability infrastructure
| your team already has in place). Now we've got rewrites across
| languages/platforms, and leave teams with the cognitive
| overhead of multiple architectures & toolchains (and split
| focus). The alternative would be having a separate team to hand
| off real-time systems to and only that's if the company can
| afford to have that many engineers. Might as well just allocate
| that spend to your cloud budget and let the existing data team
| run up a crazy bill on Snowflake or BigQuery as long as it's
| less than the cost of a new engineering team.
|
| ------
|
| There's something incredible about the ruthless efficiency of
| sql data platforms that allows data teams to scale the number
| of components/engineer. Once you have a Modern-Data-Stack
| system in place, the marginal cost of new pipelines or
| transformations is negligible (and they build atop one
| another). That platform-enabled compounding effect doesn't
| really occur with data-intensive microservices/streaming
| pipelines and means only the biggest business-critical
| applications (or skunk works shadow projects) will get the
| data-intensive applications[1] treatment, and business
| stakeholders will be hesitant to greenlight it.
|
| I think Materialize is trying to build that Modern-Data-Stack
| type platform for real-time use cases: one that doesn't come
| with the cognitive cost of a completely separate architecture
| or the divide of completely separate teams and tools. If I
| already had a go-to system in place for streaming data that
| could be prototyped with the data warehouse, then shifted over
| to a streaming platform, the same teams could manage it and
| we'd actually get that cumulative compounding effect. Not to
| mention it becomes a lot easier to then justify using a real-
| time application the next time.
|
| [1]: https://martin.kleppmann.com/2014/10/16/real-time-data-
| produ...
| slotrans wrote:
| Just gonna keep linking this til the heat death of the
| universe: https://mcfunley.com/whom-the-gods-would-destroy-
| they-first-...
|
| Real-time analytics are _worse than useless_. At best they are
| a distracting resource sink, at worst they directly harm the
| quality of decision-making.
| jandrewrogers wrote:
| The term "real-time" is much abused in marketing copy. It is
| often treated like a technical metric but it is actually a
| business metric: am I making operational decisions with the
| most recent data available? For many businesses, "most recent
| data available" can be several days old and little operational
| efficiency would be gained by investing in reducing that
| latency.
|
| For some businesses, "real-time" can be properly defined as
| "within the last week". While there are many businesses where
| reducing that operational tempo to seconds would have an
| impact, it is by no means universal.
| Eumenes wrote:
| From my experience (mostly startups), real time analytics is
| generally overkill, esp. from a BI perspective. Unless your
| business is very focused on real time data and transactional
| processing, you can generally get away with ETL/batch jobs.
| Show executives, product, and downstream teams some metrics
| that update a few times per day saves a ton of money over
| things like Snowflake/Databricks/Redshift stuff. While cloud
| services can be pricey, tools like dbt are really useful and
| can be administered by savvy business people or analyst types.
| Those candidates are way easier to hire compared to data
| engineers, sql experts, etc.
| lmkg wrote:
| I work as a web analyst (think Google Analytics).
|
| One time I ran an A/B test on the color of a button. After the
| conclusion of the test, with a clear winner in hand, it took
| _eleven months_ for all involved stakeholders to approve the
| change. The website in question got a few thousand visits a
| month and was not critical to any form of business.
|
| This organization does not benefit from real-time analytics.
|
| Now that's an extreme outlier, but my experience is that _most_
| organizations are in that position. The feedback loop from
| collecting data to making a decision is long, and real-time
| analytics shortens a part that 's already not the bottleneck.
| The _technical_ part of real-time analytics provides no value
| unless the org also has the _operational_ capacity to use that
| data quickly.
|
| I have seen this! I have, for example, seen a news site that
| looked at web analytics data from the morning and was able to
| publish new opinion pieces that afternoon if something was
| trending. They had a _dedicated process_ built around that data
| pipeline. Critically, they had a _specific_ idea of _what they
| could do_ with that data when the received it.
|
| So if you want a framework, I would start from a single, simple
| question: What can you actually _do_ with real-time data? Name
| one (1) action your organization could take based on that data.
|
| I think it's also useful to separate _what data_ benefits from
| realtime and _which users_ can make use of it. Even if you have
| real-time data, some consumers don 't benefit from immediacy.
| coredog64 wrote:
| Generally speaking "What questions do you hope to answer with
| this data?" is a good filter for all kinds of operational
| data.
| iamacyborg wrote:
| Hate to say it but if your site was only getting a few
| thousand visitors a month your test was likely vastly
| underpowered and therefore irrelevant anyway
| mrbungie wrote:
| Power is not just about sample size, but also
| (expected/previously informed by some other evidence)
| effect size. You can't make that conclusion without that.
| iamacyborg wrote:
| For sure, but you'd need one hell of a good cta to be
| getting a sufficient effect size to warrant small
| samples.
| ozim wrote:
| For me it mostly is that business people don't understand OLAP
| vs OLTP and that if they add 5 items to database and they are
| visible in the system their "dashboard" will not update
| instantly but only after when data pipes run.
|
| Which is hard to explain because if it is not instant
| everywhere they think it is a bug and system is crappy. Later
| on they will use dashboard view once a week or once a month so
| 5 items update is not relevant at all.
| atwebb wrote:
| Real-time generally means near-real-time and even then I liken
| it to availability.
|
| If asked people would say "I need to always be up" until they
| see the costs associated with it, then being out for a few
| hours a year tends to be ok.
| datadrivenangel wrote:
| This is a great way looking at it. The cost starts going up
| rapidly from daily and approaches infinity as you get to
| ultra-low latency realtime analytics.
|
| There is a minimum cost though (systems, engineers, etc), so
| for medium data there's often very little marginal cost up
| until you start getting to hourly refreshes. This is not true
| for larger datasets though.
| higeorge13 wrote:
| I work in a real time subscription analytics company
| (chartmogul.com). We fetch, normalize and aggregate various
| billing systems data and eventually visualize them into graphs
| and tables.
|
| I had this discussion with key people and i would say it
| depends on multiple factors. Small companies really like and
| require real-time analytics: they want to see how a couple
| invoices translate into updated saas metrics or why they didn't
| get a slack/email notification as soon asit happened. Larger
| ones will check their data less frequently per day or week, but
| again it depends on the people and their role. Most of them are
| happy with getting their data once per day into their mailboxes
| or warehouses.
|
| But we try to make everyone happy so we aim for real time
| analytics.
| mrbungie wrote:
| I think GP's point is that is not about the perceived value
| of real time data/analytics, but rather, its actual value.
| Decision makers may ask for RT or NRT, but most of the time
| won't make a decision or action in a timeframe that actually
| justifies RT/NRT data/analytics.
|
| For most operations RT/NRT data stuff normally is about
| novelty/vanity rather than a real existing business need.
| andrenotgiant wrote:
| The article is separating "operational" and "analytical"
| use-cases.
|
| IIUC analytical = "what question are you trying to answer"
| and in analytics, RT/NRT is absolutely novelty/vanity.
| Operational = "what action are you trying to take" and it
| makes sense to want to have up-to-date data when, for
| example, running ML models, triggering notifications,
| etc...
| mrbungie wrote:
| Yeah, totally. I should've specified "analytical
| operations", as in, updating dashboards and other non-
| time-critical data processing that eventually feed into
| decision making. That's were devs or decision makers
| asking for RT/NRT makes no sense.
| debarshri wrote:
| 15 years ago when I joined workforce business intelligence was
| all the rage. Data world was pretty much straight forward. You
| had transactional data in OLTP databases which would be shipped
| to Operational data stores, then rolled into the data warehouse.
| Datawarehouses were actual specialised hardware appliances
| (netezza et al) reporting tools were robust too.
|
| Everytime I moved from one org to another, these concepts of data
| warehouse somehow got muddled.
| andrenotgiant wrote:
| It seems like Snowflake is going all-in on building features and
| doing marketing that encourage their customers to build
| applications, serving operational workloads, etc... on them.
| Things like in-product analytics, usage-based billing,
| personalization, etc...
|
| Anyone here taking them up on it? I'm genuinely curious how it's
| going.
| weego wrote:
| After a series of calls, examples and explanations with them we
| never managed to get close to a reasonable projection of what
| our monthly costs would be like on Snowflake. I understand why
| companies in this field use abstract notions of 'processing'
| /'compute' units but it's a no go finance wise.
|
| Without some close to real world projections we don't have time
| to consider implementation to find out for ourselves.
| benjaminwootton wrote:
| Snowflake is one of the easier tools to measure because it's
| a simple function of region, instance size, uptime. If you
| can simulate some real loads and understand the usage then
| you do have a shot at forecasting.
|
| Of course the number is going to be high, but you have to
| remember it rolls up compute and requires less manpower. This
| is also a win for finance if they are comfortable with usage
| based billing.
| code_biologist wrote:
| Who's finance team likes usage based billing? It makes
| sense for elastic use cases and is definitely "fair", but
| there are a lot of issues: Forecasting is hard. "dev team
| had an oops" situations.
|
| I had frog getting boiled situation at one job that was
| exactly the process described in the posted article: usage
| of the cloud data warehouse grew as people trusted the
| infrastructure and used more fresh data for more and more
| use cases. They were all good, sane use cases. I repeatedly
| under-forecast our cost growth until we made large changes
| and it really frustrated the finance people, rightly so.
| GeneralAntilles wrote:
| Yeah, they're providing a path-of-least-resistance for getting
| stuff done in your existing data environment.
|
| A common challenge in a lot of organizations is IT as a
| roadblock to deployment of internal tools coming from data
| teams. Snowflake is answering this with Streamlit. You get an
| easy platform for data people to use and deploy on and it can
| all be done within the business firewall under data governance
| within Snowflake.
| politelemon wrote:
| I've noticed that too. I think the marketing is definitely
| working, I'm seeing a few organisations starting to shift more
| and more workloads onto them, and some are also publishing
| datasets on their marketplace.
|
| One of their most interesting offerings coming up is Snowpark
| which lets you run a Python function as a UDF, within
| Snowflake. This way you don't have to transfer data around
| everywhere, just run it as part of your normal SQL statements.
| It's also possible to pickle a function and send it over... so
| conceivably one could train a data science model and run that
| as part of a SQL statement. This could get very interesting.
| jamesblonde wrote:
| In theory, fine. Then you look at the walled garden that is
| Snowpark - only "approved" python libraries are allowed
| there. It will be a very constrictive set of models you can
| train, and very constrictive feature engineering in Python.
| And, wait, aren't Python UDFs super-slow (GIL) - what about
| Pandas UDFs (wait that's PySpark.....)
| Pils wrote:
| Having worked with a team using Snowpark, there are a
| couple things that bother me about it as a platform. For
| example, it only supported Python 3.8 until 3.9/10 recently
| entered preview mode. It feels a bit like a rushed project
| designed to compete with Databricks/Spark at the bullet
| point level, but not quite at the same quality level.
|
| But that's fine! It has only existed for around a year in
| public preview, and appears to be improving quickly. My
| issue was with how aggressively Snowflake sales tried to
| push it as a production-ready ML platform. Whenever I asked
| questions about version control/CI, model versioning/ops,
| package managers, etc. the sales engineers and data
| scientists consistently oversold the product.
| disgruntledphd2 wrote:
| Yeah it's definitely not ready for modelling. It's pretty
| rocking for ETL though, and much easier to test and
| abstract than regular SQL. Granted it's a PySpark clone
| but our data is already in Snowflake.
| noazdad wrote:
| Disclaimer: Snowflake employee here. You can add any Python
| library you want - as long as its dependencies are also
| 100% Python. Takes about a minute: pip install the package,
| zip it up, upload it to an internal Snowflake stage, then
| reference it in the IMPORTS=() directive in your Python. I
| did this with pydicom just the other day - worked a treat.
| So yes, not the depth and breadth of the entire Python
| ecosystem, but 1500+ native packages/versions on the
| Anaconda repo, plus this technique? Hardly a "walled
| garden".
| jamesblonde wrote:
| Good luck with trying to install any non-trivial python
| library this way. And with AI moving so fast, do you
| think people will accept that they can't use the
| libraries they need, because you haven't approved them
| yet?!?
| lokar wrote:
| Also containers:
|
| https://www.snowflake.com/blog/snowpark-container-
| services-d...
| atwebb wrote:
| > run a Python function as a UDF
|
| Is that a differentiator? I'm unfamiliar with Snowpark's
| actual implementation but know SQL Server introduced Python/R
| in engine in 2016? something like that.
| ramraj07 wrote:
| Snowflake is capturing a large market share in analytics
| industries thanks to its "just works" feature. I'm a massive
| fan.
|
| But in the end, snowflake stores the data in S3 as partitions.
| If you want to update a single value you have to replace the
| entire s3 partition. Similarly you need to read a reasonable
| amount of s3 data to retrieve even a single record. Thus you're
| never going to get responses shorter than half a second (at
| best). As long as you don't try and game around that limitation
| it works great.
|
| Materialize up here also follows the same model in the end
| FWIW.
| munchor wrote:
| Disclaimer: I work at SingleStoreDB.
|
| Building a database that can handle both analytics and
| operations is what we've been working on for the past 10+
| years. Our customers use us to build applications with a strong
| analytical component to them (all of the use cases you
| mentioned and many more).
|
| How's it going? It's going really well! And we're working on
| some really cool things that will expand our offering from
| being a pure data storage solution to much more of a
| platform[1].
|
| If you want to learn more about our architecture, we published
| this paper at SIGMOD in late 2022 about it[2].
|
| [1]: https://davidgomes.com/databases-cant-be-just-databases-
| anym...
|
| [2]: https://dl.acm.org/doi/pdf/10.1145/3514221.3526055
| datadrivenangel wrote:
| I assume they're angling for a salesforce acquisition as they
| move towards being a micro-hosting service like salesforce.
| rubiquity wrote:
| Snowflake is worth at least 25% of Salesforce so such an
| acquisition is very unlikely unless Salesforce has $60
| billion or more burning a hole in their pocket.
___________________________________________________________________
(page generated 2023-08-16 23:01 UTC)