hngopher.com

       [HN Gopher] Uses and abuses of cloud data warehouses
       ___________________________________________________________________
        
       Uses and abuses of cloud data warehouses
        
       Author : Malp
       Score  : 126 points
       Date   : 2023-08-16 13:10 UTC (9 hours ago)
        
 (HTM) web link (materialize.com)
 (TXT) w3m dump (materialize.com)
        
       | spullara wrote:
       | These reasons are why Snowflake is building hybrid tables (under
       | the Unistore umbrella). Those tables keep recent data in an
       | operational store and historical data in their typical data
       | warehouse storage systems. Best of both worlds. Still in private
       | preview but definitely the answer to how you build applications
       | that need both without using multiple databases and syncing.
       | 
       | https://www.snowflake.com/guides/htap-hybrid-transactional-a...
        
         | datavirtue wrote:
         | Conveniently leave out the issue of cost. Snowflake is piling
         | on features that encourage more compute. Customers abuse the
         | system and they (Snowflake) respond by helping cement them into
         | continuing the abuse (spending more) by developing features to
         | make bad habits and horrible engineering decisions look like
         | something they should be doing. Typical.
        
           | disgruntledphd2 wrote:
           | Snowflake are the Oracle of the cloud.
        
             | ed_elliott_asc wrote:
             | Oh come on snowflake isn't cheap but there are none of the
             | license auditing nonsense.
             | 
             | (Also aren't oracle the oracle of the cloud?)
        
               | tcoff91 wrote:
               | Oracle is so far beyond anyone other major player in
               | crookedness that it's not even funny.
               | 
               | Oh, you happened to run your oracle database in a VM and
               | got audited? They'll try to shake you down to pay for
               | however many oracle licenses for every other box you are
               | running hypervisors on, because they claim that you could
               | have transferred the database to any of those other
               | boxes. So if you have a datacenter with 1000 boxes
               | running VMWare, and you ran Oracle on one of them, they
               | try to shake you down for paying for not buying 1000
               | licenses. Then they say but if you just buy a bunch of
               | cloud credits, we can make your 1000x license violation
               | go away.
        
       | mrbungie wrote:
       | I remeber one time I was working as a Data & Analytics Lead
       | (almost a Chief Data Officer but without the title) in a company
       | were I don't work anymore and I was "challenged" by our parent
       | company CDO about our data tech stack and operations. Just for
       | context, my team at the time was me working as the lead and main
       | Data Engineer plus 3 Data Analysts that I was coaching/teaching
       | to convert into DEngs/DScientists.
       | 
       | At the time we were mostly a batch data shop, based on Apache
       | Airflow + K8S + BigQuery + GCS in Google Cloud Platform, with
       | BigQuery + GCS as the central datalake techs for analytics and
       | processing. We still had RT capabilities due to having also some
       | Flink processes running in the K8S cluster, and also having time-
       | critical (time, not latency) processes running in microbatches of
       | minutes for NRT. It was pretty cheap and sufficiently reliable,
       | with both Airflow and Flink having self-healing capabilities at
       | least at the node/process level (and even cluster/region level
       | should we need it and be willing to increase the costs), while
       | also allowing for some changes down the road like moving out of
       | BQ if the costs scaled up too much.
       | 
       | What they wanted us to implement what according to them was the
       | industry "best practices" circa 2021: a Kafka-based Datalake
       | (KSQL and co.), at least other 4 engines (Trino, Pinot, Postgres
       | and Flink) and an external object storage with most of the stuff
       | running inside Docker containers orchestrated by Ansible in N
       | compute instances manually controlled from a bastion instance.
       | For some reason, they insisted on having a real time datalake
       | based on Kafka. It was an insane mix of cargo cult, FOMO, high
       | operational complexity and low reliability in one package.
       | 
       | I resisted the idea until the last second I was in that place. I
       | reunited with some of my team members for drinks months later
       | after my departure and they told me the new CDO was already
       | convinced that said "RT-based" datalake was the way to go
       | forward. I still shudder every time I remember the architectural
       | diagram and I hope they didn't finally follow that terrible
       | advice.
       | 
       | tl;dr: I will never understand the cargo cult around real time
       | data and analytics but it is a thing that appeals to both
       | decision makers and "data workers". Most businesses and
       | operations (especially those whose main focus is not IT by
       | itself) won't act or decide in hours, but rather in days. Build
       | around your main use case and then make exceptions, not the other
       | way around.
        
         | chuckhend wrote:
         | I agree that is a great approach - build around the main use
         | cases and then make exceptions. I think a lot of companies have
         | legitimate use cases for real-time analytics (outside of their
         | internal decision making), but as you mention, preemptively
         | optimize for the aspiration and leads them towards unnecessary
         | tool and tech sprawl. For example, a marketplace application
         | that shows you the quantity of an item currently available --
         | you as a consumer use that information to make a decision in
         | seconds, so its a great use-case. Internally, the org probably
         | uses that data for weekly or quarterly forecasting. I've seen
         | use cases like that lead to the "let's make everything real-
         | time", but not every use case benefits the same from real-time.
        
       | kitanata wrote:
       | [flagged]
        
       | mritchie712 wrote:
       | I caught myself wondering how Google, Microsoft and Amazon let
       | Snowflake win. You can argue they haven't won, but lets assume
       | they have. Two things:
       | 
       | 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T.
       | Owning Snowflake would be a drop in the bucket for any of them
       | (let alone if they were splitting the revenue).
       | 
       | 2. Snowflake runs on AWS, GCP or Azure (customers choice), so a
       | good chunk of their revenue goes back to these services.
       | 
       | Looking at these two points as the CEO of GOOGL, MSFT, or AMZN,
       | I'd shrug away Snowflake "beating us". It's crazy that you can
       | build a $50B company that your largest competitors barely care
       | about.
        
         | riku_iki wrote:
         | > 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over
         | $1T. Owning Snowflake would be a drop in the bucket for any of
         | them (let alone if they were splitting the revenue).
         | 
         | FAANG can't utilize their market cap to buy SNOW, they would
         | need to pay cash, and 50B is very large amount for any of these
         | companies (its about annual Google net income).
         | 
         | Also, snow stock is very inflated now, it is heavily income
         | negative, and revenue not that high, stock price is very high
         | on growth expectations.
        
           | mritchie712 wrote:
           | My point is that they wouldn't want to buy it (or have
           | focused much on building a competitive product) if it's only
           | worth $50B.
        
         | hnthrowaway0328 wrote:
         | I agree. The cloud providers are basically the guys who sell
         | shovels in gold rush. Snowflake still needs to build on top the
         | clouds so MAG never lose. I heard that SNOW is offering its own
         | cloud services but I could be wrong -- and even if I'm correct
         | they have a super long way to catch up.
        
         | pm90 wrote:
         | What I heard is that AWS got there first with Redshift but then
         | didn't really invest as much as was required by users so
         | Snowflake found an opening and pounced on it.
         | 
         | BigQuery in GCP is a pretty great alternative and I know that
         | GCP invests/promotes it heavily, but they were slightly late to
         | the market.
        
           | datadrivenangel wrote:
           | BigQuery is pretty great. The serverless by default setup
           | works very well for most BI use cases. There are some weird
           | issues when you're a heavy user and start hitting the
           | normally hidden quotas.
        
             | dsaavy wrote:
             | There are some ways around the heavy user issues that
             | aren't ideal but will work for BI-oriented heavy users.
        
       | hodgesrm wrote:
       | This article uses an either or definition that leaves out a big
       | set of use cases that combine operational _and_ analytic usage:
       | 
       | > First, a working definition. An operational tool facilitates
       | the day-to-day operation of your business. Think of it in
       | contrast to analytical tools that facilitate historical analysis
       | of your business to inform longer term resource allocation or
       | strategy.
       | 
       | Security event and incident management (SEIM) is a typical
       | example. You want fast notification on events _combined with_ the
       | ability to sift through history extremely quickly to assesss
       | problems. This is precisely the niche occupied by real-time
       | analytic databases like ClickHouse, Druid, and Pinot.
        
       | dontupvoteme wrote:
       | The random bolding of words reeks of adtech.
       | 
       | is the usage of such an old html tag itself now a trigger to send
       | something to /dev/null?
        
       | atwong wrote:
       | There are other databases today that do real time analytics
       | (ClickHouse, Apache Druid, StarRocks along with Apache Pinot).
       | I'd look at the ClickHouse Benchmark to see who are the
       | competitors in that space and their relative performance.
        
         | slotrans wrote:
         | Yeah ClickHouse is definitely the way to go here. Its ability
         | to serve queries with low latency and high concurrency is in an
         | entirely different league from Snowflake, Redshift, BigQuery,
         | etc.
        
           | biggestdummy wrote:
           | StarRocks handles latency and concurrency as well as
           | Clickhouse but also does joins. Less denormalization, and you
           | can use the same platform for traditional BI/ad-hoc queries.
        
             | riku_iki wrote:
             | Clickhouse also does joins.
             | 
             | Somehow StarRocks dudes appear in every relevant post with
             | this false claim.
        
               | biggestdummy wrote:
               | There's a difference between "supports the syntax for
               | joins" and "does joins efficiently enough that they are
               | useful."
               | 
               | My experience with Clickhouse is that its joins are not
               | performant enough to be useful. So the best practice in
               | most cases is to denormalize. I should have been more
               | specific in my earlier comment.
        
               | riku_iki wrote:
               | ack that anonymous user in internet said he couldn't make
               | CLickhouse joins perform well in his case which he didn't
               | describe
        
       | albert_e wrote:
       | Arent a lot of businesses being sold on "real time analytics"
       | these days?
       | 
       | That mixes the uses cases of analytics and operations because
       | everyone is led to believe that things that happened in last 10
       | minutes must go through the analytics lens and yield actionable
       | insights in real time so their operational systems can
       | react/adapt instantly.
       | 
       | Most business processes probably don't need anywhere near such
       | real time analytics capability but it is very easy to think (or
       | be convinced that) we do. Especially if I am a owner of a given
       | business process (with an IT budget) why wouldn't I want the
       | ability to understand trends in real-time and react to it if not
       | get ahead of them and predict/be prepared. Anything less than
       | that is seen as being shamefully behind on the tech curve.
       | 
       | In this context-- the section in article where it says present
       | data is of virtually zero importance to analytics is no longer
       | true. We need a real solution even if we apply those (presumably
       | complex and costly) solutions to only the most deserving use
       | cases (and not abuse them).
       | 
       | What is the current thinking in this space? I am sure there are
       | technical solutions here but what is the framework to evaluate
       | which use case actually deserves pursuing such a setup.
       | 
       | Curious to hear.
        
         | riordan wrote:
         | > In this context-- the section in article where it says
         | present data is of virtually zero importance to analytics is no
         | longer true. We need a real solution even if we apply those
         | (presumably complex and costly) solutions to only the most
         | deserving use cases (and not abuse them).
         | 
         | Totally agreed, though where real-time data is being put
         | through an analytics lens is where CDW's start to creak and get
         | costly. In my experience, these real-time uses shift the burden
         | from being about human-decision-makers to automated decision-
         | making and it becomes more a part of the product. And that's
         | cool, but it gets costly, fast.
         | 
         | It also makes perfect sense to fake-it-til-you-make-it for
         | real-time use cases on an existing Cloud Data Warehouse/dbt
         | style _modern data stack_ if your data team's already using it
         | for the rest of their data platform; after all they already
         | know it and it's allowed that team to scale.
         | 
         | But a huge part of the challenge is that once you've made it,
         | the alternative for a data-intensive use case is a bespoke
         | microservice or a streaming pipeline, often in a language or on
         | a platform that's foreign to the existing data team who's built
         | the thing. If most of your code is dbt sql and airflow jobs,
         | working with Kafka and streaming spark is pretty foreign (not
         | to mention entirely outside of the observability infrastructure
         | your team already has in place). Now we've got rewrites across
         | languages/platforms, and leave teams with the cognitive
         | overhead of multiple architectures & toolchains (and split
         | focus). The alternative would be having a separate team to hand
         | off real-time systems to and only that's if the company can
         | afford to have that many engineers. Might as well just allocate
         | that spend to your cloud budget and let the existing data team
         | run up a crazy bill on Snowflake or BigQuery as long as it's
         | less than the cost of a new engineering team.
         | 
         | ------
         | 
         | There's something incredible about the ruthless efficiency of
         | sql data platforms that allows data teams to scale the number
         | of components/engineer. Once you have a Modern-Data-Stack
         | system in place, the marginal cost of new pipelines or
         | transformations is negligible (and they build atop one
         | another). That platform-enabled compounding effect doesn't
         | really occur with data-intensive microservices/streaming
         | pipelines and means only the biggest business-critical
         | applications (or skunk works shadow projects) will get the
         | data-intensive applications[1] treatment, and business
         | stakeholders will be hesitant to greenlight it.
         | 
         | I think Materialize is trying to build that Modern-Data-Stack
         | type platform for real-time use cases: one that doesn't come
         | with the cognitive cost of a completely separate architecture
         | or the divide of completely separate teams and tools. If I
         | already had a go-to system in place for streaming data that
         | could be prototyped with the data warehouse, then shifted over
         | to a streaming platform, the same teams could manage it and
         | we'd actually get that cumulative compounding effect. Not to
         | mention it becomes a lot easier to then justify using a real-
         | time application the next time.
         | 
         | [1]: https://martin.kleppmann.com/2014/10/16/real-time-data-
         | produ...
        
         | slotrans wrote:
         | Just gonna keep linking this til the heat death of the
         | universe: https://mcfunley.com/whom-the-gods-would-destroy-
         | they-first-...
         | 
         | Real-time analytics are _worse than useless_. At best they are
         | a distracting resource sink, at worst they directly harm the
         | quality of decision-making.
        
         | jandrewrogers wrote:
         | The term "real-time" is much abused in marketing copy. It is
         | often treated like a technical metric but it is actually a
         | business metric: am I making operational decisions with the
         | most recent data available? For many businesses, "most recent
         | data available" can be several days old and little operational
         | efficiency would be gained by investing in reducing that
         | latency.
         | 
         | For some businesses, "real-time" can be properly defined as
         | "within the last week". While there are many businesses where
         | reducing that operational tempo to seconds would have an
         | impact, it is by no means universal.
        
         | Eumenes wrote:
         | From my experience (mostly startups), real time analytics is
         | generally overkill, esp. from a BI perspective. Unless your
         | business is very focused on real time data and transactional
         | processing, you can generally get away with ETL/batch jobs.
         | Show executives, product, and downstream teams some metrics
         | that update a few times per day saves a ton of money over
         | things like Snowflake/Databricks/Redshift stuff. While cloud
         | services can be pricey, tools like dbt are really useful and
         | can be administered by savvy business people or analyst types.
         | Those candidates are way easier to hire compared to data
         | engineers, sql experts, etc.
        
         | lmkg wrote:
         | I work as a web analyst (think Google Analytics).
         | 
         | One time I ran an A/B test on the color of a button. After the
         | conclusion of the test, with a clear winner in hand, it took
         | _eleven months_ for all involved stakeholders to approve the
         | change. The website in question got a few thousand visits a
         | month and was not critical to any form of business.
         | 
         | This organization does not benefit from real-time analytics.
         | 
         | Now that's an extreme outlier, but my experience is that _most_
         | organizations are in that position. The feedback loop from
         | collecting data to making a decision is long, and real-time
         | analytics shortens a part that 's already not the bottleneck.
         | The _technical_ part of real-time analytics provides no value
         | unless the org also has the _operational_ capacity to use that
         | data quickly.
         | 
         | I have seen this! I have, for example, seen a news site that
         | looked at web analytics data from the morning and was able to
         | publish new opinion pieces that afternoon if something was
         | trending. They had a _dedicated process_ built around that data
         | pipeline. Critically, they had a _specific_ idea of _what they
         | could do_ with that data when the received it.
         | 
         | So if you want a framework, I would start from a single, simple
         | question: What can you actually _do_ with real-time data? Name
         | one (1) action your organization could take based on that data.
         | 
         | I think it's also useful to separate _what data_ benefits from
         | realtime and _which users_ can make use of it. Even if you have
         | real-time data, some consumers don 't benefit from immediacy.
        
           | coredog64 wrote:
           | Generally speaking "What questions do you hope to answer with
           | this data?" is a good filter for all kinds of operational
           | data.
        
           | iamacyborg wrote:
           | Hate to say it but if your site was only getting a few
           | thousand visitors a month your test was likely vastly
           | underpowered and therefore irrelevant anyway
        
             | mrbungie wrote:
             | Power is not just about sample size, but also
             | (expected/previously informed by some other evidence)
             | effect size. You can't make that conclusion without that.
        
               | iamacyborg wrote:
               | For sure, but you'd need one hell of a good cta to be
               | getting a sufficient effect size to warrant small
               | samples.
        
         | ozim wrote:
         | For me it mostly is that business people don't understand OLAP
         | vs OLTP and that if they add 5 items to database and they are
         | visible in the system their "dashboard" will not update
         | instantly but only after when data pipes run.
         | 
         | Which is hard to explain because if it is not instant
         | everywhere they think it is a bug and system is crappy. Later
         | on they will use dashboard view once a week or once a month so
         | 5 items update is not relevant at all.
        
         | atwebb wrote:
         | Real-time generally means near-real-time and even then I liken
         | it to availability.
         | 
         | If asked people would say "I need to always be up" until they
         | see the costs associated with it, then being out for a few
         | hours a year tends to be ok.
        
           | datadrivenangel wrote:
           | This is a great way looking at it. The cost starts going up
           | rapidly from daily and approaches infinity as you get to
           | ultra-low latency realtime analytics.
           | 
           | There is a minimum cost though (systems, engineers, etc), so
           | for medium data there's often very little marginal cost up
           | until you start getting to hourly refreshes. This is not true
           | for larger datasets though.
        
         | higeorge13 wrote:
         | I work in a real time subscription analytics company
         | (chartmogul.com). We fetch, normalize and aggregate various
         | billing systems data and eventually visualize them into graphs
         | and tables.
         | 
         | I had this discussion with key people and i would say it
         | depends on multiple factors. Small companies really like and
         | require real-time analytics: they want to see how a couple
         | invoices translate into updated saas metrics or why they didn't
         | get a slack/email notification as soon asit happened. Larger
         | ones will check their data less frequently per day or week, but
         | again it depends on the people and their role. Most of them are
         | happy with getting their data once per day into their mailboxes
         | or warehouses.
         | 
         | But we try to make everyone happy so we aim for real time
         | analytics.
        
           | mrbungie wrote:
           | I think GP's point is that is not about the perceived value
           | of real time data/analytics, but rather, its actual value.
           | Decision makers may ask for RT or NRT, but most of the time
           | won't make a decision or action in a timeframe that actually
           | justifies RT/NRT data/analytics.
           | 
           | For most operations RT/NRT data stuff normally is about
           | novelty/vanity rather than a real existing business need.
        
             | andrenotgiant wrote:
             | The article is separating "operational" and "analytical"
             | use-cases.
             | 
             | IIUC analytical = "what question are you trying to answer"
             | and in analytics, RT/NRT is absolutely novelty/vanity.
             | Operational = "what action are you trying to take" and it
             | makes sense to want to have up-to-date data when, for
             | example, running ML models, triggering notifications,
             | etc...
        
               | mrbungie wrote:
               | Yeah, totally. I should've specified "analytical
               | operations", as in, updating dashboards and other non-
               | time-critical data processing that eventually feed into
               | decision making. That's were devs or decision makers
               | asking for RT/NRT makes no sense.
        
       | debarshri wrote:
       | 15 years ago when I joined workforce business intelligence was
       | all the rage. Data world was pretty much straight forward. You
       | had transactional data in OLTP databases which would be shipped
       | to Operational data stores, then rolled into the data warehouse.
       | Datawarehouses were actual specialised hardware appliances
       | (netezza et al) reporting tools were robust too.
       | 
       | Everytime I moved from one org to another, these concepts of data
       | warehouse somehow got muddled.
        
       | andrenotgiant wrote:
       | It seems like Snowflake is going all-in on building features and
       | doing marketing that encourage their customers to build
       | applications, serving operational workloads, etc... on them.
       | Things like in-product analytics, usage-based billing,
       | personalization, etc...
       | 
       | Anyone here taking them up on it? I'm genuinely curious how it's
       | going.
        
         | weego wrote:
         | After a series of calls, examples and explanations with them we
         | never managed to get close to a reasonable projection of what
         | our monthly costs would be like on Snowflake. I understand why
         | companies in this field use abstract notions of 'processing'
         | /'compute' units but it's a no go finance wise.
         | 
         | Without some close to real world projections we don't have time
         | to consider implementation to find out for ourselves.
        
           | benjaminwootton wrote:
           | Snowflake is one of the easier tools to measure because it's
           | a simple function of region, instance size, uptime. If you
           | can simulate some real loads and understand the usage then
           | you do have a shot at forecasting.
           | 
           | Of course the number is going to be high, but you have to
           | remember it rolls up compute and requires less manpower. This
           | is also a win for finance if they are comfortable with usage
           | based billing.
        
             | code_biologist wrote:
             | Who's finance team likes usage based billing? It makes
             | sense for elastic use cases and is definitely "fair", but
             | there are a lot of issues: Forecasting is hard. "dev team
             | had an oops" situations.
             | 
             | I had frog getting boiled situation at one job that was
             | exactly the process described in the posted article: usage
             | of the cloud data warehouse grew as people trusted the
             | infrastructure and used more fresh data for more and more
             | use cases. They were all good, sane use cases. I repeatedly
             | under-forecast our cost growth until we made large changes
             | and it really frustrated the finance people, rightly so.
        
         | GeneralAntilles wrote:
         | Yeah, they're providing a path-of-least-resistance for getting
         | stuff done in your existing data environment.
         | 
         | A common challenge in a lot of organizations is IT as a
         | roadblock to deployment of internal tools coming from data
         | teams. Snowflake is answering this with Streamlit. You get an
         | easy platform for data people to use and deploy on and it can
         | all be done within the business firewall under data governance
         | within Snowflake.
        
         | politelemon wrote:
         | I've noticed that too. I think the marketing is definitely
         | working, I'm seeing a few organisations starting to shift more
         | and more workloads onto them, and some are also publishing
         | datasets on their marketplace.
         | 
         | One of their most interesting offerings coming up is Snowpark
         | which lets you run a Python function as a UDF, within
         | Snowflake. This way you don't have to transfer data around
         | everywhere, just run it as part of your normal SQL statements.
         | It's also possible to pickle a function and send it over... so
         | conceivably one could train a data science model and run that
         | as part of a SQL statement. This could get very interesting.
        
           | jamesblonde wrote:
           | In theory, fine. Then you look at the walled garden that is
           | Snowpark - only "approved" python libraries are allowed
           | there. It will be a very constrictive set of models you can
           | train, and very constrictive feature engineering in Python.
           | And, wait, aren't Python UDFs super-slow (GIL) - what about
           | Pandas UDFs (wait that's PySpark.....)
        
             | Pils wrote:
             | Having worked with a team using Snowpark, there are a
             | couple things that bother me about it as a platform. For
             | example, it only supported Python 3.8 until 3.9/10 recently
             | entered preview mode. It feels a bit like a rushed project
             | designed to compete with Databricks/Spark at the bullet
             | point level, but not quite at the same quality level.
             | 
             | But that's fine! It has only existed for around a year in
             | public preview, and appears to be improving quickly. My
             | issue was with how aggressively Snowflake sales tried to
             | push it as a production-ready ML platform. Whenever I asked
             | questions about version control/CI, model versioning/ops,
             | package managers, etc. the sales engineers and data
             | scientists consistently oversold the product.
        
               | disgruntledphd2 wrote:
               | Yeah it's definitely not ready for modelling. It's pretty
               | rocking for ETL though, and much easier to test and
               | abstract than regular SQL. Granted it's a PySpark clone
               | but our data is already in Snowflake.
        
             | noazdad wrote:
             | Disclaimer: Snowflake employee here. You can add any Python
             | library you want - as long as its dependencies are also
             | 100% Python. Takes about a minute: pip install the package,
             | zip it up, upload it to an internal Snowflake stage, then
             | reference it in the IMPORTS=() directive in your Python. I
             | did this with pydicom just the other day - worked a treat.
             | So yes, not the depth and breadth of the entire Python
             | ecosystem, but 1500+ native packages/versions on the
             | Anaconda repo, plus this technique? Hardly a "walled
             | garden".
        
               | jamesblonde wrote:
               | Good luck with trying to install any non-trivial python
               | library this way. And with AI moving so fast, do you
               | think people will accept that they can't use the
               | libraries they need, because you haven't approved them
               | yet?!?
        
           | lokar wrote:
           | Also containers:
           | 
           | https://www.snowflake.com/blog/snowpark-container-
           | services-d...
        
           | atwebb wrote:
           | > run a Python function as a UDF
           | 
           | Is that a differentiator? I'm unfamiliar with Snowpark's
           | actual implementation but know SQL Server introduced Python/R
           | in engine in 2016? something like that.
        
         | ramraj07 wrote:
         | Snowflake is capturing a large market share in analytics
         | industries thanks to its "just works" feature. I'm a massive
         | fan.
         | 
         | But in the end, snowflake stores the data in S3 as partitions.
         | If you want to update a single value you have to replace the
         | entire s3 partition. Similarly you need to read a reasonable
         | amount of s3 data to retrieve even a single record. Thus you're
         | never going to get responses shorter than half a second (at
         | best). As long as you don't try and game around that limitation
         | it works great.
         | 
         | Materialize up here also follows the same model in the end
         | FWIW.
        
         | munchor wrote:
         | Disclaimer: I work at SingleStoreDB.
         | 
         | Building a database that can handle both analytics and
         | operations is what we've been working on for the past 10+
         | years. Our customers use us to build applications with a strong
         | analytical component to them (all of the use cases you
         | mentioned and many more).
         | 
         | How's it going? It's going really well! And we're working on
         | some really cool things that will expand our offering from
         | being a pure data storage solution to much more of a
         | platform[1].
         | 
         | If you want to learn more about our architecture, we published
         | this paper at SIGMOD in late 2022 about it[2].
         | 
         | [1]: https://davidgomes.com/databases-cant-be-just-databases-
         | anym...
         | 
         | [2]: https://dl.acm.org/doi/pdf/10.1145/3514221.3526055
        
         | datadrivenangel wrote:
         | I assume they're angling for a salesforce acquisition as they
         | move towards being a micro-hosting service like salesforce.
        
           | rubiquity wrote:
           | Snowflake is worth at least 25% of Salesforce so such an
           | acquisition is very unlikely unless Salesforce has $60
           | billion or more burning a hole in their pocket.
        
       ___________________________________________________________________
       (page generated 2023-08-16 23:01 UTC)