[HN Gopher] Polars Cloud: The Distributed Cloud Architecture to ...
___________________________________________________________________
Polars Cloud: The Distributed Cloud Architecture to Run Polars
Anywhere
Author : neilfrndes
Score : 82 points
Date : 2025-03-07 20:57 UTC (2 hours ago)
(HTM) web link (pola.rs)
(TXT) w3m dump (pola.rs)
| LaurensBER wrote:
| This is very impressive and definitely fills a huge hole in the
| whole data frame ecosystem.
|
| I've been quite impressed with the Polars team and after using
| Pandas for years, Polars feels like a much needed fresh wind.
| Very excited to give this a go sometime soon!
| 0cf8612b2e1e wrote:
| I'll bite- what's the pitch vs Dask/Spark/Ray/etc?
|
| I am admittedly a tough sell when the workstation under my desk
| has 192GB of RAM.
| benrutter wrote:
| Doesn't look like benchmarks are there yet, but knowing polars,
| I'd guess performance will be front and centre.
|
| I think the best selling point speaks to your workstation size-
| just start with polars vanilla. It'll work great for ages, and
| if you do need to scale, you can use polars cloud.
|
| That solves what I see as one if the big issues with a lot of
| these types of projects, which is the really poor performance
| at smaller sizes, meaning practically you end up using
| completely different frameworks based on size, which is a bif
| hassle if you want to rewrite in one direction.
| __mharrison__ wrote:
| Yeah, you can process 99% of tabular workloads with that. I
| generally advise my clients to work on a single node before
| attempting to scale out.
| film42 wrote:
| I think this will be a hit with the big name audit companies. I
| know some use databricks for pyspark on the M&A side. As deals
| move forward and they get more data, they have to scale up
| their instances which isn't cheap. If polars enables serverless
| compute where you pay by the job, that could be a big win.
|
| And sure, databricks has an idle shutdown feature, but suppose
| it takes ~6 hours to process the deal report, and only the
| first hour needs the scaled up power to compute one table, and
| the rest of the jobs only need 1/10th the mem and cores. Polars
| could save these firms a lot of money.
| serced wrote:
| May I ask what part in M&A needs this much data processing? I
| am quite familiar with the field but did not yet see such
| tasks.
| tfehring wrote:
| The obvious one is that you can handle bigger workloads than
| you can fit in RAM on a single machine. The more important but
| less obvious one is that it right-sizes the resources needed
| for each workload, so you're not running an 8GB job on an 8TB
| machine, and your manually-allocated 8GB server doesn't OOM
| when that job grows to 10GB next year.
| orlp wrote:
| Disclaimer: I work for Polars Inc, but my opinions are my own.
|
| If you have a very beefy desktop machine and no giant datasets,
| there isn't a strong reason to use Polars Cloud.
|
| Are you a data scientist running a Polars data pipeline against
| a subsampled dataset in a notebook on your laptop? With just
| changing a couple lines of code you can run that same pipeline
| against your full dataset on a beefy cloud machine which is
| automatically spun up and spun down for you. If you have so
| much data that one machine doesn't cut it, you can start
| running distributed.
|
| In a nutshell, the pitch is very similar to Dask/Ray/Spark,
| except that it's Polars. A lot of our users say that they came
| for the speed but stayed for the API, and with Polars Cloud
| they can use our API and semantics on the cloud. No need to
| translate it to Dask/Ray/Spark.
| __mharrison__ wrote:
| Really excited for the Polars team. I've always been impressed by
| their work and responsiveness to issues I've filed in the past.
| The world is lifted when there is good competition like this.
| TheAlchemist wrote:
| Having switched from Pandas to Polars recently, this is quite
| interesting and I guess performance wise it will be excellent.
| whalesalad wrote:
| Never understood these kinds of cloud tools that deal with big
| data. You are paying enormous ingress/egress fees to do this.
| tfehring wrote:
| That's almost certainly the main reason they're offering this
| on all 3 major public clouds from day 1.
| tomnipotent wrote:
| > You are paying enormous ingress/egress fees to do this.
|
| It looks like their offering runs on the same cloud provider as
| the client, so no bandwidth fees. Right now it looks to be AWS,
| but mentions Azure/GCP/self-hosted.
| Starlord2048 wrote:
| I can appreciate the pain points you guys are addressing.
|
| The "diagonal scaling" approach seems particularly clever -
| dynamically choosing between horizontal and vertical scaling
| based on the query characteristics rather than forcing users into
| a one-size-fits-all model. Most real-world data workloads have
| mixed requirements, so this flexibility could be a major
| advantage.
|
| I'm curious how the new streaming engine with out-of-core
| processing will compare to Dask, which has been in this space for
| a while but hasn't quite achieved the adoption of pandas/PySpark
| despite its strengths.
|
| The unified API approach also tackles a real issue. The cognitive
| overhead of switching between pandas for local work and PySpark
| for distributed work is higher than most people acknowledge.
| Having a consistent mental model regardless of scale would be a
| productivity boost.
|
| Anyway, I would love to apply for the early access and try it
| out. I'd be particularly interested in seeing benchmark
| comparisons against Ray, Dask, and Spark for different workload
| profiles. Also curious about the pricing model and the cold start
| problem that plagues many distributed systems.
| tfehring wrote:
| This is really cool, not sure how I missed it. I assume catalog
| support will be added fairly quickly. But ironically I think the
| biggest barrier to adoption will be the lack of an off-ramp to a
| FOSS solution that companies can self-host. Obviously Polars
| itself is FOSS, but it understandably seems like there's no way
| to self-host a backend to point a `pc.ComputeContext` to. That
| will be an especially tough selling point for companies that are
| already on Spark. I wonder how much they'll focus on startups vs.
| trying to get bigger companies to switch, and whether they'll try
| a Spark compatibility layer like DataFusion
| (https://github.com/apache/datafusion-comet).
| orlp wrote:
| Disclaimer: I work for Polars Inc, but my opinions are my own.
|
| Polars itself is FOSS and will remain FOSS.
|
| Self-hosted/on-site Polars Cloud is something we intend on
| developing as there is quite a bit of demand, but it is
| unlikely to be FOSS. It most likely will involve licensing of
| some sort. Ultimately we do have to make money, and we intend
| on doing that through Polars Cloud, self-hosted or not (as well
| as other ventures such as offering training, commercial
| support, etc).
| whyho wrote:
| How does this integrate into existing services like aws glue? I
| fear that despite polars being good/better it will lack adoption
| since it cannot easily be integrated.
| th0ma5 wrote:
| I think this is the main problem with this, like H2O offers
| Spark integration as well their own clustering solution, but
| most people with this problem have their own opinionated and
| bespoke needs.
| melvinroest wrote:
| I just got into data analysis recently (former software engineer)
| and tried out pandas vs polars. I like polars way more because it
| feels like SQL but then sane, and it's faster. It's clear in what
| it tries to do. I didn't really have that with pandas.
| epistasis wrote:
| I've been doing data analysis for decades, and stayed on R for
| a long time because Pandas was so bad.
|
| People complain about R, but compared to the multitude of
| import lice and unergonomic APIs in Pandas, R always felt like
| living in the future.
|
| Polars is a much much more sane API, but expressions are very
| clunky for doing basic computation. Or at least I can't find
| anything less clunky than pl.col("x") or pl.literal(2) where in
| R it's just x or 2.
|
| Still, I'm using Python a ton more now that polars has enough
| steam for others to be able to understand the code.
| minimaxir wrote:
| This may be a hot take, but there is now no reason to ever use
| pandas for new data analysis codebases. Polars is better in
| every way that matters.
| melvinroest wrote:
| Sure, just wanted to give the perspective of a new person
| walking into this field. I'd agree, but I think there are a
| lot of data analysts that have never heard of polars.
|
| Though, I guess they're not on this site :')
| comte7092 wrote:
| It's a bit of a hot take, but not wildly outlandish either.
|
| Pandas supports so many use cases and is still more feature
| rich than polars. But you always have the
| polars.DataFrame.to_pandas() function in your back pocket so
| realistically you can always at least start with polars.
| efxhoy wrote:
| Looks great! Can I run it on my own bare metal cluster? Will I
| need to buy a license?
| marxisttemp wrote:
| What does this project have to do with Serbia? They're based in
| the Netherlands. They must have made a mistake when registering
| their domain name.
___________________________________________________________________
(page generated 2025-03-07 23:00 UTC)