[HN Gopher] Polars Cloud: The Distributed Cloud Architecture to ...
       ___________________________________________________________________
        
       Polars Cloud: The Distributed Cloud Architecture to Run Polars
       Anywhere
        
       Author : neilfrndes
       Score  : 82 points
       Date   : 2025-03-07 20:57 UTC (2 hours ago)
        
 (HTM) web link (pola.rs)
 (TXT) w3m dump (pola.rs)
        
       | LaurensBER wrote:
       | This is very impressive and definitely fills a huge hole in the
       | whole data frame ecosystem.
       | 
       | I've been quite impressed with the Polars team and after using
       | Pandas for years, Polars feels like a much needed fresh wind.
       | Very excited to give this a go sometime soon!
        
       | 0cf8612b2e1e wrote:
       | I'll bite- what's the pitch vs Dask/Spark/Ray/etc?
       | 
       | I am admittedly a tough sell when the workstation under my desk
       | has 192GB of RAM.
        
         | benrutter wrote:
         | Doesn't look like benchmarks are there yet, but knowing polars,
         | I'd guess performance will be front and centre.
         | 
         | I think the best selling point speaks to your workstation size-
         | just start with polars vanilla. It'll work great for ages, and
         | if you do need to scale, you can use polars cloud.
         | 
         | That solves what I see as one if the big issues with a lot of
         | these types of projects, which is the really poor performance
         | at smaller sizes, meaning practically you end up using
         | completely different frameworks based on size, which is a bif
         | hassle if you want to rewrite in one direction.
        
         | __mharrison__ wrote:
         | Yeah, you can process 99% of tabular workloads with that. I
         | generally advise my clients to work on a single node before
         | attempting to scale out.
        
         | film42 wrote:
         | I think this will be a hit with the big name audit companies. I
         | know some use databricks for pyspark on the M&A side. As deals
         | move forward and they get more data, they have to scale up
         | their instances which isn't cheap. If polars enables serverless
         | compute where you pay by the job, that could be a big win.
         | 
         | And sure, databricks has an idle shutdown feature, but suppose
         | it takes ~6 hours to process the deal report, and only the
         | first hour needs the scaled up power to compute one table, and
         | the rest of the jobs only need 1/10th the mem and cores. Polars
         | could save these firms a lot of money.
        
           | serced wrote:
           | May I ask what part in M&A needs this much data processing? I
           | am quite familiar with the field but did not yet see such
           | tasks.
        
         | tfehring wrote:
         | The obvious one is that you can handle bigger workloads than
         | you can fit in RAM on a single machine. The more important but
         | less obvious one is that it right-sizes the resources needed
         | for each workload, so you're not running an 8GB job on an 8TB
         | machine, and your manually-allocated 8GB server doesn't OOM
         | when that job grows to 10GB next year.
        
         | orlp wrote:
         | Disclaimer: I work for Polars Inc, but my opinions are my own.
         | 
         | If you have a very beefy desktop machine and no giant datasets,
         | there isn't a strong reason to use Polars Cloud.
         | 
         | Are you a data scientist running a Polars data pipeline against
         | a subsampled dataset in a notebook on your laptop? With just
         | changing a couple lines of code you can run that same pipeline
         | against your full dataset on a beefy cloud machine which is
         | automatically spun up and spun down for you. If you have so
         | much data that one machine doesn't cut it, you can start
         | running distributed.
         | 
         | In a nutshell, the pitch is very similar to Dask/Ray/Spark,
         | except that it's Polars. A lot of our users say that they came
         | for the speed but stayed for the API, and with Polars Cloud
         | they can use our API and semantics on the cloud. No need to
         | translate it to Dask/Ray/Spark.
        
       | __mharrison__ wrote:
       | Really excited for the Polars team. I've always been impressed by
       | their work and responsiveness to issues I've filed in the past.
       | The world is lifted when there is good competition like this.
        
       | TheAlchemist wrote:
       | Having switched from Pandas to Polars recently, this is quite
       | interesting and I guess performance wise it will be excellent.
        
       | whalesalad wrote:
       | Never understood these kinds of cloud tools that deal with big
       | data. You are paying enormous ingress/egress fees to do this.
        
         | tfehring wrote:
         | That's almost certainly the main reason they're offering this
         | on all 3 major public clouds from day 1.
        
         | tomnipotent wrote:
         | > You are paying enormous ingress/egress fees to do this.
         | 
         | It looks like their offering runs on the same cloud provider as
         | the client, so no bandwidth fees. Right now it looks to be AWS,
         | but mentions Azure/GCP/self-hosted.
        
       | Starlord2048 wrote:
       | I can appreciate the pain points you guys are addressing.
       | 
       | The "diagonal scaling" approach seems particularly clever -
       | dynamically choosing between horizontal and vertical scaling
       | based on the query characteristics rather than forcing users into
       | a one-size-fits-all model. Most real-world data workloads have
       | mixed requirements, so this flexibility could be a major
       | advantage.
       | 
       | I'm curious how the new streaming engine with out-of-core
       | processing will compare to Dask, which has been in this space for
       | a while but hasn't quite achieved the adoption of pandas/PySpark
       | despite its strengths.
       | 
       | The unified API approach also tackles a real issue. The cognitive
       | overhead of switching between pandas for local work and PySpark
       | for distributed work is higher than most people acknowledge.
       | Having a consistent mental model regardless of scale would be a
       | productivity boost.
       | 
       | Anyway, I would love to apply for the early access and try it
       | out. I'd be particularly interested in seeing benchmark
       | comparisons against Ray, Dask, and Spark for different workload
       | profiles. Also curious about the pricing model and the cold start
       | problem that plagues many distributed systems.
        
       | tfehring wrote:
       | This is really cool, not sure how I missed it. I assume catalog
       | support will be added fairly quickly. But ironically I think the
       | biggest barrier to adoption will be the lack of an off-ramp to a
       | FOSS solution that companies can self-host. Obviously Polars
       | itself is FOSS, but it understandably seems like there's no way
       | to self-host a backend to point a `pc.ComputeContext` to. That
       | will be an especially tough selling point for companies that are
       | already on Spark. I wonder how much they'll focus on startups vs.
       | trying to get bigger companies to switch, and whether they'll try
       | a Spark compatibility layer like DataFusion
       | (https://github.com/apache/datafusion-comet).
        
         | orlp wrote:
         | Disclaimer: I work for Polars Inc, but my opinions are my own.
         | 
         | Polars itself is FOSS and will remain FOSS.
         | 
         | Self-hosted/on-site Polars Cloud is something we intend on
         | developing as there is quite a bit of demand, but it is
         | unlikely to be FOSS. It most likely will involve licensing of
         | some sort. Ultimately we do have to make money, and we intend
         | on doing that through Polars Cloud, self-hosted or not (as well
         | as other ventures such as offering training, commercial
         | support, etc).
        
       | whyho wrote:
       | How does this integrate into existing services like aws glue? I
       | fear that despite polars being good/better it will lack adoption
       | since it cannot easily be integrated.
        
         | th0ma5 wrote:
         | I think this is the main problem with this, like H2O offers
         | Spark integration as well their own clustering solution, but
         | most people with this problem have their own opinionated and
         | bespoke needs.
        
       | melvinroest wrote:
       | I just got into data analysis recently (former software engineer)
       | and tried out pandas vs polars. I like polars way more because it
       | feels like SQL but then sane, and it's faster. It's clear in what
       | it tries to do. I didn't really have that with pandas.
        
         | epistasis wrote:
         | I've been doing data analysis for decades, and stayed on R for
         | a long time because Pandas was so bad.
         | 
         | People complain about R, but compared to the multitude of
         | import lice and unergonomic APIs in Pandas, R always felt like
         | living in the future.
         | 
         | Polars is a much much more sane API, but expressions are very
         | clunky for doing basic computation. Or at least I can't find
         | anything less clunky than pl.col("x") or pl.literal(2) where in
         | R it's just x or 2.
         | 
         | Still, I'm using Python a ton more now that polars has enough
         | steam for others to be able to understand the code.
        
         | minimaxir wrote:
         | This may be a hot take, but there is now no reason to ever use
         | pandas for new data analysis codebases. Polars is better in
         | every way that matters.
        
           | melvinroest wrote:
           | Sure, just wanted to give the perspective of a new person
           | walking into this field. I'd agree, but I think there are a
           | lot of data analysts that have never heard of polars.
           | 
           | Though, I guess they're not on this site :')
        
           | comte7092 wrote:
           | It's a bit of a hot take, but not wildly outlandish either.
           | 
           | Pandas supports so many use cases and is still more feature
           | rich than polars. But you always have the
           | polars.DataFrame.to_pandas() function in your back pocket so
           | realistically you can always at least start with polars.
        
       | efxhoy wrote:
       | Looks great! Can I run it on my own bare metal cluster? Will I
       | need to buy a license?
        
       | marxisttemp wrote:
       | What does this project have to do with Serbia? They're based in
       | the Netherlands. They must have made a mistake when registering
       | their domain name.
        
       ___________________________________________________________________
       (page generated 2025-03-07 23:00 UTC)