[HN Gopher] FireDucks: Pandas but Faster
       ___________________________________________________________________
        
       FireDucks: Pandas but Faster
        
       Author : sebg
       Score  : 343 points
       Date   : 2024-11-14 11:48 UTC (6 days ago)
        
 (HTM) web link (hwisnu.bearblog.dev)
 (TXT) w3m dump (hwisnu.bearblog.dev)
        
       | viraptor wrote:
       | > 100% compatibility with existing Pandas code: check.
       | 
       | Is it actually? Do people see that level of compatibility in
       | practice?
        
         | jeroenhd wrote:
         | I don't think it's exactly 100%: https://fireducks-
         | dev.github.io/docs/user-guide/04-compatibi...
         | 
         | It should be pretty close, though.
        
       | thecleaner wrote:
       | Sure but single node performance. This makes it not very useful
       | IMO since quite a few data science folks work with Hadoop
       | clusters or Snowflake clusters or DataBricks where data is
       | distributed and querying is handled by Spark executors.
        
         | chaxor wrote:
         | The comparison is to pandas, so single node performance is
         | understood in the scope. This is for people running small tasks
         | that may only take a couple days on a single node with a 32
         | core CPU or something, not tasks that take 3 months using
         | thousands of cores. My understanding for the latter is that
         | pyspark is a decent option, while ballista is the better option
         | for which to look forward. Perhaps using bastion-rs as a
         | backend can be useful for an upcoming system as well.
         | Databricks et al are cloud trash IMO, as is anything that isn't
         | meant to be run on a local single node system and a local HPC
         | cluster with zero code change and a single line of config
         | change.
         | 
         | While for most of my jobs I ended up being able to evade the
         | use of HPC by simply being smarter and discovering better
         | algorithms to process information, I recall like pyspark
         | decently, but preferring the simplicity of ballista over
         | pyspark due to the simpler installation of Rust over managing
         | Java and JVM junk. The constant problems caused by anything
         | using JVM backend and the environment config with it was
         | terrible to add to a new system every time I ran a new program.
         | 
         | In this regard, ballista is a enormous improvement. Anything
         | that is a one-line install via pip on any new system, runs
         | local-first without any cloud or telemetry, and requires no
         | change in code to run on a laptop vs HPC is the only option
         | worth even beginning to look into and use.
        
         | Kalanos wrote:
         | Hadoop hasn't been relevant for a long time, which is telling.
         | 
         | Unless I had thousands of files to work with, I would be loathe
         | to use cluster computing. There's so much overhead, cost,
         | waiting for nodes to spin up, and cloud architecture nonsense.
         | 
         | My "single node" computer is a refurbished tower server with
         | 256GB RAM and 50 threads.
         | 
         | Most of these distributed computing solutions arose before data
         | processing tools started taking multi-threading seriously.
        
       | E_Bfx wrote:
       | Very impressive, the Python ecosystem is slowly getting very
       | good.
        
         | BiteCode_dev wrote:
         | Spent the last 20 years hearing that.
         | 
         | At some point I think it's more honest to say "the python
         | ecosystem keeps getting more awesome".
        
         | Kalanos wrote:
         | Continues to be the best by far
        
       | i_love_limes wrote:
       | I have never heard of FireDucks! I'm curious if anyone else here
       | has used it. Polars is nice, but it's not totally compatible. It
       | would be interesting how much faster it is for more complex
       | calculations
        
       | bratao wrote:
       | Unfortunately it is not Opensource yet -
       | https://github.com/fireducks-dev/fireducks/issues/22
        
         | Y_Y wrote:
         | Wouldn't it be nice if GitHub was just for source code and you
         | couldn't just slap up a README that's an add for some
         | proprietary shitware with a vague promise of source some day in
         | the glorious future?
        
           | thecopy wrote:
           | >proprietary shitware
           | 
           | Is this shitware? It seems to be very high quality code
        
             | yupyupyups wrote:
             | I think the anger comes from the fact that we expect Github
             | repositories to host the actual source code and not be a
             | dead-end with a single README.md file.
        
             | ori_b wrote:
             | How can you tell?
        
               | sbarre wrote:
               | I mean, based on the claims and the benchmarks, it seems
               | to provide massive speedups to a very popular tool.
               | 
               | How would you define "quality" in this context?
        
               | echoangle wrote:
               | High quality code isn't just code that performs well when
               | executed, but also is readable, understandable and
               | maintainable. You can't judge code quality by looking at
               | the compiled result, just because it works well.
        
               | sbarre wrote:
               | That's certainly one opinion about it.
               | 
               | One could also say that quality is related to the
               | functional output.
        
               | echoangle wrote:
               | > One could also say that quality is related to the
               | functional output.
               | 
               | Right, I said nothing that contradicts that ("High
               | quality code isn't _just_ code that performs well when
               | executed, but also ... "). High quality functional output
               | is a necessary requirement, but it isn't sufficient to
               | determine if code is high quality.
        
               | sbarre wrote:
               | Sure, I guess it depends on what matters to you or to
               | your evaluation criteria.
               | 
               | My point was that it's all subjective in the end.
        
               | echoangle wrote:
               | It's not really subjective if you're at all reasonable
               | about it.
               | 
               | Imagine writing a very good program, running it through
               | an obfuscator, and throwing away the original code. Is
               | the obfuscated code "high quality code" now, because the
               | output of the compilation still works as before?
        
               | ori_b wrote:
               | Written so that it's easy to maintain, well tested,
               | correct in its handling of edge cases, easy to debug, and
               | easy to iterate on.
        
           | rad_gruchalski wrote:
           | You'd slap that in a comment then?
        
           | diggan wrote:
           | > Wouldn't it be nice if GitHub was just for source code
           | 
           | GitHub always been a platform for "We love to host FOSS but
           | we won't be 100% FOSS ourselves", so makes sense they allow
           | that kind of usage for others too.
           | 
           | I think what you want, is something like Codeberg instead,
           | which is explicitly for FOSS and 100% FOSS themselves.
        
         | gus_massa wrote:
         | > _FireDucks is not a open source library at this moment. You
         | can get it installed freely using pip and use under BSD-3
         | license and of course can look into the python part of the
         | source code._
         | 
         | I don't understand what it means. It looks like a
         | contradiction. Does it have a BSD-3 licence or not?
        
           | _flux wrote:
           | They provide BSD-3-licensed Python files but the interesting
           | bit happens in the shared object library, which is only
           | provided in binary form (but is also BSD-3-licensed it seems,
           | so you can distribute it freely).
        
             | joshuaissac wrote:
             | Since it is under the BSD 3 licence, users would also be
             | permitted to decompile and modify the shared object under
             | the licence terms.
        
               | jlokier wrote:
               | Nice insight!
        
           | abcalphabet wrote:
           | From the above link:
           | 
           | > While the wheel packages are available at
           | https://pypi.org/project/fireducks/#files, and while they do
           | contain Python files, most of the magic happens inside a
           | (BSD-3-licensed) shared object library, for which source code
           | is not provided.
        
           | sampo wrote:
           | BSD license gives you the permission to use and to
           | redistribute. In this case you may use and redistribute the
           | binaries.
           | 
           | Edit: To use, redistribute, and modify, and distribute
           | modified versions.
        
             | japhyr wrote:
             | "Redistribution and use in source and binary forms, _with
             | or without modification_ , are permitted provided that the
             | following conditions are met..."
             | 
             | https://opensource.org/license/bsd-3-clause
        
             | GardenLetter27 wrote:
             | Such a crazy distortion of the meaning of the license.
             | 
             | Imagine being like "the project is GPL - just the compiled
             | machine code".
        
               | PittleyDunkin wrote:
               | This is pretty common for binary blobs for where the
               | source code has been lost.
        
       | pplonski86 wrote:
       | How does it compare to Polars?
       | 
       | EDIT: I've found some benchmarks https://fireducks-
       | dev.github.io/docs/benchmarks/
       | 
       | Would be nice to know what are internals of FireDucks
        
       | rich_sasha wrote:
       | It's a bit sad for me. I find the biggest issue for me with
       | pandas is the API, not the speed.
       | 
       | So many foot guns, poorly thought through functions, 10s of
       | keyword arguments instead of good abstractions, 1d and 2d
       | structures being totally different objects (and no higher-order
       | structures). I'd take 50% of the speed for a better API.
       | 
       | I looked at Polars, which looks neat, but seems made for a
       | different purpose (data pipelines rather than building models
       | semi-interactively).
       | 
       | To be clear, this library might be great, it's just a shame for
       | me that there seems no effort to make a Pandas-like thing with
       | better API. Maybe time to roll up my sleeves...
        
         | martinsmit wrote:
         | Check out redframes[1] which provides a dplyr-like syntax and
         | is fully interoperable with pandas.
         | 
         | [1]: https://github.com/maxhumber/redframes
        
           | otsaloma wrote:
           | Building on top of Pandas feels like you're only escaping
           | part of the problems. In addition to the API, the datatypes
           | in Pandas are a mess, with multiple confusing (and none of
           | them good) options for e.g. dates/datetimes. Does redframes
           | do anything there?
        
         | ljosifov wrote:
         | +1 Seconding this. My limited experience with pandas had a non-
         | trivial number of moments "?? Is it really like this? Nah - I'm
         | mistaken for sure, this can not be, no one would do something
         | insane like that". And yet and yet... Fwiw since I've found
         | that numpy is a must (ofc), but pandas is mostly optional. So I
         | stick to numpy for my writing, and keep pandas read only. (just
         | execute someone else's)
        
         | omnicognate wrote:
         | What about the polars API doesn't work well for your use case?
        
           | short_sells_poo wrote:
           | Polars is missing a crucial feature for replacing pandas in
           | Finance: first class timeseries handling. Pandas allows me to
           | easily do algebra on timeseries. I can easily resample data
           | with the resample(...) method, I can reason about the index
           | frequency, I can do algebra between timeseries, etc.
           | 
           | You can do the same with Polars, but you have to start
           | messing about with datetimes and convert the simple problem
           | "I want to calculate a monthly sum anchored on the last
           | business day of the month" to SQL-like operations.
           | 
           | Pandas grew a large and obtuse API because it provides
           | specialized functions for 99% of the tasks one needs to do on
           | timeseries. If I want to calculate an exponential weighted
           | covariance between two time series, I can trivially do this
           | with pandas: series1.ewm(...).cov(series2). I welcome people
           | to try and do this with Polars. It'll be a horrible and
           | barely readable contraption.
           | 
           | YC is mostly populated by technologists, and technologists
           | are often completely ignorant about what makes pandas useful
           | and popular. It was built by quants/scientists, for doing
           | (interactive) research. In this respect it is similar to R,
           | which is not a language well liked by technologists, but it
           | is (surprise) deeply loved by many scientists.
        
             | dkga wrote:
             | Exactly the single reason why I use pandas when I need to
             | use python. But coming from R, it still feels like "second
             | best".
        
             | n8henrie wrote:
             | I don't know what exponential weighted covariance is, but
             | I've had pretty good luck converting time series-based
             | analyses from pandas to polars (for patient presentations
             | to my emergency department -- patients per hour, per day,
             | per shift, etc.). Resample has a direct (and easier IMO)
             | replacement in polars, and there is group_by_dynamic.
             | 
             | I've had trouble determining whether one timestamp falls
             | between two others across tens of thousands of rows (with
             | the polars team suggesting I use a massive cross product
             | and filter -- which worked but excludes the memory
             | requirement), whereas in pandas I was able to sort the
             | timestamps and thereby only need to compare against the
             | preceding / following few based on the index of the last
             | match.
             | 
             | The other issue I've had with resampling is with polars
             | automatically dropping time periods with zero events,
             | giving me a null instead of zero for the count of events in
             | certain time periods (which then gets dropped from
             | aggregations). This has caught me a few times.
             | 
             | But other than that I've had good luck.
        
               | short_sells_poo wrote:
               | I'm curious how is polars group_by_dynamic easier than
               | resample in pandas. In pandas if I want to resample to a
               | monthly frequency anchored to the last business day of
               | the month, I'd write:
               | 
               | > my_df.resample("BME").apply(...)
               | 
               | Done. I don't think it gets any easier than this. Every
               | time I tried something similar with polars, I got bogged
               | down in calendar treatment hell and large and obscure SQL
               | like contraptions.
               | 
               | Edit: original tone was unintentionally combative -
               | apologies.
        
               | cmdlineluser wrote:
               | > cross product and filter
               | 
               | `.join_where()`[1] was also added recently.
               | 
               | [1]: https://docs.pola.rs/api/python/stable/reference/dat
               | aframe/a...
        
             | marcogorelli wrote:
             | Could you show how you write "calculate a monthly sum
             | anchored on the last business day of the month" in pandas
             | please?
        
               | sebg wrote:
               | Not OP.
               | 
               | But I'm guessing it's something like this:
               | 
               | import pandas as pd
               | 
               | def calculate_monthly_business_sum(df, date_column,
               | value_column):                   """         Calculate
               | monthly sums anchored to the last business day of each
               | month              Parameters:         df: DataFrame with
               | dates and values         date_column: name of date column
               | value_column: name of value column to sum
               | Returns:         DataFrame with sums anchored to last
               | business day         """         # Ensure date column is
               | datetime         df[date_column] =
               | pd.to_datetime(df[date_column])                  # Group
               | by end of business month and sum         monthly_sum =
               | df.groupby(pd.Grouper(             key=date_column,
               | freq='BME'  # Business Month End frequency
               | ))[value_column].sum().reset_index()              return
               | monthly_sum
               | 
               | # Example usage:
               | 
               | df = pd.DataFrame({ 'date': ['2024-01-01', '2024-01-31',
               | '2024-02-29'], 'amount': [100, 200, 300] })
               | 
               | result = calculate_monthly_business_sum(df, 'date',
               | 'amount')
               | 
               | print(result)
               | 
               | Which you can run here => https://python-
               | fiddle.com/examples/pandas?checkpoint=1732114...
        
               | short_sells_poo wrote:
               | It's actually much simpler than that. Assuming the index
               | of the dataframe DF is composed of timestamps (which is
               | normal for timeseries):
               | 
               | df.resample("BME").sum()
               | 
               | Done. One line of code and it is quite obvious what it is
               | doing - with perhaps the small exception of BME, but if
               | you want max readability you could do:
               | 
               | df.resample(pd.offsets.BusinessMonthEnd()).sum()
               | 
               | This is why people use pandas.
        
               | short_sells_poo wrote:
               | Answered the child comment but let me copy paste here
               | too. It's literally one (short) line:
               | 
               | > df.resample("BME").sum()
               | 
               | Assuming `df` is a dataframe (ie table) indexed by a
               | timestamp index, which is usual for timeseries analysis.
               | 
               | "BME" stands for BusinessMonthEnd, which you can type out
               | if you want the code to be easier to read by someone not
               | familiar with pandas.
        
               | tomrod wrote:
               | A bit from memory as in transit, but something like df.gr
               | oupby(df[date_col]+pd.offsets.MonthEnd(0))[agg_col].sum()
        
         | sega_sai wrote:
         | Great point that I completely share. I tend to avoid pandas at
         | all costs except for very simple things as I have bitten by
         | many issues related to indexing. For anything complicated I
         | tend to switch to duckdb instead.
        
           | bravura wrote:
           | Can you explain your use-case and why DuckDB is better?
           | 
           | Considering switching from pandas and want to understand what
           | is my best bet. I am just processing feature vectors that are
           | too large for memory, and need an initial simple JOIN to
           | aggregate them.
        
             | sega_sai wrote:
             | I am not necessarily saying duckdb is better. I personally
             | just found it easier, clearer to write a sql query for any
             | complicated set of joins/group by processing than to try to
             | do that in pandas.
        
             | rapatel0 wrote:
             | Look into [Ibis](https://ibis-project.org/). It's a
             | dataframe library built on duckdb. It supports lazy
             | execution, greater than memory datastructures, remote s3
             | data and is insanely fast. Also works with basically any
             | backend (postgres, mysql, parquet/csv files, etc) though
             | there are some implementation gaps in places.
             | 
             | I previously had a pandas+sklearn transformation stack that
             | would take up to 8 hours. Converted it to ibis and it
             | executes in about 4 minutes now and doesn't fill up RAM.
             | 
             | It's not a perfect apples to apples pandas replacement but
             | really a nice layer on top of sql. after learning it, I'm
             | almost as fast as I was on pandas with expressions.
        
               | techwizrd wrote:
               | I made the switch to Ibis a few months ago and have been
               | really enjoying it. It works with all the plotting
               | libraries including seaborn and plotnine. And it makes
               | switching from testing on a CSV to running on a SQL/Spark
               | a one-line change. It's just really handy for analysis
               | (similar to the tidyverse).
        
         | amelius wrote:
         | Yes. Pandas turns 10x developers into .1x developers.
        
           | berkes wrote:
           | It does to me. Well, a 1x developer into a .01x dev in my
           | case.
           | 
           | My conclusion was that pandas is not for developers. But for
           | one-offs by managers, data-scientists, scientists, and so on.
           | And maybe for "hackers" who cludge together stuff 'till it
           | works and then hopefully never touch it.
           | 
           | Which made me realize such thoughts can come over as smug,
           | patronizing or belittling. But they do show how software can
           | be optimized for different use-cases.
           | 
           | The danger then lies into not recognizing these use-cases
           | when you pull in smth like pandas. "Maybe using panda's to
           | map and reduce the CSVs that our users upload to insert
           | batches isn't a good idea at all".
           | 
           | This is often worsened by the tools/platforms/lib devs or
           | communities not advertising these sweet spots and
           | limitations. Not in the case of Pandas though: that's really
           | clear about this not being a lib or framework for devs, but a
           | tool(kit) to do data analysis with. Kudo's for that.
        
             | analog31 wrote:
             | I'm one of those people myself, and have whittled my Pandas
             | use down to displaying pretty tables in Jupyter. Everything
             | else I do in straight Numpy.
        
               | theLiminator wrote:
               | Imo numpy is not better than pandas for the things you'd
               | use pandas for, though polars is far superior.
        
         | Kalanos wrote:
         | The pandas API makes a lot more sense if you are familiar with
         | numpy.
         | 
         | Writing pandas code is a bit redundant. So what?
         | 
         | Who is to say that fireducks won't make their own API?
        
         | faizshah wrote:
         | Pandas is a commonly known DSL at this point so lots of data
         | scientists know pandas like the back of their hand and thats
         | why a lot of pandas but for X libraries have become popular.
         | 
         | I agree that pandas does not have the best designed api in
         | comparison to say dplyr but it also has a lot of functionality
         | like pivot, melt, unstack that are often not implemented by
         | other libraries. It's also existed for more than a decade at
         | this point so there's a plethora of resources and stackoverflow
         | questions.
         | 
         | On top of that, these days I just use ChatGPT to generate some
         | of my pandas tasks. ChatGPT and other coding assistants know
         | pandas really well so it's super easy.
         | 
         | But I think if you get to know Pandas after a while you just
         | learn all the weird quirks but gain huge benefits from all the
         | things it can do and all the other libraries you can use with
         | it.
        
           | rich_sasha wrote:
           | I've been living in the shadow of pandas for about a decade
           | now, and the only thing I learned is to avoid using it.
           | 
           | I 100% agree that pandas _addresses_ all the pain points of
           | data analysis in the wild, and this is precisely why it is so
           | popular. My point is, it doesn 't address them _well_. It
           | seems like a conglomerate of special cases, written for a
           | specific problem it 's author was facing, with little concern
           | for consistency, generality or other use cases that might
           | arise.
           | 
           | In my usage, any time saved by its (very useful) methods
           | tends to be lost on fixing subtle bugs introduced by strange
           | pandas behaviours.
           | 
           | In my use cases, I reindex the data using pandas and get it
           | to numpy arrays as soon as I can, and work with those, with a
           | small library of utilities I wrote over the years. I'd gladly
           | use a "sane pandas" instead.
        
             | specproc wrote:
             | Aye, but we've learned it, we've got code bases written in
             | it, many of us are much more data kids than "real devs".
             | 
             | I get it doesn't follow best practices, but it does do what
             | it needs to. Speed has been an issue, and it's exciting
             | seeing that problem being solved.
             | 
             | Interesting to see so many people recently saying "polars
             | looks great, but no way I'll rewrite". This library seems
             | to give a lot of people, myself included, exactly what we
             | want. I look forward to trying it.
        
         | te_chris wrote:
         | Pandas best feature for me is the df format being readable by
         | duckdb. The filtering api is a nightmare
        
         | egecant wrote:
         | Completely agree, from the perspective of someone that
         | primarily uses R/tidyverse for data wrangling, there is this
         | great article on why Pandas API feel clunky:
         | https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-...
        
         | movpasd wrote:
         | I started using Polars for the "rapid iteration" usecase you
         | describe, in notebooks and such, and haven't looked back --
         | there are a few ergonomic wrinkles that I mostly attribute to
         | the newness of the library, but I found that polars forces me
         | to structure my thought process and ask myself "what am I
         | actually trying to do here?".
         | 
         | I find I basically never write myself into a corner with
         | initially expedient but ultimately awkward data structures like
         | I often did with pandas, the expression API makes the semantics
         | a lot clearer, and I don't have to "guess" the API nearly as
         | much.
         | 
         | So even for this usecase, I would recommend trying out polars
         | for anyone reading this and seeing how it feels after the
         | initial learning phase is over.
        
         | h14h wrote:
         | If you wanna try a different API, take a look at Elixir
         | Explorer:
         | 
         | https://hexdocs.pm/explorer/exploring_explorer.html
         | 
         | It runs on top of Polars so you get those speed gains, but uses
         | the Elixir programming language. This gives the benefit of a
         | simple finctional syntax w/ pipelines & whatnot.
         | 
         | It also benefits from the excellent Livebook (a Jupyter
         | alternative specific to Elixir) ecosystem, which provides all
         | kinds of benefits.
        
         | paddy_m wrote:
         | Have you tried polars? It's a much more regular syntax. The
         | regular syntax fits well with the lazy execution. It's very
         | composable for programmatically building queries. And then it's
         | super fast
        
           | bionhoward wrote:
           | I found the biggest benefit of polars is ironically the loss
           | of the thing I thought I would miss most, the index; with
           | pandas there are columns, indices, and multi-indices, whereas
           | with polars, everything is a column, it's all the same so you
           | can delete a lot of conditionals.
           | 
           | However, I still find myself using pandas for the timestamps,
           | timedeltas, and date offsets, and even still, I need a whole
           | extra column just to hold time zones, since polars maps
           | everything to UTC storage zone, you lose the origin / local
           | TZ which screws up heterogeneous time zone datasets. (And I
           | learned you really need to enforce careful manual thoughtful
           | consideration of time zone replacement vs offsetting at the
           | API level)
           | 
           | Had to write a ton of code to deal with this, I wish polars
           | had explicit separation of local vs storage zones on the
           | Datetime data type
        
             | paddy_m wrote:
             | I think pandas was so ambitious syntax wise and concept
             | wise. But it got be a bit of a jumble. The index idea in
             | particular is so cool, particular multi-indexes, watching
             | people who really understand it do multi index operations
             | is very cool.
             | 
             | IMO Polars sets a different goal of what's the most pandas
             | like thing that we can build that is fast (and leaves open
             | the possibility for more optimization), and clean.
             | 
             | Polars feels like you are obviously manipulating an
             | advanced query engine. Pandas feels like manipulating this
             | squishy datastructure that should be super useful and
             | friendly, but sometimes it does something dumb and slow
        
         | stared wrote:
         | Yes, every time I write df[df.sth = val], a tiny part of me
         | dies.
         | 
         | For a comparison, dplyr offers a lot of elegant functionality,
         | and the functional approach in Pandas often feels like an
         | afterthought. If R is cleaner than Python, it tells a lot (as a
         | side note: the same story for ggplot2 and matplotlib).
         | 
         | Another surprise for friends coming from non-Python backgrounds
         | is the lack of column-level type enforcement. You write
         | df.loc[:, "col1"] and hope it works, with all checks happening
         | at runtime. It would be amazing if Pandas integrated something
         | like Pydantic out of the box.
         | 
         | I still remember when Pandas first came out--it was fantastic
         | to have a tool that replaced hand-rolled data structures using
         | NumPy arrays and column metadata. But that was quite a while
         | ago, and the ecosystem has evolved rapidly since then,
         | including Python's gradual shift toward type checking.
        
           | oreilles wrote:
           | > Yes, every time I write df[df.sth = val], a tiny part of me
           | dies.
           | 
           | That's because it's a bad way to use Pandas, even though it
           | is the most popular and often times recommended way. But the
           | thing is, you can just write "safe" immutable Pandas code
           | with method chaining and lambda expressions, resulting in
           | very Polars-like code. For example:                 df = (
           | pd         .read_csv("./file.csv")
           | .rename(columns={"value":"x"})         .assign(y=lambda d:
           | d["x"] * 2)         .loc[lambda d: d["y"] > 0.5]       )
           | 
           | Plus nowadays with the latest Pandas versions supporting
           | Arrow datatypes, Polars performance improvements over Pandas
           | are considerably less impressive.
           | 
           | Column-level name checking would be awesome, but
           | unfortunately no python library supports that, and it will
           | likely never be possible unless some big changes are made in
           | the Python type hint system.
        
             | OutOfHere wrote:
             | Using `lambda` without care is dangerous because it risks
             | being not vectorized at all. It risks being super slow,
             | operating one row at a time. Is `d` a single row or the
             | entire series or the entire dataframe?
        
               | rogue7 wrote:
               | In this case `d` is the entire dataframe. It's just a way
               | of "piping" the object without having to rename it.
               | 
               | You are probably thinking about `df.apply(lambda row:
               | ..., axis=1)` which operates on each row at a time and is
               | indeed very slow since it's not vectorized. Here this is
               | different and vectorized.
        
               | OutOfHere wrote:
               | That's excellent.
        
               | almostkorean wrote:
               | Appreciate the explanation, this is something I should
               | know by now but don't
        
             | rogue7 wrote:
             | Agreed 100%. I am using this method-chaining style all the
             | time and it works like a charm.
        
             | wodenokoto wrote:
             | I'm not really sure why you think
             | .loc[lambda d: d["y"] > 0.5]
             | 
             | Is stylistically superior to                   [df.y > 0.5]
             | 
             | I agree it comes in handy quite often, but that still
             | doesn't make it great to write compared to what sql or
             | dplyr offers in terms of choosing columns to filter on
             | (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)
        
               | oreilles wrote:
               | It is superior because you don't need to assign your
               | dataframe to a variable ('df'), then update that variable
               | or create a new one everytime you need to do that
               | operation. Which means it is both safer (you're
               | guaranteed to filter on the current version of the
               | dataframe) and more concise.
               | 
               | For the rest of your comment: it's the best you can do
               | _in python_. Sure you could write SQL, but then you 're
               | mixing text queries with python data manipulation and I
               | would dread that. And SQL-only scripting is really out of
               | question.
        
               | chaps wrote:
               | Eh, SQL and python can still work together very well
               | where SQL takes the place of pandas. Doing things in
               | waves/batch helps.
               | 
               | Big problem with pandas is that you still have to load
               | the dataframe into memory to work with it. My data's too
               | big for that and postgres makes that problem go away
               | almost entirely.
        
               | __mharrison__ wrote:
               | It's superior because it is safer. Not because the API
               | (or requirement for using Lambda) looks better. The
               | lambda allows the operation to work on the current state
               | of the dataframe in the chained operation rather than the
               | original dataframe. Alternatively, you could use
               | .query("y > 0.5"). This also works on the current state
               | of the dataframe.
               | 
               | (I'm the first to complain about the many warts in
               | Pandas. Have written multiple books about it. This is
               | annoying, but it is much better than [df.y > 0.5].)
        
             | moomin wrote:
             | I mean, yes there's arrow data types, but it's got a long
             | way to go before it's got full parity with the numpy
             | version.
        
           | doctorpangloss wrote:
           | All I want is for the IDE and Python to correctly infer types
           | and column names for all of these array objects. 99% of the
           | pain for me is navigating around SQL return values and CSVs
           | as pieces of text instead of code.
        
         | otsaloma wrote:
         | Agreed, never had a problem with the speed of anything NumPy or
         | Arrow based.
         | 
         | Here's my alternative: https://github.com/otsaloma/dataiter
         | https://dataiter.readthedocs.io/en/latest/_static/comparison...
         | 
         | Planning to switch to NumPy 2.0 strings soon. Other than that I
         | feel all the basic operations are fine and solid.
         | 
         | Note for anyone else rolling up their sleeves: You can get
         | quite far with pure Python when building on top of NumPy (or
         | maybe Arrow). The only thing I found needing more performance
         | was group-by-aggregate, where Numba seems to work OK, although
         | a bit difficult as a dependency.
        
         | adolph wrote:
         | _So many foot guns, poorly thought through functions, 10s of
         | keyword arguments instead of good abstractions_
         | 
         | Yeah, Pandas has that early PHP feel to it, probably out of
         | being a successful first mover.
        
         | wodenokoto wrote:
         | In that case I'd recommend dplyr in R. It also integrates with
         | a better plotting library, GGPlot, which not only gives you
         | better API than matplotlib but also prettier plots (unless you
         | really get to work at your matplot code)
        
         | epistasis wrote:
         | Have you examined siuba at all? It promises to be more similar
         | to the R tidyverse, which IMHO has a _much_ better API. And I
         | personally prefer dplyr /tidyverse to Polars for exploratory
         | analysis.
         | 
         | https://siuba.org
         | 
         | I have not yet used siuba, but would be interested in others'
         | opinions. The activation energy to learn a new set of tools is
         | so large that I rarely have the time to fully examine this
         | space...
        
           | Bootvis wrote:
           | The lack of non standard evaluation still forces you to write
           | `_.` so this might be a better Pandas but not a better
           | tidyverse.
           | 
           | A pity their compares don't have tidyverse or R's data.table.
           | I think R would look simpler but now it remains unclear.
        
           | otsaloma wrote:
           | I think the choice of using functions instead of classes +
           | methods doesn't really fit well into Python. Either you need
           | to do a huge amount of imports or use the awful `from siuba
           | import *`. This feels like shoehorning the dplyr syntax into
           | Python when method chaining would be more natural and would
           | still retain the idea.
           | 
           | Also, having (already a while ago) looked at the
           | implementation of the magic `_` object, it seemed like an
           | awful hack that will serve only a part of use cases. Maybe
           | someone can correct me if I'm wrong, but I get the impression
           | you can do e.g. `summarize(x=_.x.mean())` but not
           | `summarize(x=median(_.x))`. I'm guessing you don't get
           | autocompletion in your editor or useful error messages and it
           | can then get painful using this kind of a magic.
        
         | kussenverboten wrote:
         | Agree with this. My favorite syntax is the elegance of
         | data.table API in R. This should be possible in Python too
         | someday.
        
         | nathan_compton wrote:
         | Yeah. Pandas is the worst. Polars is better in some ways but so
         | verbose!
        
         | fluorinerocket wrote:
         | Thank you I don't know why people think it's so amazing. I end
         | up sometimes just extracting the numpy arrays from the data
         | frame and doing things like I know how to, because the Panda
         | way is so difficult
        
       | omnicognate wrote:
       | > Then came along Polars (written in Rust, btw!) which shook the
       | ground of Python ecosystem due to its speed and efficiency
       | 
       | Polars rocked my world by having a sane API, not by being fast. I
       | can see the value in this approach if, like the author, you have
       | a large amount of pandas code you don't want to rewrite, but
       | personally I'm extremely glad to be leaving the pandas API
       | behind.
        
         | ralegh wrote:
         | I personally found the polars API much clunkier, especially for
         | rapid prototyping. I use it only for cemented processes where I
         | could do with speed up/memory reduction.
         | 
         | Is there anything specific you prefer moving from the pandas
         | API to polars?
        
           | benrutter wrote:
           | Not OP but the ability to natively implement complex groupby
           | logic is a huge plus for me at least.
           | 
           | Say you want to take an aggergation like "the mean of all
           | values over the 75th percentile" algonside a few other
           | aggregations. In pandas, this means you're gonna be in for a
           | bunch of hoops and messing around with stuff because you
           | can't express it via the api. Polars' api lets you express
           | this directly without having to implement any kind of
           | workaround.
           | 
           | Nice article on it here:
           | https://labs.quansight.org/blog/dataframe-group-by
        
       | adrian17 wrote:
       | Any explanation what makes it faster than pandas and polars would
       | be nice (at least something more concrete than "leverage the C
       | engine").
       | 
       | My easy guess is that compared to pandas, it's multi-threaded by
       | default, which makes for an easy perf win. But even then,
       | 130-200x feels extreme for a simple sum/mean benchmark. I see
       | they are also doing lazy evaluation and some MLIR/LLVM based JIT
       | work, which is probably enough to get an edge over polars; though
       | its wins over DuckDB _and_ Clickhouse are also surprising out of
       | nowhere.
       | 
       | Also, I thought one of the reasons for Polars's API was that
       | Pandas API is way harder to retrofit lazy evaluation to, so I'm
       | curious how they did that.
        
       | ayhanfuat wrote:
       | In its essence it is a commercial product which has a free trial.
       | 
       | > Future Plans By providing the beta version of FireDucks free of
       | charge and enabling data scientists to actually use it, NEC will
       | work to improve its functionality while verifying its
       | effectiveness, with the aim of commercializing it within FY2024.
        
         | graemep wrote:
         | Its BSD licensed. They do not way what the plans are but most
         | likely a proprietary version with added support or features.
        
           | ayhanfuat wrote:
           | They say the source code for the part "where the magic
           | happens" is not available so I am not sure what BSD implies
           | there.
        
             | HelloNurse wrote:
             | It serves as a ninja's smoke bomb until the "BSD" binary
             | blob is suddenly obsoleted by a proprietary binary blob.
        
           | ori_b wrote:
           | It's a BSD licensed binary blob. There's no code provided.
        
             | graemep wrote:
             | Wow! That is so weird.
             | 
             | Its freeware under an open source license. Really
             | misleading.
             | 
             | It looks like something you should stay away from unless
             | you need it REALLY badly. Its a proprietary product with
             | unknown pricing and no indication of what their plans are.
             | 
             | Does the fact that the binary is BSD licensed allow
             | reverse-engineering?
        
               | captn3m0 wrote:
               | > Redistribution and use in source and binary forms, with
               | or without modification, are permitted
               | 
               | Reversing and re-compiling should count as modification?
        
       | imranq wrote:
       | This presentation does a good job distilling why FireDucks is so
       | fast:
       | 
       | https://fireducks-dev.github.io/files/20241003_PyConZA.pdf
       | 
       | The main reasons are
       | 
       | * multithreading
       | 
       | * rewriting base pandas functions like dropna in c++
       | 
       | * in-built compiler to remove unused code
       | 
       | Pretty impressive especially given you import fireducks.pandas as
       | pd instead of import pandas as pd, and you are good to go
       | 
       | However I think if you are using a pandas function that wasn't
       | rewritten, you might not see the speedups
        
         | faizshah wrote:
         | It's not clear to me why this would be faster than polars,
         | duckdb, vaex or clickhouse. They seem to be taking the same
         | approach of multithreading, optimizing the plan, using arrow,
         | optimizing the core functions like group by.
        
           | mettamage wrote:
           | Maybe it isn't? Maybe they just want a fast pandas api?
        
             | geysersam wrote:
             | According to their benchmarks they are faster. Not by a
             | lot, but still significantly.
        
           | maleldil wrote:
           | None of those drop-in replacements for Pandas. The main draw
           | is "faster without changing your code".
        
             | faizshah wrote:
             | I'm asking more about what techniques did they use to get
             | the performance improvements in the slides.
             | 
             | They are showing a 20-30% improvement over Polars,
             | Clickhouse and Duckdb. But those 3 tools are SOTA in this
             | area and generally rank near eachother in every benchmark.
             | 
             | So 20-30% improvement over that cluster makes me interested
             | to know what techniques they are using to achieve that over
             | their peers.
        
       | Kalanos wrote:
       | Linux only right now https://github.com/fireducks-
       | dev/fireducks/issues/27
        
       | Kalanos wrote:
       | Regarding compatibility, fireducks appears to be using the same
       | column dtypes:
       | 
       | ```
       | 
       | >>> df['year'].dtype == np.dtype('int32')
       | 
       | True
       | 
       | ```
        
       | DonHopkins wrote:
       | FireDucks FAQ:
       | 
       | Q: Why do ducks have big flat feet?
       | 
       | A: So they can stomp out forest fires.
       | 
       | Q: Why do elephants have big flat feet?
       | 
       | A: So they can stomp out flaming ducks.
        
       | short_sells_poo wrote:
       | Looks very cool, BUT: it's closed source? That's an immediate
       | deal breaker for me as a quant. I'm happy to pay for my tools,
       | but not being able to look and modify the source code of a
       | crucial library like this makes it a non-starter.
        
       | KameltoeLLM wrote:
       | Shouldn't that be FirePandas then?
        
       | benrutter wrote:
       | Anyone here tried using FireDucks?
       | 
       | The promise of a 100x speedup with 0 changes to your codebase is
       | pretty huge, but even a few correctness / incompatibility issues
       | would probably make it a no-go for a bunch of potential users.
        
       | safgasCVS wrote:
       | I'm sad that R's tidy syntax is not copied more widely in the
       | python world. Dplyr is incredibly intuitive most don't ever
       | bother reading the instructions you can look at a handful of
       | examples and you've got the gist of it. Polars despite its speed
       | is still verbose and inconsistent while pandas is seemingly a
       | collection of random spells.
        
       | softwaredoug wrote:
       | The biggest advantage of pandas is its extensibility. If you care
       | about that, it's (relatively) easy to add your own extension
       | array type.
       | 
       | I haven't seen that in other system like Polars, but maybe I'm
       | wrong.
        
       | ssivark wrote:
       | Setting aside complaints about the Pandas API, it's frustrating
       | that we might see the community of a popular "standard" tool
       | fragment into two or _even three_ ecosystems (for libraries with
       | slightly incompatible APIs) -- seemingly all with the value
       | proposition of  "making it faster". Based on the machine learning
       | experience over the last decade, this kind of churn in tooling is
       | somewhat exhausting.
       | 
       | I wonder how much of this is fundamental to the common approach
       | of writing libraries in Python with the processing-heavy parts
       | delegated to C/C++ -- that the expressive parts cannot be fast
       | and the fast parts cannot be expressive. Also, whether Rust (for
       | polars, and other newer generation of libraries) changes this
       | tradeoff substantially enough.
        
         | tgtweak wrote:
         | I think it's a natural path of software life that compatibility
         | often stands in the way of improving the API.
         | 
         | This really does seem like a rare thing that everything speeds
         | up without breaking compatability. If you want a fast revised
         | API for your new project (or to rework your existing one) then
         | you have a solution for that with Polars. If you just want your
         | existing code/workloads to work faster, you have a solution for
         | that now.
         | 
         | It's OK to have a slow, compatible, static codebase to build
         | things on then optimize as-needed.
         | 
         | Trying to "fix" the api would break a ton of existing code,
         | including existing plugins. Orphaning those projects and
         | codebases would be the wrong move, those things take a decade
         | to flesh out.
         | 
         | This really doesn't seem like the worst outcome, and doesn't
         | seem to be creating a huge fragmented mess.
        
         | SiempreViernes wrote:
         | > Based on the machine learning experience over the last
         | decade, this kind of churn in tooling is somewhat exhausting.
         | 
         | Don't come to old web-devs with those complains, every single
         | one of them had to write at least one open source javascript
         | library just to create their linkedin account!
        
       | cmcconomy wrote:
       | Every time I see a new better pandas, I check to see if it has
       | geopandas compatibility
        
       | PhasmaFelis wrote:
       | "FireDucks: Pandas but Faster" sounds like it's about something
       | much more interesting than a Python library. I'd like to read
       | that article.
        
       | dkga wrote:
       | Reading all pandas vs polars reminded me of the tidyverse vs
       | data.table discussion some 10 years ago.
        
       | flakiness wrote:
       | > FireDucks is released on pypi.org under the 3-Clause BSD
       | License (the Modified BSD License).
       | 
       | Where can I find the code? I don't see it on GitHub.
       | 
       | > contact@fireducks.jp.nec.com
       | 
       | So it's from NEC (a major Japanese computer company), presumably
       | a research artifact?
       | 
       | > https://fireducks-dev.github.io/docs/about-us/ Looks like so.
        
       | xbar wrote:
       | Great work, but I will hold my adoption until c++ source is
       | available.
        
       | uptownfunk wrote:
       | If they could just make a dplyr for py it would be so awesome.
       | But sadly I don't think the python language semantics will
       | support such a tool. It all comes down to managing the namespace
       | I guess
        
       | gigatexal wrote:
       | On average only 1.5x faster than polars. That's kinda crazy.
        
         | geysersam wrote:
         | Why is that crazy? (I think the crazy thing is that they are
         | faster at all. Taking an existing api and making it fast is
         | harder than creating the api from scratch with performance in
         | mind)
        
       | OutOfHere wrote:
       | Don't use it:
       | 
       | > By providing the beta version of FireDucks free of charge and
       | enabling data scientists to actually use it, NEC will work to
       | improve its functionality while verifying its effectiveness, with
       | the aim of commercializing it within FY2024.
       | 
       | In other words, it's free only to trap you.
        
         | ladyanita22 wrote:
         | Important to upvote this. If there's room for improvement for
         | Polars (which I'm sure there is), go and support the project.
         | But don't fall for a commercial trap when there are competent
         | open source tools available.
        
           | maleldil wrote:
           | While I agree, it's worth noting that this project is a drop-
           | in replacement (they claim that, at least), but Polars has a
           | very different API. I much prefer Polars's API, but it's
           | still a non-trivial cost to switch to it, which is why many
           | people would instead explore Pandas alternatives instead.
        
           | binoct wrote:
           | No shade to the juggernaut of the open source software
           | movement and everything it has/will enabled, but why the hate
           | for a project that required people's time and knowledge to
           | create something useful to a segment of users and then expect
           | to charge for using it in the future? Commercial trap seems
           | to imply this is some sort of evil machination but it seems
           | like they are being quite upfront with that language.
        
             | floatrock wrote:
             | It's not hate for the project, it's hate for the deceptive
             | rollout.
             | 
             | Basically it's a debate about how many dark patterns can
             | you squeeze next to that "upfront language" before
             | "marketing" slides into "bait-n-switch."
        
             | papichulo2023 wrote:
             | Not sure if evil or not, but it is unprofessional to use a
             | tool that you dont know how much it will cost for your
             | company in the future.
        
         | tombert wrote:
         | Thanks for the warning.
         | 
         | I nearly made the mistake of merging Akka into a codebase
         | recently; fortunately I double-checked the license and noticed
         | it was the bullshit BUSL and it would have potentially cost my
         | employer tens of thousands of dollars a year [1]. I ended up
         | switching everything to Vert.x, but I really hate how
         | normalized these ostensibly open source projects are sneaking
         | scary expensive licenses into things now.
         | 
         | [1] Yes I'm aware of Pekko now, and my stuff probably would
         | have worked with Pekko, but I didn't really want to deal with
         | something that by design is 3 years out of date.
        
           | cogman10 wrote:
           | IMO, you made a good decision ditching akka. We have an akka
           | app before the BUSL and it is a PITA to maintain.
           | 
           | Vert.x and other frameworks are far better and easier for
           | most devs to grok.
        
             | tombert wrote:
             | Yeah, Vert.x actually ended up being pretty great. I feel
             | like it gives me most of the cool features of Akka that I
             | actually care about, but it allows you to gradually move
             | into it; it _can_ be a full-on framework, but it can also
             | just be a decent library to handle concurrency.
             | 
             | Plus the license isn't stupid.
        
             | switchbak wrote:
             | > We have an akka app before the BUSL and it is a PITA to
             | maintain
             | 
             | I would imaging the non-Scala use case to be less than
             | ideal.
             | 
             | In Scala land, Pekko - the open source fork of Akka is the
             | way to go if you need compatibility. Personally, I'd avoid
             | new versions of Akka like the plague, and just use more
             | modern alternatives to Pekko/Akka anyway.
             | 
             | I'm not sure what Lightbend's target market is? Maybe they
             | think they have enough critical mass to merit the price tag
             | for companies like Sony/Netflix/Lyft, etc. But they've
             | burnt their bridge right into the water with everyone else,
             | so I see them fading into irrelevance over the next few
             | years.
        
               | tombert wrote:
               | I actually do have some decision-making power in regards
               | to what tech I use for my job [1] at a mid-size (by tech
               | standards) company, and my initial plan was to use Akka
               | for the thing I was working on, since it more or less fit
               | into the actor model perfectly.
               | 
               | I'm sure that Lightbend feels that their support contract
               | is the bee's knees and worth whatever they charge for it,
               | but it's a complete non-starter for me, and so I look
               | elsewhere.
               | 
               | Vert.x actor-ish model is a bit different, but it's not
               | the _that_ different, and considering that Vert.x tends
               | to perform extremely well in benchmarks, it doesn 't
               | really feel like I'm _losing_ a lot by using it instead
               | of Akka, particularly since I 'm not using Akka Streams.
               | 
               | [1] Normal disclaimer: I don't hide my employment
               | history, and it's not hard to find, but I politely ask
               | that you do not post it here.
        
             | wmfiv wrote:
             | I've found actors (Akka specifically) to be a great model
             | when you have concurrent access to fine grained shared
             | state. It provides such a simple mental model of how to
             | serialize that access. I'm not a fan as a general
             | programming model or even as a general purpose concurrent
             | programming model.
        
               | tombert wrote:
               | Vert.x has the "Verticle" abstraction, which more or less
               | corresponds to something like an Actor. It's close enough
               | to where I don't feel like I'm missing much by using it
               | instead of Akka.
        
               | Weryj wrote:
               | What are your criticisms of actors as a general purpose
               | concurrent programming model?
        
         | mushufasa wrote:
         | If it's good, then why not just fork it when (if) the license
         | changes? It is 3-clause BSD.
         | 
         | In fact, what's stopping the pandas library from incorporating
         | fireducks code into the mainline branch? pandas itself is BSD.
        
           | nicce wrote:
           | There is no code. The binary blob is licensed.
        
         | BostonEnginerd wrote:
         | I thought I saw on the documentation that it was released under
         | the modified BSD license. I guess they could take future
         | versions closed source, but the current version should be
         | available for folks to use and further develop.
        
           | OutOfHere wrote:
           | It's just the binary that's BSD, not the source code. The
           | source code is unavailable.
        
       | caycep wrote:
       | Just because I haven't jumped into the data ecosystem for a while
       | - is Polars basically the same as Pandas but accelerated? Is Wes
       | still involved in either?
        
       | breakds wrote:
       | I understand `pandas` is widely used in finance and quantitative
       | trading, but it does not seem to be the best fit especially when
       | you want your research code to be quickly ported to production.
       | 
       | We found `numpy` and `jax` to be a good trade-off between "too
       | high level to optimize" and "too low level to understand".
       | Therefore in our hedge fund we just build data structures and
       | helper functions on top of them. The downside of the above
       | combination is on sparse data, for which we call wrapped c++/rust
       | code in python.
        
       | liminal wrote:
       | Lots of people have mentioned Polars' sane API as the main reason
       | to favor it, but the other crucial reason for us is that it's
       | based on Apache Arrow. That allows us to use it where it's the
       | best tool and then switch to whatever else we need when it isn't.
        
       | rcarmo wrote:
       | The killer app for Polars in my day-to-day work is its direct
       | Parquet export. It's become indispensable for cleaning up stuff
       | that goes into Spark or similar engines.
        
       | hinkley wrote:
       | TIL that NEC still exists. Now there's a name I have not heard in
       | a long, long time.
        
       | insane_dreamer wrote:
       | surprised not to see any mention of numpy (our go-to) here
       | 
       | edit: I know pandas uses numpy under the hood, but "raw" numpy is
       | typically faster (and more flexible), so curious as to why it's
       | not mentioned
        
       | __mharrison__ wrote:
       | Many of the complaints about Pandas here (and around the
       | internet) are about the weird API. However, if you follow a few
       | best practices, you never run into the issue folks are
       | complaining about.
       | 
       | I wrote a nice article about chaining for Ponder. (Sadly, it
       | looks like the Snowflake acquisition has removed that. My book,
       | Effective Pandas 2, goes deep into my best practices.)
        
         | otsaloma wrote:
         | I don't quite agree, but if this was true, what would you tell
         | a junior colleague in a code review? You can't use this
         | function/argument/convention/etc you found in the official API
         | documentation because...I don't like it? I think any team-
         | maintained Pandas codebase will unavoidably drift into the
         | inconsistent and bad. If you're always working alone, then it
         | can of course be a bit better.
        
           | __mharrison__ wrote:
           | I have strong opinions about Pandas. I've used it since it
           | came out and have coalesced on patterns that make it easy to
           | use.
           | 
           | (Disclaimer: I'm a corporate trainer and feed my family
           | teaching folks how to work with their data using Pandas.)
           | 
           | When I teach about "readable" code, I caveat that it should
           | be "readable for a specific audience". I hold that if you are
           | a professional, that audience is other professionals. You
           | should write code for professionals and not for newbies.
           | Newbies should be trained up to write professional code.
           | YMMV, but that is my bias based on experience seeing this
           | work at some of the biggest companies in the world.
        
       | __mharrison__ wrote:
       | Lots of Pandas hate in this thread. However, for folks with lots
       | of lines of Pandas in production, Fireducks can be a lifesaver.
       | 
       | I've had the chance to play with it on some of my code it queries
       | than ran in 8+ minutes come down to 20 seconds.
       | 
       | Re-writing in Polars involves more code changes.
       | 
       | However, with Pandas 2.2+ and arrow, you can use .pipe to move
       | data to Polars, run the slow computation there, and then zero
       | copy back to Pandas. Like so...                   (df          #
       | slow part          .groupby(...)          .agg(...)         )
       | 
       | to:                   def polars_agg(df):           return
       | (pl.from_pandas(df)             .group_by(...)
       | .agg(...)             .to_pandas()           )              (df
       | .pipe(polars_agg)         )
        
       | nooope6 wrote:
       | Pretty cool, but where's the source at?
        
       ___________________________________________________________________
       (page generated 2024-11-20 23:00 UTC)