[HN Gopher] FireDucks: Pandas but Faster
___________________________________________________________________
FireDucks: Pandas but Faster
Author : sebg
Score : 343 points
Date : 2024-11-14 11:48 UTC (6 days ago)
(HTM) web link (hwisnu.bearblog.dev)
(TXT) w3m dump (hwisnu.bearblog.dev)
| viraptor wrote:
| > 100% compatibility with existing Pandas code: check.
|
| Is it actually? Do people see that level of compatibility in
| practice?
| jeroenhd wrote:
| I don't think it's exactly 100%: https://fireducks-
| dev.github.io/docs/user-guide/04-compatibi...
|
| It should be pretty close, though.
| thecleaner wrote:
| Sure but single node performance. This makes it not very useful
| IMO since quite a few data science folks work with Hadoop
| clusters or Snowflake clusters or DataBricks where data is
| distributed and querying is handled by Spark executors.
| chaxor wrote:
| The comparison is to pandas, so single node performance is
| understood in the scope. This is for people running small tasks
| that may only take a couple days on a single node with a 32
| core CPU or something, not tasks that take 3 months using
| thousands of cores. My understanding for the latter is that
| pyspark is a decent option, while ballista is the better option
| for which to look forward. Perhaps using bastion-rs as a
| backend can be useful for an upcoming system as well.
| Databricks et al are cloud trash IMO, as is anything that isn't
| meant to be run on a local single node system and a local HPC
| cluster with zero code change and a single line of config
| change.
|
| While for most of my jobs I ended up being able to evade the
| use of HPC by simply being smarter and discovering better
| algorithms to process information, I recall like pyspark
| decently, but preferring the simplicity of ballista over
| pyspark due to the simpler installation of Rust over managing
| Java and JVM junk. The constant problems caused by anything
| using JVM backend and the environment config with it was
| terrible to add to a new system every time I ran a new program.
|
| In this regard, ballista is a enormous improvement. Anything
| that is a one-line install via pip on any new system, runs
| local-first without any cloud or telemetry, and requires no
| change in code to run on a laptop vs HPC is the only option
| worth even beginning to look into and use.
| Kalanos wrote:
| Hadoop hasn't been relevant for a long time, which is telling.
|
| Unless I had thousands of files to work with, I would be loathe
| to use cluster computing. There's so much overhead, cost,
| waiting for nodes to spin up, and cloud architecture nonsense.
|
| My "single node" computer is a refurbished tower server with
| 256GB RAM and 50 threads.
|
| Most of these distributed computing solutions arose before data
| processing tools started taking multi-threading seriously.
| E_Bfx wrote:
| Very impressive, the Python ecosystem is slowly getting very
| good.
| BiteCode_dev wrote:
| Spent the last 20 years hearing that.
|
| At some point I think it's more honest to say "the python
| ecosystem keeps getting more awesome".
| Kalanos wrote:
| Continues to be the best by far
| i_love_limes wrote:
| I have never heard of FireDucks! I'm curious if anyone else here
| has used it. Polars is nice, but it's not totally compatible. It
| would be interesting how much faster it is for more complex
| calculations
| bratao wrote:
| Unfortunately it is not Opensource yet -
| https://github.com/fireducks-dev/fireducks/issues/22
| Y_Y wrote:
| Wouldn't it be nice if GitHub was just for source code and you
| couldn't just slap up a README that's an add for some
| proprietary shitware with a vague promise of source some day in
| the glorious future?
| thecopy wrote:
| >proprietary shitware
|
| Is this shitware? It seems to be very high quality code
| yupyupyups wrote:
| I think the anger comes from the fact that we expect Github
| repositories to host the actual source code and not be a
| dead-end with a single README.md file.
| ori_b wrote:
| How can you tell?
| sbarre wrote:
| I mean, based on the claims and the benchmarks, it seems
| to provide massive speedups to a very popular tool.
|
| How would you define "quality" in this context?
| echoangle wrote:
| High quality code isn't just code that performs well when
| executed, but also is readable, understandable and
| maintainable. You can't judge code quality by looking at
| the compiled result, just because it works well.
| sbarre wrote:
| That's certainly one opinion about it.
|
| One could also say that quality is related to the
| functional output.
| echoangle wrote:
| > One could also say that quality is related to the
| functional output.
|
| Right, I said nothing that contradicts that ("High
| quality code isn't _just_ code that performs well when
| executed, but also ... "). High quality functional output
| is a necessary requirement, but it isn't sufficient to
| determine if code is high quality.
| sbarre wrote:
| Sure, I guess it depends on what matters to you or to
| your evaluation criteria.
|
| My point was that it's all subjective in the end.
| echoangle wrote:
| It's not really subjective if you're at all reasonable
| about it.
|
| Imagine writing a very good program, running it through
| an obfuscator, and throwing away the original code. Is
| the obfuscated code "high quality code" now, because the
| output of the compilation still works as before?
| ori_b wrote:
| Written so that it's easy to maintain, well tested,
| correct in its handling of edge cases, easy to debug, and
| easy to iterate on.
| rad_gruchalski wrote:
| You'd slap that in a comment then?
| diggan wrote:
| > Wouldn't it be nice if GitHub was just for source code
|
| GitHub always been a platform for "We love to host FOSS but
| we won't be 100% FOSS ourselves", so makes sense they allow
| that kind of usage for others too.
|
| I think what you want, is something like Codeberg instead,
| which is explicitly for FOSS and 100% FOSS themselves.
| gus_massa wrote:
| > _FireDucks is not a open source library at this moment. You
| can get it installed freely using pip and use under BSD-3
| license and of course can look into the python part of the
| source code._
|
| I don't understand what it means. It looks like a
| contradiction. Does it have a BSD-3 licence or not?
| _flux wrote:
| They provide BSD-3-licensed Python files but the interesting
| bit happens in the shared object library, which is only
| provided in binary form (but is also BSD-3-licensed it seems,
| so you can distribute it freely).
| joshuaissac wrote:
| Since it is under the BSD 3 licence, users would also be
| permitted to decompile and modify the shared object under
| the licence terms.
| jlokier wrote:
| Nice insight!
| abcalphabet wrote:
| From the above link:
|
| > While the wheel packages are available at
| https://pypi.org/project/fireducks/#files, and while they do
| contain Python files, most of the magic happens inside a
| (BSD-3-licensed) shared object library, for which source code
| is not provided.
| sampo wrote:
| BSD license gives you the permission to use and to
| redistribute. In this case you may use and redistribute the
| binaries.
|
| Edit: To use, redistribute, and modify, and distribute
| modified versions.
| japhyr wrote:
| "Redistribution and use in source and binary forms, _with
| or without modification_ , are permitted provided that the
| following conditions are met..."
|
| https://opensource.org/license/bsd-3-clause
| GardenLetter27 wrote:
| Such a crazy distortion of the meaning of the license.
|
| Imagine being like "the project is GPL - just the compiled
| machine code".
| PittleyDunkin wrote:
| This is pretty common for binary blobs for where the
| source code has been lost.
| pplonski86 wrote:
| How does it compare to Polars?
|
| EDIT: I've found some benchmarks https://fireducks-
| dev.github.io/docs/benchmarks/
|
| Would be nice to know what are internals of FireDucks
| rich_sasha wrote:
| It's a bit sad for me. I find the biggest issue for me with
| pandas is the API, not the speed.
|
| So many foot guns, poorly thought through functions, 10s of
| keyword arguments instead of good abstractions, 1d and 2d
| structures being totally different objects (and no higher-order
| structures). I'd take 50% of the speed for a better API.
|
| I looked at Polars, which looks neat, but seems made for a
| different purpose (data pipelines rather than building models
| semi-interactively).
|
| To be clear, this library might be great, it's just a shame for
| me that there seems no effort to make a Pandas-like thing with
| better API. Maybe time to roll up my sleeves...
| martinsmit wrote:
| Check out redframes[1] which provides a dplyr-like syntax and
| is fully interoperable with pandas.
|
| [1]: https://github.com/maxhumber/redframes
| otsaloma wrote:
| Building on top of Pandas feels like you're only escaping
| part of the problems. In addition to the API, the datatypes
| in Pandas are a mess, with multiple confusing (and none of
| them good) options for e.g. dates/datetimes. Does redframes
| do anything there?
| ljosifov wrote:
| +1 Seconding this. My limited experience with pandas had a non-
| trivial number of moments "?? Is it really like this? Nah - I'm
| mistaken for sure, this can not be, no one would do something
| insane like that". And yet and yet... Fwiw since I've found
| that numpy is a must (ofc), but pandas is mostly optional. So I
| stick to numpy for my writing, and keep pandas read only. (just
| execute someone else's)
| omnicognate wrote:
| What about the polars API doesn't work well for your use case?
| short_sells_poo wrote:
| Polars is missing a crucial feature for replacing pandas in
| Finance: first class timeseries handling. Pandas allows me to
| easily do algebra on timeseries. I can easily resample data
| with the resample(...) method, I can reason about the index
| frequency, I can do algebra between timeseries, etc.
|
| You can do the same with Polars, but you have to start
| messing about with datetimes and convert the simple problem
| "I want to calculate a monthly sum anchored on the last
| business day of the month" to SQL-like operations.
|
| Pandas grew a large and obtuse API because it provides
| specialized functions for 99% of the tasks one needs to do on
| timeseries. If I want to calculate an exponential weighted
| covariance between two time series, I can trivially do this
| with pandas: series1.ewm(...).cov(series2). I welcome people
| to try and do this with Polars. It'll be a horrible and
| barely readable contraption.
|
| YC is mostly populated by technologists, and technologists
| are often completely ignorant about what makes pandas useful
| and popular. It was built by quants/scientists, for doing
| (interactive) research. In this respect it is similar to R,
| which is not a language well liked by technologists, but it
| is (surprise) deeply loved by many scientists.
| dkga wrote:
| Exactly the single reason why I use pandas when I need to
| use python. But coming from R, it still feels like "second
| best".
| n8henrie wrote:
| I don't know what exponential weighted covariance is, but
| I've had pretty good luck converting time series-based
| analyses from pandas to polars (for patient presentations
| to my emergency department -- patients per hour, per day,
| per shift, etc.). Resample has a direct (and easier IMO)
| replacement in polars, and there is group_by_dynamic.
|
| I've had trouble determining whether one timestamp falls
| between two others across tens of thousands of rows (with
| the polars team suggesting I use a massive cross product
| and filter -- which worked but excludes the memory
| requirement), whereas in pandas I was able to sort the
| timestamps and thereby only need to compare against the
| preceding / following few based on the index of the last
| match.
|
| The other issue I've had with resampling is with polars
| automatically dropping time periods with zero events,
| giving me a null instead of zero for the count of events in
| certain time periods (which then gets dropped from
| aggregations). This has caught me a few times.
|
| But other than that I've had good luck.
| short_sells_poo wrote:
| I'm curious how is polars group_by_dynamic easier than
| resample in pandas. In pandas if I want to resample to a
| monthly frequency anchored to the last business day of
| the month, I'd write:
|
| > my_df.resample("BME").apply(...)
|
| Done. I don't think it gets any easier than this. Every
| time I tried something similar with polars, I got bogged
| down in calendar treatment hell and large and obscure SQL
| like contraptions.
|
| Edit: original tone was unintentionally combative -
| apologies.
| cmdlineluser wrote:
| > cross product and filter
|
| `.join_where()`[1] was also added recently.
|
| [1]: https://docs.pola.rs/api/python/stable/reference/dat
| aframe/a...
| marcogorelli wrote:
| Could you show how you write "calculate a monthly sum
| anchored on the last business day of the month" in pandas
| please?
| sebg wrote:
| Not OP.
|
| But I'm guessing it's something like this:
|
| import pandas as pd
|
| def calculate_monthly_business_sum(df, date_column,
| value_column): """ Calculate
| monthly sums anchored to the last business day of each
| month Parameters: df: DataFrame with
| dates and values date_column: name of date column
| value_column: name of value column to sum
| Returns: DataFrame with sums anchored to last
| business day """ # Ensure date column is
| datetime df[date_column] =
| pd.to_datetime(df[date_column]) # Group
| by end of business month and sum monthly_sum =
| df.groupby(pd.Grouper( key=date_column,
| freq='BME' # Business Month End frequency
| ))[value_column].sum().reset_index() return
| monthly_sum
|
| # Example usage:
|
| df = pd.DataFrame({ 'date': ['2024-01-01', '2024-01-31',
| '2024-02-29'], 'amount': [100, 200, 300] })
|
| result = calculate_monthly_business_sum(df, 'date',
| 'amount')
|
| print(result)
|
| Which you can run here => https://python-
| fiddle.com/examples/pandas?checkpoint=1732114...
| short_sells_poo wrote:
| It's actually much simpler than that. Assuming the index
| of the dataframe DF is composed of timestamps (which is
| normal for timeseries):
|
| df.resample("BME").sum()
|
| Done. One line of code and it is quite obvious what it is
| doing - with perhaps the small exception of BME, but if
| you want max readability you could do:
|
| df.resample(pd.offsets.BusinessMonthEnd()).sum()
|
| This is why people use pandas.
| short_sells_poo wrote:
| Answered the child comment but let me copy paste here
| too. It's literally one (short) line:
|
| > df.resample("BME").sum()
|
| Assuming `df` is a dataframe (ie table) indexed by a
| timestamp index, which is usual for timeseries analysis.
|
| "BME" stands for BusinessMonthEnd, which you can type out
| if you want the code to be easier to read by someone not
| familiar with pandas.
| tomrod wrote:
| A bit from memory as in transit, but something like df.gr
| oupby(df[date_col]+pd.offsets.MonthEnd(0))[agg_col].sum()
| sega_sai wrote:
| Great point that I completely share. I tend to avoid pandas at
| all costs except for very simple things as I have bitten by
| many issues related to indexing. For anything complicated I
| tend to switch to duckdb instead.
| bravura wrote:
| Can you explain your use-case and why DuckDB is better?
|
| Considering switching from pandas and want to understand what
| is my best bet. I am just processing feature vectors that are
| too large for memory, and need an initial simple JOIN to
| aggregate them.
| sega_sai wrote:
| I am not necessarily saying duckdb is better. I personally
| just found it easier, clearer to write a sql query for any
| complicated set of joins/group by processing than to try to
| do that in pandas.
| rapatel0 wrote:
| Look into [Ibis](https://ibis-project.org/). It's a
| dataframe library built on duckdb. It supports lazy
| execution, greater than memory datastructures, remote s3
| data and is insanely fast. Also works with basically any
| backend (postgres, mysql, parquet/csv files, etc) though
| there are some implementation gaps in places.
|
| I previously had a pandas+sklearn transformation stack that
| would take up to 8 hours. Converted it to ibis and it
| executes in about 4 minutes now and doesn't fill up RAM.
|
| It's not a perfect apples to apples pandas replacement but
| really a nice layer on top of sql. after learning it, I'm
| almost as fast as I was on pandas with expressions.
| techwizrd wrote:
| I made the switch to Ibis a few months ago and have been
| really enjoying it. It works with all the plotting
| libraries including seaborn and plotnine. And it makes
| switching from testing on a CSV to running on a SQL/Spark
| a one-line change. It's just really handy for analysis
| (similar to the tidyverse).
| amelius wrote:
| Yes. Pandas turns 10x developers into .1x developers.
| berkes wrote:
| It does to me. Well, a 1x developer into a .01x dev in my
| case.
|
| My conclusion was that pandas is not for developers. But for
| one-offs by managers, data-scientists, scientists, and so on.
| And maybe for "hackers" who cludge together stuff 'till it
| works and then hopefully never touch it.
|
| Which made me realize such thoughts can come over as smug,
| patronizing or belittling. But they do show how software can
| be optimized for different use-cases.
|
| The danger then lies into not recognizing these use-cases
| when you pull in smth like pandas. "Maybe using panda's to
| map and reduce the CSVs that our users upload to insert
| batches isn't a good idea at all".
|
| This is often worsened by the tools/platforms/lib devs or
| communities not advertising these sweet spots and
| limitations. Not in the case of Pandas though: that's really
| clear about this not being a lib or framework for devs, but a
| tool(kit) to do data analysis with. Kudo's for that.
| analog31 wrote:
| I'm one of those people myself, and have whittled my Pandas
| use down to displaying pretty tables in Jupyter. Everything
| else I do in straight Numpy.
| theLiminator wrote:
| Imo numpy is not better than pandas for the things you'd
| use pandas for, though polars is far superior.
| Kalanos wrote:
| The pandas API makes a lot more sense if you are familiar with
| numpy.
|
| Writing pandas code is a bit redundant. So what?
|
| Who is to say that fireducks won't make their own API?
| faizshah wrote:
| Pandas is a commonly known DSL at this point so lots of data
| scientists know pandas like the back of their hand and thats
| why a lot of pandas but for X libraries have become popular.
|
| I agree that pandas does not have the best designed api in
| comparison to say dplyr but it also has a lot of functionality
| like pivot, melt, unstack that are often not implemented by
| other libraries. It's also existed for more than a decade at
| this point so there's a plethora of resources and stackoverflow
| questions.
|
| On top of that, these days I just use ChatGPT to generate some
| of my pandas tasks. ChatGPT and other coding assistants know
| pandas really well so it's super easy.
|
| But I think if you get to know Pandas after a while you just
| learn all the weird quirks but gain huge benefits from all the
| things it can do and all the other libraries you can use with
| it.
| rich_sasha wrote:
| I've been living in the shadow of pandas for about a decade
| now, and the only thing I learned is to avoid using it.
|
| I 100% agree that pandas _addresses_ all the pain points of
| data analysis in the wild, and this is precisely why it is so
| popular. My point is, it doesn 't address them _well_. It
| seems like a conglomerate of special cases, written for a
| specific problem it 's author was facing, with little concern
| for consistency, generality or other use cases that might
| arise.
|
| In my usage, any time saved by its (very useful) methods
| tends to be lost on fixing subtle bugs introduced by strange
| pandas behaviours.
|
| In my use cases, I reindex the data using pandas and get it
| to numpy arrays as soon as I can, and work with those, with a
| small library of utilities I wrote over the years. I'd gladly
| use a "sane pandas" instead.
| specproc wrote:
| Aye, but we've learned it, we've got code bases written in
| it, many of us are much more data kids than "real devs".
|
| I get it doesn't follow best practices, but it does do what
| it needs to. Speed has been an issue, and it's exciting
| seeing that problem being solved.
|
| Interesting to see so many people recently saying "polars
| looks great, but no way I'll rewrite". This library seems
| to give a lot of people, myself included, exactly what we
| want. I look forward to trying it.
| te_chris wrote:
| Pandas best feature for me is the df format being readable by
| duckdb. The filtering api is a nightmare
| egecant wrote:
| Completely agree, from the perspective of someone that
| primarily uses R/tidyverse for data wrangling, there is this
| great article on why Pandas API feel clunky:
| https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-...
| movpasd wrote:
| I started using Polars for the "rapid iteration" usecase you
| describe, in notebooks and such, and haven't looked back --
| there are a few ergonomic wrinkles that I mostly attribute to
| the newness of the library, but I found that polars forces me
| to structure my thought process and ask myself "what am I
| actually trying to do here?".
|
| I find I basically never write myself into a corner with
| initially expedient but ultimately awkward data structures like
| I often did with pandas, the expression API makes the semantics
| a lot clearer, and I don't have to "guess" the API nearly as
| much.
|
| So even for this usecase, I would recommend trying out polars
| for anyone reading this and seeing how it feels after the
| initial learning phase is over.
| h14h wrote:
| If you wanna try a different API, take a look at Elixir
| Explorer:
|
| https://hexdocs.pm/explorer/exploring_explorer.html
|
| It runs on top of Polars so you get those speed gains, but uses
| the Elixir programming language. This gives the benefit of a
| simple finctional syntax w/ pipelines & whatnot.
|
| It also benefits from the excellent Livebook (a Jupyter
| alternative specific to Elixir) ecosystem, which provides all
| kinds of benefits.
| paddy_m wrote:
| Have you tried polars? It's a much more regular syntax. The
| regular syntax fits well with the lazy execution. It's very
| composable for programmatically building queries. And then it's
| super fast
| bionhoward wrote:
| I found the biggest benefit of polars is ironically the loss
| of the thing I thought I would miss most, the index; with
| pandas there are columns, indices, and multi-indices, whereas
| with polars, everything is a column, it's all the same so you
| can delete a lot of conditionals.
|
| However, I still find myself using pandas for the timestamps,
| timedeltas, and date offsets, and even still, I need a whole
| extra column just to hold time zones, since polars maps
| everything to UTC storage zone, you lose the origin / local
| TZ which screws up heterogeneous time zone datasets. (And I
| learned you really need to enforce careful manual thoughtful
| consideration of time zone replacement vs offsetting at the
| API level)
|
| Had to write a ton of code to deal with this, I wish polars
| had explicit separation of local vs storage zones on the
| Datetime data type
| paddy_m wrote:
| I think pandas was so ambitious syntax wise and concept
| wise. But it got be a bit of a jumble. The index idea in
| particular is so cool, particular multi-indexes, watching
| people who really understand it do multi index operations
| is very cool.
|
| IMO Polars sets a different goal of what's the most pandas
| like thing that we can build that is fast (and leaves open
| the possibility for more optimization), and clean.
|
| Polars feels like you are obviously manipulating an
| advanced query engine. Pandas feels like manipulating this
| squishy datastructure that should be super useful and
| friendly, but sometimes it does something dumb and slow
| stared wrote:
| Yes, every time I write df[df.sth = val], a tiny part of me
| dies.
|
| For a comparison, dplyr offers a lot of elegant functionality,
| and the functional approach in Pandas often feels like an
| afterthought. If R is cleaner than Python, it tells a lot (as a
| side note: the same story for ggplot2 and matplotlib).
|
| Another surprise for friends coming from non-Python backgrounds
| is the lack of column-level type enforcement. You write
| df.loc[:, "col1"] and hope it works, with all checks happening
| at runtime. It would be amazing if Pandas integrated something
| like Pydantic out of the box.
|
| I still remember when Pandas first came out--it was fantastic
| to have a tool that replaced hand-rolled data structures using
| NumPy arrays and column metadata. But that was quite a while
| ago, and the ecosystem has evolved rapidly since then,
| including Python's gradual shift toward type checking.
| oreilles wrote:
| > Yes, every time I write df[df.sth = val], a tiny part of me
| dies.
|
| That's because it's a bad way to use Pandas, even though it
| is the most popular and often times recommended way. But the
| thing is, you can just write "safe" immutable Pandas code
| with method chaining and lambda expressions, resulting in
| very Polars-like code. For example: df = (
| pd .read_csv("./file.csv")
| .rename(columns={"value":"x"}) .assign(y=lambda d:
| d["x"] * 2) .loc[lambda d: d["y"] > 0.5] )
|
| Plus nowadays with the latest Pandas versions supporting
| Arrow datatypes, Polars performance improvements over Pandas
| are considerably less impressive.
|
| Column-level name checking would be awesome, but
| unfortunately no python library supports that, and it will
| likely never be possible unless some big changes are made in
| the Python type hint system.
| OutOfHere wrote:
| Using `lambda` without care is dangerous because it risks
| being not vectorized at all. It risks being super slow,
| operating one row at a time. Is `d` a single row or the
| entire series or the entire dataframe?
| rogue7 wrote:
| In this case `d` is the entire dataframe. It's just a way
| of "piping" the object without having to rename it.
|
| You are probably thinking about `df.apply(lambda row:
| ..., axis=1)` which operates on each row at a time and is
| indeed very slow since it's not vectorized. Here this is
| different and vectorized.
| OutOfHere wrote:
| That's excellent.
| almostkorean wrote:
| Appreciate the explanation, this is something I should
| know by now but don't
| rogue7 wrote:
| Agreed 100%. I am using this method-chaining style all the
| time and it works like a charm.
| wodenokoto wrote:
| I'm not really sure why you think
| .loc[lambda d: d["y"] > 0.5]
|
| Is stylistically superior to [df.y > 0.5]
|
| I agree it comes in handy quite often, but that still
| doesn't make it great to write compared to what sql or
| dplyr offers in terms of choosing columns to filter on
| (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)
| oreilles wrote:
| It is superior because you don't need to assign your
| dataframe to a variable ('df'), then update that variable
| or create a new one everytime you need to do that
| operation. Which means it is both safer (you're
| guaranteed to filter on the current version of the
| dataframe) and more concise.
|
| For the rest of your comment: it's the best you can do
| _in python_. Sure you could write SQL, but then you 're
| mixing text queries with python data manipulation and I
| would dread that. And SQL-only scripting is really out of
| question.
| chaps wrote:
| Eh, SQL and python can still work together very well
| where SQL takes the place of pandas. Doing things in
| waves/batch helps.
|
| Big problem with pandas is that you still have to load
| the dataframe into memory to work with it. My data's too
| big for that and postgres makes that problem go away
| almost entirely.
| __mharrison__ wrote:
| It's superior because it is safer. Not because the API
| (or requirement for using Lambda) looks better. The
| lambda allows the operation to work on the current state
| of the dataframe in the chained operation rather than the
| original dataframe. Alternatively, you could use
| .query("y > 0.5"). This also works on the current state
| of the dataframe.
|
| (I'm the first to complain about the many warts in
| Pandas. Have written multiple books about it. This is
| annoying, but it is much better than [df.y > 0.5].)
| moomin wrote:
| I mean, yes there's arrow data types, but it's got a long
| way to go before it's got full parity with the numpy
| version.
| doctorpangloss wrote:
| All I want is for the IDE and Python to correctly infer types
| and column names for all of these array objects. 99% of the
| pain for me is navigating around SQL return values and CSVs
| as pieces of text instead of code.
| otsaloma wrote:
| Agreed, never had a problem with the speed of anything NumPy or
| Arrow based.
|
| Here's my alternative: https://github.com/otsaloma/dataiter
| https://dataiter.readthedocs.io/en/latest/_static/comparison...
|
| Planning to switch to NumPy 2.0 strings soon. Other than that I
| feel all the basic operations are fine and solid.
|
| Note for anyone else rolling up their sleeves: You can get
| quite far with pure Python when building on top of NumPy (or
| maybe Arrow). The only thing I found needing more performance
| was group-by-aggregate, where Numba seems to work OK, although
| a bit difficult as a dependency.
| adolph wrote:
| _So many foot guns, poorly thought through functions, 10s of
| keyword arguments instead of good abstractions_
|
| Yeah, Pandas has that early PHP feel to it, probably out of
| being a successful first mover.
| wodenokoto wrote:
| In that case I'd recommend dplyr in R. It also integrates with
| a better plotting library, GGPlot, which not only gives you
| better API than matplotlib but also prettier plots (unless you
| really get to work at your matplot code)
| epistasis wrote:
| Have you examined siuba at all? It promises to be more similar
| to the R tidyverse, which IMHO has a _much_ better API. And I
| personally prefer dplyr /tidyverse to Polars for exploratory
| analysis.
|
| https://siuba.org
|
| I have not yet used siuba, but would be interested in others'
| opinions. The activation energy to learn a new set of tools is
| so large that I rarely have the time to fully examine this
| space...
| Bootvis wrote:
| The lack of non standard evaluation still forces you to write
| `_.` so this might be a better Pandas but not a better
| tidyverse.
|
| A pity their compares don't have tidyverse or R's data.table.
| I think R would look simpler but now it remains unclear.
| otsaloma wrote:
| I think the choice of using functions instead of classes +
| methods doesn't really fit well into Python. Either you need
| to do a huge amount of imports or use the awful `from siuba
| import *`. This feels like shoehorning the dplyr syntax into
| Python when method chaining would be more natural and would
| still retain the idea.
|
| Also, having (already a while ago) looked at the
| implementation of the magic `_` object, it seemed like an
| awful hack that will serve only a part of use cases. Maybe
| someone can correct me if I'm wrong, but I get the impression
| you can do e.g. `summarize(x=_.x.mean())` but not
| `summarize(x=median(_.x))`. I'm guessing you don't get
| autocompletion in your editor or useful error messages and it
| can then get painful using this kind of a magic.
| kussenverboten wrote:
| Agree with this. My favorite syntax is the elegance of
| data.table API in R. This should be possible in Python too
| someday.
| nathan_compton wrote:
| Yeah. Pandas is the worst. Polars is better in some ways but so
| verbose!
| fluorinerocket wrote:
| Thank you I don't know why people think it's so amazing. I end
| up sometimes just extracting the numpy arrays from the data
| frame and doing things like I know how to, because the Panda
| way is so difficult
| omnicognate wrote:
| > Then came along Polars (written in Rust, btw!) which shook the
| ground of Python ecosystem due to its speed and efficiency
|
| Polars rocked my world by having a sane API, not by being fast. I
| can see the value in this approach if, like the author, you have
| a large amount of pandas code you don't want to rewrite, but
| personally I'm extremely glad to be leaving the pandas API
| behind.
| ralegh wrote:
| I personally found the polars API much clunkier, especially for
| rapid prototyping. I use it only for cemented processes where I
| could do with speed up/memory reduction.
|
| Is there anything specific you prefer moving from the pandas
| API to polars?
| benrutter wrote:
| Not OP but the ability to natively implement complex groupby
| logic is a huge plus for me at least.
|
| Say you want to take an aggergation like "the mean of all
| values over the 75th percentile" algonside a few other
| aggregations. In pandas, this means you're gonna be in for a
| bunch of hoops and messing around with stuff because you
| can't express it via the api. Polars' api lets you express
| this directly without having to implement any kind of
| workaround.
|
| Nice article on it here:
| https://labs.quansight.org/blog/dataframe-group-by
| adrian17 wrote:
| Any explanation what makes it faster than pandas and polars would
| be nice (at least something more concrete than "leverage the C
| engine").
|
| My easy guess is that compared to pandas, it's multi-threaded by
| default, which makes for an easy perf win. But even then,
| 130-200x feels extreme for a simple sum/mean benchmark. I see
| they are also doing lazy evaluation and some MLIR/LLVM based JIT
| work, which is probably enough to get an edge over polars; though
| its wins over DuckDB _and_ Clickhouse are also surprising out of
| nowhere.
|
| Also, I thought one of the reasons for Polars's API was that
| Pandas API is way harder to retrofit lazy evaluation to, so I'm
| curious how they did that.
| ayhanfuat wrote:
| In its essence it is a commercial product which has a free trial.
|
| > Future Plans By providing the beta version of FireDucks free of
| charge and enabling data scientists to actually use it, NEC will
| work to improve its functionality while verifying its
| effectiveness, with the aim of commercializing it within FY2024.
| graemep wrote:
| Its BSD licensed. They do not way what the plans are but most
| likely a proprietary version with added support or features.
| ayhanfuat wrote:
| They say the source code for the part "where the magic
| happens" is not available so I am not sure what BSD implies
| there.
| HelloNurse wrote:
| It serves as a ninja's smoke bomb until the "BSD" binary
| blob is suddenly obsoleted by a proprietary binary blob.
| ori_b wrote:
| It's a BSD licensed binary blob. There's no code provided.
| graemep wrote:
| Wow! That is so weird.
|
| Its freeware under an open source license. Really
| misleading.
|
| It looks like something you should stay away from unless
| you need it REALLY badly. Its a proprietary product with
| unknown pricing and no indication of what their plans are.
|
| Does the fact that the binary is BSD licensed allow
| reverse-engineering?
| captn3m0 wrote:
| > Redistribution and use in source and binary forms, with
| or without modification, are permitted
|
| Reversing and re-compiling should count as modification?
| imranq wrote:
| This presentation does a good job distilling why FireDucks is so
| fast:
|
| https://fireducks-dev.github.io/files/20241003_PyConZA.pdf
|
| The main reasons are
|
| * multithreading
|
| * rewriting base pandas functions like dropna in c++
|
| * in-built compiler to remove unused code
|
| Pretty impressive especially given you import fireducks.pandas as
| pd instead of import pandas as pd, and you are good to go
|
| However I think if you are using a pandas function that wasn't
| rewritten, you might not see the speedups
| faizshah wrote:
| It's not clear to me why this would be faster than polars,
| duckdb, vaex or clickhouse. They seem to be taking the same
| approach of multithreading, optimizing the plan, using arrow,
| optimizing the core functions like group by.
| mettamage wrote:
| Maybe it isn't? Maybe they just want a fast pandas api?
| geysersam wrote:
| According to their benchmarks they are faster. Not by a
| lot, but still significantly.
| maleldil wrote:
| None of those drop-in replacements for Pandas. The main draw
| is "faster without changing your code".
| faizshah wrote:
| I'm asking more about what techniques did they use to get
| the performance improvements in the slides.
|
| They are showing a 20-30% improvement over Polars,
| Clickhouse and Duckdb. But those 3 tools are SOTA in this
| area and generally rank near eachother in every benchmark.
|
| So 20-30% improvement over that cluster makes me interested
| to know what techniques they are using to achieve that over
| their peers.
| Kalanos wrote:
| Linux only right now https://github.com/fireducks-
| dev/fireducks/issues/27
| Kalanos wrote:
| Regarding compatibility, fireducks appears to be using the same
| column dtypes:
|
| ```
|
| >>> df['year'].dtype == np.dtype('int32')
|
| True
|
| ```
| DonHopkins wrote:
| FireDucks FAQ:
|
| Q: Why do ducks have big flat feet?
|
| A: So they can stomp out forest fires.
|
| Q: Why do elephants have big flat feet?
|
| A: So they can stomp out flaming ducks.
| short_sells_poo wrote:
| Looks very cool, BUT: it's closed source? That's an immediate
| deal breaker for me as a quant. I'm happy to pay for my tools,
| but not being able to look and modify the source code of a
| crucial library like this makes it a non-starter.
| KameltoeLLM wrote:
| Shouldn't that be FirePandas then?
| benrutter wrote:
| Anyone here tried using FireDucks?
|
| The promise of a 100x speedup with 0 changes to your codebase is
| pretty huge, but even a few correctness / incompatibility issues
| would probably make it a no-go for a bunch of potential users.
| safgasCVS wrote:
| I'm sad that R's tidy syntax is not copied more widely in the
| python world. Dplyr is incredibly intuitive most don't ever
| bother reading the instructions you can look at a handful of
| examples and you've got the gist of it. Polars despite its speed
| is still verbose and inconsistent while pandas is seemingly a
| collection of random spells.
| softwaredoug wrote:
| The biggest advantage of pandas is its extensibility. If you care
| about that, it's (relatively) easy to add your own extension
| array type.
|
| I haven't seen that in other system like Polars, but maybe I'm
| wrong.
| ssivark wrote:
| Setting aside complaints about the Pandas API, it's frustrating
| that we might see the community of a popular "standard" tool
| fragment into two or _even three_ ecosystems (for libraries with
| slightly incompatible APIs) -- seemingly all with the value
| proposition of "making it faster". Based on the machine learning
| experience over the last decade, this kind of churn in tooling is
| somewhat exhausting.
|
| I wonder how much of this is fundamental to the common approach
| of writing libraries in Python with the processing-heavy parts
| delegated to C/C++ -- that the expressive parts cannot be fast
| and the fast parts cannot be expressive. Also, whether Rust (for
| polars, and other newer generation of libraries) changes this
| tradeoff substantially enough.
| tgtweak wrote:
| I think it's a natural path of software life that compatibility
| often stands in the way of improving the API.
|
| This really does seem like a rare thing that everything speeds
| up without breaking compatability. If you want a fast revised
| API for your new project (or to rework your existing one) then
| you have a solution for that with Polars. If you just want your
| existing code/workloads to work faster, you have a solution for
| that now.
|
| It's OK to have a slow, compatible, static codebase to build
| things on then optimize as-needed.
|
| Trying to "fix" the api would break a ton of existing code,
| including existing plugins. Orphaning those projects and
| codebases would be the wrong move, those things take a decade
| to flesh out.
|
| This really doesn't seem like the worst outcome, and doesn't
| seem to be creating a huge fragmented mess.
| SiempreViernes wrote:
| > Based on the machine learning experience over the last
| decade, this kind of churn in tooling is somewhat exhausting.
|
| Don't come to old web-devs with those complains, every single
| one of them had to write at least one open source javascript
| library just to create their linkedin account!
| cmcconomy wrote:
| Every time I see a new better pandas, I check to see if it has
| geopandas compatibility
| PhasmaFelis wrote:
| "FireDucks: Pandas but Faster" sounds like it's about something
| much more interesting than a Python library. I'd like to read
| that article.
| dkga wrote:
| Reading all pandas vs polars reminded me of the tidyverse vs
| data.table discussion some 10 years ago.
| flakiness wrote:
| > FireDucks is released on pypi.org under the 3-Clause BSD
| License (the Modified BSD License).
|
| Where can I find the code? I don't see it on GitHub.
|
| > contact@fireducks.jp.nec.com
|
| So it's from NEC (a major Japanese computer company), presumably
| a research artifact?
|
| > https://fireducks-dev.github.io/docs/about-us/ Looks like so.
| xbar wrote:
| Great work, but I will hold my adoption until c++ source is
| available.
| uptownfunk wrote:
| If they could just make a dplyr for py it would be so awesome.
| But sadly I don't think the python language semantics will
| support such a tool. It all comes down to managing the namespace
| I guess
| gigatexal wrote:
| On average only 1.5x faster than polars. That's kinda crazy.
| geysersam wrote:
| Why is that crazy? (I think the crazy thing is that they are
| faster at all. Taking an existing api and making it fast is
| harder than creating the api from scratch with performance in
| mind)
| OutOfHere wrote:
| Don't use it:
|
| > By providing the beta version of FireDucks free of charge and
| enabling data scientists to actually use it, NEC will work to
| improve its functionality while verifying its effectiveness, with
| the aim of commercializing it within FY2024.
|
| In other words, it's free only to trap you.
| ladyanita22 wrote:
| Important to upvote this. If there's room for improvement for
| Polars (which I'm sure there is), go and support the project.
| But don't fall for a commercial trap when there are competent
| open source tools available.
| maleldil wrote:
| While I agree, it's worth noting that this project is a drop-
| in replacement (they claim that, at least), but Polars has a
| very different API. I much prefer Polars's API, but it's
| still a non-trivial cost to switch to it, which is why many
| people would instead explore Pandas alternatives instead.
| binoct wrote:
| No shade to the juggernaut of the open source software
| movement and everything it has/will enabled, but why the hate
| for a project that required people's time and knowledge to
| create something useful to a segment of users and then expect
| to charge for using it in the future? Commercial trap seems
| to imply this is some sort of evil machination but it seems
| like they are being quite upfront with that language.
| floatrock wrote:
| It's not hate for the project, it's hate for the deceptive
| rollout.
|
| Basically it's a debate about how many dark patterns can
| you squeeze next to that "upfront language" before
| "marketing" slides into "bait-n-switch."
| papichulo2023 wrote:
| Not sure if evil or not, but it is unprofessional to use a
| tool that you dont know how much it will cost for your
| company in the future.
| tombert wrote:
| Thanks for the warning.
|
| I nearly made the mistake of merging Akka into a codebase
| recently; fortunately I double-checked the license and noticed
| it was the bullshit BUSL and it would have potentially cost my
| employer tens of thousands of dollars a year [1]. I ended up
| switching everything to Vert.x, but I really hate how
| normalized these ostensibly open source projects are sneaking
| scary expensive licenses into things now.
|
| [1] Yes I'm aware of Pekko now, and my stuff probably would
| have worked with Pekko, but I didn't really want to deal with
| something that by design is 3 years out of date.
| cogman10 wrote:
| IMO, you made a good decision ditching akka. We have an akka
| app before the BUSL and it is a PITA to maintain.
|
| Vert.x and other frameworks are far better and easier for
| most devs to grok.
| tombert wrote:
| Yeah, Vert.x actually ended up being pretty great. I feel
| like it gives me most of the cool features of Akka that I
| actually care about, but it allows you to gradually move
| into it; it _can_ be a full-on framework, but it can also
| just be a decent library to handle concurrency.
|
| Plus the license isn't stupid.
| switchbak wrote:
| > We have an akka app before the BUSL and it is a PITA to
| maintain
|
| I would imaging the non-Scala use case to be less than
| ideal.
|
| In Scala land, Pekko - the open source fork of Akka is the
| way to go if you need compatibility. Personally, I'd avoid
| new versions of Akka like the plague, and just use more
| modern alternatives to Pekko/Akka anyway.
|
| I'm not sure what Lightbend's target market is? Maybe they
| think they have enough critical mass to merit the price tag
| for companies like Sony/Netflix/Lyft, etc. But they've
| burnt their bridge right into the water with everyone else,
| so I see them fading into irrelevance over the next few
| years.
| tombert wrote:
| I actually do have some decision-making power in regards
| to what tech I use for my job [1] at a mid-size (by tech
| standards) company, and my initial plan was to use Akka
| for the thing I was working on, since it more or less fit
| into the actor model perfectly.
|
| I'm sure that Lightbend feels that their support contract
| is the bee's knees and worth whatever they charge for it,
| but it's a complete non-starter for me, and so I look
| elsewhere.
|
| Vert.x actor-ish model is a bit different, but it's not
| the _that_ different, and considering that Vert.x tends
| to perform extremely well in benchmarks, it doesn 't
| really feel like I'm _losing_ a lot by using it instead
| of Akka, particularly since I 'm not using Akka Streams.
|
| [1] Normal disclaimer: I don't hide my employment
| history, and it's not hard to find, but I politely ask
| that you do not post it here.
| wmfiv wrote:
| I've found actors (Akka specifically) to be a great model
| when you have concurrent access to fine grained shared
| state. It provides such a simple mental model of how to
| serialize that access. I'm not a fan as a general
| programming model or even as a general purpose concurrent
| programming model.
| tombert wrote:
| Vert.x has the "Verticle" abstraction, which more or less
| corresponds to something like an Actor. It's close enough
| to where I don't feel like I'm missing much by using it
| instead of Akka.
| Weryj wrote:
| What are your criticisms of actors as a general purpose
| concurrent programming model?
| mushufasa wrote:
| If it's good, then why not just fork it when (if) the license
| changes? It is 3-clause BSD.
|
| In fact, what's stopping the pandas library from incorporating
| fireducks code into the mainline branch? pandas itself is BSD.
| nicce wrote:
| There is no code. The binary blob is licensed.
| BostonEnginerd wrote:
| I thought I saw on the documentation that it was released under
| the modified BSD license. I guess they could take future
| versions closed source, but the current version should be
| available for folks to use and further develop.
| OutOfHere wrote:
| It's just the binary that's BSD, not the source code. The
| source code is unavailable.
| caycep wrote:
| Just because I haven't jumped into the data ecosystem for a while
| - is Polars basically the same as Pandas but accelerated? Is Wes
| still involved in either?
| breakds wrote:
| I understand `pandas` is widely used in finance and quantitative
| trading, but it does not seem to be the best fit especially when
| you want your research code to be quickly ported to production.
|
| We found `numpy` and `jax` to be a good trade-off between "too
| high level to optimize" and "too low level to understand".
| Therefore in our hedge fund we just build data structures and
| helper functions on top of them. The downside of the above
| combination is on sparse data, for which we call wrapped c++/rust
| code in python.
| liminal wrote:
| Lots of people have mentioned Polars' sane API as the main reason
| to favor it, but the other crucial reason for us is that it's
| based on Apache Arrow. That allows us to use it where it's the
| best tool and then switch to whatever else we need when it isn't.
| rcarmo wrote:
| The killer app for Polars in my day-to-day work is its direct
| Parquet export. It's become indispensable for cleaning up stuff
| that goes into Spark or similar engines.
| hinkley wrote:
| TIL that NEC still exists. Now there's a name I have not heard in
| a long, long time.
| insane_dreamer wrote:
| surprised not to see any mention of numpy (our go-to) here
|
| edit: I know pandas uses numpy under the hood, but "raw" numpy is
| typically faster (and more flexible), so curious as to why it's
| not mentioned
| __mharrison__ wrote:
| Many of the complaints about Pandas here (and around the
| internet) are about the weird API. However, if you follow a few
| best practices, you never run into the issue folks are
| complaining about.
|
| I wrote a nice article about chaining for Ponder. (Sadly, it
| looks like the Snowflake acquisition has removed that. My book,
| Effective Pandas 2, goes deep into my best practices.)
| otsaloma wrote:
| I don't quite agree, but if this was true, what would you tell
| a junior colleague in a code review? You can't use this
| function/argument/convention/etc you found in the official API
| documentation because...I don't like it? I think any team-
| maintained Pandas codebase will unavoidably drift into the
| inconsistent and bad. If you're always working alone, then it
| can of course be a bit better.
| __mharrison__ wrote:
| I have strong opinions about Pandas. I've used it since it
| came out and have coalesced on patterns that make it easy to
| use.
|
| (Disclaimer: I'm a corporate trainer and feed my family
| teaching folks how to work with their data using Pandas.)
|
| When I teach about "readable" code, I caveat that it should
| be "readable for a specific audience". I hold that if you are
| a professional, that audience is other professionals. You
| should write code for professionals and not for newbies.
| Newbies should be trained up to write professional code.
| YMMV, but that is my bias based on experience seeing this
| work at some of the biggest companies in the world.
| __mharrison__ wrote:
| Lots of Pandas hate in this thread. However, for folks with lots
| of lines of Pandas in production, Fireducks can be a lifesaver.
|
| I've had the chance to play with it on some of my code it queries
| than ran in 8+ minutes come down to 20 seconds.
|
| Re-writing in Polars involves more code changes.
|
| However, with Pandas 2.2+ and arrow, you can use .pipe to move
| data to Polars, run the slow computation there, and then zero
| copy back to Pandas. Like so... (df #
| slow part .groupby(...) .agg(...) )
|
| to: def polars_agg(df): return
| (pl.from_pandas(df) .group_by(...)
| .agg(...) .to_pandas() ) (df
| .pipe(polars_agg) )
| nooope6 wrote:
| Pretty cool, but where's the source at?
___________________________________________________________________
(page generated 2024-11-20 23:00 UTC)