[HN Gopher] I wrote one of the fastest DataFrame libraries
___________________________________________________________________
I wrote one of the fastest DataFrame libraries
Author : polyrand
Score : 369 points
Date : 2021-03-14 01:27 UTC (21 hours ago)
(HTM) web link (www.ritchievink.com)
(TXT) w3m dump (www.ritchievink.com)
| nojito wrote:
| Still mind boggling to me how amazing data.table is
| Keyframe wrote:
| Arrow and SIMD, how would it work on Arm? I've had quite a
| success with Gravitons recently.
| The_rationalist wrote:
| What is your question? ARM supports SIMD through NEON
| Keyframe wrote:
| Writing a seoarate path for NEON is what would be needed.
| It's not like there are these magical SIMD functions
| (intrinsics) that work across architectures.
| mhh__ wrote:
| You can implement the intel intrinsics on the arm end, but
| NEON and AVX aren't exactly the same thing, so there's
| usually performance to be found.
|
| I'm also not aware of any free SIMD profilers that work on
| ARM that hold a candle to vTune.
| jpgvm wrote:
| No profilers really hold a candle to vTune is the problem
| in general. I love the new AMD chips but uProf isn't in the
| same class as vTune and that is sad. I'm certain with
| better tools the AMD chips could be demolishing Intel by an
| even greater margin.
| deadmutex wrote:
| Out of curiosity, what did you run on Graviton? How did you
| define success?
| staticautomatic wrote:
| What's up with all the Dask benchmarks saying "internal error"? I
| expected at least some explanation in the post.
| mumblemumble wrote:
| If you click through to the detailed benchmarks page
| (https://h2oai.github.io/db-benchmark/). A lot of them are that
| it's running out of memory, a few of them are features that
| haven't been implemented yet.
|
| Inefficient use of memory is a problem I've seen with several
| projects that focus on scale out. All else being equal, they
| tend to use a _lot_ more memory. This happens for various
| reasons, but a lot of it is the simple fact that all the
| mechanisms you need to support distributed computing, and make
| it reliable, add a _lot_ of overhead.
| FridgeSeal wrote:
| I suppose there isn't as much focus on squeezing everything
| possible out of a single machine if the major focus is on
| distribution.
| mumblemumble wrote:
| There's that, but there's also just costs that are baked
| into the fact of distribution itself.
|
| For example, take Spark. Since it's built to be resilient,
| every executor is its own process. Because of that,
| executors can't just share immutable data the way threads
| can in a system that's designed for maximum single-machine
| performance. They've got to transfer the data using IPC. In
| the case of transferring data between two executors, that
| can result in up to four copies of the data being resident
| in the memory at once: The original data, a copy that's
| been serialized for transfer over a socket, the
| destination's copy of the serialized data, and the final
| deserialized copy.
| nojito wrote:
| These are single machine benchmarks.
| whoevercares wrote:
| Now any potential this become next DataBricks?
| FridgeSeal wrote:
| That's a bit like asking when Python/Pandas will become the
| next SaaS product.
|
| This is analogous to Pandas, Databricks is a commercial
| offering of managed Spark, Ballista is a new project that a
| Rust/"modern??" Rewrite of Spark.
| sdmac99 wrote:
| I would like it if Vaex was added to the benchmarks.
| lordgroff wrote:
| I've been intrigued about this library, and specifically the
| possibility about a Python workflow, but a fallback to rust if
| needed. I mean, I haven't really looked at what the interop is
| but should work, right?
|
| It's not going to happen for now though because the project is
| still immature and there's zero documentation in Python from what
| I can see. But it's something in keeping a close eye on, I often
| work with R and C++ as a fallback when speed is paramount, but I
| think I'd rather replace C++ with Rust.
| wealthyyy wrote:
| Rust projects takes longer. If memory safety is not a concern,
| I'd advice stick to Modern C++.
| qchris wrote:
| This feels like a gross generalization that's not applicable
| in many situations, and is immensely dependent on each
| individual person and situation.
|
| I can write non-trivial performant code in Rust, including
| bindings across a C FFI much faster than I can weave together
| the equivalent code and build scripts in C++. Memory safety
| isn't the only thing Rust brings to the table. I sometimes
| don't because C++'s ecosystem is far developed for a certain
| application and it's not worth it for that particular
| situation. As with most things, it's about trade-offs.
| wealthyyy wrote:
| Are you not using CMake for building C++ apps?
|
| You can use C++ for everything and it's not developed for
| certain applications.
| mhh__ wrote:
| Probably worth pointing out that the sections on
| microarchitecture stopped being representative with the pentium
| pro in the mid 90s. The processor still has a pipeline, but it's
| much harder to stall it like that.
|
| http://www.lighterra.com/papers/modernmicroprocessors/ is a good
| guide (Agner Fog's reference isn't really a book so I don't
| recommend it for the uninitiated)
| danielecook wrote:
| Pretty impressed with the data.table benchmarks. The syntax is a
| little weird and takes getting used to but once you have the
| basics it's a great tool.
| orhmeh09 wrote:
| I use it a lot but it really breaks the tidyverse, which makes
| using R actually enjoyable. Why aren't these other libraries
| (not in R; I'm talking the others in the benchmark)
| consistently as fast as data.table? Are the programmers of
| data.table just that much better?
| zhdc1 wrote:
| > It really breaks the tidyverse
|
| You may want to look at tidyfst.
|
| > Are the programmers of data.table just that much better?
|
| Pixie dust, R's C API (and yes, they're just exceptionally
| good).
| clircle wrote:
| I dropped dplyr in favor of data.table and never looked back.
|
| https://github.com/eddelbuettel/gsir-te
| jrumbut wrote:
| Vanilla R got a bad name but once you understand the
| fundamentals it's quite good, fewer footguns than used to
| be there, and I find it easier to reason about than
| tidyverse.
| clircle wrote:
| There are dozens of us!
| warlog wrote:
| But the hexagons! Where are it's hexagons?
| lordgroff wrote:
| While I like tidyverse, I honestly have trouble using it most
| of the time, knowing how much slower it is. It becomes
| addictive, where I have trouble accepting minutes over
| seconds many operations take in DT.
|
| As for the speed, Matt Dowle definitely strikes me as a
| person that optimizes for speed. Then of course, there is the
| fact that everything is in place, and parallelization is at
| this point baked in. It's also mature unlike a lot of other
| alternatives and has never lost sight of speed. Note, for
| example, how in pandas, in place operations have become very
| much discouraged over time, and are often not actually in
| place anyways.
|
| Note back to tidyverse. Why do you think tidyverse breaks
| with DT. If you enjoy the pipe, write out DT to a function
| (e.g. dt) that takes a data frame, and ensure that any
| operations you need specific to DT return a reference to your
| data table object and off you go with something like this:
| df %>% dt(, x := y + z) %>% unique() %>%
| merge(z, by = "x") %>% dt(x < a)
|
| There, it looks like tidyverse, but way faster.
| [deleted]
| orhmeh09 wrote:
| There are almost 200 magrittr-related issues in GitHub and
| I have had a bad time pairing data.table with tidyverse
| packages (and others because of e.g. IDate). DT code is
| like line noise to me, but I prefer to write things in it
| directly -- the only reason I use it is because it's fast,
| and guessing how it's going to interact with tidy stuff and
| NSE (especially when using in place methods) is
| counterproductive to that goal.
| lordgroff wrote:
| 19 of those are open and most of them not terribly
| relevant. Considering the ubiquity of the package, I'd
| say the total number of issues is shockingly low.
|
| As for NSE, DT uses NSE as well, but differently of
| course. I guess it all comes to what we "mean" by
| tidyverse. If we mean integration with the cast majority
| of packages, then yeah, it will work, but of course
| certain things are out of bounds. If you just want to use
| data table like dplyr, then tidytable is your ticket.
|
| I'd argue the beast thing to do though is to just get
| used to the syntax. Data table looks like line noise
| until you're really comfortable with it, then the terse
| syntax comes across as really expressive and short. I've
| come to like writing data table in locally scoped blocks,
| pretty much without the pipe, and using mostly vanilla R
| (aside from data table). I think it looks pretty good
| actually, and I think less line noise than pandas with
| its endless lambda lambda lambda lambda.
| rcthompson wrote:
| dplyr and related packages use the existing R data frame
| class. (A "tibble" is just a regular R data frame under the
| hood.) This means that it inherits all the performance
| characteristics of regular R data frames. data.table is a
| completely separate implementation of a data structure that
| is functionally similar to a data frame but designed from the
| ground up for efficiency, though with some compromises, such
| as eschewing R's typical copy-on-modify paradigm. There are
| other more subtle reasons for the differences, but that's the
| absolute simplest explanation.
|
| Supposedly you can use data.tables with dplyr, but I haven't
| experimented with it in depth.
| lordgroff wrote:
| dtplyr, the dplyr backend for data table is still IMHO not
| great, and will often break in subtle and not so subtle
| ways. Tidytable is, I think, a much more interesting
| implementation, and gets close to the same speeds.
| rcthompson wrote:
| Hmm, this looks very interesting! I've ended up
| preferring dplyr for it's expressiveness in spite of the
| speed difference, so this might be a nice compromise for
| when dplyr gets too slow.
| orhmeh09 wrote:
| Oh, I know that, I use it daily and I've read some of its
| source code. I'm just astonished that the best-performing
| data frame library in the world is developed in R and it
| outperforms engines written with million/billion dollar
| companies behind it.
| hugh-avherald wrote:
| data.table is written primarily in C. But R happens to
| have a very good package system and a very good interface
| to C code.
|
| And Matt Dowle has bled for that C code.
| dm319 wrote:
| I feel like some of it is to do with the way R's generics
| work - being lisp-based and making use of promises. It
| allows for nice syntax / code while interfacing the C
| backend.
| creddit wrote:
| > data.table is a completely separate implementation of a
| data structure that is functionally similar to a data frame
| but designed from the ground up for efficiency, though with
| some compromises, such as eschewing R's typical copy-on-
| modify paradigm.
|
| This is totally false. data.table inherits from data.frame.
| Sure, it has some extra attributes that a tibble doesn't
| but the way classing works in R is so absurdly lightweight,
| that's meaningless in comparison. Both tibble and
| data.table are data.frames at their core which are just
| lists of equal length vectors. You can pass a data.table
| wherever you pass a data.frame.
| rcthompson wrote:
| Thank you for the correction. I knew that tibbles were
| essentially just data frames with an extra class
| attribute, but for some reason I didn't realize this was
| also true of data.table. I think assumed that
| data.table's reference semantics couldn't be implemented
| on top of the existing data frame class, but I guess I'm
| wrong about that. Unfortunately it's too late for me to
| edit my original comment.
| kkoncevicius wrote:
| Tibbles are not just data frames with extra class
| attribute. For one - they don't have row names. Second,
| consider this example, demonstrating how treating tibbles
| as data frames can be dangerous:
| df_iris <- iris tb_iris <- tibble(iris)
| nunique <- function(x, colname)
| length(unique(x[,colname])) nunique(df_iris,
| "Species") > 3 nunique(tb_iris,
| "Species") > 1
|
| R-devel mailing list had a long discussion about this
| too: https://stat.ethz.ch/pipermail/r-package-
| devel/2017q3/001896...
| rcthompson wrote:
| Ok, fine, to be more precise, tibbles and data frames and
| data tables are all implemented as R lists whose elements
| are vectors which form the columns of the table. And also
| `is.data.frame` currently returns TRUE for all of them,
| whether or not that is ultimately correct.
| ineedasername wrote:
| Me too: I've tended to let the database do a lot of heavy
| lifting before I bring data in. Maybe I don't actually need to
| do that.
| FridgeSeal wrote:
| There's really no harm in doing that, and it's still a pretty
| good idea.
|
| I generally try and get my data sources as far as possible
| with the database, then leave framework/language specific
| things to the last step, means that-if nothing else-someone
| else picking up your dataset in a different
| language/framework toolset doesn't need to pick up yours as a
| dependency, and you're not spending time re-implementing what
| a database can already do (and can do more portably).
| ineedasername wrote:
| The only downside to letting the database do some of the
| pre-processing is that I don't have a full raw data set to
| work with within either R or Python. If I decide I need a
| an existing measure aggregated up to a different level, or
| a new measure, I've got to go back to the database and then
| bring in an additional query. So I have less flexibility
| within the R or Python environment. But you make a good
| point: there's trade offs either way, and keeping the
| dataset as something like a materialized view on the
| database makes it a little more open to others' usage.
| snicker7 wrote:
| Looks great, but I would like to see how this compares against
| proprietary solutions (e.g. kdb+).
| natemcintosh wrote:
| I'm guessing Polars and Ballista (https://github.com/ballista-
| compute/ballista) have different goals, but I don't know enough
| about either to say what those might be. Does anyone know enough
| about either to explain the differences?
| timClicks wrote:
| Ballista is distributed. Its author, Andy Grove is the author
| of the Rust implementation of Arrow though so there will be
| similarities between the two projects.
| MrPowers wrote:
| Looks like a cool project.
|
| It's better to separate benchmarking results for big data
| technologies and small DataFrame technologies.
|
| Spark & Dask can perform computations on terabytes of data
| (thousands of Parquet files in parallel). Most of the other
| technologies in this article can only handle small datasets.
|
| This is especially important for join benchmarking. There are
| different types of cluster computing joins (broadcast vs shuffle)
| and they should be benchmarked separately.
| glogla wrote:
| Yeah, nobody uses Spark because they want to, they use it
| because from certain data size, there's nothing else.
| nxpnsv wrote:
| How is data.table so amazing? I didn't expect it to be that
| fast...
| 0xdeadfeed wrote:
| Optimized C can be very hard to match.
| DangitBobby wrote:
| If this will read a csv that has columns with mixed integers and
| nulls _without_ converting all of the numbers to float by
| default, it will replace pandas in my life. 99% of my problems
| with pandas arise from ints being coerced into floats when a bull
| shows up.
| em500 wrote:
| Pass dtype = {"colX":"Int64"} for the columns that you want to
| read as a Nullable integer type:
| https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...
| [deleted]
| JenniferLopezQQ wrote:
| https://sites.google.com/view/meet-for-sex69/home
|
| I'm available for hookups,Plz check my personal website
| lmeyerov wrote:
| We have been rewriting our stack for multi/many-gpu scale out via
| python GPU dataframes, but it's clear that smaller workloads and
| some others would be fine on CPU (and thus free up the GPUs for
| our other tenants), so having a good CPU impl is exciting, esp if
| they achieve API compatibility w pandas/dask as RAPIDS and others
| do. I've been eyeing vauex here (I think the other rust arrow
| project isn't DF's?), so good to have a contender!
|
| I'd love to see a comparison to RAPIDS dataframes for the single
| GPU case (ex: 2 GB), single GPU bigger-than-memory (ex: 100 GB),
| and then the same for multi-GPU. We have started to measure as
| things like "200 GB/s in-memory and 60 GB / s when bigger than
| memory", to give perspective.
| robin21 wrote:
| Is there a data frame project for Node.js?
| JPKab wrote:
| I would love to learn the details of building a Python wrapper on
| Rust code like you did with pypolars.
| DangitBobby wrote:
| I just did that this weekend with py03 (pyo3?). The github page
| has some working examples.
| nijave wrote:
| Not sure if anything exists but I wish something would do in
| memory compression + smart disk spillover. Sometimes I want to
| work with 5-10GB compressed data sets (usually log files) and
| decompressed that ends up being 10x (plus add data structure
| overhead). There's stuff like Apache Drill but it's more
| optimized for multi node than running locally
| cinntaile wrote:
| What do you mean with smart disk spillover?
| FridgeSeal wrote:
| When the library attempts to load something from disk that
| doesn't fit into memory, it's transparently, and (usually)
| without extra intervention from the user, swaps to memory-
| mapping and chunking through the file(s).
|
| Particularly useful for when you've got a bunch of data that
| doesn't fit in memory, but setting up a whole cluster is not
| worth the overhead (operationally or otherwise) and/or if
| you've already got a processing pipeline written in a
| language/framework and you can't/don't-want to go through
| rewriting it for something distributed.
| hermitcrab wrote:
| How is that different to virtual memory?
| dTal wrote:
| Memory mapping lazily loads the file, completes
| immediately, and scales arbitrarily; using disk-backed
| virtual memory would require the entire file to be read
| from disk, written back out to disk, and then read in
| from disk again on access; it would also require swap to
| be set up at the OS level, and the amount of swap set up
| puts a hard limit on the size of the file.
| hermitcrab wrote:
| Thanks for clearing that up!
| wtallis wrote:
| You've described the difference between using mmap vs
| relying on the operating system's swap mechanism. But
| neither of those is quite the same as having an
| application that's aware of its memory usage and
| explicitly manages what it keeps in RAM. Using mmap may
| be useful for achieving that, but mmap on its own still
| leaves most of the management up to the OS.
| nighthawk454 wrote:
| maybe give Vaex a try. it's along those lines - out-of-core
| data frame library, but also works on a single machine
|
| https://github.com/vaexio/vaex
| rscho wrote:
| If you're not afraid of exotic languages, I encourage you to
| have a look at the APL ecosystem, and especially J and it's
| integrated columnar store Jd.
|
| I have just embarked on an adventure to do just what you
| describe in... Racket. But it's nowhere to be seen yet.
|
| I'm an epidemiologist and I've been wanting to make my own
| tools for a while, now. It'll be interesting to see how far I
| can go with Racket, which already includes many pieces of the
| puzzle.
| cfourpio wrote:
| Would you mind pointing me at a resource that explains how
| J/Jd handle it by comparison?
| rscho wrote:
| https://code.jsoftware.com/wiki/Jd/Overview
|
| Jd only packs int vectors, though. So if you're hoping for
| string compression then I don't know of any free solution.
| Jd heavily leverages SIMD and mmap. Larger-than RAM columns
| can be easily processed by ftable. I use Jd for data
| wrangling before making models in R. Of course, the J
| language is not for the faint of heart but it's really
| well-suited to the task.
| bobbylarrybobby wrote:
| I wonder if Arrow (https://arrow.apache.org/faq/) would work
| for this:
|
| "The Arrow IPC mechanism is based on the Arrow in-memory
| format, such that there is no translation necessary between the
| on-disk representation and the in-memory representation.
| Therefore, performing analytics on an Arrow IPC file can use
| memory-mapping, avoiding any deserialization cost and extra
| copies."
| [deleted]
| f311a wrote:
| I think ClickHouse does that.
| Enginerrrd wrote:
| I feel like the HDF5 libraries could be helpful here if you
| could figure out how to get compatible compression.
| amatsukawa wrote:
| Dask is popular for this purpose. No memory compression, but
| supports disk spillage as well as distributed dataframes.
| superdimwit wrote:
| You can memory map Arrow
| kthejoker2 wrote:
| If youre doing OLAP style queries you should look at DuckDB,
| it's hella fast and supports out-of-memory compute (it's not
| exactly "smart" but it handles spillover)
| sgtnoodle wrote:
| How is that different from, say, opening up a gzipped file with
| a reader object in python?
| Keyframe wrote:
| gzip, zcat, parallel... That sort of thing?
| Hendrikto wrote:
| > This directly shows a clear advantage over Pandas for instance,
| where there is no clear distinction between a float NaN and
| missing data, where they really should represent different
| things.
|
| Not true anymore:
|
| > Starting from pandas 1.0, an experimental pd.NA value
| (singleton) is available to represent scalar missing values. At
| this moment, it is used in the nullable integer, boolean and
| dedicated string data types as the missing value indicator.
|
| > The goal of pd.NA is provide a "missing" indicator that can be
| used consistently across data types (instead of np.nan, None or
| pd.NaT depending on the data type).
|
| (https://pandas.pydata.org/pandas-docs/stable/user_guide/miss...)
| wtallis wrote:
| And much more recently (December 26, 2020):
| https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0...
|
| > Experimental nullable data types for float data
|
| > We've added Float32Dtype / Float64Dtype and FloatingArray.
| These are extension data types dedicated to floating point data
| that can hold the pd.NA missing value indicator (GH32265,
| GH34307).
|
| > While the default float data type already supports missing
| values using np.nan, these new data types use pd.NA (and its
| corresponding behavior) as the missing value indicator, in line
| with the already existing nullable integer and boolean data
| types.
| kzrdude wrote:
| I wonder how pandas can both be at version 1.0 and have a an
| experimental feature for something so central. Honest question
| Hendrikto wrote:
| It's at version 1.0 because it has a mature and stable
| interface. That does not mean that it cannot have
| experimental features which are not part of that stable
| interface.
| otsaloma wrote:
| Because Pandas is built on top of NumPy and NumPy has never
| had a proper NA value. I would call that a serious design
| problem in NumPy, but it seems to be difficult to fix. There
| have been multiple NEPs (NumPy Enhancement Proposals) over
| the years, but they haven't gone anywhere. Probably since
| things are not moving along in NumPy, a lot of development
| that should logically happen at the NumPy level is now
| happening in Pandas. But, I agree, I find it baffling how
| Python has gotten so big in data science and been around so
| long without having proper NA support.
|
| https://numpy.org/neps/#deferred-and-superseded-neps
| SatvikBeri wrote:
| This is very cool. I'm happy to see the decision to use Arrow,
| which should make it almost trivially easy to transfer data into
| e.g. Julia, and maybe even to create bindings to Polar.
| voceboy521 wrote:
| am i supposed to know what a dataframes is?
| davnn wrote:
| Does anyone have insights into why data.table is as fast as it
| is? PS: Great work with polars!
| sandGorgon wrote:
| "Polars is based on the Rust native implementation Apache Arrow.
| Arrow can be seen as middleware software for DBMS, query engines
| and DataFrame libraries. Arrow provides very cache-coherent data
| structures and proper missing data handling."
|
| This is super cool. Anyone know if Pandas is also planning to
| adopt Arrow ?
| ellimilial wrote:
| It's on their roadmap
| https://pandas.pydata.org/docs/development/roadmap.html#apac...
| SatvikBeri wrote:
| I believe Pandas is incompatible with Arrow for a few reasons,
| such as their indexes and datetime types. But it's pretty easy
| to convert a pandas dataframe to Arrow and vice versa - I
| actually use this to pass data between Python & Julia.
|
| As a side note, Wes McKinney, the creator of Pandas, is heavily
| involved in Arrow.
| sonthonax wrote:
| It'll probably never be fully comparable because Pandas can
| represent python objects and nulls (badly). However, for the
| most part Arrow and Numpy are compatible. There's no overhead
| in converting an arrow data structure into a Numpy one.
| derriz wrote:
| I don't think this is the case. Particularly if you move past
| 1d numpy numeric arrays. And even in the simplest case of say
| a 1d float32 array, Arrow arrays are chunked which means
| there is significant overhead if you try to use an arrow
| table as your data structure when using Python's
| scientific/statistics/numerics ecosystem.
| jinmingjian wrote:
| It is often doubtful if one uses the word "fastest". You often
| see that one micro-bench lists ten products, then it says "look,
| I am running in the shortest time".
|
| The problem is that, people often compare "apple to orange". Do
| you know how to correctly use ClickHouse(there are 20-30 engines
| in ClickHouse to use. Do you compare an in-memory engine to an
| disk-persistent-design Database?), Spark, Arrow... ? How can you
| guarantee to do a fair evaluation among ten or twelve products?
| ineedasername wrote:
| I'm surprised that data.table is so fast, and that pandas is so
| slow relative to it. It does explain why I've occasionally had
| memory issues on ~2GB data files when performing moderately
| complex functions. (to be fair, it's a relatively old Xeon w/
| 12GB ram) I'll have to learn the nuances of data.table syntax
| now.
| zhdc1 wrote:
| I'm convinced that data.table is wizardry.
|
| For anyone who's turned off by dt[i, j, by=k], Andrew Brooks
| has a good set of examples at http://brooksandrew.github.io/sim
| pleblog/articles/advanced-d.... Data Camp's Cheat Sheet is also
| a good resource https://s3.amazonaws.com/assets.datacamp.com/bl
| og_assets/dat....
| philshem wrote:
| Are you using Pandas >= 1.0? I noticed a big speedup without
| changing my code.
| ineedasername wrote:
| Is this faster than a "traditional" RDBMS like SQL Server,
| Postgres, Oracle (::uhhg:: but I have to work with it)?
| aaronharnly wrote:
| This is all in-memory afaik, so yes, this will be orders of
| magnitude faster than an RDBMS.
| de6u99er wrote:
| I tried running your code via docker-compose. After some building
| time, none of the notebooks in examples-folder worked.
|
| The notebook with the title "10 minutes to pypolars" was missing
| the pip command which I had to add to your Dockerfile (actually
| python-pip3). After rebuilding the whole thing and restarting the
| notebook, I had to change "!pip" to "!pip3" (was to lazy to add
| an alias) in the first code-cell which installed all dependencies
| after running. All the other cells resulted in errors.
|
| I suggest to focus on stability and reproducibility first and
| then on performance.
| ritchie46 wrote:
| Author here. These examples and docker-composes files are
| heavily outdated. Please take a look at the docs for up to date
| examples.
|
| P.S. I do what I can to keep things up to date, but only have
| the time I have.
| qeternity wrote:
| > At the time of writing this blog, Polars is the fastest
| DataFrame library in the benchmark second to R's data.table, and
| Polars is top 3 all tools considered
|
| This is a very strange way to write "Polaris is the second
| fastest" but I guess that doesn't grab headlines
| ohazi wrote:
| I think it's mostly a nod to the fact that R's data.table blows
| everybody else out of the water by such a ridiculously wide
| margin. It's like a _factor of 2_ faster than the next
| fastest...
|
| So if you're writing a dataframe library as a hobby project,
| it's far less demotivating to use "all the other
| implementations" as your basis for comparison, at least
| initially.
| yamrzou wrote:
| Any idea what makes R's data.table so fast compared to the
| others?
| ZephyrBlu wrote:
| I read somewhere else (Another comment I think) that it was
| a ground-up implementation taking a very performance
| orientated approach.
|
| Basically it seemed like they _really_ got in the weeds to
| make it super fast.
| ColFrancis wrote:
| R is from ~2000, while pandas started in 2011. Is it
| possible that the lack of compute power had an effect on
| the required performance characteristics?
| juancb wrote:
| R is much older than 2000, it's from 1993.
| andylynch wrote:
| And it's an implementation of S, originally from Bell
| Labs in 1976
| nojito wrote:
| data.table is basically a highly optimized C library
|
| https://github.com/Rdatatable/data.table
| noir_lord wrote:
| That's somewhat like libvips which was started when a 486
| was state of the art - fast forward and it's an image
| processing monster.
| qeternity wrote:
| I think a hobby project written in a general purpose language
| being the second fastest dataframe library is a hell of an
| accomplishment.
| clircle wrote:
| It seems like DataFrames.jl still has a ways to go before Julia
| can close the gap on R/data.table. I don't think these benchmarks
| include compilation time either?
| superdimwit wrote:
| It does look like the benchmarks include compilation time.
| ninjin wrote:
| Not that I am a heavy DataFrame user, but I have felt more at
| home with the comparatively light-weight TypeTables [1]. My
| understanding is that the rather complicated DataFrame
| ecosystem in Julia [2] mostly stems from whether tables should
| be immutable and/or typed. As far as I am aware there has not
| been any major push at the compiler level to speed up untyped
| code yet - although there should be plenty of room for
| improvements - which I suspect would benefit DataFrames
| greatly.
|
| [1]: https://github.com/JuliaData/TypedTables.jl
|
| [2]:
| https://typedtables.juliadata.org/stable/man/table/#datafram...
| pdeffebach wrote:
| > As far as I am aware there has not been any major push at
| the compiler level to speed up untyped code yet - although
| there should be plenty of room for improvements - which I
| suspect would benefit DataFrames greatly.
|
| That's not quite correct. The major `source => fun => dest`
| API as part of DataFrames.jl was designed specifically to get
| around the non-typed container problem. And it definitely
| works. That's not the cause of slow performance.
|
| I think the reason is that, as you mentioned, DataFrames has
| a big API and a lot of development effort is put towards
| finalizing the API in preparation for 1.0. After that there
| will be much more focus on performance.
|
| In particular, some changes to optimize grouping may have
| recently been merged but didn't make it into the release by
| the time this test suite was run, as well as multi-threaded
| operations, which havent been finished yet, should speed
| things up a lot.
|
| That said, this new Polars library looks seriously
| impressive. Congrats to the developer.
| SatvikBeri wrote:
| I started using Julia in December, DataFrames are in a sort of
| weird place because they're so much less necessary compared to
| e.g. Python. In Julia, you could just use a dict of arrays and
| get most of the benefits, thanks to libraries like Query.jl and
| Tables.jl. Thus the ecosystem is a lot more spread out. I
| actually use DataFrames much less than I used to in Python.
|
| This is mostly good, because you can apply the same operations
| on DataFrames, Streams, Time Series data, Differential
| Equations Results, etc., but it does mean that some of the
| specialized optimizations haven't made it into DataFrames.jl
| machinelabo wrote:
| Those arguments apply to Python as well. There is nothing
| special about Julia that warrants your arguments.
| SatvikBeri wrote:
| I've been using Python a lot longer than I've been using
| Julia, and this isn't really true. Python tends towards
| much larger packages where everything is bundled together,
| and there are fairly deep language-level reasons for that.
| Python doesn't have major alternatives to pandas the way
| Julia has half a dozen alternatives to DataFrames. There is
| nothing like Query.jl that applies to all table-like
| structures in Python.
|
| In pandas, you'll see things like exponentially weighted
| moving averages, while DataFrames.jl is pretty much just
| the data structure.
|
| The centralization of the Python ecosystem and extra
| attention that pandas has gotten has made it much better in
| several ways - for example, pandas's indexing makes
| filtering significantly faster. These optimizations might
| make it to DataFrames.jl eventually, but I doubt you'll
| ever see the same level of centralization.
| timClicks wrote:
| I disagree. Python's data science community is strongly
| clustered around pandas, even though it's possible to use
| other approaches
| theopsguy wrote:
| How does it compare to datafusion which is also a rust project
| that has dataframe support
| FridgeSeal wrote:
| Polars is to Pandas as Ballista/datafusion is to Spark on a
| very broad basis.
| sradman wrote:
| My naive interpretation - The canonical Apache Arrow
| implementation is written in C++ with multiple language bindings
| like PyArrow. The Rust bindings for Apache Arrow re-implemented
| the Arrow specification so it can be used as a pure Rust
| implementation. Andy Grove [1] built two projects on top of Rust-
| Arrow: 1. DataFusion, a query engine for Arrow that can optimize
| SQL-like JOIN and GROUP BY queries, and 2. Ballista, clustered
| DataFusion-like queries (vs. Dask and Spark). DataFusion was
| integrated into the Apache Arrow Rust project.
|
| Ritchie Vink has introduced Polars that also builds upon Rust-
| Arrow. It offers an Eager API that is an alternative to PyArrow
| and a Lazy API that is a query engine and optimizer like
| DataFusion. The linked benchmark is focused on JOIN and GROUP BY
| queries on large datasets executed on a server/workstation-class
| machine (125 GB memory). This seems like a specialized use case
| that pushes the limits of a single developer machine and overlaps
| with the use case for a dedicated column store (like Redshift) or
| a distributed batch processing system like Spark/MapReduce.
|
| Why Polars over DataFusion? Why Python bindings to Rust-Arrow
| rather than canonical PyArrow/C++? Is there something wrong with
| PyArrow?
|
| [1] https://andygrove.io/projects/
| ritchie46 wrote:
| Hi Author here,
|
| Polars is not an alternative to PyArrow. Polars merely uses
| arrow as its in-memory representation of data. Similar to how
| pandas uses numpy.
|
| Arrow provides the efficient data structures and some compute
| kernels, like a SUM, a FILTER, a MAX etc. Arrow is not a query
| engine. Polars is a DataFrame library on top of arrow that has
| implemented efficient algorithms for JOINS, GROUPBY, PIVOTs,
| MELTs, QUERY OPTIMIZATION, etc. (the things you expect from a
| DF lib).
|
| Polars could be best described as an in-memory DataFrame
| library with a query optimizer.
|
| Because it uses Rust Arrow, it can easily swap pointers around
| to pyarrow and get zero-copy data interop.
|
| DataFusion is another query engine on top of arrow. They both
| use arrow as lower level memory layout, but both have a
| different implementation of their query engine and their API. I
| would say that DataFusion is more focused on a Query Engine and
| Polars is more focused an a DataFrame lib, but this is
| subjective.
|
| Maybe its like comparing Rust Tokio vs Rust async-std. Just
| different implementations striving the same goal. (Only Polars
| and DataFusion can easily be mixed as they use the same memory
| structures).
| sradman wrote:
| Pandas supports JOIN and GROUP BY operators so you are saying
| that there is a gap between Apache Arrow and other mature
| dataframe libraries? If there is a gap, is there no plan to
| fix it in the standard Arrow API?
|
| I understand the case for a SQL-like DSL and an optimizer for
| distributed queries (in-memory column stores, not so much).
| I'm trying to understand the value add of Polars. I don't
| mean to come across as critical; perhaps DataFusion is a poor
| implementation and you are being too polite to say so.
|
| I also think that there is a C++/Arrow vs Rust/Arrow decision
| that has to be made. I associate PyArrow with the C++/Arrow
| library. Is Polars' Eager API a superset of the PyArrow API
| with the addition of JOIN/GROUPBY/other operators?
| ritchie46 wrote:
| There is definitely a gap, and I don't think that Arrow
| tries to fill that. But I don't think that its wrong to
| have multiple implementations doing the same thing right?
| We have PostgresQL vs MySQL, both seem valid choices to me.
|
| A SQL like query engine has its place. An in memory
| DataFrame also has its place. I think the wide-spread use
| of pandas proves that. I only think we can do that more
| efficient.
|
| With regard to C++ vs Rust arrow. The memory underneath is
| the same, so having an implementation in both languages
| only helps more widespread adoption IMO.
| lordgroff wrote:
| Thank you for your work! I've decided to kick the tires
| after reading your Python book, I think you understimate
| the clarity of the API you have exposed which, honestly,
| looks a fair bit more sane than the tangled web that
| pandas is.
| ritchie46 wrote:
| Thanks, I feel so too. There is still a hope work to do.
| I hope that I can also bridge the gap regarding utility
| and documentation.
| tkoolen wrote:
| [note: see more nuanced comment below]
|
| The Julia benchmark two links deep at
| https://github.com/h2oai/db-benchmark doesn't follow even the
| most basic performance tips listed at
| https://docs.julialang.org/en/v1/manual/performance-tips/.
| SatvikBeri wrote:
| What specifically are you thinking of?
|
| The non-const global variables stand out to me, but I'm not
| experienced enough tell whether that would make a large
| difference.
| tkoolen wrote:
| Non-const globals could be an issue, but it's possible it
| doesn't matter too much for this particular benchmark. I'm a
| little worried about taking compilation time (apart from
| precompilation) into account (would that also be done for C++
| code?). But I must confess I maybe posted my comment a bit
| too soon, partially because of the time of day, partially
| because of the semicolons at the end of each line in the
| code, which made me quickly think the benchmark writer was
| using Julia for the first time. While I have a good amount of
| experience with Julia, I don't have that much experience with
| DataFrames.jl itself, so I don't know for sure whether the
| reported benchmark times are reasonable or not.
| stabbles wrote:
| From the git history it seems like DataFrames.jl
| maintainers contributed at least some fixes to the scripts,
| so I guess that means they aren't opposed to it.
___________________________________________________________________
(page generated 2021-03-14 23:02 UTC)