[HN Gopher] I wrote one of the fastest DataFrame libraries
       ___________________________________________________________________
        
       I wrote one of the fastest DataFrame libraries
        
       Author : polyrand
       Score  : 369 points
       Date   : 2021-03-14 01:27 UTC (21 hours ago)
        
 (HTM) web link (www.ritchievink.com)
 (TXT) w3m dump (www.ritchievink.com)
        
       | nojito wrote:
       | Still mind boggling to me how amazing data.table is
        
       | Keyframe wrote:
       | Arrow and SIMD, how would it work on Arm? I've had quite a
       | success with Gravitons recently.
        
         | The_rationalist wrote:
         | What is your question? ARM supports SIMD through NEON
        
           | Keyframe wrote:
           | Writing a seoarate path for NEON is what would be needed.
           | It's not like there are these magical SIMD functions
           | (intrinsics) that work across architectures.
        
           | mhh__ wrote:
           | You can implement the intel intrinsics on the arm end, but
           | NEON and AVX aren't exactly the same thing, so there's
           | usually performance to be found.
           | 
           | I'm also not aware of any free SIMD profilers that work on
           | ARM that hold a candle to vTune.
        
             | jpgvm wrote:
             | No profilers really hold a candle to vTune is the problem
             | in general. I love the new AMD chips but uProf isn't in the
             | same class as vTune and that is sad. I'm certain with
             | better tools the AMD chips could be demolishing Intel by an
             | even greater margin.
        
         | deadmutex wrote:
         | Out of curiosity, what did you run on Graviton? How did you
         | define success?
        
       | staticautomatic wrote:
       | What's up with all the Dask benchmarks saying "internal error"? I
       | expected at least some explanation in the post.
        
         | mumblemumble wrote:
         | If you click through to the detailed benchmarks page
         | (https://h2oai.github.io/db-benchmark/). A lot of them are that
         | it's running out of memory, a few of them are features that
         | haven't been implemented yet.
         | 
         | Inefficient use of memory is a problem I've seen with several
         | projects that focus on scale out. All else being equal, they
         | tend to use a _lot_ more memory. This happens for various
         | reasons, but a lot of it is the simple fact that all the
         | mechanisms you need to support distributed computing, and make
         | it reliable, add a _lot_ of overhead.
        
           | FridgeSeal wrote:
           | I suppose there isn't as much focus on squeezing everything
           | possible out of a single machine if the major focus is on
           | distribution.
        
             | mumblemumble wrote:
             | There's that, but there's also just costs that are baked
             | into the fact of distribution itself.
             | 
             | For example, take Spark. Since it's built to be resilient,
             | every executor is its own process. Because of that,
             | executors can't just share immutable data the way threads
             | can in a system that's designed for maximum single-machine
             | performance. They've got to transfer the data using IPC. In
             | the case of transferring data between two executors, that
             | can result in up to four copies of the data being resident
             | in the memory at once: The original data, a copy that's
             | been serialized for transfer over a socket, the
             | destination's copy of the serialized data, and the final
             | deserialized copy.
        
         | nojito wrote:
         | These are single machine benchmarks.
        
       | whoevercares wrote:
       | Now any potential this become next DataBricks?
        
         | FridgeSeal wrote:
         | That's a bit like asking when Python/Pandas will become the
         | next SaaS product.
         | 
         | This is analogous to Pandas, Databricks is a commercial
         | offering of managed Spark, Ballista is a new project that a
         | Rust/"modern??" Rewrite of Spark.
        
       | sdmac99 wrote:
       | I would like it if Vaex was added to the benchmarks.
        
       | lordgroff wrote:
       | I've been intrigued about this library, and specifically the
       | possibility about a Python workflow, but a fallback to rust if
       | needed. I mean, I haven't really looked at what the interop is
       | but should work, right?
       | 
       | It's not going to happen for now though because the project is
       | still immature and there's zero documentation in Python from what
       | I can see. But it's something in keeping a close eye on, I often
       | work with R and C++ as a fallback when speed is paramount, but I
       | think I'd rather replace C++ with Rust.
        
         | wealthyyy wrote:
         | Rust projects takes longer. If memory safety is not a concern,
         | I'd advice stick to Modern C++.
        
           | qchris wrote:
           | This feels like a gross generalization that's not applicable
           | in many situations, and is immensely dependent on each
           | individual person and situation.
           | 
           | I can write non-trivial performant code in Rust, including
           | bindings across a C FFI much faster than I can weave together
           | the equivalent code and build scripts in C++. Memory safety
           | isn't the only thing Rust brings to the table. I sometimes
           | don't because C++'s ecosystem is far developed for a certain
           | application and it's not worth it for that particular
           | situation. As with most things, it's about trade-offs.
        
             | wealthyyy wrote:
             | Are you not using CMake for building C++ apps?
             | 
             | You can use C++ for everything and it's not developed for
             | certain applications.
        
       | mhh__ wrote:
       | Probably worth pointing out that the sections on
       | microarchitecture stopped being representative with the pentium
       | pro in the mid 90s. The processor still has a pipeline, but it's
       | much harder to stall it like that.
       | 
       | http://www.lighterra.com/papers/modernmicroprocessors/ is a good
       | guide (Agner Fog's reference isn't really a book so I don't
       | recommend it for the uninitiated)
        
       | danielecook wrote:
       | Pretty impressed with the data.table benchmarks. The syntax is a
       | little weird and takes getting used to but once you have the
       | basics it's a great tool.
        
         | orhmeh09 wrote:
         | I use it a lot but it really breaks the tidyverse, which makes
         | using R actually enjoyable. Why aren't these other libraries
         | (not in R; I'm talking the others in the benchmark)
         | consistently as fast as data.table? Are the programmers of
         | data.table just that much better?
        
           | zhdc1 wrote:
           | > It really breaks the tidyverse
           | 
           | You may want to look at tidyfst.
           | 
           | > Are the programmers of data.table just that much better?
           | 
           | Pixie dust, R's C API (and yes, they're just exceptionally
           | good).
        
           | clircle wrote:
           | I dropped dplyr in favor of data.table and never looked back.
           | 
           | https://github.com/eddelbuettel/gsir-te
        
             | jrumbut wrote:
             | Vanilla R got a bad name but once you understand the
             | fundamentals it's quite good, fewer footguns than used to
             | be there, and I find it easier to reason about than
             | tidyverse.
        
               | clircle wrote:
               | There are dozens of us!
        
               | warlog wrote:
               | But the hexagons! Where are it's hexagons?
        
           | lordgroff wrote:
           | While I like tidyverse, I honestly have trouble using it most
           | of the time, knowing how much slower it is. It becomes
           | addictive, where I have trouble accepting minutes over
           | seconds many operations take in DT.
           | 
           | As for the speed, Matt Dowle definitely strikes me as a
           | person that optimizes for speed. Then of course, there is the
           | fact that everything is in place, and parallelization is at
           | this point baked in. It's also mature unlike a lot of other
           | alternatives and has never lost sight of speed. Note, for
           | example, how in pandas, in place operations have become very
           | much discouraged over time, and are often not actually in
           | place anyways.
           | 
           | Note back to tidyverse. Why do you think tidyverse breaks
           | with DT. If you enjoy the pipe, write out DT to a function
           | (e.g. dt) that takes a data frame, and ensure that any
           | operations you need specific to DT return a reference to your
           | data table object and off you go with something like this:
           | df %>%         dt(, x := y + z) %>%         unique() %>%
           | merge(z, by = "x") %>%         dt(x < a)
           | 
           | There, it looks like tidyverse, but way faster.
        
             | [deleted]
        
             | orhmeh09 wrote:
             | There are almost 200 magrittr-related issues in GitHub and
             | I have had a bad time pairing data.table with tidyverse
             | packages (and others because of e.g. IDate). DT code is
             | like line noise to me, but I prefer to write things in it
             | directly -- the only reason I use it is because it's fast,
             | and guessing how it's going to interact with tidy stuff and
             | NSE (especially when using in place methods) is
             | counterproductive to that goal.
        
               | lordgroff wrote:
               | 19 of those are open and most of them not terribly
               | relevant. Considering the ubiquity of the package, I'd
               | say the total number of issues is shockingly low.
               | 
               | As for NSE, DT uses NSE as well, but differently of
               | course. I guess it all comes to what we "mean" by
               | tidyverse. If we mean integration with the cast majority
               | of packages, then yeah, it will work, but of course
               | certain things are out of bounds. If you just want to use
               | data table like dplyr, then tidytable is your ticket.
               | 
               | I'd argue the beast thing to do though is to just get
               | used to the syntax. Data table looks like line noise
               | until you're really comfortable with it, then the terse
               | syntax comes across as really expressive and short. I've
               | come to like writing data table in locally scoped blocks,
               | pretty much without the pipe, and using mostly vanilla R
               | (aside from data table). I think it looks pretty good
               | actually, and I think less line noise than pandas with
               | its endless lambda lambda lambda lambda.
        
           | rcthompson wrote:
           | dplyr and related packages use the existing R data frame
           | class. (A "tibble" is just a regular R data frame under the
           | hood.) This means that it inherits all the performance
           | characteristics of regular R data frames. data.table is a
           | completely separate implementation of a data structure that
           | is functionally similar to a data frame but designed from the
           | ground up for efficiency, though with some compromises, such
           | as eschewing R's typical copy-on-modify paradigm. There are
           | other more subtle reasons for the differences, but that's the
           | absolute simplest explanation.
           | 
           | Supposedly you can use data.tables with dplyr, but I haven't
           | experimented with it in depth.
        
             | lordgroff wrote:
             | dtplyr, the dplyr backend for data table is still IMHO not
             | great, and will often break in subtle and not so subtle
             | ways. Tidytable is, I think, a much more interesting
             | implementation, and gets close to the same speeds.
        
               | rcthompson wrote:
               | Hmm, this looks very interesting! I've ended up
               | preferring dplyr for it's expressiveness in spite of the
               | speed difference, so this might be a nice compromise for
               | when dplyr gets too slow.
        
             | orhmeh09 wrote:
             | Oh, I know that, I use it daily and I've read some of its
             | source code. I'm just astonished that the best-performing
             | data frame library in the world is developed in R and it
             | outperforms engines written with million/billion dollar
             | companies behind it.
        
               | hugh-avherald wrote:
               | data.table is written primarily in C. But R happens to
               | have a very good package system and a very good interface
               | to C code.
               | 
               | And Matt Dowle has bled for that C code.
        
               | dm319 wrote:
               | I feel like some of it is to do with the way R's generics
               | work - being lisp-based and making use of promises. It
               | allows for nice syntax / code while interfacing the C
               | backend.
        
             | creddit wrote:
             | > data.table is a completely separate implementation of a
             | data structure that is functionally similar to a data frame
             | but designed from the ground up for efficiency, though with
             | some compromises, such as eschewing R's typical copy-on-
             | modify paradigm.
             | 
             | This is totally false. data.table inherits from data.frame.
             | Sure, it has some extra attributes that a tibble doesn't
             | but the way classing works in R is so absurdly lightweight,
             | that's meaningless in comparison. Both tibble and
             | data.table are data.frames at their core which are just
             | lists of equal length vectors. You can pass a data.table
             | wherever you pass a data.frame.
        
               | rcthompson wrote:
               | Thank you for the correction. I knew that tibbles were
               | essentially just data frames with an extra class
               | attribute, but for some reason I didn't realize this was
               | also true of data.table. I think assumed that
               | data.table's reference semantics couldn't be implemented
               | on top of the existing data frame class, but I guess I'm
               | wrong about that. Unfortunately it's too late for me to
               | edit my original comment.
        
               | kkoncevicius wrote:
               | Tibbles are not just data frames with extra class
               | attribute. For one - they don't have row names. Second,
               | consider this example, demonstrating how treating tibbles
               | as data frames can be dangerous:
               | df_iris <- iris         tb_iris <- tibble(iris)
               | nunique <- function(x, colname)
               | length(unique(x[,colname]))              nunique(df_iris,
               | "Species")         > 3              nunique(tb_iris,
               | "Species")         > 1
               | 
               | R-devel mailing list had a long discussion about this
               | too: https://stat.ethz.ch/pipermail/r-package-
               | devel/2017q3/001896...
        
               | rcthompson wrote:
               | Ok, fine, to be more precise, tibbles and data frames and
               | data tables are all implemented as R lists whose elements
               | are vectors which form the columns of the table. And also
               | `is.data.frame` currently returns TRUE for all of them,
               | whether or not that is ultimately correct.
        
         | ineedasername wrote:
         | Me too: I've tended to let the database do a lot of heavy
         | lifting before I bring data in. Maybe I don't actually need to
         | do that.
        
           | FridgeSeal wrote:
           | There's really no harm in doing that, and it's still a pretty
           | good idea.
           | 
           | I generally try and get my data sources as far as possible
           | with the database, then leave framework/language specific
           | things to the last step, means that-if nothing else-someone
           | else picking up your dataset in a different
           | language/framework toolset doesn't need to pick up yours as a
           | dependency, and you're not spending time re-implementing what
           | a database can already do (and can do more portably).
        
             | ineedasername wrote:
             | The only downside to letting the database do some of the
             | pre-processing is that I don't have a full raw data set to
             | work with within either R or Python. If I decide I need a
             | an existing measure aggregated up to a different level, or
             | a new measure, I've got to go back to the database and then
             | bring in an additional query. So I have less flexibility
             | within the R or Python environment. But you make a good
             | point: there's trade offs either way, and keeping the
             | dataset as something like a materialized view on the
             | database makes it a little more open to others' usage.
        
       | snicker7 wrote:
       | Looks great, but I would like to see how this compares against
       | proprietary solutions (e.g. kdb+).
        
       | natemcintosh wrote:
       | I'm guessing Polars and Ballista (https://github.com/ballista-
       | compute/ballista) have different goals, but I don't know enough
       | about either to say what those might be. Does anyone know enough
       | about either to explain the differences?
        
         | timClicks wrote:
         | Ballista is distributed. Its author, Andy Grove is the author
         | of the Rust implementation of Arrow though so there will be
         | similarities between the two projects.
        
       | MrPowers wrote:
       | Looks like a cool project.
       | 
       | It's better to separate benchmarking results for big data
       | technologies and small DataFrame technologies.
       | 
       | Spark & Dask can perform computations on terabytes of data
       | (thousands of Parquet files in parallel). Most of the other
       | technologies in this article can only handle small datasets.
       | 
       | This is especially important for join benchmarking. There are
       | different types of cluster computing joins (broadcast vs shuffle)
       | and they should be benchmarked separately.
        
         | glogla wrote:
         | Yeah, nobody uses Spark because they want to, they use it
         | because from certain data size, there's nothing else.
        
       | nxpnsv wrote:
       | How is data.table so amazing? I didn't expect it to be that
       | fast...
        
         | 0xdeadfeed wrote:
         | Optimized C can be very hard to match.
        
       | DangitBobby wrote:
       | If this will read a csv that has columns with mixed integers and
       | nulls _without_ converting all of the numbers to float by
       | default, it will replace pandas in my life. 99% of my problems
       | with pandas arise from ints being coerced into floats when a bull
       | shows up.
        
         | em500 wrote:
         | Pass dtype = {"colX":"Int64"} for the columns that you want to
         | read as a Nullable integer type:
         | https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...
        
         | [deleted]
        
       | JenniferLopezQQ wrote:
       | https://sites.google.com/view/meet-for-sex69/home
       | 
       | I'm available for hookups,Plz check my personal website
        
       | lmeyerov wrote:
       | We have been rewriting our stack for multi/many-gpu scale out via
       | python GPU dataframes, but it's clear that smaller workloads and
       | some others would be fine on CPU (and thus free up the GPUs for
       | our other tenants), so having a good CPU impl is exciting, esp if
       | they achieve API compatibility w pandas/dask as RAPIDS and others
       | do. I've been eyeing vauex here (I think the other rust arrow
       | project isn't DF's?), so good to have a contender!
       | 
       | I'd love to see a comparison to RAPIDS dataframes for the single
       | GPU case (ex: 2 GB), single GPU bigger-than-memory (ex: 100 GB),
       | and then the same for multi-GPU. We have started to measure as
       | things like "200 GB/s in-memory and 60 GB / s when bigger than
       | memory", to give perspective.
        
       | robin21 wrote:
       | Is there a data frame project for Node.js?
        
       | JPKab wrote:
       | I would love to learn the details of building a Python wrapper on
       | Rust code like you did with pypolars.
        
         | DangitBobby wrote:
         | I just did that this weekend with py03 (pyo3?). The github page
         | has some working examples.
        
       | nijave wrote:
       | Not sure if anything exists but I wish something would do in
       | memory compression + smart disk spillover. Sometimes I want to
       | work with 5-10GB compressed data sets (usually log files) and
       | decompressed that ends up being 10x (plus add data structure
       | overhead). There's stuff like Apache Drill but it's more
       | optimized for multi node than running locally
        
         | cinntaile wrote:
         | What do you mean with smart disk spillover?
        
           | FridgeSeal wrote:
           | When the library attempts to load something from disk that
           | doesn't fit into memory, it's transparently, and (usually)
           | without extra intervention from the user, swaps to memory-
           | mapping and chunking through the file(s).
           | 
           | Particularly useful for when you've got a bunch of data that
           | doesn't fit in memory, but setting up a whole cluster is not
           | worth the overhead (operationally or otherwise) and/or if
           | you've already got a processing pipeline written in a
           | language/framework and you can't/don't-want to go through
           | rewriting it for something distributed.
        
             | hermitcrab wrote:
             | How is that different to virtual memory?
        
               | dTal wrote:
               | Memory mapping lazily loads the file, completes
               | immediately, and scales arbitrarily; using disk-backed
               | virtual memory would require the entire file to be read
               | from disk, written back out to disk, and then read in
               | from disk again on access; it would also require swap to
               | be set up at the OS level, and the amount of swap set up
               | puts a hard limit on the size of the file.
        
               | hermitcrab wrote:
               | Thanks for clearing that up!
        
               | wtallis wrote:
               | You've described the difference between using mmap vs
               | relying on the operating system's swap mechanism. But
               | neither of those is quite the same as having an
               | application that's aware of its memory usage and
               | explicitly manages what it keeps in RAM. Using mmap may
               | be useful for achieving that, but mmap on its own still
               | leaves most of the management up to the OS.
        
         | nighthawk454 wrote:
         | maybe give Vaex a try. it's along those lines - out-of-core
         | data frame library, but also works on a single machine
         | 
         | https://github.com/vaexio/vaex
        
         | rscho wrote:
         | If you're not afraid of exotic languages, I encourage you to
         | have a look at the APL ecosystem, and especially J and it's
         | integrated columnar store Jd.
         | 
         | I have just embarked on an adventure to do just what you
         | describe in... Racket. But it's nowhere to be seen yet.
         | 
         | I'm an epidemiologist and I've been wanting to make my own
         | tools for a while, now. It'll be interesting to see how far I
         | can go with Racket, which already includes many pieces of the
         | puzzle.
        
           | cfourpio wrote:
           | Would you mind pointing me at a resource that explains how
           | J/Jd handle it by comparison?
        
             | rscho wrote:
             | https://code.jsoftware.com/wiki/Jd/Overview
             | 
             | Jd only packs int vectors, though. So if you're hoping for
             | string compression then I don't know of any free solution.
             | Jd heavily leverages SIMD and mmap. Larger-than RAM columns
             | can be easily processed by ftable. I use Jd for data
             | wrangling before making models in R. Of course, the J
             | language is not for the faint of heart but it's really
             | well-suited to the task.
        
         | bobbylarrybobby wrote:
         | I wonder if Arrow (https://arrow.apache.org/faq/) would work
         | for this:
         | 
         | "The Arrow IPC mechanism is based on the Arrow in-memory
         | format, such that there is no translation necessary between the
         | on-disk representation and the in-memory representation.
         | Therefore, performing analytics on an Arrow IPC file can use
         | memory-mapping, avoiding any deserialization cost and extra
         | copies."
        
           | [deleted]
        
         | f311a wrote:
         | I think ClickHouse does that.
        
         | Enginerrrd wrote:
         | I feel like the HDF5 libraries could be helpful here if you
         | could figure out how to get compatible compression.
        
         | amatsukawa wrote:
         | Dask is popular for this purpose. No memory compression, but
         | supports disk spillage as well as distributed dataframes.
        
         | superdimwit wrote:
         | You can memory map Arrow
        
         | kthejoker2 wrote:
         | If youre doing OLAP style queries you should look at DuckDB,
         | it's hella fast and supports out-of-memory compute (it's not
         | exactly "smart" but it handles spillover)
        
         | sgtnoodle wrote:
         | How is that different from, say, opening up a gzipped file with
         | a reader object in python?
        
         | Keyframe wrote:
         | gzip, zcat, parallel... That sort of thing?
        
       | Hendrikto wrote:
       | > This directly shows a clear advantage over Pandas for instance,
       | where there is no clear distinction between a float NaN and
       | missing data, where they really should represent different
       | things.
       | 
       | Not true anymore:
       | 
       | > Starting from pandas 1.0, an experimental pd.NA value
       | (singleton) is available to represent scalar missing values. At
       | this moment, it is used in the nullable integer, boolean and
       | dedicated string data types as the missing value indicator.
       | 
       | > The goal of pd.NA is provide a "missing" indicator that can be
       | used consistently across data types (instead of np.nan, None or
       | pd.NaT depending on the data type).
       | 
       | (https://pandas.pydata.org/pandas-docs/stable/user_guide/miss...)
        
         | wtallis wrote:
         | And much more recently (December 26, 2020):
         | https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0...
         | 
         | > Experimental nullable data types for float data
         | 
         | > We've added Float32Dtype / Float64Dtype and FloatingArray.
         | These are extension data types dedicated to floating point data
         | that can hold the pd.NA missing value indicator (GH32265,
         | GH34307).
         | 
         | > While the default float data type already supports missing
         | values using np.nan, these new data types use pd.NA (and its
         | corresponding behavior) as the missing value indicator, in line
         | with the already existing nullable integer and boolean data
         | types.
        
         | kzrdude wrote:
         | I wonder how pandas can both be at version 1.0 and have a an
         | experimental feature for something so central. Honest question
        
           | Hendrikto wrote:
           | It's at version 1.0 because it has a mature and stable
           | interface. That does not mean that it cannot have
           | experimental features which are not part of that stable
           | interface.
        
           | otsaloma wrote:
           | Because Pandas is built on top of NumPy and NumPy has never
           | had a proper NA value. I would call that a serious design
           | problem in NumPy, but it seems to be difficult to fix. There
           | have been multiple NEPs (NumPy Enhancement Proposals) over
           | the years, but they haven't gone anywhere. Probably since
           | things are not moving along in NumPy, a lot of development
           | that should logically happen at the NumPy level is now
           | happening in Pandas. But, I agree, I find it baffling how
           | Python has gotten so big in data science and been around so
           | long without having proper NA support.
           | 
           | https://numpy.org/neps/#deferred-and-superseded-neps
        
       | SatvikBeri wrote:
       | This is very cool. I'm happy to see the decision to use Arrow,
       | which should make it almost trivially easy to transfer data into
       | e.g. Julia, and maybe even to create bindings to Polar.
        
       | voceboy521 wrote:
       | am i supposed to know what a dataframes is?
        
       | davnn wrote:
       | Does anyone have insights into why data.table is as fast as it
       | is? PS: Great work with polars!
        
       | sandGorgon wrote:
       | "Polars is based on the Rust native implementation Apache Arrow.
       | Arrow can be seen as middleware software for DBMS, query engines
       | and DataFrame libraries. Arrow provides very cache-coherent data
       | structures and proper missing data handling."
       | 
       | This is super cool. Anyone know if Pandas is also planning to
       | adopt Arrow ?
        
         | ellimilial wrote:
         | It's on their roadmap
         | https://pandas.pydata.org/docs/development/roadmap.html#apac...
        
         | SatvikBeri wrote:
         | I believe Pandas is incompatible with Arrow for a few reasons,
         | such as their indexes and datetime types. But it's pretty easy
         | to convert a pandas dataframe to Arrow and vice versa - I
         | actually use this to pass data between Python & Julia.
         | 
         | As a side note, Wes McKinney, the creator of Pandas, is heavily
         | involved in Arrow.
        
         | sonthonax wrote:
         | It'll probably never be fully comparable because Pandas can
         | represent python objects and nulls (badly). However, for the
         | most part Arrow and Numpy are compatible. There's no overhead
         | in converting an arrow data structure into a Numpy one.
        
           | derriz wrote:
           | I don't think this is the case. Particularly if you move past
           | 1d numpy numeric arrays. And even in the simplest case of say
           | a 1d float32 array, Arrow arrays are chunked which means
           | there is significant overhead if you try to use an arrow
           | table as your data structure when using Python's
           | scientific/statistics/numerics ecosystem.
        
       | jinmingjian wrote:
       | It is often doubtful if one uses the word "fastest". You often
       | see that one micro-bench lists ten products, then it says "look,
       | I am running in the shortest time".
       | 
       | The problem is that, people often compare "apple to orange". Do
       | you know how to correctly use ClickHouse(there are 20-30 engines
       | in ClickHouse to use. Do you compare an in-memory engine to an
       | disk-persistent-design Database?), Spark, Arrow... ? How can you
       | guarantee to do a fair evaluation among ten or twelve products?
        
       | ineedasername wrote:
       | I'm surprised that data.table is so fast, and that pandas is so
       | slow relative to it. It does explain why I've occasionally had
       | memory issues on ~2GB data files when performing moderately
       | complex functions. (to be fair, it's a relatively old Xeon w/
       | 12GB ram) I'll have to learn the nuances of data.table syntax
       | now.
        
         | zhdc1 wrote:
         | I'm convinced that data.table is wizardry.
         | 
         | For anyone who's turned off by dt[i, j, by=k], Andrew Brooks
         | has a good set of examples at http://brooksandrew.github.io/sim
         | pleblog/articles/advanced-d.... Data Camp's Cheat Sheet is also
         | a good resource https://s3.amazonaws.com/assets.datacamp.com/bl
         | og_assets/dat....
        
         | philshem wrote:
         | Are you using Pandas >= 1.0? I noticed a big speedup without
         | changing my code.
        
       | ineedasername wrote:
       | Is this faster than a "traditional" RDBMS like SQL Server,
       | Postgres, Oracle (::uhhg:: but I have to work with it)?
        
         | aaronharnly wrote:
         | This is all in-memory afaik, so yes, this will be orders of
         | magnitude faster than an RDBMS.
        
       | de6u99er wrote:
       | I tried running your code via docker-compose. After some building
       | time, none of the notebooks in examples-folder worked.
       | 
       | The notebook with the title "10 minutes to pypolars" was missing
       | the pip command which I had to add to your Dockerfile (actually
       | python-pip3). After rebuilding the whole thing and restarting the
       | notebook, I had to change "!pip" to "!pip3" (was to lazy to add
       | an alias) in the first code-cell which installed all dependencies
       | after running. All the other cells resulted in errors.
       | 
       | I suggest to focus on stability and reproducibility first and
       | then on performance.
        
         | ritchie46 wrote:
         | Author here. These examples and docker-composes files are
         | heavily outdated. Please take a look at the docs for up to date
         | examples.
         | 
         | P.S. I do what I can to keep things up to date, but only have
         | the time I have.
        
       | qeternity wrote:
       | > At the time of writing this blog, Polars is the fastest
       | DataFrame library in the benchmark second to R's data.table, and
       | Polars is top 3 all tools considered
       | 
       | This is a very strange way to write "Polaris is the second
       | fastest" but I guess that doesn't grab headlines
        
         | ohazi wrote:
         | I think it's mostly a nod to the fact that R's data.table blows
         | everybody else out of the water by such a ridiculously wide
         | margin. It's like a _factor of 2_ faster than the next
         | fastest...
         | 
         | So if you're writing a dataframe library as a hobby project,
         | it's far less demotivating to use "all the other
         | implementations" as your basis for comparison, at least
         | initially.
        
           | yamrzou wrote:
           | Any idea what makes R's data.table so fast compared to the
           | others?
        
             | ZephyrBlu wrote:
             | I read somewhere else (Another comment I think) that it was
             | a ground-up implementation taking a very performance
             | orientated approach.
             | 
             | Basically it seemed like they _really_ got in the weeds to
             | make it super fast.
        
               | ColFrancis wrote:
               | R is from ~2000, while pandas started in 2011. Is it
               | possible that the lack of compute power had an effect on
               | the required performance characteristics?
        
               | juancb wrote:
               | R is much older than 2000, it's from 1993.
        
               | andylynch wrote:
               | And it's an implementation of S, originally from Bell
               | Labs in 1976
        
               | nojito wrote:
               | data.table is basically a highly optimized C library
               | 
               | https://github.com/Rdatatable/data.table
        
               | noir_lord wrote:
               | That's somewhat like libvips which was started when a 486
               | was state of the art - fast forward and it's an image
               | processing monster.
        
           | qeternity wrote:
           | I think a hobby project written in a general purpose language
           | being the second fastest dataframe library is a hell of an
           | accomplishment.
        
       | clircle wrote:
       | It seems like DataFrames.jl still has a ways to go before Julia
       | can close the gap on R/data.table. I don't think these benchmarks
       | include compilation time either?
        
         | superdimwit wrote:
         | It does look like the benchmarks include compilation time.
        
         | ninjin wrote:
         | Not that I am a heavy DataFrame user, but I have felt more at
         | home with the comparatively light-weight TypeTables [1]. My
         | understanding is that the rather complicated DataFrame
         | ecosystem in Julia [2] mostly stems from whether tables should
         | be immutable and/or typed. As far as I am aware there has not
         | been any major push at the compiler level to speed up untyped
         | code yet - although there should be plenty of room for
         | improvements - which I suspect would benefit DataFrames
         | greatly.
         | 
         | [1]: https://github.com/JuliaData/TypedTables.jl
         | 
         | [2]:
         | https://typedtables.juliadata.org/stable/man/table/#datafram...
        
           | pdeffebach wrote:
           | > As far as I am aware there has not been any major push at
           | the compiler level to speed up untyped code yet - although
           | there should be plenty of room for improvements - which I
           | suspect would benefit DataFrames greatly.
           | 
           | That's not quite correct. The major `source => fun => dest`
           | API as part of DataFrames.jl was designed specifically to get
           | around the non-typed container problem. And it definitely
           | works. That's not the cause of slow performance.
           | 
           | I think the reason is that, as you mentioned, DataFrames has
           | a big API and a lot of development effort is put towards
           | finalizing the API in preparation for 1.0. After that there
           | will be much more focus on performance.
           | 
           | In particular, some changes to optimize grouping may have
           | recently been merged but didn't make it into the release by
           | the time this test suite was run, as well as multi-threaded
           | operations, which havent been finished yet, should speed
           | things up a lot.
           | 
           | That said, this new Polars library looks seriously
           | impressive. Congrats to the developer.
        
         | SatvikBeri wrote:
         | I started using Julia in December, DataFrames are in a sort of
         | weird place because they're so much less necessary compared to
         | e.g. Python. In Julia, you could just use a dict of arrays and
         | get most of the benefits, thanks to libraries like Query.jl and
         | Tables.jl. Thus the ecosystem is a lot more spread out. I
         | actually use DataFrames much less than I used to in Python.
         | 
         | This is mostly good, because you can apply the same operations
         | on DataFrames, Streams, Time Series data, Differential
         | Equations Results, etc., but it does mean that some of the
         | specialized optimizations haven't made it into DataFrames.jl
        
           | machinelabo wrote:
           | Those arguments apply to Python as well. There is nothing
           | special about Julia that warrants your arguments.
        
             | SatvikBeri wrote:
             | I've been using Python a lot longer than I've been using
             | Julia, and this isn't really true. Python tends towards
             | much larger packages where everything is bundled together,
             | and there are fairly deep language-level reasons for that.
             | Python doesn't have major alternatives to pandas the way
             | Julia has half a dozen alternatives to DataFrames. There is
             | nothing like Query.jl that applies to all table-like
             | structures in Python.
             | 
             | In pandas, you'll see things like exponentially weighted
             | moving averages, while DataFrames.jl is pretty much just
             | the data structure.
             | 
             | The centralization of the Python ecosystem and extra
             | attention that pandas has gotten has made it much better in
             | several ways - for example, pandas's indexing makes
             | filtering significantly faster. These optimizations might
             | make it to DataFrames.jl eventually, but I doubt you'll
             | ever see the same level of centralization.
        
             | timClicks wrote:
             | I disagree. Python's data science community is strongly
             | clustered around pandas, even though it's possible to use
             | other approaches
        
       | theopsguy wrote:
       | How does it compare to datafusion which is also a rust project
       | that has dataframe support
        
         | FridgeSeal wrote:
         | Polars is to Pandas as Ballista/datafusion is to Spark on a
         | very broad basis.
        
       | sradman wrote:
       | My naive interpretation - The canonical Apache Arrow
       | implementation is written in C++ with multiple language bindings
       | like PyArrow. The Rust bindings for Apache Arrow re-implemented
       | the Arrow specification so it can be used as a pure Rust
       | implementation. Andy Grove [1] built two projects on top of Rust-
       | Arrow: 1. DataFusion, a query engine for Arrow that can optimize
       | SQL-like JOIN and GROUP BY queries, and 2. Ballista, clustered
       | DataFusion-like queries (vs. Dask and Spark). DataFusion was
       | integrated into the Apache Arrow Rust project.
       | 
       | Ritchie Vink has introduced Polars that also builds upon Rust-
       | Arrow. It offers an Eager API that is an alternative to PyArrow
       | and a Lazy API that is a query engine and optimizer like
       | DataFusion. The linked benchmark is focused on JOIN and GROUP BY
       | queries on large datasets executed on a server/workstation-class
       | machine (125 GB memory). This seems like a specialized use case
       | that pushes the limits of a single developer machine and overlaps
       | with the use case for a dedicated column store (like Redshift) or
       | a distributed batch processing system like Spark/MapReduce.
       | 
       | Why Polars over DataFusion? Why Python bindings to Rust-Arrow
       | rather than canonical PyArrow/C++? Is there something wrong with
       | PyArrow?
       | 
       | [1] https://andygrove.io/projects/
        
         | ritchie46 wrote:
         | Hi Author here,
         | 
         | Polars is not an alternative to PyArrow. Polars merely uses
         | arrow as its in-memory representation of data. Similar to how
         | pandas uses numpy.
         | 
         | Arrow provides the efficient data structures and some compute
         | kernels, like a SUM, a FILTER, a MAX etc. Arrow is not a query
         | engine. Polars is a DataFrame library on top of arrow that has
         | implemented efficient algorithms for JOINS, GROUPBY, PIVOTs,
         | MELTs, QUERY OPTIMIZATION, etc. (the things you expect from a
         | DF lib).
         | 
         | Polars could be best described as an in-memory DataFrame
         | library with a query optimizer.
         | 
         | Because it uses Rust Arrow, it can easily swap pointers around
         | to pyarrow and get zero-copy data interop.
         | 
         | DataFusion is another query engine on top of arrow. They both
         | use arrow as lower level memory layout, but both have a
         | different implementation of their query engine and their API. I
         | would say that DataFusion is more focused on a Query Engine and
         | Polars is more focused an a DataFrame lib, but this is
         | subjective.
         | 
         | Maybe its like comparing Rust Tokio vs Rust async-std. Just
         | different implementations striving the same goal. (Only Polars
         | and DataFusion can easily be mixed as they use the same memory
         | structures).
        
           | sradman wrote:
           | Pandas supports JOIN and GROUP BY operators so you are saying
           | that there is a gap between Apache Arrow and other mature
           | dataframe libraries? If there is a gap, is there no plan to
           | fix it in the standard Arrow API?
           | 
           | I understand the case for a SQL-like DSL and an optimizer for
           | distributed queries (in-memory column stores, not so much).
           | I'm trying to understand the value add of Polars. I don't
           | mean to come across as critical; perhaps DataFusion is a poor
           | implementation and you are being too polite to say so.
           | 
           | I also think that there is a C++/Arrow vs Rust/Arrow decision
           | that has to be made. I associate PyArrow with the C++/Arrow
           | library. Is Polars' Eager API a superset of the PyArrow API
           | with the addition of JOIN/GROUPBY/other operators?
        
             | ritchie46 wrote:
             | There is definitely a gap, and I don't think that Arrow
             | tries to fill that. But I don't think that its wrong to
             | have multiple implementations doing the same thing right?
             | We have PostgresQL vs MySQL, both seem valid choices to me.
             | 
             | A SQL like query engine has its place. An in memory
             | DataFrame also has its place. I think the wide-spread use
             | of pandas proves that. I only think we can do that more
             | efficient.
             | 
             | With regard to C++ vs Rust arrow. The memory underneath is
             | the same, so having an implementation in both languages
             | only helps more widespread adoption IMO.
        
               | lordgroff wrote:
               | Thank you for your work! I've decided to kick the tires
               | after reading your Python book, I think you understimate
               | the clarity of the API you have exposed which, honestly,
               | looks a fair bit more sane than the tangled web that
               | pandas is.
        
               | ritchie46 wrote:
               | Thanks, I feel so too. There is still a hope work to do.
               | I hope that I can also bridge the gap regarding utility
               | and documentation.
        
       | tkoolen wrote:
       | [note: see more nuanced comment below]
       | 
       | The Julia benchmark two links deep at
       | https://github.com/h2oai/db-benchmark doesn't follow even the
       | most basic performance tips listed at
       | https://docs.julialang.org/en/v1/manual/performance-tips/.
        
         | SatvikBeri wrote:
         | What specifically are you thinking of?
         | 
         | The non-const global variables stand out to me, but I'm not
         | experienced enough tell whether that would make a large
         | difference.
        
           | tkoolen wrote:
           | Non-const globals could be an issue, but it's possible it
           | doesn't matter too much for this particular benchmark. I'm a
           | little worried about taking compilation time (apart from
           | precompilation) into account (would that also be done for C++
           | code?). But I must confess I maybe posted my comment a bit
           | too soon, partially because of the time of day, partially
           | because of the semicolons at the end of each line in the
           | code, which made me quickly think the benchmark writer was
           | using Julia for the first time. While I have a good amount of
           | experience with Julia, I don't have that much experience with
           | DataFrames.jl itself, so I don't know for sure whether the
           | reported benchmark times are reasonable or not.
        
             | stabbles wrote:
             | From the git history it seems like DataFrames.jl
             | maintainers contributed at least some fixes to the scripts,
             | so I guess that means they aren't opposed to it.
        
       ___________________________________________________________________
       (page generated 2021-03-14 23:02 UTC)