[HN Gopher] Scikit-Learn Version 1.0
___________________________________________________________________
Scikit-Learn Version 1.0
Author : m3at
Score : 237 points
Date : 2021-09-14 08:50 UTC (14 hours ago)
(HTM) web link (scikit-learn.org)
(TXT) w3m dump (scikit-learn.org)
| conor_f wrote:
| https://0ver.org/ will need an update!
| NeutralForest wrote:
| Excellent library with stellar documentation, I hope it'll live
| on for a long time.
| jgilias wrote:
| Yes, glad they've decided it's finally out of the "don't use
| it, it's experimental" phase and gotten off the 0ver.org wall
| of shame!
| Uberphallus wrote:
| Really, for any other ML library the best documentation is how-
| tos spread through the web, but scikit-learn leaves very little
| room for that kind of content.
| sveme wrote:
| Best documented library. It even provides examples, guidance
| and best practices in the documentation. Have rarely learned so
| much as when I went through the sci-kit documentation. Absolute
| delight.
| mistrial9 wrote:
| you mean the 4000 page cookbook thing?
| infimum wrote:
| scikit-learn (next to numpy) is the one library I use in every
| single project at work. Every time I consider switching away from
| python I am faced with the fact that I'd lose access to this
| workhorse of a library. Of course it's not all sunshine and
| rainbows - I had my fair share of rummaging through its internals
| - but its API design is a de-facto standard for a reason. My only
| recurring gripe is that the serialization story (basically just
| pickling everything) is not optimal.
| zeec123 wrote:
| There is so much wrong with the api design of sklearn (how can
| one think "predict_proba" is a good function name?). I can
| understand this, since most of it was probably written by PhD
| students without the time and expertise to come up with a
| proper api; many of them without a CS background.[1]
|
| [1]
| https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...
| mrtranscendence wrote:
| I didn't want to bag on sklearn (I've already bagged on
| pandas enough here), but for what it's worth I agree with
| you. It's, ahh, not the API I would've come up with. It's
| what everybody has standardized on, though, and maybe there's
| some value in that.
| kzrdude wrote:
| What's a typical task you do with sklearn? Just trying to get
| inspired about what it can do
| CapmCrackaWaka wrote:
| I recently ran into this issue as well. Serialization of
| sklearn random forests results in absolutely massive files. I
| had to switch to lightgbm, which is 100x faster to load from a
| save file and about 20x smaller.
| XoS-490 wrote:
| What about sktime? https://github.com/alan-turing-
| institute/sktime
| lr1970 wrote:
| Early on, pandas made some unfortunate design decisions that are
| still biting hard. For example, the choice of datetime
| (pandas.Timestamp) represented by a 64-bit int with a fixed
| nanosecond resolution. This choice gives dynamic range of +- 292
| years around 1970-01-01 (the epoch). This range is too small to
| represent the works of William Shakespeare, never mind human
| history. Using pandas in these areas becomes a royal pain in the
| neck, for one constantly needs to work around pandas datetime
| limitations.
|
| OTOH, in numpy one can choose time resolution units (anything
| from attosecond to a year) tailoring time resolution to your task
| (from high energy physics all way to astronomy). Panda's choice
| is only good for high-frequency stock traders, though.
| Bostonian wrote:
| Pandas was started by a quant working for AQR Capital, so it's
| not surprising if "Panda's choice is only good for high-
| frequency stock traders".
| lr1970 wrote:
| An illustrative example of how reasonable short-term and
| narrow-scope considerations can be really bad in long-term
| and/or at a larger scope.
| minsc__and__boo wrote:
| This assumes that all projects should be built with the
| larger scope in mind.
|
| Sometimes you just need a shovel, not a Bagger 288.
| [deleted]
| nojito wrote:
| Why should he care about other use-cases?
|
| It's not his responsibility to make sure his package is as
| wide as possible before opensourcing.
| lr1970 wrote:
| The problem is not with the Wes' original decision but
| with the fact that it was never revisited even when
| pandas took off at much larger scope. Should had been
| fixed before 1.0 release.
| nojito wrote:
| This belief is quite common in the Opensource space.
|
| It's far easier to criticize than it is to submit a pull
| request.
| kickopotomus wrote:
| It's like there is some strange belief now that software
| should be "finished" before a 1.0 version now. When did
| that start?
| anigbrowl wrote:
| I'm glad you posted about this because I didn't know, but
| my reflexive response was 'well guess that won't work for
| [project idea], guess I'll roll my own or just use the
| NumPy version.'
|
| I personally don't mind the lack of one-size-fits-all. If
| Pandas were to be part of the Python Standard Library I
| think you'd have a stronger argument, since the unspoken
| premise of a SL is that you can leave for a desert island
| with only that and your IDE and still get things done.
| nxpnsv wrote:
| Most data is not 300 years old or in the distance future, in
| fact ranges 1970+-292 years are very common. That is to say,
| panda's choice is good for lots of people, including outside
| high-frequency stock traders.
| lr1970 wrote:
| > Most data is not 300 years old or in the distance future,
| in fact ranges 1970+-292 years are very common.
|
| In what domains? Astronomy, geology, history call for larger
| time range. Laser and High Energy physics need femtosecond
| rather than nanosecond resolution. My point is that a fixed
| time resolution, whatever it is, is a bad choice. Numpy
| explicitly allows to select time resolution unit and this is
| the right approach. BTW, numpy is pandas dependency and
| predates it by several years.
| monkeybutton wrote:
| Scikit-Learn is great, and, reading the documentation for other
| 3rd party ML packages and seeing the words "Scikit-learn API" is
| even better.
| zibzab wrote:
| Is anyone using scikit for NN?
|
| Why/why not?
| FlyingSaucer wrote:
| I have used the MLP classifier[1] before. It's very simple to
| use (like most of sklearn's models). Worked well for standard
| and reasonably small classification model, but lacks some
| features for it to be a flexible way of using NNs:
|
| - No saving checkpoints (can be crucial for large models who
| need alot of compute and time)
|
| - No way to assign different activation functions to different
| layers
|
| - No complex nodes like LSTM, GRU - No way to implement complex
| architectures like transformers, encoders etc
|
| I also do not know if its even possible to use CUDA or any GPU
| with it.
|
| [1] : https://scikit-
| learn.org/stable/modules/generated/sklearn.ne...
| kuu wrote:
| I would say the same as you. As long as you only need a
| simple model, yes, the MLP is good enough, but forget about
| making any DeepLearning stuff.
|
| And AFAIK, there isn't GPU support, CPU performance is poor
| compared to GPU execution.
| westurner wrote:
| There are scikit-learn (sklearn) API-compatible wrappers for
| e.g. PyTorch and TensorFlow.
|
| Skorch: https://github.com/skorch-dev/skorch
|
| tf.keras.wrappers.scikit_learn:
| https://www.tensorflow.org/api_docs/python/tf/keras/wrappers...
|
| AFAIU, there are not Yellowbrick visualizers for PyTorch or
| TensorFlow; though PyTorch abd TensorFlow work with TensorBoard
| for visualizing CFG execution.
|
| > _Many machine learning libraries implement the scikit-learn
| `estimator API` to easily integrate alternative optimization or
| decision methods into a data science workflow. Because of this,
| it seems like it should be simple to drop in a non-scikit-learn
| estimator into a Yellowbrick visualizer, and in principle, it
| is. However, the reality is a bit more complicated._
|
| > _Yellowbrick visualizers often utilize more than just the
| method interface of estimators (e.g. `fit()` and `predict()`),
| relying on the learned attributes (object properties with a
| single underscore suffix, e.g. `coef_`). The issue is that when
| a third-party estimator does not expose these attributes, truly
| gnarly exceptions and tracebacks occur. Yellowbrick is meant to
| aid machine learning diagnostics reasoning, therefore instead
| of just allowing drop-in functionality that may cause
| confusion, we've created a wrapper functionality that is a bit
| kinder with it's messaging._
|
| Looks like there are Yellowbrick wrappers for XGBoost,
| CatBoost, CuML, and Spark MLib; but not for NNs yet.
| https://www.scikit-yb.org/en/latest/api/contrib/wrapper.html...
|
| From the RAPIDS.ai CuML team:
| https://docs.rapids.ai/api/cuml/stable/ :
|
| > _cuML is a suite of fast, GPU-accelerated machine learning
| algorithms designed for data science and analytical tasks. Our
| API mirrors Sklearn's, and we provide practitioners with the
| easy fit-predict-transform paradigm without ever having to
| program on a GPU._
|
| > _As data gets larger, algorithms running on a CPU becomes
| slow and cumbersome. RAPIDS provides users a streamlined
| approach where data is intially loaded in the GPU, and compute
| tasks can be performed on it directly._
|
| CuML is not an NN library; but there are likely performance
| optimizations from CuDF and CuML that would accelerate
| performance of NNs as well.
|
| Dask ML works with models with sklearn interfaces, XGBoost,
| LightGBM, PyTorch, and TensorFlow: https://ml.dask.org/ :
|
| > _Scikit-Learn API_
|
| > _In all cases Dask-ML endeavors to provide a single unified
| interface around the familiar NumPy, Pandas, and Scikit-Learn
| APIs. Users familiar with Scikit-Learn should feel at home with
| Dask-ML._
|
| dask-labextension for JupyterLab helps to visualize Dask ML
| CFGs which call predictors and classifiers with sklearn
| interfaces: https://github.com/dask/dask-labextension
| armcat wrote:
| NN as in "neural network", or NN as in "nearest neighbour"
| algorithm? No to the former, yes to the latter. The reason for
| a "no" to neural networks - in my case I've only ever
| implemented neural networks with many layers, and typically
| using kernels, pooling mechanisms, etc, and since scikit-learn
| doesn't have GPU support, I opt for frameworks that do
| (PyTorch, TensorFlow). However, if you're only building fully-
| connected neural nets (MLPs), with just a few layers, you don't
| need GPU support since any benefits of having parallel
| processing are offset by shuffling data between CPU and GPU. So
| in that case, scikit-learn would probably work quite well,
| although I never tested this myself.
| woko wrote:
| GPU can be useful for Nearest Neighbour as well. In case you
| have access to a GPU, I would strongly recommend Facebook's
| FAISS [1,2]. For everything else, sklearn is amazing.
|
| [1] https://faiss.ai/ [2]
| https://github.com/facebookresearch/faiss
| armcat wrote:
| Faiss looks very nice, thanks for the tip!
| woko wrote:
| See the FAQ: https://scikit-learn.org/dev/faq.html#will-you-
| add-gpu-suppo...
| [deleted]
| jptech wrote:
| You should be asking if anyone is using it for ML, not NN.
| detaro wrote:
| why?
| lysecret wrote:
| Excellent library for train_test_split. Jokes aside. This next to
| Numpy, Pandas Jupyter and Matplotlib + the DL libraries are the
| reason Python is the powerhouse it is for Data Science.
| disgruntledphd2 wrote:
| I'm with you on sklearn, the DL libraries and Numpy, but Pandas
| and Matplotlib are poor, poor relations of the tools available
| in the R ecosystem (dplyr/ggplot etc).
| pc86 wrote:
| If you're doing data science aren't sklearn, DL, and numpy
| getting you 90% of the way there anyway? Even if R has better
| "versions" of pandas/matplotlib (not conceding that point)
| it's not exactly central to the job of data science.
| disgruntledphd2 wrote:
| > If you're doing data science aren't sklearn, DL, and
| numpy getting you 90% of the way there anyway?
|
| Not really, tbh. Most of my jobs (even when the primary
| output was models) require spending a _lot_ of time data
| wrangling and plotting. R is much, much better for this
| kind of exploratory work.
|
| But if I need to integrate with bigger systems (as I
| normally do), there's a stronger push for Python to reduce
| complexity and make it easier for SE's to understand and
| maintain (some of) the code.
| civilized wrote:
| As a working data scientist I'd say it's completely the
| opposite: a good tabular data manipulation package is the
| single most valuable tool in my tool box. And R's packages
| (either data.table or dplyr) are definitely way better than
| pandas. There's no comparison.
|
| I would be hard-pressed to find a working data scientist
| whose definition of data science is "that thing you do with
| sklearn, Deep Learning and Numpy".
| jstx1 wrote:
| > "Data science is that thing where you do sklearn, Deep
| Learning and Numpy" is not a working data scientist's
| perspective.
|
| It could be. It's such a broad job title and it looks so
| different across different companies and teams that the
| main tool for one data scientist might be something that
| another data scientist never has to touch. Different data
| science jobs prioritise different tools, that's all.
| civilized wrote:
| Right, so defining data science as 90% sklearn+DL+numpy
| is just as silly as saying that it's 90% table
| manipulation. That's exactly my point.
|
| Still, if anyone here has managed to find a data science
| job in which tabular data management is not a sizable
| piece of what you do, I'd like to know some details!
| _Wintermute wrote:
| I worked as a datascientist for a couple of years and
| tabular data was a very small part of my job. I spent far
| more time with image-analysis and JSON, both of which I
| found R sucks at.
| mrtranscendence wrote:
| I imagine there are data scientists who operate primarily
| on unstructured rather than tabular data. Part of my
| current job involves stuff like text classification, and
| it's not that difficult to imagine someone for whom
| that's a more sizable proportion of their day-to-day.
|
| Still, my suspicion -- at least from my corner of data
| science -- is that such individuals are rare, and that
| most data scientists do make use of tabular data more
| often than not.
| civilized wrote:
| I totally get what you mean - I would suspect that when
| you work with unstructured data, tabular data
| manipulation is maybe 20-40% of what you do, and when you
| work with structured data, it's more like 60-80%.
| pletnes wrote:
| Tabular data is great for many usecases, but saying that
| image, audio, and video analysis is not data science
| seems like a weird variant of gatekeeping to me.
| disgruntledphd2 wrote:
| > Tabular data is great for many usecases, but saying
| that image, audio, and video analysis is not data science
| seems like a weird variant of gatekeeping to me.
|
| Most problems are mostly tabular, IME.
|
| I completely agree that text, images and video are much,
| much better handled by Python (that's why I use and know
| both).
| civilized wrote:
| Of course, my post doesn't imply such a silly statement.
| bllguo wrote:
| maybe we are casualties of the vague definition of "data
| science," but in my experience numpy is too low-level for
| most of what I consider DS, and pandas/matplotlib are
| _much_ more central than sklearn or pytorch. Even if your
| definition only encompasses deep learning research, surely
| plotting is still indispensable?
|
| I'll also add my vote for the superiority of data.table and
| ggplot2 to any Python alternatives. the bloat and verbosity
| of pandas is a daily struggle
| MichaelRazum wrote:
| Just curious. In which way is data.table superior to
| pandas? Really interested about it! From my personal
| experience pandas is just sometimes a bit slow.
| nojito wrote:
| Data.table is Faster to write and faster to perform
|
| https://h2oai.github.io/db-benchmark/
| bllguo wrote:
| I just love how much more terse and fast it is, someone
| else linked a benchmark below. There's definitely a
| learning curve though.
|
| If you already think pandas is slow I think you'll be
| surprised how much more strongly you feel after using
| data.table!
| mrtranscendence wrote:
| I'm more a dplyr man myself, but data.table is much
| faster than pandas, most noticeably IMO when reading
| large files. It's also extremely succinct if you're into
| that sort of thing (though I find it a bit obfuscated).
| pandas is a lot of things, but "fast" and "concise" are
| not two of them.
| MichaelRazum wrote:
| Got it. Regarding fast you have something like Vaex on
| python side (but not sure how fast it realy is). For me I
| had with pandas the most issues using it's multiindex.
| mrtranscendence wrote:
| > For me I had with pandas the most issues using it's
| multiindex.
|
| Yessss. I loathe indices, and have never been in a
| situation where I was better off with them than without
| them.
|
| > Regarding fast you have something like Vaex on python
| sid
|
| I've never used Vaex, but I've used datatable
| (https://github.com/h2oai/datatable) and polars
| (https://github.com/pola-rs/polars). Polars is my
| favorite API, but datatable was faster at reading data
| (Polars was faster in execution). I'll have to give Vaex
| a try at some point.
| civilized wrote:
| Pandas is the PHP of data science. Pretty badly designed,
| but immensely popular because it got there first and had
| no real competition (in Python) for years.
| optimalsolver wrote:
| Of possible interest, a C++ replacement for Pandas:
|
| https://github.com/hosseinmoein/DataFrame
| pantsforbirds wrote:
| I'm surprised you dont like pandas. I've found it to be a
| pretty easy to use and useful tool and you can almost always
| use something like DASK (or if youre lucky CUDF from
| rapidsai) if you need better performance.
|
| I will say that my very first "real" programming experience
| was Matlab at a research internship, so maybe i just got used
| to working in vectors and arrays for computational tasks.
| baron_harkonnen wrote:
| > i just got used to working in vectors and arrays for
| computational tasks.
|
| Have you worked with R? R, like matlab, natively supports
| vector based operations. In fact, all values in R are
| vectors. Many of the problems with Pandas ultimately boil
| down to the fact that you have to replicate this experience
| without truly being in a vector based language.
| jfarina wrote:
| Numpy has a better api than pandas if we're strictly
| talking about vectors and arrays.
|
| Pandas indexing makes sense once you get it, but it does
| seem to require a lot more words than equivalent statements
| in R.
|
| My primary language is python, but I have been picking up
| some R.
| kzrdude wrote:
| They need seaborn too, whoch makes the python side a lot
| stronger
| disgruntledphd2 wrote:
| plotnine FTW! You'll pry ggplot2 from my cold, dead hands.
| boringg wrote:
| Wait how many companies are actually using R in the wild? As
| I understand it, R is born of academia, great for
| statistics/analysis but breaks down on data manipulation and
| isn't used in production/data engineering. Maybe my
| understanding is dated though?
| baron_harkonnen wrote:
| Of the many companies I've done data science with I can
| only think of a few, rare exceptions where R wasn't used as
| much as if not more than Python.
|
| If you're mostly dealing with Neural Nets you won't see
| much R, but for anything really statistical in nature R is
| a much better tool than Python. For anything that ends up
| in a report R is much better than Python (a lot of very
| valuable data science work ends up being a report to non-
| technical people).
|
| > breaks down on data manipulation
|
| This is very outdated. The tidyverse eco-system has bumped
| R back into being first in class for data manipulation now.
| This becomes less true as you get further and further from
| having your data in a matrix/df (I can't imagine doing
| Spark queries in R), but if you already have a basic data
| frame, manipulation from there is very easy.
|
| Even for things that end up in production, whether you're
| in R or Python, whatever your first pass is should always
| be a prototype and will have to be reworked before you get
| close to moving it to production.
| [deleted]
| dagw wrote:
| _Wait how many companies are actually using R in the wild?_
|
| Depends on your definition. While not very often 'deployed'
| in 'production'. I know lots places in all kinds of
| industries where people reach for R as soon as they have to
| look at some new data.
| lysecret wrote:
| used to work in insurance and heavily used it.
| disgruntledphd2 wrote:
| > Wait how many companies are actually using R in the
| wild? As I understand it, R is born of academia, great
| for statistics/analysis but breaks down on data
| manipulation and isn't used in production/data
| engineering.
|
| It depends, I've worked in some places where R was the
| core part of their data infrastructure. Data manipulation
| (of non text) is far, far better in R.
|
| Integrating with other systems can be tricky though, and
| you don't have the wide variety of Python libraries
| available for core SE tasks, so it can often make sense
| to use Python even though it's not as good for a lot of
| the core work.
|
| Additionally, R is a very, very flexible language (like
| Python), but without strong community lead norms (unlike
| Python) so it's pretty easy to make a mess with it.
|
| Finally, when you need to hand over stuff to software
| engineers, they vastly tend to prefer Python, so it often
| ends up being used to make this stuff easier.
|
| Like, in R there's a core tool called broom which will
| pull out the important features of a model and make it
| really easy to examine them with your data. There's
| nothing comparable in Python, and I miss it so so much
| when I use Python.
|
| That being said, working with strings is much much nicer
| in Python, and pytest is the bomb, so there's tradeoffs
| everywhere.
| mrtranscendence wrote:
| It's unrelated to your main point, but:
|
| > Additionally, R is a very, very flexible language (like
| Python)
|
| I'd argue that R is much more flexible than Python
| syntactically. There's a reason that every attempt at
| recreating dplyr in Python ends in a bit of a mess (IMO)
| -- Python just doesn't allow the sort of metaprogramming
| you'd require for a really nice port. Something as simple
| as a general pipe operator can't be defined in Python, to
| say nothing of how dplyr scopes column names within
| verbs.
|
| Arguably this does allow you to go crazy in a way that
| ends up being detrimental to readability, but I'd say
| overall it's a net benefit to R over Python. I really
| miss this stuff and have spent an undue amount of time
| thinking of the best way to emulate it (only to come up
| with ideas that just disappoint).
|
| > Finally, when you need to hand over stuff to software
| engineers, they vastly tend to prefer Python
|
| Indeed, this is maybe 50% of the reason my organization
| has pushed R to the sidelines over the past few years. We
| used to be very heavily into R but now it has "you can
| use it, but don't expect support" status.
| mrtranscendence wrote:
| (Replying to disgruntledphd2)
|
| > Well that's just lazy evaluation of function arguments,
| which can't be done in Python.
|
| "Just lazy evaluation"! :) It's a pretty big deal. This
| is three-fifths of the way to a macro system.
|
| > But if take a look at the Python data model, it does
| seem super, super flexible.
|
| Sure, you can have a lot of control over the behavior of
| Python objects (some techniques of which remain obscure
| to me even after using Python for many years). But you
| don't have anything like syntactic macros. You can define
| a pipe operator with macropy, though -- it's pretty easy.
| But macropy is basically dead now I think (and a total
| hack).
|
| > You'll still need strings for column names in any dplyr
| port though, because of the function argument issue.
|
| This is major, though, because you can't do this:
| mutate(df, x="y" + "z")
|
| You have to do something like what dfply does, defining
| an object that defines addition, subtraction, etc.
| mutate(df, x=X.y + X.z)
|
| But that hits corner cases quickly. What if you want to
| call a regular Python function that expects numeric
| arguments? This won't work: mutate(df,
| x=f(X.y))
|
| etc. Granted, this only really works in R because it's
| easy to define functions that accept and return vectors.
| So in that sense it's kind of a leaky abstraction. But
| you couldn't even get that far in Python, because X.y
| isn't a vector ... it's a kind of promise to substitute a
| vector.
|
| Give Python macros, I say! To hell with the consequences!
| dragonwriter wrote:
| > Sure, you can have a lot of control over the behavior
| of Python objects (some techniques of which remain
| obscure to me even after using Python for many years).
| But you don't have anything like syntactic macros.
|
| Not yet, but there's a PEP for that:
|
| https://www.python.org/dev/peps/pep-0638/
| mrtranscendence wrote:
| Nice, I'd love for this to see the light of day. I
| suspect it'll see some resistance (even pattern matching
| caused conflict, and I thought that was terribly
| innocuous).
|
| (Why can I reply at this level of nesting now, whereas
| before I couldn't?)
| disgruntledphd2 wrote:
| I'm totally with you on these points, and it's one of the
| places where R's genesis as a scheme program has lead to
| really, really good consequences.
|
| Fundamentally though, both DS Python and R are
| abstractions over well-tested Fortran linear algebra
| routines (I'm sortof kidding, but only sortof).
| disgruntledphd2 wrote:
| > I'd argue that R is much more flexible than Python
| syntactically. There's a reason that every attempt at
| recreating dplyr in Python ends in a bit of a mess (IMO)
| -- Python just doesn't allow the sort of metaprogramming
| you'd require for a really nice port. Something as simple
| as a general pipe operator can't be defined in Python, to
| say nothing of how dplyr scopes column names within
| verbs.
|
| Well that's just lazy evaluation of function arguments,
| which can't be done in Python. But if take a look at the
| Python data model, it does seem super, super flexible.
| You'll still need strings for column names in any dplyr
| port though, because of the function argument issue.
|
| Like, both Python/R derive from the CLOS approach (Art of
| the Metaobject Protocol), but R retains a lot more of the
| lispy goodness (but Python's implementation is easier to
| use).
| Tarq0n wrote:
| It's mostly the "in production" part that determines
| whether R is suitable for a business or not. It's much more
| complicated to avoid runtime errors or do proper testing in
| R, whereas it shines for interactive use, or generating
| reports.
|
| That said having used both the DSL's for plotting and data
| wrangling in the R package ecosystem are vastly superior to
| pandas and python plotting libraries. For modeling I
| actually like the better namespacing of Python which helps
| keep things more legible when there are a ton of model
| options to choose from, assuming you don't need cutting
| edge statistics.
| mrtranscendence wrote:
| > It's much more complicated to avoid runtime errors or
| do proper testing in R
|
| It's not that much harder. There's no pytest, but
| testthat works well enough. I've developed a few packages
| internally in R and wouldn't say it was that much harder
| to ensure correctness than for the corresponding Python
| packages. (We used to keep them in sync, before basically
| moving everything to Python.)
| disgruntledphd2 wrote:
| I actually quite like R's error handling. It's as good as
| Common Lisp's which is often held up as the epitome of
| this.
|
| You also have the dump.frames option, which will save
| your workspace on failure, which is incredibly useful
| when running R stuff remotely/in a distributed fashion.
| dachryn wrote:
| R is everywhere, especially when you need to visualize
| stuff. It is primarily used in teams who are trying to get
| rid of SAS in my experience.
|
| You are right in the sense that R is typically not used
| end-to-end as far as I can tell, but already tries to start
| with a data connection to some sort of dump or datalake, or
| datawarehouse.
|
| Many people in my team use Python for modelling, but grab
| ggplot in whatever way to make their presentations and
| visuals (they all use different methods, usually something
| messy like mixing python and R in a notebook or so).
| GGPlots also has a vast library of super high quality
| plugins.
|
| Python is far far behind in the viz space
| RobinL wrote:
| I don't think it's correct to say Python is far behind in
| the viz space at all. It's just different.
|
| I primarily use Altair within Python. ggplot is ahead of
| Altair in some respects, but behind in others.
|
| For example, here is a chart which can be made in Altair:
|
| https://altair-
| viz.github.io/gallery/seattle_weather_interac...
|
| Note:
|
| - You can brush over the date range to filter the bar
| chart
|
| - You can click on weather type to filter the scatter
| chart
|
| - It can be embedded in any webpage with these
| interactive elements in tact. Since the chart is
| represented by json and rendered by javascript, the spec
| also embeds the data within the chart itself, and allows
| the user to therefore change the chart however they want
|
| You can even build something like gapminder:
| https://vega.github.io/vega-
| lite/examples/interactive_global...
|
| More examples here: https://altair-viz.github.io/gallery/
| mrtranscendence wrote:
| There are Python ports of ggplot (e.g. plotnine
| (https://github.com/has2k1/plotnine)), but agreed, Python
| is behind here. I'm not the best at data viz, but I can
| usually piece together a way to make ggplot do what I
| want it to do without that much trouble or looking at
| documentation.
|
| Matplotlib, though ... that's a harder beast to
| internalize. I know it's possible to make high-quality
| matplotlib plots, but it's much harder for me. Like
| pandas, it's a library that I don't want to denigrate
| because I know people put lots of effort into it, but I
| can't lie -- I'm not a fan.
| mindv0rtex wrote:
| Speaking of what's possible in matplotlib, I am very much
| looking forward to reading this book:
| https://github.com/rougier/scientific-visualization-book
| nojito wrote:
| R with data.table and collapse blow away the competition in
| terms of tabular data wrangling.
|
| Both in terms of conciseness and performance
|
| https://h2oai.github.io/db-benchmark/
| baron_harkonnen wrote:
| I used to very strongly agree with you re: matplotlib, but
| I've recently switched from using almost exclusively ggplot2
| to almost exlusively Matplotlib and my realization is that
| they are very different tools serving very different
| purposes.
|
| ggplot2 is obviously fantastic and makes beautiful plots, and
| very easily at that. However it is definitely a "convention
| over configuration" tool. For 99% of the typical plot you
| might want to create, ggplot is going to be easier and look
| nicer.
|
| However matplotlib lib really shines when you want to make
| very custom plots. If you have a plot in your mind that you
| want to see on paper, matplotlib will be the better tool for
| helping you create exactly what you are looking for.
|
| For certain projects I've done, where I want to do a bunch of
| non-standard visualizations, especially ones that tend to be
| fairly dense, I prefer matplotlib. For day to day analytics
| ggplot2 is so much better it's ridiculous. The real issue is
| that Python doesn't really offer anything in the same league
| as ggplot2 for "convention over configuration" type plotting.
|
| Fully agree on Pandas. R's native data frame + tidyverse is
| world's easier. Pandas' overly complex indexing system is a
| persistent source of annoyance no matter how much I use that
| library.
| bttger wrote:
| > Fully agree on Pandas. R's native data frame + tidyverse
| is world's easier. Pandas' overly complex indexing system
| is a persistent source of annoyance no matter how much I
| use that library.
|
| Is it just the syntax/readability that annoys you, or are
| there actually problems that need like n steps more to do
| the same with Pandas?
| chaps wrote:
| I spend more time working around panda's strange isms
| than it takes me to write vanilla python that does the
| same thing. The index problems are not just a small
| annoyances, and sometimes can waste hours because of its
| awkward defaults. For example, its default in df.to_csv
| to write an index (without a column name..)! It doesn't
| make any sense to me whatsoever that reading a csv, then
| writing the csv would add a new column. I'm really tired
| of rerunning pandas code after I forget to turn that
| stupid default index setting off. Is that a small thing?
| Sure. But it had _tons_ of small things like that.
| _Wintermute wrote:
| It's funny you complain about the index being saved in
| csv files, which is the default behaviour in R.
| tgb wrote:
| Matplotlib is my go-to despite being mediocre. I recently
| found proplot library built on it which seems to solve a lot
| of the warts (particularly around figure layout with subplots
| and legends). I haven't had a chance to use it yet - does
| anyone know if it's worth it?
|
| I like to stick to basic, widely used tools when possible so
| I'm biased against it versus just wrangling it out with
| matplotlib. But proplot does look compelling, like it was
| written for exactly my complaints.
| lysecret wrote:
| Hehe used to do R IMO you are right about ggplot but I
| strongly disagree about pandas. I fing love it. Would love to
| understand you troubles with it though, after using it for 4
| years daily mabye I can offer some perspective ;)
| andreareina wrote:
| I run into pandas edge cases all the time. pd.concat()
| failing on empty sequences (just let me specify a default
| for that case please); .squeeze() not letting me say,
| "squeeze down to a series but not a scalar";
| .groupby().apply() returning different types depending on
| _how many groups /rows per group there are_... it's fine
| when you know exactly what you have but it's hard using it
| in a pipeline with that needs to be agnostic about whether
| there's zero, one, or many data (datums?).
| mrtranscendence wrote:
| I don't mean to disparage pandas, which is a library that
| does a lot of things fairly well. But as an API for data
| manipulation I find it very verbose and it doesn't mesh
| with a "functional" way of thinking about applying
| transformations.
|
| Generally, I've even preferred Spark to pandas, though it's
| hardly less verbose. Coming from R, it's much slower than
| data.table and nowhere near as slick and discoverable as
| dplyr. Its system of indices is a pain that I'd rather not
| deal with at all (and, indeed, I can't think of another
| data frame library that relies on them). I hate finding
| CSVs that other data scientists have created from pandas,
| because they invariably include the index ...
|
| Handles time series really well, though.
|
| Recently I've been using polars (https://github.com/pola-
| rs/polars). As an API I much, much prefer it to pandas, and
| it's a lot faster. Comes at the cost of not using numpy
| under the hood, so you can't just toss a polars data frame
| into a sklearn model.
| disgruntledphd2 wrote:
| Agreed on your major points.
|
| That being said: > I hate finding CSVs that other data
| scientists have created from pandas, because they
| invariably include the index ...
|
| This is also default in R, with row numbers (like I have
| ever needed them). To be fair, it's gotten better since
| people stopped putting important information in rownames.
|
| Polars looks interesting, thanks for the recommendation!
| deshpand wrote:
| < I hate finding CSVs that other data scientists
|
| Ideally you should be using the parquet format which will
| use the binary format, preserve column types and indexes
| [df.to_parquet(<file>); df = pd.read_parquet(<file>)]
|
| You can get away from a lot of problems by simply
| avoiding text files
| disgruntledphd2 wrote:
| It reminds me of base R from 2010, and i thought dplyr had
| driven a stake through the heart of those approaches.
|
| More generally, the API is large, all-consuming and not
| consistent. sklearn is best in class here, I rarely need to
| look things up whereas the pandas docs autocomplete in my
| browser after one or two characters.
| baron_harkonnen wrote:
| Pandas indexing system is overly complex and I've never
| personally benefited from that. To start with there are
| __getitem__, loc and iloc approaches to accessing values.
| If your library constantly has to warn users that "you
| might being something wrong, read the docs!" that should be
| a warning sign that you don't have the correct level of
| abstraction. R has a much more sane api and assumptions
| about when you want to access a value by reference (which
| is almost always) and by value.
|
| Then when doing basic operations like "group by" you end up
| excessively elaborate indexes that are in my experience
| useless and always need to be manually squashed to
| something coherent.
|
| It's a common joke for me that whenever even a seasoned
| Pandas user cries out "gaarrr! why isn't this working!?" I
| just reply "have you tried reset_index?"... this works in a
| frighteningly large number of cases.
| lr1970 wrote:
| Just to clarify, scikit-learn 1.0 has not been released yet. The
| latest tag in the github repo is 1.0.rc2
|
| https://github.com/scikit-learn/scikit-learn/releases/tag/1....
| laichzeit0 wrote:
| Great that they finally added quantile regression. This was
| sorely missed.
|
| I'm still hoping for a mixed-effects model implementation
| someday, like lme4 in R. The statsmodels implementation can only
| do predictions on fixed effects, which limits it greatly.
|
| I've always wondered why mixed effect type models are not more
| popular in the ML world.
| crimsoneer wrote:
| Preach. The statsmodels implement sucks.
___________________________________________________________________
(page generated 2021-09-14 23:01 UTC)