hngopher.com

       [HN Gopher] Scikit-Learn Version 1.0
       ___________________________________________________________________
        
       Scikit-Learn Version 1.0
        
       Author : m3at
       Score  : 237 points
       Date   : 2021-09-14 08:50 UTC (14 hours ago)
        
 (HTM) web link (scikit-learn.org)
 (TXT) w3m dump (scikit-learn.org)
        
       | conor_f wrote:
       | https://0ver.org/ will need an update!
        
       | NeutralForest wrote:
       | Excellent library with stellar documentation, I hope it'll live
       | on for a long time.
        
         | jgilias wrote:
         | Yes, glad they've decided it's finally out of the "don't use
         | it, it's experimental" phase and gotten off the 0ver.org wall
         | of shame!
        
         | Uberphallus wrote:
         | Really, for any other ML library the best documentation is how-
         | tos spread through the web, but scikit-learn leaves very little
         | room for that kind of content.
        
         | sveme wrote:
         | Best documented library. It even provides examples, guidance
         | and best practices in the documentation. Have rarely learned so
         | much as when I went through the sci-kit documentation. Absolute
         | delight.
        
           | mistrial9 wrote:
           | you mean the 4000 page cookbook thing?
        
       | infimum wrote:
       | scikit-learn (next to numpy) is the one library I use in every
       | single project at work. Every time I consider switching away from
       | python I am faced with the fact that I'd lose access to this
       | workhorse of a library. Of course it's not all sunshine and
       | rainbows - I had my fair share of rummaging through its internals
       | - but its API design is a de-facto standard for a reason. My only
       | recurring gripe is that the serialization story (basically just
       | pickling everything) is not optimal.
        
         | zeec123 wrote:
         | There is so much wrong with the api design of sklearn (how can
         | one think "predict_proba" is a good function name?). I can
         | understand this, since most of it was probably written by PhD
         | students without the time and expertise to come up with a
         | proper api; many of them without a CS background.[1]
         | 
         | [1]
         | https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...
        
           | mrtranscendence wrote:
           | I didn't want to bag on sklearn (I've already bagged on
           | pandas enough here), but for what it's worth I agree with
           | you. It's, ahh, not the API I would've come up with. It's
           | what everybody has standardized on, though, and maybe there's
           | some value in that.
        
         | kzrdude wrote:
         | What's a typical task you do with sklearn? Just trying to get
         | inspired about what it can do
        
         | CapmCrackaWaka wrote:
         | I recently ran into this issue as well. Serialization of
         | sklearn random forests results in absolutely massive files. I
         | had to switch to lightgbm, which is 100x faster to load from a
         | save file and about 20x smaller.
        
       | XoS-490 wrote:
       | What about sktime? https://github.com/alan-turing-
       | institute/sktime
        
       | lr1970 wrote:
       | Early on, pandas made some unfortunate design decisions that are
       | still biting hard. For example, the choice of datetime
       | (pandas.Timestamp) represented by a 64-bit int with a fixed
       | nanosecond resolution. This choice gives dynamic range of +- 292
       | years around 1970-01-01 (the epoch). This range is too small to
       | represent the works of William Shakespeare, never mind human
       | history. Using pandas in these areas becomes a royal pain in the
       | neck, for one constantly needs to work around pandas datetime
       | limitations.
       | 
       | OTOH, in numpy one can choose time resolution units (anything
       | from attosecond to a year) tailoring time resolution to your task
       | (from high energy physics all way to astronomy). Panda's choice
       | is only good for high-frequency stock traders, though.
        
         | Bostonian wrote:
         | Pandas was started by a quant working for AQR Capital, so it's
         | not surprising if "Panda's choice is only good for high-
         | frequency stock traders".
        
           | lr1970 wrote:
           | An illustrative example of how reasonable short-term and
           | narrow-scope considerations can be really bad in long-term
           | and/or at a larger scope.
        
             | minsc__and__boo wrote:
             | This assumes that all projects should be built with the
             | larger scope in mind.
             | 
             | Sometimes you just need a shovel, not a Bagger 288.
        
               | [deleted]
        
             | nojito wrote:
             | Why should he care about other use-cases?
             | 
             | It's not his responsibility to make sure his package is as
             | wide as possible before opensourcing.
        
               | lr1970 wrote:
               | The problem is not with the Wes' original decision but
               | with the fact that it was never revisited even when
               | pandas took off at much larger scope. Should had been
               | fixed before 1.0 release.
        
               | nojito wrote:
               | This belief is quite common in the Opensource space.
               | 
               | It's far easier to criticize than it is to submit a pull
               | request.
        
               | kickopotomus wrote:
               | It's like there is some strange belief now that software
               | should be "finished" before a 1.0 version now. When did
               | that start?
        
               | anigbrowl wrote:
               | I'm glad you posted about this because I didn't know, but
               | my reflexive response was 'well guess that won't work for
               | [project idea], guess I'll roll my own or just use the
               | NumPy version.'
               | 
               | I personally don't mind the lack of one-size-fits-all. If
               | Pandas were to be part of the Python Standard Library I
               | think you'd have a stronger argument, since the unspoken
               | premise of a SL is that you can leave for a desert island
               | with only that and your IDE and still get things done.
        
         | nxpnsv wrote:
         | Most data is not 300 years old or in the distance future, in
         | fact ranges 1970+-292 years are very common. That is to say,
         | panda's choice is good for lots of people, including outside
         | high-frequency stock traders.
        
           | lr1970 wrote:
           | > Most data is not 300 years old or in the distance future,
           | in fact ranges 1970+-292 years are very common.
           | 
           | In what domains? Astronomy, geology, history call for larger
           | time range. Laser and High Energy physics need femtosecond
           | rather than nanosecond resolution. My point is that a fixed
           | time resolution, whatever it is, is a bad choice. Numpy
           | explicitly allows to select time resolution unit and this is
           | the right approach. BTW, numpy is pandas dependency and
           | predates it by several years.
        
       | monkeybutton wrote:
       | Scikit-Learn is great, and, reading the documentation for other
       | 3rd party ML packages and seeing the words "Scikit-learn API" is
       | even better.
        
       | zibzab wrote:
       | Is anyone using scikit for NN?
       | 
       | Why/why not?
        
         | FlyingSaucer wrote:
         | I have used the MLP classifier[1] before. It's very simple to
         | use (like most of sklearn's models). Worked well for standard
         | and reasonably small classification model, but lacks some
         | features for it to be a flexible way of using NNs:
         | 
         | - No saving checkpoints (can be crucial for large models who
         | need alot of compute and time)
         | 
         | - No way to assign different activation functions to different
         | layers
         | 
         | - No complex nodes like LSTM, GRU - No way to implement complex
         | architectures like transformers, encoders etc
         | 
         | I also do not know if its even possible to use CUDA or any GPU
         | with it.
         | 
         | [1] : https://scikit-
         | learn.org/stable/modules/generated/sklearn.ne...
        
           | kuu wrote:
           | I would say the same as you. As long as you only need a
           | simple model, yes, the MLP is good enough, but forget about
           | making any DeepLearning stuff.
           | 
           | And AFAIK, there isn't GPU support, CPU performance is poor
           | compared to GPU execution.
        
         | westurner wrote:
         | There are scikit-learn (sklearn) API-compatible wrappers for
         | e.g. PyTorch and TensorFlow.
         | 
         | Skorch: https://github.com/skorch-dev/skorch
         | 
         | tf.keras.wrappers.scikit_learn:
         | https://www.tensorflow.org/api_docs/python/tf/keras/wrappers...
         | 
         | AFAIU, there are not Yellowbrick visualizers for PyTorch or
         | TensorFlow; though PyTorch abd TensorFlow work with TensorBoard
         | for visualizing CFG execution.
         | 
         | > _Many machine learning libraries implement the scikit-learn
         | `estimator API` to easily integrate alternative optimization or
         | decision methods into a data science workflow. Because of this,
         | it seems like it should be simple to drop in a non-scikit-learn
         | estimator into a Yellowbrick visualizer, and in principle, it
         | is. However, the reality is a bit more complicated._
         | 
         | > _Yellowbrick visualizers often utilize more than just the
         | method interface of estimators (e.g. `fit()` and `predict()`),
         | relying on the learned attributes (object properties with a
         | single underscore suffix, e.g. `coef_`). The issue is that when
         | a third-party estimator does not expose these attributes, truly
         | gnarly exceptions and tracebacks occur. Yellowbrick is meant to
         | aid machine learning diagnostics reasoning, therefore instead
         | of just allowing drop-in functionality that may cause
         | confusion, we've created a wrapper functionality that is a bit
         | kinder with it's messaging._
         | 
         | Looks like there are Yellowbrick wrappers for XGBoost,
         | CatBoost, CuML, and Spark MLib; but not for NNs yet.
         | https://www.scikit-yb.org/en/latest/api/contrib/wrapper.html...
         | 
         | From the RAPIDS.ai CuML team:
         | https://docs.rapids.ai/api/cuml/stable/ :
         | 
         | > _cuML is a suite of fast, GPU-accelerated machine learning
         | algorithms designed for data science and analytical tasks. Our
         | API mirrors Sklearn's, and we provide practitioners with the
         | easy fit-predict-transform paradigm without ever having to
         | program on a GPU._
         | 
         | > _As data gets larger, algorithms running on a CPU becomes
         | slow and cumbersome. RAPIDS provides users a streamlined
         | approach where data is intially loaded in the GPU, and compute
         | tasks can be performed on it directly._
         | 
         | CuML is not an NN library; but there are likely performance
         | optimizations from CuDF and CuML that would accelerate
         | performance of NNs as well.
         | 
         | Dask ML works with models with sklearn interfaces, XGBoost,
         | LightGBM, PyTorch, and TensorFlow: https://ml.dask.org/ :
         | 
         | > _Scikit-Learn API_
         | 
         | > _In all cases Dask-ML endeavors to provide a single unified
         | interface around the familiar NumPy, Pandas, and Scikit-Learn
         | APIs. Users familiar with Scikit-Learn should feel at home with
         | Dask-ML._
         | 
         | dask-labextension for JupyterLab helps to visualize Dask ML
         | CFGs which call predictors and classifiers with sklearn
         | interfaces: https://github.com/dask/dask-labextension
        
         | armcat wrote:
         | NN as in "neural network", or NN as in "nearest neighbour"
         | algorithm? No to the former, yes to the latter. The reason for
         | a "no" to neural networks - in my case I've only ever
         | implemented neural networks with many layers, and typically
         | using kernels, pooling mechanisms, etc, and since scikit-learn
         | doesn't have GPU support, I opt for frameworks that do
         | (PyTorch, TensorFlow). However, if you're only building fully-
         | connected neural nets (MLPs), with just a few layers, you don't
         | need GPU support since any benefits of having parallel
         | processing are offset by shuffling data between CPU and GPU. So
         | in that case, scikit-learn would probably work quite well,
         | although I never tested this myself.
        
           | woko wrote:
           | GPU can be useful for Nearest Neighbour as well. In case you
           | have access to a GPU, I would strongly recommend Facebook's
           | FAISS [1,2]. For everything else, sklearn is amazing.
           | 
           | [1] https://faiss.ai/ [2]
           | https://github.com/facebookresearch/faiss
        
             | armcat wrote:
             | Faiss looks very nice, thanks for the tip!
        
         | woko wrote:
         | See the FAQ: https://scikit-learn.org/dev/faq.html#will-you-
         | add-gpu-suppo...
        
           | [deleted]
        
         | jptech wrote:
         | You should be asking if anyone is using it for ML, not NN.
        
           | detaro wrote:
           | why?
        
       | lysecret wrote:
       | Excellent library for train_test_split. Jokes aside. This next to
       | Numpy, Pandas Jupyter and Matplotlib + the DL libraries are the
       | reason Python is the powerhouse it is for Data Science.
        
         | disgruntledphd2 wrote:
         | I'm with you on sklearn, the DL libraries and Numpy, but Pandas
         | and Matplotlib are poor, poor relations of the tools available
         | in the R ecosystem (dplyr/ggplot etc).
        
           | pc86 wrote:
           | If you're doing data science aren't sklearn, DL, and numpy
           | getting you 90% of the way there anyway? Even if R has better
           | "versions" of pandas/matplotlib (not conceding that point)
           | it's not exactly central to the job of data science.
        
             | disgruntledphd2 wrote:
             | > If you're doing data science aren't sklearn, DL, and
             | numpy getting you 90% of the way there anyway?
             | 
             | Not really, tbh. Most of my jobs (even when the primary
             | output was models) require spending a _lot_ of time data
             | wrangling and plotting. R is much, much better for this
             | kind of exploratory work.
             | 
             | But if I need to integrate with bigger systems (as I
             | normally do), there's a stronger push for Python to reduce
             | complexity and make it easier for SE's to understand and
             | maintain (some of) the code.
        
             | civilized wrote:
             | As a working data scientist I'd say it's completely the
             | opposite: a good tabular data manipulation package is the
             | single most valuable tool in my tool box. And R's packages
             | (either data.table or dplyr) are definitely way better than
             | pandas. There's no comparison.
             | 
             | I would be hard-pressed to find a working data scientist
             | whose definition of data science is "that thing you do with
             | sklearn, Deep Learning and Numpy".
        
               | jstx1 wrote:
               | > "Data science is that thing where you do sklearn, Deep
               | Learning and Numpy" is not a working data scientist's
               | perspective.
               | 
               | It could be. It's such a broad job title and it looks so
               | different across different companies and teams that the
               | main tool for one data scientist might be something that
               | another data scientist never has to touch. Different data
               | science jobs prioritise different tools, that's all.
        
               | civilized wrote:
               | Right, so defining data science as 90% sklearn+DL+numpy
               | is just as silly as saying that it's 90% table
               | manipulation. That's exactly my point.
               | 
               | Still, if anyone here has managed to find a data science
               | job in which tabular data management is not a sizable
               | piece of what you do, I'd like to know some details!
        
               | _Wintermute wrote:
               | I worked as a datascientist for a couple of years and
               | tabular data was a very small part of my job. I spent far
               | more time with image-analysis and JSON, both of which I
               | found R sucks at.
        
               | mrtranscendence wrote:
               | I imagine there are data scientists who operate primarily
               | on unstructured rather than tabular data. Part of my
               | current job involves stuff like text classification, and
               | it's not that difficult to imagine someone for whom
               | that's a more sizable proportion of their day-to-day.
               | 
               | Still, my suspicion -- at least from my corner of data
               | science -- is that such individuals are rare, and that
               | most data scientists do make use of tabular data more
               | often than not.
        
               | civilized wrote:
               | I totally get what you mean - I would suspect that when
               | you work with unstructured data, tabular data
               | manipulation is maybe 20-40% of what you do, and when you
               | work with structured data, it's more like 60-80%.
        
               | pletnes wrote:
               | Tabular data is great for many usecases, but saying that
               | image, audio, and video analysis is not data science
               | seems like a weird variant of gatekeeping to me.
        
               | disgruntledphd2 wrote:
               | > Tabular data is great for many usecases, but saying
               | that image, audio, and video analysis is not data science
               | seems like a weird variant of gatekeeping to me.
               | 
               | Most problems are mostly tabular, IME.
               | 
               | I completely agree that text, images and video are much,
               | much better handled by Python (that's why I use and know
               | both).
        
               | civilized wrote:
               | Of course, my post doesn't imply such a silly statement.
        
             | bllguo wrote:
             | maybe we are casualties of the vague definition of "data
             | science," but in my experience numpy is too low-level for
             | most of what I consider DS, and pandas/matplotlib are
             | _much_ more central than sklearn or pytorch. Even if your
             | definition only encompasses deep learning research, surely
             | plotting is still indispensable?
             | 
             | I'll also add my vote for the superiority of data.table and
             | ggplot2 to any Python alternatives. the bloat and verbosity
             | of pandas is a daily struggle
        
               | MichaelRazum wrote:
               | Just curious. In which way is data.table superior to
               | pandas? Really interested about it! From my personal
               | experience pandas is just sometimes a bit slow.
        
               | nojito wrote:
               | Data.table is Faster to write and faster to perform
               | 
               | https://h2oai.github.io/db-benchmark/
        
               | bllguo wrote:
               | I just love how much more terse and fast it is, someone
               | else linked a benchmark below. There's definitely a
               | learning curve though.
               | 
               | If you already think pandas is slow I think you'll be
               | surprised how much more strongly you feel after using
               | data.table!
        
               | mrtranscendence wrote:
               | I'm more a dplyr man myself, but data.table is much
               | faster than pandas, most noticeably IMO when reading
               | large files. It's also extremely succinct if you're into
               | that sort of thing (though I find it a bit obfuscated).
               | pandas is a lot of things, but "fast" and "concise" are
               | not two of them.
        
               | MichaelRazum wrote:
               | Got it. Regarding fast you have something like Vaex on
               | python side (but not sure how fast it realy is). For me I
               | had with pandas the most issues using it's multiindex.
        
               | mrtranscendence wrote:
               | > For me I had with pandas the most issues using it's
               | multiindex.
               | 
               | Yessss. I loathe indices, and have never been in a
               | situation where I was better off with them than without
               | them.
               | 
               | > Regarding fast you have something like Vaex on python
               | sid
               | 
               | I've never used Vaex, but I've used datatable
               | (https://github.com/h2oai/datatable) and polars
               | (https://github.com/pola-rs/polars). Polars is my
               | favorite API, but datatable was faster at reading data
               | (Polars was faster in execution). I'll have to give Vaex
               | a try at some point.
        
               | civilized wrote:
               | Pandas is the PHP of data science. Pretty badly designed,
               | but immensely popular because it got there first and had
               | no real competition (in Python) for years.
        
           | optimalsolver wrote:
           | Of possible interest, a C++ replacement for Pandas:
           | 
           | https://github.com/hosseinmoein/DataFrame
        
           | pantsforbirds wrote:
           | I'm surprised you dont like pandas. I've found it to be a
           | pretty easy to use and useful tool and you can almost always
           | use something like DASK (or if youre lucky CUDF from
           | rapidsai) if you need better performance.
           | 
           | I will say that my very first "real" programming experience
           | was Matlab at a research internship, so maybe i just got used
           | to working in vectors and arrays for computational tasks.
        
             | baron_harkonnen wrote:
             | > i just got used to working in vectors and arrays for
             | computational tasks.
             | 
             | Have you worked with R? R, like matlab, natively supports
             | vector based operations. In fact, all values in R are
             | vectors. Many of the problems with Pandas ultimately boil
             | down to the fact that you have to replicate this experience
             | without truly being in a vector based language.
        
             | jfarina wrote:
             | Numpy has a better api than pandas if we're strictly
             | talking about vectors and arrays.
             | 
             | Pandas indexing makes sense once you get it, but it does
             | seem to require a lot more words than equivalent statements
             | in R.
             | 
             | My primary language is python, but I have been picking up
             | some R.
        
           | kzrdude wrote:
           | They need seaborn too, whoch makes the python side a lot
           | stronger
        
             | disgruntledphd2 wrote:
             | plotnine FTW! You'll pry ggplot2 from my cold, dead hands.
        
           | boringg wrote:
           | Wait how many companies are actually using R in the wild? As
           | I understand it, R is born of academia, great for
           | statistics/analysis but breaks down on data manipulation and
           | isn't used in production/data engineering. Maybe my
           | understanding is dated though?
        
             | baron_harkonnen wrote:
             | Of the many companies I've done data science with I can
             | only think of a few, rare exceptions where R wasn't used as
             | much as if not more than Python.
             | 
             | If you're mostly dealing with Neural Nets you won't see
             | much R, but for anything really statistical in nature R is
             | a much better tool than Python. For anything that ends up
             | in a report R is much better than Python (a lot of very
             | valuable data science work ends up being a report to non-
             | technical people).
             | 
             | > breaks down on data manipulation
             | 
             | This is very outdated. The tidyverse eco-system has bumped
             | R back into being first in class for data manipulation now.
             | This becomes less true as you get further and further from
             | having your data in a matrix/df (I can't imagine doing
             | Spark queries in R), but if you already have a basic data
             | frame, manipulation from there is very easy.
             | 
             | Even for things that end up in production, whether you're
             | in R or Python, whatever your first pass is should always
             | be a prototype and will have to be reworked before you get
             | close to moving it to production.
        
               | [deleted]
        
             | dagw wrote:
             | _Wait how many companies are actually using R in the wild?_
             | 
             | Depends on your definition. While not very often 'deployed'
             | in 'production'. I know lots places in all kinds of
             | industries where people reach for R as soon as they have to
             | look at some new data.
        
             | lysecret wrote:
             | used to work in insurance and heavily used it.
        
               | disgruntledphd2 wrote:
               | > Wait how many companies are actually using R in the
               | wild? As I understand it, R is born of academia, great
               | for statistics/analysis but breaks down on data
               | manipulation and isn't used in production/data
               | engineering.
               | 
               | It depends, I've worked in some places where R was the
               | core part of their data infrastructure. Data manipulation
               | (of non text) is far, far better in R.
               | 
               | Integrating with other systems can be tricky though, and
               | you don't have the wide variety of Python libraries
               | available for core SE tasks, so it can often make sense
               | to use Python even though it's not as good for a lot of
               | the core work.
               | 
               | Additionally, R is a very, very flexible language (like
               | Python), but without strong community lead norms (unlike
               | Python) so it's pretty easy to make a mess with it.
               | 
               | Finally, when you need to hand over stuff to software
               | engineers, they vastly tend to prefer Python, so it often
               | ends up being used to make this stuff easier.
               | 
               | Like, in R there's a core tool called broom which will
               | pull out the important features of a model and make it
               | really easy to examine them with your data. There's
               | nothing comparable in Python, and I miss it so so much
               | when I use Python.
               | 
               | That being said, working with strings is much much nicer
               | in Python, and pytest is the bomb, so there's tradeoffs
               | everywhere.
        
               | mrtranscendence wrote:
               | It's unrelated to your main point, but:
               | 
               | > Additionally, R is a very, very flexible language (like
               | Python)
               | 
               | I'd argue that R is much more flexible than Python
               | syntactically. There's a reason that every attempt at
               | recreating dplyr in Python ends in a bit of a mess (IMO)
               | -- Python just doesn't allow the sort of metaprogramming
               | you'd require for a really nice port. Something as simple
               | as a general pipe operator can't be defined in Python, to
               | say nothing of how dplyr scopes column names within
               | verbs.
               | 
               | Arguably this does allow you to go crazy in a way that
               | ends up being detrimental to readability, but I'd say
               | overall it's a net benefit to R over Python. I really
               | miss this stuff and have spent an undue amount of time
               | thinking of the best way to emulate it (only to come up
               | with ideas that just disappoint).
               | 
               | > Finally, when you need to hand over stuff to software
               | engineers, they vastly tend to prefer Python
               | 
               | Indeed, this is maybe 50% of the reason my organization
               | has pushed R to the sidelines over the past few years. We
               | used to be very heavily into R but now it has "you can
               | use it, but don't expect support" status.
        
               | mrtranscendence wrote:
               | (Replying to disgruntledphd2)
               | 
               | > Well that's just lazy evaluation of function arguments,
               | which can't be done in Python.
               | 
               | "Just lazy evaluation"! :) It's a pretty big deal. This
               | is three-fifths of the way to a macro system.
               | 
               | > But if take a look at the Python data model, it does
               | seem super, super flexible.
               | 
               | Sure, you can have a lot of control over the behavior of
               | Python objects (some techniques of which remain obscure
               | to me even after using Python for many years). But you
               | don't have anything like syntactic macros. You can define
               | a pipe operator with macropy, though -- it's pretty easy.
               | But macropy is basically dead now I think (and a total
               | hack).
               | 
               | > You'll still need strings for column names in any dplyr
               | port though, because of the function argument issue.
               | 
               | This is major, though, because you can't do this:
               | mutate(df, x="y" + "z")
               | 
               | You have to do something like what dfply does, defining
               | an object that defines addition, subtraction, etc.
               | mutate(df, x=X.y + X.z)
               | 
               | But that hits corner cases quickly. What if you want to
               | call a regular Python function that expects numeric
               | arguments? This won't work:                   mutate(df,
               | x=f(X.y))
               | 
               | etc. Granted, this only really works in R because it's
               | easy to define functions that accept and return vectors.
               | So in that sense it's kind of a leaky abstraction. But
               | you couldn't even get that far in Python, because X.y
               | isn't a vector ... it's a kind of promise to substitute a
               | vector.
               | 
               | Give Python macros, I say! To hell with the consequences!
        
               | dragonwriter wrote:
               | > Sure, you can have a lot of control over the behavior
               | of Python objects (some techniques of which remain
               | obscure to me even after using Python for many years).
               | But you don't have anything like syntactic macros.
               | 
               | Not yet, but there's a PEP for that:
               | 
               | https://www.python.org/dev/peps/pep-0638/
        
               | mrtranscendence wrote:
               | Nice, I'd love for this to see the light of day. I
               | suspect it'll see some resistance (even pattern matching
               | caused conflict, and I thought that was terribly
               | innocuous).
               | 
               | (Why can I reply at this level of nesting now, whereas
               | before I couldn't?)
        
               | disgruntledphd2 wrote:
               | I'm totally with you on these points, and it's one of the
               | places where R's genesis as a scheme program has lead to
               | really, really good consequences.
               | 
               | Fundamentally though, both DS Python and R are
               | abstractions over well-tested Fortran linear algebra
               | routines (I'm sortof kidding, but only sortof).
        
               | disgruntledphd2 wrote:
               | > I'd argue that R is much more flexible than Python
               | syntactically. There's a reason that every attempt at
               | recreating dplyr in Python ends in a bit of a mess (IMO)
               | -- Python just doesn't allow the sort of metaprogramming
               | you'd require for a really nice port. Something as simple
               | as a general pipe operator can't be defined in Python, to
               | say nothing of how dplyr scopes column names within
               | verbs.
               | 
               | Well that's just lazy evaluation of function arguments,
               | which can't be done in Python. But if take a look at the
               | Python data model, it does seem super, super flexible.
               | You'll still need strings for column names in any dplyr
               | port though, because of the function argument issue.
               | 
               | Like, both Python/R derive from the CLOS approach (Art of
               | the Metaobject Protocol), but R retains a lot more of the
               | lispy goodness (but Python's implementation is easier to
               | use).
        
             | Tarq0n wrote:
             | It's mostly the "in production" part that determines
             | whether R is suitable for a business or not. It's much more
             | complicated to avoid runtime errors or do proper testing in
             | R, whereas it shines for interactive use, or generating
             | reports.
             | 
             | That said having used both the DSL's for plotting and data
             | wrangling in the R package ecosystem are vastly superior to
             | pandas and python plotting libraries. For modeling I
             | actually like the better namespacing of Python which helps
             | keep things more legible when there are a ton of model
             | options to choose from, assuming you don't need cutting
             | edge statistics.
        
               | mrtranscendence wrote:
               | > It's much more complicated to avoid runtime errors or
               | do proper testing in R
               | 
               | It's not that much harder. There's no pytest, but
               | testthat works well enough. I've developed a few packages
               | internally in R and wouldn't say it was that much harder
               | to ensure correctness than for the corresponding Python
               | packages. (We used to keep them in sync, before basically
               | moving everything to Python.)
        
               | disgruntledphd2 wrote:
               | I actually quite like R's error handling. It's as good as
               | Common Lisp's which is often held up as the epitome of
               | this.
               | 
               | You also have the dump.frames option, which will save
               | your workspace on failure, which is incredibly useful
               | when running R stuff remotely/in a distributed fashion.
        
             | dachryn wrote:
             | R is everywhere, especially when you need to visualize
             | stuff. It is primarily used in teams who are trying to get
             | rid of SAS in my experience.
             | 
             | You are right in the sense that R is typically not used
             | end-to-end as far as I can tell, but already tries to start
             | with a data connection to some sort of dump or datalake, or
             | datawarehouse.
             | 
             | Many people in my team use Python for modelling, but grab
             | ggplot in whatever way to make their presentations and
             | visuals (they all use different methods, usually something
             | messy like mixing python and R in a notebook or so).
             | GGPlots also has a vast library of super high quality
             | plugins.
             | 
             | Python is far far behind in the viz space
        
               | RobinL wrote:
               | I don't think it's correct to say Python is far behind in
               | the viz space at all. It's just different.
               | 
               | I primarily use Altair within Python. ggplot is ahead of
               | Altair in some respects, but behind in others.
               | 
               | For example, here is a chart which can be made in Altair:
               | 
               | https://altair-
               | viz.github.io/gallery/seattle_weather_interac...
               | 
               | Note:
               | 
               | - You can brush over the date range to filter the bar
               | chart
               | 
               | - You can click on weather type to filter the scatter
               | chart
               | 
               | - It can be embedded in any webpage with these
               | interactive elements in tact. Since the chart is
               | represented by json and rendered by javascript, the spec
               | also embeds the data within the chart itself, and allows
               | the user to therefore change the chart however they want
               | 
               | You can even build something like gapminder:
               | https://vega.github.io/vega-
               | lite/examples/interactive_global...
               | 
               | More examples here: https://altair-viz.github.io/gallery/
        
               | mrtranscendence wrote:
               | There are Python ports of ggplot (e.g. plotnine
               | (https://github.com/has2k1/plotnine)), but agreed, Python
               | is behind here. I'm not the best at data viz, but I can
               | usually piece together a way to make ggplot do what I
               | want it to do without that much trouble or looking at
               | documentation.
               | 
               | Matplotlib, though ... that's a harder beast to
               | internalize. I know it's possible to make high-quality
               | matplotlib plots, but it's much harder for me. Like
               | pandas, it's a library that I don't want to denigrate
               | because I know people put lots of effort into it, but I
               | can't lie -- I'm not a fan.
        
               | mindv0rtex wrote:
               | Speaking of what's possible in matplotlib, I am very much
               | looking forward to reading this book:
               | https://github.com/rougier/scientific-visualization-book
        
             | nojito wrote:
             | R with data.table and collapse blow away the competition in
             | terms of tabular data wrangling.
             | 
             | Both in terms of conciseness and performance
             | 
             | https://h2oai.github.io/db-benchmark/
        
           | baron_harkonnen wrote:
           | I used to very strongly agree with you re: matplotlib, but
           | I've recently switched from using almost exclusively ggplot2
           | to almost exlusively Matplotlib and my realization is that
           | they are very different tools serving very different
           | purposes.
           | 
           | ggplot2 is obviously fantastic and makes beautiful plots, and
           | very easily at that. However it is definitely a "convention
           | over configuration" tool. For 99% of the typical plot you
           | might want to create, ggplot is going to be easier and look
           | nicer.
           | 
           | However matplotlib lib really shines when you want to make
           | very custom plots. If you have a plot in your mind that you
           | want to see on paper, matplotlib will be the better tool for
           | helping you create exactly what you are looking for.
           | 
           | For certain projects I've done, where I want to do a bunch of
           | non-standard visualizations, especially ones that tend to be
           | fairly dense, I prefer matplotlib. For day to day analytics
           | ggplot2 is so much better it's ridiculous. The real issue is
           | that Python doesn't really offer anything in the same league
           | as ggplot2 for "convention over configuration" type plotting.
           | 
           | Fully agree on Pandas. R's native data frame + tidyverse is
           | world's easier. Pandas' overly complex indexing system is a
           | persistent source of annoyance no matter how much I use that
           | library.
        
             | bttger wrote:
             | > Fully agree on Pandas. R's native data frame + tidyverse
             | is world's easier. Pandas' overly complex indexing system
             | is a persistent source of annoyance no matter how much I
             | use that library.
             | 
             | Is it just the syntax/readability that annoys you, or are
             | there actually problems that need like n steps more to do
             | the same with Pandas?
        
               | chaps wrote:
               | I spend more time working around panda's strange isms
               | than it takes me to write vanilla python that does the
               | same thing. The index problems are not just a small
               | annoyances, and sometimes can waste hours because of its
               | awkward defaults. For example, its default in df.to_csv
               | to write an index (without a column name..)! It doesn't
               | make any sense to me whatsoever that reading a csv, then
               | writing the csv would add a new column. I'm really tired
               | of rerunning pandas code after I forget to turn that
               | stupid default index setting off. Is that a small thing?
               | Sure. But it had _tons_ of small things like that.
        
               | _Wintermute wrote:
               | It's funny you complain about the index being saved in
               | csv files, which is the default behaviour in R.
        
           | tgb wrote:
           | Matplotlib is my go-to despite being mediocre. I recently
           | found proplot library built on it which seems to solve a lot
           | of the warts (particularly around figure layout with subplots
           | and legends). I haven't had a chance to use it yet - does
           | anyone know if it's worth it?
           | 
           | I like to stick to basic, widely used tools when possible so
           | I'm biased against it versus just wrangling it out with
           | matplotlib. But proplot does look compelling, like it was
           | written for exactly my complaints.
        
           | lysecret wrote:
           | Hehe used to do R IMO you are right about ggplot but I
           | strongly disagree about pandas. I fing love it. Would love to
           | understand you troubles with it though, after using it for 4
           | years daily mabye I can offer some perspective ;)
        
             | andreareina wrote:
             | I run into pandas edge cases all the time. pd.concat()
             | failing on empty sequences (just let me specify a default
             | for that case please); .squeeze() not letting me say,
             | "squeeze down to a series but not a scalar";
             | .groupby().apply() returning different types depending on
             | _how many groups /rows per group there are_... it's fine
             | when you know exactly what you have but it's hard using it
             | in a pipeline with that needs to be agnostic about whether
             | there's zero, one, or many data (datums?).
        
             | mrtranscendence wrote:
             | I don't mean to disparage pandas, which is a library that
             | does a lot of things fairly well. But as an API for data
             | manipulation I find it very verbose and it doesn't mesh
             | with a "functional" way of thinking about applying
             | transformations.
             | 
             | Generally, I've even preferred Spark to pandas, though it's
             | hardly less verbose. Coming from R, it's much slower than
             | data.table and nowhere near as slick and discoverable as
             | dplyr. Its system of indices is a pain that I'd rather not
             | deal with at all (and, indeed, I can't think of another
             | data frame library that relies on them). I hate finding
             | CSVs that other data scientists have created from pandas,
             | because they invariably include the index ...
             | 
             | Handles time series really well, though.
             | 
             | Recently I've been using polars (https://github.com/pola-
             | rs/polars). As an API I much, much prefer it to pandas, and
             | it's a lot faster. Comes at the cost of not using numpy
             | under the hood, so you can't just toss a polars data frame
             | into a sklearn model.
        
               | disgruntledphd2 wrote:
               | Agreed on your major points.
               | 
               | That being said: > I hate finding CSVs that other data
               | scientists have created from pandas, because they
               | invariably include the index ...
               | 
               | This is also default in R, with row numbers (like I have
               | ever needed them). To be fair, it's gotten better since
               | people stopped putting important information in rownames.
               | 
               | Polars looks interesting, thanks for the recommendation!
        
               | deshpand wrote:
               | < I hate finding CSVs that other data scientists
               | 
               | Ideally you should be using the parquet format which will
               | use the binary format, preserve column types and indexes
               | [df.to_parquet(<file>); df = pd.read_parquet(<file>)]
               | 
               | You can get away from a lot of problems by simply
               | avoiding text files
        
             | disgruntledphd2 wrote:
             | It reminds me of base R from 2010, and i thought dplyr had
             | driven a stake through the heart of those approaches.
             | 
             | More generally, the API is large, all-consuming and not
             | consistent. sklearn is best in class here, I rarely need to
             | look things up whereas the pandas docs autocomplete in my
             | browser after one or two characters.
        
             | baron_harkonnen wrote:
             | Pandas indexing system is overly complex and I've never
             | personally benefited from that. To start with there are
             | __getitem__, loc and iloc approaches to accessing values.
             | If your library constantly has to warn users that "you
             | might being something wrong, read the docs!" that should be
             | a warning sign that you don't have the correct level of
             | abstraction. R has a much more sane api and assumptions
             | about when you want to access a value by reference (which
             | is almost always) and by value.
             | 
             | Then when doing basic operations like "group by" you end up
             | excessively elaborate indexes that are in my experience
             | useless and always need to be manually squashed to
             | something coherent.
             | 
             | It's a common joke for me that whenever even a seasoned
             | Pandas user cries out "gaarrr! why isn't this working!?" I
             | just reply "have you tried reset_index?"... this works in a
             | frighteningly large number of cases.
        
       | lr1970 wrote:
       | Just to clarify, scikit-learn 1.0 has not been released yet. The
       | latest tag in the github repo is 1.0.rc2
       | 
       | https://github.com/scikit-learn/scikit-learn/releases/tag/1....
        
       | laichzeit0 wrote:
       | Great that they finally added quantile regression. This was
       | sorely missed.
       | 
       | I'm still hoping for a mixed-effects model implementation
       | someday, like lme4 in R. The statsmodels implementation can only
       | do predictions on fixed effects, which limits it greatly.
       | 
       | I've always wondered why mixed effect type models are not more
       | popular in the ML world.
        
         | crimsoneer wrote:
         | Preach. The statsmodels implement sucks.
        
       ___________________________________________________________________
       (page generated 2021-09-14 23:01 UTC)