hngopher.com

       [HN Gopher] R: Introduction to Data Science (2019)
       ___________________________________________________________________
        
       R: Introduction to Data Science (2019)
        
       Author : tosh
       Score  : 171 points
       Date   : 2024-03-02 10:03 UTC (12 hours ago)
        
 (HTM) web link (rafalab.dfci.harvard.edu)
 (TXT) w3m dump (rafalab.dfci.harvard.edu)
        
       | asicsp wrote:
       | See also:
       | 
       | "R Programming for Data Science" https://leanpub.com/rprogramming
        
       | chollida1 wrote:
       | i'm an old R user, now migrated fully to python.
       | 
       | For those of you who us R still what is your use case?
       | 
       | We found R has a really hard time integrating into data pipelines
       | and was best used as a standalone tool by individuals, which
       | doesn't really work in our particular professional setup where
       | everyone works collaboratively together.
       | 
       | What we found was that R had alot of packages but most haven't
       | been touched in years and when you contact the owner you find
       | they've often moved onto the python/pandas/scikit eco system
        
         | medstrom wrote:
         | Basically with tidyverse, R can let you write less code and
         | keep it readable: https://www.sumsar.net/blog/pandas-feels-
         | clunky-when-coming-...
         | 
         | Can't speak to abandonment, but it seems a lot of recent devel
         | is occurring inside the the tidyverse, which is deprecating a
         | whole bunch of other stuff.
        
           | chollida1 wrote:
           | I will agree that I left just as tidyverse was coming of age
           | and I'm sometimes jealous i never got to use it.
           | 
           | What Hadley Wickham has done is very impressive.
        
         | guccigav wrote:
         | Exactly what you said, R is easy to get started for individuals
         | in social science fields. Most people I know who want to dive
         | deeper end up learning Python anyway.
        
           | countrymile wrote:
           | This is true, I've found teaching r to get things done is
           | very fast and readable fully achievable in a semester
           | compared to teaching pandas (and also having to teach how to
           | program in python)
        
         | ildjarn wrote:
         | R is much better for REPL style development and functional
         | programming.
         | 
         | Python could be so much better with some minor syntax
         | extensions.
        
           | chollida1 wrote:
           | I find that with vscode and the immediate window I get a
           | decent repl.
           | 
           | What about R's language makes it better for Repl driven
           | development?
        
             | pama wrote:
             | If you haven't used Emacs ESS it may be hard to explain
             | what you're missing with a true REPL, but if you had used
             | it in the past and add tidyverse to it, you basically have
             | super smooth interactive editing with the ability to
             | quickly pool text from other REPLs, past notes or scripts.
             | Contrary to the similar Python repl, you can easily pull
             | and edit chains of multiple commands via R's piping.
        
             | lottin wrote:
             | For one thing, R code can be written more concisely, due to
             | the fact that the language is vector-based and
             | functionally-oriented.
        
         | pama wrote:
         | My main use case is making high quality visualizations for
         | quick data exploration and sharing with a team. It is easy to
         | guarantee that fonts are large enough, style is minimalist and
         | clean, and filtering, transforming views or facets iteratively
         | is only a couple characters change.
        
         | rpier001 wrote:
         | R is the better EDA language by far. Python has caught up a
         | lot. Notebook diffs are now readable in git with the right
         | tooling, that's huge.
         | 
         | The drum about not fitting into data pipelines... if you're
         | literally using a bash pipe its true most R programmers have no
         | idea how to do that. Otherwise, that is where Docker and k8s
         | shine.
         | 
         | On packaging. R's package authority runs tests and ensures that
         | all packages work with the latest version of their peers. The
         | dependency heck is much less deep as a result.
         | 
         | We use R at my employer still because we put statistical data
         | science into production. Our experts come to us comfortable
         | with R. Reimplementation would be absurd.
        
         | yabbs wrote:
         | Bayesian stats
         | 
         | Traditional stats
         | 
         | Very fast iteration for data exploration in REPL (vs code or R
         | studio).
         | 
         | Prefer pipeline workflows (Tidyverse/maggrittr).
         | 
         | Prefer functional
         | 
         | Prefer array based.
         | 
         | Prefer 1-indexed arrays (yes there are some of us).
        
         | usgroup wrote:
         | I think its more intuitive for statistical applications where
         | Python is grossly under-represented. This includes things like
         | the design and analysis of experiments but also lots of domain
         | specific statistics and algorithms such as in bioinformatics,
         | chemistry, and so on.
         | 
         | Typically those applications are not the sort of line-of-
         | business enhancements ML in Python is more tuned to. I.e.
         | recommender systems, NN models, and so on.
        
           | doodledoodahs wrote:
           | The consistency of model specification across multiple
           | libraries is really helpful (base lm, lme4, brms etc). Even
           | though the syntax is sometimes extended, it seems consistent
           | enough to mostly be comprehensible/guessable.
        
         | transcriptase wrote:
         | Bioinformatics, particularly genetic mapping and population
         | genomics. There's an entire ecosystem of very mature tools
         | actively maintained by labs to add analyses pertaining to
         | advancements in the field, without breaking pipelines or
         | silently changing the results of a given analysis from version
         | to version.
         | 
         | Take something like adegenet, where the manual itself is
         | approaching 200 pages:
         | 
         | https://cran.r-project.org/web/packages/adegenet/adegenet.pd...
        
         | nomilk wrote:
         | About half our team can wrangle and plot as fast as we can
         | think of ideas. It creates an incredibly tight cycle time
         | between us having ideas and getting answers; sometimes many
         | (e.g. 10-20+) of those cycles in a single meeting. Before we
         | used R, it would require someone jotting down things to
         | investigate and reporting back in the next meeting. But we can
         | do ~80% of whatever people can think of on the spot (more
         | involved research questions can take more time).
         | 
         | The unique qualities of R that allow this are that it's so easy
         | to use, extremely reliable for package installation (problems
         | occur approximately never), and the tidyverse makes it
         | incredible easy to translate ideas into code, not only in its
         | broad, easy to understand and powerful vocabulary, but in there
         | being little 'nesting' required; instead working left to right
         | and top to bottom (via the magrittr pipe) - i.e. your code, for
         | the most part, is like reading a page in a book.
        
         | jslakro wrote:
         | What python learning material you recommend focused on data
         | science?
        
           | downrightmike wrote:
           | I like this and they have a video course on o'reilly
           | https://www.amazon.com/Python-Programmers-Artificial-
           | Intelli...
        
         | zhdc1 wrote:
         | I've transitioned a lot of my work over to Julia, but R is
         | still the most intuitive language I've used for scripting out
         | data collection, cleaning, aggregation, and analysis cases.
         | 
         | The ecosystem is simply better. The folks who maintain CRAN do
         | a fantastic job. I can't remember the last time a library
         | incompatibility led to a show stopper. This is a weekly
         | occurrence in Python.
        
           | klmr wrote:
           | > _I can't remember the last time a library incompatibility
           | led to a show stopper._
           | 
           | Oh, it's _very_ common unless you basically only use  < 5
           | packages that are completely stable and no longer actively
           | developed: packages break backwards compatibility all the
           | time, in small and in big ways, and version pinning in R
           | categorically does not work as well as in Python, despite all
           | the issues with the latter. People joke about the complex
           | packaging ecosystem in Python but at least there is such a
           | thing. R has no equivalent. In Python, if you have a
           | versioned lockfile, anybody can redeploy your code unless a
           | system dependency broke. In R, even with an 'renv' lockfile,
           | installing the correct packages version is a crapshoot, and
           | will frequently fail. Don't get me wrong, 'renv' has made
           | things _much_ better (and 'rig' and PPM also help in small
           | but important ways). But it's still dire. At work we are
           | facing these issues every other week on some code base.
        
             | apwheele wrote:
             | Agree with this, I am pretty agnostic to the pandas vs R
             | whatever stuff (I prefer base R to tidyverse, and I like
             | pandas, but realize I am old and probably not in majority
             | based on comments online). But many teams who are "R
             | adherent" folks I talk to are not deploying software in
             | varying environments so much as reporting shops doing ad-
             | hoc analytics.
             | 
             | For those whom want to use both R/python, I have notes on
             | using conda for R environments,
             | https://andrewpwheeler.com/2022/04/08/managing-r-
             | environment....
        
             | wodenokoto wrote:
             | At my old job we snapshotted CRAN and pinned versions of
             | package dependencies _against_ CRAN.
        
               | hadley wrote:
               | We now provide snapshotted CRAN binaries (for many
               | platforms) at https://packagemanager.posit.co.
        
             | disgruntledphd2 wrote:
             | Can you not just build your own code as a package and
             | specify exact dependencies?
             | 
             | It's a bit of faff but that seems like it _should_ work
             | (but maybe I 'm missing something).
        
             | getoffmycase wrote:
             | I basically don't use anything outside of tidyverse or base
             | R because of the package dependency issues.
        
             | hadley wrote:
             | I'd love to hear more about this because from my
             | perspective renv does seem to solve 95% of the challenges
             | the folks face in practice. I wonder what makes your
             | situation different? What are we missing in renv?
        
         | ekianjo wrote:
         | tidymodels is miles ahead the toys you have in python for
         | traditional machine learning. of course Python is much better
         | in other areas but that is a big reason to use R, together with
         | the super powerful tidyverse syntax.
         | 
         | and package management is much, much more reliable in R than in
         | python.
        
           | disgruntledphd2 wrote:
           | Is tidy models better than sklearn? As honestly sklearn is
           | one of the few things I was jealous of from the python
           | ecosystem, historically.
        
         | acc_297 wrote:
         | Pk/PD work for pharmaceutical data analysis I didn't like using
         | R at first but I've come to appreciate the speed that comes
         | with months of experience.
         | 
         | It's a language which feels like it has a lot of magical
         | incantations you need to remember - the default namespace is
         | much more crowded. Functions like sapply vs mapply are tricky
         | to reason about from the documentation alone. The values NA vs
         | Null vs integer(0) are all used as standins for real thrown
         | errors and knowing which one to check for after calling a
         | function can be tough.
         | 
         | But after using it for a few hundred hours to do data
         | processing and statistical regression it's hard to imagine
         | python or Julia being faster to use. But in all honesty for the
         | pharmaceutical industry it's mostly momentum that keeps R on
         | top same reason they use a lot of FORTRAN90.
        
           | klmr wrote:
           | > _But in all honesty for the pharmaceutical industry it's
           | mostly momentum that keeps R on top_
           | 
           | I can't agree with this: especially in PK/PD, R is only just
           | now taking over from the previous (closed-source) systems.
           | Momentum would keep R _out_ , not _in_.
        
           | nutshell42 wrote:
           | > Functions like sapply vs mapply are tricky to reason about
           | from the documentation alone.
           | 
           | Could you please expand on that? It's unclear what you're
           | referring to.
           | 
           | > The values NA vs Null vs integer(0) are all used as
           | standins for real thrown errors and knowing which one to
           | check for after calling a function can be tough.
           | 
           | `checkmate::assert_numeric()` (or similar)
           | 
           | with base R you want isTRUE():
           | 
           | `stopifnot(isTRUE(is.finite(x)))` (or is.na or anything else)
           | will error on empty values.
        
         | rcbdev wrote:
         | R's biggest moat in my opinion is its much saner package
         | management system and lower propensity to curb stomp existing
         | libraries and projects with breaking changes.
         | 
         | As a SWE I much rather inherit and maintain R services than
         | Python services.
        
         | hadley wrote:
         | If you tell me what makes R hard to integrate into data
         | pipelines I will do my best to fix it :)
        
           | dash2 wrote:
           | This guy is the man to ask ^^^^^^
        
           | wodenokoto wrote:
           | It's been more than a few years since I worked in an R shop.
           | While I loved wrangling and plotting data in the tidy verse I
           | did find that the dependency management story in R to be even
           | worse than Python.
           | 
           | Maybe that's the problem?
        
           | chollida1 wrote:
           | Wow, I really appreciate the reply. As I said in another
           | comment here, I wish tidyverse was big when I was using R.
           | 
           | I was an R user from about 2003-2010.
           | 
           | We didn't have DPlyr at the moment though ggplot2 was coming
           | around about that time I think. That helped alot for easy to
           | develop visualizations.
           | 
           | But in our specific cases, the distributed libraries we used
           | were written in python and integrated well with native python
           | code. Pandas was just coming out around 2010, I think, and I
           | think multi threading was also an issue then, but I can't
           | really remember.
           | 
           | So our issues was partially our infrastructure tooling was
           | going to python, but also we had a far easier time hiring
           | people who were proficient in python and harder to find the
           | same for R.
           | 
           | And once you start writing more code in python it starts to
           | become harder to justify two separate code bases that can do
           | the same thing so the R code got phased out and rewritten in
           | python so we could have a single code base and not have to
           | duplicate functionality in two languages.
           | 
           | Also a slight push for python came from the programmers who
           | thought python represented a better language to know for
           | their careers. Which looking back it does seem like python is
           | used more often these days in general.
           | 
           | So I guess there isn't much you could have done in this case.
           | 
           | And as a side note, thanks for all the work you've done with
           | R!!
        
           | mushufasa wrote:
           | A few of the main issues I see, as a R user who built his
           | company on python
           | 
           | - when we wanted to build a web app that processes data, it
           | was a lot more straightforward to build both in python, so we
           | can process data within the web servers instead of having to
           | manage multiple stages of infrastructure and different
           | languages. There's no Django for R.
           | 
           | - R will often do something instead of explicitly failing.
           | This is the wrong tradeoff when running a production system,
           | as if you're returning the wrong results to users you may not
           | realize it unless there's an error
           | 
           | - R reproducible builds are worse than python. That's saying
           | something because python is a pretty low bar. But running
           | production systems you can't have builds suddenly fail week
           | over week because one of a hundred packages was updated
        
             | ekianjo wrote:
             | > one of a hundred packages was updated
             | 
             | There's renv that addresses that point already:
             | https://rstudio.github.io/renv/articles/renv.html
             | 
             | > There's no Django for R.
             | 
             | Nowadays you can integrate R with WebR (WASM) in a web app:
             | https://docs.r-wasm.org/webr/latest/
        
               | hadley wrote:
               | A lighterweight alternative to renv is to use Posit
               | Public Package Manage (https://packagemanager.posit.co/)
               | with a pinned date. That doesn't help if you're
               | installing packages from a mix of places, but if you're
               | only using CRAN packages it lets you get everything as of
               | a fixed date.
               | 
               | And of course on the web side you have shiny
               | (https://shiny.posit.co), which now also comes in a
               | python flavour.
        
             | doodledoodahs wrote:
             | > R will often do something instead of explicitly failing.
             | 
             | I mentioned exception handling above, but this is more
             | specifically the problem.
             | 
             | I think it's a hard problem to solve, because the behaviour
             | of older libraries is so varied.
             | 
             | I have sometimes thought that something like a try catch
             | wrapper which pattern matched or tested the value returned
             | would be useful.
        
               | hadley wrote:
               | I have noodled on this problem a bit in
               | https://github.com/hadley/strict, which I'm contemplating
               | bringing back to life over the coming year. It's
               | certainly very difficult to cover 100% of all possible
               | problems, but I suspect we can get good coverage of the
               | most common failure points (specifically around recycling
               | and coercion) with a decent amount of work.
        
           | doodledoodahs wrote:
           | OK, since you're here!
           | 
           | (this all prefaced with a massive thank you for tidyverse,
           | without which R is very crusty).
           | 
           | I love R for interactive work and quick analyses, but I'm
           | currently trying to integrate various bits of R code into a
           | large document-building pipeline and wishing I could use
           | Python for it:
           | 
           | - Exception handling and error processing seem a pain in R.
           | Maybe I'm doing it wrong, but if feels like a mess and not
           | nearly as ergonomic as python. Trycatch seems to have gotchas
           | related to scope because the error handling is in a function.
           | The distinction between warning, stop etc seems odd. The
           | option to stop on warnings isn't useful because older
           | packages seem to abuse warnings as messages. I have just
           | discovered `safely` which is helpful, but then you have to
           | unwrap lists in pipelines which feels clunky.
           | 
           | - Related, I _really_ wish we could just drop model objects
           | or other tibbles as single objects directly into a tibble
           | cell rather than as list(df). Unpacking lists and checking
           | objects inside them exist is much more of a pain (e.g. can't
           | just do `filter(!is.na(df_col))`)
           | 
           | - I really miss defaultdict from python, and dictionaries
           | generally.
           | 
           | - Passing variable names as strings to dynamically generate
           | things seems clunky compared with python. Again, it may be
           | because I'm doing to wrong but I end up having to wrap things
           | in !!sym the whole time and the nse semantics seem hard to
           | remember (I only use R about 20% of the time). I liked
           | cur_data() for passing a df row to a function but this now
           | seems deprecated.
           | 
           | - String formatting -- fstrings are just great. Glue is OK,
           | but escaping special characters seems more tricksy. Jinjar is
           | OK, not quite jinja.
           | 
           | - purrr is nice, but furrr just isn't a drop-in replacement.
           | Making http requests in parallel seems non-trivial compared
           | to doing it with python. Is there an easy way to do it
           | without creating multiple processes? Why can't I just do
           | something like `. %>% mutate_parallel(response=GET(url),
           | workers=10) %>% ...`?
        
             | laylower wrote:
             | Amen to that. Can I add the following:
             | 
             | - 5 different ways to do wide to long and long to wide over
             | the years even in the tidyverse. - A lot of dependencies to
             | connect to DBs and difficult programs. Rstudio/Posit does
             | have some premium libraries but they should be made free
             | and bundled with the tidyverse to really promote the
             | ecosystem. - Shiny support to save interactive charts and
             | tables. This is a massive problem for me. If I have a
             | heavily stylized HTML table with a bunch of css, I need to
             | rely on webshot, webshot2 which are both alpha or beta
             | versions and they are poorly documented. How can I
             | evangelize R if my deployments cannot be used properly by
             | my community?
        
               | hadley wrote:
               | What are the premium packages you're talking about? As
               | far as I know all of our R packages are 100% open source.
               | 
               | I'd love to hear more why you're using webshot etc to
               | talk screenshots of your shiny app. A more typical
               | workflow would be to generate a separate HTML/PDF with
               | quarto/RMarkdown.
        
               | laylower wrote:
               | Thanks for responding and your amazing work with the
               | tidyverse. I am the "R-guy" in my finservices company and
               | we have a paid rconnect dev/qa/prod and rserver pro
               | licences for a few hundred users.
               | 
               | The packages I think are the dependencies of some DB
               | connectivity libraries.
               | https://www.rstudio.com/tags/databases/ - these are the
               | ones I was referring to.
               | 
               | Re webshot my use case is: I have a heavily modified DT
               | table in a shiny app. Users log in, play around with the
               | DT table, update ggplots etc and then download the
               | snapshot and send it to a WORD file. I can't move away
               | from word and use html or pdf because we need the word
               | file formatted by editors for publication and they need
               | to follow the corpo guidelines. So, I am having to use
               | webshot to grab a screenshot of the tagged html instead
               | of natively handling it. I tried using officedown and a
               | few other methods and it just didn't work.
               | 
               | ps: I hope the rebrand goes great and I am rooting for
               | you.
        
             | nutshell42 wrote:
             | > The distinction between warning, stop etc seems odd. The
             | option to stop on warnings isn't useful because older
             | packages seem to abuse warnings as messages.
             | 
             | Use suppressWarnings() to silence misbehaving functions or
             | withCallingHandlers() to stop or handle specific
             | conditions.
             | 
             | > Passing variable names as strings to dynamically generate
             | things seems clunky compared with python.
             | 
             | Can you give me an elegant example in Python? Because I
             | don't understand what you want to generate dynamically.
             | 
             | That said, I dislike the tidyverse solution as well. Too
             | much abstraction for not enough benefit over a base
             | solution with substitute()
        
           | mslip1 wrote:
           | Hey Hadley!! Personally only issues for me with integrating R
           | is making renv play nice in multistage docker builds. I found
           | that I need to have my other pipeline software built in the
           | same stage as my R env setup (building specific version from
           | archive, system dependencies, then r package dependencies via
           | renv)
        
           | fastaguy88 wrote:
           | (1) The big problem I have is transitioning from RStudio to a
           | pipeline (so I end up not using RStudio). A traditional
           | pipeline is going to be a script with some set of arguments
           | -- parameter values, fitting functions, and data file names,
           | that I put into a shell script and say:
           | 
           | my_plot_script.R --plot_col=g_max --output_type=pub_quality
           | data_file1 data_file2 data_file3
           | 
           | It's possible to use optparse/OptionParser() to get that
           | information (but you have an option for every argument, no
           | --param1 X --param2 Y file1 file2 file3) but it is much more
           | difficult to fit those arguments into the RStudio
           | environment. I want an RStudio to be able emulate reading
           | command line arguments (since they do not exist in RStudio).
           | Right now, I have to check to see if there are commandArgs(),
           | and, if not, do something else to get the information to the
           | RStudio script.
           | 
           | (2) There needs to be an option that says STOP if something
           | doesn't make sense. I have dozens of beautiful data plots
           | that look great, but in fact do not in fact plot what I think
           | they do, because factors have not been properly assigned to
           | colors, shapes, or linetypes. (And it can be really hard to
           | recognize that the data has not been plotted properly.) Give
           | me an option that says, if I did not explicitly declare a
           | column a factor, and I did not specifically associate
           | colors/shapes/lines with factors, then the data will not be
           | plotted.
        
             | bomewish wrote:
             | On point two, can't you just use stopifnot(condition)? Then
             | log it etc?
        
         | civilized wrote:
         | > We found R has a really hard time integrating into data
         | pipelines and was best used as a standalone tool by individuals
         | 
         | In my org we have several 100% R teams (including mine) that
         | have been developing and maintaining business-critical, data-
         | intensive applications for a decade now. We don't find R
         | difficult to integrate into data pipelines. We write our data
         | pipelines in R, and we find it very efficient to do so. They
         | talk to databases, APIs, command line tools, etc without issue.
         | 
         | Doing what we do in Python is unimaginable, especially if
         | pandas is the tabular lingua franca in the team. I vehemently
         | agree with this article on the clunkiness of pandas from a
         | sister comment: https://www.sumsar.net/blog/pandas-feels-
         | clunky-when-coming-.... Compared to dplyr and the tidyverse,
         | pandas very noticeably gets in your way rather than being a
         | tool of thought. (For what it's worth, there are other teams in
         | my org that use Python for entirely justified reasons, and they
         | use polars these days, not pandas.)
         | 
         | If I had to complain about anything in R these days, it would
         | be the increasing complexity and illegibility of error
         | messages. Tidyverse tracebacks are often dozens or hundreds of
         | lines. This is made much worse if you have a web app in the
         | Shiny framework, as Shiny seems to mangle and garble what
         | little useful information you can get (my kingdom for an error
         | with a file name and line number). Even outside of advanced
         | packages like Shiny, the reporting of error messages suffers
         | from some clunkiness and irregularity.
         | 
         | As an expert user, I can usually squint at the error barrage
         | and infer what is really going on, but it's probably quite
         | confusing and off-putting to newer users.
         | 
         | Overall though, I'm not seeing any competition for R in our
         | space. My fondest hope is that in the coming decades there
         | arises a new, thoughtfully designed language with the Lispy
         | flexibility of R, but also optional type safety and static
         | analysis affordances. I'm not sure if that's even possible, but
         | I hope the computer science geniuses figure out a way.
        
           | samstave wrote:
           | Dont take this is flippant or lame 'AI BOI' etc...
           | 
           | But what if you had a GPT summarize the error messages and
           | tell IT to look for certain things...
           | 
           | Why not train an llm on all your error messages and fixes?
           | 
           | I know that sounds sophmoric, but its an honest Q.
           | 
           | ---
           | 
           | I worked at one of the first VOIP startup in the late 90s -
           | before cisco had entered the game...
           | 
           | We had all our server error logs printed on a LINE PRINTER in
           | the server room...
           | 
           | (This was on Mission Street above the Dennys - with PISS
           | alley behind the building)
           | 
           | My CTO (I was IT Mr. Mgr at the time) - required that I went
           | through the error logs every morning with a HIGHLIGHTER to
           | read each error message for hacking attempts...
           | 
           | You have no idea how life-sucking that part of my job was....
        
           | hadley wrote:
           | If you have specific issues around error messages and
           | tracebacks please feel free to let me know directly or to
           | file issues on Github. We really do care about the legibility
           | of errors and tracebacks and me and my team have put a lot of
           | effort into them in the last few years. But there's always
           | room to do better and I'd love to know where the pain points
           | are.
           | 
           | (The intersection of tidyverse and shiny tracbacks are a
           | known pain point that's hard to resolve. Unfortunately shiny
           | and tidyverse did a bunch of parallel work that took us in
           | slightly different directions and now it's hard to re-align.)
           | 
           | One thing we are missing is a guide to reading traceback for
           | newer users. Often experts can get a good sense of where the
           | problem is, but we've failed to teach newer users how to get
           | the most value from a traceback.
        
             | civilized wrote:
             | There's clearly been a ton of progress in this area; the
             | only issue is that feature development is even faster :)
             | I'll keep an eye out for specific issues that seem helpful
             | to raise.
             | 
             | The biggest one I have right now is a little niche, but
             | probably useful to address. Moderately complex dbplyr
             | pipelines on wide tables have a tendency to generate very
             | long queries, and if there's an error, the generated SQL
             | returned tends to overflow some text or line limit allotted
             | to show the error at the command prompt. My workaround is
             | to use sink() to dump the error to a file, which is a
             | little painful as the sink() API and documentation are not
             | the most straightforward or intuitive. (Hmm, I wonder if a
             | withr wrapper would help me make something simpler to
             | use...)
        
           | xapata wrote:
           | Why Polars and not Dask?
        
           | ramblenode wrote:
           | > My fondest hope is that in the coming decades there arises
           | a new, thoughtfully designed language with the Lispy
           | flexibility of R, but also optional type safety and static
           | analysis affordances.
           | 
           | I think many of us saw Julia as the successor to R.
           | Unfortunately, the package ecosystem---one of R's strongest
           | points---still has a long way to go.
        
             | civilized wrote:
             | I was excited about Julia too but it now seems to be a
             | relatively niche HPC language. It's about saving CPU time
             | more than user time.
             | 
             | My sniff test for a successor language to R is whether it
             | can replicate the tidyverse API with 100% fidelity. The API
             | is already optimal for tabular data analysis, especially
             | the dplyr core. It can be thought of as a specification for
             | other languages to implement.
             | 
             | There is a great deal about how R works that is negotiable.
             | But if the language can't implement dplyr to spec, or
             | somehow doesn't "want to", it's not the language for the
             | audience served by the tidyverse.
        
         | samstave wrote:
         | im somewhat of an armchair data scientist myself.
         | 
         | However - I'd love to learn your ways; specifically - what are
         | your best recommendations for python over R?
         | 
         | Specifically, even though my R skills are weak - I think that
         | RStudio is pretty darn amazing - what do you recommend over
         | Rstudio?
         | 
         | I'd truly like to hear what a good toolbox looks like from your
         | perspective these days (especially now this little GPT toddler
         | is bonking into everything in my domain)
        
         | richrichie wrote:
         | I quit R a while ago - before data science became a thing - and
         | switched to Julia for such tasks. R has lots of stats packages,
         | but it is too esoteric and specialised a language to be useful
         | IMO.
        
         | wodenokoto wrote:
         | Did you work with Rstudio Server and still found it not
         | collaborative enough?
        
         | tropical333 wrote:
         | > What we found was that R had alot of packages but most
         | haven't been touched in years and when you contact the owner
         | you find they've often moved onto the python/pandas/scikit eco
         | system
         | 
         | As a "bilingual" R & Python user, I've found this to be true
         | for the latter language as well :)
         | 
         | I don't have much to add on top of what other useRs have
         | mentioned, except another testimonial that our company has
         | successfully used R in production for 6+ years, from data
         | "pipeline" stuff you mentioned to dozens upon dozens of
         | predictive models of varying complexities.
         | 
         | When faced with a new data analysis ask, 99%+ of the time I
         | reach for R (although without the tidyverse, that number would
         | be much lower). Like another commenter said, the ease by which
         | you can plot in R blows Python away. Seaborn seems like a
         | decent compromise in my limited experience, but plotting in
         | "base" matplotlib makes me want to die.
        
           | disgruntledphd2 wrote:
           | Plotnine is a pretty rocking ggplot clone in Python. Just
           | import star and you're golden.
        
             | hadley wrote:
             | We (Posit) have hired Hassan (the maintainer of plotnine)
             | so this is great to hear :)
        
               | disgruntledphd2 wrote:
               | It definitely is! If you could hurry up and destroy
               | Jupiter notebooks that would be sweet ;)
        
         | nojito wrote:
         | >For those of you who us R still what is your use case?
         | 
         | Still the best replacement for EDA and reproducible analysis
         | that used to be done in Excel.
        
         | vharuck wrote:
         | Why I still use R for analysis at work:
         | 
         | - R Markdown is just great for static reports. We use PowerBI
         | or ArcGIS for interactive stuff.
         | 
         | - GIS is a breeze. My work provides licenses for ArcGIS, which
         | has a Python library for scripting. Despite that, it is so much
         | easier to do stuff in R, which can read and create ArcGIS
         | shapefiles.
         | 
         | - Exploratory data analysis is easy. Often, before meetings,
         | I'll connect to the database in R and make a few basic tables.
         | Then I can query, aggregate, or plot data sitting the meeting.
         | I have custom ggplot themes in a package, so even my happy
         | hastily created plots look nice.
         | 
         | - RStudio is amazing. What it lacks in editing tricks, it more
         | than makes up for in simplifying R-specific tasks. Showing
         | plots is automatic, rendering and viewing markdown reports (of
         | any type) is two buttons, testing and building a package are
         | each two buttons.
         | 
         | - I spent a _lot_ of time evangelizing R (team-wide
         | presentations, being the  "R guy" for troubleshooting,
         | organizing an R User Group with members from different teams,
         | creating an internal package repository). Some became happy
         | converts, the rest begrudgingly accepted it as a tool we would
         | use. I don't know if I could do it again with another language.
         | 
         | I'll admit my work doesn't get incorporated into pipelines. We
         | get the data, analyze it, create reports, and share the reports
         | by email or on our public website. The statisticians are
         | segregated from the developers here. State government resists
         | change, especially role changes that don't match grants' or
         | laws' wording.
        
           | ekianjo wrote:
           | > R Markdown is just great for static reports.
           | 
           | Quarto (also supporting R) is a good replacement for
           | rmarkdown (with a saner syntax) and I say this as someone who
           | has extensively used rmarkdown over the years.
        
         | sebastianavina wrote:
         | RStudio is a nice tool for making some quick graphs on data,
         | descriptive analyisis and quickly exploring a dataset. Building
         | some reports, or manipulating small datasets for beautiful
         | graphs.
         | 
         | For anything else, we use Python.
        
         | axpy906 wrote:
         | I stopped using it in 2015, when I began to learn how to code.
         | 
         | At my FAANG company, there are teams that use it for
         | econometrics. I think that's Rs sweet spot, still in 2024.
        
         | minimaxir wrote:
         | There is currently no Python equivalent for both the ease of
         | use and output quality of ggplot2 for data visualization. Many
         | have tried over the past _decade_ , but none have gotten close.
         | (Plotnine was the closest: per Hadley in another comment his
         | company hired the maintainer)
        
         | trts wrote:
         | two reasons for me
         | 
         | 1) tidyverse makes prodding and plotting my data faster and
         | more enjoyable. when I am prototyping a model I'll sometimes do
         | the groundwork in R and then migrate the production version to
         | python
         | 
         | 2) I can't seem to write data wrangling code in py that is as
         | aesthetically pleasing and easy to reinterpret later. could
         | just be that I started in R, but while the methods in pandas
         | "work" I don't always totally understand why they work the way
         | they do. with tidy it works the way I expect and feels easier
         | to read back and iterate on
        
         | winwang wrote:
         | Was there any performance difference between R and Python in
         | your case?
        
         | clatan wrote:
         | I'm an old R user forced to mostly use python because that's
         | what the team uses.
         | 
         | R is so much better than python in many areas concerning data
         | pipelines: connecting with external database systems through an
         | unified API, superior data munging utilities, as well as
         | plotting, a more comprehensive (obviously) statistical analysis
         | toolset.
         | 
         | I even find rmarkdown vastly superior to jupyter.
         | 
         | But IMO the best reason to use R rather tha python is that its
         | tools will make you approach the problem as a statistician
         | rather than a programmer.
        
         | dxbydt wrote:
         | >what is your use case?
         | 
         | if you are doing Bayesian stats, fitting hierarchical models,
         | or using Stan in any serious capacity, R/Stan is so much more
         | ergonomic than Pystan. Here's a long list of pros-cons:
         | 
         | https://discourse.mc-stan.org/t/various-observations-on-rsta...
        
       | clircle wrote:
       | Unfortunately there are 56 other data science with R books, so
       | what is the differentiating factor here?
        
         | countrymile wrote:
         | It's the Harvardx course
        
       | benreesman wrote:
       | I'm looking at R seriously for the first time.
       | 
       | I've got a decade in with Python numeric computing, and I'm
       | interested in Julia and all of the cutting-edge stuff.
       | 
       | I've only dabbled with R until now, and I haven't researched it
       | enough to know if rumors of it's inevitable demise have any
       | substance.
       | 
       | There are a lot of interesting math problems other than training
       | gigantic neural networks on NVIDIA gear, and I've got some
       | Computer Algebra System / ergonomic linear modeling needs on a
       | current project:
       | 
       | I need the best tool for someone who is messing with Black-
       | Scholes type stuff, who is still building the fidelity with
       | tricky antiderivatives by hand, but I have enough fundamentals to
       | check the computer's work.
       | 
       | What role should R play here?
        
         | laylower wrote:
         | I love R. You could do it R. But a lot of the derivations and
         | Math Finance stuff you can and should be able to do in C/C++. R
         | packages mostly depend on those as well for heavy duty calcs.
         | 
         | So, if I wanted to dabble I'd easily use R and if I was in the
         | quant developer world I'd be doing C/C++
        
           | helsinki wrote:
           | I work with a trading team that manages $1B, exclusively with
           | R.
        
             | mamonster wrote:
             | Second this, seen lots of funds use whatever language their
             | lead QR/QT feels comfortable with. At the end of the day,
             | if you aren't running a strategy that requires colocation
             | on the exchange, whatever speed improvement you get from
             | the language will usually disappear from the network
             | latency.
             | 
             | Something like intraday momentum/sector rotations can
             | easily be done entirely in Python/R, from what I've seen.
        
               | benreesman wrote:
               | Likewise interested if a pro has any consulting hours to
               | spare :)
        
               | mamonster wrote:
               | Sorry, unfortunately do not do consulting.
               | 
               | In your other comment, you said you are looking to price
               | "weird derivatives". How weird are we talking? If its OTC
               | I won't be able to help anyway, if its standard then I
               | can at least try to point you in the right direction. The
               | fact you mention Black Scholes makes me think it might be
               | something closer to "vanilla" than the other way around.
        
             | benreesman wrote:
             | I have to price some weird derivatives.
             | 
             | You do any consulting on non-adjacent areas.
        
           | iainctduncan wrote:
           | I've done some work for scientists where they used C++
           | extensions to R for heavy number crunching. For their
           | workflow, R is really nice. Don't know how common this is
           | though.
        
             | nequo wrote:
             | Rcpp is pretty common in major performance sensitive
             | packages. The CppCast did an interview with Dirk
             | Eddelbuettel about it in 2022:
             | 
             | https://cppcast.com/rcpp/
        
         | dxbydt wrote:
         | quite easy to price derivatives with R. I have a degree in
         | finmath from uchicago, where derivative pricing was taught
         | using Matlab and R. But in the last semester we were told - oh
         | yeah when you go out there into the real world and start
         | working for the banks you can't use civilized tools like R and
         | Matlab. So you have to take this mandatory class on cpp. There
         | once was a guy named stroustrup and this shit here is called a
         | makefile... after graduation i worked for BofA and yes, the
         | quant world is completely C++. But there are small funds (few
         | billion dollars) that do their own shit in R, Haskell, Q/kdb,
         | others. Very doable in R.
        
       | tagyro wrote:
       | I love the power of R, especially when used for "stupid" stuff
       | [^0]
       | 
       | + extra points for using quarto
       | 
       | [0]: https://gist.github.com/mine-cetinkaya-
       | rundel/03d7516dea1e5f...
        
       | AndyMcConachie wrote:
       | I dipped my feet into R a few years back, but eventually stopped
       | it because of the way it handles integers. At the time it treated
       | all integers internally as signed 32-bit and if the number is too
       | large for that it converted it to a float.
       | 
       | I don't know what R does now, but this was a deal breaker for me
       | at the time because I was dealing with really large integers that
       | regularly broke this limit.
        
         | armchairhacker wrote:
         | Integers are still only 32 bits. There's a class which
         | effectively represents 64-bit integers (https://www.rdocumentat
         | ion.org/packages/csvread/versions/1.2...) as well as arbitrary-
         | sized (https://cran.r-project.org/web/packages/gmp/index.html, 
         | https://www.rdocumentation.org/packages/gmp/versions/0.7-4/t...
         | ). I will say there are a few pitfalls where the integer bits
         | are unexpectedly converted to something else, but it's
         | workable.
        
         | nutshell42 wrote:
         | Finally a real reason.
         | 
         | A lot of the stuff above was complaining about issues where
         | Python is a lot worse than R, about non-issues or with a
         | fundamental misunderstanding of the language. I'd given up hope
         | of seeing a real weakness named as such :)
         | 
         | There is bit64 and doubles being used as 53bit pseudo-integers
         | - but if I needed 64bit integers, R wouldn't be my first
         | choice, definitely.
        
       | SomeoneFromCA wrote:
       | In my case, I found R a better tool for learning DS, as it is
       | more or less, a DSL for statistics, and feels more low level and
       | fores you to learn more fundamentals than python. For production
       | it is probably worse tan python, true.
        
         | Cosi1125 wrote:
         | It's not a DSL.
        
           | SomeoneFromCA wrote:
           | It is de facto.
        
             | Cosi1125 wrote:
             | Why do you think that? (I'm legitimately curious.)
        
       | uptownfunk wrote:
       | No better tool for EDA and data analysis than R and RStudio. Fell
       | in love in stat 133 at Cal and now while I am doing software
       | engineering I have very fond memories of writing R and tidyverse
        
       | SoftTalker wrote:
       | I took a two or three day on-site intro to R class that my
       | employer put together. Perhaps it was not a great class, but as a
       | seasoned software developer familiar with a number of imperative
       | and functional languages I was baffled by R. It felt like a bunch
       | of little functions that had been developed by different people
       | with no consistent framework, and thrown together in some kind of
       | big wrapper. I know it's popular among statisticians and
       | researchers, so I think a prerequsite must be a good fluency with
       | statistics (I don't have that). Maybe it makes more sense if you
       | think like a statistician. As a programmer I felt like nothing I
       | learned about R contributed to developing an intuitive
       | understanding of any of the rest of it.
        
         | pinewurst wrote:
         | I think of R as a programming language designed by people who'd
         | heard about programming languages but never actually used one
         | before. It's great for ad-hoc analysis without having to think
         | about production systems.
        
         | hadley wrote:
         | R definitely has its warts, but I strongly believe that
         | underneath them lies a beautiful and quite elegant language
         | that's extremely well suited to the challenges of data
         | analysis. If you're already a programmer, you might find
         | something like Advanced R (https://adv-r.hadley.nz) to be
         | useful to get a sense of what R really is as a programming
         | language.
        
         | mint2 wrote:
         | I get a similar impression but to contextualize, in terms of
         | statistical programming what you're saying is even more so true
         | of what came before R, but a thousand fold worse. In that
         | context R is fantastic.
         | 
         | For example SAS makes R look beautiful and consistent. And
         | that's more a comment on SAS than R. And this isn't to say
         | python is perfect either, but I prefer it.
        
       | AlbertCory wrote:
       | In Google "data science" circa 2009 (although we didn't call it
       | that), R was the weapon of choice.
       | 
       | I consider it a bad relic of the 70's. It doesn't have a
       | "learning curve" -- it has a "learning straight line." Even when
       | you're experienced and semi-competent at it, it's still difficult
       | and surprising.
        
       | haunter wrote:
       | CS50 will be also available with R starting this summer
       | https://www.edx.org/learn/r-programming/harvard-university-c...
        
       | tea-coffee wrote:
       | What makes this book different from R for Data Science by Hadley
       | Wickham, Mine Cetinkaya-Rundel, & Garrett Grolemund?
        
       | yaomingite wrote:
       | That's a comprehensive guide. If anyone wants a similar
       | introduction, with interactive exercises to try while they study
       | this is also a good resource:
       | https://www.codecademy.com/learn/learn-r
        
       | 29athrowaway wrote:
       | R is the PHP of data science. It is productive, it has a large
       | ecosystem, lots of functionality, but it grew fast and
       | organically and not in well planned manner, making it not
       | consistent and a bit messy to work with.
       | 
       | If you have to use R, use the tidyverse.
       | 
       | https://www.tidyverse.org/
       | 
       | I like R and use it often as it find it more concise to work with
       | than Python for simple statistical purposes. I forced myself to
       | use R instead of spreadsheets and don't regret it.
       | 
       | This is one the reasons why (thanks, Zed Shaw)
       | https://web.archive.org/web/20110702162929/https://zedshaw.c...
        
       ___________________________________________________________________
       (page generated 2024-03-02 23:00 UTC)