[HN Gopher] R: Introduction to Data Science (2019)
___________________________________________________________________
R: Introduction to Data Science (2019)
Author : tosh
Score : 171 points
Date : 2024-03-02 10:03 UTC (12 hours ago)
(HTM) web link (rafalab.dfci.harvard.edu)
(TXT) w3m dump (rafalab.dfci.harvard.edu)
| asicsp wrote:
| See also:
|
| "R Programming for Data Science" https://leanpub.com/rprogramming
| chollida1 wrote:
| i'm an old R user, now migrated fully to python.
|
| For those of you who us R still what is your use case?
|
| We found R has a really hard time integrating into data pipelines
| and was best used as a standalone tool by individuals, which
| doesn't really work in our particular professional setup where
| everyone works collaboratively together.
|
| What we found was that R had alot of packages but most haven't
| been touched in years and when you contact the owner you find
| they've often moved onto the python/pandas/scikit eco system
| medstrom wrote:
| Basically with tidyverse, R can let you write less code and
| keep it readable: https://www.sumsar.net/blog/pandas-feels-
| clunky-when-coming-...
|
| Can't speak to abandonment, but it seems a lot of recent devel
| is occurring inside the the tidyverse, which is deprecating a
| whole bunch of other stuff.
| chollida1 wrote:
| I will agree that I left just as tidyverse was coming of age
| and I'm sometimes jealous i never got to use it.
|
| What Hadley Wickham has done is very impressive.
| guccigav wrote:
| Exactly what you said, R is easy to get started for individuals
| in social science fields. Most people I know who want to dive
| deeper end up learning Python anyway.
| countrymile wrote:
| This is true, I've found teaching r to get things done is
| very fast and readable fully achievable in a semester
| compared to teaching pandas (and also having to teach how to
| program in python)
| ildjarn wrote:
| R is much better for REPL style development and functional
| programming.
|
| Python could be so much better with some minor syntax
| extensions.
| chollida1 wrote:
| I find that with vscode and the immediate window I get a
| decent repl.
|
| What about R's language makes it better for Repl driven
| development?
| pama wrote:
| If you haven't used Emacs ESS it may be hard to explain
| what you're missing with a true REPL, but if you had used
| it in the past and add tidyverse to it, you basically have
| super smooth interactive editing with the ability to
| quickly pool text from other REPLs, past notes or scripts.
| Contrary to the similar Python repl, you can easily pull
| and edit chains of multiple commands via R's piping.
| lottin wrote:
| For one thing, R code can be written more concisely, due to
| the fact that the language is vector-based and
| functionally-oriented.
| pama wrote:
| My main use case is making high quality visualizations for
| quick data exploration and sharing with a team. It is easy to
| guarantee that fonts are large enough, style is minimalist and
| clean, and filtering, transforming views or facets iteratively
| is only a couple characters change.
| rpier001 wrote:
| R is the better EDA language by far. Python has caught up a
| lot. Notebook diffs are now readable in git with the right
| tooling, that's huge.
|
| The drum about not fitting into data pipelines... if you're
| literally using a bash pipe its true most R programmers have no
| idea how to do that. Otherwise, that is where Docker and k8s
| shine.
|
| On packaging. R's package authority runs tests and ensures that
| all packages work with the latest version of their peers. The
| dependency heck is much less deep as a result.
|
| We use R at my employer still because we put statistical data
| science into production. Our experts come to us comfortable
| with R. Reimplementation would be absurd.
| yabbs wrote:
| Bayesian stats
|
| Traditional stats
|
| Very fast iteration for data exploration in REPL (vs code or R
| studio).
|
| Prefer pipeline workflows (Tidyverse/maggrittr).
|
| Prefer functional
|
| Prefer array based.
|
| Prefer 1-indexed arrays (yes there are some of us).
| usgroup wrote:
| I think its more intuitive for statistical applications where
| Python is grossly under-represented. This includes things like
| the design and analysis of experiments but also lots of domain
| specific statistics and algorithms such as in bioinformatics,
| chemistry, and so on.
|
| Typically those applications are not the sort of line-of-
| business enhancements ML in Python is more tuned to. I.e.
| recommender systems, NN models, and so on.
| doodledoodahs wrote:
| The consistency of model specification across multiple
| libraries is really helpful (base lm, lme4, brms etc). Even
| though the syntax is sometimes extended, it seems consistent
| enough to mostly be comprehensible/guessable.
| transcriptase wrote:
| Bioinformatics, particularly genetic mapping and population
| genomics. There's an entire ecosystem of very mature tools
| actively maintained by labs to add analyses pertaining to
| advancements in the field, without breaking pipelines or
| silently changing the results of a given analysis from version
| to version.
|
| Take something like adegenet, where the manual itself is
| approaching 200 pages:
|
| https://cran.r-project.org/web/packages/adegenet/adegenet.pd...
| nomilk wrote:
| About half our team can wrangle and plot as fast as we can
| think of ideas. It creates an incredibly tight cycle time
| between us having ideas and getting answers; sometimes many
| (e.g. 10-20+) of those cycles in a single meeting. Before we
| used R, it would require someone jotting down things to
| investigate and reporting back in the next meeting. But we can
| do ~80% of whatever people can think of on the spot (more
| involved research questions can take more time).
|
| The unique qualities of R that allow this are that it's so easy
| to use, extremely reliable for package installation (problems
| occur approximately never), and the tidyverse makes it
| incredible easy to translate ideas into code, not only in its
| broad, easy to understand and powerful vocabulary, but in there
| being little 'nesting' required; instead working left to right
| and top to bottom (via the magrittr pipe) - i.e. your code, for
| the most part, is like reading a page in a book.
| jslakro wrote:
| What python learning material you recommend focused on data
| science?
| downrightmike wrote:
| I like this and they have a video course on o'reilly
| https://www.amazon.com/Python-Programmers-Artificial-
| Intelli...
| zhdc1 wrote:
| I've transitioned a lot of my work over to Julia, but R is
| still the most intuitive language I've used for scripting out
| data collection, cleaning, aggregation, and analysis cases.
|
| The ecosystem is simply better. The folks who maintain CRAN do
| a fantastic job. I can't remember the last time a library
| incompatibility led to a show stopper. This is a weekly
| occurrence in Python.
| klmr wrote:
| > _I can't remember the last time a library incompatibility
| led to a show stopper._
|
| Oh, it's _very_ common unless you basically only use < 5
| packages that are completely stable and no longer actively
| developed: packages break backwards compatibility all the
| time, in small and in big ways, and version pinning in R
| categorically does not work as well as in Python, despite all
| the issues with the latter. People joke about the complex
| packaging ecosystem in Python but at least there is such a
| thing. R has no equivalent. In Python, if you have a
| versioned lockfile, anybody can redeploy your code unless a
| system dependency broke. In R, even with an 'renv' lockfile,
| installing the correct packages version is a crapshoot, and
| will frequently fail. Don't get me wrong, 'renv' has made
| things _much_ better (and 'rig' and PPM also help in small
| but important ways). But it's still dire. At work we are
| facing these issues every other week on some code base.
| apwheele wrote:
| Agree with this, I am pretty agnostic to the pandas vs R
| whatever stuff (I prefer base R to tidyverse, and I like
| pandas, but realize I am old and probably not in majority
| based on comments online). But many teams who are "R
| adherent" folks I talk to are not deploying software in
| varying environments so much as reporting shops doing ad-
| hoc analytics.
|
| For those whom want to use both R/python, I have notes on
| using conda for R environments,
| https://andrewpwheeler.com/2022/04/08/managing-r-
| environment....
| wodenokoto wrote:
| At my old job we snapshotted CRAN and pinned versions of
| package dependencies _against_ CRAN.
| hadley wrote:
| We now provide snapshotted CRAN binaries (for many
| platforms) at https://packagemanager.posit.co.
| disgruntledphd2 wrote:
| Can you not just build your own code as a package and
| specify exact dependencies?
|
| It's a bit of faff but that seems like it _should_ work
| (but maybe I 'm missing something).
| getoffmycase wrote:
| I basically don't use anything outside of tidyverse or base
| R because of the package dependency issues.
| hadley wrote:
| I'd love to hear more about this because from my
| perspective renv does seem to solve 95% of the challenges
| the folks face in practice. I wonder what makes your
| situation different? What are we missing in renv?
| ekianjo wrote:
| tidymodels is miles ahead the toys you have in python for
| traditional machine learning. of course Python is much better
| in other areas but that is a big reason to use R, together with
| the super powerful tidyverse syntax.
|
| and package management is much, much more reliable in R than in
| python.
| disgruntledphd2 wrote:
| Is tidy models better than sklearn? As honestly sklearn is
| one of the few things I was jealous of from the python
| ecosystem, historically.
| acc_297 wrote:
| Pk/PD work for pharmaceutical data analysis I didn't like using
| R at first but I've come to appreciate the speed that comes
| with months of experience.
|
| It's a language which feels like it has a lot of magical
| incantations you need to remember - the default namespace is
| much more crowded. Functions like sapply vs mapply are tricky
| to reason about from the documentation alone. The values NA vs
| Null vs integer(0) are all used as standins for real thrown
| errors and knowing which one to check for after calling a
| function can be tough.
|
| But after using it for a few hundred hours to do data
| processing and statistical regression it's hard to imagine
| python or Julia being faster to use. But in all honesty for the
| pharmaceutical industry it's mostly momentum that keeps R on
| top same reason they use a lot of FORTRAN90.
| klmr wrote:
| > _But in all honesty for the pharmaceutical industry it's
| mostly momentum that keeps R on top_
|
| I can't agree with this: especially in PK/PD, R is only just
| now taking over from the previous (closed-source) systems.
| Momentum would keep R _out_ , not _in_.
| nutshell42 wrote:
| > Functions like sapply vs mapply are tricky to reason about
| from the documentation alone.
|
| Could you please expand on that? It's unclear what you're
| referring to.
|
| > The values NA vs Null vs integer(0) are all used as
| standins for real thrown errors and knowing which one to
| check for after calling a function can be tough.
|
| `checkmate::assert_numeric()` (or similar)
|
| with base R you want isTRUE():
|
| `stopifnot(isTRUE(is.finite(x)))` (or is.na or anything else)
| will error on empty values.
| rcbdev wrote:
| R's biggest moat in my opinion is its much saner package
| management system and lower propensity to curb stomp existing
| libraries and projects with breaking changes.
|
| As a SWE I much rather inherit and maintain R services than
| Python services.
| hadley wrote:
| If you tell me what makes R hard to integrate into data
| pipelines I will do my best to fix it :)
| dash2 wrote:
| This guy is the man to ask ^^^^^^
| wodenokoto wrote:
| It's been more than a few years since I worked in an R shop.
| While I loved wrangling and plotting data in the tidy verse I
| did find that the dependency management story in R to be even
| worse than Python.
|
| Maybe that's the problem?
| chollida1 wrote:
| Wow, I really appreciate the reply. As I said in another
| comment here, I wish tidyverse was big when I was using R.
|
| I was an R user from about 2003-2010.
|
| We didn't have DPlyr at the moment though ggplot2 was coming
| around about that time I think. That helped alot for easy to
| develop visualizations.
|
| But in our specific cases, the distributed libraries we used
| were written in python and integrated well with native python
| code. Pandas was just coming out around 2010, I think, and I
| think multi threading was also an issue then, but I can't
| really remember.
|
| So our issues was partially our infrastructure tooling was
| going to python, but also we had a far easier time hiring
| people who were proficient in python and harder to find the
| same for R.
|
| And once you start writing more code in python it starts to
| become harder to justify two separate code bases that can do
| the same thing so the R code got phased out and rewritten in
| python so we could have a single code base and not have to
| duplicate functionality in two languages.
|
| Also a slight push for python came from the programmers who
| thought python represented a better language to know for
| their careers. Which looking back it does seem like python is
| used more often these days in general.
|
| So I guess there isn't much you could have done in this case.
|
| And as a side note, thanks for all the work you've done with
| R!!
| mushufasa wrote:
| A few of the main issues I see, as a R user who built his
| company on python
|
| - when we wanted to build a web app that processes data, it
| was a lot more straightforward to build both in python, so we
| can process data within the web servers instead of having to
| manage multiple stages of infrastructure and different
| languages. There's no Django for R.
|
| - R will often do something instead of explicitly failing.
| This is the wrong tradeoff when running a production system,
| as if you're returning the wrong results to users you may not
| realize it unless there's an error
|
| - R reproducible builds are worse than python. That's saying
| something because python is a pretty low bar. But running
| production systems you can't have builds suddenly fail week
| over week because one of a hundred packages was updated
| ekianjo wrote:
| > one of a hundred packages was updated
|
| There's renv that addresses that point already:
| https://rstudio.github.io/renv/articles/renv.html
|
| > There's no Django for R.
|
| Nowadays you can integrate R with WebR (WASM) in a web app:
| https://docs.r-wasm.org/webr/latest/
| hadley wrote:
| A lighterweight alternative to renv is to use Posit
| Public Package Manage (https://packagemanager.posit.co/)
| with a pinned date. That doesn't help if you're
| installing packages from a mix of places, but if you're
| only using CRAN packages it lets you get everything as of
| a fixed date.
|
| And of course on the web side you have shiny
| (https://shiny.posit.co), which now also comes in a
| python flavour.
| doodledoodahs wrote:
| > R will often do something instead of explicitly failing.
|
| I mentioned exception handling above, but this is more
| specifically the problem.
|
| I think it's a hard problem to solve, because the behaviour
| of older libraries is so varied.
|
| I have sometimes thought that something like a try catch
| wrapper which pattern matched or tested the value returned
| would be useful.
| hadley wrote:
| I have noodled on this problem a bit in
| https://github.com/hadley/strict, which I'm contemplating
| bringing back to life over the coming year. It's
| certainly very difficult to cover 100% of all possible
| problems, but I suspect we can get good coverage of the
| most common failure points (specifically around recycling
| and coercion) with a decent amount of work.
| doodledoodahs wrote:
| OK, since you're here!
|
| (this all prefaced with a massive thank you for tidyverse,
| without which R is very crusty).
|
| I love R for interactive work and quick analyses, but I'm
| currently trying to integrate various bits of R code into a
| large document-building pipeline and wishing I could use
| Python for it:
|
| - Exception handling and error processing seem a pain in R.
| Maybe I'm doing it wrong, but if feels like a mess and not
| nearly as ergonomic as python. Trycatch seems to have gotchas
| related to scope because the error handling is in a function.
| The distinction between warning, stop etc seems odd. The
| option to stop on warnings isn't useful because older
| packages seem to abuse warnings as messages. I have just
| discovered `safely` which is helpful, but then you have to
| unwrap lists in pipelines which feels clunky.
|
| - Related, I _really_ wish we could just drop model objects
| or other tibbles as single objects directly into a tibble
| cell rather than as list(df). Unpacking lists and checking
| objects inside them exist is much more of a pain (e.g. can't
| just do `filter(!is.na(df_col))`)
|
| - I really miss defaultdict from python, and dictionaries
| generally.
|
| - Passing variable names as strings to dynamically generate
| things seems clunky compared with python. Again, it may be
| because I'm doing to wrong but I end up having to wrap things
| in !!sym the whole time and the nse semantics seem hard to
| remember (I only use R about 20% of the time). I liked
| cur_data() for passing a df row to a function but this now
| seems deprecated.
|
| - String formatting -- fstrings are just great. Glue is OK,
| but escaping special characters seems more tricksy. Jinjar is
| OK, not quite jinja.
|
| - purrr is nice, but furrr just isn't a drop-in replacement.
| Making http requests in parallel seems non-trivial compared
| to doing it with python. Is there an easy way to do it
| without creating multiple processes? Why can't I just do
| something like `. %>% mutate_parallel(response=GET(url),
| workers=10) %>% ...`?
| laylower wrote:
| Amen to that. Can I add the following:
|
| - 5 different ways to do wide to long and long to wide over
| the years even in the tidyverse. - A lot of dependencies to
| connect to DBs and difficult programs. Rstudio/Posit does
| have some premium libraries but they should be made free
| and bundled with the tidyverse to really promote the
| ecosystem. - Shiny support to save interactive charts and
| tables. This is a massive problem for me. If I have a
| heavily stylized HTML table with a bunch of css, I need to
| rely on webshot, webshot2 which are both alpha or beta
| versions and they are poorly documented. How can I
| evangelize R if my deployments cannot be used properly by
| my community?
| hadley wrote:
| What are the premium packages you're talking about? As
| far as I know all of our R packages are 100% open source.
|
| I'd love to hear more why you're using webshot etc to
| talk screenshots of your shiny app. A more typical
| workflow would be to generate a separate HTML/PDF with
| quarto/RMarkdown.
| laylower wrote:
| Thanks for responding and your amazing work with the
| tidyverse. I am the "R-guy" in my finservices company and
| we have a paid rconnect dev/qa/prod and rserver pro
| licences for a few hundred users.
|
| The packages I think are the dependencies of some DB
| connectivity libraries.
| https://www.rstudio.com/tags/databases/ - these are the
| ones I was referring to.
|
| Re webshot my use case is: I have a heavily modified DT
| table in a shiny app. Users log in, play around with the
| DT table, update ggplots etc and then download the
| snapshot and send it to a WORD file. I can't move away
| from word and use html or pdf because we need the word
| file formatted by editors for publication and they need
| to follow the corpo guidelines. So, I am having to use
| webshot to grab a screenshot of the tagged html instead
| of natively handling it. I tried using officedown and a
| few other methods and it just didn't work.
|
| ps: I hope the rebrand goes great and I am rooting for
| you.
| nutshell42 wrote:
| > The distinction between warning, stop etc seems odd. The
| option to stop on warnings isn't useful because older
| packages seem to abuse warnings as messages.
|
| Use suppressWarnings() to silence misbehaving functions or
| withCallingHandlers() to stop or handle specific
| conditions.
|
| > Passing variable names as strings to dynamically generate
| things seems clunky compared with python.
|
| Can you give me an elegant example in Python? Because I
| don't understand what you want to generate dynamically.
|
| That said, I dislike the tidyverse solution as well. Too
| much abstraction for not enough benefit over a base
| solution with substitute()
| mslip1 wrote:
| Hey Hadley!! Personally only issues for me with integrating R
| is making renv play nice in multistage docker builds. I found
| that I need to have my other pipeline software built in the
| same stage as my R env setup (building specific version from
| archive, system dependencies, then r package dependencies via
| renv)
| fastaguy88 wrote:
| (1) The big problem I have is transitioning from RStudio to a
| pipeline (so I end up not using RStudio). A traditional
| pipeline is going to be a script with some set of arguments
| -- parameter values, fitting functions, and data file names,
| that I put into a shell script and say:
|
| my_plot_script.R --plot_col=g_max --output_type=pub_quality
| data_file1 data_file2 data_file3
|
| It's possible to use optparse/OptionParser() to get that
| information (but you have an option for every argument, no
| --param1 X --param2 Y file1 file2 file3) but it is much more
| difficult to fit those arguments into the RStudio
| environment. I want an RStudio to be able emulate reading
| command line arguments (since they do not exist in RStudio).
| Right now, I have to check to see if there are commandArgs(),
| and, if not, do something else to get the information to the
| RStudio script.
|
| (2) There needs to be an option that says STOP if something
| doesn't make sense. I have dozens of beautiful data plots
| that look great, but in fact do not in fact plot what I think
| they do, because factors have not been properly assigned to
| colors, shapes, or linetypes. (And it can be really hard to
| recognize that the data has not been plotted properly.) Give
| me an option that says, if I did not explicitly declare a
| column a factor, and I did not specifically associate
| colors/shapes/lines with factors, then the data will not be
| plotted.
| bomewish wrote:
| On point two, can't you just use stopifnot(condition)? Then
| log it etc?
| civilized wrote:
| > We found R has a really hard time integrating into data
| pipelines and was best used as a standalone tool by individuals
|
| In my org we have several 100% R teams (including mine) that
| have been developing and maintaining business-critical, data-
| intensive applications for a decade now. We don't find R
| difficult to integrate into data pipelines. We write our data
| pipelines in R, and we find it very efficient to do so. They
| talk to databases, APIs, command line tools, etc without issue.
|
| Doing what we do in Python is unimaginable, especially if
| pandas is the tabular lingua franca in the team. I vehemently
| agree with this article on the clunkiness of pandas from a
| sister comment: https://www.sumsar.net/blog/pandas-feels-
| clunky-when-coming-.... Compared to dplyr and the tidyverse,
| pandas very noticeably gets in your way rather than being a
| tool of thought. (For what it's worth, there are other teams in
| my org that use Python for entirely justified reasons, and they
| use polars these days, not pandas.)
|
| If I had to complain about anything in R these days, it would
| be the increasing complexity and illegibility of error
| messages. Tidyverse tracebacks are often dozens or hundreds of
| lines. This is made much worse if you have a web app in the
| Shiny framework, as Shiny seems to mangle and garble what
| little useful information you can get (my kingdom for an error
| with a file name and line number). Even outside of advanced
| packages like Shiny, the reporting of error messages suffers
| from some clunkiness and irregularity.
|
| As an expert user, I can usually squint at the error barrage
| and infer what is really going on, but it's probably quite
| confusing and off-putting to newer users.
|
| Overall though, I'm not seeing any competition for R in our
| space. My fondest hope is that in the coming decades there
| arises a new, thoughtfully designed language with the Lispy
| flexibility of R, but also optional type safety and static
| analysis affordances. I'm not sure if that's even possible, but
| I hope the computer science geniuses figure out a way.
| samstave wrote:
| Dont take this is flippant or lame 'AI BOI' etc...
|
| But what if you had a GPT summarize the error messages and
| tell IT to look for certain things...
|
| Why not train an llm on all your error messages and fixes?
|
| I know that sounds sophmoric, but its an honest Q.
|
| ---
|
| I worked at one of the first VOIP startup in the late 90s -
| before cisco had entered the game...
|
| We had all our server error logs printed on a LINE PRINTER in
| the server room...
|
| (This was on Mission Street above the Dennys - with PISS
| alley behind the building)
|
| My CTO (I was IT Mr. Mgr at the time) - required that I went
| through the error logs every morning with a HIGHLIGHTER to
| read each error message for hacking attempts...
|
| You have no idea how life-sucking that part of my job was....
| hadley wrote:
| If you have specific issues around error messages and
| tracebacks please feel free to let me know directly or to
| file issues on Github. We really do care about the legibility
| of errors and tracebacks and me and my team have put a lot of
| effort into them in the last few years. But there's always
| room to do better and I'd love to know where the pain points
| are.
|
| (The intersection of tidyverse and shiny tracbacks are a
| known pain point that's hard to resolve. Unfortunately shiny
| and tidyverse did a bunch of parallel work that took us in
| slightly different directions and now it's hard to re-align.)
|
| One thing we are missing is a guide to reading traceback for
| newer users. Often experts can get a good sense of where the
| problem is, but we've failed to teach newer users how to get
| the most value from a traceback.
| civilized wrote:
| There's clearly been a ton of progress in this area; the
| only issue is that feature development is even faster :)
| I'll keep an eye out for specific issues that seem helpful
| to raise.
|
| The biggest one I have right now is a little niche, but
| probably useful to address. Moderately complex dbplyr
| pipelines on wide tables have a tendency to generate very
| long queries, and if there's an error, the generated SQL
| returned tends to overflow some text or line limit allotted
| to show the error at the command prompt. My workaround is
| to use sink() to dump the error to a file, which is a
| little painful as the sink() API and documentation are not
| the most straightforward or intuitive. (Hmm, I wonder if a
| withr wrapper would help me make something simpler to
| use...)
| xapata wrote:
| Why Polars and not Dask?
| ramblenode wrote:
| > My fondest hope is that in the coming decades there arises
| a new, thoughtfully designed language with the Lispy
| flexibility of R, but also optional type safety and static
| analysis affordances.
|
| I think many of us saw Julia as the successor to R.
| Unfortunately, the package ecosystem---one of R's strongest
| points---still has a long way to go.
| civilized wrote:
| I was excited about Julia too but it now seems to be a
| relatively niche HPC language. It's about saving CPU time
| more than user time.
|
| My sniff test for a successor language to R is whether it
| can replicate the tidyverse API with 100% fidelity. The API
| is already optimal for tabular data analysis, especially
| the dplyr core. It can be thought of as a specification for
| other languages to implement.
|
| There is a great deal about how R works that is negotiable.
| But if the language can't implement dplyr to spec, or
| somehow doesn't "want to", it's not the language for the
| audience served by the tidyverse.
| samstave wrote:
| im somewhat of an armchair data scientist myself.
|
| However - I'd love to learn your ways; specifically - what are
| your best recommendations for python over R?
|
| Specifically, even though my R skills are weak - I think that
| RStudio is pretty darn amazing - what do you recommend over
| Rstudio?
|
| I'd truly like to hear what a good toolbox looks like from your
| perspective these days (especially now this little GPT toddler
| is bonking into everything in my domain)
| richrichie wrote:
| I quit R a while ago - before data science became a thing - and
| switched to Julia for such tasks. R has lots of stats packages,
| but it is too esoteric and specialised a language to be useful
| IMO.
| wodenokoto wrote:
| Did you work with Rstudio Server and still found it not
| collaborative enough?
| tropical333 wrote:
| > What we found was that R had alot of packages but most
| haven't been touched in years and when you contact the owner
| you find they've often moved onto the python/pandas/scikit eco
| system
|
| As a "bilingual" R & Python user, I've found this to be true
| for the latter language as well :)
|
| I don't have much to add on top of what other useRs have
| mentioned, except another testimonial that our company has
| successfully used R in production for 6+ years, from data
| "pipeline" stuff you mentioned to dozens upon dozens of
| predictive models of varying complexities.
|
| When faced with a new data analysis ask, 99%+ of the time I
| reach for R (although without the tidyverse, that number would
| be much lower). Like another commenter said, the ease by which
| you can plot in R blows Python away. Seaborn seems like a
| decent compromise in my limited experience, but plotting in
| "base" matplotlib makes me want to die.
| disgruntledphd2 wrote:
| Plotnine is a pretty rocking ggplot clone in Python. Just
| import star and you're golden.
| hadley wrote:
| We (Posit) have hired Hassan (the maintainer of plotnine)
| so this is great to hear :)
| disgruntledphd2 wrote:
| It definitely is! If you could hurry up and destroy
| Jupiter notebooks that would be sweet ;)
| nojito wrote:
| >For those of you who us R still what is your use case?
|
| Still the best replacement for EDA and reproducible analysis
| that used to be done in Excel.
| vharuck wrote:
| Why I still use R for analysis at work:
|
| - R Markdown is just great for static reports. We use PowerBI
| or ArcGIS for interactive stuff.
|
| - GIS is a breeze. My work provides licenses for ArcGIS, which
| has a Python library for scripting. Despite that, it is so much
| easier to do stuff in R, which can read and create ArcGIS
| shapefiles.
|
| - Exploratory data analysis is easy. Often, before meetings,
| I'll connect to the database in R and make a few basic tables.
| Then I can query, aggregate, or plot data sitting the meeting.
| I have custom ggplot themes in a package, so even my happy
| hastily created plots look nice.
|
| - RStudio is amazing. What it lacks in editing tricks, it more
| than makes up for in simplifying R-specific tasks. Showing
| plots is automatic, rendering and viewing markdown reports (of
| any type) is two buttons, testing and building a package are
| each two buttons.
|
| - I spent a _lot_ of time evangelizing R (team-wide
| presentations, being the "R guy" for troubleshooting,
| organizing an R User Group with members from different teams,
| creating an internal package repository). Some became happy
| converts, the rest begrudgingly accepted it as a tool we would
| use. I don't know if I could do it again with another language.
|
| I'll admit my work doesn't get incorporated into pipelines. We
| get the data, analyze it, create reports, and share the reports
| by email or on our public website. The statisticians are
| segregated from the developers here. State government resists
| change, especially role changes that don't match grants' or
| laws' wording.
| ekianjo wrote:
| > R Markdown is just great for static reports.
|
| Quarto (also supporting R) is a good replacement for
| rmarkdown (with a saner syntax) and I say this as someone who
| has extensively used rmarkdown over the years.
| sebastianavina wrote:
| RStudio is a nice tool for making some quick graphs on data,
| descriptive analyisis and quickly exploring a dataset. Building
| some reports, or manipulating small datasets for beautiful
| graphs.
|
| For anything else, we use Python.
| axpy906 wrote:
| I stopped using it in 2015, when I began to learn how to code.
|
| At my FAANG company, there are teams that use it for
| econometrics. I think that's Rs sweet spot, still in 2024.
| minimaxir wrote:
| There is currently no Python equivalent for both the ease of
| use and output quality of ggplot2 for data visualization. Many
| have tried over the past _decade_ , but none have gotten close.
| (Plotnine was the closest: per Hadley in another comment his
| company hired the maintainer)
| trts wrote:
| two reasons for me
|
| 1) tidyverse makes prodding and plotting my data faster and
| more enjoyable. when I am prototyping a model I'll sometimes do
| the groundwork in R and then migrate the production version to
| python
|
| 2) I can't seem to write data wrangling code in py that is as
| aesthetically pleasing and easy to reinterpret later. could
| just be that I started in R, but while the methods in pandas
| "work" I don't always totally understand why they work the way
| they do. with tidy it works the way I expect and feels easier
| to read back and iterate on
| winwang wrote:
| Was there any performance difference between R and Python in
| your case?
| clatan wrote:
| I'm an old R user forced to mostly use python because that's
| what the team uses.
|
| R is so much better than python in many areas concerning data
| pipelines: connecting with external database systems through an
| unified API, superior data munging utilities, as well as
| plotting, a more comprehensive (obviously) statistical analysis
| toolset.
|
| I even find rmarkdown vastly superior to jupyter.
|
| But IMO the best reason to use R rather tha python is that its
| tools will make you approach the problem as a statistician
| rather than a programmer.
| dxbydt wrote:
| >what is your use case?
|
| if you are doing Bayesian stats, fitting hierarchical models,
| or using Stan in any serious capacity, R/Stan is so much more
| ergonomic than Pystan. Here's a long list of pros-cons:
|
| https://discourse.mc-stan.org/t/various-observations-on-rsta...
| clircle wrote:
| Unfortunately there are 56 other data science with R books, so
| what is the differentiating factor here?
| countrymile wrote:
| It's the Harvardx course
| benreesman wrote:
| I'm looking at R seriously for the first time.
|
| I've got a decade in with Python numeric computing, and I'm
| interested in Julia and all of the cutting-edge stuff.
|
| I've only dabbled with R until now, and I haven't researched it
| enough to know if rumors of it's inevitable demise have any
| substance.
|
| There are a lot of interesting math problems other than training
| gigantic neural networks on NVIDIA gear, and I've got some
| Computer Algebra System / ergonomic linear modeling needs on a
| current project:
|
| I need the best tool for someone who is messing with Black-
| Scholes type stuff, who is still building the fidelity with
| tricky antiderivatives by hand, but I have enough fundamentals to
| check the computer's work.
|
| What role should R play here?
| laylower wrote:
| I love R. You could do it R. But a lot of the derivations and
| Math Finance stuff you can and should be able to do in C/C++. R
| packages mostly depend on those as well for heavy duty calcs.
|
| So, if I wanted to dabble I'd easily use R and if I was in the
| quant developer world I'd be doing C/C++
| helsinki wrote:
| I work with a trading team that manages $1B, exclusively with
| R.
| mamonster wrote:
| Second this, seen lots of funds use whatever language their
| lead QR/QT feels comfortable with. At the end of the day,
| if you aren't running a strategy that requires colocation
| on the exchange, whatever speed improvement you get from
| the language will usually disappear from the network
| latency.
|
| Something like intraday momentum/sector rotations can
| easily be done entirely in Python/R, from what I've seen.
| benreesman wrote:
| Likewise interested if a pro has any consulting hours to
| spare :)
| mamonster wrote:
| Sorry, unfortunately do not do consulting.
|
| In your other comment, you said you are looking to price
| "weird derivatives". How weird are we talking? If its OTC
| I won't be able to help anyway, if its standard then I
| can at least try to point you in the right direction. The
| fact you mention Black Scholes makes me think it might be
| something closer to "vanilla" than the other way around.
| benreesman wrote:
| I have to price some weird derivatives.
|
| You do any consulting on non-adjacent areas.
| iainctduncan wrote:
| I've done some work for scientists where they used C++
| extensions to R for heavy number crunching. For their
| workflow, R is really nice. Don't know how common this is
| though.
| nequo wrote:
| Rcpp is pretty common in major performance sensitive
| packages. The CppCast did an interview with Dirk
| Eddelbuettel about it in 2022:
|
| https://cppcast.com/rcpp/
| dxbydt wrote:
| quite easy to price derivatives with R. I have a degree in
| finmath from uchicago, where derivative pricing was taught
| using Matlab and R. But in the last semester we were told - oh
| yeah when you go out there into the real world and start
| working for the banks you can't use civilized tools like R and
| Matlab. So you have to take this mandatory class on cpp. There
| once was a guy named stroustrup and this shit here is called a
| makefile... after graduation i worked for BofA and yes, the
| quant world is completely C++. But there are small funds (few
| billion dollars) that do their own shit in R, Haskell, Q/kdb,
| others. Very doable in R.
| tagyro wrote:
| I love the power of R, especially when used for "stupid" stuff
| [^0]
|
| + extra points for using quarto
|
| [0]: https://gist.github.com/mine-cetinkaya-
| rundel/03d7516dea1e5f...
| AndyMcConachie wrote:
| I dipped my feet into R a few years back, but eventually stopped
| it because of the way it handles integers. At the time it treated
| all integers internally as signed 32-bit and if the number is too
| large for that it converted it to a float.
|
| I don't know what R does now, but this was a deal breaker for me
| at the time because I was dealing with really large integers that
| regularly broke this limit.
| armchairhacker wrote:
| Integers are still only 32 bits. There's a class which
| effectively represents 64-bit integers (https://www.rdocumentat
| ion.org/packages/csvread/versions/1.2...) as well as arbitrary-
| sized (https://cran.r-project.org/web/packages/gmp/index.html,
| https://www.rdocumentation.org/packages/gmp/versions/0.7-4/t...
| ). I will say there are a few pitfalls where the integer bits
| are unexpectedly converted to something else, but it's
| workable.
| nutshell42 wrote:
| Finally a real reason.
|
| A lot of the stuff above was complaining about issues where
| Python is a lot worse than R, about non-issues or with a
| fundamental misunderstanding of the language. I'd given up hope
| of seeing a real weakness named as such :)
|
| There is bit64 and doubles being used as 53bit pseudo-integers
| - but if I needed 64bit integers, R wouldn't be my first
| choice, definitely.
| SomeoneFromCA wrote:
| In my case, I found R a better tool for learning DS, as it is
| more or less, a DSL for statistics, and feels more low level and
| fores you to learn more fundamentals than python. For production
| it is probably worse tan python, true.
| Cosi1125 wrote:
| It's not a DSL.
| SomeoneFromCA wrote:
| It is de facto.
| Cosi1125 wrote:
| Why do you think that? (I'm legitimately curious.)
| uptownfunk wrote:
| No better tool for EDA and data analysis than R and RStudio. Fell
| in love in stat 133 at Cal and now while I am doing software
| engineering I have very fond memories of writing R and tidyverse
| SoftTalker wrote:
| I took a two or three day on-site intro to R class that my
| employer put together. Perhaps it was not a great class, but as a
| seasoned software developer familiar with a number of imperative
| and functional languages I was baffled by R. It felt like a bunch
| of little functions that had been developed by different people
| with no consistent framework, and thrown together in some kind of
| big wrapper. I know it's popular among statisticians and
| researchers, so I think a prerequsite must be a good fluency with
| statistics (I don't have that). Maybe it makes more sense if you
| think like a statistician. As a programmer I felt like nothing I
| learned about R contributed to developing an intuitive
| understanding of any of the rest of it.
| pinewurst wrote:
| I think of R as a programming language designed by people who'd
| heard about programming languages but never actually used one
| before. It's great for ad-hoc analysis without having to think
| about production systems.
| hadley wrote:
| R definitely has its warts, but I strongly believe that
| underneath them lies a beautiful and quite elegant language
| that's extremely well suited to the challenges of data
| analysis. If you're already a programmer, you might find
| something like Advanced R (https://adv-r.hadley.nz) to be
| useful to get a sense of what R really is as a programming
| language.
| mint2 wrote:
| I get a similar impression but to contextualize, in terms of
| statistical programming what you're saying is even more so true
| of what came before R, but a thousand fold worse. In that
| context R is fantastic.
|
| For example SAS makes R look beautiful and consistent. And
| that's more a comment on SAS than R. And this isn't to say
| python is perfect either, but I prefer it.
| AlbertCory wrote:
| In Google "data science" circa 2009 (although we didn't call it
| that), R was the weapon of choice.
|
| I consider it a bad relic of the 70's. It doesn't have a
| "learning curve" -- it has a "learning straight line." Even when
| you're experienced and semi-competent at it, it's still difficult
| and surprising.
| haunter wrote:
| CS50 will be also available with R starting this summer
| https://www.edx.org/learn/r-programming/harvard-university-c...
| tea-coffee wrote:
| What makes this book different from R for Data Science by Hadley
| Wickham, Mine Cetinkaya-Rundel, & Garrett Grolemund?
| yaomingite wrote:
| That's a comprehensive guide. If anyone wants a similar
| introduction, with interactive exercises to try while they study
| this is also a good resource:
| https://www.codecademy.com/learn/learn-r
| 29athrowaway wrote:
| R is the PHP of data science. It is productive, it has a large
| ecosystem, lots of functionality, but it grew fast and
| organically and not in well planned manner, making it not
| consistent and a bit messy to work with.
|
| If you have to use R, use the tidyverse.
|
| https://www.tidyverse.org/
|
| I like R and use it often as it find it more concise to work with
| than Python for simple statistical purposes. I forced myself to
| use R instead of spreadsheets and don't regret it.
|
| This is one the reasons why (thanks, Zed Shaw)
| https://web.archive.org/web/20110702162929/https://zedshaw.c...
___________________________________________________________________
(page generated 2024-03-02 23:00 UTC)