[HN Gopher] Groundhog: Addressing the Threat That R Poses to Rep...
       ___________________________________________________________________
        
       Groundhog: Addressing the Threat That R Poses to Reproducible
       Research
        
       Author : snakeboy
       Score  : 161 points
       Date   : 2021-01-05 13:24 UTC (2 days ago)
        
 (HTM) web link (datacolada.org)
 (TXT) w3m dump (datacolada.org)
        
       | Railsify wrote:
       | This is not a problem the R poses, this is a problem that people
       | pose. If you want to run the same code you need to bundle the
       | source code of all packages and archive it with the data. This is
       | not an R problem at all.
        
       | SoSoRoCoCo wrote:
       | > The problem is that packages are constantly being updated, and
       | sometimes those updates are not backwards compatible.
       | 
       | Uh oh, someone just discovered the modern programming landscape!
       | 
       | Python, Node, R, Rust, and other langs/OSes with package managers
       | are at the mercy of volunteers who keep important packages
       | healthy. Once issues stop being fixed, y'all better have local
       | copies. This used to be predominantly an OS issue, now it is a
       | language issue, too.
        
         | swsieber wrote:
         | > The problem is that packages are constantly being updated,
         | and sometimes those updates are not backwards compatible.
         | 
         | > Python, Node, R, Rust,
         | 
         | Correct me if I'm wrong, but for binary programs, a lock file
         | easily mitigates these issue. I know Node and Rust both support
         | lock files.
        
           | SoSoRoCoCo wrote:
           | Yes, you're right. That's what makes lock files so important.
           | I think we're past worrying about those wheels/npms/pkgs
           | disappearing from the internet.
           | 
           | My concern is more about packages going stale and don't peer-
           | match with other packages that evolve, or major versions that
           | change results: not so much for R pkgs but there have been
           | cases of major versions breaking existing projects, or
           | requiring significant effort to update. (One example that
           | zinged me is the FFI interface for Node. The "official"
           | package hasn't been touched in years, and the "replacement",
           | FFI-NAPI, is still has lots of open issues. We were using in-
           | house fixes for some time.)
        
       | winrid wrote:
       | Does R not allow you to lock into versions?
       | 
       | Is this person suggesting we never improve anything? :)
        
       | AuthorizedCust wrote:
       | His example is poor. dplyr wasn't even at 1.0 in 2016. It was
       | only 0.50: https://blog.rstudio.com/2016/06/27/dplyr-0-5-0/. Of
       | course one might expect breaking changes in a maturing package.
        
       | tpoacher wrote:
       | A very nice and useful library indeed; though I don't think the
       | article really needed to sound so doomsdayish and apocalyptic
       | about it.
        
       | mespe wrote:
       | While I agree with many of the negative comments here about
       | issues with how this is implemented, the tone of some comments
       | is... not great. To the point that I would be reluctant to share
       | work I do in R on Hacker News, which is not helping anyone.
       | 
       | Just a reminder: https://news.ycombinator.com/newsguidelines.html
        
         | h2odragon wrote:
         | No kidding. Seems like an elegant solution to a potential
         | problem to me. I've only used R at the "poke stick at numbers"
         | level, but this would have been a useful addition to _that_
         | trivial use.
        
       | anderscarling wrote:
       | I've yet to use it personally, but renv [1] seems to try to solve
       | the reproducible builds problem in a way more similar to other
       | modern package managers (e.g. by generating a lockfile).
       | 
       | This approach enables stricter validations against tampering with
       | the package repositories as a hash of the package can be stored
       | in the lockfile, however it is obviously a bit more complex to
       | use than the groundhog approach.
       | 
       | [1]: https://github.com/rstudio/renv
        
         | notafraudster wrote:
         | Agreed that renv is a better solution here. Even the example
         | code for Groundhog is not written in idiomatic R which does not
         | inspire confidence. Simonssohn is a legend in transparent
         | research but not primarily a coder or software tool contributor
         | (take a look at the source for p-curve if you want to see what
         | I mean) and I think a secondary threat to reproducibility is
         | relying on tools that end up abandoned or deprecated or for
         | which bugs never get fixed.
        
           | vharuck wrote:
           | >Even the example code for Groundhog is not written in
           | idiomatic R which does not inspire confidence.
           | 
           | Not to mention the given example for irreproducibility in
           | base R looks at code that would be a bug in the script for
           | 3.6. It's only useful to keep this reproducible if I'm
           | debugging the script.
           | 
           | And, in this case, anyone who's proficient with R would
           | recognize this problem from personal experience or the many
           | warnings in tutorials. I usually wouldn't shoot down a given
           | example as though it disproved the existence of any example,
           | but I don't know if there is another example. Unless old code
           | relied on undocumented or contrary-to-documented behavior.
        
         | prepend wrote:
         | I came here to say this.
         | 
         | This seems like a non-issue given renv. And renv gives a more
         | reproducible, I think, solution as it pins to versions, not
         | dates.
        
       | timsneath wrote:
       | I'm curious whether this actually solves the problem. I
       | understand how this assists with reproducibility of packages, but
       | the R software itself is updated frequently, as is briefly noted
       | in the preamble to this document. Indeed, the release notes [0]
       | are fairly transparent about the relatively long list of changes.
       | 
       | Given this, it almost seems more dangerous to imply through this
       | package that a particular date's results are reproducible, since
       | unless the user has the same version of R, they may see different
       | results anyway.
       | 
       | [0]: https://stat.ethz.ch/pipermail/r-announce/2020/000653.html
        
       | st1x7 wrote:
       | Either I'm misunderstanding or this is a non-problem. You can
       | specify older versions of a package when you install it. You can
       | also manage them with packrat. As long as researchers share their
       | language and package versions, you can fully reproduce their
       | environment. (And the base language is _really_ stable, almost to
       | a fault.)
       | 
       | This is just a bad way for the author to promote their own
       | library for dealing with this. The way their library seems to
       | approach this (using dates instead of versions) seems horrible
       | too - on any given date I can have a random selection of packages
       | in my environment, some of them up-to-date, some of them not. So
       | unless all researchers start using the author's library (and
       | update to the latest versions of everything just before they
       | publish), it's only making things worse and not really solving
       | the problem it claims to solve.
        
         | kwertzzz wrote:
         | I full agree. With version number you can have a good sense
         | whether a package update breaks your code or not (as long the
         | package authors following semantic versioning).
         | 
         | I think in julia this problem is solved quite nicely with the
         | Project.toml (list of package that you directly dependent) and
         | Manifest.toml file (the version numbers of the complete
         | dependency tree which is automatically generated).
         | 
         | It seems that in groundhog you declare only direct
         | dependencies. Is there a way to store the full dependency tree
         | in R ?
        
           | Hasnep wrote:
           | Using renv (or packrat) generates a renv.lock file with the
           | version of every dependency.
        
         | kumarsw wrote:
         | The impression I get is that this tool has a forensic bent to
         | it. You ask for the code for a paper and Joe grad student with
         | no programming knowledge emails you a zipped folder of R
         | scripts. He just finished his dissertation, is starting a new
         | job somewhere on the west coast, and no longer has access to
         | the computer in his old advisor's lab where he did the work.
         | The implementation may (or may not be) be lousy, but the use
         | case sounds plenty valid.
        
         | dash2 wrote:
         | Yeah, this seems half-thought-through. renv works at project
         | level and isolates the dependencies of a project from your main
         | library. groundhog.library() tramples over your library
         | installing multiple versions. It also has the "cute" feature of
         | auto-installing libraries if they aren't on your system
         | already. Yuck. If you really wanted this script-only solution
         | then you could go with the `versions` library, which already
         | lets you specify an installation date.[1]
         | 
         | [1]:
         | https://cran.r-project.org/web/packages/versions/index.html
        
         | mslip wrote:
         | Not really relevant but just to note packrat had been soft-
         | deprecated and is superseded by renv, which comes standard with
         | rstudio.
        
         | meztez wrote:
         | Fully agreed. Add `sessionInfo()` output as an appendix to your
         | publication. Should not be too hard to rebuild from that.
        
         | ekianjo wrote:
         | Packrat is deprecated now. It is recommended to use renv
         | instead.
        
           | st1x7 wrote:
           | Thanks, I just checked renv out (I haven't had to work with R
           | in over a year). Renv looks much better than packrat at a
           | first glance.
        
         | _Wintermute wrote:
         | Correct me if I'm wrong but specifying an older package version
         | in R still pulls the newest packages from CRAN for any
         | dependencies, which is quick way to run into a load of
         | incompatibilities.
         | 
         | I've not tried renv yet but packrat was a pretty poor solution.
        
           | jcheng wrote:
           | If you need to use an older version of a package and don't
           | have a packrat/renv lockfile already, then packrat/renv are
           | not going to help you. mran/checkpoint could though.
           | 
           | I agree packrat (which I created) was a poor solution for
           | most users. renv is far better and more usable.
        
           | mslip wrote:
           | You can specify version numbers in renv, you take snapshots
           | of your dependencies into a lock file and can always restore
           | from there or make a new snapshot.
        
       | torcete wrote:
       | I hope all these efforts for reproducibility and avoiding library
       | resolution conflicts consolidate into one solution.
        
       | ppod wrote:
       | Very clickbaity headline. The problem described is real, but just
       | as real or worse in other statistical software, so it's not 'R'
       | as a whole that poses a threat to reproducibility.
        
       | asperous wrote:
       | Reproducible? Or deterministic?
       | 
       | There's certainly benefits to being able to pull down research
       | source code, and bug checking it. That's how programmers check
       | code: tests and audits.
       | 
       | However I think reproducing research is more often then not done
       | "from scratch", taking a new sample, treating it, checking
       | results. "independent verification".
       | 
       | Re-using source code saves time, but I would argue not being able
       | to shouldn't threaten reproducibility.
        
         | geomark wrote:
         | Back when I studied this stuff there was a distinction between
         | reproducible (rerun analysis on the data from the original
         | experiment and see if you get the same results - if not then
         | there is an error in the analysis) and replicable (redo the
         | entire experiment by taking new data and running the analisys).
        
         | bonoboTP wrote:
         | This is why people are starting to make a difference between
         | terms: repeatability, reproducibility, replicability.
         | 
         | > You give me your code and enough information for me to
         | produce and identical environment or (even better) your code is
         | insenstive the environment, then your research is Repeatable.
         | 
         | > If you describe your study sufficiently well that I can re-
         | implement your study from scratch, without looking at your code
         | and still get the same answer, then it is Reproducible
         | 
         | > If I can arrive at the same conclusions as you, just from a
         | description of its aims, then it is Replicable.
         | 
         | From https://academia.stackexchange.com/a/118518/15198
        
         | coliveira wrote:
         | That's something that eludes software people: reproducibility
         | in science is the ability to create independent tests. Making
         | software available, while useful, does very little for
         | reproducibility from the scientific point of view.
        
           | f6v wrote:
           | There're fields of science, like computational biology, where
           | it's all about the code. I wish the methods section was
           | always 100% unambiguous, but it's not the case. And nowadays
           | the computational pipelines have to support analysis of up to
           | terabytes of data. You can imagine how many dependencies such
           | pipelines have. Sometimes I have trouble installing the
           | software even when package manager such as anaconda is used.
        
           | petters wrote:
           | It's true that "reproduce" can mean different things in
           | software and science. But having the code available to
           | "reproduce" any plots in a paper should be a requirement for
           | publication, imo. It certainly is the case for my papers.
        
           | bonoboTP wrote:
           | It's a minimum standard, though. Of course the goal is
           | reproducibility from a broader point of view, but that's not
           | an excuse to do research in a one-off way where nobody is
           | able to show how to get those numbers again, after a year or
           | so from publication.
           | 
           | The coding standards are often abysmally, unexpectedly
           | terrible. Often not even the help of the original authors is
           | enough to be able to produce the same figures from a paper
           | because things and settings and commands get forgotten. Some
           | part of the analysis was done in one language, another part
           | in Excel. Some of the code has now disappeared. Some of the
           | libraries are no longer working. Some people left and their
           | academic storage space was wiped and therefore the
           | intermediate steps and results or notes are deleted. You
           | wouldn't believe it.
           | 
           | Once a paper is published researchers are not really
           | incentivized to document things or maintain the materials.
           | They got the publication, they put it on their CV. On to the
           | next project! No time to waste on work that's already
           | completed. New work leads to new publications, messing around
           | with the old code for the sake of a potential later person
           | interested in it is a waste from the point of view of a
           | researcher, career wise. Also most papers are never attempted
           | to be reproduced ever.
        
             | coliveira wrote:
             | The researcher's job is to properly do an experiment and
             | document it in a paper. If we require more than this, then
             | we will incur in damage to the scientific process for two
             | reasons: (1) companies are not willing to make available
             | software developed by their researchers, therefore they
             | will publish even less; and (2) universities don't have
             | money and staff to produce and maintain software at these
             | standards, so professors will be required to publish less
             | papers.
        
               | bonoboTP wrote:
               | "These standards" are pretty low. Currently it's a free-
               | for-all chaos. Theoretically papers are reproducible from
               | the documentation found in the paper but that is a lie.
               | It is never reproducible just from the paper. Lots of
               | stuff is done in the background that is not known to the
               | reader. For all we know, they can even tweak their
               | numbers to be 2% better and if someone can't get the
               | results of the paper from the released code, the authors
               | can just ignore it or say, the problem is not on their
               | side, or that the paper numbers were generated with a
               | slightly different code than the released version etc.
               | I've seen this many times on Github, issues getting
               | closed or deleted without comment etc. There is zero
               | accountability.
               | 
               | It's slowly changing though but many people are grinding
               | their teeth, because they can't torture the data as much
               | if things are out in the open.
        
               | coliveira wrote:
               | > "These standards" are pretty low. Currently it's a
               | free-for-all chaos.
               | 
               | I disagree. It is not perfect, but it is certainly a
               | process that enables scientific development, as it has
               | for centuries. If we start to create more and more rules
               | that researchers need to follow, it will become even
               | harder to make scientific research and most institutions
               | won't have resources to continue.
        
         | jdale27 wrote:
         | Ideally research does get reproduced from scratch; I think what
         | people usually mean when they talk about the
         | replication/reproducibility crisis in science is not being able
         | to reproduce an experiment with new samples, independent data
         | analysis, etc.
         | 
         | However, if you can't even reproduce an analysis with the
         | authors' own data and code, that's a red flag before you even
         | get to the starting line. Ensuring that level of
         | reproducibility is, I think, an essential ingredient to
         | enabling the stronger form of reproducibility.
         | 
         | Personally, I made the mistake during my graduate career of
         | trying to reimplement an analysis using a certain rather
         | complicated ML algorithm, from scratch, in a different language
         | than the original authors had used. After struggling mightily
         | to get it to work, I finally bothered to try to get their own
         | code working. (I had been hesitant to do so because I wasn't
         | proficient in the language they used, and it wasn't even clear
         | they had released all the necessary code, aside from the core
         | algorithm.) Once I did that, I discovered that I couldn't even
         | get their own code working on their own data, and gave up. This
         | was researched published in Science by a group from a top-tier
         | research university. (I don't fully blame the authors, it may
         | well have been my own incompetence that was the issue. But it
         | just serves as yet another illustration of how pervasive and
         | disregarded the reproducibility issue was for a long while.)
        
         | davnn wrote:
         | > Re-using source code saves time, but I would argue not being
         | able to shouldn't threaten reproducibility.
         | 
         | More often than not it's not clear from a paper what exactly
         | the authors did to a achieve a specific result. Being able to
         | exactly reproduce what previous authors did should improve
         | reproducibility; also for new samples.
        
           | asdff wrote:
           | It's also standard fare for typical lab work. A good paper's
           | methods section would contain enough detail for you to go
           | into your lab and repeat the experiment yourself, even down
           | to the catalog number for the reagents to order from the lab
           | supplier. Code should be no different, that's why it's
           | encouraged that authors submit all code used in analysis and
           | generation of figures.
        
             | davnn wrote:
             | The fun thing is that there are approaches that want to go
             | beyond this kind of methodological description of a
             | scientific process to __code __[0, 1]. In general I would
             | say that the more we can remove the human aspect and
             | inherent ambiguity of science, the better for
             | reproducibility. See [2] for a couple of examples.
             | 
             | [0] https://www.emeraldcloudlab.com/ [1]
             | https://nextjournal.com/ [2]
             | https://www.youtube.com/watch?v=L1UgdoP2aeg
        
       | stewbrew wrote:
       | I wish the (default) utils::install.packages function could take
       | a version number of the requested library. I also wish library()
       | would automatically install libraries not available on the
       | system. (Both can be achieved with custom functions that shadow
       | the default ones but I would like to see this functionality in
       | the base packages.) Other than that, I think all alternatives to
       | this "threat called R" are worse. It's telling the author has to
       | cite a bug from 2016 for an example of a breaking change.
        
       | cauthon wrote:
       | This title is an exceedingly hot take for someone who wrote a new
       | package manager.
       | 
       | Also, it appears that Groundhog is itself a CRAN package and the
       | author recommends installing with install.packages(). So is the
       | author committing to never making any backwards incompatible
       | updates to their new package?
        
         | coolreader18 wrote:
         | > So is the author committing to never making any backwards
         | incompatible updates to their new package?
         | 
         | Well, yes, probably. It's not all that hard, and groundhog
         | seems to have a fairly simple API anyways.
         | 
         | And groundhog still uses CRAN packages, it just brings a method
         | of pinning them to a specific version.
        
         | bsza wrote:
         | I think it's more like a Wayback Machine for R programs, since
         | the author of a science paper isn't required to use groundhog.
         | You can just provide it the date the article was published,
         | which you already know, and it reconstructs how the program
         | worked on that day.
         | 
         | Also, because groundhog isn't made for the author to use,
         | whether or not the interface changes is irrelevant. You'll
         | never encounter library(groundhog) in a paper.
        
           | st1x7 wrote:
           | > and it reconstructs how the program worked on that day.
           | 
           | It reconstructs how the fully updated version of everything
           | worked that day which isn't necessarily the same as the
           | researcher's environment. It's a horrible idea to use dates
           | instead of package versions for this. The author's library
           | doesn't solve the problem it claims to solve.
        
             | SCLeo wrote:
             | If I am understanding this correctly, the problem is that
             | the paper authors do _not_ provide a specific version or a
             | package.json equivalent. In that case, using dates seem to
             | be the only choice.
        
               | st1x7 wrote:
               | Even if that's the case, using dates isn't a solution
               | because dates don't give you the build that the
               | researcher used. Date of publication is different from
               | the date when the code ran and there is no guarantee that
               | the researcher ran the latest version of every dependency
               | that was available to them anyway. In fact that's very
               | unlikely considering that some their libraries might
               | require older versions. It might not even be possible to
               | take the latest version of every package and use them in
               | the same environment.
        
               | SCLeo wrote:
               | So, what is your better alternative then? I honestly
               | believe using version available at that date is better
               | than using the latest version.
        
               | [deleted]
        
         | jsmith99 wrote:
         | That's the problem. This package is very similar to Microsoft's
         | checkpoint package which is based on Microsoft's MRAN
         | snapshots, and this package also uses MRAN. The article
         | explains the difference is that this package allows you to
         | specify the date in the code itself, whereas checkpoint is used
         | to set a whole installation to a specific date. But this is no
         | advantage as it means code will stop working if the groundhog
         | package changes, whereas with checkpoint a paper could just say
         | 'use packages as of date x'.
        
         | resonantjacket5 wrote:
         | Your take seems a bit 'hot' too?
         | 
         | How else would you install the cran packages without using
         | install.packages? Unless if you want them to recursively
         | install it using groundhog but that seems unnecessary.
         | 
         | As long as you have the timestamp it should work, though I
         | assume there will be some edge case.
         | 
         | What you're saying is like don't use pip because you don't
         | install it using pip? Or don't use package-lock.json because
         | you can't install npm through npm?
        
           | cat199 wrote:
           | > Your take seems a bit 'hot' too?
           | 
           | OP is not claiming that Groundhog itself is a threat to the R
           | language ecosystem itself, whereas the author is claiming
           | that the R language is itself a threat to Science itself...
        
           | scottmcdot wrote:
           | Someone correct me if I'm wrong, but can't you copy and paste
           | the package folder into your libpath directory and R can load
           | it that way with actually running install.packages()?
        
             | vharuck wrote:
             | Usually, yes. However, it is possible for a package to have
             | code that only runs when it is installed. If you just copy-
             | paste, it won't be run.
        
           | cauthon wrote:
           | No, I'm saying don't call CRAN a "threat to reproducible
           | science" and then make your solution a CRAN package
        
       | f6v wrote:
       | The only working solution I've seen is using Docker container
       | with Jupyter Lab and all the dependencies installed. I hate
       | pulling those huge images on my 256GB MBP, but it works. Of
       | course, only bigger labs do that, since individual researchers
       | are often unfamiliar with Docker.
       | 
       | However, if I run my software on HPC cluster, that's no longer an
       | option. The HPC at my university doesn't allow running Docker,
       | only Singularity containers(which isn't supported on Mac).
        
       | snicker7 wrote:
       | Adding dates to source code? No thanks. If you want
       | reproducibility, invest in guix. Everything else is a hack.
        
       | hermitcrab wrote:
       | Being able to assemble a solution from parts (as in R packages)
       | is super flexible. But complex and potentially brittle.
       | 
       | Reproducability is a big problem all around. When I create
       | releases I put the binaries as well as the source in version
       | control, because changes in tools/libraries etc mean that I
       | probably won't be able to create the exact same binary several
       | years later from the same source.
       | 
       | There is always a tradeoff between flexibility and simplicity.
       | Clearly software needs to be able to change, or you are never
       | going to be able to improve it or fix bugs. And an assembly of
       | constantly changing parts is clearly going to come with its own
       | challenges.
       | 
       | My own software product, Easy Data Transform (which competes with
       | R to some extent) trades off some flexibility for simplicity by
       | having a single set of binaries for each platform. You can't add
       | any components (without hacking). So the same version of software
       | should always give the same result.
        
       | samch93 wrote:
       | Can recommend the paper "A Reproducible Data Analysis Workflow
       | with R Markdown, Git, Make, and Docker" by Peikert and Brandmaier
       | [1], which shows a much more robust approach to reproducibility.
       | 
       | [1] https://psyarxiv.com/8xzqy/
        
         | mslip wrote:
         | Thanks!
        
       | oli5679 wrote:
       | I find the miniconda docker image quite useful for making
       | reproducible R environments.
       | 
       | You can install specific package versions recorded in
       | environment.yml file.
       | 
       | There are probably many ways to do this but this is an approach I
       | like.
       | 
       | https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-...
       | 
       | https://hub.docker.com/r/continuumio/miniconda
        
       | roel_v wrote:
       | Apart from all the other considerations and problems with various
       | types of package management, consider this:
       | 
       | "Update January 6th, 2021 A reader alerted me to a bug with the
       | current groundhog (version 1.1.0) where you cannot set the
       | groundhog library to be a folder containing spaces in the name."
       | 
       | So we are talking about software here that somehow made it to
       | version 1.1 *without anyone ever using a directory with spaces in
       | it with it". This can be interpreted in two ways: either very few
       | people have spaces in their paths, or very few people have
       | actually ever even tried (not even really used, I'm only talking
       | about the most basic trial use) this package. I'm not a betting
       | man, but if I were, I know where I'd put my money...
        
         | bayindirh wrote:
         | As I can see from the researchers in our cluster and my own
         | academic research, most people still avoid spaces in paths and
         | files like the plague.
         | 
         | YMMV of course.
        
           | dstick wrote:
           | If my own hobby python projects are anything to go by, there
           | aren't even folders ;-)
           | 
           | I have a friend who taught herself R for her research and it
           | was basically one big procedural codebase.
        
             | YeGoblynQueenne wrote:
             | Best way to know where every bit of code is: put it all in
             | one source file.
             | 
             | Sarcasm aside, I've worked with codebases like that-
             | thousand-line java methods and classes and the like. The
             | problem is that there's nothing that really forces
             | modularity on a codebase. There isn't even any consensus,
             | objective way to modularise code. Otherwise, a machine
             | could do it and we wouldn't have this kind of problem. But,
             | a machine cannot, and so we do.
        
           | roel_v wrote:
           | Of course, and so do I. But nobody ever even encountering the
           | situation and/or bothering to report it, that's a whole
           | different matter.
        
             | bayindirh wrote:
             | My guess is people are encountering the situation, working
             | around it and calling a day. Maybe a little note here and
             | there but, I don't think someone would report it due to a
             | couple of reasons.
             | 
             | First of all, I don't think people report this type of
             | stuff because they don't know how to report it, and
             | secondly think it doesn't need to support this use case
             | anyway since space is a latecomer to naming and path game.
        
           | fjcp wrote:
           | As a Linux user I can relate to that. I always avoid spaces
           | in folders and filenames as they make it more annoying to
           | manipulate them using command line tools. Years later I
           | carried this habit to whatever OS I am using.
        
         | kristaps wrote:
         | Don't remember the source and probably misquoting, but I like
         | this truism: there's software that people complain about and
         | software that nobody is using.
        
           | st1x7 wrote:
           | The original quote is from Bjarne Stroustrup, the creator of
           | C++. The quote also doesn't apply here. (You can't just use
           | it to excuse any problem with software that you come across).
           | The author of the article and the library in it just seems
           | out of their depth in many ways.
        
             | cat199 wrote:
             | > there's software that people complain about and software
             | that nobody is using.
             | 
             | > The original quote is from Bjarne Stroustrup, the creator
             | of C++
             | 
             | i find this ironic, given the 'popularity' (either way) of
             | C++
        
               | st1x7 wrote:
               | I don't think it's ironic, the quote directly addresses
               | the many criticisms towards C++.
        
               | cat199 wrote:
               | ah whoops- completely misread it
        
         | tpxl wrote:
         | Could also be that the package manager doesn't use spaces and
         | most people use package managers?
         | 
         | Ie maven will create a folder structure like
         | "/home/user/.m2/repository/com/example/example.jar" which will
         | never have spaces unless the username has spaces (Can linux
         | usernames have spaces?).
        
           | nerdponx wrote:
           | No, the R package manager can tolerate spaces in filenames.
        
           | roel_v wrote:
           | On Unixy systems, spaces are uncommon because so little
           | software can deal with them, so that people are trained from
           | the very beginning to treat spaces like the plague. I do it
           | too - I've been burned by treatment of spaces in shitty 0.x
           | level software so many times (25+ years ago) that I now have
           | an intuitive aversion of anything with spaces.
           | 
           | Spaces in filenames are a reality though, especially on
           | Windows (where the home directory itself used to have spaces
           | in it, and also where many home directories on corporate
           | networks are on network drives and start with \\\\), and any
           | software that can't deal with those kinds of paths has just
           | not been exposed to much (if any) real world use. That was
           | the point I was trying to make - software that can't handle
           | anything but the most bog-standard path names in its core
           | configuration is 'hey guys look at what I hacked up yesterday
           | evening' quality at best. (yes yes it is possible to imagine
           | exceptions, like software that is decades old and ported
           | across platforms; I'm talking about something new that is
           | meant to solve a general problem).
        
         | jbullock35 wrote:
         | A further concern: the repository for this R package [1]
         | doesn't include any test files. Am I right to think that we
         | should be wary of R packages that don't have any unit tests?
         | 
         | https://github.com/CredibilityLab/groundhog
        
         | jcelerier wrote:
         | > This can be interpreted in two ways: either very few people
         | have spaces in their paths
         | 
         | it's been years since I've seen anyone doing that - a main
         | reason, is that a very widely used dev tool, make, does not
         | handle spaces in paths:
         | 
         | http://savannah.gnu.org/bugs/?712
         | 
         | thus leading to inertia in the whole ecosystem - if make does
         | not support spaces in paths, why bother
        
         | [deleted]
        
         | IshKebab wrote:
         | > So we are talking about software here that somehow made it to
         | version 1.1 _without anyone ever using a directory with spaces
         | in it with it_.
         | 
         | This is extremely common, especially on Linux. Basically
         | anything that uses things like Bash or CMake will almost
         | certainly not work in directories containing spaces.
         | 
         | Developers don't use paths containing spaces because it causes
         | so many issues with badly written Bash scripts, and as a result
         | they don't test their code with paths containing spaces.
         | 
         | Bash and CMake and similar hacked together languages have very
         | error-prone quoting rules that make it very easy to
         | accidentally make something work with paths without spaces but
         | fail on paths with spaces.
        
           | Sebb767 wrote:
           | > Developers don't use paths containing spaces because it
           | causes so many issues with badly written Bash scripts, and as
           | a result they don't test their code with paths containing
           | spaces.
           | 
           | It is also a PITA to use when typing in a shell, as you need
           | two characters ( \ + space ) instead of one. So even though
           | my scripts can handle them, I still avoid them if possible.
        
             | benibela wrote:
             | Some programs also use URLs
             | 
             | Today I wanted to send a screenshot by mail.
             | 
             | Should be simple, but with not Gnome. I make the
             | screenshot, Gnome creates a file "Screenshot from ...", but
             | does not tell you where. Then I search it in the file
             | explorer, find it, copy the path. Then I paste the path in
             | the mail program, file:///....Screenshot%20from%20. Then
             | the mail program: "File not found"
        
         | mattmanser wrote:
         | It doesn't even seem to be on GitHub, in fact the source
         | doesn't seem to be listed anywhere on the project website.
         | 
         | Which in our world would scream 'complete amateur, avoid,
         | avoid, avoid', but perhaps it's different in the R world.
        
           | qwantim1 wrote:
           | No, I think you're correct. Incomplete source is bad in any
           | world.
           | 
           | Unfortunately, it's that world we live in for pretty much
           | everything.
           | 
           | Reproducibility? What if all of the source were to depend on
           | part of a CPU instruction set that we stop using? How long
           | must things be reproducible? We don't even make lab equipment
           | exactly like we used to with the experiments our current
           | sciences are based on.
           | 
           | However, I give a thumbs up to Groundhog for trying to do the
           | right thing.
        
             | corty wrote:
             | Reproducibility down to CPU bit differences is a sign that
             | you did something wrong. Usually calculation with
             | insufficient precision and no thought given to the range of
             | simulation error. Simulation must be treated like a
             | measurement, there is a maximum precision for your
             | instrument and you have to know and apply it.
             | 
             | And even if you might disagree for the single-threaded
             | case, most things running in parallel will eat that free
             | lunch of bit-identical results due to timing differences.
        
           | cowsandmilk wrote:
           | Is it not on GitHub at
           | https://github.com/CredibilityLab/groundhog ?
        
           | Hansi wrote:
           | https://github.com/CredibilityLab/groundhog
        
           | roel_v wrote:
           | While this specific project does have a github page, the R
           | world is 'complete amateur, avoid avoid avoid'. It's not
           | really a 'programming language' in the way software engineers
           | would see it. It's more a loose collection of stats
           | functionality that is tied together with text interfactes in
           | a way that somewhat looks like programming to the
           | uninitiated. I mean, batch scripting is technically
           | 'programming', and Excel (even without VBA) is technically
           | Turing complete, but neither of those would be considered
           | 'programming' by software engineers, at least not under an
           | intuitive understanding of what 'programming' is. (by that I
           | mean, it's easy to be pedantic and argue that R and batch
           | files and Excel files are 'programming' because of [xyz]
           | where [xyz] will probably involve real 'definitions' and
           | selection criteria etc; but despite those tools being
           | _useful_ , you can't do real _software engineering_ in them,
           | which you sometimes want /need).
        
             | epistasis wrote:
             | > you can't do real software engineering
             | 
             | This is completely, 100%, absolutely wrong.
             | 
             | Of course you can. There's packages, with excellent
             | software engineering structure, that are designed to
             | include documentation and tests.
             | 
             | R has so much good software engineering, that clever people
             | with no software engineering background can easily make
             | their own packages!
             | 
             | And come on, the R language is a masterpiece. It's not
             | cobbled together like JavaScript or bash. It's got
             | impeccable functional programming language pedigree, you
             | can even look at the AST directly of a function directly
             | inside code.
             | 
             | I'm not sure how you came to any of your conclusions, other
             | than not bothering to understand the language to start.
             | It's a beautiful language with a messy, user contributed
             | set of stats code.
        
               | huijzer wrote:
               | > Of course you can. There's packages, with excellent
               | software engineering structure, that are designed to
               | include documentation and tests.
               | 
               | For me, the problem with R is that the language is
               | inconsistent. Many packages arose to address many
               | problems, but they all feel like a hack on top of the
               | core language. Take the whole Tidyverse; it just does
               | dataframes from R core but then from the ground up. Now,
               | users can choose between the core language dataframes and
               | the Tidyverse dataframes. Same holds for plotting. The
               | core issue, I think, is that the core language misses
               | some essential features which other languages do have
               | nowadays. For example, a type system. In R, since types
               | are missing, everything is a table (dataframe) which I
               | find just weird.
               | 
               | > It's not cobbled together like JavaScript or bash.
               | 
               | But also not as good as my favorite: Julia. Comparing it
               | to Bash is like saying that its better than COBOL. We all
               | know Bash is quite old, but for certain situations it
               | just works.
        
               | epistasis wrote:
               | The tidyverse is the benefit and the curse of
               | metaprogramming, something that R takes from lisp, and
               | something that has cursed (helped?) C++ since it was
               | added.
               | 
               | As far as type systems, there's really two different
               | types of "types": individual types objects that can have
               | generic functions attached to them, etc. This is not as
               | well known, and there are actually several object systems
               | for typing:
               | 
               | http://adv-r.had.co.nz/OO-essentials.html
               | 
               | But these sort of objects are not quite as commonly
               | created by programmers, because the second type of
               | "types" are much more useful: data frames, which is kind
               | of a vectorization of structs. This is what would be used
               | in data oriented design, which is apparently much more
               | common in modern game design.
        
             | vharuck wrote:
             | This argument seems elitist. R is more than just
             | technically Turing complete.
             | 
             | It's definitely a specialized language. It's not the go-to
             | for managing servers or anything with a lot of I/O, but it
             | has those capabilities because they're useful for managing
             | projects. And I'd be hard-pressed to justify using a
             | language for statistical analysis if it doesn't focus on
             | statistical analysis. It'd be like rolling my own
             | cryptography.
             | 
             | You need to differentiate between "base R" (everything that
             | comes with a new install) and community-contributed
             | packages. Base R is amazingly reliable. It has detailed
             | documentation[0].
             | 
             | User-package land is more of a Wild West, that's true. I
             | would personally not use anything that's not on CRAN unless
             | I can walk up to the maintainer's desk (in non-pandemic
             | times).
             | 
             | [0] https://cran.r-project.org/manuals.html
        
               | roel_v wrote:
               | _shrug_. It 's largely opinion-based, I guess. My pet
               | peeve (which also illustrates my point, but again, in an
               | opinion-based way): there is no documented, 'officially
               | supported' way to get the path of the current script in
               | R. That is not a problem for amateur programmers who
               | don't think about things like robustness, distribution
               | etc, and it's needlessly complicated and bolted on in
               | SAS, too. But it's still silly and indicative of R's
               | typical use cases. Excel is reliable and well documented
               | too, and I still wouldn't call even complicated workbooks
               | 'software engineering'.
               | 
               | And CRAN... well... let's just say that people used to
               | point to CPAN as a strength of Perl, too... All that sort
               | of archives, after the first few years which comprise
               | mostly of contributors with deep knowledge and who can
               | produce high quality libraries, turn into dumping grounds
               | for trivial half-assed 'libraries' under the guise of
               | 'community contributions'. Example: try to do trivial
               | compound interest simulations in R. So basic that it's
               | barealy worth calling 'finance'. There are (at least)
               | three packages on CRAN that claim to do this, except that
               | (depending on which variable in the equation you want to
               | solve for) they all provide only part of the solution, in
               | mostly incompatible ways. And this is because very few of
               | the people putting code into CRAN know how to... well...
               | write good code. This is not an indictment of those
               | people; many of them are much more intelligent than a
               | bunch of us combined. It's just that for them coding is a
               | byproduct, and with good intentions they share what has
               | been useful for them, it just leads to a situation of 'in
               | the land of the blind one eye is king'.
        
             | [deleted]
        
         | CJefferson wrote:
         | If you start discarding software which has problems with a
         | space in a directory name, you should start with libtool, at
         | which point you can't build significant chunks of the Linux
         | ecosystem.
         | 
         | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=193163
         | 
         | I hit this when trying to test libgmp (as an example of an
         | important library you would lose).
         | 
         | This means in practice you can't really build most software
         | which uses configure scripts and libraries in a directory with
         | a space -- this may well be what they are hitting.
        
       | hprotagonist wrote:
       | seems like a fairly esoteric way to spell "lockfile with hashes",
       | but hey, R seems fairly esoteric to me anyway.
        
       | qrohlf wrote:
       | I've had some brief run-ins with R, and it doesn't surprise me
       | that it doesn't have a versioning story for packages, and that
       | the patched-in system described here is based on _dates_ rather
       | than something like a SHA or version number...
       | 
       | My favorite description of the language comes from
       | http://arrgh.tim-smith.us/:
       | 
       | > R is a shockingly dreadful language for an exceptionally useful
       | data analysis environment.
       | 
       | I feel like this is just one more data point to support that
       | statement.
        
         | jcheng wrote:
         | Packrat and its successor renv are the most popular package
         | management systems for R, and they are based on versions/SHAs
         | and lockfiles, like most other languages today.
         | 
         | https://rstudio.github.io/packrat/
         | 
         | https://rstudio.github.io/renv/articles/renv.html
        
       | GlennS wrote:
       | I know of two other existing solutions to this, although I don't
       | know enough to compare. I don't think either of these tick all
       | the author's boxes.
       | 
       | Microsoft MRAN https://mran.microsoft.com/
       | 
       | > For the purpose of reproducibility, MRAN hosts daily snapshots
       | of the CRAN R packages and R releases as far back as Sept. 17,
       | 2014.
       | 
       | MRAN doesn't seem to be very well known or used in the R
       | community, but I don't really know why?
       | 
       | Separately, Nix https://nixos.org/ also solves this problem for
       | lots of different languages, but is difficult to get started with
       | and still a bit rough around the edges. Probably not a good
       | recommendation for a typical analyst or academic at this point.
        
         | chalst wrote:
         | The article discusses MRAN in footnote 5, when arguing against
         | the MRAN-based 'checkpoint' approach.
         | 
         | Nixpkg/Nixos is obviously a useful technology for
         | reproducibility, but note that the output of Nix scripts can
         | depend on the time the system was built, the contents of URLs
         | and the system architecture unless care is taken.
        
           | GlennS wrote:
           | So it does, I missed that!
        
           | myWindoonn wrote:
           | This is misleading; empirically, nixpkgs is about 99% [0]
           | reproducible already. We know that the main variance is
           | between language-specific behaviors; Python, Rust, and C all
           | are prone to reproducibility problems.
           | 
           | In general, we _want_ the output to depend on the system
           | architecture and the contents of URLs. Nix uses hashes to
           | require that URL contents don 't change over time, which
           | protects from those contents changing arbitrarily.
           | 
           | [0] https://r13y.com/
        
             | chalst wrote:
             | The current community around NixOS and Nixpkgs handles
             | these issues just fine, but if 'just use Nix' was regarded
             | as a magic bullet for reproducibility in science, I'm
             | guessing it wouldn't work out so well.
        
               | myWindoonn wrote:
               | Fortunately, "just use Nix" doesn't do much on its own.
               | People usually want GCC or another complete C toolchain,
               | a C standard library, etc. and this implies that they
               | will use nixpkgs or one of its forks. If people try to
               | "just use Nix" in anger, then they will almost certainly
               | be funneled into using nixpkgs as a matter of practice.
               | 
               | The main problem with reproducibility in science is that
               | most scientists are not actually interested in doing
               | science. Of course software will not fix this problem.
        
         | warlog wrote:
         | It looks like this is much more fine grained compared to mran,
         | i.e., with groundhog, you select the date vs with mran where
         | you use the last (often > year old) snapshot.
         | 
         | mran is a great idea and if Rstudio (the defacto gate-keepers
         | of the faith -- with Hadley the high priest) pushed to use
         | mran, then the R community would follow suit (like they do for
         | everything else).
         | 
         | This would do a lot to bring MS into the fold, which would
         | actually be great for R.
        
           | kgwgk wrote:
           | They have their own package management library
           | 
           | https://rstudio.github.io/packrat/
           | 
           | and sell their own package management product
           | 
           | https://rstudio.com/products/package-manager/
        
           | jsmith99 wrote:
           | MRAN takes daily snapshots, and is the repository powering
           | this new package.
        
           | Hansi wrote:
           | Hadley works for RStudio, RStudio now have their own MRAN
           | type mirror: https://packagemanager.rstudio.com/client/#/
        
         | _Wintermute wrote:
         | MRAN has saved my bacon more than once when I need to replicate
         | some R environment written years ago. The package management in
         | R really is terrible.
        
       | wodenokoto wrote:
       | Wow, that's a lot of pessimism for a fairly elegant solution to
       | the fact that almost no R code has package versioning defined.
       | 
       | I think the major sales point here is:
       | 
       | > A nice feature of groundhog is that it makes 'retrofitting'
       | existing code quite easy. If you come across a script that no
       | longer works, you can change its library() statements for
       | groundhog.library() ones, using as the groundhog.day the date the
       | code was probably written (say when it was posted on the
       | internet), and it may work again.
       | 
       | I don't know how good ratpack is now a days. I've never met an R
       | application that uses it, but at my old work, we would take a
       | dated snapshot of CRAN at the beginning of every new project. If
       | we needed to update a package we could then "update CRAN" for
       | that project. When productionising a project it would be frozen
       | to a date in CRAN.
        
         | nojito wrote:
         | >Wow, that's a lot of pessimism for a fairly elegant solution
         | to the fact that almost no R code has package versioning
         | defined.
         | 
         | This isn't true.
         | 
         | https://mran.microsoft.com/documents/rro/reproducibility
         | 
         | https://rstudio.github.io/packrat/
        
       | dracodoc wrote:
       | Title aside, the purposed solution just
       | 
       | - use Microsoft MRAN which did the heavy lifting of hosting
       | archives
       | 
       | - use date instead of version
       | 
       | - install package automatically in first time (which
       | pacman::p_load has been doing for ages) and easier to use in
       | script level.
       | 
       | It's not coincidence that most package manager solutions used
       | version instead of date to control the environment:
       | 
       | - A paper published on 2017 may used a date in 2017.10.01, but
       | there is a high possibility that some of the dependency packages
       | might be of earlier date, unless the author update packages every
       | day/week, which is not a good habit anyway because updating too
       | frequently will break things more frequently.
       | 
       | - Then how can you reproduce the environment using a date? The
       | underlying assumption that all packages will be latest till that
       | date simply doesn't hold.
       | 
       | That's why packrat/renv etc will use a lock file to record all
       | package versions, and why you will need a project to manage
       | libraries, because you will need to maintain different library
       | environments and cannot install to same location.
       | 
       | Yet the author take installing all packages to a single location
       | as a feature since you don't need to install same package again,
       | and try to avoid project and prefer script as much as possible
       | when doing reproducible research?
        
       | paultopia wrote:
       | This language about "threat" seems a bit overblown. Especially
       | when we ask: compared to what? Some commercial package where
       | different versions might have different and poorly documented
       | data storage formats? (Have you ever tried to read an old SPSS or
       | SAS or STATA data file in any reasonable environment? It is a
       | nightmare.) Excel??
        
       | threeseed wrote:
       | Nothing about this is specific to R.
       | 
       | If you want to guarantee reproducible results you have to use a
       | container/image with libraries added at build time. Anytime you
       | are relying on floating versions or downloaded libraries you will
       | have issues.
        
         | jdc wrote:
         | Yeah or even just vendorize your dependencies.
        
         | tempay wrote:
         | Even this isn't enough to be reproducible for complex numeric
         | code as switching CPU can make a big difference with small
         | differences being amplified. Hopefully none of those cases
         | matter but it's hard to definitively prove that.
        
           | andi999 wrote:
           | If the research results depend on small differences being
           | amplified you have a much much bigger problem. (but if course
           | this could happen unnoticed/sloppy work)
        
             | bonoboTP wrote:
             | That's true but not an excuse! It's still extremely
             | important when assessing an anomaly. If you can say "okay
             | this is a known-good config that gets me the numbers from
             | the paper", it's an enormous help in uncovering what leads
             | to issues.
             | 
             | If you can't even get those numbers, then you can suspect
             | any number of things. Maybe you're not using the right
             | data, maybe there was a typo, maybe someone fraudulently
             | manually tweaked the numbers, maybe you forgot to do a step
             | in the processing chain etc etc. There's no way to know
             | what's going on if you can't even be sure how the original
             | numbers were created.
        
       | vhhn wrote:
       | There are two camps in the R world - tidyverse and base-R
       | (tiniverse).
       | 
       | Its not a coincidence that the author gives an example from the
       | tidyverse ecosystem. Authors and users of tidyverse value other
       | things like consistency and new features over API stability and
       | backward compatility. The base-R ecosystem is actually very
       | stable and so the original package manager is very simple.
       | 
       | With R spreading out from the academic environment and with many
       | new authors breaking their packages' APIs we observe new attempts
       | to solve the issues with dependencies (such as renv or
       | https://rsuite.io)
        
       ___________________________________________________________________
       (page generated 2021-01-07 23:03 UTC)