[HN Gopher] Groundhog: Addressing the Threat That R Poses to Rep...
___________________________________________________________________
Groundhog: Addressing the Threat That R Poses to Reproducible
Research
Author : snakeboy
Score : 161 points
Date : 2021-01-05 13:24 UTC (2 days ago)
(HTM) web link (datacolada.org)
(TXT) w3m dump (datacolada.org)
| Railsify wrote:
| This is not a problem the R poses, this is a problem that people
| pose. If you want to run the same code you need to bundle the
| source code of all packages and archive it with the data. This is
| not an R problem at all.
| SoSoRoCoCo wrote:
| > The problem is that packages are constantly being updated, and
| sometimes those updates are not backwards compatible.
|
| Uh oh, someone just discovered the modern programming landscape!
|
| Python, Node, R, Rust, and other langs/OSes with package managers
| are at the mercy of volunteers who keep important packages
| healthy. Once issues stop being fixed, y'all better have local
| copies. This used to be predominantly an OS issue, now it is a
| language issue, too.
| swsieber wrote:
| > The problem is that packages are constantly being updated,
| and sometimes those updates are not backwards compatible.
|
| > Python, Node, R, Rust,
|
| Correct me if I'm wrong, but for binary programs, a lock file
| easily mitigates these issue. I know Node and Rust both support
| lock files.
| SoSoRoCoCo wrote:
| Yes, you're right. That's what makes lock files so important.
| I think we're past worrying about those wheels/npms/pkgs
| disappearing from the internet.
|
| My concern is more about packages going stale and don't peer-
| match with other packages that evolve, or major versions that
| change results: not so much for R pkgs but there have been
| cases of major versions breaking existing projects, or
| requiring significant effort to update. (One example that
| zinged me is the FFI interface for Node. The "official"
| package hasn't been touched in years, and the "replacement",
| FFI-NAPI, is still has lots of open issues. We were using in-
| house fixes for some time.)
| winrid wrote:
| Does R not allow you to lock into versions?
|
| Is this person suggesting we never improve anything? :)
| AuthorizedCust wrote:
| His example is poor. dplyr wasn't even at 1.0 in 2016. It was
| only 0.50: https://blog.rstudio.com/2016/06/27/dplyr-0-5-0/. Of
| course one might expect breaking changes in a maturing package.
| tpoacher wrote:
| A very nice and useful library indeed; though I don't think the
| article really needed to sound so doomsdayish and apocalyptic
| about it.
| mespe wrote:
| While I agree with many of the negative comments here about
| issues with how this is implemented, the tone of some comments
| is... not great. To the point that I would be reluctant to share
| work I do in R on Hacker News, which is not helping anyone.
|
| Just a reminder: https://news.ycombinator.com/newsguidelines.html
| h2odragon wrote:
| No kidding. Seems like an elegant solution to a potential
| problem to me. I've only used R at the "poke stick at numbers"
| level, but this would have been a useful addition to _that_
| trivial use.
| anderscarling wrote:
| I've yet to use it personally, but renv [1] seems to try to solve
| the reproducible builds problem in a way more similar to other
| modern package managers (e.g. by generating a lockfile).
|
| This approach enables stricter validations against tampering with
| the package repositories as a hash of the package can be stored
| in the lockfile, however it is obviously a bit more complex to
| use than the groundhog approach.
|
| [1]: https://github.com/rstudio/renv
| notafraudster wrote:
| Agreed that renv is a better solution here. Even the example
| code for Groundhog is not written in idiomatic R which does not
| inspire confidence. Simonssohn is a legend in transparent
| research but not primarily a coder or software tool contributor
| (take a look at the source for p-curve if you want to see what
| I mean) and I think a secondary threat to reproducibility is
| relying on tools that end up abandoned or deprecated or for
| which bugs never get fixed.
| vharuck wrote:
| >Even the example code for Groundhog is not written in
| idiomatic R which does not inspire confidence.
|
| Not to mention the given example for irreproducibility in
| base R looks at code that would be a bug in the script for
| 3.6. It's only useful to keep this reproducible if I'm
| debugging the script.
|
| And, in this case, anyone who's proficient with R would
| recognize this problem from personal experience or the many
| warnings in tutorials. I usually wouldn't shoot down a given
| example as though it disproved the existence of any example,
| but I don't know if there is another example. Unless old code
| relied on undocumented or contrary-to-documented behavior.
| prepend wrote:
| I came here to say this.
|
| This seems like a non-issue given renv. And renv gives a more
| reproducible, I think, solution as it pins to versions, not
| dates.
| timsneath wrote:
| I'm curious whether this actually solves the problem. I
| understand how this assists with reproducibility of packages, but
| the R software itself is updated frequently, as is briefly noted
| in the preamble to this document. Indeed, the release notes [0]
| are fairly transparent about the relatively long list of changes.
|
| Given this, it almost seems more dangerous to imply through this
| package that a particular date's results are reproducible, since
| unless the user has the same version of R, they may see different
| results anyway.
|
| [0]: https://stat.ethz.ch/pipermail/r-announce/2020/000653.html
| st1x7 wrote:
| Either I'm misunderstanding or this is a non-problem. You can
| specify older versions of a package when you install it. You can
| also manage them with packrat. As long as researchers share their
| language and package versions, you can fully reproduce their
| environment. (And the base language is _really_ stable, almost to
| a fault.)
|
| This is just a bad way for the author to promote their own
| library for dealing with this. The way their library seems to
| approach this (using dates instead of versions) seems horrible
| too - on any given date I can have a random selection of packages
| in my environment, some of them up-to-date, some of them not. So
| unless all researchers start using the author's library (and
| update to the latest versions of everything just before they
| publish), it's only making things worse and not really solving
| the problem it claims to solve.
| kwertzzz wrote:
| I full agree. With version number you can have a good sense
| whether a package update breaks your code or not (as long the
| package authors following semantic versioning).
|
| I think in julia this problem is solved quite nicely with the
| Project.toml (list of package that you directly dependent) and
| Manifest.toml file (the version numbers of the complete
| dependency tree which is automatically generated).
|
| It seems that in groundhog you declare only direct
| dependencies. Is there a way to store the full dependency tree
| in R ?
| Hasnep wrote:
| Using renv (or packrat) generates a renv.lock file with the
| version of every dependency.
| kumarsw wrote:
| The impression I get is that this tool has a forensic bent to
| it. You ask for the code for a paper and Joe grad student with
| no programming knowledge emails you a zipped folder of R
| scripts. He just finished his dissertation, is starting a new
| job somewhere on the west coast, and no longer has access to
| the computer in his old advisor's lab where he did the work.
| The implementation may (or may not be) be lousy, but the use
| case sounds plenty valid.
| dash2 wrote:
| Yeah, this seems half-thought-through. renv works at project
| level and isolates the dependencies of a project from your main
| library. groundhog.library() tramples over your library
| installing multiple versions. It also has the "cute" feature of
| auto-installing libraries if they aren't on your system
| already. Yuck. If you really wanted this script-only solution
| then you could go with the `versions` library, which already
| lets you specify an installation date.[1]
|
| [1]:
| https://cran.r-project.org/web/packages/versions/index.html
| mslip wrote:
| Not really relevant but just to note packrat had been soft-
| deprecated and is superseded by renv, which comes standard with
| rstudio.
| meztez wrote:
| Fully agreed. Add `sessionInfo()` output as an appendix to your
| publication. Should not be too hard to rebuild from that.
| ekianjo wrote:
| Packrat is deprecated now. It is recommended to use renv
| instead.
| st1x7 wrote:
| Thanks, I just checked renv out (I haven't had to work with R
| in over a year). Renv looks much better than packrat at a
| first glance.
| _Wintermute wrote:
| Correct me if I'm wrong but specifying an older package version
| in R still pulls the newest packages from CRAN for any
| dependencies, which is quick way to run into a load of
| incompatibilities.
|
| I've not tried renv yet but packrat was a pretty poor solution.
| jcheng wrote:
| If you need to use an older version of a package and don't
| have a packrat/renv lockfile already, then packrat/renv are
| not going to help you. mran/checkpoint could though.
|
| I agree packrat (which I created) was a poor solution for
| most users. renv is far better and more usable.
| mslip wrote:
| You can specify version numbers in renv, you take snapshots
| of your dependencies into a lock file and can always restore
| from there or make a new snapshot.
| torcete wrote:
| I hope all these efforts for reproducibility and avoiding library
| resolution conflicts consolidate into one solution.
| ppod wrote:
| Very clickbaity headline. The problem described is real, but just
| as real or worse in other statistical software, so it's not 'R'
| as a whole that poses a threat to reproducibility.
| asperous wrote:
| Reproducible? Or deterministic?
|
| There's certainly benefits to being able to pull down research
| source code, and bug checking it. That's how programmers check
| code: tests and audits.
|
| However I think reproducing research is more often then not done
| "from scratch", taking a new sample, treating it, checking
| results. "independent verification".
|
| Re-using source code saves time, but I would argue not being able
| to shouldn't threaten reproducibility.
| geomark wrote:
| Back when I studied this stuff there was a distinction between
| reproducible (rerun analysis on the data from the original
| experiment and see if you get the same results - if not then
| there is an error in the analysis) and replicable (redo the
| entire experiment by taking new data and running the analisys).
| bonoboTP wrote:
| This is why people are starting to make a difference between
| terms: repeatability, reproducibility, replicability.
|
| > You give me your code and enough information for me to
| produce and identical environment or (even better) your code is
| insenstive the environment, then your research is Repeatable.
|
| > If you describe your study sufficiently well that I can re-
| implement your study from scratch, without looking at your code
| and still get the same answer, then it is Reproducible
|
| > If I can arrive at the same conclusions as you, just from a
| description of its aims, then it is Replicable.
|
| From https://academia.stackexchange.com/a/118518/15198
| coliveira wrote:
| That's something that eludes software people: reproducibility
| in science is the ability to create independent tests. Making
| software available, while useful, does very little for
| reproducibility from the scientific point of view.
| f6v wrote:
| There're fields of science, like computational biology, where
| it's all about the code. I wish the methods section was
| always 100% unambiguous, but it's not the case. And nowadays
| the computational pipelines have to support analysis of up to
| terabytes of data. You can imagine how many dependencies such
| pipelines have. Sometimes I have trouble installing the
| software even when package manager such as anaconda is used.
| petters wrote:
| It's true that "reproduce" can mean different things in
| software and science. But having the code available to
| "reproduce" any plots in a paper should be a requirement for
| publication, imo. It certainly is the case for my papers.
| bonoboTP wrote:
| It's a minimum standard, though. Of course the goal is
| reproducibility from a broader point of view, but that's not
| an excuse to do research in a one-off way where nobody is
| able to show how to get those numbers again, after a year or
| so from publication.
|
| The coding standards are often abysmally, unexpectedly
| terrible. Often not even the help of the original authors is
| enough to be able to produce the same figures from a paper
| because things and settings and commands get forgotten. Some
| part of the analysis was done in one language, another part
| in Excel. Some of the code has now disappeared. Some of the
| libraries are no longer working. Some people left and their
| academic storage space was wiped and therefore the
| intermediate steps and results or notes are deleted. You
| wouldn't believe it.
|
| Once a paper is published researchers are not really
| incentivized to document things or maintain the materials.
| They got the publication, they put it on their CV. On to the
| next project! No time to waste on work that's already
| completed. New work leads to new publications, messing around
| with the old code for the sake of a potential later person
| interested in it is a waste from the point of view of a
| researcher, career wise. Also most papers are never attempted
| to be reproduced ever.
| coliveira wrote:
| The researcher's job is to properly do an experiment and
| document it in a paper. If we require more than this, then
| we will incur in damage to the scientific process for two
| reasons: (1) companies are not willing to make available
| software developed by their researchers, therefore they
| will publish even less; and (2) universities don't have
| money and staff to produce and maintain software at these
| standards, so professors will be required to publish less
| papers.
| bonoboTP wrote:
| "These standards" are pretty low. Currently it's a free-
| for-all chaos. Theoretically papers are reproducible from
| the documentation found in the paper but that is a lie.
| It is never reproducible just from the paper. Lots of
| stuff is done in the background that is not known to the
| reader. For all we know, they can even tweak their
| numbers to be 2% better and if someone can't get the
| results of the paper from the released code, the authors
| can just ignore it or say, the problem is not on their
| side, or that the paper numbers were generated with a
| slightly different code than the released version etc.
| I've seen this many times on Github, issues getting
| closed or deleted without comment etc. There is zero
| accountability.
|
| It's slowly changing though but many people are grinding
| their teeth, because they can't torture the data as much
| if things are out in the open.
| coliveira wrote:
| > "These standards" are pretty low. Currently it's a
| free-for-all chaos.
|
| I disagree. It is not perfect, but it is certainly a
| process that enables scientific development, as it has
| for centuries. If we start to create more and more rules
| that researchers need to follow, it will become even
| harder to make scientific research and most institutions
| won't have resources to continue.
| jdale27 wrote:
| Ideally research does get reproduced from scratch; I think what
| people usually mean when they talk about the
| replication/reproducibility crisis in science is not being able
| to reproduce an experiment with new samples, independent data
| analysis, etc.
|
| However, if you can't even reproduce an analysis with the
| authors' own data and code, that's a red flag before you even
| get to the starting line. Ensuring that level of
| reproducibility is, I think, an essential ingredient to
| enabling the stronger form of reproducibility.
|
| Personally, I made the mistake during my graduate career of
| trying to reimplement an analysis using a certain rather
| complicated ML algorithm, from scratch, in a different language
| than the original authors had used. After struggling mightily
| to get it to work, I finally bothered to try to get their own
| code working. (I had been hesitant to do so because I wasn't
| proficient in the language they used, and it wasn't even clear
| they had released all the necessary code, aside from the core
| algorithm.) Once I did that, I discovered that I couldn't even
| get their own code working on their own data, and gave up. This
| was researched published in Science by a group from a top-tier
| research university. (I don't fully blame the authors, it may
| well have been my own incompetence that was the issue. But it
| just serves as yet another illustration of how pervasive and
| disregarded the reproducibility issue was for a long while.)
| davnn wrote:
| > Re-using source code saves time, but I would argue not being
| able to shouldn't threaten reproducibility.
|
| More often than not it's not clear from a paper what exactly
| the authors did to a achieve a specific result. Being able to
| exactly reproduce what previous authors did should improve
| reproducibility; also for new samples.
| asdff wrote:
| It's also standard fare for typical lab work. A good paper's
| methods section would contain enough detail for you to go
| into your lab and repeat the experiment yourself, even down
| to the catalog number for the reagents to order from the lab
| supplier. Code should be no different, that's why it's
| encouraged that authors submit all code used in analysis and
| generation of figures.
| davnn wrote:
| The fun thing is that there are approaches that want to go
| beyond this kind of methodological description of a
| scientific process to __code __[0, 1]. In general I would
| say that the more we can remove the human aspect and
| inherent ambiguity of science, the better for
| reproducibility. See [2] for a couple of examples.
|
| [0] https://www.emeraldcloudlab.com/ [1]
| https://nextjournal.com/ [2]
| https://www.youtube.com/watch?v=L1UgdoP2aeg
| stewbrew wrote:
| I wish the (default) utils::install.packages function could take
| a version number of the requested library. I also wish library()
| would automatically install libraries not available on the
| system. (Both can be achieved with custom functions that shadow
| the default ones but I would like to see this functionality in
| the base packages.) Other than that, I think all alternatives to
| this "threat called R" are worse. It's telling the author has to
| cite a bug from 2016 for an example of a breaking change.
| cauthon wrote:
| This title is an exceedingly hot take for someone who wrote a new
| package manager.
|
| Also, it appears that Groundhog is itself a CRAN package and the
| author recommends installing with install.packages(). So is the
| author committing to never making any backwards incompatible
| updates to their new package?
| coolreader18 wrote:
| > So is the author committing to never making any backwards
| incompatible updates to their new package?
|
| Well, yes, probably. It's not all that hard, and groundhog
| seems to have a fairly simple API anyways.
|
| And groundhog still uses CRAN packages, it just brings a method
| of pinning them to a specific version.
| bsza wrote:
| I think it's more like a Wayback Machine for R programs, since
| the author of a science paper isn't required to use groundhog.
| You can just provide it the date the article was published,
| which you already know, and it reconstructs how the program
| worked on that day.
|
| Also, because groundhog isn't made for the author to use,
| whether or not the interface changes is irrelevant. You'll
| never encounter library(groundhog) in a paper.
| st1x7 wrote:
| > and it reconstructs how the program worked on that day.
|
| It reconstructs how the fully updated version of everything
| worked that day which isn't necessarily the same as the
| researcher's environment. It's a horrible idea to use dates
| instead of package versions for this. The author's library
| doesn't solve the problem it claims to solve.
| SCLeo wrote:
| If I am understanding this correctly, the problem is that
| the paper authors do _not_ provide a specific version or a
| package.json equivalent. In that case, using dates seem to
| be the only choice.
| st1x7 wrote:
| Even if that's the case, using dates isn't a solution
| because dates don't give you the build that the
| researcher used. Date of publication is different from
| the date when the code ran and there is no guarantee that
| the researcher ran the latest version of every dependency
| that was available to them anyway. In fact that's very
| unlikely considering that some their libraries might
| require older versions. It might not even be possible to
| take the latest version of every package and use them in
| the same environment.
| SCLeo wrote:
| So, what is your better alternative then? I honestly
| believe using version available at that date is better
| than using the latest version.
| [deleted]
| jsmith99 wrote:
| That's the problem. This package is very similar to Microsoft's
| checkpoint package which is based on Microsoft's MRAN
| snapshots, and this package also uses MRAN. The article
| explains the difference is that this package allows you to
| specify the date in the code itself, whereas checkpoint is used
| to set a whole installation to a specific date. But this is no
| advantage as it means code will stop working if the groundhog
| package changes, whereas with checkpoint a paper could just say
| 'use packages as of date x'.
| resonantjacket5 wrote:
| Your take seems a bit 'hot' too?
|
| How else would you install the cran packages without using
| install.packages? Unless if you want them to recursively
| install it using groundhog but that seems unnecessary.
|
| As long as you have the timestamp it should work, though I
| assume there will be some edge case.
|
| What you're saying is like don't use pip because you don't
| install it using pip? Or don't use package-lock.json because
| you can't install npm through npm?
| cat199 wrote:
| > Your take seems a bit 'hot' too?
|
| OP is not claiming that Groundhog itself is a threat to the R
| language ecosystem itself, whereas the author is claiming
| that the R language is itself a threat to Science itself...
| scottmcdot wrote:
| Someone correct me if I'm wrong, but can't you copy and paste
| the package folder into your libpath directory and R can load
| it that way with actually running install.packages()?
| vharuck wrote:
| Usually, yes. However, it is possible for a package to have
| code that only runs when it is installed. If you just copy-
| paste, it won't be run.
| cauthon wrote:
| No, I'm saying don't call CRAN a "threat to reproducible
| science" and then make your solution a CRAN package
| f6v wrote:
| The only working solution I've seen is using Docker container
| with Jupyter Lab and all the dependencies installed. I hate
| pulling those huge images on my 256GB MBP, but it works. Of
| course, only bigger labs do that, since individual researchers
| are often unfamiliar with Docker.
|
| However, if I run my software on HPC cluster, that's no longer an
| option. The HPC at my university doesn't allow running Docker,
| only Singularity containers(which isn't supported on Mac).
| snicker7 wrote:
| Adding dates to source code? No thanks. If you want
| reproducibility, invest in guix. Everything else is a hack.
| hermitcrab wrote:
| Being able to assemble a solution from parts (as in R packages)
| is super flexible. But complex and potentially brittle.
|
| Reproducability is a big problem all around. When I create
| releases I put the binaries as well as the source in version
| control, because changes in tools/libraries etc mean that I
| probably won't be able to create the exact same binary several
| years later from the same source.
|
| There is always a tradeoff between flexibility and simplicity.
| Clearly software needs to be able to change, or you are never
| going to be able to improve it or fix bugs. And an assembly of
| constantly changing parts is clearly going to come with its own
| challenges.
|
| My own software product, Easy Data Transform (which competes with
| R to some extent) trades off some flexibility for simplicity by
| having a single set of binaries for each platform. You can't add
| any components (without hacking). So the same version of software
| should always give the same result.
| samch93 wrote:
| Can recommend the paper "A Reproducible Data Analysis Workflow
| with R Markdown, Git, Make, and Docker" by Peikert and Brandmaier
| [1], which shows a much more robust approach to reproducibility.
|
| [1] https://psyarxiv.com/8xzqy/
| mslip wrote:
| Thanks!
| oli5679 wrote:
| I find the miniconda docker image quite useful for making
| reproducible R environments.
|
| You can install specific package versions recorded in
| environment.yml file.
|
| There are probably many ways to do this but this is an approach I
| like.
|
| https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-...
|
| https://hub.docker.com/r/continuumio/miniconda
| roel_v wrote:
| Apart from all the other considerations and problems with various
| types of package management, consider this:
|
| "Update January 6th, 2021 A reader alerted me to a bug with the
| current groundhog (version 1.1.0) where you cannot set the
| groundhog library to be a folder containing spaces in the name."
|
| So we are talking about software here that somehow made it to
| version 1.1 *without anyone ever using a directory with spaces in
| it with it". This can be interpreted in two ways: either very few
| people have spaces in their paths, or very few people have
| actually ever even tried (not even really used, I'm only talking
| about the most basic trial use) this package. I'm not a betting
| man, but if I were, I know where I'd put my money...
| bayindirh wrote:
| As I can see from the researchers in our cluster and my own
| academic research, most people still avoid spaces in paths and
| files like the plague.
|
| YMMV of course.
| dstick wrote:
| If my own hobby python projects are anything to go by, there
| aren't even folders ;-)
|
| I have a friend who taught herself R for her research and it
| was basically one big procedural codebase.
| YeGoblynQueenne wrote:
| Best way to know where every bit of code is: put it all in
| one source file.
|
| Sarcasm aside, I've worked with codebases like that-
| thousand-line java methods and classes and the like. The
| problem is that there's nothing that really forces
| modularity on a codebase. There isn't even any consensus,
| objective way to modularise code. Otherwise, a machine
| could do it and we wouldn't have this kind of problem. But,
| a machine cannot, and so we do.
| roel_v wrote:
| Of course, and so do I. But nobody ever even encountering the
| situation and/or bothering to report it, that's a whole
| different matter.
| bayindirh wrote:
| My guess is people are encountering the situation, working
| around it and calling a day. Maybe a little note here and
| there but, I don't think someone would report it due to a
| couple of reasons.
|
| First of all, I don't think people report this type of
| stuff because they don't know how to report it, and
| secondly think it doesn't need to support this use case
| anyway since space is a latecomer to naming and path game.
| fjcp wrote:
| As a Linux user I can relate to that. I always avoid spaces
| in folders and filenames as they make it more annoying to
| manipulate them using command line tools. Years later I
| carried this habit to whatever OS I am using.
| kristaps wrote:
| Don't remember the source and probably misquoting, but I like
| this truism: there's software that people complain about and
| software that nobody is using.
| st1x7 wrote:
| The original quote is from Bjarne Stroustrup, the creator of
| C++. The quote also doesn't apply here. (You can't just use
| it to excuse any problem with software that you come across).
| The author of the article and the library in it just seems
| out of their depth in many ways.
| cat199 wrote:
| > there's software that people complain about and software
| that nobody is using.
|
| > The original quote is from Bjarne Stroustrup, the creator
| of C++
|
| i find this ironic, given the 'popularity' (either way) of
| C++
| st1x7 wrote:
| I don't think it's ironic, the quote directly addresses
| the many criticisms towards C++.
| cat199 wrote:
| ah whoops- completely misread it
| tpxl wrote:
| Could also be that the package manager doesn't use spaces and
| most people use package managers?
|
| Ie maven will create a folder structure like
| "/home/user/.m2/repository/com/example/example.jar" which will
| never have spaces unless the username has spaces (Can linux
| usernames have spaces?).
| nerdponx wrote:
| No, the R package manager can tolerate spaces in filenames.
| roel_v wrote:
| On Unixy systems, spaces are uncommon because so little
| software can deal with them, so that people are trained from
| the very beginning to treat spaces like the plague. I do it
| too - I've been burned by treatment of spaces in shitty 0.x
| level software so many times (25+ years ago) that I now have
| an intuitive aversion of anything with spaces.
|
| Spaces in filenames are a reality though, especially on
| Windows (where the home directory itself used to have spaces
| in it, and also where many home directories on corporate
| networks are on network drives and start with \\\\), and any
| software that can't deal with those kinds of paths has just
| not been exposed to much (if any) real world use. That was
| the point I was trying to make - software that can't handle
| anything but the most bog-standard path names in its core
| configuration is 'hey guys look at what I hacked up yesterday
| evening' quality at best. (yes yes it is possible to imagine
| exceptions, like software that is decades old and ported
| across platforms; I'm talking about something new that is
| meant to solve a general problem).
| jbullock35 wrote:
| A further concern: the repository for this R package [1]
| doesn't include any test files. Am I right to think that we
| should be wary of R packages that don't have any unit tests?
|
| https://github.com/CredibilityLab/groundhog
| jcelerier wrote:
| > This can be interpreted in two ways: either very few people
| have spaces in their paths
|
| it's been years since I've seen anyone doing that - a main
| reason, is that a very widely used dev tool, make, does not
| handle spaces in paths:
|
| http://savannah.gnu.org/bugs/?712
|
| thus leading to inertia in the whole ecosystem - if make does
| not support spaces in paths, why bother
| [deleted]
| IshKebab wrote:
| > So we are talking about software here that somehow made it to
| version 1.1 _without anyone ever using a directory with spaces
| in it with it_.
|
| This is extremely common, especially on Linux. Basically
| anything that uses things like Bash or CMake will almost
| certainly not work in directories containing spaces.
|
| Developers don't use paths containing spaces because it causes
| so many issues with badly written Bash scripts, and as a result
| they don't test their code with paths containing spaces.
|
| Bash and CMake and similar hacked together languages have very
| error-prone quoting rules that make it very easy to
| accidentally make something work with paths without spaces but
| fail on paths with spaces.
| Sebb767 wrote:
| > Developers don't use paths containing spaces because it
| causes so many issues with badly written Bash scripts, and as
| a result they don't test their code with paths containing
| spaces.
|
| It is also a PITA to use when typing in a shell, as you need
| two characters ( \ + space ) instead of one. So even though
| my scripts can handle them, I still avoid them if possible.
| benibela wrote:
| Some programs also use URLs
|
| Today I wanted to send a screenshot by mail.
|
| Should be simple, but with not Gnome. I make the
| screenshot, Gnome creates a file "Screenshot from ...", but
| does not tell you where. Then I search it in the file
| explorer, find it, copy the path. Then I paste the path in
| the mail program, file:///....Screenshot%20from%20. Then
| the mail program: "File not found"
| mattmanser wrote:
| It doesn't even seem to be on GitHub, in fact the source
| doesn't seem to be listed anywhere on the project website.
|
| Which in our world would scream 'complete amateur, avoid,
| avoid, avoid', but perhaps it's different in the R world.
| qwantim1 wrote:
| No, I think you're correct. Incomplete source is bad in any
| world.
|
| Unfortunately, it's that world we live in for pretty much
| everything.
|
| Reproducibility? What if all of the source were to depend on
| part of a CPU instruction set that we stop using? How long
| must things be reproducible? We don't even make lab equipment
| exactly like we used to with the experiments our current
| sciences are based on.
|
| However, I give a thumbs up to Groundhog for trying to do the
| right thing.
| corty wrote:
| Reproducibility down to CPU bit differences is a sign that
| you did something wrong. Usually calculation with
| insufficient precision and no thought given to the range of
| simulation error. Simulation must be treated like a
| measurement, there is a maximum precision for your
| instrument and you have to know and apply it.
|
| And even if you might disagree for the single-threaded
| case, most things running in parallel will eat that free
| lunch of bit-identical results due to timing differences.
| cowsandmilk wrote:
| Is it not on GitHub at
| https://github.com/CredibilityLab/groundhog ?
| Hansi wrote:
| https://github.com/CredibilityLab/groundhog
| roel_v wrote:
| While this specific project does have a github page, the R
| world is 'complete amateur, avoid avoid avoid'. It's not
| really a 'programming language' in the way software engineers
| would see it. It's more a loose collection of stats
| functionality that is tied together with text interfactes in
| a way that somewhat looks like programming to the
| uninitiated. I mean, batch scripting is technically
| 'programming', and Excel (even without VBA) is technically
| Turing complete, but neither of those would be considered
| 'programming' by software engineers, at least not under an
| intuitive understanding of what 'programming' is. (by that I
| mean, it's easy to be pedantic and argue that R and batch
| files and Excel files are 'programming' because of [xyz]
| where [xyz] will probably involve real 'definitions' and
| selection criteria etc; but despite those tools being
| _useful_ , you can't do real _software engineering_ in them,
| which you sometimes want /need).
| epistasis wrote:
| > you can't do real software engineering
|
| This is completely, 100%, absolutely wrong.
|
| Of course you can. There's packages, with excellent
| software engineering structure, that are designed to
| include documentation and tests.
|
| R has so much good software engineering, that clever people
| with no software engineering background can easily make
| their own packages!
|
| And come on, the R language is a masterpiece. It's not
| cobbled together like JavaScript or bash. It's got
| impeccable functional programming language pedigree, you
| can even look at the AST directly of a function directly
| inside code.
|
| I'm not sure how you came to any of your conclusions, other
| than not bothering to understand the language to start.
| It's a beautiful language with a messy, user contributed
| set of stats code.
| huijzer wrote:
| > Of course you can. There's packages, with excellent
| software engineering structure, that are designed to
| include documentation and tests.
|
| For me, the problem with R is that the language is
| inconsistent. Many packages arose to address many
| problems, but they all feel like a hack on top of the
| core language. Take the whole Tidyverse; it just does
| dataframes from R core but then from the ground up. Now,
| users can choose between the core language dataframes and
| the Tidyverse dataframes. Same holds for plotting. The
| core issue, I think, is that the core language misses
| some essential features which other languages do have
| nowadays. For example, a type system. In R, since types
| are missing, everything is a table (dataframe) which I
| find just weird.
|
| > It's not cobbled together like JavaScript or bash.
|
| But also not as good as my favorite: Julia. Comparing it
| to Bash is like saying that its better than COBOL. We all
| know Bash is quite old, but for certain situations it
| just works.
| epistasis wrote:
| The tidyverse is the benefit and the curse of
| metaprogramming, something that R takes from lisp, and
| something that has cursed (helped?) C++ since it was
| added.
|
| As far as type systems, there's really two different
| types of "types": individual types objects that can have
| generic functions attached to them, etc. This is not as
| well known, and there are actually several object systems
| for typing:
|
| http://adv-r.had.co.nz/OO-essentials.html
|
| But these sort of objects are not quite as commonly
| created by programmers, because the second type of
| "types" are much more useful: data frames, which is kind
| of a vectorization of structs. This is what would be used
| in data oriented design, which is apparently much more
| common in modern game design.
| vharuck wrote:
| This argument seems elitist. R is more than just
| technically Turing complete.
|
| It's definitely a specialized language. It's not the go-to
| for managing servers or anything with a lot of I/O, but it
| has those capabilities because they're useful for managing
| projects. And I'd be hard-pressed to justify using a
| language for statistical analysis if it doesn't focus on
| statistical analysis. It'd be like rolling my own
| cryptography.
|
| You need to differentiate between "base R" (everything that
| comes with a new install) and community-contributed
| packages. Base R is amazingly reliable. It has detailed
| documentation[0].
|
| User-package land is more of a Wild West, that's true. I
| would personally not use anything that's not on CRAN unless
| I can walk up to the maintainer's desk (in non-pandemic
| times).
|
| [0] https://cran.r-project.org/manuals.html
| roel_v wrote:
| _shrug_. It 's largely opinion-based, I guess. My pet
| peeve (which also illustrates my point, but again, in an
| opinion-based way): there is no documented, 'officially
| supported' way to get the path of the current script in
| R. That is not a problem for amateur programmers who
| don't think about things like robustness, distribution
| etc, and it's needlessly complicated and bolted on in
| SAS, too. But it's still silly and indicative of R's
| typical use cases. Excel is reliable and well documented
| too, and I still wouldn't call even complicated workbooks
| 'software engineering'.
|
| And CRAN... well... let's just say that people used to
| point to CPAN as a strength of Perl, too... All that sort
| of archives, after the first few years which comprise
| mostly of contributors with deep knowledge and who can
| produce high quality libraries, turn into dumping grounds
| for trivial half-assed 'libraries' under the guise of
| 'community contributions'. Example: try to do trivial
| compound interest simulations in R. So basic that it's
| barealy worth calling 'finance'. There are (at least)
| three packages on CRAN that claim to do this, except that
| (depending on which variable in the equation you want to
| solve for) they all provide only part of the solution, in
| mostly incompatible ways. And this is because very few of
| the people putting code into CRAN know how to... well...
| write good code. This is not an indictment of those
| people; many of them are much more intelligent than a
| bunch of us combined. It's just that for them coding is a
| byproduct, and with good intentions they share what has
| been useful for them, it just leads to a situation of 'in
| the land of the blind one eye is king'.
| [deleted]
| CJefferson wrote:
| If you start discarding software which has problems with a
| space in a directory name, you should start with libtool, at
| which point you can't build significant chunks of the Linux
| ecosystem.
|
| https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=193163
|
| I hit this when trying to test libgmp (as an example of an
| important library you would lose).
|
| This means in practice you can't really build most software
| which uses configure scripts and libraries in a directory with
| a space -- this may well be what they are hitting.
| hprotagonist wrote:
| seems like a fairly esoteric way to spell "lockfile with hashes",
| but hey, R seems fairly esoteric to me anyway.
| qrohlf wrote:
| I've had some brief run-ins with R, and it doesn't surprise me
| that it doesn't have a versioning story for packages, and that
| the patched-in system described here is based on _dates_ rather
| than something like a SHA or version number...
|
| My favorite description of the language comes from
| http://arrgh.tim-smith.us/:
|
| > R is a shockingly dreadful language for an exceptionally useful
| data analysis environment.
|
| I feel like this is just one more data point to support that
| statement.
| jcheng wrote:
| Packrat and its successor renv are the most popular package
| management systems for R, and they are based on versions/SHAs
| and lockfiles, like most other languages today.
|
| https://rstudio.github.io/packrat/
|
| https://rstudio.github.io/renv/articles/renv.html
| GlennS wrote:
| I know of two other existing solutions to this, although I don't
| know enough to compare. I don't think either of these tick all
| the author's boxes.
|
| Microsoft MRAN https://mran.microsoft.com/
|
| > For the purpose of reproducibility, MRAN hosts daily snapshots
| of the CRAN R packages and R releases as far back as Sept. 17,
| 2014.
|
| MRAN doesn't seem to be very well known or used in the R
| community, but I don't really know why?
|
| Separately, Nix https://nixos.org/ also solves this problem for
| lots of different languages, but is difficult to get started with
| and still a bit rough around the edges. Probably not a good
| recommendation for a typical analyst or academic at this point.
| chalst wrote:
| The article discusses MRAN in footnote 5, when arguing against
| the MRAN-based 'checkpoint' approach.
|
| Nixpkg/Nixos is obviously a useful technology for
| reproducibility, but note that the output of Nix scripts can
| depend on the time the system was built, the contents of URLs
| and the system architecture unless care is taken.
| GlennS wrote:
| So it does, I missed that!
| myWindoonn wrote:
| This is misleading; empirically, nixpkgs is about 99% [0]
| reproducible already. We know that the main variance is
| between language-specific behaviors; Python, Rust, and C all
| are prone to reproducibility problems.
|
| In general, we _want_ the output to depend on the system
| architecture and the contents of URLs. Nix uses hashes to
| require that URL contents don 't change over time, which
| protects from those contents changing arbitrarily.
|
| [0] https://r13y.com/
| chalst wrote:
| The current community around NixOS and Nixpkgs handles
| these issues just fine, but if 'just use Nix' was regarded
| as a magic bullet for reproducibility in science, I'm
| guessing it wouldn't work out so well.
| myWindoonn wrote:
| Fortunately, "just use Nix" doesn't do much on its own.
| People usually want GCC or another complete C toolchain,
| a C standard library, etc. and this implies that they
| will use nixpkgs or one of its forks. If people try to
| "just use Nix" in anger, then they will almost certainly
| be funneled into using nixpkgs as a matter of practice.
|
| The main problem with reproducibility in science is that
| most scientists are not actually interested in doing
| science. Of course software will not fix this problem.
| warlog wrote:
| It looks like this is much more fine grained compared to mran,
| i.e., with groundhog, you select the date vs with mran where
| you use the last (often > year old) snapshot.
|
| mran is a great idea and if Rstudio (the defacto gate-keepers
| of the faith -- with Hadley the high priest) pushed to use
| mran, then the R community would follow suit (like they do for
| everything else).
|
| This would do a lot to bring MS into the fold, which would
| actually be great for R.
| kgwgk wrote:
| They have their own package management library
|
| https://rstudio.github.io/packrat/
|
| and sell their own package management product
|
| https://rstudio.com/products/package-manager/
| jsmith99 wrote:
| MRAN takes daily snapshots, and is the repository powering
| this new package.
| Hansi wrote:
| Hadley works for RStudio, RStudio now have their own MRAN
| type mirror: https://packagemanager.rstudio.com/client/#/
| _Wintermute wrote:
| MRAN has saved my bacon more than once when I need to replicate
| some R environment written years ago. The package management in
| R really is terrible.
| wodenokoto wrote:
| Wow, that's a lot of pessimism for a fairly elegant solution to
| the fact that almost no R code has package versioning defined.
|
| I think the major sales point here is:
|
| > A nice feature of groundhog is that it makes 'retrofitting'
| existing code quite easy. If you come across a script that no
| longer works, you can change its library() statements for
| groundhog.library() ones, using as the groundhog.day the date the
| code was probably written (say when it was posted on the
| internet), and it may work again.
|
| I don't know how good ratpack is now a days. I've never met an R
| application that uses it, but at my old work, we would take a
| dated snapshot of CRAN at the beginning of every new project. If
| we needed to update a package we could then "update CRAN" for
| that project. When productionising a project it would be frozen
| to a date in CRAN.
| nojito wrote:
| >Wow, that's a lot of pessimism for a fairly elegant solution
| to the fact that almost no R code has package versioning
| defined.
|
| This isn't true.
|
| https://mran.microsoft.com/documents/rro/reproducibility
|
| https://rstudio.github.io/packrat/
| dracodoc wrote:
| Title aside, the purposed solution just
|
| - use Microsoft MRAN which did the heavy lifting of hosting
| archives
|
| - use date instead of version
|
| - install package automatically in first time (which
| pacman::p_load has been doing for ages) and easier to use in
| script level.
|
| It's not coincidence that most package manager solutions used
| version instead of date to control the environment:
|
| - A paper published on 2017 may used a date in 2017.10.01, but
| there is a high possibility that some of the dependency packages
| might be of earlier date, unless the author update packages every
| day/week, which is not a good habit anyway because updating too
| frequently will break things more frequently.
|
| - Then how can you reproduce the environment using a date? The
| underlying assumption that all packages will be latest till that
| date simply doesn't hold.
|
| That's why packrat/renv etc will use a lock file to record all
| package versions, and why you will need a project to manage
| libraries, because you will need to maintain different library
| environments and cannot install to same location.
|
| Yet the author take installing all packages to a single location
| as a feature since you don't need to install same package again,
| and try to avoid project and prefer script as much as possible
| when doing reproducible research?
| paultopia wrote:
| This language about "threat" seems a bit overblown. Especially
| when we ask: compared to what? Some commercial package where
| different versions might have different and poorly documented
| data storage formats? (Have you ever tried to read an old SPSS or
| SAS or STATA data file in any reasonable environment? It is a
| nightmare.) Excel??
| threeseed wrote:
| Nothing about this is specific to R.
|
| If you want to guarantee reproducible results you have to use a
| container/image with libraries added at build time. Anytime you
| are relying on floating versions or downloaded libraries you will
| have issues.
| jdc wrote:
| Yeah or even just vendorize your dependencies.
| tempay wrote:
| Even this isn't enough to be reproducible for complex numeric
| code as switching CPU can make a big difference with small
| differences being amplified. Hopefully none of those cases
| matter but it's hard to definitively prove that.
| andi999 wrote:
| If the research results depend on small differences being
| amplified you have a much much bigger problem. (but if course
| this could happen unnoticed/sloppy work)
| bonoboTP wrote:
| That's true but not an excuse! It's still extremely
| important when assessing an anomaly. If you can say "okay
| this is a known-good config that gets me the numbers from
| the paper", it's an enormous help in uncovering what leads
| to issues.
|
| If you can't even get those numbers, then you can suspect
| any number of things. Maybe you're not using the right
| data, maybe there was a typo, maybe someone fraudulently
| manually tweaked the numbers, maybe you forgot to do a step
| in the processing chain etc etc. There's no way to know
| what's going on if you can't even be sure how the original
| numbers were created.
| vhhn wrote:
| There are two camps in the R world - tidyverse and base-R
| (tiniverse).
|
| Its not a coincidence that the author gives an example from the
| tidyverse ecosystem. Authors and users of tidyverse value other
| things like consistency and new features over API stability and
| backward compatility. The base-R ecosystem is actually very
| stable and so the original package manager is very simple.
|
| With R spreading out from the academic environment and with many
| new authors breaking their packages' APIs we observe new attempts
| to solve the issues with dependencies (such as renv or
| https://rsuite.io)
___________________________________________________________________
(page generated 2021-01-07 23:03 UTC)