[HN Gopher] Jacquard lab notebook: Version control and provenanc...
       ___________________________________________________________________
        
       Jacquard lab notebook: Version control and provenance for empirical
       research
        
       Author : surprisetalk
       Score  : 76 points
       Date   : 2024-09-04 17:54 UTC (3 days ago)
        
 (HTM) web link (www.inkandswitch.com)
 (TXT) w3m dump (www.inkandswitch.com)
        
       | data_maan wrote:
       | Behind all the technical lingo, what problem does this solve that
       | cannot be solved by sticking to a git repo that tracks your
       | research and using some simple actions on top of GitHub for
       | visualization etc.?
        
         | scherlock wrote:
         | The fact that software engineers are the only folks with the
         | skills to do what you just said.
         | 
         | When I was working on PhD thesis 20 years ago, I had a giant
         | makefile that generated my graphs and tables then generated the
         | thesis from LaTeX.
         | 
         | All of it was in version control, made it so much easier, but
         | no way anyone other than someone that uses those tools would be
         | able figure it out.
        
           | exe34 wrote:
           | > The fact that software engineers are the only folks with
           | the skills to do what you just said.
           | 
           | I've always been impressed by the amount of effort that
           | people are willing to put in to avoid using version control.
           | I used mercury about 18 years ago, and then moved to git when
           | that took off, and I never write much text for work or
           | leisure without putting it in git. I don't even use branches
           | at all outside of work - it's just so that the previous
           | versions are always available. This applies to my reading
           | notes, travel plans, budgeting, etc.
        
             | ska wrote:
             | Oh, they all use version control.
             | 
             | It just looks like "conf_paper1.tex" "conf_paper3.tex"
             | "conf_paper_friday.tex" "conf_paper_20240907.tex"
             | "conf_paper_last_version.tex" "conf_paper_final.tex"
             | 
             | ...
             | 
             | "conf_paper_final2.tex"
             | 
             | Oh, and the figures reference files on local dir structure.
             | 
             | And the actual, eventually published version, only exists
             | in email back and forth with publisher for style files etc.
        
               | RhysabOweyn wrote:
               | I once worked with a professor and some graduate students
               | who insisted on using box as a code repository since it
               | kept a log of changes to files under a folder. I tried to
               | convince them to switch to git by making a set of
               | tutorial videos explaining the basics but it was still
               | not enough to convince them to switch.
        
         | throwpoaster wrote:
         | Remember the famous HN comment:
         | 
         | "This 'Dropbox' project of yours looks neat, but why wouldn't
         | people just use ftp and rsync?"
        
       | bluenose69 wrote:
       | I'm sure this will be useful for some folks, but I'll stick with
       | 'git' to track changes and 'make' to organize my work.
       | 
       | It seems as though this project is a sort of system for creating
       | Makefiles, and that would be great for folks who are unfamiliar
       | with them.
       | 
       | I'm not sure of the audience, though. At least in my research
       | area, there are mainly two groups.
       | 
       | 1. People who use latex (and scripting languages) are comfortable
       | with writing Makefiles. 2. People who work in excel (etc) and
       | incorporate results into msword.
       | 
       | Neither group seems a likely candidate for this (admittedly
       | intriguing) software.
        
         | conformist wrote:
         | There are many people in group 1 in academia eg in physics and
         | maths that are comfortable with latex and scripting languages
         | but mostly use email to share files. Anything that helps them
         | organise their collaborative work better without having to deal
         | with git helps (eg see eg success of overleaf).
        
           | ska wrote:
           | Part of the problem is that git is a fairly poor fit for
           | these workflows.
           | 
           | I spent time getting some mathematicians working together via
           | version control rather than email, it was a bit of a mixed
           | bag even using something simpler (e.g. svn). Eventual we
           | moved back to email, except the rule was email me your update
           | as a reply to the version you edited, and I scripted
           | something to put it all into a repo on my end to manage
           | merges etc. Worked ok. Better than the version where we
           | locked access for edit but people forgot to unlock and went
           | off to a conference...
           | 
           | If I was doing the same now, I'd probably set up on github,
           | give each person a branch off main, and give them scripts for
           | "send my changes" and "update other changes" - then manage
           | all the merges behind the scenes for anyone who didn't want
           | to bother.
           | 
           | I think expecting everyone in a working group to acquire the
           | skills to deal with merge issues properly etc. is too far if
           | they don't do any significant software work already. In the
           | latter case., teach them.
        
         | __MatrixMan__ wrote:
         | It's easy to collect and verify metadata involving the hashes
         | of intermediate artifacts such that readers can observe it and
         | trust that the charts correspond with the data because they
         | trust whoever published the metadata. This could be automatic,
         | just part of the reader.
         | 
         | The trouble with make is that unless you're very disciplined or
         | very lucky, if you build the images and documents on your
         | machine and I do the same on mine, we're going to get artifacts
         | that look similar but hash differently, if for no other reason
         | than that there's a timestamp showing up somewhere and throwing
         | it off (though often for more concerning reasons involving the
         | versions of whatever your Makefile is calling).
         | 
         | That prevents any kind of automated consensus finding about the
         | functional dependency between the artifacts. Now reviewers must
         | rebuild the artifacts themselves and compare the outputs
         | visually before they can be assured that the data and
         | visualizations are indeed connected.
         | 
         | So if we want to get to a place where papers can be more
         | readily trusted--a place where the parts of research that can
         | be replicated automatically, are replicated automatically, then
         | we're going to need something that provides a bit more
         | structure than make (something like nix, with a front end like
         | Jacquard lab notebook).
         | 
         | The idea that we could take some verifiable computational step
         | and represent it in a UI such that the status of that
         | verification is accessible, rather than treating the makefile
         | as an authoritative black box... I think it's rather exciting.
         | Even if I don't really care about the UI so much, having the
         | computational structure be accessible is important.
        
           | XorNot wrote:
           | Here's the thing though: you're trying to solve a problem
           | here which doesn't exist.
           | 
           | In physical science, no one commits academic fraud by
           | manipulating a difference between the graphs they publish and
           | the data they collected...they just enter bad data to start
           | with. Or apply extremely invalid statistical methods or the
           | like.
           | 
           | You can't fix this by trying to attest the data pipeline.
        
             | __MatrixMan__ wrote:
             | I'm not really trying to address fraud. Most of the time
             | when I try to recreate a computational result from a paper,
             | things go poorly. I want to fix that.
             | 
             | Recently I found one where the authors must've mislabeled
             | something because the data for mutant A actually
             | corresponded with the plot for mutant B.
             | 
             | Other times it'll take days of tinkering just to figure out
             | which versions of the dependencies are necessary to make it
             | work at all.
             | 
             | None of that sort of sleuthing should've required a human
             | in the loop at all. I should be able to hold one thing
             | constant (be it the data or the pipeline), change the
             | other, and rebuild the paper to determine whether the
             | conclusions are invariant to the change I made.
             | 
             | Human patience for applying skepticism to complex things is
             | scarce. I want to automate as much of replication as
             | possible so that what skepticism is available is applied
             | more effectively. It would just be a nicer world to live
             | in.
        
         | jpeloquin wrote:
         | Even in group 1, when I go back to a project that I haven't
         | worked on in years, it would be helpful to be able to query the
         | build system to list the dependencies of a particular artifact,
         | including data dependencies. I.e., reverse dependency lookup.
         | Also list which files could change as a consequence of changing
         | another artifact. And return results based on what the build
         | actually did, not just the rules as specified. I think make
         | can't do this because it has no ability to hash & cache
         | results. Newer build systems like Bazel, Please, and Pants
         | should be able to do this but I haven't used them much yet.
        
       | kleton wrote:
       | Given the replication crisis in the sciences, objectively this is
       | probably a good thing, but the incumbents in the field would
       | strongly push back against it becoming a norm.
       | 
       | https://en.wikipedia.org/wiki/Replication_crisis
        
         | ska wrote:
         | This addresses a nearly orthogonal issue.
        
       | sega_sai wrote:
       | That is actually a very interesting idea. While I am not
       | necessarily interested in some sort of build system for a paper,
       | but being able to figure out which plots need to be regenerated
       | when some data file or some equation is changed is useful. For
       | this, being able to encode the version of the script and all the
       | data files used in creating of the plot would be valuable.
        
       | idiotlogical wrote:
       | Am I not smart, or what about the "Subscribe" page won't allow me
       | to get past the "Name" field? I tried a few combos and even an
       | email addresses and it doesn't validate:
       | 
       | https://buttondown.com/inkandswitch
        
       | LowkeyPlatypus wrote:
       | The idea sounds great! However, I see some potential issues.
       | First, IIUC using this tool means that researchers will have to
       | edit their code within it, which may be fine for small edits, but
       | for larger changes, most people would prefer to rely on their
       | favourite IDE. Moreover, if the scripts take a long time to run,
       | this could be problematic and slow down workflows. So, I think
       | this "notebook" could be excellent for projects with a small
       | amount of code, but it may be less suitable for larger projects.
       | 
       | Anyway, it's a really cool project, and I'm looking forward to
       | seeing how it grows.
        
       | cashsterling wrote:
       | I follow the work of the ink & switch folks... they have a lot of
       | interesting ideas around academic research management and
       | academic publishing.
       | 
       | I have a day job, but spend a lot of thought about ways to
       | improve academic/technical publishing in the modern era. There
       | are a lot problems with our current academic publishing model: a
       | lot of pay-walled articles / limited public access to research,
       | many articles have no/limited access to the raw data or
       | analytical code, articles don't make use of modern technology to
       | enhance communication (interactive plots, animations, CAD files,
       | video, etc.).
       | 
       | Top level academic journals are trying to raise the bar on
       | research publication standards (partially to avoid the
       | embarrassment of publishing fraudulent research) but they are all
       | stuck not want to kill the golden goose. Academic publishing is a
       | multi-billion dollar affair and making research open, etc. would
       | damage their revenue model.
       | 
       | We need a GitHub for Science... not in the sense of Microsoft
       | owning a publishing platform but in the sense of what GitHub
       | provides for computer science; a platform for public
       | collaboration on code and computer science ideas. We need a
       | federated, open platform for managing experiments and data (i.e.
       | an electronic lab notebook) and communicating research to the
       | public (via code, animations, plots, written text in
       | Typst/LaTeX/Markdown, video, audio, presentations, etc. Ideally
       | this platform would also have an associated discussion forum for
       | discussion and feedback on research.
        
       | karencarits wrote:
       | Coming from R, I would recommend researchers to have a look at
       | Quarto [1] and packages such as workflowr [2] that also aim to
       | ensure a tight and reproducible pipeline from raw data to the
       | finished paper
       | 
       | [1] https://quarto.org/docs/manuscripts/authoring/rstudio.html
       | 
       | [2] https://workflowr.io/
        
       ___________________________________________________________________
       (page generated 2024-09-07 23:00 UTC)