[HN Gopher] Jacquard lab notebook: Version control and provenanc...
___________________________________________________________________
Jacquard lab notebook: Version control and provenance for empirical
research
Author : surprisetalk
Score : 76 points
Date : 2024-09-04 17:54 UTC (3 days ago)
(HTM) web link (www.inkandswitch.com)
(TXT) w3m dump (www.inkandswitch.com)
| data_maan wrote:
| Behind all the technical lingo, what problem does this solve that
| cannot be solved by sticking to a git repo that tracks your
| research and using some simple actions on top of GitHub for
| visualization etc.?
| scherlock wrote:
| The fact that software engineers are the only folks with the
| skills to do what you just said.
|
| When I was working on PhD thesis 20 years ago, I had a giant
| makefile that generated my graphs and tables then generated the
| thesis from LaTeX.
|
| All of it was in version control, made it so much easier, but
| no way anyone other than someone that uses those tools would be
| able figure it out.
| exe34 wrote:
| > The fact that software engineers are the only folks with
| the skills to do what you just said.
|
| I've always been impressed by the amount of effort that
| people are willing to put in to avoid using version control.
| I used mercury about 18 years ago, and then moved to git when
| that took off, and I never write much text for work or
| leisure without putting it in git. I don't even use branches
| at all outside of work - it's just so that the previous
| versions are always available. This applies to my reading
| notes, travel plans, budgeting, etc.
| ska wrote:
| Oh, they all use version control.
|
| It just looks like "conf_paper1.tex" "conf_paper3.tex"
| "conf_paper_friday.tex" "conf_paper_20240907.tex"
| "conf_paper_last_version.tex" "conf_paper_final.tex"
|
| ...
|
| "conf_paper_final2.tex"
|
| Oh, and the figures reference files on local dir structure.
|
| And the actual, eventually published version, only exists
| in email back and forth with publisher for style files etc.
| RhysabOweyn wrote:
| I once worked with a professor and some graduate students
| who insisted on using box as a code repository since it
| kept a log of changes to files under a folder. I tried to
| convince them to switch to git by making a set of
| tutorial videos explaining the basics but it was still
| not enough to convince them to switch.
| throwpoaster wrote:
| Remember the famous HN comment:
|
| "This 'Dropbox' project of yours looks neat, but why wouldn't
| people just use ftp and rsync?"
| bluenose69 wrote:
| I'm sure this will be useful for some folks, but I'll stick with
| 'git' to track changes and 'make' to organize my work.
|
| It seems as though this project is a sort of system for creating
| Makefiles, and that would be great for folks who are unfamiliar
| with them.
|
| I'm not sure of the audience, though. At least in my research
| area, there are mainly two groups.
|
| 1. People who use latex (and scripting languages) are comfortable
| with writing Makefiles. 2. People who work in excel (etc) and
| incorporate results into msword.
|
| Neither group seems a likely candidate for this (admittedly
| intriguing) software.
| conformist wrote:
| There are many people in group 1 in academia eg in physics and
| maths that are comfortable with latex and scripting languages
| but mostly use email to share files. Anything that helps them
| organise their collaborative work better without having to deal
| with git helps (eg see eg success of overleaf).
| ska wrote:
| Part of the problem is that git is a fairly poor fit for
| these workflows.
|
| I spent time getting some mathematicians working together via
| version control rather than email, it was a bit of a mixed
| bag even using something simpler (e.g. svn). Eventual we
| moved back to email, except the rule was email me your update
| as a reply to the version you edited, and I scripted
| something to put it all into a repo on my end to manage
| merges etc. Worked ok. Better than the version where we
| locked access for edit but people forgot to unlock and went
| off to a conference...
|
| If I was doing the same now, I'd probably set up on github,
| give each person a branch off main, and give them scripts for
| "send my changes" and "update other changes" - then manage
| all the merges behind the scenes for anyone who didn't want
| to bother.
|
| I think expecting everyone in a working group to acquire the
| skills to deal with merge issues properly etc. is too far if
| they don't do any significant software work already. In the
| latter case., teach them.
| __MatrixMan__ wrote:
| It's easy to collect and verify metadata involving the hashes
| of intermediate artifacts such that readers can observe it and
| trust that the charts correspond with the data because they
| trust whoever published the metadata. This could be automatic,
| just part of the reader.
|
| The trouble with make is that unless you're very disciplined or
| very lucky, if you build the images and documents on your
| machine and I do the same on mine, we're going to get artifacts
| that look similar but hash differently, if for no other reason
| than that there's a timestamp showing up somewhere and throwing
| it off (though often for more concerning reasons involving the
| versions of whatever your Makefile is calling).
|
| That prevents any kind of automated consensus finding about the
| functional dependency between the artifacts. Now reviewers must
| rebuild the artifacts themselves and compare the outputs
| visually before they can be assured that the data and
| visualizations are indeed connected.
|
| So if we want to get to a place where papers can be more
| readily trusted--a place where the parts of research that can
| be replicated automatically, are replicated automatically, then
| we're going to need something that provides a bit more
| structure than make (something like nix, with a front end like
| Jacquard lab notebook).
|
| The idea that we could take some verifiable computational step
| and represent it in a UI such that the status of that
| verification is accessible, rather than treating the makefile
| as an authoritative black box... I think it's rather exciting.
| Even if I don't really care about the UI so much, having the
| computational structure be accessible is important.
| XorNot wrote:
| Here's the thing though: you're trying to solve a problem
| here which doesn't exist.
|
| In physical science, no one commits academic fraud by
| manipulating a difference between the graphs they publish and
| the data they collected...they just enter bad data to start
| with. Or apply extremely invalid statistical methods or the
| like.
|
| You can't fix this by trying to attest the data pipeline.
| __MatrixMan__ wrote:
| I'm not really trying to address fraud. Most of the time
| when I try to recreate a computational result from a paper,
| things go poorly. I want to fix that.
|
| Recently I found one where the authors must've mislabeled
| something because the data for mutant A actually
| corresponded with the plot for mutant B.
|
| Other times it'll take days of tinkering just to figure out
| which versions of the dependencies are necessary to make it
| work at all.
|
| None of that sort of sleuthing should've required a human
| in the loop at all. I should be able to hold one thing
| constant (be it the data or the pipeline), change the
| other, and rebuild the paper to determine whether the
| conclusions are invariant to the change I made.
|
| Human patience for applying skepticism to complex things is
| scarce. I want to automate as much of replication as
| possible so that what skepticism is available is applied
| more effectively. It would just be a nicer world to live
| in.
| jpeloquin wrote:
| Even in group 1, when I go back to a project that I haven't
| worked on in years, it would be helpful to be able to query the
| build system to list the dependencies of a particular artifact,
| including data dependencies. I.e., reverse dependency lookup.
| Also list which files could change as a consequence of changing
| another artifact. And return results based on what the build
| actually did, not just the rules as specified. I think make
| can't do this because it has no ability to hash & cache
| results. Newer build systems like Bazel, Please, and Pants
| should be able to do this but I haven't used them much yet.
| kleton wrote:
| Given the replication crisis in the sciences, objectively this is
| probably a good thing, but the incumbents in the field would
| strongly push back against it becoming a norm.
|
| https://en.wikipedia.org/wiki/Replication_crisis
| ska wrote:
| This addresses a nearly orthogonal issue.
| sega_sai wrote:
| That is actually a very interesting idea. While I am not
| necessarily interested in some sort of build system for a paper,
| but being able to figure out which plots need to be regenerated
| when some data file or some equation is changed is useful. For
| this, being able to encode the version of the script and all the
| data files used in creating of the plot would be valuable.
| idiotlogical wrote:
| Am I not smart, or what about the "Subscribe" page won't allow me
| to get past the "Name" field? I tried a few combos and even an
| email addresses and it doesn't validate:
|
| https://buttondown.com/inkandswitch
| LowkeyPlatypus wrote:
| The idea sounds great! However, I see some potential issues.
| First, IIUC using this tool means that researchers will have to
| edit their code within it, which may be fine for small edits, but
| for larger changes, most people would prefer to rely on their
| favourite IDE. Moreover, if the scripts take a long time to run,
| this could be problematic and slow down workflows. So, I think
| this "notebook" could be excellent for projects with a small
| amount of code, but it may be less suitable for larger projects.
|
| Anyway, it's a really cool project, and I'm looking forward to
| seeing how it grows.
| cashsterling wrote:
| I follow the work of the ink & switch folks... they have a lot of
| interesting ideas around academic research management and
| academic publishing.
|
| I have a day job, but spend a lot of thought about ways to
| improve academic/technical publishing in the modern era. There
| are a lot problems with our current academic publishing model: a
| lot of pay-walled articles / limited public access to research,
| many articles have no/limited access to the raw data or
| analytical code, articles don't make use of modern technology to
| enhance communication (interactive plots, animations, CAD files,
| video, etc.).
|
| Top level academic journals are trying to raise the bar on
| research publication standards (partially to avoid the
| embarrassment of publishing fraudulent research) but they are all
| stuck not want to kill the golden goose. Academic publishing is a
| multi-billion dollar affair and making research open, etc. would
| damage their revenue model.
|
| We need a GitHub for Science... not in the sense of Microsoft
| owning a publishing platform but in the sense of what GitHub
| provides for computer science; a platform for public
| collaboration on code and computer science ideas. We need a
| federated, open platform for managing experiments and data (i.e.
| an electronic lab notebook) and communicating research to the
| public (via code, animations, plots, written text in
| Typst/LaTeX/Markdown, video, audio, presentations, etc. Ideally
| this platform would also have an associated discussion forum for
| discussion and feedback on research.
| karencarits wrote:
| Coming from R, I would recommend researchers to have a look at
| Quarto [1] and packages such as workflowr [2] that also aim to
| ensure a tight and reproducible pipeline from raw data to the
| finished paper
|
| [1] https://quarto.org/docs/manuscripts/authoring/rstudio.html
|
| [2] https://workflowr.io/
___________________________________________________________________
(page generated 2024-09-07 23:00 UTC)