[HN Gopher] How to make your scientific data accessible, discove...
___________________________________________________________________
How to make your scientific data accessible, discoverable and
useful
Author : JohnHammersley
Score : 44 points
Date : 2023-06-27 15:39 UTC (1 days ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| yawnxyz wrote:
| Heh, it takes a lot of work to convert illegible scribbles in lab
| notebooks to well-formatted numbers and descriptors that make
| sense to anyone else but the person who did the experiment.
|
| This includes other lab members on the same project...
| JohnHammersley wrote:
| It's also interesting to look at this in the context of the
| "State of Open Data" report [1].
|
| [1]
| https://digitalscience.figshare.com/articles/report/The_Stat...
| JohnHammersley wrote:
| and in case anyone's interested in the meandering thoughts of
| two start-up founders in the scholarly comms space, Mark and I
| had a nice chat following a recent figshare milestone :) [1]
|
| [1] https://www.digital-science.com/tldr/article/seven-
| million-o...
| carlossouza wrote:
| The article doesn't fully answer its title, especially the
| "discoverable."
| __MatrixMan__ wrote:
| I've been thinking about that problem.
|
| I think the publishers (or maybe the universities? or anybody
| at the center of a community of experts really) should host an
| API which maps set of CTPH hashes to URL's (or ideally, CID's
| for use in something like IPFS). The goal would be that anybody
| (author or otherwise) could attach metadata after publication.
|
| Maybe it's criticism, maybe it's instructions on how to get the
| included code to run, maybe it's links to related research that
| occurred after the initial publication...
|
| Suppose you have metadata to attach, you generate CTPH's for
| the article, pick a subset of them which corresponds with the
| location you want to anchor your metadata to, and upload the
| pair to the context aggregator (these would likely be topic-
| centered, so if it's a biology paper you'd find a biology
| aggregator).
|
| When people view the paper, they can generate the same CTPH's
| and query the appropriate aggregator, and they'll get the
| annotations back which link locations in the article's text to
| metadata that, for whatever reason, was not included in the
| original publication.
|
| I want to use CTPH's instead of DOI's or somesuch because they
| don't require a third party to index the items for you, and
| they still work even if you have only part of the article (like
| maybe the rest is hidden by pagination or a paywall). You could
| do a speech-to-text transcription, annotated that in this way,
| and somebody else who generated the same transcript could then
| find your annotations without ever creating an ID for the
| speech you're annotating.
| stonogo wrote:
| That's be the 'metadata' section. Encouraging scientists to
| include metadata, as opposed to unlabeled binary dumps, is an
| ongoing effort.
| carlossouza wrote:
| Metadata is necessary but not sufficient.
|
| Imho there aren't enough tools to discover scientific data.
| robwwilliams wrote:
| Semantic web was supposed to help long ago, and may finally
| be doing so. In www. genenetwork.org we are now using RDF
| SPARCL and GraphQL and Xapian for speedy and flexible
| search that can represent much of our complex metadata.
| Surprising how long this has taken to catch on.
| chaxor wrote:
| IPFS or torrent are the best options for distributing data
| robwwilliams wrote:
| Not in out 5 year experience trying to use with GeneNetwork.org
| to share large and small datasets. IPSF is marketed as simple
| but is complex--or even over-engineered from some perspectives.
| Hate to say it, but Dropbox is much easier and stable.
|
| Hoping IPFS makes it someday because the idea is great.
| Blahah wrote:
| It seems you have confused 'distributed' with... Something
| else. Regardless of how easy it is for you, or how complex
| you found it, the data is distributed via ipfs.
| anamexis wrote:
| What data?
| Blahah wrote:
| Any data shared in the network.
| anamexis wrote:
| I don't understand what you're responding to. GP said
| they tried using IPFS with their project, but it ended up
| being too complicated and they opted for Dropbox instead.
| hsjqllzlfkf wrote:
| Every time I remember that torrents exist, that blows my mind.
| JBorrow wrote:
| That is incorrect for scientific data. The limitations are:
|
| a) Massive data volumes (~100 Gb - 1 Pb/project)
| ai) This means that data is typically stored on limited access
| machines like HPC clusters bi) This also means that
| shipping this data around is financially expensive, and cannot
| be supported purely by small client machines
|
| b) A low number of seeders; scientific data is not exactly
| popular, and there may be network restrictions on uploads
| through the typically used networks;
|
| c) The requirement for a data legacy; torrents are fantastic
| for ephemeral data (e.g. operating system builds), but are
| terrible for data that must be archived and kept for
| potentially decades to centuries.
| staunton wrote:
| Scientific datasets like that can be very easily hosted at
| one of the repositories such as zotero. The only reasons
| people don't do that is a vague sense of insecurity about
| having someone declare their analysis botched, vague legal
| worries, vague unwillingness to do the very small amount of
| work required to publish data, or the hope to milk a dataset
| for more papers before anyone else gets a chance.
| Blahah wrote:
| I guess you meant zenodo, not zotero.
| 0cf8612b2e1e wrote:
| Most scientific datasets are not that large. For every CERN
| type study there are 1000x biology papers with n=3 where the
| collected results sit in a single tab of an Excel document.
| robwwilliams wrote:
| Those do not belong in IPFS. They won't be replicated and
| may die.
| jrumbut wrote:
| Very true, but if you're planning on expanding the work of
| a small, pilot study like that and you don't have the
| people who were involved in the original you probably need
| to recreate the study (to shake out the kinks in the
| protocol, confirm results for yourself, etc).
|
| It would be challenging to find a solution robust enough
| for CERN type data but also simple enough for an n=3
| undergraduate research project (that may have yielded some
| interesting results).
|
| I don't know what the solution is there. My intuition is
| that university libraries could be involved, and that a
| data librarian could help you get your small study into
| shape or be embedded at a percentage effort on a large
| study.
| stainablesteel wrote:
| did they say to publish it in an overpriced black box so only
| their subscribers can view it?
| epgui wrote:
| That's a problem, but it's a different problem. Stay relevant.
___________________________________________________________________
(page generated 2023-06-28 23:00 UTC)