[HN Gopher] How to make your scientific data accessible, discove...
       ___________________________________________________________________
        
       How to make your scientific data accessible, discoverable and
       useful
        
       Author : JohnHammersley
       Score  : 44 points
       Date   : 2023-06-27 15:39 UTC (1 days ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | yawnxyz wrote:
       | Heh, it takes a lot of work to convert illegible scribbles in lab
       | notebooks to well-formatted numbers and descriptors that make
       | sense to anyone else but the person who did the experiment.
       | 
       | This includes other lab members on the same project...
        
       | JohnHammersley wrote:
       | It's also interesting to look at this in the context of the
       | "State of Open Data" report [1].
       | 
       | [1]
       | https://digitalscience.figshare.com/articles/report/The_Stat...
        
         | JohnHammersley wrote:
         | and in case anyone's interested in the meandering thoughts of
         | two start-up founders in the scholarly comms space, Mark and I
         | had a nice chat following a recent figshare milestone :) [1]
         | 
         | [1] https://www.digital-science.com/tldr/article/seven-
         | million-o...
        
       | carlossouza wrote:
       | The article doesn't fully answer its title, especially the
       | "discoverable."
        
         | __MatrixMan__ wrote:
         | I've been thinking about that problem.
         | 
         | I think the publishers (or maybe the universities? or anybody
         | at the center of a community of experts really) should host an
         | API which maps set of CTPH hashes to URL's (or ideally, CID's
         | for use in something like IPFS). The goal would be that anybody
         | (author or otherwise) could attach metadata after publication.
         | 
         | Maybe it's criticism, maybe it's instructions on how to get the
         | included code to run, maybe it's links to related research that
         | occurred after the initial publication...
         | 
         | Suppose you have metadata to attach, you generate CTPH's for
         | the article, pick a subset of them which corresponds with the
         | location you want to anchor your metadata to, and upload the
         | pair to the context aggregator (these would likely be topic-
         | centered, so if it's a biology paper you'd find a biology
         | aggregator).
         | 
         | When people view the paper, they can generate the same CTPH's
         | and query the appropriate aggregator, and they'll get the
         | annotations back which link locations in the article's text to
         | metadata that, for whatever reason, was not included in the
         | original publication.
         | 
         | I want to use CTPH's instead of DOI's or somesuch because they
         | don't require a third party to index the items for you, and
         | they still work even if you have only part of the article (like
         | maybe the rest is hidden by pagination or a paywall). You could
         | do a speech-to-text transcription, annotated that in this way,
         | and somebody else who generated the same transcript could then
         | find your annotations without ever creating an ID for the
         | speech you're annotating.
        
         | stonogo wrote:
         | That's be the 'metadata' section. Encouraging scientists to
         | include metadata, as opposed to unlabeled binary dumps, is an
         | ongoing effort.
        
           | carlossouza wrote:
           | Metadata is necessary but not sufficient.
           | 
           | Imho there aren't enough tools to discover scientific data.
        
             | robwwilliams wrote:
             | Semantic web was supposed to help long ago, and may finally
             | be doing so. In www. genenetwork.org we are now using RDF
             | SPARCL and GraphQL and Xapian for speedy and flexible
             | search that can represent much of our complex metadata.
             | Surprising how long this has taken to catch on.
        
       | chaxor wrote:
       | IPFS or torrent are the best options for distributing data
        
         | robwwilliams wrote:
         | Not in out 5 year experience trying to use with GeneNetwork.org
         | to share large and small datasets. IPSF is marketed as simple
         | but is complex--or even over-engineered from some perspectives.
         | Hate to say it, but Dropbox is much easier and stable.
         | 
         | Hoping IPFS makes it someday because the idea is great.
        
           | Blahah wrote:
           | It seems you have confused 'distributed' with... Something
           | else. Regardless of how easy it is for you, or how complex
           | you found it, the data is distributed via ipfs.
        
             | anamexis wrote:
             | What data?
        
               | Blahah wrote:
               | Any data shared in the network.
        
               | anamexis wrote:
               | I don't understand what you're responding to. GP said
               | they tried using IPFS with their project, but it ended up
               | being too complicated and they opted for Dropbox instead.
        
         | hsjqllzlfkf wrote:
         | Every time I remember that torrents exist, that blows my mind.
        
         | JBorrow wrote:
         | That is incorrect for scientific data. The limitations are:
         | 
         | a) Massive data volumes (~100 Gb - 1 Pb/project)
         | ai) This means that data is typically stored on limited access
         | machines like HPC clusters            bi) This also means that
         | shipping this data around is financially expensive, and cannot
         | be supported purely by small client machines
         | 
         | b) A low number of seeders; scientific data is not exactly
         | popular, and there may be network restrictions on uploads
         | through the typically used networks;
         | 
         | c) The requirement for a data legacy; torrents are fantastic
         | for ephemeral data (e.g. operating system builds), but are
         | terrible for data that must be archived and kept for
         | potentially decades to centuries.
        
           | staunton wrote:
           | Scientific datasets like that can be very easily hosted at
           | one of the repositories such as zotero. The only reasons
           | people don't do that is a vague sense of insecurity about
           | having someone declare their analysis botched, vague legal
           | worries, vague unwillingness to do the very small amount of
           | work required to publish data, or the hope to milk a dataset
           | for more papers before anyone else gets a chance.
        
             | Blahah wrote:
             | I guess you meant zenodo, not zotero.
        
           | 0cf8612b2e1e wrote:
           | Most scientific datasets are not that large. For every CERN
           | type study there are 1000x biology papers with n=3 where the
           | collected results sit in a single tab of an Excel document.
        
             | robwwilliams wrote:
             | Those do not belong in IPFS. They won't be replicated and
             | may die.
        
             | jrumbut wrote:
             | Very true, but if you're planning on expanding the work of
             | a small, pilot study like that and you don't have the
             | people who were involved in the original you probably need
             | to recreate the study (to shake out the kinks in the
             | protocol, confirm results for yourself, etc).
             | 
             | It would be challenging to find a solution robust enough
             | for CERN type data but also simple enough for an n=3
             | undergraduate research project (that may have yielded some
             | interesting results).
             | 
             | I don't know what the solution is there. My intuition is
             | that university libraries could be involved, and that a
             | data librarian could help you get your small study into
             | shape or be embedded at a percentage effort on a large
             | study.
        
       | stainablesteel wrote:
       | did they say to publish it in an overpriced black box so only
       | their subscribers can view it?
        
         | epgui wrote:
         | That's a problem, but it's a different problem. Stay relevant.
        
       ___________________________________________________________________
       (page generated 2023-06-28 23:00 UTC)