https://www.nature.com/articles/d41586-023-01929-7 Skip to main content Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Advertisement Advertisement Nature * View all journals * Search * Log in * Explore content * About the journal * Publish with us * Subscribe * Sign up for alerts * RSS feed 1. nature 2. technology features 3. article How to make your scientific data accessible, discoverable and useful Download PDF Download PDF * TECHNOLOGY FEATURE * 27 June 2023 How to make your scientific data accessible, discoverable and useful Specialists offer seven tips for effectively sharing your data. * Jeffrey M. Perkel 1. Jeffrey M. Perkel View author publications You can also search for this author in PubMed Google Scholar * Twitter * Facebook * Email Cartoon of 4 heads with colourful tangrams growing in each with the final head the tangram forming a rocket and taking off. Illustration by The Project Twins Miguel Acevedo typically gets two questions about his research on malaria in lizards. "Do lizards really get malaria?" (The answer is yes.) And, "Will I get malaria from a lizard?" (Not likely.) Lizard malaria is a model for vector-borne disease ecology and evolution^1. A colleague had been pursuing the same problem, at the same site in Puerto Rico, since the 1990s, and Acevedo, a wildlife ecologist at the University of Florida in Gainesville, wanted to combine those older data with his own to perform a long-term analysis. It was easier said than done. Whereas Acevedo's data were logged using a standardized data-entry template, the colleague's data were recorded in a mix of paper notebooks, Excel spreadsheets and hand-drawn maps. "It was some of the most organized data of that era, but we didn't have the standards then that we have today," he says. Columns weren't necessarily consistent from sheet to sheet, nor did they use the same units, and it wasn't always clear which sampling sites were being measured. In the end, what could have been a morning's effort took "six or seven months", Acevedo says. "It's a lot of work, and it's not fun work, you know?" Funder and publisher mandates, coupled with a growing emphasis on open science and reproducibility, mean that researchers are increasingly depositing data alongside their publications. Other scientists can use those data to drive new research. But not every journal requires that authors make their data sets available, and some authors decline to do so, either for fear of getting scooped or for lack of time. (The research data policy for Springer Nature, which publishes Nature, "strongly encourage[s] that all datasets supporting the analysis and conclusions of the paper are made publicly available at the time of publication", and mandates "the sharing of community-endorsed data types".) Nature asked data scientists about their best practices for publishing usable, high-quality data -- here's what they said. Craft metadata If there's one thing scientists can add to maximize their data's value, it's "metadata, metadata, metadata", says environmental scientist Patricia Soranno at Michigan State University in East Lansing. Metadata are data that describe data -- the timestamp and geolocation details that a smartphone camera stores with every image, for instance. Metadata basically explain what data mean, and are key to making data FAIR -- findable, accessible, interoperable and reusable^2 . "Data without metadata", says Acevedo, "is like a Lego set without the instructions." [d41586-023] Need web data? Here's how to harvest them What those instructions should say varies from experiment to experiment -- microscopy data require different metadata than do gene sequences, say. But according to Sarah Supp, an ecologist at Denison University in Granville, Ohio, they can generally be put into a simple 'README' text file that lists when, where and how the data were collected, and by whom; the licence under which they are released; whether data collection is complete; and their status -- raw or processed, for instance. It's worth including a 'codebook' that defines experimental variables, units, abbreviations, expected ranges and how missing data are denoted (using 'NA', for example). If there are many tables or files, then explain how they interrelate. And if software was used for data processing, detail the tools, version numbers and runtime parameters, says Anne Brown, a product-development scientist at Bayer US Crop Science in Chesterfield, Missouri. Template README files, data dictionaries and project summaries have been shared on Twitter by Crystal Lewis, a research-data management consultant in St Louis, Missouri (see go.nature.com/43kvzt2). For Acevedo, good metadata practice has made his lizard malaria project maintainable. "It's like learning from trauma," he says. Over-share What with raw numbers, exploratory dead ends and the final processed data set, "at the end of the project, there's actually thousands of versions of the data", says Ciera Martinez, a research data scientist at the Eric and Wendy Schmidt Center for Data Science and Environment in Berkeley, California. So which one should scientists publish? "If you're able to share both the raw data and the derived data, do so," says Karthik Ram, a data scientist at the Berkeley Institute for Data Science. Processed data underlie the analysis, but raw data let other researchers test your assumptions and processing strategies. That said, raw data sets can be unwieldy and expensive to store. In that case, says Martinez, a "good rule of thumb" is to publish the data that were used to generate your figures. Ultimately, says Brown, publishing data shouldn't simply tick a box, but should serve the scientific community. So, ask yourself what others are likely to want from the data, and how they might use them. "Knowing that can help you understand, OK, if other researchers are going to use this data then I am going to make sure that they're able to understand it." Embrace standards Every project is different, as are the expectations of which data should be published and how that should be done. So, look to the broader community for guidance, Martinez says. Many disciplines have dedicated data repositories, such as Genbank and the Protein DataBank for DNA sequences and protein structures, respectively. But data can also be posted to general archives such as Zenodo, Figshare and Dryad. Ask whether your publisher (or funder) has a preferred storage location and file format, Brown advises. Or, consult your institutional resource librarian, suggests Jacqueline Campbell, a plant geneticist at the US Department of Agriculture (USDA) Agricultural Research Station in Ames, Iowa. [d41586-023] A graphics toolkit for visualizing genome data Small data sets can be deposited on the code-sharing site GitHub, but that doesn't guarantee persistence, warns Ethan White, an environmental data scientist at the University of Florida. Data can be deleted or modified at any time, so archive the data formally, as well. Never post data to personal websites, says Tracy Chen, a scientific analyst at the NASA Exoplanet Science Institute in Pasadena, California, who co-authored a best-practices document for astrophysics data^3. If you change jobs or retire, links to personal websites can become obsolete. Consider the format Data should be in an open, non-proprietary file format, says Ellen Bledsoe, who teaches ecological data science at the University of Arizona in Tucson; otherwise, they could become unreadable. Bledsoe encountered that problem when she had to extract data from Lotus 1-2-3 -- a now-obsolete commercial spreadsheet program. "Trying to finagle that data added a whole other step," she says. Text-based file formats, such as CSV (comma-separated values), can be read by many tools and programming languages, achieving the 'I' in FAIR data. And unlike with binary files, it's easy to track how text files change over time. Above all, avoid using PDF files for tables, says Campbell, who is an assistant curator for the USDA's soybean genetics database SoyBase. Spreadsheets are easy to import, she says. But PDF tables must be manually keyed in -- a slow, painful and error-prone process. Include code If you used code for data analysis, post it alongside the data. Code reveals the many steps and decisions you made, "providing, in effect, a more detailed version of the methods section", says White. Before publishing, test that the code runs in a clean computational environment -- that is, one with no objects in memory. Remove computer-specific elements, such as hard-coded file paths. Add comments to show what you're doing, and detail how to run the code, suggests John Guerra Gomez, a computer scientist at Northeastern University's campus in San Francisco, California. "Think as a time traveller," he says. "What would I want John in the future to know about this?" Finally, suggests Kari Jordan, executive director of The Carpentries, find a coding partner. The Carpentries, based in Oakland, California, runs workshops on scientific computing and data analysis, and one point that it makes during instructor training is to "never teach alone", Jordan says. "Never teach alone, don't learn alone, don't do anything alone." For instance, says White, you could ask a more advanced programmer to provide high-level feedback: "What are a couple of big things that you can do to make this easier to understand?" White's typical response to this question is to suggest breaking up long blocks of code into discrete functions, eliminating repetitious code and ensuring that function and variable names are informative. If a third party can understand and execute your code, Supp says, "you've probably done a pretty decent job at making your code readable". Think accessibility Big-data projects often expect a certain level of technical infrastructure on the part of prospective users. And they make assumptions about how people will consume, query and manipulate the data. [d41586-023] NatureTech hub These assumptions often don't hold, says Sabina Leonelli, who teaches the philosophy and history of science at the University of Exeter, UK. "The idea that you're creating platforms that are for universal use, that can be infinitely repurposed, fails in practice because it doesn't take account of the fact that there may be groups around the world which are working under different conditions." Leonelli's advice: consult organizations, such as the Research Data Alliance or the Committee on Data of the International Science Council, for feedback on your data standards and assumptions. And, where possible, consider "low-tech solutions", she says. Can you develop a low-bandwidth version of a database, for instance, or release both low- and high-resolution images? Fail to consider a range of requirements, says Leonelli, and the result will be a resource that only you and others like you can use. "You run the risk of producing a resource that doesn't take any of those needs into account." Take the plunge Open science, says Bledsoe, "is not an all-or-nothing game"; anything you can do adds value. "Even if you don't know how to go all the way to zero-to-60 open science, zero-to-20 is also really good," she says. So, release your data -- that gives data consumers more to analyse, and data providers more opportunities for collaboration. It's also scary, Supp admits: sharing means opening oneself to scrutiny. "There's a certain level of vulnerability with that," she says. "But that's also how we get better." Nature 618, 1098-1099 (2023) doi: https://doi.org/10.1038/d41586-023-01929-7 References 1. Otero, L., Schall, J. J., Cruz, V., Aaltonen, K. & Acevedo, M. A. Parasitology 146, 453-461 (2019). Article PubMed Google Scholar 2. Wilkinson, M. D. et al. Sci. Data 3, 160018 (2016). Article PubMed Google Scholar 3. Chen, T. X. et al. Astrophys. J. Supp. Ser. 260, 5 (2022). Article Google Scholar Download references Related Articles * [d41586-023] A graphics toolkit for visualizing genome data * [d41586-023] Need web data? Here's how to harvest them * [d41586-023] Taking the pain out of data sharing * [d41586-023] NatureTech hub Subjects * Databases * Publishing * Research data Latest on: Databases Nature Index Annual Tables 2023: China tops natural-science table Nature Index Annual Tables 2023: China tops natural-science table Nature Index 15 JUN 23 Focus on health for global adaptation to climate change Correspondence 13 JUN 23 China's souped-up data privacy laws deter researchers China's souped-up data privacy laws deter researchers News 25 MAY 23 Publishing Stop talking about tomorrow's AI doomsday when AI poses risks today Stop talking about tomorrow's AI doomsday when AI poses risks today Editorial 27 JUN 23 Field studies: list local contributors as authors Correspondence 20 JUN 23 Nature Index Annual Tables 2023: China tops natural-science table Nature Index Annual Tables 2023: China tops natural-science table Nature Index 15 JUN 23 Research data Nature Index Annual Tables 2023: China tops natural-science table Nature Index Annual Tables 2023: China tops natural-science table Nature Index 15 JUN 23 Nature Index Annual Tables 2023: first health-science ranking reveals big US lead Nature Index Annual Tables 2023: first health-science ranking reveals big US lead Nature Index 15 JUN 23 Focus on health for global adaptation to climate change Correspondence 13 JUN 23 Nature Careers Jobs * Postdoctoral Fellow Position in Computational Epigenomics At Moffitt Cancer Center, we come face-to-face with cancer every day, but we also see courage. And it inspires us to be the safest and best place f... Tampa, Florida H. Lee Moffitt Cancer Center & Research Institute [] * Postdoc - Testa Group - NeuroCOV project APPLICATION CLOSING DATE: 1st of September 2023 Human Technopole (HT) is a new interdisciplinary life science research institute, created and su... Milan (IT) Human Technopole [] * Postdoctoral fellow (f/m/d) The University Hospital of Tubingen is a leading center of German university medicine. Every year, around 75,000 inpatients and around 380,000 outp... Tubingen, Baden-Wurttemberg (DE) Universitatsklinikum Tubingen [] * Medical Director, Oncology - Early Development In Roche's Pharmaceutical Research and Early Development organization (pRED), we make transformative medicines for patients in order to tackle some of Basel, Canton of Basel-Stadt (CH) Adveritas - F. Hoffmann la Roche [] * Post-doctoral position Scientist(m/f/d) Spatial+global lipid alterations in cardiovascular disease Post-doctoral position / Scientist (m/f/d) - Spatial and global lipid alterations in cardiovascular disease Dortmund, Nordrhein-Westfalen (DE) Leibniz-Institut fur Analytische Wissenschaften - ISAS - e.V. [] Download PDF Related Articles * [d41586-023] A graphics toolkit for visualizing genome data * [d41586-023] Need web data? Here's how to harvest them * [d41586-023] Taking the pain out of data sharing * [d41586-023] NatureTech hub Subjects * Databases * Publishing * Research data Advertisement Sign up to Nature Briefing An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday. Email address [ ] [ ] Yes! Sign me up to receive the daily Nature Briefing email. I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Sign up * Close Nature Briefing Sign up for the Nature Briefing newsletter -- what matters in science, free to your inbox daily. Email address [ ] Sign up [ ] I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Close Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing Explore content * Research articles * News * Opinion * Research Analysis * Careers * Books & Culture * Podcasts * Videos * Current issue * Browse issues * Collections * Subjects * Follow us on Facebook * Follow us on Twitter * Subscribe * Sign up for alerts * RSS feed About the journal * Journal Staff * About the Editors * Journal Information * Our publishing models * Editorial Values Statement * Journal Metrics * Awards * Contact * Editorial policies * History of Nature * Send a news tip Publish with us * For Authors * For Referees * Language editing services * Submit manuscript Search Search articles by subject, keyword or author [ ] Show results from [All journals] Search Advanced search Quick links * Explore articles by subject * Find a job * Guide to authors * Editorial policies Nature (Nature) ISSN 1476-4687 (online) ISSN 0028-0836 (print) nature.com sitemap About Nature Portfolio * About us * Press releases * Press office * Contact us Discover content * Journals A-Z * Articles by subject * Nano * Protocol Exchange * Nature Index Publishing policies * Nature portfolio policies * Open access Author & Researcher services * Reprints & permissions * Research data * Language editing * Scientific editing * Nature Masterclasses * Live Expert Trainer-led workshops * Research Solutions Libraries & institutions * Librarian service & tools * Librarian portal * Open research * Recommend to library Advertising & partnerships * Advertising * Partnerships & Services * Media kits * Branded content Career development * Nature Careers * Nature Conferences * Nature events Regional websites * Nature Africa * Nature China * Nature India * Nature Italy * Nature Japan * Nature Korea * Nature Middle East * Privacy Policy * Use of cookies * Your privacy choices/Manage cookies * Legal notice * Accessibility statement * Terms & Conditions * Your US state privacy rights Springer Nature (c) 2023 Springer Nature Limited