https://www.nature.com/articles/d41586-021-02211-4 Skip to main content Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Advertisement Advertisement Nature * View all journals * Search * My Account Login * Explore content * Journal information * Publish with us Subscribe * Sign up for alerts * RSS feed 1. nature 2. news 3. article * NEWS * 13 August 2021 Autocorrect errors in Excel still creating genomics headache Despite geneticists being warned about spreadsheet problems, 30% of published papers contain mangled gene names in supplementary data. * Dyani Lewis ^0 1. Dyani Lewis 1. Dyani Lewis is a freelance science journalist in Melbourne, Australia. View author publications You can also search for this author in PubMed Google Scholar * Share on Twitter * Share on Facebook * Share via E-Mail You have full access to this article via your institution. Download PDF A molecular geneticist is evaluating a test for SARS-CoV-2 coronavirus on a computer A molecular geneticist evaluates genetic data in a hospital lab in Germany's Rhineland region.Credit: Sascha Steinbach/EPA-EFE/ Shutterstock Embarrassing autocorrect mistakes are common fodder for Internet listicles and Twitter threads. But they are also the bane of geneticists using spreadsheet programs such as Microsoft Excel. Five years after a study showed that autocorrect problems were widespread, the academic literature is still littered with error-riddled spreadsheets, according to an analysis of published gene lists. And the problem may be even worse than previously realized. The long-standing issue often occurs when the abbreviated form of a gene's name -- known as a gene symbol -- is incorrectly recognized as a date and autocorrected as such by Excel or Google Sheets. For example, SEPT4 (septin 4) and MARCH1 (membrane associated ring-CH-type finger 1) will be automatically changed to 4-Sep and 1-Mar. "It can have a significant impact on your research," says molecular biologist Auriol Purdie at the University of Sydney in Australia. Having worked with gene-microarray and gene-transcription data sets for two decades, Purdie is familiar with the inadvertent errors. But she says the problem frequently catches out beginners. A GROWING PROBLEM. The proportion of papers with gene-name errors created by spreadsheet autocorrect functions is increasing. Source: Ref. 3 Distorting results Purdie works to identify gene networks involved in the early stages of disease in sheep and cattle. If a spreadsheet alters the gene names, those genes are lost when the data are imported into gene-network-analysis software, and this can distort results. The program "will tell you you've lost a bunch of your genes", she says, but won't indicate which ones. And when dealing with data sets that contain 20,000 genes, manually comparing lists to identify which genes have been lost is an onerous task, Purdie adds. The problem was first documented in 2004, when Barry Zeeberg, a molecular pharmacologist at the National Institute of Cancer in Bethesda, Maryland, and his colleagues warned of changes to gene symbols when processing genomics data^1. [d41586-021] Errors in genetic sequences mar hundreds of studies In 2016, Mark Ziemann and his colleagues at the Baker IDI Heart and Diabetes Institute in Melbourne, Australia, quantified the problem. They found that one-fifth of papers in top genomics journals contained gene-name conversion errors in Excel spreadsheets published as supplementary data^2. These data sets are frequently accessed and used by other geneticists, so errors can perpetuate and distort further analyses. However, despite the issue being brought to the attention of researchers -- and steps being taken to fix it -- the problem is still rife, according to an updated and larger analysis led by Ziemann, now at Deakin University in Geelong, Australia^3. His team found that almost one-third of more than 11,000 articles with supplementary Excel gene lists published between 2014 and 2020 contained gene-name errors (see 'A growing problem'). Simple checks can detect autocorrect errors, says Ziemann, who researches computational reproducibility in genetics. But without those checks, the errors can easily go unnoticed because of the volume of data in spreadsheets. Changes to naming conventions In 2017, the HUGO Gene Nomenclature Committee (HGNC) -- which standardizes human-gene names -- announced that it would take the drastic measure of changing the gene symbols for commonly affected genes, because community-outreach efforts (including a 2019 video on YouTube) had failed to solve the problem. Since then, 27 gene symbols have been updated, including SEPT4 (now SEPTIN4) and MARCH1 (now MARCHF1). The move was a departure from the committee's preference for keeping names stable, says Elspeth Bruford, who coordinates the HGNC from the European Bioinformatics Institute in Hinxton, UK. Last year, the committee published guidelines to reflect the new rule for modifying gene symbols in cases where data handling is affected^4. Other gene-naming bodies have followed suit. But it might be too soon to see any change to the frequency of errors in the literature, says Bruford, because published data sets often contain outdated gene lists. "It's going to take years for this to percolate through," she says, which is why the HGNC recommends that researchers access the most recent data from public databases, and that journals request authors to do so before publication. Since the beginning of the year, Ziemann has published a monthly leader board of offending journals, which frequently features well-known titles such as Nature Communications, eLife, PLoS Genetics and Scientific Reports. Ziemann says this is probably because articles published in these journals contain more gene lists and larger data sets. Avoid or adapt One solution is to avoid using spreadsheets, he suggests. Although some -- such as the open-source programs LibreOffice and Gnumeric -- don't have the problem, spreadsheets are hard to audit. "If there's a problem, it's not readily apparent where the problem happened," because there's no record of what steps the software took, he says. Some computational biologists use scripted computer languages, such as Python and R. These don't autocorrect gene symbols, says Ziemann, and researchers can trace the source of errors. However, they require users to learn the computer language so that they can write code to analyse data. [d41586-021] Jennifer Byrne: error sleuth That's something Purdie says she doesn't have time for. She has adapted to Excel's quirks, adding apostrophes before commonly affected genes to prevent the conversion, or pre-formatting spreadsheet cells before importing data. "It's one of those things that I just accept," she says. Bruford says the autocorrect issue in Excel is unlikely to be fixed any time soon. "We're a small user base, compared to all the users of Excel," she says, and Microsoft has never indicated that it will alter its software to accommodate the genetics community. For those persisting with problematic software, Ziemann recommends a quick check before sharing or publishing data. Sorting data by gene symbol can bring date-conversion errors to the top, he says. doi: https://doi.org/10.1038/d41586-021-02211-4 References 1. 1. Zeeberg, B. R. et al. BMC Bioinformatics 5, 80 (2004). PubMed Article Google Scholar 2. 2. Ziemann, M., Eren, Y. & El-Osta, A. Genome Biol. 17, 177 (2016). PubMed Article Google Scholar 3. 3. Abeysooriya, M., Soria, M., Kasu, M. S. & Ziemann, M. PLoS Comput. Biol. 17, e1008984 (2021). Article Google Scholar 4. 4. Bruford, E. A. et al. Nature Genet. 52, 754-758 (2020). PubMed Article Google Scholar Download references Related Articles * [d41586-021] Jennifer Byrne: error sleuth * [d41586-021] Errors in genetic sequences mar hundreds of studies * Online software spots genetic errors in cancer papers Subjects * Genomics * Software * Bioinformatics * Genetics Latest on: Genomics How the daddy-long-legs gets long legs How the daddy-long-legs gets long legs Research Highlight 05 AUG 21 Deranged chromatin drives uterine fibroid tumours Deranged chromatin drives uterine fibroid tumours News & Views 04 AUG 21 Genetic variations could one day help predict timing of menopause Genetic variations could one day help predict timing of menopause News 04 AUG 21 Software Accelerating the diagnosis of epilepsy with computer modelling Accelerating the diagnosis of epilepsy with computer modelling Outlook 24 JUN 21 Old-school computing: when your lab PC is ancient Old-school computing: when your lab PC is ancient Technology Feature 01 JUN 21 Reactive, reproducible, collaborative: computational notebooks evolve Reactive, reproducible, collaborative: computational notebooks evolve Technology Feature 03 MAY 21 Bioinformatics A metabolomics pipeline for the mechanistic interrogation of the gut microbiome A metabolomics pipeline for the mechanistic interrogation of the gut microbiome Article 14 JUL 21 Want to track pandemic variants faster? Fix the bioinformatics bottleneck Want to track pandemic variants faster? Fix the bioinformatics bottleneck Comment 01 MAR 21 Give African research participants more say in genomic data, say scientists Give African research participants more say in genomic data, say scientists News 15 FEB 21 Nature Careers Jobs * Assistant Professor of Neuroscience Stanford University Stanford, CA, United States * Postdoctoral Researcher University of Toronto (U of T) Toronto, Ontario, Canada * Postdoctoral Scholar Stanford University Stanford, CA, United States * Chief, Laboratory of Neurogenetics NIH National Institute on Aging (NIA) Bethesda, MD, United States Nature Briefing An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday. Email address [ ] [ ] Yes! Sign me up to receive the daily Nature Briefing email. I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Sign up You have full access to this article via your institution. Download PDF Related Articles * [d41586-021] Jennifer Byrne: error sleuth * [d41586-021] Errors in genetic sequences mar hundreds of studies * Online software spots genetic errors in cancer papers Subjects * Genomics * Software * Bioinformatics * Genetics Advertisement Sign up to Nature Briefing An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday. Email address [ ] [ ] Yes! Sign me up to receive the daily Nature Briefing email. I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Sign up * Close Nature Briefing Sign up for the Nature Briefing newsletter -- what matters in science, free to your inbox daily. Email address [ ] Sign up [ ] I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Close Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing Explore content * Research articles * News * Opinion * Research Analysis * Careers * Books & Culture * Podcasts * Videos * Current issue * Browse issues * Collections * Subjects * Follow us on Facebook * Follow us on Twitter * Subscribe * Sign up for alerts * RSS feed Journal information * About the Journal * Awards * Editorial policies Publish with us * For Authors * For Referees * Submit manuscript Search Search articles by subject, keyword or author [ ] Show results from [All journals] Search Advanced search Quick links * Explore articles by subject * Find a job * Guide to authors * Editorial policies Nature (Nature) ISSN 1476-4687 (online) ISSN 0028-0836 (print) nature.com sitemap Nature portfolio * About us * Press releases * Press office * Contact us * * * Discover content * Journals A-Z * Articles by subject * Nano * Protocol Exchange * Nature Index Publishing policies * Nature portfolio policies * Open access Author & Researcher services * Reprints & permissions * Research data * Language editing * Scientific editing * Nature Masterclasses * Nature Research Academies Libraries & institutions * Librarian service & tools * Librarian portal * Open research * Recommend to library Advertising & partnerships * Advertising * Partnerships & Services * Media kits * Branded content Career development * Nature Careers * Nature Conferences * Nature events Regional websites * Nature Africa * Nature China * Nature India * Nature Italy * Nature Japan * Nature Korea * Nature Middle East Legal & Privacy * Privacy Policy * Use of cookies * Manage cookies/Do not sell my data * Legal notice * Accessibility statement * Terms & Conditions * California Privacy Statement Springer Nature (c) 2021 Springer Nature Limited