[HN Gopher] One in five genetics papers contains errors thanks t...
       ___________________________________________________________________
        
       One in five genetics papers contains errors thanks to Microsoft
       Excel (2016)
        
       Author : redbell
       Score  : 81 points
       Date   : 2024-09-14 16:49 UTC (6 hours ago)
        
 (HTM) web link (www.science.org)
 (TXT) w3m dump (www.science.org)
        
       | mbreese wrote:
       | Even when this paper was published, errors due to Excel mangled
       | gene names were well known for at least a decade.
       | 
       | Somewhat recently, some of the more error prone genes were
       | renamed to accommodate Excel. (Ex: SEPT7 -> SEPTIN7)
        
       | bbarnett wrote:
       | It's not really Excel, is it? But users not reviewing what's
       | happening, not paying attention.
        
         | happytoexplain wrote:
         | This attitude is at the root of why software is so much worse
         | than it could be. "It's the users' fault" as a real root cause
         | is borderline non-existent, because it can _always_ be obviated
         | in the software, even though the kind of talent that can do
         | that for complex systems is exceedingly rare.
        
           | TeMPOraL wrote:
           | I disagree. The dominating attitude today is that users are
           | idiots and too dumb to use anything they can't master in 30
           | seconds from first exposure, which has a nice side effect of
           | cutting out _a lot_ of functionality from the scope. The
           | result is, for example, 20 startups going after any given
           | idea, each spending years refining a slightly different take
           | on a useless set of simple features.
           | 
           | (Actually, many of such products can be made better by
           | replacing them with an Excel sheet, which is a big part of
           | the reason why people who actually need to get shit done end
           | up using Excel.)
        
           | nitwit005 wrote:
           | This is users using software in a way the designers never
           | expected, decades after it was designed. Those users then
           | chose to ignore warnings about using it.
           | 
           | Newer programs like Google Sheets have better default
           | behaviors. They have free access to it.
        
         | marcosdumay wrote:
         | Yes, the problem is using Excel, not Excel by itself.
        
       | golergka wrote:
       | I studied bioinformatics in university back in 2005, and the
       | first subject we were trained on was not Python or Java, but
       | using Excel. Our teachers explicitly told us about excel's
       | conversions to dates and other pitfalls. I find it mind-boggling
       | that some researchers still made these mistakes 10 years later.
       | 
       | Excel is a wonderful tool. But you need to learn your tools and
       | find out about possible footguns.
        
         | rectang wrote:
         | Haranguing your users not to make mistakes is utterly
         | worthless. The only way to move the needle is to design systems
         | that are easier to use correctly.
        
           | nitwit005 wrote:
           | This isn't Microsoft lecturing them.
           | 
           | It's entirely valid for the school to tell students to avoid
           | easily avoidable pitfalls.
        
             | rectang wrote:
             | Of course they can lecture them -- it just isn't going to
             | work. The error rates will barely budge.
             | 
             | There is one effect: it allows smug gloating about how
             | stupid, lazy, and irresponsible these users are.
             | 
             | Blaming the user is the last refuge of the incompetent.
        
               | TeMPOraL wrote:
               | > _Blaming the user is the last refuge of the
               | incompetent._
               | 
               | Why then is it the dominating mindset in software design
               | today, and advertised as being the _opposite_ to the
               | mindset that gives you Excel?
        
           | prepend wrote:
           | Excel is not a genetics tool. It has millions if use cases,
           | many more important than genetics. Excel didn't care about
           | this.
           | 
           | It's like people complaining because sugar gets misused. Or
           | that murderers stab people with knives. The solutions isn't
           | to "fix" knives.
        
             | saagarjha wrote:
             | Excel is a tool for general-purpose data processing. What
             | the genetics people are doing is exactly that.
        
               | lukan wrote:
               | Also it is a somewhat working solution, to ban knives in
               | certain places, like clubs, so people don't become drunk
               | murderers in the first place ..
        
             | bobbylarrybobby wrote:
             | It's not like you have to be in genetics for excel to bite
             | you. I once had a "business" column in a CSV with one cell
             | set to July 11. Why? Because the business was 7/11. (I
             | assume in a different locale it would've become November
             | 7?)
             | 
             | The simple solution is to do what every CSV - DataFrame
             | library does, which is ensure columns are a homogenous
             | type. In this case a single non-date entry in a column
             | would be enough to treat the whole column as string.
        
           | TeMPOraL wrote:
           | > _The only way to move the needle is to design systems that
           | are easier to use correctly._
           | 
           | If you do it like most software vendors do, by simplifying
           | and removing functionality, you're moving the needle _in the
           | wrong direction_.
        
             | lukan wrote:
             | Depends I think.
             | 
             | Ideally on the screen UI only those things are shown, that
             | are relevant in the context.
             | 
             | And the context of beginners is very small, so they don't
             | need to see advanced tools they never will use anyway. But
             | for sure it is not the right way to also remove the tools
             | for the advanced users who do need them.
             | 
             | But it is possible to make UIs that can be customized ..
        
         | mobilio wrote:
         | Follow up:
         | 
         | https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
        
       | rectang wrote:
       | _Only_ one in five? Evidence of egregious overengineering by the
       | Excel team.
        
       | HenryBemis wrote:
       | Let me reword this. "One in five genetics papers contains errors
       | because the authors were careless when using Excel".
       | 
       | I dislike Excel for what it does (EUCs, nearly impossible to
       | track changes, etc.) But on the other hand it is an amazing tool.
        
         | monocasa wrote:
         | > One in five genetics papers contains errors because the
         | authors were careless when using Excel
         | 
         | Or Excel is a remarkably easy tool to mishandle because of
         | generally unexpected transformations it makes 'for you'
         | automatically in an easy to miss way.
        
           | prepend wrote:
           | You think Excel should make special accommodations because
           | genetics authors misuse it?
           | 
           | Excel has been acting this way since before bioinformatics
           | existed. Authors need to use their tools properly.
        
             | chucksmash wrote:
             | Excel (and GSheets) are tools where I think a stripped down
             | "trust my input, I know what I'm doing" mode that doesn't
             | try to intuit user intent would be useful.
             | 
             | As I type this comment and my phone miscorrects "intuit" to
             | "Intuit," I think also Google keyboard could benefit from
             | such a mode that only handles spelling mistakes but doesn't
             | replace uncommon words with common brands, etc.
        
               | tengwar2 wrote:
               | But if you actually do know what you are doing, you know
               | to set the cell type appropriately.
        
             | Prickle wrote:
             | They already did in 2023.
             | 
             | Excel did not have an option for turning automatic
             | conversions off.
             | 
             | You can now, FINALLY disable automatic conversion. Honestly
             | that "feature" has been a bane of my existence, and I don't
             | work with genes.
        
               | macintux wrote:
               | I had no idea. Found https://insider.microsoft365.com/en-
               | us/blog/control-data-con..., thanks.
               | 
               | I'd love to know the percentage of corrupted data across
               | all Excel workbooks.
        
         | rectang wrote:
         | Oh, come on. The Excel user interface encourages mistakes, as
         | surely as a bicycle with handlebars that steer backwards.
         | 
         | https://www.youtube.com/watch?v=MFzDaBzBlL0
         | 
         | It's not _all_ on the users.
        
           | dudus wrote:
           | I had a different interpretation of the video you linked. It
           | seems to me even a backwards bicycle is perfectly drivable as
           | long as you adjust you mental model.
           | 
           | In that sense the excel UI doesn't make sense only for those
           | not used to it. Which might be a nice analogy
        
           | TeMPOraL wrote:
           | At least it's a bicycle.
           | 
           | Remember the whole "computer as bicycle for the mind" thing?
           | That didn't happen, the world went in the opposite direction.
           | Software like Excel are the last surviving remnants of the
           | idea of empowering end users to improve their work and lives.
        
             | happytoexplain wrote:
             | > Software like Excel are the last surviving remnants of
             | the idea of empowering end users to improve their work and
             | lives.
             | 
             | I agree. I love Excel. But this attitude only makes sense
             | if we assume one can only fix easy-to-make mistakes by
             | dumbing the software down, which is not true.
        
         | happytoexplain wrote:
         | There are three options:
         | 
         | 1. Make the software better (very hard on complex systems with
         | GUIs)
         | 
         | 2. Ensure everybody knows all footguns (impossible)
         | 
         | 3. Don't care about those people (easy)
         | 
         | If we opt for #3, we might as well not even be in software as a
         | profession/hobby. Having such low standards indicates that we
         | don't really care.
        
         | TeMPOraL wrote:
         | > _EUCs_
         | 
         | Had to search for that one. It seems that EUC here means "end-
         | user computing", and I was shocked to discover it's a
         | _pejorative_ term used by vendors to describe what they
         | consider a _problem_ that needs solving,
        
         | rzzzt wrote:
         | Careless as in they type in "MARCH1" or "SEPT2" in a cell and
         | these get turned automagically into 1-Mar and 2-Sep dates after
         | pressing Enter?
        
       | erehweb wrote:
       | Original paper
       | https://genomebiology.biomedcentral.com/articles/10.1186/s13...
        
       | Prickle wrote:
       | Previous popular thread, Aug 2020:
       | https://news.ycombinator.com/item?id=24070385
       | 
       | > Scientists rename human genes to stop MS Excel from misreading
       | them as dates (theverge.com)
       | 
       | Related details:
       | 
       | 2023:
       | 
       | https://www.pcmag.com/news/microsoft-finally-fixes-excel-gli...
       | 
       | > Years after introducing Excel's automatic conversion features,
       | Microsoft rolls out an update to prevent it from changing gene
       | symbols to dates.
       | 
       | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9325790/
       | 
       | > Gene Updater: a web tool that autocorrects and updates for
       | Excel misidentified gene names
        
       | magicalhippo wrote:
       | I use Excel a fair bit at work, mostly for looking at data, but
       | also to fairly easily generate SQL statements based on such data,
       | often after some formulas have been applied.
       | 
       | Though I think what I'd really want is some tool which has the
       | same grid-like visualization, filtering and direct entering, but
       | was code-based under the hood and without magic conversions.
       | 
       | So, take Excel, and when you enter a formula in a cell, it
       | actually writes a line of code for you, which you can inspect and
       | edit. Including adding your own functions and such.
       | 
       | When importing delimited text data, you'd have to specify what
       | the data is in each column. It should still save the original
       | text data so you can change your mind, but yeah, no automagic
       | stuff.
        
         | lukan wrote:
         | Yes, something like this is on my mind since quite some years
         | as well.
         | 
         | It would be limited to programmers, though.
        
         | scrlk wrote:
         | The Data Wrangler extension for VS Code might be of interest.
         | As you apply operations to your dataset, it generates Pandas
         | code.
         | 
         | https://code.visualstudio.com/docs/datascience/data-wrangler
         | 
         | https://www.youtube.com/watch?v=5tWJVLF6PuA
        
       | redbell wrote:
       | Here's another interesting article on the same topic (2016)
       | entitled "Gene name errors are widespread in the scientific
       | literature" :
       | https://genomebiology.biomedcentral.com/articles/10.1186/s13...
       | 
       | From the article:
       | 
       | " _The problem of Excel software inadvertently converting gene
       | symbols to dates and floating-point numbers was originally
       | described in 2004. For example, gene symbols such as SEPT2
       | (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1,
       | E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep'
       | and '1-Mar', respectively._ "
        
       | hilbert42 wrote:
       | Don't we ever learn. We've known since VisiCalc in the '80s not
       | to trust spreadsheets for important data without crosschecking
       | the results.
       | 
       | The question is why do we still use substandard tools for
       | processing important data like this.
        
         | lukan wrote:
         | "The question is why do we still use substandard tools for
         | processing important data like this."
         | 
         | Because there is no better alternative (yet)?
         | 
         | A better alternative needs to be really better, to justify the
         | effort of people relearning how to do things in this better
         | tool then.
        
           | jasinjames wrote:
           | I think another factor is that the spreadsheet model has
           | basically zero barrier to entry from a users perspective. You
           | can have elementary school students punch data from a
           | classroom experiment into an excel table, which is great!
           | 
           | Any replacement system which, for example, enforced a strong
           | separation between operations, input reference data and
           | output result data would require users to learn the model
           | before attempting to use the software. This is a pretty big
           | ask, especially since lots of small-scale users wouldn't see
           | an immediate benefit. I think of it like the tradeoff between
           | dynamic and static typing when programming- it's the same
           | "upfront mental overhead versus long term maintainability"
           | question IMO.
        
       | freehorse wrote:
       | At work we have had problems with excel changing values of
       | numerical entries due to different locale formatting multiple
       | times in the past, and many times causing hours of lost time till
       | the culprit (excel) was found. Other types of values that are
       | completely inconvenient for excel include social security numbers
       | and phone numbers; which one would assume are more general
       | purpose and thus closer to the standard use cases of excel than
       | genes. Still excel is messing these things up. Opening a csv file
       | is also a non-trivial overly complicated endeavour for excel in
       | year 2024, which should be done with a double click (somehow
       | other spreadsheet programs manage to handle csv files much more
       | easily). The worst is that excel is usually used by non-technical
       | people, who misunderstand these idiosyncrasies and time is wasted
       | over and over.
        
         | yial wrote:
         | My work frustration with excel is not so much excel itself, but
         | that people will destroy / delete / not understand formulas (or
         | think they're improving them...) and then the end result of
         | their edits is garbage. However this is my fault usually for
         | not using the protect / lock features well enough. I do know
         | that notes explanations get overlooked. I've mitigated this
         | slightly by almost always having in our shared a blank start,
         | that's just a backup to start from, and then the actual one to
         | be used.
        
       | theodpHN wrote:
       | So, the question is: Would there be fewer or more errors if
       | something other than Excel was used? Has much rigorous research
       | been done on this?
        
       | Gimpei wrote:
       | Didn't the same thing happen in economics because of a bug in
       | Stata in the 90s?
        
       | bikenaga wrote:
       | An annoyance with spreadsheets that deterred me from ever using
       | them in teaching is that they've perpetuated a arithmetic order-
       | of-precedence bug. ("Bug" in the sense that it contradicts long-
       | standing mathematical convention.) If you type
       | -3^2
       | 
       | in a cell and press ENTER, the spreadsheet tell you it's "9". It
       | should be "-9"; in math, exponentiation has precedence over unary
       | minus, so you square 3, then negate the result. For instance, if
       | you tell students to graph "y = -x^2", they should draw a
       | parabola opening _downward_.
       | 
       | I don't have a recent copy of Excel to check this in, but this
       | was the case in the '97 version. I just tried it in the current
       | LibreOffice calc, and it returns "9". My guess is that one of the
       | early spreadsheets messed up the order of precedence, and
       | everybody after copied it for compatibility.
       | 
       | On the other hand, I just tried maxima and python and they both
       | give "-9".
       | 
       | I wonder if this particular problem afflicts people who copy
       | formulas from (say) math books into spreadsheets.
        
         | mobilio wrote:
         | But it's 9!
         | 
         | Check it: https://www.mathplanet.com/education/pre-
         | algebra/explore-and....
         | 
         | "You also have to pay attention to the signs when you multiply
         | and divide. There are two simple rules to remember: When you
         | multiply a negative number by a positive number then the
         | product is always negative. When you multiply two negative
         | numbers or two positive numbers then the product is always
         | positive."
         | 
         | So basically you have -3x-3 and result is 9.
        
           | n_plus_1_acc wrote:
           | Correct, but irrelevant to the question. The usual rules of
           | math require it to be parsed like -(32), so there are only
           | positive numbers being multiplied.
        
           | shagie wrote:
           | https://www.wolframalpha.com/input?i=-3%5E2
           | 
           | https://en.wikipedia.org/wiki/Order_of_operations
           | 
           | Parentheses, Exponentiation, Multiplication, Division,
           | Addition, Subtraction
           | 
           | -3^2 would then be correctly parsed as -(3^2) which is -9.
           | 
           | Parsing it as (-3)^2 would require the addition of
           | parentheses.
           | 
           | This gets to the special case of the unary minus sign...
           | which the Wikipedia article specifically calls out.
           | Special cases              Unary minus sign
           | There are differing conventions concerning the unary
           | operation '-' (usually pronounced "minus"). In written or
           | printed mathematics, the expression -32 is interpreted to
           | mean -(32) = -9.              In some applications and
           | programming languages, notably Microsoft Excel, PlanMaker
           | (and other spreadsheet applications) and the programming
           | language bc, unary operations have a higher priority than
           | binary operations, that is, the unary minus has higher
           | precedence than exponentiation, so in those languages -32
           | will be interpreted as (-3)2 = 9. This does not apply to the
           | binary minus operation '-'; for example in Microsoft Excel
           | while the formulas =-2^2, =-(2)^2 and =0+-2^2 return 4, the
           | formulas =0-2^2 and =-(2^2) return -4.
        
       | jasinjames wrote:
       | I had a similar issue just two weeks ago at $DAYJOB. I was
       | scraping a log file and I had written an awk script to convert
       | the HH:MM:SS.uuuuuu timestamps to nanoseconds since epoch for
       | ingestion by another tool. Little did I know that awk uses
       | floating point for numbers internally, and so my conversion
       | function was incorrectly rounding the result, which messed up my
       | later analysis in arcane ways. To add insult to injury, I had the
       | same problem again when paging through the data in Visidata, but
       | there was an easy workaround I don't exactly remember.
        
       | smartmic wrote:
       | The problem with Excel is that it tries to do almost everything
       | in one software tool: front-end (editing, presentation),
       | analysis, semantic modeling, data storage/database.
       | 
       | Almost all reasonable engineers see that there is something wrong
       | with such an approach. But almost all everyday computer users
       | think that this is the way computing has to be.
       | 
       | Sometimes I wonder why even I voluntarily open it for certain
       | tasks - anyway, despite all the criticism, Excel has reached the
       | Lindy[1] threshold for me and is here to stay.
       | 
       | [1] https://en.wikipedia.org/wiki/Lindy_effect
        
         | ravetcofx wrote:
         | Except that it's proprietary and will eventually be un-
         | maintained and stop working. But spreadsheets, in general, fall
         | under the Lindy effect and open source software will continue
         | it for centuries to come.
        
       | darkhorn wrote:
       | This is why Statisticians do not use Excel for scientific work.
       | They use SPSS, R, Minitab, MATLAB, SAS, etc.
        
       | Eddy_Viscosity2 wrote:
       | But why oh why can't you disable automatic date reading? Put
       | something in settings with a checkbox that I, the user, can tell
       | excel to F-off with that nonsense.
       | 
       | Is there anybody who can argue the 'for' case for having this on
       | all the time without recourse?
        
       ___________________________________________________________________
       (page generated 2024-09-14 23:00 UTC)