[HN Gopher] One in five genetics papers contains errors thanks t...
___________________________________________________________________
One in five genetics papers contains errors thanks to Microsoft
Excel (2016)
Author : redbell
Score : 81 points
Date : 2024-09-14 16:49 UTC (6 hours ago)
(HTM) web link (www.science.org)
(TXT) w3m dump (www.science.org)
| mbreese wrote:
| Even when this paper was published, errors due to Excel mangled
| gene names were well known for at least a decade.
|
| Somewhat recently, some of the more error prone genes were
| renamed to accommodate Excel. (Ex: SEPT7 -> SEPTIN7)
| bbarnett wrote:
| It's not really Excel, is it? But users not reviewing what's
| happening, not paying attention.
| happytoexplain wrote:
| This attitude is at the root of why software is so much worse
| than it could be. "It's the users' fault" as a real root cause
| is borderline non-existent, because it can _always_ be obviated
| in the software, even though the kind of talent that can do
| that for complex systems is exceedingly rare.
| TeMPOraL wrote:
| I disagree. The dominating attitude today is that users are
| idiots and too dumb to use anything they can't master in 30
| seconds from first exposure, which has a nice side effect of
| cutting out _a lot_ of functionality from the scope. The
| result is, for example, 20 startups going after any given
| idea, each spending years refining a slightly different take
| on a useless set of simple features.
|
| (Actually, many of such products can be made better by
| replacing them with an Excel sheet, which is a big part of
| the reason why people who actually need to get shit done end
| up using Excel.)
| nitwit005 wrote:
| This is users using software in a way the designers never
| expected, decades after it was designed. Those users then
| chose to ignore warnings about using it.
|
| Newer programs like Google Sheets have better default
| behaviors. They have free access to it.
| marcosdumay wrote:
| Yes, the problem is using Excel, not Excel by itself.
| golergka wrote:
| I studied bioinformatics in university back in 2005, and the
| first subject we were trained on was not Python or Java, but
| using Excel. Our teachers explicitly told us about excel's
| conversions to dates and other pitfalls. I find it mind-boggling
| that some researchers still made these mistakes 10 years later.
|
| Excel is a wonderful tool. But you need to learn your tools and
| find out about possible footguns.
| rectang wrote:
| Haranguing your users not to make mistakes is utterly
| worthless. The only way to move the needle is to design systems
| that are easier to use correctly.
| nitwit005 wrote:
| This isn't Microsoft lecturing them.
|
| It's entirely valid for the school to tell students to avoid
| easily avoidable pitfalls.
| rectang wrote:
| Of course they can lecture them -- it just isn't going to
| work. The error rates will barely budge.
|
| There is one effect: it allows smug gloating about how
| stupid, lazy, and irresponsible these users are.
|
| Blaming the user is the last refuge of the incompetent.
| TeMPOraL wrote:
| > _Blaming the user is the last refuge of the
| incompetent._
|
| Why then is it the dominating mindset in software design
| today, and advertised as being the _opposite_ to the
| mindset that gives you Excel?
| prepend wrote:
| Excel is not a genetics tool. It has millions if use cases,
| many more important than genetics. Excel didn't care about
| this.
|
| It's like people complaining because sugar gets misused. Or
| that murderers stab people with knives. The solutions isn't
| to "fix" knives.
| saagarjha wrote:
| Excel is a tool for general-purpose data processing. What
| the genetics people are doing is exactly that.
| lukan wrote:
| Also it is a somewhat working solution, to ban knives in
| certain places, like clubs, so people don't become drunk
| murderers in the first place ..
| bobbylarrybobby wrote:
| It's not like you have to be in genetics for excel to bite
| you. I once had a "business" column in a CSV with one cell
| set to July 11. Why? Because the business was 7/11. (I
| assume in a different locale it would've become November
| 7?)
|
| The simple solution is to do what every CSV - DataFrame
| library does, which is ensure columns are a homogenous
| type. In this case a single non-date entry in a column
| would be enough to treat the whole column as string.
| TeMPOraL wrote:
| > _The only way to move the needle is to design systems that
| are easier to use correctly._
|
| If you do it like most software vendors do, by simplifying
| and removing functionality, you're moving the needle _in the
| wrong direction_.
| lukan wrote:
| Depends I think.
|
| Ideally on the screen UI only those things are shown, that
| are relevant in the context.
|
| And the context of beginners is very small, so they don't
| need to see advanced tools they never will use anyway. But
| for sure it is not the right way to also remove the tools
| for the advanced users who do need them.
|
| But it is possible to make UIs that can be customized ..
| mobilio wrote:
| Follow up:
|
| https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
| rectang wrote:
| _Only_ one in five? Evidence of egregious overengineering by the
| Excel team.
| HenryBemis wrote:
| Let me reword this. "One in five genetics papers contains errors
| because the authors were careless when using Excel".
|
| I dislike Excel for what it does (EUCs, nearly impossible to
| track changes, etc.) But on the other hand it is an amazing tool.
| monocasa wrote:
| > One in five genetics papers contains errors because the
| authors were careless when using Excel
|
| Or Excel is a remarkably easy tool to mishandle because of
| generally unexpected transformations it makes 'for you'
| automatically in an easy to miss way.
| prepend wrote:
| You think Excel should make special accommodations because
| genetics authors misuse it?
|
| Excel has been acting this way since before bioinformatics
| existed. Authors need to use their tools properly.
| chucksmash wrote:
| Excel (and GSheets) are tools where I think a stripped down
| "trust my input, I know what I'm doing" mode that doesn't
| try to intuit user intent would be useful.
|
| As I type this comment and my phone miscorrects "intuit" to
| "Intuit," I think also Google keyboard could benefit from
| such a mode that only handles spelling mistakes but doesn't
| replace uncommon words with common brands, etc.
| tengwar2 wrote:
| But if you actually do know what you are doing, you know
| to set the cell type appropriately.
| Prickle wrote:
| They already did in 2023.
|
| Excel did not have an option for turning automatic
| conversions off.
|
| You can now, FINALLY disable automatic conversion. Honestly
| that "feature" has been a bane of my existence, and I don't
| work with genes.
| macintux wrote:
| I had no idea. Found https://insider.microsoft365.com/en-
| us/blog/control-data-con..., thanks.
|
| I'd love to know the percentage of corrupted data across
| all Excel workbooks.
| rectang wrote:
| Oh, come on. The Excel user interface encourages mistakes, as
| surely as a bicycle with handlebars that steer backwards.
|
| https://www.youtube.com/watch?v=MFzDaBzBlL0
|
| It's not _all_ on the users.
| dudus wrote:
| I had a different interpretation of the video you linked. It
| seems to me even a backwards bicycle is perfectly drivable as
| long as you adjust you mental model.
|
| In that sense the excel UI doesn't make sense only for those
| not used to it. Which might be a nice analogy
| TeMPOraL wrote:
| At least it's a bicycle.
|
| Remember the whole "computer as bicycle for the mind" thing?
| That didn't happen, the world went in the opposite direction.
| Software like Excel are the last surviving remnants of the
| idea of empowering end users to improve their work and lives.
| happytoexplain wrote:
| > Software like Excel are the last surviving remnants of
| the idea of empowering end users to improve their work and
| lives.
|
| I agree. I love Excel. But this attitude only makes sense
| if we assume one can only fix easy-to-make mistakes by
| dumbing the software down, which is not true.
| happytoexplain wrote:
| There are three options:
|
| 1. Make the software better (very hard on complex systems with
| GUIs)
|
| 2. Ensure everybody knows all footguns (impossible)
|
| 3. Don't care about those people (easy)
|
| If we opt for #3, we might as well not even be in software as a
| profession/hobby. Having such low standards indicates that we
| don't really care.
| TeMPOraL wrote:
| > _EUCs_
|
| Had to search for that one. It seems that EUC here means "end-
| user computing", and I was shocked to discover it's a
| _pejorative_ term used by vendors to describe what they
| consider a _problem_ that needs solving,
| rzzzt wrote:
| Careless as in they type in "MARCH1" or "SEPT2" in a cell and
| these get turned automagically into 1-Mar and 2-Sep dates after
| pressing Enter?
| erehweb wrote:
| Original paper
| https://genomebiology.biomedcentral.com/articles/10.1186/s13...
| Prickle wrote:
| Previous popular thread, Aug 2020:
| https://news.ycombinator.com/item?id=24070385
|
| > Scientists rename human genes to stop MS Excel from misreading
| them as dates (theverge.com)
|
| Related details:
|
| 2023:
|
| https://www.pcmag.com/news/microsoft-finally-fixes-excel-gli...
|
| > Years after introducing Excel's automatic conversion features,
| Microsoft rolls out an update to prevent it from changing gene
| symbols to dates.
|
| https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9325790/
|
| > Gene Updater: a web tool that autocorrects and updates for
| Excel misidentified gene names
| magicalhippo wrote:
| I use Excel a fair bit at work, mostly for looking at data, but
| also to fairly easily generate SQL statements based on such data,
| often after some formulas have been applied.
|
| Though I think what I'd really want is some tool which has the
| same grid-like visualization, filtering and direct entering, but
| was code-based under the hood and without magic conversions.
|
| So, take Excel, and when you enter a formula in a cell, it
| actually writes a line of code for you, which you can inspect and
| edit. Including adding your own functions and such.
|
| When importing delimited text data, you'd have to specify what
| the data is in each column. It should still save the original
| text data so you can change your mind, but yeah, no automagic
| stuff.
| lukan wrote:
| Yes, something like this is on my mind since quite some years
| as well.
|
| It would be limited to programmers, though.
| scrlk wrote:
| The Data Wrangler extension for VS Code might be of interest.
| As you apply operations to your dataset, it generates Pandas
| code.
|
| https://code.visualstudio.com/docs/datascience/data-wrangler
|
| https://www.youtube.com/watch?v=5tWJVLF6PuA
| redbell wrote:
| Here's another interesting article on the same topic (2016)
| entitled "Gene name errors are widespread in the scientific
| literature" :
| https://genomebiology.biomedcentral.com/articles/10.1186/s13...
|
| From the article:
|
| " _The problem of Excel software inadvertently converting gene
| symbols to dates and floating-point numbers was originally
| described in 2004. For example, gene symbols such as SEPT2
| (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1,
| E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep'
| and '1-Mar', respectively._ "
| hilbert42 wrote:
| Don't we ever learn. We've known since VisiCalc in the '80s not
| to trust spreadsheets for important data without crosschecking
| the results.
|
| The question is why do we still use substandard tools for
| processing important data like this.
| lukan wrote:
| "The question is why do we still use substandard tools for
| processing important data like this."
|
| Because there is no better alternative (yet)?
|
| A better alternative needs to be really better, to justify the
| effort of people relearning how to do things in this better
| tool then.
| jasinjames wrote:
| I think another factor is that the spreadsheet model has
| basically zero barrier to entry from a users perspective. You
| can have elementary school students punch data from a
| classroom experiment into an excel table, which is great!
|
| Any replacement system which, for example, enforced a strong
| separation between operations, input reference data and
| output result data would require users to learn the model
| before attempting to use the software. This is a pretty big
| ask, especially since lots of small-scale users wouldn't see
| an immediate benefit. I think of it like the tradeoff between
| dynamic and static typing when programming- it's the same
| "upfront mental overhead versus long term maintainability"
| question IMO.
| freehorse wrote:
| At work we have had problems with excel changing values of
| numerical entries due to different locale formatting multiple
| times in the past, and many times causing hours of lost time till
| the culprit (excel) was found. Other types of values that are
| completely inconvenient for excel include social security numbers
| and phone numbers; which one would assume are more general
| purpose and thus closer to the standard use cases of excel than
| genes. Still excel is messing these things up. Opening a csv file
| is also a non-trivial overly complicated endeavour for excel in
| year 2024, which should be done with a double click (somehow
| other spreadsheet programs manage to handle csv files much more
| easily). The worst is that excel is usually used by non-technical
| people, who misunderstand these idiosyncrasies and time is wasted
| over and over.
| yial wrote:
| My work frustration with excel is not so much excel itself, but
| that people will destroy / delete / not understand formulas (or
| think they're improving them...) and then the end result of
| their edits is garbage. However this is my fault usually for
| not using the protect / lock features well enough. I do know
| that notes explanations get overlooked. I've mitigated this
| slightly by almost always having in our shared a blank start,
| that's just a backup to start from, and then the actual one to
| be used.
| theodpHN wrote:
| So, the question is: Would there be fewer or more errors if
| something other than Excel was used? Has much rigorous research
| been done on this?
| Gimpei wrote:
| Didn't the same thing happen in economics because of a bug in
| Stata in the 90s?
| bikenaga wrote:
| An annoyance with spreadsheets that deterred me from ever using
| them in teaching is that they've perpetuated a arithmetic order-
| of-precedence bug. ("Bug" in the sense that it contradicts long-
| standing mathematical convention.) If you type
| -3^2
|
| in a cell and press ENTER, the spreadsheet tell you it's "9". It
| should be "-9"; in math, exponentiation has precedence over unary
| minus, so you square 3, then negate the result. For instance, if
| you tell students to graph "y = -x^2", they should draw a
| parabola opening _downward_.
|
| I don't have a recent copy of Excel to check this in, but this
| was the case in the '97 version. I just tried it in the current
| LibreOffice calc, and it returns "9". My guess is that one of the
| early spreadsheets messed up the order of precedence, and
| everybody after copied it for compatibility.
|
| On the other hand, I just tried maxima and python and they both
| give "-9".
|
| I wonder if this particular problem afflicts people who copy
| formulas from (say) math books into spreadsheets.
| mobilio wrote:
| But it's 9!
|
| Check it: https://www.mathplanet.com/education/pre-
| algebra/explore-and....
|
| "You also have to pay attention to the signs when you multiply
| and divide. There are two simple rules to remember: When you
| multiply a negative number by a positive number then the
| product is always negative. When you multiply two negative
| numbers or two positive numbers then the product is always
| positive."
|
| So basically you have -3x-3 and result is 9.
| n_plus_1_acc wrote:
| Correct, but irrelevant to the question. The usual rules of
| math require it to be parsed like -(32), so there are only
| positive numbers being multiplied.
| shagie wrote:
| https://www.wolframalpha.com/input?i=-3%5E2
|
| https://en.wikipedia.org/wiki/Order_of_operations
|
| Parentheses, Exponentiation, Multiplication, Division,
| Addition, Subtraction
|
| -3^2 would then be correctly parsed as -(3^2) which is -9.
|
| Parsing it as (-3)^2 would require the addition of
| parentheses.
|
| This gets to the special case of the unary minus sign...
| which the Wikipedia article specifically calls out.
| Special cases Unary minus sign
| There are differing conventions concerning the unary
| operation '-' (usually pronounced "minus"). In written or
| printed mathematics, the expression -32 is interpreted to
| mean -(32) = -9. In some applications and
| programming languages, notably Microsoft Excel, PlanMaker
| (and other spreadsheet applications) and the programming
| language bc, unary operations have a higher priority than
| binary operations, that is, the unary minus has higher
| precedence than exponentiation, so in those languages -32
| will be interpreted as (-3)2 = 9. This does not apply to the
| binary minus operation '-'; for example in Microsoft Excel
| while the formulas =-2^2, =-(2)^2 and =0+-2^2 return 4, the
| formulas =0-2^2 and =-(2^2) return -4.
| jasinjames wrote:
| I had a similar issue just two weeks ago at $DAYJOB. I was
| scraping a log file and I had written an awk script to convert
| the HH:MM:SS.uuuuuu timestamps to nanoseconds since epoch for
| ingestion by another tool. Little did I know that awk uses
| floating point for numbers internally, and so my conversion
| function was incorrectly rounding the result, which messed up my
| later analysis in arcane ways. To add insult to injury, I had the
| same problem again when paging through the data in Visidata, but
| there was an easy workaround I don't exactly remember.
| smartmic wrote:
| The problem with Excel is that it tries to do almost everything
| in one software tool: front-end (editing, presentation),
| analysis, semantic modeling, data storage/database.
|
| Almost all reasonable engineers see that there is something wrong
| with such an approach. But almost all everyday computer users
| think that this is the way computing has to be.
|
| Sometimes I wonder why even I voluntarily open it for certain
| tasks - anyway, despite all the criticism, Excel has reached the
| Lindy[1] threshold for me and is here to stay.
|
| [1] https://en.wikipedia.org/wiki/Lindy_effect
| ravetcofx wrote:
| Except that it's proprietary and will eventually be un-
| maintained and stop working. But spreadsheets, in general, fall
| under the Lindy effect and open source software will continue
| it for centuries to come.
| darkhorn wrote:
| This is why Statisticians do not use Excel for scientific work.
| They use SPSS, R, Minitab, MATLAB, SAS, etc.
| Eddy_Viscosity2 wrote:
| But why oh why can't you disable automatic date reading? Put
| something in settings with a checkbox that I, the user, can tell
| excel to F-off with that nonsense.
|
| Is there anybody who can argue the 'for' case for having this on
| all the time without recourse?
___________________________________________________________________
(page generated 2024-09-14 23:00 UTC)