[HN Gopher] Autocorrect errors in Excel still creating genomics ...
       ___________________________________________________________________
        
       Autocorrect errors in Excel still creating genomics headache
        
       Author : tchalla
       Score  : 69 points
       Date   : 2021-08-17 16:09 UTC (6 hours ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | gpapilion wrote:
       | Autocorrect(and predictive text) has reached an interesting
       | milestone, good enough that you mainly rely on it, but
       | occasionally it distorts the data you intended to send.
       | Disengaging the technology isn't easy, or even understanding
       | field types it shouldn't correct(everyday my work VPN auth window
       | tries to autocorrect my username). There is so much context that
       | is unavailable to the computer, and there is no good way to
       | provide it.
       | 
       | I think it has interesting implications for self-driving, since
       | as far as I can tell we're at an earlier but similar stage in its
       | development; works for simple situations, likely to drive you
       | into a wall for no reason. We build so much trust in these
       | systems, we stop paying attention to what they are doing, and
       | then pay the price when it behaves in an unexpected way to the
       | user.
        
         | soco wrote:
         | I get bitten by Excel's auto-correct all the time and I work
         | nothing related to genomics. It's in my opinion just a dumb
         | decision of Microsoft to not let you disable all those smarty
         | feats (like, thinking you have a date if your string kinda
         | looks like one)
        
       | guitarbill wrote:
       | It's easy to hate on Excel or Microsoft, a "dumb" program and a
       | faceless mega corporation. Don't get me wrong, I'm no fan. But
       | the real question is what are journals doing? It seems like the
       | obvious solution is for editors and reviewers to do a quick
       | search for these errors. To me, this is more of an indication of
       | how poor most journals really are.
       | 
       | (You could argue that supplementary data should be in an open
       | format, so you could automate this checking. I'm not against
       | this, either.)
        
         | [deleted]
        
         | nnmg wrote:
         | This is an important point. What have the journals done? Raised
         | their prices and business as usual.
         | 
         | Scientific editors do _nothing_ for data validation. There is
         | no accountability, even after retractions.
         | 
         | Scientific journal editors are glorified gatekeepers for "high
         | impact" work (read: flashy), and then use free reviewer labor
         | to cover themselves so they can call it 'peer reviewed'. In the
         | rare cases when journals do require supporting data, they
         | explicitly ask for excel spreadsheets :(
        
       | nnmg wrote:
       | Excel is used as a database/storage/interchange format,
       | especially after the initial analysis by someone who uses python
       | or R. Bioinformatician does the analysis, then the PI wants to
       | see it so they can Ctrl-F for genes they are interested in, so
       | out comes an excel document.
       | 
       | And really, even if you know python or R, are you really going to
       | fire up a jupyter notebook, load the data, and run pandas queries
       | every time someone in lab meeting or after a talk asks you about
       | this gene or that gene in your data?
       | 
       | I think the important question is why is date conversion a
       | default? Would it really break backwards compatibility for MS
       | Excel users if date conversions were explicit instead of
       | automatic? Turning that off by default would fix a lot of this.
        
         | dec0dedab0de wrote:
         | _And really, even if you know python or R, are you really going
         | to fire up a jupyter notebook, load the data, and run pandas
         | queries every time someone in lab meeting or after a talk asks
         | you about this gene or that gene in your data?_
         | 
         | I don't do any scientific research, but I have been using
         | jupyter as a replacement for excel since it was called the
         | ipython notebook. I don't really use pandas all that often, I
         | just find it easier to read and edit data in python. Though I
         | first learned ipython added the notebook from a talk Wes
         | McKinney gave about Pandas.
        
         | BugsJustFindMe wrote:
         | > _Excel is used as a database /storage/interchange format,
         | especially after the initial analysis by someone who uses
         | python or R. Bioinformatician does the analysis_
         | 
         | Sometimes, but the situation is in reality worse than that.
         | Excel is also used as the gold standard
         | database/storage/interchange format of record for random shit
         | that clinical researchers have typed in by hand whether
         | directly or transcribed from other notes, often when that data
         | isn't actually fundamentally tabular in nature because people
         | really like working in grids. Even when grids hurt more than
         | help.
         | 
         | A big secret in genetic research is that the MDs, grad
         | students, project managers, and coordinators running the
         | research programs are often not super focused on what well-
         | structured data looks like and don't know what things like
         | "key-value store" or "nested tree-like structure" mean, and
         | even if they did there aren't good GUI tools for entering them
         | anyway, and it leads to countless errors that maybe (here I
         | speculate) they just assume will wash out as noise.
         | 
         | > _I think the important question is why is date conversion a
         | default?_
         | 
         | Yes, why any kind of conversion is ever the default is a real
         | money question.
        
           | quantified wrote:
           | For the finance and business office worker, it seems to have
           | traction. Just like auto-creating an emoji when you type a :
           | character. Excel is for offices, not specializations of
           | scientists. Bummer.
        
         | SonicScrub wrote:
         | I don't work in bioinformatics, but what you are describing is
         | a completely accurate description of what I experienced working
         | in manufacturing quality control. Raw data came in from
         | suppliers in the form of spreadsheets, and management wanted to
         | see results in spreadsheets. Meaning all our quality data was
         | subjected to these issues. The date formatting issue was a
         | particularly annoying "gotcha", particularly when features were
         | defined with a XX-XX numeric code. The number of times I had to
         | deal with someone in a meeting saying "hey, why is this feature
         | called October-13?!" Super frustrating.
         | 
         | If I could choose the tools used by the whole process involving
         | multiple different companies and departments, hey I would! It
         | would be python all the way down. But I was but a cog in a
         | massive organization.
        
           | dragonwriter wrote:
           | > A lot of "safety culture" is composed of things like
           | checklists and hazard warnings which are more geared towards
           | shifting the blame for accidents onto somebody else than
           | actually preventing those accidents,
           | 
           | If you _stay in spreadsheets_ these problems mostly don't
           | occur (that is, once data entry is squared away so that the
           | initial spreadsheet has what you want it doesn 't tend to get
           | lost), its when you move in and out of spreadsheets via text
           | and take the path of least resistance [0] to do the
           | transition that the problem occurs.
           | 
           | [0] and to be fair, there is a _lot_ of resistance off that
           | path.
        
             | SonicScrub wrote:
             | The process I had to deal with was filling out spreadsheets
             | with data from a python-driven 3D inspection program that
             | exported out data files in CSV format. Needless to say,
             | these errors were inevitable for exactly the reasons you've
             | stated. Why we didn't bypass the large, poorly formatted
             | cumbersome spreadsheets and just directly export data via
             | pandas? All the inspection was done by Python anyways. You
             | tell me! Also, it did not help that the spreadsheets were
             | not created by me, or any colleagues in my department.
             | 
             | God I hated working in old-school
             | engineering/manufacturing. "That's not how we do things" is
             | the answer to everything. I
        
       | planet-and-halo wrote:
       | Serious question, is this a viable area for a startup product?
       | Obviously nothing will easily displace Excel for general
       | spreadsheet work, but I could see room for a domain-specific
       | spreadsheet that has core features and supports just a basic set
       | of functions relevant to genomics, + a "we won't fuck up your
       | data" feature.
        
         | Closi wrote:
         | I think the size and scope of Excel is hard to replicate,
         | however it sounds like it would make a great plugin.
        
         | masklinn wrote:
         | Probably not: people use excel because it's available, it's
         | what they know, and it's what their peers use.
         | 
         | It's been going for more than half a decade now, and
         | genomicists apparently would rather rename genes than stop
         | using excel...
        
         | mceoin wrote:
         | Hi - I'm building a networked spreadsheet product and had heard
         | of this auto-correct problem in genomics research through word
         | of mouth a couple of times, so have asked this question myself.
         | (I was astonished to learn just how much of genomics work
         | involves sending Excel files back and forth!) Here's the
         | conclusion that I have come to:
         | 
         | - It's definitely possible to build a custom spreadsheet
         | product with a small team, even one targeted at such a "niche"
         | user group. So it's an idea worth testing.
         | 
         | - Product can be "backwards compatible" - you can export to
         | xlxs and import from xlxs - so you don't have to change
         | behaviour of the entire industry on day one to get this to
         | work, only a single genomics researcher or lab.
         | 
         | - Pricing, unit economics, etc are unknown to me (I have no
         | background in genomics or scientific research). But presumably
         | you could leverage standard SaaS models and build a viable
         | model up from there using a few case studies. There's
         | definitely schleppy behaviour going on here that can be solved.
         | 
         | - Even a "lifestyle business" has significant upside beyond the
         | financial: improving genomics research improves genomic
         | research!
         | 
         | - The product advantage over time presumably involve building
         | more custom tooling into the genomics / data ecosystem. "Not
         | creating typos" is just the beachhead.
         | 
         | I've never actually interviewed genomics people about their
         | need here, but if anyone knows people with this problem I would
         | love to talk to them: @mceoin on twitter. (DMs open)
        
         | prionassembly wrote:
         | The easier solution is coming up with a 1:1 human-readable-
         | hashing scheme so that MARCH4 translates to "funny-blue-smell-
         | tuesday". Then you keep the safe key in a column next to the
         | unsafe key.
        
         | NullPrefix wrote:
         | >a "we won't fuck up your data" feature
         | 
         | That's a hard to implement feature. Hand waving and buzzword
         | lingo will not be enough for people to believe it.
        
           | masklinn wrote:
           | > That's a hard to implement feature.
           | 
           | It really isn't: just remove the data autodetection anything
           | and you're done.
        
         | cm2187 wrote:
         | That's an expensive workaround to just formatting your cells as
         | text before entering the values
        
           | dspillett wrote:
           | You'd be surprised how much effort can be saved by not
           | relying on human interaction to be reliable.
           | 
           | Though convincing people to switch to a new app for that
           | reason, even if it were free and Free, would still be an
           | uphill struggle.
           | 
           |  _> just formatting your cells as text before entering the
           | values_
           | 
           | It is a little more than that. I work with finance people,
           | and there a lot of data is _manipulated_ in Excel but passed
           | around as CSV files for compatibility elsewhere. This causes
           | no end of problems because fixes by setting cell properties
           | are obviously lost in transcription, and date errors creep in
           | as things are moved back  & forth between people in the US
           | and those in locales that do dates properly.
        
       | breakfastduck wrote:
       | So we blame the software instead of blaming the downright _lazy_
       | people who can 't be fucked learning an appropriate tool even
       | though that's their entire career.
       | 
       | Honestly these kind of things really wind me up.
        
         | Closi wrote:
         | > So we blame the software instead of blaming the downright
         | lazy people who can't be fucked learning an appropriate tool
         | even though that's their entire career.
         | 
         | Further, the software would be fine if you use it properly for
         | this sort of use case (i.e. bring the data in via PowerQuery
         | which has enforced types).
        
         | BugsJustFindMe wrote:
         | In reality the appropriate tool for research data actually
         | doesn't exist (make it and become very rich). Excel is only
         | sort of close if you squint.
        
         | IAmEveryone wrote:
         | There is one problem people have for their use case, so it's
         | inappropriate? Have you ever created an issue on some GitHub
         | project? I guess you were using the wrong tool?
         | 
         | People choose their tools for some reason(s), and there is no
         | law of man or nature that says that using a common tool for
         | some esoteric purpose is somehow wrong, just because you
         | consider it too easy.
        
         | ironmagma wrote:
         | If a problem is re-occurring and between lots of people, it's
         | probably the fault of the system that powers it. The system in
         | this case is the software, so blaming the software is
         | appropriate.
        
           | breakfastduck wrote:
           | No it isn't. This isn't a gotcha. This is probably one of the
           | most well known software behaviors on the planet.
           | 
           | You KNOW the problem exists, yet you STILL choose to use the
           | software, THEN expect it to change its behaviors because its
           | inconvenient for your very specific use case. Microsoft
           | aren't going to fix this behavior because there are likely
           | millions if not billions of spreadsheets that rely on it.
           | 
           | That is not a software problem.
        
             | ironmagma wrote:
             | People don't choose to use Microsoft products, the software
             | is chosen for them by their organization. For a problem
             | like this, the existence of the article may very well be
             | the first step to switching to another solution. The first
             | step in that process is raising awareness that there is a
             | problem.
        
               | breakfastduck wrote:
               | A convenient cop out there.
               | 
               | "I didn't install it, I only chose to use it for
               | something it isnt suited for instead of looking for
               | alternatives, not my fault"
        
               | ironmagma wrote:
               | People working in large organizations are frequently not
               | even allowed to install software that isn't pre-approved
               | by N layers of management. The bottom line though is that
               | Excel is made for data entry. If it's not suitable for
               | that use case, that's a big problem with the product-
               | market fit.
        
               | AmericanChopper wrote:
               | Excel isn't a data entry tool. It's a rather
               | sophisticated analysis tool for tabular data. It's
               | success is derived from the fact that it's very good at
               | this, and is so easy for users to pick up. Mind boggling
               | amounts of commerce and administration are driven my
               | Excel.
               | 
               | It is however most suited to accounting and logistics
               | tasks. For any more novel use case, you're going to have
               | to put some thought into how you use it, or use a more
               | suitable alternative. Even for the tasks that it's very
               | well suited to, there will always be more sophisticated
               | alternatives. But I'd consider the way it lowers the the
               | technical barriers to entry an overwhelming positive,
               | even if it doesn't completely absolve the users from
               | learning about how their data works. I have the same
               | frustrations with developers who have been trained by
               | ORMs to have no idea how RDBMSes work. But I consider it
               | their fault for not learning their trade properly, rather
               | than the fault of the people who make nice tools to help
               | them.
        
               | breakfastduck wrote:
               | I cannot imagine a single piece of software that is more
               | widespread and used without issue by huge swathes of
               | people than excel.
               | 
               | But one specific use case for genetics means its totally
               | unsuitable & and big problem with the product. Got it.
        
               | quantified wrote:
               | Right. I doubt genomics registered as much of a market
               | segment, and probably still doesn't register.
        
               | bserge wrote:
               | If only the organization was run by people.
        
         | psychometry wrote:
         | compbio is not a field you'd want to go in if you want to use
         | well-written tools exclusively. There's tons of absolutely
         | abominable Perl and R code powering critical pipelines
         | everywhere you look.
         | 
         | What makes matters worse is that scientists, who are already
         | not great coders, have to work with even less technical people
         | (i.e. the physicians). These collaborative processes will
         | inevitably involve software like Excel. Need your M.D. co-
         | collaborator to annotate a gene list or whatever? Your CSV's
         | getting re-saved in Excel whether you like it or not.
        
       | codetrotter wrote:
       | That's the price we pay.
       | 
       | A reasonable alternative would be for example statically typing
       | each column or row in the sheet.
       | 
       | So in one sheet column A contains only floats, col B contains
       | only text, col C contains only dates etc. And in another sheet
       | col A may be date type, col B date type also, col C and D text,
       | and col E monetary amounts.
       | 
       | But this would come at the cost of not being able to mix types in
       | column, and a lot of people want to be able to mix.
       | 
       | And that's why we're stuck
        
         | zozbot234 wrote:
         | > A reasonable alternative would be for example statically
         | typing each column or row in the sheet.
         | 
         | Excel actually supports this out of the box. But it's one more
         | option to set, and people get lazy.
        
           | dspillett wrote:
           | Also those properties are lost if you transfer the data to
           | another format then back, or don't properly survive
           | copy/cut/paste operations (often cutting will remove
           | everything about a cell including options such as these).
           | 
           | As well as being easy to be lazy and not use the options, it
           | is easy to accidentally undo them also.
        
         | Closi wrote:
         | > A reasonable alternative would be for example statically
         | typing each column or row in the sheet.
         | 
         | You can actually statically type the columns now for imported
         | data via the PowerQuery editor (which is built into Excel),
         | although not a lot of people know how to use it.
        
         | eitland wrote:
         | Nah, much of this is the same as the dumbness of many
         | autocorrecting keyboards:
         | 
         |  _automatically "correcting" something that was correct based
         | on a dumb hunch_
         | 
         | It is really simple to remove the most embarrassing
         | autocomplete errors on phones:
         | 
         | - just disable autocorrect,
         | 
         | - or preferably use another keyboard that just proposes fixes
         | but leaves it to you to select the suggestion (I use SwiftKey
         | and it does this, others probably exist too).
         | 
         | The same is not possible in Excel as far as I am aware:
         | 
         | You cannot turn of autoformatting.
         | 
         | The whole thing could be easily (on a ux level at least ;-)
         | solved by introducing a single checkbox: "stop embarrassing me"
         | with help text that says "turn off autoformatting".
         | 
         | Checking that box would leave values like you typed or pasted
         | them in.
         | 
         | No need to select on a column basis.
         | 
         | Edit: It is also almost at the level of modern search engines
         | that replace my very specific queries with generic queries
         | about something unrelated or vaguely related.
         | 
         | Edit 2: it is already more than a year ago I guess since
         | iPhones were caught changing the names of correctly spelled
         | medications to the name of another unrelated medication. Sooner
         | or later this incessant drive to be smarter than the user is
         | going to cost lives I'm afraid.
         | 
         | When it comes to search it is already costing me many minutes
         | lost a day, and that is just first order effects, before we get
         | into all the things that doesn't get done because we give up on
         | finding it.
        
           | jjk166 wrote:
           | Right click on top left corner
           | 
           | Select format cells
           | 
           | Select "Text"
           | 
           | This will disable excel's autoformatting. Given the rarity
           | that someone types "SEPT4" into excel and isn't typing a
           | date, putting the option one level deep in a menu seems more
           | appropriate than a top level button.
        
             | eitland wrote:
             | > Given the rarity that someone types "SEPT4" into excel
             | and isn't typing a date, putting the option one level deep
             | in a menu seems more appropriate than a top level button.
             | 
             | Given the rarity that someone has pasted 100 000 rows into
             | and want them to be autoborked disabling autoborking, at
             | least for pasted data seems appropriate enough for me ;-)
        
               | jjk166 wrote:
               | I would expect dates to show up in large datasets pasted
               | into excel orders of magnitude more often than genes. For
               | the people who use excel, autoformatting is
               | overwhelmingly the desired default behavior.
        
             | ironmagma wrote:
             | Note how nowhere in those instructions is the phrase
             | "autocorrect."
        
               | jjk166 wrote:
               | Why would it? Excel doesn't have autocorrect, it has
               | autoformat - a feature controlled in its formatting menu.
        
         | wtallis wrote:
         | > But this would come at the cost of not being able to mix
         | types in column, and a lot of people want to be able to mix.
         | 
         | I feel like a lot of the desire for this comes from Excel's
         | desire to fill the screen with a single infinite spreadsheet.
         | Apple's Numbers and some other programs make it more natural to
         | have multiple separate tables on screen, each of which can have
         | their own header rows and columns and separate type and format
         | rules for those columns/rows. Excel more or less forces people
         | to emulate this capability by just using a separate region of
         | the same table, which gets in the way of applying consistent
         | formatting and typing rules to whole columns or rows.
        
       | tssva wrote:
       | This isn't a Microsoft or Excel issue. It is a choosing a
       | spreadsheet issue. LibreOffice and Google Sheets for instance by
       | default also try to determine what type of data is being entered.
       | 
       | The issue is choosing the incorrect tool or if you have no choice
       | but to use a spreadsheet failing to learn how your tools work.
        
       | HPsquared wrote:
       | They might move to a relational database, but then the gene named
       | '; DROP TABLE genes; will start causing problems.
        
       | OptionX wrote:
       | Can anyone with first hand experience on this tell shed some
       | light on why stick with Excel versus using python, R or what have
       | you to treat the data?
        
         | gerdesj wrote:
         | "That's something Purdie says she doesn't have time for. She
         | has adapted to Excel's quirks, adding apostrophes before
         | commonly affected genes to prevent the conversion, or pre-
         | formatting spreadsheet cells before importing data. "It's one
         | of those things that I just accept," she says."
        
           | qayxc wrote:
           | Wait, if she knows how to avoid the issue, how is it still an
           | issue?
        
         | enaaem wrote:
         | It might be easier to teach people how to use excel correctly
         | with for example power query.
        
         | jpeloquin wrote:
         | As a biomedical researcher who prefers using python & R over
         | Excel:
         | 
         | (1) Graduate students (who are the bulk of the scientific labor
         | force) usually have some training in python, R, or Matlab, but
         | are seldom practiced in applying it to real-world work. So if a
         | lab standardizes on python, R, etc. the senior people have to
         | do a lot of extra work to support the programming, rather than
         | doing work with intrinsic scientific value. It's easier to
         | teach the Excel quirks than to teach effective programming
         | practices. Excel quirks are concrete, and "good practices" in
         | programming are vague and situational.
         | 
         | (2) Python & R have their own pitfalls. It's easy to apply the
         | wrong filters to a data set and compare subsets that aren't
         | what you think you're comparing. Although when they go wrong,
         | they go very wrong, and this is easier for a supervisor to
         | notice.
         | 
         | (3) Excel has rich text formatting, graph embedding, and data
         | types, so it's very useful as a human-readable data interchange
         | & summary format even if you never do computation in it.
         | 
         | IMO neither programming languages nor Excel are a great fit for
         | data analysis. Something purpose-made, like JMP or even
         | GraphPad, is probably the better choice in most situations.
         | Programming gets you automation but at the cost of high
         | complexity. Since you still need to look at and think about the
         | data there's a limit to how much time automation can save
         | (Amdahl's law).
        
         | happytoexplain wrote:
         | Is there any reason to suspect that the primary factor is
         | something other than the difference between an extremely
         | featured end user GUI and _programming_?
        
         | gpapilion wrote:
         | I took a programming for math majors course in college, and the
         | largest challenge for the class was understanding procedural
         | looping to process data sets. Understanding a for loop or a
         | while loop really was a hard concept for someone to understand.
         | We didn't cover any real data structures, and everything was a
         | 2 dimensional array.
         | 
         | Excel, and most spreadsheets for that, avoid some of the basic
         | control structures beginners find challenging. Many functions
         | are basically map and reduce functions so you typically are
         | getting the same number of cells back, or one. Often maps are
         | done visually in the spreadsheet with a formula being applied
         | to every cell in a column.
        
         | delosrogers wrote:
         | From my experience Python/R can be great when you have large
         | scale analysis to do but if you have a task that requires more
         | manual fiddling with the data then it's much nicer to use
         | Excel.
        
         | taeric wrote:
         | The same line of questioning can easily land in "why use csv?"
         | 
         | It is a terrible format that has very few redeeming qualities.
         | It happens to also be the most widely used one.
        
         | cm2187 wrote:
         | When you have done programming for a while and it is mostly
         | muscle memory, it is easy to forget how much it is an
         | impenetrable black box to someone who has never done it.
         | 
         | Try to tell someone who just needs to get a task done that he
         | needs to spend the next 3 months learning programming from
         | scratch vs using excel.
        
           | breakfastduck wrote:
           | Tough shit, though. Thats what they're being paid for.
           | 
           | "A bad workman always blames his tools" has never been truer
           | than this situation.
        
         | minikites wrote:
         | Given that these are relatively unskilled users, would
         | switching away from Excel increase or decrease errors? I don't
         | think it's a given that errors would decrease by switching away
         | from Excel.
        
           | instagraham wrote:
           | I think there is a difference between the errors. Excel
           | autocorrect errors are easy to miss as it is the software
           | making an arbritrary decision over your own.
           | 
           | Errors made from using a new , presumably neutral software,
           | would be the kind you are supposed to catch as a researcher.
           | 
           | There is obviously scope for error in everything but Excel's
           | user-hostile methods are not justified by this.
        
           | an_d_rew wrote:
           | Switch to what?
           | 
           | Excel is everywhere.
           | 
           | People are familiar with it. The same people are often not
           | familiar with, and have never used, and may even be confused
           | by "notepad.exe".
        
             | ltbarcly3 wrote:
             | These aren't someone's 85 year old grandparent getting a
             | computer for the first time to look at pictures of their
             | great grandchildren, these are people who have used a
             | computer every day since they were 14 and have advanced
             | STEM degrees. I think they are probably re-trainable.
        
               | Iwan-Zotow wrote:
               | XXX YYY ZZZ, PhD in genomics, probably re-trainable
        
           | blackbear_ wrote:
           | Anecdotically, I was talking to an acquaintance who decided
           | to pick up R as a replacement for excel for data analysis and
           | they were positively blown away by the possibilities opened
           | by the new mindset that came with programming.
           | 
           | So yes, maybe these "relatively unskilled users" will
           | struggle for a few weeks/months, but it will be a net
           | positive change as many of them will see immense benefit
           | after some tinkering.
        
         | an_d_rew wrote:
         | Because an absolutely enormous number of people involved in all
         | ends of genomics and biomedical science have absolutely no
         | inkling whatsoever about programming.
         | 
         | Furthermore, Microsoft office is installed everywhere
         | everywhere everywhere... like it leave it love it or hate it,
         | it doesn't matter... it's true (at least for huge swaths of the
         | demographic).
         | 
         | So the first thing most people learn is excel and the last
         | thing most people use is excel. It's a database, it's a
         | spreadsheet, you can even use it as a word processor if you
         | want to. And some have.
         | 
         | I'm not saying you could and I'm not saying you should, I'm
         | just answering as to "why".
        
           | OptionX wrote:
           | Even newcomers?
           | 
           | I've had my education in a engineering focused campus so that
           | may skew it, but even not IT related area had programming
           | courses to, at least, learn the basics.
           | 
           | Do the newer researcher still show reluctance to move from
           | Excel or is it an old guard kind of deal?
        
             | jasode wrote:
             | _> why stick with Excel versus using python, R or what have
             | you to treat the data? [...] Do the newer researcher still
             | show reluctance to move from Excel_
             | 
             | The feature of Excel that's underestimated/overlooked by
             | skilled programmers of Python/R is that spreadsheets
             | _immediately expose an editable visual spatial canvas
             | datagrid GUI_.
             | 
             | I have decades of programming C/C++/C#/Python/etc and yet
             | _I create new spreadsheets and use them every day._
             | 
             | Spreadsheets are much faster than wiring up a datagrid in
             | C# or C++ Qt gui widget editor or any other "true"
             | programming language. Spreadsheets are also faster than
             | using Python package Pandas to import a csv into a
             | dataframe and view it in a Jupyter notebook. And last time
             | I checked, displaying an _output_ grid of cells in Jupyter
             | is _read-only_ and not 2-way editable like Excel.
             | 
             | And yes, the complaint is that _" MS Excel calculations are
             | not auditable, repeatable, version controlled, etc"_. All
             | true, but it still doesn't change the fact that Python
             | doesn't have an instant datagrid GUI. UI affordances also
             | matter in viewing science data sets as well as financial
             | budgets.
        
               | int_19h wrote:
               | Besides, these are not mutually exclusive - you can use
               | Excel as the front-end to your Python code via something
               | like PyXLL.
        
           | fartcannon wrote:
           | Yeah. At some point though, people who want to do something
           | that requires python or R, but refuse to learn it, and then
           | use Excel (and make mistakes like the article mentions)
           | should reconsider their choices. This is really out of
           | character for me, but in this instance, relying on excel to
           | do everything is a mistake by the user, not Microsoft's
           | ridiculous data destroying import function.
           | 
           | Use the right tool.
        
             | qayxc wrote:
             | > Use the right tool.
             | 
             | Step one should be learn your tool, no matter the tool.
             | 
             | There's several ways to avoid the issue in Excel. If
             | someone working with Excel isn't able to learn that (heck,
             | a simple template would suffice), I have no hope the same
             | demographic would have any success with R or Python.
        
               | da_chicken wrote:
               | > _There 's several ways to avoid the issue in Excel._
               | 
               | No, not really. There are several ways to help reduce the
               | issue, but none of them eliminate what it does.
               | 
               | Example that comes to mind is the data file for College
               | Board's SAT test. The data formats for student reports
               | for schools come in two formats: PDF (one page per
               | student), CSV, and fixed-width. That is the comprehensive
               | list of your options. College Board doesn't care about
               | you as a customer. They're too big. Any request you
               | submit will be black holed.
               | 
               | Some of the columns in the file indicate a range, usually
               | in the format "X-Y". Excel will try to coerce that into a
               | date, if it's valid.
               | 
               | Other columns indicate a ratio, expressed as "X/Y". Excel
               | will coerce that into a date, if it's valid.
               | 
               | Other columns indicate an ID number, expressed as a
               | large, fixed-digit number, zero-padded. Excel will coerce
               | that into an integer, or, if it's too long, into
               | scientific notation discarding digits.
               | 
               | It doesn't matter how you format the CSV. Excel will do
               | the above.
               | 
               | Here's an example CSV:                 ID,Range,Ratio
               | 12345678901234567890,8-9,"7/15"
               | "00000000000000012345","9-10",21/35
               | 
               | I open that with Excel and immediately save it as a CSV.
               | I look at the file in a text editor and I see:
               | ID,Range,Ratio       1.23457E+19,9-Aug,15-Jul
               | 12345,10-Sep,21/35
               | 
               | Do you have any idea how fun it is to explain to teachers
               | and school administrators what happened here?
               | 
               | The correct way to work with this data file is: Do not,
               | under any circumstances, open it with Excel if you expect
               | to use it for anything else.
               | 
               | The problem is, there are very few applications that work
               | well with CSV files. I know of CsvEd and Delimit. There
               | are several text editors with a CSV column mode that
               | makes the file look like a table (with varying degrees of
               | success). All of these vary between "godawful" and "a
               | complete nightmare" in terms of performance and usability
               | compared to Excel.
        
               | dragonwriter wrote:
               | > It doesn't matter how you format the CSV. Excel will do
               | the above.
               | 
               | Only if you _open_ the file, so that Exel assumes
               | defaults for formatting. If you _import_ the CSV (Data -
               | > From Text in the current UI) you can specify the format
               | with the Import Text Wizard as described on the College
               | Board instructions for using the file. Except...
               | 
               | Unfortunately, the College Board has outdated
               | instructions on their website; Excel used to offer the
               | Import Text Wizard on opening a csv or text file, rather
               | than making default assumptions, and the College Board
               | instructions page, while it does tell you to use that
               | Import Text Wizard and provide all the details of what to
               | plug into that wizard, tells you to open the file to get
               | it, which bypasses the wizard.
               | 
               | Things like this is why clerical and other low-level
               | positions involving excel test candidates on proficiency
               | with specific versions of excel. Higher-level workers are
               | expected to be able to figure out these kind of changes
               | themselves, though (or consume lower-level staff time, if
               | they are in management.
        
               | da_chicken wrote:
               | Yeah, that's also obnoxious. The Data From Text wizard
               | lets you easily import the data... but if you do that
               | then the first row no longer defaults to headers. So if
               | you open it that way and want to filter or sort the data,
               | your headers will disappear. You have to go into full
               | blown Power Query to do both prevent reformatting and use
               | column headers, and you have to reassign the data types
               | on a much more complex interface.
               | 
               | We have now changed a 1 second automated process that
               | anybody could do, into a 2 minute manual process that
               | requires knowledge of data types. We've gone from a non-
               | technical task to a highly technical task. To open a file
               | without completely munging the data.
               | 
               | It also will sometimes do weird things like exclusive
               | lock the file on disk until you close every open window
               | of Excel because it creates data connections to the file.
        
               | dragonwriter wrote:
               | > The Data From Text wizard lets you easily import the
               | data... but if you do that then the first row no longer
               | defaults to headers. So if you open it that way and want
               | to filter or sort the data, your headers will disappear.
               | You have to go into full blown Power Query to do both
               | prevent reformatting and use column headers
               | 
               | You don't, because while the wizard doesn't _default_ to
               | headers, the first screen of the wizard has a checkbox
               | for it, so the option is there without Power Query.
        
               | occamrazor wrote:
               | If you open the csv file with Excel you get those errors.
               | If instead one uses Get&Transform (aka Power Query) then
               | it is possible to specify the type of each column.
        
               | pletnes wrote:
               | As someone who has programmed in python for a decade, and
               | no idea how to do anything right in Excel, I would beg to
               | differ.
        
               | qayxc wrote:
               | Do you use Excel on a daily basis for productive work as
               | well?
               | 
               | If not then your personal experience is irrelevant in
               | this context.
        
             | CobaltFire wrote:
             | Not specifically this topic but I ran into an issue that
             | would have been trivial in any real programming language
             | this week, but I'm not able to USE anything that's not on
             | my work computer due to data restrictions.
             | 
             | So I get to cobble together some really ugly spreadsheets
             | in excel when I'd MUCH rather use something more
             | appropriate. Sometimes it's not the person, it's the org.
             | In this case, government.
        
       | tcskeptic wrote:
       | We need a clippy type feature "It looks like you are working on
       | genomics data, should I turn off all the data destroying
       | auotcorrect features?"
        
         | UnFleshedOne wrote:
         | "It looks like you are working on genomics data, should I
         | direct you to a search engine to find appropriate tooling
         | instead?"
        
       | delosrogers wrote:
       | This exact problem burned me when I started doing
       | transcriptomics, one of the things I found that helps mitigate
       | the problem is to always keep both the gene symbol and a ID like
       | an Ensemble or Entrez ID for every data point because those don't
       | get mangled by Excel
        
       | PedroBatista wrote:
       | I generally am very critical of my own gang ( programmers ) for
       | being so into their own bubble they completely lose perspective
       | of the real World, and with that ignorance comes the usual
       | arrogance.
       | 
       | But this time I feel we should give the business to the
       | geneticists. The Excel "problem" is known and has been known for
       | a long time, finance guys and policy makers were maybe the first
       | big ones to "discover" this for slightly different reasons.
       | 
       | Then WHY on Earth these top professionals, elites, "creme de la
       | creme" people continue on this path? And it's not like they love
       | Excel, quite the contrary. Yes, the main reason is they are not
       | programmers so they need to find a way to hack something to solve
       | their problem. The problem with their "problem" solving is they
       | are being slobs in a work that requires the EXACT opposite.
       | 
       | "Oh now we need to hire a programmer too?" - Yes, you do.
       | 
       | "But we don't have the budget for it" - Yes, you do. You see,
       | money is never enough for anything, it's a question of priorities
       | and when you blow millions of dollars with fancy experiments to
       | ruin your career because of Excel, maybe that Python/R/Julia guy
       | was a bargain after all.
       | 
       | I know some of these people, and under the aura of lab coats and
       | distinguished professors it's just some guy/gal trying to crank
       | that paper ASAP to keep the hamster wheel spinning AND they are
       | being sloppy about the data and process AND most of them know it.
       | 
       | Sorry, but no excuses.
        
         | breakfastduck wrote:
         | Yeah. They use poor/ unsuitable tooling known for unusual data
         | quirks with absolutely no effort to look at alternatives or pay
         | for an expert.
         | 
         | No sympathy from me. Embarrassing. I certainly wouldn't get
         | away with such negligence at my workplace.
        
         | Closi wrote:
         | The underlying assumption here is that when transferred into R
         | or Julia, with an equivalent level of development effort and
         | testing, that there will be no/fewer bugs. I'm not entirely
         | sure that's true.
         | 
         | I think we are comparing something that is knocked out in an
         | hour in excel to someone spending 4-10x of the time doing it in
         | Python / Julia, but I don't think that's a like-for-like
         | comparison.
        
       ___________________________________________________________________
       (page generated 2021-08-17 23:02 UTC)