[HN Gopher] Autocorrect errors in Excel still creating genomics ...
___________________________________________________________________
Autocorrect errors in Excel still creating genomics headache
Author : tchalla
Score : 69 points
Date : 2021-08-17 16:09 UTC (6 hours ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| gpapilion wrote:
| Autocorrect(and predictive text) has reached an interesting
| milestone, good enough that you mainly rely on it, but
| occasionally it distorts the data you intended to send.
| Disengaging the technology isn't easy, or even understanding
| field types it shouldn't correct(everyday my work VPN auth window
| tries to autocorrect my username). There is so much context that
| is unavailable to the computer, and there is no good way to
| provide it.
|
| I think it has interesting implications for self-driving, since
| as far as I can tell we're at an earlier but similar stage in its
| development; works for simple situations, likely to drive you
| into a wall for no reason. We build so much trust in these
| systems, we stop paying attention to what they are doing, and
| then pay the price when it behaves in an unexpected way to the
| user.
| soco wrote:
| I get bitten by Excel's auto-correct all the time and I work
| nothing related to genomics. It's in my opinion just a dumb
| decision of Microsoft to not let you disable all those smarty
| feats (like, thinking you have a date if your string kinda
| looks like one)
| guitarbill wrote:
| It's easy to hate on Excel or Microsoft, a "dumb" program and a
| faceless mega corporation. Don't get me wrong, I'm no fan. But
| the real question is what are journals doing? It seems like the
| obvious solution is for editors and reviewers to do a quick
| search for these errors. To me, this is more of an indication of
| how poor most journals really are.
|
| (You could argue that supplementary data should be in an open
| format, so you could automate this checking. I'm not against
| this, either.)
| [deleted]
| nnmg wrote:
| This is an important point. What have the journals done? Raised
| their prices and business as usual.
|
| Scientific editors do _nothing_ for data validation. There is
| no accountability, even after retractions.
|
| Scientific journal editors are glorified gatekeepers for "high
| impact" work (read: flashy), and then use free reviewer labor
| to cover themselves so they can call it 'peer reviewed'. In the
| rare cases when journals do require supporting data, they
| explicitly ask for excel spreadsheets :(
| nnmg wrote:
| Excel is used as a database/storage/interchange format,
| especially after the initial analysis by someone who uses python
| or R. Bioinformatician does the analysis, then the PI wants to
| see it so they can Ctrl-F for genes they are interested in, so
| out comes an excel document.
|
| And really, even if you know python or R, are you really going to
| fire up a jupyter notebook, load the data, and run pandas queries
| every time someone in lab meeting or after a talk asks you about
| this gene or that gene in your data?
|
| I think the important question is why is date conversion a
| default? Would it really break backwards compatibility for MS
| Excel users if date conversions were explicit instead of
| automatic? Turning that off by default would fix a lot of this.
| dec0dedab0de wrote:
| _And really, even if you know python or R, are you really going
| to fire up a jupyter notebook, load the data, and run pandas
| queries every time someone in lab meeting or after a talk asks
| you about this gene or that gene in your data?_
|
| I don't do any scientific research, but I have been using
| jupyter as a replacement for excel since it was called the
| ipython notebook. I don't really use pandas all that often, I
| just find it easier to read and edit data in python. Though I
| first learned ipython added the notebook from a talk Wes
| McKinney gave about Pandas.
| BugsJustFindMe wrote:
| > _Excel is used as a database /storage/interchange format,
| especially after the initial analysis by someone who uses
| python or R. Bioinformatician does the analysis_
|
| Sometimes, but the situation is in reality worse than that.
| Excel is also used as the gold standard
| database/storage/interchange format of record for random shit
| that clinical researchers have typed in by hand whether
| directly or transcribed from other notes, often when that data
| isn't actually fundamentally tabular in nature because people
| really like working in grids. Even when grids hurt more than
| help.
|
| A big secret in genetic research is that the MDs, grad
| students, project managers, and coordinators running the
| research programs are often not super focused on what well-
| structured data looks like and don't know what things like
| "key-value store" or "nested tree-like structure" mean, and
| even if they did there aren't good GUI tools for entering them
| anyway, and it leads to countless errors that maybe (here I
| speculate) they just assume will wash out as noise.
|
| > _I think the important question is why is date conversion a
| default?_
|
| Yes, why any kind of conversion is ever the default is a real
| money question.
| quantified wrote:
| For the finance and business office worker, it seems to have
| traction. Just like auto-creating an emoji when you type a :
| character. Excel is for offices, not specializations of
| scientists. Bummer.
| SonicScrub wrote:
| I don't work in bioinformatics, but what you are describing is
| a completely accurate description of what I experienced working
| in manufacturing quality control. Raw data came in from
| suppliers in the form of spreadsheets, and management wanted to
| see results in spreadsheets. Meaning all our quality data was
| subjected to these issues. The date formatting issue was a
| particularly annoying "gotcha", particularly when features were
| defined with a XX-XX numeric code. The number of times I had to
| deal with someone in a meeting saying "hey, why is this feature
| called October-13?!" Super frustrating.
|
| If I could choose the tools used by the whole process involving
| multiple different companies and departments, hey I would! It
| would be python all the way down. But I was but a cog in a
| massive organization.
| dragonwriter wrote:
| > A lot of "safety culture" is composed of things like
| checklists and hazard warnings which are more geared towards
| shifting the blame for accidents onto somebody else than
| actually preventing those accidents,
|
| If you _stay in spreadsheets_ these problems mostly don't
| occur (that is, once data entry is squared away so that the
| initial spreadsheet has what you want it doesn 't tend to get
| lost), its when you move in and out of spreadsheets via text
| and take the path of least resistance [0] to do the
| transition that the problem occurs.
|
| [0] and to be fair, there is a _lot_ of resistance off that
| path.
| SonicScrub wrote:
| The process I had to deal with was filling out spreadsheets
| with data from a python-driven 3D inspection program that
| exported out data files in CSV format. Needless to say,
| these errors were inevitable for exactly the reasons you've
| stated. Why we didn't bypass the large, poorly formatted
| cumbersome spreadsheets and just directly export data via
| pandas? All the inspection was done by Python anyways. You
| tell me! Also, it did not help that the spreadsheets were
| not created by me, or any colleagues in my department.
|
| God I hated working in old-school
| engineering/manufacturing. "That's not how we do things" is
| the answer to everything. I
| planet-and-halo wrote:
| Serious question, is this a viable area for a startup product?
| Obviously nothing will easily displace Excel for general
| spreadsheet work, but I could see room for a domain-specific
| spreadsheet that has core features and supports just a basic set
| of functions relevant to genomics, + a "we won't fuck up your
| data" feature.
| Closi wrote:
| I think the size and scope of Excel is hard to replicate,
| however it sounds like it would make a great plugin.
| masklinn wrote:
| Probably not: people use excel because it's available, it's
| what they know, and it's what their peers use.
|
| It's been going for more than half a decade now, and
| genomicists apparently would rather rename genes than stop
| using excel...
| mceoin wrote:
| Hi - I'm building a networked spreadsheet product and had heard
| of this auto-correct problem in genomics research through word
| of mouth a couple of times, so have asked this question myself.
| (I was astonished to learn just how much of genomics work
| involves sending Excel files back and forth!) Here's the
| conclusion that I have come to:
|
| - It's definitely possible to build a custom spreadsheet
| product with a small team, even one targeted at such a "niche"
| user group. So it's an idea worth testing.
|
| - Product can be "backwards compatible" - you can export to
| xlxs and import from xlxs - so you don't have to change
| behaviour of the entire industry on day one to get this to
| work, only a single genomics researcher or lab.
|
| - Pricing, unit economics, etc are unknown to me (I have no
| background in genomics or scientific research). But presumably
| you could leverage standard SaaS models and build a viable
| model up from there using a few case studies. There's
| definitely schleppy behaviour going on here that can be solved.
|
| - Even a "lifestyle business" has significant upside beyond the
| financial: improving genomics research improves genomic
| research!
|
| - The product advantage over time presumably involve building
| more custom tooling into the genomics / data ecosystem. "Not
| creating typos" is just the beachhead.
|
| I've never actually interviewed genomics people about their
| need here, but if anyone knows people with this problem I would
| love to talk to them: @mceoin on twitter. (DMs open)
| prionassembly wrote:
| The easier solution is coming up with a 1:1 human-readable-
| hashing scheme so that MARCH4 translates to "funny-blue-smell-
| tuesday". Then you keep the safe key in a column next to the
| unsafe key.
| NullPrefix wrote:
| >a "we won't fuck up your data" feature
|
| That's a hard to implement feature. Hand waving and buzzword
| lingo will not be enough for people to believe it.
| masklinn wrote:
| > That's a hard to implement feature.
|
| It really isn't: just remove the data autodetection anything
| and you're done.
| cm2187 wrote:
| That's an expensive workaround to just formatting your cells as
| text before entering the values
| dspillett wrote:
| You'd be surprised how much effort can be saved by not
| relying on human interaction to be reliable.
|
| Though convincing people to switch to a new app for that
| reason, even if it were free and Free, would still be an
| uphill struggle.
|
| _> just formatting your cells as text before entering the
| values_
|
| It is a little more than that. I work with finance people,
| and there a lot of data is _manipulated_ in Excel but passed
| around as CSV files for compatibility elsewhere. This causes
| no end of problems because fixes by setting cell properties
| are obviously lost in transcription, and date errors creep in
| as things are moved back & forth between people in the US
| and those in locales that do dates properly.
| breakfastduck wrote:
| So we blame the software instead of blaming the downright _lazy_
| people who can 't be fucked learning an appropriate tool even
| though that's their entire career.
|
| Honestly these kind of things really wind me up.
| Closi wrote:
| > So we blame the software instead of blaming the downright
| lazy people who can't be fucked learning an appropriate tool
| even though that's their entire career.
|
| Further, the software would be fine if you use it properly for
| this sort of use case (i.e. bring the data in via PowerQuery
| which has enforced types).
| BugsJustFindMe wrote:
| In reality the appropriate tool for research data actually
| doesn't exist (make it and become very rich). Excel is only
| sort of close if you squint.
| IAmEveryone wrote:
| There is one problem people have for their use case, so it's
| inappropriate? Have you ever created an issue on some GitHub
| project? I guess you were using the wrong tool?
|
| People choose their tools for some reason(s), and there is no
| law of man or nature that says that using a common tool for
| some esoteric purpose is somehow wrong, just because you
| consider it too easy.
| ironmagma wrote:
| If a problem is re-occurring and between lots of people, it's
| probably the fault of the system that powers it. The system in
| this case is the software, so blaming the software is
| appropriate.
| breakfastduck wrote:
| No it isn't. This isn't a gotcha. This is probably one of the
| most well known software behaviors on the planet.
|
| You KNOW the problem exists, yet you STILL choose to use the
| software, THEN expect it to change its behaviors because its
| inconvenient for your very specific use case. Microsoft
| aren't going to fix this behavior because there are likely
| millions if not billions of spreadsheets that rely on it.
|
| That is not a software problem.
| ironmagma wrote:
| People don't choose to use Microsoft products, the software
| is chosen for them by their organization. For a problem
| like this, the existence of the article may very well be
| the first step to switching to another solution. The first
| step in that process is raising awareness that there is a
| problem.
| breakfastduck wrote:
| A convenient cop out there.
|
| "I didn't install it, I only chose to use it for
| something it isnt suited for instead of looking for
| alternatives, not my fault"
| ironmagma wrote:
| People working in large organizations are frequently not
| even allowed to install software that isn't pre-approved
| by N layers of management. The bottom line though is that
| Excel is made for data entry. If it's not suitable for
| that use case, that's a big problem with the product-
| market fit.
| AmericanChopper wrote:
| Excel isn't a data entry tool. It's a rather
| sophisticated analysis tool for tabular data. It's
| success is derived from the fact that it's very good at
| this, and is so easy for users to pick up. Mind boggling
| amounts of commerce and administration are driven my
| Excel.
|
| It is however most suited to accounting and logistics
| tasks. For any more novel use case, you're going to have
| to put some thought into how you use it, or use a more
| suitable alternative. Even for the tasks that it's very
| well suited to, there will always be more sophisticated
| alternatives. But I'd consider the way it lowers the the
| technical barriers to entry an overwhelming positive,
| even if it doesn't completely absolve the users from
| learning about how their data works. I have the same
| frustrations with developers who have been trained by
| ORMs to have no idea how RDBMSes work. But I consider it
| their fault for not learning their trade properly, rather
| than the fault of the people who make nice tools to help
| them.
| breakfastduck wrote:
| I cannot imagine a single piece of software that is more
| widespread and used without issue by huge swathes of
| people than excel.
|
| But one specific use case for genetics means its totally
| unsuitable & and big problem with the product. Got it.
| quantified wrote:
| Right. I doubt genomics registered as much of a market
| segment, and probably still doesn't register.
| bserge wrote:
| If only the organization was run by people.
| psychometry wrote:
| compbio is not a field you'd want to go in if you want to use
| well-written tools exclusively. There's tons of absolutely
| abominable Perl and R code powering critical pipelines
| everywhere you look.
|
| What makes matters worse is that scientists, who are already
| not great coders, have to work with even less technical people
| (i.e. the physicians). These collaborative processes will
| inevitably involve software like Excel. Need your M.D. co-
| collaborator to annotate a gene list or whatever? Your CSV's
| getting re-saved in Excel whether you like it or not.
| codetrotter wrote:
| That's the price we pay.
|
| A reasonable alternative would be for example statically typing
| each column or row in the sheet.
|
| So in one sheet column A contains only floats, col B contains
| only text, col C contains only dates etc. And in another sheet
| col A may be date type, col B date type also, col C and D text,
| and col E monetary amounts.
|
| But this would come at the cost of not being able to mix types in
| column, and a lot of people want to be able to mix.
|
| And that's why we're stuck
| zozbot234 wrote:
| > A reasonable alternative would be for example statically
| typing each column or row in the sheet.
|
| Excel actually supports this out of the box. But it's one more
| option to set, and people get lazy.
| dspillett wrote:
| Also those properties are lost if you transfer the data to
| another format then back, or don't properly survive
| copy/cut/paste operations (often cutting will remove
| everything about a cell including options such as these).
|
| As well as being easy to be lazy and not use the options, it
| is easy to accidentally undo them also.
| Closi wrote:
| > A reasonable alternative would be for example statically
| typing each column or row in the sheet.
|
| You can actually statically type the columns now for imported
| data via the PowerQuery editor (which is built into Excel),
| although not a lot of people know how to use it.
| eitland wrote:
| Nah, much of this is the same as the dumbness of many
| autocorrecting keyboards:
|
| _automatically "correcting" something that was correct based
| on a dumb hunch_
|
| It is really simple to remove the most embarrassing
| autocomplete errors on phones:
|
| - just disable autocorrect,
|
| - or preferably use another keyboard that just proposes fixes
| but leaves it to you to select the suggestion (I use SwiftKey
| and it does this, others probably exist too).
|
| The same is not possible in Excel as far as I am aware:
|
| You cannot turn of autoformatting.
|
| The whole thing could be easily (on a ux level at least ;-)
| solved by introducing a single checkbox: "stop embarrassing me"
| with help text that says "turn off autoformatting".
|
| Checking that box would leave values like you typed or pasted
| them in.
|
| No need to select on a column basis.
|
| Edit: It is also almost at the level of modern search engines
| that replace my very specific queries with generic queries
| about something unrelated or vaguely related.
|
| Edit 2: it is already more than a year ago I guess since
| iPhones were caught changing the names of correctly spelled
| medications to the name of another unrelated medication. Sooner
| or later this incessant drive to be smarter than the user is
| going to cost lives I'm afraid.
|
| When it comes to search it is already costing me many minutes
| lost a day, and that is just first order effects, before we get
| into all the things that doesn't get done because we give up on
| finding it.
| jjk166 wrote:
| Right click on top left corner
|
| Select format cells
|
| Select "Text"
|
| This will disable excel's autoformatting. Given the rarity
| that someone types "SEPT4" into excel and isn't typing a
| date, putting the option one level deep in a menu seems more
| appropriate than a top level button.
| eitland wrote:
| > Given the rarity that someone types "SEPT4" into excel
| and isn't typing a date, putting the option one level deep
| in a menu seems more appropriate than a top level button.
|
| Given the rarity that someone has pasted 100 000 rows into
| and want them to be autoborked disabling autoborking, at
| least for pasted data seems appropriate enough for me ;-)
| jjk166 wrote:
| I would expect dates to show up in large datasets pasted
| into excel orders of magnitude more often than genes. For
| the people who use excel, autoformatting is
| overwhelmingly the desired default behavior.
| ironmagma wrote:
| Note how nowhere in those instructions is the phrase
| "autocorrect."
| jjk166 wrote:
| Why would it? Excel doesn't have autocorrect, it has
| autoformat - a feature controlled in its formatting menu.
| wtallis wrote:
| > But this would come at the cost of not being able to mix
| types in column, and a lot of people want to be able to mix.
|
| I feel like a lot of the desire for this comes from Excel's
| desire to fill the screen with a single infinite spreadsheet.
| Apple's Numbers and some other programs make it more natural to
| have multiple separate tables on screen, each of which can have
| their own header rows and columns and separate type and format
| rules for those columns/rows. Excel more or less forces people
| to emulate this capability by just using a separate region of
| the same table, which gets in the way of applying consistent
| formatting and typing rules to whole columns or rows.
| tssva wrote:
| This isn't a Microsoft or Excel issue. It is a choosing a
| spreadsheet issue. LibreOffice and Google Sheets for instance by
| default also try to determine what type of data is being entered.
|
| The issue is choosing the incorrect tool or if you have no choice
| but to use a spreadsheet failing to learn how your tools work.
| HPsquared wrote:
| They might move to a relational database, but then the gene named
| '; DROP TABLE genes; will start causing problems.
| OptionX wrote:
| Can anyone with first hand experience on this tell shed some
| light on why stick with Excel versus using python, R or what have
| you to treat the data?
| gerdesj wrote:
| "That's something Purdie says she doesn't have time for. She
| has adapted to Excel's quirks, adding apostrophes before
| commonly affected genes to prevent the conversion, or pre-
| formatting spreadsheet cells before importing data. "It's one
| of those things that I just accept," she says."
| qayxc wrote:
| Wait, if she knows how to avoid the issue, how is it still an
| issue?
| enaaem wrote:
| It might be easier to teach people how to use excel correctly
| with for example power query.
| jpeloquin wrote:
| As a biomedical researcher who prefers using python & R over
| Excel:
|
| (1) Graduate students (who are the bulk of the scientific labor
| force) usually have some training in python, R, or Matlab, but
| are seldom practiced in applying it to real-world work. So if a
| lab standardizes on python, R, etc. the senior people have to
| do a lot of extra work to support the programming, rather than
| doing work with intrinsic scientific value. It's easier to
| teach the Excel quirks than to teach effective programming
| practices. Excel quirks are concrete, and "good practices" in
| programming are vague and situational.
|
| (2) Python & R have their own pitfalls. It's easy to apply the
| wrong filters to a data set and compare subsets that aren't
| what you think you're comparing. Although when they go wrong,
| they go very wrong, and this is easier for a supervisor to
| notice.
|
| (3) Excel has rich text formatting, graph embedding, and data
| types, so it's very useful as a human-readable data interchange
| & summary format even if you never do computation in it.
|
| IMO neither programming languages nor Excel are a great fit for
| data analysis. Something purpose-made, like JMP or even
| GraphPad, is probably the better choice in most situations.
| Programming gets you automation but at the cost of high
| complexity. Since you still need to look at and think about the
| data there's a limit to how much time automation can save
| (Amdahl's law).
| happytoexplain wrote:
| Is there any reason to suspect that the primary factor is
| something other than the difference between an extremely
| featured end user GUI and _programming_?
| gpapilion wrote:
| I took a programming for math majors course in college, and the
| largest challenge for the class was understanding procedural
| looping to process data sets. Understanding a for loop or a
| while loop really was a hard concept for someone to understand.
| We didn't cover any real data structures, and everything was a
| 2 dimensional array.
|
| Excel, and most spreadsheets for that, avoid some of the basic
| control structures beginners find challenging. Many functions
| are basically map and reduce functions so you typically are
| getting the same number of cells back, or one. Often maps are
| done visually in the spreadsheet with a formula being applied
| to every cell in a column.
| delosrogers wrote:
| From my experience Python/R can be great when you have large
| scale analysis to do but if you have a task that requires more
| manual fiddling with the data then it's much nicer to use
| Excel.
| taeric wrote:
| The same line of questioning can easily land in "why use csv?"
|
| It is a terrible format that has very few redeeming qualities.
| It happens to also be the most widely used one.
| cm2187 wrote:
| When you have done programming for a while and it is mostly
| muscle memory, it is easy to forget how much it is an
| impenetrable black box to someone who has never done it.
|
| Try to tell someone who just needs to get a task done that he
| needs to spend the next 3 months learning programming from
| scratch vs using excel.
| breakfastduck wrote:
| Tough shit, though. Thats what they're being paid for.
|
| "A bad workman always blames his tools" has never been truer
| than this situation.
| minikites wrote:
| Given that these are relatively unskilled users, would
| switching away from Excel increase or decrease errors? I don't
| think it's a given that errors would decrease by switching away
| from Excel.
| instagraham wrote:
| I think there is a difference between the errors. Excel
| autocorrect errors are easy to miss as it is the software
| making an arbritrary decision over your own.
|
| Errors made from using a new , presumably neutral software,
| would be the kind you are supposed to catch as a researcher.
|
| There is obviously scope for error in everything but Excel's
| user-hostile methods are not justified by this.
| an_d_rew wrote:
| Switch to what?
|
| Excel is everywhere.
|
| People are familiar with it. The same people are often not
| familiar with, and have never used, and may even be confused
| by "notepad.exe".
| ltbarcly3 wrote:
| These aren't someone's 85 year old grandparent getting a
| computer for the first time to look at pictures of their
| great grandchildren, these are people who have used a
| computer every day since they were 14 and have advanced
| STEM degrees. I think they are probably re-trainable.
| Iwan-Zotow wrote:
| XXX YYY ZZZ, PhD in genomics, probably re-trainable
| blackbear_ wrote:
| Anecdotically, I was talking to an acquaintance who decided
| to pick up R as a replacement for excel for data analysis and
| they were positively blown away by the possibilities opened
| by the new mindset that came with programming.
|
| So yes, maybe these "relatively unskilled users" will
| struggle for a few weeks/months, but it will be a net
| positive change as many of them will see immense benefit
| after some tinkering.
| an_d_rew wrote:
| Because an absolutely enormous number of people involved in all
| ends of genomics and biomedical science have absolutely no
| inkling whatsoever about programming.
|
| Furthermore, Microsoft office is installed everywhere
| everywhere everywhere... like it leave it love it or hate it,
| it doesn't matter... it's true (at least for huge swaths of the
| demographic).
|
| So the first thing most people learn is excel and the last
| thing most people use is excel. It's a database, it's a
| spreadsheet, you can even use it as a word processor if you
| want to. And some have.
|
| I'm not saying you could and I'm not saying you should, I'm
| just answering as to "why".
| OptionX wrote:
| Even newcomers?
|
| I've had my education in a engineering focused campus so that
| may skew it, but even not IT related area had programming
| courses to, at least, learn the basics.
|
| Do the newer researcher still show reluctance to move from
| Excel or is it an old guard kind of deal?
| jasode wrote:
| _> why stick with Excel versus using python, R or what have
| you to treat the data? [...] Do the newer researcher still
| show reluctance to move from Excel_
|
| The feature of Excel that's underestimated/overlooked by
| skilled programmers of Python/R is that spreadsheets
| _immediately expose an editable visual spatial canvas
| datagrid GUI_.
|
| I have decades of programming C/C++/C#/Python/etc and yet
| _I create new spreadsheets and use them every day._
|
| Spreadsheets are much faster than wiring up a datagrid in
| C# or C++ Qt gui widget editor or any other "true"
| programming language. Spreadsheets are also faster than
| using Python package Pandas to import a csv into a
| dataframe and view it in a Jupyter notebook. And last time
| I checked, displaying an _output_ grid of cells in Jupyter
| is _read-only_ and not 2-way editable like Excel.
|
| And yes, the complaint is that _" MS Excel calculations are
| not auditable, repeatable, version controlled, etc"_. All
| true, but it still doesn't change the fact that Python
| doesn't have an instant datagrid GUI. UI affordances also
| matter in viewing science data sets as well as financial
| budgets.
| int_19h wrote:
| Besides, these are not mutually exclusive - you can use
| Excel as the front-end to your Python code via something
| like PyXLL.
| fartcannon wrote:
| Yeah. At some point though, people who want to do something
| that requires python or R, but refuse to learn it, and then
| use Excel (and make mistakes like the article mentions)
| should reconsider their choices. This is really out of
| character for me, but in this instance, relying on excel to
| do everything is a mistake by the user, not Microsoft's
| ridiculous data destroying import function.
|
| Use the right tool.
| qayxc wrote:
| > Use the right tool.
|
| Step one should be learn your tool, no matter the tool.
|
| There's several ways to avoid the issue in Excel. If
| someone working with Excel isn't able to learn that (heck,
| a simple template would suffice), I have no hope the same
| demographic would have any success with R or Python.
| da_chicken wrote:
| > _There 's several ways to avoid the issue in Excel._
|
| No, not really. There are several ways to help reduce the
| issue, but none of them eliminate what it does.
|
| Example that comes to mind is the data file for College
| Board's SAT test. The data formats for student reports
| for schools come in two formats: PDF (one page per
| student), CSV, and fixed-width. That is the comprehensive
| list of your options. College Board doesn't care about
| you as a customer. They're too big. Any request you
| submit will be black holed.
|
| Some of the columns in the file indicate a range, usually
| in the format "X-Y". Excel will try to coerce that into a
| date, if it's valid.
|
| Other columns indicate a ratio, expressed as "X/Y". Excel
| will coerce that into a date, if it's valid.
|
| Other columns indicate an ID number, expressed as a
| large, fixed-digit number, zero-padded. Excel will coerce
| that into an integer, or, if it's too long, into
| scientific notation discarding digits.
|
| It doesn't matter how you format the CSV. Excel will do
| the above.
|
| Here's an example CSV: ID,Range,Ratio
| 12345678901234567890,8-9,"7/15"
| "00000000000000012345","9-10",21/35
|
| I open that with Excel and immediately save it as a CSV.
| I look at the file in a text editor and I see:
| ID,Range,Ratio 1.23457E+19,9-Aug,15-Jul
| 12345,10-Sep,21/35
|
| Do you have any idea how fun it is to explain to teachers
| and school administrators what happened here?
|
| The correct way to work with this data file is: Do not,
| under any circumstances, open it with Excel if you expect
| to use it for anything else.
|
| The problem is, there are very few applications that work
| well with CSV files. I know of CsvEd and Delimit. There
| are several text editors with a CSV column mode that
| makes the file look like a table (with varying degrees of
| success). All of these vary between "godawful" and "a
| complete nightmare" in terms of performance and usability
| compared to Excel.
| dragonwriter wrote:
| > It doesn't matter how you format the CSV. Excel will do
| the above.
|
| Only if you _open_ the file, so that Exel assumes
| defaults for formatting. If you _import_ the CSV (Data -
| > From Text in the current UI) you can specify the format
| with the Import Text Wizard as described on the College
| Board instructions for using the file. Except...
|
| Unfortunately, the College Board has outdated
| instructions on their website; Excel used to offer the
| Import Text Wizard on opening a csv or text file, rather
| than making default assumptions, and the College Board
| instructions page, while it does tell you to use that
| Import Text Wizard and provide all the details of what to
| plug into that wizard, tells you to open the file to get
| it, which bypasses the wizard.
|
| Things like this is why clerical and other low-level
| positions involving excel test candidates on proficiency
| with specific versions of excel. Higher-level workers are
| expected to be able to figure out these kind of changes
| themselves, though (or consume lower-level staff time, if
| they are in management.
| da_chicken wrote:
| Yeah, that's also obnoxious. The Data From Text wizard
| lets you easily import the data... but if you do that
| then the first row no longer defaults to headers. So if
| you open it that way and want to filter or sort the data,
| your headers will disappear. You have to go into full
| blown Power Query to do both prevent reformatting and use
| column headers, and you have to reassign the data types
| on a much more complex interface.
|
| We have now changed a 1 second automated process that
| anybody could do, into a 2 minute manual process that
| requires knowledge of data types. We've gone from a non-
| technical task to a highly technical task. To open a file
| without completely munging the data.
|
| It also will sometimes do weird things like exclusive
| lock the file on disk until you close every open window
| of Excel because it creates data connections to the file.
| dragonwriter wrote:
| > The Data From Text wizard lets you easily import the
| data... but if you do that then the first row no longer
| defaults to headers. So if you open it that way and want
| to filter or sort the data, your headers will disappear.
| You have to go into full blown Power Query to do both
| prevent reformatting and use column headers
|
| You don't, because while the wizard doesn't _default_ to
| headers, the first screen of the wizard has a checkbox
| for it, so the option is there without Power Query.
| occamrazor wrote:
| If you open the csv file with Excel you get those errors.
| If instead one uses Get&Transform (aka Power Query) then
| it is possible to specify the type of each column.
| pletnes wrote:
| As someone who has programmed in python for a decade, and
| no idea how to do anything right in Excel, I would beg to
| differ.
| qayxc wrote:
| Do you use Excel on a daily basis for productive work as
| well?
|
| If not then your personal experience is irrelevant in
| this context.
| CobaltFire wrote:
| Not specifically this topic but I ran into an issue that
| would have been trivial in any real programming language
| this week, but I'm not able to USE anything that's not on
| my work computer due to data restrictions.
|
| So I get to cobble together some really ugly spreadsheets
| in excel when I'd MUCH rather use something more
| appropriate. Sometimes it's not the person, it's the org.
| In this case, government.
| tcskeptic wrote:
| We need a clippy type feature "It looks like you are working on
| genomics data, should I turn off all the data destroying
| auotcorrect features?"
| UnFleshedOne wrote:
| "It looks like you are working on genomics data, should I
| direct you to a search engine to find appropriate tooling
| instead?"
| delosrogers wrote:
| This exact problem burned me when I started doing
| transcriptomics, one of the things I found that helps mitigate
| the problem is to always keep both the gene symbol and a ID like
| an Ensemble or Entrez ID for every data point because those don't
| get mangled by Excel
| PedroBatista wrote:
| I generally am very critical of my own gang ( programmers ) for
| being so into their own bubble they completely lose perspective
| of the real World, and with that ignorance comes the usual
| arrogance.
|
| But this time I feel we should give the business to the
| geneticists. The Excel "problem" is known and has been known for
| a long time, finance guys and policy makers were maybe the first
| big ones to "discover" this for slightly different reasons.
|
| Then WHY on Earth these top professionals, elites, "creme de la
| creme" people continue on this path? And it's not like they love
| Excel, quite the contrary. Yes, the main reason is they are not
| programmers so they need to find a way to hack something to solve
| their problem. The problem with their "problem" solving is they
| are being slobs in a work that requires the EXACT opposite.
|
| "Oh now we need to hire a programmer too?" - Yes, you do.
|
| "But we don't have the budget for it" - Yes, you do. You see,
| money is never enough for anything, it's a question of priorities
| and when you blow millions of dollars with fancy experiments to
| ruin your career because of Excel, maybe that Python/R/Julia guy
| was a bargain after all.
|
| I know some of these people, and under the aura of lab coats and
| distinguished professors it's just some guy/gal trying to crank
| that paper ASAP to keep the hamster wheel spinning AND they are
| being sloppy about the data and process AND most of them know it.
|
| Sorry, but no excuses.
| breakfastduck wrote:
| Yeah. They use poor/ unsuitable tooling known for unusual data
| quirks with absolutely no effort to look at alternatives or pay
| for an expert.
|
| No sympathy from me. Embarrassing. I certainly wouldn't get
| away with such negligence at my workplace.
| Closi wrote:
| The underlying assumption here is that when transferred into R
| or Julia, with an equivalent level of development effort and
| testing, that there will be no/fewer bugs. I'm not entirely
| sure that's true.
|
| I think we are comparing something that is knocked out in an
| hour in excel to someone spending 4-10x of the time doing it in
| Python / Julia, but I don't think that's a like-for-like
| comparison.
___________________________________________________________________
(page generated 2021-08-17 23:02 UTC)