[HN Gopher] Leakage and the reproducibility crisis in ML-based s...
___________________________________________________________________
Leakage and the reproducibility crisis in ML-based science
Author : randomwalker
Score : 37 points
Date : 2022-07-15 19:07 UTC (3 hours ago)
(HTM) web link (reproducible.cs.princeton.edu)
(TXT) w3m dump (reproducible.cs.princeton.edu)
| a-dub wrote:
| this is sort of one of the weird problems that shows up at the
| intersection between science in the public interest and a market
| driven system of production.
|
| pure science that is publicly funded in the public interest would
| publish all raw data along with re-runnable processing pipelines
| that will literally reproduce the figures of interest.
|
| but, the funding is often provided by governments with the aim of
| producing commercializable new technology that can make life
| better for society.
|
| the problem is that if you do the science in the open, then it
| can be literally picked off by large incumbents before smaller
| inventors have a chance to try and spin up commercialization of
| their life's work.
|
| so we have this system today where science is semi-closed in
| order to protect the inventors, but sometimes to the detriment of
| the science itself.
| adminprof wrote:
| I think you're missing two fatal problems in this "publish all
| raw data and code" mindset. I don't think the desire of
| commercialization is high on the list of fatal problems
| preventing people from publishing data+software.
|
| 1) How do you handle research in domains where the data is
| about people, so that releasing it harms their privacy?
| Healthcare, web activity, finances. Sure you can try to
| anonymize it, anonymization is imperfect, and even fully
| anonymized data can be joined to other data sources to de-
| identify people; k-anonymity only works in a closed ecosystem.
| If we live in a world where search engine companies don't
| publish their research because of this constraint, that seems
| worse than the current system.
|
| 2) How does one define "re-runnable processing"? Software rots,
| dependencies disappear, operating systems become incompatible
| with software, permission models change. Does every researcher
| now need a docker expert to publish? Who verifies that
| something is re-runnable, and how are they paid for it?
| nicoco wrote:
| From my experience in the digital health sector, concerns for
| privacy is always the reason given for not sharing anything
| valuable and/or useful to others. But it's just a convenient
| way of hiding the 'desire of commercialisation'.
| a-dub wrote:
| this is also true, and it also runs within science itself.
| if someone spends two years collecting some data that is
| very hard to collect and it has a few papers worth of
| insights within it, they're going to want to keep that data
| private until they can get those papers out themselves lest
| someone else come along, download their data and scoop them
| before they have a chance to see the fruits of their hard
| labor.
|
| while it's not great for science at large, i don't blame
| them either.
| a-dub wrote:
| > 1) How do you handle research in domains where the data is
| about people, so that releasing it harms their privacy?
|
| that's an interesting problem that i have not thought about.
|
| i think maybe that this is not a technical problem, but more
| an ethical one. under the open data approach, if you want to
| study humans you probably would need to get express informed
| consent that indicates that their data will be public and
| that it could be linked back to them.
|
| > 2) How does one define "re-runnable processing"? Software
| rots, dependencies disappear, operating systems become
| incompatible with software, permission models change. Does
| every researcher now need a docker expert to publish? Who
| verifies that something is re-runnable, and how are they paid
| for it?
|
| one defines it by building a specialized system for the
| purpose of reproducible research computing. i would envision
| this as a sort of distributed abstract virtual machine and
| source code packaging standard where the entire environment
| that was used to process the data is packaged and shipped
| with the paper. the success of this system would depend on
| the designers getting it right such that researchers
| _wouldn't_ have to worry about weird systems level kludges
| like docker. as it would behave as a hermetically sealed
| virtual machine (or cluster of virtual machines), there would
| be no concerns about bitrot unless one needed to make changes
| or build a new image based on an existing one.
|
| the good news is that most data processing and simulation
| code is pretty well suited to this sort of paradigm. often it
| just does cpu/gpu computations and file i/o. internet
| connectivity or outside dependencies are pretty much out of
| scope.
|
| i don't think it's hard... there just hasn't been the will or
| financial backing to build this out right and therefore it
| does not exist.
| a-dub wrote:
| ...also, if a technique appears in a paper, an expert on that
| technique should be a reviewer and/or a standard rubric should
| be applied (i think nature and science have gotten much more
| rigorous about this in recent years in the wake of the
| psychology replication crisis).
| AtNightWeCode wrote:
| There is even tech that claims to solve the train-test split
| "under the hood". You also get surprised with the low amount of
| data points some of these ML people think is necessary. Far off
| from what you learn in basic statistics classes.
|
| To not provide accurate ways of reproducing something claimed in
| a paper means that the paper is invalid.
| dekhn wrote:
| Recently, I saw that people were tagging their input records
| (test records in git repos) specifically so that later data
| loaders would reject those records in appropriate conditions. I
| forget what the tech was called but it was interesting.
| jokoon wrote:
| Machine learning isn't really science, since it's only
| statistical methods. It doesn't provide insight into what
| intelligence is. It's only techniques, so it's just engineering.
| It's brute force hacking at best, and when it sort of works, it's
| impossible to figure out why it does because it's black boxes all
| the way down.
|
| So of course there are cool things like gpt, but it's not like
| it's scientific progress. It doesn't really to understand how
| brains work, and how to understand what general intelligence
| really is.
| randomwalker wrote:
| It's possible you may have misunderstood the title of the post.
| It isn't about the science of ML, or GPT-3, or brains. Rather,
| it's about using ML as a tool to do actual science, like
| medicine or political science or chemistry or whatnot. The
| first sentence of the post explains this.
| [deleted]
| nestorD wrote:
| Machine learning is _not_ about getting insight into what
| intelligence is (it might do so as a byproduct but very few
| people are using it with that goal in mind).
|
| However, ML _is_ useful to generalist science as long as you
| are be aware of its shortcomings and not just trying to replace
| something with ML without thinking about it.
|
| To give you an example I worked on (to be published): I worked
| with some physicists that use an incredibly slow and expensive
| iterative solver to get information on particules. We
| introduced a machine learning algorithm that predicts the end
| result. It does _not_ replace the solver (you could not trust
| its results, contrary to a physics based numerical algorithm)
| but, using its guess as a starting point for the iterative
| solver, you can make the overall solving process orders of
| magnitude faster.
| YeBanKo wrote:
| > It does not replace the solver (you could not trust its
| results, contrary to a physics based numerical algorithm)
|
| And I guess the outcome variable in the train set for the ML
| model was produced by the solver?
| notrealyme123 wrote:
| Statistics are the backbone of many natural sciences.
|
| It is also valid to make scientific progress just inside of a
| field and not in the grand scheme of things.
| deelowe wrote:
| I feel that the deterministic computing theologists are going
| to be in for a rude awakening over time. Computing need not
| be perfect to work and the thing about recent advancements in
| ML is that they scale Extremely well.
| antipaul wrote:
| Not a bad checklist ("model info sheet")
|
| But rather than stand-alone, it should be incorporated into
| publications.
|
| In my experience, only a minority of applied machine learning
| papers provide even a minority of the info requested by the info
| sheet.
|
| Meaning, you really have no proper idea how cross validation was
| done, what preprocessing was done etc. - in actually published
| papers
___________________________________________________________________
(page generated 2022-07-15 23:01 UTC)