[HN Gopher] The Benchmark Lottery
___________________________________________________________________
The Benchmark Lottery
Author : amrrs
Score : 53 points
Date : 2021-07-17 13:04 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| shahinrostami wrote:
| We highlighted something similar in the multi-objective
| optimisation literature [1]. Unfortunately, it looks like
| comparing benchmark scores between papers can be unreliable.
|
| - _Algorithm A_ implemented by _Researcher A_ performs different
| to _Algorithm A_ implemented by _Researcher B_.
|
| - _Algorithm A_ outperforms _Algorithm B_ in _Researcher A's_
| study.
|
| - _Algorithm B_ outperforms _Algorithm A_ in _Researcher B's_
| study.
|
| That's a simple case... and it can come down to many different
| factors which are often omitted in the publication. It can drive
| PhD students mad as they try to reproduce results and understand
| why theirs don't match!
|
| [1] https://link.springer.com/article/10.1007/s42979-020-00265-1
| NavinF wrote:
| This really sucks when some papers don't come with code that
| can reproduce the benchmark results. I wish there was a filter
| for "reproducable" in search results
| phdelightful wrote:
| I like presenting the correlation of results between different
| benchmarks - I'd be interested in hearing to what extent this
| problem exists in more traditional benchmarking. One difference
| is that ML has this accuracy/quality component where in the past
| we've been more concerned with performance. Unfortunately this
| paper doesn't really address the long history of non-ML
| benchmarking, and I find it hard to believe no one has previously
| addressed the fragility of benchmark results.
| version_five wrote:
| What are some non-ML benchmarks? Or put another way, what
| domain are you referring to? Some examples I can think of are
| bit error rate for communication and full width at half maximum
| of the point spread function for image resolution.
|
| BER is a much more objective benchmark (even if alone it might
| miss power or some other factor). FWHM has it's own set of
| problems, because unless your an astronomer, the thing you're
| imaging probably isnt a point source. So you get into more
| subjective resolution phantoms, e.g. in medical imaging, or
| comparisons in how some part of Lena looks under different
| treatments.
|
| But mostly,these benchmarks are all much simpler and objective
| than something like imagenet that has so much randomness in it,
| and no real underlying principle other than it being diverse
| enough to be general. I'm curious if there are other similarly
| random benchmarks used in other domains?
| YeGoblynQueenne wrote:
| >> Thus when using a benchmark, we should also think about and
| clarify answers to several related questions: Do improvements on
| the benchmark correspond to progress on the _original problem_?
| (...) How far will we get by gaming the benchmark rather than
| making progress towards solving the original problem?
|
| But what is the "original problem" and how do we measure progress
| towards solving it? Obviously there's not just one such problem -
| each community has a few of its own.
|
| But in general, the reason that we waste so much time and effort
| on benchmarks in AI research (and that's AI in general, not just
| machine learning this time) is because nobody can really answer
| this fundamental question: how do we measure the progress of AI
| research?
|
| And that in turn is because AI research is not guided by a
| scientific theory: an epistemic object that can explain current
| and past observations according to current and past knowledge,
| and make predictions of future observations. We do not have such
| a theory of _artificial_ intelligence. Therefore, we do not know
| what we are doing, we do not know where we are going and we do
| not even know where we are.
|
| This is the sad, sad state of AI research. If AI research has
| been reduced, time and again, to a spectacle, a race to the
| bottom of pointless benchmarks, that's because AI research has
| never stopped to take its bearings, figure out its goals ( _there
| are no commonly accepted goals of AI research_ ) and establish
| itself as a _science_ , with a _theory_ - rather than a
| constantly shifting trip from demonstration to demonstration. 70
| years of demonstrations!
|
| I think the paper above manages to go on about benchmarks for 34
| pages and still miss the real limitation of empirical-only
| evaluations in a field without a theoretical basis. That no
| matter what benchmarks you choose and how, without a theoretical
| basis, you'll never know what you're doing.
| aruncis wrote:
| Well... none of the papers are up to the standards of the Money
| Laundering crowd.
| axiom92 wrote:
| For a minute I thought someone scooped our SIGBOVIK paper:
| https://madaan.github.io/res/papers/sigbovik_real_lottery.pd...!
| optimalsolver wrote:
| Is there a physical/mathematical/computational reason this
| couldn't actually be done?
| Groxx wrote:
| > _3. Theoretical Analysis_
|
| > _Sir, this is a Wendy's._
|
| I very much approve.
|
| Congrats on beating the baseline! At this rate, surely you're
| only a few improvements away from winning the jackpot!
___________________________________________________________________
(page generated 2021-07-17 23:01 UTC)