[HN Gopher] The Benchmark Lottery
       ___________________________________________________________________
        
       The Benchmark Lottery
        
       Author : amrrs
       Score  : 53 points
       Date   : 2021-07-17 13:04 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | shahinrostami wrote:
       | We highlighted something similar in the multi-objective
       | optimisation literature [1]. Unfortunately, it looks like
       | comparing benchmark scores between papers can be unreliable.
       | 
       | - _Algorithm A_ implemented by _Researcher A_ performs different
       | to _Algorithm A_ implemented by _Researcher B_.
       | 
       | - _Algorithm A_ outperforms _Algorithm B_ in _Researcher A's_
       | study.
       | 
       | - _Algorithm B_ outperforms _Algorithm A_ in _Researcher B's_
       | study.
       | 
       | That's a simple case... and it can come down to many different
       | factors which are often omitted in the publication. It can drive
       | PhD students mad as they try to reproduce results and understand
       | why theirs don't match!
       | 
       | [1] https://link.springer.com/article/10.1007/s42979-020-00265-1
        
         | NavinF wrote:
         | This really sucks when some papers don't come with code that
         | can reproduce the benchmark results. I wish there was a filter
         | for "reproducable" in search results
        
       | phdelightful wrote:
       | I like presenting the correlation of results between different
       | benchmarks - I'd be interested in hearing to what extent this
       | problem exists in more traditional benchmarking. One difference
       | is that ML has this accuracy/quality component where in the past
       | we've been more concerned with performance. Unfortunately this
       | paper doesn't really address the long history of non-ML
       | benchmarking, and I find it hard to believe no one has previously
       | addressed the fragility of benchmark results.
        
         | version_five wrote:
         | What are some non-ML benchmarks? Or put another way, what
         | domain are you referring to? Some examples I can think of are
         | bit error rate for communication and full width at half maximum
         | of the point spread function for image resolution.
         | 
         | BER is a much more objective benchmark (even if alone it might
         | miss power or some other factor). FWHM has it's own set of
         | problems, because unless your an astronomer, the thing you're
         | imaging probably isnt a point source. So you get into more
         | subjective resolution phantoms, e.g. in medical imaging, or
         | comparisons in how some part of Lena looks under different
         | treatments.
         | 
         | But mostly,these benchmarks are all much simpler and objective
         | than something like imagenet that has so much randomness in it,
         | and no real underlying principle other than it being diverse
         | enough to be general. I'm curious if there are other similarly
         | random benchmarks used in other domains?
        
       | YeGoblynQueenne wrote:
       | >> Thus when using a benchmark, we should also think about and
       | clarify answers to several related questions: Do improvements on
       | the benchmark correspond to progress on the _original problem_?
       | (...) How far will we get by gaming the benchmark rather than
       | making progress towards solving the original problem?
       | 
       | But what is the "original problem" and how do we measure progress
       | towards solving it? Obviously there's not just one such problem -
       | each community has a few of its own.
       | 
       | But in general, the reason that we waste so much time and effort
       | on benchmarks in AI research (and that's AI in general, not just
       | machine learning this time) is because nobody can really answer
       | this fundamental question: how do we measure the progress of AI
       | research?
       | 
       | And that in turn is because AI research is not guided by a
       | scientific theory: an epistemic object that can explain current
       | and past observations according to current and past knowledge,
       | and make predictions of future observations. We do not have such
       | a theory of _artificial_ intelligence. Therefore, we do not know
       | what we are doing, we do not know where we are going and we do
       | not even know where we are.
       | 
       | This is the sad, sad state of AI research. If AI research has
       | been reduced, time and again, to a spectacle, a race to the
       | bottom of pointless benchmarks, that's because AI research has
       | never stopped to take its bearings, figure out its goals ( _there
       | are no commonly accepted goals of AI research_ ) and establish
       | itself as a _science_ , with a _theory_ - rather than a
       | constantly shifting trip from demonstration to demonstration. 70
       | years of demonstrations!
       | 
       | I think the paper above manages to go on about benchmarks for 34
       | pages and still miss the real limitation of empirical-only
       | evaluations in a field without a theoretical basis. That no
       | matter what benchmarks you choose and how, without a theoretical
       | basis, you'll never know what you're doing.
        
       | aruncis wrote:
       | Well... none of the papers are up to the standards of the Money
       | Laundering crowd.
        
       | axiom92 wrote:
       | For a minute I thought someone scooped our SIGBOVIK paper:
       | https://madaan.github.io/res/papers/sigbovik_real_lottery.pd...!
        
         | optimalsolver wrote:
         | Is there a physical/mathematical/computational reason this
         | couldn't actually be done?
        
         | Groxx wrote:
         | > _3. Theoretical Analysis_
         | 
         | > _Sir, this is a Wendy's._
         | 
         | I very much approve.
         | 
         | Congrats on beating the baseline! At this rate, surely you're
         | only a few improvements away from winning the jackpot!
        
       ___________________________________________________________________
       (page generated 2021-07-17 23:01 UTC)