[HN Gopher] New Computer Language Benchmarks Game metric: time +...
___________________________________________________________________
New Computer Language Benchmarks Game metric: time + source code
size
Author : benstrumental
Score : 37 points
Date : 2022-06-01 15:19 UTC (7 hours ago)
(HTM) web link (benchmarksgame-team.pages.debian.net)
(TXT) w3m dump (benchmarksgame-team.pages.debian.net)
| sidkshatriya wrote:
| Geometric mean of (time + gzipped source code size in bytes)
| seems statistically wrong.
|
| What if you shifted time to nanoseconds ? Or source code size in
| terms of Megabytes. The rankings could change. The culprit is the
| '+'
|
| I would think Geometric mean of (time x gzipped source code size)
| is the correct way to compare languages together. It would not
| matter what the units of time or size are in that case.
|
| [Here the geometric mean is the geometric mean of (time x gzipped
| size) of all benchmark programs of a particular language.]
| ntoskrnl wrote:
| Yep this is correct. Adding disparate units is almost always
| nonsensical. You can confirm with a scientific calculator like
| insect: $ insect '5s + 10MB' Conversion
| error: Cannot convert unit MB (base units: bit)
| to unit s $ insect '5s * 10MB' 50 s*MB
| smegsicle wrote:
| units, frink, insect oh my
| igouy wrote:
| > The culprit is the '+'
|
| That annotation does seem to have caused much frothing and
| gnashing.
|
| Here's how the calculation is made -- "How not to lie with
| statistics: The correct way to summarize benchmark results."
|
| [pdf]
| http://www.cse.unsw.edu.au/~cs9242/11/papers/Fleming_Wallace...
| yorwba wrote:
| That paper is only about the reasoning behind taking the
| geometric mean, it doesn't have anything to say on the "time
| + gzipped source code size in bytes" part.
| dwattttt wrote:
| It's not necessarily wrong to add disparate units like this.
| It's implicitly weighting one unit to the other. Changing to
| nanoseconds just gives more weight to the time metric in the
| unified benchmark. You could instead explicitly weight them
| without changing units, if you cared about the size more you
| could add a multiplier to it.
| sidkshatriya wrote:
| You really don't know what weight is the right weight to
| balance time and gripped size. Multiplying them together
| sidesteps the whole issue and puts time and size on par with
| each other regardless of the individual unit scaling.
|
| The whole point of benchmarks is to protect against
| accidental bias in your calculations. Adding them seems
| totally against my intuition. If you _did_ want to give time
| more weight then I would raise it to some power. Example:
| geometric mean of (time x time x source size) would give time
| much more importance in an arguably more principled way.
| dwattttt wrote:
| Multiplying them is another way of expressing them as a
| unified value. It's not a question of accidental bias,
| you're explicitly choosing how important one second is
| compared to one byte.
|
| You could imagine there's a 1 sec/byte multiplier on the
| bytes value, saying in effect "for every byte of gzipped
| source, penalise the benchmark by one second".
| sidkshatriya wrote:
| > You could imagine there's a 1 sec/byte multiplier on
| the bytes value, saying in effect "for every byte of
| gzipped source, penalise the benchmark by one second".
|
| Your explanation makes sense. However the main issue is
| we don't know if this "penalty" is fair or correct or has
| some justifiable basis. In absence of any explanation it
| would make more sense to multiply them together as a
| "sane default". Later, having done some research we can
| attach some weightage perhaps appealing to some physical
| laws or information theory. Even then I doubt that +
| would be the operator I would use to combine them.
| igouy wrote:
| > Adding them...
|
| Read '+' as '&'.
| Thaxll wrote:
| The thing they should change is to forbid the nonsense like:
|
| https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
|
| Actually if you look at all the top net core submissions the only
| one fast are the one using low level intrinsics etc ...
| spullara wrote:
| All of the languages now have that trash in them. I'd like a
| "naive" benchmarks game where you write the code straight
| forwardly in a normal style for the language.
| igouy wrote:
| "simple" (2nd link on the homepage.)
|
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| weberer wrote:
| >Java: 40 seconds
|
| >Python 3: 1h 09 minutes
|
| Well damn.
| igouy wrote:
| > Actually if you look at all the top net core submissions the
| only one fast are the one using low level intrinsics etc ...
|
| Do you mean "fast" like a C program using low level intrinsics?
| NeutralForest wrote:
| This presentation is pretty bad, there should be more context,
| some kind of color scheme or labels instead of text in the
| background, spacing between the languages represented, other
| benchmarks than the geometric mean, etc.
| gus_massa wrote:
| > _other benchmarks than the geometric mean_
|
| The text is not clear enough, but "geometric mean" is not the
| benchmark. The 11 problems are listed in
| https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
|
| The results of the 11 problems are combined using the
| "geometric mean" into a single number. Some people prefer the
| "geometric mean", other people prefer the "arithmetic mean" to
| combine the numbers, other people prefer the maximum, and there
| rare many other methods (like the average excluding both
| borders).
| NeutralForest wrote:
| >The text is not clear enough, but "geometric mean" is not
| the benchmark.
|
| Thanks that makes more sense, that's another issue for
| context then. I don't have anything against geometric means
| but there should be basic statistics like average, max,
| min,... available as well.
| igouy wrote:
| > ... basic statistics like...
|
| median, quartiles
|
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| NeutralForest wrote:
| Could you guide me to the ones I mentioned, I'm not
| seeing them.
| igouy wrote:
| The bar in the middle of the box is an "average" - the
| median.
|
| https://www.merriam-webster.com/dictionary/average
|
| https://www.itl.nist.gov/div898/handbook/eda/section3/box
| plo...
| gus_massa wrote:
| It would be nice to be able to see the numbers in a table
| (perhaps in a auxiliary page, instead of the main page).
| Sometimes people want to rearrange the data or use
| another representation. (log scale? sort by 75% quartile?
| ...)
| igouy wrote:
| For people want to rearrange the data or use another
| representation, there are data files --
|
| https://salsa.debian.org/benchmarksgame-
| team/benchmarksgame/...
| stonemetal12 wrote:
| The linked page has box and whisker plots. On a box and
| whisker plot the lower bar is the min, the upper bar is
| the max. The box goes from 25th percentile to 75th
| percentile while the bar in the middle of the box is the
| 50th percentile.
|
| Therefore the stats you mentioned are all there min, max,
| and average with two different definitions of average
| given (geometric mean, and 50th percentile).
| IshKebab wrote:
| With a totally arbitrary conversion of 1 second = 1 gzipped byte.
|
| This is basically meaningless. I don't see why you'd even need to
| do this. You can easily show code size _and_ performance on the
| same graph.
| kibwen wrote:
| For comparing multiple implementations of a single benchmark in a
| single language, this sort of data would be interesting as a 2D
| plot, to see how many lines it takes to improve performance by
| how much. But for cross-language benchmarking this seems somewhat
| confounding, as the richness of standard libraries varies between
| languages (and counting the lines of external dependencies sounds
| extremely annoying, not only because you have to decide whether
| to include standard libraries (including libc), you also need to
| find a way not to penalize those for having many lines devoted to
| tests).
| mrtranscendence wrote:
| I'm not sure I see the problem. What does it matter that
| program A is shorter than program B because language A has a
| richer standard library? Program A still required less code.
| kibwen wrote:
| Because that's not what's being measured here, you're also
| mixing in performance, and it's impossible to tell at a
| glance whether a score is attributable to one or the other or
| both.
| benstrumental wrote:
| Not exactly what you're looking for, but here are some 2D plots
| of code size vs. execution time with geometric means of fastest
| entries and smallest code size entries of each language:
|
| https://twitter.com/ChapelLanguage/status/152442889069266944...
| simion314 wrote:
| And when you want to make the code readable you try to space
| things out, split things in small functions, use longer and
| clear variables name. I guess they are asking for running the
| code trough a minifier so their implementation gains some
| points.
| benstrumental wrote:
| I can't find the documentation for it, but you can see here
| that they measure the size of the source file after gzip
| compression, which reduces advantage of code-golf solutions:
|
| https://salsa.debian.org/benchmarksgame-
| team/benchmarksgame/...
| igouy wrote:
| "How source code size is measured"
|
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| kibwen wrote:
| On other benchmarks they measure the size of source code
| after it's been run through compression, as a way to
| normalize that. Not sure if that's been done here, but it
| should be.
| igouy wrote:
| Yes, they're the same measurements.
| [deleted]
| _b wrote:
| I'd be interested to see "C compiled with Clang" added as another
| language to the benchmark games. In part, digging into Clang vs
| gcc benchmarks is always interesting, and in part, as Rust &
| Clang share the same LLVM backend, it would shed light on how
| much of the C vs Rust difference is from frontend language stuff
| vs backend code gen stuff.
| igouy wrote:
| Already done:
|
| https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
|
| https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
| hexo wrote:
| I dont buy these results at all. Julia at second place looks like
| plain lie and complete nonsense, to the point I'm gonna look into
| this and run it myself.
|
| After trying hard to use julia for about a year and I came to
| conclusion it's one of the slowest things around. Maybe the stuff
| changed? Maybe, but julia code still remains incorrect.
|
| I hope they fix both things, speed (including start up speed, it
| counts A LOT) and correctness.
| ChrisRackauckas wrote:
| Note that these benchmarks include compilation time for Julia,
| while it does not include compilation time for C, Rust, etc.
| igouy wrote:
| Julia is presented like this --
|
| "Julia features optional typing, multiple dispatch, and good
| performance, _achieved using type inference and just-in-time
| (JIT) compilation_ , implemented using LLVM."
|
| Julia 1.7 Documentation, Introduction
|
| https://docs.julialang.org/en/v1/
| cpurdy wrote:
| Predictably, "The Computer Language Benchmarks Game" once again
| proves the worthlessness of "The Computer Language Benchmarks
| Game".
|
| This thing has been a long running joke in the software industry,
| exceeded only by the level of their defensiveness.
|
| SMH.
| agentgt wrote:
| I really wish they aggregated the metric of build time (+
| whatever).
|
| That is a huge metric I care about.
|
| You can figure out it somewhat by clicking on each language
| benchmark but it is not aggregated.
|
| BTW as biased guy in the Java world I can tell you this is one
| area Java is actually mostly the winner even beating out many
| scripting languages apparently.
| igouy wrote:
| Do Java "build time" measurements include class loading and JIT
| compilation? :-)
___________________________________________________________________
(page generated 2022-06-01 23:01 UTC)