[HN Gopher] P < 0.05 Considered Harmful
___________________________________________________________________
P < 0.05 Considered Harmful
Author : bookish
Score : 104 points
Date : 2023-04-10 16:00 UTC (7 hours ago)
(HTM) web link (simplicityissota.substack.com)
(TXT) w3m dump (simplicityissota.substack.com)
| gjm11 wrote:
| The article is all about why "0.05" might be a bad value to
| choose. But, more fundamentally, _p_ is often the wrong thing to
| be looking at in the first place.
|
| 1. Effect sizes.
|
| Suppose you are a doctor or a patient and you are interested in
| two drugs. Both are known to be safe (maybe they've been used for
| decades for some problem other than the one you're now facing).
| As for efficacy against the problem you have, one has been tried
| on 10000 people, and it gave an average benefit of 0.02 units
| with a standard deviation of 1 unit, on some 5-point scale. So
| the standard deviation of the average over 10k people is about
| 0.01 units, the average benefit is about 2 sigma, and p is about
| 0.05. Very nice.
|
| The other drug has only been tested on 100 people. It gave an
| average benefit of 0.1 unit with a standard deviation of 0.5
| units. Standard deviation of average is about 0.05, average
| benefit is about 2 sigma, p is again about 0.05.
|
| Are these two interventions _equally promising_? Heck no. The
| first one almost certainly does very little on average, and does
| substantially more harm than good about half the time. The second
| one is probably about 5x better on average, and seems to be less
| likely to harm you. It 's more _uncertain_ because the sample
| size is smaller, and for sure we should do a study with more
| patients to nail it down better, but I would definitely prefer
| the second drug.
|
| (With those very large standard deviations, if the second drug
| didn't help me I would want to try the first one, in case I'm one
| of the lucky people it gives > 1 unit of benefit to. But it might
| well be > 1 unit of harm instead.)
|
| Looking only at p-values means only caring about effect size in
| so far as it affects how confident you are that there's any
| effect at all. (Or, e.g., any _improvement_ on the previous
| best.) But usually you do, in fact, care about the effect size
| too.
|
| Here's another way to think about this. When computing a p-value,
| you are asking "if the null hypothesis is true, how likely are
| results like the ones we actually got?". That's a reasonable
| question. But you will notice that it _makes no reference at all
| to any not-null hypothesis_. p < 0.05 means something a bit like
| "the null hypothesis is probably wrong" (though that is _not_ in
| fact quite what it means) but usually you also care _how_ wrong
| it is, and the p-value won 't tell you that.
|
| 2. Prior probability.
|
| The parenthetical remark in the last paragraph indicates
| _another_ way in which the p-value is fundamentally the Wrong
| Thing. Suppose your null hypothesis is "people cannot
| psychically foretell the future by looking at tea leaves". If you
| test this and get a p=0.05 "positive" result, then indeed you
| should probably think it a little more likely than you previously
| did that this sort of clairvoyance is possible. But if you are a
| reasonable person, your previous opinion was a much-much-less-
| than-5% chance that tasseomancy actually works[1], and when
| someone gets a 1-in-20 positive result you should be thinking
| "oh, they got lucky", not "oh, it seems tasseomancy works after
| all".
|
| [1] By psychic powers, anyway. Some people might be good at
| predicting the future and just pretend to be doing it by reading
| tea leaves, or imagine that that's how they're doing it.
|
| 3. Model errors.
|
| And, of course, if someone purporting to read the future in their
| tea leaves does _really_ well -- maybe they get p=0.0000001 --
| this still doesn 't oblige a reasonable person to start believing
| in tasseomancy. That p-value comes from assuming a particular
| model of what's going on and, again, the test makes no reference
| to any specific alternative hypothesis. If you see p=0.0000001
| then you can be pretty confident that the null hypothesis's model
| is wrong, but it could be wrong in lots of ways. For instance,
| maybe the test subject cheated; maybe that probability comes from
| assuming a normal distribution but the actual distribution is
| much heavier-tailed; maybe you're measuring something and your
| measurement process is biased, and your model assumes all the
| errors are independent; maybe there's a way for the test subject
| to get good results that doesn't require either cheating or
| psychic powers.
|
| None of these things is helped much by replacing p=0.05 with
| p=0.001 or p=0.25. They're fundamental problems with the whole
| idea that p-values are what we should care about in the first
| place.
|
| (I am not claiming that p-values are worthless. It _is_ sometimes
| useful to know that your test got results that are unlikely-to-
| such-and-such-a-degree to be the result of such-and-such a
| particular sort of random chance. Just so long as you are capable
| of distinguishing that from "X is effective" or "Y is good" or
| "Z is real", which it seems many people are not.)
| whimsicalism wrote:
| I feel like a lot of these critiques are just straw-manning
| p-values consideration.
|
| Consider effect sizes - this seems to be a completely different
| (yes _important_ ) question. Obviously the magnitude of the
| impact of the drug is important - but it isn't a replacement or
| "something to look at instead of p-value" because the chance
| that the results you saw are due to random variation is still
| important! You can see a massive effect size but if that is
| totally expected within your null model, then it is probably
| not all that exciting!
|
| Effect sizes are a _complement_ to some sort of hypothesis
| testing, but they are not a replacement.
|
| > Prior probability.
|
| yes, when you can effectively encode your prior probability I
| would say the posterior probability of seeing what you did is
| at least as good as p-value.
| gjm11 wrote:
| I agree that you shouldn't look _only_ at effect sizes any
| more than you should look _only_ at p-values. (What I would
| actually prefer you to do, where you can figure out a good
| way to do it, is to compute a posterior probability
| distribution and look at the whole distribution. Then you can
| look at its mean or median or mode or something to get a
| point estimate of effect size, you can look at how much of
| the distribution is > 0 to get something a bit like a
| p-value but arguably more useful, etc.
|
| If anything I wrote appeared to be saying "just look at
| effect size, it's the only thing that matters" then that was
| an error on my part. I definitely didn't _intend_ to say
| that.
|
| But I was responding to an article saying "p<0.05 considered
| harmful" that _never mentions effect sizes at all_. I think
| that 's enough to demonstrate that, in context, "it's bad to
| look at p-values and ignore effect sizes" is not in fact a
| straw man.
|
| Incidentally, I am not convinced that the p-value as such is
| often a good way to assess how likely it is that your results
| are due to random chance. Suppose you see an effect size of 1
| unit with p=0.05. OK, so there's a 5% chance of getting these
| results if the true effect size is zero. But you should also
| care what the chance is of getting these results if the true
| effect size is +0.1. (Maybe the distribution of errors is
| really weird and these results are _very likely_ with a
| positive but much smaller effect size; then you have good
| evidence against the null hypothesis but very weak evidence
| for an effect size of the magnitude you measured.) In fact,
| what you really want to know is what the probability is for
| every possible effect size, because that gives you the
| likelihood ratios you can use to decide how likely you think
| any given effect size is after seeing the results. For sure,
| having the p-value is better than having nothing, but if you
| were going to pick one statistic to know in addition to (say)
| a point estimate of the effect size, it 's not at all clear
| that the p-value is what you should choose.
| whimsicalism wrote:
| Fair enough, I don't disagree with anything you just said.
|
| It would be cool if interfaces caught up so I could just
| draw a basic estimate of my prior and then see the
| posterior graph afterwards.
| nextos wrote:
| What I have discovered after working in medicine for pretty
| long is that many biologists and MDs think p-values are a
| measure of effect sizes. Even a reviewer from _Nature_
| thought that, which is incredibly disturbing.
|
| p-values were created to facilitate rigorous inference with
| minimal computation, which was the norm during the first
| half of the 20th century. For those who work on a
| frequentist framework, inference should be done using a
| likelihood-based approach plus model selection, e.g. AIC.
| It's makes it much harder to lie.
| spekcular wrote:
| AIC is an estimate of prediction error. I would caution
| against using it for selecting a model for the purpose of
| inference of e.g. population parameters from some dataset
| (without producing some additional justification that
| this is a sensible thing to do). Also, uncertainty
| quantification after data-dependent model selection can
| be tricky.
|
| Best practice (as I understand it) is to fix the model
| ahead of time, before seeing the data, if possible (as in
| a randomized controlled trial of a new medicine, etc.).
| epgui wrote:
| You can always encode the prior. If you take the frequentist
| approach and ignore bayesian concepts, it's the same as just
| going bayesian but with an "uninformative prior" (constant
| distribution).
|
| The only question is... would you rather be up front and
| explicit about your assumptions, or not?
|
| An uninformative prior _is_ an assumption, even if it's the
| one that doesn't bias the posterior (note that here "bias" is
| not a bad word).
| whimsicalism wrote:
| There is potentially bias (of the bad word variant)
| introduced by the mismatch between the prior in your own
| mind and the distribution and params you choose to try to
| approximate that, especially if you're trying to pick out a
| distribution with a nice posterior conjugate.
|
| I'm also not sure why everyone perceived my comment as
| anti-bayesian.
| uniqueuid wrote:
| Yeah and it's super simple to roll significance and effect
| sizes in one with confidence (or credible) intervals.
|
| Plus the interpretation is super straight forward: CI
| contains zero: Not significant. If nothing else, we should
| make CIs the primary default instead of p-values.
| hgomersall wrote:
| If you're not encoding your prior probability, you're just
| making stuff up. You almost always have _some_ information,
| even if it 's just sanity bounds.
| whimsicalism wrote:
| Trying to fit your preconception into a mathematically
| convenient conjugate distribution is not as far afield from
| making stuff up as people want to believe. Maybe it is
| better with numerical approaches.
|
| But yes, you usually do have some information and it is net
| better to encode when possible.
| uniqueuid wrote:
| I agree, and would add lack of statistical power to the list.
| Underpowered studies (e.g. small effects, noisy measurement,
| small N) decrease the changes of finding a true effect,
| paradoxically increasing the risk of false positives.
|
| It's immensely frustrating that we haven't made a lot of
| progress since Cohen's (1962) paper.
| tpoacher wrote:
| > The article is all about why "0.05" might be a bad value to
| choose.
|
| No. It's more about why choosing a value to serve as the
| default choice is a bad idea in the first place. The specific
| value chosen as the default itself (i.e., 0.05 in this case) is
| irrelevant.
|
| The idea is that the value you choose should reflect some prior
| knowledge about the problem. Therefore choosing 5% all the time
| would be somewhat analogous to a bayesian choosing a specific
| default gaussian as their prior: it defeats the point of
| choosing a prior, and it's actively harmful if it sabotages
| what you should actually be using as a prior, because it's not
| even uninformative, it's a highly opinionated prior instead.
|
| As for points 1 to 3, technically I agree, but there's a lot of
| misdirection involved.
|
| The point on effect sizes is true (and indeed something many
| people get wrong), but it is contrived to make effect sizes
| more useful than p-values. In which case, the obvious answer
| is, you should be choosing an example where reporting a p-value
| is more important than an effect size. One way to look at a
| p-value is as a ranking, which would be useful for comparing
| between effect sizes of incomparable units. Is a student with a
| 19/20 grade from a european school better than an american
| student with a 4.6 GPA? Reducing the compatibility scores to
| rankings, can help you compare these two effect sizes
| immediately.
|
| Prior probability, similarly. "If" you interpret things as you
| did, then yes, p-values suck. But you're not supposed to. What
| the p-value tells you is "the data and model are 'this much'
| compatible". It's up to you then to say "and therefore" vs
| "must have been an atypical sample". In other words, there is
| still space for a prior here. And in theory, you are free to
| repurpose the compatibility score of a p-value to introduce
| this prior directly (though nobody does this in practice).
|
| Regarding p=0.05 vs p=0.001 not mattering; of course they do.
| But only if they're used as compatibility rankings as opposed
| to decision thresholds. If you compare two models, and one has
| p=0.05 and the other has p=0.001, this tells you two things: a)
| they are both very incompatible with the data, b) the latter is
| a lot more incompatible than the former. The problem is not
| that people use p-values, the problem that people abuse them,
| to make decisions that are not necessarily crisply supported by
| the p-values used to push them. But this could be said of any
| metric. I have actively seen people propose "deciding in favour
| of a model if the Bayes Factor is > 3". This is exactly the
| same faulty logic, and the fact that BF is somehow "bayesian"
| won't protect subsequent researchers who use this heuristic
| from entering a new reproducibility crisis.
| spekcular wrote:
| I think if hypothesis testing is understood properly, these
| objections don't have much teeth.
|
| 1. Typically we use p-values to construct confidence intervals,
| answering the concern about quantifying the effect size. (That
| is, the confidence interval is the collection of all values not
| rejected by the hypothesis test.)
|
| 2. P-values control type I error. Well-powered designs control
| type I and type II error. Good control of these errors is a
| kind of minimal requirement for a statistical procedure. Your
| example shows that we should perhaps consider more than just
| these aspects, but we should certainly be suspicious of any
| procedure that doesn't have good type I and II error control.
|
| 3. This is a problem with any kind of statistical modeling, and
| is not specific to p-values. All statistical techniques make
| assumptions that generally render them invalid when violated.
| foxes wrote:
| That sounds like if you write proper C code correctly you
| don't make memory errors when in reality it's very common to
| not write correct code.
|
| That's why rust came along, to stop that behaviour, you
| simply can't make that mistake, and hence the point is maybe
| there's a better test to use than p value as a standard.
| spekcular wrote:
| How else do you propose to construct procedures that
| control type I error and evaluate their properties?
| uniqueuid wrote:
| Your points are theoretically correct, and probably the
| reason why many statisticians still regard p-values and HNST
| favorably.
|
| But looking at the practical application, in particular the
| replication crisis, specification curve analysis, de facto
| power of published studies and many more, we see that there
| is an _immense_ practical problem and p-values are not making
| it better.
|
| We need to criticize p-values and NHST hard, not because they
| cannot be used correctly, but because they _are_ not used
| correctly (and are arguably hard to use right, see the
| Gigerenzer paper I linked).
| spekcular wrote:
| The items you listed are certainly problems, but p-values
| don't have much to do with them, as far as I can see. Poor
| power is an experimental design problem, not a problem with
| the analysis technique. Not reporting all analyses is a
| data censoring problem (this is what I understand
| "specification curve analysis" to mean, based on some
| Googling - let me know if I misinterpreted). Again, this
| can't really be fixed at the analysis stage (at least
| without strong assumptions on the form of the censoring).
| The replication crisis is a combination of these these two
| things, and other design issues.
| uniqueuid wrote:
| I can understand why you see it this way, but still
| disagree:
|
| (1) p-values make significance the target, and thus
| create incentives for underpowered studies, misspecified
| analyses, early stopping (monitoring significance while
| collecting data), and p-hacking.
|
| (2) p-values separate crucial pieces of information. It
| represents a highly specific probability (of the observed
| data, given the null hypothesis is true), but does not
| include effect size or a comprehensive estimate of
| uncertainty. Thus, to be useful, p-values need to be
| combined with effect sizes and ideally simulations,
| specification curves, or meta-analyses.
|
| Thus my primary problem with p-values is that they are an
| incomplete solution that is too easy to use incorrectly.
| Ultimately, they just don't convey enough information in
| their single summary. CIs, for example, are just as
| simple to communicate, but much more informative.
| spekcular wrote:
| I don't understand. CIs are equivalent to computing a
| bunch of p-values, by test-interval duality. Should I
| interpret your points as critiques of simple analyses
| that only test a single point null of no effect (and go
| no further)? (I would agree that is bad.)
| uniqueuid wrote:
| Yes, I argue that individual p-values (as they are used
| almost exclusively in numerous disciplines) are bad, and
| adding more information on effect size and errors are
| needed. CIs do that by conveying (1) significance (does
| not include zero), (2) magnitude of effect (mean of CI),
| and (3) errors/noise (width of CI). That's
| _significantly_ better than a single p-value (excuse the
| pun).
| lumb63 wrote:
| It's worth pointing out that processes like meta analyses ought
| to be able to catch #2. The problem is, GIGO. A lot of poorly
| designed studies (no real control group, failure to control for
| other relevant variables, other methodological errors) make
| meta analysis outcomes unreliable.
|
| An area I follow closely is exercise science and I am amazed
| that researchers are able to get grants for some of the
| research they do. For instance, sometimes researchers aim to
| compare, say, the amount of hypertrophy on one training program
| versus another. They'll use a study or maybe 15 individuals.
| The group on intervention A will experience 50% higher
| hypertrophy than intervention B, at p=0.08, and they'll
| conclude that there's no difference in hypertrophy between the
| two protocols rather than suggesting an increase in statistical
| power.
|
| Another great example is studies whose interventions fail to
| produce any muscle growth in new trainees. They'll compare two
| programs, one with, say, higher training volume, and one with
| lower training volume. Both fail to produce results for some
| reason. They conclude that variable is not important, rather
| than perhaps concluding that their interventions are poorly
| designed since a population that is begging to gain muscle
| couldn't gain anything from it.
| sillysaurusx wrote:
| (Did HN finally increase the character limit for toplevel
| comments? Normally >2500 chars will get you booted to the very
| bottom over time. Happy to see this one at the top. It might be
| because of a 2008-era account though. Thanks for putting in a
| bunch of effort into your writing!)
| Joel_Mckay wrote:
| Still not significant:
|
| https://mchankins.wordpress.com/2013/04/21/still-not-signifi...
|
| Nothing like spending 10 minutes reading a paper to see results
| which are likely nonsense. However, it pales in comparison to
| spending 3 weeks trying to replicate popular works... only to
| find it doesn't generalize... you know that ROC was likely from
| cooked data-sets confounded with systematic compression artifact
| errors... likely not harmful, but certainly irritating. lol =)
| ttpphd wrote:
| What a bananas list! Thanks for sharing it.
| revision17 wrote:
| American Statistical Association also released a statement on
| pvalues:
|
| https://www.tandfonline.com/doi/epdf/10.1080/00031305.2016.1...
| uniqueuid wrote:
| So much has been said about p-values and null hypothesis
| significance testing (NHST) that one blog post probably won't
| change anybody's opinion.
|
| But I want to recommend the wonderful paper "the null ritual" [1]
| by Gerd Gigerenzer et al. It shows that a precise understanding
| of what a p-value means is extremely rare even among statistics
| lecturers, and, more interestingly, that there have always been
| fundamentally different understandings even when p-values were
| invented, e.g. between Fisher and Pearson.
|
| Beyond that, there was a special issue recently in the american
| statistician discussing at length the issues of p-values in
| general and .05 in particular [2].
|
| Personally, I of course feel that null hypothesis testing is
| often stupid and harmful, because null effects never exist in
| reality, effects without magnitude are useless, and because the
| "statelessness" of NHST creates or exacerbates problems such as
| publication bias and lack of power.
|
| [1] http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf
|
| [2] https://www.tandfonline.com/toc/utas20/73/sup1
| bookish wrote:
| Thanks for that. I'll give [1] a read. I'm familiar with [2],
| and cited one of those papers in the blog.
|
| About the stupid or harmful nature of null hypothesis testing
| in general, what do you recommend instead for decision making
| and for summarization of uncertainty? In the scenario of large
| (yet fast moving) organizations where most people will have
| little stats background.
| uniqueuid wrote:
| Thanks, I hope you find Gigerenzer useful. The paper is a bit
| academic, but he also wrote a couple of nice popular science
| books on the (mis-)perception of numbers and statistics,
| those might be useful in a business environment.
|
| For real-world applications outside engineering and academia,
| I would rely heavily on confidence intervals and/or
| confidence bands. For example, the packages from easystats
| [1] in R have quite a few very useful visualization
| functions, which make it very easy to interpret results of
| statistical tests. You can even get a textual precise
| description, but then again, that's intended for papers and
| not a wider audience.
|
| Apart from that, I would mainly echo recommendations from
| people like Andrew Gelman, John Tukey, Edward Tufte etc.:
| Visuals are extremely useful and contain a lot of data. Use
| e.g. scatterplots with jittered points to show raw data and
| the goodness of fit. People will intuitively make more of it
| than of a single p-value.
|
| [1] https://easystats.github.io/easystats/
| bookish wrote:
| Totally agree about visualization, and that those authors
| are great advocates for it. Confidence intervals are
| definitely much more informative and intuitive than
| p-values.
|
| Would the policy be "look at our confidence intervals later
| and then decide what to do"? One remaining issue is how to
| have consistent decision criteria, and to convey it ahead
| of time. Imagine a context with 10-50 teams at a company
| that run experiments, where the teams are implicitly
| incentivized to find ways to report their experiments as
| successful. Quantified criteria can be helpful in
| minimizing that bad incentive.
| tpoacher wrote:
| Thanks for this, looking forward to read it! I would also
| recommend in turn the paper by Greenland et al (2016) called
| "Statistical tests, P values, confidence intervals, and power:
| a guide to misinterpretations.". It's a great read.
|
| I recently had a chat with a colleague who has "abandoned
| p-values for bayes factors" in their research, on how, in
| principle, there's nothing stopping you from having a
| "bayesian" p-value (i.e. the definition of the p-value at its
| most general can easily accommodate bayesian inference, priors,
| posteriors, etc). The counter-retort was more or less "no it
| can't, educate yourself, bayes factors are better" and didn't
| want to hear about it. It made me sad.
|
| p-values are an incredibly insightful device. But because most
| people (ab)use it in the same way most people abuse normality
| assumptions or ordinal scales as continuous, it's gotten a bad
| rep and means something entirely different to most now by
| default.
| pasc1878 wrote:
| xkcd has commented on that different pvalues
| https://xkcd.com/1478/
|
| and shown an example of ,marketing using p values
| https://xkcd.com/882/
| bookish wrote:
| Classics!
| clircle wrote:
| Maybe tech industry insiders can tell me this ... but do real
| people actually make product decisions based solely on p < 0.05?
| Seems like the author is writing about a contrived problem.
| candiddevmike wrote:
| Product decisions are made based on someone's gut instinct. p
| values are mostly used when (or abused until) they align with
| that instinct, if they are being considered at all.
| bookish wrote:
| The "stasis" and "arbitrarily adjustments" regimes that I wrote
| about are certainly ones that I've seen, which don't rely
| solely on p < 0.05 but are still pretty suboptimal.
| Furthermore, it's not only about whether 0.05 is the sole
| criteria, but also about whether it's a useful criteria for us
| to highlight at all, depending on whether the anchoring effect
| of it is damaging relative to alternatives.
|
| But let me turn that around and ask: what product decision
| regime do you see most often or think would be the most
| relevant to use as an example? I'd be happy to hear your
| perspective and make sure I keep it in mind for future blogs.
| ftxbro wrote:
| Hi I wrote some other response in this thread where I called
| you a techbro, so sorry about that I probably count as one
| too I wasn't trying to insult you too much.
|
| Anyway when you say "Furthermore, it's not only about whether
| 0.05 is the sole criteria, but also about whether it's a
| useful criteria for us to highlight at all, depending on
| whether the anchoring effect of it is damaging relative to
| alternatives." I love this analogy of the null hypothesis vs.
| alternative hypotheses in frequentist statistics to the
| 'anchoring effect' cognitive bias that you try to work to
| your advantage in marketing and sales and negotiation or
| management.
| https://en.wikipedia.org/wiki/Anchoring_(cognitive_bias)
|
| If you don't want to be tied to p-values and you only care
| about downstream decisions rather than quantifying beliefs,
| you can use some ideas in decision theory
| https://en.wikipedia.org/wiki/Decision_theory
|
| For example, maybe you are deciding between two alternative
| ways of doing something. You don't know which one is better,
| and you are confronted with not only the decision of which
| one to use, but also with the decision of whether to
| experiment (with A/B testing for example) to be more sure of
| which one is right, versus whether to exploit the one that
| you currently think is better. This is the multi-arm bandit
| problem and it doesn't necessarily use p-values so your
| intuition is right! https://en.wikipedia.org/wiki/Multi-
| armed_bandit
|
| Maybe that's not your situation. Maybe your situation is that
| you have an existing business process and you want to know
| whether to switch to one that might be better. Someone might
| say it's a p-value problem, but again I agree with your
| intuition that it really isn't best to think about it that
| way, especially when there is a cost to switching. Instead,
| it's a more complicated decision that depends on what is the
| switching cost, how much better you think the new process
| would be (including uncertainty of it), and what kind of
| business horizon you care about. There might even be a multi-
| armed bandit effect again even in this situation, where you
| _also_ have to weigh the costs of reducing your uncertainty
| of the switching improvement or even of reducing the
| uncertainty of the switching cost itself.
|
| Anyway, these problems do involve concepts from probability
| and statistics but it's for sure true that the decisions
| don't always reduce to P < 0.05 at the end! Good luck best
| wishes living your best techbro life!
| bookish wrote:
| I've been called worse things, and at this stage in life it
| is hard to be offended by people who don't know me well
| enough to give an insightful insult. But I hadn't responded
| to the earlier comment because it was more generally
| antagonistic and seemed to reflect a (possibly
| intentional?) misinterpreted and negative reading of the
| blog. "Don't feed the trolls", as the saying goes.
|
| I started writing that particular blog with different
| experimental methods in mind, but wrote so much as a
| prerequisite that I wanted to stop and make that first part
| a standalone post. My last paragraph was supposed to make
| it clear that this was a launching off point, rather than a
| summation of everything I know about decision theory or
| experimentation.
|
| Thanks for writing more substance into this comment. These
| opinions are sensible, I agree with much that you wrote
| here. On one: I do use MABs and see them as a feasible
| method for many organizations, and at some point I'd like
| to write about some challenges with those too.
|
| Wishing you the best in your techbro or ftxbro or
| <whatever>bro life too!
| leni536 wrote:
| Bold of you to assume that product decisions are based on
| rigorous statistical tests.
|
| Jokes aside, for product decisions (or all kinds of decisions,
| really) you should differentiate between statistical
| significance and relevance. A small measured difference in some
| metric, even if statistically significant to p < 0.000001, may
| not be relevant for a decision.
| regularfry wrote:
| Yep. I've seen "rigorous" A/B testing regimes set up based on a
| p<0.05 requirement that then went straight into "we'll keep
| running this until we reach significance" and "this one's
| clearly trending towards significance, we should just implement
| it now" nonsense.
| belter wrote:
| There is a joke between Mathematicians: The Confidence Interval
| is nothing more than, the interval between first learning about
| it, thinking you understood it, and the time it takes until you
| realize what it really means.
| hsjqllzlfkf wrote:
| P < 0.05 considered harmful 5% of the times.
| kneebonian wrote:
| The thing that made me realize how ineffective P < 0.05 was, was
| playing D&D, because the odds of something happening 5% of the
| time was the same odds as rolling a 1 on a 20 sided dice, which
| happens surprisingly frequently once you roll it more than a
| couple of times.
|
| Also XCOM taught me that 98% != 100%
| galkk wrote:
| Xcom isn't a good example because the game actively lies to you
| with displayed probabilities
|
| https://youtu.be/l0KEDYFWbVc
| kneebonian wrote:
| "That's XCOM baby!"
| kibwen wrote:
| I've seen no source that shows that Xcom fudges its displayed
| hit chances. You may be thinking of Fire Emblem, whose games
| use a variety of well-documented approaches to fudging their
| rolls: https://fireemblemwiki.org/wiki/True_hit
| reibitto wrote:
| I know at least XCOM 2 does on certain difficulties. The
| aim assist values are directly in the INI files (it fudges
| the numbers in your favor for lower difficulties). Here are
| instructions on how to remove the aim assists: https://stea
| mcommunity.com/sharedfiles/filedetails/?id=61799...
| pphysch wrote:
| Most big games implement "randomness" with "pseudorandomness"
| in the name of controlling variance of outcome, chopping off
| the long tail
| C-x_C-f wrote:
| Do pseudorandom distributions have chopped tails? Wouldn't
| that go against the definition of pseudorandom?
| pphysch wrote:
| To be clear it's a mix of pseudorandomness and
| "procedural randomness" to increase "fairness"
| the_af wrote:
| Is there any game that does NOT use pseudorandom
| generators? And does this significantly change
| probabilities?
| pphysch wrote:
| Some games I follow talked about "switching to
| pseudorandomness" but that may be misuse of terminology.
|
| There's an argument for competitive games to use true
| randomness to eliminate any possibility of abuse, but I'm
| not aware of specific examples.
| burnished wrote:
| Think there is confusion here - if I understand correctly
| you are asking about the number generator, they are
| talking about the process of determining success. Like in
| league of legends you have a listed crit chance but the
| way they determine success isnt to generate a number and
| compare it to your chance, you start with a smaller base
| number that gets incremented each time you fail and reset
| when you succeed - the end result is that your overall
| chance remains the same but the likelihood oh a streak
| (of fails or successes) goes down.
|
| Doesnt change the overall probability, drastically
| reduces variance.
| PeterisP wrote:
| Some games (both computer games and physical board games)
| intentionally use "shuffled randomness" where e.g. for
| percentile fail/success rolls you'd take numbers from
| 1-100 and use true randomness to shuffle that list; in
| this way the overall probability is the same, but has a
| substantially different feel as it's impossible for
| someone to have bad/good luck throughout the whole game
| and things like "gambler's fallacy" which are false for
| actual randomness become true.
| teddyh wrote:
| > _the overall probability is the same_
|
| Only for the very first roll. After that, the outcome
| becomes more and more predictable for each roll.
| the_af wrote:
| Can you elaborate on how XCOM lies? I often suspected this
| (but you can never be sure, since human intuition is bad at
| probabilities). Is there hard evidence?
| galkk wrote:
| https://youtu.be/l0KEDYFWbVc
| fauxpause_ wrote:
| I believe it fudges the numbers to give you better than
| expected results on lower difficulties.
| yunruse wrote:
| 2s is fine, but the benefit of modern technology is that we can
| tell exactly what standard deviation would be needed for the null
| hypothesis to randomly generate our results. Particle physics
| holds itself to an "industry standard" of 5 sigma, for example.
|
| The real conversation to be had is -- what standard deviation
| will we tolerate? Is this something we'll keep doing, and thus
| A/B test ourselves into a (perhaps quite horrifying) local
| minimum based on random noise? Is this a single great experiment?
| Are lives on the line? Will these results be taken Quite
| Seriously? Is this a test to say "look, this is worth further
| investigation"?
|
| I'm no statistician but in my opinion this is the first
| conversation that needs to be had when doing an experiment: what
| p / s levels are satisfactory to claim confidence? p<0.05 is a
| decent heuristic for some experiments, but not all.
| peteradio wrote:
| It's 5s for discovery, 2s for null hypothesis "confirmation".
| cozzyd wrote:
| and 3\sigma for "evidence" (i.e. enough for publication...)
| ftxbro wrote:
| This post is amazing.
|
| When I see a post "P < 0.05 Considered Harmful" I think OK now
| they are going to talk about maybe Bonferroni vs. other multiple
| hypothesis corrections if they are a frequentist or otherwise
| they are going to try to explain Bayesian things.
|
| But no, this one isn't from a frequentist, or from a Bayesian.
| It's from a techbro whose solution isn't any kind of multiple
| hypothesis correction or getting Bayes-pilled, it's to say "Why
| not just admit you want something akin to p = 0.25 in the first
| place?" for 'ship criteria' in the only stats context he appears
| to know which is A/B testing, talking about Maslow hierarchy and
| namedropping Hula and Netflix. It's seriously like some Silicon
| Valley parody. Wait is it actually a satire blog?
| thebestgamers wrote:
| [flagged]
| whimsicalism wrote:
| > "Why not just admit you want something akin to p = 0.25 in
| the first place?" for 'ship criteria'
|
| That's culture shock for me - I guess this is why I don't work
| at startups.
| bookish wrote:
| It all depends on what you're experimenting on. I do think
| there are many teams out there who are (in effect) making
| decisions with less certainty than this, but they wouldn't
| want to actually quantify it.
| mlyle wrote:
| p=0.2 doesn't work too well for ship criteria for medicine.
|
| p=0.2 for "this reordering of the landing text improves the
| rate conversion events" is fine. People make changes based on
| less information all the time. Waiting for certainty has its
| own expenses.
| [deleted]
| staunton wrote:
| The fundamental error that causes misuse of p-values (and
| statistics in general) is misunderstanding what statistics is in
| the first place. Statistics is applied epistemology. There is no
| algorithm that fits all situations where you're trying to think
| about and learn new things. It's just hard and we have to deal
| with that.
|
| Arguably, the main point of p-values is trying to prevent people
| who really know better from "cheating" by reporting "interesting"
| results from very low sample sizes. Having a very rigid framework
| as a rejection criteria helps with this. However, the scientific
| community and system are not capable of dealing with "real
| cheating" that includes fabrication of data. Also, any such rigid
| metric is going to be gamed. Of course, there are some people who
| would also cheat themselves and maybe learning about p-values
| makes this less frequent. But such people using them don't
| understand what those values are telling them, beyond "yes, I can
| publish this". This is counterproductive because it prevents deep
| thinking and thorough investigation.
|
| Most scientists realize that in practice there is no simple list
| of objective criteria that tells you what experiments to perform
| and how exactly to interpret the results. This takes a lot of
| work, careful thinking, trying different things and very much
| benefits from collaboration. But who's got time for that? There's
| papers to publish and grant applications to write. So p-values it
| is. Or maybe some other thing eventually, that also won't solve
| the fundamental problem.
| begemotz wrote:
| Besides the Gigerenzer article mentioned below (there are others
| by the same author worth reading e.g. 'Mindless Statistics'), I
| would recommend:
|
| 'Moving to a world beyond "p <0.05" by Wasserstein et al in the
| American Statistician
| https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...
|
| as well as the classic article by Jacob Cohen "The Earth is Round
| (p <.05).
|
| YMMV, but in certain disciplines still, "statistical analysis" is
| little more than checking for p-values and applying a binary
| decision rule.
|
| That is without recognizing the shaky theoretical ground of NHST
| as practiced.
___________________________________________________________________
(page generated 2023-04-10 23:01 UTC)