[HN Gopher] P < 0.05 Considered Harmful
       ___________________________________________________________________
        
       P < 0.05 Considered Harmful
        
       Author : bookish
       Score  : 104 points
       Date   : 2023-04-10 16:00 UTC (7 hours ago)
        
 (HTM) web link (simplicityissota.substack.com)
 (TXT) w3m dump (simplicityissota.substack.com)
        
       | gjm11 wrote:
       | The article is all about why "0.05" might be a bad value to
       | choose. But, more fundamentally, _p_ is often the wrong thing to
       | be looking at in the first place.
       | 
       | 1. Effect sizes.
       | 
       | Suppose you are a doctor or a patient and you are interested in
       | two drugs. Both are known to be safe (maybe they've been used for
       | decades for some problem other than the one you're now facing).
       | As for efficacy against the problem you have, one has been tried
       | on 10000 people, and it gave an average benefit of 0.02 units
       | with a standard deviation of 1 unit, on some 5-point scale. So
       | the standard deviation of the average over 10k people is about
       | 0.01 units, the average benefit is about 2 sigma, and p is about
       | 0.05. Very nice.
       | 
       | The other drug has only been tested on 100 people. It gave an
       | average benefit of 0.1 unit with a standard deviation of 0.5
       | units. Standard deviation of average is about 0.05, average
       | benefit is about 2 sigma, p is again about 0.05.
       | 
       | Are these two interventions _equally promising_? Heck no. The
       | first one almost certainly does very little on average, and does
       | substantially more harm than good about half the time. The second
       | one is probably about 5x better on average, and seems to be less
       | likely to harm you. It 's more _uncertain_ because the sample
       | size is smaller, and for sure we should do a study with more
       | patients to nail it down better, but I would definitely prefer
       | the second drug.
       | 
       | (With those very large standard deviations, if the second drug
       | didn't help me I would want to try the first one, in case I'm one
       | of the lucky people it gives > 1 unit of benefit to. But it might
       | well be > 1 unit of harm instead.)
       | 
       | Looking only at p-values means only caring about effect size in
       | so far as it affects how confident you are that there's any
       | effect at all. (Or, e.g., any _improvement_ on the previous
       | best.) But usually you do, in fact, care about the effect size
       | too.
       | 
       | Here's another way to think about this. When computing a p-value,
       | you are asking "if the null hypothesis is true, how likely are
       | results like the ones we actually got?". That's a reasonable
       | question. But you will notice that it _makes no reference at all
       | to any not-null hypothesis_. p  < 0.05 means something a bit like
       | "the null hypothesis is probably wrong" (though that is _not_ in
       | fact quite what it means) but usually you also care _how_ wrong
       | it is, and the p-value won 't tell you that.
       | 
       | 2. Prior probability.
       | 
       | The parenthetical remark in the last paragraph indicates
       | _another_ way in which the p-value is fundamentally the Wrong
       | Thing. Suppose your null hypothesis is  "people cannot
       | psychically foretell the future by looking at tea leaves". If you
       | test this and get a p=0.05 "positive" result, then indeed you
       | should probably think it a little more likely than you previously
       | did that this sort of clairvoyance is possible. But if you are a
       | reasonable person, your previous opinion was a much-much-less-
       | than-5% chance that tasseomancy actually works[1], and when
       | someone gets a 1-in-20 positive result you should be thinking
       | "oh, they got lucky", not "oh, it seems tasseomancy works after
       | all".
       | 
       | [1] By psychic powers, anyway. Some people might be good at
       | predicting the future and just pretend to be doing it by reading
       | tea leaves, or imagine that that's how they're doing it.
       | 
       | 3. Model errors.
       | 
       | And, of course, if someone purporting to read the future in their
       | tea leaves does _really_ well -- maybe they get p=0.0000001 --
       | this still doesn 't oblige a reasonable person to start believing
       | in tasseomancy. That p-value comes from assuming a particular
       | model of what's going on and, again, the test makes no reference
       | to any specific alternative hypothesis. If you see p=0.0000001
       | then you can be pretty confident that the null hypothesis's model
       | is wrong, but it could be wrong in lots of ways. For instance,
       | maybe the test subject cheated; maybe that probability comes from
       | assuming a normal distribution but the actual distribution is
       | much heavier-tailed; maybe you're measuring something and your
       | measurement process is biased, and your model assumes all the
       | errors are independent; maybe there's a way for the test subject
       | to get good results that doesn't require either cheating or
       | psychic powers.
       | 
       | None of these things is helped much by replacing p=0.05 with
       | p=0.001 or p=0.25. They're fundamental problems with the whole
       | idea that p-values are what we should care about in the first
       | place.
       | 
       | (I am not claiming that p-values are worthless. It _is_ sometimes
       | useful to know that your test got results that are unlikely-to-
       | such-and-such-a-degree to be the result of such-and-such a
       | particular sort of random chance. Just so long as you are capable
       | of distinguishing that from  "X is effective" or "Y is good" or
       | "Z is real", which it seems many people are not.)
        
         | whimsicalism wrote:
         | I feel like a lot of these critiques are just straw-manning
         | p-values consideration.
         | 
         | Consider effect sizes - this seems to be a completely different
         | (yes _important_ ) question. Obviously the magnitude of the
         | impact of the drug is important - but it isn't a replacement or
         | "something to look at instead of p-value" because the chance
         | that the results you saw are due to random variation is still
         | important! You can see a massive effect size but if that is
         | totally expected within your null model, then it is probably
         | not all that exciting!
         | 
         | Effect sizes are a _complement_ to some sort of hypothesis
         | testing, but they are not a replacement.
         | 
         | > Prior probability.
         | 
         | yes, when you can effectively encode your prior probability I
         | would say the posterior probability of seeing what you did is
         | at least as good as p-value.
        
           | gjm11 wrote:
           | I agree that you shouldn't look _only_ at effect sizes any
           | more than you should look _only_ at p-values. (What I would
           | actually prefer you to do, where you can figure out a good
           | way to do it, is to compute a posterior probability
           | distribution and look at the whole distribution. Then you can
           | look at its mean or median or mode or something to get a
           | point estimate of effect size, you can look at how much of
           | the distribution is  > 0 to get something a bit like a
           | p-value but arguably more useful, etc.
           | 
           | If anything I wrote appeared to be saying "just look at
           | effect size, it's the only thing that matters" then that was
           | an error on my part. I definitely didn't _intend_ to say
           | that.
           | 
           | But I was responding to an article saying "p<0.05 considered
           | harmful" that _never mentions effect sizes at all_. I think
           | that 's enough to demonstrate that, in context, "it's bad to
           | look at p-values and ignore effect sizes" is not in fact a
           | straw man.
           | 
           | Incidentally, I am not convinced that the p-value as such is
           | often a good way to assess how likely it is that your results
           | are due to random chance. Suppose you see an effect size of 1
           | unit with p=0.05. OK, so there's a 5% chance of getting these
           | results if the true effect size is zero. But you should also
           | care what the chance is of getting these results if the true
           | effect size is +0.1. (Maybe the distribution of errors is
           | really weird and these results are _very likely_ with a
           | positive but much smaller effect size; then you have good
           | evidence against the null hypothesis but very weak evidence
           | for an effect size of the magnitude you measured.) In fact,
           | what you really want to know is what the probability is for
           | every possible effect size, because that gives you the
           | likelihood ratios you can use to decide how likely you think
           | any given effect size is after seeing the results. For sure,
           | having the p-value is better than having nothing, but if you
           | were going to pick one statistic to know in addition to (say)
           | a point estimate of the effect size, it 's not at all clear
           | that the p-value is what you should choose.
        
             | whimsicalism wrote:
             | Fair enough, I don't disagree with anything you just said.
             | 
             | It would be cool if interfaces caught up so I could just
             | draw a basic estimate of my prior and then see the
             | posterior graph afterwards.
        
             | nextos wrote:
             | What I have discovered after working in medicine for pretty
             | long is that many biologists and MDs think p-values are a
             | measure of effect sizes. Even a reviewer from _Nature_
             | thought that, which is incredibly disturbing.
             | 
             | p-values were created to facilitate rigorous inference with
             | minimal computation, which was the norm during the first
             | half of the 20th century. For those who work on a
             | frequentist framework, inference should be done using a
             | likelihood-based approach plus model selection, e.g. AIC.
             | It's makes it much harder to lie.
        
               | spekcular wrote:
               | AIC is an estimate of prediction error. I would caution
               | against using it for selecting a model for the purpose of
               | inference of e.g. population parameters from some dataset
               | (without producing some additional justification that
               | this is a sensible thing to do). Also, uncertainty
               | quantification after data-dependent model selection can
               | be tricky.
               | 
               | Best practice (as I understand it) is to fix the model
               | ahead of time, before seeing the data, if possible (as in
               | a randomized controlled trial of a new medicine, etc.).
        
           | epgui wrote:
           | You can always encode the prior. If you take the frequentist
           | approach and ignore bayesian concepts, it's the same as just
           | going bayesian but with an "uninformative prior" (constant
           | distribution).
           | 
           | The only question is... would you rather be up front and
           | explicit about your assumptions, or not?
           | 
           | An uninformative prior _is_ an assumption, even if it's the
           | one that doesn't bias the posterior (note that here "bias" is
           | not a bad word).
        
             | whimsicalism wrote:
             | There is potentially bias (of the bad word variant)
             | introduced by the mismatch between the prior in your own
             | mind and the distribution and params you choose to try to
             | approximate that, especially if you're trying to pick out a
             | distribution with a nice posterior conjugate.
             | 
             | I'm also not sure why everyone perceived my comment as
             | anti-bayesian.
        
           | uniqueuid wrote:
           | Yeah and it's super simple to roll significance and effect
           | sizes in one with confidence (or credible) intervals.
           | 
           | Plus the interpretation is super straight forward: CI
           | contains zero: Not significant. If nothing else, we should
           | make CIs the primary default instead of p-values.
        
           | hgomersall wrote:
           | If you're not encoding your prior probability, you're just
           | making stuff up. You almost always have _some_ information,
           | even if it 's just sanity bounds.
        
             | whimsicalism wrote:
             | Trying to fit your preconception into a mathematically
             | convenient conjugate distribution is not as far afield from
             | making stuff up as people want to believe. Maybe it is
             | better with numerical approaches.
             | 
             | But yes, you usually do have some information and it is net
             | better to encode when possible.
        
         | uniqueuid wrote:
         | I agree, and would add lack of statistical power to the list.
         | Underpowered studies (e.g. small effects, noisy measurement,
         | small N) decrease the changes of finding a true effect,
         | paradoxically increasing the risk of false positives.
         | 
         | It's immensely frustrating that we haven't made a lot of
         | progress since Cohen's (1962) paper.
        
         | tpoacher wrote:
         | > The article is all about why "0.05" might be a bad value to
         | choose.
         | 
         | No. It's more about why choosing a value to serve as the
         | default choice is a bad idea in the first place. The specific
         | value chosen as the default itself (i.e., 0.05 in this case) is
         | irrelevant.
         | 
         | The idea is that the value you choose should reflect some prior
         | knowledge about the problem. Therefore choosing 5% all the time
         | would be somewhat analogous to a bayesian choosing a specific
         | default gaussian as their prior: it defeats the point of
         | choosing a prior, and it's actively harmful if it sabotages
         | what you should actually be using as a prior, because it's not
         | even uninformative, it's a highly opinionated prior instead.
         | 
         | As for points 1 to 3, technically I agree, but there's a lot of
         | misdirection involved.
         | 
         | The point on effect sizes is true (and indeed something many
         | people get wrong), but it is contrived to make effect sizes
         | more useful than p-values. In which case, the obvious answer
         | is, you should be choosing an example where reporting a p-value
         | is more important than an effect size. One way to look at a
         | p-value is as a ranking, which would be useful for comparing
         | between effect sizes of incomparable units. Is a student with a
         | 19/20 grade from a european school better than an american
         | student with a 4.6 GPA? Reducing the compatibility scores to
         | rankings, can help you compare these two effect sizes
         | immediately.
         | 
         | Prior probability, similarly. "If" you interpret things as you
         | did, then yes, p-values suck. But you're not supposed to. What
         | the p-value tells you is "the data and model are 'this much'
         | compatible". It's up to you then to say "and therefore" vs
         | "must have been an atypical sample". In other words, there is
         | still space for a prior here. And in theory, you are free to
         | repurpose the compatibility score of a p-value to introduce
         | this prior directly (though nobody does this in practice).
         | 
         | Regarding p=0.05 vs p=0.001 not mattering; of course they do.
         | But only if they're used as compatibility rankings as opposed
         | to decision thresholds. If you compare two models, and one has
         | p=0.05 and the other has p=0.001, this tells you two things: a)
         | they are both very incompatible with the data, b) the latter is
         | a lot more incompatible than the former. The problem is not
         | that people use p-values, the problem that people abuse them,
         | to make decisions that are not necessarily crisply supported by
         | the p-values used to push them. But this could be said of any
         | metric. I have actively seen people propose "deciding in favour
         | of a model if the Bayes Factor is > 3". This is exactly the
         | same faulty logic, and the fact that BF is somehow "bayesian"
         | won't protect subsequent researchers who use this heuristic
         | from entering a new reproducibility crisis.
        
         | spekcular wrote:
         | I think if hypothesis testing is understood properly, these
         | objections don't have much teeth.
         | 
         | 1. Typically we use p-values to construct confidence intervals,
         | answering the concern about quantifying the effect size. (That
         | is, the confidence interval is the collection of all values not
         | rejected by the hypothesis test.)
         | 
         | 2. P-values control type I error. Well-powered designs control
         | type I and type II error. Good control of these errors is a
         | kind of minimal requirement for a statistical procedure. Your
         | example shows that we should perhaps consider more than just
         | these aspects, but we should certainly be suspicious of any
         | procedure that doesn't have good type I and II error control.
         | 
         | 3. This is a problem with any kind of statistical modeling, and
         | is not specific to p-values. All statistical techniques make
         | assumptions that generally render them invalid when violated.
        
           | foxes wrote:
           | That sounds like if you write proper C code correctly you
           | don't make memory errors when in reality it's very common to
           | not write correct code.
           | 
           | That's why rust came along, to stop that behaviour, you
           | simply can't make that mistake, and hence the point is maybe
           | there's a better test to use than p value as a standard.
        
             | spekcular wrote:
             | How else do you propose to construct procedures that
             | control type I error and evaluate their properties?
        
           | uniqueuid wrote:
           | Your points are theoretically correct, and probably the
           | reason why many statisticians still regard p-values and HNST
           | favorably.
           | 
           | But looking at the practical application, in particular the
           | replication crisis, specification curve analysis, de facto
           | power of published studies and many more, we see that there
           | is an _immense_ practical problem and p-values are not making
           | it better.
           | 
           | We need to criticize p-values and NHST hard, not because they
           | cannot be used correctly, but because they _are_ not used
           | correctly (and are arguably hard to use right, see the
           | Gigerenzer paper I linked).
        
             | spekcular wrote:
             | The items you listed are certainly problems, but p-values
             | don't have much to do with them, as far as I can see. Poor
             | power is an experimental design problem, not a problem with
             | the analysis technique. Not reporting all analyses is a
             | data censoring problem (this is what I understand
             | "specification curve analysis" to mean, based on some
             | Googling - let me know if I misinterpreted). Again, this
             | can't really be fixed at the analysis stage (at least
             | without strong assumptions on the form of the censoring).
             | The replication crisis is a combination of these these two
             | things, and other design issues.
        
               | uniqueuid wrote:
               | I can understand why you see it this way, but still
               | disagree:
               | 
               | (1) p-values make significance the target, and thus
               | create incentives for underpowered studies, misspecified
               | analyses, early stopping (monitoring significance while
               | collecting data), and p-hacking.
               | 
               | (2) p-values separate crucial pieces of information. It
               | represents a highly specific probability (of the observed
               | data, given the null hypothesis is true), but does not
               | include effect size or a comprehensive estimate of
               | uncertainty. Thus, to be useful, p-values need to be
               | combined with effect sizes and ideally simulations,
               | specification curves, or meta-analyses.
               | 
               | Thus my primary problem with p-values is that they are an
               | incomplete solution that is too easy to use incorrectly.
               | Ultimately, they just don't convey enough information in
               | their single summary. CIs, for example, are just as
               | simple to communicate, but much more informative.
        
               | spekcular wrote:
               | I don't understand. CIs are equivalent to computing a
               | bunch of p-values, by test-interval duality. Should I
               | interpret your points as critiques of simple analyses
               | that only test a single point null of no effect (and go
               | no further)? (I would agree that is bad.)
        
               | uniqueuid wrote:
               | Yes, I argue that individual p-values (as they are used
               | almost exclusively in numerous disciplines) are bad, and
               | adding more information on effect size and errors are
               | needed. CIs do that by conveying (1) significance (does
               | not include zero), (2) magnitude of effect (mean of CI),
               | and (3) errors/noise (width of CI). That's
               | _significantly_ better than a single p-value (excuse the
               | pun).
        
         | lumb63 wrote:
         | It's worth pointing out that processes like meta analyses ought
         | to be able to catch #2. The problem is, GIGO. A lot of poorly
         | designed studies (no real control group, failure to control for
         | other relevant variables, other methodological errors) make
         | meta analysis outcomes unreliable.
         | 
         | An area I follow closely is exercise science and I am amazed
         | that researchers are able to get grants for some of the
         | research they do. For instance, sometimes researchers aim to
         | compare, say, the amount of hypertrophy on one training program
         | versus another. They'll use a study or maybe 15 individuals.
         | The group on intervention A will experience 50% higher
         | hypertrophy than intervention B, at p=0.08, and they'll
         | conclude that there's no difference in hypertrophy between the
         | two protocols rather than suggesting an increase in statistical
         | power.
         | 
         | Another great example is studies whose interventions fail to
         | produce any muscle growth in new trainees. They'll compare two
         | programs, one with, say, higher training volume, and one with
         | lower training volume. Both fail to produce results for some
         | reason. They conclude that variable is not important, rather
         | than perhaps concluding that their interventions are poorly
         | designed since a population that is begging to gain muscle
         | couldn't gain anything from it.
        
         | sillysaurusx wrote:
         | (Did HN finally increase the character limit for toplevel
         | comments? Normally >2500 chars will get you booted to the very
         | bottom over time. Happy to see this one at the top. It might be
         | because of a 2008-era account though. Thanks for putting in a
         | bunch of effort into your writing!)
        
       | Joel_Mckay wrote:
       | Still not significant:
       | 
       | https://mchankins.wordpress.com/2013/04/21/still-not-signifi...
       | 
       | Nothing like spending 10 minutes reading a paper to see results
       | which are likely nonsense. However, it pales in comparison to
       | spending 3 weeks trying to replicate popular works... only to
       | find it doesn't generalize... you know that ROC was likely from
       | cooked data-sets confounded with systematic compression artifact
       | errors... likely not harmful, but certainly irritating. lol =)
        
         | ttpphd wrote:
         | What a bananas list! Thanks for sharing it.
        
       | revision17 wrote:
       | American Statistical Association also released a statement on
       | pvalues:
       | 
       | https://www.tandfonline.com/doi/epdf/10.1080/00031305.2016.1...
        
       | uniqueuid wrote:
       | So much has been said about p-values and null hypothesis
       | significance testing (NHST) that one blog post probably won't
       | change anybody's opinion.
       | 
       | But I want to recommend the wonderful paper "the null ritual" [1]
       | by Gerd Gigerenzer et al. It shows that a precise understanding
       | of what a p-value means is extremely rare even among statistics
       | lecturers, and, more interestingly, that there have always been
       | fundamentally different understandings even when p-values were
       | invented, e.g. between Fisher and Pearson.
       | 
       | Beyond that, there was a special issue recently in the american
       | statistician discussing at length the issues of p-values in
       | general and .05 in particular [2].
       | 
       | Personally, I of course feel that null hypothesis testing is
       | often stupid and harmful, because null effects never exist in
       | reality, effects without magnitude are useless, and because the
       | "statelessness" of NHST creates or exacerbates problems such as
       | publication bias and lack of power.
       | 
       | [1] http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf
       | 
       | [2] https://www.tandfonline.com/toc/utas20/73/sup1
        
         | bookish wrote:
         | Thanks for that. I'll give [1] a read. I'm familiar with [2],
         | and cited one of those papers in the blog.
         | 
         | About the stupid or harmful nature of null hypothesis testing
         | in general, what do you recommend instead for decision making
         | and for summarization of uncertainty? In the scenario of large
         | (yet fast moving) organizations where most people will have
         | little stats background.
        
           | uniqueuid wrote:
           | Thanks, I hope you find Gigerenzer useful. The paper is a bit
           | academic, but he also wrote a couple of nice popular science
           | books on the (mis-)perception of numbers and statistics,
           | those might be useful in a business environment.
           | 
           | For real-world applications outside engineering and academia,
           | I would rely heavily on confidence intervals and/or
           | confidence bands. For example, the packages from easystats
           | [1] in R have quite a few very useful visualization
           | functions, which make it very easy to interpret results of
           | statistical tests. You can even get a textual precise
           | description, but then again, that's intended for papers and
           | not a wider audience.
           | 
           | Apart from that, I would mainly echo recommendations from
           | people like Andrew Gelman, John Tukey, Edward Tufte etc.:
           | Visuals are extremely useful and contain a lot of data. Use
           | e.g. scatterplots with jittered points to show raw data and
           | the goodness of fit. People will intuitively make more of it
           | than of a single p-value.
           | 
           | [1] https://easystats.github.io/easystats/
        
             | bookish wrote:
             | Totally agree about visualization, and that those authors
             | are great advocates for it. Confidence intervals are
             | definitely much more informative and intuitive than
             | p-values.
             | 
             | Would the policy be "look at our confidence intervals later
             | and then decide what to do"? One remaining issue is how to
             | have consistent decision criteria, and to convey it ahead
             | of time. Imagine a context with 10-50 teams at a company
             | that run experiments, where the teams are implicitly
             | incentivized to find ways to report their experiments as
             | successful. Quantified criteria can be helpful in
             | minimizing that bad incentive.
        
         | tpoacher wrote:
         | Thanks for this, looking forward to read it! I would also
         | recommend in turn the paper by Greenland et al (2016) called
         | "Statistical tests, P values, confidence intervals, and power:
         | a guide to misinterpretations.". It's a great read.
         | 
         | I recently had a chat with a colleague who has "abandoned
         | p-values for bayes factors" in their research, on how, in
         | principle, there's nothing stopping you from having a
         | "bayesian" p-value (i.e. the definition of the p-value at its
         | most general can easily accommodate bayesian inference, priors,
         | posteriors, etc). The counter-retort was more or less "no it
         | can't, educate yourself, bayes factors are better" and didn't
         | want to hear about it. It made me sad.
         | 
         | p-values are an incredibly insightful device. But because most
         | people (ab)use it in the same way most people abuse normality
         | assumptions or ordinal scales as continuous, it's gotten a bad
         | rep and means something entirely different to most now by
         | default.
        
       | pasc1878 wrote:
       | xkcd has commented on that different pvalues
       | https://xkcd.com/1478/
       | 
       | and shown an example of ,marketing using p values
       | https://xkcd.com/882/
        
         | bookish wrote:
         | Classics!
        
       | clircle wrote:
       | Maybe tech industry insiders can tell me this ... but do real
       | people actually make product decisions based solely on p < 0.05?
       | Seems like the author is writing about a contrived problem.
        
         | candiddevmike wrote:
         | Product decisions are made based on someone's gut instinct. p
         | values are mostly used when (or abused until) they align with
         | that instinct, if they are being considered at all.
        
         | bookish wrote:
         | The "stasis" and "arbitrarily adjustments" regimes that I wrote
         | about are certainly ones that I've seen, which don't rely
         | solely on p < 0.05 but are still pretty suboptimal.
         | Furthermore, it's not only about whether 0.05 is the sole
         | criteria, but also about whether it's a useful criteria for us
         | to highlight at all, depending on whether the anchoring effect
         | of it is damaging relative to alternatives.
         | 
         | But let me turn that around and ask: what product decision
         | regime do you see most often or think would be the most
         | relevant to use as an example? I'd be happy to hear your
         | perspective and make sure I keep it in mind for future blogs.
        
           | ftxbro wrote:
           | Hi I wrote some other response in this thread where I called
           | you a techbro, so sorry about that I probably count as one
           | too I wasn't trying to insult you too much.
           | 
           | Anyway when you say "Furthermore, it's not only about whether
           | 0.05 is the sole criteria, but also about whether it's a
           | useful criteria for us to highlight at all, depending on
           | whether the anchoring effect of it is damaging relative to
           | alternatives." I love this analogy of the null hypothesis vs.
           | alternative hypotheses in frequentist statistics to the
           | 'anchoring effect' cognitive bias that you try to work to
           | your advantage in marketing and sales and negotiation or
           | management.
           | https://en.wikipedia.org/wiki/Anchoring_(cognitive_bias)
           | 
           | If you don't want to be tied to p-values and you only care
           | about downstream decisions rather than quantifying beliefs,
           | you can use some ideas in decision theory
           | https://en.wikipedia.org/wiki/Decision_theory
           | 
           | For example, maybe you are deciding between two alternative
           | ways of doing something. You don't know which one is better,
           | and you are confronted with not only the decision of which
           | one to use, but also with the decision of whether to
           | experiment (with A/B testing for example) to be more sure of
           | which one is right, versus whether to exploit the one that
           | you currently think is better. This is the multi-arm bandit
           | problem and it doesn't necessarily use p-values so your
           | intuition is right! https://en.wikipedia.org/wiki/Multi-
           | armed_bandit
           | 
           | Maybe that's not your situation. Maybe your situation is that
           | you have an existing business process and you want to know
           | whether to switch to one that might be better. Someone might
           | say it's a p-value problem, but again I agree with your
           | intuition that it really isn't best to think about it that
           | way, especially when there is a cost to switching. Instead,
           | it's a more complicated decision that depends on what is the
           | switching cost, how much better you think the new process
           | would be (including uncertainty of it), and what kind of
           | business horizon you care about. There might even be a multi-
           | armed bandit effect again even in this situation, where you
           | _also_ have to weigh the costs of reducing your uncertainty
           | of the switching improvement or even of reducing the
           | uncertainty of the switching cost itself.
           | 
           | Anyway, these problems do involve concepts from probability
           | and statistics but it's for sure true that the decisions
           | don't always reduce to P < 0.05 at the end! Good luck best
           | wishes living your best techbro life!
        
             | bookish wrote:
             | I've been called worse things, and at this stage in life it
             | is hard to be offended by people who don't know me well
             | enough to give an insightful insult. But I hadn't responded
             | to the earlier comment because it was more generally
             | antagonistic and seemed to reflect a (possibly
             | intentional?) misinterpreted and negative reading of the
             | blog. "Don't feed the trolls", as the saying goes.
             | 
             | I started writing that particular blog with different
             | experimental methods in mind, but wrote so much as a
             | prerequisite that I wanted to stop and make that first part
             | a standalone post. My last paragraph was supposed to make
             | it clear that this was a launching off point, rather than a
             | summation of everything I know about decision theory or
             | experimentation.
             | 
             | Thanks for writing more substance into this comment. These
             | opinions are sensible, I agree with much that you wrote
             | here. On one: I do use MABs and see them as a feasible
             | method for many organizations, and at some point I'd like
             | to write about some challenges with those too.
             | 
             | Wishing you the best in your techbro or ftxbro or
             | <whatever>bro life too!
        
         | leni536 wrote:
         | Bold of you to assume that product decisions are based on
         | rigorous statistical tests.
         | 
         | Jokes aside, for product decisions (or all kinds of decisions,
         | really) you should differentiate between statistical
         | significance and relevance. A small measured difference in some
         | metric, even if statistically significant to p < 0.000001, may
         | not be relevant for a decision.
        
         | regularfry wrote:
         | Yep. I've seen "rigorous" A/B testing regimes set up based on a
         | p<0.05 requirement that then went straight into "we'll keep
         | running this until we reach significance" and "this one's
         | clearly trending towards significance, we should just implement
         | it now" nonsense.
        
       | belter wrote:
       | There is a joke between Mathematicians: The Confidence Interval
       | is nothing more than, the interval between first learning about
       | it, thinking you understood it, and the time it takes until you
       | realize what it really means.
        
       | hsjqllzlfkf wrote:
       | P < 0.05 considered harmful 5% of the times.
        
       | kneebonian wrote:
       | The thing that made me realize how ineffective P < 0.05 was, was
       | playing D&D, because the odds of something happening 5% of the
       | time was the same odds as rolling a 1 on a 20 sided dice, which
       | happens surprisingly frequently once you roll it more than a
       | couple of times.
       | 
       | Also XCOM taught me that 98% != 100%
        
         | galkk wrote:
         | Xcom isn't a good example because the game actively lies to you
         | with displayed probabilities
         | 
         | https://youtu.be/l0KEDYFWbVc
        
           | kneebonian wrote:
           | "That's XCOM baby!"
        
           | kibwen wrote:
           | I've seen no source that shows that Xcom fudges its displayed
           | hit chances. You may be thinking of Fire Emblem, whose games
           | use a variety of well-documented approaches to fudging their
           | rolls: https://fireemblemwiki.org/wiki/True_hit
        
             | reibitto wrote:
             | I know at least XCOM 2 does on certain difficulties. The
             | aim assist values are directly in the INI files (it fudges
             | the numbers in your favor for lower difficulties). Here are
             | instructions on how to remove the aim assists: https://stea
             | mcommunity.com/sharedfiles/filedetails/?id=61799...
        
           | pphysch wrote:
           | Most big games implement "randomness" with "pseudorandomness"
           | in the name of controlling variance of outcome, chopping off
           | the long tail
        
             | C-x_C-f wrote:
             | Do pseudorandom distributions have chopped tails? Wouldn't
             | that go against the definition of pseudorandom?
        
               | pphysch wrote:
               | To be clear it's a mix of pseudorandomness and
               | "procedural randomness" to increase "fairness"
        
             | the_af wrote:
             | Is there any game that does NOT use pseudorandom
             | generators? And does this significantly change
             | probabilities?
        
               | pphysch wrote:
               | Some games I follow talked about "switching to
               | pseudorandomness" but that may be misuse of terminology.
               | 
               | There's an argument for competitive games to use true
               | randomness to eliminate any possibility of abuse, but I'm
               | not aware of specific examples.
        
               | burnished wrote:
               | Think there is confusion here - if I understand correctly
               | you are asking about the number generator, they are
               | talking about the process of determining success. Like in
               | league of legends you have a listed crit chance but the
               | way they determine success isnt to generate a number and
               | compare it to your chance, you start with a smaller base
               | number that gets incremented each time you fail and reset
               | when you succeed - the end result is that your overall
               | chance remains the same but the likelihood oh a streak
               | (of fails or successes) goes down.
               | 
               | Doesnt change the overall probability, drastically
               | reduces variance.
        
               | PeterisP wrote:
               | Some games (both computer games and physical board games)
               | intentionally use "shuffled randomness" where e.g. for
               | percentile fail/success rolls you'd take numbers from
               | 1-100 and use true randomness to shuffle that list; in
               | this way the overall probability is the same, but has a
               | substantially different feel as it's impossible for
               | someone to have bad/good luck throughout the whole game
               | and things like "gambler's fallacy" which are false for
               | actual randomness become true.
        
               | teddyh wrote:
               | > _the overall probability is the same_
               | 
               | Only for the very first roll. After that, the outcome
               | becomes more and more predictable for each roll.
        
           | the_af wrote:
           | Can you elaborate on how XCOM lies? I often suspected this
           | (but you can never be sure, since human intuition is bad at
           | probabilities). Is there hard evidence?
        
             | galkk wrote:
             | https://youtu.be/l0KEDYFWbVc
        
             | fauxpause_ wrote:
             | I believe it fudges the numbers to give you better than
             | expected results on lower difficulties.
        
       | yunruse wrote:
       | 2s is fine, but the benefit of modern technology is that we can
       | tell exactly what standard deviation would be needed for the null
       | hypothesis to randomly generate our results. Particle physics
       | holds itself to an "industry standard" of 5 sigma, for example.
       | 
       | The real conversation to be had is -- what standard deviation
       | will we tolerate? Is this something we'll keep doing, and thus
       | A/B test ourselves into a (perhaps quite horrifying) local
       | minimum based on random noise? Is this a single great experiment?
       | Are lives on the line? Will these results be taken Quite
       | Seriously? Is this a test to say "look, this is worth further
       | investigation"?
       | 
       | I'm no statistician but in my opinion this is the first
       | conversation that needs to be had when doing an experiment: what
       | p / s levels are satisfactory to claim confidence? p<0.05 is a
       | decent heuristic for some experiments, but not all.
        
         | peteradio wrote:
         | It's 5s for discovery, 2s for null hypothesis "confirmation".
        
           | cozzyd wrote:
           | and 3\sigma for "evidence" (i.e. enough for publication...)
        
       | ftxbro wrote:
       | This post is amazing.
       | 
       | When I see a post "P < 0.05 Considered Harmful" I think OK now
       | they are going to talk about maybe Bonferroni vs. other multiple
       | hypothesis corrections if they are a frequentist or otherwise
       | they are going to try to explain Bayesian things.
       | 
       | But no, this one isn't from a frequentist, or from a Bayesian.
       | It's from a techbro whose solution isn't any kind of multiple
       | hypothesis correction or getting Bayes-pilled, it's to say "Why
       | not just admit you want something akin to p = 0.25 in the first
       | place?" for 'ship criteria' in the only stats context he appears
       | to know which is A/B testing, talking about Maslow hierarchy and
       | namedropping Hula and Netflix. It's seriously like some Silicon
       | Valley parody. Wait is it actually a satire blog?
        
         | thebestgamers wrote:
         | [flagged]
        
         | whimsicalism wrote:
         | > "Why not just admit you want something akin to p = 0.25 in
         | the first place?" for 'ship criteria'
         | 
         | That's culture shock for me - I guess this is why I don't work
         | at startups.
        
           | bookish wrote:
           | It all depends on what you're experimenting on. I do think
           | there are many teams out there who are (in effect) making
           | decisions with less certainty than this, but they wouldn't
           | want to actually quantify it.
        
           | mlyle wrote:
           | p=0.2 doesn't work too well for ship criteria for medicine.
           | 
           | p=0.2 for "this reordering of the landing text improves the
           | rate conversion events" is fine. People make changes based on
           | less information all the time. Waiting for certainty has its
           | own expenses.
        
       | [deleted]
        
       | staunton wrote:
       | The fundamental error that causes misuse of p-values (and
       | statistics in general) is misunderstanding what statistics is in
       | the first place. Statistics is applied epistemology. There is no
       | algorithm that fits all situations where you're trying to think
       | about and learn new things. It's just hard and we have to deal
       | with that.
       | 
       | Arguably, the main point of p-values is trying to prevent people
       | who really know better from "cheating" by reporting "interesting"
       | results from very low sample sizes. Having a very rigid framework
       | as a rejection criteria helps with this. However, the scientific
       | community and system are not capable of dealing with "real
       | cheating" that includes fabrication of data. Also, any such rigid
       | metric is going to be gamed. Of course, there are some people who
       | would also cheat themselves and maybe learning about p-values
       | makes this less frequent. But such people using them don't
       | understand what those values are telling them, beyond "yes, I can
       | publish this". This is counterproductive because it prevents deep
       | thinking and thorough investigation.
       | 
       | Most scientists realize that in practice there is no simple list
       | of objective criteria that tells you what experiments to perform
       | and how exactly to interpret the results. This takes a lot of
       | work, careful thinking, trying different things and very much
       | benefits from collaboration. But who's got time for that? There's
       | papers to publish and grant applications to write. So p-values it
       | is. Or maybe some other thing eventually, that also won't solve
       | the fundamental problem.
        
       | begemotz wrote:
       | Besides the Gigerenzer article mentioned below (there are others
       | by the same author worth reading e.g. 'Mindless Statistics'), I
       | would recommend:
       | 
       | 'Moving to a world beyond "p <0.05" by Wasserstein et al in the
       | American Statistician
       | https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...
       | 
       | as well as the classic article by Jacob Cohen "The Earth is Round
       | (p <.05).
       | 
       | YMMV, but in certain disciplines still, "statistical analysis" is
       | little more than checking for p-values and applying a binary
       | decision rule.
       | 
       | That is without recognizing the shaky theoretical ground of NHST
       | as practiced.
        
       ___________________________________________________________________
       (page generated 2023-04-10 23:01 UTC)