[HN Gopher] Effect size is significantly more important than sta...
___________________________________________________________________
Effect size is significantly more important than statistical
significance
Author : stochastician
Score : 189 points
Date : 2021-09-14 16:10 UTC (6 hours ago)
(HTM) web link (www.argmin.net)
(TXT) w3m dump (www.argmin.net)
| ammon wrote:
| But how much more important? :) Sorry, could not help myself.
| [deleted]
| kbrtalan wrote:
| There's a whole book about this idea, Antifragile by Nassim
| Taleb, highly recommended
| abeppu wrote:
| I think the weird thing is that a bunch of people in tech
| understand this well _with respect to tech_, but often fall into
| the same p-value trap when reading about science.
|
| If you're working with very large datasets generated from e.g. a
| huge number of interactions between users and your system,
| whether as a correlation after the fact, or as an A/B experiment,
| getting a statistically significant result is easy. Getting a
| meaningful improvement is rarer, and gets harder after a system
| has received a fair amount of work.
|
| But then people who work in these big-data contexts can read
| about a result outside their field (e.g. nutrition, psychology,
| whatever), where n=200 undergrads or something, and p=0.03 (yay!)
| and there's some pretty modest effect, and be taken in by
| whatever claim is being made.
| agnosticmantis wrote:
| An investigator needs to rule out all conceivable ways their
| modeling can go wrong, among them the possibility of a
| statistical fluke, which statistical significance is supposed to
| take care of. So statistical significance may best be thought of
| as a necessary condition, but is typically is taken to be a
| sufficient condition for publication. If I see a strange result
| (p-value < 0.05), could it be because my functional form is
| incorrect? or because I added/removed some data? Or I failed to
| include an important variable? These are hard questions and not
| amenable to algorithmic application and mass production.
| Typically these questions are ignored, and only the possibility
| of a statistical fluke is ruled out (which itself depends on the
| other assumptions being valid).
|
| Dave Freedman's Statistical Models and Shoe Leather is a good
| read on why such formulaic application of statistical modeling is
| bound to fail.[0]
|
| [0:https://psychology.okstate.edu/faculty/jgrice/psyc5314/Freed..
| .]
| jerf wrote:
| Speaking not to this study in particular necessarily, I strongly
| agree with the general point. Science has really been held back
| by an over-focusing on "significance". But I'm not really
| interested in a pile of hundreds of thousands of studies that
| establish a tiny effect with suspiciously-just-barely-significant
| results. I'm interested in studies that reveal robust results
| that are reliable enough to be built on to produce other results.
| Results of 3% variations with p=0.046 aren't. They're dead ends,
| because you can't put very many of those into the foundations of
| future papers before the probability of one of your foundations
| being incorrect is too large.
|
| To the extent that those are hard to come by... Yeah! They are!
| Science is hard. Nobody promised this would be easy. Science
| _shouldn 't_ be something where labs are cranking out easy
| 3%/p=0.046 papers all the time just to keep funding. It's just a
| waste of money and time of our smartest people. It _should_ be
| harder than it is now.
|
| Too many proposals are obviously only going to be capable of
| turning up that result (insufficient statistical power is often
| obvious right in the proposal, if you take the time to work the
| math). I'd rather see more wood behind fewer arrows, and see
| fewer proposals chasing much more statistical power, than the
| chaff of garbage we get now.
|
| If I were King of Science, or at least, editor of a prestigious
| journal, I'd want to put word out that I'm looking for papers
| with at least one of some sort of _significant_ effect, or a p
| value of something like p = 0.0001. Yeah. That 's a high bar. I
| know. That's the point.
|
| "But jerf, isn't it still valuable to map out all the little
| things like that?" No, it really isn't. We already have every
| reason in the world to believe the world is _drenched_ in 1%
| /p=0.05 effects. "Everything's correlated to everything", so
| that's not some sort of amazing find, it's the totally expected
| output of living in our reality. Really, this sort of stuff is
| still just _below the noise floor_. Plus, the idea that we can
| remove such small, noisy confounding factors is just silly. We
| need to look for the things that stand out from that noise floor,
| not spending billions of dollars doing the equivalent of
| listening to our spirit guides communicate to us over white noise
| from the radio.
| naasking wrote:
| > If I were King of Science, or at least, editor of a
| prestigious journal, I'd want to put word out that I'm looking
| for papers with at least one of some sort of significant
| effect, or a p value of something like p = 0.0001. Yeah. That's
| a high bar. I know. That's the point.
|
| And study preregistration to avoid p-hacking and incentivize
| publishing negative results. And full availability of data, aka
| "open science".
| DiabloD3 wrote:
| Preregistration, _requirement_ to publish negative _or_ null
| results, and full data is, arguably, the three legs of modern
| science. If we collectively don 't enforce this, nobody is
| doing science, they're just fucking around and writing it
| down.
| romwell wrote:
| Also replication studies for negative or null results _in
| addition to_ positive ones (we don 't have either).
| analog31 wrote:
| I've thought about the idea of allowing people to separately
| publish data and analysis. Right now, data are only published
| if the analysis shows something interesting.
|
| Improving the quality of measurements and data could be a
| rewarding pursuit, and could encourage the development of
| better experimental technique. And a good data set, even if
| it doesn't lead to an immediate result, might be useful in
| the future when combined with data that looks at a problem
| from another angle.
|
| Granted, this is a little bit self serving: I opted out of an
| academic career, partially because I had no good research
| ideas. But I love creating experiments and generating data!
| Fortunately I found a niche at a company that makes
| measurement equipment. I deal with the quality of data, and
| the problem of replication, all day every day.
| Tycho wrote:
| What do you (or anyone else) think about the statistical
| conclusions in this paper? Particularly the adjusted r-squared
| values reported.
|
| https://www.cambridge.org/core/journals/american-political-s...
| shakezula wrote:
| I blame most of this on pop science. It's absolutely ruined the
| average public's respect for the behind the scenes work doing
| interesting stuff in every field. What's worse is the attitude
| it breeds. Anti-intellectualism runs rampant amongst even well
| educated members of my social circle. It's frustrating to say
| the least.
| shawn-butler wrote:
| Some say that it is not anti-intellectualism to realize the
| emperor has no clothes but enlightenment.
|
| Either way it's dangerous.
| hyperbovine wrote:
| Come into Bayesian land, the water is fine. The whole NHST
| edifice starts to seem really shaky once you stop and wonder if
| "True" and "False" are really the only two possible states of a
| scientific hypothesis. Andrew Gelman has written about this in
| many places, e.g. http://www.stat.columbia.edu/~gelman/research
| /published/aban....
| Retric wrote:
| Bayesian reasoning has even worse underpinnings. You don't
| actually know any of the things the equations want. For
| example suppose a robot is counting Red and Blue balls from a
| bin, the count is 400Red and 637Blue, it just classified a
| Red ball.
|
| Now what's the count, wait what's the likelihood it
| misclassified a ball? How accurate are those estimates, and
| those estimates of those ...
|
| For a real world example someone using Bayesian reasoning
| when counting cards should consider the possibility that the
| deck doesn't have the correct cards. And the possibility that
| the decks cards have been changed over the course of the
| game.
| tux3 wrote:
| Suppose the likelihood it missclassified a ball is
| significantly different from zero, but not yet known
| precisely.
|
| If you use a model that doesn't ask you to think about this
| likelihood at all, you will get the same result as if you
| had used bayes and consciously chose to approximate the
| likelihood of misclassification as zero.
|
| You may get slightly better results if you have a
| reasonnable estimate of that probability, but you will get
| no worse if you just tell Bayes zero.
|
| It feels like you're criticizing the model for _asking hard
| questions_.
|
| I feel like explicitely not knowing an answer is always a
| small step ahead of not considering the question.
| Retric wrote:
| The criticism is important because of how Bayes keeps
| using the probability between experiments. Garbage in
| Garbage out.
|
| As much as people complain about frequentist approaches,
| examining the experiment independently from the output of
| the experiment effectively limits contamination.
| Karrot_Kream wrote:
| Huh? You can derive all of those from Bayesian models. If
| you're counting balls from a bin with replacement, and your
| bot has counted 400Red with 637Blue, you have a
| Beta/Binomial model. That means you p_blue | data ~
| Beta(401, 638) assuming a Uniform prior. The probability of
| observing a red ball given the above p_blue | data is
| P(red_obs | p_blue) = 1 - P(blue_obs | p_blue), which is
| calculable from p_blue | data. In fact in this simple
| example you can even analytically derive all of these
| values, so you don't even need a simulation!
| tfehring wrote:
| And if misclassification is a concern (as the parent
| mentioned) you can put a prior on that rate too!
| Retric wrote:
| Which rate? The rate you failed to mix the balls? The
| rate you failed to count a ball? The rate you
| misclassified the ball? The rate you repeatedly counted
| the same ball? The rate you started with an incorrect
| count? The rate you did the math wrong? etc
|
| Here's the experiment and here's the data is concrete it
| may be bogus but it's information. Updating probabilistic
| based on recursive estimates of probabilities is largely
| restating your assumptions. Black swans can really throw
| a wrench into things.
|
| Plenty of downvotes and comments, but nothing addressing
| the point of the argument might suggest something.
| Karrot_Kream wrote:
| > Which rate? The rate you failed to mix the balls? The
| rate you failed to count a ball? The rate you
| misclassified the ball? The rate you repeatedly counted
| the same ball? The rate you started with an incorrect
| count? The rate you did the math wrong? etc
|
| This is called modelling error. Both Bayesian and
| frequentist approaches suffer from modelling error.
| That's what TFA talks about when mentioning the normality
| assumptions behind the paper's GLM. Moreover, if errors
| are additive, certain distributions combine together
| easily algebraically meaning it's easy to "marginalize"
| over them as a single error term. In most GLMs, there's a
| normally distributed error term meant to marginalize over
| multiple i.i.d normally distributed error terms.
|
| > Plenty of downvotes and comments, but nothing
| addressing the point of the argument might suggest
| something.
|
| I don't understand the point of your argument. Please
| clarify it.
|
| > Here's the experiment and here's the data is concrete
| it may be bogus but it's information. Updating
| probabilistic based on recursive estimates of
| probabilities is largely restating your assumptions.
|
| What does this mean, concretely? Run me through an
| example of the problem you're bringing up. Are you saying
| that posterior-predictive distributions are "bogus"
| because they're based on prior distributions? Why?
| They're just based on the application of Bayes Law.
|
| > Black swans can really throw a wrench into things
|
| A "black swan" as Taleb states is a tail event, and this
| sort of analysis is definitely performed (see:
| https://en.wikipedia.org/wiki/Extreme_value_theory). In
| the case of Bayesian stats, you're specifically
| calculating the entire posterior distribution of the
| data. Tail events are visible in the tails of the
| posterior predictive distribution (and thus calculable)
| and should be able to tell you what the consequences are
| for a misprediction.
| kadoban wrote:
| Can't you just add that to your equation? Seems like for
| anything real, this will not go many levels deep at all
| before it's irrelevent.
| funklute wrote:
| > The whole NHST edifice starts to seem really shaky once you
| stop and wonder if "True" and "False" are really the only two
| possible states of a scientific hypothesis.
|
| The root problem here is that people tend to dichotomise what
| are fundamentally continuous hypothesis spaces. The correct
| question is not "is drug A better than drug B?", it's "how
| much better or worse is drug A compared to drug B?". And this
| is an error you can do both in Bayesian and frequentist
| lands, though culturally the Bayesians have a tendency to
| work directly with the underlying, continuous hypothesis
| space.
|
| That said, there are sometimes external reasons why you have
| to dichotomise your hypothesis space. E.g. ethical reasons in
| medicine, since otherwise you can easily end up concluding
| that you should give half your patients drug A and the other
| half drug B, to minimise volatility of outcomes (this
| situation would occur when you're very uncertain which drug
| is better).
| Karrot_Kream wrote:
| Gelman et al's BDA3 has a fun exercise estimating heart-
| disease rates in one of the early chapters that demonstrates
| this issue with effect-sizes. BDA3 uses a simple frequentist
| model to determine heart-disease rates and shows that areas
| with small population sizes have heavily exaggerated heart-
| disease rates because of the small base population. Building
| a Bayesian model does not have the same issue as the prior
| population prevalence incorporates the small base population
| sizes.
| bluGill wrote:
| > Plus, the idea that we can remove such small, noisy
| confounding factors is just silly. We need to look for the
| things that stand out from that noise floor
|
| We have found most of them, and all the easy ones. Today the
| interesting things are near the noise floor. 3000 years ago
| atoms were well below the noise floor, now we know a lot about
| them - most of it seems useless in daily life yet a large part
| of the things we use daily depend on our knowledge of the atom.
|
| Science needs to keep separating things from the noise floor.
| Some of them become important once we understand it.
| exporectomy wrote:
| Doesn't it make a difference if it's near the noise floor
| because it's hard to measure (atoms) or if it's near the
| noise floor because it's hardly there (masks)? Maybe if these
| "hardly there" results led to further research that isolated
| some underlying "very there" phenomena, they would be
| important, but until that happens, who cares if thinking
| about money makes you slightly less generous than thinking
| about flowers? If they're not building on previous research
| to discover more and more important things, then it doesn't
| seem like useful progress.
| kazinator wrote:
| Individual atoms, or small numbers of them, may be beneath
| some noise floor, but not combined atoms.
|
| A salt crystal (Lattice of NaCl atoms) is nothing like a pure
| gold nugget (clump of Au atoms).
|
| That difference is a massive effect.
|
| So to begin with, we have this sort of massive effect which
| requires an explanation, such as atoms.
|
| Maybe the right language here is not that we need an effect
| rather than statistical significance, but that we need a
| clear, unmistakable _phenomenon_. There has to be a
| phenomenon, which is then explained by research. Research
| cannot be _inventing the phenomenon_ by whiffing at the faint
| fumes of statistical significance.
| jerf wrote:
| I don't think we have found most of them. I think we make it
| look like we've found most of them because we keep throwing
| money at these crap studies.
|
| Bear in mind that my criteria are two-dimensional, and I'll
| accept either. By all means, go back and establish your 3%
| effect to a p-value of 0.0001. Or 0.000000001. That makes
| that 3% much more interesting and useful.
|
| It'll especially be interesting and valuable when you fail to
| do so.
|
| But we do not, generally, do that. We just keep piling up
| small effects with small p-values and thinking we're getting
| somewhere.
|
| Further, if there is a branch of some "science" that we've
| exhaused so thoroughly that we can't find anything that isn't
| a 3%/p=0.047 effect anymore... pack it in, we're done here.
| Move on.
|
| However, part of the reason I so blithely say that is that I
| suspect if we did in fact raise the standards as I propose
| here, it would realign incentives such that more sciences
| would start finding more useful results. I suspect, for
| instance, that a great deal of the soft sciences probably
| could find some much more significant results if they studied
| larger groups of people. Or spent more time creating theories
| that aren't about whether priming people with some sensitive
| word makes them 3% more racist for the next twelve minutes,
| or some other thing that even if true really isn't that
| interesting or useful as a building block for future work.
| tommiegannert wrote:
| A few years ago, HN comments complained about the censorship
| that only leaves successful studies. We need to report on
| everything we've tried, so we don't walk around on donuts.
|
| What's missing in my mind is admitting that results were
| negative. I'm reading up on financial literacy, and many
| studies end with some metrics being "great" at p 5%, but then
| some other metrics are also "great" at p 10%, without the
| author ever explaining what they would have classified as bad.
| They're just reported without explanation of what significance
| they would expect (in their field).
| nix0n wrote:
| > ...so we don't walk around on donuts
|
| I agree with what you're saying, but I don't understand this
| phrase.
| Imnimo wrote:
| The phrase "walk around on donuts" has one Google result
| and it's this thread.
| rootusrootus wrote:
| I don't know where that turn of phrase comes from, but I
| imagine it's synonymous with 'walking around in circles'.
| potatoman22 wrote:
| You know how sometimes you'll accidentally step on a donut
| and you'll have to call your dog over to lick all the jelly
| off your toes? That.
| phreeza wrote:
| This is clearly a cost/benefit tradeoff, and the sweet spot
| will depend entirely on the field. If you are studying the
| behavior of heads of state, getting an additional N is
| extremely costly, and having a p=0.05 study is maybe more
| valuable than having no published study at all, because the
| stakes are very high and even a 1% chance of (for example)
| preventing nuclear war is worth a lot. On the other hand, if
| you are studying fruit flies, an additional N may be much
| cheaper, and the benefit of yet another low effect size study
| may be small, so I could see a good argument being made for
| more stringent standards. In fact I know that in particle
| physics the bar for discovery is much higher than p=0.05.
| bongoman37 wrote:
| What if it's the other way round and a p<0.05 study says that
| the best way to make sure a rival country does not do a
| nuclear strike on you first is to do a massive nuclear strike
| on them first?
| BenoitEssiambre wrote:
| p = 0.0001 doesn't help much. You can get to an arbitrarily
| small p by just using more data. The problem is trying to
| reject a zero width null hypothesis. Scientists should always
| reject something bigger than infinitesimally small so that they
| are not catching tiny systematic biases in their experiments.
| There are always small biases.
|
| Gwern's page "Everything Is Correlated" is worth reading:
| https://www.gwern.net/Everything
| bpodgursky wrote:
| It would at least filter out the social science experiments
| where results on 30 college students is "significant" at
| p=.04 (and it's too expensive to recruit 3000 of them to
| force significance).
| Robotbeat wrote:
| The problem is that when you're on the cusp of a new thing,
| unless you're super lucky, the result will necessarily be near
| the noise floor. Real science is like that.
|
| But I definitely agree it'd be nice to go back and show
| something is true to p=.0001 or whatever. Overwhelmingly solid
| evidence is truly a wonderful thing, and as you say, it's
| really the only way to build a solid foundation.
|
| When you engineer stuff, it needs to work 99.99-99.999% of the
| time or more. Otherwise you're severely limited to how far your
| machine can go (in terms of complexity, levels of abstraction
| and organization) before it spends most of its time in a broken
| state.
|
| I've been thinking about this while playing Factorio: so much
| of our discussion and mental modeling of automation works under
| the assumption of perfect reliability. If you had SLIGHTLY
| below 100% reliability in Factorio, the game would be a
| terrible grind limited to small factories. Likewise with
| mathematical proofs or computer transistors or self driving
| cars or any other kind of automation. The reliability needs to
| be insanely good. You need to add a bunch of nines to whatever
| you're making.
|
| A counterpoint to this is when you're in an emergency and
| inaction means people die. In that case, you need to accept
| some uncertainty early on.
| MaulingMonkey wrote:
| > If you had SLIGHTLY below 100% reliability in Factorio, the
| game would be a terrible grind limited to small factories.
|
| I'd argue you _do_ have <100% reliability in Factorio, and
| much of the game is in increasing the 9s.
|
| Biters can wreck havok on your base. Miners contaminate your
| belts with the wrong types of ore, if you weren't paying
| enough attention near overlapping fields. Misplaced inserters
| may mis-feed your assemblers, reducing efficiency or leaving
| outright nonfunctional buildings. Misclicks can cripple large
| swaths of your previously working factory, ruining plenty of
| speedruns if they go uncaught. For later game megabase
| situations, you must deal with limited lifetimes as mining
| locations dry up, requiring you to overhaul existing systems
| with new routes of resources into them. As inputs are split
| and redirected, existing manufacturing can choke and sputter
| when they end up starved of resources. Letting your power
| plants starve of fuel can result in a small crisis! Electric
| miners mining coal, refineries turning oil into solid fuel,
| electric inserters fueling the boilers, water pumps providing
| the water to said boilers - these things all take power, and
| jump starting these after a power outage takes time you might
| not have if under active attack if your laser turrets are all
| offline as well.
|
| But you have means of remediating much of this unreliability.
| Emergency fuel and water stockpiles, configuring priorities
| such that fuel for power is prioritized ahead of your fancy
| new iron smelting setup, programmable alerts for when input
| stockpiles run low, ammo-turrets that work without power,
| burner inserters for your power production's critical path
| will bootstrap themselves after an outage, roboports that
| replace biter-attacked defenses.
|
| Your first smelting setup in Factorio will likely be a hand-
| fed burner miner and furnace, taking at most 50 coal. This
| will run out of power in _minutes_. Then you might use
| inserters to add a coal buffer. Then a belt of coal, so you
| don 't need to constantly refill the coal buffer. Then a rail
| station, so you don't need to constantly hand-route entirely
| new coal and ore mining patches. Then you'll use blueprints
| and bots to automate much of constructing your new inputs. If
| you're really crazy, you'll experiment with automating the
| usage of those blueprints to build self-expanding bases...
| reilly3000 wrote:
| I really considered getting into Factorio but your comment
| is exactly why I can't touch it. I have certain demands
| upon my time that would inevitably go unmet as I fuss with
| factory.
| twoslide wrote:
| > when you're on the cusp of a new thing, unless you're super
| lucky, the result will necessarily be near the noise floor.
| Real science is like that.
|
| That's not necessarily true in social sciences. When you're
| working with large survey datasets, many variables are
| significantly related. That doesn't mean these relationships
| are meaningful or causal, they could be due to underlying
| common causes, etc. (Maybe social sciences weren't included
| in "real science" - but there's where a lot of stats
| discussions focus)
| mercurywells wrote:
| > I've been thinking about this while playing Factorio: so
| much of our discussion and mental modeling of automation
| works under the assumption of perfect reliability. If you had
| SLIGHTLY below 100% reliability in Factorio, the game would
| be a terrible grind limited to small factories.
|
| So I'm making a guess here that you play with few monsters or
| non-aggressive monsters?
| Robotbeat wrote:
| Currently playing a game to minimize pollution to try to
| totally avoid biter attention. Surrounded by trees, now
| almost entirely solar with efficiency modules.
| mumblemumble wrote:
| Fine. Do it like the experimental physicists do: if you think
| you're on to something, refine and repeat the experiment in
| order to get a more robust, repeatable result.
|
| The original sin of the medical and social sciences is
| failing to recognize a distinction between exploratory
| research and confirmatory research and behave accordingly.
| Robotbeat wrote:
| The problem is that it's really hard to get good data,
| ethically, in medical sciences. Something that improves
| outcomes by 5-10% can be really important, but trying to
| get a study big enough to prove it can be super expensive
| already.
| TameAntelope wrote:
| Nobody likes being in the control group of the first
| working anti-aging serum...
| q-big wrote:
| > Nobody likes being in the control group of the first
| working anti-aging serum...
|
| You only know whether it works when the study has been
| completed. You also only know whether the drug has
| (potentially) disastrous consequences when the study has
| been completed. Thus, I am not completely sure whether
| your claim holds.
| bluGill wrote:
| You missed the working part. Success was a prerequisite
| to their after the fact feelings. At least some of the
| control group will be in old age but still alive when we
| know it woris. They might not know if it is infinite life
| (and side effects may turn it into die at 85, so some
| control may outlive the intervention group after the
| study ), but they will know on average they did worse
| [deleted]
| modeless wrote:
| Not only is it not valuable to publish tons of studies with
| p=.04999 and small effect size, in fact it's harmful. With so
| many questionable results published in supposedly reputable
| places it becomes possible to "prove" all sorts of crackpot
| theories by selectively citing real research. And if you try to
| dispute the studies you can get accused of being anti-science.
| exporectomy wrote:
| Only a problem for people who are trying hard not to think.
| You can just ignore those people. They're not doing any harm
| believing their beliefs.
| bee_rider wrote:
| We are literally in the middle of a global crisis that is
| founded on people misunderstanding science.
| exporectomy wrote:
| What on earth are you talking about? I guess climate
| change but that's certainly not founded on people
| misunderstanding science, it's caused by people
| understanding science which led to industrialization. Or
| maybe you mean covid-19? Neither that. You're just trying
| to make it seem like it's somehow very serious and bad if
| everyone doesn't agree with you. It's not.
| [deleted]
| robbedpeter wrote:
| The USDA food pyramid and nutrition education would suggest
| that there's an inherent danger in just letting people
| believe irrational things after a correction is known. It
| depends on the belief - flat earth people aren't likely to
| cause any harm. Bad nutrition information can wreak havoc
| at scale.
| vkou wrote:
| Flat earth beliefs doesn't cause harm, but flat earth
| believers have largely upgraded to believing more
| dangerous nonsense.
| exporectomy wrote:
| Data or it didn't happen. This really sounds like you're
| inventing a caricature of your enemy and assigning them
| "dangerous" qualities so you can hate them more.
| vkou wrote:
| Nobody needs to caricature the insane beliefs surrounding
| COVID (or flat earth), people holding them are doing a
| good enough job of that themselves.
|
| I do have a few favorites. "COVID tests give you COVID,
| so I won't go get tested" is certainly up there. I can't
| say I give two figs about your opinion on the Earth's
| topology, but this one is a public health problem, that's
| crippling hospitals around the country.
| sanxiyn wrote:
| I agree we shouldn't listen to noise, but small effect size is
| not necessarily noise. (I agree it is highly correlated.) I
| mean, QED's effect size on g factor is 1.001. QED was very much
| worth finding out.
| RandomLensman wrote:
| These discussions are fun but rather pointless: e.g., sometimes a
| small effect is really interesting but it needs to be pretty
| strongly supported (for instance, claiming a 1% higher electron
| mass or a 2% survival rate in rabies).
|
| Also, most published research is inconsequential so it really
| does not matter other than money spent (and that is not only
| related to findings but also keeping people employed etc.). If
| confidence in results is truly an objective might need to link it
| directly to personal income or loss of income, ie force bets on
| it.
| robocat wrote:
| From the article:
|
| Ernest Rutherford is famously quoted proclaiming "If your
| experiment needs statistics, you ought to have done a better
| experiment."
|
| "Of course, there is an existential problem arguing for large
| effect sizes. If most effect sizes are small or zero, then most
| interventions are useless. And this forces us scientists to
| confront our cosmic impotence, which remains a humbling and
| frustrating experience."
| fmajid wrote:
| The studies are in villages, but the real concern is dense urban
| environments like New York (or Dhaka) where people are tightly
| packed together and at risk of contagion. I'm pretty sure masks
| make little difference in Wyoming either, where the population is
| 5 people per square mile.
| sanxiyn wrote:
| Mask's effect size on seroprevalence is probably zero. So no
| effect is expected result.
|
| That's because mask acts on R0, not seroprevalence. After acting
| on R0, if R0 is >1, exponential growth, if <1, exponential decay.
| So no effect, unless it is the thing that pushes one from >1 to
| <1.
| [deleted]
| lotu wrote:
| Also, they aren't testing masking effect on seroprevalence (or
| R0), they are testing the effect of sending out free masks and
| encouraging masking. That is only going to move the percent of
| people masking up or down a few percent at best.
| sampo wrote:
| The study says:
|
| > The intervention increased proper mask-wearing from 13.3%
| in control villages (N=806,547 observations) to 42.3% in
| treatment villages (N=797,715 observations)
|
| https://www.poverty-
| action.org/sites/default/files/publicati...
| hammock wrote:
| >Effect Size Is Significantly More Important Than Statistical
| Significance
|
| Ok, but by how much?
| mrtranscendence wrote:
| > If most effect sizes are small or zero, then most interventions
| are useless.
|
| But this doesn't necessarily follow, does it? If there really
| were a 1.1-fold reduction in risk due to mask-wearing it could
| still be beneficial to encourage it. The salient issue (taking up
| most of the piece) seems to be not the size of the effect but
| rather the statistical methodology the authors employed to
| measure that size. The p-value isn't meaningful in the face of an
| incorrect model -- why isn't the answer a better model rather
| than just giving up?
|
| Small effects are everywhere. Sure, it's harder to disentangle
| them, but they're still often worth knowing.
| ummonk wrote:
| > If there really were a 1.1-fold reduction in risk due to
| mask-wearing it could still be beneficial to encourage it.
|
| That's understating it. The study doesn't measure the reduction
| in risk due to mask-wearing, but rather the reduction simply
| from encouraging mask-wearing (which only increases actual mask
| wearing by a limited amount). If the study's results hold up
| statistically, then they're really impressive. With the caveat
| of course that they apply to older variants with less viral
| loads than Delta - it's likely Delta is more effective against
| masks simply due to its viral load.
|
| > The salient issue (taking up most of the piece) seems to be
| not the size of the effect but rather the statistical
| methodology the authors employed to measure that size. The
| p-value isn't meaningful in the face of an incorrect model --
| why isn't the answer a better model rather than just giving up?
|
| Exactly. The irony of this article is that this is an example
| where effect size is actually not the issue - it's potential
| issues with statistical significance due to imperfect modeling,
| and an inability for other researchers to rerun an analysis on
| statistical significance, due to not publishing the raw data.
| sanxiyn wrote:
| I agree the problem here is an incorrect model. Mask does not
| act on seroprevalence. Measuring mask's effect on
| seroprevalence is just wrong study design, although it may be
| easier to do.
| whatshisface wrote:
| Who cares if each effect is a factor of 2^(1/100) improvement,
| just give me 100 interventions and I'll double the value being
| measured.
| nabla9 wrote:
| If you have one BALB/c lab mouse, you give it something, and it
| glows in the dark few months after, effect size alone makes it
| significant.
| exporectomy wrote:
| I wonder if we should separate the roles of scientist and
| researcher. Universities would have generalist "scientists" who's
| job would be to consult for domain-specialized researchers to
| ensure they're doing the science and statistics correctly. That
| way, we don't need every researcher in every field to have a deep
| understanding of statistics, which they often don't.
|
| Either that or stop rewarding such bad behavior. Science jobs are
| highly competitive, so why not exclude people with weak
| statistics? Maybe because weak statistics leads to more spurious
| exciting publications which makes the researcher and institution
| look better?
| civilized wrote:
| The scientific establishment will never be convinced to stop
| doing bad statistics, so "the solution to bad speech is more
| speech". Statisticians should be rewarded for effective review
| and criticism of flawed studies, and critical statistical
| reviews of any article should be easy to find when they exist.
|
| This is sounding like a great startup idea for a new scientific
| journal, actually.
| robertlagrant wrote:
| Just adding an Arxiv filter that allows me to set a minimum
| p-value or variation % would do it!
| vavooom wrote:
| I do enjoy the idea of a journal focused entirely on the
| review of statistical methods and underlying methodologies
| applied in modern day research. Could act as a helpful signal
| for relevant and applicable research.
| Robotbeat wrote:
| We exclude people who don't publish. Papers tend not to publish
| stuff that isn't a positive result.
| ummonk wrote:
| Agree with the title, but not the contents. The study in question
| is actually an example of a huge effect size (10% reduction in
| cases just from instructing villages they should wear masks is
| amazing) possibly hampered by poor statistical significance (as
| the blog post outlines).
| _Nat_ wrote:
| The title's misinformation: effect-size _ISN 'T_ more important
| than statistical significance.
|
| The article itself makes some better points, e.g.
|
| > I worry that because of statistical ambiguity, there's not much
| that can be deduced at all.
|
| , which would seem like a reasonable interpretation of the study
| that the article discusses.
|
| However, the title alone seems to assert a general claim about
| statistical interpretation that'd seem potentially harmful to the
| community. Specifically, it'd seem pretty bad for someone to see
| the title and internalize a notion of effect-size being more
| important than statistical significance.
| spywaregorilla wrote:
| Not so fast. If you win your first jackpot on the first ticket.
| You'll require 500,000 failures (at $1 per ticket) in order to
| fail to reject the null hypothesis at p < 0.05. Assuming you're
| just doing a t test (which isn't really appropriate tbh).
|
| If you bought just ten tickets you would have a p value below
| 0.0000001
|
| And that makes sense, because a p value of 0.01 says the
| probability of getting a sample this far from the null
| hypothesis is less than 1 in a million by random chance...
| which is what happened when you got the extremely unlikely but
| highly profitable answer.
|
| edit: post was edited making this seem out of context...
___________________________________________________________________
(page generated 2021-09-14 23:00 UTC)