[HN Gopher] Interpreting A/B test results: false positives and s...
___________________________________________________________________
Interpreting A/B test results: false positives and statistical
significance
Author : ciprian_craciun
Score : 37 points
Date : 2021-10-29 19:25 UTC (3 hours ago)
(HTM) web link (netflixtechblog.com)
(TXT) w3m dump (netflixtechblog.com)
| palae wrote:
| It's probably a good idea to remind (or inform) people that at
| least in scientific research, null hypothesis statistical testing
| and "statistical significance" in particular have come under fire
| [1,2]. From the American Statistical Association (ASA) in 2019
| [2]:
|
| "We conclude, based on our review of the articles in this special
| issue and the broader literature, that it is time to stop using
| the term "statistically significant" entirely. Nor should
| variants such as "significantly different," "p < 0.05," and
| "nonsignificant" survive, whether expressed in words, by
| asterisks in a table, or in some other way.
|
| Regardless of whether it was ever useful, a declaration of
| "statistical significance" has today become meaningless."
|
| [1] The ASA Statement on p-Values: Context, Process, and Purpose
| - https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1...
|
| [2] Moving to a World Beyond "p < 0.05" -
| https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...
| antihipocrat wrote:
| I read through the first link you posted and couldn't find any
| ideas about what we could use instead of p-Values.
|
| Statistical tests are a very useful and objective method of
| determining whether the outcomes of one thing/activity are more
| desirable than another when applied correctly.
|
| Some solutions could be to set a higher bar for statistical
| analysis education. Or perhaps a more thorough statistically
| focussed vetting and peer review process for published
| material?
| brian_spiering wrote:
| Bayesian hypothesis testing, including Bayes factors, might
| be more useful.
| kristjansson wrote:
| It's worth pulling the principles from the ASA's statement [2]
| as well: 1. P-values can indicate how
| incompatible the data are with a specified statistical model.
| 2. P-values do not measure the probability that the studied
| hypothesis is true, or the probability that the data were
| produced by random chance alone. 3. Scientific
| conclusions and business or policy decisions should not be
| based only on whether a p-value passes a specific threshold.
| 4. Proper inference requires full reporting and transparency
| 5. A p-value, or statistical significance, does not measure the
| size of an effect or the importance of a result.
| 6. By itself, a p-value does not provide a good measure of
| evidence regarding a model or hypothesis.
|
| The basic criticism one of brittleness - that unless very
| carefully planned, executed, and interpreted, p-values from
| hypothesis does not support the claims some would like to be on
| their results, and that meeting the first condition is so
| difficult that the technique should not be recommended. One
| _should_ look for 'significant' results, but using measures
| that align better with colloquial understandings of
| significance i.e. with how users are misinterpreting p-values
| now.
| samch93 wrote:
| The ASA recently published a new statement which is more
| optimistic about the use of p-values [1]. I myself also think
| that correctly used p-values are in many situations a good tool
| for making sense out of data. Of course, a decision should
| never be conducted on a p-value alone, but the same could also
| be said about confidence/credible intervals, Bayes factors,
| relative belief ratios, and any other inferential tool
| available (and I'm saying this as someone who is doing research
| in Bayesian hypothesis testing methodology). Data analysts
| always need to use common sense and put the data at hand into
| broader context.
|
| [1] https://projecteuclid.org/journals/annals-of-applied-
| statist...
| jonathanbentz wrote:
| I am interested to see what they will be testing in some of the
| upcoming posts in this series. It would be fun to be scrolling
| Netflix and have the transparency to know that I'm seeing the 'B'
| test.
| kristjansson wrote:
| Like all controlled experiments though, the experimenter wants
| to hide that information from the subject (user in this
| instance) to measure how they respond to the change itself,
| rather than the change and being told about it.
| dmitriid wrote:
| Before interpreting A/B results, the main question that needs to
| be asked: "what is it that you're A/B testing?"
|
| For too many companies, it's testing "engagement" which leads to
| hiding functionality (more clicks is more engagement), reducing
| info density (more time spent is more engagement) etc.
|
| And coming from Netflix... I don't think there's a single person
| who likes that when you browse Netflix it autoplays random videos
| (not even trailers) with audio at full volume. But yeah, A/B
| tests something something. So I wish Netflix learned from their
| own teachings.
| dafelst wrote:
| People may not like that feature (I sure don't), but I would
| bet a decent sum that feature didn't drive increases in
| negative metrics like churn, and increased positive metrics
| like hours watched, perhaps by causing people to scroll through
| more of the library faster, or perhaps drawing people in with
| the previews. Or maybe they just saw in improvement in a non-
| core metric like "distance scrolled" with no other negative
| effects and said, "meh, ship it". Both seem likely.
|
| Of course this is the danger of any sort of behavioral metric
| driven optimization strategy - you may trade negative customer
| sentiment for positive business outcomes. That's where the real
| decision making comes about, i.e. are you willing to make that
| trade? It seems that in this case, Netflix was.
| mobjack wrote:
| I've A/B tested hiding functionality and reducing info density
| increased the number of people spending money on the site.
|
| I was completely shocked by seeing those results initially and
| dove deep to look for any other negative effects from these
| changes but could not find any. I've repeated similar tests and
| the results are often similar.
|
| From that experience, I've learned that most people are not
| like me or the HN crowd. The things that you complain about
| could actually make things easier for the majority.
| type_enthusiast wrote:
| (disclaimer: I work for Netflix. Edit: I should clarify that I
| wasn't involved with this article in any way)
|
| You can disable the behavior you mentioned. Go to your profile
| settings, and under "Playback Settings" you can uncheck
| "Autoplay previews while browsing on all devices".
___________________________________________________________________
(page generated 2021-10-29 23:00 UTC)