[HN Gopher] Five ways to reduce variance in A/B testing
       ___________________________________________________________________
        
       Five ways to reduce variance in A/B testing
        
       Author : Maro
       Score  : 58 points
       Date   : 2024-09-28 12:42 UTC (1 days ago)
        
 (HTM) web link (bytepawn.com)
 (TXT) w3m dump (bytepawn.com)
        
       | withinboredom wrote:
       | good advice! From working on an internal a/b testing platform, we
       | had built-in tooling to do some of this stuff after the fact. I
       | don't know of any off-the-shelf a/b testing tool that can do this
       | stuff.
        
         | alvarlagerlof wrote:
         | Pretty sure that http://statsig.com can
        
         | ulf-77723 wrote:
         | Worked at an A/B Test SaaS company as a solutions engineer and
         | to my knowledge every vendor is capable of delivering solutions
         | for those problems.
         | 
         | Some advertise with those things, but the big ones take it for
         | granted. Usually before a test will be developed the project
         | manager will assist in mentioning critical questions about the
         | test setup
        
           | chashmataklu wrote:
           | Pretty sure most don't. Most A/B Test SAAS vendors cater to
           | lightweight clickstream optimization, which is why they don't
           | have features like Stratified Sampling. Internal systems are
           | lightyears ahead of most SAAS vendors.
        
       | kqr wrote:
       | See also sample unit engineering:
       | https://entropicthoughts.com/sample-unit-engineering
       | 
       | Statisticians have a lot of useful tricks to get higher quality
       | data out of the same cost (i.e. sample size.)
       | 
       | Another topic I want to learn properly is running multiple
       | experiments in parallel in a systematic way to get faster results
       | and be able to control for confounding. Fisher advocated for this
       | as early as 1925, and I still think we're learning that lesson
       | today in our field: sometimes the right strategy is _not_ to try
       | one thing at a time and keep everything else constant.
        
         | authorfly wrote:
         | Can you help me understand why we would use sample unit
         | engineering/bootstrapping? Imagine if we don't care about
         | between subjects variance (and thus P-values in T-tests/AB
         | tests), in that case, it doesn't help us right...
         | 
         | I just feel intuitively that it's masking the variance by
         | converting it into within-subjects variance arbitrarily.
         | 
         | Here's my layman-ish interpretation:
         | 
         | P-values are easier to obtain when the variance is reduced. But
         | we established P-values and the 0.05 threshold _before_ these
         | techniques. With the new techniques reducing SD, which P-values
         | directly interpret, you need to counteract the reduction in SD
         | of the samples with a harsher P-value in order to obtain the
         | same number of true positive experiments as when P-values were
         | originally proposed. In other words, allowing more experiments
         | to have less variance in group tests and result in more
         | statistical significant if there is an effect size is not
         | necessarily advantageous. Especially if we consider the purpose
         | of statistics and AB testing to be rejecting the null
         | hypothesis, rather than showing significant effect sizes.
        
           | kqr wrote:
           | Let's use the classic example of "Lady tasting tea". Someone
           | claims to be able to tell, by taste alone, if milk was added
           | before or after boiling water.
           | 
           | We can imagine two versions of this test. In both, we serve
           | 12 cups of tea, six of which have had milk added first.
           | 
           | In one of the experiments, we keep everything else the same:
           | same quantities of milk and tea, same steeping time, same
           | type of tea, same source of water, etc.
           | 
           | In the other experiment, we randomly vary quantities of milk
           | and tea, steeping time, type of tea etc.
           | 
           | Both of these experiments are valid, both have the same 5 %
           | risk of false positives (given by the null hypothesis that
           | any judgment by the Lady is a coinflip). But you can probably
           | intuit that in one of the experiments, the Lady has a greater
           | chance of proving her acumen, because there are fewer
           | distractions. Maybe she _is_ able to discern milk-first-or-
           | last by taste, but this gets muddled up by all the variations
           | in the second experiment. In other words, the cleaner
           | experiment is more _sensitive_ , but it is not at a greater
           | risk of false positives.
           | 
           | The same can be said of sample unit engineering: it makes
           | experiments more sensitive (i.e. we can detect a finer signal
           | for the same cost) without increasing the risk of false
           | positives (which is fixed by the type of test we run.)
           | 
           | ----
           | 
           | Sometimes we only care about detecting a large effect, and a
           | small effect is clinically insignificant. Maybe we are only
           | impressed by the Lady if she can discern despite distractions
           | of many variations. Then removing distractions is a mistake.
           | But traditional hypothesis tests of that kind are designed
           | from the perspective of "any signal, however small, is
           | meaningful."
           | 
           | (I think this is even a requirement for using frequentist
           | methods. They neef an exact null hypothesis to compute
           | probabilities from.)
        
       | sunir wrote:
       | One of the most frustrating results I found is that A/B split
       | tests often resolved into a winner within the sample size range
       | we set; however if I left the split running over a longer period
       | of time (eg a year) the difference would wash out.
       | 
       | I had retargeting in a 24 month split by accident and found it
       | didn't matter after all the cost in the long term. We could bend
       | the conversion curve but not change the people who would convert.
       | 
       | And yes we did capture more revenue in the short term but over
       | the long term the cost of the ads netted it all to zero or less
       | than zero. And yes we turned off retreating after conversion. The
       | result was customers who weren't retargeted eventually bought
       | anyway.
       | 
       | Has anyone else experienced the same?
        
         | bdjsiqoocwk wrote:
         | > One of the most frustrating results I found is that A/B split
         | tests often resolved into a winner within the sample size range
         | we set; however if I left the split running over a longer
         | period of time (eg a year) the difference would wash out.
         | 
         | Doesn't that just mean there's no difference? Why is that
         | frustrating?
         | 
         | Does the frustration come from the expectation that any little
         | variable might make a difference? Should I use red buttons or
         | blue buttons? Maybe if the product is shit, the color of the
         | buttons doesn't matter.
        
           | admax88qqq wrote:
           | > Maybe if the product is shit, the color of the buttons
           | doesn't matter.
           | 
           | This should really be on a poster in many offices.
        
         | kqr wrote:
         | > We could bend the conversion curve but not change the people
         | who would convert.
         | 
         | I think this is very common. I talked to salespeople who
         | claimed that customers on 2.0 are happier than those on 1.0,
         | which they had determined by measuring satisfaction in the two
         | groups and got a statistically significant result.
         | 
         | What they didn't realise was that almost all of the customers
         | on 2.0 had been those that willingly upgraded from 1.0. What
         | sort of customer willingly upgrades? The most satisfied ones.
         | 
         | Again: they bent the curve, didn't change the people. I'm sure
         | this type of confounding-by-self-selection is incredibly
         | common.
        
         | Adverblessly wrote:
         | Obviously it depends on the exact test you are running, but a
         | factor that is frequently ignored in A/B testing is that often
         | one arm of the experiment is the existing state vs. another arm
         | that is some novel state, and such novelty can itself have an
         | effect. E.g. it doesn't really matter if this widget is blue or
         | green, but changing it from one color to the other temporarily
         | increases user attention to it, until they are again used to
         | the new color. Users don't actually prefer your new flow for X
         | over the old one, but because it is new they are trying it out,
         | etc.
        
       | pkoperek wrote:
       | Good read. Does anyone know if any of the experimentation
       | frameworks actually uses these methods to make the results more
       | reliable (e.g. allow to automatically apply winsorization or
       | attempt to make the split sizes even)?
        
       | kqr wrote:
       | > Winsorizing, ie. cutting or normalizing outliers.
       | 
       | Note that outliers are often your most valuable data points[1].
       | I'd much rather stratify than cut them out.
       | 
       | By cutting them out you indeed get neater data, but it no longer
       | represents the reality you are trying to model and learn from,
       | and you run a large risk of drawing false conclusions.
       | 
       | [1]: https://entropicthoughts.com/outlier-detection
        
         | chashmataklu wrote:
         | TBH depends a lot on the business you're experimenting with and
         | who you're optimizing for. If you're Lime Bike, you don't want
         | to skew results because of a Doordasher who's on a bike for the
         | whole day because their car is broken.
         | 
         | If you're a retailer or a gaming company, you probably care
         | about your "whales" who'd get winsorized out. Depends on
         | whether you're trying to move topline - or trying to move the
         | "typical".
        
           | kqr wrote:
           | > Depends on whether you're trying to move topline - or
           | trying to move the "typical".
           | 
           | If this is an important difference, you should define the
           | "typical" population prior to running the experiment.
           | 
           | If you take "typical" to mean "the users who didn't
           | accidentally produce annoying data in this experiment" you
           | will learn things that don't generalise because they only
           | apply to an ill-defined fictional subsegment of your
           | population that is impossible to recreate.
           | 
           | If you don't know up front how to recognise a "typical" user
           | in the sense that matters to you, then that is the first
           | experiment to run!
        
       | vijayer wrote:
       | This is a good list that includes a lot of things most people
       | miss. I would also suggest:
       | 
       | 1. Tight targeting of your users in an AB test. This can be
       | through proper exposure logging, or aiming at users down-funnel
       | if you're actually running a down-funnel experiment. If your new
       | iOS and Android feature is going to be launched separately, then
       | separate the experiments.
       | 
       | 2. Making sure your experiment runs in 7-day increments.
       | Averaging out weekly seasonality can be important in reducing
       | variance but also ensures your results accurately predict the
       | effect of a full rollout.
       | 
       | Everything mentioned in this article, including stratified
       | sampling and CUPED are available, out-of-the-box on Statsig.
       | Disclaimer: I'm the founder, and this response was shared by our
       | DS Lead.
        
         | wodenokoto wrote:
         | > 2. Making sure your experiment runs in 7-day increments.
         | Averaging out weekly seasonality can be important in reducing
         | variance but also ensures your results accurately predict the
         | effect of a full rollout.
         | 
         | There are of course many seasonalities: day/nigh, weekly,
         | monthly, yearly seasonality, so it can be difficult to decide
         | how broad you want to collect data. But I remember interviewing
         | at a very large online retailer and they did their a/b tests in
         | an hour because they "would collect enough data points to be
         | statistical significant" and that never sat right with me.
        
       | usgroup wrote:
       | Adding covariates to the post analysis can reduce variance. One
       | instance of this is CUPED by there are lots of covariates which
       | are easier to add (eg request type, response latency, day of
       | week, user info, etc).
        
       | sanchezxs wrote:
       | Yes.
        
       | tmoertel wrote:
       | Just a note that "stratification" as described in this article is
       | not what is traditionally meant by taking a stratified sample.
       | The article states:
       | 
       | > Stratification lowers variance by making sure that each sub-
       | population is sampled according to its split in the overall
       | population.
       | 
       | In common practice, the main way that stratification lowers
       | variance is by computing a separate estimate for each sub-
       | population and then computing an overall population estimate from
       | the sub-population estimates. If the sub-populations are more
       | uniform ("homogeneous") than is the overall population, the sub-
       | populations will have smaller variances than the overall
       | population, and a combination of the smaller variances will be
       | smaller than the overall population's variance.
       | 
       | In short, you not only stratify the sample, but also
       | correspondingly stratify the calculation of your wanted
       | estimates.
       | 
       | The article does not seem to take advantage of the second part.
       | 
       | (P.S. This idea, taken to the limit, is what leads to importance
       | sampling, where potentially every member of the population exists
       | in its own stratum. Art Owen has a good introduction:
       | https://artowen.su.domains/mc/Ch-var-is.pdf.)
        
       ___________________________________________________________________
       (page generated 2024-09-29 23:01 UTC)