[HN Gopher] Five ways to reduce variance in A/B testing
___________________________________________________________________
Five ways to reduce variance in A/B testing
Author : Maro
Score : 58 points
Date : 2024-09-28 12:42 UTC (1 days ago)
(HTM) web link (bytepawn.com)
(TXT) w3m dump (bytepawn.com)
| withinboredom wrote:
| good advice! From working on an internal a/b testing platform, we
| had built-in tooling to do some of this stuff after the fact. I
| don't know of any off-the-shelf a/b testing tool that can do this
| stuff.
| alvarlagerlof wrote:
| Pretty sure that http://statsig.com can
| ulf-77723 wrote:
| Worked at an A/B Test SaaS company as a solutions engineer and
| to my knowledge every vendor is capable of delivering solutions
| for those problems.
|
| Some advertise with those things, but the big ones take it for
| granted. Usually before a test will be developed the project
| manager will assist in mentioning critical questions about the
| test setup
| chashmataklu wrote:
| Pretty sure most don't. Most A/B Test SAAS vendors cater to
| lightweight clickstream optimization, which is why they don't
| have features like Stratified Sampling. Internal systems are
| lightyears ahead of most SAAS vendors.
| kqr wrote:
| See also sample unit engineering:
| https://entropicthoughts.com/sample-unit-engineering
|
| Statisticians have a lot of useful tricks to get higher quality
| data out of the same cost (i.e. sample size.)
|
| Another topic I want to learn properly is running multiple
| experiments in parallel in a systematic way to get faster results
| and be able to control for confounding. Fisher advocated for this
| as early as 1925, and I still think we're learning that lesson
| today in our field: sometimes the right strategy is _not_ to try
| one thing at a time and keep everything else constant.
| authorfly wrote:
| Can you help me understand why we would use sample unit
| engineering/bootstrapping? Imagine if we don't care about
| between subjects variance (and thus P-values in T-tests/AB
| tests), in that case, it doesn't help us right...
|
| I just feel intuitively that it's masking the variance by
| converting it into within-subjects variance arbitrarily.
|
| Here's my layman-ish interpretation:
|
| P-values are easier to obtain when the variance is reduced. But
| we established P-values and the 0.05 threshold _before_ these
| techniques. With the new techniques reducing SD, which P-values
| directly interpret, you need to counteract the reduction in SD
| of the samples with a harsher P-value in order to obtain the
| same number of true positive experiments as when P-values were
| originally proposed. In other words, allowing more experiments
| to have less variance in group tests and result in more
| statistical significant if there is an effect size is not
| necessarily advantageous. Especially if we consider the purpose
| of statistics and AB testing to be rejecting the null
| hypothesis, rather than showing significant effect sizes.
| kqr wrote:
| Let's use the classic example of "Lady tasting tea". Someone
| claims to be able to tell, by taste alone, if milk was added
| before or after boiling water.
|
| We can imagine two versions of this test. In both, we serve
| 12 cups of tea, six of which have had milk added first.
|
| In one of the experiments, we keep everything else the same:
| same quantities of milk and tea, same steeping time, same
| type of tea, same source of water, etc.
|
| In the other experiment, we randomly vary quantities of milk
| and tea, steeping time, type of tea etc.
|
| Both of these experiments are valid, both have the same 5 %
| risk of false positives (given by the null hypothesis that
| any judgment by the Lady is a coinflip). But you can probably
| intuit that in one of the experiments, the Lady has a greater
| chance of proving her acumen, because there are fewer
| distractions. Maybe she _is_ able to discern milk-first-or-
| last by taste, but this gets muddled up by all the variations
| in the second experiment. In other words, the cleaner
| experiment is more _sensitive_ , but it is not at a greater
| risk of false positives.
|
| The same can be said of sample unit engineering: it makes
| experiments more sensitive (i.e. we can detect a finer signal
| for the same cost) without increasing the risk of false
| positives (which is fixed by the type of test we run.)
|
| ----
|
| Sometimes we only care about detecting a large effect, and a
| small effect is clinically insignificant. Maybe we are only
| impressed by the Lady if she can discern despite distractions
| of many variations. Then removing distractions is a mistake.
| But traditional hypothesis tests of that kind are designed
| from the perspective of "any signal, however small, is
| meaningful."
|
| (I think this is even a requirement for using frequentist
| methods. They neef an exact null hypothesis to compute
| probabilities from.)
| sunir wrote:
| One of the most frustrating results I found is that A/B split
| tests often resolved into a winner within the sample size range
| we set; however if I left the split running over a longer period
| of time (eg a year) the difference would wash out.
|
| I had retargeting in a 24 month split by accident and found it
| didn't matter after all the cost in the long term. We could bend
| the conversion curve but not change the people who would convert.
|
| And yes we did capture more revenue in the short term but over
| the long term the cost of the ads netted it all to zero or less
| than zero. And yes we turned off retreating after conversion. The
| result was customers who weren't retargeted eventually bought
| anyway.
|
| Has anyone else experienced the same?
| bdjsiqoocwk wrote:
| > One of the most frustrating results I found is that A/B split
| tests often resolved into a winner within the sample size range
| we set; however if I left the split running over a longer
| period of time (eg a year) the difference would wash out.
|
| Doesn't that just mean there's no difference? Why is that
| frustrating?
|
| Does the frustration come from the expectation that any little
| variable might make a difference? Should I use red buttons or
| blue buttons? Maybe if the product is shit, the color of the
| buttons doesn't matter.
| admax88qqq wrote:
| > Maybe if the product is shit, the color of the buttons
| doesn't matter.
|
| This should really be on a poster in many offices.
| kqr wrote:
| > We could bend the conversion curve but not change the people
| who would convert.
|
| I think this is very common. I talked to salespeople who
| claimed that customers on 2.0 are happier than those on 1.0,
| which they had determined by measuring satisfaction in the two
| groups and got a statistically significant result.
|
| What they didn't realise was that almost all of the customers
| on 2.0 had been those that willingly upgraded from 1.0. What
| sort of customer willingly upgrades? The most satisfied ones.
|
| Again: they bent the curve, didn't change the people. I'm sure
| this type of confounding-by-self-selection is incredibly
| common.
| Adverblessly wrote:
| Obviously it depends on the exact test you are running, but a
| factor that is frequently ignored in A/B testing is that often
| one arm of the experiment is the existing state vs. another arm
| that is some novel state, and such novelty can itself have an
| effect. E.g. it doesn't really matter if this widget is blue or
| green, but changing it from one color to the other temporarily
| increases user attention to it, until they are again used to
| the new color. Users don't actually prefer your new flow for X
| over the old one, but because it is new they are trying it out,
| etc.
| pkoperek wrote:
| Good read. Does anyone know if any of the experimentation
| frameworks actually uses these methods to make the results more
| reliable (e.g. allow to automatically apply winsorization or
| attempt to make the split sizes even)?
| kqr wrote:
| > Winsorizing, ie. cutting or normalizing outliers.
|
| Note that outliers are often your most valuable data points[1].
| I'd much rather stratify than cut them out.
|
| By cutting them out you indeed get neater data, but it no longer
| represents the reality you are trying to model and learn from,
| and you run a large risk of drawing false conclusions.
|
| [1]: https://entropicthoughts.com/outlier-detection
| chashmataklu wrote:
| TBH depends a lot on the business you're experimenting with and
| who you're optimizing for. If you're Lime Bike, you don't want
| to skew results because of a Doordasher who's on a bike for the
| whole day because their car is broken.
|
| If you're a retailer or a gaming company, you probably care
| about your "whales" who'd get winsorized out. Depends on
| whether you're trying to move topline - or trying to move the
| "typical".
| kqr wrote:
| > Depends on whether you're trying to move topline - or
| trying to move the "typical".
|
| If this is an important difference, you should define the
| "typical" population prior to running the experiment.
|
| If you take "typical" to mean "the users who didn't
| accidentally produce annoying data in this experiment" you
| will learn things that don't generalise because they only
| apply to an ill-defined fictional subsegment of your
| population that is impossible to recreate.
|
| If you don't know up front how to recognise a "typical" user
| in the sense that matters to you, then that is the first
| experiment to run!
| vijayer wrote:
| This is a good list that includes a lot of things most people
| miss. I would also suggest:
|
| 1. Tight targeting of your users in an AB test. This can be
| through proper exposure logging, or aiming at users down-funnel
| if you're actually running a down-funnel experiment. If your new
| iOS and Android feature is going to be launched separately, then
| separate the experiments.
|
| 2. Making sure your experiment runs in 7-day increments.
| Averaging out weekly seasonality can be important in reducing
| variance but also ensures your results accurately predict the
| effect of a full rollout.
|
| Everything mentioned in this article, including stratified
| sampling and CUPED are available, out-of-the-box on Statsig.
| Disclaimer: I'm the founder, and this response was shared by our
| DS Lead.
| wodenokoto wrote:
| > 2. Making sure your experiment runs in 7-day increments.
| Averaging out weekly seasonality can be important in reducing
| variance but also ensures your results accurately predict the
| effect of a full rollout.
|
| There are of course many seasonalities: day/nigh, weekly,
| monthly, yearly seasonality, so it can be difficult to decide
| how broad you want to collect data. But I remember interviewing
| at a very large online retailer and they did their a/b tests in
| an hour because they "would collect enough data points to be
| statistical significant" and that never sat right with me.
| usgroup wrote:
| Adding covariates to the post analysis can reduce variance. One
| instance of this is CUPED by there are lots of covariates which
| are easier to add (eg request type, response latency, day of
| week, user info, etc).
| sanchezxs wrote:
| Yes.
| tmoertel wrote:
| Just a note that "stratification" as described in this article is
| not what is traditionally meant by taking a stratified sample.
| The article states:
|
| > Stratification lowers variance by making sure that each sub-
| population is sampled according to its split in the overall
| population.
|
| In common practice, the main way that stratification lowers
| variance is by computing a separate estimate for each sub-
| population and then computing an overall population estimate from
| the sub-population estimates. If the sub-populations are more
| uniform ("homogeneous") than is the overall population, the sub-
| populations will have smaller variances than the overall
| population, and a combination of the smaller variances will be
| smaller than the overall population's variance.
|
| In short, you not only stratify the sample, but also
| correspondingly stratify the calculation of your wanted
| estimates.
|
| The article does not seem to take advantage of the second part.
|
| (P.S. This idea, taken to the limit, is what leads to importance
| sampling, where potentially every member of the population exists
| in its own stratum. Art Owen has a good introduction:
| https://artowen.su.domains/mc/Ch-var-is.pdf.)
___________________________________________________________________
(page generated 2024-09-29 23:01 UTC)