[HN Gopher] Ask HN: Resources about math behind A/B testing
___________________________________________________________________
Ask HN: Resources about math behind A/B testing
I've been learning more about AB testing during the last months.
I've read almost all the work by Evan Miller, and I've enjoyed it a
lot. However, I'd like a more structured approach to the topic
since sometimes I feel I'm missing some basics. I've good math
knowledge and pretty decent stats foundations. Which are your
favourite books/papers in this topic?
Author : alexmolas
Score : 172 points
Date : 2024-08-08 21:03 UTC (2 days ago)
| sebg wrote:
| Hi
|
| Have you looked into these two?
|
| - Trustworthy Online Controlled Experiments by Kohavi, Tang, and
| Xu
|
| - Statistical Methods in Online A/B Testing by Georgi Georgiev
|
| Recommended by stats stackexchange
| (https://stats.stackexchange.com/questions/546617/how-can-i-l...)
|
| There's a bunch of other books/courses/videos on o'reilly.
|
| Another potential way to approach this learning goal is to look
| at Evan's tools (https://www.evanmiller.org/ab-testing/) and go
| into each one and then look at the JS code for running the tools
| online.
|
| See if you can go through and comment/write out your thoughts on
| why it's written that way. of course, you'll have to know some JS
| for that, but it might be helpful to go through a file like
| (https://www.evanmiller.org/ab-testing/sample-size.js) and figure
| out what math is being done.
| sebg wrote:
| PS - if you are looking for more of the academic side (cutting
| edge, much harder statistics), you can start to look at recent
| work people are doing with A/B tests like this paper ->
| https://arxiv.org/abs/2002.05670
| sebg wrote:
| Even more!
|
| Have you seen this video -
| https://www.nber.org/lecture/2024-methods-lecture-susan-
| athe...
|
| Might be interesting to you.
| iamacyborg wrote:
| I'll second Trustworthy Online Controlled Experiments.
| Fantastic read and Ron Kohavi is worth a follow on LinkedIn as
| he's quite active there and usually sharing some interesting
| insights (or politely pointing out poor practices).
| vismit2000 wrote:
| https://everyday-data-science.tigyog.app/a-b-testing
| nanis wrote:
| Early in the A-B craze (optimal shade of blue nonsense), I was
| talking to someone high up with an online hotel reservation
| company who was telling me how great A-B testing had been for
| them. I asked him how they chose stopping point/sample size. He
| told me experiments continued until they observed a statistically
| significant difference between the two conditions.
|
| The arithmetic is simple and cheap. Understanding basic intro
| stats principles, priceless.
| regularfry wrote:
| And yet this is the default. As commonly implemented, a/b
| testing is an excellent way to look busy, and people will
| _actively resist_ changing processes to make them more
| reliable.
|
| I think this is not unrelated to the fact that if you wait long
| enough you can get a positive signal from a neutral
| intervention, so you can _literally_ shuffle chairs on the
| Titanic and claim success. The incentives are _against_
| accuracy because nobody wants to be told that the feature they
| 've just had the team building for 3 months had no effect
| whatsoever.
| axegon_ wrote:
| Many years ago I was working for a large gaming company and I
| was the one who developed a very optimal and cheap way to split
| any cluster of users into A/B groups. The company was extremely
| happy with how well that worked. However I did some
| investigation on my own a year later to see how the business
| development people were using it and... Yeah, pretty much what
| you said. They were literally brute forcing different
| configuration until they(more or less) got the desired results.
| kwillets wrote:
| Microsoft has a seed finder specifically aimed at avoiding a
| priori bias in experiment groups, but IMO the main effect is
| pushing whales (which are possibly bots) into different
| groups until the bias evens out.
|
| I find it hard to imagine obtaining much bias from a random
| hash seed in a large group of small-scale users, but I
| haven't looked at the problem closely.
| ec109685 wrote:
| We definitely saw bias, and it made experiments hard to
| launch until the system started pre-identifying unbiased
| population samples ahead of time, so the experiment could
| just pull pre-vetted users.
| glutamate wrote:
| Sounds like you already know this, but that's not great and
| will give a lot of false positives. In science this is called
| p-level hacking. The rigorous way to use hypothesis to testing
| is to calculate the sample size for the expected effect size
| and only one test when this sample size is achieved. But this
| requires knowing the effect size.
|
| If you are doing a lot of significance tests you need to adjust
| the p-level to divide by the number of implicit comparisons, so
| e.g. only accept p<0.001 if running ine test per day.
|
| Alternatively just do thompson sampling until one variant
| dominates.
| paulddraper wrote:
| To expand, p value tells you significance (more precisely the
| likelihood of the effect if there were no underlying
| difference). But if you observe it over and over again and
| pay attention to one value, you've subverted the measure.
|
| Thompson/multi-armed bandit optimizes for outcome over the
| duration of the test, by progressively altering the treatment
| %. The test runs longer, but yields better outcomes while
| doing it.
|
| It's objectively a better way to optimize, unless there is
| time-based overhead to the existence of the A/B test itself.
| (E.g. maintaining two code paths.)
| youainti wrote:
| I just wanted to affirm what you are doing here.
|
| A key point here is that P-Values optimize for detection of
| effects if you do everything right, which is not common as
| you point out.
|
| > Thompson/multi-armed bandit optimizes for outcome over
| the duration of the test.
|
| Exactly.
| kqr wrote:
| The p value is the risk of getting an effect _specifically
| due to sampling error_ , under the assumption of perfectly
| random sampling with no real effect. It says very little.
|
| In particular, if you aren't doing perfectly random
| sampling it is meaningless. If you are concerned about
| other types of error than sampling error it is meaningless.
|
| A significant p-value is nowhere near proof of effect. All
| it does is suggestively wiggle its eyebrows in the
| direction of further research.
| paulddraper wrote:
| > likelihood of the effect if there were no underlying
| difference
|
| By "effect" I mean "observed effect"; i.e. how likely are
| those results, assuming the null hypothesis.
| esafak wrote:
| Perhaps he was using a sequential test.
| Someone wrote:
| > He told me experiments continued until they observed a
| statistically significant difference between the two
| conditions.
|
| Apparently, if you do the observing the right way, that is a
| sound way to do that. https://en.wikipedia.org/wiki/E-values:
|
| _"We say that testing based on e-values remains safe (Type-I
| valid) under optional continuation."_
| gatopingado wrote:
| This is correct. There's been a lot of interest in e-values
| and non-parametric confidence sequences in recent literature.
| It's usually refered to as anytime-valid inference [1]. Evan
| Miller explored a similar idea in [2]. For some practical
| examples, see my Python library [3] implementing multinomial
| and time inhomogeneous Bernoulli / Poisson process tests
| based in [4]. See [5] for linear models / t-tests.
|
| [1] https://arxiv.org/abs/2210.0194
|
| [2] https://www.evanmiller.org/sequential-ab-testing.html
|
| [3] https://github.com/assuncaolfi/savvi/
|
| [4] https://openreview.net/forum?id=a4zg0jiuVi
|
| [5] https://arxiv.org/abs/2210.08589
| ryan-duve wrote:
| Did you link the thing that you intended to for [1]? I
| can't find anything about "anytime-valid inference" there.
| gatopingado wrote:
| Thanks for noting! This is the right link for [1]:
| https://arxiv.org/abs/2210.01948
| abhgh wrote:
| This is form of "interim analysis" [1].
|
| [1] https://en.wikipedia.org/wiki/Interim_analysis
| IshKebab wrote:
| This is surely more optimal if you do the statistics right? I
| mean I'm sure they didn't but the intuition that you can stop
| once there's sufficient evidence is correct.
| scott_w wrote:
| Bear in mind many people aren't doing the statistics right.
|
| I'm not an expert but my understanding is that it's doable if
| you're calculating the correct MDE based on the observed
| sample size, though not ideal (because sometimes the observed
| sample is too small and there's no way round that).
|
| I suspect the problem comes when people don't adjust the MDE
| properly for the smaller sample. Tools help but you've gotta
| know about them and use them ;)
|
| Personally I'd prefer to avoid this and be a bit more strict
| due to something a PM once said: "If you torture the data
| long enough, it'll show you what you want to see."
| sethd1211 wrote:
| Which company was this? was it by chance SnapTravel?
| simulo wrote:
| For learning about the basics of statistics, my go-to resource is
| "Discovering Statistics using [R/SPSS]" (Andy Field). "Improving
| Your Statistical Inferences" (Daniel Lakens) needs some basics,
| but covers a lot of intesting topics, including sequencial
| testing and equivalence tests (sometimes you want to know if a
| new thing is equivalent to the old)
| daxaxelrod wrote:
| Growthbook wrote a short paper on how they evaluate test results
| continuously.
|
| https://docs.growthbook.io/GrowthBookStatsEngine.pdf
| phyalow wrote:
| Experimentation for Engineers: From A/B Testing to Bayesian
| Optimization. by David Sweet
|
| This book is really great, and I highly recommend it, it goes
| broader than A/B, but covers everything quite well from a first
| principles perspective.
|
| https://www.manning.com/books/experimentation-for-engineers
| youainti wrote:
| Just as some basic context, there are two related approaches to
| A/B testing. The first comes from statistics, and is going to
| look like standard hypothesis testing of differences of means or
| medians. The second is from Machine Learning and is going to
| discuss multi-armed bandit problems. They are both good and have
| different tradeoffs. I just wanted you to know that there are two
| different approaches that are both valid.
| rancar2 wrote:
| I once wanted a structured approach before I had access to large
| amounts of traffic. Once I had traffic available, the learning
| naturally happened (background in engineering with advanced
| math). If you are lucky enough to start learning through hands on
| experience, I'd check out: https://goodui.org/
|
| I was lucky to get trained well by 100m+ users over the years. If
| you have a problem you are trying to solve, I'm happy to go over
| my approach to designing optimization winners repeatedly.
|
| Alex, I will shoot you an email shortly. Also, sebg's comment is
| good if you are looking for of the more academic route to
| learning.
| gjstein wrote:
| I'd also like to mention the classic book "Reinforcement
| Learning" by Sutton & Barto, which goes into some relevant
| mathematical aspects for choosing the "best" among a set of
| options. They have a full link of the PDF for free on their
| website [1]. Chapter 2 on "Multi-Armed Bandits" is where to
| start.
|
| [1] http://incompleteideas.net/book/the-book-2nd.html
| austin-cheney wrote:
| When I used to do A/B testing all results per traffic funnel
| averaged over time into cumulative results. The tests would run
| as long as they needed to attain statistical confidence between
| the funnels where confidence was the ratio of differentiation
| between results over time after discounting for noise and
| variance.
|
| Only at test completion were financial projections attributed to
| test results. Don't sugar coat it. Let people know up front just
| how damaging their wonderful business ideas are.
|
| The biggest learning from this is that the financial projections
| from the tests were always far too optimistic compared to future
| development in production. The tests were always correct. The
| cause for the discrepancies were shitty development. If a new
| initiative to production is defective or slow it will not perform
| as well as the tests projected. Web development is full of shitty
| developers who cannot program for the web, and our tests were
| generally ideal in their execution.
| AlexeyMK wrote:
| If you'd rather go through some of this live, we have a section
| on Stats for Growth Engineers in the Growth Engineering Course on
| Reforge (course.alexeymk.com). We talk through stat sig, power
| analysis, common experimentation footguns and alternate
| methodologies such as Bayesian, Sequential, and Bandits (which
| are typically Bayesian). Running next in October.
|
| Other than that, Evan's stuff is great, and the Ron Kohavi book
| gets a +1, though it is definitely dense.
| aidos wrote:
| Do you have another link -that one's not working for me
| nivertech wrote:
| A/B Testing
|
| An interactive look at Thompson sampling
|
| https://everyday-data-science.tigyog.app/a-b-testing
| vishnuvram wrote:
| I really liked Richard McElreath's Statistical rethinking
| https://youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XHt7uU...
| Maro wrote:
| My blog has tons of articles about A/B testing, with math and
| Python code to illustrate. Good starting point:
|
| https://bytepawn.com/five-ways-to-reduce-variance-in-ab-test...
| tgtweak wrote:
| No specific literature to recommend but understanding sample size
| and margin of error/confidence interval calculations will really
| help you understand a/b testing. Beyond a/b, this will help with
| multivariate testing as well which has mostly replaced a/b in
| orgs that are serious about testing.
| kqr wrote:
| I don't think the mathematics is what gets most people into
| trouble. You can get by with relatively primitive maths, and the
| advanced stuff is really just a small-order-of-magnitude cost
| optimisation.
|
| What gets people are incorrect procedures. To get a sense of all
| the ways in which an experiment can go wrong, I'd recommend
| reading more traditional texts on experimental design, survey
| research, etc.
|
| - Donald Wheeler's _Understanding Variation_ should be mandatory
| reading for almost everyone working professionally.
|
| - Deming's _Some Theory of Sampling_ is really good and covers
| more ground than the title lets on.
|
| - Deming's _Sample Design in Business Research_ I remember being
| formative for me also, although it was a while since I read it.
|
| - Efron and Tibshirani's _Introduction to the Bootstrap_ gives an
| intuitive sense of some experimental errors from a different
| perspective.
|
| I know there's one book covering survey design I really liked but
| I forget which one it was. Sorry!
| psawaya wrote:
| I'm also looking for a good resource on survey design. If you
| remember the book, please let us know! :)
| kqr wrote:
| I know the Deming books are written to a large extent from
| the perspective of surveys, but they are mainly technical.
|
| I have also read Robinson's _Designing Quality Survey
| Questions_ which I remember as good, but perhaps not as deep
| as I had hoped. I don 't think that's the one I'm thinking
| of, unfortunately.
|
| It's highly possible I'm confabulating a book from a variety
| of sources also...
| RyJones wrote:
| I worked for Ron Kohavi - he has a couple books. "Practical Guide
| to Controlled Experiments on the Web: Listen to Your Customers
| not to the HiPPO", and "Trustworthy Online Controlled
| Experiments: A Practical Guide to A/B Testing". I haven't read
| the second, but the first is easy to find and peruse.
| crdrost wrote:
| So the thing I always ctrl-F for, to see if a paper or course
| really knows what it's talking about, is called the "multi-armed
| bandit" problem. Just ctrl-F bandit, if an A/B tutorial is long
| enough it will usually mention them.
|
| This is not a foolproof method, I'd call it only +-5 dB of
| evidence, so it would shift a 50% likely that they know what
| they're talking about to like 75% if present or 25% if absent,
| but obviously look at the rest of it and see if that's borne out.
| And to be clear: Even mentioning it if it's just to dismiss it,
| counts!
|
| So e.g. I remember reading a whitepaper about "A/B Tests are
| Leading You Astray" and thinking "hey that's a fun idea, yeah,
| effect size is too often accidentally conditioned on whether the
| result was judged significantly significant which would be a
| source of bias" ...and sure enough a sentence came up, just
| innocently, like, "you might even have a bandit algorithm! But
| you had to use your judgment to discern that that was appropriate
| in context." And it's like "OK, you know about bandits but you
| are explicitly interested in human discernment and human decision
| making, great." So, +5 dB to you.
|
| And on the flip-side if it makes reference to A/B testing but
| it's decently long and never mentions bandits then there's only
| maybe a 25% chance they know what they are talking about. It can
| still happen, you might see e.g. kh2 instead of the t-test
| [because usually you don't have just "converted" vs "did not
| convert"... can your analytics grab "thought about it for more
| than 10s but did not convert" etc.?] or something else that
| piques interest. Or it's a very short article where it just
| didn't come up, but that's fine because we are, when reading,
| performing a secret cost-benefit analysis and short articles have
| very low cost.
|
| For a non-technical thing you can give to your coworkers,
| consider https://medium.com/jonathans-musings/ab-
| testing-101-5576de64...
|
| Researching this comment led to this video which looks
| interesting and I'll need to watch later about how you have to
| pin down the time needed to properly make the choices in A/B
| testing: https://youtu.be/Fs8mTrkNpfM?si=ghsOgDEpp43yRmd8
|
| Some more academic looking discussions of bandit algorithms that
| I can't vouch for personally, but would be my first stops:
|
| - https://courses.cs.washington.edu/courses/cse599i/21wi/resou...
| - https://tor-lattimore.com/downloads/book/book.pdf -
| http://proceedings.mlr.press/v35/kaufmann14.pdf
| graycat wrote:
| Bradley Efron, {\it The Jackknife, the Bootstrap, and Other
| Resampling Plans,\/} ISBN 0-89871-179-7, SIAM, Philadelphia,
| 1982.\ \
| SkyPuncher wrote:
| In my experience, there's just not much depth to the math behind
| A/B testing. It all comes down to does A or B affect X parameter
| without negatively affecting Y. This is all basic analysis stuff.
|
| The harsh reality is A/B testing is only an optimization
| technique. It's not going to fix fundamental problems with your
| product or app. In nearly everything I've done, it's been a far
| better investment to focus on delivering more features and more
| value. It's much easier to build a new feature that moves the
| needle by 1% than it is to polish a turd for 0.5% improvement.
|
| That being said, there are massive exceptions to this. When
| you're at scale, fractions of percents can mean multiple millions
| of dollars of improvements.
| rgbrgb wrote:
| One of my fav resources for binomial experiment evaluation + a
| lot of explanation:
| https://thumbtack.github.io/abba/demo/abba.html
| benreesman wrote:
| When seeking to both explore better treatments and also exploit
| good ones the mathematical formalism often used is a "bandit".
|
| https://en.m.wikipedia.org/wiki/Multi-armed_bandit
| epgui wrote:
| In my experience the most helpful and generalizable resources
| have been resources on "experimental design" in biology, and
| textbooks on linear regression in the social sciences. (Why these
| fields is actually an interesting question but I don't feel like
| getting into it.)
|
| A/B tests are just a narrow special case of these.
| cpeterso wrote:
| Anyone have fun examples of A/B tests you've run where the
| results were surprising or hugely lopsided?
___________________________________________________________________
(page generated 2024-08-10 23:00 UTC)