hngopher.com

       [HN Gopher] Ask HN: Resources about math behind A/B testing
       ___________________________________________________________________
        
       Ask HN: Resources about math behind A/B testing
        
       I've been learning more about AB testing during the last months.
       I've read almost all the work by Evan Miller, and I've enjoyed it a
       lot. However, I'd like a more structured approach to the topic
       since sometimes I feel I'm missing some basics. I've good math
       knowledge and pretty decent stats foundations. Which are your
       favourite books/papers in this topic?
        
       Author : alexmolas
       Score  : 172 points
       Date   : 2024-08-08 21:03 UTC (2 days ago)
        
       | sebg wrote:
       | Hi
       | 
       | Have you looked into these two?
       | 
       | - Trustworthy Online Controlled Experiments by Kohavi, Tang, and
       | Xu
       | 
       | - Statistical Methods in Online A/B Testing by Georgi Georgiev
       | 
       | Recommended by stats stackexchange
       | (https://stats.stackexchange.com/questions/546617/how-can-i-l...)
       | 
       | There's a bunch of other books/courses/videos on o'reilly.
       | 
       | Another potential way to approach this learning goal is to look
       | at Evan's tools (https://www.evanmiller.org/ab-testing/) and go
       | into each one and then look at the JS code for running the tools
       | online.
       | 
       | See if you can go through and comment/write out your thoughts on
       | why it's written that way. of course, you'll have to know some JS
       | for that, but it might be helpful to go through a file like
       | (https://www.evanmiller.org/ab-testing/sample-size.js) and figure
       | out what math is being done.
        
         | sebg wrote:
         | PS - if you are looking for more of the academic side (cutting
         | edge, much harder statistics), you can start to look at recent
         | work people are doing with A/B tests like this paper ->
         | https://arxiv.org/abs/2002.05670
        
           | sebg wrote:
           | Even more!
           | 
           | Have you seen this video -
           | https://www.nber.org/lecture/2024-methods-lecture-susan-
           | athe...
           | 
           | Might be interesting to you.
        
         | iamacyborg wrote:
         | I'll second Trustworthy Online Controlled Experiments.
         | Fantastic read and Ron Kohavi is worth a follow on LinkedIn as
         | he's quite active there and usually sharing some interesting
         | insights (or politely pointing out poor practices).
        
       | vismit2000 wrote:
       | https://everyday-data-science.tigyog.app/a-b-testing
        
       | nanis wrote:
       | Early in the A-B craze (optimal shade of blue nonsense), I was
       | talking to someone high up with an online hotel reservation
       | company who was telling me how great A-B testing had been for
       | them. I asked him how they chose stopping point/sample size. He
       | told me experiments continued until they observed a statistically
       | significant difference between the two conditions.
       | 
       | The arithmetic is simple and cheap. Understanding basic intro
       | stats principles, priceless.
        
         | regularfry wrote:
         | And yet this is the default. As commonly implemented, a/b
         | testing is an excellent way to look busy, and people will
         | _actively resist_ changing processes to make them more
         | reliable.
         | 
         | I think this is not unrelated to the fact that if you wait long
         | enough you can get a positive signal from a neutral
         | intervention, so you can _literally_ shuffle chairs on the
         | Titanic and claim success. The incentives are _against_
         | accuracy because nobody wants to be told that the feature they
         | 've just had the team building for 3 months had no effect
         | whatsoever.
        
         | axegon_ wrote:
         | Many years ago I was working for a large gaming company and I
         | was the one who developed a very optimal and cheap way to split
         | any cluster of users into A/B groups. The company was extremely
         | happy with how well that worked. However I did some
         | investigation on my own a year later to see how the business
         | development people were using it and... Yeah, pretty much what
         | you said. They were literally brute forcing different
         | configuration until they(more or less) got the desired results.
        
           | kwillets wrote:
           | Microsoft has a seed finder specifically aimed at avoiding a
           | priori bias in experiment groups, but IMO the main effect is
           | pushing whales (which are possibly bots) into different
           | groups until the bias evens out.
           | 
           | I find it hard to imagine obtaining much bias from a random
           | hash seed in a large group of small-scale users, but I
           | haven't looked at the problem closely.
        
             | ec109685 wrote:
             | We definitely saw bias, and it made experiments hard to
             | launch until the system started pre-identifying unbiased
             | population samples ahead of time, so the experiment could
             | just pull pre-vetted users.
        
         | glutamate wrote:
         | Sounds like you already know this, but that's not great and
         | will give a lot of false positives. In science this is called
         | p-level hacking. The rigorous way to use hypothesis to testing
         | is to calculate the sample size for the expected effect size
         | and only one test when this sample size is achieved. But this
         | requires knowing the effect size.
         | 
         | If you are doing a lot of significance tests you need to adjust
         | the p-level to divide by the number of implicit comparisons, so
         | e.g. only accept p<0.001 if running ine test per day.
         | 
         | Alternatively just do thompson sampling until one variant
         | dominates.
        
           | paulddraper wrote:
           | To expand, p value tells you significance (more precisely the
           | likelihood of the effect if there were no underlying
           | difference). But if you observe it over and over again and
           | pay attention to one value, you've subverted the measure.
           | 
           | Thompson/multi-armed bandit optimizes for outcome over the
           | duration of the test, by progressively altering the treatment
           | %. The test runs longer, but yields better outcomes while
           | doing it.
           | 
           | It's objectively a better way to optimize, unless there is
           | time-based overhead to the existence of the A/B test itself.
           | (E.g. maintaining two code paths.)
        
             | youainti wrote:
             | I just wanted to affirm what you are doing here.
             | 
             | A key point here is that P-Values optimize for detection of
             | effects if you do everything right, which is not common as
             | you point out.
             | 
             | > Thompson/multi-armed bandit optimizes for outcome over
             | the duration of the test.
             | 
             | Exactly.
        
             | kqr wrote:
             | The p value is the risk of getting an effect _specifically
             | due to sampling error_ , under the assumption of perfectly
             | random sampling with no real effect. It says very little.
             | 
             | In particular, if you aren't doing perfectly random
             | sampling it is meaningless. If you are concerned about
             | other types of error than sampling error it is meaningless.
             | 
             | A significant p-value is nowhere near proof of effect. All
             | it does is suggestively wiggle its eyebrows in the
             | direction of further research.
        
               | paulddraper wrote:
               | > likelihood of the effect if there were no underlying
               | difference
               | 
               | By "effect" I mean "observed effect"; i.e. how likely are
               | those results, assuming the null hypothesis.
        
         | esafak wrote:
         | Perhaps he was using a sequential test.
        
         | Someone wrote:
         | > He told me experiments continued until they observed a
         | statistically significant difference between the two
         | conditions.
         | 
         | Apparently, if you do the observing the right way, that is a
         | sound way to do that. https://en.wikipedia.org/wiki/E-values:
         | 
         |  _"We say that testing based on e-values remains safe (Type-I
         | valid) under optional continuation."_
        
           | gatopingado wrote:
           | This is correct. There's been a lot of interest in e-values
           | and non-parametric confidence sequences in recent literature.
           | It's usually refered to as anytime-valid inference [1]. Evan
           | Miller explored a similar idea in [2]. For some practical
           | examples, see my Python library [3] implementing multinomial
           | and time inhomogeneous Bernoulli / Poisson process tests
           | based in [4]. See [5] for linear models / t-tests.
           | 
           | [1] https://arxiv.org/abs/2210.0194
           | 
           | [2] https://www.evanmiller.org/sequential-ab-testing.html
           | 
           | [3] https://github.com/assuncaolfi/savvi/
           | 
           | [4] https://openreview.net/forum?id=a4zg0jiuVi
           | 
           | [5] https://arxiv.org/abs/2210.08589
        
             | ryan-duve wrote:
             | Did you link the thing that you intended to for [1]? I
             | can't find anything about "anytime-valid inference" there.
        
               | gatopingado wrote:
               | Thanks for noting! This is the right link for [1]:
               | https://arxiv.org/abs/2210.01948
        
         | abhgh wrote:
         | This is form of "interim analysis" [1].
         | 
         | [1] https://en.wikipedia.org/wiki/Interim_analysis
        
         | IshKebab wrote:
         | This is surely more optimal if you do the statistics right? I
         | mean I'm sure they didn't but the intuition that you can stop
         | once there's sufficient evidence is correct.
        
           | scott_w wrote:
           | Bear in mind many people aren't doing the statistics right.
           | 
           | I'm not an expert but my understanding is that it's doable if
           | you're calculating the correct MDE based on the observed
           | sample size, though not ideal (because sometimes the observed
           | sample is too small and there's no way round that).
           | 
           | I suspect the problem comes when people don't adjust the MDE
           | properly for the smaller sample. Tools help but you've gotta
           | know about them and use them ;)
           | 
           | Personally I'd prefer to avoid this and be a bit more strict
           | due to something a PM once said: "If you torture the data
           | long enough, it'll show you what you want to see."
        
         | sethd1211 wrote:
         | Which company was this? was it by chance SnapTravel?
        
       | simulo wrote:
       | For learning about the basics of statistics, my go-to resource is
       | "Discovering Statistics using [R/SPSS]" (Andy Field). "Improving
       | Your Statistical Inferences" (Daniel Lakens) needs some basics,
       | but covers a lot of intesting topics, including sequencial
       | testing and equivalence tests (sometimes you want to know if a
       | new thing is equivalent to the old)
        
       | daxaxelrod wrote:
       | Growthbook wrote a short paper on how they evaluate test results
       | continuously.
       | 
       | https://docs.growthbook.io/GrowthBookStatsEngine.pdf
        
       | phyalow wrote:
       | Experimentation for Engineers: From A/B Testing to Bayesian
       | Optimization. by David Sweet
       | 
       | This book is really great, and I highly recommend it, it goes
       | broader than A/B, but covers everything quite well from a first
       | principles perspective.
       | 
       | https://www.manning.com/books/experimentation-for-engineers
        
       | youainti wrote:
       | Just as some basic context, there are two related approaches to
       | A/B testing. The first comes from statistics, and is going to
       | look like standard hypothesis testing of differences of means or
       | medians. The second is from Machine Learning and is going to
       | discuss multi-armed bandit problems. They are both good and have
       | different tradeoffs. I just wanted you to know that there are two
       | different approaches that are both valid.
        
       | rancar2 wrote:
       | I once wanted a structured approach before I had access to large
       | amounts of traffic. Once I had traffic available, the learning
       | naturally happened (background in engineering with advanced
       | math). If you are lucky enough to start learning through hands on
       | experience, I'd check out: https://goodui.org/
       | 
       | I was lucky to get trained well by 100m+ users over the years. If
       | you have a problem you are trying to solve, I'm happy to go over
       | my approach to designing optimization winners repeatedly.
       | 
       | Alex, I will shoot you an email shortly. Also, sebg's comment is
       | good if you are looking for of the more academic route to
       | learning.
        
       | gjstein wrote:
       | I'd also like to mention the classic book "Reinforcement
       | Learning" by Sutton & Barto, which goes into some relevant
       | mathematical aspects for choosing the "best" among a set of
       | options. They have a full link of the PDF for free on their
       | website [1]. Chapter 2 on "Multi-Armed Bandits" is where to
       | start.
       | 
       | [1] http://incompleteideas.net/book/the-book-2nd.html
        
       | austin-cheney wrote:
       | When I used to do A/B testing all results per traffic funnel
       | averaged over time into cumulative results. The tests would run
       | as long as they needed to attain statistical confidence between
       | the funnels where confidence was the ratio of differentiation
       | between results over time after discounting for noise and
       | variance.
       | 
       | Only at test completion were financial projections attributed to
       | test results. Don't sugar coat it. Let people know up front just
       | how damaging their wonderful business ideas are.
       | 
       | The biggest learning from this is that the financial projections
       | from the tests were always far too optimistic compared to future
       | development in production. The tests were always correct. The
       | cause for the discrepancies were shitty development. If a new
       | initiative to production is defective or slow it will not perform
       | as well as the tests projected. Web development is full of shitty
       | developers who cannot program for the web, and our tests were
       | generally ideal in their execution.
        
       | AlexeyMK wrote:
       | If you'd rather go through some of this live, we have a section
       | on Stats for Growth Engineers in the Growth Engineering Course on
       | Reforge (course.alexeymk.com). We talk through stat sig, power
       | analysis, common experimentation footguns and alternate
       | methodologies such as Bayesian, Sequential, and Bandits (which
       | are typically Bayesian). Running next in October.
       | 
       | Other than that, Evan's stuff is great, and the Ron Kohavi book
       | gets a +1, though it is definitely dense.
        
         | aidos wrote:
         | Do you have another link -that one's not working for me
        
       | nivertech wrote:
       | A/B Testing
       | 
       | An interactive look at Thompson sampling
       | 
       | https://everyday-data-science.tigyog.app/a-b-testing
        
       | vishnuvram wrote:
       | I really liked Richard McElreath's Statistical rethinking
       | https://youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XHt7uU...
        
       | Maro wrote:
       | My blog has tons of articles about A/B testing, with math and
       | Python code to illustrate. Good starting point:
       | 
       | https://bytepawn.com/five-ways-to-reduce-variance-in-ab-test...
        
       | tgtweak wrote:
       | No specific literature to recommend but understanding sample size
       | and margin of error/confidence interval calculations will really
       | help you understand a/b testing. Beyond a/b, this will help with
       | multivariate testing as well which has mostly replaced a/b in
       | orgs that are serious about testing.
        
       | kqr wrote:
       | I don't think the mathematics is what gets most people into
       | trouble. You can get by with relatively primitive maths, and the
       | advanced stuff is really just a small-order-of-magnitude cost
       | optimisation.
       | 
       | What gets people are incorrect procedures. To get a sense of all
       | the ways in which an experiment can go wrong, I'd recommend
       | reading more traditional texts on experimental design, survey
       | research, etc.
       | 
       | - Donald Wheeler's _Understanding Variation_ should be mandatory
       | reading for almost everyone working professionally.
       | 
       | - Deming's _Some Theory of Sampling_ is really good and covers
       | more ground than the title lets on.
       | 
       | - Deming's _Sample Design in Business Research_ I remember being
       | formative for me also, although it was a while since I read it.
       | 
       | - Efron and Tibshirani's _Introduction to the Bootstrap_ gives an
       | intuitive sense of some experimental errors from a different
       | perspective.
       | 
       | I know there's one book covering survey design I really liked but
       | I forget which one it was. Sorry!
        
         | psawaya wrote:
         | I'm also looking for a good resource on survey design. If you
         | remember the book, please let us know! :)
        
           | kqr wrote:
           | I know the Deming books are written to a large extent from
           | the perspective of surveys, but they are mainly technical.
           | 
           | I have also read Robinson's _Designing Quality Survey
           | Questions_ which I remember as good, but perhaps not as deep
           | as I had hoped. I don 't think that's the one I'm thinking
           | of, unfortunately.
           | 
           | It's highly possible I'm confabulating a book from a variety
           | of sources also...
        
       | RyJones wrote:
       | I worked for Ron Kohavi - he has a couple books. "Practical Guide
       | to Controlled Experiments on the Web: Listen to Your Customers
       | not to the HiPPO", and "Trustworthy Online Controlled
       | Experiments: A Practical Guide to A/B Testing". I haven't read
       | the second, but the first is easy to find and peruse.
        
       | crdrost wrote:
       | So the thing I always ctrl-F for, to see if a paper or course
       | really knows what it's talking about, is called the "multi-armed
       | bandit" problem. Just ctrl-F bandit, if an A/B tutorial is long
       | enough it will usually mention them.
       | 
       | This is not a foolproof method, I'd call it only +-5 dB of
       | evidence, so it would shift a 50% likely that they know what
       | they're talking about to like 75% if present or 25% if absent,
       | but obviously look at the rest of it and see if that's borne out.
       | And to be clear: Even mentioning it if it's just to dismiss it,
       | counts!
       | 
       | So e.g. I remember reading a whitepaper about "A/B Tests are
       | Leading You Astray" and thinking "hey that's a fun idea, yeah,
       | effect size is too often accidentally conditioned on whether the
       | result was judged significantly significant which would be a
       | source of bias" ...and sure enough a sentence came up, just
       | innocently, like, "you might even have a bandit algorithm! But
       | you had to use your judgment to discern that that was appropriate
       | in context." And it's like "OK, you know about bandits but you
       | are explicitly interested in human discernment and human decision
       | making, great." So, +5 dB to you.
       | 
       | And on the flip-side if it makes reference to A/B testing but
       | it's decently long and never mentions bandits then there's only
       | maybe a 25% chance they know what they are talking about. It can
       | still happen, you might see e.g. kh2 instead of the t-test
       | [because usually you don't have just "converted" vs "did not
       | convert"... can your analytics grab "thought about it for more
       | than 10s but did not convert" etc.?] or something else that
       | piques interest. Or it's a very short article where it just
       | didn't come up, but that's fine because we are, when reading,
       | performing a secret cost-benefit analysis and short articles have
       | very low cost.
       | 
       | For a non-technical thing you can give to your coworkers,
       | consider https://medium.com/jonathans-musings/ab-
       | testing-101-5576de64...
       | 
       | Researching this comment led to this video which looks
       | interesting and I'll need to watch later about how you have to
       | pin down the time needed to properly make the choices in A/B
       | testing: https://youtu.be/Fs8mTrkNpfM?si=ghsOgDEpp43yRmd8
       | 
       | Some more academic looking discussions of bandit algorithms that
       | I can't vouch for personally, but would be my first stops:
       | 
       | - https://courses.cs.washington.edu/courses/cse599i/21wi/resou...
       | - https://tor-lattimore.com/downloads/book/book.pdf -
       | http://proceedings.mlr.press/v35/kaufmann14.pdf
        
       | graycat wrote:
       | Bradley Efron, {\it The Jackknife, the Bootstrap, and Other
       | Resampling Plans,\/} ISBN 0-89871-179-7, SIAM, Philadelphia,
       | 1982.\ \
        
       | SkyPuncher wrote:
       | In my experience, there's just not much depth to the math behind
       | A/B testing. It all comes down to does A or B affect X parameter
       | without negatively affecting Y. This is all basic analysis stuff.
       | 
       | The harsh reality is A/B testing is only an optimization
       | technique. It's not going to fix fundamental problems with your
       | product or app. In nearly everything I've done, it's been a far
       | better investment to focus on delivering more features and more
       | value. It's much easier to build a new feature that moves the
       | needle by 1% than it is to polish a turd for 0.5% improvement.
       | 
       | That being said, there are massive exceptions to this. When
       | you're at scale, fractions of percents can mean multiple millions
       | of dollars of improvements.
        
       | rgbrgb wrote:
       | One of my fav resources for binomial experiment evaluation + a
       | lot of explanation:
       | https://thumbtack.github.io/abba/demo/abba.html
        
       | benreesman wrote:
       | When seeking to both explore better treatments and also exploit
       | good ones the mathematical formalism often used is a "bandit".
       | 
       | https://en.m.wikipedia.org/wiki/Multi-armed_bandit
        
       | epgui wrote:
       | In my experience the most helpful and generalizable resources
       | have been resources on "experimental design" in biology, and
       | textbooks on linear regression in the social sciences. (Why these
       | fields is actually an interesting question but I don't feel like
       | getting into it.)
       | 
       | A/B tests are just a narrow special case of these.
        
       | cpeterso wrote:
       | Anyone have fun examples of A/B tests you've run where the
       | results were surprising or hugely lopsided?
        
       ___________________________________________________________________
       (page generated 2024-08-10 23:00 UTC)