[HN Gopher] Lines of code that will beat A/B testing every time ...
       ___________________________________________________________________
        
       Lines of code that will beat A/B testing every time (2012)
        
       Author : Kerrick
       Score  : 140 points
       Date   : 2025-01-09 23:34 UTC (3 days ago)
        
 (HTM) web link (stevehanov.ca)
 (TXT) w3m dump (stevehanov.ca)
        
       | asdasdsddd wrote:
       | Multi arm bandits are fine but their limited to tests where its
       | ok to switch users between arms frequently and tests that have
       | more power
        
         | tantalor wrote:
         | > where its ok to switch users between arms frequently
         | 
         | It's not hard to keep track of which arm any given user was
         | exposed to in the first run, and then repeat it.
        
           | asdasdsddd wrote:
           | There are often product limitations
        
       | 85392_school wrote:
       | Previously discussed:
       | 
       | https://news.ycombinator.com/item?id=11437114
       | 
       | https://news.ycombinator.com/item?id=4040022
        
       | nottorp wrote:
       | "People distrust things that they do not understand, and they
       | especially distrust machine learning algorithms, even if they are
       | simple."
       | 
       | How times have changed :)
        
         | saintfire wrote:
         | Just had to anthropomorphize machine learning.
        
       | isoprophlex wrote:
       | As one of the comments below the article states, the
       | probabilistic alternative to epsilon-greedy is worth exploring ad
       | well. Take the "bayesian bandit", which is not much more complex
       | but a lot more powerful.
       | 
       | If you crave more bandits:
       | https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...
        
         | hruk wrote:
         | We've been happy using Thompson sampling in production with
         | this library https://github.com/bayesianbandits/bayesianbandits
        
         | timr wrote:
         | Just a warning to those people who are potentially implementing
         | it: it doesn't really matter. The blog author addresses this,
         | obliquely (says that the simplest thing is best most of the
         | time), but doesn't make it explicit.
         | 
         | In my experience, obsessing on the best decision strategy is
         | the biggest honeypot for engineers implementing MAB. Epsilon-
         | greedy is _very easy to implement_ and you probably don 't need
         | anything more. Thompson sampling is a pain in the butt, for not
         | much gain.
        
           | blagie wrote:
           | "Easy to implement" is a good reason to use bubble sort too.
           | 
           | In a normal universe, you just import a different library, so
           | both are the same amount of work to implement.
           | 
           | Multiarmed bandit seems theoretically pretty, but it's rarely
           | worth it. The complexity isn't the numerical algorithm but
           | state management.
           | 
           | * Most AB tests can be as simple as a client-side random()
           | and a log file.
           | 
           | * Multiarmed bandit means you need an immediate feedback
           | loop, which involves things like adding database columns,
           | worrying about performance (since each render requires
           | another database read), etc. Keep in mind the database needs
           | to now store AB test outcomes and use those for decision-
           | making, and computing those is sometimes nontrivial (if it's
           | anything beyond a click-through).
           | 
           | * Long-term outcomes matter more than short-term. "Did we
           | retain a customer" is more important than "did we close one
           | sale."
           | 
           | In most systems, the benefits aren't worth the complexity.
           | Multiple AB tests also add testing complexity. You want to
           | test three layouts? And three user flows? Now, you have nine
           | cases which need to be tested. Add two color schemes? 18
           | cases. Add 3 font options? 54 cases. The exponential growth
           | in testing is not fun. Fire-and-forget seems great, but in
           | practice, it's fire-and-maintain-exponential complexity.
           | 
           | And those conversion differences are usually small enough
           | that being on the wrong side of a single AB test isn't
           | expensive.
           | 
           | Run the test. Analyze the data. Pick the outcome. Kill the
           | other code path. Perhaps re-analyze the data a year later
           | with different, longer-term metrics. Repeat. That's the right
           | level of complexity most of the time.
           | 
           | If you step up to multiarm, importing a different library
           | ain't bad.
        
         | krackers wrote:
         | There's a good derivation of EXP3 algorithm from standard
         | multiplicative weights which is fairly intuitive. The
         | transformation between the two is explained a bit in
         | https://nerva.cs.uni-
         | bonn.de/lib/exe/fetch.php/teaching/ws18.... Once you have the
         | intuition, then the actual choice of parameters is just
         | cranking out the math
        
       | zahlman wrote:
       | Nothing shows on this page without JavaScript except for the
       | header and a grey background. A bit strange for a blog.
        
         | forgetfreeman wrote:
         | Utterly unremarkable for 2012 though.
        
         | lelandfe wrote:
         | Pre-CSS grid masonry layout. Author hides the content with CSS,
         | and JS reveals it, to avoid a flash.
         | 
         | CSS to make it noscript friendly: `.main { visibility: visible
         | !important; max-width: 710px; }`
        
       | tracerbulletx wrote:
       | A lot of sites don't have enough traffic to get statistical
       | significance with this in a reasonable amount of time and it's
       | almost always testing a feature more complicated than button
       | color where you aren't going to have more than the control and
       | variant.
        
         | douglee650 wrote:
         | Yes wondering what the confidence intervals are.
        
         | wiml wrote:
         | If the effect size x site traffic is so small it's
         | statistically insignificant, why are you doing all this work in
         | the first place? Just choose the option that makes the PHB
         | happy and move on.
         | 
         | (But, it's more likely that you _don 't know_ if there's a
         | significant effect size)
        
           | koliber wrote:
           | The PHB wanted A/B testing! True story. I've spent two months
           | convincing them that it made no sense with the volume of
           | conversion events we had.
        
         | kridsdale1 wrote:
         | I've only implemented A/B/C tests at Facebook and Google, with
         | hundreds of millions of DAU on the surfaces in question, and
         | three groups is still often enough to dilute the measurement in
         | question below stat-sig.
        
       | usgroup wrote:
       | If your aim is to evaluate an effect size of your treatment
       | because you want to know whether it's significant, you can't do
       | what the article advises.
        
       | HeliumHydride wrote:
       | The "20" is missing from the title.
        
         | jerrygenser wrote:
         | I believe hacker news automatically truncates the number at the
         | beginning of titles
        
       | crazygringo wrote:
       | No, multi-armed bandit doesn't "beat" A/B testing, nor does it
       | beat it "every time".
       | 
       | Statistical significance is statistical significance, end of
       | story. If you want to show that option B is better than A, then
       | you need to test B enough times.
       | 
       | It doesn't matter if you test it half the time (in the simplest
       | A/B) or 10% of the time (as suggested in the article). If you do
       | it 10% of the time, it's just going to take you five times
       | longer.
       | 
       | And A/B testing can handle multiple options just fine, contrary
       | to the post. The name "A/B" suggests two, but you're free to use
       | more, and this is extremely common. It's still called "A/B
       | testing".
       | 
       | Generally speaking, you want to find the best option and then
       | _remove the other ones_ because they 're suboptimal and code
       | cruft. The author suggests _always_ keeping 10% exploring other
       | options. But if you already know they 're worse, that's just
       | making your product worse for those 10% of users.
        
         | LPisGood wrote:
         | Multi-arm bandit does beat A/B testing in the sense that
         | standard A/B testing does not seek to maximize reward during
         | the testing period, MAB does. MAB also generalizes better to
         | testing many things than A/B testing.
        
           | crazygringo wrote:
           | No -- you can't have your cake and eat it too.
           | 
           | You get _zero_ benefits from MAB over A /B if you simply end
           | your A/B test once you've achieved statistical significance
           | and pick the best option. Which is what any efficient A/B
           | test does -- there no reason to have any fixed "testing
           | period" beyond what is needed to achieve statistical
           | significance.
           | 
           | While, to the contrary, the MAB described in the article does
           | _not_ maximize reward -- as I explained in my previous
           | comment. Because the post 's version runs indefinitely, it
           | has _worse_ long-term reward because it continues to test
           | inferior options long after they 've been proven worse. If
           | you leave it running, you're harming yourself.
           | 
           | And I have no idea what you mean by MAB "generalizing" more.
           | But it doesn't matter if it's worse to begin with.
           | 
           | (Also, it's a huge red flag that the post doesn't even
           | _mention_ statistical significance.)
        
             | LPisGood wrote:
             | > you can't have your cake and eat it too
             | 
             | I disagree. There is a vast array of literature on solving
             | the MAB problem that may as well be grouped into a bin
             | called "how to optimally strike a balance between having
             | one's cake and eating it too."
             | 
             | The optimization techniques to solve MAB problem seek to
             | optimize reward by giving the right balance of exploration
             | and exploitation. In other words, these techniques attempt
             | to determine the optimal way to strike a balance between
             | exploring if another option is better and exploiting the
             | option currently predicted to be best.
             | 
             | There is a strong reason this literature doesn't start and
             | end with: "just do A/B testing, there is no better
             | approach"
        
               | crazygringo wrote:
               | I'm not talking about the literature -- I'm talking about
               | the extremely simplistic and sub-optimal procedure
               | described in the post.
               | 
               | If you want to get sophisticated, MAB properly done is
               | essentially just A/B testing with optimal strategies for
               | deciding when to end individual A/B tests, or balancing
               | tests optimally for a limited number of trials. But
               | again, it doesn't "beat" A/B testing -- it _is_ A /B
               | testing in that sense.
               | 
               | And that's what I mean. You can't magically increase your
               | reward while simultaneously getting statistically
               | significant results. Either your results are significant
               | to a desired level or not, and there's no getting around
               | the number of samples you need to achieve that.
        
               | LPisGood wrote:
               | I am talking about the literature which solves MAB in a
               | variety of ways, including the one in the post.
               | 
               | > MAB properly done is essentially just A/B testing
               | 
               | Words are only useful insofar as their meanings invoke
               | ideas, and in my experience absolutely no one thinks of
               | other MAB strategies when someone talks about A/B
               | testing.
               | 
               | Sure, you can classify A/B testing as one extremely
               | suboptimal approach to solving MAB problem. This
               | classification doesn't help much though, because the
               | other MAB techniques do "magically increase the rewards"
               | compared this simple technique.
        
               | cauch wrote:
               | Another way of seeing the situation: let run your MAB
               | solution for a while. Orange has been tested 17 times and
               | blue has been tested 12 times. This is exactly equivalent
               | of doing a A/B testing where you display 1 time the
               | orange button to 17 persons and 1 time the blue button to
               | 12 persons.
               | 
               | The trick is to find the exact best number of test for
               | each color so that we have good statistical significance.
               | MAB does not do that well, as you cannot easily force
               | testing an option that was bad when this option did not
               | get enough trial to have a good statistical significance
               | (imagine you have 10 colors and the color orange first
               | score 0/1. It will take a very long while before this
               | color will be re-tested quite significantly: you need to
               | first fall into the 10%, but then you still have ~10% to
               | randomly pick this color and not one of the other). With
               | A/B testing, you can do a power analysis before hand (or
               | whenever during) to know when to stop.
               | 
               | Literature does not start with "just do A/B testing"
               | because it is not the same problem. In MAB, your goal is
               | not to demonstrate that one is bad, it's to do your own
               | decision when faced with a fixed situation.
        
               | LPisGood wrote:
               | > The trick is to find the exact best number of test for
               | each color so that we have good statistical significance
               | 
               | Yes, A/B testing will force through enough trials to get
               | statistical significance(it is definitely a "exploration
               | first strategy), but in many cases, you care about
               | maximizing reward as well, in particular during testing.
               | A/B testing does very poorly at balancing exploitation
               | with exploitation in general.
               | 
               | This is especially true if the situation is dynamic. Will
               | you A/B test forever in case something has changed and
               | give up that long term loss in reward value?
        
               | cauch wrote:
               | But the proposed MAB system does not even propose a
               | method to know when this system needs to be stopped (and
               | remove all the choices except the best one).
               | 
               | With the A/B testing, you can do power analysis whenever
               | you want, including in the middle of the experiment. It
               | will just be an iterative adjustment that converges.
               | 
               | In fact, you can even run on all possibilities in advance
               | (if A get 1% and B get 1%, how many A and B do I need, if
               | A get 2% and B get 1%, if A get 3% and B get 1%, ...) and
               | it will give you the exact boundaries to stop for any
               | configurations before even running the experiment. You
               | will just have to stop trialing option A as soon as
               | option A crosses the already decided significance
               | threshold for A.
               | 
               | So, no, the A/B testing will never run forever. And A/B
               | testing will always be better than the MAB solution,
               | because you will have a better way to stop trying a bad
               | solution as soon as you have crossed the threshold you
               | decided is enough to consider it's a bad solution.
        
           | cle wrote:
           | This is a double-edged sword. There are often cases in real-
           | world systems where the "reward" the MAB maximizes is biased
           | by eligibility issues, system caching, bugs, etc. If this
           | happens, your MAB has the potential to converge on the worst
           | possible experience for your users, something a static
           | treatment allocation won't do.
        
             | LPisGood wrote:
             | I haven't seen these particular shortcomings before, but I
             | certainly agree that if your data is bad, this ML approach
             | will also be bad.
             | 
             | Can you share some more details about your experiences with
             | those particular types of failures?
        
               | cle wrote:
               | Sure! A really simple (and common) example would be a
               | setup w/ treatment A and treatment B, your code does "if
               | session_assignment == A .... else .... B" . In the else
               | branch you do something that for whatever reason causes
               | misbehavior (perhaps it sometimes crashes or throws an
               | exception or uses a buffer that drops records under high
               | load to protect availability). That's suprisingly common.
               | Or perhaps you were hashing on the wrong key to generate
               | session assignments--ex you accidentally used an ID that
               | expires after 24 hours of inactivity...now only highly
               | active people get correctly sampled.
               | 
               | Another common one I saw was due to different systems
               | handling different treatments, and there being caching
               | discrepancies between the two, like esp in a MAB where
               | allocations are constantly changing, if one system has a
               | much longer TTL than the other then you might see
               | allocation lags for one treatment and not the other,
               | biasing the data. Or perhaps one system deploys much more
               | frequently and the load balancer draining doesn't wait
               | for records to finish uploading before it kills the
               | process.
               | 
               | The most subtle ones were eligibility biases, where one
               | treatment might cause users to drop out of an experiment
               | entirely. Like if you have a signup form and you want to
               | measure long-term retention, and one treatment causes
               | some cohorts to not complete the signup entirely.
               | 
               | There are definitely mitigations for these issues, like
               | you can monitor the expected vs. actual allocations and
               | alert if they go out-of-whack. That has its own set of
               | problems and statistics though.
        
       | jbentley1 wrote:
       | Multi-armed bandits make a big assumption that effectiveness is
       | static over time. What can happen is that if they tip traffic
       | slightly towards option B at a time when effectiveness is higher
       | (maybe a sale just started) B will start to overwhelmingly look
       | like a winner and get locked in that state.
       | 
       | You can solve this with propensity scores, but it is more
       | complicated to implement and you need to log every interaction.
        
         | LPisGood wrote:
         | This objection is mentioned specifically in the post.
         | 
         | You can add a forgetting factor for older results.
        
           | randomcatuser wrote:
           | This seems like a fudge factor though. Some things are
           | changed bc you act on them! (e.g. recommendation systems that
           | are biased towards more popular content). So having dynamic
           | groups makes the data harder to analyze
        
             | LPisGood wrote:
             | A standard formulation of MAB problem assumes that acting
             | will impact the rewards, and this forgetting factor
             | approach is one which allows for that and still attempts to
             | find the currently most exploitable lever.
        
       | __MatrixMan__ wrote:
       | After this:
       | 
       | > hundreds of the brightest minds of modern civilization have
       | been hard at work not curing cancer. Instead, they have been
       | refining techniques for getting you and me to click on banner ads
       | 
       | I was really hoping this would slowly develop into a statistical
       | technique couched in terms of ad optimization but actually
       | settling in on something you might call ATCG testing (e.g. the
       | biostatistics methods that one would indeed use to cure cancer).
        
       | JensRantil wrote:
       | Yep. I implemented this as a Java library a while ago (and other
       | stuff): https://github.com/JensRantil/java-canary-tools
       | Basically, I think feature flags and A/B tests aren't always
       | needed to roll out experiments.
        
       | royal-fig wrote:
       | If multi-arm bandits has piqued your curiosity, we recently added
       | support for it to our feature flagging and experimentation
       | platform, GrowthBook.
       | 
       | We talk about it here: https://blog.growthbook.io/introducing-
       | multi-armed-bandits-i...
        
       | munro wrote:
       | Here's an interesting write up on various algorithms & different
       | epsilon greedy % values.
       | 
       | https://github.com/raffg/multi_armed_bandit
       | 
       | It shows 10% exploration performs the best, very great simple
       | algorithm.
       | 
       | Also it shows the Thompson Sampling algorithm converges a bit
       | faster-- the best arm chosen by sampling from the beta
       | distribution, and eliminates the explore phase. And you can use
       | the builtin random.betavariate !
       | 
       | https://github.com/raffg/multi_armed_bandit/blob/42b7377541c...
        
       | kazinator wrote:
       | > _Statistics are hard for most people to understand._
       | 
       | True, but that's exactly what statistics helps with, though also
       | hard to understand. :)
        
       | iforgot22 wrote:
       | I don't like how this dismisses the old approach as "statistics
       | are hard for most people to understand." This algo beats A/B
       | testing in terms of maximizing how many visitors get the best
       | feature. But is that really a big enough concern IRL that people
       | are interested in optimizing it every time? Every little dynamic
       | lever adds complexity to a system.
        
         | randomcatuser wrote:
         | Yeah basically. The idea is that somehow this is the data-
         | optimal way of determining which one is the best (rather than
         | splitting your data 50/50 and wasting a lot of samples when you
         | already know)
         | 
         | The caveats (perhaps not mentioned in the article) are: -
         | Perhaps you have many metrics you need to track/analyze (CTR,
         | conversion, rates on different metrics), so you can't strictly
         | do bandit! - As someone mentioned below, sometimes the
         | situation is dynamic (so having evenly sized groups helps with
         | capturing this effect) - Maybe some other ones I can't think
         | of?
         | 
         | But you can imagine this kind of auto-testing being useful...
         | imagine AI continually pushes new variants, and it just
         | continually learns which one is the best
        
           | iforgot22 wrote:
           | Facebook or YouTube might already be using an algo like this
           | or AI to push variants, but for each billion user product,
           | there are probably thousands of smaller products that don't
           | need something this automated.
        
           | cle wrote:
           | It still misses the biggest challenge though--defining
           | "best", and ensuring you're actually measuring it and not
           | something else.
           | 
           | It's useful as long as your definition is good enough and
           | your measurements and randomizations aren't biased. Are you
           | monitoring this over time to ensure that it continues to
           | hold? If you don't, you risk your MAB converging on something
           | very different from what you would consider "the best".
           | 
           | When it converges on the right thing, it's better. When it
           | converges on the wrong thing, it's worse. Which will it do?
           | What's the magnitude of the upside vs downside?
        
         | rerdavies wrote:
         | I think you missed the point. It's not about which visitors get
         | the best feature. It's about how to get people to PUSH THE
         | BUTTON!!!!! Which is kind of the opposite of the best feature.
         | The goal is to make people do something they don't want to do.
         | 
         | Figuring out best features is a completely different problem.
        
           | iforgot22 wrote:
           | I didn't say it was the best for the user. Really the article
           | misses this by comparing a new UI feature to a life-saving
           | drug, but it doesn't matter. The point is, whatever metric
           | you're targeting, do you use this algo or fixed group sizes?
        
       | awkward wrote:
       | Pure, disinterested A/B testing where the goal is just to find
       | the good way to do it, and there's enough leverage and traffic
       | that funding that A/B testing is worthwhile is rare.
       | 
       | More frequently, A/B testing is a political technology that
       | allows teams to move forward with changes to core, vital services
       | of a site or app. By putting a new change behind an A/B test, the
       | team technically derisks the change, by allowing it to be undone
       | rapidly, and politically derisks the change, by tying it's
       | deployment to rigorous testing that proves it at least does no
       | harm to the existing process before applying it to all users. The
       | change was judged to be valuable when development effort went
       | into it, whether for technical, branding or other reasons.
       | 
       | In short, not many people want to funnel users through N code
       | paths with slightly different behaviors, because not many people
       | have a ton of users, a ton of engineering capacity, and a ton of
       | potential upside from marginal improvements. Two path tests solve
       | the more common problem of wanting to make major changes to
       | critical workflows without killing the platform.
        
         | ljm wrote:
         | Tracks that I've primarily seen A/B tests used as a mechanism
         | for gradual rollout rather than pure data-driven
         | experimentation. Basically expose functionality to internal
         | users by default then slowly expand it outwards to early
         | adopters and then increment it to 100% for GA.
         | 
         | It's helpful in continuous delivery setups since you can test
         | and deploy the functionality and move the bottleneck for
         | releasing beyond that.
        
           | baxtr wrote:
           | I wouldn't call that A/B testing but rather a gradual roll-
           | out.
        
       | taion wrote:
       | The problem with this approach is that it requires the system
       | doing randomization to be aware of the rewards. That doesn't make
       | a lot of sense architecturally - the rewards you care about often
       | relate to how the user engages with your product, and you would
       | generally expect those to be collected via some offline analytics
       | system that is disjoint from your online serving system.
       | 
       | Additionally, doing randomization on a per-request basis heavily
       | limits the kinds of user behaviors you can observe. Often you
       | want to consistently assign the same user to the same condition
       | to observe long-term changes in user behavior.
       | 
       | This approach is pretty clever on paper but it's a poor fit for
       | how experimentation works in practice and from a system design
       | POV.
        
         | ivalm wrote:
         | You can assign multiarm bandit trials on a lazy per user basis.
         | 
         | So first time user touches feature A they are assigned to some
         | trial arm T_A and then all subsequent interactions keep them in
         | that trial arm until the trial finishes.
        
           | kridsdale1 wrote:
           | The systems I've use pre-allocate users effectively randomly
           | an arm by hashing their user id or equivalent.
        
       | IshKebab wrote:
       | This is fine as long as your users don't mind your site randomly
       | changing all the time.
        
         | data-ottawa wrote:
         | That's also a problem for AB testing and solvable (to a degree)
         | by caching assignments
        
       | fitsumbelay wrote:
       | I've only read the first paragraph so bear with me but I'm not
       | understanding the reasoning behind "A/B testing drugs is bad
       | because only half of the sample can potentially benefit" when the
       | whole point is to delineate the gots and got-nots ...
        
         | atombender wrote:
         | If the drug is effective and safe, then one half of the
         | patients lost out on the benefit. You are intentionally
         | "sacrificing" the control arm.
         | 
         | (Of course, the whole point is that the benefit and safety are
         | not certain, so I think the term "sacrifice" used in the
         | article is misleading.)
        
           | kridsdale1 wrote:
           | And the control group is also sacrificed from potentially
           | deadly side effects.
        
       | m3kw9 wrote:
       | If you only keep your entire site static while test one variable
       | change at a time, it could be statistically significant, other
       | wise if your flow changes some where while you do this algo, it
       | may be misleading you into a color and then under perform because
       | you've made a change else where before users get to this page.
        
       ___________________________________________________________________
       (page generated 2025-01-13 23:00 UTC)