[HN Gopher] Lines of code that will beat A/B testing every time ...
___________________________________________________________________
Lines of code that will beat A/B testing every time (2012)
Author : Kerrick
Score : 140 points
Date : 2025-01-09 23:34 UTC (3 days ago)
(HTM) web link (stevehanov.ca)
(TXT) w3m dump (stevehanov.ca)
| asdasdsddd wrote:
| Multi arm bandits are fine but their limited to tests where its
| ok to switch users between arms frequently and tests that have
| more power
| tantalor wrote:
| > where its ok to switch users between arms frequently
|
| It's not hard to keep track of which arm any given user was
| exposed to in the first run, and then repeat it.
| asdasdsddd wrote:
| There are often product limitations
| 85392_school wrote:
| Previously discussed:
|
| https://news.ycombinator.com/item?id=11437114
|
| https://news.ycombinator.com/item?id=4040022
| nottorp wrote:
| "People distrust things that they do not understand, and they
| especially distrust machine learning algorithms, even if they are
| simple."
|
| How times have changed :)
| saintfire wrote:
| Just had to anthropomorphize machine learning.
| isoprophlex wrote:
| As one of the comments below the article states, the
| probabilistic alternative to epsilon-greedy is worth exploring ad
| well. Take the "bayesian bandit", which is not much more complex
| but a lot more powerful.
|
| If you crave more bandits:
| https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...
| hruk wrote:
| We've been happy using Thompson sampling in production with
| this library https://github.com/bayesianbandits/bayesianbandits
| timr wrote:
| Just a warning to those people who are potentially implementing
| it: it doesn't really matter. The blog author addresses this,
| obliquely (says that the simplest thing is best most of the
| time), but doesn't make it explicit.
|
| In my experience, obsessing on the best decision strategy is
| the biggest honeypot for engineers implementing MAB. Epsilon-
| greedy is _very easy to implement_ and you probably don 't need
| anything more. Thompson sampling is a pain in the butt, for not
| much gain.
| blagie wrote:
| "Easy to implement" is a good reason to use bubble sort too.
|
| In a normal universe, you just import a different library, so
| both are the same amount of work to implement.
|
| Multiarmed bandit seems theoretically pretty, but it's rarely
| worth it. The complexity isn't the numerical algorithm but
| state management.
|
| * Most AB tests can be as simple as a client-side random()
| and a log file.
|
| * Multiarmed bandit means you need an immediate feedback
| loop, which involves things like adding database columns,
| worrying about performance (since each render requires
| another database read), etc. Keep in mind the database needs
| to now store AB test outcomes and use those for decision-
| making, and computing those is sometimes nontrivial (if it's
| anything beyond a click-through).
|
| * Long-term outcomes matter more than short-term. "Did we
| retain a customer" is more important than "did we close one
| sale."
|
| In most systems, the benefits aren't worth the complexity.
| Multiple AB tests also add testing complexity. You want to
| test three layouts? And three user flows? Now, you have nine
| cases which need to be tested. Add two color schemes? 18
| cases. Add 3 font options? 54 cases. The exponential growth
| in testing is not fun. Fire-and-forget seems great, but in
| practice, it's fire-and-maintain-exponential complexity.
|
| And those conversion differences are usually small enough
| that being on the wrong side of a single AB test isn't
| expensive.
|
| Run the test. Analyze the data. Pick the outcome. Kill the
| other code path. Perhaps re-analyze the data a year later
| with different, longer-term metrics. Repeat. That's the right
| level of complexity most of the time.
|
| If you step up to multiarm, importing a different library
| ain't bad.
| krackers wrote:
| There's a good derivation of EXP3 algorithm from standard
| multiplicative weights which is fairly intuitive. The
| transformation between the two is explained a bit in
| https://nerva.cs.uni-
| bonn.de/lib/exe/fetch.php/teaching/ws18.... Once you have the
| intuition, then the actual choice of parameters is just
| cranking out the math
| zahlman wrote:
| Nothing shows on this page without JavaScript except for the
| header and a grey background. A bit strange for a blog.
| forgetfreeman wrote:
| Utterly unremarkable for 2012 though.
| lelandfe wrote:
| Pre-CSS grid masonry layout. Author hides the content with CSS,
| and JS reveals it, to avoid a flash.
|
| CSS to make it noscript friendly: `.main { visibility: visible
| !important; max-width: 710px; }`
| tracerbulletx wrote:
| A lot of sites don't have enough traffic to get statistical
| significance with this in a reasonable amount of time and it's
| almost always testing a feature more complicated than button
| color where you aren't going to have more than the control and
| variant.
| douglee650 wrote:
| Yes wondering what the confidence intervals are.
| wiml wrote:
| If the effect size x site traffic is so small it's
| statistically insignificant, why are you doing all this work in
| the first place? Just choose the option that makes the PHB
| happy and move on.
|
| (But, it's more likely that you _don 't know_ if there's a
| significant effect size)
| koliber wrote:
| The PHB wanted A/B testing! True story. I've spent two months
| convincing them that it made no sense with the volume of
| conversion events we had.
| kridsdale1 wrote:
| I've only implemented A/B/C tests at Facebook and Google, with
| hundreds of millions of DAU on the surfaces in question, and
| three groups is still often enough to dilute the measurement in
| question below stat-sig.
| usgroup wrote:
| If your aim is to evaluate an effect size of your treatment
| because you want to know whether it's significant, you can't do
| what the article advises.
| HeliumHydride wrote:
| The "20" is missing from the title.
| jerrygenser wrote:
| I believe hacker news automatically truncates the number at the
| beginning of titles
| crazygringo wrote:
| No, multi-armed bandit doesn't "beat" A/B testing, nor does it
| beat it "every time".
|
| Statistical significance is statistical significance, end of
| story. If you want to show that option B is better than A, then
| you need to test B enough times.
|
| It doesn't matter if you test it half the time (in the simplest
| A/B) or 10% of the time (as suggested in the article). If you do
| it 10% of the time, it's just going to take you five times
| longer.
|
| And A/B testing can handle multiple options just fine, contrary
| to the post. The name "A/B" suggests two, but you're free to use
| more, and this is extremely common. It's still called "A/B
| testing".
|
| Generally speaking, you want to find the best option and then
| _remove the other ones_ because they 're suboptimal and code
| cruft. The author suggests _always_ keeping 10% exploring other
| options. But if you already know they 're worse, that's just
| making your product worse for those 10% of users.
| LPisGood wrote:
| Multi-arm bandit does beat A/B testing in the sense that
| standard A/B testing does not seek to maximize reward during
| the testing period, MAB does. MAB also generalizes better to
| testing many things than A/B testing.
| crazygringo wrote:
| No -- you can't have your cake and eat it too.
|
| You get _zero_ benefits from MAB over A /B if you simply end
| your A/B test once you've achieved statistical significance
| and pick the best option. Which is what any efficient A/B
| test does -- there no reason to have any fixed "testing
| period" beyond what is needed to achieve statistical
| significance.
|
| While, to the contrary, the MAB described in the article does
| _not_ maximize reward -- as I explained in my previous
| comment. Because the post 's version runs indefinitely, it
| has _worse_ long-term reward because it continues to test
| inferior options long after they 've been proven worse. If
| you leave it running, you're harming yourself.
|
| And I have no idea what you mean by MAB "generalizing" more.
| But it doesn't matter if it's worse to begin with.
|
| (Also, it's a huge red flag that the post doesn't even
| _mention_ statistical significance.)
| LPisGood wrote:
| > you can't have your cake and eat it too
|
| I disagree. There is a vast array of literature on solving
| the MAB problem that may as well be grouped into a bin
| called "how to optimally strike a balance between having
| one's cake and eating it too."
|
| The optimization techniques to solve MAB problem seek to
| optimize reward by giving the right balance of exploration
| and exploitation. In other words, these techniques attempt
| to determine the optimal way to strike a balance between
| exploring if another option is better and exploiting the
| option currently predicted to be best.
|
| There is a strong reason this literature doesn't start and
| end with: "just do A/B testing, there is no better
| approach"
| crazygringo wrote:
| I'm not talking about the literature -- I'm talking about
| the extremely simplistic and sub-optimal procedure
| described in the post.
|
| If you want to get sophisticated, MAB properly done is
| essentially just A/B testing with optimal strategies for
| deciding when to end individual A/B tests, or balancing
| tests optimally for a limited number of trials. But
| again, it doesn't "beat" A/B testing -- it _is_ A /B
| testing in that sense.
|
| And that's what I mean. You can't magically increase your
| reward while simultaneously getting statistically
| significant results. Either your results are significant
| to a desired level or not, and there's no getting around
| the number of samples you need to achieve that.
| LPisGood wrote:
| I am talking about the literature which solves MAB in a
| variety of ways, including the one in the post.
|
| > MAB properly done is essentially just A/B testing
|
| Words are only useful insofar as their meanings invoke
| ideas, and in my experience absolutely no one thinks of
| other MAB strategies when someone talks about A/B
| testing.
|
| Sure, you can classify A/B testing as one extremely
| suboptimal approach to solving MAB problem. This
| classification doesn't help much though, because the
| other MAB techniques do "magically increase the rewards"
| compared this simple technique.
| cauch wrote:
| Another way of seeing the situation: let run your MAB
| solution for a while. Orange has been tested 17 times and
| blue has been tested 12 times. This is exactly equivalent
| of doing a A/B testing where you display 1 time the
| orange button to 17 persons and 1 time the blue button to
| 12 persons.
|
| The trick is to find the exact best number of test for
| each color so that we have good statistical significance.
| MAB does not do that well, as you cannot easily force
| testing an option that was bad when this option did not
| get enough trial to have a good statistical significance
| (imagine you have 10 colors and the color orange first
| score 0/1. It will take a very long while before this
| color will be re-tested quite significantly: you need to
| first fall into the 10%, but then you still have ~10% to
| randomly pick this color and not one of the other). With
| A/B testing, you can do a power analysis before hand (or
| whenever during) to know when to stop.
|
| Literature does not start with "just do A/B testing"
| because it is not the same problem. In MAB, your goal is
| not to demonstrate that one is bad, it's to do your own
| decision when faced with a fixed situation.
| LPisGood wrote:
| > The trick is to find the exact best number of test for
| each color so that we have good statistical significance
|
| Yes, A/B testing will force through enough trials to get
| statistical significance(it is definitely a "exploration
| first strategy), but in many cases, you care about
| maximizing reward as well, in particular during testing.
| A/B testing does very poorly at balancing exploitation
| with exploitation in general.
|
| This is especially true if the situation is dynamic. Will
| you A/B test forever in case something has changed and
| give up that long term loss in reward value?
| cauch wrote:
| But the proposed MAB system does not even propose a
| method to know when this system needs to be stopped (and
| remove all the choices except the best one).
|
| With the A/B testing, you can do power analysis whenever
| you want, including in the middle of the experiment. It
| will just be an iterative adjustment that converges.
|
| In fact, you can even run on all possibilities in advance
| (if A get 1% and B get 1%, how many A and B do I need, if
| A get 2% and B get 1%, if A get 3% and B get 1%, ...) and
| it will give you the exact boundaries to stop for any
| configurations before even running the experiment. You
| will just have to stop trialing option A as soon as
| option A crosses the already decided significance
| threshold for A.
|
| So, no, the A/B testing will never run forever. And A/B
| testing will always be better than the MAB solution,
| because you will have a better way to stop trying a bad
| solution as soon as you have crossed the threshold you
| decided is enough to consider it's a bad solution.
| cle wrote:
| This is a double-edged sword. There are often cases in real-
| world systems where the "reward" the MAB maximizes is biased
| by eligibility issues, system caching, bugs, etc. If this
| happens, your MAB has the potential to converge on the worst
| possible experience for your users, something a static
| treatment allocation won't do.
| LPisGood wrote:
| I haven't seen these particular shortcomings before, but I
| certainly agree that if your data is bad, this ML approach
| will also be bad.
|
| Can you share some more details about your experiences with
| those particular types of failures?
| cle wrote:
| Sure! A really simple (and common) example would be a
| setup w/ treatment A and treatment B, your code does "if
| session_assignment == A .... else .... B" . In the else
| branch you do something that for whatever reason causes
| misbehavior (perhaps it sometimes crashes or throws an
| exception or uses a buffer that drops records under high
| load to protect availability). That's suprisingly common.
| Or perhaps you were hashing on the wrong key to generate
| session assignments--ex you accidentally used an ID that
| expires after 24 hours of inactivity...now only highly
| active people get correctly sampled.
|
| Another common one I saw was due to different systems
| handling different treatments, and there being caching
| discrepancies between the two, like esp in a MAB where
| allocations are constantly changing, if one system has a
| much longer TTL than the other then you might see
| allocation lags for one treatment and not the other,
| biasing the data. Or perhaps one system deploys much more
| frequently and the load balancer draining doesn't wait
| for records to finish uploading before it kills the
| process.
|
| The most subtle ones were eligibility biases, where one
| treatment might cause users to drop out of an experiment
| entirely. Like if you have a signup form and you want to
| measure long-term retention, and one treatment causes
| some cohorts to not complete the signup entirely.
|
| There are definitely mitigations for these issues, like
| you can monitor the expected vs. actual allocations and
| alert if they go out-of-whack. That has its own set of
| problems and statistics though.
| jbentley1 wrote:
| Multi-armed bandits make a big assumption that effectiveness is
| static over time. What can happen is that if they tip traffic
| slightly towards option B at a time when effectiveness is higher
| (maybe a sale just started) B will start to overwhelmingly look
| like a winner and get locked in that state.
|
| You can solve this with propensity scores, but it is more
| complicated to implement and you need to log every interaction.
| LPisGood wrote:
| This objection is mentioned specifically in the post.
|
| You can add a forgetting factor for older results.
| randomcatuser wrote:
| This seems like a fudge factor though. Some things are
| changed bc you act on them! (e.g. recommendation systems that
| are biased towards more popular content). So having dynamic
| groups makes the data harder to analyze
| LPisGood wrote:
| A standard formulation of MAB problem assumes that acting
| will impact the rewards, and this forgetting factor
| approach is one which allows for that and still attempts to
| find the currently most exploitable lever.
| __MatrixMan__ wrote:
| After this:
|
| > hundreds of the brightest minds of modern civilization have
| been hard at work not curing cancer. Instead, they have been
| refining techniques for getting you and me to click on banner ads
|
| I was really hoping this would slowly develop into a statistical
| technique couched in terms of ad optimization but actually
| settling in on something you might call ATCG testing (e.g. the
| biostatistics methods that one would indeed use to cure cancer).
| JensRantil wrote:
| Yep. I implemented this as a Java library a while ago (and other
| stuff): https://github.com/JensRantil/java-canary-tools
| Basically, I think feature flags and A/B tests aren't always
| needed to roll out experiments.
| royal-fig wrote:
| If multi-arm bandits has piqued your curiosity, we recently added
| support for it to our feature flagging and experimentation
| platform, GrowthBook.
|
| We talk about it here: https://blog.growthbook.io/introducing-
| multi-armed-bandits-i...
| munro wrote:
| Here's an interesting write up on various algorithms & different
| epsilon greedy % values.
|
| https://github.com/raffg/multi_armed_bandit
|
| It shows 10% exploration performs the best, very great simple
| algorithm.
|
| Also it shows the Thompson Sampling algorithm converges a bit
| faster-- the best arm chosen by sampling from the beta
| distribution, and eliminates the explore phase. And you can use
| the builtin random.betavariate !
|
| https://github.com/raffg/multi_armed_bandit/blob/42b7377541c...
| kazinator wrote:
| > _Statistics are hard for most people to understand._
|
| True, but that's exactly what statistics helps with, though also
| hard to understand. :)
| iforgot22 wrote:
| I don't like how this dismisses the old approach as "statistics
| are hard for most people to understand." This algo beats A/B
| testing in terms of maximizing how many visitors get the best
| feature. But is that really a big enough concern IRL that people
| are interested in optimizing it every time? Every little dynamic
| lever adds complexity to a system.
| randomcatuser wrote:
| Yeah basically. The idea is that somehow this is the data-
| optimal way of determining which one is the best (rather than
| splitting your data 50/50 and wasting a lot of samples when you
| already know)
|
| The caveats (perhaps not mentioned in the article) are: -
| Perhaps you have many metrics you need to track/analyze (CTR,
| conversion, rates on different metrics), so you can't strictly
| do bandit! - As someone mentioned below, sometimes the
| situation is dynamic (so having evenly sized groups helps with
| capturing this effect) - Maybe some other ones I can't think
| of?
|
| But you can imagine this kind of auto-testing being useful...
| imagine AI continually pushes new variants, and it just
| continually learns which one is the best
| iforgot22 wrote:
| Facebook or YouTube might already be using an algo like this
| or AI to push variants, but for each billion user product,
| there are probably thousands of smaller products that don't
| need something this automated.
| cle wrote:
| It still misses the biggest challenge though--defining
| "best", and ensuring you're actually measuring it and not
| something else.
|
| It's useful as long as your definition is good enough and
| your measurements and randomizations aren't biased. Are you
| monitoring this over time to ensure that it continues to
| hold? If you don't, you risk your MAB converging on something
| very different from what you would consider "the best".
|
| When it converges on the right thing, it's better. When it
| converges on the wrong thing, it's worse. Which will it do?
| What's the magnitude of the upside vs downside?
| rerdavies wrote:
| I think you missed the point. It's not about which visitors get
| the best feature. It's about how to get people to PUSH THE
| BUTTON!!!!! Which is kind of the opposite of the best feature.
| The goal is to make people do something they don't want to do.
|
| Figuring out best features is a completely different problem.
| iforgot22 wrote:
| I didn't say it was the best for the user. Really the article
| misses this by comparing a new UI feature to a life-saving
| drug, but it doesn't matter. The point is, whatever metric
| you're targeting, do you use this algo or fixed group sizes?
| awkward wrote:
| Pure, disinterested A/B testing where the goal is just to find
| the good way to do it, and there's enough leverage and traffic
| that funding that A/B testing is worthwhile is rare.
|
| More frequently, A/B testing is a political technology that
| allows teams to move forward with changes to core, vital services
| of a site or app. By putting a new change behind an A/B test, the
| team technically derisks the change, by allowing it to be undone
| rapidly, and politically derisks the change, by tying it's
| deployment to rigorous testing that proves it at least does no
| harm to the existing process before applying it to all users. The
| change was judged to be valuable when development effort went
| into it, whether for technical, branding or other reasons.
|
| In short, not many people want to funnel users through N code
| paths with slightly different behaviors, because not many people
| have a ton of users, a ton of engineering capacity, and a ton of
| potential upside from marginal improvements. Two path tests solve
| the more common problem of wanting to make major changes to
| critical workflows without killing the platform.
| ljm wrote:
| Tracks that I've primarily seen A/B tests used as a mechanism
| for gradual rollout rather than pure data-driven
| experimentation. Basically expose functionality to internal
| users by default then slowly expand it outwards to early
| adopters and then increment it to 100% for GA.
|
| It's helpful in continuous delivery setups since you can test
| and deploy the functionality and move the bottleneck for
| releasing beyond that.
| baxtr wrote:
| I wouldn't call that A/B testing but rather a gradual roll-
| out.
| taion wrote:
| The problem with this approach is that it requires the system
| doing randomization to be aware of the rewards. That doesn't make
| a lot of sense architecturally - the rewards you care about often
| relate to how the user engages with your product, and you would
| generally expect those to be collected via some offline analytics
| system that is disjoint from your online serving system.
|
| Additionally, doing randomization on a per-request basis heavily
| limits the kinds of user behaviors you can observe. Often you
| want to consistently assign the same user to the same condition
| to observe long-term changes in user behavior.
|
| This approach is pretty clever on paper but it's a poor fit for
| how experimentation works in practice and from a system design
| POV.
| ivalm wrote:
| You can assign multiarm bandit trials on a lazy per user basis.
|
| So first time user touches feature A they are assigned to some
| trial arm T_A and then all subsequent interactions keep them in
| that trial arm until the trial finishes.
| kridsdale1 wrote:
| The systems I've use pre-allocate users effectively randomly
| an arm by hashing their user id or equivalent.
| IshKebab wrote:
| This is fine as long as your users don't mind your site randomly
| changing all the time.
| data-ottawa wrote:
| That's also a problem for AB testing and solvable (to a degree)
| by caching assignments
| fitsumbelay wrote:
| I've only read the first paragraph so bear with me but I'm not
| understanding the reasoning behind "A/B testing drugs is bad
| because only half of the sample can potentially benefit" when the
| whole point is to delineate the gots and got-nots ...
| atombender wrote:
| If the drug is effective and safe, then one half of the
| patients lost out on the benefit. You are intentionally
| "sacrificing" the control arm.
|
| (Of course, the whole point is that the benefit and safety are
| not certain, so I think the term "sacrifice" used in the
| article is misleading.)
| kridsdale1 wrote:
| And the control group is also sacrificed from potentially
| deadly side effects.
| m3kw9 wrote:
| If you only keep your entire site static while test one variable
| change at a time, it could be statistically significant, other
| wise if your flow changes some where while you do this algo, it
| may be misleading you into a color and then under perform because
| you've made a change else where before users get to this page.
___________________________________________________________________
(page generated 2025-01-13 23:00 UTC)