[HN Gopher] Run fewer, better A/B tests
___________________________________________________________________
Run fewer, better A/B tests
Author : econti
Score : 67 points
Date : 2021-06-26 14:44 UTC (8 hours ago)
(HTM) web link (edoconti.medium.com)
(TXT) w3m dump (edoconti.medium.com)
| btilly wrote:
| The notifications examples make me wonder what fundamental
| mistakes they are making.
|
| People respond to change. If you A/B test, say, a new email
| headline, the change usually wins. Even if it isn't better. Just
| because it is different. Then you roll it out in production, look
| at it a few months later, and it is probably worse.
|
| If you don't understand downsides like this, then A/B testing is
| going to have a lot of pitfalls that you won't even know that you
| fell into.
| uyt wrote:
| I think it's known as the "novelty effect" in the industry.
| jonathankoren wrote:
| I'm pretty skeptical of this. I've run a lot of ML based A/B
| tests over my career. I've talked to a lot of people that have
| also run ML A/B tests over their careers. And the one constant
| everyone has discovered is that offline evaluation metrics are
| only somewhat directionally correlated with online metrics.
|
| Seriously. A/B tests are kind of a crap shoot. The systems are
| constantly changing. The online inference data drifts from the
| historical training data. User behavior changes.
|
| I've seen positive offline models perform flat. I've seen
| _negative_ offline metrics perform positively. There's just a lot
| of variance between offline and online performance.
|
| Just run the test. Lower the friction for running the tests, and
| just run them. It's the only way to be sure.
| zwaps wrote:
| Maybe I am too cynic but what we are really talking about here
| is causal inference for observational data based on more or
| less structural statistical models.
|
| Any researcher will tell you: this is really hard. It is more
| than an engineering problem. You need to know not only how to
| deal with problems, but rather what problems may arise and what
| you can actually identify. Most importantly, you need to figure
| out what you can not identify.
|
| There are, at least here in academia, only a limited set of
| people who are really good at this.
|
| Long story short: even if offline analysis is viable, I doubt
| every team had the right people for it, making it potentially
| not worthwhile.
|
| It is infinitely easier to produce a statistical analysis that
| looks good but isn't, than one that is good. An overwhelming
| amount of useless offline models would, statistically speaking,
| be expected ;)
| varsketiz wrote:
| Recently I hear that booking.com is given as an example of a
| company that runs a lot of a/b tests. Anyone from booking reading
| this? How does it look from the inside, is it worth it to run
| hundreds at a time?
| tootie wrote:
| I've seen a lot of really sophisticated data pipelines and
| testing frameworks at a lot of shops. I've seen precious few who
| were able to make well-considered product decisions based on the
| data.
| bruce343434 wrote:
| Between the emojis in the headings and the 2009 era memes, this
| was a bit of a cringy read. Also, the author seems to avoid at
| all costs going in depth about the actual implementation of OPE
| and I still don't quite understand how I would go about
| implementing it. Machine learning based on past A/B tests that
| finds similarities between the UI changes???
| econti wrote:
| Author here! I implemented every method I described in the post
| in the pip library I used in the post.
|
| In case you missed it: https://github.com/banditml/offline-
| policy-evaluation
| xivzgrev wrote:
| Yea me too.
|
| My biggest question is where do you get user data to run the
| simulation? Take the simple push example - if to date you've
| only sent pushes on day 1, and you want to explore day 2,3,4,5
| etc...where does that user response data come from? It seems
| like you need to get the data, then you can simulate various
| permutations of a policy. But then why not just run multi arm
| bandit?
| austincheney wrote:
| When I was the A/B test guy for Travelocity I was fortunate to
| have an excellent team. The largest bias we discovered is that
| our tests were executed with amazing precision and durability. My
| dedicated QA was the whining star that made that happen.
| Unfortunately when the resulting feature entered the site in
| production as a released feature there was always some defect, or
| some conflict, or some oversight. The actual business results
| would then under perform compared to the team's analyzed
| prediction.
| tartakovsky wrote:
| What is your advice, or more details on the types of challenges
| you came across and how you handled this discrepancy? I would
| imagine the data shifts a bit, and that standard assumptions
| don't hold up around how the difference you were measuring
| between your "A" and your "B" remain fixed after the testing
| period.
| austincheney wrote:
| The biggest technical challenge we came across is that when I
| had to hire my replacement we couldn't find competent
| developer talent. Everybody NEEDED everything to be jQuery
| and querySelectors but those were far too slow and buggy
| (jQuery was buggy cross browser). A good A/B test must not
| look like a test. It has to feel and look like the real
| thing. Some of our tests were extremely complex spanning
| multiple pages performing varieties of interactions. We
| couldn't be dicking around with basic code literacy and
| fumbling through entry level beginner defects.
|
| I was the team developer and not the team analyst so I cannot
| speak to business assumption variance. The business didn't
| seem to care about this since the release cycle is slow and
| defects were common. They were more concerned with the
| inverse proportions of cheap tests bringing stellar business
| wins.
| eximius wrote:
| I'm going to stick with multiarm bandit testing.
| gingerlime wrote:
| What tools/frameworks are you using for running and analysing
| results?
| nxpnsv wrote:
| This is a much better approach...
| sbierwagen wrote:
| >Now you might be thinking OPE is only useful if you have
| Facebook-level quantities of data. Luckily that's not true. If
| you have enough data to A/B test policies with statistical
| significance, you probably have more than enough data to evaluate
| them offline.
|
| Isn't there a multiple comparisons problem here? If you have
| enough data to do single A/B test, how can you do a hundred
| historical comparisons and still have the same p value?
| dr_dshiv wrote:
| The challenge I've seen is to have a combination of good, small-
| scale Human-Centered Design research (watching people work,for
| instance) and good, large-scale testing. It can be really hard to
| learn the "why" from a/b tests otherwise.
___________________________________________________________________
(page generated 2021-06-26 23:00 UTC)