[HN Gopher] Annoying A/B testing mistakes
___________________________________________________________________
Annoying A/B testing mistakes
Author : Twixes
Score : 251 points
Date : 2023-06-16 10:30 UTC (12 hours ago)
(HTM) web link (posthog.com)
(TXT) w3m dump (posthog.com)
| methou wrote:
| Probably off-topic, but how do opt out from most of A/B testings?
| mabbo wrote:
| > The solution is to use an A/B test running time calculator to
| determine if you have the required statistical power to run your
| experiment and for how long you should run your experiment.
|
| Wouldn't it be better to have an A/B testing system that just
| counts how many users have been in each assignment group and end
| when you have the required statistical power?
|
| Time just seems like a stand in for "that should be enough", when
| in reality you might have a change in how many users get exposed
| that differs from your expectations.
| aliceryhl wrote:
| Running the experiment until you have a specific pre-determined
| number of observations is okay.
|
| However, the deceptively similar scheme of running it until the
| results are statistical significant is not okay!
| mreezie wrote:
| If you want statistical significance of 1/20 and you check 20
| times... you are likely to find it.
| dbroockman wrote:
| Another one: don't program your own AB testing framework! Every
| time I've seen engineers try to build this on their own, it fails
| an AA test (where both versions are the same so there should be
| no difference). Common reasons are overly complicated
| randomization schemes (keep it simple!) and differences in load
| times between test and control.
| giraffe_lady wrote:
| I don't keep that up with it but it seems like the ecosystem
| has kind of collapsed the last few years though? Like you have
| optimizely and its competitors that are fully focused on huge
| enterprise with "call us" pricing right out the gate. VWO has a
| clunky & aged tech stack that was already causing problems when
| I used it a couple years ago and seems unchanged since then.
|
| If you're a medium-small business I see why you'd be tempted to
| roll your own. Trustworthy options under $15k/year are not
| apparent.
| HWR_14 wrote:
| Shouldn't AA tests fail a certain percentage of the time?
| Typically, 5% of the time?
| rmetzler wrote:
| If I read the first mistake correctly, then getFeatureFlag() has
| the side-effect to count how often it was called and uses this to
| calculate the outcome of the experiment? Wow. I don't know what
| to say....
| willsmith72 wrote:
| Yep gross...
| xp84 wrote:
| That's how every one of these tools works, that's the whole
| point of using them: you only call them when you're going to
| actually show the variation to the user. If you're running a
| test that modifies the homepage only, you shouldn't be calling
| that decision method in, say, your global navigation code that
| you show everyone. Or, for instance, if your test only affects
| how the header looks for subscribers, you have to put an outer
| if statement "if subscriber" before the "if test variation."
| How else would it correctly know exactly who saw the test?
| pjm331 wrote:
| This is indeed the case. Have run into a few surprising things
| like this when implementing posthog experiments recently
| sometimes_all wrote:
| Yeah I felt that way too. Initially I thought I wasn't sure
| what I was missing, since the only difference is that the order
| of the checks is switched, and the function will still return
| the same true/false in both cases. Then I thought about side
| effects and it felt icky.
| alsiola wrote:
| Writing an article about developer mistakes is easier than
| redesigning your rubbish API though.
| dyeje wrote:
| When you call the feature flag, it's going to put the user into
| one of the groups. The article is saying you don't want to add
| irrelevant users (in the example, ones that had already done
| the action they were testing) because it's going to skew your
| results.
| willsmith72 wrote:
| The point is from an api design perspective something like
|
| "posthog.getFeatureFlag('experiment-key')"
|
| doesn't look like it's actually performing a mutation.
| kimukasetsu wrote:
| The biggest mistake engineers make is determining sample sizes.
| It is not trivial to determine the sample size for a trial
| without prior knowledge of effect sizes. Instead of waiting for a
| fixed sample size, I would recommend using a sequential testing
| framework: set a stopping condition and perform a test for each
| new batch of sample units.
|
| This is called optional stopping and it is not possible using a
| classic t-test, since Type I and II errors are only valid at a
| determined sample size. However, other tests make it possible:
| see safe anytime-valid statistics [1, 2] or, simply, bayesian
| testing [3, 4].
|
| [1] https://arxiv.org/abs/2210.01948
|
| [2] https://arxiv.org/abs/2011.03567
|
| [3] https://pubmed.ncbi.nlm.nih.gov/24659049/
|
| [4]
| http://doingbayesiandataanalysis.blogspot.com/2013/11/option...
| travisjungroth wrote:
| People often don't determine sample sizes at all! And doing
| power calculations without an idea of effect size isn't just
| hard but impossible. It's one of the inputs to the formula. But
| at least it's fast so you can sort of guess and check.
|
| Anytime valid inference helps with this situation, but it
| doesn't solve it. If you're trying to detect a small effect, it
| would be nicer to figure out you need a million samples up
| front versus learning that because your test with 1,000 samples
| a day took three years.
|
| Still, anytime is way better than fixed IMO. Fixed almost never
| really exists. Every A/B testing platform I've seen allows
| peeking.
|
| I work with the author of the second paper you listed. The math
| looks advanced, but it's very easy to implement.
| hackernewds wrote:
| The biggest mistake is engineers owning experimentation. They
| should be owned by data scientists.
|
| Realize though that is a luxury, but I also see this trend in
| blue chip companies
| pbae wrote:
| Did a data scientist write this? You don't need to be a
| member of a priesthood to run experiments. You just need to
| know what you're doing.
| playingalong wrote:
| ... and by some definition you'd be a data scientist
| yourself. (Regardless of your job title)
| bonniemuffin wrote:
| I agree with both sides here. :) DS should own
| experimentation, AND engineers should be able to run a
| majority of experiments independently.
|
| As a data scientist at a "blue chip company", my team owns
| experimentation, but that doesn't mean we run all the
| experiments. Our role is to create guidelines, processes,
| and tooling so that engineers can run their own experiments
| independently most of the time. Part of that is also
| helping engineers recognize when they're dealing with a
| difficult/complex/unusual case where they should bring DS
| in for more bespoke hands-on support. We probably only look
| at <10% of experiments (either in the setup or results
| phase or both), because engineers/PMs are able to set up,
| run, and draw conclusions from most of the experiments
| without needing us.
| 2rsf wrote:
| Another challenge, related more to implementation than theory, is
| having too many experiments running in parallel.
|
| As a company grows there will be multiple experiments running in
| parallel executed by different teams. The underlying assumption
| is that they are independent, but it is not necessarily true or
| at least not entirely correct. For example a graphics change on
| the main page together with a change in the login logic.
|
| Obviously this can be solved by communication, for example
| documenting running experiments, but like many other aspects in
| AB testing there is a lot of guesswork and gut feeling involved.
| cantSpellSober wrote:
| A better solve is E2E or unit tests to make sure A/B segments
| aren't conflicting. At the enterprise level there's simply too
| many teams testing too much to keep track of it in, say, a
| spreadsheet.
| franze wrote:
| plus, mind the Honeymoon Effect
|
| something new performs better cause its new
|
| if you have a platform with lots pf returning users this one will
| hit you again and again.
|
| so even if you have a winner after the test and make the change
| permanent, revisit it 2 months later and see if you are now
| really better of.
|
| all changes of a/b tests in sum has a high chance to just get an
| average platform in the sum of all changes.
| Sohcahtoa82 wrote:
| The one mistake I assume happens too much is trying to measure
| "engagement".
|
| Imagine a website is testing a redesign, and they want to decide
| if people like it by measuring how long they spend on the site to
| see if it's more "engaging". But the new site makes information
| harder to find, so they spend more time on the site browsing and
| trying to find what they're looking for.
|
| Management goes, "Oh, users are delighted with the new site! Look
| how much time they spend on it!" not realizing how frustrated the
| users are.
| Xenoamorphous wrote:
| LinkedIn is a good example, I think. One day I got a "you have
| a new message" email. I clicked it, thinking, well, someone has
| messaged me, right? It turned out to be just bullshit, someone
| in my network had just posted something.
|
| I'm sure the first few of those got a lot of clicks, but it
| prompted me to ignore absolutely everything that comes from
| LinkedIn except for actual connection requests from people I
| know. Lots of clicks but also lots of pissed off people. I
| guess the latter is harder ti measure.
| ravenstine wrote:
| Engagement is my favorite form of metrics pseudoscience. A
| classic example is when engagement actually goes up, not
| because the design change is better, but because it frustrates
| and confuses the user, causing them to click around more and
| remain on the site longer. Without a focus group, there's
| really no way to determine whether the users are actually
| "delighted".
|
| EDIT: For some reason it didn't compute with me that you
| already referred to the same example. I've seen that exact
| scenario play out in real life, though.
| Sohcahtoa82 wrote:
| I bet the reddit redesign used a similar faulty measurement
| of engagement.
|
| "People spent more time scrolling the feed, people must enjoy
| it!"
|
| No, the feed takes up more space, so now I can only fit 1 or
| 2 items on my screen at once, rather than 10, so I have to
| scroll more to see more content.
| afro88 wrote:
| If that also resulted in little or no change in how often
| you (and everyone) opened reddit each day, then it is a
| "success" for them. They have your eyeballs for longer, so
| you likely see more ads.
|
| If only they were trying to maximise enjoyment and not
| addictiveness. They don't care at all about enjoyment, just
| like Facebook doesn't care about genuine connection to
| family and friends, or twitter to useful and constructive
| discussion that leads to positive social change.
| ravenstine wrote:
| That would not surprise me in the least! In fact, that's
| exactly what happened at a company I used to work for (that
| shall remain nameless). At the behest of the design team,
| we implemented a complete redesign of our site which
| included changing the home page so that at most only two
| media items could be on-screen at a time, and the ads which
| used to be simple banners now were woven between the feed
| of items. I remember sitting in a meeting where we had A/B
| tested this new homepage, and witnessing some data analyst
| guy giving a presentation which included how "engagement in
| the B-group was increased by N-percent!!!" The directors of
| web content were awestruck by this despite no context or
| explanation as to _why_ supposed "engagement" was higher
| with the new design. The test wasn't even carried out for a
| long duration of time. For all anyone knew, users were
| confused and spent more time clicking around because they
| were looking for something they were accustomed to in the
| original design. And no, it did not matter that I brought
| up my reasons for skepticism; anything that made a number
| increase made it into the final design. _Then_ , we
| actually had focus groups, long after the point at which we
| should have been consulting them, and the feedback we
| received was overwhelmingly lukewarm or negative. Much of
| it vindicated my concerns the entire time; users didn't
| _actually_ like scrolling. Then again, I guess if they 're
| viewing more ads, then who cares what the user thinks??
| Never have I felt more like I was living in a Dilbert comic
| than that time.
| time4tea wrote:
| Annoying illegal cookie consent banner?
| realjohng wrote:
| Thanks for posting this. It's to the point and easy to
| understand. And much needed- most companies seem to do testing
| without teaching the intricacies involved.
| alsiola wrote:
| On point 7 ((Testing an unclear hypothesis), while agreeing with
| the overall point, I strongly disagree with the examples.
|
| > Bad Hypothesis: Changing the color of the "Proceed to checkout"
| button will increase purchases.
|
| This is succinct, clear, and is very clear what the
| variable/measure will be.
|
| > Good hypothesis: User research showed that users are unsure of
| how to proceed to the checkout page. Changing the button's color
| will lead to more users noticing it and thus more people will
| proceed to the checkout page. This will then lead to more
| purchases.
|
| > User research showed that users are unsure of how to proceed to
| the checkout page.
|
| Not a hypothesis, but a problem statement. Cut the fluff.
|
| > Changing the button's color will lead to more users noticing it
| and thus more people will proceed to the checkout page.
|
| This is now two hypotheses.
|
| > This will then lead to more purchases.
|
| Sorry I meant three hypotheses.
| travisjungroth wrote:
| The biggest issue with those three hypotheses is one of them,
| the noticing the button, almost certainly isn't being tested.
| But, how the test goes will inform how people think about that
| hypothesis.
| ano-ther wrote:
| Good observation that the noticing doesn't get tested.
|
| Would there be any benefit from knowing the notice rate
| though? After all, the intended outcome is increased sales by
| clicking.
| ricardobeat wrote:
| Probably not, but then that hypothesis should not be part
| of the experiment.
| alsiola wrote:
| This is what I was driving at in my original comment - the
| intermediary steps are not of interest (from the POV of the
| hypothesis/overall experiment), so why mention them at all.
| hinkley wrote:
| Rate of traffic on the checkout page, _divided by overall
| traffic_.
|
| We see a lot of ghosts in A/B testing because we are loosey
| goosey about our denominators. Mathematicians apparently hate
| it when we do that.
| plagiarist wrote:
| That doesn't test noticing the button, that tests clicking
| the button. If the color changes it is possible that fewer
| people notice it but are more likely to click in a way that
| increases total traffic. Or more people notice it but are
| less likely to click in a way that reduces traffic.
| ssharp wrote:
| I don't think these examples are bad. From a clarity
| standpoint, where you have multiple people looking at your
| experiments, the first one is quite bad and the second one is
| much more informative.
|
| Requiring a user problem, proposed solution, and expected
| outcome for any test is also good discipline.
|
| Maybe it's just getting into pedants with the word "hypothesis"
| and you would expect the other information elsewhere in the
| test plan?
| sacrosancty wrote:
| [dead]
| darkerside wrote:
| Having a clearly stated hypothesis and supplying appropriate
| context separately isn't pedantry. It is semantics, but words
| result in actions that matter.
| avereveard wrote:
| the problem is the hand wavy "user research"
|
| if you have done that properly, why ab testing? if you did
| that improperly, why bother?
|
| ab testing moves from an hypotesis, because ab testing is
| done to inform a bayesian analysis to identify causes.
|
| if one knows already that the reason is 'button not visible
| enough' ab testing is almost pointless.
|
| not entirely pointless, because you can still do ab testing
| to validate that the change is in the right direction, but
| investing developer time for production quality code and
| risking business to just validate something one already knows
| seems crazy compared to just ask a focus group.
|
| when you are unsure about the answer, that's when investing
| in ab testing to discovery makes the most sense.
| tomnipotent wrote:
| > ab testing is almost pointless
|
| Except you can never be certain that the changes made were
| impactful in the direction you're hoping unless you measure
| it. Otherwise it's just wishful thinking.
| avereveard wrote:
| I didn't say anything to the contrary, the quotation is
| losing all the context.
|
| but if you want to verify hipotesis and control for
| confounding factor, the ab test needs to be part of a
| baesyan analysis, if you're doing that, why also pay for
| the priori research?
|
| by going down the path of user research > production
| quality release > validation of the hypotesis you are
| basically paying research twice and paying development
| once regardless of wether the testing is succesful or
| not.
|
| it's more efficient to either use bayesian hypotesis + ab
| testing for research (so pay development once per
| hypotesis, collect evidence and steer into the direction
| the evidence points to) or use user research over a set
| of POCs (pay research once per hypotesis, develop in the
| direction that research points to)
|
| if your research need validation, you paid for a research
| you might not need. if you start research knowing the
| priory (the user doens't see the button) you're not
| actually doing research, you're just gold plating a
| hunch, then why pay for research, just skip to the
| testing phase. if you want to research from the users,
| you do ab testing, but again, not against a hunch, but
| against a set of hypotesis, so you can eliminate
| confounding factors and narrow down the confidence
| interval.
| kevinwang wrote:
| It is surely helpful to have a "mechanism of action" so that
| you're not just blindly AB testing and falling victim to
| coincidences like in https://xkcd.com/882/ .
|
| Not sure if people do this, but with a mechanism of action in
| place you can state a prior belief and turn your AB testing
| results into actual posteriors instead of frequentist metrics
| like p-values which are kind of useless.
| datastoat wrote:
| That xkcd comic highlights the problem with observational (as
| opposed to controlled) studies. TFA is about A/B testing,
| i.e. controlled studies. It's the fact that you (the
| investigator) is controlling the treatment assignment that
| allows you to draw causal conclusions. What you happen to
| believe about the mechanism of action doesn't matter, at
| least as far as the outcome of this particular experiment is
| concerned. Of course, your conjectured mechanism of action is
| likely to matter for what you decide to investigate next.
|
| Also, frequentism / Bayesianism is orthogonal to causal /
| correlational interpretations.
| eVoLInTHRo wrote:
| The xkcd comic seems more about the multiple comparisons
| problem (https://en.wikipedia.org/wiki/Multiple_comparisons
| _problem), which could arise in both an observational or
| controlled setting.
| majormajor wrote:
| AB tests are still vulnerable to p-hacking-esque things
| (though usually unintentional). Run enough of them and your
| p value is gonna come up by chance sometimes.
|
| Observational ones are particularly prone because you can
| slice and dice the world into near-infinite observation
| combinations, but people often do that with AB tests too.
| Shotgun approach, test a bunch of approaches until
| something works, but if you'd run each of those tests for
| different significance levels, or for twice as long, or
| half as long, you could very well see the "working" one
| fail and a "failing" one work.
| carlmr wrote:
| I think what kevinwang is getting at, is that if you A/B
| test with a static version A and enough versions of B, at
| some point you will get statistically significant results
| if you repeat it often enough.
|
| Having a control doesn't mean you can't fall victim to
| this.
| ricardobeat wrote:
| You control statistical power and the error rate, and
| choose to accept a % of false results.
| thingification wrote:
| As kevinwang has pointed out in slightly different terms: the
| hypothesis that seems wooly to you seems sharply pointed to
| others (and vice versa) because explanationless hypotheses
| ("changing the colour of the button will help") are easily
| variable (as are the colour of the xkcd jelly beans), while
| hypotheses that are tied strongly to an explanation are not.
| You can test an explanationless hypothesis, but that doesn't
| get you very far, at least in understanding.
|
| As usual here I'm channeling David Deutsch's language and ideas
| on this, I think mostly from The Beginning of Infinity, which
| he delightfully and memorably explains using a different
| context here: https://vid.puffyan.us/watch?v=folTvNDL08A (the
| yt link if you're impatient:
| https://youtu.be/watch?v=folTvNDL08A - the part I'm talking
| about starts at about 9:36, but it's a very tight talk and you
| should start from the beginning).
|
| Incidentally, one of these TED talks of Deutsch - not sure if
| this or the earlier one - TED-head Chris Anderson said was his
| all-time favourite.
|
| plagiarist:
|
| > That doesn't test noticing the button, that tests clicking
| the button. If the color changes it is possible that fewer
| people notice it but are more likely to click in a way that
| increases total traffic.
|
| "Critical rationalists" would first of all say: it does test
| noticing the button, but tests are a shot at refuting the
| theory, here by showing no effect. But also, and less commonly
| understood: even if there is no change in your A/B - an
| apparently successful refutation of the "people will click more
| because they'll notice the colour" theory - experimental tests
| are also fallible, just as everything else.
| alsiola wrote:
| Will watch the TED talk, thanks for sharing. I come at this
| from a medical/epidemiological background prior to building
| software, and no doubt this shapes my view on the language we
| use around experimentation, so it is interesting to hear
| different reasoning.
| [deleted]
| mtlmtlmtlmtl wrote:
| Surprised no one said this yet, so I'll bite the bullet.
|
| I don't think A/B testing is a good idea at all for the long
| term.
|
| Seems like a recipe for having your software slowly evolved into
| a giant heap of dark patterns. When a metric becomes a target, it
| ceases to be a good metric.
| activiation wrote:
| > Seems like a recipe for having your software slowly evolved
| into a giant heap of dark patterns.
|
| Just don't test for dark patterns?
| mtlmtlmtlmtl wrote:
| Well, how does one "just not do" that though, specifically?
| activiation wrote:
| First determine if what you want to test for is a dark
| pattern?
| mtlmtlmtlmtl wrote:
| And how do you determine that? I'm not trying to be coy
| here, I genuinely don't understand.
|
| Because you're not testing for patterns, what you test is
| some measurable metric(s) you want to maximise(or
| minimise), right? So how can you determine which metrics
| lead to dark patterns, without just using them and seeing
| if dark pattern emerge? And how do you spot these dark
| patterns if by their very nature they're undetectable by
| the metrics you chose to test first?
| activiation wrote:
| [flagged]
| mtlmtlmtlmtl wrote:
| Well this discussion isn't helpful at all.
|
| Why reply at all if you're just gonna waste my time?
| activiation wrote:
| [flagged]
| hackernewds wrote:
| Let's ship the project of those that bang the table, and
| confirm our biases instead.
| mtlmtlmtlmtl wrote:
| Please try to be serious and don't put words in my mouth. I'm
| actually trying to learn and have a serious discussion here.
|
| Thanks.
| matheusmoreira wrote:
| I don't think it should even be legal. Why do these
| corporations think they can perform human experimentation on
| unwitting subjects for profit?
| withinboredom wrote:
| More or less, it tells you the "cost" of removing an accidental
| dark pattern. For example we had three plans and a free plan.
| The button for the free plan was under the plans, front-and-
| center ... unless you had a screen/resolution that most of our
| non-devs/designers had.
|
| So, the button, (for user's most common resolution) had the
| button just below the fold.
|
| This was an accident though some of our users called us out for
| it -- suggesting we'd removed the free plan altogether.
|
| So, we a/b tested moving the button to the top.
|
| It would REALLY hurt the bottom line and explained some growth
| we'd experienced. To remove the "dark pattern" would mean
| laying off some people.
|
| I think you can guess which one was chosen and still
| implemented.
| whimsicalism wrote:
| When an organization has many people, I think that many of
| these are a continuum from accidental to intentional.
| withinboredom wrote:
| When I left that company it had grown to massive and the
| product was full of dark patterns... I mean bugs,
| seriously, they were tracked as bugs that no one could fix
| without severe consequences. No one put them there on
| purpose. When you have hundreds of devs working on the same
| dozen files (onboarding/payments/etc) there are bound to be
| bad merges (when a git merge results in valid but incorrect
| code), misunderstanding of requirements, etc.
| cantSpellSober wrote:
| Good multivariate testing and (statistically significant) data
| doesn't do that. It shows lots of ways to improve your UX, and
| if your guesses at improving UX actually work. Example from
| TFA:
|
| > more people signed up using Google and Github, overall sign-
| ups didn't increase, and nor did activation
|
| Less friction on login for the user, 0 gains in conversions,
| they shipped it anyway. That's not a dark pattern.
|
| If you're _intentionally_ trying to make dark patterns it will
| help with that too I guess; the same way a hammer can build a
| house, or tear it down, depending on use.
| mtlmtlmtlmtl wrote:
| I often see this argument, and although I can happily accept
| the examples given in defence as making sense, I never see an
| argument that this multivariate approach solves the problem
| _in general_ and doesn 't merely ameliorate some of the worst
| cases(I suppose I'm open to the idea that it could at least
| get it from "worse than the disease" to "actually useful in
| moderation").
|
| Fundamentally, if you pick some number of metrics, you're
| always leaving some number of possible metrics "dark", right?
| Is there some objective method of deciding which metrics
| should be chosen, and which shouldn't?
| cantSpellSober wrote:
| "user trust" is a good one, abeit hard to measure
|
| Rolled out some tests to streamline cancelling
| subscriptions in response to user feedback, with
| Marketing's begrudging approval.
|
| Short term, predictably, we saw an increase in
| cancellations, then a decrease and eventual levelling out.
| Long term we continued to see an increase in subscriptions
| after rollout, and focused on more important questions like
| "how do we provide a good product that a user doesn't
| _want_ to cancel? "
| mtlmtlmtlmtl wrote:
| So, it's just a process of trial and error, in terms of
| what metrics to choose and how to weight them?
| 2OEH8eoCRo0 wrote:
| _Every_ engineer? Electrical engineers? Kernel developers?
| Defense workers?
|
| I hesitate to write this (because I don't want to be negative)
| but I get a sense that most software "engineers" have a very
| narrow view of the industry at large. Or this forum leans a
| particular way.
|
| I haven't A/B tested in my last three roles. Two of them were
| defense jobs, my current job deals with the Linux kernel.
| [deleted]
| o1y32 wrote:
| Was going to say the same thing. Lots of articles have
| clickbait titles, but this one is especially bad. Even among
| software engineers, only a small percentage will ever do any
| A/B testing, not to mention that often "scientists" or other
| roles are in charge of designing, running and analyzing A/B
| test experiments.
| chefandy wrote:
| I used to get knots in my hair about these distinctions, but in
| retrospect, I was just being pedantic. It's a headline-- not a
| synopsis or formal tagging system. Context makes it perfectly
| clear to most in a web-focused software industry crowd which
| "engineers" might be doing a/b testing. Also, my last three
| jobs haven't included a lot of stuff I read about here; why
| should that affect the headline?
| jldugger wrote:
| > Two of them were defense jobs, my current job deals with the
| Linux kernel.
|
| I don't work on the kernel, but one of the most professionally
| useful talks about the Linux kernel was an engineer talking
| about how to use statistical tests on perf related changes with
| small effects[1]. It's not an _online_ A/B technique but
| sometimes you pay attention to how other fields approach things
| in order to learn how to improve your own field.
|
| [1]: https://lca2021.linux.org.au/schedule/presentation/31/
| withinboredom wrote:
| I built an internal a/b testing platform with a team of 3-5 over
| the years. It needed to handle extreme load (hundreds of millions
| of participants in some cases). Our team also had a sister team
| responsible for teaching/educating teams about how to do proper
| a/b testing -- they also reviewed implementations/results on-
| demand.
|
| Most of the a/b tests they reviewed (note the survivorship bias
| here, they were reviewed because they were surprising results)
| were incorrectly implemented and had to be redone. Most companies
| I worked at before or since did NOT have a team like this, and
| blindly trusted the results without hunting for biases, incorrect
| implementations, bugs, or other issues.
| srveale wrote:
| Do you know if there were common mistakes for the incorrect
| implementations? Were they simple mistakes or more because
| someone misunderstood a nuance of stats?
| withinboredom wrote:
| I don't remember much specifics, but IIRC, most of the
| implementation related ones were due to an anti-pattern from
| the older a/b testing framework. Basically, the client would
| try and determine if the user was eligible to be in the A/B
| test (instead of relying on the framework), then in an API
| handler, get the user's assignment. This would mean the UI
| would think the user wasn't in the A/B test at all, while the
| API would see the user as in the A/B test. In this case, the
| user would be experiencing the 'control' while the framework
| thought they were experiencing something else.
|
| That was a big one for awhile, and it would skew results.
|
| Hmmm, another common one was doing geographic experiments
| when part of the experiment couldn't be geofenced for
| technological reasons. Or forgetting that a user could leave
| a geofence and removing access the feature after they'd
| already been given access to it.
|
| Almost all cases boiled down to showing the user one thing
| while thinking we were showing them something else.
| srveale wrote:
| I wonder if that falls under mistake #4 from the article,
| or if there's another category of mistake: "Actually test
| what you think you're testing." Seems simple but with a big
| project I could see that being the hardest part.
| withinboredom wrote:
| I actually just read it (the best I could, the page is
| really janky on my device) I didn't see this mistake on
| there and it was the most common one we saw by a wide
| margin in the beginning.
|
| Number 2 (1 in the article) was solved by the platform.
| We had two activation points for UI experiments. The
| first was getting the users assignment (which could be
| cached for offline usage). At that point they became part
| of the test, but there was a secondary one that happened
| when the component under test became visible (whether it
| was a page view or a button). If you turned on this
| feature for the test, you could analyze it using the
| first or secondary points.
|
| One issue we saw with that (which is potentially specific
| to this implementation), was people forgetting to fire
| the secondary for the control. That was pretty common but
| you usually figured that out within a few hours when you
| got an alert that your distribution looked biased (if you
| specify a 10:20 split, you should get a 10:20 ratio of
| activity).
| indymike wrote:
| > It needed to handle extreme load (hundreds of millions of
| participants in some cases).
|
| I can see extreme loads being valuable for an A/B test of a
| pipeline change or something that needs that load... but for
| the kinds of A/B testing UX and marketing does, leveraging
| statistical significance seems to be a smart move. There is a
| point where a large sample is trivially more accurate than a
| small sample.
|
| https://en.wikipedia.org/wiki/Sample_size_determination
| withinboredom wrote:
| Even if you're testing 1% of 5 million visitors, you still
| need to handle the load for 5 million visitors. Most of the
| heavy experiments came from AI-driven assignments (vs.
| behavioral). In this case the AI would generate very fine-
| grained buckets and assign users into them as needed.
| rockostrich wrote:
| Same experience here for the most part. We're working on
| migrating away from an internal tool which has a lot of
| problems: flags can change in the middle of user sessions,
| limited targeting criteria, changes to flags require changes to
| code, no distinction between feature flags and experiments,
| experiments often target populations that vary greatly,
| experiments are "running" for months and in some cases years...
|
| Our approach to fixing these problems starts with having a
| golden path for running an experiment which essentially fits
| the OP. It's still going to take some work to educate everyone
| but the whole "golden path" culture makes it easier.
| withinboredom wrote:
| When we started working on the internal platform, this was
| exactly the problems we had. When we were finally deleting
| the old code, we found a couple of experiments that had been
| running for nearly half a decade.
|
| For giggles, we ran an analysis on those experiments: no
| difference between a & b.
|
| That's usually the best result you can get, honestly. It
| means you get to make a decision of whether to go with a or
| b. You can pick the one you like better.
| travisjungroth wrote:
| That's a great outcome for a do-no-harm test. Horrible
| outcome when you're expecting a positive effect.
| withinboredom wrote:
| It's an experiment, you shouldn't be "expecting"
| anything. You hypothesize an effect, but that doesn't
| mean it will be there and if you prove it wrong, you
| continue to iterate.
| travisjungroth wrote:
| > you shouldn't be "expecting" anything
|
| This is the biggest lie in experimentation. Of course you
| expect something. Why are you running this test over all
| other tests?
|
| What I'm challenging is that if a team has spent three
| months building a feature, you a/b test it and find no
| effect, that is not a good outcome. Having a tie where
| you get to choose anything is worse than having a winner
| that forces your hand. At least you have the option to
| improve your product.
| donretag wrote:
| If anyone from posthog is reading this, please fix your RSS feed.
| The link actually points back to the blog homepage.
| corywatilo wrote:
| Will take a look, thanks for the heads up!
| alberth wrote:
| Enough traffic.
|
| Isn't the biggest problem with A/B testing that very few web
| sites even have enough traffic to properly measure statistical
| differences.
|
| Essentially making A/B testing for 99.9% of websites useless.
| hanezz wrote:
| This. Ron Kohavi 1) has some excellent resources on this 2).
| There is a lot of noise in data, that is very often
| misattributed to 'findings' in the context of A/B testing.
| Replication of A/B tests should be much more common in the CRO
| industry, it can lead to surprising yet sobering insights into
| real effects.
|
| 1) https://experimentguide.com/ 2)
| https://bit.ly/ABTestingIntuitionBusters
| xp84 wrote:
| I have worked for some pretty arrogant, business types who
| fancy themselves "data driven" but actually they knew nothing
| about statistics. What that actually meant was they forced us
| to run AB tests for every change, and when the tests nearly
| always showed no particular statistical significance, they
| would accept the insignificant results if it supported their
| agenda, or if the insignificant results were against their
| desired outcome, they would run the test longer until it
| happend to flop the other way. The whole thing was such a joke.
| You definitely need some very smart math people to do this in a
| way that isn't pointless.
| Retric wrote:
| A/B testing works fine even at a hundred users per day. More
| visitors means you can run more tests and notice smaller
| differences, but that's also a lot of work which smaller sites
| don't really justify.
| wasmitnetzen wrote:
| Posthog is on developerdans "Ads & Tracking" blocklist[1], if
| you're wondering why this doesn't load.
|
| [1]:
| https://github.com/lightswitch05/hosts/blob/master/docs/list...
| RobotToaster wrote:
| Just noticed that myself, It's also in the Adguard DNS list.
| masswerk wrote:
| Ad 7)
|
| > Good hypothesis: User research showed that users are unsure of
| how to proceed to the checkout page. Changing the button's color
| will lead to more users noticing it (...)
|
| Mind that you have to prove first that this preposition is
| actually true. Your user research is probably exploratory,
| qualitative data based on a small sample. At this point, it's
| rather an assumption. You have to transform and test this (by
| quantitative means) for validity and significance. Only then you
| can proceed to the button-hypothesis. Otherwise, you are still
| testing multiple things at once, based on an unclear hypothesis,
| while merely assuming that part of this hypothesis is actually
| valid.
| arrrg wrote:
| In practice you often cannot test that in a quantitative way.
| Especially since it's about a state of mind.
|
| However, you should not dismiss qualitative results out of
| hand.
|
| If you do usability testing of the checkout flow with five
| participants and three actually verbalize the hypothesis during
| checkout ("hm, I'm not sure how to get to the next step here",
| "I don't see the button to continue", after searching for 30s:
| "ah, there it is!" - after all of which a good moderater would
| also ask follow up questions to better understand why they
| think it was hard for them to find the way to the next step and
| what their expectations were) then that's plenty of evidence
| for the first part of the hypothesis, allowing you to move on
| to testing the second part. It would be madness to
| quantitatively verify the first part. A total waste of
| resources.
|
| To be honest: with (hypothetical) evidence as clear as that
| from user research I would probably skip the A/B testing and go
| straight to implementing a solution if the problem is obvious
| enough and there are best practice examples. Only if designers
| are unsure about whether their proposed solution to the problem
| actually works would I consider testing that.
|
| Also: quantitative studies are not the savior you want them to
| be. Especially if it's about details in the perception of users
| ... and that's coming from me, a user researcher who loves to
| do quantitative product evaluation and isn't even all that firm
| in all qualitative methods.
| masswerk wrote:
| You really have to be able to build samples based on the
| first part of the hypothesis: you should test 4 groups for a
| crosstab. (Also, homogeneity may be an issue.) Transitioning
| from qualitative to quantitative methods is really the tricky
| part in social research.
|
| Mind that 3/5 doesn't meet the criteria of a binary test. In
| statistical terms, you do know nothing, this is still random.
| Moreover, even if metrics are suggesting that some users are
| spending considerable time, you still don't know why: it's
| still an assumption based on a negligible sample. So, the
| first question should be really, how do I operationalize the
| variable "user is disoriented", and, what does this exactly
| mean. (Otherwise, you're in for spurious correlation of all
| sorts. I.e. you still don't know _why_ some users display
| disorientations and others not. Instead of addressing the
| underlying issue, you rather fix this by an obtrusive button
| design, which may have negative impact on the other group.)
| arrrg wrote:
| I think you are really missing the forest for the trees.
|
| Everything you say is completely correct. But useful? Or
| worthwhile? Or even efficient?
|
| The goal is not to find out why some users are disoriented
| and some are not. Well, I guess indirectly it is. But
| getting there with rigor is a nightmare and to my mind not
| worthwhile in most cases. The hypothesis developed from the
| usability test would be "some users are disoriented during
| checkout". That to me would be enough evidence to actually
| tackle that problem of disorientation, especially since to
| me 3/5 would indicate a relatively strong signal (not in
| terms of telling me the percentage of users affected by
| this problem, just that it's likely the problem affects
| more than just a couple people).
|
| The more mysterious question to me would actually be
| whether that disorientation also leads to people not
| ordering. Which is a plausible assumption - but not
| trivially answerable for sure. (Usability testing can
| provide some hints toward answering that question - but
| task based usability testing is always a bit artificial in
| its setup.)
|
| Operationalizing "user is disoriented" is a nightmare and
| not something I would recommend at all (at least not as the
| first step) if you are reasonably sure that disorientation
| is a problem (because some users mention it during
| usability testing) and you can offer plausible solutions (a
| new design based on best practices and what users told you
| they think makes them feel disoriented).
|
| Operationalizing something like disorientation is much more
| fraught with danger (and just operationalizing it in the
| completely wrong way without even knowing) than identifying
| a problem and based on reasonableness arguments
| implementing a potential solution and seeing whether the
| desired metric improves.
|
| I agree that it would be an awesome research project to
| actually operationalize disorientation. But worthwhile when
| supporting actual product teams? Doubtful ...
| masswerk wrote:
| > The more mysterious question to me would actually be
| whether that disorientation also leads to people not
| ordering.
|
| This is actually the crucial question. The disorientation
| is an indication for a fundamental mismatch between the
| internal model established by the user and the
| presentation. It may be an issue of the flow of
| information (users are hesitant, because they realize at
| this point that this is not about what they thought it
| may be) or on the usability/design side of things (user
| has established an operational model, but this is not how
| it operates). Either way, there's a considerable
| dissonance, which will probably hurt the product: your
| presentation does not work for the user, you're not
| communicating on the same level, and this will be
| probably perceived as an issue of quality or even fitness
| for the purpose, maybe even intrigue. (Shouting at the
| user may provide a superficial fix, but will not address
| the potential damage.) - Which leads to the question:
| what is the actual problem and what caused it in the
| first place? (I'd argue, any serious attempt to
| operationalize this variable will inevitable lead you
| towards this much more serious issue. Operationalization
| is difficult for a reason. If you want tho have a
| controlled experiment, you must control all your
| variables - and the attempt to do so may hint you at
| deeper issues.)
|
| BTW, there's also a potential danger in just taking the
| articulations of user dislike at face value: a classic
| trope in TV media research was audiences critizising the
| outfit of the presenter, while the real issue was a
| dissonance/mismatch in audio and visual presentation. Not
| that users could pinpoint this, hence, they would rather
| blame how the anchor was dressed...
| arrrg wrote:
| What you say is all true and I actually completely agree
| with you (and like how you articulate those points -
| great to read it distilled that way) but at the same time
| probably not a good idea at all to do in most
| circumstances.
|
| It is alright to decide that in certain cases you can act
| with imperfect information.
|
| But to be clear, I actually think there may be situations
| where pouring a lot of effort into really understanding
| confusion is confusion. It's just very context dependent.
| (And I think you consistently underrate that progress you
| can make in understanding confusion or any other thing
| impacting conversion and use by using qualitative
| methods.)
| masswerk wrote:
| Regarding underestimating qualitative methods: I'm
| actually all for them. It may turn out, it's all you
| need. (Maybe, a quantitative test will be required to
| prove your point, but it will probably not contribute
| much to a solution.) It's really that I think that A/B
| testing is somewhat overrated. (Especially, since you
| will probably not really know what you're actually
| measuring without appropriate preparation, which will
| provide the heavy lifting already. A/B testing should
| really be just about whether you can generalize on a
| solution and the assumptions behind this or not. Using
| this as a tool for optimization, on the other hand, may
| be rather dangerous, as it doesn't suggest any relations
| between your various variables, or the various layers of
| fixes you apply.)
| ssharp wrote:
| This is adding considerable effort and weight to the
| process.
|
| The alternative is just running the experiment, which would
| take 10 minutes to set up, and see the results. The A/B
| test will help measure the qualitative finding in
| quantitative terms. It's not perfect but it is practical.
| londons_explore wrote:
| I want an A/B test framework that automatically optimizes the
| size of the groups to maximize revenue.
|
| At first, it would pick say a 50/50 split. Then as data rolls in
| that shows group A is more likely to convert, shift more users
| over to group A. Keep a few users on B to keep gathering data.
| Eventually, when enough data has come in, it might turn out that
| flow A doesn't work at all for users in France - so the ideal
| would be for most users in France to end up in group B, whereas
| the rest of the world is in group A.
|
| I want the framework to do all this behind the scenes - and
| preferably with statistical rigorousness. And then to tell me
| which groups have diminished to near zero (allowing me to remove
| the associated code).
| quadrature wrote:
| Curious about your use case.
|
| Is the idea that you wa t to optimize the conversion and then
| you would remove the experiment code with the winning variant
| ?.
|
| Or would you prefer to keep the code in and have it
| continuously optimize variants ?.
| joseda-hg wrote:
| There might be a particular situation where B might be more
| effective than A, and therefore should be kept, if only for
| that specific situation There might be a cutoff point, where
| maintaining B would cost more than it's worth, but that's a
| parameter you will have to determine for each test
| londons_explore wrote:
| I'd expect to be running tens of experiments at any one time.
| Some of those experiments might be variations in wording or
| colorschemes - others might be entirely different signup
| flows.
|
| I'd let the experiment framework decide (ie. optimize) who
| gets shown what.
|
| Over time, the maintenance burden of tens of experiments (and
| every possible user being in any combination of experiments)
| would exceed the benefits, so then I'd want to end some
| experiments, keeping just whatever variant performs best. And
| I'd be making new experiments with new ideas.
| [deleted]
| jdwyah wrote:
| Get yourself a multi arm bandit and some Thompson sampling
| https://engineering.ezcater.com/multi-armed-bandit-experimen...
| PheonixPharts wrote:
| As others have mentioned, you're referring to Thompson sampling
| and plenty of testing providers offer this (and if you have any
| DS on staff, they'll be more than happy to implement it).
|
| My experience is that there's a good reason why this hasn't
| taken off: the returns for this degree of optimization are far
| lower than you think.
|
| I once worked with a very eager, but junior DS who thought that
| we should build out a massive internal framework for doing
| this. He didn't quite understand the math behind it, so I build
| him a demo to understand the basics. What we realized in
| running the demo under various conditions is that the total
| return on adding this complexity to optimization was negligible
| at the scale we were operating at and required much more
| complexity than our current set up.
|
| This pattern repeats in a lot of DS related optimization in my
| experience. The difference between a close guess and perfectly
| optimal is often surprisingly little. Many DS teams perform
| optimizations on business processes that yield a lower
| improvement in revenue than the salary of the DS that built it.
| brookst wrote:
| Small nit: it's a bad idea if NPV of future returns is less
| than the cost. If someone making $100k/yr can produce one
| $50k/yr optimization that NPV's out to $260k, it's worth it.
| I suspect you meant that, just a battle I have at work a lot
| with people who only look at single-year returns.
| tomfutur wrote:
| Besides complexity, a price you pay with multi-armed bandits
| is that you learn less about the non-optimal options (because
| as your confidence grows that an option is not the best, you
| run fewer samples through it). It turns out the people
| running these experiments are often not satisfied to learn "A
| is better than B." They want to know "A is 7% better than B,"
| but a MAB system will only run enough B samples to make the
| first statement.
| withinboredom wrote:
| Sadly, there is an issue with the Novelty Effect[1]. If you
| push traffic to the current winner, it probably won't validate
| that its the actual winner. So you may trade more conversions
| now, for a higher churn than you can tolerate later.
|
| For example, you run two campaigns:
|
| 1. Get my widgets, one year only 19.99!
|
| 2. Get my widgets, first year only 19.99!
|
| The first one may win, but they all cancel at the second year
| because they thought it was only for one year. They all leave
| reviews complaining that you scammed them.
|
| So, I would venture that this idea is a bad one, but sounds
| good on paper.
|
| [1]: https://medium.com/geekculture/the-novelty-effect-an-
| importa...
|
| PS. A/B tests don't just provide you with evidence that one
| solution might be better than the other, they also provide some
| protection in that a number of participants will get the
| status-quo.
| travisjungroth wrote:
| > So, I would venture that this idea is a bad one, but sounds
| good on paper.
|
| It's a great idea, it's just vulnerable to non-stationary
| effects (novelty effect, seasonality, etc). But it's actually
| no worse than fixed time horizon testing for your example if
| you run the test less than a year. You A/B test that copy for
| a month, push everyone to A, and you're still not going to
| realize it's actually worse.
| withinboredom wrote:
| Yeah. If churn is part of the experiment, then even after
| you stop the a/b test for treatment, you may have to wait
| at least a year before you have the final results.
| iLoveOncall wrote:
| That sounds like you don't want A/B testing at all.
| londons_explore wrote:
| Indeed - I really want A/B testing combined with conversion
| optimization.
| hotstickyballs wrote:
| Exactly. That just sounds like a Bayesian update
| NavinF wrote:
| It's https://en.wikipedia.org/wiki/Multi-armed_bandit
| bobsmooth wrote:
| Why can't users just tell me what works!
| jedberg wrote:
| The biggest mistake engineers make about A/B testing is not
| recognizing local maxima. Your test may be super successful, but
| there may be an even better solution that's significantly
| different than what you've arrived at.
|
| It's important to not only A/B test minor changes, but
| occasionally throw in some major changes to see if it moves the
| same metric, possibly even more than your existing success.
| jldugger wrote:
| > 6. Not accounting for seasonality
|
| Doesn't the online nature of an A/B automatically account for
| this?
| throwaway084t95 wrote:
| That's not Simpson's Paradox. Simpson's Paradox is when the
| aggregate winner is different from the winner in each element of
| a partition, not just some of them
| hammock wrote:
| What it is is confounding
| robertlacok wrote:
| Exactly.
|
| On that topic - what do you do when you observe that in your
| test results? What's the right way to interpret the data?
| throwaway084t95 wrote:
| Let's consider an example that would be a case of Simpson's
| Paradox. Suppose you are A/B testing two different landing
| pages, and you want to know which will make more people
| become habitual users. You partition on whether the user adds
| at least one friend in their first 5 minutes on the platform.
| It might be that landing page A makes people who add a friend
| in the first 5 minutes more likely to become habitual users,
| and it also makes people who don't add a friend in the first
| 5 minutes more likely to become habitual users. But page A
| makes people less likely to add a friend in the first 5
| minutes, and people who add a friend in the first 5 minutes
| are overwhelmingly more likely to become habitual users than
| people who don't. So, in this case at least, it seems like
| the aggregate statistics are most relevant, but the fact that
| page A is bad mainly because it makes people less likely to
| add a friend in the first 5 minutes is also very interesting;
| maybe there is some way of combining A and B to get the good
| qualities of each and avoid the bad qualities of both
| contravariant wrote:
| It can only happen with unequal populations. If you decide to
| include people in the control or test group randomly you're
| fine (you can use statistical tests to rule out sample biad).
| ssharp wrote:
| With random bucketing happening at the global level for any
| test, the proper thing to do is to take any segments that
| show interesting (and hopefully statistically significant)
| results that differ from the global results and test those
| segments individually so the random bucketing happens at that
| segment level.
|
| There are two issues at play here -- one is that the sample
| sizes for the segments may not be high enough, the other is
| that the more segments you look at , the greater the
| probability for finding a false positive.
| hnhg wrote:
| Also, doesn't their suggested approach amount to multiple
| testing? In other words, a kind of p-hacking:
| https://en.wikipedia.org/wiki/Multiple_comparisons_problem
|
| Edit - and this:
| http://www.stat.columbia.edu/~gelman/research/unpublished/p_...
| throwaway202351 wrote:
| Yeah, a good AB testing framework would either refuse to let
| you break things down too much or have a large warning about
| the results not being significant, but that doesn't always
| stop the business-types from trying to wiggle in some way for
| them to show a win.
| Lior539 wrote:
| I'm the author of this blog. Thank you for calling this out!
| I'll update the example to fix this :)
| hammock wrote:
| The example is fine, you are calling out confounding
| variables. Just call it confounders, instead of Simpson
| paradox.
| keithwinstein wrote:
| FWIW, the arithmetic in that example also has a glitch. For
| the "Mobile" "Control" case, 100/3000 is about 3% rather than
| 10%.
| jameshart wrote:
| Yes, I don't think it's possible to observe a simpson's paradox
| in a simple conversion test, either.
|
| Simpson's paradox is about spurious correlations between
| variables - conversion analysis is pure Bayesian probability.
|
| It shouldn't be possible to have a group as a whole increase
| its probability to convert, while having every subgroup
| decrease its probability to convert - the aggregate has to be
| an average of the subgroup changes.
| HWR_14 wrote:
| Simpson's paradox is sometimes about spurious correlations,
| but the original paradox Simpson wrote about was simply a
| binary question with 84 subgroups, where 3 or 4 subgroups
| with the outlying answer just had a significant enough amount
| of all samples, and a significant enough effect, to mutate
| the whole.
| BoppreH wrote:
| Are you sure?
|
| Consider the case where iOS users are more likely to convert
| than Android users, but you currently have very few iOS
| users. You then A/B test a new design that imitates iOS, but
| has awful copy. Both iOS and Android users are less likely to
| convert, but it attracts more iOS users.
|
| The group as a whole has higher conversion because of the
| demographic shift, but every subgroup has less.
| whimsicalism wrote:
| I don't follow. If one bucket has many more iOS users, it
| seems like you have done a bad job randomizing your
| treatment?
| BoppreH wrote:
| It could be self-selection happening after you randomized
| the groups. For example a desktop landing page
| advertising an app, which might be installed on either
| mobile operating system.
___________________________________________________________________
(page generated 2023-06-16 23:01 UTC)