[HN Gopher] Annoying A/B testing mistakes
       ___________________________________________________________________
        
       Annoying A/B testing mistakes
        
       Author : Twixes
       Score  : 251 points
       Date   : 2023-06-16 10:30 UTC (12 hours ago)
        
 (HTM) web link (posthog.com)
 (TXT) w3m dump (posthog.com)
        
       | methou wrote:
       | Probably off-topic, but how do opt out from most of A/B testings?
        
       | mabbo wrote:
       | > The solution is to use an A/B test running time calculator to
       | determine if you have the required statistical power to run your
       | experiment and for how long you should run your experiment.
       | 
       | Wouldn't it be better to have an A/B testing system that just
       | counts how many users have been in each assignment group and end
       | when you have the required statistical power?
       | 
       | Time just seems like a stand in for "that should be enough", when
       | in reality you might have a change in how many users get exposed
       | that differs from your expectations.
        
         | aliceryhl wrote:
         | Running the experiment until you have a specific pre-determined
         | number of observations is okay.
         | 
         | However, the deceptively similar scheme of running it until the
         | results are statistical significant is not okay!
        
           | mreezie wrote:
           | If you want statistical significance of 1/20 and you check 20
           | times... you are likely to find it.
        
       | dbroockman wrote:
       | Another one: don't program your own AB testing framework! Every
       | time I've seen engineers try to build this on their own, it fails
       | an AA test (where both versions are the same so there should be
       | no difference). Common reasons are overly complicated
       | randomization schemes (keep it simple!) and differences in load
       | times between test and control.
        
         | giraffe_lady wrote:
         | I don't keep that up with it but it seems like the ecosystem
         | has kind of collapsed the last few years though? Like you have
         | optimizely and its competitors that are fully focused on huge
         | enterprise with "call us" pricing right out the gate. VWO has a
         | clunky & aged tech stack that was already causing problems when
         | I used it a couple years ago and seems unchanged since then.
         | 
         | If you're a medium-small business I see why you'd be tempted to
         | roll your own. Trustworthy options under $15k/year are not
         | apparent.
        
         | HWR_14 wrote:
         | Shouldn't AA tests fail a certain percentage of the time?
         | Typically, 5% of the time?
        
       | rmetzler wrote:
       | If I read the first mistake correctly, then getFeatureFlag() has
       | the side-effect to count how often it was called and uses this to
       | calculate the outcome of the experiment? Wow. I don't know what
       | to say....
        
         | willsmith72 wrote:
         | Yep gross...
        
         | xp84 wrote:
         | That's how every one of these tools works, that's the whole
         | point of using them: you only call them when you're going to
         | actually show the variation to the user. If you're running a
         | test that modifies the homepage only, you shouldn't be calling
         | that decision method in, say, your global navigation code that
         | you show everyone. Or, for instance, if your test only affects
         | how the header looks for subscribers, you have to put an outer
         | if statement "if subscriber" before the "if test variation."
         | How else would it correctly know exactly who saw the test?
        
         | pjm331 wrote:
         | This is indeed the case. Have run into a few surprising things
         | like this when implementing posthog experiments recently
        
         | sometimes_all wrote:
         | Yeah I felt that way too. Initially I thought I wasn't sure
         | what I was missing, since the only difference is that the order
         | of the checks is switched, and the function will still return
         | the same true/false in both cases. Then I thought about side
         | effects and it felt icky.
        
         | alsiola wrote:
         | Writing an article about developer mistakes is easier than
         | redesigning your rubbish API though.
        
         | dyeje wrote:
         | When you call the feature flag, it's going to put the user into
         | one of the groups. The article is saying you don't want to add
         | irrelevant users (in the example, ones that had already done
         | the action they were testing) because it's going to skew your
         | results.
        
           | willsmith72 wrote:
           | The point is from an api design perspective something like
           | 
           | "posthog.getFeatureFlag('experiment-key')"
           | 
           | doesn't look like it's actually performing a mutation.
        
       | kimukasetsu wrote:
       | The biggest mistake engineers make is determining sample sizes.
       | It is not trivial to determine the sample size for a trial
       | without prior knowledge of effect sizes. Instead of waiting for a
       | fixed sample size, I would recommend using a sequential testing
       | framework: set a stopping condition and perform a test for each
       | new batch of sample units.
       | 
       | This is called optional stopping and it is not possible using a
       | classic t-test, since Type I and II errors are only valid at a
       | determined sample size. However, other tests make it possible:
       | see safe anytime-valid statistics [1, 2] or, simply, bayesian
       | testing [3, 4].
       | 
       | [1] https://arxiv.org/abs/2210.01948
       | 
       | [2] https://arxiv.org/abs/2011.03567
       | 
       | [3] https://pubmed.ncbi.nlm.nih.gov/24659049/
       | 
       | [4]
       | http://doingbayesiandataanalysis.blogspot.com/2013/11/option...
        
         | travisjungroth wrote:
         | People often don't determine sample sizes at all! And doing
         | power calculations without an idea of effect size isn't just
         | hard but impossible. It's one of the inputs to the formula. But
         | at least it's fast so you can sort of guess and check.
         | 
         | Anytime valid inference helps with this situation, but it
         | doesn't solve it. If you're trying to detect a small effect, it
         | would be nicer to figure out you need a million samples up
         | front versus learning that because your test with 1,000 samples
         | a day took three years.
         | 
         | Still, anytime is way better than fixed IMO. Fixed almost never
         | really exists. Every A/B testing platform I've seen allows
         | peeking.
         | 
         | I work with the author of the second paper you listed. The math
         | looks advanced, but it's very easy to implement.
        
         | hackernewds wrote:
         | The biggest mistake is engineers owning experimentation. They
         | should be owned by data scientists.
         | 
         | Realize though that is a luxury, but I also see this trend in
         | blue chip companies
        
           | pbae wrote:
           | Did a data scientist write this? You don't need to be a
           | member of a priesthood to run experiments. You just need to
           | know what you're doing.
        
             | playingalong wrote:
             | ... and by some definition you'd be a data scientist
             | yourself. (Regardless of your job title)
        
             | bonniemuffin wrote:
             | I agree with both sides here. :) DS should own
             | experimentation, AND engineers should be able to run a
             | majority of experiments independently.
             | 
             | As a data scientist at a "blue chip company", my team owns
             | experimentation, but that doesn't mean we run all the
             | experiments. Our role is to create guidelines, processes,
             | and tooling so that engineers can run their own experiments
             | independently most of the time. Part of that is also
             | helping engineers recognize when they're dealing with a
             | difficult/complex/unusual case where they should bring DS
             | in for more bespoke hands-on support. We probably only look
             | at <10% of experiments (either in the setup or results
             | phase or both), because engineers/PMs are able to set up,
             | run, and draw conclusions from most of the experiments
             | without needing us.
        
       | 2rsf wrote:
       | Another challenge, related more to implementation than theory, is
       | having too many experiments running in parallel.
       | 
       | As a company grows there will be multiple experiments running in
       | parallel executed by different teams. The underlying assumption
       | is that they are independent, but it is not necessarily true or
       | at least not entirely correct. For example a graphics change on
       | the main page together with a change in the login logic.
       | 
       | Obviously this can be solved by communication, for example
       | documenting running experiments, but like many other aspects in
       | AB testing there is a lot of guesswork and gut feeling involved.
        
         | cantSpellSober wrote:
         | A better solve is E2E or unit tests to make sure A/B segments
         | aren't conflicting. At the enterprise level there's simply too
         | many teams testing too much to keep track of it in, say, a
         | spreadsheet.
        
       | franze wrote:
       | plus, mind the Honeymoon Effect
       | 
       | something new performs better cause its new
       | 
       | if you have a platform with lots pf returning users this one will
       | hit you again and again.
       | 
       | so even if you have a winner after the test and make the change
       | permanent, revisit it 2 months later and see if you are now
       | really better of.
       | 
       | all changes of a/b tests in sum has a high chance to just get an
       | average platform in the sum of all changes.
        
       | Sohcahtoa82 wrote:
       | The one mistake I assume happens too much is trying to measure
       | "engagement".
       | 
       | Imagine a website is testing a redesign, and they want to decide
       | if people like it by measuring how long they spend on the site to
       | see if it's more "engaging". But the new site makes information
       | harder to find, so they spend more time on the site browsing and
       | trying to find what they're looking for.
       | 
       | Management goes, "Oh, users are delighted with the new site! Look
       | how much time they spend on it!" not realizing how frustrated the
       | users are.
        
         | Xenoamorphous wrote:
         | LinkedIn is a good example, I think. One day I got a "you have
         | a new message" email. I clicked it, thinking, well, someone has
         | messaged me, right? It turned out to be just bullshit, someone
         | in my network had just posted something.
         | 
         | I'm sure the first few of those got a lot of clicks, but it
         | prompted me to ignore absolutely everything that comes from
         | LinkedIn except for actual connection requests from people I
         | know. Lots of clicks but also lots of pissed off people. I
         | guess the latter is harder ti measure.
        
         | ravenstine wrote:
         | Engagement is my favorite form of metrics pseudoscience. A
         | classic example is when engagement actually goes up, not
         | because the design change is better, but because it frustrates
         | and confuses the user, causing them to click around more and
         | remain on the site longer. Without a focus group, there's
         | really no way to determine whether the users are actually
         | "delighted".
         | 
         | EDIT: For some reason it didn't compute with me that you
         | already referred to the same example. I've seen that exact
         | scenario play out in real life, though.
        
           | Sohcahtoa82 wrote:
           | I bet the reddit redesign used a similar faulty measurement
           | of engagement.
           | 
           | "People spent more time scrolling the feed, people must enjoy
           | it!"
           | 
           | No, the feed takes up more space, so now I can only fit 1 or
           | 2 items on my screen at once, rather than 10, so I have to
           | scroll more to see more content.
        
             | afro88 wrote:
             | If that also resulted in little or no change in how often
             | you (and everyone) opened reddit each day, then it is a
             | "success" for them. They have your eyeballs for longer, so
             | you likely see more ads.
             | 
             | If only they were trying to maximise enjoyment and not
             | addictiveness. They don't care at all about enjoyment, just
             | like Facebook doesn't care about genuine connection to
             | family and friends, or twitter to useful and constructive
             | discussion that leads to positive social change.
        
             | ravenstine wrote:
             | That would not surprise me in the least! In fact, that's
             | exactly what happened at a company I used to work for (that
             | shall remain nameless). At the behest of the design team,
             | we implemented a complete redesign of our site which
             | included changing the home page so that at most only two
             | media items could be on-screen at a time, and the ads which
             | used to be simple banners now were woven between the feed
             | of items. I remember sitting in a meeting where we had A/B
             | tested this new homepage, and witnessing some data analyst
             | guy giving a presentation which included how "engagement in
             | the B-group was increased by N-percent!!!" The directors of
             | web content were awestruck by this despite no context or
             | explanation as to _why_ supposed  "engagement" was higher
             | with the new design. The test wasn't even carried out for a
             | long duration of time. For all anyone knew, users were
             | confused and spent more time clicking around because they
             | were looking for something they were accustomed to in the
             | original design. And no, it did not matter that I brought
             | up my reasons for skepticism; anything that made a number
             | increase made it into the final design. _Then_ , we
             | actually had focus groups, long after the point at which we
             | should have been consulting them, and the feedback we
             | received was overwhelmingly lukewarm or negative. Much of
             | it vindicated my concerns the entire time; users didn't
             | _actually_ like scrolling. Then again, I guess if they 're
             | viewing more ads, then who cares what the user thinks??
             | Never have I felt more like I was living in a Dilbert comic
             | than that time.
        
       | time4tea wrote:
       | Annoying illegal cookie consent banner?
        
       | realjohng wrote:
       | Thanks for posting this. It's to the point and easy to
       | understand. And much needed- most companies seem to do testing
       | without teaching the intricacies involved.
        
       | alsiola wrote:
       | On point 7 ((Testing an unclear hypothesis), while agreeing with
       | the overall point, I strongly disagree with the examples.
       | 
       | > Bad Hypothesis: Changing the color of the "Proceed to checkout"
       | button will increase purchases.
       | 
       | This is succinct, clear, and is very clear what the
       | variable/measure will be.
       | 
       | > Good hypothesis: User research showed that users are unsure of
       | how to proceed to the checkout page. Changing the button's color
       | will lead to more users noticing it and thus more people will
       | proceed to the checkout page. This will then lead to more
       | purchases.
       | 
       | > User research showed that users are unsure of how to proceed to
       | the checkout page.
       | 
       | Not a hypothesis, but a problem statement. Cut the fluff.
       | 
       | > Changing the button's color will lead to more users noticing it
       | and thus more people will proceed to the checkout page.
       | 
       | This is now two hypotheses.
       | 
       | > This will then lead to more purchases.
       | 
       | Sorry I meant three hypotheses.
        
         | travisjungroth wrote:
         | The biggest issue with those three hypotheses is one of them,
         | the noticing the button, almost certainly isn't being tested.
         | But, how the test goes will inform how people think about that
         | hypothesis.
        
           | ano-ther wrote:
           | Good observation that the noticing doesn't get tested.
           | 
           | Would there be any benefit from knowing the notice rate
           | though? After all, the intended outcome is increased sales by
           | clicking.
        
             | ricardobeat wrote:
             | Probably not, but then that hypothesis should not be part
             | of the experiment.
        
             | alsiola wrote:
             | This is what I was driving at in my original comment - the
             | intermediary steps are not of interest (from the POV of the
             | hypothesis/overall experiment), so why mention them at all.
        
           | hinkley wrote:
           | Rate of traffic on the checkout page, _divided by overall
           | traffic_.
           | 
           | We see a lot of ghosts in A/B testing because we are loosey
           | goosey about our denominators. Mathematicians apparently hate
           | it when we do that.
        
             | plagiarist wrote:
             | That doesn't test noticing the button, that tests clicking
             | the button. If the color changes it is possible that fewer
             | people notice it but are more likely to click in a way that
             | increases total traffic. Or more people notice it but are
             | less likely to click in a way that reduces traffic.
        
         | ssharp wrote:
         | I don't think these examples are bad. From a clarity
         | standpoint, where you have multiple people looking at your
         | experiments, the first one is quite bad and the second one is
         | much more informative.
         | 
         | Requiring a user problem, proposed solution, and expected
         | outcome for any test is also good discipline.
         | 
         | Maybe it's just getting into pedants with the word "hypothesis"
         | and you would expect the other information elsewhere in the
         | test plan?
        
           | sacrosancty wrote:
           | [dead]
        
           | darkerside wrote:
           | Having a clearly stated hypothesis and supplying appropriate
           | context separately isn't pedantry. It is semantics, but words
           | result in actions that matter.
        
           | avereveard wrote:
           | the problem is the hand wavy "user research"
           | 
           | if you have done that properly, why ab testing? if you did
           | that improperly, why bother?
           | 
           | ab testing moves from an hypotesis, because ab testing is
           | done to inform a bayesian analysis to identify causes.
           | 
           | if one knows already that the reason is 'button not visible
           | enough' ab testing is almost pointless.
           | 
           | not entirely pointless, because you can still do ab testing
           | to validate that the change is in the right direction, but
           | investing developer time for production quality code and
           | risking business to just validate something one already knows
           | seems crazy compared to just ask a focus group.
           | 
           | when you are unsure about the answer, that's when investing
           | in ab testing to discovery makes the most sense.
        
             | tomnipotent wrote:
             | > ab testing is almost pointless
             | 
             | Except you can never be certain that the changes made were
             | impactful in the direction you're hoping unless you measure
             | it. Otherwise it's just wishful thinking.
        
               | avereveard wrote:
               | I didn't say anything to the contrary, the quotation is
               | losing all the context.
               | 
               | but if you want to verify hipotesis and control for
               | confounding factor, the ab test needs to be part of a
               | baesyan analysis, if you're doing that, why also pay for
               | the priori research?
               | 
               | by going down the path of user research > production
               | quality release > validation of the hypotesis you are
               | basically paying research twice and paying development
               | once regardless of wether the testing is succesful or
               | not.
               | 
               | it's more efficient to either use bayesian hypotesis + ab
               | testing for research (so pay development once per
               | hypotesis, collect evidence and steer into the direction
               | the evidence points to) or use user research over a set
               | of POCs (pay research once per hypotesis, develop in the
               | direction that research points to)
               | 
               | if your research need validation, you paid for a research
               | you might not need. if you start research knowing the
               | priory (the user doens't see the button) you're not
               | actually doing research, you're just gold plating a
               | hunch, then why pay for research, just skip to the
               | testing phase. if you want to research from the users,
               | you do ab testing, but again, not against a hunch, but
               | against a set of hypotesis, so you can eliminate
               | confounding factors and narrow down the confidence
               | interval.
        
         | kevinwang wrote:
         | It is surely helpful to have a "mechanism of action" so that
         | you're not just blindly AB testing and falling victim to
         | coincidences like in https://xkcd.com/882/ .
         | 
         | Not sure if people do this, but with a mechanism of action in
         | place you can state a prior belief and turn your AB testing
         | results into actual posteriors instead of frequentist metrics
         | like p-values which are kind of useless.
        
           | datastoat wrote:
           | That xkcd comic highlights the problem with observational (as
           | opposed to controlled) studies. TFA is about A/B testing,
           | i.e. controlled studies. It's the fact that you (the
           | investigator) is controlling the treatment assignment that
           | allows you to draw causal conclusions. What you happen to
           | believe about the mechanism of action doesn't matter, at
           | least as far as the outcome of this particular experiment is
           | concerned. Of course, your conjectured mechanism of action is
           | likely to matter for what you decide to investigate next.
           | 
           | Also, frequentism / Bayesianism is orthogonal to causal /
           | correlational interpretations.
        
             | eVoLInTHRo wrote:
             | The xkcd comic seems more about the multiple comparisons
             | problem (https://en.wikipedia.org/wiki/Multiple_comparisons
             | _problem), which could arise in both an observational or
             | controlled setting.
        
             | majormajor wrote:
             | AB tests are still vulnerable to p-hacking-esque things
             | (though usually unintentional). Run enough of them and your
             | p value is gonna come up by chance sometimes.
             | 
             | Observational ones are particularly prone because you can
             | slice and dice the world into near-infinite observation
             | combinations, but people often do that with AB tests too.
             | Shotgun approach, test a bunch of approaches until
             | something works, but if you'd run each of those tests for
             | different significance levels, or for twice as long, or
             | half as long, you could very well see the "working" one
             | fail and a "failing" one work.
        
             | carlmr wrote:
             | I think what kevinwang is getting at, is that if you A/B
             | test with a static version A and enough versions of B, at
             | some point you will get statistically significant results
             | if you repeat it often enough.
             | 
             | Having a control doesn't mean you can't fall victim to
             | this.
        
               | ricardobeat wrote:
               | You control statistical power and the error rate, and
               | choose to accept a % of false results.
        
         | thingification wrote:
         | As kevinwang has pointed out in slightly different terms: the
         | hypothesis that seems wooly to you seems sharply pointed to
         | others (and vice versa) because explanationless hypotheses
         | ("changing the colour of the button will help") are easily
         | variable (as are the colour of the xkcd jelly beans), while
         | hypotheses that are tied strongly to an explanation are not.
         | You can test an explanationless hypothesis, but that doesn't
         | get you very far, at least in understanding.
         | 
         | As usual here I'm channeling David Deutsch's language and ideas
         | on this, I think mostly from The Beginning of Infinity, which
         | he delightfully and memorably explains using a different
         | context here: https://vid.puffyan.us/watch?v=folTvNDL08A (the
         | yt link if you're impatient:
         | https://youtu.be/watch?v=folTvNDL08A - the part I'm talking
         | about starts at about 9:36, but it's a very tight talk and you
         | should start from the beginning).
         | 
         | Incidentally, one of these TED talks of Deutsch - not sure if
         | this or the earlier one - TED-head Chris Anderson said was his
         | all-time favourite.
         | 
         | plagiarist:
         | 
         | > That doesn't test noticing the button, that tests clicking
         | the button. If the color changes it is possible that fewer
         | people notice it but are more likely to click in a way that
         | increases total traffic.
         | 
         | "Critical rationalists" would first of all say: it does test
         | noticing the button, but tests are a shot at refuting the
         | theory, here by showing no effect. But also, and less commonly
         | understood: even if there is no change in your A/B - an
         | apparently successful refutation of the "people will click more
         | because they'll notice the colour" theory - experimental tests
         | are also fallible, just as everything else.
        
           | alsiola wrote:
           | Will watch the TED talk, thanks for sharing. I come at this
           | from a medical/epidemiological background prior to building
           | software, and no doubt this shapes my view on the language we
           | use around experimentation, so it is interesting to hear
           | different reasoning.
        
       | [deleted]
        
       | mtlmtlmtlmtl wrote:
       | Surprised no one said this yet, so I'll bite the bullet.
       | 
       | I don't think A/B testing is a good idea at all for the long
       | term.
       | 
       | Seems like a recipe for having your software slowly evolved into
       | a giant heap of dark patterns. When a metric becomes a target, it
       | ceases to be a good metric.
        
         | activiation wrote:
         | > Seems like a recipe for having your software slowly evolved
         | into a giant heap of dark patterns.
         | 
         | Just don't test for dark patterns?
        
           | mtlmtlmtlmtl wrote:
           | Well, how does one "just not do" that though, specifically?
        
             | activiation wrote:
             | First determine if what you want to test for is a dark
             | pattern?
        
               | mtlmtlmtlmtl wrote:
               | And how do you determine that? I'm not trying to be coy
               | here, I genuinely don't understand.
               | 
               | Because you're not testing for patterns, what you test is
               | some measurable metric(s) you want to maximise(or
               | minimise), right? So how can you determine which metrics
               | lead to dark patterns, without just using them and seeing
               | if dark pattern emerge? And how do you spot these dark
               | patterns if by their very nature they're undetectable by
               | the metrics you chose to test first?
        
               | activiation wrote:
               | [flagged]
        
               | mtlmtlmtlmtl wrote:
               | Well this discussion isn't helpful at all.
               | 
               | Why reply at all if you're just gonna waste my time?
        
               | activiation wrote:
               | [flagged]
        
         | hackernewds wrote:
         | Let's ship the project of those that bang the table, and
         | confirm our biases instead.
        
           | mtlmtlmtlmtl wrote:
           | Please try to be serious and don't put words in my mouth. I'm
           | actually trying to learn and have a serious discussion here.
           | 
           | Thanks.
        
         | matheusmoreira wrote:
         | I don't think it should even be legal. Why do these
         | corporations think they can perform human experimentation on
         | unwitting subjects for profit?
        
         | withinboredom wrote:
         | More or less, it tells you the "cost" of removing an accidental
         | dark pattern. For example we had three plans and a free plan.
         | The button for the free plan was under the plans, front-and-
         | center ... unless you had a screen/resolution that most of our
         | non-devs/designers had.
         | 
         | So, the button, (for user's most common resolution) had the
         | button just below the fold.
         | 
         | This was an accident though some of our users called us out for
         | it -- suggesting we'd removed the free plan altogether.
         | 
         | So, we a/b tested moving the button to the top.
         | 
         | It would REALLY hurt the bottom line and explained some growth
         | we'd experienced. To remove the "dark pattern" would mean
         | laying off some people.
         | 
         | I think you can guess which one was chosen and still
         | implemented.
        
           | whimsicalism wrote:
           | When an organization has many people, I think that many of
           | these are a continuum from accidental to intentional.
        
             | withinboredom wrote:
             | When I left that company it had grown to massive and the
             | product was full of dark patterns... I mean bugs,
             | seriously, they were tracked as bugs that no one could fix
             | without severe consequences. No one put them there on
             | purpose. When you have hundreds of devs working on the same
             | dozen files (onboarding/payments/etc) there are bound to be
             | bad merges (when a git merge results in valid but incorrect
             | code), misunderstanding of requirements, etc.
        
         | cantSpellSober wrote:
         | Good multivariate testing and (statistically significant) data
         | doesn't do that. It shows lots of ways to improve your UX, and
         | if your guesses at improving UX actually work. Example from
         | TFA:
         | 
         | > more people signed up using Google and Github, overall sign-
         | ups didn't increase, and nor did activation
         | 
         | Less friction on login for the user, 0 gains in conversions,
         | they shipped it anyway. That's not a dark pattern.
         | 
         | If you're _intentionally_ trying to make dark patterns it will
         | help with that too I guess; the same way a hammer can build a
         | house, or tear it down, depending on use.
        
           | mtlmtlmtlmtl wrote:
           | I often see this argument, and although I can happily accept
           | the examples given in defence as making sense, I never see an
           | argument that this multivariate approach solves the problem
           | _in general_ and doesn 't merely ameliorate some of the worst
           | cases(I suppose I'm open to the idea that it could at least
           | get it from "worse than the disease" to "actually useful in
           | moderation").
           | 
           | Fundamentally, if you pick some number of metrics, you're
           | always leaving some number of possible metrics "dark", right?
           | Is there some objective method of deciding which metrics
           | should be chosen, and which shouldn't?
        
             | cantSpellSober wrote:
             | "user trust" is a good one, abeit hard to measure
             | 
             | Rolled out some tests to streamline cancelling
             | subscriptions in response to user feedback, with
             | Marketing's begrudging approval.
             | 
             | Short term, predictably, we saw an increase in
             | cancellations, then a decrease and eventual levelling out.
             | Long term we continued to see an increase in subscriptions
             | after rollout, and focused on more important questions like
             | "how do we provide a good product that a user doesn't
             | _want_ to cancel? "
        
               | mtlmtlmtlmtl wrote:
               | So, it's just a process of trial and error, in terms of
               | what metrics to choose and how to weight them?
        
       | 2OEH8eoCRo0 wrote:
       | _Every_ engineer? Electrical engineers? Kernel developers?
       | Defense workers?
       | 
       | I hesitate to write this (because I don't want to be negative)
       | but I get a sense that most software "engineers" have a very
       | narrow view of the industry at large. Or this forum leans a
       | particular way.
       | 
       | I haven't A/B tested in my last three roles. Two of them were
       | defense jobs, my current job deals with the Linux kernel.
        
         | [deleted]
        
         | o1y32 wrote:
         | Was going to say the same thing. Lots of articles have
         | clickbait titles, but this one is especially bad. Even among
         | software engineers, only a small percentage will ever do any
         | A/B testing, not to mention that often "scientists" or other
         | roles are in charge of designing, running and analyzing A/B
         | test experiments.
        
         | chefandy wrote:
         | I used to get knots in my hair about these distinctions, but in
         | retrospect, I was just being pedantic. It's a headline-- not a
         | synopsis or formal tagging system. Context makes it perfectly
         | clear to most in a web-focused software industry crowd which
         | "engineers" might be doing a/b testing. Also, my last three
         | jobs haven't included a lot of stuff I read about here; why
         | should that affect the headline?
        
         | jldugger wrote:
         | > Two of them were defense jobs, my current job deals with the
         | Linux kernel.
         | 
         | I don't work on the kernel, but one of the most professionally
         | useful talks about the Linux kernel was an engineer talking
         | about how to use statistical tests on perf related changes with
         | small effects[1]. It's not an _online_ A/B technique but
         | sometimes you pay attention to how other fields approach things
         | in order to learn how to improve your own field.
         | 
         | [1]: https://lca2021.linux.org.au/schedule/presentation/31/
        
       | withinboredom wrote:
       | I built an internal a/b testing platform with a team of 3-5 over
       | the years. It needed to handle extreme load (hundreds of millions
       | of participants in some cases). Our team also had a sister team
       | responsible for teaching/educating teams about how to do proper
       | a/b testing -- they also reviewed implementations/results on-
       | demand.
       | 
       | Most of the a/b tests they reviewed (note the survivorship bias
       | here, they were reviewed because they were surprising results)
       | were incorrectly implemented and had to be redone. Most companies
       | I worked at before or since did NOT have a team like this, and
       | blindly trusted the results without hunting for biases, incorrect
       | implementations, bugs, or other issues.
        
         | srveale wrote:
         | Do you know if there were common mistakes for the incorrect
         | implementations? Were they simple mistakes or more because
         | someone misunderstood a nuance of stats?
        
           | withinboredom wrote:
           | I don't remember much specifics, but IIRC, most of the
           | implementation related ones were due to an anti-pattern from
           | the older a/b testing framework. Basically, the client would
           | try and determine if the user was eligible to be in the A/B
           | test (instead of relying on the framework), then in an API
           | handler, get the user's assignment. This would mean the UI
           | would think the user wasn't in the A/B test at all, while the
           | API would see the user as in the A/B test. In this case, the
           | user would be experiencing the 'control' while the framework
           | thought they were experiencing something else.
           | 
           | That was a big one for awhile, and it would skew results.
           | 
           | Hmmm, another common one was doing geographic experiments
           | when part of the experiment couldn't be geofenced for
           | technological reasons. Or forgetting that a user could leave
           | a geofence and removing access the feature after they'd
           | already been given access to it.
           | 
           | Almost all cases boiled down to showing the user one thing
           | while thinking we were showing them something else.
        
             | srveale wrote:
             | I wonder if that falls under mistake #4 from the article,
             | or if there's another category of mistake: "Actually test
             | what you think you're testing." Seems simple but with a big
             | project I could see that being the hardest part.
        
               | withinboredom wrote:
               | I actually just read it (the best I could, the page is
               | really janky on my device) I didn't see this mistake on
               | there and it was the most common one we saw by a wide
               | margin in the beginning.
               | 
               | Number 2 (1 in the article) was solved by the platform.
               | We had two activation points for UI experiments. The
               | first was getting the users assignment (which could be
               | cached for offline usage). At that point they became part
               | of the test, but there was a secondary one that happened
               | when the component under test became visible (whether it
               | was a page view or a button). If you turned on this
               | feature for the test, you could analyze it using the
               | first or secondary points.
               | 
               | One issue we saw with that (which is potentially specific
               | to this implementation), was people forgetting to fire
               | the secondary for the control. That was pretty common but
               | you usually figured that out within a few hours when you
               | got an alert that your distribution looked biased (if you
               | specify a 10:20 split, you should get a 10:20 ratio of
               | activity).
        
         | indymike wrote:
         | > It needed to handle extreme load (hundreds of millions of
         | participants in some cases).
         | 
         | I can see extreme loads being valuable for an A/B test of a
         | pipeline change or something that needs that load... but for
         | the kinds of A/B testing UX and marketing does, leveraging
         | statistical significance seems to be a smart move. There is a
         | point where a large sample is trivially more accurate than a
         | small sample.
         | 
         | https://en.wikipedia.org/wiki/Sample_size_determination
        
           | withinboredom wrote:
           | Even if you're testing 1% of 5 million visitors, you still
           | need to handle the load for 5 million visitors. Most of the
           | heavy experiments came from AI-driven assignments (vs.
           | behavioral). In this case the AI would generate very fine-
           | grained buckets and assign users into them as needed.
        
         | rockostrich wrote:
         | Same experience here for the most part. We're working on
         | migrating away from an internal tool which has a lot of
         | problems: flags can change in the middle of user sessions,
         | limited targeting criteria, changes to flags require changes to
         | code, no distinction between feature flags and experiments,
         | experiments often target populations that vary greatly,
         | experiments are "running" for months and in some cases years...
         | 
         | Our approach to fixing these problems starts with having a
         | golden path for running an experiment which essentially fits
         | the OP. It's still going to take some work to educate everyone
         | but the whole "golden path" culture makes it easier.
        
           | withinboredom wrote:
           | When we started working on the internal platform, this was
           | exactly the problems we had. When we were finally deleting
           | the old code, we found a couple of experiments that had been
           | running for nearly half a decade.
           | 
           | For giggles, we ran an analysis on those experiments: no
           | difference between a & b.
           | 
           | That's usually the best result you can get, honestly. It
           | means you get to make a decision of whether to go with a or
           | b. You can pick the one you like better.
        
             | travisjungroth wrote:
             | That's a great outcome for a do-no-harm test. Horrible
             | outcome when you're expecting a positive effect.
        
               | withinboredom wrote:
               | It's an experiment, you shouldn't be "expecting"
               | anything. You hypothesize an effect, but that doesn't
               | mean it will be there and if you prove it wrong, you
               | continue to iterate.
        
               | travisjungroth wrote:
               | > you shouldn't be "expecting" anything
               | 
               | This is the biggest lie in experimentation. Of course you
               | expect something. Why are you running this test over all
               | other tests?
               | 
               | What I'm challenging is that if a team has spent three
               | months building a feature, you a/b test it and find no
               | effect, that is not a good outcome. Having a tie where
               | you get to choose anything is worse than having a winner
               | that forces your hand. At least you have the option to
               | improve your product.
        
       | donretag wrote:
       | If anyone from posthog is reading this, please fix your RSS feed.
       | The link actually points back to the blog homepage.
        
         | corywatilo wrote:
         | Will take a look, thanks for the heads up!
        
       | alberth wrote:
       | Enough traffic.
       | 
       | Isn't the biggest problem with A/B testing that very few web
       | sites even have enough traffic to properly measure statistical
       | differences.
       | 
       | Essentially making A/B testing for 99.9% of websites useless.
        
         | hanezz wrote:
         | This. Ron Kohavi 1) has some excellent resources on this 2).
         | There is a lot of noise in data, that is very often
         | misattributed to 'findings' in the context of A/B testing.
         | Replication of A/B tests should be much more common in the CRO
         | industry, it can lead to surprising yet sobering insights into
         | real effects.
         | 
         | 1) https://experimentguide.com/ 2)
         | https://bit.ly/ABTestingIntuitionBusters
        
         | xp84 wrote:
         | I have worked for some pretty arrogant, business types who
         | fancy themselves "data driven" but actually they knew nothing
         | about statistics. What that actually meant was they forced us
         | to run AB tests for every change, and when the tests nearly
         | always showed no particular statistical significance, they
         | would accept the insignificant results if it supported their
         | agenda, or if the insignificant results were against their
         | desired outcome, they would run the test longer until it
         | happend to flop the other way. The whole thing was such a joke.
         | You definitely need some very smart math people to do this in a
         | way that isn't pointless.
        
         | Retric wrote:
         | A/B testing works fine even at a hundred users per day. More
         | visitors means you can run more tests and notice smaller
         | differences, but that's also a lot of work which smaller sites
         | don't really justify.
        
       | wasmitnetzen wrote:
       | Posthog is on developerdans "Ads & Tracking" blocklist[1], if
       | you're wondering why this doesn't load.
       | 
       | [1]:
       | https://github.com/lightswitch05/hosts/blob/master/docs/list...
        
         | RobotToaster wrote:
         | Just noticed that myself, It's also in the Adguard DNS list.
        
       | masswerk wrote:
       | Ad 7)
       | 
       | > Good hypothesis: User research showed that users are unsure of
       | how to proceed to the checkout page. Changing the button's color
       | will lead to more users noticing it (...)
       | 
       | Mind that you have to prove first that this preposition is
       | actually true. Your user research is probably exploratory,
       | qualitative data based on a small sample. At this point, it's
       | rather an assumption. You have to transform and test this (by
       | quantitative means) for validity and significance. Only then you
       | can proceed to the button-hypothesis. Otherwise, you are still
       | testing multiple things at once, based on an unclear hypothesis,
       | while merely assuming that part of this hypothesis is actually
       | valid.
        
         | arrrg wrote:
         | In practice you often cannot test that in a quantitative way.
         | Especially since it's about a state of mind.
         | 
         | However, you should not dismiss qualitative results out of
         | hand.
         | 
         | If you do usability testing of the checkout flow with five
         | participants and three actually verbalize the hypothesis during
         | checkout ("hm, I'm not sure how to get to the next step here",
         | "I don't see the button to continue", after searching for 30s:
         | "ah, there it is!" - after all of which a good moderater would
         | also ask follow up questions to better understand why they
         | think it was hard for them to find the way to the next step and
         | what their expectations were) then that's plenty of evidence
         | for the first part of the hypothesis, allowing you to move on
         | to testing the second part. It would be madness to
         | quantitatively verify the first part. A total waste of
         | resources.
         | 
         | To be honest: with (hypothetical) evidence as clear as that
         | from user research I would probably skip the A/B testing and go
         | straight to implementing a solution if the problem is obvious
         | enough and there are best practice examples. Only if designers
         | are unsure about whether their proposed solution to the problem
         | actually works would I consider testing that.
         | 
         | Also: quantitative studies are not the savior you want them to
         | be. Especially if it's about details in the perception of users
         | ... and that's coming from me, a user researcher who loves to
         | do quantitative product evaluation and isn't even all that firm
         | in all qualitative methods.
        
           | masswerk wrote:
           | You really have to be able to build samples based on the
           | first part of the hypothesis: you should test 4 groups for a
           | crosstab. (Also, homogeneity may be an issue.) Transitioning
           | from qualitative to quantitative methods is really the tricky
           | part in social research.
           | 
           | Mind that 3/5 doesn't meet the criteria of a binary test. In
           | statistical terms, you do know nothing, this is still random.
           | Moreover, even if metrics are suggesting that some users are
           | spending considerable time, you still don't know why: it's
           | still an assumption based on a negligible sample. So, the
           | first question should be really, how do I operationalize the
           | variable "user is disoriented", and, what does this exactly
           | mean. (Otherwise, you're in for spurious correlation of all
           | sorts. I.e. you still don't know _why_ some users display
           | disorientations and others not. Instead of addressing the
           | underlying issue, you rather fix this by an obtrusive button
           | design, which may have negative impact on the other group.)
        
             | arrrg wrote:
             | I think you are really missing the forest for the trees.
             | 
             | Everything you say is completely correct. But useful? Or
             | worthwhile? Or even efficient?
             | 
             | The goal is not to find out why some users are disoriented
             | and some are not. Well, I guess indirectly it is. But
             | getting there with rigor is a nightmare and to my mind not
             | worthwhile in most cases. The hypothesis developed from the
             | usability test would be "some users are disoriented during
             | checkout". That to me would be enough evidence to actually
             | tackle that problem of disorientation, especially since to
             | me 3/5 would indicate a relatively strong signal (not in
             | terms of telling me the percentage of users affected by
             | this problem, just that it's likely the problem affects
             | more than just a couple people).
             | 
             | The more mysterious question to me would actually be
             | whether that disorientation also leads to people not
             | ordering. Which is a plausible assumption - but not
             | trivially answerable for sure. (Usability testing can
             | provide some hints toward answering that question - but
             | task based usability testing is always a bit artificial in
             | its setup.)
             | 
             | Operationalizing "user is disoriented" is a nightmare and
             | not something I would recommend at all (at least not as the
             | first step) if you are reasonably sure that disorientation
             | is a problem (because some users mention it during
             | usability testing) and you can offer plausible solutions (a
             | new design based on best practices and what users told you
             | they think makes them feel disoriented).
             | 
             | Operationalizing something like disorientation is much more
             | fraught with danger (and just operationalizing it in the
             | completely wrong way without even knowing) than identifying
             | a problem and based on reasonableness arguments
             | implementing a potential solution and seeing whether the
             | desired metric improves.
             | 
             | I agree that it would be an awesome research project to
             | actually operationalize disorientation. But worthwhile when
             | supporting actual product teams? Doubtful ...
        
               | masswerk wrote:
               | > The more mysterious question to me would actually be
               | whether that disorientation also leads to people not
               | ordering.
               | 
               | This is actually the crucial question. The disorientation
               | is an indication for a fundamental mismatch between the
               | internal model established by the user and the
               | presentation. It may be an issue of the flow of
               | information (users are hesitant, because they realize at
               | this point that this is not about what they thought it
               | may be) or on the usability/design side of things (user
               | has established an operational model, but this is not how
               | it operates). Either way, there's a considerable
               | dissonance, which will probably hurt the product: your
               | presentation does not work for the user, you're not
               | communicating on the same level, and this will be
               | probably perceived as an issue of quality or even fitness
               | for the purpose, maybe even intrigue. (Shouting at the
               | user may provide a superficial fix, but will not address
               | the potential damage.) - Which leads to the question:
               | what is the actual problem and what caused it in the
               | first place? (I'd argue, any serious attempt to
               | operationalize this variable will inevitable lead you
               | towards this much more serious issue. Operationalization
               | is difficult for a reason. If you want tho have a
               | controlled experiment, you must control all your
               | variables - and the attempt to do so may hint you at
               | deeper issues.)
               | 
               | BTW, there's also a potential danger in just taking the
               | articulations of user dislike at face value: a classic
               | trope in TV media research was audiences critizising the
               | outfit of the presenter, while the real issue was a
               | dissonance/mismatch in audio and visual presentation. Not
               | that users could pinpoint this, hence, they would rather
               | blame how the anchor was dressed...
        
               | arrrg wrote:
               | What you say is all true and I actually completely agree
               | with you (and like how you articulate those points -
               | great to read it distilled that way) but at the same time
               | probably not a good idea at all to do in most
               | circumstances.
               | 
               | It is alright to decide that in certain cases you can act
               | with imperfect information.
               | 
               | But to be clear, I actually think there may be situations
               | where pouring a lot of effort into really understanding
               | confusion is confusion. It's just very context dependent.
               | (And I think you consistently underrate that progress you
               | can make in understanding confusion or any other thing
               | impacting conversion and use by using qualitative
               | methods.)
        
               | masswerk wrote:
               | Regarding underestimating qualitative methods: I'm
               | actually all for them. It may turn out, it's all you
               | need. (Maybe, a quantitative test will be required to
               | prove your point, but it will probably not contribute
               | much to a solution.) It's really that I think that A/B
               | testing is somewhat overrated. (Especially, since you
               | will probably not really know what you're actually
               | measuring without appropriate preparation, which will
               | provide the heavy lifting already. A/B testing should
               | really be just about whether you can generalize on a
               | solution and the assumptions behind this or not. Using
               | this as a tool for optimization, on the other hand, may
               | be rather dangerous, as it doesn't suggest any relations
               | between your various variables, or the various layers of
               | fixes you apply.)
        
             | ssharp wrote:
             | This is adding considerable effort and weight to the
             | process.
             | 
             | The alternative is just running the experiment, which would
             | take 10 minutes to set up, and see the results. The A/B
             | test will help measure the qualitative finding in
             | quantitative terms. It's not perfect but it is practical.
        
       | londons_explore wrote:
       | I want an A/B test framework that automatically optimizes the
       | size of the groups to maximize revenue.
       | 
       | At first, it would pick say a 50/50 split. Then as data rolls in
       | that shows group A is more likely to convert, shift more users
       | over to group A. Keep a few users on B to keep gathering data.
       | Eventually, when enough data has come in, it might turn out that
       | flow A doesn't work at all for users in France - so the ideal
       | would be for most users in France to end up in group B, whereas
       | the rest of the world is in group A.
       | 
       | I want the framework to do all this behind the scenes - and
       | preferably with statistical rigorousness. And then to tell me
       | which groups have diminished to near zero (allowing me to remove
       | the associated code).
        
         | quadrature wrote:
         | Curious about your use case.
         | 
         | Is the idea that you wa t to optimize the conversion and then
         | you would remove the experiment code with the winning variant
         | ?.
         | 
         | Or would you prefer to keep the code in and have it
         | continuously optimize variants ?.
        
           | joseda-hg wrote:
           | There might be a particular situation where B might be more
           | effective than A, and therefore should be kept, if only for
           | that specific situation There might be a cutoff point, where
           | maintaining B would cost more than it's worth, but that's a
           | parameter you will have to determine for each test
        
           | londons_explore wrote:
           | I'd expect to be running tens of experiments at any one time.
           | Some of those experiments might be variations in wording or
           | colorschemes - others might be entirely different signup
           | flows.
           | 
           | I'd let the experiment framework decide (ie. optimize) who
           | gets shown what.
           | 
           | Over time, the maintenance burden of tens of experiments (and
           | every possible user being in any combination of experiments)
           | would exceed the benefits, so then I'd want to end some
           | experiments, keeping just whatever variant performs best. And
           | I'd be making new experiments with new ideas.
        
         | [deleted]
        
         | jdwyah wrote:
         | Get yourself a multi arm bandit and some Thompson sampling
         | https://engineering.ezcater.com/multi-armed-bandit-experimen...
        
         | PheonixPharts wrote:
         | As others have mentioned, you're referring to Thompson sampling
         | and plenty of testing providers offer this (and if you have any
         | DS on staff, they'll be more than happy to implement it).
         | 
         | My experience is that there's a good reason why this hasn't
         | taken off: the returns for this degree of optimization are far
         | lower than you think.
         | 
         | I once worked with a very eager, but junior DS who thought that
         | we should build out a massive internal framework for doing
         | this. He didn't quite understand the math behind it, so I build
         | him a demo to understand the basics. What we realized in
         | running the demo under various conditions is that the total
         | return on adding this complexity to optimization was negligible
         | at the scale we were operating at and required much more
         | complexity than our current set up.
         | 
         | This pattern repeats in a lot of DS related optimization in my
         | experience. The difference between a close guess and perfectly
         | optimal is often surprisingly little. Many DS teams perform
         | optimizations on business processes that yield a lower
         | improvement in revenue than the salary of the DS that built it.
        
           | brookst wrote:
           | Small nit: it's a bad idea if NPV of future returns is less
           | than the cost. If someone making $100k/yr can produce one
           | $50k/yr optimization that NPV's out to $260k, it's worth it.
           | I suspect you meant that, just a battle I have at work a lot
           | with people who only look at single-year returns.
        
           | tomfutur wrote:
           | Besides complexity, a price you pay with multi-armed bandits
           | is that you learn less about the non-optimal options (because
           | as your confidence grows that an option is not the best, you
           | run fewer samples through it). It turns out the people
           | running these experiments are often not satisfied to learn "A
           | is better than B." They want to know "A is 7% better than B,"
           | but a MAB system will only run enough B samples to make the
           | first statement.
        
         | withinboredom wrote:
         | Sadly, there is an issue with the Novelty Effect[1]. If you
         | push traffic to the current winner, it probably won't validate
         | that its the actual winner. So you may trade more conversions
         | now, for a higher churn than you can tolerate later.
         | 
         | For example, you run two campaigns:
         | 
         | 1. Get my widgets, one year only 19.99!
         | 
         | 2. Get my widgets, first year only 19.99!
         | 
         | The first one may win, but they all cancel at the second year
         | because they thought it was only for one year. They all leave
         | reviews complaining that you scammed them.
         | 
         | So, I would venture that this idea is a bad one, but sounds
         | good on paper.
         | 
         | [1]: https://medium.com/geekculture/the-novelty-effect-an-
         | importa...
         | 
         | PS. A/B tests don't just provide you with evidence that one
         | solution might be better than the other, they also provide some
         | protection in that a number of participants will get the
         | status-quo.
        
           | travisjungroth wrote:
           | > So, I would venture that this idea is a bad one, but sounds
           | good on paper.
           | 
           | It's a great idea, it's just vulnerable to non-stationary
           | effects (novelty effect, seasonality, etc). But it's actually
           | no worse than fixed time horizon testing for your example if
           | you run the test less than a year. You A/B test that copy for
           | a month, push everyone to A, and you're still not going to
           | realize it's actually worse.
        
             | withinboredom wrote:
             | Yeah. If churn is part of the experiment, then even after
             | you stop the a/b test for treatment, you may have to wait
             | at least a year before you have the final results.
        
         | iLoveOncall wrote:
         | That sounds like you don't want A/B testing at all.
        
           | londons_explore wrote:
           | Indeed - I really want A/B testing combined with conversion
           | optimization.
        
           | hotstickyballs wrote:
           | Exactly. That just sounds like a Bayesian update
        
             | NavinF wrote:
             | It's https://en.wikipedia.org/wiki/Multi-armed_bandit
        
           | bobsmooth wrote:
           | Why can't users just tell me what works!
        
       | jedberg wrote:
       | The biggest mistake engineers make about A/B testing is not
       | recognizing local maxima. Your test may be super successful, but
       | there may be an even better solution that's significantly
       | different than what you've arrived at.
       | 
       | It's important to not only A/B test minor changes, but
       | occasionally throw in some major changes to see if it moves the
       | same metric, possibly even more than your existing success.
        
       | jldugger wrote:
       | > 6. Not accounting for seasonality
       | 
       | Doesn't the online nature of an A/B automatically account for
       | this?
        
       | throwaway084t95 wrote:
       | That's not Simpson's Paradox. Simpson's Paradox is when the
       | aggregate winner is different from the winner in each element of
       | a partition, not just some of them
        
         | hammock wrote:
         | What it is is confounding
        
         | robertlacok wrote:
         | Exactly.
         | 
         | On that topic - what do you do when you observe that in your
         | test results? What's the right way to interpret the data?
        
           | throwaway084t95 wrote:
           | Let's consider an example that would be a case of Simpson's
           | Paradox. Suppose you are A/B testing two different landing
           | pages, and you want to know which will make more people
           | become habitual users. You partition on whether the user adds
           | at least one friend in their first 5 minutes on the platform.
           | It might be that landing page A makes people who add a friend
           | in the first 5 minutes more likely to become habitual users,
           | and it also makes people who don't add a friend in the first
           | 5 minutes more likely to become habitual users. But page A
           | makes people less likely to add a friend in the first 5
           | minutes, and people who add a friend in the first 5 minutes
           | are overwhelmingly more likely to become habitual users than
           | people who don't. So, in this case at least, it seems like
           | the aggregate statistics are most relevant, but the fact that
           | page A is bad mainly because it makes people less likely to
           | add a friend in the first 5 minutes is also very interesting;
           | maybe there is some way of combining A and B to get the good
           | qualities of each and avoid the bad qualities of both
        
           | contravariant wrote:
           | It can only happen with unequal populations. If you decide to
           | include people in the control or test group randomly you're
           | fine (you can use statistical tests to rule out sample biad).
        
           | ssharp wrote:
           | With random bucketing happening at the global level for any
           | test, the proper thing to do is to take any segments that
           | show interesting (and hopefully statistically significant)
           | results that differ from the global results and test those
           | segments individually so the random bucketing happens at that
           | segment level.
           | 
           | There are two issues at play here -- one is that the sample
           | sizes for the segments may not be high enough, the other is
           | that the more segments you look at , the greater the
           | probability for finding a false positive.
        
         | hnhg wrote:
         | Also, doesn't their suggested approach amount to multiple
         | testing? In other words, a kind of p-hacking:
         | https://en.wikipedia.org/wiki/Multiple_comparisons_problem
         | 
         | Edit - and this:
         | http://www.stat.columbia.edu/~gelman/research/unpublished/p_...
        
           | throwaway202351 wrote:
           | Yeah, a good AB testing framework would either refuse to let
           | you break things down too much or have a large warning about
           | the results not being significant, but that doesn't always
           | stop the business-types from trying to wiggle in some way for
           | them to show a win.
        
         | Lior539 wrote:
         | I'm the author of this blog. Thank you for calling this out!
         | I'll update the example to fix this :)
        
           | hammock wrote:
           | The example is fine, you are calling out confounding
           | variables. Just call it confounders, instead of Simpson
           | paradox.
        
           | keithwinstein wrote:
           | FWIW, the arithmetic in that example also has a glitch. For
           | the "Mobile" "Control" case, 100/3000 is about 3% rather than
           | 10%.
        
         | jameshart wrote:
         | Yes, I don't think it's possible to observe a simpson's paradox
         | in a simple conversion test, either.
         | 
         | Simpson's paradox is about spurious correlations between
         | variables - conversion analysis is pure Bayesian probability.
         | 
         | It shouldn't be possible to have a group as a whole increase
         | its probability to convert, while having every subgroup
         | decrease its probability to convert - the aggregate has to be
         | an average of the subgroup changes.
        
           | HWR_14 wrote:
           | Simpson's paradox is sometimes about spurious correlations,
           | but the original paradox Simpson wrote about was simply a
           | binary question with 84 subgroups, where 3 or 4 subgroups
           | with the outlying answer just had a significant enough amount
           | of all samples, and a significant enough effect, to mutate
           | the whole.
        
           | BoppreH wrote:
           | Are you sure?
           | 
           | Consider the case where iOS users are more likely to convert
           | than Android users, but you currently have very few iOS
           | users. You then A/B test a new design that imitates iOS, but
           | has awful copy. Both iOS and Android users are less likely to
           | convert, but it attracts more iOS users.
           | 
           | The group as a whole has higher conversion because of the
           | demographic shift, but every subgroup has less.
        
             | whimsicalism wrote:
             | I don't follow. If one bucket has many more iOS users, it
             | seems like you have done a bad job randomizing your
             | treatment?
        
               | BoppreH wrote:
               | It could be self-selection happening after you randomized
               | the groups. For example a desktop landing page
               | advertising an app, which might be installed on either
               | mobile operating system.
        
       ___________________________________________________________________
       (page generated 2023-06-16 23:01 UTC)