[HN Gopher] Introduction to Multi-Armed Bandits
       ___________________________________________________________________
        
       Introduction to Multi-Armed Bandits
        
       Author : Anon84
       Score  : 28 points
       Date   : 2025-09-30 21:08 UTC (1 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | esafak wrote:
       | One way to address the
       | https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitati...
        
       | rented_mule wrote:
       | We employed bandits in a product I worked on. It was selecting
       | which piece of content to show in a certain context, optimizing
       | for clicks. It did a great job, but there were implications that
       | I wish we understood from the start.
       | 
       | There was a constant stream of new content (i.e., arms for the
       | bandits) to choose from. Instead of running manual experiments
       | (e.g., A/B tests or other designs), the bandits would sample the
       | new set of options and arrive at a new optimal mix much more
       | quickly.
       | 
       | But we did want to run experiments with other things around the
       | content that was managed by the bandits (e.g., UI flow, overall
       | layout, other algorithmic things, etc.). It turns out bandits
       | complicate these experiments significantly. Any changes to the
       | context in which the bandits operate lead them to shift things
       | more towards exploration to find a new optimal mix, hurting
       | performance for some period of time.
       | 
       | We had a choice we could make here... treat all traffic,
       | regardless of cohort, as a single universe that the bandits are
       | managing (so they would optimize for the mix of cohorts as a
       | whole). Or we could setup bandit stats for each cohort. If things
       | are combined, then we can't use an experiment design that assumes
       | independence between cohorts (e.g., A/B testing) because the
       | bandits break independence. But the optimal mix will likely look
       | different for one cohort vs. another vs. all of them combined. So
       | it's better for experiment validity to isolate the bandits for
       | each cohort. Now small cohorts can take quite a while to converge
       | before we can measure how well things work. All of this puts a
       | real limit on iteration speed.
       | 
       | Things also become very difficult to reason about because their
       | is state in the bandit stats that are being used to optimize
       | things. You can often think of that as a black box, but sometimes
       | you need to look inside and it can be very difficult.
       | 
       | Much (all?) of this comes from bandits being feedback loops -
       | these same problems are present in other approaches where
       | feedback loops are used (e.g., control theory based approaches).
       | Feedback mechanisms are incredibly powerful, but they couple
       | things together in ways that can be difficult to tease apart.
        
         | kianN wrote:
         | I've actually run into the exact same issue. At the time we
         | similarly had to scrap bandits. Since then I've had the
         | opportunity to do a fair amount of research into hierarchical
         | dirichelete processes in an unrelated field.
         | 
         | On a random day, a light went off in my head that hierarchy
         | perfectly addresses the stratification vs aggregation problems
         | that arise in bandits. Unfortunately I've never had a chance to
         | apply this (and thus see the issues) in a relevant setting
         | since.
        
         | dr_dshiv wrote:
         | " If things are combined, then we can't use an experiment
         | design that assumes independence between cohorts (e.g., A/B
         | testing) because the bandits break independence."
         | 
         | Can you explain, please?
        
       ___________________________________________________________________
       (page generated 2025-09-30 23:00 UTC)