[HN Gopher] Introduction to Multi-Armed Bandits
___________________________________________________________________
Introduction to Multi-Armed Bandits
Author : Anon84
Score : 28 points
Date : 2025-09-30 21:08 UTC (1 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| esafak wrote:
| One way to address the
| https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitati...
| rented_mule wrote:
| We employed bandits in a product I worked on. It was selecting
| which piece of content to show in a certain context, optimizing
| for clicks. It did a great job, but there were implications that
| I wish we understood from the start.
|
| There was a constant stream of new content (i.e., arms for the
| bandits) to choose from. Instead of running manual experiments
| (e.g., A/B tests or other designs), the bandits would sample the
| new set of options and arrive at a new optimal mix much more
| quickly.
|
| But we did want to run experiments with other things around the
| content that was managed by the bandits (e.g., UI flow, overall
| layout, other algorithmic things, etc.). It turns out bandits
| complicate these experiments significantly. Any changes to the
| context in which the bandits operate lead them to shift things
| more towards exploration to find a new optimal mix, hurting
| performance for some period of time.
|
| We had a choice we could make here... treat all traffic,
| regardless of cohort, as a single universe that the bandits are
| managing (so they would optimize for the mix of cohorts as a
| whole). Or we could setup bandit stats for each cohort. If things
| are combined, then we can't use an experiment design that assumes
| independence between cohorts (e.g., A/B testing) because the
| bandits break independence. But the optimal mix will likely look
| different for one cohort vs. another vs. all of them combined. So
| it's better for experiment validity to isolate the bandits for
| each cohort. Now small cohorts can take quite a while to converge
| before we can measure how well things work. All of this puts a
| real limit on iteration speed.
|
| Things also become very difficult to reason about because their
| is state in the bandit stats that are being used to optimize
| things. You can often think of that as a black box, but sometimes
| you need to look inside and it can be very difficult.
|
| Much (all?) of this comes from bandits being feedback loops -
| these same problems are present in other approaches where
| feedback loops are used (e.g., control theory based approaches).
| Feedback mechanisms are incredibly powerful, but they couple
| things together in ways that can be difficult to tease apart.
| kianN wrote:
| I've actually run into the exact same issue. At the time we
| similarly had to scrap bandits. Since then I've had the
| opportunity to do a fair amount of research into hierarchical
| dirichelete processes in an unrelated field.
|
| On a random day, a light went off in my head that hierarchy
| perfectly addresses the stratification vs aggregation problems
| that arise in bandits. Unfortunately I've never had a chance to
| apply this (and thus see the issues) in a relevant setting
| since.
| dr_dshiv wrote:
| " If things are combined, then we can't use an experiment
| design that assumes independence between cohorts (e.g., A/B
| testing) because the bandits break independence."
|
| Can you explain, please?
___________________________________________________________________
(page generated 2025-09-30 23:00 UTC)