[HN Gopher] Reinforcement Learning at Facebook
       ___________________________________________________________________
        
       Reinforcement Learning at Facebook
        
       Author : agbell
       Score  : 92 points
       Date   : 2021-02-01 15:39 UTC (7 hours ago)
        
 (HTM) web link (corecursive.com)
 (TXT) w3m dump (corecursive.com)
        
       | rayuela wrote:
       | Lol, almost thought the URL was COERCION.com!
        
       | agbell wrote:
       | Interviewer here. Happy to answer any questions or take any
       | feedback about the episode.
       | 
       | Jason Gauci joined facebook to try to solve some problems with
       | the newsfeed using reinforcement learning. He originally got into
       | ML via training bots to play capture the flag. What he ended up
       | creating is open source [1].
       | 
       | [1]: https://reagent.ai/
        
         | Ozzie_osman wrote:
         | This was a great read. It looks like the objective function
         | (which seems to be some measure of "did we increase value" vs
         | did the user tap the notification ) is really important here.
         | Any idea how that was actually measured?
        
           | agbell wrote:
           | Thanks!
           | 
           | My understanding is they are looking at page management
           | activity and whether it increases above what they would
           | expect if they didn't notify.
           | 
           | Some of the details are covered in the paper [1]
           | 
           | >The Markov Decision Process (MDP) is based on a sequence of
           | notification candidates for a particular person. The actions
           | here are sending and dropping the notification, and the state
           | describes a set of features about the person and the
           | notification candidate. There are rewards for interactions
           | and activity on Facebook, with a penalty for sending the
           | notification to control the volume of notifications sent. The
           | policy optimizes for the long term value and is able to
           | capture incremental effects of sending the notification by
           | comparing the Q-values of the send and drop action
           | 
           | > The training data spans multiple weeks to enable the RL
           | model to capture page admins' responses and interactions to
           | the notifications with their managed pages over a long term
           | horizon. The accumulated discounted rewards collected in the
           | training allow the model to identify page admins with
           | longterm intent to stay active with the help of being
           | notified.
           | 
           | [1] https://arxiv.org/pdf/1811.00260.pdf
        
         | mlthoughts2018 wrote:
         | The "This is a reinforcement learning problem" section is very
         | unconvincing. It presupposes that approaching the problem like
         | a game or "capture the flag" is good or somehow better than
         | supervised learning based on attributes that are known to
         | correlate quite strongly to user preferences.
         | 
         | Given the very widespread complaints about this type of
         | recommender system - eg modern YouTube and FB Newsfeed rankings
         | are widely panned as reinforcing biases, optimizing for pure
         | engagement which leads to reinforcing outrage, both by ML
         | experts and general users who don't like the experience & feel
         | it is manipulative, what is your take on how we can steer the
         | conversation in the other direction - that this is NOT a
         | reinforcement learning problem, and that we shouldn't reward
         | people looking to pad their ML resume with solutions in search
         | of a problem that facilitate their ability to brag about scale
         | & complexity over a solution that very demonstrably serves
         | users poorly?
        
           | agbell wrote:
           | I know very little about ML, being just the host, but I think
           | 'explore vs exploit' trade off that Jason mentions sounds
           | like an improvement on pure 'exploit'. Explore is finding new
           | interests and not just exploiting existing ones.
           | 
           | I think you are correct to have concerns around optimizing
           | for pure engagement though. These algos are giant optimizing
           | machines. What should we be asking them to optimize? That
           | seems like an important question that was only vaguely
           | touched on in this discussion.
        
             | mlthoughts2018 wrote:
             | That's zero reason to reach for a bazooka like
             | reinforcement learning though. You could use simple
             | Thompson Sampling or many other multi-armed bandit methods.
             | 
             | Balancing learning vs serving the optimal result is fine -
             | many companies have approached ranking and recommendation
             | that way for decades - but reinforcement learning would
             | still not be justified unless you present some extremely
             | compelling evidence.
        
               | mlthoughts2018 wrote:
               | Also to be clear - I think it's great to hear this
               | perspective and putting together the interview is well
               | done. I'm not trying to criticize you nor the merit of
               | talking about this topic - it is well worth it.
               | 
               | I was just asking, since you develop these kinds of
               | interviews, what do you think would be a good way to get
               | the other side of the story and talk to big tech
               | practitioners who do not agree with the leap to
               | reinforcement learning?
        
               | agbell wrote:
               | Isn't multi-armed bandit a simple reinforcement learning
               | algo? It is used in reagent's introductory tutorial:
               | https://reagent.ai/rasp_tutorial.html
               | 
               | Replying here to sibling: Thank you for the feedback. I
               | think it is fair to say this interview does not explore
               | in depth the issues that these techniques can cause and
               | it certainly only presents one side. I think the
               | recommendation to get more than one perspective is a good
               | one.
               | 
               | Let me know if there is anyone specific you recommend I
               | talk to.
        
               | mlthoughts2018 wrote:
               | That's a lot of semantic hairsplitting. They are both
               | "reinforcement learning" in the same way a Honda Civic
               | and an aircraft carrier are both "vehicles."
        
               | srean wrote:
               | > Isn't multi-armed bandit a simple reinforcement
               | learning algo
               | 
               | It is indeed.
               | 
               | Its a restricted form of it. In RL one can allow a state
               | change after an action which in turn can make the same
               | 'arms'|'actions' behave differently because their
               | behavior is tied to the state. The state one lands in can
               | also exercise control over which state you end up next.
               | Its for this reason some extra book keeping is necessary
               | for full-fledged RL. But you are absolutely right that
               | bandits are considered a simplified version of RL. By
               | controlling the size of the state space one can control
               | how bandit like the solution will behave.
               | 
               | There is also something called a contextual bandit that
               | sits between pure bandits and RL. CBs do not have state
               | change, but they do have access to a side information
               | that can affect the 'arms'. In RL one needs to think not
               | only about the reward but also about the possibility of
               | ending up in a 'dead-end| hard-to-recover-from' state
               | because the immediate reward was high. CBs do not have
               | such 'traps' but have more modeling power because the
               | reward of an arm can depend on this side-information,
               | usually called context.
               | 
               | The heat that you are getting from some comments is
               | unwarranted.
               | 
               | EDIT: Holy mother of monkey milk you have a ton of super
               | interesting interviews ! Glad I ran into your content.
               | Better late than never.
        
               | PartiallyTyped wrote:
               | I'd say that in CBs the action does not affect the
               | distribution of future states.
        
               | agbell wrote:
               | Bandits being reinforcement makes sense! Thank you, if
               | you are looking for a recommendation "Software That
               | Doesn't Suck" is a personal fav:
               | https://corecursive.com/software-that-doesnt-suck-with-
               | jim-b...
        
               | srean wrote:
               | You had me at Brian Kernighan's interview. I don't think
               | I have met a more modest man.
               | 
               | Once upon a time I had my open cube just behind his open
               | cube. I had no idea who he was and his modesty certainly
               | did not make it any easier. Once he had got locked out of
               | the floor and I had to let him in. Its only after that I
               | came to notice his name tag
        
         | glutamate wrote:
         | Is reinforcement learning being used to stop the newsfeed
         | promoting genocide?
         | https://www.nytimes.com/2018/10/15/technology/myanmar-facebo...
        
         | Jugurtha wrote:
         | How is Facebook doing machine learning? I know they have their
         | internal platform (FBLearner Flow, "equivalent" to Uber's
         | Michelangelo), but I have talked with people who have worked
         | there and they didn't use it.
         | 
         | I spoke with them to test our own machine learning platform
         | (https://iko.ai). The workflow they described was really odd.
         | SSHing into boxes to use a cluster, etc. Which is what we have
         | been doing as a tiny, immature company a few years ago until it
         | became so frustrating that we built our platform. I'm talking
         | about a tiny team, so I'm wondering how they get away with it,
         | or is it that the people I talked with did simply not adopt it.
         | 
         | Someone went as far as saying that "experiment tracking" was "I
         | told my manager which hyperparameters worked best".
        
           | anon_tor_12345 wrote:
           | >they didn't use it ... The workflow they described was
           | really odd. SSHing into boxes to use a cluster, etc.
           | 
           | no clue what you're talking about - most everyone on a
           | product team uses fblearner (the platform you're alluding to)
           | which is a job queue type tool i.e. submit fblearner jobs and
           | watch them run along with metrics tracking.
           | 
           | >Someone went as far as saying that "experiment tracking" was
           | "I told my manager which hyperparameters worked best"
           | 
           | hyperparameters are rarely fiddled with because of how much
           | data there is to train on but like i said fblearner has
           | plenty of views to help with "experiment tracking" when it
           | comes to hpo.
        
             | Jugurtha wrote:
             | This is why I found it odd. I wondered why they didn't use
             | FBLearner Flow and figured that not all teams were using it
             | even though they did machine learning.
             | 
             | We like these conversations where people share problems
             | they may be having in order to get a bigger picture. We
             | built our ML platform to solve our own problems that we
             | faced over the years, but it's always nice to be exposed
             | with problems we have not seen before to solve a _slightly_
             | more general problem.
        
               | anon_tor_12345 wrote:
               | >I wondered why they didn't use FBLearner Flow
               | 
               | were they in FAIR? conceivably FAIR might need more
               | flexibility (because they're trying to "innovate") and so
               | they fall back on lower level tools. but i know people at
               | FAIR and they too use fblearner. regardless FAIR (or
               | whatever other org you spoke to) is very small relative
               | to the total number of people using/doing ML at FB so
               | extrapolating from their needs is unwise (if you're
               | trying to build a business around some typical process).
        
               | Jugurtha wrote:
               | It makes sense. As I said, I'm building for our own needs
               | as we help organizations with machine learning and we
               | needed to deliver faster, _but_ I appreciate talking with
               | people in the field to cluster families of problems and
               | see a slightly bigger picture, and I generally like
               | talking with this kind of people. They remind me of my
               | colleagues, and I really like my colleagues.
        
         | snicksnak wrote:
         | What would be the actual goal RL should aim for when being
         | applied to the newsfeed? I understood RL for amazon aims for
         | suggesting you things you're looking for or you're likely to
         | buy. One might think this correlates directly with minimizing
         | the time spent on browsing amazon. For the newsfeed it probably
         | would be the complete opposite right, so maximizing the scroll
         | of doom?
         | 
         | Also, from the transcript:
         | 
         | > when you go to Facebook, [...], you see all these posts from
         | your friends
         | 
         | Maybe I'm an outlier but my newsfeed probably contains ~5% of
         | posts related to my friends, birthday wishes included. I use
         | facebook primarily as a news aggregator nowadays.
        
           | agbell wrote:
           | Thanks for reading or listening to the episode!
           | 
           | I think that is the hardest question, what to optimize for.
           | Jason mentions that facebook employs social scientists who
           | help set what the value is they are optimizing for.
           | 
           | > I don't work on the social science part of it. We try to
           | optimize and we do it on good faith that the goals we're
           | optimizing for are good faith goals. But I've been in enough
           | of the meetings to see the intent is really a good intent.
           | It's just a thing that's very difficult to quantify.
           | 
           | > But I do think that the intent is to provide that value.
           | And I do think that they would trade some of the margin for
           | the value in a heartbeat.
        
             | dbtc wrote:
             | Not profit? I thought it was for profit.
        
               | alexbeloi wrote:
               | Ads optimizes for profit, all other content is broadly
               | optimized for _meaningful social interaction_ and against
               | _problematic content_.
               | 
               | https://www.facebook.com/business/news/news-feed-fyi-
               | bringin...
               | 
               | https://about.fb.com/news/2019/04/remove-reduce-inform-
               | new-s...
        
           | buitreVirtual wrote:
           | > I use facebook primarily as a news aggregator nowadays.
           | 
           | Isn't this how people end up trapped in alternative-facts
           | bubbles?
        
             | goguy wrote:
             | Depends which news he's taking about.
             | 
             | Sports news is usually pretty factual.
        
       ___________________________________________________________________
       (page generated 2021-02-01 23:02 UTC)