[HN Gopher] Using reinforcement learning and $4.80 of GPU time t...
       ___________________________________________________________________
        
       Using reinforcement learning and $4.80 of GPU time to find the best
       HN post
        
       Author : kcorbitt
       Score  : 118 points
       Date   : 2024-10-28 17:17 UTC (5 hours ago)
        
 (HTM) web link (openpipe.ai)
 (TXT) w3m dump (openpipe.ai)
        
       | kcorbitt wrote:
       | Hey all, this project was a labor of love I worked on in my spare
       | time over the last couple of weeks. Happy to answer any
       | questions!
        
       | eugenekolo wrote:
       | What does the model say about this post?
        
         | kcorbitt wrote:
         | Haha great question. Since it's only trained on on-platform HN
         | content and not external links, this post is a little bit out
         | of distribution for it unfortunately. I'm thinking about
         | scraping a corpus of external links and running the same
         | analysis though, in which case I'd definitely run it on this
         | story because I'm also curious about that. :)
        
           | Rick76 wrote:
           | I would be very interested in the results of that as well
        
       | Havoc wrote:
       | Nice write up.
       | 
       | Did you ever figure out what happened in 2016?
        
         | kcorbitt wrote:
         | Nope. I was actually planning on asking dang if he has any
         | insights there. If he sees this thread hopefully he can chime
         | in!
        
           | twoodfin wrote:
           | I think text vs. link used to be XOR, but isn't any longer.
           | 
           | It's still outside the hn mainstream to use both in the same
           | submission, so that might be biasing the model in strange
           | ways.
        
             | jerjerjer wrote:
             | From the post:
             | 
             | > But to simplify, instead I'll just limit to stories that
             | have only text bodies, instead of links.
             | 
             | This line implies that pre- and post- 2016 stories are text
             | only, so this change should not affect the data so much.
        
           | kelnos wrote:
           | In case he doesn't, you might as well email him about it.
           | He's a very responsive guy and might find it interesting.
        
           | n2d4 wrote:
           | Given that Google Trends doesn't show that bump, I'd assume
           | it has to do with how the data was collected. Maybe all
           | stories with < X votes/comments older than 2015 are not
           | included, or deleted from whatever index you used?
        
       | pclmulqdq wrote:
       | There is a timing factor that you need to consider, too.
       | Anecdotally, Sunday morning is the best time to get onto the
       | front page, while Tuesday or Wednesday morning gets you the most
       | views.
        
         | kcorbitt wrote:
         | Yep, that's why I included the post date in the information
         | available to the model; in theory (if it's smart enough) it
         | should be able to take that into account. That said I didn't
         | include time-of-day; it would be interesting to see whether
         | adding that information would be able to make the model more
         | accurate!
         | 
         | If the reward model is indeed smart enough to be able to take
         | that into account you could actually use it to plan the optimal
         | time of day to post a specific story! You could just use the
         | reward model to compute a predicted score for 8 different
         | versions of your content, holding the post title/text constant
         | across them all and just changing the date. Based on the
         | differences in scores, you can determine which posting time the
         | RM thinks is most likely to make your post successful!
        
           | pixl97 wrote:
           | >you could actually use it to plan the optimal time of day to
           | post a specific story!
           | 
           | You see this on Reddit pretty commonly.
           | 
           | Someone posts original content at an off time and get a
           | small/moderate amount of upvotes. Then some time later (could
           | be hours, days, or weeks) a bot/karma account will post the
           | content at an optimal time to farm upvotes.
        
       | oli5679 wrote:
       | If you withhold a small amount of data, or even retrain on a
       | sample of your training data, then isotonicregression is good to
       | solve many calibration problems.
       | 
       | https://scikit-learn.org/dev/modules/generated/sklearn.isoto...
       | 
       | I also agree with your intuition that if your output is censored
       | at 0, with a large mass there, it's good to create two models,
       | one for likelihood of zero karma, and another expected karma,
       | conditional on it being non-zero.
        
         | Y_Y wrote:
         | Did you dictate this? It looks like you typo'd/brain I'd
         | "centered" into "censored", but even allowing for phonetic
         | mistakes (of which I make many) and predictive text flubs, I
         | still can't understand how this happened.
        
           | CaptainFever wrote:
           | I'm not the parent commenter, but whisper based dictation is
           | getting pretty awesome nowadays. It's almost as good as sci-
           | fi.
           | 
           | (Fully dictated, no edits except for this)
        
           | oli5679 wrote:
           | I was thinking of censoring, maybe I should have said another
           | word like floored.
           | 
           | The reason I think of this as censoring is that there are are
           | some classical statistical models that model a distribution
           | with a large mass at a minimum threshold, e.g. "tobit"
           | censored regression.
           | 
           | https://en.wikipedia.org/wiki/Censoring_(statistics)
        
             | Y_Y wrote:
             | Thanks for the explanation. I never paid much attention in
             | my stats lectures so I deserve to have missed out on that
             | term-of-art. I think the physics lingo would be to call it
             | "capped" or "bounded" or "constrained".
        
               | oli5679 wrote:
               | thanks, it's very understandable that you thought i was
               | mistyping 'centred'.
        
           | 1024core wrote:
           | I also thought that the commenter spoke "centered" and the
           | speech recognition model output "censored".
        
         | kcorbitt wrote:
         | I hadn't heard of isotonicregression before but I like it!
         | 
         | > it's good to create two models, one for likelihood of zero
         | karma, and another expected karma, conditional on it being non-
         | zero.
         | 
         | Another way to do this is to keep a single model but have it
         | predict two outputs: (1) likelihood of zero karma, and (2)
         | expected karma if non-zero. This would require writing a custom
         | loss function which sounds intimidating but actually isn't too
         | bad.
         | 
         | If I were actually putting a model like this into production at
         | HN I'd likely try modeling the problem in that way.
        
       | sdflhasjd wrote:
       | It's interesting that service complaints are so popular on HN. I
       | always feel a bit bad that my most popular HN contribution was me
       | complaining about a popular service
        
         | Rick76 wrote:
         | I don't like it, but it seems the internet always reacts more
         | to inherently negative posts. That seems to be common across
         | the entire internet, I think that's why the internet doesn't
         | seem as fun as it did 10 years ago.
         | 
         | I'm sure it's just human psyche but I'm trying to overcome it
         | and make my life more positive again
        
         | andrewmcwatters wrote:
         | I suspect a large percentage of Dan's work moderating HN is
         | downweighing posts that incite engagement from frustration.
         | I've had on at least one occasion the top comment in a thread
         | by over 100 upvotes that was purely the sentiment of several
         | readers but did not contribute to the curated voice of the
         | community.
        
         | Karrot_Kream wrote:
         | A popular theory on techie parts of the web is that engagement-
         | optimizing sites create this negativity loop, but I disagree. I
         | think negativity is naturally something that people seek no
         | matter what the algorithm is. In an upvote based site, outrage
         | ranks to the top. I also think text based platforms suffer from
         | negative engagement much moreso than multimedia platforms.
         | 
         | Model correlation is decent here but there's certainly more to
         | do to use its outputs predictively.
        
           | jerjerjer wrote:
           | Humans love having something to be righteously indignant
           | about.
        
           | Vampiero wrote:
           | If that theory were true then, what about every website on
           | the internet pre-2010? What about 4chan?
           | 
           | See also https://en.wikipedia.org/wiki/Negativity_bias
           | 
           | We're just built like that.
           | 
           | Regarding text platforms suffering more than non-text
           | platforms, I think it's because of the lack of social cues
           | that are otherwise there. You can infer a lot from the way
           | someone talks, or from their body language. You can't infer
           | much from text, which is partly why Poe's law exists --
           | sarcasm doesn't translate well.
        
             | Karrot_Kream wrote:
             | > what about every website on the internet pre-2010
             | 
             | It was definitely there. Plenty of forums had "rant
             | threads" that were efforts to quarantine shitty reactionary
             | behavior like this. Also a lot of the healthier forums were
             | smaller forums. I was on plenty of forums that had 10-20
             | folks on them that today would just be a Telegram group
             | chat or a small Discord "server". These small spaces tend
             | to be a lot lower on toxicity than larger fora. I was part
             | of a few large fora like Gaia Online and they were just as
             | toxic as today's large platforms. Managing large
             | communities with chronological posting is really difficult
             | and upvote based social networks were the first real
             | networks to be able to scale to larger userbases without
             | having hundreds of moderators (like Gaia or the large
             | MUDs.)
             | 
             | > What about 4chan?
             | 
             | 4chan is immune because the default emotional register
             | there is indignant dismissal. Because of this it's just a
             | matter of choosing what else to layer ontop of the
             | indignant dismissal, like sarcasm or anger or whatnot.
             | 
             | > Regarding text platforms suffering more than non-text
             | platforms, I think it's because of the lack of social cues
             | that are otherwise there. You can infer a lot from the way
             | someone talks, or from their body language. You can't infer
             | much from text, which is partly why Poe's law exists.
             | 
             | That's an interesting theory actually. My theory was that
             | in the age of multimedia platforms, text platforms tend to
             | attract folks who specifically want to use text over
             | multimedia. Generally text forums will select for folks
             | with social or self-esteem issues. These folks are the
             | least likely to healthily deal with their emotions or
             | disengage positively. This leads to higher toxicity on text
             | based platforms.
        
               | Vampiero wrote:
               | > My theory was that in the age of multimedia platforms,
               | text platforms tend to attract folks who specifically
               | want to use text over multimedia. Generally text forums
               | will select for folks with social or self-esteem issues.
               | These folks are the least likely healthily deal with
               | their emotions or disengage positively. This leads to
               | higher toxicity on text based platforms.
               | 
               | Yeah that's very plausible indeed
        
           | miki123211 wrote:
           | As a mastodon user, I can definitely confirm this.
           | 
           | Give people the way to repost / retweet / boost, and your
           | feed suddenly turns into mostly negativity, even if your
           | algorithm is "show posts from my followers only, newest to
           | oldest"
        
             | Karrot_Kream wrote:
             | Yeah my Bluesky followers are carefully curated to stop
             | from swelling into negativity. I've been playing around
             | with a labeller that filters followed posts into those that
             | I find emotionally pleasant which I've been training based
             | on my own labeling of followers' posts. The goal is to
             | follow more people and have the labeller (or feed generator
             | depending on how I go) hide the posts I don't care for.
        
         | kelnos wrote:
         | I flag most complaint posts, unless the complaint actually
         | brings to light or discusses something surprising or unique
         | that can be generalized and discussed.
         | 
         | I generally find these posts pretty boring, and most comments
         | on them are people recounting their own stories about how that
         | (or a similar) service screwed them over. I suppose they can be
         | a decent way to warn people off of a particular product
         | (scammy, terrible customer support, whatever), but that's not
         | what I come to HN for.
        
       | swyx wrote:
       | > > This query took 17 seconds to load the dataset into RAM and
       | then aggregating by type was almost instant. It is absolutely
       | incredible to me that I can load every HN post and comment ever
       | into RAM in a few seconds on my (admittedly beefy) dev laptop,
       | and analyze them at will. What an age of abundance!
       | 
       | https://motherduck.com/blog/big-data-is-dead/
        
       | suyash wrote:
       | Very interesting project, would love to read a more technical
       | write up on how the model was architected and trained, any
       | pointers?
        
         | kcorbitt wrote:
         | I link to it from the post, but all the code is open source!
         | You can find the specific training script here:
         | https://github.com/OpenPipe/best-hn/blob/main/stories_train_...
         | 
         | And all the graphs for the blog are from this notebook:
         | https://github.com/OpenPipe/best-hn/blob/main/blog-figures.i...
         | 
         | Lots of other good stuff in that repo, although it's only
         | organized to a "working researcher" standard I'm afraid.
        
       | ChrisArchitect wrote:
       | First problem with the submissions that supposedly 'would do well
       | on HN' is other than the Ask HN: they're misusing the submission
       | by putting it in a text post instead of sharing as a link post
       | directly. And sketchy new/inactive accounts. C'mon. Not gonna
       | keep reading grifty post after that opening.
        
       | youoy wrote:
       | Thanks for sharing! Very interesting.
       | 
       | > The correlation is actually not bad (0.53), but our model is
       | very consistently over-estimating the score at the low end, and
       | underestimating it at the high end. This is surprising; some
       | variation on any given data point is expected, but such a
       | consistent mis-estimation trend isn't what we'd expect.
       | 
       | This is a consequence on the model objective. If you don't know
       | what is really happening, a good way of reducing the overall
       | error is to do that. If you instead try to exactly predict the
       | very highs and very lows, you can see that you will get very high
       | errors on those, resulting in a bigger overall error.
       | 
       | Appart from that, I want to comment on AI alignment here. For me
       | the objective of "most up votes" is not fully correlated with
       | where I get the most value on HN. Most of the time, the most up
       | voted I would have found them anyway on other platforms. It's the
       | middle range what I really like. So be careful implementing this
       | algorithm at scale, it could turn the website into another
       | platform with shitty AI recommendations.
        
         | kcorbitt wrote:
         | > For me the objective of "most up votes" is not fully
         | correlated with where I get the most value on HN. Most of the
         | time, the most up voted I would have found them anyway on other
         | platforms.
         | 
         | Yes, this is a fantastic point. I'm curious if there's some
         | other measurable proxy metric for "things I get the most value
         | out of on HN"? Upvotes seems like the most natural but
         | optimizing for it too strongly would definitely take HN down a
         | dark path.
        
           | losteric wrote:
           | Perhaps selecting for posts with the highest quality reply
           | engagement? If many different people were drawn to lengthy
           | discussions, that suggests the content sparks thoughts that
           | others then feel compelled to engage with. Or select for the
           | emotional content of replies, awe/empathy/anger, depending on
           | what one wants from HN?
        
             | kcorbitt wrote:
             | Ohh, I really like that as a potential proxy metric!
        
             | hatthew wrote:
             | lots of platforms optimize for engagement, but all that
             | does is encourage ragebait
        
       | jerjerjer wrote:
       | > In this case, I included the post title, author, date, and
       | content. All of those factors could be relevant to the chance a
       | story gets voted up.
       | 
       | > Even if the model gets extremely good at predicting
       | final_score_if_it_hits_front_page, there's still the inherent
       | randomness of probability_of_hitting_front_page that is
       | fundamentally unpredictable.
       | 
       | In addition to date, you might want to include three fields:
       | 
       | - day of week (categorical)
       | 
       | - is weekend/holiday (boolean)
       | 
       | - hour or time of the day (categorical, you can have 24 of them
       | or morning/afternoon/etc.).
       | 
       | The probability of a post hitting the front page is usually
       | affected by these things so it can really help the model.
        
         | kcorbitt wrote:
         | Yep that makes sense. Would be interesting to do a follow-up
         | that explicitly includes these variables and see if it
         | meaningfully improves the results.
        
         | sitkack wrote:
         | I find that the best stories get posted by folks in EU time
         | zones as well as the weekend (more of hacker ethos). The flame
         | bait startup drama is M-F Pacific.
        
         | maaaaattttt wrote:
         | I wonder if hour of day would benefit from being combined with
         | HN's visitors location data to be truly relevant? I think the
         | location is embedded in the time somehow if the visitors'
         | origins are stable over time. If 9am PT is a popular time and
         | most of the visitors are on the PT timezone then even if this
         | 9am PT is encoded as UTC the model will pick it up (I think).
         | Now, if over time visitors get more diverse and a big chunk is
         | now coming from Europe, this original 9am will make less sense
         | to the model. Adding visitors origin stats at time of the post
         | would probably even help surface region trends. But I guess
         | this historical data isn't public.
        
         | jedberg wrote:
         | I haven't run the data, but anecdotally I can tell you that
         | those things probably don't affect hitting the front page. They
         | _do_ affect the total score, but that is not what is being
         | optimized here.
         | 
         | It's counterintuitive, but if you post at a really popular
         | time, you're competing with a lot of other submissions. If you
         | post at a really slow time, you'll get fewer votes, but it will
         | take fewer to reach the front page and you'll have less
         | competition.
         | 
         | In the end, it kinda evens out. The number of votes it takes to
         | get to the front page _and_ the number of competing submissions
         | are both correlated to your fields above.
        
       | Arctic_fly wrote:
       | > But in 2015 there is a stark discontinuity, where the number of
       | stories (with text) shoots up by >10x, and the average score
       | drops by 5x! Is this some kind of eternal September?
       | 
       | Based on the later analysis in the post (which I agree with), the
       | total score of a comment is disproportionately tied to whether it
       | hits the front page, and of course how long it stays there.
       | Regardless of the quality of the average post starting in 2015,
       | the sheer quantity would make it impossible for all but a few to
       | stay on the front page for very long. Hacker News got more
       | popular, so each story got less prime time.
        
       | 6gvONxR4sf7o wrote:
       | Why use RL for this instead of plain old supervised learning?
        
         | dinobones wrote:
         | I am trying to understand this too.
         | 
         | Supervised learning you train on pairs of (x, y) where x is
         | your input (title/post text/metadata) and y is the output
         | score.
         | 
         | Naively, it's a linear regression model, Y = b0 + b1x1 + b2x2 +
         | b3x3. Where b0 is your bias ("a floor for score points"), and
         | b1, b2, and b3 are bias terms for the actual data of the post.
         | You can solve this, closed form, and find the b1/b2/b3 that
         | minimize the error of fitting to Y.
         | 
         | How do these equations change with RL? I always assumed RL was
         | a multi-step process where actions are taken to get to a
         | reward. If there is only 1 step/decision, to produce a "random"
         | score, it feels much like supervised learning.
        
           | jampekka wrote:
           | The post is not doing RL. It's just regression as you
           | thought.
        
             | billmalarky wrote:
             | This post is using regression to build a reward model. The
             | reward model will then be used (in a future post) to build
             | the overall RL system.
             | 
             | Here's the relevant text from the article:
             | 
             | >In this post we'll discuss how to build a reward model
             | that can predict the upvote count that a specific HN story
             | will get. And in follow-up posts in this series, we'll use
             | that reward model along with reinforcement learning to
             | create a model that can write high-value HN stories!
        
         | jampekka wrote:
         | It is just plain old supervised learning. A regression from the
         | post features to vote count. The RL discussion in TFA is a bit
         | confusing.
         | 
         | Such a model can be used as the "reward model" for the
         | "reinforcement learning from human feedback" (RLHF) method.
        
       | kelnos wrote:
       | I don't get the conclusion the author is trying to draw. If you
       | look at the data presented, it seems that the model was actually
       | pretty bad at guessing the real-world behavior of the posts
       | listed. Out of the top ten it picked:
       | 
       | * 1 had a score that was reasonably close (8.4%) to what the
       | model predicted
       | 
       | * 4 had scores wildly lower than the model predicted
       | 
       | * 2 had scores wildly higher than the model predicted
       | 
       | * the remaining 3 were not wildly off, but weren't really that
       | close either (25%-42% off)
       | 
       | Then there's a list of 10 submissions that the model predicted
       | would have scores ranging from 33 to 135, but they all only
       | received a score of 1 in reality.
       | 
       | The graph shown paints a bit of a better picture, I guess, but
       | it's still not all that compelling to me.
        
         | kcorbitt wrote:
         | This is a fair point. The reason why I think "correlation" is a
         | better metric than "predicts the exact correct score" is
         | because of how I'll be using this model in the next post.
         | 
         | Broadly, the main use case for this model (in the RL context)
         | will be to take two different versions of the same post, and
         | predict which of the two is more likely to be upvoted. So what
         | matters isn't that it gets the exact number of upvotes
         | correctly, but that it correctly predicts the relative
         | difference in likely upvote count between two variants.
         | 
         | Now it still doesn't do a _great_ job at that (the correlation
         | is only 0.53 after all) but it still does a good enough job to
         | provide some useful signal.
        
       | chx wrote:
       | > . That's not much time for a model that (hopefully) understands
       | all of HN!
       | 
       | this is dangerous talk.
       | 
       | it doesn't understand anything at all.
       | 
       | Reminder: We are more prone to anthromorphizing LLMs than to
       | humanizing suffering humans.
        
       | 1024core wrote:
       | Is it my understanding that the reward model is also similar to
       | an LLM (with the difference being it predicts a score instead of
       | the next token)?
        
         | kcorbitt wrote:
         | Yes! The architecture is almost identical. The only difference
         | is in the final layer. In an LLM used for text generation, the
         | final layer has a separate output for every potential token the
         | model could produce, and we decide which token to generate by
         | choosing the one with the highest likelihood at each generation
         | step (at least that's what the simplest sampling methods do).
         | In an LLM used as a reward model, we only have one output in
         | the final layer, and we interpret its value as the predicted
         | reward.
         | 
         | Everything else in the model before that final layer is exactly
         | identical, architecture-wise.
        
           | 1024core wrote:
           | But a typical LLM has a feedback loop: it looks at the last
           | token it generated and then decides, given the N tokens
           | before that, which token to output next.
           | 
           | In the case of a reward model, are you streaming in the list
           | of tokens; if so, what is the output after each token? Or are
           | you feeding in all of the tokens in one shot, with the
           | predicted reward as the output?
        
             | maleldil wrote:
             | There are multiple ways to model reward. You can have it be
             | fine-grained, such that every token gets its own reward,
             | but by far the most common is to feed in the whole sequence
             | and generate a single reward at the end.
        
               | 1024core wrote:
               | I guess I'm not sure how the "feed in the whole sequence"
               | works, if there's a single reward at the end.
        
       | floobertoober wrote:
       | Maybe it would help to use a box cox transform on the score
       | distribution?
        
       ___________________________________________________________________
       (page generated 2024-10-28 23:00 UTC)