[HN Gopher] Using reinforcement learning and $4.80 of GPU time t...
___________________________________________________________________
Using reinforcement learning and $4.80 of GPU time to find the best
HN post
Author : kcorbitt
Score : 118 points
Date : 2024-10-28 17:17 UTC (5 hours ago)
(HTM) web link (openpipe.ai)
(TXT) w3m dump (openpipe.ai)
| kcorbitt wrote:
| Hey all, this project was a labor of love I worked on in my spare
| time over the last couple of weeks. Happy to answer any
| questions!
| eugenekolo wrote:
| What does the model say about this post?
| kcorbitt wrote:
| Haha great question. Since it's only trained on on-platform HN
| content and not external links, this post is a little bit out
| of distribution for it unfortunately. I'm thinking about
| scraping a corpus of external links and running the same
| analysis though, in which case I'd definitely run it on this
| story because I'm also curious about that. :)
| Rick76 wrote:
| I would be very interested in the results of that as well
| Havoc wrote:
| Nice write up.
|
| Did you ever figure out what happened in 2016?
| kcorbitt wrote:
| Nope. I was actually planning on asking dang if he has any
| insights there. If he sees this thread hopefully he can chime
| in!
| twoodfin wrote:
| I think text vs. link used to be XOR, but isn't any longer.
|
| It's still outside the hn mainstream to use both in the same
| submission, so that might be biasing the model in strange
| ways.
| jerjerjer wrote:
| From the post:
|
| > But to simplify, instead I'll just limit to stories that
| have only text bodies, instead of links.
|
| This line implies that pre- and post- 2016 stories are text
| only, so this change should not affect the data so much.
| kelnos wrote:
| In case he doesn't, you might as well email him about it.
| He's a very responsive guy and might find it interesting.
| n2d4 wrote:
| Given that Google Trends doesn't show that bump, I'd assume
| it has to do with how the data was collected. Maybe all
| stories with < X votes/comments older than 2015 are not
| included, or deleted from whatever index you used?
| pclmulqdq wrote:
| There is a timing factor that you need to consider, too.
| Anecdotally, Sunday morning is the best time to get onto the
| front page, while Tuesday or Wednesday morning gets you the most
| views.
| kcorbitt wrote:
| Yep, that's why I included the post date in the information
| available to the model; in theory (if it's smart enough) it
| should be able to take that into account. That said I didn't
| include time-of-day; it would be interesting to see whether
| adding that information would be able to make the model more
| accurate!
|
| If the reward model is indeed smart enough to be able to take
| that into account you could actually use it to plan the optimal
| time of day to post a specific story! You could just use the
| reward model to compute a predicted score for 8 different
| versions of your content, holding the post title/text constant
| across them all and just changing the date. Based on the
| differences in scores, you can determine which posting time the
| RM thinks is most likely to make your post successful!
| pixl97 wrote:
| >you could actually use it to plan the optimal time of day to
| post a specific story!
|
| You see this on Reddit pretty commonly.
|
| Someone posts original content at an off time and get a
| small/moderate amount of upvotes. Then some time later (could
| be hours, days, or weeks) a bot/karma account will post the
| content at an optimal time to farm upvotes.
| oli5679 wrote:
| If you withhold a small amount of data, or even retrain on a
| sample of your training data, then isotonicregression is good to
| solve many calibration problems.
|
| https://scikit-learn.org/dev/modules/generated/sklearn.isoto...
|
| I also agree with your intuition that if your output is censored
| at 0, with a large mass there, it's good to create two models,
| one for likelihood of zero karma, and another expected karma,
| conditional on it being non-zero.
| Y_Y wrote:
| Did you dictate this? It looks like you typo'd/brain I'd
| "centered" into "censored", but even allowing for phonetic
| mistakes (of which I make many) and predictive text flubs, I
| still can't understand how this happened.
| CaptainFever wrote:
| I'm not the parent commenter, but whisper based dictation is
| getting pretty awesome nowadays. It's almost as good as sci-
| fi.
|
| (Fully dictated, no edits except for this)
| oli5679 wrote:
| I was thinking of censoring, maybe I should have said another
| word like floored.
|
| The reason I think of this as censoring is that there are are
| some classical statistical models that model a distribution
| with a large mass at a minimum threshold, e.g. "tobit"
| censored regression.
|
| https://en.wikipedia.org/wiki/Censoring_(statistics)
| Y_Y wrote:
| Thanks for the explanation. I never paid much attention in
| my stats lectures so I deserve to have missed out on that
| term-of-art. I think the physics lingo would be to call it
| "capped" or "bounded" or "constrained".
| oli5679 wrote:
| thanks, it's very understandable that you thought i was
| mistyping 'centred'.
| 1024core wrote:
| I also thought that the commenter spoke "centered" and the
| speech recognition model output "censored".
| kcorbitt wrote:
| I hadn't heard of isotonicregression before but I like it!
|
| > it's good to create two models, one for likelihood of zero
| karma, and another expected karma, conditional on it being non-
| zero.
|
| Another way to do this is to keep a single model but have it
| predict two outputs: (1) likelihood of zero karma, and (2)
| expected karma if non-zero. This would require writing a custom
| loss function which sounds intimidating but actually isn't too
| bad.
|
| If I were actually putting a model like this into production at
| HN I'd likely try modeling the problem in that way.
| sdflhasjd wrote:
| It's interesting that service complaints are so popular on HN. I
| always feel a bit bad that my most popular HN contribution was me
| complaining about a popular service
| Rick76 wrote:
| I don't like it, but it seems the internet always reacts more
| to inherently negative posts. That seems to be common across
| the entire internet, I think that's why the internet doesn't
| seem as fun as it did 10 years ago.
|
| I'm sure it's just human psyche but I'm trying to overcome it
| and make my life more positive again
| andrewmcwatters wrote:
| I suspect a large percentage of Dan's work moderating HN is
| downweighing posts that incite engagement from frustration.
| I've had on at least one occasion the top comment in a thread
| by over 100 upvotes that was purely the sentiment of several
| readers but did not contribute to the curated voice of the
| community.
| Karrot_Kream wrote:
| A popular theory on techie parts of the web is that engagement-
| optimizing sites create this negativity loop, but I disagree. I
| think negativity is naturally something that people seek no
| matter what the algorithm is. In an upvote based site, outrage
| ranks to the top. I also think text based platforms suffer from
| negative engagement much moreso than multimedia platforms.
|
| Model correlation is decent here but there's certainly more to
| do to use its outputs predictively.
| jerjerjer wrote:
| Humans love having something to be righteously indignant
| about.
| Vampiero wrote:
| If that theory were true then, what about every website on
| the internet pre-2010? What about 4chan?
|
| See also https://en.wikipedia.org/wiki/Negativity_bias
|
| We're just built like that.
|
| Regarding text platforms suffering more than non-text
| platforms, I think it's because of the lack of social cues
| that are otherwise there. You can infer a lot from the way
| someone talks, or from their body language. You can't infer
| much from text, which is partly why Poe's law exists --
| sarcasm doesn't translate well.
| Karrot_Kream wrote:
| > what about every website on the internet pre-2010
|
| It was definitely there. Plenty of forums had "rant
| threads" that were efforts to quarantine shitty reactionary
| behavior like this. Also a lot of the healthier forums were
| smaller forums. I was on plenty of forums that had 10-20
| folks on them that today would just be a Telegram group
| chat or a small Discord "server". These small spaces tend
| to be a lot lower on toxicity than larger fora. I was part
| of a few large fora like Gaia Online and they were just as
| toxic as today's large platforms. Managing large
| communities with chronological posting is really difficult
| and upvote based social networks were the first real
| networks to be able to scale to larger userbases without
| having hundreds of moderators (like Gaia or the large
| MUDs.)
|
| > What about 4chan?
|
| 4chan is immune because the default emotional register
| there is indignant dismissal. Because of this it's just a
| matter of choosing what else to layer ontop of the
| indignant dismissal, like sarcasm or anger or whatnot.
|
| > Regarding text platforms suffering more than non-text
| platforms, I think it's because of the lack of social cues
| that are otherwise there. You can infer a lot from the way
| someone talks, or from their body language. You can't infer
| much from text, which is partly why Poe's law exists.
|
| That's an interesting theory actually. My theory was that
| in the age of multimedia platforms, text platforms tend to
| attract folks who specifically want to use text over
| multimedia. Generally text forums will select for folks
| with social or self-esteem issues. These folks are the
| least likely to healthily deal with their emotions or
| disengage positively. This leads to higher toxicity on text
| based platforms.
| Vampiero wrote:
| > My theory was that in the age of multimedia platforms,
| text platforms tend to attract folks who specifically
| want to use text over multimedia. Generally text forums
| will select for folks with social or self-esteem issues.
| These folks are the least likely healthily deal with
| their emotions or disengage positively. This leads to
| higher toxicity on text based platforms.
|
| Yeah that's very plausible indeed
| miki123211 wrote:
| As a mastodon user, I can definitely confirm this.
|
| Give people the way to repost / retweet / boost, and your
| feed suddenly turns into mostly negativity, even if your
| algorithm is "show posts from my followers only, newest to
| oldest"
| Karrot_Kream wrote:
| Yeah my Bluesky followers are carefully curated to stop
| from swelling into negativity. I've been playing around
| with a labeller that filters followed posts into those that
| I find emotionally pleasant which I've been training based
| on my own labeling of followers' posts. The goal is to
| follow more people and have the labeller (or feed generator
| depending on how I go) hide the posts I don't care for.
| kelnos wrote:
| I flag most complaint posts, unless the complaint actually
| brings to light or discusses something surprising or unique
| that can be generalized and discussed.
|
| I generally find these posts pretty boring, and most comments
| on them are people recounting their own stories about how that
| (or a similar) service screwed them over. I suppose they can be
| a decent way to warn people off of a particular product
| (scammy, terrible customer support, whatever), but that's not
| what I come to HN for.
| swyx wrote:
| > > This query took 17 seconds to load the dataset into RAM and
| then aggregating by type was almost instant. It is absolutely
| incredible to me that I can load every HN post and comment ever
| into RAM in a few seconds on my (admittedly beefy) dev laptop,
| and analyze them at will. What an age of abundance!
|
| https://motherduck.com/blog/big-data-is-dead/
| suyash wrote:
| Very interesting project, would love to read a more technical
| write up on how the model was architected and trained, any
| pointers?
| kcorbitt wrote:
| I link to it from the post, but all the code is open source!
| You can find the specific training script here:
| https://github.com/OpenPipe/best-hn/blob/main/stories_train_...
|
| And all the graphs for the blog are from this notebook:
| https://github.com/OpenPipe/best-hn/blob/main/blog-figures.i...
|
| Lots of other good stuff in that repo, although it's only
| organized to a "working researcher" standard I'm afraid.
| ChrisArchitect wrote:
| First problem with the submissions that supposedly 'would do well
| on HN' is other than the Ask HN: they're misusing the submission
| by putting it in a text post instead of sharing as a link post
| directly. And sketchy new/inactive accounts. C'mon. Not gonna
| keep reading grifty post after that opening.
| youoy wrote:
| Thanks for sharing! Very interesting.
|
| > The correlation is actually not bad (0.53), but our model is
| very consistently over-estimating the score at the low end, and
| underestimating it at the high end. This is surprising; some
| variation on any given data point is expected, but such a
| consistent mis-estimation trend isn't what we'd expect.
|
| This is a consequence on the model objective. If you don't know
| what is really happening, a good way of reducing the overall
| error is to do that. If you instead try to exactly predict the
| very highs and very lows, you can see that you will get very high
| errors on those, resulting in a bigger overall error.
|
| Appart from that, I want to comment on AI alignment here. For me
| the objective of "most up votes" is not fully correlated with
| where I get the most value on HN. Most of the time, the most up
| voted I would have found them anyway on other platforms. It's the
| middle range what I really like. So be careful implementing this
| algorithm at scale, it could turn the website into another
| platform with shitty AI recommendations.
| kcorbitt wrote:
| > For me the objective of "most up votes" is not fully
| correlated with where I get the most value on HN. Most of the
| time, the most up voted I would have found them anyway on other
| platforms.
|
| Yes, this is a fantastic point. I'm curious if there's some
| other measurable proxy metric for "things I get the most value
| out of on HN"? Upvotes seems like the most natural but
| optimizing for it too strongly would definitely take HN down a
| dark path.
| losteric wrote:
| Perhaps selecting for posts with the highest quality reply
| engagement? If many different people were drawn to lengthy
| discussions, that suggests the content sparks thoughts that
| others then feel compelled to engage with. Or select for the
| emotional content of replies, awe/empathy/anger, depending on
| what one wants from HN?
| kcorbitt wrote:
| Ohh, I really like that as a potential proxy metric!
| hatthew wrote:
| lots of platforms optimize for engagement, but all that
| does is encourage ragebait
| jerjerjer wrote:
| > In this case, I included the post title, author, date, and
| content. All of those factors could be relevant to the chance a
| story gets voted up.
|
| > Even if the model gets extremely good at predicting
| final_score_if_it_hits_front_page, there's still the inherent
| randomness of probability_of_hitting_front_page that is
| fundamentally unpredictable.
|
| In addition to date, you might want to include three fields:
|
| - day of week (categorical)
|
| - is weekend/holiday (boolean)
|
| - hour or time of the day (categorical, you can have 24 of them
| or morning/afternoon/etc.).
|
| The probability of a post hitting the front page is usually
| affected by these things so it can really help the model.
| kcorbitt wrote:
| Yep that makes sense. Would be interesting to do a follow-up
| that explicitly includes these variables and see if it
| meaningfully improves the results.
| sitkack wrote:
| I find that the best stories get posted by folks in EU time
| zones as well as the weekend (more of hacker ethos). The flame
| bait startup drama is M-F Pacific.
| maaaaattttt wrote:
| I wonder if hour of day would benefit from being combined with
| HN's visitors location data to be truly relevant? I think the
| location is embedded in the time somehow if the visitors'
| origins are stable over time. If 9am PT is a popular time and
| most of the visitors are on the PT timezone then even if this
| 9am PT is encoded as UTC the model will pick it up (I think).
| Now, if over time visitors get more diverse and a big chunk is
| now coming from Europe, this original 9am will make less sense
| to the model. Adding visitors origin stats at time of the post
| would probably even help surface region trends. But I guess
| this historical data isn't public.
| jedberg wrote:
| I haven't run the data, but anecdotally I can tell you that
| those things probably don't affect hitting the front page. They
| _do_ affect the total score, but that is not what is being
| optimized here.
|
| It's counterintuitive, but if you post at a really popular
| time, you're competing with a lot of other submissions. If you
| post at a really slow time, you'll get fewer votes, but it will
| take fewer to reach the front page and you'll have less
| competition.
|
| In the end, it kinda evens out. The number of votes it takes to
| get to the front page _and_ the number of competing submissions
| are both correlated to your fields above.
| Arctic_fly wrote:
| > But in 2015 there is a stark discontinuity, where the number of
| stories (with text) shoots up by >10x, and the average score
| drops by 5x! Is this some kind of eternal September?
|
| Based on the later analysis in the post (which I agree with), the
| total score of a comment is disproportionately tied to whether it
| hits the front page, and of course how long it stays there.
| Regardless of the quality of the average post starting in 2015,
| the sheer quantity would make it impossible for all but a few to
| stay on the front page for very long. Hacker News got more
| popular, so each story got less prime time.
| 6gvONxR4sf7o wrote:
| Why use RL for this instead of plain old supervised learning?
| dinobones wrote:
| I am trying to understand this too.
|
| Supervised learning you train on pairs of (x, y) where x is
| your input (title/post text/metadata) and y is the output
| score.
|
| Naively, it's a linear regression model, Y = b0 + b1x1 + b2x2 +
| b3x3. Where b0 is your bias ("a floor for score points"), and
| b1, b2, and b3 are bias terms for the actual data of the post.
| You can solve this, closed form, and find the b1/b2/b3 that
| minimize the error of fitting to Y.
|
| How do these equations change with RL? I always assumed RL was
| a multi-step process where actions are taken to get to a
| reward. If there is only 1 step/decision, to produce a "random"
| score, it feels much like supervised learning.
| jampekka wrote:
| The post is not doing RL. It's just regression as you
| thought.
| billmalarky wrote:
| This post is using regression to build a reward model. The
| reward model will then be used (in a future post) to build
| the overall RL system.
|
| Here's the relevant text from the article:
|
| >In this post we'll discuss how to build a reward model
| that can predict the upvote count that a specific HN story
| will get. And in follow-up posts in this series, we'll use
| that reward model along with reinforcement learning to
| create a model that can write high-value HN stories!
| jampekka wrote:
| It is just plain old supervised learning. A regression from the
| post features to vote count. The RL discussion in TFA is a bit
| confusing.
|
| Such a model can be used as the "reward model" for the
| "reinforcement learning from human feedback" (RLHF) method.
| kelnos wrote:
| I don't get the conclusion the author is trying to draw. If you
| look at the data presented, it seems that the model was actually
| pretty bad at guessing the real-world behavior of the posts
| listed. Out of the top ten it picked:
|
| * 1 had a score that was reasonably close (8.4%) to what the
| model predicted
|
| * 4 had scores wildly lower than the model predicted
|
| * 2 had scores wildly higher than the model predicted
|
| * the remaining 3 were not wildly off, but weren't really that
| close either (25%-42% off)
|
| Then there's a list of 10 submissions that the model predicted
| would have scores ranging from 33 to 135, but they all only
| received a score of 1 in reality.
|
| The graph shown paints a bit of a better picture, I guess, but
| it's still not all that compelling to me.
| kcorbitt wrote:
| This is a fair point. The reason why I think "correlation" is a
| better metric than "predicts the exact correct score" is
| because of how I'll be using this model in the next post.
|
| Broadly, the main use case for this model (in the RL context)
| will be to take two different versions of the same post, and
| predict which of the two is more likely to be upvoted. So what
| matters isn't that it gets the exact number of upvotes
| correctly, but that it correctly predicts the relative
| difference in likely upvote count between two variants.
|
| Now it still doesn't do a _great_ job at that (the correlation
| is only 0.53 after all) but it still does a good enough job to
| provide some useful signal.
| chx wrote:
| > . That's not much time for a model that (hopefully) understands
| all of HN!
|
| this is dangerous talk.
|
| it doesn't understand anything at all.
|
| Reminder: We are more prone to anthromorphizing LLMs than to
| humanizing suffering humans.
| 1024core wrote:
| Is it my understanding that the reward model is also similar to
| an LLM (with the difference being it predicts a score instead of
| the next token)?
| kcorbitt wrote:
| Yes! The architecture is almost identical. The only difference
| is in the final layer. In an LLM used for text generation, the
| final layer has a separate output for every potential token the
| model could produce, and we decide which token to generate by
| choosing the one with the highest likelihood at each generation
| step (at least that's what the simplest sampling methods do).
| In an LLM used as a reward model, we only have one output in
| the final layer, and we interpret its value as the predicted
| reward.
|
| Everything else in the model before that final layer is exactly
| identical, architecture-wise.
| 1024core wrote:
| But a typical LLM has a feedback loop: it looks at the last
| token it generated and then decides, given the N tokens
| before that, which token to output next.
|
| In the case of a reward model, are you streaming in the list
| of tokens; if so, what is the output after each token? Or are
| you feeding in all of the tokens in one shot, with the
| predicted reward as the output?
| maleldil wrote:
| There are multiple ways to model reward. You can have it be
| fine-grained, such that every token gets its own reward,
| but by far the most common is to feed in the whole sequence
| and generate a single reward at the end.
| 1024core wrote:
| I guess I'm not sure how the "feed in the whole sequence"
| works, if there's a single reward at the end.
| floobertoober wrote:
| Maybe it would help to use a box cox transform on the score
| distribution?
___________________________________________________________________
(page generated 2024-10-28 23:00 UTC)