[HN Gopher] How not to sort by average rating (2009)
___________________________________________________________________
How not to sort by average rating (2009)
Author : soheilpro
Score : 190 points
Date : 2021-11-12 15:23 UTC (7 hours ago)
(HTM) web link (www.evanmiller.org)
(TXT) w3m dump (www.evanmiller.org)
| rkuykendall-com wrote:
| This article inspired me so much that I based my shitty undergrad
| senior thesis on it. My idea was to predict the trend of the
| ratings by using I think a trailing weighted average, weighted to
| the most recent window. It managed to generate more "predictive"
| ratings of the following 6 months based on the Amazon dataset I
| used, but I doubt it would have held up to much scrutiny. I
| learned a ton though!
|
| Edit: Link to paper, which looks like it actually attempts to use
| a linear prediction algorithm.
| https://github.com/rkuykendall/rkuykendall.com/blob/e65147f6...
| kazinator wrote:
| This still has the problem that some item with 12 votes will be
| ranked higher than some item with 12,000 votes. Oh, and also has
| the problem that some item with 12 votes will be ranked lower
| than some item with 12,000 votes.
|
| I think you simply need separate search categories for this.
|
| Say I want to look for underrated or undiscovered gems:
|
| "Give me the best ranked items that have 500 votes or less."
|
| It is misleading to throw a 12 vote item together into the same
| list as a 12,000,000 vote item, and present them as being ranked
| relative to each other.
| taormina wrote:
| This is a blast from the past. It's also surprisingly simple to
| implement his "correct" sort. Seriously, this link should make
| the rounds every year or so here.
| [deleted]
| hwbehrens wrote:
| While I agree with the author in principle, I think there is an
| implicit criteria they ignore, which is the intuitive correctness
| from the perspective of the user.
|
| Imagine a user chooses "Sort by rating", and they subsequently
| observe an item with an average 4.5 ranking above a score of 5.0
| because it has a higher Wilson score. Some portion of users will
| think "Ah, yes, this makes sense because the 4.5 rating is based
| on many more reviews, therefore its Wilson score is higher." and
| the vast, vast majority of users will think "What the heck? This
| site is rigging the system! How come this one is ranked higher
| than that one?" and erode confidence in the rankings.
|
| In fact, these kinds of black-box rankings* frequently land sites
| like Yelp into trouble, because it is natural to assume that the
| company has a finger on the scale so to speak when it is in their
| financial interests to do so. In particular, entries with a
| higher Wilson score are likely to be more expensive because their
| ostensibly-superior quality commands (or depends upon) their
| higher cost, exacerbating this effect due to perceived-higher
| margins.
|
| So the next logical step is to present the Wilson score directly,
| but this merely shifts the confusion elsewhere -- the user may
| find an item they're interested in buying, find it has one 5-star
| review, and yet its Wilson score is << 5, producing at least the
| same perception and possibly a worse one.
|
| Instead, providing the statistically-sound score but de-
| emphasizing or hiding it, such as by making it accessible in the
| DOM but not visible, allows for the creation of alternative
| sorting mechanisms via e.g. browser extensions for the
| statistically-minded, without sacrificing the intuition of the
| top-line score.
|
| * I assume that most companies would choose not to explain the
| statistical foundations of their ranking algorithm.
| jkaptur wrote:
| That's a really good point. I wonder if folks would intuitively
| get it if you provided little data visualization (visible on
| hover or whatever). Like:
|
| Result 1: (4.5 )
|
| Result 2: (5.0 )
|
| edit: HN stripped out the unicode characters :(. I was using
| something like this: https://blog.jonudell.net/2021/08/05/the-
| tao-of-unicode-spar....
| SerLava wrote:
| You could probably get around this by
|
| A) labelling 1-2 review items with "needs more reviews" message
|
| Or B) not giving an aggregate review score for low review
| items. Actually _replacing_ the review star bar with "needs
| more reviews". Then when the user goes from the listing page to
| the detail page, you can show the reviews next to a message
| saying "this item only has a few reviews, so we can't be sure
| they're accurate until more people chime in"
| fennecfoxen wrote:
| C) normalizing the display of stars to the score
| nkrisc wrote:
| I worked on an e-commerce site that attempted to solve the
| issue by simply not giving an average rating to an item until
| it had a certain amount of reviews. We still showed the reviews
| and their scores, but there was no top level average until it
| had enough reviews. We spent a lot of time in user testing and
| with surveys trying to figure it how to effectively communicate
| that.
| jahewson wrote:
| I think this can be solved with better UI: Instead of stars,
| show a sparkline of the distribution the of scores. The user
| can then see the tiny do representing the single 5 star review
| and the giant peak representing the many 4 star reviews.
| 1024core wrote:
| This is a UX problem, which can be solved by not showing the
| exact rating, but showing a "rating score" which is the Wilson
| score.
| alecbz wrote:
| OP addressed that:
|
| > So the next logical step is to present the Wilson score
| directly, but this merely shifts the confusion elsewhere --
| the user may find an item they're interested in buying, find
| it has one 5-star review, and yet its Wilson score is << 5,
| producing at least the same perception and possibly a worse
| one.
|
| Though I'm not convinced how big of a deal this is. Even if
| you're worried about this, a further optimization may be to
| simply not display the score until there's enough reviews
| that it's unlikely anyone will manually compute the average
| rating.
| dfabulich wrote:
| In another article, the author (Evan Miller) recommends not
| showing the average unless there are enough ratings. You would
| say "2 ratings" but not show the average, and just sort it
| wherever it falls algorithmically.
|
| https://www.evanmiller.org/ranking-items-with-star-ratings.h...
|
| In that article, he even includes a formula for how many
| ratings you'd need:
|
| > _If you display average ratings to the nearest half-star, you
| probably don't want to display an average rating unless the
| credible interval is a half-star wide or less_
|
| In my experience, the second article is more generally useful,
| because it's more common to sort by star rating than by thumb-
| up/thumb-down ranking, which is what the currently linked
| article is about.
|
| And the philosophical "weight on the scale" problem isn't as
| bad as you'd think when using these approaches. If you see an
| item with a perfect 5-star average and 10 reviews ranked below
| an item with a 4.8-star average and 1,000 reviews, and you call
| the sort ranking "sort by popularity," it's pretty clear that
| the item with 1,000 reviews is "more popular."
| sdwr wrote:
| Not having faith in the user is a giant step towards
| mediocrity. Does a weighted average provide better results?
| Then use a weighted average! The world isn't split into an
| elite group of power users and the unwashed masses. There are
| just people with enough time and attention to fiddle with
| browser extensions, and everyone else. And all of them want the
| best result to show up first.
|
| Yelp didn't get dinged because their algorithms were hidden.
| They lost credibility because they were extorting businesses.
| Intention matters.
| enlyth wrote:
| I don't think this is an easy problem to solve.
|
| The inherent problem to me seems like we're trying to
| condense reviews into a tiny signal of an integer in the
| range of 1 to 5.
|
| For many things, this simply doesn't cut it.
|
| 2 stars, what does that mean? Was the coffee table not the
| advertised shade of grey? Does the graphics card overheat on
| medium load because of a poor cooler design? Was the delivery
| late (not related to the product, but many people leave these
| kinds of reviews)? Did you leave a 2 star review because you
| don't like the price but you didn't actually order the
| product?
|
| All these things I've seen on reviews and I've learned to
| ignore star ratings because not only they can be gamed, they
| are essentially useless.
|
| Props to users who take the time to write out detailed
| reviews of products which give you an idea of what to expect
| without having to guess what a star rating means, although
| sometimes these can be gamed as well as many sellers on
| Amazon and such will just give out free products in exchange
| for favourable reviews.
|
| Being a consumer is not easy these days, you have to be
| knowledgeable in what you're buying and assume every seller
| is an adversary.
| strken wrote:
| The problem with having faith in your users is you have to
| actually do it. If you're sorting by Wilson score when the
| user clicks a column that displays a ranking out of five,
| then you're mixing two scores together in a frustrating way
| because you think your users are too dumb to understand.
|
| There has to be a way to let users choose between "sort by
| rating, but put items without many reviews lower" and "sort
| by rating, even items with only one or two reviews" in a way
| that helps give control back to them.
| sdwr wrote:
| The way I've seen it done is a single column with avg stars
| + # reviews, which isn't clickable, because why would you
| want to sort by minimum ranking?
| IggleSniggle wrote:
| If you don't provide a "Sort by rating" option but instead
| include options like sort by "popularity," "relevance,"
| "confidence," or similar, then it is more accurate description,
| more useful to the user, and not so misleading about what is
| being sorted.
|
| I agree that if I "sort by rating" then an average rating sort
| is expected. The solution is to simply not make sorting by
| rating an option, or to keep the bad sorting mechanism but de-
| emphasize it in favor of the more useful sort. Your users will
| quickly catch on that you're giving them a more useful tool
| than "sort by average rating."
| crooked-v wrote:
| I think you're overemphasizing the confusion that an alternate
| ranking schema would cause. We have Rotten Tomatoes as a very
| obvious example of one that a lot of people are perfectly happy
| with even though it's doing something very different from the
| usual meaning of X% ratings.
|
| I feel like all that's really needed is a clear indicator that
| it's some proprietary ranking system (for example,
| "Tomatometer" branding), plus a plain-language description of
| what it's doing for people who want to know more.
| tablespoon wrote:
| > Imagine a user chooses "Sort by rating", and they
| subsequently observe an item with an average 4.5 ranking above
| a score of 5.0 because it has a higher Wilson score. Some
| portion of users will think "Ah, yes, this makes sense because
| the 4.5 rating is based on many more reviews, therefore its
| Wilson score is higher." and the vast, vast majority of users
| will think "What the heck? This site is rigging the system! How
| come this one is ranked higher than that one?" and erode
| confidence in the rankings.
|
| It also erodes confidence in ratings when something with one
| fake 5 star review sorts above something else with 1000 reviews
| averaging 4.9.
|
| I think you're mainly focusing on the very start of a learning
| curve, but eventually people get the hang of the new system.
| Especially if it's named correctly (e.g. "sort by review-count
| weighted score").
| mandelbrotwurst wrote:
| I'd opt for a simpler and less precise name like "Sort by
| Rating", but then offer the more precise definition via a
| tooltip or something, to minimize complexity for the typical
| user but ensure that accurate information is available for
| those who are interested.
| nkrisc wrote:
| Better in my opinion to give an item a rating until it has
| some number of reviews. You can still show the reviews, but
| treat it as unrated.
| dfabulich wrote:
| I prefer to call it "Sort by Popularity."
| mc32 wrote:
| I don't like that measure because popularity doesn't
| translate into "good".
|
| What's the most popular office pen? Papermate, Bic? I may
| be looking for more quality.
|
| What's the most popular hotel in some city? Maybe I'm
| looking for location or other aspects other than popularity
| among college kids.
| dfabulich wrote:
| When you use the OP article's formula, you're sorting by
| popularity. You may choose not to sort by popularity, but
| when you use it, you should _call_ it sorting by
| "popularity."
| alecbz wrote:
| This is a fair point but it's not as if knowing with items are
| actually good is something that should only be available to
| power users. The real goal ought to be: making sure your
| customers get access to actually good things. Not merely
| satisfying what might be some customers' naive intuition that
| things with higher average ratings are actually better.
|
| I think there's better approaches that can be taken here to
| address possible confusion. E.g., if the Wilson score rating
| ever places an item below ones with higher average rating, put
| a little tooltip next to that item's rating that says something
| like "This item has fewer reviews than ones higher up in the
| list." You don't need to understand the full statistical model
| to have the intuition that things with only a few ratings
| aren't as "safe".
| giovannibonetti wrote:
| In order to deal with that, I would place two sorting options
| related to average: - regular average - weighted average
| (recommended, default)
|
| Then the user can pick the regular average if they want,
| whereas the so-called weighted average (the algorithm described
| in the article) would be the default choice.
| ChrisArchitect wrote:
| Anything new here?
|
| Some previous discussions:
|
| _4 years ago_ https://news.ycombinator.com/item?id=15131611
|
| _6 years ago_ https://news.ycombinator.com/item?id=9855784
|
| _10 years ago_ https://news.ycombinator.com/item?id=3792627
|
| _13 years ago_ https://news.ycombinator.com/item?id=478632
|
| Reminder: you can enjoy the article without upvoting it
| dang wrote:
| Thanks! Macroexpanded:
|
| _How Not to Sort by Average Rating (2009)_ -
| https://news.ycombinator.com/item?id=15131611 - Aug 2017 (156
| comments)
|
| _How Not to Sort by Average Rating (2009)_ -
| https://news.ycombinator.com/item?id=9855784 - July 2015 (59
| comments)
|
| _How Not To Sort By Average Rating_ -
| https://news.ycombinator.com/item?id=3792627 - April 2012 (153
| comments)
|
| _How Not To Sort By Average Rating_ -
| https://news.ycombinator.com/item?id=1218951 - March 2010 (31
| comments)
|
| _How Not To Sort By Average Rating_ -
| https://news.ycombinator.com/item?id=478632 - Feb 2009 (56
| comments)
| oehpr wrote:
| Maybe what we need here is an extension where you can filter
| out articles?
|
| It adds a click event to each link for the article, and then
| after a day has passed, will start filtering that link out from
| HN results? I give it a gap of a day because maybe you'd want
| to return and leave a comment.
|
| I might try my hand at a greasemonkey script if you're
| interested.
|
| Though, personally, I have no great issue seeing high quality
| posts again occasionally.
| rdlw wrote:
| This is a genuine question, is there an HN guideline that says
| not to upvote reposts?
|
| I don't know if I knew about HN four years ago and if I did, I
| almost certainly missed that post, and if I didn't, I certainly
| don't remember the interesting discussion in the comments.
|
| I enjoyed the article and I'm not sure I see a reason not to
| upvote it.
| svnpenn wrote:
| 4 years? I think that's fine for a repost.
| chias wrote:
| The new thing is another cohort of people getting to be today's
| lucky 10,000.
| edude03 wrote:
| I'm one of them and I appreciate the repost
| iyn wrote:
| https://xkcd.com/1053/
| monkeybutton wrote:
| I was just looking for some of his old blog posts about A/b
| testing the other day. Since I first read them, I'd lost my
| bookmarks and forgotten his name. Do you know how bad the
| google search results for A/B testing are now? They're
| atrocious! SEO services and low-content medium posts as far as
| the eye can see! I was only able to rediscover his blog after
| finding links to it in the readme of a random R project in
| github.
| mbauman wrote:
| I'd love to see an update here that:
|
| * Included a graph of the resulting ordering of the two
| dimensional plane and some examples
|
| * Included consideration of 5- or 10-star scales.
| abetusk wrote:
| They have an article about K-star rating systems [0] which uses
| Bayesian approximation [1] [2] (something I know little to
| nothing about, I'm just regurgitating the article).
|
| There's a whole section on their website that has different
| statistics for programmers, including rating systems [3].
|
| [0] https://www.evanmiller.org/ranking-items-with-star-
| ratings.h...
|
| [1]
| https://en.wikipedia.org/wiki/Approximate_Bayesian_computati...
|
| [2] https://www.evanmiller.org/bayesian-average-ratings.html
|
| [3] https://www.evanmiller.org/ ("Mathematics of user ratings"
| section)
| ScoutOrgo wrote:
| The formula still works for scales of 5 or 10, you just have to
| divide by the max rating first and then multiply by it again at
| the end.
|
| For example a 3/5 stars turns into 0.6 positive and 0.4
| negative observation. Following the formula from there will
| give a lower bound estimation between 0 and 1, so then you just
| multiple by 5 again to get it between 0 and 5.
| WalterGR wrote:
| (2009)
| karaterobot wrote:
| Is there a better solution now?
| WalterGR wrote:
| No idea. It's customary to include the year in HN submission
| titles if it was published before the current year. When I
| made my comment, the title didn't include the year.
| driscoll42 wrote:
| One alternative is SteamDB's solution:
| https://steamdb.info/blog/steamdb-rating/
| 1970-01-01 wrote:
| My anecdotally accurate advice (AAA) is to always read 2-star
| reviews before purchase.
| actually_a_dog wrote:
| Why 2 star? I get the whole "forget about the 5 star reviews,
| because they're not going to tell you any of the downsides of
| the product," and "forget the 1 star reviews, because they're
| often unrelated complaints about shipping or delivery, and
| generally don't tell you much about the product." But, why not
| 3 star reviews?
|
| I generally pay the most attention to 3 star reviews, because
| they tend to be pretty balanced and actually tell you the
| plusses and minuses of the product. It seems like 2 star
| reviews would be somewhat like that, but leaning toward the
| negative/critical side. Is the negative/critical feedback what
| you're after?
| 1970-01-01 wrote:
| Because therein I find the best explanations for product
| failures. 3-star reviews tend to contain less failures and
| more "this could have been much better if they ___" . Again,
| it's anecdotal. I have no data to back my words.
| gowld wrote:
| "3 stars" means "meh, it's fine. I don't want to commit to
| rating but I'm not a sucker who gives 5 to everything"
|
| "2 stars" means "I really don't like it, but I can control my
| emotions and explain myself".
| jedberg wrote:
| Fun fact, this article inspired the sysadmin at XKCD to submit a
| patch to open source reddit to implement this sort on comments.
| It lives still today as the "best" sort.
|
| The blog post that explained it:
| https://web.archive.org/web/20091210120206/http://blog.reddi...
| bradbeattie wrote:
| There are a number of approaches to this with increasing
| complexity:
|
| - Sum of votes divided by total votes
|
| - More advanced statistical algorithms that take into account
| confidence (as this article suggests)
|
| - Recommendation engines that provides a rating based on your
| taste profile
|
| But I'm pretty sure you could take this further depending on what
| data you're looking to feed in and what the end-users'
| expectations of the system are.
| voldemort1968 wrote:
| Similarly, the problem of calculating "Trending"
| https://smosa.com/adam/code-and-technology
| chias wrote:
| I've been using this at work for the last year or so to great
| success.
|
| For example, we have an internal phishing simulation/assessment
| program, and want to track metrics like improvement and general
| uncertainty. Since implementing this about a year ago, we've been
| able to make great improvements such as:
|
| * for a given person, identify the wilson lower bound that they
| would _not_ get phished if they were targeted
|
| * for the employee population as a whole, determine the 95%
| uncertainty on whether a sample employee would get phished if
| targeted
|
| It lets us make much more intelligent inferences about things,
| much more accurate risk assessments, and also lets us improve the
| program pretty significantly (e.g. your probability of being
| targeted being weighted by a combination of your wilson lower
| bound and your wilson uncertainty).
|
| There are SO MANY opportunities to improve things by using this
| method. Obviously it isn't applicable everywhere, but I'd suggest
| you look at any metrics you have that use an average and just
| take a moment to ask yourself if a Wilson bound would be more
| appropriate, or might enable you to make marked improvements.
| user5994461 wrote:
| Sounds like people who don't read their emails would get the
| best score because they don't get phished.
| chias wrote:
| Pretty much, yep :) They're also less likely to get phished
| in general.
|
| Though this property may be suboptimal for other reasons.
| anthony_r wrote:
| This is cool. But what I usually do is replace x/y with x/(y+5),
| and hope for the best :). The 5 can be replaced by 3 or 50,
| depending on what I'm dealing with.
|
| (in less important areas than sorting things by ratings to
| directly rank things for users; mentally bookmarked this idea for
| the next time I need something better, as this clearly looks
| better)
| mattb314 wrote:
| Heads up this weights all your scores towards 0. If you want to
| avoid this, an equally simple approach is to use (x+3)/(y+5) to
| weight towards 3/5, or any (x+a)/(y+b) to weight towards a/b.
| It turns out that this seemingly simple method has some (sorta)
| basis in mathematical rigor: you can model x and y as successes
| and total attempts from a Bernoulli random variable, a and b as
| the parameters in a beta prior distribution, and the final
| score to be the mean of the updated posterior distribution:
| https://en.wikipedia.org/wiki/Beta_distribution#Bayesian_inf...
|
| (I saw first this covered in Murphy's Machine Learning: A
| Probabilistic Perspective, which I'd recommend if you're
| interested in this stuff)
| zzzeek wrote:
| if you dont have PostgreSQL it might be hard to create an index
| on that function. you can use a trigger that updates a fixed
| field on the row each time positive/negative changes, or
| otherwise run the calc and include it in your UPDATE statement
| when those numbers change.
| Waterluvian wrote:
| You can't rate 0 stars so the entire range is shifted by 1 star.
| This makes any star rating system fatally flawed to begin with.
|
| Humans will see 3 stars and not perceive that as 50%.
| feoren wrote:
| Is that really a _fatal_ flaw? It 's humans reading the
| ratings, and humans doing the ratings, so our human-factors
| might balance out a bit. I don't think people come in expecting
| the rating system to be perfectly linear because we have a
| mental model of how other humans rate things -- 1 star and 5
| stars are very common, even when there's obviously ways the
| thing could be worse/better. So even though 3 stars sounds like
| more than 50%, most people would consider 3.0 stars a very poor
| rating.
| Waterluvian wrote:
| I think you make a good point. But I don't think it
| completely defeats the bias. Especially given that the star
| system that existed before the Web had 0 and half stars.
|
| It seems like it's purely a result of widget design
| deficiency: how do you turn a null into a 0 with a star
| widget? (You could add an extra button but naturally
| designers will poo poo that)
| Macha wrote:
| Percentage systems aren't immune to this, various pieces of
| games media were often accused of a 70-100% rating scale.
| Anything below 70 was perceived as a terrible game, and they
| didn't want to harm their relationship with publishers. So 70
| became the "You might like it if there are some specifics that
| appeal to you" and 80 was a pretty average game.
| WithinReason wrote:
| IIRC, a simple approximation of that horrendous formula is :
|
| (positive)/(positive+negative+1)
|
| It rewards items with more ratings. Basically, you initialize the
| number of negative ratings to 1 instead of 0.
| akamoonknight wrote:
| Very interesting, seems your remembering looks correct to me.
|
| x / (x+y+1) ::
| https://www.wolframalpha.com/input/?i=plot+x+%2F%28x+%2B+y+%...
|
| horrendous formula ::
| https://www.wolframalpha.com/input/?i=plot+%28%28x%2F%28x%2B...
|
| Much less prone to typos.
| gowld wrote:
| The main flaw in this formula is that when positive=0 the
| negative votes have no weight.
| rdlw wrote:
| A heuristic I use when looking at products with low numbers
| of reviews is to add one positive and one negative review, so
|
| (positive+1)/(positive+negative+2).
|
| This basically makes the 'default' rating 50% or 3 stars or
| whatever, and votes move the rating from that default.
| raldi wrote:
| This is a decent approximation. It handles all the common
| hazard cases:
|
| +10/-0 should rank higher than +1/-0
|
| +10/-5 should rank higher than +10/-7
|
| +100/-3 should rank higher than +3/-0
|
| +10/-1 should rank higher than +900/-200
| DangerousPie wrote:
| One of my sites has been using a ranking algorithm based on this
| article for over 10 years now. Nobody ever complained, so it must
| be pretty good.
| truculent wrote:
| A simpler solution:
|
| Weighted score = (positive + alpha) / (total + beta)
|
| In which alpha and beta are the mean number of positive and total
| votes, respectively. You may wish to estimate optimal values of
| alpha and beta subject to some definition of optimal, but I find
| the mean tends to work well enough for most situations.
___________________________________________________________________
(page generated 2021-11-12 23:00 UTC)