[HN Gopher] Implementing the Elo Rating System (2020)
___________________________________________________________________
Implementing the Elo Rating System (2020)
Author : pmontra
Score : 88 points
Date : 2021-02-12 16:55 UTC (6 hours ago)
(HTM) web link (mattmazzola.medium.com)
(TXT) w3m dump (mattmazzola.medium.com)
| vmception wrote:
| Nice use for gaming
|
| Crowd-behavior ranking systems should be banned for human
| relationships, where the person doing the ranking didn't intend
| to send a message to the crowd
| noir_lord wrote:
| Glicko2 is a better system.
|
| https://en.wikipedia.org/wiki/Glicko_rating_system
| andrewzah wrote:
| There's also the TrueSkill rating system from Microsoft
| (developed for Halo), which claims to be better than Glicko:
|
| "Glicko was developed as an extension of ELO and was thus
| naturally limited to two player matches which end in either win
| or loss. Glicko cannot update skill levels of players if they
| compete in multi-player events or even in teams. The logistic
| model would make it computationally expensive to deal with team
| and multi-player games. Moreover, chess is usually played in
| pre-set tournaments and thus matching the right opponents was
| not considered a relevant problem in Glicko. In contrast, the
| TrueSkill ranking system offers a way to measure the quality of
| a match between any set of players."
|
| https://www.microsoft.com/en-us/research/project/trueskill-r...
| HeyImAlex wrote:
| I noticed trueskill is pretty effective at matching people
| together, but the stickiness of rankings makes it frustrating
| if it's user facing. People want to see the number next to
| their name change after 4 consecutive wins.
| postalrat wrote:
| I think most people would want to see a significant boost
| if you won 40 out of 50 games. Winning 4 out of 4 doesn't
| mean much.
| monkeybutton wrote:
| First, a good article about TrueSkill I found when looking
| for systems to rank pairs of foosball players:
| http://www.moserware.com/2010/03/computing-your-skill.html
|
| Second, apparently Tinder uses an ELO ranking system for
| their algorithm. I would add a link but everything I'm
| finding online right now is very clickbait-y.
| bmm6o wrote:
| I'd be interested to read more about Tinder's application.
| Off the top of my head, i don't see how it would apply to a
| dating app.
| monkeybutton wrote:
| >What was it, though? It was a part of our algorithm that
| considered how others engaged with your profile. ... this
| part of our algorithm compared Likes and Nopes, and was
| utilized to show you potential matches who may be a fit
| for you, based on similarities in the way others would
| engage with profiles. Based on those profile ratings you
| received, there was a "score" -- in the sense that it was
| represented with a numeric value in our systems so that
| it could factor into the other facets in our algorithm.
|
| From: https://blog.gotinder.com/powering-tinder-r-the-
| method-behin...
|
| Its not a deep description at all, but it sounds like
| when someone viewing your profile is treated as a game,
| which way they swipe is a win or loss for you and your
| change in score depends on their score as well? Basically
| if a desirable person swipes right on your average
| profile, its indicating you are desirable and moves you
| up more than say the inverse scenario where an average
| profile swipes right on a desirable profile (I guess this
| would be akin to a low ranked player losing to a high
| ranked player).
|
| Edit: I guess the outcome would be people of equal (as
| perceived by others) attractiveness being matched.
| uxp100 wrote:
| Is the issue multiple players or is it teams? I didn't have
| any problem implementing ELO for multiple players in games
| where there can be multiple losers but one winner. I modeled
| it as simultaneous wins for the winner, with a K-factor
| divided by the number of players.
|
| But I wasn't concerned with being mathematically rigourus,
| maybe there is some glitch with this approach? It was for an
| EDH group. (BTW, don't do this. We no longer play EDH. It
| makes the spiky players even spikier, and then they get
| saltier when they're targeted as they have a far higher ELO)
| reificator wrote:
| > _It was for an EDH group. (BTW, don 't do this. We no
| longer play EDH. It makes the spiky players even spikier,
| and then they get saltier when they're targeted as they
| have a far higher ELO)_
|
| Yes please keep tournaments and rankings away from EDH for
| the love of god. CEDH is a different story because it's a
| completely different mindset, but if I sat down for normal
| EDH and was told we were tracking Elo I'd run far, far
| away.
| uxp100 wrote:
| It was mainly to track strength of different decks we all
| played, since you could check ELO of either. It was the
| strongest player who was upset by it, since once people
| realized he won 50% of 3 and 4 player games he was always
| targeted. But yeah, I wouldn't do it again.
| leetcrew wrote:
| I think the issue is that, outside of formal tournaments,
| there are rarely stable teams competing against each other
| in typical online multiplayer. each team is a random
| assortment of players, where anywhere from zero to a full
| team are queuing together. people might queue with a friend
| of a very different skill level. you need to track the
| skill level of each individual player and then estimate the
| win probability for a team of players who may never have
| played with each other before. I play a lot of csgo, and
| while far from perfect, I'm surprised how good the
| matchmaking is. I win pretty close to 50% of my matches.
| porphyra wrote:
| For certain types of multiplayer games the recent Elo-MMR is
| even better (faster, provable guarantees) [1][2].
|
| Quote from the paper regarding TrueSkill:
|
| > The main disadvantage of TrueSkill is its complexity:
| originally developed by Microsoft for the popular Halo video
| game, TrueSkill performs approximate belief propagation on a
| factor graph, which is iterated until convergence. Aside from
| being less human-interpretable, this complexity means that,
| to our knowledge, there are no proofs of key properties such
| as runtime and incentive alignment. Even when these
| properties are discussed, no rigorous justification is
| provided. In addition, we are not aware of any work that
| extends TrueSkill to non-Gaussian performance models, which
| might be desirable to limit the influence of outlier
| performances.
|
| [1] https://github.com/EbTech/Elo-MMR
|
| [2] https://arxiv.org/abs/2101.00400
| andrewzah wrote:
| It is fascinating to see improvement in this space.
|
| People -really- care about good ranking systems. I
| personally saw this when I administrated rankings for a
| local fighting game scene. Back then we used Glicko since
| matches were 1-on-1.
| meheleventyone wrote:
| Trueskill is also under patent. Which is considerably more
| off putting than its rigor.
| bluecalm wrote:
| Depends what you mean by better. I agree Glicko approximates
| your true skill level better and faster. It's also not as fun
| as ELO in my view. With ELO when I win a few games in a row I
| get an exciting chance to play against much better players than
| usual. This is a fun challenge. It's necessary to learn new
| things. It gives you highs and lows. It's just fun.
|
| With Glicko, once it pinpoints your true skill level, it feels
| like 2 meters of mud around you. You will never get out of it
| and you will be forever matched vs the same group of players
| stuck near you.
|
| So yeah, I don't like Glicko. I have quitted services over it
| (even something like chess problem solving website). It's just
| not fun imo.
| bitshiftfaced wrote:
| It sounds like it's not that you don't like Glicko but rather
| you'd prefer more variety in matchmaking. You can accomplish
| that easily by changing the way matchmaking picks your
| opponent.
| porphyra wrote:
| Glicko2 has a fatal flaw for massively multiplayer games such
| as Pokemon Go. Basically, by intentionally losing certain
| games, you can increase the volatility in your rating, and then
| you can win a few games in a row against low ranked players and
| leapfrog people who are actually consistently trying to win.
|
| See:
|
| [1] Farming Volatility: How a major flaw in a well-known rating
| system takes over the GBL leaderboard.
| https://www.reddit.com/r/TheSilphRoad/comments/hwff2d/farmin...
| bitshiftfaced wrote:
| This further strengthens my belief that Glicko 1 is much more
| preferable to Glicko 2. It's simpler (no volatility
| parameter), nearly as predictive (if not just as predictive),
| and it wasn't specifically designed for a monthly chess
| schedule, so it fits online games much more easily.
| syspec wrote:
| Does this flaw only apply to massively multiplayer games?
| carbocation wrote:
| Not sure that I've seen this addressed, but I view Elo as an
| approximation to a PageRank-like stationary distribution. Have
| others thought about this more formally?
| hogFeast wrote:
| As someone else has mentioned, the time-dependent nature is
| actually pretty important because PageRank (and there are a few
| others similar to this, Massey ranking is one...I can't
| remember them all) depends on every team playing each other
| team at some point and/or playing each other a certain number
| of times.
|
| So, in practice, stuff like PageRank is fairly brittle, and ELO
| tends to work better.
|
| I also don't think PageRank is approximation to ELO. PageRank
| are just eigenvectors so (I believe) rankings are proportionate
| to each other. This happens with ELO but to a lesser degree
| because you aren't necessarily looking at all the results for
| all other teams at all times. A lot of information is embedded
| into an ELO rating but you are updating match-by-match (I
| believe, I have done a lot of work with ELO and learned how to
| modify it so understand it well...I have far less experience
| with the matrix-based methods). So the practical advantage of
| ELO is actually a theoretical one, imo.
|
| The best way to think about ELO is Bayesian updating: you start
| with a prior about team skills, you use this to create a
| forecast, and then update your ratings based on the actual
| result. Comparisons are Kalman filters, Markov Chains or MCMC,
| etc.
|
| I will add ELO is very powerful. Yes, Glicko is better, ELO is
| a special case of Glicko but ELO is also far simpler than
| Glicko. Simple ELO models beat complex regression most of the
| time, it is remarkable.
|
| Imo, it is the Bayesian-esque updating that works so well. And,
| if you understand what ELO is at this level, you can split this
| part of the model off and use it with regression or whatever
| you want. It is truly amazing though.
| martin_balsam wrote:
| Super nit-picky comment, but it's Elo rating, not ELO. It's
| not an acronym, it's named after the physicist Arpad Elo
| morelandjs wrote:
| Elo is essentially a time dependent graph algorithm. In a
| static scenario, Elo probably converges to something like
| pagerank, but I'm sure there are differences. It is worth
| mentioning that pagerank has been applied to sports in a
| similar manner as Elo.
| nextos wrote:
| Microsoft Research developed TrueSkill [1], where they frame
| the problem as a Bayesian inference one.
|
| Incidentally, TrueSkill is implemented in Infer.NET, which IMHO
| doesn't get the attention it deserves. An open-source state-of-
| the-art factor graph engine with amazing performance. When
| dealing with many kinds of problems, including those involving
| discrete variables, I haven't found anything that compares in
| terms of speed and covergence.
|
| [1] https://www.microsoft.com/en-
| us/research/uploads/prod/2018/0...
| 1980phipsi wrote:
| Probably could use something like this for looking at sports
| teams.
| oscargrouch wrote:
| > TrueSkill is patented,[3] and the name is trademarked,[4]
| so it is limited to Microsoft projects and commercial
| projects that obtain a license to use the algorithm.
|
| Capitalism sometimes leads to stupid outcomes as this big
| issue with "what is property".
|
| Imagine if physicists patented their equations, forcing
| universities, books, movies, wikipedia, etc.. all to pay for
| the privilege of using their equations. The world would still
| be in the dark ages with just a privileged social class with
| access to advanced knowledge.
|
| Not to mention this will bar a totally valid shot of others
| human beings of getting into the same result while trying to
| solve this problem, giving they are using a math toolbox that
| is generally available to anyone.
|
| Corporations should have the right to explore something they
| worked on, of course, but look at this example where
| Microsoft already implemented and still profit from their
| invention/discovery.
|
| Using the legal system and not creativity to stop everyone
| else to progress using not just their knowledge, but the
| knowledge that can come from everywhere its pure stupidity as
| its not serving towards a collective good.
|
| We are in the middle of the COVID outbreak right now, imagine
| if this sort of adversarial view were prominent in the world,
| how many more people would die if basic techniques about
| making vaccines were patented?
|
| This is just utterly stupid and it's another proof that the
| system is broken.
| rland wrote:
| We don't need to imagine:
|
| https://www.doctorswithoutborders.org/what-we-do/news-
| storie...
|
| > So far, pharmaceutical corporations and other
| manufacturers of products needed to combat COVID-19 have
| not shown any willingness to take a different approach
| during the pandemic to ensure the necessary broad access to
| needed products
| oscargrouch wrote:
| > Furthermore, there has been an astonishing number of
| patents filed for COVID-19 vaccines in development,
| including more than 100 for the mRNA platform technology
|
| I hope there someone with enough intelligence out there
| to understand that if there's a time to rethink the
| rules, the time is now.
|
| Even for people that are only thinking in terms of
| profits and economy, if too many people die, there will
| be no economy at all..
|
| People on top of social structures: government,
| corporations, etc, are working like pavlovian dogs
| unaware that the rules of the old normal doesn't apply to
| the new normality. Or its just a matter of a corrupt
| system that is organically formed over power and money
| that is completely oblivious about the values they should
| all stand for.
| [deleted]
| morelandjs wrote:
| There are a lot of similarities between classic Elo and
| Bayesian statistics. You essentially construct a prior for
| each comparison and update your prior by its deviation from
| the observation. Does true skill just do this in a more
| rigorous fashion according to Bayes' theorem?
| chairmanwow1 wrote:
| One of my favorite Elo bugs I came across was a bug in a game
| called age of empires 2: DE.
|
| Whenever my friend and I played online together, we would
| sporadically get absolutely creamed. We started looking up the
| stats of the others players we were matched with and noticed that
| on those bad games the average Elo of our opponents would jump to
| >2k Elo vs our barely 1k.
|
| Both of us being software engineers, eventually the conversation
| turned to "What the hell do you think is going on here? How does
| a matchmaking bug like this crop up?"
|
| Eventually we came to the conclusion that they must be summing
| out Elo for the Elo of our matchmaking party rather than
| averaging it.
|
| Eventually the issue was patched and our worst suspicions were
| confirmed that indeed multiplayer party Elos were summed instead
| of averaged.
| jsight wrote:
| How would that make a difference? Or were the party sizes
| different between you and your opponents?
| chairmanwow1 wrote:
| So our opponents were queueing as individuals and were highly
| skilled. So their "party Elo" was genuinely 2k. Parties would
| be selected to take place in a match (2v2 / 3v3 / 4v4) of
| similar party Elos.
|
| We would either get paired with another couple 2k players
| against a full team of 2k (4v4 match) or in the worst case a
| 2v2 against 2k players.
|
| In the 2v2 case, because of the Elo difference the opposing
| side was expected to win >99.99% of the time.
| bitshiftfaced wrote:
| Since there are an equal number of players on both teams,
| why would it matter if the system used the average or the
| sum of the players' Elo? Wouldn't both methods have the
| same outcome?
| ivarv wrote:
| Two players of similar skill would probably get creamed by a
| team with a large skill disparity.
| jsight wrote:
| I guess, but (making up numbers here) two players with 750
| skill would be 1500 or an average of 750.
|
| A sum might match them up against two players with 1300
| skill & 200 skill due to the 1500 sum. OTOH, the average
| would match them up against... an average of 750, so the
| same people?
|
| I'm sure I'm just missing something really fundamental
| about how this all works.
| postalrat wrote:
| They shouldn't be summing or averaging. They should be using
| the max elo of the players in your party for every member of
| your party.
| bitshiftfaced wrote:
| > Eventually we came to the conclusion that they must be
| summing out Elo for the Elo of our matchmaking party rather
| than summing it.
|
| I think I'm missing something. Could you rephrase this?
| SamBam wrote:
| *rather than averaging it, I assume.
| chairmanwow1 wrote:
| Yes. Thank you. *Averaging. Thought I caught all the
| mistakes.
| hwjwill wrote:
| I also noticed it when I play AoE: 2 DE as I only play
| multiplayer game. I initially thought it's just because there
| aren't enough players so they match us with someone else much
| higher. However I only notice it in Quick Play where ELO is not
| affected instead of Ranked games.
| TameAntelope wrote:
| The sentiment you're expressing is the idea that one 2k player
| is so much better than two 1k rated players that it doesn't
| make sense to simply sum up each team's rating and keep it
| even, which is the false assumption that the AoE 2 devs made.
| Elo is _not_ linear.
|
| However I don't think they'd have fixed it just by averaging
| the Elo ratings, they must have implemented something more
| complex, like a proximity based algorithm, otherwise if the
| teams were the same size you'd hit a similar problem with an
| average as with a sum. A 2k player + a 0 rated player would sum
| to 2k and average to 1k. You and your friend would also sum to
| 2k and average to 1k, again assuming even team sizes.
|
| That said, still very funny!
|
| Also, fun fact: Elo isn't an acronym/initialism, it's someone's
| name, Arpad Elo (the inventor).
|
| Also also, I don't _think_ Elo is really a good idea for
| multiplayer (more than 1v1 that is) games generally. Don 't
| know enough to say that authoritatively, but using it for
| something like AoE 2 is likely prone to all kinds of strange
| unexpected outcomes.
| jerf wrote:
| "The sentiment you're expressing is the idea that one 2k
| player is so much better than two 1k rated players that it
| doesn't make sense to simply sum up each team's rating and
| keep it even, which is the false assumption that the AoE 2
| devs made. Elo is not linear."
|
| It's even better than that. Elo is purely relative. The
| difference in score between two opponents is what is supposed
| to predict the likelihood of victory for the two. This means
| that you can add any constant you like to the entire Elo pool
| without affecting the Elo rankings at all.
|
| As a matter of aesthetics, when initializing an Elo pool it
| is nice to try to start with values that will result in
| nobody having a negative value, because that gives humans the
| feelbadz, but that's not mathematically necessary; you can
| run the whole thing deep in the negative billions if that
| floats your boat.
|
| The Elo math is very much based on two players playing head-
| to-head. It would be completely broken for almost any other
| scenario, as multiplayer introduces new degrees of freedom
| that Elo has nowhere to put. By that I mean that there may be
| one game for which the team's skill is essentially the max of
| the members, another for which is it essentially the minimum,
| and a wide variety of things in between. Elo has nowhere to
| put that "wide variety", and it was never designed or
| intended to do so, so that's not a criticism of it. It just
| isn't what it was for.
| wongarsu wrote:
| Overwatch implements an Elo system for 6v6 competitive
| multiplayer. Over the years they've had to make a lot of
| tweaks to make it kind of work. For example groups are rated
| by average Elo, but the system doesn't allow you to cause
| large Elo differences within a team. Also it differentiates
| between "premade" groups and groups assembled by matchmaking,
| since the former are assumed to be more coordinated. It also
| tries to be really smart about how Elo points get distributed
| within the team on win/loss (which causes a heap of other
| problems).
|
| Overall you can make it somewhat work, but Elo really isn't
| well suited for anything larger than 1v1
| lordnacho wrote:
| Seems to me that you need to use the max of the team rather
| than the average. The better player can tell the less good
| player to do, up to a point.
| maybeOneDay wrote:
| This is what rocket league uses. While it can be unpleasant
| as a high ranked player to jump in a lobby with your much
| less skilled friend, overall the matchmaking system and
| equality of matches is in my opinion utterly fantastic
| jerf wrote:
| I'd expect "geometric average" to be one of the closest
| simple statistics for most games, assuming that you have
| some ranking that can be treated linearly.
|
| It would be generally unclear that rankings can be treated
| linearly. A subtle aspect of Elo is that as the disparity
| between two players increases, the Elo metric essentially
| becomes less and less certain that there is any
| relationship between the skill of the players at all, which
| is probably a bit deeper way of understanding "why" .
| Eventually you'll reach a point where if you round to
| integers, the winner winning will result in an increase of
| 0 points. (In fact one thing to watch out for is how you
| round; it's easy to end up with a wandering Elo basis if
| you truncate both sides rather than rounding properly, for
| instance.)
|
| It's fairly trivial to show that if you take a pool of the
| worst players and the best players, Elo will eventually
| converge to those two pools having about that amount of
| distance between them, but by inserting a fresh pool of
| middling players into the pool, the distance between the
| best & the worst will increase despite their skill not
| changing. If skill is linearly distributed... assuming you
| can even define _that_... Elo ratings might be somewhat
| reasonably treated linearly, but skill is very unlikely to
| be linearly distributed.
|
| Basically, I like Elo when used properly and I have used it
| to good effect a few times myself, but it is a bit
| dangerous in that it offers a _number_... but that number
| lacks many properties we associate with integers. (For
| instance, in another posting I point out you can shift the
| entire Elo pool by any constant you like without changing
| the system. Integers do not have this property.) They have
| a definite meaning, but it isn 't what your intuition might
| suggest. You really shouldn't do any arithmetic on them.
| They only make sense in the context of the Elo computation
| itself.
| jayd16 wrote:
| Average just causes other issues. Teams should probably have
| their own rating not derived from the individual elo.
|
| But honestly, elo is a shit system for fun matchmaking anyhow.
| buzzerbetrayed wrote:
| Why is it shit and what system is better?
| elcomet wrote:
| Why do you say it's shit ?
| smogcutter wrote:
| One issue is that once you find your level, Elo should keep
| you around a 50/50 win rate. So it doesn't feel like you're
| making progress or improving even if in a real sense you
| are.
|
| It's also a bad fit for team games. But ranking individual
| contribution to a team is _hard_. Even for something like
| the NBA where there's obviously a ton of incentive to do
| this, statistical methods rely on a large sample size and
| often need to be taken with a grain of salt.
| Kranar wrote:
| Most multiplayer games I know of, such as StarCraft, use
| a different score for single player and multiplayer
| games. You might have an amazing single player Elo score
| but a crappy multiplayer score and vice-versa.
| jayd16 wrote:
| Once you reach the top there's much less reason to keep
| playing. It doesn't have a great way to handle multiplayer
| situations. It doesn't handle confidence as well as some
| other systems ie it takes a while to skill test a player.
|
| https://en.wikipedia.org/wiki/Elo_rating_system#Practical_i
| s...
|
| Trueskill and many others are often used instead.
|
| https://en.wikipedia.org/wiki/TrueSkill
|
| The biggest issue is fun is not the same fair and its a lot
| more complicated to tune for fun.
|
| You could, for example, play 100 matches where you were
| given the closest opponent but at a disadvantage for every
| single match. There's a lot of other scenarios around how
| you want to fold in ping based match making and how that
| smaller pool effects the fairness of the matchhing.
| xmprt wrote:
| What's the open source alternative to TrueSkill? It seems
| really interesting but it's patented by Microsoft.
| wizzwizz4 wrote:
| They can't patent the _general idea_ , so just base
| something off the information in the Wikipedia article.
| At time of writing, I don't _think_ doing that would
| infringe their patent.
| ummonk wrote:
| Glicko
| maybeOneDay wrote:
| I accept your premise and the fact that matchmaking is
| hard but:
|
| > You could, for example, play 100 matches where you were
| given the closest opponent but at a disadvantage for
| every single match.
|
| not really, you'd derank aggressively and end up playing
| people at a lower level and presumably winning
| comfortably after some time.
| [deleted]
| jayd16 wrote:
| It all depends on who happens to be playing the game to
| match with but you're right. Ideally you would quickly
| hit a good match and win.
| tehjoker wrote:
| That's what they do in Starcraft 2 with MMR.
| lindig wrote:
| How dependent is the result on the order in which players play?
| Assuming who beats whom is fixed, how much do results vary over
| the order of the tournament?
| narag wrote:
| Aren't the calculations done always with the initial points?
| I'm very curious to know if that's not the case.
| hogFeast wrote:
| I believe order is important. The way to verify this is by
| running a simulation with players whose skills are all equal
| but vary with some normal distribution, and you will find that
| skill levels to do move about quite a bit even though the
| underlying skill level is fixed (I am thinking about order as
| just random shuffling here, so if we take a dataset where there
| is just randomness and in converges quickly then order doesn't
| matter...this doesn't really happen with ELO).
|
| This isn't to say that rankings don't converge. They do. But
| things like order definitely can get in the way. Glicko
| controls for this by modelling uncertainty in a skill estimate
| (iirc, proportional to the number of matches seen). In
| practice, you would tend to burn the first few matches using
| ELO to try and get rankings to converge to a limit.
|
| But, again, even then I have found that issues. ELO will
| quickly identify the worst and best but there is a lot of
| volatility around the middle grouping of teams. And this does
| vary with the underlying activity you are modelling too:
| activities with a lot of randomness take longer to converge.
| You can also control convergence with the K-factor.
|
| I think this is one of the trade-offs that comes with the time
| dependent nature of ELO. You get more variance but less bias.
| Imo, it usually isn't a huge issue in practice because the
| upside of this is your method isn't static, and your ranking
| can respond to things like lineup changes (which tend to occur
| in many applications)...it works well in most applications.
___________________________________________________________________
(page generated 2021-02-12 23:01 UTC)