[HN Gopher] Show HN: Improving search ranking with chess Elo scores
___________________________________________________________________
Show HN: Improving search ranking with chess Elo scores
Hello HN, I'm Ghita, co-founder of ZeroEntropy (YC W25). We build
high accuracy search infrastructure for RAG and AI Agents. We just
released two new state-of-the-art rerankers zerank-1, and
zerank-1-small. One of them is fully open-source under Apache 2.0.
We trained those models using a novel Elo score inspired pipeline
which we describe in detail in the blog attached. In a nutshell,
here is an outline of the training steps: * Collect soft
preferences between pairs of documents using an ensemble of LLMs. *
Fit an ELO-style rating system (Bradley-Terry) to turn pairwise
comparisons into absolute per-document scores. * Normalize
relevance scores across queries using a bias correction step,
modeled using cross-query comparisons and solved with MLE. You can
try the models either through our API
(https://docs.zeroentropy.dev/models), or via HuggingFace
(https://huggingface.co/zeroentropy/zerank-1-small). We would love
this community's feedback on the models, and the training approach.
A full technical report is also going to be released soon. Thank
you!
Author : ghita_
Score : 182 points
Date : 2025-07-16 14:17 UTC (1 days ago)
(HTM) web link (www.zeroentropy.dev)
(TXT) w3m dump (www.zeroentropy.dev)
| sippeangelo wrote:
| Really cool stuff! Just want to let you know you forgot to link
| to the evals at the end.
| ghita_ wrote:
| oh waw thanks for flagging, just fixed, thanks!
| esafak wrote:
| I would have titled it "Improving ranking..."
|
| I like that it works with `sentence_transformers`
| ghita_ wrote:
| yes we found it hard to find a good title for this, thanks for
| the feedback
| dang wrote:
| We could change the title to "Improving search ranking with
| chess Elo scores". Anybody object?
|
| Edit: ok, done. Submitted title was "Show HN: Improving RAG
| with chess Elo scores".
| slybot wrote:
| They don't use Elo scores. See my comment above, the loss
| function is adopted from Bradley-Terry.
| npip99 wrote:
| Bradley-Terry and Elo scores are equivalent mathematical
| models! The fundamental presumption is the same Thurstone
| model - that an individual's skill in a particular game is
| a normally distributed random variable around their
| fundamental skill.
|
| We did experiment with a Bradley-Terry loss function
| (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw), but we found
| that even better was to calculate Elo scores, do cross-
| query bias adjustment, and then MSE loss to predict the Elo
| score itself.
| slybot wrote:
| ->Bradley-Terry and Elo scores are equivalent
| mathematical models! No, they are not equivalent
| mathematical models, they are equalivant in terms of
| calculation of score function(logistic) given equivalent
| scale factors. Such that, Bradley-terry: 1/(1 + e^(x(r_B
| - r_A))) and Elo rating: 1/(1 + 10^((r_B - r_A)/y)), then
| equivalance requires x = ln(10)/y. More importantly, Elo
| rating is _online_ scoring system, meaning it takes into
| accoun the sequence of the events. From your blog post, I
| understand that you are not updating the scores after
| after each event. In other words, Elo rating can be
| interpreted as an incremental fitting of a Bradley-Terry
| (using similar logistic) model but not the same!
|
| -> The fundamental presumption is the same Thurstone
| model The Thurstone model is similar, and as you said it
| assumes normal (as opposed to logistic) using probit link
| function. It predates both models and due to
| computational constraints, you can call Bradley-Terry and
| Elo rating computationally convenient approximation of
| the Thurstone model.
|
| -> We did experiment with a Bradley-Terry loss function
| (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw) The math is
| correct. Thanks for sharing. Indeed, if you do it with
| incremental updating, you will lose the differentiability
| given the next winning probability is dependent on the
| previous updates. Call it what you want, but note that
| this is not truly and Elo rating which leads
| misunderstanding. It is Bradley-Terry given you do batch
| updates which you take extra steps to connect with Elo
| score, as shown in the link.
|
| Lastly, normal and logistic distribution will lead to
| log(0) in evaluations which results inf in loss. As I can
| see from you upper comment, you try add uniform(0.02) as
| ad-hoc fix. An elegant fix to that is use heavy-tailed
| distribution such as Cauchy.
| ashwindharne wrote:
| Cool stuff! We use a similar process internally to rerank and
| filter our cold outbound lists. We just use an off-the-shelf
| model as the judge, give it a custom criteria, and let it run
| until some set number of iterations. It's helped narrow down wide
| searches to the maximally relevant set of people (few thousand
| medium-bad matches to few hundred good matches)
|
| It's not cheap and it's not fast, but it definitely works pretty
| well!
| jayunit wrote:
| Very interesting! What are some examples of criteria that you
| can evaluate pairwise, but couldn't score individually?
| bravura wrote:
| Pairwise rank constraints involve fewer assumptions that per-
| item scoring about the underlying nature of the data, thus
| they are more robust.
| npip99 wrote:
| Yeah that's exactly what we observed. Our goal was to
| create an absolute score that's completely independent from
| the Corpus, which is difficult because naturally all ELO
| distributions are inherently tied to the corpus itself!
|
| When we were exploring the mathematical foundations, we
| considered ELO scoring against a "Universal Corpus" based
| on the natural entropy of human language (Obviously that's
| intractable, but sometimes this term cancels out like in
| the DPO proof).
|
| But eventually we figured out a method using cross-query
| comparisons to assign an "ELO bias" to all document ELOs
| within a given query's candidate list. This normalizes it
| correctly such that when a candidate list is all bad, the
| ELOs shift low. And when the candidate list is all good,
| the ELOs shift high. Even when the relative ELOs are all
| the same.
| ashwindharne wrote:
| It's all unstructured text (title, company, company size,
| experience, skills, raw text, etc.) and LLMs are pretty bad
| at assigning numerical scores in a vacuum. To make it work,
| we'd have to provide a representative set of examples, break
| scoring down by specific field, etc.
|
| Kind of a lot of work compared to just dumping the text of 2
| profiles into a context window along with a vague description
| of what I want, and having the LLM make the binary judgment.
| yalok wrote:
| What's the expected additional latency due to running this re-
| ranker?
| ghita_ wrote:
| It actually runs pretty fast, our benchmarks show ~149ms for
| 12665 bytes. It's faster than many other models
| esafak wrote:
| I would prominently display your benchmarks (against your
| competitors, of course). That's your selling point, right?
| ghita_ wrote:
| Yes! We did this here:
| https://www.zeroentropy.dev/blog/announcing-zeroentropys-
| fir... We wanted to share the approach with the community
| in this post. It does do better than competitors though!
| seanhunter wrote:
| Fun fact about ELO. It's natural to think that it is some kind of
| initialism, but in fact ELO doesn't stand for anything. It's the
| name of the guy who invented the system.
| https://en.wikipedia.org/wiki/Arpad_Elo
|
| So don't say it "E.L.O." (unless you're talking about the band, I
| guess), say "ee-low"
| ghita_ wrote:
| oh interesting, had no idea, thanks for sharing
| amelius wrote:
| What was his ELO rating?
| homarp wrote:
| https://chess.stackexchange.com/questions/35420/what-was-
| arp...
|
| 2065
| esafak wrote:
| It should be Elo rating!
| https://en.wikipedia.org/wiki/Elo_rating_system
| reactordev wrote:
| It's also popular in ranking online players in games... really
| any game where there's an win/loss ranking..
| kayge wrote:
| Thanks for this :) I had never heard of Elo until I noticed
| this morning that the new Chess course in Duolingo gives you an
| Elo ranking after a few rounds against Oscar. Probably would
| have skipped right over this story and comments otherwise, but
| now I have a fun bit of non-tech trivia to share if it ever
| comes up in small talk someday.
| rurban wrote:
| In table-tennis we also use the ELO ranking, because it's
| pretty fair. If you loose to a good player you don't loose
| much points, but when you loose to a bad a player you'll
| loose a lot. Likewise when you win.
| npip99 wrote:
| I often see it rendered as "Elo" but I've always found it more
| natural to capitalize as "ELO", but perhaps I should swap to
| "Elo" given this. Pronouncing "ee-low" is certainly the way
| it's done in chess/esports though!
| bbstats wrote:
| (also because it's a name, you don't capitalize all three
| letters)
| fvdessen wrote:
| Similar to the 'Gini coefficient', named after Corrado Gini,
| former president of the Italian Genetics and Eugenics Society
| and author of 'The Scientific Basis of Facism'
|
| https://en.wikipedia.org/wiki/Corrado_Gini
| rahulnair23 wrote:
| Interesting work.
|
| For a slightly different take using a similar intuition, see our
| paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking
| LLMs which may be of interest.
|
| Our HuggingFace space has some examples:
| https://huggingface.co/spaces/ibm/llm-rank-themselves
| ghita_ wrote:
| thank you, will check out the paper, the hf space is very cool!
| mkaszkowiak wrote:
| Happy to see competition in rerankers! Good luck with your
| product.
|
| My questions: what languages do your models currently support?
| Did you perform multilingual benchmarks? Couldn't find an answer
| on the website
| ghita_ wrote:
| Thanks! We trained on most european languages (english, french,
| spanish, russian...), arabic, and chinese so it does well on
| those! We haven't tested too much on other languages, but happy
| to do so if there is a use case
| ethan_smith wrote:
| Language support is a crucial differentiator for rerankers -
| would love to see MTEB or other cross-lingual benchmark results
| if you have them.
| Neywiny wrote:
| I have a paper that got denied but it was about using 2AFC
| sorting to do this instead of elo. It has a defined end unlike
| elo scores. The code is on my github and focuses on humans
| sorting images but basically if you have a python sort function,
| you put your comparison as the key instead of assigning the
| comparison a numeric score. Then the algorithm does the rest
|
| Code: https://github.com/Neywiny/merge-sort Conference/abstract
| presentation: https://www.spiedigitallibrary.org/conference-
| proceedings-of...
| ghita_ wrote:
| would love to check out the code if you have it!
| Neywiny wrote:
| https://github.com/Neywiny/merge-sort
|
| It was actually done to counter Elo based approaches so
| there's some references in the readme on how to prove who's
| better. I haven't run this code in 5 years and haven't
| developed on it in maybe 6, but I can probably fix any issues
| that come up. My co-author looks to have diverged a bit.
| Haven't checked out his code.
| https://github.com/FrankWSamuelson/merge-sort . There may
| also be a fork by the FDA itself, not sure. This work was
| done for the FDA's medical imaging device evaluation division
| reactordev wrote:
| I was going to mention this approach as well. The problem with
| the OP is that it has assumption bias and the entire chain is
| based on that assumption. It's novel. But the original idea was
| to more evenly distribute scores so you can find real relevance
| and I think 2AFC is better. But I don't have time to verify and
| post a paper about it.
| Neywiny wrote:
| It's probably because that's what we used, but nAFC has been
| my go-to since I first learned about it. Literally any time
| there's a ranking, even for dumb stuff like tier list videos
| on YouTube, they're too arbitrary. Ok you ranked this snack
| an 8/10. Based on what? And then they go back and say
| "actually I'm going to move that to a 7". AFC fixes all of
| that.
| npip99 wrote:
| Yes our pairwise method is based entirely on 2AFC
| comparisons, for both intra-query and inter-query ELO
| calculations.
|
| It's definitely the best if not only way to get extremely
| high signal, and a score assignment that actually converges
| the more you sample.
|
| In terms of the "F" in 2AFC, we actually have this amusing
| snippet from our prompt:
|
| > Do NOT output a score of 0.0, ensure to focus on which
| document is superior, and provide a negative or positive
| float between -1.0 and 1.0.
| reactordev wrote:
| Nice, I use an epoch to prevent stalemate but this might be
| better.
| Alex3917 wrote:
| Out of curiosity, is there a reason why you are using ELO proper,
| rather than one of the ELO variants that doesn't make assumptions
| about the distribution of results? E.g.:
|
| https://github.com/pfmonville/whole_history_rating
| npip99 wrote:
| Hey! We actually did a lot of research into ELO consistency,
| i.e. to check whether or not the NxN pairwise matrix followed
| the ELO model. It was a long road that's probably grounds for
| an entirely separate blog post, but the TLDR is that we observe
| that:
|
| For each document, there is a secret hidden score "s" which is
| the "fundamental relevance according to the LLM". Then, when we
| sample (q, d1, d2) from the LLM, the LLM follows the
| statistical property that:
|
| - The "fundamental hidden preference" is `pref = s_{d1} -
| s_{d2}`, usually ranging between -4 and 4.
|
| - The LLM will sample a normal distribution around the `pref`
| with stddev ~0.2, which is some "inner noise" that the LLM
| experiences before coming to a judgement.
|
| - The preference will pass through the sigmoid to get a
| sampled_score \in [0, 1].
|
| - There is an additional 2% noise. i.e., 0.98 * sampled_score +
| 0.02 * random.random()
|
| When we use Maximum Likelihood Estimation to find the most
| likely predicted "hidden scores" \hat{s} associated with each
| document, then we go ahead and sample pairwise matrices
| according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0,
| 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with
| virtually identical statistical properties to the observed
| pairwise matrices.
| slybot wrote:
| More confused,
|
| 1) 0.02 * random.random() != N(0, 0.02)
|
| 2) The LLM will sample a normal distribution, this only
| depends on your c parameter, the absolute scale doesn't
| matter neither in Bradley-Terry nor in Elo. So saying +-4 and
| claiming LLM reasoning in Standard normal is ridiculous.
|
| 3) > then we get a pairwise matrix with virtually identical
| statistical properties to the observed pairwise matrices. >>>
| then did you asked yourselves if I have "statistically
| identical" pair-wise matrix and observed pairwise matrix,
| the. why you even bother myself? You can simply use observed
| pairwise matrix...
| etk934 wrote:
| Will the reranker trained with MSE be better calibrated than
| those trained with InfoNCE? Will threshold on reranker scores be
| more useful in RAG applications?
| npip99 wrote:
| We tried a bradley-terry loss function, as calculated with
| https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/SJ8sRl1Zge
|
| We found that MSE after elo-adjustment worked equally well.
| And, MSE lets you shuffle (q, d) across the dataset which has
| good statistical properties (Versus contrastive, which makes
| you sample the same query many times within a single minibatch)
|
| In this case "InfoNCE" isn't applicable because the reranker's
| output is a scalar, not a vector. So that's why we checked both
| bradley-terry and MSE.
| pbhjpbhj wrote:
| So this is for recruitment?
|
| I like the pairwise approach but in the field I'm interested in,
| at the document level there can be a lot of relevance (we
| historically use scoring based on TF-IDF) but we tend to get a
| corpus of documents that then need involved human analysis to
| give the relevant sections. It seems that paragraph-level vectors
| are probably at the right conceptual level for refinement.
|
| Ultimately I guess, what is considered a document is somewhat
| arbitrary. But I wondered if you'd looked at - or if someone here
| knows about - MLs for retrieval that consider documents at a mix
| of conceptual levels to improve retrieval. So, pairwise
| paragraph-level after a broader retrieval would be a simple
| example.
|
| I guess for looking at CV/resumes that might relate to finding
| someone who was gardener at Google and then later used ML for
| graphic design, vs someone who did ML at Google ... which might
| be a similar document vector (poor example, but you get the
| picture).
|
| Currently I'm seeing document level references to source
| material, snippets based on keywords, but not paragraph level
| referencing as you'd have for legal decisions.
| bbstats wrote:
| Little reminder that Elo is a guy, not an acronym :)
| FredrikMeyer wrote:
| Came to comment this. As a consequence, writing it in capital
| letters "ELO" is wrong.
| scoresmoke wrote:
| You might also consider a fast implementation of Elo and Bradley-
| Terry that I have been developing for some time:
| https://github.com/dustalov/evalica (Rust core, Python bindings,
| 100% test coverage, and nice API).
| swyx wrote:
| would you consider JS bindings? should be easy to vibe code
| given what you have. bonus points if it runs in the browser (eg
| export the wasm binary). thank you!
| scoresmoke wrote:
| I am thinking about this for a while and I think I'll
| vibecode them. Not sure about WASM, though, as the underlying
| libraries should support it, too, and I am not sure about all
| of them at the same time.
| npip99 wrote:
| In our case training and inferencing the models takes days,
| calculating all of the ELOs take 1min haha. So we didn't need
| to optimize the calculation.
|
| But, we _did_ need to work on numeric stability!
|
| I have our calculations here: -
| https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/B15B4Rsleg
|
| tldr; wikipedia iterates on <e^elo>, but that can go to zero or
| infinity. Iterating on <elo> stays between -4 and 4 in all of
| our observed pairwise matrices, so it's very well-bounded.
| scoresmoke wrote:
| I am working on post-training and evaluation tasks mostly,
| and I built Evalica as a convenient tool for my own use
| cases. The computation is fast enough to not bother the user,
| but the library does not stand in my way during the analysis.
| david_shi wrote:
| This is awesome, reminds me of the kind of intuition behind
| PageRank.
| fulmicoton wrote:
| One trouble I could see with your approach is that you treat the
| information "Doc at pos i" beats "Doc at pos j" independently
| from i and j. Intuitively, it is not as critical when a bad doc
| is at rank 9 instead of rank 10; compared to bad doc landing at
| rank 1 instead of rank 10.
|
| LambdaMART's approach seems better in that respect.
|
| https://medium.com/@nikhilbd/pointwise-vs-pairwise-vs-listwi...
| patrickhogan1 wrote:
| Awesome! This is great!
|
| The link in the article to the full blog explaining rerankers is
| 404ing for me.
|
| Questions to you as an expert related to search ranking. With o3
| and source quality thresholds when performing web search. Could
| we implement an ELO-style cutoff where systems default to "I
| don't know" rather than citing low-ranked sources?
|
| Currently o3's main weakness is mixing high-quality sources with
| poor ones when it uses the web search in the same response. The
| answer sounds authoritative throughout, but parts are backed by
| unreliable sources. This makes it harder to trust even the well-
| sourced portions (e.g. believing the US election is next year -
| not a hallucination but a poorly date formatted source it used).
| It also makes the response a lot slower.
|
| Would a hard quality threshold be better than the current
| approach of seamlessly blending good and bad sources?
| ghita_ wrote:
| Hey! Thanks so much! I fixed the link thanks for flagging. Yes
| the same approach could be used for internet search. The fact
| that we now have an "absolute score" is very interesting since
| we can also use a threshold value to determine when an answer
| simply doesn't exist in a corpus. The only issue is that if all
| scores are below the cutoff value, you end up discarding them
| all, and end up with many "I don't know"s. Best approach could
| just be to flag the "trust" the model has in each source
| retrieved and use it as such.
| slybot wrote:
| > Fit an ELO-style rating system (Bradley-Terry) to turn pairwise
| comparisons into absolute per-document scores.
|
| There are some conceptual gaps and this sentence is misleading in
| general. First, this sentence implies that Bradley-Terry is a
| some sort of an Elo variant, which is not true. Elo rating
| introduced nearly 10 years later in completely different domain.
|
| They are two completely different ranking systems. Bradley-Terry
| use ratio-based, while Elo use logistic score function. Scales of
| the scores are completely different as well as the their
| sensitivity to the score differences.
|
| Possibly, Bradley-Terry is preffered by the authors due to
| simpler likelihood evaluation and update doesn't depend on the
| order of pairwise evaluations.
|
| There is also variants of Elo-rating that use MLE (optimized Elo)
| and even recently Bayesian Elo. For post-hoc time invariant
| scores, there is randomized Elo rating and so on.
|
| People like Elo ratings because they are simple to understand.
| Most of the time, they forget why they developed specifically for
| chess tournaments. All variants above and dozens more try to
| improve (fix) one aspect of the Elo ratings, because their
| application has no 100% clear determination of winner, the update
| scale parameter is too small or large, matches are played
| simultaneously, different matches played and so on.
|
| Also, let say one document is always preffered one all LLMs then
| it has only wins, then MLE will result in flat marginal
| likelihood for that where the update parameter (c) will inf.
| timhh wrote:
| Explanation of Bradley-Terry here:
| https://stats.stackexchange.com/a/131270/60526
|
| It's such a great and simple algorithm. I feel like it deserves
| to be more widely known.
|
| I used it at Dyson to evaluate really subjective things like how
| straight a tress of hair is - pretty much impossible to say if
| you just look at a photo, but you can ask a bunch of people to
| compare two photos and say which looks straighter, then you can
| get an objective ranking.
| npip99 wrote:
| Yeah absolutely. In your link, it iterates on _ = ^{_}, until
| it finds the fixed point.
|
| In our training pipeline, we had to convert the fixed point
| iteration to be on _ directly for numerical stability. I have a
| post on that here!: https://hackmd.io/x3_EkXGKRdeq-rNHo_RpZA
|
| Bradley-Terry also very cleanly turns into a loss function that
| you can do gradient descent on, which will cause your model to
| efficiently learn Elo scores! Our calculations are at:
| https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw
| rjmunro wrote:
| I think Elo style rankings would be good for rating e.g. Uber
| rides and restaurant reviews. Instead of asking to rate out of 5
| stars or similar, where everyone basically ends up giving 5
| stars, just ask was it better or worse than your last experience.
| ricree wrote:
| Is this really viable for something like Uber, where most rides
| aren't really meaningfully better or worse?
| adamgusky wrote:
| super cool
___________________________________________________________________
(page generated 2025-07-17 23:01 UTC)