[HN Gopher] Show HN: Improving search ranking with chess Elo scores
       ___________________________________________________________________
        
       Show HN: Improving search ranking with chess Elo scores
        
       Hello HN,  I'm Ghita, co-founder of ZeroEntropy (YC W25). We build
       high accuracy search infrastructure for RAG and AI Agents.  We just
       released two new state-of-the-art rerankers zerank-1, and
       zerank-1-small. One of them is fully open-source under Apache 2.0.
       We trained those models using a novel Elo score inspired pipeline
       which we describe in detail in the blog attached. In a nutshell,
       here is an outline of the training steps: * Collect soft
       preferences between pairs of documents using an ensemble of LLMs. *
       Fit an ELO-style rating system (Bradley-Terry) to turn pairwise
       comparisons into absolute per-document scores. * Normalize
       relevance scores across queries using a bias correction step,
       modeled using cross-query comparisons and solved with MLE.  You can
       try the models either through our API
       (https://docs.zeroentropy.dev/models), or via HuggingFace
       (https://huggingface.co/zeroentropy/zerank-1-small).  We would love
       this community's feedback on the models, and the training approach.
       A full technical report is also going to be released soon.  Thank
       you!
        
       Author : ghita_
       Score  : 182 points
       Date   : 2025-07-16 14:17 UTC (1 days ago)
        
 (HTM) web link (www.zeroentropy.dev)
 (TXT) w3m dump (www.zeroentropy.dev)
        
       | sippeangelo wrote:
       | Really cool stuff! Just want to let you know you forgot to link
       | to the evals at the end.
        
         | ghita_ wrote:
         | oh waw thanks for flagging, just fixed, thanks!
        
       | esafak wrote:
       | I would have titled it "Improving ranking..."
       | 
       | I like that it works with `sentence_transformers`
        
         | ghita_ wrote:
         | yes we found it hard to find a good title for this, thanks for
         | the feedback
        
         | dang wrote:
         | We could change the title to "Improving search ranking with
         | chess Elo scores". Anybody object?
         | 
         | Edit: ok, done. Submitted title was "Show HN: Improving RAG
         | with chess Elo scores".
        
           | slybot wrote:
           | They don't use Elo scores. See my comment above, the loss
           | function is adopted from Bradley-Terry.
        
             | npip99 wrote:
             | Bradley-Terry and Elo scores are equivalent mathematical
             | models! The fundamental presumption is the same Thurstone
             | model - that an individual's skill in a particular game is
             | a normally distributed random variable around their
             | fundamental skill.
             | 
             | We did experiment with a Bradley-Terry loss function
             | (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw), but we found
             | that even better was to calculate Elo scores, do cross-
             | query bias adjustment, and then MSE loss to predict the Elo
             | score itself.
        
               | slybot wrote:
               | ->Bradley-Terry and Elo scores are equivalent
               | mathematical models! No, they are not equivalent
               | mathematical models, they are equalivant in terms of
               | calculation of score function(logistic) given equivalent
               | scale factors. Such that, Bradley-terry: 1/(1 + e^(x(r_B
               | - r_A))) and Elo rating: 1/(1 + 10^((r_B - r_A)/y)), then
               | equivalance requires x = ln(10)/y. More importantly, Elo
               | rating is _online_ scoring system, meaning it takes into
               | accoun the sequence of the events. From your blog post, I
               | understand that you are not updating the scores after
               | after each event. In other words, Elo rating can be
               | interpreted as an incremental fitting of a Bradley-Terry
               | (using similar logistic) model but not the same!
               | 
               | -> The fundamental presumption is the same Thurstone
               | model The Thurstone model is similar, and as you said it
               | assumes normal (as opposed to logistic) using probit link
               | function. It predates both models and due to
               | computational constraints, you can call Bradley-Terry and
               | Elo rating computationally convenient approximation of
               | the Thurstone model.
               | 
               | -> We did experiment with a Bradley-Terry loss function
               | (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw) The math is
               | correct. Thanks for sharing. Indeed, if you do it with
               | incremental updating, you will lose the differentiability
               | given the next winning probability is dependent on the
               | previous updates. Call it what you want, but note that
               | this is not truly and Elo rating which leads
               | misunderstanding. It is Bradley-Terry given you do batch
               | updates which you take extra steps to connect with Elo
               | score, as shown in the link.
               | 
               | Lastly, normal and logistic distribution will lead to
               | log(0) in evaluations which results inf in loss. As I can
               | see from you upper comment, you try add uniform(0.02) as
               | ad-hoc fix. An elegant fix to that is use heavy-tailed
               | distribution such as Cauchy.
        
       | ashwindharne wrote:
       | Cool stuff! We use a similar process internally to rerank and
       | filter our cold outbound lists. We just use an off-the-shelf
       | model as the judge, give it a custom criteria, and let it run
       | until some set number of iterations. It's helped narrow down wide
       | searches to the maximally relevant set of people (few thousand
       | medium-bad matches to few hundred good matches)
       | 
       | It's not cheap and it's not fast, but it definitely works pretty
       | well!
        
         | jayunit wrote:
         | Very interesting! What are some examples of criteria that you
         | can evaluate pairwise, but couldn't score individually?
        
           | bravura wrote:
           | Pairwise rank constraints involve fewer assumptions that per-
           | item scoring about the underlying nature of the data, thus
           | they are more robust.
        
             | npip99 wrote:
             | Yeah that's exactly what we observed. Our goal was to
             | create an absolute score that's completely independent from
             | the Corpus, which is difficult because naturally all ELO
             | distributions are inherently tied to the corpus itself!
             | 
             | When we were exploring the mathematical foundations, we
             | considered ELO scoring against a "Universal Corpus" based
             | on the natural entropy of human language (Obviously that's
             | intractable, but sometimes this term cancels out like in
             | the DPO proof).
             | 
             | But eventually we figured out a method using cross-query
             | comparisons to assign an "ELO bias" to all document ELOs
             | within a given query's candidate list. This normalizes it
             | correctly such that when a candidate list is all bad, the
             | ELOs shift low. And when the candidate list is all good,
             | the ELOs shift high. Even when the relative ELOs are all
             | the same.
        
           | ashwindharne wrote:
           | It's all unstructured text (title, company, company size,
           | experience, skills, raw text, etc.) and LLMs are pretty bad
           | at assigning numerical scores in a vacuum. To make it work,
           | we'd have to provide a representative set of examples, break
           | scoring down by specific field, etc.
           | 
           | Kind of a lot of work compared to just dumping the text of 2
           | profiles into a context window along with a vague description
           | of what I want, and having the LLM make the binary judgment.
        
       | yalok wrote:
       | What's the expected additional latency due to running this re-
       | ranker?
        
         | ghita_ wrote:
         | It actually runs pretty fast, our benchmarks show ~149ms for
         | 12665 bytes. It's faster than many other models
        
           | esafak wrote:
           | I would prominently display your benchmarks (against your
           | competitors, of course). That's your selling point, right?
        
             | ghita_ wrote:
             | Yes! We did this here:
             | https://www.zeroentropy.dev/blog/announcing-zeroentropys-
             | fir... We wanted to share the approach with the community
             | in this post. It does do better than competitors though!
        
       | seanhunter wrote:
       | Fun fact about ELO. It's natural to think that it is some kind of
       | initialism, but in fact ELO doesn't stand for anything. It's the
       | name of the guy who invented the system.
       | https://en.wikipedia.org/wiki/Arpad_Elo
       | 
       | So don't say it "E.L.O." (unless you're talking about the band, I
       | guess), say "ee-low"
        
         | ghita_ wrote:
         | oh interesting, had no idea, thanks for sharing
        
         | amelius wrote:
         | What was his ELO rating?
        
           | homarp wrote:
           | https://chess.stackexchange.com/questions/35420/what-was-
           | arp...
           | 
           | 2065
        
         | esafak wrote:
         | It should be Elo rating!
         | https://en.wikipedia.org/wiki/Elo_rating_system
        
         | reactordev wrote:
         | It's also popular in ranking online players in games... really
         | any game where there's an win/loss ranking..
        
         | kayge wrote:
         | Thanks for this :) I had never heard of Elo until I noticed
         | this morning that the new Chess course in Duolingo gives you an
         | Elo ranking after a few rounds against Oscar. Probably would
         | have skipped right over this story and comments otherwise, but
         | now I have a fun bit of non-tech trivia to share if it ever
         | comes up in small talk someday.
        
           | rurban wrote:
           | In table-tennis we also use the ELO ranking, because it's
           | pretty fair. If you loose to a good player you don't loose
           | much points, but when you loose to a bad a player you'll
           | loose a lot. Likewise when you win.
        
         | npip99 wrote:
         | I often see it rendered as "Elo" but I've always found it more
         | natural to capitalize as "ELO", but perhaps I should swap to
         | "Elo" given this. Pronouncing "ee-low" is certainly the way
         | it's done in chess/esports though!
        
         | bbstats wrote:
         | (also because it's a name, you don't capitalize all three
         | letters)
        
         | fvdessen wrote:
         | Similar to the 'Gini coefficient', named after Corrado Gini,
         | former president of the Italian Genetics and Eugenics Society
         | and author of 'The Scientific Basis of Facism'
         | 
         | https://en.wikipedia.org/wiki/Corrado_Gini
        
       | rahulnair23 wrote:
       | Interesting work.
       | 
       | For a slightly different take using a similar intuition, see our
       | paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking
       | LLMs which may be of interest.
       | 
       | Our HuggingFace space has some examples:
       | https://huggingface.co/spaces/ibm/llm-rank-themselves
        
         | ghita_ wrote:
         | thank you, will check out the paper, the hf space is very cool!
        
       | mkaszkowiak wrote:
       | Happy to see competition in rerankers! Good luck with your
       | product.
       | 
       | My questions: what languages do your models currently support?
       | Did you perform multilingual benchmarks? Couldn't find an answer
       | on the website
        
         | ghita_ wrote:
         | Thanks! We trained on most european languages (english, french,
         | spanish, russian...), arabic, and chinese so it does well on
         | those! We haven't tested too much on other languages, but happy
         | to do so if there is a use case
        
         | ethan_smith wrote:
         | Language support is a crucial differentiator for rerankers -
         | would love to see MTEB or other cross-lingual benchmark results
         | if you have them.
        
       | Neywiny wrote:
       | I have a paper that got denied but it was about using 2AFC
       | sorting to do this instead of elo. It has a defined end unlike
       | elo scores. The code is on my github and focuses on humans
       | sorting images but basically if you have a python sort function,
       | you put your comparison as the key instead of assigning the
       | comparison a numeric score. Then the algorithm does the rest
       | 
       | Code: https://github.com/Neywiny/merge-sort Conference/abstract
       | presentation: https://www.spiedigitallibrary.org/conference-
       | proceedings-of...
        
         | ghita_ wrote:
         | would love to check out the code if you have it!
        
           | Neywiny wrote:
           | https://github.com/Neywiny/merge-sort
           | 
           | It was actually done to counter Elo based approaches so
           | there's some references in the readme on how to prove who's
           | better. I haven't run this code in 5 years and haven't
           | developed on it in maybe 6, but I can probably fix any issues
           | that come up. My co-author looks to have diverged a bit.
           | Haven't checked out his code.
           | https://github.com/FrankWSamuelson/merge-sort . There may
           | also be a fork by the FDA itself, not sure. This work was
           | done for the FDA's medical imaging device evaluation division
        
         | reactordev wrote:
         | I was going to mention this approach as well. The problem with
         | the OP is that it has assumption bias and the entire chain is
         | based on that assumption. It's novel. But the original idea was
         | to more evenly distribute scores so you can find real relevance
         | and I think 2AFC is better. But I don't have time to verify and
         | post a paper about it.
        
           | Neywiny wrote:
           | It's probably because that's what we used, but nAFC has been
           | my go-to since I first learned about it. Literally any time
           | there's a ranking, even for dumb stuff like tier list videos
           | on YouTube, they're too arbitrary. Ok you ranked this snack
           | an 8/10. Based on what? And then they go back and say
           | "actually I'm going to move that to a 7". AFC fixes all of
           | that.
        
           | npip99 wrote:
           | Yes our pairwise method is based entirely on 2AFC
           | comparisons, for both intra-query and inter-query ELO
           | calculations.
           | 
           | It's definitely the best if not only way to get extremely
           | high signal, and a score assignment that actually converges
           | the more you sample.
           | 
           | In terms of the "F" in 2AFC, we actually have this amusing
           | snippet from our prompt:
           | 
           | > Do NOT output a score of 0.0, ensure to focus on which
           | document is superior, and provide a negative or positive
           | float between -1.0 and 1.0.
        
             | reactordev wrote:
             | Nice, I use an epoch to prevent stalemate but this might be
             | better.
        
       | Alex3917 wrote:
       | Out of curiosity, is there a reason why you are using ELO proper,
       | rather than one of the ELO variants that doesn't make assumptions
       | about the distribution of results? E.g.:
       | 
       | https://github.com/pfmonville/whole_history_rating
        
         | npip99 wrote:
         | Hey! We actually did a lot of research into ELO consistency,
         | i.e. to check whether or not the NxN pairwise matrix followed
         | the ELO model. It was a long road that's probably grounds for
         | an entirely separate blog post, but the TLDR is that we observe
         | that:
         | 
         | For each document, there is a secret hidden score "s" which is
         | the "fundamental relevance according to the LLM". Then, when we
         | sample (q, d1, d2) from the LLM, the LLM follows the
         | statistical property that:
         | 
         | - The "fundamental hidden preference" is `pref = s_{d1} -
         | s_{d2}`, usually ranging between -4 and 4.
         | 
         | - The LLM will sample a normal distribution around the `pref`
         | with stddev ~0.2, which is some "inner noise" that the LLM
         | experiences before coming to a judgement.
         | 
         | - The preference will pass through the sigmoid to get a
         | sampled_score \in [0, 1].
         | 
         | - There is an additional 2% noise. i.e., 0.98 * sampled_score +
         | 0.02 * random.random()
         | 
         | When we use Maximum Likelihood Estimation to find the most
         | likely predicted "hidden scores" \hat{s} associated with each
         | document, then we go ahead and sample pairwise matrices
         | according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0,
         | 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with
         | virtually identical statistical properties to the observed
         | pairwise matrices.
        
           | slybot wrote:
           | More confused,
           | 
           | 1) 0.02 * random.random() != N(0, 0.02)
           | 
           | 2) The LLM will sample a normal distribution, this only
           | depends on your c parameter, the absolute scale doesn't
           | matter neither in Bradley-Terry nor in Elo. So saying +-4 and
           | claiming LLM reasoning in Standard normal is ridiculous.
           | 
           | 3) > then we get a pairwise matrix with virtually identical
           | statistical properties to the observed pairwise matrices. >>>
           | then did you asked yourselves if I have "statistically
           | identical" pair-wise matrix and observed pairwise matrix,
           | the. why you even bother myself? You can simply use observed
           | pairwise matrix...
        
       | etk934 wrote:
       | Will the reranker trained with MSE be better calibrated than
       | those trained with InfoNCE? Will threshold on reranker scores be
       | more useful in RAG applications?
        
         | npip99 wrote:
         | We tried a bradley-terry loss function, as calculated with
         | https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/SJ8sRl1Zge
         | 
         | We found that MSE after elo-adjustment worked equally well.
         | And, MSE lets you shuffle (q, d) across the dataset which has
         | good statistical properties (Versus contrastive, which makes
         | you sample the same query many times within a single minibatch)
         | 
         | In this case "InfoNCE" isn't applicable because the reranker's
         | output is a scalar, not a vector. So that's why we checked both
         | bradley-terry and MSE.
        
       | pbhjpbhj wrote:
       | So this is for recruitment?
       | 
       | I like the pairwise approach but in the field I'm interested in,
       | at the document level there can be a lot of relevance (we
       | historically use scoring based on TF-IDF) but we tend to get a
       | corpus of documents that then need involved human analysis to
       | give the relevant sections. It seems that paragraph-level vectors
       | are probably at the right conceptual level for refinement.
       | 
       | Ultimately I guess, what is considered a document is somewhat
       | arbitrary. But I wondered if you'd looked at - or if someone here
       | knows about - MLs for retrieval that consider documents at a mix
       | of conceptual levels to improve retrieval. So, pairwise
       | paragraph-level after a broader retrieval would be a simple
       | example.
       | 
       | I guess for looking at CV/resumes that might relate to finding
       | someone who was gardener at Google and then later used ML for
       | graphic design, vs someone who did ML at Google ... which might
       | be a similar document vector (poor example, but you get the
       | picture).
       | 
       | Currently I'm seeing document level references to source
       | material, snippets based on keywords, but not paragraph level
       | referencing as you'd have for legal decisions.
        
       | bbstats wrote:
       | Little reminder that Elo is a guy, not an acronym :)
        
         | FredrikMeyer wrote:
         | Came to comment this. As a consequence, writing it in capital
         | letters "ELO" is wrong.
        
       | scoresmoke wrote:
       | You might also consider a fast implementation of Elo and Bradley-
       | Terry that I have been developing for some time:
       | https://github.com/dustalov/evalica (Rust core, Python bindings,
       | 100% test coverage, and nice API).
        
         | swyx wrote:
         | would you consider JS bindings? should be easy to vibe code
         | given what you have. bonus points if it runs in the browser (eg
         | export the wasm binary). thank you!
        
           | scoresmoke wrote:
           | I am thinking about this for a while and I think I'll
           | vibecode them. Not sure about WASM, though, as the underlying
           | libraries should support it, too, and I am not sure about all
           | of them at the same time.
        
         | npip99 wrote:
         | In our case training and inferencing the models takes days,
         | calculating all of the ELOs take 1min haha. So we didn't need
         | to optimize the calculation.
         | 
         | But, we _did_ need to work on numeric stability!
         | 
         | I have our calculations here: -
         | https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/B15B4Rsleg
         | 
         | tldr; wikipedia iterates on <e^elo>, but that can go to zero or
         | infinity. Iterating on <elo> stays between -4 and 4 in all of
         | our observed pairwise matrices, so it's very well-bounded.
        
           | scoresmoke wrote:
           | I am working on post-training and evaluation tasks mostly,
           | and I built Evalica as a convenient tool for my own use
           | cases. The computation is fast enough to not bother the user,
           | but the library does not stand in my way during the analysis.
        
       | david_shi wrote:
       | This is awesome, reminds me of the kind of intuition behind
       | PageRank.
        
       | fulmicoton wrote:
       | One trouble I could see with your approach is that you treat the
       | information "Doc at pos i" beats "Doc at pos j" independently
       | from i and j. Intuitively, it is not as critical when a bad doc
       | is at rank 9 instead of rank 10; compared to bad doc landing at
       | rank 1 instead of rank 10.
       | 
       | LambdaMART's approach seems better in that respect.
       | 
       | https://medium.com/@nikhilbd/pointwise-vs-pairwise-vs-listwi...
        
       | patrickhogan1 wrote:
       | Awesome! This is great!
       | 
       | The link in the article to the full blog explaining rerankers is
       | 404ing for me.
       | 
       | Questions to you as an expert related to search ranking. With o3
       | and source quality thresholds when performing web search. Could
       | we implement an ELO-style cutoff where systems default to "I
       | don't know" rather than citing low-ranked sources?
       | 
       | Currently o3's main weakness is mixing high-quality sources with
       | poor ones when it uses the web search in the same response. The
       | answer sounds authoritative throughout, but parts are backed by
       | unreliable sources. This makes it harder to trust even the well-
       | sourced portions (e.g. believing the US election is next year -
       | not a hallucination but a poorly date formatted source it used).
       | It also makes the response a lot slower.
       | 
       | Would a hard quality threshold be better than the current
       | approach of seamlessly blending good and bad sources?
        
         | ghita_ wrote:
         | Hey! Thanks so much! I fixed the link thanks for flagging. Yes
         | the same approach could be used for internet search. The fact
         | that we now have an "absolute score" is very interesting since
         | we can also use a threshold value to determine when an answer
         | simply doesn't exist in a corpus. The only issue is that if all
         | scores are below the cutoff value, you end up discarding them
         | all, and end up with many "I don't know"s. Best approach could
         | just be to flag the "trust" the model has in each source
         | retrieved and use it as such.
        
       | slybot wrote:
       | > Fit an ELO-style rating system (Bradley-Terry) to turn pairwise
       | comparisons into absolute per-document scores.
       | 
       | There are some conceptual gaps and this sentence is misleading in
       | general. First, this sentence implies that Bradley-Terry is a
       | some sort of an Elo variant, which is not true. Elo rating
       | introduced nearly 10 years later in completely different domain.
       | 
       | They are two completely different ranking systems. Bradley-Terry
       | use ratio-based, while Elo use logistic score function. Scales of
       | the scores are completely different as well as the their
       | sensitivity to the score differences.
       | 
       | Possibly, Bradley-Terry is preffered by the authors due to
       | simpler likelihood evaluation and update doesn't depend on the
       | order of pairwise evaluations.
       | 
       | There is also variants of Elo-rating that use MLE (optimized Elo)
       | and even recently Bayesian Elo. For post-hoc time invariant
       | scores, there is randomized Elo rating and so on.
       | 
       | People like Elo ratings because they are simple to understand.
       | Most of the time, they forget why they developed specifically for
       | chess tournaments. All variants above and dozens more try to
       | improve (fix) one aspect of the Elo ratings, because their
       | application has no 100% clear determination of winner, the update
       | scale parameter is too small or large, matches are played
       | simultaneously, different matches played and so on.
       | 
       | Also, let say one document is always preffered one all LLMs then
       | it has only wins, then MLE will result in flat marginal
       | likelihood for that where the update parameter (c) will inf.
        
       | timhh wrote:
       | Explanation of Bradley-Terry here:
       | https://stats.stackexchange.com/a/131270/60526
       | 
       | It's such a great and simple algorithm. I feel like it deserves
       | to be more widely known.
       | 
       | I used it at Dyson to evaluate really subjective things like how
       | straight a tress of hair is - pretty much impossible to say if
       | you just look at a photo, but you can ask a bunch of people to
       | compare two photos and say which looks straighter, then you can
       | get an objective ranking.
        
         | npip99 wrote:
         | Yeah absolutely. In your link, it iterates on _ = ^{_}, until
         | it finds the fixed point.
         | 
         | In our training pipeline, we had to convert the fixed point
         | iteration to be on _ directly for numerical stability. I have a
         | post on that here!: https://hackmd.io/x3_EkXGKRdeq-rNHo_RpZA
         | 
         | Bradley-Terry also very cleanly turns into a loss function that
         | you can do gradient descent on, which will cause your model to
         | efficiently learn Elo scores! Our calculations are at:
         | https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw
        
       | rjmunro wrote:
       | I think Elo style rankings would be good for rating e.g. Uber
       | rides and restaurant reviews. Instead of asking to rate out of 5
       | stars or similar, where everyone basically ends up giving 5
       | stars, just ask was it better or worse than your last experience.
        
         | ricree wrote:
         | Is this really viable for something like Uber, where most rides
         | aren't really meaningfully better or worse?
        
       | adamgusky wrote:
       | super cool
        
       ___________________________________________________________________
       (page generated 2025-07-17 23:01 UTC)