[HN Gopher] Understanding the BM25 full text search algorithm
       ___________________________________________________________________
        
       Understanding the BM25 full text search algorithm
        
       Author : rrampage
       Score  : 268 points
       Date   : 2024-11-20 03:43 UTC (19 hours ago)
        
 (HTM) web link (emschwartz.me)
 (TXT) w3m dump (emschwartz.me)
        
       | jankovicsandras wrote:
       | Shameless plug:
       | 
       | https://github.com/jankovicsandras/plpgsql_bm25
       | 
       | https://github.com/jankovicsandras/bm25opt
        
         | mark_l_watson wrote:
         | Thanks, yesterday I was thinking of adding BM25 to a little
         | side project, so a well timed plug!
         | 
         | Do you know of any pure Python wrapper projects for managing
         | large numbers of text and PDF documents? I thought of using
         | Solr or ElasticSearch but that seems too heavy weight for what
         | I am doing. I am considering using SQLite with pysqlite3 and
         | PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I
         | imagine many people are looking at tools for building hybrid
         | BM25 / vector store / LLM applications.
        
         | softwaredoug wrote:
         | If we're shameless plugging passion projects, SearchArray is a
         | pandas extension for fulltext (BM25) search for dorking around
         | with things in google colab
         | 
         | https://github.com/softwaredoug/searcharray
         | 
         | I'll also plug Xing Han Lu's BM25S which is very popular with
         | similar goals:
         | 
         | https://github.com/xhluca/bm25s
        
       | jll29 wrote:
       | Nice write-up.
       | 
       | A few more details/background that are harder to find: "BM25"
       | stands for "Best Matching 25", "best matching" becaue it is a
       | formula for ranking and term weighting (the matching refers to
       | the term in the query versus the document), and the number 25
       | simply indicates a running number (there were 24 earlier formula
       | variants and some later ones, but #25 turned out to work best, so
       | it was the one that was published).
       | 
       | It was conceived by Stephen Robertson and Karen Sparck Jones (the
       | latter of IDF fame) and first implemented in the former's OKAPI
       | information retrieval (research) system. The OKAPI system was
       | benchmarked at the annual US NIST TREC (Text Retrieval
       | Conference) for a number of years, the international "World
       | Champtionship" of search engine methods (although the event is
       | not about winning, but about compariing notes and learning from
       | each other, a highly recommended annual event held every November
       | in Gaithersburg, Maryland, attended by global academic and
       | industry teams that conduct research on improving search - see
       | trec.nist.gov).
       | 
       | Besides the "bag of words" Vector Space Model (sparse vectors of
       | terms), the Probabilistic Modles (that BM25 belongs to), there
       | are suprising and still growing number of other theoretical
       | frameworks how to rank a set of documents, given a query
       | ("Divergence from Randomness", "Statistical Language Modeling,
       | "Learning to Rank", "Quantum Information Retrieval", "Neural
       | Ranking" etc.). Conferences like ICTIR and SIGIR still publish
       | occasionaly entirely new paradigms for search. Note that the
       | "Statistical Language Modeling" paradigm is not about Large
       | Language Models that are on vogue now (that's covered under the
       | "Neural Retrieval" umbrella), and that "Quantum IR" is not going
       | to get you to a tutorial about Quantum Information Retrieval but
       | to methods of infrared spectroscopy or a company with the same
       | name that produces cement; such are the intricacies of search
       | technology, even in the 21st century.
       | 
       | If you want to play with BM25 and compare it with some of the
       | alternatives, I recommend the research platform Terrier, and
       | open-source search engine developed at the University of Glasgow
       | (today, perhaps the epicenter of search research).
       | 
       | BM25 is over a quarter century old, but has proven to be a hard
       | baseline to beat (it is still often used as a reference point for
       | comparing new nethods against), and a more recent variant, BM24F,
       | can deal with multiple fields and hypertext (e.g. title, body of
       | documents, hyperlinks).
       | 
       | The recommended paper to read is: Sparck Jones, K.; Walker, S.;
       | Robertson, S. E. (2000). "A probabilistic model of information
       | retrieval: Development and comparative experiments: Part 1".
       | Information Processing & Management 36(6): 779-808, and its
       | successor, Part 2. (Sadly they are not open access.)
        
         | marcyb5st wrote:
         | Thanks for sharing!
         | 
         | Do you have more information about BM24F? Googling (also Google
         | scholar) didn't yield anything related. Thanks in advance!
        
           | bradleyjkemp wrote:
           | A typo I think, should be BM25F. From Wikipedia:
           | 
           | > BM25F (or the BM25 model with Extension to Multiple
           | Weighted Fields) is a modification of BM25 in which the
           | document is considered to be composed from several fields
           | (such as headlines, main text, anchor text)
           | https://en.wikipedia.org/wiki/Okapi_BM25
           | 
           | Some papers are linked in the references
        
             | marcyb5st wrote:
             | Thanks, really appreciate it!
        
         | CSSer wrote:
         | Coincidentally, US NIST TREC is happening right now! It started
         | on the 18th and will conclude on the 22nd.
         | 
         | Link for more details: https://trec.nist.gov/
        
       | sidcool wrote:
       | Good article. I am genuinely interested to learn about how to
       | think of problems in such a mathematical form. And how to test
       | it. Any resources?
        
       | RA_Fisher wrote:
       | BM25 is an ancient algo developed in the 1970s. It's basically a
       | crappy statistical model and statisticians can do far better
       | today. Search is strictly dominated by learning (that yes, can
       | use search as an input). Not many folks realize that yet, and /
       | or are incentivized to keep the old tech going as long as
       | possible, but market pressures will change that.
        
         | netdur wrote:
         | While BM25 did emerge from earlier work in the 1970s and 1980s
         | (specifically building on the probabilistic ranking principle),
         | I'm curious about your perspective on a few things:
         | 
         | What specific modern statistical approaches are you seeing as
         | superior replacements for BM25 in practical applications? I'm
         | particularly interested in how they handle edge cases like rare
         | terms and document length normalization that BM25 was
         | explicitly designed to address.
         | 
         | While I agree learning-based approaches have shown impressive
         | results, could you elaborate on what you mean by search being
         | "strictly dominated" by learning methods? Are you referring to
         | specific benchmarks or real-world applications?
        
           | RA_Fisher wrote:
           | BM25 can be used as a starting point for a statistical
           | learning model and more readily built on. A key advantage is
           | that one gains a systematic way to reduce edge cases, instead
           | of handling a couple, bc they're so large as to be
           | noticeable.
        
         | simplecto wrote:
         | Those are some really spicy opinions. It would seem that many
         | search experts might not agree.
         | 
         | David Tippet (formerly opensearch and now at Github)
         | 
         | A great podcast with David Tippet and Nicolay Gerold entitled:
         | 
         | "BM25 is the workhorse of search; vectors are its visionary
         | cousin"
         | 
         | https://www.youtube.com/watch?v=ENFW1uHsrLM
        
           | dumb1224 wrote:
           | Agreed. In the 2000s it was all about BM25 in the NLP
           | community. I hardly see any paper that did not mention it in
           | my opinion.
        
             | RA_Fisher wrote:
             | For sure, it's very popular, just not the best anymore (and
             | actually far from it).
        
             | authorfly wrote:
             | And dependency chaining. But yes, lots of BM25.
             | 
             | The 2000s and even 2010s was a wonderful and fairly
             | theoretical time for linguistics and NLP. A time when NLP
             | seemed to harbor real anonymized general information to
             | make the right decisions with, without impinging on
             | privacy.
             | 
             | Oh to go back.
        
           | RA_Fisher wrote:
           | I'm sure Search experts would disagree, because it'd be their
           | technology they'd be admitting is inferior to another. BM25
           | is the workhorse, no doubt-- but it's also not the best
           | anymore. Vectors are a step toward learning models, but only
           | a small mid-range step vs. an explicit model.
           | 
           | Search is a useful approach for computing learning models,
           | but there's a difference between the computational means and
           | the model. For example, MIPS is a very useful search algo for
           | computing learning models (but first the learning model has
           | to be formulated).
        
             | simplecto wrote:
             | It seems that the current mode (eg fashion) is a hybrid
             | approach, with vector results on one side, BM25 on the
             | other, and then a re-reank algo to smooth things out.
             | 
             | I'm out of my depth here but genuinely interested and
             | curious to see over the horizon.
        
               | authorfly wrote:
               | Out of interest how come you use the word "mode" here?
        
               | simplecto wrote:
               | because the space moves fast, and from my learning this
               | is the current thing. Like fashion -- it changes from
               | season to season
        
               | RA_Fisher wrote:
               | Best is to use one statistical model and encode the
               | underlying aspects of the context that relate to goal
               | outcomes.
        
             | softwaredoug wrote:
             | I don't know a lot of search practitioners who don't want
             | to use the "new sexy" thing. Most of us do a fair amount of
             | "resume driven development" so can claim to be "AI
             | Engineers" :)
        
               | RA_Fisher wrote:
               | I don't think it's realistic to think that software
               | engineers can pick up advanced statistical modeling on
               | the job, unless they're pairing with statisticians.
               | There's just too much background involved.
        
               | binarymax wrote:
               | Your overall condescending attitude in this thread is
               | really disgusting.
        
               | RA_Fisher wrote:
               | Statisticians are famously disliked, especially by
               | engineers (there are open-minded folks, of course! maybe
               | they'd taken some econometrics or statistics, are
               | exceptionally humble, etc). There are some interesting
               | motives and incentives around that. Sometimes I think in
               | part it's because many people would prefer their existing
               | beliefs be upheld as opposed to challenged, even if
               | they're not well-supported (and likely to lead to bad
               | decisions and outcomes). Sticking with outdated
               | technology is one example.
        
               | softwaredoug wrote:
               | The "search practitioners" I'm referring to are pretty
               | uniformly ML Engineers . They also work on feeds,
               | recommendations, and adjacent Information Retrieval
               | spaces. Both to generate L0 retrieval candidates and to
               | do higher layers of reranking with learning to rank and
               | other systems to whatever the system's goal is...
               | 
               | You can decide if you agree that most people are
               | sufficiently statistically literate in that group of
               | people. But some humility around statistics is probably
               | far up there in what I personally interview for.
        
               | RA_Fisher wrote:
               | For sure. There are ML folks with statistical learning
               | backgrounds, but it tends to be relatively rare. Physics
               | and CS are more common. They tend to view things like you
               | mention, more procedural eg- learning to rank, minimizing
               | distances, less statistical modeling. Humility around
               | statistics is good, but statistical knowledge is still
               | what's required to really level up these systems (I've
               | built them as well).
        
             | dtaivpp wrote:
             | I have been summoned. Hey it's David from the podcast. As
             | someone who builds search for users every day and shaped
             | the user experience for vector search at OpenSearch I
             | assure you no one is afraid of their technology becoming
             | inferior.
             | 
             | There are two components of search that are really
             | important to understand why BM25 (will likely) not go away
             | for a long time. The first is precision and the second is
             | recall. Precision is the measure of how many relevant
             | results were returned in light of all the results returned.
             | A completely precise search would return only the relevant
             | results and no irrelevant results.
             | 
             | Recall on the other hand measures how many of all the
             | relevant results were returned. For example, if our search
             | only returns 5 results but we know that there were 10
             | relevant search results that should have been returned we
             | would say the recall is 50%.
             | 
             | These two components are always at odds with each other.
             | Vector search excels at increasing recall. It is able to
             | find documents that are semantically similar. The problem
             | with this is semantically similar documents might not
             | actually be what the user is looking for. This is because
             | vectors are only a representation of user intent.
             | 
             | Heres an example: A user looks up "AWS Config". Vector
             | search would read this and may rate it as similar to
             | ["amazon web services configuration", "cloud
             | configuration", "infrastructure as a service setup"]. In
             | this case the user was looking for a file called,
             | "AWS.config". Vector search is inherently imprecise. It is
             | getting better but it's not replacing BM25 as a scoring
             | mechanism any time soon.
             | 
             | You don't have to believe me though. Weaviate, Vespa,
             | Qdrant all support BM25 search for a reason. Here is an in
             | depth blog that dives more into hybrid search:
             | https://opensearch.org/blog/hybrid-search/
             | 
             | As an aside, vector search is also much more expensive than
             | BM25. It's very hard to scale and get precise results.
        
               | RA_Fisher wrote:
               | Hi David. Nice to meet you. Yes, precision and recall are
               | always in tension. However, both can be made
               | simultaneously better with a more informed model. Using
               | your example, this would be a model that encodes the
               | concept of files in the context of a user demand
               | surrounding AWS.
        
               | iosjunkie wrote:
               | "more informed model"
               | 
               | Can you be specific on what you recommend instead of
               | BM25?
        
         | mrbungie wrote:
         | Are those the same market pressures that made Google discard or
         | repurpose a lot of working old search tech for new shiny ML-
         | based search tech? The same tech that makes you add "+reddit"
         | in every search so you can evade the adversarial SEO war?
         | 
         | PS: Ancient != bad. I don't know what weird technologist take
         | worries about the age of an invention/discovery of a technique
         | rather than its usefulness.
        
           | RA_Fisher wrote:
           | Google's come a long way since PageRank + terms. Ancient
           | doesn't mean bad, but usually it means outdated and that's
           | the case here. Search algos are subsumed by learning models,
           | our species can do better now.
        
             | mbreese wrote:
             | So, I'm not entirely sure if I follow you here... How would
             | one use a language model to find a document out of a corpus
             | of existing documents? As opposed to finding an answer to a
             | question, trained on documents, which I can see. I mean
             | answering a query like "find the report containing X"?
             | 
             | I see search as encompassing at least two separate, but
             | related, domains: information gathering/seeking (answering
             | a question) and information retrieval (find the best
             | matching document). I'm curious how LLMs can help with the
             | later.
        
               | ordersofmag wrote:
               | That's the 'vector search' people are talking about in
               | this discussion. Use the LLM to generate an embedding
               | vector that represents the 'meaning' of your query. Do
               | the same for all the documents (or better with chunks of
               | all the documents). Find the document vector that's
               | closest to your query vector and you have a document that
               | has a 'meaning' similar to your query. Obviously that's
               | just a starting point. And lots of folks are doing hybrid
               | where they combine bm25 search with some sort of vector
               | search (e.g. run them in parallel and combine results, or
               | do a bm25 and then use vector search to rerank the top
               | results).
        
         | softwaredoug wrote:
         | I think there are also incentives to "sell new things". That's
         | always been the case in search which has had a bazillion trends
         | and "AI related things" as long as I've worked in it. We have
         | massively VC funded vector search companies with armies of tech
         | evangelists pushing a specific point of view right now.
         | 
         | Meanwhile, the amount of manual curation, basic, boring hand-
         | curated taxonomies that actually drive things like "semantic
         | search" at places like Google are simply staggering. Just
         | nobody talks about them much at conferences because they're not
         | very sexy.
        
       | hubraumhugo wrote:
       | Given the recent advances in vector-based semantic search, what's
       | the SOTA search stack that people are using for hybrid keyword +
       | semantic search these days?
        
         | emschwartz wrote:
         | Most of the commercial and open source offerings for hybrid
         | search seem to be using BM25 + vector similarity search based
         | on embeddings. The results are combined using Reciprocal Rank
         | Fusion (RRF).
         | 
         | The RRF paper is impressive in how incredibly simple it is (the
         | paper is only 2 pages):
         | https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
        
           | softwaredoug wrote:
           | A warning that RRF is often not Enough, as it can just drag a
           | good solution down towards the worse solution :)
           | 
           | https://softwaredoug.com/blog/2024/11/03/rrf-is-not-enough
        
             | emschwartz wrote:
             | Ah, that's great! Thanks for sharing that.
             | 
             | I had actually implemented full text search + vector search
             | using RRF but I kept it disabled by default because it
             | wasn't meaningfully improving my results. This seems like a
             | good hypothesis as to why.
        
         | d4rkp4ttern wrote:
         | In the Langroid[1] LLM library we have a clean, extensible RAG
         | implementation in the DocChatAgent[2] -- it uses several
         | retrieval techniques, including lexical (bm25, fuzzy search)
         | and semantic (embeddings), and re-ranking (using cross-encoder,
         | reciprocal-rank-fusion) and also re-ranking for diversity and
         | lost-in-the-middle mitigation:
         | 
         | [1] Langroid - a multi-agent LLM framework from CMU/UW-Madison
         | researchers https://github.com/langroid/langroid
         | 
         | [2] DocChatAgent Implementation -
         | https://github.com/langroid/langroid/blob/main/langroid/agen...
         | 
         | Start with the answer_from_docs method and follow the trail.
         | 
         | Incidentally I see you're the founder of Kadoa -- Kadoa-snack
         | is one of favorite daily tools to find LLM-related HN
         | discussions!
        
         | noduerme wrote:
         | A generic search strategy is so different from something you
         | want to target. The task should probably determine the tool.
         | 
         | So I don't know the answer, but I was recently handed about 3
         | million surveys with 10 free-form writing fields each, and
         | tasked with surfacing the ones that might require action on the
         | part of the company. I chose to use a couple of different small
         | classifier models, manually strip out some common words based
         | on obvious noise in the first 10k results, and then weight the
         | model responses. It turned out to be almost flawless. I would
         | NOT call this sort of thing "programming", it's more just
         | tweaking the black-box output of various different tools until
         | you have a set of results that _looks good_ for your test
         | cases. (And your client ;)
         | 
         | All stitching together small Hugging Face models running on a
         | tiny server in nodejs, btw.
        
           | keeeba wrote:
           | Nice, also find small classifiers work best for things like
           | this. Out of interest, how many, if any, of the 3million were
           | labelled?
           | 
           | Did you end up labelling any/more, or distilling from a
           | generative model?
        
         | treprinum wrote:
         | text-embedding-3-large + SPLADE + RRF
        
         | khaki54 wrote:
         | We're doing something like BM25 with a semantic ontology
         | enhanced query (naive example: search for truck hits on Ford
         | F-150, even if truck never appears in the doc) then vector
         | based reranking. In testing, we always get the best result in
         | the top 3.
        
         | dmezzetti wrote:
         | Excellent article on BM25!
         | 
         | Author of txtai [1] here. txtai implements a performant BM25
         | index in Python [2] via the arrays package and storing the term
         | frequency vectors in SQLite.
         | 
         | With txtai, the hybrid index approach [3] supports both convex
         | combination when BM25 scores are normalized and reciprocal rank
         | fusion (RRF) when they aren't [4].
         | 
         | [1] https://github.com/neuml/txtai
         | 
         | [2] https://neuml.hashnode.dev/building-an-efficient-sparse-
         | keyw...
         | 
         | [3] https://neuml.hashnode.dev/benefits-of-hybrid-search
         | 
         | [4]
         | https://github.com/neuml/txtai/blob/master/src/python/txtai/...
        
         | softwaredoug wrote:
         | My opinion is people need to not focus on one stack. But be
         | prepared to use tools best for each job. Elasticsearch for BM25
         | type things. Turbopuffer for simple and fast vector retrieval.
         | Even redis to precompute results for certain queries. Or
         | certain extremely dynamic attributes that change frequently
         | like price. Combine all these in a scatter/gather approach.
         | 
         | I say that because almost always you have a layer outside the
         | search stack(s) that ideally can just be a straightforward
         | inference service for reranking that looks most like other ML
         | infra.
         | 
         | You also almost always route queries to different backends
         | based on an understanding of the users query. Routing "lookup
         | by ID" to a different system than "fuzzy semantic search".
         | These are very different data structures. And search almost
         | always covers very broad/different use cases.
         | 
         | I think it's an anti pattern to just push all work to one
         | system. Each system is ideal for different workloads. And their
         | inference capabilities won't ever keep pace with the general ML
         | tooling that your ML engineers are used to. (I tried with
         | Elasticsearch Learning to Rank and its a hopeless task.)
         | 
         | (That said, Vespa is probably the best 'single stack' that
         | tries to solve a broad range of use-cases.)
        
       | DavidPP wrote:
       | We use https://typesense.org/ for regular search, but it now has
       | support for doing hybrid search, curious if anyone has tried it
       | yet?
        
         | kkielhofner wrote:
         | I've used it for hybrid search and it works quite well.
         | 
         | Overall I'm really happy to see Typesense mentioned here.
         | 
         | A lot of the smaller scale RAG projects, etc you see around
         | would be well served by Typesense but it seems to be relatively
         | unknown for whatever reasons. It's probably one of the easiest
         | solutions to deploy, has reasonable defaults, good docs, easy
         | clustering, etc while still be very capable, performant, and
         | powerful if you need to dig in further.
        
       | tselvaraj wrote:
       | Hybrid search solves the long-standing challenge of relevance
       | with search results. We can use ranking fusion between keyword
       | and vector to create a hybrid search that works in most
       | scenarios.
        
       | MPSimmons wrote:
       | Does anyone know if the average document length mentioned in the
       | document length normalization is median? It seems like it would
       | need to be to properly deweight excessively long documents,
       | otherwise the excessively long documents would unfairly weight
       | the average, right?
        
         | softwaredoug wrote:
         | It's the mean. At least in Lucene. Using median would be an
         | interesting experiment.
         | 
         | Do you know of a search dataset with very large document length
         | differences? MSMarco for example is pretty consistent in
         | length.
        
       ___________________________________________________________________
       (page generated 2024-11-20 23:01 UTC)