[HN Gopher] Understanding the BM25 full text search algorithm
___________________________________________________________________
Understanding the BM25 full text search algorithm
Author : rrampage
Score : 268 points
Date : 2024-11-20 03:43 UTC (19 hours ago)
(HTM) web link (emschwartz.me)
(TXT) w3m dump (emschwartz.me)
| jankovicsandras wrote:
| Shameless plug:
|
| https://github.com/jankovicsandras/plpgsql_bm25
|
| https://github.com/jankovicsandras/bm25opt
| mark_l_watson wrote:
| Thanks, yesterday I was thinking of adding BM25 to a little
| side project, so a well timed plug!
|
| Do you know of any pure Python wrapper projects for managing
| large numbers of text and PDF documents? I thought of using
| Solr or ElasticSearch but that seems too heavy weight for what
| I am doing. I am considering using SQLite with pysqlite3 and
| PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I
| imagine many people are looking at tools for building hybrid
| BM25 / vector store / LLM applications.
| softwaredoug wrote:
| If we're shameless plugging passion projects, SearchArray is a
| pandas extension for fulltext (BM25) search for dorking around
| with things in google colab
|
| https://github.com/softwaredoug/searcharray
|
| I'll also plug Xing Han Lu's BM25S which is very popular with
| similar goals:
|
| https://github.com/xhluca/bm25s
| jll29 wrote:
| Nice write-up.
|
| A few more details/background that are harder to find: "BM25"
| stands for "Best Matching 25", "best matching" becaue it is a
| formula for ranking and term weighting (the matching refers to
| the term in the query versus the document), and the number 25
| simply indicates a running number (there were 24 earlier formula
| variants and some later ones, but #25 turned out to work best, so
| it was the one that was published).
|
| It was conceived by Stephen Robertson and Karen Sparck Jones (the
| latter of IDF fame) and first implemented in the former's OKAPI
| information retrieval (research) system. The OKAPI system was
| benchmarked at the annual US NIST TREC (Text Retrieval
| Conference) for a number of years, the international "World
| Champtionship" of search engine methods (although the event is
| not about winning, but about compariing notes and learning from
| each other, a highly recommended annual event held every November
| in Gaithersburg, Maryland, attended by global academic and
| industry teams that conduct research on improving search - see
| trec.nist.gov).
|
| Besides the "bag of words" Vector Space Model (sparse vectors of
| terms), the Probabilistic Modles (that BM25 belongs to), there
| are suprising and still growing number of other theoretical
| frameworks how to rank a set of documents, given a query
| ("Divergence from Randomness", "Statistical Language Modeling,
| "Learning to Rank", "Quantum Information Retrieval", "Neural
| Ranking" etc.). Conferences like ICTIR and SIGIR still publish
| occasionaly entirely new paradigms for search. Note that the
| "Statistical Language Modeling" paradigm is not about Large
| Language Models that are on vogue now (that's covered under the
| "Neural Retrieval" umbrella), and that "Quantum IR" is not going
| to get you to a tutorial about Quantum Information Retrieval but
| to methods of infrared spectroscopy or a company with the same
| name that produces cement; such are the intricacies of search
| technology, even in the 21st century.
|
| If you want to play with BM25 and compare it with some of the
| alternatives, I recommend the research platform Terrier, and
| open-source search engine developed at the University of Glasgow
| (today, perhaps the epicenter of search research).
|
| BM25 is over a quarter century old, but has proven to be a hard
| baseline to beat (it is still often used as a reference point for
| comparing new nethods against), and a more recent variant, BM24F,
| can deal with multiple fields and hypertext (e.g. title, body of
| documents, hyperlinks).
|
| The recommended paper to read is: Sparck Jones, K.; Walker, S.;
| Robertson, S. E. (2000). "A probabilistic model of information
| retrieval: Development and comparative experiments: Part 1".
| Information Processing & Management 36(6): 779-808, and its
| successor, Part 2. (Sadly they are not open access.)
| marcyb5st wrote:
| Thanks for sharing!
|
| Do you have more information about BM24F? Googling (also Google
| scholar) didn't yield anything related. Thanks in advance!
| bradleyjkemp wrote:
| A typo I think, should be BM25F. From Wikipedia:
|
| > BM25F (or the BM25 model with Extension to Multiple
| Weighted Fields) is a modification of BM25 in which the
| document is considered to be composed from several fields
| (such as headlines, main text, anchor text)
| https://en.wikipedia.org/wiki/Okapi_BM25
|
| Some papers are linked in the references
| marcyb5st wrote:
| Thanks, really appreciate it!
| CSSer wrote:
| Coincidentally, US NIST TREC is happening right now! It started
| on the 18th and will conclude on the 22nd.
|
| Link for more details: https://trec.nist.gov/
| sidcool wrote:
| Good article. I am genuinely interested to learn about how to
| think of problems in such a mathematical form. And how to test
| it. Any resources?
| RA_Fisher wrote:
| BM25 is an ancient algo developed in the 1970s. It's basically a
| crappy statistical model and statisticians can do far better
| today. Search is strictly dominated by learning (that yes, can
| use search as an input). Not many folks realize that yet, and /
| or are incentivized to keep the old tech going as long as
| possible, but market pressures will change that.
| netdur wrote:
| While BM25 did emerge from earlier work in the 1970s and 1980s
| (specifically building on the probabilistic ranking principle),
| I'm curious about your perspective on a few things:
|
| What specific modern statistical approaches are you seeing as
| superior replacements for BM25 in practical applications? I'm
| particularly interested in how they handle edge cases like rare
| terms and document length normalization that BM25 was
| explicitly designed to address.
|
| While I agree learning-based approaches have shown impressive
| results, could you elaborate on what you mean by search being
| "strictly dominated" by learning methods? Are you referring to
| specific benchmarks or real-world applications?
| RA_Fisher wrote:
| BM25 can be used as a starting point for a statistical
| learning model and more readily built on. A key advantage is
| that one gains a systematic way to reduce edge cases, instead
| of handling a couple, bc they're so large as to be
| noticeable.
| simplecto wrote:
| Those are some really spicy opinions. It would seem that many
| search experts might not agree.
|
| David Tippet (formerly opensearch and now at Github)
|
| A great podcast with David Tippet and Nicolay Gerold entitled:
|
| "BM25 is the workhorse of search; vectors are its visionary
| cousin"
|
| https://www.youtube.com/watch?v=ENFW1uHsrLM
| dumb1224 wrote:
| Agreed. In the 2000s it was all about BM25 in the NLP
| community. I hardly see any paper that did not mention it in
| my opinion.
| RA_Fisher wrote:
| For sure, it's very popular, just not the best anymore (and
| actually far from it).
| authorfly wrote:
| And dependency chaining. But yes, lots of BM25.
|
| The 2000s and even 2010s was a wonderful and fairly
| theoretical time for linguistics and NLP. A time when NLP
| seemed to harbor real anonymized general information to
| make the right decisions with, without impinging on
| privacy.
|
| Oh to go back.
| RA_Fisher wrote:
| I'm sure Search experts would disagree, because it'd be their
| technology they'd be admitting is inferior to another. BM25
| is the workhorse, no doubt-- but it's also not the best
| anymore. Vectors are a step toward learning models, but only
| a small mid-range step vs. an explicit model.
|
| Search is a useful approach for computing learning models,
| but there's a difference between the computational means and
| the model. For example, MIPS is a very useful search algo for
| computing learning models (but first the learning model has
| to be formulated).
| simplecto wrote:
| It seems that the current mode (eg fashion) is a hybrid
| approach, with vector results on one side, BM25 on the
| other, and then a re-reank algo to smooth things out.
|
| I'm out of my depth here but genuinely interested and
| curious to see over the horizon.
| authorfly wrote:
| Out of interest how come you use the word "mode" here?
| simplecto wrote:
| because the space moves fast, and from my learning this
| is the current thing. Like fashion -- it changes from
| season to season
| RA_Fisher wrote:
| Best is to use one statistical model and encode the
| underlying aspects of the context that relate to goal
| outcomes.
| softwaredoug wrote:
| I don't know a lot of search practitioners who don't want
| to use the "new sexy" thing. Most of us do a fair amount of
| "resume driven development" so can claim to be "AI
| Engineers" :)
| RA_Fisher wrote:
| I don't think it's realistic to think that software
| engineers can pick up advanced statistical modeling on
| the job, unless they're pairing with statisticians.
| There's just too much background involved.
| binarymax wrote:
| Your overall condescending attitude in this thread is
| really disgusting.
| RA_Fisher wrote:
| Statisticians are famously disliked, especially by
| engineers (there are open-minded folks, of course! maybe
| they'd taken some econometrics or statistics, are
| exceptionally humble, etc). There are some interesting
| motives and incentives around that. Sometimes I think in
| part it's because many people would prefer their existing
| beliefs be upheld as opposed to challenged, even if
| they're not well-supported (and likely to lead to bad
| decisions and outcomes). Sticking with outdated
| technology is one example.
| softwaredoug wrote:
| The "search practitioners" I'm referring to are pretty
| uniformly ML Engineers . They also work on feeds,
| recommendations, and adjacent Information Retrieval
| spaces. Both to generate L0 retrieval candidates and to
| do higher layers of reranking with learning to rank and
| other systems to whatever the system's goal is...
|
| You can decide if you agree that most people are
| sufficiently statistically literate in that group of
| people. But some humility around statistics is probably
| far up there in what I personally interview for.
| RA_Fisher wrote:
| For sure. There are ML folks with statistical learning
| backgrounds, but it tends to be relatively rare. Physics
| and CS are more common. They tend to view things like you
| mention, more procedural eg- learning to rank, minimizing
| distances, less statistical modeling. Humility around
| statistics is good, but statistical knowledge is still
| what's required to really level up these systems (I've
| built them as well).
| dtaivpp wrote:
| I have been summoned. Hey it's David from the podcast. As
| someone who builds search for users every day and shaped
| the user experience for vector search at OpenSearch I
| assure you no one is afraid of their technology becoming
| inferior.
|
| There are two components of search that are really
| important to understand why BM25 (will likely) not go away
| for a long time. The first is precision and the second is
| recall. Precision is the measure of how many relevant
| results were returned in light of all the results returned.
| A completely precise search would return only the relevant
| results and no irrelevant results.
|
| Recall on the other hand measures how many of all the
| relevant results were returned. For example, if our search
| only returns 5 results but we know that there were 10
| relevant search results that should have been returned we
| would say the recall is 50%.
|
| These two components are always at odds with each other.
| Vector search excels at increasing recall. It is able to
| find documents that are semantically similar. The problem
| with this is semantically similar documents might not
| actually be what the user is looking for. This is because
| vectors are only a representation of user intent.
|
| Heres an example: A user looks up "AWS Config". Vector
| search would read this and may rate it as similar to
| ["amazon web services configuration", "cloud
| configuration", "infrastructure as a service setup"]. In
| this case the user was looking for a file called,
| "AWS.config". Vector search is inherently imprecise. It is
| getting better but it's not replacing BM25 as a scoring
| mechanism any time soon.
|
| You don't have to believe me though. Weaviate, Vespa,
| Qdrant all support BM25 search for a reason. Here is an in
| depth blog that dives more into hybrid search:
| https://opensearch.org/blog/hybrid-search/
|
| As an aside, vector search is also much more expensive than
| BM25. It's very hard to scale and get precise results.
| RA_Fisher wrote:
| Hi David. Nice to meet you. Yes, precision and recall are
| always in tension. However, both can be made
| simultaneously better with a more informed model. Using
| your example, this would be a model that encodes the
| concept of files in the context of a user demand
| surrounding AWS.
| iosjunkie wrote:
| "more informed model"
|
| Can you be specific on what you recommend instead of
| BM25?
| mrbungie wrote:
| Are those the same market pressures that made Google discard or
| repurpose a lot of working old search tech for new shiny ML-
| based search tech? The same tech that makes you add "+reddit"
| in every search so you can evade the adversarial SEO war?
|
| PS: Ancient != bad. I don't know what weird technologist take
| worries about the age of an invention/discovery of a technique
| rather than its usefulness.
| RA_Fisher wrote:
| Google's come a long way since PageRank + terms. Ancient
| doesn't mean bad, but usually it means outdated and that's
| the case here. Search algos are subsumed by learning models,
| our species can do better now.
| mbreese wrote:
| So, I'm not entirely sure if I follow you here... How would
| one use a language model to find a document out of a corpus
| of existing documents? As opposed to finding an answer to a
| question, trained on documents, which I can see. I mean
| answering a query like "find the report containing X"?
|
| I see search as encompassing at least two separate, but
| related, domains: information gathering/seeking (answering
| a question) and information retrieval (find the best
| matching document). I'm curious how LLMs can help with the
| later.
| ordersofmag wrote:
| That's the 'vector search' people are talking about in
| this discussion. Use the LLM to generate an embedding
| vector that represents the 'meaning' of your query. Do
| the same for all the documents (or better with chunks of
| all the documents). Find the document vector that's
| closest to your query vector and you have a document that
| has a 'meaning' similar to your query. Obviously that's
| just a starting point. And lots of folks are doing hybrid
| where they combine bm25 search with some sort of vector
| search (e.g. run them in parallel and combine results, or
| do a bm25 and then use vector search to rerank the top
| results).
| softwaredoug wrote:
| I think there are also incentives to "sell new things". That's
| always been the case in search which has had a bazillion trends
| and "AI related things" as long as I've worked in it. We have
| massively VC funded vector search companies with armies of tech
| evangelists pushing a specific point of view right now.
|
| Meanwhile, the amount of manual curation, basic, boring hand-
| curated taxonomies that actually drive things like "semantic
| search" at places like Google are simply staggering. Just
| nobody talks about them much at conferences because they're not
| very sexy.
| hubraumhugo wrote:
| Given the recent advances in vector-based semantic search, what's
| the SOTA search stack that people are using for hybrid keyword +
| semantic search these days?
| emschwartz wrote:
| Most of the commercial and open source offerings for hybrid
| search seem to be using BM25 + vector similarity search based
| on embeddings. The results are combined using Reciprocal Rank
| Fusion (RRF).
|
| The RRF paper is impressive in how incredibly simple it is (the
| paper is only 2 pages):
| https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
| softwaredoug wrote:
| A warning that RRF is often not Enough, as it can just drag a
| good solution down towards the worse solution :)
|
| https://softwaredoug.com/blog/2024/11/03/rrf-is-not-enough
| emschwartz wrote:
| Ah, that's great! Thanks for sharing that.
|
| I had actually implemented full text search + vector search
| using RRF but I kept it disabled by default because it
| wasn't meaningfully improving my results. This seems like a
| good hypothesis as to why.
| d4rkp4ttern wrote:
| In the Langroid[1] LLM library we have a clean, extensible RAG
| implementation in the DocChatAgent[2] -- it uses several
| retrieval techniques, including lexical (bm25, fuzzy search)
| and semantic (embeddings), and re-ranking (using cross-encoder,
| reciprocal-rank-fusion) and also re-ranking for diversity and
| lost-in-the-middle mitigation:
|
| [1] Langroid - a multi-agent LLM framework from CMU/UW-Madison
| researchers https://github.com/langroid/langroid
|
| [2] DocChatAgent Implementation -
| https://github.com/langroid/langroid/blob/main/langroid/agen...
|
| Start with the answer_from_docs method and follow the trail.
|
| Incidentally I see you're the founder of Kadoa -- Kadoa-snack
| is one of favorite daily tools to find LLM-related HN
| discussions!
| noduerme wrote:
| A generic search strategy is so different from something you
| want to target. The task should probably determine the tool.
|
| So I don't know the answer, but I was recently handed about 3
| million surveys with 10 free-form writing fields each, and
| tasked with surfacing the ones that might require action on the
| part of the company. I chose to use a couple of different small
| classifier models, manually strip out some common words based
| on obvious noise in the first 10k results, and then weight the
| model responses. It turned out to be almost flawless. I would
| NOT call this sort of thing "programming", it's more just
| tweaking the black-box output of various different tools until
| you have a set of results that _looks good_ for your test
| cases. (And your client ;)
|
| All stitching together small Hugging Face models running on a
| tiny server in nodejs, btw.
| keeeba wrote:
| Nice, also find small classifiers work best for things like
| this. Out of interest, how many, if any, of the 3million were
| labelled?
|
| Did you end up labelling any/more, or distilling from a
| generative model?
| treprinum wrote:
| text-embedding-3-large + SPLADE + RRF
| khaki54 wrote:
| We're doing something like BM25 with a semantic ontology
| enhanced query (naive example: search for truck hits on Ford
| F-150, even if truck never appears in the doc) then vector
| based reranking. In testing, we always get the best result in
| the top 3.
| dmezzetti wrote:
| Excellent article on BM25!
|
| Author of txtai [1] here. txtai implements a performant BM25
| index in Python [2] via the arrays package and storing the term
| frequency vectors in SQLite.
|
| With txtai, the hybrid index approach [3] supports both convex
| combination when BM25 scores are normalized and reciprocal rank
| fusion (RRF) when they aren't [4].
|
| [1] https://github.com/neuml/txtai
|
| [2] https://neuml.hashnode.dev/building-an-efficient-sparse-
| keyw...
|
| [3] https://neuml.hashnode.dev/benefits-of-hybrid-search
|
| [4]
| https://github.com/neuml/txtai/blob/master/src/python/txtai/...
| softwaredoug wrote:
| My opinion is people need to not focus on one stack. But be
| prepared to use tools best for each job. Elasticsearch for BM25
| type things. Turbopuffer for simple and fast vector retrieval.
| Even redis to precompute results for certain queries. Or
| certain extremely dynamic attributes that change frequently
| like price. Combine all these in a scatter/gather approach.
|
| I say that because almost always you have a layer outside the
| search stack(s) that ideally can just be a straightforward
| inference service for reranking that looks most like other ML
| infra.
|
| You also almost always route queries to different backends
| based on an understanding of the users query. Routing "lookup
| by ID" to a different system than "fuzzy semantic search".
| These are very different data structures. And search almost
| always covers very broad/different use cases.
|
| I think it's an anti pattern to just push all work to one
| system. Each system is ideal for different workloads. And their
| inference capabilities won't ever keep pace with the general ML
| tooling that your ML engineers are used to. (I tried with
| Elasticsearch Learning to Rank and its a hopeless task.)
|
| (That said, Vespa is probably the best 'single stack' that
| tries to solve a broad range of use-cases.)
| DavidPP wrote:
| We use https://typesense.org/ for regular search, but it now has
| support for doing hybrid search, curious if anyone has tried it
| yet?
| kkielhofner wrote:
| I've used it for hybrid search and it works quite well.
|
| Overall I'm really happy to see Typesense mentioned here.
|
| A lot of the smaller scale RAG projects, etc you see around
| would be well served by Typesense but it seems to be relatively
| unknown for whatever reasons. It's probably one of the easiest
| solutions to deploy, has reasonable defaults, good docs, easy
| clustering, etc while still be very capable, performant, and
| powerful if you need to dig in further.
| tselvaraj wrote:
| Hybrid search solves the long-standing challenge of relevance
| with search results. We can use ranking fusion between keyword
| and vector to create a hybrid search that works in most
| scenarios.
| MPSimmons wrote:
| Does anyone know if the average document length mentioned in the
| document length normalization is median? It seems like it would
| need to be to properly deweight excessively long documents,
| otherwise the excessively long documents would unfairly weight
| the average, right?
| softwaredoug wrote:
| It's the mean. At least in Lucene. Using median would be an
| interesting experiment.
|
| Do you know of a search dataset with very large document length
| differences? MSMarco for example is pretty consistent in
| length.
___________________________________________________________________
(page generated 2024-11-20 23:01 UTC)