Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

What is the formula lucene uses for bm25?


Asked by Hadleigh Love on Dec 07, 2021 FAQ



The actual formula Lucene/BM25 uses for this part is: Where docCount is the total number of documents that have a value for the field in the shard (across shards, if you’re using search_type=dfs_query_then_fetch) and f (qi) is the number of documents which contain the i th query term.
One may also ask,
BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself.
Keeping this in consideration, BM25 (Best Match 25) function scores each document in a corpus according to the document's relevance to a particular text query. For a query Q, with terms q 1, …, q n, the BM25 score for document D is:
Next,
BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art TF-IDF -like retrieval functions used in document retrieval.
Additionally,
BM25 and TF*IDF sit at the core of the ranking function. They comprise what Lucene calls the “field weight”. Field weight measures how much matched text is about a search term. Classic Lucene Similarity: What is TF*IDF?