M
M
madcat19912013-06-13 18:49:18
Sphinx
madcat1991, 2013-06-13 18:49:18

Ranking problem and bm25

There was a following problem. The project uses Sphinx. The search is performed with a postfix asterisk. To increase the rank of documents with an exact match to a query, a pattern like this is used:

@title дом* | @title дом


sph04 is used for ranking. It was noticed that on the word "home", the weight of the phrase "Home for the rich" was higher than the weight of the word "Home". The ranking result is spoiled by the bm25 metric, which is part of sph04. Based on How Sphinx relevance ranking works , the following assumption holds for bm25:

"...for performance reasons we account for all the keywords occurrences in the document, and not just the matched ones. For instance, (@ title “hello world ”) query that only matches a single instance of “hello world” phrase in the title will result in the same BM25 with a (hello world) query that matches all the instances of both keywords everywhere in the document.".

Below are two queries with bm25 metric. Both queries are looking for the word "home", but one is looking in title_star and the other in description_star. Regardless, the results are the same:


mysql> SELECT id, weight() FROM catalogue
    -> WHERE MATCH('@(title_star) дом') AND subsite_ids IN (110) AND paid_type_index IN (0) AND id IN (859490, 842300)
    -> LIMIT 0, 20
    -> OPTION index_weights=(catalogue=1), max_matches=10000, ranker=expr('bm25');
+--------+----------+
| id     | weight() |
+--------+----------+
| 842300 |      700 |
| 859490 |      669 |
+--------+----------+
2 rows in set (0.00 sec)


mysql> SELECT id, weight() FROM catalogue
    -> WHERE MATCH('@(description_star) дом') AND subsite_ids IN (110) AND paid_type_index IN (0) AND id IN (859490, 842300)
    -> LIMIT 0, 20
    -> OPTION index_weights=(catalogue=1), max_matches=10000, ranker=expr('bm25');
+--------+----------+
| id     | weight() |
+--------+----------+
| 842300 |      700 |
| 859490 |      669 |
+--------+----------+
2 rows in set (0.01 sec)


We count the number of words "home" in the content:
  • 84230 (House for the rich) == 8
  • 85949 (House) == 3


Those. the search occurs on all fields for which full-text search is available within the same index. Does this mean that the words that are present in the description also contribute to the weight values ​​for the title search results?

I noticed another thing that despite the same values, the number of results differs. Does this mean that in the case when we search by title, all results should have "home" in the title?

PS The question was asked on the sphinxsearch forum, but there is silence

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Alexander N++, 2014-07-22
@sanchezzzhak

Hello, did you find out why?

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question