Answer the question
In order to leave comments, you need to log in
Sphinx Ranking: How to Raise Exact Matches to the Top of SERPs?
Good afternoon,
I’ll probably ask this question 100500 times, after spending tests and searching Google, I still haven’t found a clear answer for myself.
Problem statement:
There is a search word, for example "test", you need to find the following word forms:
- test
- test *
- * test *
In this case, you need to rank the results:
- first by word form (i.e. documents with clear matches test should be higher than documents with matches test*)
- if the word forms are the same, then the document with more word forms of this type goes higher (i.e. the "test test" document should be higher than the "just test" document)
The position of the word in the document, its frequency, etc. are not important, i.e. the words "this" and "counter-lift"
Test index:
index rttest2
{
type = rt
path = /var/sphinx/rttest2
rt_attr_string = phrase1
rt_field = phrase2
min_infix_len = 2
index_exact_words = 1
}
INSERT into rttest2(id,phrase1,phrase2) values(1,'очень простой тест','очень простой тест');
INSERT into rttest2(id,phrase1,phrase2) values(2,'хитрый тестик','хитрый тестик');
INSERT into rttest2(id,phrase1,phrase2) values(3,'супертестик','супертестик');
INSERT into rttest2(id,phrase1,phrase2) values(4,'тестировщик тестов','тестировщик тестов');
select *,weight() from rttest2 where match('тест|тест*|*тест*') and id <= 4 option ranker=sph04;
+------+-------------------------------------+----------+
| id | phrase1 | weight() |
+------+-------------------------------------+----------+
| 3 | супертестик | 6430 |
| 4 | тестировщик тестов | 6310 |
| 1 | очень простой тест | 4430 |
| 2 | хитрый тестик | 4362 |
+------+-------------------------------------+----------+
select *,weight() from rttest2 where match('тест|тест*|*тест*') option ranker=expr('sum(exact_order*10+hit_count)');
+------+-------------------------------------+----------+
| id | phrase1 | weight() |
+------+-------------------------------------+----------+
| 1 | очень простой тест | 13 |
| 4 | тестировщик тестов | 4 |
| 2 | хитрый тестик | 2 |
| 3 | супертестик | 1 |
+------+-------------------------------------+----------+
INSERT into rttest2(id,phrase1,phrase2) values(5,'тестик тестик простой','тестик тестик простой');
select *,weight() from rttest2 where match('(тест|тест*|*тест*)&(простой|простой*|*простой*)') option ranker=expr('sum(exact_order*10+hit_count)');
+------+------------------------------------------+----------+
| id | phrase1 | weight() |
+------+------------------------------------------+----------+
| 5 | тестик тестик простой | 7 |
| 1 | очень простой тест | 6 |
+------+------------------------------------------+----------+
select *,weight() from rttest2 where match('тест|тест*|*тест*') option ranker=expr('sum(hit_count)');
+------+------------------------------------------+----------+
| id | phrase1 | weight() |
+------+------------------------------------------+----------+
| 4 | тестировщик тестов | 4 |
| 5 | тестик тестик простой | 4 |
| 1 | очень простой тест | 3 |
| 2 | хитрый тестик | 2 |
| 3 | супертестик | 1 |
+------+------------------------------------------+----------+
select *,weight() from rttest2 where match('тест|тест|тест*|*тест*') option ranker=expr('sum(hit_count)');
+------+------------------------------------------+----------+
| id | phrase1 | weight() |
+------+------------------------------------------+----------+
| 4 | тестировщик тестов | 2 |
| 5 | тестик тестик простой | 2 |
| 1 | очень простой тест | 1 |
| 2 | хитрый тестик | 1 |
| 3 | супертестик | 1 |
+------+------------------------------------------+----------+
Answer the question
In order to leave comments, you need to log in
UPD. I myself answer my own question:
1. If you use the extended query language, then there is such a thing as "IDF booster", which actually is the weight of the search word (unfortunately, this net weight cannot be used in the ranker expression, you can only take it into in conjunction with IDF. for my case, this is not important, but below I will explain when this can be a problem)
Weights
are set after each word through the ^ sign:
which is the product of tf- the number of times that the search word occurs in the query and idf (which in turn is the product of the weight of the word you specified in the query by a certain coefficient)). It is important to note that for test* and *test* for some reason it has a negative value (although the documentation says that idf without weight can take values from 0 to 1), so in our case we take its absolute value and the complete request will be look like this:
select *,weight() from rttest2 where match('(тест^100|тест*^5|*тест*)') option ranker=expr('sum(abs(tf_idf)*1000)');
+------+------------------------------------------+----------+
| id | phrase1 | weight() |
+------+------------------------------------------+----------+
| 1 | очень простой тест | 14423 |
| 4 | тестировщик тестов | 1846 |
| 5 | тестик тестик простой | 1846 |
| 2 | хитрый тестик | 923 |
| 3 | супертестик | 155 |
+------+------------------------------------------+----------+
Search separately and then merge the IDs found with the ranking you need.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question