Sphinx Ranking: How to Raise Exact Matches to the Top of SERPs?

V

vadamlyuk2015-10-17 03:16:17

Sphinx

vadamlyuk, 2015-10-17 03:16:17

Good afternoon,
I’ll probably ask this question 100500 times, after spending tests and searching Google, I still haven’t found a clear answer for myself.
Problem statement:
There is a search word, for example "test", you need to find the following word forms:
- test
- test *
- * test *
In this case, you need to rank the results:
- first by word form (i.e. documents with clear matches test should be higher than documents with matches test*)
- if the word forms are the same, then the document with more word forms of this type goes higher (i.e. the "test test" document should be higher than the "just test" document)
The position of the word in the document, its frequency, etc. are not important, i.e. the words "this" and "counter-lift"
Test index:

index rttest2
{
  type	= rt
  path	= /var/sphinx/rttest2
  rt_attr_string = phrase1
  rt_field = phrase2
  min_infix_len = 2
  index_exact_words = 1
}

INSERT into rttest2(id,phrase1,phrase2) values(1,'очень простой тест','очень простой тест');
INSERT into rttest2(id,phrase1,phrase2) values(2,'хитрый тестик','хитрый тестик');
INSERT into rttest2(id,phrase1,phrase2) values(3,'супертестик','супертестик');
INSERT into rttest2(id,phrase1,phrase2) values(4,'тестировщик тестов','тестировщик тестов');

Wrong solution 1: Use OPTION ranker=sph04 (this is a typical answer when asking about exact matches)
... because this ranking takes into account first of all the position of the word in the document and the frequency:

select *,weight() from rttest2 where match('тест|тест*|*тест*') and id <= 4 option ranker=sph04;
+------+-------------------------------------+----------+
| id   | phrase1                             | weight() |
+------+-------------------------------------+----------+
|    3 | супертестик                         |     6430 |
|    4 | тестировщик тестов                  |     6310 |
|    1 | очень простой тест                  |     4430 |
|    2 | хитрый тестик                       |     4362 |
+------+-------------------------------------+----------+

Not so good solution 2: write our own ranking based on hit_count and exact_order
... because it works with one search word, but not with more

select *,weight() from rttest2 where match('тест|тест*|*тест*') option ranker=expr('sum(exact_order*10+hit_count)');

+------+-------------------------------------+----------+
| id   | phrase1                             | weight() |
+------+-------------------------------------+----------+
|    1 | очень простой тест                  |       13 |
|    4 | тестировщик тестов                  |        4 |
|    2 | хитрый тестик                       |        2 |
|    3 | супертестик                         |        1 |
+------+-------------------------------------+----------+

However, add another document and a second word to the search and things get bad (because exact_order stops working):

INSERT into rttest2(id,phrase1,phrase2) values(5,'тестик тестик простой','тестик тестик простой');
select *,weight() from rttest2 where match('(тест|тест*|*тест*)&(простой|простой*|*простой*)') option ranker=expr('sum(exact_order*10+hit_count)');

+------+------------------------------------------+----------+
| id   | phrase1                                  | weight() |
+------+------------------------------------------+----------+
|    5 | тестик тестик простой                    |        7 |
|    1 | очень простой тест                       |        6 |
+------+------------------------------------------+----------+

It is not clear why solution 3 does not work: we try to add weight to word forms by repetition (so as not to use exact_order), but the ranking stops working altogether, since hit_count always becomes equal to 1:

select *,weight() from rttest2 where match('тест|тест*|*тест*') option ranker=expr('sum(hit_count)');
+------+------------------------------------------+----------+
| id   | phrase1                                  | weight() |
+------+------------------------------------------+----------+
|    4 | тестировщик тестов                       |        4 |
|    5 | тестик тестик простой                    |        4 |
|    1 | очень простой тест                       |        3 |
|    2 | хитрый тестик                            |        2 |
|    3 | супертестик                              |        1 |
+------+------------------------------------------+----------+

select *,weight() from rttest2 where match('тест|тест|тест*|*тест*') option ranker=expr('sum(hit_count)');
+------+------------------------------------------+----------+
| id   | phrase1                                  | weight() |
+------+------------------------------------------+----------+
|    4 | тестировщик тестов                       |        2 |
|    5 | тестик тестик простой                    |        2 |
|    1 | очень простой тест                       |        1 |
|    2 | хитрый тестик                            |        1 |
|    3 | супертестик                              |        1 |
+------+------------------------------------------+----------+

I don't provide the output from packedfactors, since it makes the message unreadable, those who wish can easily include it in requests
. I suspect that there is some kind of error or sphinx "feature" that prevents this trick from being used.
Summing up : I did not find a normal solution, there is an acceptable solution for one search word, but there is no solution for several search forms. Maybe someone will come up with a better solution.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

vadamlyuk, 2015-10-17
@vadamlyuk

UPD. I myself answer my own question:
1. If you use the extended query language, then there is such a thing as "IDF booster", which actually is the weight of the search word (unfortunately, this net weight cannot be used in the ranker expression, you can only take it into in conjunction with IDF. for my case, this is not important, but below I will explain when this can be a problem) Weights
are set after each word through the ^ sign:
which is the product of tf- the number of times that the search word occurs in the query and idf (which in turn is the product of the weight of the word you specified in the query by a certain coefficient)). It is important to note that for test* and *test* for some reason it has a negative value (although the documentation says that idf without weight can take values from 0 to 1), so in our case we take its absolute value and the complete request will be look like this:

select *,weight() from rttest2 where match('(тест^100|тест*^5|*тест*)') option ranker=expr('sum(abs(tf_idf)*1000)');

+------+------------------------------------------+----------+
| id   | phrase1                                  | weight() |
+------+------------------------------------------+----------+
|    1 | очень простой тест                       |    14423 |
|    4 | тестировщик тестов                       |     1846 |
|    5 | тестик тестик простой                    |     1846 |
|    2 | хитрый тестик                            |      923 |
|    3 | супертестик                              |      155 |
+------+------------------------------------------+----------+

According to this author's explanation :
Those. for word forms of one word with asterisks and without IDF values will be equal,
i.e. for the words test, test*, and *test*, the IDF will be almost directly proportional to the weight that you set in the query
. But if you want to do the same trick with the words "this" and "counterbounce" in the query option "or or", then most likely you won't succeed, because the IDF will depend heavily on the set of documents that you have in the index. Therefore, of course, it would be cool if you could set the weight just for a specific word, without reference to idf

P

Puma Thailand, 2015-10-17
@opium

Search separately and then merge the IDs found with the ranking you need.