How to implement ranking of full-text search results by text and tags in Django + Sphinx (sphinxit library)?

S

Stanislav Gordienko2015-07-07 04:56:15

Django

Stanislav Gordienko, 2015-07-07 04:56:15

Hello everyone,
I was given the task to implement a full-text search for records with tags. I implemented the search using Sphinx and the sphinxit library. But now I'm confused, how to properly organize the ranking in descending order, when the user specified the text for search and selected several tags (for example: seo, django, google)? After all, it may be that the text is found, but not all tags are present in the record, how to rank in this case? First display all posts where all tags are present, then the posts where there are 90% of tags, and so on?
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dmitry Filimonov, 2015-07-11
@stagor

Your question is not quite complete. Is it about ranking or about how to make tags friends with the sphinx? How are tags implemented? Therefore, I will write an overview answer that will show where to dig, what to do. I myself once used django-sphinx (now abandoned) for similar tasks (only I didn’t sort by tags, because my task was to clarify them). But I did not use sphinxit, so we will first consider the "naked" sphinx, and then sphinxit as an add-on. So I could be wrong somewhere, lie.
Let's assume that you have a classic implementation of tags (tags? hy) in the form of a label with tags + many-to-many into a label with elements (for example, django-taggit does this ).
The sphinx has full-text fields (on which it builds a special structure (index) to quickly search for matches by keywords). They are not stored in their original form, only as an index. There are also attributes, they are attached to the index, you can then filter by them, they are stored in the index. These are different numeric values (such as price or customer_id ) or string or even JSON , it is usually convenient to filter and refine the request by them. There is also MVA - these are also attributes, but they are sets of numerical values. And how to set them in the sphinx - an example . In general, such an opportunity (attributes and MVA in particular) is made to unload the database and load the sphinx. :)
MVA attributes are suitable for storing tags and the like (many-to-many), even in the docs it says:
Ranking is a very complex algorithm. There are special rankers for full-text fields in the sphinx , among which there is the BM25 algorithm and naive rankers, for example SPH_RANK_WORDCOUNT , which just performs a simple count of keyword occurrences and takes into account the weight of the field. You can also filter and sort by attributes . By MVA , you can also filter and sort.
Further I will speak within the framework of the SphinxQL query language. You can still access its API, but you still use a wrapper, and the query language is somehow simpler for explanation. In the docks, there are sometimes examples from the implementation of the API in PHP, and there are no explanations about SphinxQL, but this does not mean that it cannot be done through it (or, therefore, you need to look).
For your task, you need to do a SELECT with the LENGTH () function by the MVA attribute with tags, then sort by it using ORDER_BY in the desired order, and also filter using WHERE, where to specify the occurrence of tags in the attribute using the IN operator. The WHERE clause is a combination of attribute filters and full text searches. This way you can sort the results by the length of the MVA attribute (number of tags).
You can also add full-text ranking here (enabled usingSetRankingMode or using OPTION in SphinxQL) on full-text fields (by BM25 algorithm, for example, which is the default), and sort by ORDER_BY and by the length of the tagged MVA attribute and by WEIGHT() from the ranker (see the docs for examples) ; perhaps or even most likely, here you will also have to select WEIGHT () in SELECT (it used to implicitly return to ORDER_BY, I did not test this either). In this case, it will be cool: for example, if there are 50 entries in a row with the same number of tags, they will be sorted by the weight of the ranking algorithm.
In general, your task is beautifully solved. Considering the example above, using only the sphinx, you can also implement a complete tag match (compare LENGTH () with the number of tags in WHERE). And the like.
If the tags are stored somehow non-classically (highload, etc.), then in any case they can be obtained either in MVA or made into a full-text field (in this case, the SPH_RANK_WORDCOUNT ranker is just applicable). Therefore, the essence is the same.
I think I answered all the "theoretical" questions about ranking and the sphinx. There is a field for good ranking and for query optimization, you need to experiment.
Now in the framework of sphinxit. According to his docs , he can turn on rankers and speak SphinxQL. You just need to get it all there. Pitfalls are not ruled out, perhaps he will not be able to do something, perhaps somewhere I lied. In theory, you need to be able to do filtering ( tyk ), sorting ( tyk ), andthere is even an example of how to include a ranker + a description of possible rankers. It seems that if anything, then you can add a condition to the select . In general, everything looks friendly, good luck!
By the way, in the sphinxit documentation in the options examples (link above) there is an example where the ranker is enabled, but in the ORDER BY condition only sort by the name attribute . In theory, this should kill the whole point of the ranker for sorting, because it must be explicitly specified in ORDER BY. Apparently, there is just an example of how it is converted to SphinxQL.
ps It's strange that already 4+ days have passed, and no one has written an answer to this question. Sphinx is not popular? :(