How to configure search options in Apache Solr?

A

Alexey2019-11-07 16:35:29

Text Processing Automation

Alexey, 2019-11-07 16:35:29

Good afternoon! I started using the text search engine Solr version 8.3.0 and I'm exploring its capabilities. Out of the box, everything works quite tolerably, but it is clear that you can still improve the results of the issuance, but poking around in the documentation, I did not find what I needed. The following questions are of interest:
1) take into account only one occurrence of the search word in the search text. Now it turns out that the more often the search word is found in the text, the higher the relevance of the text. This should be turned off and preference should be given to the text where the search word occurs only once.
2) shorter texts would have priority
3) how to sharpen the search specifically for the Russian language? I know that there is a text_ru type, but I don’t really understand how to apply it
4) in fuzzy search, give preference to the first part of the word. Roughly speaking, pay less attention to the ending and more to the root.
If there are Solr experts here, then tell me how the above can be tuned or poke your nose into the appropriate docks.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Alexey, 2019-11-15
@Espritto

I understand that the question is very specific and no one even responded to the stackoverflow, I already wanted to pull the Solr developers directly, but still I figured it out myself, I answer myself ...
1) It is necessary to modify the scorer - the algorithm for evaluating and ranking the found matches. In this algorithm, we are interested in the Term Frequency parameter (abbreviated as TF, it is usually used together with its brother IDF - inverse document frequency, but in this case it is not necessary to touch it). TF counts the number of words in a document, and the higher it is, the higher the score. We need to ignore this parameter, so we need to find a place in the engine code and write it so that it always returns 1. Solr uses the Similarity class to evaluate the results, or rather, many extensions of this class that implement different algorithms. In the configs of the base core, I prescribed that the ClassicSimilarityFactory class be used, and in the implementation of the ClassicSimilarity class, I hardcoded it so that the tf() function always returns 1.0f. Since Solr is an open source project written in Java, changing the sources is not difficult. Further, according to the instructions from the README, we build the project, launches - everything works! You can make sure that the weight calculations are correct through the debug mode in the request (then debug info will be returned along with the result)
2) in fact, it already works out of the box, nothing needs to be done here
3) as it was said, there is a ready-made text_ru type in which a Russian stemmer is already configured. In order for the text to be processed according to the "rule of the Russian language", you must either name the text field *_text_ru or explicitly create a new field in the admin panel in the schema section and specify the text_ru type for it
4) this will happen if you use the data type with the Russian stemmer, that is, the words will be search by root and discard endings