List ranking "quantity-significance" and the task of voting?

A

AndreyIvanoff2011-09-07 12:35:39

Sphinx

AndreyIvanoff, 2011-09-07 12:35:39

Hello Khabrovites.
There was an interesting problem of ranking search results.
Let's say there is an abstract search engine for some objects that returns the result in the form of the following list of matches "Number of matches - Average error of matches" .
In practice, the following situations arise:
Situation 1:
Document 1: 19-0.32
Document 2: 1-0.59 Document
3:
2-0.69
following situation:
Situation 2:
Document 1: 19-0.32
Document 2: 18-0.30
Document 3: 2-0.69
There are doubts between documents with numbers 1 and 2 - and two documents can be issued with confidence in the search result.
or like this:
Situation 3:
Document 1: 2-0.1
Document 2: 18-0.30
Document 3: 2-0.69
There are also doubts between documents with numbers 1 and 2, but it’s more logical to issue a document in the search 2 - because there are more matches. But at the same time, it is impossible to rank simply by the number of matches - since there can be situation 4:
Situation 4:
Document 1: 2-0.1
Document 2: 18-0.30
Document 3: 100-0.99
And the average error is 0, 99 with 100 matches is practically the absence of these same matches.
Question:Since the output of the search algorithm can consist of a huge list in size - how to rank it and give it to the user? Probably, you should combine the parameters: the number of matches and the accuracy in one parameter. How to do this - are there "best practices" in this regard?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vlad911, 2011-09-07
@Vlad911

The number of false matches in a document is equal to the product of the number of matches (N) and the coefficient of error (e): Ne = N*e
Then the probability of a wrong choice of the document seems to be: (N-Ne)/N = Pe
Choose max Pe.
In any case, you need to choose a function that will minimize the likelihood of an incorrect choice of documents. This problem is solved in various classifiers.
The same task can be considered as the task of constructing a classifier that determines the relationship of documents to one of two classes - erroneous and relevant. However, there is little input.
It is possible that the ratio of the number of matches to the length of the document also matters. (3 misspellings in the word GUI is not the same as 3 misspellings in the word representative). :)