How to find a sample of the closest numbers from an array?

M

mahbe2015-02-20 19:49:53

Mathematics

mahbe, 2015-02-20 19:49:53

I set out to do a search on the site. In the results for each found material, a small clipping should be displayed, in which the mention of the keywords of the search query is maximum.
To do this, I break the search query into an array of words. I define their index (numbers of occurrences) in the text of the material. And... Then I don't know what to do with these numbers. Let's say where my keys begin in the full text of the material is clear, but how to correctly pull them out of there? How to find the segment where they are mentioned most often? That is, a selection of the numbers closest in value?
I thought to find a trite arithmetic mean and just look for the nearest key from it. But it's kind of weak.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Andrew, 2015-02-21
@mahbe

How about a head-on solution - set a sliding window equal to the number of words in the "sidebar" and go through the entire text, counting the total number of words that fell into the window at this position or the sum of their weights if the words have different priorities. The value of the sum of the weights that fell into the sliding window at the current position is calculated by changing the previous value of the window - one position of the text (at the "head" of the window) entered (increase the amount if the word was "search") and one position (at the "tail") left (reduce the amount if the word was "search").

D

Deerenaros, 2015-02-21
@Deerenaros

The Hamming distance can formalize the problem: it is required to find a bounded area, the distance of any two points in which will be determined by r. Points are represented by vectors, where one is the presence of a certain keyword, and zero is its absence.
Obviously, a full set of keywords will be required. In addition, the task is not trivial, as it may seem at first glance, but it is quite possible to count once when creating / changing the material, and add the newly created text to the found ones so as not to recalculate everything again.
Although in fact, it's easier to just SELECT the database and take the first few articles that have at least 1-2 common tags. Such a heuristic is more than sufficient, simply because the term "similar material" is itself highly subjective.