Answer the question
In order to leave comments, you need to log in
How to find the most common combinations of words in several texts?
There is a DB with set of the textual information. It is necessary to find the most common combinations of words in order to train some scripts in the future.
How would it be logically correct to organize the search so as not to miss anything?
Answer the question
In order to leave comments, you need to log in
I think to do it as follows:
We take the text, divide it into shingles with a length of 1-2-3-4-5 overlapping, remove duplicates.
Next, we check each shingle for presence in the database. If this shingle does not exist, then we add it to the database and increase the number of mentions by 1. If there is a shingle, then we simply increase the number. And so is every text. As a result, we get the desired set of shingles and sort them.
Here is information on recursive searching and on a search tree structure for storing text.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question