A
A
Anton Danilkin2017-02-03 23:56:15
PHP
Anton Danilkin, 2017-02-03 23:56:15

How to find the most common combinations of words in several texts?

There is a DB with set of the textual information. It is necessary to find the most common combinations of words in order to train some scripts in the future.
How would it be logically correct to organize the search so as not to miss anything?

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
Anton Danilkin, 2017-02-04
@danilkin

I think to do it as follows:
We take the text, divide it into shingles with a length of 1-2-3-4-5 overlapping, remove duplicates.
Next, we check each shingle for presence in the database. If this shingle does not exist, then we add it to the database and increase the number of mentions by 1. If there is a shingle, then we simply increase the number. And so is every text. As a result, we get the desired set of shingles and sort them.

X
xmoonlight, 2017-02-04
@xmoonlight

Here is information on recursive searching and on a search tree structure for storing text.

V
Vlad_Fedorenko, 2017-02-04
@Vlad_Fedorenko

www.nltk.org/howto/collocations.html

K
kotofey, 2017-02-05
@kotofey

Everything has already been thought of before us. It also works with different word forms for the Russian language: carrot2

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question