What algorithm to use to find duplicate words in a string?

Aleksandr Govorukhin2016-03-05 11:39:58

Algorithms

Aleksandr Govorukhin, 2016-03-05 11:39:58

Hello, friends!
Suggest an algorithm for finding repeated words in a string.
PS for a small number of words, you can write a double loop for comparison, but this approach is inefficient for a large number of words - it is not suitable for one million words, etc.

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Vitaly Pukhov, 2016-03-05
@SnapSh0t

The task is set in a strange way, firstly, there are no million words in any language, but on average from 3 to 10 thousand are used. That is, the task for a million is fundamentally wrong. In any case, at least 1 pass would be required anyway, I would do so.
1. Creating a dictionary of words and their number
2. Passing through all the words in turn
3. Searching for the current word in the list, if it is not there, add it to the list and set the number to 1, if there is +1 to the number
As a result, we get 1 pass through the entire text and indexing to a dictionary, which is relatively cheap, is much cheaper than iterating over the entire list.
A similar approach was used for text analysis, as a result, for 20 books ~ 500 pages each, it took about 3-5 seconds.
For C# it looks like this

mib, 2016-03-05
@mib

You can use the naive method: create a `table` in the database: `word`|`count_words` (primary key `word`). Then take all the words in order, and add to the table. If such a word already exists, increase its number of repetitions by 1, something like this:
INSERT INTO `table` (`word`) VALUES ('$new_word') ON DUPLICATE KEY UPDATE `count_words`=`count_words`+1;
There are not so many words in any language, well, for example, 50,000, and the text can be arbitrarily large.
And then make a selection sorted by the number of repetitions.
The same can be done without a database, using hashes: translate the word into transliteration, and increase the counter according to the hash.

Armenian Radio, 2016-03-05
@gbg

Clarifications are needed right away - does the string fit in RAM?
If it fits - in the first pass you can find the ends of all words (positions of all spaces in the line).
This will be a list of segments (beginnings and ends of words) . We
sort this list by the length of words.
We get a list in which words of equal length will go in a row.
Then we sort each cluster in this list in lexicographic order. And we count the sorted duplicates.
The task is well suited for parallel processing.

Dimonchik, 2016-03-05
@dimonchik2013

I'm sorry,
what's wrong?
words are out of order?
then string to array, sort and ..

uvelichitel, 2016-03-05
@uvelichitel

Bloom filter

SeptiM, 2016-03-05
@SeptiM

And the prefix tree will not work? I would first remove all punctuation marks, lower the registers, and then build it. It is possible to store the number of leaves in the subtree at each vertex and find in O(1/phi) all words that occur at least phi * n times, where n is the number of all words.