Need to get the first N most occurring words in a text file?

I

Ivan Ershov2014-10-28 18:29:43

C++ / C#

Ivan Ershov, 2014-10-28 18:29:43

Comrades. The task is this. There is a text file. You need to get the first N most frequently repeated words (in descending order of frequency of occurrence). The comparison is case-insensitive. And you need to make a stop dictionary! Store the dictionary in a file.... Word separators are spaces, tabs, newlines, punctuation
You can use anything (STL)! Comrades tell me :D

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

B

brutal_lobster, 2014-10-28
@brutal_lobster

Check out the uniq code from coreutils ;)

E

Eddy_Em, 2014-10-28
@Eddy_Em

There is a stupid (on the forehead), but slow option - sorting and counting the number of repetitions with the compilation of a kind of pseudo-tree. There is a difficult option a little faster - trees. You can also consider a lot of faster and much more complex options.

K

Koss1024, 2014-10-29
@Koss1024

What to suggest? What's the question?
Or write the code for you?
The algorithm is simple:
We read the words from the file stream and collect them in a map like this
ifstream fs("filename.txt");
map freq; // file frequency
string word;
while(read_next_word(fs, word)) // read and skip spaces tabs etc... (the logic for skipping unnecessary characters here)
{
transform(word.begin(), word.end(), word.begin(), tolower) ; //lowercase
freq[word]++; // increment the counter for our word
}
now we have the frequencies of all the words in the map, copy it into a vector and sort by frequency
vector > vocabulary(freq.begin(), freq.end());
sort(vocabulary.begin(), vocabulary.end(), less_second); // using a lambda would be easier if possible c++11
The words in the vocabulary container are sorted by frequency and you can do anything with them
where
bool less_second(const pair& a, const pair& b)
{
return a.second < b. second;
}
This is really all the code. (except for the logic of skipping characters, but everything is simple in my opinion)