I
I
Ivan Ershov2014-10-28 18:29:43
C++ / C#
Ivan Ershov, 2014-10-28 18:29:43

Need to get the first N most occurring words in a text file?

Comrades. The task is this. There is a text file. You need to get the first N most frequently repeated words (in descending order of frequency of occurrence). The comparison is case-insensitive. And you need to make a stop dictionary! Store the dictionary in a file.... Word separators are spaces, tabs, newlines, punctuation
You can use anything (STL)! Comrades tell me :D

Answer the question

In order to leave comments, you need to log in

3 answer(s)
B
brutal_lobster, 2014-10-28
@brutal_lobster

Check out the uniq code from coreutils ;)

E
Eddy_Em, 2014-10-28
@Eddy_Em

There is a stupid (on the forehead), but slow option - sorting and counting the number of repetitions with the compilation of a kind of pseudo-tree. There is a difficult option a little faster - trees. You can also consider a lot of faster and much more complex options.

K
Koss1024, 2014-10-29
@Koss1024

What to suggest? What's the question?
Or write the code for you?
The algorithm is simple:
We read the words from the file stream and collect them in a map like this
ifstream fs("filename.txt");
map freq; // file frequency
string word;
while(read_next_word(fs, word)) // read and skip spaces tabs etc... (the logic for skipping unnecessary characters here)
{
transform(word.begin(), word.end(), word.begin(), tolower) ; //lowercase
freq[word]++; // increment the counter for our word
}
now we have the frequencies of all the words in the map, copy it into a vector and sort by frequency
vector > vocabulary(freq.begin(), freq.end());
sort(vocabulary.begin(), vocabulary.end(), less_second); // using a lambda would be easier if possible c++11
The words in the vocabulary container are sorted by frequency and you can do anything with them
where
bool less_second(const pair& a, const pair& b)
{
return a.second < b. second;
}
This is really all the code. (except for the logic of skipping characters, but everything is simple in my opinion)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question