Algorithmic question from a future C# .NET junior. Where to start research?

D

deleted-zenshot2014-12-10 13:22:31

Programming

deleted-zenshot, 2014-12-10 13:22:31

Greetings friends.
I'm learning C# from Andrew Troelsen's book. At the moment, I have reached the level of a complete understanding of what the author writes about. So I feel strong enough to start writing my first C# program.
I came up with the following task-research:
Explore 100 English books grouped into 10 different topics.
Task
Determine the most frequently used words:

In all 100 books.
In each of the 10 topics.

I plan to write a program that is universal, convenient, and so on.
I have no special questions about the implementation of common points (design, data display, data storage, etc.). Intuitively, I understand how to act. Troelsen handled such moments well.
But in terms of the implementation of the algorithm itself, I have a huge stupor. To understand where to start, how to act and in what direction to move, I decided to do the following:

Simplify the task.
Ask experts questions to get at least an approximate plan of action.

A simplified task
Count the frequency of occurrence of each word (and its variations) in a large text document.
Initial data:
Text document (book). The number of words in a document can be up to several hundred thousand. For example, the book "Pro C# 5.0 and the .NET 4.5 Framework (Andrew Troelsen)" has approximately 433,000 words.
What you need to do:
Count the number of occurrences in the document:

Every word.
Each group of single-root words.

An example of a group of words with the same root:
follow
followed
follower
followers
followership Questions
for
experts:
Where to start? How to approach the solution of this problem? In what directions to dig?
Can someone describe at least an approximate algorithm?

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

L

lam0x86, 2014-12-12
@deleted-zenshot

The sequence of actions is as follows:
1) splitting the text into lexical units (in your case, meaningful units are words). It is convenient to get an IEnumerable as an output, representing a lazy iterator over the words in the text.
2) bringing the word to normal form, i.e. to lowercase and, optionally, to the general word form (for example, for nouns - the first case, singular, etc.)
3) adding the word to the Dictionary, where the key is the word itself, and the value is the counter:

int count;
dictionary.TryGetValue(word, out count);
dictionary[word] = count + 1;

V

Viktor Vsk, 2014-12-10
@viktorvsk

Why single words? Do you think follow and followership are the same words? Then you can take something ready from fuzzy matching, for example. Or, if you want to use algorithms, you can implement yourself finding the Levenshtein distance, or something similar simple.
But, as for me, it would be more logical to first build on the immediate words (in book X, the word follow is used 123,234 times)
And so on for each word. And only then, based on these data, come up with a new problem. For example, find the most commonly used root.

P

pro100saniok, 2014-12-10
@pro100saniok

Respect to the author. I myself thought to implement something similar, I think such a program will greatly help those people who do not have very good English. For example, you want to read some book, but the vocabulary is still small, you learn the most unfamiliar most used words (for ease of learning, for example, you can add them to the lingualeo.com dictionary), and read ahead =) Another question for everyone, is there a ready-made solution to this problem ? Thanks in advance!

A

Arthur Gurinovich, 2014-12-10
@ArthurGurinovich

cppstudio.com/post/1318

E

Espleth, 2014-12-10
@Espleth

Since you're looking specifically for words, it's not exactly a substring search within a string. You won't need the rest of the word if it doesn't already match. And some characters, such as spaces and punctuation marks, do not participate in your comparison.
But in general, you can google search for a substring in a string, there are many algorithms. For example, the Knuth-Morris-Pratt algorithm, or the Boyer-Moore algorithm.