Answer the question
In order to leave comments, you need to log in
How to find frequently occurring text sequences?
There is a large text file. About 120 gigabytes of Russian-language text.
You need to find 30-40 most frequently occurring character sequences, longer than 4-5 characters.
What can be used to solve this problem?
If there are standard programs - excellent.
If there are sources for c\c++, rust, nim - good.
At worst, tell me the algorithm (I really don’t feel like writing, the employment is strong, but where to go in a pinch)
Thank you!
Answer the question
In order to leave comments, you need to log in
Please note that it std::string
uses SBO, that is, it does not allocate additional. heap memory for short strings. Also, standard maps in C ++ are extremely inefficient, include the library. The idea is this:
What can be used to solve this problem?
std::hashmap<std::string, size_t>
120 gigabytes is not yet a Big Data but is already close to going beyond the RAM. If the source material is divided into files (of a small size), then I would suggest solving this problem through map-reduce.
If we manage to do this, then the implementation written in Python can work many times faster due to parallelism. I'm not saying that you shouldn't do it in C++. I just emphasize that the task has the specifics of parallelization. Roughly speaking, the task gravitates towards big-data and parallel processing patterns for which the language is not particularly important, but this option is important.
By algorithm. Well I +1 to Anton.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question