How to find frequently occurring text sequences?

A

anikavoi2020-04-26 01:27:39

C++ / C#

anikavoi, 2020-04-26 01:27:39

There is a large text file. About 120 gigabytes of Russian-language text.
You need to find 30-40 most frequently occurring character sequences, longer than 4-5 characters.
What can be used to solve this problem?
If there are standard programs - excellent.
If there are sources for c\c++, rust, nim - good.
At worst, tell me the algorithm (I really don’t feel like writing, the employment is strong, but where to go in a pinch)

Thank you!

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Anton Zhilin, 2020-04-26
@Anton3

Please note that it std::stringuses SBO, that is, it does not allocate additional. heap memory for short strings. Also, standard maps in C ++ are extremely inefficient, include the library. The idea is this:

Hashmap "strings -> counters" for strings of length 3
Hashmap "strings -> counters" for strings of length 4, but add there only strings whose beginning of length 3 is included in the map from (1) at least 2 times
Hashmap "strings -> counters" for strings of length 5, but add there only strings whose beginning of length 4 is included in the map from (2) at least 2 times

J

jcmvbkbc, 2020-04-26
@jcmvbkbc

What can be used to solve this problem?

With an array , one for each length sequence? std::hashmap<std::string, size_t>

M

mayton2019, 2020-04-26
@mayton2019

120 gigabytes is not yet a Big Data but is already close to going beyond the RAM. If the source material is divided into files (of a small size), then I would suggest solving this problem through map-reduce.
If we manage to do this, then the implementation written in Python can work many times faster due to parallelism. I'm not saying that you shouldn't do it in C++. I just emphasize that the task has the specifics of parallelization. Roughly speaking, the task gravitates towards big-data and parallel processing patterns for which the language is not particularly important, but this option is important.
By algorithm. Well I +1 to Anton.