Can you suggest an algorithm for finding the most frequently occurring substrings in the text?

K

knott2012-06-18 04:37:28

Algorithms

knott, 2012-06-18 04:37:28

There is a text. It is necessary to find the most occurring substrings in it.
For example:
Karl stole corals from Clara, and Clara stole the clarinet from Karl.
Here, for example, it should output:

to (Karl, Clara, Stole, Coral, Clara, Karla, Stole, Clarinet)
a (karl, stole, corals, clara, karla, stole, clarinet),
...
klar (Clara, Clara, clarinet),
Karl (Karl, Karla),
stole (stole, stole)
stole (stole, stole)

With algorithms on you, but I assume that the O(n ² ) solution is undesirable.
I really don't want to invent my own bike.
I would be very grateful for your help!

Reply

Answer the question

In order to leave comments, you need to log in

6 answer(s)

M

MikhailEdoshin, 2012-06-18
@knott

Suffix tree ( also in English ). It is built in linear time, the search time is proportional to the length of the search string , but it takes a lot of memory. (It's funny that in the Discrete Analysis (2003) by I. V. Romanovsky in the chapter "Suffix tree" an example is given with just the same phrase about Karl and Clara.)

A

Alexey Firsov, 2012-06-18
@lesha_firs

It is also interesting if anyone knows it is desirable with the help of regular expressions.

A

agmt, 2012-06-18
@agmt

I'm not a guru, but it seems to me that O (n2) is quite worthy, at least in the formulation that I understood. After all, the result should be a sorted array of strings ranging in size from n (the original string "aa ... a") to (1 + n)*n/2 (the original string "abcd.."). With the introduction of a limit on the size of a substring, it may be possible to improve the efficiency of the algorithm.
In general, as Pike said:

Rule 3: Sophisticated algorithms are slow if n is small, and n is usually small. Sophisticated algorithms have large constants. Until you make sure that n gets big often, avoid being fancy. (Even if n gets big, use Rule 2 first.)

F

Fahrenheit, 2012-06-18
@Fahrenheit

“Usually” when setting such a problem, the question is raised about sufficiently large values of n, as well as about finding not all, but m most frequently occurring sequences (in the result example, there is no substring “Karl stole corals from Clara, and Clara stole the clarne from Karl” ( original minus one character) that occurs once).
Accordingly, the algorithm with n^2 will behave badly.

M

Mrrl, 2012-06-28
@Mrl

I solved a similar problem by replacing the most frequently occurring pair of symbols with a new symbol (if it occurred more than three times) - however, words played the role of “symbols” for me. The longest "frequently occurring" substring in "Dead Souls" was "the lady is pleasant in all respects."

M

motl, 2012-06-28
@motl

Try trie data structure
en.wikipedia.org/wiki/Trie