Finding identical substrings in a string?

T

tac2013-01-14 12:59:54

Algorithms

tac, 2013-01-14 12:59:54

Maybe someone came across algorithms for finding identical substrings in a string. For example, there is a string "1234567845690450", we determine that we are interested in substrings of 3 or more characters, as a result we need to find the substring "456" because it occurs more than once in a string.
How to do it in the forehead slowly is clear - but you need to quickly.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

A

al13n321, 2013-01-15
@tac

There is a copy-paste detector in PMD: pmd.sourceforge.net/pmd-5.0.1/cpd.html (didn't find a normal description on their website right away; I once saw an article about how it works: there is a suffix array).
If you really need an algorithm, then in O(N log N) (or, if you try, O(N)) in the worst case, you can use a suffix array , a suffix tree , or a suffix automaton (careful, articles are focused on sports programming, code style can be unusual).
Perhaps the easiest to work with is a suffix array: it's just all the suffixes of a string, ordered in lexicographic order (of course, the suffixes themselves are not stored as strings, but as start indices). For all pairs of adjacent suffixes, the LCP (Largest Common Prefix) can be quickly found. Let the minimum length (let's call it L) of the required substrings be given. If there are K consecutive (in lexicographical order) suffixes in the suffix array such that the LCP of any two neighbors is at least L, then the LCP of all of them is a substring of the original string that appears in it at least K times. Using this idea, in O(N log N) you can, for example, find all substrings of length L that occur at least K times (although it is easier to do this with hashes, as suggested by mihaildemidoff). Probably, it is possible to similarly iterate over suitable strings in descending order of length or number of occurrences. But probably it is more convenient to do it with a suffix tree.
Probably, right when constructing a suffix tree, for each vertex, you can find the number of occurrences of the corresponding substring. Thus, we will find in general for each substring the number of its occurrences. But to get O(N) rather than O(N^2) data, the substrings will be split into groups (corresponding to tree vertices) with the same number of occurrences, and everything will happen in O(N). The same can be achieved (easier to implement and harder to understand) with a suffix automaton (in a sense, it is dual to a suffix tree). Such a construction is probably sufficient to solve the problem in any reasonable concrete formulation.

M

mihaildemidoff, 2013-01-14
@mihaildemidoff

Alternatively, you can calculate the hashes of substrings of the desired length, then sort them. In one more pass, we will find the maximum number of repetitions of one hash, in one more pass we will find the original substring in the string (although you can store it right away).

K

KEKSOV, 2013-01-14
@KEKSOV

One simple option is to build a frequency dictionary . A more complex variant is a binary ordered tree . Examples of application in the article about LZW or in the article about image compression

V

VBart, 2013-01-14
@VBart

See implementations of the LZ77 algorithm in any gzip.

E

elw00d, 2014-05-20
@elw00d

7-zip in LZMA uses the hash chain method. Apparently, this is the fastest method, because. In early versions of the LZMA SDK, there were 3 implementations of the search for matches: hash chains, binary search tree, patricia tree. And now there is only one left. Tree-based algorithms have been dropped. How each of these methods works can be found on Wikipedia .