C
C
Chvalov2016-04-11 07:10:36
Java
Chvalov, 2016-04-11 07:10:36

How to find the largest sequence of characters that is repeated in the text more than once?

It is necessary to take the text with a change of type String and find the largest sequence of characters that is repeated in the text more than once.
Spaces should also be taken into account.
For example, there is text:

Toster site , blah blah blah Toster site

As output I should get:
Site toster

Another example here:
aboi booolsa b booolsa df

At the exit:
booolsa

Something I'm at a dead end, otherwise I would have started an array with words (the word separator is a space ""), I would have found words that are repeated more than once and would have chosen the word that is the longest.
But I don't know what to do here :(

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
Alexander Dorofeev, 2016-04-11
@Chvalov

Solution :
The LongestSubstringFinder class that performs this operation:

import java.util.Arrays;

public final class LongestSubstringFinder {

    private LongestSubstringFinder() {}

    public static String findLongestSubstringAtString(String string) {

        string = string.replaceAll("\\s+", " ");

        int quantityOfSuffixes = string.length();
        String[] suffixes = new String[quantityOfSuffixes];

        for (int i = 0; i < quantityOfSuffixes; i++) {
            suffixes[i] = string.substring(i, quantityOfSuffixes);
        }

        Arrays.sort(suffixes);

        String longestRepeatedSubstring = "";
        for (int i = 0; i < quantityOfSuffixes - 1; i++) {
            String x = findLongestCommonPrefixOfTwoStrings(suffixes[i], suffixes[i + 1]);

            if (x.length() > longestRepeatedSubstring.length()) {
                longestRepeatedSubstring = x;
            }
        }
        return longestRepeatedSubstring;
    }

    private static String findLongestCommonPrefixOfTwoStrings(String first, String second) {
        int quantity = Math.min(first.length(), second.length());
        for (int i = 0; i < quantity; i++) {
            if (first.charAt(i) != second.charAt(i))
                return first.substring(0, i);
        }
        return first.substring(0, quantity);
    }
}

Tests ( passed ):
@Test
    public void testFindLongestSubstringAtStringFirst() throws Exception {
        String testString = "Сайт toster, бла бла бла Сайт toster";
        String neededResult = "Сайт toster";
        String longestSubstring = LongestSubstringFinder
                .findLongestSubstringAtString(testString);
        assertEquals(longestSubstring, neededResult);
    }

    @Test
    public void testFindLongestSubstringAtStringSecond() throws Exception {
        String testString = "aboibooolsabbooolsadf";
        String neededResult = "booolsa";
        String longestSubstring = LongestSubstringFinder
                .findLongestSubstringAtString(testString);
        assertEquals(longestSubstring, neededResult);
    }

About the algorithm: here and here .

A
aslan7470, 2016-04-17
@aslan7470

Idea:
We take N sequences of length 1, starting at elements 0..N-1, write out the position of their 1st character (0..N-1) to the array, sort by the character in this position. For groups with the same character, we call recursively, considering the 2nd character from the position recorded in the array, etc.
Pseudocode:

find(N,A[])
  p[N]={0..N-1} // позиции начала подстрок
  l=0,ml=0,s // длина, макс длина подстроки, позиция подстроки макс длины
  finda(0,N)

// искать среди позиций p[i..j-1]  
def finda(i,j)
  if l>ml && j-i>1
    ml=l
    s=p[i]
  ++l
  // оставим в p[i..j-1] позиции <= N-l
  m=j,k=j=i
  while k<m
    if p[k]<=N-l
      p[j++]=p[k]
    ++k
  sort p[i..j-1]):A[p[i]+l]
  k=m=i
  while ++k<j
    if(A[p[k]+l]!=A[p[k-1]+l]) // конец группы подстрок с равным символом на позиции l от начала
      finda(m,k)
      m=k
  finda(m,j) // последняя группа подстрок с равным символом на позиции l от начала
  --l

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question