Monday, June 27, 2011

Longest Repeated SubString e.g. Maximum Length SubString

For instance, the longest repeated string in ``Ask not what your
country can do for you, but what you can do for your country'' is `` can do for you'', with `` your country'' a close
second place.

Suppose, though, that you had a chance to preprocess the body of text before performing searches. You could make a hash table (or search tree) to index every distinct word of the document, and store a list of every occurrence of
each word. Such an ``inverted index'' allows a program to look up a given word quickly. One can look up phrases by intersecting the lists of the words they contain, but this is subtle to implement and potentially slow. (Some web
search engines do, however, take exactly this approach.) We'll turn now to a powerful data structure and apply it to a small problem: given an input file of text, find the longest duplicated substring of characters in it. For instance, the longest repeated string in ``Ask not what your country can do for you, but what you can do for your country'' is `` can do for you'', with `` your country'' a close
second place. How would you write a program to solve this problem?

Data Structure :Suffix Array

Algorithm

This problem is reminiscent of the anagram problem that we saw in Section 2.4. If the input string is stored in c[0..n-1], then we could start by comparing every pair of substrings using pseudocode like this

maxlen = -1
for i = [0, n)
for j = (i, n)
if (thislen = comlen(&c[i], &c[j])) > maxlen
maxlen = thislen
maxi = i
maxj = j

The comlen function returns the length that its two parameter strings have in common, starting with their first
characters:

int comlen(char *p, char *q)
i = 0
while *p && (*p++ == *q++)
i++
return i

Because this algorithm looks at all pairs of substrings, it takes time proportional to n2, at least. We might be able to
speed it up by using a hash table to search for words in the phrases, but we'll instead take an entirely new approach.

Our program will process at most MAXN characters, which it stores in the array c:

#define MAXN 5000000
char c[MAXN], *a[MAXN];

We'll use a simple data structure known as a ``suffix array''; the structure has been used at least since the 1970's,
though the term was introduced in the 1990's. The structure is an array a of pointers to characters. As we read the
input, we initialize a so that each element points to the corresponding character in the input string:

while (ch = getchar()) != EOF
a[n] = &c[n]
c[n++] = ch
c[n] = 0

The final element of c contains a null character, which terminates all strings.
The element a[0] points to the entire string; the next element points to the suffix of the array beginning with the
second character, and so on. On the input string ``banana'', the array will represent these suffixes:

a[0]: banana
a[1]: anana
a[2]: nana
a[3]: ana
a[4]: na
a[5]: a

The pointers in the array a together point to every suffix in the string, hence the name ``suffix array''.If a long string occurs twice in the array c, it appears in two different suffixes. We will therefore sort the array to bring together equal suffixes (just as sorting brought together anagrams in Section 2.4). The ``banana'' array sorts to

a[0]: a
a[1]: ana
a[2]: anana
a[3]: banana
a[4]: na
a[5]: nana

We can then scan through this array comparing adjacent elements to find the longest repeated string, which in this case is ``ana''. We'll sort the suffix array with the qsort function: qsort(a, n, sizeof(char *), pstrcmp) The pstrcmp comparison function adds one level of indirection to the library strcmp function. This scan through the array uses the comlen function to count the number of letters that two adjacent words have in common:

for i = [0, n)
if comlen(a[i], a[i+1]) > maxlen
maxlen = comlen(a[i], a[i+1])
maxi = i
printf("%.*s\n", maxlen, a[maxi])
The printf statement uses the ``*'' precision to print maxlen characters of the string.

I ran the resulting program to find the longest repeated string in the 807,503 characters in Samuel Butler's
translation of Homer's Iliad. The program took 4.8 seconds to locate this string:
whose sake so many of the Achaeans have died at Troy, far from their homes? Go about at once among the
host, and speak fairly to them, man by man, that they draw not their ships into the sea.
The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks (Achaeans) from
departing from Troy; it occurs shortly thereafter when Minerva repeats the argument verbatim to

Working Code:

#include
#include
#include
int pstrcmp(char **p, char **q)
{ return strcmp(*p, *q); }
int comlen(char *p, char *q)
{ int i = 0;
while (*p && (*p++ == *q++))
i++;
return i;
}
#define M 1
#define MAXN 5000000
char c[MAXN]="banana", *a[MAXN];
int main()
{ int i, ch, n = 0, maxi, maxlen = -1;
while ((ch = getchar()) != EOF) {
a[n] = &c[n];
c[n++] = ch;
}
c[n] = 0;
qsort(a, n, sizeof(char *), pstrcmp);
for (i = 0; i < n-M; i++) if (comlen(a[i], a[i+M]) > maxlen) {
maxlen = comlen(a[i], a[i+M]);
maxi = i;
}
printf("%.*s\n", maxlen, a[maxi]);
return 0;
}

Time Complexity O(NLogN)
Space Comple4xity O(1)
Auxillary Space O(N)
Run Here https://ideone.com/IQK8g

Source Jhon Bentely Programming Pearls
http://www.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/Assignments/repeatedstrings/index.html


Follow Up:
How would you modify the program for finding duplicated strings to find the longest string that occurs more than M times?

#include
#include
#include
int pstrcmp(char **p, char **q)
{ return strcmp(*p, *q); }
int comlen(char *p, char *q)
{ int i = 0;
while (*p && (*p++ == *q++))
i++;
return i;
}
#define M 1
#define MAXN 5000000
char c[MAXN], *a[MAXN];
int main()
{ int i, ch, n = 0, maxi, maxlen = -1;
while ((ch = getchar()) != EOF) {
a[n] = &c[n];
c[n++] = ch;
}
c[n] = 0;
qsort(a, n, sizeof(char *), pstrcmp);
for (i = 0; i < n-M; i++) if (comlen(a[i], a[i+M]) > maxlen) {
maxlen = comlen(a[i], a[i+M]);
maxi = i;
}
printf("%.*s\n", maxlen, a[maxi]);
return 0;
}


Here is good lecture on suffix tree http://www.youtube.com/watch?v=jXAHLqQthKw
Here you can try some of applications of suffix tree http://algs4.cs.princeton.edu/63suffix/

5 comments :

Ravindra Patel said...

Hi Shashank,
I didn't understand how you arrived at time complexity O(nlogn).
comlen is O(n) and its called in a for loop (1-n), so wont the resultant complexity will be O(n^2).

You can mail me your reply at ravindra.itbhu@gmail.com

Unknown said...

@ravindra i think you are right i missed it . Thanks for Analysis i have updated the same :)


Happy Coding
Shashank

Anonymous said...

The way i solved this was to create a multimap of all characters
eg
for banana
mymap[a]->1,3,5
mymap[b]->0
mymap[n]->2,4

in pass 2
i walk over the string and see that 1,2,3 and 3,4,5 (length 3 max)are same characters hence ana is common substring longest however this is O(n2) algo.

Anonymous said...

Thanks for the post. In the current code, the complexity of constructing the suffix array is O(N^2) but there is a way to construct the suffix array in O(N log N) time.

Anonymous said...

Thanks for the post. The current construction of the suffix array takes O(N^2) time but there is a way to construct it in O(NlogN) so the search can be completed in O(NlogN).