Problem Description

Given a string P,T. We want to check whether P is a substring of T.

Algorithm

RabinKarp algorithm performs well in practice and also generalized to other algorithm for related problems, such as two-dimensional pattern matching. The worst case running time is O((n-m+1)*m).

Let's assume that the character used in string T is a set S { 0,1,2,...,9 }. We can view a string of k consecutive character as decimal numbers. For example, string "31415" corresponds to the number 31415.

Given a pattern P[1..m], let p denote its corresponding decimal value. Given a text T[1..n], we let t(s) denote the decimal value of length-m substring T[s+1..s+m] for s = 0,1,...,n-m. Obviously, t(s) = p if and only if T[s+1..s+m] = P[1..m].

We can p compute in O(m) time using Horner's Rule

p = P[m] + 10(P[m-1] + 10(P[m-2] + ... + 10 (P[2] + 10 P[1])...)).

pseudo code for above assumption is

result = 0;

for i = 1 to m

result = result * 10;

result = result + P[i]

We can compute all t(s), for s = 0,1,...,n-m values in a total of O(n) time. The value of t(0) can be similarly computed from T[1..m] in O(m) time. To compute the remaining values t(1), t(2), ..., t (n-m), observe that t(s+1) can be computed can be computed from t(s) in constant time.

t(s+1) = 10*(t(s) - 10^(m-1) * T[s+1]) + T[s + m + 1].

For example,

T = "123456" and m = 3

t(0) = 123

t(1) = 10*(123 - 100*1) + 4 = 234

Step by Step explanation

First : remove the first digit : 123 - 100*1 = 23

Second: Multiply by 10 to shift it : 23 * 10 = 230

Third : Add last digit : 230 + 4 = 234

The algorithm runs by comparing, t(s) with p. When t(s) = p, then we have found the substring P in T, starting from position s.

Generalization

The only problem with above explanation is t(s) and p may be too large, so that no built-in data type can fit them. The solution is we required all t(s) and p be performed in modulo q.

In general, if we have d-ary alphabet { 0,1,...,d-1 }, we could have 26 alphabet { a, b, ..., z }, we choose q so that dq fits within a computer word. We can work out p modulo q by using this pseudo code

result = 0;

for i = 1 to m

result = (d*result + P[i] ) mod q

and computation of t(s+1) become

t(s+1) = d*(t(s) - T[s+1]*h) + T[s + m + 1]) mod q, where h = d^(m-1) (mod q)

The weakness of this method is that, ts = p (mod q) doesn't imply that ts = p. On the other hand, if ts != p (mod q), we definitely know that ts != p. So we can use ts != p (mod q) as a fast heuristic test to rule out invalid shifts s. But if we have ts = p we have have to check whether T[s+1...s+m] = P[1..m]

Implementation in C/C++

/* Rabin-Karp String Matching Algorithm

RabinKarpMatch(T,P,d,q)

- Assume T and P consist only a..z and A..Z

- radix d, prime q

- match whether P is a substring of T

- return the starting index of the first occurence of P in T

- n = length(T)

- m = length(P)

Worst Case : O((n-m+1)*m)

Best Case : O(n+m)

*/

#define tonum(c) (c >= 'A' && c <= 'Z' ? c - 'A' : c - 'a' + 26)

/* return a^p mod m */

int mod(int a,int p,int m)

{

if (p == 0) return 1;

int sqr = mod(a,p/2,m) % m;

sqr = (sqr * sqr) % m;

if (p & 1) return ((a % m) * sqr) % m;

else return sqr;

}

int RabinKarpMatch(char *T,char *P,int d,int q)

{

int i,j,p,t,n,m,h,found;

n = strlen(T);

m = strlen(P);

h = mod(d,m-1,q);

p = t = 0;

for (i=0; i

p = (d*p + tonum(P[i])) % q;

t = (d*t + tonum(T[i])) % q;

}

for (i=0; i<=n-m; i++)

{

if (p == t)

{

found = 1;

for (j=0; j

{

found = 0;

break;

}

if (found) return i;

} else {

t = (d*(t - ((tonum(T[i])*h) % q)) + tonum(T[i+m])) % q;

}

}

return -1;

}

## No comments :

Post a Comment