Wednesday, March 9, 2011

WAP for Rabin Karp String Matching Algorithm

Rabin-Karp Algorithm

Problem Description
Given a string P,T. We want to check whether P is a substring of T.


Algorithm
RabinKarp algorithm performs well in practice and also generalized to other algorithm for related problems, such as two-dimensional pattern matching. The worst case running time is O((n-m+1)*m).

Let's assume that the character used in string T is a set S { 0,1,2,...,9 }. We can view a string of k consecutive character as decimal numbers. For example, string "31415" corresponds to the number 31415.

Given a pattern P[1..m], let p denote its corresponding decimal value. Given a text T[1..n], we let t(s) denote the decimal value of length-m substring T[s+1..s+m] for s = 0,1,...,n-m. Obviously, t(s) = p if and only if T[s+1..s+m] = P[1..m].

We can p compute in O(m) time using Horner's Rule

p = P[m] + 10(P[m-1] + 10(P[m-2] + ... + 10 (P[2] + 10 P[1])...)).

pseudo code for above assumption is

result = 0;
for i = 1 to m
result = result * 10;
result = result + P[i]


We can compute all t(s), for s = 0,1,...,n-m values in a total of O(n) time. The value of t(0) can be similarly computed from T[1..m] in O(m) time. To compute the remaining values t(1), t(2), ..., t (n-m), observe that t(s+1) can be computed can be computed from t(s) in constant time.

t(s+1) = 10*(t(s) - 10^(m-1) * T[s+1]) + T[s + m + 1].

For example,
T = "123456" and m = 3
t(0) = 123
t(1) = 10*(123 - 100*1) + 4 = 234

Step by Step explanation
First : remove the first digit : 123 - 100*1 = 23
Second: Multiply by 10 to shift it : 23 * 10 = 230
Third : Add last digit : 230 + 4 = 234

The algorithm runs by comparing, t(s) with p. When t(s) = p, then we have found the substring P in T, starting from position s.

Generalization
The only problem with above explanation is t(s) and p may be too large, so that no built-in data type can fit them. The solution is we required all t(s) and p be performed in modulo q.

In general, if we have d-ary alphabet { 0,1,...,d-1 }, we could have 26 alphabet { a, b, ..., z }, we choose q so that dq fits within a computer word. We can work out p modulo q by using this pseudo code

result = 0;
for i = 1 to m
result = (d*result + P[i] ) mod q


and computation of t(s+1) become

t(s+1) = d*(t(s) - T[s+1]*h) + T[s + m + 1]) mod q, where h = d^(m-1) (mod q)

The weakness of this method is that, ts = p (mod q) doesn't imply that ts = p. On the other hand, if ts != p (mod q), we definitely know that ts != p. So we can use ts != p (mod q) as a fast heuristic test to rule out invalid shifts s. But if we have ts = p we have have to check whether T[s+1...s+m] = P[1..m]


Implementation in C/C++

/* Rabin-Karp String Matching Algorithm
RabinKarpMatch(T,P,d,q)
- Assume T and P consist only a..z and A..Z
- radix d, prime q
- match whether P is a substring of T
- return the starting index of the first occurence of P in T
- n = length(T)
- m = length(P)

Worst Case : O((n-m+1)*m)
Best Case : O(n+m)
*/

#define tonum(c) (c >= 'A' && c <= 'Z' ? c - 'A' : c - 'a' + 26)

/* return a^p mod m */
int mod(int a,int p,int m)
{
if (p == 0) return 1;
int sqr = mod(a,p/2,m) % m;
sqr = (sqr * sqr) % m;
if (p & 1) return ((a % m) * sqr) % m;
else return sqr;
}

int RabinKarpMatch(char *T,char *P,int d,int q)
{
int i,j,p,t,n,m,h,found;
n = strlen(T);
m = strlen(P);
h = mod(d,m-1,q);
p = t = 0;

for (i=0; i {
p = (d*p + tonum(P[i])) % q;
t = (d*t + tonum(T[i])) % q;
}

for (i=0; i<=n-m; i++)
{
if (p == t)
{
found = 1;
for (j=0; j if (P[j] != T[i+j])
{
found = 0;
break;
}
if (found) return i;
} else {
t = (d*(t - ((tonum(T[i])*h) % q)) + tonum(T[i+m])) % q;
}
}
return -1;
}

No comments :