You are on page 1of 179

 



SANDEEP SINGH
(III B.TECH I.T)
 Brief Introduction
 Basic Notations
 Naive Algorithm
 Knuth – Morris – Pratt Algorithm
 Boyer – Moore Algorithm
   

 Given a string ‘T’, the problem of string


matching deals with finding whether a
pattern ‘P’ occurs in ‘T’.

 If ‘P’ does occur then returning position


in ‘T’ where ‘P’ occurs.
 
&
 P The pattern being searched for
 T The text in which P is sought
 m The length of P
 n The length of T
 pi,tj The i th characters in P & T
 j Current position within T
 k Current position within P
 
&
 endText ( T , j ) : A boolean function
which tells us when we are beyond the
last character of the text .

 It returns True if j > index of the last


character of T , & returns False
otherwise.
 
 


 : P & T , the pattern & text strings ;
m , the length of P. The pattern is
assumed to be nonempty.

 : The return value is the index in


T where a copy of P begins , or -1 if no
match for P is found.

int naivescan( char[ ] P, char[ ] T, int m)
int match; // Value to return
int i,j,k;
// i is the current guess at which P begins in T
// j is the index of the current character in T;
// k is the index of the current character in P.
match = -1;
j = 0; k = 0;

i = j;
while (endText ( T, j ) == false)
if ( k > m )
match = i; //Match found
break;

if ( t j == p k )
j++; k++;

else
// Back up over matched characters.
int backup = k – 1;
j = j – backup; k = k – backup;
//Slide pattern forward , start over.
j++;
i = j;
// Continue loop.
return match;
Comparisons = 0

a b b a b a b a a

a b a a
Comparisons = 1

a b b a b a b a a

a b a a
Comparisons = 2

a b b a b a b a a

a b a a
Comparisons = 3

a b b a b a b a a

a b a a
Comparisons = 3

a b b a b a b a a

a b a a
Comparisons = 4

a b b a b a b a a

a b a a
Comparisons = 4

a b b a b a b a a

a b a a
Comparisons = 5

a b b a b a b a a

a b a a
Comparisons = 5

a b b a b a b a a

a b a a
Comparisons = 6

a b b a b a b a a

a b a a
Comparisons = 7

a b b a b a b a a

a b a a
Comparisons = 8

a b b a b a b a a

a b a a
Comparisons = 9

a b b a b a b a a

a b a a
Comparisons = 9

a b b a b a b a a

a b a a
Comparisons = 10

a b b a b a b a a

a b a a
Comparisons = 10

a b b a b a b a a

a b a a
Comparisons = 10

a b b a b a b a a

a b a a
Comparisons = 11

a b b a b a b a a

a b a a
Comparisons = 12

a b b a b a b a a

a b a a
Comparisons = 13

a b b a b a b a a

a b a a
Comparisons = 14

a b b a b a b a a

Found !

a b a a

 Preprocessing Time = 0.

 Naive Algorithm takes time O((n - m +


1)m)

 The naive algorithm is inefficient


because information gained about the
text for one shift is entirely ignored in
considering other shifts.
– –


 : P & T , the pattern & text strings
; m , the length of P; fail , the array of
failure links .

 : The return value is the index of


T where a copy of P begins , or -1 if no
match for P is found.

int kmpscan( char[ ] P, char[ ] T, int m, int[ ] fail)
int match;
int j,k;
// j indexes text characters;
// k indexes the pattern & fail array.
match = -1;
j = 0; k = 0;

while (endText ( T, j ) == false)
if ( k > m )
match = j – m; //Match found
break;
if( k == 0 )
j++;
k = 1; //Start Pattern Over

else if ( t j == p k )
j++;
k++;
else
//follow fail arrow.
k = fail [k];
//Continue loop.
return match;
  

 After a shift of the pattern, the naive


algorithm has forgotten all information
about previously matched symbols.

 So it is possible that it re-compares a text


symbol with different pattern symbols
again & again.
 
 The KMP algorithm makes use of the
information gained by previous symbol
comparisons.

 It never re-compares a text symbol that


has matched a pattern symbol , i.e.
backtracking on the text ‘T’ never occurs.
(M=0,i=0)

0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A B D

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8

A B C _ A B C D A

FAIL !

A B C D A B D

0 1 2 3 4 5 6
 In the fourth step, we get T[3] is a space
and P[3] = 'D', a mismatch.

 Rather than beginning to search again at


T[1], we note that no 'A' occurs between
positions 0 and 3 in T except at 0;
 Hence, having checked all those
characters previously, we know there is
no chance of finding the beginning of a
match if we check them again.

 Therefore we move on to the next


character, setting M = 4 and i = 0
(M=4,i=0)

0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
(M=4,i=0)

0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
0 1 2 3 4 5 6 7 8

A B C A B C D A

A B C D A

0 1 2 3 4
8 9 10 11 12 13 14 15 16

A B A B C D A B

A B D

4 5 6
8 9 10 11 12 13 14 15 16

A B A B C D A B

A B D

4 5 6
8 9 10 11 12 13 14 15 16

A B A B C D A B

A B D

4 5 6
8 9 10 11 12 13 14 15 16

A B _ A B C D A B

FAIL !

A B D

4 5 6
 We quickly obtain a nearly complete
match "ABCDAB" when, at T[6] (P[10]),
we again have a discrepancy.

 However, just prior to the end of the


current partial match, we passed an "AB"
which could be the beginning of a new
match, so we must take this into
consideration.
(M=8,i=2)

6 7 8 9 10 11 12 13 14

C D A B A B C D

A B C D A B D

0 1 2 3 4 5 6
(M=8,i=2)

6 7 8 9 10 11 12 13 14

C D A B _ A B C D

FAIL !

A B C D A B D

0 1 2 3 4 5 6
( M = 11 , i = 0 )

11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
( M = 11 , i = 0 )

11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

A B C D A B D

0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19

A B C D A B C D A

FAIL !

A B C D A B D

0 1 2 3 4 5 6
( M = 15 , i = 2 )

14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
( M = 15 , i = 2 )

14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

A B C D A B D

0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22

D A B C D A B D E

MATCH !

A B C D A B D

0 1 2 3 4 5 6

 A preprocessing of the pattern is necessary in
order to analyze its structure.

 The preprocessing phase has a complexity of


O(m).

 Since m ≤ n, the overall complexity of the


KMP algorithm is in O(n).
 – 
 – 
 Input : P & T , the pattern & text strings ;
m , the length of P; charjump &
matchjump. The Pattern is assumed to be
non-empty.

 Matchjump[k] is the amount to


increment j , the text position index , to
begin the next right – to – left scan of the
pattern after a mismatch has occurred at
p k.
 – 
 For 1 ≤ k ≤ m,

 Matchjump [k] = slide [ k ] + m – k

 Slide [ k ] is how far we can slide the


pattern forward after a mismatch on p k.

 m – k = how many characters were


matched before the mismatch
 – 
 The number of positions we can “jump”
forward when there is a mismatch
depends on the text character being read ,
say t j.

 These numbers are stored in an array


charjump indexed by the character set Σ .
 – 

 Output: The return value is the index in


T where a copy of P begins , or -1 if no
match for P is found.
 – 
int bmscan( char[ ] P, char[ ] T, int m, int[ ]
charjump, int[ ] matchjump)
int match;
int j,k;
// j indexes text characters;
//k indexes the pattern & fail array.
match = -1;
j = m-1; k = m-1;
 – 
while (endText ( T, j ) == false)
if ( k < 1 )
match = j + 1; //Match found
break;

if( t j == p k)
j --; k --;
 – 
else
//slide P forward

j+ = max( charjump[t j], matchjump[k]);


k = m-1;
//Continue loop.
return match;

0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

FAIL !

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

FAIL !

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
0 1 2 3 4 5 6 7 8

i f y o u w i

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h _ t o u

FAIL !

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h _ t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h _ t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h _ t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h _ t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h t o u

FAIL !

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h t o u

m u s t

0 1 2 3
7 8 9 10 11 12 13 14 15

w i s h t o u

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

( Line up u’s )

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

FAIL !

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

m u s t

0 1 2 3
14 15 16 17 18 19 20 21 22

u n d e r s t a

FAIL !

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

Just pass the ‘r’

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

FAIL !

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

m u s t

0 1 2 3
19 20 21 22 23 24 25 26 27

r s t a n d o t

FAIL !

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

Just pass the ‘o’

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

FAIL !

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

FAIL !

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
26 27 28 29 30 31 32 33 34

o t h e r s y o

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

FAIL !

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

( Line up u’s )

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

m u s t

0 1 2 3
34 35 36 37 38 39 40 41 42

o u m u s t

MATCH !

m u s t

0 1 2 3

. . . . . d a t s

(T) FAIL !

t s a n d c a t s

(P)

 Letters in T to the right of the current
position are ‘ats’ , the same letters that
form the suffix of P that was just
scanned.

 If we know that P does not have another
instance of ‘ats’ , then we can slide P all
the way past the ‘ats’ in T.

 If P does have an earlier instance of ‘ats’


, we could slide P so that earlier ‘ats’ line
up with the matched letters in T.

d a t s . . . . .

(T) Fail !
( Previous Occurrence of ‘ats’ in P )

b a t s a n d c a

(P)

Shift one place to the right of ‘d’, as usual

d a t s . . . . .

(T)

b a t s a n d c a t s

(P)
 
 The Boyer-Moore searching algorithm
perfoms O(n) comparisons in the worst
case.

 If the alphabet is large compared to the


length of the pattern, the algorithm
performs O(n/m) comparisons on the
average.
   ?
 Boyer-Moore algorithm is extremely fast
on large alphabet (relative to the length
of the pattern).

 If the pattern is quite small (m ≤ 3), then


the overhead of preprocessing the
pattern is not worthwhile. BM does more
comparisons than the naive approach.
   ?
 For the very shortest patterns, the
naive algorithm may be better.

 Use BM for :
Strings of average length ( m ≥ 5 ) .
   ?
 For Binary strings, BM does not do quite as
well.

 For binary strings Knuth-Morris-Pratt


algorithm is recommended.
T a k  !

You might also like