Sandeep Singh (Iii B.Tech I.T)

 

SANDEEP SINGH
(III B.TECH I.T)
 Brief Introduction
 Basic Notations
 Naive Algorithm
 Knuth – Morris – Pratt Algorithm
 Boyer – Moore Algorithm
   
 Given a string ‘T’, the problem of string

matching deals with finding whether a
pattern ‘P’ occurs in ‘T’.
 If ‘P’ does occur then returning position

in ‘T’ where ‘P’ occurs.
 
&
 P The pattern being searched for
 T The text in which P is sought
 m The length of P
 n The length of T
 pi,tj The i th characters in P & T
 j Current position within T
 k Current position within P
 
&
 endText ( T , j ) : A boolean function
which tells us when we are beyond the
last character of the text .
 It returns True if j > index of the last

character of T , & returns False
otherwise.
 
 


 : P & T , the pattern & text strings ;
m , the length of P. The pattern is
assumed to be nonempty.
 : The return value is the index in

T where a copy of P begins , or -1 if no
match for P is found.

int naivescan( char[ ] P, char[ ] T, int m)
int match; // Value to return
int i,j,k;
// i is the current guess at which P begins in T
// j is the index of the current character in T;
// k is the index of the current character in P.
match = -1;
j = 0; k = 0;

i = j;
while (endText ( T, j ) == false)
if ( k > m )
match = i; //Match found
break;
if ( t j == p k )
j++; k++;

else
// Back up over matched characters.
int backup = k – 1;
j = j – backup; k = k – backup;
//Slide pattern forward , start over.
j++;
i = j;
// Continue loop.
return match;
Comparisons = 0
a b b a b a b a a
a b a a
Comparisons = 1
a b b a b a b a a
a b a a
Comparisons = 2
a b b a b a b a a
a b a a
Comparisons = 3
a b b a b a b a a
a b a a
Comparisons = 3
a b b a b a b a a
a b a a
Comparisons = 4
a b b a b a b a a
a b a a
Comparisons = 4
a b b a b a b a a
a b a a
Comparisons = 5
a b b a b a b a a
a b a a
Comparisons = 5
a b b a b a b a a
a b a a
Comparisons = 6
a b b a b a b a a
a b a a
Comparisons = 7
a b b a b a b a a
a b a a
Comparisons = 8
a b b a b a b a a
a b a a
Comparisons = 9
a b b a b a b a a
a b a a
Comparisons = 9
a b b a b a b a a
a b a a
Comparisons = 10
a b b a b a b a a
a b a a
Comparisons = 10
a b b a b a b a a
a b a a
Comparisons = 10
a b b a b a b a a
a b a a
Comparisons = 11
a b b a b a b a a
a b a a
Comparisons = 12
a b b a b a b a a
a b a a
Comparisons = 13
a b b a b a b a a
a b a a
Comparisons = 14
a b b a b a b a a
Found !
a b a a

 Preprocessing Time = 0.
 Naive Algorithm takes time O((n - m +

1)m)
 The naive algorithm is inefficient

because information gained about the
text for one shift is entirely ignored in
considering other shifts.
– –


 : P & T , the pattern & text strings
; m , the length of P; fail , the array of
failure links .
 : The return value is the index of


int kmpscan( char[ ] P, char[ ] T, int m, int[ ] fail)
int match;
int j,k;
// j indexes text characters;
// k indexes the pattern & fail array.
match = -1;
j = 0; k = 0;

if ( k > m )
match = j – m; //Match found
break;
if( k == 0 )
j++;
k = 1; //Start Pattern Over

else if ( t j == p k )
j++;
k++;
else
//follow fail arrow.
k = fail [k];
//Continue loop.
return match;
  
 After a shift of the pattern, the naive

algorithm has forgotten all information
about previously matched symbols.
 So it is possible that it re-compares a text

symbol with different pattern symbols
again & again.
 
 The KMP algorithm makes use of the
information gained by previous symbol
comparisons.
 It never re-compares a text symbol that

has matched a pattern symbol , i.e.
backtracking on the text ‘T’ never occurs.
(M=0,i=0)
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A B D
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
A B C _ A B C D A
FAIL !
A B C D A B D
0 1 2 3 4 5 6
 In the fourth step, we get T[3] is a space
and P[3] = 'D', a mismatch.
 Rather than beginning to search again at

T[1], we note that no 'A' occurs between
positions 0 and 3 in T except at 0;
 Hence, having checked all those
characters previously, we know there is
no chance of finding the beginning of a
match if we check them again.
 Therefore we move on to the next

character, setting M = 4 and i = 0
(M=4,i=0)
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
(M=4,i=0)
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
0 1 2 3 4 5 6 7 8
A B C A B C D A
A B C D A
0 1 2 3 4
8 9 10 11 12 13 14 15 16
A B A B C D A B
A B D
4 5 6
8 9 10 11 12 13 14 15 16
A B A B C D A B
A B D
4 5 6
8 9 10 11 12 13 14 15 16
A B A B C D A B
A B D
4 5 6
8 9 10 11 12 13 14 15 16
A B _ A B C D A B
FAIL !
A B D
4 5 6
 We quickly obtain a nearly complete
match "ABCDAB" when, at T[6] (P[10]),
we again have a discrepancy.
 However, just prior to the end of the

current partial match, we passed an "AB"
which could be the beginning of a new
match, so we must take this into
consideration.
(M=8,i=2)
6 7 8 9 10 11 12 13 14
C D A B A B C D
A B C D A B D
0 1 2 3 4 5 6
(M=8,i=2)
6 7 8 9 10 11 12 13 14
C D A B _ A B C D
FAIL !
A B C D A B D
0 1 2 3 4 5 6
( M = 11 , i = 0 )
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
( M = 11 , i = 0 )
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
A B C D A B D
0 1 2 3 4 5 6
11 12 13 14 15 16 17 18 19
A B C D A B C D A
FAIL !
A B C D A B D
0 1 2 3 4 5 6
( M = 15 , i = 2 )
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
( M = 15 , i = 2 )
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
A B C D A B D
0 1 2 3 4 5 6
14 15 16 17 18 19 20 21 22
D A B C D A B D E
MATCH !
A B C D A B D
0 1 2 3 4 5 6

 A preprocessing of the pattern is necessary in
order to analyze its structure.
 The preprocessing phase has a complexity of

O(m).
 Since m ≤ n, the overall complexity of the

KMP algorithm is in O(n).
 – 
 – 
 Input : P & T , the pattern & text strings ;
m , the length of P; charjump &
matchjump. The Pattern is assumed to be
non-empty.
 Matchjump[k] is the amount to

increment j , the text position index , to
begin the next right – to – left scan of the
pattern after a mismatch has occurred at
p k.
 – 
 For 1 ≤ k ≤ m,
 Matchjump [k] = slide [ k ] + m – k
 Slide [ k ] is how far we can slide the

pattern forward after a mismatch on p k.
 m – k = how many characters were

matched before the mismatch
 – 
 The number of positions we can “jump”
forward when there is a mismatch
depends on the text character being read ,
say t j.
 These numbers are stored in an array

charjump indexed by the character set Σ .
 – 
 Output: The return value is the index in

 – 
int bmscan( char[ ] P, char[ ] T, int m, int[ ]
charjump, int[ ] matchjump)
int match;
int j,k;
// j indexes text characters;
//k indexes the pattern & fail array.
match = -1;
j = m-1; k = m-1;
 – 
if ( k < 1 )
match = j + 1; //Match found
break;
if( t j == p k)
j --; k --;
 – 
else
//slide P forward
j+ = max( charjump[t j], matchjump[k]);

k = m-1;
//Continue loop.
return match;

0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
FAIL !
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
FAIL !
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
0 1 2 3 4 5 6 7 8
i f y o u w i
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h _ t o u
FAIL !
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h _ t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h _ t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h _ t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h _ t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h t o u
FAIL !
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h t o u
m u s t
0 1 2 3
7 8 9 10 11 12 13 14 15
w i s h t o u
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
( Line up u’s )
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
FAIL !
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
m u s t
0 1 2 3
14 15 16 17 18 19 20 21 22
u n d e r s t a
FAIL !
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
Just pass the ‘r’
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
FAIL !
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
m u s t
0 1 2 3
19 20 21 22 23 24 25 26 27
r s t a n d o t
FAIL !
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
Just pass the ‘o’
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
FAIL !
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
FAIL !
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
26 27 28 29 30 31 32 33 34
o t h e r s y o
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
FAIL !
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
( Line up u’s )
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
m u s t
0 1 2 3
34 35 36 37 38 39 40 41 42
o u m u s t
MATCH !
m u s t
0 1 2 3


. . . . . d a t s
(T) FAIL !
t s a n d c a t s
(P)

 Letters in T to the right of the current
position are ‘ats’ , the same letters that
form the suffix of P that was just
scanned.

 If we know that P does not have another
instance of ‘ats’ , then we can slide P all
the way past the ‘ats’ in T.
 If P does have an earlier instance of ‘ats’

, we could slide P so that earlier ‘ats’ line
up with the matched letters in T.

d a t s . . . . .
(T) Fail !
( Previous Occurrence of ‘ats’ in P )
b a t s a n d c a
(P)

Shift one place to the right of ‘d’, as usual
d a t s . . . . .
(T)
b a t s a n d c a t s
(P)
 
 The Boyer-Moore searching algorithm
perfoms O(n) comparisons in the worst
case.
 If the alphabet is large compared to the

length of the pattern, the algorithm
performs O(n/m) comparisons on the
average.
   ?
 Boyer-Moore algorithm is extremely fast
on large alphabet (relative to the length
of the pattern).
 If the pattern is quite small (m ≤ 3), then

the overhead of preprocessing the
pattern is not worthwhile. BM does more
comparisons than the naive approach.
   ?
 For the very shortest patterns, the
naive algorithm may be better.
 Use BM for :
Strings of average length ( m ≥ 5 ) .
   ?
 For Binary strings, BM does not do quite as
well.
 For binary strings Knuth-Morris-Pratt

algorithm is recommended.
T a k  !

Sandeep Singh (Iii B.Tech I.T)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sandeep Singh (Iii B.Tech I.T)

Uploaded by

Copyright:

Available Formats

 

 Given a string ‘T’, the problem of string

 If ‘P’ does occur then returning position

 It returns True if j > index of the last

 : The return value is the index in

 Naive Algorithm takes time O((n - m +

 The naive algorithm is inefficient

 : The return value is the index of

 After a shift of the pattern, the naive

 So it is possible that it re-compares a text

 It never re-compares a text symbol that

 Rather than beginning to search again at

 Therefore we move on to the next

 However, just prior to the end of the

 The preprocessing phase has a complexity of

 Since m ≤ n, the overall complexity of the

 Matchjump[k] is the amount to

 Matchjump [k] = slide [ k ] + m – k

 Slide [ k ] is how far we can slide the

 m – k = how many characters were

 These numbers are stored in an array

 Output: The return value is the index in

j+ = max( charjump[t j], matchjump[k]);

Just pass the ‘r’

Just pass the ‘o’

 If P does have an earlier instance of ‘ats’

 If the alphabet is large compared to the

 If the pattern is quite small (m ≤ 3), then

 For binary strings Knuth-Morris-Pratt

You might also like