You are on page 1of 7

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

INFLUENCED FACTORS ON COMPUTATION AMONG QUICK SEARCH, TWO-WAY AND KARP-RABIN


ALGORITHMS
Atheer Akram Abdulrazzaq, Nur’Aini Abdul Rashid, Hazrina Binti Yusof Hamdani,
Rana M. Ghadban, Ammar Waysi Mahmood
1
School of Computer Science, Unversiti Sains Malaysia, 11800, Penang, Malaysia
athproof@yahoo.com, nuraini@cs.usm.my, hazrina.yusofhamdani@gmail.com,
rana_ghadban@yahoo.com, ammar_wysi@yahoo.com

Abstract
String matching problem has become highly important in the last two decades due to the advancement in technology.
It has also become necessary to solve this problem because of its application in many fields. One of the main factors
associated with this problem is the consumed time. The consumed time is influenced by type of algorithm, data type,
data size and length of pattern used. This study showed lowest consumed time after using Quick search algorithm
than the other two algorithms studied which are Two way and Karp-Rabin. Consumed time was being lowest in DNA
and Source code when 50 MB and 200 MB, respectively.
Keyword: Consumed time, exact string matching algorithm, data type, data size and length of pattern

Introduction

Consumed time of exact string matching is the computing time during the execution of the algorithm. String
matching is a searching operation carried out to check the optimal alignment by comparing the two finite-length
strings [1]. It has been considered as one of the most important problems related issue. This is because of the
advancement in technology in the last two decades [2]. It has become necessary to solve string matching problem
because of its application in many fields such as in the field of the computation biology [3]. In this field, the biological
molecules are approximated as sequences nucleotides or amino acids. In the literature research, string matching is
use in the inclusion of memorized data and dictionaries [4]. String matching is also use in web search engine, artificial
intelligence, retrieval of information [2] and intrusion detection system [5]. In all the applications, the consumed time
grows as data grows, therefore, it is better to use an efficient algorithm in order to complete the matching with higher
speed [6].

In this paper, we studied three exact string matching algorithm which are Quick search, Karp-Rabin and Two
way algorithms. Two way algorithm [Crochemore-Perrin] is used to improve Brute force algorithm technique. The
algorithm depends on the matching by specific order [4] which divides the pattern into two parts by factorization [7].

Karp-Rabin algorithm [Karp-Rabin] depends on the hashing function in matching. The behavior of this algorithm
depended on left to right order [4]. It measures the content of window and compares it with the content of the pattern.
This process showed high efficiency than matching of characters [8]. The purpose of using hash function is to avoid
the quadratic number during comparisons in order to obtain fast matching process [9]. This is because the usage of
integer numbers will reduce the computational time and simplify the procedure. The main properties of this algorithm
are efficient computation, high discrimination and hash which is characterized by easy way in matching [8].

Quick Search algorithm [Sunday] is similar to Boyer-Moore algorithm and Horspool algorithm which used the bad
character shift [10]. The behavior of this algorithm depended on left to right order [4]. It is characterized by higher
efficiency, easy implemented and capability to apply on short and large pattern [11].

Methodology

The general framework of this study is done in two phases as shown in Fig. 2. First phase is to implement three
exact string matching algorithms, while the second phase is to compare these algorithms based on the data type, the
length of pattern and size of data used. The computation time of each algorithm is measured in the first phase.

The involved algorithms were; Two way algorithm (from Reverse Factor family), Quick Search algorithm (from
Boyer Moore family) and Karp-Rabin algorithm (from Hashing approach). The techniques of these algorithms were
depended on the preprocessing and searching phase of each algorithm, these techniques were depicted in the Fig 1.

©Informatics '09, UM 2009 RDT4 - 100


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Division

Determination of local
period

Preprocessing phase
(Internal and external
choices )
Г (w, k) = {z | z on the k in w}
∂ (w,k) = min{|z| | z ∈ Г (w,k)}

Preprocessing phase
Hash computing
of pattern
Pattern period determination

x [i] = x [i + p]

Preprocessing phase
Hash computing for
Choosing the critical
text window
factorization Quick search bad
|L| = {max | l | , | l | }
character (qsBc)
Possibilities
|u| < period (x), |u| minimal table
Searching phase

Rehash computing
Searching phase

Comparison of characters in
pattern suffix and prefix with
text Searching phase
Number of attempts
Number of attempts and number of
and number of characters
characters comparison
Number of attempts and number comparison
of characters comparison

(a) Two way algorithm (b) Karp-Rabin algorithm (c) Quick search algorithm

Fig. 1 (a), (b) and (c) preprocessing and searching phases for the three algorithms (Two way, Karp-Rabin and

Quick search)

In the second phase, four types of data were downloaded from Benchmark Standard. They are DNA sequence,
Protein sequence, English text and Source program code. Different lengths of pattern were used in this study in order
to demonstrate the effect of pattern length on the consumed time for the three algorithms. This selection was
performed depending on previous study in which the length of pattern was chosen randomly [12]. In this study, the
pattern lengths ranged from 5 to 95 characters and the increment was done by 10 characters i.e. 10 readings were
©Informatics '09, UM 2009 RDT4 - 101
Proceeding of the 3rd International Conference on Informatics and Technology, 2009

applied. For data sizes, the common sizes used depending on Benchmark Standard were 50 MB and 200 MB. The
framework of this study depicted in Fig 2.

Phase 1
Databases

Quick Karp- Rabin


Two way
Search

Computation of consumed time

Comparison Phase 2

Data type Data size Pattern


length

Fig. 2: Research Methodology

Two way algorithm has two phases; first preprocessing phase, this phase depends on the factorization method.
Factorization method has several steps and extends to the critical factorization. These steps are division, characters
local periods determination, pattern period determination and choosing the critical factorization. Division is needed to
divide the pattern into two substrings. Suppose the pattern is denoted by (x) and both the prefix and suffix of x are
called u and v respectively. The word that need to be search in the pattern is denoted by (w), where the w ∈ x.
Determination of local periods need to find the local periods. The local period is defined as the length of smallest
central repetition. The central repetition is denoted by (z) and the position at this repetition is denoted by (k). First,
find the choices or the possibilities of central repetition that occur on k positions. These choices are either being
internal or external and on the right or left side. In Pattern period determination, the period of pattern was determined
by choosing the smallest period. Choosing the critical factorization in this step, the critical factorization for the pattern
was determined and this occurred after obtaining the all local periods and pattern period. This step has respective
choices until it reaches the critical factorization position that will cut the pattern into two parts. First determine and
then choose the higher local period. If there is only one local period as higher value then it will choose directly.
Second, the arrangements of the characters have to be noticed alphabetically i.e. have to check the direction and
choose which direction it have to follow. This direction depends on the forward and inverse directions. The forward
direction means the characters are arranged alphabetically. The inverse means the characters are arranged
alphabetically from the inverse direction. Third is the computation of the maximal suffix which is the higher length of
all chosen characters that arranged lexicographically.

Second searching phase, this phase included two choices in order to perform the comparisons during matching
operation. First, the comparisons will start with pattern suffix (v) which can compare between text and pattern
characters. The movement in v will start from left to right. If there is mismatch then the pattern will shift toward right by
©Informatics '09, UM 2009 RDT4 - 102
Proceeding of the 3rd International Conference on Informatics and Technology, 2009

the length of character position (k). If there is matching then the characters will continue comparisons until it complete
all v characters and if they are completed then will change to the second choice. Second choice, the comparisons will
start in prefix (u) characters. The movement here will occur from right to left. If there is mismatch then the pattern will
shift toward right by the pattern period (p). But if matching occurs then the comparisons will continue until it complete
all u characters. If all u characters are matching, then the pattern will shift by pattern period also [7].

Karp-Rabin algorithm has two phases, first preprocessing phase, this phase depends on the hash property. The
function of hash is to convert the characters into integer values which can deal with in mathematical operations.
Supposing the word (w) is found in both pattern (x) and text (y), and the length of window (during comparisons) is
denoted by m. In preprocessing the hashing, it is easy to calculate for all pattern characters. Second Searching
phase, in this phase, first the comparison will occur between the hash number of text window and the hash number of
pattern. The movement of pattern will be start from left to right. If there is mismatch then the window will shift toward
right by one character. Rehash property require to compute the new window hash after addition of new character on
right side and remove the hash value of the first left character in the previous step. The rehash property will continue
to occur after every shifting process. Supposing the first character was left from the previous step is denoted by (a)
and the new character on the right side of window is denoted by (b). Second, if the hash number of pattern is equal to
window hash number, it will compare all character (one by one) of pattern with the window characters. When the
matching occurs for all characters, the shifting of pattern will move by one character only [9].

Quick search algorithm has two phases, first preprocessing phase, this phase depends on the quick search bad
character (qsBc) table. This table represents the minimal resulted values from the difference between the position (i)
of the character (c) and the pattern length (m-1). Where, m refers to the position of character that comes after the
window during text matching. When a character is not presented in pattern, then the value must be m in qsBc table.
Second searching phase, in this phase, the matching process will occur between the pattern and text characters. The
movement of matching starts from left to right. If there is mismatch, then the shifting of the pattern depends on m
value in qsBc table. But if matching occurs then the comparisons will continue to another character. If all characters
are matching, then the shifting will depend on m also [10].

Implementation

The experiment was run on our local machine (HP Intel(R) Core(TM) 2 CPU (1.66 GHz), 2.49 GB RAM,
Windows XP service pack-2 32-bit Operating System, the compiler used was Intel C++.

Results

We execute the program using the experiments by using type of data 5 times for every data type at different times
and then takes the average.

©Informatics '09, UM 2009 RDT4 - 103


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

3 30 440
Pattern

Consumed time (sec.)


Consumed time (sec.)

2.5 3.5
35
25 Length
330
2 20
2.5
25 95
1.5
15 220
1 10 1.5
15 85
110
0.55 75
0.55
0 0 0 0 65
Tw o Way Karp-Rabin Quick search Tw o Way Karp-Rabin Quick search
55
algorithm algorithm algorithm algorithm algorithm algorithm
45
(a) DNA sequence (b) Protein sequence
35
25
440 440
3.5
Consumed time (sec.) 3.5
35 15
Consumed time (sec.)

35
330 330 5
2.5
25 2.5
25
220 220
1.5
15 1.5
15
110 110
0.55 0.5
5
0 0 00
Tw o Way Karp-Rabin Quick search Tw o Way Karp-Rabin Quick search
algorithm algorithm algorithm algorithm algorithm algorithm

(c) English text (d) Source code

Fig. 3 (a), (b), (c) and (d) Evaluations for the consumed time when 50 MB types of data is used

As shown in Fig. 3, the Quick search algorithm has showed the lowest consumed time followed by Karp-Rabin
algorithm and then Two way algorithm when 50 MB size is used. The consumed time found in DNA sequence was
lowest compared to other types of data used. Naturally, the consumed time was increases when the length of pattern
increases i.e. 95 characters of pattern length has consume time higher than others. According to the results, Quick
search algorithm had lowest consumed time when DNA used in 50 MB size. Quick search was given lowest
consumed time in DNA sequence data type when minimum pattern length (5) character is used. The maximum
pattern length (95) character was given lowest consumed time for quick search algorithm in DNA sequence. Karp-
Rabin algorithm had lowest consumed time in DNA sequence when the minimum and maximum pattern length use.
Two way algorithm had lowest consumed time in DNA sequence when minimum and maximum pattern length use.

©Informatics '09, UM 2009 RDT4 - 104


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

20
200 12
120
Pattern

Consumed time (sec.)


10
Consumed time (sec.)

100 Length
15
150
880
95
10
100 660
440 85
5 50
220 75
0 0 00 65
Tw o Way Karp-Rabin Quick search Tw o Way Karp-Rabin Quick search
55
algorithm algorithm algorithm algorithm algorithm algorithm
45
35
(a) DNA sequence (b) Protein sequence
25
990 90
9 15
880
Consumed time (sec.)

Consumed time (sec.)


80
8
770 5
70
7
660 60
6
550 50
5
440 40
4
330 30
3
220 20
2
110 10
1
00 00
Tw o Way Karp-Rabin Quick search Tw o Way Karp-Rabin Quick search
algorithm algorithm algorithm algorithm algorithm algorithm

(c) English text (d) Source code

Fig. 4 (a), (b), (c) and (d) Evaluations for the consumed time when 200 MB types of data is used

The consumed time showed lowest in Quick search algorithm than other two algorithms when 200 MB data size
used. Also, it was found that the consumed time was lowest when Source code used when compared with other data
types. The consumed time was increased when length of pattern increased too. However, the quick search was
showed lowest consumed time at Source code at 200 data size (as shown in Fig 4).

Quick search algorithm was given lowest consumed time in Source code when the minimum pattern length (5)
and maximum pattern length (95) use. Karp-Rabin was given lowest consumed time in English text for both maximum
(95) and minimum (5) pattern length use. Two way algorithm had lowest consumed time in Source code data type
when use the pattern length (5) and (95).

The quick search algorithm showed lowest time than others. This is because the number of attempts was lower if
compared with Karp-Rabin and Two way algorithms depending on previous studies. The consumed time was
effected by the data size because they are differ in the types of characters included in each data type i.e. the
consumed was showed lowest in DNA when use the pattern length 50 MB while when increased the types of
characters then the Source code showed lowest consumed time in 200 MB.

Conclusion

The consumed time is changeable finding and it influenced by several factors. These factors are type of algorithm,
data type, data size and length of pattern. The Quick search is given lowest consumed time for different data types,
data sizes and pattern lengths. This change majorly depends on the technique of the algorithm used.

©Informatics '09, UM 2009 RDT4 - 105


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Acknowledgment

This research is partly supported by USM short term grant 304/pkomp/ 639037

References

[1] J. Gregor and R.S. Harris "String matching with left-to-right networks". Pattern Recognition Letter, Vol 16,
1995, pp. 213-218.

[2] R.S.Viswanadha, B.A. Vinaya and M.Mrudula. "Backend Engine for Parallel String Matching Using Boolean
Matrix". International Symposium on Parallel Computing in Electrical Engineering, 2006, pp.281-283.

[3] L.Zheng., C.Xin., J.Borneman. and J. Tao (). “A fast algorithm for approximate string matching on gene
th
sequences,” in Symposium. 16 Annu. Combinatorial Pattern Matching. Springer-Verlag, Vol 3537,2005, pp.79-
90.

[4] A. A. Hassan. "Mixed Heuristic Algorithm for Intelligent String Matching for Information Retrieval".
Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications.
IEEE, 2005, pp. 11-16.

[5] K. Bi, N. Gu., K. Tu, X. Liu. and G. Liu. "A Practical Distributed String Matching Algorithm Architecture and
Implementation". Proceeding of world academy of science, engineering and technology, Vol 10, 2005, pp.1307-
6884.

[6] M. Miyzaki., A. Shinohara. and M. Takeda. "An improved pattern matching algorithm for string in terms of
straight line programs". Journal of Discrete Algorithms, Vol 1:1, 1997, pp.187-204.

[7] M. Crochemore. and D. Perrin. "Two way string matching". Journal of the Association for Computing
Machinery, vol 38, 1991, pp. 65I -675.

[8] C.Charras. and T. Lecroq. "Handbook of Exact String_Matching Algorithms". King's College Publications,
France, 2004.

[9] R.M Karp. and M.O.Rabin. "Efficient randomized pattern-matching algorithms", IBM Journal of Research and
Development, vol 31, 1987, pp. 249–260.

[10] D.M. Sunday. "A very fast substring search algorithm". Communications of the Association for Computing
Machinery, vol 33, 1990, pp. 132–142.

[11] A.A. Hnaif, M. Alhalaiqah, O. Abouabdalla, S. Ramadass. and M. K. Mohammed.. "Parallel Quick Search
Algorithm to Speed Packet Payload Filtering in NIDS", Paper accepted in Journal of Engineering Science and
Technology (JESTEC), 2008.

[12] P.D. Michailidis. and K.G Margaritis. "On-line String Matching Algorithms: Survey and Experimental
Results". International Journal of Computer and Mathematic, Vol 76, 2000, pp. 411-434.

©Informatics '09, UM 2009 RDT4 - 106

You might also like