String Matching

CS 3343: Analysis of Algorithms
Lecture 26: String Matching Algorithms
Definitions
Text: a longer string T Pattern: a shorter string P Exact matching: find all occurrence of P in T
T P
b b a b a
a b a
a b a b a
a b a
length = m
Length = n
The nave algorithm

b b a b a
a b a a b a
a b a b a
a b a
Length = m
Length = n
a b a a b a a b a a b a a b a a b a
Time complexity
Worst case: O(mn) Best case: O(m)
aaaaaaaaaaaaaa vs. baaaaaaa
Average case?
Alphabet size = k Assume equal probability How many chars do you need to compare before find a mismatch?
In average: k / (k-1) Therefore average-case complexity: mk / (k-1) For large alphabet, ~ m
Not as bad as you thought, huh?
Real strings are not random

T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: O(m) average case is still bad for long strings! Smarter algorithms: O(m + n) in worst case sub-linear in practice how is this possible?
How to speedup?
Pre-processing T or P Why pre-processing can save us time?
Uncovers the structure of T or P Determines when we can skip ahead without missing anything Determines when we can infer the result of character comparisons without actually doing them.
ACGTAXACXTAXACGXAX ACGTACA
Cost for exact string matching

Overhead
Total cost = cost (preprocessing) + cost(comparison) + cost(output)

Constant
Minimize
Hope: gain > overhead
String matching scenarios

One T and one P
Search a word in a document
One T and many P all at once

Search a set of words in a document Spell checking
One fixed T, many P

Search a completed genome for a short sequence
Two (or many) Ts for common patterns

Would you preprocess P or T? Always pre-process the shorter seq, or the one that is repeatedly used
Pattern pre-processing algs

Karp Rabin algorithm
Small alphabet and small pattern
Boyer Moore algorithm

The choice of most cases Typically sub-linear time
Knuth-Morris-Pratt algorithm (KMP) Aho-Corasick algorithm

The algorithm for the unix utility fgrep
Suffix tree
One of the most useful preprocessing techniques Many applications
Algorithm KMP
Not the fastest Best known Good for real-time matching
i.e. text comes one char at a time No memory of previous chars
Idea
Left-to-right comparison Shift P more than one char whenever possible
Intuitive example 1
T P Nave approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabc mismatch abcxabcde
Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. Number of comparisons saved: 6
Intuitive example 2
Should not be a c T P Nave approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabc mismatch abcxabcde
Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches Number of comparisons saved: 7
KMP algorithm: pre-processing

Key: the reasoning is done without even knowing what string T is. Only the location of mismatch in P must be known.
T P t z t t x y
j
P t
i
z j t i y
Pre-processing: for any position i in P, find P[1..i]s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t, and the next char of t is different from the next char of t (i.e., y z) For each i, let sp(i) = length(t)
KMP algorithm: shift rule

T
P t z j
t
t i
x
y
t y t z 1 sp(i) j i
Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the
right by i sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].
Failure Link Example

P: aataac
If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1)
sp(i)
aa at
aat aac
Another example
P: abababc
If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1)
a
Sp(i) 0
b
0
a
0
b
0
a
0
b
4
c
0
ab ab
abab abab
ababa ababc
KMP Example using Failure Link

a a t a a c
T: aacaataaaaataaccttacta aataac Time complexity analysis: ^^* Each char in T may be compared up to n aataac times. A lousy analysis gives O(mn) time. .* More careful analysis: number of aataac Implicit comparisons can be broken to two phases: comparison ^^^^^* Comparison phase: the first time a char in T aataac is compared to P. Total is exactly m. Shift phase. First comparisons made after a ..* shift. Total is at most m. aataac .^^^^^ Time complexity: O(2m)
KMP algorithm using DFA (Deterministic Finite Automata)

P: aataac
If a char in T fails to match at pos 6, re-compare it with the char at pos 3
Failure link
a DFA 0 a 1 a 2
If the next char in T is t after matching 5 chars, go to state 3
a
a
c
a
All other inputs goes to state 0.
DFA Example
a t t
DFA
a a
c a
T: aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made.

But it takes longer to do pre-processing, and needs more space to store the FSA.
Difference between Failure Link and DFA

Failure link
Preprocessing time and space are O(n), regardless of alphabet size Comparison time is at most 2m (at least m)
DFA
Preprocessing time and space are O(n ||)
May be a problem for very large alphabet size For example, each char is a big integer Chinese characters
Comparison time is always m.
The set matching problem

Find all occurrences of a set of patterns in T First idea: run KMP or BM for each P
O(km + n)
k: number of patterns m: length of text n: total length of patterns
Better idea: combine all patterns together and search in one run
A simpler problem: spell-checking

A dictionary contains five words:
potato poetry pottery science school
Given a document, check if any word is (not) in the dictionary

Words in document are separated by special chars. Relatively easy.
Keyword tree for spell checking

This version of the potato gun was inspired by the Weird Science team out of Illinois p s
o
t a t e t r
c
i
h e n
l 5
t
o 1
e
r y
y
3
c e 4
O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of text Common prefix only need to be compared once. What if there is no space between words?
Aho-Corasick algorithm
Basis of the fgrep algorithm Generalizing KMP
Using failure links
Example: given the following 4 patterns:

potato tattoo theater other
Keyword tree
p o t a t o o 1 2 o 3 t a h e t t
0
t h e r
a t
e r
Keyword tree
0
t h e r
a t
e r
potherotathxythopotattooattoo
Keyword tree
0
t h e r
a t
e r
O(mn)
m: length of text. n: length of longest pattern
Keyword Tree with a failure link

0
t h e r
a t
e r
Keyword Tree with a failure link

0
t h e r
a t
e r
Keyword Tree with all failure links

p o t a t o o t a h e t t
0
t h e r
a t
e r 3
1
2
Example
0
t h e r
a t
e r 3
1
2
Example
0
t h e r
a t
e r 3
1
2
Example
0
t h e r
a t
e r 3
1
2
Example
0
t h e r
a t
e r 3
1
2
Example
0
t h e r
a t
e r 3
1
2
Aho-Corasick algorithm
O(n) preprocessing, and O(m+k) searching.
n: total length of patterns. m: length of text k is # of occurrence.
Can create a DFA similar as in KMP.

Requires more space, Preprocessing time depends on alphabet size Search time is constant
Suffix Tree
All algorithms we talked about so far preprocess pattern(s)
Karp-Rabin: small pattern, small alphabet Boyer-Moore: fastest in practice. O(m) worst case. KMP: O(m) Aho-Corasick: O(m)
In some cases we may prefer to pre-process T

Fixed T, varying P
Suffix tree: basically a keyword tree of all suffixes
Suffix tree
T: xabxac Suffixes:
1. 2. 3. 4. 5. 6. xabxac abxac bxac xac ac c
b x c 5 a b x a c 3 x a c 6 b x a c
a
c 2
Nave construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch
Suffix tree implementation

Explicitly labeling seq end T: xabxa T: xabxa$
a b x a b x a x a b x a a b x $ 2 3 2 a $ 5 b x a $ 3
x a
$ 4
b x a$
Suffix tree implementation

Implicitly labeling edges T: xabxa$
x a b x a $ 3 2 $ 4 1:2 2:2 1 3:$ $ 5 3 3:$ $ 4
a b x $ 2 a $ 5
b x a$
3:$
1
Suffix links
Similar to failure link in a keyword tree Only link internal nodes having branches
x a b d e c a b c d e f g
xabcf
f
g
h j i
h
i j
Suffix tree construction

1234567890 acatgacatt 1:$

1234567890 acatgacatt 1:$ 2:$

1234567890 acatgacatt 2:$ a 4:$ 2:$
3 1 2

1234567890 acatgacatt 2:$ a 4:$
4:$
2:$ 4
3 1 2

1234567890 acatgacatt 2:$ 5:$ a 4:$ 5
4:$
2:$ 4
3 1 2

1234567890 acatgacatt a t 5:$ 1 t $ 6 3 2 c 4:$ 5:$ a 5
4:$
2:$ 4

1234567890 acatgacatt a t 5:$ 1 t $ 6 3 c 5:$ a c a t 4:$ 5
4:$
t 7 2 4 5:$

1234567890 acatgacatt a t 5:$ 1 t $ 6 t 5:$ 3 c 5:$ a c a t t t 7 2 5
4:$
4 5:$

1234567890 acatgacatt a t 5:$ 1 t $ 6 t 5:$ 3 c a c a t t t 5:$ 5
t 5:$
t 7 2 9 5:$ 4

1234567890 acatgacatt a t 5:$ 1 t $ 6 t 5:$ 3 c a t c a t t t 7 2 5:$ 5 $ 10 t 5:$ 9 5:$ 4
ST Application 1: pattern matching

Find all occurrence of P=xa in T
Find node v in the ST that matches to P Traverse the subtree rooted at v to get the locations
b x a c 2 c a b x a c 3 x a c 6 c
b x a c
T: xabxac
O(m) to construct ST (large constant factor)
O(n) to find v linear to length of P instead of T!

O(k) to get all leaves, k is the number of occurrence. Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.
ST Application 2: set matching

Find all occurrences of a set of patterns in T
Build a ST from T Match each P to ST
2 b x a c c a b x a c 3 x a c 6 c
b x a c
T: xabxac P: xab
O(m) to construct ST (large constant factor)
O(n) to find v linear to total length of Ps

O(k) to get all leaves, k is the number of occurrence. Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if Ps are fixed. Otherwise depending on relative size.
ST application 3: repeats finding

Genome contains many repeated DNA sequences Repeat sequence length: Varies from 1 nucleotide to millions
Genes may have multiple copies (50 to 10,000) Highly repetitive DNA in some non-coding regions
6 to 10bp x 100,000 to 1,000,000 times
Problem: find all repeats that are at least kresidues long and appear at least p times in the genome
Repeats finding
at least k-residues long and appear at least p times in the seq
Phase 1: top-down, count label lengths (L) from root to each node Phase 2: bottom-up: count # of leaves descended from each internal node
For each node with L >= k, and N >= p, print all leaves
O(m) to traverse tree (L, N)
Maximal repeats finding

1. Right-maximal repeat
S[i+1..i+k] = S[j+1..j+k], but S[i+k+1] != S[j+k+1]
acatgacatt 1. cat 2. aca 3. acat
2. Left-maximal repeat
S[i+1..i+k] = S[j+1..j+k] But S[i] != S[j]
3. Maximal repeat
S[i+1..i+k] = S[j+1..j+k] But S[i] != S[j], and S[i+k+1] != S[j+k+1]

1234567890 acatgacatt a t 5:e 1 t 6 t 5:e 3 c a t c a t t t 7 2 5:e 5 $ 10 t 5:e 9 5:e 4
Find repeats with at least 3 bases and 2 occurrence

right-maximal: cat Maximal: acat left-maximal: aca

1234567890 acatgacatt a t 5:e 1 Left char = [] g c c a t 6 t 5:e 3 c a t c a t t t 7 2 a 5:e 5 $ 10 t 5:e 9 5:e 4
How to find maximal repeat?

A right-maximal repeats with different left chars
ST application 4: word enumeration

Find all k-mers that occur at least p times
Compute (L, N) for each node
L: total label length from root to node N: # leaves
L<k L=k L=K L>=k, N>=p
Find nodes v with L>=k, and L(parent)<k, and N>=y Traverse sub-tree rooted at v to get the locations
This can be used in many applications. For example, to find words that appeared frequently in a genome or a document
Joint Suffix Tree

Build a ST for many than two strings Two strings S1 and S2 S* = S1 & S2 Build a suffix tree for S* in time O(|S1| + |S2|) The separator will only appear in the edge ending in a leaf
S1 = abcd S2 = abca S* = abcd&abca$

a & d a
c b
b c
a 1,1
2,4 a d & 2,1 a 2,2 b c a
&abcd useless d & c a b c d d a & 1,4 a 2,3 b c d 1,3
1,2
To Simplify
a
c b
c b a & d a
b c
a 1,1
2,4 a d & 2,1 a 2,2 b c a
&abcd useless d & c a b c d d a & 1,4 a 2,3 b c d 1,3
a c b d a 1,1 2,1 2,2 1,2 $
b c
c a
d d 1,4
2,4 a d
2,3 1,3
1,2
We dont really need to do anything, since all edge labels were implicit. The right hand side is more convenient to look at
Application of JST
Longest common substring
For each internal node v, keep a bit vector B B[1] = 1 if a child of v is a suffix of S1 Find all internal nodes with B[1] = B[2] = 1 Report one with the longest label Can be extended to k sequences. Just use a longer bit vector.
Not subsequence a c b d a 1,1 2,1 2,2 1,2 $ c a
b c
d
d 1,4
2,4 a d
2,3 1,3
Application of JST
Given K strings, find all k-mers that appear in at least d strings
L< k L >= k B = (1, 0, 1, 1) cardinal(B) >= d
1,x
4,x 3,x 3,x
Many other applications

Reproduce the behavior of Aho-Corasick Recognizing computer virus
A database of known computer viruses Does a file contain virus?
DNA finger printing

A database of peoples DNA sequence Given a short DNA, which person is it from?
Catch
Large constant factor for space requirement Large constant factor for construction Suffix array: trade off time for space
Summary
One T, one P
Boyer-Moore is the choice KMP works but not the best
Alphabet independent
One T, many P
Aho-Corasick Suffix Tree
Alphabet dependent
One fixed T, many varying P

Suffix tree
Two or more Ts
Suffix tree, joint suffix tree, suffix array
Pattern pre-processing algs

Small alphabet and small pattern

The choice of most cases Typically sub-linear time
Knuth-Morris-Pratt algorithm (KMP) Aho-Corasick algorithm

The algorithm for the unix utility fgrep
Suffix tree
One of the most useful preprocessing techniques Many applications
Karp Rabin Algorithm

Lets say we are dealing with binary numbers
Text: 01010001011001010101001 Pattern: 101100
Convert pattern to integer

101100 = 2^5 + 2^3 + 2^2 = 44

Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 0 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44
(m+n)

What if the pattern is too long to fit into a single integer? Pattern: 101100. What if each word in our computer has only 4 bits? Basic idea: hashing. 44 % 13 = 5
10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)
(m+n) expected running time

Three ideas:
Right-to-left comparison Bad character rule Good suffix rule

Right to left comparison
x y Skip some chars without missing any occurrence. y But how?
Bad character rule

0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now?
Bad character rule

0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab
Bad character rule

0 1 123456789012345678 T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab
Basic bad character rule
tpabxab
char a b p t x Right-most-position in P 6 7 2 1 5
Pre-processing: O(n)

k
T: xpbctbxabpqqaabpqz P: tpabxab When rightmost T(k) in *^^^^

P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P char a b p t x i=3
Shift 3 1 = 2
P: tpabxab
Right-most-position in P 6 7 2 1 5

k
T: xpbctbxabpqqaabpqz P: tpabxab * When T(k) is not in

P, shift left end of P to align with T(k+1) i=7
Shift 7 0 = 7
P: tpabxab
char a b p t x

k
T: xpbctbxabpqqaabpqz P: tpabxab When rightmost T(k) *^^

in P is right to i, shift pattern P one pos i=5 5 6 < 0. so shift 1
P: tpabxab
char a b p t x
Extended bad character rule

k
T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is

immediately left to i, shift P to align T(k) with that position char a b p t x i=5 5 3 = 2. so shift 2
P: tpabxab
Position in P 6, 3 7, 4 2 1 5 Preprocessing still O(n)
Extended bad character rule

Best possible: m / n comparisons Works better for large alphabet size In some cases the extended bad character rule is sufficiently good Worst-case: O(mn) What else can we do?
0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^

According to extended bad character rule
P:
qcabdabdab
(weak) good suffix rule

P:
qcabdabdab
(Weak) good suffix rule
T P t
x y
t t
Preprocessing: For any suffix t of P, find the rightmost copy of t, denoted by t. How to find t efficiently?
y t
(Strong) good suffix rule


0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

P:
qcabdabdab

T x t
t y zy
In preprocessing: For any suffix t of P, find the rightmost copy of t, t, such that the char left to t the char left to t
y t
z t
Pre-processing can be done in linear time If P in T, searching may take O(mn) If P not in T, searching in worst-case is O(m+n)
Example preprocessing
qcabdabdab
Bad char rule
char a b Positions in P 9, 6, 3 10, 7, 4
Good suffix rule
1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 0 0 0 2 0 0
dab cab
Does not depend on T
c
d q
2
8,5 1
Where to shift depends on T

String Matching

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

String Matching

Uploaded by

Copyright:

Available Formats

CS 3343: Analysis of Algorithms

Lecture 26: String Matching Algorithms

The nave algorithm

Not as bad as you thought, huh?

Real strings are not random

Cost for exact string matching

Total cost = cost (preprocessing) + cost(comparison) + cost(output)

Hope: gain > overhead

String matching scenarios

One T and many P all at once

One fixed T, many P

Two (or many) Ts for common patterns

Pattern pre-processing algs

Boyer Moore algorithm

Knuth-Morris-Pratt algorithm (KMP) Aho-Corasick algorithm

KMP algorithm: pre-processing

KMP algorithm: shift rule

Failure Link Example

KMP Example using Failure Link

KMP algorithm using DFA (Deterministic Finite Automata)

If the next char in T is t after matching 5 chars, go to state 3

All other inputs goes to state 0.

Therefore, exactly m comparisons are made.

Difference between Failure Link and DFA

Comparison time is always m.

The set matching problem

A simpler problem: spell-checking

Given a document, check if any word is (not) in the dictionary

Keyword tree for spell checking

Example: given the following 4 patterns:

m: length of text. n: length of longest pattern

Keyword Tree with a failure link

Keyword Tree with a failure link

Keyword Tree with all failure links

Can create a DFA similar as in KMP.

In some cases we may prefer to pre-process T

Suffix tree: basically a keyword tree of all suffixes

Suffix tree implementation

Suffix tree implementation

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

Suffix tree construction

ST Application 1: pattern matching

O(m) to construct ST (large constant factor)

O(n) to find v linear to length of P instead of T!

ST Application 2: set matching

O(m) to construct ST (large constant factor)

O(n) to find v linear to total length of Ps

ST application 3: repeats finding

Maximal repeats finding

Maximal repeats finding

Find repeats with at least 3 bases and 2 occurrence

Maximal repeats finding

How to find maximal repeat?

ST application 4: word enumeration

Joint Suffix Tree