Professional Documents
Culture Documents
Definitions
Text: a longer string T Pattern: a shorter string P Exact matching: find all occurrence of P in T
T P
b b a b a
a b a
a b a b a
a b a
length = m
Length = n
a b a b a
a b a
Length = m
Length = n
a b a a b a a b a a b a a b a a b a
Time complexity
Worst case: O(mn) Best case: O(m)
aaaaaaaaaaaaaa vs. baaaaaaa
Average case?
Alphabet size = k Assume equal probability How many chars do you need to compare before find a mismatch?
In average: k / (k-1) Therefore average-case complexity: mk / (k-1) For large alphabet, ~ m
How to speedup?
Pre-processing T or P Why pre-processing can save us time?
Uncovers the structure of T or P Determines when we can skip ahead without missing anything Determines when we can infer the result of character comparisons without actually doing them.
ACGTAXACXTAXACGXAX ACGTACA
Minimize
Suffix tree
One of the most useful preprocessing techniques Many applications
Algorithm KMP
Not the fastest Best known Good for real-time matching
i.e. text comes one char at a time No memory of previous chars
Idea
Left-to-right comparison Shift P more than one char whenever possible
Intuitive example 1
T P Nave approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabc mismatch abcxabcde
Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. Number of comparisons saved: 6
Intuitive example 2
Should not be a c T P Nave approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabc mismatch abcxabcde
Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches Number of comparisons saved: 7
j
P t
i
z j t i y
Pre-processing: for any position i in P, find P[1..i]s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t, and the next char of t is different from the next char of t (i.e., y z) For each i, let sp(i) = length(t)
t
t i
x
y
t y t z 1 sp(i) j i
Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the
right by i sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].
sp(i)
aa at
aat aac
Another example
P: abababc
If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1)
a
Sp(i) 0
b
0
a
0
b
0
a
0
b
4
c
0
ab ab
abab abab
ababa ababc
T: aacaataaaaataaccttacta aataac Time complexity analysis: ^^* Each char in T may be compared up to n aataac times. A lousy analysis gives O(mn) time. .* More careful analysis: number of aataac Implicit comparisons can be broken to two phases: comparison ^^^^^* Comparison phase: the first time a char in T aataac is compared to P. Total is exactly m. Shift phase. First comparisons made after a ..* shift. Total is at most m. aataac .^^^^^ Time complexity: O(2m)
Failure link
a DFA 0 a 1 a 2
a
a
c
a
DFA Example
a t t
DFA
a a
c a
T: aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
DFA
Preprocessing time and space are O(n ||)
May be a problem for very large alphabet size For example, each char is a big integer Chinese characters
Better idea: combine all patterns together and search in one run
o
t a t e t r
c
i
h e n
l 5
t
o 1
e
r y
y
3
c e 4
O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of text Common prefix only need to be compared once. What if there is no space between words?
Aho-Corasick algorithm
Basis of the fgrep algorithm Generalizing KMP
Using failure links
Keyword tree
p o t a t o o 1 2 o 3 t a h e t t
0
t h e r
a t
e r
Keyword tree
p o t a t o o 1 2 o 3 t a h e t t
0
t h e r
a t
e r
potherotathxythopotattooattoo
Keyword tree
p o t a t o o 1 2 o 3 t a h e t t
0
t h e r
a t
e r
potherotathxythopotattooattoo
O(mn)
0
t h e r
a t
e r
potherotathxythopotattooattoo
0
t h e r
a t
e r
potherotathxythopotattooattoo
0
t h e r
a t
e r 3
1
2
Example
p o t a t o o t a h e t t
0
t h e r
a t
e r 3
1
2
potherotathxythopotattooattoo
Example
p o t a t o o t a h e t t
0
t h e r
a t
e r 3
1
2
potherotathxythopotattooattoo
Example
p o t a t o o t a h e t t
0
t h e r
a t
e r 3
1
2
potherotathxythopotattooattoo
Example
p o t a t o o t a h e t t
0
t h e r
a t
e r 3
1
2
potherotathxythopotattooattoo
Example
p o t a t o o t a h e t t
0
t h e r
a t
e r 3
1
2
potherotathxythopotattooattoo
Aho-Corasick algorithm
O(n) preprocessing, and O(m+k) searching.
n: total length of patterns. m: length of text k is # of occurrence.
Suffix Tree
All algorithms we talked about so far preprocess pattern(s)
Karp-Rabin: small pattern, small alphabet Boyer-Moore: fastest in practice. O(m) worst case. KMP: O(m) Aho-Corasick: O(m)
Suffix tree
T: xabxac Suffixes:
1. 2. 3. 4. 5. 6. xabxac abxac bxac xac ac c
b x c 5 a b x a c 3 x a c 6 b x a c
a
c 2
Nave construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch
x a
$ 4
b x a$
a b x $ 2 a $ 5
b x a$
3:$
1
Suffix links
Similar to failure link in a keyword tree Only link internal nodes having branches
x a b d e c a b c d e f g
xabcf
f
g
h j i
h
i j
3 1 2
4:$
2:$ 4
3 1 2
4:$
2:$ 4
3 1 2
4:$
2:$ 4
4:$
t 7 2 4 5:$
4:$
4 5:$
t 5:$
t 7 2 9 5:$ 4
b x a c
T: xabxac
b x a c
T: xabxac P: xab
Problem: find all repeats that are at least kresidues long and appear at least p times in the genome
Repeats finding
at least k-residues long and appear at least p times in the seq
Phase 1: top-down, count label lengths (L) from root to each node Phase 2: bottom-up: count # of leaves descended from each internal node
For each node with L >= k, and N >= p, print all leaves
O(m) to traverse tree (L, N)
2. Left-maximal repeat
S[i+1..i+k] = S[j+1..j+k] But S[i] != S[j]
3. Maximal repeat
S[i+1..i+k] = S[j+1..j+k] But S[i] != S[j], and S[i+k+1] != S[j+k+1]
Find nodes v with L>=k, and L(parent)<k, and N>=y Traverse sub-tree rooted at v to get the locations
This can be used in many applications. For example, to find words that appeared frequently in a genome or a document
c b
b c
a 1,1
1,2
To Simplify
a
c b
c b a & d a
b c
a 1,1
b c
c a
d d 1,4
2,4 a d
2,3 1,3
1,2
We dont really need to do anything, since all edge labels were implicit. The right hand side is more convenient to look at
Application of JST
Longest common substring
For each internal node v, keep a bit vector B B[1] = 1 if a child of v is a suffix of S1 Find all internal nodes with B[1] = B[2] = 1 Report one with the longest label Can be extended to k sequences. Just use a longer bit vector.
Not subsequence a c b d a 1,1 2,1 2,2 1,2 $ c a
b c
d
d 1,4
2,4 a d
2,3 1,3
Application of JST
Given K strings, find all k-mers that appear in at least d strings
1,x
Catch
Large constant factor for space requirement Large constant factor for construction Suffix array: trade off time for space
Summary
One T, one P
Boyer-Moore is the choice KMP works but not the best
Alphabet independent
One T, many P
Aho-Corasick Suffix Tree
Alphabet dependent
Two or more Ts
Suffix tree, joint suffix tree, suffix array
Suffix tree
One of the most useful preprocessing techniques Many applications
(m+n)
tpabxab
char a b p t x Right-most-position in P 6 7 2 1 5
Pre-processing: O(n)
Shift 3 1 = 2
P: tpabxab
Right-most-position in P 6 7 2 1 5
Shift 7 0 = 7
P: tpabxab
Right-most-position in P 6 7 2 1 5
char a b p t x
P: tpabxab
Right-most-position in P 6 7 2 1 5
char a b p t x
P: tpabxab
Position in P 6, 3 7, 4 2 1 5 Preprocessing still O(n)
P:
qcabdabdab
P:
qcabdabdab
T P t
x y
t t
Preprocessing: For any suffix t of P, find the rightmost copy of t, denoted by t. How to find t efficiently?
y t
P:
qcabdabdab
t y zy
In preprocessing: For any suffix t of P, find the rightmost copy of t, t, such that the char left to t the char left to t
y t
z t
Pre-processing can be done in linear time If P in T, searching may take O(mn) If P not in T, searching in worst-case is O(m+n)
Example preprocessing
qcabdabdab
Bad char rule
char a b Positions in P 9, 6, 3 10, 7, 4
1 2 3 4 5 6 7 8 9 10 q c a b d a b d a b 0 0 0 0 0 0 0 2 0 0
dab cab
Does not depend on T
c
d q
2
8,5 1