Professional Documents
Culture Documents
Overview
Markov models of sequence structures
Introduction to Hidden Markov Models (HMM)
HMM algorithms; Viterbi decoder
The Challenges
Biological sequences have modular structure
Genes exons, introns
Promoter regions modules, promoters
Proteins domains, folds, structural parts, active parts
Interfaces
ATP Binding
DomainE258
S265
M244
E236
L266 D276
E275
Y257 V270
F311
F317 A269
L301 L284
T315
E255
L248 K271
E281
E316 M278
Y253 E279E286
Q252
E282
L384
V339
Binding
Site
Q346 M351
Y342
E494 F486E352
Active components
H396
Activation
Domain
Y440 M472
L451
E450
.6
e(F,H)=0.5
e(F,T)=0.5
F
.4
.8
e(B,H)=0.9
.2
e(B,T)=0.1
B
8
Coin Example
Two states MC: {F,B} F=fair coin, B=biased
Emission probabilities
Described in state boxes
Or through emission boxes
.4
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=0.9
.2
e(B,T)=0.1
.8
.6
.4
F
.8
.5 .5
.9
.2
.1
9
A=1.
A=.1
A=.5
A=.9
A=.0
G=.0
T=.0
G=.0
T=.8
G=.2
T=.3
G=.0
T=.0
0 10 1
0 0 1
2 0 0
8 0 8
5
0
2
3
9 0
1 0
0 0
0 10
G=.0
T=1.
E
TA TGA
TATAA
TATAA
TAATA
TATAA
TATTA
GATAA
GATAC
TACGA
TATTA
T
T
T
T
T
T
T
T
T
T
10
A=.0
C=.0
G=.2
T=.8
A=1.
C=.0
G=.0
T=.0
A=.1
C=.1
G=.0
T=.8
A=.5
C=.0
G=.2
T=.3
A=.6
C=.2
G=.2
T=.0
.7
.3
1.
1.
A=.0
C=.0
G=.0
T=1.
- =1.
Delete-State
S
A=.0
C=.0
G=.2
T=.8
A=1.
C=.0
G=.0
T=.0
A=.1
C=.1
G=.0
T=.8
A=.5
C=.0
G=.2
T=.3
Insert-State
A=.0
C=.0
G=.0
T=1.
.8
.2
1.
A=.25
C=.25
G=.25
T=.25
TA TGA T
TATACT
TATAAT
TATTAT
TATAGT
TATT - T
GATAAT
GATA - T
TACG- T
TATTCT
TA TGA T
TATACT
TATA - T
TATT - T
TATA - T
TATT - T
GATA - T
GATA - T
TACG- T
T A T T - 11T
Profile HMM
ACA
TCA
ACA
AGA
ACC
--ACT
C---G--
ATG
ATC
AGC
ATC
ATC insertion
Transition probabilities
Output Probabilities
Profile alignment
E.g., What is the most likely path to generate ACATATC ?
How likely is ACATATC to be generated by this profile?
12
13
Regular Sequence
A+
G+
A-
G-
C+
T+
C-
T-
+ A G C T
A 0.1800.2740.426 0.120
G 0.1710.3680.274 0.188
+ A G C T
C
T
A
G
0.1610.3390.375 0.125
0.0790.3550.384 0.182
+ A G C T
C
T
A
G
C
0.3220.2980.078 0.302
0.1770.2390.292 0.292
0.2480.2460.298 0.208
14
Splice site
15
A=.29
C=.31
G=.04
T=.36
16
e(F,H)=0.5
e(F,T)=0.5
Start
.4
.8
.5
e(B,H)=0.9
.2
e(B,T)=0.1
17
18
States
s1 s2 s3 s4 s5
Observed sequence
1 2 3 4 5 6 7 8 9 10 12 13 14 1516 17 18
State sequence
19
Viterbis Decoder
F(i,k) = probability of the most likely path to state i
generating S1Sk
Forward recursion:
F(i,k+1)=e(i,Sk+1)*max j{F(j,k)a(i,j)}
Best path to i
Sk+1
F(6,k)a(6,i)
i
1 2 3 4
k k+1
20
10
.2*.9=.18
.8*.5=.4
.6
.4*.9=.36
F
Start
.5
.5
.4
e(F,H)=0.5
e(F,T)=0.5
.6*.5=.3
H H H H H
1 .45 .09 .065.019.006.0028
.8
e(B,H)=0.9
.2
e(B,T)=0.1
w=.5*(.9)*(.4)2*(.3)*(.36)2
Start 0
1
21
A+
G+
A-
G-
C+
T+
C-
T-
Start
Start
0 1 2 3 4
22
11
Computational Note
Computing probability products propagates errors
Instead of multiplying probabilities add log-likelihood
Define f(i,k)=log F(i,k)
f(i,k+1)=log e(i,Sk+1) + max j{f(j,k)+log a(i,j)}
Or, define the weight w(i,j,k)=log e(i,Sk+1)+ log a(i,j)
To get the following standard DP formulation
f(i,k+1)=max j{f(j,k)+w(i,j,k)}
23
Example
what is the most likely sequence of transitions & emissions to
explain the observation: S=HHHHHH Start
(using base 2 log)
-1
-0.74
-2.32-0.15=-2.47
lge(F,H)=-1
-.32-1=-1.32
lge(F,T)=-1
-2.32
-1
-1.32
lge(B,H)=-0.15
lge(B,T)=-3.32
-0.32
-1.32-.15=-1.47
F
-.74-1=-1.74
W(.,.,H)
S
F
B
-2
-1.15
-1.74
-1.47
-1.32
-2.47
f(i,k+1)=max j{f(j,k)+w(i,j,k)}
H
-1.15 -3.47
-2 -2.47
Start
24
12
Concluding Notes
Viterbi decoding: hidden pathway of an observed sequence
Hidden pathway explains the underlying structure
E.g., identify CpG islands
E.g., align a sequence against a profile
E.g., determine gene structure
..
25
13