Chapter4.1 HMM

Chapter 4: Hidden Markov Models
4.1 Introduction to HMM
Prof. Yechiam Yemini (YY)
Computer Science Department

Columbia University
Overview
Markov models of sequence structures
Introduction to Hidden Markov Models (HMM)
HMM algorithms; Viterbi decoder
Durbin chapters 3-5
The Challenges
Biological sequences have modular structure
Genes exons, introns
Promoter regions modules, promoters
Proteins domains, folds, structural parts, active parts
How do we identify informative regions?

How do we find & map genes
How do we find & map promoter regions
Mapping Protein Regions
Interfaces
ATP Binding
DomainE258
S265
M244
E236
L266 D276
E275
Y257 V270
F311
F317 A269
L301 L284
T315
E255
L248 K271
E281
E316 M278
Y253 E279E286
Q252
E282
L384
V339
Binding
Site
Q346 M351
Y342
E494 F486E352
Active components
H396
Activation
Domain
Y440 M472
L451
E450
Statistical Sequence Analysis

Example: CpG islands indicate important regions
CG (denoted CpG) is typically transformed by methylation into TG
Promoter/start regions of gene suppress methylation
This leads to higher CpG density
How do we find CpG islands?
Example: active protein regions are statistically similar

Evolution conserves structural motifs but varies sequences
Simple comparison techniques are insufficient

Global/local alignment
Consensus sequence
The challenge: analyzing statistical features of regions
Review of Markovian Modeling

Recall: a Markov chain is described by transition probabilities
(n+1)=A(n) where (i,n)=Prob{S(n)=i} is the state probability
A(i,j)=Prob[S(n+1)=j|S(n)=i] is the transition probability
Markov chains describe statistical evolution

In time: evolutionary change depends on previous state only
In space: change depends on neighboring sites only
From Markov To Hidden Markov Models (HMM)

Nature uses different statistics for evolving different regions
Gene regions: CpG, promoters, introns/exons
Protein regions: active, interfaces, hydrophobic/philic
How can we tell regions?

Sample sequences have different statistics
Model regions as Markovian states emitting observed sequences
Example: CpG islands

Model: two connected MCs one for CpG one for normal
The MC is hidden; only sample sequences are seen
Detect transition to/from CpG MC
Similar to a dishonest casino: transition from fair to biased dice
Hidden Markov Models

HMM Basics
A Markov Chain: states & transition probabilities A=[a(i,j)]
Observable symbols for each state O(i)
A probability e(i,X) of emitting the symbol X at state i
.6
e(F,H)=0.5
e(F,T)=0.5
F
.4
.8
e(B,H)=0.9
.2
e(B,T)=0.1
B
8
Coin Example
Two states MC: {F,B} F=fair coin, B=biased
Emission probabilities
Described in state boxes
Or through emission boxes
Example: transmembrane proteins

Hydrophilic/hydrophobic regions
.6
.4
e(F,H)=0.5
e(F,T)=0.5
e(B,H)=0.9
.2
e(B,T)=0.1
.8
.6
.4
F
.8
.5 .5
.9
.2
.1
9
HMM Profile Example (Non-gapped)

A=.0
A=1.
A=.1
A=.5
A=.9
A=.0
1. C=.0 1. C=.0 1. C=.11. C=.0 1. C=.1 1. C=.0 1.

G=.2
T=.8
G=.0
T=.0
G=.0
T=.8
G=.2
T=.3
G=.0
T=.0
A State per Site

A
C
G
T
0 10 1
0 0 1
2 0 0
8 0 8
5
0
2
3
9 0
1 0
0 0
0 10
G=.0
T=1.
E
TA TGA
TATAA
TATAA
TAATA
TATAA
TATTA
GATAA
GATAC
TACGA
TATTA
T
T
T
T
T
T
T
T
T
T
10
How Do We Model Gaps?

Gap can result from deletion or insertion
Deletion = hidden delete state
Insertion= hidden insert state
S
A=.0
C=.0
G=.2
T=.8
A=1.
C=.0
G=.0
T=.0
A=.1
C=.1
G=.0
T=.8
A=.5
C=.0
G=.2
T=.3
A=.6
C=.2
G=.2
T=.0
.7
.3
1.
1.
A=.0
C=.0
G=.0
T=1.
- =1.
Delete-State
S
A=.0
C=.0
G=.2
T=.8
A=1.
C=.0
G=.0
T=.0
A=.1
C=.1
G=.0
T=.8
A=.5
C=.0
G=.2
T=.3
Insert-State
A=.0
C=.0
G=.0
T=1.
.8
.2
1.
A=.25
C=.25
G=.25
T=.25
TA TGA T
TATACT
TATAAT
TATTAT
TATAGT
TATT - T
GATAAT
GATA - T
TACG- T
TATTCT
TA TGA T
TATACT
TATA - T
TATT - T
TATA - T
TATT - T
GATA - T
GATA - T
TACG- T
T A T T - 11T
Profile HMM
ACA
TCA
ACA
AGA
ACC
--ACT
C---G--
ATG
ATC
AGC
ATC
ATC insertion
Transition probabilities
Output Probabilities
Profile alignment
E.g., What is the most likely path to generate ACATATC ?
How likely is ACATATC to be generated by this profile?
12
In General: HMM Sequence Profile
13
HMM For CpG Islands

CpG generator
Regular Sequence
A+
G+
A-
G-
C+
T+
C-
T-
+ A G C T
A 0.1800.2740.426 0.120
G 0.1710.3680.274 0.188
+ A G C T
C
T
A
G
0.1610.3390.375 0.125
0.0790.3550.384 0.182
+ A G C T
C
T
A
G
C
0.3220.2980.078 0.302
0.1770.2390.292 0.292
0.3 0.2050.285 0.21
0.2480.2460.298 0.208
14
Modeling Gene Structure With HMM

Genes are organized into sequential functional regions
Regions have distinct statistical behaviors
Splice site
15
HMM Gene Models

HMM state region ; Markov transitions between regions
Emission {A,C,T,G}; regions have different probabilities
A=.29
C=.31
G=.04
T=.36
16
Computing Probabilities on HMM

Path = a sequence of states
E.g., X=FFBBBF
Path probability: 0.5 (0.6)2 0.4(0.2)3 0.8 =4.608 *10-4
Probability of a sequence emitted by a path: p(S|X)

E.g., p(HHHHHH|FFBBBF)=p(H|F)p(H|F)p(H|B)p(H|B)p(H|B)p(H|F)
=(0.5)3(0.9)3 =0.09
Note: usually one avoids multiplications and computes

logarithms to minimize error propagation
.5
.6
e(F,H)=0.5
e(F,T)=0.5
Start
.4
.8
.5
e(B,H)=0.9
.2
e(B,T)=0.1
17
The Three Computational Problems of HMM

Decoding: what is its most likely sequence of transitions &
emissions that generated a given observed sequence?
Likelihood: how likely is an observed sequence to have
been generated by a given HMM?
Learning: how should transition and emission probabilities be
learned from observed sequences?
18
The Decoding Problem: Viterbis Decoder

Input: an observed sequence S
Output: a hidden path X maximizing P(S|X)
Key Idea (Viterbi): map to a dynamic programming problem
States
Describe the problem as optimizing a path over a grid

DP search: (a) compute price of forward paths (b) backtrack
Complexity: O(m2n) (m=number of states, n= sequence size)
A6
A5
A4
A3
A2
A1
s1 s2 s3 s4 s5
Observed sequence
1 2 3 4 5 6 7 8 9 10 12 13 14 1516 17 18
State sequence
19
Viterbis Decoder
F(i,k) = probability of the most likely path to state i
generating S1Sk
Forward recursion:
F(i,k+1)=e(i,Sk+1)*max j{F(j,k)a(i,j)}
Best path to i
Backtracking: start with highest F(i,n) and backtrack

Initialization: F(0,0)=1, F(i,0)=0
A6
A5
A4
A3
A2
A1
Sk+1
F(6,k)a(6,i)
i
1 2 3 4
k k+1
20
10
Example: Dishonest Coin Tossing

what is the most likely sequence of transitions &
emissions to explain the observation: S=HHHHHH
B
.2*.9=.18
.8*.5=.4
.6
.4*.9=.36
F
Start
.5
.5
.4
e(F,H)=0.5
e(F,T)=0.5
.6*.5=.3
H H H H H
1 .45 .09 .065.019.006.0028
1 .25 .18 .054.026.008.0023
.8
e(B,H)=0.9
.2
e(B,T)=0.1
w=.5*(.9)*(.4)2*(.3)*(.36)2
Start 0
1
21
Example: CpG Islands

Given: observed sequence CGCG what is the likely
state sequence generating it?
C GC G
AGCTA+
G+
C+
T+
A+
G+
A-
G-
C+
T+
C-
T-
Start
Start
0 1 2 3 4
22
11
Computational Note
Computing probability products propagates errors
Instead of multiplying probabilities add log-likelihood
Define f(i,k)=log F(i,k)
f(i,k+1)=log e(i,Sk+1) + max j{f(j,k)+log a(i,j)}
Or, define the weight w(i,j,k)=log e(i,Sk+1)+ log a(i,j)
To get the following standard DP formulation
f(i,k+1)=max j{f(j,k)+w(i,j,k)}
23
Example
what is the most likely sequence of transitions & emissions to
explain the observation: S=HHHHHH Start
(using base 2 log)
-1
-0.74
-2.32-0.15=-2.47
lge(F,H)=-1
-.32-1=-1.32
lge(F,T)=-1
-2.32
-1
-1.32
lge(B,H)=-0.15
lge(B,T)=-3.32
-0.32
-1.32-.15=-1.47
F
-.74-1=-1.74
W(.,.,H)
S
F
B
-2
-1.15
-1.74
-1.47
-1.32
-2.47
f(i,k+1)=max j{f(j,k)+w(i,j,k)}
H
-1.15 -3.47
-2 -2.47
Start
24
12
Concluding Notes
Viterbi decoding: hidden pathway of an observed sequence
Hidden pathway explains the underlying structure
E.g., identify CpG islands
E.g., align a sequence against a profile
E.g., determine gene structure
..
This leaves the two other HMM computational problems

How do we extract an HMM model, from observed sequences?
How do we compute the likelihood of a given sequence?
25
13

Chapter4.1 HMM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter4.1 HMM

Uploaded by

Copyright:

Available Formats

Chapter 4: Hidden Markov Models

4.1 Introduction to HMM

Prof. Yechiam Yemini (YY)

Computer Science Department

Durbin chapters 3-5

How do we identify informative regions?

Mapping Protein Regions

Statistical Sequence Analysis

Example: active protein regions are statistically similar

Simple comparison techniques are insufficient

The challenge: analyzing statistical features of regions

Review of Markovian Modeling

Markov chains describe statistical evolution

From Markov To Hidden Markov Models (HMM)

How can we tell regions?

Example: CpG islands

Hidden Markov Models

Example: transmembrane proteins

HMM Profile Example (Non-gapped)

1. C=.0 1. C=.0 1. C=.11. C=.0 1. C=.1 1. C=.0 1.

A State per Site

How Do We Model Gaps?

In General: HMM Sequence Profile

HMM For CpG Islands

0.3 0.2050.285 0.21

Modeling Gene Structure With HMM

HMM Gene Models

Computing Probabilities on HMM

Probability of a sequence emitted by a path: p(S|X)

Note: usually one avoids multiplications and computes

The Three Computational Problems of HMM

The Decoding Problem: Viterbis Decoder

Describe the problem as optimizing a path over a grid

Backtracking: start with highest F(i,n) and backtrack

Example: Dishonest Coin Tossing

1 .25 .18 .054.026.008.0023

Example: CpG Islands

This leaves the two other HMM computational problems

You might also like