You are on page 1of 35

ComputationalBiology,Part5

HiddenMarkovModels
RobertF.Murphy
Copyright 20052009.
Allrightsreserved.

Markovchains
Ifwecanpredictallofthepropertiesofa
sequenceknowingonlytheconditional
dinucleotideprobabilities,thenthat
sequenceisanexampleofaMarkovchain
AMarkovchainisdefinedasasequence
ofstatesinwhicheachstatedependsonly
onthepreviousstate

FormalismforMarkovchains

M=(Q,,P)isaMarkovchain,where
Q=vector(1,..,n)isthelistofstates

=vector(p1,..,pn)istheinitialprobabilityofeachstate

Q(1)=A,Q(2)=C,Q(3)=G,Q(4)=TforDNA
(i)=pQ(i)(e,g.,(1)=pAforDNA)

P=nxnmatrixwheretheentryinrowiandcolumnjis
theprobabilityofobservingstatejifthepreviousstateisi
andthesumofentriesineachrowis1(dinucleotide
probabilities)

P(i,j)=p*Q(i)Q(i)(e.g.,P(1,2)=p*ACforDNA)

GeneratingMarkovchains

GivenQ,,P(andarandomnumbergenerator),we
cangeneratesequencesthataremembersofthe
MarkovchainM
If,Parederivedfromasinglesequence,the
familyofsequencesgeneratedbyMwillinclude
thatsequenceaswellasmanyothers
If,Parederivedfromasampledsetof
sequences,thefamilyofsequencesgeneratedby
Mwillbethepopulationfromwhichthatsethas
beensampled

InteractiveDemonstration

(A11Markovchains)

Matlabcodeforgenerating
Markovchains
chars=['a''c''g''t'];

%thedinucsarrayshowsthefrequencyofobservingthecharacterinthe
%rowfollowedbythecharacterinthecolumn
%thesevaluesshowstrongpreferenceforcc
dinucs=[2,1,2,0;0,8,0,1;2,0,2,0;1,0,0,1];
%thesevaluesrestricttransitionsmore
%dinucs=[2,0,2,0;0,8,0,0;2,0,2,0;1,1,0,1];

%calculatemononucleotidefrequenciesonlyastheprobabilityof
%startingwitheachnucleotide
monocounts=sum(dinucs,2);
monofreqs=monocounts/sum(monocounts);
cmonofreqs=cumsum(monofreqs);

Matlabcodeforgenerating
Markovchains
%calculatedinucleotidefrequenciesandcumulativedinucfreqs
freqs=dinucs./repmat(monocounts,1,4);
cfreqs=cumsum(freqs,2);
disp('Dinucleotidefrequencies(transitionprobabilities)');
fprintf('%c%c%c%c\n',chars)
fori=1:4
fprintf('%c%f%f%f%f\n',chars(i),freqs(i,:))
end

Matlabcodeforgenerating
Markovchains
nseq=10;
forntries=1:20
rnums=rand(nseq,1);
%startsequenceusingmononucleotidefrequencies
seq(1)=min(find(cmonofreqs>=rnums(1)));
fori=2:nseq
%extenditusingtheappropriaterowfromthedinucfreqs
seq(i)=min(find(cfreqs(seq(i1),:)>=rnums(i)));
end

output=chars(seq);
disp(strvcat(output));
end

Discriminatingbetweentwo
stateswithMarkovchains

Todeterminewhichoftwostatesa
sequenceismorelikelytohaveresulted
from,wecalculate

x i 1 x i

x i 1 x i

a
P(x | model)
S(x) log
log
P(x | model) i1
a
L

S(x) x i 1 x i
i1

Stateprobablitiesfor+and
models

Givenexamplessequencesthatarefrom
either+model(CpGisland)ormodel(not
CpGisland),cancalculatetheprobability
thateachnucleotidewilloccurforeach
model(theavaluesforeachmodel)

+ACGTACGT
A0.1800.2740.4260.120A0.3000.2050.2850.210
C0.1710.3680.2740.188C0.3220.2980.0780.302
G0.1610.3390.3750.125G0.2480.2460.2980.208
T0.0790.3550.3840.182T0.1770.2390.2920.292

Transitionprobabilitiesconverted
tologlikelihoodratios

A
A 0.740
C 0.913
G 0.624
T 1.169

C
0.419
0.302
0.461
0.573

G
0.580
1.812
0.331
0.393

T
0.803
0.685
0.730
0.679

Example
WhatisrelativeprobabilityofC+G+C+
comparedwithCGC?
Firstcalculatelogoddsratio:
S(CGC)=(CG)+(GC)=1.812+0.461=2.273
Converttorelativeprobability:
22.273=4.833
Relativeprobabilityisratioof(+)to()
P(+)=4.833P()

Example
Converttopercentage
P(+)+P()=1
4.833P()+P()=1
P()=1/5.833=17%
Conclusion
P(+)=83%P()=17%

HiddenMarkovmodels

Hiddenconnotesthatthesequenceis
generatedbytwoormorestatesthathave
differenttransitionprobabilitymatrices

Moredefinitions
i=stateatpositioniinapath
akl=P( i=l| i1=k)

probabilityofgoingfromonestatetoanother
transitionprobability

ek(b)=P(xi=b| i=k)
probabilityofemittingabwheninstatek
emissionprobability

Generatingsequences(see
previousexamplecode)

%forceemissiontomatchstate(normalMarkov
model,nothidden)
emit=diag(repmat(1,4,1));
[seq2,states]=hmmgenerate(10,freqs,emit)
output2=chars(seq2);
disp(strvcat(output2));

Decoding
ThegoalofusinganHMMisoftento
determine(estimate)thesequenceof
underlyingstatesthatlikelygaverisetoan
observedsequence
Thisiscalleddecodinginthejargonof
speechrecognition

Moredefinitions

Cancalculatethejointprobabilityofa
sequencexandastatesequence
L

P(x, ) a0 1 e i (x i )a i i 1
i1

requiring
L 1 0

Determiningtheoptimalpath:
theViterbialgorithm
Viterbialgorithmisformofdynamic
programming
Definition:Letvk(i)betheprobabilityofthe
mostprobablepathendinginstatekwith
observationi

Determiningtheoptimalpath:
theViterbialgorithm
Initialisation(i=0):
v0(0)=1,vk(0)=0fork>0
Recursion(i=1..L):
vl(i)=el(xi)maxk(vk(i1)akl)
ptri(l)=argmaxk(vk(i1)akl)

Termination:P(x,*)=maxk(vk(L)ak0)
L*=argmaxk(vk(L)ak0)

Traceback(i=L..1):i1*=ptri(i*)

BlockDiagramforViterbi
Algorithm
alphabet
emission
probabilities
transition
probabilities
sequence

Viterbi
Algorithm

most
probable
state
sequence

Multiplepathscangivethesame
sequence
TheViterbialgorithmfindsthemostlikely
pathgivenasequence
Otherpathscouldalsogiverisetothesame
sequence
Howdowecalculatetheprobabilityofa
sequencegivenanHMM?

Probabilityofasequence
Sumtheprobabilitiesofallpossiblepaths
thatgivethatsequence
LetP(x)betheprobabilityofobserving
sequencexgivenanHMM

P(x) P(x, )

Probabilityofasequence
CanfindP(x)usingavariationonViterbi
algorithmusingsuminsteadofmax
Thisiscalledtheforwardalgorithm
Replacevk(i)withfk(i)=P(x1xi,i=k)

Forwardalgorithm
Initialisation(i=0):
f0(0)=1,fk(0)=0fork>0
Recursion(i=1..L):

f l (i) el (x i ) f k (i 1)akl
k

Termination:

P(x) f k (L)ak 0
k

Backwardalgorithm
Wemayneedtoknowtheprobabilitythata
particularobservationxicamefroma
particularstatekgivenasequencex,
P(i=k|x)
Usealgorithmanalogoustoforward
algorithmbutstartingfromtheend

Backwardalgorithm
Initialisation(i=0):
bk(L)=ak0forallk
Recursion(i=L1,,1):

bk (i) akl el (x i1 )bl (i 1)


l

Termination:

P(x) a0l el (x1 )bl (1)


l

Estimatingprobabilityofstateat
particularposition

Combinetheforwardandbackwardprobabilities
toestimatetheposteriorprobabilityofthe
sequencebeinginaparticularstateataparticular
position

f k (i)bk (i)
P( i k | x)
P(x)

ParameterestimationforHMMs
Simplewhenstatesequenceisknownfor
trainingexamples
Canbeverycomplexforunknownpaths

Estimationwhenstatesequence
known
Countnumberoftimeseachtransition
occurs,Akl
Countnumberoftimeseachemission
occursfromeachstate,Ek(b)
Converttoprobabilities

E k (b)
Akl
ek (b)
akl
E k (b')
Akl'
l'

b'

BaumWelch
Makeinitialparameterestimates
Useforwardalgorithmandbackward
algorithmtocalculateprobabilityofeach
sequenceaccordingtothemodel
Calculatenewmodelparameters
Repeatuntilterminationcriteriamet
(changeinloglikelihood<threshold)

Estimatingtransitionfrequencies
Probabilitythataklisusedaspositioniin
sequencex
f k (i)akl el (x i1 )bl (i 1)
P( i k, i1 l | x, )
P(x)

Sumoverallpositions(i)andallsequences
(j)togetexpectednumberoftimesaklisused
1
j
j
j
Akl
f (i)akl el (x i1 )bl (i 1)
j k
j P(x ) i

Estimatingemissionfrequencies

Sumoverallpositionsforwhichtheemitted
characterisbandallsequences

1
j
j
E k (b)
f
(i)b
(i)

k
k
j
P(x
)
j
j
i|x b
i

Updatingmodelparameters

Convertexpectednumberstoprobabilities
asifexpectednumberswereactualcounts

Akl
E k (b)
akl
ek (b)
Akl'
E k (b')
l'

b'

Testfortermination

Calculatetheloglikelihoodofthemodelforallofthe
sequencesusingthenewparameters

log P(x

| )

j1

Ifthechangeinloglikelihoodexceedssome
threshold,gobackandmakenewestimatesof aande

You might also like