Lectures

ComputationalBiology,Part5
HiddenMarkovModels
RobertF.Murphy
Copyright 20052009.
Allrightsreserved.
Markovchains
Ifwecanpredictallofthepropertiesofa
sequenceknowingonlytheconditional
dinucleotideprobabilities,thenthat
sequenceisanexampleofaMarkovchain
AMarkovchainisdefinedasasequence
ofstatesinwhicheachstatedependsonly
onthepreviousstate
FormalismforMarkovchains
M=(Q,,P)isaMarkovchain,where
Q=vector(1,..,n)isthelistofstates
=vector(p1,..,pn)istheinitialprobabilityofeachstate
Q(1)=A,Q(2)=C,Q(3)=G,Q(4)=TforDNA
(i)=pQ(i)(e,g.,(1)=pAforDNA)
P=nxnmatrixwheretheentryinrowiandcolumnjis
theprobabilityofobservingstatejifthepreviousstateisi
andthesumofentriesineachrowis1(dinucleotide
probabilities)
P(i,j)=p*Q(i)Q(i)(e.g.,P(1,2)=p*ACforDNA)
GeneratingMarkovchains
GivenQ,,P(andarandomnumbergenerator),we
cangeneratesequencesthataremembersofthe
MarkovchainM
If,Parederivedfromasinglesequence,the
familyofsequencesgeneratedbyMwillinclude
thatsequenceaswellasmanyothers
If,Parederivedfromasampledsetof
sequences,thefamilyofsequencesgeneratedby
Mwillbethepopulationfromwhichthatsethas
beensampled
InteractiveDemonstration
(A11Markovchains)
Matlabcodeforgenerating
Markovchains
chars=['a''c''g''t'];
%thedinucsarrayshowsthefrequencyofobservingthecharacterinthe
%rowfollowedbythecharacterinthecolumn
%thesevaluesshowstrongpreferenceforcc
dinucs=[2,1,2,0;0,8,0,1;2,0,2,0;1,0,0,1];
%thesevaluesrestricttransitionsmore
%dinucs=[2,0,2,0;0,8,0,0;2,0,2,0;1,1,0,1];
%calculatemononucleotidefrequenciesonlyastheprobabilityof
%startingwitheachnucleotide
monocounts=sum(dinucs,2);
monofreqs=monocounts/sum(monocounts);
cmonofreqs=cumsum(monofreqs);
Markovchains
%calculatedinucleotidefrequenciesandcumulativedinucfreqs
freqs=dinucs./repmat(monocounts,1,4);
cfreqs=cumsum(freqs,2);
disp('Dinucleotidefrequencies(transitionprobabilities)');
fprintf('%c%c%c%c\n',chars)
fori=1:4
fprintf('%c%f%f%f%f\n',chars(i),freqs(i,:))
end
Markovchains
nseq=10;
forntries=1:20
rnums=rand(nseq,1);
%startsequenceusingmononucleotidefrequencies
seq(1)=min(find(cmonofreqs>=rnums(1)));
fori=2:nseq
%extenditusingtheappropriaterowfromthedinucfreqs
seq(i)=min(find(cfreqs(seq(i1),:)>=rnums(i)));
end
output=chars(seq);
disp(strvcat(output));
end
Discriminatingbetweentwo
stateswithMarkovchains
Todeterminewhichoftwostatesa
sequenceismorelikelytohaveresulted
from,wecalculate
x i 1 x i
x i 1 x i
a
P(x | model)
S(x) log
log
P(x | model) i1
a
L
S(x) x i 1 x i
i1
Stateprobablitiesfor+and
models
Givenexamplessequencesthatarefrom
either+model(CpGisland)ormodel(not
CpGisland),cancalculatetheprobability
thateachnucleotidewilloccurforeach
model(theavaluesforeachmodel)
+ACGTACGT
A0.1800.2740.4260.120A0.3000.2050.2850.210
C0.1710.3680.2740.188C0.3220.2980.0780.302
G0.1610.3390.3750.125G0.2480.2460.2980.208
T0.0790.3550.3840.182T0.1770.2390.2920.292
Transitionprobabilitiesconverted
tologlikelihoodratios
A
A 0.740
C 0.913
G 0.624
T 1.169
C
0.419
0.302
0.461
0.573
G
0.580
1.812
0.331
0.393
T
0.803
0.685
0.730
0.679
Example
WhatisrelativeprobabilityofC+G+C+
comparedwithCGC?
Firstcalculatelogoddsratio:
S(CGC)=(CG)+(GC)=1.812+0.461=2.273
Converttorelativeprobability:
22.273=4.833
Relativeprobabilityisratioof(+)to()
P(+)=4.833P()
Example
Converttopercentage
P(+)+P()=1
4.833P()+P()=1
P()=1/5.833=17%
Conclusion
P(+)=83%P()=17%
HiddenMarkovmodels
Hiddenconnotesthatthesequenceis
generatedbytwoormorestatesthathave
differenttransitionprobabilitymatrices
Moredefinitions
i=stateatpositioniinapath
akl=P( i=l| i1=k)
probabilityofgoingfromonestatetoanother
transitionprobability
ek(b)=P(xi=b| i=k)
probabilityofemittingabwheninstatek
emissionprobability
Generatingsequences(see
previousexamplecode)
%forceemissiontomatchstate(normalMarkov
model,nothidden)
emit=diag(repmat(1,4,1));
[seq2,states]=hmmgenerate(10,freqs,emit)
output2=chars(seq2);
disp(strvcat(output2));
Decoding
ThegoalofusinganHMMisoftento
determine(estimate)thesequenceof
underlyingstatesthatlikelygaverisetoan
observedsequence
Thisiscalleddecodinginthejargonof
speechrecognition
Moredefinitions
Cancalculatethejointprobabilityofa
sequencexandastatesequence
L
P(x, ) a0 1 e i (x i )a i i 1
i1
requiring
L 1 0
Determiningtheoptimalpath:
theViterbialgorithm
Viterbialgorithmisformofdynamic
programming
Definition:Letvk(i)betheprobabilityofthe
mostprobablepathendinginstatekwith
observationi
Determiningtheoptimalpath:
theViterbialgorithm
Initialisation(i=0):
v0(0)=1,vk(0)=0fork>0
Recursion(i=1..L):
vl(i)=el(xi)maxk(vk(i1)akl)
ptri(l)=argmaxk(vk(i1)akl)
Termination:P(x,*)=maxk(vk(L)ak0)
L*=argmaxk(vk(L)ak0)
Traceback(i=L..1):i1*=ptri(i*)
BlockDiagramforViterbi
Algorithm
alphabet
emission
probabilities
transition
probabilities
sequence
Viterbi
Algorithm
most
probable
state
sequence
Multiplepathscangivethesame
sequence
TheViterbialgorithmfindsthemostlikely
pathgivenasequence
Otherpathscouldalsogiverisetothesame
sequence
Howdowecalculatetheprobabilityofa
sequencegivenanHMM?
Probabilityofasequence
Sumtheprobabilitiesofallpossiblepaths
thatgivethatsequence
LetP(x)betheprobabilityofobserving
sequencexgivenanHMM
P(x) P(x, )
Probabilityofasequence
CanfindP(x)usingavariationonViterbi
algorithmusingsuminsteadofmax
Thisiscalledtheforwardalgorithm
Replacevk(i)withfk(i)=P(x1xi,i=k)
Forwardalgorithm
f0(0)=1,fk(0)=0fork>0
Recursion(i=1..L):
f l (i) el (x i ) f k (i 1)akl
k
Termination:
P(x) f k (L)ak 0
k
Backwardalgorithm
Wemayneedtoknowtheprobabilitythata
particularobservationxicamefroma
particularstatekgivenasequencex,
P(i=k|x)
Usealgorithmanalogoustoforward
algorithmbutstartingfromtheend
Backwardalgorithm
bk(L)=ak0forallk
Recursion(i=L1,,1):
bk (i) akl el (x i1 )bl (i 1)

l
Termination:
P(x) a0l el (x1 )bl (1)

l
Estimatingprobabilityofstateat
particularposition
Combinetheforwardandbackwardprobabilities
toestimatetheposteriorprobabilityofthe
sequencebeinginaparticularstateataparticular
position
f k (i)bk (i)
P( i k | x)
P(x)
ParameterestimationforHMMs
Simplewhenstatesequenceisknownfor
trainingexamples
Canbeverycomplexforunknownpaths
Estimationwhenstatesequence
known
Countnumberoftimeseachtransition
occurs,Akl
Countnumberoftimeseachemission
occursfromeachstate,Ek(b)
Converttoprobabilities
E k (b)
Akl
ek (b)
akl
E k (b')
Akl'
l'
b'
BaumWelch
Makeinitialparameterestimates
Useforwardalgorithmandbackward
algorithmtocalculateprobabilityofeach
sequenceaccordingtothemodel
Calculatenewmodelparameters
Repeatuntilterminationcriteriamet
(changeinloglikelihood<threshold)
Estimatingtransitionfrequencies
Probabilitythataklisusedaspositioniin
sequencex
f k (i)akl el (x i1 )bl (i 1)
P( i k, i1 l | x, )
P(x)
Sumoverallpositions(i)andallsequences
(j)togetexpectednumberoftimesaklisused
1
j
j
j
Akl
f (i)akl el (x i1 )bl (i 1)
j k
j P(x ) i
Estimatingemissionfrequencies
Sumoverallpositionsforwhichtheemitted
characterisbandallsequences
1
j
j
E k (b)
f
(i)b
(i)
k
k
j
P(x
)
j
j
i|x b
i
Updatingmodelparameters
Convertexpectednumberstoprobabilities
asifexpectednumberswereactualcounts
Akl
E k (b)
akl
ek (b)
Akl'
E k (b')
l'
b'
Testfortermination
Calculatetheloglikelihoodofthemodelforallofthe
sequencesusingthenewparameters
log P(x
| )
j1
Ifthechangeinloglikelihoodexceedssome
threshold,gobackandmakenewestimatesof aande

Lectures

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectures

Uploaded by

Copyright:

Available Formats

ComputationalBiology,Part5

bk (i) akl el (x i1 )bl (i 1)

P(x) a0l el (x1 )bl (1)

You might also like