You are on page 1of 12

73

Markov Chains
Equipped with the basic tools of probability theory, we can now revisit the stochastic models we considered starting on page 47 of these notes. The recurrence (26) for the stochastic version of the sand-hill crane model is an instance of the following template: y (n) = a sample from pYn | Yn1 ,...,Ynm (y | y (n 1), . . . , y (n m)) . The stochastic sand-hill crane model is an example of the special case m = 1: y (n) = a sample from pYn | Yn1 (y | y (n 1)) . (38) (37)

Recall that, given the value y (n 1) for the population Yn1 (a random variable) of cranes in year n 1, we formulated probability distributions for the numbers of births (Poisson) and deaths (binomial) between year n 1 and year n. Because these probabilities depend on y (n 1), they are conditioned upon the fact that Yn1 = y (n 1). If there are b births and d deaths, then Yn = Yn1 + b d so knowing the value y (n 1) of Yn1 and the conditional probability distributions pb | Yn1 and

pd | Yn1 of b and d given Yn1 is equivalent to knowing the conditional distribution18 pYn | Yn1 of Yn given Yn1 . A sequence of random variables Y0 , Y1 , . . . whose values y (0), y (1), . . . are produced by a stochastic recurrence of the form (37) is called a discrete Markov process of order m. The initial value y (0) of Y0 is itself a random variable, rather than a xed number. In the sand-hill crane model, the value that the population Yn in any given year n can assume is unbounded. This presents technical difculties that is wise to avoid in a rst pass through the topic of stochastic modeling. We therefore make the additional assumption that the random variables Yn are drawn out of a nite alphabet Y with K values. In the sand-hill crane example, we would have to assume a maximum population of K 1 birds (including zero, this yields K possible values). In other examples, including the one examined in the next Section, the restriction to a nite alphabet is more natural. A discrete Markov process dened on a nite alphabet is called a Markov chain. Thus:
18

To actually determine the form of this conditional distribution would require a bit of work. It turns out that the probability distribution of the sum (or difference) of two independent random variables is the convolution (or correlation) of the two distributions. The assumption that births and deaths are independent is an approximation, and is more or less valid when the death rate is small relative to the population. Should this assumption be unacceptable, one would have to provide the joint distribution of b and d.

74

SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

To specify a Markov chain of order m (equation (37)) requires specifying the initial probability distribution pY0 (y (0)) and the transition probabilities pYn | Yn1 ,...,Ynm (y (n) | y (n 1), . . . , y (n m)) where all variables y (n) range over a nite alphabet Y of K elements. Note that in this expression y (0) and y (n), . . . , y (n m) are variables, not xed numbers. For instance, we need to know the distribution pY0 (y (0)) for all possible values of y (0). The Markov chain (37) is said to be stationary if the transition probabilities are the same for all n. In that case, their expression can be simplied as follows: pY | Y1 ,...,Ym (y | y1 , . . . , ym ) .

A Language Model
To illustrate the concept of a Markov chain we consider the problem of modeling the English language at the low level of word utterances (that is, we do not attempt to model the structure of sentences). More specically, we attempt to write a computer program that generates random strings of letters that in some sense look like English words. A rst, crude attempt would draw letters and blank spaces at random out of a uniform probability distribution over the alphabet (augmented with a blank character, for a total of 27 symbols). This would be a poor statistical model of the English language: all it species is what characters are allowed. Here are a few samples of 65-character sentences (one per line): earryjnv anr jakroyvnbqkrxtgashqtzifzstqaqwgktlfgidmxxaxmmhzmgbya mjgxnlyattvc rwpsszwfhimovkvgknlgddou nmytnxpvdescbg k syfdhwqdrj jmcovoyodzkcofmlycehpcqpuflje xkcykcwbdaifculiluyqerxfwlmpvtlyqkv This is not quite compelling: words (that is, strings between blank spaces) have implausible lengths, the individual letters seem to come from some foreign language, and letter combinations are unpronounceable. The algorithm that generated these sequences is a zeroth-order stationary Markov chain where the initial probability distribution and the transition probabilities are all equal to the uniform distribution. Transition here is even a misnomer, because each letter is independent of the previous one: pY0 (y ) = pY | Y1 ,...,Ym (y | y1 , . . . , ym ) = pY (y ) = pU27 (y ) (pU27 is the uniform distribution over 27 points). (39)

75 A moderate improvement comes from replacing the uniform distribution with a sample distribution from real text. To this end, let us take as input all of James Joyces Ulysses, a string of 1,564,672 characters (including a few lines of header).19 Some cleanup rst converts all characters to lowercase, replaces non-alphabetic characters (commas, periods, and so forth) with blanks, and compacts the possible resulting sequences of consecutive blanks with single blanks.
As an exercise in vector-style text processing, here is the Matlab code for the cleanup function: function out = cleanup(in, alphabet) % Convert to lower case in = lower(in); % Change unknown symbols to blank spaces known = false(size(in)); for letter = alphabet known(in == letter) = true; end in(known) = ; % Find first of each sequence of nonblanks first = in = & [ in(1:(end-1))] == ; % Find last of each sequence of nonblanks last = in = & [in(2:end) ] == ; % Convert from logical flags to indices first = find(first); last = find(last); % Replace runs of blanks with single blanks out = ; for k = 1:length(first) out = [out, in(first(k):last(k)), ]; end The input argument in is a string of characters (the text of Ulysses, a very long string indeed), and alphabet is a string of the 26 lowercase letters of the English alphabet, plus a blank added as the rst character. The output string out contains the cleaned-up result. Loops were avoided as far as possible. The loop on letter is very small (27 iterations), and the last loop on k seemed unavoidable.20

After this cleanup operation, it is a simple matter to tally the frequency of each of the letters in the alphabet (including the blank character) in Ulysses. A plot of this frequency distribution would look very jagged. For a more pleasing display, gure 17(a) shows the frequency distribution of the characters sorted by their frequency of occurrence. This function is used as an approximation of the probability distribution of letters and spaces in the English language.
19 20

This text is available for instance at http://www.gutenberg.org/ebooks/4300. Please let me know if you nd a way to avoid this loop.

76

SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

0.2 0.18 0.16 0.14 0.12 pY (y) 0.1 0.08 0.06 0.04 0.02 0 etaoinshrldumcgfwypbkvjxqz

(a)
1 0.9 0.8 0.7 0.6 cY (y) 0.5 0.4 0.3 0.2 0.1 0 etaoinshrldumcgfwypbkvjxqz

(b) Figure 17: (a) Frequency distribution of the 26 letters of the alphabet and the blank character in Joyces Ulysses. Letters were sorted in order of decreasing frequency. The rst value is for the blank character. (b) Corresponding cumulative distribution. The probability that a number (diamond) drawn uniformly between 0 and 1 falls on the interval shown on the ordinate is equal to the difference cY (o) cY (a). By denition of cY , this difference is equal to the probability pY (o), so we select the letter o whenever the diamond falls in the interval shown.

77 The resulting language model is still the zero-th order Markov chain of equation (39), but with a more plausible choice of individual letters. Rather than drawing from the uniform distribution over the alphabet, we now draw from a distribution pY (y ) that we estimated from a piece of text. How do we draw from a given distribution? If the set Y on which the distribution pY (y ) is dened is nite, there is a simple method for converting a uniform (pseudo)random sample generator into a generator that draws from pY (y ) instead. First, compute the cumulative distribution of Y : cY (y ) = P[Y y ] = pY (k )
k y

where P[] stands for the phrase the probability that... This is shown in Figure 17(b). If we now append a zero to the left of the sequence of values cY (y ), the resulting sequence spans monotonically the entire range between 0 and 1, since by construction cY (y 1) cY (y ) and cY (yK ) = 1 (yK is the last element in the alphabet Y = y1 , . . . , yK ). In addition, the differences between consecutive values of cY (y ) are equal to the probabilities pY (y ). We can then use a uniform (pseudo)random number generator to draw a number u between 0 and 1, map that to the ordinate of the plot in Figure 17(b), and nd the rst entry yu in the domain Y of the function cY such that cY (yu ) equals or exceeds u. This construction is illustrated in Figure 17(b). The probability of hitting a particular value yu (the letter o in the example in the Figure) is equal to the probability that u (the diamond on the ordinate) is between cY (yu 1) and cY (yu ) and this is, by denition of cY , the probability pY (yu ). Thus, this procedure draws from pY , as desired.
Here is how to do this in Matlab: function v = draw(y, p, n) if nargin < 3 || isempty(n) n = 1; % Draw a single number end c = cumsum(p); c = c(:); if c(end) == 0 v = []; else c = c / c(end); c = [0; c]; K = length(c); r = rand(n, 1); v = zeros(n, 1); for i = 2:K v(find(r <= c(i) & end end

r > c(i - 1))) = y(i - 1);

78

SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

The argument y lists the values in the domain Y . In the example, these would be numbers between 1 and 27, which represent the alphabet plus the blank character. The argument p lists the probabilities of each element in the domain, and the argument n species how many numbers to draw (1 if n is left unspecied). The Matlab built-in function cumsum computes the cumulative sum of the elements of the input vector. A note on efciency: The function draw spends most of its time looking for the rst index v where c(v) equals or exceeds the random number(s) r. This search is implemented here with the Matlab built-in function find, which scans the entire array that is passed to it as argument. In principle, this is an inefcient method, because one could use binary search: if there are m elements in the vector c, rst try c(round(m/2)), an element roughly in the middle of the vector. If this element is too large, the target must be in the rst half of c, otherwise it must be in the second half. Repeat this procedure in the correct half until the desired element is found. Since the interval is halved at every step, binary search requires only log2 (K ) comparisons rather than K . However, unless K is very large, this theoretical speedup is more than canceled by the fact that the built-in function find is pre-compiled, and therefore very fast. A binary search would have to be written in Matlab, an interpreted language, which is much slower.

Here are some sample sentences obtained by drawing from a realistic frequency distribution for the letters in English: ooyusdii eltgotoroo tih ohnnattti gyagditghreay nm roefnnasos r naa euuecocrrfca ayas el s yba anoropnn laeo piileo hssiod idlif beeghec ebnnioouhuehinely neiis cnitcwasohs ooglpyocp h trog l This still does not look anywhere near English. However, both letters and blanks are now drawn with a frequency that equals that in Ulysses. The letters look more common than with the uniform distribution because they correspond to actual frequencies in English. Even so, words have still implausible lengths: a correct frequency of blanks only ensures that the mean number of blanks in any substantial length of text is correct, not that the lengths of the runs between blanks (words) are correct. For instance, you will notice several multiple blanks in the text above. Three four-letter words separated by two blanks (a plausible sequence in English) are equivalent to one 12-letter word followed by two consecutive blanks (a much less plausible sequence in English) in the sense that the frequency of blanks is the same for both cases (2/14). Similar considerations hold for letters: Frequencies are correct, but the order in which letters follow each other is entirely unrelated to English. Adjacent letters are statistically independent in the model, but not in reality. To address this shortcoming, we can collect statistics about the conditional probability of one letter given the previous one, pY | Y1 (y | y1 ) = P[Y = y | Y1 = y1 ] . These transition probabilities are displayed graphically in Figure 18. Do not confuse these with the joint probabilities. For instance, the conditional probability that the current letter Y is equal to u given that the previous letter Y1 is equal to q is one, because u always follows q. However,

79 the joint probability of the pair qu is equal to the probability of q. From Figure 17(a) we see that this probability is very small (it turns out to be about 9 103 ). More generally, from the denition (34) of conditional probability we see that the joint probability can be found from the conditional probabilities (Figure 18) and the probabilities of the individual letters (Figure 17(a)) as follows: pY1 ,Y (y1 , y ) = pY | Y1 (y | y1 )pY1 (y1 ) = pY | Y1 (y | y1 )pY (y ) . The last equality is justied by our assumption that language statistics are stationary. Values in each row of the transition matrix pY | Y1 add up to one, because the probability that a letter is followed by some character is 1 (except for the very last letter in the text): pY | Y1 (y | y1 ) = 1

y Y

where Y is the alphabet, plus the blank character. A matrix with this property is said to be stochastic. The matrix of transition probabilities can be estimated from a given piece of text (Ulysses in our case) by initializing a 27 27 matrix to all zeros, and incrementing the entry corresponding to previous letter y1 and current letter y every time we encounter such a pair. Once the whole text has been scanned, we divide each row by the sum of its entries.21 We can now generate a new set of sentences by drawing out of the transition probabilities, thereby generating samples out of a (stationary) rst-order Markov chain as in equation (38). More specically, the rst letter is drawn out of pY (y ), just as we did in our earlier attempt. After that, we look at the specic value y1 of the previous letter, and draw from the transition probability pY | Y1 (y | y1 ). Since y1 is now a known value, this amounts to drawing from the distribution represented by the row corresponding to y1 in Figure 18. Here are a few sample sentences: icke inginatenc blof ade and jalorghe y at helmin by hem owery fa st sin r d n cke s t w anks hinioro e orin en s ar whes ore jot j whede chrve blan ted sesourethegebe inaberens s ichath fle watt o On occasion, this is almost pronounceable. Word lengths are now slightly more plausible. Note however that the distribution of word lengths in the second line is very different from that in the third: statistics are just statistics, and one must expect variability in an experiment that is this small. Now however we can see the pattern: incorporating more information on how letters follow each other leads to more plausible results. Instead of a rst-order Markov chain, we can try a second order one: y (n) = a sample from pY | Y1 ,Y2 (y | y1 , y2 ) .

This of course requires compiling the necessary frequency tables from Ulysses (or from your favorite text). The distribution pY | Y1 ,Y2 is a function of three variables rather than two, and each
21

If we were to divide by the sum of all the entries in the matrix, we would obtain the joint probabilities instead of the conditional ones.

80

SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

a b c d e f g h i j k l m n o p q r s t u v w x y z

Previous Letter

abcdefghijklmnopqrstuvwxyz Current Letter

Figure 18: Conditional frequencies of the 26 letters of the alphabet and the blank character in Joyces Ulysses, given the previous character. The rst row and rst column are for the blank character. The area of each square is proportional to the conditional probability of the current letter given the previous letter. For instance, the largest square in the picture corresponds to the probability, equal to one, that the letter q is followed by the letter u.

81 of them can take one of 27 values. So the new table has 273 = 19683 entries. Other than this, the procedures for data collection and sequence generation are essentially the same, and the idea can then be repeated for higher order chains as well.
The cost of exponentially large tables (27n+1 entries for an n-th order chain) can be curbed substantially by observing that most conditional probabilities are zero. For instance, a sequence of three letters drawn entirely (that is uniformly) at random is unlikely to be a plausible English sequence. If it is not, it will never show up in Ulysses, and the corresponding conditional probability in the table is zero. Matlab can deal very well with sparse matrices like these. If a matrix A has, say, 10,000 entries only 200 of which are nonzero, the instruction A = sparse(A); will create a new data structure that only stores the nonzero values of A, together with information necessary to reconstruct where in the original matrix each entry belongs. As a consequence, the storage space (and processing time) for sparse matrices is proportional to the number of nonzero entries, rather than to the formal size of the matrix. Here is the code that collects text statistics up to a specied order: function [ng, alphabet] = ngrams(text, maxOrder) nmax = maxOrder + 1; if nmax > 4 error(Unwise to do more than quadrigrams: too much storage, computation) end alphabet = abcdefghijklmnopqrstuvwxyz; na = length(alphabet); da = double(alphabet); ng = {}; for n = 1:nmax ng{n} = sparse(zeros(na(n-1), na)); end text = cleanup(text, alphabet); % Wrap around to avoid zero probabilities text = [text((end - maxOrder):end), text]; for k = 1:length(text) ps = place(text(k), da); last = k-1; ne = min(nmax, k); for n = 1:ne j = k - n + 1; prefix = text(j:last); pp = place(prefix, da); ng{n}(pp, ps) = ng{n}(pp, ps) + 1;

82
end end

SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

% Normalize conditional distributions o = ones(na, 1); for n = 1:nmax s = ng{n} * o; for p = 1:size(ng{n}, 1) if s(p) = 0 ng{n}(p, :) = ng{n}(p, :) / s(p); end end end This function is called ngrams because it computes statistics for digrams (pairs of letters), trigrams (triples of letters), and so forth. The output ng is a cell array with maxOrder + 1 matrices, each describing the statistics of a different order. Logically, the statistics of order n should be stored in an array with n + 1 dimensions. Instead, the code above attens these arrays into two-dimensional matrices to make access and later computation faster. This requires mapping, say, a four-dimensional vector of indices (i, j, k, l) to a two-dimensional vector (v, l) where v (i, j, k ) is some invertible function of (i, j, k ). How this is done is not important, but it must be done consistently when collecting statistics and when using them. To this end, the mapping has been encapsulated into a function place that takes a string of letters prefix and (a numerical representation da of) the alphabet and returns the place of that string within a matrix of appropriate size. Here is how place works: function pos = place(string, da) na = length(da); len = length(string); pos = 0; for k = 1:len i = find(double(string(k)) == da); if isempty(i) % Convert unknown characters to blanks (position 1 in da) i = 1; end pos = pos * na + i - 1; end % Convert to Matlab style array indexing (so minimum pos is 1, not 0) pos = pos + 1; Going back to ngrams, the function cleanup has been already discussed. The instruction with a comment about Wrap around prevents a little quirk that concerns the end of the text: if the string, say, ix appears at the end of the text and nowhere else, all the transition probabilities from ix to a third character are zero. If ix is ever generated in the Markov chain, then there is no next character to go to. To avoid, this, the tail end of the text is also copied to the beginning, so that every sequence of letters is followed by some letter. The

83
rest of the code is straightforward: initialize storage for ng, compute the statistics by scanning text and incrementing the proper entries of ng, and normalize entries to obtain conditional probabilities.

The following are examples of gibberish generated by a second-order Markov chain: he ton th a s my caroodif flows an the er ity thayertione wil ha m othenre re creara quichow mushing whe so mosing bloack abeenem used she sighembs inglis day p wer wharon the graiddid wor thad k Some of these look like actual words: common sequences of letters and blanks. A third-order model does even better: es in angull o shoppinjust stees ther a kercourats allech is hote ternal liked be weavy because in coy mrs hand room him rolio und ceran in that he mound a dishine when what to bitcho way forgot p Almost looks like real text...
All the gibberish22 in this Section has been generated with the following Matlab code. function s = randomSentence(ng, alphabet, len, order) if nargin < 4 || isempty(order) order = length(ng) - 1; % Use maximum order possible end if order >= length(ng) error(Only statistics up to order %d are available, length(ng) - 1); end if order < -1 error(order must be at least -1) end da = double(alphabet); na = length(alphabet); if order == -1 % Dont even consider letter statistics: % draw uniformly from the alphabet dr = draw(1:na, ones(1, na) / na, len); s = alphabet(dr); else s = char(double( ) * ones(1, len)); for k = 1:len
22

Well, at least the gibberish in fixed-width font...

84

SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

j = max(1, k - order); pp = place(s(j:(k-1)), da); g = min(k, order + 1); s(k) = alphabet(draw(1:na, ng{g}(pp, :))); end end The real meat of this code is the for loop at the bottom, which computes the place of the previous letters within the appropriate frequency table in ng, draws a new letter index from the corresponding row, and appends the alphabet letter for that index to the output string s.

The same principle of modeling sequences with Markov chains can of course be extended to sequences of words instead of characters: collect word statistics for a dictionary rather than character statistics for an alphabet, and generate pseudo-sentences out of real words. By now you are probably wondering why in the world anyone would attempt to mimic the English language with statistically correct gibberish, other than as an exercise for a modeling class. An important application of this principle, at more levels than just letters and words, is speech recognition. In a nutshell, a speech recognition software system typically takes a stream of speech signals coming from a microphone, and parses this stream rst into phonemes23 then phonemes into words, and words into sentences. Parsing means cutting up the input into the proper units, and recognizing each unit (as a specic phoneme, word, or sentence). To understand the difculty of this computation, think of listening to an unfamiliar language and trying to determine the boundaries between words, let alone understand the words. Apparently, the two must be done together. Markov statistical models of speech have encountered a great deal of success in the past decade or two. Rather than generating gibberish, the statistical model is used to measure the likelihood of different candidate parsing results for the same input, and to choose the most likely interpretation. The Markov models capture the likelihoods of individual links (of varying orders, and at different levels) between units. Interesting computational techniques can then accrue these values to compute the likelihoods of long chains of units. The methods for actually doing so are beyond the scope of this course. See for instance F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997.

23

A phoneme is similar to a syllable, but less ambiguous in its pronunciation.

You might also like