Professional Documents
Culture Documents
Managing Editor
MAX A. VIERGEVER
Utrecht University, Utrecht, The Netherlands
Editorial Board
Volume24
Ten Lectures on Statistical
and Structural Pattern Recognition
by
Michaill. Schlesinger
Ukranian Academy of Sciences,
Kiev, Ukraine
and
Vaclav Hlavac
Czech Technical University,
Prague, Czech Republic
Preface xi
Preface to the English edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
A letter from the doctoral student Jifi Pecha prior to publication of
the lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
A letter from the authors to the doctoral student Jifi Pecha . . . . . . . xiii
Basic concepts and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgements xix
v
vi Contents
Bibliography 507
Index 514
Preface
XI
xii Preface
Therefore it is the highest time for the users and authors of pattern recog-
nition methods to be capable of separating from reality the beautiful dreams
about the miraculous powers of pattern recognition. Present day pattern recog-
nition offers enough knowledge so that its authority could be based on real, and
not on imaginary values.
The lectures you are hunting for contain just what you miss, to an extent.
But the professor's notes, which he prepared before each lecture, will hardly be
of any use for you. They contain formulations of major concepts and theorems.
And this is only a small part of what was being explained at the lecture.
Furthermore, the lectures contained a motivating part, a critical (and at times
sharp) analysis of methods currently applied in pattern recognition and warning
against different pitfalls, such as seemingly common sense solutions, but in
reality erroneous ones. Today, we know no longer whether these parts were
prepared beforehand or whether they originated immediately in the lecturing
hall. And it were just these commentaries after which we began to comprehend
how little we know about what we are assumed to know a great deal. The
lectures had also an opposite, but still a positive aspect. We saw that a certain
group of pattern recognition methods, which we had regarded as isolated islets
dispersed in the ocean of our ignorance, formed at least an archipelago, which
could be overlooked at a glance. We kept asking ourselves if we should consider
publishing this interesting but not very academic considerations. Doing so
seemed to us impossible.
We increasingly regret it because others appear to be interested in the lec-
tures as you are. That is why we decided to put them in order and publish
them. It will not be the sort of book as you have imagined, but there will be
a little more to it. If we understood you properly you would prefer to have a
textbook like a reference book, where chapters on different mathematical dis-
ciplines used in pattern recognition would be described. We have thought of
publishing a reference book like this. But we have changed our opinion and
it was mostly your letter that has suggested this to us. Even if you evaluate
your skill in the mathematical apparatus of modern pattern recognition rather
modestly, you have found that it is usually only a matter of applying some fun-
damental and simple mathematical concepts. Your troubles do not lie in that
these concepts are too complicated, but that they are too many and come from
different mathematical disciplines. A novice in pattern recognition may also
feel embarrassed by the fact that some mathematical concepts have acquired a
rather different meaning within the pale of pattern recognition than the original
one in mathematics. Present day pattern recognition has not only taken over
concepts from different domains of mathematics but it has brought them into
a mutual relation that resulted in coming into existence of new concepts, new
task formulations and new issues that already belong to pattern recognition
and not to the mother disciplines from which they originated.
The lectures we are revising have been focused on pattern recognition tasks
and not on the tasks of linear programming, mathematical statistics, or the
graph theory. The required mathematical means has not been presented mutu-
ally isolated, but in a context necessary for solving a particular pattern recogni-
Preface XV
tion task. Mathematical results have taken their part in one mechanism, which
could be referred to as the mathematical apparatus for pattern recognition.
Our opinion is that it is an all round view which you do not possess.
Unfortunately, we are only at the beginning of setting up the lectures into a
form which is publishable. About two years of work lie before us. We under-
stand that you are not willing to wait for such a long time. We have already
started writing the lectures down. It will naturally take a while to supplement
bare comments with all that is necessary for the text to be readable and un-
derstandable. Not to leave you idle during all this, we would like to ask for
your collaboration. We could send you the text of separate lectures, as we
assume that you need not have all the lectures in once. It would be perfect if
you, after having gone through a part of the subject matter, could write down
your critical findings, your ideas on the subject matter and questions, if any.
We would consider such a feedback a necessary condition of our collaboration.
You will obtain the next lecture only when we get a thorough analysis of the
previous lecture from you.
We are now sending the first lecture to you and are looking forward to our
collaboration.
Michaill. Schlesinger, Vaclav Hlavac
(I) pxy(x,y)
PXIY X Y = "' ( ).
L.. PXY x,y
yEY
on the set Y because the sum 2:yEY PxiY(Y) is not necessarily equal to one.
Therefore the number PxiY(Y) is called the likelihood of the value y.
The function px : X --t ~ is called an a priori probability distribution on
the set X for a given joint probability distribution PXF: X x Y --t ~ and it is
defined as
Px(x) = L::>xy(x,y).
yEY
The book consists of ten lectures, which are numbered decimally. The lectures
are divided into sections (numbered, e.g., 2.3). The section consists of subsec-
tions (e.g., 2.3.1). The last section of each lecture is the discussion with the
student Jiff Pecha.
Acknowledgements
This book should express our deep respect and appreciation to our teachers
Prof. V. A. Kovalevski and Prof. Z. Kotek. We would consider it the highest
honour if our teachers were to place this monograph among their own achieve-
ments, because without their contribution it would not occur.
The work on the Czech version of a manuscript lasted much longer than we
initially expected. T. :'\'. Barabanuk (l\Is.) helped us in typing the manuscript.
\Ve thank several of our co-workers and studrnts for stimulating discussions over
two years, in which the Czech manuscript was in pn~paration. These discussions
continued for another almost two years while the edition in English was in
preparation. Assoc. Prof. M. Tlalkova (l\Is.) helped us with the translation of
the Czech version into English.
The manuscript in preparation was read and commented on by Dr. .J. Grim
(the reviewer of the Czech edition), colleagues Assoc. Prof. M. Navara, Dr.
J. Matas, Prof. B. Melichar, Assoc. Prof. C. Matyska, Prof. H. Bischof, doc-
toral students P. Becvar, A. Fitch, L . .Janku (Ms.), B. Kurkoski, D. Pn1sa,
D. Beresford. Students V. Franc and J. Dupac wrote as diploma theses the
Matlab toolboxes implementing many algorithms given in our monograph. The
doctoral student V. Z)·ka helped us a lot with typography of the monograph in
HJE.X.
We thank the director of the Czech Technical UniYersity Publishing House
Dr. I. Smolfkova (.\Is.), who helped the Czech edition to be published as well
as with the issue of transferring the copyright for the English and intended
Russian editions of the monograph. We acknowledge Prof. M. Viergever, the
Managing Editor of Kluwer Academic Publishers series Computational Imaging
and Vision, who very positively accepted the idea of publishing the monograph
in English in Spring 1999 and kept stimulating us to do so. The support of the
series editors Dr. P. Roos, Mr. J. Finlay, and their assistant Inge Rardon (Ms.)
was of great help too.
Co-authors are grateful to each other for cooperation that allowed the work
to be finished in spite of age, geographical, language, and other differences.
XIX
Lecture 1
Bayesian statistical decision making
Let us stress the immense generality of the Bayesian task formulation outlined.
Nothing has been said so far about hm"· the set. of observations X, states K,
and decisions D ought to be understood. In other words, there has not been
any constraint on what mathematical shape the elements of those sets are to
have. The observation x E X, depending on the application, can be a number
or a non-numerical mathematical object. The example can be a symbol in an
abstract alphabet, it can be a vector or an ensemble of characters, it can be
a function of a single variable (a process) or of two variables (an image). We
can further think of a function with a more complicated domain than a set of
values of one or two numerical values has, i.e., of a graph or another algebraic
structure. The sets of states ]{ and decisions D can be similarly diverse.
Various concretizations of Bayesian tasks which have proved to be useful
in pattern recognition will he carefully analysed in this course. Before we get
down to it, we shall list. several properties which are valid, in entire generality,
for the whole class of Bayesian tasks. There are not. many such properties
whose validity does not depend on a specific application. These properties can
1.3 Two properties of Bayesian strategies 3
be quite easily formulated and, moreover, they are important, as they allow us
to avoid severe errors.
(1.1)
xEX kEK dED
In such a case there exists the deterministic strategy q: X -t D with the risk
The equality LdED qr(d Ix) = 1 holds for any x E X and qr(d I x) ~ 0 holds
for any d E D and x E X. Thanks to it the inequality
Equation (1.3) into the inequality (1.2) then we obtain the inequality
The risk of the deterministic strategy q can be found on the right-hand side of
the preceding inequality. It can be seen that Rctet -:::; Rrand holds. •
We have seen that the introduction of the stochastic strategy (it is also called
a randomisation) cannot improve Bayesian strategy from the point of view of
the mathematical expectation of a penalty.
Let first us explain the second important property of Bayesian strategy in an
example. Let the hidden parameter assume two values only, K = {1, 2}. Let us
assume that from all data needed to create the Bayesian strategy q: X --+ D,
only conditional probabilities Px1 1 (x) and Px1 2 (x) are known. The a priori
probabilities PK(1) and PK(2) and penalties W(k, d), k E {1, 2}, dE D, are
not known. In this situation the Bayesian strategy cannot be created. However,
it can be shown that the strategy cannot be an arbitrary one any more. The
strategy should belong to a certain class of strategies, i.e., it should have certain
properties.
If the a priori probabilities PK (k) and the penalty W(k, d) were known then
the decision q(x) about the observation x ought to be
= argmin ( PXIl (x) PK (1) vV(l, d) + Px12 (x) PK(2) W(2, d))
d
(1.4)
= argmin ( PXIl t~
PK(1) W(1, d)+ PK(2) W(2, d))
d PXI2 X
= argmin (/'(x) c1 (d)+ c2(d)) .
d
The notation c1 (d) = PK(1) W(1, d), c2(d) = PK(2) W(2, d) was used in the
last line of Equation (1.4) and the likelihood ratio f'(x) = PXIl (x)/Px1 2 (x) was
introduced which is a well known and an important concept. It can be seen from
equation (1.4) that the subset of observations X(d*), for which the decision d*
should be made. is the solution of the system of inequalities
Each inequality in the system is linear with respect to the likelihood ratio
f'(x) and therefore the subset X(d*) corresponds to a convex subset of the
values of the likelihood ratio f'(x). As f'(x) are real numbers, their convex
subsets correspond to the numerical intervals. We have arrived at the important
property of Bayesian strategy, valid in the particular case in which the hidden
parameter can assume only two values. (However, there can be more than two
decisions.)
1.3 Two properties of Bayesian strategies 5
Any Bayesian strategy divides the real axis from 0 to oo into IDI intervals
I(d), dE D. The decision dis made for observation x EX when the likelihood
ratio 'Y = PXil (x)/PXI2(x) belongs to the interval I(d).
In a more particular case, when only two decisions D = {1, 2} are possible,
the generally known result is obtained. In this case the Bayesian strategy is
characterised by a single threshold value (). For an observation x the decision
depends only on whether the likelihood ratio is larger or smaller than ().
PXil(x)
Let us express this property of the Bayesian strategy in a more general case in
which the hidden parameter k can assume more than two values. The likelihood
ratio does not make any sense in such a case. Let us recall the previous case for
this purpose and give an equivalent formulation of Bayesian strategies for the
case in which IKI = 2 and IDI ~ 2. Each observation x EX will be represented
by a point on a plane with coordinates Px1 1 (x) and Px1 2 (x). In this a way the
set X is mapped into the upper right quadrant of the plane. Each set X(d),
d E D, is mapped into a sector bound by two lines passing through the origin
of the coordinate system. The sectors are convex, of course.
Let us proceed now to a more general case in which IKI > 2. Let II be a
IKI-dimensionallinear space. The subset II' of the space II is called a cone if
a 1r E II' holds for an arbitrary 1r E II' and an arbitrary real number a > 0.
If the subset is a cone and, in addition, it is convex then it is called a convex
cone, see Fig. 1.1.
Let us map the set of observations X into the positive hyperquadrant of the
space II, i.e., into the set of points with non-negative coordinates. The point
1r(x) with coordinates PXIk(x), k E K, corresponds to the observation x EX.
Any Bayesian strategy can be formed by decomposition of the positive hyper-
quadrant of the space II into IDI convex cones II( d), dE D, in such a way that
the decision dis taken for observation x when 1r(x) E II( d). Some of the cones
can be empty.
Let us express this general property of Bayesian strategies in the following
theorem.
6 Lecture 1: Bayesian statistical decision making
Proof. Let us create cones which are referred to in the theorem being proved.
Let us enumerate every decision from the set D in such a way that n(d) is the
number of the particular decision d. Let us state one of the possible strategies
that minimises the risk. It is going to be the strategy that makes a decision d*
when such x is observed that
L PXIK(x) PK(k) W(k, d*) ::; L PXIK(x) PK(k) W(k, d), n(d) < n(d*),
kEK kEK
L PXIK(x) PK(k) W(k, d*) < L PXIK(x) PK(k) W(k, d), n(d) > n(d*).
kEK kEK
The system of equations given above can be expressed by means of coordinates
of the point 1r(x) E II, i.e., numbers 7rk = PXik(x). The point 1r with coordinates
7rk. k E K, has to be mapped into the set II(d*), if
L 7r" p K (k) w (k' d*) ::; L 7r ~' p K (k) w (k, d) , n(d) < n(d*),
kEK kEK
(a, 1r) such that (a, 1r) < 0 for one cone and (a, 1r) ~ 0 for the second one.
Such sets are called linearly separable, or that there exists a linear discriminant
function separating these two sets. This property has been popular in pattern
recognition so far. One of the later lectures will be devoted to it. The theorem
provides a certain basis and explanation of this popularity as it states that the
Bayesian strategy surely decomposes the space of probabilities into classes that
are linearly separable.
is thus the probability of the situation q(:c) -:f. k*. The Bayesian task con-
sists, just as in the general case, in determining the strategy q: X -+ K which
minimises the mathematical Pxpectation given by equation (1.6), i.e.,
q(x) = argm_in
kEf\
L
k* EK
Pxi<(x, k*) W(k*, k). (1.7)
= argmin Px(x)
kEK
L PI<Ix(k* Ix) W(k*, k)
k*EK
= argm_in
kEf\
L
k* EK\{k}
PKix(k* Ix)
= argmin
kEK
( L
k*EK
PI<Ix(k*l x)- PK1.dk Ix)'\
J
= argmin (1- PI<Ix(k Ix)) = argmax PI<1.dk Ix).
kEK kEK
8 Lecture 1: Bayesian statistical decision making
Let us mention one more pseudo-solution which, regrettably, occurs even more
often than the previous case. Assume that a device or a computer program is
created that implements a strategy q: X -+ D which for a given observation
x decides about a state k, which can assume one of four values: 1, 2, 3 and 4.
Assume that this strategy is optimal from the standpoint of the probability of
the wrong decision. Let us now imagine that it appears that it is not needed
to provide such a detailed information about the state. It is sufficient to decide
whether the state is smaller than 3 (or not). It is obvious that the task is
modified. In the first mentioned case, the set D consists of four decisions
k = 1, k = 2, k = 3 and k = 4. In the second, new, task, the set D' contains
two decision k E { 1, 2} and k E { 3, 4} only. It is thus needed to replace the
previous strategy q: X -+ D by the new strategy q' : X -+ D'. It could appear
that the new task is simpler than the previous one and that (watch out, the error
follows!) the existing strategy q can be used when designing a new strategy q'.
Then it has be to decided that the state is smaller than 3, if q(x) = 1 or q(x) = 2
and the state is not smaller than 3, if q(x) = 3 or q(x) = 4. Theorem 1.2 about
the convex shape of classes in the space of probabilities II provides good reasons
to doubt about the proposed solution of the described task.
When the first task is solved, the space of probabilities is separated into four
classes II(1), II(2), II(3) and II(4). The strategy q' is constructed for the new
task in such a way that the space II is divided into two classes II(1) U II(2) and
II(3) U II(4). But Theorem 1.2 states that each of these six sets has to be a
convex cone. When the strategy q' is created in such a. simple way it may easily
happen that the classes corresponding to the strategy are not convex because a
1.4 Two particular cases of the Bayesian task 9
union of convex sets can be a non-convex set, see Fig 1.2. When this happens it
will mean that the new strategy q' does not only reach the minimal probability
of a wrong decision but moreover it does not solve any Bayesian task.
Let us show as an example how it can happen that the strategy q' created in
the abovementioned way is not the best one. Assume that for some observation
x the a posteriori probabilities of the states
1, 2, 3 and 4 correspond to 0.3; 0.3; 0.4 and Px 12(x)
0.0, respectively. The strategy q decides in
this case that the state k = 3 has occurred.
It is the best decision from the point of
view of the minimal probability of a wrong
state. The strategy q' explores the previous
decision and determines that the state is
not smaller than 3. Indeed, it is not the Pxp(x)
best decision. The probability of error in 0
this case is equal to 0.6. If the opposite Figure 1.2 The union of two convex
answer was given then the probability of cones does not necessarily have to be a
error would be 0.4, i.e., the smaller one. convex cone.
The value R(x, d) will be called the partial risk. Naturally, a decision d has
to be set for each observation x in such a way that the partial risk is mini-
mal. However, it can happen for some observations that this minimum will be
quite large. It would be appropriate if the set of all decisions contained also a
particular decision corresponding to the answer not known. The decision not
known is given in the case in which the observation x does not contain enough
information to decide with a small risk. Let us formulate this task within the
Bayesian approach.
Let X and K be sets of observations and states, PXK: X x K -+ IR be a
probability distribution and D = K U {not known} be a set of decisions. Let
us determine penalties W(k,d) , kE K, dE D, according to the following rule:
0, if d = k'
W(k, d) = { 1, ~f d ::fi k and d ::fi not known, (1.9)
£, 1f d =not known .
Let us find the Bayesian strategy q: X -+ D for this case. The decision q(x)
corresponding to the observation x has to minimise the partial risk. This means
that
q(x) = argmin L PKix(k*ix) W(k*,d) . (1.10)
dED k'EK
10 Lecture 1: Bayesian statistical decision making
= kEK
mm """'
L
PKix(k* I x)
k•EK\{k}
= miJ!
~·E/\
( L PI<Ix(k* I x)- PK1.-dk I x))
k•EK
The description of the strategy q(x) may be put in words as: The state
k first has to be found which has the largest a posteriori probability. If this
probability is larger than 1 - E then it is decided in favour of the state k. If its
probability is not larger than 1 - E then the decision not known is provided.
Such a strategy can also be understood informally. If E = 0 then the best
strategy is not to decide. It is understandable since the penalty corresponding
to the answer not known is zero. If it is decided in favour of any state then
the error is not excluded and thus the non-zero penalty is not excluded either.
Conversely, for E > 1 the partial risk of decision not known will be greater than
the partial risk of any other decision. In this case the answer not known will
be never used.
This correct strategy is quite natural, but pseudo-solutions are suggested
even in this case. For examplP, the following one: it. is suggested to use the an-
swer not known for an observation that has a small probability for any state k,
1.5 Discussion 11
i.e., if the probability PXIK(x I k) < 8 for any k E K . The number 8 in-
dicates a threshold that says what is understood as a small probability ob-
servation. It follows from Theorem 1.2 on
the convex shape of classes in the space of
probabilities that this strategy is not a so-
lution of our task but, moreover, it is not
a solution of any Bayesian task. The set of
IKI-dimensional vectors, i.e., the set of IKI-
tuplets (PxiK(x I k), k E K), which consti-
tutes the class not known in the space of
Pxp (x)
probabilities, is convex but it is not a cone. ....
The case is illustrated in Fig. 1.3. The grey
shaded square illustrates observations with Figure 1.3 Decision not knot.m for ob-
servations with small probability is not
small probabilities. Though the square is a cone and thus it does not correspond
convex but it is not a cone due to its spatial to any Bayesian strategy.
limitations.
1.5 Discussion
I have come across Bayesian theory several times and each time it seemed to
me that the theory was explained too generally. It appears to me that only
the particular case is useful, i.e., the minimisation of the wrong decision . I
do not see the applicability of another approach yet; perhaps my horizon is
not wide enough. Is the general theory not a mere mental exercise? Does the
lecture provide the lesson that only the last part of the lecture dealing with the
particular case has a practical significance?
We do agree with you that the probability of a wrong decision has to be min-
imised in many practical tasks. On the other hand, let us not forget that the
set of practical tasks has an inexhaustible variability. It is hopeless to attempt
to squeeze the possible tasks into any theoretical construction based on the
optimisation of a single criterion even if it were as respected as the probability
of a wrong decision is. We shall see in Lecture 2 that many practical tasks can-
not be compressed even into such a rich theoretical construct as the Bayesian
approach is.
If the Bayesian approach, though if it is so general, does not suffice for some
practical tasks then one cannot think that it is too rich. Imagine that you
come across an applied problem which cannot be squeezed into a part of the
Bayesian approach that selects the most probable state. In the better case it
will raise your doubts and you will try to find a better solution. In the worse
case you will distort your task so that no one could recognise it any more, and
will try to modify it to match that part of the theory you are familiar with.
Finally, you will solve a totally different task than was needed.
The recognition task is often understood in a constrained manner as a max-
imisation of the a posteriori probability PKix(k). Indeed, the Bayesian recogni-
tion was understood in your question in this way too. Such an approach follows
12 Lecture 1: Bayesian statistical decision making
from a simplified assumption that all errors have the same significance even if
it is natural to penalise them differently.
The Bayesian approach covers a more realistic case which takes into account
that errors are of different significance. This is expressed by the dependence
of a penalty on the deviation from the correct answer. This is the reason why
it is often needed that not only the a posteriori probability PKix(k) has to be
a part of the maximised criterion but the penalty function as well which takes
into account the significance of the error.
As an exercise we would like you to solve an example which can serve as a
model of many practical situations. Assume that a set X is a set of images of
digits ranging from 0 to 9 which were written on a sheet of paper. The set of
possible states is consequently K = {0, 1, 2, ... , 9}. Let us assume that the a
priori probabilities on the set K are known. For example, PK(k) = 0.1 for all
k E K. In addition, let us presume that conditional probabilities PxjK(x I k)
are known as well, even when they are enormously complicated. Even so, let
us imagine that we already have a program in hand that yields 10 probabilities
PXIK(x Ik) for each image x EX. This program maps the analysed image into
a space of probabilities. Everything that is needed to solve any Bayesian task
is at one's disposal at this moment. The popular task estimating the digit k,
which secures the smallest probability of error, belongs to these tasks. A digit
k with the largest a posteriori probability is searched for. The a posteriori
probabilities are calculated according to Bayes' formula
and estimates the sum S = L:i~ 1 ki. If the question is asked why is it done
exactly in this way then there is an answer prepared for all cases: 'it follows from
the Bayesian theory that such an algorithm yields the best results'. Nothing of
this kind follows from the Bayesian theory, of course. We are looking forward
to you proposing a correct solution of a specified task which actually follows
from the Bayesian theory.
1.5 Discussion 13
XxXx ... xX
20 times
'l!hich is the set of sequences which consists of 20 images each. The set of states
K is
KxKx ... xK,
20 times
i.e., the set of sequences consisting of 20 digits. Each digit can be 0, 1, ... , 9.
Let us denote by x the sequence x 1 , x 2 , ... , x 20 that is submitted to recognition.
In addition, let us denote by k the sequence k1, k2, ... , k2o of digjts t]!at are
really shown on the recognised images. The probabilities P.\' K : X x K ---+ lR
are clearly
20
PxR(x,k) = ITPK(ki)PxiK(xdki)
i=l
because the images in the analysed sequence are mutually independent. The
penalty fun~tion W: K x D ~ lR has the value eit]Ier zero or one. If l:i=l ki =
- 20
q(x) = argmin
dED - -
L PxR(x,k) W(k,d)
kEK
= argmin P.x(i)
dED
L PR 1.\'(k Ix) W(k,d)
- -
kEK
= argmin
dED -
L-
PR 1.i((k I x)
k~K(d)
= argmin
dED
(1 - L - -
PRIX (k I x))
kEK(d}
= argmax
dED - -
L PRI.Y(k I x).
kEK(d}
In the last three steps of the derivation, K(d) denotes the set of such sequences
k1, k2, ... , k2o, whose sum L:i= 1 ki = d. Furthermore, the expression for a
')0
14 Lecture 1: Bayesian statistical decision making
derived Bayesian strategy can be made more specific because images in the
observed sequence are independent. Thus
20
q(x) = argmax L L ... L L ITPK!x(k; I x;), (1.15)
dED k 1 EK k2EK k1gEK k2oEK i=l
means the summation along all sequences (k1 , k2 , .•. , k20 ), whose sum
z:::;~l k; =d.
I believe that I honestly did all that was recommended in the lecture and
obtained the Bayesian strategy suitable toour task. Nevertheless, I havedoubts
about the value of the result. The expression obtained is not likely to be used in
practice. The maximisation is not a problem any more because it is necessary
to find the largest number out of only 181 numbers (the value 181 is the number
of possible values of the sum L; k;). What matters is the fantastically difficult
calculation of those 181 numbers from which the maximal one is to be selected.
Indeed, it is required to calculate a sum of so many summands; roughly speaking
such that is equal to the number of all possible sequences k, i.e., 10 20 summands.
It can be clearly concluded from the example mentioned why the theoretical
recommendation typically is not followed. It seems to me that the reason does
not lie in not understanding the theory, but in the difficulties when implement-
ing the theoretical recommendations in practice. I prefer to solve the task using
the method you laughed at.
20
1
Fl(d) = L IT PKiX (k; j x;) ,
-...-
k1EK
kl=d
i=l
1.5 Discussion 15
2
F2(d) =L L ITPKix(ki Ixi),
k1EKk2EK i=l
'-----v----"
kJ+k2=d
ITPKix(ki I xi),
j+l
20
F2o(d) =L L ... L L ITPKix(ki I Xi).
k1EK k2EK k1gEK k2oEK i=l
It is obvious that the function F20 is the same as the function F. It is clear too
that values F 1 (d) are easy to calculate. Actually, they even need not be com-
puted. Indeed, the product rr~=l PKIX (ki I Xi) consists of a single multiplicative
term PKix(kl Ix1). The sum
kl=d
consists of one single summand and thus F 1 (d) = PKix(d Ixi). We will show
how to calculate the values FJ+I(d) provided that Fi(d) are already available:
j+l
When calculating the value of the function Fi (d) for one d, we have to perform
10 multiplications and 9 additions. There are not more than 181 such values.
The transformation of the functions Fi to Fi+l should be done 19 times before
we obtain the function F20 • Consequently, we do not need 1020 calculations
but only 10 x 181 x 19 multiplications at most and nearly the same number of
additions. It is not worth mentioning from the computational complexity point
of view.
Expression (1.17) surprised me. I believe that it is not a mere trick. There is
probably some depth in it when you said that similar modifications would be
needed several times.
Let me come back to my persuasion that nothing but maximisation of a
posteriori probability is needed. The previous example did not disprove my
persuasion, perhaps just the converse is true. If I look at calculated functions
carefully then it becomes clear that the function F(d) is nothing else but an a
posteriori probability that the sum of random numbers k1 , k2 , .•. , k20 is equal
to d. By the way, the recursive expression (1.18) is nothing but the known
formula which we learned at college to calculate the probability distribution
of the sum of two independent variables. One is ki+ 1 and the other has the
probability distribution Fj. It seems that the best strategy again means the
search for the most probable value which is calculated in a non-trivial way, I
have to admit. Maximisation of some probabilities was avoided but we reached
the probability maximisation via a detour through more general Bayesian con-
siderations. What differed were the events.
It is good you realised that you have been familiar with formula (1.18) for long
time. If you had recalled it earlier, you would have overcome by yourself the
troubles that blocked your way. We did not get to the bottom of maximisation
of a posteriori probability together but you did it yourself when you formulated
the penalty function W(k, d) in a way that seemed to you the only possible one.
We do not object to this function, neither do we like to consider it as the only
possible one or the most natural one. It is quite unnatural to pay the same
penalty if instead of the actual value 90 the sum is estimated to be 89 or 25. It
would be more natural if the penalty was larger in the second case than that
in the first one. What about analysing various penalty functions that may suit
the formulated task?
Let d~(k*) denote the true result 2:7~ 1 ki. The first penalty function could
be: W(k,d) is zero if ld*(k*) - dl is not greater than an acceptable error
(tolerance) ~. The unit penalty is used when the difference ld*(k*) - dl is
greater than the tolerance ~. Minimisation of the risk in this case means
minimisation of a probability that the difference of estimated d from the correct
value d*(k*) will be greater than~.
1.5 Discussion 17
Let the second penalty function be the difference ld* (k*) - di and the third
one be proportional to the squared difference (d* (k*) - d) .
Although the second penalty function worried me before, I hope that I have
mastered it now. The algorithms for the three formulations of Bayesian tasks
are rather similar. All three comprise a calculation of a posteriori probabilities
F(d) for each value of the sum. When the function F(d) is available then
the decision is made for each formulation of the penalty function differently.
The simplest situation occurs in the case of quadratic penalty for which the
Bayesian strategy is approved in the following manner. The decision d has to
minimise the partial risk
k*EK
This means that the decision d is a solution of the equation that requires the
derivative of function R(d) be equal to zero, i.e.,
k*EK
= 2 d- 2 2: 2: .x
P.K 1 (k*i x) d*
d*ED k*EK(d*)
= 2 d- 2 2: d* L P.K 1x(k*i x)
d· ED k* EK(d*)
= 2 d- 2 L d* F(d*).
d*ED
It follows that d = Ld•ED d* F(d*), as one could expect. The decision d will be
in favour of the a posteriori mathematical expectation of the correct sum d*.
Let us return to the first penalty function with a tolerance~. The penalty
W ( k*, d) is now either one or zero. The first option occurs when the error
id*(k*) - di is greater than ~. The second option applies otherwise. Let us
denote by symbol g(d*, d) a function of two variables. Its value is equal to 0 if
id* - di ~ ~' or it is equal to 1 if id* - di > ~. In such a case the decision has
to be
argmin
dED
L-
P.K 1.y(k*ix) W(k*,d)
k*EK
= argmin
dED
L P.K 1.x(k* ix) g(d*(k*),d)
-
k*EK
= argmin
dED
L g(d*, d) L -
Pi<IX(k*l i) = argmin
dED
L
d* D
g(d*, d) F(d*)
d•ED k•EK(d*) E
= argmin
dED
(1- L
ld*-di:Sll.
F(d*)) = argmax
dED
~
d*=d-ll.
F(d*).
So minimization of the risk has been reduced to the maximization of the sum
d+ll.
R'(d) = L F(d*)' (1.19)
d*=d-ll.
I will decompose the set D = {0, ... , 180} into three subsets:
D+={d*l(d*ED) A (d*>d)},
D- = {d* I (d* E D) A (d* < d)} '
D= = {d* I (d* E D) A (d* = d)} .
which says that the minimal risk is achieved when the total probability of
the event that the random variable d* is smaller than d is equal to the total
probability that a random variable d* is larger than d. The value d is the
median of the random variable d*.
Although this result is correct (which I will show later), the procedure that
led me to it is incorrect. First, the function R(d) in expression (1.21) only
seems to be linear. The value R(d) depends on d not only explicitly but also
implicitly by means of the sets D+ and D- because the sets themselves depend
on d. Second, it may happen that the condition (1.22) will not be satisfied for
any d. That would mean that the function (1.21) does not have the minimum,
which is not true. It is obvious that some more complicated considerations
are needed for the minimisation of R( d). I have to confess that I have not yet
solved the task. Nevertheless, I am convinced that the solution corresponds to
the median of random variable d* and I back up it by the following mechanical
model.
Let us imagine that a thin wooden board is at hand, a straight line is drawn on
it, and 181 holes regularly spaced, 1 ern apart, are drilled in it. The holes will
match values d* = 0, 1, ... , 180. Let us raise the board to a horizontal position
180 em above the ground. Let us get 181 strings, each 180 em long. The free
ends of strings on one side will be tied together into a single knot. The other
free ends of the strings remain free. Each string will be led through a hole so
that the knot remains above the wooden board and the free ends hang below.
A weight will be tight on to the free end of each string. From the string passing
through the d* -th hole a weight of the mass F(d*) will be hanging.
The strings will stretch out owing their weights and will reach a steady
position. We are now interested in what position the knot remains at above
the wooden board. The knot cannot get under the board. It could do so only
by passing through a hole in the board. Thus one of the weights would have
to lie on the ground, since the length of the string exactly corresponds to the
distance between the board and the ground. The other weights would pull the
knot on the top side of the board. Now we know that the knot must remain
above the wooden board. The knot has to lie on the straight line connecting
the holes on the board. If the knot got off the straight line then the resultant
force of all the weights would pull it back. The knot is steady. Consequently
the sum of weights pulling leftward cannot be greater than 0.5, i.e., greater
than the sum of weights pulling rightwards.
This means that the knot positions d, for which Ld· <d F( d*) > 0.5, cannot
be stable positions. By the same reasons, the positions for which Ld· >d F( d*) >
0.5 have to be also excluded. So only one position d remains, i.e., that for which
Ld'<d F(d*) S 0.5 and Ld'>d F(d*) S 0.5 simultaneously hold. This corre-
sponds to the median of a random variable with probability distribution F( d*).
I have to make sure that the sum Ld'ED F(d*)ld*- dl achieves the minimal
value at the steady position of the knot. The mechanical system of ours cannot
be in any other state than in that where the potential energy Ld· ED F( d* )h(d*)
is minimal. The value h( d*) corresponds to height of the d* -th weight from the
20 Lecture 1: Bayesian statistical decision making
ground. The total length 180 em of each string can be divided into two parts:
the length l1 (d*) lying on the board, and the length l2(d*) below the board.
The distance 180 em from the ground to the d* -th hole consists of two parts
too: the length l 2(d*) and the height h(d*) of d* -th weight above the ground.
This means that 180 em = h (d*) + l2(d*) = l2(d*) + h(d*). From this it follows
that h(d*) = h(d*). But the length l1 (d*) is equal to the distance of d*-th hole
from the knot which is h(d*) = ld*- dj. This means that the potential energy
in our mechanical system is L:d*ED F(d*)ld*(k)- dj. And it is this value that
is minimised.
The idea with a mechanical interpretation of the task is neat. What pleased us
most was when you analysed the quadratic penalty function and unwittingly
uttered 'one can expect that the decision will be in favour of the a posteriori
mathematical expectation of the correct sum'. You did not recall that it was not
that long ago when you had not wanted to approve anything but the decision
in favour of the most probable sum. You were not puzzled when even the
mathematical expectation was not an integer. However, only integer sums
have non-zero a posteriori probabilities. This means that as a rule the decision
is made in favour of the states with zero a posteriori probability, not with
maximal.
I think that all three tasks became solvable only because the sum L:~~L~i
assumed a finite and, in particular, a small number of values on the set K.
Thus the probability distribution can be calculated for every value of the sum.
Then it is easy to determine the best value from 181 probability values.
Assume that we want to solve a little more complicated task. The aim of
the previous simpler task was to determine a sum of twenty individual digits
under uncertainty in recognising digits from images. Let us imagine a more
complicated case in which twenty digits express a single decimal number. The
rightmost digit corresponds to units, the digit one position to tl1e right matches
tens, etc.. The digits are k1 k2 k3 ... k2o. We are supposed to determine the value
of the decimal number, i.e., the sum
20 20
L: aiki = L: 1oi-l ki .
i=l i=l
The aim of the new more general task is to estimate the value of a decimal
number with the smallest possible quadratic error. The complication of the
task is caused by coefficients ai. Owing to them t]!e sum L:;~ 1 aiki can take
not 181 but tremendously many values on the set K
Let us analyse the newly formulated task from the very beginning, i.e., starting
already from the expression for the risk (1.14). If the function F cannot be
computed, let us try to avoid its computation. The decision d* about the sum
L:~~ 1 a3 k3 has to minimise the expected quadratic error (d - L:~~ 1 a3 k3) 2.
1.5 Discussion 21
Thus
J=1
20 20 20 20
= 2:: IIPKix(ki Ixi) I:aj kj = I:aj 2:: kj II PKix(ki Ixi)
kEK i=1 j=1 j=1 kEK i=1
20 20
20 20
= Lai L PKix(kj lxJ) kj L ··· L L ... L II PKix(ki I xi)
k1EK kj-1EK i=1,if-j
kJ+J EK k2oEK
20 20
= L L aj kj PKix(kj IXj) = L ajkj.
j=1
This proves the generally known result that the expected value of the sum of
random variables is equal to the sum of expected values of individual sum-
mands. We can see that algorithms estimating the linear function 2::;~ 1 ai ki,
which are the best in a quadratic sense, do not depend to a great extent on the
coefficients ai. The 20 a posteriori mathematical expectations ki, i = 1, ... , 20,
have to be calculated at first. This is the most difficult part of the task because
the function PKix(k I x) can be quite complicated. This most difficult calcu-
lation does not depend on the coefficients ai. Only after that the coefficients
ai are used to compute the best estimate according to the extremely simple
' 20
expresswn d = L:i=l ai ki.
A
20 20 )2
W(k*,k) = ( ~ai k;- ~ai ki . (1.23)
We have already learned that the optimal estimate for the sequence k =
(k1, k2, ... , k2o) is the sequence of the mathematical expectations of values ki,
which, as can be seen, does not depend on the coefficients ai, i = 1, 2, ... , 20,
at all. In this case the risk can be even minimised under the condition that
the penalty function is not known. It has to be certain only that the penalty
function is quadratic, i.e., it has the form of (1.23). If this result is used
22 Lecture 1: Bayesian statistical decision making
another step forward can be made. The strategy minimising the mathematical
expectation of penalty of the form (1.23) independently of coefficients ai, i =
1, 2, ... , 20, is suitable for any penalty function that is defined as a sum of
quadratic penalty functions of the form (1.23), i.e., also for a penalty function
of the form
2
20 ( 20 )
W(k*, k) =~ ~ aiJ (ki - ki) . (1.24)
Under the condition that the matrix containing coefficients aij is positive semi-
definite, the same strategy is suitable for any function of the form
20 20
W(k*,k) =L LaiJ (ki- ki) (kj- kj). (1.25)
j=l i=l
It is because any positive semi-definite function of the form (1.25) can be ex-
pressed as (1. 24).
We want to emphasise an important result. You need not to care about
specific coefficients a;j for a task with a positive semi-definite quadratic penalty
function (1.25). Not knowing them you can create a strategy that minimises
the mathematical expectation of such a penalty that is not fully known.
Let us return to the task aiming to estimate the sum 2:::7~ 1 ki that we started
with. We can see now that for the certain penalty function the task can be
solved easier than you did it earlier. Namely, for the optimal estimate of the
sum d = L::~ 1 ki it is not needed for all values of the sum to calculate the
probability F(d) at all.
You are not the only one of this opinion. But we would like to warn you against
being taken by the beauty and using the quadratic penalty function where it
is not appropriate. There are many such cases as well.
We can finish the analysis of the Bayesian task and proceed to the next
lecture which will be devoted to non-Bayesian statistical decision making.
December 1996.
25
3. The object state is not random and that is why the a priori probabilities
PK(k), k E K, do not exist and thus it is impossible to discover them by an
arbitrary detailed exploration of the object. Non-Bayesian methods must
be used in this situation. They are treated in this lecture. Let us illustrate
such a situation on an example.
Example 2.1 Task not belonging to the Bayesian class. Let us assume that
x is a signal originating from the observed airplane. On the basis of the signal
x it is to be discovered if the airplane is an allied one (k = 1) or an enemy one
(k = 2). The conditional probability PxjK(xlk) can depend on the observation
x in a complicated manner. Nevertheless, it is natural to assume at least that
there exists a function p x IK (x I k) which describes dependence of the observation
x on the situation k correctly. What concerns a priori probabilities PK(k), these
are not known and even cannot be known in principle because it is impossible
to say about any number a, 0 ::; a ::; 1, that a is the probability of occurrence
of an enemy plane. In such a case probabilities p K (k) do not exist since the
frequency of experiment result does not converge to any number which we are
allowed to call probability. In other words, k is not a random event. &
One cannot speak about the probability of an event which is not random just
as one cannot speak either about the temperature of a sound or about the
sourness or bitterness of light. A property such as probability is simply not
defined on the set of non-random events. Application tasks in which it is
needed to estimate the value of a non-random variable do not belong to the
Bayesian tasks. Their formalisation needs a theoretical construction in which
the concept of the a priori probability does not arise at all.
Let us show a spread pseudo-solution of an applied task that is similar to that
mentioned in Example 2.1. If a priori probabilities are unknown the situation
is avoided by supposing that a priori probabilities are the same for all possible
situations. In our case, it should mean that an occurrence of an enemy plane has
the same probability as the occurrence of an allied one. It is clear that it does
not correspond to the reality even if we assume that an occurrence of a plane
is a random event. Logical reasons, which should prove such an assumption,
are difficult to find. As a rule, logical arguments are quickly substituted by
a pseudo-argument by making a reference to some renowned person. In the
given case, this would be, e.g., to C. Shannon thanks to the generally known
property that an equally distributed probability has the highest entropy. It
happens even if this result does not concern the studied problem in any way.
influences what the letter looks like, and it affects its recognition too. As can
be seen, the third parameter z E {1, 2, 3}, a so called unobservable intervention,
was added to observable parameters x E X and hidden parameters k E K.
The goal of the task is to answer the following question for each picture x.
Which letter is written in the picture? It is possible to speak about the penalty
function W(k, d) and about the a priori probabilities PK(k) of the individual
letters but it is not possible to talk about conditional probabilities PXIK(x Ik)
in this application. The reason is that the appearance of the specific letter x
depends not only on the letter label but also on a non-random intervention,
i.e., on the fact who wrote the letter. We can speak only about conditional
probabilities PXIK.z(x I k, z), i.e., about how a specific character looks like if it
was written by a certain person. If the intervention z would be random and the
probability pz(z) would be known for each z then it would be possible to speak
also about probabilities PXIK(x I k), because they could be calculated using the
formula
3
PXIK(x Ik) = LPz(z)pxiK,z(x Ik, z).
z=l
But preconditions for applying an algorithm do not provide any evidence to
assume how often it will be necessary to recognise pictures written by this or
that person. Rather, it is not excluded during the whole period of the algorithm
application that only pictures written by only one single writer are used but
it will be unknown by whom. Under such uncertain statistical conditions an
algorithm ought to be created that will secure the required recognition quality
of pictures independently on the fact who wrote the letter. This means that the
task should be formulated in the way that the concept of a priori probabilities
p z (z) of the variable z will not be used because this variable is not random and
such a feature as probability is not defined for it.
Let us introduce the most famous formulations of non-Bayesian tasks and
their solutions here. In addition, we introduce new modifications of these known
tasks. We shall see that the whole class of non-Bayesian tasks has common
features in spite of the variety of non-Bayesian tasks. These allow us to analyse
and solve them by the same procedure. Later on, we shall see that there is
not any crucial gap between the class of Bayesian tasks and all non-Bayesian
ones. We shall show that the strategy solving any non-Bayesian task can be
realised similarly as, that for the Bayesian tasks in the space of probabilities.
The strategy divides the space of probabilities into convex cones in the same
manner as in the Bayesian tasks. This means that their solution, in spite of all
basic difference between Bayesian and non-Bayesian tasks, is found within the
same set of strategies.
k, to which the object belongs. There are two possible states-the normal
one k = 1 and the dangerous one k = 2. The set of states K is thus {1, 2}.
The probability distributions are known and defined by a set of conditional
probabilities PxfK(x I k), x EX, k: E K.
The goal of recognition is to decide according to the observed feature x if
the object is in the normal or dangerous state. The set X is to be divided into
two such subsets X 1 and X 2 that for an observation x E X 1 is being decided
the normal state and for an observation X E X2 the dangerous state.
In view of the fact that some values of the feature x can occur both in the
normal and in the dangerous state of the object, there is no faultless strategy
and it is characterised by two numbers. The first number is a probability of
an event that the normal state will be recognised as a dangerous one. Such
an event is called a false alarm or a false positive. The second number is the
probability of the event that the dangerous state will be recognised as a normal
one and it is called an overlooked danger or a false negative. These two faults
are sometimes called the error of the first and second type, respectively. The
conditional probability of the false alarm is given by the sum L:cEX 2 Px IK (x 11)
and the conditional probability of the overlooked danger is LxEX 1 PxfK(x 12).
Such a strategy is sought in the Neyman~ Pearson task [Neyman and Pearson,
1928; Neyman and Pearson, 1933] (we shall call it simply Neyman task here-
after), i.e., a decomposition of the set X into two subsets X 1 c X and X 2 c X,
X1 n X2 = 0, that, firstly, the conditional probability of the overlooked danger
is not larger than a predefined value E,
(2.4)
The fundamental result of 1\'eyman~Pearson states that for sets X 1 and X 2 ,
which solve a given optimisation task, there exists such a threshold value ()
that each observation x E X, for which the likelihood ratio
PxfK(X 11)
PXIK(x 12)
is smaller than 0, belongs to the set X 2 . And also, vice versa, the assignment
x E X 1 is made for each x E X with likelihood ratio larger than (). Let us
30 Lecture 2: Non-Bayesian statistical decision making
put the case of equality aside for pragmatic reasons. This case occurs so rarely
in practical situations that we do not need to deal with it. The theoretical
analysis might be interesting, but it is complicated and not necessary for our
purpose.
The known solution of Neyman task is not proved easily, it means that it is
not easy to show that it follows from the formulation given by relations (2.2)-
(2.4). Therefore, presumably, it is not the knowledge but more a belief based on
Neyman's and Pearson's authority. This belief suffices when exactly Neyman
task is to be solved. As soon as a task is met, which differs from the Neyman
task in some trifle, a mere belief is not sufficient.
For instance, let us have a look at the following tiny modification of Neyman
task. Let the number of states of the recognised object be not two but three. For
each state k E {1, 2, 3} and for each observation x the conditional probability
PXJK(x I k) is determined. Only one state k = 1 is normal, whereas the other
two states are dangerous. In the same way as in the original Neyman task,
the aim is to find out a reasonable strategy which has to determine for each
observation if the state is normal or one of these two dangerous ones.
Several suggestions how to solve this task occur at a folklore level. For
instance, two likelihood ratios are being computed
and two threshold values B12 and (} 13 are set. The situation is considered as
normal, if 1'12 > B12 and /'13 > B13. Other suggestions are based on the effort
to invent such a generalisation of the likelihood ratio concept which should
suit even in the case in which it concerns the 'ratio' not of two but of three
quantities, for instance
or other similar figments. Then it is decided for the normal or dangerous state
by comparing the mythical 'ratio' with a certain threshold value.
Suggestions of a similar sort demonstrate the effort to find out the algorithm
of the solution at once without formulating the task which that algorithm is to
solve. That is why such proposals are not convincing enough. Of course, such
suggestions are not supported by Neyman's authority as he was not interested
in this modification at all.
In the effort to manage a task, even if it is merely a slight generalisation
of Neyman task, it is not possible to start with a direct endeavour to alter
Neyman's strategy to a slightly modified task. It is correct to begin from the
formulation of a corresponding generalised task and then pass the whole way
through from the task formulation to the algorithm similarly as Neyman did
with his task.
2.2 Formulation of the known and new non-Bayesian tasks 31
under conditions
X1 n X23 = 0, (2.8)
X1 ux23 = x. (2.9)
The formulated optimisation task will be thoroughly analysed later when the
other non-Bayesian tasks will be formulated too. In addition, it will be seen
that the whole series of non-Bayesian tasks can be solved in a single construc-
tive framework.
would be evaluated by two tests: the preliminary test and the final one. The
customer himself would perform the preliminary test and would check what the
probability of a wrong decision w(k) was for all states k. The customer selects
the worst state k* = argmax~·EK w(k). In the final test, only those objects are
checked that are in the worst state.
The result of the final test will be written in the protocol and the final
evaluation depends on the protocol content. The algorithm designer aims to
achieve the best result of the final test.
It is known for the task with two states as well as for Neyman task that
the strategy solving a minimax problem is based on the comparison of the
likelihood ratio with some threshold value. Similarly as in Neyman task, it
is more a belief than knowledge, because hardly anybody is able to derive a
solution of this minimax problem. That is why the solution of the problem has
not been widely known for the more general case, i.e., for the arbitrary number
of object states.
For such strategies the requirements w(1) ~ c and w(2) ~care not contradic-
tory for an arbitrary non-negative value c because the strategy X 0 =X, X 1 = 0,
X 2 = 0 belongs to the class of allowed strategies too. Each strategy meeting
the requirements w(1) ~ c and w(2) ~ c is, moreover, characterised by how
often the strategy is reluctant to decide, i.e., by the number max (x(1), x(2)).
Wald task seeks among the strategies satisfying the requirements w(1) ~ E:,
w(2) ~ c for a strategy which minimises the value max (x(1), x(2)). It is known
that the solution of this task is based on the calculation of the likelihood ratio
'Y(x) = PXIK(x 11) .
PXIK(x 12)
(X*(k), k E K) = argmin
(X(k),kEK)
max
zEZ kEK
L PK(k) L x¢X(k)
PXIK.z(x I k, z).
and the task will be formulated as a search for the best strategy in this sense,
i.e., as a search for decomposition
(X*(k), k E K) = argrnin max max
( \'(k) kEK) kEK :EZ
L PXIK,z(x I k, z).
• ' xrtX(k)
(2.10)
iE/ •E/ .iEJ jEJ
The point x' E X will be called an acceptable one in respect to the func-
tion (2.10) if the function f (.T 1 , y) of one variable y is bound maximally on the
36 Lecture 2: Non-Bayesian statistical decision making
'¢(y) = minf(x,y).
xEX
The_dual task checks if}· =j:. 0 is ~a.tisfied. If the function '¢ is bound maximally
on Y then such the point y• E Y is looked for that maximises '¢. In this case
the dual linear programming task is called solvable.
The relation between the primal and dual linear programming tasks is given
in the following theorem.
Theorem 2.1 The first duality theorem, also the Kuhn-Tucker theorem. If
the primal linear programming task is solvable then the dual linear programming
task is solvable also. Moreover, the following equation holds
Proof.
xEX yEY
The proof is not short and can be found in books about mathematical
•
programming in different formulations. The proof in the form matching our
explanation is in [Zuchovickij and Avdejeva, 1967]. •
The set .~ and the function cp: .~ -t IR can be expressed explicitly as it is shown
in the lemma.
Lemma 2.1 Canonical form of the linear programming task. The set X is
fL set of solution of the system of equation and inequalities
""'b
L...J ij Xi = Cj, J. E JO , (2.13)
iEI
Xi~ 0, i E [+,
Let us prove first that if some x' does not satisfy the system of inequalities (2.12)
and equations (2.13) then x' ~X. In other words, the function f(x',y) of one
variable y is not bound maximally.
Let us assume that for such an x' one of the inequalities from the sys-
tem (2.12), say the j'-th one, does not hold. The function f can thus have an
arbitrarily large value. If this is to happen it suffices that the coordinate Yi' is
large enough. There is nothing that can prevent the growth of the coordinate
YJ' since the matching coordinate is limited only by the property that it cannot
be negative.
Let us permit now that x' ~ }[, because for some j" E J 0 the equation
from the system (2.13) is not satisfied. In such a case the function f can
achieve an arbitrary large value again. It suffices if the absolute value of YJ"
is large enough. The coordinate YJ" itself can be either positive or negative
depending on the difference Cj" - Z:::iEI bij':_ Xi being positive or negative. From
the contradiction we can see that any x E X complies with the conditions (2.12)
and (2.13).
Let us demonstrate furthermore that for each x' satisfying the relations
(2.12) and (2.13), the function f(x',y) is bound maximally on the set Y, i.e.,
x' E "X'.
The function value f(x',y) comprises three additive terms, cf. (2.15). The
first term is independent of y. The second term is independent of y too, since
the difference Cj - Z:::iEI bijx: is zero for all j E J 0 as it follows from the
condition (2.13). The third term is not positive as none of the additive terms
constituting it is positive. Indeed, there holds YJ 2: 0 and Cj - Z:::iEI bij x~ :S 0
for any j E J+, which implies from the condition (2.12). It follows that the
third term in equation (2.15) is bound maximally and this upper limit is zero.
It implies too, that the whole expression (2.15) has an upper limit as well and
it is Z:::iEI ai x~. In such a way it is proved that the set X is identical with the
set of solutions of the system of inequalities (2.12) and equalities (2.13).
As the proof of the equation (2.14) is concerned, it is obvious that the upper
limit LiEiaix~ of the function f(x',y) is achievable on the set Y, because
f(x',O) =LiEf ai x; and 0 E Y. Then maxuEY f(:c',y) =LiEf ai x~ and the
equation (2.14) is satisfied too. •
Having the proved lemma in mind, the primal linear programming task can be
expressed in the following canonical form. The
min""
~l1-a· x· (2.16)
xEX iEJ
38 Lecture 2: Non-Bayesian statistical decision making
L b;j x; = Cj , j E J0 , (2.18)
iEl
X; 2: 0, i E J+ . (2.19)
Properties of the dual task can be proved in a similar manner. Let us repeat
the thoughts which were used when proving Lemma 2.1. It can be shown that
in the dual linear programming task,
r.nax"" c y
EY L ) J
(2.20)
y jEJ
L b;J yJ :S a; , i E J+ , (2.21)
jEJ
YJ 2: 0, j E J+ . (2.23)
2. Each constraint in the primal task corresponds to the variable in the dual
task and each variable in the primal task matches the constraint in the dual
task.
3. Constraints in the primal task are either linear equations or linear inequal-
ities of the form 2: . Constraints in the dual task are either linear equations
or linear inequalities of the form ~ .
4. Values of some variables in both the primal and the dual task can be positive
and negative. These variables are called free variables. There are other
variables in the primal and dual tasks which are not allowed to be negative.
Such variables are called non-negative variables.
5. To each equality among the primal task constraints the free variable in the
dual task corresponds. To each inequality among the primal task constraints
the non-negative variable in the dual task matches. To each free variable
in the primal task the constraint in the dual task in the equality form
corresponds.
6. Coefficients ai, which express the minimised function in the primal task, are
present as threshold values on the right-hand side of equalities or inequalities
in the dual task. Thresholds Cj appearing on the right-hand side of equalities
or inequalities of the primal task appear as coefficients of the linear function
being minimised in the dual task.
7. The coefficient matrix in the system of equalities or inequalities of the primal
task corresponds to the transposed matrix of coefficients of the system of
equalities and inequalities in the dual task.
In our exposition we can proceed to the next theorem which is particularly
important when analysing the pair of dual linear programming tasks. Namely,
this theorem will help us when analysing non-Bayesian decision tasks that are
formulated in the form of linear programming tasks.
Theorem 2.2 Second duality theorem, also called theorem on mutual non-
movability. Let the solution of primal linear programming task be x* = (xi,
i E I), let the solution of the dual task be y* = (yj, j E J).
Unless some coordinate x; of the point :r* is equal to zero, the correspond-
ing constraint of the dual task for i E I+ is then satisfied by the equation
LjEJ b;J yj = ai (although it was an inequality in the task formulation).
If the j -th constraint in the primal task is satisfied in the point x* as a strict
inequality, i.e., if LiEf b;J x'; > CJ holds then the corresponding value Yj in the
dual task is equal to zero. •
Proof. The theorem actually says that for all i E I there holds
xi (a; - L yj) = 0 ,
b;J (2.25)
jEJ
The equation
is apparently valid.
The first duality theorem states that LiE/ ai xi = LjEJ Cj Yj which implies
(2.28)
As x*, respectively y*, are solutions of the primal, respectively dual task, the
constraints (2.17)-(2.19) and (2.21)-(2.23) are satisfied for them and it implies
that for any j E J there holds
Equation (2.28) states that the sum of non-positive additive terms is the same
as the sum of non-negative additive terms. This is possible only in that case in
which all additive terms equal zero. The validity of equations (2.25) and (2.26)
follows from that. •
When non-integer functions are allowed then the set of all possible functions
o: will be more extensive. Any function o: can then be understood as a ran-
domised strategy. The value o:(x, k) is a probability of the event that having
observed x, it is decided in favour of the state k. After this generalisation all
non-Bayesian tasks formulated earlier can be expressed as a particular case of
linear programming and can be analysed in a single formal framework.
We will analyse common properties of non-Bayesian tasks keeping in mind
the properties of the linear optimisation tasks. The most important prop-
erty claims that the solution of any non-Bayesian task differs only fractionally
from the Bayesian strategy. Saying it more precisely, the strategy solving the
arbitrary non-Bayesian and Bayesian tasks is implementable in the space of
probabilities. Each decision corresponds to a convex cone in the space of prob-
abilities. Deterministic decisions match the inner points of the cone. The
random decisions need not appear always, and if they occur then they occur in
points which lie at the boundary of the convex cone.
a(x, I) ~ 0, xE X, (d)
In the optimisation task (2.33) the variables are the values a(x, I), a(x,2) for
all x EX. The constants are values E and PXjK(x I k) for all x EX, k = I, 2.
Let us rewrite the task to convert the inequality (2.33b) into a standard form
with the relation ~ as is required in the primal linear programming task.
Let us take into account that the line (2.33c) represents not just one but lXI
constraints. There is a dual variable t(x) corresponding to each of these lXI
constraints. We obtain the primal task
L t(x) - ET) ,
max ( (a)
xEX
a(x, 1) t(x)- TPXIK(x 12) ~ 0, x EX, (b)
(2.35)
a(x, 2) t(x) ~ PxiK(x II), x EX, (c)
T ~ 0. (d)
We will explain the mechanism of deriving the dual task more thoroughly
in the first non-Bayesian task analysed. The following tasks will be described
lr~ss cautiously. The line (2.35a) exhibits a linear function depending on dual
variahks T and t(:r). Each variable t(x), x EX, is multiplied by a unit coeffi-
ci(~nt as the nurnlH~r 1 appears on the right-hand side of the inequality (2.34c)
of tlw primal task to which the dual variable t(x) corresponds. The variable T
is multiplied hy -E hecaus(~ tlw threshold -E occurs on the right-hand side of
the inequality (2.34h).
The line (2.35h) specifies lXI constraints. Each of them corresponds to the
variable a(x, 1) in the primal set. The constraints are expressed as an inequality
due to the property that o:(x, 1) is non-negative, cf., constraint (2.34d). The
vahw 0 is on the right-hand side of the constraint as the variable a(x, 1) is not
present in the function (2.34a) which is to be minimised in the primal task.
It is the same as if tlw variable was present and multiplied by the coefficient
0. Tlw kft-hand side of tlw constraints is composed of two additive terms
since the variable a(:r, 1) occurs in two constraints of the primal task only,
i.e., in one constraint from tlw group (2.34c) and in the constraint (2.34b).
The variable t(:r) in the constraint (2.35b) is multiplied by 1, because also
o:(:r, 1) in the constraint (2.34c) is multiplied by 1. The variable Tin (2.35b) is
multiplied by -PxiK(x 12). The variable a(x, 1) is multiplied by -PxiK(x 12)
in the constraint (2.34b).
The line (2.35c) specifics lXI constraints corresponding to group of variables
o(:r, 2). There is the probability PXIK(x II) on the right-hand side of the con-
straints since this coefficient multiplies the variable a(x, 2) in the linear function
(2.34a) which is minimised in the primal task. There is a single variable t(x) on
tlw ldt-hand side of the constraint (2.35c) since the variable a(x, 2) occurs only
in a single constraint (2.34c) of the primal task. The variable t(x) in (2.35c)
is multiplied by 1 since the variable a(x, 2) is multiplied by 1 in the constraint
(2.34c) too. The constraint (2.35c) is the inequality as the variable a(x, 2)
in the primal task is defined as an non-negative variable, cf. the constraint
(2.34e).
The constraint (2.35d) requires the variable T to be non-negative as it cor-
responds in the primal task to the constraint (2.34b) which is expressed by
an inequality. Dual variables t(x), x E X, can be both positive and nega-
tive, because the matching constraints to it in the primal task are expressed as
equalities.
44 Lecture 2: Non-Bayesian statistical decision making
Only the reader not confident in transforming primal tasks of linear program-
ming to the dual ones is likely to need this explanation. These transformations
are more or less automatic. Here, the primal task (2.34) was transformed into
the dual task (2.35). These explanations are superfluous with respect to the
proof of Neyman task. In fact, the pair of tasks (2.34) and (2.35) is a pair of
dual tasks of linear programming.
Thanks to that, it is possible to find out Neyman strategy based on the Sec-
ond Duality Theorem (Theorem 2.2) after the following simple consideration.
The task (2.35) cannot be solved for such values ofT and t(x), x E X, for
which the both constraints (2.35b) and (2.35c) were strictly satisfied. Having
in mind the Second Duality Theorem a(x, I) = a(x, 2) = 0 should have held,
this would be in contradiction with the constraint (2.34c). Thus the equality
should hold for each x E X in one of inequalities (2.35b) and (2.35c) at least.
This means that
then t(x) < PXIK(x II). As the inequality (2.35c) is satisfied strictly, a(x, 2) =
0, a(x, 1) = I and the state k is assessed as a normal one. The conditions
(2.36) and (2.37) can be expressed in the known form in which the likelihood
ratio
'Y(x) = PXIK(x II) (2.38)
PXIK(x 12)
is calculated and this ratio is compared to the non-negative threshold value T.
We showed that Neyman tasks in the form of the dual task pair (2.34)
and (2.35) can be quite briefly expressed and solved in a transparent way. This
briefness is based on the theory of dual linear programming tasks. In the given
case it is based on the Second Duality Theorem which helps to solve not only
Neyman task in an easier way, but also other non-Bayesian tasks.
The expression (2.39) represents probability of the false alarm. The condition
(2.40) means that the probability of the overlooked dangerous state k = 2 has
to be small and the condition (2.41) requires the same for the state k = 3.
Constraints (2.42)-(2.43) are the standard conditions the strategy has to meet.
We rewrite the conditions for the pair of dual tasks now.
Primal task:
min L a23(x) PXII<(x 11),
xEX
72 - L al(x) PXIK(x 12) ~ -E,
xEX
73 - L a1 (x) PXIK(x 13) ~ -£,
xEX (2.44)
t(x) al(x) + a 23 (x) = 1, x EX,
a1 (x) ~ 0, x EX,
73 ~ 0.
46 Lecture 2: Non-Bayesian statistical decision making
From conditions (2.45a), (2.45b) and the fact that n: 1 (x) and n:23 (x) cannot be
equal to zero simultaneously, it implies that for x E X and the variable t(x)
there must hold
then n: 1 (x) must equal 0, n: 23 (x) must equal1, and x signifies a dangerous state.
If
(2.48)
then x is a sign of the normal state. The strategy solving this task has the
following form: for certain non-negative numbers 72 and 73 the likelihood ratio
is computed
'Y(x) = PXJK(x 11) ' (2.49)
72 PXIK(x 12) + 73 PXIK(X 13)
and then it has to be decided for either a normal or a dangerous state, if the
likelihood ratio is higher or smaller than 1, respectively.
EJ.}
mine,
(2.51)
c-IJ(x)2:0, j
c
This procedure is demonstrated in
Fig. 2.1forthecaseJ = {1,2,3}. The
shaded area in the figure shows the
set of pairs (x, c) which satisfy con-
ditions c·- fi(x) 2:: 0, c- f2(x) 2:: 0,
c- h(x) 2:: 0. The task (2.51) requires
that a point with the minimal coor-
dinate c must be found in the area. Figure 2.1 Minimax task for three functions f.
2.4 The solution of non-Bayesian tasks using duality theorems 47
Apparently, it is the point denoted by a filled circle in Fig. 2.1. The function
max (JI(x),h(x),h(x)) is shown as a bolded curve. The task (2.50) requires
that the point with the smallest coordinate c has to be found on the bold curve.
It is the same point denoted by the filled circle.
Because of the equivalence of the tasks (2.50) and (2.51), the minimax task
formulated in Subsection 2.2.3 can be expressed as the following linear pro-
gramming task. The variables are o:(x, k) and the auxiliary variable c,
mine,
Thanks to (2.52b), the sum Lk•# a(x, k*) is equal to 1-a(x, k), and therefore
the inequality (2.52a) will be transcribed as c + LxEX o:(x, k) PXIK(x I k) ~ 1.
Our task will be expressed as a pair of dual tasks.
Primal task:
min c,
Dual task:
r(k) ~ 0, k E K.
It follows from the condition (2.54a) and the requirement to obtain the largest
t(x) that for any x the variable t(x) is equal to
a(x, k*) must equal to zero. Furthermore it follows that a(x, k*) can be non-
zero-only when
k* = argmax (7(k) PxjK(x I k)) .
kEK
Primal task:
mine
Dual task:
c Q1 + Q2 = 1'
From conditions (2.56a), (2.56b), (2.56c) and a form of the function, which is
minimised in the dual task, it is implied that t(x) has to be the smallest from
three values 71 PxjK(x J1), 72PXIK(x J2) and Q1 PxjK(x J1) + Q2PXjK(x J2). Let
us study the three corresponding cases according to which of these three values
is the smallest one.
2.4 The solution of non-Bayesian tasks using duality theorems 49
Case 1. If
7 1 PXIK(x [1) < 72 PXjK(x [2), } (2 .57 )
71 PxjK(x [1) < q1 PXjK(x [1) + qz PXjK(x [2)
then the strict inequalities (2.56a) and (2.56c) hold. It implies that
a 1 (x) = 0, a 0(x) = 0 and consequently az(x) = 1.
Case 2. If
7z < 71 PxjK(x [1),
PxjK(x [2) }
(2.58)
7z PXjK(x [2) < q1 PxjK(X [1) + qz PxjK(X [2)
then a 1 (x) = 1, because the strict inequalities (2.56b) and (2.56c) have
to hold, and therefore a 2 ( x) = 0 and ao (x) = 0 has to hold.
Case 3. Finally if
is the smallest, where ')'(x) is the likelihood ratio PxjK(x [1)/PxjK(x [2). Let
us draw in Fig. 2.2 how these three functions (2.60) depend on the likelihood
ratio !'(X). Two thresholds fh, () 2 are represented on the horizonal axis. It
can be seen that the condition !'(x) < 01 is equivalent to the condition (2.57).
The condition l'(x) > () 2 is the same as the condition (2.58), and finally, the
condition () 1 < /'(x) < ()2 corresponds to the condition (2.59). This means that
it is decided for the second state in the first case, for the first state in the second
case, and no decision is made in the third case. This is, namely, the solution
of Wald task. Fig. 2.2 demonstrates an additional interesting property. It can
be seen that the subset X 0 can be empty in some tasks. This means that at
some tasks the optimal strategy never says not known though such response is
allowed.
The solution of Wald generalised task in the case, in which the number of
states is greater than two, is not so illustrative. That might be the reason
why so many pseudo-solutions occur here, similar to thoses which spin around
Neyman strategy. We will show a reasonable formulation and a solution of this
generalised task in the following paragraph. We will see that even when it is
not possible to use the likelihood ratio, it is possible to master the task easily.
50 Lecture 2: Non-Bayesian statistical decision making
Primal task:
mine,
Dual task.·
max ( I: I: T(k:)) ,
t(x) + (1- c:)
xEX kEK
a(x, 0) t(x) +I: T(k:) PxiK(:r I k)- I: q(k:) PxiK(x I k:) S 0, x E X,
kEK kEK
(2.62)
a(x,k) t(x) + T(k)PXIK(xl k:) S 0, x EX, k: E I<,
c 2.:: q(k) = 1,
kEK
q(k) 2 0, T(k) 2 0, k: E K.
It is obvious from the dual task that the quantity t(x) has to be equal to the
smallest value from the following IKI + 1 values: the first IKI values repre-
sent values -T(k) PxiK(x I k) and the (IKI + 1)-th value is I,:kEK (q(k:)- T(k))
PxiK(x I k). The smallest value determines which decision is chosen: a(x, k) = 1
for some k E K or a(x, 0) =
1. More precisely, this rule is as follows. The fol-
lowing quantity has to be calculated,
max(T(k)PXIK(xlk )) ·
kEK
If
~~ (T(k) PxiK(x I k)) < L (T(k)- q(k)) PXIK(x Ik)
kEK
then the decision is k* = 0. In the opposite case it is decided for
The strategy solving the generalised Wald tas.k is not as simple as we have seen
in the previous tasks. It would hardly be possible to guess the strategy only
on the basis of mere intuition without thorough formulation of the task, and
without its expression in the form of the pair of dual linear programming tasks
(2.61), (2.62), followed by a formal deduction.
52 Lecture 2: Non-Bayesian statistical decision making
=L
kEK
PK(k) (1- L xEX
a(x, k) PXJK,z(x Ik, z))
= 1- L PK(k) L a(x, k) PXIK,z(x Ik, z). (2.63)
kEK xEX
mine,
Dual task:
It results from the form of the maximised function in the dual task and from
the upper limit of the quantity t(x) that the optimum occurs when t(x) is the
minimal value out of the following IKI possible numbers,
argmax
kEK
L PK(k) r(z) PXIK,z(x Ik, z).
zEZ
mine,
a(x, k) ~ 0, k E K, x EX.
Dual task:
max (
xEX
I: t(x) + I: I:
zEZ kEK
r(z, k)) ,
a(x,k) t(x) + L r(z,k)PxjK,z(xlk,z)::::; 0, x EX, k E K,
zEZ (2.67)
c L L r(z, k) = 1 ,
zEZ kEK
r(z, k) ~ 0, z E Z, k E K.
It follows from the abovementioned pair of tasks that, if for some x E X and
some k* E K and all k E K \ { k*} the following inequality is satisfied,
then a(x,k*) = 1.
2.5 Comments on non-Bayesian tasks
We have seen a variety of non-Bayesian tasks. We have introduced several
examples of these tasks and it would be possible to continue. It can be seen,
in spite of all this variety, that all tasks do not appear any more as isolated
54 Lecture 2: Non-Bayesian statistical decision making
2.6 Discussion
Even when I studied all the non-Bayesian tasks formulated and their formal
solution thoroughly, I cannot say that I mastered the explained material well
and understood the presented results in the same way as if I had discovered
them myself I doubt that I would be able to solve a new practical task by
myself.
Nearly all tasks formulated in the lecture are new to me. I have known only
about the Neyman-Pearson task so far. I did not think about the proof of the
strategy and its solution too much. A.n extensive and unfamiliar landscape of
non-Bayesian tasks unfolded in front of me. It was presented in the lecture that
all these tasks could be analysed and solved by using the linear programming
gadget. But I regret, I do not know this gadget.
Yes, of course I did. But it was in a quite different spirit from your lecture.
The main emphasis was placed to computational procedures optimising linear
functions with linear constraints. The focus was mainly on the simplex method
which I understood quite well. The linear programming gadget now plays a
quite different role. It is not just a tool for writing the program for optimisation
tasks, but it is a tool for theoretical analysis of the task with pencil and paper.
Such a use of linear programming is new to me.
2.6 Discussion 55
Both options, but the second one a little more. You should be familiar with
the theoretical basis of mathematical programming (namely the theory, and not
merely a calculational procedure) whether you are engaged in pattern recog-
nition or not. These pieces of knowledge are useful in any area of applied
informatics. We do not want to leave you standing in front of problems alone.
We will continue in the direction of your remark. You have said that you would
understand the dual tasks principle much better if there was something that
follmved from formal theorems, something you also knew without duality theo-
rems. We can go through one example of that kind with you but it will require
a bit of patience on your side. O.K.?
1 --<>
1
circuit
~X circuit tY
2 2
(a) (b)
case of sin t ~ 0 as well as the current flowing from the point 2 to the point 1
for sin t :::; 0. (b) The quantity x = -3 will correspond to the current -3 sin t
which means the current from the point 2 to the point 1 for sin t ~ 0, and
conversely from the point 1 to the point 2 for sin t :::; 0.
Similarly, we will show some quantities with the help of voltage which is
the difference of electrical potential between two points. E.g., the quantity y
between the points 1 and 2 of the circuit (cf., Fig. 2.3b) representing potentials
tp 1 - tp 2 between points 1 and 2 is y sin t. The arrow between points 1 and 2
shows the voltage direction which is given as positive, i.e., y sin tis the difference
tp 1 - tp2 and not the other way round.
(a) (b)
r-----------,
1 ciccuit rn
(a)
X
circui~
---
; IJa? r
(b)
1
Figure 2.5 (a) Phase rectifier. (b) Possible implementation of a phase rectifier.
I must interrupt you here. Do I properly understand that the quantities x and
y are not represented with the help of alternating voltages and currents, but in
some other way? When I connect the rectifier (I assume it is an ordinary diode)
to the circuit in Fig. 2.5a then the current in this branch cannot alternate. This
must be so for the voltage at the diode too.
You perhaps did not understand the explanation properly. It is good that you
interrupted us. You cannot realise the phase rectifier in Fig. 2.5a as an ordinary
diode. That is why we called it the phase rectifier. It is an idealised device such
that the current passing through it can be only x sin t, where x ~ 0. The phase
rectifier can be implemented, e.g., by the circuit in Fig. 2.5b. The controlled
switch is connected to the point 1 which alternatively connects the point 1 to
the point 3 when sin t ~ 0 holds, and to the point 4 when sin t < 0 holds.
Is the phase rectifier in Fig. 2.5b implemented by means of ordinary diodes and
some switch?
Yes, it is.
circuit
3 3 0 -4 1
Figure 2.6 A transformer rliagrarn for varying number of turns in the coil and directions of
winding.
58 Lecture 2: Non-Bayesian statistical decision making
-4 of the fourth winding says that the winding direction is the opposite of the
first, second and fifth winding, whose direction is considered as positive.
Thanks to the interaction of transformer winding with the shared magnetic
field in the ferromagnetic core, the currents x1, x2, . .. , Xm in m coils must
satisfy the equation
m
L:xibi = 0,
i==l
where bi is the number of turns of the i-th winding and the used signs + or -
agree to the rule given above. The currents XI, x 2, ... , x 5 in the transformer
windings in Fig. 2.6 can thus reach only the values satisfying the constraint
The second property of the transformer is that voltages of the windings are
proportional to the number of turns. Do not forget that the number of turns
is considered both positive and negative according to the winding direction. In
Fig. 2.6, the voltages YI, y2, y3, Y4 are uniquely constrained by the voltage y5,
i.e., Y1 = 3y5, Y2 = 3y5, Y3 = 0, Y4 = -4Y5·
We will need to know what is represented by the current source, voltage
source, phase rectifier and transformer when considering electrical analogy of
the dual linear programming tasks.
I think that t understand the components of the electrical circuit quite well
now.
Create a circuit from the given components with the currents XI sin t, x 2 sin t,
X3 sin t, X4 sin t, X5 sin t, in its branches which comply with the following system
of equalities arid inequalities,
(2.68)
Figure 2.7 The equivalent electrical circuit for the linear programming task.
It was not easy for me to draw the required electrical circuit even when I am
familiar with electrical concepts. The most difficult was to satisfy the three con-
ditions in the system (2.69) which have the form of inequalities. I succeeded
eventually and the resulting circuit is drawn in Fig. 2. 7. It is ensured in the cir-
cuit that the currents x 1 , ... , x 5 satisfy the conditions (2.68) and (2.69). I hope
that you expected a similar result, because the circuit is quite straightforward.
The circuit shape resembles the shape of a matrix with the constraints (2.69).
My circuit consists of six transformers which matches six constraints from the
system (2.69), and five branches whose currents correspond to the variables
x1, ... , X5. Each i-th branch, i = 1, ... , 5, turns around the core of the j-th
transformer, j = 1, 2, ... ,6, and the number of turns can be positive, negative
or zero. The number of turns corresponds to the coefficient bij by which the
variable Xi is multiplied in the j-th constraint of the system (2.69).
I ensured the conditions (2.68) easily by connecting the phase rectifiers to
the first, second and third branch. Thanks to it, negative currents cannot occur
in these branches. I satisfied the three last conditions in the form of equations
from the system (2.69) by inserting the coil on each j-th, j = 4, 5,6, transformer
which makes -1 turn around the particular core and to which the current c1
60 Lecture 2: Non-Bayesian statistical decision making
Bows from an ideal current source. The following equation is satisfied thanks
to it and thanks to the transformer properties for j = 4, 5, 6
5
L b i j X i - Cj = 0
i=l
which corresponds to the first three constraints in the system (2.69). That is,
perhaps, all.
You did everything right and you even got ahead of it a little bit. In Fig. 2.7 we
can see the voltage sources a 1 , .•. , a5 , which are in fact not needed to satisfy
constraints (2.68) and (2.69). Why did you introduce them?
It was a mere intuition which I cannot proYe, but I can support it with some
general thoughts. It seems to me, that when I introduced the voltage sources
my diagram became not only a tool to satisfy constraints (2.68) and (2.69), but
also a tool to minimise the sum {;~=! a;x; on the set given by systems (2.68)
and (2.69). Indeed, the sum Li=l a;x; corresponds to the power that the
voltage sources a 1 , ... , a5 obtain from surroundings. But the voltage sources
a 1 , ..• , a5 are sources of energy, not consumers of it. That is the reason why
these sources interact with the surroundings in such a way that they get rid of
energy as soon as possible, i.e., they pass it to the surroundings with the highest
possible power. This means, till now only intuitively, that such currents get
stabilised in the diagram in Fig. 2. 7 that maximise the total power- :Z::::i=l a;x;
in which the voltage sources are deprived of energy. It is clear enough that
the suggested diagram constitutes a physical model of the linear programming
task. Actually, an arbitrary linear programming task can be modelled using
the analogy. I do not claim that I can prove these thoughts.
I would like to proceed to the main issue now for which I needed the electrical
interpretation of linear programming tasks. First of all, I wanted to clarify
the relation between the pair of dual tasks. I have not obsen'ed anything like
this in the proposed diagram yet, it even seems to me that I umlerstand the
diagram quite well.
2.6 Discussion 61
You will see this in a while. You analysed the properties of currents x1, ... , X5
which the diagram can generate, and have proved (and it was not difficult) that
the currents satisfy the constraints (2.68) and (2.69). But beside the current,
the voltages are generated by the diagram too, and these are not only voltages
a 1, ... , a 5 at voltage sources but also voltages at the current sources c1, ... , c6.
I have got it! I am surprised that I did not notice it earlier. Indeed, these
voltages correspond to the variables in the task which is dual with respect to
the task minimising sum 2::;=
1 a;x; under constraints (2.68) and (2.69). They
are, namely, dual variables!
Because the phase rectifiers are connected to the first, second and the third
branch, the sum l::~=l b;iYJ of voltages on coils in the i-tll branch, i = 1, 2, 3,
is smaller than voltage a;. For till:' first branch there holds
Earlier, I applied additional coils to the first, second and third cores and con-
nected tlw phase rectifier to it. I did it to satisfy the first three conditions
(inequalities) in the system (2.69). I did not think of it ;.1t that time, but now I
see that additional coils on the first, second and tbinl transformer secure that
voltages Y1, Y2, Y3 on the corresponding voltage sources cannot be negative.
62 Lecture 2: Non-Bayesian statistical decision making
X1 ~ 0, X2 ~ 0, X3 ~ 0,
2x1 - 3x2 + 4x3 + 5x4 - 2x5 ~ c1 ,
(2. 70)
X1 + X2 + X3 + X4 + X5 ~ C4 ,
-Xl + 2X2 + 3X3 + X4 - X5 ~ C5 ,
3xl + 2x2 - 4x3 - 2x4 + 3x5 ~ c6 ,
Y1 ~ 0, Y2 ~ 0, Y3 ~ 0,
2yl + Y2- Y3 + Y4- Y5 + 3y6 :S a1 ,
-3yl + 2y2 + Y4 + 2y5 + 2y6 :S a2 ,
(2.71)
4yl + Y2 + 3y3 + Y4 + 3y5 - 4y6 :S a3 ,
5yl + Y4 + Y5 - 2y6 = a4 ,
-2yl + Y2 + Y4- Y5 + 3y6 = a5.
In this ~my I came to the idea that any physical system which is a model of
a linear progTamming task is also inevitably a model of a dual task. It seems
that I am starting to understand dual tasks slowly, but in spite of that there
are still more unclear than clear facts. I expressed my hypothesis earlier that
in the diagram in Fig. 2. 7. I cannot implement arbitrary solutions of the sys-
tem (2. 70), but only those that maximise the total power- L~=l aiXi, by which
the voltage sources a1, ... , a5 dissipate energy. I presume also that the volt-
ages Y1, ... , YB cannot correspond to arbitrary solutions of the system (2. 71),
but only to those that maximise the total power L~=l CjYj of current sources
c1 , ... , c6. I cannot prove my hypothesis properly. Probably you can help me
with it?
Yes, with pleasure, of course. First of all we will prove an auxiliary statement.
'Ve will do it in an abstract manner not referring to electrical analogies and
after that \Ve will have a look at what this statement means for our electrical
diagram.
Let x = (x 1 , x 2 , ... , xm) be a vector confirming conditions
L biJxi ~ CJ ,
Ill
The proof of the previous statement is rather simple. It follows from inequali-
ties (2.72) and (2.77)
n* ( m )
L LbijXi- C; YJ;:::: 0 °
j==l i==l
n ( m )
L LbijX;-C; YJ=O.
j==n*+l i==l
n ( m )
""""
~ ""b
~I) x-c
l I y·>O.
J _ (2. 79)
j==l i==l
L
m (
a; - L
n
b;JYJ
)
x; = 0.
i==m*+l j==l
64 Lecture 2: Non-Bayesian statistical decision making
~
m (
Cli - ~
n
bijyj
)
Xi 2: 0. (2.80)
which is only another form of inequality (2. 78) that was being proved. The
proof is finished.
As all the constraints (2.72)-(2.77), which you created in Fig. 2.7, are satis-
fied, the inequality (2. 78) is satisfied, too.
Inequality (2. 78) can be easily understood without proof when one has in mind
our electrical diagram. The total energy '£/=
1 CJYJ, which current sources CJ,
j = 1, ... , n, dissipate at any instant, cannot be larger than the total energy
'£:: 1 aix;, which is received from the surrounding by the voltage sources a;,
i = 1, ... , m. Otherwise, the energy conservation law would be violated.
I can see now that an even stricter relation than (2. 78) can be proved on the
basis of purely physical considerations, which is the equation
1n n
which is equal to the first duality theorem (cf Theorem 2.1). Its analytical
proof is rather difficult. Equation (2.81) holds simply because that the total
energy '£:: 1 a;:c;, •Nhich occurs on voltage sources, cannot be larger than the
total energy '£~'= 1 CJYJ produced by current sources. Otherwise, it would not
be clear from where energy could originate. Equation (2.81} confirms my earlier
hypothesis that only such currents :1:, appear in the diagram as are solutions of
the primal linear programming task. Similarly, only voltages Yi are generated
that are solutions of the dual taslc
Only a slight effort is needed to give the electrical interpretation to the second
duality tlworem too. When that is clone we will understand it entirely and
informally. Are you able to do that on your own?
That might be all we can help you with today to clarify the dual tasks of linear
programming and allow you to manipulate it freely when analysing practical
tasks.
The electrical analogy of the abstract duality theorems is quite obvious. There
are some questions left which concern non-Bayesian tasks.
Go ahead, ask!
I have understood from the lecture that in many practical tasks a penalty
function, as occurs in Bayesian tasks, does not make any sense in reality. I
noticed that the concept of a penalty function has not been used at all in the
listed non-Bayesian tasks.
Assume that the statistical model of an object is entirely known in an appli-
cation, but the penalty function is not defined. This means that both a priori
probabilities Px(k), k E K, and conditional probabilities PX/K(x I k) make
sense and are known. I understand that the penalty function, which does not
make sense in a given application, is missing for the formulation of the Bayesian
task. But why should I have to formulate the task as a non-Bayesian one in
this case? Is it not better to remain within the Bayesian framework? Would
the task not be formulated as a risk minimisation, but as the minimisation of
the probability of the wrong decision? Such a criterion is fully understandable
to the user and even more than is the risk. It seems to me that the criterion
minimising the probability of the wrong decision has the right to exist fully and
not only when it is derived from a specific penalty function.
decision algorithm by two probabilities of the wrong decision and not merely
by a single one. These two probabilities are formally expressed by means of
numbers which can be added, averaged, etc .. However, these probabilities de-
scribe entirely different events and nothing corresponds to their sum that could
be interpreted within the framework of your application. You have to formu-
late the task in the way that this non-interpretable operation should not be
used. In this way only, you can achieve the formally proved strategy that will
correspond to real demands of the application.
Instead of formulating these demands, you formulated another task in a hope
that the optimal solution of this supplementary task will not be too bad for
the actual task. Such an approach should be comprehensible if you had known
only one statistical decision task. But you know more tasks of this kind and
you need not keep the only one so convulsively. The most appropriate would
be to arrange the task as Wald or Neyman-Pearson task in your case.
I was convinced in the lecture that the solution of the properly formulated task
can significantly differ from the solutions which someone could guess only on
the basis of mere intuition. In this respect Wald task seems to me to be the
most distinctive one for IKI > 2, where the outcome is quite unexpected. And
I resent it, since I understand the task only purely formally. If I am to use it
I have to believe in formal methods too much as to my taste. I expected that
the solution of Wald tasks will be much easier. Let me explain the algorithm
whir.h I considered to be the solution of the task. And then I would ask you to
help me analyse it and tell me why it is wrong.
The formally proved algorithm can be expressed in the following form. The
strategy depends on 2IKI numbers T(k), q(k), k E K, and on basis of them the
following values should be calculated
T(k)PX[K(xlk)
(
"/ k X) = -=,.---:-:-:--'--..:........,--:-:-
E T(k) PX[K(x Ik)
kEK
which resemble very much the a posteriori probabilities of the state k under
the condition :r:. Next, the largest number of these numbers is sought and is
compared with the threshold B(x). The threshold value is
E q(k) PxtK(x Ik)
B(x) =1- --=kE=II.-·- - - - -
E T(k) PX[J<(X Ik) '
kEK
and this value is different for each observation. It is very difficult to under-
stand this step informally, because the threslwld value was the same for all
observations in the Bayesian •.:ariant of this task, i.e., independent of x. These
considerations are not the proof, but I assumed that here in lVald task it should
be the same case. This means that the strategy would have a form
argmax Atdx), if max 'Ydx) > (1- <5),
kEK kEK
0, if max 'Ydx) < (1 - <5) .
kEK
2.6 Discussion 67
This strategy seems to me far more natural compared to the strategy used in
the lecture. That is why I assumed that when the strategy was not the solution
of the formulated task then it could be a useful strategy for some other task.
Let us try to make a retrospective analysis of your strategy and let us find
tasks for which this strategy is optimal. The strategy suggested by you would
be obtained if the variables T(k ), q(k), k E K, were constrained by the property
that the ratio
These linear constraints will be embodied into the formulation of the dual
task (2.62). This constitutes a new dual task.
Dual task:
I: q(k) = 1, (b)
kEK
q(x)~O, T(k)~O.
This task differs from the previous task (2.62) which was composed in the
lecture having in mind lXI additional constraints (2.82a). The original task
was somewhat deformed. We will build up the primal task, to which the task
(2.82) is dual, rather formally in order to find out what happened after defor-
mation. Before that we will perform several equivalent transformations of the
task (2.82). First, on the basis of (2.82a) the constraint (2.82c:) can have the
form
t(x) + (1- 15) L T(k) PxiK(x! k) :S 0, x EX. (2.83)
kEK
68 Lecture 2: Non-Bayesian statistical decision making
This expression is substituted into the maximised function and we can write
'\"' 1-c
L.. t(x) +- 8- .
xEX
L t(x).
xEX
From what was said above the new shape of the task (2.82) follows:
max ( L t(x))
xEX
t(x) L T(k) PXIK(x I k) 50, x EX,
+ (1- 8)
kEK
t(x)+T(k)PxjK(xlk)50, xEX, kEK,
(2.85)
8 L T(k) PXIK(x Ik) - L q(k) p(x Ik) = 0, x EX, (a)
kEK kEK
L T(k) = ~,
kEK
T(k)~O, q(k)~O, kEK.
From such a form of the task it can be seen that the constraints (2.85a) are
redundant. Indeed, if the variables t(x), x EX, q(k), k E K, and T(k), k E K,
conform to all constraints except the constraint (2.85a) then only the variables
q(k), k E K, can be changed so that the constraint (2.85a) will be satisfied.
For example, the variable q(k) can be selected to be equal to 6 T(k). Thus the
2.6 Discussion 69
c L r(k) = t,
kEK
r(k) ~0.
It should not surprise you that variables q(k), k E K, disappeared from the task.
Well, you yourself wanted that the threshold with which 'a posteriori' quantities
'Y(k), k E K, should be compared would not depend on the observation x.
This dependence was realised in the former task through coefficients q(k), k E
K. You will probably be surprised that as a result of a task deformation the
parameter c. disappeared, which had determined an acceptable probability error
limit. So we can affirm now that the algorithm, which seems natural to you,
does not secure any more that the error will not be larger than c. This is
because the algorithm simply ignores the value c. We will proceed further to
find out in which sense is the algorithm proposed by you optimal.
The task (2.86) is dual to the following task
min ( J)
is the probability of the wrong decision provided that the object is in the state
k and
x(k) = L
a:(x, 0) PXIK(x I k)
xEX
is the probability of the answer not known under condition that the object is
in the state k. The parameter 8 in your algorithm can be interpreted as the
penalty given to the answer not known where the wrong decision is penalised
by one. Your task formulation is admissible as well. You can now answer
70 Lecture 2: Non-Bayesian statistical decision making
I cannot definitely agree with your last remark although I acknowledge that it
is distinctive and expressed in rich colours.
I have one more question which is not very short. It was said in Lectures 1
and 2 that the strategy solving both Bayesian and non-Bayesian tasks could
be expressed in the space of probabilities using convex cones. For all that, a
significant difference remains between Bayesian and non-Bayesian strategies.
The Bayesian strategy divides the whole space of probabilities into convex
cones in such a way that each point from the space belongs just to a single
cone including the points lying on the cone borders. This was in fact stated by
the basic theorem about the deterministic property of Bayesian strategies, cf.,
Theorem 1.1.
It is somewhat different in non-Bayesian strategies. It was proved in the
lecture that all points inside the cones have to be classified into a certain class
in a deterministic way. Nothing was said, on purpose or maybe accidentally,
about what is to be done with the points which lie exactly on the cone's borders.
Non-Bayesian strategies are deterministic in almost all points of the probability
space but not entirely everywhere. It is clear to me that decisions corresponding
to observations fitting exactly to the borders of convex cones not only can but
even must be random. This random decision can be better than any determin-
istic one. Naturally, I do not believe that miraculous results can be achieved
with the help of this randomisation. Random strategies cannot be much better
than deterministic strategies because the randomisation is useful only in a very
small subset of points. I am more concerned that the deterministic character
of strategies is no longer a peremptory imperative for non-Bayesian tasks as it
was for Bayesian tasks. On this basis I am starting to believe that there might
exist another broader area of statistical tasks in which randomised strategies
will have decisive predominance over deterministic ones perhaps similarly as
what happens in antagonistic games in the game theory.
It seems to me that such a situation could arise if the basic concepts were
enriched by the penalty function that would not be fixed but would be depen-
dent on a value of a certain non-random parameter, i.e., the intervention. In
contradiction to interventions, which we met in testing of complex hypotheses
(random or non-random), these inteHentions do not influence the observed ob-
ject but merely, how the appropriate decision will be penalised. The task could
2.7 Bibliographical notes 71
be defined as seeking such a strategy which is good enough for any interven-
tion. In other words, I would like to study also such strategies for which the
penalty function is not defined uniquely but on the other hand it is not entirely
unknown. Such a partial knowledge and partial ignorance is expressed by the
help of the class of penalty functions and a strategy has to be found, which
would be admissible for each penalty function from this class.
The given questions are not easy to answer. These issues will be treated neither
at this moment nor during our entire course since we do not know the answer.
It might be an open problem which is worth being examined. You can see that
the answer to your question is far shorter than the question itself.
January 1997.
2. 7 Bibliographical notes
The tool in this lecture was the pair of dual tasks of linear programming which
was carefully studied in mathematics [Kuhn and Tucker, 1950; Zuchovickij and
Avdejeva, 1967].
We have not found such a general view on non-Bayesian tasks anywhere.
Actually, this was the main motivation for us to write the lecture. A mathe-
matician can observe tasks from the height of great generality, e.g., Wald [Wald,
1950] stated that the finite nature of the observation space is such a severe con-
straint that it is seldom satisfied in statistical decision tasks. Nevertheless,
in the statistical decision making theory, situations are sharply distinguished-
when the estimated parameter is either random or non-random [Neyman, 1962].
A practitioner solves non-Bayesian tasks often subconsciously when she or
he starts tuning parameters of the decision rule that was derived in a Bayesian
manner with the aim of recognising all classes roughly in the same way. By
doing this she or he actually solves the non-Bayesian task, in the given case a
minimax one. Two articles [Schlesinger, 1979b; Schlesinger, 1979a] are devoted
to formalisation of the practitioner's approach and they served as a starting
material for writing the lecture.
The references to original sources will be mentioned for individual non-
Bayesian tasks. References relevant to Neyman-Pearson task [Neyman and
Pearson, 1928; Neyman and Pearson, 1933] and for deeper understanding the
t.c~xt.book of statistics [Lehmann, 1959] can be recommended. A minimax task
is d<~scribed in [Wald, 1950]. Wald task, as it was understood in the lecture,
is a SJH~cial case of Wald sequential analysis '[Wald, 1947; Wald and Wolfowitz,
1~)48]. Statistical decision tasks with non-random interventions, also called
t.asks t.c~sting complex hypotheses, were formulated by Linnik [Linnik, 1966].
Lecture 3
Two statistical models
of the recognised object
A distribution Px IK: X x K ---+ IR of conditional probabilities of observations
x E X, under the condition that the object is in a state k E K, is the cen-
tral concept on which various task in pattern recognition are based. Now is
an appropriate time to introduce examples of conditional probabilities of ob-
servations with the help of which we can elucidate the previous as well as the
following theoretical construction. In this lecture we will stop progressing in the
main direction of our course for a while to introduce the two simplest functions
PXIK which are the most often used models of the recognised object.
73
d PXIK(x I k = 1) < d
(lmin < PXIK
(X I k -_ 2) - (lmax ' (3.3)
where B~lin and B~lax are threshold values. The expression (3.3) is evidently
equivalent to the relation
PXIK(x I k = 1) d
e~lill < log PXIK (X I k. -_ 2 ) ~ emax , (3.4)
where the threshold values are different from those in the relation (3.3). If each
feature x;, i E /, assumes only two values 0 or 1 then the following derivation
in the form of several equations will bring us to the interesting property of the
logarithm of the likelihood ratio
log PXIK(x I k = 1) =
PX;IK(xi I k = 1)
~ log ::_:_:_:c.:.:...:__.:__
~ __
PxiK(x I k = 2) i=L PX;IK(x; I k = 2)
~. _PX;!K(1Ik=1)Px;!K(Oik=2) ~ P.'(;IK(Oik=1)
~x,log +~log .
i=l PX;IK(11 k = 2) PX;IK(O I k = 1) i=l PX;!K(O I k = 2)
The transition from the next to last line to the last line in the previous deriva-
tion can be verified when two possible cases Xi = 0 and x; = 1 are considered
separately. It can be seen that the logarithm of the likelihood ratio is a linear
function of variables Xi· Because of this we can rewrite the expression (3.4) in
the following way
n
e~in < L ai X; ~ e~ax· (3.5)
i=l
If the tasks are expressed by a firmly chosen function PXIK then various strate-
gies (3.5) differ each from the other only by a threshold value. If, in addition,
the function PXIK varies then also the coefficients a; start varying. At all these
changes, it remains valid that all decision regions are regions, where values of
a linear function belong to a contiguous interval.
In special cases in which a set of decisions consists of two decisions only, i.e.,
when the observation set X is to be divided into two subsets X 1 and X 2 then
the decision function assumes the form
n
Xt, if La; :z:; ~ (}'
xE{ X2, if
i=L
n
2::: a;:r; >e.
(3.6)
1=1
3.2 Gaussian probability distribution 75
This means that for objects characterised by binary and conditionally inde-
pendent features, the search for the needed strategy is equal to searching for
coefficients ai and the threshold value B. The entire Lecture 5 on linear de-
cision rules will be devoted to the manner how to tune these coefficients and
thresholds properly.
Coefficients aij, /3i, i, j = 1, 2, ... , m, and the threshold value 'Y depend on a
statistical model of the object, i.e., on matrices AI, A2 , vectors J.L 1 , J.L 2 and also
on the fact which Bayesian or non-Bayesian decision task is to be solved. Even
in the two-dimensional case, the variability of geometrical forms, which the sets
X1 and Xz assume, is quite large. We will show some of them.
1. The border between the sets X 1 and X 2 can be a straight line which is
situated in the way that the set X 1 lies on one side of the line and the set
Xz on its other side.
76 Lecture 3: Two statistical models of the recognised object
'
Xi , 0 0 0
1 Xn-1 1 Xn,
X1X1 X1X2, X]Xi' X]Xn-1 X]Xn
' 1
'
0 0 0 0 0 0
1 1
(3.10)
xi:r;, XiXn-l XiXn,
' '
0 0 0
Xn-lXn-1, Xn-1Xn,
XnXn) .
transformed by means of (3.10), then the strategy (3.8) can be written in the
form
(3.11)
if
if
3.3 Discussion
Every time the model with independent featmes is used in publications it seems
incredible to me that it is possible to describe a real object in such a simple
way. Why is the model used so often? 1\{v answer is that there is a lack of
knowledge in practical tasks about being able to use a more complex model.
Available experimental data are sufficient to evaluate how particular features
depend on the object state and are not sufficient to evaluate the dependence of
feature group on the state. However, tl1e lack of knowledge about the mutual
dependence of features does not entitle anybody to make the conclusion that
the features are independent. A more thorough approach should be used here
which explicitly considers the insufficient knowledge of the statistical model of
the object. In common practice the insufficient piece of knowledge is wilfully
replaced by a specific model which has the only advantage that its analysis is
78 Lecture 3: Two statistical models of the recognised object
simple. An analogy comes to mind, of a person looking for a lost item not at
the spot he lost it but under the lantern. I have tried to settle accounts with
the indicated difficulties. I have reached partial results but it is difficult for me
to continue. I will explain to you my results first and I will ask you to help me
to make a step ahead.
Let X be the set of observations X= (xl, X2, ... , Xn) which is X= xl X x2 X
... x Xn, where Xi is the set of values of the feature Xi. Let the set of states
K consist of two states 1 and 2, i.e., K = {1, 2}. Let PXJk(x) be a conditional
probability of observation x under the condition of the state k. The functions
PXik, k E K, are unknown but the marginal probabilities Px,Jk(xi) are known
and are expressed by the relation
PX 1 Jk(xl) = L:
xEX(l,x!)
PXJk(x),
PX;Jk(xi) = L:
xEX(i,xi)
PXJk(x), k = 1,2. (3.12)
PXnlk(xn) = L:
xEX(n,xn)
PXJk(x),
In the previous formula the notation X(i, Xi) stands for the set xl X x2 X ... X
Xi-1 X {xi} X xi+l X ••• X Xn, i.e., the set of those sequences X= (xl, X2, ... , Xn)
in which the i-th position is occupied by the fixed value Xi.
If it is known that the features under the condition of a fixed state k consti-
tute an ensemble of independent random variables, it means, in fact, that also
the function PXik is known, because
n
PXJk(x) = ITPx,Jk(xi), k = 1, 2 (3.13)
i=l
holds in this case.
Let us sincerely admit that such a model is quite simple. If it actually occurs
then difficult questions can hardly arise. But I am interested in how I should
create a recognition strategy, when I am not sure that the assumption (3.13)
about the conditional independence of features is satisfied, and I know only the
marginal probabilities PX,Jk' i = 1, ... , n, k = 1, 2. In other words, how should
I recognise the state k, when I only know that the functions PXlkl k = 1, 2,
satisfy the relations (3.12) with known left sides, and nothing else.
I am not amazed that the question is difficult for me because many other
questions seem difficult to me too. On the other hand, I am surprised that
no one has been attracted by this question so far. As soon as I ask someone
how I should recognise an object, about which only marginal probabilities Px, lk
are known, I usually get a witty answer based on the implicit assumption that
3.3 Discussion 79
the features are mutually independent. If I ask why the recognition strategy
should be oriented just to the case of independent features, I obtain a less
understandable explanation. This strategy is said to be suitable also for the
case in which the features are mutually dependent.
When I pursue the additional explanation, I am told that the entropy of
observation was the greatest for independent features. It should mean that
such observations can also appear at the input of the classifier, which would
not occur, if there existed a dependence between the features. That is why the
strategy, which is successful with a certain set of observations, cannot be less
successful with the subset of these observations.
On the one hand, all these considerations seem to me admissible because
they are in harmony with the informal behaviour of a human in a statistically
uncertain situation. I consider as acceptable the behaviour in which a human
in uncertain conditions makes preparations for the worst situation and behaves
in such a way that losses might not be too high even in that case. When it
proves that reality is better than the worst case, it is then even better. On the
other hand, these considerations are based on implicit assumptions which seem
self-evident, but in fact need not be always satisfied.
The main question is if for some set of statistical models (in our case these
are the models which satisfy the relation (3.12)) there exists the exceptional
worst model to which the recognition strategy should be tuned. At the same
time, the strategy tuned to this exceptional model should also suit any other
model. The recognition results should not be worse for any other model than
those for the worst one. I know that such an exceptional model need not exist
in every set. I can illustrate it by the following simple situation.
Let X be a two-dimensional set (a plane), K = {1, 2}, and PX/l (x) be two-
dimensional Gaussian probability distribution with independent components,
with variance 1 and mathematical expectation tt 1 = tt 2 = 0. Let Px 12 (x) be
unknown, but let it be known that it is
only one of the two possible distribu- X 1/ II X'2
+--!--+
tions: either P'x 12 (x) or p~ 12 (x) which
differs from p x /l (x) only in the math- p~ 12 (x)
ematical expectation. It is the point
It~ = 2, IJ~ = 0 in the first case, and
the point p,~ = 0, p,~ = 4 in the second
case, see Fig. 3.1. Thus we have the
set consisting of two statistical mod-
els. The first model is determined by
----- ---r-----t:.
I
X"
1
the pair offunctions (PXJl(x), P'x 12 (x)) I
I
I
and the second by the pair (PxJ 1 (x),
P~/2(x)). PX /1 (X m.
·. ~t
. . r---1-!-+Q'-·
........,.__ _P.:.:'x.J.:/2:...(_x-+)>
\:JJ!) 1 \Zi)
·rt·.
worse at the first glance. However, when Figure 3.1 Difficulties in seeking for the
the strategy tuned to this model will be •worst statistical model' when attempting
used, i.e., the strategy that decomposes to decompose the plane into subsets.
80 Lecture 3: Two statistical models of the recognised object
mine
c- L al(x) PXI2(x) 2: 0, Px12 E P(2);
.rEX
c- L a2(x) PXIl (x) 2: 0, PXIl E P(1); (3.17)
.rEX
al(x) + az(x) = 1, x EX;
a1 (x) 2: 0, az(x) 2: 0, x E X.
This task has infinitely many constraints which are just IP(1)1 + IP(2)1 + lXI. I
got rid of the unpleasant infinities like this: the set P(1) is the set of solutions
of the system of linear equations (3.12). Having in mind that the solution PXIl
of the system (3.12) has to satisfy the natural constraint 0 :::; PXIl (x) :::; 1 in
any point x E X, I come to conclusion that P(1) is a constrained convex set.
As the solution of the finite system of linear equations is concerned, the set is a
multi-dimensional polyhedron. A number of polyhedron vertices is quite large,
but finite. I will denote the vertices by p{, j E J(1), where J(1) is a finite set
of indices. It is obvious that when the inequality r - LxEX az(x) PXIl (x) 2: 0
holds for an arbitrary function PXIl from the set P(1) then the same inequality
holds also for any function Jli, j E J(1), i.e.,
because every vertex p{, j E J(1), belongs to the set P(1). The opposite
statement is correct too. This mean that from inequalities (3.18) also the
82 Lecture 3: Two statistical models of the recognised object
It is so because it is possible to express any fun_ction PXil from the set P(1) as
PXIl = L
jEJ(l)
'Yi Pi '
where "{j, j E J(1), are non-negative coefficients for which LjEJ(l) 'Yi = 1
holds. Thus, the conditions (3.18) and (3.19) are equivalent. The same is true
for conditions
c-I:o:l(x)J4(x)~O, jEJ(2),
xEX
where {J41 j E J(2)} is the set of all vertices of the polyhedron P(2). The
task (3.17) assumes the form
mine
c- I: o:1(x)J4(x) ~ 0, j E J(2);
xEX 1
Tlj c- L o:2(x)p{(x) ~ 0, j E J(1); (3.20)
xEX1
t(x) o:l(x) + o: 2 (x) = 1, x EX;
o:1(x) ~ 0, o:2(x) 2:0, x EX.
where Ti3 ,j E J(1), and T;J,j E J(2), are Lagrange coefficients that solve the
dual task
max L t(x),
xEX
t(x)- L T2j J4(x) ~ 0, x EX;
jEJ(2)
(3.23)
jEJ(l) jEJ(2)
and for which the conditional probabilities of observed state x under the con-
dition of states k = 1, 2 are
(3.24)
p;- 12 (x) = L
jEJ( 2 )
I: 7;j ~· ~(x),
iEJ(2) 2i
x EX. (3.25)
The statistical model (3.23), (3.24), (3.25) will be denoted by m*. It is obvious
that this model satisfies the condition (3.12) because ~oth functions p;- 11 and
p;- 2 represent the convex combination of functions p{, j E J(l), and~. j E
J(h satisfying the condition (3.12). The strategy q* is the solution of Linnik
task that is formulated as the minimisation of the function maxmE M R( q, m).
The task is expressed by (3.17). I can write
The coefficients T{i, j E J(1), and T2J, j E J(2), are the solution of the dual
task (3.22) and thus
where the expression on the right-hand side of (3.26) denotes the risk R(q*, m*).
By the first duality Theorem 2.1, I have mine= maxi:xEX t(x), and conse-
quently
R(q*, m*) = max R(q*, m). (3.27)
mEM
I have proved that the set of models satisfying conditions (3.12) also comprises
the worst model m*, for which the following holds
The first inequality in the expression (3.28) is correct because q* is the Bayesian
strategy for the model m*. The second inequality is only the equation (3.27)
a
rewritten in different manner.
The answer to my Question 1 is therefore positive. Now an additional ques-
tion can be formulated correctly, too. This question asks: What is the worst
model from the ensemble of models that satisfy (3.12)? How must the strategy
84 Lecture 3: Two statistical models of the recognised object
be chosen for the worst model? What should the recognition look like in a case
in which only marginal probabilities on the left-hand side of (3.12) are known
and nothing else?
I am helpless in an attempt to answer the abovementioned questions. When
once I became convinced in such a complicated way that the questions were
correct, I cannot help but get an impression that there are answers to these
questions.
We are pleasantly surprised by the enormous work you have done. It seems
that we should hardly distinguish who teaches whom here. It took a while until
we found the correct answer to your question. It was worth the effort because
the answer is entirely unexpected. We were quite pleased by your solution of
the problems as well as that you have found a virgin field which seems to have
been investigated in a criss-cross manner.
The worst model m* the existence of which you proved in such an excellent
way, cannot be described so transparently. The most interesting issue in your
question is that the Bayesian strategy q* can be found for the worst model
m* without the need to find the worst model. It is possible because it can be
proved that the strategy q* makes the decision about the state k only on the
basis of a single feature and it must ignore all others.
It was not easy to prove this property well, and our explanation may not
have been easily comprehensible for you. So we will discuss first, using simple
examples, what this property means. Assume you have two features x 1 and x 2 .
You are to find out which of these two features lead to smaller probability of
the wrong decision. You will make decision about the state k only on the basis
of the better feature and you will not use the value of the other feature. Let us
have a more detailed look at this situation and let us try to understand why
we have to deal with it in just this way.
Say that the feature x1 assumes two values 0, 1 only and its dependence on
the state k E {0, 1} is determined by four conditional probabilities:
p(x1 = 11 k = 1) = 0.75, p(x1 = 0 Ik = 1) = 0.25,
p(x1 = 11 k = 0) = 0.25, p(x1 = 0 Ik = 0) = 0.75.
It is evident that you can create the strategy q*, based on this feature, which
will estimate the state k with the probability of the wrong decision 0.25. It
is the strategy which chooses either the state k = 1 when x 1 = 1 or the state
k = 0, when x1 = 0. The probability of the wrong decision does not apparently
rely on a priori probabilities of states. Be they of any kind, the wrong decision
probability will be the same, that is 0.25.
Let us assume that this probability seems to you too large and that is why
you would like to lower it by using one more feature. Let us assume that
you have such a feature at your disposal. In our simplified example let it be
the feature x2 which also assumes two values only. Conditional probabilities
corresponding to these values under the constraint of the fixed state k are
p(x2=1lk =1)=0.7, p(x2 = 0 I k = 1) = 0.3,
p(x2 = 11 k = 0) = 0.3, p(x2 = 0 Ik = 0) = 0.7.
3.3 Discussion 85
~------p-(k-~-=-1~-=-?------~i ~j-------p(_k_k=-~-~-=-?------~
X! 0 1 X! 0 1
p(xiik=1) p(x1 Ik = 0)
X2 p(x21k=1) 0.25 0.75 X2 p(x2l k = 0) 0.75 0.25
0 0.3 ? ? 0 0.7 ? ?
1 0.7 ? ? 1 0.3 ? ?
Table 3.1 The data of the example determining the probability of the wrong decision. The
values in six table entries denoted by the question mark correspond to the unknown param-
eters of the statistical model.
All data having a relation to our example are given in a comprehensive way
in Table 3.1. Furthermore the known values presented in the table, a space
is reserved for unknown data. These are the a priori probabilities p(k = 1)
and p(k = 0) and joint conditional probabilities p(x1, x 2 1k). There are ques-
tion marks in the entries corresponding to unknown values in Table 3.1. The
question marks can be replaced by arbitrary numbers that must only satisfy
an obvious condition
p(k = 1) + p(k = 0) = 1
and also the condition (3.12) on marginal probabilities. This means that the
sum of probabilities p(x1 = 1, x2 = 11 k = 1) and p(x1 = 1, x2 = 0 Ik = 1) has
to be 0.75, etc .. The alternative, by which question marks are substituted by
the numbers in Table 3.1, influences in the general case the probability of the
wrong decision reached by means of a strategy. It will not be difficult for you
to become convinced that the formerly introduced strategy q*, which decides
about the state only considering the feature x 1 , secures the probability 0.25
of the wrong decision for an arbitrary substitution of the question marks by
actual numbers.
Let us now have a look at whether the probability of the wrong decision can
be lowered when both features x 1 and x 2 are used instead of a single feature x 1 .
It could be natural to use one of the two following strategies for this purpose.
The first strategy decides for the state k = 1 if and only if x 1 = 1 and x 2 = 1.
The second strategy selects k = 0 if and only if x 1 = 0 and x2 = 0.
Let us analyse the first strategy first. Here the probability of the wrong
decision, unlike that in the case of the strategy q*, is dependent on which
numbers substitute the question marks in Table 3.1. The numbers can be such
that the probability of the wrong decision will be 0.55. Thus it will be worse
than it would be if only the worse feature x2 was used. In this case it would
be 0.3. These numbers are displayed in Table 3.2(a) which shows a value only
for k = 1. It is obvious that with p(k = 0) = 0 there is no longer influence
whatever the probabilities p(x1, x2l k = 0) are.
When applying the second strategy, such numbers can substitute the ques-
tion marks in Table 3.1 that the probability of the wrong decision will again
be 0.55. These values are shown in Table 3.2(b).
86 Lecture 3: Two statistical models of the recognised object
~-----p-(k-:-=-1-:=--1----~1 ~1------p-(k_k_==-0-~=--1----~
X! 0 1 X! 0 1
p(x1 I k =1) p(x1lk=O)
X2 p(x21k=1) 0.25 0.75 X2 p(x2l k = 0) 0.75 0.25
0 0.3 0 0.3 0 0.7 0.45 0.25
1 0.7 0.25 0.45 1 0.3 0.3 0
(a) (b)
Table 3.2 The decision with two features. Two most unfavourable cases with the probability
of the wrong decision 0.55 are depicted which correspond to two different strategies.
If you liked, you could make sure that even any other strategy will not be
better than the earlier presented strategy q*. It is the strategy q* which decides
about the state k wrongly in 25% of cases for an arbitrary substitution of the
question marks by actual numbers. For any other strategy there exists such a
substitution of the question marks by numbers for which the probability of the
wrong decision will be greater than 25%. We do not think that it would be
difficult for you to try the remaining possibilities. There are only 16 possible
strategies in our case and 3 of them have been already analysed. In spite of that,
we think that you will not like trying it because this is only a simple example.
We would like to study together with you the properties of the strategy found
in the general case. We mean the case in which marginal probabilities are
arbitrary, and the number of features and the number of values of each feature
need not be just two but it can be arbitrary.
But before we start analyzing your task let us try to get used to a paradoxical
fact, on which we will now concentrate. The fact is that the usage of a greater
number offeatures cannot prove better, and it is usually worse compared to the
use of only one single feature. When we manage to prove it (and we shall do
so) then we will obtain a quite important constraint for procedures, of how to
distinguish the object state on the basis of information available from various
sources. Let us leave these quite serious problems aside in the meanwhile and
let us deal with a less important but more instructive example, which will lead
us to a clear idea, how to solve your task formulated by Question 2.
Imagine that you are a company director or a department head or someone
else who has to make the decision 'yes' or 'no'. You will establish a board of
adYisors consisting of ten people for such a case. You will submit the question
to be decided to the advisors and you will get ten answers x 1 , x 2 , ••• , x 10 'yes'
or 'no'. After a certain time of your cooperation with the board of advisors,
you will learn the quality of each expert which will be expressed with the help
of probabilities Pi(xilk), where k is the correct answer. You are faced with the
question in the 'yes' or 'no' manner on the basis of ten answers of the advisors
x1, x2, ... , xw. This question would be easy if you were convinced that the
experts from your advisory board are mutually independent. But you do know
that such independence is not possible due to the complicated personal interre-
lations among the experts. But you do not know what the dependence is like.
3.3 Discussion 87
Let M denote the set of statistical models of the form m = (PK(1), PK(2),
PXYil. PXl''l2), where PxY 11 and PXYI2 satisfy the conditions (3.29) and (3.30).
You have already proved that there is such a strategy q*: X x Y -t {1, 2} and
such a model m * E M for which there holds
(3.32)
Assume that the equation (3.32) does not hold. Then a point (x*, y*)
must exist which belongs both to (XY+(1)r and to (XY+(2)( The value
r
q*(x*,y*) in that point is either 1 or 2. Let us choose q*(x*,y*) = 1 forcer-
tainty. If (x*, y*) E (XY+ (2) then there are two points (x*, y) and (x, y*)
for which there holds
We will choose a positive quantity ~ which is not larger than Pxy 12 (x*, y) and
is not larger than Pxy 12 (x,y*). It can be, for instance, the value
We create a new model m = (Pi<(1), Pi<(2), PxYfl' Pxy 12 ) in which only the
function PXYI 2 changed when compared with the function Pxy 12 in the model
m*. Furthermore the function PXYI 2 differs from the function Pxy 12 only in four
points (x*, y*), (x, y*), (x*, y) and (x, y) according to the following equations:
(3.34)
2. Such a decomposition of the set Y into two subsets Y(1) and Y(2) exists
that
(3.35)
A
The proof of that at least one of the two assertions mentioned is valid is clear
on the basis of purely geometrical considerations. For two non-intersecting
rectangles with vertical and horizontal edges lying in a single plane, a vertical
or horizontal line exists that separates the plane into two parts containing just
one rectangle each. Let us remark that we are working here with generalised
rectangles. Nevertheless, the generalisation of the given principle is easy and
leads to the formulation of the previous assertion.
Assume for certainty that, for instance, the first part of Assertion 3.1 holds.
Let us denote by q' the strategy which decomposes the set X x Y into classes
X(1) x Y and X(2) x Y. We can see that the strategy q' does not depend
on the feature y. That is why the risk R( q', m) does depend on the model m
either, i.e.,
R(q',m*)=R(q',m), mEM. (3.36)
Let us prove now that when the strategy q* for the model m* is a Bayesian
strategy then also q' is the Bayesian strategy for m*. Let it hold for some
point (:1:, y) that q' (x, y) -::J q* (x, y). If such a point does not exist then the
strategies q' and q* are equal, and therefore also the strategy q' is Bayesian. If
q' (x, y) -::J q* (x, y) then there holds
We proved that in the case of two features the strategy sought depends on one of
them only. We will explore a more general case now when the number of the fea-
tures is arbitrary. The previous considerations are also valid for the general case
90 Lecture 3: Two statistical models of the recognised object
except for the relation (3.32) which states that Cartesian hulls of certain sets
do not intersect. We have to prove this property for a multi-dimensional case.
Let xi,x 2 , .•• ,Xn ben features that assume values from the sets XI, X2,
... , Xn. Let X be an ensemble (xi,X2,···,Xn), X= XI X x2 X ... X Xn,
m* be the statistical model (P'K(1),p'K(2),p:X 1l'p:X 12 ), where the function Pxii
satisfies the system of equations
and the sets (x+(1)r and (x+(2)r as Cartesian hulls of the sets x+(1) and
X+(2). We will prove that from inequalities (3.37) there follows
Assume that the previous expression does not hold, i.e., there is a point x*
which belongs both to (X"+(1)r and to (X+(2)t Assume for uniqueness that
q* (x*) = 1. Let the features XI, x2, ... , Xn assume the value 0 at the point x*.
r
The point x* is thus an ensemble of n zeros, i.e., (0, 0, 0, ... , 0). If x* belongs to
(x+ (2) 1 then such a set s of points XI I x 2 I • • • 1 xf 1 t ~ n, exists that for each
i = 1, 2, ... , n such a point x' exists in the setS that x~ = 0. Furthermore, each
point x' E S belongs to x+(2), i.e., for each of them q*(x') = 2, Px 12 (x') > 0
holds.
We will show that in this case probabilities Px1 2 (x) can be decreased in the
points xi, x 2 , ••• , xt and probabilities can be increased in other points, includ-
ing the point x*, all that without changing marginal probabilities PX;I 2 (x;). In
such case only probabilities of the points xi, x 2 , •••••• , xt are decreased, which
are assigned by the strategy q* to the second class. There is one point at least
(it is the point x*) which is assigned by the strategy q* to the first class and the
probability of which increases. As a consequence of a change of probabilities in
3.3 Discussion 91
deliberately selected points, the probability of the wrong decision about the ob-
ject in the second state increases. As this contradicts to the requirement (3.37),
it proves the relation (3.37) too.
r r
We will prove that the possibility of expected change of probability follows
from the inequality (x+ (1) n (x+ (2) f. 0. The proof will be based on two
assertions.
Assertion 3.2 If the set S contains two points x 1 and x2 only then the model
m exists for which the inequality R( q*, m *) < R( q*, m) , m E M, holds. A
Proof. Select some point x'. Its coordinates x;, i = 1, ... , n, are determined
by the following rule. If x} = 0 then x; = xy. If xy = 0 then x; = x}. In other
words, the i-th coordinate of the point x' is equal to the non-zero coordinate
which is either x} or xf. If both these coordinates are equal to zero then the
i-th coordinate of the point x' is equal to zero too. For the point x' determined
in this way and for points x 1 , x 2 and x* it holds: How many times a certain
value of the i-th coordinate occurs in a pair of points x 1 and x 2 , exactly that
many times this value occurs in the pair of points x' and x*. Let us remind
that all coordinates of the point x* are zeros.
Example 3.2 Let x 1 = (0, 0, 0, 5, 6, 3) and x 2 = (5, -2, 0, 0, 0, 0). As was sai,d
earlier, the point x* = (0, 0, 0, 0, 0, 0). In this case x' = (5, -2, 0, 5, 6, 3). A
zero coordinate in the point x. Therefore I(x*) = UxES I(x) = I. The set S
apparently contains two points. Let us denote them as x 1 and x 2 for which
I(x 1 ) f= I(x 2 ). Let us create two new points x', x", according to the points x 1 ,
x 2 , so that I(x') = I(x 1 ) U I(x 2 ). This means that the i-th coordinate of the
point x' is equal to zero if this coordinate is zero in one of two points x 1 or x 2
at least. All other coordinates will be determined in such a way that the same
property holds for the quadruplet of points (x 1 , :c2 , x', x") which we mentioned
in the proof of Assertion 3.2. It is also valid here that the number of times a
certain value of the i-th coordinate appears in the pair of points x 1 and x 2 , it
appears the same number of times in the pair of points x' and x". It is easy to
show that it is possible to find the pair of points with such properties.
Example 3.3
x1 = (0,0,0,0,1,2,3,4,5),
x2 = (5,4,0,0,0,0,3,2,1).
The pair of points x' and x" can have, for example, the form
Let us denote the variable~= min (PxJ 2 (x 1 ),PxJ 2 (x 2 )) which is positive be-
cause x 1 E x+(2) and x 2 E x+(2). The new model is determined exactly
according to the relation (3.38), where x* is replaced by x". Let the strategy
q* assign one of the points x', x" to the first class at least. The probability
of the wrong decision about the object in the second state will increase ow-
ing to the change of the model and thus the inequality R(q*, m*) < R(q*, m)
will be satisfied. If q*(x') = q*(x") = 2 then the point x 1 will belong to the
set x+(2) in the new model because its probability is already positive. As
I(x') = I(x 1 ) U /(x 2 ) holds, the inequality UxEs·I(x) =I holds too for the set
S' = {x',x 3 , ... ,xk} which has one point less than the setS. The setS' is ob-
tained so that the points x 1 , x 2 are excluded from and the point x' is included
into the set S. •
And so we have proved in the more general case that the inequality R( q*, m *) ~
R(q*, m) does not hold for all models m when the Cartesian hulls of classes
intersect. It follows from this property that the strategy q*, for which the
corresponding Cartesian hulls intersect, is not the strategy that we look for.
But because you have already proved that the strategy sought really exists,
this can be only a strategy whose Cartesian hulls do not intersect. All other
considerations are the same as in the case in which the number of the features
is 2. Therefore we can say that we have viribus unitis managed your task.
Well, strictly speaking we have not finished yet because it is not quite clear to
me whicl1 of the considerations mentioned can be generalised easily for the case
in which the number of states is larger than 2.
You have certainly noticed that your proof of the existence of the worst model
can be generalised almost without any change also for the case of an arbitrary
3.3 Discussion 93
where r (x, 11) is the Eudich~an distance of points x and /1· Then, x is assigned
into the first or second class. if r/ 1 S r/ 2 or d 1 > d2 , respectively.
2. The class'ification acconling to the integml of the probability is based on the
assumption that JL 1 and JL 2 are random variables with the uniform proba-
94 Lecture 3: Two statistical models of the recognised object
--+--------------XJ --4-------------- XI
Figure 3.2 Two convex sets corresponding Figure 3.3 The decision as a separation of
toK={l,2}. a plane by a straight line.
bility distribution on the sets lvh and M 2 • Two quantities are calculated
where f(x, JL) is the probability density of the Gaussian variable in the point
x, the mathematical expectation being JL· The state k is then assessed
according to the ratio si/s2 •
Your task is to find the strategy which solves Linnik task correctly in the given
case.
The next step is to determine the Bayesian strategy for the model m •, in which
the probabilities PK(l) and PK(2) are the same, and Px 11 (x) = f(x,JLi) and
Px 12 (x) = f(x, JL2). This strateKY decomposes the plane X into two classes by
means of a straight line according to Fig. 3.3. I will denote this strategy as q*
and the probability of the error, wl1ich the strategy q* assumes on the model
m*, will be denoted as c*. In view of q* being the Bayesian strategy form*,
R(q, m*) ;:::: c* holds for any strategy q. Furthermore, it is obvious from Fig. 3.3
that for any model m = (PK(l),pK(2),f(x,JLJ),f(x,JL'2)), JL1 E M1, JL2 E M2,
PK(l) + PK(2) = 1, PK(l);:::: 0, PK(2);:::: 0, the inequality R(q*,m) ~ c• holds.
This means that the model m • is the worst one and the strategy q* is the
solution of the task.
The strategy described is much simpler than the strategies usual in the liter-
ature, from whicl1 you mentioned only the nearest neighbour classification and
the classification according to the integral of tl1e probability. In the first pro-
posal in which the observation x is assigned to a class according to the smallest
distance from the exemplar, it is necessary to solve two quadratic programming
3.3 Discussion 95
tasks. It is yet not so difficult in the two-dimensional case but problems can
occur in a multi-dimensional case. In the second classification proposal accord-
ing to the integral of the probability, it is necessary to calculate an integral
of the Gaussian probability distribution on a polyhedron in recognising each
observation. I do not know, even in a two-dimensional case, from where to
begin my calculation. The simplicity of the exact solution of Linnik task is
really surprising in comparison with tlw two mentioned strategies.
You must add to your evaluation that the strategy q* which you have found is
the solution of a well defined task. Having any a priori probabilities of the states
and any mathematical expectations /.LJ and /.L 2 , you can be sure that the prob-
ability of the wrong decision will not be greater than f.*. The value f.* serves as
the guaranteed quality that is independent of the statistical model of the object.
There is no other strategy about. which the same can be said. It is not possible to
express any similar assertion about the recognition based on the nearest neigh-
bour and on the integral of the probability that would sound as a guarantee.
We would like to offer you a small implementation exercise that relates to
the Gaussian random variables. It does not solve any basic question and it
belongs more or less to mathematical or programming folklore. You will not
lose anything when you look at such an exercise.
We have already mentioned in the lecture that in the case of two states and
in the case that the observation x under the condition of each state is a multi-
dimensional Gaussian random variable, in the search for a decision it is needed
to calculate a value of a certain quadratic function in the observed point x and
compare the obtained value with a threshold. When there are two features only
and they are denoted by symbols x, y then the following discriminant function
has to be calculated
f(x,y) = ax 2 + bxy + cy 2 + dx + ey + g,
f (x, y) = f (x - 1, y) + 2ax + by + d + a , (3.40)
f(O, y) = f(O, y- 1) + 2cy + c- c. (3.41)
96 Lecture 3: Two statistical models of the recognised object
The program, which I introduce later, will use tlw constants: A = 2a, 8 = b,
C = 2c, D = d +a, E = e- c, G = g, about which I assume that they were
calculated in advance. By (3.40) and (3.41) I obtain tl1e following formula
f (x, y) = f (x - 1 y) + Ax + By + D
1 1 (3.42)
f(O,y) = f(O,y -1) + Cy + E I (3.43)
f(0 1 0) =G. (3.44)
fCur = G; DeltaCur = D
for (x=1; x<1000; x++)
f[x] [0] = fCur += Delta[x] = DeltaCur +=A; I* L2 *I
.4fter each command L2 is satisfied, the variable DeltaCur is filled by the value
Ax+ D. This variable is stored in the element Delta(x) and it is added to the
variable fCur, whose content is f (x-1, 0) before the command L2 is issued and
it is stored in the element f (x, 0) after the command is issued which is correct
with respect to (3.42).
Finally, let us show the la..'it pmgram section which fills up tl1e rest of the
array j(:I: 1 y) for variables x = 1, 2.... 1 999 andy= 11 21 • • • 1 999.
2(nx- 1) in the second program, and 2(n,r- l)(ny- 1) in the third program,
where nx, and ny are the numbers of values of variables x, and y, respectively.
The total number of additions is thus 2( n.r ny- 1) which means that the average
number of additions for every value f(x, y) is less than 2. It is definitely less
than 8 multiplications and 5 additions that would have to be performed if each
value of the function f (x, y) was calculated directly according to formula (3.39).
Assume for the moment that you have to tabulate a quadratic function that
depends on three variables x 1 , x 2, x 3 and not just on two variables x, y as in
the previous case,
2
j(x1,x2,x3) =au x1 + a22 x 22 + a33 x 32
+ a12 X1 Xz + a23 X2 X3 + a13 X1 X3
+ b1 x1 + bz x2 + b3 :r3
+c.
In this case 15 multiplications and 9 additions are needed to calculate the value
of a given function in one point (x 1 , x 2 , x 3 ). See how the calculation complexity
increases for the best tabulation possible in comparison to the two-dimensional
case.
It seems incredible, but the tabulation of the function of three variables needs
again only 2 additions and no multiplication for each entry of the table. This
property even does not depend on the number of variables being tabulated.
When I encountered this property I said to myself again that the quadratic
functions are really wonderful.
You are right, but do not think that other functions are worse. When you think
a bit about how you should tabulate a cubic function, you will find out quite
quickly that 3 additions and no multiplication are again needed to tabulate it.
Again, this property does not depend on a number of variables. In the general
case, when tabulating a polynomial function of degree k of an arbitrary large
number of variables, only k additions and no multiplications are needed for
each table entry.
It has been said in the lecture that if all features are binary, then the strat-
egy (3.3) is implementable by a hyperplane. I have seen and heard this result
many times, for example in the book by Duda and Hart {Duda and Hart, 1973}.
It is a pleasant property, of course, that eases the analysis of these strategies
and makes it more illustrati,·e. That is why I am surprised why hardlJ· anybody
has noticed that a similarly pleasant property is valid not only in the case of
binary features. I will try to make such a generalisation.
Let k( i) be the number of values of the feature :r;. lVithout loss of generality
I can assume that the set of values of the feature x; is X;= {0, 1, 2, ... , k(i)-1}.
I shall express the observation x = (x 1 , J: 2 , ... , X 11 ) as a different observation
y with binary features in the following way. I will number the elements of
observations y using two indices i = 1, 2, ... , n and j = 0, 1, 2, ... , k(i) - 1.
98 Lecture 3: Two statistical models of the recognised object
I shall express the feature YiJ in such a way that YiJ = 1, if tlw feature X;
assumes its j-th value, and YiJ = 0, if the feature Xi does not assumes its j-th
. . .
value. T11e cl1scnmmant . 2::"._ 1 log Px-1dx;)
fiunctwn ' ( ·) can be wn'tt en m· th e 10rm
r
1- PX-12 x,
~~~ ~k(i)-1 Yij a.,. . , wh ere th e coeffl cwnt a., IS Iog Px;l
L....i= L....j=O
0 0 PXjll(j)
(j), an
0 d th e s t ra t egy
••• 0'
1 1 1 2
(3.46) obtains the form
ll k(i)-1
x1, if L: L:
i=1 j=O
Dij Yij ~ (} ,
xE (3.45)
11 k(i)-1
x2, if L: j=O
i=!
L: Dij Yii < () 0
x1, if ~
1l
IO PX;Il
(
X;
)
> (}
L..,
i.=l
g Pxlo(x·)
. l- 1
- '
xE { (3.46)
Xz, if ~ lo Px;~dx;) < ()
L..,
i=l
g Px 1 12(x;)
.
'
can be expressed using the linear discriminant function. But it is not yet
the generalisation of the result given in the lecture. If all the features xi are
binary, i.e., if k(i) = 2, then from the relation (3.45) it follows that the strategy
(3.46) can be implemented by a hyperplane in a 2n-dimensional space. On
the other hand, the result from the lecture asserts that the strategy (3.46) is
implementable in this case by means of a linear discriminant function in an
n-dimensional space. The form of the strategy (3.45) can be improved so that
the number of binary features will not be 2:::'=
1 k(i) but 1 k(i)- n. I shall 2:::'=
introduce new variables y;J = YiJ - Yio, i = 1, 2, ... , n, j = 1, 2, ... , k(i), and
the new threshold 8' = () - 2::;'=
1 a.;o. It is obvious that the strategy (3.45) is
equivalent to the strategy
n k(i)-1
if L L Dij Y~j ~ ()' ,
i=1 J=1
11 k(i)-1
if L L Dij y;j < ()' 0
i=1 j=1
This result generalises the result given in the lecture for the case in which the
features Xi are not binary.
Each strategy of the form (3.46) can be expressed using linear discriminant
functions in (2::;'= 1 k(i) - n)-dimensional space. In the particular case in which
k(i) = 2 for every i, the dimension of this space is n.
I am sure that this more general result can simplify the analysis of strategies
of the form of (3.46) in various theoretical considerations.
101
obstacle on its own since not every one possesses this knowledge. We will be
convinced many times in the following lectures that there are, unfortunately,
far more obstacles of this kind than we would wish. The reaction to such a
situation usually occurs as a dream about a miraculous tool, e.g., in the 'Lay
table, lay!' form. ¥/ith its help it would be possible to avoid all the difficulties
at once. This fairy tale has usually the following wording in pattern recognition:
'There is a system (a genetic, evolutionary, neural, or exotic in another way)
which works in the following manner. The system learns first, i.e., the training
multi-set x 1 , x 2 , ... , Xt of observational examples is brought to its input. Simul-
taneously, each observation Xi from the training multi-set is accompanied by the
information k; representing the reaction to the observation which is considered
correct. When the learning finishes after l steps, the normal exploitation stage
of the system begins, during which the system reacts through the correct answer
k to each observation x, and even to one which did not appear in the learning
stage. Thanks to the information about the correct answer not having been
provided explicitly, the system is able to solve any pattern recognition task.'
In such cases it is usually hopeless to try to find an understandable answer to
the question of how the task is formulated, for which the solution is intended,
and to learn more specifically how the system works. The expected results
seem, at least to thE;) authors of the proposals mentioned, to be so wonderful
and easily accessible that they regret on losing time on trifles like those of
the unambiguous task formulation and the exact derivation of the particular
algorithm which should solve the task. The fairy tale is simply so wonderful
that it is merely spoiled by down to earth questions.
The more realistic: view of this fairy tale leads to current models of learning
and their formulations that are going to be brought forward in the following
section.
on the state k. The lack of knowledge can be expressed in such a way that
the function PXIK is known to belong to a class P of functions but it is not
known which specific function from the class P actually describes the object.
Expressed differently, knowledge can be determined by the ensemble of sets
P(k), k E K. Each of the sets comprises the actual function PXIk; however,
which one is not known. The set P or, what is the same, the ensemble of sets
P(k), k E K, can quite often (roughly speaking almost always) be parame-
terised in such a way that the function f(x, a) of two variables x, a is known
and determines the function f (a) : X --t ffi. of one single variable for each fixed
value of the parameter a. At present it is not necessary to specify more exactly
what is meant by the parameter a and to constrain the task prematurely and
unnecessarily. The parameter a can be a number, vector, graph, etc.. The
set P(k) is thus {f(a) I a E A}, where A is the set of values of the unknown
parameter a. Our knowledge about the probabilities PXIK(x I k) which is given
by the relation PXIK E P means that the value a* of the parameter a is known
to exist for which PXIk = f(a*).
Example 4.1 Parametrisation of P(k). Let P be a set consisting of a proba-
bility distributions of n-dimensional Gaussian random variables with mutually
independent components and unit variances. Then the set P(k) in a parame-
terised form is the set {f(J.L) I J.L E ffi."} of the functions f(J.L): X --t ffi. of the
form
f( J.L )( x ) -_ rr" _1
~<>= exp
(-(Xi- 1-l·i?) .
i=I v271' 2
Based on knowledge of the functions PXIk• k E K, defined up to values of the
unknown parameters a1, a2, ... , an, the function q(x, a 1, a 2 , ••• , an) can be cre-
ated which will be understood as a strategy given up to the values of unknown
parameters. The function q(x, a1, a2, ... , an) illustrates how the observation x
would be assessed if the parameters ak, k = 1, 2, ... , n determining the distri-
bution PXIk• were known. In other words the parametric set of strategies can
be created
which will be obtained in using the strategy the wrong decisions of which are
quantified by the penalty W. But the criterion cannot be computed because the
function PxK(x,k) is not known. The lack of knowledge about the function
PxK(x,k) is substituted to a certain degree by the training set or multi-set.
Various formulations of the learning task differ in how the most natural
criterion is replaced by the substitute criteria which can be calculated on the
basis of information obtained during the learning. Nevertheless, a gap always
remains between the criterion that should, but cannot, be calculated, and the
substitute criterion which can be computed. This gap can be based on consci-
entiousness (intuition or experience) of the learning algorithm's designer or can
be estimated in some way. We will first introduce the most famous substitute
criteria on which the approaches to learning, popular today, are based. Later
we will introduce the basic concepts of the statistical learning theory, main task
of which is just the evaluation of how large the gap we spoke of can be.
In this case the probability of the training multi-set T can be computed for
each ensemble of unknown parameters a = (ak, k E K) as
I
L(T, a)= II PK(ki) PXIK(Xi Iki, ak;) (4.2)
i=l
Then the ensemble a* of values (az, k E K) found is treated in the same way a.s
if the values were real. This means that the ensemble (a A;, k E K) is substituted
into the general expression q(x, a 1 , a2, ... , an) and the recognition is performed
according to the strategy q(x, ai, a;, ... , a~).
The expression (4.3) can be expressed in a different but equivalent form
which will be useful in the coming analysis. Let a(x, k) be a number that
indicates how many times the pair (x, k) occurred in the training multi-set.
We can write under the condition of non-zero probabilities PXIK(x I k, ak)
a* = argmax
(ak.kEK) xEX
IT II (PK(k) PXIK(x Ik, ak)t(x,k)
kEK
argma~ L xEX
(ak,kEI\.) ~'EK
L a(x, k) logpK(k) PXIK(x Ik, ak)
argmax
(ak.kEK) kE/(
L xEX
L a(x, k) logpXIK(x Ik, ak). (4.4)
The previous Equation (4.5) shows that it is not needed to know a priori
probabilities PK(k) when determining ak.
then in the first formulation (learning according to the maximal likelihood) the
Ji.'k is estimated as the mean value (1/l) E!=l x; of observations of the object
in the k-th state. If the learning task is solved in its second formulation (based
on the non-mndom training set) then the Jl'k is estimated as the centre of the
smallest cir-cle containing all vectors which were selected by the teacher as rather
good repr·esentatives of objects in the k-th state. •
1 l
R(a) = l LW(ki,q(a)(xi)), (4.7)
i=l
which can be measured and seems to be a close substitute of the actual risk
(4.6).
The third approach to learning in pattern recognition tries to create para-
metric a set of strategies on the basis of partial knowledge about the statistical
model of the object. From this parametric set, such a strategy is next cho-
sen which secures the minimal empirical risk (4. 7) on the submitted training
multi-set.
Example 4.3 Learning by minimisation of empirical risk for multi-dimen-
sional Gaussian distributions. Let us have a look at what the third approach
just discussed means in a special case, the same as that in which we have
recently illustrated the dissimilarity between the learning according to the max-
imal likelihood and the learning according to the non-random training set, see
Example 4.2.
If the number of states and number of decisions is equal to two and the
observation is a multi-dimensional Gaussian random variable with mutually
independent components and unit variance then the set of strategies contains
strategies separating classes by the hyperplane. The third approach to learning
aims at finding the hyperplane which secures the minimal value of the empirical
risk (or the minimal number of errors in the particular case) on the training
multi-set. &
class of tasks. The learning is nothing else than the recognition of what task
has to be solved and, subsequently, the choice of the right algorithm for this
task. To be able to make such a choice a designer of a learning algorithm
has himself to solve all the tasks that can occur. In other words, he has to
find a general solution for the whole class of the tasks and present this general
solution as the set of parametric strategies. When this is done this general
solution is then to be incorporated into the body of the learning algorithm.
Such a deformed fairy tale about pattern recognition with learning has totally
lost its gracefulness and charm, no doubt, but it has gained a prosaic solidity
and reliability because it has stopped being a miracle.
such a substitution can be justified (pay attention, the error follows!) by the
law of large numbers which, roughly speaking, claims that in a large number
of experiments the relative frequency of an even differs only a little from its
probability.
The question to which the answer is sought is far more complicated in reality
to be smoothed away by a mere, and not quite well thought out reference to
the law of large numbers. Let us express more exactly what this complexity
pivots on.
Let Q be a set of strategies and q a strategy from this set, i.e., q E Q.
Let the ensemble T be a training multi-set (x1, k1), (!:z, kz), ... , (xt, kt) and T*
be the set of all possible training multi-sets. Let R(T, q) denote the relative
frequency of wrong decisions that the strategy q makes on the multi-set T.
Let us denote by R(q) the probability of the wrong decision that is achieved
when the strategy q is used. And, finally, let us denote by V: T* -+ Q the
learning algorithm, i.e., the algorithm which for eac_!l selected multi-set T E T*
determines the strategy V(T) E Q. The number R(T, V(T)) thus represents
the quality achieved on the training multi-set T using the strategy which was
created based on the same multi-set T. By the law of large numbers it is
possible to state, in a slightly vulgarised manner for the time being, that for
any strategy q the random number R(T, q) converges to the probability R(q)
provided the length of the multi-set approaches infinity. The length of the
training multi-set is understood as the number of its elements. This not very
exact, but basically correct, statement does not say anything about the relation
between two random numbers; the first of them is the number R(T, V(T)) and
the second is the number R (F (T)).
If we assumed, with reference to the law of large numbers, that these two
numbers coincide for the large lengths of the multi-sets T then it testifies that
the concept of the 'law of large numbers' is used as a mere magic formula
without clear understanding of what it relates to. The law does not say anything
about the relation between the two numbers mentioned.
In reality the convergence of the random numbers R(T, V (T)) and R (V (T))
to the same limits is not secured. In some cases this pair of random numbers
converges to the same limits and in other cases the numbers R(T, V (T)) and
R(V(T)) remain different whatever the length of the training multi-set Tis.
We will show the example of the second mentioned situation.
Example 4.4 The estimate of the risk and the actual risk can also differ
for infinitely long training multi-sets T. Let the set X of observation be a
one-dimensional continuum, for example, an interval of real numbers. Let two
functions Px 11 (x) and p x1 2 (x) define two probability distributions of the random
variable x on the set X under the condition that the object is in the first or in
the second state. Let it be known that densities Px1dx) and Px1 2 (x) are not
infinitely large in any point x which means that the probability of each value
x, as well as of any finite number of values, is equal to zero. Let V (T) be the
following strategy: it is analysed for each x E X if the value x occurs in the
training multi-set T. If it does, i. c., if some x; is equal to x then the decision
110 Lecture 4: Learning in pattern recognition
ki is given for observation x. If the observation x does not occur in the training
multi-set T then k = 1 is decided.
Two assertions are valid for this learning algorithm V. The probability of
the wrong decision R(V (T)) is an a priori probability of the second state PK (2)
because the strategy V(T) assigns practically all observations into the first class
independently on the training multi-set T. Indeed, the probability that the ran-
dom x appears in the finite multi-set T is equal to zero. This means that the
probability of the answer k = 2 for r.andom observation is equal to zero too.
On the other hand, the~number R(T, V(T)) is equal, of course, to zero with
the probability 1. Indeed, R (T, V (T)) = 0 for an arbitrary multi-set T, in which
any element (x, k) does not occur more than once and the total probability of
all other multi-sets is equal to 0.
Consequently we have two random variables. The first is PK(2) with the
probability 1 and the second is equal to zero with probability one. This fact holds
for an arbitrary length of the training set T. There[pre it cannot happen for
any lengths of the multi-set T that random variables R(T, V(T)) and R(V(T))
approach each other. It does not contradict the law of large numbers since it
does not have anything in common with it. A
(4.8)
can be achieved. Actually, the length l of the experiment can be quite large,
that is,
-ln 7J
l?.~· (4.10)
Example 4. 7 Accuracy, reliability, and length of the experiment shown on
specific numbers. Having in mind the previous relation {4.10}, the customer
can determine more exactly, at least for himself, what result of an experiment
will be considered positive. The customer realises that any experiment has the
restricted accuracy c given by the highest admissible probability of the wrong
decision and the reliability 71· In his particular case he chooses c = 2% and
77 = 0.1% and formulates the rule according to which he accepts or rejects the
proposed strategy q. If the strategy q recognises about 9000 observations without
error, then he accepts it and deduces that the probability of the wrong decision
for the accepted strategy does not exceed 2%. Such a rule can be justified on the
basis of the correctly understood law of large numbers. He substitutes c = 2%
and TJ = 0.1% into the inequality (4.10} and writes
The customer equipped with this knowledge enters a shop selling programs for
pattern recognition and chooses a program which does not make any single
mistake on the testing multi-set prepared in advance. He is convinced that he
already has what he needs this time. Indeed, the purchased program has been
created not considering the experimental material. That is why he concludes,
that the possibility of a direct swindle has been excluded. The customer makes
a mistake here again.
Despite all illusory cogency, the rules used do not protect the customer from
choosing a wrong pattern recognition strategy, because it is not taken into
account from how extensive a set the strategy is chosen. The extent of this set
has a substantial significance. Indeed, the customer makes a wrong choice if one
single strategy out of the bad ones passes the test. Imagine the counterexample
in which a choice is made from the set of wrong examples only, even in the case
of an extremely strict test being used. The probability that some of the wrong
strategies will pass the test can actually be quite large if the set of examined
strategies is quite extensive.
The customer, having acquired this important but not very pleasant expe-
rience, comes to the conclusion that it is not enough for reliable choice of the
strategy that the length of the experimental material satisfies the condition
v, (
P { I q) - p( q) I > c} < 77 , (4.11)
which is: (a) much more strict compared to the condition (4.11); (b) it requires
that the probability that a single wrong algorithm from the set Q passes the
test is low; and (c) this condition depends significantly on the cardinality of the
set of recognition algorithms Q, from which the choice is made. Our illusory
customer starts understanding the merit of the questions that the statistical
theory of learning (to be explained in the next subsection) tries to answer.
if k 'I k* ,
W(k,k*) ={ o:1 if k = k*, (4.12)
holds and the symbol k* denotes an estimate of the actual state k using the
strategy q.
Let Vt ( q) be a random variable represented by the frequency of a wrong
decision which the strategy q assumes on the random training multi-set
T= ((x1,kl), (x2,k2), ... , (xt,kt)) oflength l,
It is known by the law of large numbers that the relation between the value
p(q) and the random variable Vt(q) can be expressed by the following inequality
for an arbitrary c > 0
(4.13)
The strategies can be divided into correct and wrong ones. The strategy q is
considered correct if p(q) :5 p* and wrong if p(q) > p*. It is not possible to
decide immediately about the correctness of a strategy. However, on the basis
of the relation (4.13), the following test can be performed. If v1(q) ::::; p* - c
then the strategy has passed the test (it is likely to be correct); otherwise it
has not passed the test (it is likely to be wrong). It is possible to calculate
the reliability of the test, which means the probability of the event that the
wrong strategy passes the test. By (4.13) this probability is not larger than
exp (-2c 2 l).
In the case in which learning is used for finding this strategy, not only a
single strategy q but a set Q of strategies are put through the test. A strategy
114 Lecture 4: Learning in pattern recognition
is small.
The reliability of the whole learning process is influenced by the probability
(4.14), not by the probability (4.13). The probabilities (4.13) and (4.14) are
substantially different. The main property of the probability (4.13) is that it
can assume an arbitrary small value for the arbitrary c > 0 when the proper
length lis chosen. Owing to this property the relation (4.13) is one of the basic
formulce of classical mathematical statistics.
A similar property is not guaranteed for the probability (4.14). The prob-
ability (4.14) can no always be arbitrarily decreased by increasing the length
l. In other words, the probability (4.13) converges always to zero for l -t oo,
whereas the probability (4.14) may or may not converge to zero for l -t oo in
dependence on the set of strategies Q. This fact expresses the central issue of
learning in pattern recognition, which cannot be solved by mere reference to
the law of large numbers.
Let us show the most important properties of the probability (4.14). We will
start from the simplest case in which the set Q consists of a finite number N
of strategies. In this case
We will show how this simply derived relation can be interpreted in learning
within pattern recognition.
1. Q is a set which consists of N strategies in the form X -t K;
2. Tis a random multi-set (xl,kl),(x2,k 2 ), ••• ,(x1,kt) of the length l with
the probability n~=l PxK(Xi, ki); PxK(X, k) is a joint probability of the
observation x E X and the state k E K;
3. We will determine two subsets of strategies for two certain numbers p* and c.
The strategy belongs to the set of wrong strategies if p( q) > p*. The strategy
belongs to the subset of strategies that passed the test if v1(q) < p*- c.
4. An arbitrary strategy that passed the test is selected from the set Q.
5. It follows from the relation (4.15) that the probability ofthe wrong strategy
selection is not larger than N exp( -2c 2 l).
4.3 Basic concepts and questions of the statistical theory of learning 115
Theorem 4.1 Chervonenkis and Vapnik. The estimate of the training multi-
set length. If from the set consisting of N strategies a strategy is chosen that
has the smallest relative frequency v of errors on the training multi-set of length
l, then with a probability 1 - TJ it can be stated that applying this strategy the
probability of the wrong decision will be smaller than v + c provided that
InN -lnry
l = (4.16)
•
2r:;2
This theorem correctly expresses the most substantial property of learning: the
broader the class of strategies is, i.e., the less the specific pattern recognition
task was investigated in advance, the longer the learning must last to become
reliable enough (which is always needed).
From a practical point of view, Equation (4.16) defines the demands for the
length of learning too roughly, that is with a too big reserve. For instance,
when the set Q is infinite, and as a matter of fact only such cases occur in
practice, then the recommendation (4.16) does not yield anything because it
requires endlessly long learning. This contradicts our intuition and, as we will
see later, the intuition is correct. This means that the relation (4.16) can be
substantially improved. The length of learning will not depend on such a rough
characteristic of the set Q as the number of its elements is, but on other more
gentle properties of this set. These properties are the entropy, the growth
function, and the capacity of the class. Let us introduce definitions of these
concepts.
Let Q be a set of strategies and x 1 , x 2 , .•. , x 1 be a sequence of observations.
Two strategies ql E Q and q2 E Q are called equivalent with respect to the
sequence x1,x2, ... ,xm, if for any i the equality q1(xi) = q2(xi) holds. Thus
each sequence of observations induces the equivalence relation on the set Q.
Let denote number of equivalence classes corresponding to this relation by
~(Q,xJ,Xz, ... ,xf). In other words, the ~(Q,x 1 ,x 2 , .•. ,xt) corresponds to
the number of different decompositions of the sequence x 1 , x 2 , . •• , x 1 by means
of strategies from the set Q.
Example 4.8 Decomposition of real numbers through a threshold. Let
be real number·s and q be a strategy of the following form: each
X1, x2, . .. , Xt
strategy is characterised by the threshold value (} and maps an observation x
into the first class if x < (}, and into the second class if x 2: (}. It is obvious
that the number ~( Q, x1, x2, . .. , xt) is greater by one than the number of dif-
ferent numbers in the sequence x 1 , x 2 , ... , x 1 . Well, ~( Q, x 1 , x 2 , ... , xt) = l + 1
happens almost always. A
116 Lecture 4: Learning in pattern recognition
Because the sequence of observation is random then the number ~(Q, x1,
x 2 , ••• , x 1) is also random. The mathematical expectation of the logarithm
of this number can be defined
l
We will denote it H 1(Q) and call it the entropy of the set of strategies Q on
the sequences of the length l.
Our main goal is to show how large the length of learning l should be in
order to obtain a fairly accurate and reliable result of learning. This means
that
P {max
lvt(q)- p(q)l >
qEQ
c}
should be fairly small for a quite small c. Before doing so we must exclude from
consideration all situations in which this probability does not converge to zero
at all. In this case learning does not make any sense because the frequency of
errors on the learning sequence does not have anything in common with the
probability of error for a learning sequence of arbitrarily large length. The
complete description of all such hopeless situations is given by the following
theorem.
Theorem 4.2 Chervonenkis and Vapnik. The necessary and sufficient con-
dition of a uniform convergence of empirical risk convergence to the real
risk. The probability
converges to zero for l --+ oo and for any € > 0, if and only if the relative
entropy Ht(Q)/l converges to zero for l--+ oo. A
Proof. The proof on the Theorem 4.2 is rather complicated and long [Vapnik
and Chervonenkis, 1974]. •
Theorem 4.2 provides an exhaustive answer to that difficult question. Of course,
this theorem, like any exact and general statement, can be used only with
difficulties in particular cases. The theorem says that the difficult question
about the convergence of the probability
P {max
qEQ
lvt(q)- p(q)l > c}
The first step is based on the term of the growth function. Let 6.1 (Q, x1,
x 2, ... , x 1) be the number of possible decompositions of the sequence x1, x2,
... , x 1 by strategies of the set Q. Let us introduce the number m1 (Q) by
The sequence of numbers mt(Q), l = 1, 2, ... , :::o, is called the growth func-
tion. The number logmt(Q) is tied up with the entropy Ht(Q) by the simple
expression logmt(Q) ~ Ht(Q). Thus if
.
1lm logmt(Q) -- 0 ,
l
(4.19)
1-+oo
then limt-+oo Ht(Q) = 0 and the expression (4.19) can be used as a sufficient
condition (but :wt a necessary one) to assure convergence of the probability
(4.18) to zero. Equation (4.19) can be checked in an easier manner because the
probability distribution PxK(x, k) need not be known in order to calculate the
growth function. With the help of the growth function it is possible not only
to prove the convergence of the probability (4.18) to zero but also to find the
upper bound of the empirical risk deviation from the actual risk.
Theorem 4.3 On the upper bound of the empirical risk deviation from the
actual risk.
(4.2~
It can be seen that Equation (4.20) assessing the reliability of learning is similar
to Equation (4.15) that holds for the case of the finite set Q. The growth
function plays in Equation (4.20) the same role as does the number of strategies
in Equation (4.15). This means that the growth function can be considered the
measure of the complexity of the set Q which is analogous to the number of
strategies for the case of the finite sets Q. Certainly, if the growth function
can be calculated for the finite set Q then exactly the growth function should
be used, and not the mere number N of strategies in the set Q. The growth
function describes the structure of the set of strategies more expressively and
more precisely than the simple number of strategies in that set because it
considers the diversity of strategies. A mere number of strategies simply ignores
this diversity.
The second step towards the simplified assessment of learning reliability is
based on the concept of the capacity of the set of strategies. The concept of
the capacity of the set of strategies, informally speaking, is the smallest possible
number of observations which cannot be classified in an arbitrary way by the
strategies from the appropriate set. The name VC dimension it is also used
in the literature for the capacity of the set of strategies according to the first
letters of the surnames of the original authors. We use the name introduced by
Chervonenkis and Vapnik in their original publications.
First, let us have a look at how the capacity of the set of strategies is defined
in simple examples. The exact definition for the general case will be introduced
later.
118 Lecture 4: Learning in pattern recognition
Example 4.9 Capacity of the set of strategies for classification of real num-
bers according to the threshold. Let X be the set of real numbers and Q
be the set of strategies of the abovementioned form for classification into two
classes only: any strategy is characterised by the threshold (} and assigns the
number x E X to the first class if x < (}, and to the second class if x ~ (}.
For any x E X there is such a strategy q' in the set Q which assigns the ob-
servation x to the first class, and another strategy q" which assigns x to the
second class. Let x 1 and x 2 be two different points on the coordinate axis X.
For these two points either x 1 < x2 or x2 < x1 holds. Let us choose x1 < x2
to assure certainty. This pair of points cannot already be classified arbitrarily
into two classes by means of strategies from the set Q. Indeed, there is no
strategy in the set Q that assigns x2 into the first class and x1 into the second
class because x 2 > x 1 . Thus the given set Q of strategies is such that there
is a point on the straight line X that can be classified in an arbitrary man-
ner using different strategies from the given set. But for any pair of points
such classification of these points exists, which can be made with no strategy
from the given set. In this case the capacity of the set of strategies Q is equal
to two. .&
Example 4.10 Capacity of the richer set of strategies. Let us extend the set
Q and illustrate how the capacity of the set of strategies is defined in this richer
case. Let Q be the set of strategies each of which being determined by the pair
of numbers a, 0. The observation x is assigned in the first class if ax < (}, and
in the second class if ax ~ (}. Let x 1 , x 2 be two distinct points on the straight
line X such that X1 =f. x2. There are 22 possible decompositions of this pair of
points in two classes and each of them can be implemented with some strategy
from the set Q. Let us analyse any triplet of distinct points x 1 , x 2, x 3 and let
us assume that x1 < x 2 < X3. There are 23 decompositions of this triplet in
two classes but not all of them are implementable by means of the strategies
from the set Q. No strategy from Q can assign x 1 and x 3 in the first class and
the x 2 into the second class. The given set Q is such that some pair of points
x1, Xz can be decomposed in an arbitrary way in two classes and then no triplet
of points can be already decomposed in an arbitrary manner into two classes.
Such a set of strategies has the capacity 3. .&
Having presented two simple particular cases, we can proceed to the general def-
inition of the capacity of the set Q of strategies q: X -+ { 1, 2}. Let x 1 , x 2 , ... , x 1
be the sequence of observations, Ct: {1, 2, ... , l}-+ {1, 2} be the decomposition
(classification) of this sequence into two classes, Ct be the set of all possible
decompositions in the form{1, 2, ... , l} -+ {1, 2} which is a set consisting of 2'
decompositions.
The number r defines the capacity of the set Q of strategies of the form
q : X -+ {1, 2} iff
1. There exists a sequence x1, x2, ... , Xr-l of the length r - 1 such that for any
classification Cr-1 E c;_ 1 a strategy q E Q exists such that q(x;) = Cr_ 1(i),
i = 1, 2, ... , r - 1;
4.3 Basic concepts and questions of the statistical theory of learning 119
m1 (Q), m2(Q), ... , ffir-1 (Q), mr(Q), ... , mt(Q), ... (4.21)
be the growth function for a set Q. In this sequence m 1 ( Q) is not larger than
21 , m 2 (Q) is not larger than 22 and in the general case the l-th element mt(Q)
is not larger than 21. If an element, say the l-th element mt ( Q) has value 21,
then the preceding (l- 1)-th element is 21- 1 too. The reason is that if some
sequence of the length l can be decomposed in all possible manners, then it
naturally holds also for any its subsequence of the length l - 1. It follows from
the abovesaid that the sequence (4.21) consists of two contiguous parts. The
initial part, whose length can even be zero, is composed of sequence elements,
which has the value 21, where l is an ordinal number of the element in the
sequence (4.21). The elements in the second part of the sequence are lesser
than 21.
The capacity of the set Q is an ordinate number by which the start of the
second part is indexed, i.e., the minimall, for which mt(Q) < 21 holds.
It is immediately obvious from the definition that if r is the capacity of the
set of strategies Q, then m1 ( Q) = 21 holds for any l < r. Much less expected is
that the values m1 (Q) for l ~ r are also influenced by the capacity and cannot
assume arbitrary values. It follows from the next theorem.
Theorem 4.4 Upper limit of the growth function. If r is the capacity of the
set Q then for all lengths of sequences l ~ r
1 szr-l
mt(Q) :S (~ _ l)! (4.22)
holds.
Proof. We refer the interested reader to [Vapnik and Chervonenkis, 1974]. •
By Theorem 4.4, the inequality (4.20) can be rewritten in the form
(2l)r-l 1 2
TJ < 4.5 e-4< {l- 1 )
(4.23)
(r- 1)!
which expresses the explicit relation between all three parameters describing
learning: the accuracy E, the reliability TJ and the sufficient length l of the
training multi-set. The set Q in the formula (4.23) is represented by its only
one parameter, i.e., by the capacity r.
Let us summarise the main supporting points that lead to the sought result.
1. For the analysis of reliability of learning in pattern recognition, the knowl-
edge how the probability
is satisfied in the real task and not merely in its artificial substitute. The prob-
lem of the relation between theoretical models and reality is difficult in general
because it cannot be solved in the framework of any theoretical construction.
But particularly difficult problems are those where the relation between statis-
tics and real problem is sought.
This vexed nature, even in the exaggerated form, is expressed by the follow-
ing tale, which is known in the pattern recognition folklore.
Example 4.11 The customer is a geologist. [Zagorujko, 1999) Imagine a
geologist being a customer and coming to a supplier demanding a solution of a
pattern recognition task. It is to be found out by measuring physical properties
of a piece of rock whether it contains iron or whether it is dead. The customer
can take the responsibility that the decision strategy solving this task is imple-
mentable with the help of linear functions of the chosen measurable physical
propert·ies of the rock. The supplier is required to find out professionally the
parameters of the appropriate linear decision function. It appears that this is a
situation ideally suited to pattern recognition based on learning.
When the supplier requests that he needs a training multi-set to fulfil the job,
it does not embarrass the customer. The customer is already prepared for this
situation and takes two pieces of rock out his rucksack. He is sure that one of
them contains the iron and the other is dead. Such a training multi-set appears
to be too short to the supplier who calculates, very quickly applying known
formultE, that he needs at least 200 samples of the rock containing iron and
at least 200 dead pieces of rock to assure quite reliable and accurate learning.
Neither this demand embarrasses the wstome·r: he takes out of his rucksack a
geological hammer and crushes each piece of rock into 200 pieces. The S1tpplier
clearly understands that he has obtained something quite different from what he
needs but he is not able to express in an understandable way the recommendation
which the customer should follow when he prepares the training multi-set. &
The first serious objection to the practical applicability of recommendations
that follows from the statistical learning theory is this: recommendations follow
from assumptions that cannot be constructively proved. In another words, in
the recommendation of the form 'if A is valid then B is valid', the statement
A is formulated in a way about which it is not possible to say whether it is
satisfied. But this is not the only imperfection. Let us have a look at, and
we will be extremely critical again if the statement B can be constructively
proved.
Let us draw our attention to the crucial difference between the two following
statements. The first is 'the probability of the wrong decision is not larger than
E'. The second statement reads 'the probability of the fact that the probability
of the wrong decision will be larger than E is not larger than 77'. The first
statement characterises the specific strategy. Each specific strategy, including
the one that will be obtained as a result of learning, can be analysed and it can
be found out if the first statement is true.
The second statement does not characterise a specific strategy but a popula-
tion of strategies. We do not even consider that it can be quite difficult to find
122 Lecture 4: Learning in pattern recognition
exclude such strategies from the a priori possible strategies that are in contra-
diction to it. As a result of this narrowing, a narrower set of possible strategies is
obtained. In the general case it is still a set containing more than a single strat-
egy. In such a case the demand to find such a strategy about which it is known
only that it belongs to a certain set is similar to the abovementioned nonsense.
In order not to come across such nonsense, it is necessary to explicitly say
good bye to the idea that the result of learning is a strategy. The strategy
q* : X ---+ K cannot be considered as the goal of learning because q* cannot be
determined uniquely. The goal can only be to find out what result is provided
by a correct but unknown strategy q*(x) for some given observation x which is
to be recognised. The construction that we call ta'Ught in recognition is based 1
on this principal idea: even in spite of the 'Unambiguity of the strategy q*, its
value for some observations x E X can be determined uniquely. It is natural
that the unambiguity cannot be reached for all observations x E X. In the case
in which such an ambiguous observation is recognised, the learning algorithm
does not provide any answer and merely says that learning was insufficient to
assess correctly the observation. We will show more precisely how such a taught
in recognition can work.
Let X be a set of observations and q* : X ---+ D be a strategy t.hat will be
called the correct strategy. The strategy q* is unknown but the set of strategies
Q, into which q* belongs, is known.
Let illustrate the described construction by an example.
Example 4.12 Decomposition of a plane by a straight line. Assume X be
a two-dimensional space (a plane), Q is the set of strategies, each of which
separates the plane X into two parts by means of a straight line. j,
(4.26)
where di = q*(xi) is the decision, i = 1, 2, ... , l. The sequence (4.26) differs
considerably from the training multi-set Ton which statistical learning is based,
and which has the form
(4.27)
where ki is the state in which the object was when the observation Xi was
observed.
Special conditions are necessary for obtaining the sequence (training multi-
set) T under which the state of the object becomes directly observable. Some-
times it is not possible to provide such conditions. We will not insist that one
of these two approaches is always preferred to the other. But sometimes, in
the case of image recognition, it is much easier to obtain the sequence (training
set) X D than the sequence T.
1 V.H. The term 'taught in recognition' was selected for the translation from Czech into
English even when 'supervised recognition' seems to match the Czech or Russian equivalent
better. The latter term is likely to be confused with 'supervised learning'.
124 Lecture 4: Learning in pattern recognition
In the previous relation, the set of strategies Q, the observation Xi and the
decision di, i = 1, ... , l are considered as known. The strategy q* is unknown.
It has to be determined in the taught in recognition task about each pair x E X,
dE D, whether it follows from Equation (4.28) that q*(x) =d. Let us denote
by Q(XD) the set of strategies which satisfy Equation (4.28). It is to be
determined in the task whether the set Q(X D) is not empty, and in addition
for chosen x E X to verify
The previous formula says that all the strategies which satisfy Equation (4.28)
(consequently, the correct strategy q* too) assign x to the same class d. If this
statement is correct then the correct value q*(x) be determined unambiguously.
If the statement (4.29) is not satisfied then the only possible answer is not
known, because the information obtained from the teacher does not suffice for
a justified conclusion about the value q* (x).
of those points x E X for which the statement (4.29) holds. For each x E X1
the q*(x) = 1 holds, and for each x E X2 the q*(x) = 2 holds. •
To check the validity of the statement (4.29) it is not needed to represent in
some particular manner the set Q(X D) of all possible strategies which satisfy
Equation (4.28). In fact, it can be seen that the original Equation (4.28) is
already the most suitable for direct verification of the statement (4.29). We
will show how such a verification can be done.
For each decision dE D and observation x EX which has to be recognised,
we will write a relation similar to Equation (4.28),
q* E Q'
q*(xi) = di, ; = I, 2, ... ,l, } dE D. (4.30)
q*(x)=d,
Equation (4.30) does not express a single expression but an ensemble consisting
of IDI relations and each of them corresponds to one decision d E D. There
has to be checked for each relation from the ensemble (4.30), i.e., for each value
d, whether it is contradictive. Thus the statement (4.29) is equivalent to the
statement that there is just a single relation in the ensemble (4.30) which is
not contradictive.
We will show at the end of our example how the equivalence of these two
statements can be used to recognise a specific observation.
Example 4.14 Recognition of the single observation based on the training
set. Two auxiliary sets X D1 = =
(X D U {(x, 1)}) and X D2 (X D U {(x, 2)})
have to be created based on the training set X D. It is to be checked whether
there is a straight line for both sets which classifies them correctly. The result
of the analysis can be just one of the four following possibilities.
1. If the set X D1 can be correctly classified with the help of the straight line
and the set X D2 cannot be classified in a similar way this means that the
answer q* (x) = 1 is sure to be cor·rect.
2. If the set X D1 cannot be classified with the help of the straight line and it
can be done for the set X D2, then the answer q* (x) = 2 is sure to be correct.
3. If each of the sets can be classified with the help of the straight line then it
means that the device did not make enough progress in teaching to be able
to recognise correctly the submitted observation. That is why it must give
the answer not known. It is important that the taught in classifier detects
its inadequacy by itself in the learning process. In this case it can address
its teacher with a question of how the given observation is to be classified
correctly. When it receives the answer then it can incorporate it into the
training set and use it later to recognise next observations better.
4. If none of the sets can be classified with the help of the straight line then
it means that the in·itial information which the device had learned from the
teacher was contr·adictive. The taught in classifier with a good sense of
humour· would be able to give the answer 'you do not know' in this case, and
present to the teacher the smallest possible paT't of the training set provided
126 Lecture 4: Learning in pattern recognition
earlier by the teacher that contains the discovered contradiction. In this case
the teacher could modify the answers which he had earlier considered to be
correct or change the set Q. As is seen, it is not easy to distinguish in this
case who actually learns from whom. &
In the next lecture, which will be devoted to the linear discriminant functions,
we will see that the contradiction in sequences can be discovered constructively.
In the general case the analysis of the contradiction in Equation (4.30) pro-
vides more useful information, from a pragmatic point of view, than a mere
statement about the validity of Equation (4.29). Let D(x) C D be the set of
those decisions dE D whose incorporation into Equation (4.30) does not lead
to contradiction. In this case, even if ID(x)l::f.l but D(x) ::j:.0, D(x) ::f. D, i.e.,
when it is not possible to determine the correct answer uniquely, those deci-
sions d cf. D(x) can be determined which cannot be correct for the observation
x. The practical useful result of recognition may not be only the single cor-
rect answer q*(x) but the whole set D(x) of decisions which do not contradict
observations.
In the proposed taught in classifier hardly anything has remained from what
has been earlier considered as learning in pattern recognition. What has re-
mained from the earlier case is only that the result of recognition is influenced
by the training set. The phase of actual learning disappeared entirely in the
taught in classifier created. However, the taught in classifier has not lost fea-
tures of intelligent behaviour; moreover, it seems to have even improved them.
Indeed, an approach in which teaching precedes recognising, and later, the
recognition starts without being taught, is very far from the mutually fruitful
relations between the teacher and the student. Such an approach much more
resembles the drill of correct behaviour than education. As we have seen, hard
solvable problems occur as a result of separating learning and recognition in
time. Well, it is difficult, or even impossible, to mediate during the teaching
phase all that has to be sufficient in any future case in which learning will be
no longer possible.
The arrangement is entirely different with the suggested procedure. The
taught in classifier is ready to be recognising in each stage of its activity. Knowl-
edge obtained so far from the teacher suffices to solve some tasks and does not
suffice for others. The first group of tasks is solved by the taught in classifier
correctly, no doubt. For the other group of tasks the classifier approaches the
teacher and enlarges its knowledge. In each stage of such a behaviour, i.e., in
the taught in recognition, the taught in classifier has the possibility of process-
ing the information obtained from the teacher. It can detect contradiction in
it, or redundancy on the other hand. Redundancy means that some elements
in the training set follow from other elements. Finally, the taught in classifier
need not wait until an observation occurs which it is not yet able to recog-
nise correctly. It can create such an observation artificially and approach the
teacher with it. By doing so it can influence in an active way the content of
the training set, i.e., the knowledge received from the teacher.
4.6 Discussion 127
It is only now that we can see how much we miss in the current statistical
theory of learning, to be able to call it the theory of intelligent behaviour.
With a certain degree of exaggeration it can be said that ingenious analysis
was applied to algorithms resembling more drill than intellectual behaviour.
4.6 Discussion
I would like to ask several questions concerning learning. The first of them,
I am afraid, is not very concrete. I feel a deference to subtle mathematical
considerations with help of which the main asymptotic properties of learning
algorithms were formulated. I can only presume how back breakingly difficult
the proofs of theorems, presented in the lecture, can be. It may be natural
that after an enthusiastic declaration a sentence starting with 'but' usually
follows. I became rather firmly convinced that· the scientific significance of
the all theory about learning discussed lies hidden within the theory and that
there are considerably fewer recommendations resulting from the theory and
addressing the world outside it. When reading the part of the lecture about
the relation between the length of the training multi-set, the accuracy and
reliability of recognition, I have spontaneously recalled the various oriental
souvenirs. For instance, a bottle with a narrow neck and a ship built inside
it, or several Chinese balls carved from a single piece of ivory and hidden one
inside another. When I see such objects I admire the craftsmen's mastery, and
mainly the patience needed to create it. At the same time, I cannot get rid
of an unpleasant impression from the unanswered question: for what, in fact,
could the sailing boat inside a bottle or Chinese balls be useful? I would not
like to formulate my questions more precisely because they would not sound
polite enough.
understand why it did not work. Well, it seemed so obvious that a perpetuum
mobile should work, and moreover, it would be so excellent if it worked.
The scientific and practical value of Chervonenkis-Vapnik theory is that it
clearly prohibits some trends in designing learning algorithms in pattern recog-
nition. Let us have a look at the fundamental Theorem 4.2 about the necessary
and sufficient condition allowing the recognition device to learn on its own. The
condition says that the relative entropy H(l)/l has to converge to zero when the
length l of the training multi-set increases to infinity. Even though the entropy
H(l) can be almost never calculated, the theorem has, in spite of it's lack of
constructiveness, a strong prohibiting power. The entropy H(l) can be easily
calculated for a universal recognition device which can implement an arbitrary
decomposition of the space of observations. The entropy value H(l) in this case
is equal to the lengths l of the learning sequence. The relative entropy H(l)/l
then is equal to one for an arbitrary l. That is why the necessary condition
that the device can learn is not satisfied. The strict restriction consequently
follows from the statistical theory of learning: learning in a universal pattern
recognition device is impossible.
Knowledge of this restriction can save you a lot of time and stress. You surely
have heard lectures several times or you have read articles, in which it is proved
in the first paragraph that the proposed recognition algorithm is universal,
and in the second paragraph the learning procedure of the same algorithm is
described. Usually it is quite difficult to find some counterexample which would
prove that the algorithm is not universal. It is even more difficult to prove
that there are classifications which cannot be achieved by learning. Without
analysing it in a complicated way, you can be quite sure that the author has
made a blunder in at least one of the two paragraphs mentioned. You can
also require the author to explain how his results agree with the Chervonenkis-
Vapnik theorem. If he does not know anything about the theorem you can
stop discussing it without any hesitation, because his professional level can be
compared to that of the mechanics 300 years ago. He has not yet known, after
all, what everyone should know in these days.
Figure 4.2 Four possible decompositions of Figure 4.3 The convex hull of four points can
three points in the plane with the help of be a triangle.
straight lines.
with the help of a straight line. The second case resembles the first one but
X is a three-dimensional linear space in which each strategy decomposes the
space X into two parts with the help of a plane.
I tried to get used to the concept of the capacity of a set in the two-dimensional
example. I have found out that the capacity of the set is CAP = 4 on the basis
of the following purely geometrical thoughts. There is such a triplet of points
which can be decomposed by a straight line into two classes in an arbitrary way.
For example, it can be the triplet of points x 1 , x 2 , and x 3 shown in Fig. 4.2,
where all four possible decompositions using four straight lines are illustrated.
It is possible to imagine that no quadruplet of points x 1 , x 2 , x 3 , x 4 can be
decomposed by a straight line into two classes in an arbitrary manner. It is
obvious because if three out of four points lie on a single straight line then
not all decompositions are realisable. Actually, the point in the middle cannot
be separated by any straight line from the other two points. If no triplet of
points is collinear then the convex hull of the quadruplet of points x 1 , x2, X3,
X4 can constitute just one of the two following configurations: either a triangle
(Fig. 4.3); or a quadrilateral (Fig. 4.4).
In the first case it is not possible to sep-
arate the point which is located inside the
triangle from the other three vertices of the
triangle. In the second case it is not pos-
sible to separate one pair of the opposite
quadrangle vertices by means of a straight
line. That is why the capacity of the class
mentioned is equal to four. Figure 4.4 The convex hull of four
Purely geometrical considerations do not points can be a quadrilateral.
suffice in the three-dimensional case. I have
analysed the case analytically and the result suits not only the three-dimen-
sional case but also the general k-dimensional case.
Let X beak-dimensional space and Xi, i = 0, 1, ... , k + 1, be an ensemble
of k + 2 points in this space. Certainly the vectors Xi- x 0 , i = 1, ... , k + 1, are
linearly dependent. This means that the coefficients ai, i = 1, ... , k + 1, exist,
130 Lecture 4: Learning in pattern recognition
not all of them being equal to zem and satis(ving the equation
k+l
La; (x; - xo) = 0,
i=l
As "k+l
wi=O a; = 0 holds, the equation
I: a; = - I: 0; =~ o
iE/- iEI+
holds too. Let new variables (3;., i = 0, 1, ... , k + 1, be introduced such that
if i E I- ,
if i E J+.
"wiEI+ a,·
L (~;X; + L a; X; =0
i.E/- iEI+
where
I: [3; = I: fj, = 1. (4.32)
iEI- iE/+
I can assert now that there is no such hyperplane for which the ensemble of
points x;, i E 1-, lies on one side of it and the ensemble of points x;, i E J+, lies
4.6 Discussion 131
on its other. Indeed, if such a hyperplane existed then also the vector o: E X
and a number (} would exist that would satisfy the system of inequalities
where (o:,x;) denotes the scalar product of vectors o: and x;. Because the sum
of all coefficients B;, i E J+, is equal to 1 and all coefficients are positive, it
follows from the first group of inequalities in the system (4.33) that
L 3; (a,:r;) ~ ().
iEf+
iEI- iEI+
follows which contradicts the inequality (4.34). I have pwved that in the k-
dimensional space no ensemble of k+2 points can be decomposed in an arbitrary
manner into two classes by means of a hyperplane.
It is not difficult to slw11· that an ensemble of k + 1 points exists that can be
decomposed into two classes in an arbitrary manner using a h,YlJerplane. For
instance, it could be the ensemble: the point x 0 has all coordinates equal to
zero, and the point x;, i = 1, ... , k, has all coordinates equal to zero except
i-th coordinate which is non-zero. I can thus consider the following statement
as proved.
Let X be the k-dimensional space and Q be a set of strategies of the form
X -+ {1, 2}. Each strategy decomposes the space X into two classes using a
hyperplane. The capacity of the set Q is k + 2.
Indeed, the capacities of the set of strategies are not as terrible as it might
look at first glance.
We are glad to hear that. Have a look at the following set of strategies in a
two-dimensional space to be more certain about capacities: each strategy is
defined by a circle which decomposes a plane into two parts: the inner and the
outer part of the circle. We will prompt you a little: apply the straightening
of the feature space.
132 Lecture 4: Learning in pattern recognition
I have managed the circle case with your hint quite quickly. First, there is a
quadruplet of points in the plane which can be decomposed by means of circles
into two classes in an arbitrary manner. For instance, it can be the quadruplet
illustrated in Fig. 4.5, where also all possible decompositions using eight circles
are shown. Consequently the capacity of given set of strategies is not less than 5.
Figure 4.5 Eight possible decompositions of four points in the plane by means of eight circles.
Second, I will prove that already no ensemble of five points can be decom-
posed into arbitrary two classes with the help of circles. Thus the capacity of
the introduced set of strategies is equal to 5.
Let x and y be the coordinates of an arbitrary point in the plane. The given
set of strategies contains strategies q in the form:
,
(4.35)
or
q(x,y) =
{I, 2'
if
if
( x- xo)~')+ ( y- Yo )2 > r 2 ,
(x - xo) 2 + (y- Yo) 2 :S r 2 .
(4.36)
Each strategy in the form (4.35) or (4.36) can be expressed in the form
1 , if 0: X + (3 y + "( Z > 0 ,
q(x,y,z)= { 2, if ax+(Jy+-yz <.e, (4.37)
where z = x 2 + y 2 . The converse holds, i.e., any strategy in the form (4.37) on
the set of points satisfying the constraint z = x 2 + y 2 can be expressed either
in the form (4.35) or in the form (4.36). A direct statement is sufficient for
me from which it follows that the capacity of the class (4.35 ), (4.36) cannot be
4.6 Discussion 133
greater than the capacity of the class (4.37). The class (4.37) is the class of
linear decision functions in the three-dimensional space. I had proved before
that a capacity of the class (4.37) was 5. In such a way I have proved that the
capacity of the class (4.35), (4.36) is equal to 5.
Now I can see that the capacity can be also determined exactly for the
set of strategies given by the quadratic discriminant functions in a general
form. Those strategies in question are those optimal in the case in which the
observation x under the condition of each fixed state k is an n-dimensional
Gaussian random variable in a general form. The strategy of this kind then
has the form
q(x)=1, if f(x)?_(),}
(4.38)
q(x)=2, if f(x)<O,
for a threshold value () and for a quadratic function f which has the form
n n n n
(4.39)
i=l i=l j=i+l i=l
The designation Xi in Equation (4.39) denotes the i-th coordinate of the point
x. I will show that the set X* exists which consists of2n + ~n(n- 1) + 1 points
and can be decomposed into two classes in an ai"bitrary manner. From this
it will immediately follow that for the capacity CAP of the set of strategies
studied
1
CAP> 2n + 2n(n- 1) + 1 (4.40)
holds. The ensemble X* of points is defined in the following way. The ensem-
ble will consist of the sets Xi and Xi introduced in addition (each of them
consisting ofn points), and also of the set X 2 (consisting of ~n(n- 1) points)
and the set X 0 containing a single point. The points in the set Xi will be
numbered by the index i = 1, 2, ... , n. The point with the index i will be
denoted by (xi)- and will be defined as the point in which all the coordinates
equal zero but the i-th one the value of which is -1. In a similar way the i-th
point in tile ensemble Xi will be denoted by (xi)+. It will be defined that all
coordinates in this point equal zero except the i-th one which assumes the value
+1. The points in the set X 2 will be numbered by two indices i = 1, 2, ... , n
and j = i + 1, i + 2, ... , n. The (ij)-th point will be denoted as xii and will be
determined as the point in which all the coordinates equal zero except the i-th
and j-th the value of which is 1. The single point of which the set X 0 consists
will be denoted by x 0 and will be defined as the origin of the coordinate system.
Let Xi and x; be an arbitrary decomposition of the ensemble X* = Xi U
Xi U X2 U Xo into two classes. Let us prove that such coefficients ai, f3ii, "Yi
exist for the given decomposition and such a threshold value () exists in the
strategy (4.38), (4.39) that it satisfies the system of inequalities
f(x) = 1, x E Xi, }
(4.42)
f(x)=O, xEX2,
and thus satis(v the constraints (4.41) too. We will introduce the auxiliary
notation, i.e., the numbers (ki)-, (ki)+, kiJ, i = 1, 2, ... , n, j = i+ 1, i+2, ... , n
such that
(ki)- = 0, if (xi)- Ex;,
(ki)- = 1, if (xi)- E Xi,
(ki)+ = 0, if (xi)+ Ex;,
(ki)+ = 1, if (xi)+ E Xi,
kij = 0, if xii Ex;,
kij = 1, if xiJ E Xi.
If tllis notation is used the system of equations (4.42) assumes the form
f((xi)-) (ki)-, } .
!((xi)+) = (k')+, z_ = ~, 2, .. :, n, (4.43)
f(xil) = kiJ , J = z + 1, z + 2, ... , n,
or, which is equivalent,
'Yi
i = 1, 2, ... , n,
Q;-
a;+ 'Yi (4.44)
j = i + 1, i + 2, ... , n.
a; + a J + 'Yi + 'YJ + /3;j
From this, with regard to the inequality (4.40) proved earlier, the equation
CAP = 2n + 21n(n- 1) + 2
follows.
When you have so thoroughly investigated the model with Gaussian random
variables, could you do, for completeness, the same also for the model with
conditional independence? Try to determine the capacity of the set of strategies
in the form
X1, if t'-1
log Px;~ 1 (:i) ~ o,
PX;J2( ,)
xE {
1
7i (4.45)
,, .f " ' 1 Px,11(xi)
.'1.2 , 1 ~ og ( ) <0,
i= 1 Px, 12 Xi
which are optimal in the case in which the observation x under the condition
of a fixed state is the random variable x = (x 1 , x2, . .. , Xn) with independent
components.
In the discussion after Lecture 3 I proved that any strategy of the form (4.45)
could be expressed as a linear discriminant function of the dimension
n
Lk(i)- n,
i=1
where k(i) was a number of values of the variable Xi· From this result there
immediately follows
n
CAP ~ L k( i) - n + 2 , (4.46)
i=1
because I have just proved that the capacity of the set of hyperplanes in an
m-dimensional space ism+ 2.
I will now show that in the inequality (4.46) the relation~ can be substituted
by an equality. I need to prove that an ensemble of 2:;~ 1 k(i)- n + 1 of points
in the space X exists which can be decomposed into two classes by means of
strategies of the form (4.45).
The ensemble sought may be constructed in the following way. Let me
choose an arbitrary point x0 = (x?, xg, .. . , x?1 ) and include it in the ensemble.
Then every point x' differing from x 0 in only one component is included in the
ensemble too. So the number of points in the ensemble will be 2::~ 1 k(i) -n+ 1.
Have I to prove that such a set of observation can be decomposed in an arbitrary
way?
It is not necessary. We think that it is quite clear. We thank you for cooperation
and this will be enough for today.
April 1997.
136 Lecture 4: Learning in pattern recognition
4. 7 Bibliographical notes
Three formulations of learning tasks have been introduced in this lecture. The
first formulation with respect to the maximal likelihood is a direct transfer
of known statistical methods into pattern recognition [Nilsson, 1965]. Let us
mention only Gauss and Fisher [Fisher, 1936] from the statistical sources re-
lated to the most likely estimates. If someone is interested in the matter we
recommend the textbook by Waerden [Waerden, 1957]. A theoretical analysis
of the properties of learning algorithms in pattern recognition according to the
first formulation is represented by [Raudys and Pikelis, 1980].
The second minimax formulation of learning according to a non-random
ensemble was suggested in [Schlesinger, 1989], who was inspired by practical
tasks [Schlesinger and Svjatogor, 1967]. The theoretical analysis of the ap-
proach will be given in this monograph in Lecture 8.
The third formulation seeks a strategy which correctly recognises the training
sequence [Rosenblatt, 1962; Ajzerman et al., 1970]. Many other publications
stem from these works. The third formulation was analysed by Chervonenkis,
Vapnik [Vapnik and Chervonenkis, 1974; Vapnik, 1995], [Vapnik, 1998] and has
been developed into a deep theory. The first work mentioned established the
basis of our explanation in this lecture.
Let us compare Raudis' theory analysing the learning in the first formulation
with respect to the maximal likelihood [Raudys and Pikelis, 1980] with the
conclusions by Vapnik with respect to the third formulation. The first case
yields less general assumptions and thus estimates a shorter training sequence.
The second approach is more general and thus more pessimistic in its estimates.
Another interesting view of statistical learning theory is given in [Vidyasagar,
1996].
We have adapted Zagorujko's example with geologist [Zagorujko, 1999]. This
book is of interest on its own, as it gives insight into the research of one Russian
group strong in clustering, and lists several tasks solved practically.
Lecture 5
Linear discriminant function
f(x) = L aj fJ(x)
jEJ
1~7
to the finite capacity of the linear discriminant function class. In the preced-
ing lecture, for this case an explicit relation between the necessary length of
the training multi-set, the accuracy, and reliability of learning was mentioned,
expressed by a practically applicable formula.
All these advantages would naturally not be of great importance if there were
not at our disposal procedures for finding linear discriminant functions. They
are the main topic of this lecture. We will see that different linear discriminant
function tasks, which seem not to be similar to each other at first glance, are
acting together (are collaborating, in fact). We will see that the properties of
one type of tasks are helpful for solving other tasks the properties of which are
not so evident.
are only their weights. If the state of the object k is to be found from the
mentioned incomplete a priori knowledge of the statistical model and the known
observation x then the task is to be formulated as a task of statistical decision
making with non-random interventions (it has been described in a more general
manner in Subsection 2.2.5). If we used the results of the analysis in our case,
we would be seeking the strategy q: X -+ { 1, 2} which minimises the value
where c;(j, J.Li, ai, q) is the probability that the Gaussian random vector x with
mathematical expectation J.Lj and the covariance matrix ai satisfies either the
relation q(x) = 1 for j E h or q(x) = 2 for j E J1.
In other words, the minimal value c; and the strategy q* are sought that
satisfy two conditions:
1. The probability of a wrong estimate of the state under the condition that
the actual state is 1 is not greater than c;, which is valid independently of
the values of the mathematical expectation J.ll and the covariance matrix
a1, but only when (J.Ll,al) E {(J.Li,ai),j E J!}.
2. The probability of a wrong evaluation of a state under the condition that the
actual state is 2, is not greater than c;, either, which is valid independently
of the values of mathematical expectation J.L2 and the covariance matrix a 2,
but only when (J.L2,a2) E {(J.Li,ai),j E J2}.
From the results presented in Subsection 2.2.5 it follows that the statistical
decision making task with non-random interventions is reduced to searching
for a minimax solution in the weight space of mixture components.
We are interested in the task (5.1) with an additional constraint on the
strategy q. We require the discriminant function to be linear, i.e., to be the
hyperplane (o:, x) = (} and
1, if (o:, x) > (} ,
q(x) ={ (5.2)
2' if (o:, x) < (} ,
at a certain vector o: E X and the threshold value 0. Recall that (o:, x) denotes
the scalar product of the vectors o:, x. For the two-dimensional case illustrated
above in Fig. 5.1, the observation plane should be divided into two half-planes
so that the first half-plane should contain the majority of random realisations
from the first, second, and third probability distributions, and the second half-
plane contains those from the fourth and fifth probability distribution. This
distribution is represented in Fig. 5.1 by the separating straight line q.
The task (5.1) satisfying the requirement (5.2) is a generalisation of the
known task by Anderson and Bahadur [Anderson and Bahadur, 1962], who
formulated and solved the task for the case IJ1l = IJ2l = 1. Our more general
case will be called generalised Anderson task.
The abovementioned formulation of generalised Anderson task includes an-
other particular case worth attention, i.e., when all covariance matrices ai,
140 Lecture 5: Linear discriminant function
j E J 1 U h are unit. This task is used even in the case of pattern recognition
algorithms which are determined by a training_set. It is referred to as the op-
timal separation of a finite sets of points. Let X be a finite set of p~nts x1, ~2,
.•. , Xn from the space X which is decomposed into two sub~ets xl and x2.
A separating hyperplane is sought which will let the subset X 1 remain in one
half-space, and the subset -~2 in the other half-space. And moreover, the hy-
perplane is as distant from the both divided subsets as possible. More precisely
speaking, a vector o: and the threshold value(} are sought in order that:
1. all x E X 1 satisfy the inequality
At the close of the lecture we will present results which will cover all the
tasks studied h~re in a single frame. This will reveal less obvious relations
between Anderson general task, the optimal separation of the sets of points,
and the most spedfitJd simple separation of the sets of points. Understanding
these relations, we call""pwperly modify perceptron and Kozinec algorithms so
that they will be applicable even for solving Anderson general task.
To make further analysis more convenient, we will present the task (5.6) in a
slightly different equivalent formulation. Let us introduce vectors p,'i in this
form
for j E J1,
for j E J2.
Figure 5.2 illustrates the transformation for the case in which J1 = {1, 2} and
J2 = {3, 4}. For any yector a there holds that the probability of the inequality
(a, x) ~ 0 for the rai}dQrn Gaussian vector x with the mathematical expectation
p,i and the covariance mp.tiix cri is the same as the probability of the inequality
(a, x) ~ 0 for the random Gaussian vector x with the mathematical expectation
-p,i and the covariance matrix cri. Thus generalised Anderson task (5.6) can
be expressed in the following equivalent formulation.
For the ensemble ( (p,i, cri), j E J) ,
CD
8
a non-zero vector a has to be sought
which minimises the number m~x c;i (a),
. . . . \:_)
J
. ,f;\,.
a= argmin m~x c:i(a),
0 J
(5.7)
,, .........
--,
random Gaussian vector x with the I \ '-. 3
mathematical expectation J.Li and the {~)
covariance matrix cri will satisfy the
inequality (a, x) ~ 0. Figure 5.2 The straight line separating the
For better illustration let us go back ellipses 1, 2 from the ellipses 3, 4 is equivalent
to geometrical considerations. In the to the straight line leaving the ellipses 1, 2
original task there were two sets cor- and 3', 4 along one side.
1
(5.8)
5.3 Anderson tasks 143
• J-l
2
•
0 •
Figure 5.3 Straight line passing through the Figure 5.4 Contact of one ellipse with the
origin, leaving the points J.!l, J.!2 and J.!3 along straight line.
one side.
where · denotes a matrix product, in our case the product of a matrix and a
vector.
The set of points defined by the preceding inequality will be denoted E(r, J-l,
a) and will be referred to as an ellipse of the size r, even in which we have in
mind a multi-dimensional body. The concept of ellipse will be used also for a
multi-dimensional case where from the geometrical point of view it would be
called an ellipsoid.
Let us express generalised Anderson task in the following equivalent form
(on the basis of common sense and without any proof, for the time being). For
the given ensemble of pairs ( (J-Li, ai), j E J) , such a vector a is to be found
for the half-space {x E X, I (a, x) ~ 0} to contain the union of the ellipses
UiEJ E(r, J-Li, ai) with their largest possible size r.
If the formulation presented was really equivalent to the requirement (5.7)
(and we will see later that it really is so) then the hyperplane, which is the
solution of the task (5. 7), could be sought by means of a procedure which will
be presented first in the simplest two-dimensional case.
Let J-L 1 , J-L 2 and J-L 3 be the mathematical expectations of three random vari-
ables, as it is shown in Fig. 5.3. First, a straight line is to be drawn that passes
through the origin of the coordinates and leaves all three points J-L 1 , J-L 2 and J-L3
in the same half-space.
If such a straight line were not exist, it would mean that for each linear
discriminant function at least one random variable existed for which the proba-
bility of the wrong decision was greater than 0.5. In such a case it would not be
necessary to solve the task because even the best result would not be practically
applicable. In Fig. 5.3 we can see that such a straight line does exist, e.g., as a
horizontal straight line. Around the points J-L 1 , J-L 2 and J-L3 the ellipses begin to
grow whose sizes are the same at each instant and whose orientation depends
on the matrices a 1 , a 2 and a 3 . At the same time, with growing sizes of the
ellipses, the position of the straight line changes so that all three ellipses should
lie, all the time, in one half-plane defined by the straight line. The growth of
the ellipse sizes continues till some ellipses (it may be even one ellipse) force the
straight line into the only one possible position. Here a further growth of the
ellipse sizes is no longer possible, since there is no such straight line to allow
all three ellipses to lie in one half-plane.
Let us see what the growth of ellipse sizes will look like in the case in Fig. 5.3.
At the beginning the ellipse sizes grow without forcing the straight line to be
144 Lecture 5: Linear discriminant function
Figure 5.5 A particular case in which the Figure 5.6 The straight line contacts another
straight line is closely contacted with one el- ellipse.
lipse.
Figure 5. 7 Turning the straight line clock- Figure 5.8 The straight line has contacted
wise further. another ellipse. The growth of the ellipses
ends.
rotated. The initial growth will last only till one of the ellipses touches the
straight line. In our case (when the matrices a 1 , a2, 0"3 are the same) it is the
ellipse 1, see Fig. 5.4. If the contact point were to fall exactly in the coordinate
origin, further growth of the ellipse sizes would not be possible. This particular
case of determining the straight line by one single ellipse is presented in Fig. 5.5.
But in our case the contact point is not a coordinate origin, and thus the growth
of the ellipse sizes continues and the straight line continues turning clockwise
till it touches another ellipse. In our case it is the ellipse 2 in Fig. 5.6. If
the contact points of the first and second ellipses were to lie along different
sides with respect to the coordinate origin no further growth of the ellipse sizes
would be possible and thus the growth of the ellipses would end. In our case
such a situation has not occurred and ellipse sizes can grow further and the
straight line is turning clockwise at the same time. The first ellipse stops to
touch the straight line and the turning up now depends on the second ellipse
only, Fig. 5. 7.
The growth of ellipses continues either until the contact point does not
reaches the origin of coordinates or until the straight line touches some other
ellipse. In our our case it is the ellipse 3, see Fig. 5.8. The contact points
of the second and third ellipses lie on opposite half-lines with respect to the
coordinate origin and therefore with the growing size of the second ellipse the
straight line would have to turn in one direction, and the growing size of the
third ellipse would force the straight line to turn in the opposite direction. The
growth of sizes of both the ellipses at the same time is no longer possible, and
thus the found out position represents the solution of our task.
5.3 Anderson tasks 145
With a certain amount of imagination we can obtain some idea of what the
growth of ellipse sizes might look like in a three-dimensional space (in terms of
geometry they would be ellipsoids). Also the ellipse sizes here are growing till
some ellipses force the separating plane into one possible position. This can
happen either when the contact point gets to the coordinate origin, or when
two contact points and the coordinate origin appear to be on one straight line,
or, finally, when the triangle formed by three contact points incorporates the
coordinate origin as well.
On the basis of such an informally understood task we can consider the
following necessary and sufficient condition for the optimal position of the hy-
perplane in the task (5.7).
Let H be a hyperplane and p,i, ai, j E J, be parameters of IJI random
Gaussian vectors. The variable ri is a positive real number. Let xi,
j E J,
represent a point in which the ellipse
ri = minri.
jEJ
For the optimal position of the hyperplane H with respect to the task (5.7)
it is necessary and sufficient that the coordinate origin should lie inside the
polyhedron the vertices of which are the contact points xi,
j E J0 •
This statement will be formulated more elaborately and will be proved. For
a more accurate analysis of Anderson tasks both the original and generalised
tasks, we will state more precisely the concepts of the ellipse and the contact
point which we introduced when referring to intuitive understanding.
X 0 and the ellipse E(r*, JL, a). It will be denoted by x 0 and referred to as the
contact point. It is also obvious that in the contact point the minimal value of
the function F in the hyperplane X 0 is reached and the value of the function F
in the contact point is (r*) 2 , i.e., the square power of the distance (JL, a) from
X 0 . Explicit expressions for the distance and the contact point can be derived.
To the respective optimisation task the Lagrange function corresponds
<I>(x, .X)= ((x- JL), a- 1 · (x- JL)) +.X· (a, x)
and the point x 0 sought is the solution of the pair of equations
grad <I>(x, .X) = 0, }
(5.10)
(a:, x) = ()
with respect to the variables x and .X. In particular, the first equation in the
system (5.10) is
2 a- 1 · (x- JL) +.X· a= 0,
from which it follows that
.X
xo = JL- 2 a· a, (5.11)
where the Lagrange coefficient .X is to assume such a value at which the second
equation in the system (5.10) is satisfied, i.e., the equation
When the expression for Xo is substituted into the formula (5.9) we obtain the
size (r*) 2 of the ellipse ((x 0 - JL), a- 1 · (x 0 - JL)) = (r*) 2 at which the ellipse
touches the hyperplane (a, x) = 0. Let us do it.
This means that the dimension r* of the ellipse contacting the hyperplane
(a, x) = () is
(a, /1) - ()
r* = (5.14)
J(a, u ·a)
If we take into consideration that the vector 11 belongs to the positive half-space
x+ then we will obtain an expression without the sign for the absolute value
(a, /1) - ()
r* = (5.15)
J(a, 0' ·a)
In case the vector 11 belonged to the negative half-space x-, the corresponding
expression would be
() - (a, /1)
r* = (5.16)
J(a, 0' ·a)
We will continue concentrating our attention on the formulce (5.15) and
(5.16) obtained above. The numerator in (5.15) is the mathematical expecta-
tion of the random variable (a, x) -B for the random vector x with mathematical
expectation 11· The denominator is the mean square deviation of the random
variable (a, x)- () for the random vector x with the covariance matrix u. From
this it directly follows that the size of the ellipse contacting the hyperplane is a
strictly monotonically decreasing function of the probability that the random
variable x will get to the half-space defined by the hyperplane X 0 and different
from the half-space in which the mathematical expectation 11 occurs. In this
way we have proved the following lemma.
lemma 5.1 Let x be a multi-dimensional random Gaussian variable with the
mathematical expectation f1 and the covariance matrix 0', which assumes the
values in a linear space X. Let the vector a E X and the number () decompose
the space X into three subsets:
which by means of the newly introduced variables A' 1 = !A1 , A' 2 = -!A2 will
be presented in the form
The second equation in the relation (5.17) has been rewritten to the form (5.18)
with respect to the requirement J.t1 E x+, J.t2 E x-. Both coefficients A1 and
A2 are positive, and therefore their sum >. 1 + >. 2 is also positive. Note that
the vector a need not be determined precisely, but only up to a multiple by
a positive coefficient. The solution of the task, therefore, does not depend
on precise values of the coefficients >.1 and >.2, but only on their ratio. Both
coefficients can be, e.g., tied together by the following relation
(5.19)
5.3 Anderson tasks 149
since any ratio between >. 1 and >.2 can be achieved even on the condition given
by the equation (5.19). Their ratio is
Thanks to the second equality in the system (5.18) the product of the two first
coefficients in the right-hand part of (5.20) is unit, and thus
Let us denote the minimised function maxjEJ c(o:, J.lj, CJi) by the symbol f(o:).
The given data suffice for proving the theorem which states that the function
f(o:) is unimodal and thus its minimisation can be achieved.
Theorem 5.1 Convexity of the set of vectors a. The set of vectors o: satis-
fying the inequality f(o:) :S b is convex for each number b < 0.5. A
Proof. Lemma (5.1) states that the probability c(o:, J.lJ, CJJ) strictly decreases
with the growth of
(o:,J.li) =0.
J(o:, (Jj. o:)
The condition /(a:) :S b is equivalent to the following system of equations
E(o:,J.li,CJi) :S b, j E J.
where, in a way, the number c: depends on the number b. The system (5.22)
can be written in the form
(o:,pj)-c·v(o:,CJJ·o:)2':o, jEJ. (5.23)
5.3 Anderson tasks 151
The functions in the left-hand side of each inequality of the system (5.23) consist
of two summands. The first of them is a linear function of the vector a, and as
such it is concave. The function (a, ai ·a) is a convex function of the vector
a. Thus the function -c (a, ai ·a) is concave since the number c is strictly
positive by the assumption b < 0.5. The left-hand side in each inequality in the
system (5.23) is a sum of two concave functions, and thus it is also concave.
Therefore for each j the set of vectors satisfying the j-th inequality is a convex
set. The set of vectors satisfying the system (5.23) is an intersection of convex
sets, and thus it is also convex. •
From the theorem proved it directly follows that in the domain where f(a) <
0.5 no strict local minimum can exist which would not be identical to the
global minimum. The strict local minimum is here the point a 0 , for which a
!5-neighbourhood of the point a 0 exists in which for each a :f. a 0 the strict
inequality /(no) < f(a) is satisfied. Let us assume the opposite, let a' and
a" be two strict local minima. Without loss of generality, let us assume that
f(a") ~ f(a') = c < 0.5. We will connect the points a" and a' with a
straight line segment. Since the point a' is the local minimum then on this line
segment there is a point a (and it can be quite near the point a') for which
f(a) > f(a') = c must hold. This would, however, mean that the set of vectors
a, for which f(a) ~ c < 0.5 holds is not convex. From Theorem 5.1 it follows
that if the point a is reached in which the value f(a) ~ 0.5 then from this
point we can get to the global minimum a* directly along the straight line a
which connects the current point with the point a*. When moving along the
straight line, the function f(a) will not be rising in any position.
Actually even stronger statements are valid than those presented here. From
each point a for which f(a) < 0.5 holds it is possible to pass along the straight
line to the point a* in which the global minimum of the function f is reached.
When moving along the straight line from the point a to the point a* the
function f will be continuously decreasing.
This property of the function f could be used for organising the procedure
of its minimisation. But this procedure cannot be based on the motion in the
direction of the gradient getting towards a zero-gradient point, as usually hap-
pens, since the function being maximised is neither convex, nor differentiable.
Necessary and sufficient conditions for the existence of a maximum are to be
stated that are not based on the concept of the gradient.
Further on a lemma is proved which deals with the necessary and sufficient
conditions for the minimum of the number c:{a, f..L, a). The formulation of the
lemma is based on the concept of the contact point of an ellipse {x E X J (x -
J.L,a · (x- J.L)} ~ r 2 } and a hyperplane {x EX I (a,x) = 0}. The contact point
is marked x 0 (a,J.L,a). On the basis of (5.13) referring to 0 = 0, the expression
for xo(a,J.L,a) is
(a,J.L)
xo(a,J.L,a) =J.L- ( ) a·a. (5.24)
a, a·a
The proof is based on the concept of distance of the pair (J.L, a) from the hyper-
plane {x EX I (a,x} = 0}. The distance is marked r*(a,J.L,a). Based on (5.15)
152 Lecture 5: Linear discriminant function
and remembering that () = 0 the expression for r* (a, fJ, a) can be written as
follows
r*(a,!J,a) = (a,Jl) (5.25)
J(a, a· a)
Lemma 5.2 Necessary and sufficient condition for optimality of a for one
distribution. Let for a triplet (a, fJ, a) hold that (a, fJ) > 0. Let xo(a, fJ, a) be
the contact point and ~a any vector which is not collinear with the vector a.
For this case two implications are valid:
1. If the following condition is satisfied
On the basis of the expression (5.24) for xo(a,~-t,u) it is clear that in the
fraction on the right-hand side of (5.30) the numerator is the scalar product
(x 0 (a, J.L, u) , tia), which is positive, as assumed. The examined derivative is,
therefore, also positive. It follows from it that there exists such a positive T
that for each t, 0 < t ~ T, the following inequality is satisfied
from which, thanks to Lemma 5.1 the inequality (5.27) follows which was to
prove. In this way the first statement of Lemma 5.2 is proved.
We will now prove the second statement of Lemma 5.2, where the following
condition is assumed
(tia, xo(a, /-t, u)) ~ 0.
Here the behaviour of the function r* (a, /-t, u) is to be examined not only in
the neighbourhood of the vector a, as was in the previous case, but in a global
sense, and therefore for an analysis of such a behaviour the knowledge of the
derivatives of this function in the point a will not do. Thus additional consid-
erations, but not very complicated ones, are needed.
To be brief, we will denote a+ t!.a as a' and Xo (a, /-t, u) as xo. For the vector
a' three cases can appear:
(a',~-t) ~ 0; (5.31)
The case in which (a',xo) > 0 is excluded since (a',xo) ((a+ t!.a),x 0 )
= (a, xo) + (tia, xo) ~ 0. And actually the summand (a, x0 ) is zero since
according to the definition the contact point x 0 belongs to the hyperplane
X 0 (a) = {x E X I (a, x) = 0}, and (t!..a, x0 ) is not positive according to the
assumption (5.28).
When the condition (5.31) is satisfied then the statement of Lemma 5.2 is
obviously valid, since a satisfies the inequality (a, J.L) > 0, and thus c:(a, J.L, u) <
0.5. The inequality (5.31) means that c:(a',~-t,u) ~ 0.5.
Let us examine the cases (5.32) and (5.33). The symbol F(x) will denote a
quadratic function ( (x- J.L), u- 1 · (x- 1-t)) and we will prove that in both cases
(5.32) and (5.33) such a point x* in the hyperplane X 0 (a') = {x E X I(a', x) =
0} exists that
F(x*) < F(xo) . (5.34)
154 Lecture 5: Linear discriminant function
Thus the inequality (5.29) will be proved, and so will the entire Lemma 5.2.
If the inequality (5.34) is valid (and we will prove its validity) then there holds
If we use the expression (5.11) for x 0 then we obtain a- 1 · (f..l- xo) = -~Aa
which simplifies the formula (5.37) to
Because x' ~ X 0 (a) and Xo E X 0 (a), the relations (a, x') -::/= 0 and (a, Xo) = 0
are valid. Consequently the scalar product \a, (x'- x 0 )) is not 2;ero. Accord-
ing (5.12) A is a nonzero number as well as 2 A(a, (x'- x 0 )). Thus
If the abovementioned convex hull does not include the coordinate origin then
a vector ~a and a positive number T exist so that for any t, 0 < t ::=; T, the
following inequality is satisfied:
jEJ jEJ
Proof. First, we will prove the first statement of Theorem 5o2. It is assumed
that such numbers 'YJ, j E J 0 , exist which satisfy the conditions
L"Yj·X~=Oo
jEJO
L ')'J 0
(a', x~) =0 0
jEJO
156 Lecture 5: Linear discriminant function
The equality is certainly valid for some nonzero vector a' which is not collinear
with a. This sum can be zero only when at least for one j* E J 0 the following
inequality is satisfied
(a',xf) :50.
The equation (a, x~) = 0 is satisfied for each j E J and thus also for j*. This
means that the vector ~a = a'- a satisfies the inequality (~a, xf) :5 0. With
respect to Lemma 5.2 we write c(a', flj*, ai*) > c(a, fli*, ai* ). The number
maxiEJ c(a, fli, ai) is evidently e(a, fli*, ai* ), since j* E J 0 , and the number
maxiEJ c(a', fli, ai) is not less than c(a', fli*, ai* ). Thus maxiEJ c(a', fli, ai) >
maxiEJ c(a, fli, ai). In this way the first statement of Theorem 5.2 is proved.
We will denote by X 0 the convex hull of the set of contact points x~ and
prove the second statement of Theorem 5.2 in which 0 ~ X 0 is assumed. Then
a vector ~a exists for which the inequality (~a, x~) > 0 for each j E J 0 holds.
It can be, e.g., the point argminxEXo lxl. As a result of the first statement
of Lemma 5.2 it follows that there exist such positive numbers Ti, j E J 0
that
V(j E J 0 ) V(t I 0 :5 T 1 ): c(a +~a· t, fli, ai) < c(a, fli, ai).
The preceding statement remains valid when all numbers Ti are substituted by
the number T' = miniEJo Ti, and after this substitution the order of quantifiers
is changed. In this way we obtain the relation
According to the definition of the set J 0 each value c(a, fli, ai), j E J 0 , is
equal to maxiE.! c(a, fli, ai) and the expression (5.38) can be modified as fol-
lows:
V(t I 0 < t :5 T'): max c(a +~a· t, fli, ai) <max c(a, fli, ai). (5.39)
jEJ 0 )EJ
The dependence of c(a, flj, ai) on the vector a is continuous. Therefore when
for an index j' the inequality
Thus a positive number T exists (it may be less than T') for which the inequal-
ity
5.3 Anderson tasks 157
is valid for any t, 0 < t ~ T. Based on this we will rewrite the statement (5.39)
in the form
(5.41)
We come again to the task of the simple separation of finite sets of points.
158 Lecture 5: Linear discriminant function
~a = argmax min
Aa j
(- ac(a+8tt·~))
or, which is the same,
A • (~a, yi)
ua = argmax mm
Aa
Iua I .
A (5.42)
1
We can see that searching for such a vector is identical with the task of the
best separation of the sets of points.
3. After finding the vector ~a. which is characterised by the condition (5.42),
we must find
t
= argmax min c(a +-~a, J..li, ai)
t jEJ
(5.43)
and find a new vector a, as a ::= a + t · ~a. This vector is also sure to
satisfy the condition (5.40).
4. Go to the step 2 of the procedure.
Let us see how far the outlined procedure is apt to be a basis for writing
a practically applicable program. From the theoretical standpoint, the most
important drawback is that the iterative cycle need not end. From the practical
standpoint, it is not as inconvenient. The value
least, an unimodal function of one variable. There is a number of methods for its
optimisation. For the realisation of the formula (5.43) they are suitable rather
equally. In spite of that, the formula (5.43) is treacherous for programmers, so
that in careless programming the program can run about ten to twenty times
longer than is needed.
Let us examine the auxiliary tasks (5.40), (5.41) and (5.42). The tasks (5.40)
and (5.41) are the same. The task (5.42) includes the above two tasks. First, the
task (5.42) is a particular case of generalised Anderson task, to whose solving
the whole procedure sought is intended. Second, the task (5.42) is a particular
case of generalised Anderson task, where all matrices ui, j E J, are unitary
matrices. We might seem to be stacked in a logical loop: to solve Anderson
task it is necessary to know how to solve the task (5.42), which can be solved
only through the algorithm for solving generalised Anderson task. Actually,
there is no logical loop since the particular case (5.42) has additional positive
features, thanks to which its solving is much easier than that with the general
task.
Furthermore the property that the solution of the particular case (5.42)
contributes to the solution of generalised Anderson task, this task itself has a
further importance for the separation of the finite sets of points through linear
discriminant functions. Such a kind of task is favourite in pattern recognition
as one of the methods of learning. At the beginning of the lecture we made a
note that it was worth being a favourite one.
Now we will part with Anderson task for some time. First we will study the
task of the linear separation of finite sets of points and then within the scope
of this lecture we will again return to Anderson task.
(a,xJ) > 0, j E J1 , }
(5.44)
(a,xJ) < 0, j E h.
j) o
(5.45)
I a
\~'X < '
has a solution. If the system (5.45) has one solution then the same system has
an infinite number of solutions. Therefore, let us make (5.45) stricter through
160 Lecture 5: Linear discriminant function
the requirement to seek a vector a and the greatest positive value of r satisfying
the system
(5.46)
a= argmax mm
. ( mm
. -(a,xi) . -(a,xi))
- - , ~run I I . (5.4 7)
a JEJ1 1Q 1 JEh Q
The task (5.4 7) is referred to as the optimal separation of finite sets of points.
This task is a particular case of generalised Anderson task in which for all j E J
the matrices cri are unit matrices.
In this particular case the task has an illustrative geometrical interpretation
which will later be several times our basis in formal as well as in informal
considerations. The task (5.45) requires a hyperplane to be found separating
the set of points {xi, j E JI} from the set of points {xi, j E J 2 }. The left-
hand sides of inequalities in (5.45) represent the distance of the points from the
hyperplane. The tasks (5.46) and (5.47) require us to find a hyperplane among
all possible hyperplanes satisfying (5.45) which is most distant from the given
points.
From the analysis of generalised Anderson task we can see that the tasks
(5.46), (5.47) can have even a different geometrical interpretation. An arbitrary
vector satisfying the system (5.45) separates not only the points xi, j E J 1 ,
from the points xi, j E h, but it also separates a certain r-neighbourhoods of
these points. The size of the neighbourhood, i.e., the number r, depends on
the vector a. The task (5.46), (5.47) requires to find such a vector a which
separates together with separating one set of points from the other even their
largest possible neighbourhoods.
We will denote the vectors x'i, j E J, so that x'i =xi, j E J1 , and x'i = -xi,
j E h. The objective of (5.45) is to find a vector a for which there holds
Our tasks will be analysed in both formulations. The first formulation (5.48)
has its origin in the task of simple separation of sets of points. Now a hyper-
plane is to be found that will get all points into one half-space. The second
task (5.49) originates in the task of optimal separation of the set of points. Now
a hyperplane is to be found that, in addition to satisfying the conditions (5.48)
of the first formulation, is most possibly distant from the set of points.
5.4 Linear separation of finite sets of points 161
The distance is meant in the usual Euclidian sense. The formulation of the
necessary and sufficient conditions for the optimal position of the hyperplane,
which in the general case is provided by Theorem 5.2, can be expressed in
the following nearly geometrical form. Assume we already operate with trans-
formed vectors x'i and we will simply write them as xi.
Theorem 5.3 Geometrical interpretation of conditions for optimal hyper-
plane. Let X be a convex hull of the set of points {xi, j E J} and a* be a
point from X, which lies nearest the coordinate origin,
a* = argmin lxl .
xEX'
This means that in the side Xi of the triangle, it is the vertex a* which is the
nearest point to the coordinate origin. Thus the angle at the vertex a* cannot
be acute. The result is that the scalar product of the vectors -a* and xi - a*
cannot be positive,
(-a*, xi -a*) ~ 0.
The same can be expressed as an equivalent statement
Ia * 12 <
_ (a * ,xi) . (5.51)
a* = 2: ri ·xi . (5.52)
iEJ
From what has been said the equality (a*, a*) = l:jEJ ri · (a*, xi) follows
which will be written in a somewhat different form
L 11 · ((a*,x1) -la*l 2) = 0.
iEJ
We can see that the sum of non-negative numbers (see (5.51)) is zero. This,
however, can occur only when all summands equal to zero, i.e.,
162 Lecture 5: Linear discriminant function
Some coefficients "fi, j E J, must not be zero since their sum must be 1. We will
denote by J 0 the set of indices j for which "fi :I 0 holds. For each such j E J 0
the equation (a*, xi) = ia*l 2 must hold, which together with the inequality
(5.51) means that for each j E J 0 , and for an arbitrary j' E J there holds
(a*, xi) ~ (a*, x'i).
We have proved that in the expression (5.52) the non-zero coefficients are only
the coefficients "fi by which the vectors nearest to the hyperplane (a, x) = 0
are multiplied. So the expression (5.52) assumes the form
We will start from the previous expression and prove that the convex hull
of the contact points includes the coordinate origin, and this with respect to
Theorem 5.2 will prove that a* is the solution of the formulated task. And
in fact, if we use the formula (5.15) for the contact point and if we take into
account that (Ti = 1 then we can write
i_ i • (a*,xi)
Xo - x -a . !a•i2 '
and
( a* I: "fi. xi)
"\"' -vi . xi = "\"' -vi . xi - a* . ' jEJo
~ '
jEJo
0 ~
iEJo
I Ia *I")~
_ ._ ._(a*,a*)_ ._ ._ 0
-a a la*l2 -a a - .
This means that the convex hull of the contact points includes the coordinate
origin. •
The theorem proved is already the solution of the task of optimal separation
of the finite sets of points, since it reduces the task to minimisation of the
quadratic function the domain of definition of which is a multi-dimensional
convex polyhedron. Special features of the task allow to use even simpler and
more illustrative algorithms. They will be quoted later. The theorem proved
and the ideas of the proof can be summarised in relations which are valid for
an arbitrary vector a and an arbitrary vector x E X, i.e.,
rp.in(-lal,
JEJ a
xJ) ~ min/ a*l, xi)= ia*l ~!xi.
JEJ \ a* 1
(5.53)
The previous relations will be helpful in analysing algorithms solving the tasks
of simple and optimal separations of finite sets of points.
5.4 Linear separation of finite sets of points 163
We will show a procedure known as Kozinec algorithm, which can find a vector
a, that satisfies the condition (5.54), even if it may, naturally, be different from
a.
We will create a sequence of vectors a 1 , a 2 , ... , a~., O:t+l, ... according to the
following algorithm.
( 5.55)
2. If such a vector xi does not exist then it means that the solution of the task has
already been found and o 1 is the vector sought.
3. If the vector xi exists then we will denote it as x 1 . The vector Ot+l is determined
in such a way that on a straight line connecting the points o 1 and x 1 a point is
sought which is nearest the coordinate origin. This means that
where
k = argminl(l - k) · Ot + k · xtl. (5.57)
k
It is proved of Algorithm 5.1 that the vector a 1 is sure to occur in one of the
steps which satisfies (5.54). This is stated in the following theorem.
164 Lecture 5: Linear discriminant function
Figure 5.10 Geometrical interpretation of properties of the points crt, Clt+l and Xt.
is valid. It follows from the strict inequality in condition (5.54) that the set
X does not include the coordinate origin, which means that the length of the
vector at cannot converge to zero.
On the basis of the geometrical interpretation of the relations between the
vectors at, at+1 and Xt we will evaluate the ratio \at+II/!at! for at+l f. at.
The point b is an intersection of the straight line interlaced with the points at,
Xt with the hyperplane (at, x) = 0. We can see from Fig. 5.10 that
1
I 12 2
~>~
\b\2 - D2 '
5.4 Linear separation of finite sets of points 165
and
lo:t+11 < 1 < 1.
lo:tl - J1 +c 2 ID 2
It can be seen that the sequence of values lo: 11, ... , lo:t I, ... is decreasing faster
than a decreasing geometrical sequence. If the sequence o: 1, ... , O:t, ... was infi-
nite the number lo:t I could be less than any arbitrary positive number. Thanks
to (5.58) the number lo:tl cannot be less than c. Therefore for some t* the
vector O:t• must cease changing. The theorem has been proved. •
For completeness we will indicate that the number t* can be estimated by
means of the inequality
t•
~ ~ (vl+~'/D')
lo:t+11 ( 1 )t•
~:::; J1+c2ID2
1 ( 1 + -c 2
-t* -In ) > In -c · *< ln(D 2 Ic 2 )
2 D2 - D' t - In (1 + c 2 I D 2 )
We will create the sequence of vectors o: 1, o: 2 , ... , O:t, O:t+l, ... in the following
way.
for each j E J.
Proof. Let us see what follows from the property that for some t the con-
ditions (at. xi) > 0, j E J, are not satisfied and at+ I # at occurs. First, it
follows that at each t' :::; t also at'+l #at' occurs. In addition, for each t' :::; t
an Xt' was found such that (at', Xt') :::; 0. Therefore there holds
/\ Ia*
a* I , .x j) ~ Ia I - c .
* -
latH I~ t ·c.
If we divide the inequality (5.59) by the inequality lat+ 1 12 ~ t 2 · c2 then we
obtain t :::; D 2 / c2 . From this it follows that in the perceptron the vector a
can be changed only if t :::; D 2 / c2 . Thus, not later than in the step number
(D 2 /c 2 ) + 1 the inequality (at, xi) > 0 is satisfied for each j E J. •
5.4 Linear separation of finite sets of points 167
We have already proved before that the vector o:* maximises the number
.
mm
JEJ
\ -1o:*- 1 , x
o:*
j) . \
- mm
JEJ
-10:- 1 , x
0:
j) ~ c: •
The Kozinec algorithm for a simple separation of finite sets of points has a
favourable feature, i.e., after a slight modification it becomes the algorithm for
the E-optimal solution of the task. The algorithm creates a sequence of vectors
0:1, o:2, ... , O:t, O:tH, ... in the following way.
IOt I - . ( -Ot- , x
mm
JEJ 1 1 Ot
j) ~ c: , j E J. (5.60)
It can be seen that this algorithm hardly differs from the procedure quoted
above for the simple separation given by the relations (5.55)-(5.56). The dif-
ference is only in the stopping condition. In the algorithm for the simple sep-
aration, the condition (5.55) finishes the algorithm when all scalar products
(atflatl, xi) are positive. In the algorithm for the c:-optimal solution another
condition, (5.60) is used which is stricter with small c:. According to this con-
dition, the algorithm ends its operation only when all scalar products are not
less than Inti- c:.
For such a modified algorithm the condition (5.60) is surely satisfied in a
certain step, since the lengths Iat I is decreasing faster than does the geometrical
series with a quotient less than 1. In this way, in an infinite continuation, the
length Inti would converge to zero. This is, however, not possible because the
vector a 1 at any step t does not get over the limit of the convex set X. Thus
its length cannot converge to zero.
If the algorithm ended after creating the vector llt then this vector is the
solution of the c:-optimal task. From the condition (5.60) for the algorithm stop
and from the inequality
min/ Ia* I,
jEJ \ a*
xi) ~ Inti,
which was many times referred to (see 5.53), we obtain the inequality
. \ -a*
mm
JEJ
-
1
Q* 1
, x i) . \
- mm
JEJ
-at-
1
llt 1
, x i) ::::; c:
and therefore the analysis of convergence with such a sequence requires much
finer considerations. Let us leave this analysis to future generations, since the
theoretical incompleteness of the present analysis is no obstacle for practical
application of the given 'algorithm' at c = 0. The user lets the algorithm run
and observes at each iterative step the number Iat I and the number
is satisfied.
The decomposition of the space X into convex cones of the above properties
is the Fisher classifier.
Let X_ be~ finite ~ets of points in the space X which is decomposed into K
subsets X1, X2, ... , X K. In the task, which we will refer to as Fisher task, such
vectors a 1, a2, ... , aK are to be found for the inequality
(5.61)
170 Lecture 5: Linear discriminant function
to be valid for any triplet (x, k,j) satisfying the condition XE xk, j f. k. The
system (5.61) thus consists of a finite number of inequalities. They are just
1-YI(K- 1). Our objective is to solve the system (5.61) under the condition
that the solution of such a task is previously known to exist. It is obvious that
the task of a linear separation of two finite sets of points is a special case of
Fisher task at K = 2. The linear separation is achieved by means of the vector
a 2 - a 1 . An unexpected result is that any Fisher task can be reduced into its
particular case. And now we will show how to do it.
Let Y be a space of the dimension nK. We will map into it the set X and
the set of vectors a 1 , a 2 , ... , a K. The set of coordinates of the space Y will
be decomposed into K subsets. Each of them consists n coordinates. Thus we
can use the expression the 'first n-tuplet of coordinates', 'second n-tuplet of
coordinates', 'n-tuplet of coordinates with the ordinal number k'.
The ensemble of vectors a 1 , a 2 , ... , a K will be represented as a (nK)-dimen-
sional vector a the k-th n-tuplet of coordinates of which is the vector ak. Simply
speaking, the sequence of (nK) coordinates of vectors a is created so that the
coordinates of the vectors o: 1 , a 2 , ... , aK are written into one sequence one
after another.
For each x E ,Y a set Y(x) C Y will be created which contains K -1 vectors.
It will be done in the following way. Let k be the ordinal number of the subset
,Y(k) to which x belongs. We will enumerate the vectors from the set Y(x) with
numbers j = 1, 2, ... , K, j f. k. The symbol y(j, x) will denote j-th vector from
the set Y(x). It will be created so that its j-th n-tuplet of coordinates is -x,
~th n-tuplet is x, and all other coordinates are equal to zero. We will introduce
Y as the set
Y= U l"(x).
xE.Y
Let k and j be different numbers. Let X be a point from the subset xk. In
the manner of creating vectors a and the set Y presented there holds (ak, x) -
(aj, x) = (a, y(j, x) ), and therefore the inequality (ak, x) > (a1 , x) is equivalent
to the inequality (a, y(j, x)) > 0. The system of inequalities (5.61) will become
equivalent to the system
(5.62) originates from the system (5.61). We will demonstrate that, e.g., the
modification of a perceptron algorithm will appear incredibly simple.
Let, in the step t of the algorithm, the vectors a~, k = 1, ... , K, be calcu-
late<!: These vectors are to be verified and the existence of the point x in the
set X is to be found which will be wrongly recognised by these vectors. The
vectors x E X are to be examined one after another so that for each vector the
number
b =max (a~,x)
k
is calculated and then it is checked whether the equality (a;, x) = q_is satisfied
for some j i- k. Let us mention that k is the number of the subset Xk to which
the point x belongs. As soon as such a point x occurs, it means that it will
not be correctly classified. The vectors ai and ak for the next iteration will be
changed so that
a~+l = a~ + x, a}+ 1 = a~ - x .
When such a point does not occur then this means that the task has been
solved.
We could hardly find an algorithm the programming of which is simpler.
The modification of the Kozinec algorithm leading to the Fisher classifier is
slightly more complicated, but it is also rather simple.
Now we formulate the tasks of the optimal and c--optimal set separations and
we will introduce Kozinec solution of the tasks without their equivalent trans-
formations. We will see that such a direct solution has certain advantages.
Let a and () be the solution of the system (5.63). The distance of the point
X EX\ from the hyperplane X 0 (a,()) is ((a,x)- 0)/lal and the distance of the
point X E x2 from the hyperpla_pe X 0 (~()) is(()- (a,x))/lal. The task of the
optimal separation of the sets x1 and x2 is defined as finding such a solution
of the system (5.63), that maximises the number
For r* = maxa () f(a, ()) the task of the f-Optimal separation of the sets x1 and
- '
X2 is defined as finding the vector a and the number () for which
r* - f(a, 0) ~c.
Let us have a brief look at the main considerations leading to the solution of
these tasks. The key idea is that the sets X1 and X2 can be optimally sepa-
rated by a hyperplane which is perpendicular to the vector ai - a2, and passes
through the centre of the straight line connecting the points ar and a2. The
points ai and a2 belong to the convex hulls X 1 and X 2. These two points de-
termine the shortest distance between the two convex hulls. Algorithms for the
optimal and c--optimal separations, as well as the algorithm for creating a se-
quence that converges to the optimal separation are based on the minimisation
of the distance between the points a 1 and a 2 on the condition that a 1 E X 1
and a 2 E X 2 . If we wanted to derive these algorithms we would recall, with
slight changes, the way of deriving Kozinec algorithms for the previous case,
when it was assumed that () = 0. We will present the algorithm for solving the
tasks formulated here without deriving or proving them. For the reader who is
eager to learn more they are left as a kind of individual exercise.
We w1"11 create a sequence of pomts . •) ... , a t , a t+1 , ... an d a sequence
a 11 , ai, 1 1
f . t 1 2 t t+ 1 d"
o pom s a 2, a 2, ...... , a 2 , a 2 , ... accor mg tot he c1011 owmg . a lgont . hm.
(5.64)
( 5.65)
5.4 Linear separation of finite sets of points 173
Line 1 J, c:/2
I
Line 3
- ,t+l
Xt -
Line 5 - - - - - - - - t - - - - '-'1
Line 4 J,c,_/2- - - - + - -
-+-
Line 2 -~+.-------+---
a~ (a) (b)
Figure 5.11 Geometrical interpretation of Figure 5.12 Two possible geometrical in-
conditions as a relation of two points and terpretations of the change of point 01
five possible straight lines. positions.
3. If neither of these points exists then the algorithm stops the operation and provides
the vectors ai and Q~ at the output.
4. If the vector x 1 E X1 exists which satisfies the relation (5.64) then the vector Q2
does not change, i.e., Q~+I = Q~ , and the vector Q1 is changed according to the
rule
Qi+ 1 = Qi (1- k) + x 1 • k, }
. ( 1 (Qi - Q~, Qi - x 1 )) (5.66)
where k = min , 2 ·
IQi - xtl
The above rule means that the vector Qi+I is determined as a point on the abscissa
connecting the points Qi and x 1 that is nearest to the point Q~.
5. When the vector xt E -~2 exists satisfying (5.65) then Qi+I = Qi and
Q~+I =Q~(1-k)+xt·k, }
h = mm. (1 , (Q~ -
k Qi , Q~ - xt)) (5.67)
w ere 2 ·
IQ~- xtl
The above rule says that the vector Q~+I is determined as a point on the abscissa
connecting the points Q~ and xt that is nearest to the point Qi.
According to the quantity c: in the expressions (5.64) and (5.65) the expressions
(5.64)-(5.67) provide three different algorithms.
1. When c: is a positive constant then this is an algorithm for the c:-optimal
separation of the sets.
2. When c: is a variable ~ lai -a~ I then this is an algorithm for the simple
separation of the sets.
3. When c: = 0 then this is an algorithm for creating an infinite sequence of
vectors ai and a~, which converges to the vectors
argmin
by means of Fig. 5.11, where two points o:L o:~ and five straight lines are shown,
which are perpendicular to an abscissa connecting the points o:i and o:~. Line
1 passes through the point o:i. Line 2 passes through the point o:~. Line 3 and
Line 4 lie between Line 1 and Line 2 so that the distance between Line 3 and
Line 1 is ~e. Similarly, Line 4 is in the distance ~e from the Line 2. Finally,
Line 5 lies half way between the points o:i and o:~.
The conditions (5.64) and (5.65) have the following geometrical interpreta-
tion. When e = \o:i - o:~\ then the conditions (5.64) or (5.65) state that either
one point from the set X 1 gets below the Line 5 or a point from the set X2 gets
above this straight line, i.e., one of the points is not correctly classified. In this
case one of the points o: 1 or o: 2 changes its p~sition. If such a point does not
exist it means that Line 5 separates the sets X 1 and X2. When e is ~positive
constant then the condition (5.64) means that a point from the set X 1 occurs
below Line 3. The condition (5.65) states that a point from the set X2 comes
above Line 4. In that case either of the points o: 1 or o: 2 changes its position and
Line 5 can already classify the set X1 UX2 correctly. When no such a point
exists then it means that Line 5 separates e-optimally the set X1 UX2 . And
finally, when no point from X1 occurs below Line 1_ and no_point from X2 lies
above Line 2 then Line 5 optimally separates sets X1 and X2.
The algorithm changing the position of points o: 1 and 0:2, which is expressed
by relations (5.66) and (5.67), is also easy to comprehend in its geometrical
interpretation. The case in which the vector o: 1 changes and the vector 0:2 does
not change is illustrated in Fig. 5.12.
The point o:i+l is a point on the abscissa connecting the point o:i to the
point Xt which is the closest to the point o:~. It is either the bottom of the per-
pendicular drawn from the point o:~ towards the straight line passing through
points o:i and Xt (Fig. 5.12a), or the point Xt (Fig. 5.12b). The first case occurs
when the bottom of the perpendicular fits inside the abscissa and the second
when the bottom lies outside the abscissa.
We have made sure in an informal, and hopefully easy to understand, way
that the described algorithm in the steady state solves one of three tasks de-
pending on the value of the e.
1. The task of simple separation of the point sets if the algorithm used changing
variable e,
2. The task of e-optimal separation of the point sets if the algorithm used the
constant value e.
3. The task of optimal separation of the point sets if e = 0.
What remains now is to check whether the described algorithm converges to
the stable state for sure. It is possible to prove using simple reasoning similar
to that one applied in analysis of the Kozinec algorithm from Subsection 5.4.3
that the algorithm definitely ends up in the stable state provided that e -:/:- 0.
The reason is that the sequence of lengths \o:i - o:~ \ monotonically decreases
5.5 Solution of the generalised Anderson task 175
1
< 1.
y'1 + c:2fD2
ma~lxl
xEX
as was the case in the Kozinec algorithm separating finite sets of points, but it
is the value
max ( rna~
x,yEXt
lx- Yl , m~
x,yEX2
lx- Yl) ,
which can be much smaller. The algorithm can converge faster and typically it
is the case.
Now let us break away from the purely mathematical content of generalised
Anderson task, even though it is quite rich in itself. Let us see the task from
the practical point of view. In the great majority of practical applications
the main interest of the project designer is not to have an optimal recognition
procedure, but very often it is sufficient to choose just a good one. If the optimal
discrimination rule is found and errors occur, e.g., in 30% cases, the task has
not been solved from the practical point of view. Neither the procedure being
optimal does not help. But in another situation in which recognition is correct
in 99.9% cases, one can hardly imagine in practice that a procedure would
be rejected only because of not being optimal. Simply speaking, optimal and
well applicable are two different concepts. An optimal procedure may not be
applicable and in another situation, a not optimal procedure can be acceptable.
The previous informal, but still reasonable, considerations make us replace
the task (5.68) by seeking the vector a and the numbers 0 which satisfy the
inequality
m?JC er(j, a, 0) < £, (5.69)
J
The task (5.69) can be reduced to a task of the simple separation of infinite
sets X1 (r) and X2(r) at a certain valuer. It results from the following theorem.
Theorem 5.6 On e-solution of generalised Anderson task. Let the number r
be the solution of the following equation for the given positive number£ < 0.5,
(a, x) > 0,
(5.71)
(a, x) < 0,
•
5.5 Solution of the generalised Anderson task 177
Proof. In the proof an explicit expression is needed for the maximal and
minimal values of the scalar product
(5.73)
The points sought which are the solution of this task satisfy the equation
a+ A· a- 1 · (J.L- x) = 0,
from which it follows that
1
x = p,+ -:\a ·a.
It is obvious that the extreme of the linear function on the convex set (5. 73) will
be achieved at its limit, i.e., when F(x) = r 2 • We will substitute the expression
derived for x to the equation F(x) = r 2 and so ensure the value A,
A=J(a,a·a),
r
since the number r is positive at c: < 0.5. Similarly, in seeking the maximum,
A=-J(a,a·a).
r
The position vectors of the extremes sought are
minf(x)=(a,J.L)-r·J(a,a·a),}
maxf(x) = (a,J.L) + r · .j(a, a· a).
178 Lecture 5: Linear discriminant function
We will prove now that any solution of the system (5. 71) satisfies the condition
(5.70). When (5.71) is satisfied then the inequality (a,x) >()is satisfied for
any point of the ellipse E (r,J..LJ,aJ), j E J 1 . Since this ellipse is a closed set
the system of inequalities
and
(a,J..Li)-()
>r, jEh. (5.76)
J(a, ai ·a)
Similarly, thanks to the second inequality in (5.71), we have
This means that er(j, a,()) < E for j E J1 . Similarly, from the inequality (5. 77)
there follows er(j, a,()) < E for j E J2 . In this way we have proved that the
arbitrary solution of the system (5.71) satisfies (5.70) as well.
Now we will prove that when the pair (a,()) does not satisfy the system
(5.71) then it is not satisfied by the inequality (5.70) either. Assume that the
inequality expressed in the first line of (5.71) is not satisfied. Then j E J1 and
x E E (r, f..lj, ai) exist such that (a, x) ~ (). From this there immediately follows
that
min . (a, x) ~ () ,
xEE(j,IJ.' ,uJ)
(a, f..lj) - r J (a, aJ · a) ~ () ,
5.5 Solution of the generalised Anderson task 179
Eventually er(j, o:, (}) 2 c is true. Similarly, when an inequality of the sec-
ond line in the system (5. 71) is not satisfied then j E Jz exists such that
er(j, o:, 0) 2 E. •
not to be infinitely large and for their convex hulls to be disjunctive. We will
slww that the diameters of the sets ~Y 1 and ~Y 2 are always finite. For ~Y 1 and
X 2 there hold
The matrix a in the relations (5.78) are positive-definite, and thus the sets X\
and X2 are bound. The disjunctive character of X1 and X2 cannot be ensured
in the general case, and so the algorithm quoted later can solve the c;-task, if
such a solution exists.
Let us study the problem how to look for an inequality in the system (5.76),
which for the vectors a and the number () is not satisfied. Thus the index j E J
and the vector x are to be found which satisfy the inequality ( x- J.Li , (ai) - l · ( x-
J.Li)) ~ r 2 and (a,x} ~()for j E J 1 or the inequality (x-J.Li, (ai)- 1 ·(x-J.Li)) ~
r 2 and (a, x} ~ () for j E J2 • On the basis of our considerations used in
the proof of Theorem 5.6, we claim that the first condition is equivalent to
the statement that the minimal value of the scalar product (a, x} on the set
{xI ((x- J.Li), (ai)- 1 · (x- J.Li)) ~ r 2 } is ~ ()for some j E J 1 . The second
condition is equivalent to the statement that for some j E J2 the maximal value
of the scalar product (a, x} is ~ () on a similar set, i.e.,
( 3j E Jl min . (a,x} ~ o)
IzEE(j,J.I.J ,uJ}
v (3j E J21 max . (a,x}
zEE{j,J.I.J ,uJ)
~ o) .
Recall that the symbol V expresses the disjunction of the two conditions.
Using the expressions (5.76) and (5.77) we will transpose the statement that
the pair (a, 0) does not satisfy the infinite system of inequalities (5. 78) into an
equivalent statement that the same pair (a, 0) satisfies some of the inequalities
(5.79)
When the pair (a, 0) satisfies some inequality from the second line of the system
(5.79), i.e., for j E J2 , then the inequality from the system (5.78) which will
not be satisfied corresponds to the point
. r
x = J.L1 + · ai · a . (5.81)
J(a, ai ·a}
Both expressions (5.80) and (5.81) can be calculated fast and easily, since they
do not require the inversion of the covariance matrix ai either. We see that the
5.5 Solution of the generalised Anderson task 181
validity of the system (5. 78) containing an infinite number of linear inequalities
can be constructively verified. In addition, a concrete inequality can be found
that is not satisfied when the system (5. 78) is not valid. This property can be
summarised in the following algorithm seeking the vector a and the number
0, which satisfy the conditions (5. 78) on assumption that the values sought
actually exist.
2. The algorithm creates two sequences of vectors aLa~, ... , aL ... and a~, a~, ...
. . . , a~, ... , to which the searched vector at and the number (Jt correspond
The vector a~ is an arbitrarily selected vector from the set X1 , for example, one
of the vectors p,i, j E h, and the vector a~ is, for example, one of the vectors p,i,
j E h.
3. Assume the vectors a1 and a~ have already been created. For them the vector at,
the number (Jt are to be calculated and the following conditions are to be checked
(at,J.Li) _ 8t
> r, j E J1, (5.82}
y(at, ui ·at)
as well as the conditions
(5.83)
4. When all these conditions have been satisfied then it means that the vector at
and the number 81 are the €-solution of the task.
5. When for j E J1 some of the inequalities (5.82) is not satisfied then a vector
a~+l = ai · (1- k) + xt · k,
where
k= . ( 1 (a1 - aL a1 - x 1) )
mln , Ia t1 - x tl2
The vector 'a~ does not change in this case, i.e., a~+l =a~.
6. If for j E h some the inequalities (5.83) is not satisfied then a vector
182 Lecture 5: Linear discriminant function
If Anderson task has an e-solution then the above algorithm is sure to arrive at
a state in which both conditions (5.82) and (5.83) are satisfied, and therefore
the algorithm stops.
5.6 Discussion
I have noticed an important discrepancy which erodes the analysis of Anderson
task. On the one hand, you kept assuming during the analysis that covariance
matrices ai were positive-definite. This assumption was used by you several
times in the proofs. On the other hand, you claimed from the very beginning
that without loss of generality the separating hyperplane sought went through
the coordinate origin. But you achieved that by introducing an additional
constant coordinate. The variance of the additional coordinate is zero, and
just for this reason, the covariance matrix cannot be positive-definite but only
positive semi-definite.
I have waited to see how this discrepancy will be settled in the lecture, but
in v·ain. Now I am sure that you have made a blunder in the lecture. I believe
that you did this teacher's trick on purpose to check if I had read the lecture
properly. There was no use doing that. Your lectures are very interesting to
me, even if they are not easy reading matter.
The teacher's trick you are speaking about is used by us from time to time,
but not now. The discrepancy mentioned was made neither on purpose, nor
through an oversight. It really is a discrepancy, but it has no negative effect on
the final results. Have another glance at the complete procedure of Anderson
task and its proof and you will see that everywhere where the assumption of
positive-definiteness of matrices is made use of we could do without it. But
we would only have to mention the case in which the covariance matrix is
degenerate. It would not be difficult but it would interrupt the continuity of
the argument. Let us now examine the most important moments of deriving
the procedure.
In deriving the procedure for solving Anderson task, covariance matrices are
used in two situations. First, it is in calculating the values
(5.84)
5.6 Discussion 183
and for the second time in searching for the contact points
j (
x0 a
) - j
- J..l - (
(JlJ' a)
.
( j
) · a ·a ,
)
(5.85)
a,aJ ·a
In the algorithm for the £-solution of Anderson task the covariance matrices
are again helpful in calculating the points
r .
xJ = J..lJ - · ( a 1 · a) (5.86)
J
(a, ai . a)
which minimise the scalar product (a, x) on the set of vectors x which satisfy
the inequality
(5.87)
The algorithms use only the formulre (5.84), (5.85) and (5.86). The for-
mula (5.87) only illustrates the meaning of the relation (5.86) and is practically
not used by the algorithm. Formally speaking, for the calculation of the for-
mulre (5.84), (5.85) and (5.86) the matrices ai, j E J, are to be positive-definite.
The matrices involve the quadratic function (a, aJ ·a), whose values form the
denominators of fractions and thus the value (a, aJ · a) must be greater than
zero for any a f. 0. However, based on the way in which the algorithm uses the
value fl(a) in further calculation, (see formula (5.84)) and according to the
meaning of the vectors x1 (see formula (5.86) ), the algorithm can be defined
even for the case of a zero value of the quadratic function (a, ai ·a). The values
jJ (a) are calculated because the least value is to be chosen from them. The
value fl(a) for (a,aJ ·a) = 0 can be understood as a rather great number
which is definitely not less than jJ(a) for the indices j, where (a,aJ ·a) f= 0.
The contact points x~ are thus to be calculated only for the indices j, for which
(a, aJ · a) f= 0 holds. When such points do not exist it means that there is
a zero probability of wrong classification of the j-th random Gaussian vector
for any index j. In this case the algorithm can stop the operation, since no
better recognition quality can be achieved. We can see that such an augmented
algorithm holds even for the case of degenerate matrices a1.
Let us see how the vector xJ is calculated according to the formula (5.86)
when (a, aJ · a) = 0. Recall that xi is a point of an ellipse and maximises
the scalar product (a, x). Formally speaking, the formula (5.87) defines the
given ellipse only in the case in which the matrix aJ is positive-definite, i.e.,
if all eigenvalues of the matrix are positive. Only then the matrix ai can
be inverted. The ellipse, however, can be defined even in the case in which
some eigenvalues are zero. For any size r it will be an ellipse whose points
will lie in the hyperplane whose dimension is equal to the number of non-zero
eigenvalues. In all cases, irrespective of whether the matrix ai can be inverted,
the point xJ sought in such a created ellipse is given by the formula (5.86). An
184 Lecture 5: Linear discriminant function
exception is the case in which (a, ai · a} assumes a zero value. In this case the
scalar products of all the points of the ellipse with the vector a are the same.
This is because the whole ellipse lies in a hyperplane which is parallel with the
hyperplane (a, x} = 0. Thus any arbitrary point of the ellipse can be chosen
for the point xi. The simplest way is to choose the point Jli.
As you can see, the final results can be stated for the case of degenerate, i.e.,
positively semi-definite, matrices. It is not even difficult, but only painstaking.
If you feel like it then examine all the places in the analysis where the inversion
of matrices is assumed and make sure that the assumption is not necessary
for proving the above statements. But we do not think that you would come
across significant results from the pattern recognition standpoint during this
examination. In the minimal case it would not be a bad exercise in linear
algebra and the theory of matrices for you.
Could I, perhaps, ask you now to examine, together with me, a case which is
part of the solution of Anderson task, i.e., seeking the vector a which maximises
the function
. (a,J.Li)
!( a ) = mm -r;0====::='=7
iEJ J (a, ai · a}
Assume we have found the direction ~a in which the function f (a) is growing.
And now we have to find a specific point a + t · ~a in this direction for which
f(a + t ·~a) > f(a) holds. In another way, and better stated,
. (a + t~a, Jli)
t = argmax mm
t iEJ V( a+ t~a, ai ·(a+ t~a)) . (5.88)
The solving of this task was not mentioned at all in the lecture, and so I do
not know what to think of it. On the one hand, it is an optimisation of a
one-dimensional function which has one extreme in addition. It might seem
that this is a simple task. On the other hand, I know that even such tasks are
objects of serious research.
Let us try it. The task of one-dimensional optimisation is only seemingly simple.
If sufficient attention is not paid to it, it can be even troublesome. Moreover, the
task (5.88) is a suitable probing field for optimisation and programmer's trifles.
None of them is very significant in itself but if the application programmer
knows about one hundred of them, it indicates that he/she is an expert. Even
for this reason these trifles are worth knowing. Now we are at a loss which trifles
to explain first. It would be better to explain to us how would you handle the
task (5.88) yourself.
First, I am to select a finite number of points on the straight line on which the
point a + t · ~a lies. To the points a finite number of values of the parameters
to, t1, ... , tL correspond. I will get a little bit ahead and say that the number
of points L + 1 is to be odd and equal to 2h + 1 for me to find the best point by
5.6 Discussion 185
dividing the sequence into two parts having the same number of points. If it
lay in a previously determined interval T then I would not bother much about
the selection of the values to, ... , t L. I would simply divide the interval into L
equal segments. But we know only that 0 < t < oo in our task. The sequence
to, t1, ... , t L can thus be selected in several different ways which seem nearly
the same to me. The basic motivation for the selection is due to the property
that for the function f the relation f(a) = f(k ·a) for an arbitrary positive
k E IR is valid. Therefore, instead in the points a+ t ·~a, 0 < t < oo, I can
examine this function in the points
-1 1 ·a+ _t_ ·~a, or, which is the san1e, in the points a· (1- r) +~a· T,
+t 1+t
where T already lies in the finite interval 0 < T ~ 1. The finite sequence of
points can be selected in a natural way by dividing the interval into L segments
of equal length.
Tl1e abovementioned way means mapping a set of points of the form a+t·~a,
0 < t < oo, on a set of points of the form a· (1- r) +~a· r. The half-line
is mapped on the finite abscissa which connects the points a and ~a, as it is
shown in Fig. 5.13.
The half-line can be mapped on a finite set in a lot of reasonable ways.
In addition to the matching already given, even later matching seems to be
natural. In Fig. 5.14, a mapping of a half-line is shown on an abscissa, which
connects the points aflal and ~a/l~al. They seem to be better than the
previous ones because the vectors a and ~a are becoming 'of equal rights'. A
natural way of matching is the matching of a half-line on a set of points of the
form
lal·~a .
a· coscp + l~al · smcp,
i.e., on a quarter-circle as can be seen in Fig. 5.15, or on the sides of a square
as shown in Fig. 5.16. I do not think that it would be of much importance to
explore which of these ways, and perhaps of other ways too, is the best. The
ma.tter is that they are nearly the same. In any of the ways a finite number
186 Lecture 5: Linear discriminant function
o: + tz . ~o:
You have omitted a number of interesting items since you took some trends
in solving the task for granted and for the only possible ones. From the very
beginning you have decided that you must inevitably match the half-line to
a finite number of points and so replace the maximisation of the function of
one real variable by searching for the greatest number from the finite set of
numbers. Even though you noticed that there were many possibilities of such
a replacement and none of them seemed convincing enough to you, it did not
occur to you that such a replacement could be avoided.
In the general case not, but in our particular case it is. But we will leave it to
some later time. But now let us have a more profound look at the procedure
which you have, rather carelessly, denoted as a method of interval halving, and
even more so, in a strictly unimodal sequence. You said that by means of this
method you would achieve the best point in log 2 L steps. We suspect that you
do not see clearly enough the difference between searching for the zero point in
a strictly decreasing number sequence and searching for the greatest number
in a strictly unimodal sequence. Understand this difference well, and then we
will continue.
They are two different tasks, indeed. I looked at either of them and saw that
even when their respective calculations were similar they were still different.
First, I will present the procedure for the simpler task of searching for zero,
even when we evidently do not need it in our task.
5.6 Discussion 187
1. An index l' is determined that divides the greater of the two intervals
(Zteg' l~id) and (l~id• Z!nct) into two subintervals of the same lengths. In
concrete terms, if (l~id - zteg) < (l~nd - l~id) then l' = ~(l~id + l~nd).
IE (l~id - zteg) ~ U!nct - z~id) holds then l' = ~(t~id + zteg).
2. The quadruplet zteg, l', l~id and l!nct obtained is ordered in an ascendant
way as h,l2,h,l4. This means that l1 = zteg' l2 = min(l',l~id), l3 =
max(l',l~id), l4 = l~nd'
188 Lecture 5: Linear discriminant function
We thank you for your rather transparent explanation. But tell us now why
you stated, without any further thought, that the length of the longer of the
two intervals should be divided in every step into two equal parts. Why should
an interval not be divided in another ratio.
Well, this is quite clear! Only in such a way can the greatest number in a
unimodal sequence be found at the smallest number of queries for the value of
a number. I do not know how to account for it except for saying that everybody
does so.
We do not think that everybody does so. Perhaps all your acquaintances do
so, and even that is doubtful. We would prefer you not to do so any longer
and use the method that was proposed in the early 13th century by the Italian
mathematician Leonardo Fibonacci. Fibonacci may have learned about the
method in Central Asia, where science, including mathematics, was flourishing
at that time. Fibonacci himself got to that part of the world as a merchant
commissioned by his father. He surely, as you and we do, devoted his time also
to interests not at all connected with his main commercial duties.
Let l(i), i = 1, 2, ... , be a series of numbers for which l(l) = l(2) = 1 holds,
and for each i > 2 the equality l(i) = l(i- 1) + l(i- 2) holds. The numbers
of this series are called Fibonacci numbers. Every Fibonacci number can be
expressed as a sum of two other Fibonacci numbers and it can only be done in
an unique way.
Two intervals are changed in each step in the procedure you created to find
the greatest number. The lengths of intervals are lmid - lbeg and lend - lmid.
The lengths are either equal, and in that case they are 2h for the integer h,
or they are different, and in that case they are 2h and 2h-l_ Create a new
algorithm which differs from what you proposed earlier. The length lend -lbeg
must be one of Fibonacci numbers in each step, say l(i). The index lmid is to
5.6 Discussion 189
be such that the lengths lend - lmid and lmid - lbeg are Fibonacci numbers as
well. Thus, one length is l(i - 1) and the other is l(i - 2). We will call this
modification the Fibonacci algorithm. Do you see it?
Yes, I do.
Then try both algorithms at the same strictly unimodal sequence and verify
that the greatest number can be sought by means of the Fibonacci algorithm
faster than by means of the algorithm you proposed before and which you
without any reason claimed to be the fastest.
Well, this is surprising! The method of halving an interval is really not the
fastest. The Fibonacci algorithm works faster, but only a little bit.
Why then does everybody say that the method of halving an interval is the
fastest?
We are repeating once more that it is not said by everybody. Fibonacci did not
say it.
But now a question arises. I know now that the method of halving an interval
is not optimal because the Fibonacci method is better. But I cannot so far
say that the Fibonacci method is an optimal one. What is, then, the ratio
for halving an interval that allows to achieve the highest speed in finding the
greatest number?
We are sure that this is not a difficult question for you. It would be the worst
for you to think that you had already known the optimal algorithm and would
not ask such questions. Now, when you have come across that question you
will quickly find the right answer.
It was not very quick, but I have found an answer. Though I have not formu-
lated the optimisation task very precisely, I understand it like this.
I have a certain class of algorithms the two representatives of which are
already known to me. One is the algorithm seeking the greatest number in
a strictly unimodal sequence, which I quoted before and which is based on
halving the longer of two intervals. The latter algorithm for the same purpose
has the same form as the former except that the longer interval is divided into
unequal parts corresponding to Fibonacci numbers. .4. general algorithm is
formulated by me as an algorithm in which the longer interval is divided into
parts proportional to the values a and 1- a, where 0 < a < 1. The number
a is not known and it is necessary to determine it in a certain sense optimally.
I am not going to formulate the criterion of optimality precisely now. I will
make a not very complex analysis of an algorithm for the fixed value a and the
analysis will show in what sense the value should be optimal.
190 Lecture 5: Linear discriminant function
I assume that, before some of the iterations of the algorithm, I had numbers
lbeg, lrnid and lend at my disposal. Without loss of generality I can assume that
they are the numbers lbeg = 0, lrnid = 1 -a and lend = 1 and I accept that they
are no longer integers. Furthermore, I assume that a ~0.5. In accordance with
the algorithm I am expected to divide the greater interval, i.e., lbeg, lmid into
two parts that will be, thanks to their fixed value a of the lengths (1- a) 2 and
(1 -a) ·a. This means that the index l' will be (1 - a) 2 • I am again ignoring
that it will be not an integer.
The new values of the numbers lbeg and lend will be either 0 and (1 -a) or
(1- a) 2 and 1 depending on the values f(l') and f(lmid)· The lengths of the
new interval lend - lbeg will then be either 1 - a or 1 - (1 - a) 2 • Only now
I can formulate the requirement that the parameter of the algorithm a is to
be chosen so that the length of the longer of the two intervals should be the
shortest possible one,
3- J5
a*
- - -2- - .
With the computed value of the parameter a, the sequence of the lengths of
the intervallbeg - lend is during the algorithm upper bound by a decreasing
geometrical series with the quotient ( J5- 1)/2 which is approximately 0.618.
The optimality of that particular value of the parameter a lies in that at another
arbitrary value of the parameter a the sequence of the lengths of the intervals
will be bound by a geometrical series that is decreasing at a slower rate. For
example, in the algorithm I proposed before only one thing was guaranteed,
i.e., that in every pair of steps the lengths of the interval is shortened twice.
This means, roughly speaking, that the sequence of the lengths is decreasing
with the rate of the geometrical series with the quotient V2/2"' 0.707.
I was greatly pleased that I have managed to solve the task up to such a
concrete results, up to a number, as the mathematicians say.
You are not the first to have been pleased by this task. Since several thou-
sands of years, since the time of ancient Greece, the ratio ( J5 - 1) /2 has been
attracting attention and is referred to as the golden cut.
The number did remind me of something. The analysis was very interesting,
but from a pragmatic standpoint it does not offer very much. You said in
your lecture that a carelessly written program for this optimisation worked 100
times slower than one which was well thought out. Did you not have in mind
anything else than the analysis just completed?
You are right. We had in mind something quite different, but we are glad that
you have dragged us into the problem already dealt with.
5.6 Discussion 191
Let us go back to our task in which for a given group of vectors Jl.j and
matrices ai, j E J, and for given vectors a and .6.a the parameter t is to be
found which maximises the function f(t),
.
f( t ) == mln (a+ t.6.a, Jl.J)
jEJ V( a+ t.6.a, ai ·(a+ t.6.a))
. (5.89)
I hope I have understood your idea. I will denote by f(j, a) the function
J
(a, Jl.i) / (a, ai ·a). For the given sequence of numbers to, h, ... , tL, the index
s1 == (a, Jl.J),
s!J == (.6.a, Jl.i) '
.6.sJ == s' 1 - si ,
(5.90)
a1 == (a ai · a)
(} ' '
(J~ == (.6.a, (Jj • .6.a) '
a~t. == (.6.a, a1 . a) .
Having done that, the numbers f (j, a( 1 - tt) + .6.a · t 1 ) can be calculated as
(5.91)
Looking at it, we can see that all multi-dimensional operations are performed
only in the calculation of 6I.JI numbers according to the formula (5.90). This
calculation does not depend on the index l, and thus it is performed outside the
cycle in which different indices l are tested. In testing the indices l everything
is calculated according to the formula (5.91) which does not contain any multi-
dimensional operations any longer.
Where the hundred fold acceleration is achieved?
At the very beginning of our discussion you stated that a function of one variable
could be numerically maximised without substituting an infinite straight line
by a finite number of values. But I do not know the way in which it could be
done.
than c:. The advantage of dual methods compared to the direct methods lies in
that the maximum is estimated with a given accuracy. The advantage of direct
methods is that it is the position of the maximum that is estimated with a given
accuracy. The most important question in applying dual methods certainly is
how easily one can find out if the inequality f(t) ~ u has a solution.
The function we would intend to maximise is quite well known for us to solve
the inequality f(t) ~ u rather easily. Let us ask if the inequality
. (o:·(1-t)+t·~O:,Jlj)
mm
jEJ J(a · (1- t) + t · ~a,ai · (o: · (1- t) + t · ~o:))
>c (5.92)
si + t · ~si
~ c, j E J, e : :; t:::; 1, (5.93)
;(1 - t) 2 . 0'~ + 2t. (1 - t) . O'~t:,. + t2 . 0'~
where si ~ 0, ~si, 0'~, a~t:., a~ are known numbers. We are interested in the
system of inequalities (5.93) only for positive c, i.e., for such values t, only at
which the inequalities
(5.94)
are valid for all j E J. The inequalities from the system (5.94) for which
~si ~ 0 need not be taken into consideration because on the interval 0 :::; t :::; 1
they are always satisfied. Let us denote by J 0 a set of indices j for which
~si < 0 holds. The system (5.94) is equivalent to the system
si + t · ~si ~ 0, j E J0 ,
or, at last,
-si
t <min~.
- jEJ 0 usJ
We will denote
T = min ( 1, jEJ -si)
min ~
usJ 0
We will take into consideration that now on the condition 0 :::; t :::; T in the
inequalities (5.93) no numerator is negative and rewrite (5.93) into the form
In this way the question whether the inequality (5.92) has a solution has been
reduced to the solvability of the system (5.95), which contains quadratic in-
equalities of one variable t only on a sufficiently simple condition 0 ~ t ~ T.
The solvability of (5.95) can be easily verified. Each inequality in the first line
of the system (5.95) can be rewritten into the form
AJ · t 2 + BJ · t + cJ 2:: 0 ,
where the coefficients AJ, BJ, CJ are calculated from numbers sJ, ~sJ, c, a~,
a~il, a~ which are already known. The system (5.95) will assume the form
And now it is the turn of another question inspired by the lecture. It was
said that in modern applied mathematics the methods of optimisation of non-
smooth or non-differentiable functions had been thoroughly examined and that
the fundamental concept was that of the generalised gradient. I completely
agree with you that these methods are not sufficiently known in the pattern
recognition sphere. I would like to learn more, at least about the most funda-
mental concepts, and see what the complete analysis of Anderson task would
look like if the results of the non-smooth optimisation were applied.
The main core of the theory of the non-differentiable optimisation can be rather
briefly explained, as it is with any significant knowledge. But this brief expla-
nation opens a wide space for thinking.
Let X be a finite-dimensional linear space on which a concave function
f: X -+ IR is defined. Let xo be a selected point and g(xo) such a vector
that the function
f(xo) + (y(xo), (x- xo))
dependent on the vector xis not less than f(x) for any arbitrary x EX. Then
we can write
f(xo) + (g(xo), (:z:- :z:o)) 2:: f(x) (5.97)
5.6 Discussion 195
where g(xi) is the generalised gradient of the function f in the point x;.
In this case
lim f(x;) = j*,
I -tOO
then and only then when there exists a 'horizontal' tangent hyperplane in that
point. Note that even the formal proof of the theorem is rather brief. •
Let us now make use the acquired knowledge for an analysis of our task.
Theorem 5.10 On the generalised gradient in a selected point. Let {Ji(x),
j E J} be a set of concave differentiable functions, xo be a selected point, and
gi(xo), j E J, be a gradient of the function Ji(x) in the point x 0 .
Let Jo be a set of indices j for which there holds
(5.99)
and 'Yj, j E Jo, are non-negative numbers the sum of which is 1. Then the
convex combination go = I:jEJo 'Yi · gi (xo) is the generalised gradient of the
function
f(x) = minJi(x)
jEJ
in the point Xo. And vice versa, if any vector g0 is a generalised gradient of
the function f(x) in the point xo then the vector g0 belongs to the convex hull
of the vectors gi(x 0 ), j E J0 . A
Proof. First, let us prove the first statement of the theorem. The assertion
that the vectors gi (Xo), j E J0 , are gradients of the functions Ji means
or in another form
~ minJi(x)- minJi(xo)
jEJ jEJ
= f(x)- f(xo),
198 Lecture 5: Linear discriminant function
From the assumption (5.99) there follows that at j E .10 the value Ji (:r0 ) cloes
not depend on j and is equal to f(xo). Thus
~~~ Ji (xo + tx') - f(xo) > (!/!h :ro + t:t:') - (!Jo, :ro) . (5.104)
All functions Ji(x), j E J0 , are eontinuous, tlwn.fon• for t.IH' iudPx j E .!,
j ¢ Jo, for which the equality Ji(:Do) = minjE.I J1(.r0 ) is uot satisfiP<I. nPitlwr
the equality Ji(xo + t:I:') = minjeJ ji(:1:0 + t:r') will IH' sat.isfi<•d, at }past at
small values oft. Thus
and g0 is not a generalised gradient of f(x). This proves the second statement
of Theorem 5.10. •
If we now shut our eyes to the contradiction that the above theorems express
the properties of concave functions, and the maximised function in Anderson
task is not concave, even when it is unimodal, we would see that Theorem 5.2 on
necessary and sufficient conditions of the maximum in Andeson task, proved
in the lecture, quite obviously follows from the theorems on the generalised
gradients presented now. And indeed, Theorem 5.10 claims that any generalised
gradient of the function
Theorem 5.9 claims that it is necessary and sufficient for the maximisation that
some of the generalised gradients should be zero. This means that such positive
values '"Yj are to exist that
L"fj·gj=O.
jEJo
In the lecture, the gradients gi were proved to be collinear with the positional
vector of the contact point. Therefore the statement that the convex hull of
the gradients gJ, j E J0 , includes the coordinate origin is equivalent to the
statement that the convex hull of contact points contains the coordinate origin.
The condition on the maximum, which we introduced at the lecture informally
at first, and which was also proved there, could be derived as a consequence of
Theorem 5. 7 known from the theory of non-smooth optimisation.
Further on it can be easily seen that the actual proposal of maximisation
presented at the lecture is one of the possible alternatives of the generalised
gradient growth, which is stated in Theorem 5. 7. We require a direction to
be sought in which the function f(x) =mini Ji(x) grows, i.e., in which each
function r(x), j E J0 grows. This direction will be one of the possible gener-
alised gradients. But not every generalised gradient has the property that the
motion in its direction guarantees the growth of all functions jJ(x), j E J0 .
And this is why the recommendation resulting from the lecture is stricter than
the recommendation to move in the direction of the generalised gradient. The
general theory of non-smooth optimisation claims that such a recommendation
is extremely strict. The direction in which a point is to move in its next step
when seeking the maximum can be given by any point from the convex hull
of the gradients gJ, j E J, and not only by that which secures the growth
of the function f(x) = minJEJ Ji(.r). Simply, it can be any of the gradients
gJ, j E J 0 . The algorithm for solving Anderson task could then have even the
following very simple form.
The algorithm creates the sequence of vectors a 1 , a 2 , ... , a 1 , ... If the vector
a 1 has already been created then any (!!!) j is sought for which there holds
200 Lecture 5: Linear discriminant function
The contact point x~ is calculated as well as the new position nt+l of the vector
a as
xi0
CXt+l = CXt + "(t · lx~l ,
where "'t, t = 1, 2, ... , oo, is a predetermined sequence of coefficients that satisfy
the conditions
A.s far as general concepts and theorems were discussed, everything seemed to
me to be natural and understandable. Up to the moment when it was stated
how a simple algorithm for solving Anderson task resulted from the whole
theory. This seems to me to be particularly incredible, including Theorem 5. 7,
which certainly is of fundamental significance.
I will show now why it seems incredible to me. Assume we want to maximise
the function
f(x) =min P (x)
jEJ
and we have reached the point where two functions, say f 1 and P, assume
an equal value which is less than are the values of all other functions. It is
to change x so that the function f(x) should be increased and this means
increasing both the functions f 1 and P. The algorithm following from the
general theory, instead of taking into consideration both functions, deals with
one of them only trying to make it larger, and simply ignores the other. In such
a case, however, the other function can even decrease, which means that the
function f can decrease as well. It is not only possible for such a situation to
occur, but it certainly will occur since in approaching the maximum the number
of functions which the algorithm should take into consideration is growing. But
it looks to be about one function only.
You are right, but it is not because Theorem 5.7 may be wrong. You claim
that the algorithm of maximisation does not secure a monotonic growth, but
Theorem 5. 7 does not state so. It only says that the algorithm converges
to a set of points in which the maximum is reached. For this convergence a
monotonic growth is not necessary. We will examine, though not very strictly,
the counterexample you quoted. When it has already happened that in one
of the steps together with the growth of one function, say the function fr, the
other function h has decreased (which can happen since the algorithm did
not regard the function h at all) then certainly as soon as in the next step the
function fr will be not taken into consideration, and it may be just the function
h which will have the worst value and the algorithm will have to consider it.
A precise proof of Theorem 5. 7 must be damned complicated!
Well, when you once have persuaded me that Theorem 5. 7 holds, I will dare
to ask an impertinent question. Why did not the theorem become the basis of
the lecture? Did not you explain the subject matter as if the theorem had not
existed?
We would like to remind you that only a while ago you said that you did not
believe what Theorem 5.7 states. Now you may already believe it, but it does
not make any difference. We would not intend to base our lecture on a theorem
which we together did not completely understand. It would be a lecture based
on a trust in a theorem the proof of which was not presented by us, and not
based on knowledge.
Furthermore, earlier we shut our eyes to the fact that our task deals with the
maximisation of a function which is not concave. Now we can open our eyes
again and see that Theorem 5. 7 by itself cannot be the basis for solving our
task. The theorem may be worth generalising so that the sphere of its validity
could cover even the functions which had occurred in our task. But that would
be quite another lecture.
You have already asked yourself several questions and some of them are fantas-
tically extensive. Let us start from whether pattern recognition is art, science,
or yet something else. This question is very extensive, and however seriously
intended the answer might be it will inevitably not be precise. The easiest way
might be to do away with the question by saying that pattern recognition is
what we are just now, together with you, dealing with and will be dealing in
our lectures, and not to go back to that question any longer. It is not a very
clever answer, but as far as we know, other answers are not much cleverer.
And now as to your second question. What attitude are we to take to the
fact that the fundamentals of pattern recognition involve knowledge that has
202 Lecture 5: Linear discriminant function
been generally and for long known in other spheres of applied mathematics. We
should be pleased at it. It is only a pity that there are still far fewer concepts
and procedures taken over by and adopted in pattern recognition from other
fields than we would have wished for.
And now to your last question. Let us again go over the result of the lecture
you refer to. Let X 1 and X 2 be two finite sets of points in a linear space. A
hyperplane is sought which will separate the sets X 1 and X 2 and its distance
from the nearest points of both the sets is will be the greatest. In searching
for the hyperplane, two least distant points x 1 and x 2 are to be found, the
former belonging to the convex hull of the set X 1 and the latter belonging to
the convex hull of the set X 2 • The hyperplane found should be perpendicular to
a straight line passing points x 1 and x2 and lie at the halfway distance between
those points.
This result is known in convex analysis as the divisibility theorem. It is
important that the theorem, after being transferred to pattern recognition,
answered questions which had their origin in pattern recognition and in their
original reading had nothing in common with the theorem known on convex
linear divisibility. Thus there were no convincing answers to these questions at
a disposal either.
The questions concern, e.g., recognition by virtue of the minimal distance
from the exemplar. They will be stated in the way as they were originally
brought forth in pattern recognition.
Let X1 and X2 be two finite sets of points in a metric space X with Euclidean
metric d: X -t Ilt Let the classifier operate in the following manner. It has
in memory two exemplars (exemplar points) a 1 and a 2 which are points of
the space X. The recognised point x is placed in the first class if the distance
between x and a1 is less than the distance between x and a 2 . In the opposite
case the point is placed in the second class. The question is what the exemplars
Ql and Q2 are to be like to allow all objects from the set x l to be placed in
the first class and all objects from the set X 2 to be placed in the second class.
Even now, when pattern recognition has reached a certain degree of advance-
ment, you can find different and not very convincing recommendations how to
choose the exemplars. For example, the exemplars a 1 and a 2 should be the
'less damaged', or in a sense the best elements of the sets X 1 and X 2 . Another
time, even artificial exemplars are designed that suit the designer's conception
of ideal representatives of the sets X 1 and X 2 . Another inaccurate considera-
tion leads to a requirement for the exemplar to lie in average in position that
is the nearest to all elements of the set it represents. Therefore the exemplar is
determined as the arithmetic average of the elements which corresponds to the
centre of gravity of a set. We have seen several other unconvincing answers.
The answer has become convincing only after a precise formulation of the
task and by applying Theorem 5.3 on linear divisibility of sets. You may admit
that the result is rather unexpected. For an exemplar neither the best nor
an average representative of a set is to be chosen. Just the opposite, for an
exemplar of one set such a point in the convex hull of the set is to be chosen
that is the nearest to the convex hull of the second set. If, for example, we
5.6 Discussion 203
wanted to recognise two letters A and B then such a specimen of the letter
A that best resembles the letter B should become the exemplar. In a sense
it is the worst specimen. It also holds vice versa, the worst representative of
the letters B that best resembles the letter A is chosen for the exemplar of the
letter B.
This knowledge is new, from the point of view of pattern recognition it is
nontrivial, and thus it was not revealed without applying the theorem on linear
divisibility of sets. In convex analysis, which is the home of the theorem, this
knowledge was not revealed because the concepts which express this knowledge
are quite alien to convex analysis. And thus only after the meeting of one field
with the other did the new knowledge originate.
We would like you to notice that even in pattern recognition this accepted
theorem borders on other questions, not only on those of linear divisibility
of sets. Through being applied in pattern recognition, the theorem has been
enriched and has contributed to another third field. In pattern recognition
different non-linear transformations of space observation are continually used,
which are known as straightening of the featur·e space. The task of decomposing
sets by means of a straight line then stops being different from separating
sets by means of circles, parabolas, ellipses, etc. Taking over the theorem on
linear separability of sets, the pattern recognition has broaden the sphere of its
possible applications. For example, there are the algorithms in computational
geometry seeking distances between convex hulls as well as linear separation of
points in a plane. At the same time you can relatively often notice that even an
experienced programmer using computational geometry does not come at once
across the algorithm for separating sets by means of circles. You can rarely
see a programmer who would not be at a loss if he was expected to divide two
sets of points by means of an ellipse. But you will be not at a loss because, as
we hope, you do not see any substantial difference between those tasks. And
you do not see the difference just thanks to your having used the theorem on
separability in pattern recognition. And so the theorem on separability after
being used in pattern recognition was enriched, and so enriched it returned to
the scientific environment from where it had once come to pattern recognition.
We still do not dare to seriously define whether pattern recognition is a
science, art, or a collection of technical and mathematical tricks. But we dare
to claim that hardly a field will be found in which the knowledge from different
spheres of applied mathematics meets common application so frequently as
it is in pattern recognition. And therefore pattern recognition could become
attractive not only for young researchers who are engaged in it, as you yourself
are, but for everybody who wants to learn all the 'charming features' necessary
for work in applied informatics, quickly and from his or her own experience.
LP(X)·X=JLi,
X
LP(x)·x·xT=ai,
X
p(x) 2: 0, x E X.
which means the probability of the event that a random vector x with proba-
bility distribution p(x) will satisfy the inequality (a, x) ~ B.
The task according to my wish tries to avoid the assumption of the Gaussian
character of the random vector. A vector a and the number B are to be found
which minimise the value
max max c( a, B, p) .
jEJ pEPi
It occurred to me that this task was very similar to the task that had interested
me after Lecture 3. Tlwre I tried to find a reasonable strategy that, despite
a common procedure, is not based on the assumption of the independence of
features. At the same time nothing was known about the form of the depen-
dence. With your lwlp I saw at that time that tlw recognition strategy with the
5.6 Discussion 205
unknown mutual dependence was different from the strategy which was correct
on the condition of feature independence. After this lecture a similar question
worries me: is the strategy solving the task (5.105) the same as the strategy of
solving the same task on a stricter assumption that the corresponding random
vectors are Gaussian? I am going to say it in different words: can I also use the
solution of Anderson task in a situation when I am not sure that the random
vectors are Gaussian, and, moreover, when I know hardly anything about their
distribution besides their mathematical expectation and covariance matrix?
Boy! We respect you for your asking from time to time such deeply thought
out and precisely formulated questions. To your question we have found a
convincing and unambiguous answer. Your question was so well thought out
that it does not seem to us that you had not already had an answer to it. All
right, you must examine the function maxpE"Pi c(a, O,p). When you see that
the function is decreasing in a monotonic way when the ratio
(a,J.Li)- (}
J(a,ai ·a)
increases then the solution of the task (5.105) is identical with the solution of
Anderson task. When the function is not decreasing in a monotonic way then
further examining is necessary.
I see that, but I do not know the way and I have not made any progress in my
research either. It seems to me to be too complicated.
Do not worry and start analysing the question, for example, in the following
formulation. Let the mathematical expectation of an n-dimensional random
vector x = (x 1 , x2, . .. , xn) be zero, i.e.,
L p(x) · x · xT = a, (5.107)
:tEX
You would like to know what the numbers p(x), x EX, are to be that satisfy
the conditions (5.106), (5.107) and further the conditions
L p(x) = 1 , p(x) ~ 0, x E X,
xEX
and maximise the number c(a,O,p) expressed by the equation (5.108). To get
used to this task, look first at a one-dimensional case. What are the numbers
p(x) which for a one-dimensional variable x maximise the number
(5.109)
xEX
L p(x) · x = 0,
xEX
xEX
p(x) ~ 0, x EX.
That is the linear programming task. Although the task has infinitely many
variables, it is still solvable. It resembles the well known Chebyshev problem,
which differs from the case you have in mind only in that instead of (5.109) the
sum
L
p(x) (5.110)
lxi~O
~
" " p(x) = 02 · 02 ~ p(x) = 02
1 "" ~ p(x) · 02
1 ""
lxi~O lxi~IJ lxi~IJ
a
~ 1 ""
02 1 ""
~ p(x) ·x 2 ~ 02 ~p(x) ·x 2 ()2 .
lxi~IJ xEX
5.6 Discussion 207
Let aj(P ~ 1 hold. Let us have a look at the following function p*(x),
For the function p*(x) the sum (5.110) is a/0 2 , and for all other functions,
thanks to (5.111), it is not greater. Therefore the maximal value of the sum
(5.110) at constraints (5.111) corresponds to the value a/0 2 • You can see then
that the task of maximising the functions (5.110) need not be difficult, even
when it depends on an infinite number of arguments. Do you not think that
Chebyshev perfectly mastered the task? Why could not you as well master first
the maximisation (5.109) on the conditions (5.111), and then solve a multi-
dimensional task?
>.a 2: p(x) = 1,
xEX
AI 2: p(x) · x = 0, (5.112)
xEX
p(x) ~ 0, X EX.
In the first line of (5.112) the function to be maximised is written. The desig-
nation X(a, 0) in this objective function of the above linear task formulation
means x- (a, 0) U xa(a, 0). In the second line the constraint is stated to which
the dual variable >.a corresponds. The third line briefly shows n constraints
related to n dual variables which are denoted by the vector >.I. The fourth
line yields n x n constraints and the corresponding ensemble of dual variable
is represented by the matrix >. 2.
Tl1e variables in the task (5.112) are the numbers p(x), x EX. To each such
variable a constraint corresponds which the dual variable >.a, the vector AI and
the matrix >.2 must satisfy. To the variable p(x), x E X(a, 0), the following
constraint corresponds
(5.113)
208 Lecture 5: Linear discriminant function
and to the variable p(x), x E x+(a, 0), the following constraint corresponds
(5.114)
The constraints (5.113) and (5.114) are to be understood in such a way that the
ensemble of dual variables defines on the space X a quadratic function which
will be denoted by F,
which must not be less than 1 on the set X (a, 0) and must not be negative on
the set x+(a,O). As a whole, the function F(x) is positive-semidefinite.
I will analyse the problem (5.112) only for the situations in which the math-
ematical expectations of random variable x belongs to the set x+(a, 0). This
means that 0 < 0.
Let p* be the solution of the task. At least in one point x E x+(a,O) there
must be p*(x) f:. 0, since in the opposite case
(a, L
xEX{o,O)
p*(x) · x)
L p*(x) ·(a, x) (5.115)
xEX(a,O)
< L p*(x)-0=8<0,
xEX(a,9)
would bold, which would contradict the constraint in the third line of (5.112).
The point X E X+(a, 0) for which p*(x) f:. 0 holds will be denoted by Xo and
the scalar product (a, xo) will be denoted by the symbol~-
Now I will prove that the scalar product (a, x 1 ) in an arbitrary point x 1 E
x+(a,O), for which p*(xt) f:. 0 holds, is also~- In other words, all points
Xt E x+(a,O) for which p*(x) f:. 0 holds lie in a hyperplane parallel to the
hyperplane (a, x) = 0 denoted X 0 (a, 0). I assume that it is wrong and consider
a pair from the hitherto point xo and another point x 1 so that for them there
holds
p*(xo) f:. 0, }
p*(xt) f:. 0, (5.116)
(a, xo) f:. (a, Xt) .
I will examine how the function F(x) behaves on a straight line which passes
through the points xo and x 1 . This straight line is not parallel with the hyper-
plane X 0 (a, 0). Therefore there certainly exists a point x 2 which lies on that
5.6 Discussion 209
straight line and at the same time it lies in the hyperplane X 0 (a,8). Thanks
to the constraint (5.113) in this point there holds
By the second duality theorem the first and the second inequalities in (5.116)
imply correspondingly
F(xo)=O, F(x1)=0.
Any convex combination of the points xo and x1 belongs to the set x+(a, 8),
and thus for each point of this abscissa
F(x*) ~ 0
holds by (5.114). Thus on the abscissa passing points x 0 and x 1 the quadratic
function F must behave in the following way. On the abscissa between points
Xo and x1 the function F must not be negative, in the extreme points x 0 and
x1 it must be zero, and in a certain point x 2 outside the abscissa it must not
be less than 1. Since such a quadratic function does not exist the assumption
(5.116) is not satisfied. Thus we have proved by contradiction that in all points
x in the set x+(a, 0) for which p*(x) f; 0 holds the scalar product (a, x) is the
same and equal to the number which I denoted by ~.
Let us assume that the statement (5.118) is wrong. Let for a point x 1 hold
(5.119)
I will examine how the function F behaves on the straight line which passes
through the points x1 and x 0 . For the point x 0 the existence of which has
already been proved by me, there holds
This straight line must intersect the hyperplane X 0 (a, 0) in a point which will
be denoted x*. Resulting from (5.113) there holds
F(x*);:::l.
210 Lecture 5: Linear discriminant function
Thanks to the second theorem on duality (Theorem 2.2) from the first inequality
in (5.119)
F(xl) = 1
holds and from the first inequality in (5.120) there holds
F(xo) = 0.
The function F must behave on the straight line examined as follows. In a
selected point x 1 the function F assumes the value 1, in another point xo it
assumes the value 0 and in the interjacent point x* it assumes a value that is
not less than 1. Since no positive-semidefinite quadratic form can behave in
this way, the assumption (5.119) is not valid. Thus I have proved (5.117).
A set of points x for which (a, x) = ~ holds is a. hyperplane X 0 (a, ~).
Furthermore, I will denote
Po= I:
xEX 0 (o,9)
p*(x)'
p~ = I:
xEX 0 (o,~)
p*(x)'
where p* represents, as before, the function which solves the task (5.112). The
function p*, therefore, must also satisfy the conditions of this task from which
there follows
Po + p~ =1 }
Po · () + . p~ ~
= o: (5.121)
Po· B2 + p~ · ~ 2 = (a, a· a).
I will show how (5.121) follows from the conditions (5.112). For the first con-
dition from (5.121) it is quite evident because all points x with p(x) =f. 0 are
either in X 0 (a, ~) or in X 0 (a, B).
The equation L: p(x) · x = 0 in (5.112) is only a brief representa.tion of n
xEX
equations
LP*(x)·x;=O, i =1,2, ... ,n.
xEX
From the previous n equations there follows that
and further
= Po · () + p~ · ~ .
5.6 Discussion 211
and further
= LP*(x)· (L:ai·x;Y
xEX •
The system (5.121) consists of three scalar equations only in which the variables
are the numbers p0, p~ and~. The system has one single solution for which
the following equality holds
* (a, a ·a)
Po = ()2 + (a, a . a) .
The number Po is just the value of the sum LxEX(a,O) p(x), after substituting
the function p*(x) in it which maximises the sum in our task, and therefore is
an explicit expression for solving the task (5.112). The result is the number
(a, a ·a)
02 + (a, a · a)
which decreases monotonically when the quantity
()
J(a, a· a)
grows.
I then ask a half-hearted question. Have I enhanced Chebyshev inequality?
212 Lecture 5: Linear discriminant function
Of course, you have not. The Chebyshev inequality cannot be enhanced be-
cause, as we have already seen, it defines the exact upper limit for a certain
probability. This means that even such a random variable exists at which
Chebyshev inequality becomes an equation. The Chebyshev inequality esti-
mates the probability of a two-sided inequality lx - JLI ~ B, whereas your
inequality estimates the probability of a one-sided inequality x- JL ~ -B. You
have managed to prove that your estimate is also exact. You have not enhanced
the Chebyshev inequality, but you have avoided its often wrong or inaccurate
application. The first application of this kind is based on a correct estimate
u
P(x - JL ~ -B) ~ P(lx - Pi ~ B) ~ B2
which is not exact and instead of which you can now use an exact estimate on
the basis of the inequality of yours. The second common application is based
on the following consideration
1 u
P(x- JL ~ -B)~ 2P(Ix- JLI ~B)~ 2B2 ,
which is not correct. The estimate which results from your considerations is
u
P( x - r" '<
- B2 +-
-B) -< - u
. (5.122)
Since you have proved that in the relation (5.122) the equality can be obtained,
your estimate, as well as that of Chebyshev, cannot be enhanced.
So we have together proved that all algorithms which were proved for An-
derson task on the assumption that the random variables were Gaussian can
be used even in a case where the Gaussian assumption is not satisfied. But
be careful and use this recommendation only in the sense that has been stated
here. Use it particularly when you are not sure that the random variables are
Gaussian and when you doubt over other assumptions as well. If you know for
certain that a random variable is not Gaussian and if, moreover, you are sure
that it belongs to another class of random variables, and you know the class,
even more effective algorithms can be created.
I thank you for your important advice, but I must say that other different
algorithms do not worry me much at the moment. I am interested in actual
recommendations of what to do when I have two finite sets X 1 and X 2 and
want to separate them by means of a hyperplane in a reasonable way. I see
that here I can proceed in at least two directions.
In the first case I can try to separate two sets by means of the Kozinec
algorithm. If the number of points in the sets X 1 , X 2 is very large then I can
even go in the other direction, i.e., calculate the vectors p 1 , p 2 and matrices
u 1 , u 2 • For searching for the separating hyperplane I can use the algorithms
for solving Anderson task. I can afford to do so because I have proved that for
the correctness of such a procedure it is not necessary to assume the Gaussian
character of random variables. Anderson task, which I use here for replacing
5.7 Link to a toolbox 213
the original task, is quite simple because one operates with two classes only,
IJI = 2.
Here, in practice, we do not recommend you anything. We would like to remark
only that your being at a loss that you know several approaches of how to solve
an application task is of a quite different character now than as if you knew
none. But the worst situation is when someone knows one approach only, and
therefore he or she uses it without any hesitation or doubt.
But here several approaches could be considered. I need not, for example,
represent the whole set X 1 by means of a single pair J.L 1 , u 1 , but I can divide
it, in some reasonable way, into subsets and so express X 1 by means of more
vectors and matrices. But I do not know how this division of a set into subsets
can be done.
You are already interfering with the subject matter of our next lecture, where
these problems will be dealt with. We are sure that after the lecture you will
again have interesting comments.
July 1997.
5. 7 Link to a toolbox
The public domain Statistical Pattern Recognition Toolbox was written by V.
Franc as a diploma thesis in Spring 2000. It can be downloaded from the
website http: I I cmp. felk. cvut. czl cmpl cmp_software. html. The toolbox is
built on top of Matlab version 5.3 and higher. The source code of algorithms
is available. The development of the toolbox has been continued.
The part of the toolbox which is related to this lecture implements lin-
ear discriminant functions, e.g., separation of the finite point sets, perceptron
learning rule, Kozinec algorithm, €-solution by the Kozinec algorithm, Sup-
port Vector Machines (linearly separable case), Fisher classifier, modified Per-
ceptron rule, modified Kozinec algorithm, generalized Anderson task, original
Anderson-Bahadur solution, €-solution. The quadratic discriminant functions
are implemented, too, through non-linear data mapping.
Another, but quite close, view of the decomposition of finite sets is repre-
sented by potential functions [Ajzerman et al., 1970). The approach deals with
decomposition of finite or infinite sets and uses nonlinear discriminant func-
tions. Essentially the idea of feature space straightening is generalised. Note
that space straightening has been introduced in Lecture 3. The main idea of
the potential functions method states that the straightening can be used even
if the feature space straightened has infinite dimension, and not only finite di-
mension. The potential function method discovers that the main question asks
if the scalar product can be constructively calculated in the space straightened,
and not if the space straightened has finite or infinite dimension. The scalar
product in space straightened is a function of two variables defined in original
space, and it is just a potential function. In many applications the potential
function can be implied directly from the content of the problem solved. The
convergence of the potential function method is proved similarly to the proof
of Novikoff theorem. Another algorithms separating finite and infinite sets of
points were suggested by Kozinec [Kozinec, 1973) and Jakubovich [Jakubovich,
1966; Jakubovich, 1969).
The transformation of the task separating linearly and optimally a set of
points to the quadratic programming task was explained in the lecture. The
approach has roots in known results of Chervonenkis and Vapnik. Their method
of generalised portraits was published in [Vapnik and Chervonenkis, 1974).
The method is well known now in western publication as a Support Vector
Machine [Boser et al., 1992), [Vapnik, 1995), [Vapnik, 1998).
Class of Fisher strategies, synthesis of which can be reduced to synthesis of
linear discriminant functions, were originally introduced in [Fisher, 1936].
A modification of the perceptron and Kozinec algorithms for dividing infinite
sets and their use for c:-solution of tasks is in [Schlesinger et al., 1981).
Let us note that Kozinec algorithm can be extended to non-separable data
[Franc and Hlavac, 2001).
The mathematical basis for non-smooth optimisation used in the discussion
has been taken over from [Shor, 1979; Shor, 1998). Considerations on Fibbonaci
numbers can be found in [Renyi, 1972).
Lecture 6
Unsupervised learning
215
device come into being, such as that in a living organism, and only then the
researcher tries to understand the purpose of this processing.
Rosenblatt accounts for the lack of understanding of the perceptron by stat-
ing that every supporter of a certain model, be it of the genotype, or monotype
kind, acts as if his/her model were the only possible one. After Rosenblatt, the
perceptron is a significant example of the genotype research. The main reason
for the lack of understanding for it is an effort to evaluate it from the monotype
point of view. The perceptron is neither a device for image analysis nor is it a
device for speech recognition. It is not even a device that could be defined in a
way a monotypist would expect it. It is because the common feature of all per-
ceptrons, we consider a class of neural networks, is not expressed by virtue of
formulating the tasks being solved, but by describing its construction. We will
give the description in the original form in which the perceptron was defined
in early publications, with only such an amount of detail that is necessary for
our lecture.
The perceptmn is a device which consists of a certain set of elements re-
ferred to as neurons. Each neuron can be in either of two states: in the
excited state or in the inhibited state. The state of a neuron is unambigu-
ously determined by the image which is at the input of the perceptron. Let
the set of images which can occur at the input of the perceptron consist of
l images. This set can be classified into two classes in 21 ways. The first
class images will be called positive, and the second class images will be called
negative. There exists an authority-the teacher-who selects from these 21
classifications a single one which is then regarded as unchanging further on.
This classification is called the teacher's classification, or the correct classifi-
cation. In addition to the teacher's classification, there are also other input
image classifications which in the general case need not be identical to the
correct one. One of these classifications, which will be called perceptmn clas-
sification, will be implemented in the following way. Each neuron is charac-
terised by its numerical parameter which is called the weight of the neuron.
The weight can be any real number. The image at the perceptron input is
evaluated as negative or positive according to whether the sum of weights of
the neurons that have been excited by the image are positive or negative, re-
spectively.
The classification by the perceptron depends on neuron weights. The weights
vary by means of reinforcement and inhibition of a neuron, which represent the
respective increase or decrease of a neuron weight by a certain constant quan-
tity ~- In observing each image a particular neuron is reinforced or inhibited
according to
• whether the given neuron was excited by the observed image;
• which class the image was classified by the teacher;
• which class the image was classified by the perceptron itself.
It can be seen that Algorithm 6.3 differs from Algorithm 6.1 only in that the
role of the teacher is performed by the perceptron itself. This difference is sub-
stantial. To implement Algorithm 6.1 or Algorithm 6.2 the perceptron must
have two inputs: one for observing the input image and the other for informa-
tion from the teacher. Algorithm 6.3 can be implemented without any contact
with the teacher. Therefore if the first two Algorithms 6.1 and 6.2 have been
called perceptron learning then the third Algorithm 6.3 is named unsupervised
learning.
Rosenblatt's main premise was in that a perceptron controlled by any of
the three presented algorithms reaches the state in which it classifies all input
images in a correct way.
It is hardly possible to describe the universal enthusiasm which was evoked by
Rosenblatt's first publications concerning these premises. With the perceptron
everything seemed to be marvellous: the charming simplicity of the algorithm,
the application of terms unusual in computer technology of those times, such as
'neuron', 'learning', and the word 'perceptron' itself. A romantic atmosphere
was created as when one seems to stand in front of an unlocked, but still
not open door. :VIoreover, one is convinced that behind the door something
is waiting for him/her \\·hat has been long expected, even though one does
not yet know, what particularly it will be. Should Rosenblatt's premise appear
222 Lecture 6: Unsupervised learning
correct it would mean that the painful and tedious work of constructing pattern
recognition devices could be easily avoided. It would only suffice to build a
perceptron and then show it some examples of how to operate, and it would
proceed automatically. In addition, the assumed ability of the perceptron for
unsupervised learning would make a correct performance possible even without
showing certain examples to the perceptron: it would suffice for the perceptron
to examine images which it should recognise and it would find out by itself,
how to classify them.
As nearly always happens, only an insubstantial part of the outlined beau-
tiful fairy tale becomes true. The realisable part is expressed by Novikoff the-
orem which was quoted and proved in the previous lecture. Novikoff theorem
has confirmed Rosenblatt's assumption in Algorithm 6.2, controlled both by the
teacher's and perceptron's classification. Let us note that the theorem holds
under the condition that such an ensemble of neuron weights exists in which
the correct classification is realised.
For an algorithm controlled only according to the teacher's classification we
can easily prove that Rosenblatt's assumption is erroneous. Though it is not
the main purpose of this lecture, let us look over a simple counterexample which
proves Rosenblatt's assumption wrong.
Example 6.1 Rosenblatt Algorithm 6.1 need not converge to correct clas-
sification. Let perceptron consist of three neurons. Each positive image will
excite all three neurons, and each negative image will excite either the second
or the third neuron only. This situation is favourable for the perceptron since
there exists an ensemble of neuron weights w 1 , w 2 , w 3 , with which the percep-
tron faultlessly classifies all the input images. Such weights can be, for example,
WI = 3, w2 = -1, w3 = -1. The sum of the weights of neurons excited by a
positive image will be + 1, and that by the negative image will be -1.
Assume that the perceptron learning starts with exactly those weight values
with which the perceptmn can correctly recognise all input images. Let us see
what happens when the weights are changed after learning on n positive and n
negative images. During learning each neuron will be n times reinforced, whilst,
the second neumn will be n 2 times, and the third neuron n 3 times inhibited,
where n = n2 + n3. The neuron weights will be
WI= 3+n~,
w2 = -1 + ~ (n - n2) ,
w3 = -1 + ~ (n- n3).
The sum of the neuron weights that are excited by a positive image will be
1 + 2n ~ which is a positive number for· any n. This means that the recognition
of positive images does not deteriorate due to the correction.s of neuron weights.
But the sum of neuron weights excited by a negative image is either -1 + ~ (n-
n2) or -1 + ~ (n- n3). The sum of these two numbers is -2 + n ~ which
is a positive number at a sufficiently great n. From this it follows that at least
one of the summands -1 + ~ (n- n2) or -1 + ~ (n- n 3) is positive. We can
6.3 Unsupervised learning in a perceptron 223
see that the perceptron after learning passed from the state in which it had been
correctly recognising all the images, to a state in which it is not recognising all
of the images correctly. A
~:1
K L
I·•.
(a) Varying classifier. (b) Learning as an estimate of parameter A.
uothing. And as it frequently happens, only when the experiments had led to
that negative result, it was clear that the result could not have been different.
When the perceptron has reached the state in which it includes all the images
in one class, to be specific let it be the positive class, then no subsequent
unsupervised learning can alter the situation. After processing a new image,
the weights of the excited neurons are only rising, and thus the sum of weights
of the excited neurons is rising as well. The image so far evaluated as positive
can be classified as positive henceforth.
That was the first part of our discourse on how the procedure known as
unsupervised learning in pattern recognition had been gradually created and
then ended with such disillusionment. In spite this ideas had been present there
which have become an important constituent of the present day understanding
of unsupervised learning. These ideas, however, were unnecessarily encumbered
with neural, physiological, or rather pseudo-physiological considerations. From
the present day point of view it may seem to have been purposely done to
hopelessly block the way to reasonable outcomes. We will clear away from the
idea all that seems to be useless in the context of this lecture and present the
definition of a certain class of algorithms which is not very strict. We will
call the class Rosenblatt algorithms owing to deep respect to the author of the
perceptron.
Let X and K be two sets members of which are the observations x and
recognition results k. The function q: X --+ K is considered as the recognition
strategy. Further let Q be a set of strategies. Let a device (see Fig 6.la) be
assigned to implement each strategy from the set Q. The particular strategy,
the device is performing at that moment, is given by the value of a parameter A
which leads to the input assigned to it. ThP observations lead to another input.
6.3 Unsupervised learning in a perceptron 225
of algorithms to which the term unsupervised learning applies, since the data
which comes from the teacher in the process of learning is created by the clas-
sifier itself. Rosenblatt's contribution to the theory of unsupervised learning
could be evaluated from the present point of view as a design of an algorithm
class which, according to his opinion, solves some intuitively considered tasks.
These tasks have not been precisely formulated so far.
In the meantime in the scientific disriplines neighbouring pattern recognition,
tasks were formulated that had remained without solution for a long time. Only
when these tasks were adopted by the pattern recognition theory, a solution
was hatched in its framework in the form of algorithms that are quite close to
Rosenblatt's algorithms. But let us postpone this until Section 6.6.
R(q,a) =a I 1 1 (·
v"Fffe-2 x-
1)2 dx + (1- a) I 1 1 ( +1)2
v"Fffe-2 x dx.
{xlq(x)=2} {xJq(x)=1}
By the symbol q(a) a Bayesian strategy will be denoted which could be created
if the probability a was known and the probability of the wrong decision was
minimised. This means that
where Q is the set of all possible strategies. Let q* denote the strategy which
decides that the object is in the first state if x ~ 0, and in the opposite case it
decides for the second state. Even though it is rather obvious that the strategy
q* is a minimax one, and in this sense an optimal one, we will briefly prove this
statement.
The probability of the wrong decision R( q*, a) is
R(q*, a) =a I _1_e-~(x-1)2
0
v'2ir
+ I _1_e-~(x+1)2
dx
v'2ir
(1- a)
00
dx
-00 0
I -v"Fff-e-~.r + I -v"Fff-e-~x
-1 00
=a 1 2 dx (1- a) 1 2 dx.
-oo l
and
-oo
are the same (they are approximately equal to 0.16), the number R(q*, a) does
not depend on the probability a, and thus it holds
which is strictly satisfied in the case examined. By joining the equality (6.1)
and the inequality (6.2) together
we obtain
maxR(q,o:) > maxR(q*,o:) ~ 0.16
Q Q
which means that the strategy q* has the following two features.
1. The strategy q* makes a wrong decision about the state k with the prob-
ability 0.16 independently of what the a priori probabilities of the states
are.
2. Any other strategy does not have this property any longer, and moreover,
for every other strategy there exist such a priori probabilities of the states
in which the probability of the wrong decision is greater than 0.16.
In the above sense the strategy q* is optimal in a situation in which the a priori
probabilities of the object states are arbitrary. However, if the a priori prob-
abilities were known, i.e., the probability o: was known then the strategy q(o:)
could be built in the following form. The strategy decides for the first state
when
-0:- e _l(x-l)2 >1-0:
- - e _l(x+1)2
/2-ff - /2-ff '
2 2
x, that would not use the explicit knowledge about a priori probabilities, but
yield wrong answers with the probability 0.16 only in the case in which the a
priori probabilities are the worst ones? Is it possible to attain better quality
recognition in the case in which the reality is better than the worst one? Or,
asked in a quite sharp form now: is there a possible strategy which would not
be worse than the Bayesian strategy, but would not be based on the complete
knowledge about the a priori probabilities of states?
The answer to this question be it in one or another form is indubitably neg-
ative. The Bayesian strategy q(a) substantially depends on a priori probabili-
ties (it is evident, e.g., from the expression (6.3)). Therefore a strategy which
is independent of a priori probabilities cannot be identical with all Bayesian
strategies which do depend on a priori probabilities.
Notice that the reasoning by virtue of which we arrived at the negative
answer is nearly the same as the reasoning used in proving that the perceptron
by itself was not able to get at a correct classification. In both cases the
negative answer is based on an evident fact that a constant object cannot ever
be identical with some other object that is varying. The significance of Robbins'
approach which Neyman [Neyman, 1962] ardently valued as a breakthrough in
the Bayesian front is in that he did not try to reply with a positive answer to
a question for which a negative answer was nearly evident. Instead of that he
changed the question, from the practical point of view rather slightly, but in
such a way that the reply to the modified question ceased to be so obviously
negative.
Imagine that the task of estimating the state of an object is not to be solved
only once, but many times in some n moments i = 1, 2, ... , n. Assume as
well that the state of the object is random at these moments. The sequence
k1, k2, ... , kn consists of mutually independent. random elements ki. The prob-
ability of the event ki = 1 is equal to the probability a which we do not know.
But the a priori probability a is known to be the same for all moments of
observation. Let us also imagine that the classifier need not decide about the
state k1 at once at the first moment in which only the first observation x 1 is
known. The decision can be delayed, and the decision made only when the
entire sequence x 1, x2, ... , Xn is available. In this case the state k1 is evaluated
not only on the basis of one single observation x 1 , but on the basis of the entire
sequence x1, x 2, ... , Xn. In the same way if the entire observation sequence is
known then a decision can be made about the state k2 at the second moment,
and then that about the states k3 , k4 , etc .. In this case strategies of the form
X -+ K need not be referred to, but it concerns strategies of a more general
form xn -+ Kn.
Let us repeat now the old question, but modified after H. Robbins: Is there
a strategy of the form xn -+ {1, 2}n which does not use a priori probabilities
of states, and is not worse than the Bayesian strategy of the form X -+ {1, 2}
which uses such information? Now it is not quite evident that the answer to
the question in such a formulation must be negative.
Robbins proposed a specific strategy which proves that at least with large
values of n the answer to this question is positive. The strategy decides for the
230 Lecture 6: Unsupervised learning
state 1 if
1 1 n - I:7- 1 x;
x; 2': -2 n (6.4)
n + '"'"
L-i=l x;
,
and in the opposite case, it decides for the state 2. As our case is concerned,
for n -t oo the strategy (6.4) becomes arbitrarily close to the strategy (6.3) in
the sense that the random quantity in the right-hand side of (6.4) converges to
a constant given in the right-hand side of (6.3). Indeed, it holds that
'"'" x,.
. n - ui-1 · 1 _ ln ui=l
'"'" x,. 1 - n-too
lim ln I:"-
'- 1
x; (6.5)
hm
n-+oo n + I;._ 1 X;
n = hm
n-+oo 1 +- I;._ 1 X;
1 n = 1 + hm
. -n1 '"'"
L..,·- 1 Xi
~- n z_ n4oo 'Z-
Robbins explains that the strategy (6.4) is the result of the double processing of
the input observation sequence which yields information of two different types.
On the one hand each observation x; is a random variable which depends on
the state k; of the object at the i-th moment, and provides certain information
on the state. On the other hand, the sequence x 1 , x2, .. . , Xn on the whole is
a sequence of random samples of the population. The probability distribution
Px on this population depends on the a priori probabilities PK(1) = a and
PK(2) = 1- a, since
px (x ) -_ -
PK(I)
- e _l(x-1)
2
2 +-
PK(2)
- e
_l(x+1) 2
2
.j2; .j2;
Therefore the sequencP x 1 , x 2 , ... , Xn provides certain information on un-
known a priori probabilities too. The sequence is to be processed in two passes:
1. A priori probabilities are estimated that are not known in advance.
2. The result of this more or less approximate estimate is used for the decision
on the states k1, k2, ... , kn with the help of the Bayesian strategy as if the
estimated values of the a priori probabilities are the true ones.
The strategy (6.4) was formed on the basis of indirect evaluation of a priori
probabilities. It starts from the fact that the mathematical expectation E(x)
of the observation x uniquely determines unknown probabilities so that
Robbins task (6.6) remained unsolved for rather a long time. Its exact solution
was found only in pattern recognition by means of clever algorithms which
rather strongly resemble Rosenblatt's algorithms, and which were proposed in
pattern recognition as a model of unsupervised learning.
Before introducing and proving algorithms which solve Robbins task in this
lecture we would like to say that Robbins approach should not be considered
in a limited manner in any case. Our explanation set forth just a recommenda-
tion for a concrete situation with one-dimensional Gaussian random quantities
which Robbins described only as an illustration of his approach. The formu-
lation (6.6) itself is far more general. Robbins' approach understood in a far
more general way can be used even in those situations which, formally speak-
ing, do not belong to the framework of the formulation (6.6). We will present
an example of this situation. For this purpose we will repeat all Robbins' con-
siderations that led to the formulation (6.6), and then formulate the task which
formalises Robbins' approach in its full generality.
Let us go back to the example with one-dimensional Gaussian variables. This
time let us assume that the a priori probability of each state is known and it
has the value 0.5 for each state. On the other hand conditional mathematical
expectations are not known under the condition that the object is in one or the
other state. It is only known:
• If the object is in the first state then the mathematical expectation lies in
the interval from 1 to 10;
• If the object is in the second state then the mathematical expectation is a
number in the interval from -1 to -10.
In this example as well as in the previous one, the best strategy in the minimax
sense is the one which decides q*(x) = 1 if x ~ 0, and q*(x) = 2 in the opposite
case. The strategy q* is oriented to the worst case, when the first mathemat-
ical expectation is 1 and the second is -1. The actual situation, however, can
be better than the worst one. The first mathematical expectation can be, for
example, 1 and the second -10. For this case a far better strategy than q* can
be thought of. This possibility can be used if we do not start recognising the
observations x immediately, but wait for some time until a sufficient number
of observations for recognition has been accumulated. The accumulated ob-
servations x are first used for the more exact estimation of the mathematical
expectations, and only then are the observations classified all at once.
It is possible to create further more complex examples since the 'break-
through in the Bayesian front' has already set in. The tasks that occur in all
situations of this kind are expressed in the following generalised form.
232 Lecture 6: Unsupervised learning
Let X and K be two sets. Their Cartesian product forms a set of values of
a random pair (x, k). The probability of the pair (x, k) is determined by the
function PXK: X x K-+ !It The function PXK is not known, but a set P which
contains the function PXK is known. In addition, the sequence X1, x2, ... , Xn
of random, mutually independent observations x is known. Its probability
distribution is
LPxK(x, k).
kEK
The objective is to find such a distribution PxK which belongs to the set
P and for which the occurrence probability of the abovementioned multi-set
x 1 , x2, ... , x 11 is the greatest. Therefore
I: log L PxK(x;,k).
11
had not been well set up. Recall that a perceptron was not able to operate
in the case in which the two mentioned algorithms were applied in supervised
learning. Classification and learning algorithms of that sort continued to be
objects of experiments until an applicable classifier was successfully created.
As will be seen the experiments were worth making because they successfully
resulted in creating a new algorithm for unsupervised learning which is known
under the name of ISO DATA.
Let us look at the following statistical model of an object, which is perhaps
one of the simplest. Let the object be in one of two possible states with the same
probabilities, i.e., K = {1, 2}, PK(1) = PK(2) = 0.5. Let x be ann-dimensional
vector of features. As long as the object is in the first state, x is a random
Gaussian vector with mathematical expectation /Ll· Components of the vector
x are mutually independent, the variance of each component being 1. The
vector x has the same properties under the condition that the object is in the
second state, with one single difference that its mathematical expectation is J.t 2
in this case.
The Bayesian strategy which on the basis of the observation x decides on
the state k with the least probability of the wrong decision, selects the first
state if
(6.8)
and in the opposite case it selects the second state. This strategy will be
denoted by the symbol q. If the mathematical expectations /Ll and J.t 2 are not
known then the strategy is considered as a varying strategy q(J.t 1 , J.t 2 ). This is
fully determined if the vectors JL 1 and J.t 2 are defined.
The learning algorithm has to find the maximum likelihood estimate J.ti,
JLi of mathematical expectations JL 1 , JL 2 on the basis of the multi-set (x 1 , k1 ),
(x2, k2), ... ,(xm, km) of random pairs (x, k) with the probability density
PxK(X, k) = ]JK(k) 1 -~(x-ttd
(J27rr e -
= argrnax f:
(JLI,JL2) i=l
-(xi- JLk,) 2 = argrnin( L(x;- J.td 2
(JLJ,J! 2 ) iEft
+ L(xi- JL2) 2)
iE/2
(6.9)
Authors of the ISODATA algorithm, Hall and Ball [Ball and Hall, 1967],
joined the strategy (6.8) and the learning algorithm (6.9) in such a way that
they created the following unsupervised learning algorithm. The algorithm
examines the multi-set of input observations x 1, x2, ... , Xn many times. The
parameters /-Ll and J.L 2 change after each pass through the data in the general
case, and thus the results of classification are also changed.
The values 1-L? and 1-Lg can be nearly arbitrary before the first passage (the
right-hand superscript denotes the number of iteration). Only the strategy
q(J.L?, J.Lg) in the analysis of the input sequence is prohibited from placing all
observations into one class. The initial values can, for example, be J.L? = x 1 and
J.L~ = x2. If after the examination step number (t - 1) the vectors /-Ll and J.L2
were assuming the values Mit-lJ and 1-L~t-l) then in the step t of the analysis
two procedures will be performed.
1. Classification. The observation x;, i = 1, ... , n, is placed into the first
class if
(1-1)) 2 ( (t-1)) 2
( X; - i-Ll ~ X; - i-L2 '
and into the second class in the opposite case. The result of the procedure
is the decomposition of the set of indices i = 1, 2, ... , m into two subsets
I 1(t) an d I(tJ
') .
2. Learning. -New values Mit) and J.L~t) are calculated as the average of vectors
included in the first and second classes which is
(t)
f-lk
1 """'
= -(t-) L X; ' k = 1,2.
Ilk: I iEI~')
To create and experimentally examine this algorithm required courage since
after unsuccessful experiments in the perceptron with unsupervised learning
the problems of unsupervised learning seemed never to recover. Courage was
rewarded at last by the following, experimentally attained, positive results.
1. The algorithm converges after a finite number of steps, independently of the
initial values J.L? and J.Lg, to the state in which the values /-Ll and f1 2 do not
change. The values /-Ll and f.t2 and the result of the classification h and h
are the solution of the system of equations
(6.10)
2. The system of equations (6.10) can have more than one solution. With
one of them the classification 11 , 12 of the input sequence x 1 , x 2 , . .. , Xn is
rather close to the one which would he attained in supervised learning. We
6.5 Quadratic clustering and formulation of a general clustering task 235
will call it a good solution. Unfortunately the algorithm does not converge
necessarily to the good solution. But if the length of the sequence n is large
enough as well as the ratio of the distance [J.Ll - J.Lz[ to the variance then the
convergence of the algorithm to the good solution becomes quite probable.
The algorithm quoted finds, at least in the abovementioned particular case, the
correct classification of the input observations only on the basis of an analysis
of the input observations without any additional information from the teacher.
Unfortunately, the algorithm could not be successfully enhanced or generalised
to such an extent to yield similar desirable results in other cases. For exam-
ple, when the a priori probabilities of the states are not equal, or when the
variances are not the same, etc.. But it has been found that even without an
enhancement, the system of equations (6.10) is usable. It expresses a certain
understandable task which is, however, completely different from the recog-
nition task concerning the unknown state k of the object on the basis of the
observations x. The formulated task belongs to a class of tasks of the follow-
ing form.
Let x be a random object from the set of objects X with the probability
distribution px: X -t JR. Let D he a set elements of which will be referred
to as decisions, such that for each object x E X a certain decision d E D is
assigned. Let W: X x D -t IR be a penalty function value W(x, d) of which
represents losses in the case in which for the object x the decision d is chosen.
Note that this penalty function is of a quite different form than the penalty
function in Bayesian recognition tasks. With a firmly chosen decision d the
penalty here depends on the known observation x and not on an unobservable
state. The term unobservable state does not occur in this construction at all.
Assume preliminarily that the aim of the task is to construct the strategy
q: X -t D which minimises the value
i.e., the mathematical expectation of losses. It can be seen that the task in this
formulation is solved by the strategy
The strategy q(x) does not depend on the probability distribution px and
for some forms of the penalty function W it can be easily found. But the
task becomes substantially more complicated if the strategy q must satisfy a
constraint of the following form. The strategy is to assume only the assigned
number of values on the set X which are, for example, only two values d 1
and d 2 . But these are, unfortunately, not known beforehand, and are to be
chosen in an optimal way. The strategy q should not, therefore, have the form
q: X -t D, but it is given hy the representation q: X -t { d 1 , d2 }. If the
values d 1 , dz were known then the strategy q* would decide for q* ( x) = d 1 if
236 Lecture 6: Unsupervised learning
W(x,di) ::; W(x,d 2), and for q*(x) = d2 in the opposite case. The penalty
would be evidently equal to
If the values d1 , d2 are not known then the aim is to find di and d2 which
minimise the value
The solution of this task does not lie in the found decisions di and d2, but also in
the decomposition of the set X into the subsets X 1 and X2. The decomposition
means that afterwards objects of the same set Xi will be handled in the same
way, and thus a certain diversity of objects can be ignored. Let us see two
examples of such tasks.
Example 6.2 Locating water tanks. Let us imagine a village in which the
water supply was damaged and must be compensated for by several drinking-
water tanks. Let x be the location of a house in the village, X be the set of
points in which the houses are placed, and Px (x) be the number of inhabitants
in a house placed at point x. Let W(x, d) evaluate the loss of an inhabitant
from the house at point x who fetches water from the tank at point d.
The strategy q: X --t D which should without additional constraints set down
the position q(x) of the water tank for· the inhabitant from the house at point
x would be simple. For each x the position of the tank q(x) would be identical
with x. It is quite natural, since it would be best for each inhabitant to have the
tank next to his/her hov.se.
If however the constraint had to be taken into account that the whole locality
could have only two water tanks at their disposal then a far more difficult task
of the sort in (6.11) would be created. First a decision must be made in which
points d1 and d2, the tanks, are to stand. Then an easier task is to be solved
which will classify the locality into two subclasses so that each house could be
assigned to either of the two tanks. &
In these examples as well as in the general formulation (6.11) we can see that
the task is composed in such a way that its solution can lead to a useful and
reasonable classification of the observation set x into subsets. In this respect
the task resembles the tasks of statistical pattern recognition, where the final
result is the classification of the observation set as well. But the nature of the
mentioned tasks is substantially different.
In the task of statistical pattern recognition the existence of a certain factor
k is assumed which affects the observed parameters x. The factor is not, how-
ever, immediately observable, but by means of a demanding provision one can,
sooner or later, obtain some knowledge about it. The classification which is the
solution of these kind of tasks can be regarded to be good or bad, according
to what extent a correct estimate is obtained of what is not known, but what
actually exists.
In the task presented now in the form (6.11) and illustrated by the two
examples no actually existing factor is assumed. The aim of the classification
is completely different. For the user's convenience a rather large set of objects
is classified into subsets so that afterwards the objects from the same subset
are managed in the same way. It results in that the differences between some
objects are ignored and quality of managing the objects becomes worse. The
problem is to classify the objects so that the loss in quality will be as small as
possible.
Informally, the tasks of this form have been long known as clustering tasks,
taxonomy tasks, classification tasks, etc .. The formulation (6.11) can be re-
garded as a formal expression of such tasks. The ISODATA algorithm is an
algorithm that solves a particular case of this task in which W(x, d) = (x- d) 2 .
Experimentally discovered properties showing that the clustering algorithms
approach the Bayesian classification of input observations on some conditions
(e.g., in the case of Gaussian shapes of conditional probability distributions), is
a mere coincidence. The situation would be the same if someone tried to create
an algorithm for calculating the function log 2 x, but by chance had an algorithm
implementing the function y'x at his/her disposal. The first experiments in
which the algorithm had been used for the values not much different from
v'4 = 2 = log2 4, would encourage him/her. But later it would be found out
that this algorithm could be regarded neither as an approximate calculation,
nor as the calculation of the function log 2 x, but that it was a precise algorithm
for calculating a quite different function, even if the function was also necessary
and perhaps beautiful.
238 Lecture 6: Unsupervised learning
If we denote by the symbol L(m) the logarithm of the number l(m), we will
obtain
n n
L(m) = L)ogpK(ki) + Llogp(xi,ak;).
i=l i=l
The information on the states k1, k2, ... , kn received from the teacher can be
expressed by an ensemble of numbers a(i, k), i=1, ... ,n, k E K. Here a(i, k)
is equal to 1, if k = ki and it is equal to 0 in any other case. Applying the
denotation introduced here, the function L(m) assumes the form
n n
L(m) = LL a(i, k) logpK(k) +L L a(i, k) logp(xi, ak). (6.13)
i=l kEK i=l kEK
Through the learning task the search for such a model m = ((p K (k), ak), k E K)
is understood in which the maximum of the logarithm of probability L(m) is
achieved. The values PK(k), k E K, must of course satisfy the condition that
their sum LkEK PK(k) is 1. It is not difficult to concede (we will return to it,
anyhow) that the best estimate of the probability PK(k) is
2::~ 1 a(i, k)
PK (k) = , kEK, (6.14)
n
and the task concerning the maximisation of L(m) with respect to the ensemble
(ak, k E K) is reduced to IKI independent tasks. The following value is to be
maximised in each of them
n
L a(i, k) logp(xi, ak)
i=l
240 Lecture 6: Unsupervised learning
ak = argmaxL::C~(i,k)logp(xi,a). (6.15)
a
Note that the learning task formulated in this way is defined not only for a case
of a fully informed teacher (supervisor) who at each observation Xi correctly
indicates the actual state of the object. The values a(i, k) need not always
be ones or zeroes, but they can be any real numbers within the interval 0 ~
a(i, k) ~ 1 sum I;kEK a(i, k) of which is 1 for each i. In this manner the
information from an incompletely informed teacher (supervisor) who does not
precisely know the actual state of the object, but who knows, from some source,
the probabilities of each state, can be expressed. And this is information of
the same form as that provided by a recognition device as was defined in the
previous paragraph.
m* = (Pi<(k), ak IkE K)
which maximises the expression (6.17). This means that the probability of
occurrence of the observed multi-set x 1 , x 2 , ••• , Xn is maximised.
In a particular case in which only maximum likelihood estimates of a priori
probabilities PK(k) are to be found, the task bein~; formulated becomes Robbins
task.
6.6 Unsupervised learning algorithms and their analysis 241
1 (. k) _ Pk(k)p(x;,at)
a ~, - """ t ( k ') p ( x;, ak,
t )
(6.18)
i..J PK
k'EK
for each i = 1, 2, ... , nand each k E K. These are the numbers which should be
calculated in recognition stage, provided the values Pk(k) and at were actual
values. The numbers a 1 ( i, k) resemble the a posteriori probabilities of the state
k under the condition of observing the signal x; from the presented multi-set.
However, they are not the a posteriori probabilities because one cannot claim
with certainty that the values Pk(k) and aL are actual a priori probabilities
and values of the parameter a. Nevertheless, in the second stage of the t + 1-
th algorithm iteration a new model m 1+ 1 is to be calculated by means of the
already quoted learning algorithm (6.14) and (6.15) in such a way as if the
numbers were actual probabilities a(i, k) provided by the teacher, i.e.,
n
k E K. (6.20)
It can be easily seen that the described algorithm is very similar to Rosenblatt's
algorithms, though it also markedly differs from them. Both Rosenblatt's algo-
rithms and the algorithms described here are arranged as a multiple repetition
of recognition and learning. For learning, the data obtained from the teacher
is not used, but the results of one's own recognition. The difference lies in
that in the described algorithm both recognition and learning are considered
in a somewhat wider sense than in Rosenblatt's algorithms. Recognition is not
strictly considered as a unique inclusion of the observation into just one class.
The algorithm behaves so as if it breaks each observation into parts propor-
tional to the numbers a(i, k) and then includes the observation :r; partly into
one class and partly into another. In a similar way, the concept of learning, i.e ..
the maximum likelihood estimation of unknown parameters of a random vari-
able is modified. Unlike learning in Rosenblatt's algorithms, it is not necessary
to know in which state exactly the object was during the observation :r;. It is
sufficient to know only the probability of this or that state. ThesP diffen•nces
are substantial for the success of the unsupervised lPaming algorithm (kscribed
and will be presented in the following explanation.
242 Lecture 6: Unsupervised learning
We think it necessary to point out the immense generality of both the for-
mulation of a task and the algorithm for its solution. We will see that even the
properties of the algorithm are proved either at not very restraining premises, or
even without any additional premises. This concerns the basic relation between
three extensive classes of tasks: the recognition itself, the supervised learning
and unsupervised learning. Thanks to the fact that this relation is expressed
in an illustrative form, which becomes easily fixed in one's memory, its discov-
ery belongs to the most significant outcomes not only in pattern recognition,
but also in the modern analysis of statistical data. The theory of pattern
recognition thus shows a certain amount of maturity when it no longer merely
absorbs the outcomes of the neighbouring scientific fields, but is able to enrich
its neighbouring disciplines with the results of its own.
(6.21)
and the equality comes only when x; =a;/ 2::~1= 1 aj for all i. •
Proof. We will denote F(x 1 , .1: 2 , ... , xn) = 2::;~, a; log x; and find for which
values x; the function F reaches its maximum under the condition L; x; = 1.
Since the function F is a concave one, the point x 1 , x 2 , . .. , Xn in which the
maximum is reached is the solution of the system of equations
-+
Di A
}
0, i=1, ... ,n,
Xi
n
I:Xi 1
i=1
Xi=T,
-Oi
L
n
Xj =
- ~n
L...j=1 Oj
A=-
n
LDJ.
A
j=1 j=1
The result is
(6.23)
It can be seen that the system of equations (6.22) has only one solution. The
resulting point Xi given by the equation (6.23) is one single point in the hyper-
plane 2::::~ 1 Xi = 1 where the maximum is reached. •
The decomposition (6.27) of the function L(m) into three summands is valid
for any numbers a(i, k) which satisfy the constraint (6.26), and thus also for
the numbers a 1 (i, k). We write this decomposition for the numbers L(m 1 ) and
L(m 1+1 ). In both cases the same coefficients a 1 (i, k) will be used. We change
the order of addition L:~=l and L:kEK in the first and second summands and
obtain
n n
L(m1) = LL a 1(i, k) log Pk(k) +L L a (i, k) log p(xi, a~)
1
n n
L(m 1+1) =L :La 1 (i,k) log p~ 1 (k) +L :La 1(i,k) log p(xi,a~+l)
kEK i=l kEK i=l
(6.29)
This means that the first summand on the right-hand side of (6.28) is not
greater than the first summand on the right-hand side of (6.29).
According to definition (6.20) there holds that
n n
:Ln 1 (i,k) logp(xi,aD ~ :La (i,k) logp(xi,a~+ 1 ),
1 kEK.
i=l i=l
If we sum up these inequalities over all values k E K we obtain
n n
L La 1 (i,k) Iogp(xi,aU ~ L l:a (i,k) Iogp(xi,a~+ 1 )
1 (6.31)
kEK i=l kEK i=l
which means that the second summand on the right-hand side of (6.28) is not
greater than the second summand on the right-hand side of (6.29).
6.6 Unsupervised learning algorithms and their analysis 245
Because I: a 1 (i, k) = 1 and owing to Lemma 6.1 the following inequality holds
kEK
n t+l(k) ( t+l)
> '"''"' t(· k)l og
~ ~ a ~,
PK
t+l P Xi,ak t+l . (6.33)
·= 1 kE ,_.
1
I: PK (k') p(xi, ak' )
" k'EK
This means that the negatively given summand on the right-hand side of (6.28)
is greater than the corresponding element on the right-hand side of (6.29).
The inequality L(m 1 ) < L(m 1+ 1 ) is a quite evident consequence of inequali-
ties (6.30), (6.31) and (6.33). •
We can see that during unsupervised learning the logarithm of probability L(m)
is growing in a monotonic way. The consequence is that repeated recognition
accompanied by unsupervised learning on the basis of the same recognition
will not deteriorate the knowledge expressed by the initial model m 0 . Just the
opposite, the knowledge is enhanced in a sense and leads to the enhancement
of recognition. It follows from the chain L(m 0 ) < L(m 1 ) < ... < L(m 1 ) <
L(m 1+ 1 ) while m 1+ 1 -:f. m 1+2 that the sequence L(m 1 ) converges at t-+ oo since
L(m) cannot take positive values. But this does not imply that the sequence
of models m 1 converges as t -+ oo. Moreover, at the level of generality of our
analysis, the statement itself that the sequence of models m 1 converges is not
sufficiently understandable.
Until now no metric properties of the parametric set of models have been
assumed. Therefore it has not been defined yet what convergence of models
means. A parameter of the model can be, for example, a graph, and thus it
246 Lecture 6: Unsupervised learning
Ia I= L L (a(i, k)) 2
•
k
the unsupervised learning. Let n* be a limit point of this sequence. Then the
sequence S(ni), i = 1, 2, ... oo, also converges, namely towards S(n*).
The ensemble a which satisfies the condition a = S(n) will be referred to
as the fixed point of unsupervised learning. The validity of further analysis is
constrained by another important premise on the finite number of the fixed
points of unsupervised learning. These two premises suffice for the sequence of
n 1 , n 2 , ... , at, ... to converge towards a fixed point during unsupervised learn-
ing. To prove this statement we will need a number of auxiliary statements.
Lemma 6.2 Kullback. Let a; and x;, i = 1, 2, ... , n, be positive numbers for
which there hold :L7= 1 n; = :L;~ 1 X; = 1. In this case
n n
1 ""'( - x; ) . (6.34)
""'
~ n; ln n; ~
-;- 2 ~ ni 2
Proof.
i=1 z i=1
Let 15; = x; - n;, i = 1, ... , n. Let us define two functions c.p and '¢
•
dependent on the variable 'Y which assumes the values 0 ~ 'Y ~ 1,
n
c.p('Y) = L a; ln ni o'
i=1 Q; + 'Y i
1 n
'1/Jb) = 2 I:boY.
i=1
It is obvious that c.p(O) = '1/!(0) = 0. For all i and 'Y the inequalities 0 <
n; +'YO; < 1 hold, except for the case n = 1, which is, however, trivial. Let us
state derivatives of the functions c.p and '¢ with respect to the variable "f,
Since :L;~ 1 0; = 0, the following relations are valid for the derivative dc.p/d"f,
dc.p('Y)
d"f
=- f:
i=1 Q;
a;
+'YO;
o; + f: o; = f:
i=1 i=1 Qi
'Y of
+'YO;
~ f:
i=1
'Y of = d'I/Jb)
d"f
.
In Kullback's Lemma 6.3 a logarithm with natural base was used because the
proof is more concise. Further on, we will use logarithms of other bases which
will not bring about difficulties, since at an arbitrary logarithm base there hold
The chosen base of the logarithm affects only the constant c. In this sense we
will refer to Kullback's lemma in proving the following lemma.
Lemma 6.3 Let (i and aH 1 be two ensembles obtained in the steps t and
(t + 1) of the unsupervised learning algorithm, respectively. In this case the
number
(6.35)
k
converges towards zero at t -t oo. &
Proof. The sequence L(mt) monotonically rises and does not assume posi-
tive values. Therefore it is sure to converge, which means that the difference
L(mt+ 1 ) - L(mt) converges towards zero,
lim (L(mt+l)- L(mt))
t-+oc
= 0. (6.36)
Let us write the expression for L(mt+l) - L(mt) in more detail using some-
w4at modified decompositions (6.28) and (6.29) to which we have substituted
from (6.18).
L(mt+l)- L(mt)
= 2: L at(i, k) logp~ 1 (k) p(xi, at+l)- L L at(i, k) logat+ 1 (i, k)
k i i k
We can see that the difference L(mt+l )-L(mt) consists of two summands closed
by large brackets in the last step of the preceding derivation. Both summands
are non-negative. The non-negativeness of the former follows immediately from
the definitions (6.19) and (6.20) and the non-negativeness of the latter is proved
by Lemma 6.2. Since the sum of two non-negative summands converges towards
zero, there also holds that either of these two summands converges towards
zero as well. It is important from our point of view that the second summand
converges towards zero,
og at+l ( i,k) =0 .
6.6 Unsupervised learning algorithms and their analysis 249
•
Lemma 6.4 Let function S(a) be a continuous function of the ensemble
lim at(j) = a* .
j-+oo
Let S(a 1( 1l),S(a 1( 2 l), ... ,S(a1Ul), ... be set up from those points that in the
sequence a 1, a 2 , ... ,at, ... immediately follow after the elements at(l), a 1l 2 l, ...
. .
a , 1.e.,
. of t he se Iecte d sequence S( a t(il) , J. -- 1, 2 ,... , 1s
... , a tUl , . . .Th e I'1m1t . a Iso
since Lemma 6.3 claims that limt-too la 1 - S(a1 )1 = 0. The premise of the
lemma being proved is the continuity of the mapping S, and thus it follows
from (6.37) that S(a*) =a*. •
Lemma 6.5 Let !1 be a set of fixed points for the algorithm; min a• Ell Ia- a*l 2
be the distance of the point a to the set !1. If the function S (a) is continuous
then
lim min la 1 - a*l 2 = 0. (6.38)
t-+oo a* Ell
Proof. Assume that the relation (6.38) is not correct. Let us write the formal
•
meaning of the relation (6.38) and then the meaning of its negation and what
results from it. The relation (6.38) is a concisely written statement
The statement (6.39) means that there exists such c > 0 and such an infinite
subsequence
holds for each element t(j). Since this subsequence is an infinite sequence on a
closed and limited set, it also contains a convergent subsequence. The limit of
this new sequence owing to (6.40) will not belong to 0, and thus it will not be
a fixed point of unsupervised learning. We have arrived at a result which is in
contradiction with Lemma 6.4. Thus the assumption (6.39) is wrong. •
Theorem 6.2 On convergence of unsupervised learning. If the function S
which denotes one iteration of the unsupervised learning algorithm is continu-
ous, and the set of fixed points is finite then the sequence
Let us admit that it would happen an infinite number of times that the distance
at from the nearest fixed point a* would be less than a certain c5, and the
distance at+ 1 from the nearest, but now from another fixed point, would also
be less than c5. Thanks to Lemma 6.5 this situation would occur at any positive
value c5, thus even at a rather small one. As a result this would mean it would
occur an infinite number of times that the distance between at and at+ 1 would
be greater than ~- 26. But this is not possible because Lemma 6.3 states that
the distance between at and at+l converges towards zero.
We have proved that after some finite t the fixed point
argmin lat-a* I
<>*EO
which is the closest to the point at ceases to change. Such a fixed point will
be denoted a** and the proved relation (6.38) assumes the form limt-+oo lat -
a** 12 = 0, or similarly limt-->oo at = a**. •
With some additional assumptions it could be shown that the fixed points
of unsupervised learning have certain properties from the standpoint of the
logarithm likelihood
L:Iogi>K(k)p(xi lak).
i k
6.6 Unsupervised learning algorithms and their analysis 251
In some cases it could be proved that the values PK(k), ak, k E K, through
which a fixed point is characterised, are in a sense the best values in their
neighbourhood. A general account of these properties for a rather extensive
class of models is not difficult, but is not very interesting either. Therefore
we recommend analysing, in each particular case, the properties of the fixed
points using all specific features of a particular case. As an example of such an
analysis of a particular case of an unsupervised learning task a situation will
be discussed in which the conditional probabilities PXIK(x I k) are completely
known, and only the a priori probabilities PK(k) of states are unknown. This
example is valuable in itself. It corresponds, indeed, to Robbins task in its
complete generality as well as in its original formulation. We will prove that for
quite self-evident assumptions the unsupervised learning algorithm converges
towards the globally most likely estimates of a priori probabilities PK(k).
(6.41)
(6.42)
We can see that the algorithm for solving Robbins task is expressed quite explic-
itly. This is the difference compared with the unsupervised learning algorithm
in the general case. There the learning task in its optimising formulation (6.20)
has to be solved in every particular case of its construction. We can also see
that the algorithm itself, described by the relations (6.41), (6.42), is incredibly
simple. The calculation of the values PXIK(x; I k) is anyway expected to be al-
gorithmically supported because it is necessary for recognition even if learning
or unsupervised learning is not used.
It seems to be plausible that the algorithm given by the relations (6.41),
(6.42) converges towards the point in which the global maximum of likelihood
function
L log L:>K(k) PXIK(x; Ik)
i k
has been attained since this function is a concave one. The algorithm that con-
verges towards the local maximum of a concave function provides its maximi-
sation also in the global sense, since a concave function has (roughly speaking)
only one local maximum. These considerations are, of course, only preliminary
252 Lecture 6: Unsupervised learning
and cannot replace the following exact formulation and the proof with which
we will end this lecture.
Theorem 6.3 On solving Robbins task for the general case. Let values
Pi<(k), k E K, be parameters of the fixed point of the algorithm (6.41), {6.42).
Furthermore let none of these values be 0, i.e., Pi<(k) f:. 0 for all k E K. Then
the following inequality is satisfied,
n n
2: log 2: Pi<(k) PXIK(x; l k) > 2: log 2: PK(k) PXIK(x; Ik)
i=l kEK i=l kEK
Let the ensemble (Pi<(k), k E K) represent the fixed point of the algorithm
and so Pi<(k) = p~ 1 (k) = Pk(k), k E K. With respect to the property that
no probability Pi<(k) is zero, we obtain
" PXIK(X; l k)
n= 7 kEK
E Pi<(k) PXIK(x; Ik) ' kEK ·
(6.43)
since both the sum LkEK PK(k) and the sum LkEK Pi<(k) are 1.
To be brief we will denote the ensemble (PK(k), k E K) by the symbol PK
and introduce the denotation f;(PK ),
(6.46)
We will create the following function dependent on the scalar variable 'Y
(6.47)
It is clear that Q(O) is the left-hand part of the expression (6.46), and thus
Q(O) = 0. (6.48)
It is also evident that at any value"( the derivative dQ{'Y)/d"f is not positive
t- (
since
Q('Y) ::; 0
at any non-negative value "( ~ 0. From that it follows further that the integral
f01 Q('Y)d"f is not positive. Let us write it in greater detail, see (6.47),
0 t=1 1=0
= ::; 0.
i=1 i=1
We will write the inequality in even greater detail using the definition (6.45)
n n
L log L PK(k) PX\K(x; Ik) ::; L log L P'K(k) PX\K(x; I k).
i=1 kEK i=1 kEK
who are interested more in the outcomes than in the historical pathways along
which one had to go to attain the outcomes. It seems to me that the important
subject matter that I will need in future is constrained by the formulation of the
unsupervised learning task in Subsection 6.6.3, by the unsupervised learning
algorithm in Subsection 6.6.4 and by Theorem 6.1 in Subsection 6.6.5. It is
a relatively small part of the lecture, and that is why I dare to ask directly
and plainly what more from the lecture should be, according to your opinion,
necessary for me, and whether such an extended introduction to these essential
results is not simply valueless.
We are answering your question directly and plainly. In this course we will
still use the unsupervised learning algorithm in the form it was presented in
Subsection 6.6.4. It will be of benefit to you when you understand it well
enough and not forget about its existence, at least until when we develop on
the basis of it new algorithms for solving certain specific tasks. Regardless of
whether you will need these results, you should know them. They are the kind
of results which, because of their generality, belong to the gamut of fundamental
knowledge in random data processing. Therefore everyone who claims to be
professionally active in this field should know them. It is, simply, a part of
one's education. Naturally, from the demand that everyone should know these
results it does not follow that you in particular should know them. There are
quite enough people who do not know the most necessary things.
As to the rest of the lecture, we agree with you that it is a rather long
introduction to the main results. But we are not so resolute and we would
not like to completely agree that a detailed introduction is useless. Every-
thing depends on from which side you intend to view the results that both
you and we regard as necessary. You know that any product can be evaluated
from two sides: from the standpoint of him who will use the product, and
from the standpoint of him who makes the product. Even a simple product,
such as beef steak, looks from the eater's standpoint completely different than
it does from the standpoint of a cook. You must decide yourself where you
stand in the kitchen called pattern recognition, whether amongst the eaters or
the cooks. In the former case it is quite needless for you to know the entire
pathway that unsupervised learning had gone through before it was formed
into its present day shape. In the latter case you will realise sooner or later,
how very low the efficiency of scientific research can be. You will also realise
a small ratio of results which manage to get established in science for some
time to the erroneous or little significant results, which a researcher must rum-
mage through before he/she gets across a result worth anything. Furthermore,
you will see that it needs the patience of Job to rear an idea, powerless at
its birth, liable to being hurt or destroyed by anybody, from its swaddling
clothes, and to lead it to maturity, when it starts living a life of its own. The
sooner you also realise the diversity of negative sides of scientific research, the
better for you, even though the process of realising it is no pleasant thing
in itself. We have used this lecture to warn you what is definitely in store
for you.
6.7 Discussion 255
I did not wait so long; I read the passage about Columbus once more straight-
away. Now I am absolutely sure that the whole historical introduction is useless
for me, because I will somehow avoid a development like this. And moreover,
for some persons this topic is not only useless but simply detrimental. A pic-
ture drawn b:v you, which I begin to understand only now, has an unnecessarily
256 Lecture 6: Unsupervised learning
We are not afraid of that. First, if it happened so, we would be pleased for
his/her sake that he/she had quickly learned that he/she liked something else
more than science. Second, many people do not take very seriously that the
career of a scientist is so harsh and are convinced that they will be able to
avoid all unpleasant circumstances in a way. Third, and this is the saddest
of all, that an actual dramatic situation occurs only in the cases when really
significant scientific discoveries are at stake, and this happens rather rarely.
Thus, the majority of us is rather safely protected from the worst unpleasant
cases of this kind.
I would like to look at the positive outcomes of the lecture from the eater's view.
The algorithm for solving Robbins task is expressed quite unambiguously. It
is, therefore, a product that is prepared for practical application. But I would
not say so about the general unsupervised learning algorithm. I would rather
say that the general unsupervised learning algorithm is more a semi-finished
product than a product ready for use. It is an algorithm that is expressed up
to some other algorithm, and this represents immense ambiguity. An algorithm
that is to be constructed for a certain optimisation task will be unambiguously
expressed only when there is another algorithm for a further optimisation task
at our disposal. This auxiliar.Y algorithm is to be inserted into the algorithm
that is being constructed. I cannot clearly see what I will practically gain from
such a recommendation when there is nothing to account for the statement
that the auxiliary task is simpler than the original one.
The best thing for you will be to thoroughly analyse several quite simple ex-
amples.
Assume that k is either 1 or 2, PK(k) are a priori probabilities, x is a one-
dimensional Gaussian random variable the conditional probability distribution
PXJk(x), k = 1, 2, of which is
Assume that the values PK(k) and Jl.k, k = 1, 2, are unknown and it is necessary
to estimate these values on the basis of a sequence x 1 , ... , x, 1 , where each
Xi is an instance of a random variable x, having a probability distribution
6.7 Discussion 257
PK(1) PXIl (x) + PK(2) Px12(x). This means that numbers PK(1),PK(2), 1-lr and
p, 2 are to be found for which the value
(6.49)
is maximal.
In order to maximise the function (6.49) I must solve rather simple auxiliary
maximisation task
t. t.
~-' i=l v2n
Since the function 2::::~ 1 n(i, k) (x;- p,) 2 is convex with respect top,, the min-
imising position 1-l'k is obtained by solving the equation in three steps:
Thus the algorithm for maximising the function (6.49) is to have the following
form: Let the initial values be, for example, p~(l) = p~(2) = 0.5, 1-l? = x 1 ,
p,g = x 2 . The algorithm is to iteratively enhance the above four numbers.
Assume that after the iteration t the numbers Pk(1), Pk(2), I-lL p,~ have been
attained. The new values p~ 1 (1), p~ 1 (2), p,~+I, p,~+l are to be calculated on
the basis of the following explicit formula=:
n( i, 1) i = 1, 2, ... , n;
(6.51)
2:::;~ 1 n(i, 1) .
n '
I:~=I a(i, 1) x; .
2:::7= 1 n(i, 1) '
258 Lecture 6: Unsupervised learning
I have not written the superscripts t with the variables a(i, k). I believe that
it is obvious that they vary at every iteration of the algorithm.
So you can see that you have mastered the optimisation task (6.50) quite
quickly. You needed just several lines of formulre, whereas the original task
(6.49) may have scared you.
The optimisation function (6.49) is really not a very pleasant function, but I
believe that with a certain effort I would also manage its optimisation.
We do not doubt it. But with the aid of general recommendations you have
written the algorithm (6.51) without any effort, and this is one of the outcomes
of the lecture, if we wanted to view it from the eater's standpoint. We think
that in a similar manner you would master even algorithms for more general
cases. For example, for cases in which also the conditional variances of random
variables were unknown or for multi-dimensional cases, and the like. But in all
these cases, as well as in many others, you will clearly notice that the auxiliary
task is substantially simpler than the original one.
Go once more through another rather easy example which is profuse because
of its consequences. Let the state k be a random variable again which assumes
two values: k = 1 with the probability PK(1) and k = 2 with the probability
PK(2). Let x be a random variable which assumes values from the set X with
the probabilities Px1 1 (x), provided the object is in the first state; and with
the probabilities Px1 2 (x), provided the object is in the second state. Let y
be another random variable which assumes values from the set Y with the
probabilities py 11 (y), provided the object is in the first state; and with the
probabilities py1 2 (y) in the opposite case. The numbers PK(1), PK(2), PXIl (x),
Px12(x), x EX, and the numbers Pvj 1 (y), PY1 2 (y), y E Y, are unknown. This
means that there is no knowledge about the dependence of any of these features
on the state k. But it is known that under the condition the object is in the
first state, as well as under the condition it is in the second state, the features
x and y do not depend on one another, i.e., the equality
The auxiliary task consists in that the numbers Pxlk(x), x EX, p~lk(y), y E Y,
k = 1, 2, are to be found which maximise the function
n
L a(i, k) log (Px1dx;) PYidY;)) (6.52)
i=l
at the known numbers a(i, k), i = 1, ... , n, k = 1, 2, and the assigned sequence
(x 1 ,yi), (x2,y2), . .. , (.r 11 ,y 11 ). This auxiliary task can be solved quite simply
n
The first equality in the preceding derivation merely repeats the formulation
(6.52) of the auxiliary task. The second equality takes advantage of the rule
that the logarithm of a product is the sum of the logarithms. The third equa-
tion is valid because a sum of two summands is to be maximised, where each
summand is dependent on the group of variables of its own, and therefore the
sum can be maximised as the independent maximisations of each particular
summand separately. The fourth equation uses the denotation Ix (x) for the
set of those indices i, for which it holds that x; has assumed the values x. A
similar denotation is used for the set /y (y). The summands a(i, k) logpx 1dx;)
can thus be grouped in such a way that the addition is first done over the
indices i, at which the observed feature :r; assumed a certain value, and then
it is done over all values x. The sum z::::;~ 1 can be changed to
xEX iE/x(:r)
or to 2: 2:
yEY iE/y(y)
And finally, in the last equality advantage was taken of the property that in
the sums
the values Px1dx) and Pl-'lk(Y) do not depend 011 the index i according to which
the addition is done, and thus they can be factored out behind the summation
symbol I:i·
We will find that as a consequence of Lemma 6.1 the numbers Pxlk(x) and
p~lk(y), which maximise the sums
• I:iEJ.dx) a(i, k)
Pxjk(x)
•
= "' "'
L.JxEX L.JiElx{x)
('t, k)
Q
'
• I:iE/y{yJa(i,k)
PYik(x) = "' "' (' ) ·
L.JyEY L.JiE/y (y) Q t, k
The algorithm for solving the original maximisation task has the following
explicit expression. Let Pk(k),p~XIk(x), p;. 1k(y), k = 1, 2, x EX, y E Y, be the
values of unknown probabilities after the iteration t of unsupervised learning.
The new values of these probabilities are to be calculated according to the
formulm
You see, it actually seems to be nonsense, but only at first and second glance.
But at third and fourth glance, after scrutinising the problem with great at-
tention, which it justly deserves, even here certain intrinsic regularities can be
revealed. You are not quite right when saying that nothing is a priori known
about the dependence of the features x and y on the state k. Although noth-
ing is known about how either of the features x and y depends on the state
k, we still know something very substantial about their joint dependence on
the state. It is namely known that the features x and y depend on the state
k independently of each other. In other words, if the object is in one fixed
state k then the features themselves cease to depend on each other. If you are
interested in it then we can discuss it later in greater detail.
Now we will only roughly answer your question. You and we altogether
should not see such a great nonsense in that one can learn about something
which has never been observed. The entire intellectual activity of individuals,
as well as that of large human communities, has for long been turned to those
parameters which are inaccessible to safe observation. We will not be speaking
about such grandiose parameters as good and evil. We will choose something
much simpler at first glance, for example the temperature of a body which is
regarded as an average rate of motion of the body's molecules. Even though the
average rate of motion of molecules has not ever been observed, it is now quite
262 Lecture 6: Unsupervised learning
precisely known how the volume of a body, its state of aggregation, radiation
depend on the average rate of molecule motion. It is also known how the
temperature of the body itself depends on the temperature of surrounding
bodies, which is, by the way, also unobservable. Many other properties of
the body temperature are well known though they never have been directly
observed.
The path leading to knowledge about directly unobservable phenomena is
nothing else than an analysis of parameters which can be observed, and a search
for a mechanism (model) explaining the relations between the parameters. This
means an effort of exploring the relations between the observed parameters and
the impossibility to explain them in another way (or more simply) than as an
existence of a certain unobservable factor that affects all the visible parameters
and thus is the cause of their mutual dependence. Recall astronomers who
have been predicting a still unobservable planet by encountering discrepancies
in observations from assumed elliptical orbits of observable planets since Kepler
laws have been known. Such an approach is a normal procedure for analysing
unknown phenomena. The capability of doing such exploring has since long
ago been considered to be a measure of intelligence.
The word decorrelation could be well matched to the purpose if it had not been
already used for tasks of quite a different sort, i.e., in which the eigenvectors
of covariance matrices had served as a new orthogonal base of a linear space.
The method is also referred to as Karhunen-Loeve expansion. By this method,
random quantities can be transformed to a form where their correlation is equal
to zero.
If the decorrelation is meant as searching for an invisible influence of a phe-
nomenon, the presence of which causes a dependence of visible parameters
which would be independent if the invisible parameter did not change, then
the case we can see in our example would be that very decorrelation.
Your reference to all that a human can manage cannot be any argument in
judging whether the formulated task is, or is not a nonsense. I am afraid I have
not yet got the answer to my question. Now, at least, I am able to formulate
my question more precisely.
Let x, y, k = 1, 2, be three random variables the probability distribution
p x y K ( x, y, k) of which has the form of the product
with unknown probabilities pg(k), PXIK(x), PYIK(Y) which are of interest for
us, but cannot be directly stated because the parameter k is not observable. As-
sume we have chosen appropriate values p'g(k), P:xw(x I k), p~IK(y I k), which
satisfy the condition
2
and thus they explain the empirically obtained data that are expressed by
means of the numbers P.n·(x, y). And now it is the turn of my question. Can
I be sure that a wilful explanation p'g, P:x IK, p~'ll\, which satisfies the relation
(6.54), will be identical with the reality PK,PXIK,PYIK? Or, if I use a milder
question: In what relation will be the explanation and the reality?
Your fears are sufficiently justified. They appear in literature in a general form
referred to as the 'problem of compound mixture identifiability'. In our case the
equation (6.53) is, in fact, not always sufficient for the numbers pxy(x, y) to
unambiguously define the functions PXIK and PYIK and the numbers PK (1) and
pg(2). Everything depends on what these functions and a priori probabilities
are actually like. In some cases hardly anything can be said about them based
only on the knowledge of the statistics pxy(x, y). These cases, as can be seen
later, are so exotic that we need not take them into consideration. Even if we
somehow managed to obtain the functions necessary for us, we could see that
nothing can be recognised on the basis of them. In other situations, which occur
more frequently, the necessary statistical dependencies can be found, except for
some ambiguity, which does not make any difference in the practical solution of
some problems. And finally, on certain, but not so much restricting conditions
either, the statistical model of an object can be uniquely determined.
Let us describe one method for determining the function p'g(k), P:xiK' p~.IK
on the assumption that the probabilities pxy(x,y) are known. You must not
think in any case that it is a method to be applied in practice. The purpose of
this method is only to understand to what extent the relation (6.53) defines the
functions sought. For practical application, the most appropriate algorithm is
the one you have already developed.
On the basis of numbers pxy(x,y), x E X, y E Y, the following numbers
can be calculated
pxy(x, y)
xE X, y E y'
2::Pxy(x, y) '
xEX
which are nothing else than conditional probabilities PxjY(x I y) that the value
x of the first feature occurred in the experiment under the condition that the
second feature assumed the value y. As before, we will express the function
PXIY of two variables x andy as an ensemble of several functions of one variable
PX!y,y E 1'. If we regard each function PX!y of this ensemble as a point in
an IXI-dimensional linear space then we can immediately notice that all the
264 Lecture 6: Unsupervised learning
functions PXIY' y E Y, lie on one straight line passing through the points
corresponding to unknown functions Px1 1, Px1 2 . It is so because
(6.55)
where PKiy(k) is the a posteriori probability of the state kat the observation y.
Let us denote this straight line by the symbol r. The straight line r represents
the shape of the dependence between the visible parameters x and y, which is
affected by an invisible parameter k. Think its meaning over well, and then we
will proceed further.
In certain cases the straight line r can be uniquely determined without the
functions PXIl and Px1 2 being known, namely on the basis of empirical data
pxy(x, y). If the set {PXiy I y E Y} contains more than one function then any
pair of non-equal functions uniquely determines the straight line f. But it can
happen that the set {PXiy /y E Y} contains only one single function. It happens
when all functions PXIY' y E Y, are the same. In this case, the straight liner is
not determined in a unique way, and that is the insoluble case mentioned above.
It concerns the first of the abovementioned situations in which a reconstruction
of the statistical model of an object based on empirical data P.n·(x,y) is not
feasible. Let us look at this situation in greater detail.
The function PXiy is the same for all values yin three cases (cf. (6.55)):
1. The functions Px1 1 and Px1 2 are the same; in this case the function PXiy does
not depend on probabilities PKiy(2) and consequently, the set {PXiy /y E Y}
consists of only one single function.
2. The functions PYil and PY1 2 are the same; in this case a posteriori proba-
bilities PKiy(k) do not depend on y and the set {PXiy Iy E Y} contain again
only one single function.
3. One of the a priori probabilities PK(l) or PK(2) is zero; in this case the
probabilities PKiy(k) do not depend on y and moreover, one of them is
always zero.
All three cases are degenerate. From an observation no information on the state
can be extracted, nor in a case if the statistical model of the object was known.
It is clear that no great harm is done when the function PK,PXIK, PYIK cannot
be reconstructed in such case. Even if they could be reconstructed, they would
not be helpful for recognition.
Let us now discuss a normal situation, in which the function PXiy depends
on the value y, which means that the set {PXiy Iy E Y} includes more than one
function. The straight line r can be thus uniquely determined and the unknown
functions Px11 and Px1 2 can no longer be of an arbitrary character. Functions
must correspond to the points lying on the straight line r. Assume for a while
that the position of these points on the straight line r is known. We will intro-
duce a coordinate system on the straight line (one single coordinate) so that
the unit coordinate is represented by the coordinate of the point in which the
function Px1 1 is located, and zero coordinate is represented by tlH' coordiuatP
of the point which corresponds to the function Px1 2 . If tlw coonlinatP of tlH'
point corresponding to the function PXIy is denoted e(y) then on til<' basis of
6.7 Discussion 265
the relation (6.55) we can claim that the coordinate e(y) is the a posteriori
probability PKiy(l) of the first state on the condition of the observation y.
In this way we have made sure that the set Y of the observations y can be
naturally ordered in agreement with the position of the function PXIy on the
straight liner. At the same time this order is identical with the order according
to the the a posteriori probability PKiy(l) of the first state. From this it then
follows that. any Bayesian strategy (according to the penalty fum:tion) will
have just one single parameter, which will be the coordinate of a point. e on
the straight line r. All points on one side with respect to the point 8 are to
be included in one class, and all points on the other side are to be included
in the other class. And the most important of all is that for this ordering the
functions Pxp and Px1:2 need not be known. The order which is made only on
the basis of how the functions PXIy are placed on the straight line, i.e., on the
knowledge of empirical data pxy(x,y), is sure to be identical either with the
order according to the a posteriori probability PKiy(l) of the first state, or with
the order according to the a posteriori probability of the second state.
Now we are able to quantitatively express the information on the classifi-
cation of the set. 1' which can be extracted from mere empirical data. Let n
be the number of values of the variable y. The set Y can be separated into
two classes in 2" ways. To express the correct classification it is necessary to
have n bits. These n bits can be considered as n binary replies of a certain
teacher to the question in which of the two classes each of n observations is to
be included.
After appropriate examination of the empirical data the overwhelming
amount of these 2" classifications can be rejected, since the correct classifi-
cation is one of the 2n classifications which are already known. To obtain a
correct classification only 1 + log 2 n bits are needed. This additional piece of
information can be considered as a reply of a certain teacher to the question in
which class not all but only properly selected observations are to be included.
Note that even in the case in which the functions PYII' PYI£ and the numbers
PK(l), PK(2) are completely known, the classification of the set Y into two
classes will not be uniquely determined. Only a group of 2n classifications
would be determined, where each of them, according to the penalty function,
can c:lairn that it is just the very function to be corn~ct. Even though we can
see that on the basis of empirical data the statistical model of an object is not
always capable of being uniquely determined, the empirical data contain the
same information about the required classification as the complete knowledge
of a statistical model.
Now, let us assume that the statistical model of an object has to be deter-
mined not because of the succeeding classification but for other purposes when
it is necessary to determine just the actual model. When is such a unique
determination possible? We will find out on what additional conditions the
system of equations
2
Pxdx, y) = LPK(k) fix1dx) PYidY), x EX, y E 1',
k==l
266 Lecture 6: Unsupervised learning
has only one solution with respect to thP functions PK, PXIk> PYik· It is quite
natural that two models which differ only with the name permutation of the
states k, will be considered identical. More precisely speaking, the two models
p K, p x 1k> py 1k and P'x, P\ 1k, p~. 1 k will be considered identical even in the case
in which
Assume that under the conditions px(l) f- 0, Px(2) f- 0, PXII f- PXI2• which
were assumed in the preceding analysis, another additional condition is satis-
fied. Let it be called the condition of ideal representatives' existence. Such
a value of y1 of the feature y is assumed to exist that can occur only when
the object is in the first state. Further on, such a value y2 exists which has a
non-zero probability only when the object is in the second state. This means
that
(6.56)
From the assumption (6.56) it follows that
PKiYI (1) =1' PKIYI (2) = 0' PKIY2 (1) = 0' PKIY2 (2) = 1'
and thus on the basis of (6.55) there holds
I hope I have understood the main core of your considerations. It is that the
set of functions {PXIy I y E Y} cannot be of any kind, but only such a set that
lies on a one-dimensional straight line. I could simply generalise this property
even for the case in which the number of states does not equal two but it can
be any integer number. If the number of states is n then the set of functions
{PXIy I y E Y} fits completely into an (n- !)-dimensional hyperplane where
also all sought functions PXIA-> k E K, which are not known beforehand, are
contained.
One can still think a great deal about it. But I would not bother you with
that. I hope I will be able to ferret out all possible consequences of this result
myself. Now I would rather make use of the time I have at my disposal for a
discussion with you to make clear for myself a more important question.
I understand that any considerations about how to create the functions PK,
PXIK• and PYIK by no means suit practical application, but serve only for ex-
plaining how the information on these functions is hidden within the empirical
data pxy(x, y), x EX, y E Y. In spite of that, I would still like to pass from
these idealised thoughts to real situations. Therefore I am asking, and answer-
ing myself, a key question: Why cannot these considerations be a foundation
for solving a task in a real situation? It is certainly because of that in an
ideal case the straight line r can be sought on the basis of an arbitrary pair
of different points which lie on the straight line being sought, since at the end
all the points lie on this straight line. But if the sequence of observations is
finite, even though quite large then the probabilities PXIy(x) cannot be consid-
ered as known. Only some other numbers p~IY(x) are known which state how
many times the observation x occurred under the condition that observation
y occurred. The set {p~ Y I y E Y} of functions formed in this way naturally
need not lie on one straig~t line. Thus even the straight lines which pass across
different pairs of functions need not be the same. The search for the straight
line r and the very definition of this straight line in this case is already not
very easy. It is necessary to find a straight line which would appropriately
approach the set {p~IY I y E Y} and appropriately approximate it. Therefore
the practical solution of the task should start from formulating the criterion
which quantitatively determines the way how well the straight line r replaces
the empirically observed set {p~IY I y E Y}. Afterwards that best straight line
should be sought.
The result you would like to attain has already actually been presented in the
lecture when the unsupervised learning task was formulated as seeking a model
which is in a certain sense the best approximation of empirical data, i.e., of the
finite sequence of observations. The unsupervised learning algorithm is just the
procedure for the best approximation of empirical data in a situation in which
actual probabilities are not available, for whose calculation indefinitely many
observations of the object would be needed. We have only a finite sequence at
our disposal, at the basis of which these probabilities can be calculated with a
certain inevitable error. We are not certain this time what else would you like
to know because we seem to have already done what you desire for.
I will try to describe my idea once more. On the one hand, we have a task
to find such numbers PK(k), PXIK(x Ik), PYIK(Y Ik), where k = 1, 2, x E X,
y E Y, which maximise the number
n 2
n 2
argmax L log LPK(k) PXIK(xilk) PYIK(Yi I k) (6.58)
(PI(,PX[K•PYjK) i=l k=l
has to be found. I wrote the algorithm for this calculation quite formally as
a particular case of a more general algorithm which was presented and proved
(formally as well) in the lecture. In the labyrinth of formalism I have com-
pletely lost clear understanding of how it can happen that one finds statistical
parameters of a variable which has never been observed. Thanks to you, things
have cleared up for me, but only for the ideal case in which the observation
sequence is indefinitely long. This elucidation is supported by the property
that a system of equations
cannot have too diverse solutions with respect to the functions PK,PXIK• PYIK
at the known numbers pxy(x, y). The main factor of this elucidation is the
straight line r which is built up in a certain manner. But a straight line cannot
be seen at all in the formulation (6.58). Therefore I am not able to transfer
my way of tl1inking, attained with your help for the ideal case (6.59), to the
real case expressed by the requirement (6.58). And now I would like to beat
a path from the ideal requirement (6.59), which I well understand, to the real
task which I understand only formally. In working my way from the task (6.59)
to the task (6.58), I would not like to lose the straight liner out of my view.
It is for me the single clue in the problem.
I seem to see the first step in this working out of the way. In a real case
the system of equations (6.59) has no solution. The straight line r, which is
uniquely expressed in the ideal case, simply does not exist here. The task (6.59)
is to be re-formulated in such a way that the straight line r should be defined
even in the case in which the ensemble of functions PXIY• y E Y, does not lie
on one straight line. Could you, please, help me make this step, but in such a
way that I should not lose the straight line r from my considerations?
We think we could. The preliminary formulation of the task can be, for exam-
ple, as follows. Let (p~X"Iy I y E Y) be an ensemble of points which do not lie on
one straight line. Another ensemble of points (Pxly I y E Y) which lies on one
straight line and rather strongly resembles the ensemble (p~IY I y E Y) is to be
found. It would be natural to define the resemblance of the ensemble as a sum
of a somehow defined resemblance of its elements, i.e., by means of a function
which has the following form
where L(p~'<)y'Px fy) is the 'similarity' of the functions p~IY and PXIY' and the
number p~ (Y) states how often the value y occurred in the finite sequence
(x;, y;), i = 1, ... , n, i.e.,
P~·(Y) = I: P:n·(x, y) ·
xEX
The number p:yy(x, y) states how often the pair (x, y) has occurred in the
sequence of observations. Let us still recall that recall holds
, ( ) _ P:xy(x, y)
PXfy X - p~(y) .
Now, let us consider what could be regarded as the similarity L of the functions
p~'<IY and PXfy· The function p:'<fy is the result of a finite observation of the
object and PXfy is a function assumed to be the result of an infinite observation.
It seems to be natural that the resemblance measure of the result of a finite
experiment should be the logarithm of probability of this result, therefore
The straight liner, which you would not like to lose from your considerations,
can be expressed through the following formulation of the task.
Functions PXIY' y E Y, are to be found to lying on one (not known before-
hand) straight line and at the same time to maximise the number (6.60), which
is, with respect to the definition (6.61), the value
where p~. and p:'<lu are empirical data obtained from the finite observation
sequence. The straight liner, on which the best functions PXfy in the sense of
(6.62), are expected to lie, is exactly that straight line you like so much.
You can see that the number (6.62) resembles in a way the number (6.58) to
which we intend to work our way. We will make another step in this direction.
The straight liner will be determined by means of two functions PXIl and PXI2>
which are assumed to lie on this straight line. The position of the function PXIY'
y E Y, which is expected also to lie on the straight liner, will be denoted by
means of two numbers PKfy(1) and PKfy(2). Using these two numbers we can
replace the expression PXfy(x) in (6.62) by equivalent expression
2
(6.63)
yEY .rEX k=l
270 Lecture 6: Unsupervised learning
n 2
L log LPK(k) PXIK(xi Ik)PYIK(Yi Ik)
i=l k=l
2
We can see that maximisation (6.58) with respect to the functions PK, PXIK•
PY!K is equivalent to the maximisation of the number
(6.64)
yEY yEY xEX k=l
according to the functions py, PK!Y• PX!K· Since either of the two summands
in the expression (6.64) depends on its group of variables, the maximisation of
their sum can be satisfied by maximisation of either of these two summands.
The first summand is maximised according to py and the second according to
PK!y and PX!K· We can also see that the second summand is identical with the
number which is to be maximised in seeking the optimal straight line r. So
we have beaten the path from the task (6.59) to the task (6.58). Are you now
happy?
Quite happy. Perhaps, except that when we had beaten the path to the task
(6.58) transformed to the task (6.64), we saw that it incorporated, besides
6.7 Discussion 271
~~------------------------
seeking the straight line r in the task (6.58), even something more. What
could that be?
That is evident. Naturally, besides seeking the straight line that approximates
the ensemble of functions {PXIY• y E Y}, another straight line must be sought
which properly approximates the ensemble offunctions {PYix• x EX}. Without
supplying this the whole procedure would use asymmetrically the information
which either of the features x and y bears.
Now at last I feel that I understand the tasks and algorithms presented as if I
had found them out myself. I see that these tasks deserve that I think them
over well and find the algorithm for their solutions. I am not sure that the
algorithm I have designed is really the right one. It is only an adaptation of a
general algorithm for a particular case. The general algorithm at the lecture
was proved only to converge monotonically to some fixed point. Can I be sure
that in our particular case the global maximum of the number (6.58) is reached
in the fixed point, similarly as it was in Robbins task?
You can be certain of that, but this certainty is based only on frequent experi-
mental checking of this algorithm. In the theoretical way, the certainty has not
yet been achieved. It might be a nice task for you.
I would definitely not want to do that. I would rather formulate the task in
such a way that it should be solvable with certainty, even if it did not appear
so well reputed as the tasks based on the maximum likelihood estimate. What
would you say to the following procedure?
It transforms empirical data P'xy(x, y), x EX, y E Y, to the form
2
has no solution. What can be more natural in this case than seeking such
numbers PK(k), Px1dx) and PYik(y), x EX, y E Y, k = 1, 2, which minimise
the sum
The task formulated like this is not based on any statistical considerations but
its advantage is that it has been thoroughly examined and its solution is known.
It is again a task about the Karhunen-Loeve decomposition. In my context,
however, this sounds a bit unusual.
272 Lecture 6: Unsupervised learning
This idea is new, and therefore we do not intend to restrain it at its birth. Such a
formulation does not seem to us to be very natural, since it is difficult to explain
the meaning of the second power of differences of probabilities. Furthermore,
in further formal manipulation with the expression (6.65) quantities appear
such as
'length' Lx (PXIk(x)) 2 ,
'scalar product' Lx PXIl (x) Px12(x),
'matrix' lXI x IYI the elements of which are the probabilities P'xy (x, y),
'covariance matrix' of the dimension lXI x lXI having elements
Ly P'xy(x', y) P'xy(x", y),
and other different mathematical objects that can be hard to interpret in terms
of our original task. You will need a certain amount of patience to rear this
idea and put it adrift to the world.
I have exploited nearly all I could from this lecture. Now I would like to place
the subject matter of this lecture in the total framework of statistical pattern
recognition, to which the previous lectures were devoted. You pointed out
that Robbins' methods the generalisation of which is the unsupervised learning
presented, had originated as an effort to fill the gap between Bayesian and
non-Bayesian methods. It seems to me that this effort has succeeded only
to a small extent. Already from the formulations of the tasks, we can see
the great difference between the Bayesian and non-Bayesian methods on one
side, and the empirical Bayesian approach and unsupervised learning on the
other side. In spite of all their distinctness, the Bayesian and non-Bayesian
methods have a common property. The purpose of either of them is to seek
a certain recognition strategy. Unsupervised learning tasks (and, in fact, even
the supervised learning) in the formulation as was presented in the lecture do
not lead to any recognition strategy, but require only the most likely evaluation
of a priori unknown statistical parameters of an object. The mutual relation
between supervised learning and unsupervised learning tasks, formulated in this
way, and particularly the mutual relation between algorithms for their solution
is undoubtedly elegant, so that I may never forget it.
But (this unpleasant 'but' must ever occur) a quite visible gap remains be-
tween the maximum likelihood estimate of unknown parameters and the search-
ing for an optimal strategy. It does not follow from anywhere that in the case of
an incompletely known statistical model of an object the recognition strategy
is to be built exactly as a Bayesian strategy, to which instead of the actual
values of unknown parameters their most likely values are substituted. Such
a procedure has, therefore, the form of a postulate which is accepted without
giving any reasons. But a postulate should have a far simpler formulation so
that a question of the type 'why exactly in this way' might not arise. Here this
question is justified. The reply to it is usually based only on rather general and
imprecise considerations, for example, that at a quite extent training multi-set
the most likely values differ only slightly from the actual ones. Then also the
strategy which uses these values differs only slightly from the best one.
6.8 Link to a toolbox 273
There are now three of us who have these ambitious desires. We may sometime
manage to build up such a theory. We will now follow together the well known
and wise advice that one has to seek for truth, even though sometimes one has
to search for the truth nearly in the dark-but one should run away fast from
those who have already found that truth.
January 1998.
built on top of Matlab version 5.3 and higher. The source code of algorithms
is available. The development of the toolbox has been continued.
The part of the toolbox which is related to this lecture implements the
unsupervised learning algorithm for normally distributed statistical models (the
274 Lecture 6: Unsupervised learning
275
now speaking about the diversity of the statistical recognition theory, we mean
mainly the diversity of formal properties of sets that play a part in this theory.
In pattern recognition tasks there often occurs that an observation x does
not consist of one, but several measurements x 1 , x 2 , •.• , Xn· We can speak not
only about the structure of sets of values pertaining to individual features, but
also about the structure of relations between the features, which is different
with different applications. We will examine two cases which illustrate the
different character of the structure of features (just of the set of features and
not of the set of their values).
In the first example, the features x 1 , x 2 , ••• , Xn are answers in a medical
questionnaire which is filled in at a patient's first visit at a doctor's surgery.
It concerns data about the patient's body temperature, blood pressure, pulse,
sex, say, n answers altogether. The second example is a case in which after
the medical treatment one particular data about a patient, say the body tem-
perature, is measured n-times at regular time intervals. The outcome of such
an observation is again an ensemble of indexed values Xi, where the index i
assumes values from the set {1, 2, ... , n} as it was in the first case. However,
the dissimilarity between the structure of this ensemble and the structure in the
previous case must be evident. It was not essential in the first case for, e.g., the
age to be represented exactly by the third feature, because no essential change
would occur in the task if the features were numbered in another way. This
means that in the first case the set of features is simply void of any structure
and its members 1, 2, ... , n are considered not as numbers, but only as symbols
in an abstract alphabet.
In the second case the matter is quite different. Here the feature index has
just the meaning of a number. The ensemble of the measured values of the
feature forms a sequence and the numbering of sequence elements cannot be
arbitrary. The set of indices has now a clearly visible structure which the set
of indices in the first case was empty of.
We will go back to these problems more than once and speak about them
in a more concrete way. At the moment, we only want a cursory taste from
the immense diversity expressed in the words 'let the sets of observations X
and states K be two finite sets' which used to be quoted refrain-like at the
beginning of formal reasoning in previous lectures. The results obtained are
not supported by any concretisation of the form of the sets X and K. A very
positive consequence is that the results are valid even in the case in which,
owing to a concrete context of an application, the mathematical form of the
sets X and K must be expressed more precisely. The negative consequence
follows from the mentioned generality too, because the sets X and K have to
be expressed as specifically as possible when the statistical methods have to be
used in a useful way. This means that from the vast set of potentials a case
must be chosen that corresponds to the original application task. This is by no
means easy to do.
Fortunately, some applications can be expressed through formalism which
has already been thoroughly examined in applied mathematical statistics. Its
most developed part is the statistics of random numbers. The overwhelming
7.2 Why is structural recognition necessary for image recognition? 277
pretend to be the only possible and universally usable one for any applica-
tion. Some researchers, and there were not few of them, could not avoid this
temptation.
For example, it was assumed that an image could also be represented by a
point in a linear space in a natural way. An image was regarded as a function
f(x, y), where x, yare coordinates of a point in the domain of definition of the
function f which is a square with the dimensions D x D. The value of the
function f(x, y) was the brightness (intensity) in the corresponding point. The
square becomes covered by N 2 smaller squares of the dimensions ~ x ~' ~ =
D / N. The observation of the image f corresponds to the measured value of the
average brightness in each smaller square. The outcome of the observation is an
ensemble of N 2 numbers considered as a point in N 2 -dimensionallinear space.
A long time and a lot of effort were needed to understand that such for-
malisation of images is highly deceptive. This representation actually deceived
the correct trend of analysis. The identification of an image with a point in a
multi-dimensional linear space involuntarily invited one to use such sets, trans-
formations and functions that are well examined and verified for linear spaces,
i.e., convex sets, hyperplanes, half-spaces, linear transformations, linear func-
tions, etc.. And it is these mathematical means that are least suitable for
images. That is also why the concept of linear space in the processing and
recognizing of images has not started such an avalanche of fruitful results as it
was the case, for example, in linear programming. Pure geometrical relations,
such as the possibility of passing along the edges of a polyhedron from an ar-
bitrary vertex of the polyhedron to any other vertex of it, greatly supported
the researcher's intuition, i.e., they made it possible to be certain that this or
that statement is right before it was formally proved. In the case of image
recognition it was just the opposite. An avalanche of results following from un-
derstanding the image as a point in a linear space appeared destructive rather
than becoming a contribution. This situation was evaluated quite sharply by
M. Minsky and S. Papert [Minsky and Papert, 1969], when saying that nothing
has brought about more damage to the machine analysis Qf images than the
multi-dimensional geometrical analogies.
Let us try to understand what the fundamental difference between an image
and a point in a multi-dimensional space consists of. Therefore let us first
formulate both concepts so as to see what they have in common. Both the
vector and the image, can be considered as a function of the form T -+ V the
domain of definition of which is a finite set T. The function itself assumes
its values from the set V. If this function is an n-dimensional vector, the
coordinates of which are the numbers x;, i = 1, 2, ... , n, then T is a set of
indices, and V is a set of real numbers. If the function T -+ V is an image
then the set Tis a rectangle in a two-dimensional integer lattice, i.e., T = { i, j I
1 :S i :S n; 1 :S j :S n}, and ~r is a kind of a set of observations, as a rule a
finite one.
If we consider the function T -+ V as a multi-dimensional vector then we
can see that it assumes its values from a well structured set for which addition
and multiplication make sense, where special members 0 and 1 exist, and many
7.2 Why is structural recognition necessary for image recognition? 279
other things which provide a certain mathematical structure to the set. The
domain of definition of this function, however, can be understood as an abstract
alphabet only, without any structure or ordering.
With the function T -t V representing an image, the case is just the oppo-
site. There are, of course, also applications in which the domain of the values
V must be considered just as a set of real numbers. An example is the tasks
where the brightness in a point of the image represents the result of a direct
measurement of a physical quantity, such as that in examining the temperature
of a product on the rolling train, or the overall density of a body in a particular
direction in computer tomography. But these are tasks of such a kind that we
would not regard as typical image analysis by humans. There exist many tasks
in which an observation in a certain point need not be considered so strictly.
For example, for the image x: T -t V and a certain monotonically increasing
function f: V -t V it commonly occurs that a change in brightness x(t) in the
point t E T to the brightness J(x(t)) in the same point does not alter anything
from the point of view of information which one is observing within an image,
on the assumption that the transformation J is the same for all points t E T.
This means that the domain of values V need not be considered as a set of real
numbers, but it can be interpreted as a set with a far weaker structure, e.g., as
a completely ordered set. Situations for which the domain of values V has an
even weaker structure or it is void of any structure are not rare. Consider, for
example, the analysis of the color graphical documents.
At a cursory glance it might seem that applying a richer formalism than
is needed for the examined reality does not do any harm. The application of
a richer formalism, however, adds to the original task properties it does not
actually have. Then involuntarily algorithms are created based on additional
properties such as minimisation of mean square deviation of brightness, or
linear transformations of an image. Thus operations are applied which make
sense within the accepted formalism, but which do not provide a reasonable
explanation in terms of the initial application task.
Now let us have a look at what differentiates the domains of definition of the
function T -t v' in the case in which the function is considered as a vector, and
when it represents an image. The transition from an image to a vector is con-
nected with a loss of immensely important properties of the image, particularly
those, which account for the specificity of the image as an information medium.
For the vector T -t V the domain of definition is an abstract alphabet of in-
dices without any structure. For the image T -t V the domain of definition T
is a rectangle in a two-dimensional integer lattice, which is a set with a clear
structure. Outside this structure one can hardly imagine properties such as
the connectivity of objects in an image, the symmetry of an image, and other
concepts important for image analysis. We can only admire the optimism with
which pattern recognition in its young days hoped that the machine analysis of
images would be successfully solved without taking into account the structure
of the domain of definition of the functions characteristic for the images.
In the constructive application of the general theory of pattern recognition
for image analysis, in the first place a set of observations X must be concretised
280 Lecture 7: Mutual relationship of statistical and structural recognition
someone said, 'That quite small part which I understand well in the statistical
pattern recognition theory has nothing in common with the applications that
I am engaged in'.
Let us summarise; for a practical application of statistical pattern recognition
theory in analysing images it is necessary to thoroughly concretise the set of K
values of hidden parameters which are to be found in the analysis.
(i, j) is zero. Simply speaking, the image x is created in such a way that several
horizontal and vertical lines are coloured in black, but not all possible horizontal
and not all possible vertical lines.
Let us formulate the pattern recognition task so that, based on the knowledge
of the image x, it is necessary to find k, i.e., to tell what lines are drawn in the
image.
The task is actually a trivial one. Since the ensemble k does not include all
horizontal lines there exists a row i* in which there is not a horizontal line. In
addition, there exists a column j* in which there is not a vertical line. Thus in
the image x there is a point i*, j* which is white. Therefore the solution of the
task is the following: the ensemble k contains a horizontal line hi if and only
if x(i, j*) = 1, and it contains a vertical line Vj if and only if x(i*, j) = 1.
We will now take into consideration the inevitable fact that the image x
cannot be observed without noise. We will see that even with the simplest model
of noise the task not only ceases to be trivial, but it is not solvable in polynomial
time.
Let the observed image be changed as a result of the noise to become the image
x': T--+ {0,1} so that in each point (i,j) E T the equation x(i,j) = x'(i,j)
is satisfied with the probability 1- c, and the inequality x(i,j) :f:. x'(i,j) is
satisfied with the probability c. Not to diverge from the very simplest model
of the noise we will assume that the noise affects different points of the image
independently. When we assume that all ensembles k in the set K are equally
probable the statistical model p(x', k), i.e., the joint probability of the ensemble
k and observation x', is then uniquely determined, and for each k and x' can
be easily calculated according to the formulfE
which is to be solved under the condition that the variables a( i) and ,B(j) as-
sume only the values 0 or 1. It is not very difficult to write an algorithm for
this minimisation problem. However, this algorithm will require a computing
time which is proportional to 2min(m,n), and thus cannot be practically applied.
No one in the world today seems to know substantially better algorithms. The
7.2 Why is structural recognition necessary for image recognition? 283
In the previous example it can be seen how a seemingly easy task appears to be
practically unsolvable. We are speaking about an exact solution of an exactly
formulated task, and not about so called practically acceptable suggestions
which, though they do not guarantee a solution, but in the majority of practical
cases they . . . etc.. Neither do we mention here the so called practically
acceptable algorithms, since they can be discussed only in the case in which the
algorithm is intended for solving a task, which is of real practical significance.
Our example is mere child's play in which it is to be found if there are ideal
lines in the image. Practical tasks are substantially more difficult, because in
the image far more intricate objects on a more complicated background are
to be sought. The situation is further complicated by not fully known errors
which affect the observed image. When coming across such tasks the incomplete
theoretical understanding of the applied algorithm can be excused, since the
requirements for theoretical perfection give way to far harder requirements
of a practical character. In contrast to practical tasks, a theoretical clarity
should be achieved in the extremely simplified case quoted, be it positive or
negative.
It is typical for the task mentioned that the set K of hidden parameter values
is so immensely large that one cannot apply algorithms in which finding and
examining of each member of the set K would be expected, even though the
examination of each single member is extremely simple. The immense extent of
the set K is nothing extraordinary in image recognition. In image recognition,
the set K is, as a rule, so extensive that its exhaustive enumeration is impossible
in practice. The complexity of image recognition tasks, however, does not only
consist in this extent. In the example mentioned, we can formulate the aim of
the recognition, e.g., in such a way that it is only to find whether a horizontal
line passes through the fifth row of the image. The task in this case would
just be the classification of the image set into two classes: into images in which
the particular line occurs, and into the rest of the images. Unfortunately, the
property that the hidden parameter assumes only two values does not bring
about any simplification of the problem. The choice of the most probable
value of the hidden parameter would not give rise to any difficulties in this
case, since the parameter assumes two values only. However, insurmountable
obstacles would appear in computing the a posteriori probabilities of these two
values.
The example mentioned demonstrates the usual difficulties in image recog-
nition which do not consist in the lack of knowledge about necessary relations,
sets, or probabilities. This knowledge might be available, of course, but it does
not belong to the well respected classes of mathematical objects, which owing
to centuries' old study and research, have assumed a perfect and elegant form.
They are, simply, something new.
284 Lecture 7: Mutual relationship of statistical and structural recognition
We will now present the main concepts which will be used in the following
explanation.
Let an object be characterised by a certain set of parameters. Even though
it need not always be quite natural, let us imagine that the parameter values
are written in memory which belongs to the object itself. Let us denote the
memory by the symbol T and call it the object field (the field of the object). We
would like to point out that T means only the memory, and not what is written
into the memory. The memory as a set of parameters T consists of memory
cells t which correspond to individual parameters. For the time being we will
assume that all parameters take their values from one single set S which is the
same for all parameters.
An object is completely described when for each parameter t E T its value is
known, which will be denoted s(t), s(t) E S. Formally speaking, the description
of an object is the functions: T -+ S the domain of which i& the set of parame-
ters T (which is the same as the memory T), and its value domain is the set S.
The recognition task here and in other parts of the lecture will be understood
in this way: based on the knowledge of the memory's contents pertaining to its
part T' C T something meaningful is to be said about the contents of the rest
of the memory. Or in another formulation: based on the knowledge of values
pertaining to some parameters, the values of others are to be found. Even
from this informal definition we can feel its affinity with statistical pattern
recognition, but here from the very beginning respect has been taken that both
the observed and the hidden parameters are considered as sets of parameters.
For a more precise formulation of the concepts needed we will accept the
following notation. Let X and Y be sets. The set of all possible functions
of the form X-+ Y is denoted by yx. It is clear that IYxl = IYIIXI. Let
f E Y x be a function of the form X -+ Y and X' C X be a subset in X. The
restriction of the function f: X -+ Y to the subset X' C X is defined as the
function f': X' -+ Y, where for all x E X' the relation f'(x) = f(x) holds.
7.3 Main concepts necessary for structural analysis 285
r
Let us consider a decomposition of
the set T (i.e., the field of objects)
into two subsets: the set Tx of ob-
servable parameters (observable field)
and the remaining set T k of hidden
parameters (hidden field) . The sym-
bols x, k in the introduced notation I
are understood as a part of the sym- I'----~----L----------+--~~X
bol and not as an index.
The recognition task, still under- X'
stood informally, assumes the follow-
X
ing form: There is a function s : T --t
S which is defined on the known set Figure 7.1 Restriction of the function f(X').
T and assumes values from the known
set S. The function s: T --t S is not known, but the restriction x : Tx --t S of
the functions to the known observed field Tx C Tis known. The task is to de-
termine the restriction k : T k --t S of the function s to the hidden field T k C T.
The function x: Tx --t S represents the same notion that was understood as
an observation in the previous lectures. The function k : T k --t S corresponds
to the previous hidden state of an object. The function s : T --t S, which is
nothing else but a pair created from functions x : T x --t S and k : T k --t S, will
be called complete descr-iption of the object.
It is obvious that the formulation mentioned is not the task definition. The
hidden function k: T k --+ S could be found on the basis of the observed function
x : Tx --t S only in the case in which the relation between functions k: Tk --t S
and x: Tx --t S were to be known a priori, i.e., if a constraint to the complete
descriptions: T --t S of the object would be known. We will define this relation
by means of two different but still similar ways. In the first case a subset L csr
of parameters' values ensembles will be determined that are admissible for the
object , i.e. , the subset of functions s: T --t S that may occur. In the second
case the function PT : sr --t IR will be determined. This function decides for
any function s: T --t S, i.e., for each ensemble of parameters, what is the
probability Pr(s) of the function (as well as the ensemble) occurrence.
The foreshadowed formulation of the relation between observable and hid-
den parameter-s is exactly the link connecting our forthcoming explanation of
286 Lecture 7: Mutual relationship of statistical and structural recognition
Example 7.2 Probabilities on fragments and on the entity. Let the ob-
ject field be T = {1, 2, 3} and the structure T contain the pairs {1, 2}, {1, 3},
{ 2,3}. From the assignment it follows that we consider thn~e random variables
81, 82, 83. The distribution of their joint probabilities is described by the func-
tion PT ( 81, 82, 83). This function is not known, but we know three distributions
of joint probabilities P{I2}(8J,82), JJ{I 3}(8J,83) and JJ{ 23 }(82,8 3 ). These three
functions constrain the possible function Pr(sJ, 82, 83), since it must satisfy
thr·ee equations,
P{23}(s2,s3) = LPr(s1,s2,s3).
83
7.3 Main concepts necessary for structural analysis 287
The assumption postulated in the paragraph before the preceding Example 7.2
does not define the probability distribution PT uniquely. The unambiguity is
achieved using additional assumptions which are of the Markovian form. A
precise formulation of these assumptions will be left for future lectures. Here
we point out just once more the characteristic feature of structural pattern
recognition: a complex function with a wide domain sT is defined by means of
a number of simpler functions with a restricted domain of definition. And as
before, the reduction of a complex concept into a number of simpler ones is not
always possible. And it is the possibility of such a reduction that defines the
action radius of structural methods which is very extensive even despite this
limitation.
We will illustrate the concepts mentioned and make use of the task outlined in
Example 7.1 of horizontal and vertical lines.
Example 7.3 Horizontal and vertical lines, illustration of concepts. To store
the full description of an object it is necessary to have (mn + m + n) memory
cells in which information on the image (mn cells), on the set of horizontal
lines (m cells), and the set of vertical lines (n cells) is stored. The set of cells
T, i.e., the object field, is a set {(i,j) I 0 ~ i ~ m, 0 ~ j ~ n, i + j -:f 0}.
Only a part of the cells in this set is observable, namely, the part containing the
information on the image. This part is the set {(i, j) \1 ~ i ~ m, 1 ~ j ~ n}
and it represents the observed field Tx. The other hidden part of the field T in
which the information on horizontal and vertical lines is stored, contains the
cells (i,O), i = 1, ... ,m, and the cells (O,j), j = 1, ... ,n. Thus the hidden field
Tk is the set {(i,O) \1 ~ i ~ m}U{(O,j) \1 ~ j ~ n}. The set of values which
can be stored in the cells is evidently {0, 1}, where the numbers 0 and 1, written
in the hidden cells, carry information on whether a certain line occurs in the
image. The same numbers, written in the observed cells, inform on the values
of brightness (here black and white only) in certain positions of the image.
The relationship between all parameters of the object, be the noise taken
into account or not, is expressed by virtue of a structure of the third order,
and thus through the structure which contains triplets of cells of the form
((i,O),(O,j),(i,j)),1 ~ i ~ m, 1 ~ j ~ n. So the structure of the object
is the set T = {((i,O), (O,j), (i,j)) 11 ~ i ~ m, 1 ~ j ~ n}.
In the case in which the noise is not taken into consideration the restric-
tion of the description s: T ~ S is defined by the set of permitted triplets
(s(i,O),s(O,j),s(i,j)) of values that can occur in the fragment ((i,O), (O,j),
(i,j)). This set is the same for all fragments, which is {(1,1,1), (1,0,1),
(0, 1, 1), (0, 0, 0)}. This set formally expresses the relationship between hidden
and observable parameters which was informally stated before. It is indicated
that the cell (i, j) can be black only when the i-th horizontal line or the j -th
vertical line is rep1·esented in the image.
When the noise is taken into account its probability then must be given for
each triplet (s(i, 0), s(O,j), s(i,j)). In our case we assume that the probabilities
do not depend on the coordinates (i,j). It is therefore necessary to have ISI 3 = 8
numbers which are presented in Table 7.1. A
288 Lecture 7: Mutual relationship of statistical and structural recognition
11 s(i,j) =1 1 s(i,j) =o 1
s(i,O) = 0 s(O,j) = 0 lc
4 t(l- c)
s(i,O)=O s(O,j) = 1 t(l- c) tc
s(i,0)=1 s(O,j) = 0 t(l- c) lc
4
1
s(i,0)=1 s(O,j) = 1 t(1-c) ;ic
The concepts illustrated in the previous example do not yet define any pattern
recognition task, but make us acquainted with the 'characters' which will be
acting in the task. These are the concepts the formulations of which will precede
the formulation of structural recognition problems, similarly as the sentence
'let X and K be two finite sets, the function px K : X x K --t lR being defined
on their Cartesian product' preceded the formulation of a task in the general
statistical theory of pattern recognition.
In further explanation, the previous brief sentence will be replaced by the
following more detailed introductory sentence. Let T and S be two finite sets,
where T is the set of parameters and S is the set of values which each parameter
assumes. Let Tx C T be a subset of observed parameters, and the parameters
from the set Tk = T \ Tx be hidden. Let T be a finite set of subsets from T.
Let for each set T' E T the following be determined:
either the subset Lr' C sT';
or the probability distribution PT' : ST' --t JR.
Various tasks will be formulated after these introductory sentences. Gener-
ally the tasks will require that according to the known function x: Tx --t S
defined on an observed field, the function k: Tk --t S defined on the hidden
field, should be found somehow. The following lectures will be devoted to the
analysis of tasks of this kind for different classes of structures, starting from the
simplest cases and switching over to the analysis of the problem in its complete
generality.
7.4 Discussion
Lecture 6 is a kind of dividing line in that we completed a section of pattern
recognition theory. I would now like to ask you a favour. Upon my supervisor's
recommendation, I acted as a tutor for a seminar of the optional subject Pattern
Recognition for students of Applied Informatics. As a basis I used your lectures.
Some questions I have a.'iked you in our discussions were questions asked by my
students in the seminars.
It has occurred to me that it would be helpful if I organised a seminar to
summarise and verify what the students had learned. Could you possibly make
a list of questions for me to use in the seminar?
7.4 Discussion 289
The previous lectures were to ensure that she/he who studied them would be
well oriented in the following concepts and tasks.
1. Observable and hidden parameters of an object, strategy, penalty functions,
risk, Bayesian tasks of statistical decision making.
2. Probability of the wrong recognition (decision) as a special case of Bayesian
risk; the strategy which minimises the probability of the wrong decision.
3. Bayesian strategy with allowed non-decision.
4. The deterministic character of Bayesian strategies.
5. The Neyman-Pearson task and its generalisation to a case in which the
number of states is greater than two.
6. Minimax task.
7. Wald task of the form presented in Lecture 2.
8. Testing of complex hypotheses, statistical decision making after Linnik.
(Let us note that the form of strategy for solving the tasks under items 5-8
is to be derived by using two theorems of duality).
9. Risk, empirical risk, Chervonenkis-Vapnik theorem on the necessary and
sufficient condition of convergence of the empirical risk to the risk itself.
10. The growth function and the capacity of a set of strategies, sufficient con-
ditions for convergence of the empirical risk to the risk itself.
11. Anderson task in generalised form, necessary and sufficient conditions for
optimality of strategies in Anderson task.
12. Linear strategy, algorithm of perceptron and Novikoff theorem.
13. Linear strategy, Kozinec algorithm and its validation.
14. Fisher strategy by means of linear separation algorithm.
15. f-solution of Anderson task by means of linear separation algorithm.
16. Formulation of clustering tasks, taking the ISODATA algorithm as an ex-
ample.
17. Empirical Bayesian approach by Robbins, Robbins task and its generalisa-
tion.
18. Mutual relationship of learning and unsupervised learning in pattern recog-
nition, unsupervised learning algorithm, a theorem about its monotonous
convergence.
That may be all. We wonder how successful your students will be in seminars
and at examinations.
I expect that future lectures will be substantially different from the previous
ones. It may be a question of an entirely different subject matter which will be
based on new concepts. That is why I would like to know if the next subject
matter can be understood without knowing the previous one. Some students
who did not attend my seminars would like to join them. Is not now the right
moment for them to join the course?
a conception only a small part of application tasks can be mastered. The im-
age recognition problems, in which complex structural relations between image
fragments need not be taken into account, are quite rare. Similarly, rare are
tasks where a random character of images need not be considered. The ran-
domness may be, at least, inevitably affected by noise in the observed image.
And thus, in solving practical tasks both of the images' properties must be
considered, their complex structure as well as their random character.
I am afraid you do not seem to have understood my question in the right way. I
did not ask whether for a practical activity in the pattern recognition field both,
the previous and the future subject matter, are necessary. I have some doubts
about it as well, but I am going to ask my question later. I am now above all
interested in whether the subject matter of future lectures can be understood
v.·ithout the knowledge of what was explained in the preceding ones.
We have understood your question in the right way and we answer once more:
The following lectures cannot be delivered to those who did not properly master
the previous subject matter. In solving practical tasks, two parts of the images'
nature must be considered in a more ingenious way than by merely applying
purely statistical methods in a certain stage of processing, and then applying
purely structural methods in the later stage. The applied algorithms are to
make use of both, statistical and structural features of the image. Therefore
one cannot say if the probability of the wrong decision is being minimised at
some step of the algorithm, or the structural relations between fragments of the
image are being analysed. Both activities are performed together at each mo-
ment. In our future lectures we will direct our attention towards designing such
algorithms and therefore a competence in statistical methods is a prerequisite.
You are saying that for the future explanation everything that was previously
dealt with is needed. Could you, perhaps, select the key concepts? I would like
to recommend them to the students who are going to join us now so they can
study the topics by themselves.
Boy, you are rather strongly insistent today. You want, at all costs, to find
something that was unnecessary in previous lectures. Well, as the saying goes,
discretion is the better part of valour.
To understand the subsequent subject matter it would be required to know
the Bayesian formulation of the pattern recognition task, the formulation of
learning and unsupervised learning tasks as the maximum likelihood estima-
tion of unknown statistical parameters. The designing of unsupervised learning
algorithms should be understood almost automatically. Moreover, it is neces-
sary to master the procedure seeking the linear discriminant function by means
of Kozinec or perceptron algorithms which is nearly the entire contents of Lec-
ture 5.
7.4 Discussion 291
I would rather be less insistent. But the subject matter is far too extensive.
I am not afraid of it since I have carefully studied and understood all of the
subject matter. At the same time I wonder if we will still need algorithms
linearly separating subsets in a linear space. Methods based on formulations
by a set of observations in a linear space were subject to such a sweeping
criticism from your side that I believed that I should forget them as something
that allures the student astray. When you are saying that we will still need
these methods in structural analysis, I cannot come to terms with it.
It might sound strange even to us, if we did not get used to peculiar destinies of
some scientific ideas that develop in a quite unpredictable and unexpected way
and start to live lives of their own. Let us again recall Minsky and Papert [Min-
sky and Papert, 1969], who compared pattern recognition to a mathematical
novel in which the characters can disappear for some time and reappear at the
right moment. Only in a lapse of time, could it be seen that the contribu-
tion of those ideas is far richer than it seemed at first glance. Even when the
criticism against the linear approach is justified we will, despite this, actively
apply methods seeking separating hyperplanes in linear spaces as soon as we
deal with problems of learning in structural pattern recognition. Do not try to
cope with this contradiction now. When we arrive at this subject matter, the
knowledge will settle down in a desirable way.
It seems to me that I have understood the main concepts of the lecture which
will become the basis for future formal constructions. The objects of research
will be functions of the form T --+ S, where Tis a finite set for which a structure
T c 2T is defined, where 2T denotes a system of all subsets ofT. Why was
nothing said about the structure of the set S?
The structure of the set S will not be taken into consideration anywhere in
our lectures. The course was built up, in a sense, in contrast with the theory
which formalises the observed object by means of points in a linear space, i.e.,
by means of a function which assumes numerical values, but is defined on a set
void of any structure. We will examine the opposite of such an assumption and
will see what results follow from a case, where the domain of definition of the
function is a well structured set, and despite this, the structure of the value set
is not taken into consideration.
It is natural that this diametrically opposed view has its drawbacks. But
you must not forget that the anticipated view of the tasks is rather general.
From the fact that the structure of the value set of a function is not considered,
it does not follow that the results of such an examination hold only for sets
void of any structure. The other way round, there will be results that are valid
for sets with whatever structure they may have. The other thing is that on
certain assumptions about this or that set of values which the function under
examination assumes, other additional results could still be obtained. These
open questions, however, will not be dealt with in our lectures.
We will reply in the style of the Clever highlander girl (a character from the
popular Czech fairy tale). Yes as well as no. Do not ask for a more precise
answer from us now. We would like to draw your attention to the fact that the
question, whether the restriction to a certain set of variables can be expressed
by means of relations in their subsets, is not so simple. We will examine this
question in detail later.
We will now only demonstrate how, by the means mentioned in the lec-
ture, a requirement is to be expressed that the ensemble of numbers {s(1,0),
s(2, 0), ... , s(m, 0)) must contain one zero at least. You are right that this
constraint cannot be reduced to partial constraints. But auxiliary variables
(z(0), z(1), ... , z(m)) can be introduced such as to make the restriction pos-
sible. A local constraint will be introduced for the variables z(O), z(m) and
for the triplets of variables of the form (z(i - 1), s(i, 0), z(i)), i = 1, 2, ... , m.
The constraints will result in the following meaning of variables z(i): the quan-
tity z(i), i = 1, 2, ... , m, is zero if and only if at least one of the quantities
{s(1,0),s(2,0), ... ,s(i,O)) is zero. Local constraints, therefore, state in which
z(O) must be 1, z(m) must be 0, and the triplet of values {z(i -1), s(i,O), z(i))
must be one of the following four triplets: (1, 1, 1), (1, 0, 0), (0, 1, 0), and (0, 0, 0).
I see that I have opened a box with rather complicated questions. I realised
that some complex 'multi-dimensional' relations can be expressed in a sim-
plified form. This simplification consists of reducing 'multi-dimensional' rela-
tions to a larger number of 'less-dimensional' ones. It is also clear to me that
there exist such multi-dimensional relations for which the before mentioned re-
duction is not possible. For the present, can I assume in advance that some
multi-dimensional relations can be simplified only when to the original variables
additional variables are added?
When we connive at a not very satisfying precision of your statements, you are
right .
.From what has been said, a lot of new questions arise. But I am afraid that I
am in too much of a hurry when I wish to know just now what will be spoken
on at the lectures. Still before I start with the questions, I would like to make
clear for myself, roughly at least, the relation between the previous and future
theories. It may be untimely to make an attempt to do it just before the
explanation gets started. But the formulation of concepts itself, such as the set
of observations S, the object field T, the observed field Tx c T, the hidden
field Tk C T, T = Tx U Tk, the structure of the field T c 2T and the system
of subsets Lr' c sT'' T' E T, or of functions PT' : sr' ~ IR, T' E T, already
294 Lecture 7: Mutual relationship of statistical and structural recognition
creates the framework of the future theory, and gives an idea of what will be
dealt with at the lectures. I have a certain idea at least. For the time being,
my idea does not suffice for me to answer the question whether the structural
theory is a concretisation of the statistical pattern recognition theory, or its
generalisation.
On the one hand it was said in the lecture that the structural theory is a
particular case of the general statistical theory, which concretises the form of
the set X x K, whereas the general theory is not based on the assumption
concerning the form of the sets X and K.
On the other hand, the general theory as a whole can be regarded as a
trivial case of the structural theory in which the observed field Tx consists
of one single cell tx, the hidden field consists of a single cell tk, the set S is
XU K. The structure T is }~ tx, tk}}, i.e., it consists o~ one sing]: subset. ~he
system of functions PT': S --+ JR, T' E T, also cons1st of a smgle functwn
px K : X x K --+ JR. In this case the structural pattern recognition can be
regarded as an immense generalisation of the previous statistical theory of
pattern recognition which in itself is impractically general. Is it really so?
Certainly, it is. Do not take it too seriously. The statements of the type 'X is a
special case of Y' or 'X is a generalisation of Y' express something unambigu-
ously understandable only when the objects X and Y are formally defined in an
unambiguous way. In our case these are two rather general theories. An exact
expression of the relationship between them would require the construction of
a kind of metatheory beforehand, i.e., a theory about theories. But we are not
inclined to do it, and so forget the last question, please.
Even though I did not make my previous question with much seriousness, I still
cannot dismiss it completely from my mind because I am interested in it for
quite earthly, and therefore serious reasons. Now may be the most opportune
moment for me to remember why, in fact, I started to study your lectures. If
you still remember, my interest in pattern recognition was stimulated by the
fact that I had written a program for character recognition. I made it on the
basis of considerations that seemed to me quite reasonable, but in spite of that,
it did not work satisfactorily.
Now I can tell you that it was a program for recognizing standard printed
characters. There are a lot of similar programs commercially available. None
work perfectly, but one can make a good profit from such programs. The
program I wrote was not based on the results of pattern recognition theory for
the very reason that most of them were not known to me. Now, thanks to your
lectures, I know the fundamental results of the theory, but I do not have even
a vague notion of how to go back, by virtue of these outcomes, to my earthly
objectives. The outcomes of the pattern recognition theory are still abstract
for me, even when I do not cover up the reason for my interest. I hoped that
in the second part we would pass, at last, from the abstract level to a more
earthly one. But now I see that instead of returning from the sky to our Earth,
we are flying still higher, and I do not know when my quite concrete tasks get
7.4 Discussion 295
their turn. I am almost afraid that I am heading further and further away from
my original objective for the sake of which I had set off, together with you,
on a laborious journey. Could I, possibly, ask you the question what out of
the previous subject matter, and perhaps also out of the future one, may be
directed at my not very noble interests?
Why could you not ask? But it may be clear to you that your question is far
more difficult than all the previous ones. First, it requires from us the acquisi-
tion of your application task, and second, to know what you have actually done.
In any case the answer to your question can be neither short, nor complete. If
you did not find by yourself what to use from the previous subject matter then
no short answer will persuade you. Let us try to reconstruct what troubles
you have already undergone and what is still ahead of you. You might feel
disappointed seeing that we did not assess or explain your steps adequately.
Therefore, you had better imagine that we are not speaking about you, but
about someone else.
You certainly tried, at first, to cope with the task on a simplifying assump-
tion that only one single character is present in the image. Thought further
simplification, the character can have only one of two labels, say, A or B. The
character can be located in any place on the image, it has a certain rotation
and size. You assume that when there are more characters arranged in several
lines certain complications will arise, but you will leave their solution to a later
time. At first, you will cope with primary problems, which lie on the surface
and do not seem to be difficult for you.
You will scan several hundred images, each of them consisting of 150 · 100
pixels. The character in the image covers a rectangle which consists of 30 · 20
pixels. You made the size of the image five times larger than the size of
the character because you anticipate that you will soon intend to recognise
simple texts which consist of 25 characters written in five lines by five char-
acters each. For the time being there is only one character in each of the
scanned images which can be placed anywhere. A character can assume any
out of (100 - 20) · (150 - 30) = 9600 positions. Thanks to binary scan-
ning, the image will be represented in the memory of your computer by a
two-dimensional array x the element x(i,j) of which is 1 or 0 according to
whether the pixel with coordinates (i,j) is white or black. When you display
the ensemble on the screen, and you view your characters, you will see how
perfectly you can recognise them for your part, and in optimistic spirits you
will write a program, which, as you are sure, will also cope with character
recognition.
Design the program on the basis of reasonable considerations. You will
prepare two small images of dimensions 20 · 30 by which you represent two
recognised characters, A and B, which are, in your opinion, ideal. You call
them exemplars. Then you will copy either of the two exemplars 9600 times
so that you will obtain 9600 images of dimensions 100 · 150, where in each
image there is a character A in all possible positions. In a similar way, you
will obtain 9600 images for a character B. The images will be denoted so that
296 Lecture 7: Mutual relationship of statistical and structural recognition
150 100 )
k* = arg~in ( mint;~ jx(i,j)- Vtk(i,j) I (7.2)
You will use the written program for practical experiments and their results
will astonish you. They are not only bad. They are unexplainable.
On the one hand, the program makes frequent errors, on the other hand, the
errors appear in cases other than those you would expect. You would be able,
for example, to explain if an error occurred for the character A, which would
be so much distorted by noise that it could be hard to find, in an objective
way, if it concerned the character A or B. But such errors are hardly noticed
by you. The algorithm quite frequently and with certainty decides for the
character A, even in the cases in which you can clearly see that a rather correct
character B was presented to it. At first, you assume that an error has crept
into the program. After carefully checking the program, you will arrive at a
conclusion that you will have to explain the wrong behaviour of the program
on the basis of additional experiments. There you will come across difficulties
since your program is too slow to be apt for extensive experiments. And so you
will arrive at the knowledge that you will have to deal with questions such as
the recognition speed which you intended to put off until the time when the
algorithm worked properly.
In the effort to make the program faster you will notice, at first, that ac-
cording to the algorithm, i.e., the relation (7.2), the dissimilarity (7.1) is to
be calculated for every possible position t of the character in the image. You
assume that it could be possible to find the actual position t 0 of the character
by a less expensive means. If the actual position t 0 would be known then the
answer k* sought can be determined by a simpler algorithm
150 100
k* = arg~in
k
L L Ix(i,j)- Vt k(i,j) I·
i=1 j=1
0 (7.3)
instead of the calculation (7.2). We will not deal with the question which of
the known algorithms has to be used. The reason is that all of them are rather
bad. Actually you used one of them.
7.4 Discussion 297
which can be interpreted in a double sense. If both the observed image x and
the exemplars VA and VB are binary then the coefficient a~i is to be adjusted
to the value
log P~j
( -k
) ,
1- Pij
where P~j is the probability that in the image x, which actually displays the
character k, the observation x(i,j) will not have the value vk(i,j). That means
that there will be no observation in the pixel (i,j) which would be expected in
an ideal image, i.e., undistorted by noise.
But it can happen that you will find at this stage of analysis that many erro-
neous decisions were due to wrong binarisation of the original image which was
performed by the scanner. You can choose a more exact way of data acquisition
by scanning more brightness levels. The brightness is then represented not by
a binary, but by an integer number in an interval, say, of 0 through 255. In this
case you can express the numbers vk(i,j) in the formula (7.4) as mathematical
298 Lecture 7: Mutual relationship of statistical and structural recognition
expectation of the numbers x(i,j) in the actual image which represents the
number k, and the coefficient a~j as (a~ )- 2 , where a~ is the variance of that
number.
It is an awfully primitive method of recognition, but it is still more advanced
than the calculation of Hamming distance which seemed to you the only pos-
sible one at the beginning. These concrete recommendations are not the most
important. More significant is that you have learned in the third lecture from
which assumptions about the probability distribution PXIK these recommenda-
tions follow. If these assumptions are satisfied then the algorithm recognising
characters must (!!!) operate not only in a correct, but also in an optimal way.
Now, when your algorithm has continued yielding unsatisfactory results (and
the results will be really bad all the time), you need not examine that algorithm
of yours, but you can view the images themselves without their recognition,
and try to find what assumptions about the probabilities PXIK from which the
application of the formula (7.4) follows are not actually satisfied.
Stop it, please. I apologise for being impolite, but I feel I must interrupt
you. Up to now you have managed to give a true picture of my past troubles.
It was clear to me that we still keep, in our field of view, our aim which is
to create an algorithm for recognising images containing texts. But now you
actually recommend me to quit refining the algorithm and to begin a kind of
new research. Just now, I started having a feeling that we are beginning to
go away from my original task and get stuck in new research from which there
would be no way out. I am afraid that in your further explanation my feeling of
going away from the original task will be reinforced. When you are at the end
of your explanation I will be flummoxed by it. I would not know how to adapt
to my task the interesting information which I will certainly Jearn from you.
You are right to have interrupted us. We are not going away from the solution
of your original task, but only reveal that we are still some distance away from
the solution. Please realise that when two people are at the same distance
from the target, but only one of them knows the distance, it is he that has an
advantage over the other. Moreover the second person starts from the wrong
assumption that he is quite near to his target.
The path to solving the pattern recognition task inevitably leads through
examination of the the model of the object that is to be recognised. Sometimes
the path is long, sometimes short, it is not always pleasant, but it is a passage.
He who would prefer to avoid it completely can arrive at a marshland. Let us
examine the formula (7.4), and try to answer why it does not work properly
even if it works a bit better than the relation (7.3). The reply to this question
will be sought in examining what premises which validate the procedure (7.4),
are not actually satisfied.
If you examine even a small number of images then you will rather quickly
discover that at least one premise is not satisfied on the basis of which the
formula (7.4) was derived. It is the premise concerning the independence of
brightness parameters x(i, j) in different pixels of the observed image. You will
7.4 Discussion 299
see that there are groups of pixels in which the same, or very similar brightness
is observed, but which changes from one image to the other. In addition you will
notice that in the image such pairs of regions exist within which the brightness
is the same, or nearly the same, but the brightness parameters in pixels of
different regions are usually different. A dependence can even be observed if
brightness parameters in one region rise then in another region fall down, etc ..
Primarily you have the idea that dependencies of this kind can be expressed
by the general Gaussian model of multi-dimensional random variables. The
particular case of this idea is the independence of the components of the multi-
dimensional variable which was the basis for deriving the formula (7.4). You are
quite entitled to be afraid of expressing the revealed dependence by covariance
matrices which would assume huge dimensions in this case, i.e., the number
of pixels in the image raised to the power of 2. Already for tiny images of
the dimension 30 · 30 pixels we can speak about a covariance matrix of the
dimension 900 · 900. Of course, if you did not have any knowledge of the nature
of brightness dependence in pixels then you would have no other simpler choice
than to express the dependence by means of covariance matrices.
But you have quite thoroughly examined and understood your characters.
You have arrived at a conclusion that if a character really occurred in the same
position in the image then the dependence of brightness in pixels would be
far less. The dependence between pixels mostly results from the property that
your method of finding the position of a pixel in an image, be it of any kind,
is rather inaccurate. The actual position of a character may differ from the
position found by your algorithm, say, by one, or two pixels to the left, to the
right, downwards, or upwards. That is just why the brightness parameters in
the pixels of an actual image do not change independently of each other but all
at once for the whole pixel group. It happens because a character as a whole
has moved, say, by two pixels to the right. If you know not only the mere fact
that brightness parameters in pixels are mutually dependent, but, moreover,
you know the mechanism of this dependence then you can state the dependence
not by covariance matrices, where your knowledge might dwindle away, but by
another model in which the revealed mechanism of the dependence will be ex-
plicitly expressed. For this case it would be best to consider the probability
distribution PX/K as a mixture of twenty-five partial distributions. The value
25 is the number of all possible positions in which a character can actually
occur when your algorithm has defined a certain position for it, i.e.,
25
PX/k = I>k(t) P~X/k' (7.5)
t=l
and this means that the strategy for recognition should be as follows: it has to
decide that the image represents the character A if
25 25
LPA(t)p~/A(x) ~ LP 8 (t)p~ 18 (x), (7.6)
t=l t=l
and the character B in the opposite case. In the formulre (7.5) and (7.6), the
number pA(t) states how often the displacement, numbered by t, of the actual
300 Lecture 7: Mutual relationship of statistical and structural recognition
position of the character A from the position found by your algorithm has oc-
curred. A similar meaning is that of the numbers p 8 (t) which, of course, are
not the same as pA(t) because the accuracy with which your algorithm finds the
position of the character A need not be the same as the accuracy of finding the
position of the number B. The function p~lk is the probability distribution of
the image which displays the character k in the position t. It is a distribution
about which one can now assume with greater certainty that the brightness
parameters in individual pixels are mutually independent. The influence due
to the change of the position of the character, which was the main cause of this
dependence, is out of the question.
In applying the formulre (7.5) and (7.6), you must not assume that the
probabilities of each displacement t are the same because your way of finding
the position of a character is not altogether bad. Quite often, the position of
the character is found in a correct way, and the error in displacement by two
pixels does not occur so often as that by one pixel. The recognition algorithm
has to take into account these properties of your algorithm which searches for
the position of a character, and therefore 50 numbers pk(t), t = 1, 2, ... , 25,
k = A, B must be known to apply the formulre (7.6). This is, however, a
far more specific task than the task of searching for the recognition algorithm
which does not always provide an easy survey. Your attention is drawn to the
fact that we deal only with the estimate of the algorithm used to recognise the
true position of the character, not with its improvement. The algorithm for
locating a position need not be exact but you must know precisely the measure
of its inaccuracy.
You must not be angry with us, but we misled you in the moor land on purpose.
We wanted you to notice by yourself that the task waiting to be solved is an
unsupervised learning task. Moreover, it is exactly that concrete case you so
thoroughly dealt with in the discussion after Lecture 6. We did not bring it to
your attention that it was the unsupervised learning task which was waiting for
you. It was just for you to see by yourself that unsupervised learning was not a
mere intellectual game solved with one's head in the clouds. It is a procedure
for solving quite earthly statistical tasks which occur at the first steps of the
statistical examination of a complex object, such as an image, if it is examined
quite seriously, of course. Well, do not be afraid of getting into the marshland
because you know how to get out of it.
For the estimate of parameters in the statistical model (7.5) apply the algo-
rithm which you so brilliantly found in the discussion after Lecture 6. You even
wondered at its simplicity. We are already eager to know what will it result in.
I declare victory. Though, I am afraid of speaking too soon. The results of the
recognition algorithm into which I included the parameters calculated by means
of unsupervised learning appeared to be very satisfactory. I could already
regard them as practically applicable. During the relatively long experiments
I noticed not a single (!!!) error.
We do not intend to lower your results. But do not forget that the capacity of
the set of strategies in the form of (7.6) is quite large in your case. You have
to take into consideration all outcomes of the statistical theory of learning that
we discussed with you in Lecture 4. They claim that the results of recognition,
obtained in learning, can in some cases differ rather substantially from the
results you will observe when applying a non-variable algorithm.
Do you not, possibly, think that I have forgotten about the results in Lecture 4?
It is clear that I checked my algorithm on data other than that I used for stating
the parameters of the model in (7.5).
In advance, and with some caution for the time being, we can already con-
gratulate you. Note that now you have not only the algorithm for recognising
302 Lecture 7: Mutual relationship of statistical and structural recognition
characters, but also their statistical model. It is by far more than a mere
algorithm since you can continue examining the statistical model for further
purposes, and not only for constructing the recognition algorithm.
I have already started doing so. At first I was interested in what the partial
probability distributions p~lk looked like. I expected that they would seem as
if the algorithm for unsupervised learning itself came to the conclusion that
the image is affected by such a parameter as the position of a character. This
would be so, if the functions p~lk at different parameters t mutually differed
by the displacement. But I did not observe anything like that. I was not able
to interpret in a comprehensible manner the group of functions p~lk' k =A, B,
t = 1, 2, ... , 25 by which the algorithm approximated the set of observed images
of one class.
Well, as you can see, it was the position of characters that was the influence
which first occurred to us when we wanted to explain why the brightness pa-
rameters in different pixels were mutually dependent. Actually there are even
more parameters of this kind. For example, we can consider yet another pa-
rameter which is the thickness of character-lines. If we examined the characters
even further, we would come across other parameters as well which are hidden
at first glance. The unsupervised learning algorithm devoted more time to this
examining than we did, and therefore, within the scope of potentialities pre-
sented to it (which were limited by the number 25), it found an appropriate
approximation for all parameters which cause the dependence between pixels.
I dare assume that even the statistical model on which recognition is based
corresponds to reality at the first approximation. I tried to examine this model
even further. I recognised images that were not scanned as such, but which
were generated by means of that particular statistical model. For some pur-
poses, these experiments are preferred to the experiments with real images. If
I generate an artificial image then I know everything about it. Not only do
I know what character is displayed, but I also know at what parameter value
t the character was created. I cannot obtain such information in any way by
real experiments because I even do not know the meaning of that parameter.
Naturally I even cannot determine its real value.
When I had completed the experiments with artificial images I found some-
thing which arouses some fear in me. I will try to explain this. I found that the
probability of the wrong decision substantially depends on the parameter t. At
some values oft, and those are the more probable, the recognition is of a far
better quality than at those which are less probable. It does not surprise me
much because the strategy created is aimed at minimising the average proba-
bility of the wrong decision. Primarily, the strategy tries to achieve a correct
recognition of images that were generated for values of the parameter t with
greater probability.
Though all these considerations are reasonable, I would not like to follow
them hastily. After all, I would not like to regard the parameter t as a random
7.4 Discussion 303
one because I do not understand its meaning completely. I would like to create
another strategy which would secure, instead of the minimal average probability
of the wrong decision, a good magnitude of this error at any value of the
parameter t. This means that I must formulate a non-Bayesian task for the
already created statistical model and solve it from the very beginning. Shall I
make such a decision? If I choose this path then it will mean that I will have
to throw the whole of my program away. I regret it. I have already put much
effort into it, and finally it is not so bad.
We start to like you again, because you have found what we wanted to call your
attention to. But you are not right when you fear that you have to throw all
your work into a waste-paper basket. If you go through Lecture 2 once more
then you will see that the task in the new non-Bayesian formulation is solved
by a strategy of the same form as the strategy (7.6) which you have already
programmed. You just need to include in it other values of the quantities pk(t)
than those you obtained by means of the unsupervised learning algorithm. You
must calculate them in a different way. We believe that you will come across
it without our help.
But this means I have programmed the unsupervised learning algorithm in vain
when I am expected to throw its outcome away.
By no means! The unsupervised learning has provided you not only with the a
priori probabilities of values of the parameter t, but in addition the conditional
probabilities p~lk which you will make use of now.
We will answer one more question even though you have not yet asked it. We
will not preclude that you will manage to design an algorithm which will secure
quite good results of recognition at any value of the parameter t. You can try to
simplify the recognition strategy itself, and instead of the strategy of the form
(7.6), you can use, say, a linear discriminant function. This can be created as a
solution of a generalised Anderson problem. You have the mathematical model
of your images at your disposal, and you can apply it for different purposes.
So, you can see that even in your rather simplified task nearly all tasks and
their solutions were used that were referred to at a theoretical level in our
lectures.
O.K. But in spite of that, the identification of abstract concepts with a concrete
application task seems to me to be rather painful and not straightforward.
That is not so. When one has learnt one's application thoroughly and is quite at
home in the theory then the connection is clear and it should immediately catch
one's eye. We are, therefore, greatly surprised that you needed our explanation
concerning the connection of the subject matter from our lecture with your task.
We noticed that you are quite at home in the theory. The only explanation
may be that you did not carefully examine the images which you were trying
to recognise. You may have believed that knowledge of the theory could make
304 Lecture 7: Mutual relationship of statistical and structural recognition
up for the lack of knowledge about the application task. Well, you were wrong.
Once more we will remind you that the theory of pattern recognition is no
magical means for a lazybones who relies on the 'Magic Table' method!
But if one knows one's application rather thoroughly, he or she may solve it
even without the theory. Am I not right?
That is true. The only question is how much time it would take him or her.
Assume that in an application an area below a graph of a polynomial function of
one variable in the interval from 0 to 1 is to be calculated. Your question could
be transposed to this example in the following way. Can anybody solve this
task without knowing the integral calculus? It is, of course, possible because
nothing stands in the way of creating the concept of the derivative, finding
formul<E for differentiating the polynomial function, creating the concept of the
integral, and proving Newton and Leibnitz theorems. If he or she had known
these concepts and relations between them prior to it, he would have solved
the task faster.
The task we have just analysed can be regarded as an application task only
with a large amount of politeness. We have only analysed a case in which the
recognised image consists of one single character. You certainly admit that it
is child's play when compared to a real application task.
We have solved the most difficult part of it, I think. To solve an interesting
practical task, I have nothing more to do than divide a large image into smaller
rectangles circumscribing every character. Then, it could be possible to solve
a task of real application significance, could it not?
No, not as a whole, but quite a large part of it. Do not think, however, that
you will manage to break up a text into individual characters by means of
some simple aids. We again remind you that you should look carefully at the
texts you would intend to recognise. Even a cursory glance at real images
is an effective means for sobering up. In a real image you can notice that
neighbouring characters in one line touch one another now and then and so
the space between them is lost; due to noise, individual characters come apart
so that the set of black pixels stops being connected; rectangles circumscribing
the characters differ in heights and widths; the neighbouring position of two
characters in a line forms a configuration, which consists of two halves of two
different characters, and this configuration cannot be distinguished from some
other character which actually does not occur in that place (such as the pairs
oo ~ x, xx ~ o, cl ~ d, ic ~ k, lc ~ k); rectangles that circumscribe individual
characters are not disjunctive, etc., etc ..
7.5 Bibliographical notes 305
When you get a notion of all these tiny and not so tiny treacheries of an
actual image, you will come to the conclusion that recognition of character
lines cannot be so easily reduced to recognition of individual characters of
which the line consists. Then you will deduce the correct conclusion which is
that you have to formulate your task as recognition of a whole line at once.
This may drive you mad since the number of possible sequences in such a line
is astronomically large. That will be the right moment for you to read the
following two lectures and see that there already exist methods which are quite
well elaborated for recognising a whole sequence at once. On the basis of these
methods, you would be ready to create an algorithm for recognising images
with texts that could be respected even in the world of applications, but only
on the assumption that there are reasons for believing that you are capable of
dividing the image into rectangles circumscribing one and just one line. Only
in this case you will manage to reduce the recognition of a whole page to the
recognition of individual lines.
In cases in which the lines are closely adjacent to one another, you will prob-
ably not manage to break up the text by means of simple tricks into individual
lines. You will have to formulate the task as the recognition of a whole page at
once. In this case you will have to read further lectures of ours where algorithms
for recognising such two-dimensional structures are designed.
I am already looking forward to it. But perhaps one little remark. I do not
know how I should express my fear. I would much regret if the entire integral
calculus was applicable only for the calculation of the area below a graph of
polynomial functions. It might not be worth creating it only for the sake of
such a narrowly oriented task.
We understand your fear. The operating radius of structural methods is exten-
sive and covers much more than mere recognition of images with text. But do
not ask us to analyse another application area with you. It is now your work,
and we would not wish to do it for you.
February 1998.
307
1
The- symbol. x} then means x,
the symbol xi represents Xi, the symbol k0 means k, and ki represents ki·
We will speak of joint and conditional probabilities of the given parameters
and different groups of these parameters. All the probabilities will be denoted
8.2 Markovian statistical model of a recognised object 309
by a single symbol p. For example, the notation p(x;, k;, k;-1) will be used for
a joint probability that the i-th observable parameter has assumed the value
x;, the (i- 1)-th hidden parameter has assumed the value k;-1, and the i-th
hidden parameter has assumed the value k;. Along with it, the same symbol p
will also be used for the conditional probability p(x;, k; I k;-1) in the event that
the i-th observable parameter has assumed the value x; and the i-th hidden
parameter has assumed the value k; under the condition that the (i - 1)-th
hidden parameter assumed the value k;_ 1 . So, the same symbol p denotes two
different functions in two expressions: p(x;, k;, k;_t) and p(x;, k; I k;-1) and it
is not quite correct. Nevertheless, we will use this incorrectness for simplifying
the expression. This inaccuracy should not cause misunderstanding since in
this lecture we will not use the identifier p without subsequently writing paren-
theses containing the parameters. The parameters unambiguously determine
which function is referred to. In cases in which the incorrectness could lead
to ambiguous understanding we will diversify the notation of the probability p
by means of indices which will inevitably result in a certain clumsiness of the
expressions.
The statistical model is determined by the function xn X Kn+ 1 -+ IR which
for each sequence x and each sequence k expresses the probability p(x, k). With
this probability we will assume that for each i = 1, 2, ... , n-1, for each sequence
k = (kb- 1 , k;, kJ+ 1 ), and for each sequence x = (xt, xi+l) the following holds
p(x, k) = p(k;) p(xi, kb- 1 Ik;) p(xf+l, kf+ 1 Ik;) . (8.1)
This follows from the assumption that the probability p(k) can be expressed in
the form
(8.2)
The expression is valid for each i = 1, 2, ... , n- 1 and each sequence k E Kn+ 1 ,
where k = (kb- 1 ,k;,kf+ 1 ). Equation (8.2) was formed by summing Equation
(8.1) over all sequences i.
A random sequence k the probability distribution p(k) of which satisfies the
condition (8.2) is referred to as a Markovian sequence, or a Markovian chain.
In this lecture we will be exclusively concerned with cases in which the random
sequence is of the Markovian type.
If the summation in Equation (8.1) will be performed over all sequences kf+ 2
and then over all sequences xif- 2 , we will obtain
We can see that owing to the Markovian assumption (8.1) the definition and
calculation of a complex function depending on 2n + 1 variables k;, i = 0, ... , n,
and x;, i = 1, ... , n, is simplified to a definition of n functions p(x;, k; I k;_l)
of three variables, and one function p(ko) of a single variable. The assumption
(8.1) therefore specifies a very narrow but important class of statistical models
we are going to examine. We consider it useful to understand this specification
in an informal manner, as well. We will present some considerations supporting
the informal understanding of further ideas.
The property (8.1) can be, for example, understood in the following way.
Let, in the universal population of pairs (x, k) = (xi, ... , Xn, ko, ki, ... , kn),
the probability distribution p(x, k) have Markovian properties (8.1). We will
specify an arbitrary number i, 0 < i < n, and an arbitrary value a for the
hidden parameter k;. Let us fixate the selected values i, a and take from the
universal population all pairs (x, k) in which k; = a. Being Markovian then
means that the group of parameters (xi,x 2 , .•• ,x;), (ko,ki, ... ,k;_l) in the
chosen ensemble is statistically independent of the group of parameters (xi+l,
Xi+2, ... , x,J, (ki+I, ki+2, ... , kn)·
This correct interpretation is often expressed in a vulgarised form, i.e., a
Markovian sequence is such a sequence in which the future does not depend
on the past, but only on the present. The vulgarised form is treacherous, since
while being incorrect it is very similar to the correct one.
The following mechanical model
of a Markovian sequence provides a
good intuitive idea. Let the sequen-
ces (ko, ki, k2, k3, k4) and (xi, x2,
x3, x4) be represented by the posi-
tions of points in a plane, Fig. 8.1.
Figure 8.1 A mechanical model of a Markovian Assume some pairs of points are con-
sequence. nected by a spring which is denoted
by abscissas between the points. Assume that one of the points, say the point
X3, starts for some random reasons to oscillate. By virtue of mechanical links,
all (!) the other points of the mechanical system start to oscillate as well, not
only the points k2 and k3 which are connected to the point x 3 by the spring.
In this system each point is dependent on every other point, and the system as
a whole does not break up into independent components. Furthermore, if the
positions of points XI, x2, X3, X4, have been fixed then the positions of points
ko, ki, k2, k3, k4 are determined as well. The position of each point k; will
be affected not only by the positions of the points x; and x;H to which the
point k; is immediately connected, but also by the positions of all points XI,
X2, X3,X4.
But if we now imagine that a point, say the point k3 , is fixed immobile on
the plane then the whole mechanical system breaks up into two independent
parts. One consists of the points k0 , ki, k2, XI, x 2, x 3 , and the other of the
points X4 and k4. Now the oscillation of a point, say the point x 4, does not in
any way affect the positions of points k1 , x 2 , and of the points to the left of
the point k3.
8.2 Markovian statistical model of a recognised object 311
k,
It can be seen that the number of operations for calculating p(x) is of the
order IKI 2 n. First, the numbers fn(kn_ 1) are to be calculated according to the
first row in (8.6), and then gradually the numbers fn-1 (kn-2), ... , fi(ki_I),
... , h (k0 ) according to the second row in (8.6), and finally the number p(x)
according to the third row. In this way the task of the stochastic automaton
recognition has been solved. According to the procedure (8.6) the probabilities
Pa(x) and Pb(x) for the automata a and bare to be calculated and then with
respect to the ratio Pa(x)/Pb(x) a decision is made for the benefit of one of the
automata, a or b. As a rule the decision is made by comparing the likelihood
ratio to a certain threshold value, though in some tasks the decision making
strategy may be more sophisticated.
fn = Pnf,
fi = Pi fi+l , i = n- 1, n- 2, ... , 2,1,
p(x)='Ph,
or, after excluding the auxiliary vectors h, h, ... , fn, in the form
p(x) = 'P P1 P2 · · · Pn-l Pn f · (8.7)
The notation (8.7) can be made even more concise
(8.8)
and the numbers p(xn, kn I k 71 _1) are the known probabilities that represent the
stochastic automaton. For the probability p(xn-l, Xn I kn-2) in the general case
the following equation holds
Based on Markovian property (8.1) (with the intuitive support of the mechan-
ical model of being Markovian in Fig. 8.1), we have
So, the probabilities p(xf I k;_ 1 ), which we should like to calculate for any i,
can be calculated, at least, for i = n and i = n - 1 by means of the sums
(8.11) and (8.13). Now, we will show how to calculate these probabilities for
i - 1, assuming that the probabilities p(xf I k;_ 1 ) are already calculated for the
value i.
For the probability p(xil_ 1 I k;_ 2 ) in the general case there holds that
(8.14) can be changed into p(x? I ki-d and the expression (8.14) will then be
changed to
(8.15)
By means of the formula (8.11) and the multiply applied formula (8.15) we
can calculate the probability p(xi' I k0 ) for each state ko and then calculate the
probability p(x) :;ought according to the relation
means that its coordinates are II< I numbers p(xi, Xi+!, ... , Xn I k;_I), ki-1 E K,
which are the probabilities that the automaton will generate a sequence of
symbols Xi, Xi+!, ... , Xn under the condition that the generation started in the
state ki-1·
Let as now look at the statistical considerations which will lead to the calcu-
lation of the probability p(x) according to the procedure (8.10). They are nearly
the same as the ideas mentioned above. \Ve will state now how the joint proba-
bility p(xf, k;) would lw ealculated for the event that the automaton generates
the sequence x1, x2, ... , x;, and after the end of the generation the automaton
will transit into the state k;. For i = 1 the probability is obviously
Assume we have already calc:ulated the probabilities p(xt- 1 , ki-d for some i
and with respect to them we would like to calculate the probabilities p(x{, ki)·
For the probability p(xt, k;) in the general case holds
If we include the previous expression into the sum (8.18) then we obtain the
following recursive expression for the calculation of p(xi, ki),
If we have calculated according to the formulre (8.17) and (8.19) the probabili-
ties p(xf, kn) then we can calculate the probability p(x) being sought according
to the formula
When the features Xi are measured one after another, and not in accordance
with the index i the following situation can occur which requires a different
calculation ordering than the two already mentioned procedures.
Assume that the object is described by twenty features x1, x2, . .. , x2o and
twenty-one hidden parameters k0 , k1 , ... , k20 . Let us also assume that the fea-
tures (x 5 ,x 6 , ..• ,x 10 ) and (x 12 ,x 13 , ... ,x17) were known at some moment.
Waiting for the results of the measurement of the rest of the features takes
considerable time. However, when the remaining features become known then
the object must be recognised as fast as possible. In such situations a purely
technical question arises: How should the features already known be processed
before the rest is measured and so the computation time not wasted in waiting?
Matrix representation (8. 7) of the probability p(x) provides a clear answer to
this question. The expression (8.7) is equivalent to
(8.21)
where
(8.22)
(8.23)
From the previous relations it can be seen that with the known sequences
x~ 0 and xg matrix products P* and P** can be calculated by means of for-
mulre (8.22) and (8.23). By the time the information for all the other features
is available, the probability p(x) of the formula (8.21) will be calculated. The
total number of operations in this case will be greater if compared with the
calculation according to the formula (8.9) or (8.10), but on the other hand the
number of operations needed for calculating (8.23) with the matrices P* and
P** already known will decrease.
Now let us imagine that in the given example no information about the other
features were provided and it was necessary to recognise the object only with
respect to the already known features. In this case the probability p(x~ 0 , xm
should be calculated. We will briefly show how the probability has to be calcu-
lated. Let I be a set {1, 2, ... , n}, through which the index i ranges in notation
xi, I' be a subset of indices and for each i E I' the value Xi is known. The
ensemble of known values will be denoted as (xi, i E I') and the ensemble of
not yet known values will be denoted as (xi, i cf. I'). Let k be a sequence
ko, k1, ... , kn. The joint probability of the ensemble (xi, i E I) and the se-
quence k is
where
When compared with the previous matrix products, the matrix P; differs in
that it depends on whether the value of the feature x; is known or not known.
Let us now notice the great diversity of object recognition tasks which occur
within the framework of the Markovian model. Usually, this class of recognition
problems is closely connected to optimisation methods based on dynamic pro-
gramming. However, in quite meaningful recognition problems considered so
far, the dynamic programming has not yet occurred. In this respect we would
like to cast doubt upon the naive, but well rooted view which regards dynamic
programming to be a universal key opening every door.
Later we would like to draw attention to the importance of representing
recognition tasks concerning the Markovian-describable objects by matrix prod-
ucts. This representation keeps all modifications of the task together and does
not allow them to be broken into isolated and mutually separated problems. It
is not surprising since the stochastic matrix is one of the basic concepts in the
general theory of Markovian processes. It is rather strange that the representa-
tion through the matrix product has not become thoroughly settled in pattern
recognition tasks. Later we will see that matrix products appear even in some
well known tasks where hardly anybody would expect them.
8.4 The most probable sequence of hidden parameters 321
i=l
- A.
- B
Figure 8.2 The optimisation task concerning the transition through the states is represented
as seeking a path in an oriented graph. Edges of the graph are oriented from left to right.
'Po the function of the form I<~ lR will be denoted the values cpo(ko), ko E I<,
t.
of which are numbers -logp(k0 ). Thus, the optimisation task (8.25) can be
written in the form
The graph created in this way defines a set of paths from the vertex a to the
vertex (3 and each path acquires its length given by the sum of lengths of edges
which the path consists of. To each path from a to (3 in the graph a sequence
of states corresponds, which is given by the labels of the vertices through which
the path goes. Conversely to each sequence ko, k1 , ... , kn a path in the graph
corresponds which passes through the vertices
a, (ko, 0), (k1, 1), (k2, 2), ... , (kn, n), (3.
The value cp(ko) + 2::;,~ 1 Qi(ki-1, ki) is the length of the path corresponding to
the sequence k0 , k1 , ... , kn. Thus, the optimisation task (8.25) is reduced to
seeking the shortest path between a pair of vertices in the graph. This task
has been solved successfully by the algorithm based on the Bellman's dynamic
programming. In the coming subsection we will provide, for completeness, the
algorithm finding the shortest path by means of dynamic programming.
necessary for the transfer of the message from the vertex (u', i- 1) to the vertex
(u, i). The shortest time fi(u) in which the message will be delivered to the
vertex (u, i) is
(8.28)
At the same time it will be indicated from which vertex (u', i - 1) the message
was delivered to the vertex (u, i) in the fastest possible way. This preceding
vertex will be denoted by the symbol indi (u),
The variable indi (u) states that the shortest path from the vertex o: to the
vertex (u,i) passes through the vertex (indi(u),i -1). Notice that there can
be more than one such possibility. One of them can be selected randomly for
simplicity.
The moment at which the message is delivered to the target vertex {3, i.e.,
the length of the shortest path from o: to {3 is given by the value
The vertex from the group (u, n) from which the message was first delivered to
the end vertex is
kn = argmin fn(u) .
uEK
The formulre (8.27), (8.28), (8.29), and (8.30) are the core of the algorithm
seeking the shortest path from the vertex o: to the vertex {3. Let us quote them
together
fo(a) = cp(u) , u E K; (8.31)
fi(u)=min(fi-I(a')+qi(a',u)), i=1,2, ... ,n, aEK; (8.32)
u'EK
kn = argminfn(a). (8.34)
u'EK
At first, according to the formula (8.31) the distances to the vertices of the group
(a, 0) from the vertex o: will be calculated. It is not a matter of calculating,
but of transcribing the values of the function f(u) from one memory cell to
another. Then gradually by means of the formulre (8.32) the distances from
the vertex o: to the vertices of the ensemble (a, 1), a E K, are calculated, then
those of the ensemble (a, 2), a E K, and so on, until the distances for the
vertices of the group (a, n), a E K, are calculated. Along with determining
the distance fi(u) the value indi(a) for each vertex will be calculated according
to the formula (8.33). This value determines the label of the vertex (a', i - 1)
which immediately precedes the vertex (u, i) along the shortest path from the
vertex o: to the vertex (a, i).
8.4 The most probable sequence of hidden parameters 325
_...
A
B
_...
c
_...
Example 8.4 Seeking the shortest path in a graph with three possible states.
The Fig. 8.3 depicts the same situation as in Example 8.3 in which the set of
states consists of three states A, B, C. The index i assumes the value 0, 1, 2, 3.
The numbers labelling the edges of the graph provide information about their
lengths, i.e., quantities q;( CJ 1 , CJ). Vertices are denoted by circles. The numbers
labelling the vertices represent the quantities fi (CJ). Values ind; (CJ) are depicted
as arrows from the vertex (CJ,i) to the vertex (ind;(CJ),i- 1). It is possible to
traverse from the target vertex (3 following the arrows and the shortest path is
laid back in the inverse direction, from the target vertex towards the start vertex.
Notice that in some cases there can be more than one arrow from a vertex. In
such a case just one of the arrows can be selected randomly. Let the selected path
traverse through vertices (C, 3), (A, 2), (A, 1), (C, 0) and is shown in bold in
Fig. 8.3. The sequence that minimises (8.26) and maximises (8.25) is AAAC.
In this particular case the optimal sequences can be AABC and AACC too. A
326 Lecture 8: Recognition of Markovian sequences
(8.35)
To avoid unnecessary complications which would obscure the main idea, we will
only concern ourselves with calculating the value of the minimum and we will
not seek the sequence (kg)* in which the minimum is achieved.
The optimisation task (8.35) has a similar form to the expression
n
p(x) =L L L ··· LP(ko) ITp(xi, ki Iki-d (8.36)
i=l
which we studied when solving the task recognising the automaton. In both
expressions, (8.35) as well as (8.36), a number is calculated, namely the number
d in (8.35) and the number p(x) in (8.36). ThP- number is calculated with
respect to the function of the form K --+ IR (in (8.35) it is r.p(ko) and in (8.36)
it is p(ko), ko E K) and n functions of the form]( x K--+ IR (in (8.35) they are
the functions Qi, i = 1,2, ... ,n, and in (8.36) they arep(xi,ki I ki-t), ki E K,
328 Lecture 8: Recognition of Markovian sequences
ki-l E K, i = 1, ... , n). The relations (8.35) and (8.36) are calculated by
means of different, but yet similar programs. The difference is only in that the
program for calculating (8.35) is obtained from the program (8.36) in such a
way that wherever a sum of two numbers occurs in the first program, a smaller
one of both numbers has to be found in the second program. Moreover, the
multiplication of two numbers in the first program is replaced by the sum of
the same numbers.
In examining the procedure of calculating the numbers p(x), see (8.36),
we have arrived at a conclusion that when the starting numbers p(ko) and
p(xi, k.i I ki-d are understood as components of a row vector cp and matrices
Pi, i = 1, ... , n, then the calculation according to the formula (8.36) is equiv-
alent to the calculation of the matrix product
(8.37)
This matrix product represents the number being computed, and in this sense
converts the problem into creation of an algorithm that must compute the num-
ber. This problem is obviously equivalent to the problem (8.36), but is given
in a different form. The number to be computed according to the formulated
problem is explicitly stated by the expression (8.37). In this sense the matrix
expression (8.37) immediately performs the algorithm for its calculation. So,
the transformation of the problem of the form (8.36) to the form (8.37) is virtu-
ally the solution of the problem because expressing the task in the form (8.37)
makes the task trivial.
The equivalence of the expression (8.36) which is the original formulation
of the task, and of the matrix product (8.37) is based on properties of adding
and multiplying real numbers. These properties are so obvious that usually
they go without saying. They are the associativity and distributiveness of
multiplication with respect to addition. For any three real numbers x, y and z
there hold
x+(y+z) = (x+y)+z,}
x(yz)=(xy)z, (8.38)
X (y + Z) = X y + X Z .
In other words a set of real numbers with the operations of addition and mul-
tiplication forms an algebraic structure known as a semi-ring. This structure
satisfies other requirements, but at the moment they are not important for us.
The essential observation is that addition and multiplication are not the only
pair of operations that satisfy requirements (8.38). It is of key importance
for the operation with sequences that a set of non-negative real numbers with
operations min and + also forms a semi-ring. There hold
q(k,k') = min(q'(k,l)
IEK
+q"(l,k')). (8.39)
The product 'P 8 q of the row vector 'P of the dimension IKI and the matrix
q of the dimension IKI x IKI will be the row vector t.p' the k'-th coordinate of
which is
t.p'(k') =min (t.p(k) + q(k, k')) . (8.40)
kEK
And finally, the product q 8 f of a matrix q having the dimension IKI x IKI
and IKI-dimensional column vector f is understood as the column vector f'
the k-th coordinate of which is
q(k,k') = EB(q'(k,t)8q"(t,k')),
/EK
t.p'(k') = EB ('P(k) 0 q(k, k')) , (8.42)
f'(k)
k'EK
which altogether formally agrees with the conventional definition of the matrix
product. So far ( EB, 8) has been considered as a pair (+,product) built up using
addition and multiplication where the formulre (8.42) define the matrix prod-
ucts in the usual sense, i.e., the matrix products in the semi-ring (+,product).
However, if the (67, c;J) is understood as a pair (min,+), the same formulre cor-
respond to a matrix product in the sense we have introduced, i.e., to matrix
products in the semi-ring (min,+).
The original expression (8.35) defining the original optimisation task can be
expressed using notation EB and 8 as
(8.43)
330 Lecture 8: Recognition of Markovian sequences
(8.44)
or
'P 8 6
i=l
q; 8 ( 6
i=l+l
q;) 8 6
i=k+l
q; 8 f'
and we can choose either of them according to purely technical conditions we
know from Section 8.3.
n
p(x, k) = p(ko) II p(xi, ki I ki-1) , (8.45)
i=l
This quantity depends only on the values (ki, i E Ik) which are to be deter-
mined. It does not depend on the quantities (xi, i (/. Ix) and (ki, i (/. Ik) since
according to them addition is performed. It does not depend on the values
(xi, i E Ix) because they are fixed results of the experiment and so they are
constants within one task. The number (8.46) which depends on (ki, i E Ik)
will be denoted d((ki, i E Ik)). The summation with respect to the values
(xi, i (/. Ix) will be performed in the following manner,
n
d((ki, i E Ik)) L p(ko) ITp(xi, ki Iki-d
(k; , if/_lk) (x; , if/_Jx) i=1
where
(8.47)
and the number r.p(ko) is the probability p(k0 ). The objective is to find the
maximum value for d ( (ki , i E Ik)) , i.e.,
n
d = m~x
(k;, zEJk)
L.
r.p(ko) IT Qi(ki-1, ki) , (8.48)
(k; , zftlk) i=1
332 Lecture 8: Recognition of Markovian sequences
and the ensemble (ki, i* E Ik) by which the maximum value is attained, i.e.,
n.
(ki, i E Ik) = argmax L cp(ko) IT Qi(ki-1, ki) . (8.49)
(k;,tElk) (k;,i~lk) i=l
We will be concerned with the task (8.48) only. According to its solution the
solution of the task (8.49) will become clear. As before, the symbol Qi will
denote a matrix of the dimension IKI x IKI in which in the (ki_I)-th row
and (ki)-th column we find the number Qi(ki- 1, ki) calculated according to
(8.47). We will denote by cp the row vector composed from the coordinates
cp(ko) = p(ko), ko E K. We will denote by 0i the matrix multiplication in the
semi-ring (+,product) if i (/. lk, and in the semi-ring (max, product) if i E Ik.
With this notation the number (8.48) is a matrix product
(8.50)
where f is a JKJ-dimensional column vector all coordinates of which are 1.
The expression (8.50) presents the two tasks studied in a unified way. It con-
cerns the recognition of a Markovian object as a whole, as well as the recognition
of values of its hidden parameters, including different modifications of the task.
An important advantage of expressing tasks in this way is not only that their
affinity becomes revealed, but also that the tasks themselves are becoming easy
since they are formulated as matrix product which has just to be calculated.
In calculating matrix products of the form (8.50), it must be taken into con-
sideration that in the expression (8.50) the matrix products occur in different
semi-rings. This makes them different from the previous two tasks in which
the multiplications within the product (8. 7) were understood as being in the
semi-ring (+,product), and multiplications within (8.44) were considered as
being in the semi-ring (min,+). In both cases thanks to the associativity of
matrix multiplication, the calculations according to the formulre (8.7) or (8.44)
could be performed in any arbitrary order, from left to right, from right to left,
from the centre, etc .. It is a different matter in the expression (8.50). There
the diversity concerning the potential order of calculations is smaller since the
products in the semi-ring (+,product) possess priority over the multiplication
in the semi-ring (max, product), and therefore they have to be processed first.
It is due to the fact that the product A G"J (B C) is not the same as the product
(A 0 B) C, even if A 0 (B 8 C) = (A 8 B) 8 C and A (B C) = (A B) C. The
product A 8 (B C) means
Owing to the required order in calculating the product (8.50), the complexity
can increase compared with the complexity in calculating the expressions (8.7)
and (8.44) which is O(IKI 2 n). In the case in which the sequence of parameters
to be determined consists of a large number of mutually non-interconnected
segments, i.e., if the set Ik is strongly mixed up with the set {0, 1, 2, ... , n} \ Ik
then the calculation of the product (8.50) has the complexity O(IKI 3 n). In the
case, however, if a certain connected subsequence of hidden parameters, i.e.,
k'j is to be determined then the complexity of calculation according to (8.50)
will remain O(IKI 2 n), i.e., it will not increase with respect to the complexity
of calculation according to the formulre (8.7) and (8.44). The product (8.50)
assumes the form
d = <p ( II
1-1
Qi
)
0
(
0
m
Qi
)
0
(
. II
n
Qi
)
1
t=1 t=l t=m+1
in this case and this form is to be understood as a brief notation to the following
calculations.
1. The calculation of a row vector <p 1 according to the formula
I k2 \kJ I A I B I c I
A 0.30 0 0
B 0.20 0 0
c 0 0.25 0.25
_ If_ a sequ!!nc!! (kL k~) has to be found such that the probability of the event
(k1, k2) f. (k~, k~) should be minimal which corresponds to the penalty function
{8.52) then a decision must be made that the sequence (k~, k~) is (A, A). With
8.5 Seeking sequences composed of the most probable hidden parameters 335
such a decision the probability of the inequality (k 1 ,k2) -=f (k~,k~) will be 0.7,
and with any other decision this probability will be greater. Let us see what
risk is present with such a decision with respect to the penalty function {8.51},
i.e., in other words, what mathematical expectation of the number of incorrectly
recognised sequence elements amounts to. The actual sequence (k1 , k2 ) can be
one of four possibilities (A, A), (A, B), (B, C) and (C, C) the probabili-
ties of which are 0.3, 0.2, 0.25, 0.25 correspondingly. The number of incorrectly
recognised elements will be 0, 1, 2, 2, correspondingly and the mathematical ex-
pectation of this number will be equal to 1.2.
Now let us see what the mathematical expectation would be like if the decision
were made that (k~, k~) = (A, C). Let us note that this sequence has a zero
a posteriori probability, but in spite of that, at the decision (kL k~) = (A, C)
the mathematical expectation of incorrectly recognised elements will have the
value 1. The actual sequence can rightly be only (A, A), (A, B), (B, C),
(C, C), and with each of these sequences the number of incorrectly recognised
elements will be 1.
We can see that the solution of a Bayesian task at the penalty function {8.51}
is not even approximately identical with the solution the penalty function of
which is {8.52}. Therefore, if the application requires a penalty function of the
form {8.51} then the algorithm which is seeking the most probable set of hidden
parameters cannot be used. It is suited for other penalty function of the form
{8.52). &
If the penalty function of the form (8.51) occurs then the Bayesian task has to
be solved from the very beginning, i.e., starting from the Bayesian formulation.
Let X and K be two finite sets, and x and k be two sequences of the lengths
nand n + 1, respectively, which are composed from elements of X and K, x =
(xi,x2, ... ,xn), k = (ko,ki, ... ,kn). The pair (x,k) is random and assumes
the value from the set xn X Kn+l so that the probability of the pair (x, k) is
given by the expression
n
p(x, k) = p(ko) ITp(x;, k; Ik;_I) (8.53)
i=l
~ "''i~in ~ ~ ... :t g (p(ko) p(x;, k;l k;_ J)) t, w(k;, k;) . (8.58)
An important feature of this task already results from the mere assumption
that the penalty function has the form (8.54), without taking into account the
Markovian property (8.53), and its concretisation (8.55) and (8.56). Let us
demonstrate this feature.
The risk l:k p(x, k) 2:7=o w(ki, k~), which the sequence k~, k~, ... , k~ sought
has to minimise, will be denoted by R. We can write
n n
R = L p(x, k) L w(ki, k;) = L L p(x, k) w(k;, kD
i=O
n
= :2:: :2:: w(k;,k;) :2:: p(x,k~,k;, ... ,k;, ... ,k~)
i=O k;EK ( k;•, i' #-i)
n
=L L w(k;, k;) p(x, k;).
i=O k;EK
We can see that the function of n + 1 variables (kb, k~, ... , k~), which is to
be minimised, is created as a sum of n + 1 functions, each of them depending
on only one variable. The optimisation task (8.57) is thus broken into n + 1
independent optimisation tasks along a single variable,
(8.59)
When considering the specific forms (8.55), (8.56) of the partial function w
then we arrive at the conclusion that
8.5 Seeking sequences composed of the most probable hidden parameters 337
The conclusion claims that even when the task was originally formulated as
an optimisation one, (see (8.57) and (8.58)), its complexity is not caused by
optimisation at all since it is reduced to trivial tasks (8.59) and (8.60). The
core of its complexity is in the calculation of the (n + 1) IKI values p(x, ki),
i = 0, 1, ... , n, ki E K, according to the general formula
and fhe matrix product ( f1~1=i Pj) f expressed the probabilities p( Xi+ 1' ° ' 0 0
Xn I ki). The numbers p(x, k;) defined by the expression (8.61) can be calcu-
lated using the formula
correctness of which results both from the expression in relation (8.61) and
directly from the Markovian property of the model.
The complexity of calculating the values p(x, ki) for one particular i and for
all ki E K is identical with the complexity of calculating the matrix products
~P(IT;:i Pi) and (f17=i Pi)f and is O(IKI 2 n). The complexity of computing
the numbers p(x, ki) for all i = 0, 1, ... , n, and for all ki E K will by no means
be O(IKI 2 n 2 ), but will remain O(IKI 2 n). This will be the same situation as in
the previous task of automaton recognition and in the task of the most likely
estimation of the sequence of hidden parameters.
We have analysed three recognition tasks which can be formulated within the
Markovian model of the recognised object. These are the task of recognising the
object as a whole, seeking the most probable sequence of hidden states of the
object, and seeking a sequence of the most probable hidden states of the object.
Even if we have analysed diverse varieties of these tasks, we do not intend to
create an impression that the tasks analysed cover a vast variety of possible
applications. Rather the opposite, one of the aims of this lecture is to rouse a
feeling that we know only a small part of the relevant tasks. In this way we wish
to impair the widespread and pleasantly self-delusive ideas that it is sufficient
338 Lecture 8: Recognition of Markovian sequences
to know only one method of Markovian sequence recognition, and this is seeking
the shortest path in a graph by the methods of dynamic programming.
A significant breakthrough in structural recognition appeared when the so-
lution of problems, insurmountable before, proved successful with the aid of
dynamic programming. This deserves credit even after several decades. In this
context, however, we wished to point out that significant as the knowledge may
be, it need not be an actual contribution, when, without forethought, it begins
to be considered as generally valid.
Markovian model (8.1), consists of the assumption that for any vertex 0 E I,
let it be called a 0-th one, the probability p(x, k) has the form
where g(i) is the vertex connected with the vertex i by an edge which pertains
to the path from the 0-th vertex to the i-th. The property (8.3) formulated
before is a particular case of the property (8.62), for I = {0, 1, 2, ... , n} and
g(i) = i - 1.
The property of the model (8.62) can be informally represented by the model
in Fig. 8.4 which generalises the mechanical model in Fig. 8.1 used for informal
representation of the Markovian sequence properties.
In Fig. 8.4 the values xh, h E H, and ki, i E I, are represented by means
of points in a plane which are connected by line segments. If we visualise each
straight line as a flexible rod then it can be seen that the position of each
point affects the positions of all other points. If we fixate a point that corre-
sponds to some quantity ki, say the point kJ, then the mechanical model in
that given case breaks into three independent parts: one consists of the points
ka, kb, kc, kd, Xab. Xcb. X de and Xbf, the second consists of the points kt, x fl, and
the third of the points kh and x f h. It is this property of conditional indepen-
dence of individual parts of a complex object that is formally expressed by the
assumption (8.62). This assumption has become a basis for the formulation
and exact solution of Bayesian recognition tasks, similar to those which we
analysed for the case of sequences. We will briefly examine only two tasks:
the task of calculating the probability p(x) for the given observation x and the
task of calculating the number maxkEKI p(x, k) for the given observation x.
After completing this analysis, the solution will become quite dear for other
tasks and their modifications which we have dealt with in detail in the case of
sequences.
Since the quantities xh, h E H, are fixated the expression (8.63) will be
written so that these quantities should not be present in it. The probabil-
ity p(x{i,g(i)},ki I kg(i)) will be denoted fi(ki,kg(i))· To achieve symmetry of
the expression (8.63) with respect to the indices i, and also for further rea-
sons which will become clear later, we will introduce the notation IPi(ki)· This
means p(k0 ), if i = 0, and IPi(ki) = 1 for all i :f. 0 and ki E K. The expression
(8.63) thus assumes the form
In this expression the variable ki for each i is present in a single factor IPi(ki).
As to the factors fi(ki, kg(i)), the variable ki· can be present depending on the
index i* in one, two, or more factors. The variable ki· is naturally present in the
factor h·(ki•,kg(i•)), but also in those factors fi(ki,kg(i)) for which g(i) = i*.
There certainly exists such an index i*, that the variable ki· is present only in
one factor of the form fi(ki, kg(i))· It is such an index i* for which i* = g(i) is
valid for no index i. The existence of such an index results from the property
that a vertex exists in an acyclic graph from which only one edge goes out. For
the index i* defined in this way the formula (8.64) will be rewritten in the form
(8.66)
The denotation IPi' (ki') on the right-hand side of the expression (8.66) is con-
sidered to be the value of the number IPi' (ki') before the operator has been
satisfied; the same denotation on the left-hand side is considered to be the
new value obtained through this operator. The calculation complexity of the
operator (8.66) is O(IKI 2 ).
The obtained numbers (/Ji' (ki') can be substituted into the expression (8.65)
and it can be written in the form
where h =I\ {i*}. The expression (8.67) has the same form as the original
expression. Only the number of variables ki, according to which the addition is
performed decreased by one. Among the reduced number of variables there is
at least one that it is present only in one factor of the form fi (ki, k9 ( i)), and can
be eliminated in the way already described, i.e., by the operator (8.66). After
the (III - 1)-th elimination of the variables the expression (8.65) will assume
8.6 Markovian objects with acyclic structure 341
(a) (b)
(d
~ (c) (d)
d
~
(e) (f) (g)
the form
d= L~(ko)
ko
which can be easily calculated since the addition is done only along the values
of one variable. Therefore, we have proved that the calculation complexity of
the number p(x) is O(IKI 2 III), i.e., the same as the complexity in the case in
which the recognised object had the structure of a sequence.
Example 8. 7 Calculating the probability of observation for an acyclic graph.
We will demonstrate the procedure presented above using an example of a graph
from Fig. 8.4- The structure of the input data for calculating the number p(x)
corresponds to the initial configuration of the graph, see Fig. 8.5(a). For each
vertex i the memory for IKI numbers 'Pi(ki), ki E K, is reserved, and for each
edge h = (i,i') the memory for IKI 2 numbers fi(ki,ki'), ki E K, ki' E K is
reserved.
A calculation diagram for processing this data can be expressed in the fol-
lowing form. In Fig. 8.5{a) the graph was enclosed by a curve (an outline).
We will choose a starting point on the outline which is represented by a filled
square, and passes along the outline anticlockwise. During the passage we will
342 Lecture 8: Recognition of Markovian sequences
create a sequence of graph vertices around which the path is led. The sequence
of vertices will be (b, a, b, f, h, j, l, f, b, c, d, c, b). Furthermore, this se-
quence passes through one vertex after another, and the filled square in Fig. 8. 5
indicates the position at that moment. Some vertices are passed by without any
change in the data, i.e., nothing is calculated. The data is changed in other
vertices including the graph. The changes are brought about by those vertices
from which only one edge goes out in the particular momentary graph, but they
are not the starting vertex. In our case the vertex b is the starting one. The
sequence of vertices which change the data is (a, h, l, f, d, c). We will show
what changes will occur in the data at each of these vertices. After each change
the graph is modified in a corresponding way.
Vertex a. New values of the number 'Pb(kb) are calculated according to the
formula
tpb(kb) := tpb(kb) L'Pa(ka) fa(ka,kb).
ka
The vertex a is eliminated from the graph and the algorithm continues
using the graph in Fig. 8.5(b).
Vertex h. New values of the number 'Pt(kt) are calculated according to the
formula
The vertex h is eliminated from the graph and the algorithm continues
using the graph as in Fig. 8.5(c).
Vertex l. New values of the number 'Pt(kt) are calculated according to the
formula
The vertex l is eliminated from the graph and the algorithm continues
using the graph as in Fig. 8.5(d).
Vertex f. New values of the numbers tpb(kb) are calculated according to the
formula
'Pb(kb) := 'Pb(kb) 2::
cp 1 (kt) ft(kj, kb).
k,
The vertex f is eliminated from the graph and the algorithm continues
using the graph as in Fig. 8.5(e).
Vertex d. New values of the number 'Pc(kc) are calculated according to the
formula
The vertex d is eliminated from the graph and the algorithm continues
using the graph as in Fig. 8.5(f).
Vertex c. New values of the number tpb(kb) are calculated according to the
formula
'Pb(kb) := 'Pb(kb) L
'Pc(kc) fc(kc, kb) ·
kc
8.6 Markovian objects with acyclic structure 343
The vertex c is eliminated from the graph and the algorithm arrived at the
elementa1·y graph in Fig. 8.5{g) which contains a single point b.
When the graph was simplified then the probability of the observation x is cal-
culated as the sum L:kb <l'b(kb)· &
(8.68)
has the same structure and is based on the same considerations as the algorithm
for calculating
(8.69)
(k;,iE/) iE/ iEI\{0}
We will denote g(i*) as i' and calculate new values of the numbers <p;• (k;•) by
the operator
We will make use of the calculated value and write the number d in the form
(8.71)
which is trivial since the maximisation is done after one variable. The total
number of operations needed for solving the task (8.68) has the complexity
O(IKI 2 n) which is the same as that for a sequence.
8. 7 Formulation of supervised and
unsupervised learning tasks
In Lecture 4, three learning tasks have been formulated in the general form,
through which a reasonable estimation of a statistical model of the object un-
der examination can be found. In Lecture 6, we formulated the unsupervised
learning task and solved it in the general form. We will express analogous
tasks for the class of Markovian models which we are now dealing with. We
will discuss a case in which an ensemble of parameters has the structure of a
sequence, and not the general structure of an acyclic graph because here the
most essential properties of the analysed tasks can be revealed without being
overshadowed by unnecessary details. The results we obtain for sequences can
be easily generalised for the general case of acyclic structures.
In the same way as before we assume that a complete description of a recog-
nised object is formed by two sequences: the sequence of observable features
x = (x1, x2, ... , Xn) of the length n and the sequence of hidden parameters
k = (ko, k1, ... , kn) of the length n + 1. The pair (x, k) is random and is de-
scribed by the probability distribution p(x, k) which is not arbitrary but has
the form
n
p(x, k) = p(ko) ITPi(ki, xi I ki-d.
i=l
recognised object. Since the ensemble P uniquely determines also the function
p(x, k), i.e., the probability distribution of the pairs x = (x 1 , ... , Xn) and k =
(k0 , ... , kn), the function p will also be referred to as the statistical model of
the object.
If the statistical model of an object is known then various pattern recognition
tasks can be formulated and solved the examples of which were analysed in the
previous parts of the lecture. If the statistical model of an object is not known
then it must be found either by experimentally examining the object, or on the
basis of the user's information in which he states his or her ideas either about the
recognised object, or about the desired behaviour of the recognition algorithm.
The creation of the statistical model is mostly called learning, supervised
or unsupervised. The information on the basis of which the model is created
is termed training information. A precise formulation of the learning tasks
depends on the properties of the training information. These formulations will
be presented now.
_
- argmax · · · max II
l
P1
(ki0 , x i1 , ki)
1
II
n
Pi
(kj
i-1,xi,
·
j kj)
;
· (8.72)
Pt Pn j=l i=2 L L Pi(kf_1 7 X, k)
kEK xEX
Let us note that the previous expression for the joint probability p(x, k) has
a slightly different form than that used before. The previous expression in-
volves the Joint probabilities p;(k{_ 1 , x{, k{), not the conditional probabilities
p;(kf,xi I kf_ 1 ).
P* = (P1*,p2,* · · · ,pn• )
.
= argmax max · · · max mm ( P1 (ki , x j , kj) ITn Pi (k{_ 1 , xi,
i
k{) )
0 1 1
Pl P2 Pn J i= 2 L L p;(ki-1, X, k)
kEK xEX
L
l
_
- argmax···max
III "'"' "' (
L.JL.J"""L.JPI ko,x ,ki
)IIn j
1 "
Pi(ki-I,x{,ki)
" ·(k· k).
PI Pn J·=I k k k i=2 L...., L...., p, l-l,X,
0 I n xEX kEK
way of creating the model is reasonable, but on the property that the created
model maximises (8.72).
For the sake of formal analysis the task will be expressed in the following
equivalent form
(8.74)
The task given by the previous relation does not change if the integer function
g is substituted by the function a for which there hold a(x, k) = g(x, k)/ l, i.e.,
(8.75)
Ki(k', k"), i = 1, 2, ... , n, k' E K, k" E K, means the set of such sequences k =
(ko,kl, ... ,kn) E Kn+l for which k.i-1 = k', k; = k 11 holds. By ai(k',x',k")
we will denote the sum
L L a(x,k) = 1.
xEXn kEKn+l
Let Pi, i = 1, 2, ... , n, be functions of the form J( x X x J( --t lR for which there
hold
Pi(k',x',k") = L L a(x,k),
xEXi(x') kEK;(k',k")
L L L Pi(k',x',k") = 1.
k'EK x'EX k"EK
•
Proof. The basis for the proof is Lemma 6.1. First we will make clear the
relationship between the sums
(8.77)
350 Lecture 8: Recognition of Markovian sequences
L L a(x,k)logpi(ko,x1,k1)
xEXn kEKn+J
= L L L L L a(x,k)logpi(ko,x1,ki)
x1EX xEXJ(xJ) koEK k1EK kEKJ(ko,kJ)
L L L ( L L a(x,k)) logpi(ko,x1,kl)
~'oEK x1EX k1EK xEXJ(xJ) kEKJ(k 0 ,kJ)
= L L L Pi(ko,x1,kl)logpi(ko,xl,kl). (8.78)
koEK x1EX k1EK
L L L Pi(ko,x1,k1)logp1(ko,x1,k1). (8.79)
koEK x1EX k1EK
Since both the sum I:koEK l:x 1EX I:k 1 EK p~ (ko, x1, kl) and the sum
l:koEK I:x 1 EX l:k 1EKP1(ko,x1,kl) are equal to 1, we can find on the basis of
Lemma 6.1 that (8.78) is not less than (8.79), and thus
(8.80)
Now we will make clear the relationship between the sums
and
"'"' "'"' (- k)l Pi(k;-l,Xi,ki) (8.81)
L.....- L.....- ax, og "' "' ·(k· xl kl) .
xEXn kEKn+J L.., L.., p, z-1' '
x'EX k'EK
The first of the sums is
= L L L L L a(x,k)log "'pi(ki-l,xi,ki)
I: P';(ki-1 1 X1 ,k1 )
'· ,_. vk
,·i-JE.n x;E." ·;En xEX;(xi) kEK;(k;-J,k;)
L' - L..,
x'EX k'EK
= L L L( L
k•-IE K· x. K- . -
a(x,k)) log
L
L
ai(ki-1,xi,ki)l I
:Lai(ki-l,x,k)
x,E. k,E xEX,(x;) kEK;(k;-J,k;) x'EXk'EK
With respect to similar considerations we claim that the second sum in (8.81)
is
(8.83)
. ~ ~ p;(k;-t,X;,k;) . l l
Smce the sum 6 6 ~ ~ ·( . 1 k1 ) 1s equa to 1 at any va ue
x; EX k; EK L.x' EX L.k' EK p, k,_l' X '
k;_ 1 we claim on the basis of Lemma 6.1 that the inequality
is correct at any value k;- 1. If we sum this inequality over all k;-1 then we
obtain the inequality
and thus owing to (8.82) and (8.83) we also obtain the inequality
which is satisfied for any i = 2, 3, ... , n. If we sum this inequality over all i
(8.84)
are to be calculated which in the experiment means the relative frequency of the
value x' in the position i of the sequence x together with the values k' and k" in
the i - 1-th and i-th positions of the sequence k.
3. The ensemble of numbers Pi (k', x', k") expresses the Markovian model of the ob-
ject examined in the sense that for each pair of sequences x E xn and k E Kn+l
it determines their joint probability according to the formula
Theorem 8.1 proved claims that the Markovian model of the object obtained is
the most likely one in the sense that the probability of the ensemble ((xi ,ki),
j = 1, 2, ... , l) experimentally observed in this model is not less than the
probability of the same experimental outcomes in any other Markovian model
of the object.
For further explanation it is helpful to consider Algorithm 8.1, i.e., the for-
mulre (8.85) and (8.86) as a transformation of the function a: xn X Kn+l -+ lR
to the function p: xn X Kn+l -+ JR. The transformation converts the prob-
ability distribution a which need not be Markovian and can be of any kind
to the probability distribution p which must be Markovian. The probability
distribution p which is formed on the basis of the probability distribution a
will be referred to as the Markovian approximation of the function a and will
be denoted as aM. The index M in the denotation aM is understood as an op-
erator which affects the function a and transforms it into the function p = aM.
For example, the denotation (a+ (3)M means the Markovian approximation of
the sum of functions a and /3. The denotation o:M (x) means the value of the
function that is the Markovian approximation of the function a in the point x,
etc ..
From the definition of the Markovian approximation there immediately fol-
lows
L a(x, k) 1ogp(x, k)
= argmax _ll}m
. (
logp1
(k
o,x1,k1
) ~
+ Llog "'"'
p;(ki-1, X;, k;)
"'"' ·(k
)
x' k') ·
PJ, ... ,pn (x,k)EL ._ 2 L, L, p, 1.-1, ,
,_ x'EX k'EK
We will introduce an algorithm which solves this task with any predefined
accuracy c > 0.
The algorithm alters stepwise the integer numbers n(x, k) for each pair x =
(x1, ... , xn) and k = (ko, k1, ... , k 11 ) which occurs in the training set L, and
the integer numbers n;(k', x', k") for each i = 1, 2, ... , n and for each triplet
k' E K, x' E X, k" E K. The numbers n(x,k) and n;(k', x', k") serve for
calculating current values of the probabilities a(x,k) and p;(k',x',k") in the
following way
n (-'-x-'-,k..:...)~
a (i, k) = _ _ ·(k' x' k") = n;(k',x',k")
L n(i,k)' p, ' ' 2:: 2:: 2:: n; (k', x', k") ·
(x,k)EL k'EK x'EX k"EK
was x' in the instant i. Assume that prior to the step t the numbers n 1(x, k),
(x, k) E L, and the numbers nHk', x', k"), i = 1, 2, ... , n. were computed. Next
values of these numbers are calculated according to the rules:
n 1 (x, k)
c/ (x, k) = _L:_:..._n.:..._t-'-x-,
( k~) ' (x,k) E L,
(x,k)EL
t I I II nl(k 1 ,X 1 ,k 11 )
Pi (k ' x ' k ) = L: L: L: nl( kl' xl' k") .
k'EK :r'EX k"EK
2. The probabilities p 1 (x, k) are calculated according to the formula (8.86), i.e.,
t(-- t(
p x,k) =p! ko,XJ,k! ·-·
llln L: L:
pl(ki-!,X;,k;)
Pl(k;-J,xl,kl)' (x, k) E L. (8.87)
' - 2 x'EX k'EK
4. If the inequality (8.88) is satisfied then the algorithm ends, and the current values
pl( k 1 , X 1 , k") form the c-solution of the task.
5. If the inequality (8.88) is not satisfied then the following calculations are per-
formed.
(a) Any pair (x*, k*) E Lis found for which there holds
(b) New values of the numbers n(x,k) and n;(k;- 1 ,x;,k;) are calculated
6. It is proceeded to the (t + 1)-th iteration of the algorithm, starting from the step 1.
For the algorithm formulated in this way the following two theorems are valid.
Theorem 8.2 On convergence of an algorithm in a finite number of steps.
For any predefined positive value c: Algorithm 8.2 gets after a finite number of
steps to the state in which the inequality ( 8. 88) is satisfied and the algorithm
ends. •
8.9 Minimax estimate of a statistical model 355
F(P) mm
·
= (x,k)EL (
logpl(ko,XI,ki
)
+~
L.,log
_
"""
L,.,
p;(ki-1, X;, k;)
""" ·(k·
L,., p,
, k')
1-l, X ,
)
2
,_ x'EX k'EK
3. The function f is convex if for each point x 0 E X there exists such a linear
function Lxa : X --+ IR, that the inequality
in any point of a convex function. If the function f and the definition (8.89)
unambiguously determine the linear function Lx 0 , and thus also the generalised
gradient g(x 0), then the function f is called differentiable (or smooth) and the
generalised gradient g(xo) is simply called gradient.
Any convex function on a finitely dimensional set is continuous. This means
that the difference f(x) - f(x 0) approaches zero if x approaches xo. A linear
function is a particular case of the convex function and therefore
lirnx-+xo g(xo)(x- Xo) = 0.
If, however, the function f is smooth and g(xo) :I 0 then
Since the solution of the task, i.e., the function p* does not depend on how
many times an element has occurred in the multi-set (xi, j = 1, ... , n), and
depends on a single occurrence in the multi-set at least, the input information
can be regarded as a finite subset L C X, not as a multi-set (xJ, j = 1, ... , n).
The task (8.90) assumes the form
p* = argmax minp(x) ,
pEP xEL
Let us recall the maximum likelihood estimate of the model p, i.e., seeking
This means that a function p* is being sought with which the probability of
a multi-set is maximised and in which o:(x) is the relative occurrence of the
element x. The function p* according to the relation (8.92) depends on the
coefficients o:(x), and therefore the corresponding algorithm can be regarded as
an operator which transfers the function o:: L --+ ffi. to the function o:M: X --+ ffi.
in such a way that
o:M = argmax
pEP
L
o:(x) logp(x).
xEL
which depends on the function o:: L--+ ffi. and which will be denoted Q(o:). For
the number Q(o:) according to its definition there holds that the inequality
glance it may seem much more difficult. It is therefore important that the min-
imax estimate is reduced to the maximum likelihood estimate in the following
sense.
If there is a program for the maximum likelihood estimate (8.92) at our
disposal then we can quite formally, that is, in a standard way, also create
a program for the minimax estimate (8.91). The part solving (8.92) will be
included in it as a subroutine. This trick is made possible because we are
able to prove that independently on the sets X and P the solution of the
task (8.91) must have the form aM for some function a. In other words,
the minimax estimate is identical with the maximum likelihood estimate for
certain coefficients a(x), x E L. The factors a(x) with which the two estimates
become identical are extremal in the sense that they minimise the number
Q(a) = L:xELa(x)logaM(x). Another important result is that for arbitrary
sets X, L and P, the function Q(a) is always convex. It can therefore be
minimised in various well known ways.
The so far informally expressed statements will be exactly formulated and
proved.
lemma 8.1 On the upper-bound of the function min,eL logp(x). Let p
be any function X -t JR. from the class P, a: L -t IR be any function for which
there holds
L:a(x)=1, }
xEL (8.93)
a(x) ~ 0, x E L.
In this case there hold.s
Proof.
xEL
xEL
The inequality (8.94) results from the evident inequalities
•
min logp(:1:) :::;
:rEL
L n(:r) logp(x), (8.95)
:rEL
Tlw iw~quality (8.95) holds owing to tlH' couditiou (8.93). The value on the
right-hand sidP of (8.95) is tlw weight.<'d aritlmwtic averag<! of tlw uumbers
logp(:1:), the nmnber on t.IH' Jpft.-hand sid<' of (8.95) is th<' least of them. Of
course, the least uumbPr is uot gr!'a!.<'r thau t.IJ<' awra.gP.
The inequality (8.9G) imHH'diatd.v r<'slllts from tlw ddiuition of the function
n 111 • From the iuequaliti<!S (8.!J5) and (8.!JG) W<' obtain (8.!).!). •
The symbol .4 will denot<' tlw sd. of f11nct.ions n: L -t ~ which satisfy (8.93).
Proof. The symbol lin A will denote the linear closure of the set A. For
an arbitrary point ao E A, i.e., for an arbitrary function a 0 : L --+ ~ we will
determine the linear function G no : lin A --+ ~
The function Gno is consequently just the function the existence of which ow-
ing to (8.89) determines the convexity of the function Q(a). The following
inequality is satisfied on the set A
and
a* = argmin L a(x) log aM (x). (8.99)
etEA xEL
2. If the function Q(a) = L a(x) log aM (x) is smooth and it is satisfied that
xEL
Proof. The first part of Theorem 8.4 can be proved quite easily. Thanks to
Lemma 8.1 the inequality (8.94) holds for any p E P and a E A, and thus for
any p E P and just for the a*, that satisfies (8.97). So we can write
This inequality together with the condition (8.97) leads to the inequality
which is the relation (8.99) expressed in another way. Therefore the first state-
ment of Theorem 8.4 is proved.
Now we will prove the second part of Theorem 8.4 by contradiction. The
inequality
is trivial. Assume that the result (8.101) does not hold, i.e., that a strict
inequality occurs
We will prove that in this case there would exist a function a E A that
.rEa xEa
min loga*M(x)
xEL
= ""a'(x)loga*M(x),
~
xEL
We will examine how the function Q(o:) = I:xEL o:(x) log o:M (x) behaves on
the abscissa that connects the points a:* and a:', i.e., the dependence of the
number Q (a:* (1 - /') + a:' /') on the coefficient /'. Let us regard the points a:*
and a:' to be fixed, and the function Q (a:* (1 - ')') + a:' ')') to be a function of
one variable I'· The derivative of this function according to the variable /' in
the point /' = 0 is
The derivative is negative because of (8.103). This means that for small values
/' at least the number Q (a:* (1 - /') + a:' ')') is less than Q (a:*), and thus
which is in contradiction with the assumption (8.100). In this way the second
part of Theorem 8.4 is proved. •
Theorem 8.4 that has been proved shows the direction in which the solution
of the minimax estimate task concerning the model p E P should be sought.
It is necessary to find such weights o:*(x), x E £,which minimise the convex
function Q(o:) = I:xEL o:(x) log o:M (x). From the second part of Theorem 8.4
it will result that obtained weights o:*(x) satisfy the Equation (8.101). At the
same time from the first part of Theorem 8.4 it results that the solution of
the minimax task (8.91) is the approximation p* = o:•M, i.e., the maximum
likelihood estimate
Since the function Q(o:) is convex various procedures are at hand for its min-
imisation. But for this minimisation standard procedures need not be used
since from the very proof of Theorem 8.4 the following recommendations for
creating minimisation algorithms result.
So far a pair a: E A. and p E P has been found satisfying the relation
min
xEL
logp(x) ~ o:(x) logp(x),}
~
.rEL
and the task is solved. If the relation (8.104) is not satisfied then we can see in
which way the weights o:(x) should be altered. We have to increase the weight
a:( x) of such a pattern from the training set x E L the instantaneous probability
p(x) of which is lowest, or one of the lowest.
This intuition can be exactly expressed by means of the following algorithm.
362 Lecture 8: Recognition of Markovian sequences
4. If the inequality
~ n 1 (x) logp 1 (x)- min Jogp 1 (x)
£....... xEL
<c (8.106)
:rEL
is satisfied then the algorithm ends and p 1 is the solution of the task.
5. If the inequality (8.106) is not satisfied
(a) It is denoted
x' = argminp 1(x).
xEL
(b) New values of the numbers n(x), x E £,are calculated such that
n 1+ 1 (x') = n 1 (x') + 1,
n 1+ 1 (x) = n 1 (x), x E L, x # x'.
6. It proceeds to the next (t + 1)-th iteration namely by going to step 2 of the
algorithm.
Algorithm 8.3 differs from the mentioned assumptions because the condition
for ending the iterations is not Equation (8.104) which thanks to Theorem 8.4
guarantees that the task has been solved. A weaker condition (8.106) was used.
Using one's common sense one would assume that since the condition (8.106)
is an approximate alternative of the condition (8.104) it could be considered as
the condition of an approximate solution of the task. This correct assumption
is confirmed by the following theorem.
Theorem 8.5 On approximate solution of a minimax estimate task. If n E
A and p E P satisfy the inequality
where
p* = argmax minp(x). (8.108)
Proof.
pEP xEL
The inequality (8.94) which is proved by Lemma 8.1 is valid for any
•
a E A and p E P. Therefore this is valid even for a which satisfies the condition
(8.107), and for p* which satisfies (8.108). We can write
1 (8.112)
1=1
L
00
The munber on the left-hand side of tlH' last (~quality is nothing else
than limt-too(CJ(o: 1 ) - Q(a 1 )). Thercfon' the relation (8.113) would
mean that in spite of the lower-bound of tlw function Q(a) the number
Q(a 1 ) could fall below any negative numbn· at sufficiently great t. In
this way, the theorem would be proved through contradiction.
Thus, it is necessary to prove that the assumption (8.110) results in the
relations (8.111) and (8.112) and Theorem 8.6 will be proved.
3. We will denote by n 1 = L.rEL n/(x), x 1 = argmin,rELP1(x), and by alt the
function L -7 lR for which there holds a't (x) = 1 if :r = x 1 and o/1 (x) = 0 if
x ::/:- xt. With these notations the inequality (8.110) assumes the form
L (n (:r)- a' (x)) 1ogp (x) ~
1 1 1 E, (8.114)
.tEL
t+l ( I nl tt 1
a x) =a (x) n' + 1 + n (x) nt + 1 .
8.9 Minimax estimate of a statistical model 365
4. The number n 1 in the first iteration is ILl and it is increased by one in each
iteration. Therefore there holds that n 1 = ILl + t- 1, and therefore
(8.115)
The numerator is limited and therefore we can see that for all x E L
xEL
At last we can see, owing to (8.114), that the difference is negative and,
moreover,
Gt(a 1+1 ) - Gt(a 1 ) < __c:_.
- ILl +t
6. The sum L,[= 1 G 1 (at+l- a 1) is not greater than -c: L,[= 1 td+t, and there-
fore with increasing T the sum can fall below any negative number. The
relation (8.112) is proved. •
Thus we have proved that rather simple algorithm performs the minimax eval-
uation of the statistical model of the object. Of course, this algorithm can be
treated as a simple one only under the condition that a simple algorithm for the
maximum likelihood estimate is available. If we already have such a program
then the program for the minimax estimate is designed in quite a mechanical
way. The existing program is extended by simple operations of adding up a
one, and enclosing into a cycle which is not worth mentioning. This superstruc-
ture does not depend on a concrete task, i.e., on the form of the observation
set X and on the class of models P. In this way a close relationship of two
extensive estimation tasks is revealed which would seem, at a cursory glance,
to be different.
The algorithm mentioned above is, because of its universal character, suit-
able for using even when specific properties of the applied problem under con-
sideration are not yet known. With increasing knowledge of the task even other
366 Lecture 8: Recognition of Markovian sequences
algorithms can be better suited, for example those of the gradient descent, or
the methods of reciprocal gradients, and many others. The competence of
using all these methods is based on the following two previously proved and
universally true properties.
• The minimax estimate of the model for some training set L is identical with
maximum likelihood estimate of the model for some training multi-set L •
where every member x E L occurs with relative frequency a(x).
• The coefficients a(x), x E L, which provide the equivalence of these two
estimates, minimise the well defined convex function.
The program implementing the operation was assumed to have been available.
For the case in which P is a set of Markovian models, we defined this oper-
ation by means of the formulce (8.85) and (8.86) and called it a Markovian
approximation.
It can be noted that including the calculations (8.85), (8.86) from Section 8.8
in the general Algorithm 8.3 will lead to the particular Algorithm 8.2. It is
not difficult to make certain that the set of Markovian models P satisfies the
conditions on which Theorems 8.4, 8.5 and 8.6 are valid. These are only two
conditions: consistence of the training set L with respect to the set P and
the smoothness of the function L;a(x,k)logaM(x,k). The first property is
satisfied because for any training set ((xi, ki), j = 1, ... , l) such a Markovian
model exists in which each pair (xJ, kl) occurs with non-zero probability. It
can also be noticed that for the set of Markovian models P the dependence
of the value maxpEP L(x,k)EL a(x, k) logp(x, k) on coefficients a(x, k) is not
only convex but also smooth, i.e., differentiable. Theorems 8.2 and 8.3 are
thus special cases of proved Theorems 8.5 and 8.6, respectively. We need not,
therefore, prove them.
Written in another way, the function values fi(k;_ 1 ,x;,k;) have to satisfy the
system of inequalities
" "
Lf;(kf_ 1 ,xi,kf) > Lf;(k;-J,x{,k;), k =/: f.:J, j = 1,2, ... ,l. (8.117)
i=1 1=1
The sequence (8.118) is obtained with the help of constructive algorithms seek-
ing the shortest path in the graph which was introduced in Subsection 8.4.4.
If the equation (8.119) is not satisfied for some j then it also determines the
inequality from the system (8.117) which is not satisfied. Namely it is that in-
equality the left-hand side of which corresponds to the sequence k6, k{, ... , kh
and its right-hand side to the sequence k~ 1 , k?, ... , k~J.
The advantage of perceptron and Kozinec algorithms is that it is not re-
quired to know all unsatisfied inequalities from the system to change values
fi (k', x', k"). It suffices to know just a single inequality out of them. If for
some j the equation (8.119) is not satisfied then it leads immediately to mod-
ification of values fi(k',x',k") for which the selected algorithm is used. If the
perceptron algorithm is used, and it is the most easy one to formulate, then the
rule performing the modification has very simple form. Numbers J;(k', x', k")
are to be increased by one if kf_ 1 = k', xi = x', kf = k", and to be decreased
by one if k;~ 1 = k', xi = x', k;.1 = k". Otherwise the values remain unchanged.
In the case if k7~ 1 = kf- 1 = k' and simultaneously k;i = k{ = k", the men-
tioned modification rule has to lw understood in such a way that the value
368 Lecture 8: Recognition of Markovian sequences
p* = argmax
EP
ITJ=l
1
-
L p(x1, k)
p kEKn+l
= argmax
pEP
L 1
J=l
log L- kEKn+•
.p(xi, k). (8.121)
The similarity of this task to the task of unsupervised learning which was
discussed in Lecture 6 is quite evident even if, strictly speaking, the tasks are
different. Formerly we analyzed a case of supervised learning in which the
sought model p(x, k) is decomposed into IKI + 1 independent tasks. The first
of them is seeking a priori probabilities PK(k) for each value of the hidden
parameter k. Other IKI tasks seek the distribution of conditional probabilities
PXIk(x) under the condition k for each value k E K. It was assumed that the
choice of functions PXik' from the known set P does not, in any way, affect the
8.11 The maximum likelihood estimate of statistical model . . . 369
and
- II
PX"Ik"(x)=
11
Pi (k"i-1•Xi, k")i
"L...., Pz·(k~'z-l•X''k~')"
·- 1 z
'·- x'EX
(8.122)
We denote by Ji(x') the subset of those indices j for which x{ = x' holds, and
Ki(k', k") the set of those sequences k for which ki- 1 = k', ki = k" holds.
Assume that we know the ensemble of numbers PH k', x', k"). A new en-
semble of numbers p~+l (k', x', k") is being built in the following two steps.
Recognition. For each sequence xi and each sequence k E K 11 +1 we calculate
the number
(8.123)
where the numbers pt(xi, k) are calculated according to the formula (8.122).
370 Lecture 8: Recognition of Markovian sequences
The given algorithm is, of course, not suitable for use since the set Kn+l is ex-
tremely extensive, and therefore the numbers a(xi, k) cannot be calculated for
each sequence k E Kn+l. Similarly the sum over all sequences k E Ki(k', k") in
the formula (8.124) cannot be computed. Later we will show how an algorithm
is to be formed which is equivalent to the above quoted algorithm, but can be
built in a constructive way. But now we will use the definition of the algorithm
in the given non-constructive form to prove the following theorem.
Theorem 8.7 On unsupervised learning in Markovian sequences. Let p~(k',
x', k") and p:+ 1 (k',x',k"), i = 1,2, ... ,n, k' E K, x' EX, k" E K, be two
ensembles calculated according to the relations (8.123} and (8.124}. Let p 1 ,
pt+l be two successive models, i.e., two functions of the form xn X Kn+l -+ ~
defined by the formula (8.122}. In this case there holds that
l I
II I: pt+l(xi,k) ~II I: pt(xi,k).
L log L
l
pt+l(xi ,k)
j=l kEKn+l
According to (8.123)
8.11 The maximum likelihood estimate of statistical model . . . 371
holds for each j = 1, 2, ... , l. Consequently owing to Lemma 6.1 the inequality
pt(i:i,k) pt+l(xi,k)
L
kEKn+!
o/(xi,k)log
-
L
p
t(xi k)
'
~- L
kEKn+!
c/(xi,k)log L pt+l(xi,k)
kEKn+! kEKn+!
(8.125)
The numbers p:+l W, x', k") calculated according to (8.124) satisfy the condi-
tions of Theorem 8.1 from which it follows that
I l
L L c/ (xi ,k) logp1+ 1 (xi, k) ~ L L ci(xi, k) logp 1 (xi ,k).
j=l A:EKn+l j=l kEKn+!
(8.126)
The following inequality follows from the inequalities (8.125) and (8.126)
l l
Llog L p1+ 1 (xi,k) ~ I:log L p1(xi,k),
i=l j=l
the calculation of which does not pos<' any unsunuountable obstacles lH'cause
it requires only addition over the sequences which arc pr<'S<'nt in t hP training
multi-set.
The calculation according to the formula (8.123) is immediately replaced by
the calculation of the sum n; (k', :l:J, J.:") on the basis of the numbers p:-l (J.:', :r',
k"). The medmnism of this calculation was analysed in detail in Section 8.5
since the nmnlH'r n; (k', :T·J, J.:") is nothiug else than the joint a postc7·i{J1"i prob-
ability of the <'vent k;_ 1 = J.:', k; = J.:" for an observed sequence :I:J.
372 Lecture 8: Recognition of Markovian sequences
8.12 Discussion
I have had an ambiguous impression from your lecture. On the one hand, I
noticed that the Markovian model of an object, which you examined in detail in
your lecture, allows us to solve precisely a number of pattern recognition tasks.
I seem to understand the lecture to such an extent that I could solve the tasks by
myself. And so I gather a self-confident feeling that in the sphere of structural
recognition of sequences I can master many things. At the same time I assume
that there exist treacherous pitfalls, into which I can fall more easily when I do
not know about their existence. I would rather know them beforehand. I well
remember Example 7.1 on recognising vertical and horizontal lines. The task
seemed to be quite easy, but in fact it happened to be fantastically complicated.
I would like to find a similar task even among Markovian models.
You would naturally come across such a task if you had enough time for it.
You would master one task after another and we estimate that after the tenth
task at the latest you would come across what you are looking for. We will try
to make the job easier for you by seeking it together with you. But anyhow,
let us start from a simple task.
Let us assume, as we did in the lecture, that x E xn and k E Kn+l are two
random sequences the joint probability distribution p(x, k) of which is Marko-
vian. We already know that a sequence k* = argmaxkEKn+l p(x, k) can be
constructively created for any sequence x. How would you design an algorithm
if you did not need to know the whole sequence k*, but only to find out how
many times the certain value a E K occurred in the sequence?
Here a kind of trouble is hidden and I did not reveal it. When I already have
the sequence k* it is then easy to count how many times a occurs in it. I
wonder what I passed over when I cannot see why it should not be a correct
solution.
Your suggestion is not incorrect. But a trouble is there in spite of all that. You
think it self-evident that for solving the task it is necessary to find the whole
sequence k*. But in formulating the task we pointed out that a whole sequence
was not needed in your application. In your algorithm the whole sequence
is only an auxiliary piece of information, from which you will select the final
result. You have not suggested the best way.
Yes, it was said that I did not have to create the whole sequence k*, but it was
not said that I was not allowed to create it. Why would not I be able to find
it as an auxiliary sequence?
Since you would unnecessarily waste the memory and you may be later short
of it. In creating this auxiliary information you must have enough memory for
quantities indi(k) for each i = 1, 2, ... , n, and for each k E K, i.e., the memory
of n !KI log IKI bits. Remember the procedure presented in Subsection 8.4.4.
Do not forget that the length n of the sequence can be so large that you may
8.12 Discussion 373
not have the necessary memory any more. In this case you were not able to
realise your procedure. It does not mean, however, that the task cannot be
solved by another algorithm which can find out how many times the value a
has occurred in the most probable sequence without your having built up the
most probable sequence. What should such an algorithm look like?
I understand it now. Let the numbers indi(k) have the same sense as defined
in the lecture. I will introduce other numbers hi(k), i = O,l, ... ,n, k E K,
which mean how many times the value a occurred in the sequence kb, k~, ... , k~,
which maximises the probability p(xl, kb) and in which k~ = k. The numbers
hi(k) are calculated according to the following procedure:
1' if k=a,
ho(k) = {
0, if k ::/; a .
(8.127)
if k=a,
if k ::/;a.
If k* = (k0, ki, ... , k~) is the most probable sequence then the number hn(k~)
is the solution of the task. To find it we need not know the whole sequence k*,
but it is sufficient to know only its last element. To find that last element I
need not know the numbers indi(k), and therefore neither the memory for them
is needed. The number indi(k), which occurs in the relation (8.127), is used
for each i and k only once, immediately after its being calculated and therefore
to remember it, only one log IKI-bit cell is sufficient. The numbers hi(k),
k E K, which the algorithm has calculated, are also used only once, namely
in calculating the numbers hi+ 1 ( k). To store the numbers hi (k), k E K, the
memory for 2JKJlogn bits is sufficient, which is far less than n JKJ!og IKI bits
that would be necessary if I wanted to reconstruct the whole sequence k*.
Now try to create an algorithm which, for a known stochastic automaton, finds
out how many times the automaton got into the particular state a while it was
generating the known output sequence (x 1 , x 2 , .•. , Xn)·
Of course not. The most probable sequence need not be the real one.
But, I do not know any real sequence. I cannot say that a particular sequence
k is or is not the real one. At most, I can calculate the a posteriori probability
of it being real. And so the question you are asking is similar to that which
follows. There exists a random quantity with a known probability distribution.
On this basis it is to find out what this quantity is equal to. Well, it is nonsense.
A random quantity is not identical with any fixed quantity.
for a given state rJ, and a given number lit is necessary to find the probability
that the automaton passed the state fJ l times when generating the observed
sequence of symbols (x1,x2, ... ,xn)·
I have understood the assignment. I denote the sequence (x 1 , x2, ... , x;) by xf
and assume that x? is an empty sequence. I denote by the symbols g;(xf, l, k),
i = 0, 1, ... , n, l = 0, 1, 2, ... , i + 1, k E K, the joint probability of the following
three events:
1. The first i elements in the output sequence which the automaton generates
are the values xf.
2. In the sequence of states k 0 , k1 , ... , k; through which the automaton passes
the state fJ occurs l times.
3. The state k; is k.
Po (ko) , if l =1 and ko = fJ ,
go(x?,l,ko)= { Po(ko), if l=O and ko-:f.rJ, (8.128)
0, in other cases .
I denote by the symbol g;(x\,l,k), i = 1,2, ... ,n, l = 0,1, ... ,i, k E K, the
joint probability of somewhat different events than those whose probability is
denoted by g.
1. The first i elements of the output sequence which are generated by the
automaton are xf.
2. In the sequence of states (k 0 , k 1 , ... , k;_I) through which the automaton
passes the state fJ occurs [-times.
3. The state k; is k.
. { g;(xf,l,k), if k-j.rJ,
g(x' l k)= (8.129)
l l > > I ( i l - 1 k) if k = (J •
g, xl' ' '
and the fraction
The probabilities g 1 and g11 must satisfy the following relation which holds for
any probabilities:
g';(xi,l,k) = L
g; 1 (xi,l,k 1 ,k). (8.130)
k'EK
g; 1 (x1,l,k 1 ,k) =g; 1 ((x~- 1 ,x;),l,k 1 ,k) =g;_I(xl- 1,l,k 1 )p;(x;,klk 1 ) , (8.131)
You have managed the task quite well. It does not seem to us that the algorithm
could be made substantially faster. Notice, however, that if you did not have
to calculate the whole probability distribution of the quantity l, but only some
characteristics of the random quantity l, such as the mathematical expectation
or variance, then the calculation could be made faster. If you were to calculate
the probability of each value l when only the mathematical expectation of this
value is needed you would make many superfluous calculations.
Let us direct your attention to a more difficult task because the limit of your
resources does not seem to have been attained yet.
Let us assume that you are interested not only in the total number of
states a in the sequence (k 0 , k1 , ... , k11 ), but also in their positions. This
means that you are interested in the set r
of all instances in which k; =
a has occurred. This task may be one of the simplest out of the class of
tasks referred to as segmentation. In the given case it is the segmentation
376 Lecture 8: Recognition of Markovian sequences
of the time interval (0, 1, 2, ... , n) into subintervals which are separated from
each other by the state a and inside them the state a does not occur. It
is a task which, because of its treacherous character, is quite near to the
task concerning vertical and horizontal lines you have remembered. At first
glances it seems that it is a simple and common task, such as finding loca-
tions in a text document in which a certain letter is placed. If the character
sought is a space then it means a segmentation of the text into individual
words.
We will formulate the task exactly and you will see that it can be solved, but
its solution will require some mental effort. The set I' C {0, 1, ... , n} will be
called segmentation. Let us define a set K(I') of sequences k = (k 0 , k 1 , ... , kn)
for each segmentation I'. The sequence k = (ko, k1, ... , kn) belongs to K(I'),
if ki = a for all i E I' and, at the same time, ki -:f. a for all i rf. I'. The
probability that the actual segmentation is I' is equal to the probability that
the actual sequence k belongs to the set K(I'). Under the condition that the
sequence xis known, the probability of the segmentation I' is given by the sum
LkEK(I') p(k I x). The most probable segmentation I* is
Try to design an algorithm which for each given sequence x yields the segmen-
tation (8.133).
I have mastered the task, but it may be the most diflicult task I can still manage.
I have found that the task (8.133) can be transformed to a form in which it can
be solved with dynamic programming. But in setting up the particular task of
dynamic programming a quite complicated algorithm is needed. This algorithm
calculates the data which are the input for the task of dynamic programming
itself.
First, I made clear what the function
F(I') = L p(x, k)
kEK(I')
For some given segmentation I' = (i 0 , i 1 , ... , iQ) the pair (x, k) can be ex-
pressed as a concatenation of a certain number of subsequences in the form
(i, k) = ko'
-il kil-l
xl ' 1 ' k;l '
i2 ki2-l
xi1+I' iJ+l' ki2 '
iq ki.-1
X;•-1 +I' iq-1+1• k;. ,
Q
p(x, k) = Po(ko) II p~ ( x;;_ 1 +I , k;;_:-11+1 , k;. I k;._ 1) . (8.134)
q=l
In this product the number p0 (k), k E K, means. the probability that the
initial state of the automaton is k. The number p~(x~;_ 1 +I, k;;_:-11+ 1 , k;. Ik;._ 1)
means a conditional probability of the event which under the condition that the
automaton was in the state k;._ 1 in the (iq_I)-th instant, the further iq- iq-1
output symbols will be x~;_ 1 + 1 and the automaton will pass through iq- iq-1
states k;;_ 1 +I· _
In the sum LkEK(l'l p(x, k), ~hich depends on the segmentation I' = (i 0 ,
i1, ... , iQ), only such sequences k occur in wl1ich k;. = ri, q = 0, 1, ... , Q, and
k-i =f. r7 for other indices i. Therefore the product (8.134) assumes tl1e form
II Pq
Q
- k-) =Po (r7 )
P(x, I ( iq
xi.- 1+1'
k;iq-1+1'
.• -1 fJ
I fJ ) · (8.135)
q=l
This product is to be summed over all sequences of the set K(I'). This means
that the summation LkEK(l') must be performed, i.e., the multi-dimensional
sum
L L
kil-l Ah-1
L
kiq-1
L
kn-1
(8.136)
1 'it+l iq-1 +I 'Q-1+1
This summation of the products (8.135) has the same form as, e.g., the sum
L II cp;(z;)
111
Zm i=l
378 Lecture 8: Recognition of Markovian sequences
is universally correct for any functions 4?i, i = 1, 2, ... , m. Therefore the func-
tion F(I') which is expressed as the sum (8.136) of the products (8.135) will
assume the form
F(I') L p(x, k)
kEK(l')
Q
= Po(a) II L p~ ( x::_ +I> k;;~11+I> a I a)
1 . (8.137)
q=1 k;q-l
'tq-1 +1
We will introduce a more general denotation p;1 (x{+ 1 , kf;11 , a Ia) for the proba-
bility that under the condition k; = a the automaton will generate
. 1
the sequence
xi+ I and will pass through the sequence of the states (kf;1 , a). Therefore the
.
denotation p'q used will now be written asp',· q-11 i q • Each factor in the product
(8.137) has now the form
and depends only on indices i and j. The sum (8.138) evidently does not
depend on the subsequence k{;f because the sum is taken over the set of
these subsequences. The sum (8.138) does not depend on the subsequence xi+ 1
because in each calculation this subsequence is fixed. The denotation cJ>( i, j)
will be introduced in (8.138), i.e.,
In this way I have succeeded in decomposing the original task into two
separate tasks. In the former the value cJ>(i,j) for each pair of indices (i,j),
i = 0, 1, ... , n -1, j = 1, 2, ... , n, i < j, is calculated according to the definition
8.12 Discussion 379
(8.139). In the latter task the segmentation I* is found, i.e., the sequence
i 0, ii, ... , i'Q (with Q not known beforehand) which minimises (8.140),
Q
= _argmax Il<I>(iq-l,iq) (8.141)
ll,'l2, ... ,lQ-1 q=l
The values <I>(i,j) sought are <I>'(i,j, a). The values <I>'(i -1, i, k) are calculated
according to the following recursive formula
The calculation begins with the values <I>'(i - 1, i, k) which for each i are the
known probabilities Pi(x;, k I a) characterising the automaton. The calculation
of the collection of numbers <I>' (i, j, k) for all triplets (i, j, k) according to the
formula (8.142) has a complexity O(n 2 IKI 2). The values <I>'(i,j, k) have been
calculated, thus we know the numbers <I>(i,j) = <I>'(i,j,a).
Now I will search for the segmentation I* satisfying the requirement (8.141).
Let i* > 0 be the chosen index and J(i*) be a set that contains all sequences
of the form
io, i1, iz, · · · , iQ, 0 = io < i1 < i2 < · · · < iQ-1 < iQ = i*.
Each sequence of this type is characterised by the number TI~= 1 <I>(iq- 1, iq). I
will denote the largest of them F* (i*),
Q
F*(i*) = max
(io,i,, ... ,iq)EJ(i•)
IT <I>(iq-1, iq). (8.143)
q=l
Let i 0,ii, ... ,i'Q_ 1,i'Q be a sequence which maximises (8.143). I will denote
the last but one index i'Q_ 1 in the sequence by the symbol ind(i*). The last
index i'Q is, as was said, i*. I define F* (i*) = 1 fori* = 0. The following holds
380 Lecture 8: Recognition of Markovian sequences
Well, you wished so yourself. We wonder if you are able to formulate a task
that you probably will not be able to master.
I realise quite clearly now that we dealt with the most primitive variant of
the segmentation task. For actual applications the tasks should be formulated
with far greater care. From these lectures I have learned that seeking the most
probable value of a hidden parameter of an object seems to be natural only at
first glance. In the first, usually evident, serious plunge into the application
task the roughness of such an attitude is already obvious.
Seeking the most probable segmentation would result from an unmentioned
assumption that all deviations of the estimated segmentation from the actual
one are equally significant. This is, however, an unforgivable simplification of
a task. For example, if the algorithm has wrongly decided that the automaton,
at a certain instant, has passed through the state a, this error has smaller or
greater significance according to what the actual state of the automaton was,
or whether the automaton was in the state a at a not distant time, etc ..
So the solution of an applied problem must begin with the careful definition
of penalties d(I', /"), which estimate how dangerous the situation is when the
8.12 Discussion 381
segmentation I" is assumed instead of actual segmentation I'. I can guess that
a mere calculation of the penalty for the given pair of the segmentations I', I"
will be devilishly difficult. The mathematical expectation of such reasonably
defined penalties, i.e., the risk, must be thoroughly analysed. Even this may
mean considerable effort. When it is done then the optimisation problem that
seeksthe segmentation minimising risk can be solved. The algorithm obtained
in such a way could be considered as quite a good achievement.
At last we can see you to be such as we were used to. Before, it seemed to us for
a while that somebody else was discussing instead of you. You may, sometime,
manage to master even the tasks you see now.
I doubt it. Not because I would underestimate myself. There is a more serious
reason here, which lies in my respect for the community who are engaged in
pattern recognition (especially in image segmentation) and who do not go into
these highly interesting tasks. This may not be just by chance.
In the core of all the difficult tasks you quoted in the lectures, as well as
of those we came across together in our discussions, there appears to be an
irrefutable fact that the most probable value of a random variable is not iden-
tical with its actual value. Should a certain feature of the random object be
estimated, not only the most probable object is to be taken into consideration
but also other objects having smaller probabilities. Even when you have been
repeating this idea since the first lecture, I have actually understood it only
now.
Why, in the great majority of research work and application tasks, is nothing
else done than seeking the most probable sequences of hidden parameters? The
sequences found are then manipulated as if they were real ones. In the case in
which this starting point was correct, all tlle tasks we have analysed, including
the difficult segmentation task, would then be reduced to a single task seeking
the most probable sequence. But such a procedure is erroneous. But when
nearly everybody does so there must be a more serious explanation than merely
stating that it is a wrong procedure. Lacking a proper explanation, I do not
dare to leave the smooth path along which everybody walks, and take up other
paths, along which nobody has walked so far. You may think me to be too
conservative, but anyhow, conservatism is not always the worst virtue. In the
present case it is respect for the well known ways in pattern recognition and a
fear of that being destroyed which has already been achieved.
Your respect for the established views impresses us, and therefore we were
thinking about your question for rather a long time, but we have not arrived
at anything convincing. But we recalled an old joke which comes from the
Ukrainian city of Odessa. In addition to its many beautiful sights, this city
is known for the famous brilliance of spirit of its inhabitants. Among many
stories coming from Odessa you could find this one also.
A person had a pair of trousers made at a tailor. The tailor finished the work
ordered only after a week's time. It seemed to be too long for the customer and
382 Lecture 8: Recognition of Markovian sequences
he reproached the tailor saying that one week had been enough for Almighty
God to create the whole world. The tailor could only defend himself by saying:
'But, look at the world and look at these trousers'.
Well, now look at the lot of algorithms for segmentation and ...
Thank you. I would also like to discuss with you the second part of the lecture
devoted to learning. I am fascinated by the level of the universality and ab-
stractness at which the tasks were examined, and by the relation between the
plentiful classes of tasks being successfully revealed without creating algorithms
for solving them. My attention was attracted by the relation between the min-
imax and the maximum likelihood estimate of the statistical model. Later,
I was captivated by the quite unexpected relation between tuning the algo-
rithm recognising Markovian sequences, on the one hand, and the perceptron
or Kozinec algorithms on the other.
I do not even mention the relation between learning and unsupervised learn-
ing which I had already understood in Lecture 6. Now I have again made sure
of its fruitfulness when I saw how the task estimating the Markovian object in
unsupervised learning can be reduced from astonishing complexity to a task of
supervised learning the complexity of which is not worth mentioning. When
I see how the extensive classes of tasks start cooperating and fusing into one
river, I start imagining that, at last, we hold a pneumatic drill in our hands
to cope with the rock representing pattern recognition. Certainly, it is not dy-
namite yet, but it is already not a nail flle, with which we jabbed at the rock
before.
Yes, you are right. It seems to me that you have thoroughly discussed a certain
aspect of building up the statistical model of an object, but from my point of
view, it is not the most important one. The question has remained aside of how
to find the structure of a complex object. It seems to me that it is the most
significant question in structural pattern recognition. I am going to explain
what I mean. When we know that an object is composed of parts and we want
to apply a method for recognition which was explained in the lecture, we have
to order the known set of the parts so that it may become a sequence. Such an
ordering is sometimes not known beforehand.
The ordering may be unknown even in the cas(' if the sequence ko, k1 , k2 ,
... , ki, ... is a process which rlevelops in time. The index i then represents
time. For example, let ki denote the behaviour of a person on the i-th day.
Only at a rough glance one can assume that the behaviour of a person today is
entirely dependent on his behaviour of yesterday. Let us imagine that a person
has two ranges of interest. On week days he is at his place of work and on
Saturdays and Sundays he is at his holiday home. In this case his behaviour
at his lwliday home on Saturday will be less affected by what he did at his
work on Friday, but rather by what he had done on the previous Sunday. In
8.12 Discussion 383
We are afraid that the situation is quite hopeless. We will not even ask you
how you formulated the task that you have arrived at in the travelling salesman
problem. We have tried ourselves to solve the task many times, but not a single
time did we manage to avoid either the travelling salesman problem or the task
seeking the Hamiltonian cycle in the graph which is, from the point of view of
its solution, also hopeless.
Forgive me, I do not know what the Hamilton cycle is. Could you explain it to
me?
A task is usually formulated by a simple example. Assume that you are inviting
a set I of people to a banquet. You know about each pair of guests whether
they are acquainted with one another. This knowledge can be expressed by an
unoriented graph whose vertices are formed by the set I. Two vertices in a
graph are connected by an edge if and only if they represent a pair of guests
who are acquainted with each other. You are to seat the guests around a round
table so that each guest is acquainted with the guest on left-hand, as well as
with his or her right-hand neighbour. In terms of graphs, you are to find if
the graph contains a cycle which passes through each vertex just once, and
along each edge at most once. The graphs containing such a cycle are called
Hamiltonian graphs.
It is even worse. So far, nobody has solved the task yet in polynomial time, but
nobody has proved that it is not solvable in the polynomial time either. But
we frankly advise you, do not try to solve it. It is an abyss similar to Fermat's
last theorem.
384 Lecture 8: Recognition of Markovian sequences
With the important difference that Fermat's last theorem has already been
solved.
We did not even know that. Well, let us wait about three hundred years until
the situation with NP-complete problems is cleared up.
And what connects the travelling salesman problem with Hamilton cycles?
But I have come across a quite different task. My task is not reduced to seek-
ing a cycle but to seeking for a chain which passes through all graph vertices.
Maybe, in this case, the task is not hopeless from the point of view of complex-
ity.
I resent it extremely. The reduction of the original task concerning the structure
of a complex object to the travelling salesman problem was not simple. The
blown bubble has at last burst.
You would resent it even more if you learned that you had been within reach
of a very beautiful task concerning an estimate of the complex object structure
which is solvable. From the beginning you have been tied up to an idea that
the structure sought must be a chain. Recall that in the lecture a whole section
was purposely devoted to a more general structure ...
I have got it now! All results of the lecture come in useful even in the case in
which the object sought has the structure of an acyclic graph. Now, I should
generalise the tasks even more, so that with respect to the training multi-set
not only numerical parameters of the statistical model of the object, but also its
structure, i.e., the mutual relations of the parts, may be estimated. When I do
not require that the structure should correspond to the chain then I probably
may come to a solvable task.
Yes, that is right. Now, do not hurry because you have come across a task
the solution of which is made possible thanks to a nearly hundred years effort.
In 1968 the American Chow [Chow, 1965; Chow and Liu, 1968] formulated
a task which we now understand as an estimate of the structure of mutual
dependences between parts of a complex object. He also demonstrated that the
8.12 Discussion 385
Let I be a finite set of indices by which individual parts of the complex object
examined are indexed. The index i E I is understood as the number of a
particular part. Each part is described by two parameters: k; is the hidden
parameter and x; is the observable parameter of the i-th part of the complex
object. The ensemble k = (ki, i E I) is the unobservable state of the object
and the ensemble x = (xi, i E I) is the result of its observation. As before, we
assume that the parameters ki, i E I, take the value from the finite set K, and
the parameters Xi, i E I, do so from the finite set X.
The set I is understood as a set of vertices on which an unoriented acyclic
continuous graph is created. The set of the graph edges will be denoted G. The
notation (i, j) E G means that the vertices i and j, i E I, j E I, are connected by
the edge from G. We assume that with a fixed parameter k;, corresponding to
the i-th part, the observable parameter does not depend on any other parameter
of the object. This means that Xi is conditionally independent (at fixed value
ki) on any other paramete.! of the object. So, p(x I k) = niEI p;(x; I k;). The
probability distribution p(k) is assumed to be Markovian with respect to the
graph the edges of which are formed by the set G,
where hi denotes the number of edges that pass through the vertex i. The joint
probability p(x, k) is
In the formulre (8.146) and (8.147) the value 9ij(k, k'), k E K, k' E K, means
the joint probability that the i-th hidden parameter will assume the value k
386 Lecture 8: Recognition of Markovian sequences
and the j-th parameter the value k'. The probability Pi(x, k), x EX, k E K,
represents the joint probability of the value x of the i-th observable parameter
and the value k for the i-th hidden parameter. And eventually, gi(k), k E K,
is the probability that the i-th hidden parameter will assume the value k. This
means that
9i(k) = L 9ij(k, k')
k'EK
- _) _ n(x, y)
a (x,y - l .
The task solved requires me to find a set of edges of the graph G, which forms
an acyclic structure, and the ensembles offunctions (gij , (i, j) E G), (Pi , i E I)
which maximise the probability of the results of the experiment given by the
multi-set ((xi,ki), j = 1,2, ... ,l), i.e.,
= argmax
G
max max
(p;,iEI) (YUiJ•(i,j)EG)
L a(x, k) logp(x, k)
x.k
D 9iJ(ki,kJ)DPi(xi,ki)
= argmax "'"'(X-, k-) log (i,j)EG iEI
G
max max
(p;,iEI) (g;j.(i,j)EG)
L..., ....
x,k n (gi(ki))h;
iEI
8.12 Discussion 387
La(x,k)log IT YiJ(k;,kJ)
x,k (i,j)EG
I: 2: I: a;j(k;,kj)logg;j(k;,kj)
(i,j)EG k;EK kjEK
(8.149)
where
a(x,k),
(.r,,, i'EJ) (k;', i'EI\{i,j})
I: L a(x,k),
(.c;', i'Ef\{i}) (k;', i'El\{i})
2: L n(x,k).
(.r,,, i'EJ) (k;', i'El\{i})
The last three fommlre are presented only to demonstrate the rightfulness of
the last step in deriving the (8.149). In fact, the numbers a;J, n; and ,8; are
not calculated according to the formulre quoted, but it is done in a far simpler
way. The number niJ ( k, k') means merely the relative frequency of the situation
occurring in the experiment when the i-th hidden parameter assumed the value
k and the j -th one the value k' . .4. similar sense is assigned also to the numbers
a; and ,8;.
On the basis of similar considerations as those in Theorem 8.1 from the
lecture, it can be proved that the numbers YiJ (k;, ki), p; (x;, ki) and g; (k;), which
maximise (8.149), are to be equal to the corresponding numbers n;j(k;, kj),
,B;(x;, k;) and a;(k;). This means that the expression (8.148) assumes the form
c· = arg~ax ( I: I: I: aij(k;,kj)logau(k,,kj)
(, ( i.j)E(; k; EK k 1 E /\'
(8.150)
388 Lecture 8: Recognition of Markovian sequences
In the preceding expression the first summand depends on the set G since the
addition is done according to those pairs (i,j) which belong to G. So does the
third summand, too, since the numbers hi depend on the set G. The second
summand does not depend on the set G, and thus the expression (8.150) can
be rewritten to
G* = argmax ( L L L O:ij(ki,ki)logo:ii(ki,ki)
G (i,j)EG k;EK kjEK
and
H; =- L o:i(ki) logo:i(ki) (8.152)
k;EK
then I obtain
G* = argmax
G
(LiE/
hi Hi - L
(i,j)EG
Hii)
At last, I arrived at the following procedure by which the set G is created, i.e.,
the structure of mutual dependencies between the parts of a complex object.
1. With respect to the training multi-set (ki, j = 1, ... , l) (the data xi are not
used at all) the numbers O:;j(k;, kj) and o:i(ki), i E /, j E /, i ::j:. j, ki E K,
ki E K, are calculated.
2. For each pair (i,j), i E /, j E /, i ::j:. j, the entropy Hii is calculated
according to the formula (8.151) and for each i E I the entropy Hi is
calculated according to the formula (8.152).
3. The set I is understood as a set of gTaph vertices on which a complete
graph is created. In this graph each vertex is connected to each vertex by
an edge. The length of the edge connecting the vertices i and j is given as
Hi+ Hi- HiJ·
4. In the graph obtained a connected acyclic subgraph (i.e., a tree) is to be
found which contains all the vertices of the original graph and which max-
imises the sum of the graph edges that belong to the subgraph.
A.nd now I would like to know your opinion on whether this is Cayley task or
not.
8.12 Discussion 389
Yes, it is.
Could you, please, explain to me the Boruvka algorithm for solving Cayley
task?
It is so simple that we can but admire its cleverness and we are astonished that
people were not able to fall upon that simple solution for such a long time.
In addition, we are surprised at how it may have happened that even after
Boruvka's article it was not known for quite long a time that Cayley task had
been solved.
Let M be a set of graph edges, i.e., pairs (i, j) of the form i E I, j E I, i f j.
Let us order the set M according to the edge lengths; if the edge (i 1 , j 1 ) occurs
in the array M prior to the edge (i", j") then the length of the edge (i 1 , j 1) is
greater than or equal to the length of the edge (i", j"). We seek a subset of
edges which forms a connected subgraph G* containing all the vertices I and
maximises the sum of the edge lengths. The subgraph G* is formed in the
following way. One edge after another is examined in the order given by the
array M and a decision is made about each current edge whether it belongs to
the subgraph G* or not.
Let the edge (i 1 ,j1 ) from the array M be examined at a certain instant and
let the subset of edges G 1 to have been constructed at the preceding instant.
They are the edges about which the decision has already been made that they
belong to G*. If there exists a path in the subset G1 from the vertex i 1 to
the vertex j 1 then the decision is made that the edge does not belong to G*,
i.e., the subset G1 does not change. If the subset G1 contains no path from
the vertex i 1 to the vertex j' then the decision is made that the edge (i 1 , j 1 )
belongs to the set G*, and so the set G1 is changed into the set G1 U {(i 1 , j 1 )}.
The array M is examined as long as there is less than III - 1 edges in the
subset G 1 •
The simplicity of the algorithm is incredible. Could not you, please, explain
the most important ideas of Boruvka's proof to me?
We could, but we do not want to. You should master the proof on your own.
Of course, it does not mean that we can equal Boruvka. Imagine it so: we are
dwarfs who have climbed onto a giant's shoulders. This statement was neither
invented by us nor do we remember who said it, but he must have been a clever
man.
For simplicity, I dealt with the task only for the case in which the lengths of
graph edges differed from each other. The proof of Boruvka algorithm is based
only on two rather obvious statements.
Assertion 8.1 Let (io, io) be the greatest edge and G* be the set of edges sought
which is the solution of Cayley task. Then (i 0 ,j0 ) E G* holds. •
390 Lecture 8: Recognition of Markovian sequences
Proof. I will prove the assertion by contradiction. I will prove that if(i 0 ,j0 ) f/.
G* then the set G* is not a solution of Cayley task. Since the set of edges G*
forms a connected graph there exists a path from the vertex i 0 to the vertex )o.
Each edge within this path is shorter than is the length of the edge (io, j 0 ). Let
(i', j') be an arbitrary edge within this path. The path from i 0 to j 0 together
with the edge (i 0 , j 0 ) creates a cycle. If I take out the edge (i', j') from the
set G* and include the edge (io,jo) in it then the new graph remains to be
connected and will, as it was before, contain all vertices. But the total length
of its edges will increase since the length of the edge (i', j') is less than the
length of the edge (io, j 0 ). From that it follows that the set G* was not the
solution of Cayley task. •
Assertion 8.2 Let G* be a set of edges that solves Cayley task and G' be its
subset. Let (io, j 0 ) be an edge that satisfies two conditions:
1. There is no path in the set G' from the vertex i 0 to the vertex j 0 .
2. Among all edges that satisfy the first condition, the edge (i 0 ,j0 ) has the
greatest length.
In this case (io, Jo) E G* holds. &
But still I have not yet made dear for myself a question that is important for
me. I am not sure if I have understood tlw importance of the matrix notation
which is the main thread of all the first part of tlw lecture. According to
how you stress tlw clpplication of tlw ma.trix notation, I am afraid tha.t you
see something in it I have not noticed. On account of the lecture I l1ave not
been convinced that the matrix notation is indispensable for solving the tasks.
I understand and I cllll CHlHtiJle of solving a.ll tasks analysed in the lecture
without knowing anything ctiJCmt tlw matrix notation. Am I right that it is
only the matter of concisely expressing something that is dear anyhow? And
is it so essential what language or wlwt matlwnli:ltical symbolism is used for
expressing knowledge? It is tlw knowledge itself tlwt is important, i.e., that
invariant something, whid1 does not depend 011 tlw f(mn in which it is written.
For wlmt do I need something new wlwn I uwlerstmul all I need even without it?
8.12 Discussion 391
knowledge may be more important than acquiring new knowledge. Only when
the old knowledge is expressed in a new terminology and people get used to
the new terms do they begin wondering why such simple things had seemed so
complicated before.
Your negative attitude to matrix notation reminds us of another circum-
stance which necessarily accompanies the creation of new tools for interpreting
old knowledge. The mere existence of the language might affect the range of
knowledge which becomes generally known. Before new language tools are cre-
ated the greatest popularity is attained by that knowledge which had proved
successful in being expressed in the old language. All of that shows a certain
infirmity of new language tools, since what has been known best does not sound
very familiar in the new language. And, in addition, we have to take into con-
sideration the effort which is needed for mastering the new language. You have
nicely expressed this situation in your words 'why do I need anything new when
I understand everything I need even without that'.
We have also thought about how it could happen that only not very long
ago, perhaps about a thousand years ago, that people were not able to add
and multiply arbitrary integer numbers. It was not because the overall level
of education would have been low and therefore the majority of population
would not have known how to multiply. The situation was more complex. No-
body could operate with arbitrary integer numbers. The capability of adding
some large numbers was regarded as an attribute of the highest intellect. The
tasks concerning the addition and multiplication of individual numbers or of
some number classes became respected scientific research. From the present
day point of view it can hardly be understood how it could happen that people
were capable of adding and multiplying some numbers but others not. Though,
in fact, the procedure through which these operations are performed is the same
for all numbers. The explanation that our ancestors were less intelligent than
we are would be wrong. Even though the human society is, as a whole, a bit
more educated than a thousand years ago, there is no evidence that every in-
dividual today would be more capable of mental activity than were his grand
or great grandparents. Let us recall the brilliant outcomes of European an-
cient mathematics from which such concepts as the prime number, the highest
common devisor, the lowest common denominator, etc. have their origin. All
this gives tribute to a deep understanding for the nature of the integer number.
And in spite of all that, people did not know how to add and multiply for long
after the time when the concept of the number had been understood rather
clearly.
It was because that people did not represent numbers in the unified form to
which we are accustomed now and know as the Arabic notation of numbers. The
mode of notation was different with different nations, and even with different
classes of numbers. Of the earlier forms, Roman numerals are used up to today.
In such a mess and disorder in the number representation itself, hardly anybody
got the idea that there might be a universal way of manipulating numbers. Well,
all the numbers seemed to bear no resemblance to one another. Does not this
situation remind you of the present day state of affairs in pattern recognition?
8.12 Discussion 393
It does, a little, but I wonder what all these historical considerations lead to,
all the more so that they are not quite correct. As early as many thousands of
years ago, in ancient Mesopotamia, people could manipulate integer numbers
quite correctly. It was far earlier than you say.
Do not be very strict this time. The matter is that you and we are not doing
historical research, but something quite different. It is important that only
not very long ago, about one thousand years ago, people could not manipulate
integer numbers in quite extensive territories (when once you are so strict to
us).
The mess concerning integer numbers prevailed even in the Central Asian
science center, Samarkand, until about a thousand years ago Muhammad from
the neighbouring Khwarizmi carne to Samarkand and notified them that a new
way of number representation was used in Khwarizmi. Thanks to it, the ca-
pability of performing mathematical operations ceased to be regarded as an
exceptional gift of Nature for some intellectuals and became accessible for any
young boy of the street. Muhammad ibn Musa al-Khwarizrni explained the
way of representing numbers and calculating with them which has been used
in the world up to now. In addition to the facility to calculate, which af-
fected the development of science all over the world for centuries, there was
another outcome. The manner started by Muhammad al-Khwarizmi, by which
ingenious intellectual inventions can be replaced by disciplined executing of un-
ambiguously formulated regulations, was quite new. In honour of al-Khwarizmi
(perhaps also in honour of the country where he came from) the formulated
rules began to be called the Khwarizmian way, or the al-Khwarizmi method.
Owing to later distortions of the expression the word algorithm originated. You
can see in what way significant outcomes in the history of science can originate
even because the objects known before were newly expressed or given a new
name. And therefore we cannot agree with you that it is of no significance in
which formalism the new, as well as the old, knowledge is expressed.
And there is something else we would like to add. The personality of Muham-
mad al-Khwarizmi is so great that a mere ambition to make him one's example
could be regarded as an unforgivable immodesty. In spite of that, try to imag-
ine yourself in his place. If you put yourself in his situation in a quite realistic
way, and in all inevitable details then you will find that from the standpoint of
al-Khwarizmi, his situation appeared more than ugly. The poor al-Khwarizmi
must have listened to a pretty large amount of foolishness in his life.
Someone may have disliked the representation of numbers. For example,
the representation of the number 24 7 seemed to be far less illustrative than
CCXLVII. Well, even from the notation CCXLVII one can see that it is a sum
C + C +(L-X)+ V +I+ I, but the number 24 7 does not say anything as that. To
find out what it means, it is necessary to calculate 2 · 10 2 + 4 ·10 1 + 7. 10°. Well
now, our colleague al-Khwarizmi, instead of multiplying only when we need it,
we will now have to multiply every time when we want only to know what the
number means! To some other person, the new way of adding seemed to be far
more complicated than the preYious one. The fact that the sum of the numbers
394 Lecture 8: Recognition of Markovian sequences
V and II is equal to VII is far more understandable than stating that the sum
of 5 and 2 is 7. The numeral 7 itself does in no way include the numerals 5
and 2. Yet another person in turn criticised the new way because it needed 10
numerals, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, whereas for the old way only VII numerals I,
V, X, L, C, D, M had been sufficient. It is true, though, that by means of Roman
numerals the numbers greater than, e.g., MMDCCCLXXXVIII are difficult to
express, but for a great majority of practical applications it is sufficient.
Furthermore, nobody can remember the rules of multiplying numbers. To
multiply, one has to know by heart about 100 rules, i.e., the multiplication
tables. When it is necessary to know 100 rules then any additional rules seem
to be useless since the products of numbers which occur in practice can be
easily calculated by common sense. If the poor al-Khwarizmi asked in a shy
manner how much to pay for CCLXXIV rams when each costs XLIX ducats
then everybody would wonder how it was that al-Khwarizmi was so silly and out
of touch of the real life. Has anybody seen CCLXXIV rams for sale? In practice
there are always either CC or CCC, or CCL rams, at worst. I wonder where
you saw such a silly price, XLIX ducats for one ram? In practice it is always
L ducats, and therefore these fine and rather complicated considerations are
unworkable. In practice, I am to multiply CC times L, and without all wisdom
of yours, my colleague al-Khwarizmi, I know that it will be C times C ducats.
At last, colleague al-Khwarizmi, come to see that we cannot reckon on being
able to retrain all the merchants in Samarkand to use the new way of writing
numbers only for the purpose of making their addition and multiplication easier
for you. The whole scientific community of ours is kept by virtue of taxes and
charity benefits from the merchants. They are here not for the sake of us, but
we are here for the sake of them ...
We could continue ad infinitum. We need not even invent silly stories like
that. Each of us can hear a lot of them around.
I admit that I was wrong saying that it was of no importance in which formalism
knowledge was expressed.
But I would like, in a similar way like you, to add something. The first part
of your answer sounded like a beautiful poem. But I understood the second
part of it as an irony directed towards me. My attitude is, in fact, nearer to
those imaginary blockheads, who did not understand al-Khwarizmi, than to
his views. I am not very proud of it, but I do not feel like opposing it either.
Therefore, I am saying again that all tasks the solution of which was expressed
by you by matrix products can be proved by me even without using them.
Could you quote an example of a task that is solved within the formalism
mentioned, but is difficult to solve without it?
We know one of such tasks. You yourself may guess that it must be a pretty
difficult task. We are planning it for our next lecture.
April 1998
8.13 Link to a toolbox 395
LL p(ko.x,k).
kEK xEX
If the automaton is in the state k;_ 1 in the (i -1)-th instant, i = 1, 2, ... , then a
random pair (x;, k;) is generated in agreement with the probability distribution
the symbol x; appears at the output, and the automaton passes to the state
k;. The function p characterises the stochastic automaton and it defines the
probability distribution on the set of the output sequences of the automaton.
Among all possible sequences of the length n, the probability of the sequence
x 1 , x2, . .. , Xn is given by the sum
"
L '"
L ' ... "
L ' ( ])(1\;Q,X],
/. kI) III1 L X;,. k;)
L p(ki-1' )
k
0
k
·I
k
. .,
-·)
, __ .rEX ~·E K
p(k,_j' x, k)
397
At the same time the automaton described defines a more rough characteris-
tic of the set of sequences which may occur at its output, i.e., the set of the
sequences with non-zero probability. If we are interested only whether the se-
quence x 1 , x 2 , ... , Xn may occur at the automaton output, and the probability
of that sequence is of no interest, such a detailed characteristic, as that de-
scribed by the function p: K x X x K ---+ lR, is not needed. It is sufficient to
know if the probability p(k', x, k") of the triplet k' E K, x E X, k" E K is zero.
The automaton can be described in less detail by the function P of the form
K x X x K---+ {0, 1} the value P(k', x, k") of which is 0 if p(k', x, k") = 0, and
1 if p(k', x, k") ::J. 0. Thus the subset of the sequences is determined by a simple
binary function of three variables. Naturally, some sequence subsets cannot be
defined in this way. It only concerns subsets of a certain form known as regular
languages. We will introduce this concept more precisely.
Let X be a finite set which will be called an alphabet. Its elements are
symbols. The finite sequence of symbols from the alphabet X is referred to as
a sentence in the alphabet X. The set of all possible sentences in alphabet X
will be denoted X*. The subset of sentences L C X* is called a language in
the alphabet X.
Let K be a finite set which will be called the alphabet of the automaton
states. Let the function r.p: K ---+ {0, 1} define the subset of states which are
regarded as the initial states of the automaton. If r.p(k) = 1 then it means that
the state k is one of the initial states (it can be a single one). Let the function
1/J: K ---+ { 0, 1} define a set of target states in a similar way. Let the function
P: K x X x K---+ {0, 1} be the state transition function of an automaton that
has the following sense. If the automaton in the instant (i- 1) occurred in
the state ki-l then in the succeeding i-th instant it can produce only such a
symbol Xi at its output, and be in such a state ki for which P(ki-l, Xi, ki) = 1
holds.
The automaton is given by the five-tuplet A = (X, K, cp, P, '¢•). This five-
tuplet unambiguously determines the sets of sequences which may occur at
the automaton output. This sequence set will be called the language of the
automaton A and denoted L(A). A sentence x 1 ,x2 , ... ,xn belongs to the
language L( (X, K, cp, P, 1/J)) if
1. Xi EX, i = 1,2, ... ,n;
2. a sequence ko, kt, ... , kn exists for which the following holds:
(a) ki E K, i = 0, 1, . .. ,n;
(b) cp(ko) = 1;
(c) P(ki-t,Xi,ki) = 1, i = 1,2, ... ,n;
(d) 1/J(kn) = 1.
The following equation defines the membership of the sequence to a language
in a brief form
F(xt, X2, ... , Xn)
where F(x 1 ,x 2 , ... ,xn) is the statement 'The sequence x = (xl,xz, ... ,xn)
belongs to the language L( (X, K, (j), P, '1/J)) '.
The definition (9.1) can be written in a briefer way if we accept the following
interpretation of the functions (/), P and '1/J. The function (/) will be understood
as a row vector of dimension IKI the k-th component of which is '{J(k), k E K.
The function 'ljJ is a column vector of dimension IKI the k-th component of which
is 1/J(k). For a given sequence x 1 , x 2 , ... , Xn, the matrix Pi, i = 1, 2, ... , n, will
be introduced. The element in its k'-th row and k"-th column is P(k', xi, k").
Using this denotation we can express the definition (9.1) as the matrix product
(9.2)
has been now expressed by means of a regular grammar given by the quadruplet
(X, I<, I< 0 , R), where
X is a terminal alphabet,
I< is a non-terminal alphabet,
!{ 0 is a set of axioms,
R is a set of rules.
The regular grammar determines the language, i.e., the subset L C X* in the
following manner.
1. The sentence consisting of one single symbol which corresponds to one of
the axioms is considered to be proved in the given grammar.
2. If some sentence xk', x E X*, k' E I<, is proved in the grammar and the set
of rules contains the rule k' -+ x' k" then the sentence xx' k" is considered
to be proved.
3. If a sentence xk', x E X*, k' E I<, is proved in the grammar and the set of
rules contains the rule k' -+ x' then the sentence xx' belongs to the language
defined by the given grammar.
Once we have obtained the grammar (X, I<, I< 0 , R) from the automaton (X,
I<, t.p, P, 'l/J) in the above manner then the language defined by the grammar is
just the set of sentences that can occur at the output of the automaton. We will
not prove this assertion because it is almost obvious. Also a converse transition
is possible, i.e., from a regular grammar to an automaton which will generate
all sentences belonging to the language of the given grammar and only them.
The autonomous finite automaton and regular grammars are two equivalent
means for defining sets of a certain kind, namely, regular languages.
0
4
abc+ x =
Figure 9.2 Deterministic automaton recognising the language from the example.
graph by means of edges and symbols assigned to the edges. If an edge starts
from the vertex k' and points toward the vertex k" and a symbol x labels the
edge then it means that P( k', x, k") = 1. For all other triplets the function P
assumes the value 0. This means that the function P assumes the value 1 with
the following triplets:
(0, a, 1), (0, b, 1), (0, c, 1), (3, a, 3), (3, b, 3), (3, c, 3),
(1,a,1), (1,b,1), (1,c,1), (3,+,2), (3, x,2),
(1, =, 2)' (2, a, 4), (2, b, 4), (2, c, 4),
(2, a, 3), (2, b, 3), (2, c, 3), (4,a,4), (4,b,4), (4,c,4).
The table aud graph can be understood as grammar of the following form. Its
terminal alphabet is {a, b, c, =, +, x}, the nonterminal alphabet is {0, 1, 2, 3, 4},
the axiom is 0, and the set of rules is as follows
0 --+ al, 0--+ bl, 0--+ c1, 3--+ +2, 3--+ x2,
1 --+ a1, 1 --+ b1, 1 --+ c1, 2 --+ a4, 2 --+ b4, 2 --+ c4 ,
1--+=2, 4 --+ a4, 4 --+ b4, 4 --+ c4 ,
2 --+ a3, 2 --+ b3, 2 --+ c3, 4 --+ a, 4 --+ b, 4 --+ c.
3 --+ a3, 3 --+ b3, 3 --+ c3,
And finally, the automaton which recognises sentences of the given language
is shown by the graph in Fig. 9.2. The input symbol alphabet is {a, b, c, =
, +, x }. The set of states K is {0, 1, 2, 3, 4, 5}. The initial state is 0 and the
target state is 3. The automaton recognises if a sentence is correctly formed.
Each correct sentence traverses the automaton to the state 3. Each incorrect
sentence takes it to some other state. It is achieved by a proper choice of the
transition function q which controls transitions of the automaton from state
to state according to the input symbol. The transitions between the states
are marked by an arrow in Fig. 9.2. If the arrow starts from the vertex k'
and points toward the vertex k:" and the symbol x is attached to it then the
transition q( k', x) = k" is expressed with it.
404 Lecture 9: Regular languages and corresponding pattern recognition tasks
Such tasks are called best matching problems. The tasks checking the valid-
ity of the relation x E L are called exact matching problems.
The following subsection is devoted to pattern recognition tasks in which
the fundamental concept is not the language as a subset of a sequence, but as
a function defined on a set of sequences.
The fuzzy automaton (X, K, cp, P, 1/J) defines the fuzzy language as the function
X* -+ lR which is determined by the left-hand side of Equation (9.3) with
the operation V understood as a union of fuzzy sets, and the operation 1\
as an intersection of fuzzy sets. Thus, the language of the fuzzy automaton
(X, K, cp, P, 1/J) is defined as a fuzzy subset defined by the function
(9.5)
where c.p is a row vector, P;, i = 1, ... , n, are matrices, and '1/J is a column vector.
The above quantities are constructed in a similar way as those in the analysis
of earlier problems, e.g., in writing Equation (9.2). Matrix multiplication in
the formula (9.5) is to be performed in a relevant semi-ring, i.e., with operation
max as an addition and operation min as a multiplication. In calculating the
product in (9.5) keep in mind that matrix multiplication is not commutative.
We can see that fuzzyfication of the concepts-the automaton and the reg-
ular language-has kept the pattern recognition task on a trivial level. The
expression (9.5) formulates the task as a calculation to what extent the sen-
tence x1, x2, ... , Xn belongs to the given fuzzy language. At the same time, an
algorithm for this calculation is defined by the same expression (9.5). Com-
plexity of this calculation is evidently O(IKI 2 n).
and comparing it with the number c. The calculation (9.6) can be expressed
as the matrix product
We can see that even in this formulation the pattern recognition task is trivial
because its formulation directly results in the algorithm of its calculation. The
complexity of the algorithm remains O(\K\ 2 n).
In a special case in which the variables Xi assume only two values, and with
a relevant choice of the function d, the given sum is the Hamming distance
known in the theory of coding. Let L c X* be a regular language (neither
fuzzy, nor penalised) which corresponds to the automaton (X, K, '{), P, '1/J). In
the simple best matching problem the given sequence x 1 , x 2 , ... , Xn is to be
substituted by the sequence YI, Y2, ... , Yn from the language L to obtain the
minimal distance L:~ 1 d(yi, Xi). This means that it is necessary to solve the
minimisation task
n
min min··· min
Y1 Y2 Yn
L d(yi, Xi),
i=l
under the conditions:
'P(ko) = 1, (9.7)
P(ki-I,Yi,ki) = 1, i = 1, ... ,n,
'1/J(kn) = 1,
with the known sequence x 1 , x2, ... , Xn and given functions d, '{), P and '1/J.
This problem can be transformed on the basis of the following considerations.
Let
We will introduce new functions '{) 1 , P' and f' which are defined as follows
Using the new functions 'f! 1 , P' and f' we will rewrite the task (9.7) as
":~n "l:n ·· "i:n (~· (ko) + t, ":!" (P' (k,_,, y;, k;) + d(y;, x;)) + ,P' (kn))
If the notation
(9.9)
Note that the preceding expression is of the same form as that of the expression
(9.6) which had to be calculated in the recognition task based on the penalised
language. It follows that even this simple best matching problem can be reduced
to a calculation of the matrix product
(9.10)
In the previous tasks the relevant matrices and vectors were taken directly
from input data. The examined best matching problem differs from the pre-
ceding tasks in creating the matrices PJ' as several, not very difficult, optimi-
sation tasks (9.9). The computational complexity of calculating matrices PJ'
is O(IKI 2 lXI n). After the matrices have been computed the product (9.10)
is to be calculated the computational complexity of which is O(IKI 2 n). It
is clear that the expression (9.9) need not be computed only if the sequence
x1,x2, ... ,xn is already known. The numbers P"(k',x,k") can be computed
in advance for all k' E K, x E X, k" E K according to the formula
In this way the analysed best matching problem (9. 7) is directly reduced to
a task for recognising a language defined by the penalised automaton (X, K,
cp', P", '¢'). Then for each sequence x 1 , x 2 , ... , Xn preliminary calculations of
(9.9) need not be performed since they are substituted by other preliminary
calculations of (9.11) with calculation complexity O(JKI 2 JXI 2 ).
by the same algorithm then the second task does not appear to be in any way
so fantastically complicated either.
Let us give another example concerning a pair of problems of seemingly
different complexity, and being closely related to our explanation. Seeking
a path between two vertices of a graph does not substantially differ from a
seemingly more complicated task which seeks the shortest path between these
two vertices.
We have seen that transformation of the original problems to a matrix form
has not, so far, required complicated calculations. Only with the best matching
problem the transition from the original formulation to a matrix form requires
a certain, but not very complicated transformation of input data. In all other
tasks no transformation of input data has been needed. The matrices to be
multiplied are stored in the initial data. Extracting matrices from the initial
data does not require any transformation of the initial data. They only have
to be interpreted in another way. If we use the chess terminology then the
problem is in winning position from the very beginning.
We also know how to solve the simplest best matching problem in which a
simple transformation of initial data is required to convert the problem in a
matrix form which makes the solution evident. If we continue with the chess
analogy then the winning position can be achieved in a single move. In our
explanation we have now worked our way to a task in which we will need
multiple modifications to transpose the task to the form of a matrix product.
The winning position will be achieved in several moves and a certain auxiliary
task is solved with every move. This more complicated problem is Levenstein's
well known approximation of the assigned sentence by means of a sentence of
the given regular language. The analysis of this task will be dealt with in the
remaining part of Lecture 9.
for each sentence x EX* and for each regular language L c X*. The number
D(x) will be called the dissimilarity between the sentence x and the language L.
We will deal with the problem (9.12) for the case in which the function dis
defined in a special way known as the Levenstein function.
9.5 Levenstein approximation of a sentence 411
:: ~ :::_~ ::::: :~
YJ Yn
: ! :--------! !--------! !
: : : : : : :
~--------~--------~
X;~ --------
~ ~
--------, I
'
' ' '
'
'
'
'
I I I '
I 1
Xm fsls]-_-_-_-_-_-_-_-_
Figure 9.3 Graph of transitions illustrating the performance of an algorithm for calculating
Levenstein dissimilarity.
to be used even for tasks for which its correct performance is not guaranteed.
The well known and favoured algorithm for calculating Levenstein similarity is
a glowing example of such a pseudo-solution of tasks. We intend to mention it
in our explanation for completeness and will show where it is in error.
Let each edge be weighted by the length which will correspond to the penalty
for the respective edit operation in the sentence. The length of each vertical
edge in the i-th row represents the penalty in(xi) for inserting the symbol Xi·
The length of the horizontal edge in the j-th column is the penalty de(yj) for
deleting the symbol Yi. Finally, the length of the diagonal edge in the i-th row
and j-th column is the penalty ch(yj, Xi) for changing the symbol Yi to the
symbol Xi· In so defined lengths of the edges, the length of each path in the
graph will be equal to the total penalty for using the respective edit operations.
Therefore (watch out, an error will follow!) the search for the best sequence of
edit operations which transform fj to the sentence x is reduced to the known
task of seeking the shortest path in the graph from the top left vertex to the
bottom right vertex. The error in reasoning mentioned above will be seen from
the following counterexample.
Example 9.1 Undermining the often used algorithm for calculating Leven-
stein dissimilarity. Let the alphabet X be {a, b, c, d}, let the sentence jj consist
of one single symbol a and let the sentence x consist of one single symbol b.
Levenste·in dissimilarity between the sentence b and the sentence a is to be cal-
culated. The corresponding graph is shown in Fig. 9.4. Only three paths exist
from the top left vertex to the bottom right vertex in the graph. These three paths
correspond to three possible transformations of the sentence a to the sentence b.
1. possibility.
a
• Changing a to b.
2. possibility.
• Deleting a.
• Inserting b.
3. possibility. Figure 9.4 Counterexample. Com-
• Inserting b. puting Levenstein di.~similarity.
• Deleting a.
In seeking the shortest path, i.e., in seeking the best of these three alterna-
tives, we will arrive at the r·esult that Levenstein dissimilarity between b and a
is min ( ch(a, b), de(a) + in(b)). However, the factual mismatch of b and a can
be far less because the graph 9.4 has shown only three possible ways of trans-
forming the sentence a to the sentence b. The factual number of possibilities is
much larger. For example, consider the following.
• Changing a to c.
• Deleting c.
• Inserting d.
• Changing d to b.
There are many other possibilities. The graph in Fig. 9.4 shows only a small
por·tion of possible sequence.~ of edit operations of a sentence. Therefore the
quoted procedure can solve the task only if there is an a priori certainty that
the cheapest sequence of the edit operations belongs to just that small portion.
There is, of course, no snch certainty in the geneml case. .A
414 Lecture 9: Regular languages and corresponding pattern recognition tasks
Our ascertainment that the generally applied procedure seeking Levenstein dis-
similarity solves the task only in particular cases would not necessarily have
disastrous consequences. As soon as the hitch has been discovered and under-
stood, an algorithm for correct calculation of Levenstein dissimilarity can be
immediately created. We have dwelt on this example just to show one of the
many treacherous traps in which a person handling the task without due care
can be easily trapped. We note at the same time that we are not interested in
the calculation of Levenstein dissimilarity between a given sentence and some
other sentence which is also given. We are interested in the dissimilarity be-
tween a given sentence and an extensive, actually infinite, set of sentences, i.e.,
between a sentence and a regular language. Even at first glance this is a much
more difficult task. There are far more such hidden traps as we will see later.
can even be an empty one) of arrows of the type in. It is followed by a sequence
(it can even be an empty one) of the type ch. Finally it ends with a sequence
{it can even be an empty one) of arrows of the type de. A
Proof. Let fi, x1, x2, ... ,xn, x be the shortest paths from the vertex fi to the
vertex x which does not possess the property stated by the lemma to be proved.
Failure in this property can be revealed in one of the triplet of vertices Xi- I, Xi,
Xi+!, i.e., in a pair of arrows Xi- I, Xi and Xi, Xi+!. There are only three cases
in which the failure in the property being proved can occur. Each of them will
be discussed separately, and the causes of the failure will be examined.
1. The arrow (xi-t,Xi) is of the type de and the arrow (xi,xi+I) is of the
type in. This means that the sentences Xi-J,Xi,Xi+! can assume one of the
following forms,
In the first case the sentence Xi = x' xx" x'" will be changed to the sentence
x; = x' yx" zx'". In the second case the sentence xi will be changed to the
sentence x~ = x' zx" yx'". In both cases we will obtain a new path from the
sentence fi to the sentence x which has exactly the same length as that of
the original path. With the new path the arrow (xi-l, x;) will be of the
type in and the arrow (x:, xi+ I) will be of the type ch.
3. The arrow (xi-1, xi) is of the type de, and the arrow (xi, xi+ I) is of the
type ch. Two alternatives can occur again,
In the first case the sentence x; = x' x" yx"' will be changed to the sentence
x: = x' xx" zx"'. In the second case the sentence x; = x' yx" x"' will be
changed to the sentence x: = x' z x" xx"'. The obtained new path from y
to x will have exactly the same length as that of the original path. The
arrow (x;_ 1 ,x;) will be of the type ch and the arrow (x:,xi+l) will be of
the type de.
We have seen that by gradually changing the original path from y to x we
will find a path where none of the three quoted situations will occur. For
the resulting paths, the property stated in the Lemma being proved will be
satisfied. •
Thanks to the proved lemma we can see that one of the shortest paths from
the sentence y to the sentence x consists of three sections. Note that each of
them can be empty. The first section of the path passes through the arrows
in, the second section is formed by the arrows ch, and finally the third section
consists of the arrows de.
If we take this property into consideration then we can define Levenstein
dissimilarity in one more way. We will introduce three partial Levenstein dis-
similarities which will be denoted d; 11 , dch and dde· The number d;n(Y,x) is
defined as the length (the shortest one) of the path from the vertex y to the
vertex x which passes through the arrows in. In this definition, the word
'shortest' is redundant. The path from y to x exists only if the sequence y is a
subsequence in the sequence x. If such a path exists then the length of all the
paths from y to x are the same. If such a path does not exist then we consider
d;n(Y, x) = 00.
The number dde(Y, x) determines the length (the shortest one) of the path
from the vertex'[} to the vertex x which passes arrows de. If such a path does
not exist then we declare dde(Y, x) = oo. The adjective 'shortest' is redundant
here too as the lengths of all possible paths from y to x are the same.
Similarly dch (Y, x) is the length of the shortest (the adjective is needed here)
path from '[} to x, that passes through arrows ch. If such a path does not exist
then we define dch(Y, x) = oo. The previous case occurs if the lengths of the
sentences '[} and x are not the same. If the sentences are identical, i.e., y = x
then we define d;n(Y,x) = dch(y,x) = dde(fi,x) = 0.
The concept of the arrow has been used so far for representing admissible
edit operations. Now we will supply additional arrows which will denote a
repeated application of the same edit operation, and will be called long arrows.
We intend to distinguish them from the hitherto used arrows which will be
called short arrows. Let y and x be two sentences and let the sentence x be
obtained from the sentence y by inserting symbols. We will introduce a long
arrow in which starts from the vertex y and leads toward the vertex x. Its
length will be stated as d;n ('[}, x). Let '[} and x be two sentences of the same
length. We will introduce a long arrow ch which leads from y towards x and be
of the length dch(Y, x). Let y and x be two such sentences, the sentence x can
be obtained from the sentence y by deleting symbols from it. We will introduce
a long arrow de which leads from y toward x and be of the length dde(Y, x).
9.5 Levenstein approximation of a sentence 417
After the concepts have been introduced it is clear that the length of the
shortest path from fi to x along the short arrows, i.e., Levenstein dissimilarity
between the sentences x and fj, is given by the length of the shortest path from
fj to x along the long arrows. This path consists of three long arrows in, ch
and de at most, in the respective order. Some arrows can also be missing. The
mathematical representation of the previous assertions is the relation
(9.13)
vv
koEK k1EK
This task substantially differs from all previous tasks recognising sequences.
The previous tasks were stated as seeking the optimum on a set of sequences.
Roughly speaking the best sequence was to be found. From the nature of
the task the length of the sequence was known beforehand. Even if such a
defined set was extremely extensive then it was only a finite set after all. The
task (9.14) which has now been formulated requires seeking the best sentence
in the whole regular language. The length of the sequence sought is not limited
beforehand and the respective regular language is infinite. Therefore it is a
type of optimisation in an infinite domain.
This specificity of our task can be expressed in another way. In the pre-
vious cases we were dealing with multi-dimensional optimisation tasks. The
dimension of the optimisation space was usually quite large, but it was known
418 Lecture 9: Regular languages and corresponding pattern recognition tasks
The function F is simple in the sense that its calculation has a complexity
0(/K/ 2 n), where n is the length of the sequence i. In the same sense all tasks
analysed so far have been simple ones.
The most important outcome will be that for any function D: X* -t lR of
the form (9.14) there exists its equivalent expression in the form (9.15). The
9.5 Levenstein approximation of a sentence 419
calculation of the function D has therefore the same complexity as all the
previous tasks have. Let us put aside for the time being an important question
about complexity of the transition from the expression (9.14) to the expression
(9.15). Let this outcome be formulated in a precise way.
Theorem 9.1 Equivalent expression of Levenstein dissimilarity. Let X and
I< be two finite sets, ip: J( --t {0, 1}, P: J( x X xI< --t {0, 1}, ·1/J: K --t {0, 1}
be three functions which define a regular language L containing the sentences
x 1 , x 2 , ... , X 11 for which there holds
Let us consider· three non-negatively defined functions in: X --t IR, ch: X x X
--tIR, de: X --t IR which determine Levenstein dissimilarity d: X* x X* --t IR
and Levenstein dissimilarity D: X* --t R
For each six-tuplet (ip, P, 'ljJ, in, ch, de) which determines the function D with
respect to {9.17) there is a pair· of functions P', 'lj/ such that the equality
(9.18)
The features quoted in Theorem 9.1 are not intuitively understandable and
therefore their proof should satisfy the most demanding requirements possible
for formal correctness.
Theorem 9.1 will be proved within the scope of formalism which is similar
to the formalism of generalised matrices used in this and the previous lecture.
Since the calculus of generalised matrices will not be used now only for a more
concise expression of relations already known, but for a proof of a hitherto not
proved theorem, we are formulating this calculus more precisely than before.
x+oo=oo,
min(x, oo) = x.
After extending the set of non-negative numbers by oo, all the above quoted
properties of the semi-ring are satisfied, and therefore the set R U { oo} along
with the operations min as addition and + as multiplication form a semi-ring.
Indeed, there holds that:
1. min(x, y) = min(y, x).
2. min (min(x,y),z) =min (x,min(y,z)).
3. X+ y = y +X.
4. (x +y) + z = x + (y + z).
5. x + min(y,z) = min(x + y,x + z).
9.5 Levenstein approximation of a sentence 421
6. The set R U {oo} contains the 'number' oo, so that for each x the equality
min(x, oo) = x and x + oo = oo. The 'number' oo is then a zero element
with respect to the operation EB.
7. The set R U {oo} contains the number 0, so that for each x the equality
x + 0 = x is valid. The number 0 is then a unitary element with respect to
the operation ®.
We will deal with functions that are defined on finite sets and assume their
values on a commutative-multiplication semi-ring. The denotation ![x, y] will
be interpreted as a function in total, i.e., a mapping from one set to the other.
Inside the square brackets identifiers of variables are written on which the
function depends. The denotation f(x, y) in round brackets will not mean a
function, but it will be the value which the function f[x, y] assumes with certain
values of the argument. For example, f(a, b, c) is a value which the function
![x, y, z] assumes when x =a, y = b, z =care substituted.
Let X be a set {x1, x2, ... , Xn}. The expression El1xEX f (x) will be used for
a brief expression of the sum f(xt) EB f(x2) EB ... EB f(xn)·
Let X, Y, Z be three finite sets, (W, EB, ®) be a commutative semi-ring, and
ft[x,y]: X x Y--+ W, h[y,z]: Y x Z--+ W be two functions. The expression
ft[x,yJQS}h[y,z]
y
will be used for a brief denotation of the function f[x, z]: X x Z--+ W values
of which are defined by the expression
dimension lXI x IYI by the matrix h of dimension IYI x IZI. The result of
the product is the matrix f of dimension lXI x IZI. Note however that con-
volution expressions have a certain preference over a matrix product because
convolution expressions in a commutative semi-ring m·e cornnmtative, i.e., the
following equation is valid
(9.20)
• distributivity
• and one more property, which has no name, and results from commutativity
and associativity
Let X be some set and <5[x, y] be a function of the form X x X-+ W for which
o(x,y) = 1<81 if X= y, and o(x,y) = QE!l if X =I y. The function will be referred
to as a Kronecker function. For a Kronecker function
also holds, i.e., convolution with a Kronecker function does not transform the
function itself and changes only the denotation of its argument.
If the semi-ring is formed by the operations min in the sense of addition,
+ in the sense of multiplication then the convolution expressions assume addi-
tional features resulting from the idempotent property of addition which means
fEB f =f. Let f[x, y] be a function of the form X x X--+ (JR U { oo} ). For any
function f of this form we will define the function f 0 [x, y] as the Kronecker
function, and the function J'l[x, y] as the convolution f[x, z] ®z r-
1 [z, y].
lim ffi fi
11-H)Q w
= (c5 ffi f)k-1
i=O
<P(#) = 1181'
9.5 Levenstein approximation of a sentence 425
F[k',x,k"J@<I>[x] (9.25)
is
IKI-1
(o[k', k"J Ell (tik', x, k"J ~ ~(x)) )
Proof. Immediately from the definition of convolution there follows that the
convolution (9.25) is only a brief designation for the function of two variables,
k' and k" the value of which for the given pair (k', k") E K 2 is the sum
(9.26)
n=O xEX"
where xn is the set of sequences Xl' X2, ... 'Xn, Xi E X' of length n. This
assertion will be repeated in the proof of this lemma several times. That is
why we will express it in the form of equality
Strictly speaking the preceding relation is not correct because on the left-hand
side of the equality the function K x K --+ W is stated, and on the right-hand
side of the equality the value appears which this function assumes for the pair
k', k". In spite of this incorrectness we will use the denotation of (9.27) so that
after this explanation no misunderstanding should occur.
The sum
EB
F(k', x, k11 ) 0 <I>(x)
xEX 0
is evidently
(9.29)
For n = 1 the equality is valid because it is identical with the equality (9.28).
We will prove that if the equality (9.29) is valid for some n then it is also valid
for n + 1. This statement is proved by the following derivation
E9 F(k',x,k")(Z)<I>(x)= E9 EBF(k',xx,k")(Z)<I>(xx)
= (
j[k',x,k"Jq?'P[x]
)
.
We will insert (9.29) in (9.26) and find that the convolution (9.26) is
We can see that a mere formal analysis of convolution expressions provides rules
•
of their equivalent transformation. It is important that two of the rules quoted
in Lemma 9.2 and Lemma 9.3 enable transformation of infinite convolution
expressions to their finite equivalents. In the task on Levenstein approximation
one of the major difficulties is the requirement for minimisation on the set of
all possible sequences of arbitrary length. (Roughly speaking, a function is to
be minimised which depends on an infinitely large number of variables.) Such
sets cannot be coped with by any computational procedure in a finite number
of steps. Therefore in solution of that task we will make use of the last two
results which reduce the infinite convolution expressions to finite ones. The
task has to be expressed in a convolution form for this purpose. Its previous
nonconvolution formulation was given in Subsection 9.5.5.
9.5 Levenstein approximation of a sentence 427
which is equivalent to the property that there exists a sequence ko, k1, ... , kn
for which there holds
cp(ko) = 0, }
P(ki-!,Xi,ki) = 0, i = 1,2, ... ,n, (9.31)
'1/J(kn)=O,
because the sum cp(ko) + L~~ 1 P(ki-1, Xi, ki) +'ljJ(kn) for an arbitrary sequence
k0 , k1, ... , kn can be either 0 or oo. This sum is 0 if a system of conditions (9.31)
is satisfied; and is oo if at least one condition from (9.31) is not satisfied.
It is evident that the language L created in this way belongs among regular
languages. And vice versa each regular language can be expressed in the form
of (9.31).
Let us have Levenstein dissimilarity d: X* x X* -t Ilt The task (9.14)
which has to be solved requires for each given sequence x E L calculation of
the number
D(x) =min d(y, x) (9.32)
iiEL
which is Levenstein dissimilarity between the sentence x and the language L.
The number (9.30) which depends on the sequence x E X* will be denoted
F(x). Since F(x) can be either 0 or oo, the number D(x) defined by the relation
(9.32) can be expressed as
The main expected outcome of this part of the explanation which was stated
before as Theorem 9.1 can be now expressed in the convolution form.
Let X and K be two finite sets and let us have three functions <p: K ---+
{0, oo }, P: K x X x K ---+ {0, oo }, 1/1: K ---+ {0, oo }. Let the function f: K x
X* x K ---+ IR be defined in the following manner.
If x is an empty sentence # then
Let us have three functions in: X ---+ IR, ch: X x X ---+ IR, de: X ---+ IR which
determine Levenstein dissimilarity d[y, x]: X* x X* ---+ IR between the sentences
x and y and Levenstein dissimilarity D[x]: X* ---+ IR between the sentence x
and the set of sentences y E X* for which F(Y) = 0 holds, i.e.,
where
/'(k,#,k')=J(k,k'), kEK, k'EK,
f'[k',xx, k"] = f'[k',x,k] Q9P'[k,x,k"].
k
9.5 Levenstein approximation of a sentence 429
where the vectors cp and 1/J' express the functions cp and 1/J' of one variable
k E K and matrices Pf, i = l, ... ,n, represent functions P'[k,xi,k'] of two
variables k E K, k' E K. It results from the Theorem 9.2 that the calculation
of Levenstein dissimilarity between the sentence x and the language L has a
complexity O(JKJ 2 n), where n is the length of the sentence x. In this way
the Levenstein dissimilarity problem has been transferred to simpler problems
analysed before.
Validity of the declared main result will be proved in the following subsection.
iii ii2
Let us incorporate the function d[Y, x] in this form into the definition (9.35) of
Levenstein dissimilarity D[x] between the sentence x and the language L,
The proof of Theorem 9.2 can be reduced to a proof of the next three lemmata
by decomposing the problem on optimal sentence transformation into three
independent simpler problems of optimal insertion, changing and deleting of
symbols.
Lemma 9.4 Optimal transformation of a sentence by inserting symbols. Let
us have two finite sets X, K and three functions cp: K -+ R, P: K x X x K
-+ R, ·¢: K -+ R determining the function F: X* -+ R so that
F[y] = cp[k'] Q9 f[k', y, k"] Q9 1/J[k"] , (9.38)
k' k"
where
l[k,#,k"] = 6[k',k"], }
f[k', xx, k"] = l[k', x, k] ® P[k, x, k"].
k
430 Lecture 9: Regular languages and corresponding pattern recognition tasks
Let the function in: X -+ IR define the function din: X* x X* -+ IR the value
din (y, 'fh) of which is the minimal penalty for transforming the sentence 'fj to
the sentence y1 by repeatedly inserting symbols from X to the sentence y.
In this case such a function P1 : K x X x K -+ IR exists that the function
where
11 [k', #, k"J = o[k', k"J , (9.40)
fl[k','iJ1Y1,k"] = fl[k','iJl,k]@Pl[k,yl,k"]. (9.41)
k •
where
!I[k', #, k"J = o[k', k"J, }
fl[k',:ihYl,k"] = fl[k','fll,k]~Pl[k,yl,k"]. (9.43)
where
h[k',#,k"J = o[k',k"J, (9.46)
/2[k','iJ2Y2,k"] = /2[k','iJ2,k]@P2[k,y2,k"]. (9.47)
k •
9.5 Levenstein approximation of a sentence 431
where
h[k',#,k"J = 8[k',k"J' }
h[k',ihY2,k"J = f2[k',y2,k]~P2[k,y2,k"]. (9.49)
Let the function de: X ----t R define the function dde: X* x X* ----t R the value
dde[Y2, x] of which is the minimal penalty for transforming the sentence Y2 to
the sentence x by repeatedly deleting symbols from the sequence y2 •
In this case such functions P': K x X x K ----t R and '1/J': K ----t R exist that the
function
D[x] = F2[Y2J@dde[Y2,x] (9.50)
ii2
is identical with the function
where
(9.53)
If a sequence y consists of a single symbol y and a sequence y1 consists of
a single symbol Y1 then d;n (y, Yl) = oo if (y =f Y1) and d;n (Y, Y1) = 0 in
opposite case. This means in the convolution form
(9.54)
If neither y nor Yl is empty then value d;n (y, Yl) results from the following
considerations. The optimal transformation of the sequence fj = y'y to the
432 Lecture 9: Regular languages and corresponding pattern recognition tasks
din(y'y,yD +din(#,yl) ·
In the second case the penalty will be
F1[y!] = F[y]@din[y,y!]
y
which existence states Lemma 9.4. Let us use the expression (9.38) for F[y]
and obtain
is valid. In this way we have proved that the function F 1 [ill] is of the form
cp[k'JQS)h[k',yl,k"J@-¢[k"J'
k' k"
where
h[k',jh,k"] = f[k',y,k"J@din[Y,Yl]· (9.56)
y
Now it is to be proved that the function .ft [k', y1 , k"] defined by Equa-
tion (9.56) satisfies conditions (9.40) and (9.41).
3. The property (9.40) is quite obvious. If Y1 = # then din (Y, th) = 0 = 1181
only when y = #. It is because a nonempty sequence cannot be changed
into an empty one by inserting symbols. Equation (9.56) in this case will
obtain the form
h[k',#,k"] = f[k',#,k"J.
9.5 Levenstein approximation of a sentence 433
In Lemma 9.4 the condition f[k', #, k"J = 8[k', k"J is given and thus also
h[k',#,k"J = 8[k',k"] holds.
4. Let us write a more detailed expression for Equation (9.56) if the sequence
'fh is non empty and consequently is of the form ih Y1,
fi (k', 'fhY1, k") = EB J(k', '[}, k") ~din('[}, Y1YI) ·
fiE X*
JI(k',iJlYl,k") = J(k',#,k")~din(#,iJlyi)
63 EB EB J(k',yy,k") ~ d;n(iJy,iJlYd.
fiEX* yEX
Using expressions (9.53) for d;n (#, iJ1Y1), (9.39) for f(k', 'fjy, k"), and (9.55)
for d;n (Yy, iJ1 yi) , we obtain
fi (k', YlYl, k") = f(k', #, k") ~ d;n( #,'[}I)~ d;n(#, yl) (9.57)
63 EB J(k', fi, k") ~ d;n (fi, fil) ~din (#, yl)
fiEX*\{#}
63 EB EB (J(k',y,k)~d('f},fil)~ ( ffiP(k,y,k")~din(y,yi)))
fiEX* kEK yEX
The first and the second lines in the preceding expression can be written
in one line using the summation over all possible sentences y including the
empty sentence #. So Equation (9.57) can be rewritten in the form
63 EB EB (J(k',y,k)~d(fi,fii)~ ( ffiP(k,y,k")~din(y,yi)\).
fiEX* kEK yEX J
5. The first line in (9.58) can be written owing to the distributive property of
multiplication as
in which the sum in round brackets is /I (k', fi 1 , k") according to the defini-
tion (9.56). The first line in Equation (9.58) is then
Based on the property of the Kronecker function 6[k, k'] we can write the
number (9.59) as
6. We will examine the second line in (9.58). Owing to Equation (9.54) we can
write
EB P(k, y, k") 0 din (y, yl) = EB P(k, y, k") 0 6(y, yl) = P(k', Yl, k")
yEX yEX
On the base of the definition (9.56) the sum in round parenthesis is equal
to
7. If we substitute (9.60) instead of the first line of (9.58) and substitute (9.61)
instead of the second line then we obtain
where
P1 [k, Y1, k"] = P[k, Y1, k"] ffi (6[k, k"]0 din[#, Y1]) .
So it has been proved that the function fi defined by the relation (9.56)
satisfies the condition (9.41). •
For completeness we note that din[#,yi] is in[yi], and thus the function P1
which is referred to in the Lemma can be expressed in the following simple
way,
PI[k',y,k"] =P[k',y,k"]ffi (6[k',k"]0in[y]).
The preceding relation explicitly provides a constructive way for creating the
function P1 at the known functions P and in.
9.5 Levenstein approximation of a sentence 435
F2 [172] = ~t~[k'] Q9 (!1 [k', Y1, k"] Q9 dch [y1, Y2l) Q9 '1/J[k"] .
k' ih k"
The previous expression implies that the function F2 is of the required form
(9.45). The function h[k',y2,k"] is the convolution
It is to be proved now that the function defined in this way satisfies the
conditions (9.46) and (9.4 7) .
2. The property (9.46) is evidently correct. If Y2 = # then the convolution
(9.62) assumes the form
In the preceding sum only sentences y1 the lengths of which are equal to the
length of the sentence y2 y2 can be considered since no sequence of symbol
changes can change the length of a sentence. For the sentences of the form
Y1Y1 the lengths of which are equal to the length Y2Y 2 the following holds
For the sentences of the form Y1Y1 according to assumption (9.43) the fol-
lowing is valid
This means that the function h satisfies the condition (9.47). The function
P 2 is defined by expression (9.66). •
For completeness we will express P 2 directly by means of functions that are
known. The number dch(y 1 , y 2 ) is the penalty for the cheapest sequence of
symbol changes which transform the symbol YI to the symbol Y2. This number
can be calculated as the length of the shortest path between two vertices of
the graph consisting of lXI vertic(!S that correspond to the symbols x E X,
and in which the length of the arrow from the vertex y 1 to the vertex Y2 is
ch(yi,Y2)· This length dch: X x X~ IRis equal to the sum 6:)~ 0 chi. The
sum was proved to equal the function (IS f}, ch)IXI-l, which was denoted by
means of ch *.
The function P2 which was stated before by means of (9.66) can be thus
expressed in the form
where dde [y2, x] is now understood as a function of one variable Y2 since in all
further consideration x will be a fixated sequence. In this sum the addition
need not be performed over all possible sequences y2 , but only over those that
have the form
(9.68)
where x1, x2, ... , Xn+l are sequences. In other words the addition is to be
performed only over those sentences y2 that can be transformed to the sentence
x by merely deleting some symbols, i.e., sentences that include the sentence x
as a subsequence. For the sentence in the form of (9.68) the function F 2 [y2 ]
has in agreement with (9.49) the following form
(9.69)
If we substitute (9.70) into (9.67) and take into consideration that convolution
is calculated only for sentences fh of the form (9.68) then we will obtain
Explicit expressions for calculating the functions P' and 1/J' thus are
IKI-1
( )
P'[k',x,k"] = t5[k',k]i!1(P2 [k',x,k]~de[x]) ~P2 [k,x,k"],
which show a constructive way how they can be obtained on the basis of the
known functions P2, 1/J and de.
The three proved lemmata 9.4, 9.5 and 9.6 prove Theorem 9.2, and thus the
equivalent Theorem 9.1 as well. It is because these three lemmata show how to
find the functions P' and 1/J' the existence of which is mentioned in the above
two theorems. The functions P' and 1/J' are to be created on the basis of the
five functions P, 1/J, in, ch, de by means of the following seven steps.
q[k', k"] = 8[k', k"] E!J (P2 [k', x, k"] Q9 de[x]). (9.76)
X
All calculations quoted here do not depend on the actual sequence x for which
Levenstein dissimilarity with a known regular language is calculated. If the
dissimilarities are sought for various different sentences x with the same au-
tomaton tp, P, 1/J and the same penalty functions in, ch, de then the above
relations can be calculated in advance only once for all sentences that will be
analysed in the future.
If the functions P' and 1/J' are already at our disposal then Levenstein dissim-
ilarity D(x) is calculated for each sentence x = (x 1 , x 2 , ... , Xn) as a convolution
expression
440 Lecture 9: Regular languages and corresponding pattern recognition tasks
?/!(kn) = 0.
Furthermore let us have three functions in : X ---+ IR, ch : X x X ---+ IR, de : X ---+
IR which determine the function d: X* x X* --t IR the value d(y,x) of which
denotes Levenstein dissimilarity of the sentences x and jj. An algorithm is to
be found in the task which for each sentence x E X* will find the number
D(x) =min d(jj, x) (9.81)
iiEL
which is called Levenstein dissimilarity between the sentence x and the lan-
guage L.
The algorithm for calculating the number D(x) consists of two parts. The
first part are preliminary calculations the complexity of which is not greater
than {O(IKilogiKI), O(IX*IlogiXI), O(IKI 3 IXI), O(IKI 2 IXI)). These cal-
culations do not depend on the input sentence x and are calculated only once
for the given language L and Levenstein. dissimilarity. The second part are
calculations which depend on the given sentence and the complexity of which
is O(IKI 2 n), where n is the length of the sentence x.
Now we will present the calculations of the first part which were expressed
before by means of convolution formulre (9.73) through (9.79). We will explain
the necessary computations. The explanations will rest on the representation
of the language L by a finite automaton.
1. A function P1 : K x X x K---+ IR will be created (see formula (9.73)),
The number P 1 (k', y, k") represents the minimal penalty for adding the
symbol y to the end of the sentence under the condition that the automaton
was in the state k', and reached the state k" after adding. For k' =f:. k"
there exists only one way of adding. The symbol y must be generated by
the automaton and the penalty in this case is P(k', y, k"). There are two
options for how to add a symbol y if k' = k" = k. The first option is that
the automaton generates symbol y and is penalised by P(k, y, k). In the
second option the operation of the automaton is interrupted, the symbol y
is inserted at the end of the already generated sequence and a penalty in(y)
is paid. It is quite natural that the cheapest alternative is selected.
2. A function ch*: X x X--+ lR will be created (see formula (9.74)) for instance
in this way. First the numbers ch • (x, y) will be created such that ch • (x, y) =
0 for x = y and ch*(x, y) = ch(x, y) for x =f:. y. Then the numbers are
transformed many times by the operator
The number P2 (k',x,k") represents the minimal penalty for adding the
symbol x to the end of the sentence under the condition that the automaton
was in the state k' before adding, and got to the state k" after adding. It
is a number resembling the number P1 ( k', x, k"), but there is an essential
difference between them. The added symbol can be either generated by
the automaton or inserted at the end of the sequence. Afterwards the
added symbol can be changed by an arbitrary long sequence of changes
or not changed at all. Thus now the number P2 (k', x, k") is the result of
optimisation on a quite extent set.
4. An auxiliary function q: K x K --+ lR is calculated
0, if k' = k",
k' k"
q(')
={ min(P2(k',x,k")+de(x)), if k' =f:. k" .
xEK
The number q(k', k") is the price of the cheapest process by which the au-
tomaton passes from the state k' to the state k", but the generated sentence
was not changed as a result of the whole process, though it was changed
during the process. Any process of this class consists of:
442 Lecture 9: Regular languages and corresponding pattern recognition tasks
The number q*(k', k") is similar to the number q(k', k"). There is a differ-
ence between these numbers as it was assumed when creating the number
q(k', k") that the automaton generated only one symbol or only one symbol
was inserted at the end of the sequence. This symbol was then further ma-
nipulated. When the number q* (k', k") is being created a situation is taken
into account when the automaton can generate any sequence of symbols (it
can be an empty sequence as well as a rather long one). The automaton
begins to generate symbols in the state k' and finally gets to the state k".
Apart from that any symbols can be inserted to the obtained sequence.
The sequence generated is subject to a number of changes until it ends in
deleting all generated symbols. The price for the least expensive procedure
of this class is q*(k',k").
6. A function P' : K x X x K -+ IR will be calculated (see formula (9. 78)),
The function P' resembles the functions P 1 and P 2 . The number P' (k', x, k")
is the price for the least expensive procedure of the following class. The au-
tomaton which is in the state k' will generate a sentence. Then each symbol
of the sentence is subject to changes and is deleted. Then the automaton
will either generate the symbol x' or the symbol will be inserted at the end
of the sentence. Being the last symbol, it is subject to repeated changes
until it is changed to the symbol x which is no longer deleted. Through the
above procedure the automaton gets to the state k".
7. A number 1/J'(k) will be calculated (see formula (9.79)),
1/J'(k) = k'EK
min (q*(k, k') + ~~(k')).
The number 1/J' (k) represents the price for the least expensive procedure
of the following class. The automaton which is in the state k generates a
sentence and gets to the state k' where it stops. Then each symbol in the
generated sentence is changed until it is finally deleted.
9.6 Discussion 443
The computational procedure (9.80) has the following form when explained in
natural language. For each sequence x = (xo, x1, ... , Xn) the numbers /i(k),
k E K, i = 0, 1, ... , n, and the number D(x) are to be calculated according to
the formulre
fo(ko) = <p(ko) , ko E K,
fi(ki) = k;-1EK
min (fi-dki-d + P'(ki-!,Xi,k;)), k; E K, i = 1, 2, ... , n,
D(x) = min (fn(kn) + 1/J'(kn)).
knEK
We can see that mere explanation of the already validated algorithm using
natural language is rather difficult. The explanation becomes inevitably lengthy
and consequently not transparent and convincing. It could be even worse if the
algorithm were not available yet and should be created and validated by so
called reasonable consideration. In such a case the risk becomes rather great
that some hardly noticeable peculiarities of the problem will be omitted and
consequently an erroneous outcome will be obtained.
We choose the formal way in this lecture to construct the algorithm for
Levenstein approximation. The formal convolution expressing the problem is
equivalently transformed step by step until the convolution expression is ob-
tained that represents the algorithm. Every step in this deducing is guided by
formal rules equivalently transforming convolutions and not by a vague reason-
ing. Such a way is not very amusing but it excludes unfortunate inadvertence.
9.6 Discussion
I have noticed a substantial difference between Lecture 8 and Lecture 9, even
when their topics are close, both dealing with recognising sequences. The pre-
ceding Lecture 8 actively uses results from the general statistical pattern recog-
nition theory. The present Lecture 9 is quite different. It seems to me as if
the problem started being examined from another side and from the very be-
ginning. Well, in the substantial part of the lecture the term 'probability' does
not occur even once and the outcomes of the preceding lectures are not made
use of. On the whole this lecture could be placed at the beginning of the course
and nothing would obstruct understanding it. It seems to me that through
this lecture the explanation loses its clearly ordered structure, in which its in-
dividual parts clung closely to each other. I see Lecture 9 as if hanging in the
air, and so a number of questions arise. Was not the explanation on structural
recognition started from some other, nonstatistical standpoint? Or, have not I,
perhaps, overlooked an important relationship between current and previously
explained matter.
You have hardly overlooked anything very important. It may rather have es-
caped our notice. But if you feel a little bit confused then it is only because
you expect more from the theory of pattern recognition than it can offer you in
its present day state of the art. For the time being the theory does not present
444 Lecture 9: Regular languages and corresponding pattern recognition tasks
method, and also vice versa. The two classes described have a relatively large
intersection. On the other hand, for some nontrivial tasks such a reduction
has not been found. This also applies to the task of Levenstein matching of a
sentence to a regular language which was discussed in Lecture 9.
We would like to add an important comment. For the time being we do not
know a universal formulation that would generalise the tasks of Lecture 8 and
Lecture 9. But along with you we have become convinced that the algorithms
for solving them are more than only being related. The algorithms are, actually,
identical and form the link between the subject matter of those two lectures.
I expect that, beginning with this lecture, the nearest neighbour methods will
gradually acquire more weight. But I am surprised that you did not devote at
least one lecture to these methods in the first part of the course.
... or at the end of the book? We cannot help turning your question into a joke.
When we go further in the direction of your consideration we can find that the
446 Lecture 9: Regular languages and corresponding pattern recognition tasks
Figure 9.5 Structural analysis of a sequence with not all lengths of graph edges known.
content of one book depends on what has been written in the other book, then
consequently you have to read all books first and only then to decide what was
the meaning of the first symbol in the first book. One could write poetry on
that!
And the content of all such poems can be expressed formally. Let XI, X2, ... , Xn
be a sequence and let k0, k;, ... , k~ be a sequence defined as follows,
This question was asked by V.A. Kovalevski of himself immediately after he had
designed the first algorithm known for the structural analysis of a sequence [Ko-
valevski, 1967]. We will illustrate your question by means of a graph in Fig. 9.5
which we have already come across earlier in Fig. 8.2. In the same way as before
the graph is formed by the vertices a and (3 and a group having /K/ n vertices.
Each of vertices is represented as a point (k, ·i), k E K, i = 0, ... , n, with the
horizontal coordinate i and the vertical coordinate k. The graph presented
illustrates a situation where the set K consists of three states A, B, C.
Your question can now be asked in the following way. Assume you have
examined the given graph only as far as to a horizontal coordinate l, where
l < n. This means that you know only the lengths of those edges lying to the
left of the coordinate l, and the other lengths are still unknown to you. In such
a situation, of course, you cannot find the shortest path from the vertex a to
the vertex (3, because the path depends on data still unknown to you. But
9.6 Discussion 447
you are not interested in the shortest path as a whole. You are interested only
through which leftmost vertex the path passes. In Fig. 9.5 these three vertices
are marked by white circles.
You could answer this question after the following considerations. Even if
you do not know what the shortest path from a to (3 will be, one thing is
certain. It passes through one of the vertices the coordinate of which in the
horizontal direction is l. In Fig. 9.5 these vertices are marked by black circles
(A,l), (B,l) and (C,l). You can find the shortest path from the vertex a to
every of the vertices marked by black circles. Thus you will find three paths:
the shortest path from the vertex a to the vertex (A, l), and similarly those to
the vertex (B, l) and the vertex (C, l). Now, you already know that whatever
the path from the vertex a to the vertex (3 may be, its initial section must be
one of the three paths, which you already know. For each of the three paths you
will find through which vertex with a horizontal coordinate i = 0, i.e., through
which vertex marked by a white circle in Fig. 9.5, it passes. Finally, you will
arrive at the following conclusion. If each of three found paths passes through
the same vertex marked by a white circle, e.g., through the vertex (A, 0), then
the vertex (A, 0) belongs to the shortest path from a to (3.
This means that a partial knowledge of the graph is not yet sufficient for an
unambiguous decision on the state ko and you have to continue examining it.
Is this informal explanation clear to you?
Yes, it is!
l
Ft(kt) =max max·· ·max""" fi(ki-l,Xi,ki) (9.83)
ko k1 k,_ 1 ~
t=l
and I will denote as ko1 (kt) the first element in the sequence k0, ki, ... , ki_ 1 , in
which the sought maximum (9.83) is reached. The numbers F1(kt) and kot(kt)
are calculated according to the following recurrent relations
Ft(k") = max(Ft-dk)
kEK
+ ft(k, Xt-l, k")) ,
If for some value l all the values kot(k"), k" E K, are the same, and this
value is k*, for example, it means that the initial part of the observation
x 1 , x 2 , ••• , x 1 is sufficient to determine unambiguously the element k0 in the
sequence k0, ki, ... , k~ which maximises the sum
n
LJ;(ki-1,Xi,ki).
i=1
Try to imagine now the situation in which you did not start observing the
sequence from the very beginning, but from somewhere in the middle. You can
assume that the object observed by you started working at the moment - T in
the past and will be working until to the moment T in the future. During its
functioning it generated a sequence of symbols
If the preceding sequence were known then you could find the most probable
sequence of the states
(9.85)
through which the object had passed. You did not note, however, the mo-
ment when the object started working. You only observe a section of the
sequence (9.84) from X-1+1 till Xt- 1 . You are expected to answer the question
whether the information is sufficient for uniquely determining the element k0 in
the sequence (9.85). In the case of a positive answer, you are also to determine
what the element k0 is equal to.
No, you need not. It may be evident enough for anybody who has thoroughly
studied the two previous lectures. And you have mastered them in an excellent
way.
9.6 Discussion 449
I would like to discuss one more modification of the examined task with you.
We have studied a case in which we wanted to determine what was the first
element k0 in the sequence ko, k1 , ... , kn which was the most probable with
respect to the sequence x 1 , x 2 , •. . , Xn being observed. We wanted to determine
it when we did not have a complete sequence x 1 , x 2 , ... , Xn at our disposal, but
only its initial part.
But if we are interested only in the first element and we would like to de-
termine it with the least probability of the wrong decision, we should take
quite another approach to the task. You have mentioned this several times
and I absolutely agree with you. In this case if we had the complete sequence
x 1, x2, ... , Xn at our disposal then we should calculate IKI numbers p(ko, x1, x2,
... , Xn), ko E K, and select a value ko from them corresponding to the largest
of these numbers. I wonder what you will advise me to do if I do not have the
entire sequence x1, x2, ... , Xn at my disposal but only its part x 1, xz, ... , Xl·
On the one hand, I can calculate IKI numbers p(ko, x1, x2, ... , xl), ko E K, and
evaluate k0 on the basis of the information I already have. Though the infor-
mation will be used in an optimal way, the quality of decision will be worse
in the general case than that attainable in observing the complete sequence
Xl,X2, ... ,Xn·
I would intend not only to make an optimal use of the available information
x1, x2, ... , x1, but, moreover, to have a criterion which would guarantee that
the quality of decision on the basis of the information is not a bit worse than the
quality of decisions I would attain through further observations x 1+ 1 , x 1+2 , ...
In other words, I am looking for an answer to the question whether the infinite
part of the sequence of observations Xl+l, Xl+ 2 , . .. , Xi, . .. is negligible from the
standpoint of information gain, which it yields as to the state k0 . When the
criterion was met I could interrupt my observations and estimate the state k0 . I
could be certain that the estimated quality cannot be enhanced by any further
(even infinite) observation of the object. If the criterion was not met then the
observation should continue.
The question formulated like this evokes an assumption that I should look
for an answer within the frame of Wald sequential analysis. I have not thought
about it thoroughly enough, but for the time being I have arrived at the con-
clusion that I will not find the answer there. I suspect that the answer to my
question is either negative, or very complicated.
If! denote the probability p(ko, x 1 , x2, ... , Xt) by Pl (k 0 ) then I can formulate
the question as follows. Does a number l exist such that at any i > l the
probabilities Pi(ko) will be identical to the probabilities Pl(ko)? This question
can be answered positively in the trivial case only, when the observation xi,
i > l, and the state ko are statistically independent events. But here I deal
with Markovian models, where all parameters are dependent on each other.
This case was illustrated in Lecture 8 with a mechanical model. It follows that
even with a rather large i each observation Xi yields some, probably small,
information on the state k0 • The result of it is quite unfortunate, i.e., I cannot
decide on the first symbol in the book before I read through the whole book.
Is there any other possibility for me than to call for help?
450 Lecture 9: Regular languages and corresponding pattern recognition tasks
The preceding relation states that the vector q11 (i.e., the ensemble p(k0 ,x 1 ,
x2, ... , Xn), ko E K) belongs to a convex cone the boundary of which is formed
by vectors qt(k,) (i.e., the ensembles (p(ko, x1, X2, ... ,X[, kl), ko E K), kt E
K). We will denote the cone by Q1. The question can then be formulated as fol-
lows. When can we unambiguously find what argmaxkoEK p(ko, x1, x2, ... , Xn)
is equal to if we only know that the ensemble p(k0 , x 1 , x 2, ... , x 11 ) belongs to
the cone Ql? And the answer to the question is quite evident.
If for any ensemble p' (k 0 ), k0 E K, belonging to the cone Q1 the argmaxkoEK
p'(ko) results in the same value k0 then argmaxkoEKp(ko, x1, x2, ... ,xn)
is also k 0. It also holds vice versa. If the cone Q 1 contains two such en-
sembles (p'(ko)lko E K) and (p"(ko)lko E K), that argmaxkoEK p'(ko) f.
argmaxko EK p" (ko), then from the statement that the ensemble p( ko, x 1 , x2,
... , Xn) belongs to the cone no conclusion can be drawn what argmaxkoEK p(ko,
XI, x2, ... , Xn) is equal to.
To find if the conditions of the previous statements are satisfied, not all
points of the cone Q 1 are to be examined (and they are indefinitely many). It is
sufficient to examine only the cone boundaries, i.e., the ensembles (p( k0 , x 1 , x 2,
... ,Xt,kt), ko E K), k, E K. Youaresuretoprovethefollowingtwoassertions,
the latter of which is quite trivial, the former not being very complicated either.
Let a value k0 E K exist such that for each k0 E K and k1 E K the following
inequality holds
9.6 Discussion 451
In this case with some probabilities p(xt+l, Xt+2, ... , Xn I kt) the value
argmaxkoEK p(ko, x1, x2, ... , Xn) will vary.
Your explanation is valid for the case in which we intend to determine the
most probable value of the state k0 . I would like to know if it is possible to
generalise your considerations to a more general case in which a Bayesian risk
for an arbitrary, but prior known, penalty function is to be minimised.
Of course, it is. Have a look at the general case now without our help. Recall
the theorem on the convex form of classes in the space of probabilities which
was introduced as early as in the first lecture. We can see that a rifle loaded in
the first act of our explanation has fired at last. We had nearly thought that
it would not be needed in our course.
Even when I have understood your explanation I still cannot find where I made a
mistake arriving at the conclusion that each observation Xi yielded information
on the state k0 .
But you made nearly no mistake! You simply passed from one question to the
other, considering them to be equivalent. You are right in that any observation
x; even at large i makes the information on the state k0 more exact. But we
are asking about something else. We are interested if the overall information
resulting from the infinitely extended observation is sufficient for flipping the
decision on the state k0 from one class to the other. This, at first glance more
difficult, question can be, as you can see, quite easily answered.
You are far better off than those who believe that they know Wald sequential
analysis quite well. Your advantage over them is, at least, that you can see not
only the ingenious simplicity of Wald's procedures but you fear the complexity
452 Lecture 9: Regular languages and corresponding pattern recognition tasks
of a proof that just these procedures solve some tasks in an optimal way. Wald
sequential analysis should not be referred to in a cursory way. That is why
we answer your questions, just to allay your quite justifiable doubt. Our brief
answer does not make a claim to be exact and comprehensive.
The questions which we discussed with you do not belong to the 'continent'
discovered by Wald. Very roughly speaking, Wald's procedures answer the
question of how long an object has to be observed so that its state may be eval-
uated with a previously given quality. Under certain assumptions about the
statistical observation model, the observation sequence has been successfully
proved to converge in a sense. It is to understand that any quality of estimate
is available, even when it can be sometimes attained only through a long ob-
servation. And we particularly bring to your attention that this convergence
occurs only under certain assumptions concerning the object.
Sometimes, with an inaccurate reference to Wald, it is stated in a vulgar way
that by increasing the number of observed features of the object an arbitrar-
ily high recognition quality can be attained. It is, of course, only a negligent
manipulation with Wald's results. Just now, we have, together with you, got
convinced that in observing Markovian objects and in analysing Markovian se-
quences it may easily happen that the information so far obtained is not enough
for a sufficient quality of recognition. In spite of that, further observation of
the object ceases to be decisive. It is because further observation cannot affect
the decision, and so not enhance it either.
Together with you we have also examined a task in which the observation
of an object can be interrupted. But as a condition for the interruption we
did not regard the attainment of a previously given recognition quality (as was
in the case of Wald), but a situation in which the quality attained, whatever
it may be, cannot improve further, perhaps by an infinitely long observation.
There are, therefore, two different conditions for interrupting the observation
of an object, and we cannot guarantee that either of them will be satisfied.
For some Markovian objects the necessary quality of its state estimate can be
unattainable however long the observation may be, and then Wald's condition
for interrupting the observation will not be satisfied. For other Markovian
objects a situation may occur, even in a quite long observation, that further
observation can flip the decision on the state attained on the basis of the already
known observations.
And only now can we ask a really interesting question. We are asking how
these two conditions for interrupting observation interact. Can one state that
the observation of any Markovian object should be sure to interrupt at some
time, either because the observations already attained allow us to qualitatively
recognise the state, or because further observation cannot improve the already
attained quality?
Well, go ahead!
of the best one hundred sequences does not satisfy the additional conditions,
which in creating the set L were not taken into consideration.
To be able to use such a technique I need to have an algorithm that is
capable of finding d number of the best paths in a graph. I wish to get an
effective algorithm the computational complexity of which should not rise too
much with increasing d.
All that you have said now is right. We only do not understand what is the
core of the question.
I do not know what the algorithm for finding the d-best paths in a graph should
look like.
We do not believe that. We can guess that the algorithm for solving that task
possesses a complexity which rises with d increasing linearly. This means that
seeking the d best paths is not more than d-times more complex than seeking
one single best path, and thus, its computational complexity is O(IKI 2 d n).
Now it is I who does not believe it. Could not you, please, explain the algorithm
in more detail for me?
Let us try it together. But first, tell us about the train of your considerations
so that we would not start a wrong way for the second time.
Assume I have a graph G which determines the set of paths L from the vertex
a to the vertex f3. I assume to have already found the path k* E L which
is the shortest in the set L. My job now is to find the path k 1 which is the
shortest in the set L \ {k*}. The difficulties are because the subset L \ {k*}
cannot be represented as a set of paths in the subgraph of G. No edge in the
graph G can be excluded, since through each edge some of the paths from the
set L \ {k*} passes. The set L \ {k*} contains not only paths which do not
intersect with the paths k*, but all the paths which diverge from the path k*
in some section of it, at least. In spite of all these difficulties, a new graph
G 1 can be created such that the set of paths in it will represent just the set
L \ {k*}. But the new graph will have twice as many vertices as there were in
the original graph. This means that seeking two best paths will be three times
more difficult than seeking a single best path. That would still do, but I do
not think I can continue doing so, since the number of vertices in a graph the
paths of which match the set L \ {k*, k 1 , k 2 , •• • , kd} (where k*, k 1 , ... , kd are
the earlier found, and thus firmly determined, paths), is 2d-times larger than
the number of vertices in the graph G the paths of which match the set L.
We have understood your difficulties and we will show how to get over them.
But it will be the last opportunity in our lectures for you to see the fruitful way
of stating structural analysis tasks as algebraic expressions in appropriate semi-
rings. We believed that we had sufficiently explained the subject matter when
9.6 Discussion 455
I admit I have not yet included the algebraic methods explained amongst other
tools I actively use. Levenstein matching did not seem so much convincing since
you had deliberately, and as I can see it, rather unnecessarily complicated the
task by letting the edit functions in, de and ch be without any restriction. Well,
even some weak and quite natural assumptions on edit functions are sufficient
to make the task at least reasonably solvable, if not quite simple. I mean, for
example, a constraint in the form of triangular inequality. I have heard about
Wagner algorithms [Wagner and Fischer, 1974; Wagner and Seiferas, 1978},
which solve the task not only for regular languages, but for the context-free
ones as well.
And do not the difficulties you see in the task on d-best paths in a graph occur
to you as accumulated in an artificial way?
No, they do not. But I already suspect that you know the solution of the task
which will be quite unexpected from my part.
Well. Now look how the task on d-best paths is formulated in the form of
generalised convolution expressions. You will see that a task formulated like this
is not worth mentioning. But be patient, since for briefness' sake the subject
matter will be explained in a similar way as can be found in the most indigestible
pseudo-mathematical articles, where something is referred to without saying in
advance what it is good for.
Let us first write down in a form of enumeration the main notions which are
necessary for seeking the d-best paths in a graph.
1. Let R be a set of nonnegative real numbers extended by a particular 'num-
ber' oo, and it is assumed that for an arbitrary a E R the oo + a = oo and
min(oo,a) =a hold.
2. Let Rd be a set of ordered ensembles of the form (a 1 , a2, .. . , ad), where
a; E R, i = 1, 2, ... , d, a1 :S a2 :S ... :S ad·
3. On the set Rd x Rd a function of two variables is defined which assumes
values on Rd. The function will be called addition of ensembles. For each
pair a E Rd and b E Rd the function determines the sum c = a EB b in the
following way.
If a= (a1,a2, ... ,ad) and b = (b1,b2, ... ,bd) then for the calculation of
their sum it is needed
• to create an ensemble (a 1 , a 2 , ... , ad, b1 , b2, ... , bd) of the length 2d;
• to order the ensemble in an ascendant way;
• to regard the first d numbers in the ensemble as the sum a EB b.
Addition defined in this way is an associative and commutative operation
with QE!l which is the ensemble (oo, oo, ... oo).
456 Lecture 9: Regular languages and corresponding pattern recognition tasks
4. On the set Rd x Rd the function of two variables is defined which assumes its
values on Rd. The function will be called multiplication of ensembles. For
each pair a = (a 1, a 2, ... , ad) and b = (b1, b2, ... , bd) the function determines
their product a 181 b such that
• an ensemble (ai + b3 , i = 1, 2, ... , d; j = 1, 2, ... , d) of the length~ is
created;
• the ensemble is ordered in an ascendant way;
• the first d numbers in the ensemble are regarded as the product a 181 b.
Multiplication defined in this way is an associative and commutative oper-
ation with 1181 which is the ensemble (0, oo, ... , oo).
5. The multiplication introduced is distributive with respect to the addition
introduced earlier, i.e., a181(bEBc) = (a181b)EB(a181c). By the distributivity and
also because the product of each ensemble with an introduced zero is also
a zero the given operations of addition and multiplication form a semi-ring
on the set Rd.
6. Let X be a finite set and f be a non-negatively defined real function X -t R.
We will denote as f' the following function which assumes values on the set
Rd. For x E X the value f'(x) is an ensemble that consists of d numbers,
where the first element is f(x) and further d- 1 elements are oo. There
follows from this definition that the product ®xEX f'(x) is an ensemble of d
numbers in which the first number is ~xEX f(x) and further d -1 elements
are oo. The sum EBxEX f'(x) is an ordered ensemble which contains d
smallest numbers from the ensemble (f(x),x EX).
Now we can state the original formulation of the task seeking the d-best paths
in a graph.
Let X and K be finite sets and Pi, i = 1, 2, ... , n, ben functions of the form
K x X x K -t lit For any sequence x = (x 1,x2 , ... ,xn), Xi EX, and any
sequence k = (ko, k1, ... , kn), ki E K, the number
n
F(x, k) = LPi(ki-1, xi, ki) (9.88)
i=l
is defined. The aim of the task is to find an algorithm which for each sequence
x E xn determines d smallest numbers from the ensemble (F(x, k), k E Kn+l)
that consists of jKjn+l elements.
Finally, we can start the algebraic formulation of the task seeking the d best
paths in a graph and to its solution.
By means of the operations introduced that add and multiply ensembles of
the length d, the formulated task will be reduced to seeking the ensemble Q
given by the formula
n
Q(x)= E9 F'(x,k)= E9 Q9p~(ki-l,xi,ki). (9.89)
9.6 Discussion 457
In the preceding formula notations F' and Pi are used instead of F and Pi in the
formula (9.88). The values of the functions F' and Pi (primed) are ensembles of
length d, and not numbers for which the denotations F and Pi (without primes)
were used.
In the definition (9.89) nothing will be changed if each summand is mul-
tiplied by the ensembles cp'(ko) and 'lj;'(kn) which are ones, i.e., ensembles
(O,oo, ... ,oo). Thus we have
I can say I already understand it. I am now expected to gradually calculate the
sum in the innermost parentheses. The result of each such addition will be IKI
ensembles of length d. I will first calculate the ensembles f n- 1 ( kn- 1 ) of length
d, where kn- 1 E K. It will be according to the formula
and then I will calculate the ensembles fn-2(kn-2) for each kn-2 E K,
The ensemble Q(x) I am seeking will be then found according to the formula
You stopped before the last step which would have much pleased us. The
calculation according to the formul<£ (9.92), (9.93) and (9.94) can be regarded
as a calculation of the matrix product
(9.95)
The solution of the task presented here is the most convincing illustration of
the fruitfulness of the algebraic expressions. I do not mean by that the final
result (9.95), but I admire how brief and transparent the path from a formal
definition of the task (9.89) to the computational procedure in (9.92), (9.93),
(9.94) can be. This intelligibility becomes even more impressive when I compare
the algebraic formulation of the task with its original graph formulation. The
algebraic formulation immediately indicates the direction in which to look for
the solution. The object sought (in our case an ensemble of numbers) is defined
by the formula (9.95). The task analysis lies in formal transformations of this
formula in which the unchanged form of the defined object is guaranteed.
It was quite different with the original task. The task was not defined by
a formula but the definition was verbal. It was stated that d-shortest paths
in the graph is to be found. This formulation seems to be illustrative only at
first glance because it is not supported by any apparatus by which a verbal
formulation can be transposed to another verbal formulation. The researcher
who tries to solve the task in its verbal formulation can rely only on his/her
rational considerations. These considerations can be very extensive and be-
sides leading in the right direction they can put the researcher on many other
paths. Just this kind of 'intelligibility' routed me at first toward such task
solving algorithms which were not practically implementable. At that time,
the 'intelligibility' simply disoriented me.
Now I have understood at last how just the algebraic representation of the
task seeking d-best path revealed its simplicity, which was treacherously hidden
in graph representation. Graph representation does represent something, but
it is not what is to be found in the formulated task. The object sought, i.e.,
a group of d-best paths cannot be represented in the graph expression at all.
When we superimpose the paths from some group on them we obtain a graph
which contains not only the paths of the group but also a lot of irrelevant paths.
When I have now seen the actual simplicity of the task examined, I feel I
could master even more general tasks. I have in mind, for example, the case
9.6 Discussion 459
We are sure that you will easily generalise the computational procedure (9.95)
even for the case of acyclic graph structures. But be careful in generalising
Levenstein matching. One of the most significant properties, thanks to which
Levenstein approximation was successfully managed, is the idempotent prop-
erty of addition which was minimisation in our particular case. Only thanks
to the idempotent property did we manage to prove that some infinite con-
volution expressions have finite equivalents. Remember Lemmata 9.2 and 9.3
of the lecture. Adding d-dimensional ensembles, as we have defined it for the
task of d-best sequences, is not idempotent. Therefore, you will have to devise
something which would be similar to the above mentioned lemmata 9.2 and 9.3
for the given case.
I would like to come back to the computational procedure (9.95) and see what
form it will assume when I do not express it by means of macro operations of
multiplying and adding of ensembles, but through elementary level operations
dealing with individual numbers, i.e., with elements of ensembles.
(9.96)
A sequence of IKI-dimensional row vectors h, h, ... , fn is to be calculated,
where h = r.p · P1 and J; = fi-1 · P;, i = 2, ... , n. Then a 'scalar product'
fn ·1/J is to be calculated. The computational complexity of (9.96) is n times
greater than computational complexity of multiplying the IKI-dimensional row
vector by a matrix of dimension IKI x IKI. I will examine the complexity of
this multiplication. I will take into consideration that the components of the
vector J;_ 1 and the matrix P; are d-dimensional ensembles, not numbers. The
component J; (k) of the vector J; = J;_ 1 · P; is an ensemble of the length d that
is determined by the sum
To create the ensembles f;(k) for all k E K means to create IKI 2 auxiliary
ensembles
c(k',k) = li-I(k') 0P;(k',k), k' E K, (9.98)
460 Lecture 9: Regular languages and corresponding pattern recognition tasks
You have missed some important properties of ensemble addition. You are right
that the addition a 1 EB a 2 has the complexity O(d). But it does not follow from
that that the complexity of the addition a 1 EB a2 EB ... EB a"' is O((m- 1)d).
Actually, it is substantially less, being just 0 ( (m - 1) + (d - 1) log m). When
the calculation procedure is clear to you then you will arrive at the conclusion
that the calculation of the product J;_ 1 · P; is not of complexity O(JKJ 2 d), but
a more favourable value O(JKJ 2 + d JKJlog JKI). Think over the program for
computing the product of the vector J;_ 1 with the matrix P; more thoroughly.
Not to deal with unnecessary details, I will introduce simplified notation for
calculating the ensemble J;(k) for one certain k E Kin agreement with the for-
9.6 Discussion 461
k=:rs 0DD~
k=l D ..
~
~ D D
- b(O)o
b{l)
>b{B))
•=' D . . D D ~ ''')) b{l2)
- ~ b{9)
•=• D · · D D -~
b{l4)
@_
•<•l·) b{lO)
•=• D · · D D ~
b{l3)
- N ''')·) b{ll
k=1 D .. D D ~ b{7)
Figure 9.6 Arrangement of data for effective computation of the product of vector J;_ 1 with
matrix P;. The squares contain numbers f(k,q).
mula (9.97). If the value k is fixed then it can be omitted and the formula (9.97)
has the form
f' = E9 f(k) ® P'(k), (9.100)
kEK
The first number selected is evidently mink {J(k, 0) + P(k)). To determine this
number is not complicated and therefore it is obvious that its computational
complexity is O(IKI). However, I do not intend to seek the minimum in a
common fashion, but by means of a data structure which is represented by the
tree in the right-hand side of Fig. 9.6. The vertices in the tree are labelled by
the values b(j) in the Fig. 9.6. Their amount is 2 IKI - 1. The numbers b(j)
are determined by the following program fragment
for ( k = 0; k < IKI i i + +)
{ b(k) = f(k,O) +P(k); q*(k) = 1; }
k = 0; j = IKI;
while c j # 2IKI - U {
if ( b(k) :::; b(k + 1)) } (9.101)
{ b(j)=b(k); ind(j)=k;}
else
{ b(j) = b(k + 1); ind(j) = k + 1; }
k=k+2; j=j+1;
}
When the program finishes then at each tree vertex a number is written which
is the smaller of two numbers which are written at the vertices connected
by edges with the respective vertex and lie to the left of it. For example,
b(9) = min (b(2), b(3)) and b(12) = min (b(8), b(9)). The numbers ind(j),
j = IKI, ... , 2 (IKI - 1) are indices pointing which of the two numbers was
overwritten from the left to the right. The indices are represented by arrows
in Fig. 9.6.
For example, the arrow from the ninth vertex points towards the third vertex,
which means that ind(9) = 3, and thus b(3) :::; b(2). It is evident that with such
a data arrangement the number b(2(IKI - 1)) at the root of the tree is the
smallest of the numbers b(j), j = 0, ... , IKI- 1. The ensemble of arrows shows
the number j of the tree leaf where the least number lies. The ensemble of
arrows in Fig. 9.6 indicates, for example, that b(14) = b(3). The program
(9.101) also determines indices q*(k) which indicate how many numbers were
taken of the ensemble f(k,q), q = 0, ... ,d- 1, and were written in the tree
leaves.
The above mentioned program for finding the least number from the ensem-
ble {J(k, 0) + P(k), k = 0, ... , IKI - 1), is exceedingly complicated only at
first glance. In fact, its complexity is O(IKI-1), i.e., of the same order as that
with the common procedure seeking the least number by simple examination
of the numbers f(k, 0) + P(k), k = 0, ... , IKI- 1. An important advantage of
the algorithm (9.101) is that in addition to seeking the least number it creates
supplementary data which allows us to find the succeeding least number not in
(IKI - 1) operations, but in substantially fewer log IKI operations. I will show
how it occurs.
Assume that the number written in the output ensemble f' as the first
was a number from the k*-th row of the ensemble f(k,q). This means that
the number f(k*,O) + P(k*) = f'(O) was the least number in the ensemble
9.6 Discussion 463
Algorithm 9.2 Seeking a further least number in the remaining part of the ensemble.
1. It is to find from what tree leaf the last number f' (q- 1) was taken,
k* = ind(2(IKI- 1)); }
(9.102)
while (k* ;::: IKI)k* = ind(k*);
After executing the previous commands the number k* informs us that the last
number f' (q -1) written to the output ensemble f' was the number f (k*, q* (k*)-
l)+P(k*).
464 Lecture 9: Regular languages and corresponding pattern recognition tasks
2. The succeeding number of the k* -th row of the ensemble f(k, q) is taken out and
the number in the k* -th tree leaf is changed.
3. With respect to the change of the number in the k* -th tree leaf, new data in the
tree are computed,
4. The number from the tree root is written at the position q of the output ensemble
f' and the number q is incremented by one.
In Algorithm 9.2 the operation k* xor 1 represents inversion of the least signif-
icant bit in the binary representation of the integer nonnegative number k*.
The operation k* /2 is an integer number division ignoring, at the same time,
the information in the least significant bit in the binary representation of the
number k*. The index j2 points to the vertex within the paths from the vertex
k*. This is the vertex that is connected with the vertex k* by an edge. The
index j1 is the second vertex that is connected with the vertex j2 but the dis-
tance of which from the tree root is greater than that of j2. The meaning of
indices k*, j1 and j2 can be understood from Fig. 9. 7.
After Algorithm 9.2 has ended, the j1
number of positions already written in
the output ensemble has increased by j2
one and data needed for seeking the suc-
ceeding number to be written into the
output ensemble have been prepared.
The complexity of Algorithm 9.2 is given
by the number of cycle repetitions in
its fragments (9.102) and (9.104). This
number is always O(log IKI) and it is Figure 9.7 Meaning of indices j*, jl and
the complexity of finding one element j2 in Algorithm 9.2.
(but not the first one) in the output en-
semble f'. The creating of the whole ensemble of the length d, i.e., the calcu-
lation (9.97), consists of the calculation of the first element by means of the
program (9.101), having the complexity IKI and the calculation of the (d- 1)
other elements. The total complexity thus is O(IKI + (d- 1) log !KI). The en-
semble (9.97) is to be calculated for each k E K, and therefore the complexity
9.6 Discussion 465
of the operation fi = fi-1 ·Pi is O(IKI(IKI + (d- 1) log IKI)). It follows from
that the calculation complexity of the ensemble (9.96) of the length dis
This is a damned interesting result. I expected that the search for d-best
sequences will be substantially more complicated than the search for one single
best sequence. But in reality the algorithm shows quite the opposite property.
It is the first best sequence that requires the most of calculations. When we
are looking for the first sequence with deliberation then seeking any further
sequence is less demanding.
When you understand so well the advantage attained by a carefully thought over
computational procedure we can advise you to further continue in analysing
our task. In a more detailed examination of task properties you will find that
the computational complexity is even less than (9.105). You will see that the
complexity is only
O(IKI 2 n +(log IKI) n (d- 1)). (9.106)
Note that the number (log IKI) n (d - 1) is simply a number of bits which
is needed only for the (d- 1) shortest paths found to be stored somewhere.
This means that if you create an algorithm of the complexity of (9.106) then
the finding of any succeeding shortest path will have a complexity of the same
order as simply writing this path to the output memory. It will be an algorithm
which seeks the shortest path in such a deliberate way that each succeeding
path will look as if the paths were only rewritten from one memory to another.
If I did not have any experience of discussion with you then I would believe that
further reducing the computational complexity was not possible. Well now, the
matrix product
(9.107)
cannot be calculated in any other way than by an n-times repeated product of
the vector f with the matrix P, can it? The complexity of such a calculation
cannot be less than 0( n C), where C is the complexity of the calculation f P.
But the result of multiplying f and P is an ensemble consisting of IK I ensembles
of length d. Thus the IKI d numbers have to be sought and stored. Because of
that the calculation of the product f P must have a component the complexity
of which is O(IKI d). Furthermore, it must have an n-fold product of the vector
with the matrix, which is needed for the calculation (9.107), a component
with the complexity O(IKI d n). But a contribution that would refer to the
component mentioned here, I cannot find in your estimate of the complexity of
(9.106).
Before showing where you made your mistake let us go back to the computa-
tional procedures which we have already dealt with.
466 Lecture 9: Regular languages and corresponding pattern recognition tasks
You and we clearly understand that adding two ensembles a1 and a2 has a
complexity O(d). There is no hope for improvement here, since an ensemble of
length d is being created. It would seem to result from it that the complexity
of adding n ensembles a 1 , a 2 , ... , an cannot be less than 0 ( (n - 1) d). As the
addition can be done (attention, an error will follow!) only in such a way that a
sum of two summands is computed, then of two others, and then to the already
computed summands another summand in turn is added, and so forth. It would
follow from it that we cannot do without n- 1 additions.
But just a while ago you proposed an excellent algorithm for calculating the
sum EB~= 1 ak the complexity of which is not O((n- 1) d) at all, but which
is substantially less, i.e., 0 ( (n - 1) + (d - 1) log n). Even when you scrutinise
your algorithm as best as possible, you will not find anything which would
indicate that auxiliary data could be interpreted as partial sums of subsets of
summands, in our case the subsets of ensembles. This means that you have
managed to create an algorithm for adding a set of ensembles and avoiding the
addition of some of their subsets.
Now let us return to calculating an ensemble of length d
(9.108)
lo = c.p
(9.109)
li = li-1 pi' i = !,2, ... ,n. }
where ln(k) and 'ljJ(k) at .any k E K are ensembles of the length d. You
examined that calculation and arrived at a correct result that its complexity is
O(IKI +(d- 1) log jKj). Go thoroughly through your algorithm once more and
notice that for the calculation (9.110) you need not have complete information
on all jKj ensembles ln(k), k E K. The complete information about them
consists of jKj d numbers, but only d of them are necessary for calculating
(9.110). Unnecessary numbers need not be calculated.
You can continue in considerations like that. For calculating the vector In
complete information about the vector ln- 1 is not needed either. To obtain
partial information about the vector In, even fewer data of the vector ln- 1
suffice. The procedure of computing (9.109) is in a sense a bit wasteful. In
each step i = 1, 2, ... , n, the procedure creates spare data and a large part of
these data will not be ever used. When you design an algorithm such that only
what is used in calculation will be included in it then you will see that you will
calculate the whole product (9.108) and avoiding the calculation of products
c.p P1, r.p P1 P2, c.p H P2 P3, etc ..
9.6 Discussion 467
I will go back to the graph interpretation of the task. I am going to speak about
a graph the vertices of which are a and /3 (see Fig. 9.5) and further K (n + 1)
vertices of the form (k, i), k = 0, 1, ... , K- 1, i = 0, 1, ... , n. I did not denote
the set of values of the variable k by the symbol K, but their number, since
further on in the algorithm I will use the value k as an integer index.
The graph contains the following oriented edges. The K edges point from
the vertex a to vertices of the form (k, 0). The lengths of the edge (a, (k, 0)) is
determined as 'P(k). From each edge (k', i -1) the Kedges point to the vertices
( k", i). The length of each of these edges is P; (k', k"). From each vertex (k, n),
the single edge points to the vertex /3 and is of the length 'l/!(k).
I am interested in the complexity of the algorithm that will find d shortest
paths from the vertex a to the vertex /3.
For the analysis of the task and the algorithm of its solution I will introduce
the following notation. For each vertex"( of the graph which corresponds either
to the pair ( k, i) or /3, I will denote by C ('Y) the set of paths from the vertex a
to the vertex 'Y. I will order this set in ascendant order by the path length and
denote by c("f, q) q-th path in this ordered set. The paths in the ordered set
C('Y) will be enumerated starting from zero. The shortest path in the ordered
set C('Y) is therefore c('Y, 0). In the program presented later the sets C('Y) and
the paths c("f, q) are not explicitly represented. They do not belong to the
objects which the program has to manipulate but they are concepts explaining
the sense of the data that the program creates and transforms. These data
are subdivided into groups corresponding to the graph vertices. The sense of
the data is equal for all vertices as well as for their processing. But formal
descriptions of their meaning are different for the vertices of the form (k, i)
and /3. I will explain the data and rules for their processing which apply only
to the vertices of the form (k, i) in a rather detailed way. The processing of the
vertex /3 will be explained in a less detailed way and I hope that the explanation
will be clear even without detailed comments.
The most important group of data in the algorithm is the ensemble of num-
bers f(k, i, q), which stand for the length of the path c(k, i, q), i.e., the q-th path
from the vertex a to the vertex (k, i). The number f (/3, q) stands for the length
of the q-th best path from a to /3. For a fixed i the ensemble f(k, i, q) is just the
the row vector]; in the algebraic representation of the task, and the ensemble of
numbers f(/3, 0), f(/3, 1), ... ,j(/3, d-1) is the ensemble expressed by the product
in (9.109). If the situation q 2: IC(k, i)l occurs then I consider f(k, i, q) = oo.
For each vertex (k, i) and each number q the indices k'(k, i, q) and q'(k, i, q)
are defined. These indices have the following meaning. The path c(k, i, q)
consists of the path c(k' (k, i, q), i- 1, q' (k, i, q)) to the end of which the vertex
(k, i) is to be appended. By means of indices k' (k, i, q) and q' (k, i, q) any path
c(k, i, q) can be created. For example, the q-th path from a to /3 is
where
kn = k'(f3,q),q = q'((J,q),
11 k;-J = k'(k;,i,q;), Qi-1 = q'(k;,i,q;).
468 Lecture 9: Regular languages and corresponding pattern recognition tasks
The algorithm, which will be presented later, calculates the quoted data suc-
cessively in such a way that in every step the data f(k, i, q), k'(k, i, q), q'(k, i, q)
are available only for some triplets (k, i, q), not for all. Which part of the data
are already known is determined by numbers q*(k, i). They indicate for each
vertex (k, i) that the data f(k, i, q), k'(k, i, q), q'(k, i, q) are available for q*(k, i)
best paths leading to the vertex (k,i).
Furthermore the quoted data additional data relate to each vertex (k, i)
which will be referred to as a tree of the vertex (k, i). These data have
the same structure as that which was presented in Fig. 9.6. The vertices
of the tree (k,i) correspond to triplets (k,i,j), j = 0,1, ... ,2(K- 1), the
numbers b(k, i, j) and indices ind(k, i, j) being defined for every vertex. The
root of the tree (k, i) corresponds to the index (k, i, 2 (K- 1)). The number
b(k,i,2 (K -1)) in the root is the number f(k,i,q*(k,i)), i.e., the length of
the longest path out of all the already determined paths which is the q* (k, i)-th
shortest path of the set C(k, i) out of the paths from a to (k, i). The leaves of
the tree (k, i) correspond to the indices (k, i, j), j = 0, ... , K - 1. Some num-
bers b(k, i, 0), b(k, i, 1), ... , b(k, i, K -1) and indices ind(k, i, 0), ind(k, i, 1), ... ,
ind(k, i, K -1) are stored in the leaves. The number b(k, i, k') is equal to number
f(k', i - 1, q) + P;(k', k) at some q the value of which is stored as ind(k, i, k').
The numbers b(k, i, j) and indices ind(k, i, j) in all the other vertices of the
tree (k, i) have the same sense which I mentioned before when explaining them
in Fig. 9.6. So the indices ind(k, i, j) in the leaves of the tree have a slightly
different meaning than in other vertices of the tree.
At first glance the data manipulated by the algorithm seem to be quite
numerous. But fortunately only at first glance. The data are subdivided into
two groups. The former contains quantities f(k, i, q), k'(k, i, q) and q'(k, i, q).
The required size of memory for storing these data depends on the numbers
q* (k, i), because it is proportional to L,k i q* (k, i). Further on we will see
that L,~==~1 q* (k, i) is not greater than the 'number q* ((3). This number q* ((3)
indicates how many shortest paths from the vertex a to the vertex (3 have been
created. The total size of the memory for f(k, i, q), k* (k, i, q) and q' (k, i, q) is
then not greater than L,~==O q* ((3) = (n + 1) q* ((3). It is the memory of the same
order of size as the memory needed for storing final results.
The other data group is formed by trees for each graph vertex. The total
size of the memory for storing trees is O(K 2 n), which corresponds to the
size of memory needed for storing the input data of the task, i.e., for storing
information of edge lengths P;(k', k). It can be seen that neither the demands
for memory are exaggerated. The required memory is of the same order as the
memory size necessitated for storing the input data of the task and the results
of its solution. If these data are exceedingly many nothing can be done, and it
means that the task under solution is really too extensive.
Quite a different question is that the data are quite diverse and cannot be
overlooked at one sight to see all mutual relationships among the data. These
are the programmer's trouble, whose job is to write a program so that the
required relation between data should not be violated in their transforming or
supplementing the data.
9.6 Discussion 469
Finally, our patience has been rewarded and you well understand the advantages
of the algebraic expression of tasks we have been dealing with.
I will present an auxiliary algorithm which will be the most substantial part
of the algorithm solving the problem. I will call the algorithm NEXT(k, i) and
define its functions as follows. The algorithm changes the given data in such a
way that the number q*(k, i) is increased by one and all other data will adapt
to this new value. Before starting the algorithm NEXT(k, i) the data yielded
information on q-best paths from the vertex a to the vertex (k, i). When the
algorithm NEXT(k, i) stops then the data have been transformed so that they
yield information on the (q + l)-th paths.
As a prototype for the algorithm NEXT(k, i) the algorithms (9.102), (9.103)
and (9.104) will serve for calculating the ensemble EakEK f(k) ® P(k), where
f(k) and P(k) are ensembles. The prototype has to be modified with respect
to the fact that the algorithm NEXT(k, i) has to transform different data ac-
cording to input arguments (k, i). The algorithm will also include a command
which will state what has to be done if the data the algorithm is to use are not
yet available.
1. First it has to be found which number was taken out of the ensemble f(k', i -
1,q) when the number f(k,i,q*(k,i)- 1) was being computed, i.e., the length
of the q• (k, i)-th path from the vertex o: to the vertex (k, i). In the prototype
algorithm, the command (9.102) was used for this purpose. Now it is no longer
needed as the necessary information is stored in the data k'(k,i,q*(k,i) -1) and
q' (k, i, q• (k, i) - 1). Instead of the command (9.102) the following command is
performed
if (q*(k*,i-1)=q*+1) NEXT(k*,i-1); }
(9.111)
b(k,i,k*) = f(k*,i -1,q* + 1) + P;(k*,k); .
The preceding operation resembles the command (9.103) from the prototype
program with that important difference that in the command (9.111) the first
line is also present which the command (9.103) did not contain. The number
f(k*, i - 1, q* + 1), which is necessary for satisfying the second line of the com-
mand (9.111), may not be yet available. It will be so when q* + 2 > q*(k*, i -1).
But before the command (9.111) was carried out the number f(k*, i - 1, q*) had
already been available. This means that q* (k*, i - 1) ~ q* + 1. It follows from
this that if the number f(k*, i - 1, q* + 1) is not available then q* (k*, i - 1) is
exactly q* + 1. To create the number f(k*, i - 1, q* + 1) it is sufficient to let the
program NEXT(k*, i- 1) run just once which will occur in the first line of the
command (9.111).
3. The data in the tree (k, i) are transformed with respect to the change of the
number b(k, i, k*) in the tree leaf, which is done by means of a program which
only slightly differs from the prototype program (9.104).
of which is O(log K), and perhaps also of the computations of the function
NEXT(k", i - 2) ) for one certain k". Since the function NEXT(k, 0) will be
called not even once, the overall complexity of the algorithm NEXT(k, i) is
O(i logK).
It can be easily understood that the algorithm NEXT(k, i) presented can be
changed through a slight modification so that to its domain of definition not
only vertices of the form (k, i) is included but also the target vertex ;3. For
this modification in the programs (9.111), (9.112) and (9.113), it is sufficient
to write /3 instead of every pair (k, i) as well as n instead of i - 1. Execution
the algorithm NEXT(/3) will have a complexity of at most O(n log K).
It is self-evident that before the first starting the program NEXT(/3) the
data must be initialised to become consistent.
The initiation presented contains, in fact, all computations which are needed to
seek the shortest path from a to (3. The results of seeking and further auxiliary
data are stored in such a form that the program NEXT may be used. The
complexity of the initiation is O(K 2 n).
In this way I have arrived at a result in which seeking the d-shortest path
from a to the vertex (3 consists of the initiation having the complexity O(K 2 n)
and (d - 1)-fold running of the program NEXT(/3) having the complexity
O(n log K). The overall complexity of the task solution is not greater than
0 (K 2 n + (d - 1) n log K) .
I cannot but close the question that I mentioned only cursorily before. The
matter is that the memory needed for storing the data f(k,i,q), k'(k,i,q)
472 Lecture 9: Regular languages and corresponding pattern recognition tasks
and q' (k, i, q), is actually much smaller than the size of 'L.k 'L.i (q* (k, i) + 1),
which it would seem to be at first glance. Based on the knowledge of a con-
crete algorithm we can claim that immediately after the initiation the sum
'L.k 'L.i (q*(k,i) + 1) slightly differs from the number Kn. Further on, in each
application of the program NEXT(;3) the sum does not increase by more than n,
because for each i no more than one of the numbers q*(k, i) is changed. When
the sum changes then it is increased only by one. It follows from this that after
ad-fold application of the program NEXT(;3) the sum of the memory sizes will
not be greater than CJ(Kn + nd).
Well, this may be all with respect to the d-shortest paths in the graph and,
consequently, with respect to search for the d-best approximations of given
sentence with sentences of given regular language. Certainly, the procedure
described can be easily generalized to the situation in which the object under
approximation is of more complicated acyclic structure than a sequence.
The algorithm for solving the task on d shortest paths in a graph begins to
gradually assume an appropriate form. You might publish your results, since
they have a significance on their own, and not only in the pattern recognition
domain. On the whole we can see that you have excellently mastered the
subject matter of the last two lectures.
The analysis was quite instructive for me because I became convinced once more
how substantially the efficiency of a procedure carefully thought over may differ
from the one which occurs to me at first and seems to be self-evident.
Allow me to say frankly that in one item of your lecture I saw quite some
negligence in the estimate of computational complexity. It concerns the com-
plexity of the matrix polynomial EB:o A i in a semi-ring with idempotent addi-
tion. Without any hesitation you wrote that the complexity of that operation
was CJ(k 3 logk) for the matrix A of the dimension k x k. But that is the com-
plexity of the most primitive algorithm which occurs to anybody in the first
place. For this task there exists a long known algorithm owed to Floyd {Floyd,
1962} complexity of which is CJ(k 3 ). How shall I come to terms with it?
Simply by forgiving the negligence. We did not intend to interrupt the firm
and purposeful storming of the main aim of the lecture.
Well, will you get on with it and formulate the question in a more concrete
fashion?
that it yields bad results. But the theory does not tell me anything about
what a segmentation algorithm should look like. But image segmentation does
belong to pattern recognition, does it not?
The situation as a whole seems to me as if a person promised to serve me
any kind of meal, and only when I really wanted to get a meal, did I find that
I had first to prepare it myself Then independently of what kind of meal it
was, that person could bring and serve it to me. He had kept his promise, but
I saw that it had been a pure hoax.
Here I can see a clear gap between the theory explained and my actual
practical task. The theory deals with making use of the dependency between
symbols which in my practical task cannot substantially improve the results
of recognition. The major difficulty of my task is how the originally compact
observation x is to be transformed into a sequence of observations x 1 , x 2 , ... , Xn,
in which each observation corresponds to one character. The theory is silent
about that. In the theory the sequence x 1 , x 2 , ... , Xn is already assumed as
given. To worry about how to get it is to be my job.
Do I see the gap in the right place? Could not you, perhaps, help me in
analysing my task of recognising an image with a line of text in which segmen-
tation problems appear? These problems seem to me the most significant. The
relationship between neighbouring characters could be ignored, as it seems less
substantial to me.
First you created the gap yourself and now you can clearly see it. You have
assumed since the very beginning that the alphabet K of states is identical
with the alphabet of characters. This conception of the set of states immedi-
ately results in the idea that the sequence of observations must have the form
x1, x2, ... , Xn, in which xi is a part of the image containing the i-th character
and no part of any other character. But for using the theory it is not at all
inevitable for the sequences k1, k2, ... , kn and X1, X2, ... , Xn to have exactly
this meaning.
We will examine your task quoting a form of data which can be assumed
to undoubtedly correspond to input data. In any case we can assume that
the input data have the form of a two-dimensional array (x(i, j), 1 :::; i :::; n,
1 :::; j :::; m) which consists of n columns and m rows. As usual, when referring
to images, the value x(i,j) is considered to be the brightness of the image at a
point with integer coordinates (i,j). These data can be interpreted even as a
sequence x1,x2, ... ,xn, where Xi is a one-dimensional ensemble of the length
m; simply speaking, it is the i-th column in an original two-dimensional array.
But in such a case the set X, from which the quantities Xi assume their values,
is extremely extensive.
Yes, it is. But for the time being do not worry about it. Now it is more
important to reveal what this multi-dimensional, and therefore so complicated,
operation Xi depends on. Recall that i indicates the number of the column
being processed in the whole line of the text.
9.6 Discussion 475
The observation Xi depends only on two quantities. One is the number q, i.e.,
the counter of columns from the beginning of a character, since the first column
in the character being processed is the column i - q. The other quantity is the
name k of the character being processed which is a name from the alphabet K.
For each k E K the number Q(k) will be introduced which means the width of
the symbol k, i.e., it indicates of how many columns the image, representing
the character, consists. The pair (k, q), on which the column in the observed
two-dimensional array (x(i,j), 1 ~ i ~ n, j ~ m) depends, belongs to the set
{(k, q)l lk E K, q = 0, 1, ... , Q(k)- 1}. This set will be considered as a set of
states of an automaton which generates images with lines of text.
I have caught it! The main thing is that the automaton states need not cor-
respond just to what is to be recognised, but they may be something more
detailed. I remember that we were discussing something similar to it after Lec-
ture 7. Then you directed my attention to a situation in which in creating a
model of a recognised object some artificially created parameters are sometimes
to be added to the hidden parameters of a natural kind. By extending the set
of hidden parameters the task does not become more complicated. Just the
opposite, it becomes simpler.
Here just such a situation has occurred. We are surprised that you did not
arrived at it sooner. Well, continue by yourself.
The automaton generates a sequence of states (k1, ql), (k 2, q2), ... ,(kn, qn) and
a sequence e1, e2, ... , en of columns, each of which consists of m elements, ap-
pears at the output. The sequence e 1 , e 2 , ... , en then forms a two-dimensional
ensemble (e(i,j), i = 1,2, ... ,n, j = 1,2, ... ,m) which can be considered as
an image. For the image to correspond to an ideal, undamaged text line, cer-
tain constraints are to be satisfied as to what the state (ki, qi) can be in the
i-th moment in dependence on what the state (ki- 1 , qi_ 1 ) was in the preceding
moment ei on the state (ki, qi) is to be determined. For simplicity I will not
take into consideration that the labels of characters in the text are mutually de-
pendent. The automaton generating an image of a text line in which mutually
independent symbols occur can be, for example, defined as follows.
1. The set of initial automaton states is the set { (k, 0) I k E K} by which a
clearly understandable property is stated that the generation of an image
with text begins from generating the initial (zero) column of some of the
symbols.
2. The set of target states is { (k, Q(k)- 1) IkE K}. This means that the gen-
eration of an image with text can end only in those states when generating
some of the symbols is finished.
3. If the automaton is in some state (k, q), k E K, q =J. Q(k) - 1, (i.e., if the
state labelled k has not been processed as a whole) then the succeeding
state must be (k, q+ 1) (i.e., it must continue generating the same character
k).
476 Lecture 9: Regular languages and corresponding pattern recognition tasks
The sequence obtained (ki, qj), i = 1, ... , n, determines the sequence i 1, i2,
... , iM of indices, where qjm = 0. The index im provides the horizontal coor-
dinate of a point in which the m-th character begins, kim denotes the name of
the character, and M is the number of of characters in the text line which is
being recognised.
What you have designed is one of the simplest approaches. As we know you,
some more accomplished algorithms will occur to you after a period of thinking
it over.
July 1998
9.7 Link to a toolbox 477
9. 7 Link to a toolbox
The public domain demonstration software related to recognition in regular
languages was written in C language by P. Soukup as a diploma thesis in Sum-
mer 2001. It can be downloaded from the website http: I I cmp. felk. cvut.
czl cmpl cmp_software. html. The library implementing generalised convolu-
tion and several tasks solving the best matching problem are available including
the source code.
From time to time scientific terminology seems to make fun of a trusting reader,
deliberately wanting to confuse him or her. It happens that scientific concepts
are used which are common in everyday life but denote something quite dif-
ferent. For example, the theory of catastrophes does not deal with what we
normally consider a catastrophe, a disaster. Similarly, games theory has noth-
ing to do with what is happening on a football ground or on a chess board.
The concept 'context-free language' is one of the examples of such a perfidi-
ous concept. According to the name it could be supposed to mean manipulating
a language in such a way that one wants only to switch from one topic to an-
other until the sentence resembles a chaotic chain of mutually independent
fragments. In fact, a context-free language is determined by precise definitions.
If the definitions were not known then one could not guess what the particular
term might mean. This is the case in which the application of a familiar and
expressive concept for a concrete idea brings about only disorientation.
This lecture is devoted to a formalism with the aid of which sets of images
and a probability distribution on them are constructively defined. Based on
the definitions, different pattern recognition tasks are solved resembling those
analysed in the previous two lectures 8 and 9. This lecture differs from the
previous two lectures in that the objects recognised will not only have the
form of one-dimensional sequences but primarily the form of two-dimensional
and multi-dimensional arrays. We will see that the formalism proposed is a
natural generalisation of context-free grammars and languages according toN.
Chomsky's hierarchy. In its turn, the Chomsky's context-free grammars and
languages are generalizations of regular grammars and languages.
479
M. I. Schlesinger et al., Ten Lectures on Statistical and Structural Pattern Recognition
© Springer Science+Business Media Dordrecht 2002
480 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
any of the images labelled Sct. The images drawn are to be arranged to form
one image so that the image labelled Su forms its upper part and the image la-
belled Sct forms its lower part. The second and third metarules can be similarly
interpreted.
Let us show now how the class of the images with label SH can be defined
using the quoted metarules.
By means of the third and second metarules the following definition of the
set of images labelled SH can be stated: The image may have a label SH if its
label is SH1 or if it is composed of two parts divided by a vertical line, the left
part being labelled SH1 and the right one WR (white rectangle, i.e., an image
all pixels of which are white). The first part of the definition is represented by
the left picture in Fig. 10.1 and the second part by the right picture in Fig. 10.1.
If we deleted all useless words from the definition, and kept only what makes
the definitions differ from each other then the definitions could be written in
the following brief form:
WR ; }
SH1 ::= SH2
(10.2) SH9 ::: ~~~;} (10.4)
SH1 ::= SH2. SH3 .. - WR.
I
U ::= Ll L; (10.6) L::= BR; (10.7) I::= BR I WR. (10.8)
482 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
Figure 10.1 Generating letter SH according Figure 10.2 Illustration of the (10.2), i.e.,
to rule (10.1), i.e., separating the image mar- separation of the image margin above the let-
gin at the right side of the letter. ter.
The rules (10.1) through (10.10) form the definition of the concept 'the image
is labelled SH', i.e., a definition of a set of images. The information obtained by
the computer from the human during the imaginary dialogue can be arranged
in a shape of a six-tuplet
G = (X,K,ko,Pv,Ps,Pr) (10.11)
SH3 WR '---y---1
SH3
Figure 10.3 Illustration of the rule (10.3), Figure 10.4 Illustration of the rule (10.4),
i.e., separation of the margin at the right side i.e., separation of the margin below the let-
of the letter. ter.
WI
Figure 10.5 Illustration of the rule (10.5), Figure 10.6 Illustration of the rule (10.6),
i.e., separation of the black rectangle (BR) i.e., decomposition of the remaining part of
at the right side. the letter into two shapes resembling letter L.
ID
y
BRWR
'-v-'
"--,-'
I
Figure 10.7 Illustration of the rule ( 10. 7), Figure 10.8 Illustration of the rule (10.8),
i.e., decomposition of the shape resembling i.e., decomposition of the shape resembling
letter L into two parts: the black rectangle letter I into a black rectangle and a white
at the bottom and a shape resembling letter rectangle.
I, where the black rectangle is at the left and
the white rectangle is at the right.
I • D
BP
WP
DD 0
BRBR
"---..r--'
BR
D WR} WR
WR
WR WR
'-----v---'
WR
Figure 10.9 Illustration of the rule (10.9), Figure 10.10 Illustration of the rule (10.10),
i.e., the black rectangle can be created by i.e., the white rectangle can be composed of
concatenating of these black rectangles only. these white rectangles.
484 lecture 10: Context-free languages, their 2-D generalisation, related tasks
Figure 10.11 (a) An example of the image Figure 10.12 (a) An example of the image
with a letter resembling SH which can be with a letter not resembling SH which can be
created by the introduced rules. (b) An ex- created by introduced rules. (b) An example
ample of the image not resembling SH which of the letter resembling SH which cannot be
cannot be created by the introduced rules. created by introduced rules.
The set of all possible images will be denoted X* and each subset L C X* will
be called the two-dimensional language in the alphabet X.
We will introduce an operation of horizontal and vertical concatenation of
images. Let x1 = (m,n1,T(m,nl) ~X) and x2 = (m,n2,T(m,n2) ~X) be
two images which have the same number of rows m. The horizontal concatena-
tion of the images x 1 and x 2 means the image x = (m, n 1 +n 2, T(m, n 1 +n 2) ~
X) which will be denoted x = x 1 I x2 and for which the following holds,
G = (X,K,ko,Ph,Pv,Pr)
486 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
Here and later on we will use the denotation Ph, Pv, Pr in two senses. Some-
times these letters will be considered as before to denote relations, i.e., subsets
Ph C K x K x K, Pv C K x K x K and Pr C K x (K U X). Other times
the denotations will be understood as functions Ph: K x K x K --+ {0, 1},
Pv: K x ]( x K--+ {0, 1}, Pr: K X (KUX) --+ {0, 1}. The relations (k1, k2, k3) E
Ph and Ph(k1,k2,k3) = 1, (k1,k2,k3) E Pv, Pv(k1,k2,k3) = 1, and others
will be considered equivalent. With the newly introduced denotations we can
write the following recursive relations for the values f(it,ib,ji,jr,k), 1:::; it:::;
:::; ib :::; m, 1 :::; j1 :::; jr :::; n, k E K,
jp-1
[ V VV(i(it,ib,ji,j,kt)/\Ph(k,kt,kr)/\j(it,ib,j+1,jr,kr))]
j=jj k1 kr
The previous relation precisely expresses the assertions 1-4 from the earlier
introduced Definition 10.1 which define images belonging to the language Lk.
The last term (in square brackets) in the relation (10.13) corresponds to the first
and second assertion of the definition. The second term corresponds to the third
assertion, and the first term corresponds to the fourth assertion of the definition.
The relation (10.13), and thus also the Definition 10.1 of the language Lko
k E K, form a basis for the following algorithm recognising whether the image
x = (m, n, T(m, n) --+X) belongs to the language L(G).
V V V (/(iJ, i2,j1,j, k1) 1\ h(k, k1, kr) 1\ J(il, i2,j + 1,h, kr)).
(10.14)
488 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
(10.15)
If the condition (10.15) is satisfied then /(iJ,i2,ji,h,k) := 1 is substituted.
(c) For values of k for which /(i 1,i 2,i!,h,k) continues to be zero the following
condition is verified
1 :'S m 1 :'S m 2, 1 :'S n 1 :'S n2, 1 :'S t1 :'S t2. At V = {0, 1} the observation can
be considered as a time varying binary image which is observed in the interval
from h to t2. A
Example 10.4 Observed field with a more complicated structure. The ob-
served field T can have a more complicated structure. It can be, for example,
a set of vertices of an acyclic graph or that of a Cartesian product of such
graphs. A
Remark 10.1 The alphabet of symbols V corresponds to terminal and non-
terminal alphabets in formal grammars. For our later purposes these two types
of symbols need not be differentiated. A
The set T x V will be called a set of structural elements and will be denoted S.
Individual structural elements will generally be denoted by a lower-case letter
s distinguished by indices. A structural element is a certain fragment from the
structure T marked by a symbol from V.
Example 10.5 Structural element. If a pixel is said to be black then it deter-
mines the structural element. If a set of pixels is said to form an abscissa then
it determines the structural element too. If a rectangle in the observed field set
is said to contain an image representing the letter A then it is also referred to
as a structural element. A
A four additional important concepts are segmentation of a fragment, hierar-
chical segmentation, map, and hierarchical map of the structural element.
The segmentation of the fragment To E T is a subset R C T of fragments
which contains the fragment T0 and some other fragments R \{To} which form
the decomposition of the fragment T0 . In other words if To = U:
1 Ti, Ti E T,
and Tin Ti = 0 for any i > 0, j > 0, i f- j then R = {Ti I i = 0, 1, ... , m} is
the segmentation of the fragment T0 .
Any function m: R --7 V in which R is the segmentation of the fragment To
defines the map of the structural element (T0 , m(T0 )). Each map is a subset of
labelled fragments, i.e., a subset of structural elements.
We will introduce two important particular cases of maps. The map defined
on segmentation which consists of three fragments is called a rule. For this
particular case the denotation 1r will be used and different sets of rules will be
denoted by II with different indices and arguments.
The other particular case of a map is the labelled image. The information
contained in this map consists of the definition of the fragment T0 , label v
which characterises the image as a whole, and an ensemble of labels which
characterise individual pixels in the fragment T0 . Formally speaking, it is the
segmentation R = {To} U ( UtETo {{t}}) of the fragment T0 to individual pixels
and the function x: R --7 V. Thus, this pair is a map as well. The labelled
image is also a subset of labelled fragments of a certain form, i.e., the set
{(To, vo)} U { ( {t}, x(t)) It E To}.
Hierarchical segmentation is defined in the following recursive way.
1. The segmentation R which consists of three fragments is the hierarchical
segmentation.
492 Lecture 10: Context-free languages, their 2-D generalisation, related tasks .
generated not arbitrary ones. The quantity {10.18) states whether the particular
image is admissible. A
Example 10.8 Let W be a completely ordered set. Let EB mean max and
0 mean min. In this case the function P defines an element from W for
each rule 1r which indicates the degree of confidence in the rule. At the same
time the degree of confidence in generating the image is assumed not to be
less than () if the degree of confidence in each rule applied in generation is not
less than () either. The expression (10.18) in this case determines the safest
way of generating the particular image, i.e., roughly speaking finding the most
convincing reason that the image presented is labelled vo. A
Example 10.9 Let W be a set of nonnegative numbers. Let EB mean min and
0 mean addition in the common sense. In this case the function P can be
understood as stating a penalty for the application of the rule. The number
{10.17) represents the total penalty for the actual procedure in generating the
image and the number {10.18) means the least possible total penalty with which
it is still possible to generate the image presented. A
which are to be calculated for each structural element (T', v'), T' C To, T' E T,
v' E V. In the formula (10.20) x(T') means contraction of the analysed image
x to the fragment T'.
If we compute all the quantities (10.20) then we will solve the task (10.19)
as well, because the quantity (10.19) is one of the quantities given by (10.20).
In each actual task, the image x: T0 --+ V, as well as its contraction to
different fragments, is constant. Therefore to make later formul<E brief, we
will neither refer to the denotation of the image x, nor to the denotation for its
contraction x(T'). Recall the previously introduced denotations 8 for structural
elements, i.e., pairs of the form (T',v'), T' C T 0 , T' E T, v' E V, and the
denotation 1r for rules, i.e., triplets (8 1 , 8 2 , 8 3 ) of a certain form. In using these
denotations the formula (10.20) will have a shorter form
0 P(7r) (10.23)
10.5 General structural construction 495
which will be denoted G(s, s 1 , s 2). The rules, s1, Sz is present in each hierarchi-
cal map H from the set Ji(s,s 1 ,s 2), and thus the factor P(s,s1,s2) is present
in each product ®rrEII(H) P(n). It can be factored out behind the addition
For each hierarchical map H E 1i (s, s 1 , s 2) the set II (H)\ { (s, s1, sz)} is decom-
posed into two subsets. One is the hierarchical map H1 from the set Ji(sl), and
the other is the hierarchical map H 2 from the set Ji(s 2 ). Therefore the product
®rrEII(H)\{(s,s 1 , 82 )} P(n) means ( ®rrEII(Ht) P(n)) 0 ( ®rrEII(H2 ) P(n)), and
addition along all the maps from the set Ji(s, s 1 , s2) means addition along all
the maps from the Cartesian product Ji(sl) x Ji(sz). Thus the formula (10.24)
assumes the form
G(s,s1,s2) (10.25)
On the basis of the evident equality EBiEJ EB jEJ (f3i 0 "fj) = ( EBiEI f3i) 0
( EBjEJ "!j) we further have
which is the tool for solving the basic problem. We are now going to demon-
strate it.
The quantity G (s), as we defined it before and as can be seen from the
formula (10.27), is relevant only for such structural elements s = (T',v') which
are decomposed into other elements, i.e., when T' ~ 2. We will extend this
definition even to the cases in which the structural element is an individually
496 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
labelled pixel. For the presented image x: T 0 --+ V we define G (s), s = ({t}, v),
t E T0 , v E V, in such a way that
OE!l, if X ( t) ::j: V ,
{ (10.28)
G({t},v)= 1°,if x(t)=v.
We will order all the structural elements into a one-dimensional sequence (it
does not depend on the dimensions of the observed field T which are by no
means taken into consideration) so that the element s' = (T', v') precedes the
element s 11 = (T", v"), if T' ::J: T" and T' C T". If T' = T" then the elements
are arranged in the sequence in an arbitrary order which will later be considered
as fixed. For the given observation x: To--+ V the quantities G({t},v), t E T 0 ,
v E V, will be defined in accordance with (10.28). Then we stepwise examine
all the elements in the before settled order and for each of them, assume for the
element s, we will calculate the value G(s) according to the formula (10.27).
All the data is already at hand for the calculation because only the values of
G(s') for those elements s' are needed which occurred in the ordered sequence
of elements before. In this procedure the values G(s) as well as those of the
fragment (To, v), v E V, are calculated. In this way it can be stated to what
extent the assertion is valid that the presented image can be labelled v. The
total number of operations in the calculation is proportional to the number of
rules (s,s1,s2), s = (T',v'), s 1 = (T{,vD, s2 = (T~,v~), T' C To, for which
P(s,s1,s2) ::J: OEll.
Example 10.11 Regular language and structural construction. In the case
of regular languages and their stochastic and fuzzy generalisations, the set T
is a set of positive integer numbers. The structure T contains fragments of
the form {t I 1 :S t :S n}, n E T, and fragments of the form {t}, t E T.
Let a sequence of the length n be given which is to be processed, i.e., let a set
To = { 1, 2, ... , n} and the function x: T 0 --+ V be given. For the set To only
n - 1 triplets of fragments (T1, T2, T3), T; C To, i = 1, 2, 3, exist in which T2
and T3 constitute the decomposition of the fragment T1. The reason is that the
triplet (T1,T2,T3) can have only the form (T1,T1\{t}, {t}) in which tis the
last pixel in the fragment T1 . Because the structural element s 3 can only have
the form (x(t), {t}) {where x(t) is the t-th symbol in the presented sequence),
to each triplet of fragments at most IVI 2 rules 1r = (s1,s2,s3) correspond for
which P(1r) ::J: 0. Thus the complexity of the analysis of a sequence with the
created construction is O(IFI 2 (n- 1)) which is of the same order as that for
algorithms analysing sequences which refer solely to regular languages. •
Example 10.12 Context-free language and structural construction. In defin-
ing a context-free language by means of the created structural construction, the
set T is a set of integer numbers as it was in the preceding example. The struc-
ture T contains all intervals of the form {t I i ~ t ~ j}, i E T, j E T, i ~ j.
If the recognised sequence To = {1, 2, ... , n} is of the length n then the number
of segmentations of the form {Tr, T2, T3}, T; C To, is of the order n 3 which is
just the complexity of sequence recognition by means of the known algorithms
which have been created solely for the case of context-free languages. •
10.5 General structural construction 497
Now we can more clearly understand what causes such a steep increase in the
complexity of the pattern recognition task in passing from regular to context-
free languages if intermediate levels were not considered. It is owed to the
difference between the used structures which in the theory of formal grammars
are not quoted under one separate notion at all. Nevertheless, the definition of
a class of languages is based on the application of a certain structure, even if it
is not explicitly described. A structure of regular languages contains only such
fragments which can be decomposed into other fragments in a single way. The
result is that one single hierarchical segmentation corresponds to each sequence
presented for recognition which is easy to find. It is natural that the complexity
of the syntactic analysis falls rapidly in this case.
The situation is different in the case of context-free languages. Here each
fragment of the length n can be decomposed in all the (n- 1) ways into other
two fragments. Thanks to the greater freedom in decomposition of the fragment
into two parts, the greater amount of hierarchical segmentations corresponds
to the sequence.
Now when the fundamental factor that determines the complexity of syn-
tactic analysis of the sentence has been revealed, languages can be constructed
which, strictly speaking, are not regular but for which the complexity of the
pattern recognition task is not greater than that for regular languages.
Example 10.13 A language between the regular and the context-free lan-
guage. Let the structure T be defined in the same way as it was in Exam-
ple 10.12 for context-free grammars. The function P, however, has been chosen
so that for each fragment T1 the quantity P ( (T1, v1), (T2, v2), (T3, v3)) is non-
zero only in the case of one single decomposition of fragment T1 into fragments
T2 and T3. For example, it can be done in such a way that only those pairs of
fragments are to be taken into consideration the lengths of which do not differ
from each other by more than 1. The language defined by such a structure and
by the function P will be no longer a regular one. Nevertheless, the complex-
ity of recognising sequence in this language remains linearly dependent on the
sentence length as was the case in regular languages.
Thus the class of languages can be constructed which ranks in a sense between
regular and context-free languages. Eventually languages can be constructed
which, strictly speaking, do not belong to the class of context-free languages,
and at the same time the complexity of sequence analysis with these languages
rises slower than the third degree polynomial (cubic) of the sentence length.
Examples of such languages were given at the beginning of the lecture where
they were called two-dimensional context-free languages. &
The presented structural construction for observed sets is, therefore, a gener-
alisation of the known formalisms, such as the formal regular or context-free
grammars including their stochastic and fuzzy modifications. The generalisa-
tion consists in that no 'one-dimensional' property of observed data is assumed.
The construction is based on other, more general means, with the aid of which
sets of other forms than those of sequences are defined.
498 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
10.6 Discussion
In the lecture you introduced a structural construction the particular cases of
which are regular and context-free grammars. Compared with the grammars,
the construction has additional tools at hand, by which not only regular and
context-free languages, but also other sets of various forms can be defined. In
the first place I would like to ask you what tools have made such generalisation
possible. Then I will ask other questions.
You have already presented this generalisation and used it in previous lectures.
I am interested in further steps in generalisation which appeared in this lecture.
I would like to be more at home with the problem. I do not understand properly
why it is necessary to formalise observation in any other way than as a sequence.
Of course, I understand that an image is something different than a sentence.
However, I do not know why one should formalise these two representations
of information in different manner. Well, even when in terms of the general
structural construction I say that an observation is an ensemble of structural
elements 8 1 , 82, ... , 8n, I still write it down as a sequence. Why could not the
expression I have just written be called a sequence?
You have come across the same difficulties that pattern recognition encoun-
tered in the 1960s, when formalisation of an observation by a point in a multi-
dimensional space seemed to be universally applicable. It was found even at
that time that it was necessary to make a break with that charming idea.
Actually, it is not important how the observation 8 1 , 82, ... , 8n is called. An
observation can be called a vector, a sequence, a set, etc .. It is essential what
operations on this object are considered understandable in a particular applica-
tion. If years ago it was found that the formalisation of observation by means of
the multi-dimensional vector is not convenient for the purpose then it resulted
in something more serious than in replacing the word vector by another word.
10.6 Discussion 499
Not everything. There are sets T having a completely clear structure T, which
loses its lucidity with any mapping of the set T to a set of integers.
For example?
I understand that the correct definition of the set T and its corresponding
structure T is the most important step in representing an applied problem by
means of a structural construction.
Let me put the alphabet aside, for a while. Let me even put aside the structure
T, because I agree with you that the words by which I will determine what set
T is referred to will immediately delimit the structure T natural for a set.
Now I am coming to the main question. The entire structural construction
is a tool for defining some sets. But at least one element in the construction is
500 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
again a set. It is the set T, with respect to it nothing at all is stated, neither
how it should be defined nor what its form could be. Briefly speaking, nothing
is said about it and therefore I can consider is as an arbitrary set. Thus, the
entire structural construction stops being constructive. To define a set (here a
set of admissible observations) I must have defined another set. And this set is
just the set T.
Indeed, you have revealed the weakest point in the proposed structural con-
struction. The definition of a set of admissible observations cannot start with
the words 'let T be a set' because it is a too general sentence. At least, we
should say 'let T be a finite set' or otherwise strongly limit the set T, so that
further doing with respect to construction might become correct. We did not
do it for different reasons, and so understand the sentence in the following in-
formal sense 'let T be a set which is easy to define and quite obviously results
from the applied problem being solved'.
But still I do not understand why the form of the set T could not be reasonably
limited for the whole construction to become correct and in spite of it to contain
all possible sets that can practically occur. Well, the variety of the sets T which
are of practical interest is not very large. It can be a completely ordered set,
e.g., a set of integers, or it can be a Cartesian product of a finite number of
such sets. What else do I dare to ask?
Do it, but only after a while. We have not answered your first question yet. You
asked us by means of what additional tools it was possible to use the extended
potentiality of structural construction when compared with formal grammars.
You have certainly noticed that rules in the structural construction have
another form than those in formal grammars. In the structural construction
the rule is a triplet of structural elements (8 1 , 8 2 , 83), where each element is
a labelled fragment, i.e., a pair of the form (T, v), T' E T, v' E V. Thus a
rule is the six-tuplet (T1,v 1, T2,v2, T3,v3). A rule in formal grammars has a
simpler form. It is a triplet of labels and each grammar is characterised by
a subset of triplets that is determined by the function V x V x V --+ {0, 1}
which will be denoted PV. If the language of a classical formal grammar
is expressed by means of a general structural construction then the function
P: T x V x T x V x T x V--+ {0, 1} will have the form
(10.29)
I will show how I understand the computational procedure solving the basic
problem. I will present it for the case in which the function P assumes only
two values, 0 and 1, and these values can be subject to logical addition and
multiplication. During your lectures I got used to all other cases, seemingly
more complicated, being in fact exactly as simple as this lucid one.
The structural recognition task is understood by me in such a way that a
set of objects 8 1 , 8 2 , ... , 8n (I have nearly said a sequence!) is given. It is to be
found whether this set is admissible. In other words, it is to be checked whether
the objects presented can be understood as parts of a composed object. I imag-
ine the following procedure for solving the task. The set Si, i = 0, 1, 2, ... of
objects is being created step by step which are regarded as examined. At the
beginning the set S 0 is represented by the set {8 1 , 8 2 , ... , 8n} which was pre-
sented for analysis. Let a set si-l be created after the step (i- 1). Then a
502 Lecture 10: Context-free languages, their 2-D generalisation, related tasks
triplet of objects (s', s", s 111 ) is sought for which P(s', s", s 111 ) = 1, s' fl. si- 1 ,
s" E Si-I, s"' E Si- 1 , holds, and the set Si = Si- 1 U {s'} is being created.
Simply speaking, in each step the set of already found partial objects is in-
creased by one more object the existence of which was proved in that step.
This procedure of creating the set S continues until it is possible with respect
to the function P.
I regard the gradual growing of the set presented as the most essential part
of structural recognition. I would even say that it is its property. After the
set S is created in the manner described some details are to be completed for
its interpretation which I do not consider important. If I did not make any
mistake somewhere then the simplicity is all too much remarkable. Moreover,
I would say that all pattern recognition algorithms which we have discussed so
far since Lecture 8 on Markovian sequences have been successfully packed into
a simple procedure.
Do not wonder at it. A general view of the class of tasks (if possible at all)
allows to see the properties of the tasks which are difficult to observe when
the tasks are analysed apart. Well, usually if you intend to know, e.g., a large
building, you had better move away from it a little than come nearer to it. It is
similar to the situation we have already spoken about. As long as people have
counted one type of objects by pairs, another by dozens, the third by tens, and
the fourth by three scores, the manipulation with quantities seemed to be very
complicated. Only since the time when unified representation of quantities was
proposed, counting has been accessible to every child.
You have grasped well that part in structural recognition which does not
change in nearly every application. You have used correct expressions, except
for using the concept of object instead of the concept of structural element.
It seems to me that in this way the main idea was pointed out more illustra-
tively.
It may be so, but not to get confused let us go back to the terminology intro-
duced in the lecture.
The procedure you presented can be formulated more concretely to add to
the subject matter already grasped an elucidation concerning the computa-
tional complexity of the algorithm, and to take into consideration factors that
influence the complexity. Moreover, we will more precisely state which part of
the algorithm is changed when one goes on from one application to the another.
We will create the universal algorithm. This means that its input are obser-
vations {s1, s2, . .. , sn} and the function P determining which structural anal-
ysis is to be performed on the particular observation. The function takes its
values from the set W. Two operations EB and ® of the form W x W -+ W are
given which form a semi-ring on the set W. These operations are also defined
in input data.
We will order structural elements in such a way that the structural element
s' = (T', v') precedes the structural element s" = (T", v") if T' C T" and
10.6 Discussion 503
T' T". The ordering will be denoted 8 1 -< 8 11 • If T' = T" then the elements
f:.
81 and 8 11 will be arbitrarily ordered, either as 8 1 -< 8 11 or 8 11 -< s'. The order
defined will be regarded as fixed. We will order the rules 1r, i.e., triplets of
structural elements (8 1 ,s 2 ,s 3 ) so that the rule 7T 1 = (8~,8~,8~) precedes the
rule 1r 11 = (8~, 8~, 8~), 1r 1 -< 1r 11 , if 8~ -< 8~. The algorithm consists of the
following operations.
where the operations EB and @ are determined from the input data.
From the expression for the algorithm there immediately follows that its com-
putational time linearly depends on the number of the rules 1r for which P( 1r) f:.
Qffi, since each such rule is applied only once.
I can now see from the description that in every actual application of the
structural construction I have to do a lot of work. Its result can be considered as
a program which creates a sequence of triplets of structural elements (8 1 , s 2 , 83).
These triplets together with the quantities P( 8 1 , 8 2 , 8 3 ) are then provided to
the program which already is an invariant, i.e., it does not depend on the
selected application. It is painful and often annoying work. It seems to me that
structural analysis of data is simple only under the condition that somebody
has already done all the unpleasant work beforehand. And this 'somebody'
may be I, myself!
This is usual in applied informatics. We will remind you once more that no for-
malisation, including formal methods of pattern recognition, is a magical means
for lazybones like The Magic Table fairy tales. No matter how well elaborated
and lucid the formalism may be, it does not relieve the researcher of the pains
of representing an informally conceived task in the particular formalism.
Yes, of course. But realise please that we all (by which not only the three of
us but the whole pattern recognition community are meant) are only at the
beginning of a long path.
And what might the first steps along this path be like?
Keeping in line with our course, it is quite natural that first of all the learning
task should be formulated correctly. This means that in formulating the task
the insolvability of one task and the uselessness of others should be taken into
account. The analysis of the learning task for regular languages we were dealing
with in Lecture 8 can be regarded as the zero step in the due direction.
And now, have another considered look at the structural construction which
we have proposed in this lecture. Different generalisations of regular and
context-free languages which can be defined by means of construction also con-
tain a stochastic generalisation of context-free languages. This means that by
means of structural construction not only a certain context-free language can
be defined but also the corresponding probability distribution on such a lan-
guage. Spare some good thought for this as it is by no means trivial. Stochastic
modification of context-free languages is not as simple as that in the case of
regular languages. The probability distribution on a set of rules applied in
grammars hardly ever determines the probability distribution on a set of se-
quences. It is a known problem, which within structural construction can be
overcome thanks to the rule not being considered as a triplet of labels, as it is in
grammars, but as a triplet of labelled fragments. Think it out yourself because
it is worth considering. Now, however, it is essential that by means of struc-
tural construction varied probability distributions on a set of observations can
be defined. A particular case is the probability distribution of a certain form
on the context-free language. We regard it as a basis when formulating a task
of a statistical estimation of a stochastic context-free grammar with respect to
the observation of a finite set of random sequences. The first step in solving
a learning task for structural recognition could be that for context-free gram-
mars all the results should be repeated which were demonstrated for regular
grammars in Lecture 8.
I am nearly sure that I have understood you in the right way, but still I would
like to be certain about it. Structural construction is based on one type of
concept and the formulation of statistical learning tasks is based on other con-
cepts. In my opinion their mutual correspondence is as follows. The observed
parameter of an object is an image, and the hidden parameter is a hierarchical
map. An unknown parameter that determines the joint probability of the im-
age and the hierarchical map is the probability distribution on a set of rules,
where each rule is understood as a triplet of labelled structural elements and
not as a triplet of labels.
We have had no doubts about it, but in spite of that we are glad to hear it
from you. We would like to thank you for your patience and the ideas you have
contributed.
Aho, A., Hopcroft, J., and Ullman, J. (1975). The design and analysis of com-
puter algorithms. Addison-Wesley, Reading, Mass.
Aho, A. and Ullman, J. (1971). The theory of parsing, translation, and compil-
ing, volume 1 - Parsing. Prentice-Hall, Englewood Cliff, New Jersey.
Ajzerman, M., Braverman, E., and Rozoner, L. (1970). Metod potencialnych
funkcij v teorii obucenia mashin; in Russian (The method of potential func-
tions in machine learninng theory). Nauka, Moskva.
Amengual, J. and Vidal, E. (1996). Two different approaches for cost-efficient
Viterbi parsing with error correction. In Pmceedings of the 5th Interna-
tional Workshop Advances in Structural and Syntactical Pattern Recogni-
tion, Leipzig, pages 30-39, Heidelberg, Germany. Springer-Verlag, Lecture
Notes in Computer Science 1121.
Anderson, T. (1958). An introduction to multivariate statistical analysis. John
Wiley, New York, USA.
Anderson, T. and Bahadur, R. (1962). Classification into two multivariate nor-
mal distributions with different covariance matrices. Annals of Mathematical
Statistics, 33:420-431.
Ball, G. and Hall, D. (1967). A clustering technique for summarizing multivari-
ate data. Behavioral Science, 12:153-155.
Baum, L., Petrie, T., Soules, G., and Weiss, M. (1970). A maximization tech-
nique occuring in the statistical analysis of probabilistic functions of Markov
chains. Annals of Mathematical Statistics, 41:164-167.
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chance.
Philosophical Transactions of the Royal Society, London. Reprinted in Bio-
metrika, 45:298-315, 1958.
Bellman, R. and Dreyfus, S. E. (1962). Applied dynamic programming. Prince-
ton University Press, Princeton, New Jersey.
Beymer, D. and Poggio, T. (1996). Image representations for visual learning.
Science, 272:1905-1909.
507
508 Bibliography
514
Index: Cayley-frequency 515
1. B.M. ter Haar Romeny (ed.): Geometry-Driven Diffusion in Computer Vision. 1994
ISBN 0-7923-3087-0
2. J. Serra and P. Soille (eds.): Mathematical Morphology and Its Applications to Image
Processing. 1994 ISBN 0-7923-3093-5
3. Y. Bizais, C. Barillot, and R. Di Paola (eds.): Information Processing in Medical
Imaging. 1995 ISBN 0-7923-3593-7
4. P. Grangeat and J.-L. Amans (eds.): Three-Dimensional Image Reconstruction in
Radiology and Nuclear Medicine. 1996 ISBN 0-7923-4129-5
5. P. Maragos, R.W. Schafer and M.A. Butt (eds.): Mathematical Morphology
and Its Applications to Image and Signal Processing. 1996 ISBN 0-7923-9733-9
6. G. Xu and Z. Zhang: Epipolar Geometry in Stereo, Motion and Object Recognition.
A Unified Approach. 1996 ISBN 0-7923-4199-6
7. D. Eberly: Ridges in Image and Data Analysis. 1996 ISBN 0-7923-4268-2
8. J. Sporring, M. Nielsen, L. Florack and P. Johansen (eds.): Gaussian Scale-Space
Theory. 1997 ISBN 0-7923-4561-4
9. M. Shah and R. Jain (eds.): Motion-Based Recognition. 1997 ISBN 0-7923-4618-1
10. L. Florack: Image Structure. 1997 ISBN 0-7923-4808-7
11. L.J. Latecki: Discrete Representation of Spatial Objects in Computer Vision. 1998
ISBN 0-7923-4912-1
12. H.J.A.M. Heijmans and J.B.T.M. Roerdink (eds.): Mathematical Morphology and its
Applications to Image and Signal Processing. 1998 ISBN 0-7923-5133-9
13. N. Karssemeijer, M. Thijssen, J. Hendriks and L. van Erning (eds.): Digital Mammo-
graphy. 1998 ISBN 0-7923-5274-2
14. R. Highnam and M. Brady: Mammographic Image Analysis. 1999
ISBN 0-7923-5620-9
15. I. Amidror: The Theory of the Moire Phenomenon. 2000 ISBN 0-7923-5949-6;
Ph: ISBN 0-7923-5950-x
16. G.L. Gimel'farb: Image Textures and Gibbs Random Fields. 1999 ISBN 0-7923-5961
17. R. Klette, H.S. Stiehl, M.A. Viergever and K.L. Vincken (eds.): Performance Char-
acterization in Computer Vision. 2000 ISBN 0-7923-6374-4
18. J. Goutsias, L. Vincent and D.S. Bloomberg (eds.): Mathematical Morphology and
Its Applications to Image and Signal Processing. 2000 ISBN 0-7923-7862-8
19. A.A. Petrosian and F.G. Meyer (eds.): Wavelets in Signal and Image Analysis. From
Theory to Practice. 2001 ISBN 1-4020-0053-7
20. A. Jaklic, A. Leonardis and F. Solina: Segmentation and Recovery of Superquadrics.
2000 ISBN 0-7923-6601-8
21. K. Rohr: Landmark-Based Image Analysis. Using Geometric and Intensity Models.
2001 ISBN 0-7923-6751-0
22. R.C. Veltkamp, H. Burkhardt and H.-P. Kriegel (eds.): State-of-the-Art in Content-
Based Image and Video Retrieval. 2001 ISBN 1-4020-0109-6
23. A.A. Amini and J.L. Prince (eds.): Measurement of Cardiac Deformations from MRI:
Physical and Mathematical Models. 2001 ISBN 1-4020-0222-X
Computational Imaging and Vision
24. M.l. Schlesinger and V. Hlavac: Ten Lectures on Statistical and Structural Pattern
Recognition. 2002 ISBN 1-4020-0642-X