You are on page 1of 15

1.1.

Introduction and Overview

1.1.1. Overview and Motivation


Hello and welcome to the class on Probabilistic Graphical Models. Probabilistic Graphical
Models are a bit of a mouthful, so before we define them, let's first figure out what they might
be used for.

So, one example application, which in fact is the one where probabilistic graphical models,
or PGMs as they're called, first made its way into computer science and artificial intelligence,
is that as medical diagnosis. Consider a doctor who's faced with a patient. The doctor has a
fair amount of information at her disposal when she looks up at a patient. Predisposing
factors, symptom, the results of various tests. And from that, she's supposed to figure out
what diseases the patient might have and what the response to different treatments might
be. A very different application where PGMs have also been used is that of image
segmentation. Here, we might have an image such as this that has thousands or even
hundreds of thousands of pixels, and what we'd like to do is we'd like to figure out what each
pixel corresponds to. For example, if we break up the image into these fairly larger regions to
have less stuff to reason about, we want to take figure out which of these corresponds to
grass, sky, cow, or horse. What do these two problems have in common? First, they have a
very large number of variables that we have to reason about. In the context of a doctor, it's
all these predisposing factors, test results, possible diseases, and so on. And in the context
of the image segmentation, it's the labels for these different pixels or these larger regions
called superpixels. The second thing that these applications have in common is that
fundamentally there is going to be significant uncertainty about the right answer, no matter
how clever the algorithms that we design.
So, probabilistic graphical models are a framework for dealing with this kind of application.
So, let's first understand what each of these words mean in the context of this framework.

So first, let's consider the word models. So what is a model? The model is a declarative
representation of our understanding of the world. So it is a representation within the
computer that captures our understanding of what these variables are and how they interact
with each other. And the fact that it's declarative means that the representation stands on its
own, which means that we can look into it and make sense of it aside from any algorithm
that we might choose to apply on. So, why is that important? It's important because that
same representation, that same model can then be used in the context of one algorithm that
answers any one kind of question. Or other algorithms that might answer different kinds of
questions or the same question in more efficient ways, or that make different trade-offs
between accuracy and complicational cause. The other advantage of having a stand alone
model is that we can separate out the construction of the model from the algorithms that are
used to reason over it. So, we can construct methodologies that elicit these models from a
human expert or ones that learn it from historical data using statistical machine learning
techniques or a combination of the two. And once again, the separation between the
algorithm and the model and the learning in the model allows us to tackle each of these
problems separately.
So, that was the word model, what about probabilistic? The word probabilistic is in there
because these models are designed to help us deal with large amounts of uncertainty. So
uncertainty comes in many forms and for many different reasons. So, first it comes because
we just have partial knowledge of the state of the world, for example the doctor doesn't get to
measure every symptom or every test result and she's certainly uncertain about the diseases
that the patient has. Uncertainty comes because of noisy observations. So even when we
get to observe certain things like blood pressure, those observations are often subject to
significant amounts of noise. Uncertainty also comes in because of modeling limitations, so
we're going to have phenomena that are just not covered by our model. All sorts of different
obscure diseases for example that might cause the same set of symptoms. It's impossible
for us to write down the model that is so detailed that includes every possible contingency in
every possible factor. And so you're going to have uncertainty and variability that is simply
due to modelling limitations. And finally, some people would argue that the world is
inherently stochastic. Certainly, if you go down to the quantum level, that's true. But even at
a higher level, the modeling limitations of complicated systems are such that one might as
well view the world as inherently stochastic.
Probability theory is a framework that allows us to deal with uncertainty in ways that are
principled and that bring to bear important and valuable tools. So first, probabilistic models
provide us again this word declarative. A declarative representation, that is stand alone,
where you could look at a probability distribution and it has clear semantics that represent
our uncertainty about different state that the world might be in. It also provides us with a
toolbox comprising powerful reasoning patterns that include, for example, conditioning on
new forms of evidence or decision making under uncertainty. And because of the intricate
connection between probability theory and statistics, you can bring to bear a range of
powerful learning methodologies from statistical learning to allow us to learn these models
effectively from historical data. Avoiding the need for a human to specify every single aspect
of the model by hand.

Finally, the word graphical. The word graphical is here from the perspective of computer
science, because probabilistic graphical models are a synthesis between ideas from
probability theory in statistics and ideas from computer science. And the idea here is to use
the connections computer science, specifically that of graphs to allow us to represent
systems that are very complicated that involved large numbers of variables. And we'd
already seen those large number of variable in both of the applications that we use
examples. Both in the medical example, as well as in the image segmentation example. And
so in order to capture probability distributions over spaces involving such a large number of
factors, we need to have probability distributions over what are called random variables. And
so the focus of this class and what we'll do for most of it is to think about the world as
represented by a set of random variables, X1 up to Xn, each of which captures some facet of
the world. So, one symptom that may be present or absent, or a test result that might have a
continuous set of possible values or a pixel that might have one of several labels. So each of
these is a random variable and our goal is to capture our uncertainty about the possible
states of the world in terms of their probability distribution or what's called a joint distribution
over the possible assignments to the set of random variables. Now, the important thing to
realize when looking at this, is that even in the simplest case where each of these is, say,
binary valued, which is not on the case, but say just for sake of the argument. If you have n
binary value variable then this is a distribution, over to to the n possible states of the world.
One for each possible assignment. And so we have to deal with objects that are intrinsically,
exponentially large. And our only way to do that is by exploiting data structures that encode,
that use ideas from computer science in this case to exploit the structure and distribution and
represent and manipulate it in an effective way.

So what are graphical models? Let's look at a couple of very simple examples, so here's a
toy Bayesian network, one that will accompany us through much of the first part of this
course. A Bayesian network is one of the two main classes of probabilistic graphical models,
and it uses a directed graph as the intrinsic representation. In this case, remember we had a
set of random variables X1 up to Xn. The random variables are represented by nodes in the
graph. So, to take a look at this very simple example which we'll discuss again later, here we
have a situation where we have a student who takes a course and gets a grade in the
course, and so that's one of our random variables. We have other random variables that are
also related to that. For example, the intelligence of the student's in the course, the difficulty
of the course. And others that might also be of interest, for example the quality of the
recommendation letter that the student gets in the course which is dependent on things,
perhaps the students' grade, and these score that the students might receive on the SAT.
So, this is a representation of a probability distribution, in this case over these five random
variables. And the edges in this graph represent the probabilistic connections between those
random variables in a way that is very formal as we'll define later on. The other main class of
probabilistic graphical model is what's called the Markov network and that uses an
undirected graph. And in this case, we have an undirected graph over 4 random variables A,
B, C, D and will give an example of this type of network maybe a little bit later on.
So these were toy examples, here are some real examples of the same type of framework.
So, this is a real Bayesian network. It's a network that's actually called CPCS, it's a real
medical diagnosis network. It was designed at Stanford University for the purpose of
diagnosis of internal diseases and it has 480 some nodes, and a little bit over 900 edges.
And it was used for diagnosing internal diseases by physicians here. Another real graphical
model, in this case on the Markov network side, is one that's used for the image
segmentation tasks that we talked about before. Here, the random variables represent the
labels of pixels or superpixels. So, one per each superpixel say. And the edges represent,
again probabilistic relationships between the label of a pixel and the label of an adjacent
pixel since these are likely to be related to each other.

So to summarize, the graphical representation gives us an intuitive and compact data


structure for capturing these high dimensional probability distributions. It provides us at the
same time, as we'll see later in the course, a suite of methods for efficient reasoning, using
general purpose algorithms that exploit the graphical structure. And because of the way in
which the graph structure encodes the parameterization of the probability distribution, we
can represent these high-dimensional probability distribution efficiently using a very small
number of parameters. Which allows us though feasible elicitation by hand from an expert as
well as automatically learning from data. And in both cases a reduction in the number of
parameters is very valuable.
This framework is a very general one and has been applied to a very broad range of
applications and I'm not going to list all of the ones on the slide, and there is many others
that I could have put on the slide had there been more space. Let me just very briefly
mention a few of them on subsequent slides.

So, we've already discussed the image segmentation. So, just to motivate the benefits of the
PGM framework in this context, let's look at these two images as an example. Here is the
original images, this is the division of these images into what I mentioned were called
superpixels, which are these sort of slightly larger coherent regions. And this is what you get
if you apply a state of the art machine learning framework. Individual super pixels separately.
So, just trying to classify each superpixel separately and you can see that you get, especially
in the case of the cow, a total mess with different superpixels having drastically different
labels. You can't even see the cow in this segmented image. Whereas if you construct the
probabilistic graphical model to capture the global structure of the scene and the
correlations, the probabilistic relationships between these superpixels. You end up with a
much more coherent segmentation that captures the structure of the scene.
We've already discussed medical diagnosis as an example, so here's a real world
application. This was something that was fielded on the Microsoft network for helping
parents figure out what was wrong with a sick child and the site was called OnParenting.
And parents could enter the primary complaint and then were led through a series of
questions that allowed the system to provide a probability distribution over the most likely
diagnosis with ailing the child.

A very different application is one of textual information extraction. Where we might have an
article from, say, a newspaper and we'd like to take this unstructured data and convert it into
a structured form, where we have some representation of the people locations, organizations
and perhaps relationships. So one such task might be take this kind of sentence and
recognize that these two words together form a person, which might not be that hard, given
the presence of the word, missus. But, this is a little bit harder because Green also a word
and yet, we want to identify it as a person. We might want to then infer that this is a location
and perhaps that this is an organization. It turns out that the state of the art methodology for
solving this problem is as a probabilistic classical model where we have a variable for each
word that encodes the label for that word. For example, here we have the beginning of a
person unit and an intermediate label for the person unit. And here is another person unit
whereas, here in this variable is we would like to label it as a location. But we would like to
capture, importantly, the correlations between both adjacent words as well as between
non-adjacent words by using this occurrence of the word Green to infer this occurrence of
the word Green is also a name.

A very different example all together is one that implicates data from multiple sensors. This
occurs in many applications, one such is for integrating data related to traffic from both
sensors that we might have in the road or on the top of bridges, weather, information,
incident report that we might get in some form. We'd like to take all this together and use a
model that as it happens was learned from data. And that model is then used to predict both
current and future road speed including not only on roads that where we have sensors that
measure traffic, but even more interestingly, on roads where traffic wasn't measured. And it
turns out that this was a very successful application that was fielded in several large cities
with very good results.
From a totally different application domain, probabilistic graphical models have been used
very successfully for discovering new knowledge that we didn't have before. In this case, the
application was biological network reconstruction, where a biologist measured protein levels
of a diverse set of protein in different conditions under different perturbations. And from that,
they learned the relationship between those proteins and discovered interactions between
those proteins where one was controlling another, including the ones that were not known
before.

So, let's conclude this introduction with an overview of what we'll learn in the class. So, we
will cover three main pieces related to probabilistic graphical models. We'll cover
representation of PGMs, including both directed and undirected representation. We'll also
talk about higher level structures that allows to encode more complicated scenarios.
Including ones that involve temporal structure as well as ones that involve repeated or
relational structure, specifically a class of model called plate model. We'll talk about
inference or reasoning using these models. Covering both exact reasoning, where exact
probabilities are the outcome, as well as approximate methods that provide the different
trade off regarding accuracy in computation. And we'll also talk about how this class of
models can be used for decision making under uncertainty. And finally, we'll talk about how
you might learn these models from historical statistical data. And we'll talk about learning
both parameters, as well as structure of these models automatically. And dealing with both
the somewhat simpler scenario where we have complete data that is all of the variables are
always observed as well as the more difficult case where some of the variables might not be
observed all the time, or perhaps not at all. And that introduces an interesting set of
complications but also a wealth of very exciting applications as we'll see.
1.1.2. ​Distributions
So before we get into the details of probabilistic graphical models, we need to talk a little bit
about what a probability distribution is, just so we have a shared vocabulary.

So, let's start with a very simple example of a joint distribution. One that is going to be
extended in examples later on in the, in other parts of the course. and let's start with an
example that involves just three random variables. this is what I call the student example and
you have a student who has, who can be described, in this case, by a variable representing
his intelligence. And that could be high or low. The student is taking a class. The class might
be difficult or not so this random variable, B. So, the random variable I has two values.
Difficulty variable also has two values and then, there is the grade that the student gets in
the course, and that has three values. In this case, we're going to assume A, B, and C. Now
here's an example, joint distribution over this over this set of three random variables. So this
is an example of P of I, D, G. It's a joint distribution. And let's think about how many entries
are in such a joint distribution. Well since we have three variables and we want to, we need
to represent the probability of every combination of values for each of these three variables,
and so we have 2 * 2 * 3 possible combinations. For a total of twelve possible values that we
need to assign a probability to. So there's twelve total parameters in this probability and I'm
going to introduce a notion of independence parameters which we're going to talk about
later, as well. Independent parameters are parameters whose value is not completely
determined by the value of other parameters. So in this case, because this thing is a
probability distribution, we know that all of the numbers here on the right have to sum to one.
And therefore if you tell me eleven out of the twelve, I know what the twelfth is, and so the
number of independent parameters is eleven. And we'll see that, that is a useful notion later
on when we start evaluating the relative expressive power of different probability
distributions.

What are things that we can do with probability distributions? Well, one important thing that
we can do
is condition the probability distribution on a particular observation.
So, for example assume that we observe that the student got an A.
And so we have now an assignment to the variable G which is G1.
And that immediately eliminates all possible assignments, but they're not
consistent, with my observations. So everything but the G1 observations,
okay? And so that gives me a reduced
probability distribution, and so this is an operation that's called reduction.
I've taken the probability distribution, I've reduced away stuff that is not consistent with what
I've observed. Now, that by itself doesn't give me a probability distribution, because notice
that these numbers no longer sum to one, because they summed to one before I threw out a
bunch of stuff.

Umm, and so what I do in order to get a probability distribution, what I do is I take this.
Normalized measure. An indication the word measure indicates that it's a form of distribution
but the fact that it's un-normalized means that it doesn't sum to one, it doesn't normalize to
one. So this un-normalized measure if we want to turn it into a probability distribution, the
obvious thing to do is to normalize it. And so what we're going to do is take all of these
entries and we're going to sum them up. And that's going to give us a number, which in this
case is 0.447. And we can now divide each of these by 0.447. And that's going to give us a
normalized distribution. Which in this case corresponds to the probability of I, D given G1. So
that's a way of taking an un-normalized measure, and turning it into a normaliting a
normalized probability distribution. We'll see that this operation is one of the more important
ones that we were using, throughout the course.

Okay, the final operation I'm going to talk about regarding probability distribution is the
operation of marginalization, and that is an operation that takes a probability distribution over
a larger subset of variables and produces a probability distribution over a subset of those.
So in this case we have a probability distribution over IND. And say that we want to
marginalize I which means we're going to basically sum up we're going to throw away, I'm
going to restrict the tension to D. And so what that does. Is, for example. If I want to compute
the probability of d0. I'm going to add up both of the entries that have the d0, associated with
them. And that's, the one corresponding to I0, and the one corresponding to I1. And that's
the marginalization of this probability distribution.

---------------

What is ∑​IP ​ ​I​,​D​)?


​ (

A. P(I|D)
B. P(I)
C. P(D|I)
D. P(D) <----

∑​I​P(​ ​I,​ ​D)​ =​P(​ ​D​). (As a concrete example of marginalization, say you have throw two 6-sided
dice, ​D​1​ and ​D​2​. This defines a joint probability distribution ​P(​ ​D1
​ ​,​D​2​). The probability that
D​2​=1 is equals to:

∑​i​=1​6​P​(​D​1​=​i,​ ​D2
​ ​=1), since ​D​1​ can only take on the values 1 to 6.

---------------

1.1.3. ​Factors

You might also like