You are on page 1of 3

Machine Learning

Assignment 1
Basic Concepts

Due: 27 March 2015, 15:00pm

Please hand in your solutions in the class, and upload a pdf version with the code in a
single .zip file to the Moodle system before the deadline. Your submission delay is rounded
up to a day, i.e. one minute delay is considered one day.
You are supposed to submit your code for the questions that are marked with [+CODE].
After each section, the interface you should provide for your code is specified. Your code
will be tested by calling a function as mentioned.
For computations and plotting the graphs, you are free to use any software/language of
your choice. The recommended tool is MatLab/Octave, since they will be by far the easiest
tools for the next assignments. You better get used to them sooner than later.
MatLab: http://www.mathworks.com/
Octave : http://octave.sourceforge.net/
Fast Octave installation in Ubuntu: apt-get install octave octave-signal
Fast Octave installation in MacOS: port install octave octave-signal
Inside Octave, load a package (e.g. signal) by: pkg load signal

Question 1 (Function of Random Variables): Imagine two random


variables X and Y with probability distributions fX (x) and fY (y), and the merged random
variable Z = X + Y with probability distribution fZ (z).
(a): Prove
fZ (z) = [fX (x) fY (y)]|z , (1)
where f (.)|z is the value of the function f at location z, i.e. f (z), and is the convolution
operator: Z
[f (x) g(y)]|z = f (x)g(z x)dx (2)
(b): Suppose the two random variables X and Y are Gaussian:
   
2
X N X , X Y N Y , Y2 . (3)
Prove that the new random variable Z is also a Gaussian, with distribution:
 
2
Z N X + Y , X + Y2 (4)

1
( HINT: To calculate the convolution, you can split Z = X + Y into X = aZ + t and
Y = (1 a)Z t, instead of X = x and Y = Z x; i.e.:
Z Z
fZ (z) = fX (x)fY (z x)dx = fX (az + t)fY ((1 a)z t) dt. (5)

The equations become straightforward for one specific choice of a. )

Question 2 (Data Simulation [+CODE]): The random generator function


randn in MatLab generates one sample from a random variable X N (0, 1). Using
samples from X random variable, how can someone generate samples of another Gaussian
random variable Y N (10, 5)? Why?
CODE: You should provide a function that generates n samples from the requested dis-
tribution: a = generateSamples(n)

Question 3 (Decision Theory [+CODE]): In this question, you need to load


our imaginary medical dataset:
http://vda.univie.ac.at/Teaching/ML/15s/assignments/asgn01-data.zip.
The data is available in MAT and TXT format. You can load either one in Matlab/Octave,
using load data.mat or load data.txt. The columns of the data are (BT, WBC, DS,
I).
We would like to decide if a person has infection based on three attributes: their White
Blood Cells count (WBC), their Body Temperature (BT), and their Daily Sleep (DS). The
algorithm should classify the patient as Healthy (CH ) or Infected (CI ). For simplicity,
you can imaging an attribute I, which is +1 if the person is infected, and 1 if the patient
is healthy.
We have 2000 imaginary people in the dataset file with their WBC, BT, and DS. The
person p is healthy if Ip = 1, and infected if Ip = +1. Using the training data, we would
like to design an algorithm to find out the infection status of new patients.
(a): Draw the data points in 2D projections BT-WBC, WBC-DS, and DS-BT, and
color code the infection state. Color the infected patients with red, and healthy patients
with blue. (HINT: use scatter(X,Y,color) function in MatLab/Octave)
CODE: function runA(data)
(b): Draw the following distribution pairs, each pair in one graph:

p(W BC|CI ) and p(W BC|CH );

p(BT |CI ) and p(BT |CH );

p(DS|CI ), p(DS|CH )

(HINT: You can use histogram with proper bin size. You can use [a,b]=hist(X) function
in MatLab and plot the histogram with plot(b,a,color). For the distribution, be careful
about the normalization factor.)
CODE: function runB(data)

Page 2
(c): Based on the visual look of the graphs, which attribute is the best to determine the
patients infection status? Why? (NOTE: There might be more than one correct answer.
Your answer to the why question is what matters.)
(d): Find out how much each data attribute can tell us about the infection status by
calculating their correlation with the infection status. Which correlation has the highest
value? Is it consistent with your reasoning in the previous section?
CODE: function runD(data)
(e): Plot Infection-vs.-WBC, Infection-vs.-BT, and Infection-vs.-DS in three different
graphs (For the infection, assume I=-1 if the person is healthy, and I=+1 if infected). How
do the correlations you calculated before reflect on these graphs? How can this graph tell us
about our single-attribute decision algorithm?
CODE: function runE(data)
Sufficient Statistic: Given a set X of i.i.d. data with probability p(X|) for an unknown
parameter , a sufficient statistic is a function T (X), which contains all the information that
X provides to estimate . In other words:
P (|T (X), X) = P (|T (X)). (6)
From this point, assume p(W BC, BT, DS|CI ) and p(W BC, BT, DS|CH ) have Gaussian
distributions. As a result, all their slices and projections are also Gaussian.
(f ): What are the sufficient statistics (TI and TH ) for the ML estimator of the parameters
of p(W BC|CH ) and p(W BC|CI ) distributions? If TM = [TI , TH , K], what parameter K can
make TM sufficient statistic for estimating the parameters of p(W BC)?
(g): What are p(CH ) and p(CI ) in the training dataset? Calculate p(W BC) from the
training data.
CODE: function runG(data)
(h): Imagine the data is captured from people who came to the hospital for a checkup,
while in the real-world, only 5% of the people are infected. What is the estimated p(W BC)
for the real world? Why?
CODE: function runH(data)
(i): Design the decision algorithm, only based on WBC, with:
Maximum Likelihood approach
Minimum Cost approach, considering the cost of mis-classifying an infected person as
10 times higher than a healthy person.
Maximum A Posteriori approach, considering that only 5% of people in the real world
has are infected.
CODE: functions runI ML(data), runI COST(data), runI MAP(data)
(j): Consider the Maximum Likelihood approach. We would like to select only one
attribute for our decision making. Using the data in the dataset, find out which attribute
is the best for our decision making, using 10-fold cross-validation. Is your result consistent
with your analysis in the previous sections? Why?
CODE: function runJ(data)

Page 3

You might also like