You are on page 1of 10

Problem 3)

a) Data was collected for minimum and maximum temperature for each day from January, April and August 2005-2011 in program. In matlab function xlsread was used for data collection purpose. Its details is given in program below. b) Following is the plot for first 60% of the training data.

350 300 250 Jan April August

Max Daily Temp (F)

200 150 100 50 0 -50 -50

50

100 150 200 Min Daily Temp (F)

250

300

350

Basically there are some misclassification like some January data are misclassified as April and some April misclassified as January. Following is the percentage. pctWrong = 19.5929 pctRight = c) Now data is test data(remaining 40% of the input data) and following figure we got for it. 80.4071

350 300 250

Jan April August

Max Daily Temp (F)

200 150 100 50 0 -50 -50 0 50 100 150 200 Min Daily Temp (F) 250 300 350

As seen in above figure less number of data are misclassified instead and we got better percentage. pctWrong = 6.6667 pctRight = 93.3333 It does better than training data. Test data outperforms training data. Also when we take whole 100% of the data following is the outcome.

350 300 250

Jan April August

Max Daily Temp (F)

200 150 100 50 0 -50 -50 0 50 100 150 200 Min Daily Temp (F) 250 300 350

And following is the training percentage


80 70 60 50 40 30 20 10 0

50

100

150

200

250

d)
Here I use the built in function from matlab knnclassify

For k nearest neighbor classification.

350 300 250

Jan April August

Max Daily Temp (F)

200 150 100 50 0 -50 -50 0 50 100 150 200 Min Daily Temp (F) 250 300 350

K N classifier was better than earlier Gaussian model.

Appendix:: Matlab Code:


%% Hw3Prob3.m % % Calculating the gausian model and k nearest model. %% Clean up clear all close all

varr = 1; %[ndata, text, alldata] = xlsread('GNVWX.csv','F2:Q6'); [ndata, text, alldata] = xlsread('GNVWX_1960_2012.csv','F11180:Q13735');

wxDataRaw = [ndata(:,1) ndata(:,7) ndata(:,12)]; clear ndata, text,alldata; ndata = 0; text = ''; alldata = 0;

% % %

JanData = find( ((wxDataRaw(:,1) >= 20050101) & (wxDataRaw(:,1) < 20050201)) | ... ((wxDataRaw(:,1) >= 20060101) & (wxDataRaw(:,1) < 20060201)) | ... ((wxDataRaw(:,1) >= 20070101) & (wxDataRaw(:,1) < 20070201)) | ... ((wxDataRaw(:,1) >= 20080101) & (wxDataRaw(:,1) < 20080201)) | ... ((wxDataRaw(:,1) >= 20090101) & (wxDataRaw(:,1) < 20090201)) | ... ((wxDataRaw(:,1) >= 20100101) & (wxDataRaw(:,1) < 20100201)) | ... ((wxDataRaw(:,1) >= 20110101) & (wxDataRaw(:,1) < 20110201) )); AprilData = find( ((wxDataRaw(:,1) >= 20050401) & (wxDataRaw(:,1) < 20050501)) | ... ((wxDataRaw(:,1) >= 20060401) & (wxDataRaw(:,1) < 20060501)) | ... ((wxDataRaw(:,1) >= 20070401) & (wxDataRaw(:,1) < 20070501)) | ... ((wxDataRaw(:,1) >= 20080401) & (wxDataRaw(:,1) < 20080501)) | ... ((wxDataRaw(:,1) >= 20090401) & (wxDataRaw(:,1) < 20090501)) | ... ((wxDataRaw(:,1) >= 20100401) & (wxDataRaw(:,1) < 20100501)) | ... ((wxDataRaw(:,1) >= 20110401) & (wxDataRaw(:,1) < 20110501))); AugustData = find( ((wxDataRaw(:,1) >= 20050801) & (wxDataRaw(:,1) < 20050901)) | ... ((wxDataRaw(:,1) >= 20060801) & (wxDataRaw(:,1) < 20060901)) | ... ((wxDataRaw(:,1) >= 20070801) & (wxDataRaw(:,1) < 20070901)) | ... ((wxDataRaw(:,1) >= 20080801) & (wxDataRaw(:,1) < 20080901)) | ... ((wxDataRaw(:,1) >= 20090801) & (wxDataRaw(:,1) < 20090901)) | ... ((wxDataRaw(:,1) >= 20100801) & (wxDataRaw(:,1) < 20100901)) | ... ((wxDataRaw(:,1) >= 20110801) & (wxDataRaw(:,1) < 20110901))) ; JanTemp = wxDataRaw( AprilTemp = wxDataRaw( AugustTemp = wxDataRaw( JanData , : ); AprilData , : ); AugustData , : );

% Max Min temperature from 2005-2011. JanMaxTemp = JanTemp(:,2); AprilMaxTemp = AprilTemp(:,2); AugustMaxTemp = AugustTemp(:,2); JanMinTemp = JanTemp(:,3); AprilMinTemp = AprilTemp(:,3); AugustMinTemp = AugustTemp(:,3); %Now lets take first 60% of the data(Training data) len = ceil(0.6*length(JanMaxTemp)); JanMaxTemp = JanMaxTemp(1:len); AprilMaxTemp=AprilMaxTemp(1:len); AugustMaxTemp=AugustMaxTemp(1:len); JanMinTemp=JanMinTemp(1:len); AprilMinTemp=AprilMinTemp(1:len); AugustMinTemp=AugustMinTemp(1:len); %Now lets take first 40% of the remaining data(Test data) len = ceil(0.6*length(JanMaxTemp)); JanMaxTemp1 = JanMaxTemp(len:end); AprilMaxTemp1 = AprilMaxTemp(len:end); AugustMaxTemp1 = AugustMaxTemp(len:end); JanMinTemp1 = JanMinTemp(len:end); AprilMinTemp1 = AprilMinTemp(len:end); AugustMinTemp1 = AugustMinTemp(len:end);

% % % % % % % % % % % % % % % %

%Lets try K-nearest neighbour using matlab function itself. %unifrnd(-5, 5, 10, 2) %training = [JanMaxTemp ;AprilMaxTemp; AugustMaxTemp; JanMinTemp; AprilMinTemp; AugustMinTemp]; % training = [JanMaxTemp ; JanMinTemp]; % lenTrain = length(JanMaxTemp); % %group = [repmat(1,lenTrain,1); repmat(2,lenTrain,1) ; repmat(3,lenTrain,1) ; repmat(4,lenTrain,1) ; repmat(5,lenTrain,1) ; repmat(6,lenTrain,1)]; % group = [repmat(1,lenTrain,1); repmat(2,lenTrain,1)]; % % %sample = [JanMaxTemp1; AprilMaxTemp1; AugustMaxTemp1; JanMinTemp1; AprilMinTemp1; AugustMinTemp1]; % sample = [JanMaxTemp1; JanMinTemp1]; % lenTrain = length(JanMaxTemp1); % % c = knnclassify(sample, training, group); % gscatter(JanMinTemp,JanMaxTemp,group,'rb','+x'); % hold on; % gscatter(JanMinTemp1,JanMaxTemp1,c,'mc'); hold on; % legend('Training group 1','Training group 2', 'Data in group 1','Data in group 2'); % hold off; %% Plot data and Normal decision regions

% close all st = 0.1; % step for grid plotting % setup the area to work in x1 = min([JanMinTemp; AprilMinTemp; AugustMinTemp]):st:max([JanMaxTemp; AprilMaxTemp; AugustMaxTemp]); x2 = x1; [X1,X2] = meshgrid(x1,x2); % Calculate the Normal models muJanMinMax = mean([JanMinTemp JanMaxTemp]); sJanMinMax = cov([JanMinTemp JanMaxTemp]); muAprilMinMax = mean([AprilMinTemp AprilMaxTemp]); sAprilMinMax = cov([AprilMinTemp AprilMaxTemp]); muAugustMinMax = mean([AugustMinTemp AugustMaxTemp]); sAugustMinMax = cov([AugustMinTemp AugustMaxTemp]); % plot the decision regions xy = [X1(:) X2(:)]; image_size = size(X1); JanPrior = 0.10; AprilPrior = 0.20; AugustPrior = 0.70; % Make a matrix of the probabilities of each point % for all classes probs = [JanPrior*mvnpdf(xy, muJanMinMax, sJanMinMax) ... AprilPrior*mvnpdf(xy, muAprilMinMax, sAprilMinMax) ... AugustPrior*mvnpdf(xy, muAugustMinMax, sAugustMinMax)]; % Find the class with the max probability for each point [m,idx] = max(probs, [], 2); %dim=2,see max along row. % reshape the idx (which contains the class label) % into an image. decisionmap = reshape(idx, image_size); figure('name','Min and Max Temperatures') %show the image imagesc([min(x1) max(x1)], [min(x2) max(x2)], decisionmap); hold on; set(gca,'ydir','normal'); % colormap for the classes: cmap = [0.6 0.6 1; 0.6 1 0.6; 1 0.6 0.6]; colormap(cmap); hold all plot(JanMinTemp, JanMaxTemp, 'bx', 'linewidth', 2, 'markersize', 10) plot(AprilMinTemp , AprilMaxTemp, 'gx', 'linewidth', 2, 'markersize', 10)

plot(AugustMinTemp , AugustMaxTemp, 'rx', 'linewidth', 2, 'markersize', 10) legend('Jan', 'April', 'August', 'location', 'nw') legend boxoff %set(gca, 'fontsize', 24) xlabel('Min Daily Temp (F)') ylabel('Max Daily Temp (F)')

% Plot Gaussian distributions based on the data FJan = reshape(mvnpdf([X1(:) X2(:)],muJanMinMax,sJanMinMax),... length(x2),length(x1)); mxFJan = max(max(FJan)); repLevelJan = mxFJan*normpdf(-1,0,1)/normpdf(0,0,1); %contour(x1,x2,FJan, [repLevelJan repLevelJan], 'b', 'linewidth', 2) contour(x1,x2,FJan, [repLevelJan repLevelJan], 'k', 'linewidth', 2) plot(muJanMinMax(1), muJanMinMax(2), '*b', 'markersize', 20, 'linewidth', 2) FApril = reshape(mvnpdf([X1(:) X2(:)],muAprilMinMax,sAprilMinMax),... length(x2),length(x1)); mxFApril = max(max(FApril)); repLevelMay = mxFApril*normpdf(-1,0,1)/normpdf(0,0,1); contour(x1,x2,FApril, [repLevelMay repLevelMay], 'k', 'linewidth', 2) plot(muAprilMinMax(1), muAprilMinMax(2), '*g', 'markersize', 20, 'linewidth', 2) FAugust = reshape(mvnpdf([X1(:) X2(:)],muAugustMinMax,sAugustMinMax),... length(x2),length(x1)); mxFAugust = max(max(FAugust)); repLevelJul = mxFAugust*normpdf(-1,0,1)/normpdf(0,0,1); %contour(x1,x2,FAugust, [repLevelJul repLevelJul], 'b', 'linewidth', 2) contour(x1,x2,FAugust, [repLevelJul repLevelJul], 'k', 'linewidth', 2) plot(muAugustMinMax(1), muAugustMinMax(2), '*r', 'markersize', 20, 'linewidth', 2) %% Classify the data points % This is very similar to what was done to create the % decision boundaries. % This initial version uses all the data as training % and as testing, so this is an example of checking % to see how well the models classify the training data % The models should also be tried against some test data % Make a matrix of all the points, the third column is % the class each point belongs to. minSize = min([length(JanMinTemp),length(AprilMinTemp),length(AugustMinTemp)],[], 2); JanMinTemp = JanMinTemp(1:minSize,:); JanMaxTemp = JanMaxTemp(1:minSize,:); AprilMinTemp=AprilMinTemp(1:minSize,:); AprilMaxTemp=AprilMaxTemp(1:minSize,:);

AugustMinTemp=AugustMinTemp(1:minSize,:); AugustMaxTemp=AugustMaxTemp(1:minSize,:);

pts = [JanMinTemp JanMaxTemp ones(size(JanMinTemp,1),1); ... AprilMinTemp AprilMaxTemp 2*ones(size(JanMinTemp,1),1); ... AugustMinTemp AugustMaxTemp 3*ones(size(JanMinTemp,1),1)]; % Make a matrix of the probabilities of each point % for all classes probs = [JanPrior*mvnpdf(pts(:,1:2), muJanMinMax, sJanMinMax) ... AprilPrior*mvnpdf(pts(:,1:2), muAprilMinMax, sAprilMinMax) ... AugustPrior*mvnpdf(pts(:,1:2), muAugustMinMax, sAugustMinMax)]; % Find the class with the max probability for each point [m,pts(:,4)] = max(probs, [], 2); % Now the first two columns of pts are the data, % the third column is the class it belongs to and the % fourth column is the class it was classified to be % Error is the percentage of wrong classifications nWrong = sum(pts(:,3)~=pts(:,4)) wrongInd = find(pts(:,3) ~= pts(:,4)); pts(wrongInd,:) pctWrong = nWrong/size(pts,1)*100 pctRight = 100-pctWrong %% Now to try a loop with varying the number of training % points % this is the maximum number of each month to train from % due to the different lengths of months maxN = min([size(JanMinTemp,1) size(AprilMinTemp,1) size(AugustMinTemp,1)]); trainPctWrong = zeros(maxN,1); % preallocate to speed up testPctWrong = zeros(maxN,1); % preallocate to speed up trainWrong = zeros(maxN,1); % preallocate to speed up testWrong = zeros(maxN,1); % preallocate to speed up % Is choosing the first N points the best way here? jj = 1; Ns = 6:(maxN-1); for ii = Ns % Calculate the Normal models muJanMinMax = mean([JanMinTemp(1:ii) JanMaxTemp(1:ii)]); sJanMinMax = cov([JanMinTemp(1:ii) JanMaxTemp(1:ii)]); muAprilMinMax = mean([AprilMinTemp(1:ii) AprilMaxTemp(1:ii)]); sAprilMinMax = cov([AprilMinTemp(1:ii) AprilMaxTemp(1:ii)]); muAugustMinMax = mean([AugustMinTemp(1:ii) AugustMaxTemp(1:ii)]); sAugustMinMax = cov([AugustMinTemp(1:ii) AugustMaxTemp(1:ii)]); % Make a matrix of all the points, the third column is % the class each point belongs to. pts = [JanMinTemp JanMaxTemp ones(size(JanMinTemp,1),1); ...

AprilMinTemp AprilMaxTemp 2*ones(size(JanMinTemp,1),1); ... AugustMinTemp AugustMaxTemp 3*ones(size(JanMinTemp,1),1)]; % Make a matrix of the probabilities of each point % for all classes probs = [JanPrior*mvnpdf(pts(:,1:2), muJanMinMax, sJanMinMax) ... AprilPrior*mvnpdf(pts(:,1:2), muAprilMinMax, sAprilMinMax) ... AugustPrior*mvnpdf(pts(:,1:2), muAugustMinMax, sAugustMinMax)]; % Find the class with the max probability for each point [m,pts(:,4)] = max(probs, [], 2); % % % % Now the first two columns of pts are the data, the third column is the class it belongs to and the fourth column is the class it was classified to be Error is the percentage of wrong classifications

% this needs to look at each months data and the % actual first set of data for each month trainWrong(jj) = sum(pts(1:ii,3)~=pts(1:ii,4)); testWrong(jj) = sum(pts(ii+1:end,3)~=pts(ii+1:end,4)); trainPctWrong(jj) = trainWrong(jj)/ii*100; testPctWrong(jj) = testWrong(jj)/(size(pts,1)-ii)*100; jj = jj + 1; end figure('name','Training Percentage') plot(Ns(1:(jj-1)), trainPctWrong(1:(jj-1)), 'x-', 'linewidth', 2) hold all plot(Ns(1:(jj-1)), testPctWrong(1:(jj-1)), 'x-', 'linewidth', 2)

You might also like