You are on page 1of 14

Data Cleansing: Filling Missing Values in Data

Class Presentation CIS 764


Instructor Dr. William Hankley Presented by Gaurav Chauhan

Overview

Problems Caused Methods for retrieving missing values Predicting values


The average way The probabilistic way By leveraging the relational network structure

Conclusions
CIS 764-Gaurav Chauhan

Problems Caused
Following problems occur in data analysis because of missing values in the same

Summarizing variables Computing new variables Comparing variables Combining variables In Time Series Analysis

CIS 764-Gaurav Chauhan

Methods for retrieving missing values

Considering average of the available values for prediction Using probabilistic approach for value prediction Leveraging relation network structure of the data to predict values

CIS 764-Gaurav Chauhan

Predicting Values- the average way


Year
1936 1937 1938 1939 1940 1941

Rainfall (avg) in (cm)


30 32 N.A, Predicted = 28.5 cm 25 23 30

Temperature (avg)
60F 66F 62F 64F 69F 59F

1942
1943 1944

N.A, Predicted = 29.0 cm


28 22

60F
59F 65F
CIS 764-Gaurav Chauhan

For finding the values for year 1938 and 1942


We can calculate the rainfall for these two years as: Taking avg of rainfall of 1937 and 1939 Rainfall in 1938 = (32+25)/2 cm = 28.5 cm Taking avg of rainfall of 1941 and 1943 Rainfall in 1942 = (30+28)/2 cm = 29 cm

CIS 764-Gaurav Chauhan

Predicting Values- the probabilistic way

Assume that we have n values and we are required to predict n+1th value For every i such that i=1 to n the probability that a data instance has a value vi is p(vi) Each of these probabilities is calculated on the bases of the frequency with which vi occurs in the data. That said, vn+1 is picked at random such that

p(vn+1= vi ) > p(vn+1 = vj) If p(vi)>p(vj)

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network

This technique applies only to relational data only The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network

Example 1
Book C
Category C

Book A
Category A

Book B
Category B

Book A

Book C

Book B
Category B

? (Predicted= A) Category C

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network

Example 2 Teacher

Student 1 Student 2 Student 3 Student 4 Age(19) ? Age(18) Age(19) (Predicted 19)


CIS 764-Gaurav Chauhan

Conclusion

Missing values in the data are bad when it is used for analysis, learning or mining purposes Various techniques aim at predicting data but none has reached a 100% accuracy An average of 90% accuracy with which these values are predicted is still acceptable
CIS 764-Gaurav Chauhan

References

www.hrs.co.nz
http://dblife.cs.wisc.edu/search.cgi?enti ty=entity-8982

CIS 764-Gaurav Chauhan

Questions Anyone

I am shivering not because of nervousness but because of cold room temperature -one nervous student

CIS 764-Gaurav Chauhan

You might also like