You are on page 1of 2

Intelligent Data Analysis and Probabilistic Inference

Data Mining Tutorial 1: Overview & Data Cleaning

1. Basic Concepts
a. Give a brief definition for the term “Data Mining”?
b. Briefly explain the difference between “Data Mining”, “OLAP” and
traditional “Database Querying”.
c. Explain the difference between “Explorative Data Mining” and
“Predictive Data Mining” and give one example of each.
d. State three different applications for which data mining techniques seem
appropriate. Informally explain each application.
e. Explain what is meant by “Data Integration” and describe why it is an
important pre-processing step for data mining.

2. Data Mining Techniques and Applications


a. Explain briefly the difference between “Classification” and “Clustering” and give
an informal example of an application that would benefit from each technique.
b. Explain briefly the difference between “Regression” and “Classification”.
c. Explain briefly what is meant by “Association Rule Analysis” and describe the
different between it and “Sequence Rule Analysis”.

3. Clustering:
You are given the task to cluster (i.e. divide into similar groups) the students
attending this tutorial based on their physical appearance.
a. Devise a feature representation scheme that allows describing each student in the
class as a record, make sure that you have at least 5 features to describe each
student. For each feature, describe the type of variable it denotes (Numerical,
Categorical, etc) and state the valid range of values for that variable.
b. Fill in the feature table for six students, i.e. build a table containing 6 rows and 5
columns and provide the values for each cell in the table.
c. Describe why you believe your feature representation scheme will produce good
results when applied to grouping the students in the tutorial.
d. Explain what is meant by an “outlier”. Add a new record to the table that you
believe would be an outlier compared to the whole data set and also to the
different clusters, and explain why it is indeed an outlier.

4. Classification:
You are now given the task to derive a model that can predict whether a student will
pass the data mining course or not (PASS/FAIL decision).
a. Devise a feature representation scheme with five features that can help
deriving such a model. Make sure you choose features you believe may be good
predictors of a student’s grade, and describe why you believe they are better
predictors than the features you chose in question 3.
b. Fill in the table with six different records for six hypothetical students
from the class of 2000. This table should contain six columns (one column for

yg@doc.ic.ac.uk, mmg@doc.ic.ac.uk 25th Nov 2003


each of your chosen features, and one column for Pass/Fail result) and six rows
(one for each student). Which columns (variables) of this table are “independent”
(“input”) variables and which are “dependent” (“output” or “class”) variables?
c. A decision rule is in the form “If FeatureA = FeatureValue1 then
ClassValue = ClassValue1). Informally derive at least 4 “Decision Rules” that can
be inferred from your data table. Is there any inconsistency between your rules?
d. Explain informally how you can test the accuracy of your decision rules
based on the data set you have provided. What is the accuracy of the each rule?
What is the accuracy of the overall model (i.e. the 4 rules together)?
e. Testing the accuracy of the rules on your data set may be biased, they
probably over-fit your data since they were derived and tested only using this data
set. What would be a better way to assess the accuracy of your rules?
f. Explain how your decision rules can be applied to predict whether you
yourself will PASS or FAIL the data mining course in 2003.

5. Classification/Prediction/Feature Selection:
There are many applications of data mining in finance. Explain why and how it can
be dangerous to naively use predictive data mining techniques to predict stock price
movements.
Hint: Consider what features would you choose to describe each stock and also
consider what really makes stock prices move. Can you find a good feature set that
can be presented to a data mining algorithm?

6. Data Cleaning:
a. Explain what is meant by “Data Cleaning” and why it may be required before
mining a large data set.
b. Describe three commonly used data cleaning operations.
c. Explain three methods for handling missing data in dataset.

7. Data Cleaning:
Given the following data set [4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34]
a. Divide the data set into 3 equi-depth bins.
b. Divide the data set into 3 bins that are smoothed by their means.
c. Normalize the data set based on a min-max normalization.

yg@doc.ic.ac.uk, mmg@doc.ic.ac.uk 25th Nov 2003

You might also like