Professional Documents
Culture Documents
1. Basic Concepts
a. Give a brief definition for the term “Data Mining”?
b. Briefly explain the difference between “Data Mining”, “OLAP” and
traditional “Database Querying”.
c. Explain the difference between “Explorative Data Mining” and
“Predictive Data Mining” and give one example of each.
d. State three different applications for which data mining techniques seem
appropriate. Informally explain each application.
e. Explain what is meant by “Data Integration” and describe why it is an
important pre-processing step for data mining.
3. Clustering:
You are given the task to cluster (i.e. divide into similar groups) the students
attending this tutorial based on their physical appearance.
a. Devise a feature representation scheme that allows describing each student in the
class as a record, make sure that you have at least 5 features to describe each
student. For each feature, describe the type of variable it denotes (Numerical,
Categorical, etc) and state the valid range of values for that variable.
b. Fill in the feature table for six students, i.e. build a table containing 6 rows and 5
columns and provide the values for each cell in the table.
c. Describe why you believe your feature representation scheme will produce good
results when applied to grouping the students in the tutorial.
d. Explain what is meant by an “outlier”. Add a new record to the table that you
believe would be an outlier compared to the whole data set and also to the
different clusters, and explain why it is indeed an outlier.
4. Classification:
You are now given the task to derive a model that can predict whether a student will
pass the data mining course or not (PASS/FAIL decision).
a. Devise a feature representation scheme with five features that can help
deriving such a model. Make sure you choose features you believe may be good
predictors of a student’s grade, and describe why you believe they are better
predictors than the features you chose in question 3.
b. Fill in the table with six different records for six hypothetical students
from the class of 2000. This table should contain six columns (one column for
5. Classification/Prediction/Feature Selection:
There are many applications of data mining in finance. Explain why and how it can
be dangerous to naively use predictive data mining techniques to predict stock price
movements.
Hint: Consider what features would you choose to describe each stock and also
consider what really makes stock prices move. Can you find a good feature set that
can be presented to a data mining algorithm?
6. Data Cleaning:
a. Explain what is meant by “Data Cleaning” and why it may be required before
mining a large data set.
b. Describe three commonly used data cleaning operations.
c. Explain three methods for handling missing data in dataset.
7. Data Cleaning:
Given the following data set [4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34]
a. Divide the data set into 3 equi-depth bins.
b. Divide the data set into 3 bins that are smoothed by their means.
c. Normalize the data set based on a min-max normalization.