You are on page 1of 9

CSE4412 & CSE6412 3.

0 Data Mining
Tuesdays, Thursdays 13:00-14:20 LAS 3033
Fall Semester, 2014


_____________________________________________________________________________________________
THE BIG ASSIGNMENT
_____________________________________________________________________________________________

Name: _________________________________

1.

How is a data warehouse different from a database? How are they similar?
Answer:
Differences between a data warehouse and a database: A data warehouse is a repository of information
collected from multiple sources over a history of time, stored under a unified schema, and used for data
analysis and decision support, whereas a database, is a collection of interrelated data that represents the
current status of the stored data. There could be multiple heterogeneous databases where the schema of
one database may not agree with the schema of another. A database system supports ad-hoc query and
on-line transaction processing. Additional differences are detailed in the Han & Kamber textbook
Section 3.1.1: Differences between Operational Databases Systems and Data Warehouses.
Similarities between a data warehouse and a database: Both are repositories of information, storing
huge amounts of persistent data.

2.

Define each of the following data mining functionalities: characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, and evolution analysis. Give examples of each
data mining functionality, using a real-life database that you are familiar with.
Answer:
Characterization is a summarization of the general characteristics or features of a target class of data.
For example, the characteristics of students can be produced, generating a profile of all the University
first year computing science students, which may include such information as a high GPA and large
number of courses taken.
Discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. For example, the general features of
students with high GPA's may be compared with the general features of students with low GPA's. The
resulting description could be a general comparative profile of the students such as 75% of the students
with high GPA's are fourth-year computing science students while 65% of the students with low GPA's
are not.
Association is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. For example, a data mining system may find association
rules like
major (X, computing science) owns (X, personal computer) [support=12%; confidence=98%]

where X is a variable representing a student. The rules indicate that of the students under study, 12%
(support) major in computing science and own a personal computer. There is a 98% probability
(confidence, or certainty) that a student in this group owns a personal computer.
Classification differs from prediction in that the former constructs a set of models (or functions) that
describe and distinguish data classes or concepts, whereas the latter builds a model to predict some

EECS 4412 Data Mining

Page 1

3.

missing or unavailable, and often numerical, data values. Their similarity is that they are both tools for
prediction: Classification is used for predicting the class label of data objects and prediction is
typically used for predicting missing numerical data values.
Clustering analyzes data objects without consulting a known class label. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass
similarity. Each cluster that is formed can be viewed as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of observations into a hierarchy of classes that group
similar events together.
Data evolution analysis describes and models regularities or trends for objects whose behavior changes
over time. Although this may include characterization, discrimination, association, classification, or
clustering of time-related data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data

Describe why concept hierarchies are useful in data mining.


Answer:
Concept hierarchies define a sequence of mappings from a set of lower-level concepts to higher-level, more
general concepts and can be represented as a set of nodes organized in a tree, in the form of a lattice, or as a
partial order. They are useful in data mining because they allow the discovery of knowledge at multiple
levels of abstraction and provide the structure on which data can be generalized (rolled-up) or specialized
(drilled-down). Together, these operations allow users to view the data from different perspectives, gaining
further insight into relationships hidden in the data. Generalizing has the advantage of compressing the data
set, and mining on a compressed data set will require fewer I/O operations. This will be more efficient than
mining on a large, uncompressed data set.

4.

What are the major challenges of mining a huge amount of data (such as billions of tuples) in comparison
with mining a small amount of data (such as a few hundred tuple data set)?
Answer:
One challenge to data mining regarding performance issues is the efficiency and scalability of data mining
algorithms. Data mining algorithms must be efficient and scalable in order to effectively extract
information from large amounts of data in databases within predictable and acceptable running times.
Another challenge is the parallel, distributed, and incremental processing of data mining algorithms. The
need for parallel and distributed data mining algorithms has been brought about by the huge size of many
databases, the wide distribution of data, and the computational complexity of some data mining methods.
Due to the high cost of some data mining processes, incremental data mining algorithms incorporate
database updates without the need to mine the entire data again from scratch.

5.

Suppose that the values for a given set of data are grouped into intervals. The intervals and corresponding
frequencies are as follows.
Age
1-5
5-15
15-20
20-50
50-80
80-110

Frequency
200
450
300
1500
700
44

Compute an approximate median value for the data.


Answer:
Using Equation (2.3) from the Han & Kamber text, we have L1 = 20, N = 3194, ( freq )l = 950,
freqmedian = 1500, width = 30, median = 32.94 years.
6.

In real-world data, tuples with missing values for some attributes are a common occurrence. Describe
various methods for handling this problem.
Answer:
The various methods for handling the problem of missing values in data tuples include:

EECS 4412 Data Mining

Page 2

(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification or description). This method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values per attribute
varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may not be a
reasonable task for large data sets with many missing values, especially when the value to be filled in is not
easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like Unknown, or -. If missing values are replaced by, say, Unknown, then
the mining program may mistakenly think that they form an interesting concept, since they all have a value
in common that of Unknown. Hence, although this method is simple, it is not recommended.
(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal)
values: For example, suppose that the average income of AllElectronics customers is $28,000. Use this
value to replace any missing values for income.
(e) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal)
values, for all samples belonging to the same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value with the average income value for customers
in the same credit risk category as that of the given tuple.
(f) Using the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other
customer attributes in the data set, we can construct a decision tree to predict the missing values for
income.
7.

Suppose a group of 12 sales price records has been sorted as follows:


5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning
(c) clustering
Answer:
(a) equal-frequency partitioning
bin 1 5,10,11,13
bin 2 15,35,50,55
bin 3 72,92,204,215
(b) equal-width partitioning
The width of each interval is (215 - 5) / 3 = 70.
bin 1 5,10,11,13,15,35,50,55,72
bin 2 92
bin 3 204,215
(c) clustering
We will use a simple clustering technique: partition the data along the 2 biggest gaps in the data.
bin 1 5,10,11,13,15
bin 2 35,50,55,72,92
bin 3 204,215

8.

Assume a base cuboid of 10 dimensions contains only three base cells: (1) (a1, d2, d3, d4, , d9, d10), (2)
(d1, b2, d3, d4, , d9, d10 ), and (3) (d1, d2, c3, d4, , d9, d10), where a1 d1 , b2 d2 , and c3 d3 . The
measure of the cube is count.
(a) How many nonempty cuboids will a full data cube contain?
(b) How many nonempty aggregate (i.e., nonbase) cells will a full cube contain?
(c) How many nonempty aggregate cells will an iceberg cube contain if the condition of the iceberg cube is
count 2?

EECS 4412 Data Mining

Page 3

(d) A cell, c, is a closed cell if there exists no cell, d , such that d is a specialization of cell c (i.e., d is
obtained by replacing a * in c by a non-* value) and d has the same measure value as c . A closed cube is
a data cube consisting of only closed cells. How many closed cells are in the full cube?
Answer:
(a) How many nonempty cuboids will a complete data cube contain?
210.
(b) How many nonempty aggregated (i.e., nonbase) cells a complete cube will contain?
(1) Each cell generates 210 - 1 nonempty aggregated cells, thus in total we should have 3 x 210 - 3
cells with overlaps removed.
(2) We have 3 x 27 cells overlapped once (thus count 2) and 1 x 27 (which is (*,d*, *, d4, , d10
)) overlapped twice (thus count 3). Thus we should remove in total 5 x 27 overlapped cells.
(3) Thus we have: 3 x 8 x 27 - 5 x 27 - 3 = 19 x 27 - 3.
(c) How many nonempty aggregated cells will an iceberg cube contain if the condition of the iceberg cube
is count 2?
Analysis: (1) (*, *, d3, d4, , d9; d10 ) has count 2 since it is generated by both cell 1 and cell 2; similarly,
we have (2) (*, d2; , d4, , d9, d10 ), and (3) (, , d3, d4, , d9, d10 ):2; and (4) (, *, *, d4, , d9, d10 ):3.
Therefore we have, 4 x 27 = 29.
(d) A cell, c , is a closed cell if there exists no cell, d , such that d is a specialization of cell c (i.e., d is
obtained by replacing a * in c by a non-* value) and d has the same measure value as c . A closed cube
is a data cube consisting of only closed cells. How many closed cells are in the full cube?
There are seven cells, as follows
(1) (a1, d2, d3, d4, , d9, d10) : 1,
(2) (d1, d3, d4, , d9, d10) : 1,
(3) (d1; d2; c3; d4, , d9, d10) : 1,
(4) (*, *, d3, d4, , d9, d10) : 2,
(5) (.; d2; .; d4, , d9, d10) : 2,
(6) (d1, *, *, d4, , d9, d10) : 2, and
(7) (*,*, *, d4, , d9, d10) : 3.
9.

Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting in a huge, yet
sparse, multidimensional matrix.
(a) Design an implementation method that can elegantly overcome this sparse matrix problem. Note that
you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve
data from your structures.
(b) Modify your design in (a) to handle incremental data updates. Give the reasoning behind your new
design.
Answer:
(a) Design an implementation method that can elegantly overcome this sparse matrix problem. Note that
you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve
data from your structures.
A way to overcome the sparse matrix problem is to use multiway array aggregation . (Note: this answer is
based on the paper by Zhao, Deshpande, and Naughton entitled \An array-based algorithm for simultaneous
multidimensional aggregates " in Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 159170, Tucson, Arizona, May 1997 [ZDN97]).
The first step consists of partitioning the array-based cube into chunks or subcubes that are small enough to
fit into the memory available for cube computation. Each of these chunks is first compressed to remove
cells that do not contain any valid data, and is then stored as an object on disk. For storage and retrieval
purposes, the chunkID + offset can be used as the cell address. The second step involves computing the
aggregates by visiting cube cells in an order that minimizes the number of times that each cell must be

EECS 4412 Data Mining

Page 4

revisited, thereby reducing memory access and storage costs. By first sorting and computing the planes of
the data cube according to their size in ascending order, a smaller plane can be kept in main memory while
fetching and computing only one chunk at a time for a larger plane.
(b) Modify your design in (a) to handle incremental data updates. Give the reasoning behind your new
design.
In order to handle incremental data updates, the data cube is first computed as described in (a).
Subsequently, only the chunk that contains the cells with the new data is recomputed, without needing to
recompute the entire cube. This is because, with incremental updates, only one chunk at a time can be
affected. The recomputed value needs to be propagated to its corresponding higher-level cuboids. Thus,
incremental data updates can be performed efficiently.

10.

Suppose that a data relation describing students at Big University has been generalized to the generalized
relation R in the Table below.

Let the concept hierarchies be as follows:


status:
{freshman, sophomore, junior, senior} undergraduate.
{M.Sc., M.A., Ph.D.) graduate.
major:
{physics, chemistry, math} science.
{cs, engineering} appl. sciences.
{French, philosophy} arts.
Age:
{1620, 2125} young.
{2630, over 30} old.
Nationality:
{Asia, Europe, Latin America} foreign.
{U.S.A., Canada} North America.

EECS 4412 Data Mining

Page 5

Let the minimum support threshold be 20% and the minimum confidence threshold be 50% (at each of the
levels).
(a) Draw the concept hierarchies for status, major, age, and nationality.
(b) Write a program to find the set of strong multilevel association rules in R using uniform support for all
levels, for the following rule template,
S R, P (S, x) ^ Q (S, y) gpa (S, z)

[s, c]

where P, Q {status, major, age, nationality}.


(c) Use the program to find the set of strong multilevel association rules in R using level-cross filtering by
single items. In this strategy, an item at the ith level is examined if and only if its parent node at the (i-1)th
level in the concept hierarchy is frequent. That is, if a node is frequent, its children will be examined;
otherwise, its descendants are pruned from the search. Use a reduced support of 10% for the lowest
abstraction level, for the preceding rule template.
Answer:
(a) Draw the concept hierarchies for status, major, age, and nationality.
Students can easily sketch the corresponding concept hierarchies.
(b) Find the set of strong multilevel association rules in R using uniform support for all levels.
status(X, undergraduate) ^ major(X, science) gpa(X, 3.6...4.0) [20% 100%]
status(X, undergraduate) ^ major(X, appl sciences) gpa(X, 3.2...3.6) [43%, 100%]
status(X, undergraduate) ^ age(X, \young") gpa(X, 3.2...3.6) [55%, 71%]
status(X, undergraduate) ^ nationality(X, North America) gpa(X, 3.2...3.6) [42%, 62%]
major(X, science) ^ nationality(X, North America) gpa(X, 3.6...4.0) [20%, 100%]
major(X, appl sciences) ^ nationality(X, North America) gpa(X, 3.2...3.6) [31%, 100%]
major(X, science) ^ age(X, young) gpa(X, 3.6...4.0) [20%, 100%]
major(X, appl sciences) ^ age(X, young) gpa(X, 3.2...3.6) [43%, 100%]
age(X, young) ^ nationality(X, North America) gpa(X, 3.2...3.6) [42%, 65%]
status(X, junior) major(X, engineering) gpa(X, 3.2...3.6) [21% 100%]
status(X, junior) ^ age(X, 21...25) gpa(X, \3.2...3.6") [21%, 83.5%]
status(X, junior) ^ nationality(X, Canada) gpa(X, \3.2...3.6") [28%, 77%]
major(X, engineering) ^ age(X, 21...25) gpa(X, 3.2...3.6) [21%, 100%]
age(X, 16...20) ^ nationality(X, Canada) gpa(X, 3.2...3.6) [30%, 80%]
(c) Find the set of strong multilevel association rules in R using level-cross filtering by single items, where
a reduced support of 1% is used for the lowest abstraction level.
Note: The following set of rules is mined in addition to those mined above.
status (X; junior) ^ age (X; 16::: 20)gpa (X; 3.2 3.6) [20%, 58%]
status (X; senior) ^ age (X; 16::: 20)gpa (X; 3.2 4.0) [14%, 77%]
status (X; PhD)age (X; 26::: 30) ^gpa (X; 3.6 4.0) [12%, 100%]
status (X; junior) ^ nationality (X; Europe)gpa (X; 3.2 3.6) [13%, 100%]
status (X; senior) ^ nationality (X; Canada)gpa (X; 3.2 3.6) [14%, 85%]
major (X; math) ^ age (X; 16 20)gpa (X; 3.6 4.0) [11%, 100%]
major (X; French) ^ age (X; 16 20)gpa (X; 3.2 3.6) [12%, 92%]
major (X; cs) ^ nationality (X; Canada)gpa (X; 3: 2::: 3: 6) [18%, 100%]
major (X; engineering) ^ nationality (X; Canada)gpa (X; (3.2 3.6) [12%, 100%]
major (X; French) ^ nationality (X; Canada)gpa (X; 3.2 3.6) [12%, 96%]
age (X; 21::: 25) ^ nationality (X; Canada)gpa (X; 3.2 3.6) [12%, 100%]

EECS 4412 Data Mining

Page 6

11.

The following table consists of training data from an employee database. The data have been generalized.
For example, 31 35 for age represents the age range of 31 to 35. For a given row entry, count
represents the number of data tuples having the values for department, status, age , and salary given in that
row.
department

status

age

salary

count

sales

senior

3135

46K50K

30

sales

junior

2630

26K30K

40

sales

junior

3135

31K35K

40

systems

junior

2125

46K50K

20

systems

senior

3135

66K70K

systems

junior

2630

46K50K

systems

senior

4145

66K70K

marketing

senior

3640

46K50K

10

marketing

junior

3135

41K45K

secretary

senior

4650

36K40K

secretary

junior

2630

26K30K

Let status be the class label attribute.


(a) How would you modify the basic decision tree algorithm to take into consideration the count of each
generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values systems, 26 30, and 46-50K for the attributes department,
age, and salary , respectively, what would a nave Bayesian classification of the status for the tuple be?
(d) Design a multilayer feed-forward neural network for the given data. Label the nodes in the input and
output layers.
(e) Using the multilayer feed-forward neural network obtained above, show the weight values after one
iteration of the backpropagation algorithm, given the training instance (sales, senior, 31. . . 35, 46K. . .
50K) .
Indicate your initial weight values and biases, and the learning rate used.
Answer:
(a) How would you modify the basic decision tree algorithm to take into consideration the count of each
generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modified as follows to take into consideration the count of each
generalized data tuple.

The count of each tuple must be integrated into the calculation of the attribute selection measure (such
as information gain).
Take the count into consideration to determine the most common class among the tuples.

(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:
(salary = 26K...30K:
junior
= 31K...35K:
junior

EECS 4412 Data Mining

Page 7

= 36K...40K:
senior
= 41K...45K:
junior
= 46K...50K

(department

= secretary:
junior
= sales:
senior
= systems:
junior
= marketing:
senior)

= 66K...70K:
senior)
(c) Given a data tuple with the values systems, junior, and 26...30 for the attributes department,
status, and age, respectively, what would a nave Bayesian classification of the salary for the tuple be?
P (X|senior) = 0; P (X|junior) = 0.018. Thus, a nave Bayesian classification predicts junior.
(d) Design a multilayer feed-forward neural network for the given data. Label the nodes in the input and
output layers.
No standard answer. Every feasible solution is correct. As stated in the Han & Kamber book, discretevalued attributes may be encoded such that there is one input unit per domain value. For hidden layer units,
the number
should be smaller than that of input units, but larger than that of output units.
(e) Using the multilayer feed-forward neural network obtained above, show the weight values after one
iteration of the back propagation algorithm, given the training instance \(sales, senior, 31...35, 46K...50K).
Indicate your initial weight values and biases and the learning rate used.
No standard answer. Every feasible solution is correct.
12.

Quantize the (R, G, B) cube with 8 prototypical colours:

That is, each pixels RGB vector gets replaced by that of the closest prototypical colour.
Take a six-pixel image, with RGB values (0.22, 0.37, 0.8), (0.19, 0.8, 0.19), (0.6, 0.1. 0.05), (0.8, 0.3,
0.22), (0.7, 0.32, 0.8), (1, 0.4, 0.34). What is the bag-of-colours representation of the image? What is the
representation after norming by Euclidean length?
Answer:
The colour labels of the points are, in order, blue, green, red, red, magenta, and red. The bagof-colours representation would thus be
blue green magenta red
1
1
1
3
before normalization and
blue green magenta red
0.29 0.29
0.29
0.87
after normalization.

EECS 4412 Data Mining

Page 8

13.

The UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/) currently maintain 298 data
sets as a service to the machine learning community. You may view all data sets through their searchable
interface. Their old web site (http://mlearn.ics.uci.edu/MLRepository.html) is still available. For a general
overview of the Repository, please visit http://archive.ics.uci.edu/ml/about.html.
Using the following four datasets, use WEKA to classify (show all induced rules) for each dataset using 3
different discretization methods for each dataset and compare the results (accuracy mean, accuracy
standard deviation).
http://archive.ics.uci.edu/ml/datasets/Iris
Iris Data Set: for flower classification.

http://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Diabetes 130-US hospitals for years 1999-2008 Data Set: This data has been prepared to analyze factors
related to readmission as well as other outcomes pertaining to patients with diabetes.

http://archive.ics.uci.edu/ml/datasets/Wine+Quality
Wine data set: Using chemical analysis determine the origin of wines.

http://archive.ics.uci.edu/ml/datasets/Tennis+Major+Tournament+Match+Statistics
Tennis Major Tournament Match Statistics Data Set: This is a collection of 8 files containing the match
statistics for both women and men at the four major tennis tournaments of the year 2013. Each file has 42
columns and a minimum of 76 rows.

How can rule quality measures aid classification?


Answers:
Output from WEKA plus explanation of results.
Rule quality measures can help to determine when to stop generalization or specification of rules in a rule
induction system. Rule quality measures can also help to resolve conflicts among rules in a rule
classification system.

EECS 4412 Data Mining

Page 9

You might also like