Professional Documents
Culture Documents
Dept: ISE
UNIT-I
1. What is Data Mining? Explain the process of Knowledge Discovery in Databases (KDD) with a
diagram
2. What are the different motivating challenges faced by Data Mining Algorithms? Explain each of them
7. In case of record data, what is transaction / market based data, Data Matrix and Sparse Data Matrix?
Explain with examples.
8. In case of ordered data, Explain Sequential Data, Sequence Data, Time Series Data and Spatial Data
with examples
9. What do you mean by Data Preprocessing? Explain Aggregation and Sampling in this respect
11. What are the different variations of Graph Data? Explain with diagrams
12. What is Feature Subset Selection? What are the different approaches for doing this? Explain the
architecture of Feature subset selection with a diagram
i) Feature Extraction
1
NMIT, Bangalore Data Mining Question Bank Dept of ISE
iii) Feature Construction
14. What do you mean by Binarization? Explain the conversion of a Categorical Attribute to 3 binary
attributes? What is its drawback? How is it overcome?
15. How is Discretization of Continuous Attributes done? In this regard, Explain unsupervised and
supervised Discretization.
ii) Normalization/Standardization
20. What are Discrete and Continuous Attributes? Explain the term resolution.
21. What is the curse of Dimensionality? Explain Data Quality issues related to applications
UNIT-II
1. Give the formal definition of classification. What is classification model? Explain with diagram
2. With a diagram, explain the general approach for building a classification model
3. For the Nodes N1 & N2 given below, calculate the Gini Index, Entropy and Classification Error.
Node N1 Count
Class=0 0
2
NMIT, Bangalore Data Mining Question Bank Dept of ISE
Class=1 6
Node N2 Count
Class=0 1
Class=1 5
4. What is confusion matrix? Explain the confusion matrix for a 2-class problem with an example. In
this regard, explain Accuracy and error rate of prediction with appropriate formula
11. Calculate the Gini Index for Attributes A and B given below and specify which attribute is better for
splitting.
A B
12. What is rule based classifier? Explain how it works with an example. In this regard, also define
accuracy and coverage
13. Consider a training set that contains 60 positive examples and 100 negative examples. Suppose two
rules are given:
For the above two rules, calculate Laplace, accuracy, coverage and likelihood ratio.
3
NMIT, Bangalore Data Mining Question Bank Dept of ISE
14. Explain characteristics of Rule-Based classifier
15. How can a decision tree be converted into classification rules? Explain with example.
19. Consider a training set that contains 100 positive examples and 400 negative examples. For each of
the following candidate rules
Determine which is the best and worst candidate rules according to:
20. Consider a training set that contains 29 positive examples and 21 negative examples. For each of the
Determine which is the best and worst candidate rules according to:
21. For the following Confusion matrix , calculate the Accuracy and Error rate:
Predicted Class
Class=1 Class=0
Actual Class=1 15 10
Class
Class=0 20 11
22. Consider the following table with attributes A, B, C and two class labels +, -
4
NMIT, Bangalore Data Mining Question Bank Dept of ISE
A B C Number of Instances
+ -
T T T 5 0
F T T 0 20
T F T 20 0
F F T 0 5
T T F 0 0
F T F 25 0
T F F 0 0
F F F 0 25
According to the classification error rate, which attribute would be chosen as the best splitting attribute?
UNIT-III
1. How is market basket data represented in a binary format? Explain with example. In this case explain
the terms itemset, association rule, support count, support and confidence
4. What is frequent itemset generation? Generate candidate 3 itemsets for the following data by applying
APriori principle taking a minimum support threshold of 60%
TID Items
1 {Bread, Milk}
2 {Bread, A, B, C}
3 {Milk, A, B, D}
4 {Bread, Milk, A, B}
5 {Bread, Milk, A, D}
5
NMIT, Bangalore Data Mining Question Bank Dept of ISE
5. Write the algorithm for Frequent itemset generation of the Apriori algorithm
6. How is support counting done using a Hash tree? Explain with example
7. How are candidates generated using Lexicographic ordering ? Explain with example.
8. What is candidate generation and pruning? Explain Fk-1 x F1 and Fk-1 x Fk-1 methods of candidate
generation with examples.
9. What are the factors that affect the computation complexity of the Apriori algorithm? Explain
14. What are the alternative methods for generating frequent items? Explain
15. Explain relationships among frequent, maximal frequent and closed frequent itemsets with diagram
16. Explain the DFS and BFS methods of generating frequent itemsets with examples.
17. What are the two ways in which a transaction data set be represented? Explain with example
0001 {a,d,e}
0024 {a,b,c,e}
0012 {a,b,d,e}
0031 {a,c,d,e}
0015 {b,c,e}
0022 {b,d,e}
0029 {c,d}
0040 {a,b,c}
0030 {a,d,e}
0038 {a,b,e}
i) Compute the support count for itemsets {e}, {b,d} and {b,d,e}
6
NMIT, Bangalore Data Mining Question Bank Dept of ISE
1 {a,b,d,e}
2 {b,c,d}
3 {a,b,d,e}
4 {a,c,d,e}
5 {b,c,d,e}
6 {b,d,e}
7 {c,d}
8 {a,b,c}
9 {a,d,e}
10 {b,d}
i) What is the maximum number of association rules that can be extracted from this data?
ii) What is the maximum number of frequent itemsets that can be extracted (including null set)?
iii) Generate candidate 1-itemset, 2-itemsets and 3-itemsets assuming a support threshold of 60%
using Apriori algorithm
i. Equivalence classes
UNIT-IV
2. How are frequent itemsets generated using FP-Tree Algorithm? Explain with example.
5. For the following tables, Calculate the Interest Factor, ø-correlation coefficient and IS Measure
p p
q 880 5 930
0
q 50 3 70
0
7
NMIT, Bangalore Data Mining Question Bank Dept of ISE
930 7 1000
0
r r
s 2 50 70
0
s 5 880 930
0
7 930 1000
6. How can Objective Measures be extended 0
beyond pairs of Binary Variables? Explain
with contingency table
8. Calculate ø-correlation coefficient, IS Measure, Interest Factor and Confidence for the rule
{Tea} -> {Coffee} for the following table
Coffe Coffe
e e
Tea 50 30 800
11. For the following contingency tables compute support, the interest measure, and the ø-correlation
coefficient, for the association patterns {A,B}. Also compute the confidence of rules A -> B and B
-> A. Is confidence a Symmetric measure?
8
NMIT, Bangalore Data Mining Question Bank Dept of ISE
B B B B
9 1 A 89 1
1 89 1 9
AAAAAAAAAAAAA
A AAAAAAAAAAAAA
Yes 99 81 180
No 54 66 120
Calculate:
Compute:
ii) ø-correlation coefficient, IS Measure, Interest Factor when Customer Group=Working Adult
iii)Calculate Confidence for the rules when {HDTV=Yes} -> {Exercise Machine=Yes},
{HDTV=No} -> {Exercise Machine=Yes} when Customer Group=College Students and
10
NMIT, Bangalore Data Mining Question Bank Dept of ISE
A
1 0
C=0 B 1 0 15
0 15 30
C=1 B 1 5 0
0 0 15
20. What do you mean by Timing Constraints with regard to Sequential Patterns?
21. Draw contingency tables for the rules {b} -> {c} and {a} -> {d} using the transactions shown
below:
UNIT-V
5. With respect to K-Means algorithm, explain how points are assigned to closest centroid using
SSE for centroid calculation
6. In K-Means Algorithm, How are initial centroids chosen? Explain with diagram
7. Give a table listing common choices for Proximity, Centroids and Objective Functions with
respect to K-Means Algorithm
11
NMIT, Bangalore Data Mining Question Bank Dept of ISE
8. Comment on Time and Space Complexity of K-Means Algorithm
12. Write and explain Basic Agglomerative Hierarchical Clustering Algorithm. How is proximity
between clusters defined?
13. Comment on the Time and Space Complexity of Agglomerative Hierarchical Clustering
algorithm.
16. Explain the Single Link or MIN method of Hierarchical Clustering with example
17. Explain Complete Link or MAX method of Hierarchical Clustering with example
19. How are points classified according to Centroid Based Density in DBSCAN algorithm? Explain with
diagrams and example
12