Ritajit Majumdar
Mtech 1st semester
Computer Science and Engineering
Class Roll: 1
Exam Roll: 97/CSM/140001
Registration No: 0029169 of 20082009
March 26, 2015
Contents
1 Problem 1
1.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
1
2
2 Problem 2
2.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
4
5
3 Problem 3
3.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
7
7
4 Problem 4
4.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Kmeans Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
9
9
5 Discussion
10
Problem 1
Consider the following table based on the experimental result of Rothcancer research lab. The table
consists of 8 genes with 3 attributes (viz., GO attributes, Expression level and Pseudo gene found)
and one class label: cancer mediating. Write a program which can perform the following tasks
a) Find out the test attribute and draw the decision tree.
b) Find out the class label of the gene with GO attribute > 40, Expression level = medium and
Pseudo gene found = No.
c) Construct the classifier that can predict the class label of the unknown genes.
GeneID GO attributes
g1
<=30
g2
<= 30
g3
31...40
g4
> 40
g5
> 40
g6
> 40
g7
31...40
g8
<= 30
1.1
Cancer Mediating
No
No
Yes
Yes
Yes
No
Yes
No
Theoretical Support
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes theorem with strong (naive) independence assumptions between the features.
1.2
Algorithm
Let S be a set of s data samples. Suppose the class label attributes have m distinct values such as
Ci is the ith class label. The classes are Ci to Cm and si be the number of samples of S in the class
label Ci .
Hence, the expected information needed to classify a given sample is given by
P
I(s1 , s2 , ..., sm ) = i pi log2 (pi )
where pi is the probability that an arbitrary sample belongs to class label Ci and is estimated as
pi = ssi .
Let attribute A has v distinct values {a1 , a2 , ..., av }. This attribute divides the entire data S into v
subsets S1 , S2 , ..., Sv . Here Sj contains these samples in S that have a value aj . Let sij be the number
of samples of class label Ci of subset Sj . Hence the entropy based on the partitioning into subsets of
A is
P s +...+s
E(A) = j ij s mj *I(sij , ..., smj )
where
I(sij , ..., smj ) =
Hence gain of A
Gain(A) = I(s1 , s2 , ..., sm ) E(A)
sij
Sj 
Bayes Theorem
Let X be a data sample whose class label (C) is unknown. Let H be a hypothesis that the data sample
X Ci . For classification problem we want to determine P(H  X) i.e. probability that the hypothesis
holds given the data sample X. P(H  X) is known as posteriori probability of H conditioned on X.
P(H  X) =
P (XH).P (H)
P (X)
1. Each data sample is represented by ndimensional feature matrix X = (x1 , x2 , ... , xn ) with n
attributes.
2. Suppose there are m classes C1 , C2 , ... , Cm . According to the theorem, an unknown sample x
belongs to class label Ci iff
P(Ci X) > P(Cj X) where i6=j and 1 j m
i ).P (Ci )
P(Ci  X) = P (XC
P (X)
3. P(X) is constant for all classes. Hence P (XCi ).P (Ci ) needs to be maximised. P(Ci ) = ssi .
Q
4. P (XCi ) = k P (Xk Ci ). Each value of the product can be estimated in the following way if Ak is categorical, then compute P (Xk Ci ) =
sik
.
si
(xk ci )
1
exp( 2
2
ci
(2)ci
1.3
Result
Problem 2
1
2
3
4
5
6
7
8
9
g1
g2
g2
g1
g1
g2
g1
g1
g1
g2
g4
g3
g2
g3
g3
g3
g2
g2
g5 g4 g3 g5
g3 
Find out the frequent item set for support count 2. Also find out the set of significant rule set from
frequent item set (Confidence level of the rules 70 signifies the significant rule).
2.1
Theoretical Support
Apriori algorithm is the originality algorithm of Boolean association rules of mining frequent item
sets, raised by R. Agrawa and R. Srikan in 1994. The core principles of this theory are the subsets
of frequent item sets are frequent item sets and the supersets of infrequent item sets are infrequent
item sets. This theory is regarded as the most typical data mining theory all the time[6].
2.2
Algorithm
The algorithm is provided in pseudocode format. This format of the algorithm was obtained from
wikipedia.
L1 {large1 itemsets}
k2
while Lk1 6= do
`
Ck {a {b}a Lk1 b Lk1 b
/ a}
for transactions t T do
Ck {cc Ck c t}
for candidates c Ct do
count[c] count[c] + 1
end
end
Lk {cc Ck count[c] }
k k+1
end
`
return k Lk
Algorithm 1: Apriori(T, )
2.3
Result
The t a b l e L1 :
g1
g2
g3
g4
g5





6
7
6
2
2
C2 :
g1 , g2 : 4
g1 , g3 : 4
g1 , g4 : 1
g1 , g5 : 2
g2 , g3 : 4
g2 , g4 : 2
g2 , g5 : 2
g3 , g4 : 0
g3 , g5 : 1
g4 , g5 : 0
The t a b l e L2 :
g1 , g2
g1 , g3
g1 , g5
g2 , g3
g2 , g4
g2 , g5






4
4
2
4
2
2
C3 :
g1 , g2 , g3 : 2
g1 , g2 , g5 : 2
g1 , g2 , g4 : 1
g1 , g3 , g5 : 1
g1 , g2 , g3 , g4 : 0
g1 , g2 , g3 , g5 : 1
g1 , g2 , g4 , g5 : 0
g2 , g3 , g4 : 0
g2 , g3 , g5 : 1
g2 , g4 , g5 : 0
The t a b l e L3 :
g1 , g2 , g3
2
5
g1 , g2 , g5
C4 :
g1 , g2 , g3 , g5 : 1
The t a b l e L4 :
The f r e q u e n t i t e m s e t :
The t a b l e L3 :
g1 , g2 , g3
g1 , g2 , g5


2
2
Derived r u l e s e t :
{ g1 } > {g2 , g3 } : 0 . 3 3
{ g2 } > {g1 , g3 } : 0 . 2 9
{ g3 } > {g1 , g3 } : 0 . 3 3
{g2 , g3 } > { g1 } : 0 . 5
{g1 , g3 } > { g2 } : 0 . 5
{g1 , g2 } > { g3 } : 0 . 5
{ g1 } > {g2 , g5 } : 0 . 3 3
{ g2 } > {g1 , g5 } : 0 . 2 9
{ g5 } > {g1 , g2 } : 1 . 0
{g2 , g5 } > { g1 } : 1 . 0
{g1 , g5 } > { g2 } : 1 . 0
{g1 , g2 } > { g5 } : 0 . 5
Problem 3
Design a Backpropagation learning algorithm for 321 feedforward neural network. The given training set is (1,0,1) 1. Discover the class label of all remaining patterns.
3.1
Theoretical Support
than earlier approaches to learning, making it possible to use neural nets to solve problems which
had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in
neural networks.
3.2
Algorithm
Initialize all weights with small random numbers (in the program random numbers between 0
and 1 are used).
repeat
for Every pattern in the training set do
Present the pattern to the network
for Each layer in the network do
for Every node in the layer do
1. Calculate the weight sum of the inputs to the node. P
2. Add the threshold to the sum. The net input is Ij =
wij Oj + j .
3. Calculate the activation for the node. Typically use sigmoid function to get
the output of each node Oj = 1+e1Ij .
end
end
for Every node in the output layer do
Calculate the error signal as Errj = Oj (1 Oj )(Tj Oj ) where Tj is the true
output.
end
for All hidden layers do
for every node in the layer do
P
1. Calculate the nodes signal error as Errj = Oj (1 Oj ) k Errk wjk where
wjk is the weight of the connection from unit j to k (which is the next highest
layer) and Errk is the error of unit k.
2. Update each nodes weight in the network.
end
end
Calculate the Global Error Function.
end
until (maximum number of iterations < specified) AND (Error function > specified);
Algorithm 2: Backpropagation
3.3
Result
Problem 4
Given the following table of gene sequence, implement kmean, kmediod and fuzzy c means clustering
algorithms to generate the clusters. Tune K or C values from 2 to 5. Consider index = (average
intracluster distance) / (1 + average intercluster distance). Find out the best result.
4.1
Theoretical Support
Clustering can be considered the most important unsupervised learning problem; it deals with finding
a structure in a collection of unlabelled data. A cluster is therefore a collection of objects which are
similar between them and are dissimilar to the objects belonging to other clusters[1].
Kmeans[?] is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,
one for each cluster. The next step is to take each point belonging to a given data set and associate
it to the nearest centroid. Then for each cluster, we compute the mean coordinate and take it as
the new cluster centre. This process continues till the cluster centres change no more.
4.2
Kmeans Algorithm
4.3
Result
The points are chosen randomly. Then the number of cluster is varied from k = 2 to k = 5. Snapshots
are shown for k = 2 and k = 3.
Discussion
All the programs were developed using python 2.7.3 in Macintosh OS. The program should run
smoothly on Linux and Windows OS too, though the programs were not checked in those environments. However, some programs will fail to run in python 3 since there are some commands which
10
References
[1] A tutorial on Clustering Algorithms, http://home.deib.polimi.it/matteucc/Clustering/tutorial html/
[2] J. B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations,
Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley,
University of California Press, 1:281297, (1967).
[3] J. C. Dunn, A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact
WellSeparated Clusters, Journal of Cybernetics 3: 3257, 1973.
[4] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press,
New York, 1981.
[5] Rumelhart, Hinton, Williams Learning representations by backpropagating errors Nature
323, 533536 9 October 1986.
[6] Jiao Yabing, Research of an Improved Apriori Algorithm in Data Mining Association Rules,
International Journal of Computer and Communication Engineering, Vol. 2, No. 1, January
2013.
11