You are on page 1of 12

Bioinformatics Assignment

Ritajit Majumdar
Mtech 1st semester
Computer Science and Engineering
Class Roll: 1
Exam Roll: 97/CSM/140001
Registration No: 0029169 of 2008-2009
March 26, 2015

Contents
1 Problem 1
1.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
1
2

2 Problem 2
2.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4
4
4
5

3 Problem 3
3.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6
6
7
7

4 Problem 4
4.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8
8
9
9

5 Discussion

10

Problem 1

Consider the following table based on the experimental result of Roth-cancer research lab. The table
consists of 8 genes with 3 attributes (viz., GO attributes, Expression level and Pseudo gene found)
and one class label: cancer mediating. Write a program which can perform the following tasks
a) Find out the test attribute and draw the decision tree.
b) Find out the class label of the gene with GO attribute > 40, Expression level = medium and
Pseudo gene found = No.
c) Construct the classifier that can predict the class label of the unknown genes.
Gene-ID GO attributes
g1
<=30
g2
<= 30
g3
31...40
g4
> 40
g5
> 40
g6
> 40
g7
31...40
g8
<= 30

1.1

Expression Level Pseudo Gene found


High
No
High
No
High
No
Medium
No
Low
Yes
Low
Yes
Low
Yes
Medium
No

Cancer Mediating
No
No
Yes
Yes
Yes
No
Yes
No

Theoretical Support

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes theorem with strong (naive) independence assumptions between the features.

1.2

Algorithm

Let S be a set of s data samples. Suppose the class label attributes have m distinct values such as
Ci is the ith class label. The classes are Ci to Cm and si be the number of samples of S in the class
label Ci .
Hence, the expected information needed to classify a given sample is given by
P
I(s1 , s2 , ..., sm ) = i pi log2 (pi )
where pi is the probability that an arbitrary sample belongs to class label Ci and is estimated as
pi = ssi .
Let attribute A has v distinct values {a1 , a2 , ..., av }. This attribute divides the entire data S into v
subsets S1 , S2 , ..., Sv . Here Sj contains these samples in S that have a value aj . Let sij be the number
of samples of class label Ci of subset Sj . Hence the entropy based on the partitioning into subsets of
A is
P s +...+s
E(A) = j ij s mj *I(sij , ..., smj )
where
I(sij , ..., smj ) =

pij log2 (pij ), pij =

Hence gain of A
Gain(A) = I(s1 , s2 , ..., sm ) E(A)

sij
|Sj |

Bayes Theorem
Let X be a data sample whose class label (C) is unknown. Let H be a hypothesis that the data sample
X Ci . For classification problem we want to determine P(H | X) i.e. probability that the hypothesis
holds given the data sample X. P(H | X) is known as posteriori probability of H conditioned on X.
P(H | X) =

P (X|H).P (H)
P (X)

1. Each data sample is represented by n-dimensional feature matrix X = (x1 , x2 , ... , xn ) with n
attributes.
2. Suppose there are m classes C1 , C2 , ... , Cm . According to the theorem, an unknown sample x
belongs to class label Ci iff
P(Ci |X) > P(Cj |X) where i6=j and 1 j m
i ).P (Ci )
P(Ci | X) = P (X|C
P (X)
3. P(X) is constant for all classes. Hence P (X|Ci ).P (Ci ) needs to be maximised. P(Ci ) = ssi .
Q
4. P (X|Ci ) = k P (Xk |Ci ). Each value of the product can be estimated in the following way if Ak is categorical, then compute P (Xk |Ci ) =

sik
.
si

if Ak is continuous, then apply Gaussian distribution.


P (Xk |Ci ) =

(xk ci )
1
exp( 2
2
ci
(2)ci

5. Thus the unknown sample X is assigned class label Ci iff


P(Ci |X) > P(Cj |X) where i6=j and 1 j m

1.3

Result

First the snapshot of decision tree generation is provided.

The snapshot of the result has been provided below.

Problem 2

Consider the following table


Set
Set
Set
Set
Set
Set
Set
Set
Set

1
2
3
4
5
6
7
8
9

g1
g2
g2
g1
g1
g2
g1
g1
g1

g2
g4
g3
g2
g3
g3
g3
g2
g2

g5 g4 g3 g5
g3 -

Find out the frequent item set for support count 2. Also find out the set of significant rule set from
frequent item set (Confidence level of the rules 70 signifies the significant rule).

2.1

Theoretical Support

Apriori algorithm is the originality algorithm of Boolean association rules of mining frequent item
sets, raised by R. Agrawa and R. Srikan in 1994. The core principles of this theory are the subsets
of frequent item sets are frequent item sets and the supersets of infrequent item sets are infrequent
item sets. This theory is regarded as the most typical data mining theory all the time[6].

2.2

Algorithm

The algorithm is provided in pseudo-code format. This format of the algorithm was obtained from
wikipedia.

L1 {large1 itemsets}
k2
while Lk1 6= do
`
Ck {a {b}|a Lk1 b Lk1 b
/ a}
for transactions t T do
Ck {c|c Ck c t}
for candidates c Ct do
count[c] count[c] + 1
end
end
Lk {c|c Ck count[c] }
k k+1
end
`
return k Lk
Algorithm 1: Apriori(T, )

2.3

Result

The t a b l e L1 :
g1
g2
g3
g4
g5

|
|
|
|
|

6
7
6
2
2

C2 :

g1 , g2 : 4
g1 , g3 : 4
g1 , g4 : 1
g1 , g5 : 2
g2 , g3 : 4
g2 , g4 : 2
g2 , g5 : 2
g3 , g4 : 0
g3 , g5 : 1
g4 , g5 : 0

The t a b l e L2 :
g1 , g2
g1 , g3
g1 , g5
g2 , g3
g2 , g4
g2 , g5

|
|
|
|
|
|

4
4
2
4
2
2

C3 :

g1 , g2 , g3 : 2
g1 , g2 , g5 : 2
g1 , g2 , g4 : 1
g1 , g3 , g5 : 1
g1 , g2 , g3 , g4 : 0
g1 , g2 , g3 , g5 : 1
g1 , g2 , g4 , g5 : 0
g2 , g3 , g4 : 0
g2 , g3 , g5 : 1
g2 , g4 , g5 : 0

The t a b l e L3 :
g1 , g2 , g3

2
5

g1 , g2 , g5

C4 :

g1 , g2 , g3 , g5 : 1

The t a b l e L4 :

The f r e q u e n t i t e m s e t :
The t a b l e L3 :
g1 , g2 , g3
g1 , g2 , g5

|
|

2
2

Derived r u l e s e t :

{ g1 } > {g2 , g3 } : 0 . 3 3
{ g2 } > {g1 , g3 } : 0 . 2 9
{ g3 } > {g1 , g3 } : 0 . 3 3
{g2 , g3 } > { g1 } : 0 . 5
{g1 , g3 } > { g2 } : 0 . 5
{g1 , g2 } > { g3 } : 0 . 5
{ g1 } > {g2 , g5 } : 0 . 3 3
{ g2 } > {g1 , g5 } : 0 . 2 9
{ g5 } > {g1 , g2 } : 1 . 0
{g2 , g5 } > { g1 } : 1 . 0
{g1 , g5 } > { g2 } : 1 . 0
{g1 , g2 } > { g5 } : 0 . 5

Significant rule set :


{ g5 } > {g1 , g2 }
{g2 , g5 } > { g1 }
{g1 , g5 } > { g2 }

Problem 3

Design a Backpropagation learning algorithm for 3-2-1 feedforward neural network. The given training set is (1,0,1) 1. Discover the class label of all remaining patterns.

3.1

Theoretical Support

Backpropagation, an abbreviation for backward propagation of errors, is a common method of


training artificial neural networks used in conjunction with an optimization method such as gradient
descent. The backpropagation algorithm was originally introduced in the 1970s, but its importance
wasnt fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald
Williams[5]. That paper describes several neural networks where backpropagation works far faster
6

than earlier approaches to learning, making it possible to use neural nets to solve problems which
had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in
neural networks.

3.2

Algorithm

Initialize all weights with small random numbers (in the program random numbers between 0
and 1 are used).
repeat
for Every pattern in the training set do
Present the pattern to the network
for Each layer in the network do
for Every node in the layer do
1. Calculate the weight sum of the inputs to the node. P
2. Add the threshold to the sum. The net input is Ij =
wij Oj + j .
3. Calculate the activation for the node. Typically use sigmoid function to get
the output of each node Oj = 1+e1Ij .
end
end
for Every node in the output layer do
Calculate the error signal as Errj = Oj (1 Oj )(Tj Oj ) where Tj is the true
output.
end
for All hidden layers do
for every node in the layer do
P
1. Calculate the nodes signal error as Errj = Oj (1 Oj ) k Errk wjk where
wjk is the weight of the connection from unit j to k (which is the next highest
layer) and Errk is the error of unit k.
2. Update each nodes weight in the network.
end
end
Calculate the Global Error Function.
end
until (maximum number of iterations < specified) AND (Error function > specified);
Algorithm 2: Backpropagation

3.3

Result

The snapshot of the result has been provided.

Problem 4

Given the following table of gene sequence, implement k-mean, k-mediod and fuzzy c means clustering
algorithms to generate the clusters. Tune K or C values from 2 to 5. Consider index = (average
intra-cluster distance) / (1 + average inter-cluster distance). Find out the best result.

4.1

Theoretical Support

Clustering can be considered the most important unsupervised learning problem; it deals with finding
a structure in a collection of unlabelled data. A cluster is therefore a collection of objects which are
similar between them and are dissimilar to the objects belonging to other clusters[1].
K-means[?] is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,
one for each cluster. The next step is to take each point belonging to a given data set and associate
it to the nearest centroid. Then for each cluster, we compute the mean co-ordinate and take it as
the new cluster centre. This process continues till the cluster centres change no more.

4.2

K-means Algorithm

Data: E = {e1 , e2 , ..., en } Set of entities


k number of clusters
MaxIters limit of iterations
Result: C = {c1 , c2 , ..., ck } set of cluster centroids
L = {l(e)|e = 1, 2, ..., n} set of cluster labels of E
foreach ci C do
ci ej E(randomselection)
end
foreach ei E do
l(ei ) argminDistance(ei , cj )j {1, ..., k}
end
changed false
iter 0
repeat
foreach ci C do
UpdateCluster(ci )
end
foreach ei E do
minDist argminDistance(ei , cj ) j {1, ..., k}
if minDist 6= l(ei ) then
l(ei ) minDist
changed true
end
end
iter ++
until changed = true and iter MaxIter ;
Algorithm 3: K-Means Algorithm

4.3

Result

The points are chosen randomly. Then the number of cluster is varied from k = 2 to k = 5. Snapshots
are shown for k = 2 and k = 3.

Discussion

All the programs were developed using python 2.7.3 in Macintosh OS. The program should run
smoothly on Linux and Windows OS too, though the programs were not checked in those environments. However, some programs will fail to run in python 3 since there are some commands which
10

are different in the later version of python.

References
[1] A tutorial on Clustering Algorithms, http://home.deib.polimi.it/matteucc/Clustering/tutorial html/
[2] J. B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations,
Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley,
University of California Press, 1:281-297, (1967).
[3] J. C. Dunn, A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact
Well-Separated Clusters, Journal of Cybernetics 3: 32-57, 1973.
[4] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press,
New York, 1981.
[5] Rumelhart, Hinton, Williams Learning representations by back-propagating errors Nature
323, 533-536 9 October 1986.
[6] Jiao Yabing, Research of an Improved Apriori Algorithm in Data Mining Association Rules,
International Journal of Computer and Communication Engineering, Vol. 2, No. 1, January
2013.

11