Warsaw University of Technology Warsaw University of Technology Warsaw University of Technology

Warsaw University of Technology
Feature selection and mapping of data from short amino

acid sequences
Stanisław Jankowski, Marek Dwulit, Zbigniew Szymański

Szyma
ICS Research Report 2/10
Institute of Computer Science
Nowowiejska 15 / 19, 00-665

00 665 Warsaw, Poland
2
Feature selection and mapping of data from short amino acid

sequences.
Jankowski S., Dwulit M., Szymański Z.

(sjank@ise.pw.edu.pl, mdwulit@ii.pw.edu.pl, z.szymanski@ii.pw.edu.pl)
Abstract
The report describes two processing methods of short amino acid sequences in order to
enable their classification by the LS-SVM classifier. The amino acid sequences are
represented by the strings, which contain 17 characters (each character denotes one amino
acid). To enable classification of such data by the LS-SVM classifier it is necessary to
perform mapping of character data to real numbers domain and to perform feature selection.
In the first method the features (positions in the amino acid sequence) relevant for
classification are selected by the application of classification trees. Then the mapping
procedure (based on GINI index]) is applied in order to convert character data to real
numbers. Next, the obtained data set is used as input to the LS-SVM classifier.
The second method utilizes the AAindex (amino acid index), which contains values
representing different physicochemical and biological properties of amino acids. Each symbol
from the amino acid sequence is substituted by the corresponding values from the AAindex.
Thereafter the feature selection procedure is applied, which uses simple ranking formula and the
Gram-Schmidt orthogonalization. Next, the obtained data set is used as input to the LS-SVM
classifier.
Key words: feature selection, LS-SVM, AAindex, amino acid sequence

3
Feature selection and mapping of data from short amino acid sequences.
Stanisław Jankowski, Marek Dwulit, Zbigniew Szymański
TABLE OF CONTENTS
1. INTRODUCTION ........................................................................................................................................ 4
2. INPUT DATA ............................................................................................................................................... 5
3. METHOD 1 – CLASSIFICATION TREES, DYNAMIC MAPPING, LS-SVM .................................... 6
CLASSIFICATION TREES ....................................................................................................................................... 6
DYNAMIC MAPPING. ............................................................................................................................................ 8
LEAST-SQUARES SUPPORT VECTOR MACHINE (LS-SVM) .................................................................................. 9
4. RESULTS – METHOD 1........................................................................................................................... 10
NOTATION DESCRIPTION .................................................................................................................................... 10
RESULTS SUMMARY ........................................................................................................................................... 11
PKA ENZYME .................................................................................................................................................... 12
PKB ENZYME .................................................................................................................................................... 13
PKC ENZYME .................................................................................................................................................... 14
CDK ENZYME ................................................................................................................................................... 16
CK2 ENZYME .................................................................................................................................................... 17
MAPK ENZYME ................................................................................................................................................ 19
5. METHOD 2 – AAINDEX MAPPING, RANKING OF MODEL VARIABLES, LS-SVM ................. 21
AAINDEX BASED MAPPING OF SYMBOLS ............................................................................................................ 21
RANKING METHOD OF THE MODEL VARIABLES .................................................................................................. 22
6. RESULTS - METHOD 2 ........................................................................................................................... 23
PKA ENZYME .................................................................................................................................................... 24
PKB ENZYME .................................................................................................................................................... 25
PKC ENZYME .................................................................................................................................................... 25
CDK ENZYME ................................................................................................................................................... 26
CK2 ENZYME .................................................................................................................................................... 26
MAPK ENZYME ................................................................................................................................................ 27
7. CONCLUSIONS......................................................................................................................................... 28
APPENDIX ........................................................................................................................................................... 30
APPENDIX A. CLASSIFICATION TREE BUILD FOR THE WHOLE SET . ..................................................................... 30
4
1. Introduction
The report describes two processing methods of short amino acid sequences in order to enable
their classification by the LS-SVM classifier. The amino acid sequences are represented by the
strings, which contain 17 characters (each character denotes one amino acid). To enable
classification of such data by the LS-SVM classifier it is necessary to perform mapping of
character data to real numbers domain. Statistical classifiers (e.g. LS-SVM) have to meet basic
mathematical requirements. From the theorem of T. Cover [10] we know, that number of
elements N of the learning data set has to be equal or greater than 2(d+1), where d is the number
of features. It is necessary to perform feature selection and restrict the whole data set only to the
subset based on most relevant variables. When the learning data set does not satisfy Cover
theorem, the obtained classifier is equivalent to randomly defined classifier.
The first method is described in chapter 3. The features (positions in the amino acid sequence)
relevant for classification are selected by the application of classification trees [3]. Then the
mapping procedure (based on GINI index [4]) is applied in order to convert character data to real
numbers. Next, the obtained data set is used as input to the LS-SVM classifier. The results of
feature selection procedure and classification are presented in chapter 4.
The second method (described in chapter 5) utilizes the AA index (amino acid index) [8],
which contains values representing different physicochemical and biological properties of amino
acids. Each symbol from the amino acid sequence is substituted by the corresponding values from
the AAindex. Thereafter the feature selection procedure is applied, which uses simple ranking
formula and the Gram-Schmidt orthogonalization [5,6]. Next, the obtained data set is used as input to
the LS-SVM classifier. The results of feature selection procedure and classification are presented
in chapter 6.
The methods described in this report can be applied in the research on structure of enzymes.
The knowledge which amino acids sequences react (and on which position) with selected
enzymes can be useful for building of three dimensional enzyme models. The long term goal of
our research is preparation of a classifier which will be able to predict if an amino acid sequence
will or will not react with an enzyme.
5
2. Input data
The data set contains the 17-symbols amino acids sequences grouped with respect to their
reactions with 6 selected enzymes. The data set was defined at the Center of Oncology in
Warsaw. The data file is in the text form. Figure 2.1 presents an example of the input file format.
A single line represents either a sequence of amino acids or an enzyme symbol. The line starting
with the “#” sign denotes the line describing an enzyme symbol. The line starting with a letter
contains a sequence of 17 amino acids.
Each line of enzyme symbol opens a new series of amino acids sequences reacting with this
particular enzyme. For example the sequence SKSSPKDPSQRRRSLEP reacts with PKC
enzyme. Respectively, the amino acids sequence RRRSRRVSRRRRARRR reacts with CK2
enzyme.
# P K C
S K S S P K D P S Q R R R S L E P
R R S R R Y R R S T V A R W R R R
R R R R S R R S T V A W R R R R V
# C K 2
R R R R S R R V S R R R R A R R R
R R R R P R S V S R R W R A R R R
R R S R R Y R R S T V A R W R R R
R T S A V P T L S T F R T T R V T
Figure 2.1 An input file example.
The data file contains 6 enzymes: PKA, PKB, PKC, CDK, CK2 and MAPK. Hence, the goal
of this project is to obtain the statistical classifier that is able to divide the amino acids sequences
into 6 classes. It is important to notice that one sequence can belong to more than one class. For
example the sequence RRSRRYRRSTVARWRRR belongs to either PKC or CK2 class, as this
sequence reacts with both enzymes.
6
3. Method 1 – classification trees, dynamic mapping, LS-

SVM
The proposed approach consists of three stages. The task of the first stage is twofold: a) to
identify the positions in the amino acids sequences, which decides if the particular sequence
reacts with a given enzyme; b) for selected positions, to extract symbols, which decide if this
sequence reacts with the considered enzyme (this goal is solved by using the classification trees
algorithm described in the next section).
The goal of the second stage is to define the method of mapping symbols into real numbers in
order to apply the statistical classifiers, as neural networks, kernel machines etc. In this project
the least-squares support vector machines (LS-SVM) were applied as classifiers.
The third stage is to design the LS-SVM classifier and to validate its quality by the cross
validation approach. The results are summarized in subsection “Classification Charts”.
Classification Trees
In the first stage of the our approach we attempt to answer the question: “At what positions in
the amino acids sequence are placed the influential symbols?”, i.e. to discover from the data the
positions of symbols in the amino acid sequence that reflect the chemical relationship with the
enzyme of the class.
This problem is solved by building the maximum classification tree, which may be regarded
as recursive partition of the instance space. Characteristic feature of maximal classification tree is
that all terminal nodes in the tree contain observations only of one class. Furthermore too build
classification tree all class labels of all samples must be known in advance.
The classification tree is built following the splitting rule - the rule that performs splitting of
learning sample into smaller subsets. CART algorithm [1,3] in each node divides data set
recursively into two subsets preserving the rule of maximum homogeneity. Homogeneity of a
single node t in the classification tree T is define by an impurity function denoted as i(t). Because
in practice there are many functions which may be used as impurity function first we will
describe splitting procedure using i(t).
Consider the learning set X=[x1,..xN] where xi=[xi1…xiM] and the class vector Y=[y1..yN], yi
∈{1,...,K} describes the assignment of data samples to K classes. Let tp be a parent node and tl, tr -
respectively the left and the right child nodes of the parent node tp. Furthermore let Sj be a subset
off all possible nominal values which j-th variable denoted as xj may have and let xSj be a splitting
criterion such that if xij ∈ Sj the i-th sample is stored in tl node.
We know that for a parent node tp for any of the possible splits xj∈xSj, j=1,...,M, the impurity
is constant. Because we intend to minimize the impurity for both child nodes we need to
maximize the change of impurity function ∆i(t) between the parent node and child nodes. The
impurity change ∆i(t) between the parent node and child nodes is defined as follows:
7
∆i(t) = i(tp) − Pli(tl) − Pri(tr) (3.1)

(3.2)

Pl = ; Pr =
Where:
n - Number of training set samples in the parent node
n - Number of training set samples in the left child node
n
- Number of training set samples in the right child node
Therefore, at each node CART solves the following maximization problem
arg max − ! "! # − $ "$ #% (3.3)

∈ ,,...,
CART algorithm recommends Gini splitting rule to calculate the impurity i(t)
"# = ' (")|#("+|# (3.4)

,-!
where:
k, l = 1,...,K - the class label;
p(k|t) – relative frequency of class k provided we are in node t.
When we use Gini index as impurity function the difference ∆i(t) between parent node and child
nodes is expressed by equation:
/
1 1 1
∆ "# = − ' ( )0 + /

' ( )|! # +
/"
' (/ ")|$ #
(3.5)
! $
, , ,
Therefore, in order to find the optimal split value the following optimization problem has to be
solved
1 1 1
arg max 3− ' ( )0 + /

! ' ( )|! # +
/"
$ ' (/ ")|$ #4
∈ ,,...,
(3.6)
, , ,
The CART algorithm performs the search in the learning set for the largest class and isolates
it from the rest of the data. The recursive application of the splitting procedure leads to the
maximum classification tree, which by definition is overfitted. There are various approaches to
handle that problem. However as we are interested only in estimating of variable relevance we
disregard it.
8
Dynamic Mapping.
The second stage of the proposed method is searching for a function that maps symbols
(nominal values) into the real numbers. Let 5 denotes a matrix of all amino acids sequences,
where the row represents a single amino acid sequence. In that case we may consider each
position in amino acid sequence as a feature describing that sequence. We know that there are 20
possible amino acids. Furthermore let 6 be a vector of class labels (enzymes) which assigns each
single amino acid sequence to the certain class.
Let 7 = 89, :, ;, <, =, >, ?, @, A, B, C, D, , E, F, 7, G, H, I, 6J be a set of all possible amino
acids and let : = 8 A9, AK, A:, :A;, :A2, C9 A J be a set of all enzymes. In that case we
can describe input data as follow:
N ⋯ NP U
5=M ⋮ ⋱ ⋮ T; 6 = M ⋮ T
NS ⋯ NSP US (3.7)
NV ∈ 7 | 1 ≤ ≤ Y,Z 1 ≤ [ ≤ \
where:
UV ∈ : | 1 ≤ ≤ YZ
The support vector machines (SVM) are by definition the binary classifiers. We consider six
binary SVM classifiers: one against all. We define the mapping function for each particular
classifier. Hence, a given amino acid NV , at a given position, usually has different values in F P
with respect to considered class (enzyme).
The main feature of mapping function should be the appropriate distribution of vectors in F P .
We attempt the members of at least one class are clustered in F P in the way profitable for
efficient classification. The Gini index [4] function denoted here as GINI(5, ], , , ^) has this
property. In our case, for each position in sequence d and each symbol ], ∈ 7 on that position
and for the class UV we define Gini index as follows:
>@D@ "5, ], , , ^# = 1 − (,,V,_ / − `,,V,_ / (3.8)
where:
(,,V,_ - represents relative frequency of k-th symbol ], on d-th position in the amino acids
sequences belonging to class UV in the;
`,,V,_ - represents relative frequency of k-th symbol ], on d-th position in amino acids
sequences not belonging to the UV class;
5 - the vector space defined by a training set.
In order to achieve better clustering in F P , for the considered class UV we distinguish between
symbols which belong and not belong to that class. For all symbols in 5 we add the sign. With
the sign included, vectors will be clustered around zero: the vectors which belong to class UV are
distributed on the positive site in each dimension and the vectors not belonging to UV are
distributed on the negative site. The final mapping f function is defined below:
a "5, ], , ^ # = ]b\(,,V,_ − `,,V,_ ∗ "1 − (,,V,_ / − `,,V,_ / # (3.9)

9
Least-Squares Support Vector Machine (LS-SVM)
LS-SVM originates by changing the inequality constraints in the SVM formulation to

equality constraints with objective function in the least squares sense [9]. Data set D is defined
as:
D = {( x i , t i )} x i ∈ X ⊂ R d , t i ∈ {−1,+1} (3.10)
The LS-SVM classifier performs the function:
f (x) = wφ (x) + b (3.11)
This function is obtained by solving the following optimization problem:
l
1
L= || w || 2 +γ ∑ [t i − w φ (x i ) − b] 2 (3.12)
2 i =1
Hence, the solution can be expressed as the linear combination of kernels weighted by the
Lagrange multipliers αi:
l
f ( x) = ∑ α i K ( x i , x) + b (3.13)
i =1
The global minimizer is obtained in LS-SVM by solving the set of linear equations
K + γ −!I 1  α  t 
 T   =   (3.14)
1 0 b  0
In this work the RBF kernel is applied:
K (x, x' ) = exp{−η || x − x'||2 }, σ = 1 / γ (3.15)
This system is easier to solve as compared to SVM. However the sparseness of the support
vectors is lost. In SVM, most of the Lagrangian multipliers αi are equal 0 while in LS-SVM the
Lagrangian multipliers αi are proportional to the errors ei.
In this research the RBF kernel has been used. The parameters σ and γ are adjusted upon the class
and the number of input variables.
10
4. Results – method 1
Notation description
A single node description - text format.
Table 4.1 Classification tree – node description
2) V12=A,C,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,Y 1294 75 BAC (0.94204019 0.05795981)

2) Node number
The child nodes of node x are always numbered 2x (left) and 2x+1 (right)
V12 Variable on which the split was made.
A,C,F,G,H,I,K,L,M, Decision set
N,P,Q,R,S,T,V,Y List of values of the split variable, which assign the data sample to specified
majority class.
1294 The number of subjects at the node (classified as BACKGROUND)
75 Number of misclassified subjects
BAC Majority class
(0.94204019 Vector of class probability estimators (classes are listed in alphabetical order)
0.05795981)
* indicates that the node is terminal
A single node description - graphical format.
Figure 4.1 Classification tree – node description.
Hint how to interpret the tree diagram (see fig. 4.2): We perform binary classification (one class
against all others - called background in the figures). If all data samples are assigned to the
background then 322 of them will be misclassified. When on the 7th position (V7) of the amino
acid sequence occurs K or R then the sequence is classified as class PKA and 204 samples will be
misclassified. If an additional variable is considered (V6 – representing 6th position in the
sequence) and if the variable equals C, H, I, K, L, N, P or R then the sequence is classified as
class PKA and 102 samples will be misclassified.
11
Results summary
In order to validate the proposed method the following parameters were calculated for each
constructed classifier: false positive (FP), false negative (FN), true negative (TN) true positive
(TP), precision, recall and score. Precision is defined as TP/(TP+FP). The recall is defined as
TP/(TP+FN). Finally score is defined as (TP+TN)/(TP+TN+FP+FN). Precision score of 100%
for binary classification means that every item labeled as belonging to class X does indeed belong
to class X. Recall of 100% means that every item from class X was labeled as belonging to class
X. Score represents level of correctly labeled items in the whole set.
Additionally to computing the classification parameters in each try, standard deviation of
classification in one hundred tests was calculated. Next three charts compare precision, recall and
score parameters between all classes respectively. The standard deviation is also marked for each
parameter and each class. The class CK2 is the class for which classification results are best. On
the other hand class MAPK poses problems during classification. Interestingly class PKB which
is the smallest class, only 5% of the whole set, classifies with precision equal 61% and recall
equal 49%. However the standard deviation is high, it is equal 11% and 12.5% respectively.
12
PKA Enzyme
Classification tree – text format

format.
1) root 1641 322 BAC (0.8037782 0.1962218)

2) V7=A,C,D,E,F,G,H,I,L,M,N,P,Q,S,T,V,W,Y 1214 99 BAC (0.9184514 0.0815486) *
3) V7=K,R
R 427 204 PKA (0.4777518 0.5222482)
6) V6=A,D,E,F,G,M,Q,S,T,V,W,Y 129 27 BAC (0.7906977 0.2093023) *
7) V6=C,H,I,K,L,N,P,R 298 102 PKA (0.3422819 0.6577181)
14) V11=H,K,M,R 55 11 BAC (0.8000000 0.2000000) *
15) V11=A,C,D,E,F,G,I,L,N,P,Q,S,T,V,W,Y 243 58 PKA (0.2386831 0.7613169) *
Classification tree – graphical representation.
V7
{A,C,D,E,F,G,H,I,L,
M,N,P,Q,S,T,V,W,Y }
(1214, 99)
BACKGROUND V6
(1641, 322)
{A,D,E,F,G,M,
BACKGROUND
Q,S,T,V,W,Y}
V7 (129, 27)
BACKGROUND V11
{K, R}
(427, 204 ) {H,K,M,R}
PKA (55, 11)
BACKGROUND
V6
{C,H,I,K,L,N,P,R}
(298, 102) V11
PKA {A,C,D,E,F,G,I,L,
N,P,Q,S,T,V,W,Y}
(243, 58)
PKA
Figure 4.2 PKA Enzyme classification tree.
LS-SVM Classifier – testing results.
Table 4.2 Class PKA classification based on V7 and V6 (LS-SVM parameters: γ=1.3, σ=10, RBF kernel).
TP TN FP FN Precision Recall Score
Mean 67,61 459,92 37,41 57,06 64,45% 54,23% 84,81%
Standard Deviation 6,37 10,07 6,02 5,86 4,29% 3,46% 1,29%
Table 4.3 Class PKA classification based on V7, V6 and V11 (LS-SVM parameters: γ=1.3, σ=10, RBF kernel).
Mean 67,54 459,72 37,61 57,13 64,37% 54,19% 84,77%
13
PKB Enzyme
Classification tree – text format.
1) root 1641 83 BAC (0.94942108 0.05057892)

2) V4=A,C,D,E,F,G,H,I,K,L,M,N,P,Q,S,T,V,W,Y 1442 16 BAC (0.98890430 0.01109570) *
3) V4=R 199 67 BAC (0.66331658 0.33668342)
6) V6=A,D,G,H,I,K,L,M,N,P,Q,S,T,V,W,Y 85 3 BAC (0.96470588 0.03529412) *
7) V6=R 114 50 PKB (0.43859649 0.56140351)
0.561403
14) V7=E,H,I,K,L,Q,R,T 73 27 BAC (0.63013699 0.36986301) *
15) V7=A,D,F,G,M,N,P,S,V 41 4 PKB (0.09756098 0.90243902) *
V4
{{A,C,D,E,F,G,H,I,K, L,
M,N,P,Q,S,T,V,W,Y}
M,N,P,Q,S,T,V,W,Y
(1442, 16)
BACKGROUND
V6
(1641, 83)
{A,D,G,H,I,K,L,M,
BACKGROUND
N,P,Q,S,T,V,W,Y }
V4 (85, 3)
BACKGROUND V7
{R}
(199, 67 ) {E,H,I,K,L,Q,R,T }
BACKGROUND (73, 27)
BACKGROUND
V6
{R}
(114, 50)
PKB V7
{A,D,F,G,M,N,P,S,V }
(41, 4)
PKB
Figure 4.3 PKB Enzyme classification tree.
=1, RBF kernel).

Table 4.4 Class PKB classification based on V4, and V6 (LS-SVM parameters: γ=100, σ=1

Mean 21,99 571,98 17,8 10,23 50,75% 68,84% 95,49%
=100, σ=1, RBF kernel).

Table 4.5 Class PKB classification based on V6, V7 and V4 (LS-SVM parameters: γ=100

Mean 15,44 579,32 10,46 16,78 61,39% 48,40% 95,62%
Standard Deviation 3,9
3,92 5,99 4,96 5,11 10,99% 12,55% 0,78%
14
PKC Enzyme
1) root 1641 382 BACKG (0.76721511 0.23278489)

2) V11=A,C,D,E,F,G,H,I,L,M,N,P,Q,S,T,V,W,Y 1338 185 BACKG (0.86173393 0.13826607)
4) V10=C,D,E,L,N,P 810 29 BACKG (0.96419753 0.03580247) *
5) V10=A,F,G,H,I,K,M,Q,R,S,T,V,W,Y 528 156 BACKG (0.70454545 0.29545455)
10) V12=A,C,D,E,F,G,H,L,M,N,P,Q,S,T,W
G,H,L,M,N,P,Q,S,T,W 411 87 BACKG (0.78832117 0.21167883) *
11) V12=I,K,R,V,Y 117 48 PKC (0.41025641 0.58974359)
22) V15=A,D,F,G,L,M,N,Y 36 8 BACKG (0.77777778 0.22222222) *
23) V15=E,H,I,K,P,Q,R,S,T,V 81 20 PKC (0.24691358 0.75308642)
0.75308642) *
3) V11=K,R 303 106 PKC (0.34983498 0.65016502)
6) V10=H,P 71 6 BACKG (0.91549296 0.08450704) *
7) V10=A,D,E,F,G,I,K,L,M,N,Q,R,S,T,V,W,Y 232 41 PKC (0.17672414 0.82327586) *
V10
{C,D,E,L,N,P}
V11 (810, 29)
V12
{A,C,D,E,F,G,H,I,L,M, BACKGROUND
{A,C,D,E,F,G,H,L,
N,P,Q,S,T,V,W,Y} M,N,P,Q,S,T,W }
(1338, 185) (411, 87)
V10
BACKGROUND BACKGROUND
{A,F,G,H,I,K,M,Q,R,S,T,V,W,Y}
(528, 156)
1641, 382 BACKGROUND
BACKGROUND
V12
V10 {I,K,R,V,Y}
{H,P]} (117, 48)
(71, 6) PKC
V11 BACKGROUND
{K, R}
(303, 106)
PKC V10
{A,D,E,F,G,I,K,L,M,N,Q,R,S,T,V,W,Y}
(232, 41)
PKC
Figure 4.4 PKC Enzyme classification tree.

15
Table 4.6 Class PKC classification based on V11 and V10 (LS-SVM parameters: γ=1.3, σ=10, RBF kernel).

Mean 67,76 455,18 27,88 71,18 71,10% 48,76% 84,07%
Table 4.7 Class PKC classification based on V11, V10 and V12 (LS-SVM parameters: γ=1.3, σ=10, RBF kernel).
Mean 68,13 454,08 28,98 70,81 70,67% 49,07% 83,96%
16
CDK Enzyme
1) root 1641 325 BAC (0.801950030 0.198049970)

2) V10=A,C,D,E,F,G,I,K,L,M,N,Q,R,S,T,V,W,Y 1050 9 BAC (0.991428571 0.008571429) *
3) V10=H,P 591 275 CDK (0.465313029 0.534686971)
6) V12=A,C,D,E,F,G,H,L,M,N,P,Q,S,T,V,Y 439 188 BAC (0.571753986 0.428246014)
12) V14=A,D,E,F,I,P,Q,S,V,W 260 81 BAC (0.688461538 0.311538462) *
13) V14=C,G,H,K,L,M,N,R,T,Y 179 72 CDK (0.402234637 0.597765363)
26) V8=E,I,Q,R 27
7 5 BAC (0.814814815 0.185185185) *
27) V8=A,C,D,F,G,H,K,L,M,N,P,S,T,V,Y 152 50 CDK (0.328947368 0.671052632) *
7) V12=I,K,R,W 152 24 CDK (0.157894737 0.842105263) *
V10 V14
{A,C,D,E,F,G,I,K,L,M, {A,D,E,F,I,P,Q,S,V,W}
N,Q,R,S,T,V,W,Y} ( 260, 81)
(1050, 9) V12
BACKGROUND V8
1641, 325 BACKGROUND {A,C,D,E,F,G,H,L,M,N
,P,Q,S,T,V,Y} {E,I,Q,R}
BACKGROUND (27, 5)
V10 (439, 188)
BACKGROUND V14 BACKGROUND
{H, P} {C,G,H,K,L,M,N,R,T,Y}
(591, 275) (179, 72)
BACKGROUND CDK V8
{A,C,D,F,G,H,K,L,M,N,
V12 P,S,T,V,Y}
{I,K,R,W} (152, 50)
(152, 24) CDK
CDK
Figure 4.5 CDK Enzyme classification tree.
Table 4.8 Class CDK classification based on V10 and V12 (LS-SVM parameters: γ=1,
=1, σ=10, RBF kernel).
kernel)
Mean 87,02 444,73 50,53 38,72 63,51% 69,25% 85,63%
Table 4.9 Class CDK classification based on V10, V12 and 14 (LS-SVM
(LS parameters: γ=1, σ=10, RBF kernel).

Mean 82,32 446,88 48,38 43,42 63,42% 65,69% 85,22%
17
CK2 Enzyme
1) root 1641 280 BAC (0.82937233 0.17062767)

2) V12=A,C,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,Y 1294 75 BAC (0.94204019 0.05795981)
4) V10=A,F,G,H,I,K,L,M,N,P,Q,R,T,V,W,Y 1141 33 BAC (0.97107800 0.02892200) *
5) V10=C,D,E,S 153 42 BAC (0.72549020 0.27450980)
10) V6=H,K,L,M,P,R,S,T,Y 110 10 BAC (0.90909091 0.09090909) *
11) V6=A,D,E,F,G,I,N,Q,V 43 11 CK2 (0.25581395 0.74418605) *
3) V12=D,E,W 347 142 CK2 (0.40922190 0.59077810)
6)
) V10=F,I,K,L,M,P,R,T,V,W 141 36 BAC (0.74468085 0.25531915)
12) V6=A,C,H,I,K,L,M,N,P,Q,R,T,V 107 11 BAC (0.89719626 0.10280374) *
13) V6=D,E,F,G,S,Y 34 9 CK2 (0.26470588 0.73529412) *
7) V10=A,C,D,E,G,N,Q,S,Y 206 37 CK2 (0.17961165 0.82038835)
14) V6=K,R,T 46 16 BAC (0.65217391 0.34782609) *
15) V6=A,D,E,F,G,H,I,L,M,N,P,Q,S,V,W,Y 160 7 CK2 (0.04375000 0.95625000) *
V10
{A,F,G,H,I,K,L,M,N,P,Q,R,T,V
,W,Y}
V12 (1141, 33)
{A,C,F,G,H,I,K,L,M, BAKGROUND
N,P,Q,R,S,T,V,Y} V6
(1294, 75) {H,K,L,M,P,R,S,T,Y}
BACKGROUND (110, 10)
V10
BACKGROUND
{C,D,E,S}
(153, 42)
BAKGROUND
V6
{A,D,E,F,G,I,N,Q,V}
(43, 11)
CK2
1641, 280
BACKGROUND
V6
{A,C,H,I,K,L,M,N,P,Q,R,T,V}
(107, 11)
V10
BACKGROUND
{F,I,K,L,M,P,R,T,V,W}
(141, 36)
BACKGROUND V6
{D,E,F,G,S,Y}
V12 (34, 9)
{D,E,W} CK2
(347, 142)
CK2 V6
{K,R,T}
(46 16)
V10 BAKGROUND
{A,C,D,E,G,N,Q,S,Y}
(206, 37) V6
CK2 {A,D,E,F,G,H,I,L,
M,N,P,Q,S,V,W,Y}
(160 , 7)
CK2
Figure 4.6 CK2 Enzyme classification tree.

18
Table 4.10 Class CK2 classification based on V12, V10 and V6 (LS-SVM parameters: γ=100, σ=1, RBF kernel).

Mean 76,36 496,08 14,88 32,68 83,78% 70,12% 92,33%
19
MAPK Enzyme
1) root 1641 249 BAC (0.84826325 0.15173675)

2) V10=A,C,D,E,F,G,H,I,K,L,M,N,Q,R,S,T,V,W,Y
V10=A,C,D,E,F,G,H,I,K,L 1060 31 BAC (0.97075472 0.02924528) *
3) V10=P 581 218 BAC (0.62478485 0.37521515)
6) V12=G,I,K,N,R,W 188 31 BAC (0.83510638 0.16489362) *
7) V12=A,C,D,E,F,H,L,M,P,Q,S,T,V,Y 393 187 BAC (0.52417303
(0.52417303 0.47582697)
14) V7=A,D,E,F,G,H,I,K,N,Q,R,S,V,Y 210 72 BAC (0.65714286 0.34285714)
28) V14=D,E,H,K,M,R,T,W 69 7 BAC (0.89855072 0.10144928) *
29) V14=A,C,F,G,I,L,N,P,Q,S,V,Y 141 65 BAC (0.53900709 0.46099291)
58) V15=A,D,H,I,K,R,T,V,W,Y 60 14 BAC (0.76666667 0.23333333) *
59) V15=C,F,G,L,M,N,P,Q,S 81 30 MAPK (0.37037037 0.62962963)
118) V17=D,G,K,S,V,Y 30 9 BAC (0.70000000 0.30000000) *
119) V17=A,E,F,H,L,M,N,P,Q,R,T
V17=A,E,F,H,L,M,N,P,Q,R,T 51 9 MAPK (0.17647059 0.82352941) *
15) V7=C,L,M,P,T 183 68 MAPK (0.37158470 0.62841530)
30) V6=A,I,M,N,Q,R,S 61 23 BAC (0.62295082 0.37704918)
60) V1=A,D,E,F,G,K,N,P,R,V,Y 32 3 BAC (0.90625000 0.09375000) *
61) V1=C,L,M,Q,S,T 29 9 MAPK (0.31034483 0.68965517) *
31) V6=C,D,E,F,G,H,K,L,P,T,V,Y 122 30 MAPK (0.24590164 0.75409836) *
V14
{D,E,H,K,M,R,T,
D,E,H,K,M,R,T,
V10 W}
{A,C,D,E,F,G,H,I,
A,C,D,E,F,G,H,I, (69, 7)
K,L, BACKGROUND
M,N,Q,R,S,T,V, V7
W,Y } {A,D,E,F,G,H,I,K, V15
(1060, 31) N,Q,R,S,V,Y} {A,D,H,I,K,R,T,V,
BACKGROUND (210, 72) W,Y}
1641, 249 BACKGROUND (60, 14)
V12 V14 BACKGROUND
BACKGROUND
{G,I,K,N,R,W} {A,C,F,G,I,L,N,P,
A,C,F,G,I,L,N,P,
( 188, 31 ) Q,S,V,Y}
BACKGROUND (141, 65) V15
V10
BACKGROUND {C,F,G,L,M,N,P,
{P}
(581, 218) Q,S}
BACKGROUND V12 ( 81, 30)
{A,C,D,E,F,H,L,M, MAPK
P,Q,S,T,V,Y}
(393, 187)
BACKGROUND V1
{A,D,E,F,G,K,N,P,
R,V,Y}
(32, 3)
BACKGROUND
V6
{A,I,M,N,Q,R,S
A,I,M,N,Q,R,S}
(61, 23)
V7 BACKGROUND
V1
{C,L,M,P,T } {C,L,M,Q,S,T}
(183, 68) (29 , 9)
MAPK V6
{C,D,E,F,G,H,K,L,
C,D,E,F,G,H,K,L, MAPK
P,T,V,Y}
( 122, 30)
MAPK
Figure 4.7 MAPK Enzyme classification tree.

20
Table 4.11 Class MAPK classification based on V10, V12, V7, V14 and V6 (LS-SVM parameters: γ=1.1, σ=2,
RBF kernel).

Mean 36,66 494,26 31,17 57,91 54,52% 38,91% 85,63%
Table 4.12 Class MAPK classification based on V10, V12, V7, V14, V6, V15, V17 and V1 (LS-SVM parameters:
γ=1.1, σ=2, RBF kernel).

Mean 31,8 495,29 30,14 62,77 51,74% 33,75% 85,01%
21
5. Method 2 – aaindex mapping, ranking of model variables,

LS-SVM
The proposed approach consists of two stages. Mapping of amino acid symbols into real
numbers is performed in the first stage. Each symbol is substituted by corresponding values from
the AAindex data set. To decrease number of features only 193 uncorrelated indices were chosen
for the substitution (from all 544 indices) – see table 6.1 in chapter 6. After the first stage each
amino acid sequence is described by 3281 features – most of them are irrelevant for classification
purposes.
It is clear that the learning data set, which contains 984 data samples (and 3281 variables)
does not meet requirements of T. Cover theorem. The goal of the second stage is selection of
relevant features (variables). Simple ranking formula and the Gram-Schmidt orthogonalization is used
to solve the task. Results of the second stage are applied to the statistical classifiers - the least-
squares support vector machine (LS-SVM).
AAindex based mapping of symbols
An amino acid index [7,8] is a set of 20 numerical values representing any of the different
physicochemical and biological properties of amino acids. The AAindex1 section of the Amino
Acid Index Database is a collection of published indices together with the result of cluster
analysis using the correlation coefficient as the distance between two indices. This section
contains 544 indices.
Table 5.1 Example entry from AAindex1 data set

H ARGP820102
D Signal sequence helical potential (Argos et al., 1982)
R LIT:0901079b PMID:7151796
A Argos, P., Rao, J.K.M. and Hargrave, P.A.
T Structural prediction of membrane-bound proteins
J Eur. J. Biochem. 128, 565-575 (1982)
C ARGP820103 0.961 KYTJ820101 0.803 JURD980101 0.802
I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
The meaning of the fields in an AAindex1 entry [7]:

H - accession number
D - data description
R - LITDB entry number
A - author(s)
T - title of the article
J - journal reference
22
C - accession numbers of similar entries with the correlation coefficients of 0.8 (-0.8) or
more (less).
I - amino acid index data in the following order:
Ala Arg Asn Asp Cys Gln Glu Gly His Ile
Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
Ranking method of the model variables
The orthogonalization procedures enable us examination of the influence of every input feature on the
output vector. The presented method uses simple ranking formula and the well known Gram-Schmidt
orthogonalization procedure for pointing out the model most salient variables [5,6].
N input-output pairs (measurements of the output of the process to be modeled, and of the candidate
features) is available.
We denote by:
Q – number of candidate features,
N – number of measurements of the process to be modeled,
xi=[xi1, xi2, ...xiN,] – the vector of the i-th feature values of N measurements,
yp – the N-dimensional vector of the classifier target values.
We consider the N by Q matrix X=[x1, x2, ..., xQ,]. The ranking procedure starts with calculating
correlation coefficient
cos2(xk,yp)=(xkyp)2/(||xk||2 ||yp||2) (5.1)
The larger it is, the better the k-th feature vector explains the yp variation.
As the first base vector we pick the one with the largest value of correlation coefficient. All the
remaining candidate features and the output vector are projected onto the null subspace (of dimension N-
1) of the selected feature. Next, we calculate correlation coefficient for the projected vectors and again
pick the one with the largest value of this quantity. The remaining feature vectors are projected onto the
null subspace of the first two ranked vectors by the classical Gram-Schmidt orthogonalization. This
procedure is continued until all the vectors xk are ranked.
To reject the irrelevant inputs we compare its rank value correlation coefficient with the rank of a
random probe. The remaining features are considered relevant to the model.
23
6. Results - method 2
The tests were performed on 20 data sets randomly generated from the data set containing all
sequences. The 60% of data samples were used for training of the classifier. Remaining data
samples were used for validation of obtained LS-SVM model.
Table 6.1 Subset of uncorrelated features .

'ANDN920101' 'FINA770101' 'QIAN880112' 'VASM830101'
'ARGP820101' 'FINA910101' 'QIAN880114' 'VASM830102'
'ARGP820102' 'FINA910103' 'QIAN880116' 'VASM830103'
'BEGF750101' 'FINA910104' 'QIAN880117' 'VELV850101'
'BEGF750102' 'GARJ730101' 'QIAN880118' 'VHEG790101'
'BEGF750103' 'GEIM800102' 'QIAN880122' 'WERD780102'
'BHAR880101' 'GEIM800103' 'QIAN880123' 'WERD780103'
'BIGC670101' 'GRAR740101' 'QIAN880124' 'WERD780104'
'BIOV880101' 'HOPA770101' 'QIAN880125' 'WOLS870102'
'BROC820101' 'HUTJ700101' 'QIAN880128' 'WOLS870103'
'BULH740102' 'ISOY800102' 'QIAN880129' 'YUTK870101'
'BUNA790101' 'ISOY800106' 'QIAN880130' 'ZIMJ680101'
'BUNA790103' 'ISOY800107' 'QIAN880137' 'AURR980101'
'BURA740101' 'ISOY800108' 'QIAN880138' 'AURR980102'
'BURA740102' 'KARP850103' 'QIAN880139' 'AURR980103'
'CHAM810101' 'KHAG800101' 'RACS820101' 'AURR980105'
'CHAM820102' 'KLEP840101' 'RACS820102' 'AURR980106'
'CHAM830102' 'LAWE840101' 'RACS820103' 'AURR980107'
'CHAM830103' 'LEVM760103' 'RACS820104' 'AURR980110'
'CHAM830104' 'LEWP710101' 'RACS820105' 'AURR980116'
'CHAM830105' 'LIFS790102' 'RACS820107' 'AURR980118'
'CHAM830107' 'MAXF760103' 'RACS820110' 'AURR980119'
'CHAM830108' 'MEEJ800101' 'RACS820111' 'AURR980120'
'CHOC760102' 'NAKH900103' 'RACS820112' 'VINM940104'
'CHOP780202' 'NAKH900104' 'RACS820113' 'MONM990101'
'CHOP780203' 'NAKH900109' 'RACS820114' 'PARS000102'
'CHOP780204' 'NAKH900113' 'RADA880105' 'KUMS000103'
'CHOP780205' 'OOBM770102' 'RICJ880101' 'FODM020101'
'CHOP780206' 'OOBM770104' 'RICJ880103' 'NADH010106'
'CHOP780207' 'OOBM850101' 'RICJ880104' 'KOEP990101'
'CHOP780211' 'OOBM850103' 'RICJ880105' 'KOEP990102'
'CHOP780212' 'OOBM850104' 'RICJ880107' 'FUKS010101'
'CHOP780214' 'OOBM850105' 'RICJ880108' 'AVBF000109'
'CHOP780215' 'PALJ810108' 'RICJ880109' 'YANJ020101'
'CRAJ730102' 'PALJ810109' 'RICJ880110' 'MITS020101'
'DAYM780101' 'PALJ810111' 'RICJ880112' 'WILM950101'
'DAYM780201' 'PALJ810113' 'RICJ880113' 'WILM950103'
'EISD860102' 'PALJ810114' 'RICJ880114' 'WILM950104'
'FASG760102' 'PALJ810115' 'RICJ880116' 'SUYM030101'
'FASG760103' 'PALJ810116' 'RICJ880117' 'GEOR030101'
'FASG760104' 'PONP800104' 'ROBB760107' 'GEOR030105'
'FASG760105' 'PONP800105' 'ROBB760109' 'GEOR030107'
'FAUJ880101' 'PONP800106' 'ROSM880103' 'DIGM050101'
'FAUJ880104' 'PRAM820101' 'SNEP660101'
'FAUJ880105' 'PRAM820102' 'SNEP660103'
'FAUJ880107' 'QIAN880101' 'SNEP660104'
'FAUJ880108' 'QIAN880102' 'SUEM840102'
'FAUJ880110' 'QIAN880103' 'TANS770102'
'FAUJ880111' 'QIAN880104' 'TANS770106'
'FAUJ880112' 'QIAN880108' 'TANS770108'
24
PKA Enzyme
Table 6.2 Features selected for classification of PKA class.

Feature AAindex1
no. Sequence accession
position number
1 10 QIAN880117
2 11 GEIM800102
3 10 QIAN880103
4 10 BEGF750101
5 10 QIAN880104
6 10 CHAM830104
7 10 CHAM810101
8 11 QIAN880122
9 11 RACS820114
10 11 GEOR030107
11 10 RACS820113
12 11 QIAN880138
13 10 PALJ810108
14 10 RICJ880112
15 10 RACS820114
16 10 QIAN880112
Table 6.3 Classification results of PKA class (LS-SVM parameters: γ=130, σ=10, RBF kernel).

Mean 62,20 493,15 35,20 66,45 64,03% 48,41% 84,53%
25
PKB Enzyme
Table 6.4 Features selected for classification of PKB class.

Feature AAindex1
position number
1 4 RACS820114
2 4 RACS820112
3 4 QIAN880108
4 4 ISOY800106
5 4 GEOR030101
Table 6.5 Classification results of PKB class (LS-SVM parameters: γ=200, σ=120, RBF kernel).

Mean 30,23 559,85 61,00 5,92 33,97% 83,66% 89,81%
PKC Enzyme
Table 6.6 Features selected for classification of PKC class.

Feature AAindex1
position number
1 10 QIAN880117
2 11 GEIM800102
3 10 QIAN880103
4 10 BEGF750101
5 10 QIAN880104
6 10 CHAM830104
7 10 CHAM810101
8 11 QIAN880122
9 11 RACS820114
10 11 GEOR030107
11 10 RACS820113
12 11 QIAN880138
Table 6.7 Classification results of PKC class (LS-SVM parameters: γ=13, σ=2, RBF kernel).

Mean 92,50 459,25 45,35 59,90 67,27% 60,71% 83,98%
26
CDK Enzyme
Table 6.8 Features selected for classification of CDK class.

Feature AAindex1
position number
1 10 ARGP820102
2 10 CHAM830104
3 10 QIAN880116
Table 6.9 Classification results of CDK class (LS-SVM parameters: γ=1.3, σ=1, RBF kernel).

Mean 125,35 420,40 105,50 5,75 54,32% 95,61% 83,07%
CK2 Enzyme
Table 6.10 Features selected for classification of CK2 class.

Feature AAindex1
position number
1 12 FAUJ880112
2 12 KLEP840101
3 12 HOPA770101
4 10 HOPA770101
5 10 FAUJ880112
6 12 FUKS010101
7 12 FAUJ880110
8 10 AURR980107
9 12 AURR980107
10 10 MEEJ800101
11 12 CHOP780204
12 12 RICJ880105
13 10 FAUJ880108
14 12 CHAM830107
15 12 MEEJ800101
16 10 RICJ880105
17 12 PARS000102
Table 6.11 Classification results of CK2 class (LS-SVM parameters: γ=13, σ=10, RBF kernel).

Mean 67,05 522,40 22,60 44,95 75,01% 59,93% 89,72%
27
MAPK Enzyme
Table 6.12 Features selected for classification of MAPK class.

Feature AAindex1
position number
1 10 ARGP820102
2 10 CHAM830104
3 10 PONP800105
4 10 QIAN880116
5 10 QIAN880117
6 10 BROC820101
7 10 YUTK870101
8 10 WILM950101
9 10 RICJ880107
10 10 FASG760102
11 10 QIAN880112
12 10 QIAN880104
13 10 BEGF750101
14 10 SNEP660101
15 10 BULH740102
16 10 RICJ880112
17 10 QIAN880124
Table 6.13 Classification results of MAPK class (LS-SVM parameters: γ=1300, σ=100, RBF kernel).

Mean 70,00 399,79 158,00 29,21 26,21% 71,39% 71,51%
28
7. Conclusions
The summary of performed research is presented on figure 7.1. The results obtained by both
methods are comparable, however method 1 (based on classification trees and GINI index)
performs slightly better. This fact may be caused by mapping procedure used in method 2
(AAindex mapping, ranking of model variables). After substitution different amino acid
sequences may be represented by the same feature vector. This is one of major drawbacks of the
second method.
The precision values are higher in method 1. Recall values are higher in method 2. However
the balance between precision and recall may be slightly modified by different selection of
hyperparameters of the LS-SVM classifier and number of variables.
The major advantage of method 1 is small number of features (see Table 7.1) obtained in
feature selection stage based on classification trees. The classification task could be completed
by classifier utilizing approximately three variables.
Table 7.1 Number of features used for classification.

Class name Method 1 Method 2
PKA 2 16
PKB 3 5
PKC 2 12
CDK 2 3
CK2 3 17
MAPK 5 17
REFERENCES
1. MaimonO., Rokach L. Data Mining and Knowledge Discovery Handbook. Springer, 2005
2. Xu R., Wunsch II D.C. Clustering. Wiley, 2009
3. Breiman L., Friedman J. H., Olshen R. A., and Stone C. J.. Classification and Regression Trees.
Wadsworth, Belmont, 1984.
4. Gini C., Variabilita e Mutabilita, Journal of the Royal Statistical Society, Vol. 76, No. 3, February,
1913, pp. 326-327
5. Stoppiglia H., G. Dreyfus, R. Dubois, Y. Oussar, Ranking a Random Feature for Variable and
Feature Selection, Journal of Machine Learning Research 3 (2003), 1399-1414
6. Jankowski S., Szymański Z., Raczyk M., Piatkowska-Janko E., Oreziak A. Pertinent signal-averaged
ECG parameters selection for recognition of sustained ventrical tachycardia. XXXVth International
Congress on Electrocardiology, 18-21 September, 2008, St. Petersburg, Russia, pp. 43 (abstract).
7. Kawashima S., AAindex: Amino Acid Index Database Release 9.1, Aug 2006,
ftp://ftp.genome.jp/pub/db/community/aaindex/aaindex.doc
8. Kawashima S. and Kanehisa M.; AAindex: amino acid index database. Nucleic Acids Res. 28, 374
(2000)
9. Suykens J.A.K and Vandewalle J.: Least squares support vector machine classifier, Neural
Processing Letters, 9 (1999), 293-300
10. Cover T.M.: Geometrical and statistical properties of systems of linear inequalities with applications
in pattern recognition, IEEE Trans. on Electr. Comp., 1965 EC14, 326-334
29
a) d)
100,00% 100,00
90,00% 90,00
80,00% 80,00
70,00% 70,00
60,00% 60,00
50,00% 50,00
40,00% 40,00
30,00% 30,00
20,00% 20,00
10,00% 10,00
0,00% 0,00
PKA PKB PKC CDK CK2 MAPK PKA PKB PKC CDK CK2 MAPK
b) e)
100,00% 100,00
90,00% 90,00
80,00% 80,00
70,00% 70,00
60,00% 60,00
50,00% 50,00
40,00% 40,00
30,00% 30,00
20,00% 20,00
10,00% 10,00
0,00% 0,00
c) f)
100,00% 100,00
95,00% 95,00
90,00% 90,00
85,00% 85,00
80,00% 80,00
75,00% 75,00
70,00% 70,00
65,00% 65,00
60,00% 60,00
55,00% 55,00
50,00% 50,00
Figure 7.1 Summary of classification

lassification results. Method 1: a) precision,, b) recall, c) total score. Method 2:
d) precision, e) recall, f)) total score.
30
Appendix
Appendix A. Classification tree build for the whole set .
1) root 1641 1259 PKC (0.2 0.17 0.15 0.2 0.051 0.23)
2) V10=P 581 269 CDK (0.54 0.024 0.38 0.036 0.0086 0.019)
4) V12=G,I,K,R,W 181 34 CDK (0.81 0 0.16 0.011 0 0.017) *
5) V12=A,C,D,E,F,H,L,M,N,P,Q,S,T,V,Y 400 211 MAPK (0.41 0.035 0.47 0.048 0.012 0.02)
10) V8=C,E,H,K,N,S,T,V,Y 165 69 CDK (0.58 0.024 0.35 0.024 0.024 0) *
11) V8=A,D,F,G,I,L,M,P,Q,R,W 235 103 MAPK (0.29 0.043 0.56 0.064 0.0043 0.034) *
3) V10=A,C,D,E,F,G,H,I,K,L,M,N,Q,R,S,T,V,W,Y 1060 689 PKC (0.012 0.25 0.029 0.28 0.074 0.35)
6) V12=D,E,W 312 113 CK2 (0 0.64 0.0096 0.19 0.038 0.12)
12) V6=A,D,E,F,G,H,I,K,L,M,N,P,Q,S,T,V,W,Y 239 49 CK2 (0 0.79 0.013 0.079 0.0084 0.1) *
13) V6=C,R 73 33 PKA (0 0.12 0 0.55 0.14 0.19) *
7) V12=A,C,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,Y 748 416 PKC (0.017 0.09 0.037 0.32 0.088 0.44)
14) V11=A,C,D,E,F,G,H,I,L,M,N,P,Q,S,T,V,W,Y 550 319 PKA (0.016 0.12 0.047 0.42 0.11 0.29)
28) V7=A,C,D,E,F,G,H,I,L,M,N,P,S,T,V,W,Y 278 192 PKC (0.032 0.22 0.076 0.19 0.17 0.31)
56) V4=A,C,D,E,F,G,H,I,K,L,M,N,P,Q,S,T,V,W,Y 210 139
PKC(0.043 0.28 0.095 0.22 0.029 0.34)
112) V12=A,C,G,H,I,L,M,N,P,Q,S,T 147 92 CK2 (0.02 0.37 0.12 0.26 0.041 0.18)
224) V6=A,D,E,F,G,H,L,N,P,Q,S,T,V,Y 94 42 CK2 (0.021 0.55 0.16 0.11 0.021 0.14) *
225) V6=C,K,R 53 25 PKA (0.019 0.057 0.057 0.53 0.075 0.26) *
113) V12=F,K,R,V,Y 63 19 PKC (0.095 0.048 0.032 0.13 0 0.7) *
57) V4=R 68 26 PKB (0 0.029 0.015 0.12 0.62 0.22) *
29) V7=K,Q,R 272 95 PKA (0 0.015 0.018 0.65 0.037 0.28)
58) V6=H,I,K,L,N,P,R 201 45 PKA (0 0 0.005 0.78 0.045 0.17) *
59) V6=A,D,E,F,G,M,Q,S,T,V,W,Y 71 30 PKC (0 0.056 0.056 0.3 0.014 0.58) *
15) V11=K,R 198 28 PKC (0.02 0.015 0.01 0.056 0.04 0.86) *

Warsaw University of Technology Warsaw University of Technology Warsaw University of Technology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Warsaw University of Technology Warsaw University of Technology Warsaw University of Technology

Uploaded by

Copyright:

Available Formats

Warsaw University of Technology

Feature selection and mapping of data from short amino

Stanisław Jankowski, Marek Dwulit, Zbigniew Szymański

ICS Research Report 2/10

Institute of Computer Science

Nowowiejska 15 / 19, 00-665

Feature selection and mapping of data from short amino acid

Jankowski S., Dwulit M., Szymański Z.

Key words: feature selection, LS-SVM, AAindex, amino acid sequence

Figure 2.1 An input file example.

3. Method 1 – classification trees, dynamic mapping, LS-

∆i(t) = i(tp) − Pli(tl) − Pri(tr) (3.1)

Therefore, at each node CART solves the following maximization problem

arg max   − !  "! # − $  "$ #% (3.3)

"# = ' (")|#("+|# (3.4)

∆ "# = − ' ( )0  + /

arg max 3− ' ( )0  + /

a "5, ], , ^ # = ]b\(,,V,_ − `,,V,_  ∗ "1 − (,,V,_ / − `,,V,_ / # (3.9)

Least-Squares Support Vector Machine (LS-SVM)

LS-SVM originates by changing the inequality constraints in the SVM formulation to

The LS-SVM classifier performs the function:

f (x) = wφ (x) + b (3.11)

This function is obtained by solving the following optimization problem:

In this work the RBF kernel is applied:

K (x, x' ) = exp{−η || x − x'||2 }, σ = 1 / γ (3.15)

A single node description - text format.

Table 4.1 Classification tree – node description

2) V12=A,C,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,Y 1294 75 BAC (0.94204019 0.05795981)

A single node description - graphical format.

Figure 4.1 Classification tree – node description.

Classification tree – text format

1) root 1641 322 BAC (0.8037782 0.1962218)

Classification tree – graphical representation.

Figure 4.2 PKA Enzyme classification tree.

LS-SVM Classifier – testing results.

Classification tree – text format.

1) root 1641 83 BAC (0.94942108 0.05057892)

Classification tree – graphical representation.

Figure 4.3 PKB Enzyme classification tree.

LS-SVM Classifier – testing results.

=1, RBF kernel).

TP TN FP FN Precision Recall Score

=100, σ=1, RBF kernel).

TP TN FP FN Precision Recall Score

Classification tree – text format.

1) root 1641 382 BACKG (0.76721511 0.23278489)

Classification tree – graphical representation.

Figure 4.4 PKC Enzyme classification tree.

LS-SVM Classifier – testing results.

TP TN FP FN Precision Recall Score

Classification tree – text format.

1) root 1641 325 BAC (0.801950030 0.198049970)

Classification tree – graphical representation.

Figure 4.5 CDK Enzyme classification tree.

LS-SVM Classifier – testing results.

TP TN FP FN Precision Recall Score

Classification tree – text format.

1) root 1641 280 BAC (0.82937233 0.17062767)

Classification tree – graphical representation.

Figure 4.6 CK2 Enzyme classification tree.

LS-SVM Classifier – testing results.

TP TN FP FN Precision Recall Score

arg max − ! "! # − $ "$ #% (3.3)

"# = ' (")|#("+|# (3.4)

∆ "# = − ' ( )0 + /

arg max 3− ' ( )0 + /

a "5, ], , ^ # = ]b\(,,V,_ − `,,V,_ ∗ "1 − (,,V,_ / − `,,V,_ / # (3.9)