A Novel Method Based On Physicochemical Properties of Ami 2015 Journal of Bi

Journal of Biomedical Informatics 56 (2015) 300306
Contents lists available at ScienceDirect
Journal of Biomedical Informatics

journal homepage: www.elsevier.com/locate/yjbin
A novel method based on physicochemical properties of amino acids

and one class classication algorithm for disease gene identication
Abdulaziz Yousef, Nasrollah Moghadam Charkari
Faculty of Electrical & Computer Engineering, Tarbiat Modares University, Tehran, Iran
a r t i c l e i n f o a b s t r a c t
Article history: Identifying the genes that cause disease is one of the most challenging issues to establish the diagnosis
Received 13 February 2015 and treatment quickly. Several interesting methods have been introduced for disease gene identication
Revised 4 June 2015 for a decade. In general, the main differences between these methods are the type of data used as a
Accepted 26 June 2015
prior-knowledge, as well as machine learning (ML) methods used for identication. The disease gene
Available online 2 July 2015
identication task has been commonly viewed by ML methods as a binary classication problem
(whether any gene is disease or not). However, the nature of the data (since there is no negative data
Keywords:
available for training or leaners) creates a major problem which affect the results. In this paper,
Disease gene identication
Physicochemical properties of amino acid
sequence-based, one class classication method is introduced to assign genes to disease class (yes, no).
One class classication First, to generate feature vector, the sequences of proteins (genes) are initially transformed to numerical
Support Vector Data Description vector using physicochemical properties of amino acid. Second, as there is no denite approach to dene
non-disease genes (negative data); we have attempted to model solely disease genes (positive data) to
make a prediction by employing Support Vector Data Description algorithm. The experimental results
conrm the efciency of the method with precision, recall and F-measure of 79.3%, 82.6% and 80.9%,
respectively.
2015 Elsevier Inc. All rights reserved.
1. Introduction consider this issue as two class classication problem. Some studies
have dened the known disease gene as positive set and the
Identifying disease genes is an important issue to enhance our unknown disease gene as negative set [12,13]. Since the unknown
knowledge about disease mechanism, and to improve clinical genes set is often comprised with some disease genes, other meth-
methods. ods have attempted to reduce the confusion in classication process
Traditional linkage and association studies have been carried by selecting a small fraction of unknown genes as negative set
out to identify a large number of candidate genes probably related [14,15]. Nevertheless, these methods are not robust and reliable
with diseases [1]. Since using experimental approaches to identify enough as the negative set which has been achieved from unknown
disease associated genes from the vast numbers of candidates is an genes, still suffers from noisy data.
expensive task, the requirement of computational approaches has All the above mentioned methods might not be implemented
been taken into account. In this regard, many interesting machine properly, because these methods rely on the information of pro-
learning methods have been introduced to nd the similarity fea- teins achieved from prior-knowledge (PPI network, gene ontology,
tures between unknown genes (candidate genes) with the known and protein domains) which contains some errors. Moreover, they
disease genes. These methods differ in two ways. First, the type usually suffer from incompleteness. Therefore, a universal
of genomic data used for generating the feature vector, such as pro- prior-knowledge would be required to tackle this problem. The only
teinprotein interactions (PPIs) [26], gene expression proles [7], data which are available for all proteins and has inuential role in
gene ontology (GO) [8]. Other methods integrate multiple data solving many problems such as predicting a subcellular locations
sources to prioritize candidate disease genes [911]. [16,17], proteinprotein interactions [1820], and protein struc-
Second, the type of the algorithm which has been used for train- tural and functional classes [2123] is the sequences of proteins
ing the prediction model. However, the majority of the studies [24]. On the other hand, there is no information about the negative
data (non-disease gene). Also, there is no guarantee of using the
Corresponding author. Tel.: +98 21 82883301; fax: +98 21 82884325. unknown genes or fraction of them as a negative data. Hence, using
E-mail addresses: yousef@modares.ac.ir, Azizyousef1@yahoo.com (A. Yousef), two class classication algorithm may be not appropriate.
charkari@modares.ac.ir (N.M. Charkari).
http://dx.doi.org/10.1016/j.jbi.2015.06.018
1532-0464/ 2015 Elsevier Inc. All rights reserved.
A. Yousef, N.M. Charkari / Journal of Biomedical Informatics 56 (2015) 300306 301
In this paper, we present a novel sequence-based one class classi- auto correlation (GA) [27], auto covariance (AC) [28], Normalized
cation method for disease gene identication. Since the earliest pro- MoreauBroto autocorrelation (NA) [29], and Moran
tein sequences and the structures were determined, it would be clear auto-correlation (MA) [30]. These methods are based on different
that the positioning and properties of amino acids are key point to physicochemical properties. In this paper, six joint physicochemi-
infer many biological processes. For example, the rst protein struc- cal properties of amino acid which were used in many applications
ture, haemoglobin provides a molecular explanation for the genetic (e.g. predicting protein structural, functional classes, proteinpro-
disease sickle cell anaemia [25]. Therefore, a sequence-translated tein interactions, sub-cellular locations. . .) have been selected.
method based on physicochemical properties of amino acid, is Since by adding one more physicochemical property, thirty fea-
employed to construct a feature vector for each protein. To improve tures will be added to the feature vector, we have selected out
the performance of the proposed method, some efcient features the more effective physicochemical properties with the minimum
are selected using Principal Component Analysis (PCA). Since, the number of these properties to avoid complexity (Time, and
mutation of genes is always possible to happen, and also there is a Computation). These properties are polarity (POL) [31], residue
likelihood of this mutation leading to disease [26], there is no avail- accessible surface area in tripeptide (RAS) [32], hydrophilicity
able conception asserts that some proteins are not involved in disease (HY-PHIL) [33], polarizability (POL2) [34], hydrophobicity
(non-disease genes). Hence, we have attempted to only train the pos- (HY-PHOB) [35], solvation free energy (SFE) [36], respectively.
itive data (disease genes) using Support Vector Data Description The original values of these physicochemical properties for each
(SVDD) algorithm. We have used two type of data as negative data amino acid were normalized using MinMax normalization
to evaluate the model. The st type has selected randomly from method as shown in Eq. (1):
unknown dataset. While the second one has been selected from the Pi;j Pj;min
likely negative data used in positive-unlabeled technique [15]. Pij 1
P j;Max Pj;min
The proposed method has been compared with ProDiGe
method [14], Smalters method [12], and yang method [15]. where P i;j is the j-th descriptor value for i-th amino acid, P j;min is the
The experimental results achieved 79.3%, 82.6% and 80.9% minimum value of j-th descriptor over the 20 amino acids and Pj;Max
for precision, recall and F-measure, respectively. These results is the maximum value of j-th descriptor over the 20 amino acids.
indicate that our method overpassed the current state-of-the Table 1 shows the normalized physicochemical properties. Since
art methods in precision. GA method has achieved a good result in other application [20], in
this work, we have applied GA method as a representation method.
2. Material and method
2.2. Principal Component Analysis (PCA)
In this section, the proposed method for identifying disease
genes is described. The method consists of three steps: (1)
To increase the overall performance of the proposed method, we
Translate corresponding gene products (proteins) into numerical
tried to extract the most relevant and useful features from the
feature vector using physicochemical descriptor; (2) Principal
high-dimensional represented feature vectors (180 features) gener-
Component Analysis (PCA) algorithm is applied to extract appro-
ated by GA method. PCA is presented as a dimensionality reduction
priate features; (3) training the positive data using SVDD algorithm
methods. PCA is an appropriate statistical technique to identify pat-
to make the identication. The proposed method schema is illus-
terns in data, and to expiree the data in a way that to high light their
trated in Fig. 1.
similarities and differences. Therefore, utilized to determine signif-
icant features which preserves most of the information and removes
2.1. Protein sequence translation the redundant components [37]. PCA is a linear combination which
transforms one set of variables in Rm space into another set in Rn
Many representation methods have been introduced to extract space containing the maximum amount of variance in the data
the information encoded in amino acid sequence, including Geary where n is smaller than m. This is obtained as the following steps:
Fig. 1. The schematic of the proposed method. As is shown, the proposed method includes three layers which are representation layer, Feature extraction layer, predictor
layer.
302 A. Yousef, N.M. Charkari / Journal of Biomedical Informatics 56 (2015) 300306
Table 1
The normalized value of six physicochemical properties for each of Amino Acid (AY).
HY-PHOB HY-PHIL POL POL2 SFE RAS

A 0.281 0.453 0.395 0.112 0.589 0.222
C 0.458 0.375 0.074 0.312 0.527 0.333
D 0 1 1 0.256 0.191 0.416
E 0.027 1 0.913 0.369 0.285 0.638
F 1 0.14 0.037 0.709 0.936 0.75
G 0.198 0.531 0.506 0 0.446 0
H 0.207 0.453 0.679 0.562 0.582 0.666
I 0.792 0.25 0.037 0.454 0.851 0.555
K 0.198 1 0.79 0.535 0.325 0.694
L 0.783 0.25 0 0.454 0.851 0.527
M 0.721 0.328 0.098 0.54 0.957 0.611
N 0.12 0.562 0.827 0.327 0.319 0.472
P 0.253 0.531 0.382 0.32 0.702 0.388
Q 0.123 0.562 0.691 0.44 0.4 0.583
R 0.222 1 0.691 0.711 0 0.833
S 0.235 0.578 0.53 0.151 0.448 0.222
T 0.318 0.468 0.456 0.264 0.557 0.361 Fig. 2. Support Vector Data Description sketch-ma.
V 0.687 0.296 0.123 0.342 0.765 0.444
W 0.56 0 0.061 1 1 1
Y 0.922 0.171 0.16 0.728 0.787 0.861
3. Results
At rst, we have evaluated the effect of different sequence rep-

1. Compute the covariance matrix as follow; resentation methods on the performance of the proposed method.
Furthermore, the best number of features extracted by PCA algo-
rithm for each of these methods has been investigated. After
1X N
S xk mxk mT ; selecting the most effective sequence representation method, we
N k0
have assessed the impact of each physicochemical properties of
P amino acid on the identication process. Finally, to conrm the
where m N1 Nk xk , the mean of the original feature vectors. efciency of the method, we have conducted the comparison
2. Compute the eigenvectors and eigenvalues of the covariance between our method and the state-of-the-art techniques on gen-
matrix by solving the following decomposition eral disease genes identication.
ki ei Sei ;
where ki is the eigenvalue associated with the eigenvector ei . 3.1. Experimental data
3. Sort the result in decreasing order of eigenvalue.
4. Choose the most important components (features). The data used by [15] has been employed in our experiments.
This dataset consists of 5405 known disease genes (positive data)
spanning 2751 disease phenotypes, where all the genes have been
2.3. Support Vector Data Description (SVDD)
extracted by combining GENECARD [39] and OMIM [40] disease
gene data. In addition 16 k genes from Ensembl [41] has been
Since there is no information about the negative data
selected as the unknown gene set.
(non-disease gene) and solely the positive data (target data) is
available, one class classication (OCC) method would be the
proper approach to efciently solve the disease gene identication 3.2. Evaluation metrics
problem. The OCC method makes a description of a disease gene
set and detects new genes to resemble this set. Support Vector To measure the performance of the proposed identication
Data Description (SVDD) method is inspired by support vector method, different metrics have been used. The Recall, Precision,
machine which has used for novelty or outlier detection. it is a and F-measure which are dened as follows:
one-class classication that constructs a hyper-sphere by enclosing TP
the target instances properly to assign a test instance to positive Recall
TP FN
class or not [38]. The objective of SVDD is to nd a minimum
hyper-sphere with center a and radius R, including all the target TP
instances. Fig. 2 shows the SVDD method sketch-ma. SVDD algo- Precision
TP FP
rithm like SVM algorithm provides more exible and stable
description of the boundary by applying kernel functions (e.g. 2 Recall Precision
Gaussian kernel). For disease gene identication, SVDD has the fol- F-measure
Recall Precision
lowing benets [38]:
where FP is false positive (non-disease genes negative data which
1. Sparsity: fewer training samples are needed. As we mentioned have been identied as disease gene), TP is True Positive (disease
above, a small fraction of genes would be predicted as disease genes which have been correctly identied), and FN is false negative
gene. (disease genes which have been identied as non-disease gene neg-
2. Good generalization: SVDD method avoids overfitting and yields ative data not correctly). The Recall is the ratio of the number of
good generalization results when compared with other conven- disease genes retrieved to the total number of disease genes in
tional methods. the dataset. Precision is the ratio of the number of relevant disease
3. Use of kernels: by exploiting the kernel trick, the SVDD method genes retrieved to the total number of irrelevant and relevant dis-
is able to accurately model the support of non-trivial, ease genes retrieved. Finally, the F-measure is the harmonic mean
multi-modal distributions. of recall and precision.
4. Discussion Table 3
Comparison between sequence representation methods using PCA extracted features.
In order to minimize the overtting of the identication model, Methods # of PCA Features Precision (%) Recall (%) F-measure (%)
5-fold cross-validations has been carried out in all of the experi- AC-SVDD 55 78.6 74.2 76.3
ments. 5000 positive and 5000 unlabeled instances have been GA-SVDD 60 79.3 82.6 80.9
employed. Since, the disease genes are available, we used SVDD NA-SVDD 80 76.1 81.2 78.5
MA-SVDD 55 72 80.4 75.9
which requires only positive data for training. For testing and esti-
mating the error rates, some negative data would be required. In
this regard, we have employed two evaluation strategies. In The
rst strategy, some of unknown data have been selected randomly
as negative data and have been used in testing process. In the sec-
ond strategy, we have used positive-unlabeled method [15] which
considered the farthest unknown instances from the mean vector
of the all positive data as negative instances.
4.1. Performance of the sequence representation methods
To evaluate the robustness of the representation methods, we

have attempted to train and test the model using SVDD algorithm
based on AC, GA, MA, NA representation methods separately. The
results of each predicator have been reported in two ways. First,
the results when all the features have been employed. Second,
the new results when the PCA feature reduction method have been
employed.
Table 2 shows the comparison between the identication per-
formances of the sequence representation method using all fea-
tures. Table 3 indicates the results of each representation method
after applying the PCA feature reduction method. It can be found Fig. 3. The effectiveness of physicochemical properties on the disease gene
that the performance of the GA method is better than the perfor- identication.
mance of the other ones. Therefore, we select the GA as the main

sequence representation method for the next experiments.
Moreover, Table 3 indicates the positive effect of using PCA in
improving the experimental results. Table 4
Comparative evaluation between different one class classication methods.
4.2. Assessment of physicochemical properties Methods Precision (%) Recall (%) F-measure (%)
SVDD 79.3 82.6 80.9
Since the sequence representation method is based on the Parzen 79.6 80.3 79.9
physicochemical properties, in this experiment, our objective is Mixture of Gaussian 79.2 81.5 80.3
to understand which physicochemical property has more impact

in identication disease gene. In this regard, six different GA fea-
ture vectors have been built. The rst one is generated using all
physicochemical properties, and the ve other ones are generated All the employed classiers originate from the Matlab toolbox
by deleting one of the physicochemical properties, respectively. In ddTOOLS [42]. Each of these classiers has its own parameters.
this experiment, we could observe that the deletion of the HY-PHIL Parzen classier has a smoothing parameter s, and GMM has the
property reduces the F-measure dramatically. While, deleting the number of mixture components k of the positive class. While
others has a small negative effect on the performance. Fig. 3 illus- SVDD has two parameters C, and kernel parameter a, where C is
trates the priority of physicochemical properties in disease gene the regularization parameter which controls the trade-off between
identication. However, the deletion of any physicochemical prop- the volume of the hypersphere and the number of accepted train-
erties will decrease the performance. ing samples. We used trial and error to optimize the Parzen and
GMM parameters, and the values s = 0.38, and k = 29 were selected.
4.3. Evaluation of different One Class Classication methods While, grid search has been carried out to optimize SVDD parame-
ters. Accordingly, the trained value of optimal kernel parameter as
To evaluate the performance of SVDD method for disease gene well as the trained value of optimal regularization parameter are
identication, two other well known one class classication meth- a = 0.128, and C = 16, respectively.
ods, i.e. Guassian Mixture Model, and Parzen density estimation, The results of the estimation of precision, recall, and F-measure
have been implemented in our study. scores are shown in Table 4. The results of precision indicate that
there is no signicant differences between the methods. While,
the analysis of the results of the recall scores clearly shows that
Table 2 the SVDD method provides better performance than the other
Comparison between sequence representation methods using all features.
mentioned methods. The signicant results of SVDD can be
Methods # of Feature Precision (%) Recall (%) F-measure (%) explained as this classier prevents overtting by utilizing only
AC-SVDD 180 74 76.2 75 few instances to support a boundary between disease genes and
GA-SVDD 180 77.3 79.5 78.3 other genes. While, for other methods like Parzen method, it takes
NA-SVDD 180 70.9 82.4 76.2 into account the information of all training samples. Hence, it leads
MA-SVDD 180 69.8 79.7 74.4
to more overtting as well as more false negative rate.
Table 5
Comparison between disease gene identication methods.
Methods Precision (%) Recall (%) F-measure (%)

Xus method (1) [6] 68.4 54.2 60.4
Xus method (5) [6] 66.8 56.3 61.1
Smalters method [12] 67.9 61.5 64.5
ProDiGe [14] 70.4 76.2 73.1
Proposed method (randomly negative data) 72.9 78.8 75.7
PUDI [15] 72.7 82.4 77.2
Proposed method (Likely negative data) 79.3 82.6 80.9
Fig. 4. The ROC curve of the test set as analyzed by PUDI, ProDiGe, Xus Method, and Proposed Method.
Table 6
Novel Disease gene predicted using SVDD algorithm.
2. The classication method used for disease gene identication
Gene Score Phenotype problem. All the previous methods consider this problem as
TRPV1 98.5 [42] two class classication problem. while, in this work the disease
GP5 98.2 Thrombocytopenia [43] gene identication problem is treated as one class classication
Platelet disorder [44] problem. The reason is that, there is no guarantee about the
ANGPTL1 96.2 Melanoma [45]
negative instances which are extracted from unknown genes.
Tumors [46]
BDNF 93.6 Huntington [47]
Therefore, it would be preferred to train the model using only
WSB1 92.8 Neurobalstoma [48] the positive instances.
TRPAL 88
PHLDA1 80.1 Tumors [49] To test our model, we have applied two types of negative data.
ODAM 79.6
The First one, to reduce the noise data, we have attempted to select
ITGB1BP2 76.5 Hypertrophy [50]
EIF2AK2 69.3 Inuenza [51] some more likely negative data from unknown data; In this regard,
Hepatitis c [52] we selected the more farthest unknown data from the mean vector
of all positive data. The second type of negative data have been
4.4. Comparison with other works randomly selected from all unknown data. We used the rst type
to make comparison with PUDI algorithm, while the second type
We have compared the proposed method with the state-of-art have been used to compare with the remaining algorithms. The
methods, including, Smalters method [12], Xus method [6], results indicate that the prediction performance of the proposed
ProDiGe method [14], and Yang method [15]. Table 5 shows the method is better than the other ones.
comparison between the proposed method with other related To investigate how the proposed method can improve the per-
methods. The main differences between the proposed method formance of disease gene identication compared with the
and the related ones are: state-of-the-art methods, we implemented the leave-one-out
cross-validation test and drew the ROC curves of the test set. As
1. The prior-knowledge used to extract feature vector. In this shown in Fig. 4, the curve of the SVDD predictor shows higher pre-
work, the sequence of protein as a more universal diction performance on the whole range of false positive rates, and
prior-knowledge has been carried out, while in the previous produces an AUC score of 95.7%. While the AUCs achieved by PUDI,
work, the prior-knowledge suffered from noise effect. ProDiGe, Xus Method, are 93%, 88.7%, 83.2% respectively. In other
Fig. 5. The ROC curve of Parkinsons disease set based on PUDI, ProDiGe, Xus Method, and Proposed Method.
words, the SVDD classier can decrease both false negatives and disease gene using SVDD algorithm. The results demonstrate the
false positives compared with other methods. importance of solving the disease identication problem as one
class classication problem. In addition, it is worth to mention that
4.5. Novel gene identication the physicochemical properties of amino acids would be highly
benecial. As comparison with other methods, we found that the
We made an effort to discover novel disease genes that are not proposed method achieved better results than the previous stud-
available in the disease gene data set. In this regard, we employed ies. For future work, to achieve better classication performance,
the positive data (disease genes) to train the model. Then, we more physicochemical properties will be taken into consideration
tested the model using other unknown genes. We selected the such as amino acid composition, CC in regression analysis, and
top 10 genes based on their model likelihoods (SVDD probabili- graph shape index. To get better performance, we will also apply
ties). By implementing the literature search, we found that 8 out a combination of different type of one class classication methods.
of 10 predicted disease genes are related at least with one disease.
Table 6 enumerates the novel disease genes which are predicted Conict of interest
[4353].
The authors declare that they have no conict of interest.
4.6. Identication specic disease class
References
To investigate the reliability of the proposed method in detect-
ing disease genes for specic disease classes, Parkinsons disease [1] A.M. Glazier, J.H. Nadeau, T.J. Aitman, Finding genes that underlie complex
(PD) has been selected. We have obtained 32 PD-related proteins traits, Science 298 (5602) (2002) 23452349.
[2] S. Kohler et al., Walking the interactome for prioritization of candidate disease
from OMIM disease gene data. And 18 PD-related proteins based
genes, Am. J. Hum. Genet. 82 (4) (2008) 949958.
on UniProt database [54]. We provided the dataset containing 100 [3] P. Yang et al., Inferring gene-phenotype associations via global protein
samples. In this regard, the 50 PD-related proteins make up the pos- complex network propagation, PLoS ONE 6 (7) (2011) e21502.
[4] W. Zhang, F. Sun, R. Jiang, Integrating multiple protein-protein interaction
itive data. While 50 unlabeled genes which have been selected from
networks to prioritize disease genes: a Bayesian regression approach, BMC
likely negative set randomly, make up the negative data. To evalu- Bioinformatics 12 (Suppl. 1) (2011) S11.
ate the identier model, leave-one-out cross-validation test has [5] S. Navlakha, C. Kingsford, The power of protein interaction networks for
been performed. As shown in Fig. 5, the proposed method achieved associating genes with diseases, Bioinformatics 26 (8) (2010) 10571063.
[6] J. Xu, Y. Li, Discovering disease-genes by topological features in human
the area of 97% under ROC. While, the results of, Xus Methods, protein-protein interaction network, Bioinformatics 22 (22) (2006) 2800
ProDiGe, and PUDI are 91.4%, 93%, 95.2%, respectively. These results 2805.
indicate that the proposed method has good performance since it [7] U. Ala et al., Prediction of human disease genes by human-mouse conserved
coexpression analysis, PLoS Comput. Biol. 4 (3) (2008) e1000043.
can identify PD-related genes with high probability. [8] J. Freudenberg, P. Propping, A similarity-based method for genome-wide
prediction of disease-relevant human genes, Bioinformatics 18 (suppl 2)
(2002) S110S115.
5. Conclusions [9] S. Aerts et al., Gene prioritization through genomic data fusion, Nat.
Biotechnol. 24 (5) (2006) 537544.
[10] T. De Bie et al., Kernel-based data fusion for gene prioritization, Bioinformatics
In this article, we proposed a one class classication based on
23 (13) (2007) i125i132.
sequence of protein for disease genes identication. GA represen- [11] Y. Li, J.C. Patra, Integration of multiple data sources to prioritize candidate
tation method based on the physicochemical properties of amino genes using discounted rating system, BMC Bioinformatics 11 (Suppl. 1)
acids have been used to transform the amino acids sequences to (2010) S20.
[12] A. Smalter, S.F. Lei, X.-W. Chen, Human disease-gene classication with
numerical feature vectors. Then, the more signicant features are integrative sequence-based and topological features of protein-protein
selected using PCA algorithm. Finally, we trained the model of interaction networks, IEEE.
[13] P. Radivojac et al., An integrated approach to inferring gene-disease [34] M. Charton, B.I. Charton, The structural dependence of amino acid
associations in humans, Proteins 72 (3) (2008) 10301037. hydrophobicity parameters, J. Theor. Biol. 99 (4) (1982) 629644.
[14] F. Mordelet, J.P. Vert, ProDiGe: Prioritization Of Disease Genes with multitask [35] R.M. Sweet, D. Eisenberg, Correlation of sequence hydrophobicities measures
machine learning from positive and unlabeled examples, BMC Bioinformatics similarity in three-dimensional protein structure, J. Mol. Biol. 171 (4) (1983)
12 (1) (2011) 389. 479488.
[15] P. Yang et al., Positive-unlabeled learning for disease gene identication, [36] D. Eisenberg, A.D. McLachlan, Solvation energy in protein folding and binding,
Bioinformatics 28 (20) (2012) 26402647. Nature 319 (6050) (1986) 199203.
[16] K.C. Chou, Y.D. Cai, Prediction of protein subcellular locations by GO-FunD- [37] I. Jolliffe, Principal Component Analysis, in Encyclopedia of Statistics in
PseAA predictor, Biochem. Biophys. Res. Commun. 320 (4) (2004) 12361239. Behavioral Science, John Wiley & Sons Ltd., 2005.
[17] Y. Fukasawa et al., Plus a change-evolutionary sequence divergence predicts [38] D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54 (1)
protein subcellular localization signals, BMC Genomics 15 (1) (2014) 46. (2004) 4566.
[18] S.L. Lo et al., Effect of training datasets on support vector machine prediction of [39] M. Safran, et al., GeneCards Version 3: The Human Gene Integrator. Database
protein-protein interactions, Proteomics 5 (4) (2005) 876884. (Oxford), 2010, p. baq020.
[19] C.-Y. Yu, L.-C. Chou, D.T.H. Chang, Predicting protein-protein interactions in [40] V.A. McKusick, Mendelian Inheritance in Man and its online version, OMIM,
unbalanced data using the primary structure of proteins, BMC Bioinformatics Am. J. Hum. Genet. 80 (4) (2007) 588604.
11 (1) (2010) 167. [41] P. Flicek et al., Ensembl 2011, Nucleic Acids Res. 39 (Database Issue) (2011)
[20] A. Yousef, N. Moghadam Charkari, A novel method based on new adaptive LVQ D800D806.
neural network for predicting protein-protein interactions from protein [42] D.M.J. Tax, DDtools, the Data Description Toolbox for Matlab, 2007.
sequences, J. Theor. Biol. 336 (2013) 231239. [43] N.J. Marshall et al., A role for TRPV1 in inuencing the onset of cardiovascular
[21] C.Z. Cai et al., SVM-Prot: Web-based support vector machine software for disease in obesity, Hypertension 61 (1) (2013) 246252.
functional classication of a protein from its primary sequence, Nucleic Acids [44] K. Acar et al., Soluble platelet glycoprotein V in distinct disease states of
Res. 31 (13) (2003) 36923697. pathological thrombopoiesis, J. Natl Med. Assoc. 100 (1) (2008) 8690.
[22] C.Z. Cai et al., Enzyme family classication by support vector machines, [45] Q. Shi et al., Targeting platelet GPIbalpha transgene expression to human
Proteins 55 (1) (2004) 6676. megakaryocytes and forming a complete complex with endogenous GPIbbeta
[23] L.Y. Han et al., Prediction of RNA-binding proteins from primary sequence by a and GPIX, J. Thromb. Haemost. 2 (11) (2004) 19891997.
support vector machine approach, RNA 10 (3) (2004) 355368. [46] A. Smagur, J. Szary, S. Szala, Recombinant angioarrestin secreted from mouse
[24] Z.R. Li et al., PROFEAT: a web server for computing structural and melanoma cells inhibits growth of primary tumours, Acta Biochim. Pol. 52 (4)
physicochemical features of proteins and peptides from amino acid (2005) 875879.
sequence, Nucleic Acids Res. 34 (Web Server issue) (2006) W32W37. [47] Y. Xu, Y.J. Liu, Q. Yu, Angiopoietin-3 inhibits pulmonary metastasis by
[25] M.J. Betts, R.B. Russell, Amino-acid properties and consequences of inhibiting tumor angiogenesis, Cancer Res. 64 (17) (2004) 61196126.
substitutions, Bioinformatics Geneticists 2 (2007) 311342. [48] C. Zuccato et al., Loss of huntingtin-mediated BDNF gene transcription in
[26] X. Zhu et al., One gene, many neuropsychiatric disorders: lessons from Huntingtons disease, Science 293 (5529) (2001) 493498.
Mendelian diseases, Nat. Neurosci. 17 (6) (2014) 773781. [49] Q.R. Chen et al., Increased WSB1 copy number correlates with its over-
[27] R.R. Sokal, B.A. Thomson, Population structure inferred by local spatial expression which associates with increased survival in neuroblastoma, Genes
autocorrelation: an example from an Amerindian tribal population, Am. J. Chromosom. Cancer 45 (9) (2006) 856862.
Phys. Anthropol. 129 (1) (2006) 121131. [50] M.A. Nagai et al., Down-regulation of PHLDA1 gene expression is associated
[28] Y. Guo et al., Using support vector machine combined with auto covariance to with breast cancer progression, Breast Cancer Res. Treat. 106 (1) (2007) 4956.
predict proteinprotein interactions from protein sequences, Nucleic Acids [51] V. Palumbo et al., Melusin gene (ITGB1BP2) nucleotide variations study in
Res. 36 (9) (2008) 30253030. hypertensive and cardiopathic patients, BMC Med. Genet. 10 (1) (2009) 140.
[29] Z.P. Feng, C.T. Zhang, Prediction of membrane protein types based on the [52] J.-Y. Min et al., A site on the inuenza A virus NS1 protein mediates both
hydrophobic index of amino acids, J. Protein Chem. 19 (4) (2000) 269275. inhibition of PKR activation and temporal regulation of viral RNA synthesis,
[30] J.F. Xia, K. Han, D.S. Huang, Sequence-based prediction of proteinprotein Virology 363 (1) (2007) 236243.
interactions by means of rotation forest and autocorrelation descriptor, [53] Y. Hiasa et al., Protein kinase R is increased and is functional in hepatitis C
Protein Pept. Lett. 17 (1) (2010) 137145. virusrelated hepatocellular carcinoma, Am. J. Gastroenterol. 98 (11) (2003)
[31] R. Grantham, Amino acid difference formula to help explain protein evolution, 25282534.
Science 185 (4154) (1974) 862864. [54] Z.R. Li et al., PROFEAT: a web server for computing structural and
[32] C. Chothia, The nature of the accessible and buried surfaces in proteins, J. Mol. physicochemical features of proteins and peptides from amino acid
Biol. 105 (1) (1976) 112. sequence, Nucleic Acids Res. 34 (Web Server issue) (2006) W32W37.
[33] T.P. Hopp, K.R. Woods, Prediction of protein antigenic determinants from
amino acid sequences, Proc. Natl. Acad. Sci. USA 78 (6) (1981) 38243828.

A Novel Method Based On Physicochemical Properties of Ami 2015 Journal of Bi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Novel Method Based On Physicochemical Properties of Ami 2015 Journal of Bi

Uploaded by

Copyright:

Available Formats

Journal of Biomedical Informatics 56 (2015) 300306

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

A novel method based on physicochemical properties of amino acids

HY-PHOB HY-PHIL POL POL2 SFE RAS

At rst, we have evaluated the effect of different sequence rep-

4.1. Performance of the sequence representation methods

To evaluate the robustness of the representation methods, we

mance of the other ones. Therefore, we select the GA as the main

to understand which physicochemical property has more impact

Methods Precision (%) Recall (%) F-measure (%)

You might also like