You are on page 1of 21

Mol Divers (2011) 15:269289 DOI 10.

1007/s11030-010-9234-9

COMPREHENSIVE REVIEW

Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM)
Michael Fernandez Julio Caballero Leyden Fernandez Akinori Sarai

Received: 14 May 2009 / Accepted: 25 January 2010 / Published online: 20 March 2010 Springer Science+Business Media B.V. 2010

Abstract Many articles in in silico drug design implemented genetic algorithm (GA) for feature selection, model optimization, conformational search, or docking studies. Some of these articles described GA applications to quantitative structureactivity relationships (QSAR) modeling in combination with regression and/or classication techniques. We reviewed the implementation of GA in drug design QSAR and specifically its performance in the optimization of robust mathematical models such as Bayesian-regularized articial neural networks (BRANNs) and support vector machines (SVMs) on different drug design problems. Modeled data sets encompassed ADMET and solubility properties, cancer target inhibitors, acetylcholinesterase inhibitors, HIV-1 protease inhibitors, ion-channel and calcium entry blockers, and antiprotozoan compounds as well as protein classes, functional, and conformational stability data. The GA-optimized predictors were often more accurate and robust than previous published models on the same data sets and explained more than 65% of data variances in validation experiments. In addition, feature selection over large pools of molecular descriptors provided insights into the structural and atomic properties ruling ligandtarget interactions.
M. Fernandez (B) A. Sarai Department of Bioscience and Bioinformatics, Kyushu Institute of Technology (KIT), 680-4 Kawazu, Iizuka 820-8502, Japan e-mail: michael_llamosa@yahoo.com J. Caballero Centro de Bioinformatica y Simulacion Molecular, Universidad de Talca, 2 Norte 685, Casilla 721, Talca, Chile e-mail: jcaballero@utalca.cl L. Fernandez Barcelona Supercomputing CenterCentro Nacional de Supercomputacin, Nexus II Building c/ Jordi Girona, 29, 08034 Barcelona, Spain

Keywords Drug design Enzyme inhibition Feature selection In silico modeling QSAR Review SAR Structureactivity relationships List of abbreviations ADMET Absorption, distribution, metabolism, excretion and toxicity AD Alzheimers disease log S Aqueous solubility ANNs Articial neural networks BRANNs Bayesian-regularized articial neural networks BRGNNs Bayesian-regularized genetic neural networks BBB Bloodbrain barrier CoMFA Comparative molecular eld analysis CG Conjugated Gradient GA Genetic algorithm GA-PLS Genetic algorithm-based partial least squares GA-SVM Genetic algorithm-optimized support vector machines GNN Genetic neural networks GSR Genetic stochastic resonance HIA Human intestinal absorption PPBR Human plasma protein binding rate Log P Lipophilicity LHRH Luteinizing hormone-releasing hormone MMP Matrix metalloproteinase MT Mitochondrial toxicity MLR Multiple linear regression MT Negative mitochondrial toxicity NNEs Neural network ensembles EVA Normal coordinate eigenvalue BIO Oral bioavailability

123

270

Mol Divers (2011) 15:269289

PLS P-gp PCC MT+ PC-GA-ANN PCs PPR QSAR QSPR RBF SOMs SR SVMs Trb1 Tdp VKCs

Partial least squares P-glycoprotein Physicochemical composition Positive mitochondrial toxicity Principal component-genetic algorithm-articial neural network Principal components Projection pursuit regression Quantitative structureactivity relationship Quantitative structureproperty relationship Radial Basic Function Self-organized maps Stochastic resonance Support vector machines Thyroid hormone receptor b1 Torsades de pointes Voltage-gated potassium channels

Introduction One of the main challenges in todays drug design is the discovery of new biologically active compounds on the basis of previously synthesized molecules. Quantitative structure activity relationship (QSAR) is an indirect ligand-based approach which models the effect of structural features on biological activity. This knowledge is then employed to propose new compounds with enhanced activity and selectivity prole for a specic therapeutic target [1]. QSAR methods are based entirely on experimental structureactivity relationships for enzyme inhibitor or receptor ligands. In comparison to direct receptor-based methods, which include molecular docking and advanced molecular dynamics simulations, QSAR methods do not strictly require the 3D-structure of a target enzyme or even a receptoreffector complex. They are computationally not demanding and allow establishing an in silico tool from which biological activity of newly synthesized molecules can be predicted [1]. Three-dimensional-QSAR (3D-QSAR) methods, especially comparative molecular eld analysis (CoMFA) [2] and Comparative Molecular Similarity Indices Analysis, (CoMSIA) [3] are nowadays used widely in drug design. The main advantages of these methods are that they are applicable to heterogeneous data sets, and they bring a 3D-mapped description of favorable and unfavorable interactions, according to physicochemical properties. In this sense, they provide a solid platform for retrospective hypotheses by means of the interpretation of significant interaction regions. However, some disadvantages of these methods are related to the 3D information and alignment of the molecular structures, since there are uncertainties about different binding modes of ligands, and uncertainties about the bioactive conformations [4].

CoMFA and CoMSIA have emerged as the 3D-QSAR methods most embraced by the scientic community today; however, current articles on QSAR encompass the use of too many forms of the molecular information and statistical correlation methods. The structures can be described by physicochemical parameters [5], topological descriptors [6], quantum chemical descriptors [7], etc. The correlation can be obtained by linear methods or nonlinear predictors such as articial neural networks (ANNs) [8] and nonlinear support vector machines (SVMs) [9]. Unlike linear methods (CoMFA, CoMSIA, etc), ANNs and SVMs are able to describe nonlinear relationships, which should bring to a more realistic approximation of the structurerelationship paradigm, since interactions between the ligand and its biological target must be nonlinear. Two major problems arise when the functional dependence between biological activities and the computed molecular descriptor matrix is nonlinear, and when the number of calculated variable exceeds the number of compounds in the data set. The nonlinearity problem can be tackled inside a nonlinear modeling framework, while the over-dimensionality issue can be handled by implementing a feature selection routine that determines which of the descriptors have a significant inuence on the activity of a set of compounds. Genetic algorithm (GA) rather than forward or backward elimination procedure has been successfully applied for feature selection in QSAR studies when the dimensionality of the data set is high and/or the interrelations between variables are convoluted [10]. The present review focuses on the application of very exible and robust approaches: Bayesian-regularized genetic neural networks (BRGNNs) and GA-optimized SVM (GA-SVM) to QSAR modeling in drug design. Biological activities of low molecular weight compounds and protein function, class and stability data were modeled to derive reliable classiers with potential use in virtual library screening. Firstly, we expose a general survey of GA implementation and application on QSAR drug design. Secondly, we describe the BRGNN and GA-SVM approaches. Finally, we discuss their applications to model different target-ligand data sets relevant for drug discovery and also protein function and stability prediction.

General survey of genetic algorithm implementations in drug design QSAR Genetic algorithms are stochastic optimization methods governed by biological evolution rules that have been inspired by evolutionary principles [11]. GA investigates many possible solutions simultaneously and each one explores different regions in the parameter space [12]. Firstly, a population of N individuals is created in which each individual encodes a

123

Mol Divers (2011) 15:269289

271

randomly chosen subset of the modeling space and the tness or cost of each individual in the present generation is determined. Secondly, parents selected on the basis of their scaled tness scores yield a fraction of children of the next generation by crossover (crossover children) and the rest by mutation (mutation children). In this way, the new offspring contains characteristics from its parents. Usually, the routine is run until a satisfactory rather than the global optimum solution is achieved. Advantages such as quickly scan a vast solution set, bad proposals do not effect the end solution and doesnt have to know any rules of the problem, make GA very attractive for model optimization in drug discovery in which every problem is highly particular because of the lack of previous knowledge of the functional relationship and generalization is very difcult. Chromosome representation Solving the shortcoming of QSAR analysis such as, selection of optimum feature subsets, optimization of model parameters and also data set manipulation has been the main goal of GA-based QSAR. Optimization space can include variables and model parameters. However, since variable selection is the most common task, populations have been mainly encode by binary or integer chromosomes. Binary representation is very popular due to its easy and straightforward implementation in which the chromosome is a binary vector having the same length of main data matrix. Numerical values 1 and 0 represent the inclusion or exclusion of feature in the individual chromosome, respectively. Models with different dimensionality can evolve throughout the search process at the same time. In this case, the algorithm is highly automatic since no extra parameters must be set, and the optimum solution is achieved when a predened stopping condition is reached. On the other hand, integer representation is encoded by a string of integers representing the position of the features in the whole data matrix. Usually, sizes of feature that vector encodes in the chromosome are controlled according to some criteria derived from previous knowledge on the modeled problem. Despite this drawback, algorithm gains efciency because inefcient large-dimension models are avoided by controlling the number of variables during search process. This aspect is specially important when complex predictors, given their high tendency to overparametrization/overtting, and expensive time-computing, are trained [10]. Model size can be also controlled in binary GA, but this simple routine is usually implemented in a very unsupervised way. In many of GA implementations in QSAR studies, individuals in the populations are predictors and training, validation or/and crossvalidation errors are the individual tness or cost functions. Different functions have been reported to rank the individuals in a population depending on the mathematical model implemented inside the GA framework. The authors

had proposed a variety of tness functions which are proportional to the residual error of the training set [10,1325], the validation set [26], or crossvalidation [2730], and combination of them [3133]. Overtting has been decreased by complementing the cost function with terms accounting for trade-off between number of variables and number of training cases [34] and/or keeping model complexity as simple as possible in the searching process [10]. Population generation and ranking of individuals The rst step is to create a gene pool (population of models) of N individuals. Chromosome values are randomly initiated, and the tness of each individual in this generation is determined by the tness function of the model and scaled by the scaling function. Fitness scaling converts the raw tness scores that are returned by the tness function to values in a range that is suitable for the selection function. The selection function uses the scaled tness values to select the parents of the next generation. A higher probability of selection to individuals with higher scaled values is assigned by the selection function. Controlling the range of the scaled values is very important because it affects the performance of the GA. Scaled values varying too widely cause individuals with the highest scaled values reproduce too rapidly. They take over the population gene pool too quickly, and prevent the GA from exploring other areas of the solution space. However, scaled values varying narrowly cause individuals to have too similar reproduction chance and the optimization will progress very slowly. One type of the most used tness scaling functions is that of rank-based functions. The position of an individual in the sorted scores list is its rank. In rank-based functions scale, the raw scores are based on the rank of each individual instead of its score. This tness scaling removes the effect of the spread of the raw scores [11,12]. Evolution and stopping criteria During evolution, a fraction of children of the next generation is produced by crossover (crossover children) and the rest by mutation (mutation children) from the parents. Sexual and asexual reproductions take place so that the new offspring contains characteristics from both or one of its parents. In sexual reproduction, a selection function selects probabilistically two individuals on the basis of their ranking to serve as parents. An individual can be selected more than once as a parent, in which case it contributes its genes to more than one child. Stochastic selection functions, lays out a line in which each parent corresponds to a section of the line of length proportional to its scaled value [11,12]. Similarly, roulette selection chooses parents by simulating a roulette wheel, in which the area of the section of the wheel corresponding to an individual is proportional to the individuals expectation. The

123

272

Mol Divers (2011) 15:269289

algorithm uses a random number to select one of the sections with a probability equal to its area [11,12]. On the other hand, tournament selection chooses each parent by selecting set of players (individuals) at random and then choosing the best individual out of that set to be a parent [32]. Then, crossover of parents performs a random selection of a fraction of its descriptor set, and a child is constructed by combining these fragments of genetic code. Finally, the rest of the individuals in the new generation are obtained by asexual reproduction when parents selected randomly are subjected to a random mutation of its genes. Reproduction often includes elitism which protects the ttest individual in any given generation from crossover or mutation [27]. Finally, stopping criteria determine what causes the algorithm to terminate. Most common parameters used to control algorithm ow are the maximum number of iterations the GA will perform and the maximum time the algorithm runs before stopping. Some implementations stop a GA if the best tness score is less than or equal to the value of a threshold value; others evaluate the performance for a number of previously set generations or time interval, and the algorithm stops if there is no improvement in the best tness value. Some applications GA has been successfully applied to drug design QSAR to optimize linear and nonlinear predictors. Cho and Hermsmeier [13] introduced a simple encoding scheme for chemical features and allocation of compounds in a data set. They applied GA to simultaneously optimize descriptors and composition of training and test sets. The method generates multiple models on subsets of compounds representing clusters with different chemotypes and a molecular similarity method determined the best model for a given compound in the test set. The performance on the Selwood data set [35] was comparable to other published methods. Hemmateenejad and co-workers [3133] reported seminal study on GA-based QSAR in drug design. They modeled the calcium channel antagonist activity of a set of nifedipine analogous by GA-optimized multiple linear regression (MLR) and partial least squares (PLS) regression [31]. Adequate models with low standard errors and high correlation coefcients were derived from topology, hydrophobicity, and surface area but PLS had better prediction ability than MLR. The authors applied a principal componentgenetic algorithmarticial neural network (PCGAANN) procedure to model activity of another series of nifedipine analogs [32]. Each molecule was encoded by 10 sets of descriptors and principal component analysis (PCA) was used to compress the descriptor groups into principal components (PCs). GA selected the best set of PC to train feed forward ANN. PC GAANN routine overperformed ANNs trained with topranked PC (PCANN) by yielding better predictional ability.

Hemmateenejad et al. [33] reported the application of PC regression to model structurecarcinogenic activity relationship of drugs. PC correlation ranking and a GA were compared for selecting the best set of PCs for a large data set containing 735 carcinogenic activities and 1,355 descriptors. Crossvalidation procedure showed that introduction of PCs by the conventional eigenvalue ranking was outperformed by correlation ranking and GA with good similar quality about 80% accuracy. Thyroid hormone receptor b1 (Trb1) antagonists are of special interest because of their potential role in safe therapies for nonthyroid disorders while avoiding the cardiac side effects. Optimum molecular descriptors selected by GA served as inputs for a projection pursuit regression (PPR) study yielding accurate models [36]. GA was also reported to optimize routines of descriptor generation. Normal coordinate eigenvalue (EVA) structural descriptors, based on calculated fundamental molecular vibrational frequencies are sensitive to 3D structure and additionally structural superposition is not required [28]. The original technique involves a standardization method wherein uniform Gaussians of xed standard deviation ( ) are used to smear out frequencies projected onto a linear scale. GA was used to search for optimal localized values by optimizing crossvalidated PLS regression scores. Although GA-based EVA did not improve performance for a benchmark steroid data set, crossvalidation statistics were 0.25 unit higher than the simple EVA approach in the case of a more heterogeneous data set of ve structural classes. A GA-optimized ANN, named GNW, that simultaneously optimizes feature selection and node weights, was reported by Xue and Bajorath [37] for supervised feature ranking. Interconnected weights were binary encoded as a 16-bit string chromosome. A primary feature ranking index, dened as the sum of self-depleted weights and the corresponding weight adjustments, computed selected relevant features for some articial data sets of known feature rankings tested. GNW outperformed SVM method on three articial and matrix metalloproteinase-1 inhibitor data sets [37]. Two-dimensional (2D) representation was chosen to classify about 500 molecules in seven biological activity classes using a method based on principal component analysis combined with GA [38]. Scoring functions, which accounted for number of compounds in pure classes (i.e., compounds with the same biological activity), singletons, and mixed classes, identied effective descriptor sets. The results indicated that combinations of few critical descriptors related to aromatic character, hydrogen bond acceptors, estimated polar van der Waals surface area, and a single structural key were preferred to classify compounds according to their biological activities. Kamphausen et al. [39] reported a simplied GA based on small training sets that runs a small number of generations. Folding energies of RNA molecules and spinglass from a multiletter alphabet biopolymers such as peptides

123

Mol Divers (2011) 15:269289

273

were optimized. Noteworthy, de novo construction of peptidic thrombin inhibitors, computationally guided by this approach, resulted in the experimental tness determination of only 600 different compounds from a virtual library of more than 1017 molecules [39]. Caco-2 cell monolayers are widely used systems for predicting human intestinal absorption and quantitative structureproperty relationship (QSPR) models of Caco2 permeability have been widely performed. Yamashita et al. [34] used a GA-based partial least squares (GA-PLS) method to predict Caco-2 permeability data using topological descriptors. The nal PLS model described more than 80% of crossvalidation variance. In alternative applications, a GA routine based on the theory of stochastic resonance (SR) was reported in which variables that are related to the bioactivity of a molecule series were considered as signal and the other non-related features as noise [40]. The signal was amplied by SR in a nonlinear system with GA-optimized parameters. The algorithm was successfully evaluated with the well-known Selwood data set [35]. The relevant variables were enhanced, and their power spectra were significantly changed and similar to that of the bioactivity after genetic SR (GSR). The descriptor matrix continuously became more informative, and the collinearity was suppressed. Then, feature selection was easier and more efcient and, consequently, QSAR models of the data set obtained had better performances than previous reported approaches [40]. Teixido et al. [41] presented another nonconventional GA to search for peptides that can cross the bloodbrain barrier (BBB). A genetic meta-algorithm optimized the GA parameters and the approach was validated by virtual screening of a peptide library of more than 1000 molecules. Chromosomes were populated with chemical physical properties of peptides instead of aminoacid peptide sequences and the tness function was derived from statistical analysis of the experimental data available on peptideBBB permeability. The authors stated that GA tuned for a specic problem can steer the design and drug discovery process and set the stage for evolutionary combinatorial chemistry. Coupling of ANNs and GA in drug QSAR studies was introduced by So and Karplus [27] by proposing GA-based ANNs called genetic neural networks (GNNs). After calculating molecular descriptors using different commercially available software, predictive models were generated by coupling GA feature selection and neural networks function approximation. The optimum neural networks outperforms PLS and GA-based MRL models. The authors extended GNN to 3D-QSAR modeling by exploring similarity matrix space [42,43]. An early review on this approach [44] reports its evaluation in several problems such as the Selwood data set, the benzodiazepine afnity for benzodiazepine/GABAA receptors, progesterone receptor binding steroids human and

intestinal absorption. Patankar and Jurs also have reported several QSAR models by hybrid GNN frameworks outperforming other predictors for the inhibition of acyl-CoA: cholesterol O-acyltransferase [45], sodium ionproton antiporter [46], cyclooxygenase-2 [47], carbonic anhydrase [48], human type 1 5alpha-reductase [49], and glycine/NMDA receptor [50]. Another variant of the same hybrid approach was recently reported by Di Fenza et al. [26] as the rst attempt that combines GA and ANNs for the modeling of CACO 2 cell apparent permeability. The optimum model had adequate crossvalidation accuracy of 57%, and the selected descriptors were related to physicochemical characteristics such as, hydrophilicity, hydrogen bonding propensity, hydrophobicity, and molecular size which are involved in the cellular membrane permeation phenomenon. Ab initio theory was used to calculate several quantum chemical descriptors including electrostatic potentials and local charges at each atom, HOMO and LUMO energies, etc., which were used to model the solubility of thiazolidine-4-carboxylic acid derivatives by means of the GA-PLS, which yielded relative errors of prediction lower than 4%.

Bayesian-regularized genetic neural networks In the context of hybrid GA-ANN modeling of biological interactions, we introduced BRGNNs as a robust nonlinear modeling technique that combines GA and Bayesian regularization for neural network input selection and supervised network training, respectively (Fig. 1). This approach attempts to solve the main weaknesses of neural network modeling: the selection of optimum input variables and the adjustment of network weights and biases to optimum values for yielding regularized neural network predictors [5052]. By combining the concepts of BRANNs and GAs, BRGNNs were implemented in such a way that BRANN inputs are selected inside a GA framework. BRGNN approach is a version of the So and Karplus article [27] incorporating Bayesian regularization that has been successfully introduced by our group in drug design QSAR. BRGNN was programmed within Matlab environment [53] using GA [54] and Neural Networks Toolboxes [55]. Bayesian regularized articial neural networks Back-propagation ANNs are data-driven models in the sense that their adjustable parameters are selected in such a way as to minimize some network performance function F: 1 N
N

F = MSE =

(yi ti )2
i=1

(1)

123

274

Mol Divers (2011) 15:269289

molecular descriptors pool

GA model optimization

approach involves a probability distribution of network weights. In BRANNs, Bayesian approach yields a posterior distribution of network parameters, conditional on the training data, and predictions are expressed in terms of expectations with respect to this posterior distribution [56,57]. Assuming a set of pairs D = {xi , ti }, where i = 1, . . . , N is a label running over the pairs, the data set can be modeled as deviating from this mapping under some additive noise process (vi ): ti = yi + vi (2)

Models with R > threshold value

If v is modeled as zero-mean Gaussian noise with standard deviation v , then, the probability of the data given the parameters w is: P(D|w, , M) = 1 exp ( MSE) Z D () (3)

crossvalidation

Best model (best Q2)

Random splits

Ensemble averaging (Optional)

Assembling test sets

Averaged predictions

where M is the particular neural network model used, = 2 1/v , and the normalization constant is given by Z D () = (/) N /2 . P(D|w, , M) is called the likelihood. The maximum likelihood parameters wML (the w that minimizes MSE) depends sensitively on the details of the noise in the data [56,57]. For completing the interpolation model, it must be dened a prior probability distribution which embodies our prior knowledge on the sort of mappings that are reasonable. Typically, this is quite a broad distribution, reecting the fact that we only have a vague belief in a range of possible parameter values. Once, we have observed the data, Bayes theorem can be used to update our beliefs, and we obtain the posterior probability density. As a result, the posterior distribution is concentrated on a smaller range of values than the prior distribution. Since a neural network with large weights will usually give rise to a mapping with large curvature, we favor small values for the network weights. At this point, it is dened a prior that expresses the sort of smoothness it is expected the interpolant to have. The model has a prior of the form: P (w|, M) = 1 exp ( MSE) Z W () (4)

Fig. 1 Flowchart of the BRGNN framework in QSAR studies

In the above equation, MSE is the mean of the sum of squares of the network errors, N is the number of compounds, yi is the predicted biological activity of the compound i, and ti is the experimental biological activity of the compound i. Often, predictors can memorize the training examples, but it has not learned to generalize to new situations. The Bayesian framework for ANNs is based on a probabilistic interpretation of network training to improve generalization capability of the classical networks. In contrast to conventional network training, where an optimal set of weights is chosen by minimizing an error function, the Bayesian

where represents the inverse variance of the distribution and the normalization constant is given by Z W () = (/) N /2 . MSW is the mean of the sum of the squares of the network weights and is commonly referred to as a regularizing function [56,57]. Considering the rst level of inference, if and are known, then posterior probability of the parameters w is: P (w|D, , , M) = P (D|w, , M) P (w|, M) P (D|, , M) (5)

where P(w|D, , , M) is the posterior probability, that is the plausibility of a weight distribution considering the information of the data set in the model used, P(w|, M) is the

123

Mol Divers (2011) 15:269289

275

prior density, which represents our knowledge of the weights before any data are collected, P(D|w, , M) is the likelihood function, which is the probability of the data occurring, given the weights and P(D|, , M) is a normalization factor, which guarantees that the total probability is 1. Considering that the noise in the training set data is Gaussian and that the prior distribution for the weights is Gaussian, the posterior probability fullls the relation: P (w|D, , , M) = 1 exp(F) ZF (6)

where Z F depends of objective function parameters. Therefore, under this framework, minimization of F is identical to nd the (locally) most probable parameters. In short, Bayesian regularization involves modifying the performance function (F) dened in Eq. 1, which is possibly improving generalization by adding an additional term that regularizes the weights by penalizing overly large magnitudes. F = MSE + MSW (7)

Contrary to other GA-based approach, the objective of the algorithm is not to obtain a sole optimum model but a reduced population of well-tted models, with MSE lower than a threshold MSE value, which the Bayesian regularization guarantees to posses good generalization capabilities (Fig. 3). This is because we used MSE of data training tting instead of crossvalidation, or test-set MSE values as cost function, and, therefore, the optimum model cannot be directly derived from the best-tted model yielded by the genetic search. However, from crossvalidation experiments over the subpopulation of well-tted models, it can derive the best generalizable network with the highest predictive power. This process also avoids chance correlations. This approach has shown to be highly efcient in comparison with crossvalidation-based GA approach, since only optimum models, according to the Bayesian regularization, are crossvalidated at the end of the routine, and not all the models generated throughout the searching process.

Genetic algorithm-optimized support vector machines (GA-SVM) Support vector machine (SVM) is a machine learning method, which has been used for many kinds of pattern recognition problems [58]. Contrary to BRANN framework that is not in so much of widespread use, SVM becomes a very popular pattern recognition technique. Since there are excellent introductions to SVMs [58,59], only the main idea of SVMs applied to pattern classication problems is stated here. First, the input vectors are mapped into one feature space (possible, with a higher dimension). Second, a hyperplane which can separate two classes is constructed within this feature space. Only relatively low-dimensional vectors in the input space and matrix products in the feature space will be involved in the mapping function. SVM was designed to minimize structural risk whereas previous techniques were usually based on minimization of empirical risk. SVM is usually less vulnerable to the overtting problem, and it can deal with a large number of features. The mapping into the feature space is performed by a kernel function. There are several parameters in the SVM, including the kernel function and regularization parameter. The kernel function and its specic parameters, together with regularization parameter, cannot be set from the optimization problem but have to be tuned by the user. These can be optimized by the use of VapnikChervonenkis bounds, crossvalidation, an independent optimization set, or Bayesian learning. In the articles from our group, the Radial Basic Function (RBF) was used as kernel function. For nonlinear SVM models, we used also the GA-based optimization of kernel regularization parameter C and width of an RBF kernel 2 as suggested by Frhlich et al. [60]. We

The relative size of the objective function parameters and dictates the emphasis for getting a smoother network response. MacKays Bayesian framework automatically adapts the regularization parameters to maximize the evidence of the training data [56,57]. BRANNs were rst and broadly applied to model biological activities by Burden and Winkler [51,52]. Genetic algorithm implementation in BRANN feature selection A string of integers means the numbering of the rows in the all-descriptors matrix that will be tested as BRANN inputs (Fig. 2). Each individual encodes the same number of descriptors; the descriptors are randomly chosen from a common data matrix, and in a way such that (1) no two individuals can have exactly the same set of descriptors and (2) all descriptors in a given individual must be different. The tness of each individual in this generation is determined by the training mean square error (MSE) of the model, and a top scaling function which scaled a top fraction of the individuals in a population equally; these individuals have the same probability to be reproduced while the rest are assigned the value 0. As it is depicted in Fig. 2, children are sexually created by single point crossover from father chromosomes and asexually by mutating one gene in the chromosome of a single father. Similar to So and Karplus [27], we also included elitism thus the genetic content of the best-tted individual simply moves on to the next generation intact. The reproductive cycle is continued until 90% of the generations showed the same target tness score (Fig. 3).

123

276 Fig. 2 Flow diagram of the strategy for the genetic algorithm implemented in the BRGNNs

Mol Divers (2011) 15:269289

simply concatenated a representation of the parameter to our existing chromosome. That means we are trying to select an optimal feature subset and an optimal C at the same time. This is reasonable, because the choice of the parameter is inuenced by the feature subset taken into account and vice versa. Usually, it is not necessary to consider any arbitrary value of, but only certain discrete values with the form: n 10k , where n = 1, . . . , 9 and k = 3, . . . , 4. Therefore, these values can be calculated by randomly generating n and k values as

integers between (1, . . . , 9) and (3, . . . , 4), respectively. In a similar way, we used GA to optimize the width of an RBF kernel, but in this case, n and k values were integers between (1, . . . , 9) and (2, . . . , 1). Then, our chromosome was concatenated with another gene with discrete values in the interval (0.00190,000) for encoding the C parameter, and similarly the width of the RBF kernel was encoded in a gene containing discrete values ranging in the interval (0.01 90). In other articles, feature and hyperparameter genes were

123

Mol Divers (2011) 15:269289

277

Models validation Traditionally, meaningful assessment of statistical t of a QSAR model consists of predicting some removed proportion of the data set. The whole data are randomly split into a number of disjointed crossvalidation subsets. One from each of these subsets is left out in turn, and the remaining complement of data is used to make a partial model. The samples in the left-out data are then used to perform predictions. At the end of this process, there are predictions for all data in the training set, made up from the predictions originating from the resulting partial models. All partial models are then assessed against the same performance criteria, and decisions are made on the basis of the consistency of the assessment results. The more-often-used crossvalidation method is the leave-one-out crossvalidation method, when all crossvalidation subsets consist of only one data point each. In addition to assessment of statistical t by crossvalidation, randomization of the modeled property (also known as Y-randomization) have also evaluated model robustness [21,24,27,65,66]. Undesirable chance correlations can be achieved as result of exhaustive GA searches. So and Karplus et al. [27] proposed the evaluation of crossvalidation performance on several scrambled data sets. The position of the dependent variable (modeled property) for every case along the data set is randomized several times, and Q 2 is calculated. The absence of chance correlation is proved when no Q 2 > 0.5 appear during the test [27]. The accuracy of crossvalidation results is extensively accepted in the literature considering the Q 2 value. In this sense, a high value of the statistical characteristic (Q 2 > 0.5) is considered as proof of the high predictive ability of the model. However, a high value of Q 2 appears to be a necessary but not sufcient condition for the model to have a high predictive power, and the predictive ability of a QSAR model can only be estimated using a sufciently large collection of compounds that was not used for building the model [65,66]. In this sense, the data set can be divided into training and validation (or test) partitions. For the given partitioning, a model is constructed only from the samples of the training set. At this point, an important step is the generation of these partitions. Quite a few methods have been used, such as random selection, activity-ranked binning, and sphere exclusion algorithms [65,66]. Various forms of neural networks have also been employed in the selection of training sets, including Kohonen neural networks [19]. Undoubtedly, external validation is a way to establish the reliability of a QSAR model. However, the majority of studies that are validated by external predictions are based on a single validation set; this may cause the predictors to perform well on a particular external set, but there is no guarantee that the same results may be achieved on another. For example, it can happen that several outliers, by pure coincidence, are

Fig. 3 Reproduction procedure in the BRGNN implementation

concatenated in the chromosomes and encoded as bit string; however, evolution was driven using similar crossover, mutation, and selection operators according to tness functions accounting for crossvalidation accuracies [6163]. Data subsets are created, subsets are generated in the crossvalidation process for training the SVM, and another subset is then predicted. This process is repeated until all subsets have been predicted. A venetian-blind method was used for creating the data subsets. In the rst place, data set is sorted according to the dependent variable, and in the second step, the cases are added consecutively to each subset, in such a way that they become representative samples of the whole data set. The GA routine minimized the regression MSE and the misclassication percent of crossvalidation experiment. The GA-SVM implemented in our articles is a version of the GA by Caballero and Fernandez [10] but incorporating SVM hyperparameter optimization that was programmed within the Matlab environment [53] using libSVM library for Matlab by Chang and Lin [64]. A few other authors [6163] represented features of chromosomes as bit strings, but SVM parameters were optimized by Conjugated Gradient (CG) method during models tness evaluation. The crossover and mutation rates were set to adequate values according to preliminary experiments, and evolution was stopped when the number of generations reached a preset maximum value, or when the tness value remained constant or nearly constant for a maximum number of generations [6163].

123

278

Mol Divers (2011) 15:269289

out of the test set, in which case, the validation error will be small even though the training error was high. The ensemble solution has been proposed for originating multiple validation sets [67]. An ensemble is a collection of predictors that, as a whole, provides a prediction which is a combination of the individual ones. If there is disagreement among those predictors, then very reliable models can be obtained, since a further decrease in generalization error can be achieved. Another trait to take into account for the ensemble application is the average error of ensemble members; with this trait, when decreasing the error for each individual, the ensemble gets a smaller generalization error [67]. In BRGNN-related studies, the predictive power was measured taking into account R 2 and root MSE values of the averaged test sets of BRGNN ensembles having an optimum number of members [15,18,19,21,24,68,69]. For generating the predictors that will be averaged, the whole data was partitioned into several training and test sets. The assembled predictors aggregate their outputs to produce a single prediction. In this way, instead of predicting a sole randomly selected external set, the result of averaging several ones was pre-

dicted. Each case was predicted several times forming training and test sets, and an average of both values was reported.

Data sets: sources and general prior preparation Biological activity measurements were taken as afnity constants (Ki) or ligand concentrations for the 50% (IC50 ) or 90% (IC90 ) inhibition of the targets (Table 1). For modeling, IC50 and IC90 were converted in logarithmic activities; (pIC50 and pIC90 ) are measurements of drug effectiveness which is the functional strength of the ligand toward the target. For classication problems, data were labeled according to some convenient threshold. In our articles, prior to molecular descriptor calculations, 3D structures of the studied compounds (Fig. 4) were geometrically optimized using the semi-empirical quantumchemical methods implemented in the MOPAC 6.0 computer software by Frank J. Seiler Research Laboratory [70]. The articles in Table 1 included QSAR modeling of cancer therapy targets [19,20,23,25,7173], HIV target [22],

Table 1 Data set details and statistics of the optimum models reported by BRGNN modeling Dataset category Target name or biological activity/function Farnesyl protein transferase Matrix metalloproteinase Cyclin-dependent kinase LHRH(non-peptide) LHRH (erythromycin A analogs) HIV-1 protease Potassium channel Calcium channel Alzheimers disease Acetylcholinesterase inhibition (tacrine analogs) Acetylcholinesterase inhibition (huprine analogs) Candida albicans Cruzain Human lysozyme Gene V protein Chymotrypsin inhibitor 2
a Average

Descriptor type

Data size

Number of optimum variables 7 6 7 6 8 4 4 3 5 7

Validation accuracy (%) 70 70a 80b 65 75 70 70 91 65 74

Ref.

Cancer

3D 2D 2D 2D 2D Quantum chemical 2D 2D 2D 3D

78 30a 6368b 98 128 38 55 29 60 136

[25] [23] [72] [19] [20] [71] [22] [16] [17] [21]

HIV Cardiac dysfunction

3D

41

84

[24]

Antifungal Antiprotozoan Protein conformational stability

3D 2D 2D 2D 3D

96 46 123 123 95

16 5 10 10 10

87 75 68 66 72

[10] [18] [68] [69] [15]

values of ve models for MMP-1, MMP-2, MMP-3, MMP-9 and MMP-13 matrix metalloproteinases b Average values of ve models for MMP-1, MMP-9 and MMP-13 matrix metalloproteinases

123

Mol Divers (2011) 15:269289

279

1
Y R2 N X

R1 N N N R

2
N R2

R1 N

3
R3O2C

CH2R2 R1

4
O R
2

N R
R4

N N S

R1

O F

5
O R
2

6
N O N N N R1 N H

7
R2

R4
X

R2 N S

H R3 N O S O

O
R1

R3

F F

N H

R1

8
O R2 R1 HN N S O O

9
O N S O O OH

OMe

10

11
O S

O O O

R1

HN

R1 R2 NHOH

O O

N R1 NHOH

OH

12
H2N HS O N H O H N R

13
O R1 N H O H N R O

X 14
R O N O O O O OR1 HO OMe O N R

15
R7

R6 R5

R1

R2

R3 R4

NH2

16
yR z
N X R3 NH2 R1 R2

17
NH 2

18
O R N N R3 HO OH R2 R1

19

20

Ar Ar
Ar

21 (Diltiazem)
OMe

22
R2 S R N O N O R1

23
S R N X z Y R1

24
R2 R3 X

O R N

S OCOMe N O

R3

25
R2 R3 X

26
Y z N R

O O N H

R3

R4

H N O R2

O R1

27
R4 O

O N H

R3

H N O R2

O R1

Fig. 4 Sketches of reviewed chemical scaffolds

123

280

Mol Divers (2011) 15:269289

Alzheimers disease target [21,24], ion channel blockers [16,17], antifungals [10], antiprotozoan target [18], ion channel proteins [29], ghrelin receptor [30] and protein conformational stability [15,68,69]. Dragon computer software [74] was used for generating the majority of the feature vectors for low weight compounds. Four types of molecular descriptors (according to Dragon software classication) were used: zero-dimensional (0D), one-dimensional (1D), two-dimensional (2D), three-dimensional (3D). When 2D topological representation of molecules was used, spatial lag was varied from 1 to 8. Four atomic properties (atomic masses, atomic van der Waals volumes, atomic Sanderson electronegativities, and atomic polarizabilities) weighted both, 2D and 3D molecular graphs. In some biological systems, it was suitable to use quantum-chemical descriptors which were calculated from output les of the semi-empirical geometrical optimizations. In the pharmacokinetic and pharmacodynamic properties, including absorption, distribution, metabolism, excretion, and toxicity (ADMET) studies using GA-optimized SVMs, several properties were modeled such as: identication of P-glycoprotein substrates and nonsubstrates (Pgp) [61], prediction of human intestinal absorption (HIA) [61], prediction of compounds inducing torsades de pointes (Tdp) [61], prediction of BBB penetration [61], human plasma protein binding rate (PPBR) [62], oral bioavailability (BIO) [62], and induced mitochondrial toxicity (MT) [63]. All the structures of the compounds were generated and then optimized by using Cerius2 program package (Cerius2, version 4.10) [75]. The authors manually inspected the 3D structure of each compound to ensure that each molecule was properly represented and molecular descriptors were computed using the online application PCLINET [76]. Feature spaces for peptides and proteins in [68] and [69] were computed using in-house software PROTMETRICS [77]. Different sets of protein feature vectors were computed on the sequences [68,69] and crystal structures [15] weighted by 48 amino acid/residue properties from AAindex database [78]. In general, descriptors that were constant or almost constant were eliminated, and pairs of variables with a square correlation coefcient greater than 0.9 were classied as intercorrelated, and only one of these was included for building the model. Finally, high dimension data matrices were obtained. Feature subspaces in such matrices were explored searching for lower dimensional combination of vectors that derive optimum nonlinear model throughout BRGNN or GA-SVM techniques. Afterward, in some applications, optimum feature vectors were used for unsupervised training of competitive neurons to build self-organized maps (SOMs) [79] for the qualitative analysis of optimum chemical subspace distributions at different activity levels.

Application of BRGNN and GA-SVM to ligandtarget data sets ADMET modeling GA-optimized SVMs had been applied at the early stage of drug discovery to predict pharmacokinetic and pharmacodynamic properties, including ADMET [6163]. An interesting SVM method that combined GA for feature selection and CG method for parameter optimization (GA-CG-SVM), was reported to predict PPBR and BIO [62]. A general implementation of this framework is described later. For each individual, features chromosomes were represented as bit strings but SVM parameters were optimized by CG method during models tness evaluation. The crossover and mutation rates were set to 0.8 and 0.05, respectively. Evolution was stopped when number of generations equal 500 or with tness value remaining constant or nearly constant for the last 50 generations. This approach yielded, an optimum 29-variables model for the PPBR of 692 compounds with prediction accuracies of 86 and 81% for ve-fold crossvalidation and the independent test set (161 compounds), respectively. At the same time, an optimum 25-variables model for the BIO data set including 690 compounds in the training set and 76 compounds in an independent validation set, had prediction accuracies of 80 and 86% for the training set ve-fold crossvalidation and the independent test set, respectively [62]. The descriptors selected by GA-CG method covered a large range of molecular properties which imply that the PPBR and BIO of a drug might be affected by many complicated factors. The authors claimed that PPBR and BIO predictors overcame previous models in the literature [62]. Drug-induced MT have been one of the key reasons for drugs failing to enter into or being withdrawn from market [80]. That is why MT has became an important test in ADMET studies. The hybrid GA-CG-SVM approach was also applied to predict MT using a collected data set of 288 compounds, including 171 MT+ and 117 MT [63]. Data set was randomly divided into training set (253 compounds) and test set (35 compounds). Bit string representation of feature chromosome was used. Populations were evolved according to crossover and mutation rates of 0.5 and 0.1, respectively. The algorithm was stopped when the generation number reaches 200 or the tness value does not improve during the last 10 generations [63]. Accuracies for ve-fold crossvalidation and the test set were about 85 and 77%, respectively. A total of 27 optimum molecular descriptors were selected, which were roughly grouped into ve categories: molecular weight-related descriptors, van der Waals volume-related descriptors, electronegativities, molecular structural information, shape, and other physicochemical properties-related descriptors. This descriptor

123

Mol Divers (2011) 15:269289 Table 2 Data set details and statistics of the optimum models reported by GA-SVM modeling Dataset category Target name or biological activity/function Human plasma protein binding rate (PPBR) Oral bioavailability (BIO) Mitochondrial toxicity (MT) P-glycoprotein substrates and nonsubstrates (P-gp) Human intestinal absorption (HIA) Induction of torsades de pointes (Tdp) Blood-brain-barrier (BBB) penetration Apoptosis Log S Log P Protein function/ class Folding class Descriptor type Data size Number of optimum variables 29 25 27 8 Validation accuracy (%) 81 86 77 85

281

Ref.

ADMET

0D, 1D, 2D and 3Da 0D, 1D, 2D and 3D 0D, 1D, 2D and 3D 0D, 1D, 2D and 3D

853 766 288 201

[63]

[64] [62]

0D, 1D, 2D and 3D 0D, 1D, 2D and 3D 0D, 1D, 2D and 3D

196 361 3,941 593

25 17 169 24 7 9 14

87 86 91 97 92 90 82 90 [102] [72] [95]

Cancer Aqueous solubility

0D, 1D, 2D and 3D Structural, atom type, electrotopological Structural, atom type, electrotopological Sequence features and order Physicochemical composition

43 1,342 10,782

204,277498 700

Subcellular location

504 703

33 28 30

56 72 90

[103]

Proteinprotein complexes Voltage-gated K+ channelb Ghrelin receptor


a Descriptor b Average

Physicochemical atomic 172,345 properties 2D 2D 100 23

[104]

3 2

85 93

[29] [30]

classication according to Dragon software [74] over three physiological variable models

diversity pointed out the high complexity of MT mechanism [63]. The same methodology was successfully applied to other ADMET-related properties [61]. Identication of P-gp substrates and nonsubstrates yielded an eight-input model explaining 85% of crossvalidation variance. Prediction of HIA yielded a 25-input model explaining 87% of crossvalidation variance. Prediction of compounds inducing Tdp yielded a 17-input model explaining 86% of crossvalidation variance. Prediction of BBB penetration that yielded two models, 169-input and 24-input models explaining more than 91 and 94% of crossvalidation variance, respectively [61] (Table 2). The authors above cited claimed that the optimum models significantly improve overall prediction accuracy and have fewer input features in comparison to the previous reported models [61].

Anticancer targets Cancer is characterized by uncontrolled proliferative growth and the spread of aberrant cells from their site of origin. Most anticancer agents exert their therapeutic action by damaging DNA, blocking DNA synthesis, altering tubulin polymerizationdepolymerization, or disrupting the hormonal stimulation of cell growth [81]. Recent ndings on the underlying genetic changes related to the cancerous state have aroused interest toward novel mechanistic targets. Computer-aided development of cancer therapeutics has taken on new dimensions since modern biological techniques open the way leading to mechanism and structure understanding of key cellular processes at the protein level. In the context of cancer therapy targets, BRGNN have been employed to

123

282

Mol Divers (2011) 15:269289

predict inhibition of farnesyl protein transferase [25], matrix metalloproteinase (MMP) [23,70], cyclin-dependent kinase [19], and antagonist activity for the luteinizing hormone releasing hormone (LHRH) receptor [20,69]. Results from BRGNN modeling of four cancer-target data sets appear in Table 1. Numbers of selected features varied according to the size and variability of each data set. The selected features correspond to the molecular descriptors which best described the afnity of the ligands toward the targets. Models were validated by crossvalidation or/and test set prediction. Validation accuracies were higher than 65% for all data sets. Two-dimensional molecular descriptors were used for BRGNN modeling of the activity toward cancer targets of several chemotypes in Fig. 4 such as 1H-pyrazolo[3,4d]pyrimidine derivatives (1 and 2) as cyclin-dependent kinase inhibitors; heterocyclic compounds as LHRH agonists; and thieno[2,3-b]pyridine-4-ones (3), thieno[2,3-d]pyrimidine2,4-diones (4), imidazo[1,2-a]pyrimidin-5-ones (5), benzimidazole derivatives (6 and 7), N-hydroxy-2-[(phenylsulfonyl)amino]acetamide derivatives (8 and 9) and N-hydroxy--phenylsulfonylacetamide derivatives (10 and 11) as inhibitors of the MMP family. On the other hand, thiol (12) and non-thiol (13) inhibitors of farnesyl protein transferase in Fig. 4 were modeled by 3Ddescriptors which encoded distributions of atomic properties on the tridimensional molecular spaces [25]. Knowledge of the binding mode was available for this target; thus, ligand molecules were conveniently aligned to crystal structure of an inhibitor in binding site. 3D encoding of molecules is more realistic than 2D approximation but conformation variability could introduce some undesirable noise in the data. Consequently, 2D descriptors tends to achieve better performance when the system lacks binding mode information or/and when the target is promiscuous and the ligands bind in different conformations. It is worthy to note that BRGNNs trained with chemical quantum descriptors from 11,12-cyclic carbamate derivatives of 6-O-methylerythromycin A (14) in Fig. 4 predicted LHRH antagonist activity with 70% accuracy [69]. Chemical quantum descriptors only encoded information relative to the electronic states of the molecules rather than distribution of chemical groups on the structure. The structural homogeneity of the macrolides in this data set suggests a well-dened electronic pattern that was successfully recognized by the networks after supervised training. Unwanted, defective, or damaged cells are rapidly and selectively eliminated from the body by the innate mechanism called apoptosis, or programmed cell death. Resistant tumor cells evade the action of anticancer agents by increasing their apoptotic threshold [82,83]. This has triggered the interest in novel chemical compounds capable of inducing apoptosis in chemo/immunoresistant tumor cells. There-

fore, apoptosis has received a huge attention in recent years [82,83]. The induction of apoptosis by a total of 43 4-aryl4-H-chromenes (15) in Fig. 4 was predicted by chemometrics methods using molecular descriptors calculated from the molecular structure [71]. GA and stepwise multiple linear regression were applied to feature selection for SVM, ANN, and MLR training. Nevertheless, GA was implemented inside the linear framework, and then selected descriptors were used for SVM and ANN training. The optimum 7-variable SVM predictor superseded ANN and MLR as well as previous reported models, showing correlation coefcients of 0.950 and 0.924 for training and test set, respectively, with crossvalidation accuracy of about 70% [71]. Acetylcholinesterase inhibition The neurodegenerative Alzheimers disease (AD) is a degenerative disorder characterized by a progressive impairment of cognitive function which seems to be associated with deposition of amyloid protein and neuronal loss, as well as with altered neurotransmission in the brain. Neurodegeneration in AD patients is mainly attributed to the loss of the basal forebrain cholinergic system that it is thought to play a central role in producing the cognitive impairments [84]. Therefore, enhancement of cholinergic transmission has been regarded as one of the most promising methods for treating AD patients. BRGNN models of acetylcholinesterase inhibition by huprine- and tacrine-like inhibitors were reported. For analogs of tacrine (16) [21] and huprine (17) [24], in Fig. 4 GA explored a wide pool of 3D descriptors. The predictive capacity of the selected model was evaluated by averaging multiple validation sets generated as members of neural network ensembles (NNEs). The tacrines model showed adequate test accuracy about 71% [21] (Table 1). Likewise, huprine analogs data set was also evaluated by NNEs averaging showing a optimum high accuracy of 85% when 40 networks were assembled [24]. The higher accuracy yielded for the huprine analogs in comparison to the tacrine analogs would be related to the higher structural variability of tacrine data set. This fact contributed to the 30% of prediction uncertainty of the afnity of tacrine analogs. In this connection, tacrinelike inhibitors had been found experimentally to bind acetylcholinesterase in different binding modes at the active site and also at peripheral sites [85,86]. HIV-1 protease inhibition A number of targets for potential chemotherapeutic intervention of the human HIV-1 are provided by the retrovirus life cycle. The protease-mediated transformation from the immature, non-dangerous virion, to the mature, infective virus is a crucial stages in the HIV-1 life cycle. HIV-1 protease has

123

Mol Divers (2011) 15:269289

283

thus become a major target for anti-AIDS drug design, and its inhibition has been shown to extend the length and improve the quality of life of AIDS patients [87]. A large number of inhibitors have been designed, synthesized, and assayed, and several HIV-1 protease inhibitors are now utilized in the treatment of AIDS [8790]. Cyclic urea derivatives (18) in Fig. 4 are among the most successful candidates for AIDS targeting, and BRGNN was successfully applied to model the activities of a set of such compounds toward HIV-1 protease [22]. 2D encoding was used to avoid conformational noise in the feature chemical space and the optimum BRGNN model accurately predicted IC50 values with 70% accuracy in validation test for 55 cyclic urea derivatives (Table 1). Despite the feature, the space was only 2D dependent, and the problem was accurately solved by the nonlinear approach. Inhibitory activity variations due to differential chemical substitutions at the cyclic urea scaffold were learned by the networks and the activity of new compounds were adequately predicted.

cal autocorrelation vectors. Remarkably, optimum BRGNN model exhibited adequate accuracy of about 65% [17]. The complexity of the cellular cardiac response, a multifactor event where several interactions such as membrane trespassing and receptor interactions are taking place, accounts for this discrete but adequate performance. Antifungal activity None of the existing systemic antifungals satises the medical need completely; there are weaknesses in spectrum, potency, safety, and pharmacokinetic properties [10]. Few substances have been discovered that exert an inhibitory effect on the fungi pathogenic for humans and most of these are relatively toxic. BRGNN methodology was applied to a data set of antifungal heterocyclic ring derivatives in Fig. 4: (2,5,6- trisubstituted benzoxazoles; 2,5-disubstituted benzimidazoles; 2-substituted benzothiazoles; and 2-substituted oxazolo(4,5-b)pyridines) (24 and 25) [10]. A comparative analysis using MLR and BRGNNs was carried out to correlate the inhibitory activity against Candida albicans (log(1/C)) with 3D descriptors encoding the chemical structures of the heterocyclic compounds [10]. Beyond the improvement of training set tting, BRGNN outperformed multiple linear regression describing 87% of test set variance. The antifungal nonlinear models showed that the distribution of van der Waals atomic volumes and atomic masses have a large inuence on the antifungal activities of the compounds studied. Also, the BRGNN model included the inuence of atomic polarizability that could be associated with the capacity of the antifungal compounds to be deformed when interacting with biological macromolecules [10]. Antiprotozoan activity Trypanosoma cruzi, a parasitic protozoan, is the causative agent of the Chagas disease or American trypanosomiasis, one of the most threatening endemics in Central and South America. The primary cysteine protease of Trypanosoma cruzi, cruzain, is expressed throughout the life cycle and is essential for the survival of the parasite within host cells [94]. Thus, inhibiting cruzain has become interesting for the development of potential therapeutics for the treatment of the Chagas disease. The Ki values of a set of 46 ketone-based cruzain inhibitors (26 and 27) in Fig. 4 against cruzain was successfully modeled by means of data-diverse ensembles of BRGNNs using 2D molecular descriptors with accuracy about 75% [18]. The BRGNNs outperformed GA-optimized PLS model suggesting that the functional dependence between afnity and the inhibitors topological structure has a strong nonlinear component. The unsupervised training of SOM maps with opti-

Potassium-channel and calcium entry blocker activities K+ channels constitute a remarkably diverse family of membrane-spanning proteins that have a wide range of functions in electrically excitable and unexcitable cells. One important class opens in response to a calcium concentration increase within the cytosol. Pharmacological and electrophysiological evidence and, more recently, structural evidence from cloning studies, have established that there exist several kinds of Ca2+ -activated K+ channels [91,92]. Several compounds have been shown to block the IKCa mediated Ca2+ -activated K+ permeability in red blood cells [93]. A model of the selective inhibition of the intermediate conductance in Ca2+ -activated K+ channel by some clotrimazole analogs (19, 20) in Fig. 4 was developed by BRGNNs [16]. Substitutions around triarylmethane scaffold yielded a differential inhibition of the K+ channel by triarylmethane analogs that were encoded in 2D descriptors. BRGNN approach yielded a remarkable accurate model describing more than 90% of data variance in validation experiments. Interactions with the ion channel were encoded in topological charge variables, and the homogeneity of the data set assures a very high prediction accuracy. The SOM map of blockers depicted a very good behavior of the optimum features for unsupervised differentiation of inhibitors at activity levels [16]. Similarly, a BRGNN model of calcium entry blockers with myocardial activity (negative ionotropic activity) was reported [17]. Taking into account the lack of information about active conformations and mechanism of action of dilthiazen analogs (2123) in Fig. 4 as cardiac malfunction drugs, structural information was encoded in 2D topologi-

123

284

Mol Divers (2011) 15:269289

mum feature vectors depicted high and low inhibitory activity levels that matched well with data set activity proles. Aqueous solubility The aqueous solubility (logS) and lipophilicity (logP) are very important properties to be evaluated in drug design process. Zhang et al. [95] reported SVM classiers considering a three-class scheme for these two properties. They applied GA for feature selection and CG method for parameter optimization. Two data sets with 1,342 and 10,782 compounds were used to generate logS and logP models. The chromosome was represented as bit string, and simple mutation and crossover operators were used to create the individuals in the new generations. Five-fold crossvalidation accuracy was used as the tness function to evaluate the quality of individuals to be allowed to reproduce or survive to the next generation. A roulette wheel algorithm selected the chromosomes for crossover to produce offspring, and the swapping positions were randomly created with crossover and mutation rates of 0.5 and 0.1, respectively [95]. The overall prediction accuracies for logS were 87 and 90% for training set and test set, respectively. Similarly, the overall prediction accuracies for logP are 81.0 and 82.0% for training set and test set, respectively. The prediction accuracies of the two-class models of logs and logP were higher than three-class models, and GA feature selection had a significant impact on the quality of classication [95]. Protein function/classstructure relationships Functional variations induced by mutations are the main causes of several genetic pathologies and syndromes. Due to the availability of functional variation data on mutations of several proteins and other protein functional/structural data, it is possible to use supervised learning to model protein function/property relationships [29,30,70,71,96105]. GASVM regression and binary classication were carried out to predict functional properties of ghrelin receptor mutants [30] and voltage-gated K+ channel proteins [29]. Structural information was encoded in 2D descriptors calculated from the protein sequences. Regression and classication tasks were properly attained with accuracies of about 93 and 85%, respectively (Table 2). The optimum model of the constitutive activity of ghrelin receptor was remarkable accurate depending on only two descriptors. A novel 3D pseudo-folding graph representation of protein sequences inside a magic dodecahedron was used to classify voltage-gated potassium channels (VKCs) according to the signs of three electrophysiological variables: activation threshold voltage, half-activation voltage, and half-inactivation voltage [29]. We found relevant contributions of the pseudo-core and pseudo-surface of the 3D pseudo-

folded proteins in the discrimination between VKCs according to the three electrophysiological variables. On the other hand, the accuracies of voltage-gated K+ channel models by GA-SVM were higher than the other nine GA-wrapper linear and nonlinear classiers [29]. Since many disease-causing mutations exert their effects by altering protein folding, the prediction of protein structures and stability changes upon mutation is a fundamental aim in molecular biology. BRGNN technique had been also applied to model the conformation stability of mutants of human lysozyme [68], gene V protein [69], and chymotrypsin inhibitor 2 [15]. The change of unfolding Gibbs free energy change ( G) of human lysozyme, gene V protein mutants were successfully modeled using amino acid sequence autocorrelation vectors calculated by measuring the autocorrelations of 48 amino acid/residue properties [68,69] selected from the AAindex data base [78]. On the other hand, G of chymotrypsin inhibitor 2 mutants were predicted using protein-radial distribution scores calculated over 3D structure using the same 48 amino acid/residue properties. Ensembles of BRGNNs yielded optimum nonlinear models for the conformational stabilities of human lysozymes, gene V proteins, and chymotrypsin inhibitor 2 mutants, which described about 68, 66 and 72% of ensemble test set variances (Table 1). The neural network models provided information about the most relevant properties ruling conformational stability of the studied proteins. The authors determined how an input descriptor is correlated to the predicted output by the network. [15,68,69]. Entropy changes and the power to be at the N-terminal of a -helix had the strongest contributions to the stability pattern of human lysozyme. In the case of gene V protein mutants, the sequence autocorrelations of thermodynamic transfer hydrophobicity and the power to be at the middle of a -helix had the highest impact on the G. Meanwhile, spherical distribution of entropy change of side-chains on the 3D structure of chymotrypsin inhibitor 2 mutants, exhibited the highest relevance in comparison to the other descriptors. Prediction of structural class of protein, that characterizes the overall folding type or its domain, had been based on a group of features that only possesses a kind of discriminative information. Different types of discriminative information associated with primary sequence have been missed reducing the prediction accuracy [102]. Li et al. [102] reported a novel method for the prediction of protein structure class by coupling GA and SVMs. Proteins were represented by six feature groups composed of 10 structural and physicochemical features of proteins and peptides yielding a total of 1,447 features. GA was applied to select an optimum feature subset and to optimize SVMs parameters. The authors used a hybrid binary-decimal representation of chromosomes, and the tness function was the accuracy of ve-fold crossvalidation. Features in the chromosome were represented in 1,447 binary

123

Mol Divers (2011) 15:269289

285

genes and the parameters as two-decimal genes. Jack-knife tests on the working data sets yielded outstanding prediction accuracies of classication higher than 97% with an overall accuracy of 99.5% [102] (Table 2). SVM learning methods have also shown effectiveness for prediction of protein subcellular and subnuclear localizations, which demand cooperation between informative features and classier design. For this propose, Huang et al. [103] reported an accurate system for predicting protein subnuclear localization, named ProLoc, based on evolutionary SVM (ESVM) classier with automatic feature selection from a large set of physicochemical composition (PCC) descriptors. An inheritable GA combined with SVM automatically selected the best number of PCC features using two data sets, which have 504 proteins localized in six subnuclear compartments, and 370 proteins localized in nine subnuclear compartments. The features and SVM parameters were encoded concatenated in binary chromosomes, which evolved according to mutation and crossover operators. The training accuracy of ten-fold-crossvalidation was used as tness function. ProLoc with 33 and 28 PCC features reported leave-one-out accuracies over 56 and 72% for each data set, respectively [103]. Both predictors overcame a SVM model using k-peptide composition features and an optimized evidence-theoretic k-nearest neighbor classier utilizing pseudo amino acid composition. The nature of different proteinprotein complexes was analyzed by a computational framework that handles the preparation, processing, and analysis of proteinprotein complexes with machine learning algorithms [104]. Among different machine learning algorithms, SVM was applied in combination with various feature selection techniques including GA. Physicochemical characteristics of protein protein complex interfaces were represented in four different ways, using two different atomic contact vectors, DrugScore pair potential vectors, and SFC score descriptor vectors. Two different data sets were used: one with contacts enforced by the crystallographic packing environment (crystal contacts) and biologically functional homodimer complexes; and another with permanent complexes and transient protein protein complexes [104]. The authors implemented a simple GA with a population size of 30, a crossover rate of 75%, and a mutation rate of 5%. Two-point crossover and single bit mutation were applied to evolve until convergence, dened as no further changes over 10 generations or 100% prediction quality, was reached. Although, SVM did not yield the highest accuracy, the optimum models obtained by GA selection reached more than 90% accuracy for the packing enforced/functional and the permanent/transient complexes. GA also identied the discriminating ability of the three most relevant features, given in descending order as follows: the contacts of hydrophobic and/or aromatic atoms located in the proteinprotein interfaces, the pure hydrophobic/hydropho-

bic atom contacts, and the polar/hydrophobic atom contacts [104]. Kernytsk et al. [105] reported a framework that sets rst global sequence features and, second, widely expands the feature space by generically encoding the coexistence of residue-based features in proteins. A global protein feature scheme was generated for function and structure prediction studies. They proposed a combination of individual features, which encompasses the feature space from global feature inputs to features that can capture every local evidence such as a the individual residues of a catalytic triad. GA-optimized ANN and SVM were used to explore the vast feature space created. Inside GA, the initial population of solutions was built as multiple combinations of all the global features, which also contains the maximal intersection of all the feature classes with 360 input features [105]. New offspring was created by inserting or deleting nodes in the existing individuals. Nodes were denes as feature classes, or any operator on the features which combined two global feature classes. The mutation probability was set to 0.4 per node per generation, and the probability of crossover was set to 0.2 per solution per generation. After new offspring solutions are generated via crossover and/or mutation (insertion/deletion) of the parent solutions, the worst solutions were discarded to restore the populations original size ensuring that the best-performing solutions are not selected out of the next generation by chance, which have a tendency to converge faster at the cost of losing diversity more quickly among the solutions. This contrasts with the typical selection scheme (roulette wheel selection) where the more-t solutions have a higher chance than less-t solutions of getting to the next generation but have no guaranteed survival. Area under the receiver operating characteristics curve was monitored as tness/cost function. Population size was set to 100 solutions with 50 potential offspring created in each generation, and GA ran for 1,000 generations. GA was critical to effectively manage a feature space that is far too large for exhaustive enumeration and allowed detecting combinations of features that were neither too general with poor performance, nor too specic leading to overtraining. This GA variant was successfully applied to the prediction of protein enzymatic activity [105].

Conclusions The reviewed articles comprise GA-optimized predictors implemented to quantitatively or qualitatively describe structureactivity relationships in data relevant for drug discovery. BRGNN and GA-SVM are presented and discussed as powerful data modeling tools arisen from the combination of GA and efcient nonlinear mapping techniques, such as BRANN and SVM. Convoluted relationships can be successfully

123

286

Mol Divers (2011) 15:269289 11. Holland H (1975) Adaption in natural and articial systems. The University of Michigan Press, Ann Arbor, MI 12. Cartwright HM (1993) Applications of articial intelligence in chemistry. Oxford University Press, Oxford 13. Cho SJ, Hermsmeier MA (2002) Genetic algorithm guided selection: variable selection and subset selection. J Chem Inf Comput Sci 42:927936. doi:10.1021/ci010247v 14. Duchowicz PR, Vitale MG, Castro EA, Fernandez M, Caballero J (2007) QSAR analysis for heterocyclic antifungals. Bioorg Med Chem 15:26802689. doi:10.1016/j.bmc.2007.01.039 15. Fernndez M, Caballero J, Fernndez L, Abreu JI, Garriga M (2007) Protein radial distribution function (P-RDF) and Bayesian-regularized genetic neural networks for modeling protein conformational stability: chymotrypsin inhibitor 2 mutants. J Mol Graph Model 26:748759. doi:10.1016/j.jmgm.2007.04.011 16. Caballero J, Garriga M, Fernndez M (2005) Genetic neural network modeling of the selective inhibition of the intermediate-conductance Ca2+ -activated K+ channel by some triarylmethanes using topological charge indexes descriptors. J Comput Aid Mol Des 19:771789. doi:10.1007/s10822-005-9025-z 17. Caballero J, Garriga M, Fernndez M (2006) 2D Autocorrelation modeling of the negative inotropic activity of calcium entry blockers using Bayesian-regularized genetic neural networks. Bioorg Med Chem 14:33303340. doi:10.1016/j.bmc.2005.12.048 18. Caballero J, Tundidor-Camba A, Fernndez M (2007) Modeling of the inhibition constant (K i ) of some Cruzain ketonebased inhibitors using 2D spatial autocorrelation vectors and data-diverse ensembles of Bayesian-regularized genetic neural networks. QSAR Comb Sci 26:2740. doi:10.1002/qsar. 200610001 19. Fernndez M, Tundidor-Camba A, Caballero J (2005) Modeling of cyclin-dependent kinase inhibition by 1H-pyrazolo [3,4-d] pyrimidine derivatives using articial neural networks ensembles. J Chem Inf Model 45:18841895. doi:10.1021/ci050263i 20. Fernndez M, Caballero J (2006) Bayesian-regularized genetic neural networks applied to the modeling of non-peptide antagonists for the human luteinizing hormone-releasing hormone receptor. J Mol Graph Model 25:410422. doi:10.1016/j.jmgm.2006. 02.005 21. Fernandez M, Carreiras MC, Marco JL, Caballero J (2006) Modeling of acetylcholinesterase inhibition by tacrine analogues using Bayesian-regularized genetic neural networks and ensemble averaging. J Enzym Inhib Med Chem 21:647661. doi:10.1080/ 14756360600862366 22. Fernndez M, Caballero J (2006) Modeling of activity of cyclic urea HIV-1 protease inhibitors using regularized-articial neural networks. Bioorg Med Chem 14:280294. doi:10.1016/j.bmc. 2005.08.022 23. Fernndez M, Caballero J, Tundidor-Camba A (2006) Linear and nonlinear QSAR study of N-hydroxy-2-[(phenylsulfonyl)amino]acetamide derivatives as matrix metalloproteinase inhibitors. Bioorg Med Chem 14:41374150. doi:10.1016/j.bmc. 2006.01.072 24. Fernndez M, Caballero J (2006) Ensembles of Bayesian-regularized genetic neural networks for modeling of acetylcholinesterase inhibition by huprines. Chem Biol Drug Des 68:201212. doi:10.1111/j.1747-0285.2006.00435.x 25. Gonzlez MP, Caballero J, Tundidor-Camba A, Helguera AM, Fernndez M (2006) Modeling of farnesyltransferase inhibition by some thiol and non-thiol peptidomimetic inhibitors using genetic neural networks and RDF approaches. Bioorg Med Chem 14:200213. doi:10.1016/j.bmc.2005.08.009 26. Di Fenza A, Alagona G, Ghio C, Leonardi R, Giolitti A, Madami A (2007) Caco-2 cell permeability modelling: a neural network coupled genetic algorithm approach. J Comput Aid Mol Des 21:207221. doi:10.1007/s10822-006-9098-3

modeled and relevant explanatory variables identify among large pools of descriptors. Interestingly, accurate predictors were achieved from 2D topological representation of ligands and targets. The approach outperformed other linear and nonlinear mapping techniques combining different feature selection methods. BRGNNs showed satisfactory performance, converging quickly toward the optimal position and avoid overtting in a large extent. Similarly, GA-optimizations of SVMs yielded robust and best generalizable models. However, considering complexity of network architecture and weight optimization routines, BRGNN was more suitable for function approximation of convoluted but low dimensional data in comparison to GA-SVM which performed better in classication tasks of high dimensional data. These methodologies are regarded as useful tools for drug design.
Acknowledgements Julio Caballero acknowledges with thanks the support received through Programa Bicentenario de Ciencia y Tecnologa, ACT/24.

References
1. Gasteiger J (2006) Chemoinformatics: a new eld with a long tradition. Anal Bioanal Chem 384:5764. doi:10.1007/ s00216-005-0065-y 2. Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular eld analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110:59595967. doi:10. 1021/ja00226a005 3. Klebe G, Abraham U, Mietzner T (1994) Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. J Med Chem 37:41304146. doi:10.1021/jm00050a010 4. Folkers G, Merz A, Rognan D (1993) CoMFA: scope and limitations. In: Kubinyi H (ed) 3D-QSAR in drug design. Theory, methods and applications. ESCOM Science Publishers BV, Leiden pp 583618 5. Hansch C, Kurup A, Garg R, Gao H (2001) Chem-bioinformatics and QSAR: a review of QSAR lacking positive hydrophobic terms. Chem Rev 101:619672. doi:10.1021/cr0000067 6. Sabljic A (1990) Topological indices and environmental chemistry. In: Karcher W, Devillers J (eds) Practical applications of quantitative structureactivity relationships (QSAR) in environmental chemistry and toxicology. Kluwer, Dordrecht pp 6182 7. Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chemical descriptors in QSAR/QSPR studies. Chem Rev 96:1027 1043. doi:10.1021/cr950202r 8. Livingstone DJ, Manallack DT, Tetko IV (1997) Data modelling with neural networks: advantages and limitations. J Comput Aid Mol Des 11:135142. doi:10.1023/A:1008074223811 9. Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26:514. doi:10.1016/ S0097-8485(01)00094-8 10. Caballero J, Fernndez M (2006) Linear and non-linear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks. J Mol Model 12:168181. doi:10.1007/ s00894-005-0014-x

123

Mol Divers (2011) 15:269289 27. So S, Karplus M (1996) Evolutionary optimization in quantitative structureactivity relationship: an application of genetic neural networks. J Med Chem 39:15211530. doi:10.1021/jm9507035 28. Gao H (2001) Application of BCUT metrics and genetic algorithm in binary QSAR analysis. J Chem Inf Comput Sci 41:402 407. doi:10.1021/ci000306p 29. Fernndez M, Fernndez L, Abreu JI, Garriga M (2008) Classication of voltage-gated K(+) ion channels from 3D pseudo-folding graph representation of protein sequences using genetic algorithm-optimized support vector machines. J Mol Graph Model 26:13061314. doi:10.1016/j.jmgm.2008.01.001 30. Caballero J, Fernndez L, Garriga M, Abreu JI, Collina S, Fernndez M (2007) Proteometric study of ghrelin receptor function variations upon mutations using amino acid sequence autocorrelation vectors and genetic algorithm-based least square support vector machines. J Mol Graph Model 26:166178. doi:10.1016/ j.jmgm.2006.11.002 31. Hemmateenejad B, Miri R, Akhond M, Shamsipur M (2002) QSAR study of the calcium channel antagonist activity of some recently synthesized dihydropyridine derivatives. An application of genetic algorithm for variable selection in MLR and PLS methods. Chemom Intell Lab 64:9199. doi:10.1016/ S0169-7439(02)00068-0 32. Hemmateenejad B, Akhond M, Miri R, Shamsipur M (2003) Genetic algorithm applied to the selection of factors in principal component-articial neural networks: application to QSAR study of calcium channel antagonist activity of 1,4-dihydropyridines (nifedipine analogous). J Chem Inf Comput Sci 43:13281334. doi:10.1021/ci025661p 33. Hemmateenejad B (2004) Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based PCR. J Chemom 18:475485. doi:10.1002/cem.891 34. Yamashita F, Wanchana S, Hashida M (2002) Quantitative structure/property relationship analysis of caco-2 permeability using a genetic algorithm-based partial least squares method. J Pharm Sci 91:22302238. doi:10.1002/jps.10214 35. Selwood DL, Livingstone DJ, Comley JCW, ODowd AB, Hudson AT, Jackson P, Jandu KS, Rose VS, Stables JN (1990) Structureactivity relationships of antifilarial antimycin analogues: a multivariate pattern recognition study. J Med Chem 33:136142. doi:10.1021/jm00163a023 36. Ren Y, Liu H, Li S, Yao X, Liu M (2007) Prediction of binding afnities to b1 isoform of human thyroid hormone receptor by genetic algorithm and projection pursuit regression. Bioorg Med Chem Lett 17:24742482. doi:10.1016/j.bmcl.2007.02.025 37. Turner DB, Willett P (2000) Evaluation of the EVA descriptor for QSAR studies: 3. The use of a genetic algorithm to search for models with enhanced predictive properties (EVA_GA). J Comput Aid Mol Des 14:121. doi:10.1023/A: 1008180020974 38. Xue L, Bajorath J (2000) Molecular descriptors for effective classication of biologically active compounds based on principal component analysis identied by a genetic algorithm. J Chem Inf Comput Sci 40:801809. doi:10.1021/ci000322m 39. Kamphausen S, Hltge N, Wirsching F, Morys-Wortmann C, Riester D, Goetz R, Thrk M, Schwienhorst A (2002) Genetic algorithm for the design of molecules with desired properties. J Comput Aid Mol Des 16:551567. doi:10.1023/A: 1021928016359 40. Guo W, Cai W, Shao X, Pan Z (2005) Application of genetic stochastic resonance algorithm to quantitative structureactivity relationship study. Chemom Intell Lab 75:181188. doi:10.1016/ j.chemolab.2004.07.004 41. Teixido M, Belda I, Rosello X, Gonzalez S, Fabrec M, Llora X, Bacardite J, Garrelle JM, Vilaro S, Albericio F, Giralta E (2003) Development of a genetic algorithm to design and iden-

287 tify peptides that can cross the bloodbrain barrier 1. Design and validation in silico. QSAR Comb Sci 22:745753. doi:10.1002/ qsar.200320004 So SS, Karplus M (1997) Three-dimensional quantitative structureactivity relationships from molecular similarity matrices and genetic neural networks: 1. Method and validations. J Med Chem 40:43474359. doi:10.1021/jm970487v So SS, Karplus M (1997) Three-dimensional quantitative structureactivity relationships from molecular similarity matrices and genetic neural networks: 2. Applications. J Med Chem 40:4360 4371. doi:10.1021/jm970488n Chiu TL, So SS (2003) Genetic neural networks for functional approximation. QSAR Comb Sci 22:519526. doi:10.1002/qsar. 200310004 Patankar SJ, Jurs PC (2000) Prediction of IC50 values for ACAT inhibitors from molecular structure. J Chem Inf Comput Sci 40:706723. doi:10.1021/ci990125r Kauffman GW, Jurs PC (2000) Prediction of inhibition of the sodium ion-proton antiporter by benzoylguanidine derivatives from molecular structure. J Chem Inf Comput Sci 40:753761. doi:10.1021/ci9901237 Kauffman GW, Jurs PC (2001) QSAR and k-nearest neighbor classication analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. J Chem Inf Comput Sci 41:15531560. doi:10.1021/ci010073h Mattioni BE, Jurs PC (2002) Development of quantitative structureactivity relationship and classication models for a set of carbonic anhydrase inhibitors. J Chem Inf Comput Sci 42:94 102. doi:10.1021/ci0100696 Bakken GA, Jurs PC (2001) QSARs for 6-azasteroids as inhibitors of human type 1 5alpha-reductase: prediction of binding afnity and selectivity relative to 3-BHSD. J Chem Inf Comput Sci 41:12551265. doi:10.1021/ci010036q Patankar SJ, Jurs PC (2002) Prediction of glycine/NMDA receptor antagonist inhibition from molecular structure. J Chem Inf Comput Sci 42:10531068. doi:10.1021/ci010114+ Burden FR, Winkler DA (1999) Robust QSAR models using Bayesian regularized neural networks. J Med Chem 42:3183 3187. doi:10.1021/jm980697n Winkler DA, Burden R (2004) Bayesian neural nets for modeling in drug discovery. Biosilico 2:104111. doi:10.1016/ S1741-8364(04)02393-5 MATLAB 7.0. Program (2004) MathWorks Inc., Natick. http:// www.mathworks.com The MathWorks Inc (2004) Genetic algorithm and direct search toolbox users guide for use with MATLAB. The Mathworks Inc., Natick The MathWorks Inc. (2004) Neural network toolbox users guide for use with MATLAB. The Mathworks Inc., Natick Mackay DJC (1992) A practical Bayesian framework for backpropagation networks. Neural Comput 4:448472. doi:10.1162/ neco.1992.4.3.448 Mackay DJC (1992) Bayesian interpolation. Neural Comput 4:415447 Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273297 Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discipl 2:147. doi:10.1023/ A:1009715923555 Frhlich H, Chapelle O, Schlkopf B (2003) Feature selection for support vector machines by means of genetic algorithms. In: Proceedings of 15th IEEE international conference on tools with AI, Sacramento, CA, USA, pp 142148. doi:10.1109/TAI.2003. 1250182 Yang SY, Huang Q, Li LL, Ma CY, Zhang H, Bai R, Teng QZ, Xiang ML, Wei YQ (2009) An integrated scheme for fea-

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53. 54.

55. 56.

57. 58. 59.

60.

61.

123

288 ture selection and parameter setting in the support vector machine modeling and its application to the prediction of pharmacokinetic properties of drugs. Artif Intell Med 46:155163. doi:10.1016/j. artmed.2008.07.001 Ma CY, Yang SY, Zhang H, Xiang ML, Huang Q, Wei YQ (2008) Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA-CG-SVM method. J Pharmaceut Biomed 47:677682. doi:10.1016/j.jpba.2008.03.023 Zhang H, Chen QY, Xiang ML, Ma CY, Huang Q, Yang SY (2009) in silico prediction of mitochondrial toxicity by using GA-CGSVM approach. Toxicol in Vitro 23:134140. doi:10.1016/j.tiv. 2008.09.017 Chih-Chung C, Chih-Jen L (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graph Model 20:269276. doi:10.1016/S1093-3263(01)00123-1 Afantitis A, Melagraki G, Sarimveis H, Igglessi-Markopoulou O, Kollias G (2009) A novel QSAR model for predicting the inhibition of CXCR3 receptor by 4-N-aryl-[1,4] diazepane ureas. Eur J Med Chem 44:877884. doi:10.1016/j.ejmech.2008.05.028 Agraotis DK, Cedeo W, Lobanov VS (2002) On the use of neural network ensembles in QSAR and QSPR. J Chem Inf Comput Sci 42:903911. doi:10.1021/ci0203702 Caballero J, Fernndez L, Abreu JI, Fernndez M (2006) Amino acid sequence autocorrelation vectors and ensembles of Bayesian-regularized genetic neural networks for prediction of conformational stability of human lysozyme mutants. J Chem Inf Model 46:12551268. doi:10.1021/ci050507z Fernndez L, Caballero J, Abreu JI, Fernndez M (2007) Amino acid sequence autocorrelation vectors and bayesian-regularized genetic neural networks for modeling protein conformational stability: gene V protein mutants. Proteins 67:834852. doi:10.1002/ prot.21349 MOPAC 6.0. (1993) Frank J. Seiler Research Laboratory. US Air Force Academy, Springs, CO Fernndez M, Caballero J (2007) QSAR models for predicting the activity of non-peptide luteinizing hormone-releasing hormone (LHRH) antagonists derived from erythromycin A using quantum chemical properties. J Mol Model 13:465476. doi:10.1007/ s00894-006-0163-6 Fernndez M, Caballero J (2007) QSAR modeling of matrix metalloproteinase inhibition by N-hydroxy--phenylsulfonylacetamide derivatives. Bioorg Med Chem 15:62986310. doi:10. 1016/j.bmc.2007.06.014 Fatemi MH, Gharaghani S (2007) A novel QSAR model for prediction of apoptosis-inducing activity of 4-aryl-4-H-chromenes based on support vector machine. Bioorg Med Chem 15:7746 7754. doi:10.1016/j.bmc.2007.08.057 Todeschini R, Consonni V, Pavan M (2002) DRAGON, version 2.1. Talete SRL, Milan, Italy Cerius2, Version 4.11, http://www.accelrys.com VCCLAB, Virtual Computational Chemistry Laboratory (2005) http://www.vcclab.org Fernandez M, Abreu JI (2006) PROTMETRICS; version 1.0. Molecular Modeling Group University of Matanzas, Matanzas, Cuba Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28:374374. doi:10.1093/nar/28.1. 374 Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:5969. doi:10.1007/ BF00337288 Dykens JA, Will Y (2007) The significance of mitochondrial toxicity testing in drug development. Drug Discov Today 12:777 785. doi:10.1016/j.drudis.2007.07.013

Mol Divers (2011) 15:269289 81. Foye WO (1995) Cancer chemotherapeutic agents. American Chemical Society, Washington, DC 82. Ashkenazi A, Dixit VM (1998) Death receptors: signaling and modulation. Science 281:13051308. doi:10.1126/science.281. 5381.1305 83. Nagata S (1997) Apoptosis by death factor. Cell 88:355365. doi:10.1016/S0092-8674(00)81874-7 84. Bartus RT, Dean RL, Beer B, Lippa AS (1982) The cholinergic hypothesis of geriatric memory dysfunction. Science 217:408 417. doi:10.1126/science.7046051 85. Radic Z, Reiner E, Taylor P (1991) Role of the peripheral anionic site on acetylcholinesterase: inhibition by substrates and coumarin derivatives. Mol Pharmacol 39:98104 86. Pang YP, Quiram P, Jelacic T, Hong F, Brimijoin S (1996) Highly potent, selective, and low cost bis-tetrahydroaminacrine inhibitors of acetylcholinesterase: steps towar novel drugs for treating Alzheimers disease. J Biol Chem 271:2364623649. doi:10.1074/ jbc.271.39.23646 87. Katz RA, Skalka AM (1994) The retroviral enzymes. Annu Rev Biochem 63:133173. doi:10.1146/annurev.bi.63.070194. 001025 88. Kempf DJ, Marsh KC, Denissen JF, McDonald E, Vasavanonda S, Flentge CA, Green BE, Fino L, Park CH, Kong XP, Wideburg NE, Saldivar A, Ruiz L, Kati WM, Sham HL, Robins T, Stewart KD, Hsu A, Plattner JJ, Leonard JM, Norbeck DW (1995) ABT-538 is a potent inhibitor of human immunodeciency virus protease and has high oral bioavailability in humans. Proc Natl Acad Sci USA 92:24842488. doi:10.1073/pnas.92.7.2484 89. Reddy P, Ross J (1999) Amprenavir: a protease inhibitor for the treatment of patients with HIV-1 infection. Formulary 34:567 577 90. Vacca JP, Dorsey BD, Schleif WA, Levin RB, McDaniel SL, Darke PL, Zugay J, Quintero JC, Blahy OM, Roth E, Sardana VV, Schlabach AJ, Graham PI, Condra JH, Gotlib L, Holloway MK, Lin J, Chen IW, Vastag K, Ostovic D, Anderson PS, Emini EA, Huff JR (1994) L-735,524: an orally bioavailable human immunodeciency virus type 1 protease inhibitor. Proc Natl Acad Sci USA 91:40964100. doi:10.1073/pnas.91.9.4096 91. Castle NA (1999) Recent advances in the biology of small conductance calcium-activated potassium channels. Perspect Drug Discov Des 15:131154. doi:10.1023/A:1017095519863 92. Vergara C, LaTorre R, Marrion NV, Adelman JP (1998) Calciumactivated potassium channels. Curr Opin Neurobiol 8:321329. doi:10.1016/S0959-4388(98)80056-1 93. Wulff H, Miller MJ, Hnsel W, Grissmer S, Cahalan MD, Chandy KG (2000) Design of a potent and selective inhibitor of the intermediate-conductance Ca2+ -activated K+ channel, IKCa1: a potential immunosuppressant. Proc Natl Acad Sci USA 97:8151 8156. doi:10.1073/pnas.97.14.8151 94. Engel JC, Doyle PS, Palmer J, Hsieh I, Bainton DF, McKerrow JH (1998) Growth arrest of T. cruzi by cysteine protease inhibitors is accompanied by alterations in Golgi complex and ER ultrastructure. J Cell Sci 111:597606 95. Zhang H, Xiang ML, Ma CY, Huang Q, Li W, Xie Y, Wei YQ, Yang SY (2009) Three-class classication models of logS and logP derived by using GA-CG-SVM approach. Mol Divers 13:261268. doi:10.1007/s11030-009-9108-1 96. Ramosde Armas R, Gonzalez-Daz H, Molina R, Uriarte E (2004) Markovian backbone negentropies: molecular descriptors for protein research. I. predicting protein stability in Arc repressor mutants. Proteins 56:715723. doi:10.1002/prot.20159 97. Gonzalez-Diaz H, Molina R, Uriarte E (2005) Recognition of stable protein mutants with 3D stochastic average electrostatic potentials. FEBS Lett 579:42974301. doi:10.1016/j.febslet.2005.06. 065

62.

63.

64. 65. 66.

67.

68.

69.

70. 71.

72.

73.

74. 75. 76. 77.

78.

79.

80.

123

Mol Divers (2011) 15:269289 98. Gonzlez-Daz H, Vilar S, Santana L, Uriarte E (2007) Medicinal chemistry and bioinformatics-current trends in drugs discovery with networks topological indices. Curr Top Med Chem 7:1015 1029. doi:10.2174/156802607780906771 99. Vilar S, Gonzalez-Diaz H, Santana L, Uriarte E (2008) QSAR model for alignment-free prediction of human breast cancer biomarkers based on electrostatic potentials of protein pseudofolding HP-lattice networks. J Comput Chem 29:26132622. doi:10. 1002/jcc.21016 100. Munteanua CR, Gonzlez-Daz H, Magalhesa AL (2008) Enzymes/non-enzymes classication model complexity based on composition, sequence, 3D and topological indices. J Theor Biol 254:476482. doi:10.1016/j.jtbi.2008.06.003 101. Fernndez M, Caballero J, Fernndez L, Abreu JI, Acosta G (2008) Classication of conformational stability of protein mutants from 3D pseudo folding graph representation of protein sequences using support vector machines. Proteins 70:167175. doi:10.1002/prot.21524

289 102. Li ZC, Zhou XB, Lin YR, Zou XY (2008) Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 35:581590. doi:10.1007/ s00726-008-0084-z 103. Huang WL, Tung CW, Huang HL, Hwang SF, Ho SY (2007) ProLoc: prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features. BioSystems 90:573581. doi:10.1016/j.biosystems.2007. 01.001 104. Block P, Paern J, Huallermeier E, Sanschagrin P, Sotriffer CA, Klebe G (2006) Physicochemical descriptors to discriminate proteinprotein interactions in permanent and transient complexes selected by means of machine learning algorithms. Proteins 65:607622. doi:10.1002/prot.21104 105. Kernytsky A, Rost B (2009) Using genetic algorithms to select most predictive protein features. Proteins 75:7588. doi:10.1002/ prot.22211

123

You might also like