You are on page 1of 3

possessing a relatively large structure, which are impractical to model using more accurate

methods. Actually, large systems are either subjected to MM operation or semiempirical analysis
for the structural optimization, as well as computation of conformational energy.
Molecular mechanical methods are comparatively faster and generate accurate
(or nearly accurate) molecular geometries and conformational energies. Sometimes the prediction
of thermochemical data of stable chemical species is obtained with appreciable reliability
employing the MM3 or MM4 method. On the other hand, semiempirical analysis becomes
valuable while addressing systems like reactive intermediates or transition states owing to the
absence of any suitable force field program. Furthermore, a compromise is made between choosing
semiempirical and ab initio methods while dealing with small molecules. We would like to add
that the selection of a suitable technique also depends on the knowledge of the modeler regarding
the nature of chemicals employed—for example, semiempirical methods elicit good results for
chemical systems that are similar to those used in the parameterization set. Similarly, care should
be taken where semiempirical analysis is prone to failure like the prediction of activation barriers.
Finally, we would like to say that it is better to study a little regarding the pros and cons and
applicability of a particular energy optimization scheme instead of placing blind faith in it; for
example, the semiempirical method may not be applicable to a particular large chemical system,
although they are known to produce good results in many such other systems. In this context, we
would like to mention that among various molecular modeling platforms, Gaussian software
(http://www.gaussian.com/) is one of the oldest, and it allows meticulous calculation involving ab
initio formalism (HF, MP2, etc.), density functional theory (HFB, PW91, PBE, G96, LYP, VWN5,
etc.), semiempirical techniques (AM1, MNDO, PM3, PM6, etc.), MM (Amber, Dreiding, UFF),
and other hybrid methods (G1, G2, G2MP2, G3, G3B3, G4, G4MP2, MPW1PW91, B2PLYP,
B3LYP, etc.). Gaussian software also allows the use of various set of functions in the form of basis
sets (namely, STO- 3G, 3-21G, 6-21G, 4-31G, 6-31G, etc.) to characterize the wave-function using
the approximations obtained from different research outcomes. The development of this
chemometric program is credited to John Pople and his research group, which released the first
version, Gaussian70, in 1970. This software has undergone continual updates since then, and
Gaussian 09 is the latest one [41].
One last issue to be discussed here is the fact that the study of molecules using suitable
computer platforms and programs provides the user with only a hypothetical overview of the actual
phenomenon involved. Hence, we prefer to use the term simulation since it is not a real-time
situation and in fact may not represent the reality if all the constraints are not addressed properly.
Instinctively, it may be observed that with the increased complexity of the system, less accurate
approximations become foreseeable. Consideration of all real-time constraints, or even most of
them, has yet not been feasible for large systems. However, computers facilitate considerably less
drastic approximations, thereby allowing an association of experimental and theoretical insights
that give a reasonable, realistic representation of problems.

CHAPTER 6
Selected Statistical Methods in QSAR
Contents
6.1 Introduction 191
6.2 Regression-Based Approaches 192
6.2.1 Multiple linear regression 192
6.2.1.1 Model development 192
6.2.1.2 Statistical metrics to examine the quality of the developed model 196
6.2.1.3 Example of an MLR model development 199
6.2.1.4 Data pretreatment and variable selection 207
6.2.2 Partial least squares 211
6.2.2.1 The method 212
6.2.2.2 An example 213
6.2.3 Principal component regression analysis 214
6.2.4 Ridge regression 215
6.3 Classification-Based QSAR 215
6.3.1 Linear discriminant analysis 216
6.3.2 Logistic regression 216
6.3.3 Cluster analysis 218
6.3.3.1 Hierarchical cluster analysis 219
6.3.3.2 k-Means clustering 219
6.4 Machine Learning Techniques 220
6.4.1 Artificial neural network 220
6.4.2 Bayesian neural network 225
6.4.3 Decision tree and random forest 226
6.4.4 Support vector machine 227
6.5 Conclusion 228
References 228

You might also like