Professional Documents
Culture Documents
n
i
i ' i '
(
1
++ & + & &
;istance measures
! widely used measure is #uclidean distance defined as
1 D (
(
1
++ & + & & + , &
=
=
p
(
( ( )
* $ i $ * i d p ( n i ( (
here p N number of measurements, n N number of variables
!nother distance measure could be -etric measure, defined with the following properties
$(/ and * i * i d , 4 + , & = * i if * i d = = 4 + , &
$1/ * i i * d * i d , + , & + , & =
$5/ ( and * i * ( d ( i d * i d , + , & + , & + , & +
Search methods
-inimi<e or maximi<e score function changing model parameters or model structures as well
Strategies of data management
Krgani<e data set
;efine the data hierarchy etc.
"ummari,ing data
*ean
-ean simply summari<es a collection of values &data set+.
=
i
n i $ D + &
1
The sample mean has the property that it is the value that is FcentralE in the sense that it minimi<es the sum of
s,uared differences between it and the data values. Thus, if there are n data values, the mean is the value such
that the sum of n copies of it e,uals the sum of the data values. -ean is the measure of location. !nother such
measure is the median &a value with e,ual number of values below and above it+. The most repeated value is
+ode.
.tandard )e!iation
The dispersion or variability of data is measured with standard deviation or variance.
,ariance is the sum of s,uare of the difference between the mean and individual data values.
=
i
n i $ D + + & &
1 1
The s,uare root of the variance is standard de-iation.
=
i
n i $ D + + & &
1
ool displa'ing single data set or uni-ariate data (unidirectional): e.g. .istogram
ools displa'ing relationship in between two -ariables (bi/dimensional data) e.g. scatter plot% a locus (cur-e)%
contour lines etc.
ools displa'ing multi-ariate (higher dimensional data set)% e.g. Principal 0omponent Anal'sis
/rinci'al 0om'onent Analysis 1/0A2 3 /artial +east .4uare 1/+.2
$2! and $6S comprises the concept of factor analysis. 0actor analysis techni,ues are used
to reduce the number of variables, and
to detect structure in the relationships between variables &i.e. to classify variables+
The purpose of $2! is to reduce the original variables into fewer composite variables, called principal
components. In $2!, the ob"ective is to account for the maximum portion of the variance present in the original
set of variables with a minimum number of composite variables called principal components.
$2!s are useful tools capable of compressing data and reducing its dimensionality so that essential information
is retained and easier to analy<e than the original huge data set. The theory behind $2! is that the covariance
matrix of the process variables is decomposed orthogonally along directions that explain the maximum variation
of data. $2! is used only for a single data matrix, $6S models the relationship between two blocks of data
while compressing them simultaneously. The ma"or limitation of $2!-based monitoring is that $2! models are
time invariant, while most real processes are time-variant. So it is important for a $2! to be updated
recursively.
P0A Algorithm
Step (/ Bet some data
Step 1/ Subtract the mean/ Bet mean-substracted data set &;ata!d"ust+.
Step 5/ 2alculate the covariance matrix
Step 7/ 2alculate the eigenvectors and eigenvalues of the covariance -atrix
Step 3/ 2hoose components and form a feature vector
0eatureAector N Ieig(,eig1,O..eig9J
Step 3/ ;eriving the new data set
0inal;ata N 8aw0eatureAector x 8aw;ata!d"ust
2
#ach eigenvector represents a principle component. $2( &$rinciple 2omponent (+, is defined as the eigenvector
with the highest corresponding eigenvalue. The individual eigenvalues are numerically related to the variance
they capture via $2s - the higher the value, the more variance they have captured.
Partial 1east Squares (P1S) regression is based on linear transition from a large number of original descriptors
to a new variable space based on small number of orthogonal factors &latent variables+. In other words, factors
are mutually independent &orthogonal+ linear combinations of original descriptors. :nlike some similar
approaches &e.g. principal component regression $28
*
+, latent variables are chosen in such a way as to provide
maximum correlation with dependent variableC thus, $6S model contains the smallest necessary number of
factors &.oskuldsson, ()PP+
This concept is illustrated by 0ig. ( representing a hypothetical data set with two independent variables $2 and
$3 and one dependent variable '. It can be easily seen that original variables $2 and $3 here are strongly
correlated. 0rom them, we change to two orthogonal factors &latent variables+ t2 and t3 that are linear
combinations of original descriptors. !s a result, a single-factor model can be obtained that relates activity ' to
the first latent variable t2.
=asic algorithm of $6S method I-artens Q 9aes, ()P)J for the step of building (-th factor/
Where, % @ number of compounds &samples+,
* @ number of descriptors &variables+
56%7*8 - descriptor matrix
y6%8 @ activity vector,
W6*8 @ auxiliary weight vector
t6%8 @ factor coefficient vector
'6*8 @ loading vector,
4 @ scalar coefficient of relationship between factor and activity
!ll vectors are columns, entities without index %&(42+% are for the current &(-th+ factor.
3atent varia4les are the linear com4inations of original descriptors 'with coefficients represented 4y
loading vector p&.
3
To perform a principal component analysis of the X matrix and then use the principal
components of X as regressors on Y.The orthogonality of the principal components
eliminates the multicolinearity problem. Here, nothing guarantees that the principal
components, which explain X are relevant for Y. By contrast, P! regression "nds
components from X that are also relevant for Y. !peci"cally, P! regression searches for
a set of components #called latent vectors$ that performs a simultaneous decomposition
of X and Y with the constraint that these components explain as much as possible of the
covariance between X and Y. This step generali%es P&'. (t is followed by a regression
step where the decomposition of X is used to predict Y.
15
#ig. 2 ransformation of original descriptors to latent -ariables (a) and construction of acti-it' model
containing one P1S factor (b).
-artens .., 9aes T. -ultivariate 2alibration. 2hichester etc./ Wiley, ()P).
.Rskuldsson !. $6S regression methods. >. 2hemometrics., ()PP, 1&5+ 1((-11
11