Professional Documents
Culture Documents
Up to now we have been concerned with methods that: Display complex information. Detect patterns or trends. Now we will introduce methods that can be used to classify samples based on models that are developed.
Classification problems
Level I Simple classification into predefined categories. Level II Level I + detection of outliers, Level III Level II + prediction of an external property. Level IV Level II + prediction of more than one property.
Classification Methods
Many methods have been developed with new ones being published all of the time. Well look a some representative approaches. Linear Learning Machine
Supported by XLStat
Classification Methods All of these methods are considered supervised learning. Initial assumptions regarding membership or properties are made when developing a model. An initial evaluation of the data using exploratory data analysis is useful.
The available methods and approaches may vary based on the package use.
Data sets
Needed develop and evaluate a classification model. Training set Representative samples used to build the model. The modeling software uses the class information. Evaluation set Samples of known class, used to test the model. The modeling software does not know the classes. Test set True unknowns.
Data pre-processing
With any of these methods, you may choose to do some sort of data preprocessing. Raw Is fastest. Scaled Gives equal weight to the variables. PCA Can be used to reduce noise, insignificant variables.
Data pre-processing
With some data sets, you may also want to some other types of pre-processing. Example. Spectral or chromatographic traces. Options may include: Smoothing, baseline correction, signal averaging, using the first or second derivative.
Creating an evaluation set The evaluation set is typically a sub-set of the training set that was omitted when building a model. Randomly pick a subset of the data. Random pick members from each class. Any approach that selectively removes a portion of the data could cause bias.
Leave-one-out validation A standardized approach for validation of a model where each sample serves as an evaluation set. 1. Omit a single sample from the set 2. Build the model 3. Test the omitted sample 4. Repeat the above steps until each sample has been omitted and tested once.
Your data While Leave-One-Out testing is the best approach, it can be slow for large sets. Alternate approaches are to leave two or more samples out with each pass. Samples should be randomly listed in the matrix. The same two (or more) sample should never be omitted together more than once.
Iris example
Well return to the Iris example dataset - using XLStats built in DA function. Were going to use autoscaled data.
DA with XLStat.
1 Sepal Width
0.75
F2 (0.99 %)
Petal length 0
-0.25
-0.5
-0.75
F1 (99.01 %)
Coffee example
3 33 3 33 3 3 1 3 3 3 2 2 2 2 22 2 2 2 2 1 2 0 2 2 1 5 1 3 33 3 3 2 3 3 3 3 3 3 3 3 3 33 2 33 3 3 3 -10 3 3 3 3 3 3 -5 33 3 3 2 3 1 1 1 1 11 1 1 1 1 1 11 11 11 1 1 11 1 1 1 1 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1
3 3 3
33
22 2 2 22 22 222 2 2 2 2 3 22 2 2 2 2 2 2 2 2 2 2 222 2 2 2 2 -4
10
This consisted of 6 types of coffee - identified based on MS data. To avoid colinearity and null variable problems, PCA scores were used (first 5 components).
15 K KK K K
K K
10
F2 (22.79 %)
E E E E E
5 S S 0 S S S S S 15 UU U U U U 20
-15
-10
-5
10
R C C C C C C C
R R
-5
R R R -10
F1 (56.19 %)
Classification trees
Predicts class membership by sequential application of rules based on predictor variables. With DA and LLM, you create a set of math models that are all applied at once. With classification trees, the predictor variables are evaluated as ordinal rules, one at a time.
Classification trees
Solid - liquid
Density > 1
Red or green
Density > 1
Iris example (yet again!) XLStat supports the use of classification and regression trees. Classification if the Y variable (class) is qualitative, regression if the Y variable is quantitative. The iris example is a classification example.
Iris example
50
50
50
30
50
50
50
3
2
45
48
37
45
30
41
30
3
50
50
50
50
3
2
45
48
30, 0.5[
45
Using DA
30
3
10 2 2 3 10 (%):
Wine example
Riesling vs. Chardonnay. Ohio vs. California. Assayed 5 organic and 4 trace metal components. Yes, youll do the same with your homework.
Node 1 2 3
Freq. 17 17 7
Rules
If Ca in [17.5, 60.75[ then Class = CaC in 58.6% of cases If Ca in [60.75, 94.75[ then Class = CaR in 58.3% of cases If 2,3-butanediol in [0, 0.065[ and Ca in [17.5, 60.75[ then Class = CaR in 60% of cases If 2,3-butanediol in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 70.8% of cases
CaR
60.00%
CaC
17
70.83%
CaC
14
If Mn in [0.82, 1.625[ and 2,3-butanediol in [0.065, 100.00% 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 100% of cases If Mn in [1.625, 3.51[ and 2,3-butanediol in [0.065, 70.00% 0.514[ and Ca in [17.5, 60.75[ then Class = OhC in 70% of cases
OhC
OhR
6 8 10 17
CaC
If K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and 60.00% 2,3-butanediol in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 60% of cases If K in [881.75, 1147.5[ and Mn in [1.625, 3.51[ and 100.00% 2,3-butanediol in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = OhC in 100% of cases If 1-hexanol in [0.638, 0.723[ and K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediol 100.00% in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = OhC in 100% of cases If 1-hexanol in [0.723, 1.056[ and K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediol 100.00% in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 100% of cases If 1-hexanol in [0.409, 0.673[ and Ca in [60.75, 83.33% 94.75[ then Class = OhR in 83.3% of cases If 1-hexanol in [0.673, 1.218[ and Ca in [60.75, 100.00% 94.75[ then Class = CaR in 100% of cases
[0.638, 0.723[
OhR
0 2 0 0
Ca [17.5, 60.75[
OhR
1 8 3 17
OhC
[60.75, 94.75[
OhR
5 0 7 0
1-hexanol
0 7 0 17
10
OhC
[0.065, 0.514[
OhR
[0.409, 0.673[
OhR
5 0 1 0
[0.673, 1.218[
OhR
0 0 6 0
Mn [0.82, 1.625[
OhR
0 0 0 14
11
CaC
[1.625, 3.51[
OhR
0 7 0 3
12 13
OhR CaR
5 6
0 5 0 0
It attempts to assign categories to unknown samples based on multivariate proximity to other samples. It works best with discrete classification types and is tolerant of poor data sets. K - ! The number of closest neighbors being compared. Consider this as the supervised version of HCA.
K nearest neighbor classification In its simplest form, KNN is conducted by: First, a training set is collected that contains examples of each class. Intersample distances are then calculated. 2 N "
KNN
The distance matrix is sorted and the distance of the unknown sample can be compared to: 1. The K nearest neighbors 2. The nearest class cluster. Option 2 requires that K = 1.
da " b=
!^a
j =1
- b bh
KNN
When using the distance to a class, you can use the same link options that were discussed earlier. The distance can be based on: Single link - closest member of class. Complete link - farthest member of class. Centroid - center of class cluster.
K=3
KNN
Ideally, if a test sample falls well within a known class, its closes neighbors should all be of one class.
With this approach, the distance to the center of a class cluster is determined and compared.
Here, all of the blue samples would be closer to the unknown than any of the green.
Mycobacteria - HCA
1000 900 800
Mycobacteria - k means
A quick review of ALL of the ways that this data set was difcult to get useful information from.
700
600
500
400
300
200
100
46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 46 46 46 44 44 44 44 44 47 44 44 44 44 44 44 44 44 44 44 43 44 43 43 43 43 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 45 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47
Mycobacteria - PCA
4.000 3.000
Mycobacteria - DA
2.000
1.000
-6.000
-4.000
-2.000
0.000 0.000
2.000
4.000
6.000
8.000
10.000
42 43 44 45 46 47 49
-1.000
-2.000
-3.000
Mycobacteria - DA
42
Mycobacteria - DA
42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
10
F2 (29.63 %)
-25
-20
-15
-10
-5
10
46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46
-15
F1 (56.45 %)
10
Mycobacteria - DA
47 43 43 43 43 43 43 45 45 45 0 45 45 45 45 43 43 43 44 44 44 44 43 43 43 44 43 43 43
42 42 42 42 42 42 42 42 42 42 42 42 45 45 45
What if a samples distances is such that it could be in more than one class? When you have more than one possible class, we can take a vote. The class with the most votes wins.
F2 (29.63 %)
43 43 43 43 43 43 43 43 43 45 45
43
43 49 49 49 49 49 4949 49 49 49 5 49 49 49 49
45 45 45 45 45 45 45 45 45 45 45 45 45 49 45 45 45 45 45 45 45 45 45 49 49 49
49
-5 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46
F1 (56.45 %)
K=5
Here you would end up with 3 votes for B and 2 for A. B would win.
Here you would end up with 2 votes for A and one for B - A would win and the distances would be smaller.
KNN validation
The optimum number for K can be found by trial an error but for a close match, it should make no difference. The classifying power of your data can be evaluated by leave one out validation of your training set. This should be done before any sort of real classification begins.
Here, A and B would tie. The tie-breaker would be that A averages a smaller distance so would be made the winner.
KNN validation
Validation You can sequentially leave out each of your samples and test it for votes at several K values. You end up with a vote matrix that will tell you the optimum K value for each class. You will also get a misclassification matrix "this tells you how often one of your knowns are incorrectly classified.
Iris example
Cola example
What? NOT the Iris data set? Headspace MS of 4 cola classes. Two cola brands. Diet and regular. m/e 44 - 149. May need to preprocess to eliminate any nonvariant data.
Class 1 2 3 4 Brand 1 Diet brand 1 Brand 2 Diet brand 2
PCA scores
PCA scores
PCA loadings
KNN classification
Not a bad job!
KNN classifications
SIMCA
Soft Independent Modeling of Class Analogy
A method of classification that provides: Detection of outliers. Estimates of confidence for a classification. Determination of potential membership in more than a single class.
SIMCA
Basic approach. For each class of samples, a PCA model is constructed. This model is based on the optimum number of components that best clusters an individual class. The optimum number of components can vary from class to class and can be determined by cross-validation
SIMCA models
Since the number of components used can vary, each class will be best described by its own hypervolume.
SIMCA models
Limitation of a class hypervolume. You can limit the size of a hypervolume by setting a standard deviation cutoff. This results in better defined classes.
SD = 3
SD = 2
SIMCA models
Once a model has been created for each class, you are ready to classify unknowns. For each model/sample combination: + The sample is transformed into PC space and compared to see if is a likely class member. + If it is within the hypervolume of a single class, you have a match.
SIMCA classification
The potential still exists for a sample to be classied as a member of more than one class.
SIMCA classification
SIMCA will give you an estimate as to the probability of class membership. Example - two possible classes. " " Probability " Class A" " 0.90 " Class B 0.45 Here, the sample is more likely to be a member of Class A.
SIMCA summary
Of the methods covered, SIMCA offers the most options for developing a classification model when the classes are well known. It also requires the most development time as you must determine the optimum model conditions for each class. If used, plan on spending quite a bit of time working with all of the available options.
Note: We have a separate model for each class in the data set - in this case three.
Pirouette will provide an estimate as to the class hypervolumes based on the rst three PCs.
Cola example
With the cola example (two brands, diet and regular), we have 4 classes. Here you can see that the classes are pretty well resolved.
Cola example
Mycobacteria again
This data set is included with the Pirouette demo. File = Mycosing.wks It is a subset of the version Ive been using (only 72 samples)
Mycobacteria SIMCA
Perfect classifications - a first for this dataset.
Mycobacteria SIMCA
Mycobacteria SIMCA
Example shows that a different number of components were used in developing the individual SIMCA hypervolumes.
Discriminating Power is a measure of which variables show the biggest class differences.
Mycobacteria SIMCA
Modeling power indicates the relative importance of each variable for classification.
Mycobacteria SIMCA
PC plots are pretty boring since you only have one class. However, it can be used to see if you have any sub-classes.
Loadings, as always show the relative significance of each variable in constructing each PC There are relatively unimportant.
Outliers are test for by plotting sample residuals (difference between sample and center of hypervolume) vs its Mahalanobis distance from the center of the cluster - similar to a Euclidian distance but takes into account correlations of the data and is scale invariant.
Mycobacteria SIMCA