Statistical modelling for Steel Plate Fault Prediction using various classification methods

ISEN 613
Statistical modelling for Steel Plate Fault Prediction using

various classification methods
Instructor: Prof. Li Zeng
Ameya R Paranjpe (824006004)
Rohit Konusu (524004902)
Soumya Roy (124009008)
Uday Dhembare (624008668)

0
ISEN 613
ACKNOWLEDGEMENT
Before we get into thick of things, we would like to thank our supervisor, Prof. Li Zeng for the
valuable guidance and advice. She inspired us greatly to work on a project. It gives us immense
pleasure in presenting this project report on Statistical Modeling for Steel Plates Faults
Prediction Using Various Classification Methods. We would also like to thank her for showing
us some examples that are related to the topic of our project. Also, we would like to thank the
authority of Texas A&M University for providing us with a good environment and facilities to
complete this project. Finally, we would like to thank the Department of Industrial and Systems
Engineering for offering the subject- Engineering Data Analysis. It gave us an opportunity to
participate and learn about the application of statistics and data analysis for real world situations.
1
ISEN 613
INDEX
Sr.no. Content Page no.

1 Introduction 3
2 Data 4
3 Methodology 4
4 Without Dimension Reduction 5
a. Nave Bayes' using KNIME 5
b. LDA 6
c. KNN 6
d. Classification Tree 6
e. Bagging 9
f. Random Forests 10
g. SVM 10
h. C5.0 12
5 With PCA 13
a. Nave Bayes' using KNIME 14
b. LDA 14
c. KNN 15
d. Classification Tree 15
e. Bagging 15
f. Random Forests 16
g. SVM 16
h. C5.0 19
6 Results 19
7 Conclusion 20
8 Future Scope 20
9 References 20
2
ISEN 613
INTRODUCTION
The purpose of our project is to develop a model which would analyze input characteristics of a
manufactured steel plate and identify the type of fault or manufacturing defect in a steel plate. Based
on the type of fault, we can deduce what characteristic is needed to be worked on.
A fault may be defined as the deviation from a standard procedure or characteristic of a system. The
main purpose of any fault diagnosis system is to determine the location and occurrence time of possible
faults on the basis of accessible data and knowledge about the performance of diagnosed processes.
Modern manufacturing industries thrive on the aspects of fault diagnosis to maintain their edge in the
current era by providing high quality products and services. The growing demand for intelligent and
complex data mining and statistical techniques like Decision Trees, Support Vector Machine, K-Nearest
Neighbours, C5.0 etc. allow detection of occurrence of failures and faults with high accuracy for various
kinds of products.
In the steel industry, specifically alloy steel; creating different defective products can impose a high cost
for steel product manufacturer. Such defects may be results of using inferior materials, improper
production processes etc. Modifying the defects and flaws at later stage involves a lot of rework,
wastage of time and capital. As such, we see a requirement for a set-up process or a model that can
predict the type of fault by analyzing the various attributes of the product. Also, this would allow
industries to learn about which parameters to control to with high accuracy to build products with
minimum defects.
We have incorporated modern classifying technique: C5.0 algorithm in our project to develop and test
the model. This is a modern technique of classification using decision trees using the concept of
information gain. At each node of the tree, C5.0 chooses the attribute of the data that most effectively
splits its set of samples into subsets enriched in one class or the other.
The splitting criterion is the normalized information gain (difference in entropy). The attribute with the
highest normalized information gain is chosen to make the decision. The C5.0 algorithm then recurs on
the smaller sublists. Unlike older versions, C5.0 is memory efficient and faster. Also it supports boosting,
weighing and winnowing.
The following flowchart shows how we are going to proceed in our project.
3
ISEN 613
DATA
The data for our project was obtained from UCI directory. The link directing to the dataset can be
obtained below:
Main aim of our project is to develop a model which would predict a type of fault in a given sample of a
steel plate with specified attributes and classify the fault in one of the seven types of faults.
We thus have split the given dataset into training and testing data with training data constituting 80%
and testing data constituting 20%.
Following picture shows how the data looks in Microsoft Excel. We can observe that the given data uses
binary 1-0 to indicate classification type True or False. We convert the columns indicating binary 1-0 for
different fault types into a single column indicating respective fault type.
We can see that there are 27 attributes and 7 types of faults. There are total 1941 observations or
instances. The names of the attributes and instances are self explaining. We convert the columns
indicating binary 1-0 for different fault types into a single column indicating respective fault type.
Notes: Columns X, X.1 and X.2 are neither attributes nor fault class but those are dummy variables
generated by R while importing a dataset.
METHODOLOGY
Data collection and preparation:
We need to make a column which indicates type of fault so that we can use the data for testing
classification models. The next step will be identifying the outliers. We define outliers as the
observations which might or might not be caused by error possessing values which differ significantly
from standard observations.
Process for identifying outliers:
1) Find Mean, Minimum, Maximum and Median value of each attribute.
2) Find Standard Deviation for each attribute.
3) Calculate Mu+3*Sigma and Mu-3*Sigma which gives covers 99.6% of total distribution.
4) Calculate (Mu+3*Sigma)-Maximum and Minimum-( Mu-3*Sigma). Here we observed that for some
attributes the value of (Mu+3*Sigma)-Maximum was highly negative. Thus we came to know that there
is a high possibility of outlier(s) existing and we proceed further to finding the outlier(s).
5) Now we analyse the distribution from 90th percentile through 100th percentile and 0th Percentile to
4
ISEN 613
10th Percentile. We observed sudden spike in values of certain attributes between 99.95th percentile and
99.99th percentile.
6) After applying filter to a data, we came to know that an observation number (Row Number) 393 is an
outlier and thus removed it.
Applying Classification Methods

Packages of R used in the project: ISLR, MASS, car, class, gtools, pls, tree, randomForest, e1071, C50.
The outlier is found in the dataset (Row ID - 393)
> plate = read.table("S:/ISEN 613/project/Fault_steel_plates_1_col_no_outlier

.txt", header=TRUE)
> unique(is.na(plate))
There are no missing values in any of the columns of the dataset.
> names(plate)
> dim(plate)
[1] 1940 28
> index1=sample(1:nrow(plate),size=0.2*nrow(plate))
The dataset is split into 80% training and 20% test.
> test1 <- plate[index1,]
> dim(test1)
[1] 388 28
> training1 <- plate[-index1,]
> dim(training1)
[1] 1552 28
WITHOUT DIMENSION REDUCTION
Nave Bayes Using KNIME
5
ISEN 613
This is a Nave Bayes Model created using training data in KNIME to predict the fault type in test data.
As we see in the diagram above, the file reader inputs the Fault Steel Plates data, and partitions it into
training and testing dataset (80% and 20%). The training data is used to create the Nave Bayes Model,
which is leveraged to produce the predictions on the test data. We finally check the confusion matrix
using the results of the scorer.
The accuracy of the prediction is
(12+15+55+9+5+32+121)/388 = 64.18% (Misclassification Rate 35.82%)
LDA
> lda.fit <- lda(Fault_Types~., data=training1)

> lda.test <- predict(lda.fit, test1)
> table(lda.test$class,test1$Fault_Types)
KNN
> training_knn <- training1[,1:27]

> test_knn <- test1[,1:27]
> training_resp_knn <- training1[,28]
> knn.pred=knn(training_knn, test_knn, training_resp_knn, k=7)
> table(knn.pred,test1$Fault_Types)
CLASSIFICATION TREE
> plate_tree <- tree(Fault_Types ~ ., data = training1)
6
ISEN 613
> summary(plate_tree)
Classification tree:
tree(formula = Fault_Types ~ ., data = training1)
Variables actually used in tree construction:
[1] "Log_X_Index." "TypeOfSteel_A300." "Pixels_Areas."
"Orientation_Index."
[5] "Luminosity_Index." "Steel_Plate_Thickness." "Length_of_Conveyer."
"Y_Minimum."
[9] "Maximum_of_Luminosity." "Square_Index." "X_Minimum."
Number of terminal nodes: 14

Residual mean deviance: 1.556 = 2394 / 1538
Misclassification error rate: 0.3202 = 497 / 1552
> plot(plate_tree)
> text(plate_tree, pretty = 0)
PRUNING THE TREE
> cv.plate <- cv.tree(plate_tree)

> names(cv.plate)
[1] "size" "dev" "k" "method"
> plot(cv.plate$size,cv.plate$dev, type = "b")
7
ISEN 613
> prune.plate <- prune.tree(plate_tree, best = 13)

> summary(prune.plate)
Classification tree:
snip.tree(tree = plate_tree, nodes = 20L)
Variables actually used in tree construction:
[1] "Log_X_Index." "TypeOfSteel_A300." "Pixels_Areas."
"Orientation_Index."
[5] "Luminosity_Index." "Steel_Plate_Thickness." "Length_of_Conveyer."
"Y_Minimum."
[9] "Square_Index." "X_Minimum."
Number of terminal nodes: 13
Residual mean deviance: 1.59 = 2446 / 1539
Misclassification error rate: 0.3241 = 503 / 1552
> plot(prune.plate)
> text(prune.plate, pretty = 0)
8
ISEN 613
BAGGING
> bag.plate = randomForest(Fault_Types~., data = training1,mtry=27,importance

=TRUE)
> bag.plate
> importance(bag.plate)
> varImpPlot(bag.plate)
9
ISEN 613
RANDOM FOREST
> forest.plate = randomForest(Fault_Types~.,data=training1,mtry=5,importance=

TRUE)
> forest.plate
SVM LINEAR
> set.seed(1)
> svmfitdeg1=svm(Fault_Types~.,data=training1,kernel="linear")
> summary(svmfitdeg1)
Number of Support Vectors: 937
( 108 69 52 16 44 267 381 )
SVM POLYNOMIAL DEGREE 2
> svmdeg2=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =2
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg2)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
10
ISEN 613
- best parameters:
Cost 8
- best performance: 0.2558644
> svmfitdeg2=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=2,co
st=8)
( 104 82 81 23 35 250 384 )
> summary(svmdeg3)
- best parameters:
Cost 8
st=8)
( 107 83 52 25 32 255 403 )
> summary(svmdeg4)
- best parameters:
Cost 15
11
ISEN 613

st=15)
( 106 93 64 23 30 266 432 )
SVM RADIAL
> svmradial=tune(svm,Fault_Types~.,data=training1,kernel="radial",ranges=list
(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200),gamma=c(0.5,1,2,3,4)))
> summary(svmradial)
- best parameters:
Cost 3 gamma 0.5
> svmfitradial=svm(Fault_Types~.,data=training1,kernel="radial",gamma=0.5,cos
t=3)
> summary(svmfitradial)
( 126 135 139 44 40 312 537 )
C5.0
12
ISEN 613
WITH PCA
> pca_predictors <- plate[,1:27]

> pca_response <- plate[,28]
> pr.out <- prcomp(pca_predictors, scale = TRUE)
> names(pr.out)
[1] "sdev" "rotation" "center" "scale" "x"
> pr.var=pr.out$sdev^2
> pve=pr.var/sum(pr.var)
> plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explaine
d", ylim=c(0,1),type='b')
> plot(cumsum(pve), xlab="Principal Component",ylab="Cumulative Proportion of

Variance Explained", ylim=c(0,1),type='b')
> cumsum(pve)
[1] 0.3505864 0.4775133 0.5779272 0.6512180 0.7169409 0.7736171 0.8239674 0.
8583018 0.8873924 0.9148872 0.9379750
[12] 0.9549611 0.9684947 0.9781024 0.9852856 0.9903415 0.9931043 0.9954240 0.
9973215 0.9989504 0.9996252 0.9999104
[23] 0.9999815 0.9999995 1.0000000 1.0000000 1.0000000
> pca_data <- data.frame(pr.out$x)[1:10]
> pca_response <- data.frame(pca_response)
> pca_data <- cbind(pca_data, pca_response)
> names(pca_data)[11] <- "Fault_Types"
> write.table(pca_data, "S:/ISEN 613/project/Fault_steel_plates_AferPCA.txt",
sep="\t")
> index1 <- sample(1:nrow(pca_data),size=0.2*nrow(pca_data))
13
ISEN 613
> training1 <- pca_data[-index1, ]

> test1 <- pca_data[index1,]
> dim(training1)
[1] 1552 11
> dim(test1)
[1] 388 11
Nave Bayes Using KNIME
This model is similar to the Nave Bayes Model used above for original data with 27 predictors, with the
input data of 10 principal components and the response columns. The accuracy of prediction is
= (4+4+66+1+19+120+0)/388 = 55.154% and Misclassification rate is: 44.845%
LDA
14
ISEN 613
KNN
CLASSIFICATION TREE
PRUNING
No pruning after dimension reduction since the tree cannot be interpreted.
BAGGING
> bag.plate = randomForest(Fault_Types~., data = training1,mtry=10,importance

=TRUE)
> bag.plate
15
ISEN 613
RANDOM FORESTS
> forest.plate = randomForest(Fault_Types~.,data=training1,mtry=3,importance=

TRUE)
> forest.plate
SVM LINEAR
> set.seed(1)
> svmfitdeg1=svm(Fault_Types~.,data=training1,kernel="linear")
( 117 86 65 30 41 283 396 )
16
ISEN 613
> summary(svmdeg2)
- best parameters:
Cost 200
st=200)
( 103 63 61 23 22 249 361 )
> summary(svmdeg3)
- best parameters:
Cost 20
st=20)
( 98 61 53 24 24 250 373 )
17
ISEN 613
> summary(svmdeg4)
- best parameters:
Cost 15
st=15)
( 97 75 68 29 24 250 399 )
SVM RADIAL
> svmradial=tune(svm,Fault_Types~.,data=training1,kernel="radial",ranges=list
(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200),gamma=c(0.5,1,2,3,4)))
> summary(svmradial)

- best parameters:
cost 3 gamma 0.5
> svmfitradial=svm(Fault_Types~.,data=training1,kernel="radial",gamma=0.5,cos
t=3)
> summary(svmfitradial)
( 120 92 96 36 23 281 487 )
18
ISEN 613
C5.0
RESULTS
TEST MISCLASSIFICATION ERROR RATE

METHOD Without PCA With PCA
LDA 28% 35%
KNN 54% 26%
CLASSIFICATION PRUNED TREE 30% -
BAGGING 22% 23%
RANDOM FOREST 21% 23%
SVM LINEAR 28% 23%
SVM POLYNOMIAL 2 24% 26%
SVM RADIAL 25% 22%
C5.0 19% 24%
NAIVE BAYES' (USING KNIME) 36% 45%
19
ISEN 613
CONCLUSION
Without dimensionality reduction

1. Best results were obtained using advanced classification technique of C5.0 (81% accuracy)
2. The most significant predictors in order for the data are Log X Index, Type of Steel, X
Minimum, Pixels Areas, Orientation Index, Length of Conveyor, Square Index, Luminosity Index
3. Nave Bayes Classification, although a powerful technique, was not highly accurate due to large
number of classes
With dimensionality reduction

1. Best results were obtained using SVM with radial kernel function (78% accuracy)
2. Using first 10 principal components, which covered 91.5% of the variance, we observed a drop
in accuracy and classes were better identified using a radial boundary
FUTURE SCOPE
As we are provided only with the dataset of steel plates with defects, the current scope of our project is
restricted to developing a model which can determine the fault type based on given characteristics of
the steel plate. If a comprehensive dataset of designed steel plates both with and without faults is made
available, we can extend the scope to identifying whether given characteristics of a steel plate indicate
existence of any defect and identifying the class of a defect. Given the defect, we can further extend our
scope to make suggestions about the changes in design characteristics of a steel plate to eliminate the
future occurrence of the same.
REFERENCES
[1] STEEL FAULTS DIAGNOSIS USING PREDICTIVE ANALYSIS-Sanjay Jain, Chandreshekhar Azad,
VijayKumar Jha.
[2] Steel Plates Faults Diagnosis with Data Mining Models-Mahmoud Fakhr and Alaa M. Elsayad.
[3] Steel plates fault diagnosis on the basis of support vector machines-Yang Tian, MengyuFu,FangWu.
[4] Explaining Probabilistic Fault Diagnosis and Classification using Case-based Reasoning- Tomas Olsson
, Daniel Gillblad , Peter Funk and Ning Xiong.
[5] Faults Diagnosis based on Support Vector Machines and Particle Swarm Optimization - Chenghua SHI,
Yapeng WANG, Honglei ZHANG.
[6] FAULT DIAGNOSIS ON STEEL STRUCTURES USING ARTIFICIAL NEURAL NETWORKS - Adriana Zapicoa,b
and Leonardo Molisani.
[7] Hesam Komari Alaei, Karim Salahshoor & Hamed Komari Alaei, (2013) A new integrated on-line
fuzzy clustering and segmentation methodology with adaptive PCA approach for process monitoring and
fault detection and diagnosis
20

Statistical modelling for Steel Plate Fault Prediction using various classification methods

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical modelling for Steel Plate Fault Prediction using various classification methods

Uploaded by

Copyright:

Available Formats

ISEN 613

Statistical modelling for Steel Plate Fault Prediction using

Instructor: Prof. Li Zeng

Ameya R Paranjpe (824006004)

Rohit Konusu (524004902)

Soumya Roy (124009008)

Uday Dhembare (624008668)

Sr.no. Content Page no.

Applying Classification Methods

The outlier is found in the dataset (Row ID - 393)

> plate = read.table("S:/ISEN 613/project/Fault_steel_plates_1_col_no_outlier

WITHOUT DIMENSION REDUCTION

Nave Bayes Using KNIME

The accuracy of the prediction is

(12+15+55+9+5+32+121)/388 = 64.18% (Misclassification Rate 35.82%)

> lda.fit <- lda(Fault_Types~., data=training1)

> training_knn <- training1[,1:27]

> plate_tree <- tree(Fault_Types ~ ., data = training1)

Number of terminal nodes: 14

PRUNING THE TREE

> cv.plate <- cv.tree(plate_tree)

> prune.plate <- prune.tree(plate_tree, best = 13)

> bag.plate = randomForest(Fault_Types~., data = training1,mtry=27,importance

> forest.plate = randomForest(Fault_Types~.,data=training1,mtry=5,importance=

SVM POLYNOMIAL DEGREE 2

SVM POLYNOMIAL DEGREE 3

SVM POLYNOMIAL DEGREE 4

- best performance: 0.2628867

> pca_predictors <- plate[,1:27]

> plot(cumsum(pve), xlab="Principal Component",ylab="Cumulative Proportion of

> training1 <- pca_data[-index1, ]

Nave Bayes Using KNIME

= (4+4+66+1+19+120+0)/388 = 55.154% and Misclassification rate is: 44.845%

No pruning after dimension reduction since the tree cannot be interpreted.

> bag.plate = randomForest(Fault_Types~., data = training1,mtry=10,importance

> forest.plate = randomForest(Fault_Types~.,data=training1,mtry=3,importance=

SVM POLYNOMIAL DEGREE 2

SVM POLYNOMIAL DEGREE 3

SVM POLYNOMIAL DEGREE 4

Parameter tuning of svm:

TEST MISCLASSIFICATION ERROR RATE

Without dimensionality reduction

With dimensionality reduction

You might also like