You are on page 1of 21

ISEN 613

Statistical modelling for Steel Plate Fault Prediction using


various classification methods

Instructor: Prof. Li Zeng

Ameya R Paranjpe (824006004)

Rohit Konusu (524004902)

Soumya Roy (124009008)

Uday Dhembare (624008668)


0
ISEN 613

ACKNOWLEDGEMENT

Before we get into thick of things, we would like to thank our supervisor, Prof. Li Zeng for the
valuable guidance and advice. She inspired us greatly to work on a project. It gives us immense
pleasure in presenting this project report on Statistical Modeling for Steel Plates Faults
Prediction Using Various Classification Methods. We would also like to thank her for showing
us some examples that are related to the topic of our project. Also, we would like to thank the
authority of Texas A&M University for providing us with a good environment and facilities to
complete this project. Finally, we would like to thank the Department of Industrial and Systems
Engineering for offering the subject- Engineering Data Analysis. It gave us an opportunity to
participate and learn about the application of statistics and data analysis for real world situations.

1
ISEN 613

INDEX

Sr.no. Content Page no.


1 Introduction 3
2 Data 4
3 Methodology 4
4 Without Dimension Reduction 5
a. Nave Bayes' using KNIME 5
b. LDA 6
c. KNN 6
d. Classification Tree 6
e. Bagging 9
f. Random Forests 10
g. SVM 10
h. C5.0 12
5 With PCA 13
a. Nave Bayes' using KNIME 14
b. LDA 14
c. KNN 15
d. Classification Tree 15
e. Bagging 15
f. Random Forests 16
g. SVM 16
h. C5.0 19
6 Results 19
7 Conclusion 20
8 Future Scope 20
9 References 20

2
ISEN 613

INTRODUCTION

The purpose of our project is to develop a model which would analyze input characteristics of a
manufactured steel plate and identify the type of fault or manufacturing defect in a steel plate. Based
on the type of fault, we can deduce what characteristic is needed to be worked on.
A fault may be defined as the deviation from a standard procedure or characteristic of a system. The
main purpose of any fault diagnosis system is to determine the location and occurrence time of possible
faults on the basis of accessible data and knowledge about the performance of diagnosed processes.
Modern manufacturing industries thrive on the aspects of fault diagnosis to maintain their edge in the
current era by providing high quality products and services. The growing demand for intelligent and
complex data mining and statistical techniques like Decision Trees, Support Vector Machine, K-Nearest
Neighbours, C5.0 etc. allow detection of occurrence of failures and faults with high accuracy for various
kinds of products.
In the steel industry, specifically alloy steel; creating different defective products can impose a high cost
for steel product manufacturer. Such defects may be results of using inferior materials, improper
production processes etc. Modifying the defects and flaws at later stage involves a lot of rework,
wastage of time and capital. As such, we see a requirement for a set-up process or a model that can
predict the type of fault by analyzing the various attributes of the product. Also, this would allow
industries to learn about which parameters to control to with high accuracy to build products with
minimum defects.
We have incorporated modern classifying technique: C5.0 algorithm in our project to develop and test
the model. This is a modern technique of classification using decision trees using the concept of
information gain. At each node of the tree, C5.0 chooses the attribute of the data that most effectively
splits its set of samples into subsets enriched in one class or the other.
The splitting criterion is the normalized information gain (difference in entropy). The attribute with the
highest normalized information gain is chosen to make the decision. The C5.0 algorithm then recurs on
the smaller sublists. Unlike older versions, C5.0 is memory efficient and faster. Also it supports boosting,
weighing and winnowing.
The following flowchart shows how we are going to proceed in our project.

3
ISEN 613

DATA

The data for our project was obtained from UCI directory. The link directing to the dataset can be
obtained below:
Main aim of our project is to develop a model which would predict a type of fault in a given sample of a
steel plate with specified attributes and classify the fault in one of the seven types of faults.
We thus have split the given dataset into training and testing data with training data constituting 80%
and testing data constituting 20%.
Following picture shows how the data looks in Microsoft Excel. We can observe that the given data uses
binary 1-0 to indicate classification type True or False. We convert the columns indicating binary 1-0 for
different fault types into a single column indicating respective fault type.

We can see that there are 27 attributes and 7 types of faults. There are total 1941 observations or
instances. The names of the attributes and instances are self explaining. We convert the columns
indicating binary 1-0 for different fault types into a single column indicating respective fault type.
Notes: Columns X, X.1 and X.2 are neither attributes nor fault class but those are dummy variables
generated by R while importing a dataset.

METHODOLOGY
Data collection and preparation:
We need to make a column which indicates type of fault so that we can use the data for testing
classification models. The next step will be identifying the outliers. We define outliers as the
observations which might or might not be caused by error possessing values which differ significantly
from standard observations.
Process for identifying outliers:
1) Find Mean, Minimum, Maximum and Median value of each attribute.
2) Find Standard Deviation for each attribute.
3) Calculate Mu+3*Sigma and Mu-3*Sigma which gives covers 99.6% of total distribution.
4) Calculate (Mu+3*Sigma)-Maximum and Minimum-( Mu-3*Sigma). Here we observed that for some
attributes the value of (Mu+3*Sigma)-Maximum was highly negative. Thus we came to know that there
is a high possibility of outlier(s) existing and we proceed further to finding the outlier(s).
5) Now we analyse the distribution from 90th percentile through 100th percentile and 0th Percentile to

4
ISEN 613

10th Percentile. We observed sudden spike in values of certain attributes between 99.95th percentile and
99.99th percentile.
6) After applying filter to a data, we came to know that an observation number (Row Number) 393 is an
outlier and thus removed it.

Applying Classification Methods


Packages of R used in the project: ISLR, MASS, car, class, gtools, pls, tree, randomForest, e1071, C50.

The outlier is found in the dataset (Row ID - 393)

> plate = read.table("S:/ISEN 613/project/Fault_steel_plates_1_col_no_outlier


.txt", header=TRUE)
> unique(is.na(plate))
There are no missing values in any of the columns of the dataset.
> names(plate)

> dim(plate)
[1] 1940 28
> index1=sample(1:nrow(plate),size=0.2*nrow(plate))
The dataset is split into 80% training and 20% test.
> test1 <- plate[index1,]
> dim(test1)
[1] 388 28
> training1 <- plate[-index1,]
> dim(training1)
[1] 1552 28

WITHOUT DIMENSION REDUCTION

Nave Bayes Using KNIME

5
ISEN 613

This is a Nave Bayes Model created using training data in KNIME to predict the fault type in test data.
As we see in the diagram above, the file reader inputs the Fault Steel Plates data, and partitions it into
training and testing dataset (80% and 20%). The training data is used to create the Nave Bayes Model,
which is leveraged to produce the predictions on the test data. We finally check the confusion matrix
using the results of the scorer.

The accuracy of the prediction is

(12+15+55+9+5+32+121)/388 = 64.18% (Misclassification Rate 35.82%)

LDA

> lda.fit <- lda(Fault_Types~., data=training1)


> lda.test <- predict(lda.fit, test1)
> table(lda.test$class,test1$Fault_Types)

KNN

> training_knn <- training1[,1:27]


> test_knn <- test1[,1:27]
> training_resp_knn <- training1[,28]
> knn.pred=knn(training_knn, test_knn, training_resp_knn, k=7)
> table(knn.pred,test1$Fault_Types)

CLASSIFICATION TREE

> plate_tree <- tree(Fault_Types ~ ., data = training1)

6
ISEN 613

> summary(plate_tree)

Classification tree:
tree(formula = Fault_Types ~ ., data = training1)
Variables actually used in tree construction:
[1] "Log_X_Index." "TypeOfSteel_A300." "Pixels_Areas."
"Orientation_Index."
[5] "Luminosity_Index." "Steel_Plate_Thickness." "Length_of_Conveyer."
"Y_Minimum."
[9] "Maximum_of_Luminosity." "Square_Index." "X_Minimum."

Number of terminal nodes: 14


Residual mean deviance: 1.556 = 2394 / 1538
Misclassification error rate: 0.3202 = 497 / 1552
> plot(plate_tree)
> text(plate_tree, pretty = 0)

PRUNING THE TREE

> cv.plate <- cv.tree(plate_tree)


> names(cv.plate)
[1] "size" "dev" "k" "method"
> plot(cv.plate$size,cv.plate$dev, type = "b")
7
ISEN 613

> prune.plate <- prune.tree(plate_tree, best = 13)


> summary(prune.plate)

Classification tree:
snip.tree(tree = plate_tree, nodes = 20L)
Variables actually used in tree construction:
[1] "Log_X_Index." "TypeOfSteel_A300." "Pixels_Areas."
"Orientation_Index."
[5] "Luminosity_Index." "Steel_Plate_Thickness." "Length_of_Conveyer."
"Y_Minimum."
[9] "Square_Index." "X_Minimum."
Number of terminal nodes: 13
Residual mean deviance: 1.59 = 2446 / 1539
Misclassification error rate: 0.3241 = 503 / 1552
> plot(prune.plate)
> text(prune.plate, pretty = 0)

8
ISEN 613

BAGGING

> bag.plate = randomForest(Fault_Types~., data = training1,mtry=27,importance


=TRUE)
> bag.plate

> importance(bag.plate)
> varImpPlot(bag.plate)

9
ISEN 613

RANDOM FOREST

> forest.plate = randomForest(Fault_Types~.,data=training1,mtry=5,importance=


TRUE)
> forest.plate

SVM LINEAR

> set.seed(1)
> svmfitdeg1=svm(Fault_Types~.,data=training1,kernel="linear")
> summary(svmfitdeg1)
Number of Support Vectors: 937
( 108 69 52 16 44 267 381 )

SVM POLYNOMIAL DEGREE 2

> svmdeg2=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =2
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg2)
Parameter tuning of svm:
- sampling method: 10-fold cross validation

10
ISEN 613

- best parameters:
Cost 8
- best performance: 0.2558644
> svmfitdeg2=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=2,co
st=8)
> summary(svmfitdeg2)
Number of Support Vectors: 959
( 104 82 81 23 35 250 384 )

SVM POLYNOMIAL DEGREE 3

> svmdeg3=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =3
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg3)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
- best parameters:
Cost 8
- best performance: 0.2403143
> svmfitdeg3=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=3,co
st=8)
> summary(svmfitdeg3)
Number of Support Vectors: 957
( 107 83 52 25 32 255 403 )

SVM POLYNOMIAL DEGREE 4

> svmdeg4=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =4
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg4)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
- best parameters:
Cost 15
11
ISEN 613

- best performance: 0.2628867


> svmfitdeg4=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=4,co
st=15)
> summary(svmfitdeg4)
Number of Support Vectors: 1014
( 106 93 64 23 30 266 432 )

SVM RADIAL

> svmradial=tune(svm,Fault_Types~.,data=training1,kernel="radial",ranges=list
(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200),gamma=c(0.5,1,2,3,4)))
> summary(svmradial)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
- best parameters:
Cost 3 gamma 0.5
- best performance: 0.3015136
> svmfitradial=svm(Fault_Types~.,data=training1,kernel="radial",gamma=0.5,cos
t=3)
> summary(svmfitradial)
Number of Support Vectors: 1333
( 126 135 139 44 40 312 537 )

C5.0

12
ISEN 613

WITH PCA

> pca_predictors <- plate[,1:27]


> pca_response <- plate[,28]
> pr.out <- prcomp(pca_predictors, scale = TRUE)
> names(pr.out)
[1] "sdev" "rotation" "center" "scale" "x"
> pr.var=pr.out$sdev^2
> pve=pr.var/sum(pr.var)
> plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explaine
d", ylim=c(0,1),type='b')

> plot(cumsum(pve), xlab="Principal Component",ylab="Cumulative Proportion of


Variance Explained", ylim=c(0,1),type='b')

> cumsum(pve)
[1] 0.3505864 0.4775133 0.5779272 0.6512180 0.7169409 0.7736171 0.8239674 0.
8583018 0.8873924 0.9148872 0.9379750
[12] 0.9549611 0.9684947 0.9781024 0.9852856 0.9903415 0.9931043 0.9954240 0.
9973215 0.9989504 0.9996252 0.9999104
[23] 0.9999815 0.9999995 1.0000000 1.0000000 1.0000000
> pca_data <- data.frame(pr.out$x)[1:10]
> pca_response <- data.frame(pca_response)
> pca_data <- cbind(pca_data, pca_response)
> names(pca_data)[11] <- "Fault_Types"
> write.table(pca_data, "S:/ISEN 613/project/Fault_steel_plates_AferPCA.txt",
sep="\t")
> index1 <- sample(1:nrow(pca_data),size=0.2*nrow(pca_data))

13
ISEN 613

> training1 <- pca_data[-index1, ]


> test1 <- pca_data[index1,]
> dim(training1)
[1] 1552 11
> dim(test1)
[1] 388 11

Nave Bayes Using KNIME

This model is similar to the Nave Bayes Model used above for original data with 27 predictors, with the
input data of 10 principal components and the response columns. The accuracy of prediction is

= (4+4+66+1+19+120+0)/388 = 55.154% and Misclassification rate is: 44.845%

LDA

14
ISEN 613

KNN

CLASSIFICATION TREE

PRUNING

No pruning after dimension reduction since the tree cannot be interpreted.

BAGGING

> bag.plate = randomForest(Fault_Types~., data = training1,mtry=10,importance


=TRUE)
> bag.plate

15
ISEN 613

RANDOM FORESTS

> forest.plate = randomForest(Fault_Types~.,data=training1,mtry=3,importance=


TRUE)
> forest.plate

SVM LINEAR

> set.seed(1)
> svmfitdeg1=svm(Fault_Types~.,data=training1,kernel="linear")
> summary(svmfitdeg1)
Number of Support Vectors: 1018
( 117 86 65 30 41 283 396 )

16
ISEN 613

SVM POLYNOMIAL DEGREE 2

> svmdeg2=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =2
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg2)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
- best parameters:
Cost 200
- best performance: 0.286799
> svmfitdeg2=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=2,co
st=200)
> summary(svmfitdeg2)
Number of Support Vectors: 882
( 103 63 61 23 22 249 361 )

SVM POLYNOMIAL DEGREE 3

> svmdeg3=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =3
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg3)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
- best parameters:
Cost 20
- best performance: 0.2738007
> svmfitdeg3=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=3,co
st=20)
> summary(svmfitdeg3)
Number of Support Vectors: 883
( 98 61 53 24 24 250 373 )

17
ISEN 613

SVM POLYNOMIAL DEGREE 4

> svmdeg4=tune(svm,Fault_Types~.,data=training1,kernel="polynomial",degree =4
, ranges=list(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200)))
> summary(svmdeg4)
Parameter tuning of svm:
- sampling method: 10-fold cross validation
- best parameters:
Cost 15
- best performance: 0.2622374
> svmfitdeg4=svm(Fault_Types~.,data=training1,kernel="polynomial",degree=4,co
st=15)
> summary(svmfitdeg4)
Number of Support Vectors: 942
( 97 75 68 29 24 250 399 )

SVM RADIAL

> svmradial=tune(svm,Fault_Types~.,data=training1,kernel="radial",ranges=list
(cost=c(0.1,0.5,1,3,5,8,10,15,20,100,200),gamma=c(0.5,1,2,3,4)))
> summary(svmradial)

Parameter tuning of svm:


- sampling method: 10-fold cross validation
- best parameters:
cost 3 gamma 0.5
- best performance: 0.2409843
> svmfitradial=svm(Fault_Types~.,data=training1,kernel="radial",gamma=0.5,cos
t=3)
> summary(svmfitradial)
Number of Support Vectors: 1135
( 120 92 96 36 23 281 487 )

18
ISEN 613

C5.0

RESULTS

TEST MISCLASSIFICATION ERROR RATE


METHOD Without PCA With PCA
LDA 28% 35%
KNN 54% 26%
CLASSIFICATION PRUNED TREE 30% -
BAGGING 22% 23%
RANDOM FOREST 21% 23%
SVM LINEAR 28% 23%
SVM POLYNOMIAL 2 24% 26%
SVM POLYNOMIAL 3 24% 26%
SVM POLYNOMIAL 4 24% 24%
SVM RADIAL 25% 22%
C5.0 19% 24%
NAIVE BAYES' (USING KNIME) 36% 45%

19
ISEN 613

CONCLUSION

Without dimensionality reduction


1. Best results were obtained using advanced classification technique of C5.0 (81% accuracy)
2. The most significant predictors in order for the data are Log X Index, Type of Steel, X
Minimum, Pixels Areas, Orientation Index, Length of Conveyor, Square Index, Luminosity Index
3. Nave Bayes Classification, although a powerful technique, was not highly accurate due to large
number of classes

With dimensionality reduction


1. Best results were obtained using SVM with radial kernel function (78% accuracy)
2. Using first 10 principal components, which covered 91.5% of the variance, we observed a drop
in accuracy and classes were better identified using a radial boundary

FUTURE SCOPE

As we are provided only with the dataset of steel plates with defects, the current scope of our project is
restricted to developing a model which can determine the fault type based on given characteristics of
the steel plate. If a comprehensive dataset of designed steel plates both with and without faults is made
available, we can extend the scope to identifying whether given characteristics of a steel plate indicate
existence of any defect and identifying the class of a defect. Given the defect, we can further extend our
scope to make suggestions about the changes in design characteristics of a steel plate to eliminate the
future occurrence of the same.

REFERENCES

[1] STEEL FAULTS DIAGNOSIS USING PREDICTIVE ANALYSIS-Sanjay Jain, Chandreshekhar Azad,
VijayKumar Jha.

[2] Steel Plates Faults Diagnosis with Data Mining Models-Mahmoud Fakhr and Alaa M. Elsayad.

[3] Steel plates fault diagnosis on the basis of support vector machines-Yang Tian, MengyuFu,FangWu.

[4] Explaining Probabilistic Fault Diagnosis and Classification using Case-based Reasoning- Tomas Olsson
, Daniel Gillblad , Peter Funk and Ning Xiong.

[5] Faults Diagnosis based on Support Vector Machines and Particle Swarm Optimization - Chenghua SHI,
Yapeng WANG, Honglei ZHANG.

[6] FAULT DIAGNOSIS ON STEEL STRUCTURES USING ARTIFICIAL NEURAL NETWORKS - Adriana Zapicoa,b
and Leonardo Molisani.

[7] Hesam Komari Alaei, Karim Salahshoor & Hamed Komari Alaei, (2013) A new integrated on-line
fuzzy clustering and segmentation methodology with adaptive PCA approach for process monitoring and
fault detection and diagnosis

20

You might also like