You are on page 1of 5

Fifth International Conference on Fuzzy Systems and Knowledge Discovery

Fuzzy Theory Based Support Vector Machine Classifier∗

Xuehua Li Lan Shu


School of Applied Mathematics, School of Applied Mathematics,
University of Electronic Science and University of Electronic Science and
Technology of China, 610054, P.R. China Technology of China, 610054, P.R. China
leesoftcom@gmail.com shul@uestc.edu.cn

Abstract SVM maps input data into high dimensional characteristic


space in which it constructs an optimal separating hyper-
Support vector machine (SVM) has become a popular plane. In many applications it provides high generalization
tool in the area of pattern recognition, combining support ability and overcomes the overfitting problem experienced
vector machines with other theories has been proposed as by the other leaning techniques such as ANN.
a new direction to improve classification performance. This However, the practical input data are usually contami-
paper applies fuzzy theory to support vector machines for nated by noise which caused the disturbance and measure-
classification. In the first phase, a fuzzy support vector ma- ment, and data close to the class boundary is most affected
chine is proposed for the classification of real-world data by noise, the standard SVM formulation is not robust to
with noise, fuzzy membership to each data point of SVM noise. Some important qualitative information about the in-
and reformulates the SVM such that different input points put data can’t apply well to the SVM. In order to realize
can make different contributions to the each class. In the practical application, we proposed to include fuzzy theory
second phase, the SVM’s kernel’s parameters are calculated to improve SVM classify ability. As we know, fuzzy logic
by the kernel’s parameters evaluation function. To investi- is powerful tool to deal with uncertain, nonlinear, ill-posed
gate the effectiveness of the proposed fuzzy support vector problem. Fuzzy logic is appropriate to present the quali-
machine classifier, it is applied to the given dataset, the ex- tative and ambiguous knowledge. The inference model of
perimental results confirm the superiority of the presented fuzzy logic is similar with that of the human thought. So
method to the traditional SVM classifier. combinating the SVM with fuzzy logic is a kind of promis-
Keywords: Support vector machine, Kernel method, Fuzzy ing method[3]. Recently, fuzzy SVM have been attracting a
theory, Kernel parameters evaluation function. lot of interest and there have many papers to explore fuzzy
SVM for classification problem[3,4,5], and fuzzy SVM has
been shown to be extremely successful, but few papers dis-
1. Introduction cuss fuzzy SVM with kernel, hence, it will be not effica-
cious to nonlinear classification problem.
Kernel method is the most important technique for SVM
Support vector machine is learning system that uses a
to deal with nonlinear problem, which applies SVM to learn
hypothesis space of linear function in high dimensional
non-linear relations with a linear machine[1], and the attrac-
feature space[1], which has been recently introduced for
tion of kernel method is that the learning algorithms and
solving pattern recognition and function estimation prob-
theory can largely be decoupled from the specifications of
lems. The theory of SVM is based on the idea of structural
the application area, which must simply be encoded into
minimization, which shows that the generalization error is
the design of an appropriate kernel function. Hence, the
bounded by the sum of the training set and a term depend-
problem of choosing a suitable kernel for SVM is more im-
ing on the Vapnik-Chervonenkis dimension[2]. By mini-
portant. There have several well-known kernels, such as
mizing this bound, high generalization performance can be
Radial Basis Function (RBF) kernel, which using in many
achieved. Moreover, unlike other machine learning meth-
papers, but another problem is present, how many the val-
ods, SVM generalization error is not related to the prob-
ues of the kernel’s parameters, and how to evaluate effect
lem’s input dimensionality. This explains why SVM can
of SVM with the kernel which has been given the selected
have good performance even in high dimensional problem.
parameters.
∗ Supported by Foundation of National Natural science (No.10671030) In this paper, to overcome these problems, we propose

978-0-7695-3305-6/08 $25.00 © 2008 IEEE 600


DOI 10.1109/FSKD.2008.440


fuzzy support vector machine (FSVM) and kernel’s pa- where K(xi , x j ) = ϕ(xi ), ϕ(x j ) is a kernel function
rameters evaluation function. Firstly, we propose FSVM and satisfies the Mercer theorem. The Karush-Kuhn-
which improves the performance of traditional SVM by in- Tucker(KKT) complementarity conditions provide useful
cluding fuzzy membership to each input value to show the information about the structure of the solution. The condi-
membership degree of the input values to different classes, tions state that the optimal solutions α ∗ , w∗ , b∗ must satisfy
and FSVM uses kernel to deal with nonlinear classification
problem in section 3. Secondly, a kernel’s parameters eval-
uation function build by fuzzy membership and the learn- αi∗ [y(w∗ ϕ(xi ) + b∗ ) − 1] = 0, i = 1, 2, ....n. (4)
ing algorithm are given in section 4. Thirdly, we give a set
of experiments, discussions and comparisons for the SVM where the αi∗ are the solutions of the dual problem and
which combining with fuzzy theory in section 5. Finally, are non-zero only for a subset of vectors xi∗ called support
conclusions are drawn in section 6. vectors. Then the resulting SVM for function estimation
becomes
2. Summary of support vector machines m
f (x) = sgn( ∑ αi∗ yi K(x, xi∗ ) + b) (5)
i=1
The original support vector machine can be character-
ized as a powerful learning algorithm based on recent ad- where m is the number of support vectors. SVM is a new
vances in statistical learning theory[6,8]. SVM is a learning paradigm of the learning system. SVM technique is a pow-
system that uses a hypothesis space of linear functions in a erful widely used technique for solving supervised classifi-
high-dimensional space, trained with a learning algorithm cation problems due to its generalization ability. In essence,
from optimization theory that implements a learning bias SVM classifiers maximize the margin between training data
derived from statistical learning theory. SVM uses a lin- and the decision boundary (optimal separating hyperplane),
ear model to implement nonlinear class boundaries by map- which can be formulated as a quadratic optimization prob-
ping input vectors nonlinearly into a high-dimensional fea- lem in a feature space.
ture space using kernels. SVM has recently become one of
the most popular tools for machine learning and data mining 3 Fuzzy support vector machines
and can perform both classification and regression.
Let S = {(x1 , y1 ), (x2 , y2 ), ......., (xn , yn )} be a training set In this paper, by extending and developing soft margin
with input data, where xi ∈ ℜn is the training data and cor- algorithm, we introduce the fuzzy memberships of train-
responding binary class labels yn ∈ {−1, 1}. Let the weight ing points in the hyperplane definition, by exploring the
and the bias of the separating hyperplane be w and b, re- fuzzy information inherent in the training set[9]. Let us
spectively, and the SVM classifier is consider the aforementioned binary training set S. We use
fuzzy memberships [µi− , µi+ ] to weight the importance of
f (xi ) = sgn(wϕ(xi ) + b) (1) the training point xi in the hyperplane computation, where
where ϕ is a nonlinear function, which maps x the input µi− is the membership value denoting the point xi toward
space into a feature space, To separate the data linearly in the class where yi = −1, and µi+ means the membership
the feature space, the decision function satisfies the follow- value to the positive class. At first, we get all cluster centers
ing conditions and the optimization problem is of training set S by Fuzzy C-means (FCM) algorithm[10],
the objective of FCM algorithm is to minimize the Fuzzy
Minimize kwk2 =< w, w > C-Means cost function formulated as
subject to yi [wϕ(xi ) + b] ≥ 1, (2)
t n
i = 1, 2, ......., n 2
∑ ∑ (µi j )m xi − x̄ j

J= (6)
The objective of SVM is to maximize the margin of sep- j=1 i=1
aration and minimize the training errors. The problem can where m is any real number greater than 1, µi j is the de-
then be transformed into the following Lagrange formula- gree of membership of xi in the cluster t and µi j ∈ [0, 1], ∀i =
tion t
1, ......n, j = 1, ......t, ∑ µi j = 1, ∀i = 1, ......, n, xi is the i-
j=1
n
1
n th of d-dimensional measured data, x̄ j is the d-dimension
maximize L(α) = ∑ αi − 2 ∑ yi y j αi α j K(xi , x j ) center of the cluster, and || ∗ || is any norm expressing the
i=1 i, j=1
n (3) similarity between any measured data and the center.
subject to ∑ yi αi = 0 Let X̄ + = (x̄1+ , ......x̄l+ ) is the cluster centers of the class
i=1
αi ≥ 0, i = 1, 2, .....n {yi = +1} and X̄ − = (x̄1− , ......x̄m − ) is the cluster centers of

601
n
the negative class. We denote u+ ki is the degree of member-
ship of xi in the cluster xk+ , the µki+ is defined as follows ∑ αi sgn(µi+ − µi− ) = 0 (14)
i=1

1 xi −x̄
+ ∂ L(w, b, α, β , ξi+ , ξi− )
µki+ =e −2( σ k )2 k = 1, 2....., l. (7) =0→
∂α
n
Then define the membership of xi in the class{yi = +1},
µi+ as
∑ sgn(µi+ − µi− )(wxi + b + µi+ ξi+ − µi− ξi− ) − 1 = 0
i=1
(15)
µi+ = max(µ1i+ , µ2i+ , ........, µli+ ) (8)
∂ L(w, b, α, β , ξi+ , ξi− )
The membership of xi in the class{yi = −1}, µi− is de- =0→
∂β
fined as µi+
n

1 xi −x̄

− 2 ( σ k )2
∑ sgn(µi+ − µi− )(µi+ ξi+ − µi− ξi− ) = 0 (16)
− i=1
µki = e k = 1, 2....., m. (9)
According to KKT conditions, the solutions αi∗ to
µi = max(µ1i , µ2i , ........, µli− )
− − −
(10)
Eq.(15) satisfy the following formulae
We give a set S of training points: S =
{(x1 , µ1− , µ1+ ), (x2 , µ2− , µ2+ ), ......., (xn , µn− , µn+ }. The
optimal hyperplane problem is regarded as the solution to αi∗ [sgn(µi+ − µi− )(w∗ xi + b∗ + µi+ ξi+ − µi− ξi− ) − 1] = 0
ξi+ (Cµi+ − αi∗ ) = 0, ξi− (Cµi− − αi∗ ) = 0
Minimize 1
< w, w > + (17)
2
n
C ∑ sgn(µi+ − µi− )(µi+ ξi+ − µi− ξi− )
i=1 4 The evaluation function of SVM’s kernel
subject to sgn(µi+ − µi− )(wxi + b) ≥ parameters
1 − sgn(µi+ − µi− )(µi+ ξi+ − µi− ξi− )
ξi+ ≥ 0, ξi− ≥ 0,C ≥ 0 , i = 1, 2, .....n.
(11)
where C defines how much weight is given to the minimiza-
tion of the slack vector as compared to the weight vector[7].
Hence, the following Lagrange function could be obtained

L(w, b, α, β , ξi+ , ξi− ) =


1
2 < w, w >
n
+C ∑ si (µi+ ξi+ − µi− ξi− )
i=1
n
− ∑ αi [si (wxi + b + µi+ ξi+ − µi− ξi− ) − 1]
i=1
n
− ∑ βi si (µi+ ξi+ − µi− ξi− )
i=1
Figure 1. SVM with RBF Kernel ( σ = 0.1,C = 5)
αi ≥ 0, βi ≥ 0, i = 1, 2, .....n (12)
where si = sgn(µi+ − µi− ), and < w, w >=
n Kernel method is demonstrated to be able to extract
∑ sgn(µi+ − µi− )sgn(µ + −
j − µ j )αi α j K(xi , x j ). the complicated nonlinear information embedded on the
i, j=1
data-set. Kernel function is a nonlinear mapping from in-
The conditions for optimality are as follows put space X ⊆ ℜn onto feature space Z ⊆ ℜN , ϕ : X ⊆
ℜn → Z ⊆ ℜN [11]. Kernel method provides a powerful
∂ L(w, b, α, β , ξi+ , ξi− ) and principled way of detecting nonlinear relations using
=0→
∂w well-understood linear algorithms in an appropriate feature
n space. The approach decouples the design of the algorithm
w − ∑ αi wxi sgn(µi+ − µi− ) = 0 (13) from the specification of the feature space. In real Appli-
i=1 cations, the first problem is selection of the SVM’s ker-
∂ L(w, b, α, β , ξi+ , ξi− ) nel, which amounts to learning model selection. Any prior
=0→ knowledge we have of the problem can help in selecting
∂b

602
the parameters. For most classes of kernels, for example Where xi , x j are the support vectors. In general, one prob-
polynomial kernel or RBF kernel, it is always possible to lem with the SVM maximize margin suggested is the choice
find kernel parameters for which the data become separa- of parameters, typically a range of values must be tried be-
ble. However, forcing separation of the data can easily lead fore the best choice for a particular training set can be se-
to overfitting, particularly when noise is present in the data. lected. There are many support vectors produced by differ-
In the other hand, the support points contain the informa- ent parameters, let S={x1 , x2 , ...........xl } is the intersection
tion necessary to reconstruct the hyperplane, in general, the of all support vectors, the margin function
fewer the number of support vectors the better generaliza-
l
tion can be expected, but too few support vectors will reduce
w(θ ) = ∑ yi y j ᾱi ᾱ j K(xi , x j ), xi , x j ∈ S (19)
the accuracy of SVM’s classifier. i=1
Fig. 1 and Fig. 2 show SVM for learning a data set
Where θ is the kernel’s parameter vector, and ᾱi is aver-
using RBF kernels with different values of σ and C, we will
age value of all the optimization problem solution αi of the
find that the two cases the classification of the training set
support vectors xi . Then we defined M is the maximize of
is consistent, and the larger value of σ ,the smaller number
w(θi ), M = max(w(θi )), where θi is the i-th parameter of
of support vectors, the more errors of classification will be
all the training parameters. Then define µ(θi ) is the margin
present.
membership of parameter θi as

w(θi )
µ(θi ) = (20)
M
The kernel’s parameters evaluation function is defined
1
g(θi ) = exp[− (µ(θi ) − 1)2 ] (21)
2
Where g(θi ) → [0, 1], which estimates parameter θi ef-
fect to the SVM classifier. Thus, the parameter θi is com-
puted by the evaluation function which is most seemly value
of the kernel’s parameters.

Figure 2. SVM with RBF Kernel ( σ = 2,C = 10) 5 Performance Evaluation

This section presents a set of experiments that were car-


In this section, we dispose a function to evaluate the pa- ried out by using the Allbp database which be obtained from
rameters of the kernel. In SVM, the solution to the prob- UCI repository of machine learning database[12], The rea-
lem is only dependent on a subset of training data points son for using this dataset is that because it is very commonly
which are referred to as support vectors. Using only sup- used among the other classification systems we can com-
port vectors, the same solution can be obtained as using all pare our system with them for thyroid diagnosis problem,
the training data points. Then, the support points contain we select Allbp dataset contains 972 samples, each consist-
all the information necessary to reconstruct the hyperplane, ing of 29 attributes (features), this is a three-class classifi-
even if all of the other points were removed the same max- cation problem. We choose 364 samples as the training set,
imal separating hyperplane would be found for the remain- and 608 samples as the test set. As stated earlier the RBF
ing subset of the support vectors. The support vectors the kernel was used due to its better performance.
solution of the Lagrange function which contains kernel’s In order to illustrate the performance of the kernel’s pa-
parameters, hence, the relationship between SVM classifier rameters evaluation function, Fig 3 shows the kernel’s pa-
and parameters is present. rameters evaluation value by the parameter σ from 0.001
The objective of SVM is to maximize the margin of sep- to 2 and C from 0.1 to 20. We get the maximal evaluation
aration, and the solution of the optimization problem(3), value when σ =0.16 and C=5.2. Table I shows the compare
which is equivalent to the hyperplane in the feature space training accuracy and testing accuracy of SVM with FSVM,
implicitly defined by the kernel. The SVM’s geometric mar- and table I shows the effect of the different selection of ker-
gin is nel’s parameter, Table II shows the Comparisons CPU times
and support vectors number of SVM and FSVM. From table
n
γ =< w∗ , w∗ >= yi y j αi∗ α ∗j K(xi , x j ) (18) I, we can easily conclude that the test accuracy of FSVM in
∑ the unclassifiable region is much higher than those of SVM,
i, j=1

603
function improves the SVM classifier accuracy.
Table 1. Comparisons of two SVM’s classifi-
cation accuracy References
Method Parameter Training AC Testing AC
LS-SVM (0.01, 1) 91.2% 83.2%
[1] N. Cristianni and J. Shawe-Talyor, An introduction
LS-SVM (0.16, 5.2) 95.5% 89.2%
to support vector Machines, Cambridge Uninversity
LS-SVM (1, 10) 87.9% 79.3%
Press, 2000.
FSVM (0.01, 1) 92.1% 84.7%
FSVM (0.16, 5.2) 98.6% 92.5% [2] V.N. Vapnik, An Overview of Statistical Learning The-
FSVM (1, 10) 88.4% 82.6% ory, IEEE Transactions on Neural Networks, 10(5):
988-999, 1999.
[3] C.F. Lin, and S.D. Wang, Fuzzy Support Vector Ma-
from table II, the SVM classifier is more effective by us- chines, IEEE Transactions on Neural Networks, 13(3):
ing kernel’s parameters evaluation function. As compared 466-471, 2002.
to SVM and the classification results, the SVM using fuzzy
theory can not only improve the SVM classifier accuracy, [4] Danisuke Tsujinishi and Shigeo Abe, Fuzzy least
but also get optimized kernel parameters learning algorithm. squares support vector machines for multiclass prob-
lems, Neural Nehuorks, 16: 785-792, 2003.
[5] T. Inoue and S. Abe, Fuzzy support vector machines
for pattem classification, International Joint Confer-
ence on Neural Networks, 2: 1449-1454, July, 2001.
[6] V.N. Vapnik, Statistical Learning Theory, John Wiley
and Sons, New York, 1998.
[7] Shawe Talyor J., Cristianini N. Kernel Methods for
Pattern Analysis. Cambridge, Cambridge Uninversity
Press, 2004.
[8] V.N. Vapnik. The nature of statistical learning theory.
NY: Springer-Verlag, 1995.
[9] C.F. Lin and S.D. Wang, Training algorithms for
fuzzy support vector machines with noisy data, Pattern
Figure 3. Parameters evaluation value Recognition Letters, 25(10): 1647-1656, 2004.
[10] C.Y. Yang, Support vector classifier with a fuzzy-
value class label, Lecture Notes in Computer Science,
6 Conclusion Springer-Verlag, Berlin, 3173: 506-511, 2004.
[11] B. Schölkopf and A. Smola, K.R. Müller, Learn-
In this paper, we proposed fuzzy support vector machine
ing with Kernels: Support Vector Machines, Regular-
for classification. The proposed FSVM resolves unclassi-
ization, Optimization and Beyond, Cambridge, MIT
fiable regions caused by conventional support vector ma-
Press, 2002.
chines. We apply a fuzzy membership function to each data
point and reformulate the SVM such that different input [12] C.L. Blake and C.J. Merz UCI repository of ma-
points can make different contributions to the learning of chine learning databases. Dept. Inform. Comput. Sci.,
Lagrange function. On the other hand, the use of kernel’s Univ. California, Irvine, CA. [Online]. Available:
parameters evaluation function provides the seemly param- http://www.ics.uci.edu/ mlearn/MLRepository.html.
eters of SVM’s kernel. The result of experiments indicates
that the proposed method enhances the SVM in reducing
the effect of noises in the application. FSVM is suitable for
applications in which data points have modeled characteris-
tics. It demonstrates the superiority of the method of FSVM
over standard SVM, and the kernel’s parameters evaluation

604

You might also like