You are on page 1of 10

A Boost Voting Strategy for Knowledge

Integration and Decision Making

Haibo He1 , Yuan Cao1 , Jinyu Wen2 , and Shijie Cheng2


1
Department of Electrical and Computer Engineering, Stevens Institute of
Technology, Hoboken, NJ 07030 USA
{hhe,ycao}@stevens.edu
2
College of Electrical and Electronics Engineering, Huazhong University of Science
and Technology, Wuhan, 430074, China
{jinyu.wen,sjcheng}@mail.hust.edu.cn

Abstract. This paper proposes a voting strategy for knowledge integra-


tion and decision making systems with information uncertainty. As en-
semble learning methods have recently attracted growing attention from
both academia and industry, it is critical to understand the fundamental
problem of voting strategy for such learning methodologies. Motivated
by the signal to noise ratio (SNR) concept, we propose a method that
can vote optimally according to the knowledge level of each hypothesis.
The mathematical framework based on gradient analysis is used to nd
the optimal weights, and a voting algorithm, BoostVote, is presented in
detail in this paper. Simulation analyses based on synthetic data and
real-world data sets with comparison to the existing voting rules demon-
strate the eectiveness of this method.

Keywords: Voting strategy, Ensemble learning, Signal-to-noise ratio,


Classication.

1 Introduction
Voting strategy is a fundamental and critical issue for the ensemble learning
systems. Generally speaking, voting strategy provides a mechanism to integrate
the knowledge from multiple and diversied voting hypotheses to potentially
improve the nal decision making process. In this paper, we focus on the under-
standing of this problem for the ensemble classication systems. Our objective
is to nd an optimal voting weight for each individual hypothesis for improved
performance over a given goal.
We start the denition of the problem discussed in this paper explicitly:
Denition: Given an ensemble learning system with multiple hypotheses over
a classication target function Y = {1, ..., C} (C is the number of class labels):
H = {hj } , j = 1, ..., L, each is developed by a learning method, : {, },
based on the training data set, Dtr , nd the optimal voting strategy, , for an
improved nal decision: P (Y |xt ), where xt is a testing data instance drawn from
the testing data distribution Dte .

F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp. 472481, 2008.

c Springer-Verlag Berlin Heidelberg 2008
A Boost Voting Strategy for Knowledge Integration and Decision Making 473

The learning procedure includes two subsystems: base algorithm and learn-
ing procedure . Generally speaking, any kind of learning algorithms can be used
as the base algorithm to develop the decision boundary from the training data,
such as neural networks, decision tree, and others. A common scenario in the
machine learning community is to use weak learning algorithms (WeakLearn) to
develop a strong ensemble learning system [1-4]. For instance, theoretical anal-
ysis of the weak learnability is discussed in detail in [1]. General bounds of the
complexity of probably approximately correct (PAC) learning, and important
proofs on the equivalent of strongly learnable and weakly learnable have
been presented. The represents a learning procedure used to obtain multiple
hypotheses H. For instance, bootstrap sampling method can be used to sample
in the instance space to train dierent hypotheses. Bagging [2] and adaptive
boosting (AdaBoost) [3,4] are two representative works in this domain. While
bagging uses a uniformly distributed sampling function (bootstrap with replace-
ment) across all training instances, AdaBoost adopts an adaptive iteration learn-
ing procedure to automatically shift the decision boundary to be more focused
on those dicult instances: examples tend to be misclassied (hard instances)
receive higher weights compared to those tend to be correctly classied (easy
instances) at each iteration. Subspace learning is another major category to
build an ensemble of multiple hypotheses, such as the random subspace [5],
random forest [6], ranked subspace [7], and rotation forests [8]. Other major
work of ensemble learning includes stacked generalization [9] and mixture of
experts [10].
A fundamental problem in the ensemble learning system is the voting strategy
since multiple hypotheses are developed from dierent views of the training
data. To this end, hypothesis diversity is an important criterion to assess the
eectiveness of ensemble learning system, which raises the essential problem
of voting strategy. As each hypothesis carries dierent knowledge level for the
target function, it is natural to use dierent weights for dierent hypotheses to
potentially improve the nal decision: highly condent hypothesis should carry
higher weights. However, in real-world applications, it is very dicult to evaluate
the condence level of each hypothesis over future testing instances [12].
In this paper, we propose a novel understanding to address this fundamental
issue. Analogous to the signal and noise concept, we transform the decision mak-
ing process to be an optimization problem targeting to nd the optimal weights
for each hypothesis to maximize the combined knowledge level of the ensemble
system. In this way, the nal decision is boosted from individual hypotheses
for knowledge integration and accumulation. Mathematical formulation of the
problem followed by a voting algorithm, BoostVote, is presented in detail in this
paper. To our best knowledge, this is the rst time for this idea to be presented
in the community. We believe that this idea provides new insights to this fun-
damental problem and may motivate future theoretical and practical research
developments in the society.
The rest of this paper is organized as follows. Section 2 briey reviews the
major voting strategies in this eld. In section 3, the detailed mathematical
474 H. He et al.

foundation and the BoostVote algorithm is presented. Section 4 presents the sim-
ulation analyses of the proposed method to synthetic data as well as real-world
machine learning data sets. Comparative studies of the classication accuracy of
the proposed method with those of the existing methods are used to illustrate
the eectiveness of this method. Finally, a conclusion and a brief discussion of
future research work is discussed in section 5.

2 Related Works

Considering the problem denition in the section 1, we represent the training


data as Dtr with m instances {xq , yq }, where q = 1, ..., m, xq is an instance in
the n dimensional feature space X, and yq is the class label associated with xq .
Followed by the learning method : {, }, a total number of L hypothesis is
obtained: H = {hj } , j = 1, ..., L. Therefore, for each testing instance xt Dte ,
each hypothesis hj can produce an estimate of a posteriori probability across all
the potential class labels: Pj (Yi |xt ), where Yi = 1, ..., C. In this way, the voting
strategy can be dened to nd a mapping function for an improved estimate
of a nal posteriori probability P (Yi |xt ) from the individual Pj (Yi |xt ):

xt Yi satisf y max P (Yi |xt ))


Yi
 
= Pj (Yi |xt )|L
j=1 , xt , (1)

Where is a set of parameters used to adjust the contributions of each


hypothesis.
Based on the information theoretic criteria and Bayesian rationale [11,12],
various voting strategies have been used in the research community. The most
commonly adopted voting rules include geometric average (GA), arithmetic av-
erage (AA), median value (MV), and majority voting (MajV) rule. In this re-
search, we will compare the proposed BoostVote strategy to all these existing
voting methods.
GA rule:

L
xt Yi satisf y max Pj (Yi |xt ). (2)
Yi
j=1

AA rule:
1
L
xt Yi satisf y max Pj (Yi |xt ). (3)
Yi L j=1

MV rule:
xt Yi satisf y max{median(Pj (Yi |xt ))}. (4)
Yi

MajV rule:

L
xt Yi satisf y max j (Yi |xt ), (5)
Yi
j=1
A Boost Voting Strategy for Knowledge Integration and Decision Making 475

where 
1; if hj (xt ) = Yi ;
j (Yi |xt ) =
0; otherwise.

3 The Proposed Method


Fig. 1 shows the general scenario of knowledge integration and decision making
for an ensemble of classier learning system. We assume each hypothesis, hj , is
associated with a signal, sj , and noise, nj , dened as a measurement related to
the posteriori probability Pj (Yi |xt ):
sj = |Pj (Yi |xt ) 0.5| (6)
nj = 0.5 |Pj (Yi |xt ) 0.5| (7)
In a two-class classication problem, Pj (Yi |xt ) = 0.5 represents the lowest cer-
tainty, meaning that each of the two classes is equally likely. On the other hand,
Pj (Yi |xt ) = 1 or Pj (Yi |xt ) = 0 represents full knowledge, meaning that the
hypothesis, hj , is certain about the class label. For multi-class classication
problems, considering a class label Yi , the classication of any given testing in-
stance xt can be represented in a Boolean type: xt Yi or xt Yi , where
Yi = {Yk , k = i}. That is to say, xt either belongs to Yi or Yi , where Yi repre-
sents all other possible class labels in Y except Yi . In this way, the multi-class
classication problem can also be transformed analogous to a two-class problem.
Therefore, Equations (6) and (7) provide a uniform way to represent the signal
and noise concept.
In order to maximize the signal level in the ensemble system as shown in
Fig. 1, we dene the combined signal and noise as [13]:
L
s2 = (1 s1 + 2 s2 + . . . + L sL )2 = ( k sk )2 (8)
k=1


L
2 = 12 n21 + 22 n22 + . . . + L
n 2 2
nL = k2 n2k (9)
k=1

Fig. 1. Boost voting strategy for knowledge integration and decision making
476 H. He et al.

Therefore, the combined signal to noise ratio (SNR) in the knowledge integration
process can be dened as:
s2
(j ) = 2 (10)
n

To nd the maximal value, we take the gradient of Equation (10) with respect
to j .

s2 2 n 2 2

n s
j j
(j ) =
4
n

L 
L 
L
2sj k2 n2k k sk 2j n2j (k sk )2
k=1 k=1 k=1
=
4
n
(11)

By setting (j ) = 0, one can get:


L
(k nk )2
j n2j
= k=1L (12)
sj 
k sk
k=1

This lead to the conclusion of the following condition:


j j
2 = = constant (13)
sj /nj j
sj
where j = . Equation (13) means from the signal to noise ratio point of
n2j
view, each hypothesis hj should vote proportionally to j in order to maximize
the knowledge level in the ensemble system.
Based on this analysis, we now present the proposed BoostVote algorithm as
follows:

[Algorithm: BoostVote]
1. Apply the testing sample xt to each hypothesis, hj , and return the
decision prole Pd (Yi |xt ).
2. Calculate the signal and noise for each class label:

SYi = |Pd (Yi |xt ) 0.5| (14)

NYi = 0.5 SYi (15)


A Boost Voting Strategy for Knowledge Integration and Decision Making 477

3. Calculate the PYi and PYi for each potential class label:

PYi = sign(Pd (Yi |xt ) 0.5) SYi (16)

PYi
PYi = (17)
NYi
4. Calculate Yi and
Yi

SYi
Yi = (18)
(NYi )2

1
Yi =
0.5 (19)
1 + eYi
5. Calculate Pout (Yi ) and Pout (Yi )


L
Yi (k)PYi (k)
Pout (Yi ) =  k=1 (20)
 L

(Y (k))2 (NY (k))2
i i
k=1

Pout (Yi )
Pout (Yi ) = (21)
2(1 + |Pout (Yi )|)
6. Calculate the nal voting probability P (Yi |xt )

P (Yi |xt ) = Pout (Yi ) + 0.5 (22)

Output: voting strategy: mapping function

xt Yi satisf y max P (Yi |xt ) (23)


Yi

In the BoostVote algorithm, a decision prole Pd (Yi |xt ) is dened, which


provides a voting probability for each testing instance across all potential class
labels. Such a decision prole can either be obtained directly from most of the
o-the-shelf base learning algorithms or by slight modications. For instance, for
those base learning algorithms that can provide soft-type outputs (continuous
values), such as neural networks, one can directly use a scaled output value to
obtain the decision prole information to calculate the signal and noise value.
Fig. 2 illustrates this idea for a neural network model with C output neurons;
each represents a class identity label. In this case, the decision prole element
478 H. He et al.

Fig. 2. Decision prole calculation based on neural network model

Fig. 3. An example of BoostVote algorithm


A Boost Voting Strategy for Knowledge Integration and Decision Making 479

can be decided by the normalized output value from each corresponding output
neuron. On the other hand, for hard-type (output discrete class labels only) base
learning algorithms, one can obtain the decision prole information based on the
cross-validation method. We also want to point out that for many o-the-shelf
base learning algorithms, it is generally very straightforward to transform from
the hard-type output to the soft-type output [3]. In the BoostVote algorithm,
a modied logistic function is introduced in Equation (19) to adjust the voting
sensitivity level from each hypothesis, and the value of parameter can be
decided by cross-validation method. In this research, we set = 0.1 for all
simulations.
Fig. 3 shows an example of the BoostVote algorithm for a three-class clas-
sication problem with three hypotheses. From Fig. 3 one can see, BoostVote
algorithm will vote the testing example xt as a class 3 label in this case. Con-
sidering the decision prole information at Step 1, if majority voting rule is used
in this case, it will randomly select a class label for this testing instance since
each class receives the same number of votes (one vote for each class). This
indicates that the proposed algorithm can potentially boost the nal decision
making process from dierent knowledgeable voting hypotheses.

4 Simulation Analysis
To see how the proposed BoostVote algorithm can boost the knowledge level
in the ensemble system, we present our rst experiment on synthetic data set.
Assume a two-class (positive and negative) ensemble classication system includ-
ing 25 hypotheses (L = 25) and 50 testing instances. After a training procedure,
assume each hypothesis, hj , will vote each testing example with a posteriori
probability, Pj (Yi |xt ), according to a uniform distribution in [0, 1]. Fig. 4 shows
the nal combined posteriori probability P (Yi |xt ) for the positive class, where
x-mark represents AA rule and circle represents BoostVote method. From Fig. 4
one can see, BoostVote algorithm can boost the nal combined knowledge level
to be more deterministic, therefore increase the separation margin to facilitate
the nal decision making process.
We now illustrate the application of BoostVote to real-world data benches
from the UCI machine learning repository [14]. Table 1 summarizes the

Table 1. Data set characteristics used in this paper

Data set # # #
Name Examples Classes Attributes
ecoli 336 8 7
shuttle 59000 2 9
spectf 267 2 44
wdbc 569 2 30
wine 178 3 13
yeast 1484 10 8
480 H. He et al.

0.9

0.8

0.7
Final posteriori probability

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40 45 50
Testing instances

Fig. 4. Final posteriori probability based on BoostVote and AA rule

Table 2. Testing error performance(in percentage)

Voting method GA AA MV MajV BoostVote


ecoli 13.21 13.17 13.2 13.28 13.29
shuttle 7.38 7.46 8.85 8.94 7.35
spectf 21.92 22.25 23.4 23.4 21.74
wdbc 8.2 8.22 8.48 8.48 8.22
wine 20.28 22.9 28.31 29.97 14.73
yeast 39.88 39.89 39.87 39.89 39.86
Winning times 1 1 0 0 4

characteristic of the data sets used in this research. For each data set, we ran-
domly select half of the data for training and use the remaining half for testing
purpose. Neural networks with multilayer perceptron (MLP) is used as the base
learning algorithm in our current study. The number of hidden layer neuron is
set to 10, and the number of input and output neurons are set to the number
of features and classes for each data set, respectively. Sigmoid function is used
as the activation function, learning rate is set to 0.05 and the number of learn-
ing iteration is 100. Bagging [2] is used to create the ensemble system, and 25
bootstrap sampling (with replacement) iterations are used as suggested in [15]
for general use of bagging method.
Table 2 presents the testing error performance based on the average of 100
random runs. Here we compare the performance of BoostVote with those of four
popular voting strategies as discussed in section 2. For each data set, the winning
strategy is also highlighted by underline. In addition, the total winning times for
each method across all these data sets are also summarized in Table 2. These
numerical results indicate that BoostVote can provide competitive voting results
when multiple hypotheses are involved in a voting system.
A Boost Voting Strategy for Knowledge Integration and Decision Making 481

5 Conclusion and Future Work


In this paper, a novel voting strategy is proposed for knowledge integration and
decision making systems. By using the concept similar to signal and noise ratio,
the proposed method enables each hypothesis vote optimally according to their
knowledge level for the target function; therefore boost the performance of the
nal decision. Mathematical analysis is presented in detail in this paper, and
simulation analysis on various data sets is used to demonstrate the eectiveness
of this method.
There are various interesting directions can be further developed. For instance,
large scale empirical studies and assessment metrics development will be useful
to fully justify the eectives of this method in dierent applications. In addition,
it would be interesting to analyze the performance of this voting method under
the skewed data distributions (the imbalanced data learning problem). Since
voting strategy plays a critical role in many machine learning methods, we hope
the proposed research provides new insights to this fundamental problem, and
can potentially be a powerful method for a wide range of application domains.

References
1. Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5(2), 197
227 (1990)
2. Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123140 (1996)
3. Freund, Y., Schapire, R.E.: Experiments With a New Boosting Algorithm. In:
International Conference on Machine Learning, pp. 148156 (1996)
4. Freund, Y.: An Adaptive Version of the Boost by Majority Algorithm. Machine
Learning 43(3), 293318 (2001)
5. Ho, T.K.: Random Subspace Method for Constructing Decision Forests. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(8), 832844 (1998)
6. Breiman, L.: Random Forests. Machine Learning 45(1), 532 (2001)
7. He, H., Shen, X.: A Ranked Subspace Learning Method for Gene Expression Data Clas-
sication. In: International conference on Articial Intelligence, pp. 358364 (2007)
8. Rodrguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation Forest: A New Classier
Ensemble Method. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 28(10), 16191630 (2006)
9. Wolpert, D.H.: Stacked Generalization. Neural Network 5(2), 241259 (1992)
10. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive Mixtures of Local
Experts. Neural Computation 3(1), 7987 (1991)
11. Kittler, J., Hatel, M., Duin, R.P.W., Matas, J.: On Combining Classiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(3), 226239 (1998)
12. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Elsevier, Aca-
demic Press (2006)
13. Starzyk, J.A., Ding, M., He, H.: Optimized Interconnections in Probabilistic Self-
Organizing Learning. In: IASTED International Conference on Articial Intelli-
gence and Applications, pp. 1416 (2005)
14. UCI Machine Learning Repository,
http://mlean.ics.uci.edu/MLRepository.html
15. Opitz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study. J. Articial
Intelligence Research 11, 169198 (1999)

You might also like