Feature Selection

September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
International Journal of Pattern Recognition

and Artificial Intelligence
Vol. 21, No. 6 (2007) 1085–1101
c World Scientific Publishing Company
FIVE NEW FEATURE SELECTION METRICS IN

TEXT CATEGORIZATION
FENGXI SONG∗,†,‡ , DAVID ZHANG§ , YONG XU†

and JIZHONG WANG∗
Department of Automation and Simulation
451 Huang Shan Road
Hefei, Anhui 230031, P. R. China
∗New Star Research Institute of Applied Technology in Hefei City
†Shenzhen Graduate School, Harbin Institute of Technology
§Hong Kong Polytechnic University
‡songfengxi@yahoo.com
Feature selection has been extensively applied in statistical pattern recognition as a

mechanism for cleaning up the set of features that are used to represent data and as
a way of improving the performance of classifiers. Four schemes commonly used for
feature selection are Exponential Searches, Stochastic Searches, Sequential Searches,
and Best Individual Features. The most popular scheme used in text categorization is
Best Individual Features as the extremely high dimensionality of text feature spaces
render the other three feature selection schemes time prohibitive.
This paper proposes five new metrics for selecting Best Individual Features for use
in text categorization. Their effectiveness have been empirically tested on two well-
known data collections, Reuters-21578 and 20 Newsgroups. Experimental results show
that the performance of two of the five new metrics, Bayesian Rule and F-one Value,
is not significantly below that of a good traditional text categorization selection metric,
Document Frequency. The performance of another two of these five new metrics, Low
Loss Dimensionality Reduction and Relative Frequency Difference, is equal to or better
than that of conventional good feature selection metrics such as Mutual Information and
Chi-square Statistic.
Keywords: Feature selection; text categorization; support vector machines; multiple

comparative test; pattern recognition.
1. Introduction
Text categorization (TC), the assignment of free text documents to one or more
predefined categories based on their content, is a powerful tool for more effectively
finding, filtering, and managing text resources. Though TC goes back at least to
the early 1960s and to Maron’s seminal work,16 it is only in the last ten years that
it has attracted researchers in the field of machine learning. In the past decade
‡Author for correspondence
1085
1086 F. Song et al.
a number of statistical classification and machine learning techniques have been

successfully applied to text categorization.
Documents, which typically are strings of characters, have to be transformed
into a representation suitable for learning algorithms and classification task. In the
information retrieval style, each document is usually represented by a feature vector
of n weighted word stems that occur in the document.19 Each distinct word stem
corresponds to a feature in the vector.
In text categorization, one is usually confronted with feature spaces containing
tens of thousands of dimensions, often exceeding the number of available training
examples. It is imperative to perform feature selection before training a classifier as
it possible uses conventional learning methods, improves generalization accuracy,
and prevents overfitting.
Feature selection has been extensively applied in statistical pattern recogni-
tion as a mechanism for cleaning up the set of features that are used to represent
data and as a way of improving the performance of classifiers.8 Four schemes com-
monly used for feature selection are Exponential Searches (e.g. Branch and Bound
Algorithm17 ), Stochastic Searches (e.g. Genetic Algorithm3 ), Sequential Searches
(e.g. Sequential Forward Floating Selection9 ), and Best Individual Features.13,21,26
The most popular scheme used for text categorization is Best Individual Features
as the extremely high dimensionality of text feature spaces render the other three
feature selection schemes time prohibitive.
The Best Individual Features selection scheme ignores the dependence of
featuresa and selects or retains an original feature if and only if, given r features
are going to be selected, it is one of the top r features with the highest “relevance”
or “goodness” for predicting categories. Whether feature relevance is big or small is
not very important. What is important is its rank in the order as sorted according
to the feature relevance. Thus applying a monotone increasing function to feature
relevant scores should not change the feature selection result.
Different methods for evaluating the relevance or goodness of features imply
different feature selection metrics. Feature selection metrics such as Mutual Infor-
mation (MI), Chi-square Statistic (CHI), Correlation Coefficient, Relevancy Score,
Odds Ratio, Simplified Chi-square Statistic (SCHI), and Document Frequency (DF)
have been well studied in the text categorization literature.6,13,21,26 Yang and
Pedersen26 found that MI, CHI, and DF were the most effective metrics. Galavotti6
showed that SCHI is also a promising feature selection metric.
This paper proposes five new feature selection metrics for use in text catego-
rization. Their effectiveness have been empirically tested on two benchmark data
collections, Reuters-21578 and 20 Newsgroups. Experimental results show that the
performance of two of the five new metrics, Bayesian Rule and F-one Value, is not
significantly below that of a good traditional text categorization selection metric,
a In certain pattern recognition problems, where there are fewer classes and features, dependence
is sometimes taken into account.

Five New Feature Selection Metrics in Text Categorization 1087
Document Frequency. The performance of another two of these five new metrics,
Low Loss Dimensionality Reduction (LLDR) and Relative Frequency Difference
(RFD), is equal to or better than that of conventional good feature selection met-
rics such as Mutual Information and Chi-square Statistic.
2. Conventional Feature Selection Metrics

We will review Mutual Information and other conventional feature selection metrics
in this section briefly.
Our description of the feature selection metrics will use the following notation.
• A — the number of documents that are from category c and contain feature x.
• B — the number of documents that are not from category c and contain
feature x.
• C — the number of documents that are from category c and do not contain
feature x.
• D — the number of documents that are not from category c and do not contain
feature x.
• M — the number of documents that are from category c.
• N — the number of documents.
It is obvious that A + C = M , and A + B + C + D = N .
2.1. Mutual information

Mutual Information (MI), sometimes called Information Gain, is frequently
employed as a feature-relevance criterion in the field of machine learning.2,11 It
measures the relative entropy of the joint distribution with respect to the product
distribution. The mutual information between feature x and category c, M I(x, c),
is calculated as follows.
p(x, c) p(x, c)
MI (x, c) = p(x, c) · log + p(x, c) · log
p(x) · p(c) p(x) · p(c)
p(x, c) p(x, c)
+ p(x, c) · log + p(x, c) · log . (1)
p(x) · p(c) p(x) · p(c)
Here the symbol p stands for probability. For example, p(x, c̄) is the probability of
a given document that is not from category c and contains feature x.
Although we cannot calculate MI (x,c) directly, we can approximate it as follows.
1
MI (x, c) ≈ [A · log A + B · log B + C · log C + D · log D
N
− (A + B) · log(A + B) − (C + D) · log(C + D) − (A + C) · log(A + C)
− (B + D) · log(B + D) + N · log N ]. (2)
1088 F. Song et al.
Since N is a constant and for a given category c, A + C = M , and B + D =

N −M are also constants, we can use the following formula instead without altering
feature selection result.
MI (x, c) ≈ A · log A + B · log B + C · log C + D · log D
− (A + B) · log(A + B) − (C + D) · log(C + D). (3)
The larger the mutual information between feature x and category c is, the more
relevant the feature x is for predicting the category.
2.2. Chi-square statistic

The Chi-square Statistic (CHI) comes from the Contingency Table Test and mea-
sures the lack of independence between feature x and category c.20 It is given by

2 [p(x, c) − p(x) · p(c)]2 [p(x, c) − p(x) · p(c)]2
χ (x, c) = N · +
p(x) · p(c) p(x) · p(c)

[p(x, c) − p(x) · p(c)]2 [p(x, c) − p(x) · p(c)]2
+ +
p(x) · p(c) p(x) · p(c)
N · [p(x, c) · p(x, c) − p(x, c) · p(x, c)]2

= . (4)
p(x) · p(x) · p(c) · p(c)
As an approximation, we use the formula
N · (A · D − B · C)2
χ2 (x, c) = . (5)
(A + C) · (B + D) · (A + B) · (C + D)
Again, without altering feature selection result, χ2 (x, c) can be further simplified as
(A · D − B · C)2
χ2 (x, c) = . (6)
(A + B) · (C + D)
The chi-square statistic has a natural value of zero if x and c are independent.
The larger is the chi-square statistic of feature x and category c, the more relevant
is the feature x for predicting the category.
2.3. Correlation coefficient and simplified chi-square statistic

Correlation Coefficient and Simplified Chi-square Statistic (SCHI) are two modified
versions of Chi-square Statistic. Ng et al.18 have observed that the use of chi-square
statistic for text feature selection is counterintuitive because squaring the numerator
has the effect of equating those factors that indicate a positive correlation between
the feature and the category (i.e. p(x, c) and p(x, c)) with those that indicate a
negative correlation (i.e. p(x, c) and p(x, c)). The “correlation coefficient” CC(x, c),b
b In this paper, the optimal feature subset selected using CC(x, c) is exactly the same as that
chosen using the chi-square statistic.

they proposed, being the square root of χ2 (x, c), emphasizes the former and de-
emphasizes the latter, thus respecting intuitions. CC(x, c) is given by
√
N · [p(x, c) · p(x, c) − p(x, c) · p(x, c)]
CC(x, c) = . (7)
p(x) · p(x) · p(c) · p(c)
Fuhr et al.5 refined the criterion by observing that in CC (x,c) (and a fortiori
in χ2 (x, c))
√
— the N factor at the numerator has no influence on the feature selection result,
since it is equal
for all pairs (x,c);
— the presence of p(x) · p(x) at the denominator emphasizes extremely rare fea-
tures, which Yang and Pedersen26 have clearly shown to be the least effective
in TC;
— the presence of p(c) · p(c) at the denominator emphasizes extremely rare cat-
egories, which is extremely counterintuitive.c
By eliminating these factors from CC (x,c), Fuhr et al. yields the Simplified
Chi-square Statistic given by
sχ2 (x, c) = |p(x, c)p(x, c) − p(x, c)p(x, c)|, (8)
which can be further changed to the form
sχ2 (x, c) = |A · D − B · C|. (9)
2.4. Document frequency

The document frequency (DF) for a feature is the number of documents that con-
tain the feature. The document frequency approach computes the document fre-
quency for each feature in the training corpus and removes those features having
a document frequency that is less than some predetermined threshold. The basic
assumption is that rare features are either non-informative for purposes of cate-
gory prediction or that they are not influential in global performance.1 Document
frequency is given by:
DF (x) = A + B. (10)
Document Frequency is the simplest and one of the most effective text features
selection approach. However it is usually considered an ad hoc approach to efficiency
improvement, not a principled criterion for selecting predictive features. This is
because it lacks a sound theoretical base. One metric that has all of the advantages
of DF but without this disadvantage is Low Loss Dimensionality Reduction feature
selection, which is based on a Bayesian classifier. The following section discusses
LLDR in some detail.
c The square root of p(c) · p(c) is a constant that has also no influence on the feature selection
result.
1090 F. Song et al.
3. Several New Feature Selection Metrics

This section proposes five new feature selection metrics that have not previously
been used in text categorization.
3.1. Low Loss Dimensionality Reduction

Low Loss Dimensionality Reduction (LLDR) feature selection metric is directly
based on the so-called optimal classifier — a Bayesian classifier. A Bayesian classifier
assigns a free pattern to category j of the l predefined categories, ω1 , ω2 , . . . , ωl , if
and only if its observation x satisfies the condition
p(ωj |x) = max p(ωi |x). (11)
1≤i≤l
Here p(ωj |x) is the conditional probability of ωj given x.

Let Rj = {x|p(ωj |x) = max1≤i≤l p(ωi |x)}, thus the correct probability of a
classification of a Bayesian classifier is
l

PX = Pj p(Rj |ωj ). (12)
j=1
Here P1 , P2 , . . . , Pl are class priori probabilities. The subscript in PX emphasizes

that the probability is derived from observations under the attribute set X.
Definition 1. Feature x is redundant if and only if PX−{x} = PX .
Definition 2. Feature x is independent of the attribute set X −{x} and categories
if and only if for any given value of x, x∗ , and ω the following equation holds.
p(x, x∗ , ω) = p(x)p(x∗ , ω). (13)
∗
Here x is the observation under the attribute set X − {x}.
Theorem 1. If feature x is independent of the attribute set X −{x} and categories,
then it is redundant.
Proof. Let Rj∗ = {x∗ |p(ωj |x∗ ) = max1≤i≤l p(ωi |x∗ )}, and Rj = {x|p(ωj |x) =
max1≤i≤l p(ωi |x)}.
Since
x ∈ Rj ⇔ ∀i, 1 ≤ i ≤ l, p(ωj |x) ≥ p(ωi |x)
⇔ ∀i, 1 ≤ i ≤ l, p(x, x∗ , ωj ) ≥ p(x, x∗ , ωi )
⇔ ∀i, 1 ≤ i ≤ l, p(x∗ , ωj ) ≥ p(x∗ , ωi )
⇔ x∗ ∈ Rj∗ .
l l
Thus we have PX = j=1 Pj p(Rj |ωj ) = j=1 Pj p(Rj∗ |ωj ) = PX−{x} , i.e. feature
x is redundant.
Corollary 1. If feature x takes a certain value almost everywhere then it is

redundant.
Redundant features contribute nothing for sample category prediction. LLDR

tries to select predicative features by omitting features that are redundant or nearly
redundant.
A feature x that both rarely appears in category c and seldom shows in other
categories is nearly redundant. In other words, either the conditional probability of
a feature appearing in a give category or the conditional probability of the feature
not appearing in the category is a good index of the feature’s relevance to the
category. Thus feature selection metric based on LLDR can be given by
LLDR(x, c) = max{p(x|c), p(x|c)}, (14)
which can be approximated by

A B
LLDR(x, c) = max , . (15)
M N −M
The smaller the score of LLDR(x, c), the less relevant the feature x for predicting
the category.
3.2. Relative frequency difference

A feature is a representative feature for a given category only when most of samples
from that category behold it. A feature is a discriminating feature for a given
category only when most samples from other categories do not behold it. Obviously,
a feature is predictive when it is both representative and discriminating.
The probability that feature x occurs in category c, p(x|c), is a good measure
of representative characteristic of x for c. On the other hand, the negative value
of probability of feature x occurs in other categories (i.e. c), −p(x|c), is a good
measure of the discriminative ability of x for c. Thus we can use |p(x|c) − p(x|c)|
to evaluate the relevance of features. We have
RFD(x, c) = |p(x|c) − p(x|c)|, (16)
which can be approximated by using the absolute value of the relative frequency
difference, i.e.

A B |A · D − B · C|

RFD(x, c) = − = . (17)
M N −M M · (N − M )
In a way that is surprisingly similar to the Simplified Chi-square Statistic,
the feature selection metric based on Relative Frequency Difference (RFD) can
be rewritten as
RFD(x, c) = |A · D − B · C|. (18)
3.3. Bayesian rule

Bayesian Rule is another feature selection metric based on a Bayesian classifier.
When we use only one feature, say x, to predict category the error probability of a
1092 F. Song et al.
Bayesian classifier is
error = p(x) min{p(c|x), p(c|x)} + p(x) min{p(c|x), p(c|x)}, (19)
which can be approximated by
min{A, B} + min{C, D}
error = . (20)
N
The smaller the probability of error is, the more relevant the feature is. Thus
feature selection metric based on Bayesian Rule (BR) is given by
BR(x, c) = −p(x) min{p(c|x), p(c|x)} − p(x) min{p(c|x), p(c|x)}, (21)
which can be approximated and simplified as
BR(x, c) = −min{A, B} − min{C, D}. (22)
3.4. F-one value

Similar to Bayesian Rule F-one Value (FV) is also a Bayesian classifier-based feature
selection metric. The difference is that Bayesian Rule chooses features that minimize
the error probability of a Bayesian classifier whereas F-one Value chooses features
that maximize scores of F1 -value of a Bayesian classifier.
Let a denote the number of documents that are correctly assigned to the
category (i.e. the number of true documents assigned with yes), b the number
of documents incorrectly assigned to the category, and c the number of docu-
ments incorrectly rejected from the category. The recall and precision are defined
as follows:
a
r= , (23)
a+c
a
p= . (24)
a+b
Recall is the proportion of correctly assigned documents to documents in the
category. Precision is the proportion of correctly assigned documents to documents
that are assigned to the category.
When combining the recall and precision, the F1 -value is calculated by
2·r·p 2
F1 = = . (25)
r+p 1/r + 1/p
When using only one feature x to predict category based on Bayesian classifier,
we have following four contingency tables.
• if A > B and C > D, the contingency table is
Assigned YES Assigned NO

YES is correct A+C 0
NO is correct B+D 0
• if A > B and C ≤ D, the contingency table is

YES is correct A C
NO is correct B D
• if A ≤ B and C > D, the contingency table is

YES is correct C A
NO is correct D B
• if A ≤ B and C ≤ D, the contingency table is

YES is correct 0 A+C
NO is correct 0 B+D
Corresponding to above four contingency tables the scores of F1 -value are

2 2 2
2+(B+D/A+C) , 2+(B+C/A) , 2+(A+D/C) , and 0 respectively. The larger the score
of F1 is, the more relevant the feature is. Thus the feature selection metric based
on F-one Value can be given by:


 −(N − M )/M, if A > B, and C >D

−(B + C)/A, if A > B, and C ≤D
F V (x, c) = . (26)

 −(A + D)/C, if A ≤ B, and C >D

−∞, if A ≤ B, and C ≤D
3.5. Fisher discriminant

Fisher Discriminant (FD) is a feature selection metric based on Fisher Discriminant
Analysis. When using only one feature to predict category we hope that its between-
scatter is the largest whereas its within-scatter is the smallest.
When feature x is selected the between-scatter of x between category c and
other categories is calculated by
bs = |E(x|c) − E(x|c)|. (27)
Here E(x|c) and E(x|c) are conditional expectations of feature x which can be
A B
approximated by M and N −M respectively.
The within-scatter of x in c and c are conditional variances, D(x|c) and D(x|c)
A
A 2 B
B 2
which can be approximated by M − M and N −M − N −M respectively.
1094 F. Song et al.
The feature selection metric based on Fisher Discriminant is given by

bs |E(x|c) − E(x|c)|
FD(x, c) = = . (28)
D(x|c)D(x|c) D(x|c)D(x|c)
We can further approximate and simplify it as
|A · D − B · C|
F D(x, c) = √ . (29)
A·B·C ·D
It is well known that the correlation coefficient between feature x and category
c (which is different from CC (x,c)) can be calculated as
E(x · c) − E(x) · E(c)
ρ(x, c) = . (30)
D(x) · D(c)
Again, we can approximate it by
M · (N − M )(A · D − B · C)
ρ(x, c) = √ . (31)
N2 · A · B · C · D
Here E(x) and D(x) are expectation and variance of feature x respectively.
The larger the score of |ρ(x, c)|, the more relevant the feature x.
Thus a feature selection metric based on the correlation coefficient can be
written as
|A · D − B · C|
ρ(x, c) = √ , (32)
A·B·C ·D
which is the same as FD.
4. Comparative Study of Feature Selection Metrics

4.1. Classifier
Compared to state-of-the-art methods, Support Vector Machines (SVMs) have
showed superb effectiveness on text categorization.4,10,25 Moreover linear SVMs can
achieve very high performance. The linear SVMs are used as the basic classification
algorithm throughout experiments in this paper.
There are many SVM packages available on the Internet such as Joachims’s
SVMlight , and Platt’s SMO, etc. The LinearSVC in OSU SVM Classifier Matlab
Toolbox developed by Ma et al.15 has been used in the experiments in this paper. For
simplicity, instead of fine tuning the only parameter C in linear SVM on a validation
set, which is the punishment for misclassification, we let it take the default
value 1.
4.2. Data collection and performance measure

The empirical evaluation is done on two benchmark datasets: Reuters-21578 and
20 Newsgroups.
Reuters-21578 dataset was compiled by Lewis14 and originally collected by the

Carnegie group from the Reuters newswire in 1987. The “ModApte” split is used
leading to a corpus of 9603 training documents and 3299 test documents. Of the 135
potential topic categories only those 90, which has at least one training sample and
one test sample, are used. The text enclosed in the tag “<TEXT” and “/TEXT” in
each training document is used for classification. Words occurring in the titles are
not discriminated from those occurring in the bodies. The size of the vocabulary is
27,942.
Previous studies22,23 showed that representing text document by indexing with
term frequency, scaling with inverse document frequency, normalizing the feature
vector to unit length, remaining stop-words and not applying word stemming, the
linear SVM can achieve the highest performance on Reuters-21578.
Since Reuters poses a multilabel problem, it is broken into 90 binary classifica-
tion tasks by one-versus-rest approach.7 Micro-average precision-recall break-even-
point21,26 is used to measure the performance of a classifier.
Twenty newsgroups collection was first collected as a text corpus by Lang.12
It contains 19,997 email documents evenly distributed across 20 categories. The
first 800 documents in each newsgroup are used as training samples and the rest
documents are used test samples. By skipping all headers and UU-encoded blocks
only the body of a document is used for classification. The size of the vocabulary
is 98,225.
Text representation scheme in this dataset is the same as in Reuters except that
term frequency is replaced by binary value.23 Since 20 Newsgroups is a multiclass,
single-label problem, Directed Acyclic Graph SVM7 and multiclass classification
accuracy is used.
Unlike in Reuters a global feature selection scheme is used in 20 Newsgroups. To
select the top r features based on a certain feature selection metric, the relevance of
a feature with each category is calculated at first. The total goodness of the feature
is the sum of all the category-dependent relevance. Then the r features with the
highest goodness are chosen.
4.3. Results
The effectiveness of various feature selection metrics on the Reuters is displayed
in Table 1. Box plot of the data in Table 1 shows that the eight tested feature
selection metrics can be approximately divided into four distinguishable groups:
(1) MI, CHI, LLDR, RFD, (2) DF, (3) BR, FV, and (4) FD.
Further multiple comparative tests (see Fig. 2) indicate that
• There is no significant difference between any two feature selection metrics among
Group 1 that consists of MI, CHI, LLDR and RFD.
• There is no significant difference between any two feature selection metrics among
Group 2 that consists of DF, BR and FV.
1096 F. Song et al.
Table 1. Microaverage BEPs of the tested feature selection metrics over the 90 Reuters
categories.
Feature The Number of Features

Selection
Metric 200 400 600 800 1000 2000 3000 4000 5000 6000
MI 0.863 0.869 0.870 0.871 0.873 0.876 0.879 0.879 0.878 0.877
CHI 0.846 0.859 0.865 0.868 0.870 0.876 0.878 0.877 0.876 0.876
DF 0.669 0.725 0.761 0.786 0.800 0.854 0.867 0.873 0.875 0.877
LLDR 0.865 0.871 0.877 0.878 0.878 0.879 0.881 0.879 0.879 0.880
RFD 0.868 0.873 0.876 0.875 0.878 0.879 0.880 0.880 0.880 0.880
BR 0.770 0.774 0.769 0.768 0.771 0.774 0.777 0.778 0.800 0.812
FV 0.763 0.763 0.763 0.761 0.763 0.765 0.770 0.771 0.794 0.809
FD 0.577 0.564 0.560 0.558 0.561 0.560 0.564 0.568 0.591 0.597
• There is significant difference between any two of Groups 1–3 that contain the
only feature selection metric FD.
Figure 1 is the box plots of effectiveness of the tested feature selection metrics
on Reuters.
The lines of a box correspond to the lower quartile, median and upper quartile
values. The whiskers, lines extending from each end of the box, show the extent of
the rest of the data. The plus symbols are data with values beyond the ends of the
whiskers.
Figure 2 shows the results of multiple comparative tests on the data in Table 1.
In the graph a line and the circle at the middle point of it correspond to 95%
confidence interval and mean for the effectiveness of each feature selection metric.
Microaverage BEPs Over 90 Reuters Categories
0.85
0.8
0.75
0.7
0.65
0.6
0.55
MI CHI DF LLDR RFD BR FV FD
Feature Selection Method
Fig. 1. The box plots of the data in Table 1.

MI
CHI
Feature Selection Method
DF
LLDR
RFD
BR
FV
FD
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

Microaverage BEPs Over 90 Reuters Categories
Fig. 2. Results of multiple comparative tests on the data in Table 1.
Fig. 3. The effectiveness of the tested feature selection metrics on 20 newsgroups.
There is a significant difference between any two feature selection metrics if and
only if their corresponding confidence intervals do not overlap.
Figure 3 shows the effectiveness of various feature selection metrics on the
20 Newsgroups. The tested feature selection metrics can be approximately divided
1098 F. Song et al.
into three distinguishable groups: (1) MI, CHI, DF, LLDR, RFD, (2) BR, FV, and
(3) FD.
It can be concluded that any feature selection metric in Group 1 is more effective
than any metric in Group 2, and FD, the only one metric in Group 3, is neither
the most effective nor the least effective of all the metrics.
LLDR outperforms DF because of its sound theoretical background and use of
category information.
5. Conclusions and Discussion

Table 2 shows the characteristics (H, M, L, Y, and N, for high, medium, low, yes,
and no) of the tested feature selection metrics in six areas:
• effectiveness;
• efficiency;
• having a sound theoretical background;
• favoring the rare terms;
• using category information;
• supporting by the intuition.
Based on the experimental results on Reuters-21578 and 20 Newsgroups, and

characteristics of various feature selection metrics listed in Table 2, it can be con-
cluded that:
• The newly proposed feature selection metrics LLDR and RFD are at least as
good as MI and CHI, which are two best traditional feature selection metrics
in text categorization and better than DF, which is another good conventional
feature selection metric.
• The newly proposed feature selection metrics BR and FV are less effective than
LLDR, RFD, MI, and CHI but not significantly inferior to DF.
The new feature selection metrics LLDR and RFD are both easier to calculate
than MI and CHI. This makes the LLDR and RFD more suitable for selecting
features in text categorization.
There is no general conclusion that can be derived about FD in this paper. Its
effectiveness must be further assessed in future studies.
Table 2. The characteristics of various feature selection metrics.
Algorithm MI CHI RFD DF LLDR BR FV FD

Effectiveness H H H M H L L L
Efficiency L L M H H L M H
With sound theoretical background H H M L H H H H
Favoring the rare terms N N N N N N N H
Using category information Y Y Y N Y Y Y Y
Supported by the intuition N N Y Y Y N N N
Acknowledgments
This work is partially supported by the National Science Foundation of China
under grant nos. 60620160097 and 60602038 and Natural Science Foundation of
Guangdong province under grant no. 06300862.
References
1. C. Apte, F. Damerau and S. Weiss, Automated learning of decision rules for text
categorization, ACM Trans. Inform. Syst. 12(3) (1994) 233–251.
2. R. Battiti, Using mutual information for selecting features in supervised neural net
learning, IEEE Trans. Neural Networks 5(4) (1994).
3. T. J. Downey and D. J. Meyer, Genetic algorithm for feature selection, Intell. Engin.
Syst. Through Artif. Neur. Networks 4 (1994) 363–368.
4. S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive learning algorithms and
representations for text categorization, Proc. CIKM-98, 7th ACM Int. Conf. Infor-
mation and Knowledge Management (1998), pp. 148–155.
5. N. Fuhr, N. Govert, M. Lalmas and F. Sebastiani, Categorization tool: final prototype.
Deliverable 4.3, Project LE4-8303 “EUROSEARCH”, Commission of the European
Communities (1999).
6. L. Galavotti, Un sistema modulare per la classificatione di testi basato sull’ apprendi-
mento automatico, Master’s thesis, Dipartimento di Informatica, Universita di Pisa,
Pisa, IT (1999).
7. C. Hsu and C. Lin, A comparison of methods for multiclass support vector machines,
IEEE Trans. Neural Networks 13(2) (2002) 415–425.
8. A. K. Jain, R. P. W. Duin and J. Mao, Statistical pattern recognition: a review, IEEE
Trans. PAMI 22(1) (2000) 4–37.
9. A. Jain and D. Zongker, Feature selection: evaluation, application, and small sample
performance, IEEE Trans. Patt. Anal. Mach. Intell. 19(2) (1997) 153–158.
10. T. Joachims, Text categorization with support vector machines: learning with many
relevant features, Proc. 10th European Conf. Machine Learning (ECML) (Springer-
Verlag, 1998).
11. N. Kwak and C. H. Choi, Input feature selection for classification problems, IEEE
Trans. Neural Networks 13(1) (2002).
12. K. Lang, Newsweeder: learning to filter netnews, Proc. Twelfth Int. Conf. Machine
Learning (1995), pp. 331–339.
13. D. D. Lewis, Naive (Bayes) at forty: the independence assumption in information
retrieval, Proc. ECML-98, 10th European Conf. Machine Learning, Chemnitz, DE
(1998), pp. 4–15.
14. D. Lewis, Reuters-21578, Distribution 1.0 http://www.research.att.com/∼lewis/
reuters21578.html.
15. J. Ma, Y. Zhao and S. Ahalt, OSU SVM Classifier Matlab Toolbox (ver 3.00),
http://eewww.eng.ohio-state.edu/∼maj/osu svm/
16. M. Maron, Automatic indexing: an experimental inquiry, J. Assoc. Comput. Mach.
8(3) (1961) 404–417.
17. P. M. Narendra and K. Fukunaga, A branch and bound algorithm for feature subset
selection, IEEE Trans. Comput. 26(9) (1977) 917–922.
18. H. T. Ng, W. B. Goh and K. L. Low, Feature selection, perceptron learning, and
a usability case study for text categorization, Proc. SIGIR-97, 20th ACM Int.
Conf. Research and Development in Information Retrieval (Philadelphia, US, 1997),
pp. 67–73.
1100 F. Song et al.
19. G. Salton, A. Wong and C. Yang, A vector space model for automatic indexing,
Commun. ACM 18(11) (1975) 613–620.
20. H. Schutze, D. A. Hull and J. O. Pedersen, A comparison of classifiers and docu-
ment representations for the routing problem, Proc. SIGIR-95, 18th ACM Int. Conf.
Research and Development in Information Retrieval, pp. 229–237.
21. F. Sebastiani, Machine learning in automated text categorization, ACM Comput.
Surv. 34(1) (2002) 1–47.
22. F. Song, C. Cheng, S. Liu and J. Yang, Impact of text representations on performance
of linear support vector machines, Patt. Recogn. Artif. Intell. 17(2) (2004) 161–166
(in Chinese).
23. F. Song, S. Liu and J. Yang, A comparative study on text representation schemes in
text categorization, Patt. Anal. Appl. 8(1–2) (2005) 199–209.
24. Y. Yang, An evaluation of statistical approaches to text categorization, Inform. Retr.
1(1–2) (1999) 69–90.
25. Y. Yang and X. Liu, A re-evaluation of text categorization methods, Proc. SIGIR-99,
22nd ACM Int. Conf. Research and Development in Information Retrieval (1999),
pp. 42–49.
26. Y. Yang and J. O. Pedersen, A comparative study on feature selection in text
categorization, Machine Learning: Proc. Fourteenth Int. Conf. (ICML’97) (1997),
pp. 412–420.
Fengxi Song received Yong Xu received his

the B.S. degree in B.S. and M.S. degrees
mathematics at Anhui in 1994 and 1997, respec-
University, China, in tively. He received the
1984. He received the Ph.D. in pattern recog-
M.S. degree in applied nition and intelligence
mathematics at Chang- system at NUST
sha Institute of Tech- (China) in 2005. Now
nology, China, in 1987 he works at Shenzhen
and the Ph.D. in pat- graduate school, Harbin
tern recognition and intelligence systems, at Institute of Technology.
Nanjing University of Science & Technology, His current interests include biomet-
China, in 2004. Now, he is a professor at New ric, characteristic recognition and image
Star Research Inst. of Applied Tech. in Hefei processing.
City, China, and a postdoctoral research fel-
low at the Bio-Computing Research Center,
Shenzhen Graduate School, Harbin Institute
of Technology, China.
His research interests include computer
vision, machine learning and automatic text
categorization.
David Zhang gradu- Jizhong Wang received

ated in computer sci- the B.S. degree in 1999
ence from Peking and the M.S. degree
University, China, in in 2005, both from the
1974. He received the Department of Automa-
M.S. degree in com- tion and Simulation at
puter science and engi- the New Star Research
neering in 1983 and the Inst. of Applied Tech. in
Ph.D. in 1985, both Hefei City. Now, he is
from Harbin Institute of an assistant professor at
Technology, China. In 1994, he received the the New Star Research Inst. of Applied Tech.
Ph.D. in electrical and computer engineering in Hefei City, China.
from the University of Waterloo, Canada. He His current research interests are in the
is currently a professor at the Hong Kong areas of object recognition, computer vision
Polytechnic University. He is the author of and virtual simulation.
more than 140 journal papers, 20 book chap-
ters, and ten books. He holds several patents
in both the U.S. and China.
His research interests include auto-
mated biometrics-based authentication, pat-
tern recognition and biometric technology
and systems.

Feature Selection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Selection

Uploaded by

Copyright:

Available Formats

September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583

International Journal of Pattern Recognition

FIVE NEW FEATURE SELECTION METRICS IN

FENGXI SONG∗,†,‡ , DAVID ZHANG§ , YONG XU†

Feature selection has been extensively applied in statistical pattern recognition as a

Keywords: Feature selection; text categorization; support vector machines; multiple

‡Author for correspondence

1086 F. Song et al.

a number of statistical classiﬁcation and machine learning techniques have been

is sometimes taken into account.

Five New Feature Selection Metrics in Text Categorization 1087

2. Conventional Feature Selection Metrics

It is obvious that A + C = M , and A + B + C + D = N .

2.1. Mutual information

1088 F. Song et al.

Since N is a constant and for a given category c, A + C = M , and B + D =

2.2. Chi-square statistic

N · [p(x, c) · p(x, c) − p(x, c) · p(x, c)]2

2.3. Correlation coeﬃcient and simpliﬁed chi-square statistic

chosen using the chi-square statistic.

Five New Feature Selection Metrics in Text Categorization 1089

2.4. Document frequency

1090 F. Song et al.

3. Several New Feature Selection Metrics

3.1. Low Loss Dimensionality Reduction

Here p(ωj |x) is the conditional probability of ωj given x.

Here P1 , P2 , . . . , Pl are class priori probabilities. The subscript in PX emphasizes

Corollary 1. If feature x takes a certain value almost everywhere then it is

Five New Feature Selection Metrics in Text Categorization 1091

Redundant features contribute nothing for sample category prediction. LLDR

3.2. Relative frequency diﬀerence

3.3. Bayesian rule

1092 F. Song et al.

3.4. F-one value

Assigned YES Assigned NO

Five New Feature Selection Metrics in Text Categorization 1093

• if A > B and C ≤ D, the contingency table is

Assigned YES Assigned NO

• if A ≤ B and C > D, the contingency table is

Assigned YES Assigned NO

• if A ≤ B and C ≤ D, the contingency table is

Assigned YES Assigned NO

Corresponding to above four contingency tables the scores of F1 -value are

3.5. Fisher discriminant

bs = |E(x|c) − E(x|c)|. (27)

1094 F. Song et al.

The feature selection metric based on Fisher Discriminant is given by

4. Comparative Study of Feature Selection Metrics

4.2. Data collection and performance measure

Five New Feature Selection Metrics in Text Categorization 1095

Reuters-21578 dataset was compiled by Lewis14 and originally collected by the

1096 F. Song et al.

Feature The Number of Features

Fig. 1. The box plots of the data in Table 1.

Five New Feature Selection Metrics in Text Categorization 1097

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

Fig. 2. Results of multiple comparative tests on the data in Table 1.

Fig. 3. The eﬀectiveness of the tested feature selection metrics on 20 newsgroups.

1098 F. Song et al.

5. Conclusions and Discussion

Based on the experimental results on Reuters-21578 and 20 Newsgroups, and

Table 2. The characteristics of various feature selection metrics.