Professional Documents
Culture Documents
1. Introduction
Text categorization (TC), the assignment of free text documents to one or more
predefined categories based on their content, is a powerful tool for more effectively
finding, filtering, and managing text resources. Though TC goes back at least to
the early 1960s and to Maron’s seminal work,16 it is only in the last ten years that
it has attracted researchers in the field of machine learning. In the past decade
1085
September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
a In certain pattern recognition problems, where there are fewer classes and features, dependence
Document Frequency. The performance of another two of these five new metrics,
Low Loss Dimensionality Reduction (LLDR) and Relative Frequency Difference
(RFD), is equal to or better than that of conventional good feature selection met-
rics such as Mutual Information and Chi-square Statistic.
• A — the number of documents that are from category c and contain feature x.
• B — the number of documents that are not from category c and contain
feature x.
• C — the number of documents that are from category c and do not contain
feature x.
• D — the number of documents that are not from category c and do not contain
feature x.
• M — the number of documents that are from category c.
• N — the number of documents.
b In this paper, the optimal feature subset selected using CC(x, c) is exactly the same as that
they proposed, being the square root of χ2 (x, c), emphasizes the former and de-
emphasizes the latter, thus respecting intuitions. CC(x, c) is given by
√
N · [p(x, c) · p(x, c) − p(x, c) · p(x, c)]
CC(x, c) = . (7)
p(x) · p(x) · p(c) · p(c)
Fuhr et al.5 refined the criterion by observing that in CC (x,c) (and a fortiori
in χ2 (x, c))
√
— the N factor at the numerator has no influence on the feature selection result,
since it is equal
for all pairs (x,c);
— the presence of p(x) · p(x) at the denominator emphasizes extremely rare fea-
tures, which Yang and Pedersen26 have clearly shown to be the least effective
in TC;
— the presence of p(c) · p(c) at the denominator emphasizes extremely rare cat-
egories, which is extremely counterintuitive.c
By eliminating these factors from CC (x,c), Fuhr et al. yields the Simplified
Chi-square Statistic given by
sχ2 (x, c) = |p(x, c)p(x, c) − p(x, c)p(x, c)|, (8)
which can be further changed to the form
sχ2 (x, c) = |A · D − B · C|. (9)
c The square root of p(c) · p(c) is a constant that has also no influence on the feature selection
result.
September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
Proof. Let Rj∗ = {x∗ |p(ωj |x∗ ) = max1≤i≤l p(ωi |x∗ )}, and Rj = {x|p(ωj |x) =
max1≤i≤l p(ωi |x)}.
Since
x ∈ Rj ⇔ ∀i, 1 ≤ i ≤ l, p(ωj |x) ≥ p(ωi |x)
⇔ ∀i, 1 ≤ i ≤ l, p(x, x∗ , ωj ) ≥ p(x, x∗ , ωi )
⇔ ∀i, 1 ≤ i ≤ l, p(x∗ , ωj ) ≥ p(x∗ , ωi )
⇔ x∗ ∈ Rj∗ .
l l
Thus we have PX = j=1 Pj p(Rj |ωj ) = j=1 Pj p(Rj∗ |ωj ) = PX−{x} , i.e. feature
x is redundant.
Bayesian classifier is
error = p(x) min{p(c|x), p(c|x)} + p(x) min{p(c|x), p(c|x)}, (19)
which can be approximated by
min{A, B} + min{C, D}
error = . (20)
N
The smaller the probability of error is, the more relevant the feature is. Thus
feature selection metric based on Bayesian Rule (BR) is given by
BR(x, c) = −p(x) min{p(c|x), p(c|x)} − p(x) min{p(c|x), p(c|x)}, (21)
which can be approximated and simplified as
BR(x, c) = −min{A, B} − min{C, D}. (22)
−(N − M )/M, if A > B, and C >D
−(B + C)/A, if A > B, and C ≤D
F V (x, c) = . (26)
−(A + D)/C, if A ≤ B, and C >D
−∞, if A ≤ B, and C ≤D
Here E(x|c) and E(x|c) are conditional expectations of feature x which can be
A B
approximated by M and N −M respectively.
The within-scatter of x in c and c are conditional variances, D(x|c) and D(x|c)
A
A 2 B
B 2
which can be approximated by M − M and N −M − N −M respectively.
September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
4.3. Results
The effectiveness of various feature selection metrics on the Reuters is displayed
in Table 1. Box plot of the data in Table 1 shows that the eight tested feature
selection metrics can be approximately divided into four distinguishable groups:
(1) MI, CHI, LLDR, RFD, (2) DF, (3) BR, FV, and (4) FD.
Further multiple comparative tests (see Fig. 2) indicate that
• There is no significant difference between any two feature selection metrics among
Group 1 that consists of MI, CHI, LLDR and RFD.
• There is no significant difference between any two feature selection metrics among
Group 2 that consists of DF, BR and FV.
September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
Table 1. Microaverage BEPs of the tested feature selection metrics over the 90 Reuters
categories.
• There is significant difference between any two of Groups 1–3 that contain the
only feature selection metric FD.
Figure 1 is the box plots of effectiveness of the tested feature selection metrics
on Reuters.
The lines of a box correspond to the lower quartile, median and upper quartile
values. The whiskers, lines extending from each end of the box, show the extent of
the rest of the data. The plus symbols are data with values beyond the ends of the
whiskers.
Figure 2 shows the results of multiple comparative tests on the data in Table 1.
In the graph a line and the circle at the middle point of it correspond to 95%
confidence interval and mean for the effectiveness of each feature selection metric.
Microaverage BEPs Over 90 Reuters Categories
0.85
0.8
0.75
0.7
0.65
0.6
0.55
MI CHI DF LLDR RFD BR FV FD
Feature Selection Method
MI
CHI
Feature Selection Method
DF
LLDR
RFD
BR
FV
FD
There is a significant difference between any two feature selection metrics if and
only if their corresponding confidence intervals do not overlap.
Figure 3 shows the effectiveness of various feature selection metrics on the
20 Newsgroups. The tested feature selection metrics can be approximately divided
September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
into three distinguishable groups: (1) MI, CHI, DF, LLDR, RFD, (2) BR, FV, and
(3) FD.
It can be concluded that any feature selection metric in Group 1 is more effective
than any metric in Group 2, and FD, the only one metric in Group 3, is neither
the most effective nor the least effective of all the metrics.
LLDR outperforms DF because of its sound theoretical background and use of
category information.
• effectiveness;
• efficiency;
• having a sound theoretical background;
• favoring the rare terms;
• using category information;
• supporting by the intuition.
• The newly proposed feature selection metrics LLDR and RFD are at least as
good as MI and CHI, which are two best traditional feature selection metrics
in text categorization and better than DF, which is another good conventional
feature selection metric.
• The newly proposed feature selection metrics BR and FV are less effective than
LLDR, RFD, MI, and CHI but not significantly inferior to DF.
The new feature selection metrics LLDR and RFD are both easier to calculate
than MI and CHI. This makes the LLDR and RFD more suitable for selecting
features in text categorization.
There is no general conclusion that can be derived about FD in this paper. Its
effectiveness must be further assessed in future studies.
Acknowledgments
This work is partially supported by the National Science Foundation of China
under grant nos. 60620160097 and 60602038 and Natural Science Foundation of
Guangdong province under grant no. 06300862.
References
1. C. Apte, F. Damerau and S. Weiss, Automated learning of decision rules for text
categorization, ACM Trans. Inform. Syst. 12(3) (1994) 233–251.
2. R. Battiti, Using mutual information for selecting features in supervised neural net
learning, IEEE Trans. Neural Networks 5(4) (1994).
3. T. J. Downey and D. J. Meyer, Genetic algorithm for feature selection, Intell. Engin.
Syst. Through Artif. Neur. Networks 4 (1994) 363–368.
4. S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive learning algorithms and
representations for text categorization, Proc. CIKM-98, 7th ACM Int. Conf. Infor-
mation and Knowledge Management (1998), pp. 148–155.
5. N. Fuhr, N. Govert, M. Lalmas and F. Sebastiani, Categorization tool: final prototype.
Deliverable 4.3, Project LE4-8303 “EUROSEARCH”, Commission of the European
Communities (1999).
6. L. Galavotti, Un sistema modulare per la classificatione di testi basato sull’ apprendi-
mento automatico, Master’s thesis, Dipartimento di Informatica, Universita di Pisa,
Pisa, IT (1999).
7. C. Hsu and C. Lin, A comparison of methods for multiclass support vector machines,
IEEE Trans. Neural Networks 13(2) (2002) 415–425.
8. A. K. Jain, R. P. W. Duin and J. Mao, Statistical pattern recognition: a review, IEEE
Trans. PAMI 22(1) (2000) 4–37.
9. A. Jain and D. Zongker, Feature selection: evaluation, application, and small sample
performance, IEEE Trans. Patt. Anal. Mach. Intell. 19(2) (1997) 153–158.
10. T. Joachims, Text categorization with support vector machines: learning with many
relevant features, Proc. 10th European Conf. Machine Learning (ECML) (Springer-
Verlag, 1998).
11. N. Kwak and C. H. Choi, Input feature selection for classification problems, IEEE
Trans. Neural Networks 13(1) (2002).
12. K. Lang, Newsweeder: learning to filter netnews, Proc. Twelfth Int. Conf. Machine
Learning (1995), pp. 331–339.
13. D. D. Lewis, Naive (Bayes) at forty: the independence assumption in information
retrieval, Proc. ECML-98, 10th European Conf. Machine Learning, Chemnitz, DE
(1998), pp. 4–15.
14. D. Lewis, Reuters-21578, Distribution 1.0 http://www.research.att.com/∼lewis/
reuters21578.html.
15. J. Ma, Y. Zhao and S. Ahalt, OSU SVM Classifier Matlab Toolbox (ver 3.00),
http://eewww.eng.ohio-state.edu/∼maj/osu svm/
16. M. Maron, Automatic indexing: an experimental inquiry, J. Assoc. Comput. Mach.
8(3) (1961) 404–417.
17. P. M. Narendra and K. Fukunaga, A branch and bound algorithm for feature subset
selection, IEEE Trans. Comput. 26(9) (1977) 917–922.
18. H. T. Ng, W. B. Goh and K. L. Low, Feature selection, perceptron learning, and
a usability case study for text categorization, Proc. SIGIR-97, 20th ACM Int.
Conf. Research and Development in Information Retrieval (Philadelphia, US, 1997),
pp. 67–73.
September 7, 2007 18:4 WSPC/115-IJPRAI SPI-J068 00583
19. G. Salton, A. Wong and C. Yang, A vector space model for automatic indexing,
Commun. ACM 18(11) (1975) 613–620.
20. H. Schutze, D. A. Hull and J. O. Pedersen, A comparison of classifiers and docu-
ment representations for the routing problem, Proc. SIGIR-95, 18th ACM Int. Conf.
Research and Development in Information Retrieval, pp. 229–237.
21. F. Sebastiani, Machine learning in automated text categorization, ACM Comput.
Surv. 34(1) (2002) 1–47.
22. F. Song, C. Cheng, S. Liu and J. Yang, Impact of text representations on performance
of linear support vector machines, Patt. Recogn. Artif. Intell. 17(2) (2004) 161–166
(in Chinese).
23. F. Song, S. Liu and J. Yang, A comparative study on text representation schemes in
text categorization, Patt. Anal. Appl. 8(1–2) (2005) 199–209.
24. Y. Yang, An evaluation of statistical approaches to text categorization, Inform. Retr.
1(1–2) (1999) 69–90.
25. Y. Yang and X. Liu, A re-evaluation of text categorization methods, Proc. SIGIR-99,
22nd ACM Int. Conf. Research and Development in Information Retrieval (1999),
pp. 42–49.
26. Y. Yang and J. O. Pedersen, A comparative study on feature selection in text
categorization, Machine Learning: Proc. Fourteenth Int. Conf. (ICML’97) (1997),
pp. 412–420.