You are on page 1of 2

Text Document Classification

by Jana Novovicov�

During the last twenty years the number of text documents in digital form has grown
enormously in size. As a consequence, it is of great practical importance to be
able to automatically organize and classify documents. Research into text
classification aims to partition unstructured sets of documents into groups that
describe the contents of the documents. There are two main variants of text
classification: text clustering and text categorization. The former is concerned
with finding a latent group structure in the set of documents, while the latter
(also known as text classification) can be seen as the task of structuring the
repository of documents according to a group structure that is known in advance.

Document classification appears in many applications, including e-mail filtering,


mail routing, spam filtering, news monitoring, selective dissemination of
information to information consumers, automated indexing of scientific articles,
automated population of hierarchical catalogues of Web resources, identification of
document genre, authorship attribution, survey coding and so on. Automated text
categorization is attractive because manually organizing text document bases can be
too expensive, or unfeasible given the time constraints of the application or the
number of documents involved.
The task of text categorization (TC) is the focus of a research project at the
Institute of Information Theory and Automation, at the Academy of Sciences of the
Czech Republic (UTIA). The dominant approach to TC is based on machine learning
techniques. We can roughly distinguish three different phases in the design of TC
systems: document representation, classifier construction and classifier
evaluation. Document representation denotes the mapping of a document into a
compact form of its content. A text document is typically represented as a vector
of term weights (word features) from a set of terms (dictionary), where each term
occurs at least once in a certain minimum number (k) of documents. A major
characteristic of the TC problem is the extremely high dimensionality of text data.
The number of potential features often exceeds the number of training documents.
Dimensionality reduction (DR) is a very important step in TC, because irrelevant
and redundant features often degrade the performance of classification algorithms
both in speed and classification accuracy.

DR in TC often takes the form of feature selection. Methods for feature subset
selection for TC tasks use some evaluation function that is applied to a single
feature. The best individual features (BIF) method evaluates all words individually
according to a given criterion, sorts them and selects the best subset of words.
Since the vocabulary usually contains several thousand or tens of thousands of
words, BIF methods are popular in TC. However, such methods evaluate each word
separately, and completely ignore the existence of other words and the manner in
which the words work together.

UTIA proposed the use of the sequential forward selection (SFS) method based on
novel improved mutual information measures as criteria for reducing the
dimensionality of text data. These criteria take into consideration how features
work together. The performance of the proposed criteria using SFS compared to
mutual information, information gain, chi-square statistics and odds ratio using
the BIF method has been investigated. Experiments using a naive Bayes classifier
based on multinomial model, linear support vector machine (SVM) and k-nearest
neighbour classifiers on the Reuters data sets were performed and the results were
analysed from various perspectives, including precision, recall and F1-measure.
Preliminary experimental results on the Reuters data indicate that SFS methods
significantly outperform BIF based on the above-mentioned evaluation functions.
Furthermore, SVM on average outperforms both Naive Bayes and k-nearest neighbour
classifiers on the test data.
Currently, text classification research at UTIA is heading in two directions.
First, investigation of sequential floating search methods and oscillating
algorithms (developed in UTIA) for reducing dimensionality of text data; and
second, design of a new probabilistic model for document modelling based on
mixtures for simultaneously solving the problems of feature selection and
classification. These phases of the project rely on the involvement of PhD students
from the Faculty of Mathematics and Physics at Charles University in Prague.

This research is partially supported by GAAV No A2075302, the EC MUSCLE project


FP6-507752 and the project 1M6798555601 DAR.

Please contact:
Jana Novovicov�, CRCIM (UTIA), Czech Republic
Tel: +420 2 6605 2224
E-mail: novovic@utia.cas.cz

spacer

You might also like