Professional Documents
Culture Documents
Introduction
What is Text Categorization? "Text Categorization (TC) is a supervised learning task, defined as assigning category labels to new documents based on the likelihood suggested by a training set of labelled documents".
(Yiming, and Liu)
Examples:
news stories are typically organized by subject categories or topics. They are one the most commonly investigated application domains in the TC literature.
spam filtering where email messages are classified into the two categories of spam and non-spam.
Approaches
An increasing number of learning approaches have been applied including: regression models nearest neighbor classification Bayesian probabilistic approaches decision trees inductive rule learning neural networks on-line learning support vector machines
Issues
While the richness of a particular subject provides valuable information about individual methods, clear conclusions about cross-methods comparison have been difficult because often the published results are not directly comparable. For example, one cannot tell whether the performance of neural networks (NNet) by Weiner is statistically better or worse than the performance of the Support Vector Machines (SVM) by Joachims because different data collections were used in the evaluations of those methods.
F1 measure:
Classifiers
SMV
Introduced by Vapnik in 1995 for solving two-class pattern recognition problems. The method is defined over a vector space where the problem is to find a decision surface that "best" separates the data points into two classes. In order to define the "best" separation the authors introduce the "margin" between two classes as shown in the next figures:
SVM (continued)
The SVM problem is to find the decision surface that maximizes the margin between the date points in a training set. The decision surface by SVM for linearly separable space is a hyperplane which can be written as:
vector x = an arbitrary data point; vector w and constant b are learned from a training set of linearly separable data.
SVM (continued)
Letting D = {(yi, vector xi)} denote the training set, and yi E {+-1} be the classification for vector x The SVM problem is to find vector w and b that satisfies the following constraints
and that the vector 2-norm of vector w is minimized. It can be solved using quadratic programming techniques.
kNN
kNN stands for k-nearest neighbor classification Simple algorithm: given a test document, the system finds the k-nearest neighbor among the training documents and uses the categories of the k neighbors to weight the category candidates. The rule can be written as:
where y(vector x, cj) E {0,1} = classification for document vector di with respect to category cj; sim(vector x, vector di) = similarity of test document vector x and training document vector di; bj = the category specific threshold for binary decisions
LLSF
LLSF stands for Linear Least Squares Fit A multivariate regression model is automatically learned from a training set of documents and their categories. input vector is a document in the conventional vector space model output vector consists of categories of the correspoing document.
LLSF (continued)
by solving a LLSF on the training pair of vectors one can obtain a matrix of word-category regression coefficients:
where matrices A and B present the training data and matrix FLS is the solution matrix, defining a mapping from an arbitrary document to a vector of weighted categories.
NNet
NNet techniques have been intensively studied in artificial intelligence. In their experiment, a practical consideration is the training cost. Training NNet is usually much more time consuming than the other classifiers. It would be too costly to train one NNet per category therefore, the authors trained one NNet for all the 90 categories of Reuters.
NB
Naive Bayes (NB) probabilistic classifiers are commonly studied in machine learning. the basic idea in NB approaches is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document.
Significance Tests
The P-value indicates the significance level of the observed evidence against the null hypothesis. e.i: system A is better or worse than B
The test hypothesis and the P-value computation are the same as those an in Micro s-test
Evaluation
Experiments set up
1000 features were selected for NNet 2000 for NB 2415 for kNN and LLSF 10000 for SVM
Experiments set up
Other setting were:
Results
Results
Micro averaged F1 score for SVM (.85099) is slightly lower than Joachims results(.860-.864) Micro averaged F1 score (.8567) for kNN is higher than Joachims (.823) Micro averaged F1 score (.7956) for NB is higher than Joachims (.720)
Cross-classifier Comparison
Cross-classifier Comparison
The micro level analysis on pooled binary decisions suggests that SVM > kNN >> {LLSF,NNet} >> NB where the classifiers with insignificant performance are grouped into one set. Macro level analysis on the F1 scores suggests: {SVM,kNN,LLSF} >> {NB,NNet}
Cross-classifier Comparison
Cross-classifier Comparison
Conclusions
Conclusions