Professional Documents
Culture Documents
1
N.poongavanam
2
R.Giritharan, M.C.A., M.E.
1
II-Year \M.E. Computer Science and Engineering
2
Asst.Prof/CSE
Department of Computer science & Engineering
Thiruvalluvar College of Engineering &Tech.,
Vandavasi.
ABSTRACT—This paper describes Olex, a novel method for the automatic induction of rule-based text
classifiers. Olex supports a hypothesis language of the form “if T1 or… or Tn occurs in document d, and none of
Tn+1, . . . Tn+m occurs in d, then classify d under category c,” where each Ti is a conjunction of terms. The
proposed method is simple and elegant. Despite this, the results of a systematic experimentation performed on
the REUTERS-21578, the OHSUMED, and the ODP data collections show that Olex provides classifiers that
are accurate, compact, and comprehensible.
Index Terms—Data mining, text mining, clustering, classification, and association rules, mining methods and
algorithms.
309
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11
• a background knowledge B as a set of results has been given by Cohen and Page in [8].
ground logical facts of the form t 2 d, meaning that However, while in ILP it is assumed that the input
term t occurs in document d (other ground sample is consistent with some hypothesis in the
predicates may occur in B as well) and . hypothesis space, this is not necessarily true in TC;
• a set P of positive examples consisting of
indeed, the relationship between terms and
ground logical facts of the form d 2 c, meaning that
document d belongs to category c (ideal categories is nondeterministic, i.e., it is not
classification); given P, the set N of negative possible, in general, to correctly categorize a
examples consists of the facts d 2 c that are not in P document under a category only on the basis of the
constructs a hypothesis Hc (the classifier of c) that, terms occurring in it. For this reason, the expected
combined with the background knowledge B, is induced hypothesis is in general one which
(possibly) consistent with all positive and negative maximally satisfies (both positive and negative)
examples, i.e., B^Hc _ P and B^Hc 6_ N. The
examples.
induced rules will allow prediction about the
belonging of a document to a category on the basis
of the presence or absence of some terms in that 3. NOTATION AND PRELIMINARY
document. DEFINITIONS
The above induction problem is essentially an Throughout this paper, we assume the existence of
instance of Inductive Logic Programming (ILP), 1. a finite set C of categories, called classification
which deals with the general problem of inducing scheme, 2. a finite set D of documents (i.e.,
logic programs from examples in the presence of sequences of words), called corpus; D is partitioned
background knowledge. It is well known that ILP into a training set TS, a validation set and a test set;
problems are computationally intractable, so that a the training set along with the validation
main topic is that of identifying classes of etrepresent the so-called seen data, used to induce
programs that are efficiently learnable. The theory the model, while the test set represents the unseen
of PAC-learnability [9] provides a model of data, used to assess the performance of the induced
approximated polynomial learning where the model, and 3. a relation I _ C_D (ideal
polynomially bound amount of resources (both classification) which assigns each document d 2 D
number of examples and computational time) is to a number of categories in C. We denote by TSc
traded off against the accuracy of the induced the subset of the raining set TS whose documents
hypothesis. It is well known that, if a learning belong to category c according to I (the training set
problem is PAC-learnable, the related consistency of c). For this reason, in the following, we will
problem is in the randomized polynomial concentrate on a single category c 2 C. Once a
complexity class RP [10]. In [9], Valiant identifies classifier for category c has been constructed, its
a subset of propositional logic which is PAC- capability to take the right categorization decision
learnable. More recently, several authors is tested by applying it to the documents of the test
investigated the identification of PAC-learnable set and then comparing the resulting classification
subclasses of first-order Horn clauses, providing a to the ideal one. The effectiveness of the predicted
number of both positive and negative results. For classification is measured in terms of the classical
an instance, in [11], Dzeroski et al. show that a notions of Precision, Recall, and F-measure.
restricted class of function-free clauses, namely k-
iscriminative nonrecursive ij-determinate predicate 4. SELECTION OF DISCRIMINATING
definitions, is PAC-learnable. In [19], Kietz and TERMS:PROBLEM DEFINITION AND
Deroski prove that relaxing some syntactic COMPLEXITY
restriction to the above class of programs implies In this section, we provide a description of the
the loss of PAC-learn ability (unless RP ¼ NP). optimization problem aimed at generating a best set
Finally, in [13], Gottlob et al. show that learning a of discriminating terms (d-terms, for short) for
function-free Horn clause with no more than k category c 2 C. In particular, we give a formal
literals from a set of both positive and negative statement of the problem and show its complexity.
examples expressed as function-free ground Horn To this end, some preliminary definitions are
clauses is NP-hard (notably, the respective ILP- needed (Table 1). A term (or n-gram) is a sequence
consistency problem is _P2 ) and, thus, not PAC- of one or more words, or variants obtained by using
learnable (unless RP ¼ NP). An overview of some word stems, consecutively occurring within a
important techniques for deriving complexity document. A scoring function _ (or feature
310
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11
311
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11
(notice that the current maximum value of Fα is solution Xc of problem DT-GEN generated by
stored into Fmax, a global variable which is algorithm Greedy-Olex, for given values of Ø, v,
modified by Generate-New-Set at line p8). The and α. Now, to learn a “best” classifier Hc of c,
symbol _ at line p4 represents the logical constant essentially Olex repeatedly induces Hc(Ø,v,α), for
true (taken both positively and negatively). This different input vocabularies, each time validating it
formal device is needed in order for the algorithm over the validation set. Here, the input consists of a
to capture also d-terms of degree 1. The evaluation number of vocabularies (obtained by setting
of F_ is based on (2). Once the best term t has been different values of Ø and v), along with a value for
selected, it is returned to the main program through (which depends on the needs of the application at
the global variable topt (which is updated at line hand; usually, α=0.5).Whenever a classifier
p8) and then removed from V(Ø,v), (line 8). The Hc(€,v,α)is generated by the induction step (lines
do-while-loop iterates as long as the vocabulary 4-5), it is validated over the validation set (line 7).
V(Ø,v), is not empty and a new d-term is Finally, the classifier Hc with the maximum value
generated. It is straightforward to realize that the of the F-measure over the validation set is output
size of Xc. (line 9). Hc is assumed to be the “best classifier” of
c. Notice that, of the three model parameters €,v,
and α, only the former two are actually used for
“driving” the search of Hc.
312
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11
represented as sets of word stems. Second, we F-measure and BEP obtained at each of the five
proceeded to the partitioning of the training folds, and the respective means are given Table 3,
corpora: in turn, provides a picture of the results for the 10
As far as REUTERS-21578 and OHSUMED are most frequent categories of R90 (hereafter referred
concerned, we segmented each corpus into five to as R10), averaged over the five folds. In
equalsized partitions for cross validation. During particular, besides F-measure and BEP, for each
each run, four partitions will be used for training, category we report the average characteristics of
and one for validation (note that validation and test the respective best classifiers (i.e., number of rules
sets coincide in this case). Each of the five and number of negative literals occurring in each
combinations of one training set and one validation rule—recall that the sum of these two
set is a fold. values equals the number of induced d-terms).
Concerning ODP-S25 (for which the holdout
method was used), we segmented the corpus into 6.4.2 Effect of Category Size on Performance
two partitions: the seen data (70 percent of the We partitioned the categories in R90 with more
corpus documents) and the unseen data (the than seven documents into four intervals, based on
remaining 30 percent). The former is used to their size. Then, we evaluated the mean F-measure
induce the model (according to the learning process over the categories of each group, averaged over
of Fig. 2), while the latter is used for testing. The the five folds. Results are summarized in Table 5.
seen data were then randomly split into a training As we can see, the F-measure values indicate that
set (70 percent), on which to run algorithm Greedy- performances are substantially constant on the
Olex, and a validation set, on which tuning the various subsets, i.e., there is no correlation between
model parameters. We performed both the above category size and predictive accuracy (this is not
splits in such a way that each category was the case of other machine learning techniques, e.g.,
proportionally represented in both sets (stratified decision tree induction classifiers, which are biased
holdout). toward frequent classes [15]).
Finally, for every corpus and training set, we
scored all terms occurring in the documents of the 6.5 Results with OHSUMED (Cross Validation)
training set TSc of c, for each c € C. The second data set we considered is OHSUMED
and the task was to assign documents to one or
more categories of the 23 MeSH diseases. As for
the REUTERS-21578, we performed a fivefold
cross validation by running, at each fold, Algorithm
6.3 Performance Metrics 2 with input vocabularies V(CHI,v), with
Classification effectiveness was measured in terms v€{10,20,30,..100} (also in this case, _ was set to
of the classical notions of Precision, Recall, and F- 0.5). Performance results are reported in Table 6.
measure, as defined in Section 3. To obtain global As we can see, the average _-F and _-BEP are
estimates relative to experiments performed over a equal to 66.08 and 66.31, respectively.
set of categories, the standard definition of micro In Table 7, we summarize the performance values
averaged Precision and Recall was used. for the five most frequent MeSH categories, along
with the characteristics of the respective best
6.4 Results with REUTERS (Cross Validation) classifiers, averaged over the five folds. Again, we
The first data set we considered is the REUTERS- notice the compactness of classifiers (at most 40
21578 and the task was to assign documents to one rules for category C23).
or more categories of R90. As already mentioned, TABLE 7
performance evaluation was based on fivefold OHSUMED—Cross-Validation Results for the
cross validation. At each fold, we conducted a Five Most Frequent MeSH Categories
number of experiments for the induction of the best
classifier of each category in R90 according to the
algorithm sketched in Fig. 2. In particular, the
algorithm was executed with input vocabularies.
6.4.1 Performance
313
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11
314