You are on page 1of 9

T e x t M i n i n g

Combining Information
Extraction with
Genetic Algorithms for
Text Mining
John Atkinson-Abutridy, Chris Mellish, and Stuart Aitken, University of Edinburgh

T ext mining discovers unseen patterns in textual databases. But these discoveries

are useless unless they contribute valuable knowledge for users who make strate-

gic decisions. Confronting this issue can lead to knowledge discovery from texts, a com-

plicated activity that involves both discovering unseen knowledge (through TM) and

evaluating this potentially valuable knowledge. KDT search-and-optimization techniques exist,6 but some
can benefit from techniques that have been useful in techniques are more appealing for KDD tasks than
An evolutionary data mining or knowledge discovery from databases.1 others. In particular, genetic algorithms look promis-
However, you can’t immediately apply data mining ing. Compared with classical search-and-optimiza-
approach that techniques to text data for TM because they assume tion algorithms, GAs are much less susceptible to
a structure in the source data that isn’t in free text. getting stuck in local suboptimal regions of the
combines information You must therefore use new representations for text search space because they perform global searches
data. by exploring multiple solutions in parallel. Being
extraction technology In many TM applications, you can use more struc- robust, GAs can cope with noisy and missing data.
tured representations than just keywords to perform However, to use GAs effectively in KDT, we must
and genetic algorithms analysis to uncover unseen patterns. Early research tackle several problems first, including devising
on such an approach was based on seminal work on high-level representations and tailoring new genetic
can produce a new, exploratory analysis of article titles stored in the operations.
Medline medical database.2 Other approaches have
integrated model for exploited these ideas by combining more elaborated Evolutionary knowledge discovery
information extraction patterns and general lexical We’ve brought together the benefits of GAs for
text mining. resources such as WordNet3 or specific concept data mining and genre-based IE technology to pro-
resources such as thesauri.4 Another approach, rely- pose a new approach for high-level knowledge dis-
ing on IE patterns, uses linguistic resources such as covery. Unlike previous KDT approaches, our model
WordNet to assist the discovery and evaluation of doesn’t rely on external resources or conceptual
patterns to extract basic information from general descriptions. Instead, it performs the discovery using
documents.5 only information from the original corpus of text
In this context, we propose a new KDT approach documents and from training data computed from
that combines fragments of key information them. The GA that produces the hypotheses is
extracted from text documents and then uses a multi- strongly guided by semantic constraints, which
criteria optimization strategy to produce explanatory means that several specifically defined metrics eval-
knowledge without using external resources. uate the quality and plausibility.
Although researchers have tackled KDD tasks as Figure 1 shows our approach’s working model
learning problems, the nature of KDD suggests that divided into two general levels of processing. The
data mining is a continuous task of searching for and input is a corpus of technical and scientific natural
optimizing potential hypotheses that can maximize language documents; the output is a small set of the
quality criteria. hypotheses that the GA discovered.
A significant number of successful, practical The first level involves a preprocessing step that

22 1094-7167/04/$20.00 © 2004 IEEE IEEE INTELLIGENT SYSTEMS


Published by the IEEE Computer Society
produces both the training data for further
automatic evaluation of the hypotheses and Information
extraction task
the initial population of the GA from infor-
Part-of-speech Hypothesis 1
mation extracted from the documents. The Document 1 Rule 1
Document 2 tagger Rule 2 Hypothesis 2
IE task applies genre-based extraction pat- Genetic Hypothesis 3
Document 3 Rule 3 Preprocessed
terns and then generates a rule-like repre- Role and algorithm
predicate data learning
sentation for each document in a domain cor-
recognition Hypothesis k
pus. That is, after processing n documents, Document n Rule n (k << p)
the extraction stage will produce n rules,
Domain Document Hypothesis 1 Discovered
each one representing the document’s con- corpus representation novel
Hypothesis 2
tent in terms of a cause-and-effect relation. hypotheses
These rules, along with previously generated
training data, will become the semantic Latent Initial Hypothesis p
semantic
model that guides the GA-based discovery. analysis population (p << n)
To create the initial population, we create a training
random set of hypotheses by combining ran- Preprocessing and training Knowledge discovery
dom units from the extracted rules.
The second level is the GA-based knowl- Figure 1. Our genetic-algorithm-based approach for knowledge discovery from texts.
edge discovery, which aims to produce the
explanatory hypotheses. The GA runs for
several generations until achieving a maxi- abstracts have a well-defined structure that sions.7 From a scientific viewpoint, there are
mum number of learning generations. At the authors use to summarize their ideas and also claims that important findings could be
end, we obtain a small set of the best K state key facts concisely. This makes searched by linking this kind of information
hypotheses. K is a fixed user-defined value, abstracts suitable for further shallow analy- across the documents.
1 < K < p. sis and avoids many conceptual-level ambi- Unlike other researchers that exploit this
guities related to the restricted use of con- structure in their work, we use the rhetorical
Genre-based IE cepts in specific contexts. structure and the semantic information in it
To extract key information from the texts, Linguistic evidence shows that an abstract differently. As Figure 2 shows, starting from
you must define the representation and spec- in a given domain follows a prototypical and an abstract, the IE task extracts information
ify how it’s constructed from the extracted even modular organization—the genre- at a rhetorical and semantic level, then uses
information. To this end, we used a genre- dependent rhetorical structure—that its this information in a rule-like form to repre-
based approach. Specifically, we used the author uses to express the background infor- sent the documents’ key facts. Because we
genre of technical abstracts. Typically, mation, methods, achievements, and conclu- assume that what’s stated in the document

Abstract
The current study aims to provide the basic information about the fertilizer system, especially in GOAL
its nutrient dynamics. To provide the basic information ...
+
For this, long-term trends of the soil's chemical and physical fertility were analyzed. OBJECT
Long-term trends of the soil's ...
+
The methodology is based on the study of land plots using different records of usage of crop
rotation with fertilizers in order to detect long-term changes. METHOD
Study of land plots ...
Finally, a deep checking of data allowed us to conclude that soils have improved after 12 years of +
continuous rotation. CONCLUSION
Soils have improved after 12 ...

IF
GOAL(provide("the basic information about ..."))
& OBJECT(analyze("long-term trends of the soil ..."))
& METHOD(study("land plot using ..."))

THEN
CONCLUSION(improve("soils ..."))

Figure 2. Rule representation from the semantic and rhetorical information extracted from an abstract.

MAY/JUNE 2004 www.computer.org/intelligent 23


T e x t M i n i n g

guide evolutionary discovery. First, we build


Training data and rules the initial population of hypotheses randomly
by combining rhetorical and semantic infor-
mation. We keep these basic units in a separate
Initialize population
of hypotheses database to provide the genetic operations
with new information. Second, information
about the associations between rhetorical roles
Generation = 0
and predicate actions provides insights for dis-
covering coherent hypotheses. Indeed, some
Fitness evaluation Pareto set predicate actions might be more likely to hap-
pen with specific rhetorical information than
others. The preprocessing takes this into
Maximum account through a Bayesian approach that
number of No
generations? accounts for this kind of association by com-
Selection puting the conditional probabilities of predi-
Generation = generation + 1 Yes cates p, given some attached rhetorical role
r—namely, Prob(p | r).
Stop Semantic crossover In producing plausible hypotheses, the
rhetorical roles’ organization, as in a text dis-
Population update Mutation course, is worth exploring because the mean-
ing of the scientific evidence stated in the
abstract can subtly change if the facts’ order
Figure 3. A semantically constrained and multicriteria genetic algorithm. changes. Indeed, changing the order of the
rhetorical information can also alter the
coherence between the paragraphs of a text.
follows a form of antecedent-consequent rea- con/p el/art proposito/n de/p producir/inf This suggests that in generating valid
soning drawn by the author, we convert the tomates/n en/p epoca/s de/p invierno/s …, hypotheses, some rule structures will be
intermediate template into a rule-like form more desirable than others. Therefore, the
that aims to capture the author’s scientific the intermediate representation will be creation of hypotheses must be fed with
evidence and conclusion. 6444444444VP 74444444448
information concerning a good structure. To
To produce this information, the IE task ⎛ ⎞ generate this information, we can think of a
takes the set of tagged documents and pro- goal⎜ producir[tomates , en , epoca , de , invierno , K]⎟
{
rule’s p roles as a sequence of tags, <r1, r2, …,
⎜ 1424 3 1444444 424444444 3⎟.
duces a template representation for every doc- role ⎝ predicate argument ⎠ rp>, such that ri precedes ri+1. So, we generate
ument. We then easily convert this represen- the conditional probabilities Prob(rp | rq), for
tation into an if-then rule. For this purpose, The product will transform this representa- every role p, q—that is, the probability that rq
we wrote a set of domain-independent extrac- tion into goal(producir(‘tomates en epoca de ...’)). precedes rp, which we’ll further use to evalu-
tion patterns so that we could match them Although we might not know the set of pred- ate the new hypotheses.
against the input documents. Each extraction icates, we must specify the rhetorical roles All the training information, the obtained
pattern constructs an output representation in advance because they’re common across associations, and the semantic knowledge
that involves two levels of linguistic knowl- the technical genre. provided by the semistructured LSA will
edge: the rhetorical information expressed in guide how the GA produces hypotheses. In
the abstract and the semantic information Generating training data other words, this constitutes the model the
contained in it, which we later convert into a Rules extracted this way constitute key GA uses to guide the search for the plausi-
predicate-like form. We specify these extrac- elements for producing and evaluating new ble hypotheses in the whole search space.
tion patterns as follows: hypotheses as the GA goes on. However, the
rules themselves don’t suffice. Specifically, Hypotheses discovery and
con el proposito de VP …: goal(&ACTION[VP]) to make similarity judgments in producing evaluation
(In order to ACTION (OBJECT) …), hypotheses, we must use the whole corpus Once we obtain the training data and the
of documents to obtain initial knowledge at semantic information provided by LSA, the
The left-hand expression states the pattern to the lexical-semantics level provided by latent GA starts off from the initial population by
be identified (con el proposito de), and the right- semantic analysis.8 We augment this knowl- searching for optimal hypotheses (as shown
hand side (following the colon) states the cor- edge with syntactic information to represent in Figure 3) that satisfy multiple quality cri-
responding semantic action to be produced. the predicates in a vector where the predi- teria. Next, we apply semantically constrained
We decompose verb phrase (VP) compo- cates and arguments are converted into vec- genetic operations and evaluate the hypothe-
nents into two elements: the predicate action tors that represent the meaning according to ses according to their ranking obtained from
and the sequence of terms that represent its the context similarities learned by LSA. evaluating these criteria.
argument. For example, if the input tagged The initial extracted rules also convey We designed the genetic operations to
text looks like underlying data and associations that can guide the GA-based search and avoid inco-

24 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


herent knowledge. We developed three
semantically constrained operations: selec- AB
tion, crossover, and mutation. IF goal(establish(a))
Selection picks a small number of the best method(use(b))
hypotheses of every generation to be repro- method(study(c)) AC
duced according to their fitness. IF goal(establish(a))
THEN produce(x’)
Crossover recombines two selected method(use(b))
BC method(study(c))
hypotheses and takes place with some proba-
IF goal(produce(x))
bility, where both of them swap their elements THEN determine(z)
object(perform(y))
at some random position of the hypotheses to THEN determine(z)
produce new offspring. However, because we
want to restrict the operation to preserve
semantic plausibility, we defined two kinds of (a)
recombination (see Figure 4).
On the basis of Don Swanson’s inference
between the titles of documents,2 we propose Crosspoint
IF goal(establish(a))
a recombination operation (which we call IF goal(establish(a))
method(use(b))
Swanson’s crossover) to allow any kind of object(perform(y))
method(study(c))
relationship to hold between hypotheses AB THEN effect(d,e)
THEN effect(d,e)
and BC (see Figure 4a). This operation is
more flexible than Swanson’s patterns. If this IF goal(produce(x))
transitivity-like operator doesn’t apply, we IF goal(produce(x))
method(use(b))
object(perform(y))
perform the recombination in the usual way method(study(c))
by swapping conditions of the hypotheses THEN determine(z) THEN determine(z)
(see Figure 4b), as long as both hypotheses
meet some minimum semantic similarity. In (b)
Swanson’s crossover, the semantic similar-
ity between parts of the parent hypotheses is
Figure 4. Two kinds of recombination: (a) Swanson’s crossover and (b) the default
kept for a further evaluation. This is crucial
semantic crossover.
because the higher the similarity, the better
the accuracy of the offspring of this kind of
Swanson-like inference. appropriate evaluation criteria to measure the also provide explanation for the discovered
Mutation aims to make small random hypotheses’ quality (from a KDD viewpoint) knowledge. When comparing this approach
changes in the hypotheses to explore new and design an optimization strategy that trades to bag-of-word approaches, you’d expect the
possibilities in the search space. We’ve devel- off between these multiple criteria to create a model to provide a meaningful relationship
oped three kinds of constrained mutations. pool of better hypotheses. between different concepts. For example,
Role mutation selects one rhetorical role In developing evaluation metrics, we must these concepts might be the terms of a pro-
(including its contents) and randomly consider two different issues. The first is met- duced cluster in a text-mining approach that
replaces it with one from the initial database rics related to the hypotheses’ meaning and searches an unknown relationship.
of roles and predicates. Predicate mutation semantics that aim at ensuring that the Because we aim to produce explanatory
selects one predicate action and argument hypotheses are coherent and meaningful. The hypotheses about the novel discovered knowl-
and randomly replaces them with others, second is metrics related to quality from a edge, these target concepts can also be part of
which modifies the association between KDD viewpoint. We designed a set of eight a major request for the model: to find the set
semantics and rhetoric. Argument mutation, domain-independent evaluation criteria. The of the best novel hypotheses that help us
because we have no information about the GA’s goal is to explore new hypotheses that understand the relationship between a pair of
arguments’ semantic types, follows a con- maximize these criteria. This process pro- user-defined target concepts. With this in
strained procedure that randomly chooses a duces the set of vectors representing the cri- mind, the relevance criterion measures how
new argument from those predicates that teria values for every hypothesis. That is, relevant a hypothesis is to the user-defined tar-
have the same name and number of argu- each hypothesis contains an eight-dimen- get concepts. Specifically, this criterion eval-
ments as the current one. sional objective vector. We then obtain the uates the semantic closeness between the
These operators’role is only to produce new best hypotheses of every generation through predicates of a hypothesis and the target con-
hypotheses that might become part of the new a multicriteria optimization strategy. We cepts to provide more information about the
generation. However, we must evaluate these define the different proposed criteria in the unknown relationship between these concepts.
individuals’goodness to establish whether they following sections. Because the LSA-based method we use to
provide plausible hypotheses. Consequently, evaluate this similarity doesn’t consider the
the fittest individuals will survive to the next Relevance context to represent the predicate informa-
learning generation, and others will be elimi- The relevance model aims to not only dis- tion, we measure relevance in terms of these
nated. For this to happen, we must develop cover novel and interesting knowledge but target concepts’ influence on the hypothesis.

MAY/JUNE 2004 www.computer.org/intelligent 25


T e x t M i n i n g

We perform this mainly by using a variation Coherence and coverage coverage. We designed a hybrid method that
of the strength concept (proposed by Walter The coherence metric addresses the ques- combines semantic constraints and organi-
Kintsch) between a predicate and the sur- tion about whether the elements of the current zation-related aspects to obtain the cover-
rounding terms in the semantic neighborhood.9 hypothesis relate to each other in a semanti- age. In general terms, a hypothesis H covers
We compute the relevance of a hypothesis cally coherent way. Unlike rules produced by a rule Ri only if Ri contains the predicates
H with predicates Pi(Ai) and target concepts data mining techniques in which the order of of H.
(<term1> and <term2>) as the average the conditions is not an issue, the hypotheses The first step involves establishing the
semantic similarity of every predicate (and produced in our model rely on pairs of adja- semantic similarity between H and Ri. How-
argument) of H to the target concepts. That is, cent elements that should be semantically ever, because this relation is symmetrical, the
sound, a property that has long been dealt with second step analyzes whether the hypothe-
relevance(H ) = in the linguistic domain in the context of text sis contains elements of the rules. Formally,
∑ i =1 strength (Pi , A i , < term 1 >) + strength (Pi , A i , < term 2 >) , coherence. Because we have semantic infor- we define it as
1 H
2
H mation provided by the LSA analysis that is
complemented with rhetorical and predicate- RulesCovered(H) = {RUi ∈ RuleSet |
where |H| denotes the length of H (that is, the level knowledge, we developed a simple ∀HPk ∈ HP ∃Pj ∈ RUi:
number of predicates). method to measure coherence, following work (SemanticsSimilarity(HPk, Pj) ≥
on measuring text coherence.10 threshold ∧ predicateName(HPk) =
Structure and cohesion predicateName(Pj))}
The structure criterion addresses the ques-
tion of how good the rhetorical roles’ struc- In this equation, SemanticsSimilarity(HPk,
ture is, which we can approximate by deter- The coherence metric addresses Pj) represents the LSA-based similarity
mining how much of the initial extracted between predicates HPk and Pj, threshold
rules structure is exhibited in the current the question about whether the defines a minimum fixed value, RuleSet
hypothesis. To this end, this metric uses the denotes the whole set of rules, HP represents
training information provided at the begin-
ning to compute the structure’s quality
elements of the current hypothesis the list of predicates with arguments of H,
and Pj denotes a predicate (with arguments)
according to a bigram model in which the
roles ri are a sequence of tags. That struc-
relate to each other in a contained in RUi. Once we compute the set of
rules, we can compute the criterion as
ture’s quality is
semantically coherent way.
RulesCovered (H ) ,
Structure(H ) = Prob(r1 ) ∗ ∏i = 2 Prob(ri ri − 1 ). Coverage(H ) =
H
RuleSet
We calculate coherence by considering
In this equation, ri represents the ith role of the average semantic similarity between where |RulesCovered| denotes the size of the
the hypothesis H, Prob(ri | ri−1) denotes the consecutive elements of the hypothesis. set of rules covered by H, and |RuleSet| denotes
conditional probability that role ri−1 imme- However, we compute this closeness only the size of the initial set of extracted rules.
diately precedes ri, and Prob(ri) denotes the on the semantic information that the predi-
probability that no role precedes ri (the cates and their arguments convey because Simplicity and interestingness
beginning of the structure). we considered the role structure in a previ- The simplicity criterion addresses the
The cohesion criterion addresses the ques- ous criterion. Accordingly, we express the question of how simple the hypothesis is. For
tion of how likely a predicate action will be criterion as this, we focus on hypothesis length. Because
associated with some specific rhetorical role. the criterion must be maximized, and shorter
The underlying issue here is that some pred- Coherence(H ) =
or easy-to-interpret hypotheses are prefer-
icate relations Pi will be more likely than oth- able, the evaluation is simply
ers to be associated with some rhetorical role
( H − 1) ( (
SemanticSimilarity Pi ( A i ),Pi +1 A i + 1 )) ,
ri. For this, hypotheses containing this kind ∑i =1 ( H − 1) ⎛ H ⎞
of association should be “rewarded” in the Simplicity(H ) = 1 − ⎜ ⎟,
⎝ < MaxElems > ⎠
search-optimization phase.
Using the conditional probabilities pro- where (|H| − 1) denotes the number of adja-
vided by the training data, we express H’s cent pairs. where <MaxElems> denotes a fixed maxi-
cohesion as The coverage metric addresses the ques- mum number of elements allowed for any
tion of how much the model supports the hypothesis.
Prob(Pi ri )
Cohesion (H ) = ∑ r , P ∈H
i i H
,
hypothesis. KDD approaches have usually
measured coverage of a hypothesis by con-
The interestingness criterion captures the
degree of surprisingness and/or unexpected-
sidering some data structuring that isn’t in ness in what the hypothesis conveys. Unlike
where Prob(Pi | ri) is the conditional proba- textual information. In addition, most KDD another approach that uses a linguistic resource
bility of the predicate Pi given a rhetorical approaches have assumed the use of lin- for this purpose,11 we measure this criterion
role ri. guistic or conceptual resources to measure in terms of the unexpectedness of the rela-

26 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


tion between the antecedent and consequent best or worst is clear because it involves which we replace a small portion of the
of the hypothesis. picking the individuals with higher or lower worst parents with the offspring (from the
We can evaluate this criterion from the fitness value. However, because we’re deal- best parents) only if the latter are better
semistructured information provided by the ing with multiple criteria, no unique fitness than the former.
LSA analysis. Because we’re looking for value exists. So, we must redefine the notion The whole strategy’s outcome is the
interesting or unexpected connections, we of optimum, which is usually a relation of computed Pareto set for every generation,
assess this criterion through the semantic dis- dominance.6 which we incrementally update. We com-
similarity between antecedent and conse- A hypothesis H1 dominates another hypoth- pute the fitness for the nondominated and
quent—that is, esis H2 if both these conditions are true: dominated individuals differently. For non-
dominated individuals, the fitness is a pro-
Interestingness(H) 1. H1 isn’t worse than H2 in all the criteria portion of the dominated individuals in the
= <Dissimilarity between Antecedent and fj, or fj(H1)  fj(H2) for all j = 1, 2, …, population. For dominated individuals, the
Consequent> N criteria. fitness is the accumulated sum of the fit-
= 1 − SemanticSimilarity(An(H),Co(H)), 2. H1 is strictly better than H2 in at least ness of all the individuals that dominate the
one criterion, or fj (H1) > fj (H2) for at current solution. The algorithm will produce
where An(H) and Co(H) represent the least one j ∈ {1, 2, …, N}. fitness values (such as strength) between 0
antecedent and consequent of H. and 1 for the nondominated individuals,
and fitness values greater than 1 for the
Plausibility of origin dominated individuals. We prefer low fit-
As we previously mentioned, Swanson’s ness values because they represent good
crossover encourages the production of The best individuals are those solutions.
potentially novel hypotheses in terms of a If the number of Pareto members exceeds
transitivity-like inference whose precision with low fitness values because some user-defined threshold, we reduce the
(degree of similarity of the parts of the par- Pareto set by clustering its elements in terms
ent hypotheses being recombined) might they contribute positively to of the similarity between their criteria vec-
hopefully be higher for the cases considered tors. Because of this clustering, the actual set
worth exploring.
Thus, the plausibility of origin criterion
improve the dominated maintained only approximates the true Pareto
set. Because the optimization strategy aims to
measures the potential plausibility of the cur-
rent hypothesis by remembering the quality
individuals. improve the solutions in the Pareto set as the
GA goes on, we’ve explored two main
of the inference when this hypothesis was choices:
created. This is necessary because other oper-
ations might have created the current In other words, we trade off the vector of cri- • Improving the solutions in the Pareto set
hypotheses. In other words, keeping this sim- teria of each hypothesis against the vector of through the genetic operations
ilarity of Swanson’s crossover provides a the other competing hypotheses to determine • Improving the dominated solutions by
simple mechanism to remember that plausi- the best ones. You usually refer to the set of bringing them into the Pareto set or by mod-
bility of origin considers hypotheses that nondominated individuals as the Pareto set. ifying their genetic material to improve
only this operator has recombined. Computing this dominance relation to deter- individual fitness values
Accordingly, the criterion for a hypothe- mine which hypotheses are part of the Pareto
sis H is simply set doesn’t say anything about their fitness. The best individuals are those with low fit-
To obtain the “quality” of each created ness values because they contribute posi-
Plausibility(H ) = hypothesis, we must do a fitness measure. tively to improve the dominated individuals.
To get an individual fitness value from Once we’ve computed the fitness and pro-
⎧S p If H was created from a Swanson' s crossover several evaluation criteria, the multiob- duced the offspring, we update the popula-

⎨ 0 If H is the original population or is a result . jective optimization strategies must com- tion of the worst parents with the offspring
⎪ of another operation
⎩ plement the Pareto set. We use one of these through the steady-state strategy.13
strategies, called strength Pareto evolu-
tionary algorithm,12 which deals with the Evaluation and results
Steady-state strategy diversity of the solutions and the fitness To evaluate our approach’s search ability,
Once we evaluate the hypotheses’ crite- assignment as a whole. An important we implemented a prototype of our model for
ria, we must determine the best and worst aspect of SPEA is that some individuals GA-based KDT in Prolog and integrated it
individuals to choose which ones will sur- remain unmodified from one generation to with the rest of the system as seen in Figure
vive and which ones won’t. Consequently, another. Because we’re handling rule-like 1. Then we selected and cleaned up a corpus
we’ll modify the current population so that hypotheses, we must avoid losing some of documents in an example domain (agri-
(hopefully) we can consider better individ- hypotheses’ good material, so we must culture). We used one-third of the documents
uals in the next generation. How do we keep the offspring only if they’re better for setting parameters and making general
establish the set of best hypotheses in every than the population’s worst parent. To do adjustments; the rest we used for the GA in
generation? In a traditional GA, the notion of so, we adopt a steady-state strategy in the evaluation stage. From the documents, the

MAY/JUNE 2004 www.computer.org/intelligent 27


T e x t M i n i n g

Table 1. Pairs of user-defined target IE task extracted the corresponding 1,000 huge workload, each expert assessed no more
terms used for each run. rules and training information, which we than five hypotheses, and three experts
Run Term 1 Term 2 used to create an initial population of 100 assessed each hypothesis. We asked the
1 enzyme zinc
semirandom hypotheses. experts to assess each hypothesis in terms of
2 glycocide inhibitor We then ran five versions of the GA, each four KDD criteria of quality: Interestingness,
3 antinutritious cyanogenics one using the same global parameters but a Novelty, Usefulness, and Sensibleness. We
4 degradation erosive different pair of target terms (see Table 1). added the Additional Information criterion to
5 cyanogenics inhibitor These pairs of terms were regarded as rele- determine whether (according to the target
vant by a domain expert and therefore concepts of the corresponding run from
Table 2. Overall evaluation.
deserved further attention. Each run pro- which the hypothesis was produced) the
duced the overall best five hypotheses—that hypothesis contributes additional information
Criterion Confidence (95%) is, the best 25 hypotheses that contain opti- to help the experts understand the unseen rela-
Additional Information 2.60 ± 0.168 mum criteria according to the system. tionship between the previously defined tar-
Interestingness 2.60 ± 0.173 We then designed a Web-based experiment get concepts.
Novelty 2.30 ± 0.205
Sensibleness 2.51 ± 0.237
in which we converted the different hypothe- Unlike other TM and KDT approaches,
Usefulness 2.56 ± 0.228 ses into a readable natural language form to both the system evaluation and the expert’s
be assessed by 20 domain experts. To avoid a assessment consider multiple features of
quality for the discovered knowledge. We can
regard the hypotheses’ overall quality as
assessed by the experts as an average of these
criteria. We performed the whole assessment
Additional information in a scale between 1 (worst) to 5 (best). Fig-
Interestingness ure 5 shows the average resulting scores in
5.0 Novelty assessing 25 hypotheses for each criterion.
Usefulness The assessment of individual criteria (see
4.5 Sensibleness Table 2) illustrates that some hypotheses did
well with scores above the average (3). This
4.0 is the case for Hypotheses 11, 16, and 19 in
terms of Interestingness (Hypotheses 7, 17,
3.5 and 23 are just at the average), Hypotheses
14 and 19 in terms of Sensibleness (hypothe-
ses 3, 11, and 17 are just at the average),
3.0
Average score

Hypotheses 1, 5, 11, 17, and 19 in terms of


Usefulness (Hypotheses 3, 10, and 15 are just
2.5 at the average), and Hypothesis 24 in terms
of Novelty (Hypotheses 11, 19, and 23 are
2.0 just at the average). The assessment seems to
be consistent for individual hypotheses
1.5 across the criteria. Hypothesis 19 is well
above the average for almost all the criteria
(except for Novelty), Hypothesis 18 always
1.0 received a score below 2 (25 percent), except
for Additional Information, in which its score
0.5 is slightly higher.
Although there are very good individual
0 hypotheses, the average scores show that
0 except for Novelty and Usefulness, the
assessments are below average. The average
5 scores for Additional Information (along
10 with Interestingness) are slightly above the
15 rest at 2.60 (40 percent) with an expected
Hypothesis
mean with the lowest variation of all the cri-
20 teria (0.168). One reason for the assessment
25 of Additional Information is that its quality
depends on how much information in one
30
hypothesis is relevant to the target terms.
However, because dominance conditions
Figure 5. Experts’ assessment of hypotheses. must be met, the relevance values aren’t

28 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


always high. The semantic similarity also believe, might explain somewhat the low AND to establish the setting values in native
affects the contribution of a hypothesis to scores in the assessment. In this closer timbers
explain the relation between the concepts. analysis, some of the issues included THEN the best agricultural use for land lots of organic
Indeed, although LSA can regard a comprehensibility, inconsistency between agriculture must be established …
hypothesis as very close to the terms, the sim- paragraphs, specificity of topics, domain-
ilarity doesn’t ensure that these terms will specific issues, and incomplete hypotheses. The hypothesis appears to be more rele-
appear as they are. Some close semantic To show what the final hypotheses look vant and coherent than the others (its rele-
neighbor might appear instead. Details of the like, we picked some of the average best vance is 92 percent). However, this isn’t com-
evaluation suggest that the experts made lit- and worst hypotheses as assessed by the plete in terms of cause and effect. For
tle effort to realize that, while in most cases experts (out of the 25 best hypotheses). instance, the methods are missing.
the target concepts didn’t appear, some close Specifically, we highlighted two of the best
concepts try to make a hypothesis about the hypotheses and one worst hypothesis. Hypothesis 88 of Run 3. This is represented
target concepts understandable. Although Because we do not have domain knowledge by this rule:
some hypotheses show scores below the to analyze the content of these hypotheses,
average, several hypotheses look encourag- we provide brief descriptions of the predi- IF goal(present(11511)) and method(use(25511))]
ing in that this is a demanding evaluation in cates’ argument to give a flavor of the THEN conclusion(effect(1931,1932))
terms of a very complex human task. The knowledge involved:
model shows promise in terms of finding The hypothesis has the criteria vector [0.29,
(and filtering) good hypotheses rather than 0.18, 0.41, 0.030, 0.28, 0.99, 0.30, 0.50] and
discovering overall high results, which might obtained an average expert’s assessment of
not be possible. Although some hypotheses show 3.20. In natural language text, this can
On the other hand, we were able to discern roughly be interpreted as
no direct evidence on how a human would scores below the average,
perform on the same task. For such a com- IF the goal is to present the forest restoration …
plex task, humans might perform poorly several hypotheses look AND the method is based on the use of microen-
when analyzing large amounts of text data. vironments for capturing farm mice …
For the same reason, humans might not be
able to find the hypotheses that the system
encouraging in that this is a THEN digestibility “in vitro” should have an effect on
the bigalta cuttings …
found. To address this issue, we measured
the correlation between the scores of the
demanding evaluation in terms of This hypothesis looks more complete
human subjects and the model evaluation. (goal, methods, and so forth) but is less rel-
Because both the expert’s and the system’s a very complex human task. evant than the previous hypothesis despite its
model evaluated the results considering sev- close coherence (50 percent versus 41 per-
eral criteria, we first performed a normaliza- cent). Also, the plausibility is much higher
tion aimed at producing a single quality value Hypothesis 65 of Run 4. (Table 1 shows the than for Hypothesis 65, but the other criteria
for each hypothesis. For the expert assess- target concepts used for the runs.) The seemed to be a key factor for the experts.
ment, we averaged the scores of the differ- hypothesis is represented by this rule:
ent criteria for every hypothesis (values Hypothesis 52 of Run 5. This is represented
between 1 and 5). IF goal(perform(19311)) and goal(analyze(20811)) by this rule:
For the system evaluation, we considered THEN establish(111)
both the objective values and the fitness of IF object(perform(20611)) and
every hypothesis to come up with a single (Numerical values represent internal iden- object(perform(2631))
value between 0 and 2. For visualization, we tifiers for the arguments and their seman- THEN effect(1931,1932)
scaled these values up to the same range as tic vectors.) This hypothesis has a criteria
the expert assessment (1 to 5). We then cal- vector [0.92, 0.09, 0.50, 0.005, 0.7, 0.00, The hypothesis has a criteria vector [0.29,
culated the pair of values for every hypothe- 0.30, 0.25] (the vector’s elements represent 0.48, 0.49, 0.014, 0.2, 0.00, 0.30, 0.50] and
sis and obtained a (Spearman) correlation the values for criteria relevance, structure, obtained an average expert’s assessment of
r = 0.43 (t − test = 18, df = 25, p < 0.001). coherence, cohesion, interestingness, plau- 1.53. In natural language text, it can roughly
From this result, we see that the correlation sibility, coverage, and simplicity). It be interpreted as
indicates a promising level of correspon- obtained an average expert’s assessment of
dence between the system and the human 3.74. IF the object of the work is to perform the analysis of
judgments. On the other hand, it suggests that In natural language text, this rule can the fractioned honey …
an expert wouldn’t perform much better than roughly be interpreted as AND to carry out observations for the study of
the system for such a complex task with such pinus hartwegii
a large amount of information. IF work aims at performing the genetic grouping of THEN digestibity “in vitro” should have an effect on
We collected and summarized the do- populations … the bigalta cuttings …
main expert information to produce factors AND to analyze the vertical integration for elab-
identified during the experiment that, we orating Pinus timber … The structure (48 percent) is better than

MAY/JUNE 2004 www.computer.org/intelligent 29


T e x t M i n i n g

Assembling Other People’s Ideas,” Bull. of


T h e A u t h o r s the Am. Soc. for Information Science and
Technology, vol. 27, no. 3, Feb./Mar. 2001,
John Atkinson-Abutridy is an associate professor at the Departamento de
pp. 12–14; www.asis.org/Bulletin/Mar-01/
Ingeniería Informática, Universidad de Concepción, Chile. His research inter- swanson.html.
ests include natural language processing, knowledge discovery from texts,
artificial intelligence, and evolutionary computation. He received his PhD 3. M. Hearst, “Automated Discovery of Word-
in artificial intelligence from the University of Edinburgh. He’s a member Net Relations,” WordNet: An Electronic Lex-
of the ACM, AAAI, and Association for Computational Linguistics. Contact ical Database, MIT Press, 1998, pp. 131–151.
him at the Departamento de Ingeniería Informática, Universidad de Con-
cepción, Concepción, Chile; atkinson@inf.udec.cl. 4. C. Jacquemin, “Syntagmatic and Paradig-
matic Representation of Terms Variation,”
Proc. 37th Ann. Meeting Assoc. for Compu-
Chris Mellish holds a chair in the Department of Computing Science at the
tational Linguistics, Assoc. for Computa-
University of Aberdeen. His research interests include natural language pro- tional Linguistics, 1999, pp. 341–348.
cessing and logic programming. He’s especially interested in natural lan-
guage generation. He received his PhD in artificial intelligence from the Uni- 5. S. Basu et al., “Using Lexical Knowledge to
versity of Edinburgh. Contact him at the Dept. of Computing Science, Univ. Evaluate the Novelty of Rules Mined from
of Aberdeen, King’s College, Aberdeen AB24 3UE, UK; cmellish@csd. Text,” Proc. NAACL 2001 Workshop WordNet
abdn.ac.uk. and Other Lexical Resources: Applications,
Extensions and Customizations, Assoc. for
Computational Linguistics, 2001, pp. 144–
149.
Stuart Aitken is a member of the University of Edinburgh’s Artificial Intel-
ligence Applications Institute. His research interests include ontology, bioin- 6. K. Deb, Multiobjective Optimization Using
formatics, intelligent tools for knowledge acquisition, and machine learning. Evolutionary Algorithms, John Wiley & Sons,
He received his PhD in engineering from the University of Glasgow. Contact 2001.
him at the Artificial Intelligence Applications Inst., Univ. of Edinburgh, Apple-
ton Tower, Room 4.10, Crichton St., Edinburgh EH8 9LE, UK; stuart@ 7. S. Teufel and M. Moens, “Discourse-Level
aiai.ed.ac.uk. Argumentation in Scientific Articles: Human
and Automatic Annotation,” Proc. ACL 1999
Workshop towards Standards and Tools for
Discourse Tagging, Assoc. for Computational
Linguistics, 1999.
for Hypothesis 88 (18 percent). However, mation in the rules. Our approach handles the
8. T. Landauer, P. Foltz, and D. Laham, “An
because Hypothesis 52 isn’t complete, it problem of the diversity of solutions by hav- Introduction to Latent Semantic Analysis,”
received a lower score than Hypothesis 88. ing semantic and rhetorical constraints in Discourse Processes, vol. 10, no. 25, 1998,
This might be because the difference in struc- mind. Unlike traditional methods, this helps pp. 259–284.
ture between object-object and goal-method deal with the underlying text knowledge
9. W. Kintsch, “Predication,” Cognitive Science,
is not significant. Because both hypotheses without needing to perform further deep
vol. 25, no. 2, 2001, pp. 173–202.
became final solutions, the expert scored best analyses.
on those that better explained the facts. On the evolutionary-learning side, further 10. P. Foltz, W. Kintsch, and T. Landauer, “The
Because the model relies on the training data, research must investigate how the time the Measurement of Textual Coherence with Latent
it doesn’t ensure that every hypothesis is evolutionary system spends affects the Semantic Analysis,” Discourse Processes, vol.
25, no. 2, 1998, pp. 259–284.
complete. In fact, training data show that results’ quality. We could have traded off
only 26 percent of the 326 rules contain some some criteria to test how to improve the 11. S. Basu et al., “Evaluating the Novelty of
sort of method. results. Other complex issues require more Text-Mined Rules Using Lexical Knowl-
extensive experimental testing. edge,” Proc. 7th Int’l Conf. Knowledge Dis-
covery and Data Mining, ACM Press, Aug.
Overall, the model shows a good level of
2001, pp. 233–238.
prediction in terms of its correlation with

E valuating the model shows that it’s


plausible to produce new knowledge
by combining shallow text analysis with evo-
human judgments, which is comparable to
or even better than related approaches.5 How-
ever, the most outstanding feature is how our
12. E. Zitzler and L. Thiele, An Evolutionary
Algorithm for Multiobjective Optimisation:
The Strength Pareto Approach, tech. report
43, Swiss Federal Inst. of Technology (ETH),
lutionary techniques without using external approach achieves this without using exter- 1998.
resources. Our work also contributes to the nal resources.
evaluation and assessment of quality criteria 13. M. Mitchell, An Introduction to Genetic Algo-
that most other approaches have neglected, rithms, MIT Press, 1996.
by proposing new evaluation criteria to mea- References
sure the plausibility of the hypotheses as they
are produced. 1. J. Han and M. Kamber, Data Mining: Con-
In addition, our proposed semantic repre- cepts and Techniques, Morgan Kaufmann,
2001.
sentation can help capture key information For more information on this or any other com-
concerning the creation of hypotheses in a 2. D. Swanson, “On the Fragmentation of puting topic, please visit our Digital Library at
way that goes beyond the structural infor- Knowledge, the Connection Explosion, and www.computer.org/publications/dlib.

30 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

You might also like