Professional Documents
Culture Documents
Combining Information
Extraction with
Genetic Algorithms for
Text Mining
John Atkinson-Abutridy, Chris Mellish, and Stuart Aitken, University of Edinburgh
T ext mining discovers unseen patterns in textual databases. But these discoveries
are useless unless they contribute valuable knowledge for users who make strate-
gic decisions. Confronting this issue can lead to knowledge discovery from texts, a com-
plicated activity that involves both discovering unseen knowledge (through TM) and
evaluating this potentially valuable knowledge. KDT search-and-optimization techniques exist,6 but some
can benefit from techniques that have been useful in techniques are more appealing for KDD tasks than
An evolutionary data mining or knowledge discovery from databases.1 others. In particular, genetic algorithms look promis-
However, you can’t immediately apply data mining ing. Compared with classical search-and-optimiza-
approach that techniques to text data for TM because they assume tion algorithms, GAs are much less susceptible to
a structure in the source data that isn’t in free text. getting stuck in local suboptimal regions of the
combines information You must therefore use new representations for text search space because they perform global searches
data. by exploring multiple solutions in parallel. Being
extraction technology In many TM applications, you can use more struc- robust, GAs can cope with noisy and missing data.
tured representations than just keywords to perform However, to use GAs effectively in KDT, we must
and genetic algorithms analysis to uncover unseen patterns. Early research tackle several problems first, including devising
on such an approach was based on seminal work on high-level representations and tailoring new genetic
can produce a new, exploratory analysis of article titles stored in the operations.
Medline medical database.2 Other approaches have
integrated model for exploited these ideas by combining more elaborated Evolutionary knowledge discovery
information extraction patterns and general lexical We’ve brought together the benefits of GAs for
text mining. resources such as WordNet3 or specific concept data mining and genre-based IE technology to pro-
resources such as thesauri.4 Another approach, rely- pose a new approach for high-level knowledge dis-
ing on IE patterns, uses linguistic resources such as covery. Unlike previous KDT approaches, our model
WordNet to assist the discovery and evaluation of doesn’t rely on external resources or conceptual
patterns to extract basic information from general descriptions. Instead, it performs the discovery using
documents.5 only information from the original corpus of text
In this context, we propose a new KDT approach documents and from training data computed from
that combines fragments of key information them. The GA that produces the hypotheses is
extracted from text documents and then uses a multi- strongly guided by semantic constraints, which
criteria optimization strategy to produce explanatory means that several specifically defined metrics eval-
knowledge without using external resources. uate the quality and plausibility.
Although researchers have tackled KDD tasks as Figure 1 shows our approach’s working model
learning problems, the nature of KDD suggests that divided into two general levels of processing. The
data mining is a continuous task of searching for and input is a corpus of technical and scientific natural
optimizing potential hypotheses that can maximize language documents; the output is a small set of the
quality criteria. hypotheses that the GA discovered.
A significant number of successful, practical The first level involves a preprocessing step that
Abstract
The current study aims to provide the basic information about the fertilizer system, especially in GOAL
its nutrient dynamics. To provide the basic information ...
+
For this, long-term trends of the soil's chemical and physical fertility were analyzed. OBJECT
Long-term trends of the soil's ...
+
The methodology is based on the study of land plots using different records of usage of crop
rotation with fertilizers in order to detect long-term changes. METHOD
Study of land plots ...
Finally, a deep checking of data allowed us to conclude that soils have improved after 12 years of +
continuous rotation. CONCLUSION
Soils have improved after 12 ...
IF
GOAL(provide("the basic information about ..."))
& OBJECT(analyze("long-term trends of the soil ..."))
& METHOD(study("land plot using ..."))
THEN
CONCLUSION(improve("soils ..."))
Figure 2. Rule representation from the semantic and rhetorical information extracted from an abstract.
We perform this mainly by using a variation Coherence and coverage coverage. We designed a hybrid method that
of the strength concept (proposed by Walter The coherence metric addresses the ques- combines semantic constraints and organi-
Kintsch) between a predicate and the sur- tion about whether the elements of the current zation-related aspects to obtain the cover-
rounding terms in the semantic neighborhood.9 hypothesis relate to each other in a semanti- age. In general terms, a hypothesis H covers
We compute the relevance of a hypothesis cally coherent way. Unlike rules produced by a rule Ri only if Ri contains the predicates
H with predicates Pi(Ai) and target concepts data mining techniques in which the order of of H.
(<term1> and <term2>) as the average the conditions is not an issue, the hypotheses The first step involves establishing the
semantic similarity of every predicate (and produced in our model rely on pairs of adja- semantic similarity between H and Ri. How-
argument) of H to the target concepts. That is, cent elements that should be semantically ever, because this relation is symmetrical, the
sound, a property that has long been dealt with second step analyzes whether the hypothe-
relevance(H ) = in the linguistic domain in the context of text sis contains elements of the rules. Formally,
∑ i =1 strength (Pi , A i , < term 1 >) + strength (Pi , A i , < term 2 >) , coherence. Because we have semantic infor- we define it as
1 H
2
H mation provided by the LSA analysis that is
complemented with rhetorical and predicate- RulesCovered(H) = {RUi ∈ RuleSet |
where |H| denotes the length of H (that is, the level knowledge, we developed a simple ∀HPk ∈ HP ∃Pj ∈ RUi:
number of predicates). method to measure coherence, following work (SemanticsSimilarity(HPk, Pj) ≥
on measuring text coherence.10 threshold ∧ predicateName(HPk) =
Structure and cohesion predicateName(Pj))}
The structure criterion addresses the ques-
tion of how good the rhetorical roles’ struc- In this equation, SemanticsSimilarity(HPk,
ture is, which we can approximate by deter- The coherence metric addresses Pj) represents the LSA-based similarity
mining how much of the initial extracted between predicates HPk and Pj, threshold
rules structure is exhibited in the current the question about whether the defines a minimum fixed value, RuleSet
hypothesis. To this end, this metric uses the denotes the whole set of rules, HP represents
training information provided at the begin-
ning to compute the structure’s quality
elements of the current hypothesis the list of predicates with arguments of H,
and Pj denotes a predicate (with arguments)
according to a bigram model in which the
roles ri are a sequence of tags. That struc-
relate to each other in a contained in RUi. Once we compute the set of
rules, we can compute the criterion as
ture’s quality is
semantically coherent way.
RulesCovered (H ) ,
Structure(H ) = Prob(r1 ) ∗ ∏i = 2 Prob(ri ri − 1 ). Coverage(H ) =
H
RuleSet
We calculate coherence by considering
In this equation, ri represents the ith role of the average semantic similarity between where |RulesCovered| denotes the size of the
the hypothesis H, Prob(ri | ri−1) denotes the consecutive elements of the hypothesis. set of rules covered by H, and |RuleSet| denotes
conditional probability that role ri−1 imme- However, we compute this closeness only the size of the initial set of extracted rules.
diately precedes ri, and Prob(ri) denotes the on the semantic information that the predi-
probability that no role precedes ri (the cates and their arguments convey because Simplicity and interestingness
beginning of the structure). we considered the role structure in a previ- The simplicity criterion addresses the
The cohesion criterion addresses the ques- ous criterion. Accordingly, we express the question of how simple the hypothesis is. For
tion of how likely a predicate action will be criterion as this, we focus on hypothesis length. Because
associated with some specific rhetorical role. the criterion must be maximized, and shorter
The underlying issue here is that some pred- Coherence(H ) =
or easy-to-interpret hypotheses are prefer-
icate relations Pi will be more likely than oth- able, the evaluation is simply
ers to be associated with some rhetorical role
( H − 1) ( (
SemanticSimilarity Pi ( A i ),Pi +1 A i + 1 )) ,
ri. For this, hypotheses containing this kind ∑i =1 ( H − 1) ⎛ H ⎞
of association should be “rewarded” in the Simplicity(H ) = 1 − ⎜ ⎟,
⎝ < MaxElems > ⎠
search-optimization phase.
Using the conditional probabilities pro- where (|H| − 1) denotes the number of adja-
vided by the training data, we express H’s cent pairs. where <MaxElems> denotes a fixed maxi-
cohesion as The coverage metric addresses the ques- mum number of elements allowed for any
tion of how much the model supports the hypothesis.
Prob(Pi ri )
Cohesion (H ) = ∑ r , P ∈H
i i H
,
hypothesis. KDD approaches have usually
measured coverage of a hypothesis by con-
The interestingness criterion captures the
degree of surprisingness and/or unexpected-
sidering some data structuring that isn’t in ness in what the hypothesis conveys. Unlike
where Prob(Pi | ri) is the conditional proba- textual information. In addition, most KDD another approach that uses a linguistic resource
bility of the predicate Pi given a rhetorical approaches have assumed the use of lin- for this purpose,11 we measure this criterion
role ri. guistic or conceptual resources to measure in terms of the unexpectedness of the rela-
Table 1. Pairs of user-defined target IE task extracted the corresponding 1,000 huge workload, each expert assessed no more
terms used for each run. rules and training information, which we than five hypotheses, and three experts
Run Term 1 Term 2 used to create an initial population of 100 assessed each hypothesis. We asked the
1 enzyme zinc
semirandom hypotheses. experts to assess each hypothesis in terms of
2 glycocide inhibitor We then ran five versions of the GA, each four KDD criteria of quality: Interestingness,
3 antinutritious cyanogenics one using the same global parameters but a Novelty, Usefulness, and Sensibleness. We
4 degradation erosive different pair of target terms (see Table 1). added the Additional Information criterion to
5 cyanogenics inhibitor These pairs of terms were regarded as rele- determine whether (according to the target
vant by a domain expert and therefore concepts of the corresponding run from
Table 2. Overall evaluation.
deserved further attention. Each run pro- which the hypothesis was produced) the
duced the overall best five hypotheses—that hypothesis contributes additional information
Criterion Confidence (95%) is, the best 25 hypotheses that contain opti- to help the experts understand the unseen rela-
Additional Information 2.60 ± 0.168 mum criteria according to the system. tionship between the previously defined tar-
Interestingness 2.60 ± 0.173 We then designed a Web-based experiment get concepts.
Novelty 2.30 ± 0.205
Sensibleness 2.51 ± 0.237
in which we converted the different hypothe- Unlike other TM and KDT approaches,
Usefulness 2.56 ± 0.228 ses into a readable natural language form to both the system evaluation and the expert’s
be assessed by 20 domain experts. To avoid a assessment consider multiple features of
quality for the discovered knowledge. We can
regard the hypotheses’ overall quality as
assessed by the experts as an average of these
criteria. We performed the whole assessment
Additional information in a scale between 1 (worst) to 5 (best). Fig-
Interestingness ure 5 shows the average resulting scores in
5.0 Novelty assessing 25 hypotheses for each criterion.
Usefulness The assessment of individual criteria (see
4.5 Sensibleness Table 2) illustrates that some hypotheses did
well with scores above the average (3). This
4.0 is the case for Hypotheses 11, 16, and 19 in
terms of Interestingness (Hypotheses 7, 17,
3.5 and 23 are just at the average), Hypotheses
14 and 19 in terms of Sensibleness (hypothe-
ses 3, 11, and 17 are just at the average),
3.0
Average score